Next Article in Journal
Attention-Based Deep Feature Aggregation Network for Skin Lesion Classification
Previous Article in Journal
Optimizing Local Explainability in Robotic Grasp Failure Prediction
Previous Article in Special Issue
Application of Deep Dilated Convolutional Neural Network for Non-Flat Rough Surface
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FFMN: Fast Fitting Mesh Network for Monocular 3D Human Reconstruction in Live-Line Work Scenarios

1
Guangzhou Power Supply Bureau, Guangdong Power Grid Co., Ltd., Guangzhou 510510, China
2
College of Electrical and Information Engineering, Hunan University, Changsha 410082, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(12), 2362; https://doi.org/10.3390/electronics14122362
Submission received: 29 April 2025 / Revised: 30 May 2025 / Accepted: 2 June 2025 / Published: 9 June 2025

Abstract

:
In live-line power distribution operations, 3D pose and action recognition of workers holds critical significance for safety assurance and intelligent monitoring. We propose a novel neural network architecture for fast fitting-based parametric 3D human reconstruction (FFMN) from monocular images in live-line work scenarios. FFMN employs convolutional neural networks to extract feature information from input images and adopts an optimization strategy for inverse problems by reprojecting keypoints from the human model onto feature maps to acquire feedback. A transformer-based updater module then adjusts the model to better align with the person in the image. Unlike conventional regression or recurrent models, FFMN trains faster, utilizes fewer parameters, and achieves shorter inference times. Moreover, our FFMN demonstrates a significant inference speed advantage (latency = 15 ms) on the 3DPW and Human3.6M datasets while maintaining competitive accuracy (MPJPE < 50 mm), highlighting its high practicability in real-world applications.

1. Introduction

Recovering a 3D human mesh from a single image is a core problem in computer vision and computer graphics. The objective is to infer a 3D human model, characterized by realistic geometry and articulated kinematics, as represented by parametric models such as SMPL [1] and SMPL-X [2] from a 2D image. This technology plays a vital role in applications including augmented reality, virtual try-on, and sports analysis. However, challenges such as depth ambiguity inherent in monocular imagery, a wide range of complex poses [3], and occlusions [4] between the human body and the surrounding environment make it extremely difficult to reconstruct high-precision and physically plausible human meshes from single-view images.
Traditional 3D human modeling technologies [5] heavily rely on multi-view imaging or depth sensors [6], requiring specialized equipment (e.g., synchronized multi-camera arrays or LiDAR) and suffering from time-consuming processes and complex deployment. For instance, multi-view reconstruction demands synchronized multi-angle image inputs, yet the intricate environments of live-line power distribution sites—such as densely packed power towers and substation equipment—make it impractical to deploy multi-camera systems or carry professional sensors. This results in high reconstruction latency (typically exceeding 500 ms), failing to meet real-time safety monitoring requirements for live-line operations. In contrast, single-view 3D human reconstruction technology [7] directly addresses the synergistic demands of real-time performance, precision, and cost-effectiveness in power distribution safety operations. By reconstructing 3D human models in real time from monocular visual data (with MPJPE below 60 mm), pose tracking is enabled for workers in high-risk scenarios (e.g., safety distance alerts during insulator replacement or conductor repair) while eliminating reliance on expensive multi-view hardware systems. With its millisecond-level modeling capability (latency under 50 ms), this technology not only enhances dynamic risk awareness for personnel but also integrates with digital twin platforms to drive intelligent upgrades in power grid maintenance, such as improving equipment inspection efficiency by approximately 40%. Aligned with the strategic goals of “human–machine collaboration and data-driven intelligence” in China’s New Power System initiative, it provides critical technical support for the digital transformation of grid infrastructure, bridging the gap between operational safety and next-generation smart grid development.
Early approaches primarily relied on optimization-based strategies using parametric human models. For example, SMPLify (2015) [2] iteratively aligned detected 2D keypoints with the SMPL model; however, this method depends heavily on handcrafted priors—such as pose regularization and collision constraints—and suffers from low computational efficiency. With the advent of deep learning, researchers began to explore end-to-end networks that directly regress the parameters of the human model. Representative works include HMR (2018) [8], which was the first to integrate adversarial training into the 3D human reconstruction process by merging image features with model parameter priors to improve prediction stability, and SPIN (2019) [9], which combines optimization and regression frameworks by employing a neural network to predict initial parameters, thereby accelerating the traditional optimization procedure.
In this paper, we propose an architecture for 3D human reconstruction: fast fitting mesh network (FFMN). FFMN reformulates the regression problem as a learning-to-optimize task, and thanks to its lightweight network design, it achieves faster training and higher inference speeds while maintaining competitive accuracy compared to existing frameworks.
FFMN is inspired by ReFit [10], but it differs in several important aspects. (1) FFMN employs a more lightweight backbone—an improved version of YOLO [11]—to extract image features efficiently. (2) It utilizes a transformer module to control keypoints, which decouples the recurrent structure and allows keypoint reprojection to occur only once. (3) A self-attention module is integrated into the keypoint reprojection process, enabling the model to focus on feature information in the vicinity of the keypoints. This novel design facilitates the rapid fitting of a 3D human mesh (Figure 1).
For model fitting, the objective function computes the L2 distance between the reprojected points and the detected keypoints [12,13]. In FFMN, each reprojected keypoint queries a window on the feature map, where the window encodes first-order information corresponding to the derivative of the L2 objective. The extracted features are then passed to a transformer-based update module to compute the necessary updates for the human model.
The transformer update module of FFMN is designed to learn decoupled update rules for three groups of parameters. These include a hidden state vector for pose ( R B × 24 × h ), a shape vector ( R B × h ), and a camera parameter vector ( R B × h ). Each of these three parameter groups is first encoded with positional encodings before being processed in batch matrix operations, facilitating efficient training and inference.
Unlike previous architectures, FFMN leverages reprojection-based query feedback rather than using a static global feature vector [14]. Moreover, in contrast to conventional learning-to-optimize techniques that rely on explicit skeleton keypoint detection, FFMN utilizes learned features for end-to-end training, further enhancing its efficiency and performance.
The main contributions of this work are as follows:
  • We propose FFMN, a novel fast fitting mesh network for monocular 3D human reconstruction, specifically designed for real-time applications in live-line work scenarios.
  • FFMN introduces a lightweight backbone based on an improved YOLO architecture for efficient multi-scale feature extraction, and a transformer-based update module that decouples the optimization of pose, shape, and camera parameters, achieving rapid and accurate mesh fitting through parallel processing.
  • A keypoint reprojection feedback mechanism is developed, which leverages mocap marker-based features and dropout regularization to enhance robustness against occlusions and improve fitting accuracy.

2. Related Work

Monocular 3D human mesh recovery is one of the core tasks in computer vision, aiming to reconstruct 3D human models with plausible poses and shapes from a single RGB image. With the widespread adoption of parametric human models (e.g., SMPL [1], SMPL-X [2]), two principal technical routes—optimization-based and regression-based—have emerged, which have further spawned hybrid methods, novel network architectures, and extensions to multiple scenarios.
Optimization-based approaches were prominent in the early work. For instance, SMPLify [2] optimized the SMPL parameters via gradient descent to align the detected 2D keypoints, relying on manually designed priors (such as pose regularization and collision penalties) to avoid implausible outputs. However, this method is sensitive to noise in 2D detections. Subsequent work improved robustness by incorporating multimodal cues: HMR-2D [8] combined DensePose and semantic segmentation, while Bodynet [15] integrated volumetric features and contour alignment to reinforce shape constraints. Moreover, to reduce the dependency on handcrafted priors, some studies incorporated deep learning techniques [16,17,18] into the optimization process.
End-to-end regression-based methods directly map images to 3D mesh parameters or vertex coordinates through deep neural networks, thereby eliminating iterative optimization and enabling efficient inference. HMR [8] pioneered this approach by employing adversarial training to regress the SMPL rotation parameters (represented in the axis-angle format) and leveraged unpaired 3D data to mitigate the scarcity of annotated data. Subsequent studies improved rotation representations (e.g., 6D continuous vectors [19] and rotation matrices [20]) to address the discontinuity inherent in the axis-angle representation. ProHMR [21] further modeled the distribution of pose parameters using conditional normalization flows, supporting multi-hypothesis generation, while Sengupta et al. [22] employed an autoregressive model to encode the hierarchical dependencies among joint rotations [23], thereby enhancing pose consistency.
Hybrid iterative methods combine the strengths of optimization and regression, progressively refining predictions through multi-step iterations to balance data-driven flexibility and physical plausibility [24]. ReFit [10] introduced an iterative update module based on GRUs that decouples the parameter update streams—each GRU unit is responsible for adjusting specific joint or shape parameters—and employs a feedback mechanism to extract local error signals from image features. Its multi-view extension improves consistency through cross-view parameter averaging. Similarly, PyMAF [19] constructed a multi-scale feature pyramid and, starting from a coarse prediction, projected the mesh onto high-resolution feature maps to extract local cues for progressively refining joint alignment.
In contrast, our proposed FFMN is based on a hybrid optimization and regression approach that directly outputs predictions without requiring multiple iterations. By leveraging a lightweight backbone and decoupled outputs, FFMN achieves strong real-time performance. The use of keypoint reprojection to update joint and shape parameters further enhances its accuracy.

3. Methods

Given an input image of a person, our objective is to predict the parameters of the SMPL human model. In this process, FFMN’s backbone network first extracts feature representations from the image (Section 3.1), which are then processed by a multilayer perceptron to obtain a coarse-grained representation of the human model. These coarse outputs are reprojected onto the feature space for comparative evaluation (Section 3.2) and subsequently updated by transformer and regression perceptron layers to yield a refined, fine-grained human model (Section 3.3). An overview of the method is illustrated in Figure 2. We provide an overview of the key steps involved in the FFMN method, as illustrated in Algorithm 1.
Algorithm 1 FFMN Algorithm Overview
  • Input: Monocular image I
  • Output: SMPL parameters σ (pose and shape)
  •     Extract multi-scale features F = f ( I ) using the backbone network
  •     Initial SMPL model σ = M L P ( F )
  •     Keypoint x σ
  •     While  x < K e y p o i n t s  do
  •         Reproject f x = f ( x )
  •     EndWhile
  •     Update features F = T ( f 1 , . . . , x , σ ) , where T is the transformer module
  •     Update parameters σ = R e g r e s s i o n ( F )
  • Return: Refined SMPL parameters σ

3.1. Feature Extraction

The proposed lightweight feature extraction network is an improved design based on the YOLO [11] framework, incorporating a multi-scale feature fusion mechanism along with attention enhancement modules. The overall architecture consists of an efficient multi-feature extraction layer and a hierarchical feature fusion pathway, which are balanced in terms of model accuracy and inference speed using a re-parameterization design, as shown in Figure 3.
The multi-feature extraction layer employs a three-stage hierarchical structure that progressively downsamples the input to construct a multi-scale feature space: (1) Shallow Feature Extraction.This stage consists of a 5-layer convolution module that includes 2 standard 3 × 3 convolution layers (with stride = 2) and 3 C3k2 bottleneck modules, producing a 512-dimensional feature map. It preserves rich spatial details, which are critical for the preliminary localization of human contours. (2) Mid-level Feature Extraction. Two stages of downsampling are employed to obtain a 1024-dimensional feature map. Residual-connected C3k2 modules are used in this stage to enhance feature reuse, thereby improving the robustness of the extracted features. (3) Deep Semantic Extraction. An SPPF (Spatial Pyramid Pooling Fast) module is introduced to expand the receptive field, coupled with a dual-channel C2PSA attention module to amplify keypoint-responsive features. The final output of this stage is a 1024-dimensional feature map enriched with high-level semantic information.
Notably, to achieve a lightweight design, the network utilizes dynamic downsampling and feature compression. Specifically, strided convolutions are employed in lieu of pooling operations to reduce information loss while improving hardware parallel efficiency. At the network’s end, global average pooling compresses the spatial feature map into a compact 512-dimensional vector, which serves as an efficient representation for subsequent pose classification tasks.

3.2. Reprojection Feedback

Based on the preliminary estimates obtained in Section 3.1, the keypoints derived from the SMPL model are reprojected onto the feature map F to retrieve spatial features. In this design, each keypoint is mapped to a single feature channel, resulting in K channels corresponding to K keypoints. The rationale is to enable the feature extraction module to capture distinctive information for each keypoint.
Given a reprojected point x k = u , v (where u and v denote the pixel coordinates), we extract a window of features centered according to the following formulation:
f k = f ( x ) F k x x k r
where r = 3 pixels represents the radius of the window. By integrating the feedback from all keypoints and concatenating it with the current estimate parameters, as well as the bounding box center c b b o x and scale s b b o x , the final feedback vector is constructed as follows:
f = f 1 , , f K , σ , c b b o x , s b b o x
It is important to note that each channel in the keypoint feature map does not directly detect a keypoint; rather, it learns features associated with a particular keypoint. There are three types of keypoints available: semantic keypoints (K = 24), mocap markers (K = 67), and evenly sampled mesh vertices (K = 231). Our objective is to design a lightweight model without compromising accuracy; therefore, one of the mocap markers is selected because they provide more reliable pose and shape information than semantic keypoints and contain less redundancy compared to mesh vertices. We adopt the same mocap marker definitions as in AMASS [25], where the markers are defined on the mesh surface by choosing the nearest vertices.
In current monocular human regression methods, a common paradigm is to use a square crop centered on the human body as the model input, under the assumption that the optical axis of the imaging system passes through the geometric center of the crop. However, in practical scenarios, input data often originate from full-frame images with varying optical centers. This systematic discrepancy between the assumed condition and the actual data acquisition can significantly affect the accuracy of the rotation matrix estimation.
In contrast to CLIFF [26], we adjust the reprojected points from the full-frame back to the cropped image. Specifically, the transformation is defined as follows:
x 2 D c r o p = x 2 D f u l l c b b o x / s b b o x
where cbbox and sbbox denote the position and scale of the bounding box corresponding to the person. Although this operation might appear trivial, it confers two distinct advantages. Firstly, it normalizes the scale of the reprojected points, ensuring that the reprojection error remains independent of the person’s size in the original image. Secondly, since each keypoint reprojected generates a feedback signal, we employ dropout [27] to mitigate an over-reliance on global keypoint signals and enhance the robustness of the network during testing in cases where some keypoints are occluded. Specifically, a dropout rate of 0.25 was applied during training, meaning that there is a 25% chance that the feedback will be set to zero.
Our approach shares conceptual similarities with spatial dropout [28] in terms of regularization, as both methods suppress specific feature responses to improve model generalizability. However, there is a fundamental implementation difference: spatial dropout conducts random channel-wise deactivation (typically with a dropout rate of 0.2–0.5 [28]), whereas the keypoint feedback dropout proposed here implements a global, spatial-wise all-or-nothing strategy. In practice, when a keypoint’s confidence falls below an adaptive threshold, the entire gradient backpropagation path for that keypoint is completely terminated.

3.3. Transformer Update Module

The update module takes the feedback signal f as input (refer to Equation (2)), which integrates feature-layer information with the coarse-grained human model data. It employs parallel transformer encoder branches to separately process three key streams—human pose, shape, and camera parameters—thereby achieving a collaborative update of multi-dimensional state information.
The module’s input comprises the hidden state vectors h p o s e R B × 24 × h , h s h a p e R B × h , and h c a m R B × h , along with the current frame feature layers (which are obtained by concatenating information related to localization, pose, shape, and camera parameters). The core component consists of three independently designed transformer processing branches (illustrated in Figure 4):
Pose Encoding Branch: This branch uses a transformer structure with 24-dimensional joint positional encodings. Initially, h p o s e is linearly projected from R B × 24 × 32 to R B × 24 × 256 and then added to the spatially expanded frame features. Next, positional encodings based on trigonometric functions [29] are injected. An 8-head self-attention mechanism is used to model spatial dependencies at the joint level, and the branch eventually outputs new h p o s e R B × 24 × h .
Shape Encoding Branch: A single-token processing architecture is designed for the shape stream. Here, h s h a p e is first projected to R B × 1 × 256 and then fused with the frame features. A layer-normalized transformer encoder is applied to capture the global representation of the shape parameters. After a subsequent linear layer reduces the dimensionality, the branch outputs a refined shape representation, the new h s h a p e R B × 32 .
Camera Parameter Branch: This branch is similar in structure to the shape branch but maintains its own independent hidden state space. The feature interaction incorporates residual connections to ensure numerical stability during the optimization of camera extrinsics. Additionally, a multi-head attention mechanism effectively models the implicit correlations among camera parameters.
Each branch produces an intermediate representation of unified dimensions, and LayerNorm is applied after the transformer layers to standardize the outputs. This design enables the independent learning of each modality while facilitating cross-modal information exchange by sharing the lower-level frame features. Experimental results demonstrate that the inclusion of positional encoding reduces the pose estimation error. While merging the three modality-specific branches (pose, shape, and camera) through shared weight reduces model complexity, cross-modal interference enhances MPJPE error compared to isolated modality processing, as shown in Section 4.3.

3.4. Train FFMN

Three-dimensional Joints Loss. During training, this loss function quantifies the discrepancy between predicted and ground-truth joint coordinates using the L1 norm. The 3D joint regression loss L 3 D J is formulated as follows:
L 3 D J = 1 K J ^ 3 D J ¯ 3 D 1 + J ^ 3 D J ¯ 3 D 1
where J ¯ 3 D denotes the ground-truth 3D joint coordinates, J ^ 3 D represents the fine-grained predicted 3D joints, and J ^ 3 D is derived from the coarse-grained SMPL mesh via linear regression. Here, K corresponds to the number of body joints (K = 24).
Two-Dimensional Joint Projection Loss. To align the reconstructed 3D human mesh with the input 2D image, we adopt the weakly supervised projection loss methodology established in prior works [8,10,30]. The model predicts weak-perspective camera parameters s, t from camera-aware features extracted by the transformer encoder, where s R is a scaling factor and t R 2 denotes the 2D translation vector. Using the estimated camera model, we orthographically project both the original 3D joints J ^ 3 D and their augmented variant J ^ 3 D onto the 2D image plane:
J ^ 2 D = s · J ^ 3 D + t , J ^ 2 D = s · J ^ 3 D + t
where · implements the orthographic projection via the matrix 1 0 0 0 1 0 T R 3 × 2 . The projection loss L 2 D J enforces consistency between projected joints and ground-truth 2D annotations J ¯ 2 D R K × 2 using the L1 norm:
L 2 D J = 1 K J ^ 2 D J ¯ 2 D 1 + J ^ 2 D J ¯ 2 D 1
This dual-projection strategy constrains both the initial 3D pose estimation and its refined counterpart, enhancing robustness to camera parameter variations under weak 2D supervision.
SMPL Refinement Loss. This loss regularizes the deviation between the coarse-grained SMPL parameters σ and their fine-grained counterparts σ optimized through the transformer-based refinement module:
L S M P L = σ σ 2 2
Total Loss. Inspired by the iterative refinement paradigm in RAFT [31], we employ a multi-task learning framework combining 3D and 2D supervisory signals from diverse training datasets. The composite loss L T o t a l is defined as follows:
L T o t a l = λ 3 D L 3 D J + λ 2 D L 2 D J + λ S M P L L S M P L
where λ 3 D , λ 2 D , λ S M P L are empirically tuned fixed weighting coefficients to balance the loss components.

4. Experiments

Datasets. We trained our FFMN model on 3DPW [32], Human3.6M [33], MPI-INF-3DHP [34], COCO [35], and MPII [36], where SMPL pseudo-labels from EFT [16] were utilized for COCO and MPII datasets. The evaluation was conducted on 3DPW and Human3.6M validation sets using two metrics: MPJPE (mean per-joint position error) and PA-MPJPE (Procrustes-aligned MPJPE).
Implementation Details. FFMN was trained for 40 epochs on the combined dataset, with periodic validation on a subset of the 3DPW validation set. We employed the Adam optimizer with a learning rate of 1 × 10−4 and batch size 64. Input images were resized to 256 × 256 resolution, and Pascal-style occlusion augmentation [37] was applied. All experiments were conducted on a PC equipped with dual RTX 3090 GPUs and 128 GB RAM.

4.1. Quantitative Evaluation

The quantitative results are presented in Table 1, which compares our method with prior works on 3DPW and Human3.6M. To enable comprehensive analysis, we additionally report parameter counts and per-image inference time.
To further validate the effectiveness of FFMN, we compare it with recent state-of-the-art methods on the 3DPW and Human3.6M benchmarks. As shown in Table 1, FFMN achieves a favorable balance between accuracy and efficiency. Specifically, FFMN attains a PA-MPJPE of 43.9 mm on 3DPW and 32.7 mm on Human3.6M, which is competitive with or better than most existing approaches. Notably, FFMN’s inference time is significantly lower than that of ReFit [10] and other iterative optimization-based methods, making it suitable for real-time applications.
Compared to regression-based models such as HMR [8], PyMAF [19], and CLIFF [26], FFMN demonstrates improved accuracy while maintaining a lightweight architecture. Although ReFit achieves slightly better accuracy, its computational cost is much higher, with inference times exceeding 200 ms per image. In contrast, FFMN processes each image in just 15 ms, enabling real-time deployment in safety-critical scenarios such as live-line work monitoring.
These results highlight the practical advantages of FFMN: it delivers state-of-the-art accuracy with a fraction of the computational resources required by previous methods, thus bridging the gap between research and real-world deployment for monocular 3D human mesh recovery.

4.2. Qualitative Evaluation

We present qualitative results across multiple datasets in Figure 5, categorized into three occlusion levels: unoccluded, partially occluded, and heavily occluded. The results include both the original image alignment and multi-view visualizations of 3D human meshes rendered in virtual environments using Open3D, demonstrating pose consistency across perspectives.
In the unoccluded scenarios, FFMN accurately reconstructs the 3D human mesh with precise alignment to the subject’s body contours and articulated joints, even in challenging poses such as squatting or arm extension. For partially occluded cases, such as when limbs are blocked by equipment or other objects, the model maintains plausible mesh predictions by leveraging contextual cues and robust feature extraction, resulting in anatomically reasonable reconstructions. In heavily occluded situations, where large portions of the body are not visible, FFMN still produces coherent mesh structures, indicating strong generalization and the effectiveness of the feedback dropout mechanism in handling missing keypoint information.
To validate temporal coherence, we evaluate sequential human motion reconstructed by our FFM-Net model using continuous image sequences. As shown in Figure 6, eight consecutive frames from the MHHI dataset [41] are processed to generate temporally smooth 3D human animations. The results exhibit natural motion transitions in 3D space, confirming our model’s capability for dynamic human mesh recovery.
In particular, FFMN demonstrates strong temporal consistency across frames, with reconstructed meshes maintaining anatomical plausibility and smooth articulation even during rapid or complex movements. As shown in Table 2, the model effectively suppresses jitter and discontinuities that often arise in frame-by-frame predictions, thanks to its robust feature extraction and feedback mechanisms. This temporal stability is crucial for downstream applications such as motion analysis, behavior recognition, and real-time safety monitoring, where reliable tracking of human pose and shape over time is essential.

4.3. Ablation Study

We conduct ablation experiments on core components of FFMN. All models are trained with identical datasets and evaluated on 3DPW [32]. The time and parametric quantities in Table 3 are the overhead of the module during the inference process.
Backbone Architecture. The backbone network critically impacts performance. As shown in Table 3, our Pose* network (adapted from YOLO [11]) achieves 1.8% lower MPJPE error and 2.8× faster inference speed compared to the widely used HRNet baseline. This demonstrates Pose*’s superior feature extraction capability and computational efficiency.
Keypoint Reprojection. Our results confirm that employing keypoint reprojection for supervision enhances accuracy [26]. The proposed full-frame adjusted reprojection extends this benefit to the feedback stage. Optimal performance is achieved when the camera model faithfully participates in all image formation stages.
Feedback Dropout. Implementing dropout ( p = 0.25 ) during feedback feature aggregation yields 5.5 MPJPE improvement. This regularization prevents co-adaptation of keypoint signals and enhances robustness to occlusions, as evidenced by improved performance in partially occluded scenarios.
Update Module. Our experiments compare transformer and GRU modules for iterative prediction updates. While the GRU-based variant achieves marginally higher accuracy (3.0 mm lower MPJPE), its inference latency exceeds that of the transformer module by an order of magnitude. When applying shared-weight disentanglement to transformer modules across branches, although the parameter count is reduced, the model performance degrades. This occurs because preliminary SMPL estimates introduce interference when fusing and updating features from heterogeneous dimensions. Given the critical requirement for real-time performance in practical applications, we prioritize computational efficiency, ultimately selecting the transformer architecture. This design choice sacrifices <3% absolute accuracy to achieve 10× faster inference speeds, ensuring viable deployment in latency-sensitive systems.

5. Conclusions

This paper presents fast fitting mesh network (FFMN), a monocular 3D human mesh reconstruction framework based on parametric human models such as SMPL. FFMN integrates an optimization–regression hybrid strategy for end-to-end learning: it leverages transformer-based self-attention mechanisms to model long-range dependencies between global image features and human joints for direct parameter regression, while simultaneously incorporating a differentiable fitting module to refine outputs through implicit geometric constraints. The framework achieves real-time inference through an efficient parallel computation architecture that processes multi-scale voxel features during training. Experimental results demonstrate that FFMN maintains reconstruction accuracy comparable to state-of-the-art methods (e.g., CLIFF [26], ReFit [10]) on challenging datasets 3DPW, while achieving a improvement in inference speed. This work bridges the efficiency–accuracy gap in monocular mesh recovery, offering practical value for real-world applications requiring high-speed 3D human analysis.
In the context of distribution network safety monitoring, FFMN enables real-time 3D pose estimation of workers from monocular images, overcoming the limitations of traditional multi-view or sensor-based systems that are difficult to deploy in complex power grid environments. By providing accurate and efficient human mesh recovery (MPJPE < 50 mm), FFMN supports dynamic risk assessment, safety distance alerts, and intelligent operation in live-line work scenarios, thus offering practical value for the digital transformation and intelligent upgrade of power grid maintenance.

Limitations and Future Work

The current implementation has several limitations. Firstly, the model struggles with complex pose variations, particularly in occluded or truncated human bodies. Secondly, the computational cost remains relatively high for edge device deployment despite the optimizations. Future work will focus on addressing these issues by (1) incorporating more robust feature extraction mechanisms to handle occlusions and (2) exploring model quantization and pruning techniques for further efficiency improvements.

Author Contributions

Methodology, G.L. (Guokai Liang) and J.Z.; experiments, G.L. (Guokai Liang) and Z.L.; validation, G.L. (Guocheng Lin), F.Y. and J.L.; original draft preparation, G.L. (Guokai Liang), X.X. and P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of Southern Power Grid, China (030100KC23110071), and the Natural Science Funds of Hunan Province, China (2025JJ50335).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Authors Guokai Liang, Jie Zhou, Fan Yang, Guocheng Lin, Jiajian Luo, Xin Xie and Peng Zhang were employed by the company Guangdong Power Grid Co., Ltd. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries; Association for Computing Machinery: New York, NY, USA, 2023; Volume 2, pp. 851–866. [Google Scholar]
  2. Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.; Tzionas, D.; Black, M.J. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 10975–10985. [Google Scholar]
  3. Xu, H.; Bazavan, E.G.; Zanfir, A.; Freeman, W.T.; Sukthankar, R.; Sminchisescu, C. Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6184–6193. [Google Scholar]
  4. Kocabas, M.; Huang, C.H.P.; Hilliges, O.; Black, M.J. PARE: Part attention regressor for 3D human body estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 11127–11137. [Google Scholar]
  5. Liu, Y.; Qiu, C.; Zhang, Z. Deep learning for 3d human pose estimation and mesh recovery: A survey. Neurocomputing 2024, 596, 128049. [Google Scholar] [CrossRef]
  6. Jang, Y.; Jeong, I.; Younesi Heravi, M.; Sarkar, S.; Shin, H.; Ahn, Y. Multi-camera-based human activity recognition for human–robot collaboration in construction. Sensors 2023, 23, 6997. [Google Scholar] [CrossRef]
  7. Tian, Y.; Zhang, H.; Liu, Y.; Wang, L. Recovering 3d human mesh from monocular images: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15406–15425. [Google Scholar] [CrossRef]
  8. Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7122–7131. [Google Scholar]
  9. Kolotouros, N.; Pavlakos, G.; Black, M.J.; Daniilidis, K. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2252–2261. [Google Scholar]
  10. Wang, Y.; Daniilidis, K. Refit: Recurrent fitting network for 3d human recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 14644–14654. [Google Scholar]
  11. Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
  12. Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M.J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 561–578. [Google Scholar]
  13. Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 1958–1974. [Google Scholar] [CrossRef] [PubMed]
  14. Zanfir, A.; Bazavan, E.G.; Zanfir, M.; Freeman, W.T.; Sukthankar, R.; Sminchisescu, C. Neural descent for visual 3d human pose and shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14484–14493. [Google Scholar]
  15. Varol, G.; Ceylan, D.; Russell, B.; Yang, J.; Yumer, E.; Laptev, I.; Schmid, C. Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 20–36. [Google Scholar]
  16. Joo, H.; Neverova, N.; Vedaldi, A. Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; IEEE: New York, NY, USA, 2021; pp. 42–52. [Google Scholar]
  17. Song, J.; Chen, X.; Hilliges, O. Human body model fitting by learned gradient descent. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 744–760. [Google Scholar]
  18. Chen, F.; Choi, J. ATGT3D: Animatable Texture Generation and Tracking for 3D Avatars. Electronics 2024, 13, 4562. [Google Scholar] [CrossRef]
  19. Zhang, H.; Tian, Y.; Zhou, X.; Ouyang, W.; Liu, Y.; Wang, L.; Sun, Z. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 11446–11456. [Google Scholar]
  20. Omran, M.; Lassner, C.; Pons-Moll, G.; Gehler, P.; Schiele, B. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; IEEE: New York, NY, USA, 2018; pp. 484–494. [Google Scholar]
  21. Kolotouros, N.; Pavlakos, G.; Jayaraman, D.; Daniilidis, K. Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 11605–11614. [Google Scholar]
  22. Sengupta, A.; Budvytis, I.; Cipolla, R. Probabilistic 3D human shape and pose estimation from multiple unconstrained images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 16094–16104. [Google Scholar]
  23. Ren, Y.; Zhou, M.; Zhou, P.; Wang, S.; Liu, Y.; Geng, G.; Li, K.; Cao, X. Enhanced Multi-Scale Attention-Driven 3D Human Reconstruction from Single Image. Electronics 2024, 13, 4264. [Google Scholar] [CrossRef]
  24. Ren, Y.; Zhou, M.; Wang, Y.; Feng, L.; Zhu, Q.; Li, K.; Geng, G. Implicit 3D Human Reconstruction Guided by Parametric Models and Normal Maps. J. Imaging 2024, 10, 133. [Google Scholar] [CrossRef] [PubMed]
  25. Mahmood, N.; Ghorbani, N.; Troje, N.F.; Pons-Moll, G.; Black, M.J. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5442–5451. [Google Scholar]
  26. Li, Z.; Liu, J.; Zhang, Z.; Xu, S.; Yan, Y. Cliff: Carrying location information in full frames into human pose and shape estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 590–606. [Google Scholar]
  27. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  28. Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; Bregler, C. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 648–656. [Google Scholar]
  29. Kocabas, M.; Athanasiou, N.; Black, M.J. Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5253–5263. [Google Scholar]
  30. Dwivedi, S.K.; Athanasiou, N.; Kocabas, M.; Black, M.J. Learning to regress bodies from images using differentiable semantic rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 11250–11259. [Google Scholar]
  31. Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 402–419. [Google Scholar]
  32. Von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 601–617. [Google Scholar]
  33. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
  34. Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; IEEE: New York, NY, USA, 2017; pp. 506–516. [Google Scholar]
  35. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, part v 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  36. Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
  37. Wang, A.; Kortylewski, A.; Yuille, A. Nemo: Neural mesh models of contrastive features for robust 3d pose estimation. arXiv 2021, arXiv:2101.12378. [Google Scholar]
  38. Li, J.; Xu, C.; Chen, Z.; Bian, S.; Yang, L.; Lu, C. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 3383–3393. [Google Scholar]
  39. Shetty, K.; Birkhold, A.; Jaganathan, S.; Strobel, N.; Kowarschik, M.; Maier, A.; Egger, B. Pliks: A pseudo-linear inverse kinematic solver for 3d human body estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canda, 18–22 June 2023; pp. 574–584. [Google Scholar]
  40. Chen, M.; Zhou, Y.; Jian, W.; Wan, P.; Wang, Z. Temporal-Aware Refinement for Video-based Human Pose and Shape Recovery. arXiv 2023, arXiv:2311.09543. [Google Scholar]
  41. Li, K.; Jiao, N.; Liu, Y.; Wang, Y.; Yang, J. Shape and Pose Estimation for Closely Interacting Persons Using Multi-view Images. Comput. Graph. Forum 2018, 37, 361–371. [Google Scholar] [CrossRef]
Figure 1. Overview. FFMN employs a 3D human mesh network to reconstruct both the pose and spatial position of subjects in monocular images. The backbone network extracts semantic features from the input image, while the transformer module disentangles high-level features to separately model body shape, joint rotations, and global orientation.
Figure 1. Overview. FFMN employs a 3D human mesh network to reconstruct both the pose and spatial position of subjects in monocular images. The backbone network extracts semantic features from the input image, while the transformer module disentangles high-level features to separately model body shape, joint rotations, and global orientation.
Electronics 14 02362 g001
Figure 2. The FFMN reconstructs 3D human meshes from monocular images through a coarse-to-fine pipeline. A backbone network first extracts semantic features to generate an initial SMPL body model, while a transformer module refines these estimates by disentangling pose and shape features.
Figure 2. The FFMN reconstructs 3D human meshes from monocular images through a coarse-to-fine pipeline. A backbone network first extracts semantic features to generate an initial SMPL body model, while a transformer module refines these estimates by disentangling pose and shape features.
Electronics 14 02362 g002
Figure 3. The backbone architecture processes input images through a hierarchical feature extraction pipeline. On the left branch, convolutional layers progressively extract multi-scale visual patterns, while the Sampling Unit performs spatial downsampling and multi-resolution feature fusion. This dual-path design outputs two complementary components: high-level semantic features and initialized pose features.
Figure 3. The backbone architecture processes input images through a hierarchical feature extraction pipeline. On the left branch, convolutional layers progressively extract multi-scale visual patterns, while the Sampling Unit performs spatial downsampling and multi-resolution feature fusion. This dual-path design outputs two complementary components: high-level semantic features and initialized pose features.
Electronics 14 02362 g003
Figure 4. MHSA: Multi-Head Self-Attention; MHCA: Multi-Head Cross-Attention. The transformer update module processes the feedback signal f through three parallel branches, each dedicated to a specific parameter group: pose, shape, and camera parameters. Each branch employs a transformer encoder to capture long-range dependencies and interactions among the respective parameters.
Figure 4. MHSA: Multi-Head Self-Attention; MHCA: Multi-Head Cross-Attention. The transformer update module processes the feedback signal f through three parallel branches, each dedicated to a specific parameter group: pose, shape, and camera parameters. Each branch employs a transformer encoder to capture long-range dependencies and interactions among the respective parameters.
Electronics 14 02362 g004
Figure 5. Demonstration of FFMN’s performance in live-line maintenance scenarios with complex backgrounds and occlusions. The first row displays raw input images showing workers interacting with high-voltage equipment. The second row illustrates the 3D human mesh overlay on 2D images, where the reconstructed mesh accurately aligns with challenging poses (e.g., overhead arm extensions, torso twists). The third row provides 3D visualizations (rendered via Open3D [14]), demonstrating cross-view pose consistency and anatomically plausible limb articulations.
Figure 5. Demonstration of FFMN’s performance in live-line maintenance scenarios with complex backgrounds and occlusions. The first row displays raw input images showing workers interacting with high-voltage equipment. The second row illustrates the 3D human mesh overlay on 2D images, where the reconstructed mesh accurately aligns with challenging poses (e.g., overhead arm extensions, torso twists). The third row provides 3D visualizations (rendered via Open3D [14]), demonstrating cross-view pose consistency and anatomically plausible limb articulations.
Electronics 14 02362 g005
Figure 6. Sequential human motion reconstruction results from the MHHI dataset [41].
Figure 6. Sequential human motion reconstruction results from the MHHI dataset [41].
Electronics 14 02362 g006
Table 1. Evaluation results of each model on 3DPW and H36M datasets.
Table 1. Evaluation results of each model on 3DPW and H36M datasets.
ArchitectureTime (ms)Params (M)3DPWHuman3.6M
MPJPEPA-MPJPEMPJPEPA-MPJPE
HMR [8]35.328.8130.076.788.056.8
PyMAF [19]28.745.192.858.957.740.5
PARE [4]63.732.882.050.9--
HybrlK [38]30.527.674.145.055.433.6
CLIFF [26]36.626.973.544.352.635.0
PLIKS [39]122.9180.960.538.547.034.5
TAR [40]--62.740.645.633.3
ReFit [10]233.074.965.841.048.432.2
FFMN (ours)15.053.970.8 ± 1.243.9 ± 0.748.6 ± 0.332.7 ± 0.2
Table 2. Evaluation of state-of-the-art image-based methods and FFMN on MHHI.
Table 2. Evaluation of state-of-the-art image-based methods and FFMN on MHHI.
MethodMHHI [41]
MeanStd
PARE [4]30.4811.54
CLIFF [26]30.2312.17
ReFit [10]29.7811.07
FMMN (our)29.829.87
Table 3. Ablation of model designs.
Table 3. Ablation of model designs.
ExperimentMethodTime (ms)Params (M)3DPW
MPJPEPA-MPJPE
BackboneHRNet25.671.172.144.5
Pose*8.946.070.843.9
Keypoints Reprojectionsemantic2.70.2372.444.0
Dense5.40.7675.444.3
Mocap Markers3.30.3170.843.9
Feedback DropoutNo Dropout--76.345.1
0.15--73.144.5
0.25--70.843.9
Update ModuleGRU51.41.167.841.5
Transformer(c)4.90.372.244.1
Transformer5.00.770.843.9
The bold option is used for the final model.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, G.; Zhou, J.; Yang, F.; Lin, G.; Luo, J.; Xie, X.; Zhang, P.; Li, Z. FFMN: Fast Fitting Mesh Network for Monocular 3D Human Reconstruction in Live-Line Work Scenarios. Electronics 2025, 14, 2362. https://doi.org/10.3390/electronics14122362

AMA Style

Liang G, Zhou J, Yang F, Lin G, Luo J, Xie X, Zhang P, Li Z. FFMN: Fast Fitting Mesh Network for Monocular 3D Human Reconstruction in Live-Line Work Scenarios. Electronics. 2025; 14(12):2362. https://doi.org/10.3390/electronics14122362

Chicago/Turabian Style

Liang, Guokai, Jie Zhou, Fan Yang, Guocheng Lin, Jiajian Luo, Xin Xie, Peng Zhang, and Zhe Li. 2025. "FFMN: Fast Fitting Mesh Network for Monocular 3D Human Reconstruction in Live-Line Work Scenarios" Electronics 14, no. 12: 2362. https://doi.org/10.3390/electronics14122362

APA Style

Liang, G., Zhou, J., Yang, F., Lin, G., Luo, J., Xie, X., Zhang, P., & Li, Z. (2025). FFMN: Fast Fitting Mesh Network for Monocular 3D Human Reconstruction in Live-Line Work Scenarios. Electronics, 14(12), 2362. https://doi.org/10.3390/electronics14122362

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop