ViT-Based Classification and Self-Supervised 3D Human Mesh Generation from NIR Single-Pixel Imaging

Osorio Quero, Carlos; Durini, Daniel; Martinez-Carranza, Jose

doi:10.3390/app15116138

Open AccessArticle

ViT-Based Classification and Self-Supervised 3D Human Mesh Generation from NIR Single-Pixel Imaging

by

Carlos Osorio Quero

^1,*,†,‡

,

Daniel Durini

^2,*,‡

and

Jose Martinez-Carranza

^1,*

¹

INAOE Computer Science, Tonantzintla, Puebla 72840, Mexico

²

INAOE Electronics Department, Tonantzintla, Puebla 72840, Mexico

^*

Authors to whom correspondence should be addressed.

^†

Current address: Instituto Nacional de Astrofísica, Óptica y Electrónica—INAOE, Calle Luis Enrique Erro 1, Tonantzintla, Puebla 72840, Mexico.

^‡

These authors contributed equally to this work.

Appl. Sci. 2025, 15(11), 6138; https://doi.org/10.3390/app15116138

Submission received: 5 March 2025 / Revised: 26 March 2025 / Accepted: 16 April 2025 / Published: 29 May 2025

(This article belongs to the Special Issue Single-Pixel Intelligent Imaging and Recognition)

Download

Browse Figures

Versions Notes

Abstract

Accurately estimating 3D human pose and body shape from a single monocular image remains challenging, especially under poor lighting or occlusions. Traditional RGB-based methods struggle in such conditions, whereas single-pixel imaging (SPI) in the Near-Infrared (NIR) spectrum offers a robust alternative. NIR penetrates clothing and adapts to illumination changes, enhancing body shape and pose estimation. This work explores an SPI camera (850–1550 nm) with Time-of-Flight (TOF) technology for human detection in low-light conditions. SPI-derived point clouds are processed using a Vision Transformer (ViT) to align poses with a predefined SMPL-X model. A self-supervised PointNet++ network estimates global rotation, translation, body shape, and pose, enabling precise 3D human mesh reconstruction. Laboratory experiments simulating night-time conditions validate NIR-SPI’s potential for real-world applications, including human detection in rescue missions.

Keywords:

single-pixel imaging (SPI); self-supervised; SMPL-X model; depth perception; vision transformers (ViT); 3D human model; near-infrared (NIR)

1. Introduction

The field of 3D body reconstruction is crucial in various industries, such as virtual reality [1], augmented reality [2], and biomedical applications [3]. This process involves reconstructing a detailed 3D model of the human body from a single 2D image [4], a task fraught with challenges due to the ill-posed nature of obtaining three-dimensional data from two-dimensional representations [5]. This complexity arises because multiple 3D points can project identically onto a 2D plane, introducing significant ambiguity in the reconstruction process. Despite these hurdles, advances in computational techniques continue to improve the precision and reliability of 3D body reconstruction [6], broadening its applicability in fields such as medical imaging [7], gaming [8], and robotics [9].

In particular, reconstructing an accurate 3D human form from imperfect and incomplete data requires advanced handling of non-rigid body dynamics and articulated joint movements [10]. Recent breakthroughs in deep learning have enabled more sophisticated end-to-end reconstructions of human shapes [11], including detailed meshes that capture complex articulations. Historically, deep neural networks have faced challenges, such as producing rugged, blurred, or distorted meshes [12]. However, models such as Skinned MultiPerson Linear (SMPL) [13] and its extension, SMPL eXpressive (SMPL-X) [14], have revolutionized this domain. These models offer streamlined representations of 3D human figures. When integrated with deep learning techniques, they facilitate the extraction of robust image features and the regression of accurate body shape and pose parameters from standard RGB images. This enhanced methodology pushes the limits of what is possible in 3D human body cloud point reconstruction, promising increasingly accurate and versatile applications [15].

Various technologies are used to estimate human pose [16], including RGB cameras [17], thermal cameras [18], Hyperspetral [19] and IR-Ultra-wideband (UWB) RADAR [20]. Although infrared cameras excel in object detection under low-light conditions, particularly for objects with higher temperatures, they perform poorly with cooler objects. Alternatively, RGB cameras are challenged by low-light environments, although using them in the NIR spectrum can be costly. Single-pixel imaging (SPI) systems present a viable alternative [21], which overcomes many limitations of conventional and thermal cameras in dim settings. SPI systems capture images by measuring the light reflected from an object using a single-pixel detector, even in the infrared spectral bands [22]. Using deep learning, single-pixel imaging (SPI) can reconstruct high-quality images from sparse measurements. It is particularly effective for capturing 3D human body point clouds in challenging scenarios such as night-time surveillance or rescue operations. This capability establishes SPI as a leading method for detailed and accurate pose detection under low-light conditions. Deep learning techniques are highly recommended in such scenarios, and the Vision Transformer (ViT) [23] is emerging as a robust approach to handling low-resolution data.

A key advantage of SPI over conventional cameras is its ability to capture data in the NIR spectrum, significantly enhancing 3D point cloud imaging [24]. NIR imaging performs exceptionally well under low light conditions, providing superior clarity for object detection and tracking, an essential feature for advanced rescue applications [25]. When combined with time-of-flight (TOF) sensors, SPI technology produces highly accurate 2D and 3D environmental images [26]. This integration enhances the depth of the data, offering detailed insight into the spatial arrangement and motion of objects within a scene. Furthermore, SPI is not limited to low-light scenarios; it remains highly effective in challenging environments, such as those with dust or fog, where traditional cameras struggle, ensuring consistent and reliable 3D data acquisition under various conditions.

This study presents an SPI system with active illumination in the near-infrared (NIR) wavelength range of 850–1500 nm. The system utilizes single InGaAs photodetectors to capture NIR-SPI images, which are the foundation for generating 3D human meshes based on point clouds for depth perception. In addition, it incorporates a ViT-based classification model to determine the initial pose to implement an SMPL-X-based reconstruction framework. Our approach processes point clouds to generate high-fidelity 3D human meshes, using a self-supervised learning framework [27] to enhance accuracy. We introduce a novel 3D human pose reconstruction method that surpasses previous state-of-the-art techniques by integrating a probabilistic human model with depth information. This innovation achieves a new benchmark in accuracy, outperforming ViT-based models and setting a new standard for handling missing data and low-resolution scenarios. The final output is a detailed and precise 3D representation of the human pose in the SMPL-X model, providing an accurate reconstruction of human figures in three dimensions. Therefore, in this work, we propose the following:

Exploring the capability of Single-Pixel Imaging for generating 3D human pose point clouds from low-resolution 2D images.
Implementation of the ViT (Vision Transformer) model applied to SPI images to define a pre-trained SMPL-X human pose model.
Testing a self-supervised Deep Learning model for 3D reconstruction from the point cloud.
This work addresses the challenging task of predicting 3D hand poses from low-resolution 2D images. This can be applied to sensors used in rescue operations where lighting conditions are difficult.

2. Related Work

2.1. Depth-Based 3D Pose Estimation

Depth-based 3D pose estimation techniques are classified into generative and discriminative models [28]. Generative models, which resemble template matching, utilize human-body templates to establish correspondences between the input and the templates. The Iterative Closest Point (ICP) algorithm is commonly employed to track the 3D human body within these models [29]. In contrast, discriminative models bypass template matching and directly predict body poses. These models typically use random forests (RF) to classify body parts from a single-depth image, subsequently estimating the 3D joint locations [30]. Implemented a random tree walk algorithm (RTW) [31] to regress joint positions and improve processing speed. Researchers have developed a viewpoint-invariant model in deep learning that combines convolutional and recurrent neural networks for human pose estimation. Another innovative approach includes a tree-structured region ensemble network specifically designed for 3D position regression, leveraging image features. Furthermore, the Voxel-to-Voxel network (V2V-PoseNet) [32] represents a significant advance, as it processes point clouds to estimate the likelihood of the location of each body joint within each voxel. V2V-PoseNet then extracts precise 3D joint positions from the resulting heatmaps [33].

2.2. 3D Human Body

3D human body modeling from depth images involves two primary approaches: template-based and template-less methods [34]. Template-based methods leverage predefined priors such as skeletal structures [35], template models, or parametric designs to reconstruct the 3D body model [36]. These priors provide substantial prior knowledge, enhancing the robustness of 3D body reconstruction. For example, these methods adapt a prescanned template to match each input depth image using an innovative L0-based motion regularizer that effectively minimizes errors in large movements [37]. On the other hand, templateless methods generate 3D body models without any preexisting knowledge of body shape. These methods integrate all captured depth maps to build 3D models in real time. However, they are limited to capturing slow movements. Recent advancements have merged template priors with templateless methods to accommodate more significant human movements. Both approaches require establishing point correspondences for each frame using a method that searches for the closest 3D points [38]. These correspondences often need to be more accurate when based on a single depth input because of significant variations in human poses and shapes compared to the template. Point correspondences might also be directly predicted from depth images of human bodies using techniques like random forests or through matching learned feature descriptors. Following the establishment of these correspondences, 3D body models are reconstructed by aligning the template with the depth data. The LBS Autoencoder [39] is designed to fit articulated mesh models to point clouds by calculating joint angles and deformations of an LBS template, although it is intended primarily for complete 3D shapes rather than depth image-based point clouds.

2.3. Semi-Supervised Approaches in 3D Human Mesh Estimation

While supervised learning remains the predominant approach in traditional human pose estimation methods, obtaining accurate 3D mesh ground truth data presents a significant obstacle. To address the scarcity of high-quality annotations, researchers have developed weakly or semi-supervised techniques [40]. These methods make use of more readily available annotations. Specifically, semi-supervised learning employs 3D skeletons—less detailed than full meshes—as a base, whereas weakly supervised learning utilizes 2D or approximated 3D annotations [41]. Techniques such as exploiting 2D keypoint annotations enable the estimation of SMPL body model parameters through CNNs [42]. This process helps reconstruct 3D human meshes based on segmented body parts. The approach includes two anatomically informed losses within a weakly supervised framework to facilitate learning from extensive real-world 2D data and controlled 3D indoor/synthetic environments. The researchers advocate for weakly supervised methods that employ multi-view consistency losses in 2.5D pose estimation, utilizing multiview geometry and independent 2D annotations to resolve ambiguities in 3D depth [43].

2.4. Advanced 3D Imaging Methods in the Infrared Spectrum

Recent advancements have demonstrated that the 3D position and pose of a human can be accurately reconstructed from single- or multiview infrared (IR) images by leveraging reflections from common objects that capture the infrared radiation emitted by the human body [44]. This approach integrates priors learned from pre-trained 3D generative models with differentiable rendering techniques to infer reflections [45]. By framing the reconstruction process as an optimization problem, a synthesis-by-analysis methodology is employed to interpret the observed thermal reflections. These findings underscore the potential of IR cameras as valuable tools for studying human activities in real-world environments. However, existing data sets are often limited in object diversity and imaging conditions, which poses challenges to generalization [46]. Although 3D object reconstruction techniques—such as Structure from Motion (SfM) [46], Simultaneous Localization and Mapping (SLAM) [47], Semi-Global Matching (SGM) [48], silhouette-based 3D reconstruction [49], Shape from Interaction [50], and deep learning-based methods—are fast and robust for generating 3D models from visible-light imagery, they lack accuracy when applied to infrared data [51]. This limitation stems from the low contrast and lack of distinct features in infrared (IR) images, necessitating a feature extraction method capable of robustly handling subtle IR variations to enable accurate 3D object reconstruction within the IR domain. 3D modeling presents a robust and scalable methodology for the large-scale generation of infrared (IR) image datasets, enabling substantial advancements in the analytical methodologies and practical implementation of infrared imaging technologies.

3. Single-Pixel Image Reconstruction

The Single-Pixel Imaging (SPI) technique [21,22], reconstructs images by measuring correlated intensity on a detector that lacks spatial resolution. SPI cameras employ spatial light modulators (SLMs), such as Digital Micro-Mirror Devices (DMD), to generate spatially structured light patterns (resembling Hadamard patterns) that interrogate a scene. These cameras function according to two main architectures: structured detection and structured illumination, as illustrated in Figure 1.

In structured detection, an object is illuminated by a light source, and the reflected light is subsequently modulated by a Spatial Light Modulator (SLM) before being measured by a single-pixel (bucket) detector. In contrast, in structured illumination, represented by the modulation pattern

Φ

, the SLM is used to spatially modulate incident light before it illuminates the object O. The light reflected from the object is captured by the bucket detector and subsequently converted into an electrical signal

y_{k}

, as defined in Equation (1).

y_{k} = α \sum_{i = 1}^{M} \sum_{j = 1}^{N} O (i, j) Φ (i, j)

(1)

Here,

α

represents a constant factor influenced by the optoelectronic characteristics of the photodetector. The electrical signal generated by the photodetector is derived from the correlation between the spatial pattern of the light and the light reflected from the object. The corresponding sequence of electrical signals is produced by projecting a series of these spatial patterns. These signals are then computationally processed to reconstruct the image. Specifically, the image

x_{k}

is reconstructed from the captured signal

y_{k}

and the corresponding pattern

Φ

, as described in Equation (2) [22]

x_{k} = α \sum_{i = 1}^{M} \sum_{j = 1}^{N} y_{k} Φ (i, j)

(2)

In this study, we used an array of 32 × 32 near-infrared LEDs (NIR-LEDs) that emit at a peak wavelength of 1550 nm for generating Hadamard-like patterns

Φ

through active illumination. This wavelength was chosen to minimize the effect of water scattering and absorption coefficients. The LED array is positioned perpendicular to the lens’s focal plane, allowing the light patterns to be projected toward infinity. However, due to the dimensions of the matrix, the effective projection range is limited to 0.3–2 m.

3.1. Fusion Strategy for Enhancing TOF and Single-Pixel Imaging Resolution

For fusing TOF imaging and low-resolution SPI to generate a high-resolution image to employ a Deep Learning-based Super-Resolution (SR) model (see Figure 2b), which takes advantage of the complementary advantages of both modalities. The ToF depth map and the SPI intensity image are initially pre-processed and spatially registered. The ToF modality delivers precise structural depth information, while the SPI system contributes high-contrast intensity details. A dual-branch CNN extracts spatial features separately from TOF and SPI, with the TOF branch focusing on structural information (low-frequency components) and the SPI branch enhancing fine details and textures (high-frequency components). The extracted features are subsequently concatenated and passed through a Fusion Network, which learns to compute optimal blending weights to preserve essential structural information. To further enhance resolution, a Fast Super-Resolution Convolutional Neural Network (FSRCNN) [26] is applied to upscale the fused low-resolution image, reconstructing high-quality textures and spatial details. Finally, a refinement layer removes artifacts and enhances sharpness, producing a high-resolution fused image with improved texture, depth consistency, and spatial clarity.

3.2. SPI Camera

Our study introduces structured illumination to improve image quality in difficult lighting conditions, such as solid backlighting and stray light interference. We utilized a time-of-flight (TOF) system with a wavelength of 850 nm and an InGaAs photodiode as the bucket detector, which operates at 1550 nm. Our proposed architecture, named NIR-SPI, comprises two primary components. The first component involves fundamental elements based on the single-pixel imaging principle, including an InGaAs photodetector (specifically the Thorlabs FGA015 diode operating at 1550 nm), a series of NIR-LEDs for light emission, a TOF system, and an analog-to-digital converter (ADC). For a visual depiction of this setup, see Figure 2a. The second component includes a subsystem designed to process the electrical output of the bucket detector. The ADC digitizes the electrical signal, and the data is processed by an embedded system in a module (SoM) [52], notably GPU-Jetson Xavier NX, also illustrated in Figure 2a. This SOM is tasked with generating Hadamard-like patterns and processing the digitized data from the ADC. The OMP-GPU Algorithm is applied within the SoM to facilitate the generation of 2D images. The processing duration for each phase of the 2D image reconstruction is detailed. For more information on the SPI camera, further reading is recommended.

3.3. 2D Reconstruction Algorithm

We initiated the process by acquiring and digitizing the electrical signal

y_{k}

through an analog-to-digital conversion mechanism (ADC). This step involved projecting the signal with a Hadamard matrix, resulting in a vector of signals

y_{k}

(refer to Equation (1)). We then applied the Orthogonal Matching Pursuit (OMP) algorithm (see Algorithm 1) to derive the image

x_{k}

(refer to Equation (2)). Our goal was to satisfy condition

|y_{k} - Φ (i, j) x_{k}| < ε

[22]. To enhance the computational efficiency of the 2D SPI image reconstruction algorithm, matrix inversion was carried out using the Cholesky decomposition technique, as described in the literature [53,54]. This approach required us to compute the symmetric and positive Gram matrix, denoted

G_{k} = Φ^{T} Φ

[55]. Furthermore, an initial projection

p^{0} = Φ^{T} y_{k}

was performed (see Algorithm 1, line 3) to support the implementation of the Cholesky method.

L_{n e w} = [\begin{matrix} L & 0 \\ w^{T} & \sqrt{1 - w^{T} w} \end{matrix}]

(3)

Algorithm 1: OMP-GPU algorithm [55], Input: OMP-GPU algorithm input data: Patterns $Φ$ , input signal $y_{k}$ , Output: OMP-GPU algorithm output data: sparse representation $x_{k}$ that fulfills the relation $y_{k} \approx Φ x_{k}$ .
1: procedure OMP-GPU( $Φ$ , $y_{k}$ ):
2: set: $L_{1}$ =[1], k=1, $p^{0} = Φ^{T} y_{k}$
3: set: $ε = y_{k} y_{k}^{T}$ , $G_{k} = Φ^{T} Φ$ , $p = p^{0}$
4: while $ε_{k - 1} > ϵ$ do
5: $M = {arg max}_{K} \|p\|$	▹ Finding the new atom
6: if $M > 1$ then
7: $w_{k} = \{L_{k - 1} w_{k} = G_{k - 1, K}\}$ $w_{k}$	▹ Solver
8: $L_{k} = [\begin{matrix} L_{k - 1} & 0 \\ w_{k}^{T} & \sqrt{1 - w_{k}^{T} w_{k}} \end{matrix}]$	▹ Update of Cholesky
9:
10: end if
11: $x_{k} = \{L_{k} L_{k}^{T} x_{k} = p^{o}\}$	▹ Solver $x_{k}$
12: $β = G_{k} x_{k}$	▹ Matrix-sparse-vector product for each path
13: $p = p^{o} - β$
14: $δ^{k} = x_{k}^{T} β$	▹ Calculate error
15: $ε^{k} = ε^{k - 1} - δ^{k} + δ^{k - 1}$	▹ Calculate norm $ε$
16:
17: $k = k + 1$	▹ increasing iteration
18: end while
19: return $x_{k}$
20: end procedure

The matrix G can be decomposed into two triangular matrices through Cholesky decomposition, expressed as

L_{k} L_{k}^{T}

(see Equation (3)). In this expression,

L_{k}

denotes the triangular Cholesky factor [56] (refer to Algorithm 1, line 8). To address this matrix, we establish a system

L_{k} L_{k}^{T} x_{k} = Φ^{T} y_{k}

. This system is addressed by considering it as a triangular system, formulated as

L_{k} u = b

with

b = Φ y_{k}

and

L_{k}^{T} x_{k} = u

(see Algorithm 1, line 11). The matrix

L_{k}

is determined using the approach described in Equation (3) [55], where

w_{k} = L_{k}^{- 1} G_{k}

(see Algorithm 1, line 7). For reconstructing the signal

x_{k}

, which involves transforming a vector image into an NxN matrix through a reshape operation, a stopping criterion is established by comparing the norm of the residual with a threshold

ε

(see Algorithm 1, line 15), thus bypassing the need to compute the residual

δ

(see Algorithm 1, lines 12–14). To improve algorithm efficiency, we suggest that it be deployed in the Compute Unified Device Architecture (CUDA) to facilitate parallel processing in the reconstruction task [57] (see Algorithm 1).

To generate the final 2D image, we first obtain the SPI image using the method described in Algorithm 1 and then integrate it with post-processed depth data from a TOF system. This depth data is improved through a normalization technique. The FSRCNN network method, as detailed in [26], fuses the initial input image with data from the TOF system. This fusion process enhances the image to two times its initial resolution, producing a final output image with dimensions of 64 × 64 pixels. The complete system illustrated in Figure 2b details the processing algorithm of the proposed NIR-SPI vision system. This system starts with a low-resolution SPI image, processes it through the FSRCNN network [26], and combines it with data from the TOF system.

3.4. SPI Acquisition Protocol

In developing the SPI camera, we focused on two crucial parameters essential to capture SPI images: the detector exposure time (

T_{e x t}

and the frequency of pattern projection

F_{p a t t e r n s}

). We used a theoretical model of the NIR-SPI system [58] to determine the appropriate exposure time. This model accounts for factors such as maximum measurement distance, scattering effects, and the correlation between photon incidence on the sensor and the noise threshold. We established the exposure time,

T_{e x t}

, to range between 80 and 120

μ

s, optimal for measurement distances of 0.3 to 1 m. From this exposure time, we derived that the minimum frequency for the ADC must be at least 60 KHz. The frequency patterns are based on Equation (4) [59], using the parameter

F_{m i n}

to evaluate the efficiency at the individual pixel level. The ideal configuration occurs at

F = F_{m i n}

(where F denotes the actual sensor pixel count), which facilitates the highest ADC measurement rate at the lowest sensor resolution (

F_{A D C}

= 125 MHz). This setup significantly improves the signal-to-noise ratio in outdoor conditions. When the design condition is below

F < F_{m i n}

, the frequency patterns limit the resolution of the measurement. In contrast, exceeding this value (

F > F_{m i n}

), the pattern generation frequency, defined within the range of

F_{p a t t e r n s}

= 40 KHz, can achieve a measurement rate three times faster.

F_{m i n} = \frac{F_{A D C}}{F_{p a t t e r n s}}

(4)

4. Deep Learning Classification for Human Pose Estimation

ViT Transformer-Based Classification for the Estimation of SMPL-X Pose from Low-Resolution SPI

The proposed framework utilizes Vision Transformer (ViT) [23] to classify infrared (IR) images from the LWIRPOSE dataset [60], which includes pose information for 12 activities: Direction, Discussion, Eating, greeting, Phone conversation, Posing, Purchases, Sitting, Smoking, Taking Photos, Waiting and Walking. To define the classification model, we evaluated the performance of various architectures, including CNN [61], VGG10 [62], ResNet [63], and ViT. As shown in Table 1 and Table 2, ViT demonstrates superior accuracy, making it the preferred choice for this task. This classification constitutes a critical preprocessing stage for pose estimation within the SMPL-X model, thereby improving the accuracy of 3D human pose reconstruction from low-resolution SPI infrared images. By integrating deep learning-based feature extraction and classification, the proposed approach effectively addresses challenges in thermal imaging, such as low-resolution and noisy data, thereby improving human pose estimation in complex environments (see Figure 3).

1.

Input IR image acquisition

The system receives an SPI-based low-resolution infrared image that captures a human figure.
The LWIRPOSE dataset, consisting of thermal images depicting various human activities, serves as the main input data source.
These images, acquired in the Long-Wave Infrared (LWIR) spectrum, exhibit low spatial resolution, making direct 3D human pose reconstruction challenging.

2.

Image tokenization & feature embedding

The input image is divided into smaller non-overlapping patches, following the standard ViT patch embedding mechanism.
Each patch is flattened and linearly projected into a high-dimensional feature space, ensuring that local spatial features are retained.
These projected embeddings serve as input tokens to the ViT Transformer encoder.

3.

Transformer-based feature extraction

The Transformer Encoder processes the embedded patches using multihead self-attention (MHSA) and feedforward layers [65].
This enables the model to capture both local and global dependencies in the IR image, preserving structural details essential for pose recognition.
The Transformer aggregates spatial relationships between different body regions, enhancing the understanding of human posture even in low-resolution IR data.

4.

Classification via Multi-Layer Perceptron (MLP)

The extracted feature representation from the Transformer Encoder is fed into a Multi-Layer Perceptron (MLP) [23] for human activity classification.
The confusion matrix in the figure demonstrates the system’s ability to classify different activities, such as walking, sitting, taking photos, eating, smoking, and talking on the phone, achieving high precision in most categories.
Accurate classification is crucial for defining a pre-pose configuration that aligns with real-world human movements.

5.

Pre-pose estimation with SMPL-X model

The classified activity is mapped onto the SMPL-X human pose model, a 3D parametric representation of the human body that captures detailed articulations and anatomical structure.
The prepose definition based on classification output serves as an initial pose estimation, allowing for more refined 3D reconstructions of human movement.

5. Proposed Method

Our proposed method outlines an innovative approach to generate a detailed 3D human model using the SMPL-X framework from images captured by an imaging NIR-SPI system (see Figure 4). This system is particularly advantageous when traditional imaging fails due to low light or obstructions. Using advanced machine learning techniques, including deep convolutional neural networks and self-supervised learning from 3D generation, human figures of data were obtained from SPI with noise and missing information. The process involves several key stages, including depth map generation, depth perception enhancement, depth thresholding, and sophisticated 3D mesh reconstruction. The ultimate goal is to optimize the 3D mesh to fit the SMPL-X model by adjusting global rotation, translation, body shape, and pose parameters extracted from the depth data, producing a highly accurate and realistic human model.

5.1. Selection of Key Hyperparameters

The design decisions and hyperparameter configurations throughout the present work were carefully formulated to develop an accurate, efficient, and robust pipeline for reconstructing 3D human poses (see Figure 4). Each component was optimized to achieve an optimal trade-off between computational efficiency and reconstruction fidelity, ensuring seamless integration across the entire pipeline, from data acquisition to model generation. The proposed method achieves enhanced performance by optimizing the hyperparameters, exhibiting strong robustness to noise and high adaptability to low-resolution inputs, while effectively maintaining realistic representations of human poses. The following step-by-step breakdown outlines the entire workflow, encompassing SPI-based image acquisition, depth perception enhancement, and 3D human mesh reconstruction using the SMPL-X model.

Figure 4. The image illustrates a comprehensive method for constructing a 3D human model using a NIR-SPI system. The process begins with acquiring an NIR-SPI image and generating a depth map using deep learning techniques. Enhanced depth perception is achieved through additional image processing, which then undergoes thresholding to isolate the human figure. A self-supervised neural network utilizes the processed data to generate a 3D human mesh, extracting crucial parameters such as global rotation, translation, shape, and pose. The final output is a probabilistic human model in the SMPL-X format, offering a precise and articulated 3D representation of the subject captured under challenging imaging conditions.

1.

Single-Pixel Imaging acquisition (NIR-SPI)

Process overview: Utilize NIR-SPI system to capture an image. SPI systems collect data by measuring the total amount of light reflected from a scene through a series of programmable masks without forming a traditional image (see Algorithm 1).

2.

Depth map

Deep Learning model: Implement Vision Transformer (ViT) to infer a depth map from the SPI image. The ViT would be trained on a dataset [66] and the LWIRPOSE dataset [60] comprising SPI images and their corresponding depth maps, enabling it to predict depth from unseen SPI data.
Output Specifications: The resulting depth map assigns a depth value to each pixel, encoded as color intensities, illustrating the distance of objects from the imaging setup (see Figure 5).

3.

Depth thresholding

Segmentation process: Apply a thresholding algorithm to the enhanced depth map to isolate the human figure from the background. This binary segmentation helps simplify the input for the subsequent 3D modeling steps [67].

$S (x, y) = \{\begin{matrix} 1, & if D (x, y) \leq T \\ 0, & if D (x, y) > T \end{matrix}$

(5)

where $D (x, y)$ is the depth value at pixel location $(x, y)$ , T is the threshold value, and $S (x, y)$ is the resulting binary mask.

Figure 5. Generation of depth maps using NIR-SPI Images and the LWIRPOSE dataset (a) Images from the LWIRPOSE dataset depicting various human poses, depth maps generated from the dataset images and 3D depth perception, (b) NIR-SPI images of human subjects, depth maps generated from NIR-SPI images and 3D depth perception at distance 0.5 m, (c) NIR-SPI images, depth maps and 3D depth perception at distance 1 m.

Figure 5. Generation of depth maps using NIR-SPI Images and the LWIRPOSE dataset (a) Images from the LWIRPOSE dataset depicting various human poses, depth maps generated from the dataset images and 3D depth perception, (b) NIR-SPI images of human subjects, depth maps generated from NIR-SPI images and 3D depth perception at distance 0.5 m, (c) NIR-SPI images, depth maps and 3D depth perception at distance 1 m.
Segmentation output: The output is a point cloud $P = {(x_{i}, y_{i}, z_{i})}$ where the human figure is distinctly separated. This can be derived from the segmentation mask as:

$P = {(x_{i}, y_{i}, z_{i}) ∣ S (x_{i}, y_{i}) = 1}$

(6)

where $z_{i} = D (x_{i}, y_{i})$ represents the depth at the coordinates $(x_{i}, y_{i})$ .
The output point cloud P focuses on the points corresponding to the human figure, facilitating more targeted processing in the subsequent 3D modeling steps.

4.

Self-supervised 3D human mesh generation

Model architecture: Use a self-supervised learning framework based on PointNet++ [68] capable of reconstructing 3D human meshes from processed images. This might involve a sophisticated deep learning architecture that predicts 3D geometry from 2D data. For training, the CAPE dataset [69] was employed alongside the Adam optimization algorithm, with a learning rate set to $10^{- 4}$ .
Parameter extraction: Extract key parameters such as global rotation R, translation t, body shape $β$ , and pose $θ$ from the thresholded depth image, using these to construct and refine the 3D mesh.
Loss function: The overall loss function $L = L_{r e c o n} + L_{p o s e} + L_{s h a p e} + L_{r e g}$ combines multiple components to ensure the accurate reconstruction of the 3D human mesh:
(a)
Reconstruction loss ( $L_{recon}$ ): Measures the error between the predicted mesh and the ground truth mesh. This loss encourages the network to produce an accurate 3D reconstruction of the human body.

$L_{r e c o n} = {∥M_{p r e d} (R, t, β, θ) - M_{g t}∥}^{2}$

(7)

where $M_{p r e d} (R, t, β, θ)$ is the predicted 3D mesh generated using the SMPLX parameters and $M_{g t}$ is the ground truth 3D mesh.
(b)
Pose loss ( $L_{pose}$ ): Ensures that the predicted pose $θ$ closely matches the ground truth pose. This is crucial because the pose directly influences the human body’s appearance in 3D space.

$L_{p o s e} = {∥θ_{p r e d} - θ_{g t}∥}^{2}$

(8)

where $θ_{p r e d}$ is the predicted pose parameters and $θ_{g t}$ is the ground truth pose.
(c)
Shape loss ( $L_{shape}$ ): Encourages the predicted body shape $β$ to match the actual body shape from the dataset. This loss helps refine the human mesh to match the body proportions of the subject.

$L_{s h a p e} = {∥β_{p r e d} - β_{g t}∥}^{2}$

(9)

where $β_{p r e d}$ is the predicted shape parameters and $β_{g t}$ is the ground truth shape.
(d)
Regularization loss ( $L_{reg}$ ): To avoid unrealistic poses and shapes, regularization is added to the predicted parameters. This typically includes terms that penalize large deviations from expected values.

$L_{r e g} = λ_{p o s e} \cdot {∥θ_{p r e d}∥}^{2} + λ_{s h a p e} \cdot {∥β_{p r e d}∥}^{2}$

(10)

where $λ_{p o s e}$ = 0.1 and $λ_{s h a p e}$ = 0.01 are hyperparameters that control the regularization strength.

5.

Probabilistic human model fitting

SMPL-X model: Use the SMPL-X model, which provides a parametric mesh of the human body that can articulate and deform to match the observed data [70], it can be described as a function $V = M (θ, β, ψ)$ , $θ$ pose parameters (rotation angles of joints), $β$ shape parameters (capturing the individual’s body shape), and $ψ$ expression parameters (expressions of the face and hand).
Optimization technique: The goal is to adjust the SMPL-X parameters $θ, β, ψ$ so that the model fits the observed depth data from cloud point estimation We can formulate this as a probabilistic optimization problem, minimizing an error function $E (θ, β, ψ) = E_{d a t a} (M (θ, β, ψ)) + λ_{r e g} \cdot E_{r e g} (θ, β, ψ)$ , where $E_{d a t a} = E_{d a t a} = \sum_{i = 1}^{N} {∥v_{i} - d_{i}∥}^{2}$ This term measures the error between the predicted model points $v_{i}$ and the estimated depth points observed $d_{i}$ from the NIR-SPI image. The term $E_{r e g} = {∥β∥}^{2} + {∥ψ∥}^{2} + {∥θ∥}^{2}$ is used to avoid overfitting the control of deviations from the shape and expression parameters.
Gaussian correction error pose model estimate: Pose parameters $θ$ can be adjusted iteratively by applying a Gaussian correction based on the error between the observed and predicted joint angles [71]. The updated pose parameters are refined by: $θ^{(k + 1)} = θ^{(k)} - α \cdot ▽_{θ} \cdot E (θ^{k}, β^{k}, ψ^{k})$ , where $α$ is the step size, and the error gradient E is calculated to $θ$ , allowing for a gradual correction of the model pose.
Fit refinement: The refinement process seeks to reduce the overall error by iteratively adjusting the parameters $θ$ , $β$ , and $ψ$ until convergence is achieved. Each iteration k updates the parameters $θ_{k + 1} = θ_{k} - α \cdot ▽_{θ} \cdot L$ , $β_{k + 1} = β_{k} - α \cdot ▽_{β} \cdot L$ , $ψ_{k + 1} = ψ_{k} - α \cdot ▽_{ψ} \cdot L$ .

6.

Output

Final model: The final output is a detailed new 3D model of the human body SMPL-X format.

5.2. Robustness and Sensitivity Analysis

Variations significantly influence the robustness of the 3D SMPL-X human pose model in key parameters, including pose (

θ

), body shape (

β

), translation (t), and rotation (R). A sensitivity analysis was performed by perturbing these parameters within a range of ±20% to assess their impact on the reconstruction error (see Figure 6a). The pose parameter is highly sensitive to perturbations, leading to substantial variations in the reconstruction error. A strong correlation between pose (

θ

) perturbations and reconstruction error is observed, underscoring its dominant influence. Deviations in pose angles significantly degrade model accuracy. Body shape (

β

) variations introduce moderate fluctuations in the reconstruction error, suggesting that morphological changes impact reconstruction but are less critical than pose perturbations. Similarly, rotational perturbations introduce measurable errors, particularly at extreme perturbation levels. However, they remain more stable compared to pose variations, with both parameters exhibiting weaker correlations with error, indicating relatively lower sensitivity (see Figure 6b). Among all the factors evaluated, the translation parameter t demonstrates the lowest sensitivity, suggesting that minor global shifts exert minimal influence on the precision of the reconstructed output. The weak correlation between translation perturbations (t) and the reconstruction error further reinforces the robustness of the model to slight positional variations (see Figure 6b).

6. Experimental Results

The experimental results demonstrate the process of obtaining a 3D human model from NIR-SPI imaging at distances of 0.5 m and 1 m from the NIR-SPI camera under night-time illumination conditions using the methods proposed in Section 5 (see Figure 4). This process involves several key components that ensure accurate pose estimation and 3D reconstruction (see Figure 7):

NIR-SPI Imaging System:
- The NIR-SPI camera captures depth-encoded images using near-infrared illumination, allowing for reliable imaging under low-light conditions.
- The camera is positioned at a fixed location to maintain consistent capture angles across different experiments. The subject was placed at predefined distances (0.5 m and 1.0 m) using floor markers to maintain accuracy between trials.
- The field of view ( $74^{\circ}$ × $57^{\circ}$ ) is selected to cover human movements within the experimental range.
Illumination control and environmental isolation:
- A background lighting system is integrated to simulate controlled night-time conditions, reducing the effects of ambient light interference.
- This system ensures that the near-infrared (NIR) signals remain dominant, improving the accuracy of the depth estimation.
- The experiments were conducted in a light-controlled chamber to prevent external light leakage that could affect image capture. The walls and floor materials were selected to minimize reflections and absorb stray light.
- The clothing and background materials of the human subject were standardized to reduce variations in light absorption and scattering.
Deep Learning Models and Data Processing:
- The approach leverages deep learning models proposed for the LWIRPOSE dataset [60], ensuring high accuracy in pose estimation.
- The calibration of the vision system is performed using established methods [72] the alignment between the depth maps and the NIR-SPI images.
- The experimental data are collected in a controlled laboratory setting, where the performance of the vision system is evaluated (Figure 7).
Evaluation metrics for reconstruction quality [70]:
- Mean Vertex-to-Vertex Distance (V2V): Measures the accuracy of surface reconstruction (lower values indicate better performance).
- Mean Per-Joint Position Error (MPJPE): Evaluates pose accuracy by comparing predicted joint positions with ground truth (lower values indicate better performance).
- These metrics provide a comprehensive evaluation of the effectiveness of the self-supervised learning model and the robustness of the proposed methodology (Table 3).

Discussion of the Proposed Method

For evaluation of the proposed network architecture (see Figure 4), we tested different NIR-SPI-based image reconstruction approaches using the SMPL-X model to reconstruct human positions at night-time from distances 0.5 m and 1 m while taking into consideration the limited field of view of the single-pixel camera, which is

74^{\circ}

×

57^{\circ}

. We captured NIR-SPI images of human poses. We observed some limitations concerning the reference image in the hand and body positions, particularly in the bending position, due to loss of information in the input NIR-SPI image and points cloud generated resulting from reflection effects and low resolution in the reconstructed NIR-SPI image. The 3D human reconstruction demonstrated improved accuracy regarding the positions of the vertex and joint, as detailed in Table 3. Our model registers a V2V error of 36.1 mm, ranking it second behind PTF and ahead of IPNET and VIBE. This indicates a commendable performance, although there remains a potential to enhance the accuracy of finer geometric details relative to the top-performing model. Regarding MPJPE, our model demonstrates the best performance with the lowest error at 39 mm, showcasing its superior accuracy in joint localization compared to the other models. This achievement is particularly significant, as precise joint prediction is critical for many practical applications involving human body pose estimation.

7. Conclusions

This work introduces a novel methodology that leverages NIR-SPI imaging to generate 3D human models through point cloud generation and integration of ViT classification in a self-supervised learning framework. The approach has demonstrated strong performance in creating accurate 3D reconstructions; however, accurately localizing the hands remains challenging because of the inherently low contrast of the NIR-SPI images. Despite this limitation, both qualitative and quantitative assessments (see Table 3) indicate that the detection of core people retains high accuracy in the estimation of 3D poses, highlighting the potential of the method for single-image modeling of the human body—even at low resolutions (Figure 8).

The experimental results show that our model achieves a V2V error of 36.1 mm, placing it between IPNET (28.2 mm) and VIBE (57.29 mm). Although our method does not produce the lowest V2V error, it significantly outperforms competing approaches in the mean per-joint position error (MPJPE), achieving a minimum deviation of approximately 39 mm. This suggests that, although our model may exhibit lower precision in vertex placement relative to IPNET, it demonstrates greater accuracy in estimating joint positions. PTF, which reports the second-lowest MPJPE (41.1 mm), also performs well, but our model maintains a slight advantage. These findings are highly pertinent to application domains such as motion capture and rescue operations, where precise joint localization is essential to achieve high-fidelity performance and realism. The relatively higher V2V error indicates potential areas for refinement, such as enhanced vertex placement strategies or the integration of more detailed surface geometry features. However, the method achieves a well-calibrated trade-off between vertex-level and joint-level accuracy, offering a robust and versatile solution for a wide range of human body estimation and tracking applications. Furthermore, it represents a significant advancement in the expressive reconstruction of body, hand, and facial features using NIR-SPI imaging, particularly under low-resolution conditions (see Table 4 and Table 5). Future research will aim to improve the accuracy at the vertex level while preserving or further improving the precision of the joint position.

Although laboratory results for the NIR-SPI system are promising, real-world outdoor deployment introduces additional complexities, such as variable weather conditions, background clutter, and dynamic lighting. To overcome these challenges, future work will focus on adaptive calibration strategies, real-time environmental compensation algorithms, and advanced noise filtering techniques. Further efforts will involve integrating machine learning–driven correction mechanisms and implementing hardware-level optimizations to ensure robust and reliable system performance in diverse outdoor environments.

8. Patents

Carlos Alexander Osorio Quero, Daniel Durini Romero, José de Jesus Rangel Magdaleno, José Martnez Carranza, Ruben Ramos-Garcia, “Sistema y método de creación de imágenes 3D-NIR ampliado”, File No.: MX/a/2022/016091, priority date: 14 December 2022.

Author Contributions

All Conceptualization, C.O.Q. and J.M.-C.; methodology, C.O.Q.; software, C.O.Q.; validation, C.O.Q., D.D. and J.M.-C.; formal analysis, C.O.Q.; investigation, C.O.Q.; resources, J.M.-C.; data curation, C.O.Q.; writing—original draft preparation, D.D.; writing—review and editing, J.M.-C.; visualization, D.D.; supervision, J.M.-C.; project administration, J.M.-C.; funding acquisition, D.D. All authors have read and agreed to the published version of the manuscript.

Funding

The Ph.D. studies of the first author, Carlos Alexander Osorio Quero were funded by the Mexican Government through the National Council for Science and Technology—CONACyT, agreement-Nr.: 251992 and No. CVU: 661331.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CUDA	Compute Unified Device Architecture
DMD	Digital Micromirror Device
FSRCNN	Fast Super-Resolution Convolutional Neural Network
GPU	Graphics processing unit
InGaAs	Indium Gallium Arsenide
LWIR	Long-Wave Infrared
MLP	Multi-Layer Perceptron
MPJPE	Mean Per-Joint Position
NIR	Near-Infrared
OMP	Orthogonal Matching Pursuit
RGB	Red Green Blue
SLM	Spatially Light Modulators
SMPL	Skinned MultiPerson Linear
SPI	Single-Pixel Imaging
TOF	Time-of-Flight
V2V	Vertex-to-Vertex
VIS	Visible wavelengths
Vit	Vision Transformers

References

Anvari, T.; Park, K. 3D Human Body Pose Estimation in Virtual Reality: A survey. In Proceedings of the 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 19–21 October 2022; pp. 624–628. [Google Scholar] [CrossRef]
Gu, R.; Wang, G.; Hwang, J.N. Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 8243–8250. [Google Scholar] [CrossRef]
Jansen, B.; Temmermans, F.; Deklerck, R. 3D human pose recognition for home monitoring of elderly. In Proceedings of the 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France, 22–26 August 2007; pp. 4049–4051. [Google Scholar] [CrossRef]
Sosa, J.; Hogg, D. Self-supervised 3D Human Pose Estimation from a Single Image. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 4788–4797. [Google Scholar] [CrossRef]
Sharma, S.; Varigonda, P.T.; Bindal, P.; Sharma, A.; Jain, A. Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2325–2334. [Google Scholar] [CrossRef]
Kien, H.K.; Hung, N.K.; Chau, M.T.; Duyen, N.T.; Thanh, N.X. Single view image based—3D human pose reconstruction. In Proceedings of the 2017 9th International Conference on Knowledge and Systems Engineering (KSE), Hue, Vietnam, 19–21 October 2017; pp. 118–123. [Google Scholar] [CrossRef]
Yang, H.D.; Lee, S.W. Reconstruction of 3D Human Body Pose for Gait Recognition. In Advances in Biometrics; Zhang, D., Jain, A.K., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 619–625. [Google Scholar]
Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Elgharib, M.; Fua, P.; Seidel, H.P.; Rhodin, H.; Pons-Moll, G.; Theobalt, C. XNect: Real-time multi-person 3D motion capture with a single RGB camera. ACM Trans. Graph. 2020, 39, 82. [Google Scholar] [CrossRef]
Zimmermann, C.; Welschehold, T.; Dornhege, C.; Burgard, W.; Brox, T. 3D Human Pose Estimation in RGBD Images for Robotic Task Learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 1986–1992. [Google Scholar] [CrossRef]
Gärtner, E.; Andriluka, M.; Coumans, E.; Sminchisescu, C. Differentiable Dynamics for Articulated 3D Human Motion Reconstruction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13180–13190. [Google Scholar] [CrossRef]
Zhang, H.; Meng, Y.; Zhao, Y.; Qian, X.; Qiao, Y.; Yang, X.; Zheng, Y. 3D Human Pose and Shape Reconstruction From Videos via Confidence-Aware Temporal Feature Aggregation. IEEE Trans. Multimed. 2023, 25, 3868–3880. [Google Scholar] [CrossRef]
Park, G.; Argyros, A.; Lee, J.; Woo, W. 3D Hand Tracking in the Presence of Excessive Motion Blur. IEEE Trans. Vis. Comput. Graph. 2020, 26, 1891–1901. [Google Scholar] [CrossRef] [PubMed]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 2015, 34, 248. [Google Scholar] [CrossRef]
Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body From a Single Image. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10967–10977. [Google Scholar] [CrossRef]
Xu, T.; An, D.; Wang, Z.; Jiang, S.; Meng, C.; Zhang, Y.; Wang, Q.; Pan, Z.; Yue, Y. 3D Joints Estimation of the Human Body in Single-Frame Point Cloud. IEEE Access 2020, 8, 178900–178908. [Google Scholar] [CrossRef]
G, U.K.; V, S.; Ch, N.; B, G.C.; K, Y.K. Estimating 3D Human Pose using Point Based Pose Estimation and Single Stage Method. In Proceedings of the 2022 3rd International Conference on Computing, Analytics and Networks (ICAN), Rajpura, Punjab, India, 18–19 November 2022; pp. 1–5. [Google Scholar] [CrossRef]
Yu, H.; Qian, X.; Chen, F.; Liu, H.; Li, K.; Cai, T. Methods of Separable RGB-D 3D Human Pose Estimation for Different Scenes. In Proceedings of the 2023 2nd International Conference on Big Data, Information and Computer Network (BDICN), Xishuangbanna, China, 6–8 January 2023; pp. 150–153. [Google Scholar] [CrossRef]
Erdozain, J.; Ichimaru, K.; Maeda, T.; Kawasaki, H.; Raskar, R.; Kadambi, A. 3d Imaging For Thermal Cameras Using Structured Light. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 2795–2799. [Google Scholar] [CrossRef]
Lai, K.W.C.; Xi, N.; Chen, H.; Chen, L.; Song, B. Development of 3D hyperspectral camera using compressive sensing. In Proceedings of the SENSORS, Taipei, Taiwan, 28–31 October 2012; pp. 1–4. [Google Scholar] [CrossRef]
Kim, G.W.; Lee, S.W.; Son, H.Y.; Choi, K.W. A Study on 3D Human Pose Estimation Using Through-Wall IR-UWB Radar and Transformer. IEEE Access 2023, 11, 15082–15095. [Google Scholar] [CrossRef]
Quero, C.O.; Durini, D.; de Jesús Rangel-Magdaleno, J.; Martinez-Carranza, J.; Ramos-Garcia, R. Emerging Vision Technology: SPI Camera an Overview. IEEE Instrum. Meas. Mag. 2024, 27, 38–47. [Google Scholar] [CrossRef]
Osorio Quero, C.A.; Durini, D.; Rangel-Magdaleno, J.; Martinez-Carranza, J. Single-pixel imaging: An overview of different methods to be used for 3D space reconstruction in harsh environments. Rev. Sci. Instrum. 2021, 92, 111501. [Google Scholar] [CrossRef]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 12159–12168. [Google Scholar] [CrossRef]
Zhang, T.; Hu, L.; Sun, Y.; Li, L.; Navarro-Alarcon, D. Computing Thermal Point Clouds by Fusing RGB-D and Infrared Images: From Dense Object Reconstruction to Environment Mapping. In Proceedings of the 2022 IEEE International Conference on Robotics and Biomimetics (ROBIO), Jinghong, China, 5–9 December 2022; pp. 1707–1714. [Google Scholar] [CrossRef]
Quero, C.O.; Martinez-Carranza, J. Unmanned aerial systems in search and rescue: A global perspective on current challenges and future applications. Int. J. Disaster Risk Reduct. 2025, 118, 105199. [Google Scholar] [CrossRef]
Osorio Quero, C.; Durini, D.; Rangel-Magdaleno, J.; Martinez-Carranza, J.; Ramos-Garcia, R. Single-Pixel Near-Infrared 3D Image Reconstruction in Outdoor Conditions. Micromachines 2022, 13, 795. [Google Scholar] [CrossRef]
Wang, K.; Lin, L.; Jiang, C.; Qian, C.; Wei, P. 3D Human Pose Machines with Self-Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1069–1082. [Google Scholar] [CrossRef]
D’Eusanio, A.; Simoni, A.; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R. Depth-based 3D human pose refinement: Evaluating the refinet framework. Pattern Recognit. Lett. 2023, 171, 185–191. [Google Scholar] [CrossRef]
Li, J.f.; Xu, Y.h.; Chen, Y.; Jia, Y.d. A Real-Time 3D Human Body Tracking and Modeling System. In Proceedings of the 2006 International Conference on Image Processing, Atlanta, GA, USA, 8–11 October 2006; pp. 2809–2812. [Google Scholar] [CrossRef]
Fanelli, G.; Dantone, M.; Van Gool, L. Real time 3D face alignment with Random Forests-based Active Appearance Models. In Proceedings of the 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; pp. 1–8. [Google Scholar] [CrossRef]
Jung, H.Y.; Lee, S.; Heo, Y.S.; Yun, I.D. Random tree walk toward instantaneous 3D human pose estimation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2467–2474. [Google Scholar] [CrossRef]
Chang, J.Y.; Moon, G.; Lee, K.M. V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5079–5088. [Google Scholar] [CrossRef]
Parajuli, S.; Guragai, M.K. Human Pose Estimation in 3D using heatmaps. In Proceedings of the 2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP), Vijayawada, India, 12–14 February 2022; pp. 1–4. [Google Scholar] [CrossRef]
Li, R.; Li, D.; Zhang, M. 3D Human Reconstruction from A Single Depth Image. In Proceedings of the 2023 International Conference on Intelligent Perception and Computer Vision (CIPCV), Xi’an, China, 19–21 May 2023; pp. 33–39. [Google Scholar] [CrossRef]
Bakken, R.H.; Eliassen, L.M. Real-time 3D skeletonisation in computer vision-based human pose estimation using GPGPU. In Proceedings of the 2012 3rd International Conference on Image Processing Theory, Tools and Applications (IPTA), Istanbul, Turkey, 15–18 October 2012; pp. 61–67. [Google Scholar] [CrossRef]
Kolotouros, N.; Pavlakos, G.; Black, M.; Daniilidis, K. Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2252–2261. [Google Scholar] [CrossRef]
Zhang, Y.; Shen, B.; Wang, S.; Kong, D.; Yin, B. L0-regularization-based skeleton optimization from consecutive point sets of kinetic human body. ISPRS J. Photogramm. Remote. Sens. 2018, 143, 124–133. [Google Scholar] [CrossRef]
Yamauchi, K.; Bhanu, B.; Saito, H. 3D Human Body Modeling Using Range Data. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 3476–3479. [Google Scholar] [CrossRef]
Li, C.L.; Simon, T.; Saragih, J.; Póczos, B.; Sheikh, Y. LBS Autoencoder: Self-Supervised Fitting of Articulated Meshes to Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11959–11968. [Google Scholar] [CrossRef]
Chu, W.T.; Pan, Z.W. Semi-Supervised 3D Human Pose Estimation by Jointly Considering Temporal and Multiview Information. IEEE Access 2020, 8, 226974–226981. [Google Scholar] [CrossRef]
Hua, G.; Liu, H.; Li, W.; Zhang, Q.; Ding, R.; Xu, X. Weakly-Supervised 3D Human Pose Estimation With Cross-View U-Shaped Graph Convolutional Network. IEEE Trans. Multimed. 2023, 25, 1832–1843. [Google Scholar] [CrossRef]
Li, Z.; Oskarsson, M.; Heyden, A. 3D Human Pose and Shape Estimation Through Collaborative Learning and Multi-view Model-fitting. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1887–1896. [Google Scholar] [CrossRef]
Iqbal, U.; Molchanov, P.; Breuel, T.; Gall, J.; Kautz, J. Hand Pose Estimation via Latent 2.5D Heatmap Regression. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XI. Springer: Berlin/Heidelberg, Germany, 2018; pp. 125–143. [Google Scholar] [CrossRef]
Liu, R.; Vondrick, C. Humans as Light Bulbs: 3D Human Reconstruction from Thermal Reflection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12531–12542. [Google Scholar] [CrossRef]
Liu, S.; Chen, W.; Li, T.; Li, H. Soft Rasterizer: A Differentiable Renderer for Image-Based 3D Reasoning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7707–7716. [Google Scholar] [CrossRef]
Knyaz, V.A.; Vygolov, O.; Kniaz, V.V.; Vizilter, Y.; Gorbatsevich, V.; Luhmann, T.; Conen, N. Deep Learning of Convolutional Auto-Encoder for Image Matching and 3D Object Reconstruction in the Infrared Range. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 2155–2164. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-Scale Direct Monocular SLAM. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar]
Hirschmuller, H. Accurate and efficient stereo processing by semi-global matching and mutual information. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 807–814. [Google Scholar] [CrossRef]
Tzevanidis, K.; Zabulis, X.; Sarmis, T.; Koutlemanis, P.; Kyriazis, N.; Argyros, A. From Multiple Views to Textured 3D Meshes: A GPU-Powered Approach. In Trends and Topics in Computer Vision; Kutulakos, K.N., Ed.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 384–397. [Google Scholar]
Michel, D.; Zabulis, X.; Argyros, A.A. Shape from interaction. Mach. Vis. Appl. 2014, 25, 1077–1087. [Google Scholar] [CrossRef]
Huang, H.; Huang, H.; Zheng, Z.; Gao, L. Insights into infrared crystal phase characteristics based on deep learning holography with attention residual network. J. Mater. Chem. A 2025, 13, 6009–6019. [Google Scholar] [CrossRef]
Kang, P.; Lim, S. A Taste of Scientific Computing on the GPU-Accelerated Edge Device. IEEE Access 2020, 8, 208337–208347. [Google Scholar] [CrossRef]
Sturm, B.L.; Christensen, M.G. Comparison of orthogonal matching pursuit implementations. In Proceedings of the 2012 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 220–224. [Google Scholar]
Chen, J.; Chen, Z. Cholesky Factorization on Heterogeneous CPU and GPU Systems. In Proceedings of the 2015 Ninth International Conference on Frontier of Computer Science and Technology, Dalian, China, 26–28 August 2015; pp. 19–26. [Google Scholar] [CrossRef]
Quero, C.O.; Durini, D.; Ramos-Garcia, R.; Rangel-Magdaleno, J.; Martinez-Carranza, J. Hardware parallel architecture proposed to accelerate the orthogonal matching pursuit compressive sensing reconstruction. In Computational Imaging V; Tian, L., Petruccelli, J.C., Preza, C., Eds.; International Society for Optics and Photonics, SPIE: San Diego, CA, USA, 2020; Volume 11396, pp. 56–63. [Google Scholar] [CrossRef]
Zheng, R.; Wang, W.; Jin, H.; Wu, S.; Chen, Y.; Jiang, H. GPU-based multifrontal optimizing method in sparse Cholesky factorization. In Proceedings of the 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Toronto, ON, Canada, 27–29 July 2015; pp. 90–97. [Google Scholar] [CrossRef]
Choi, H.; Lee, J. Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training. Appl. Sci. 2021, 11, 10377. [Google Scholar] [CrossRef]
Quero, C.O.; Durini, D.; Rangel-Magdaleno, J.; Martinez-Carranza, J.; Ramos-Garcia, R. Deep-learning blurring correction of images obtained from NIR single-pixel imaging. J. Opt. Soc. Am. A 2023, 40, 1491–1499. [Google Scholar] [CrossRef]
Wang, J.; Gupta, M.; Sankaranarayanan, A.C. LiSens- A Scalable Architecture for Video Compressive Sensing. In Proceedings of the 2015 IEEE International Conference on Computational Photography (ICCP), Houston, TX, USA, 24–26 April 2015; pp. 1–9. [Google Scholar] [CrossRef]
Upadhyay, A.; Dhupar, B.; Sharma, M.; Shukla, A.; Abraham, A. LWIRPOSE: A novel LWIR Thermal Image Dataset and Benchmark. arXiv arXiv:2404.10212.
Singh, A.; Agarwal, S.; Nagrath, P.; Saxena, A.; Thakur, N. Human Pose Estimation Using Convolutional Neural Networks. In Proceedings of the 2019 Amity International Conference on Artificial Intelligence (AICAI), Dubai, United Arab Emirates, 4–6 February 2019; pp. 946–952. [Google Scholar] [CrossRef]
Rahmanti, F.Z.; Pradnyana, G.A.; Priyadi, A.; Yuniarno, E.M.; Purnomo, M.H. Enhancing Voxel-Based Human Pose Classification Using CNN with Modified VGG16 Method. In Proceedings of the 2023 8th International Conference on Information Technology and Digital Applications (ICITDA), Yogyakarta, Indonesia, 17–18 November 2023; pp. 1–6. [Google Scholar] [CrossRef]
Ayre-Storie, A.; Zhang, L. Deep Learning-Based Human Posture Recognition. In Proceedings of the 2021 International Conference on Machine Learning and Cybernetics (ICMLC), Adelaide, Australia, 4–5 December 2021; pp. 1–6. [Google Scholar] [CrossRef]
Wahid, W.; AlArfaj, A.A.; Alabdulqader, E.A.; Sadiq, T.; Rahman, H.; Jalal, A. Advanced Human Pose Estimation and Event Classification Using Context-Aware Features and XGBoost Classifier. IEEE Access 2024, 12, 179839–179856. [Google Scholar] [CrossRef]
Fang, Y.; Sun, L.; Zheng, Y.; Wu, Z. Deformable Convolution-Enhanced Hierarchical Transformer With Spectral-Spatial Cluster Attention for Hyperspectral Image Classification. IEEE Trans. Image Process. 2025, 34, 701–716. [Google Scholar] [CrossRef] [PubMed]
Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic Understanding of Scenes Through the ADE20K Dataset. Int. J. Comput. Vis. 2019, 127, 302–321. [Google Scholar] [CrossRef]
Ebrahimi, A.; Czarnuch, S. Automatic Super-Surface Removal in Complex 3D Indoor Environments Using Iterative Region-Based RANSAC. Sensors 2021, 21, 3724. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5105–5114. [Google Scholar]
Ma, Q.; Yang, J.; Ranjan, A.; Pujades, S.; Pons-Moll, G.; Tang, S.; Black, M.J. Learning to Dress 3D People in Generative Clothing. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Quero, C.O.; Durini, D.; Rangel-Magdaleno, J.; Martinez-Carranza, J.; Ramos-Garcia, R. Enhancing 3D human pose estimation with NIR single-pixel imaging and time-of-flight technology: A deep learning approach. J. Opt. Soc. Am. A 2024, 41, 414–423. [Google Scholar] [CrossRef]
Zhao, X.; Ning, H.; Liu, Y.; Huang, T. Discriminative estimation of 3D human pose using Gaussian processes. In Proceedings of the 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar] [CrossRef]
Quero, C.A.O.; Durini, D.; Rangel-Magdaleno, J.d.J.; Martinez-Carranza, J.; Ramos-Garcia, R. 2D NIR-SPI spatial resolution evaluation under scattering condition. In Proceedings of the 2022 19th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 9–11 November 2022; pp. 1–6. [Google Scholar] [CrossRef]
Bhatnagar, B.L.; Sminchisescu, C.; Theobalt, C.; Pons-Moll, G. Combining Implicit Function Learning and Parametric Models for 3D Human Reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Wang, S.; Geiger, A.; Tang, S. Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7635–7644. [Google Scholar] [CrossRef]
Kocabas, M.; Athanasiou, N.; Black, M.J. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5252–5262. [Google Scholar] [CrossRef]
Kolotouros, N.; Pavlakos, G.; Daniilidis, K. Convolutional Mesh Regression for Single-Image Human Shape Reconstruction. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4496–4505. [Google Scholar] [CrossRef]
Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-End Recovery of Human Shape and Pose. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7122–7131. [Google Scholar] [CrossRef]
Lassner, C.; Romero, J.; Kiefel, M.; Bogo, F.; Black, M.J.; Gehler, P.V. Unite the People: Closing the Loop Between 3D and 2D Human Representations. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4704–4713. [Google Scholar] [CrossRef]

Figure 1. Two different approaches applied to SPI: (a) structured detection and (b) structured illumination [22].

Figure 2. The overall block diagram of the proposed vision system measures 11 cm × 11 cm × 14 cm. This system consists of multiple components, including a lens with a 20 cm focal length that projects active illumination patterns. It weighs 1.2 kg and has a power consumption of 45 W. The initial module, labeled (a), features three primary components: a photodiode, an active illumination source, and an InGaAs photodetector diode (FGA015) for the time-of-flight (TOF) system [26]. The subsequent stage (b) integrates a GPU unit and an ADC. This processing unit employs an FSRCNN network to improve the quality of low-resolution SPI images, which are then merged with the captured TOF data.

Figure 3. ViT-Based human pose classification and SMPL-X pre-pose estimation from low-resolution NIR-SPI: ViT-based framework for infrared image classification using the LWIRPOSE dataset. The process involves patch-based feature extraction, Transformer encoding, and MLP-based classification into 12 human poses.

Figure 6. Sensitivity and robustness analysis of reconstruction error under parameter perturbations: (a) illustrates the impact of single-parameter variations (±20%) on reconstruction error, highlighting the high sensitivity of pose (

θ

) perturbations compared to rotation (R), body shape (

β

), and translation (t), (b) heatmap presents the correlation matrix from a Monte Carlo analysis, showing the relationships between perturbations and reconstruction error, with pose (

θ

) exhibiting the strongest correlation.

Figure 6. Sensitivity and robustness analysis of reconstruction error under parameter perturbations: (a) illustrates the impact of single-parameter variations (±20%) on reconstruction error, highlighting the high sensitivity of pose (

θ

) perturbations compared to rotation (R), body shape (

β

), and translation (t), (b) heatmap presents the correlation matrix from a Monte Carlo analysis, showing the relationships between perturbations and reconstruction error, with pose (

θ

) exhibiting the strongest correlation.

Figure 7. The environment scenario testing framework simulates real-world conditions for image capture using a single-pixel camera (SPC). The setup includes a background lighting system to control illumination.

Figure 8. Generation of SMPL-X Human Models using NIR-SPI Images, (a) Depth perception reconstructions of various human poses from the LWIRPOSE dataset. Middle row: Corresponding 3D point clouds generated from the depth perception images. Bottom row: SMPL-X human models reconstructed from the 3D point clouds, demonstrating the capability of generating detailed human mesh models in self-supervised settings and (b) Depth perception reconstructions, 3D point clouds, and SMPL-X human models generated in a laboratory testing environment at different distances (0.5 m and 1 m) using NIR-SPI images. This setup illustrates the robustness of the approach in generating accurate human models under varying distances and conditions.

Table 1. Classification performance of various models on the LWIRPOSE dataset using infrared (IR) images. The dataset includes pose information for 12 activities: Direction (P1), Discussion (P2), Eating (P3), Greeting (P4), Phone Talk (P5), Posing (P6), Purchases (P7), Sitting (P8), Smoking (P9), Taking Photos (P10), Waiting (P11), and Walking (P12). The results highlight the performance of different models, demonstrating the superior effectiveness of ViT in accurately classifying human poses.

Model	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12
CNN	0.48	0.52	0.54	0.31	0.42	0.56	0.66	0.98	0.73	0.52	0.44	0.94
VGG10	0.69	0.69	0.79	0.70	0.59	0.64	0.69	0.97	0.77	0.55	0.71	0.99
ResNet	0.81	0.61	0.82	0.68	0.74	0.75	0.89	0.74	0.78	0.93	0.75	0.93
ViT	0.83	0.70	0.73	0.88	0.82	0.88	0.98	0.93	0.99	0.79	0.77	0.98

Table 2. Evaluation of the classification models based on Precision, Recall, and F1-Score metrics [64]. The table compares the performance of different architectures, including CNN, VGG10, ResNet, and ViT. ViT achieves the highest scores across all metrics, demonstrating its effectiveness in classifying infrared images from the LWIRPOSE dataset.

Model	Precision	Recall	F1-Score
CNN	0.43	0.40	0.40
VGG10	0.63	0.57	0.54
ResNet	0.71	0.61	0.62
ViT	0.75	0.73	0.70

Table 3. The results are Mean Vertex-to-Vertex (V2V) (in mm) and Mean Per-Joint Position Error (MPJPE) (in mm) body for the different human positions.

Human Pose	V2V Error ↓	MPJPE Error ↓
IPNET [73]	28.2	43.8
PTF [74]	23.1	41.1
VIBE [75]	57.29	53
Our method	36.1	39

Table 4. Comparison of 3D human pose estimation methods based on MPJPE (Mean Per Joint Position Error), complexity, and performance. The evaluated methods include VIBE (Video Inference for Body Pose and Shape Estimation), IPNET (Implicit Part Network), PTF (Piecewise Trans-Formation Fields), GCMR (Graph Convolutional Mesh Regression), HMR (Human Mesh Recovery), UP (Unite the People), and SMPLify, VIBE with Vit and our proposed method strategies to enhance 3D reconstructions from low-resolution images.

Method	MPJPE (mm)	Complexity
VIBE [75]	65.6	Combines deep learning and optimization techniques.
IPNET [73]	43.8	Computationally complex due to high-density reconstruction.
PTF [74]	60.2	Designed for high-fidelity 3D human pose estimation with dense geometry.
GCMR [76]	71.9	Utilizes graph convolutional networks for mesh regression; computationally intensive but effective for detailed reconstructions.
HMR [77]	87.9	Balances complexity and efficiency; suitable for real-time applications.
UP [78]	80.7	Complexity varies depending on implementation; flexible but computationally expensive.
SMPLify [14]	82.3	Optimization-based approach to fitting the SMPL body model, requiring iterative refinement.
VIBE with Vit [70]	42.0	Integrates ViT and VIBE methodologies for improved accuracy and robustness in low-resolution scenarios.
Our method	39.0	Enhancing ViT and VIBE by integrating a probabilistic human model and depth information for improved 3D pose estimation.

Table 5. Performance comparison of 3D human pose reconstruction methods, highlighting accuracy, computational demands, and suitability for dynamic and low-resolution scenarios.

Method	Performance
VIBE	Delivers improved accuracy in dynamic scenarios.
IPNET	Computational complexity and precision in shape reconstruction.
PTF	High-fidelity 3D human pose estimation, particularly excelling in scenarios with dense geometry.
GCMR	Performance can be computationally demanding.
HMR	Achieves a good balance between accuracy and speed.
UP	Ensures reliable performance.
SMPLify	Deliver reasonable accuracy in controlled environments.
VIBE with Vit	Capable of generating 3D reconstructions from low-resolution images.
Our method	Enhancing 3D human pose reconstruction through NIR-SPI in scenarios with missing information and low-resolution inputs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Osorio Quero, C.; Durini, D.; Martinez-Carranza, J. ViT-Based Classification and Self-Supervised 3D Human Mesh Generation from NIR Single-Pixel Imaging. Appl. Sci. 2025, 15, 6138. https://doi.org/10.3390/app15116138

AMA Style

Osorio Quero C, Durini D, Martinez-Carranza J. ViT-Based Classification and Self-Supervised 3D Human Mesh Generation from NIR Single-Pixel Imaging. Applied Sciences. 2025; 15(11):6138. https://doi.org/10.3390/app15116138

Chicago/Turabian Style

Osorio Quero, Carlos, Daniel Durini, and Jose Martinez-Carranza. 2025. "ViT-Based Classification and Self-Supervised 3D Human Mesh Generation from NIR Single-Pixel Imaging" Applied Sciences 15, no. 11: 6138. https://doi.org/10.3390/app15116138

APA Style

Osorio Quero, C., Durini, D., & Martinez-Carranza, J. (2025). ViT-Based Classification and Self-Supervised 3D Human Mesh Generation from NIR Single-Pixel Imaging. Applied Sciences, 15(11), 6138. https://doi.org/10.3390/app15116138

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ViT-Based Classification and Self-Supervised 3D Human Mesh Generation from NIR Single-Pixel Imaging

Abstract

1. Introduction

2. Related Work

2.1. Depth-Based 3D Pose Estimation

2.2. 3D Human Body

2.3. Semi-Supervised Approaches in 3D Human Mesh Estimation

2.4. Advanced 3D Imaging Methods in the Infrared Spectrum

3. Single-Pixel Image Reconstruction

3.1. Fusion Strategy for Enhancing TOF and Single-Pixel Imaging Resolution

3.2. SPI Camera

3.3. 2D Reconstruction Algorithm

3.4. SPI Acquisition Protocol

4. Deep Learning Classification for Human Pose Estimation

ViT Transformer-Based Classification for the Estimation of SMPL-X Pose from Low-Resolution SPI

5. Proposed Method

5.1. Selection of Key Hyperparameters

5.2. Robustness and Sensitivity Analysis

6. Experimental Results

Discussion of the Proposed Method

7. Conclusions

8. Patents

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI