Posture Estimation from Tactile Signals Using a Masked Forward Diffusion Model

Kachole, Sanket; Nayak, Bhagyashri; Brouner, James; Liu, Ying; Guo, Liucheng; Makris, Dimitrios

doi:10.3390/s25164926

Open AccessArticle

Posture Estimation from Tactile Signals Using a Masked Forward Diffusion Model

by

Sanket Kachole

^1,*

,

Bhagyashri Nayak

¹,

James Brouner

²

,

Ying Liu

³

,

Liucheng Guo

³

and

Dimitrios Makris

¹

School of Computer Science and Mathematics, Kingston University, London KT1 2EE, UK

²

Department of Applied and Human Sciences, Kingston University, London KT1 2EE, UK

³

Tangi0 Ltd. (TG0), 73-75 Upper Richmond Road, London SW15 2SR, UK

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(16), 4926; https://doi.org/10.3390/s25164926

Submission received: 18 June 2025 / Revised: 20 July 2025 / Accepted: 25 July 2025 / Published: 9 August 2025

(This article belongs to the Section Physical Sensors)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Utilizing tactile sensors embedded in intelligent mats is an attractive non-intrusive approach for human motion analysis. Interpreting tactile pressure 2D maps for accurate posture estimation poses significant challenges, such as dealing with data sparsity, noise interference, and the complexity of mapping pressure signals. Our approach introduces a novel dual-diffusion signal enhancement (DDSE) architecture that leverages tactile pressure measurements from an intelligent pressure mat for precise prediction of 3D body joint positions, using a diffusion model to enhance pressure data quality and a convolutional-transformer neural network architecture for accurate pose estimation. Additionally, we collected the pressure-to-posture inference technology (PPIT) dataset that relates pressure signals organized as a 2D array to Motion Capture data, and our proposed method has been rigorously evaluated on it, demonstrating superior accuracy in comparison to state-of-the-art methods.

Keywords:

tactile pressure maps; posture estimation; convolutional-transformer neural network; diffusion models

1. Introduction

Accurate estimation and understanding of human poses are essential in fitness applications, enabling real-time feedback, performance tracking, and personalized user experiences [1]. Traditionally, human pose estimation is solved using sensors or reflectors attached to the body or vision-based techniques. The former methods are intrusive and require time, effort, and special knowledge. The latter methods may be affected by occlusions, motion blur, etc., causing inaccuracies in the final outcome [2] and also require setting up the camera at an appropriate angle. Additionally, as the clamor for user privacy surges, the demand for non-vision-based systems intensifies, leading researchers to explore alternative modalities. Tactile-based 3D human pose estimation (HPE) aims to recover human 3D poses using tactile interactions between humans and the ground [3]. It has a wide range of potential applications such as augmented reality [4], robotics [5,6,7,8], sports analysis [9], the film industry [10], etc.

Recently, innovative pressure-tactile sensor arrays [11] have been developed to detect human movements [12] and recognize postures [13]. Although these studies have demonstrated the potential of using pressure images for pose estimation, they typically restrict their analyses to poses that involve significant body contact with the sensing surfaces [14]. In reality, high-quality tactile [15] information is often unavailable, especially since the obtained pressure maps frequently contain noise. Current methods tend to process these pressure maps manually or through hard-coded algorithms to estimate poses, limiting the scalability and versatility of pose estimation across diverse scenarios. Consequently, these approaches are confined to specific types of poses [3]. Moreover, the generated heat maps create blobs at human-ground contact points, making it challenging to distinguish between the feet of a single person or to differentiate between two individuals [16]. Achieving noise-free pressure maps is crucial for these systems to accurately predict human posture, particularly in applications like sports analytics, where minor variations in body movement can significantly affect the probability of winning [17]. Deriving 3D poses from sparse pressure imprints poses significant challenges for accurate motion capture, especially with minimal sensor contact, due to incomplete data and pressure map noise.

To date, the ability to extend tactile information from minimal contact areas to model detailed 3D human poses across a broad range of activities remains a formidable task. In addition, current methods [16] rely on synchronized tactile and visual frames to train models for human postures, facing limitations in precision due to the inherent ambiguity in interpreting visual data for complex poses, the computational and accuracy challenges of triangulating 3D keypoints from 2D detections, and the susceptibility to errors from occlusions and varying environmental conditions. These approaches are further constrained by the dependency on the initial accuracy of 2D keypoints extracted from RGB images, the computational intensity of the optimization processes, and the potential for over-smoothing in the application of 3D Gaussian filters, all of which can significantly affect the reliability and applicability of the pose estimation models.

In this paper, we introduce the improvement in tactile-based 3D human pose estimation. At the heart of our approach is a meticulously designed pressure mat embedded with a vast array of tactile sensors. These sensors capture real-time pressure signals when a person interacts with the mat, reflecting various postures and movements as shown in Figure 1. To tackle the shortcomings of visual frames in 3D keypoint ground truth generation, we propose a novel approach of using synchronized tactile signals and 3D keypoints as ground truth of critical body keypoints, offering a more accurate and direct mapping of human postures. While training requires both tactile carpet and body sensors, inference relies solely on the tactile sensor, making the system non-intrusive in practical use. In addition, this method leverages the high-resolution and unambiguous nature of motion capture data, circumventing the limitations of visual data interpretation and computational inefficiencies, thus enabling precise and scalable pose estimation across a wider range of activities and conditions. By employing a denoising diffusion model to generate noise-free pressure heatmaps, we predict the body’s 3D coordinates through a convolutional-transformer neural network, showcasing outstanding accuracy. Our approach offers a novel perspective on pose estimation, leveraging tactile information to overcome challenges posed by visual obstructions, thus presenting an unobtrusive and reliable method for interpreting human actions and interactions. The contributions of our work are as follows:

A novel dual-diffusion signal enhancement (DDSE) architecture that adopts dual-forward diffusion processes. The noisy pressure signal and its associated morphological mask are each processed through their own forward diffusion pathways. At each step, features from both diffusion channels leverage a reverse diffusion process to denoise tactile information.
A novel contour detection and alignment (CDA) layer, which integrates signals from dual-forward diffusion processes using spatial-pooling-based cross attention, significantly enhances spatial resolution by leveraging temporal information to enrich feature integration and refines contour detection from step-generated images.
A pressure-to-posture inference technology (PPIT) dataset that combines tactile pressure maps with motion-captured data. This innovative motion-captured dataset addresses the challenges associated with image-based keypoint generation, thereby providing highly accurate ground truth for 3D keypoints.

The rest of this paper is organized as follows. Section 2 reviews related works. The proposed architecture is described in detail in the methodology Section 3. Section 4 provides experimental results and an ablation study. Finally, Section 5 presents the conclusion and scope for further research.

2. Related Work

2.1. Human Pose Estimation Using Tactile Sensing

Human pose estimation has advanced rapidly, with applications in interactive technologies, physical activity monitoring, augmented reality, gaming, sports analytics, and rehabilitation [18,19,20,21,22,23,24]. Traditional methods relied on probabilistic frameworks to analyze static images and infer relationships between body joints [25], while recent advancements have introduced deep learning techniques leveraging 3D supervisory signals, adversarial training, and multi-camera systems to address occlusions and ambiguities in 2D-to-3D inference [26,27,28]. Complementing these approaches, tactile sensing has emerged as a promising alternative, utilizing pressure-sensitive elements to capture complex pressure distribution patterns [29]. Tactile systems have been integrated into wearable devices like gloves and shoes [30] and non-wearable solutions such as smart beds and floors [31], demonstrating potential in activity recognition and motion analysis. Advanced techniques such as capacitive sensing [32], resistive and optical sensing, and piezoelectric materials have enabled tasks like walking pattern analysis, dynamic motion monitoring, and human localization [33]. Additionally, deep learning models, such as LeNet, combined with large-area fabric pressure sensor arrays, have successfully classified sitting postures with high accuracy [34], while pressure-sensing mats have been employed to infer 3D human pose and shape during rest [13]. However, despite these strides, tactile sensing systems have primarily focused on activity recognition and basic pose estimation, leaving the opportunity to innovate methods that integrate tactile data with alternative modalities, aiming to move beyond recognition toward accurate estimation of 3D human skeletons. Moreover, while hardware advancements have significantly improved tactile sensing, a gap remains in developing machine learning models tailored to high-resolution tactile datasets for accurate 3D human pose estimation.

2.2. Human Pose Estimation Systems

Advancements in motion capture for pose estimation have introduced varied techniques, each with its merits and limitations. Ref. [35] presents a single-view approach that uses exponential maps for tracking, but faces challenges with occlusions and complex movements. Ref. [36] employs Kinect for 3D estimation, which is limited by depth sensor resolution. Ref. [37] integrates physics for realistic motions from monocular videos, but demands high computational resources and precise conditions. Ref. [38] enhances motion capture with a balanced feedback mechanism, showing promise in controlled settings, but is limited in complex environments. Markerless capture methods, such as those by [39], use optical flow and multi-view sequences for detailed motion without physical markers, requiring extensive setup. Ref. [40] proposes a real-time algorithm using calibrated webcams that faces difficulties with occlusion, while Ref. [41] introduces a space-time shape approach that offers novelty but lacks generalizability to unpredictable movements. The prevailing gap in the literature is the lack of methodologies that leverage motion capture as a ground truth for estimating poses from tactile signals. Current strategies focus on visual and depth data, overlooking the potential of tactile information to provide a complementary and possibly more nuanced understanding of human movement.

2.3. Diffusion Models

Diffusion models, grounded in a probabilistic framework, iteratively transform noisy data into clean signals, making them particularly effective for denoising applications, including pressure signal analysis [42]. Among the neural backbones used within diffusion pipelines, the U-Net’s symmetric encoder–decoder architecture with skip connections that merge feature maps of equal resolution effectively preserves fine spatial detail while global context [43]. Accordingly, U-Net variants have become the default denoising core in many state-of-the-art diffusion frameworks, including those applied to pressure-signal reconstruction. DeepDeblur [44] and MPRNet [45] adopt convolutional architectures with distinct focuses, as follows: DeepDeblur uses multi-scale CNNs to refine image details at various scales, while MPRNet combines parallel feature extraction with multi-stage reconstruction for handling complex motion blur. In contrast, HINET [46] and Stripformer [47] introduce novel structures aimed at balancing computational efficiency and performance. HINET leverages a half-instance normalization block to maintain speed and accuracy, and Stripformer utilizes hybrid transformers to handle dynamic scenes by capturing strip-based tokens. Diffusion-based approaches, such as HI-Diff [48] and DID [49], focus on iterative refinement using hierarchical diffusion processes or learned noise distributions, whereas SI-DDPM-FMO [50] and Swintormer [51] enhance restoration through feature map optimization or adaptive attention mechanisms integrating convolutional and transformer-based models. While these methods exhibit diverse strategies, they share a common challenge capturing unwanted noise and struggling to denoise sparse input images due to architectural constraints. Their reliance on a single forward diffusion process often leads to misalignment between synthesized distributions and target results, underscoring the potential of dual-forward diffusion processes for improved accuracy and robustness.

2.4. Datasets for Tactile-Based HPE

Several datasets have been developed for tactile-based HPE, each providing unique insights into human posture recognition. Weibing et al. [34] introduced a dataset that captures sitting postures using a pressure sensor array on chairs, focusing on various common sitting positions to enhance posture recognition accuracy. Henry et al. [13] utilized a pressure mapping system to collect data on different standing and sitting postures, aiming to improve the classification of body positions in dynamic environments. Luo et al. [16] developed a dataset that integrates tactile signals from intelligent carpets to estimate 3D human poses, capturing various activities and providing a comprehensive view of posture dynamics. Chen et al. [52] focused on creating a dataset that combines tactile data with visual information to enhance the accuracy of posture estimation in diverse scenarios.

However, the currently available datasets often rely on camera-based systems, which raise privacy concerns and may not ensure the anonymity of individuals during data collection. The requirement for tactile pressure mats, paired with corresponding motion capture (MoCap) keypoint data, is crucial for achieving high accuracy in pose estimation, as it allows the integration of detailed physical interaction information with the precise spatial positioning of body parts.

3. Methodology

3.1. Pressure to Posture Estimation

Employing a carpet embedded with tactile sensors for monitoring human activities ensures privacy, which is lacking in camera-based systems. However, this method’s lower resolution compared to visual recordings presents challenges, notably the introduction of noise in the collected pressure data. In addition to reducing the sparsity in the single pressure map, the traditional approach of combining information from consecutive frames, used in [52], fails to capture the context of activities specific to a frame. The following limitation was empirically observed: distinctive actions confined to a subset of frames can be obscured or overshadowed by information in the later frames, inadvertently introducing unwanted noise. The pose in each frame may differ significantly, and concatenating consecutive frames risks blending these unique poses into an averaged representation, losing crucial temporal details. Consequently, such an approach risks missing crucial details, further compounding the noise issue with the inclusion of numerous frames.

The architecture detailed in Figure 2 for estimating posture from pressure maps involves a comprehensive two-stage process. Initially, in the first stage, as elaborated in Section 3.3, a sparse and noisy pressure signal collected from a pressure mat undergoes a forward and reverse denoising process in a diffusion model to generate a dense, noise-free pressure map, effectively eliminating the need to manually combine multiple pressure images for denoising. To further refine the learning process, two streams of the forward-noising process are employed. The second, masked stream incorporates a refined mask that serves as an attention mechanism, ensuring the model focuses on target pressure blobs. This additional stream mitigates the model’s tendency to learn noisy signals and helps avoid misinterpreting similar-looking noise as meaningful data, thereby enhancing the accuracy and robustness of the segmentation process. Refining noisy tactile signals represents significant progress over traditional tactile-based posture estimation techniques. Subsequently, in the second stage described in Section 3.4, the denoised tactile signal is fed into a transformer-convolution-based neural network, using 3D keypoints from MoCap data as labels for accurate human pose prediction.

3.2. Problem Definition

The primary objective of this study is to estimate the 3D posture keypoints

\hat{C} (t)

, which represent the human body pose at step t, from the sparse and noisy pressure signals

P (t)

acquired from a tactile mat. Mathematically,

\hat{C} (t) = f (P (t)),

To address this, a two-stage framework is proposed:

1.: Stage 1: Denoise $P (t)$ to reconstruct $P_{true} (t)$ , thereby reducing noise and enhancing the input signal.
2.: Stage 2: Use the denoised signal $P_{true} (t)$ to estimate $\hat{C} (t)$ with high precision.

3.3. Stage 1: Dual-Diffusion Signal Enhancement (DDSE)

3.3.1. Sparse/Noisy Pressure Signal

Let the pressure signal P be acquired directly from the tactile mat. This signal is often compromised by various forms of noise N due to environmental factors, sensor inaccuracies, or other disturbances. The aim of stage 1 of the methodology depicted in the upper part of Figure 2 is to extract the actual real pressure

P_{true}

applied by a person moving or standing on the mat.

P = P_{true} + N

(1)

3.3.2. Pressure Signal Forward Diffusion (PSFD)

The pressure signal undergoes a forward diffusion process over D steps, gradually adding noise. At each step t, the signal

P_{t - 1}

from the previous step is combined with Gaussian noise

ϵ_{t}

. The factor

α_{t}

, which is a time-dependent scalar, controls the proportion of the original signal and the noise in the current signal

P_{t}

.

α_{t}

is defined as a monotonically decreasing function of t, calculated using a linear decay schedule, ensuring that the influence of the original signal decreases while the noise contribution increases as the process progresses. This schedule was selected to ensure smooth transitions across steps and to balance the gradual addition of noise throughout the forward diffusion process.

Let

P_{t}

be the pressure signal at step t,

α_{t}

be the factor controlling the noise level as previously discussed, and

ϵ_{t}

be the Gaussian noise added at each step, then mathematically, we have the following:

P_{t} = \sqrt{α_{t}} P_{t - 1} + \sqrt{1 - α_{t}} ϵ_{t}, ϵ_{t} \sim N (0, I), t = 1, 2, \dots, D

(2)

3.3.3. Refined Mask Generation

In analyzing pressure signals, a critical step is creating a refined mask that accurately differentiates the foreground (areas with actual pressure) from the background (non-pressure areas). The initial mask is directly derived from the input tactile signal. As depicted in the upper part of Figure 2–Stage 1, a binary mask is generated by thresholding the input (

t = 0

). Furthermore, it undergoes a morphological opening operation to remove noise. The opening operation, combining three successive erosions followed by three dilations with a 5 × 5 kernel, effectively eliminates small noise-related blobs within the pressure signal mask without significantly affecting the larger, significant pressure areas.

3.3.4. Refined Soft Mask Forward Diffusion (RMFD)

Parallel to the pressure signal, the refined mask undergoes a forward diffusion process. Initially, the mask M is binary, but as Gaussian noise is incrementally added over D steps, it transitions into a soft mask. This process mirrors the transformation applied to the pressure signal, ensuring synchronization between the two during the reverse denoising stage. It ensures that the features and contours extracted from the soft mask can be effectively aligned and applied to the corresponding stages of the pressure signal. If

M_{t}

is the mask at step t then mathematically, we have the following:

M_{t} = \sqrt{α_{t}} M_{t - 1} + \sqrt{1 - α_{t}} ϵ_{t}, ϵ_{t} \sim N (0, I), t = 1, 2, \dots, D

(3)

3.3.5. Reverse Denoising Process

In our novel reverse denoising process, we implement a modified U-Net [43] model, which is uniquely modified with a key innovative contour detection and alignment (CDA) [53] layer as shown in Figure 3. This layer, composed of pyramidal pooling (PP) with 2, 4, 6, 8 atrous convolutions [54] and the cross-attention mechanism (CA) [55], is positioned at the U-Net model’s entrance. Its core function is to merge contour features from the masked signal into the tactile signal, thus enhancing the initial input for the U-Net. The significance of the CDA layer is to effectively integrate contour features from the masked signal into the corresponding noisy pressure signal, enhancing the model’s denoising efficacy.

As detailed in Section 3.3.2, the pressure signal

P_{t}

and the mask

M_{t}

undergo forward diffusion, accumulating noise incrementally over D steps. The noisy images resulting from this process at each step form the input to the CDA layer. This procedural approach allows the CDA layer to precisely estimate and integrate contour information relevant to the current noise state at each step. The CDA layer’s ability to focus on high-contour areas within the masked tactile image and align these features with the noisy pressure signal aids the U-Net model in more accurately predicting and extracting noise from the pressure signal. Simultaneously, it refines the pressure signal’s features based on the contour information from the mask, thereby enhancing the reconstruction of the original pressure signal for more accuracy. The mathematical representation of this denoising process is as follows:

{\hat{P}}_{t - 1} = U_{θ} ({\hat{P}}_{t}, M_{t}, t)

(4)

where

{\hat{P}}_{t}

denotes the estimated pressure signal at step t,

U_{θ}

symbolizes the U-Net model equipped with parameters

θ

, and

M_{t}

represents the mask at step t, enhanced by the CDA layer. This iterative process iterates through the steps, progressively refining the signal quality at each stage.

The output of the reverse denoising is

P_{true}

, the denoised pressure signal. This streamlined approach, combining the functionalities of the CDA layer and the U-Net model, not only diminishes noise but also sharpens and defines tactile contours, leading to an improved reconstruction of the pressure signal with each successive iteration.

3.4. Stage 2: 3D Pose Prediction Transformer (3DPPT)

3.4.1. Transformer Encoder

The denoised tactile signal

P_{true}

is subsequently passed through a transformer encoder. The encoder consists of layers of layer normalization (LN) and self-attention (SA) mechanisms, which refine the features for precise posture estimation. The mathematical operation within the transformer encoder can be expressed as follows:

T (t) = TransformerEncoder (LN (SA (P_{true})))

(5)

where

T (t)

represents the encoded feature set prepared for the decoding stage.

3.4.2. Decoder Stage

Finally, the decoder stage comprises a series of deconvolution (Deconv), batch normalization (BN), and rectified linear unit (ReLU) layers, which upsample and normalize the features before making the final keypoint predictions. The decoder operation is mathematically described by the following:

K (t) = ReLU (BN (Deconv (T (t))))

(6)

The keypoint prediction layer maps the processed features from the decoder to 3D pose keypoints during training. This regression step transforms the feature map into specific keypoint predictions, which are represented as follows:

\hat{C} (t) = KeypointPredictor (K (t))

(7)

where

\hat{C} (t)

represents the predicted pose keypoints at time t, and

K (t)

is the feature map from the decoder. The decoder output is first flattened into a 1D vector to ensure compatibility with the fully connected layer. The fully connected layer applies a linear transformation to map the flattened tactile image features to the 3D pose keypoints:

\hat{C} (t) = W K_{flat} (t) + b

(8)

Here, W is the weight matrix that defines the mapping from the high-dimensional features to the keypoints, and b is the bias vector added to the output. The final output

\hat{C} (t)

is a 12-dimensional vector representing the predicted 3D pose keypoints.

4. Experimental Evaluation

4.1. Dataset

We introduce the novel PPIT dataset to assess the effectiveness of the proposed method. The PPIT dataset is an extensive collection of synchronized tactile signal frames and 3D pose keypoints, designed to enable human pose estimation through foot pressure measurements. Tactile signal frames (resolution 496 × 298) are captured using TG0 Advanced Pressure Mats [56], designed and manufactured by TG0. Each 60 cm × 30 cm module implements an approximately 15 × 8 grid of capacitive tactels set on a 40 mm pitch lattice. Two identical sensing layers are stacked; the upper layer records binary contact-area information, while the lower layer measures normal pressure, providing 240 raw channels (120 area + 120 pressure) per module. Tiling the six modules in a 3 × 2 arrangement yields an effective sensing area of 1.80 m × 0.60 m (45 × 16 tactels). The pressure layer resolves forces up to 15 kPa with a sensitivity of 0.1 kPa. The TG0 Advanced Pressure Mat is designed for real-time applications, streaming data at 60 Hz with onboard calibration and a low-latency API. The capacitive sensing surface demonstrates negligible residual deformation and recovers within a few milliseconds under normal body-weight loads. The mat reliably captures quasi-static and moderately dynamic movements (e.g., walking, posture transitions), while very fast ballistic motions may require higher sampling rates or complementary sensors.

The ground truth, in the form of 3D pose keypoints, is obtained through a motion capture system comprising nine high-resolution Arqus-700 cameras [57], which track three markers attached to each of 12 key body points. Each camera features a 26 MP sensor capable of capturing detailed motion at 200 Hz with an impressive 3D resolution of 0.3 mm.

The dataset consists of 12 distinct activities, each performed by a volunteer. Activities range from various postures like squatting and goddess yoga positions to dynamic movements such as forearm plank transitions and push-ups; see Table 1. Each action was performed for approximately one minute. The dataset contains 60,000 tactile signal frames. We separated two activities (seating and squatting) from the 12 to serve as the validation dataset. The model was trained on the remaining 10 poses, which were further split into 80% for training and 20% for testing. The recorded tactile frames and pose data were synchronized based on their timestamps, ensuring that the tactile frame

P (t)

corresponds to the 3D pose keypoints

K (t)

at the same moment t. Our dataset is the first to offer tactile signals paired with accurate motion capture data. A wide range of activities, the corresponding pressure signals, and the motion-captured skeleton are shown in Table 1.

4.2. Experimental Protocol

The Stage 1 dual-forward diffusion and Stage 2 3D-pose-prediction transformer models were trained using mean squared error (MSE) loss and optimized with the Adam optimizer. They were trained on a The GPU PC system features 128 GB of memory, an Intel Xeon W-2155 processor (Intel Corporation; Santa Clara, CA, USA), and NVIDIA Quadro RTX 8000 48 GB graphics cards (NVIDIA Corporation; Santa Clara, CA, USA), housed in a Lenovo ThinkStation P520 chassis (Lenovo Group Ltd.; Beijing, China). No other similar datasets with paired 3D poses are currently available, so validation was performed only on the PPIT dataset. We conducted experimental evaluations of the two stages separately. Stage 1, evaluation of pressure signal restoration (Section 4.5), assesses the denoising and signal restoration capabilities of the model. Stage 2, posture prediction evaluation (Section 4.6), compares the model’s ability to predict 3D poses.

4.3. Stage 1 Evaluation Metrics

4.3.1. Peak Signal-to-Noise Ratio (PSNR)

The peak signal-to-noise ratio (PSNR) metric is used for measuring the quality of a reconstructed or processed image compared to its original (reference) version. PSNR, measured in decibels (dB), is based on the error between corresponding pixels in the two images and is quantified using the mean squared error (MSE). The ratio essentially compares the image’s maximum possible pixel value (peak signal) to the power of its noise (distortion or error). High PSNR values indicate lower distortion and, thus, higher image quality.

PSNR = 10 \cdot {log}_{10} (\frac{{MAX}_{I}^{2}}{MSE})

(9)

where

{MAX}_{I}

is the maximum possible pixel value of the image (e.g., 255 for 8-bit images) and MSE is the mean squared error between the original and the processed image.

4.3.2. Structural Similarity (SSIM)

Structural similarity (SSIM) is a metric used for measuring the similarity between two images. Unlike PSNR, which primarily focuses on pixel-wise errors, SSIM considers changes in structural information, luminance, and contrast. The SSIM index aims to provide a more perceptually relevant measure by accounting for the fact that the human visual system is highly adapted for extracting structural information from a visual scene. A higher SSIM value (closer to 1) implies greater similarity between the compared images. SSIM is a unitless metric ranging from 0 to 1, where a higher value (closer to 1) implies greater similarity between the compared images.

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})}

(10)

where x and y are the two images being compared,

μ_{x}, μ_{y}

are their average pixel values,

σ_{x}, σ_{y}

are their variances,

σ_{x y}

is the covariance, and

c_{1}, c_{2}

are constants to stabilize the division.

4.3.3. Learned Perceptual Image Patch Similarity (LPIPS)

Learned perceptual image patch similarity (LPIPS) is a perceptual metric that evaluates image similarity by computing the distance between feature representations extracted from a pre-trained neural network, providing a measure aligned with human visual perception. LPIPS is a unitless metric ranging from 0 to 1, where lower values indicate greater perceptual similarity.

LPIPS (x, y) = \sum_{l} w_{l} {∥ ϕ_{l} (x) - ϕ_{l} (y) ∥}_{2}^{2}

(11)

where x and y are the two images being compared,

ϕ_{l} (x)

and

ϕ_{l} (y)

represent the feature maps from layer l of the network for images x and y,

w_{l}

is the weight for layer l, and

{∥ \cdot ∥}_{2}

is the Euclidean norm.

4.4. Stage 2 Evaluation Metrics

4.4.1. Mean per Joint Position Error (MPJPE)

Mean per joint position error (MPJPE) measures the average Euclidean distance, in millimeters (mm), between the predicted and ground truth positions of various joints in the human body. It is a key indicator of the accuracy of a pose estimation model, with a lower MPJPE value indicating higher accuracy.

MPJPE = \frac{1}{N} \sum_{i = 1}^{N} \sqrt{\sum_{j = 1}^{J} {∥ P_{i j} - G_{i j} ∥}^{2}}

(12)

where N is the number of samples, J is the number of joints,

P_{i j}

is the predicted position of the jth joint in the ith sample, and

G_{i j}

is the corresponding ground truth position.

4.4.2. Average Keypoint Localization Error of Whole Body (AKLEB)

The average keypoint localization error of whole body (AKLEB), measured in millimeters (mm), offers a holistic assessment of whole-body pose estimation accuracy, contrasting with MPJPE’s focus on joint localization. AKLEB evaluates the localization accuracy of all key body parts, providing a comprehensive measure of a model’s ability to capture the body’s pose nuances. A lower AKLEB indicates a more accurate pose estimation across the entire body, which is crucial for detailed and precise body movement analysis.

{AKLEB}_{d} = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{J} \sum_{j = 1}^{J} ∥ P_{i j d} - G_{i j d} ∥

(13)

where N is the number of samples, J is the total number of joints across the whole body,

P_{i j d}

is the predicted coordinate for dimension

d \in [X, Y, Z]

of the jth joint in the ith sample, and

G_{i j d}

is the corresponding ground truth.

4.5. Evaluation of Pressure Signal Restoration

The efficacy of image restoration methods is differentiated through a meticulous examination of performance metrics, including PSNR, SSIM, mean average error (MAE), and LPIPS, applied to the tactile frames of the tactile signal frames of the PPIT dataset. The proposed methodology demonstrates superior restoration capability against state-of-the-art methods as evidenced by the quantitative metrics tabulated in Table 2. Among the evaluated methods, diffusion models like SI-DDPM-FMO and Swintormer lead in performance, with our method surpassing all others in PSNR (36.24) and SSIM (0.873) while demonstrating the lowest MAE (0.045) and LPIPS (0.109). SI-DDPM-FMO struggles with motion blur overlapping background elements, while Swintormer faces challenges with varied blurring scenarios due to a small compression ratio in the latent space, and DID might have inconsistencies in exposure and white balancing. Our method overcomes these limitations by leveraging dual-forward diffusion processes for enhanced noise and blur handling, along with the CDA layer for improved contour detection and alignment, ensuring better adaptation to complex scenarios and delivering superior image restoration outcomes.

Computational Efficiency Comparison

In our image restoration technique comparison as shown in Table 3, our method not only achieves the highest PSNR of 36.24 dB but also demonstrates an optimal balance between computational efficiency and image quality restoration. The parameter count, indicating the trainable parameters within the architecture, stands at 135.41 million, while the multiply-accumulate operations (MACs) reflect our model’s complexity. This is comparable to DID and Swintormer, but they do not match our method’s PSNR, underscoring its superior balance of computational demand and restoration quality. Specifying trainable parameters alongside MACs highlights our model’s efficient design in achieving state-of-the-art image restoration.

4.6. Posture Prediction Evaluation

To conduct a fair comparison, we use the PPIT dataset to evaluate the posture prediction methods; their validation results are shown in Table 4. This approach is crucial given the inherent differences in the architecture and intended applications of these methods, as well as the variation in the datasets originally applied. In adapting methods such as [34], tactile input signals are mapped to posture classifications using fully convolutional networks (FCNs). We replaced the classification layer with a regression head. The transformer encoder in [52], which originally encoded temporal features for SMPL [59] parameter prediction, was modified to output keypoint embeddings. Additionally, the loss function, which used SMPL loss and reconstruction loss, was replaced with MSE for keypoint regression.

The results in Table 4 demonstrate the superior performance of our method compared to existing approaches. Our method achieves the lowest mean per joint position error (MPJPE) of 48.41 mm, outperforming Luo et al. [16], the next best performer, by a significant margin of 13.41 mm. Similarly, in terms of average keypoint localization error of whole body (AKLEB) coordinates, our method consistently produces the lowest errors across all dimensions (X, Y, and Z), highlighting its robustness in predicting 3D poses. Our method is the only tactile-based approach that produces comparable results to modern image-based pose estimation methods [60], despite the fundamentally different sensing modality.

The methods by [16,52] excel in 3D human pose estimation from tactile signals through adversarial learning, attention, and CNNs, navigating the challenges of noisy data. Our technique outperforms these by achieving the lowest MPJPE and AKLEB in all dimensions, showcasing superior accuracy in pose estimation. The key distinction of our method lies in its advanced handling of the intrinsic limitations posed by tactile pressure sensors, which produce noisy and sparse data. Unlike traditional approaches that rely on stitching consecutive pressure images to compile information, sacrificing temporal resolution in the process and adding unnecessary noise, our dual-diffusion and the CDA layer improve the quality of the input data. Additionally, while these methods generate ground truth through multiple processes, such as triangulation from 2D to 3D keypoints, which reduces the accuracy of the ground truth, we capture ground truth using motion capture, avoiding any processing and providing accurate 3D keypoints. Furthermore, we implement pyramidal pooling before utilizing a transformer-convolution architecture for keypoint prediction. This innovative strategy significantly boosts our method’s performance, enabling it to outshine conventional tactile-based methods in accuracy and efficiency.

4.7. Ablation Study

An ablation study was conducted to assess the impact of various novel architectural elements on mean per joint position error (MPJPE) illustrated in Table 5. The best-performing combination included PSFD, RMFD, PP, and CA elements, which resulted in the lowest MPJPE of 4.4382. Passing the features from PSFD and RMFD directly to CA improves accuracy; however, refining the features through atrous convolutions in PP yields the lowest MPJPE.

5. Conclusions

In this paper, we made significant strides in advancing tactile-based 3D human pose estimation systems that leverage tactile information through a high-density tactile carpet and corresponding motion-captured poses as ground truth. Our dual-diffusion signal enhancement (DDSE) model is utilized to restore the tactile signal in the pressure images, representing the pressure information at a higher temporal and spatial resolution. Furthermore, it uses a transformer-convolution architecture for posture prediction, outperforming state-of-the-art methods. A key contribution is our unique PPIT dataset, which combines tactile pressure maps with motion-captured data and is the first to address the challenges of image-based keypoint generation with highly accurate ground truth. Our approach has been rigorously validated on the PPIT dataset, showcasing its exceptional ability to capture the nuances of human movement with high accuracy and reliability. Our method’s validation showcases its superiority in key metrics, promising significant advances in fields like human–computer interaction and assistive technologies. In future work, we will address the issue of users with musculoskeletal asymmetries (e.g., pelvic tilt, flat feet, scoliosis, injuries) by extending our dataset with a wider range of participants, and sophisticated data augmentation methods [61,62,63]. By transforming tactile data into accurate pose predictions, our research introduces a groundbreaking approach to human pose estimation. This reliable, non-intrusive method surpasses conventional visual systems, establishing a benchmark for future tactile-based 3D human pose estimation.

Author Contributions

Formal analysis: S.K. and B.N.; funding acquisition: D.M.; investigation: S.K., B.N., Y.L., and D.M.; data collection: J.B. and Y.L.; methodology: S.K.; project administration: D.M. and L.G.; resources: D.M., J.B., and Y.L.; software: S.K. and B.N.; supervision: D.M., L.G. and Y.L.; validation: D.M.; writing—original draft: S.K. and B.N.; writing—review and editing: B.N., D.M., J.B., and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially funded by Innovate UK (Pressure-to-Posture Inference Technology, Ref: 10033778). We thank Kingston University, London, for providing resources such as the motion capture lab and supporting data collection. We also acknowledge TG0 as the industrial partner that provided their patented polymer-based sensor technology.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Kingston University Research Ethics Committee (approval number 3253, 25 May 2025).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original data presented in the study are openly available at https://github.com/tg0uk/PPIT_database, accessed on 5 November 2023.

Conflicts of Interest

Authors Ying Liu and Liucheng Guo are employed by the company Tangi0 Ltd. (TG0). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Gong, J.; Foo, L.G.; Fan, Z.; Ke, Q.; Rahmani, H.; Liu, J. DiffPose: Toward More Reliable 3D Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 13041–13051. [Google Scholar]
Liu, H.; He, J.-Y.; Cheng, Z.-Q.; Xiang, W.; Yang, Q.; Chai, W.; Wang, G.; Bao, X.; Luo, B.; Geng, Y. Posynda: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation. In Proceedings of the 31st ACM International Conference on Multimedia (ACMMM), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5542–5551. [Google Scholar]
Clever, H.M.; Kapusta, A.; Park, D.; Erickson, Z.; Chitalia, Y.; Kemp, C.C. 3D Human Pose Estimation on a Configurable Bed from a Pressure Image. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 54–61. [Google Scholar]
Marchand, E.; Uchiyama, H.; Spindler, F. Pose Estimation for Augmented Reality: A Hands-On Survey. IEEE Trans. Vis. Comput. Graph. 2016, 22, 2633–2651. [Google Scholar] [CrossRef]
Kachole, S.; Alkendi, Y.; Baghaei Naeini, F.; Makris, D.; Zweiri, Y. Asynchronous Events-Based Panoptic Segmentation Using Graph Mixer Neural Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Vancouver, BC, Canada, 18–22 June 2023; pp. 4083–4092. [Google Scholar]
Kachole, S.; Sajwani, H.; Baghaei Naeini, F.; Makris, D.; Zweiri, Y. Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 399–415. [Google Scholar]
Kachole, S.; Mahakal, M.; Bhagwatkar, A. 3 Dimensional Welding SPM/Path Tracker. Int. J. Des. Manuf. Technol. 2016, 7, 19–23. [Google Scholar]
Takalkar, M.; Kakarparthy, V.; Khan, I.R. Design & Development of TIG Welding—Special Purpose Machine. Int. J. Res. Appl. Sci. Eng. Technol. (IJRASET) 2017, 5, 1344–1351. [Google Scholar]
Sharma, P.; Shah, B.B.; Prakash, C. A Pilot Study on Human Pose Estimation for Sports Analysis. In Pattern Recognition and Data Analysis with Applications; Lecture Notes in Electrical Engineering; Springer: Singapore, 2022; Volume 888, pp. 533–544. [Google Scholar]
Seguin, G.; Alahari, K.; Sivic, J.; Laptev, I. Pose Estimation and Segmentation of Multiple People in Stereoscopic Movies. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1643–1655. [Google Scholar] [CrossRef]
Ran, X.; Wang, C.; Xiao, Y.; Gao, X.; Zhu, Z.; Chen, B. A Portable Sitting Posture Monitoring System Based on a Pressure Sensor Array and Machine Learning. Sens. Actuators A Phys. 2021, 331, 112900. [Google Scholar] [CrossRef]
Lee, S.-H.; Joo, H.-T.; Chung, I.; Park, D.; Choi, Y.; Kim, K.-J. A Novel Approach for Virtual Locomotion Gesture Classification: Self-Teaching Vision Transformer for a Carpet-Type Tactile Sensor. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Sydney, Australia, 16–20 October 2023; pp. 369–370. [Google Scholar]
Clever, H.M.; Erickson, Z.; Kapusta, A.; Turk, G.; Liu, K.; Kemp, C.C. Bodies at Rest: 3D Human Pose and Shape Estimation from a Pressure Image Using Synthetic Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6215–6224. [Google Scholar]
Clever, H.M.; Grady, P.L.; Turk, G.; Kemp, C.C. Body Pressure-Inferring Body Pose and Contact Pressure from a Depth Image. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 137–153. [Google Scholar] [CrossRef]
Baghaei Naeini, F.; Kachole, S.; Makris, D.; Zweiri, Y.H. Event Augmentation for Contact Force Measurements. IEEE Access 2022, 10, 123651–123660. [Google Scholar] [CrossRef]
Luo, Y.; Li, Y.; Foshey, M.; Shou, W.; Sharma, P.; Palacios, T.; Torralba, A.; Matusik, W. Intelligent Carpet: Inferring 3D Human Pose from Tactile Signals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11255–11265. [Google Scholar]
Badiola-Bengoa, A.; Mendez-Zorrilla, A. A Systematic Review of the Application of Camera-Based Human Pose Estimation in the Field of Sport and Physical Exercise. Sensors 2021, 21, 5996. [Google Scholar] [CrossRef]
Wu, C.-H.; Wu, T.-C.; Lin, W.-B. Exploration of Applying Pose Estimation Techniques in Table Tennis. Appl. Sci. 2023, 13, 1896. [Google Scholar] [CrossRef]
Baumgartner, T.; Klatt, S. Monocular 3D Human Pose Estimation for Sports Broadcasts Using Partial Sports Field Registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 5108–5117. [Google Scholar]
Bhat, N.N.; Sameri, J.; Struye, J.; Vega, M.T.; Berkvens, R.; Famaey, J. Multi-Modal Pose Estimation in XR Applications Leveraging Integrated Sensing and Communication. In Proceedings of the 1st ACM Workshop on Mobile Immersive Computing, Networking, and Systems, New York, NY, USA, 8 October 2023; pp. 261–267. [Google Scholar]
Ohri, A.; Agrawal, S.; Chaudhary, G.S. On-Device Realtime Pose Estimation & Correction. Int. J. Adv. Eng. Manag. (IJAEM) 2021, 3, 7. [Google Scholar]
Boda, P.; Ramadevi, Y. Predicting Pedestrian Behavior at Zebra Crossings Using Bottom-Up Pose Estimation and Deep Learning. Int. J. Intell. Syst. Appl. Eng. 2024, 12, 527–544. [Google Scholar]
Anvari, T.; Park, K.; Kim, G. Upper Body Pose Estimation Using Deep Learning for a Virtual Reality Avatar. Appl. Sci. 2023, 13, 2460. [Google Scholar] [CrossRef]
Zhao, Y.; Guo, T. Xihe: A 3D Vision-Based Lighting Estimation Framework for Mobile Augmented Reality. In Proceedings of the 19th ACM International Conference on Mobile Systems, Applications and Services (MobiSys), Virtual, 24–28 June 2021; pp. 28–40. [Google Scholar]
Dong, X.; Wang, X.; Li, B.; Wang, H.; Chen, G.; Cai, M. YH-Pose: Human Pose Estimation in Complex Coal Mine Scenarios. Eng. Appl. Artif. Intell. 2024, 127, 107338. [Google Scholar] [CrossRef]
Maskeliūnas, R.; Kulikajevas, A.; Damaševičius, R.; Griškevičius, J.; Adomavičienė, A. Biomac3D: 2D-to-3D Human Pose Analysis Model for Tele-Rehabilitation Based on Pareto Optimized Deep-Learning Architecture. Appl. Sci. 2023, 13, 1116. [Google Scholar] [CrossRef]
Mehraban, S.; Adeli, V.; Taati, B. MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; pp. 6920–6930. [Google Scholar]
Lupión, M.; Polo-Rodríguez, A.; Medina-Quero, J.; Sanjuan, J.F.; Ortigosa, P.M. 3D Human Pose Estimation from Multi-View Thermal Vision Sensors. Inf. Fusion 2024, 104, 102154. [Google Scholar] [CrossRef]
Li, W.; Sun, C.; Yuan, W.; Gu, W.; Cui, Z.; Chen, W. Smart Mat System with Pressure Sensor Array for Unobtrusive Sleep Monitoring. In Proceedings of the 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Jeju, Republic of Korea, 11–15 July 2017; pp. 177–180. [Google Scholar]
Ozioko, O.; Dahiya, R. Smart Tactile Gloves for Haptic Interaction, Communication, and Rehabilitation. Adv. Intell. Syst. 2022, 4, 2100091. [Google Scholar] [CrossRef]
Song, Y.; Guo, S.; Xiao, S.; Zhao, X. Unconstrained Identification of the Positions of Chest and Abdomen and Detection of Respiratory Motions in Sleep by Using a Bed Size Tactile Sensor Sheet. IEEE Sens. J. 2023, 23, 16276–16286. [Google Scholar] [CrossRef]
Pagoli, A.; Chapelle, F.; Corrales-Ramon, J.-A.; Mezouar, Y.; Lapusta, Y. Large-Area and Low-Cost Force/Tactile Capacitive Sensor for Soft Robotic Applications. Sensors 2022, 22, 4083. [Google Scholar] [CrossRef] [PubMed]
Moro, F.; Hardy, E.; Fain, B.; Dalgaty, T.; Clémençon, P.; De Prà, A.; Esmanhotto, E.; Castellani, N.; Blard, F.; Gardien, F. Neuromorphic Object Localization Using Resistive Memories and Ultrasonic Transducers. Nat. Commun. 2022, 13, 3506. [Google Scholar] [CrossRef]
Zhong, W.; Xu, H.; Ke, Y.; Ming, X.; Jiang, H.; Li, M.; Wang, D. Accurate and Efficient Sitting Posture Recognition and Human-Machine Interaction Device Based on Fabric Pressure Sensor Array and Neural Network. Adv. Mater. Technol. 2024, 9, 2301579. [Google Scholar] [CrossRef]
Zhang, Y.; Li, Z.; An, L.; Li, M.; Yu, T.; Liu, Y. Lightweight Multi-Person Total Motion Capture Using Sparse Multi-View Cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 5560–5569. [Google Scholar]
Marusic, A.; Nguyen, S.M.; Tapus, A. Evaluating Kinect, OpenPose, and BlazePose for Human Body Movement Analysis on a Low Back Pain Physical Rehabilitation Dataset. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), Stockholm, Sweden, 13–16 March 2023; pp. 587–591. [Google Scholar]
Yang, G.; Yang, S.; Zhang, J.Z.; Manchester, Z.; Ramanan, D. PPR: Physically Plausible Reconstruction from Monocular Videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3914–3924. [Google Scholar]
Popescu, M.; Mronga, D.; Bergonzani, I.; Kumar, S.; Kirchner, F. Experimental Investigations into Using Motion Capture State Feedback for Real-Time Control of a Humanoid Robot. Sensors 2022, 22, 9853. [Google Scholar] [CrossRef] [PubMed]
Agethen, P.; Otto, M.; Mengel, S.; Rukzio, E. Using Marker-Less Motion Capture Systems for Walk Path Analysis in Paced Assembly Flow Lines. Procedia CIRP 2016, 54, 152–157. [Google Scholar] [CrossRef]
Michoud, B.; Guillou, E.; Bouakaz, S. Real-Time and Markerless Full-Body Human Motion Capture. In Actes du Groupe de Travail Animation et Simulation (GTAS’07); Association Française d’Informatique Graphique (AFIG): Lyon, France, 2007; pp. 1–11. [Google Scholar]
Sofianos, T.; Sampieri, A.; Franco, L.; Galasso, F. Space-Time-Separable Graph Convolutional Network for Pose Forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 11209–11218. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 6840–6851. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. Lect. Notes Comput. Sci. 2015, 9351, 234–241. [Google Scholar]
Nah, S.; Hyun Kim, T.; Mu Lee, K. Deep Multi-Scale Convolutional Neural Network for Dynamic Scene Deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3883–3891. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H.; Shao, L. Multi-Stage Progressive Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 14821–14831. [Google Scholar]
Chen, L.; Lu, X.; Zhang, J.; Chu, X.; Chen, C. HiNet: Half Instance Normalization Network for Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 182–192. [Google Scholar]
Tsai, F.-J.; Peng, Y.-T.; Lin, Y.-Y.; Tsai, C.-C.; Lin, C.-W. StripFormer: Strip Transformer for Fast Image Deblurring. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 146–162. [Google Scholar]
Chen, Z.; Zhang, Y.; Liu, D.; Xia, B.; Gu, J.; Kong, L.; Yuan, X. Hierarchical Integration Diffusion Model for Realistic Image Deblurring. arXiv 2023, arXiv:2305.12966. [Google Scholar] [CrossRef]
Nguyen, C.M.; Chan, E.R.; Bergman, A.W.; Wetzstein, G. Diffusion in the Dark: A Diffusion Model for Low-Light Text Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; pp. 4146–4157. [Google Scholar]
Spetlik, R.; Rozumnyi, D.; Matas, J. Single-Image Deblurring, Trajectory, and Shape Recovery of Fast Moving Objects with Denoising Diffusion Probabilistic Models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; pp. 6857–6866. [Google Scholar]
Chen, K.; Liu, Y. Efficient Image Deblurring Networks Based on Diffusion Models. arXiv 2024, arXiv:2401.05907. [Google Scholar] [CrossRef]
Chen, W.; Hu, Y.; Song, W.; Liu, Y.; Torralba, A.; Matusik, W. CAvatar: Real-Time Human Activity Mesh Reconstruction via Tactile Carpets. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Melbourne, Australia, 5–9 October 2024; Volume 7, pp. 1–24. [Google Scholar]
Kachole, S.; Huang, X.; Baghaei Naeini, F.; Muthusamy, R.; Makris, D.; Zweiri, Y. Bimodal SegNet: Fused Instance Segmentation Using Events and RGB Frames. Pattern Recognit. 2024, 149, 110215. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
TG0. Advanced Pressure Mat Demonstrator. Available online: https://www.tg0.co.uk/demonstrators/advanced-pressure-mat (accessed on 17 July 2025).
Qualisys. Qualisys-Advanced Motion Capture Systems. Available online: https://www.qualisys.com/ (accessed on 1 February 2024).
Kupyn, O.; Budzan, V.; Mykhailych, M.; Mishkin, D.; Matas, J. DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8183–8192. [Google Scholar]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. In ACM Transactions on Graphics (TOG); ACM: New York, NY, USA, 2015; Volume 34, pp. 1–16. [Google Scholar]
Guo, Y.; Gao, T.; Dong, A.; Jiang, X.; Zhu, Z.; Wang, F. A Survey of the State of the Art in Monocular 3D Human Pose Estimation: Methods, Benchmarks, and Challenges. Sensors 2025, 25, 2409. [Google Scholar] [CrossRef]
Li, Z.; Yu, C.; Liang, C.; Shi, Y. PoseAugment: Generative Human Pose Data Augmentation with Physical Plausibility for IMU-Based Motion Capture. arXiv 2024, arXiv:2409.14101. [Google Scholar]
Ray, L.S.S.; Rey, V.F.; Zhou, B.; Suh, S.; Lukowicz, P. PressureTransferNet: Human Attribute Guided Dynamic Ground Pressure Profile Transfer Using 3D Simulated Pressure Maps. arXiv 2023, arXiv:2308.00538. [Google Scholar]
Chandrasekaran, M.; Francik, J.; Makris, D. Enhancing Gait Recognition: Data Augmentation via Physics-Based Biomechanical Simulation. In Computer Vision–ECCV 2024 Workshops; Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15635. [Google Scholar] [CrossRef]

Figure 1. The figure illustrates the ‘pressure to posture’ technology, which utilizes tactile pressure data to predict body posture. It depicts a subject adorned with motion capture markers performing various exercises on a pressure-sensitive mat, facilitating the collection of posture data. The first row captures the subject in a plank position, while the second row displays corresponding pressure distribution maps from the tactile mat. The third row shows posture predictions made by a specially designed neural network. This process is repeated for the walking posture, as shown in the last three rows.

Figure 2. Proposed posture estimation framework with two stages. Stage 1 utilizes dual-forward diffusion for the noisy pressure signal and morphological mask, integrating features at each step via the CDA layer and performing denoising through reverse diffusion. Stage 2 employs a transformer-convolution neural network for 3D keypoint estimation.

Figure 3. Contour detection and alignment (CDA) layer.

Table 1. Visualization of the PPIT dataset.

Pose Title	Pose	Pressure Map	Motion Captured Skeleton
Squat
Stay in Goddess pose
Extend legs
Stay Standing
Bend upper body
Standing wide-legged and forward fold
Plank
Right Lunge
Walking
Sit
Sit Up
Left Lunge

Table 2. Comparison of state-of-the-art image restoration methods on the PPIT dataset.

Method	PSNR (db) ↑	SSIM ↑	MAE (mm) ↓	LPIPS ↓
DeblurGAN [58]	23.95	0.614	0.057	0.315
DeepDeblur [44]	24.06	0.621	0.055	0.347
MPRNet [45]	26.48	0.758	0.054	0.348
HINET [46]	29.61	0.745	0.053	0.231
Stripformer [47]	30.34	0.734	0.052	0.214
Hi Diff [48]	34.71	0.714	0.05	0.271
DID [49]	35.5	0.842	0.051	0.201
SI-DDPM-FMO [50]	35.66	0.862	0.048	0.116
Swintormer [51]	35.68	0.821	0.049	0.013
Ours	36.24	0.873	0.045	0.109

Notes: Bold denotes the best value in each column; underline denotes the second-best; the green-shaded row highlights our method. ↑ means higher is better; ↓ means lower is better. PSNR is reported in dB; MAE in mm; SSIM and LPIPS are unitless.

Table 3. The multiply-accumulate operations are estimated when the input is

256 \times 256

. Our method outperforms existing baselines, achieving state-of-the-art quality while being computationally efficient.

Table 3. The multiply-accumulate operations are estimated when the input is

256 \times 256

. Our method outperforms existing baselines, achieving state-of-the-art quality while being computationally efficient.

Method	Param (M)	MACs (G)	PSNR (dB) ↑
Stripformer [47]	36.13	18.7	30.34
Hi Diff [48]	85.17	130.35	34.71
DID [49]	128.23	6.52	35.5
SI-DDPM-FMO [50]	131.53	15.43	35.68
Swintormer [51]	154.89	8.02	35.66
Ours	135.41	7.05	36.24

Notes: Bold denotes the best value in each column; underline denotes the second-best; the green-shaded row highlights our method.

Table 4. Pose prediction evaluation.

Method	MPJPE (mm)	AKLEB (mm)
		X	Y	Z
Weibing et al. [34]	78.25	92.71	83.52	81.23
Luo et al. [16]	61.82	81.52	68.38	61.79
Wenqiang et al. [52]	65.21	74.62	65.15	59.65
Ours (Stage 1+ 2)	48.41	73.75	63.8	56.93

Notes: Bold denotes the best value in each column; underline denotes the second-best; the green-shaded row highlights our method.

Table 5. Evaluation of novel elements in architecture.

Novel Elements of Architecture				MPJPE
PSFD				16.7904
PSFD	RMFD	CA	-	6.7812
PSFD	RMFD	CA	PP	4.4382

Notes: Bold The green-shaded row highlights best performing combination of elements the proposed architecture.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kachole, S.; Nayak, B.; Brouner, J.; Liu, Y.; Guo, L.; Makris, D. Posture Estimation from Tactile Signals Using a Masked Forward Diffusion Model. Sensors 2025, 25, 4926. https://doi.org/10.3390/s25164926

AMA Style

Kachole S, Nayak B, Brouner J, Liu Y, Guo L, Makris D. Posture Estimation from Tactile Signals Using a Masked Forward Diffusion Model. Sensors. 2025; 25(16):4926. https://doi.org/10.3390/s25164926

Chicago/Turabian Style

Kachole, Sanket, Bhagyashri Nayak, James Brouner, Ying Liu, Liucheng Guo, and Dimitrios Makris. 2025. "Posture Estimation from Tactile Signals Using a Masked Forward Diffusion Model" Sensors 25, no. 16: 4926. https://doi.org/10.3390/s25164926

APA Style

Kachole, S., Nayak, B., Brouner, J., Liu, Y., Guo, L., & Makris, D. (2025). Posture Estimation from Tactile Signals Using a Masked Forward Diffusion Model. Sensors, 25(16), 4926. https://doi.org/10.3390/s25164926

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Posture Estimation from Tactile Signals Using a Masked Forward Diffusion Model

Abstract

1. Introduction

2. Related Work

2.1. Human Pose Estimation Using Tactile Sensing

2.2. Human Pose Estimation Systems

2.3. Diffusion Models

2.4. Datasets for Tactile-Based HPE

3. Methodology

3.1. Pressure to Posture Estimation

3.2. Problem Definition

3.3. Stage 1: Dual-Diffusion Signal Enhancement (DDSE)

3.3.1. Sparse/Noisy Pressure Signal

3.3.2. Pressure Signal Forward Diffusion (PSFD)

3.3.3. Refined Mask Generation

3.3.4. Refined Soft Mask Forward Diffusion (RMFD)

3.3.5. Reverse Denoising Process

3.4. Stage 2: 3D Pose Prediction Transformer (3DPPT)

3.4.1. Transformer Encoder

3.4.2. Decoder Stage

4. Experimental Evaluation

4.1. Dataset

4.2. Experimental Protocol

4.3. Stage 1 Evaluation Metrics

4.3.1. Peak Signal-to-Noise Ratio (PSNR)

4.3.2. Structural Similarity (SSIM)

4.3.3. Learned Perceptual Image Patch Similarity (LPIPS)

4.4. Stage 2 Evaluation Metrics

4.4.1. Mean per Joint Position Error (MPJPE)

4.4.2. Average Keypoint Localization Error of Whole Body (AKLEB)

4.5. Evaluation of Pressure Signal Restoration

Computational Efficiency Comparison

4.6. Posture Prediction Evaluation

4.7. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI