Lightweight Human Pose Estimation for Intelligent Anti-Cheating in Unattended Truck Weighing Systems

Zhang, Jianbing; Huang, Wenbo; Wu, Yongji

doi:10.3390/su17135802

Open AccessArticle

Lightweight Human Pose Estimation for Intelligent Anti-Cheating in Unattended Truck Weighing Systems

by

Jianbing Zhang

^1,*,

Wenbo Huang

^2,*

and

Yongji Wu

²

¹

School of Mechanical and Electrical Engineering and Automation, Jincheng College, Nanjing University of Aeronautics and Astronautics, Nanjing 211156, China

²

School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Authors to whom correspondence should be addressed.

Sustainability 2025, 17(13), 5802; https://doi.org/10.3390/su17135802

Submission received: 30 April 2025 / Revised: 15 June 2025 / Accepted: 19 June 2025 / Published: 24 June 2025

(This article belongs to the Section Sustainable Transportation)

Download

Browse Figures

Versions Notes

Abstract

Accurate human pose estimation is essential for anti-cheating detection in unattended truck scale systems, where human intervention must be reliably identified under challenging conditions such as poor lighting and small target pixel areas. This paper proposes a human joint detection system tailored for truck scale scenarios. To enable efficient deployment, several lightweight structures are introduced, among which an innovative channel hourglass convolution module is designed. By employing a channel compression-recover strategy, the module effectively reduces computational overhead while preserving network depth, significantly outperforming traditional grouped convolution and residual compression structures. In addition, a hybrid attention mechanism based on depthwise separable convolution is constructed, integrating spatial and channel attention to guide the network in focusing on key features, thereby enhancing robustness against noise interference and complex backgrounds. Ablation studies validate the optimal insertion position of the attention mechanism. Experiments conducted on the MPII dataset show that the proposed system achieves improvements of 8.00% in percentage of correct keypoints (PCK) and 2.12% in mean absolute error (MAE), alongside a notable enhancement in inference frame rate. The proposed approach promotes computational efficiency, system autonomy, and operational sustainability, offering a viable solution for energy-efficient, intelligent transportation systems, and long-term automated supervision in logistics and freight environments.

Keywords:

human pose estimation; unattended truck scale; channel hourglass structure; anti-cheating detection

1. Introduction

With the rapid advancement of logistics automation and smart transportation, unattended truck scale systems have been widely deployed in logistics, manufacturing, and related industries, significantly enhancing weighing efficiency and intelligent management [1]. However, the lack of effective supervision mechanisms has led to frequent incidents of weighing fraud, undermining system reliability. Traditional video playback-based detection methods suffer from time lags and omission risks, limiting their effectiveness in real-time fraud prevention. Therefore, there is an urgent need for intelligent recognition technologies to enable timely and accurate anti-fraud monitoring. Human pose estimation, a key technique for identifying human intervention, offers a promising solution by leveraging advances in artificial intelligence and machine learning to enhance the intelligent supervision capabilities of unattended truck scale systems.

Conventional anti-cheating measures primarily involve additional hardware installations and enhanced manual monitoring, such as infrared detectors, sensor-encrypted communications, and RFID technologies to counteract common fraudulent behaviors [2,3,4]. With the evolution of deep learning, data-driven intelligent prevention and control systems based on image recognition have emerged as a promising research direction, substantially improving the monitoring of abnormal human activities or intrusions into truck scale areas [5,6,7]. To further enhance detection precision, advanced methods based on human contour variations, 3D skeleton modeling, biomechanical features, and graph-structured behavior analysis have been proposed [8,9,10,11,12,13]. Meanwhile, background segmentation and foreground comparison techniques, through methods like background subtraction, edge detection, and connected component analysis, have effectively improved the accuracy of non-vehicle object detection [14,15,16,17].

Human pose estimation, a pivotal research area in computer vision, has been extensively applied in human–computer interaction, sports analytics, virtual reality, and rehabilitation [18,19,20]. While traditional visual sensor-based approaches have achieved initial success, challenges such as illumination changes, occlusions, and complex motion recognition persist [21]. Advances in deep learning and sensing technologies have introduced depth image-based joint detection methods, significantly improving accuracy by integrating implicit and explicit localization strategies [22]. In dynamic environments, pose estimation frameworks that incorporate body part detection and multi-dimensional feature extraction have improved the robustness of action recognition [19,20]. Additionally, non-invasive posture detection based on fiber Bragg grating (FBG) sensors has shown promising applications in rehabilitation and virtual interaction [23], and open-source systems like OpenPose have demonstrated strong resilience under occlusions and non-frontal viewing conditions [21]. Moreover, gait recognition methods utilizing spatiotemporal feature modulation have achieved high accuracy on large-scale datasets [24] while skeleton extraction and dynamic modeling have supported specific applications such as basketball training monitoring [25]. Nevertheless, achieving high-precision and efficient flexible joint detection in complex multi-person scenes remains a significant challenge, driving research toward attention mechanisms, hard example mining, lightweight architectures, and hybrid inference strategies [26,27,28,29,30,31,32].

In summary, despite substantial progress in human pose estimation, challenges remain in achieving accurate detection under complex environments, especially when dealing with small-scale targets and the need for lightweight, real-time deployment in practical applications. These limitations hinder the development of intelligent systems that support automation and energy-efficient operation in industrial contexts. Motivated by the demand for sustainable and intelligent monitoring solutions, especially in unattended truck weighing systems, this study proposes a human joint point detection framework designed to enhance operational efficiency and reduce manual intervention. The main contributions of this work are as follows:

A channel hourglass convolution module is proposed to symmetrically compress and recover channel dimensions, improving feature extraction while reducing model complexity, striking a balance between accuracy and efficiency for edge deployment.
A hybrid attention mechanism integrating spatial and channel attention with depthwise separable convolution is designed, significantly enhancing robustness and representational capacity in complex environments.
A multi-target pose estimation system for truck scales anti-cheating scenarios is constructed, combining cascade detection and lightweight strategies to achieve real-time performance and strong generalization, demonstrating broad potential for applications in intelligent transportation and industrial supervision.

2. Background

BlazePose is a lightweight convolutional neural network architecture developed by Google, specifically designed for real-time pose estimation on mobile terminals. It primarily targets single-person pose estimation tasks and can achieve a stable inference speed exceeding 30 frames per second on mobile devices such as smartphones. The model efficiently detects 33 keypoint coordinates of the human body and demonstrates a certain degree of robustness against occlusion. In terms of performance, the BlazePose Heavy model achieves a PCK@0.2 of 84.1% on the AR dataset and 77.6% on the Yoga dataset, where PCK@0.2 refers to the Percentage of Correct Keypoints within 20% of the torso length. These results reflect its strong capability in human pose estimation under various conditions.

The BlazePose network comprises 24 convolutional modules that incorporate a combination of deep separable convolutions, standard convolutions, max pooling, and upsampling operations. Residual connections are employed to enhance feature propagation and network representational capacity. The network takes RGB images as input and produces two output branches: the heatmap branch, which generates probability heatmaps for human joints, and the coordinate regression branch, which directly regresses the precise coordinates of the joints. The overall architecture is illustrated in Figure 1a, where different convolutional modules are distinguished by different colors, and the internal structure of each module is detailed in Figure 1b.

Early approaches to human keypoint detection often relied on direct coordinate regression, which typically adds a fully connected layer before the output layer to directly predict joint coordinates in an end-to-end fashion. This method offers fast training and inference speeds; however, it tends to overfit to specific datasets and lacks strong spatial generalization capabilities. The heatmap regression approach was subsequently introduced. In this method, the location of each joint in the training image is represented by a 2D Gaussian probability distribution centered at the joint position, which can be formally expressed as follows:

Y_{x y c} = e^{(- \frac{{(x - p_{x})}^{2} + {(y - p_{y})}^{2}}{{2 σ_{p}}^{2}})}

(1)

where

(x, y)

denotes the pixel coordinates of the feature map,

(p_{x}, p_{y})

represents the pixel coordinates of the annotated joint location, and

σ_{p}

is a hyperparameter that controls the width of the Gaussian distribution and indicates the degree of decrease in the confidence of the joint point.

Compared with direct coordinate regression, heatmap regression provides more spatially guided supervision, enabling the network to converge toward target joint positions more effectively. It helps to better optimize the network weights, thereby improving the inference accuracy. Moreover, heatmap regression is compatible with fully convolutional network designs, which significantly reduces the number of training parameters and improves training efficiency.

From an architectural perspective, the BlazePose network integrates both heatmap regression and coordinate regression approaches. During training, the convolutional modules associated with the coordinate regression branch are initially frozen. The coordinates of the joint points marked from the dataset are used to generate Gaussian heatmaps which serve as the target outputs. The loss between the predicted heatmap and the real heatmap is then computed, and the network parameters are updated through backpropagation. Once the gradient descent of the heatmap branch stabilizes, that is, the loss gradient approaches zero, the training of this branch is considered complete.

Next, the network freezes the parameters of the heatmap branch, keeps the weights of the backbone network unchanged, and begins training the coordinate regression branch. As illustrated in Figure 1a, the four dotted lines indicate that gradients from the coordinate regression branch do not propagate back to the backbone network. This gradient-blocking mechanism allows each branch to focus on its respective subtask during training. Such a separation not only helps to improve the quality of heatmap predictions but also enhances the accuracy of the final coordinate regression. The final output of the coordinate regression branch is a 51 × 1 feature vector, representing the x and y pixel coordinates of 17 key points along with their corresponding confidence scores p.

During the network training process, the loss function comprises two components, each corresponding to the optimization of the heatmap branch and the coordinate regression branch, respectively. For the heatmap branch, the binary cross-entropy (BCE) loss function is employed to quantify the discrepancy between the predicted heatmap and the true heatmap, and its expression is as follows:

L_{B} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log (p_{i}) + (1 - y_{i}) \log (1 - p_{i})]

(2)

where

y_{i}

represents the true value of the Gaussian heatmap generated according to the annotation,

p_{i}

represents the predicted value during the training process, and N represents the number of feature pixels.

For the coordinate regression branch, the Huber loss function is adopted to measure the prediction error of joint coordinates, and is defined as follows:

L_{H} = \{\begin{matrix} \frac{1}{2} {(y - y_{p})}^{2} & | y - y_{p} | \leq δ \\ δ (| y - y_{p} | - \frac{1}{2} δ) & | y - y_{p} | \geq δ \end{matrix}

(3)

where y represents the true coordinate value of the annotation,

y_{p}

represents the predicted coordinate value during training, and

δ

is a hyperparameter representing the error threshold.

The variation in feature map dimensions during forward propagation in the BlazePose network is illustrated in Figure 2. The heatmap branch adopts a stacked hourglass convolutional architecture, which progressively refines low-level visual cues into high-level semantic features as the network deepens. Notably, salient joint-related features may emerge in intermediate layers rather than solely in the final output layer. Therefore, relying solely on the final layer risks omitting critical information necessary for accurate keypoint detection.

To address this issue, the stacked hourglass structure enhances the feature extraction process for human keypoint detection. Specifically, it performs downsampling via convolutional layers and upsampling through nearest-neighbor interpolation to achieve forward propagation. During upsampling, features from corresponding resolution levels in the downsampling path are fused, enabling effective multi-scale feature integration. Compared with the traditional serial network, the advantage of the hourglass structure is that it improves the recognition accuracy of a single joint point by reusing the whole-body joint information. This strategy enables the network to more comprehensively extract the features of joint points at different levels, thereby significantly improving the accuracy of detection.

From a structural perspective, the keypoint inference network adopts a lightweight design inspired by MobileNetV1, extensively employing depthwise separable convolutions to replace conventional standard convolutions. This operation decomposes standard convolution into two sequential steps: (1) depthwise convolution, which applies a spatial filter independently to each input channel, preserving the channel count; (2) pointwise convolution, which uses a 1 × 1 kernel to linearly combine features across channels for inter-channel information fusion, as illustrated in Figure 3.

This strategy markedly enhances computational efficiency while preserving model performance. For instance, in the conv2-1 module of Figure 1b, depthwise separable convolution reduces the parameter count by approximately 84.72% and the computational cost by 85.12%, compared to standard convolution under the same input–output feature dimensions.

In addition, BlazePose extensively incorporates residual structures, such as the Block4 and Block8 modules shown in Figure 1b. By introducing a residual learning mechanism, these structures effectively mitigate the degradation issues commonly encountered in deep neural networks. A typical residual block consists of multiple convolutional layers, batch normalization, and activation functions, allowing the input to bypass intermediate layers via shortcut connections. This design helps alleviate gradient vanishing or exploding, thereby enhancing training stability in deep architectures.

The traditional BlazePose algorithm is evaluated in terms of both detection speed and pose estimation accuracy. On the server platform used in this study, the average single-person video detection frame rate reaches 20.60 fps. For accuracy evaluation, common metrics include the PCK and the MAE.

The PCK indicator mainly measures the accuracy of the model in predicting joint points within different tolerance ranges. The higher the value, the more accurate the joint point positioning and the better the overall performance. Its definition is as follows:

PCK = \frac{Σ_{i = 1}^{i = N} δ (| | {P_{i}}^{'} - P_{i} | |, T_{K})}{N + ε}

(4)

where

p_{i}^{'}

represents the predicted pixel coordinates,

p_{i}

denotes the marked pixel coordinates, N is the number of recognized joint points, and

ε

is a small constant (

10^{- 5}

) to prevent division by zero.

T_{k}

is the pixel coordinate error threshold, defined as 0.1 times the distance between the left and right shoulder pixels. The function

δ

equals 1 if the left parameter is less than

T_{k}

, and 0 otherwise.

Another commonly used evaluation metric is the MAE, which quantifies the average absolute difference between the predicted and ground-truth joint coordinates. A lower MAE indicates that the predicted values are closer to the actual positions, reflecting higher localization accuracy and reduced prediction variance.

3. Method

3.1. Improvement of Top-Down Multi-Person Joint Detection Method

Human joint detection was initially developed for single-person scenarios, with models such as BlazePose specifically designed for such tasks. However, real-world applications often involve complex scenes containing multiple human subjects simultaneously. Experimental results show that when multiple individuals appear in the frame, BlazePose tends to erroneously assign keypoints across different persons, leading to a substantial increase in detection errors, as illustrated in Figure 4.

This degradation in performance is primarily attributed to the limited adaptability of traditional single-person detection algorithms in multi-person environments. Occlusions, interactions, and diverse human poses significantly increase the complexity of both joint detection and model design. As a result, the original single-person detection strategy suffers from poor generalization, often leading to false positives and missed detections. In contrast, single-person detection benefits from clearer task boundaries and a more stable target structure. When only one human subject is present, the number and types of keypoints are fixed, and their spatial relationships are predictable, enabling high-precision detection with relatively simple network architectures. In this study, the joint detection algorithm is designed to recognize 16 types of human keypoints, as shown in Figure 5.

To facilitate accurate joint detection in multi-person settings, a top-down pose estimation strategy is adopted. This approach first employs object detection algorithms to locate all human targets in an image, followed by the application of a single-person joint detection model to each cropped region individually.

This method effectively isolates and processes the pose features of each person, thereby enhancing both detection accuracy and robustness, particularly in dense or occluded scenes. Specifically, YOLOv5 serves as the underlying object detection framework, extracting bounding boxes of human regions. Each local image region is then input into the BlazePose network for individual pose estimation, ultimately achieving accurate multi-person joint detection, as depicted in Figure 6.

The deployment of the proposed multi-person joint detection algorithm on a server revealed that when five individuals were present in a scene, the system achieved an average processing speed of 16.7 fps when only object detection was performed. However, the inclusion of the joint detection module reduced the frame rate significantly to 3.5 fps. This indicates that joint detection becomes the primary performance bottleneck as the number of detected individuals increases.

To mitigate this issue and improve computational efficiency, a discriminative detection strategy was designed based on the behavioral patterns relevant to cheating behavior in truck scale monitoring. In this strategy, joint detection is selectively applied only to individuals whose bounding boxes intersect with predefined high-risk zones. This approach effectively eliminates redundant pose analysis for individuals far from the truck scale or unrelated to cheating behavior, thereby reducing computational overhead. As a result, system speed is significantly improved while maintaining real-time detection capability.

Further analysis revealed that in the five-person scenario, the computation time for joint detection was approximately 3.77 times greater than that for target detection, occupying the majority of computing resources and contributing significantly to system delays. Therefore, optimizing the joint detection process through lightweight model design becomes a critical strategy for enhancing overall system responsiveness.

3.2. Improvements in Lightweighting Methods

The objective of lightweight network design is to reduce model parameters and computational complexity while maintaining detection accuracy. Common metrics for evaluating lightweight models include floating-point operations (FLOPs) and parameter count (Params), both of which reflect the computational resources required during inference.

In the original BlazePose network, the heatmap inference component involves 885,559 parameters and 1,754,264 FLOPs. In contrast, during the pixel coordinate regression process, the model involves 3,286,992 parameters and 6,460,266 FLOPs, resulting in significantly higher computational demands. Layer-wise analysis reveals that the convolution modules conv15, conv6, and conv14a are the major contributors to this complexity, as shown in Table 1. Therefore, the lightweight redesign in this study focuses on compressing and optimizing these modules while preserving critical feature extraction capabilities.

To address the structural redundancy identified in the three convolution modules discussed above, this paper proposes three lightweight optimization strategies. These strategies aim to effectively reduce network complexity and computational overhead while preserving model accuracy.

1.: Reducing the number of residual block cascades n in the Block8 module.

The Block8 module is composed of two parts: the first compresses the spatial dimensions of the feature maps through a combination of depthwise separable convolution and max-pooling, while the second enhances feature representation by stacking multiple depthwise separable residual units. Taking the conv15 module as an example, it contains n = 8 residual units, which contribute approximately 87.5% of the module’s overall parameters and computational cost. Therefore, reducing n moderately can significantly alleviate computational burden without sacrificing performance.

2.: Improving pointwise convolutions in the Block8 module to grouped convolutions.

Although depthwise separable convolution significantly reduces computation, its 1 × 1 pointwise convolution component still introduces substantial parameter overhead, especially when the number of channels is large. To address this, two grouped convolution strategies are proposed:

Grouped Full Convolution: The input feature channels are divided into g groups, each of which is convolved independently. A subsequent channel shuffle operation restores inter-group information exchange. This structure, illustrated in Figure 7a, reduces the number of pointwise convolution parameters by approximately a factor of g.
Grouped Half Convolution: Inspired by ShuffleNet V2, this method splits the input channels into two parts. One part undergoes depthwise separable convolution, while the other bypasses computation and is concatenated with the first along the channel dimension. A final channel shuffle step improves inter-channel feature interaction. This design, shown in Figure 7b, further reduces parameter count while retaining expressive capability.

3.: Improvement of the GhostNet-Based residual connection structure in the Block8 module.

To reduce the computational overhead of pointwise convolution on small spatial but high-channel features in residual connections, GhostNet decomposes the convolution into two parts: one uses standard convolutional kernels to generate primary feature maps, while the other employs simple linear operations (e.g., depthwise convolutions) to produce additional “ghost” feature maps [33]. These two outputs are then combined to form a feature map matching the input dimensions. The structure of Ghost convolution is illustrated in Figure 8.

4.: Improvement of the channel hourglass residual connection structure in the Block8 module.

Drawing inspiration from the hourglass network architecture, this paper proposes a channel-wise hourglass-shaped convolutional structure that performs a symmetrical transformation along the channel dimension. Specifically, the number of channels in the input feature map is first gradually reduced through 1 × 1 pointwise convolutions within the depthwise separable convolution. Subsequently, a mirrored 1 × 1 convolution is employed to restore the number of channels to its original size, forming a compression-recover hourglass pattern. To enhance the network’s representational capacity and mitigate gradient vanishing during training, the residual connection is preserved throughout this process. The connection is applied symmetrically between the input and output feature maps with matching channel dimensions, ensuring effective feature fusion. The detailed structural design is illustrated in Figure 9.

3.3. Improved Recognition Accuracy in Complex Environments

Following the lightweight optimization of BlazePose, the enhanced model was deployed in truck scale monitoring scenarios for validation. Experimental testing revealed that under complex environmental conditions such as low lighting or when individuals occupy only a small pixel area, the person detection module was still able to accurately localize human targets. However, the joint point detection module frequently failed to correctly output keypoints under these circumstances, as shown in Figure 10.

To address this issue, this study integrates attention mechanisms into lightweight architecture. These mechanisms selectively focus on informative regions or channels, enhancing the model’s ability to capture key features under challenging conditions. For the task of human joint point detection, incorporating attention modules mitigates the degradation or loss of critical features, enhancing both detection accuracy and reliability in real-world applications.

In this work, both channel and spatial attention mechanisms are designed and integrated into the lightweight improved network. The channel attention mechanism strengthens the representation of important feature channels, while spatial attention enhances the extraction of discriminative local features. The combined effect of these mechanisms is expected to significantly improve joint point detection performance without incurring substantial computational overhead.

For channel attention, ECANet is employed in place of traditional SENet. While SENet uses global average pooling and fully connected layers to generate channel weights, its dimensionality reduction can lead to information loss. ECANet overcomes this by applying a dynamically sized 1D convolution to capture local cross-channel interactions, thereby avoiding the need for fully connected layers. The structure is shown in Figure 11, and the kernel size k is determined by Equation (5).

k = {|\frac{\log_{2} C}{γ} + \frac{b}{γ}|}_{o d d}

(5)

where

γ

= 2, b = 1, and C is the number of channels.

The ECANet utilizes a one-dimensional convolutional kernel that traverses the channel dimension to capture local inter-channel dependencies, with dynamic weights generated via a Sigmoid activation. Unlike the SENet, the ECANet removes fully connected layers, significantly reducing model parameters. This lightweight design enhances both computational efficiency and generalization.

Beyond channel-wise attention, spatial features are refined using a spatial attention mechanism inspired by the CBAM architecture, as illustrated in Figure 12. The module applies GAP and global max pooling (GMP) along the channel axis to extract spatial context. The fused outputs generate a position-sensitive attention map that assigns adaptive weights to spatial locations, enhancing focus on key regions while suppressing background noise. This improves feature quality and model efficiency.

To leverage the depthwise separable convolutions in the BlazePose network, a spatial attention module is introduced immediately after the depthwise convolution layer to enhance salient spatial features and suppress irrelevant information. Following the pointwise convolution, which fuses channel information, a channel attention module is applied to strengthen key channel representations. This combined strategy enables simultaneous enhancement of spatial and channel features. The architecture of this fused attention mechanism is shown in Figure 13.

Although attention mechanisms have shown strong performance in image recognition tasks, prior studies often integrate them extensively into classical architectures such as the ResNet and VGG to boost overall accuracy. However, indiscriminate insertion of attention modules throughout deep convolutional networks can substantially increase model complexity and computational overhead. It may also amplify noise in the data, adversely affecting accuracy and training stability. Furthermore, the effectiveness of attention mechanisms is highly architecture-dependent, and no standardized integration strategy currently exists.

To address these issues, this study proposes a targeted attention integration strategy tailored to the multi-branch architecture used in human joint point detection. The backbone network extracts semantic features, while the heatmap and coordinate regression branches capture and refine global and local cues through multi-level feature fusion. Each branch plays a distinct role, contributing jointly to the enhancement of detection accuracy. To determine optimal insertion points for attention modules, a series of ablation studies are conducted to systematically evaluate the effects of different integration strategies on model performance.

4. Results and Discussion

4.1. The Experimental Results of Lightweight Improvement

In this study, the MPII dataset was employed for training and validation. To enable a preliminary comparison of various network structures, approximately 1/10 of the dataset (consisting of 2521 images and their corresponding JSON annotation files) was randomly selected and subsequently divided into training and validation sets in a 4:1 ratio. During the training process, a batch size of 16 was used, and the Adam optimizer was adopted. Both the heatmap and coordinate regression branches were initialized with a learning rate of 0.001, incorporating a dynamic adjustment strategy: if no reduction in the training and validation losses was observed over five consecutive epochs, the learning rate was reduced to 10% of its previous value. An early stopping mechanism was also applied, terminating training if validation loss variation remained below 0.0001 for 30 consecutive epochs to prevent overfitting.

Upon completion of heatmap branch training, the learned weights were employed to initialize the coordinate regression branch, which was subsequently trained under identical configurations. Since the goal of this experiment is to verify the effectiveness of network structure improvements rather than to achieve state-of-the-art performance, a reduced version of the training data was used, resulting in relatively higher MAE and slightly lower PCK values. Experimental results are presented in Table 2. The first row reports the baseline performance of the original BlazePose network, followed by rows showing the results of different improvement strategies. Values in parentheses indicate the percentage change relative to the baseline. Frame rate (FPS) metrics were measured in a single-person test scenario, with bold font indicating the best performance for each metric.

To reduce model parameters and computational complexity, a straightforward method is to decrease the number of residual blocks. However, this inevitably reduces the number of convolutional layers, thereby weakening feature extraction and leading to a marked decline in accuracy metrics.

To improve efficiency without sacrificing performance, this study introduces grouped convolution to optimize the pointwise convolution module within the depthwise separable convolution framework. While conventional pointwise convolution primarily facilitates information fusion across channels, grouped convolution partitions channels into separate groups for independent processing. Although this significantly improves computational efficiency, it limits inter-channel information exchange. Among the two improvement strategies evaluated, the grouped full convolution approach enhances high-level semantic feature extraction but compromises shallow feature retention, resulting in diminished joint point regression accuracy. In contrast, the grouped half convolution strategy preserves half of the original channels during the pointwise convolution process, thereby maintaining critical shallow features such as edges and textures. This approach reduces outliers in joint regression and yields better MAE performance with higher inference speed. Comparative analysis shows that grouped half convolution offers a more favorable balance between accuracy and efficiency.

Although the GhostNet-based modification achieves notable improvements in inference speed, its performance on accuracy metrics is relatively limited. This may be attributed to the fact that GhostNet reduces the number of parameters by compressing channel dimensions through pointwise convolutions, which can, in certain cases, compromise the extraction of fine-grained features critical for keypoint detection. Small-scale, high-channel feature maps typically contain rich local information, and the redundant feature generation mechanism of GhostNet may fail to adequately preserve these essential details.

Beyond simple lightweighting strategies such as reducing residual blocks or introducing grouped convolutions, this paper proposes a novel channel hourglass convolution structure to further reduce computational cost while maintaining model expressiveness. By dynamically adjusting channel configuration in the pointwise convolution module, this structure compresses and restores feature channels without changing network depth. It significantly lowers parameter count and computation while improving both training efficiency and inference speed.

In addition, the proposed channel hourglass structure enhances multi-scale feature extraction and fusion, thereby improving the accuracy of feature representation. The integration of scalable skip connections facilitates effective transmission and fusion of cross-layer features, which helps mitigate issues such as gradient vanishing. Moreover, this design addresses the limitations of GhostNet in preserving fine-grained information, leading to improved model accuracy, training stability, and generalization performance. Experimental results demonstrate that the channel hourglass structure achieves an optimal balance between accuracy and efficiency among the evaluated lightweight strategies, making it a central innovation of this work.

In the BlazePose architecture, Block8-type modules are widely distributed across both the backbone network and the coordinate regression branch. In this work, we selectively applied lightweight modifications to the three Block8 modules with the highest parameter counts. Comparative experiments demonstrate that the proposed channel-sandglass structure leads to consistent improvements in both accuracy and computational efficiency. To further investigate where this structure yields the most benefit and whether it should be universally applied to all Block8 modules, we conducted additional ablation studies. The results, presented in Table 3, reveal that integrating the channel-sandglass structure into conv6, conv14-a, and conv15 independently improves both inference accuracy and speed, with the combined configuration outperforming any single deployment.

Notably, these three modules are located at the output stages of either the backbone or the coordinate regression branch, and their feature map dimensions are all identical: (None, 8, 8, 288). This observation suggests that the channel bottleneck design is particularly effective when operating on high-dimensional feature maps. By performing channel-wise compression and recovery, the structure significantly reduces parameter count and FLOPs while enhancing feature selection and fusion. This design facilitates the extraction of more abstract and discriminative representations, supports better generalization by mitigating overfitting risks, and improves gradient flow through a form of architectural regularization. As such, the channel bottleneck strategy serves as an effective means to enhance both the representational power and deployment efficiency of deep networks.

The improved network was trained on the complete MPII dataset, and the comparative performance results are presented in Figure 14. Experimental findings demonstrate that, relative to the original BlazePose network, the proposed channel hourglass convolution structure achieves an improvement of 1.77% in MAE and 7.38% in PCK. In actual deployment scenarios, when five individuals are present within the video frame, the enhanced model maintains a real-time detection frame rate of 4.0 fps. Furthermore, experimental observations indicate that to sustain an average system frame rate of 5 fps, the number of detected individuals within the scene should be limited to four or fewer.

Building upon these results, this study adopts the channel hourglass convolution structure as the final lightweight optimization solution. By dynamically adjusting the number of pointwise convolution kernels, this design achieves an initial reduction followed by an expansion of feature channels, thereby substantially decreasing model parameters and computational complexity without altering the depth of the convolutional layers. This compression–expansion scheme enhances hierarchical feature extraction and multi-scale fusion. When combined with symmetric skip connections, it further strengthens feature representation and improves joint detection accuracy. Owing to its architectural simplicity and flexibility, the proposed structure also exhibits strong generalizability across different network architectures, underscoring its potential as a robust and scalable solution for lightweight model design.

4.2. The Experimental Results of Improved Attention Mechanism

To validate the superiority of the proposed deep separable convolution fusion hybrid attention mechanism (integrating spatial and channel attention), a series of controlled experiments were conducted. Specifically, comparative groups were designed by reversing the fusion order of the deep separable convolution and attention modules, including configurations where deep convolution was followed by a cascaded channel attention mechanism and where pointwise convolution was followed by a cascaded spatial attention mechanism. Additionally, we conducted ablation experiments to individually evaluate the hybrid attention mechanism, the channel attention mechanism, and the spatial attention mechanism. Table 4 summarizes their performance across different feature scales, with the best results highlighted in bold.

As shown in Table 4, the fusion of deep separable convolution with the hybrid attention mechanism consistently outperforms alternative strategies across various feature scales. To further evaluate the effect of various integration strategies, ablation experiments were conducted, and the results are summarized in Table 5, where bolded values indicate performance gains attributed to embedding the hybrid attention mechanism at specific network layers.

The results in Table 5 demonstrate that the incorporation of the proposed hybrid attention mechanism introduces only a marginal and acceptable increase in model parameters and computational overhead. Meanwhile, embedding the hybrid attention mechanism into seven targeted modules of the network yields a notable improvement in overall recognition accuracy. The distribution of these seven modules within the network architecture is illustrated by the orange regions in Figure 2, primarily located at the initial stages of the channel hourglass convolution structure, within the downsampling pathways, and across corresponding bypass convolution layers. Notably, integrating the hybrid attention mechanism during the early downsampling stages of the heatmap branch (particularly at 128 × 128 and 32 × 32 feature scales) further enhances model performance. This finding suggests that early-stage attention helps the network better focus on salient features under complex backgrounds or noisy environments, thereby improving its ability to extract critical information.

However, as features propagate into deeper stages of the network, the fusion and refinement of information become increasingly sufficient. Consequently, introducing additional attention mechanisms at these stages does not yield further performance gains and may, in fact, disrupt the balanced feature representations by either attenuating or excessively amplifying specific features.

Moreover, since the training of the coordinate regression branch is based on the pretrained weights of the heatmap network, gradient backpropagation to the backbone network is disabled during its training. As a result, the backbone structure cannot be further optimized through the coordinate regression branch. Under these constraints, enhancing the feature extraction capabilities of the bypass convolution layers within the hourglass structure (specifically, conv-13b and conv-14b) in the coordinate regression branch proves effective for improving the final positioning performance of the model.

Experimental results further reveal that although integrating the hybrid attention mechanism across multiple modules generally leads to performance improvement, excessive deployment of attention mechanisms can occasionally cause performance degradation. As shown in Table 6, the addition of attention mechanisms to certain modules failed to produce performance gains and, in some cases, even resulted in slight declines. Based on these controlled experiments, the final model adopts a selective attention integration strategy, embedding the hybrid attention mechanism only into the conv8-b module. This targeted deployment achieves a more efficient and effective enhancement of recognition accuracy.

Building upon the lightweight model proposed in this study, a hybrid attention mechanism was further integrated into the conv8-b convolution module, and the model was trained on the complete MPII dataset until convergence. The comparative performance results, as shown in Figure 15, indicate that the improved model achieves notable gains over the original network, with MAE reduced by 2.12% and PCK increased by 8.00%.

Since the MPII dataset predominantly consists of uniform scenes, it does not adequately reflect the advantages of our proposed algorithm under challenging conditions such as low-light environments or small-scale targets. To address this limitation, we conducted supplementary experiments using a self-collected dataset specifically designed for real-world truck scale scenarios. The evaluation set includes a total of 510 images encompassing various complex conditions, including shadowed environments, nighttime low-light settings, multi-person scenes, and small-target instances.

The goal of the test is to detect all human keypoints present in the scenes. We compared the performance of the proposed improved algorithm with BlazePose, Lite-HRNet-18, and MobileNetV2 (all trained on the MPII dataset using their respective pretrained models). The comparison results are summarized in Table 7. Specifically, the following metrics are reported:

Miss Detection: Failure to detect a person present in the scene.
Misalignment Detection: Person is detected, but keypoints are noticeably misplaced.
Accurate Detection: All keypoints of all persons are correctly detected.
Average Precision (AP): Overall accuracy of keypoint recognition.

This real-world test demonstrates the robustness and practical applicability of our method in deployment scenarios beyond the controlled MPII dataset.

To further evaluate the practical effectiveness of the proposed improvements, both the lightweight-enhanced model and the model incorporating the hybrid attention mechanism were applied to inference scenarios, particularly under challenging conditions such as poor lighting and small target pixel areas. The comparative results are illustrated in Figure 16. Compared to the original model, the recognition performance is significantly enhanced, thereby verifying the robustness and practical applicability of the proposed approach.

4.3. The Comparison of Lightweight and Accurate Models

To better align with the experimental setups of existing methods, we relaxed the evaluation metric from the stricter PCK@0.1 to a more widely used PCK@0.5. We further conducted a per-joint accuracy analysis across seven key joint types and calculated the overall mean accuracy (mean). In addition to accuracy, we measured the real-time performance in terms of FPS using our system on a real-world single-person truck scale monitoring video.

Among the compared methods, the first six are categorized as large networks, which focus primarily on detection accuracy but involve a large number of parameters and suffer from lower inference speed in deployment scenarios. The latter four are small networks, designed with lightweight architectures to achieve higher inference speed at the cost of slightly reduced keypoint accuracy. These include recent state-of-the-art models such as MobileNetV2-based pose estimation networks and HRNet-Lite, which are widely used in mobile and embedded applications.

To comprehensively evaluate both accuracy and efficiency, we adopted a normalized composite scoring method. Specifically, the accuracy score was normalized using a 100% detection accuracy baseline, and the speed score was normalized using a baseline of 30 FPS (the frame rate of the original video). A weighting factor of α = 0.5 was used to balance the importance of accuracy and speed in the final score. The composite score for each model is listed in Table 8.

As shown in the results, our proposed method achieves a competitive balance between accuracy and efficiency, ranking among the top methods in terms of overall composite score. This highlights the practical value and deployment potential of the proposed approach in real-time intelligent truck scale supervision scenarios.

5. Conclusions

To address the intelligent anti-cheating demands of truck scale systems, this study proposes a multi-module integrated approach that achieves high robustness and real-time performance for human joint point detection in complex environments. Based on a systematic analysis of the BlazePose network structure, a cascaded target detection module was designed to enhance multi-person detection accuracy, specifically targeting the limitations in target area extraction under multi-target scenarios. To meet the efficiency requirements of embedded deployment, a channel hourglass convolution structure was further proposed. Its performance was comprehensively compared against grouped convolution and residual compression structures. Experimental results demonstrate that the proposed module significantly reduces model parameters and computational complexity while preserving model depth and accuracy, outperforming other lightweight optimization strategies.

Moreover, to overcome challenges such as poor illumination and small target pixel areas encountered under actual working conditions, a hybrid attention mechanism integrating spatial and channel dimensions was designed. Combined with depthwise separable convolution, this mechanism efficiently enhances feature extraction and target perception capabilities. Ablation studies validate the effectiveness and minimal computational overhead of the attention module under different positional embedding strategies. Finally, by integrating the channel hourglass structure with the hybrid attention mechanism, an efficient and accurate human pose estimation system was constructed, achieving an 8.00% improvement in PCK and a 2.12% reduction in MAE on the MPII dataset. The proposed method demonstrates strong generalization ability and robustness to noise, while maintaining a lightweight computational cost and high real-time performance. Overall, this research provides a practical and effective technical solution for building more energy-efficient and intelligent sustainable transportation monitoring systems, especially in unattended scenarios requiring automated and fine-grained supervision.

Author Contributions

All the authors discussed the idea, conducted theoretical research, and formulated the problem. Conceptualization, J.Z., W.H., and Y.W.; methodology, W.H. and Y.W.; software, Y.W.; writing—original draft preparation, W.H. and Y.W.; writing—review and editing, J.Z., W.H., and Y.W.; supervision, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Transformation Program of Scientific & Technological Achievements of Jiangsu Province under Grant BA2022057 and the National Key Research & Development Program of China under Grant 2018YFB1308301.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ferreira, B.; Reis, J. A systematic literature review on the application of automation in logistics. Logistics 2023, 7, 80. [Google Scholar] [CrossRef]
Yang, L.Q.; Wang, H. The analysis and design of weighbridge network monitoring system based on RFID. Adv. Mater. Res. 2012, 403, 3153–3157. [Google Scholar] [CrossRef]
Zhao, Y.; Bin, Q. Truck parking cheating detection system of the truck scale using the voltage waveform analysis. AASRI Procedia 2012, 3, 727–731. [Google Scholar] [CrossRef]
Zhao, Y.; Pan, Y. Electronic truck scale wireless remote control cheating monitoring system using the voltage signal. Adv. Electron. Commer. Web Appl. Commun. 2012, 2, 323–326. [Google Scholar]
Tian, F.; Bai, X.; Liu, F.; Jiang, w. A lightweight intrusion detection algorithm for hazardous areas in oilfields. CAAI Trans. Intell. Syst. 2022, 17, 634–642. [Google Scholar]
Liu, X.; Yang, Z.; Shi, B. Weigh-in-motion method based on modular sensor system and axle recognition with neural networks. Appl. Sci. 2025, 15, 614. [Google Scholar] [CrossRef]
Lin, H.; Xiang, H.; Wang, L.; Yang, J. Weighing method for truck scale based on neural network with weight-smoothing constraint. Measurement 2017, 106, 128–136. [Google Scholar] [CrossRef]
Xia, L.-m.; Wu, W. Graph-based method for human-object interactions detection. J. Cent. South Univ. 2021, 28, 205–218. [Google Scholar] [CrossRef]
He, L. A Study on User Behavior Analysis with Graph-Structured Representations. Doctoral Dissertation, University of Technology Sydney, Sydney, Australia, 2023. [Google Scholar]
Zhu, X.; Zhang, X.; Zhang, T.; Tang, X.; Chen, P.; Zhou, H.; Jiao, L. Semantics and contour based interactive learning network for building footprint extraction. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Teng, Q.; Li, W.; Hu, G.; Shu, Y.; Liu, Y. Innovative dual-decoupling CNN with layer-wise temporal-spatial attention for sensor-based human activity recognition. IEEE J. Biomed. Health Inform. 2024, 29, 1035–1048. [Google Scholar] [CrossRef]
Ren, B.; Liu, M.; Ding, R.; Liu, H. A survey on 3d skeleton-based action recognition using learning method. Cyborg Bionic Syst. 2024, 5, 0100. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Chen, S.; Zhang, Z.; Xie, L.; Tian, Q.; Zhang, Y. Skeleton-parted graph scattering networks for 3d human motion prediction. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 18–36. [Google Scholar]
Zhang, J.; Li, W.; Li, Z. Distinguishing foreground and background alignment for unsupervised domain adaptative semantic segmentation. Image Vis. Comput. 2022, 124, 104513. [Google Scholar] [CrossRef]
Delibaşoğlu, İ. Moving object detection method with motion regions tracking in background subtraction. Signal Image Video Process. 2023, 17, 2415–2423. [Google Scholar] [CrossRef]
Ai, Y.; Song, R.; Huang, C.; Cui, C.; Tian, B.; Chen, L. A real-time road boundary detection approach in surface mine based on meta random forest. IEEE Trans. Intell. Veh. 2023, 9, 1989–2001. [Google Scholar] [CrossRef]
Liu, J.; Liu, Y.; Gao, K.; Wang, L. Generative edge intelligence for IoT-assisted vehicle accident detection: Challenges and prospects. IEEE Internet Things Mag. 2024, 7, 50–54. [Google Scholar] [CrossRef]
Badiola-Bengoa, A.; Mendez-Zorrilla, A. A systematic review of the application of camera-based human pose estimation in the field of sport and physical exercise. Sensors 2021, 21, 5996. [Google Scholar] [CrossRef]
Host, K.; Ivašić-Kos, M. An overview of human action recognition in sports based on computer vision. Heliyon 2022, 8, e09633. [Google Scholar] [CrossRef] [PubMed]
Nadeem, A.; Jalal, A.; Kim, K. Automatic human posture estimation for sport activity recognition with robust body parts detection and entropy markov model. Multimed. Tools Appl. 2021, 80, 21465–21498. [Google Scholar] [CrossRef]
Kim, W.; Sung, J.; Saakes, D.; Huang, C.; Xiong, S. Ergonomic postural assessment using a new open-source human pose estimation technology (OpenPose). Int. J. Ind. Ergon. 2021, 84, 103164. [Google Scholar] [CrossRef]
Kong, L.; Yuan, X.; Maharjan, A.M. A hybrid framework for automatic joint detection of human poses in depth frames. Pattern Recognit. 2018, 77, 216–225. [Google Scholar] [CrossRef]
Yi, P.; Luo, B.; Wu, D.; Zou, X.; Chen, F.; Huang, S.; Xu, Y.; Huang, L.; Shi, S. Design and gesture recognition of wrist joint posture sensor based on fiber bragg gratings. IEEE Sens. J. 2024, 24, 39050–39058. [Google Scholar] [CrossRef]
Junaid, M.I.; Prakash, A.J.; Ari, S. Human gait recognition using joint spatiotemporal modulation in deep convolutional neural networks. J. Vis. Commun. Image Represent. 2024, 105, 104322. [Google Scholar] [CrossRef]
Pengyu, W.; Wanna, G. Image detection and basketball training performance simulation based on improved machine learning. J. Intell. Fuzzy Syst. 2021, 40, 2493–2504. [Google Scholar] [CrossRef]
Zhang, L.; Huang, W.; Wang, C.; Zeng, H. Improved multi-person 2D human pose estimation using attention mechanisms and hard example mining. Sustainability 2023, 15, 13363. [Google Scholar] [CrossRef]
Dong, X.; Yu, J.; Zhang, J. Joint usage of global and local attentions in hourglass network for human pose estimation. Neurocomputing 2022, 472, 95–102. [Google Scholar] [CrossRef]
Tu, Z.; Zhang, J.; Li, H.; Chen, Y.; Yuan, J. Joint-bone fusion graph convolutional network for semi-supervised skeleton action recognition. IEEE Trans. Multimed. 2022, 25, 1819–1831. [Google Scholar] [CrossRef]
Yan, G.; Yan, H.; Yao, Z.; Lin, Z.; Wang, G.; Liu, C.; Yang, X. Monocular 3D multi-person pose estimation for on-site joint flexion assessment: A case of extreme knee flexion detection. Sensors 2024, 24, 6187. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, L. SA-FPN: An effective feature pyramid network for crowded human detection. Appl. Intell. 2022, 52, 12556–12568. [Google Scholar] [CrossRef]
Lin, F.-C.; Ngo, H.-H.; Dow, C.-R.; Lam, K.-H.; Le, H.L. Student behavior recognition system for the classroom environment based on skeleton pose estimation and person detection. Sensors 2021, 21, 5314. [Google Scholar] [CrossRef]
Tu, D.; Shen, W.; Sun, W.; Min, X.; Zhai, G.; Chen, C. Un-Gaze: A unified transformer for joint gaze-location and gaze-object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 3271–3285. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Tang, Z.; Peng, X.; Geng, S.; Wu, L.; Zhang, S.; Metaxas, D. Quantized densely connected u-nets for efficient landmark localization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 339–354. [Google Scholar]
Ning, G.; Zhang, Z.; He, Z. Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimed. 2017, 20, 1246–1259. [Google Scholar] [CrossRef]
Chen, Y.; Shen, C.; Wei, X.-S.; Liu, L.; Yang, J. Adversarial posenet: A structure-aware convolutional network for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1212–1221. [Google Scholar]
Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7093–7102. [Google Scholar]
Li, W.; Wang, Z.; Yin, B.; Peng, Q.; Du, Y.; Xiao, T.; Yu, G.; Lu, H.; Wei, Y.; Sun, J. Rethinking on multi-stage networks for human pose estimation. arXiv 2019, arXiv:1901.00148. [Google Scholar]
Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10440–10450. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]

Figure 1. The network structure of BlazePose. (a) BlazePose overall network structure; (b) BlazePose convolutional module details.

Figure 2. The size variation in BlazePose network features.

Figure 3. The schematic of depth-separable convolution.

Figure 4. The detection effect of traditional BlazePose algorithm in multi-person scenes.

Figure 5. The schematic diagram of human keypoints.

Figure 6. The effectiveness of top-down multi-person joint detection.

Figure 7. The schematic diagram of group convolution. (a) Grouped full convolution; (b) Grouped semi-convolution.

Figure 8. The structure diagram of GhostNet convolution.

Figure 9. The structure diagram of channel hourglass.

Figure 10. The failure of joint point detection in poor lighting conditions.

Figure 11. The module diagram of the local cross-channel attention mechanism.

Figure 12. The modular diagram of the spatial attention mechanism.

Figure 13. The schematic diagram of the architecture of the deep separable convolutional fusion space and channel attention mechanism.

Figure 14. The comparison of the performance of the BlazePose network with fused channel hourglass structure on the MPII dataset. (a–c) Training loss, MAE, and PCK curves of the original network; (d–f) Training loss, MAE, and PCK curves after lightweight improvement.

Figure 15. Comparison of the performance of BlazePose network on the complete dataset after lightweighting and attention mechanism improvement. (a–c) Training loss, MAE, and PCK curves of the original network; (d–f) Training loss, MAE, and PCK curves after lightweight and attention mechanism improvement.

Figure 16. The results of joint point detection in complex scenes.

Table 1. The number of parameters per layer in BlazePose coordinate regression.

Layer Type	Output Shape	Number of Parameters	Ratio
conv1	(None, 128, 128, 24)	672	0.02%
conv2-1	(None, 128, 128, 24)	840	0.03%
conv2-2	(None, 128, 128, 24)	840	0.03%
conv3	(None, 64, 64, 48)	9936	0.30%
conv4	(None, 32, 32, 96)	46,272	1.41%
conv5	(None, 16, 16, 192)	214,464	6.52%
conv6	(None, 8, 8, 288)	574,176	17.47%
conv7-a	(None, 16, 16, 48)	16,752	0.51%
conv7-b	(None, 16, 16, 48)	11,184	0.34%
conv8-b	(None, 32, 32, 48)	5616	0.17%
conv9-b	(None, 64, 64, 48)	2832	0.09%
conv12-a	(None, 32, 32, 96)	46,272	1.41%
conv12-b	(None, 32, 32, 96)	10,272	0.31%
conv13-a	(None, 16, 16, 192)	214,464	6.52%
conv13-b	(None, 16, 16, 192)	38,976	1.19%
conv14-a	(None, 8, 8, 288)	574,176	17.47%
conv14-b	(None, 8, 8, 288)	86,112	2.62%
conv-15	(None, 2, 2, 288)	1,377,792	41.92%
conv16	(None, 48, 1)	55,344	1.68%

Table 2. The comparison of lightweight improvement performance.

Model	FLOPs	Params	Loss	MAE	PCK	FPS
None	6,527,507	3,289,624	0.01873	0.13904	0.14231	20.60
Reduce the number of residual structures	4,132,471 (−36.69%)	2,084,056 (−36.64%)	0.02088 (+11.48%)	0.14937 (+7.43%)	0.10431 (−26.70%)	24.28 (+3.68)
Grouped full convolution	3,652,167 (−44.05%)	1,851,928 (−43.70%)	0.01881 (+0.45%)	0.13958 (+0.39%)	0.13788 (−3.11%)	23.38 (+2.78)
Grouped half convolution	3,313,431 (−49.24%)	1,675,672 (−49.06%)	0.01780 (−4.96%)	0.13805 (−0.71%)	0.13197 (−7.27%)	23.46 (+2.86)
GhostNet	4,438,381 (−32.00%)	2,245,048 (−31.75%)	0.02225 (+18.79%)	0.14159 (+1.83%)	0.11532 (−18.97%)	24.13 (+4.07)
Channel hourglass structure	3,694,455 (−43.40%)	1,866,472 (−43.26%)	0.01820 (−2.83%)	0.13736 (−1.20%)	0.14548 (+2.23%)	24.78 (+4.18)

Table 3. Ablation experiment results of the embedded position of the channel hourglass structure.

Layer Type	FLOPs	Params	Loss	MAE	PCK	FPS
conv3	−0.17%	−0.17%	+1.68%	+0.94%	−4.21%	−0.12
conv4	−0.80%	−0.79%	+0.59%	+0.45%	−3.74%	+0.46
conv5	−3.68%	−3.67%	+0.13%	−0.26%	−0.44%	−1.24
conv6	−9.86%	−9.83%	−0.24%	−0.13%	+1.02%	+2.04
conv12-a	−0.80%	−0.79%	+0.06%	+0.59%	+0.85%	+0.55
conv13-a	−3.68%	−3.67%	−0.31%	−0.15%	+0.51%	+1.34
conv14-a	−9.86%	−9.83%	−0.94%	−0.41%	+1.21%	+2.47
conv15	−23.67%	−23.59%	−1.96%	−0.86%	+1.86%	+3.79
conv6,14-a,15	−43.40%	−43.26%	−2.83%	−1.20%	+2.23%	+4.18

Table 4. The performance comparison of the different attention mechanisms introduced into convolutional layers.

Layer Type	Attention Mechanism Type	Loss	MAE	PCK
conv2-2	Fusion mixed attention mechanism	−5.77%	−2.59%	+11.36%
conv2-2	Fusion mixed attention mechanism with reverse order	+0.11%	+0.12%	−8.02%
conv2-2	Separate cascade mixed attention mechanism	+5.61%	+5.72%	−19.75%
conv2-2	Channel attention mechanism	−1.44%	−1.18%	+6.81%
conv2-2	Spatial attention mechanism	−0.43%	−1.25%	+5.19%
conv4	Fusion mixed attention mechanism	−0.55%	−0.04%	+7.99%
conv4	Fusion mixed attention mechanism with reverse order	+0.43%	+0.76%	−8.16%
conv4	Separate cascade mixed attention mechanism	−1.98%	+0.12%	−15.40%
conv4	Channel attention mechanism	+1.71%	+2.27%	−15.74%
conv4	Spatial attention mechanism	−0.37%	+1.12%	−11.11%
conv6	Fusion mixed attention mechanism	+0.05%	−0.92%	+10.87%
conv6	Fusion mixed attention mechanism with reverse order	−1.23%	+0.04%	−3.87%
conv6	Separate cascade mixed attention mechanism	−0.85%	−0.03%	+0.63%
conv6	Channel attention mechanism	−2.24%	−0.17%	+0.41%
conv6	Spatial attention mechanism	−3.90%	−1.40%	−5.77%

Table 5. Performance comparison of convolutional layers introducing attention mechanisms.

Layer Type	FLOPs	Params	Output Shape	MAE	PCK
conv2-1	+47	+24	−2.14%	+1.21%	−7.55%
conv2-2	+47	+24	−5.77%	−2.59%	+11.36%
conv3	+129	+66	−2.80%	−0.12%	−5.35%
conv4	+172	+88	−0.55%	−0.04%	+7.99%
conv5	+235	+120	+0.27%	+0.90%	−3.77%
conv6	+282	+144	+0.05%	−0.92%	+10.87%
conv7-a	+43	+22	−0.60%	−1.21%	+13.21%
conv7-b	+43	+22	−3.52%	−0.77%	−0.12%
conv8-b	+43	+22	−4.07%	−2.81%	+13.62%
conv9-b	+43	+22	−2.20%	+0.32%	−0.74%
conv10-b	+43	+22	−6.10%	−3.17%	+4.07%
conv11	+43	+22	+3.19%	+6.36%	−22.46%
conv12-a	+172	+88	−3.02%	−0.48%	−0.47%
conv12-b	+43	+22	+2.09%	+4.32%	−15.25%
conv13-a	+235	+120	+1.21%	+3.81%	−19.58%
conv14-a	+282	+144	+4.12%	+6.65%	−26.73%
conv13-b	+47	+24	−3.30%	−2.08%	+9.57%
conv14-b	+47	+24	−3.19%	−0.84%	+1.59%
conv-15	+658	+336	−3.96%	−0.33%	−8.10%

Table 6. Performance comparison of hybrid attention mechanisms for multi-module integration.

Adding Attention Mechanism Module	Loss	MAE	PCK
conv8-b	−4.07%	−2.81%	+13.62%
conv2-2, conv8-b	−1.59%	−1.43%	+7.99%
conv2-2, conv13-b	−4.18%	−1.86%	+2.90%
conv8-b, conv13-b	+0.11%	+0.61%	−1.77%
conv2-2, conv8-b, conv13-b	−4.71%	−1.86%	+2.90%

Table 7. The comparison of pose estimation algorithms on real-world truck scale scenarios.

Model	Miss Detection	Misalignment Detection	Accurate Detection	AP
Blazepose	193	112	205	40.2%
Lite-HRNet-18	155	94	261	51.2%
MobileNet-V2	212	123	175	34.3%
Proposed work	137	103	270	52.9%

Table 8. The results of performance comparison.

Model	Hea	Sho	Elb	Wri	Hip	Kne	Ank	Mean	FPS	Score
HRNet-W32 [34]	98.6	96.9	92.8	89.0	91.5	89.0	85.7	92.3	1.6	0.49
DU-Net [35]	97.4	96.4	92.1	87.7	90.2	87.7	84.3	91.2	2.8	0.50
GNet-pose [36]	98.1	96.3	92.2	87.8	90.6	87.6	82.7	91.2	2.1	0.49
Ad-PoseNet [37]	98.1	96.5	92.5	88.5	90.2	89.6	86.0	91.9	1.4	0.48
DarkPose [38]	97.2	95.9	91.2	86.7	89.7	86.7	84.0	90.6	1.3	0.47
MSPN [39]	98.4	97.1	93.2	89.2	92.0	90.1	85.5	92.6	1.1	0.48
Lite-HRNet-18 [40]	93.8	92.1	87.6	81.2	86.5	81.7	79.8	86.1	25.4	0.85
MobileNet-V2 [41]	93.4	91.6	86.8	80.6	85.4	82.3	77.7	85.4	4.9	0.51
MobileNet-V3 [42]	93.6	90.8	86.5	81.1	85.1	80.4	72.6	84.3	5.7	0.52
ShuffleNet V2 [43]	91.2	90.6	85.7	79.3	85.6	80.4	66.8	82.8	6.3	0.52
Proposed work	94.7	93.3	89.6	81.4	85.9	83.6	79.1	86.8	24.7	0.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Huang, W.; Wu, Y. Lightweight Human Pose Estimation for Intelligent Anti-Cheating in Unattended Truck Weighing Systems. Sustainability 2025, 17, 5802. https://doi.org/10.3390/su17135802

AMA Style

Zhang J, Huang W, Wu Y. Lightweight Human Pose Estimation for Intelligent Anti-Cheating in Unattended Truck Weighing Systems. Sustainability. 2025; 17(13):5802. https://doi.org/10.3390/su17135802

Chicago/Turabian Style

Zhang, Jianbing, Wenbo Huang, and Yongji Wu. 2025. "Lightweight Human Pose Estimation for Intelligent Anti-Cheating in Unattended Truck Weighing Systems" Sustainability 17, no. 13: 5802. https://doi.org/10.3390/su17135802

APA Style

Zhang, J., Huang, W., & Wu, Y. (2025). Lightweight Human Pose Estimation for Intelligent Anti-Cheating in Unattended Truck Weighing Systems. Sustainability, 17(13), 5802. https://doi.org/10.3390/su17135802

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Human Pose Estimation for Intelligent Anti-Cheating in Unattended Truck Weighing Systems

Abstract

1. Introduction

2. Background

3. Method

3.1. Improvement of Top-Down Multi-Person Joint Detection Method

3.2. Improvements in Lightweighting Methods

3.3. Improved Recognition Accuracy in Complex Environments

4. Results and Discussion

4.1. The Experimental Results of Lightweight Improvement

4.2. The Experimental Results of Improved Attention Mechanism

4.3. The Comparison of Lightweight and Accurate Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI