YOLO-HRNet with Attention Mechanism: For Automated Ergonomic Risk Assessment in Garment Manufacturing

Yichen Tan; Ziqian Yang; Zhihui Wu

doi:10.3390/app152412950

,

and

¹

College of Furnishings and Industrial Design, Nanjing Forestry University, Nanjing 210037, China

²

Jiangsu Co-Innovation Center of Efficient Processing and Utilization of Forest Resources, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(24), 12950;https://doi.org/10.3390/app152412950

Version Notes

Order Reprints

Abstract

For garment manufacturing, an efficient and precise assessment of ergonomics is vital to prevent work-related musculoskeletal disorders. This study creates a computer vision-based algorithm for fast and accurate risk analysis. Specifically, we introduced SE and CBAM attention mechanisms into the YOLO network and integrated the optimized modules into the HRNet architecture to improve the accuracy of human pose recognition. This approach effectively addresses common interferences in garment production environments, such as fabric accumulation, equipment occlusion, and complex hand movements, while significantly enhancing the accuracy of human detection. On the COCO dataset, it increased mAP and recall by 4.43% and 5.99%, respectively, over YOLOv8. Furthermore, by analyzing key postural features from worker videos of cutting, sewing, and pressing, we achieved a quantified ergonomic risk assessment. Experimental results indicate that the RULA scores calculated using this algorithm are highly consistent and stable with expert evaluations and accurately reflect the dynamic changes in ergonomic risk levels across different processes. It is important to note that the validation was based on a pilot study involving a limited number of workers and task types, meaning that the findings primarily demonstrate feasibility rather than full-scale generalizability. Even so, the algorithm outperforms existing lightweight solutions and can be deployed in real-time on edge devices within factories, providing a low-cost ergonomic monitoring tool for the garment manufacturing industry. This helps prevent and reduce musculoskeletal injuries among workers.

Keywords:

work-related musculoskeletal disorders; garment manufacturing; ergonomic risk assessment; computer vision; attention mechanisms; RULA; human pose estimation

1. Introduction

Fabric cutting, sewing assembly, and garment pressing are the basic production activities in the garment manufacturing business that put workers at risk of developing work-related musculoskeletal disorders (WMSDs). These problems result from the long-term impact of repeated actions, sustained uncomfortable postures, and prolonged labor in ergonomically unfriendly settings. Specifically, when cutting curves, cloth cutters frequently bend and twist their wrists and torsos [1]. Then, the sewing machine operator sews the cut fabric together. Sewing operators often sit at their workstations and lean forward to sew for long periods of time. Common ergonomic difficulties with sewing machines include forward head protrusion, elbows over shoulder height, and a curved back. Long-term use of these positions can result in neck, back, and other issues. Garment pressing operators adjust and press the cloth to improve garment fit, applying pressure to the wrists and torso. The risk of WMSDs increases with each year of service as discomfort increases. Therefore, given the significance of garment manufacturing in the industrial environment, garment worker ergonomic assessments are useful. Such approaches can enhance working conditions, reduce WMSDs, increase productivity and quality, and support industrial growth. However, most research on this population has been conducted through cross-sectional surveys [2], with a conspicuous absence of long-term cohort data and evaluations of intervention effectiveness.

Existing ergonomic assessment methods for WMSDs focus primarily on posture analysis. These approaches assess joint angles, duration, and repeat frequency of work postures to detect potential occupational risk factors. It normally consists of two parts: posture measurement instruments and risk assessment analysis. There are two types of posture measurement tools: nonvisual and visual. Nonvisual methods leverage wearable sensors to capture high-precision human motion data. For instance, Zhang et al. [3] used data from IMU sensors and recorded posture levels to find body positions that could cause muscle and bone disorders. They also measured how often and how long these postures were held to determine risk levels. Yu et al. [4] employed surface electromyography (sEMG) technology to simulate repetitive manual material handling tasks performed by workers. Their goal was to study how muscle tiredness changes when the body rotates to different angles, and how this connects to WMSDs. Weston et al. [5] tried a new method for recognizing postures. They used NIRS sensors to check oxygen levels in muscles. This helps show how much load a particular posture puts on muscles. However, all these methods need people to wear multiple devices for long periods to obtain full data [6]. It can limit natural movement and cause discomfort. During demanding tasks, it may lower work efficiency and raise privacy concerns [7]. Also, it is hard to always place sensors in the exact same way. This creates technical problems like timing errors and unstable signals [8]. Because of these issues, wearable sensors are mostly used in labs instead of real factories.

Visual-based methods, such as depth cameras, enable real-time detection and tracking of 3D skeletal models for dynamic posture monitoring. Li et al. [9] used Kinect v2 to collect skeleton data from construction workers. Their system recognizes postures instantly and checks physical load. It provides continuous warnings instead of needing human supervisors. Similarly, Zhou et al. [10] applied Kinect v2 to study human motions. Their approach uses machine learning to classify postures and determine risk levels. However, depth cameras have a limited working distance. For example, Kinect v2 works best within 4.5 m. This narrow range restricts their coverage in large work areas. They also use a lot of power and have relatively low image quality, making them less suitable for some industrial uses. Regular RGB cameras have a wider field of view. They can capture multiple workers across big spaces without interrupting work. Yang et al. [11] used a Bag-of-Features method to recognize movements from video. Ding et al. [12] built a CNN-LSTM deep learning model that detects risky behavior automatically. These studies indicate that computer vision can accurately estimate human body posture. This technology has great potential in assessing ergonomic risks in the workplace.

With the rapid development of deep-learning–based object detection, YOLO architectures have become widely used for real-time human monitoring in industrial environments. Recent work has demonstrated that incorporating attention mechanisms into YOLO can significantly enhance feature discrimination under occlusion, cluttered backgrounds, and small-scale target conditions. For example, Wang et al. integrated a channel–spatial attention mechanism into the C3 module of YOLOv5 and reported notable improvements in detecting small objects while maintaining real-time performance [13]. In parallel, the emergence of computer-vision–based ergonomic risk assessment (ERA) has shown that 2D human pose estimation combined with scoring systems such as RULA can achieve accuracy comparable to expert evaluations in real workplaces, as demonstrated by Agostinelli et al. [14]. Furthermore, a recent comprehensive review highlighted that pose-estimation-driven ERA has become a key direction in occupational health research but also pointed out that most existing systems are developed for general industrial or office settings and lack domain-specific adaptation for tasks involving fine hand–arm movements and frequent occlusions [15]. These findings underscore the need for a specialized, attention-enhanced, and ergonomics-oriented vision system tailored to the garment manufacturing environment.

However, despite the progress made by these existing methods, the complexity of the clothing manufacturing environment increases the challenge of human pose estimation. In sewing, the machine and piles of cloth often block the view of the lower body and hand joints. Because of this, the small, precise hand and arm movements used in sewing can be hard to detect correctly. Moreover, the fast-paced, assembly-line nature of garment production imposes stringent real-time requirements on posture assessment systems, which must rapidly and accurately identify ergonomically risky movements to enable timely intervention. But traditional ergonomic assessments or models with low frame rate are inadequate for the requirements of garment production floors. Therefore, it is important to design a task-specific integration scheme that can handle these garment-industry characteristics.

To address these limitations, the main contributions of this work are threefold:

(1) We design a scene-adaptive attention-enhanced detection framework (YOLO-SE-CBAM) specifically optimized for occlusion patterns, textile clutter, and upper-limb visibility challenges unique to garment-manufacturing environments.

(2) We introduce a wrist-oriented keypoint extension strategy for HRNet, enabling accurate ergonomic scoring in tasks—such as sewing—where wrist deviation is a dominant contributor to WMSDs.

(3) We propose a task-specific pose-to-RULA fusion pipeline, forming the first end-to-end system tailored for garment manufacturing that maps 2D pose estimation to ergonomic risk levels derived from real workstation behaviors.

These contributions highlight that although the individual deep-learning modules are well-known, the novelty of this study lies in the task-driven integration and ergonomic applicability in real garment-production scenarios, which has not been previously explored.

In light of these challenges, this paper proposes the combined YOLO-SE-CBAM-HRNet model. By enhancing YOLO’s feature extraction capabilities, the model optimizes occlusion detection in complex garment manufacturing scenarios, while simultaneously leveraging HRNet’s high-resolution advantages to enable precise human joint localization. This approach addresses the subjectivity and inefficiency of traditional RULA-based ergonomic assessment methods, which rely on manual observation and scoring. It can also adapt to video information input from ordinary cameras, providing a practical WMSDs risk-assessment solution for the garment industry. Unlike prior generic pipelines, this study provides the first integration specifically designed for the visual and ergonomic characteristics of garment-production tasks. The model implements an end-to-end workflow covering video input, pose estimation, angle calculation, RULA scoring, and risk warning, and its effectiveness has been verified in real-world scenarios.

2. Related Work

2.1. Real-Time Object Detection

Object detection is one of the main jobs in computer vision. Its function is to classify and locate objects in images. In recent years, deep learning-driven object detection technology has made breakthrough progress and is widely applied in real-time scenarios such as autonomous driving, robotic vision, and video surveillance [16]. Current object detection typically can be categorized into two-stage (e.g., R-CNN series [17,18,19]) approaches and one-stage (e.g., SSD [20], YOLO series [21,22,23]) approaches. Two-stage methods require region proposals and a secondary classification, resulting in high computational costs and long inference times, making it difficult to meet real-time requirements. In contrast, one-stage methods achieve efficient processing through end-to-end regression, becoming the core solution for industrial-grade real-time detection.

Among them, the YOLO series is a prominent mainstream solution. From the classic architectures of YOLOv1-v3 [21,22,23] to the incorporation of CSPNet, data augmentation, and enhanced PAN in YOLOv4-v5 [24,25], and further innovations in YOLOv6-v11 [26,27,28,29,30,31], such as BiC/SimCSPSPPF, anchor-free detection, C2f/C3k2 modules, and spatial attention mechanisms. These advancements consistently strive to strike a balance between accuracy and speed. For instance, YOLOv8s achieves a 44.9% mAP on the COCO dataset with a single-frame inference time of merely 1.2 milliseconds on an A100 GPU.

To enhance feature extraction capabilities for object detection in complex scenarios (e.g., occlusion, small objects), attention mechanisms have become a mainstream optimization approach. For instance, SO-YOLOv8 [32] introduces a saliency attention module, which increases the mean Average Precision (mAP) by 1% and detection accuracy by 3% with only a 1.2% rise in floating-point operations (FLOPs). Jing et al. [33] adopted a dual-dimensional (channel and spatial) attention calibration method and integrated the CBAM module, elevating the overall Average Precision (AP) from 43.6% to 44.9% while enhancing model robustness across different Intersection-over-Union (IoU) thresholds. Attention techniques are proven effective. Yet, their adoption remains limited for real-time factory use.

Inspired by the above methods, considering the problems of contextual noise interference and masking of discriminative features in occlusion scenarios, this paper deeply allows the attention mechanism to integrate with YOLOv8. Specifically, the SE module [34] uses a “squeeze-and-excitation” mechanism to adaptively calibrate the weights of channel features, enhances the feature response in target regions such as human torsos, and suppresses the background noise. Meanwhile, associated with the CBAM module [35], which can realize the calibration of channel and spatial attention simultaneously, this paper further solves the problems of “which features are important” and “where the features are located” at the same time, and then realizes more accurate localization of human targets in occlusion scenarios.

2.2. Human Pose Estimation

Human pose estimation aims to locate human body parts and construct the human body (such as the skeletal structure) from input data like images and videos, and it has broad application prospects in fields such as ergonomic assessment [36]. Based on dimensionality, human pose estimation is primarily categorized into two types: 2D human pose estimation [37,38,39,40] and 3D human pose estimation [41,42]. Among these, 2D methods have become the mainstream choice for real-time applications due to their high computational efficiency and low hardware requirements.

Depending on the number of participants, 2D human pose estimation can be divided into single-person pose estimation and multi-person pose estimation. Single-person pose estimation includes regression-based methods and heatmap-based methods. Regression-based methods, such as DeepPose [37], use AlexNet as the base network and directly regress keypoint coordinates through fully connected layers. These methods have a simple workflow and low hardware requirements, making them suitable for scenarios with high real-time demands. However, their accuracy is limited, their generalization ability is poor, and they struggle to handle complex pose variations. Therefore, mainstream methods typically generate heatmaps and select peak positions as keypoints. For example, HRNet [40] estimates keypoints by maintaining high-resolution representations throughout the process. It starts with a high-resolution subnet and gradually adds subnets with resolutions from high to low, forming multiple stages, while connecting multiple resolution subnets in parallel. Through repeated multi-scale fusion, HRNet can generate rich high-resolution representations, resulting in more accurate keypoint heatmap predictions. Compared to regression-based methods, heatmap-based methods avoid the difficulty of directly regressing coordinate points and can naturally handle uncertainty and ambiguity in pose estimation, but they require higher hardware capabilities.

Multi-person pose estimation is more challenging than single-person scenarios, as it needs to address issues like human occlusion and overlap, primarily adopting top-down and bottom-up strategies. The top-down approach is a two-stage algorithm: the first stage uses object detection methods to identify human subjects and annotate bounding boxes around them; then, it performs skeletal joint detection on the annotated regions, with representative algorithms such as Mask R-CNN [19] and AlphaPose [43]. The bottom-up approach is an end-to-end algorithm: it first extracts all human skeletal joints and then determines their ownership to human subjects based on the associations between joints, with representative algorithms like OpenPose [38] and HigherHRNet [44]. The top-down way is simpler because it looks at each person on their own. It is not confused by having many people in a picture. That is why its results are more reliable. However, as the number of detected objects increases, the computational complexity of this method grows linearly, leading to reduced efficiency in crowded scenes. Additionally, the accuracy of human detection is crucial for the reliability of pose estimation results. The computational complexity of bottom-up approaches does not significantly increase with the number of people, making them more efficient for processing large-scale crowds. However, accurately aggregating joints becomes challenging in densely packed or severely occluded scenarios. Compared to top-down methods, their accuracy may slightly decrease. Even though bottom-up and transformer-based methods achieve high detection accuracy in crowded scenes, their large model sizes make them difficult to deploy on edge devices and perform poorly in detecting small objects.

Therefore, bottom-up methods are generally unsuitable for real-time application scenarios [45]. In contrast, while top-down methods are susceptible to detection accuracy and processing speed as the number of detected objects increases, they perform better in moderately sparse scenes in terms of both detection accuracy and speed. Moreover, by improving the object detector, it is often possible to enhance the model’s detection accuracy for small objects and significantly reduce parameter sizes, making this approach more edge-device friendly [46]. Therefore, we propose a top-down 2D human pose recognition algorithm to achieve fast and cost-effective musculoskeletal disease risk identification.

2.3. Ergonomics Assessment Methods

Multiple traditional measurement methods exist in the field of ergonomic risk assessment, such as the Ovako Working Posture Analysis System (OWAS), Rapid Upper Limb Assessment (RULA), Rapid Entire Body Assessment (REBA), and the NIOSH lifting formula. These methods record workers’ postural angles through video and other means for real-time or subsequent analysis. They have been preliminarily applied in ergonomic assessments within the garment industry and have revealed core risk characteristics [10], as shown in Table 1. Mahendran et al. [47] used RULA and REBA to find that 32% of workers in large garment units and 43.9% in small units reported WMSDs, with most workers exhibiting moderate risk for developing MSDS. Su et al. [48] employed REBA to assess MSDS risk levels in sewing and cutting processes within the garment industry, identifying high-risk positions in the wrist, forearm, upper arm, trunk, and neck.

Table 1. Traditional Ergonomics Evaluation Methods.

Traditional ergonomic assessments have several advantages, like being cheap and easy to learn. But they also have issues. First, the process takes too long and needs lots of data work. This can lead to mistakes and slow down evaluations. Second, the assessor’s personal opinion can affect the results, making them less accurate and reliable.

To solve these problems, researchers are now combining these traditional methods with computer vision. This new approach helps fix the issues of manual checks. It also makes assessments more automatic and fairer. For instance, Su et al. used computer vision to estimate and detect body movement angles. They then generated a risk score dataset for sewing machine operators based on REBA criteria, providing a foundation for automated assessment [48].

However, this integrated approach depends heavily on the accuracy of computer vision in human posture recognition. If posture recognition accuracy is insufficient, it may lead to biased assessment results. To tackle this issue, this paper enhances the accuracy of human pose recognition in complex work environments by integrating an attention module, thereby facilitating effective ergonomic risk assessment.

3. Methods

The core of our methods lies in improving the computer vision-based human pose estimation algorithm to enhance the accuracy of human pose estimation in complex scenarios, thereby enabling limited ergonomic risk assessment. This section will provide a detailed exposition of the proposed methodology for ergonomic risk assessment in sewing workshops, encompassing an integrated framework that combines object detection, human pose estimation, and ergonomic scoring. The pipeline comprises three core modules: the YOLO-SE-CBAM model for human detection, the HRNet model for keypoint localization, and the RULA method for ergonomic scoring. Each component has undergone rigorous validation using performance metrics. Every part has been thoroughly tested with performance measures. The next sections will explain each part in detail.

3.1. YOLO-SE-CBAM

In clothing workshops, busy backgrounds often cause problems. Stacked textiles and equipment can hide workers or be mistaken for people. This leads to missed detections or false alarms. The standard YOLOv8 model works fast but does not pay special attention to human shapes. It may miss people in crowded sewing areas.

In our work, we added SE and CBAM modules to YOLOv8. This created a new model called YOLO-SE-CBAM. The improved system now finds people more accurately in messy environments. It also handles background distractions much better.

3.1.1. YOLOv8 Architecture

Figure 1 illustrates the YOLOv8 model. Its structure has three main parts: the Backbone, the Neck, and the Detection Head. The Backbone uses multiple layers with batch normalization and SiLU activation. This helps gradient flow better than the older ReLU function. A key improvement in YOLOv8 is the C2f module, which replaces the C3 module from YOLOv5. The C2f has more branches and skip connections. This gives richer feature details and better gradient flow, without needing much more computing power. It works especially well for spotting small objects. At the end of the Backbone, the SPPF module processes feature maps from different scales. It combines multi-level spatial information, which helps the model detect objects of various sizes. The Neck uses a PAN-FPN design. It mixes features from different Backbone layers. It samples features both top-down and bottom-up. Finally, it provides the Head with three separate feature maps made for objects of different sizes. The Head works without anchor boxes. It splits the job into two separate tasks: classifying objects and drawing bounding boxes. This separation reduces interference between the tasks and helps the model detect more accurately [49]. We chose the YOLOv8s model because it is lightweight. The model receives images sized at 640 by 640 pixels. It only keeps detections identified as “person”. This provides clean, focused areas for the next step of pose estimation. It also helps reduce noise from unnecessary boxes when people are partly hidden.

Figure 1. YOLOv8 Architecture.

The system locates the head area to obtain the body’s bounding box, called p_boxes. Each box follows the format [x1, y1, x2, y2]. Here, (x1, y1) marks the box’s top-left corner. The point (x2, y2) marks its bottom-right corner. This method lets us separate and standardize the image areas containing people. It ensures the next stage, which uses HRNet to find body keypoints, works on a stable and consistent input.

3.1.2. Attention Mechanisms

There are several common kinds of attention mechanisms. These include spatial attention, channel attention, and mixes of both. After studying our specific situation, we picked a solution that uses SE and CBAM modules. Using these two together helps us obtain the best outcome.

(1) SE: The Squeeze and Excitation (SE) module is a channel attention mechanism. It consists of two main parts: the Squeeze operation and the Excitation operation. This design allows the network to dynamically adjust the importance of different feature channels. At the same time, it suppresses features that are less useful for the task. The structure of the SE module is shown in Figure 2. SE recalibrates only the channels, yet workers within the workshop occlude one another, necessitating explicit spatial emphasis on critical regions. Consequently, CBAM is introduced to concurrently model both channels and spatial relationships.

Figure 2. The structure of the SE module [34].

Following the structural description, the mathematical formulation of the SE module is summarized as follows.

Given an input feature map

X \in R^{C \times H \times W}

, the Squeeze operation applies global average pooling to aggregate spatial information into a channel descriptor:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{c} (i, j), c = 1, \dots, C .

(1)

where

X_{c} (i, j)

is the value at spatial location

(i, j)

in channel

c

;

C

denotes the number of channels;

H

and

W

represent the height and width of the feature map, respectively; and

z_{c}

is the aggregated descriptor for channel

c

.

The Excitation operation then uses a two-layer fully connected gating mechanism to generate channel-wise importance weights:

s = σ (W_{2} δ (W_{1} z))

(2)

where

z

is the channel descriptor vector;

W_{1}

and

W_{2}

are learnable weight matrices;

δ (\cdot)

denotes the ReLU activation;

σ (\cdot)

denotes the Sigmoid function; and

s

represents the learned channel-wise importance weights.

Finally, the input feature map is rescaled through channel-wise multiplication:

{\tilde{X}}_{c} = s_{c} \cdot X_{c}

(3)

where

s_{c}

is the learned importance weight for channel

c_{i}

and

{\tilde{X}}_{c}

denotes the recalibrated output feature map.

This formulation highlights that SE adaptively adjusts the response of each channel based on global contextual information, allowing the network to emphasize feature channels most relevant to human detection under occlusion-prone and visually complex workshop environments.

(2) CBAM: As shown in Figure 3, the Convolutional Block Attention Module (CBAM) contains two parts that work in order. These are the Channel Attention Module (CAM) and the Spatial Attention Module (SAM). It helps the model choose and combine important features more flexibly. This improves how the model assigns importance during learning. It also becomes better at noticing small objects in images. The SE module has a limitation. It uses only average pooling, which can miss some important local details. The CAM module works differently. It uses both average pooling and max pooling together. Average pooling gathers overall information, while max pooling finds distinct features. They share the same MLP weights to learn about channels. Their outputs are added together to create a final channel attention map.

Figure 3. This is a detailed depiction of the attention module used in this study [35]. Specifically, (a) represents the Convolutional Block Attention Module (CBAM), which consists of two sub-modules: (b) Channel Attention Module (CAM) and (c) Spatial Attention Module (SAM).

For an input feature map

X \in R^{C \times H \times W}

, CAM first applies global average pooling and max pooling along the spatial dimensions:

f_{a v g} = G A P (X), f_{m a x} = G M P (X)

(4)

Both descriptors are fed into a shared MLP, and their outputs are added to produce the channel attention map:

M_{c} = σ (M L P (f_{a v g}) + M L P (f_{m a x}))

(5)

The recalibrated feature map is then obtained via channel-wise multiplication:

X^{'} = M_{c} \otimes X

(6)

where

f_{a v g}

and

f_{m a x}

denote the average-pooled and max-pooled channel descriptors; MLP denotes the shared fully connected layers;

σ (\cdot)

is the Sigmoid function;

M_{c}

represents the channel attention weights;

\otimes

denotes channel-wise multiplication.

The SAM module uses the spatial connections between features. It finds the most important areas in an image. First, it creates two 2D maps using average and max pooling across the channels. Then, it combines these maps into one spatial feature map. This map includes information from all channels. A 7 × 7 convolution layer processes it, and a Sigmoid function creates the final 2D spatial attention map.

Given the output of CAM,

X^{'}

, SAM computes two spatial descriptors by applying average pooling and max pooling across the channel dimension:

F_{a v g} = {A v g P o o l}_{c} (X^{'}), F_{m a x} = {M a x P o o l}_{c} (X^{'})

(7)

These two feature maps are concatenated:

F = [F_{a v g}; F_{m a x}]

(8)

and passed through a

7 \times 7

convolution followed by a Sigmoid activation to obtain the spatial attention map:

M_{s} = σ (f^{7 \times 7} (F))

(9)

The final refined feature map is obtained as follows:

\tilde{X} = M_{s} \otimes X^{'}

(10)

where

F_{a v g}

and

F_{m a x}

represent spatial descriptors;

f^{7 \times 7} (\cdot)

is a convolutional layer with a

7 \times 7

kernel;

σ (\cdot)

is Sigmoid;

M_{s}

is the spatial attention map;

\tilde{X}

is the final feature refined in both channel and spatial dimensions.

CBAM works in a serial manner. First, channel attention adjusts the feature map’s channels. Then, spatial attention adjusts its spatial areas. This process creates a feature map that is refined in both channel and spatial aspects. By combining CAM and SAM, CBAM makes the model understand images better. It strengthens key channel features and focuses on important spatial areas. This allows the model to capture a complete set of visual information.

3.1.3. The Overall Framework

This research built an object detection model called YOLO-SE-CBAM. The structure is shown in Figure 4.

Figure 4. SE-CBAM Embedded in YOLOv8 Position.

In the YOLOv8 framework, the SE and CBAM modules are inserted after the main convolutional layers to strengthen feature representation. While the backbone extracts semantic information progressively, SE recalibrates channel responses and CBAM refines both channel- and spatial-level attention, suppressing background noise and highlighting posture-relevant regions. This enhancement improves human detection robustness under common factory challenges such as occlusions, variable lighting, and cluttered scenes, while preserving the real-time efficiency required in industrial applications. These improvements provide a stronger feature basis for subsequent keypoint detection and ergonomic analysis.

3.2. HRNet with Keypoint Detection

HRNet analyzes human poses using images from YOLOv8. The input is prepared by first expanding the center of each detection box by 1.4 times. This box is then scaled to a fixed size of 288 × 384 pixels. This size is used for both training and running the HRNet model. The input images first pass through convolutional layers and max pooling to generate initial feature maps, forming the first high-resolution branch. Subsequently, low-resolution branches are added sequentially. The model becomes stronger at finding features by running multiple branches in parallel. Each branch looks at a different level of detail. During each branch extraction stage, the number of channels is adjusted through repeated stacked Bottleneck layers, maintaining the feature layer size to ensure high-resolution information remains uncompressed. Upon completing the initial high-resolution branch and channel adjustment, the network enters the branch expansion phase: it proceeds to the first Transition layer, which generates two new feature branches based on the current high-resolution feature map–one downsampled by a factor of 4 and another downsampled by a factor of 8. Subsequently, the network proceeds to the next round of Stage feature extraction. Upon completion, Transition2 directly takes the 8× downsampled feature map from the existing branches as input, generating a new 16× downsampled feature branch through a convolutional layer. At this point, the number of parallel branches within the network increases again, forming a richer, multi-scale structure comprising: ‘high-resolution branch + 4× downsampled branch + 8× downsampled branch + 16× downsampled branch’. At the feature processing grade, the high-resolution branch retains precise coordinate information for joint nodes, while low-resolution branches (such as the 16× subsampled branch) leverage larger receptive fields to capture broader human body part relationships. The network ultimately feeds the highest-resolution feature map into its output layer. Through 1 × 1 convolutions, feature channels are mapped onto heatmaps for key body points (e.g., shoulders, elbows, knees) with dimensions heatmap_width = 72, height = 96 (scaled proportionally to one-quarter of the input size). Each pixel value in the heatmap represents the probability that the corresponding position is the target keypoint. By decoding the coordinates of the maximum value location in the heatmap, the final predicted results for each human joint can be obtained [50].

To ensure transparent and reproducible posture assessment, this study explicitly formulates the computation of joint angles required for RULA scoring. All angles θ are obtained from COCO keypoint coordinates (Figure 5), using the standard vector-based formulation:

θ = a r c c o s (\frac{u \cdot v}{‖u‖ \cdot ‖v‖})

(11)

where the dot product and Euclidean norms are computed directly from the COCO keypoint coordinates shown in Figure 5. Based on this formulation, each RULA-related angle is derived from anatomically meaningful keypoint combinations as follows:

Figure 5. (a) COCO 17 Human Key Points; (b) COCO Hand Key Points.

Upper-arm angle (

θ_{1}

): Derived from the shoulder–elbow vector (left arm:

P_{5} \to P_{7}

; right arm:

P_{6} \to P_{8}

) relative to the vertical reference axis.

Lower-arm angle (

θ_{2}

): Derived from the elbow–wrist vector (left arm:

P_{7} \to P_{9}

; right arm:

P_{8} \to P_{10}

) relative to the corresponding upper-arm vector.

Wrist angle (

θ_{3}

): Computed from the wrist–hand-reference vector (left wrist:

P_{9} \to P_{17}

; right wrist:

P_{10} \to P_{17}

) relative to the lower-arm segment. The additional keypoint

P_{17}

(middle-finger base) provides a stable reference for quantifying wrist deviation during sewing operations, where fine wrist rotations frequently occur.

Neck angle (

θ_{4}

): Computed from the vector connecting the averaged shoulder position

\frac{P_{5} + P_{6}}{2}

to the upper-trunk region, relative to the vertical axis.

Trunk angle (

θ_{5}

): Computed using the vector from the averaged hip keypoint

\frac{P_{11} + P_{12}}{2}

to the averaged shoulder keypoint

\frac{P_{5} + P_{6}}{2},

referenced against the vertical direction.

All computed angles are then mapped to the corresponding posture categories and scoring rules defined in Table 2 and Table 3, ensuring full alignment with the original RULA assessment criteria and enabling transparent reproduction of the ergonomic evaluation pipeline.

Table 2. Body Part Angle Scoring in RULA.

Table 3. WMSDs Risk Assessment Score Sheet.

To support more detailed analysis of wrist posture, an additional hand keypoint was incorporated into the pose-estimation process. This was achieved by extending the keypoint annotation scheme used during training and generating a corresponding heatmap using the same Gaussian-based encoding method applied to the original joints. In the network implementation, the adjustment was confined to the output layer responsible for heatmap prediction, ensuring that the backbone architecture and multi-resolution fusion mechanism of HRNet remained unchanged. During training, the newly added keypoint was optimized together with the existing ones under the same loss function, and during inference, its location was obtained through the standard heatmap-peak decoding procedure. This integration approach provides finer hand-related posture information while maintaining full compatibility with the original HRNet design.

Table 4 shows that compared to other models trained on the COCO dataset, HRNet-W48 achieves higher average precision values on the COCO test development set, falling only slightly below the RLE and TokenPose models. However, when considering overall performance, HRNet-W48 demonstrates greater advantages. RLE (Run-Length Encoding) is fundamentally a data compression algorithm. Within the domains of object detection and keypoint recognition, it is typically employed to encode output results, thereby reducing data transmission volume. When utilised as a model architecture or methodology to enhance AP values, its network structure lacks the intuitive and readily comprehensible mechanisms for feature extraction and keypoint localisation that HRNet possesses. TokenPose, whilst a novel approach based on the Transformer architecture with unique advantages in handling long-range dependencies, suffers from the Transformer’s inherently black-box nature. In industrial real-time monitoring scenarios, its lack of interpretability may hinder the rapid identification of model misclassification causes. Second, systems like RLE and TokenPose do not effectively address frequent challenges in textile manufacturing, such as fabric patterns or machines that block the view. This makes their joint detection less reliable than HRNet. In real-time monitoring, HRNet and YOLO-SE-CBAM function really well together. HRNet’s ability to mix several feature scales allows it to manage a variety of workshop situations consistently.

Table 4. Comparisons on The COCO Dataset.

HRNet (W48) has been widely used as a strong baseline on COCO and MPII human pose-estimation benchmarks [57], which supports its selection as a stable backbone for ergonomic angle computation. As shown in Table 4, smaller HRNet variants led to reduce the keypoint-detection accuracy under partial occlusion, whereas the objective of this work is to verify the correctness of the detection–pose–scoring pipeline rather than to explore architectural variations within HRNet. Therefore, ablation analysis was focused on YOLO-based attention enhancements, where performance differences were expected to be more meaningful for the downstream ergonomic-scoring task.

As illustrated in Figure 5, in order to measure the wrist angle for RULA scoring, this study adds a new keypoint. The method uses the 17 standard points from MS COCO 2017, plus one more: the middle finger’s base. This added 9th keypoint comes from the COCO Handpose dataset.

3.3. Rapid Upper Limb Assessment

The Rapid Upper Limb Assessment (RULA) is a method for assessing the risks associated with poor arm and neck posture at work. Lynn McAtamney and Nigel Corlet [60] introduced it in 1993. The RULA approach evaluates the angles of the upper and lower arms, wrists, neck, and back. It first assigns a number to each body part depending on its position. The arm and wrist numbers are then combined to form Score A, followed by the neck and back values in Score B. To locate Score C, first check Score A in Table 2 and Score B in Table 3.

Then, add them together to reach a final number between one and seven. Score C defines action priority: 1–2 (low risk, ignore), 3–4 (medium, consider modifications), 5–6 (high, act soon), and 7 (extremely high, act now).

The experimental validation in this study was conducted as an initial feasibility assessment. A small set of representative static frames was extracted from longer operational videos to ensure that the selected images captured typical postures within cutting, sewing, and pressing tasks. These frames were annotated and evaluated by a focused expert group familiar with garment manufacturing ergonomics. The goal of this validation was not to provide full large-scale statistical generalization but rather to verify the functional correctness of the proposed detection–estimation–scoring pipeline under real workshop conditions.

3.4. Performance Metrics

This study uses several measures to check the detection performance of YOLO-SE-CBAM. This ensures the experiment results are correct. Commonly used evaluation metrics include Intersection over Union (IoU), Precision, Recall, and map, calculated as follows:

P = \frac{T P}{T P + F P}

(12)

R = \frac{T P}{T P + F N}

(13)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(14)

I o U = \frac{B_{a} \cap B_{b}}{B_{a} \cup B_{b}}

(15)

TP denotes the number of correctly detected positive samples, FP denotes the number of negative samples misclassified as positive, and FN denotes the number of missed positive samples (i.e., those incorrectly classified as negative). The AP value denotes the area under the P-R curve. In Formula (3), the mAP value is obtained by calculating the average of all category APs, where N represents the total number of detected types. In this experiment, a higher mAP value indicates better detection performance and higher recognition accuracy of the algorithm.

B_{a}

denotes the area of the predicted frame, and Bb denotes the area of the ground truth frame. The IoU ratio reflects the degree of overlap between predicted and ground truth frames. A higher IoU value indicates greater prediction accuracy. The mAP at IoU threshold 0.5 (mAP@0.5) denotes the metric calculated using non-maximum suppression (NMS) with an IoU threshold ≥0.5 [61].

4. Experiments and Results

4.1. Video Capture and Ethics

This study conducted ergonomic risk assessments on three primary processes in the sewing manufacturing industry: fabric cutting, sewing, and pressing. Video capture involved 3 master craftsmen for fabric cutting, 7 for sewing, and 4 for pressing. Based on posture estimation, all research participants were able-bodied. Ethical protocols were strictly followed throughout this study: All participants were required to sign informed consent forms, where they were fully informed of this study’s purpose, data usage, and the measures implemented to ensure data confidentiality and personal privacy. Additionally, three ergonomics experts were recruited; their manual assessment results will serve as a benchmark to verify whether the proposed method can achieve performance comparable to or exceeding that of human evaluators.

4.2. Implementation Details

All the related code and algorithm run on the computer with Intel(R) Xeon(R) CPU ES-2680 v3 (Santa Clara, CA, USA) and NVIDIA GeForce RTX 4060 Ti (Santa Clara, CA, USA) under Windows 10. The object detection training part used the MS COCO 2017 dataset, which contains about 118,287 training images from 80 object categories. The dataset is rich in human poses and a large number of images. We first split the whole dataset into a 90% training set and a validation set, as well as a 10% test set. Then, we further split the 90% training and validation set into a 90% training set and a 10% validation set. Finally, we obtain an 81% training set and 9% validation set as training data and a 10% test set as test data. The training part uses the VOC dataset format, and the labels are saved in XML format to keep the object coordinates and category information. All the grayscale images will be automatically converted into images with RGB channels to fit the model input format. The input image size is forced to be 640 × 640 (640 is a multiple of 32, which is the downsampling factor of YOLOv8). Mosaic augmentation is enabled. Mosaic enhancement will further enable mix-up enhancement with 50% probability enhancement. This strategy is used in the first 70% of epochs to train to make a balance between data diversity and fidelity to the true distribution.

The training strategy adopts a phased optimization strategy: In the freezing phase, the pretrained backbone network outputs basic features, which alleviate the feature learning instability caused by the initial parameter random initialization. While a larger batch size accelerates the training speed, 50 iterations are trained in each epoch, with 32 samples input to the model loaded into the memory in each iteration. In the unfrozen phase, the backbone network tunes to human detection and fine-tunes all parameters to improve the accuracy of human target recognition. In the unfrozen phase, the backbone network is trained for 300 epochs with a batch size of 16. This design not only reduces the occupation of GPU memory during full parameter training but also increases the number of parameters that can be updated each time, providing more flexibility. We adopt SGD as the optimizer, the initial learning rate of which is set to 1 × 10⁻². The minimum learning rate is set as 1% of the initial learning rate. The momentum is set as 0.937, and the weight decay is set as 5 × 10⁻⁴. The cosine annealing strategy is adopted in the learning rate decay strategy to make the adjustment gentler. In order to ensure the reliability of the training process and the ability to trace the results, this experiment adopts an evaluation strategy to save weight files in the 10th epoch. The mean Average Precision (mAP) is adopted as the evaluation index, which plays a key role in dynamically monitoring the training convergence and avoiding overfitting. In this study, mAP refers specifically to mAP@0.5 (AP50), following the COCO-style IoU threshold of 0.5 for object detection evaluation. To ensure experimental repeatability and efficiency, the parameter initialization module and the random augmentation module were configured with the fixed random seed (seed = 11). At the same time, the multi-threaded data loading (num_workers = 4) is enabled to parallelize loading and preprocessing the data, so as to improve the training efficiency.

To evaluate the practical deployability of the proposed pipeline, we additionally measured the runtime performance of both the detection and pose-estimation components under the same hardware configuration. The inference speed was computed using the full YOLO–HRNet pipeline, and the resulting FPS values are reported in Table 5 together with parameters and FLOPs. These measurements provide a clearer assessment of real-time feasibility on representative industrial hardware.

Table 5. Ablation Study Results of Four Different Object Detection Models on the COCO Dataset.

4.3. Results on the COCO Dataset

An ablation study was conducted on YOLOv8-SE-CBAM, with the results shown in Table 5 and Figure 6. In addition to accuracy metrics, Table 5 also reports the number of parameters, FLOPs, and inference time of each model variant. All FPS values reported in Table 5 were measured on the same hardware configuration described in Section 4.2 (Intel Xeon E5-2680 v3 CPU and NVIDIA RTX 4060 Ti GPU). These results provide a comprehensive view of the computational cost associated with SE and CBAM, enabling a clearer assessment of the accuracy–efficiency trade-off for real-time deployment.

Figure 6. Precision-Recall (PR) Curves of Four Different Object Detection Models.

As demonstrated by the precision-recall (PR) curve in Figure 6, the original YOLOv8 model exhibits a precision approaching 1.0 at low recall grades, indicating high accuracy for predictions with strong confidence. However, with the increase in recall, precision rapidly decreases to about 0.1 in the high recall region (recall > 0.8). Namely, when more objects are detected, there are more false positives. For YOLOv8-SE, compared with YOLOv8, its precision is better than YOLOv8 in the medium-to-high recall range (recall > 0.6). For example, when recall = 0.8, its precision is much higher than YOLO’s.

Compared with YOLOv8, YOLOv8-CBAM is more accurate in the whole recall range. When the recall is low, the accuracy of YOLOv8-CBAM is similar to YOLOv8; when the recall is medium (for example, recall = 0.5) or high (for example, recall = 0.8), the accuracy of YOLOv8-CBAM is higher. Because Convolutional Block Attention Module (CBAM) combines channel attention and spatial attention, it can focus more accurately on features and decrease more false positives when more objects are detected.

YOLOv8-SE-CBAM obtains the best performance in all models. When the recall is low, its accuracy is close to 1.0; when the recall = 0.6, its accuracy is close to 0.9; when the recall is high, its accuracy is much higher than the other three models. The highest F1-score achieved by the YOLOv8-SE-CBAM model, which integrates both SE and CBAM attention mechanisms, signifies its superior capability in balancing detection precision and recall under challenging industrial environments. YOLOv8-SE-CBAM outperforms all other models. The combined SE and CBAM attention mechanisms enhance its feature selection, allowing it to handle occlusions and complex postures effectively. SE attention mechanism and CBAM attention mechanism enhance the feature selection and focusing ability of the model, and YOLOv8-SE-CBAM is most effective at balancing recall and accuracy.

As shown in Table 5, introducing SE or CBAM into the YOLOv8 baseline results in only a marginal increase in parameters (from 9.831 M to 9.832 M) and no change in FLOPs (23.4 G), confirming that both modules are lightweight additions. Despite this minimal computational overhead, both attention mechanisms noticeably improve inference speed. Specifically, YOLOv8-SE achieves the highest FPS (87.36), demonstrating that SE’s channel-wise recalibration enhances feature selectivity without imposing extra latency. CBAM, which includes both channel and spatial attention, slightly increases computational complexity but still delivers a high runtime performance of 82.53 FPS. When SE and CBAM are combined, the resulting model reaches 82.84 FPS—maintaining real-time capability while achieving the highest accuracy (AP = 83.09%). Here, AP refers to AP@0.5 (IoU = 0.5) to ensure consistency in performance evaluation across all model variants.

From Table 6 and Figure 7, we compared our proposed YOLOv8-SE-CBAM model with other object detection models. Precision of YOLOv8-SE-CBAM is 91.13%, recall is 66.08%, F1 score is 77.0%, and AP is 83.09%. Even though the recall is slightly lower, the accuracy is improved a lot compared with EFF-YOLO [62], which obtains 77.8% accuracy and 67.2% recall. It outperforms all YOLOv5s variants [63]—including YOLOv5 (73.82% accuracy, 67.05% F1), YOLOv5-SE (72.64% accuracy, 64.99% F1), YOLOv5-CBAM (70.66% accuracy, 65.04% F1), and YOLOv5-STP (74.92% accuracy, 67.9% F1)—across all metrics. It also has higher accuracy than the advanced YOLOv5 model [64] that reached 79% AP. These results confirm that integrating SE and CBAM into YOLOv8 strengthens discriminative feature extraction, improving accuracy, recall, F1 score, and AP in object detection tasks.

Table 6. Performance Comparison of Different Models for Human Detection on the COCO Dataset.

Figure 7. F1 Curves of Four Different Object Detection Models.

With SE and CBAM attention modules enhancing its feature extraction capability, YOLO-SE-CBAM enables high-precision human detection in complex workshop environments that involve equipment occlusion, lighting variations, and worker overlap. Its 91.13% precision effectively filters out interfering objects (e.g., machinery, materials), providing clean human regions of interest (ROIs) for subsequent pose analysis. HRNet, leveraging its full-path high-resolution feature representation architecture, accurately extracts key human keypoints (e.g., neck, shoulders, elbows, hips) and calculates joint angles for common worker postures. This fully meets RULA’s accuracy requirements for quantifying angles of critical body regions. YOLO-SE-CBAM-HRNET supports rapid deployment. It needs no additional training data annotation for specific workshop scenarios or task types, and only depends on the general-purpose COCO human detection pre-trained model.

4.4. Algorithm Performance and Visualization

To qualitatively demonstrate the effectiveness of our proposed YOLO-SE-CBAM-HRNet model in real garment production environments, Figure 8 presents the pose estimation results on workers during the three primary processes.

Figure 8. Pose estimation results in garment manufacturing processes.

The output images confirm that our method reliably identifies joint locations and reconstructs body positions. It accurately estimates the right-side postures frequently adopted by operators during production. This performance remains consistent despite typical factory obstructions like fabric piles in sewing or machinery blocking the view during pressing. These findings verify the approach’s practical effectiveness in real workshop environments, establishing a trustworthy basis for follow-up ergonomic evaluation.

Beyond qualitative visualization, the end-to-end runtime of the complete YOLO–SE-CBAM–HRNet–RULA pipeline was also evaluated under continuous video input. On the hardware platform described in Section 4.2 (Intel Xeon E5-2680 v3 CPU and RTX 4060 Ti GPU), the system processed 640 × 640 input frames at approximately 35.3 FPS, including person detection, region-of-interest cropping, multi-scale pose estimation, and RULA angle computation. This frame rate satisfies the real-time requirement for workshop-level monitoring (typically ≥ 30 FPS) and indicates that the full detection–pose–scoring workflow can operate stably during garment production tasks.

4.5. Establishment of Expert Standards and Comparison with RULA Algorithm Scoring

This study extracted 10 static single-frame images with non-repetitive postures from each of the cutting, sewing, and pressing processes. A fixed panel of three experts was invited to score these 30 samples. A multidimensional metric system was established to distinguish between inter-expert consistency (benchmark validation) and algorithm-expert consistency (model performance):

Intraclass correlation coefficients (ICC) measure numerical consistency. For inter-expert reliability, ICC(3,1) bidirectional mixed-effects models (suitable for fixed expert groups and random samples) were employed; ICC(A,1) assessed consistency between algorithm scores and expert average scores. ICC emphasizes numerical consistency, with values >0.85 indicating high agreement.

Cohen’s Kappa focuses on categorical agreement (risk grades 1–4), independent of numerical bias and sensitive to category mismatches (e.g., misclassifying grade 2 as grade 3). A Kappa value above 0.8 shows excellent agreement in the ratings.

We use Mean Absolute Error (MAE) and 95% Limits of Agreement (95% LoA) to measure the scoring differences. MAE tells us the average size of the errors. The 95% LoA shows the range where most errors fall. A smaller range means the agreement is more consistent.

Table 7 shows the ICC for ScoreA by three experts is 0.951, which denotes the experts have nearly perfect consistency in scoring. While the ICC for ScoreB is 0.89, and Cronbach’s Alpha is 0.959. The ICC result of ScoreB is lower than that of ScoreA. It means the consistency of numerical judgments by experts for ScoreB is slightly worse than that of ScoreA. Experts had different evaluations on the trunk angle. Some experts scored samples more than 180° as 4, while the rest scored samples rotated slightly less than 180° as 2. However, Cronbach’s Alpha shows the same scoring criteria (no bias, only a slight difference in numerical value) for all experts. Reliability coefficients based on overall scoring were above 0.85, which also suggested that three experts had high consistency in their ratings and thus the experts’ mean score could be used as the standard for comparing with our system’s mean score.

Table 7. Results of Expert and System Scoring Reliability Analysis.

Table 7 presents the scores by three experts and our method. Cohen’s kappa co-efficient was used to further investigate the consistency of classification. All three scoring groups had high numerical consistency, with only a slight difference between Score A, Score B, and Score C. However, in terms of classification consistency and bias stability, there were some differences among them. The ICC between the system and three experts for Score A was 0.917, which means the system and experts had high numerical agreement. However, the Cohen’s kappa coefficient of Score A was 0.808, while those of Score B and Score C were higher than that of Score A. It means that the classification accuracy of Score A was slightly lower than that of Score B and Score C. Score A achieved a Cohen’s kappa coefficient of 0.808, while Score B reached 0.849, which was higher than 0.808. This advantage stems not from smaller fluctuations in the B score (2 or 4 points), but from fewer classification mismatches between the system and expert assessments. For example, in trunk angle evaluations, although system scores may differ from expert scores by 2 or 4 points, such mismatches occur less frequently. In contrast, Score A exhibited more frequent classification mismatches. A typical case occurs when a worker’s hand is facing the camera with the back of the hand visible: the system’s classification of “wrist” more frequently deviates from the category assigned by experts, as shown in Figure 9. Additionally, score fluctuations across grades 1, 2, and 3 for Score A involved more instances where the system and experts assigned different grades, ultimately leading to a lower kappa coefficient.

Figure 9. Average of Expert and System Scores. (A: SCORE A, B: SCORE B, C: SCORE C).

Score C achieved an ICC of 0.93 and a Cohen’s kappa coefficient of 0.831, indicating high grades of both numerical and categorical consistency between the system and experts. Regarding mean absolute error (MAE) and 95% limits of agreement (95% LoA), MAE followed the gradient: Group C (0.1556) < Group A (0.2222) < Group B (0.3111). The 95% LoA results revealed that Group C had the most concentrated deviation distribution, while Group B had the most dispersed. This confirms that for Group C scores, the algorithm exhibits the smallest average deviation from the expert average and more stable consistency, while Group B shows the highest variability in the evaluation results. In summary, the methodology proposed in this study demonstrates high reliability and stability, specifically for Score C. It accurately reflects the ergonomic risk grades of different processes for garment workers, thereby providing an effective tool for WMSDs risk assessment in sewing workshops.

Although the numerical and categorical agreements were generally high, the misclassifications exhibited a characteristic distribution across the three production processes. In cutting, most discrepancies arose from borderline trunk flexion angles, where rapid forward bending during material handling led to occasional score shifts between adjacent RULA categories. In sewing, misclassifications were primarily associated with fine wrist posture variations, which are more difficult for the algorithm to detect due to frequent occlusions caused by fabric, sewing tools, and hand–material interactions. In pressing, errors were mostly related to rapid upper-arm elevation during repetitive lifting motions, causing transient deviations in upper-limb angle estimation. Importantly, these misclassifications were generally small in magnitude—typically within one RULA level—and did not alter the overall risk classification for most samples. This distribution shows that the model performs robustly across all processes while being more sensitive to subtle or transient postural changes.

To further verify that the RULA angle computation is supported by sufficiently accurate pose information, we additionally evaluated the HRNet-W48 keypoint detector on the COCO keypoint validation split. The assessment followed the official OKS-based metric, where the Object Keypoint Similarity (OKS) measures keypoint accuracy by comparing predicted locations with ground truth while accounting for person scale and keypoint visibility. Under this protocol, HRNet-W48 achieved an OKS-AP of 74.9%, which is aligned with the commonly reported performance of this architecture. This level of keypoint localization accuracy indicates that major joints—including shoulders, elbows, hips, and wrists—are detected with sufficient precision to support stable downstream angle estimation. The inclusion of this supplementary analysis enhances the credibility of the ergonomic scoring outcomes presented in Table 7.

5. Comprehensive Analysis of Garment Manufacturing Processes

We extracted representative process videos (30 FPS) from recorded cutting, sewing, and pressing operations for ergonomic analysis:

As shown in Figure 10a, during the cutting process, when the workers used the electric cutter to trim the fabric, the comparison of the variation in risk grades based on the distance between the fabric and the body as follows: When the fabric is close to the body, the fluctuation in ScoreA value is caused by the small change in arm postures when moving the machine, but the whole risk grade is not changed, still in Grade 3. When the fabric is farther from the worker, the angle between the worker’s torso and the vertical direction is always less than 160° (minimum 121.8°), and the arm is always straight and raised up, so the range of motion is greatly increased. During this process, when the angle of the upper arm increases, the score changes from Grade 3 to Grade 4. At the same time, when the score of the torso angle reaches Grade 4, the whole risk grade is Grade 4.

Figure 10. (a) The Cutting Phase; (b) The Pressing Phase.

As shown in Figure 10b, during the pressing process, we captured the worker’s (how to fold and flatten the collar); when the worker lifted the pressing material, the angle of her upper arm was 63.1°. From 120 to 360 frames, the worker laid the pressing material on the collar; from 360 to 600 frames, the worker did the picking up the iron and repeatedly pressing the collar operation–at this time, the angle of her upper arm is sometimes 15.4°, sometimes 51.7°, fluctuating due to the back-and-forth movement of the arm. At this time, scoreA was 2–3 points, and the risk grade was evaluated as 3. After the ironed collars are produced, when workers place the collars, the angle of their upper arms becomes larger and their torso bends forward, so the risk grade becomes Grade 4.

As shown in Figure 11, during the sewing stage, a female worker picked fabric, sewed seams, and finished garments. Score A fluctuated between 2 and 4 points with high frequency (range 2–4, grade 3), because the worker had to make small adjustments with the arms and wrists (grade 2–3). In the range of video frames 0–750, shown in Figure 11. ScoreB showed large amplitude and high frequency fluctuation between 3 and 6 points. This was because the worker had to pick and place materials and prepare fabric before sewing. When worker picked and placed materials, their torso would tilt forward and backward to make their lumbar comfortable, causing a large fluctuation in risk grade. Later, in video frames 1530–1620, the worker placed the sewn garment beside her body, resulting in large fluctuations in arm abduction. The angle between the upper and fore arm increased, while the grade of the risk of torso and neck did not change, which added up to a risk grade of 5.

Figure 11. The Sewing Phase.

After scoring selected video samples, the risk grade distribution across different processes and body parts was analyzed, as shown in Figure 12. Risk grades were coded as 1–4 based on the assessment scores, with key findings detailed below:

Figure 12. Duration of Risk Grades for Each Body Part.

In the cutting process, the proportion of Grade 3 and Grade 4 risk postures of the upper arm is 41.30% and 6.19% and the sum of the two grades is nearly 47.5%. This shows that the upper arm would deviate from the angle range of the natural relaxed angle of 15° while operating a cutting machine. The arm is in a state of least muscle and joint effort, and the deviation finally leads to a high proportion of medium-to-high risk postures. In the forearm, Grade 2 risk postures take over 76.24% and there are no Grade 3 and Grade 4 risks. The proportion of the above three grades appears because the forearm should coordinate with the upper arm to make wide-range dynamic angle adjustment to move the fabric closer to or further from the cutting machine or control the machine to make cutting. The adjustment of upper limb postures leads to a slight deviation of the forearm from the optimal posture, but the deviation is not so serious that it leads to Grade 3 or above risk. In the case of torso, Grade 3 and Grade 4 risk postures take 31.97% and 17.06% and the sum of the two grades is nearly 50%. This is because the workers should lean forward to assist force application of the upper limb during cutting. The leaning posture of the torso pulls it away from the natural posture. The spine is in a state of least stress when it is in the physiological curvature state that is vertical to the ground. The deviation leads to a high proportion of medium-to-high risk.

In the sewing operation, grade 1 risk postures of the upper arm take 82.35%. Because of the precision requirement of upper limb postures in sewing work, the upper arm angle is usually maintained in a low-risk angle range, and the proportion of Grade 1 risk is relatively high. The proportion of Grade 1 and Grade 2 risk postures of the forearm is 60% and 40.8%, respectively. The proportion of the two grades is relatively balanced. This is because the forearm should be swung repetitively to feed fabric, and high precision angular requirement is needed for the angular control of the forearm to ensure the smooth sewing. The repetitive action leads to the deviation of the forearm from the optimal posture, but the deviation is not serious enough to cause the consequence of Grade 3 risk or above. In the case of trunk, Grade 4 risk posture takes 32.99%. This is because the workers should lean backward more than 180° to alleviate the back pain during the break from sewing work. The excessive extension posture of the torso leads to tremendous stress on the torso and leads to a high proportion of high risk load. Meanwhile, the long duration of fabric feeding leads to the deviation of the torso from the optimal posture, and the posture makes the worker lean forward. The proportion of Grade 2 risk is 53.64%.

The total percentage of Grade 1 upper arm movements in the process of pressing was 44.35% and the total percentage of Grade 2 upper arm movements was 42.61%. The total percentage of both grades exceeded 87%, which means that the arm swing was small, and the posture was rather relaxed and stable during the work, so the risk was low. The proportion of Grade 3 wrist movements was 42.61%, which means that the angles of the twisting the wrist were relatively high, and the repetitive movement during pressing was also considerable, which often causes hand and wrist pain. In addition, the fact that the head was in a downward position for a long time to observe the pressing area also means that the vast majority of necks were grade 3. In the examined body region, the proportion of Grade 4 risk was 29.57% and the proportion of Grade 2 risk was 65.65%. The repetitive movement of leaning forward to press and then moving to an upright position was the cause of the torso region risks.

In all three processes, the risk grades of wrist injuries were mainly Grade 2 and Grade 3. This is because, during cutting, sewing, and ironing, the wrist is forced out of its natural state, relaxed position. While cutting, workers have to hold the cutting machine; while sewing, small angle adjustments are required during the sewing process; while ironing, the iron must be operated. The difficulty of the wrist to be in a neutral angular position in all three processes is the area where the risks are concentrated. The most common risk for the neck was Grade 3, and the proportion of Grade 1 and Grade 2 was less than 3%. This is because the neck must be bent downward continuously to observe the details of the process while working.

Garment manufacturing workers are required to perform repetitive work in the forearm, torso, and wrist, which makes them more susceptible to WMSDs, chronic muscle strain, tendonitis, and degenerative diseases of the cervical and lumbar spine. Further ergonomic interventions can be carried out to reduce the occurrence of high-risk postures. For example, reasonable workstation layout and standardized postural training can be used to solve problems such as excessive trunk forward bending and excessive upper arm flexion [65].

6. Discussion

6.1. Technical Advantages

The YOLO-SE-CBAM-HRNet method shows clear strengths in assessing worker movements during clothing assembly. Sewing tasks often include handling multiple fabric layers, blocked views from machinery, and small, detailed hand motions. Common detection and posture analysis tools struggle to identify important body points due to limited feature capture. The SE part adjusts channel weights to emphasize key posture features. CBAM then pinpoints critical zones such as hands and arms in both channel and spatial views. Together, they reduce noise from hidden sections and boost the precision of locating people. Tests indicate that adding SE and CBAM increases mAP by 4.43% and recall by 5.99% over a basic model without these attention parts. This enhancement significantly improves the representational quality of human-related features, leading to more precise and reliable keypoint localization. HRNet keeps high-definition feature maps during the entire process. It also combines different scale data to add fine details. This approach fixes the problem of fuzzy keypoint positions seen in older networks that use heavy downsampling. As a result, it supports the later steps of calculating joint angles for RULA ratings.

6.2. Research Limitations

This research uses 2D pictures to detect key body points. However, real body positions exist in three-dimensional space. Flat images often make mistakes in judging depth because of camera angles. For instance, during fabric cutting, an arm that moved 30 degrees sideways looks only a little bent when seen from the side. The system then gives a low score of 1–2 points, even though the actual angle is large. In pressing work, both 30° forward bending and 30° sideways leaning look similar in 2D images. This similarity can cause wrong-angle judgments by the system. The absence of depth cues introduces structural ambiguity and limits the accuracy of posture assessment. Future studies should include 3D posture analysis to make assessments more correct.

Adding a reference point at the middle finger’s root helps measure wrist angles better. But when workers face their palms toward the camera during sewing, key points like the wrist and finger base may be located inaccurately. The system could mistake this hand position for bent postures, giving 2–3 points. Human experts can tell from the work type and hand motions that the wrist is actually relaxed. They would assign 1–2 points. The computer calculates body angles very precisely using hip and shoulder points from HRNet. For example, it measures 185.2° exactly. Following the rules gives 4 points when the torso leans back over 180°. But humans rely on visual estimation. They cannot easily see differences between 180 and 185°. They often rate a slight 185.2° backward lean as normal posture, giving 1–2 points. This scoring difference comes from two approaches: exact computer measurement versus human visual judgment.

In addition to the factors discussed above, discrepancies between human assessments and algorithmic angle estimation are also influenced by differences in evaluation philosophy. Ergonomists tend to make holistic judgments that consider task type, body movement patterns, and work context, whereas the 2D estimation module evaluates posture strictly based on geometric relationships. This often leads to small but systematic deviations. To reduce such inconsistencies in real deployment, the system adopts a simple calibration procedure that standardizes camera placement, working distance, and viewing angle, helping to minimize bias introduced by perspective distortion. Moreover, ergonomic scoring methods such as RULA inherently incorporate tolerance margins for joint angles, meaning that slight numerical differences rarely change the final risk category. As a result, even when human and algorithmic measurements diverge by a few degrees, the practical impact on risk-level assignment remains limited. These considerations suggest that the system can operate reliably in real manufacturing environments as long as basic calibration is maintained and posture scoring is interpreted within the appropriate tolerance range.

Moreover, the present study did not include comparisons with recent pose-estimation models such as RTMPose, ViTPose, DWPose, or other MMPose pipelines. These models typically rely on transformer-based architectures or multi-stage tracking mechanisms that incur higher computational cost and are less suited for lightweight edge-device deployment, which is a primary consideration for workshop-level ergonomic monitoring. Benchmarking against these models will be performed in future work to further strengthen the generalizability of the proposed framework.

Although real-time inference was achieved on the tested hardware, the experiments were conducted on pre-recorded videos, meaning that camera vibration, worker turnover, rapid hand motions, dense object interactions, and strong illumination changes were not systematically stress-tested. In addition, power consumption, long-duration thermal stability, and frame-rate robustness—which are critical for edge-device deployment—were not examined. These aspects will be addressed in future work through extended real-time experiments in continuously operating workshop environments.

6.3. Outlook

The existing data only records basic routine operations across three processes: cutting, sewing, and pressing (e.g., straight cutting with a cutting machine, flat sewing with a sewing machine, and back-and-forth pressing with an iron). It does not include extreme movements within the workflow that are prone to high risks, such as bending down to the ground to pick up stacked fabrics (trunk forward tilt angle ≤ 110°), standing on tiptoes to reach high material shelves (upper arm elevation angle ≥ 120°), or kneeling on one knee to arrange fabrics on the floor (coordinated twisting of lower limbs and torso). Although these actions occur relatively infrequently, each instance imposes substantially higher biomechanical loads than routine tasks. Such peak-load movements are well-recognized contributors to work-related musculoskeletal disorders and require considerable physical effort, which accelerates fatigue and reduces operational efficiency [66]. Their omission creates blind spots in the algorithm’s assessment of peak risks. The full garment manufacturing process encompasses a complete chain, including cutting-sewing-pressing-packaging-handling. Current research only covers the first three core processes, excluding equally high-risk operations such as packaging (e.g., folding garments, boxing, and bagging) and handling (e.g., transferring fabric bundles, finished product boxes).

The current dataset was collected under semi-controlled workshop conditions that included moderate background clutter and partial occlusion but did not fully capture the most congested or heavily obstructed environments found in some garment-production settings. Future work will expand the dataset to include scenes from more challenging real-world conditions, such as densely packed workstations, severe occlusions caused by machinery and stacked textiles, diverse lighting environments, and variations in worker clothing and fabric types. In addition, multi-angle video capture from different production lines will be incorporated to ensure broader coverage of operational contexts. These expanded data sources will support further refinement of the detection and pose-estimation modules, enhancing their robustness and generalizability in highly cluttered factory environments.

Existing research has not conducted an in-depth analysis on the material heterogeneity and process variability in garment manufacturing: On one hand, it fails to consider the impact of fabric material (e.g., heavy denim/light silk/stretch knits) and size (e.g., large-scale fabric sheets/small-scale cut pieces) on work posture and load. For instance, cutting heavy denim requires greater upper arm force and causes more significant wrist angle deviation, yet existing algorithms rely solely on posture angle scores without adjusting risk grades based on material properties. On the other hand, distinctions within the same process for specialized tasks (e.g., flat seaming, overlocking, and bar tacking in sewing) were not made. Different specialized tasks involve distinct equipment operation methods and posture emphases (e.g., bar tacking requires frequent fabric rotation and increased trunk twisting). Future research should use data from these situations to better understand muscle and skeleton injuries in clothing production.

7. Conclusions

This study aims to solve the problems of low efficiency and strong subjectivity in manual ergonomic evaluation of garment manufacturing workers. This study uses the YOLO-SE-CBAM-HRNet algorithm to realize automatic ergonomic evaluation of three basic garment manufacturing processes: cutting, sewing, and pressing. Through computer vision technology, it automatically extracts the key postural features (such as upper arm angle, wrist angle, and torso angle) of workers in garment manufacturing processes from worker operation videos and evaluates the ergonomic level of garment manufacturing workers quantitatively based on these features. Furthermore, the reliability of the YOLO-SE-CBAM-HRNet–based evaluation was examined by benchmarking its RULA-derived scores against expert assessments. To validate the algorithm, RULA scores generated by the system were compared against expert assessments. The results demonstrate strong consistency and stability between the algorithm and human experts, confirming its ability to accurately track dynamic ergonomic risks across different processes. It should be noted, however, that the present validation was conducted as a pilot study with a limited number of workers and a restricted set of task scenarios. Consequently, the results should be interpreted as demonstrating initial feasibility rather than full-scale generalizability across the entire garment manufacturing industry. Expanding participant diversity and process variability will be necessary in future work to ensure broader applicability. This study also identifies major ergonomic risk factors in garment production through analysis of the three key processes. These findings support the use of computer vision as a reliable, intelligent tool for ergonomic monitoring in the garment industry.

This study identifies critical ergonomic risk factors inherent in garment production: tasks such as cutting and material handling lead to pronounced trunk forward bending and upper arm flexion, while sewing involves sustained non-neutral wrist postures and prolonged trunk flexion. Pressing tasks further impose repetitive upper-limb loading and sustained trunk flexion, which are known contributors to cumulative musculoskeletal strain. These findings highlight the upper arms, wrists, and torso as the most vulnerable body regions, underpinning the risk of work-related musculoskeletal disorders such as muscle strain, tendonitis, and spinal degeneration.

Overall, this study significantly advances the existing knowledge base in ergonomic risk management for the garment manufacturing sector. It not only deepens understanding of the specific ergonomic risk factors inherent in core garment production processes but also provides a feasible and reliable intelligent assessment tool for industrial applications [67]. The findings offer data-driven decision support for workplace redesign. This includes optimizing workbench height, rearranging equipment, and adding assistive tools to boost safety and efficiency. The YOLO-SE-CBAM-HRNet algorithm can also be used in other apparel production processes. The resulting data can provide broader support for the intelligent transformation of ergonomic management in the garment manufacturing industry.

Author Contributions

Data curation, formal analysis, and investigation, writing—original draft preparation, Y.T.; Conceptualization, methodology, review and editing, supervision, Z.Y.; Supervision, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

The source code of this study is available upon request. However, the garment manufacturing workshop videos used in data collection are not publicly accessible, as this is necessary to protect the privacy and personal information of the workers involved in this research.

Acknowledgments

We thank Jiachuan Ning for his contributions to this work.

Conflicts of Interest

There are no conflicts of interest or competing interests with regard to this study.

Abbreviations

The following abbreviations are used in this manuscript:

WMSDs	Work-related musculoskeletal disorders
SE	Squeeze and Excitation
CBAM	Convolutional Block Attention Module
HRNet	High-Resolution Network
RULA	Rapid Upper Limb Assessment
sEMG	surface Electromyography
NIRS	Near-Infrared Spectroscopy
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory
R-CNN	Region-CNN
SSD	Single Shot Multibox Detector
CSPNet	Cross Stage Partial Network
SPPF	Spatial Pyramid Pooling-Fast
mAP	mean Average Precision
AP	Average Precision
IoU	Intersection-over-Union
OWAS	Ovako Working Posture Analysis System
REBA	Rapid Entire Body Assessment
NIOSH	National Institute for Occupational Safety and Health
OCRA	Occupational Repetitive Actions
MSDs	Musculoskeletal Disorders
PAN	Path Aggregation Network
FPN	Feature Pyramid Network
ROIs	Regions of Interest
CAM	Channel Attention Module
SAM	Spatial Attention Module
TP	True Positive
FP	False Positive
FN	False Negative
NMS	Non-Maximum Suppression
SGD	Stochastic Gradient Descent
MAE	Mean Absolute Error
LoA	Limit of Agreement

References

Ahmad, A.; Ahmad, A.; Javed, I.; Jaffri, N.R.; Abrar, U.; Hussain, A. Investigation of ergonomic working conditions of sewing and cutting machine operators of clothing industry. Ind. Textila 2021, 72, 309–314. [Google Scholar] [CrossRef]
Javed, I.; Ahmad, A.; Nukman, Y.; Ghazilla, R.A.B.; Dawal, S.Z.M.; Rahman, N.I.A. Occupational health and Safety for Sewing Machine Operators in Pakistan: Revealing Hazards in Small and Medium Garment Enterprises. J. Sci. Ind. Res. 2025, 84, 901–913. [Google Scholar] [CrossRef]
Zhang, H.; Yan, X.Z.; Li, H. Ergonomic posture recognition using 3D view-invariant features from single ordinary camera. Autom. Constr. 2018, 94, 1–10. [Google Scholar] [CrossRef]
Yu, N.; Hong, L.; Guo, J. Analysis of upper-limb muscle fatigue in the process of rotary handling. Int. J. Ind. Ergon. 2021, 83, 103109. [Google Scholar] [CrossRef]
Weston, E.B.; Alizadeh, M.; Hani, H.; Knapik, G.G.; Souchereau, R.A.; Marras, W.S. A physiological and biomechanical investigation of three passive upper-extremity exoskeletons during simulated overhead work. Ergonomics 2022, 65, 105–117. [Google Scholar] [CrossRef]
Zhou, C.M.; Xu, B.H.; Xu, X.; Kaner, J. Exploring the creation of multi-modal soundscapes in the indoor environment: A study of stimulus modality and scene type affecting physiological recovery. J. Build. Eng. 2025, 111, 113327. [Google Scholar] [CrossRef]
Tian, Y.Y.; Chen, J.Y.; Kim, J.I.; Kim, J. Lightweight deep learning framework for recognizing construction workers’ activities based on simplified node combinations. Autom. Constr. 2024, 158, 105236. [Google Scholar] [CrossRef]
Babangida, A.A.; Caraballo-Arias, Y.; Decataldo, F.; Violante, F.S. Advancing Occupational Medicine through Wearable Technology: A Review of Sensor Systems for Biomechanical Risk Assessment and Work-Related Musculoskeletal Disorder Prevention. Acs Sens. 2025, 10, 5410–5432. [Google Scholar] [CrossRef]
Li, H.Y.; Liu, M.J.; Deng, Y.C.; Ou, Z.B.; Talebian, N.; Skitmore, M.; Ge, Y. Automated Kinect-based posture evaluation method for work-related musculoskeletal disorders of construction workers. J. Asian Archit. Build. Eng. 2025, 24, 1731–1743. [Google Scholar] [CrossRef]
Zhou, D.; Chen, C.Z.; Guo, Z.Y.; Zhou, Q.D.; Song, D.W.; Hao, A.M. A real-time posture assessment system based on motion capture data for manual maintenance and assembly processes. Int. J. Adv. Manuf. Technol. 2024, 131, 1397–1411. [Google Scholar] [CrossRef]
Yang, J.; Shi, Z.K.; Wu, Z.Y. Vision-based action recognition of construction workers using dense trajectories. Adv. Eng. Inform. 2016, 30, 327–336. [Google Scholar] [CrossRef]
Ding, L.Y.; Fang, W.L.; Luo, H.B.; Love, P.E.D.; Zhong, B.T.; Ouyang, X. A deep hybrid learning model to detect unsafe behavior: Integrating convolution neural networks and long short-term memory. Autom. Constr. 2018, 86, 118–124. [Google Scholar] [CrossRef]
Wang, Z.; Men, S.; Bai, Y.; Yuan, Y.; Wang, J.; Wang, K.; Zhang, L. Improved Small Object Detection Algorithm CRL-YOLOv5. Sensors 2024, 24, 6437. [Google Scholar] [CrossRef]
Agostinelli, T.; Generosi, A.; Ceccacci, S.; Bevilacqua, A.; Mengarelli, F.; Fioretti, S.; Verdini, M.; Cacciagrano, D.; Moccia, S.; Cenci, A. Validation of Computer Vision-Based Ergonomic Risk Assessment Tools for Real Manufacturing Environments. Sci. Rep. 2024, 14, 27785. [Google Scholar] [CrossRef]
Deshpande, U.U.; Araujo, S.D.C.S.; Deshpande, S.; Mohan, R.E. A Review of Machine Learning Techniques for Ergonomic Risk Assessment Based on Human Pose Estimation. Discov. Artif. Intell. 2025, 5, 287. [Google Scholar] [CrossRef]
Zou, Z.X.; Chen, K.Y.; Shi, Z.W.; Guo, Y.H.; Ye, J.P. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Mark Liao, H.-Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G. Ultralytics YOLOv5; Ultralytics: Frederick, MD, USA, 2020. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8; Ultralytics: Frederick, MD, USA, 2023. [Google Scholar]
Wang, C.Y.; Yeh, J.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Jocher, G. Ultralytics YOLO11; Ultralytics: Frederick, MD, USA, 2024. [Google Scholar]
Giri, K.J. SO-YOLOv8: A novel deep learning-based approach for small object detection with YOLO beyond COCO. Expert Syst. Appl. 2025, 280, 127447. [Google Scholar]
Jing, X.H.; Liu, X.S.; Liu, B.L. Composite Backbone Small Object Detection Based on Context and Multi-Scale Information with Attention Mechanism. Mathematics 2024, 12, 622. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Woo, S.H.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zheng, C.; Wu, W.H.; Chen, C.; Yang, T.J.N.; Zhu, S.J.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep Learning-based Human Pose Estimation: A Survey. Acm Comput. Surv. 2024, 56, 11. [Google Scholar] [CrossRef]
Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.P.; Wei, Y.C. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, J.D.; Sun, K.; Cheng, T.H.; Jiang, B.R.; Deng, C.R.; Zhao, Y.; Liu, D.; Mu, Y.D.; Tan, M.K.; Wang, X.G.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Chen, W.; Gu, D.L.; Ke, J.T. Real-time ergonomic risk assessment in construction using a co-learning-powered 3D human pose estimation model. Comput. Aided Civ. Infrastruct. Eng. 2024, 39, 1337–1353. [Google Scholar] [CrossRef]
Jeong, S.O.; Kook, J. CREBAS: Computer-Based REBA Evaluation System for Wood Manufacturers Using MediaPipe. Appl. Sci. 2023, 13, 938. [Google Scholar] [CrossRef]
Fang, H.S.; Li, J.F.; Tang, H.Y.; Xu, C.; Zhu, H.Y.; Xiu, Y.L.; Li, Y.L.; Lu, C.W. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7157–7173. [Google Scholar] [CrossRef] [PubMed]
Cheng, B.W.; Xiao, B.; Wang, J.D.; Shi, H.H.; Huang, T.S.; Zhang, L. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020. [Google Scholar]
Dong, C.G.; Tang, Y.H.; Zhang, L.Y. MDA-YOLO Person: A 2D human pose estimation model based on YOLO detection framework. Clust. Comput. J. Netw. Softw. Tools Appl. 2024, 27, 12323–12340. [Google Scholar] [CrossRef]
Liu, L.; Sun, Y.G.; Li, Y.Y.; Liu, Y.H. A hybrid human fall detection method based on modified YOLOv8s and AlphaPose. Sci. Rep. 2025, 15, 2636. [Google Scholar] [CrossRef]
Mahendran, S.; Tiwari, R.R. Prevalence of work-related musculoskeletal disorders and quality of life assessment among garment workers in Tiruppur district, Tamil Nadu. Int. J. Occup. Saf. Ergon. 2024, 30, 146–152. [Google Scholar] [CrossRef]
Su, J.M.; Chang, J.H.; Indrayani, N.L.D.; Wang, C.J. Machine learning approach to determine the decision rules in ergonomic assessment of working posture in sewing machine operators. J. Saf. Res. 2023, 87, 15–26. [Google Scholar] [CrossRef]
Tahir, N.U.; Long, Z.; Zhang, Z.P.; Asim, M.; Elaffendi, M. PVswin-YOLOv8s: UAV-Based Pedestrian and Vehicle Detection for Traffic Management in Smart Cities Using Improved YOLOv8. Drones 2024, 8, 84. [Google Scholar] [CrossRef]
Li, R.; Yan, A.; Yang, S.Q.; He, D.; Zeng, X.; Liu, H.Y. Human Pose Estimation Based on Efficient and Lightweight High-Resolution Network (EL-HRNet). Sensors 2024, 24, 396. [Google Scholar] [CrossRef]
Newell, A.; Huang, Z.; Deng, J. Associative Embedding: End-to-End Learning for Joint Detection and Grouping. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Papandreou, G.; Zhu, T.; Chen, L.C.; Gidaris, S.; Tompson, J.; Murphy, K. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Kocabas, M.; Karagoz, S.; Akbas, E. MultiPoseNet: Fast Multi-Person Pose Estimation Using Pose Residual Network. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards Accurate Multi-person Pose Estimation in the Wild. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Sun, X.; Xiao, B.; Wei, F.Y.; Liang, S.; Wei, Y.C. Integral Human Pose Regression. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Chen, Y.L.; Wang, Z.C.; Peng, Y.X.; Zhang, Z.Q.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-Person Pose Estimation. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J.D.; Soc, I.C. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Li, J.F.; Bian, S.Y.; Zeng, A.L.; Wang, C.; Pang, B.; Liu, W.T.; Lu, C.W. Human Pose Regression with Residual Log-likelihood Estimation. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021. [Google Scholar]
Li, Y.J.; Zhang, S.K.; Wang, Z.C.; Yang, S.; Yang, W.K.; Xia, S.T.; Zhou, E.J. TokenPose: Learning Keypoint Tokens for Human Pose Estimation. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021. [Google Scholar]
McAtamney, L.; Nigel Corlett, E. RULA: A survey method for the investigation of work-related upper limb disorders. Appl. Ergon. 1993, 24, 91–99. [Google Scholar] [CrossRef]
Yuan, Z.; Fang, W.; Zhao, Y.M.; Sheng, V.S. Research of Insect Recognition Based on Improved YOLOv5. J. Artif. Intell. 2021, 3, 145–152. [Google Scholar] [CrossRef]
Hu, M.; Zhang, Y.R.; Jiao, T.; Xue, H.J.; Wu, X.; Luo, J.G.; Han, S.P.; Lv, H. An Enhanced Feature-Fusion Network for Small-Scale Pedestrian Detection on Edge Devices. Sensors 2024, 24, 7308. [Google Scholar] [CrossRef]
Liu, C.; Li, Y.Y.; Gu, J.J.; Lou, Y.Q.; Shen, T. Enhancing Pedestrian Target Recognition in Open Community Multi-Scene Spaces Using the Yolo-Stp Network. In Proceedings of the 5th International-Society-for-Photogrammetry-and-Remote-Sensing (ISPRS) Geospatial Week (GSW), Cairo, Egypt, 2–7 September 2023. [Google Scholar]
Gawande, U.; Hajari, K.; Golhar, Y. Novel person detection and suspicious activity recognition using enhanced YOLOv5 and motion feature map. Artif. Intell. Rev. 2024, 57, 16. [Google Scholar] [CrossRef]
Tondre, S.; Deshmukh, T. Guidelines to sewing machine workstation design for improving working posture of sewing operator. Int. J. Ind. Ergon. 2019, 71, 37–46. [Google Scholar] [CrossRef]
Yu, N.; Guo, J.; Hong, L.; Wu, P.W.; Li, J. Study on Fatigue of Workers in the Row Drilling Operation of Furniture Manufacturing Based on Operational Energy Efficiency Analysis. In Proceedings of the 18th International Conference on Man-Machine-Environment System Engineering (MMESE), MMESE Comm China, Nanjing, China, 20–22 October 2018. [Google Scholar]
Yue, X.Y.; Xiong, X.Q.; Xu, X.T.; Zhang, M. Big data for furniture intelligent manufacturing: Conceptual framework, technologies, applications, and challenges. Int. J. Adv. Manuf. Technol. 2024, 132, 5231–5247. [Google Scholar] [CrossRef]

Figure 1. YOLOv8 Architecture.

Figure 2. The structure of the SE module [34].

Figure 3. This is a detailed depiction of the attention module used in this study [35]. Specifically, (a) represents the Convolutional Block Attention Module (CBAM), which consists of two sub-modules: (b) Channel Attention Module (CAM) and (c) Spatial Attention Module (SAM).

Figure 4. SE-CBAM Embedded in YOLOv8 Position.

Figure 5. (a) COCO 17 Human Key Points; (b) COCO Hand Key Points.

Figure 6. Precision-Recall (PR) Curves of Four Different Object Detection Models.

Figure 7. F1 Curves of Four Different Object Detection Models.

Figure 8. Pose estimation results in garment manufacturing processes.

Figure 9. Average of Expert and System Scores. (A: SCORE A, B: SCORE B, C: SCORE C).

Figure 10. (a) The Cutting Phase; (b) The Pressing Phase.

Figure 11. The Sewing Phase.

Figure 12. Duration of Risk Grades for Each Body Part.

Table 1. Traditional Ergonomics Evaluation Methods.

Evaluation Method	Characteristics	Main Evaluation Parts/Indicators
Evaluation Method	Characteristics	Arm	Wrist	Neck	Trunk	Waist	Leg	Other Evaluation Factors
RULA	Evaluates upper-body postures, scores based on posture angles, and gives risk grades.	√	√	√	√			Frequency, grasp, load
REBA	Classifies and scores the posture angles of various parts of the body to obtain the risk grade.	√	√	√	√		√	Frequency, grasp, load
OWAS	Codes the overall postures, such as sitting and standing postures of the human body at work, and assigns risk grades by the coding sequence.			√	√	√	√	Load
NIOSH	Measures and collects multiple factors of manual lifting operations to calculate the risk of lifting and carrying.				√			Carrying distance, height, speed, etc.
OCRA	A method mainly used to evaluate repetitive hand and upper-limb operations.	√	√	√				Recovery time, frequency, repetitiveness, poor posture, etc.

Table 2. Body Part Angle Scoring in RULA.

Angle
Score	Score A			Score B
bodypart	Upperarmθ₁	Lowerarmθ₂	Wristθ₃	Neckθ₄	Trunkθ₅
1	0 ≤ θ₁ < 20	240 ≤ θ₂ ≤ 280	θ₃ = 180	180 ≤ θ₄ ≤ 190	θ₅ = 180
	340 ≤ θ₁ < 360
2	20 ≤ θ₁ < 45	180 ≤ θ₂ < 240	165 ≤ θ₃ ≤ 195	190 < θ₄ ≤ 200	160 ≤ θ₅ < 180
	315 ≤ θ₁ < 340	280 ≤ θ₂ ≤ 360
3	45 ≤ θ₁ < 90		90 ≤ θ₃ < 165	θ₄ >200	120 ≤ θ₅ < 160
	270 ≤ θ₁ < 315		195 ≤ θ₃ < 270
4	90 ≤ θ₁ ≤ 180			θ₄ < 180	θ₅ < 120 or θ₅ > 180

Table 3. WMSDs Risk Assessment Score Sheet.

Posture Score C		Neck, Trunk, Leg Score B
Posture Score C		1	2	3	4	5	6	7
Wrist/Arm ScoreA	1	1	2	3	3	4	5	5
	2	2	2	3	4	4	5	5
	3	3	3	3	4	4	5	6
	4	3	3	3	4	5	6	6
	5	4	4	4	5	6	7	7
	6	4	4	5	6	6	7	7
	7	5	5	6	6	7	7	7
	8	5	5	6	7	7	7	7

Table 4. Comparisons on The COCO Dataset.

Method	Backbone	Input Size	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L
Bottom-up: keypoint detection and grouping
Openpose [38]	-	-	61.8	84.9	67.5	57.1	68.2
Associative Embedding [51]	-	-	65.5	86.8	72.3	60.6	72.6
PersonLab [52]	-	-	68.7	89.0	75.4	64.1	75.5
MultiPoseNet [53]	-	-	69.6	86.3	76.6	65.0	76.3
Top-down: human detection and single-person keypoint detection
Mask R-CNN [19]	ResNet-50-FPN	-	63.1	87.3	68.7	57.8	71.4
G-RMI [54]	ResNet-101	353 × 257	64.9	85.5	71.3	62.3	70.0
Integral Pose Regression [55]	ResNet-101	256 × 256	67.8	88.2	74.8	63.9	74.0
CPN [56]	ResNet-Inception	384 × 288	72.1	91.4	80.0	68.7	77.2
SimpleBaseline [39]	ResNet-152	384 × 288	73.7	91.9	81.1	70.3	80.0
HRNet-W32 [57]	HRNet-W32	384 × 288	74.9	92.5	82.8	71.3	80.9
HRNet-W48 [57]	HRNet-W48	384 × 288	75.5	92.5	83.3	71.9	81.5
RLE [58]	HRNet-W48	256 × 256	75.7	92.3	82.9	72.3	81.3
TokenPose [59]	HRNet-W48	256 × 192	75.8	90.3	82.5	72.3	82.7

Table 5. Ablation Study Results of Four Different Object Detection Models on the COCO Dataset.

Model	Precision /%	Recall /%	F1 /%	AP/mAP /%	Params /M	FLOPs /G	Inference Speed/FPS
YOLOv8	90.94	60.09	72	78.66	9.831	23.4	74.89
YOLOv8_SE	92.95	64.3	76.0	81.17	9.831	23.4	87.36
YOLOv8_CBAM	90.94	64.52	75.0	81.72	9.832	23.4	82.53
YOLOv8_SE_CBAM	91.13	66.08	77.0	83.09	9.832	23.4	82.84

Table 6. Performance Comparison of Different Models for Human Detection on the COCO Dataset.

Model	Precision (%)	Recall (%)	F1 (%)	AP (%)
YOLOv8_SE_CBAM	91.13	66.08	77.0	83.09
EFF-YOLO [62]	77.8	67.2	-	-
YOLOv5 [63]	73.82	-	67.05	-
YOLOv5-SE [63]	72.64	-	64.99	-
YOLOv5-CBAM [63]	70.66	-	65.04	-
YOLOv5-STP [63]	74.92	-	67.9	-
Enhanced YOLOv5 [64]	-	-	-	79

Table 7. Results of Expert and System Scoring Reliability Analysis.

	Expert		Ours-Expert
Evaluation Indicators	ICC	Cronbach’s Alpha	ICC	Cronbach’s Alpha	Cohen’s Kappa	MAE	LoA 95%
ScoreA	0.951	0.983	0.914	0.958	0.808	0.2222	[−0.58, 1.03]
ScoreB	0.89	0.959	0.896	0.945	0.849	0.3111	[−0.94, 1.56]
ScoreC	0.863	0.948	0.93	0.963	0.831	0.1556	[−0.51, 0.82]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

YOLO-HRNet with Attention Mechanism: For Automated Ergonomic Risk Assessment in Garment Manufacturing

Abstract

1. Introduction

2. Related Work

2.1. Real-Time Object Detection

2.2. Human Pose Estimation

2.3. Ergonomics Assessment Methods

3. Methods

3.1. YOLO-SE-CBAM

3.1.1. YOLOv8 Architecture

3.1.2. Attention Mechanisms

3.1.3. The Overall Framework

3.2. HRNet with Keypoint Detection

3.3. Rapid Upper Limb Assessment

3.4. Performance Metrics

4. Experiments and Results

4.1. Video Capture and Ethics

4.2. Implementation Details

4.3. Results on the COCO Dataset

4.4. Algorithm Performance and Visualization

4.5. Establishment of Expert Standards and Comparison with RULA Algorithm Scoring

5. Comprehensive Analysis of Garment Manufacturing Processes

6. Discussion

6.1. Technical Advantages

6.2. Research Limitations

6.3. Outlook

7. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics