Improving Safety in High-Altitude Work: Semantic Segmentation of Safety Harnesses with CEMFormer

Zhou, Qirui; Liu, Dandan

doi:10.3390/sym16111449

Open AccessArticle

Improving Safety in High-Altitude Work: Semantic Segmentation of Safety Harnesses with CEMFormer

by

Qirui Zhou

and

Dandan Liu

^*

College of Electronics and Information Engineering, Shanghai University of Electric Power, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(11), 1449; https://doi.org/10.3390/sym16111449

Submission received: 7 October 2024 / Revised: 27 October 2024 / Accepted: 28 October 2024 / Published: 1 November 2024

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

The symmetry between production efficiency and safety is a crucial aspect of industrial operations. To enhance the identification of proper safety harness use by workers at height, this study introduces a machine vision approach as a substitute for manual supervision. By focusing on the safety rope that connects the worker to an anchor point, we propose a semantic segmentation mask annotation principle to evaluate proper harness use. We introduce CEMFormer, a novel semantic segmentation model utilizing ConvNeXt as the backbone, which surpasses the traditional ResNet in accuracy. Efficient Multi-Scale Attention (EMA) is incorporated to optimize channel weights and integrate spatial information. Mask2Former serves as the segmentation head, enhanced by Poly Loss for classification and Log-Cosh Dice Loss for mask loss, thereby improving training efficiency. Experimental results indicate that CEMFormer achieves a mean accuracy of 92.31%, surpassing the baseline and five state-of-the-art models. Ablation studies underscore the contribution of each component to the model’s accuracy, demonstrating the effectiveness of the proposed approach in ensuring worker safety.

Keywords:

safety harness detection; computer vision; semantic segmentation; attention mechanism

1. Introduction

In industries such as power and construction, workers frequently need to perform tasks at heights. Such work poses a risk of falls, which may result in considerable harm to workers’ health and safety. To mitigate these risks, regulations mandate the use of personal protective equipment, such as safety harnesses. However, workers may tend to improperly wear safety harnesses for various reasons. Among those injured due to falls, the proportion of workers not wearing safety harnesses correctly is notably high, reaching up to 95.9%, as reported by Anantharaman et al. [1]. To better safeguard workers’ health and safety, continuous monitoring is crucial to ensure proper use of safety harnesses during work at heights, and to promptly issue reminders or warnings upon detecting incorrect usage.

Currently, monitoring at engineering sites can be managed by installing surveillance cameras and aggregating the captured video data, enabling a small number of personnel to supervise a large number of workers. Devices such as loudspeakers or radios can be used to issue timely reminders or alerts to workers. However, continuously monitoring multiple screens and observing numerous workers’ conditions places a significant mental burden on supervisors, which may lead to a gradual decline in efficiency and an increase in erroneous judgments.

There exists a symmetry between production efficiency and production safety. While the application of artificial intelligence in production to enhance efficiency is significant and holds great potential, it is equally important to explore the use of AI for safety supervision. In this context, replacing or supplementing human supervision of workers’ safety conditions with machine vision offers a viable alternative. In recent years, advancements in deep learning have led to the gradual application of neural network models across various industries, resulting in continuous improvements in the accuracy of machine vision tasks. Semantic segmentation is a task within machine vision that aims to classify each pixel in an image. It provides pixel-level delineation of object boundaries, precisely outlining their shapes. Given the complex backgrounds in high-altitude work environments, where workers and safety harnesses are relatively small and vary in shape, semantic segmentation is particularly suitable for this scenario.

Previously, we employed a deep learning-based object detection approach to individually detect ground workers, aerial workers, and safety harnesses [2]. Given that safety harnesses are relatively small objects for bounding-box-based object detection, which hinders model learning, and considering the potential to improve dataset annotations, we re-annotated the dataset using semantic segmentation masks. Additionally, we introduced a new class for safety ropes, representing the part of the harness that is secured between the body and the anchor point, to distinguish between correct usage and incorrect usage when the harness is worn but not secured. With the new dataset, we explored the potential of semantic segmentation for harness detection and proposed a novel model, CEMFormer. The primary contributions of this paper are as follows:

Given that safety ropes are often overlooked by workers and researchers due to their small scale, we propose a new annotation principle by adding safety ropes as a new category in the dataset and utilizing masks for precise annotation.
ConvNeXt is utilized as the backbone; as a next-generation convolutional neural network (CNN) with Transformer-like characteristics, it demonstrates improved accuracy over its predecessor, ResNet.
Efficient Multi-Scale Attention (EMA) is integrated at the end of the backbone, combining channel and spatial information in the feature maps to provide the segmentation head with enriched feature representations.
Mask2Former is utilized as the segmentation head; it is a versatile architecture designed for various image segmentation tasks, demonstrating excellent performance on our new dataset.
Poly Loss is employed as the classification loss, with parameters adjusted according to the dataset to improve the model’s accuracy during training.
Log-Cosh Dice Loss is utilized as part of the mask loss to address gradient issues associated with the original Dice Loss, leading to more effective training.

CEMFormer was compared to its baseline and five state-of-the-art (SOTA) models. Ablation studies were conducted to demonstrate the effectiveness of the proposed performance improvements.

In summary, our approach, which incorporates simplified annotations to enhance attention to safety ropes, establishes new detection criteria to protect workers from fall hazards. We have experimentally validated the effectiveness of our annotation method and demonstrated the efficacy of semantic segmentation in detecting safety ropes. This foundational work paves the way for future research aimed at ensuring worker safety.

2. Related Works

2.1. Vision-Based Safety Monitoring

For the critical issue of production safety, numerous researchers have applied the rapidly advancing machine vision technology to safety monitoring. Xiong and Tang [3] employed a lightweight OpenPose model for pose estimation to identify various parts of the human body. They treated the head and upper body as local attention regions for detecting safety helmets and vests and utilized a shallow CNN-based classifier to determine whether the corresponding safety gear was being worn. Khan et al. [4] utilized Mask R-CNN for instance segmentation of workers and scaffolding, subsequently analyzing the presence and use of safety outriggers on scaffolds through an object association detection module to assess worker safety. Gong’s team [5] enhanced YOLOv3 with random erasing data augmentation to detect workers on offshore platforms, applied a regional multi-person pose estimation algorithm to locate keypoints, and then utilized an improved ResNet50 model for deep transfer learning to determine whether workers were wearing helmets and work clothes.

Riaz et al. [6] combined Unet and Swin Transformers to detect helmets and vests, capturing high-quality feature representations. Subsequently, a segmentation self-attention mechanism was employed for the precise localization of personal protective equipment (PPE). Shi and colleagues [7] integrated global attention, a bidirectional feature pyramid, SimC2f, and GhostConv into YOLOv8n for detecting helmets, goggles, vests, and masks, achieving an accuracy improvement with a 13.6% reduction in computational cost. Ludwika and Rifai [8] developed a dataset containing seven types of PPE—helmets, lab coats, safety shoes, masks, protective goggles, earmuffs, and gloves—and two states of proper or improper use for each. They evaluated YOLOv4, YOLOv5, and YOLOv6 on this dataset, with YOLOv5 (trained for 100 epochs) demonstrating the best performance among the three models.

Zhang et al. [9] developed a dataset for six types of PPE used in substations and utilized an improved YOLOv8n, incorporating multi-scale channel attention, EC2f, GhostConv, and adaptive spatial feature fusion modules for PPE detection. The enhanced model achieved a 2.4% increase in accuracy while reducing computational cost by 7.3%. Zaidi’s team [10] utilized YOLOv8 to detect helmets, boots, and vests worn by workers, then employed a rule-based module to verify compliance based on detection results and coordinates, followed by a temporal analysis module to reduce false positives in dynamic environments. Sanjeewani et al. [11] extracted features using MobileNet V2 and conducted detection using SSD, achieving real-time PPE detection on edge devices. Although the proposed model was less accurate than six contemporary models, it demonstrated the highest recall and speed. Chen and Demachi [12] utilized OpenPose for human pose estimation, YOLOv3 for helmet and mask detection, and subsequently compared the PPE and human keypoint positions to determine whether nuclear plant workers were properly wearing PPE. Tang et al. [13] utilized Faster R-CNN to detect workers and PPE, such as helmets and safety harnesses, followed by a human–object interaction model to determine PPE compliance. Although safety harnesses, a key component of fall protection systems, were mentioned, detection results for them were not provided.

The aforementioned studies evaluate worker safety using machine vision methods. Among these, Khan et al. [4] specifically focused on scaffolding used by workers, whereas the other studies concentrated primarily on common PPE. PPE, such as safety helmets, is suitable for various engineering sites, providing protection to workers by preventing head injuries to some extent, thereby contributing to their safety.

2.2. Safety Harness Monitoring Based on Deep Learning

In the literature [1], it is stated that falls are the most common cause of severe injuries in the construction industry. Furthermore, in the power industry, which is of particular interest, falls are also the most frequent cause of fatalities [14]. Safety harnesses are effective in reducing casualties resulting from falls; however, research on safety harnesses remains relatively limited compared to other PPE, such as helmets. Ma et al. [15] applied channel pruning and layer pruning to YOLOv4, enabling the model to run on mobile devices for real-time detection of helmets, reflective vests, and safety harnesses. Additionally, fine-tuning was performed to minimize the impact of pruning on model accuracy. Chen and Demachi [16] adopted a similar approach to that described in the literature [12] to determine whether workers were wearing PPE correctly. Unlike the literature [12], they included goggles and safety harnesses as detection targets in the construction site context, addressing different potential hazards compared to those in nuclear power plants.

Fang’s team [17] utilized Faster R-CNN to detect worker positions, subsequently inputting the detected content into a deep CNN to identify whether workers were wearing safety harnesses. Although the literature [15,16,17] detected safety harnesses, they focused on the harness as a whole, merely determining its presence without considering whether it was being used correctly. Specifically, they did not consider whether the harness was properly secured—namely, whether the safety rope was connected to a fixed anchor point, which is crucial for providing protection in the event of a fall.

Chern et al. [18] initially utilized YOLOv5 to detect workers and PPE at construction sites, followed by recognizing the working context of the workers and analyzing whether they were using the appropriate PPE correctly. They proposed two methods for working context recognition: a depth estimation model and a semantic segmentation model for work scenes. Although they considered the detailed components of the safety harness, their segmentation approach was excessively fine-grained, potentially leading to recognition errors when parts of the harness were occluded and not fully detectable. Fang’s team [19] utilized object detection and tracking algorithms to identify workers and windows from video footage, then employed a spatial interaction-based classifier to determine whether a worker was transitioning from an indoor to an outdoor high-altitude scenario. Subsequently, they detected PPE for workers entering high-altitude environments and issued an alarm if PPE was not properly worn. The study noted that detecting the safety harness, safety rope, and anchor point as a whole resulted in excessively large bounding boxes, which included a significant amount of irrelevant background, introducing noise and interference when extracting features.

Li and colleagues [20] extracted images from videos, detected worker and PPE positions using YOLOv5, and identified keypoints of workers’ bodies using OpenPose. They then employed a 1D CNN to compare keypoints with PPE positions to determine whether workers were properly wearing helmets and attaching harness hooks to anchor points. However, they did not consider other parts of the safety harness or the possibility of hook occlusion, which could impact the accuracy of their method.

2.3. Deep Learning-Based Semantic Segmentation

In 2015, Long et al. [21] introduced Fully Convolutional Networks (FCNs) for semantic segmentation based on classification networks, marking the beginning of pixel-wise classification as the primary method for semantic segmentation and ushering in an era of deep learning-based approaches. In the same year, U-Net [22] effectively utilized contextual information with its distinctive U-shaped architecture, becoming an early application of FCNs that not only achieved high accuracy in medical image segmentation but also inspired many subsequent models.

Subsequently, as a representative of CNNs, DeepLab proposed by Chen et al. [23] employed atrous convolution to increase the receptive field and introduced atrous spatial pyramid pooling to capture objects and context at multiple scales, evolving into a series of models. Mask R-CNN [24] expanded Faster R-CNN into a multi-task model with a mask segmentation branch, achieving high precision in instance-level segmentation.

After Dosovitskiy’s team [25] integrated Transformers into the field of computer vision, numerous semantic segmentation algorithms based on Transformers emerged. Among them, Mask2Former [26] is an outstanding semantic segmentation algorithm that has been widely applied and improved upon in various domains since its inception.

For example, Zhang et al. [27] developed Mask2Former+ based on Mask2Former to identify rock movement cracks, enabling real-time acquisition of clear crack outlines, outperforming classical U-Net and DeepLabV3+. García and colleagues [28] fine-tuned Mask2Former initialized with a pre-trained backbone on the Cityscape dataset for segmentation of photovoltaic panels, achieving an accuracy of 98.38%, surpassing eight existing models. Li et al. [29] expanded the lychee branch dataset using denoising diffusion probabilistic models and trained Mask2Former on the augmented dataset for segmentation in outdoor backgrounds. They employed specified thresholds in the Hue, Saturation, and Value color space to prevent Mask2Former from extracting incorrect regions, ultimately generating multi-view images of lychees with Wonder3D, achieving an average pixel accuracy of 85.82%. Guo’s team [30] proposed IQ2Former, which integrated a query scenario module into Mask2Former to enhance scene adaptability, strengthening feature extraction through a query attention module, resulting in high-quality semantic segmentation of remote sensing images, outperforming the baseline and six other models.

The widespread use of Mask2Former demonstrates its effectiveness and generalizability; however, in the field of safety harness segmentation, the large-scale differences between masks necessitate further improvements.

3. Method

This section initially introduces the overall structure of CEMFormer. Subsequently, we provide a detailed explanation of the new criteria established for safety harness and safety rope detection, as well as modifications made to the original dataset based on these criteria. We also describe the model’s backbone, ConvNeXt; the integrated Efficient Multi-Scale Attention (EMA) mechanism; the detection head, Mask2Former; and the improvements made to the loss function within the detection head. The overall structure of the model is illustrated in Figure 1.

A three-channel image with a size of 1024 × 1024 pixels (3 × 1024 × 1024) is input into CEMFormer. After passing through Conv1 and Layer Norm1, a feature map of size 96 × 256 × 256 is obtained. The feature map is then processed by the ConvNeXt block, retaining its dimensions and serving as the backbone for outputting multi-scale feature maps. As the feature map passes through Downsample Modules 1, 2, and 3—each consisting of Layer Norm and a 2D convolution with a kernel size of 2 × 2—the feature map sizes change to 192 × 128 × 128, 384 × 64 × 64, and 768 × 32 × 32, respectively. The feature maps output from the backbone are fed into the EMA module for cross-dimensional interaction before being used as input for Mask2Former.

The multi-scale feature maps are fed into Mask2Former, both into the Pixel Decoder for multi-scale feature fusion and pixel-level prediction, as well as into the Transformer Decoder for single-scale predictions. In the Pixel Decoder, the multi-scale feature maps are unified to 256 channels via convolution and, after position encoding, are fed into the DETR Encoder to capture global context. The fused feature map is then obtained via the nearest neighbor interpolation, which is subsequently used for mask prediction. In the Transformer Decoder, each scale of the feature map is fed into different Decoder layers, where class and mask predictions are made alongside the input query features associated with position encoding.

In Figure 1, the three convolution modules utilize different kernel sizes: Conv1 utilizes a 4 × 4 kernel, Conv2 utilizes a 1 × 1 kernel, and Conv3 utilizes a 3 × 3 kernel. Both the Downsample Modules and Conv Modules employ 2D convolutions. The Positional Encoding module uses sinusoidal positional encoding. The modifications to the dataset, along with the details of the ConvNeXt Block, EMA, Decoder, and DETR Encoder in the model, as well as the changes to the loss function, will be introduced in the following subsections.

3.1. New Dataset

The original dataset was annotated using detection bounding boxes for ground-level workers (Ground), workers at height (Offground), and safety harnesses (Safebelt). An example of the annotations, along with an enlarged view of a local region, is presented in Figure 2.

In Figure 2a, the worker is wearing the safety harness without securing the safety rope, indicating improper use of the harness. The annotations in this dataset indicate only the presence of the safety harness and whether the worker is wearing it; however, they cannot determine whether the harness is being used correctly. Consequently, this limitation hinders the accurate assessment of the worker’s safety risk.

To address this issue, we propose a new annotation principle. For workers at height, the portion of the safety harness that extends beyond the worker’s outline and is secured to the anchor point is annotated as Saferope, while the remaining parts are still annotated as Safebelt. For workers on the ground, all portions of the safety harness are labeled as Safebelt. Considering the frequent occurrence of occlusion, we opted not to further subdivide the harness, allowing the model to learn to identify categories even when a significant portion is occluded, and to avoid cases where the finer structures are completely hidden. Annotations are created using polygonal masks to minimize noise and facilitate model learning. Any areas not manually annotated are automatically labeled as background. An example of the new dataset annotations, along with an enlarged view of a local region, is presented in Figure 3, where the background represents unannotated areas.

3.2. ConvNeXt

ConvNeXt was used as the backbone to extract features from the images. ConvNeXt, introduced by Liu et al. [31], is a modernized CNN based on ResNet [32]. As an outstanding backbone that has emerged in recent years, it has been widely utilized and validated in various fields, including skin disease classification [33], peptide-binding affinity prediction [34], identification of winter wheat seedlings [35], and rolling bearing defect diagnosis [36]. The structure of the ConvNeXt Block, the fundamental module of ConvNeXt depicted in Figure 1, is presented in Figure 4.

In the ConvNeXt Block, the depthwise convolution utilizes a 7 × 7 kernel without altering the dimensions of the feature map. The other two convolution layers utilize a 1 × 1 kernel: Conv1 expands the number of channels in the feature map by four times, while Conv2 restores the expanded channels to their original size.

Compared to ResNet, ConvNeXt integrates design elements inspired by Transformers, resulting in the following adjustments:

The number of blocks in each stage has been adjusted from (3, 4, 6, 3) to (3, 3, 9, 3), where each block represents the ConvNeXt Block depicted in Figure 4. This implies that the basic structure of the ConvNeXt Block is repeated three, three, nine, and three times in ConvNeXt Blocks 1, 2, 3, and 4, respectively.

The convolution layer that processes the input image, Conv1 in Figure 1, has been adjusted from a 7 × 7 kernel with a stride of 2 to a 4 × 4 kernel with a stride of 4.

Depthwise convolution is employed in the ConvNeXt Block in place of standard convolution.

An inverted bottleneck structure is employed, wherein the feature map in the ConvNeXt Block first undergoes depthwise convolution without altering the number of channels, followed by two regular convolutions that sequentially expand the channels by four times and restore them.

A large convolution kernel is employed in the ConvNeXt Block, specifically a 7 × 7 kernel for depthwise convolution.

GELU is used in place of ReLU as the activation function, and the number of activation functions within the model has been reduced.

Batch normalization is replaced with layer normalization, and the use of normalization layers is reduced.

Downsampling is introduced between stages, as depicted in Downsample in Figure 1.

With these modifications, ConvNeXt, as a CNN, integrates advanced design features from Transformers while simplifying its structure, thereby improving accuracy compared to the baseline ResNet.

3.3. Efficient Multi-Scale Attention (EMA)

In our previous research, we integrated the Efficient Multi-Scale Attention (EMA) proposed by Ouyang et al. [37] into an object detection network, which led to improved accuracy. At the conclusion of [37], the authors suggested that EMA could also be applied to tasks such as semantic segmentation. Inspired by this, we reintroduced EMA to improve feature extraction. Additionally, following the recommendation in [38], EMA was placed at the end of the backbone. Figure 5 presents the internal structure of EMA.

After the feature map is input into EMA, it is divided into n groups along the channel (C) dimension, and each group is subsequently split into two branches for computation. In this study, n is set to 8. Conv1 utilizes a 3 × 3 kernel, forming the “3 × 3 branch”, while Conv2 utilizes a 1 × 1 kernel, forming the “1 × 1 branch”. In the 1 × 1 branch, the feature map undergoes symmetric average pooling along both width (W) and height (H) directions, effectively capturing positional information along these vertical dimensions. The pooled results are concatenated along the W (H) dimension, resulting in a size of C//n × (W + H), and subsequently passed through a convolution. Conv2 does not alter the dimensions of the feature map, thereby ensuring no reduction in dimensionality and preventing information loss. The output of the convolution is split back into two vectors along the concatenated direction, each passing through a sigmoid function to approximate a two-dimensional binomial distribution. These vectors are subsequently multiplied to facilitate inter-channel interaction.

The 3 × 3 branch utilizes Conv1 to capture multi-scale features, after which it enters the cross-spatial learning module. In cross-spatial learning, the outputs of the two branches pass through a symmetric structure consisting of 2D global average pooling for spatial information encoding, a softmax function, and multiplication of the resulting vectors to generate two spatial attention maps. At the end of EMA, the two spatial attention maps are summed, and a sigmoid function is applied to map the output features according to the attention weights.

Throughout the entire EMA, these symmetric structures aid in capturing spatial semantic information and facilitating inter-channel interaction, thereby providing pixel-level global context for the feature map. Furthermore, the number of channels remains unchanged, preventing information loss due to channel reduction.

3.4. Mask2Former

Mask2Former, introduced by Cheng et al. [26], aims to provide a flexible, general-purpose, and high-performance architecture for various image segmentation tasks, including semantic segmentation and instance segmentation. The overall structure of Mask2Former is presented in Figure 1, where the DETR Encoder in the Pixel Decoder and the Decoder in the Transformer Decoder share a similar structure, imparting a sense of symmetry to Mask2Former’s design. DETR was introduced by Carion’s team [39], and the structure of the DETR Encoder is depicted in Figure 6.

After the feature map enters the DETR Encoder, it is divided into three components: Key (K), Query (Q), and Value (V). K and V are associated with positional encoding, and these three components are then used as inputs to the multi-head self-attention (MHSA) module. The output of the MHSA is added to the feature map and undergoes layer normalization, after which the resulting feature map is split into two branches. One branch passes through a feed-forward network (FFN) and is then added to the other branch, followed by another layer normalization before output. The Pixel Decoder captures global information via the DETR Encoder, which is subsequently used for the final pixel-level mask prediction.

Transformer Decoder of Mask2Former primarily consists of three Decoder blocks, and the structure of each Decoder block is depicted in Figure 7.

The structure of the Decoder is similar to that of the DETR Encoder, but it incorporates masked attention along with corresponding layer normalization, serving as a substitute for cross-attention. This adjustment directs the model’s attention more specifically to the foreground.

3.5. Loss Function

The loss function of Mask2Former is represented by Equation (1).

L = λ_{c e} L_{c e} + λ_{d i c e} L_{d i c e} + λ_{c l s} L_{c l s}

(1)

In this context,

L_{c e}

represents the cross-entropy loss,

L_{d i c e}

denotes the Dice Loss, and

L_{c l s}

refers to the classification loss. The respective weights for

λ_{c e}

,

λ_{d i c e}

, and

λ_{c l s}

are set to 5, 5, and 2, respectively.

λ_{c e} L_{c e} + λ_{d i c e} L_{d i c e}

is the mask loss.

L_{c e}

adopts a binary cross-entropy loss, which is defined by Equation (2).

L_{c e} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log (σ (p_{i})) + (1 - y_{i}) \log (1 - σ (p_{i}))]

(2)

In this equation,

y_{i}

represents the ground truth value of sample

i

, which can be either 0 or 1.

σ (p_{i})

is the predicted value of sample

i

after passing through the sigmoid function, and

N

is the total number of samples.

L_{c l s}

uses the basic cross-entropy loss, which is defined by Equation (3).

L_{c l s} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} \log (p_{i, c})

(3)

In this equation,

y_{i, c}

represents the ground truth value of sample

i

,

p_{i, c}

is the probability that sample

i

is predicted as class

c

,

C

is the total number of classes, and

N

is the total number of samples. Leng et al. [40] found that, when optimizing cross-entropy loss using gradient descent, the Taylor expansion of the cross-entropy loss does not yield optimal results for polynomial coefficients. They proposed that adjusting the first polynomial coefficient could achieve significant gains with minimal effort, leading to the development of a new loss function, Poly Loss, as expressed in Equation (4).

L_{p o l y} = - \frac{1}{N} \sum_{i = 1}^{N} [\sum_{c = 1}^{C} y_{i, c} \log (p_{i, c}) + ε (1 - \sum_{c = 1}^{C} p_{i, c} y_{i, c})]

(4)

In this equation,

ε

is a hyperparameter used to adjust the polynomial coefficient. We used Poly Loss instead of cross-entropy loss as the classification loss for the model and found that setting

ε

to −1 effectively improved the model’s accuracy.

L_{d i c e}

uses the Dice Loss [41], which is defined by Equation (5).

L_{d i c e} = 1 - \frac{2 \sum_{i = 1}^{N} p_{i} y_{i}}{\sum_{i = 1}^{N} p_{i}^{2} + \sum_{i = 1}^{N} y_{i}^{2}}

(5)

In this equation,

p_{i}

represents the predicted value of pixel

i

,

y_{i}

is the ground truth value of pixel

i

, and

N

is the total number of pixels. Jadon [42] argued that, due to the non-convexity of the Dice coefficient, Dice Loss cannot achieve optimal results. Based on Dice Loss, he proposed Log-Cosh Dice Loss, a loss function that retains the symmetry of Dice while being smooth, thereby facilitating gradient computation. The expression for Log-Cosh Dice Loss is presented in Equation (6).

L_{l c - d c e} = \log (\cosh (L_{d i c e}))

(6)

We used Log-Cosh Dice Loss to replace the original Dice Loss. The final expression for our loss function is given by Equation (7).

L = λ_{c e} L_{c e} + λ_{l c - d c e} L_{l c - d c e} + λ_{p o l y} L_{p o l y}

(7)

Following [26], we retained the coefficients from the original loss function, with

λ_{c e}

,

λ_{l c - d c e}

, and

λ_{p o l y}

set to 5, 5, and 2, respectively.

4. Experiments

4.1. Dataset

The dataset used in this study consists of images from a power plant in Guangdong, China, with annotated categories including ground-level workers (Ground), workers at height (Offground), safety harnesses worn by workers (Safebelt), safety ropes secured at anchor points (Saferope), and other regions (Background). The annotation principles were introduced in Section 3.1, and an example of the annotations is presented in Figure 3. The dataset contains a total of 2016 images, with 1413 used for the training set, 198 for the validation set, and the remaining 404 for the test set. All models mentioned in this paper were trained on the training set, validated on the validation set, and evaluated on the test set for final results.

4.2. Evaluation Metrics

Rather than focusing on whether the predicted mask shape closely matches the ground truth, our primary concern is whether the model can accurately classify each pixel—specifically, whether it can determine from the visible portion of the safety rope in the image if it is being used correctly, thereby assessing whether the worker is properly secured. Therefore, Accuracy (Acc) and mean Accuracy (mAcc) are used as evaluation metrics for the model. Their respective expressions are provided in Equations (8) and (9).

A c c_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}}

(8)

m A c c = \frac{1}{C} \sum_{i = 1}^{C} A c c_{i}

(9)

In these equations,

A c c_{i}

represents the Accuracy of class

i

,

T P_{i}

is the number of pixels correctly predicted by the model for class

i

,

F N_{i}

is the number of pixels from class

i

incorrectly predicted as other classes by the model, and

C

is the total number of classes.

4.3. Environment and Parameter Settings

All experiments in this study were conducted on a 64-bit Windows 11 operating system, featuring an NVIDIA GeForce RTX 4090 D GPU and an AMD Ryzen 7 7800X3D CPU. Python 3.8, PyTorch 2.0.1, and mmsegmentation 1.2.2 were used as the experimental environment. All models were trained for 200 epochs, with batch sizes ranging from two to sixteen depending on GPU memory usage. For the first 10 epochs, the learning rate was set to 0.1, and a polynomial learning rate scheduler was subsequently used to gradually reduce the learning rate to zero. The input image size for the models was set to 1024 × 1024.

4.4. Results

CEMFormer was compared with its baseline, Mask2Former (using ResNet50 as the backbone), as well as five SOTA models: PIDNet [43], SegNext [44], PoolFormer [45], ConvNeXt (with Uper Head [46] as the segmentation head), and DDRNet [47]. The comparison results are presented in Table 1, where the variants of all models are also listed.

The mAcc of CEMFormer is 92.31%, surpassing PIDNet, SegNext, PoolFormer, ConvNeXt, DDRNet, and the baseline by 3.96%, 5.35%, 5.00%, 3.72%, 28.89%, and 3.15%, respectively. For individual class accuracies, compared to the baseline, our model demonstrates a loss of 0.21%, 1.67%, and 0.49% for the Background, Offground, and Ground classes, respectively. However, it achieves improvements of 4.59% and 13.54% in the Safebelt and Saferope classes. This trade-off is considered worthwhile, particularly since the Acc for the Saferope class reaches 91.68%, which is our primary focus. Except for DDRNet, all other SOTA models achieved an mAcc above 85%, indicating the effectiveness of our annotation principle. The Time column indicates the average inference time per image for each model on the test set. Our model achieves an average of 119.1 ms per image, matching SegNext, second only to PIDNet at 23.9 ms, and outperforming the other four models, including the baseline. However, achieving real-time detection will require targeted lightweight optimizations. Example images and the semantic segmentation results obtained using CEMFormer are presented in Figure 8 and Figure 9, respectively.

Figure 8 and Figure 9 demonstrate that, after training on the dataset with the new annotation principle, our model effectively distinguishes between workers in different positions, safety harnesses, and safety ropes, achieving good segmentation performance.

4.5. Ablation Study

To assess the effectiveness of all the improvements made to the model in enhancing its accuracy, an ablation study was conducted. The ablation study results are presented in Table 2. In the table, “√” indicates that the model includes the corresponding component in the experiment, whereas “×” indicates that the model does not include the component, using the corresponding structure from the baseline instead.

When comparing Experiments 1 and 2 in Table 2, replacing Log-Cosh Dice Loss with the baseline Dice Loss resulted in a 2.29% decrease in mAcc. This decrease is primarily attributed to the decline in Acc for the Offground, Safebelt, and Saferope classes, which dropped by 6.36%, 4.87%, and 2.81%, respectively. This suggests that the smoother Log-Cosh Dice Loss offers an advantage during training. When comparing Experiments 2 and 3, reverting Poly Loss to cross-entropy loss reduced mAcc by 0.3%. Cross-entropy loss exhibited advantages for the Background, Offground, and Safebelt classes, while Poly Loss significantly improved classification accuracy for Saferope by 8.69%, representing a substantial improvement for identifying safety ropes.

Similarly, removing EMA from Experiment 3 resulted in a significant loss in Acc, particularly for the Saferope class, which dropped from 80.18% to 77.05% in Experiment 4, representing a decrease of 3.13%. This reduction is attributed to the absence of high-resolution feature maps provided by EMA, making it more challenging to segment small-scale objects such as safety ropes. Lastly, although ConvNeXt, as a backbone incorporating numerous advanced designs, does not offer a distinct advantage, it is still worth using compared to the classic ResNet. When comparing Experiments 4 and 5, removing ConvNeXt resulted in changes in Acc for individual classes within 2%, with an overall Acc decrease of 0.28%.

5. Conclusions

This paper proposes the following new annotation principle: for workers at height, the portion of the safety harness that extends from the anchor point to the worker’s outline is categorized as Saferope, and the dataset is re-annotated accordingly. Based on this new dataset, the semantic segmentation model CEMFormer is proposed. The proposed model employs ConvNeXt as the backbone, Mask2Former as the segmentation head, and integrates EMA between the backbone and the head. Poly Loss is employed for classification loss in the segmentation head, while binary cross-entropy loss and Log-Cosh Dice Loss are employed for mask loss. Experimental results indicate that the proposed model outperforms both the baseline and SOTA models. Ablation studies demonstrate the contribution of each enhancement to model accuracy.

Nevertheless, there is still room for improvement in our approach. The dataset only includes annotations for workers and safety harnesses, without labeling other objects, such as platforms, utility poles, sky, and ground. Although the model achieved high accuracy under these conditions, the dataset does not facilitate learning the worker’s position in relation to the ground or height, which may limit further improvements in accuracy. Therefore, enhancing the dataset will be prioritized. Additionally, semantic segmentation tends to use a single mask to cover multiple interacting objects, which is suboptimal for scenarios involving multiple workers collaborating. The previous object detection algorithm will be used as a complement to CEMFormer—specifically, to detect large-scale objects such as human bodies to determine each worker’s location, while delegating smaller-scale objects, such as safety harnesses, to CEMFormer. Lastly, since this study focused on exploring the feasibility of the new annotation principle, model parameter and computational complexity were not considered. Evaluating computational cost and lightweighting the model will be prioritized in our future research.

Author Contributions

Methodology, D.L.; software, Q.Z.; validation, Q.Z.; resources, D.L.; data curation, Q.Z.; writing—original draft preparation, Q.Z.; writing—review and editing, D.L.; visualization, Q.Z.; supervision, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy and confidentiality concerns, as the dataset used in this study contains a substantial number of identifiable human images.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anantharaman, V.; Zuhary, T.; Ying, H.; Krishnamurthy, N. Characteristics of injuries resulting from falls from height in the construction industry. Singap. Med. J. 2023, 64, 237. [Google Scholar] [CrossRef] [PubMed]
Zhou, Q.; Liu, D.; An, K. ESE-YOLOv8: A Novel Object Detection Algorithm for Safety Belt Detection during Working at Heights. Entropy 2024, 26, 591. [Google Scholar] [CrossRef] [PubMed]
Xiong, R.; Tang, P. Pose Guided Anchoring for Detecting Proper Use of Personal Protective Equipment. Autom. Constr. 2021, 130, 103828. [Google Scholar] [CrossRef]
Khan, N.; Saleem, M.R.; Lee, D.; Park, M.-W.; Park, C. Utilizing Safety Rule Correlation for Mobile Scaffolds Monitoring Leveraging Deep Convolution Neural Networks. Comput. Ind. 2021, 129, 103448. [Google Scholar] [CrossRef]
Gong, F.; Ji, X.; Gong, W.; Yuan, X.; Gong, C. Deep Learning Based Protective Equipment Detection on Offshore Drilling Platform. Symmetry 2021, 13, 954. [Google Scholar] [CrossRef]
Riaz, M.; He, J.; Xie, K.; Alsagri, H.S.; Moqurrab, S.A.; Alhakbani, H.A.A.; Obidallah, W.J. Enhancing Workplace Safety: PPE_Swin—A Robust Swin Transformer Approach for Automated Personal Protective Equipment Detection. Electronics 2023, 12, 4675. [Google Scholar] [CrossRef]
Shi, C.; Zhu, D.; Shen, J.; Zheng, Y.; Zhou, C. GBSG-YOLOv8n: A Model for Enhanced Personal Protective Equipment Detection in Industrial Environments. Electronics 2023, 12, 4628. [Google Scholar] [CrossRef]
Ludwika, A.S.; Rifai, A.P. Deep Learning for Detection of Proper Utilization and Adequacy of Personal Protective Equipment in Manufacturing Teaching Laboratories. Safety 2024, 10, 26. [Google Scholar] [CrossRef]
Zhang, H.; Mu, C.; Ma, X.; Guo, X.; Hu, C. MEAG-YOLO: A Novel Approach for the Accurate Detection of Personal Protective Equipment in Substations. Appl. Sci. 2024, 14, 4766. [Google Scholar] [CrossRef]
Zaidi, S.F.A.; Yang, J.; Abbas, M.S.; Hussain, R.; Lee, D.; Park, C. Vision-Based Construction Safety Monitoring Utilizing Temporal Analysis to Reduce False Alarms. Buildings 2024, 14, 1878. [Google Scholar] [CrossRef]
Sanjeewani, P.; Neuber, G.; Fitzgerald, J.; Chandrasena, N.; Potums, S.; Alavi, A.; Lane, C. Real-Time Personal Protective Equipment Non-Compliance Recognition on AI Edge Cameras. Electronics 2024, 13, 2990. [Google Scholar] [CrossRef]
Chen, S.; Demachi, K. A Vision-Based Approach for Ensuring Proper Use of Personal Protective Equipment (PPE) in Decommissioning of Fukushima Daiichi Nuclear Power Station. Appl. Sci. 2020, 10, 5129. [Google Scholar] [CrossRef]
Tang, S.; Roberts, D.; Golparvar-Fard, M. Human-Object Interaction Recognition for Automatic Construction Site Safety Inspection. Autom. Constr. 2020, 120, 103356. [Google Scholar] [CrossRef]
2020 Research on the Development Status of China’s Electric Power Industry and Analysis of Accident Casualties, with Hunan Having the Highest Casualty Numbers. Available online: https://www.huaon.com/channel/trend/699986.html (accessed on 17 September 2024).
Ma, L.; Li, X.; Dai, X.; Guan, Z.; Lu, Y. A Combined Detection Algorithm for Personal Protective Equipment Based on Lightweight YOLOv4 Model. Wirel. Commun. Mob. Comput. 2022, 3574588. [Google Scholar] [CrossRef]
Chen, S.; Demachi, K. Towards On-Site Hazards Identification of Improper Use of Personal Protective Equipment Using Deep Learning-Based Geometric Relationships and Hierarchical Scene Graph. Autom. Constr. 2021, 125, 103619. [Google Scholar] [CrossRef]
Fang, W.; Ding, L.; Luo, H.; Love, P.E.D. Falls from Heights: A Computer Vision-Based Approach for Safety Harness Detection. Autom. Constr. 2018, 91, 53–61. [Google Scholar] [CrossRef]
Chern, W.-C.; Hyeon, J.; Nguyen, T.V.; Asari, V.K.; Kim, H. Context-Aware Safety Assessment System for Far-Field Monitoring. Autom. Constr. 2023, 149, 104779. [Google Scholar] [CrossRef]
Fang, Q.; Li, H.; Luo, H.; Ding, L.; Luo, X.; Li, C. Computer Vision Aided Inspection on Falling Prevention Measures for Steeplejacks in an Aerial Environment. Autom. Constr. 2018, 93, 148–164. [Google Scholar] [CrossRef]
Li, J.; Zhao, X.; Zhou, G.; Zhang, M. Standardized Use Inspection of Workers’ Personal Protective Equipment Based on Deep Learning. Saf. Sci. 2022, 150, 105689. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Becker, K. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Zhang, J.; Song, Y.; Ren, K.; Liu, Y.; Yue, Z. Mutual feedback between Mask2former and crack information under dynamic rock fractures. Theor. Appl. Fract. Mech. 2024, 133, 104602. [Google Scholar] [CrossRef]
García, G.; Aparcedo, A.; Nayak, G.K.; Ahmed, T.; Shah, M.; Li, M. Generalized deep learning model for photovoltaic module segmentation from satellite and aerial imagery. Solar Energy 2024, 274, 112539. [Google Scholar] [CrossRef]
Li, Y.; Wang, J.; Liang, M.; Song, H.; Liao, J.; Lan, Y. A Novel Two-Stage Approach for Automatic Extraction and Multi-View Generation of Litchis. Agriculture 2024, 14, 1046. [Google Scholar] [CrossRef]
Guo, S.; Yang, Q.; Xiang, S.; Wang, S.; Wang, X. Mask2Former with Improved Query for Semantic Segmentation in Remote-Sensing Images. Mathematics 2024, 12, 765. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hao, S.; Zhang, L.; Jiang, Y.; Wang, J.; Ji, Z.; Zhao, L.; Ganchev, I. ConvNeXt-ST-AFF: A Novel Skin Disease Classification Model Based on Fusion of ConvNeXt and Swin Transformer. IEEE Access 2023, 11, 117460–117473. [Google Scholar] [CrossRef]
Zhang, L.; Song, W.; Zhu, T.; Liu, Y.; Chen, W.; Cao, Y. ConvNeXt-MHC: Improving MHC–Peptide Affinity Prediction by Structure-Derived Degenerate Coding and the ConvNeXt Model. Brief. Bioinform. 2024, 25, bbae133. [Google Scholar] [CrossRef]
Liu, C.; Yin, Y.; Qian, R.; Wang, S.; Xia, J.; Zhang, J.; Zhao, L. Enhanced Winter Wheat Seedling Classification and Identification Using the SETFL-ConvNeXt Model: Addressing Overfitting and Optimizing Training Strategies. Agronomy 2024, 14, 1914. [Google Scholar] [CrossRef]
Zhao, Y.; Liang, Q.; Tian, Z. ConvNeXt-BiGRU Rolling Bearing Fault Detection Based on Attention Mechanism. In International Conference on Intelligent Computing; Springer Nature: Singapore, 2024; pp. 66–76. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greek, 4–9 June 2023. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by Gsconv: A Lightweight-Design for Real-Time Detector Architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Leng, Z.; Tan, M.; Liu, C.; Cubuk, E.D.; Shi, X.; Cheng, S.; Anguelov, D. PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions. arXiv 2022, arXiv:2204.12511. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 565–571. [Google Scholar]
Jadon, S. A Survey of Loss Functions for Semantic Segmentation. In Proceedings of the 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Via del Mar, Chile, 27–29 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar]
Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 19529–19539. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. SegNext: Rethinking Convolutional Attention Design for Semantic Segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer Is Actually What You Need for Vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Pan, H.; Hong, Y.; Sun, W.; Jia, Y. Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes. IEEE Trans. Intell. Transp. Syst. 2022, 24, 3448–3460. [Google Scholar] [CrossRef]

Figure 1. The overall structure of CEMFormer.

Figure 2. An example of annotations in the original dataset: (a) the original size; (b) an enlarged view of the local region.

Figure 3. An example of annotations in the new dataset: (a) an example of an unsecured safety rope; (b) an example of a secured safety rope; (c) an enlarged view of (b).

Figure 4. The structure of the ConvNeXt Block.

Figure 5. The structure of the Efficient Multi-Scale Attention (EMA) block.

Figure 6. The structure of the DETR Encoder.

Figure 7. The structure of the Decoder block.

Figure 8. An example image.

Figure 9. An example semantic segmentation result.

Table 1. A model performance comparison.

Model	Acc_Background	Acc_Offground	Acc_Ground	Acc_Safebelt	Acc_Saferope	mAcc	Time
PIDNet-L	99.76%	92.54%	88.34%	86.65%	74.46%	88.35%	95.2 ms
SegNext- MSCAN-L	99.64%	92.56%	88.09%	83.85%	70.64%	86.96%	119.1 ms
PoolFormer-m36	99.76%	92.51%	88.54%	83.14%	72.57%	87.31%	151.4 ms
ConvNeXt-T	99.77%	93.54%	89.04%	84.97%	75.63%	88.59%	124.2 ms
DDRNet-23	99.80%	67.52%	55.18%	55.84%	38.77%	63.42%	154.3 ms
Mask2Former	99.72%	92.58%	89.82%	85.55%	78.14%	89.16%	122.7 ms
CEMFormer (Ours)	99.51%	90.91%	89.33%	90.14%	91.68%	92.31%	119.1 ms

Table 2. An ablation study of CEMFormer.

Experiment	1	2	3	4	5
ConvNeXt	√	√	√	√	×
EMA	√	√	√	×	×
Poly Loss	√	√	×	×	×
Log-Cosh Dice Loss	√	×	×	×	×
Acc_Background	99.51%	99.55%	99.76%	99.69%	99.72%
Acc_Offground	90.91%	84.55%	92.51%	92.33%	92.58%
Acc_Ground	89.33%	91.85%	90.72%	91.53%	89.82%
Acc_Safebelt	90.14%	85.27%	85.43%	86.61%	85.55%
Acc_Saferope	91.68%	88.87%	80.18%	77.05%	78.14%
mAcc	92.31%	90.02%	89.72%	89.44%	89.16%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Q.; Liu, D. Improving Safety in High-Altitude Work: Semantic Segmentation of Safety Harnesses with CEMFormer. Symmetry 2024, 16, 1449. https://doi.org/10.3390/sym16111449

AMA Style

Zhou Q, Liu D. Improving Safety in High-Altitude Work: Semantic Segmentation of Safety Harnesses with CEMFormer. Symmetry. 2024; 16(11):1449. https://doi.org/10.3390/sym16111449

Chicago/Turabian Style

Zhou, Qirui, and Dandan Liu. 2024. "Improving Safety in High-Altitude Work: Semantic Segmentation of Safety Harnesses with CEMFormer" Symmetry 16, no. 11: 1449. https://doi.org/10.3390/sym16111449

APA Style

Zhou, Q., & Liu, D. (2024). Improving Safety in High-Altitude Work: Semantic Segmentation of Safety Harnesses with CEMFormer. Symmetry, 16(11), 1449. https://doi.org/10.3390/sym16111449

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Safety in High-Altitude Work: Semantic Segmentation of Safety Harnesses with CEMFormer

Abstract

1. Introduction

2. Related Works

2.1. Vision-Based Safety Monitoring

2.2. Safety Harness Monitoring Based on Deep Learning

2.3. Deep Learning-Based Semantic Segmentation

3. Method

3.1. New Dataset

3.2. ConvNeXt

3.3. Efficient Multi-Scale Attention (EMA)

3.4. Mask2Former

3.5. Loss Function

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Environment and Parameter Settings

4.4. Results

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI