YOLO-DFAN: Effective High-Altitude Safety Belt Detection Network

Yan, Wendou; Wang, Xiuying; Tan, Shoubiao

doi:10.3390/fi14120349

Open AccessArticle

YOLO-DFAN: Effective High-Altitude Safety Belt Detection Network

by

Wendou Yan

,

Xiuying Wang

and

Shoubiao Tan

^*

School of Integrated Circuits, Anhui University, He Fei 230039, China

^*

Author to whom correspondence should be addressed.

Future Internet 2022, 14(12), 349; https://doi.org/10.3390/fi14120349

Submission received: 6 November 2022 / Revised: 18 November 2022 / Accepted: 20 November 2022 / Published: 23 November 2022

(This article belongs to the Special Issue Developments of Computer Vision and Image Processing: Methodologies and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper proposes the You Only Look Once (YOLO) dependency fusing attention network (DFAN) detection algorithm, improved based on the lightweight network YOLOv4-tiny. It combines the advantages of fast speed of traditional lightweight networks and high precision of traditional heavyweight networks, so it is very suitable for the real-time detection of high-altitude safety belts in embedded equipment. In response to the difficulty of extracting the features of an object with a low effective pixel ratio—which is an object with a low ratio of actual area to detection anchor area in the YOLOv4-tiny network—we make three major improvements to the baseline network: The first improvement is introducing the atrous spatial pyramid pooling network after CSPDarkNet-tiny extracts features. The second is to propose the DFAN, while the third is to introduce the path aggregation network (PANet) to replace the feature pyramid network (FPN) of the original network and fuse it with the DFAN. According to the experimental results in the high-altitude safety belt dataset, YOLO-DFAN improves the accuracy by 5.13% compared with the original network, and its detection speed meets the real-time demand. The algorithm also exhibits a good improvement on the Pascal voc07+12 dataset.

Keywords:

high-altitude safety belt detection; low-effective-pixel-ratio object; improved YOLOv4-tiny; attention mechanism

1. Introduction

In the context of the modern era of countless high-rise buildings, many modern industries have increased their demand for high-altitude work. However, the development of modernization in high-altitude work is slow. This contradiction is also increasingly highlighted as development progresses. Numerous high-altitude workers have been involved in accidents without protective measures, resulting in serious injuries and even deaths, and falls from high-altitude accidents have been the “number one killer” on construction sites for many years [1]. Given this, high-altitude safety belt detection methods based on deep learning [2] have been applied and developed to promptly detect safety protection for workers at high altitudes [3,4,5,6].

Thus far, mainstream supervised detection methods are still in the manual detection stage, which is time-consuming and laborious. Object detection [7] is one of several fundamental tasks in computer vision. In recent years, the rapid development of deep learning has caused object detection algorithms to gradually shift from traditional algorithms based on manual features to detection algorithms based on deep neural networks [8]. Modern safety detection needs are adapted using deep learning combined with high-altitude safety belt detection for workers. The main deep-learning-based detection algorithms [9,10,11,12] are Faster R-CNN, Mask-RCNN, SSD, You Only Look Once (YOLO), and others.

Regarding the deep learning detection algorithm for high-altitude safety belts, Fang Weili et al. proposed a two-stage deep convolutional network to detect whether workers wear safety belts at work [13]. The method uses two network models: the two-stage Faster R-CNN model is applied to detect the worker image, and the CNN model is used to identify whether the worker is wearing a high-altitude safety belt. Shanti et al. proposed another example: a training network model using the one-stage detection network YOLOv3, which aims to detect whether the worker in the input image is wearing a high-altitude safety belt [14].

Although some of these networks have achieved good results in high-altitude safety belt detection, the networks are large, and their hardware demand is too high when the actual application is deployed, hindering the high-altitude safety belt detection task, which imposes high demands for lightweight and real-time networks in the modern era. From the perspectives of detection speed and practical application, detection based on a lightweight network is more applicable for high-altitude safety belt detection. As an object with a low effective pixel ratio, high-altitude safety belt refers to a class of objects with a low ratio of actual area to detection anchor area, characterized by scattered distribution in the image, unfocused features, and easy confusion with the background. Therefore, it is difficult for the detection network to extract their effective features. Moreover, the human body only occupies part of the input image, and the high-altitude safety belt only covers a part of the human body. These problems have led to difficulty extracting the features of high-altitude safety belts using traditional detection networks.

This paper proposes the YOLO dependency fusing attention network (DFAN) detection algorithm based on the YOLOv4-tiny detection algorithm to solve the major problems of detection speed and high-altitude safety belt feature extraction. YOLOv4-tiny is a lightweight network with an extremely fast detection speed, but it does not do a good job of detecting high-altitude safety belts; thus, we propose YOLO-DFAN. After features are extracted by the lightweight backbone network CSPDarkNet-tiny of YOLOv4-tiny, YOLO-DFAN introduces an atrous spatial pyramid pooling network (ASPPNet) [15] to capture contextual information and fuse it on multiple scales. This method increases the receptive field of the features while enriching the extracted high-altitude safety belt spatial feature information and obtaining a more refined feature representation of the network. Afterward, the idea of an attention mechanism is introduced [16]. The DFAN is innovatively proposed to optimize the representation of the feature layer in multiple dimensions by encoding the long-range dependency information for three directions to focus adequate network attention on the features of the high-altitude safety belt region. Finally, the path aggregation network (PANet) [17] is introduced to replace the FPN of the original network [18], and it fuses with DFAN to enhance the feature representation of the network in the region of interest of the object on multiple scales.

Based on the experimental results for the high-altitude safety belt dataset, the proposed method maintains the real-time detection speed and improves accuracy compared with the original network. It can supervise whether workers wear safety belts in the working scene at high altitudes in real time, playing an effective role in supervision. To further validate the algorithm’s effectiveness, we also conducted experiments on the Pascal voc07+12 dataset, and the results revealed a significant enhancement effect of the algorithm.

This paper aims to propose a high-altitude safety belt detection network with balanced speed and accuracy for detecting high-altitude safety belts. Therefore, the YOLO-DFAN network is innovatively proposed in this paper, combining the advantages of a lightweight network with fast speed and a heavyweight network with high precision. The contributions of this paper include the following three points: First, ASPPNet is introduced in YOLOv4-tiny to fuse spatial feature information in different proportions. Moreover, it increases the features’ receptive field to enable network acquisition and obtain more refined feature representation, alleviating inadequate and unrefined feature extraction due to the shallow network depth of the original lightweight network. Second, this paper innovatively proposes that the DFAN can simultaneously encode the attention maps of the channel, x-, and y-directions with three branches to obtain the long-range information in these three directions. Furthermore, it fuses the k-nearest neighbors in each direction to obtain strong dependency relationships. Afterward, it combines the attention maps of the three directions containing the long-range dependency information to adjust the feature layers complementarily to capture a richer and more accurate region of interest in the object. Third, PANet is introduced to replace the FPN module of the original network, combining the attention network DFAN with PANet, which improves the utilization of deep and shallow feature information and enhances the feature representation of the region of interest on different scales.

The rest of this paper is structured as follows: Section 2 introduces related works, including object detection and attention mechanisms. Section 3 illustrates the overall structure of the proposed YOLO-DFAN algorithm. Section 4 focuses on the experimental validation of the detection results of the dataset and qualitative analysis of the heatmap to verify the effectiveness of the proposed algorithm. Finally, the conclusions are summarized in Section 5.

2. Related Work

Relevant prior work includes studies of object detection and attention mechanism.

2.1. Object Detection

As one of the core problems in computer vision [19], object detection has been challenging due to the different appearances, shapes, and poses of various objects and the interference of factors such as illumination and occlusion during imaging. In recent years, the rapid development of deep learning has caused object detection algorithms to gradually shift from traditional algorithms based on manual features to detection algorithms based on deep neural networks [20].

Deep-learning-based object detection algorithms are primarily divided into two categories: one- and two-stage. The common two-stage object detection algorithms [21], such as the R-CNN series, are predominantly divided into two steps: First, region generation is performed to generate a preselected box (i.e., region proposal) that may contain objects. Then, the classification is performed using CNN samples. Some of the more common one-stage object detection algorithms [22], such as YOLO and SSD, do not need the region proposal stage to generate the class probability and coordinate location values of the object directly. The final detection results can be directly obtained after a single detection.

Deep learning networks and CNNs are constantly evolving in object detection, which involves the issue of the number of parameters and the computational effort when designing CNNs. Lightweight networks are created, aiming to reduce the number of parameters and the model complexity while maintaining accuracy as much as possible. Common algorithms—such as YOLOv3-tiny, YOLOv4-tiny, and SSD-tiny—are implemented by introducing lightweight feature extraction networks to replace the backbone in the baseline network. The common deep learning detection algorithms have difficulty in extracting the features of high-altitude safety belts because they are objects with a low effective pixel ratio. Therefore, we introduced the attention mechanism to help the detection network extract their features.

2.2. Attention Mechanism

As a module in neural networks to optimize the training effect of the network, attention is popular and widely used due to its flexibility, lightness, and effectiveness. Initially, the attention model was primarily used for machine translation, but its concept has a key place in deep learning and has become an important part of CNNs. The attention mechanism in neural networks can be regarded as a dynamic weight-adjustment process based on the input image features, which can help network models predict the potential features of interest and has been commonly used in such tasks as image classification, object detection, semantic segmentation, and 3D vision [23]. Common attention [24,25,26,27] mechanisms are used, such as SENet, CBAM, ECANet, and coordinate attention. This paper innovatively proposes a new attention network based on coordinate attention.

3. Proposed Method

In this section, the structure of the baseline network is first reviewed in Section 3.1, and the proposed improved algorithm is introduced. In Section 3.2, ASPPNet is presented. Section 3.3 describes the new proposed DFAN. The combination of the DFAN with the multipath fusion PANet is presented in Section 3.4.

3.1. YOLO-DFAN

In total, YOLOv4-tiny consists of three parts: CSPDarkNet-tiny, FPN, and the detection head. As the backbone network, CSPDarkNet53-tiny is responsible for performing feature extraction. CSPDarkNet53-tiny consists of 15 convolutional layers, composed of two deep residual convolutional (CBL) units and three ResBlock residual modules, as depicted in Figure 1. The CBL units consist of a convolutional layer, a batch normalization layer, and an activation function—the leaky rectified linear unit. The working mechanism of the ResBlocks is to combine four CBL units with residual nesting, and then the maximum pooling process is performed. Through these processes, they solve the network degradation problem while increasing the network depth, alleviating the gradient dispersion, and making the data transfer smoother. The FPN is the feature fusion module used to construct a multiscale feature pyramid, and the YOLO head finally integrates and predicts the features extracted from the whole network.

As a simplified version of YOLOv4, YOLOv4-tiny is a lightweight model with its parameters reduced to 6 million. Thus, this model can improve the speed with a slight loss of accuracy and meet high-altitude safety belt detection time requirements. However, given that the feature extraction of YOLOv4-tiny is insufficiently rich and fine, and that the scale-matching range is insufficiently diverse, a YOLO-DFAN network based on the original network is proposed in this paper.

An improved algorithm of YOLOv4-tiny is YOLO-DFAN, as illustrated in Figure 2. After the backbone network outputs a 13 × 13 feature layer, ASPPNet can be introduced, ensuring a denser capture of contextual information in multiple proportions to improve the features’ receptive field, giving the network-acquired features a more refined representation capability, and enriching the extracted contextual features. By adding the DFAN module after the second output of the backbone and the output of ASPPNet, the model effectively adjusts the extracted deep and shallow information to obtain more effective feature information, respectively. Then, the two parts of the features are input into the multipath PANet fused with DFAN. The features on both paths can be aggregated by performing attention-containing upsampling and downsampling. Moreover, the feature utilization of deep and shallow information can be enhanced, improving the network’s ability to represent the features of interest for multiscale and small objects. Finally, the two parts of the features after multipath aggregation are input into the detection head for feature integration and detection.

3.2. Atrous Spatial Pyramid Pooling Network

Because a high-altitude safety belt image loses the contextual feature information of some features and reduces the correlation between the local and overall spatial features in traditional convolutional and pooling layers, ASPPNet—introduced after the 13 × 13 feature layer extracted by the YOLOv4-tiny backbone network—aims to solve these problems. As depicted in Figure 3, this network can fuse multiscale contextual information to enhance the receptive field of the features and enrich the feature information extracted by the network. Using multiple dilated convolutions, ASPPNet can convolve the input features at different dilation rates to capture different scales of contextual information. Dilated convolution differs from normal convolution because it introduces the concept of a sampling rate. It increases the convolutional kernel interval to increase the receptive field without changing the resolution of the output feature map and achieves fine-grained to coarse-grained multiscale special fusion. Thus, more fine-grained features can be acquired to enhance the representability of the object in the feature layer. The ASPPNet introduced in this paper is presented in Figure 3.

3.3. Dependency Fusing Attention Network

The structure of the DFAN proposed in this paper is illustrated in Figure 4.

We encoded channel direction dependencies without dimensionality reduction in the first branch:

C_{m a p} = σ (φ_{1} (g_{1} (x)) + φ_{2} (g_{2} (x))),

(1)

The input feature map is first input into the global average pooling layer

g_{1} (x)

to capture the global spatial information. The global max-pooling layer

g_{2} (x)

is used to encode the features of the most significant part to compensate for the softly encoded average-pooled features. Then, the channel information of the k-nearest neighbors is aggregated by the one-dimensional (1D) convolutional layer

φ (x)

with the kernel size K. The activation function

σ (x)

is used to activate a self-gating mechanism so that the information with a higher response produces larger weights in comparison, with the weights of each channel as follows:

ω_{c (i)}^{\in (C, 1, 1)} = σ (\sum_{l = 1}^{K} W_{1}^{l} {(\frac{1}{w h} \sum_{j = 1, k = 1}^{h, w} X_{i, j, k})}_{l} + \sum_{l = 1}^{K} W_{2}^{l} (m a x X_{i, 1 \leq j \leq h, 1 \leq k \leq w})),

(2)

where

σ (x)

is the sigmoid activation function, W represents the parameter representation matrix of the 1D convolution, and

{(\frac{1}{w h} \sum_{j = 1, k = 1}^{h, w} X_{i, j, k})}_{l}

denotes the information of the i + l–1th channel after the aggregation space. These weights emphasize the more useful channel features (i.e., the whole branch improves the usefulness of feature channels by encoding the interdependence between the channels of the features).

For the second branch, we used two spatial extents of pooling kernels (H, 1) or (1, W) to encode each channel along the horizontal and vertical directions, respectively. These two feature maps contain direction-specific information and are separately encoded by the activation function into two attention mappings, each of which captures the long-range information of the input feature map in one spatial direction. Afterward, a 1D convolution is separately added to fuse the spatial linkage information with the k-nearest neighbors in the x-direction and the spatial linkage information with the k-nearest neighbors in the y-direction. In this way, the long-range vectors in both directions acquire rich spatial dependencies and capture more spatial information around the object feature region, which can have a scoring influence on the region of interest. The encoded attention map is obtained as follows:

H_{m a p} = σ_{1} (φ_{1} (θ ((g_{1} (X))))),

(3)

W_{m a p} = σ_{2} (φ_{2} (θ ((g_{2} (X))))),

(4)

where

σ (x)

denotes the sigmoid function;

φ (x)

is the 1D convolutional function;

θ (x)

represents the intermediate concatenation, regularization, act, and split operations; and

g (x)

denotes the global average pooling layer. The feature mapping information of the two vectors is as follows:

ω_{h}^{i} = σ (\sum_{l = 1}^{K} W_{1}^{l} {(α (\frac{1}{w} \sum_{j = 1}^{w} X_{c, i, j}))}_{l}),

(5)

ω_{w}^{i} = σ (\sum_{l = 1}^{K} W_{2}^{l} {(α (\frac{1}{h} \sum_{j = 1}^{h} X_{c, j, i}))}_{l}),

(6)

where

{(α (\frac{1}{w} \sum_{j = 1}^{w} X_{c, i, j}))}_{l}

denotes the information value of the i + l–1th position of the spatial vector in the h-direction and

{(α (\frac{1}{h} \sum_{j = 1}^{h} X_{c, j, i}))}_{l}

is the information value of the i + l–1th position of the spatial vector in the w-direction. By encoding the two spatial long-range vectors in a fused nearest neighbors fashion, more location information in the object region yielding score information about the region of interest can be preserved in the generated spatial attention map, focusing the network on a richer and more accurate localization of the region of interest of the objects.

Finally, we employed the dot product of the coded information of the channel and spatial map to obtain a 3D attention map, and each element of this attention graph reflects whether the feature of the object is present in the corresponding channel, row, and column. The attention map enhances the representation of the region of interest of the objects through complementary extraction and optimization of the input feature map. This encoding approach enables a richer and more accurate localization of the region of interest of the object and helps the model achieve better detection results.

In coordinate attention, the author embedded the spatial long-range feature encoding information with channel relations [27]. Guided by the idea of encoding the long-distance spatial information, this study eliminated the embedding of channel relations, and a 1D convolution was added to the long-distance information in the x- and y-directions to fuse the context linkage information between the k-nearest neighbors and to enhance the dependency between information, respectively. Thus, we captured the spatial information around the object feature region affecting the score in the region of interest. In another branch, we encoded channel dependencies without reducing the dimensionality [22]. Finally, a 3D attention map was obtained using the dot product of the attention maps in three directions, which can be multidimensional and optimize the unsupervised feature map.

3.4. PANet with DFAN

According to Figure 1, the two output layers of the backbone network in the YOLOv4-tiny network—FEAT1 and FEAT2—go through the FPN structure to fuse the feature information before transforming into the prediction layer head. However, the FPN structure of the network is too rudimentary, and it only stacks the upsampled FEAT1 layer with the FEAT2 layer fusion. Thus, this structure leads to the overall underutilization of the network for deep and shallow detail information, making it ineffective in detecting objects in complex and small object scenes. Given this, we introduced the PANet as a multipath feature fusion network, as depicted in Figure 2. Using two paths—bottom-up and top-down feature fusion—we shortened the information paths between low- and top-level features. Moreover, we enhanced the entire feature hierarchy using the low-level feature information to intermingle with high-level feature information, enhancing both the high-level and low-level feature information. Then, we added DFANs in both the top-down and bottom-up paths to enhance the representation of features in the region of interest on different scales.

4. Results

4.1. Environmental and Experimental Settings

The experimental platform configuration consisted of the Ubuntu 18.04, CUDA 10.1, Python 3.6, Nvidia GTX 2080Ti graphics card, and Intel i7-9700KF @ 3.60 HZ processor. The initial learning rate for model training was set to 0.001. Using the Adam optimizer and the weight decay to alleviate overfitting, 80 epochs were trained.

4.2. Dataset

Because almost no high-quality high-altitude safety belt dataset is currently publicly available, we used a combination of two acquisition methods: images downloaded from the Internet, and actual site photography. The collected and produced images were clear, with high resolution. The number of homemade pictures in the dataset was nearly 6000, and the pictures were labeled using the labelImg tool. The label category belt labeled all body areas of the workers wearing safety belts. Most of the dataset pictures were collected based on actual electric power maintenance sites and construction site scenes; thus, they have important reference significance for actual application scenarios for high-altitude safety belt detection.

4.3. Results and Discussion

We introduced a heatmap [28,29] to observe the model for high-altitude safety belt feature extraction and selected samples for the qualitative analysis to verify the effectiveness of the YOLO-DFAN algorithm. The baseline network and YOLO-DFAN were employed to predict and generate heatmaps for three samples presented in Figure 4, Figure 5 and Figure 6. Figure 5 reveals the detection results of the samples that were correctly predicted by both networks. According to Figure 5a,b, both models correctly predicted the high-altitude safety belts. Comparing the heatmaps in Figure 5c,d, the feature extraction of the baseline network was relatively rough. The focus of the high-altitude safety belt is more on the lower part of the human body and less on the central area of the safety belt. The improved network was more precise in feature extraction. According to Figure 5d, the region of interest covers the entire high-altitude safety belt region, and the region of interest of the high-altitude safety belt is more accurate.

Figure 6 presents the detection results for a worker not wearing a high-altitude safety belt. According to Figure 6a,b, observing the results of the baseline network and YOLO-DFAN, the baseline network made a false detection, whereas YOLO-DFAN correctly predicted that no high-altitude safety belt was worn. Comparing the heatmaps of the two sample outputs in Figure 6c,d, due to the overlap between the worker on the left and the power facility, the baseline network mistakenly focused on the pocket area on the thigh of the worker on the left, resulting in false detection.

Figure 7 shows the detection situation with a smaller object and a more complicated detection scene. The leftmost of the three workers is not wearing a high-altitude safety belt, while the middle and right workers are wearing safety belts. According to Figure 7a,b, observing the detection of the original network and YOLO-DFAN, the baseline network missed the high-altitude safety belt status of the rightmost worker, whereas YOLO-DFAN correctly detected all of their safety belt statuses. Comparing the generated heatmaps in Figure 7c,d, the baseline network extracted the features of the middle high-altitude safety belt wearer very broadly and roughly, while it failed to extract the features of the high-altitude safety belt status of the rightmost worker. The feature extraction of the middle high-altitude safety belt wearer was very general and inexact, and the network did not extract the features of the region of interest of the high-altitude safety belt worn by the rightmost worker. In contrast, YOLO-DFAN refined the feature extraction of the high-altitude safety belt worn by the middle worker, while accurately noting the region of interest of the high-altitude safety belt worn by the rightmost worker.

To prove the performance advantage of our algorithm, we compared and analyzed different evaluation indices for different algorithms on the test set, as shown in Table 1. From Table 1, we can draw the following conclusions: On the test set, compared with the YOLOv4-tiny, YOLO-DFAN improved R from 67.33% to 71.29%, improved mAP from 74.03% to 79.16%, improved F1 from 0.76 to 0.80, and improved P from 84.44 to 84.56, mainly benefiting from the ASPP module refining the object features extracted by the backbone, while the DFAN attention network proposed in this paper enhanced the model’s ability to extract the regions of interest, and the introduction of PANet enhanced the network’s ability to detect multiscale objects. Compared with the EfficientDet, YOLO-DFAN also showed a better performance. Compared with YOLOv4, YOLO-DFAN has tremendous advantages in speed, although its accuracy is inferior to that of YOLOv4. The main reason for this is that the idea of the CSPDarkNet network was applied to YOLOv4, strengthening the network feature representation but making the detection speed slow. In conclusion, the proposed method in this paper balances the precision defects of traditional lightweight networks and the speed defects of traditional heavyweight networks in high-altitude safety belt detection, and it completes the object detection with good performance.

We conducted ablation experiments on the high-altitude safety belt dataset to verify the effectiveness of the DFAN module and each of the other modules introduced in the paper, as follows (Table 2):

In the ablation experiments, we conducted experiments on multiple groups of attention modules and each module used in this paper. The proposed DFAN enhanced the high-altitude safety belt detection better than other small attention modules—such as SENet, CBAM, ECANet, and coordinate attention—and could verify the effectiveness of the constructed DFAN. Observing the experimental results of ASPPNet and PANet, introduced separately for the benchmark network, we were able to verify the validity of the effectiveness of these introduced modules. In the ablation experiments, we verified the effectiveness of the modules introduced and proposed in this paper. We further validated the algorithm’s effectiveness by employing YOLO-DFAN in experiments on the Pascal voc07+12 dataset. As can be seen from Figure 8, the YOLO-DFAN algorithm also has a good performance on the Pascal voc07+12 dataset.

5. Conclusions

In this paper, the YOLO-DFAN algorithm is proposed to solve the problem of the low accuracy of the original lightweight algorithm YOLOv4-tiny. Even though YOLOv4-tiny has a faster speed, its detection accuracy needs to be improved. First, YOLO-DFAN introduces ASPPNet to reduce the imprecise feature extraction caused by the shallow network depth in lightweight networks. Second, YOLO-DFAN uses the newly improved DFAN combined with multidimensional long-range dependency information to capture a more effective region of interest of objects. Finally, YOLO-DFAN integrates the attention mechanism with PANet to improve the ability to focus on the features of interest of objects at different scales. Therefore, the improved structure proposed in this paper effectively reduces the poor feature extraction of object features, rough extraction of object features, and unsatisfactory detection of multiscale and small objects in the lightweight network YOLOv4-tiny. In conclusion, while maintaining the real-time detection speed, the accuracy of the algorithm increased from 74.03% to 79.16% compared with the baseline network, at the cost of 0.0016 s of speed, meeting the needs of modern high-altitude safety belt detection tasks. Moreover, the algorithm displayed good improvement in the Pascal voc07+12 dataset. Compared with the heavyweight algorithm YOLOv4, YOLO-DFAN had a great advantage in speed, although its detection accuracy was not as good. The experimental results reveal that the improved methods can strengthen the effects of detection. The next step will be to expand the high-altitude safety belt dataset based on the dataset in this paper, further improve the detection accuracy while maintaining speed, and try to detect other objects with low effective pixel ratios.

Author Contributions

Writing—original draft, W.Y. and X.W.; Writing—review & editing, W.Y. and S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not Applicable, the study does not report any data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bai, X.P.; Zhao, Y.H. A novel method for occupational safety risk analysis of high-altitude fall accident in architecture construction engineering. J. Asian Archit. Build. Eng. 2021, 20, 314–325. [Google Scholar] [CrossRef]
Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
Cheng, R.; He, X.W.; Zheng, Z.L.; Wang, Z.T. Multi-Scale Safety Helmet Detection Based on SAS-YOLOv3-Tiny. Appl. Sci. 2021, 11, 3652. [Google Scholar] [CrossRef]
Liao, M.H.; Lyu, P.Y.; He, M.H.; Yao, C.; Wu, W.H.; Bai, X. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 532–548. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, J.M.; Chen, H.; Wang, Y. Multi-Source Remote Sensing Image Fusion for Ship Target Detection and Recognition. Remote Sens. 2021, 13, 4852. [Google Scholar] [CrossRef]
Yao, Z.X.; Song, X.P.; Zhao, L.; Yin, Y.H. Real-time method for traffic sign detection and recognition based on YOLOv3-tiny with multiscale feature extraction. Proc. Inst. Mech. Eng. Part D-J. Automob. Eng. 2021, 235, 1978–1991. [Google Scholar] [CrossRef]
Fu, K.; Chang, Z.; Zhang, Y.; Xu, G.; Zhang, K.; Sun, X. Rotation-aware and multi-scale convolutional neural network for object detection in remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 161, 294–308. [Google Scholar] [CrossRef]
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef] [Green Version]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
He, K.M.; Gkioxari, G.; Doll, P. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-Shot Refinement Neural Network for Object Detection. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Fang, W.; Ding, L.; Luo, H.; Love, P.E.D. Falls from heights: A computer vision-based approach for safety harness detection. Autom. Constr. 2018, 91, 53–61. [Google Scholar] [CrossRef]
Shanti, M.Z.; Cho, C.-S.; Byon, Y.-J.; Yeun, C.Y.; Kim, T.-Y.; Kim, S.-K.; Altunaiji, A. A Novel Implementation of an AI-Based Smart Construction Safety Inspection Protocol in the UAE. IEEE Access 2021, 9, 166603–166616. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
Ling, H.; Wu, J.; Huang, J.; Chen, J.; Li, P. Attention-based convolutional neural network for deep face recognition. Multimed. Tools Appl. 2020, 79, 5595–5616. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
O’Mahony, N.; Campbell, S.; Carvalho, A.; Harapanahalli, S.; Hernandez, G.V.; Krpalkova, L.; Riordan, D.; Walsh, J. Deep Learning vs. Traditional Computer Vision. In Proceedings of the Computer Vision Conference (CVC), Las Vegas, NV, USA, 25–26 April 2019; pp. 128–144. [Google Scholar]
Shrestha, A.; Mahmood, A. Review of Deep Learning Algorithms and Architectures. IEEE Access 2019, 7, 53040–53065. [Google Scholar] [CrossRef]
Duan, Z.; Li, S.; Hu, J.; Yang, J.; Wang, Z. Review of Deep Learning Based Object Detection Methods and Their Mainstream Frameworks. Laser Optoelectron. Prog. 2020, 57, 120005. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [Green Version]
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Electr Network, Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. YOLOv4-tiny network structure.

Figure 2. YOLO-DFAN network structure.

Figure 3. Atrous spatial pyramid pooling network (ASPPNet) structure.

Figure 4. Dependency fusing attention network (DFAN) structure.

Figure 5. Results of correctly predicted samples: (a) prediction of YOLOv4-tiny; (b) prediction of YOLO-DFAN; (c) heatmap of YOLOv4-tiny; (d) heatmap of YOLO-DFAN.

Figure 6. Results of original network misdetection samples: (a) prediction of YOLOv4-tiny; (b) prediction of YOLO-DFAN; (c) heatmap of YOLOv4-tiny; (d) heatmap of YOLO-DFAN.

Figure 7. Results of the original network’s missed samples: (a) prediction of YOLOv4-tiny; (b) prediction of YOLO-DFAN; (c) heatmap of YOLOv4-tiny; (d) heatmap of YOLO-DFAN.

Figure 8. Results for YOLOv4-tiny and YOLO-DFAN on the Pascal voc07+12 dataset.

Table 1. Belt detection effects of different networks.

Detection Network	Map (%)	F1	R (%)	P (%)	Rate(s·Picture⁻¹)
YOLOv4-tiny	74.03	0.76	67.33	84.44	0.0039
EfficientDet	72.36	0.77	64.29	85.36	0.0031
YOLOv4	80.28	0.82	72.28	86.05	0.0365
YOLO-DFAN	79.16	0.80	71.29	84.56	0.0055

Table 2. Ablation experiments.

Algorithm	Map (%)
YOLOv4-tiny	74.03
+SE	75.28
+CBAM	75.96
+ECA	74.98
+CA	76.24
+DFAN	77.00
+ASPP	76.23
+PANET	75.67
+ASPP+DFAN+PANET	79.16

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, W.; Wang, X.; Tan, S. YOLO-DFAN: Effective High-Altitude Safety Belt Detection Network. Future Internet 2022, 14, 349. https://doi.org/10.3390/fi14120349

AMA Style

Yan W, Wang X, Tan S. YOLO-DFAN: Effective High-Altitude Safety Belt Detection Network. Future Internet. 2022; 14(12):349. https://doi.org/10.3390/fi14120349

Chicago/Turabian Style

Yan, Wendou, Xiuying Wang, and Shoubiao Tan. 2022. "YOLO-DFAN: Effective High-Altitude Safety Belt Detection Network" Future Internet 14, no. 12: 349. https://doi.org/10.3390/fi14120349

APA Style

Yan, W., Wang, X., & Tan, S. (2022). YOLO-DFAN: Effective High-Altitude Safety Belt Detection Network. Future Internet, 14(12), 349. https://doi.org/10.3390/fi14120349

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-DFAN: Effective High-Altitude Safety Belt Detection Network

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Attention Mechanism

3. Proposed Method

3.1. YOLO-DFAN

3.2. Atrous Spatial Pyramid Pooling Network

3.3. Dependency Fusing Attention Network

3.4. PANet with DFAN

4. Results

4.1. Environmental and Experimental Settings

4.2. Dataset

4.3. Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI