FLE-YOLO: A Faster, Lighter, and More Efficient Strategy for Autonomous Tower Crane Hook Detection

Hu, Xin; Wang, Xiyu; Chang, Yashu; Xiao, Jian; Cheng, Hongliang; Abdelhad, Firdaousse

doi:10.3390/app15105364

Open AccessArticle

FLE-YOLO: A Faster, Lighter, and More Efficient Strategy for Autonomous Tower Crane Hook Detection

by

Xin Hu

¹

,

Xiyu Wang

¹,

Yashu Chang

¹,

Jian Xiao

^2,*

,

Hongliang Cheng

²

and

Firdaousse Abdelhad

¹

School of Energy and Electrical Engineering, Chang’an University, Xi’an 710064, China

²

School of Electronics and Control Engineering, Chang’an University, Xi’an 710064, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5364; https://doi.org/10.3390/app15105364

Submission received: 21 February 2025 / Revised: 6 May 2025 / Accepted: 9 May 2025 / Published: 11 May 2025

Download

Browse Figures

Versions Notes

Abstract

To address the complexities of crane hook operating environments, the challenges faced by large-scale object detection algorithms on edge devices, and issues such as frame rate mismatch causing image delays, this paper proposes a faster, lighter, and more efficient object detection algorithm called FLE-YOLO. Firstly, the FasterNet is used as the backbone for feature extraction, and the Triplet Attention mechanism is integrated to effectively emphasize target information while maintaining network lightweightness effectively. Additionally, the Slim-neck module is introduced in the neck connection layer, utilizing a lightweight convolutional network GSconv to further streamline the network structure without compromising recognition accuracy. Lastly, the Dyhead module is employed in the head section to unify multiple attention operations, improve the ability to resist interference from small objects and complex backgrounds. Experimental evaluations on public datasets VOC2012 and COCO2017 demonstrate the effectiveness of our proposed algorithm in terms of lightweight design and detection accuracy. Experimental evaluations were also conducted using images of crane hooks captured under complex operating conditions. The results demonstrate that compared to the original algorithm, the proposed approach achieves a reduction in computational complexity to 19.4 GFLOPs, an increase in FPS to 142.857 f/s, and the precision reached 97.3%. Additionally, the AP₅₀ reaches 98.3%, reflecting 0.6% improvement. Ultimately, the testing carried out at the construction site successfully facilitated the identification and tracking of hooks, thereby ensuring the safety and efficiency of tower crane operations.

Keywords:

autonomous tower crane detection; FLE-YOLO; FasterNet; TripletAttention; Slim-neck; Dyhead

1. Introduction

As urbanization progresses, the demand for human housing is increasing. High-rise buildings have become widely used architectural structures in the construction industry [1,2,3]. In some modern construction projects, the use of multiple cranes to accomplish challenging tasks is becoming increasingly common [4]. As important equipment in the construction of high-rise buildings at construction sites, tower cranes are widely used in various construction scenarios and provide an effective means of vertical and horizontal transportation for materials in high-rise and super high-rise buildings [5]. As a core component in the operation of tower cranes, the position and status of the hook directly affect construction safety and the accuracy of material transportation. In China, there were a total of 157 tower crane accidents from 2019 to 2022, resulting in 224 deaths [6]. The types of accidents mainly involved falling objects, crushing collisions, and electric shocks. Therefore, real-time and accurate detection and tracking of tower crane hooks are crucial for ensuring construction safety.

In the traditional construction process, tower crane operators can only obtain limited information about the construction site through visual observation or restricted means such as video surveillance and intercom systems [7]. This not only reduces construction efficiency but also increases safety hazards. With the development of video surveillance technology and sensor technology, technologies such as panoramic bird’s-eye view imaging [8], pose estimation [9], sensor positioning and angle detection [10], and edge extraction algorithms [11] have been applied to tower crane hook monitoring. These technologies effectively reduce the safety risks at construction sites and improve work efficiency. However, due to high equipment installation costs, limited detection scenarios, high algorithm complexity, and difficulties in deployment, the application scope of these technologies is difficult to expand.

In recent years, the advancement of CNNs (Convolutional Neural Networks) has led to the emergence of deep learning-based object detection technologies as a prominent research focus for detecting tower crane hooks. Xiong et al. [12] introduced a bridge crane sway angle detection system based on the YOLOv3 algorithm. This system utilizes algorithms to determine the crane hook’s position and calculates its swing angle by combining the hoisting rope length and hook deviation, enabling closed-loop anti-sway control. However, this approach is hampered by slow detection speed and limited angles for hook detection, resulting in incomplete hook information. Liang et al. [13] presented an object detection algorithm guided by a Transformer that enhances YOLOv5. This method enhances the backbone with a Transformer structure and a BIFPN (Bidirectional Feature Pyramid Network) module into the neck structure, effectively capturing the global and semantic information of small targets. As a result, this approach significantly reduces the false detection rate and achieves precise localization of the crane hook. However, the collected dataset for crane hooks is relatively limited, and the model size has not been optimized, presenting challenges for deployment on edge devices. Lu et al. [14] developed a tower crane hook fall hazard identification method based on an improved YOLOv5, combining SENet (Squeeze and Excitation Networks) attention mechanism, GIoU (Generalized Intersection over Union) loss function, and a layer for detecting small targets. This approach notably enhances detection accuracy and alleviates issues such as missed and false detections of crane hooks. Nevertheless, some frame omissions still occur in the detection process, and further improvements are required to achieve real time detection. Focusing on the problem of missed and false detections when the YOLO algorithm is used to recognize small target objects, Sun et al. [15] proposed a small target recognition model, which can improve the detection of small objects like tower crane hooks by adding a 160 × 160 detection layer, improving channel attention, and strengthening the characteristics of small target objects. Pang et al. [16] proposed an improved RT-DETR network for the tower crane small object detection model, which effectively improves detection accuracy and speed by integrating multi-structure reparameterization, optimizing convolution modules, and training loss functions. Pei et al. [17] proposed the SGD-YOLOv5 model, which utilizes spatial to deep convolution and introduces a global attention mechanism to capture global information. Decoupling the detection head operation helps overcome the problem of spatial misalignment.

Due to factors such as the small size, fast movement speed, and complex construction environment of tower crane hooks [18], in order to reduce investment costs, improve the effectiveness of detecting tower crane hook in complex scenarios, and enhance detection speed while minimizing deployment difficulties, this paper proposes a faster, lighter, and more efficient object detection algorithm called FLE-YOLO. The model introduces the FasterNet [19] network, Triplet Attention [20] mechanism, Slim neck [21] module, and Dyhead [22] structure in the backbone, neck connection layer, and head of the detector, respectively, which reduces the calculation amount while maintaining the detection accuracy. The main contributions of this paper are listed below:

(1): A novel object detection model FLE-YOLO is proposed, which adopts the FasterNet lightweight backbone network and introduces the Triplet Attention module. By focusing on the triplet interaction of spatial dimensions H, W, and channel dimension C, the feature expression of key areas of the hook is strengthened. Simultaneously, Slimneck is utilized to embed VoV-GSCSP and GSConv into the neck structure for bottom-up and top-down feature fusion. Introducing Dyhead into the head module, its unified structure of spatial, scale, and task perception can comprehensively learn target features, balancing detection accuracy and lightness.
(2): To validate the effectiveness of each module, heat maps are generated to assess the impact of the Triplet Attention and Dyhead attention on crane hooks. The color distribution in the FLE-YOLO heat maps is more concentrated in the hook area than that in the original algorithm. Additionally, ablation studies are conducted to evaluate the contribution of each module within the entire network using performance metrics.
(3): Extensive experiments on large datasets, COCO2017 and VOC2012, demonstrate significant improvements in computational complexity, parameter count, and detection speed of the FLE-YOLO detector compared to original network. Further experiments on a collected tower crane hook dataset show that compared to the original algorithm, computational complexity is reduced to 19.4 GFLOPs, detection speed increased to 142.857 f/s, accuracy reached 97.3% (can remain the same), AP₅₀ reached 98.3% (an increase of 0.6%), and parameter count reached 7.588 M, a reduction of 3.538 M.
(4): During the testing phase at the construction site, a visual detection and tracking scheme was proposed. This involved establishing a visual monitoring interface and implementing efficient identification and tracking of hooks, ultimately ensuring both the efficiency and safety of tower crane operations.

The remainder of this paper is organized as listed below: Section 2 includes the object detection algorithms and tower crane safety state monitoring systems, emphasizing the necessity of this work; Section 3 describes the methodology for collecting tower crane hook datasets; Section 4 describes the basic principles of the FLE-YOLO object detection algorithm; Section 5 presents the experimental results; and Section 6 provides a summary.

2. Techniques and Approaches

2.1. Object Detection Algorithms

The object detection algorithms are fundamental and widely studied topics in computer vision, aiming to automatically detect objects of interest in images or videos and label their positions [23,24]. In recent years, target detection technology has been widely applied in various domains, including traffic monitoring for vehicle tracking, robot vision, unmanned driving, drone inspection, and safety engineering status monitoring.

R. Girshick et al. [25] introduced the R-CNN algorithm in 2014, which first applied convolutional neural networks to object detection. However, the detection speed was slow, and real-time detection was not possible. In 2015, the emergence of Fast R-CNN [26] improved objected detection by introducing ROI (Region Of Interest) pooling layers. Short after, Faster R-CNN [27] significantly reduced computation time by introducing the RPN (Region Proposal Network) module, thereby improving detection efficiency. In 2016, J. Redmon et al. [28] proposed the YOLO algorithm, which divided the image into grid cells and directly predicted the target categories and bounding boxes in each cell using a convolutional neural network. It transformed the object detection task into a regression problem and had a concise and efficient framework, with fast detection speed and wide applicability. Subsequently, the YOLO series of object detection algorithms flourished [29]. In addition to the R-CNN and YOLO series, there have been other deep learning object detection algorithms including SSD (Single Shot Multi-Box Detector) [30] and RetinaNet [31] have also achieved good detection results. In recent years, researchers have been dedicated to further optimizing the efficiency and accuracy of deep learning-based object detection algorithms. Some works focus on improving network architectures, such as Efficient-Det [32] and CenterNet [33]. Some works aim to address challenges such as small object detection and occlusion, such as Cascade R-CNN [34] and Corner Net [35]. To sum up, the advantages and disadvantages of the target detection algorithm mentioned above are summarized in Table 1.

2.2. Tower Crane Safety Status Monitoring System

Tower cranes primarily consist of foundational structures, lifting mechanisms, slewing systems, and support systems. Enhancing the characteristics of the crane itself and collecting and early warning of operational parameters are crucial aspects of tower crane safety monitoring systems.

To address the issue of predicting the pitching angle response in a narrow range during the operation of dual tower cranes, Zhou et al. [36] effectively determined the lower and upper bounds of the interval amplitude angle response vector for dual crane systems by modeling the uncertain interval structural parameters with dynamic. This approach successfully mitigated collision risks in dual tower crane systems. Zhang et al. [37] designed an adaptive integral sliding mode control method for payload anti-sways in tower cranes affected by external disturbances, achieving continuous, jitter-free control of crane operations. Zhou et al. [38] introduced an approach for detecting and measuring structural surface cracks based on unmanned aerial vehicle (UAV) images, addressing the challenge of inaccessible areas for crack detection in large crane structures. This approach enables automatic detection and measurement of surface cracks on cranes in complex backgrounds, enhancing the safety and reliability of crane operations.

Tower crane monitoring has evolved from traditional methods to fully electronic digital systems [39]. While safety monitoring methods based on sensors and crane characteristics are valuable, they have limited information acquisition capabilities, making it challenging to promptly obtain comprehensive site information. Consequently, integrating video monitoring modules into tower crane safety systems has become crucial. However, some construction sites still rely on general network cameras, which offer basic monitoring without automatic tracking of the hook or camera auto-zoom, resulting in limited intelligence [40]. Previous research has explored ultrasonic sensors [41], variable-focus industrial cameras [42], Digital twin [43], and machine learning algorithms [44] to capture crane hook positions and visual information in surveillance videos, enabling effective hook tracking.

This paper introduces the FLE-YOLO object detector, which is distinguished by its ability to efficiently capture hook position information against complex backgrounds in surveillance footage, as well as its lightweight network structure and reduced computational complexity. This lays the groundwork for deploying the algorithm on hardware devices and automating crane hook tracking in surveillance footage. Table 2 summarizes the safety status monitoring methods of tower cranes based on different aspects.

3. Dataset Processing

3.1. Collection of Tower Crane Hook Datasets

The acquisition of tower crane hook images is sourced from China Gansu Construction Investment Group Machinery Company, which possesses a comprehensive range of tower crane equipment and provides leasing services for over a hundred sets of tower crane equipment, covering various construction sites in Gansu province. The collected datasets of tower crane data exhibit broad representativeness. The dataset was compiled using smartphones and spherical cameras to capture images of tower crane hooks from various angles and in different settings. It includes hook images with different angles and scenes, with backgrounds such as the ground, construction debris, sky, suspended objects, buildings, and trees under natural lighting conditions, as shown in Figure 1.

It also includes hook images under dim lighting, conditions with obstructions, long-distance little targets (high-altitude operations with a crane arm length of about a hundred meters), and nighttime illumination conditions, as shown in Figure 2, the timestamp in the upper-left corner (a) and lower right corner (e) follows the format “YYYY-MM-DD Weekday HH:MM:SS” in China Standard Time (UTC + 8). The datasets adequately capture the state characteristics of the hook under complex environmental factors. Additionally, images of construction sites without hook elements are captured as negative samples to enhance the robustness and aid in achieving training balance for the model.

3.2. Dataset Construction

To evaluate the generalization of this model, 1526 images were selected from the captured samples as the sample datasets. The datasets were split into a train image set, validation image set, and test image set following a (9:1:1) proportion. The datasets were augmented by performing operations such as resizing, random cropping, deformation, rotation, flipping, pixel shifting, and adding noise to the images, resulting in an expanded datasets of 10,682 images. Additionally, the Mosaic data augmentation method [45] from the YOLOv8 source code was applied. The LabelImg annotation tool was utilized to an notate the hooks in VOC format [46], which were then converted to YOLO format during training. Figure 3 shows an example of the image augmentation used in this paper.

4. The Proposed Method

4.1. Backbone Feature Extraction Network

4.1.1. Adding FasterNet Backbone

The FasterNet [19] improves the computational speed FLOPS (Floating-Point Operations) of the model while reducing the complexity FLOPs (Floating-Point Operations Per Second) of the network by introducing simple, fast, and efficient convolutional modules called PConv and PWConv. The relationship between the two and the detection latency parameter Latency is illustrated by (1). The backbone network achieves low latency during detection by reducing FLOPs while also considering the enhancement of FLOPS.

L a n t e n c y = \frac{F L O P s}{F L O P S}

(1)

As illustrated in Figure 4, the PConv module can simultaneously reduce computational redundancy and memory access, where the * denotes the convolution operation. It applies conventional Conv to spatial feature extraction only to the input channels before the dashed line (the

c_{p}

part) in the figure while keeping the remaining channels unchanged. Therefore, for a given input tensor, the FLOPs of PConv can be calculated using the following formula:

h \times w \times c_{p}^{2}

(2)

where

h

is the input tensor’s height,

w

is the input tensor’s width,

k

is the filter’s size, and

c_{p}

represents the number of channels used for spatial feature extraction.

Additionally, PConv features a smaller memory access footprint, thereby enhancing the computational speed FLOPS of the model, as demonstrated in (3).

h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p}

(3)

Since PConv only performs convolution operations on a subset of channels, to fully and effectively utilize information from all channels, a point-wise convolution PWConv (Point-Wise Convolution) is further appended on PConv.

The overall architecture of the FasterNet backbone network is illustrated in Figure 5. It consists of four hierarchical stages, each preceded by an embedding layer or a merging layer, which perform spatial down sampling and channel expansion. Multiple FasterNet blocks are arranged within each stage. Each block begins with a PConv layer and continues with two PWConv layers. They are combined in an inverted residual block, where the middle layer has an increased number of channels, and shortcut connections are placed to reuse input features. Finally, global average pooling, a 1 × 1 convolutional layer, and fully connected layers are employed collectively for feature transformation and classification.

FasterNet encompasses models of different sizes suited to distinct tasks, and in this paper, FasterNet_t0 is selected as the backbone of FLE-YOLO.

4.1.2. Triplet Attention Mechanism

The attention mechanism’s essence lies in efficiently allocating information processing resources [47]. In computer vision, the role of attention mechanism is to optimize traditional visual search methods, allowing selective adjustments to the processing of visual input by the network [48]. It enables the capture of important information while ignoring irrelevant factors in the scene, providing more convenience for data processing.

In order to reinforce the focus of the detector on crucial objects within the visual scene, while maintaining algorithmic lightweight, this paper introduces the Triplet Attention mechanism [20]. This attention mechanism can be effectively combined with the enhanced FasterNet backbone network as an auxiliary module. Specifically, it is positioned as the final layer of the backbone network, following the SPPF module.

The model structure of Triplet Attention is illustrated in Figure 6. The structure includes three parallel branches for a given input tensor

I \in R^{c \times h \times w}

. The first and second branches of Triplet Attention combine channel attention with spatial attention, establishing cross dimensional information interaction between the h and c dimensions, as well as the w and c dimensions. This enables the preservation of a significant amount of channel information. The third branch, similar to the attention mechanism CBAM (Convolutional Block Attention Module) [49], constructs spatial attention by establishing interaction between the h and w dimensions. By averaging and aggregating the outputs of the three branches, the final output is obtained.

4.2. Neck Feature Fusion Network

Slim-Neck

The Slim-neck [21] architecture, as based on the lightweight convolutional GSConv module, as depicted in Figure 7a, first conducts down sampling on the input through a 1 × 1 ordinary convolution. Subsequently, lightweight feature extraction is performed using DWConv depth wise separable convolution. The results of these two convolutions are then concatenated, followed by a shuffle operation. This operation splits the feature maps along the channel dimension into two parts and reorders the dimension sequence, thereby reorganizing the corresponding channel numbers from the two previous convolutions. Backbone networks primarily composed of convolutional neural network modules tend to compress the input images to some extent in terms of height and width and expand them in terms of channels, thereby partially losing semantic information within the images. However, GSConv effectively preserves the semantic information of the lower layers by deepening the network structure.

Building upon GSConv, the GSBottleneck module is introduced by performing Shortcut operations between two GSConv layers and the input consecutively, as illustrated in Figure 7b.

Furthermore, employing a one-shot aggregation approach, Figure 7c illustrates the architecture of the devised cross-level partial network VoV-GSCSP module. Initially, the input is divided into two parts, each undergoing a 1 × 1 convolution for down sampling to halve the channel count. Subsequently, one part is passed through a GSBottleneck module, while the other part undergoes concatenation with it. Finally, an output is obtained through a 1 × 1 convolution operation.

According to the requirements of the network structure, the Slim-neck structure is flexibly composed by combining the three components. Furthermore, in order to reduce data flow resistance and shorten inference time, this paper opts to introduce this module into the neck section of FLE-YOLO. This decision is made based on the observation that when the feature maps reach the neck structure, there is no longer a need for size and channel transformations, thereby maximizing the lightweight network structure while retaining essential information.

4.3. Head Prediction Network

Dynamic Head

The head of object detection, serving as a vital constituent of the detector, is responsible for acquiring the network’s output. An exemplary head structure possesses a keen awareness of proportion, spatial arrangement, and task perception capabilities. Dyhead [22], functioning as a dynamic detection head, adeptly integrates these elements, as depicted in Figure 8.

In Figure 8, * represents the convolution operation. By deploying attention mechanisms separately on each specific dimension of the features: the scale-aware attention module (

π_{L}

) is deployed on the horizontal dimension, the spatial-aware attention module (

π_{S}

) is deployed on the spatial dimension, and task-aware attention module (

π_{C}

) is deployed on the channel dimension.

W (L) = π_{C} (π_{S} (π_{L} (γ) \cdot γ) \cdot γ) \cdot γ

(4)

As shown in Equation (4), given a feature tensor

γ \in R^{L \times S \times C}

, by transforming the conventional attention mechanism into three consecutive and functionally different attentions, each attention focuses only on the corresponding information. This reduces the dimensionality of the tensor while also enhancing the focus on target objects. Figure 9 illustrates the model structure of Dyhead.

4.4. The Coordinate Positioning and Tracking of Crane Hooks

In the process of tower crane operations, the hooks are in motion. To ensure that the hooks are visible in the camera footage for operational purposes, after identifying the hooks in the camera footage, subsequent tracking of the hooks is achieved through gimbal control of the camera. By calculating the difference between the center position of the hook detection box and the center of the camera frame, the spherical camera’s rotation is controlled to ensure that the hook remains at the center of the camera frame, thereby achieving intelligent hook tracking.

Step 1: Calculation of the center coordinates of the crane hook.

The target detection algorithm provides the coordinates of the top-left and bottom-right corners of the detection box. By calculating the coordinates of the center point of the detection box, it is approximated as the center point coordinates of the crane hook. Equations (5) and (6) illustrate the method used in this paper to compute the center point coordinates.

c e n t e r_p o int_x = l e f t + (r i g h t - l e f t) / 2

(5)

c e n t e r_p o int_y = t o p + (b o t t o m - t o p) / 2

(6)

In the equations,

(c e n t e r_p o int_x, c e n t e r_p o int_y)

represents the approximate 2D coordinates of the center point of the crane hook in the frame, (top, left) denotes the 2D coordinates of the top-left corner of the detection box in the frame, and (bottom, right) represents the 2D coordinates of the bottom-right corner of the detection box in the frame.

As shown in Figure 10, the red dot within the detection box represents the central point, abbreviated as CP (Center Point). The values in parentheses denote the two-dimensional coordinates representing the center point of the crane hook within the frame.

Step 2: Calculation of the difference. By utilizing the image.

Shape attribute within the OpenCV library to acquire the height and width of the spherical camera’s captured image, denoted as height and width, respectively. The coordinates of the center point in the camera frame are calculated as shown in Equations (7) and (8). The difference between the center point coordinates of the crane hook and the center point coordinates of the camera frame is expressed in (9) and (10).

C e n t e r_x = w i d t h / 2

(7)

C e n t e r_y = h e i g h t / 2

(8)

Δ x = C e n t e r_x - c e n t e r_p o int_x

(9)

Δ y = C e n t e r_y - c e n t e r_p o int_y

(10)

In the equations,

(C e n t e r_x, C e n t e r_y)

represents the coordinates of the center point in the camera frame, while

Δ x

and

Δ y

denote the differences between the horizontal and vertical coordinates of the crane hook’s center point and camera frame’s center point, respectively.

Step 3: Pan-Tilt Control for Automatic Tracking of the Crane Hook.

In this work, we employed a spherical camera from Hikvision to capture images of the crane hook. Based on the control protocol, corresponding instructions are continuously transmitted to the serial port via the transmission of coordinate differences, enabling automated tracking of the crane hook. Equation (11) illustrates the scheduling strategy for the camera.

T = \{\begin{cases} 1 Δ x > 0, \\ 2 Δ x < 0, \\ 3 Δ y > 0, \\ 4 Δ y < 0, \\ 0 else \end{cases}

(11)

T

denotes the direction of control for the pan-tilt unit. If

T

equals 1, it instructs the pan-tilt unit to move horizontally to the left; if

T

equals 2, it directs the pan-tilt unit to move horizontally to the right. Similarly, if

T

equals 3, it commands the pan-tilt unit to move vertically upwards; if

T

equals 4, it instructs the pan-tilt unit to move vertically downwards. When

T

equals 0, it indicates that the pan-tilt unit should remain stationary. Through the camera scheduling strategy, the pan-tilt unit’s movements are judiciously adjusted to achieve automatic tracking of the crane hook.

To sum up, the proposed algorithm in this paper is referred to as the FLE-YOLO algorithm, and its network architecture is depicted in Figure 11.

Initially, image preprocessing and standardization of input image dimensions are conducted from the input end. Given that the images are in color and consist of three RGB channels, the network input is resized to 640 × 640 × 3. Subsequently, the images undergo feature extraction using the FasterNet backbone network. Through multiple layers of PConv and PWConv convolution, along with five down-sampling stages, feature maps of three different dimensions, namely 80 × 80 × 256, 40 × 40 × 512, and 20 × 20 × 1024, are obtained. Then, input these feature maps into the neck structure after passing through global pooling and the Triplet Attention layer for subsequent feature fusion. Within the neck structure, the original C2f module is replaced with the VoV-GSCSP module from the Slim-neck architecture while retaining the original method of feature fusion combining bottom-up and top-down approaches. This replacement enhances the acquisition of richer semantic information from lower-level images and captures more detailed information from higher-level images. Finally, the Dyhead structure unifies scale perception, spatial perception, and task perception of the feature maps, thereby generating detection results corresponding to the three different-sized feature maps.

5. Experimental Results and Discussion

5.1. Hardware Configuration and Hyper-Parameter Settings

The software and hardware configuration for this experiment is shown in Table 3.

During the training process, to maintain training stability and optimize resource utilization, considering the size of the graphics card on the experimental device, the batch size was chosen as 4, the work (the number of threads for loading data) is set to 1 to ensure that there are no “out of memory” issues during training. The SGD optimizer was configured with a 0.01 learning rate and 0.937 momentum. The training epoch of VOC2012, COCO2017, and hook datasets used in the paper is all 100 rounds. During the training process, each GPU RAM occupies about 7.5 GB and the GPU utilization is about 75%.

5.2. Comparative Analysis of Experimental Results

5.2.1. Regarding the Attention Mechanism

This article presents an introduction to two attention-based models, Triplet Attention and Dyhead, which assume vital roles within the backbone network and header structure, respectively. These models are designed to enhance the overall performance of the system by addressing various aspects of attention. By employing these models, the extraction of crucial information regarding tower crane hooks within an image becomes multidimensional in nature. Consequently, this breakthrough opens up new avenues for detecting hooks under challenging conditions, including complex backgrounds, small target sizes, and instances featuring occluded hooks. Notably, the proposed models leverage high levels of attention to prioritize relevant target in formation while effectively discarding extraneous and less significant details.

To visually showcase the benefits of Triplet Attention and Dyhead in terms of attention, this article employs the CAM (Class Activation Map) method [50]. The output results of the YOLOv8 and FLE-YOLO are observed in the form of heat maps, as shown in Figure 12.

Based on the analysis of three representative cases, it can be inferred from the figures that the enhanced network successfully filters out certain non-target elements in the background. Additionally, it exhibits a heightened focus on the crane hooks in the images, as indicated by a more conspicuous coverage in the heat map.

5.2.2. Performance Comparison of Different Networks

In order to evaluate the performance of the improved models in terms of detection accuracy, detection speed, and lightweight design, this study selects five metrics: P (Precision), AP₅₀ (Average Precision at 50% IoU), FPS (Frames Per Second), FLOPs (Floating Point Operations per Second), and Parameters. These metrics serve as the standards for assessment.

To assess the generalizability of the model, the first experiments using two extensively utilized large public datasets: COCO2017, which consists of 80 categories, and VOC2012, which consists of 20 categories. The performances of the FLE-YOLO and YOLOv8 networks on these datasets were evaluated using five metrics, as shown in Table 4.

During the model validation process, the batch size was set to 32 uniformly, and the weight outcomes of different network trainings were assessed. The results from the experiments revealed that FLE-YOLO performs optimally in terms of precision and mAP₅₀, FLOPs, and parameters on the VOC2012 dataset, with precision increased by 2% and 1%; mAP₅₀ increased by 4.8% and 3.3%. In inference speed, FPS is the highest for YOLOv12s, while FLE-YOLO increases by 3.832 f/s compared to YOLOv8s. And the computational complexity and parameters decreased by 9.1 G, 1.9 G and 2.244 M, 0.349 M, respectively. On the COCO2017 dataset, compared to the original algorithm, the FLE-YOLO maintained similar levels of precision and mAP₅₀, with an increase in detection speed of 3.764 f/s but slightly less than YOLOv12s. In terms of computational cost, FLE-YOLO performs better, with the computational complexity and parameters reduced by 9 G, 4.5 G and 2.248 M, 0.452 M, respectively.

From the experimental results, as the number of categories increases, the performance of precision and mAP₅₀ shows a slight downward trend, but it can still maintain a basically consistent detection effect with the original algorithm. The occurrence of this phenomenon may be due to the fact that as the number of target categories increases, the feature space becomes more complex, requiring more complex models to capture more category features. It is worth noting that the improved algorithm has outstanding advantages in its lightweight design compared to the original algorithm. Therefore, the improved algorithm may exhibit better performance for various evaluation indicators on single or few category datasets.

As shown in Table 5, the performances of various networks on the crane hook datasets developed for this research were evaluated. The experimental results indicate that the FLE-YOLO basically outperforms other single-stage models. Compared with YOLOv8, it maintains the same precision while improving all other indicators. The mAP₅₀ has increased by 0.6%, with detection speed increased 47.619 f/s. The computational complexity and paramaters have decreased by 9 GFLOPs and 3.537 M, respectively. This indicates that the FLE-YOLO algorithm in this article effectively inherits the functions of various improved modules on the original YOLOv8s algorithm. Compared with other most advanced algorithms of the same order within the YOLO series, including YOLOv9s, YOLOv10s, YOLOv11s, and YOLOv12s, the FLE-YOLO algorithm shows the best performance in terms of precision, AP₅₀, and FLOPs. The P has increased by 0.7%, 0.3%, 1.1%, and 2.2%, while the mAP₅₀ has increased by 3.1%, 1.2%, 1.9%, and 2.3%. The FLOPs, which measure model complexity, decreased by 7.3 G, 2 G, 1.9 G, and 1.8 G, respectively. This demonstrates that the FLE-YOLO algorithm performs well in tower crane hook detection with lower model complexity. Compared to the best algorithms in this series, other metrics show only minor differences, indicating high practicality and applicability.

Given that the dataset used in this study is a single category crane hook dataset, the improved network exhibited significant enhancements in both detection performance and lightweight design. Real-time and effective detection of the hook in the frame provides accurate two-dimensional positioning coordinates for the camera gimbal, ensuring effective tracking of the hook from the remote-control room.

5.2.3. Ablation Experiment

In order to assess the necessity and effectiveness of each improvement module, we conducted ablation experiments to evaluate their impact on different evaluation metrics. This allowed us to determine the specific role of each module within the algorithm. The comparative results of the ablation experiments can be found in Table 6.

After replacing the backbone network with FasterNet, the precision was observed to decrease by 2.5%, indicating a comparable performance to the original network, the AP₅₀ metric showed a decline of 0.7%, while the FPS improved by 56.27 f/s. Furthermore, computational complexity dropped by 6.7 GFLOPs, along with a 2.51 M reduction in parameter count. The number of parameters decreased the most, mainly due to the effect of PConv convolution and its variants, which enable the model to extract features through partial convolution. These results suggest that the network demonstrates the property of being lightweight on the given datasets.

To compensate for the accuracy drop caused by parameter reduction, the introduction of a ternary attention mechanism balances this issue. It improves accuracy by 1.0% without significantly increasing the number of parameters, as it optimizes feature weight allocation through cross-attention across different dimensions (H, W, C), rather than relying on a large stack of convolutional kernels. By introducing attention mechanisms, the focus of the network on hooks has notably heightened, enhancing its ability to capture global information, which keeps AP₅₀ roughly stable. Additionally, detection speed increases by 97.06 f/s, while computational complexity and parameter size remain consistent with the original algorithm. Therefore, the ternary attention mechanism is almost a parameter-free attention mechanism to balance the accuracy loss caused by using FasterNet with parameters.

In general, by employing Slim-neck to replace the neck structure, an enhancement of 0.2% in precision is observed, while AP₅₀ is consistent with the indicators of the benchmark model. Additionally, the group convolution and depth separable convolution operations of GSConv in Slim-neck effectively reduce computational complexity and workload, accelerating image inference speed, detection speed increases by 41.74 f/s, accompanied by a decrease of 33 GFLOPs in computation and a decrease in parameter count by 0.86 M; these improvements across various evaluation metrics highlight the effectiveness of GSConv convolution.

Furthermore, the Dyhead of the head module is not optimal in terms of model complexity and processing speed, but its introduction is mainly to balance the lightweight effect of Faster Net. When the head was replaced solely with Dyhead, there was a substantial increase of 0.4% in precision. However, the AP₅₀ exhibited a slight decline of 1.4%, which remained within an acceptable range. The FPS experienced a notable improvement of 29.76 f/s, while the computational complexity and parameter count reduced by 0.3 GFLOPs and 0.275 M, respectively. These findings collectively demonstrate the value of each improvement introduced in the algorithm.

5.3. Experimental Systems and Testing in Construction Sites

5.3.1. The Construction of Experimental Platforms

This paper proposes a scheme for hook detection and tracking, using the TC5613A tower crane with a lifting height reaching 40 m, a jib span of 56 m, a minimum working amplitude of 2 m, and a maximum working amplitude of 56 m. The spherical camera is positioned adjacent to the driver’s cabin, facing the jib for monitoring and recording. The camera layout and the monitoring range of the hook are shown in Figure 13. The imaging plane of the camera is parallel to the tower body, with the base of the tower serving as the origin of the world reference frame.

By connecting to the wireless local area network deployed within the tower crane construction area, clear video monitoring footage of critical sections of the tower crane can be viewed from the upper computer’s web interface.

In order to comprehensively monitor the operational information of tower cranes during remote operations, observe the working environment of the tower crane in real-time and promptly identify any safety hazards. This system established a total of four visualization monitoring areas, the timestamp in the upper-left corner follows the format “YYYY-MM-DD Weekday HH:MM:SS” in China Standard Time (UTC + 8), as illustrated in Figure 14. These areas include the monitoring screen demonstrating the implementation of hook detection and automatic tracking as shown in monitoring area I, the monitoring screen for the tower crane hoisting mechanism displayed in monitoring area II, the overhead view monitoring screen of the hook depicted in monitoring are III, and the monitoring screen of the tower crane’s cabin presented in monitoring area IV.

Figure 15 shows the flowchart of the hook detection and tracking experiment, which illustrates the steps and key links of the entire experimental process, in order to more intuitively demonstrate the process of experimental design and execution. Firstly, obtain the camera video stream and preprocess the current frame. Then, input the FLE-YOLO algorithm to obtain the detection box, confidence level, and coordinate results of the hook. If no hook is detected, retrieve the camera video stream for detection. If a hook is detected, output the two-dimensional coordinates of the current hook, output the coordinate difference with the center point of the current image to the serial port, implement tracking function through camera scheduling strategy T, and repeat the above steps.

5.3.2. Visualization Effect of Object Detection Algorithm

As shown in Figure 16, a comparison of detection performance between the YOLO series algorithms and the algorithm proposed in this paper under various conditions within the system is presented.

It is demonstrated through detection results that the FLE-YOLO algorithm proposed is applicable to images of hooks under different backgrounds, lighting conditions, and angles, accurately detecting the two-dimensional position information of the hook. Furthermore, compared with other algorithms in the same series, it effectively addresses issues of missed detection, false detection, and over-detection under conditions such as darkness, blurriness, dimness, complex backgrounds, and exposure (indicated by blue ellipses and rectangular textboxes in the figure). Moreover, the detection confidence of this model is higher than that of other algorithms.

5.3.3. The Implementation of Automatic Hook Tracking

This study centers on the target detection of tower crane hooks and transmits hook position information to a spherical camera for corresponding automatic tracking. As illustrated in Figure 17, the tracking effect in the experimental field is demonstrated.

The figure demonstrates the tracking effects under vertical and horizontal tracking. With an increase in frames, it becomes evident that the camera accurately captures the position of the hook during vertical or horizontal operations of the tower crane. By rotating the camera, the hook remains consistently within the camera frame and as close to the center of the frame as possible. Tower crane operators on the ground or in the cabin utilize a PC or Pad to observe and operate the crane remotely, achieving unmanned operation of the tower crane, as shown in Figure 18. The interface is the tower crane information management system interface, and the right side of the interface shows the visual monitoring interface. The lower part of the interface displays real-time measurements, with the first row showing Temperature, Humidity, Illuminance, Wind Speed, Tilt Angle X, and Tilt Angle Y. The second row displays Tilt Angle Z, Height, Rotation Angle, Amplitude, and Measurement Value.

6. Conclusions

Situated within the context of safety monitoring for unmanned tower cranes, this study is dedicated to the meticulous recognition and tracking of crane hooks amid the multifarious challenges inherent to construction sites. Leveraging the domain of target detection algorithms as our primary framework, we introduce an algorithm of notable significance: FLE-YOLO, engineered to be both lightweight and efficacious in its operational capacity.

FLE-YOLO effectively improves the backbone, neck, and head of the detection model. The experimental results indicate significant improvements achieved by the algorithm from a lightweight perspective on public datasets. On the tower crane hook dataset collected in this paper, compared to the original algorithm, the precision rate reached 97.3% and remained consistent; the AP₅₀ reached 98.3%, an increase of 0.6 percentage points; the detection speed FPS reached 142.857, an increase of 47.619; the computational complexity reached 19.4 GFLOPs, a reduction of 10 GFLOPs, and the parameter count reached 7.588 M, a reduction of 3.538 M. By deploying the algorithm on construction sites, effective detection and automatic tracking of tower crane hooks under complex conditions have been achieved.

The subsequent phase of this research endeavor entails the comprehensive establishment of a tower crane safety monitoring system. This will be achieved by incorporating a tower crane safety monitoring module based on multi-sensor data fusion, enabling a holistic evaluation of the operational status of the tower crane. This evaluation encompasses various parameters, including but not limited to height, wind speed, rotation angle, and others. These data points will be synergistically integrated with the tower crane hook recognition and tracking module proposed in this paper with the overarching objective of enhancing the safety and operational efficiency of tower crane functionalities.

Author Contributions

X.H.: Conceptualization, Methodology. X.W.: Data curation, Writing—Original draft. Y.C.: Writing—original draft, Software. J.X.: Supervision, project administration. H.C.: Validation, Resources. F.A.: Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Key Industrial Chain Project in Xi’an City (23ZDCYJSGG0013-2023), Shaanxi Qin Chuang yuan “Scientist + Engineer” Team Construction (2024QCY-KXJ-161), and the Xianyang City Key Research and Development Program Project (L2024-ZDYF-ZDYF-GY-0004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available at: https://github.com/sugar-fifty-doge/FLE-YOLO (accessed on 6 May 2025).

Acknowledgments

We sincerely thank the anonymous reviewers and editors for their valuable suggestions and opinions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Han, Z.; Weng, W.; Zhao, Q.; Ma, X.; Liu, Q.; Huang, Q. Investigation on an Integrated Evacuation Route Planning Method Based on Real-Time Data Acquisition for High-Rise Building Fire. IEEE Trans. Intell. Transp. Syst. 2013, 14, 782–795. [Google Scholar] [CrossRef]
Xiong, H.; Xiong, Q.; Zhou, B.; Abbas, N.; Kong, Q.; Yuan, C. Field vibration evaluation and dynamics estimation of a super high-rise building under typhoon conditions: Data-model dual driven. J. Civ. Struct. Health Monit. 2023, 13, 235–249. [Google Scholar] [CrossRef]
Hung, W.-H.; Kang, S.-C. Configurable model for real-time crane erection visualization. Adv. Eng. Softw. 2013, 65, 1–11. [Google Scholar] [CrossRef]
Gutierrez, R.; Magallon, M.; Hernandez, D.C. Vision-Based System for 3D Tower Crane Monitoring. IEEE Sens. J. 2021, 21, 11935–11945. [Google Scholar] [CrossRef]
Tong, Z.; Wu, W.; Guo, B.; Zhang, J.; He, Y. Research on vibration damping model of flat-head tower crane system based on particle damping vibration absorber. J. Braz. Soc. Mech. Sci. Eng. 2023, 45, 557. [Google Scholar] [CrossRef]
Zhang, D. Statistical analysis of safety accident cases of tower cranes from 2014 to 2022. Constr. Saf. 2025, 40, 81–85. [Google Scholar]
Chen, Y.; Zeng, Q.; Zheng, X.; Shao, B.; Jin, L. Safety supervision of tower crane operation on construction sites: An evolutionary game analysis. Saf. Sci. 2022, 152, 105578. [Google Scholar] [CrossRef]
Shapira, A.; Rosenfeld, Y.; Mizrahi, I. Vision System for Tower Cranes. J. Constr. Eng. Manag. 2008, 134, 320–332. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, Y.; Ji, B.; Ma, C.; Cheng, X. Adaptive sway reduction for tower crane systems with varying cable lengths. Autom. Constr. 2020, 119, 103342. [Google Scholar] [CrossRef]
Zhong, D.; Lv, H.; Han, J.; Wei, Q. A Practical Application Combining Wireless Sensor Networks and Internet of Things: Safety Management System for Tower Crane Groups. Sensors 2014, 14, 13794–13814. [Google Scholar] [CrossRef]
Postigo, J.A.; Garaigordobil, A.; Ansola, R.; Canales, J. Topology optimization of Shell–Infill structures with enhanced edge-detection and coating thickness control. Adv. Eng. Softw. 2024, 189, 103587. [Google Scholar] [CrossRef]
Xiong, X.; Zhang, Y.; Zhou, Q.; Zhao, J. Swing angle detection system of bridge crane based on YOLOv3. Hoisting Conveying Mach. 2021, 4, 30–33. [Google Scholar] [CrossRef]
Liang, G.; Li, X.; Rao, Y.; Yang, L.; Shang, B. A Transformer Guides YOLOv5 to Identify Illegal Operation of High-altitude Hooks. Electr. Eng. 2023, 10, 1–4. [Google Scholar] [CrossRef]
Lu, X.; Sun, X.; Tian, Z.; Wang, Y. Research on Dangerous Area Identification Method of Tower Crane Based on Improved YOLOv5s. Water Power 2023, 49, 68–77. [Google Scholar] [CrossRef]
Sun, X.; Lu, X.; Wang, Y.; He, T.; Tian, Z. Development and Application of Small Object Visual Recognition Algorithm in Assisting Safety Management of Tower Cranes. Buildings 2024, 14, 3728. [Google Scholar] [CrossRef]
Pang, Y.; Li, Z.; Liu, W.; Li, T.; Wang, N. Small target detection model in overlooking scenes on tower cranes based on improved real-time detection Transformer. J. Comput. Appl. 2024, 44, 3922–3929. [Google Scholar]
Pei, J.; Wu, X.; Liu, X.; Gao, L.; Yu, S.; Zheng, X. SGD-YOLOv5: A Small Object Detection Model for Complex Industrial Environments. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–10. [Google Scholar]
Xia, J.; Ouyang, H.; Li, S. Fixed-time observer-based back-stepping controller design for tower cranes with mismatched disturbance. Nonlinear Dyn. 2023, 111, 355–367. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.-h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to Attend: Convolutional Triplet Attention Module. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021; pp. 3138–3147. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
Li, J.; Chen, J.; Sheng, B.; Li, P.; Yang, P.; Feng, D.D.; Qi, J. Automatic Detection and Classification System of Domestic Waste via Multimodel Cascaded Convolutional Neural Network. IEEE Trans. Ind. Inform. 2022, 18, 163–173. [Google Scholar] [CrossRef]
Nguyen, D.H.; Abdel Wahab, M. Damage detection in slab structures based on two-dimensional curvature mode shape method and Faster R-CNN. Adv. Eng. Softw. 2023, 176, 103371. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Jiang, Q.; Jia, M.; Bi, L.; Zhuang, Z.; Gao, K. Development of a core feature identification application based on the Faster R-CNN algorithm. Eng. Appl. Artif. Intell. 2022, 115, 105200. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Li, Y.-l.; Feng, Y.; Zhou, M.-l.; Xiong, X.-c.; Wang, Y.-h.; Qiang, B.-h. DMA-YOLO: Multi-scale object detection method with attention mechanism for aerial images. Vis. Comput. 2024, 40, 4505–4518. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. arXiv 2016, arXiv:1512.02325v5. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10778–10787. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar] [CrossRef]
Kumar, D.; Zhang, X. Improving More Instance Segmentation and Better Object Detection in Remote Sensing Imagery Based on Cascade Mask R-CNN. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 4672–4675. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [Google Scholar] [CrossRef]
Zhou, B.; Zi, B.; Qian, S. Dynamics-based nonsingular interval model and luffing angular response field analysis of the DACS with narrowly bounded uncertainty. Nonlinear Dyn. 2017, 90, 2599–2626. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, Y.; Ouyang, H.; Ma, C.; Cheng, X. Adaptive integral sliding mode control with payload sway reduction for 4-DOF tower crane systems. Nonlinear Dyn. 2020, 99, 2727–2741. [Google Scholar] [CrossRef]
Zhou, Q.; Ding, S.; Qing, G.; Hu, J. UAV vision detection method for crane surface cracks based on Faster R-CNN and image segmentation. J. Civ. Struct. Health Monit. 2022, 12, 845–855. [Google Scholar] [CrossRef]
He, J.; He, Z.; Wang, W.; Wang, S.; Li, Z. Visual system design of tower crane based on improved YOLO. In Proceedings of the CNIOT’23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things, Xiamen, China, 27 July 2023; pp. 798–802. [Google Scholar]
Sun, H.; Dong, Y.; Liu, Z.; Sun, L.; Yu, H. Design and implementation of a smart site supervision system based on Internet of Things. In Proceedings of the 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 26–28 May 2023; pp. 1331–1337. [Google Scholar]
Sleiman, J.-P.; Zankoul, E.; Khoury, H.; Hamzeh, F. Sensor-Based Planning Tool for Tower Crane Anti-Collision Monitoring on Construction Sites. In Proceedings of the Construction Research Congress, San Juan, Puerto Rico, 31 May–2 June 2016; pp. 2624–2632. [Google Scholar] [CrossRef]
Jiang, W.; Ding, L.; Zhou, C. Digital twin: Stability analysis for tower crane hoisting safety with a scale model. Autom. Constr. 2022, 138, 104257. [Google Scholar] [CrossRef]
Yu, J.-L.; Zhou, R.-F.; Miao, M.-X.; Huang, H.-Q. An Application of Artificial Neural Networks in Crane Operation Status Monitoring. In Proceedings of the 2015 Chinese Intelligent Automation Conference, Fuzhou, China, 8–10 May 2015; pp. 223–231. [Google Scholar]
Wang, J.; Zhang, Q.; Yang, B.; Zhang, B. Vision-Based Automated Recognition and 3D Localization Framework for Tower Cranes Using Far-Field Cameras. Sensors 2023, 23, 4851. [Google Scholar] [CrossRef] [PubMed]
Wu, D.; Jiang, S.; Zhao, E.; Liu, Y.; Zhu, H.; Wang, W.; Wang, R. Detection of Camellia oleifera Fruit in Complex Scenes by Using YOLOv7 and Data Augmentation. Appl. Sci. 2022, 12, 11318. [Google Scholar] [CrossRef]
Xiong, J.; Wu, J.; Tang, M.; Xiong, P.; Huang, Y.; Guo, H. Combining YOLO and background subtraction for small dynamic target detection. Vis. Comput. 2024, 41, 481–490. [Google Scholar] [CrossRef]
Zhang, G.; Tian, Y.; Hao, J.; Zhang, J. A Mongolian-Chinese neural machine translation model based on Transformer’s two-branch gating structure. In Proceedings of the 2022 4th International Conference on Intelligent Information Processing (IIP), Guangzhou, China, 14–16 October 2022; pp. 374–377. [Google Scholar]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]

Figure 1. Dataset of Different Background Crane Hooks under Natural Lighting. (a) Ground; (b) Construction site debris; (c) Sky; (d) Suspended object; (e) Building; (f) Trees.

Figure 2. Dataset of Crane Hooks under Other Complex Factors. (a) Dim lighting; (b) Conditions with obstructions1; (c) Conditions with obstructions2; (d) Long-distance little targets1; (e) Nighttime illumination conditions; (f) Long-distance little targets2; (g) Negative samples1; (h) Negative samples2.

Figure 3. Data Augmentation Examples. (a) Original drawing; (b) Size scaling; (c) Random cropping; (d) Deformation; (e) Rotate; (f) Mosaic data augmentation; (g) Flipping; (h) {ixel shifting; (i) Gaussian noise; (j) Noise spicy salt.

Figure 4. PConv Structure.

Figure 5. Overall Architecture of FasterNet Backbone Network.

Figure 6. Triplet Attention Model Architecture.

Figure 7. Slim-neck Connection Layer. (a) GSConv; (b) GSBottleneck; (c) VoV-GSCSP.

Figure 8. Dyhead organically unifies three perceptual states.

Figure 9. The architecture of the Dyhead.

Figure 10. The two-dimensional coordinate positioning of a tower crane hook.

Figure 11. The architecture of FLE-YOLO.

Figure 12. Comparison of heat map results. (a) Original Image; (b) YOLOv8 heatmap; (c) FLE-YOLO Heatmap.

Figure 13. Camera installation location.

Figure 14. Visual monitoring interface. (a) Monitoring Area I; (b) Monitoring Area II; (c) Monitoring Area III; (d) Monitoring Area IV.

Figure 15. FLE-YOLO algorithm hook detection and tracking experimental flowchart.

Figure 16. Visualization results of tower crane hook detection using different algorithms.

Figure 17. The presentation of tracking results.

Figure 18. Implementation of remote homework. (a) Human computer interaction interface; (b) Remote control visualization operation.

Table 1. Summary of object detection algorithm.

Algorithm	Backbone	Advantage	Weakness	Applicable Scene	FPS
R-CNN [25]	AlexNet VGG16	Combine CNN with candidate box method	The detection speed is slow, time-consuming, and the fixed image input size is fixed	Object detection	0.03 0.5
Fast R-CNN [26]	VGG16	Use POI Pooling to extract features and save time	The calculation of candidate region selection method is complicated	Object detection	7
Faster R-CNN [27]	VGG16 ResNrt	Replace the regional suggestion with RPN to speed up the training and improve the accuracy	The model is complex	Object detection	7 5
YOLO [28]	Darknet	YOLO series is real-time, simple and efficient, and widely used	The detection accuracy of small targets is low, and it is difficult to deal with complex backgrounds. The initial model has high complexity and is difficult to be applied to devices with limited resources	Object detection, Real-time video analysis	46
SSD [30]	VGG16	Multi-scale anchor box detection, efficiency	The model is difficult to converge and the detection accuracy of the model is low	Multi-scale object detection, real-time video analysis	59
RetinaNet [31]	ResNet	The problem of category imbalance is solved through Focal Loss	Sample imbalance is caused during intensive sample training	Multi-scale object detection	5.4
Corner Net [35]	Hourglass −52/104	Without anchor point, the detection is changed to corner point detection, and the positioning performance is good	The internal features are missing, and the corner feature points of the same object need to be classified, which leads to high computational complexity	Object detection	300
CenterNet [33]	Hourglass −52/104	No anchor point, the detection is changed to a triplet, and the reasoning speed is fast	Relying on preprocessing and post-processing, the center point positioning of small targets is not accurate enough, and the ability to deal with complex scenes is limited	Object detection	270
Cascade R-CNN [34]	ResNet + FPN	The cascade structure improves the detection effect	The computing overhead is large	Object detection	0.41
Efficient-Det D1~D7 [32]	EfficientNet	with composite scaling and BiFPN network, excellent multi-scale target detection performance	The model is complex, the training time is long	Mobile devices, embedded devices	24

Table 2. Summary of Safety Monitoring Research on Tower Cranes.

Classify	Approach	Application Scenarios	Advantage	Weakness
Sensor based security monitoring method	Dynamic modeling and uncertain interval parameters	Predictions of pitch response for the double tower operation	Precisely determine the upper and lower limits of amplitude angle response to reduce the risk of collision	Only applicable to narrow sites with two towers and the calculation is complicated
	Adaptive integral sliding mode control	Tower crane load anti-oscillation control	Continuous control without jitter and strong anti-interference ability	The control algorithm is complex and the calculation burden is heavy
	Ultrasonic sensor positioning	Real-time monitoring of hook position	Real-time and easy to deploy	The influence of environmental interference is large and the accuracy is limited
Visual security monitoring method	Based on UAV image detection	Crack detection on crane surface	It can detect cracks in difficult areas under complex background	Relying on drone equipment is costly
	Variable focus industrial camera	Video surveillance, zooming hook tracking	High precision, adapt to different scenarios	Expensive
	Digital twin technology	Crane status monitoring	Visualization, real-time state reflection	Implementation is complex and relies on multiple sources of data
	Based on visual algorithms	Real-time identification and automatic positioning of crane hook	High precision crane attitude estimation meets the requirements of construction application.	In the case of large area shielding by the boom, it is difficult to extract geometric features, which affects the detection effect
	FLE-YOLO(ours)	Real-time detection and tracking of crane hook	Lightweight network, efficient detection of small targets, suitable for hardware deployment	The detection effect of extreme occlusion is reduced

Table 3. The experimental hardware and software configuration.

Name	Experimental Configuration
Operating system	Windows 11 Ubantu22.04.1
Deep learning framework	PyTorch1.13.1
Programming CPU	Intel(R)Core (TM)i7-13700H
GPU	NVIDIA GeForce RTX 3090 (24 G) × 2
Programming Language	Python3.8
Cuda	11.6
Platform	Pycharm2022
Optimizer	SGD
Batch Size	4
Epoch	100
Learning rate	0.01
momentum	0.937

Table 4. Comparison of performance between FLE-YOLO and YOLOv8 on public datasets.

Dataset	Category	Model	P	mAP₅₀	FPS	FLOPs/G	Parameters/M
VOC2012	20	YOLOv8	0.781	0.741	111.11	28.5	11.133
	20	YOLOv12s	0.791	0.756	208.33	21.3	9.238
	20	FLE-YOLO	0.801	0.789	114.942	19.4	8.889
COCO2017	80	YOLOv8	0.663	0.5776	102.618	28.6	11.157
	80	YOLOv12s	0.689	0.606	208.33	24.1	9.261
	80	FLE-YOLO	0.672	0.5777	106.382	19.6	8.809

Note: the bold data represent the optimal values.

Table 5. Results Comparison of Ablation Experiments.

Model	Backbone	P	mAP₅₀	FPS	FLOPs/G	Parameters/M
YOLOv3-spp	Darknet-53	0.949	0.945	29.069	293.1	104.71
YOLOv5s	CSP-Darknet-53	0.957	0.935	101.01	23.8	9.112
YOLOv6s	RepVGG	0.966	0.946	62.5	44	16.297
YOLOv8s	C2f-sppf-Darknet-53	0.973	0.977	95.238	28.4	11.125
YOLOv9s	GELAN	0.966	0.952	136.986	26.7	7.167
YOLOv10s	CSP-Darknet-53	0.970	0.971	208.333	21.4	7.218
YOLOv11s	C3K2	0.962	0.964	161.29	21.3	9.413
YOLOv12s	R-ELAN	0.951	0.960	147.058	21.2	9.231
FLE-YOLO	FasterNet	0.973	0.983	142.857	19.4	7.588

Note: The bold data represents the optimal values.

Table 6. Performance Comparison of Different Networks.

Faster Net	Triplet Attention	Slim-Neck	Dyhead	P	AP₅₀	FPS	FLOPs/G	Parameters/M
-	-	-	-	0.973	0.977	95.238	28.4	11.125
√	-	-	-	0.948	0.970	151.515	21.7	8.616
-	√	-	-	0.983	0.974	192.307	28.4	11.126
-	-	√	-	0.975	0.977	136.98	25.1	10.265
-	-	-	√	0.977	0.963	125	28.1	10.851
√	√	√	√	0.973	0.983	142.041	19.4	7.588

Note: The bolded data represents the optimal values. √ and - are used to distinguish whether an improvement module is included.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, X.; Wang, X.; Chang, Y.; Xiao, J.; Cheng, H.; Abdelhad, F. FLE-YOLO: A Faster, Lighter, and More Efficient Strategy for Autonomous Tower Crane Hook Detection. Appl. Sci. 2025, 15, 5364. https://doi.org/10.3390/app15105364

AMA Style

Hu X, Wang X, Chang Y, Xiao J, Cheng H, Abdelhad F. FLE-YOLO: A Faster, Lighter, and More Efficient Strategy for Autonomous Tower Crane Hook Detection. Applied Sciences. 2025; 15(10):5364. https://doi.org/10.3390/app15105364

Chicago/Turabian Style

Hu, Xin, Xiyu Wang, Yashu Chang, Jian Xiao, Hongliang Cheng, and Firdaousse Abdelhad. 2025. "FLE-YOLO: A Faster, Lighter, and More Efficient Strategy for Autonomous Tower Crane Hook Detection" Applied Sciences 15, no. 10: 5364. https://doi.org/10.3390/app15105364

APA Style

Hu, X., Wang, X., Chang, Y., Xiao, J., Cheng, H., & Abdelhad, F. (2025). FLE-YOLO: A Faster, Lighter, and More Efficient Strategy for Autonomous Tower Crane Hook Detection. Applied Sciences, 15(10), 5364. https://doi.org/10.3390/app15105364

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FLE-YOLO: A Faster, Lighter, and More Efficient Strategy for Autonomous Tower Crane Hook Detection

Abstract

1. Introduction

2. Techniques and Approaches

2.1. Object Detection Algorithms

2.2. Tower Crane Safety Status Monitoring System

3. Dataset Processing

3.1. Collection of Tower Crane Hook Datasets

3.2. Dataset Construction

4. The Proposed Method

4.1. Backbone Feature Extraction Network

4.1.1. Adding FasterNet Backbone

4.1.2. Triplet Attention Mechanism

4.2. Neck Feature Fusion Network

Slim-Neck

4.3. Head Prediction Network

Dynamic Head

4.4. The Coordinate Positioning and Tracking of Crane Hooks

5. Experimental Results and Discussion

5.1. Hardware Configuration and Hyper-Parameter Settings

5.2. Comparative Analysis of Experimental Results

5.2.1. Regarding the Attention Mechanism

5.2.2. Performance Comparison of Different Networks

5.2.3. Ablation Experiment

5.3. Experimental Systems and Testing in Construction Sites

5.3.1. The Construction of Experimental Platforms

5.3.2. Visualization Effect of Object Detection Algorithm

5.3.3. The Implementation of Automatic Hook Tracking

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI