This chapter aims to develop a high-precision obstacle detection model tailored for the complex operational environment of cranes. Addressing common industrial challenges such as uneven lighting, noise interference, and significant variations in obstacle scale, an enhanced YOLOv5 model is proposed. Using the lightweight YOLOv5s as the baseline framework, image enhancement and denoising preprocessing techniques are introduced at the data input stage. Subsequently, the SimAM parameter-free attention mechanism is embedded at the feature extraction stage of the network and replaces the original loss function with EIoU at the output stage. These optimizations collectively define the architecture of the enhanced model.
2.1. YOLOv5 Model
In the field of crane obstacle detection, the YOLO algorithm introduces fully convolutional neural networks, transforming object detection into a regression problem and significantly improving detection speed. YOLOv5 is an efficient object detection framework developed by Ultralytics (Frederick, MD, USA), designed based on the PyTorch (Version: PyTorch 1.11.0) deep learning library for easy deployment across various devices [
16,
17,
18]. The YOLOv5 network architecture comprises three main components: the backbone network, the feature fusion network, and the head network, as illustrated in
Figure 1 using Visio. The backbone network primarily consists of CBS, C3, and SPFF modules. The CBS module first performs 2D convolutions, followed by a Batch Normalization (BN) layer that accelerates model training convergence through normalization. The SiLU activation function introduces nonlinear features to enhance the network’s fitting capability. The main branch of the C3 module extracts deep features through multiple Bottleneck layers, while the residual branch directly passes raw features. Features from both branches are ultimately concatenated and fused. SPFF performs pooling operations on input features across multiple scales, then concatenates the pooled results from these scales. The feature fusion network bidirectionally concatenates shallow detail features and deep semantic features from the backbone network, enabling features to simultaneously carry both detail and semantic information. The head network is used to detect object locations and categories.
2.2. Preprocessing Optimization
The crane operation scenario features a complex background, and the edges of obstacles are often blurred due to factors such as lighting variations and shooting distances. Additionally, images captured in industrial environments are susceptible to Gaussian noise and sensor thermal noise contamination. Meanwhile, the crane operation scenarios at night or in poorly lit workshops face dynamically changing lighting conditions, all of which can compromise the accuracy of subsequent target detection.
To enhance the model’s ability to extract features of key obstacle contours, this study proposes a multi-stage image preprocessing pipeline. Firstly, the Sobel operator is employed for image sharpening. Compared with other edge detection operators, the Sobel operator incorporates a weighted smoothing mechanism, which can effectively suppress noise amplification while calculating horizontal and vertical gradients [
19,
20,
21]. By integrating local gradient information with neighborhood pixel weights, this method significantly enhances the boundary contrast between obstacles and the background, facilitating the detection network to capture the geometric structure information of objects. Secondly, the bilateral filtering algorithm is selected for adaptive denoising. Through nonlinear combination, this algorithm simultaneously considers the spatial proximity and pixel value similarity of pixels. Its core advantage lies in its ability to adaptively smooth textures in flat regions while preserving edge information with drastic intensity changes, thereby effectively removing environmental noise while maximizing the integrity of obstacle structural features. This avoids the edge blurring problem caused by traditional linear filtering and is more suitable for small target detection requirements [
22,
23]. Finally, the unsupervised learning-based Enlighten-GAN network is introduced for low-light image enhancement. Unlike traditional methods that rely on paired training data, this network adopts a generative adversarial network architecture, realizing adaptive enhancement without paired data through a U-Net-based generator and a global–local dual discriminator structure [
24,
25].
The generator guides the illumination distribution using a self-attention mechanism, as shown in
Figure 2, while the discriminator ensures that the enhanced images have balanced overall brightness without local overexposure or artifact generation through adversarial training on global and locally cropped patches. After processing through the aforementioned preprocessing pipeline, the clarity and contrast of the input images are significantly improved, noise is effectively suppressed, and edge features are preserved intact, providing high-quality feature map support for subsequent feature extraction by the YOLOv5s model.
2.3. Introduction of Attention Mechanisms
Attention mechanisms are inspired by human characteristics, mimicking our ability to focus on the subject within an image while paying little attention to its background. Introducing attention mechanisms allows the model to concentrate more on the objects to be identified, thereby optimizing the network’s detection performance.
Current mainstream attention mechanisms include self-attention, multi-head attention, and convolutional attention. Among these, self-attention is the most commonly used, generating attention weights by calculating correlations between different positions in the input data. In visual model networks, adding attention mechanisms assigns corresponding weights to different regions within an image [
26]. Regions with higher weights receive greater attention during model training and detection, while regions with lower weights receive less attention. Based on this principle, visual model networks can be optimized, improving performance metrics such as mAP@0.5 and precision.
- (1)
SE Attention Mechanism
Global average pooling compresses each channel of the feature map into a single value. Two fully connected layers then learn the weights between channels—the attention scores—which are finally mapped between 0 and 1 via a Sigmoid function. These scores adjust the channel responses of the original feature map.
For an input feature map
, the SE module first obtains
via global average pooling. It then derives weights
through a fully connected layer and activation function. The computation process can be expressed as follows:
where
denotes the fully connected layer,
represents the Sigmoid activation function, and
signifies the weight parameters of the fully connected layer. Finally, the learned weights
are multiplied channel-wise with the original feature map
to obtain the weighted feature map, thereby achieving recalibration of features across different channels.
- (2)
SimAM Attention Mechanism
SimAM extracts the importance of neurons by constructing an energy function. Its core idea is based on the local self-similarity of images, generating attention weights by calculating the similarity between each pixel in the feature map and its neighboring pixels [
27]. The SimAM calculation formula can be expressed as follows:
where
is the attention weight for pixel
,
is the normalization constant,
is the set of neighboring pixels for pixel
, and
is the similarity metric between pixel
and pixel
, typically represented as the negative Euclidean distance:
.
- (3)
CBAM Attention Mechanism
It enhances the feature representation ability of convolutional neural networks by combining channel attention and spatial attention. The output of its channel attention module can be calculated using the following formula:
where
is the input feature map,
and
denote global average pooling and max pooling operations, respectively,
represents a multilayer perceptron, and
denotes the Sigmoid activation function. The output of the spatial attention module is computed via the following formula:
where
f7×7 denotes a convolutional operation of
7 × 7, used to learn spatial attention weights from concatenated average-pooled and max-pooled feature maps.
2.4. Improved Loss Function
During the training of the YOLOv5 network model, the loss calculation expression is as follows.
where
is the corresponding loss function,
is the intersection-over-union ratio between anchor boxes and target boxes,
is the distance between anchor boxes and target boxes,
is the diagonal distance of target boxes,
is the parameter used to judge the difference in aspect ratio between anchor boxes and target boxes,
and
are the width and height of target boxes, and
and
are the width and height of anchor boxes.
- (1)
Alpha-IoU Loss Function
The Alpha-IoU loss function is an extension of the traditional IoU loss function. It introduces an adjustable parameter α to modulate the gradient of the loss function, thereby accelerating model training convergence. The principle of the Alpha-IoU loss function can be expressed as follows:
where
is the intersection-over-union ratio between the predicted bounding box and the ground truth bounding box, and
is a parameter greater than zero that controls the gradient of the loss function. By adjusting the value of
, the gradient of the loss function becomes larger when IoU is high, accelerating model convergence in high-IoU regions. This leads to performance improvements in practical applications.
- (2)
EIoU Loss Function
The EIoU loss function calculates the loss by considering the overlap area, center point distance, aspect ratio, and width-to-height ratio between the predicted box and the ground truth box [
28]. Its formula can be expressed as follows:
where
is the intersection-over-union ratio between the predicted and ground truth boxes,
denotes the Euclidean distance between the centers of the predicted and ground truth boxes,
is the diagonal length of the minimum bounding region encompassing both boxes,
and
represent the width and height of the predicted and ground truth boxes, respectively, and
and
are the diagonal lengths of the bounding regions for width and height, respectively.