LGM-YOLO: A Context-Aware Multi-Scale YOLO-Based Network for Automated Structural Defect Detection

Chuanqi Liu; Yi Huang; Zaiyou Zhao; Wenjing Geng; Tianhong Luo

doi:10.3390/pr13082411

,

and

¹

Chongqing Special Equipment Inspection and Research Institute, Chongqing 401121, China

²

Key Laboratory of Electromechanical Equipment Security in Western Complex Environment, State Administration for Market Regulation, Chongqing 401121, China

³

Key Laboratory of Optoelectronic Technology & Systems, International R & D Center of Micro-Nano Systems and New Materials Technology, Chongqing University, Ministry of Education, Chongqing 400044, China

⁴

School of Intelligent Manufacturing, Chongqing University of Arts and Sciences, Chongqing 402160, China

Processes2025, 13(8), 2411;https://doi.org/10.3390/pr13082411

This article belongs to the Special Issue Advances in Computer Vision and Image Processing for Industrial Processes

Version Notes

Order Reprints

Abstract

Ensuring the structural safety of steel trusses in escalators is critical for the reliable operation of vertical transportation systems. While manual inspection remains widely used, its dependence on human judgment leads to extended cycle times and variable defect-recognition rates, making it less reliable for identifying subtle surface imperfections. To address these limitations, a novel context-aware, multi-scale deep learning framework based on the YOLOv5 architecture is proposed, which is specifically designed for automated structural defect detection in escalator steel trusses. Firstly, a method called GIES is proposed to synthesize pseudo-multi-channel representations from single-channel grayscale images, which enhances the network’s channel-wise representation and mitigates issues arising from image noise and defocused blur. To further improve detection performance, a context enhancement pipeline is developed, consisting of a local feature module (LFM) for capturing fine-grained surface details and a global context module (GCM) for modeling large-scale structural deformations. In addition, a multi-scale feature fusion module (MSFM) is employed to effectively integrate spatial features across various resolutions, enabling the detection of defects with diverse sizes and complexities. Comprehensive testing on the NEU-DET and GC10-DET datasets reveals that the proposed method achieves 79.8% mAP on NEU-DET and 68.1% mAP on GC10-DET, outperforming the baseline YOLOv5s by 8.0% and 2.7%, respectively. Although challenges remain in identifying extremely fine defects such as crazing, the proposed approach offers improved accuracy while maintaining real-time inference speed. These results indicate the potential of the method for intelligent visual inspection in structural health monitoring and industrial safety applications.

Keywords:

steel surface; YOLOv5; multi-scale fusion; grayscale image enhancing strategy

1. Introduction

Escalator truss structures are essential load-bearing components that ensure the operational safety and stability of escalators. Typically constructed from steel beams and connectors, these structures must withstand continuous dynamic loads and harsh environmental conditions [,,]. Structural surface defects including corrosion, deformation, and cracks can significantly degrade the mechanical performance of steel components, posing serious safety risks and increasing maintenance costs. As such, automated visual detection of these features is paramount for predictive maintenance in escalator systems, where undetected defects may lead to catastrophic failure [,,].

To meet these critical inspection requirements, the field has witnessed an evolution from manual to automated detection methodologies. The development of steel defect inspection systems has progressed from traditional image processing methods [,,] to contemporary deep learning architectures [,,]. Early studies employed handcrafted features such as scale-invariant feature transform [], local binary patterns (LBP) [], Gabor filters [], and wavelet transforms [], often in conjunction with classifiers like support vector machines (SVM) [] or decision trees []. While these methods perform well for simple defect types, their effectiveness diminishes in the presence of complex textures or variable lighting conditions. In contrast, convolutional neural networks (CNNs) [] exhibit superior capacity for autonomously learning discriminative feature representations from unprocessed image data. In particular, single-stage detection architectures like the YOLO series have emerged as predominant solutions, achieving an optimal balance between inference speed and detection accuracy, advantageous for real-time industrial quality control applications [].

Despite the advancements in CNN-based methods, detecting steel surface defects in industrial environments remains challenging due to cluttered backgrounds, diverse defect scales, and varying defect shapes. Conventional YOLO (you only look once) models often struggle with these conditions, as they primarily rely on single-scale feature extraction. Two-stage detectors like Faster R-CNN [] offer improved accuracy through their region proposal mechanisms but typically sacrifice real-time performance. Consequently, recent research has focused on enhancing YOLO-based models with multi-scale feature extraction, attention mechanisms, and contextual reasoning to address these challenges.

To improve performance on small and complex defect detection tasks, several studies have proposed enhancements to the YOLO framework. For example, Guo et al. [] introduced MSFT-YOLO, integrating transformer-based modules and multi-scale feature fusion for improved detection in cluttered industrial settings. Li et al. [] developed a simulation-based training method incorporating efficient multi-scale attention and C3DX modules, along with optimized loss functions to enhance robustness. However, these models still face limitations in effectively capturing both fine-grained details and global contextual features, which are critical for identifying a wide range of surface defects under varying conditions.

Motivated by the need for more comprehensive feature representation, this study proposes LGM-YOLO, a novel context-enhanced detection model tailored for steel surface defect inspection. This comprises a local feature module (LFM) that captures fine textures and edges critical for identifying micro-cracks and scratches, a global context module (GCM) that models spatial relationships and suppresses background noise, and a multi-scale feature module (MSFM) that enables robust detection across varying defect sizes. This sequential feature enhancement pipeline ensures both local sensitivity and global awareness, while preserving real-time detection performance.

This work makes the following key contributions to the field:

(1): A dedicated LFM for extracting fine-grained features from steel surfaces is developed to enhance detection sensitivity to detect defects such as micro-cracks and surface scratches.
(2): A context-aware GCM is proposed to capture large-scale spatial patterns, effectively separating genuine imperfections from substrate textures and imaging artifacts through learned pattern discrimination.
(3): An adaptive feature fusion mechanism MSFM is established to integrate features across multiple resolutions, allowing accurate detection of both small and large defects while preserving spatial details.
(4): A novel method named the grayscale image enhancing strategy (GIES) is designed to integrate adaptive preprocessing and specialized data augmentation to optimize grayscale feature extraction and enhance neural network detection performance.

The rest of the study is organized as follows. Section 2 provides an overview of the YOLOv5 object detection algorithm. In Section 3, we present a detailed description of the proposed model. Section 4 covers our dataset, experimental setup, and evaluation methods. Comparative analyses between the proposed method and state-of-the-art approaches are systematically conducted in Section 5. Finally, Section 6 summarizes the key findings and suggests potential research extensions.

2. Related Work

2.1. Overview of the YOLOv5

YOLO has emerged as a paradigm-shifting approach in object detection, primarily owing to its unified architecture, real-time inference capability, and relatively simple loss design. Through successive iterations, the YOLO series has introduced improvements in detection accuracy, speed, and robustness. Among these, YOLOv5, a highly optimized extension of YOLOv3, achieves a favorable balance between accuracy and inference efficiency, particularly in its lightweight variant YOLOv5s, as Figure 1 illustrated, which has become a widely used baseline in real-time industrial applications.

Figure 1. Architecture of the YOLOv5s object detection algorithm.

As an end-to-end trainable detector, YOLOv5 processes both recognition and localization tasks through parallel prediction heads in one network pass, eliminating the computational overhead of traditional region proposal mechanisms. Compared with previous versions, YOLOv5 introduces several enhancements, including deeper backbone structures, optimized loss functions, and improved training strategies. These modifications enable better generalization and higher performance, particularly in detecting small-scale objects and addressing complex visual scenes.

The YOLOv5 architecture is composed of three core modules: a backbone module responsible for extracting features, a neck component that combines multi-scale features, and a head section for generating detection outputs. Before processing by the network, input images are enhanced through various data augmentation methods including Mosaic augmentation, random size adjustments, cropping operations, rotational transformations, and perspective modifications. These preprocessing techniques significantly expand the variety of training data while strengthening the model’s adaptability to diverse input conditions.

A critical component of YOLOv5’s performance lies in its neck, which combines the feature pyramid network (FPN) with the path aggregation network (PAN). The FPN passes rich semantic features from deep layers to shallow layers, while the PAN reinforces low-level spatial details. This dual-path fusion strategy allows YOLOv5 to maintain high detection accuracy across object scales, providing a strong foundation for further enhancement.

2.2. YOLOV5 Detectors for Steel Surface

Conventional YOLO-based detectors typically employ a coupled detection head with a filter size of (L + C + T), where L denotes bounding box coordinates (x, y, w, h), C represents the object confidence score, and T corresponds to the class probability. While this structure offers a lightweight and efficient solution, it often falls short in scenarios involving small or subtle defect patterns, which are common in steel surface inspection tasks.

To overcome these limitations, numerous studies have proposed enhancements to the YOLOv5 architecture by integrating advanced attention mechanisms, feature fusion modules, and transformer-based components. For instance, Le et al. [] introduced coordinate attention and a BiFPN structure into YOLOv5, along with transformer-based refinement, resulting in a 5.3% improvement in recall and maintained a 95 FPS inference speed. Similarly, Li et al. [] developed DA-YOLOv5, a domain-adaptive variant utilizing transfer learning to improve generalization across domains, demonstrating strong performance in magnetic tile defect detection.

In response to the persistent challenge of detecting small-scale targets, Zhao et al. [] proposed a refined YOLOv5 variant that integrates depthwise convolution and K-means anchor clustering for optimized feature localization. Wu et al. [] explored the combination of YOLOv5 and R-FCN for small target detection in remote sensing images, effectively leveraging multi-scale anchors and deep feature hierarchies. In safety-critical applications, such as helmet detection, Sadiq et al. [] proposed FD-YOLOv5, incorporating a fuzzy-based image enhancement module to improve accuracy in real-world, cluttered environments.

In the specific domain of industrial defect detection, Wang et al. [] proposed a YOLOv5-based approach tailored for steel-surface-defect recognition, featuring a multi-scale explore block to enhance feature representation across varying defect sizes. Building upon this, Mi et al. [] introduced TLGM-YOLO, which combines a small-target detection layer, BiFPN, and DC3 modules to boost multi-scale perception and adaptive fusion, yielding high accuracy in detecting surface defects on hot-rolled steel strips.

Despite these promising improvements, existing YOLO-based detectors continue to face challenges in complex industrial environments, especially in detecting micro-defects such as hairline cracks, inclusions, and corrosion, where both fine local features and contextual cues are critical. This underscores the need for a more sophisticated detection framework that can jointly capture local textures and global spatial dependencies, while maintaining real-time inference capability.

3. Proposed Method

3.1. Overview of LGM-YOLO Architecture

Figure 2 provides a high-level overview of the proposed LGM-YOLO architecture. The model is built upon the YOLOv5s backbone, with three custom modules—LFM, GCM, and MSFM—embedded at different stages of the feature extraction pipeline. Specifically, LFM is introduced in the early layers to enhance fine-grained features, GCM is placed after the neck to incorporate global contextual information, and MSFM is used just before the prediction head to strengthen multi-scale representation. The arrows in the diagram represent the direction of feature flow, with each block corresponding to a distinct functional unit. This modular design allows for progressive refinement of features at different scales, ultimately improving detection performance across varied defect types.

Figure 2. Architecture of the proposed LGM-YOLO model: LFM and GCM integrated into the backbone, MSFM embedded in the neck.

Specifically, the key modules of the proposed model are described as follows:

(1): LFM: This module enhances local feature interactions by first applying a 1 × 1 convolution to reduce feature dimensionality, followed by a 3 × 3 convolution to refine spatial information. The module employs batch normalization coupled with ReLU activation functions to enhance training stability and optimize feature learning.
(2): GCM: To incorporate global contextual information, GCM utilizes an adaptive global pooling layer to capture long-range dependencies. A 1 × 1 convolution, batch normalization, and sigmoid activation are applied to generate an attention map, which is then used to recalibrate the input features dynamically.
(3): MSFM: To enhance multi-scale feature extraction, MSFM employs parallel convolutional branches with 1 × 1, 3 × 3, 5 × 5, and 7 × 7 kernels. The feature maps are merged along the channel dimension, then processed through batch normalization and ReLU activation, enabling the model to capture both fine-grained details and large-scale context information effectively.

By integrating these modules into YOLOv5s, the proposed LGM-YOLO significantly improves multi-scale feature learning, spatial representation, and global context awareness, leading to enhanced detection accuracy while maintaining real-time performance.

3.2. LFM: Enhancing Small Defect Detection

In deep learning-based feature extraction, the LFM module, illustrated in Figure 3, efficiently enhances feature transformation through a structured sequence of convolutional and normalization layers. Given an input feature map

X \in R^{C \times H \times W}

(channels C, height H, width W), the forward propagation of LFM is defined as follows:

Figure 3. The structure diagram of LFM for fine-grained defect enhancement.

Channel compression and feature transformation (1 × 1 convolution):

X_{1} = W_{1} \times X + b_{1}

(1)

where

W_{1} \in R^{\frac{C}{2} \times C}

is a learnable weight matrix that reduces the channel dimension while preserving essential information,

X

is the input feature map from the previous layer,

b_{1}

is the bias term.

Normalization and nonlinear activation are as follows:

X_{2} = R e L U (B N (X_{1}))

(2)

where batch normalization (BN) stabilizes training by normalizing the feature distribution, and ReLU activation introduces non-linearity to enhance feature representation.

Local feature enhancement (3 × 3 convolution):

X_{3} = W_{2} \times X_{2} + b_{2}

(3)

where

W_{2} \in R^{C \times \frac{C}{2}}

restores the channel dimension and enhances local receptive fields to improve feature learning.

Final normalization and output are as follows:

Y = BN (X_{3})

(4)

which further balances the feature distribution, improving gradient flow during training.

3.3. GCM: Capturing Large-Scale Information

Figure 4 illustrates the architecture of the GCM. The input feature map first undergoes global average pooling, which generates a spatially compressed vector encoding the global distribution of features. This vector is then processed through a 1 × 1 convolution, batch normalization (BN), and a sigmoid activation function to produce a channel-wise attention map, which recalibrates the significance of each feature channel. As depicted in the final block of the figure, the module concludes with an element-wise multiplication between the attention map and the original input features, effectively emphasizing context-aware features.

Figure 4. The structure diagram of GCM for global spatial dependency.

As shown in Figure 4, the GCM is a lightweight yet powerful attention mechanism that enriches feature representations by integrating global contextual information. It refines feature maps by adaptively reweighting channels based on a global descriptor. Given an input feature map

X \in R^{C \times H \times W}

, the transformation is formulated below.

To capture global spatial information, we perform global average pooling over the spatial dimensions as follows:

G = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X (i, j)

(5)

where

G \in R^{C \times 1 \times 1}

is a compressed representation of the input feature map.

Channel-wise transformation (1 × 1 convolution, normalization, and activation):

G^{'} = σ (B N (W \times G))

(6)

where

W \in R^{C \times C}

is a learnable transformation matrix implemented as a 1 × 1 convolution, BN stabilizes training, sigmoid activation normalizes the response between 0 and 1, enabling soft attention across channels.

Context-based feature recalibration:

Y = X ⊙ G^{'}

(7)

where element-wise multiplication (

⊙

) applies the learned channel-wise attention weights to the input feature map, selectively enhancing or suppressing feature channels.

3.4. MSFM: Integrating Features for Accurate Detection

Figure 5 illustrates the internal structure of the MSFM module. The input feature map is simultaneously processed by four parallel convolution branches with kernel sizes of 1 × 1, 3 × 3, 5 × 5, and 7 × 7, respectively. Each branch captures features at a different scale: the 1 × 1 convolution captures fine-grained pixel-level features, while the 7 × 7 convolution extracts high-level contextual information over a wider area. After independent processing, the outputs of all branches are concatenated along the channel dimension to form a unified multi-scale representation. This combined feature map is then normalized via batch normalization and passed through a ReLU activation layer. The architecture of the MSFM includes the following:

Figure 5. The structure diagram of MSFM.

A 1 × 1 convolution: the branch captures the most localized and fine-grained features with minimal computational overhead. It is effective for detecting small objects or fine details in the input image.

A 3 × 3 convolution: the standard convolutional kernel that strikes a balance between capturing local and global information. This branch plays an essential role in processing mid-level features.

A 5 × 5 convolution: the larger kernel allows the network to capture more extended local context, helping with medium and larger object detection.

A 7 × 7 convolution: the largest kernel in the MSFM captures more significant contextual information, which is useful for detecting large objects or understanding the broader context in an image.

All the outputs from these parallel convolutional branches are concatenated along the channel dimension, creating a richer and more diverse feature representation. After obtaining the multi-scale features from each branch, the concatenated feature map undergoes batch normalization to ensure stable training and reduce internal covariate shifts. ReLU activation is applied to introduce non-linearity, ensuring the module learns complex relationships between features.

By integrating features from different scales, the MSFM enables the model to focus on different spatial resolutions simultaneously, improving the model’s ability to handle objects of various sizes within the same image.

3.5. GIES: Grayscale Image Enhancing Strategy

Figure 6a illustrates the conventional data pipeline, including the application of data augmentation techniques. Initially, color images are typically retrieved using image processing libraries such as OpenCV or APIs provided by deep learning frameworks []. After retrieval, the images undergo a series of preprocessing steps to ensure compatibility with the model’s input requirements. These steps generally include basic operations such as resizing, cropping, and normalization.

Figure 6. Grayscale image enhancing strategy: (a) traditional RGB pipeline, (b) grayscale-to-RGB conversion, (c) single-channel grayscale input, (d) proposed GIES pipeline.

To improve the model’s ability to generalize, the training dataset is enriched through systematic augmentation techniques after initial preprocessing. These methods create synthetic variations of the original images by applying both spatial and photometric modifications. Spatial transformations consist of horizontal or vertical flips, arbitrary rotations, random crops, size variations, positional shifts, and slight perturbations. Photometric adjustments encompass noise injection, blur effects, along with comprehensive color space manipulations including hue shifting, illumination changes, chroma adjustments, tonal redistribution, and color temperature modifications to improve the model’s robustness.

For grayscale images, conversion to RGB format can be performed using a gray-to-RGB transformation, as shown in Figure 6b. However, the associated computational overhead often outweighs its benefits in practical scenarios. As a result, single-channel input is frequently adopted in steel surface inspection tasks for the sake of efficiency, as exhibited in Figure 6c. Nevertheless, this strategy inevitably reduces the richness of input information, which may adversely affect the model’s recognition performance.

To address the information sparsity of single-channel grayscale inputs, we propose GIES, which transforms grayscale images into three semantically distinct channels (Figure 6d). Channel 1 applies a 3 × 3 mean filter to suppress noise, Channel 2 uses the same method, and Channel 3 retains the original grayscale. This multi-channel representation enriches feature diversity while maintaining computational efficiency. Subsequent augmentations include geometric transformations and defect-specific simulations, enhancing the model robustness to suit industrial imaging variations.

4. Experiments and Analysis

To comprehensively evaluate the proposed LGM-YOLO model, we conducted experiments on two publicly available steel-surface-defect datasets: NEU-DET and GC10-DET. Both datasets exhibit real-world industrial challenges, including uneven illumination, low-contrast defects, and cluttered backgrounds.

4.1. Dataset Description

(a): NEU-DET

In this section, we systematically assess the detection capabilities of our proposed LGM-YOLO through comprehensive benchmarking on the NEU-DET steel defect dataset []. For the purpose of identifying defects in industrial environments, this dataset is frequently adopted, specifically targeting metal-surface defects. It provides a challenging environment for object detection tasks due to the presence of various defect types and their variability in size, shape, and appearance.

In this dataset, six common surface defects found on hot-rolled steel strips are included: crazing, inclusion, patches, pitted surfaces, rolled-in scales, and scratches. It contains 1800 grayscale images in total, with 300 images allocated to each defect type, resulting in 1800 samples across all categories. Each image may contain one or more defects, making it suitable for testing the model’s capability to detect multiple defects within a single image. Example images of the defect categories can be seen in Figure 7.

Figure 7. Annotated example images from the NEU-DET dataset with category-wise colored bounding boxes: Cr (red), In (green), Ps (blue), Sc (yellow), Rs (purple), Pa (orange).

(b): GC10-DET

To further evaluate the proposed model, the GC10-DET dataset was employed. Comprising 2300 high-resolution images, it comprehensively covers ten distinct steel plate surface defect types, including various punching holes (Pu), weld lines (Wl), crescent gaps (Cg), water spots (Ws), oil spots (Os), silk spots (Ss), inclusions (In), rolled pits (Rp), creases (Cr), and waist folding (Wf). As illustrated in Figure 8, the dataset features richly annotated defect samples.

Figure 8. Annotated example images from the GC10-DET dataset with category-wise colored bounding boxes: Pu (red), Wl (green), Cg (blue), Ws (yellow), Os (magenta), Ss (orange), In (cyan), Rp (neon green), Cr (sky blue), Wf (light purple).

4.2. Evaluation Metrics

In the field of target detection, mean average precision (mAP) is frequently adopted as the core evaluation metric. Additionally, common metrics include precision (P), recall (R), and average precision (AP).

Intersection over union (IoU) quantifies the overlap degree between the predicted bounding box and the ground truth box. Its value ranges from 0 to 1, where a larger IoU signifies a closer alignment. In this study, a detection is considered valid if the IoU exceeds 0.45, and invalid if it is lower than that threshold.

Confidence represents the probability that the predicted box belongs to a certain class, also ranging from 0 to 1. A prediction is classified as positive when its confidence score is above 0.25; otherwise, it is regarded as negative.

Precision serves as a metric to evaluate prediction accuracy, representing the ratio of true positive samples to all predicted positive samples. A lower precision value implies a higher probability of false detections. Precision is defined as Equation (8).

P = \frac{T P}{T P + F P}

(8)

Recall measures the likelihood of correctly identifying a positive sample. A lower recall indicates a higher chance of missed detections by the algorithm. Recall is computed using Equation (9).

R = \frac{T P}{T P + F N}

(9)

Average precision (AP) is defined as the integral of the precision–recall curve, serving as a comprehensive indicator of a model’s overall performance. It is calculated using Equation (10).

A P = \int_{0}^{1} P (R) d R

(10)

4.3. Details of the Implementation

The model was implemented using PyTorch 2.5.1 and trained on an NVIDIA RTX 3060 GPU (12GB VRAM, NVIDIA, Santa Clara, CA, USA) with CUDA 12.1. The input images were resized to 640 × 640 pixels, and standard data augmentations including flipping, rotation, random cropping, and Gaussian noise were applied, in addition to our proposed grayscale image enhancing strategy (GIES).

For the optimization process, we opted for the stochastic gradient descent (SGD) optimizer. It was configured with an initial learning rate of 0.01, a momentum value of 0.937, and a weight decay parameter set to 0.0005. The model underwent training with a batch size of 16 for a total of 100 epochs. To facilitate the learning rate adjustment, a cosine annealing schedule was implemented, which enabled a smooth and gradual reduction in the learning rate over the training course.

Regarding the dataset utilization, it was partitioned into three subsets: 70% for the training set, 15% for the validation set, and another 15% for the testing set. The loss function consists of CIoU loss for bounding box regression, binary cross-entropy loss for objectiveness and class prediction, following the default YOLOv5 training pipeline.

5. Results and Discussion

The proposed LGM-YOLO model was evaluated and compared against several state-of-the-art models, including YOLOv3-SPP, YOLOv4, YOLOv5, Faster R-CNN, and RetinaNet, to assess its performance in surface-defect detection.

5.1. Quantitative Analysis: Comparison with Existing Models

Due to the varying complexity of the six steel-defect types, detection performance differs significantly across algorithms. Table 1 presents a comparative analysis of contemporary defect-detection approaches evaluated on NEU-DET.

Table 1. Quantitative evaluation of surface-defect-detection models on NEU-DET.

The proposed LGM-YOLO model achieves a mAP of 79.8%, outperforming YOLOv5s by 6.4 percentage points (71.8%). Faster R-CNN and RetinaNet both yield mAPs of 74.4%, while YOLOv3-SPP and YOLOv4 achieve lower scores of 64.3% and 70.4%, respectively.

At the category level, LGM-YOLO demonstrates notable improvements. It achieves 65.3% accuracy for crazing (Cr) defects, which remain challenging due to their fine-grained, overlapping characteristics. In contrast, the model excels in detecting inclusion (In) and patches (Pa), attaining accuracies of 86.7% and 96.2%, respectively, surpassing all baselines. Superior performance is also observed in detecting rolled-in scale (Rs), pitted surface (Ps), and scratches (Sc), reflecting the model’s robustness across defect types of varying scales and complexity.

From Table 2, our model achieves 100% detection accuracy for Cg (coating defects), showcasing exceptional capability in identifying this type of defect. Additionally, it exhibits remarkable precision in detecting Pu and Wl at 96.6% and 99.1%, respectively, significantly outperforming other comparative algorithms. Although its accuracy is relatively lower for certain defect categories like Wf (61.2%), the model demonstrates significant advantages in the balance of detection accuracy and adaptability to complex defects.

Table 2. Quantitative evaluation of surface-defect-detection models on GC10-DET.

The proposed model achieves a more pronounced improvement of 8.0% mAP on NEU-DET compared to 2.7% on GC10-DET, with this performance gap primarily attributed to fundamental differences in dataset characteristics. NEU-DET’s balanced class distribution (300 images per defect category) enables more uniform learning and evaluation across all defect types, while GC10-DET’s inherent class imbalance (ranging from 85 samples for rare defects like Rp to 513 samples for more common Wl) introduces bias in overall performance metrics.

5.2. Ablation Study

Ablation experiments were performed to assess the impact of each proposed module, as tabulated in Table 3. Experiment 1 achieved mAP values of 73.3% on NEU-DET and 65.4% on GC10-DET for steel-defect detection. Experiment 4 improved mAP by approximately 0.4% on NEU-DET (to 73.7%) and 0.4% on GC10-DET (to 65.8%), demonstrating its effectiveness in enhancing grayscale input representations. Adding the LFM in Experiment 2 raised the mAP to 73.6% on NEU-DET and 66.3% on GC10-DET, with a further comparison between Experiments 4 and 6 highlighting its role in enhancing fine-grained feature extraction for defects like micro-cracks. The GCM in Experiment 3 improved mAP to 73.8% on NEU-DET and 66.2% on GC10-DET, reflecting a 0.5% gain over the baseline and its capability to suppress background noise via global contextual modeling. When the MSFM was incorporated into Experiment 5, a more substantial improvement was observed, with mAP reaching 75.6% on NEU-DET and 67.2% on GC10-DET, confirming its contribution to multi-scale feature fusion.

Table 3. The results of ablation experiment.

Experiment 8, which integrates all three modules (LFM, GCM, MSFM) alongside GIES, achieved the highest performance with 79.8% mAP on NEU-DET and 68.1% on GC10-DET, demonstrating their complementary benefits. The sequential integration of these modules creates a hierarchical feature-learning mechanism that progressively refines local edges, global structures, and multi-scale semantics, achieving a 6.5% mAP improvement over YOLOv5s on NEU-DET and outperforming individual module additions across both datasets.

5.3. The Results of Confusion Matrix

As illustrated in Figure 9, the confusion matrix offers a detailed quantitative analysis of LGM-YOLOv5’s classification performance across all six surface-defect categories in the NEU-DET benchmark. Overall, the model exhibits strong detection capabilities, as reflected by the generally low false positive and false negative rates across most defect categories. However, performance discrepancies among the different classes highlight the inherent complexities and visual similarities of specific defect types.

Figure 9. The confusion matrix results on NEU-DET.

In particular, the Cr class suffers from a relatively high misclassification rate of 0.35, with the majority of incorrect predictions attributed to the Rs class. This trend suggests that Cr and Rs defects share similar local feature patterns, leading to confusion during classification. A reciprocal misclassification is also observed for the Rs class, which not only confuses it with Cr but also with other defect types, indicating substantial feature overlap and the limitations of current feature discriminability.

The Ps class achieves a high correct classification rate of 0.87 but still experiences some confusion with the In and Sc classes, which collectively account for 13% of its misclassified instances. This is likely due to overlapping surface textures that challenge the model’s ability to establish distinct boundaries between these categories. In contrast, the Pa and Sc classes report the highest classification accuracies, with correct classification rates of 0.96 and 0.92, respectively. Their relatively unique and distinguishable visual characteristics contribute to the model’s robust performance for these classes.

Despite the limitations observed in detecting Cr defects, the LGM-YOLOv5 model demonstrates strong generalization ability and robustness for the majority of defect types. The high diagonal values in the confusion matrix affirm its effectiveness in most classification tasks, while the darker matrix entries further emphasize areas of higher precision. These results confirm the model’s potential for practical deployment in automated visual inspection scenarios within industrial environments.

In the confusion matrix results on the GC10-DET dataset, as shown in Figure 10, the proposed LGM-YOLO outperforms traditional methods in fine-grained and multi-scale defect detection. The LFM improves sensitivity to micro-defects like fine Cr, while the MSFM achieves 0.75–0.78 accuracy for mixed-texture defects and corrosion pits, surpassing YOLOv4 and Faster R-CNN. With global context modeling via GCM, it reduces misclassifications for similar textures (e.g., Class 2–9 confusion at 6%).

Figure 10. The confusion matrix results on GC10-DET.

5.4. Limitations and Future Works

The proposed model’s performance on this dataset establishes a baseline for real-time defect detection in analogous industrial scenarios. Although broader validation on larger or more diverse datasets is beyond the current scope, the findings herein provide a critical foundation for addressing challenges specific to steel truss inspection, such as grayscale imaging and multi-scale defect recognition.

Despite achieving a competitive mAP of 79.8% on NEU-DET and 68.1% on GC10-DET, the proposed LGM-YOLO has several identifiable limitations. First, the parallel multi-scale convolutional branches (1 × 1 to 7 × 7 kernels) in the MSFM exhibit computational load imbalance, resulting in suboptimal hardware utilization. Future work will focus on developing dynamic branch scheduling strategies to optimize computational efficiency. Second, while the model excels at detecting large defects, its accuracy on fine-grained defects like crazing remains constrained. We plan to enhance micro-defect detection by exploring multi-level semantic feature fusion, which integrates low-level edge features with high-level contextual information. Additionally, the current GIES strategy relies on manually tuned preprocessing parameters; we intend to replace these with adaptive parameter calculation methods based on sample statistics to improve generalization. Finally, for practical industrial deployment, we will address key challenges such as efficient data coordination for distributed camera systems and the establishment of standardized protocols for real-time defect alerts and maintenance integration, which will be our primary focus moving forward.

6. Conclusions

In this study, we proposed the LGM-YOLO model for steel-surface-defect detection and validated its effectiveness through comprehensive evaluations on the NEU-DET and GC10-DET datasets. The model incorporates three key modules—LFM, GCM, and MSFM—to enhance feature extraction and fusion across multiple scales. Also, GIES is adopted to enhance the information of the input images. Experimental results demonstrate that LGM-YOLO achieves an mAP of 79.8% on NEU-DET and 68.1% onGC10-DET, outperforming several state-of-the-art models, including YOLOv5s, Faster R-CNN, and RetinaNet. Ablation studies confirm that each of the proposed modules contributes positively to performance improvements. The combined effect of all three modules results in the highest accuracy.

In summary, LGM-YOLO effectively harmonizes precision and computational efficiency, positioning it as a viable approach for real-time industrial defect detection. Moving forward, subsequent research endeavors will prioritize enhancing the model’s efficacy in identifying small and intricate defects. Additionally, investigations into its applicability across diverse industrial inspection scenarios will be conducted, aiming to broaden its practical utility and adaptability.

Author Contributions

C.L. writing—review and editing, supervision, project administration, funding acquisition, formal analysis. Y.H. writing—review and editing, project administration, investigation, formal analysis. Z.Z. writing—review and editing, writing—original draft, visualization, W.G. visualization, validation, software, data curation, conceptualization. T.L. validation, supervision, resources, investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Special Project for Performance Incentive and Guidance of Research Institutions in Chongqing under Grant CSTB2023JXJL-YFX0007, National Natural Science Foundation of China (No. 52405580), Innovation and Development Joint Fund Projects of Chongqing Natural Science Foundation (KCSTB2024NSCOLZX0128), and General Projects of Chongqing Natural Science Foundation (2024NSCQ-MSX3194).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Entezami, A.; Shariatmadar, H.; Karamodin, A. Data-driven damage diagnosis under environmental and operational variability by novel statistical pattern recognition methods. Struct. Health Monit. 2019, 18, 1416–1443. [Google Scholar] [CrossRef]
Novozhenin, S.; Vystrchil, M.; Bogdanova, K. Analysis of the mathematical modelling results of displacements and deformations induced by the construction of the escalator tunnel of” Mining Institute” station in Saint Petersburg. J. Phys. Conf. Ser. 2020, 1661, 012105. [Google Scholar] [CrossRef]
Payawal, J.M.G.; Kim, D.-K. Image-based structural health monitoring: A systematic review. Appl. Sci. 2023, 13, 968. [Google Scholar] [CrossRef]
Xiang, S.; Li, P.; Luo, J.; Qin, Y. Micro transfer learning mechanism for cross-domain equipment RUL prediction. IEEE Trans. Autom. Sci. Eng. 2024, 22, 1460–1470. [Google Scholar] [CrossRef]
Luo, Q.; Fang, X.; Liu, L.; Yang, C.; Sun, Y. Automated visual defect detection for flat steel surface: A survey. IEEE Trans. Instrum. Meas. 2020, 69, 626–644. [Google Scholar] [CrossRef]
Wang, H.; Li, Z.; Wang, H. Few-shot steel surface defect detection. IEEE Trans. Instrum. Meas. 2021, 71, 5003912. [Google Scholar] [CrossRef]
Xue, B.; Wu, Z. Key technologies of steel plate surface defect detection system based on artificial intelligence machine vision. Wirel. Commun. Mob. Comput. 2021, 2021, 5553470. [Google Scholar] [CrossRef]
González-Hidalgo, M.; Massanet, S.; Mir, A.; Ruiz-Aguilera, D. Improving salt and pepper noise removal using a fuzzy mathematical morphology-based filter. Appl. Soft Comput. 2018, 63, 167–180. [Google Scholar] [CrossRef]
Sony, S.; Dunphy, K.; Sadhu, A.; Capretz, M. A systematic review of convolutional neural network-based structural condition assessment techniques. Eng. Struct. 2021, 226, 111347. [Google Scholar] [CrossRef]
Xiang, S.; Zheng, X.; Miao, J.; Qin, Y.; Li, P.; Hou, J.; Ilolov, M. Dynamic Self-Learning Neural Network and its Application for Rotating Equipment RUL Prediction. IEEE Internet Things J. 2024, 12, 12257–12266. [Google Scholar] [CrossRef]
Fu, G.; Sun, P.; Zhu, W.; Yang, J.; Cao, Y.; Yang, M.Y.; Cao, Y. A deep-learning-based approach for fast and robust steel surface defects classification. Opt. Lasers Eng. 2019, 121, 397–405. [Google Scholar] [CrossRef]
Huang, X.; Liu, Z.; Zhang, X.; Kang, J.; Zhang, M.; Guo, Y. Surface damage detection for steel wire ropes using deep learning and computer vision techniques. Measurement 2020, 161, 107843. [Google Scholar] [CrossRef]
Guo, J.; Chen, H.; Liu, B.; Xu, F. A system and method for person identification and positioning incorporating object edge detection and scale-invariant feature transformation. Measurement 2023, 223, 113759. [Google Scholar] [CrossRef]
Zaghdoudi, R.; Bouguettaya, A.; Boudiaf, A. Steel surface defect recognition using classifier combination. Int. J. Adv. Manuf. Technol. 2024, 132, 3489–3505. [Google Scholar] [CrossRef]
Xu, W.; Gu, J.; Zhao, Y.; Yuan, M. Texture Extraction of Steel Surface Defects Using Adaptive Optimized Gabor Filter with Improved Genetic Algorithm. In Proceedings of the 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 15–17 April 2022; IEEE: New York, NY, USA, 2022; pp. 584–588. [Google Scholar]
Saadatmorad, M.; Talookolaei, R.-A.J.; Pashaei, M.-H.; Khatir, S.; Wahab, M.A. Pearson correlation and discrete wavelet transform for crack identification in steel beams. Mathematics 2022, 10, 2689. [Google Scholar] [CrossRef]
Boudiaf, A.; Benlahmidi, S.; Harrar, K.; Zaghdoudi, R. Classification of surface defects on steel strip images using convolution neural network and support vector machine. J. Fail. Anal. Prev. 2022, 22, 531–541. [Google Scholar] [CrossRef]
Takalo-Mattila, J.; Heiskanen, M.; Kyllönen, V.; Määttä, L.; Bogdanoff, A. Explainable steel quality prediction system based on gradient boosting decision trees. IEEE Access 2022, 10, 68099–68110. [Google Scholar] [CrossRef]
Liu, H.; Wang, D.; Xu, K.; Zhou, P.; Zhou, D. Lightweight convolutional neural network for counting densely piled steel bars. Autom. Constr. 2023, 146, 104692. [Google Scholar] [CrossRef]
Huang, J.; Zhang, X.; Jia, L.; Zhou, Y. A high-speed YOLO detection model for steel surface defects with the channel residual convolution and fusion-distribution. Meas. Sci. Technol. 2024, 35, 105410. [Google Scholar] [CrossRef]
Su, J.; Yi, H.; Ling, L.; Shu, A.; Lu, E.; Jiao, Y.; Wang, S. Multi-object surface roughness grade detection based on Faster R-CNN. Meas. Sci. Technol. 2022, 34, 015012. [Google Scholar] [CrossRef]
Guo, Z.; Wang, C.; Yang, G.; Huang, Z.; Li, G. Msft-yolo: Improved yolov5 based on transformer for detecting defects of steel surface. Sensors 2022, 22, 3467. [Google Scholar] [CrossRef]
Li, L.; Zhang, R.; Xie, T.; He, Y.; Zhou, H.; Zhang, Y. Experimental design of steel surface defect detection based on MSFE-YOLO—An improved YOLOV5 algorithm with multi-scale feature extraction. Electronics 2024, 13, 3783. [Google Scholar] [CrossRef]
Le, H.F.; Zhang, L.J.; Liu, Y.X. Surface defect detection of industrial parts based on YOLOv5. IEEE Access 2022, 10, 130784–130794. [Google Scholar] [CrossRef]
Li, C.; Yan, H.; Qian, X.; Zhu, S.; Zhu, P.; Liao, C.; Tian, H.; Li, X.; Wang, X.; Li, X. A domain adaptation YOLOv5 model for industrial defect inspection. Measurement 2023, 213, 112725. [Google Scholar] [CrossRef]
Zhao, Y.; Shi, Y.; Wang, Z. In The improved YOLOV5 algorithm and its application in small target detection. In Proceedings of the International Conference on Intelligent Robotics and Applications, Harbin, China, 1–3 August 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 679–688. [Google Scholar]
Sadiq, M.; Masood, S.; Pal, O. FD-YOLOv5: A fuzzy image enhancement based robust object detection model for safety helmet detection. Int. J. Fuzzy Syst. 2022, 24, 2600–2616. [Google Scholar] [CrossRef]
Wang, L.; Liu, X.; Ma, J.; Su, W.; Li, H. Real-time steel surface defect detection with improved multi-scale YOLO-v5. Processes 2023, 11, 1357. [Google Scholar] [CrossRef]
Mi, Z.; Gao, Y.; Xu, X.; Tang, J. Steel strip surface defect detection based on multiscale feature sensing and adaptive feature fusion. AIP Adv. 2024, 14, 045005. [Google Scholar] [CrossRef]
Wan, D.; Lu, R.; Hu, B.; Yin, J.; Shen, S.; Lang, X. YOLO-MIF: Improved YOLOv8 with Multi-Information fusion for object detection in Gray-Scale images. Adv. Eng. Inform. 2024, 62, 102709. [Google Scholar] [CrossRef]
Song, K.; Yan, Y. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Appl. Surf. Sci. 2013, 285, 858–864. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]

Figure 1. Architecture of the YOLOv5s object detection algorithm.

Figure 2. Architecture of the proposed LGM-YOLO model: LFM and GCM integrated into the backbone, MSFM embedded in the neck.

Figure 3. The structure diagram of LFM for fine-grained defect enhancement.

Figure 4. The structure diagram of GCM for global spatial dependency.

Figure 5. The structure diagram of MSFM.

Figure 6. Grayscale image enhancing strategy: (a) traditional RGB pipeline, (b) grayscale-to-RGB conversion, (c) single-channel grayscale input, (d) proposed GIES pipeline.

Figure 7. Annotated example images from the NEU-DET dataset with category-wise colored bounding boxes: Cr (red), In (green), Ps (blue), Sc (yellow), Rs (purple), Pa (orange).

Figure 8. Annotated example images from the GC10-DET dataset with category-wise colored bounding boxes: Pu (red), Wl (green), Cg (blue), Ws (yellow), Os (magenta), Ss (orange), In (cyan), Rp (neon green), Cr (sky blue), Wf (light purple).

Figure 9. The confusion matrix results on NEU-DET.

Figure 10. The confusion matrix results on GC10-DET.

Table 1. Quantitative evaluation of surface-defect-detection models on NEU-DET.

Method	mAP/%	Cr/%	In/%	Pa/%	Rs/%	Ps%	Sc%	FPS	GFLOPs
YOLOv3-SPP []	64.3	26.8	71.1	87.0	63.1	51.5	86.5	47.3	155.5
YOLOv4	70.4	48.3	75.2	81.5	54.2	79.3	83.6	56.4	29.9
YOLOv5s	71.8	50.9	76.5	82.2	55.6	81.1	84.5	106.7	15.8
Faster R-CNN []	74.4	49.3	81.4	84.7	62.2	79.6	89.2	24.0	91.3
RetianNet []	74.4	49.3	81.4	84.7	62.2	79.6	89.2	48.2	83.2
Proposed model	79.8	65.3	86.7	96.2	73.4	87.2	92.1	91.7	22.1

Table 2. Quantitative evaluation of surface-defect-detection models on GC10-DET.

Method	mAP/%	Pu/%	Wl/%	Cg/%	Ws/%	Os/%	Ss/%	In/%	Rp/%	Cr/%	Wf/%	FPS	GFLOPs
YOLOv3-SPP []	60.6	96.5	82.5	96.8	75.5	57.4	48.4	26.4	22.1	20.6	79.8	56.2	155.5
YOLOv4	61.2	90.4	89.8	93.9	62.6	59.4	48.3	23.6	17.7	37.6	88.2	54.4	29.9
YOLOv5s	65.2	96.5	93.6	96.2	77.5	62.8	59.1	23.3	33.5	40.2	69.1	73.3	15.8
Faster R-CNN []	60.8	82.2	78.0	95.4	69.2	57.7	58.3	24.8	29.2	30.7	82.6	17.9	91.3
RetianNet []	59.9	92.4	88.4	94.5	74.1	54.5	54.4	28.7	15.5	21.4	75.1	19.3	83.2
Proposed model	68.1	96.6	99.1	100	75.3	78.3	65.2	59.3	31.3	56.3	61.5	36.7	22.1

Table 3. The results of ablation experiment.

Experiment Number	Model	LFM	GCM	MSFM	GIES	mAP/%
Experiment Number	Model	LFM	GCM	MSFM	GIES	NEU-DET	GC10-DET
1	YOLOv5s	√				73.3	65.4
2	YOLOv5s		√			73.6	66.3
3	YOLOv5s			√		73.8	66.2
4	YOLOv5s				√	73.7	65.8
5	YOLOv5s	√	√		√	75.6	67.2
6	YOLOv5s		√	√	√	76.1	67.1
7	YOLOv5s	√		√	√	76.4	66.9
8	YOLOv5s	√	√	√	√	79.8	68.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

LGM-YOLO: A Context-Aware Multi-Scale YOLO-Based Network for Automated Structural Defect Detection

Abstract

1. Introduction

2. Related Work

2.1. Overview of the YOLOv5

2.2. YOLOV5 Detectors for Steel Surface

3. Proposed Method

3.1. Overview of LGM-YOLO Architecture

3.2. LFM: Enhancing Small Defect Detection

3.3. GCM: Capturing Large-Scale Information

3.4. MSFM: Integrating Features for Accurate Detection

3.5. GIES: Grayscale Image Enhancing Strategy

4. Experiments and Analysis

4.1. Dataset Description

4.2. Evaluation Metrics

4.3. Details of the Implementation

5. Results and Discussion

5.1. Quantitative Analysis: Comparison with Existing Models

5.2. Ablation Study

5.3. The Results of Confusion Matrix

5.4. Limitations and Future Works

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics