SLA-YOLO—Enhancing YOLO for Tiny Defect Detection in Industrial Defect Scenes

Lyu, Yanxia; Wang, Xinqi; Jin, Chenyu; Wei, Yuanhong; Sun, Zhenyu

doi:10.3390/math14111973

Open AccessArticle

SLA-YOLO—Enhancing YOLO for Tiny Defect Detection in Industrial Defect Scenes

by

Yanxia Lyu

^1,2,*

,

Xinqi Wang

¹,

Chenyu Jin

¹,

Yuanhong Wei

³

and

Zhenyu Sun

³

¹

School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066004, China

²

Hebei Key Laboratory of Marine Perception Network and Data Processing, Northeastern University at Qinhuangdao, Qinhuangdao 066004, China

³

School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao 066004, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(11), 1973; https://doi.org/10.3390/math14111973

Submission received: 13 April 2026 / Revised: 27 May 2026 / Accepted: 1 June 2026 / Published: 3 June 2026

(This article belongs to the Special Issue Mathematical Methods for Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

In recent years, the YOLO series has emerged as a widely adopted framework for real-time object detection because of its favorable balance between detection accuracy and inference efficiency. Nevertheless, accurate recognition and localization of tiny defects in industrial inspection remain challenging. These challenges mainly arise from the extremely small scale of defect targets, low image contrast, and the limited capability of conventional models in feature representation under uniform backgrounds. To address these issues from a mathematically optimized perspective and via feature modeling optimization, we develop a dedicated framework for tiny defect detection, termed SLA-YOLO. The main contributions of this work are as follows. First, we adopt a slicing-based processing strategy inspired by the SAHI framework, referred to as Image Slicing Processing (ISP) in this work, and extend it to both training and inference stages. This design enhances the relative scale of tiny defects within local regions, improving detection sensitivity and data diversity without introducing additional model complexity. Second, we introduce a Large Receptive-Field Selective Context (LRSC) module. By leveraging large-receptive-field selective convolution kernels, this module adaptively captures contextual information around critical defect regions via feature modeling optimization of scale-dependent representations. Third, we incorporate a Transformer-based High-level Feature Enhancement (THFE) module to improve global dependency modeling in high-level semantic representations, thereby enhancing feature discriminability for complex defect patterns. Experimental results on the CCB defect dataset show that SLA-YOLO improves mAP@50:95 by 2.7% and mAP@50 by 3.3%. In addition, the proposed method demonstrates strong generalization capability on other tiny object detection tasks.

Keywords:

tiny defect detection; industrial surface inspection; feature representation optimization; image slicing processing; large receptive-field context modeling; multi-scale feature fusion

MSC:

68T45

1. Introduction

Industrial surface defect detection is a fundamental task in quality control from the perspective of visual data analysis [1]. With the continuous improvement of manufacturing precision, the identification of tiny defects increasingly relies on effective feature representation and mathematical modeling, particularly for materials such as glass and ceramics. In real production environments, product surfaces often exhibit small imperfections—such as scratches or bubbles—that arise from inherent material properties or variations in the manufacturing process. Although these defects exist at an extremely small scale, they can still reduce product reliability and, in some cases, introduce potential safety risks [2]. Therefore, fast and accurate identification of such defects is of great importance for ensuring robust quality assessment.

However, detecting tiny defects in industrial scenarios with uniform backgrounds remains highly challenging from the perspective of feature representation [1]. These defects are typically characterized by extremely small scale, low contrast, and blurred boundaries, which make them prone to being overlooked or to having their discriminative information weakened during the multi-stage downsampling process of convolutional neural networks [3,4]. At the same time, surface images of industrial products usually present uniform backgrounds, making it difficult for models to effectively separate subtle defect-related signals from the background, thereby increasing the risk of missed detections. As illustrated in Figure 1, tiny defect regions in images with uniform backgrounds can be easily suppressed by surrounding background information, substantially increasing the likelihood of false negatives. In addition, industrial inspection tasks usually impose strict real-time requirements, which further increases the difficulty of achieving efficient model optimization.

Traditional industrial surface defect detection has largely relied on manual inspection and conventional machine vision techniques. Nevertheless, both approaches exhibit inherent limitations in practical applications. Manual inspection is inherently subjective and is difficult to perform under real-time constraints. Conventional machine vision methods rely heavily on hand-crafted feature extraction schemes, which not only increases implementation complexity but also limits the generalization capability of the overall framework [1,5]. In recent years, the rapid development of deep learning has provided new perspectives for surface defect detection. Early two-stage object detection frameworks, such as R-CNN and Fast R-CNN [6,7,8], laid the foundation for modern object detection methods. Subsequently, general-purpose object detection frameworks, including Faster R-CNN [8], SSD [9], RetinaNet, and YOLO [10], have been widely introduced into industrial inspection tasks. In particular, the YOLO series, owing to its end-to-end architecture and favorable balance between inference speed and detection accuracy, has gradually become a mainstream and efficient detection framework in industrial scenarios, and numerous recent studies have further improved YOLO-based methods for industrial surface defect detection through structural enhancement and optimization strategies [11,12].

However, when applied to tiny surface defect detection in industrial environments with uniform backgrounds, the YOLO series still exhibits several critical limitations. First, YOLO employs multi-stage downsampling mechanisms to extract high-level semantic features, which inevitably leads to the loss of high-frequency spatial information associated with tiny defects, thereby degrading detection accuracy [3]. Second, in images characterized by uniform backgrounds and blurred defect boundaries, existing detection frameworks still exhibit limited capability in effectively modeling structural features within low-contrast regions, resulting in increased likelihood of missed detections. Finally, accurate recognition of tiny defects often depends on modeling the relationship between critical defect regions and their surrounding contextual information. Nevertheless, the standard detection head in the YOLO series lacks dynamic contextual modeling and adaptive feature fusion capability, which restricts its performance in such scenarios.

The main contributions of this work are summarized as follows:

To address the difficulty of effectively modeling local feature representations in high-resolution images containing tiny defects, we introduce a SAHI-inspired Image Slicing Processing (ISP) strategy, which enhances local regions during both training and inference processes and improves the model’s sensitivity to small-scale structures.
We introduce the LRSC module before the detection head, which adaptively adjusts the receptive field according to object scale, thereby improving contextual feature modeling for tiny defects.
We incorporate the THFE module between the backbone and neck networks to enhance high-level feature representations through positional encoding and attention-based global interaction, thereby improving the model’s discriminative ability for tiny defect detection.

Extensive experiments on multiple datasets demonstrate that the proposed SLA-YOLO framework improves detection accuracy for tiny defects while maintaining real-time performance, showing strong generalization capability and practical applicability in industrial scenarios.

2. Related Work

2.1. Data Augmentation

In the task of tiny surface defect detection for industrial products, the extremely small scale of defects makes them highly susceptible to information loss during repeated downsampling in conventional detection models, which consequently leads to missed detections. As a crucial strategy for improving model generalization and robustness, data augmentation techniques have long played an important role in small object detection. Traditional data augmentation methods mainly rely on geometric transformations, such as image deformation, rotation, scaling, random cropping, and translation [13,14]. These fundamental augmentation strategies, as explored in early deep learning studies [15,16], substantially improve sample diversity and help alleviate overfitting in visual recognition tasks [15,16,17]. More recently, advanced augmentation strategies, including MixUp, CutMix, and Mosaic augmentation [18,19], have been widely adopted in object detection frameworks to enrich sample diversity and improve the representation ability for small-scale targets. However, although these methods improve sample diversity and model generalization, their capability in explicitly enhancing the perceptual saliency of tiny surface defects under complex industrial backgrounds remains limited. Kisantal et al. [20] proposed a copy–paste strategy that increases the density of tiny objects in specific regions by duplicating them together with their surrounding local context, thereby making them more visible in images. Chen et al. [21] introduced an adaptive resampling method within the RRNet framework, where a semantic segmentation network is incorporated to perform structured enhancement on target regions, which helps alleviate background mismatch and scale imbalance.

Because tiny objects are often sparsely distributed in the original images, the Mosaic data augmentation strategy introduced in YOLOv4 [20,22] significantly increases object quantity and density through multi-image stitching. Building on this idea, Chen et al. [23] further proposed a lightweight image stitching strategy based on equal-sized image concatenation, which improves sample diversity while keeping the computational overhead under control. Although these augmentation strategies can improve detection performance during training, they are still largely confined to the training stage and offer limited support during inference [24]. In addition, they do not provide explicit enhancement for local target perception. To overcome these limitations, we propose an image slicing-based data augmentation strategy that can be applied in both training and inference to improve overall detection performance [25].

2.2. Contextual Modeling

Detecting tiny surface defects in industrial applications remains highly challenging. These defects are not only extremely small but also exhibit low contrast, weak texture, and blurred boundaries, often with considerable scale variations. Consequently, methods relying on fixed local receptive fields struggle to extract sufficient discriminative features, frequently leading to missed detections or false positives. The situation worsens in scenarios with uniform backgrounds—such as glass and ceramic production. In these cases, the visual difference between the anomaly and the normal surface is so subtle that the already limited local features of tiny defects are easily obscured by the surrounding uniform background.

Incorporating contextual information has long been recognized as an effective way to boost tiny defect detection. Early work by Chen and Gupta [26] explored context modeling at both image and object levels, which noticeably improved recall for small objects. Building on this idea, Chen et al. [27] directly embedded contextual cues into the R-CNN framework. Their Context-AlexNet successfully combined local and global semantic relationships without overly complicating the network, enhancing detection robustness.

Later, Cai et al. [28] extended the role of context with a multi-scale region proposal mechanism, allowing the model to dynamically capture relevant information depending on object scale. In a similar vein, CoupleNet [29] adopted a dual-branch structure with Local and Global FCNs. By separately processing fine-grained local details and broader contextual cues before merging them, Zhu et al. [29] achieved highly competitive results in small object detection tasks.

However, most existing methods rely on static architectures with fixed receptive fields, limiting their adaptability to varying contexts across different defect regions. This constraint is particularly problematic for tiny defects on industrial surfaces with uniform backgrounds, where models often struggle to generalize and hit an accuracy bottleneck. To bridge this gap, we introduce a dynamic, adaptive context enhancement module before the detection head to sharpen contextual perception and boost overall performance.

2.3. The YOLO Family and Its Variants

Efficiency and precision are non-negotiable in domains like autonomous driving and industrial quality inspection. This demand has positioned the YOLO series as a leading one-stage framework, primarily for its ability to strike a practical balance between inference speed and accuracy. Since the debut of YOLOv1, the architecture has seen continuous refinement. Early iterations like YOLOv2 and YOLOv3 enhanced performance by deepening networks and integrating multi-scale feature fusion. YOLOv4 then pushed these boundaries further, introducing Cross-Stage Partial (CSP) connections and Path Aggregation Networks (PANs) to fundamentally strengthen feature representation.

As the series evolved, the focus shifted toward real-world deployment and optimization efficiency. YOLOv5, for instance, leaned into engineering-oriented refinements, making the model remarkably flexible for lightweight deployment without sacrificing accuracy. YOLOv6 pushed this further by introducing a decoupled head and adaptive anchor mechanisms, specifically tailoring the architecture for edge computing [30]. With the arrival of YOLOv7, the E-ELAN structure and dynamic label assignment strategies were brought in to strike a finer balance between computational efficiency and performance across multi-scale tasks.

The more recent YOLOv8 marks a significant shift toward an anchor-free design, utilizing a lightweight C2F module and a more streamlined decoding strategy. These architectural choices allow the model to maintain a robust balance between inference speed and generalization. This versatility has made YOLOv8 a staple in industrial defect detection, where it is now widely considered a foundational framework for modern inspection tasks.

Within this evolution, YOLOv5 warrants particular emphasis; owing to its lightweight architecture and exceptional engineering versatility, it has been extensively deployed as a core baseline in numerous latency-critical scenarios. For instance, in the domain of agricultural monitoring, Wang et al. [31] utilized YOLOv5 to facilitate efficient crop detection and growth status analysis, validating its real-time processing robustness in complex natural environments. Furthermore, in industrial safety and Personal Protective Equipment (PPE) compliance, Nazli et al. [32] developed a YOLOv5-based monitoring system for identifying safety gear usage, achieving high-precision online detection with minimal latency. These practical applications not only substantiate the engineering advantages of YOLOv5 in balancing detection performance with computational efficiency but also underscore the industry’s pressing demand for lightweight yet high-performance detection architectures—a trajectory that the SLA-YOLO framework proposed in this study aims to advance.

Modern YOLO iterations increasingly grapple with the trade-off between architectural depth and efficiency. YOLOv9, for instance, pushes for better feature modeling through its semantic–spatial decoupling mechanism (SDM) and advanced fusion strategies. However, these gains in representation come at the cost of a significantly heavier computational load.

YOLOv10 takes a different direction by employing neural architecture search (NAS) to optimize the backbone and enabling quantized inference. Although this design helps alleviate computational pressure to some extent, the resulting detection accuracy may still show slight fluctuations [33].

To bolster cross-layer representation, YOLOv11 integrates Spiking Neural Networks (SNNs) alongside a multi-scale dynamic selection mechanism. Despite these architectural improvements, the model often hits a snag with relatively high latency during real-world deployment.

Most recently, YOLOv12 attempts to unify detection and segmentation within a single framework. While it shines on high-resolution data, the resulting computational overhead and extended pipeline make it a tough sell for high-speed industrial lines—especially when the priority is detecting tiny defects.

Addressing feature attenuation and scale issues has sparked several YOLO-based refinements for tiny defect perception. A prime example is STC-YOLO [34], which swaps standard downsampling for a more deliberate strategy, paired with a dedicated tiny-object head and multi-head attention. These adjustments allow the network to catch the subtle visual cues that simpler models often miss, significantly boosting accuracy in complex scenes like traffic monitoring. Attention mechanisms have also become a staple for reinforcing feature representation in tiny object detection [35,36,37]. For instance, Shen et al. [38] integrated deformable convolutions (DCN-C2F) and self-calibrated attention into YOLOv8, enabling the model to adaptively perceive tiny objects even with irregular shapes. Along similar lines, Wang et al. [39] paired BiFormer attention with a multi-scale fusion strategy, yielding a notable 7.7% jump in detection accuracy.

To address tiny object detection in industrial scenarios, researchers have proposed a series of task-oriented YOLO variants. For instance, AF-YOLO employs adaptive feature enhancement to improve the discriminability of low-contrast regions, while DGYOLOv8 balances multi-scale feature optimization with a lightweight design to maintain real-time performance. However, these methods primarily rely on full-image modeling and lack mechanisms to explicitly reconstruct the resolution of extremely small targets.

Parallel to CNN advancements, hybrid frameworks combining CNNs and Transformers have emerged [40]. Methods like RT-DETR leverage stacked Transformer encoders and query-based end-to-end paradigms to capture global dependencies. Despite their strong performance in complex scenes, these architectures demand substantial computational resources and complex pipeline reconstructions, severely restricting their deployment in high-speed, resource-constrained industrial inspections.

Beyond architectural shifts, recent studies have introduced advanced paradigms to improve feature robustness under challenging conditions, such as low contrast and structural degradation. Techniques like Turbidity–Similarity (TS) Decoupling [41] isolate image turbidity from target features, while LSNet [42] and ECFFNet [43] utilize multi-scale semantic interactions and contextual fusion to enhance fine-grained perception. Nevertheless, these strategies often depend on intricate feature reconstruction or multi-branch modeling, which inevitably increases computational overhead and hinders high-frequency real-time application.

Overall, while existing methods advance cross-scale feature fusion and attention mechanisms, a critical gap remains: the lack of a unified, lightweight mechanism that simultaneously preserves extremely small target resolution, leverages local context, and models intra-scale semantics. In industrial surface inspections (e.g., glass and ceramics), the primary hurdles are the microscopic scale of defects, feature attenuation from repeated downsampling, and the coexistence of weak textures with blurred boundaries. To overcome these bottlenecks, we propose SLA-YOLO, a lightweight framework built upon the YOLOv8s architecture. Without relying on multimodal inputs or heavy multi-stage pipelines, SLA-YOLO utilizes structured image slicing and lightweight intra-scale semantic interaction to achieve efficient and accurate single-modal RGB detection of tiny defects.

3. Methods

This section outlines the overall architecture of SLA-YOLO for detecting tiny surface defects in industrial settings with uniform backgrounds, highlighting its main design components. The framework consists of three core parts: the ISP module, a backbone network enhanced with the THFE module, and a neck network incorporating the LRSC module. Building on these, the detection head handles the final localization and classification of tiny defects. The complete architecture is depicted in Figure 2.

3.1. Image Slicing Processing

In industrial inspection scenarios where backgrounds are uniform, tiny surface defects often occupy just a handful of pixels. These defects usually exhibit low contrast and blurred edges, making them highly susceptible to feature loss during repeated downsampling in deep networks. Moreover, the surrounding background tends to be highly homogeneous and redundant. Feeding the entire image directly into the model can further dilute defect-specific features while increasing computational load, which in turn reduces detection efficiency.

To tackle these challenges, we introduce a SAHI-inspired Image Slicing Processing (ISP) strategy. As shown in Figure 3, the module splits the input image I into several partially overlapping slices of a predefined fixed size. Crucially, this slicing does not alter the actual physical size of the defects. Instead, it effectively boosts their relative size within each slice, enabling the model to detect and capture tiny defects more reliably.

Let the input image have spatial dimensions

H \times W

, and let each slice have a size of

m \times n

. Denote the overlap ratio between adjacent slices as a, and the horizontal and vertical strides as

s_{x}

and

s_{y}

, respectively. To prevent defect regions located near slice boundaries from being truncated, the stride is computed by considering the overlap ratio, which is formulated as follows:

s_{x} = m \cdot (1 - a)

(1)

s_{y} = n \cdot (1 - a)

(2)

To ensure that adjacent slices overlap at their boundary regions, enabling complete capture of defects that span across slices, the input image I is divided into k equally sized and partially overlapping slices

P_{1}, P_{2}, \dots, P_{k}

. The number of slices k is computed as:

k = \frac{H - m}{s_{x}} \cdot \frac{W - n}{s_{y}}

(3)

After the slicing process, the edge details and texture patterns of tiny defects become significantly more prominent in the corresponding feature maps, effectively alleviating the gradual disappearance of defect-related features during deep convolutional operations.

During inference, we adopt a slicing-aided hyper inference (SAHI) strategy [44]. Specifically, the input image is first divided into slices, and each slice is processed independently. For the detection results generated from these slices, coordinate mapping is performed based on their spatial positions in the original image to restore the prediction boxes to the global image coordinate system. On this basis, non-maximum suppression (NMS) is applied to eliminate redundant boxes in overlapping regions, retaining only the results with the highest confidence. Finally, the processed detection results are merged and restored to the complete image.

Owing to the highly uniform background in industrial scenarios, detection results across different slices exhibit high consistency, enabling the aforementioned fusion process to stably and effectively suppress redundant boxes, thereby enhancing the recall performance for tiny defects. In addition, this strategy avoids the redundant computation caused by direct inference on the full image, effectively reducing computational resource consumption while ensuring detection accuracy [45,46].

It is worth noting that the ISP strategy follows the SAHI (Slicing Aided Hyper Inference) framework and extends it from inference to both training and inference stages, thereby ensuring consistent feature distributions and improving the model’s sensitivity to tiny defects.

3.2. THFE in Backbone

The backbone serves as the core feature extraction component in SLA-YOLO. It mainly consists of a series of convolutional layers and C2F modules, which generate multi-level feature maps to effectively capture representations at different scales. However, when performing tiny defect detection in environments with uniform backgrounds, high-level features tend to suffer from blurred positional information and insufficient multi-scale representation after repeated convolution and downsampling operations. To address this issue, we introduce a Transformer-based feature enhancement module at the high-level semantic stage of the backbone to improve the representation quality of deep features. This module aims to enhance global dependency modeling and positional awareness for tiny defect detection scenarios. The architecture of this module is illustrated in Figure 4.

Unlike recent CNN–Transformer hybrid detection methods that introduce multi-layer Transformer encoders and reconstruct the overall detection pipeline (e.g., RT-DETR), this work retains the original detection framework. Instead, a lightweight integration strategy is adopted by introducing a single-layer attention enhancement module at the high-level semantic feature stage, with the aim of improving feature representation while keeping the computational overhead under control [47,48].

Let the input feature

x \in R^{H \times W \times C}

. After multiple convolutional operations, spatial position information gradually becomes ambiguous. To compensate for this loss, the THFE module first employs a two-dimensional sinusoidal-cosine positional encoding function to generate a positional embedding vector

pos

:

pos = f_{2 D_sincos} (x)

(4)

where

f_{2 D_sincos}

denotes the 2D sinusoidal positional embedding function.

Subsequently, the input feature x is flattened through a function f, yielding a reshaped representation

x^{'}

. The flattened feature

x^{'}

is then combined with the positional embedding

pos

to generate the query and key vectors containing explicit spatial information:

x^{'} = f (x)

(5)

q, k = x^{'} + pos

(6)

This design ensures that the attention mechanism not only relies on feature similarity but also incorporates spatial position cues, thereby preventing the positional signals of tiny defects from being weakened during the downsampling process [49].

Next, the query vector q, key vector k, and value vector

x^{'}

are fed into a multi-head attention module

f_{ma}

, which highlights potential tiny defect regions while suppressing redundant background information. The attention-enhanced feature

x_{m a}

is obtained as:

x_{ma} = f_{ma} (q, k, x^{'})

(7)

Unlike general-purpose Vision Transformers, the proposed attention mechanism is specialized for tiny defect detection, addressing sparse features and background redundancy. By prioritizing single-scale selective enhancement over global representations, it strengthens defect-related responses and improves target discriminability.

Although the attention mechanism effectively emphasizes potential defect regions, its nonlinear representation capacity is limited. Therefore, we further introduce a feed-forward neural network (FFN) to perform nonlinear transformation and enhance feature discriminability. Specifically, the flattened feature

x^{'}

and the attention-enhanced feature

x_{m a}

are jointly fed into the FFN:

y = f_{ffn} (x^{'}, x_{ma})

(8)

The output vector y is then reshaped back to its spatial structure via function f, yielding the enhanced high-level feature representation

y^{'}

:

y^{'} = f (y)

(9)

Finally, the feature map processed by convolution, the C2F module, and the attention mechanism is further refined through the SPPF module to perform multi-scale feature aggregation, thereby efficiently capturing information at different spatial scales [22,50]:

Y = f_{sppf} (y^{'})

(10)

The SPPF module leverages multi-scale pooling operations to effectively alleviate the limitation of single-scale feature representation.

Overall, the THFE module enhances high-level feature sensitivity to tiny defects through positional compensation, attention-based interaction, and multi-scale aggregation. Under uniform background conditions, this mechanism not only suppresses noise introduced by redundant background information but also improves the global discriminability of defect regions, thereby providing a strong foundation for subsequent feature refinement in the neck network.

3.3. Large Receptive-Field Selective Context Module

The core function of the neck network is to further integrate and refine the multi-scale features generated by the backbone, thereby providing high-quality feature representations for the detection head. However, in tiny surface defect detection tasks for industrial products with uniform backgrounds, targets are typically extremely small and exhibit low contrast. Under such conditions, relying solely on local texture and morphological features is often insufficient for accurate recognition. Instead, the discrimination of such defects largely depends on modeling their relationships with surrounding contextual information. Notably, different defect types exhibit varying degrees of dependence on contextual regions [51]. A receptive field that is too small may fail to capture sufficient contextual cues, whereas an excessively large receptive field may introduce redundant background noise, leading to feature dilution and an increased risk of false positives. Therefore, adaptively selecting an appropriate receptive field range during the feature fusion stage becomes a critical factor in improving detection accuracy [52].

To address this issue, we design the LRSC module and place it before the detection head to enhance contextual modeling capability. As illustrated in Figure 5, the proposed module integrates large receptive-field convolution and selective context modeling mechanisms into the C2F structure, enabling the network to dynamically select contextual information at different scales. Through this adaptive receptive field selection strategy, the fused features are effectively enhanced, thereby improving the model’s ability to distinguish tiny defects from uniform background regions.

Specifically, considering that different defect types require varying ranges of contextual information, the input feature map

X^{'}

is first processed by two depthwise convolution operations with different kernel sizes and dilation rates to obtain multi-scale feature representations

U_{i}

:

U_{i} = f_{i}^{dw} (X^{'})

(11)

where

f_{i}^{dw} (\cdot)

denotes a depthwise convolution operation with kernel size

k_{i}

and dilation rate

d_{i}

. Large convolution kernels facilitate the capture of broader contextual information, while dilated convolution expands the receptive field without significantly increasing parameter count or computational cost.

Subsequently, a

1 \times 1

convolution layer is applied to compress the channel dimension and enhance inter-channel information interaction, yielding refined multi-scale feature representations

U_{i}^{'}

:

U_{i}^{'} = f_{i}^{1 \times 1} (U_{i}), i = 1, 2

(12)

where

f_{i}^{1 \times 1} (\cdot)

represents the

1 \times 1

convolution operation. The multi-scale features

U_{i}^{'}

are then concatenated along the channel dimension to fuse contextual representations at different scales:

U^{'} = [U_{1}^{'}; U_{2}^{'}]

(13)

To better distinguish global statistical information from locally salient responses, average pooling and max pooling operations are separately applied to

U^{'}

to extract complementary spatial features:

S A_{avg} = P_{avg} (U^{'})

(14)

S A_{\max} = P_{\max} (U^{'})

(15)

where

P_{avg} (\cdot)

and

P_{\max} (\cdot)

denote average pooling and max pooling operations, respectively. The combination of average and max pooling not only preserves global background statistics but also enhances the response to tiny defect regions.

To further enrich spatial feature representation, the pooled features

S A_{avg}

and

S A_{\max}

are concatenated and fused via a convolution operation

f (\cdot)

, generating a spatial attention map:

S A = f ([S A_{avg}; S A_{\max}])

(16)

The resulting spatial attention map provides a basis for subsequent adaptive weight assignment. To enable adaptive selection of convolutional features at different scales, each spatial attention map

S A_{i}

is processed using a sigmoid activation function:

S A_{i}^{'} = σ (S A_{i})

(17)

where

σ (\cdot)

denotes the sigmoid activation function. This operation produces spatial selection weights corresponding to each large kernel branch. Finally, the weighted multi-scale features are combined through element-wise summation and further fused via convolution to obtain the attention-enhanced feature map S. The final output Y is generated by performing element-wise multiplication between the attention-enhanced feature and the input feature:

S = f (\sum_{i = 1}^{2} S A_{i}^{'} \cdot U_{i}^{'}), Y = X^{'} \cdot S

(18)

In this structure, the large-kernel convolution component decomposes a single large convolution kernel into a series of progressively dilated depthwise convolutions. This design preserves the model’s capability to capture multi-scale receptive field representations while significantly reducing the number of parameters. Meanwhile, the spatial attention mechanism adaptively aggregates information from multiple receptive fields along the spatial dimension. Compared with attention methods that perform selection solely along the channel dimension, the proposed mechanism is better suited to defect detection under uniform background conditions. When background information is highly homogeneous, the distinguishing characteristics of defects primarily lie in localized spatial contextual relationships rather than channel-wise feature variations.

In this regard, and in contrast to differences in receptive field modeling strategies among existing methods, selective convolution approaches that perform kernel selection along the channel dimension (e.g., SKNet) and methods relying on large-kernel decomposition for receptive field expansion (e.g., LSKNet [53]), LRSC places greater emphasis on lightweight multi-receptive-field contextual modeling in the spatial dimension. This design is more aligned with the characteristics of uniform-background defect detection, where discriminative information primarily depends on local spatial contextual relationships.

Within the SLA-YOLO architecture, we design and incorporate the LRSC module at the initial stage before the detection head. This module allows the network to dynamically capture large receptive-field contextual information before producing the final detection features. Experimental results show that this design noticeably boosts both recall and detection accuracy when dealing with extremely small scale, low-contrast defects. In doing so, it effectively addresses the shortcomings of the YOLO series in contextual modeling.

4. Experiments

To thoroughly assess the effectiveness and generalization of the proposed SLA-YOLO framework for tiny surface defect detection in industrial settings with uniform backgrounds, we conducted systematic experiments on two representative public datasets as well as one self-constructed dataset. Given the limited scale of industrial defect datasets, performing a strict large-scale cross-validation protocol is often impractical. Therefore, a three-fold repeated data split strategy under a unified training setup was adopted to improve the reliability and stability of the experimental results. All reported results correspond to the average performance over the three runs, thereby reducing the influence of random initialization and data partition fluctuations [54].

The experimental setup was guided by three central research questions:

1.: Can the proposed framework deliver noticeable performance gains in scenarios characterized by extremely small scale, low contrast, and blurred boundaries?
2.: Do the core modules—ISP, THFE, and LRSC—function as intended and fully achieve their design objectives?
3.: While preserving real-time performance, how well does the model generalize and remain practically feasible across different industrial application scenarios?

4.1. Datasets

CCB: This dataset is a high-resolution industrial glass defect collection built specifically for this study. The images were captured by our research team using industrial-grade cameras and professional optical imaging equipment in real production settings. It includes 872 images, each standardized to a resolution of 5120 × 5120 pixels. The dataset is divided into 741 training images and 131 testing images. It covers three representative defect types: bubbles, broken-skin bubbles, and linear defects. These defects are extremely small, often low in contrast, weakly textured, and bounded by blurred edges, which makes the dataset particularly challenging and well-suited for evaluating the perception ability of tiny defect detection models. The CCB dataset mainly serves to test the practical applicability of the proposed method in industrial scenarios where the background is uniform.

S-ODv2 (SeaDronesSee-Object Detection v2): This public dataset contains high-resolution RGB images captured by unmanned aerial vehicles in maritime rescue scenarios. It includes 14,227 images, divided into a training set of 8930, a validation set of 1547, and a test set of 3750 images. Six categories are annotated: ignored, swimmer, boat, jetski, life_saving_appliances, and buoy, with the ignored category excluded during evaluation [33]. The dataset poses several challenges. Targets are usually very small, and the sea surface background is fairly uniform, often with strong sunlight reflections. In addition, objects are sparsely distributed, making detection more difficult. Therefore, S-ODv2 is widely used to assess the robustness and generalization of models in natural small object detection scenarios.

Mini-COCO: This dataset is built from the MS COCO 2017 dataset and focuses on the nine categories containing the smallest objects: sports ball, traffic light, baseball glove, bottle, mouse, frisbee, handbag, remote, and book. In total, it includes 36,136 images, with 28,908 used for training and 7228 for validation. Mini-COCO is mainly used to test a model’s generalization ability in typical tiny object detection scenarios, ensuring that the proposed method does not simply overfit to specific industrial datasets.

4.2. Implementation Details

All experiments were conducted using the PyTorch 2.0.1 deep learning framework. Training and evaluation were performed on a workstation equipped with two NVIDIA A100 80GB GPUs. To ensure fairness and reproducibility, all models were trained from scratch without using pretrained weights.

Given the limited scale of industrial defect datasets, a repeated three-fold data partition strategy was adopted instead of strict cross-validation to improve the reliability of the experimental evaluation, and the mean performance across the three runs was reported as the final result. Regarding the slicing strategy, consistency was maintained between the training and inference stages to avoid additional domain shifts caused by inconsistent input scale distributions, thereby ensuring stable and consistent feature learning.

For training, dataset-specific adjustments were made to match each dataset’s characteristics. For S-ODv2 and Mini-COCO, training ran for 300 epochs with a batch size of 32. Due to the higher resolution and increased difficulty of detecting tiny defects in CCB, epochs were extended to 400 while keeping the batch size at 32, balancing convergence stability and computational efficiency. Other hyperparameters—including learning rate schedules, weight decay, and optimizer settings—remained consistent with default settings of mainstream comparison models, ensuring a unified experimental baseline.

During image preprocessing, dataset-specific ISP strategies were employed to handle differences in resolution. Specifically, the network input resolution was unified to

640 \times 640

. To avoid the collapse of tiny defect features caused by direct scaling of high-resolution images, the ISP strategy first generates local patches, which are then resized to the input dimensions for detection. For CH-Glass, images were cropped into

3200 \times 3200

patches with a

0.2

overlap ratio to prevent tiny defects from being split at patch edges, thereby reducing missed detections. For S-ODv2, patches were

2400 \times 2400

with the same overlap, balancing object integrity and computational efficiency. This strategy increases the relative scale of tiny targets by reducing the downsampling ratio from

8 \times

(for full-image scaling) to approximately

5 \times

, ensuring that critical structural and weak texture information is preserved even after resizing to

640 \times 640

. Mini-COCO, with its lower resolution, was used directly without slicing to evaluate the general feature extraction capability of the THFE and LRSC modules.

Model performance was evaluated through a comprehensive analysis of both detection accuracy and computational efficiency. In terms of accuracy, we utilized Recall (R), mean Average Precision at IoU

0.5

(

mAP @ 50

),

mAP @ 75

, and

mAP @ 50 : 95

. Specifically, Recall measures the overall coverage capability of the model;

mAP @ 50

evaluates baseline detection performance;

mAP @ 75

measures localization precision under high IoU constraints; and

mAP @ 50 : 95

, the standard COCO metric, provides a comprehensive assessment of detection capability across varying localization strictness. Regarding computational efficiency, parameters (Params, M), floating-point operations (FLOPs, G), and frames per second (FPS) were introduced as evaluation metrics to measure computational complexity and practical deployment efficiency. Params reflects the model size, FLOPs measures theoretical computational overhead, and FPS reflects the response speed during actual inference, providing a more realistic evaluation of the model’s viability in industrial deployment scenarios.

4.3. Ablation Study

To quantitatively assess the contribution of each core component to tiny defect detection, we carried out systematic ablation experiments on representative datasets within a unified detection framework. The ISP, THFE, and LRSC modules were gradually incorporated into the baseline model. By comparing detection outcomes across different combinations of these modules, we could better understand how each component helps—whether by improving the perception of tiny defects, enhancing contextual modeling, or boosting overall detection accuracy.

4.3.1. Ablation Study of ISP

In high-resolution or large-scale datasets like CCB and S-ODv2, tiny defects often occupy only a small fraction of the entire image. During feature extraction, repeated downsampling can easily erase these fine-grained details, weakening the model’s ability to spot such subtle defects. we employ an image slicing strategy (denoted as ISP), which is derived from the SAHI paradigm and extended to both training and inference, enhancing local feature representation by slicing high-resolution images into smaller patches and recomposing them, thereby improving the relative scale of tiny defects within each slice and reducing feature dilution caused by global downsampling.

To ensure consistency of feature distribution between training and inference, we extend the SAHI-style slicing strategy to the training stage and apply the same slicing and recomposition process during inference. Figure 6 shows the visualization results on the CCB dataset before and after the introduction of ISP. For clearer visual inspection of tiny defects such as scratches, bubbles, and broken-skin bubbles, enlarged local views of the detected defect regions are additionally provided within the corresponding images, while images without predicted bounding boxes indicate missed detections. After incorporating this module, the model yielded more complete localization for tiny defects, with a noticeable reduction in both missed detections and false positives.

The quantitative results are summarized in Table 1. After integrating ISP, the model achieved substantial improvements in Recall and mean Average Precision (mAP) on both the CCB and S-ODv2 datasets. In particular, the gains in

mAP @ 50

and

mAP @ 50 : 95

are especially notable, fully demonstrating the effectiveness of the slicing strategy in improving tiny defect detection, mainly due to enhanced local modeling and reduced feature distribution mismatch for high-resolution tiny defect detection tasks. It is worth noting that, to avoid potential underfitting caused by insufficient training samples when the CCB dataset was not sliced, the early stopping mechanism was disabled in this experiment (

patience = 0

) to ensure sufficient convergence during training.

4.3.2. Ablation Study of THFE

From a methodological perspective, the proposed THFE differs from Transformer-based architectures like RT-DETR. While RT-DETR relies on stacked multi-layer encoders for global dependencies—increasing architectural depth and complexity—THFE introduces only a single lightweight Transformer layer at the high-level feature stage. This allows it to function as a local feature enhancement unit rather than a full framework redesign [55], making it ideal for balancing efficiency and accuracy. The quantitative results in Table 2 demonstrate that THFE consistently improves Recall and mAP across all three datasets, confirming its effectiveness in enhancing global semantic representations for better classification and localization.

From a computational perspective, THFE’s cost stems from self-attention, but because it is applied only to high-level maps (1/32 spatial resolution), the computational overhead remains manageable. Compared to multi-layer designs, this approach significantly shortens the computational path. As shown in Table 2, introducing THFE results in only marginal increases in Params and FLOPs; for instance, in the S-ODv2 dataset, the parameter count only increases by 2.09 M and FLOPs by 0.9 G, while the inference speed remains high at over 210 FPS. This validates that THFE enhances performance without introducing significant computational bottlenecks.

4.3.3. Ablation Study of LRSC

The LRSC module adaptively adjusts the receptive field according to the scale variations of tiny defects, allowing the network to perceive contextual information at multiple scales in a dynamic manner. This mechanism is particularly important for detecting tiny defects with low contrast, blurred boundaries, and significant scale variations, as it improves the model’s adaptability in complex scenarios and increases its sensitivity to subtle defect cues.

Table 2 shows that, after the LRSC module was added, the model achieved consistent mAP improvements on all datasets. The gains were particularly clear on CCB, which represents a more challenging industrial scenario, and on Mini-COCO, a general tiny object benchmark. On these two datasets, the model showed better localization and classification performance, indicating that dynamic contextual modeling is effective for tiny defect detection.

A closer look at the ablation results in Table 2 also shows that adding either the THFE module or the LRSC module alone can bring stable improvements over the baseline. When both modules are jointly integrated, the model achieves further gains in Recall and mAP, and the overall performance significantly surpasses that of single-module configurations and the baseline model. This observation indicates that the high-level semantic enhancement provided by the THFE module and the dynamic contextual modeling capability introduced by the LRSC module exhibit strong complementarity in tiny defect detection under uniform-background conditions. Their synergistic interaction effectively overcomes the performance limitations of individual modules and plays a crucial role in improving overall detection accuracy.

4.4. Main Results

4.4.1. Baseline Model Selection

To comprehensively evaluate the performance and technical advantages of the proposed SLA-YOLO model in tiny object detection tasks, several representative state-of-the-art detectors from recent years were selected as baseline models for comparison. The selected methods cover a range of detection paradigms, including two-stage and one-stage detectors, anchor-based and anchor-free designs, as well as CNN-based and Transformer-based architectures. This helps ensure that the comparative evaluation is sufficiently comprehensive, representative, and reliable.

Among the single-stage detectors, YOLOv5, YOLOv8s, YOLOv9, YOLOv10, and YOLOv11 were selected as representative models from the YOLO family. This series has continued to evolve since YOLOv1, with steady optimizations in network structures, feature fusion, and detection heads. While YOLOv5–YOLOv8 focus on engineering optimization and lightweight design, YOLOv9–YOLOv11 further improve feature decoupling and multi-scale modeling to enhance representation capability and inference efficiency. These methods are widely applied in real-time industrial scenarios and reflect the performance upper bound of current one-stage frameworks. Additionally, YOLOv12 was included as a latest-generation model that strengthens multi-task joint modeling, though it still relies primarily on full-image feature modeling, which presents limitations in information preservation for extremely small scales. Furthermore, YOLOX, an anchor-free single-stage detector, was included. Unlike conventional YOLO variants, YOLOX adopts innovative strategies for label assignment and sample modeling, showing strong adaptability to small-object scenarios.

Within the anchor-free paradigm, FCOS and DETR were included as comparison methods. FCOS performs object regression through a fully convolutional architecture, representing a typical technical route for lightweight anchor-free detection. By contrast, DETR introduces a Transformer-based architecture, enabling end-to-end detection through global feature modeling and representing a forward-looking exploration of long-range dependency modeling.

To further examine performance differences across paradigms, two classical two-stage detectors, Mask R-CNN and Cascade R-CNN, were included. Mask R-CNN enhances feature representation via an instance segmentation branch, whereas Cascade R-CNN improves localization accuracy through multi-stage regression. Both methods offer stable performance and have long served as vital benchmarks.

To supplement the analysis of Transformer-based methods, RT-DETR and Deformable DETR were introduced. RT-DETR strengthens global modeling via a query-based end-to-end framework, while Deformable DETR improves multi-scale interaction efficiency through deformable attention mechanisms. However, such methods often involve higher computational costs, limiting their application in real-time industrial scenarios. Both methods use official implementations with a ResNet-50 backbone to ensure consistency and fairness.

All comparison methods were implemented using open-source versions and trained from scratch under a unified experimental environment and hyperparameter settings, without any pretrained weights, ensuring fairness and reproducibility. Additionally, a consistent training control strategy was applied, including a uniform early stopping mechanism (patience = 0) on the CCB dataset to ensure comparability. Considering the limited scale of industrial defect datasets, a repeated three-fold data partition strategy was adopted instead of strict cross-validation, and the mean performance over three runs was reported to reduce the influence of data partitioning and random initialization. In the ablation study, YOLOv8s served as the baseline to progressively integrate the ISP, THFE, and LRSC modules, systematically analyzing the mechanism of each design. Finally, SLA-YOLO was compared against 12 mainstream methods to verify its advancement and generalization capability in tiny object detection.

4.4.2. Comparative Results on Three Datasets

Results on CCB. Table 3 reports the quantitative comparison results of all competing methods on the CCB dataset. The results show that the proposed SLA-YOLO achieves

61.8%

Recall and

68.5%

mAP @ 50

, surpassing all competing methods by a clear margin. This improvement effectively reduces both missed detections and false detections in tiny defect detection. In addition, under stricter evaluation metrics, including

mAP @ 75

and

mAP @ 50 : 95

, SLA-YOLO consistently maintains a clear performance advantage and ranks first among all compared methods.

These results indicate that, in industrial glass surface inspection scenarios characterized by high resolution, low contrast, and weak texture, SLA-YOLO can better preserve the discriminative feature information of tiny defects, thereby substantially reducing both missed detections and false positives. Compared with various YOLO-series models, anchor-free detectors, and two-stage detection frameworks, SLA-YOLO shows more stable performance in multi-scale feature modeling and the preservation of local discriminative features. These findings further confirm the effectiveness and adaptability of the proposed image slicing enhancement mechanism, semantic feature interaction module, and dynamic contextual modeling strategy for tiny defect detection on industrial glass surfaces.

Results on S-ODv2. Table 4 reports the results of different methods on the S-ODv2 dataset. This dataset contains multiple categories of tiny objects within the same background, creating a more complex setting and placing higher demands on the generalization ability and environmental robustness of detection models. The results show that SLA-YOLO achieves the best or tied-best performance on four key metrics, namely Recall,

mAP @ 50

,

mAP @ 75

, and

mAP @ 50 : 95

, and overall outperforms existing mainstream detection methods. In particular, its stronger performance under stricter metrics such as

mAP @ 75

and

mAP @ 50 : 95

suggests that SLA-YOLO not only maintains stable recall for tiny objects, but also provides more reliable localization and classification. These results further indicate that, in complex scenarios involving multi-scale object coexistence and sparse target distribution, the proposed model can more effectively capture both the local structural information of tiny objects and the global contextual relationships among them, thereby supporting more accurate tiny object detection.

Results on Mini-COCO. Table 5 shows the experimental results of all compared methods on the Mini-COCO dataset. This dataset represents a typical general tiny object detection scenario and is mainly used to evaluate how well detection models can generalize across different domains and structures. The results indicate that SLA-YOLO either outperforms or clearly surpasses other methods across multiple key metrics, especially under stricter measures like

mAP @ 75

and

mAP @ 50 : 95

, where it maintains a consistent and noticeable performance edge. These results also suggest that SLA-YOLO is not tied to the specific data distribution of industrial settings. Even in more diverse and complex general tiny object detection tasks, the model shows strong adaptability and stable performance. It effectively captures discriminative features, models contextual information efficiently, and handles scale variations robustly, confirming its strong generalization capability. Overall, these findings further support the soundness of the proposed architecture and highlight the method’s broad applicability.

Looking across all three datasets, it’s clear that SLA-YOLO consistently delivers solid and meaningful performance gains in a variety of tiny object detection scenarios—ranging from industrial-specific settings like CCB, to natural scenes in S-ODv2, and more general datasets such as Mini-COCO. The model doesn’t just improve standard detection accuracy; it also keeps localization precise and classification reliable, even under stricter evaluation criteria with higher IoU thresholds. This suggests that the improvements aren’t just due to overfitting to a specific dataset or scene type, but rather come from a carefully optimized feature representation strategy for tiny objects built into the model’s architecture.

The above experimental findings validate the following three key conclusions:

(1): The local high-resolution perception mechanism based on image slicing (the ISP strategy) effectively alleviates the information attenuation of tiny object features caused by repeated downsampling operations, thereby enhancing the model’s ability to capture fine-grained local features of tiny defects.
(2): The semantic-level feature enhancement mechanism (THFE module) improves the representation capability of discriminative structural patterns in tiny defects, which consequently enhances the accuracy of both object category recognition and spatial localization.
(3): The dynamic context modeling mechanism (the LRSC module) significantly enhances the stability of detection performance in scenarios characterized by low contrast and uniform backgrounds, thereby improving the robustness of the model in complex environments.

In summary, while maintaining efficient inference speed and a lightweight architectural design, SLA-YOLO significantly improves the accuracy and robustness of tiny defect detection, demonstrating strong potential for practical industrial deployment as well as promising cross-scenario generalization capability.

5. Conclusions

This study addresses the challenging visual perception task of detecting tiny surface defects on industrial products under uniform background conditions, and proposes an efficient detection framework, SLA-YOLO, which achieves both high detection accuracy and real-time inference capability. The proposed framework tackles several key challenges in tiny object detection through a three-level architectural design. To prevent discriminative features of tiny defects from being lost during repeated downsampling of high-resolution images, we incorporate an image slicing strategy (denoted as ISP), derived from the SAHI paradigm and extended to both training and inference. This strategy establishes a consistent “slice–reassemble–fuse” process across training and inference, improving the perception of local tiny defects by enhancing their relative scale and reducing feature dilution during downsampling. To handle the strong contextual dependency of tiny defects and the scarcity of local discriminative cues, we introduce the LRSC module at the front of the detection head. With its dynamic receptive-field mechanism, the model can adaptively capture multi-scale contextual information, significantly improving feature extraction for tiny defects that are low contrast and have blurred boundaries. To address redundancy and inconsistency in high-level semantic features, the THFE module is introduced between the backbone and the neck. By performing attention-based global dependency modeling, the module improves the consistency and category separability of high-level semantic representations.

Extensive experiments across multiple datasets show that the proposed framework really works. On the industrial tiny defect dataset, CCB, SLA-YOLO consistently outperforms nine representative detection methods, including recent approaches like YOLOv11 and DETR, reaching a Recall of

61.8 %

and an

mAP @ 50

of

68.5 %

.

On the natural-scene tiny object dataset (S-ODv2) and the general-purpose Mini-COCO benchmark, SLA-YOLO keeps showing strong generalization and structural robustness. These findings suggest that the performance improvements are not just a result of overfitting to a particular dataset or scenario. Rather, they arise from carefully designed innovations in feature representation, contextual modeling, and scale adaptability, which together allow the model to reliably detect tiny objects across different domains and scenarios.

From a practical deployment perspective, SLA-YOLO strikes a solid balance between being lightweight, running efficiently, and keeping detection accuracy high, all while maintaining strong real-time performance. This makes the framework well-suited for the demanding conditions of industrial production, where high precision, low latency, and stable long-term operation are essential. In this way, the proposed approach offers a practical solution for detecting tiny defects in high-resolution images with uniform backgrounds. On top of that, considering its modular design and feature modeling approach, the work provides scalable architectural insights and useful technical guidance for future research on tiny object detection in more complex industrial vision scenarios.

Author Contributions

Conceptualization, Y.L. and Z.S.; methodology, Y.L. and Z.S.; software, X.W., C.J. and Z.S.; validation, X.W., C.J. and Y.W.; formal analysis, Y.L. and Z.S.; investigation, Y.L., X.W., C.J. and Z.S.; resources, Y.L. and Z.S.; data curation, Y.L. and X.W.; writing—original draft preparation, X.W., C.J. and Y.W.; writing—review and editing, Y.L., X.W. and Z.S.; visualization, X.W., Z.S. and Y.W.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was supported in part by the National Natural Science Foundation of China under Grant No. 72371067, the National Science Foundation of Hebei province under Grant No. F2025501043, and the Funded by Science Research Project of Hebei Education Department under Grant No. QN2024167.

Data Availability Statement

The CCB dataset constructed in this study is publicly available to facilitate reproducibility and further research. The complete dataset, including images and annotations, can be accessed at: https://github.com/shaoqilyx/CCB_GlassDefectDataset (accessed on 12 January 2026). The code developed in this study, including the implementation and training scripts, is also publicly available at: https://github.com/20191844308/SLA-YOLO (accessed on 12 January 2026). The S-ODv2 and Min-COCO datasets are publicly available at: https://github.com/ChenyuJin1-cloud/AF-YOLO.git (accessed on 12 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, Y.; Cao, Y.; Yao, H.; Luo, W.; Jiang, C.; Zhang, H.; Shen, W. A comprehensive survey for real-world industrial surface defect detection: Challenges, approaches, and prospects. J. Manuf. Syst. 2026, 84, 152–172. [Google Scholar] [CrossRef]
Sun, P.; Hua, C.; Ding, W.; Hua, C.; Liu, P.; Lei, Z. Ceramic tableware surface defect detection based on deep learning. Eng. Appl. Artif. Intell. 2025, 141, 109723. [Google Scholar] [CrossRef]
Nautiyal, R.; Deshmukh, M. Tiny object detection: An in-depth survey of techniques, challenges, and future directions. Digit. Signal Process. 2026, 174, 105995. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New York, NY, USA, 2021; pp. 3791–3798. [Google Scholar]
Deng, R. A Review of the Applications of Machine Vision in Industrial Surface Defect Detection. J. Artif. Intell. Pract. 2025, 8, 144–151. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Zhu, G.; Qi, H.; Lv, K. DGYOLOv8: An enhanced model for steel surface defect detection based on YOLOv8. Mathematics 2025, 13, 831. [Google Scholar] [CrossRef]
Liu, Y.; Fan, G.; Zhang, H.; Xiao, D. Defect detection algorithm of galvanized sheet based on S-C-B-YOLO. Mathematics 2026, 14, 110. [Google Scholar] [CrossRef]
Buslaev, A.; Parinov, A.; Khvedchenya, E.; Iglovikov, V.I.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentation. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale Object Detection in Remote Sensing Imagery with Convolutional Neural Networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NeurIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1106–1114. [Google Scholar]
Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; Fergus, R. Regularization of neural networks using dropconnect. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1058–1066. [Google Scholar]
Su, X.; Chang, L.; Shen, J.; Cheng, Y. Data Augmentation Techniques for Deep Learning-Based Object Detection: A Comprehensive Survey. J. Vis. Commun. Image Represent. 2023, 90, 103724. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar] [CrossRef]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 6023–6032. [Google Scholar] [CrossRef]
Kisantal, M. Augmentation for Small Object Detection. arXiv 2019, arXiv:1902.07296. [Google Scholar] [CrossRef]
Chen, C.; Zhang, Y.; Lv, Q.; Wei, S.; Wang, X.; Sun, X.; Dong, J. RRNet: A hybrid detector for object detection in drone-captured images. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 100–108. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, P.; Li, Z.L.Y.; Zhang, X.; Meng, G. Feedback-driven data provider for object detection. arXiv 2020, arXiv:2004.12432. [Google Scholar]
Suárez-Ramírez, J.; Santana-Cedrés, D.; Monzón, N. DAHI: A Fast and Efficient Density Aided Hyper Inference Technique for Large Scene Object Detection. Pattern Recognit. 2025, 171, 112228. [Google Scholar] [CrossRef]
De Ridder, V.; Dey, B.; Blanco, V.; Halder, S.; Van Waeyenberge, B. Improved Defect Detection and Classification Method for Advanced IC Nodes by Using Slicing Aided Hyper Inference with Refinement Strategy. arXiv 2023, arXiv:2311.11439. [Google Scholar] [CrossRef]
Chen, X.; Gupta, A. Spatial memory for context reasoning in object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4086–4096. [Google Scholar]
Chen, C.; Liu, M.Y.; Tuzel, O.; Xiao, J. R-CNN for small object detection. In Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016, Revised Selected Papers, Part V; Springer International Publishing: Cham, Switzerland, 2017; pp. 214–230. [Google Scholar]
Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part IV; Springer International Publishing: Cham, Switzerland, 2016; pp. 354–370. [Google Scholar]
Zhu, Y.; Zhao, C.; Wang, J.; Zhao, X.; Wu, Y.; Lu, H. Couplenet: Coupling global structure with local parts for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4126–4134. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, A.; Peng, T.; Cao, H.; Xu, Y.; Wei, X.; Cui, B. TIA-YOLOv5: An improved YOLOv5 network for real-time detection of crop and weed in the field. Front. Plant Sci. 2022, 13, 1091655. [Google Scholar] [CrossRef]
Nazli, N.A.N.M.; Sabri, N.; Aminuddin, R.; Ibrahim, S.; Yusof, S.; Nasir, S.D.N.M. A real-time system for detecting personal protective equipment compliance using deep learning model YOLOv5. Procedia Comput. Sci. 2024, 245, 647–656. [Google Scholar] [CrossRef]
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 2260–2270. [Google Scholar] [CrossRef]
Lai, H.; Chen, L.; Liu, W.; Yan, Z.; Ye, S. STCYOLO: Small object detection network for traffic signs in complex environments. Sensors 2023, 23, 5307. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Wang, H.; Yang, H.; Chen, H.; Wang, J.; Zhou, X.; Xu, Y. A remote sensing image target detection algorithm based on improved YOLOv8. Appl. Sci. 2024, 14, 1557. [Google Scholar] [CrossRef]
Shen, L.; Lang, B.; Song, Z. DS-YOLOv8-based object detection method for remote sensing images. IEEE Access 2023, 11, 125122–125137. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Lv, Y.; Dong, G.; Song, X. MicroDETR: DETR with frequency-spatial aware and cross-scale fusion for tiny object detection. Pattern Recognit. 2026, 172, 113747. [Google Scholar] [CrossRef]
Zhou, W.; Tang, B.; Cong, R.; Jiang, Q. Turbidity-Similarity Decoupling: Feature-Consistent Mutual Learning for Underwater Salient Object Detection. IEEE Trans. Image Process. 2026, 35, 495–510. [Google Scholar] [CrossRef]
Zhou, W.; Zhu, Y.; Lei, J.; Yang, R.; Yu, L. LSNet: Lightweight Spatial Boosting Network for Detecting Salient Objects in RGB-Thermal Images. IEEE Trans. Image Process. 2023, 32, 1329–1340. [Google Scholar] [CrossRef]
Zhou, W.; Guo, Q.; Lei, J.; Yu, L.; Hwang, J.N. ECFFNet: Effective and Consistent Feature Fusion Network for RGB-T Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1224–1235. [Google Scholar] [CrossRef]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing aided hyper inference and fine-tuning for small object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: New York, NY, USA, 2022; pp. 966–970. [Google Scholar]
Aldubaikhi, A.; Patel, S. Advancements in Small-Object Detection (2023–2025): Approaches, Datasets, Benchmarks, Applications, and Practical Guidance. Appl. Sci. 2025, 15, 11882. [Google Scholar] [CrossRef]
Hua, W.; Chen, Q. A survey of small object detection based on deep learning in aerial images. Artif. Intell. Rev. 2025, 58, 162. [Google Scholar] [CrossRef]
Gao, Y.; Gao, Q.; Shao, L.; Wang, X.; Liu, L. HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection. Information 2026, 17, 365. [Google Scholar] [CrossRef]
Li, Y.; Shen, L. A Frequency Domain-Enhanced Transformer for Nighttime Object Detection. Sensors 2025, 25, 3673. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Peng, X.; Jiang, H. A Review of Small Object Detection Based on Deep Learning. In Proceedings of the 2025 2nd International Conference on Big Data Analytics and Artificial Intelligence Application (BDAIA ’25), New York, NY, USA, 28–30 November 2025; pp. 89–96. [Google Scholar] [CrossRef]
Jamali, M.; Davidsson, P.; Khoshkangini, R.; Ljungqvist, M.G.; Mihailescu, R.C. Context in object detection: A systematic literature review. Artif. Intell. Rev. 2025, 58, 175. [Google Scholar] [CrossRef]
Wang, Z.; Chen, Y.; Gu, Y.; Liu, J.; Zhu, X.; He, M. The evolution of object detection from CNNs to transformers and multi-modal fusion. Sci. Rep. 2026, 16, 7517. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 10–17 October 2023; pp. 16794–16805. [Google Scholar]
Lema, D.G.; Sánchez-González, L.; Usamentiaga, R.; delaCalle, F.J. Benchmarking Deep Learning Models for Surface Defect Detection: A Reproducible and Statistically-Rigorous Approach. J. Intell. Manuf. 2025; in press. [CrossRef]
Lyu, Y.; Liu, Y.; Zhao, Q.; Hao, Z.; Song, X. SFSIN: A Lightweight Model for Remote Sensing Image Super-Resolution with Strip-like Feature Superpixel Interaction Network. Mathematics 2025, 13, 1720. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Singapore, 5–7 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 1030–1040. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Ultralytics. YOLO11 Documentation. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 8 May 2026).
Ultralytics. YOLO12 Documentation. 2025. Available online: https://docs.ultralytics.com/ (accessed on 8 May 2026).
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 19–25 June 2024; pp. 16965–16974. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar] [CrossRef]

Figure 1. Example defect samples from the CCB dataset.

Figure 2. Overall architecture of the proposed SLA-YOLO framework.

Figure 3. Workflow of the ISP module.

Figure 4. Architecture of the proposed THFE module.

Figure 5. Structural design of the LRSC module.

Figure 6. Visual comparison of defect detection results on the CCB dataset.

Table 1. Ablation study of the ISP module on the CCB and S-ODv2 datasets.

Dataset	Method	R	$mAP @ 50$	$mAP @ 75$	$mAP @$ 50:95
CCB	YOLOv8s	0.358	0.330	0.104	0.157
CCB	YOLOv8s+ISP	0.591	0.657	0.371	0.388
S-ODv2	YOLOv8s	0.611	0.683	0.434	0.419
S-ODv2	YOLOv8s+ISP	0.737	0.796	0.513	0.501

Table 2. Ablation results of the THFE and LRSC modules.

Dataset	THFE	LRSC	R	$mAP @ 50$	$mAP @ 75$	$mAP @$ 50:95	Params (M)	FLOPs (G)	FPS
S-ODv2			0.737	0.796	0.513	0.501	11.13	28.4	263.16
	✓		0.745	0.800	0.526	0.510	13.22	29.3	217.39
		✓	0.746	0.801	0.524	0.509	11.71	29.6	285.71
	✓	✓	0.747	0.804	0.541	0.516	15.13	30.4	222.22
CH-Glass			0.614	0.652	0.375	0.371	11.13	28.4	98.04
	✓		0.626	0.670	0.390	0.396	13.22	29.3	77.52
		✓	0.614	0.676	0.394	0.390	11.71	29.6	142.86
	✓	✓	0.618	0.685	0.407	0.398	15.13	30.4	149.25
Mini-COCO			0.524	0.565	0.418	0.377	11.13	28.4	454.55
	✓		0.533	0.572	0.419	0.380	13.22	29.3	333.33
		✓	0.516	0.568	0.419	0.379	11.71	29.6	434.78
	✓	✓	0.532	0.576	0.424	0.383	15.13	30.4	416.67

Note: Bold values indicate the best performance for each metrics within the corresponding dataset.

Table 3. Comparative results of SLA-YOLO on the CCB dataset.

Method	R	$mAP @ 50$	$mAP @ 75$	$mAP @$ 50:95	Params (M)	FLOPs (G)	FPS
YOLOv5 [22]	0.574	0.650	0.376	0.370	7.2	16.5	170.12
YOLOv8s [56]	0.614	0.652	0.375	0.371	8.7	28.6	98.04
YOLOv9 [57]	0.590	0.627	0.361	0.357	8.6	27.9	100.45
YOLOv10 [58]	0.616	0.669	0.355	0.366	8.9	29.4	95.30
YOLOv11 [59]	0.480	0.503	0.234	0.269	9.5	31.2	90.15
YOLOv12 [60]	0.615	0.653	0.386	0.382	9.0	30.0	94.20
YOLOX [61]	0.331	0.555	0.179	0.245	9.0	26.8	104.50
RT-DETR [62]	0.392	0.417	0.240	0.237	42.0	80.0	35.12
Deformable DETR [63]	0.475	0.505	0.291	0.287	40.0	78.0	22.40
DETR [64]	0.460	0.595	0.253	0.294	41.0	86.0	18.50
FCOS [65]	0.314	0.508	0.162	0.235	32.0	54.0	55.60
Mask R-CNN [66]	0.362	0.542	0.366	0.318	44.0	178.0	12.40
Cascade R-CNN [67]	0.392	0.536	0.364	0.335	68.0	240.0	8.25
SLA-YOLO (ours)	0.618	0.685	0.407	0.398	15.13	30.4	149.25

Note: Bold values indicate the best performance for each metrics within the corresponding dataset.

Table 4. Comparative results of SLA-YOLO on the S-ODv2 dataset.

Method	R	$mAP @ 50$	$mAP @ 75$	$mAP @$ 50:95	Params (M)	FLOPs (G)	FPS
YOLOv5 [22]	0.629	0.685	0.400	0.400	9.1	23.9	284.15
YOLOv8s [56]	0.737	0.796	0.513	0.501	11.1	28.7	263.16
YOLOv9 [57]	0.708	0.765	0.487	0.469	5.16	13.41	342.10
YOLOv10 [58]	0.748	0.795	0.498	0.491	8.0	24.5	308.20
YOLOv11 [59]	0.592	0.648	0.394	0.389	9.4	21.3	325.45
YOLOv12 [60]	0.738	0.798	0.514	0.501	6.72	15.47	331.18
YOLOX [61]	0.482	0.796	0.379	0.419	5.0	17.1	255.40
RT-DETR [62]	0.472	0.509	0.265	0.280	42.8	130.5	78.50
Deformable DETR [63]	0.567	0.616	0.349	0.359	39.8	173.0	45.12
DETR [64]	0.522	0.717	0.379	0.382	41.6	85.8	32.40
Mask R-CNN [66]	0.444	0.600	0.420	0.390	44.0	178.0	21.50
Cascade R-CNN [67]	0.464	0.641	0.429	0.406	69.3	243.0	14.20
SLA-YOLO (ours)	0.747	0.804	0.541	0.516	15.13	30.4	222.22

Note: Bold values indicate the best performance for each metrics within the corresponding dataset.

Table 5. Comparative results of SLA-YOLO on the Mini-COCO dataset.

Method	R	$mAP @ 50$	$mAP @ 75$	$mAP @$ 50:95	Params (M)	FLOPs (G)	FPS
YOLOv5 [22]	0.528	0.561	0.378	0.351	9.1	23.9	510.20
YOLOv8s [56]	0.524	0.565	0.418	0.377	11.1	28.7	454.55
YOLOv9 [57]	0.504	0.543	0.402	0.362	5.16	13.41	590.35
YOLOv10 [58]	0.509	0.546	0.395	0.361	8.0	24.5	530.15
YOLOv11 [59]	0.519	0.550	0.404	0.367	9.4	21.3	545.60
YOLOv12 [60]	0.525	0.566	0.419	0.378	6.72	15.47	575.40
YOLOX [61]	0.510	0.560	0.389	0.358	5.0	17.1	550.25
RT-DETR [62]	0.335	0.361	0.267	0.241	42.8	130.5	150.35
Deformable DETR [63]	0.406	0.437	0.324	0.292	39.8	173.0	110.45
DETR [64]	0.403	0.459	0.170	0.214	41.6	85.8	100.20
Mask R-CNN [66]	0.485	0.563	0.384	0.352	44.0	178.0	50.15
Cascade R-CNN [67]	0.459	0.559	0.374	0.341	69.3	243.0	35.50
SLA-YOLO (ours)	0.532	0.576	0.424	0.383	15.13	30.4	416.67

Note: Bold values indicate the best performance for each metrics within the corresponding dataset.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lyu, Y.; Wang, X.; Jin, C.; Wei, Y.; Sun, Z. SLA-YOLO—Enhancing YOLO for Tiny Defect Detection in Industrial Defect Scenes. Mathematics 2026, 14, 1973. https://doi.org/10.3390/math14111973

AMA Style

Lyu Y, Wang X, Jin C, Wei Y, Sun Z. SLA-YOLO—Enhancing YOLO for Tiny Defect Detection in Industrial Defect Scenes. Mathematics. 2026; 14(11):1973. https://doi.org/10.3390/math14111973

Chicago/Turabian Style

Lyu, Yanxia, Xinqi Wang, Chenyu Jin, Yuanhong Wei, and Zhenyu Sun. 2026. "SLA-YOLO—Enhancing YOLO for Tiny Defect Detection in Industrial Defect Scenes" Mathematics 14, no. 11: 1973. https://doi.org/10.3390/math14111973

APA Style

Lyu, Y., Wang, X., Jin, C., Wei, Y., & Sun, Z. (2026). SLA-YOLO—Enhancing YOLO for Tiny Defect Detection in Industrial Defect Scenes. Mathematics, 14(11), 1973. https://doi.org/10.3390/math14111973

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SLA-YOLO—Enhancing YOLO for Tiny Defect Detection in Industrial Defect Scenes

Abstract

1. Introduction

2. Related Work

2.1. Data Augmentation

2.2. Contextual Modeling

2.3. The YOLO Family and Its Variants

3. Methods

3.1. Image Slicing Processing

3.2. THFE in Backbone

3.3. Large Receptive-Field Selective Context Module

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Study

4.3.1. Ablation Study of ISP

4.3.2. Ablation Study of THFE

4.3.3. Ablation Study of LRSC

4.4. Main Results

4.4.1. Baseline Model Selection

4.4.2. Comparative Results on Three Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI