3.1. Label-Guided Optimal Tiling Method
The original dataset consists of high-resolution images (ranging from 3648 × 5472 to 6000 × 8000), posing significant challenges for defect detection. High resolution introduces excessive irrelevant pixels, while complex backgrounds generate numerous distracting features that hinder model learning—particularly for small targets. Additionally, resizing these images during preprocessing may lead to substantial feature loss.
To address these challenges, we designed the label-guided optimal tiling (LGOT) method. Based on the original data annotations, this algorithm employs an optimal strategy called Corner Optimal Cover (COC) to cut the original image into multiple sub-images. Specifically, the algorithm aligns the four corners of each sub-image with the four corners of every annotated bounding box in the original image to find the optimal cropping solution. This approach offers several advantages:
This strategy has four practical effects: (1) it increases the relative scale of small targets within each tile; (2) it reduces resizing distortion because each tile is closer to the model input size (see
Figure 2 comparison); (3) it reduces redundant background content in the training set; and (4) compared with random cropping or exhaustive sliding-window tiling, it generates tiles based on annotations, which reduces redundant crops.
The entire process comparison of the COC strategy and sliding window is shown in
Figure 3. Specifically, the coverage score is calculated as the number of label boxes covered by a given sub-image. Compared with sequential sliding-window tiling, COC uses annotation-guided corner alignment to generate candidate tiles and selects those with the highest coverage score (the number of labeled boxes covered by a tile). This reduces redundant tiling for sparse targets.
To further illustrate the first benefit, we randomly selected an example showing the pixel proportion before and after cropping (
Figure 4), where
Figure 4a shows the original image and
Figure 4b–d show cropped sub-images. As observed, the relative area of insulators and defects increases significantly after cropping. This occurs because, while the bounding boxes remain unchanged, the overall image size is reduced, thereby increasing the relative prominence of the targets.
To further inspect the effect on feature attention, we apply Grad-CAM to two representative YOLO baselines; one example is shown in
Figure 5.
Concerning dataset expansion, our method generates multiple sub-images from each original image. The size of each sub-image is determined by Equation (
1), where D denotes the base dimension, typically set as a multiple of 32 for compatibility with network downscaling layers. However, D can be adjusted based on specific task requirements. n serves as a scaling factor, and as long as n × D remains smaller than the shorter edge of the original image, n can take any positive integer value.
3.2. Semi-Decoupled Prior-Driven Detect Head
We believe that for the specific task of detecting small defects in high-resolution images of power transmission lines, the detection head is a critical component that determines the upper limit of model performance and efficiency bottlenecks. This is because, whether in the training or inference process, all multi-scale features extracted by the backbone and neck networks ultimately enter the detection head. For small targets, even after feature extraction, the feature signals ultimately transmitted to the detection head remain extremely weak and have a low signal-to-noise ratio. Therefore, the bottleneck of the problem may not lie in whether features can be extracted, but rather in how to make accurate decisions from weak features.
As we mentioned in
Section 2, numerous studies have been conducted in-depth explorations into advanced feature enhancement, extraction, and separation. However, while these works focus on enhancing the backbone and neck, they mostly use universal design detection heads. Such detection heads, lacking task-specific design, may not perform optimally in specific tasks. Taking the original detection head of YOLO11 as an example, to ensure flexibility and universality, the YOLO11 Head introduces a fully decoupled depth structure and adopts an anchor-free design along with Distribution Focal Loss (DFL). However, while this complexity enhances its upper limit in general scenarios, it also introduces significant parameter complexity and computational overhead in tasks with relatively fixed target types (e.g., insulator detection or other industrial applications), making the model harder to train and infer. Our experimental results (see
Section 4) also confirm this: for domain-specific tasks, a simpler, more prior-informed design may be more effective.
We revisited the modern design of the YOLO11 detection head, retained its parallel branching approach, and significantly lightened it. Considering that the length-to-width ratio of insulators and their defects is relatively fixed, we introduced an evolutionary algorithm to generate anchor priors, making the learning process more stable and reliable. The overall architecture of the improved SDPD-YOLO model is illustrated in
Figure 6.
To be more specific, as shown in
Figure 7, different from the YOLO11 Detect Head, to resolve the inherent conflict between classification and regression tasks, the SDPD-Head designs parallel, dedicated feature extraction branches for localization and classification tasks. Both branches consist of a 3 × 3 Depthwise Convolution (DWConv). DWConv can extract spatial features that are more focused on contours and edges for the regression task without significantly increasing the number of parameters, while extracting semantic features that are more focused on texture and material for the classification task. After the two branches extract their respective specialized feature, we do not use two independent prediction layers as in a fully decoupled head. Instead, we concatenate these two feature maps along the channel dimension. This fused, more comprehensive feature map is ultimately fed into a unified 1 × 1 prediction convolutional layer to generate the final prediction containing all information. This process has been shown in
Figure 8. This structure maintains simplicity, enabling efficient information exchange between the features of the two tasks before the final decision.
Using two DWConv branches and a shared
prediction layer reduces the parameter count compared with the fully decoupled YOLO11 Head, as summarized in Equations (
2) and (
3).
Here, N is the number of classes, R is the maximum regression bin in DFL (), K is the kernel size, and and are the intermediate channel sizes in the regression and classification branches, respectively. Therefore, the SDPD head uses fewer parameters than the original YOLO11 Head under the same channel settings.
On the other hand, modern detection heads (e.g., YOLOv8 Head [
19]) typically employ anchor-free designs for versatile size adaptation. However, in industrial scenarios like insulator defect detection, targets exhibit consistent aspect ratios. Leveraging anchor priors here accelerates model learning of target features. To achieve this, we propose an evolutionary algorithm (shown in
Figure 9) to generate optimized anchor priors for initializing training.
First, the method generates an initial anchor group randomly within the width and height ranges of all target bounding boxes in the dataset. The initial group (generation
) is defined as
, in which every candidate
contains a series of anchors:
Subsequently, during the evolutionary generation phase, the algorithm employs a fitness function
to evaluate the quality of each candidate anchor set
. This function calculates the average of the maximum Intersection over Union (IoU) between all
N ground-truth bounding boxes
in the dataset and the candidate anchors
:
In each evolution iteration, the method selects
M candidates based on the following probability distribution:
It then executes the Crossing Operation on the M-selected
A values, paring them as follows:
in which
a is an anchor for
A, and
is a random number uniformly distributed in the interval [0, 1]. The Mutation Operation is then perfomed on each anchor
of the offspring individuals obtained from the Cross Operation.
Among them, is a random number drawn from a normal distribution with a mean of 0 and a standard deviation of (default = 0.1). Finally, repeat the steps until the iteration reaches the maximum number and fitness convergence or meets the recall threshold.
In summary, although replacing the fully decoupled head with a semi-decoupled design significantly reduces computational overhead and parameter count, it may limit representational capacity. To mitigate this issue, we introduce a prior-driven paradigm that leverages pre-generated optimal anchor boxes to provide explicit scale guidance, enabling more stable and effective spatial feature learning, particularly for microscopic defects. Notably, the only architectural difference between the proposed SDPD-YOLO and the baseline YOLO11 lies in the detection head. Therefore, the effectiveness of this design is validated by the comparative experiments presented in
Section 4.
3.3. Inference-Time Adaptive Tiling
To reduce feature loss at inference, we propose inference-time adaptive tiling (ITAT). ITAT uses a two-stage procedure: a coarse detector proposes Region of Interest(ROIs) on a downscaled image, and the fine detector performs tiling-based inference only within these ROIs at a higher resolution. This reduces redundant computation on background regions compared with full-image sliding-window inference (shown in
Figure 10).
Stage 1: ROI proposal. We use a lightweight coarse detector (a downscaled SDPD-YOLO) to generate candidate ROIs on the input HR image. The goal is to identify regions that may contain insulators or defects, so that high-resolution tiling is applied only to these regions.
Stage 2: Sub-image inference within ROIs. In the ITAT engine, the coarse selection model initially predicts bounding boxes
corresponding to potential defect regions in the global image, where the width and height are denoted as
w and
h, respectively. Because the edge features of small defects are prone to information loss under tight cropping and reliable detection requires sufficient surrounding context, an adaptive edge expansion mechanism is employed to generate the final Region of Interest (ROI). An expansion factor
is therefore applied to enlarge the initially predicted bounding box, and the coordinates of the resulting high-resolution ROI are computed as follows:
Within each ROI, the fine detector performs sliding-window inference and maps local predictions back to the global coordinate system, followed by non-maximum suppression(NMS) to remove duplicate boxes.
where
is the origin coordinate of the upper left corner of the ROI, and
is the size of the ROI.
Through the above two-stage inference mechanism, ITAT effectively combines global fuzzy understanding with local precise inference, improving the detection accuracy and efficiency of small objects in high-resolution images.