IM-DETR: DETR with Mix-Encoder for Industrial Scenarios

Liu, Shiyou; Feng, Yong; Wang, Dongzi; Zhou, Zijie; Wang, Haibing; Wu, Jinsong; Wang, Xiangdong; Wei, Xuekai; Yan, Jielu; Xian, Weizhi; Qin, Yi

doi:10.3390/app16073345

Open AccessArticle

IM-DETR: DETR with Mix-Encoder for Industrial Scenarios

by

Shiyou Liu

^1,2,

Yong Feng

¹,

Dongzi Wang

¹

,

Zijie Zhou

³,

Haibing Wang

^2,3,*,

Jinsong Wu

¹,

Xiangdong Wang

¹,

Xuekai Wei

^1,*

,

Jielu Yan

¹

,

Weizhi Xian

¹

and

Yi Qin

³

¹

School of Computer Science, Chongqing University, Chongqing 400030, China

²

Chongqing Tsingshan Industrial Co., Ltd., Chongqing 402776, China

³

State Key Laboratory of Mechanical Transmissions, Chongqing University, Chongqing400044, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3345; https://doi.org/10.3390/app16073345

Submission received: 4 February 2026 / Revised: 8 March 2026 / Accepted: 12 March 2026 / Published: 30 March 2026

(This article belongs to the Special Issue Advanced Computer Vision Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

Industrial defect detection is a fundamental task in intelligent manufacturing, yet existing object detection methods often struggle with the characteristics of industrial defects, such as small size, irregular shapes, and complex visual backgrounds. Moreover, most detection models are designed primarily for natural image datasets, resulting in limited robustness when deployed in real-world industrial environments. To address these challenges, this research focuses on industrial defect detection and presents contributions at both the dataset and method levels. First, two real-world industrial defect datasets collected from actual production lines are introduced, namely, the Stator Housing Defect Dataset and the Cover Plate Silicone Defect Dataset, which cover representative inspection scenarios with distinct defect characteristics. Second, we propose a detection transformer with a mixed encoder for industrial scenarios (IM-DETR). By integrating heterogeneous multi-scale feature representations, the proposed framework jointly enhances local detail sensitivity and global contextual reasoning without relying on complex post-processing. Extensive experiments on the proposed industrial datasets demonstrate that IM-DETR consistently outperforms existing state-of-the-art detection methods, particularly in scenarios involving small defects, complex backgrounds, and appearance ambiguity, validating the effectiveness and robustness of the proposed approach.

Keywords:

object detection; industrial defect detection; multi-scale feature representation; real-world industrial datasets

1. Introduction

Industrial visual inspection plays a critical role in modern intelligent manufacturing, where automatic defect detection is essential for ensuring stability and quality [1,2]. In industrial scenarios such as electric drive assembly lines and sealing processes, defects often exhibit small sizes, weak visual contrast, irregular shapes, and strong dependency on manufacturing conditions, posing substantial challenges to existing object detection algorithms [3]. Moreover, complex backgrounds [4], illumination variations [5], and frequent occlusions [6] further aggravate the difficulty of reliable defect detection in real-world industrial environments [7].

Deep learning-based object detectors have achieved remarkable progress in generic vision benchmarks in recent years [8,9]. However, their direct deployment in industrial scenarios remains nontrivial. Most detectors are designed and evaluated on natural image datasets, where object appearances and scene distributions differ substantially from those encountered in industrial production environments [10]. In contrast, industrial defect datasets are typically proprietary, fragmented, and lack standardized benchmarks, which restrict reproducibility, fair comparison, and methodological advancement in this field [11]. Consequently, a gap remains between current state-of-the-art (SOTA) methods and the practical requirements of industrial defect inspection. Transformer-based end-to-end detectors, exemplified by DETR-style architectures [12], have recently emerged as promising alternatives to conventional CNN-based detectors. By removing manually designed elements, including anchors and non-maximum suppression, DETR models offer a more streamlined detection pipeline and improved global reasoning capability. Nevertheless, their application in industrial defect detection is still underexplored. Standard DETR architectures are computationally expensive and insufficiently tailored to industrial characteristics, but their feature encoding mechanisms often fail to adequately capture both fine-grained local defect patterns and multi-scale structural information that is crucial in industrial scenes.

To address the above challenges, this paper focuses on industrial defect detection and makes both dataset-level and method-level contributions. We introduce two real-world industrial defect datasets collected from actual production lines, covering representative and practically relevant inspection tasks. Specifically, the first dataset targets stator housing defect detection, where defects are typically subtle, spatially distributed, and strongly influenced by machining and assembly processes. The second dataset focuses on cover plate silicone defect detection, which involves irregular deformation, discontinuities, and adhesion-related anomalies that are difficult to model via generic detection frameworks. Together, these datasets provide valuable benchmarks for evaluating detection algorithms under realistic industrial conditions, where complex backgrounds, small defects, and high reliability requirements pose significant challenges for existing detection frameworks. Building upon these datasets, we propose a detection transformer with a mixed encoder for industrial scenarios (IM-DETR). The proposed IM-DETR incorporates a mix-encoder that effectively integrates heterogeneous feature representations, enabling the model to exploit local detail sensitivity and global contextual reasoning jointly. By tailoring the encoding strategy to industrial defect characteristics, the proposed approach improves robustness and detection stability in complex production environments while maintaining an end-to-end detection paradigm. It is worth noting that, in industrial inspection systems, even moderate improvements in detection accuracy or robustness can significantly reduce missed defects and production risks. Therefore, the relevance of the proposed method lies not only in numerical performance gains but also in its ability to provide more reliable feature representation and stable detection behavior under challenging real-world conditions. The main contributions of this work can be summarized as follows:

Two real-world industrial defect datasets are constructed from actual production lines, namely, the Stator Housing Defect Dataset (SHDD) and the Cover Plate Silicone Defect Dataset (CPSDD). These datasets cover representative industrial inspection scenarios with distinct defect characteristics, including subtle small-scale defects and irregular large-scale deformations. They provide realistic benchmarks for evaluating detection algorithms under practical industrial conditions involving complex backgrounds, subtle defect patterns, and strict reliability requirements.
We propose IM-DETR with a mix-encoder, a transformer-based end-to-end defect detection framework specifically designed for industrial scenarios. The proposed mix-encoder integrates heterogeneous multi-scale feature representations, enabling the model to jointly capture fine-grained local details and global contextual dependencies, thereby improving detection robustness and stability for challenging industrial defects.
Extensive experiments conducted on the constructed industrial datasets demonstrate that IM-DETR achieves consistent performance improvements over representative CNN- and transformer-based detectors. The proposed framework shows strong effectiveness in handling subtle defects, complex backgrounds, and appearance ambiguity, highlighting its practical relevance for real-world industrial inspection tasks.

2. Related Work

In the early stages, object detection was predominantly addressed through convolution neural network-based approaches, with region-based and single-stage detectors being the most prominent [8,13,14]. The R-CNN family pioneered a two-stage paradigm, generating candidate regions and classifying these regions. R-CNN [15] and its successors, Fast R-CNN [16] and Faster R-CNN [17], progressively enhanced efficiency by sharing convolutional features and introducing the region proposal network (RPN) [18] for end-to-end training. These methods achieve strong localization accuracy but remain computationally expensive. To enable real-time detection, single-stage detectors such as YOLO [19] reformulated detection as a unified regression problem. Subsequent YOLO variants [20,21,22] incorporated multi-scale feature prediction, anchor-based or anchor-free designs, and improved loss functions to enhance performance across object scales. These CNN-based detectors provide a balance of accuracy and speed and remain widely used in practical applications.

In recent years, a shift toward transformer-based architectures has led to a new paradigm in object detection [23]. DETR [12,24] (Detection Transformer) frames object detection as a direct set prediction task via a transformer encoder–decoder, eliminating hand-crafted components such as anchor boxes and non-maximum suppression (NMS) module. While exhibiting strong global reasoning, DETR is hampered by slow convergence and poor performance on small objects. Facing these limitations, deformable DETR [25] incorporates multi-scale deformable attention, which enhances training efficiency and small-object detection accuracy. Several subsequent variants further enhance DETR’s practicality, including Conditional DETR [26] and DAB-DETR [27], which accelerate convergence, and RT-DETR [28], which focuses on real-time performance by optimizing transformer computation and feature fusion. DN-DETR [29] introduces a query denoising mechanism to accelerate training convergence. DINO [30] builds upon DN-DETR by incorporating contrastive denoising and mixed query selection. These DETR-based methods combine global context modeling with increasingly efficient designs, making transformer detectors competitive with CNN-based approaches in both accuracy and speed.

Despite notable advances in CNN- and transformer-based detectors, industrial defect detection still faces unresolved challenges. Existing methods are largely designed for natural images and often fail under extreme scale imbalance, weak contrast, irregular defect shapes, and strong background interference common in industrial scenes, while DETR-style models enhance global reasoning, their feature encoders are not tailored to jointly capture fine-grained local defect cues and structured industrial context, leading to suboptimal performance on small or ambiguous defects. In addition, limited and non-standardized industrial datasets hinder fair evaluation and practical deployment. These limitations highlight the need for a detection framework better aligned with the characteristics of industrial inspection scenarios and supported by realistic benchmark data.

3. Methodology

3.1. Framework

Industrial defect detection differs fundamentally from natural image detection due to extreme scale imbalance, weak contrast, and structured industrial backgrounds. Standard CNN-based detectors emphasize local convolutional patterns, while vanilla DETR-style models mainly rely on global self-attention, yet neither explicitly addresses the joint modeling of fine-grained defect cues and structured industrial context. To tackle these challenges, we propose IM-DETR, an end-to-end detection framework specifically reformulated for industrial inspection.

As illustrated in Figure 1, IM-DETR consists of a backbone for multi-scale feature extraction, a task-oriented mixed encoder for structured multi-scale representation learning, and a transformer decoder for prediction. The backbone extracts hierarchical feature maps

{S_{3}, S_{4}, S_{5}}

that capture complementary spatial and semantic information across resolutions.

The core innovation lies in the proposed mixed encoder, which explicitly decouples intra-scale semantic enhancement from cross-scale spatial refinement. Selective self-attention is applied to high-level features to strengthen global structural reasoning, followed by progressive cross-scale fusion with gated residual reweighing to preserve fine-grained defect details. The resulting unified representation

O

is fed into the transformer decoder, which directly predicts defect categories and bounding boxes via bipartite matching in a fully end-to-end manner without post-processing steps such as NMS.

3.2. Backbone Architecture

IM-DETR adopts a ResNet-50 backbone pretrained on ImageNet. The feature maps from the last three stages are extracted to construct multi-scale representations:

S_{3} \in R^{C_{3} \times H / 8 \times W / 8}, S_{4} \in R^{C_{4} \times H / 16 \times W / 16}, S_{5} \in R^{C_{5} \times H / 32 \times W / 32} .

(1)

where

C_{3} = 512

,

C_{4} = 1024

, and

C_{5} = 2048

. These feature levels correspond to strides

{8, 16, 32}

, enabling the model to preserve spatial details for small defects while capturing high-level semantic context for large or irregular defects.

3.3. Mix Encoder for Multi-Scale Defect Enhancement

The mixed encoder is a core component designed to handle the diverse sizes of industrial defects, which can range from tiny imperfections to large issues. It employs a hybrid architecture that combines attention mechanisms and convolutional operations to process multi-scale features effectively. The encoder operates on features extracted from the backbone’s last three stages, denoted as

{S_{3}, S_{4}, S_{5}}

, which capture different levels of semantic information and spatial details essential for defect detection.

The encoder first performs intra-scale feature interaction by applying self-attention mechanisms selectively to high-level features. This approach focuses computational resources on features with richer semantic content, avoiding redundant processing of lower-level features. The attention mechanism captures long-range dependencies and conceptual relationships between defect patterns, improving the model’s capacity to identify complex defects. The process can be formalized as follows:

F = Transformer (Flatten (S_{5}))

(2)

where the flattened high-level features

S_{5}

undergo self-attention transformation to produce enhanced representations

F

.

The cross-scale fusion process is implemented through a multistage operation that progressively integrates features from different resolution levels. The fusion begins with the highest-level feature

F_{5}

and systematically incorporates lower-level features through upsampling and transformation operations. The complete fusion mechanism can be mathematically described as follows:

\begin{matrix} M_{4} & = {Conv}_{3 \times 3} ({Conv}_{1 \times 1} (S_{4}) + Upsample ({Conv}_{1 \times 1} (F))) \end{matrix}

(3)

\begin{matrix} M_{3} & = {Conv}_{3 \times 3} ({Conv}_{1 \times 1} (S_{3}) + Upsample ({Conv}_{1 \times 1} (M_{4}))) \end{matrix}

(4)

where

Upsample (\cdot)

denotes bilinear interpolation with a scale factor of 2,

{Conv}_{1 \times 1} (\cdot)

performs channel dimension adjustment, and

{Conv}_{3 \times 3} (\cdot)

handles spatial feature integration. The fusion process ensures that high-level semantic information from

F

is effectively propagated to enhance the spatial details preserved in

S_{3}

and

S_{4}

.

The fusion mechanism incorporates residual connections and gating mechanisms to preserve informative features during multi-scale integration. Specifically, for each fusion stage

i \in {3, 4}

, the output feature map

{\hat{M}}_{i}

is obtained by adaptively combining the original backbone feature and the fused feature from the higher level:

{\hat{M}}_{i} = α_{i} \cdot S_{i} + (1 - α_{i}) \cdot M_{i},

(5)

where

S_{i} \in R^{C \times H_{i} \times W_{i}}

denotes the backbone feature at scale i,

M_{i} \in R^{C \times H_{i} \times W_{i}}

denotes the intermediate feature obtained after upsampling and convolutional fusion from the higher-level feature, and

α_{i} \in [0, 1]

is a learnable scalar parameter. Here,

H_{i}

and

W_{i}

represent the spatial resolution at scale i, and channel dimensions are aligned via

1 \times 1

convolutions. This gated residual fusion enables the model to dynamically balance fine-grained spatial details and high-level semantic information, which is particularly important for industrial defects with large variations in scale and appearance.

Finally, the output

O \in R^{C \times H_{3} \times W_{3}}

serves as a unified multi-scale representation, combining fine-grained spatial details and high-level semantic information, and is optimized for accurate defect detection:

O = ReLU (BatchNorm ({\hat{M}}_{3}))

(6)

For industrial defect detection, the mixed encoder incorporates specialized enhancements, including scale-adaptive weighting mechanisms that emphasize features most relevant to specific defect types. To enhance gradient flow and stabilize training, the architecture additionally incorporates residual connections and normalization layers.

3.4. Loss Function Design

The loss function is intended to maximize industrial defect location and classification via a multitask approach that combines several components:

L_{total} = L_{cls} + L_{box} + L_{giou}

(7)

where

L_{cls}

represents the classification loss,

L_{box}

denotes the localization loss, and

L_{giou}

is the GIOU loss. This formulation ensures that the model jointly learns to classify defects correctly and precisely localizes them, which is vital for industrial quality control where false positives or negatives can have significant consequences. During training, the loss is optimized from beginning to end by using bipartite matching to match predictions to ground realities.

The classification loss

L_{cls}

uses a focal loss variant to handle class imbalance, which is common in defect detection, where negative samples (no defects) may dominate. It is expressed as:

L_{cls} = - α {(1 - p_{t})}^{γ} log (p_{t})

(8)

where

α

and

γ

are adjustable hyperparameters, and the estimated probability for the true class is denoted by

p_{t}

. This loss emphasizes difficult situations and downweights well-classified examples, improving robustness for rare defects. For localization, the bounding box loss

L_{box}

employs L1 regression:

L_{box} = {∥ \hat{b} - b ∥}_{1}

(9)

where

b = (x, y, w, h)

and

\hat{b} = (\hat{x}, \hat{y}, \hat{w}, \hat{h})

stand for the ground truth and anticipated bounding boxes, respectively. Additionally, the GIoU loss

L_{giou}

enhances spatial consistency:

L_{giou} = 1 - GIoU (\hat{b}, b)

(10)

which measures the overlap between boxes and helps in handling occlusion or overlapping defects. This comprehensive loss function has been shown to improve accuracy in experiments. The design avoids the need for complex post-processing, aligning with the framework’s end-to-end design.

3.5. Computational Complexity Analysis

We analyze the computational complexity of IM-DETR with respect to the input feature resolutions. Let the spatial size of the highest-level feature map

S_{5}

be

H_{5} \times W_{5}

, and denote

N = H_{5} W_{5}

as the number of tokens after flattening.

Intra-scale Self-Attention. The transformer module is applied only to the highest-level feature

S_{5}

. The complexity of self-attention is

O (N^{2} C)

, where C is the channel dimension. Since

S_{5}

has a stride of 32, N is significantly smaller than the token number of lower-resolution feature maps. Compared to vanilla DETR encoders that apply attention across larger multi-scale token sets, our design reduces quadratic complexity by restricting attention to semantically compact high-level representations.

Cross-Scale Fusion. The progressive fusion operations consist of

1 \times 1

and

3 \times 3

convolutions and bilinear upsampling. These operations introduce a computational complexity of

O (H_{i} W_{i} C^{2})

per scale, which scales linearly with spatial resolution. Since fusion is performed only across three levels, the additional overhead remains modest compared to the backbone computation.

Overall Complexity. The overall complexity of IM-DETR can therefore be summarized as:

O (N^{2} C) + \sum_{i = 3}^{4} O (H_{i} W_{i} C^{2})

(11)

where the quadratic term originates from high-level self-attention and the linear terms correspond to multi-scale convolutional fusion. By decoupling semantic abstraction (quadratic complexity on low-resolution features) from spatial refinement (linear complexity on higher-resolution features), IM-DETR achieves an efficient balance between global reasoning capability and computational cost.

In practice, the mixed encoder introduces only a modest increase in FLOPs compared to the baseline DETR architecture while providing consistent performance improvements, demonstrating favorable efficiency–accuracy trade-offs for industrial inspection scenarios.

4. Results

4.1. Experimental Settings

Datasets. To assess the practical utility of our method, we curated two in-house datasets: SHDD and CPSDD. The SHDD dataset, which addresses metal surface anomalies, comprises 600 samples distributed with a 480:120 ratio for training and evaluation. This dataset is characterized by defects with subtle visual features and contains exclusively small- and medium-sized objects without any large-scale instances, which poses a challenge for fine-grained detection. The CPSDD addresses quality control in sealing processes and comprises 291 images, with 232 allocated for training and 59 allocated for validation. Unlike SHDD, this dataset involves irregular deformation and continuity issues. It contains primarily medium- and large-sized objects with no small objects present. These two datasets exhibit distinct object scale distributions ranging from tiny scratches to large sealant overflows, providing a comprehensive benchmark for evaluating the generalizability of our method across different industrial inspection tasks.

To provide a more comprehensive understanding of the specific challenges in these industrial scenarios, the detailed dataset statistics are summarized in Table 1. The statistics include acquisition parameters (image resolution), instance counts for the single-class defect targets, and bounding box size distributions defined by the standard COCO metric: Small: area

< 32^{2}

, Medium:

32^{2} \leq

area

< 96^{2}

, Large: area

\geq 96^{2}

. The extreme contrast in box size distributions between the two datasets provides a comprehensive benchmark for evaluating the generalizability of our method across different industrial inspection tasks.

Training Details and Evaluation Metrics. IM-DETR is trained for 72 epochs using the AdamW optimizer. The choice of 72 epochs is based on the convergence behavior of transformer-based detectors, which typically require longer training to achieve stable bipartite matching optimization. AdamW is employed due to its decoupled weight decay, which provides more stable convergence for transformer architectures compared to traditional SGD. The initial learning rate is set to

1 \times 10^{- 4}

for the transformer layers and

1 \times 10^{- 5}

for the backbone, with a weight decay of

1 \times 10^{- 4}

. The number of object queries is fixed at 300. To ensure the reproducibility of the results and the statistical validity of the comparisons, we set a fixed random seed for all experiments. This eliminates fluctuations caused by random weight initialization and data shuffling, ensuring that the observed performance gains are consistent and attributable to the proposed method. All experiments are implemented using the PyTorch 2.0.1 deep learning framework and conducted on a single NVIDIA GeForce RTX 4090 GPU. Following the standard COCO protocol [31], we evaluate performance via the mean AP across IoU thresholds from 0.5 to 0.95. We further report fixed-threshold scores AP₅₀ and AP₇₅ alongside scale-specific metrics AP_S, AP_M, and AP_L to address defect size variations.

4.2. Comparison Experiments

We benchmark the proposed method against SOTA detectors, including Faster R-CNN [17], Cascade R-CNN [32], FCOS [33], CenterNet [34], Deformable DETR [25], and DINO [30]. Furthermore, we incorporate the YOLO series, ranging from v3u to v12 [35,36,37,38,39,40,41] for a comprehensive comparison. These baseline methods cover a wide range of architectures, from classic two-stage anchor-based detectors and single-stage anchor-free models to the highly optimized CNN-based YOLO series and recent end-to-end transformer-based approaches. Evaluating our method against this diverse set of architectures ensures a comprehensive understanding of its relative strengths in handling specific industrial challenges, such as severe scale variation and background interference. The quantitative results for the SHDD and CPSDD benchmarks are summarized in Table 2 and Table 3, respectively.

Performance on SHDD. As detailed in Table 2, the SHDD dataset poses difficulties due to small, low-contrast defects. IM-DETR achieves an AP of 23.2%, surpassing those of the other architectures. Specifically, it achieves an improvement of 10.3% AP over Deformable DETR, 7.2% AP over DINO, and outperforms the recent YOLOv12m by 3.3%. Deformable DETR relies on sparse spatial sampling, which frequently misses subtle and tiny visual cues. In contrast, our approach preserves high-frequency details through dense multi-scale fusion, making it more effective for fine-grained anomaly detection.

In addition to detection accuracy, we evaluate the computational efficiency of the models. As shown in Table 2, IM-DETR operates at 95 FPS with a computational cost of 132 GFLOPs. Compared to DINO, which runs at 10 FPS with 179 GFLOPs, our method achieves a significant speedup while reducing computational overhead. Furthermore, while YOLOv12m operates with 68 GFLOPs, our method delivers a higher inference speed of 95 FPS and a superior AP of 23.2% compared to the 19.9% achieved by YOLOv12m. This demonstrates that IM-DETR offers a more advantageous balance between accuracy and real-time performance, which is critical for deployment in high-speed industrial production lines.

From a practical deployment perspective, this favorable balance between computational cost and accuracy is highly valuable. Industrial manufacturing lines often operate under strict latency constraints and rely on cost-effective edge computing devices rather than high-end server GPUs. Achieving 95 FPS implies that IM-DETR can process high-resolution images in real time without causing bottlenecks in the production workflow, fulfilling the fundamental requirements of automated optical inspection systems.

A key advantage of our method lies in its ability to detect fine-grained anomalies. In terms of small object detection, denoted as

A P_{S}

, our method achieves a score of 22.6%, which is higher than the 10.8% achieved by Faster R-CNN and the 12.7% achieved by Deformable DETR. Even compared with the highly optimized YOLOv11m, which scores 21.0% on

A P_{S}

, our approach maintains higher performance. These findings are attributable to the mix-encoder, which preserves high-frequency details vital for spotting tiny defects that generic detectors typically lose during downsampling. In addition to the superior performance on extremely small defects, our model also demonstrates strong capability in detecting medium-scale anomalies. As shown in Table 2, IM-DETR achieves an

A P_{M}

of 53.3%, outperforming YOLOv8m (46.9%) and YOLOv9m (46.1%) by a significant margin. This balanced performance across different object scales further confirms that the proposed feature enhancement strategy does not bias the model toward a single defect size but rather improves the overall feature representation.

Performance on CPSDD.Table 3 presents the results for the CPSDD benchmark, which focuses on sealant defects with irregular deformations and larger scales. Our method continues to demonstrate robustness and achieves an AP of 52.9%. This performance surpasses the second-best YOLOv5m by 2.6% and DINO by 4.6%. Furthermore, it outperforms Deformable DETR by a substantial margin of 21.2%. The sparse keypoint sampling of deformable attention is insufficient for delimiting the amorphous and continuous boundaries of sealant overflows. IM-DETR overcomes these geometric limitations through its dynamic feature integration, adapting flexibly to irregular defect shapes.

Notably, our model exhibits effective localization capability, as evidenced by the

A P_{50}

of 87.6% and

A P_{75}

of 54.1%. These metrics reflect the framework’s dual capability in delivering reliable classification alongside bounding box regression, ensuring accurate localization even for amorphous sealant overflows. Furthermore, in the detection of large objects denoted as

A P_{L}

, our method reaches 58.1% and outperforms Faster R-CNN with 53.9% and YOLOv12m with 56.0%. For example, traditional anchor-based methods like Faster R-CNN struggle to adapt their rigid anchor boxes to the highly variable aspect ratios of sealant deformations. Although the YOLO series provides strong real-time performance, they still exhibit performance fluctuations when delimiting large, irregular object boundaries. IM-DETR overcomes these geometric limitations through its dynamic feature integration, adapting flexibly to irregular defect shapes. This consistency across both datasets, where one is dominated by tiny and subtle features while the other involves large and irregular shapes, confirms the generalizability of our architecture and its suitability for diverse industrial inspection tasks.

Performance on VisDrone2019. To evaluate the generalization capability of the proposed IM-DETR beyond specific industrial domains, we conducted additional experiments on the public VisDrone2019 dataset [44]. This dataset is characterized by its high density of small objects, occlusion, and complex background interference, sharing critical characteristics with challenging industrial inspection tasks. As presented in Table 4, our method achieves an mAP of 29.5%, outperforming state-of-the-art methods such as deformable DETR with 25.5%, DINO with 26.8%, and the recent YOLOv10m with 29.1%. The superior performance across different object scales validates that the effectiveness of the proposed feature enhancement strategy is consistent and highly competitive against existing transformer-based detectors utilizing deformable multi-scale encoders. This superior mAP is particularly notable given the drastic viewpoint variations and extreme scale shifts inherent to drone-captured imagery, proving that the model does not overfit to specific viewing angles or object orientations.

The proposed framework exhibits improvements in detecting small-scale objects, a key capability for industrial quality control. In terms of

A P_{S}

, our model achieves 19.5%, which is higher than the 17.2% obtained by Deformable DETR, the 17.5% obtained by DINO, and the 16.1% obtained by Faster R-CNN, and matches the performance of the highly optimized YOLOv8m. Furthermore, our method achieves a performance of 58.6% in

A P_{L}

, demonstrating its versatility across different object scales. Additionally, the robust

A P_{M}

of 42.0% and a solid

A P_{50}

of 49.9% indicate that our architecture maintains consistent localization accuracy even when target objects are densely packed or partially obscured by background clutter. These results on a public benchmark validate that the effectiveness of the mix-encoder in feature enhancement is not limited to proprietary industrial datasets but extends to general scenarios with similar visual challenges.

4.3. Ablation Study

To verify the effectiveness of the multi-scale feature fusion strategy in the mix-encoder, we conducted an ablation study on the SHDD dataset. We analyzed the impact of different feature combinations extracted from the backbone, specifically examining the contributions of high-resolution features (

S_{3}

), mid-level features (

S_{4}

), and high-level semantic features (

S_{5}

). Selecting these three specific stages strikes a necessary balance between maintaining sufficient pixel granularity for tiny defects and keeping computational overhead within practical limits for actual deployment.

As presented in Table 5, using only the high-resolution feature

S_{3}

yields a baseline AP of 21.0%, while this scale captures fine-grained details, the lack of semantic context limits overall performance. As seen from the relatively low

A P_{75}

of 9.1% in this single-scale setting, lacking deep semantic guidance prevents the model from drawing tight bounding boxes around complex defect shapes. The addition of the mid-level feature

S_{4}

improves the AP to 21.9%, demonstrating the benefit of incorporating broader contextual information. This addition brings an immediate improvement in basic object discovery, pushing the

A P_{50}

from 55.7% to 59.3%. However, without the deepest semantic anchor, the intermediate features slightly confuse medium-scale predictions, causing a temporary drop in

A P_{M}

from 52.5% to 49.2%. The best performance is achieved by integrating all three scales (

S_{3}

,

S_{4}

, and

S_{5}

), which results in a further improvement to an AP of 23.2%.

Notably, the complete fusion strategy significantly enhances the detection of small defects. The

A P_{S}

increases from 20.2% with only

S_{3}

to 22.6% with the full configuration. This monotonic increase confirms that the mix-encoder effectively leverages the semantic strength of deep features to guide the localization of small defects within the high-resolution maps. The inclusion of

S_{5}

acts as a macro-level contextual filter that resolves previous scale ambiguities, successfully recovering the

A P_{M}

to its peak of 53.3%. Furthermore, the

A P_{75}

metric improves from 9.1% to 11.5%, indicating that the multi-scale fusion not only improves detection recall but also enhances the precision of the predicted bounding boxes. Consequently, this ablation confirms that combining detailed textures with abstract semantics is a practical necessity for handling realistic defect variations, ensuring that no single feature level becomes a bottleneck in bounding box regression.

4.4. Visualization

Figure 2 presents the detection results of the models on SHDD. The defects in these datasets are typically extremely small and low-contrast, making them difficult to detect. The comparison images indicate that the YOLOv12m model missed some small targets during detection. In contrast, IM-DETR accurately detects these defects that the baseline fails to capture, indicating improved sensitivity to subtle defect patterns. The proposed mix-encoder is primarily responsible for this increased sensitivity since it successfully maintains high-frequency spatial features, which are essential for identifying minute flaws. Our multi-scale fusion approach effectively avoids the feature disappearing issue that generic detectors usually encounter during downsampling by using deep semantic features to direct the localization inside high-resolution maps.

Figure 3 further assesses model performance on the CPSDD dataset, which primarily consists of large-scale sealant defects with irregular shapes. As shown in row (b), although YOLOv12m can identify the approximate locations of the defects, the generated bounding boxes are often too loose and fail to align precisely with the edges of the defects. In comparison, the detection boxes generated by our method, IM-DETR, are tighter and capable of adaptively covering complex deformed regions. This precise boundary alignment is enabled by the gated residual fusion mechanism, which dynamically balances spatial details and high-level semantics, making the model accurately detect full defect regions, even for irregular sealant overflows and fuzzy boundaries. Therefore, our proposed framework not only excels in small object detection but also has significant advantages in handling large defects with variable morphologies in industrial scenarios.

4.5. Error Analysis and Limitations

Visualization of failure cases. To comprehensively evaluate the boundaries of our proposed method, we first analyze typical failure cases on both datasets, as visually illustrated in Figure 4 and Figure 5.

Error analysis discussion. Based on the visualizations above, we dissect the specific causes of detection errors. As shown in Figure 4, the failure modes on the SHDD dataset primarily fall into three categories. First, the model occasionally exhibits oversensitivity, confusing strong background noise or edge reflections with actual defects, leading to false positives (left column). Second, for extremely tiny and low-contrast defects, the downsampling process causes severe feature vanishing, resulting in false negatives (middle column). Third, when the morphology of a defect strictly resembles the inherent high-frequency vertical striations of the background, the model struggles to differentiate them, causing camouflage-induced missed detections (right column).

Conversely, on the CPSDD dataset (Figure 5), the errors mainly manifest as inaccurate bounding box regressions and semantic definition ambiguity. When a large defective area transitions gradually from a normal honeycomb pattern to a severe exposed state, the model tends to anchor only on the most visually extreme sub-region, failing to encompass the entire ambiguous transitional boundary (left column). Furthermore, for defect clusters, while human annotators aggregate spatially correlated degradation into a single semantic instance, the model rigidly isolates only the absolute anomalies, revealing a limitation in holistic contextual comprehension (right column).

Quantitative breakdown of limitations and efficiency. To further quantify our failure modes, we analyze the performance degradation across defect sizes and IoU thresholds. On the SHDD dataset, the model achieves an overall AP of 23.2%, but the performance drops to 22.6% for small defects (

A P_{S}

) compared to 53.3% for medium defects (

A P_{M}

). This quantitative gap shows that while the mix-encoder improves fine-grained detection over baselines like YOLOv12m, ultra-small anomalies remain a primary source of missed detections. Regarding localization strictness, performance declines from 60.8% at

A P_{50}

to 11.5% at

A P_{75}

, reflecting the inherent difficulty of generating extremely tight bounding boxes around subtle, low-contrast defect boundaries. A similar trend is observed on the CPSDD dataset, where

A P_{50}

reaches 87.6% but drops to 54.1% at

A P_{75}

, highlighting the ongoing challenge of precisely delimiting irregular sealant overflows.

Regarding computational efficiency, the processing time is predominantly determined by the static computational graph of the model rather than the specific characteristics of the defects within a given image, such as size, quantity, or overlap. Operating at 132 GFLOPs, IM-DETR processes images at a constant 95 FPS, ensuring stable real-time performance irrespective of whether an image contains a single large anomaly or multiple tiny, dense defects.

Qualitative discussion of limitations. These observations demonstrate that while IM-DETR achieves robust overall capabilities, its feature representation remains constrained when dealing with vanishing features of extremely low contrast, heavily camouflaged defects, and the semantic aggregation of ambiguous boundaries. Future work will focus on integrating frequency-domain decoupling techniques and advanced context-aggregation modules to address these specific bottlenecks and further improve industrial applicability.

We conducted comprehensive experimental validation—including the detailed quantitative error analysis in Section 4.5, the computational complexity breakdown in Section 3.5, and the rigorous ablation experiments in Section 4.3—all of which clearly illustrate the methodological contributions of this work. We objectively analyzed the performance advantages and applicability boundaries of the proposed IM-DETR architecture, confirming that the model achieves a practical balance between detection accuracy and real-time inference efficiency. Moreover, stable and consistent performance improvements were achieved on the dedicated industrial datasets SHDD and CPSDD, as well as the public dataset VisDrone2019, verifying the strong engineering practicality of the proposed method. Ultimately, these rigorous experimental validations convincingly demonstrate the feasibility and robustness of deploying our proposed framework in complex real-world scenarios, making it effectively applicable to actual industrial automatic detection systems.

5. Conclusions

In this study, we develop IM-DETR with a mix-encoder to address the challenges of industrial defect detection in practical manufacturing settings. Considering the complexity of real production environments, two real-world industrial defect datasets are constructed from actual production lines, including stator housing defects and cover plate silicone defects. These datasets provide representative and challenging benchmarks with distinct defect characteristics, scale variations, and background interference. Building upon these datasets, a transformer-based framework equipped with a mix-encoder is proposed to integrate heterogeneous multi-scale feature representations, enabling effective modeling of both subtle local defect cues and broader contextual information in a unified end-to-end manner without relying on complex post-processing strategies. The mix-encoder enhances feature interaction across different scales, improving the model’s capability to capture fine-grained defect details while preserving global structural consistency. Extensive experimental results on the proposed industrial datasets demonstrate that IM-DETR consistently outperforms existing state-of-the-art detection methods, particularly in scenarios involving small defects, complex backgrounds, and appearance ambiguity. These results further validate the effectiveness and robustness of the proposed architecture under practical industrial conditions. Overall, the study demonstrates that jointly leveraging industrial-focused dataset development and a tailored DETR-style architecture is a rigorously validated, effective, and highly practical strategy for real-world industrial defect detection.

Author Contributions

Conceptualization, S.L. and Y.F.; methodology, D.W. and Z.Z.; software, J.W. and X.W. (Xiangdong Wang); validation, S.L., Y.F., X.W. (Xuekai Wei), J.Y., W.X. and Y.Q.; formal analysis, D.W. and Z.Z.; investigation, X.W. (Xiangdong Wang) and J.W.; resources, S.L.; data curation, S.L., D.W., Z.Z., J.W. and X.W. (Xiangdong Wang); writing—original draft preparation, D.W., Z.Z., J.W. and X.W. (Xiangdong Wang); writing—review and editing, S.L., Y.F., H.W., X.W. (Xuekai Wei), J.Y., W.X. and Y.Q.; visualization, J.W. and X.W. (Xiangdong Wang); supervision, Y.F.; project administration, S.L., H.W. and Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the the Chongqing New YC Project under Grant CSTB2024YCJH-KYXM0126; the Fundamental Research Funds for the Central Universities of Ministry of Education of China under Grant 2025CDJZKZCQ-11; the General Program of the Natural Science Foundation of Chongqing under Grant CSTB2024NSCQ-MSX0479; Chongqing Postdoctoral Foundation Special Support Program under Grant 2023CQBSHTB3119; China Postdoctoral Science Foundation under Grant 2024MD754244; the Government of Canada’s New Frontiers in Research Fund (NFRF) (NFRFE-2021–00913) and the Postdoctoral Talent Special Program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data in the datasets were provided by Chongqing Tsingshan Industrial Co., Ltd., in accordance with confidentiality and licensing agreements. Owing to commercial privacy and proprietary restrictions, the datasets are not publicly available but may be made available upon reasonable request and with the explicit permission of the company.

Conflicts of Interest

Authors Shiyou Liu and Haibing Wang were employed by the company Chongqing Tsingshan Industrial Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chen, Y.; Ding, Y.; Zhao, F.; Zhang, E.; Wu, Z.; Shao, L. Surface defect detection methods for industrial products: A review. Appl. Sci. 2021, 11, 7657. [Google Scholar] [CrossRef]
Yang, B.; Zhang, X.; Zhang, J.; Luo, J.; Zhou, M.; Pi, Y. EFLNet: Enhancing feature learning network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5906511. [Google Scholar] [CrossRef]
Saberironaghi, A.; Ren, J.; El-Gindy, M. Defect detection methods for industrial products using deep learning techniques: A review. Algorithms 2023, 16, 95. [Google Scholar] [CrossRef]
Shen, W.; Zhou, M.; Luo, J.; Li, Z.; Kwong, S. Graph-Represented Distribution Similarity Index for Full-Reference Image Quality Assessment. IEEE Trans. Image Process. 2024, 33, 3075–3089. [Google Scholar] [CrossRef]
Zhou, M.; Zhao, X.; Luo, F.; Luo, J.; Pu, H.; Xiang, T. Robust RGB-T Tracking via Adaptive Modality Weight Correlation Filters and Cross-modality Learning. ACM Trans. Multimedia Comput. Commun. Appl. 2023, 20, 95. [Google Scholar] [CrossRef]
Song, J.; Zhou, M.; Luo, J.; Pu, H.; Feng, Y.; Wei, X.; Jia, W. Boundary-Aware Feature Fusion with Dual-Stream Attention for Remote Sensing Small Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5600213. [Google Scholar] [CrossRef]
Zhang, Z.; Zhou, M.; Wan, H.; Li, M.; Li, G.; Han, D. IDD-Net: Industrial defect detection method based on Deep-Learning. Eng. Appl. Artif. Intell. 2023, 123, 106390. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Wang, G.; Li, W.; Zhou, M.; Zhu, H.; Yang, G.; Yap, C.H. 4D foetal cardiac ultrasound image detection based on deep learning with weakly supervised localisation for rapid diagnosis of evolving hypoplastic left heart syndrome. CAAI Trans. Intell. Technol. 2024, 9, 1485–1499. [Google Scholar] [CrossRef]
Yang, J.; Li, S.; Wang, Z.; Dong, H.; Wang, J.; Tang, S. Using deep learning to detect defects in manufacturing: A comprehensive survey and current challenges. Materials 2020, 13, 5755. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, C.; Dong, X. A survey of real-time surface defect inspection methods based on deep learning. Artif. Intell. Rev. 2023, 56, 12131–12170. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Qiang, B.; Chen, R.; Zhou, M.; Pang, Y.; Zhai, Y.; Yang, M. Convolutional neural networks-based object detection algorithm by jointing semantic segmentation for images. Sensors 2020, 20, 5080. [Google Scholar] [CrossRef]
Wang, K.; Zhou, M.; Lin, Q.; Niu, G.; Zhang, X. Geometry-Guided Point Generation for 3D Object Detection. IEEE Signal Process. Lett. 2025, 32, 136–140. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Li, K.; Yao, X.; Han, J. Oriented R-CNN and beyond. Int. J. Comput. Vis. 2024, 132, 2420–2442. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 –30 June 2016; pp. 779–788. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Pan, W.; Chen, J.; Lv, B.; Peng, L. Optimization and application of improved YOLOv9s-UI for underwater object detection. Appl. Sci. 2024, 14, 7162. [Google Scholar]
Li, J.; Feng, Y.; Shao, Y.; Liu, F. IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective. Appl. Sci. 2024, 14, 5277. [Google Scholar] [CrossRef]
Zheng, D.; Dong, W.; Hu, H.; Chen, X.; Wang, Y. Less is more: Focus attention for efficient DETR. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6674–6683. [Google Scholar]
Zhang, H.; Ma, Z.; Li, X. RS-DETR: An improved remote sensing object detection model based on RT-DETR. Appl. Sci. 2024, 14, 10331. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3651–3660. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.; Shum, H.-Y. DINO: DETR with Improved Denoising Anchor Boxes for End-to-End Object Detection. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023; pp. 1–19. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Jocher, G. Ultralytics YOLOv5. 2020. [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 December 2025).
Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 December 2025).
Tian, Y.; Ye, Q.; Doermann, D. YOLO12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320. [Google Scholar]
Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. Autoassign: Differentiable label assignment for dense object detection. arXiv 2020, arXiv:2007.03496. [Google Scholar] [CrossRef]
Zhu, J.; Wang, X.; Liu, Y.; Ji, Q.; Zhao, Z.; Wang, S. UavTinyDet: Tiny object detection in UAV scenes. In Proceedings of the 2022 7th International Conference on Image, Vision and Computing (ICIVC), Xi’an, China, 26–28 July 2022; pp. 195–200. [Google Scholar]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 10–17 October 2021; pp. 13039–13048. [Google Scholar]
Li, Y.; Wang, Y.; Ma, Z.; Wang, X.; Tang, Y. Sod-Uav: Small Object Detection For Unmanned Aerial Vehicle Images Via Improved Yolov7. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 7610–7614. [Google Scholar] [CrossRef]

Figure 1. Framework of the proposed method.

Figure 2. Qualitative comparison results on SHDD. Each column shows the same sample under different annotations: (a) Ground truth. (b) YOLOv12m. (c) Ours. The examples highlight typical cases with tiny and low-contrast defects, where YOLOv12m may miss some targets, while IM-DETR detects more defects.

Figure 3. Qualitative comparison results on CPSDD. These dataset features defects with irregular deformations and sealant overflow. (a) Ground truth annotations. (b) YOLOv12m predictions. (c) Ours.

Figure 4. Typical failure cases on the SHDD dataset. From left to right: (1) False positives caused by background interference; (2) False negatives on extreme ultra-small targets due to feature vanishing; (3) Camouflaged defects overlooked by the model.

Figure 5. Typical failure cases on the CPSDD dataset. From left to right: (1) partial detection on large defects with ambiguous transition boundaries; (2) conservative detection failing to semantically aggregate clustered defective regions.

Table 1. Statistical details of the SHDD and CPSDD datasets, including acquisition resolutions, bounding box size distributions, and defect instance counts.

Metric	Category	SHDD	CPSDD
Acquisition Parameters	Image Resolution	$1000 \times 1000$	$2432 \times 2048$
Box Size Distribution	Small (area $< 32^{2}$ )	1431	5
	Medium ( $32^{2} \leq$ area $< 96^{2}$ )	42	92
	Large (area $\geq 96^{2}$ )	0	310
Instance Counts	Defect Targets (Single-Class)	1473	407

Table 2. Quantitative analysis of the SHDD dataset. All the experiments utilize the standard data split for optimization and assessment. Bold entries denote the highest scores.

Method	GFLOPs	FPS	AP	AP₅₀	AP₇₅	AP_S	AP_M
Faster R-CNN [17]	134	19	11.3	34.1	3.6	10.8	35.4
Cascade R-CNN [32]	186	15	14.2	42.0	5.6	13.7	38.4
DetectoRS [42]	263	9	13.0	40.6	5.0	12.6	33.0
CenterNet [34]	123	24	17.9	51.0	5.8	17.7	31.2
ATSS [43]	126	23	14.8	44.7	5.4	14.9	26.6
FCOS [33]	123	21	15.6	44.2	6.1	15.9	31.6
Deformable DETR [25]	126	12	12.9	37.7	5.3	12.7	33.8
DINO [30]	179	10	16.0	46.3	5.8	15.7	32.3
YOLOv3u [35]	283	112	20.6	50.6	12.3	20.0	45.2
YOLOv5m [36]	64	124	20.6	52.7	11.5	20.4	40.6
YOLOv8m [37]	79	133	20.8	49.5	11.9	20.5	46.9
YOLOv9m [38]	76	66	20.3	48.8	12.0	20.1	46.1
YOLOv10m [39]	64	74	19.5	48.6	11.4	19.3	35.1
YOLOv11m [40]	68	98	21.1	51.1	14.0	21.0	42.2
YOLOv12m [41]	68	68	19.9	48.9	10.9	19.7	38.9
Ours	132	95	23.2	60.8	11.5	22.6	53.3

Table 3. Benchmarking results for CPSDD. Our architecture has distinct advantages in identifying medium- and large-scale anomalies typical of this domain. Bold entries denote the highest scores.

Method	AP	AP₅₀	AP₇₅	AP_M	AP_L
Faster R-CNN [17]	48.5	83.9	47.5	2.6	53.9
Cascade R-CNN [32]	47.0	84.5	49.3	0.5	52.7
DetectoRS [42]	46.1	85.5	43.7	1.3	51.5
CenterNet [34]	44.3	78.9	45.6	0.0	50.0
ATSS [43]	44.2	80.3	40.6	0.1	49.7
FCOS [33]	43.2	81.0	38.4	0.0	48.9
Deformable DETR [25]	31.7	59.9	30.6	0.0	35.8
DINO [30]	48.3	79.3	50.9	1.3	54.5
YOLOv3u [35]	47.9	83.8	45.3	1.5	53.4
YOLOv5m [36]	50.3	82.0	57.4	2.9	56.0
YOLOv8m [37]	48.2	81.8	49.4	2.2	53.6
YOLOv9m [38]	48.9	80.3	51.3	7.2	54.6
YOLOv10m [39]	48.9	77.7	52.4	2.2	54.3
YOLOv11m [40]	48.9	80.0	53.8	1.9	54.7
YOLOv12m [41]	49.9	82.3	51.1	0.5	56.0
Ours	52.9	87.6	54.1	7.2	58.1

Table 4. Benchmarking results on VisDrone2019. Our architecture has distinct advantages. We trained on the train set and validated/tested on the val set. Bold entries denote the highest scores.

Method	mAP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Faster R-CNN [17]	24.5	42.5	24.6	16.1	35.9	36.5
RetinaNet [45]	22.1	36.8	23.0	10.4	35.7	46.0
Cascade R-CNN [32]	25.6	43.1	26.2	16.4	37.4	41.4
ClusDet [46]	28.4	53.2	26.4	19.1	40.8	54.4
CenterNet [34]	21.4	36.1	21.8	12.6	31.7	38.3
ATSS [43]	27.6	45.5	28.5	18.1	39.2	42.1
FCOS [33]	23.7	39.9	24.6	14.6	34.4	42.0
AutoAssign [47]	23.2	43.5	21.8	15.2	33.3	40.9
UavTinyDet [48]	24.8	41.2	24.9	14.7	37.1	52.2
Deformable DETR [25]	25.5	44.1	24.9	17.2	35.5	41.1
DINO [30]	26.8	44.2	28.9	17.5	37.3	41.3
YOLOF [49]	15.1	26.3	15.4	6.1	25.2	32.4
YOLOv7m [20]	27.7	45.9	28.1	17.7	40.0	57.5
YOLOv8m [37]	28.5	49.2	28.3	19.5	40.3	46.5
YOLOv10m [39]	29.1	49.6	29.0	19.9	41.2	47.2
SOD-UAV [50]	26.3	45.7	26.8	15.6	37.6	47.6
Ours	29.5	49.9	29.4	19.5	42.0	58.6

Table 5. Ablation study on feature scales using the SHDD dataset. The experiments utilize the same data split. Bold entries denote the highest scores.

S3	S4	S5	AP	AP₅₀	AP₇₅	AP_S	AP_M
✓			21.0	55.7	9.1	20.2	52.5
✓	✓		21.9	59.3	8.5	21.3	49.2
✓	✓	✓	23.2	60.8	11.5	22.6	53.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, S.; Feng, Y.; Wang, D.; Zhou, Z.; Wang, H.; Wu, J.; Wang, X.; Wei, X.; Yan, J.; Xian, W.; et al. IM-DETR: DETR with Mix-Encoder for Industrial Scenarios. Appl. Sci. 2026, 16, 3345. https://doi.org/10.3390/app16073345

AMA Style

Liu S, Feng Y, Wang D, Zhou Z, Wang H, Wu J, Wang X, Wei X, Yan J, Xian W, et al. IM-DETR: DETR with Mix-Encoder for Industrial Scenarios. Applied Sciences. 2026; 16(7):3345. https://doi.org/10.3390/app16073345

Chicago/Turabian Style

Liu, Shiyou, Yong Feng, Dongzi Wang, Zijie Zhou, Haibing Wang, Jinsong Wu, Xiangdong Wang, Xuekai Wei, Jielu Yan, Weizhi Xian, and et al. 2026. "IM-DETR: DETR with Mix-Encoder for Industrial Scenarios" Applied Sciences 16, no. 7: 3345. https://doi.org/10.3390/app16073345

APA Style

Liu, S., Feng, Y., Wang, D., Zhou, Z., Wang, H., Wu, J., Wang, X., Wei, X., Yan, J., Xian, W., & Qin, Y. (2026). IM-DETR: DETR with Mix-Encoder for Industrial Scenarios. Applied Sciences, 16(7), 3345. https://doi.org/10.3390/app16073345

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IM-DETR: DETR with Mix-Encoder for Industrial Scenarios

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Framework

3.2. Backbone Architecture

3.3. Mix Encoder for Multi-Scale Defect Enhancement

3.4. Loss Function Design

3.5. Computational Complexity Analysis

4. Results

4.1. Experimental Settings

4.2. Comparison Experiments

4.3. Ablation Study

4.4. Visualization

4.5. Error Analysis and Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI