IM-DETR: DETR with Mix-Encoder for Industrial Scenarios
Abstract
1. Introduction
- Two real-world industrial defect datasets are constructed from actual production lines, namely, the Stator Housing Defect Dataset (SHDD) and the Cover Plate Silicone Defect Dataset (CPSDD). These datasets cover representative industrial inspection scenarios with distinct defect characteristics, including subtle small-scale defects and irregular large-scale deformations. They provide realistic benchmarks for evaluating detection algorithms under practical industrial conditions involving complex backgrounds, subtle defect patterns, and strict reliability requirements.
- We propose IM-DETR with a mix-encoder, a transformer-based end-to-end defect detection framework specifically designed for industrial scenarios. The proposed mix-encoder integrates heterogeneous multi-scale feature representations, enabling the model to jointly capture fine-grained local details and global contextual dependencies, thereby improving detection robustness and stability for challenging industrial defects.
- Extensive experiments conducted on the constructed industrial datasets demonstrate that IM-DETR achieves consistent performance improvements over representative CNN- and transformer-based detectors. The proposed framework shows strong effectiveness in handling subtle defects, complex backgrounds, and appearance ambiguity, highlighting its practical relevance for real-world industrial inspection tasks.
2. Related Work
3. Methodology
3.1. Framework
3.2. Backbone Architecture
3.3. Mix Encoder for Multi-Scale Defect Enhancement
3.4. Loss Function Design
3.5. Computational Complexity Analysis
4. Results
4.1. Experimental Settings
4.2. Comparison Experiments
4.3. Ablation Study
4.4. Visualization
4.5. Error Analysis and Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chen, Y.; Ding, Y.; Zhao, F.; Zhang, E.; Wu, Z.; Shao, L. Surface defect detection methods for industrial products: A review. Appl. Sci. 2021, 11, 7657. [Google Scholar] [CrossRef]
- Yang, B.; Zhang, X.; Zhang, J.; Luo, J.; Zhou, M.; Pi, Y. EFLNet: Enhancing feature learning network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5906511. [Google Scholar] [CrossRef]
- Saberironaghi, A.; Ren, J.; El-Gindy, M. Defect detection methods for industrial products using deep learning techniques: A review. Algorithms 2023, 16, 95. [Google Scholar] [CrossRef]
- Shen, W.; Zhou, M.; Luo, J.; Li, Z.; Kwong, S. Graph-Represented Distribution Similarity Index for Full-Reference Image Quality Assessment. IEEE Trans. Image Process. 2024, 33, 3075–3089. [Google Scholar] [CrossRef]
- Zhou, M.; Zhao, X.; Luo, F.; Luo, J.; Pu, H.; Xiang, T. Robust RGB-T Tracking via Adaptive Modality Weight Correlation Filters and Cross-modality Learning. ACM Trans. Multimedia Comput. Commun. Appl. 2023, 20, 95. [Google Scholar] [CrossRef]
- Song, J.; Zhou, M.; Luo, J.; Pu, H.; Feng, Y.; Wei, X.; Jia, W. Boundary-Aware Feature Fusion with Dual-Stream Attention for Remote Sensing Small Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5600213. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhou, M.; Wan, H.; Li, M.; Li, G.; Han, D. IDD-Net: Industrial defect detection method based on Deep-Learning. Eng. Appl. Artif. Intell. 2023, 123, 106390. [Google Scholar] [CrossRef]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Wang, G.; Li, W.; Zhou, M.; Zhu, H.; Yang, G.; Yap, C.H. 4D foetal cardiac ultrasound image detection based on deep learning with weakly supervised localisation for rapid diagnosis of evolving hypoplastic left heart syndrome. CAAI Trans. Intell. Technol. 2024, 9, 1485–1499. [Google Scholar] [CrossRef]
- Yang, J.; Li, S.; Wang, Z.; Dong, H.; Wang, J.; Tang, S. Using deep learning to detect defects in manufacturing: A comprehensive survey and current challenges. Materials 2020, 13, 5755. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, C.; Dong, X. A survey of real-time surface defect inspection methods based on deep learning. Artif. Intell. Rev. 2023, 56, 12131–12170. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Qiang, B.; Chen, R.; Zhou, M.; Pang, Y.; Zhai, Y.; Yang, M. Convolutional neural networks-based object detection algorithm by jointing semantic segmentation for images. Sensors 2020, 20, 5080. [Google Scholar] [CrossRef]
- Wang, K.; Zhou, M.; Lin, Q.; Niu, G.; Zhang, X. Geometry-Guided Point Generation for 3D Object Detection. IEEE Signal Process. Lett. 2025, 32, 136–140. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar]
- Xie, X.; Cheng, G.; Wang, J.; Li, K.; Yao, X.; Han, J. Oriented R-CNN and beyond. Int. J. Comput. Vis. 2024, 132, 2420–2442. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 –30 June 2016; pp. 779–788. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
- Pan, W.; Chen, J.; Lv, B.; Peng, L. Optimization and application of improved YOLOv9s-UI for underwater object detection. Appl. Sci. 2024, 14, 7162. [Google Scholar]
- Li, J.; Feng, Y.; Shao, Y.; Liu, F. IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective. Appl. Sci. 2024, 14, 5277. [Google Scholar] [CrossRef]
- Zheng, D.; Dong, W.; Hu, H.; Chen, X.; Wang, Y. Less is more: Focus attention for efficient DETR. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6674–6683. [Google Scholar]
- Zhang, H.; Ma, Z.; Li, X. RS-DETR: An improved remote sensing object detection model based on RT-DETR. Appl. Sci. 2024, 14, 10331. [Google Scholar] [CrossRef]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3651–3660. [Google Scholar]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.; Shum, H.-Y. DINO: DETR with Improved Denoising Anchor Boxes for End-to-End Object Detection. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023; pp. 1–19. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Jocher, G. Ultralytics YOLOv5. 2020. [CrossRef]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 December 2025).
- Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
- Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 December 2025).
- Tian, Y.; Ye, Q.; Doermann, D. YOLO12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
- Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
- Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320. [Google Scholar]
- Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. Autoassign: Differentiable label assignment for dense object detection. arXiv 2020, arXiv:2007.03496. [Google Scholar] [CrossRef]
- Zhu, J.; Wang, X.; Liu, Y.; Ji, Q.; Zhao, Z.; Wang, S. UavTinyDet: Tiny object detection in UAV scenes. In Proceedings of the 2022 7th International Conference on Image, Vision and Computing (ICIVC), Xi’an, China, 26–28 July 2022; pp. 195–200. [Google Scholar]
- Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 10–17 October 2021; pp. 13039–13048. [Google Scholar]
- Li, Y.; Wang, Y.; Ma, Z.; Wang, X.; Tang, Y. Sod-Uav: Small Object Detection For Unmanned Aerial Vehicle Images Via Improved Yolov7. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 7610–7614. [Google Scholar] [CrossRef]







| Metric | Category | SHDD | CPSDD |
|---|---|---|---|
| Acquisition Parameters | Image Resolution | ||
| Box Size Distribution | Small (area ) | 1431 | 5 |
| Medium ( area ) | 42 | 92 | |
| Large (area ) | 0 | 310 | |
| Instance Counts | Defect Targets (Single-Class) | 1473 | 407 |
| Method | GFLOPs | FPS | AP | AP50 | AP75 | APS | APM |
|---|---|---|---|---|---|---|---|
| Faster R-CNN [17] | 134 | 19 | 11.3 | 34.1 | 3.6 | 10.8 | 35.4 |
| Cascade R-CNN [32] | 186 | 15 | 14.2 | 42.0 | 5.6 | 13.7 | 38.4 |
| DetectoRS [42] | 263 | 9 | 13.0 | 40.6 | 5.0 | 12.6 | 33.0 |
| CenterNet [34] | 123 | 24 | 17.9 | 51.0 | 5.8 | 17.7 | 31.2 |
| ATSS [43] | 126 | 23 | 14.8 | 44.7 | 5.4 | 14.9 | 26.6 |
| FCOS [33] | 123 | 21 | 15.6 | 44.2 | 6.1 | 15.9 | 31.6 |
| Deformable DETR [25] | 126 | 12 | 12.9 | 37.7 | 5.3 | 12.7 | 33.8 |
| DINO [30] | 179 | 10 | 16.0 | 46.3 | 5.8 | 15.7 | 32.3 |
| YOLOv3u [35] | 283 | 112 | 20.6 | 50.6 | 12.3 | 20.0 | 45.2 |
| YOLOv5m [36] | 64 | 124 | 20.6 | 52.7 | 11.5 | 20.4 | 40.6 |
| YOLOv8m [37] | 79 | 133 | 20.8 | 49.5 | 11.9 | 20.5 | 46.9 |
| YOLOv9m [38] | 76 | 66 | 20.3 | 48.8 | 12.0 | 20.1 | 46.1 |
| YOLOv10m [39] | 64 | 74 | 19.5 | 48.6 | 11.4 | 19.3 | 35.1 |
| YOLOv11m [40] | 68 | 98 | 21.1 | 51.1 | 14.0 | 21.0 | 42.2 |
| YOLOv12m [41] | 68 | 68 | 19.9 | 48.9 | 10.9 | 19.7 | 38.9 |
| Ours | 132 | 95 | 23.2 | 60.8 | 11.5 | 22.6 | 53.3 |
| Method | AP | AP50 | AP75 | APM | APL |
|---|---|---|---|---|---|
| Faster R-CNN [17] | 48.5 | 83.9 | 47.5 | 2.6 | 53.9 |
| Cascade R-CNN [32] | 47.0 | 84.5 | 49.3 | 0.5 | 52.7 |
| DetectoRS [42] | 46.1 | 85.5 | 43.7 | 1.3 | 51.5 |
| CenterNet [34] | 44.3 | 78.9 | 45.6 | 0.0 | 50.0 |
| ATSS [43] | 44.2 | 80.3 | 40.6 | 0.1 | 49.7 |
| FCOS [33] | 43.2 | 81.0 | 38.4 | 0.0 | 48.9 |
| Deformable DETR [25] | 31.7 | 59.9 | 30.6 | 0.0 | 35.8 |
| DINO [30] | 48.3 | 79.3 | 50.9 | 1.3 | 54.5 |
| YOLOv3u [35] | 47.9 | 83.8 | 45.3 | 1.5 | 53.4 |
| YOLOv5m [36] | 50.3 | 82.0 | 57.4 | 2.9 | 56.0 |
| YOLOv8m [37] | 48.2 | 81.8 | 49.4 | 2.2 | 53.6 |
| YOLOv9m [38] | 48.9 | 80.3 | 51.3 | 7.2 | 54.6 |
| YOLOv10m [39] | 48.9 | 77.7 | 52.4 | 2.2 | 54.3 |
| YOLOv11m [40] | 48.9 | 80.0 | 53.8 | 1.9 | 54.7 |
| YOLOv12m [41] | 49.9 | 82.3 | 51.1 | 0.5 | 56.0 |
| Ours | 52.9 | 87.6 | 54.1 | 7.2 | 58.1 |
| Method | mAP | AP50 | AP75 | APS | APM | APL |
|---|---|---|---|---|---|---|
| Faster R-CNN [17] | 24.5 | 42.5 | 24.6 | 16.1 | 35.9 | 36.5 |
| RetinaNet [45] | 22.1 | 36.8 | 23.0 | 10.4 | 35.7 | 46.0 |
| Cascade R-CNN [32] | 25.6 | 43.1 | 26.2 | 16.4 | 37.4 | 41.4 |
| ClusDet [46] | 28.4 | 53.2 | 26.4 | 19.1 | 40.8 | 54.4 |
| CenterNet [34] | 21.4 | 36.1 | 21.8 | 12.6 | 31.7 | 38.3 |
| ATSS [43] | 27.6 | 45.5 | 28.5 | 18.1 | 39.2 | 42.1 |
| FCOS [33] | 23.7 | 39.9 | 24.6 | 14.6 | 34.4 | 42.0 |
| AutoAssign [47] | 23.2 | 43.5 | 21.8 | 15.2 | 33.3 | 40.9 |
| UavTinyDet [48] | 24.8 | 41.2 | 24.9 | 14.7 | 37.1 | 52.2 |
| Deformable DETR [25] | 25.5 | 44.1 | 24.9 | 17.2 | 35.5 | 41.1 |
| DINO [30] | 26.8 | 44.2 | 28.9 | 17.5 | 37.3 | 41.3 |
| YOLOF [49] | 15.1 | 26.3 | 15.4 | 6.1 | 25.2 | 32.4 |
| YOLOv7m [20] | 27.7 | 45.9 | 28.1 | 17.7 | 40.0 | 57.5 |
| YOLOv8m [37] | 28.5 | 49.2 | 28.3 | 19.5 | 40.3 | 46.5 |
| YOLOv10m [39] | 29.1 | 49.6 | 29.0 | 19.9 | 41.2 | 47.2 |
| SOD-UAV [50] | 26.3 | 45.7 | 26.8 | 15.6 | 37.6 | 47.6 |
| Ours | 29.5 | 49.9 | 29.4 | 19.5 | 42.0 | 58.6 |
| S3 | S4 | S5 | AP | AP50 | AP75 | APS | APM |
|---|---|---|---|---|---|---|---|
| ✓ | 21.0 | 55.7 | 9.1 | 20.2 | 52.5 | ||
| ✓ | ✓ | 21.9 | 59.3 | 8.5 | 21.3 | 49.2 | |
| ✓ | ✓ | ✓ | 23.2 | 60.8 | 11.5 | 22.6 | 53.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, S.; Feng, Y.; Wang, D.; Zhou, Z.; Wang, H.; Wu, J.; Wang, X.; Wei, X.; Yan, J.; Xian, W.; et al. IM-DETR: DETR with Mix-Encoder for Industrial Scenarios. Appl. Sci. 2026, 16, 3345. https://doi.org/10.3390/app16073345
Liu S, Feng Y, Wang D, Zhou Z, Wang H, Wu J, Wang X, Wei X, Yan J, Xian W, et al. IM-DETR: DETR with Mix-Encoder for Industrial Scenarios. Applied Sciences. 2026; 16(7):3345. https://doi.org/10.3390/app16073345
Chicago/Turabian StyleLiu, Shiyou, Yong Feng, Dongzi Wang, Zijie Zhou, Haibing Wang, Jinsong Wu, Xiangdong Wang, Xuekai Wei, Jielu Yan, Weizhi Xian, and et al. 2026. "IM-DETR: DETR with Mix-Encoder for Industrial Scenarios" Applied Sciences 16, no. 7: 3345. https://doi.org/10.3390/app16073345
APA StyleLiu, S., Feng, Y., Wang, D., Zhou, Z., Wang, H., Wu, J., Wang, X., Wei, X., Yan, J., Xian, W., & Qin, Y. (2026). IM-DETR: DETR with Mix-Encoder for Industrial Scenarios. Applied Sciences, 16(7), 3345. https://doi.org/10.3390/app16073345

