ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection
Abstract
1. Introduction
- We propose ERZA-DETR, a detection transformer framework for wireless capsule endoscopy (WCE) lesion detection that incorporates frequency-domain enhancement and structure-aware feature modeling. The proposed architecture is designed to better cope with the complex texture and spectral characteristics of WCE imagery.
- We introduce three complementary modules to enhance representation learning within the detection transformer. The Dual-Band Adaptive Fourier Spectral module (DBFS) performs adaptive frequency-domain modulation to refine spectral responses under complex illumination conditions. The Fused Dual-scale Gated Convolution module (FD-gConv) improves multi-scale feature aggregation for small lesion representation. The Graph-Linked Embedding at Semantic Scales module (GLES) models structural relationships on semantic feature maps through coordinate-gated graph aggregation.
- Through collaborative integration of these modules into the ERZA-DETR framework, we form an efficient detection architecture that not only enhances lesion representation and sensitivity to subtle, low-contrast lesions but also maintains real-time inference performance, thus providing a clinically viable solution for automated WCE lesion detection.
- Extensive experiments on the SEE-AI WCE dataset validate the effectiveness of the proposed design. ERZA-DETR demonstrates superior detection capability compared with strong baselines and recent detectors, particularly in challenging scenarios involving small or visually subtle lesions.
2. Related Work
2.1. Object Detection in WCE
2.2. Transformer-Based Detection and Hybrid Frameworks
2.3. Frequency Domain and Graph Learning in Vision
3. Materials and Methods
3.1. Dual-Band Adaptive Fourier Spectral Module (DBFS)
3.2. Fused Dual-Scale Gated Convolutional Module (FD-gConv)
3.2.1. Value Pathway with Depthwise Convolution
3.2.2. Gate Pathway with Channel Modulation
3.2.3. Residual Feature Fusion
3.3. Graph-Linked Embedding at Semantic Scales Module (GLES)
3.3.1. Graph Topology Construction
3.3.2. Coordinate-Aware Node Modulation
3.3.3. Structure-Guided Graph Aggregation
4. Experiments
4.1. Dataset
4.2. Evaluation Metrics
4.3. Implementation Details
5. Results and Discussion
5.1. Frequency Radius Sensitivity Analysis
5.2. Clinical Imbalance Analysis
5.3. Comparison with State-of-the-Art Methods
5.4. Ablation Studies and Analysis
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Iddan, G.; Meron, G.; Glukhovsky, A.; Swain, P. Wireless capsule endoscopy. Nature 2000, 405, 417. [Google Scholar] [CrossRef] [PubMed]
- Pennazio, M.; Rondonotti, E.; Despott, E.J.; Dray, X.; Keuchel, M.; Moreels, T.; Sanders, D.S.; Spada, C.; Carretero, C.; Valdivia, P.C.; et al. Small-bowel capsule endoscopy and device-assisted enteroscopy for diagnosis and treatment of small-bowel disorders: European Society of Gastrointestinal Endoscopy (ESGE) Guideline–Update 2022. Endoscopy 2023, 55, 58–95. [Google Scholar] [CrossRef] [PubMed]
- Zha, B.; Cai, A.; Wang, G. Diagnostic Accuracy of Artificial Intelligence in Endoscopy: Umbrella Review. JMIR Med. Inform. 2024, 12, e56361. [Google Scholar] [CrossRef] [PubMed]
- Trasolini, R.; Byrne, M.F. Artificial intelligence and deep learning for small bowel capsule endoscopy. Dig. Endosc. 2021, 33, 290–297. [Google Scholar] [CrossRef]
- Cao, Q.; Deng, R.; Pan, Y.; Liu, R.; Chen, Y.; Gong, G.; Zou, J.; Yang, H.; Han, D. Robotic wireless capsule endoscopy: Recent advances and upcoming technologies. Nat. Commun. 2024, 15, 4597. [Google Scholar] [CrossRef]
- Tontini, G.E.; Vecchi, M.; Neurath, M.F.; Neumann, H. Advanced endoscopic imaging techniques in Crohn’s disease. J. Crohn’s Colitis 2014, 8, 261–269. [Google Scholar] [CrossRef]
- Son, G.; Eo, T.; An, J.; Oh, D.J.; Shin, Y.; Rha, H.; Kim, Y.J.; Lim, Y.J.; Hwang, D. Small bowel detection for wireless capsule endoscopy using convolutional neural networks with temporal filtering. Diagnostics 2022, 12, 1858. [Google Scholar] [CrossRef]
- Zhao, X.; Fang, C.; Gao, F.; Fan, D.J.; Lin, X.; Li, G. Deep transformers for fast small intestine grounding in capsule endoscope video. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI); IEEE: Piscataway, NJ, USA, 2021; pp. 150–154. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
- Ye, S.; Meng, Q.; Zhang, S.; Wang, H. Multi-Scale Feature Fusion Network Model for Wireless Capsule Endoscopic Intestinal Lesion Detection. Comput. Mater. Contin. 2025, 82, 2043–2059. [Google Scholar] [CrossRef]
- Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 2961–2969. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 8759–8768. [Google Scholar]
- Vieira, P.M.; Freitas, N.R.; Lima, V.B.; Costa, D.; Rolanda, C.; Lima, C.S. Multi-pathology detection and lesion localization in WCE videos by using the instance segmentation approach. Artif. Intell. Med. 2021, 119, 102141. [Google Scholar] [CrossRef]
- Alam, M.J.; Rashid, R.B.; Fattah, S.A.; Saquib, M. Rat-capsnet: A deep learning network utilizing attention and regional information for abnormality detection in wireless capsule endoscopy. IEEE J. Transl. Eng. Health Med. 2022, 10, 1–8. [Google Scholar] [CrossRef] [PubMed]
- Jain, S.; Seal, A.; Ojha, A.; Yazidi, A.; Bures, J.; Tacheci, I.; Krejcar, O. A deep CNN model for anomaly detection and localization in wireless capsule endoscopy images. Comput. Biol. Med. 2021, 137, 104789. [Google Scholar] [CrossRef] [PubMed]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar] [CrossRef]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
- Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar] [CrossRef]
- Hosain, A.S.; Islam, M.; Mehedi, M.H.K.; Kabir, I.E.; Khan, Z.T. Gastrointestinal disorder detection with a transformer based approach. In Proceedings of the 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON); IEEE: Piscataway, NJ, USA, 2022; pp. 280–285. [Google Scholar]
- Habe, T.T.; Haataja, K.; Toivanen, P. Precision enhancement in wireless capsule endoscopy: A novel transformer-based approach for real-time video object detection. Front. Artif. Intell. 2025, 8, 1529814. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
- Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
- Nam, J.H.; Syazwany, N.S.; Kim, S.J.; Lee, S.C. Modality-agnostic domain generalizable medical image segmentation by multi-frequency in multi-scale attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 11480–11491. [Google Scholar]
- Chen, L.; Gu, L.; Li, L.; Yan, C.; Fu, Y. Frequency Dynamic Convolution for Dense Image Prediction. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: Piscataway, NJ, USA, 2025; pp. 30178–30188. [Google Scholar]
- Liu, Y.; Wang, J.; Huang, C.; Wang, Y.; Xu, Y. Cigar: Cross-modality graph reasoning for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 23776–23786. [Google Scholar]
- Im, J.; Nam, J.; Park, N.; Lee, H.; Park, S. Egtr: Extracting graph from transformer for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 24229–24238. [Google Scholar]
- Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar] [CrossRef]
- Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence; IEEE: Piscataway, NJ, USA, 2025; Volume 39, pp. 6896–6904. [Google Scholar]
- Chiley, V.; Thangarasa, V.; Gupta, A.; Samar, A.; Hestness, J.; DeCoste, D. RevBiFPN: The fully reversible bidirectional feature pyramid network. Proc. Mach. Learn. Syst. 2023, 5, 625–645. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2125. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar] [CrossRef]
- Chen, Y.; You, J.; He, J.; Lin, Y.; Peng, Y.; Wu, C.; Zhu, Y. SP-GNN: Learning structure and position information from graphs. Neural Netw. 2023, 161, 505–514. [Google Scholar] [CrossRef]
- You, J.; Ying, R.; Leskovec, J. Position-aware graph neural networks. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 7134–7143. [Google Scholar]
- Yokote, A.; Umeno, J.; Kawasaki, K.; Fujioka, S.; Fuyuno, Y.; Matsuno, Y.; Yoshida, Y.; Imazu, N.; Miyazono, S.; Moriyama, T.; et al. Small bowel capsule endoscopy examination and open access database with artificial intelligence: The SEE-artificial intelligence project. DEN Open 2024, 4, e258. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]











| Hyperparameter | Value |
|---|---|
| Input size | 640 × 640 |
| Batch size | 16 |
| Training epochs | 150 |
| Optimizer | AdamW ( = 0.9, = 0.999) |
| Initial learning rate | 0.0001 |
| Weight decay | 0.0001 |
| Data augmentation | Random photometric distortion, random zoom-out, random IoU crop, random horizontal flip |
| Model | Lesion Label | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Angio-Dyspl-Asia | Erosion | Steno-Sis | Lymph-Angie-Ctasia | Lymph Follicle | SMT | Polyp-like | Bleed-ing | Eryth-Ema | Diver-Ticulum | Foreign Body | Vein | |
| RT-DETR | 0.708 | 0.718 | 0.904 | 0.770 | 0.554 | 0.690 | 0.753 | 0.726 | 0.698 | 0.469 | 0.724 | 0.849 |
| RT-DETRv2 | 0.713 | 0.723 | 0.908 | 0.775 | 0.560 | 0.695 | 0.758 | 0.731 | 0.704 | 0.473 | 0.730 | 0.852 |
| ERZA-DETR (Ours) | 0.742 | 0.755 | 0.932 | 0.806 | 0.660 | 0.725 | 0.789 | 0.762 | 0.733 | 0.509 | 0.766 | 0.873 |
| Model | Params (M) | GFLOPs | Epoch | mAP | mAP@50 |
|---|---|---|---|---|---|
| Faster R-CNN [12] | 41.2 | 127.0 | 500 | 0.316 | 0.565 |
| SSD [39] | 26.3 | 342.8 | 500 | 0.293 | 0.532 |
| YOLOv8-L | 43.7 | 165.2 | 300 | 0.416 | 0.727 |
| YOLOv11-L | 25.3 | 86.9 | 300 | 0.421 | 0.742 |
| YOLOv12-L [11] | 26.4 | 88.9 | 300 | 0.454 | 0.729 |
| YOLOv13-L [30] | 27.6 | 89.0 | 300 | 0.474 | 0.743 |
| DETR [18] | 40.0 | 187.1 | 500 | 0.359 | 0.617 |
| Deformable DETR [19] | 40.2 | 172.5 | 150 | 0.385 | 0.648 |
| RT-DETR-R50 [20] | 42.7 | 137.7 | 150 | 0.422 | 0.718 |
| RT-DETRv2-R50 [21] | 42.7 | 137.7 | 150 | 0.434 | 0.723 |
| ERZA-DETR (Ours) | 37.4 | 110.5 | 150 | 0.454 | 0.755 |
| DBFS | FD-gConv | GLES | mAP | mAP@50 |
|---|---|---|---|---|
| - | - | - | 0.434 | 0.723 |
| √ | - | - | 0.437 | 0.734 |
| - | √ | - | 0.438 | 0.741 |
| - | - | √ | 0.437 | 0.735 |
| √ | √ | - | 0.440 | 0.745 |
| √ | - | √ | 0.439 | 0.742 |
| - | √ | √ | 0.440 | 0.747 |
| √ | √ | √ | 0.454 | 0.755 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ye, S.; Ma, H.; Zhang, Z.; Li, L. ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection. Algorithms 2026, 19, 268. https://doi.org/10.3390/a19040268
Ye S, Ma H, Zhang Z, Li L. ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection. Algorithms. 2026; 19(4):268. https://doi.org/10.3390/a19040268
Chicago/Turabian StyleYe, Shiren, Haipeng Ma, Zetong Zhang, and Liangjing Li. 2026. "ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection" Algorithms 19, no. 4: 268. https://doi.org/10.3390/a19040268
APA StyleYe, S., Ma, H., Zhang, Z., & Li, L. (2026). ERZA-DETR: A Deep Learning-Based Detection Transformer with Enhanced Relational-Zone Aggregation for WCE Lesion Detection. Algorithms, 19(4), 268. https://doi.org/10.3390/a19040268
