AERIS-ED: A Novel Efficient Attention Riser for Multi-Scale Object Detection in Remote Sensing
Abstract
1. Introduction
1.1. Overview
1.2. Related Works
- Scale-aware feature enrichment and attention/context aggregation (e.g., lightweight modular necks, content-interaction units, class-aware losses), and
- Multimodal integration (RGB–IR, Optical–SAR) that leverages complementary cues through fusion or translation.
1.3. Motivation and Main Contributions
- Latency-Reducing EA Design: By substituting the conventional 3 × 3 dense mixing with a combination of 1 × 1 dimensionality reduction, depthwise operations, and linearized attention at the P3–P4 levels, we achieved approximately 31% lower latency and 45% higher FPS.
- Hybrid Channel–Spatial Attention Mechanism: We designed an enhanced EA module that jointly exploits channel-based and spatial attention pathways. This dual-path design was developed to capture critical spatial relationships and channel dependencies that are often overlooked by single-path attention approaches.
- Multi-Scale Feature Enrichment Framework: We developed a comprehensive multi-scale detection framework capable of effectively addressing objects of varying size categories through intelligent feature fusion and attention-guided processing. This framework is deliberately designed to address the unique difficulties of remote sensing, in which objects appearing within a single scene frequently display considerable differences in scale.
- Scale-Aware Selective Attention Architecture: Instead of applying attention mechanisms uniformly across every feature stage, we introduced a focused placement strategy where Efficient Attention (EA) is incorporated solely at the P3 and P4 levels of the feature pyramid. By concentrating attention at these scales, the model effectively tackles the persistent difficulties in detecting small- and medium-sized objects, while simultaneously maintaining high computational efficiency.
2. Background Theories
2.1. Architectural Overview
2.2. The Efficient Attention (EA) Module
3. Experimental Results
3.1. Dataset Description
3.1.1. MAR20
3.1.2. VEDAI
3.2. Performance Evaluation Indicators
3.3. Comparative Results
3.3.1. Comparative Analysis of YOLOv12s and AERIS-ED on Different Datasets
3.3.2. Comparison of AERIS-ED Against Other Models
3.3.3. Scale-Based Performance Analysis
3.3.4. Comparison of Existing Studies
3.3.5. Ablation Study
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AERIS-ED | Attention-Enhanced Real-time Intelligence System for Efficient Detection |
| C3 | Cross Stage Partial with three convolutions |
| EA | Efficient Attention |
| Mar20 | Maritime Object Detection 2020 |
| VEDAI | Vehicle Detection in Aerial Imagery |
| mAP | mean Average Precision |
| IoU | Intersection over Union |
| RS | Remote Sensing |
| APs | Average Precision for Small objects (area smaller than 322 pixels) |
| APm | Average Precision for Medium-scale objects (area between 322 and 962 pixels) |
| APl | Average Precision for Large objects (area exceeding 962 pixels) |
| SPPF | Spatial Pyramid Pooling Fast |
| FPS | Frames Per Second |
| TP | true positives |
| FP | false positives |
| FN | false negatives |
| PR | Precision–Recall |
| GSD | ground sampling distance |
| IR | infrared |
| SAR | SAR (Synthetic Aperture Radar) |
| DDU | Downsample Difference Upsample |
| PNOC | PNOC attention module |
| SFEG | SFEG |
| CFFDNet | CFFDNet |
| DFSC | DFSC |
| GWF | GWF units |
| R2CD | region-internal content interaction |
| CDFL | class-aware discrete focal loss |
| CS3 | cross-shaped sampling space |
| FFCA | focused feature context aggregation |
| ODCLayer | context enhancement via multi-dimensional dynamic convolution |
| MFFM | multi-scale feature fusion |
| RGAM | relevance-guided attention module |
| CBC | class-balance–aware correction |
| ECL | enhanced contrastive learning |
| MDCM | MDCM |
| HFSHM | restructures high-level features through a split-and-shuffle mechanism |
| PMT | PMT module |
| IV-gate | IV-gate module |
| CFFIM | CFFIM module |
| FEM | FEM |
| FFM | FFM |
| SCAM | SCAM |
| CBAM | Convolutional Block Attention Module |
| GAP | global average pooling |
| AUC-PR | area under the PR curve |
| ROC | ROC curve |
| GenAI | AI-based language model |
References
- Acikgoz, H.; Korkmaz, D.; Talan, T. An Automated Diagnosis of Parkinson’s Disease from MRI Scans Based on Enhanced Residual Dense Network with Attention Mechanism. J. Imaging Inform. Med. 2025, 38, 1935–1949. [Google Scholar] [CrossRef]
- Aydin, A.; Avaroğlu, E. Contact classification for human-robot interaction with densely connected convolutional neural network and convolutional block attention module. SIViP Signal Image Video Process. 2024, 18, 4363–4374. [Google Scholar] [CrossRef]
- Fu, Q.; Tao, X.; Deng, W.; Liu, H. Image Detection Network Based on Enhanced Small Target Recognition Details and Its Application in Fine Granularity. Appl. Sci. 2024, 14, 4857. [Google Scholar] [CrossRef]
- Aydın, A.; Talan, T.; Aktürk, C. Vision-Based Amateur Drone Detection: Performance Analysis of New Approaches in Deep Learning. Acta Infologica 2023, 7, 308–316. [Google Scholar] [CrossRef]
- Acikgoz, H. An automatic detection model for cracks in photovoltaic cells based on electroluminescence imaging using improved YOLOv7. SIViP Signal Image Video Process. 2024, 18, 625–635. [Google Scholar] [CrossRef]
- Aydin, A.; Salur, M.U.; Aydin, İ. Fine-tuning convolutional neural network based railway damage detection. In Proceedings of the IEEE EUROCON 2021-19th International Conference on Smart Technologies, Lviv, Ukraine, 6–8 July 2021; pp. 216–221. Available online: https://ieeexplore.ieee.org/abstract/document/9535585/ (accessed on 15 September 2025).
- Dikici, B.; Bekciogullari, M.F.; Acikgoz, H.; Ozbay, S. A lightweight and improved you only look once model using GhostWise convolution and attention mechanism for accurate plant disease detection. Eng. Appl. Artif. Intell. 2025, 161, 112163. [Google Scholar] [CrossRef]
- Kahveci, S.; Avaroğlu, E. An Adaptive Underwater Image Enhancement Framework Combining Structural Detail Enhancement and Unsupervised Deep Fusion. Appl. Sci. 2025, 15, 7883. [Google Scholar] [CrossRef]
- Hu, J.; Li, Y.; Zhi, X.; Shi, T.; Zhang, W. Complementarity-Aware Feature Fusion for Aircraft Detection via Unpaired Opt2SAR Image Translation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–19. [Google Scholar] [CrossRef]
- Xu, X.; Chen, Z.; Zhang, X.; Wang, G. Context-Aware Content Interaction: Grasp Subtle Clues for Fine-Grained Aircraft Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5641319. [Google Scholar] [CrossRef]
- Jiang, H.; Luo, T.; Peng, H.; Zhang, G. MFCANet: Multiscale Feature Context Aggregation Network for Oriented Object Detection in Remote-Sensing Images. IEEE Access 2024, 12, 45986–46001. [Google Scholar] [CrossRef]
- Wang, Y.; Chen, H.; Zhang, Y.; Li, G. Relevance Pooling Guidance and Class-Balanced Feature Enhancement for Fine-Grained Oriented Object Detection in Remote Sensing Images. Remote Sens. 2024, 16, 3494. [Google Scholar] [CrossRef]
- Zhao, W.; Zhao, Z.; Xu, M.; Ding, Y.; Gong, J. Differential multimodal fusion algorithm for remote sensing object detection through multi-branch feature extraction. Expert Syst. Appl. 2025, 265, 125826. [Google Scholar] [CrossRef]
- Wang, Z.; Li, S.; Huang, K. Cross-Modal Adaptation for Object Detection in Infrared Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2025, 22, 7000805. [Google Scholar] [CrossRef]
- Nie, J.; Sun, H.; Sun, X.; Ni, L.; Gao, L. Cross-Modal Feature Fusion and Interaction Strategy for CNN-Transformer-Based Object Detection in Visual and Infrared Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5000405. [Google Scholar] [CrossRef]
- Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
- Zhang, Q.; Zhu, Y.; Cordeiro, F.R.; Chen, Q. PSSCL: A progressive sample selection framework with contrastive loss designed for noisy labels. Pattern Recognit. 2025, 161, 111284. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
- Yu, W.Q.; Cheng, G.; Wang, M.J.; Yao, Y.Q.; Xie, X.X.; Yao, X.W.; Han, J.W. MAR20: A benchmark for military aircraft recognition in remote sensing images. Natl. Remote Sens. Bull. 2024, 27, 2688–2696. [Google Scholar] [CrossRef]
- Razakarivony, S.; Jurie, F. Vehicle Detection in Aerial Imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2015. Available online: https://hal.science/hal-01122605 (accessed on 9 September 2025). [CrossRef]
- Wu, J.; Zhao, F.; Yao, G.; Jin, Z. FGA-YOLO: A one-stage and high-precision detector designed for fine-grained aircraft recognition. Neurocomputing 2025, 618, 129067. [Google Scholar] [CrossRef]
- Liu, K.; Xu, Z.; Liu, Y.; Xu, G. Military Aircraft Recognition Method Based on Attention Mechanism in Remote Sensing Images. IET Image Process. 2025, 19, e70069. [Google Scholar] [CrossRef]
- Wan, H.; Nurmamat, P.; Chen, J.; Cao, Y.; Wang, S.; Zhang, Y.; Huang, Z. Fine-Grained Aircraft Recognition Based on Dynamic Feature Synthesis and Contrastive Learning. Remote Sens. 2025, 17, 768. [Google Scholar] [CrossRef]
- Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting Mobile CNN From ViT Perspective. arXiv 2024, arXiv:2307.09283. [Google Scholar] [CrossRef]
- Yue, C.; Zhang, Y.; Yan, J.; Luo, Z.; Liu, Y.; Guo, P. Diffusion Mechanism and Knowledge Distillation Object Detection in Multimodal Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4408314. [Google Scholar] [CrossRef]
- Cao, Y.; Guo, L.; Xiong, F.; Kuang, L.; Han, X. Physical-Simulation-Based Dynamic Template Matching Method for Remote Sensing Small Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
- Sun, X.; Yu, Y.; Cheng, Q. Low-rank multimodal remote sensing object detection with frequency filtering experts. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5637114. Available online: https://ieeexplore.ieee.org/abstract/document/10643097/?casa_token=pihNVrkMTCEAAAAA:JT1I5zjYq52CIT9Xn89w99_RE8UU9Rz0EkSdgDyMQzs_HWs99Hu6l7i1t4SP5O-Mz6_IuyZHzAFg8w (accessed on 10 September 2025). [CrossRef]
- Zhao, P.; Ye, X.; Du, Z. Object detection in multispectral remote sensing images based on cross-modal cross-attention. Sensors 2024, 24, 4098. [Google Scholar] [CrossRef]












| Stage | Output Scale | Channels | Modules |
|---|---|---|---|
| Backbone Backbone Backbone | /8 (P3 base) | 184 | Conv + C3 |
| /16 (P4 base) | 360 | Conv + C3 | |
| /32 (P5 base) | 720 | Conv + C3 + SPPF | |
| Neck Neck | /8 (P3 base) | 360 | PANet + C3 + EA@P3 |
| /16 (P4 base) | 544 | PANet + C3 + EA@P4 | |
| Neck Head Stage Backbone | /32 (P5 base) | 720 | PANet + C3 |
| P3, P4, P5 | Number class | Detect | |
| Output Scale | Channels (≈) | Modules | |
| /8 (P3 base) | 184 | Conv + C3 | |
| Backbone Backbone | /16 (P4 base) | 360 | Conv + C3 |
| /32 (P5 base) | 720 | Conv + C3 + SPPF |
| Dataset | Model | Precision | Recall | mAP@0.5 | mAP@[0.5:0.95] | Inference (ms/img) | FPS |
|---|---|---|---|---|---|---|---|
| MAR20 | YOLOv12s | 0.901 | 0.874 | 0.913 | 0.502 | 5.4 | 185 |
| AERIS-ED | 0.934 | 0.896 | 0.951 | 0.541 | 3.8 | 263 | |
| VEDAI | YOLOv12s | 0.791 | 0.755 | 0.772 | 0.488 | 5.5 | 181 |
| AERIS-ED | 0.842 | 0.792 | 0.830 | 0.532 | 3.8 | 263 |
| Dataset | Model | mAP@0.5 | mAP@[0.5:0.95] | Precision (P) | Recall (R) | Inference (ms/img) | FPS |
|---|---|---|---|---|---|---|---|
| VEDAI | AERIS-ED | 0.830 | 0.532 | 0.756 | 0.771 | 3.8 | 263 |
| YOLOv5s | 0.715 | 0.430 | 0.674 | 0.717 | 3.2 | 312 | |
| YOLOv12s | 0.773 | 0.497 | 0.751 | 0.710 | 5.5 | 182 | |
| YOLOv7 | 0.620 | 0.341 | 0.581 | 0.669 | 3.3 | 303 | |
| Faster R-CNN | 0.731 | 0.422 | 0.500 | 0.555 | 52.1 | 19.2 | |
| RetinaNet | 0.752 | 0.430 | 0.450 | 0.597 | 35.7 | 28.0 | |
| MAR20 | AERIS-ED | 0.951 | 0.735 | 0.906 | 0.919 | 3.8 | 263 |
| YOLOv5s | 0.886 | 0.656 | 0.764 | 0.847 | 3.2 | 312 | |
| YOLOv12s | 0.900 | 0.683 | 0.834 | 0.840 | 5.5 | 182 | |
| YOLOv7 | 0.899 | 0.670 | 0.818 | 0.860 | 3.3 | 303 | |
| Faster R-CNN | 0.799 | 0.565 | 0.578 | 0.677 | 52.1 | 19.2 | |
| RetinaNet | 0.580 | 0.380 | 0.410 | 0.679 | 35.7 | 28.0 |
| Dataset | Model | mAP@0.5 | mAP@[0.5:0.95] | APs | APm | APl |
|---|---|---|---|---|---|---|
| VEDAI | AERIS-ED | 0.951 | 0.735 | 0.483 | 0.715 | 0.714 |
| YOLOv12s | 0.900 | 0.683 | 0.261 | 0.651 | 0.722 | |
| YOLOv7s | 0.899 | 0.670 | 0.445 | 0.652 | 0.646 | |
| YOLOv5s | 0.886 | 0.656 | 0.450 | 0.632 | 0.627 | |
| Faster R-CNN | 0.799 | 0.565 | 0.345 | 0.544 | 0.540 | |
| RetinaNet | 0.580 | 0.380 | 0.332 | 0.502 | 0.511 | |
| MAR20 | AERIS-ED | 0.830 | 0.532 | 0.484 | 0.395 | – |
| YOLOv12s | 0.773 | 0.497 | 0.454 | 0.305 | – | |
| YOLOv7s | 0.620 | 0.341 | 0.328 | 0.145 | – | |
| YOLOv5s | 0.715 | 0.430 | 0.441 | 0.361 | – | |
| Faster R-CNN | 0.731 | 0.422 | 0.439 | 0.270 | – | |
| RetinaNet | 0.752 | 0.430 | 0.189 | 0.204 | – |
| Dataset | Model | mAP@0.5 | mAP@[0.5:0.95] | Precision (P) | Recall (R) | APs |
|---|---|---|---|---|---|---|
| MAR20 | AERIS-ED | 0.951 | 0.735 | 0.906 | 0.919 | 0.483 |
| OPT2SAR [9] | 0.876 | 0.637 | – | – | – | |
| C2IDet [10] | – | 0.842 | 0.766 | 0.914 | – | |
| FGA-YOLO [21] | 0.913 | 0.696 | – | – | – | |
| RT-DETR [22] | 0.7616 | 0.5705 | 0.8216 | 0.7688 | - | |
| Oriented R-CNN + LGF + FHM + CLM [23] | 89.7 | - | - | - | - | |
| ReDet [23] | 86.7 | - | - | - | - | |
| RTMDet [23] | 87.9 | - | - | - | - | |
| VEDAI | AERIS-ED | 0.830 | 0.532 | 0.756 | 0.771 | 0.484 |
| CM-YOLO [24] | 0.725 | – | – | – | – | |
| FFCA-YOLO [16] | 0.748 | 0.448 | – | – | 0.446 | |
| DKDNet [24,25] | 0.779 | 0.492 | 0.824 | 0.706 | – | |
| DTMSI-Net [26] | 0.765 | – | 0.795 | 0.756 | – | |
| LF-MDet [27] | 0.808 | – | – | – | – | |
| RT-DETR [28] | 0.520 | 0.327 | – | – | – | |
| CFDet | 0.685 | 0.428 | – | – | – |
| Dataset | EA@P3 | EA@P4 | Model | mAP@0.5 | APs | APm | APl | Inference (ms/img) | FPS | Δ mAP@0.5 (%) | Δ APs | Δ APm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MAR20 | Baseline | 0.900 | 0.261 | 0.651 | 0.722 | 5.5 | 182 | — | — | — | ||
| √ | × | AERIS-E3 | 0.946 | 0.475 | 0.702 | 0.706 | 3.8 | 263 | +0.046 (5.11% ↑) | +0.214 | +0.051 | |
| × | √ | AERIS-E4 | 0.942 | 0.452 | 0.708 | 0.710 | 3.8 | 263 | +0.042 (4.67% ↑) | +0.191 | +0.057 | |
| √ | √ | AERIS-ED | 0.951 | 0.483 | 0.715 | 0.714 | 3.8 | 263 | +0.051 (5.67% ↑) | +0.222 | +0.064 | |
| VEDAI | Baseline | 0.773 | 0.454 | 0.305 | – | 5.5 | 182 | — | — | — | ||
| √ | × | AERIS-E3 | 0.797 | 0.550 | 0.327 | – | 4.0 | 250 | +0.024 (3.10% ↑) | +0.096 | +0.022 | |
| × | √ | AERIS-E4 | 0.807 | 0.526 | 0.392 | – | 4.0 | 250 | +0.034 (4.40% ↑) | +0.072 | +0.087 | |
| √ | √ | AERIS-ED | 0.830 | 0.562 | 0.448 | – | 3.8 | 263 | +0.057 (7.37% ↑) | +0.108 | +0.143 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Aydin, A.; Avaroğlu, E. AERIS-ED: A Novel Efficient Attention Riser for Multi-Scale Object Detection in Remote Sensing. Appl. Sci. 2025, 15, 12223. https://doi.org/10.3390/app152212223
Aydin A, Avaroğlu E. AERIS-ED: A Novel Efficient Attention Riser for Multi-Scale Object Detection in Remote Sensing. Applied Sciences. 2025; 15(22):12223. https://doi.org/10.3390/app152212223
Chicago/Turabian StyleAydin, Ahmet, and Erdinç Avaroğlu. 2025. "AERIS-ED: A Novel Efficient Attention Riser for Multi-Scale Object Detection in Remote Sensing" Applied Sciences 15, no. 22: 12223. https://doi.org/10.3390/app152212223
APA StyleAydin, A., & Avaroğlu, E. (2025). AERIS-ED: A Novel Efficient Attention Riser for Multi-Scale Object Detection in Remote Sensing. Applied Sciences, 15(22), 12223. https://doi.org/10.3390/app152212223

