CAMS-AI: A Coarse-to-Fine Framework for Efficient Small Object Detection in High-Resolution Images
Highlights
- We propose a “coarse-to-fine” detection framework, CAMS-AI, which intelligently identifies and focuses on target regions using RPN and DBSCAN clustering, thereby avoiding redundant computations on massive invalid backgrounds.
- Experiments demonstrate that compared to the SAHI method, the CAMS-AI framework achieves an 8 to 10-fold+ increase in end-to-end inference speed (End-to-End FPS) while sacrificing only a minimal, acceptable amount of accuracy (a 2–4 percentage point drop in mAP50–95).
- The framework successfully overcomes the severe efficiency bottlenecks of existing slicing methods in sparse-target scenarios (e.g., wide-area grasslands), providing a highly efficient and practical solution for high-resolution, wide-area livestock monitoring.
- CAMS-AI’s “intelligent focus” paradigm can be seen as a general solution, generalizable to other fields facing the “high-resolution, large field-of-view, sparse-target” challenge, such as remote sensing and UAV inspection.
Abstract
1. Introduction
- (1)
- We propose an efficient coarse-to-fine detection framework. CAMS-AI first uses a Region Proposal Network to rapidly lock onto potential target areas and then performs streamlined processing only within these regions, thereby avoiding useless computations on massive background areas.
- (2)
- We construct a novel multi-level slicing strategy. We designed a three-level adaptive slicing strategy—comprising centered slicing, expanded slicing, and grid slicing—to target ROIs of different sizes and densities, achieving precise allocation of computational resources.
- (3)
- We achieve a balance between accuracy and speed. Extensive experiments demonstrate that the CAMS-AI framework achieves inference speeds nearly 10 times faster than SAHI while attaining comparable accuracy, providing a practical and feasible solution for efficient detection in high-resolution, small-target scenarios.
2. Materials and Methods
2.1. Dataset Details
2.1.1. Study Area
2.1.2. Target Class and Size Distribution
2.2. Dataset Processing
2.2.1. Data Processing Strategy
2.2.2. Full-Resolution Dataset
2.2.3. Sliced Dataset
2.3. Methods
- Coarse-Grained Candidate Localization: Performing a rapid, preliminary detection sweep on the full-resolution image to recall all potential targets.
- Adaptive ROI Generation: Using the DBSCAN clustering algorithm to organize the discrete, preliminarily detected targets into several compact ROIs.
- Multi-Level Slicing and Fine-Grained Detection: Applying a multi-level slicing strategy to crop the generated ROIs and then invoking a high-performance model for precise detection.
- Result Fusion and Post-Processing: Merging the detection results from both the coarse and fine stages and applying Non-Maximum Suppression (NMS) to obtain the final output.

2.3.1. Coarse-Grained Candidate Localization
2.3.2. Region of Interest (ROI) Generation
2.3.3. Multi-Level Slicing Strategy
- Center Slice Strategy: When the ROI’s dimensions are less than or equal to the standard slice size, we adopt the center slice strategy. Let the center point of the ROI be (cx, cy). The single generated slice’s coordinates are:This is the most efficient solution and provides sufficient contextual information for small target clusters.
- Expand Slice Strategy: When the ROI’s dimensions are larger than the standard slice size but smaller than the expansion threshold, we adopt the expand-to-square slice strategy. The maximum edge length of this ROI is Dmax, and the center point is still (cx, cy). The single generated slice’s coordinates are:This strategy generates the smallest square slice that fully encloses the original ROI, avoiding high overlapping slices caused by slight size deviations beyond the threshold. For example, for a 641 × 600 ROI, the system generates a 641 × 641 slice and feeds it directly into the model. Thanks to the model’s scale robustness, the accuracy of this single detection is nearly indistinguishable from processing two highly overlapping 640 × 640 slices, while halving computational overhead. This strategy fundamentally resolves redundant slicing issues at boundary dimensions.
- Grid Slice Strategy: If a standard grid slice is applied to a large ROI with only one long edge exceeding the threshold, it may lose context on the short edge that could contain valid information. Therefore, this study proposes a hybrid slicing strategy combining centering and grid slicing. First, define the slice step size P = S (1 − O). Then, a set of starting offsets must be calculated for the X-axis and Y-axis separately. For any given axis (using the X-axis as an example, with its starting point in the ROI as x1 and its size as WR), the set of offsets OffsetX is:The calculation for OffsetY is identical to OffsetX. The final generated set of slices Cgrid is the Cartesian product of these two offset sets:This performs a sliding-window slicing along the length and width of the ROI at a certain overlap rate, ensuring complete coverage of the entire area while also ensuring that each slice’s dimensions are within a range the model can efficiently process.
2.3.4. Fine-Grained Detection and Post-Processing
3. Results
3.1. Experimental Details
3.2. Evaluation Metrics
- Accuracy MetricsThis study uses the industry-standard MS COCO evaluation metrics to measure the detection accuracy of the models. These metrics provide a comprehensive assessment of the model’s localization and classification capabilities from multiple dimensions. The main evaluation metrics used in this paper are mean Average Precision (mAP), including mAP50, mAP50–95, and mAPsmall, as well as Average Recall (AR), reported as AR100.
- Efficiency MetricsStandard model speed evaluations only measure the model’s forward inference time and cannot accurately gauge the framework’s practical performance in a slicing scenario. To address this, this paper defines and adopts a more practical efficiency metric: End-to-End FPS (End-to-End Frames Per Second). This metric measures the actual processing speed of the entire detection pipeline, in units of “frames per second.” Unlike traditional FPS, the timing for End-to-End FPS starts when a raw high-resolution image is input and ends when the final, post-processed detection results are output. Its timing range explicitly includes the time spent on all steps: clustering and ROI generation, preprocessing (slice generation), model inference, and post-processing (such as coordinate transformation, results fusion, and Non-Maximum Suppression NMS). Therefore, this metric can more realistically and fairly reflect the actual operational efficiency of different methods in deployed applications.
3.3. Parameter Settings
3.3.1. Negative Sample Ratio
3.3.2. DBSCAN Parameters
3.3.3. Expand Threshold Parameters
3.4. Experimental Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Fraser, M.D.; Vallin, H.E.; Roberts, B.P. Animal Board Invited Review: Grassland-Based Livestock Farming and Biodiversity. Animal 2022, 16, 100671. [Google Scholar] [CrossRef]
- Stevens, N.; Bond, W.; Feurdean, A.; Lehmann, C.E.R. Grassy Ecosystems in the Anthropocene. Annu. Rev. Environ. Resour. 2022, 47, 261–289. [Google Scholar] [CrossRef]
- Piipponen, J.; Jalava, M.; De Leeuw, J.; Rizayeva, A.; Godde, C.; Cramer, G.; Herrero, M.; Kummu, M. Global Trends in Grassland Carrying Capacity and Relative Stocking Density of Livestock. Glob. Change Biol. 2022, 28, 3902–3919. [Google Scholar] [CrossRef]
- Adam, M.; Song, J.; Yu, W.; Li, Q. Deep Learning Approaches for Automatic Livestock Detection in UAV Imagery: State-of-the-Art and Future Directions. Future Internet 2025, 17, 431. [Google Scholar] [CrossRef]
- Chen, A.; Jacob, M.; Shoshani, G.; Charter, M. Using Computer Vision, Image Analysis and UAVs for the Automatic Recognition and Counting of Common Cranes (Grus Grus). J. Environ. Manag. 2023, 328, 116948. [Google Scholar] [CrossRef]
- Fang, C.; Li, C.; Yang, P.; Kong, S.; Han, Y.; Huang, X.; Niu, J. Enhancing Livestock Detection: An Efficient Model Based on YOLOv8. Appl. Sci. 2024, 14, 4809. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision–ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. ISBN 978-3-319-46447-3. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision–ECCV 2020; Springer: Cham, Switzerland, 2020. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
- Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale Feature Fusion Small Object Detection Network for UAV Aerial Images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
- Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
- Wei, W.; Cheng, Y.; He, J.; Zhu, X. A Review of Small Object Detection Based on Deep Learning. Neural Comput. Appl. 2024, 36, 6283–6303. [Google Scholar] [CrossRef]
- Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Moghaddam, M.E. Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications. arXiv 2025, arXiv:2503.20516. [Google Scholar] [CrossRef]
- Biggs, D.R.; Theart, R.P.; Schreve, K. Sub-Window Inference: A Novel Approach for Improved Sheep Counting in High-Density Aerial Images. Comput. Electron. Agric. 2024, 225, 109271. [Google Scholar] [CrossRef]
- Akyon, F.C.; Onur Altinuc, S.; Temizel, A. Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar]
- Mazen, F.M.A.; Shaker, Y. Small Object Detection in Complex Images: Evaluation of Faster R-CNN and Slicing Aided Hyper Inference. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 951–960. [Google Scholar] [CrossRef]
- Fotouhi, F.; Menke, K.; Prestholt, A.; Gupta, A.; Carroll, M.E.; Yang, H.-J.; Skidmore, E.J.; O’Neal, M.; Merchant, N.; Das, S.K.; et al. Persistent Monitoring of Insect-Pests on Sticky Traps through Hierarchical Transfer Learning and Slicing-Aided Hyper Inference. Front. Plant Sci. 2024, 15, 1484587. [Google Scholar] [CrossRef]
- Pereira, A.; Santos, C.; Aguiar, M.; Welfer, D.; Dias, M.; Ribeiro, M. Improved Detection of Fundus Lesions Using YOLOR-CSP Architecture and Slicing Aided Hyper Inference. IEEE Lat. Am. Trans. 2023, 21, 806–813. [Google Scholar] [CrossRef]
- Zhorif, N.N.; Anandyto, R.K.; Rusyadi, A.U.; Irwansyah, E. Implementation of Slicing Aided Hyper Inference (SAHI) in YOLOv8 to Counting Oil Palm Trees Using High-Resolution Aerial Imagery Data. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 869–874. [Google Scholar] [CrossRef]
- Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural Object Detection with You Only Look Once (YOLO) Algorithm: A Bibliometric and Systematic Literature Review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
- Zhang, H.; Hao, C.; Song, W.; Jiang, B.; Li, B. Adaptive Slicing-Aided Hyper Inference for Small Object Detection in High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 1249. [Google Scholar] [CrossRef]
- Nguyen, N.-D.; Do, T.; Ngo, T.D.; Le, D.-D. An Evaluation of Deep Learning Methods for Small Object Detection. J. Electr. Comput. Eng. 2020, 2020, 3189691. [Google Scholar] [CrossRef]
- Tong, K.; Wu, Y. Deep Learning-Based Detection from the Perspective of Small or Tiny Objects: A Survey. Image Vis. Comput. 2022, 123, 104471. [Google Scholar] [CrossRef]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
- Ismail, Z.H.; Chun, A.K.K.; Shapiai Razak, M.I. Efficient Herd–Outlier Detection in Livestock Monitoring System Based on Density–Based Spatial Clustering. IEEE Access 2019, 7, 175062–175070. [Google Scholar] [CrossRef]
- Graziani, M.; Lompech, T.; Müller, H.; Depeursinge, A.; Andrearczyk, V. On the Scale Invariance in State of the Art CNNs Trained on ImageNet. Mach. Learn. Knowl. Extr. 2021, 3, 374–391. [Google Scholar] [CrossRef]













| Locations | Latitude and Longitude | Number |
|---|---|---|
| Wulagai Bulinquan | 118.810122, 45.800929 | 243 |
| Wulagai Hechang Reservoir | 119.240714, 46.249838 | 217 |
| Locations | Class | Small Area < 322 | Medium 322 ≤ Area < 962 | Large Area ≥ 962 |
|---|---|---|---|---|
| Wulagai Bulinquan | Cattle | 877 | 1650 | 0 |
| Horse | 41 | 0 | 0 | |
| Sheep | 1546 | 44 | 0 | |
| Wulagai Hechang Reservoir | Cattle | 5043 | 1389 | 0 |
| Horse | 4449 | 1584 | 0 | |
| Sheep | 179 | 0 | 0 | |
| Total | 12,135 | 4667 | 0 |
| Statistics | Tiny rs < 0.03 | Small 0.03 ≤ rs < 0.1 | Medium 0.1 ≤ rs < 0.3 | Large rs > 0.3 |
|---|---|---|---|---|
| Number | 16,624 | 178 | 0 | 0 |
| 98.94% | 1.06% | 0% | 0% |
| Use | Locations | Train | Valid | Test | Total |
|---|---|---|---|---|---|
| Train | Wulagai Bulinquan | 154 | 39 | 360 | |
| Wulagai Hechang Reservoir | 133 | 34 | |||
| Test | Wulagai Bulinquan | 50 | 100 | ||
| Wulagai Hechang Reservoir | 50 |
| Locations | Type | Train | Valid | Test |
|---|---|---|---|---|
| Wulagai Bulinquan | Positive | 576 | 72 | 73 |
| Negative | 115 | 14 | 0 | |
| Wulagai Hechang Reservoir | Positive | 915 | 114 | 115 |
| Negative | 182 | 22 | 24 | |
| Total | 1788 | 222 | 227 |
| Type | Details |
|---|---|
| Hardware | |
| CPU | 16 CPU Intel(R) Xeon(R) Platinum 8352 V @ 2.10 GHz |
| Memory | 32 GB |
| GPU | Nvidia GeForce RTX 4090 (24 GB) |
| Software | |
| OS | Ubuntu 22.04 |
| CUDA | 12.1 |
| Python | 3.8.10 |
| Pytorch | 1.11.0 |
| mmcv | 2.0.1 |
| mmdet | 3.1.0 |
| mmengine | 0.8.3 |
| Ultralytics | 8.0.201 |
| Eps | mAP50 | mAP50–95 | Average Time per Image(s) |
|---|---|---|---|
| 140 | 0.793 | 0.612 | 0.0746 |
| 150 | 0.792 | 0.611 | 0.0736 |
| 160 | 0.795 | 0.614 | 0.0749 |
| 170 | 0.810 | 0.622 | 0.0711 |
| 180 | 0.804 | 0.620 | 0.0736 |
| 190 | 0.799 | 0.617 | 0.0720 |
| 200 | 0.799 | 0.619 | 0.0712 |
| 210 | 0.800 | 0.618 | 0.0753 |
| Model | Method | mAP50 | mAP50–95 | mAPsmall | AR100 | End-to-End FPS |
|---|---|---|---|---|---|---|
| Faster-RCNN | Baseline | 0.061 | 0.025 | 0.007 | 0.031 | 27.23 |
| SAHI | 0.824 | 0.59 | 0.572 | 0.657 | 1.48 | |
| CAMS-AI | 0.781 | 0.559 | 0.547 | 0.619 | 12.44 | |
| YOLOv8n | Baseline | 0.064 | 0.029 | 0.016 | 0.036 | 31.31 |
| SAHI | 0.794 | 0.602 | 0.59 | 0.643 | 1.53 | |
| CAMS-AI | 0.75 | 0.57 | 0.552 | 0.605 | 15.63 | |
| YOLOv11n | Baseline | 0.052 | 0.025 | 0.016 | 0.031 | 26.76 |
| SAHI | 0.846 | 0.61 | 0.594 | 0.65 | 1.51 | |
| CAMS-AI | 0.763 | 0.571 | 0.557 | 0.623 | 14.97 | |
| RT-DETR | Baseline | 0.211 | 0.139 | 0.125 | 0.180 | 26.5 |
| SAHI | 0.853 | 0.645 | 0.629 | 0.694 | 1.43 | |
| CAMS-AI | 0.810 | 0.622 | 0.609 | 0.674 | 14.69 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Chen, Z.; Chen, Z.; Yang, B.; Guo, Q.; Wang, H.; Zeng, X. CAMS-AI: A Coarse-to-Fine Framework for Efficient Small Object Detection in High-Resolution Images. Remote Sens. 2026, 18, 259. https://doi.org/10.3390/rs18020259
Chen Z, Chen Z, Yang B, Guo Q, Wang H, Zeng X. CAMS-AI: A Coarse-to-Fine Framework for Efficient Small Object Detection in High-Resolution Images. Remote Sensing. 2026; 18(2):259. https://doi.org/10.3390/rs18020259
Chicago/Turabian StyleChen, Zhanqi, Zhao Chen, Baohui Yang, Qian Guo, Haoran Wang, and Xiangquan Zeng. 2026. "CAMS-AI: A Coarse-to-Fine Framework for Efficient Small Object Detection in High-Resolution Images" Remote Sensing 18, no. 2: 259. https://doi.org/10.3390/rs18020259
APA StyleChen, Z., Chen, Z., Yang, B., Guo, Q., Wang, H., & Zeng, X. (2026). CAMS-AI: A Coarse-to-Fine Framework for Efficient Small Object Detection in High-Resolution Images. Remote Sensing, 18(2), 259. https://doi.org/10.3390/rs18020259

