Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11
Abstract
1. Introduction
2. Related Work and Technology Analysis
2.1. Video Transmission Protocols
2.2. Proposed IMS Architecture
2.3. Artificial Intelligence Frameworks and Datasets
3. System Design and Architecture
3.1. Technological Analysis Phase
3.2. Architectural Design Phase
3.3. Integrated Implementation Phase
3.4. Experimental Validation Phase
3.5. Evaluation Protocol
3.5.1. Transmission Subsystem Evaluation
3.5.2. AI Subsystem Evaluation
3.6. Data Preparation: VisDrone
- High object density,
- Abundance of small targets,
- Variations in perspective and illumination.
3.7. YOLOv11 Model Training Configuration
4. Experimental Methodology
4.1. Training Environment and Dataset
4.2. Training Configuration and Results
4.3. Quantitative Analysis of Inference Latency
4.4. Hardware and Network Setup
- Gateway: 192.168.40.1 (Latency to router: 14–68 ms, with sporadic peaks up to ~1000 ms).
- Active Connections:
- ○
- Source (Drone): IP 192.168.40.61 connected to the MediaMTX RTSP port (8554).
- ○
- Server (Host): IP 192.168.40.27 listening on RTMP (1936), RTSP (8554), and WebRTC/WHEP (8889).
5. Results and Discussion
5.1. YOLOv11 Model Performance Analysis
- Pixel scarcity (Small Object Problem): In aerial imagery acquired from UAV platforms, objects such as bicycles or tricycles occupy only a very small fraction of the image compared to larger vehicles like buses. As a consequence, the neural network progressively loses discriminative semantic features for these small objects in deeper layers of the architecture.
- Inter-class ambiguity: The relatively low AP values observed for vans (0.382) and trucks (0.305) indicate that the model experiences difficulty in differentiating between functionally and visually similar vehicle types when observed from a top-down or oblique aerial perspective. In a military or defense context, this limitation suggests potential challenges in reliably distinguishing civilian transport vehicles from light tactical or support vehicles without additional retraining using domain-specific defense datasets.
5.2. Comparative Analysis with Existing Solutions
6. Conclusions
6.1. Future Work
6.2. Limitations
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Schuur, P.C. An Analytics-Based Framework for Military Technology Adoption and Combat Strategy. Decis. Anal. J. 2025, 15, 100586. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, L.; Wang, K.; Xu, L.; Gulliver, T.A. Underwater Acoustic Intelligent Spectrum Sensing with Multimodal Data Fusion: An Mul-YOLO Approach. Future Gener. Comput. Syst. 2025, 173, 107880. [Google Scholar] [CrossRef]
- Nandal, P.; Bohra, N.; Mann, P.; Das, N.N. YOLOv11 with Transformer Attention for Real-Time Monitoring of Ships: A Federated Learning Approach for Maritime Surveillance. Results Eng. 2025, 27, 106297. [Google Scholar] [CrossRef]
- Jang, S.; Kim, C.; Nam, H.; Kim, D.; Kim, D.; Lee, K.; Kim, K.H. Demand-Driven Standardization for Multirotor UAVs. Aerosp. Sci. Technol. 2026, 168, 111223. [Google Scholar] [CrossRef]
- Ran, W.; Nantogma, S.; Zhang, S.; Xu, Y. Bio-Inspired UAV Swarm Operation Approach towards Decentralized Aerial Electronic Defense. Appl. Soft Comput. 2025, 177, 113136. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, T.; Li, Y. Trajectory Planning for Multiple UAVs in Three-Dimensional Suppression of Enemy Air Defense Missions. Int. J. Transp. Sci. Technol. 2025. [Google Scholar] [CrossRef]
- Deng, L.; Kang, D.; Liu, Q. AirSentinel-YFSNet: Scale-Reconstruction Enhanced YOLOv8 for UAV Intrusion Defense. Results Eng. 2025, 28, 107359. [Google Scholar] [CrossRef]
- Zhu, X.; Zhu, X.; Yan, R.; Peng, R. Optimal Routing, Aborting and Hitting Strategies of UAVs Executing Hitting the Targets Considering the Defense Range of Targets. Reliab. Eng. Syst. Saf. 2021, 215, 107811. [Google Scholar] [CrossRef]
- Obaid, L.; Hamad, K.; Al-Ruzouq, R.; Dabous, S.A.; Ismail, K.; Alotaibi, E. State-of-the-Art Review of Unmanned Aerial Vehicles (UAVs) and Artificial Intelligence (AI) for Traffic and Safety Analyses: Recent Progress, Applications, Challenges, and Opportunities. Transp. Res. Interdiscip. Perspect. 2025, 33, 101591. [Google Scholar] [CrossRef]
- Shen, Z.; Zhang, H.; Bian, L.; Zhou, L.; Tian, Q.; Ge, Y. AI-Powered UAV Remote Sensing for Drought Stress Phenotyping: Automated Chlorophyll Estimation in Individual Plants Using Deep Learning and Instance Segmentation. Expert Syst. Appl. 2026, 299, 130141. [Google Scholar] [CrossRef]
- Catala-Roman, P.; Segura-Garcia, J.; Dura, E.; Navarro-Camba, E.A.; Alcaraz-Calero, J.M.; Garcia-Pineda, M. AI-Based Autonomous UAV Swarm System for Weed Detection and Treatment: Enhancing Organic Orange Orchard Efficiency with Agriculture 5.0. Internet Things 2024, 28, 101418. [Google Scholar] [CrossRef]
- Liu, L.; Meng, L.; Li, A.; Lv, Y.; Zhao, B. PD-YOLOv11: A Power Distribution Enabled YOLOv11 Algorithm for Power Transmission Tower Component Detection in UAV Inspection. Alex. Eng. J. 2025, 131, 312–324. [Google Scholar] [CrossRef]
- Liu, L.; Meng, L.; Li, X.; Liu, J.; Bi, J. WCD-YOLOv11: A Lightweight YOLOv11 Model for the Real-Time Image Processing in UAV. Alex. Eng. J. 2025, 133, 73–88. [Google Scholar] [CrossRef]
- Nayeem, N.I.; Mahbuba, S.; Disha, S.I.; Buiyan, M.R.H.; Rahman, S.; Abdullah-Al-Wadud, M.; Uddin, J. A YOLOv11-Based Deep Learning Framework for Multi-Class Human Action Recognition. Comput. Mater. Contin. 2025, 85, 1541–1557. [Google Scholar] [CrossRef]
- Wang, H.; Zhang, Y.; Zhu, C. DAFPN-YOLO: An Improved UAV-Based Object Detection Algorithm Based on YOLOv8s. Comput. Mater. Contin. 2025, 83, 1929–1949. [Google Scholar] [CrossRef]
- Liang, Y.; Yang, L.; Sun, S.; Li, Z.; Shi, Y.; Zhang, Z.; Zhang, H.; Li, Z.; Zhou, L.; Zhang, Z.; et al. YOLOv11-RAH: A Recurrent Attention-Enhanced Edge Intelligence Network for UAV-Based Power Transmission Line Insulator Inspection. Int. J. Intell. Netw. 2025, 6, 244–252. [Google Scholar] [CrossRef]
- Lu, S.; Zhao, H.; Zhang, E.; Zhao, Y.; Zhang, Y.; Zhang, Z. IMV-YOLO: Infrared Multi-Angle Vehicle Real-Time Detection Network Based YOLOv11 for Adverse Weather Conditions. Int. J. Intell. Comput. Cybern. 2025, 18, 731–758. [Google Scholar] [CrossRef]
- Xie, S.; Deng, G.; Lin, B.; Jing, W.; Li, Y.; Zhao, X. Real-Time Object Detection from UAV Inspection Videos by Combining YOLOv5s and DeepStream. Sensors 2024, 24, 3862. [Google Scholar] [CrossRef]
- Patel, U.; Tanwar, S.; Nair, A. Performance Analysis of Video On-Demand and Live Video Streaming Using Cloud Based Services. Scalable Comput. Pr. Exp. 2020, 21, 479–496. [Google Scholar] [CrossRef]
- Wang, J.; Feng, Z.; Chen, Z.; George, S.; Bala, M.; Pillai, P.; Yang, S.W.; Satyanarayanan, M. Bandwidth-Efficient Live Video Analytics for Drones via Edge Computing. In Proceedings of the 2018 3rd ACM/IEEE Symposium on Edge Computing, SEC 2018, Bellevue, WA, USA, 25–27 October 2018; pp. 159–173. [Google Scholar]
- Motlagh, N.H.; Bagaa, M.; Taleb, T. UAV-Based IoT Platform: A Crowd Surveillance Use Case. IEEE Commun. Mag. 2017, 55, 128–134. [Google Scholar] [CrossRef]
- Bacco, M.; Catena, M.; De Cola, T.; Gotta, A.; Tonellotto, N. Performance Analysis of WebRTC-Based Video Streaming Over Power Constrained Platforms. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 9–13 December 2018; pp. 1–7. [Google Scholar]
- Gunawan, T.S.; Ismail, I.M.M.; Kartiwi, M.; Ismail, N. Performance Comparison of Various YOLO Architectures on Object Detection of UAV Images. In Proceedings of the 8th IEEE International Conference on Smart Instrumentation, Measurement and Applications, ICSIMA 2022, Melaka, Malaysia, 27–28 September 2022; pp. 257–261. [Google Scholar]
- Jawaharlalnehru, A.; Sambandham, T.; Sekar, V.; Ravikumar, D.; Loganathan, V.; Kannadasan, R.; Khan, A.A.; Wechtaisong, C.; Haq, M.A.; Alhussen, A.; et al. Target Object Detection from Unmanned Aerial Vehicle (UAV) Images Based on Improved YOLO Algorithm. Electronics 2022, 11, 2343. [Google Scholar] [CrossRef]




| Protocol | Latency Range (ms) | Browser Support | Bandwidth Efficiency | Scalability | Security | Primary Use Cases | Key Strengths | Notable Limitations |
|---|---|---|---|---|---|---|---|---|
| JPEG-over-WebSocket | 1500–2500 | Native (WebSocket API) | Low | Poor | TLS-dependent | Surveillance feeds, simple monitoring | Minimal implementation complexity No specialized player required Direct browser integration | High CPU overhead Inefficient compression Limited concurrent streams |
| RTSP | 200–600 | None (requires transcoding) | High | Moderate | Optional (RTSPS) | IP cameras, professional broadcasting | Industry-standard for camera integration Low-latency delivery Mature protocol | Firewall traversal issues Requires gateway for web delivery Complex NAT handling |
| RTMP | 500–2000 | Deprecated (Flash EOL 2020) | Moderate | Moderate | Optional (RTMPS) | Legacy streaming, video ingestion | Established server ecosystem Reliable stream delivery Low setup complexity | Browser support eliminated Declining platform support Limited modern codec support |
| HLS | 2000–5000 | Universal (HTML5 native) | High | Excellent | AES encryption | VOD, live events, mobile streaming | CDN-optimized delivery Adaptive bitrate streaming Cross-platform compatibility | Segment-based latency overhead Initial buffering delay Not suitable for real-time interaction |
| WebRTC | 100–500 | Native (modern browsers) | Moderate | Good (with SFU/MCU) | Mandatory (DTLS-SRTP) | Video conferencing, real-time collaboration | Sub-second latency Peer-to-peer capable Built-in encryption | Complex signaling requirements Infrastructure costs (TURN/STUN) Bandwidth-intensive |
| WHEP | 150–500 | Native (HTTP/WebRTC) | Moderate | Excellent | Mandatory (inherited from WebRTC) | Low-latency broadcasting, sports streaming | Standardized ingestion protocol HTTP-based infrastructure Combines WebRTC speed with HTTP scalability | Limited production deployments Emerging tooling ecosystem Specification still evolving (IETF draft) |
| Framework | Latest Version (2025) | Release Date | GitHub Stars (Version 3.18.3) | License | Python Support | Main Strengths | Main Weaknesses | Role in IMS |
|---|---|---|---|---|---|---|---|---|
| TensorFlow | 2.20.0 | Aug 2025 | 192 k | Apache 2.0 | ≥3.9 | Mature ecosystem; good tooling; deployment to edge devices | Heavier graph model; less flexible for experimental research | Considered for deployment; not selected as main training framework |
| PyTorch | 2.9.1 | Nov 2025 | 94 k | BSD-3 | ≥3.10 | High flexibility; strong research community; easy experimentation | Slightly less tooling for production in some cases | Main training framework for YOLOv11 on VisDrone |
| OpenCV | 4.13.0 | Dec 2025 | 85.6 k | Apache 2.0 | ≥3.7 | Rich set of classical vision algorithms; efficient C++ backend | Not a deep-learning framework by itself | Support library for preprocessing and classical vision tasks |
| DeepStream | 8.0/7.1 | 2025 | ~1.8 k | Proprietary (NVIDIA) | ≥3.8 | Optimized multi-stream inference on NVIDIA GPUs; tight GStreamer integration | Hardware/vendor specific; higher initial complexity | Considered as future path for large-scale optimized deployment |
| Dataset | Images | Objects | Categories | Avg. Objects/Image | Aerial Context | Main Strengths | Main Weaknesses | Role in IMS |
|---|---|---|---|---|---|---|---|---|
| COCO | 118 k (train) | 860 k | 80 | 7.3 | Limited (mostly ground-level) | General-purpose objects in everyday scenes | Not designed for aerial surveillance; ground-level perspective | Useful for pretraining; not sufficient for aerial surveillance |
| xView | 1127 | 1 M+ | 60 | ~900 | High | Satellite and aerial imagery at large scale | Different scale and sensor characteristics; primarily satellite imagery | Relevant for wide-area detection but with different scale and sensor characteristics |
| DOTA | 2806 | 188 k | 15 | 67 | High | Aerial scenes with oriented objects | Oriented bounding boxes (OBB) format; focused on large objects | Good for aerial detection, focused on oriented bounding boxes |
| AI-TOD | 28,036 | 700 k+ | 8 | ~25 | High | Tiny objects in aerial images; emphasis on small-scale detection | Limited scene diversity; primarily focused on tiny object challenge | Valuable for small-object detection research |
| VisDrone | 10,209 | 2.6 M | 10 | 256 | High | Urban/peri-urban scenes captured from UAVs; high-density annotations | Challenging lighting and occlusion conditions | Most aligned with IMS requirements; selected as main dataset for fine-tuning |
| YOLO Family | Representative Model | Key Improvements | Typical Use in IMS Context | Remarks |
|---|---|---|---|---|
| YOLOv5 | YOLOv5s/m | First widely adopted Ultralytics family; strong baseline for real-time detection; mature tooling. | Baseline experiments and early prototypes of IMS. | Good balance accuracy/speed but less efficient and less expressive than later families. |
| YOLOv8 | YOLOv8m | Anchor-free design, improved head and loss functions; better performance on small objects. | Candidate for improved accuracy on dense aerial scenes. | Higher accuracy than v5 at similar or slightly higher computational cost. |
| YOLOv9 | YOLOv9c | Enhanced training strategies and architectural refinements; focus on accuracy. | Exploratory reference for high-accuracy configurations. | Oriented towards benchmarks; may be heavy for multi-stream real-time IMS deployment. |
| YOLOv10 | YOLOv10m | Optimisations focused on efficiency and latency; improved deployment characteristics. | Intermediate option when inference resources are limited but accuracy remains important. | Promising trade-off but less integrated into the current flow of IMS experimentation. |
| YOLOv11 | YOLOv11m/n | Latest Ultralytics family with refinements in backbone, neck and training recipes; strong small-object performance. | Main IMS model for real-time deployment on Vis-Drone-like flows. | Offers the best compromise between accuracy and speed in the evaluated context. |
| Model | Size (Pixels) | mAPval 50–95 | Speed CPU ONNX (ms) | Speed T4 TensorRT10 (ms) | Params (M) | FLOPs (B) |
|---|---|---|---|---|---|---|
| YOLO11x | 640 | 54.7 | 462.8 ± 6.7 | 11.3 ± 0.2 | 56.9 | 194.9 |
| YOLO11s | 640 | 47.0 | 90.0 ± 1.2 | 2.5 ± 0.0 | 9.4 | 21.5 |
| YOLO11n | 640 | 39.5 | 56.1 ± 0.8 | 1.5 ± 0.0 | 2.6 | 6.5 |
| YOLO11m | 640 | 51.5 | 183.2 ± 2.0 | 4.7 ± 0.1 | 20.1 | 68.0 |
| YOLO11l | 640 | 53.4 | 238.6 ± 1.4 | 6.2 ± 0.1 | 25.3 | 86.9 |
| Category | Details |
|---|---|
| Training hyperparameters | imgsz = 640; batch = 16; epochs = 100; cache = RAM (alternative: disk cache for determinism); optimizer = SGD (auto); lr = 0.01; momentum = 0.9 |
| Validation metrics | mAP@0.5 = 0.3360; mAP@0.5:0.95 = 0.1954; Precision ≈ 0.438; Recall ≈ 0.338 |
| Training time | ≈3.5 h (Tesla T4) |
| Ultralytics val timing (per image, T4) | preprocess ≈ 1.8 ms; inference ≈ 3.5 ms; postprocess ≈ 3.1 ms |
| Export for deployment | ONNX export (opset = 22): best.onnx (~10.1 MB) |
| Item | Value |
| Python | 3.12.12 |
| Ultralytics | 8.3.227 |
| PyTorch | 2.8.0 + cu126 |
| GPU | Tesla T4 (~15 GB VRAM) |
| Dataset | VisDrone2019-DET (YOLO format) |
| Split | Train = 6471; Val = 548; Test = 1610 images |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Castro-Castaño, J.J.; Chirán-Alpala, W.E.; Giraldo-Martínez, G.A.; Ortega-Pabón, J.D.; Rodríguez-Amézquita, E.C.; Gallego-Franco, D.F.; Garcés-Gómez, Y.A. Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11. Computers 2026, 15, 62. https://doi.org/10.3390/computers15010062
Castro-Castaño JJ, Chirán-Alpala WE, Giraldo-Martínez GA, Ortega-Pabón JD, Rodríguez-Amézquita EC, Gallego-Franco DF, Garcés-Gómez YA. Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11. Computers. 2026; 15(1):62. https://doi.org/10.3390/computers15010062
Chicago/Turabian StyleCastro-Castaño, Juan José, William Efrén Chirán-Alpala, Guillermo Alfonso Giraldo-Martínez, José David Ortega-Pabón, Edison Camilo Rodríguez-Amézquita, Diego Ferney Gallego-Franco, and Yeison Alberto Garcés-Gómez. 2026. "Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11" Computers 15, no. 1: 62. https://doi.org/10.3390/computers15010062
APA StyleCastro-Castaño, J. J., Chirán-Alpala, W. E., Giraldo-Martínez, G. A., Ortega-Pabón, J. D., Rodríguez-Amézquita, E. C., Gallego-Franco, D. F., & Garcés-Gómez, Y. A. (2026). Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11. Computers, 15(1), 62. https://doi.org/10.3390/computers15010062

