Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11

Castro-Castaño, Juan José; Chirán-Alpala, William Efrén; Giraldo-Martínez, Guillermo Alfonso; Ortega-Pabón, José David; Rodríguez-Amézquita, Edison Camilo; Gallego-Franco, Diego Ferney; Garcés-Gómez, Yeison Alberto

doi:10.3390/computers15010062

Open AccessArticle

Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11

by

Juan José Castro-Castaño

¹,

William Efrén Chirán-Alpala

¹

,

Guillermo Alfonso Giraldo-Martínez

¹

,

José David Ortega-Pabón

¹

,

Edison Camilo Rodríguez-Amézquita

¹

,

Diego Ferney Gallego-Franco

¹ and

Yeison Alberto Garcés-Gómez

^2,*

¹

Colombian Aerospace Force Command, Bogota 111321, Colombia

²

School of Engineering and Architecture, Universidad Católica de Manizales, Manizales 170001, Colombia

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(1), 62; https://doi.org/10.3390/computers15010062

Submission received: 14 December 2025 / Revised: 12 January 2026 / Accepted: 13 January 2026 / Published: 16 January 2026

Download

Browse Figures

Versions Notes

Abstract

This article presents the Intelligent Monitoring System (IMS), an AI-assisted, low-latency surveillance platform designed for defense environments. The study addresses the need for real-time autonomous situational awareness by integrating high-speed video transmission with advanced computer vision analytics in constrained network settings. The IMS employs a hybrid transmission architecture based on RTSP for ingestion and WHEP/WebRTC for distribution, orchestrated via MediaMTX, with the objective of achieving end-to-end latencies below one second. The methodology includes a comparative evaluation of video streaming protocols (JPEG-over-WebSocket, HLS, WebRTC, etc.) and AI frameworks, alongside the modular architectural design and prolonged experimental validation. The detection module integrates YOLOv11 models fine-tuned on the VisDrone dataset to optimize performance for small objects, aerial views, and dense scenes. Experimental results, obtained through over 300 h of operational tests using IP cameras and aerial platforms, confirmed the stability and performance of the chosen architecture, maintaining latencies close to 500 ms. The YOLOv11 family was adopted as the primary detection framework, providing an effective trade-off between accuracy and inference performance in real-time scenarios. The YOLOv11n model was trained and validated on a Tesla T4 GPU, and YOLOv11m will be validated on the target platform in subsequent experiments. The findings demonstrate the technical viability and operational relevance of the IMS as a core component for autonomous surveillance systems in defense, satisfying strict requirements for speed, stability, and robust detection of vehicles and pedestrians.

Keywords:

autonomous surveillance; YOLOv11; edge computing; UAV; real-time detection; defense systems

1. Introduction

Technological transformations in the contemporary military domain have significantly reshaped how nations perceive, process, and respond to emerging threats [1,2,3]. As a result, the development of autonomous surveillance and early reaction capabilities against aerial and ground threats has become a priority for modern defense forces. The increased use of small-sized unmanned aerial vehicles (UAVs), the proliferation of IP cameras in critical infrastructure, and the necessity to operate within closed networks or with latency restrictions pose new challenges for command-and-control systems [4,5,6,7,8]. In this context, it is essential to have platforms that integrate low-latency video transmission, artificial intelligence (AI)-based analytics, and integrated visualization tools, in order to support tactical decision-making in near real-time [9,10,11].

The Intelligent Monitoring System (IMS) emerges in response to these operational needs, proposing a modular architecture capable of integrating mobile cameras and drone fleets into a unified monitoring environment. Within this context, the IMS is conceived as the core intelligent surveillance module, responsible for receiving video streams from different sources, transmitting them with latencies compatible with tactical operation, executing computer vision models for automatic threat detection, and presenting the information to operators through a web interface.

Designing a system of this nature requires jointly addressing several engineering challenges: the selection of transmission protocols that enable low latency while maintaining browser compatibility, the correct choice of AI frameworks and architectures capable of operating in real-time on small objects viewed from aerial platforms, and the integration of all these components into a robust and scalable architecture. Additionally, it is necessary to experimentally validate the design decisions through prolonged testing campaigns that reproduce real operational conditions.

This work presents the design and implementation of the IMS, emphasizing two main axes: (i) the transmission architecture based on RTSP for video ingestion and WHEP/WebRTC for its distribution, managed by a MediaMTX server; and (ii) the artificial intelligence subsystem, built upon the YOLOv11 model family trained and fine-tuned on the VisDrone dataset for urban and peri-urban environments [12]. The methodology followed, the comparative evaluation of technologies, the resulting architecture, and the experimental results obtained over more than 300 h of operation are described.

The main contributions of this study are as follows: (1) the definition of a low-latency transmission architecture based on WHEP/WebRTC and MediaMTX, suitable for defense environments with network restrictions; (2) the integration of a real-time detection module based on YOLOv11 trained on VisDrone, capable of operating on high-resolution video streams; and (3) the experimental validation of the IMS in representative scenarios of the project, with metrics for latency, stability, and AI performance [13,14,15,16,17].

2. Related Work and Technology Analysis

2.1. Video Transmission Protocols

The video transmission subsystem is responsible for ensuring that streams originating from IP cameras and aerial platforms reach the operator and the AI module with latencies compatible with tactical decision-making. To meet this, a comparative evaluation of multiple video transmission protocols was conducted. The results are summarized in Table 1.

The protocols evaluated included: JPEG-over-WebSocket, RTSP, RTMP, HLS, WebRTC, and WHEP. The comparison considers both functional aspects (browser compatibility, encryption support, ease of integration) and quantitative performance metrics (end-to-end latency, stability, CPU usage, and bandwidth consumption).

The experiments showed that solutions based on sending JPEG images over WebSocket, although simple to implement, exhibited intensive CPU usage on both the client and server sides, in addition to inefficient bandwidth consumption for HD resolutions. Segmented protocols like HLS offer high scalability but introduce latencies typically exceeding 2–3 s, making them incompatible with the tactical surveillance requirements of the IMS. RTMP, on the other hand, has become obsolete in modern browsers and is limited by the need for specific servers and plugins.

RTSP was established as a robust option for ingestion from cameras and field encoders thanks to its wide adoption in video surveillance devices and its stability over IP networks. However, its lack of native browser compatibility and operation outside of HTTP restrict its use as the final delivery protocol to the web interface. In contrast, WebRTC and its WHEP extension allow for the combination of latencies on the order of 300–500 ms with operation over HTTP(s), integrated encryption, native compatibility with modern browsers, and the ability to adapt to network variations.

As a result of this analysis, the IMS adopts a hybrid architecture where RTSP is used as the ingestion protocol from IP cameras and aerial platforms, while WHEP/WebRTC, served through MediaMTX, is employed as the main distribution protocol toward the web interface and operational clients. MediaMTX acts as the central multiprotocol server, receiving RTSP streams, transcoding them when necessary, and exposing them via WHEP endpoints. This configuration simultaneously satisfies the requirements for low latency, scalability, and compatibility with the network restrictions present in defense infrastructure.

During long-duration tests spanning hundreds of hours of operation, the proposed architecture maintained stable streaming performance without service interruptions, demonstrating that the selected protocol stack and server components can sustain continuous monitoring scenarios. These results, together with the low-latency delivery enabled by WebRTC/WHEP, provide the technological foundation for the IMS transmission subsystem.

2.2. Proposed IMS Architecture

This subsection consolidates the findings from the previous technology analysis into the final proposed Integrated Monitoring System (IMS) architecture. The IMS is conceived as a modular, low-latency pipeline that ingests heterogeneous real-time video sources (e.g., IP cameras and UAV payload streams), performs decoding and AI-based object detection, and distributes the processed outputs to live monitoring and post-mission components.

As illustrated in Figure 1, the architecture is organized around four main layers: (i) video and telemetry producers, (ii) a real-time transport and ingestion layer (media gateway/server), (iii) a processing layer for decoding, buffering and inference (YOLO-based detection), and (iv) consumer applications for live visualization, alerting and storage.

Operationally, the end-to-end flow is as follows: (1) producers publish streams using RTSP/RTMP; (2) the media gateway ingests and re-serves streams via WebRTC/WHEP (and optionally HLS) for browser consumption; (3) an inference microservice taps the same stream, executes YOLO-based detection and (4) client applications display live video with overlays.

2.3. Artificial Intelligence Frameworks and Datasets

The Artificial Intelligence subsystem constitutes the cognitive core of the IMS, responsible for autonomously analyzing video streams in real-time to identify people, vehicles, and potential threats in complex environments. Within the scope of the project, this module must operate on high-resolution streams (1080 p) at rates of at least 25 FPS, maintaining a controlled false positive rate and preserving the ability to detect small objects observed from aerial platforms.

A comparative evaluation of artificial vision frameworks (TensorFlow, PyTorch, OpenCV, and DeepStream) and various training datasets (COCO, xView, DOTA, AI-TOD, and VisDrone) were conducted (as summarized in Table 2). TensorFlow stood out for the maturity of its ecosystem and its deployment capabilities on embedded devices; however, recent research and flexibility in model definition were predominantly concentrated in PyTorch. OpenCV was used as a support library for preprocessing and classical vision operations, while DeepStream and GStreamer were considered as a future optimization path for high-density deployments.

Regarding datasets, COCO and xView provide a solid foundation for object detection in ground scenes but do not adequately represent the aerial conditions and relative sizes of the objects of interest in the system. Conversely, VisDrone contains images captured from Unmanned Aerial Vehicles (UAVs) in urban and peri-urban contexts, featuring a high presence of small objects and high target density, making it the most suitable basis for training models intended for aerial surveillance (see Table 3). Based on this, a fine-tuning of YOLOv11 models was performed to adapt them to the system’s operational categories and conditions.

3. System Design and Architecture

The development of the Intelligent Monitoring System (IMS) was structured under an experimental and comparative methodology, aimed at integrating real-time video transmission technologies with AI models for automatic threat detection. The methodology was organized into four main phases: (i) technological analysis, (ii) architectural design, (iii) integrated implementation, and (iv) experimental validation in operational scenarios.

3.1. Technological Analysis Phase

During this phase, a systematic evaluation of video transmission protocols and servers (JPEG-over-WebSocket, RTSP, RTMP, HLS, WebRTC, and its WHEP extension) was conducted, along with an assessment of various computer vision frameworks and architectures (TensorFlow, PyTorch, OpenCV, DeepStream, YOLOv5, YOLOv8, YOLOv11, among others) A comparative summary of these YOLO generations and their relevance to the proposed system is provided in Table 4. This review combined the study of specifications and technical literature with controlled testing campaigns, considering metrics such as end-to-end latency, transmission stability, CPU consumption, and bandwidth, in addition to detection accuracy, inference speed, and hardware requirements for the AI models.

3.2. Architectural Design Phase

This phase focused on defining a robust, modular architecture for the Intelligent Monitoring System (IMS). This architecture, which is illustrated in Figure 1—showing the overall view and highlighting the role of MediaMTX for both RTSP stream ingestion and the distribution of processed streams to the frontend—is frontend of four interconnected subsystems.

The system begins with the Video Ingestion Subsystem, which integrates incoming video sources such as IP cameras and RTSP streams originating from UAVs or field encoders. These raw streams are then managed by the Central Media Server (MediaMTX). MediaMTX is a core component responsible for multiprotocol management; it registers the incoming RTSP flows and performs the necessary conversion to WHEP/WebRTC protocols for consumption by the visualization interface through a web browser.

The analytic backbone is provided by the Artificial Intelligence Subsystem, which functions as a real-time detection engine. This subsystem executes the YOLOv11 models, which were specifically trained and optimized on the VisDrone dataset, and is deployed on dedicated GPU/CPU hardware for efficient processing. Finally, the processed information, including real-time video and automatic detections, is presented to the operators via the Visualization Interface, an operational web application designed for deployment in specialized defense environments.

Figure 1 illustrates the proposed IMS architecture, which deliberately separates the real-time video delivery path (operator-oriented) from the AI analytics path (detection-oriented), using MediaMTX as the central distribution core. Video sources (e.g., a drone and an IP camera) deliver RTSP streams to a Video Capture Service, which centralizes acquisition and stabilizes inputs (e.g., reconnection handling, credential management, and stream normalization). Each source is then published to MediaMTX (e.g., source1, source2), where the media server maps them to internal channels (e.g., channel1, channel2) and redistributes them to multiple consumers without requiring each component to connect directly to the original devices.

From these channels, MediaMTX exposes two complementary outputs. For the IMS Frontend, video is delivered via WebRTC using WHEP, enabling browser-native playback with low end-to-end latency. In parallel, the AI module consumes the same channels through RTSP to decode frames and run inference. This split is intentional: AI processing runs as a parallel pipeline that must not block or degrade the primary operator video stream, allowing inference to be scaled or tuned independently.

3.3. Integrated Implementation Phase

The integrated implementation phase addressed the construction of the transmission and processing pipelines: RTSP streams from cameras and aerial platforms are registered in MediaMTX, which then exposes WHEP/WebRTC endpoints toward the IMS. Simultaneously, these same streams are forwarded to the detection engine for real-time analysis. This orchestration is complemented by event logging services and metadata storage which allow for the historical traceability of detections and their subsequent exploitation using analytical tools.

3.4. Experimental Validation Phase

For the validation of the IMS, continuous testing campaigns exceeding 300 h were executed, utilizing video streams that simulate tactical scenarios with multiple cameras, UAVs, and variable network conditions. We systematically measured end-to-end latency, packet loss rate, media server stability, as well as AI model performance metrics (mAP@50, mAP@50:95, precision, recall, and FPS) for different YOLOv11 variants. These experimental findings substantiate the design decisions presented in the subsequent sections.

Regarding the AI subsystem, detection metrics were calculated on the VisDrone validation set using fixed inference configurations (image size, confidence threshold, and IoU for NMS), which ensures comparability between model variants. Standard metrics in object detection tasks were considered, including precision, recall, mAP@0.5, and mAP@0.5:0.95. The F1–confidence, precision–confidence, and precision–recall curves were used to analyze the system’s behavior under different decision thresholds, while confusion matrices and label distribution allowed us to study the main sources of error and the impact of class imbalance.

Additionally, the stability of the media server and the system’s resource consumption were monitored, recording events such as reconnection attempts, session drops, and CPU/memory utilization on the transmission nodes. These indicators assess not only the instantaneous performance but also the robustness of the IMS against real operational conditions in defense environments.

3.5. Evaluation Protocol

The evaluation was divided into two complementary axes: video transmission and artificial intelligence. Operational validation tests involving human subjects were conducted in strict compliance with privacy standards. All video data captured during these tests were processed in real-time and immediately anonymized; no personally identifiable information (PII) or facial features of civilians were stored or retained in the final datasets.

3.5.1. Transmission Subsystem Evaluation

In the first case, end-to-end latency was defined as the temporal difference between the instant of frame generation at the camera or field encoder and its visualization in the IMS web interface. To estimate this magnitude, timestamps recorded at the system’s endpoints were used, and measurements were repeated during prolonged operational sessions to capture the impact of variations in available bandwidth, jitter, and packet loss.

3.5.2. AI Subsystem Evaluation

During training, losses associated with bounding box, classification, and DFL were continuously monitored, along with mAP@0.5 and mAP@0.5:0.95 metrics on the validation set. This information was used to select the epoch with the best trade-off between precision and generalization, and to objectively compare the behavior of the different YOLOv11 variants for potential deployment in the IMS.

Optimization schemes and learning rate scheduling common in the YOLO family were utilized, featuring an initial rate in the order of 10⁻² and gradual decay strategies across epochs. Training commenced from weights previously trained on general datasets, which allowed us to leverage already learned low-level features and focus the fine-tuning on the specific aerial domain characteristics of VisDrone.

Figure 2 shows representative examples of images from the VisDrone dataset used in the training phase, designated VIS1–VIS2. These images illustrate the high density of annotated objects, the simultaneous presence of multiple classes (vehicles, pedestrians, motorized bicycles, bicycles, tricycles, among others), and the variation in lighting conditions and viewing angles. This type of scene highlights the complexity of the aerial detection problem and justifies the necessity of employing state-of-the-art models such as YOLOv11.

3.6. Data Preparation: VisDrone

Data preparation constituted a critical stage for the overall performance of the system. The VisDrone dataset was utilized for training and validation. This dataset consists of images captured from UAV platforms operating in urban and peri-urban environments and is characterized by the following properties:

High object density,
Abundance of small targets,
Variations in perspective and illumination.

The official partitions (train, validation, test) were preserved to avoid data leakage. The original categories (pedestrian, people, car, van, truck, bus, motor, bicycle, tricycle, and awning-tricycle) were maintained to allow for comparisons with related research.

3.7. YOLOv11 Model Training Configuration

A preliminary training experiment was conducted using the YOLO11n variant as a baseline detector for the IMS. This choice was motivated by its low computational footprint, which enables an initial characterization of accuracy and throughput on the VisDrone domain and supports early integration in a real-time streaming pipeline. Larger variants (YOLO11s/YOLO11m/YOLO11l) are reserved for subsequent experiments, with a particular focus on YOLO11m as a candidate to improve detection performance while preserving real-time constraints. A detailed comparison of the computational resources and expected accuracy for each variant is presented in Table 5.

During the training phase, data augmentation techniques provided by the YOLOv11 implementation were applied, including geometric transformations (horizontal flips, random cropping) and photometric transformations (variations in brightness and contrast), as well as composition schemes like mosaic. These operations aim to increase the variability of the samples seen by the model and enhance its generalization capability against changes in illumination, perspective, and object density in real aerial scenes.

We worked with the standard dataset partitioning (training, validation, and testing) defined by the VisDrone authors, avoiding the mixing of sequences or scenes between partitions to prevent information leakage. The original annotations, which include categories such as pedestrian, people, car, van, truck, bus, motor, bicycle, tricycle, and awning-tricycle, were maintained to respect the dataset’s taxonomy and facilitate comparison with related work.

Training was performed with the Ultralytics implementation on GPU hardware for 100 epochs. The input resolution was fixed at 640 × 640 (imgsz = 640) to balance small-object detection performance and per-frame processing cost. The batch size was set to 16 (batch = 16), and caching was enabled in RAM (cache = RAM), noting that disk caching can be used to reduce potential non-determinism. The optimizer was selected automatically by Ultralytics (SGD with lr = 0.01 and momentum = 0.9). Standard YOLO augmentations were enabled (including geometric and photometric transforms and mosaic composition) to improve generalization under illumination, viewpoint, and density variations.

The official VisDrone2019-DET partitioning was preserved to avoid information leakage across splits. The dataset was used in YOLO format with the following split sizes: Train 6471 images, Validation 548 images, and Test 1610 images. The original category taxonomy (pedestrian, people, car, van, truck, bus, motor, bicycle, tricycle, and awning-tricycle) was maintained to facilitate comparison with related work.

Evaluation metrics include mAP@0.5, mAP@0.5:0.95, precision, and recall. For the trained YOLO11n model, Ultralytics reported per-image processing times on the Tesla T4 during validation of approximately: preprocess 1.8 ms, inference 3.5 ms, and postprocess 3.1 ms. These values provide an initial estimate of the incremental latency introduced by the AI stage when inserted into the IMS pipeline (Δt_AI ≈ preprocess + inference + postprocess ≈ 8.4 ms per frame), excluding buffering and scheduling overhead. For deployment, the best checkpoint (best.pt) was exported to ONNX (opset 22), resulting in a compact model file (best.onnx) of approximately 10.1 MB.

Based on the preliminary baseline and the expected accuracy–latency trade-off, YOLO11m is considered as the primary candidate for the final deployment stage; however, its incremental inference latency on the Tesla T4 must be explicitly quantified in the complete end-to-end pipeline. This work therefore reports measured results for YOLO11n on the Tesla T4 GPU.

In the final implementation, the YOLO11 model is deployed as an inference service integrated to the video streams managed by MediaMTX. Each stream is decoded, preprocessed, and fed into the model, which generates bounding boxes, classes, and confidence scores in real-time.

4. Experimental Methodology

4.1. Training Environment and Dataset

All training experiments reported in this work were conducted using the Ultralytics implementation of the YOLOv11 family on a Tesla T4 GPU. The specific training hyperparameters, software versions, and hardware specifications used for the baseline experiments are detailed in Table 6. YOLO11m (and larger variants) is planned for a subsequent training cycle once the end-to-end latency budget is fully characterized and validated on the same hardware.

4.2. Training Configuration and Results

The baseline model was trained for 100 epochs with an input size fixed at 640 × 640 to balance small-object sensitivity in aerial scenes against per-frame computational cost. Batch size was set to 16 to maximize GPU utilization while avoiding out-of-memory errors. Ultralytics automatic optimizer selection resolved to SGD (lr = 0.01, momentum = 0.9).

4.3. Quantitative Analysis of Inference Latency

To accurately quantify the impact of AI-based analytics on the system’s end-to-end latency, a comparative experiment was conducted under controlled network conditions. Two operational scenarios were evaluated: (i) a baseline scenario using pure WebRTC transmission without inference, and (ii) an AI-enabled scenario performing real-time object detection. In both cases, the video source was a DJI Mini 4 Pro drone transmitting at 720 p resolution, maintaining a consistent encoding pipeline towards the IMS frontend.

In the baseline scenario (no inference), the video stream was delivered directly to the web interface via MediaMTX. The end-to-end latency was estimated by comparing the timestamp embedded in the source video against the system clock at the moment of visualization. The measured latency was approximately 267 ms, which is consistent with an optimized WebRTC pipeline and falls well within the sub-second operational target.

Subsequently, the AI analytics module was enabled, executing the YOLO11n model trained on VisDrone. This module processed the stream in real-time on a MacBook Air (M2 processor) to simulate a tactical ground control station (GCS) environment. The inference pipeline operated on a 640 × 480 stream at 30 FPS using the ultrafast preset to minimize encoding delay. Under these conditions, the measured end-to-end latency increased to approximately 1039 ms.

The additional latency introduced by the AI subsystem

Δ L_{A I}

can be defined as

Δ L_{A I} = L_{W e b R T C + A I} - L_{W e b R T C} \approx 1039 m s - 267 m s \approx 772 m s

It is important to note that this 772 ms overhead is not solely driven by the per-frame inference time (which is low for YOLO11n) but reflects the cumulative latency of the complete processing pipeline, including frame extraction, CPU-based decoding, buffering strategies, and the synchronization of metadata with the video stream. These results demonstrate that even without dedicated server-grade GPU acceleration at the edge, the system remains functional with a latency close to 1 s, providing a viable trade-off for portable deployment scenarios.

4.4. Hardware and Network Setup

The experimental validation was conducted using a dual-environment approach to distinguish between model training requirements and operational deployment capabilities.

Training Environment: Model training and baseline validation were executed on a high-performance server equipped with an NVIDIA Tesla T4 GPU (15 GB VRAM), utilizing the software stack described previously (Python 3.12, Ultralytics 8.3, PyTorch 2.8). This environment was used to generate the weights and calculate the mAP metrics presented in Section 5.

Operational Testbed (Edge Scenario): To validate the system’s performance in a portable scenario typical of field operations, the end-to-end latency and stability tests were conducted on a MacBook Air equipped with the Apple M2 chip (8-core CPU, 10-core GPU, 8 GB Unified Memory). This setup acted as the host for the MediaMTX server, the AI inference service, and the visualization frontend.

Network Configuration: The connectivity tests utilized a local wireless network (IEEE 802.11ac) to emulate a constrained tactical link. The specific network parameters during the latency experiments were:

Gateway: 192.168.40.1 (Latency to router: 14–68 ms, with sporadic peaks up to ~1000 ms).
Active Connections:
○
Source (Drone): IP 192.168.40.61 connected to the MediaMTX RTSP port (8554).
○
Server (Host): IP 192.168.40.27 listening on RTMP (1936), RTSP (8554), and WebRTC/WHEP (8889).

This configuration ensures that the reported results reflect the performance of the IMS in a realistic, non-idealized network environment suitable for “on-the-move” defense applications.

5. Results and Discussion

To evaluate the performance of the IMS, extensive testing campaigns were conducted, exceeding 300 h of operation, utilizing a combination of fixed IP cameras and streams originating from UAVs in urban and peri-urban scenarios. The experiments were designed to stress both the transmission subsystem and the AI module by varying illumination conditions, object density, and network parameters such as available bandwidth, latency, and packet loss.

From a transmission perspective, the architecture based on RTSP + WHEP/WebRTC managed by MediaMTX consistently achieved of maintaining end-to-end latencies on the order of 500 ms for HD streams at 30 FPS, even in the presence of moderate variations in bandwidth and jitter. The stability of the media server was maintained without unplanned restarts throughout the testing. In comparison, alternative configurations based on HLS or JPEG image transmission via WebSocket exhibited significantly higher latencies and less efficient resource utilization.

Following integration of the AI subsystem, the YOLOv11n baseline detector was evaluated to characterize the trade-off between detection accuracy and processing cost within the target operational environment. Future experiments will extend this evaluation to larger variants (YOLO11s and YOLO11m) to determine whether the accuracy improvements justify the additional inference latency within the overall IMS latency budget.

At the system integration level, the IMS successfully rendered detection results in real time on the web interface, overlaying object bounding boxes and class labels directly onto the live video stream.

Overall, the results demonstrate that the combination of the proposed transmission architecture and the YOLOv11-based AI module satisfy the defined requirements for latency, accuracy, and operational stability. These requirements were to ensure timely threat detection within sub-second constraints, robust performance in scenes containing small and densely packed objects captured from UAVs, and continuous operation over constrained networks typical of military environments. Meeting these criteria validates the IMS as a viable platform for real-time situational awareness, tactical decision support, and deployment as a core component of an autonomous defense surveillance system.

5.1. YOLOv11 Model Performance Analysis

The confusion matrices, both in an absolute form (Figure 3a) and normalized form (Figure 3b), indicate that the dominant sources of confusions arise between visually similar vehicle categories (e.g., car, van, and bus) and between small pedestrians and the background regions. These observations suggest that incorporating additional domain-specific training data and applying targeted fine-tuning strategies for underrepresented classes could further enhance the robustness of the detection module in future iterations of the IMS.

The class-wise precision–recall curves enable a more granular analysis of the model’s performance characteristic. The highest detection performance is achieved for the car, bus, van, and motor categories, with Average Precision (AP) values ranging approximately from 0.37 to 0.76, while categories such as bicycle or awning-tricycle exhibit substantially lower AP values (below 0.15). This trend closely correlates with the label distribution of the training dataset, where classes such as car, pedestrian, and motor account for the majority of annotated instances, while tricycle, awning-tricycle, and bus remain comparatively underrepresented.

Further highlights a pronounced disparity in Average Precision (AP) between majority and minority classes. While frequent and structurally well-defined targets such as car and bus achieve AP values of 0.756 and 0.477, respectively, classes with lower representation and more complex geometries such as bicycle (0.092) and awning-tricycle (0.113) exhibit markedly inferior performance. From a tactical perspective, this degradation is not random but systematic and can be attributed primarily to two factors:

Pixel scarcity (Small Object Problem): In aerial imagery acquired from UAV platforms, objects such as bicycles or tricycles occupy only a very small fraction of the image compared to larger vehicles like buses. As a consequence, the neural network progressively loses discriminative semantic features for these small objects in deeper layers of the architecture.
Inter-class ambiguity: The relatively low AP values observed for vans (0.382) and trucks (0.305) indicate that the model experiences difficulty in differentiating between functionally and visually similar vehicle types when observed from a top-down or oblique aerial perspective. In a military or defense context, this limitation suggests potential challenges in reliably distinguishing civilian transport vehicles from light tactical or support vehicles without additional retraining using domain-specific defense datasets.

The F1–confidence curve indicated that the global F1 is maximized around 0.37 for a confidence threshold close to 0.17. This operating point represents a balanced trade-off between false alarms and missed detections in tactical surveillance scenarios. Increasing the confidence threshold improves precision but leads to a notable degradation in recall, whereas very low thresholds favor maximum object detection coverage at the expense of a higher rate of false positives.

The F1-Confidence curve showed that the model reaches its maximum operating efficiency (F1 ≈ 0.37) at an unusually low confidence threshold of approximately 0.17. This behavior indicates that the system operates in a low-certainty regime: in order to ensure that potential threats are not missed (high Recall), the operator must tolerate a higher rate of false positives (lower Precision). In a defense environment, the operational cost of a False Negative (failing to detect a threat) is significantly higher than that of a False Positive. Therefore, the IMS is intentionally calibrated to be hypersensitive (threshold 0.17), delegating final confirmation step to the human operator, which validates the need for low-latency transmission for rapid visual verification.

Regarding the global metrics, the training of YOLO11n reaches maximum values on the order of 0.45 in precision, 0.35 in recall, 0.335 in mAP@0.5, and 0.194 in mAP@0.5:0.95. These peaks are reached around epoch 80 and reflect the difficulty of the VisDrone dataset, characterized by the presence of small objects, high target density, and frequent occlusions. Nevertheless, the results confirm that the large variant of YOLOv11 is capable of capturing detailed patterns in aerial scenes, albeit at the cost of higher computational consumption.

The training curves showed a stable decrease in box, classification, and DFL losses; for example, the box loss drops from values close to 1.8 in the early epochs to around 1.25 at the end of training, while the classification loss decreases from approximately 2.4 to 0.85. The validation curves follow a similar trend, indicating a stable optimization process without marked signs of overfitting.

The loss curves showed a steady, monotonic decline in both training and validation (box_loss ending at ~1.25 and cls_loss at ~0.85). The absence of divergence between the training and validation curves indicates that the model is not suffering from massive overfitting, but rather from domain underfitting. The model has correctly learned the general visual characteristics (edges, textures), but the current architecture and VisDrone data are not sufficient to resolve the high density and occlusions of the “difficult” classes. This observation confirms that further system improvements are unlikely to result from modifying training hyperparameters (which are already stable) but instead require the incorporation of a proprietary dataset containing specific military classes (e.g., uniformed soldiers, armed vehicles) to correct the semantic imbalance observed.

In addition to the configuration selected for deployment in the IMS, an auxiliary experiment was conducted using the YOLO11n variant to analyze the upper limit of detection performance on the VisDrone dataset. The model was trained for 100 epochs, tracking the evolution of training and validation losses, as well as the global metrics of precision, recall, and mAP.

Furthermore, a qualitative analysis was conducted on samples from the validation set to visually compare the reference annotations with the predictions generated by YOLO11n. Figure 4 presents representative scenes featuring dense traffic and semi-pedestrian environments, where the model successfully localizes a high number of small objects while maintaining reasonable class consistency. The punctual discrepancies between labels and predictions reveal systematic error patterns (e.g., confusions between bicycle and motor or between different types of light vehicles) which will inform future annotation strategies and fine-tuning campaigns.

Unlike commercial UAV surveillance platforms, such as DJI FlightHub, which operate as closed ecosystems and frequently rely on cloud-based services with proprietary protocols [18,19,20], the IMS proposes a fully open and deployable on-premise architecture. While commercial solutions limit interoperability and pose data sovereignty risks in defense contexts, the use of MediaMTX and the WHEP standard in our proposal guarantees low-latency transmission (<500 ms) accessible from any modern web browser without requiring specialized client software.

Furthermore, in contrast to prior academic approaches that often rely on legacy protocols such as RTMP or HLS, the IMS demonstrates that it is feasible to integrate computationally intensive inference based on YOLOv11m into a real-time streaming pipeline, effectively balancing computational load with the tactical requirement for immediate response.

5.2. Comparative Analysis with Existing Solutions

Unlike commercial UAV surveillance platforms that often operate as closed ecosystems or rely on heavy cloud-based offloading which introduces latency and connectivity risks [21], the IMS proposes a fully open architecture deployable on-premise. While previous architectures for crowd surveillance have successfully utilized Mobile Edge Computing (MEC) to offload processing tasks [21], such dependence on external infrastructure is not always feasible in tactical defense scenarios where autonomous operation is critical.

Regarding transmission protocols, our comparative results align with findings by Bacco et al. [22], who demonstrated that WebRTC significantly optimizes performance on power-constrained platforms compared to traditional streaming methods. By adopting a WebRTC/WHEP pipeline, the IMS achieves end-to-end latencies consistent with these operational requirements, avoiding the delays inherent in legacy protocols like HLS or RTMP observed in older implementations.

Furthermore, regarding the AI subsystem, prior benchmarks on the VisDrone dataset using YOLOv5 and Tiny-YOLO variants highlighted the trade-off between inference speed and small-object detection accuracy [23,24]. The IMS builds upon these baselines by integrating YOLOv11m, which our results confirm provides a superior balance—maintaining real-time inference capabilities while improving detection metrics for difficult classes (e.g., bicycles, tricycles) compared to the lighter architectures referenced in [4].

6. Conclusions

6.1. Future Work

Future work is proposed to deepen the optimization of inference using PyTorch, expand the datasets with proprietary domain-specific information (defense domain), and explore the integration of behavior analysis techniques and sensor fusion for the early detection of complex threats.

Specifically, future work includes the construction of a proprietary dataset aligned with operational scenarios, based on capture and labeling campaigns at military bases and critical infrastructure perimeters. This dataset will enable the incorporation of defense-specific categories and improve the representation of infrequent classes. Furthermore, the integration of behavior analysis modules (e.g., loitering or object abandonment detection) and multi-object tracking techniques is foreseen, as well as the exploration of multi-sensor information fusion to enhance robustness against occlusions and adverse environmental conditions.

6.2. Limitations

In addition to the results obtained, it is important to acknowledge the main limitations of this work. First, the VisDrone domain is centered on urban and peri-urban civilian environments; therefore, certain categories and movement patterns characteristic of military scenarios are not sufficiently represented. Second, the class distribution within the dataset is significantly imbalanced, with a high concentration of samples in categories such as car, pedestrian, and motor and a comparatively small number of instances for classes such as bicycle, tricycle, and awning-tricycle. This imbalance is reflected in more modest Average Precision (AP) metrics for the latter categories.

Author Contributions

Conceptualization, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., E.C.R.-A. and D.F.G.-F.; methodology, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., E.C.R.-A. and Y.A.G.-G.; software, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., E.C.R.-A., D.F.G.-F. and Y.A.G.-G.; validation, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., E.C.R.-A., D.F.G.-F. and Y.A.G.-G.; formal analysis, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P. and E.C.R.-A.; investigation, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., E.C.R.-A. and D.F.G.-F.; resources, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P. and E.C.R.-A.; data curation, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P. and E.C.R.-A.; writing—original draft preparation, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., E.C.R.-A., D.F.G.-F. and Y.A.G.-G.; writing—review and editing, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., E.C.R.-A. and Y.A.G.-G.; visualization, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., E.C.R.-A. and Y.A.G.-G.; supervision, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., E.C.R.-A. and Y.A.G.-G.; project administration, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., D.F.G.-F. and E.C.R.-A.; funding acquisition, J.J.C.-C., W.E.C.-A., G.A.G.-M., J.D.O.-P., D.F.G.-F. and E.C.R.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Colombian Aerospace Force Command. Grant: ACTI 05: Studies and research for the development of a prototype system with video analytics capabilities for drone command and control PHASE 1: Integration of commercial drones.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the observational nature of the research, which involved the analysis of anonymized video streams in public spaces without interaction with the subjects.

Data Availability Statement

Restrictions apply to the availability of these data. Data was obtained from operational defense tests and are available from the authors with the permission of the Colombian Aerospace Force Command due to privacy and security protocols.

Acknowledgments

During the preparation of this manuscript/study, the author(s) used DeepL.com for the purposes of reviewing the final draft of the text. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schuur, P.C. An Analytics-Based Framework for Military Technology Adoption and Combat Strategy. Decis. Anal. J. 2025, 15, 100586. [Google Scholar] [CrossRef]
Li, Y.; Zhang, L.; Wang, K.; Xu, L.; Gulliver, T.A. Underwater Acoustic Intelligent Spectrum Sensing with Multimodal Data Fusion: An Mul-YOLO Approach. Future Gener. Comput. Syst. 2025, 173, 107880. [Google Scholar] [CrossRef]
Nandal, P.; Bohra, N.; Mann, P.; Das, N.N. YOLOv11 with Transformer Attention for Real-Time Monitoring of Ships: A Federated Learning Approach for Maritime Surveillance. Results Eng. 2025, 27, 106297. [Google Scholar] [CrossRef]
Jang, S.; Kim, C.; Nam, H.; Kim, D.; Kim, D.; Lee, K.; Kim, K.H. Demand-Driven Standardization for Multirotor UAVs. Aerosp. Sci. Technol. 2026, 168, 111223. [Google Scholar] [CrossRef]
Ran, W.; Nantogma, S.; Zhang, S.; Xu, Y. Bio-Inspired UAV Swarm Operation Approach towards Decentralized Aerial Electronic Defense. Appl. Soft Comput. 2025, 177, 113136. [Google Scholar] [CrossRef]
Zhang, W.; Wang, T.; Li, Y. Trajectory Planning for Multiple UAVs in Three-Dimensional Suppression of Enemy Air Defense Missions. Int. J. Transp. Sci. Technol. 2025. [Google Scholar] [CrossRef]
Deng, L.; Kang, D.; Liu, Q. AirSentinel-YFSNet: Scale-Reconstruction Enhanced YOLOv8 for UAV Intrusion Defense. Results Eng. 2025, 28, 107359. [Google Scholar] [CrossRef]
Zhu, X.; Zhu, X.; Yan, R.; Peng, R. Optimal Routing, Aborting and Hitting Strategies of UAVs Executing Hitting the Targets Considering the Defense Range of Targets. Reliab. Eng. Syst. Saf. 2021, 215, 107811. [Google Scholar] [CrossRef]
Obaid, L.; Hamad, K.; Al-Ruzouq, R.; Dabous, S.A.; Ismail, K.; Alotaibi, E. State-of-the-Art Review of Unmanned Aerial Vehicles (UAVs) and Artificial Intelligence (AI) for Traffic and Safety Analyses: Recent Progress, Applications, Challenges, and Opportunities. Transp. Res. Interdiscip. Perspect. 2025, 33, 101591. [Google Scholar] [CrossRef]
Shen, Z.; Zhang, H.; Bian, L.; Zhou, L.; Tian, Q.; Ge, Y. AI-Powered UAV Remote Sensing for Drought Stress Phenotyping: Automated Chlorophyll Estimation in Individual Plants Using Deep Learning and Instance Segmentation. Expert Syst. Appl. 2026, 299, 130141. [Google Scholar] [CrossRef]
Catala-Roman, P.; Segura-Garcia, J.; Dura, E.; Navarro-Camba, E.A.; Alcaraz-Calero, J.M.; Garcia-Pineda, M. AI-Based Autonomous UAV Swarm System for Weed Detection and Treatment: Enhancing Organic Orange Orchard Efficiency with Agriculture 5.0. Internet Things 2024, 28, 101418. [Google Scholar] [CrossRef]
Liu, L.; Meng, L.; Li, A.; Lv, Y.; Zhao, B. PD-YOLOv11: A Power Distribution Enabled YOLOv11 Algorithm for Power Transmission Tower Component Detection in UAV Inspection. Alex. Eng. J. 2025, 131, 312–324. [Google Scholar] [CrossRef]
Liu, L.; Meng, L.; Li, X.; Liu, J.; Bi, J. WCD-YOLOv11: A Lightweight YOLOv11 Model for the Real-Time Image Processing in UAV. Alex. Eng. J. 2025, 133, 73–88. [Google Scholar] [CrossRef]
Nayeem, N.I.; Mahbuba, S.; Disha, S.I.; Buiyan, M.R.H.; Rahman, S.; Abdullah-Al-Wadud, M.; Uddin, J. A YOLOv11-Based Deep Learning Framework for Multi-Class Human Action Recognition. Comput. Mater. Contin. 2025, 85, 1541–1557. [Google Scholar] [CrossRef]
Wang, H.; Zhang, Y.; Zhu, C. DAFPN-YOLO: An Improved UAV-Based Object Detection Algorithm Based on YOLOv8s. Comput. Mater. Contin. 2025, 83, 1929–1949. [Google Scholar] [CrossRef]
Liang, Y.; Yang, L.; Sun, S.; Li, Z.; Shi, Y.; Zhang, Z.; Zhang, H.; Li, Z.; Zhou, L.; Zhang, Z.; et al. YOLOv11-RAH: A Recurrent Attention-Enhanced Edge Intelligence Network for UAV-Based Power Transmission Line Insulator Inspection. Int. J. Intell. Netw. 2025, 6, 244–252. [Google Scholar] [CrossRef]
Lu, S.; Zhao, H.; Zhang, E.; Zhao, Y.; Zhang, Y.; Zhang, Z. IMV-YOLO: Infrared Multi-Angle Vehicle Real-Time Detection Network Based YOLOv11 for Adverse Weather Conditions. Int. J. Intell. Comput. Cybern. 2025, 18, 731–758. [Google Scholar] [CrossRef]
Xie, S.; Deng, G.; Lin, B.; Jing, W.; Li, Y.; Zhao, X. Real-Time Object Detection from UAV Inspection Videos by Combining YOLOv5s and DeepStream. Sensors 2024, 24, 3862. [Google Scholar] [CrossRef]
Patel, U.; Tanwar, S.; Nair, A. Performance Analysis of Video On-Demand and Live Video Streaming Using Cloud Based Services. Scalable Comput. Pr. Exp. 2020, 21, 479–496. [Google Scholar] [CrossRef]
Wang, J.; Feng, Z.; Chen, Z.; George, S.; Bala, M.; Pillai, P.; Yang, S.W.; Satyanarayanan, M. Bandwidth-Efficient Live Video Analytics for Drones via Edge Computing. In Proceedings of the 2018 3rd ACM/IEEE Symposium on Edge Computing, SEC 2018, Bellevue, WA, USA, 25–27 October 2018; pp. 159–173. [Google Scholar]
Motlagh, N.H.; Bagaa, M.; Taleb, T. UAV-Based IoT Platform: A Crowd Surveillance Use Case. IEEE Commun. Mag. 2017, 55, 128–134. [Google Scholar] [CrossRef]
Bacco, M.; Catena, M.; De Cola, T.; Gotta, A.; Tonellotto, N. Performance Analysis of WebRTC-Based Video Streaming Over Power Constrained Platforms. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 9–13 December 2018; pp. 1–7. [Google Scholar]
Gunawan, T.S.; Ismail, I.M.M.; Kartiwi, M.; Ismail, N. Performance Comparison of Various YOLO Architectures on Object Detection of UAV Images. In Proceedings of the 8th IEEE International Conference on Smart Instrumentation, Measurement and Applications, ICSIMA 2022, Melaka, Malaysia, 27–28 September 2022; pp. 257–261. [Google Scholar]
Jawaharlalnehru, A.; Sambandham, T.; Sekar, V.; Ravikumar, D.; Loganathan, V.; Kannadasan, R.; Khan, A.A.; Wechtaisong, C.; Haq, M.A.; Alhussen, A.; et al. Target Object Detection from Unmanned Aerial Vehicle (UAV) Images Based on Improved YOLO Algorithm. Electronics 2022, 11, 2343. [Google Scholar] [CrossRef]

Figure 1. Overall view of the IMS architecture, highlighting the use of MediaMTX for both RTSP stream ingestion and the distribution of processed streams to the frontend.

Figure 2. VIS1–VIS2. These images illustrate the high density of annotated objects, the simultaneous presence of multiple classes (vehicles, pedestrians, motorized bicycles, bicycles, tricycles, among others), and the variation in lighting conditions and viewing angles.

Figure 3. Confusion matrix: (a) Absolute confusion matrix (instance count) of YOLO11n on VisDrone; (b) Normalized confusion matrix of YOLO11n on VisDrone, evidencing the main confusions between classes.

Figure 4. VIS3–VIS4. Qualitative comparison between reference annotations and YOLO11n predictions on the VisDrone validation set, illustrating the model’s behavior in scenes with high object density and different capture conditions.

Table 1. Comparative Evaluation of Real-Time Video Transmission Protocols.

Protocol	Latency Range (ms)	Browser Support	Bandwidth Efficiency	Scalability	Security	Primary Use Cases	Key Strengths	Notable Limitations
JPEG-over-WebSocket	1500–2500	Native (WebSocket API)	Low	Poor	TLS-dependent	Surveillance feeds, simple monitoring	Minimal implementation complexity No specialized player required Direct browser integration	High CPU overhead Inefficient compression Limited concurrent streams
RTSP	200–600	None (requires transcoding)	High	Moderate	Optional (RTSPS)	IP cameras, professional broadcasting	Industry-standard for camera integration Low-latency delivery Mature protocol	Firewall traversal issues Requires gateway for web delivery Complex NAT handling
RTMP	500–2000	Deprecated (Flash EOL 2020)	Moderate	Moderate	Optional (RTMPS)	Legacy streaming, video ingestion	Established server ecosystem Reliable stream delivery Low setup complexity	Browser support eliminated Declining platform support Limited modern codec support
HLS	2000–5000	Universal (HTML5 native)	High	Excellent	AES encryption	VOD, live events, mobile streaming	CDN-optimized delivery Adaptive bitrate streaming Cross-platform compatibility	Segment-based latency overhead Initial buffering delay Not suitable for real-time interaction
WebRTC	100–500	Native (modern browsers)	Moderate	Good (with SFU/MCU)	Mandatory (DTLS-SRTP)	Video conferencing, real-time collaboration	Sub-second latency Peer-to-peer capable Built-in encryption	Complex signaling requirements Infrastructure costs (TURN/STUN) Bandwidth-intensive
WHEP	150–500	Native (HTTP/WebRTC)	Moderate	Excellent	Mandatory (inherited from WebRTC)	Low-latency broadcasting, sports streaming	Standardized ingestion protocol HTTP-based infrastructure Combines WebRTC speed with HTTP scalability	Limited production deployments Emerging tooling ecosystem Specification still evolving (IETF draft)

Table 2. Comparative Evaluation of Artificial Vision Frameworks for the IMS AI Subsystem.

Framework	Latest Version (2025)	Release Date	GitHub Stars (Version 3.18.3)	License	Python Support	Main Strengths	Main Weaknesses	Role in IMS
TensorFlow	2.20.0	Aug 2025	192 k	Apache 2.0	≥3.9	Mature ecosystem; good tooling; deployment to edge devices	Heavier graph model; less flexible for experimental research	Considered for deployment; not selected as main training framework
PyTorch	2.9.1	Nov 2025	94 k	BSD-3	≥3.10	High flexibility; strong research community; easy experimentation	Slightly less tooling for production in some cases	Main training framework for YOLOv11 on VisDrone
OpenCV	4.13.0	Dec 2025	85.6 k	Apache 2.0	≥3.7	Rich set of classical vision algorithms; efficient C++ backend	Not a deep-learning framework by itself	Support library for preprocessing and classical vision tasks
DeepStream	8.0/7.1	2025	~1.8 k	Proprietary (NVIDIA)	≥3.8	Optimized multi-stream inference on NVIDIA GPUs; tight GStreamer integration	Hardware/vendor specific; higher initial complexity	Considered as future path for large-scale optimized deployment

Table 3. Comparative Evaluation of Datasets for Object Detection in IMS Aerial Environments.

Dataset	Images	Objects	Categories	Avg. Objects/Image	Aerial Context	Main Strengths	Main Weaknesses	Role in IMS
COCO	118 k (train)	860 k	80	7.3	Limited (mostly ground-level)	General-purpose objects in everyday scenes	Not designed for aerial surveillance; ground-level perspective	Useful for pretraining; not sufficient for aerial surveillance
xView	1127	1 M+	60	~900	High	Satellite and aerial imagery at large scale	Different scale and sensor characteristics; primarily satellite imagery	Relevant for wide-area detection but with different scale and sensor characteristics
DOTA	2806	188 k	15	67	High	Aerial scenes with oriented objects	Oriented bounding boxes (OBB) format; focused on large objects	Good for aerial detection, focused on oriented bounding boxes
AI-TOD	28,036	700 k+	8	~25	High	Tiny objects in aerial images; emphasis on small-scale detection	Limited scene diversity; primarily focused on tiny object challenge	Valuable for small-object detection research
VisDrone	10,209	2.6 M	10	256	High	Urban/peri-urban scenes captured from UAVs; high-density annotations	Challenging lighting and occlusion conditions	Most aligned with IMS requirements; selected as main dataset for fine-tuning

Table 4. Evolution of YOLO Families and their Suitability for the IMS Context.

YOLO Family	Representative Model	Key Improvements	Typical Use in IMS Context	Remarks
YOLOv5	YOLOv5s/m	First widely adopted Ultralytics family; strong baseline for real-time detection; mature tooling.	Baseline experiments and early prototypes of IMS.	Good balance accuracy/speed but less efficient and less expressive than later families.
YOLOv8	YOLOv8m	Anchor-free design, improved head and loss functions; better performance on small objects.	Candidate for improved accuracy on dense aerial scenes.	Higher accuracy than v5 at similar or slightly higher computational cost.
YOLOv9	YOLOv9c	Enhanced training strategies and architectural refinements; focus on accuracy.	Exploratory reference for high-accuracy configurations.	Oriented towards benchmarks; may be heavy for multi-stream real-time IMS deployment.
YOLOv10	YOLOv10m	Optimisations focused on efficiency and latency; improved deployment characteristics.	Intermediate option when inference resources are limited but accuracy remains important.	Promising trade-off but less integrated into the current flow of IMS experimentation.
YOLOv11	YOLOv11m/n	Latest Ultralytics family with refinements in backbone, neck and training recipes; strong small-object performance.	Main IMS model for real-time deployment on Vis-Drone-like flows.	Offers the best compromise between accuracy and speed in the evaluated context.

Table 5. Performance Comparison of YOLOv11 Variants: Accuracy vs. Inference Cost.

Model	Size (Pixels)	mAPval 50–95	Speed CPU ONNX (ms)	Speed T4 TensorRT10 (ms)	Params (M)	FLOPs (B)
YOLO11x	640	54.7	462.8 ± 6.7	11.3 ± 0.2	56.9	194.9
YOLO11s	640	47.0	90.0 ± 1.2	2.5 ± 0.0	9.4	21.5
YOLO11n	640	39.5	56.1 ± 0.8	1.5 ± 0.0	2.6	6.5
YOLO11m	640	51.5	183.2 ± 2.0	4.7 ± 0.1	20.1	68.0
YOLO11l	640	53.4	238.6 ± 1.4	6.2 ± 0.1	25.3	86.9

Table 6. Training environment and dataset used for the preliminary YOLO11n baseline.

Category	Details
Training hyperparameters	imgsz = 640; batch = 16; epochs = 100; cache = RAM (alternative: disk cache for determinism); optimizer = SGD (auto); lr = 0.01; momentum = 0.9
Validation metrics	mAP@0.5 = 0.3360; mAP@0.5:0.95 = 0.1954; Precision ≈ 0.438; Recall ≈ 0.338
Training time	≈3.5 h (Tesla T4)
Ultralytics val timing (per image, T4)	preprocess ≈ 1.8 ms; inference ≈ 3.5 ms; postprocess ≈ 3.1 ms
Export for deployment	ONNX export (opset = 22): best.onnx (~10.1 MB)
Item	Value
Python	3.12.12
Ultralytics	8.3.227
PyTorch	2.8.0 + cu126
GPU	Tesla T4 (~15 GB VRAM)
Dataset	VisDrone2019-DET (YOLO format)
Split	Train = 6471; Val = 548; Test = 1610 images

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Castro-Castaño, J.J.; Chirán-Alpala, W.E.; Giraldo-Martínez, G.A.; Ortega-Pabón, J.D.; Rodríguez-Amézquita, E.C.; Gallego-Franco, D.F.; Garcés-Gómez, Y.A. Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11. Computers 2026, 15, 62. https://doi.org/10.3390/computers15010062

AMA Style

Castro-Castaño JJ, Chirán-Alpala WE, Giraldo-Martínez GA, Ortega-Pabón JD, Rodríguez-Amézquita EC, Gallego-Franco DF, Garcés-Gómez YA. Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11. Computers. 2026; 15(1):62. https://doi.org/10.3390/computers15010062

Chicago/Turabian Style

Castro-Castaño, Juan José, William Efrén Chirán-Alpala, Guillermo Alfonso Giraldo-Martínez, José David Ortega-Pabón, Edison Camilo Rodríguez-Amézquita, Diego Ferney Gallego-Franco, and Yeison Alberto Garcés-Gómez. 2026. "Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11" Computers 15, no. 1: 62. https://doi.org/10.3390/computers15010062

APA Style

Castro-Castaño, J. J., Chirán-Alpala, W. E., Giraldo-Martínez, G. A., Ortega-Pabón, J. D., Rodríguez-Amézquita, E. C., Gallego-Franco, D. F., & Garcés-Gómez, Y. A. (2026). Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11. Computers, 15(1), 62. https://doi.org/10.3390/computers15010062

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Low-Latency Autonomous Surveillance in Defense Environments: A Hybrid RTSP-WebRTC Architecture with YOLOv11

Abstract

1. Introduction

2. Related Work and Technology Analysis

2.1. Video Transmission Protocols

2.2. Proposed IMS Architecture

2.3. Artificial Intelligence Frameworks and Datasets

3. System Design and Architecture

3.1. Technological Analysis Phase

3.2. Architectural Design Phase

3.3. Integrated Implementation Phase

3.4. Experimental Validation Phase

3.5. Evaluation Protocol

3.5.1. Transmission Subsystem Evaluation

3.5.2. AI Subsystem Evaluation

3.6. Data Preparation: VisDrone

3.7. YOLOv11 Model Training Configuration

4. Experimental Methodology

4.1. Training Environment and Dataset

4.2. Training Configuration and Results

4.3. Quantitative Analysis of Inference Latency

4.4. Hardware and Network Setup

5. Results and Discussion

5.1. YOLOv11 Model Performance Analysis

5.2. Comparative Analysis with Existing Solutions

6. Conclusions

6.1. Future Work

6.2. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI