MDPI - Publisher of Open Access Journals

25 pages, 3673 KB

Open AccessSystematic Review

Recent Advances in Multi-Camera Computer Vision for Industry 4.0 and Smart Cities: A Systematic Review

by Carlos Julio Fierro-Silva, Carolina Del-Valle-Soto, Samih M. Mostafa and José Varela-Aldás

Algorithms 2026, 19(4), 249; https://doi.org/10.3390/a19040249 - 25 Mar 2026

Viewed by 837

The rapid deployment of surveillance cameras in urban, industrial, and domestic environments has intensified the need for intelligent systems capable of analyzing video streams beyond the limitations of single-camera setups. Unlike traditional single-camera approaches, multi-camera systems expand spatial coverage, reduce blind spots, and [...] Read more.

The rapid deployment of surveillance cameras in urban, industrial, and domestic environments has intensified the need for intelligent systems capable of analyzing video streams beyond the limitations of single-camera setups. Unlike traditional single-camera approaches, multi-camera systems expand spatial coverage, reduce blind spots, and enable consistent tracking of people and objects across non-overlapping views, thereby improving robustness against occlusions and viewpoint changes. This article presents a comprehensive review of multi-camera vision systems published between 2020 and 2025, covering application domains including public security and biometrics, intelligent transportation, smart cities and IoT, healthcare monitoring, precision agriculture, industry and robotics, pan–tilt–zoom (PTZ) camera networks, and emerging areas such as retail and forensic analysis. The review synthesizes predominant technical approaches, including deep-learning-based detection, multi-target multi-camera tracking (MTMCT), re-identification (Re-ID), spatiotemporal fusion, and edge computing architectures. Persistent challenges are identified, particularly in inter-camera data association, scalability, computational efficiency, privacy preservation, and dataset availability. Emerging trends such as distributed edge AI, cooperative camera networks, and active perception are discussed to outline future research directions toward scalable, privacy-aware, and intelligent multi-camera infrastructures. Full article

(This article belongs to the Special Issue Algorithmic Innovations: Bridging Theoretical Foundations and Practical Applications (2nd Edition))

► Show Figures

Figure 1

35 pages, 6720 KB

Open AccessArticle

Vision-Based Vehicle State and Behavior Analysis for Aircraft Stand Safety

by Ke Tang, Liang Zeng, Tianxiong Zhang, Di Zhu, Wenjie Liu and Xinping Zhu

Sensors 2026, 26(6), 1821; https://doi.org/10.3390/s26061821 - 13 Mar 2026

Viewed by 410

Abstract

With the continuous elevation of aviation safety standards, accurate monitoring of ground support vehicles in aircraft stand areas has become a critical task for enhancing overall aircraft stand operational safety. Given the limitations of existing surface movement radar and multi-camera surveillance systems in [...] Read more.

With the continuous elevation of aviation safety standards, accurate monitoring of ground support vehicles in aircraft stand areas has become a critical task for enhancing overall aircraft stand operational safety. Given the limitations of existing surface movement radar and multi-camera surveillance systems in terms of cost, deployment complexity, and coverage, this paper proposes a lightweight vision-based framework for vehicle state perception and spatiotemporal behavior analysis oriented toward aircraft stand safety. Leveraging existing fixed monocular monitoring resources in the stand area, the framework first establishes a precise mapping from image pixel coordinates to the physical plane through self-calibration and homography transformation utilizing scene line features, thereby achieving unified spatial measurement of vehicle targets. Subsequently, it integrates an improved lightweight YOLO detector (incorporating Ghost modules and CBAM for noise suppression) with the ByteTrack tracking algorithm to enable stable extraction of vehicle trajectories under complex occlusion conditions. Finally, by combining functional zone division within the stand, a semantic map is constructed, and a behavior analysis method based on a spatiotemporal finite state machine is proposed. This method performs joint reasoning by fusing multi-dimensional constraints including position, zone, and time, enabling automatic detection of abnormal behaviors such as “intrusion into restricted areas” and “abnormal stop.” Quantitative evaluations demonstrate the framework’s efficacy: it achieves an average physical localization error (RMSE) of 0.32 m, and the improved detection model reaches an accuracy (mAP@50) of 90.4% for ground support vehicles. In tests simulating typical violation scenarios, the system achieved high recall (96.0%) and precision (95.8%) rates in detecting ‘area intrusion’ and ‘abnormal stop’ violations, respectively. These results, achieved using only existing surveillance cameras, validate its potential as a cost-effective and easily deployable tool to augment existing safety monitoring systems for airport ground operations. Full article

(This article belongs to the Special Issue Intelligent Sensing and Control Technology for Unmanned Vehicles)

► Show Figures

Figure 1

27 pages, 2940 KB

Open AccessArticle

A Unified Framework for Vehicle Detection, Tracking, and Counting Across Ground and Aerial Views Using Knowledge Distillation with YOLOv10-S

by Md Rezaul Karim Khan and Naphtali Rishe

Remote Sens. 2026, 18(5), 842; https://doi.org/10.3390/rs18050842 - 9 Mar 2026

Viewed by 695

Abstract

Accurate and reliable vehicle detection, tracking, and counting across different surveillance platforms are fundamental requirements for developing smart Traffic Management Systems (TMS) and promoting sustainable urban mobility. Recent advances in both ground-level surveillance and remote sensing using deep learning have opened new opportunities [...] Read more.

Accurate and reliable vehicle detection, tracking, and counting across different surveillance platforms are fundamental requirements for developing smart Traffic Management Systems (TMS) and promoting sustainable urban mobility. Recent advances in both ground-level surveillance and remote sensing using deep learning have opened new opportunities for extracting detailed vehicular information from high-resolution aerial and surveillance video data. Our research reported here aims to present a unified, real-time vehicle analysis framework that integrates lightweight deep learning–based detection, robust multi-object tracking, and trajectory-driven counting within a single modular pipeline. The proposed framework employs a “You Only Look Once” system, YOLOv10-S as the detection backbone and enhances its robustness through supervision-level knowledge distillation without introducing any architectural modifications. Temporal consistency is enforced using an observation-centric multi-object tracking algorithm (OC-SORT), enabling stable identity preservation under camera motion and dense traffic conditions. Vehicle counting is performed using a trajectory-based virtual gate strategy, reducing duplicate counts and improving counting reliability. Comprehensive experiments conducted on the UA-DETRAC and VisDrone benchmarks show that the proposed framework effectively balances detection performance, tracking robustness, counting accuracy, and real-time efficiency in both ground-based and aerial surveillance settings. Furthermore, cross-dataset evaluations under direct train–test transfer highlight the inherent challenges of domain shift while showing that knowledge distillation consistently improves robustness in detection, tracking identity consistency, and vehicle counting. Overall, this framework enables effective real-world traffic monitoring by adopting a scalable and practical system design, where reliability is prioritized over architectural complexity. Full article

(This article belongs to the Section Urban Remote Sensing)

► Show Figures

Figure 1

18 pages, 2558 KB

Open AccessArticle

Evaluating a Multi-Camera Markerless System for Capturing Basketball-Specific Movements: An Exploration Using 25 Hz Video Streams

by Zhaoyu Li, Zhenbin Tan, Wen Zheng, Ganling Yang, Junye Tao, Mingxin Zhang and Xiao Xu

Sensors 2026, 26(5), 1689; https://doi.org/10.3390/s26051689 - 7 Mar 2026

Viewed by 635

Abstract

Markerless motion capture (MMC) provides a non-invasive alternative for motion analysis; however, its validity at the standard frame rate of 25 Hz commonly used in broadcast and surveillance applications remains to be established. This study evaluated the performance of a 25 Hz multi-camera [...] Read more.

Markerless motion capture (MMC) provides a non-invasive alternative for motion analysis; however, its validity at the standard frame rate of 25 Hz commonly used in broadcast and surveillance applications remains to be established. This study evaluated the performance of a 25 Hz multi-camera MMC workflow using consumer-grade cameras for capturing basketball-specific movements. Three highly trained male athletes completed seven tasks, including sprinting and simulated sport-specific skills, while being synchronously recorded by six MMC cameras (DJI Action 5 Pro, 25 fps) and a 10-camera Vicon system (25 Hz). Kinematic data were processed using an RTMDet–RTMPose pipeline and low-pass filtered at 6 Hz. Waveform validity was assessed using Pearson’s correlation coefficient (r) and the root mean square error (RMSE). The displacement magnitudes of 12 joints showed excellent agreement (r = 0.916–0.994; median nRMSE = 0.54–1.32%), indicating robust trajectory reconstruction. In contrast, agreement decreased for derivative variables: velocity (r = 0.583–0.867) and acceleration (r = 0.232–0.677) were highly sensitive to the low sampling rate and numerical differentiation. Although a 25 Hz configuration is insufficient for high-precision impact analysis, it provides acceptable accuracy for macroscopic displacement tracking and external-load quantification in resource-constrained training environments. Future optimization should prioritize temporal synchronization to improve the reliability of derivative variables. Full article

(This article belongs to the Special Issue Multi-Sensor Systems for Object Tracking—2nd Edition)

► Show Figures

Figure 1

24 pages, 6103 KB

Open AccessArticle

Enhancing Alarm Localization in Multi-Window Map Interfaces with Spatialized Auditory Cues: An Eye-Tracking Study

by Jing Zhang, Xiaoyu Zhu, Wenzhe Tang, Weijia Ge, Yong Zhang and Jing Li

ISPRS Int. J. Geo-Inf. 2026, 15(2), 69; https://doi.org/10.3390/ijgi15020069 - 6 Feb 2026

Viewed by 563

Abstract

Modern geo-information platforms commonly adopt multi-window map interfaces that integrate heterogeneous data, such as dynamic maps and live camera feeds. These interfaces impose high cognitive load and slow spatial event detection. Operators must rapidly locate the source of visual alarms, a task often [...] Read more.

Modern geo-information platforms commonly adopt multi-window map interfaces that integrate heterogeneous data, such as dynamic maps and live camera feeds. These interfaces impose high cognitive load and slow spatial event detection. Operators must rapidly locate the source of visual alarms, a task often leading to delays under high visual workload. To address this challenge, this study investigated whether spatialized auditory cues can improve alarm localization in such complex monitoring interfaces. A controlled experiment with 24 participants used a within-subjects design to test factors of auditory spatial cueing (none, binaural, monaural), display dynamics (dynamic, static), and interface complexity (4, 8, 12 panes). Behavioral and eye-tracking data measured detection accuracy, efficiency, and gaze patterns. Results showed that dynamic displays and high interface complexity impaired performance, indicating increased cognitive load. In contrast, monaural lateralized auditory alarms substantially improved detection efficiency and mitigated visual overload. Interaction analyses revealed that binaural cues reduced the performance costs of dynamic displays, whereas monaural cues compensated for high-density layouts. These findings demonstrate that spatialized auditory alarms effectively support spatiotemporal situational awareness and improve operator performance in high-load geo-surveillance systems. The study offers empirical and practical implications for designing cognitively ergonomic, multimodal interfaces that move beyond purely visual alarm designs. Full article

► Show Figures

Figure 1

15 pages, 1607 KB

Open AccessArticle

Using Steganography and Artificial Neural Network for Data Forensic Validation and Counter Image Deepfakes

by Matimu Caswell Nkuna, Ebenezer Esenogho and Ahmed Ali

Computers 2026, 15(1), 61; https://doi.org/10.3390/computers15010061 - 15 Jan 2026

Viewed by 772

Abstract

The merging of the Internet of Things (IoT) and Artificial Intelligence (AI) advances has intensified challenges related to data authenticity and security. These advancements necessitate a multi-layered security approach to ensure the security, reliability, and integrity of critical infrastructure and intelligent surveillance systems. [...] Read more.

The merging of the Internet of Things (IoT) and Artificial Intelligence (AI) advances has intensified challenges related to data authenticity and security. These advancements necessitate a multi-layered security approach to ensure the security, reliability, and integrity of critical infrastructure and intelligent surveillance systems. This paper proposes a two-layered security approach that combines a discrete cosine transform least significant bit 2 (DCT-LSB-2) with artificial neural networks (ANNs) for data forensic validation and mitigating deepfakes. The proposed model encodes validation codes within the LSBs of cover images captured by an IoT camera on the sender side, leveraging the DCT approach to enhance the resilience against steganalysis. On the receiver side, a reverse DCT-LSB-2 process decodes the embedded validation code, which is subjected to authenticity verification by a pre-trained ANN model. The ANN validates the integrity of the decoded code and ensures that only device-originated, untampered images are accepted. The proposed framework achieved an average SSIM of 0.9927 across the entire investigated embedding capacity, ranging from 0 to 1.988 bpp. DCT-LSB-2 showed a stable Peak Signal-to-Noise Ratio (average 42.44 dB) under various evaluated payloads ranging from 0 to 100 kB. The proposed model achieved a resilient and robust multi-layered data forensic validation system. Full article

(This article belongs to the Special Issue Multimedia Data and Network Security)

► Show Figures

Graphical abstract

37 pages, 7157 KB

Open AccessArticle

Research on Pedestrian Dynamics and Its Environmental Factors in a Jiangnan Water Town Integrating Video-Based Trajectory Data and Machine Learning

by Hongshi Cao, Zhengwei Xia, Ruidi Wang, Chenpeng Xu, Wenqi Miao and Shengyang Xing

Buildings 2025, 15(21), 3996; https://doi.org/10.3390/buildings15213996 - 5 Nov 2025

Cited by 1 | Viewed by 1507

Abstract

Jiangnan water towns, as distinctive cultural landscapes in China, are confronting the dual challenge of surging tourist flows and imbalances in spatial distribution. Research on pedestrian dynamics has so far offered narrow coverage of influencing factors and limited insight into underlying mechanisms, falling [...] Read more.

Jiangnan water towns, as distinctive cultural landscapes in China, are confronting the dual challenge of surging tourist flows and imbalances in spatial distribution. Research on pedestrian dynamics has so far offered narrow coverage of influencing factors and limited insight into underlying mechanisms, falling short of a systemic perspective and an interpretable theoretical framework. This study uses Nanxun Ancient Town as a case study to address this gap. Pedestrian trajectories were captured using temporarily installed closed-circuit television (CCTV) cameras within the scenic area and extracted using the YOLOv8 object detection algorithm. These data were then integrated with quantified environmental indicators and analyzed through Random Forest regression with SHapley Additive exPlanations (SHAP) interpretation, enabling quantitative and interpretable exploration of pedestrian dynamics. The results indicate nonlinear and context-dependent effects of environmental factors on pedestrian dynamics and that tourist flows are jointly shaped by multi-level, multi-type factors and their interrelations, producing complex and adaptive impact pathways. First, within this enclosed scenic area, spatial morphology—such as lane width, ground height, and walking distance to entrances—imposes fundamental constraints on global crowd distributions and movement patterns, whereas spatial accessibility does not display its usual salience in this context. Second, perceptual and functional attributes—including visual attractiveness, shading, and commercial points of interest—cultivate local “visiting atmospheres” through place imagery, perceived comfort, and commercial activity. Finally, nodal elements—such as signboards, temporary vendors, and public service facilities—produce multi-scale, site-centered effects that anchor and perturb flows and reinforce lingering, backtracking, and clustering at bridgeheads, squares, and comparable nodes. This study advances a shift from static and global description to a mechanism-oriented explanatory framework and clarifies the differentiated roles and linkages among environmental factors by integrating video-based trajectory analytics with machine learning interpretation. This framework demonstrates the applicability of surveillance and computer vision techniques for studying pedestrian dynamics in small-scale heritage settings, and offers practical guidance for heritage conservation and sustainable tourism management in similar historic environments. Full article

(This article belongs to the Section Architectural Design, Urban Science, and Real Estate)

► Show Figures

Figure 1

31 pages, 1455 KB

Open AccessArticle

A User-Centric Context-Aware Framework for Real-Time Optimisation of Multimedia Data Privacy Protection, and Information Retention Within Multimodal AI Systems

by Ndricim Topalli and Atta Badii

Sensors 2025, 25(19), 6105; https://doi.org/10.3390/s25196105 - 3 Oct 2025

Cited by 2 | Viewed by 2099 | Correction

Abstract

The increasing use of AI systems for face, object, action, scene, and emotion recognition raises significant privacy risks, particularly when processing Personally Identifiable Information (PII). Current privacy-preserving methods lack adaptability to users’ preferences and contextual requirements, and obfuscate user faces uniformly. This research [...] Read more.

The increasing use of AI systems for face, object, action, scene, and emotion recognition raises significant privacy risks, particularly when processing Personally Identifiable Information (PII). Current privacy-preserving methods lack adaptability to users’ preferences and contextual requirements, and obfuscate user faces uniformly. This research proposes a user-centric, context-aware, and ontology-driven privacy protection framework that dynamically adjusts privacy decisions based on user-defined preferences, entity sensitivity, and contextual information. The framework integrates state-of-the-art recognition models for recognising faces, objects, scenes, actions, and emotions in real time on data acquired from vision sensors (e.g., cameras). Privacy decisions are directed by a contextual ontology based in Contextual Integrity theory, which classifies entities into private, semi-private, or public categories. Adaptive privacy levels are enforced through obfuscation techniques and a multi-level privacy model that supports user-defined red lines (e.g., “always hide logos”). The framework also proposes a Re-Identifiability Index (RII) using soft biometric features such as gait, hairstyle, clothing, skin tone, age, and gender, to mitigate identity leakage and to support fallback protection when face recognition fails. The experimental evaluation relied on sensor-captured datasets, which replicate real-world image sensors such as surveillance cameras. User studies confirmed that the framework was effective, with over 85.2% of participants rating the obfuscation operations as highly effective, and the other 14.8% stating that obfuscation was adequately effective. Amongst these, 71.4% considered the balance between privacy protection and usability very satisfactory and 28% found it satisfactory. GPU acceleration was deployed to enable real-time performance of these models by reducing frame processing time from 1200 ms (CPU) to 198 ms. This ontology-driven framework employs user-defined red lines, contextual reasoning, and dual metrics (RII/IVI) to dynamically balance privacy protection with scene intelligibility. Unlike current anonymisation methods, the framework provides a real-time, user-centric, and GDPR-compliant method that operationalises privacy-by-design while preserving scene intelligibility. These features make the framework appropriate to a variety of real-world applications including healthcare, surveillance, and social media. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

26 pages, 9360 KB

Open AccessArticle

Multi-Agent Hierarchical Reinforcement Learning for PTZ Camera Control and Visual Enhancement

by Zhonglin Yang, Huanyu Liu, Hao Fang, Junbao Li and Yutong Jiang

Electronics 2025, 14(19), 3825; https://doi.org/10.3390/electronics14193825 - 26 Sep 2025

Cited by 2 | Viewed by 1582

Abstract

Border surveillance, as a critical component of national security, places increasingly stringent demands on the target perception capabilities of video monitoring systems, especially in wide-area and complex environments. To address the limitations of existing systems in low-confidence target detection and multi-camera collaboration, this [...] Read more.

Border surveillance, as a critical component of national security, places increasingly stringent demands on the target perception capabilities of video monitoring systems, especially in wide-area and complex environments. To address the limitations of existing systems in low-confidence target detection and multi-camera collaboration, this paper proposes a novel visual enhancement method for cooperative control of multiple PTZ (Pan–Tilt–Zoom) cameras based on hierarchical reinforcement learning. The proposed approach establishes a hierarchical framework composed of a Global Planner Agent (GPA) and multiple Local Executor Agents (LEAs). The GPA is responsible for global target assignment, while the LEAs perform fine-grained visual enhancement operations based on the assigned targets. To effectively model the spatial relationships among multiple targets and the perceptual topology of the cameras, a graph-based joint state space is constructed. Furthermore, a graph neural network is employed to extract high-level features, enabling efficient information sharing and collaborative decision-making among cameras. Experimental results in simulation environments demonstrate the superiority of the proposed method in terms of target coverage and visual enhancement performance. Hardware experiments further validate the feasibility and robustness of the approach in real-world scenarios. This study provides an effective solution for multi-camera cooperative surveillance in complex environments. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

17 pages, 2566 KB

Open AccessArticle

Secure and Decentralized Hybrid Multi-Face Recognition for IoT Applications

by Erëza Abdullahu, Holger Wache and Marco Piangerelli

Sensors 2025, 25(18), 5880; https://doi.org/10.3390/s25185880 - 19 Sep 2025

Cited by 4 | Viewed by 2146

Abstract

The proliferation of smart environments and Internet of Things (IoT) applications has intensified the demand for efficient, privacy-preserving multi-face recognition systems. Conventional centralized systems suffer from latency, scalability, and security vulnerabilities. This paper presents a practical hybrid multi-face recognition framework designed for decentralized [...] Read more.

The proliferation of smart environments and Internet of Things (IoT) applications has intensified the demand for efficient, privacy-preserving multi-face recognition systems. Conventional centralized systems suffer from latency, scalability, and security vulnerabilities. This paper presents a practical hybrid multi-face recognition framework designed for decentralized IoT deployments. Our approach leverages a pre-trained Convolutional Neural Network (VGG16) for robust feature extraction and a Support Vector Machine (SVM) for lightweight classification, enabling real-time recognition on resource-constrained devices such as IoT cameras and Raspberry Pi boards. The purpose of this work is to demonstrate the feasibility and effectiveness of a lightweight hybrid system for decentralized multi-face recognition, specifically tailored to the constraints and requirements of IoT applications. The system is validated on a custom dataset of 20 subjects collected under varied lighting conditions and facial expressions, achieving an average accuracy exceeding 95% while simultaneously recognizing multiple faces. Experimental results demonstrate the system’s potential for real-world applications in surveillance, access control, and smart home environments. The proposed architecture minimizes computational load, reduces dependency on centralized servers, and enhances privacy, offering a promising step toward scalable edge AI solutions. Full article

(This article belongs to the Special Issue Secure and Decentralised IoT Systems)

► Show Figures

Figure 1

25 pages, 4706 KB

Open AccessArticle

Transfer Learning-Based Distance-Adaptive Global Soft Biometrics Prediction in Surveillance

by Sonjoy Ranjon Das, Henry Onilude, Bilal Hassan, Preeti Patel and Karim Ouazzane

Electronics 2025, 14(18), 3719; https://doi.org/10.3390/electronics14183719 - 19 Sep 2025

Cited by 3 | Viewed by 789

Abstract

Soft biometric prediction—including age, gender, and ethnicity—is critical in surveillance applications, yet often suffers from performance degradation as the subject-to-camera distance increases. This study hypothesizes that embedding distance-awareness into the training process can mitigate such degradation and enhance model generalization across varying visual [...] Read more.

Soft biometric prediction—including age, gender, and ethnicity—is critical in surveillance applications, yet often suffers from performance degradation as the subject-to-camera distance increases. This study hypothesizes that embedding distance-awareness into the training process can mitigate such degradation and enhance model generalization across varying visual conditions. We propose a distance-adaptive, multi-task deep learning framework built upon EfficientNetB3, augmented with task-specific heads and trained progressively across four distance intervals (4 m to 10 m). A weighted composite loss function is employed to balance classification and regression objectives. The model is evaluated on a hybrid dataset combining the Front-View Gait (FVG) and MMV annotated pedestrian datasets, totaling over 19,000 samples. Experimental results demonstrate that the framework achieves up to 95% gender classification accuracy at 4 m and retains 85% accuracy at 10 m. Ethnicity prediction maintains an accuracy above 65%, while age estimation achieves a mean absolute error (MAE) ranging from 1.1 to 1.5 years. These findings validate the model’s robustness across distances and its superiority over conventional static learning approaches. Despite challenges such as computational overhead and annotation demands, the proposed approach offers a scalable and real-time-capable solution for distance-resilient biometric systems. Full article

(This article belongs to the Special Issue Multi-Modal Biometrics for Surveillance and Digital Evidence Processing in Smart Cities)

► Show Figures

Figure 1

25 pages, 27717 KB

Open AccessArticle

MCS-Sim: A Photo-Realistic Simulator for Multi-Camera UAV Visual Perception Research

by Qiming Qi, Guoyan Wang, Yonglei Pan, Hongqi Fan and Biao Li

Drones 2025, 9(9), 656; https://doi.org/10.3390/drones9090656 - 18 Sep 2025

Cited by 2 | Viewed by 2709

Abstract

Multi-camera systems (MCSs) are pivotal in aviation surveillance and autonomous navigation due to their wide coverage and high-resolution sensing. However, challenges such as complex setup, time-consuming data acquisition, and costly testing hinder research progress. To address these, we introduce MCS-Sim, a photo-realistic [...] Read more.

Multi-camera systems (MCSs) are pivotal in aviation surveillance and autonomous navigation due to their wide coverage and high-resolution sensing. However, challenges such as complex setup, time-consuming data acquisition, and costly testing hinder research progress. To address these, we introduce MCS-Sim, a photo-realistic MCSsimulator for UAV visual perception research. MCS-Sim integrates vision sensor configurations, vehicle dynamics, and dynamic scenes, enabling rapid virtual prototyping and multi-task dataset generation. It supports dense flow estimation, 3D reconstruction, visual simultaneous localization and mapping, object detection, and tracking. With a hardware-in-loop interface, MCS-Sim facilitates closed-loop simulation for system validation. Experiments demonstrate its effectiveness in synthetic dataset generation, visual perception algorithm testing, and closed-loop simulation. Here we show that MCS-Sim significantly advances multi-camera UAV visual perception research, offering a versatile platform for future innovations. Full article

(This article belongs to the Special Issue When Deep Learning Meets Geometry for Air-to-Ground Perception on Drones: 2nd Edition)

► Show Figures

Figure 1

24 pages, 5065 KB

Open AccessArticle

Benchmark Dataset and Deep Model for Monocular Camera Calibration from Single Highway Images

by Wentao Zhang, Wei Jia and Wei Li

Sensors 2025, 25(18), 5815; https://doi.org/10.3390/s25185815 - 18 Sep 2025

Viewed by 1426

Abstract

Single-image based camera auto-calibration holds significant value for improving perception efficiency in traffic surveillance systems. However, existing approaches face dual challenges: scarcity of real-world datasets and poor adaptability to multi-view scenarios. This paper presents a systematic solution framework. First, we constructed a large-scale [...] Read more.

Single-image based camera auto-calibration holds significant value for improving perception efficiency in traffic surveillance systems. However, existing approaches face dual challenges: scarcity of real-world datasets and poor adaptability to multi-view scenarios. This paper presents a systematic solution framework. First, we constructed a large-scale synthetic dataset containing 36 highway scenarios using the CARLA 0.9.15 simulation engine, generating approximately 336,000 virtual frames with precise calibration parameters. The dataset achieves statistical consistency with real-world scenes by incorporating diverse view distributions, complex weather conditions, and varied road geometries. Second, we developed DeepCalib, a deep calibration network that explicitly models perspective projection features through the triplet attention mechanism. This network simultaneously achieves road direction vanishing point localization and camera pose estimation using only a single image. Finally, we adopted a progressive learning paradigm: robust pre-training on synthetic data establishes universal feature representations in the first stage, followed by fine-tuning on real-world datasets in the second stage to enhance practical adaptability. Experimental results indicate that DeepCalib attains an average calibration precision of 89.6%. Compared to conventional multi-stage algorithms, our method achieves a single-frame processing speed of 10 frames per second, showing robust adaptability to dynamic calibration tasks across diverse surveillance views. Full article

(This article belongs to the Collection Applications of Convolutional Neural Networks in Imaging and Sensing)

► Show Figures

Figure 1

25 pages, 19177 KB

Open AccessArticle

Multimodal UAV Target Detection Method Based on Acousto-Optical Hybridization

by Tianlun He, Jiayu Hou and Da Chen

Drones 2025, 9(9), 627; https://doi.org/10.3390/drones9090627 - 5 Sep 2025

Cited by 4 | Viewed by 4452

Abstract

Urban unmanned aerial vehicle (UAV) surveillance faces significant obstacles due to visual obstructions, inadequate lighting, small target dimensions, and acoustic signal interference caused by environmental noise and multipath propagation. To address these issues, this study proposes a multimodal detection framework that integrates an [...] Read more.

Urban unmanned aerial vehicle (UAV) surveillance faces significant obstacles due to visual obstructions, inadequate lighting, small target dimensions, and acoustic signal interference caused by environmental noise and multipath propagation. To address these issues, this study proposes a multimodal detection framework that integrates an efficient YOLOv11-based visual detection module—trained on a comprehensive dataset containing over 50,000 UAV images—with a Capon beamforming-based acoustic imaging system using a 144-element spiral-arm microphone array. Adaptive compensation strategies are implemented to improve the robustness of each sensing modality, while detections results are validated through intersection-over-union and angular deviation metrics. The angular validation is accomplished by mapping acoustic direction-of-arrival estimations onto the camera image plane using established calibration parameters. Experimental evaluation reveals that the fusion system achieves outstanding performance under optimal conditions, exceeding 99% accuracy. However, its principal advantage becomes evident in challenging environments where individual modalities exhibit considerable limitations. The fusion approach demonstrates substantial performance improvements across three critical scenarios. In low-light conditions, the fusion system achieves 78% accuracy, significantly outperforming vision-only methods which attain only 25% accuracy. Under occlusion scenarios, the fusion system maintains 99% accuracy while vision-only performance drops dramatically to 9.75%, though acoustic-only detection remains highly effective at 99%. In multi-target detection scenarios, the fusion system reaches 96.8% accuracy, bridging the performance gap between vision-only systems at 99% and acoustic-only systems at 54%, where acoustic intensity variations limit detection capability. These experimental findings validate the effectiveness of the complementary fusion strategy and establish the system’s practical value for urban airspace monitoring applications. Full article

► Show Figures

Figure 1

16 pages, 2127 KB

Open AccessArticle

VIPS: Learning-View-Invariant Feature for Person Search

by Hexu Wang, Wenlong Luo, Wei Wu, Fei Xie, Jindong Liu, Jing Li and Shizhou Zhang

Sensors 2025, 25(17), 5362; https://doi.org/10.3390/s25175362 - 29 Aug 2025

Viewed by 1033

Abstract

Unmanned aerial vehicles (UAVs) have become indispensable tools for surveillance, enabled by their ability to capture multi-perspective imagery in dynamic environments. Among critical UAV-based tasks, cross-platform person search—detecting and identifying individuals across distributed camera networks—presents unique challenges. Severe viewpoint variations, occlusions, and cluttered [...] Read more.

Unmanned aerial vehicles (UAVs) have become indispensable tools for surveillance, enabled by their ability to capture multi-perspective imagery in dynamic environments. Among critical UAV-based tasks, cross-platform person search—detecting and identifying individuals across distributed camera networks—presents unique challenges. Severe viewpoint variations, occlusions, and cluttered backgrounds in UAV-captured data degrade the performance of conventional discriminative models, which struggle to maintain robustness under such geometric and semantic disparities. To address this, we propose view-invariant person search (VIPS), a novel two-stage framework combining Faster R-CNN with a view-invariant re-Identification (VIReID) module. Unlike conventional discriminative models, VIPS leverages the semantic flexibility of large vision–language models (VLMs) and adopts a two-stage training strategy to decouple and align text-based ID descriptors and visual features, enabling robust cross-view matching through shared semantic embeddings. To mitigate noise from occlusions and cluttered UAV-captured backgrounds, we introduce a learnable mask generator for feature purification. Furthermore, drawing from vision–language models, we design view prompts to explicitly encode perspective shifts into feature representations, enhancing adaptability to UAV-induced viewpoint changes. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, with ablation studies validating the efficacy of each component. Beyond technical advancements, this work highlights the potential of VLM-derived semantic alignment for UAV applications, offering insights for future research in real-time UAV-based surveillance systems. Full article

(This article belongs to the Section Remote Sensors)

► Show Figures

Figure 1

Search Results (80)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (80)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI