MCS-Sim: A Photo-Realistic Simulator for Multi-Camera UAV Visual Perception Research

Qi, Qiming; Wang, Guoyan; Pan, Yonglei; Fan, Hongqi; Li, Biao

doi:10.3390/drones9090656

Open AccessArticle

MCS-Sim: A Photo-Realistic Simulator for Multi-Camera UAV Visual Perception Research

by

Qiming Qi

,

Guoyan Wang

^*,

Yonglei Pan

,

Hongqi Fan

and

Biao Li

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(9), 656; https://doi.org/10.3390/drones9090656

Submission received: 23 July 2025 / Revised: 4 September 2025 / Accepted: 16 September 2025 / Published: 18 September 2025

(This article belongs to the Special Issue When Deep Learning Meets Geometry for Air-to-Ground Perception on Drones: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Multi-camera systems (MCSs) are pivotal in aviation surveillance and autonomous navigation due to their wide coverage and high-resolution sensing. However, challenges such as complex setup, time-consuming data acquisition, and costly testing hinder research progress. To address these, we introduce MCS-Sim, a photo-realistic MCS simulator for UAV visual perception research. MCS-Sim integrates vision sensor configurations, vehicle dynamics, and dynamic scenes, enabling rapid virtual prototyping and multi-task dataset generation. It supports dense flow estimation, 3D reconstruction, visual simultaneous localization and mapping, object detection, and tracking. With a hardware-in-loop interface, MCS-Sim facilitates closed-loop simulation for system validation. Experiments demonstrate its effectiveness in synthetic dataset generation, visual perception algorithm testing, and closed-loop simulation. Here we show that MCS-Sim significantly advances multi-camera UAV visual perception research, offering a versatile platform for future innovations.

Keywords:

multi-camera system; unmanned aerial vehicle; simulator; visual perception; dataset generation; closed-loop simulation

1. Introduction

The intelligent autonomy of unmanned aerial vehicles (UAVs) represents a leading trend in modern technology, enabling tasks such as reconnaissance, surveillance, disaster relief, transportation delivery, and virtual reality (VR) applications. Within VR ecosystems, UAV-mounted imaging systems enable dynamic viewpoint control through several key technologies [1]. Specifically, trajectory planning with optimized flight paths ensures stable image acquisition while avoiding occlusions [2]. Additionally, multi-UAV coordination allows synchronized drone swarms to provide panoramic coverage through distributed camera arrays [3]. These implementations address the core requirements of VR systems: broad field-of-view (FOV) coverage, high resolution, and motion consistency. Beyond VR applications, effective UAV operation across various domains fundamentally relies on robust environmental perception capabilities, including accurate self-localization and environmental mapping [4]. This foundation enables advanced capabilities such as navigation, obstacle avoidance, and path planning. Vision sensors dominate scene perception tasks due to their low cost and high information density. However, enhanced perception demands broader FOV, higher resolution, and precise sensing.

Monocular and binocular stereo cameras dominate current vision sensors, yet their FOV is inherently limited. While fisheye cameras expand FOV, they face trade-offs between effective resolution and FOV because of the impact of diffraction limits. To address demands for high resolution, wide FOV, scale depth perception, and motion sensitivity, multi-aperture vision systems emerged in the early 20th century [5]. These systems, termed artificial compound eye systems (ACESs) and light field cameras, are classified as multi-ocular sensors alongside unstructured multi-camera systems. Applications span computational imaging [6], simultaneous localization and mapping (SLAM) [7,8,9], and autonomous driving [10]. Except for two types of light field camera with unique light field acquisition models [5], most multi-ocular sensors can be considered as multi-camera systems (MCSs).

These MCSs integrate three or more cameras with tilted optical axes. Their non-parallel alignment expands the FOV; meanwhile, the baseline distances enable precise depth measurement for nearby objects. Additionally, the multi-view design counters two major limitations of traditional monocular and stereo cameras: lighting fluctuations and occlusions. This enhancement improves UAVs’ 3D perception and boosts operational performance in complex, wide-area scenarios.

Currently, research on MCSs faces some challenges focused on three levels: (1) Physical level: The process of configuration design and entity preparation is time-consuming and expensive, necessitating simulation tools for validation. (2) Algorithm level: Algorithm research relies on the input of physical information, but dataset scarcity limits progress. (3) Application level: Intricate system testing introduces safety hazards and high expenses. To overcome these barriers and accelerate research for multi-camera UAV visual perception, it is crucial to explore simulation tools that can compensate for the deficiency of physical systems.

In the field of embodied intelligence, using simulators to synthesize datasets for large-scale algorithm training, or providing closed-loop interfaces for online training and testing, has become a widely adopted practice. Similar principles extend to computer vision tasks, where algorithms increasingly leverage large-scale and diverse synthetic datasets to achieve robust generalization to real-world scenarios. For instance, DROID-SLAM [11] demonstrated exceptional accuracy and robustness in real-world testing after being trained exclusively on the synthetic dataset TartanAir [12]. The foundation model Depth Anything V2 [13] further exemplifies this paradigm: it first employs synthetic data to train a high-capacity teacher model; then, it generates pseudo-labels for unlabeled real images to enable knowledge distillation for efficient student models, ultimately achieving state-of-the-art zero-shot generalization. These successes suggest the viability of applying analogous approaches to multi-camera perception in UAV applications.

In this work, we introduce MCS-Sim, a novel photo-realistic simulator designed for multi-camera UAV visual perception research. The introduced framework integrates six key modules to address critical requirements in autonomous perception development: (1) MCS simulation, (2) vehicle dynamics modeling, (3) movable object simulation, (4) structured synthetic dataset generation, (5) diversified scene configuration, and (6) closed-loop testing capabilities. Our contributions include the following:

A modular simulator architecture is introduced that extends conventional monocular simulation paradigms by supporting both geometrically regular ACESs and arbitrary multi-camera configurations. This design enables rapid virtual prototyping of MCSs.
A structured synthetic dataset generation pipeline is integrated to output multi-camera sequences with multi-modal pixel-wise ground truth (e.g., depth, semantic labels, optical flow, and surface normal). This facilitates visual perception research including dense motion flow estimation, 3D reconstruction, visual SLAM, object recognition, and tracking.
A hardware-in-the-loop (HIL) interface is developed to enable end-to-end system testing. This capability is demonstrated through a representative case: real-time ground target tracking using UAV electro-optical payloads, where simulated sensor feedback directly interfaces with embedded perception algorithms for closed-loop control validation.

While AI-driven autonomous agents and more advanced features could enhance the simulator, time and cost constraints necessitate deferring these enhancements to future work, which is beyond the scope of this study.

This paper is organized as follows. In Section 2, prior work related to our approach is reviewed. Section 3 provides a detailed introduction of the simulator. In Section 4, key functions mentioned above are verified through simulation experiments. Finally, Section 5 concludes our work and proposes future directions.

2. Current Research and Projects in Open-Source UAV Simulators

In this section, related work is divided into four distinct categories: (1) vision-oriented simulators, (2) physics-based simulators, (3) neural closed-loop simulators, and (4) generative closed-loop simulators. Each category is detailed below.

Vision-oriented simulators. Within the computer vision community, there are numerous simulators that utilize 3D engines [14] to create synthetic datasets for researchers’ specific needs. The core of vision-oriented simulators lies in the generation of photo-realistic image data, coupled with precise ground truth annotations, tailored for various visual tasks. CARLA [15] and AirSim [16] are open-source simulation toolkits developed on the Unreal Engine 4 (UE4), providing high-fidelity physical and visual simulations. They facilitate the cost-effective creation of vast amounts of training data, inclusive of depth maps and semantic masks, which in turn accelerates the development of AI models in simulations involving autonomous vehicle maneuvers or UAV flights. UnrealCV [17], a UE4 plugin, simplifies virtual world creation for vision researchers. It automates the generation of RGB images, object masks, depth maps, and surface normals. However, using UnrealCV requires significant programming effort in both vision tasks and UE4 environment configuration. Sim4CV [18] streamlines this process by eliminating the need for complex UE4 game development. It provides an integrated solution for vision-based tracking and driving simulations. Another UE4-based tool, UnrealGT [19], automatically augments scenes with contextually placed objects to enhance training data variability. More recently, GRADE [20], built on Nvidia Isaac Sim, introduced dynamic elements like pedestrians and animals into simulations. This enables the generation of dynamic SLAM datasets. For vision-oriented simulators, while constructing high-fidelity scenes remains labor-intensive, the resultant data is kept free from geometric errors, ensuring ground truth annotations with perfect accuracy.
Physics-based simulators. Physics-based simulators prioritize realistic physical interactions between agents and environments, driving research in embodied AI and robotics. These simulators have sparked extensive research in various domains, including navigation (e.g., Habitat [21] and iGibson [22]), robotic manipulation (e.g., SAPIEN [23]), and embodied learning (e.g., AI2-THOR [24], TDW [25]). Recent advancements include LEGENT [26] and HAZARD [27], which integrate Large Language Models (LLMs) and Large Multimodal Models (LMMs), alongside multi-agent simulators like Habitat 3.0 [28] and UnrealZoo [29]. Notably, Genesis [30] introduces generative physics engines for ultra-fast, differentiable simulations with physic-realistic accuracy. However, traditional simulators rooted in geometric modeling still exhibit greater robustness.
Neural closed-loop simulators. Neural closed-loop simulators, inspired by Neural Radiance Fields (NeRFs) [31] and 3D Gaussian Splatting (3DGS) [32], reconstruct real-world environments for autonomous driving applications. These novel simulators leverage NeRF (e.g., MARS [33], UniSim [34], NeuRAD [35], and NeuroNCAP [36]) and 3DGS (e.g., HUGSIM [37]) to achieve efficient scene construction with high fidelity in textures and lighting compared to those found in real-world scenario. However, challenges persist under extreme behavioral deviations or viewpoint changes, where synthesized views exhibit artifacts like blurred shadows, reducing performance relative to traditional simulators based on 3D engines.
Generative closed-loop simulators. Since 2024, the rapid evolution of video diffusion models, such as GAIA-1 [38], DriveDreamer [39], Vista [40], and GenAD [41], has shown great promise in generating photo-realistic driving videos. Generative closed-loop simulators, exemplified by DriveArena [42], belong to the category of world models [43] and represent the evolutionary frontier of simulation technology. However, generative closed-loop simulators face limitations in spatial and temporal consistencies, as well as computational demands.

Each existing simulator has distinct advantages and drawbacks, as shown in Table 1. Vision-oriented and physics-based simulators leverage mature graphical pipelines to achieve geometrically precise rendering, maintaining spatio-temporal consistency while providing ground-truth pixel-wise annotations. These simulators typically operate in real time using consumer-grade GPU-equipped laptops. In contrast, neural and generative closed-loop simulators, which have emerged only in the past two years, produce high-fidelity image quality through volume rendering techniques. However, they suffer from inherent geometric inaccuracies leading to rendering artifacts and spatio-temporal inconsistencies, with compromised pixel-wise annotation accuracy. Moreover, their computational demands necessitate high-performance workstations, often struggling to maintain real-time operation due to hardware limitations.

The proliferation of MCS configurations has exposed a critical research gap: the absence of specialized simulators for multi-camera UAV visual perception research. To advance this field, such a simulator must satisfy dual requirements. From an economic perspective, it should enable easy deployment and efficient operation. Technically, it must generate photo-realistic images with error-free annotations and provide closed-loop testing capabilities. The ground truth annotations should encompass multi-modal pixel-wise outputs to support multiple vision tasks, including dense pixel-wise matching, 3D reconstruction, visual SLAM, object recognition, and tracking. Therefore, MCS-Sim is introduced to address these requirements. Ours follows the design paradigm of vision-oriented simulators, with functional and performance comparisons presented in Table 1. It extends support from monocular vision to arbitrary MCS. Additionally, it integrates a multi-modal dataset generation pipeline with enhanced annotation capabilities and provides a HIL closed-loop interface for validation.

3. System Design of MCS-Sim

MCS-Sim adopts a video game-like design for easy installation without extra dependencies, ensuring user-friendly operation. The simulator features an intuitive graphical user interface (GUI) that simplifies adjustment of all critical settings. Researchers can modify configuration files directly or via C++ scripting for advanced customization. Flight environments are expansive and detailed, containing multiple aerial vehicles (two UAVs, a small aircraft, and a freely mobile object) across a 50 km × 50 km map. This map includes diverse scenarios: towns, airports, warehouses, grasslands, forests, mountains, coastlines, oceans, and underwater zones. The closed-loop interfaces, which encompasses image injection and serial communication, enable bidirectional data stream between external devices or programs and the simulator.

3.1. Overview of Simulator Architecture

The overview of the simulator’s architecture is presented in Figure 1. The core functionality of the software lies in its ability to customize MCS configurations, orchestrate vehicle movements, set up dynamic objects, and generate and record ground truth annotations for vision perception research. It supports closed-loop simulation for photoelectric data processing and flight control via a HIL interface. Developed utilizing a 3D engine, the simulator comprises various components, including a simulation setting module, an MCS model module, a UAV control module, a third-person visual display, a first-person view display, an image data generation module, a data recording module, a visual model library, and communication interfaces.

3D engine. The 3D engine offers a realistic graphics rendering pipeline and an integrated real-time physics engine. MCS-Sim employs a commercial 3D engine, UNIGINE 2.7.2 Sim, known for its 64-bit coordinate precision and advanced rendering capabilities. While popular 3D engines [14] like Unreal Engine 4 and Unity 2019.3 utilize 32-bit systems, UNIGINE’s 64-bit architecture enables superior coordinate accuracy and wide-area visualization, which is critical for aerospace simulations. This engine integrates a physics-based rendering pipeline that produces realistic visuals while maintaining precise spatial computations, outperforming 32-bit alternatives in large-scale scene modeling. The selection of UNIGINE aligns with recent advancements in 3D simulation technology, where high-performance engines now support complex multisensory applications through integrated physics systems and visual fidelity.
Simulation setting module. The simulation settings module provides a user interface for configuring simulation parameters. It allows for the specification of vehicle types and motion modes, MCS configuration, data recording options, coordinate systems, and environmental effects. Developed with Qt Creator 4.0.3, this module integrates a GUI design that simplifies parameter input while establishing real-time communication between the interface and 3D engine core through Socket protocols.
MCS model module. The MCS model module streamlines the MCS configuration through three customizable input methods: preset templates, GUI-based adjustments, or Excel file imports. These settings are automatically converted into a unified model containing critical parameters, including MCS type, camera quantity, and both extrinsic and intrinsic parameters for each camera. Then, the corresponding light field acquisition model is constructed. This modular design ensures flexibility for researchers while maintaining parameter consistency across simulation workflows.
UAV control module. The UAV control module enables movement simulation via two modes: (1) direct manual operation using keyboard and mouse control, and (2) automated control through real-time motion control data received from external programs or devices, which drives a UAV and updates its pose state accordingly.
Third-person visual and first-person view display. The third-person visual display tracks vehicle movements using a spectator camera controlled via the PlayerPersecutor programming interface in UNIGINE. For the first-person view display, the simulator renders real-time image feeds captured by individual cameras. Each image feed is generated through the WidgetSpriteViewport programming interface in UNIGINE and displayed in dedicated interface zones, providing simultaneous multi-view monitoring.
Image data generation module. The image data generation module is used to generate RGB images tailored for MCS, alongside depth maps, segmentation masks, surface normals, optical flow, and other ground truth data. This module is mainly implemented by parsing the rendering pipeline.
Data recording module. The data recording module generates synthetic datasets for MCS-based 3D vision tasks, including single-shot images, videos, and metadata such as vehicle poses, dynamic object trajectories, and MCS calibration parameters.
Visual model library. The visual model library contains scene models with adjustable environmental effects for rendering, plus vehicle and object models. Scene models and their environmental effects are created using the scene editor of UNIGINE, with parameters modifiable through script adjustments. Vehicle and object models can also be designed in third-party software like 3ds MAX and imported into the simulator.
Communication interfaces. The communication interface enables two-way interaction between the simulator and external programs or physical devices through (1) image injection, which inputs image data into the photoelectric data processing device, and (2) serial communication, transmitting real-time simulation status and flight control parameters.

The basic workflow of the simulation process comprises the following steps: (1) configuration of simulation settings and MCS modeling; (2) 3D engine execution for visual rendering; (3) UAV control or closed-loop simulation; (4) configuration modification; (5) dataset generation and recording.

Hereinafter, several key design strategies, essential to our methodology, are introduced.

3.2. Multi-Camera System Modeling

As introduced in Section 1, MCSs are classified into two categories: ACESs and unstructured MCSs. Our prior work [44] developed simulation software for ACES, supporting preset templates, GUI-based adjustments, and Excel file imports. This simulator now extends its capabilities to accommodate unstructured MCS configurations. Table 2 summarizes the relevant parameters’ MCS modeling, including both configuration variables and the ground-truth intrinsic/extrinsic parameters of individual cameras.

For modeling the imaging parameters of individual cameras, all cameras adopt the pinhole model characterized by four physical parameters: focal length

f_{c}

with its error variance

σ_{f_{c}}

, image plane width

d_{c}

, pixel aspect ratio

γ

, and image resolution

N_{u, c} \times N_{v, c}

, where

c \in \{0, 1, \dots, N_{c} - 1\}

represents the camera index. Given the optical parameters

(f, d, γ, N_{u}, N_{v})

for camera c, its intrinsic matrix is recorded as follows:

K = [\begin{matrix} f_{u} & 0 & o_{u} \\ 0 & f_{v} & o_{v} \\ 0 & 0 & 1 \end{matrix}],

(1)

where

f_{u} = f \cdot N_{u} / d

and

f_{v} = f \cdot N_{u} / (d \cdot γ)

define pixel focal lengths, while

o_{u} = (N_{u} - 1) / 2

and

o_{v} = (N_{v} - 1) / 2

specify the principal point coordinates. The configured optical parameters

(f + δ f_{c}, d, γ, N_{u}, N_{v})

are passed to the UNIGINE API for automated image generation, augmented by an optical error parameter vector

w_{o p t i c a l_e r r o r s}

encompassing color saturation, color gamma, lens smudges, exposure adaptation, depth-of-field effects, and motion blur artifacts, which are dynamically applied through dedicated API invocations to achieve physically plausible rendering. It should be noted that the graphics engine simulates optical phenomena without strict adherence to physical laws. Geometric distortion, while requiring synchronous pixel-wise ground-truth annotation generation, is implemented as an offline post-processing step to avoid real-time rendering frame rate degradation.

The relevant coordinate systems for MCS modeling are defined as Figure 2 shows. The MCS coordinate is usually centered at its geometric origin and denoted as

F_{s}

. This establishes coordinate system conversions between the world coordinates

F_{w}

, MCS coordinates

F_{s}

, and camera coordinates

F_{c}

via the equation below:

{\tilde{X}}_{w} = T_{s}^{w} \cdot T_{c}^{s} \cdot {\tilde{X}}_{c},

(2)

where

{\tilde{X}}_{w}

and

{\tilde{X}}_{c}

represent homogeneous world and camera coordinates. The transformation matrix

T_{s}^{w}

denotes the pose of MCS, and the extrinsic matrix

T_{c}^{s}

of camera c combines a rotation matrix

R_{c}^{s} \in SO (3)

and a translation vector

t_{c}^{s} = {[X_{c}^{s}, Y_{c}^{s}, Z_{c}^{s}]}^{⊤}

from offsets.

When configuring the MCS setup settings, the orientation of camera c is represented by the axis-angle

(n_{x, c}, n_{y, c}, n_{z, c})

, while translational deviations are denoted as

(X_{c}^{s}, Y_{c}^{s}, Z_{c}^{s})

. The error variances of these setting parameters are

σ_{φ_{c}^{s}}

,

σ_{θ_{c}^{s}}

,

σ_{ϕ_{c}^{s}}

,

σ_{X_{c}^{s}}

,

σ_{Y_{c}^{s}}

, and

σ_{Z_{c}^{s}}

, respectively. The extrinsic matrix

T_{c}^{s}

for each camera is derived from

(n_{x, c}, n_{y, c}, n_{z, c}, X_{c}^{s}, Y_{c}^{s}, Z_{c}^{s})

and subsequently recorded together with

T_{s}^{w}

during simulation. If calibration error parameters (

σ_{φ_{c}^{s}}

,

σ_{θ_{c}^{s}}

,

σ_{ϕ_{c}^{s}}

,

σ_{X_{c}^{s}}

,

σ_{Y_{c}^{s}}

,

σ_{Z_{c}^{s}}

) are configured, the axis-angle representation is converted to Euler angles (where

φ_{c}^{s}

,

θ_{c}^{s}

, and

ϕ_{c}^{s}

correspond to azimuth, pitch, and roll, respectively), yielding the perturbed angles

(φ_{c}^{s} + δ φ_{c}^{s}, θ_{c}^{s} + δ θ_{c}^{s}, ϕ_{c}^{s} + δ ϕ_{c}^{s})

. Similarly, the translational deviations are updated to

(X_{c}^{s} + δ X_{c}^{s}, Y_{c}^{s} + δ Y_{c}^{s}, Z_{c}^{s} + δ Z_{c}^{s})

. Finally, the corresponding application programming interfaces (APIs) in UNIGINE are invoked to execute graphical rendering operations according to

(φ_{c}^{s} + δ φ_{c}^{s}, θ_{c}^{s} + δ θ_{c}^{s}, ϕ_{c}^{s} + δ ϕ_{c}^{s})

and

(X_{c}^{s} + δ X_{c}^{s}, Y_{c}^{s} + δ Y_{c}^{s}, Z_{c}^{s} + δ Z_{c}^{s})

.

3.3. Image Data Generation and Recording

To generate MCS-related datasets for vision perception research, three key methodologies are addressed: dynamic scene construction, graphic rendering pipeline parsing, and synchronous data recording.

The dynamic scene is built using the scene editor of UNIGINE, integrating terrain, vegetation, architecture, roads, water systems, sky, movable vehicles, various objects, and a spectrum of environmental effects (e.g., lighting, rain, clouds, wind, water motion) as outlined in [44]. Environmental effects are scripted for dynamic control. Dynamic objects employ three modeling approaches: 3D software-animated effects (e.g., helicopter rotors), editor-defined motion paths, and script-driven route management with pose updates. For segmentation annotations, each object’s auxiliary pass color is predefined in the scene editor. As an illustrative case, only moving cars are individually segmented based on their distinct color attributes.

In UNIGINE, cameras are created via Camera and viewport programming interfaces. Following the MCS model, camera poses in the world coordinate system are calculated through coordinate transformations, combining the vehicle’s pose and MCS installation parameters. Each viewport is manipulated by a corresponding Camera to render image data using the Unified UNIGINE Shader Language (UUSL) scripted pipeline, extracting depth (from the native depth buffer and converted to linear value), world-aligned normal (from the normal buffer), optical flow (from the velocity buffer) from the deferred G-buffer pass, and segmentation mask from the auxiliary pass. The final RGB images are captured via the same interfaces as the last results.

Synchronous data recording follows the pipeline in Algorithm 1. Raw data, including G-buffers (native depth, normal, velocity), auxiliary passes, and RGB images, are first captured from the GPU to RAM. Processed data is then saved to disk via a parallel thread, bypassing RAM-to-disk latency. Vehicle and object poses are recorded at each simulation step, alongside MCS configurations, automating dataset generation for visual perception research. During simulation, the rendering frame rate dynamically changes with the scale of geometry in the scene. For each simulation snapshot, all captured camera images are precisely synchronized in a rendering step, and the corresponding physical timestamp is recorded. If temporal alignment errors need to be simulated, only a single offline processing step is required. By setting equal sampling intervals to generate simulation timestamps, the temporal alignment error is defined as the deviation between these timestamps and the physical timestamps.

Algorithm 1 Image data generation and synchronous recording process.

1:: while True do
2:: Update MCS Pose;
3:: Display first-person view using WidgetSpriteViewport API;
4:: if data recording is required then
5:: Parse rendering pipeline using Camera and Viewport APIs;
6:: Write UUSL scripts to
7:: (1) Extract native depth buffer and convert it to linear depth;
8:: (2) Extract normal buffer and unify surface normal coordinate system;
9:: (3) Extract velocity buffer and convert velocity map to optical flow;
10:: (4) Extract auxiliary render pass and generate segmentation mask;
11:: Transfer these image data from GPU into RAM using Camera and Viewport APIs;
12:: if image datasets in RAM are not empty then
13:: Start a standalone thread;
14:: while thread is active do
15:: Save image data from RAM to hard disk;
16:: end while
17:: else
18:: Wait for standalone thread cleanup;
19:: end if
20:: Record poses of vehicle and dynamic objects;
21:: if recording termination is requested then
22:: Record MCS configuration settings;
23:: else
24:: continue
25:: end if
26:: else
27:: continue
28:: end if
29:: end while

3.4. UAV Motion Control and Closed-Loop Simulation

UAV motion control is achieved through two modes: (1) manual human–machine interaction through keyboard and mouse interfaces, and (2) external data-driven kinematic propulsion. For manual operation, flight commands are configured via designated keys, with aerial vehicle dynamics (fixed-wing and quadrotor models) integrated into the simulator. Real-time kinematic state propagation is computed using UNIGINE’s native physics engine, which processes user inputs through a force-feedback loop mechanism that accounts for inertial properties and environmental interactions. Alternatively, the simulator supports external data injection protocols where motion profiles (e.g., from JSBSim flight dynamics models or MATLAB/Simulink control algorithms) override internal kinematic calculations.

A critical challenge in achieving closed-loop simulation lies in coordinating data flow among software components, external programs, and hardware entities [45]. Efficient communication is ensured via RS422 serial ports for bidirectional hardware-simulator exchange, Socket interfaces for UNIGINE-Qt data transfer, and a dedicated image injection interface [45] for photoelectric data processing device. Robust communication protocols standardize data frames, composed of header, length, type, content, and checksum, to maintain data integrity. Temporal synchronization is enforced using the Network Time Protocol [46], guaranteeing all simulation components operating in accordance with the logical clock.

4. Functional Validation Through Simulation

In this section, the main functions of MCS-Sim are validated through three simulation approaches: (1) synthetic dataset generation using a customized MCS configuration; (2) visual perception algorithms testing; (3) closed-loop simulation through HIL.

For visual perception algorithms testing, the generated dataset samples are used to benchmark some representative visual perception algorithms (e.g., optical flow, scene flow, depth estimation, visual SLAM). The purpose is on demonstrating the practicality of the simulator, not introducing new algorithms.

For closed-loop simulation, firstly, the image injection function is verified by simulating multichannel image capture from an ACES and performing real-time stitching on the photoelectric data processing device. Secondly, end-to-end system verification is demonstrated via a practical HIL case: real-time ground target tracking using UAV electro-optical payloads. Simulated sensor feedback directly interfaces with embedded perception algorithms for closed-loop control validation.

4.1. Synthetic Dataset Generation

During image data generation, RGB images and multi-modal ground truth data (depth maps, segmentation masks, surface normals, optical flow) are extracted by parsing the rendering pipeline and post-processing. Camera and dynamic object poses are synchronously recorded. These raw data form the basis for deriving task-specific annotations for visual perception. For example, the stereo disparity is calculated using depth maps and camera calibration parameters. LiDAR point clouds are created by sampling depth values from virtual cameras with synthetic noise. Inertial measurement unit (IMU) data is generated from pose changes, incorporating noise models. Occlusion masks are defined using geometric constraints. While similar approaches exist in synthetic datasets (e.g., TartanAir [12], Spring [47]), producing these annotations significantly increases development effort. MCS-Sim prioritizes adaptability: its modular design allows for rapid reconfiguration of MCS and other simulation settings to match experimental requirements. To maintain focus, raw unprocessed datasets generated through our pipeline are enough for testing some representative visual perception algorithms.

Generally, an unstructured MCS comprising four cameras is customized as depicted in Figure 3, with the detailed configuration settings outlined in Table 3. These cameras lie on a single plane, with cameras 0 through 2 forming a simple cylinder ACES with a radius of 30 cm. Besides these, cameras 0 and 3 act as a binocular stereo camera with a baseline of 50 cm. The combination of the two types of structures enables this MCS configuration to cover most autonomous driving visual perception and stereo vision applications. Following the convention established by other synthetic datasets, we omit calibration and temporal misalignment errors in this configuration. This approach ensures test results can be accurately compared against ground-truth data.

According to the virtual MCS, a sample dataset with six sequences is generated using MCS-Sim. This dataset encompasses three scenarios: uptown, warehouse, and forest environments, as depicted in Figure 4. Each scenario includes sequences with forward and loop motion patterns. The lighting changes in these scenes are slightly different, and a rainy day environment (Warehouse2) is also set up. The demo dataset is available for download at https://github.com/qqm-666/MCS-Sim (accessed on 1 September 2025). A more extensive dataset will be released at the same repository in the future.

4.2. Visual Perception Algorithms Testing

In this subsection, the dataset samples generated as described in Section 4.1 are used to benchmark a selection of representative visual perception algorithms. For the purpose of general applicability, within the domains of optical flow or scene flow, depth estimation, and visual SLAM, only two typical methods from each domain are subjected to zero-shot evaluation. This evaluation aims to demonstrate the practical utility of the simulator, rather than focusing on introducing novel algorithms.

4.2.1. Optical Flow and Scene Flow Estimation

Optical flow represents 2D pixel motion between consecutive image frames, while scene flow describes 3D displacement of points in spatial scenes. Both are vital for tasks like object detection, object tracking, action recognition, facial expression analysis, among others. Dense motion flow estimation has been applied in autonomous driving and intelligent robot systems. Two representative methods, RAFT [48] (Optical flow) and RAFT-3D [49] (Scene flow), were evaluated using the generated dataset.

For RAFT [48], the authors’ provided fine-tuned model (raft-kitti.pth) was used to evaluate it. Metrics computed included End Point Error (EPE or EPE2D) and F1-all, with results shown in Table 4. During testing, the number of iterations was set to 24 for the recurrent unit, and only the optical flow for camera 0 was estimated. Figure 5 demonstrates qualitative optical flow predictions from our dataset.

The model demonstrates strong generalization to unseen synthetic scenarios despite lacking dataset-specific training. For the Forest1 test sequence, performance drops (EPE is 3.182, F1-all is 21.281%) due to dense tree proximity, which creates substantial pixel displacement and matching ambiguity. Other sequences show better results. Weather adaptability is evident in the rainy condition of Warehouse2, maintaining consistent metrics with Warehouse1. However, optical flow estimation at object boundaries proves problematic in wooded environments (Forest1 and Forest2), caused by repetitive leaf textures and absence of dataset fine-tuning.

RAFT-3D [49] is configured with bi-Laplacian smoothing, 16 iterations for the recurrent unit, and a fine-tuned model (raft3d_kitti.pth) as recommended by the authors. Cameras 0 and 3 are selected as binocular stereo inputs, separated by a 50 cm baseline. Stereo matching is performed via GA-Net [50] to generate disparity maps, followed by RAFT-3D inference. Valid pixels are defined as those with depth less than 250 m and optical flow less than 250 pixels. Evaluation metrics align with RAFT, including EPE2D and F1-all, with an added EPE3D. Threshold metrics

δ_{3 D} <

5 m and

δ_{3 D} <

10 m measure pixels within specified error bounds. Results are summarized in Table 4, with qualitative comparisons in Figure 5.

Table 4 shows that RAFT-3D outperforms RAFT in optical flow accuracy across most sequences, as indicated by bold values. This improvement stems primarily from incorporating binocular disparity, except in the Forest2 sequence. The most pronounced gains occur in Forest1, where low-altitude forest traversal creates texture-poor RGB imagery that challenges optical flow algorithms. For RAFT-3D, depth information enhances structural awareness, thereby improving accuracy. However, Figure 5 reveals significant sky-region flow errors in RAFT-3D, though these pixels are excluded from evaluation. This arises from the inaccurate sky matching of GA-Net [50], traceable to pre-training datasets (Flyingchairs [51], Flyingthings3D [52]) lacking sky content. Fine-tuning on KITTI [53,54] datasets, which also lack sky annotations, fails to resolve this limitation.

4.2.2. Depth Estimation

Vision-based 3D perception offers a cost-effective alternative to expensive depth sensors like LiDAR by leveraging rich semantic information. Depth estimation acts as a critical link between 2D image and 3D environmental understanding, driving progress in downstream applications. MCS-based egocentric depth estimation particularly benefits 3D reconstruction and autonomous driving sensing. Two surround MCS methods are evaluated using the generated dataset: SurroundDepth [55] and R3D3 [56].

SurroundDepth [55] is evaluated using the source code with identical settings as the DDAD dataset [57] implementation. Fine-tuned models (ddad_scale/encoder.pth and ddad_scale/depth.pth) are directly applied. During inference, the batch size is set to 4, processing four synchronized camera inputs per frame. Table 5 presents quantitative results within a 100 m range, evaluated via Absolute Relative Error (AbsRel), Squared Relative Error (SqRel), Root Mean Squared Error (RMSE), and Accuracy with Threshold

δ_{1.25}

. Qualitative comparisons on our dataset are shown in Figure 6.

For R3D3 [56], the fine-tuned models offered by authors (completion_ddad.ckpt and r3d3_finetuned.ckpt) are used with all other parameters fixed. Each sequence starts with 3 warmup frames, applying a mean flow threshold of 1.75 for filtering. The co-visibility graph is configured with

Δ t_{i n t r a} = 3

,

r_{i n t r a} = 2

,

Δ t_{i n t e r} = 2

, and

r_{i n t e r} = 2

. Evaluation follows the same metrics as DDAD, with scale-aware results within 100 m shown in Table 5. Qualitative comparisons appear in Figure 6.

SurroundDepth demonstrates better overall accuracy than R3D3. However, both methods exhibit reduced precision when benchmarked against original paper results, revealing limited generalization. Figure 6 shows weakened edge perception for these two methods, particularly in repetitive-texture vegetation areas. This stems from their self-supervised training relying on sparse LiDAR data rather than the dense pixel-wise annotations as MCS-Sim. The training datasets, DDAD [57] and nuScenes [58], focus on road scenes, differing significantly from the diverse objects and overhead squint perspectives of the generated dataset. These domain discrepancies contribute to the observed performance degradation.

4.2.3. Visual SLAM

SLAM aims to construct a map of an unknown environment and precisely locate sensors within this map, with a heightened emphasis on real-time functionality. Visual SLAM solutions, in which the main sensors are cameras, hold significant appeal due to the low cost of cameras and the relative sensitivity to texture information. This facilitates robust and precise self-positioning, as well as the perception of a 3D structure. For the generated dataset, two representative methods, ORB-SLAM2 [59] and Stereo DSO [60], are selected as the subjects for this testing. Additionally, the localization implementation of the multi-camera algorithm R3D3 [56] is similar to that of a visual odometry system, except that it lacks real-time capability. Its trajectory is also included in the evaluation.

ORB-SLAM2 [59] is an open-source visual SLAM framework supporting monocular, stereo, and RGB-D cameras. It includes loop closure, relocalization, and map reuse, requiring descriptor computation for feature points with re-projection error optimization. In contrast, Stereo DSO [60] optimizes photometric errors, eliminating descriptor calculations for faster processing. Both methods represent conventional visual SLAM approaches. In our implementation, a stereoscopic camera using cameras 0 and 3 (baseline 50 cm) is implemented, adopting default parameters from the source code. The performance evaluation employs Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). Both ATE and RPE use only the translation component, and their units are in meters. Quantitative results appear in Table 6, with trajectory visualizations in Figure 7.

Under conventional motion patterns with small viewpoint changes (Uptown1, Warehouse1, Uptown2, and Forest2), all three algorithms can track the trajectory completely. Among traditional methods, ORB-SLAM2 [59] generally achieves higher accuracy than Stereo DSO [60] in nonlinear trajectory Uptown2 due to its loop-closure and relocation modules. An exception occurs in Forest2, where repetitive vegetation textures lack distinct features, causing ORB-SLAM2 to fail. Here, Stereo DSO [60] provides significantly better accuracy when loop-closure mechanisms are ineffective (Uptown1, Warehouse1, and Forest2). Notably, Stereo DSO requires a longer initialization period than ORB-SLAM2, as observed in repetitive textures and rainy scenes (Forest1, Forest2, and Warehouse2). R3D3 [56] exhibits larger-scale errors and fails to leverage the advantages of MCS.

Traditional binocular SLAM methods outperform R3D3 [56] in new scenes, primarily due to the generalization limits of deep learning models and the restricted FOV overlap in our MCS configuration. However, in challenging environments, such as rainy conditions (Warehouse2) and dense forests (Forest1), R3D3 [56] demonstrates greater robustness, maintaining positioning accuracy and full trajectory tracking. This advantage emerges when R3D3 [56] uses dense flow fields to refine inter-frame and intra-frame correspondences, compensating for sparse feature limitations in traditional SLAM. Additionally, the training data includes complex environments and diverse motion patterns with significant view changes.

4.3. Closed-Loop Simulation

To implement closed-loop simulation for visual perception algorithms through HIL, the image injection interface is a crucial element. As a means of functional verification, wide FOV imaging is simulated by capturing multichannel images from a virtual ACES that performs real-time stitching on the photoelectric data processing device. Furthermore, closed-loop system testing is demonstrated through a paradigmatic example of real-time ground target tracking using UAV electro-optical payloads. Simulated vision sensor feedback interfaces directly with embedded perception algorithms for closed-loop control validation. The two simulation examples are presented to demonstrate functionality validation, rather than to evaluate specific algorithms or software implementations.

Taking the closed-loop simulation of a UAV system as an example, its key information flow is illustrated in Figure 8. In the figure, solid lines represent the data stream between hardware entities, solid-arrow dashed lines denote the serial communication data stream between the simulator and hardware entities, and hollow-arrow dashed lines indicate the image injection data stream. The simulated visual sensor injects rendered images in real time into the photoelectric data processing device. Simultaneously, it sends status parameters to both the photoelectric data processing device and the flight control device. After calculations by the embedded visual perception algorithm, the photoelectric data processing device feeds the processed information back to the flight control device and the ground station. It also transmits the video stream back to the ground station for display. The ground station sends operational control commands to the flight control device. Subsequently, the flight control device performs flight control calculations based on the control commands from the ground station, the measured position, attitude, and velocity of the UAV body, as well as the status of the visual sensor. It then outputs a control message to the simulator to update the pose of the UAV and its vision sensor. Additionally, the photoelectric data processing device sends a control message to the simulator to update the status of the vision sensor.

4.3.1. Functional Verification of Image Injection

MCS-based computational imaging has emerged as a critical method for wide FOV, high-resolution image reconstruction in photoelectric systems [61]. This approach overcomes the inherent limitation of the space bandwidth product in monocular imaging systems and demonstrates proven effectiveness in aerial applications. As an illustrative example, the calibration of both intrinsic and extrinsic parameters for a spherical ACES (Figure 9a) is undertaken using a random noise calibration pattern [62]. The ACES configuration (Table 7) is then input into MCS-Sim for simulation.

In this example, the objective is to verify the image injection function. The ACES-specific image stitching algorithm is pre-installed on the hardware entity. The simulator generates ACES image data and combines them into a single image with a resolution of 1920 × 1080 in real time (Figure 9b). This combined image is then immediately injected into the hardware entity. Each image set undergoes seamless stitching on-chip, producing a wide FOV, high-resolution composite image, as depicted in Figure 9c. The wide FOV image is then sent back to the ground station for display.

4.3.2. Closed-Loop System Testing

Closed-loop simulation creates a real-time feedback loop between virtual environments and tested systems. To validate the closed-loop simulation capabilities of the simulator, we conducted a comprehensive simulation of a small fixed-wing UAV equipped with a monocular camera, performing target recognition, tracking, and motion following.

Prior to testing, the photoelectric data processing device is loaded with the following vision algorithms: YOLOv3 [63] for target detection, SORT [64] for data association, and GOTURN [65] for single target tracking. Initially, YOLOv3 and SORT work in tandem to detect multiple objects, with SORT correlating YOLOv3 detection results frame by frame. Subsequently, GOTURN is employed to track the target that SORT identifies with the highest correlation confidence.

The flight control is managed on a hardware entity, which adjusts the flight path based on the output from the photoelectric data processing device and inputs the control instructions into the simulator for further simulation. During the simulation, the UAV operates at an altitude of 200 m with a flight speed of 25 m/s. Upon recognizing the ground target (using an aircraft as an example), it initiates target tracking and modifies its flight path to follow the target accordingly, as shown in Figure 10.

5. Conclusions and Future Work

We introduce a photo-realistic simulator, MCS-Sim, designed to address challenges in multi-camera UAV visual perception research. This simulator integrates multi-camera system simulation, UAV motion simulation, dynamic scene simulation, synthetic dataset generation, and a closed-loop testing interface into a unified platform. Consequently, it supports the prototype design of multi-camera systems, the preparation of synthetic datasets, and the closed-loop verification of the system. The functionality of MCS-Sim was verified through simulation experiments, demonstrating its significance in advancing multi-camera UAV visual perception research.

The future work will focus on three key directions to establish MCS-Sim as a versatile platform for next-generation AI research. (1) Enhancing existing functionalities: We will incorporate supported capabilities, such as dense flow estimation, 3D reconstruction, visual SLAM, object detection, and tracking, into the simulator to better facilitate algorithm verification processes. Additionally, we will expand the vehicle models, scene libraries, physical simulation fidelity, and software optimization to enhance the simulator’s capabilities and expand its application scope. (2) Developing novel capabilities: We will introduce support for autonomous movable entities and embodied AI systems, along with automated dataset generation pipelines. These additions will broaden the applicability of MCS-Sim to a wider range of research scenarios. (3) Integrating cutting-edge techniques: We will integrate advanced techniques such as neural rendering (e.g., NeRF, 3DGS) and comprehensive world models to advance the simulator’s capabilities. However, these advancements will require addressing challenges related to computational costs.

Author Contributions

Conceptualization, G.W. and H.F.; methodology, Q.Q.; software, Q.Q.; validation, Q.Q. and Y.P.; formal analysis, B.L.; investigation, Q.Q.; resources, G.W.; data curation, Q.Q.; writing—original draft preparation, Q.Q.; writing—review and editing, G.W.; visualization, Q.Q.; supervision, H.F.; project administration, G.W.; funding acquisition, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Outstanding Youth Science Fund Project of National Natural Science Foundation of China (62303478) and High-level Talent Introduction Program (202401-YJRC-XX-010).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Stanga, C.; Banfi, F.; Roascio, S. Enhancing Building Archaeology: Drawing, UAV Photogrammetry and Scan-to-BIM-to-VR Process of Ancient Roman Ruins. Drones 2023, 7, 521. [Google Scholar] [CrossRef]
Mourtzis, D.; Angelopoulos, J.; Panopoulos, N. Unmanned Aerial Vehicle (UAV) path planning and control assisted by Augmented Reality (AR): The case of indoor drones. Int. J. Prod. Res. 2023, 62, 3361–3382. [Google Scholar] [CrossRef]
Tang, X.W.; Huang, Y.; Shi, Y.; Wu, Q. MUL-VR: Multi-UAV Collaborative Layered Visual Perception and Transmission for Virtual Reality. IEEE Trans. Wirel. Commun. 2025, 24, 2734–2749. [Google Scholar] [CrossRef]
Zhuang, L.; Zhong, X.; Xu, L.; Tian, C.; Yu, W. Visual SLAM for Unmanned Aerial Vehicles: Localization and Perception. Sensors 2024, 24, 2980. [Google Scholar] [CrossRef] [PubMed]
Qi, Q.; Fu, R.; Shao, Z.; Wang, P.; Fan, H. Multi-aperture optical imaging systems and their mathematical light field acquisition models. Front. Inf. Technol. Electron. Eng. 2022, 23, 823–844. [Google Scholar] [CrossRef]
Liu, F.; Wu, X.; Zhao, L.; Duan, J.; Li, J.; Shao, X. Research Progress of Wide Field and High Resolution Computational Optical Imaging System. Laser Optoelectron. Prog. 2021, 58, 1811001. [Google Scholar] [CrossRef]
Urban, S.; Hinz, S. MultiCol-SLAM—A Modular Real-Time Multi-Camera SLAM System. arXiv 2016, arXiv:1610.07336. [Google Scholar] [CrossRef]
Kaveti, P.; Vaidyanathan, S.N.; Chelvan, A.T.; Singh, H. Design and Evaluation of a Generic Visual SLAM Framework for Multi Camera Systems. IEEE Robot. Autom. Lett. 2023, 8, 7368–7375. [Google Scholar] [CrossRef]
Yu, H.; Wang, J.; He, Y.; Yang, W.; Xia, G.S. MCVO: A Generic Visual Odometry for Arbitrarily Arranged Multi-Cameras. arXiv 2024, arXiv:2412.03146. [Google Scholar] [CrossRef]
Tian, Q.; Tan, X.; Xie, Y.; Ma, L. DrivingForward: Feed-forward 3D Gaussian Splatting for Driving Scene Reconstruction from Flexible Surround-view Input. arXiv 2024, arXiv:2409.12753. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021. [Google Scholar]
Wang, W.; Zhu, D.; Wang, X.; Hu, Y.; Qiu, Y.; Wang, C.; Hu, Y.; Kapoor, A.; Scherer, S. TartanAir: A Dataset to Push the Limits of Visual SLAM. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 4909–4916. [Google Scholar] [CrossRef]
Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything V2. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar] [CrossRef]
Zarrad, A. Game Engine Solutions. In Simulation and Gaming; InTech: Riyadh, Saudi Arabia, 2018; pp. 75–87. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Conference on Robot Learning (CoRL 2017), Mountain View, CA, USA, 13–15 November 2017. [Google Scholar] [CrossRef]
Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field and Service Robotics; Hutter, M., Siegwart, R., Eds.; Springer: Cham, Switzerland, 2017; pp. 621–635. [Google Scholar] [CrossRef]
Qiu, W.; Zhong, F.; Zhang, Y.; Qiao, S.; Xiao, Z.; Kim, T.S.; Wang, Y. UnrealCV: Virtual Worlds for Computer Vision. In Proceedings of the 25th ACM international conference on Multimedia, MM ’17, Mountain View, CA, USA, 23–27 October 2017; pp. 1221–1224. [Google Scholar] [CrossRef]
Müller, M.; Casser, V.; Lahoud, J.; Smith, N.; Ghanem, B. Sim4CV: A Photo-Realistic Simulator for Computer Vision Applications. Int. J. Comput. Vis. 2018, 126, 902–919. [Google Scholar] [CrossRef]
Pollok, T.; Junglas, L.; Ruf, B.; Schumann, A. UnrealGT: Using Unreal Engine to Generate Ground Truth Datasets. In Advances in Visual Computing; Springer: Cham, Switzerland, 2019; pp. 670–682. [Google Scholar] [CrossRef]
Bonetto, E.; Xu, C.; Ahmad, A. GRADE: Generating Realistic Animated Dynamic Environments for Robotics Research. arXiv 2023, arXiv:2303.04466. [Google Scholar] [CrossRef]
Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; et al. Habitat: A Platform for Embodied AI Research. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9338–9346. [Google Scholar] [CrossRef]
Xia, F.; Shen, W.B.; Li, C.; Kasimbeg, P.; Tchapmi, M.E.; Toshev, A.; Martin-Martin, R.; Savarese, S. Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. IEEE Robot. Autom. Lett. 2020, 5, 713–720. [Google Scholar] [CrossRef]
Xiang, F.; Qin, Y.; Mo, K.; Xia, Y.; Zhu, H.; Liu, F.; Liu, M.; Jiang, H.; Yuan, Y.; Wang, H.; et al. SAPIEN: A SimulAted Part-Based Interactive Environment. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11094–11104. [Google Scholar] [CrossRef]
Kolve, E.; Mottaghi, R.; Han, W.; VanderBilt, E.; Weihs, L.; Herrasti, A.; Deitke, M.; Ehsani, K.; Gordon, D.; Zhu, Y.; et al. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv 2017, arXiv:1712.05474. [Google Scholar] [CrossRef]
Gan, C.; Schwartz, J.; Alter, S.; Mrowca, D.; Schrimpf, M.; Traer, J.; De Freitas, J.; Kubilius, J.; Bhandwaldar, A.; Haber, N.; et al. ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), San Diego, CA, USA, 6–14 December 2020. Track on Datasets and Benchmarks. [Google Scholar] [CrossRef]
Cheng, Z.; Wang, Z.; Hu, J.; Hu, S.; Liu, A.; Tu, Y.; Li, P.; Shi, L.; Liu, Z.; Sun, M. LEGENT: Open Platform for Embodied Agents. arXiv 2024, arXiv:2404.18243. [Google Scholar] [CrossRef]
Zhou, Q.; Chen, S.; Wang, Y.; Xu, H.; Du, W.; Zhang, H.; Du, Y.; Tenenbaum, J.B.; Gan, C. HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024. [Google Scholar] [CrossRef]
Puig, X.; Undersander, E.; Szot, A.; Cote, M.D.; Yang, T.Y.; Partsey, R.; Desai, R.; Clegg, A.W.; Hlavac, M.; Min, S.Y.; et al. Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots. arXiv 2023, arXiv:2310.13724. [Google Scholar] [CrossRef]
Zhong, F.; Wu, K.; Wang, C.; Chen, H.; Ci, H.; Li, Z.; Wang, Y. UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI. arXiv 2024, arXiv:2412.20977. [Google Scholar] [CrossRef]
Zhou, X.; Qiao, Y.; Xu, Z.; Wang, T.H.; Chen, Z.; Zheng, J.; Xiong, Z.; Wang, Y.; Zhang, M.; Ma, P.; et al. Genesis: A Universal and Generative Physics Engine for Robotics and Beyond; Technical Report; Carnegie Mellon University: Pittsburgh, PA, USA, 2024; Available online: https://github.com/Genesis-Embodied-AI/Genesis (accessed on 1 September 2025).
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. Acm 2021, 65, 99–106. [Google Scholar] [CrossRef]
Kerbl, B.; Kopanas, G.; Leimkuehler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. Acm Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
Wu, Z.; Liu, T.; Luo, L.; Zhong, Z.; Chen, J.; Xiao, H.; Hou, C.; Lou, H.; Chen, Y.; Yang, R.; et al. MARS: An Instance-Aware, Modular and Realistic Simulator for Autonomous Driving. In Artificial Intelligence; Springer Nature Singapore: Singapore, 2024; pp. 3–15. [Google Scholar] [CrossRef]
Yang, Z.; Chen, Y.; Wang, J.; Manivasagam, S.; Ma, W.C.; Yang, A.J.; Urtasun, R. UniSim: A Neural Closed-Loop Sensor Simulator. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar] [CrossRef]
Tonderski, A.; Lindström, C.; Hess, G.; Ljungbergh, W.; Svensson, L.; Petersson, C. NeuRAD: Neural Rendering for Autonomous Driving. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 14895–14904. [Google Scholar] [CrossRef]
Ljungbergh, W.; Tonderski, A.; Johnander, J.; Caesar, H.; Åström, K.; Felsberg, M.; Petersson, C. NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving. arXiv 2024, arXiv:2404.07762. [Google Scholar] [CrossRef]
Zhou, H.; Lin, L.; Wang, J.; Lu, Y.; Bai, D.; Liu, B.; Wang, Y.; Geiger, A.; Liao, Y. HUGSIM: A Real-Time, Photo-Realistic and Closed-Loop Simulator for Autonomous Driving. arXiv 2024, arXiv:2412.01718. [Google Scholar] [CrossRef]
Hu, A.; Russell, L.; Yeo, H.; Murez, Z.; Fedoseev, G.; Kendall, A.; Shotton, J.; Corrado, G. GAIA-1: A Generative World Model for Autonomous Driving. arXiv 2023, arXiv:2309.17080. [Google Scholar] [CrossRef]
Wang, X.; Zhu, Z.; Huang, G.; Chen, X.; Zhu, J.; Lu, J. DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving. arXiv 2023, arXiv:2309.09777. [Google Scholar] [CrossRef]
Gao, S.; Yang, J.; Chen, L.; Chitta, K.; Qiu, Y.; Geiger, A.; Zhang, J.; Li, H. Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability. arXiv 2024, arXiv:2405.17398. [Google Scholar] [CrossRef]
Yang, J.; Gao, S.; Qiu, Y.; Chen, L.; Li, T.; Dai, B.; Chitta, K.; Wu, P.; Zeng, J.; Luo, P.; et al. Generalized Predictive Model for Autonomous Driving. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 14662–14672. [Google Scholar] [CrossRef]
Yang, X.; Wen, L.; Ma, Y.; Mei, J.; Li, X.; Wei, T.; Lei, W.; Fu, D.; Cai, P.; Dou, M.; et al. DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving. arXiv 2024, arXiv:2408.00415. [Google Scholar] [CrossRef]
Tu, S.; Zhou, X.; Liang, D.; Jiang, X.; Zhang, Y.; Li, X.; Bai, X. The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey. arXiv 2025, arXiv:2502.10498. [Google Scholar] [CrossRef]
Qi, Q.; Fu, R.; Wang, P.; Wang, M.; Fan, H. Design of Optical Compound Eye Simulation Software for Small Aircraft Applications. J. Syst. Simul. 2022, 34, 1999–2008. [Google Scholar] [CrossRef]
Qi, Q.; Fu, R.; Lü, M.; Wang, P.; Li, C.; Fan, H. Design of Small Aircraft Optical Compound Eye Simulation Test System. Aero Weapon. 2022, 29, 66–72. [Google Scholar] [CrossRef]
Yamashita, T.; Ono, S. A statistical method for time synchronization of computer clocks with precisely frequency-synchronized oscillators. In Proceedings of the 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183), ICDCS-98, Amsterdam, The Netherlands, 26–29 May 1998; pp. 32–39. [Google Scholar] [CrossRef]
Mehl, L.; Schmalfuss, J.; Jahedi, A.; Nalivayko, Y.; Bruhn, A. Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4981–4991. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In Computer Vision—ECCV 2020; Springer International Publishing: Glasgow, UK, 2020; pp. 402–419. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. RAFT-3D: Scene Flow using Rigid-Motion Embeddings. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Kuala Lumpur, Malaysia, 19–25 June 2021; pp. 8371–8380. [Google Scholar] [CrossRef]
Zhang, F.; Prisacariu, V.; Yang, R.; Torr, P.H. GA-Net: Guided Aggregation Net for End-To-End Stereo Matching. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar] [CrossRef]
Ilg, E.; Saikia, T.; Keuper, M.; Brox, T. Occlusions, Motion and Depth Boundaries with a Generic Network for Disparity, Optical Flow or Scene Flow Estimation. In Computer Vision—ECCV 2018; Springer International Publishing: Munich, Germany, 2018; pp. 626–643. [Google Scholar] [CrossRef]
Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
Menze, M.; Geiger, A. Object Scene Flow for Autonomous Vehicles. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar] [CrossRef]
Wei, Y.; Zhao, L.; Zheng, W.; Zhu, Z.; Rao, Y.; Huang, G.; Lu, J.; Zhou, J. SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation. In Proceedings of the 6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand, 14–18 December 2022. [Google Scholar] [CrossRef]
Schmied, A.; Fischer, T.; Danelljan, M.; Pollefeys, M.; Yu, F. R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3193–3203. [Google Scholar] [CrossRef]
Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3D Packing for Self-Supervised Monocular Depth Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2482–2491. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Wang, R.; Schworer, M.; Cremers, D. Stereo DSO: Large-Scale Direct Sparse Visual Odometry with Stereo Cameras. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Wu, F.; Wang, B.; Jingya, Q.; Cao, M.; Sang, Y.; Li, S.; Zhang, Y.; Chen, Q.; Zuo, C. A review of airborne multi-aperture panoramic image compositing. Acta Aeronaut. Et Astronaut. Sin. 2025, 46, 630525. [Google Scholar] [CrossRef]
Li, D.; Wang, G.; Liu, J.; Fan, H.; Li, B. Joint Internal and External Parameters Calibration of Optical Compound Eye Based on Random Noise Calibration Pattern. J. Electron. Inf. Technol. 2024, 46, 2898–2907. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the IEEE international conference on image processing (ICIP), Phoenix, AZ, USA; 2016; pp. 3464–3468. [Google Scholar] [CrossRef]
Held, D.; Thrun, S.; Savarese, S. Learning to Track at 100 FPS with Deep Regression Networks. In Computer Vision—ECCV 2016; Springer International Publishing: Amsterdam, The Netherlands, 2016; pp. 749–765. [Google Scholar] [CrossRef]

Figure 1. Overview of MCS-Sim architecture. Developed using UNIGINE engine, the simulator comprises some key components, including a simulation setting module, an MCS model module, a UAV control module, a third-person visual display, a first-person view display, an image data generation module, a data recording module, a visual model library, and communication interfaces.

Figure 2. The definition of the relevant coordinate system for MCS modeling. This establishes right-handed coordinate system conversions between the world coordinates

F_{w}

, MCS coordinates

F_{s}

, and camera coordinates

F_{c}

.

Figure 2. The definition of the relevant coordinate system for MCS modeling. This establishes right-handed coordinate system conversions between the world coordinates

F_{w}

, MCS coordinates

F_{s}

, and camera coordinates

F_{c}

.

Figure 3. The customized MCS comprising four cameras. These cameras lie on a single plane, with cameras 0 through 2 forming a cylinder ACES with a radius of 30 cm. Additionally, cameras 0 and 3 act as a binocular stereo camera with a baseline of 50 cm.

Figure 4. Six sequences are generated based on Table 3. (a) Uptown1, Warehouse1, and Forest1 depict forward motion, while Uptown2, Warehouse2, and Forest2 depict loop motion. In addition, for instance segmentation mask, only moving ground vehicles are selected as examples. (b) Pixel-wise ground truth annotations (e.g., Uptown2).

Figure 5. Visual comparison of RAFT [48] and RAFT-3D [49] on our dataset. For each scene, the figures are arranged from left to right as follows: an RGB image, the ground truth, the optical flow prediction from RAFT, the 2D projection of the scene flow prediction from RAFT-3D, and the twist field

(ϕ, τ) = {log}_{SE (3)} T

of RAFT-3D, respectively.

Figure 5. Visual comparison of RAFT [48] and RAFT-3D [49] on our dataset. For each scene, the figures are arranged from left to right as follows: an RGB image, the ground truth, the optical flow prediction from RAFT, the 2D projection of the scene flow prediction from RAFT-3D, and the twist field

(ϕ, τ) = {log}_{SE (3)} T

of RAFT-3D, respectively.

Figure 6. Visual comparison of SurroundDepth [55] and R3D3 [56] on our dataset. The depth visualization range spans from 0 to 100 m.

Figure 7. Estimated trajectories of ORB-SLAM2, Stereo DSO, and R3D3 on the generated dataset.

Figure 8. The key information flow of closed-loop simulation.

Figure 9. Verification of the image injection function using image stitching as an example: (a) physical prototype of the spherical ACES. (b) Image array generated by the simulator according to Table 7. (c) Video stream received at the ground station.

Figure 10. Closed-loop system testing. A simulation is conducted for a small fixed-wing UAV system. The UAV is equipped with a monocular camera and it performs target recognition, tracking, and motion following. (a) The third-person visual display is captured in the simulator; (b,c) the corresponding displays for target recognition (b) and tracking (c) are captured at the ground station.

Table 1. Comparison of typical simulators.

Simulator	Vision Sensor	Image Quality	Base Engine	Scalability	Pixel-Wise Annotation	Hardware Requirement	Frame Rate
MCS-Sim (ours)	S\|C\|A	P\|C	UNIGINE	✓	D\|S\|N\|F	L	⩾30FPS
CARLA [15]	M		UE	✓	D\|S
AirSim [16]			UE	✓	D\|S
Sim4CV [18]			UE	✓	D\|S
GRADE [20]			NIS	✓	D\|S\|N\|F
Habitat [21]	M	S\|C	Habitat-Sim	✓	D\|S	L	⩾30FPS
TDW [25]		P\|C	Unity	✓	-	L
LEGENT [26]		S\|C	Unity	✓	-	S
HAZARD [27]		S\|C	Unity	✓	D\|S	W
MARS [33]	S	(H)\|C	(NeRF)	-	(D)\|(S)	W	<30FPS
NeuRAD [35]			(NeRF)	-	(D)		<30FPS
NeuNCAP [36]			(NeRF)	-	-		<30FPS
HUGSIM [37]			(3DGS)	-	(D)\|(S)\|(F)		⩾30FPS
DriveArena [42]	S	(H)\|IC	LimSim\|(SD)	-	-	W	⩾30FPS

■: Vision-oriented simulator; ■: Physics-based simulator; ■: Neural closed-loop simulator; ■: Generative closed-loop simulator. Vision sensor: M: Monocular camera; S: Surround MCS; C: ACES; A: Arbitrary MCS. Image quality: S: Simple; P: Photo-realistic; H: High-fidelity; C: Spatio-temporal consistency; IC: Spatio-temporal inconsistency; ( ) denotes that it contains visual aliasing. Base Engine: UE: Unreal Engine; NIS: Nvidia Isaac Sim; SD: Stable Diffusion; ( ) denotes algorithm implementation. Scalability: The check mark denotes support for secondary development and the addition of more sensor types. Pixel-wise annotation: D: Depth map; S: Semantic segmentation; N: Surface normal; F: Optical flow; ( ) denotes that it contains errors. Hardware requirement: L: Laptop with a consumer-grade GPU; S: Connected to remote servers for training and deployment; W: Workstation with advanced GPUs. Frame rate: Average frame rate of integrated graphics rendering and physics computation.

Table 2. Relevant parameters pertaining to the configuration settings of MCS.

Type	Configuration	Variables for Rendering	Ground-Truth Recording
Number of cameras	$N_{c}$
Optical settings for camera c	$f_{c}$ , $d_{c}$ , $γ$ , $N_{u, c}$ , $N_{v, c}$ ( $σ_{f_{c}}$ , $w_{o p t i c a l_e r r o r s}$ )	$f_{c} + δ f_{c}$ , $d_{c}$ , $γ$ , $N_{u, c}$ , $N_{v, c}$ ( $w_{o p t i c a l_e r r o r s}$ )	$K_{c}$
Pose settings for camera c	$n_{x, c}$ , $n_{y, c}$ , $n_{z, c}$ , $X_{c}^{s}$ , $Y_{c}^{s}$ , $Z_{c}^{s}$ ( $σ_{φ_{c}^{s}}$ , $σ_{θ_{c}^{s}}$ , $σ_{ϕ_{c}^{s}}$ , $σ_{X_{c}^{s}}$ , $σ_{Y_{c}^{s}}$ , $σ_{Z_{c}^{s}}$ )	$φ_{c}^{s} + δ φ_{c}^{s}$ , $θ_{c}^{s} + δ θ_{c}^{s}$ , $ϕ_{c}^{s} + δ ϕ_{c}^{s}$ , $X_{c}^{s} + δ X_{c}^{s}$ , $Y_{c}^{s} + δ Y_{c}^{s}$ , $Z_{c}^{s} + δ Z_{c}^{s}$	$T_{c}^{s}$

Configuration: These parameters are used for the rapid entry of configuration. Variables for rendering: These variables are used for graphical rendering execution via application programming interfaces. Ground-truth recording: These parameters are recorded as ground truth.

f_{c}

,

σ_{f_{c}}

: focal length and its error variance;

d_{c}

: image plane width;

γ

: pixel aspect ratio;

N_{u, c}, N_{v, c}

: image resolution;

w_{o p t i c a l_e r r o r s}

: optical error parameter vector;

K_{c}

: intrinsic matrix.

n_{x, c}

,

n_{y, c}

,

n_{z, c}

,

φ_{c}^{s}

,

θ_{c}^{s}

,

ϕ_{c}^{s}

: rotation parameters (axis-angle and Euler angles);

X_{c}^{s}

,

Y_{c}^{s}

,

Z_{c}^{s}

: translational deviations;

σ_{φ_{c}^{s}}

,

σ_{θ_{c}^{s}}

,

σ_{ϕ_{c}^{s}}

,

σ_{X_{c}^{s}}

,

σ_{Y_{c}^{s}}

,

σ_{Z_{c}^{s}}

: error variances;

T_{c}^{s}

: extrinsic matrix. ( ) denotes optional error terms;

δ *

represents random perturbation generated from variance

σ *

.

Table 3. Detailed configuration settings of the customized MCS.

Camera	Focal Length (mm)	Image Plane Width (mm)	Image Resolution	Orientation (Axis-Angle)	Position Offset (mm)
0	20	20	$720 \times 640$	$(0, 0, 0)$	$(0, 0, 300)$
1				$(0, - 0.524, 0)$	$(- 150, 0, 259.808)$
2				$(0, - 1.047, 0)$	$(- 259.808, 0, 150)$
3				$(0, 0, 0)$	$(500, 0, 300)$

Table 4. Experimental results of RAFT and RAFT-3D.

Sequence	RAFT (Optical Flow)		RAFT-3D (Scene Flow)
Sequence	EPE_2D	F1-All	EPE_2D	F1-All	EPE_3D	$δ_{3 D} <$ 5 m	$δ_{3 D} <$ 10 m
Uptown1	0.595	2.265%	0.590	1.323%	15.126	86.341%	91.232%
Uptown2	1.210	3.716%	1.252	3.454%	9.424	88.079%	92.366%
Warehouse1	1.688	2.491%	0.687	2.223%	2.334	94.284%	98.557%
Warehouse2	1.525	4.757%	1.453	4.097%	3.130	95.479%	98.171%
Forest1	3.182	21.281%	1.929	8.464%	27.215	70.993%	76.922%
Forest2	0.924	4.935%	1.062	5.989%	5.506	75.326%	86.422%

The projected optical flow of RAFT-3D is compared with that of RAFT. Bold font highlights the optimal term under the same evaluation metric. The unit of EPE_2D is pixel, and the unit of EPE_3D is meters.

Table 5. Experimental results of SurroundDepth and R3D3.

Sequence	AbsRel		SqRel		RMSE		$δ_{1.25}$
Sequence	SD	R3D3	SD	R3D3	SD	R3D3	SD	R3D3
Uptown1	0.288	0.295	4.147	4.173	14.654	11.443	0.496	0.495
Uptown2	0.259	0.431	4.549	10.286	15.916	17.376	0.568	0.361
Warehouse1	0.225	0.437	3.718	12.699	14.770	20.153	0.597	0.397
Warehouse2	0.243	0.440	2.778	9.038	10.942	13.064	0.585	0.426
Forest1	0.303	0.424	4.677	9.303	15.017	12.561	0.521	0.404
Forest2	0.203	0.479	3.230	15.439	13.687	25.950	0.600	0.257

SurroundDepth is abbreviated as SD. Bold font highlights the optimal term under the same evaluation metric. The unit of RMSE is meters.

Table 6. Evaluation results of ORB-SLAM2, Stereo DSO, and R3D3.

Sequence	ATE			RPE
Sequence	ORB-SLAM2	Stereo DSO	R3D3	ORB-SLAM2	Stereo DSO	R3D3
Uptown1	0.180	0.142	9.657	0.012	0.007	0.085
Uptown2	0.369	1.690	7.250	0.022	0.144	0.165
Warehouse1	0.124	0.019	6.935	0.047	0.005	0.154
Warehouse2	(1.786)	(4.748)	1.835	(0.182)	(0.731)	0.095
Forest1	(0.116)	(0.517)	3.615	(0.152)	(0.185)	0.048
Forest2	0.336	0.086	7.676	0.035	0.032	0.082

Parentheses ( ) denote an unexpected termination of the tracking process, with only valid frames being considered for evaluation. Bold font highlights the optimal term, underlined font indicates the suboptimal term, and a continuous, uninterrupted tracking is preferred over a tracking that is disrupted.

Table 7. Detailed configuration settings of the spherical ACES.

Camera	Focal Length (mm)	Image Plane Width (mm)	Image Resolution	Roll Angle (°)	Pitch Angle (°)	Radius (mm)
0	$3.240$	$5.568$	$640 \times 360$	$0.0$	$0.0$	$170.0$
1				$0.0$	$56.2$
2				$41.93$	$55.0$
3				$88.87$	$53.5$
4				$132.52$	$53.0$
5				$183.82$	$53.0$
6				$223.16$	$52.0$
7				$268.42$	$54.5$
8				$314.32$	$55.0$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qi, Q.; Wang, G.; Pan, Y.; Fan, H.; Li, B. MCS-Sim: A Photo-Realistic Simulator for Multi-Camera UAV Visual Perception Research. Drones 2025, 9, 656. https://doi.org/10.3390/drones9090656

AMA Style

Qi Q, Wang G, Pan Y, Fan H, Li B. MCS-Sim: A Photo-Realistic Simulator for Multi-Camera UAV Visual Perception Research. Drones. 2025; 9(9):656. https://doi.org/10.3390/drones9090656

Chicago/Turabian Style

Qi, Qiming, Guoyan Wang, Yonglei Pan, Hongqi Fan, and Biao Li. 2025. "MCS-Sim: A Photo-Realistic Simulator for Multi-Camera UAV Visual Perception Research" Drones 9, no. 9: 656. https://doi.org/10.3390/drones9090656

APA Style

Qi, Q., Wang, G., Pan, Y., Fan, H., & Li, B. (2025). MCS-Sim: A Photo-Realistic Simulator for Multi-Camera UAV Visual Perception Research. Drones, 9(9), 656. https://doi.org/10.3390/drones9090656

Article Menu

MCS-Sim: A Photo-Realistic Simulator for Multi-Camera UAV Visual Perception Research

Abstract

1. Introduction

2. Current Research and Projects in Open-Source UAV Simulators

3. System Design of MCS-Sim

3.1. Overview of Simulator Architecture

3.2. Multi-Camera System Modeling

3.3. Image Data Generation and Recording

3.4. UAV Motion Control and Closed-Loop Simulation

4. Functional Validation Through Simulation

4.1. Synthetic Dataset Generation

4.2. Visual Perception Algorithms Testing

4.2.1. Optical Flow and Scene Flow Estimation

4.2.2. Depth Estimation

4.2.3. Visual SLAM

4.3. Closed-Loop Simulation

4.3.1. Functional Verification of Image Injection

4.3.2. Closed-Loop System Testing

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI