Next Article in Journal
Processing Performance Improvement in Electrical Discharge Machining of Deep Narrow Groove Using Rounded Corner Electrode
Previous Article in Journal
Hierarchical Load-Balanced Routing Optimization for Mega-Constellations via Geographic Partitioning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cyber-Physical System for Terminal Infrastructure Monitoring: A Depth-Free Registration Framework via Geometric-Model Fusion

1
School of Communication and Information Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
2
Second Research Institute of the Civil Aviation Administration of China, Chengdu 610041, China
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(24), 13079; https://doi.org/10.3390/app152413079
Submission received: 12 November 2025 / Revised: 1 December 2025 / Accepted: 3 December 2025 / Published: 11 December 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

The monitoring and security of large-scale terminal infrastructures represent a critical application domain for industrial cyber-physical systems. However, real-time 3D visualization in such environments faces significant challenges from dense crowds, specular reflections, and complex architectural layouts. This paper presents a cyber-physical system for terminal infrastructure monitoring, underpinned by a novel, depth-free camera registration framework. At its core, the system establishes explicit geometric mappings across four coordinate systems (world, 3D model, camera, image), leveraging known installation parameters to eliminate dependency on depth sensors. Dynamic inconsistencies are resolved through a multi-stage layout refinement process, enabling robust operation under terminal-specific challenges. The framework maintains real-time performance at over 25 FPS when processing 16 concurrent video streams on commercial hardware. Extensive evaluations demonstrate a 44.9% reduction in registration error compared to state-of-the-art methods, validating the system’s practicality for enhancing situational awareness and security in large-scale, dynamic terminals.

1. Introduction

Dynamic 3D visualization has emerged as a cornerstone technology for augmented reality, intelligent surveillance, and smart infrastructure management [1,2,3]. By fusing real-time video streams with 3D models, this paradigm enables precise spatiotemporal localization of dynamic objects within a reconstructed scene. As illustrated in Figure 1, our proposed framework advances beyond simulated representations to achieve photorealistic dynamic registration in operational environments.
The fusion of video imagery with 3D models, a paradigm established in the early 2000s [4], this approach has enabled widespread outdoor applications such as Google Street View and immersive sports broadcasts [5], as exemplified in Figure 2. Recent years have witnessed significant breakthroughs driven by deep learning, including hierarchical semantic decoupling for complex geometry generation [6], spatiotemporal attention mechanisms for large-scale synthesis [7], and integrated BEV-Gaussian frameworks for multi-view simulation [8]. Parallel efforts continue in geometric methods, such as linear feature-based architectural reconstruction [9] and highly adaptive scene matching algorithms [10].
However, despite these advancements, the translation of this technology to critical indoor environments—particularly airport terminals—presents a formidable and understudied challenge. Airport terminals introduce unique complexities that distinguish them from both outdoor scenes and conventional indoor spaces, including dense crowds, severe occlusions, and pervasive specular reflections. These challenges create a significant gap between the capabilities of current outdoor-oriented or simulation-heavy methods and the demands of robust, real-time terminal visualization.
Airport terminal environments present a set of distinct and compounded challenges that require specialized solutions:
  • Geometric Complexity: Modern indoor architecture frequently violates Manhattan world assumptions, featuring curved walls, non-orthogonal intersections, and irregular layouts that confuse traditional geometric methods [11,12,13].
  • Dynamic Elements: High pedestrian density creates severe and temporal occlusions, while moving crowds and changing objects break the static scene assumptions used by most registration algorithms [14,15,16].
  • Optical Complications: Reflective surfaces (glass walls, polished floors, metallic fixtures) create specular reflections that mislead depth estimation and feature matching algorithms, causing systematic errors in registration [17,18].
  • Computational Constraints: Real-time applications require processing speeds that often exceed the capabilities of computationally intensive methods, creating a trade-off between accuracy and speed [14,19,20].
The field of indoor 3D registration continues to evolve, with innovations including reverse projection for real-time texture mapping [21], single-image reconstruction leveraging known dimensions [22], and spherical coordinate conversion of point clouds [23]. Further advances incorporate feature disentanglement through convolutional sparse coding [24,25,26,27] alongside deformation field estimation techniques [28,29]. Crucially, these methods universally suffer from dependency on expensive depth sensors or high-precision point clouds, coupled with computational complexity that precludes real-time performance in complex indoor environments.
Our framework addresses these terminal-specific challenges through a fundamental insight: managed airport spaces offer a powerful but underutilized geometric prior in the form of stable, pre-calibrated camera poses. We transform this prior into a robust, depth-free registration paradigm, specifically designed for terminal environments and effectively bypassing the limitations of depth sensors and unreliable feature matching. The key contributions of this work are as follows:
1.
A Terminal-Tailored Multi-Coordinate Framework: We introduce a geometrically-grounded framework that explicitly links the world, 3D model, camera, and image coordinate systems. This formulation leverages known camera parameters to establish accurate spatial mappings specifically optimized for terminal environments, eliminating dependency on error-prone depth sensors.
2.
A Multi-Stage Layout Refinement Algorithm: To resolve spatial inconsistencies and prevent cumulative drift, we propose a progressive optimization algorithm. It refines the initial layout projection through iterative hypothesis generation and validation at structural junctions, ensuring high geometric fidelity.
3.
Demonstrated Real-Time Performance in Terminal Environments: We validate the framework’s practical efficacy by achieving real-time processing of 16 concurrent video streams at over 25 FPS on commercial hardware. The system maintains superior robustness in extreme scenarios characterized by dense crowds, heavy occlusions, and strong specular reflections.
4.
Comprehensive Terminal Benchmarking and Deployment: Through extensive evaluation on a challenging airport terminal dataset, our method demonstrates a 44.9% reduction in registration error compared to state-of-the-art methods. The system’s practicality is further confirmed by its successful, long-term deployment in a live international airport environment.

2. Related Works and Motivation

2.1. Traditional Registration Methods

Early indoor registration methods predominantly relied on geometric primitives under Manhattan world assumptions, with seminal contributions including vanishing point estimation for layout inference [30] and depth classification via constrained Manhattan categories [31]. While this assumption continues to be utilized to simplify fundamental problem constraints [32], it is inherently violated by dynamic objects commonly found in airport terminals [33]. Classical approaches employing local feature descriptors such as SIFT, SURF [34], and ORB [35] have been extensively explored for image matching. Subsequent advancements, including semantic-geometric dual-constraint frameworks leveraging YOLO for dynamic object identification [36], aim to address these inherent limitations. However, significant challenges persist in terminal environments, including the failure of static scene assumptions, conflicts between computational efficiency and real-time requirements, and poor performance in low-texture regions characteristic of modern airport architecture.

2.2. SLAM and Real-Time Systems

State-of-the-art SLAM systems (e.g., ORB-SLAM2 [14], RTAB-Map [37]) suffer from fundamental limitations arising from their reliance on static-world assumptions, leading to catastrophic failures in dynamic terminal environments. In crowded airport settings, feature points on moving passengers induce erroneous motion parallax interpretation and subsequent tracking drift through false geometric constraints. Recent mitigation attempts—including DS-SLAM semantic segmentation [38], Mix-VPR for SURF descriptors [39], and dynamic feature elimination [40]—address specific aspects of this problem but incur prohibitive computational overhead. Crucially, these approaches remain vulnerable to the dense, non-linear motion patterns characteristic of crowded indoor spaces, limiting their practical deployment in operational airport terminals.

2.3. Deep Learning Approaches

Recent deep learning methods have significantly advanced indoor scene understanding capabilities. End-to-end layout estimation is achieved by approaches such as RoomNet [41] and Flat2-Layout [42]. Subsequent methods, including AtlantaNet [43], relax the strict Manhattan world assumption, while transformer-based models like RoomFormer [44] further enhance reconstruction accuracy. Modern networks (e.g., AdaBins [45], Plane-RCNN [46]) demonstrate strong capabilities in estimating depth and planar surfaces from single images. However, these methods remain susceptible to the inherent limitations of image formation processes, require extensive domain-specific annotated data for training, and exhibit limited generalization to unseen environments—particularly challenging airport terminals with unique architectural features and lighting conditions.

2.4. Motivation

Existing indoor layout estimation methods fundamentally fail in airport terminal scenarios. As evidenced in Figure 3, vanishing-point-based approaches [30,45,46] incorrectly classify multi-plane structures into single planes (green lines), exhibiting catastrophic failure in non-cuboid spaces. Moreover, the fault features about reflected regions are also extracted, which will cause serious errors in the layout estimation. Meanwhile, deep learning solutions(e.g., AtlantaNet [43] in Figure 4) suffer from specular interference, where depth maps mirror reflected scenes, causing erroneous plane merging.
Crucially, both paradigms require high-precision depth—an impractical constraint given terminals’ reflective surfaces. However, airport surveillance systems provide a unique advantage: Known camera poses through calibration. We thus propose a paradigm shift: Leverage camera intrinsics and initial poses to project 3D layouts, bypassing error-prone depth estimation entirely. This enables robust registration through geometric reprojection without depth sensors.

3. Proposed Method

Our framework builds upon the coordinate transformation pipeline commonly employed in computer graphics systems, but extends it to address the unique requirements of real-world airport surveillance. This section first establishes four coordinate systems and computes their correspondence. The 2D layout of the terminal scene is then estimated. Finally, it introduces an optimal solution to refine the coarse layout.

3.1. Method Overview

Figure 5 illustrates the framework’s core architecture, where four interdependent coordinate systems—world (reference frame), 3D model, camera, and image—are geometrically linked through precisely defined transformation matrices.
Building upon the established coordinate systems, we project 3D spatial annotations onto the 2D image plane to generate initial layout candidates. For each candidate, multiple geometric hypotheses are generated and scored to identify the optimal 2D layout. This refined layout directly guides image partitioning for texture analysis while ensuring spatial coherence.

3.2. Four-Coordinate System Construction

Figure 6 illustrates the coordinate system architecture. The world coordinates O W is used as the reference and choose the Z axis of O W perpendicular to the ground of the real world. Model coordinates O M is constructed with its origin aligned to that of O W . Gets the camera coordinates O C . O C = O C i , i = 1 , 2 , , n , O C i represents the coordinates of a camera. The image coordinates O I = O I i , i = 1 , 2 , , n , O I i represent the image coordinates of the correspondence camera.
During model coordinate system construction, we define the origin to coincide with the world coordinate system origin. The 3D model is scaled to real-world dimensions via parameter s, establishing the transformation:
s · O M = O W
The position of camera C i in the world coordinates is O W C i = x W C i , y W C i , z W C i , x W C i , y W C i represent the position of C i on the plane of the terminal in the real world. z W C i represents the height of C i . The transformation between the camera coordinate of C i and the world coordinate can be represented as follows:
O C i = R C i W · O W C i + t C i W
where R C i W is the rotation matrix and t C i W is the displacement vector of the initial pose of camera C i in O W . It is worth noting that R C i W and t C i W are obtained by calibration. The selected calibration templates are perpendicular to the ground. Only one 90 rotation around the X or Y axis is required with respect to the world coordinate system.
The last relationship between O I i and O C i is determined by the camera intrinsic K i , the transformation is as follows:
z · O I i = K i · O C i
where K i is determined with camera calibration, and z is depth of pixels. Hence, we establish all correspondence between all types of coordinates and the world coordinate system.

3.3. Initial Layout Estimation

For the world scene as L W = s k , e k , k = 0 , 1 , 2 , where s k , e k represent the end point of line segment k . For a camera C i , its layout structure in world coordinates is L W C i , L W C i L W , it satisfies the following conditions.
L W C i = s k , e k , 0 s k t i · r i d i a n d 0 e k t i · r i d i
where d i represents the farthest distance of the camera view-field, and t i is the translation vector of camera C i . r i represents the direction of the camera’s central axis.
Then the corresponding 2-D layout l i in the image coordinate system can be projected with L W C i , as shown in the following Equation:
l i = 1 z · K i · R C i W · L W C i + t C i W
where z is the depth of the corresponding point in the camera coordinate system, the z-depth can be obtained according to the CAD drawings.
In practical deployments, cameras frequently undergo pure rotational motion with negligible translation ( t = 0 ), particularly during field-of-view adjustments. As shown in Figure 7, our 2D layout estimation leverages Kneip’s geometric constraints [47] to solve for the rotation matrix Δ R independently of translation. Given corresponding 3D point pairs { x i , x i } satisfying Kneip constraints, Δ R is computed as follows:
x i × Δ R x i = 0
When a pure rotation occurs, the layout is updated. The updated layout is I i .
l i = 1 z · K i · R C i W · Δ R · L W C i + t C i W .

3.4. Multi-Stage Layout Refinement

In 2D images, layout line segments represent edges where two image planes meet, exhibiting a consistent appearance on both sides. However, when projected into 3D space, this consistency is lost due to inconsistencies in the homography matrix arising from imprecise depth estimation. As shown in Figure 8, this manifests as misaligned edge pixels on different 3D planes—a direct consequence of reprojection errors in template meshes. To resolve this fundamental misalignment problem, we introduce a hypothesis-scoring mechanism that identifies the optimal layout solution through geometric consistency optimization.
Given an initial 2D layout estimation l i , we first extract line segment intersections as keypoints and generate multiple geometric hypotheses within their neighborhood. Each hypothesis undergoes a two-stage evaluation: computing per-line scores followed by aggregation into a direction/length-invariant descriptor vector. This scoring mechanism enables the selection of the locally optimal hypothesis, with the global layout constructed as the precise union of these locally optimized solutions, ensuring geometric consistency across the entire scene.
For a line segment comprising n pixels, we construct scale-invariant regions along its axis as blocks of fixed dimensions ( h × w ), illustrated in Figure 9. The number of regions per segment is given by 2 n / w , where the factor 2 accounts for bidirectional sampling perpendicular to the segment orientation.
For each region, we use Harris corners to describe its pixels and eigenvalues to represent the geometric properties of Harris corner features. We accumulate the Harris parameters to form the feature of the region:
V i j = V i j R , V i j L , V i j U , V i j D T
V i j R = λ 1 > 0 λ 1 , V i j L = λ 1 < 0 λ 1 , V i j U = λ 2 > 0 λ 2 , V i j D = λ 2 < 0 λ 2
where V i j k represents the accumulation of geometric properties of Harris corner features in i , j region, and k represents the four directions of the i , j region. λ 1 and λ 2 represents the gradient of a pixel in two orthogonal directions.
Then compute the similarity of line segments by Euclidean distance d, and the score is computed as the ratio of the similarity regions and the total regions. Where t h is set to 0.8 empirically.
s = w 2 n i = 1 w sgn ( d t h )

3.5. Real-Time Registration

Our texture mapping pipeline projects 2D video textures onto the 3D model through a sequential process beginning with 3D layout recovery via inverse projection:
L M i = 1 s R C i W 1 K i 1 Z · l i t C i W
where L M i is the recovered 3D layout, s the scaling factor, R C i W camera rotation, K i intrinsics, Z depth, and t C i W translation. The layout boundaries L M i then segment source images into distinct texture regions, which are finally mapped onto corresponding 3D model faces to generate photorealistic visualizations with guaranteed geometric consistency.

3.6. General Applicability

Although the proposed framework is specifically designed and validated in the context of airport terminals, its underlying principles are generally applicable to a broad class of managed indoor spaces with similar characteristics. Such spaces include, but are not limited to, shopping malls, transportation hubs (e.g., train stations and subway stations), and large office buildings. These environments typically share the following key features that our method leverages:
  • Stable camera installations: Cameras are fixed, and their positions can be pre-surveyed, providing reliable geometric priors.
  • Structural layout consistency: The 3D model of the environment is often available or can be constructed, and the layout remains relatively static over time.
  • Similar challenges: They exhibit dynamic occlusions from high pedestrian traffic, reflective surfaces, and complex lighting conditions.
By capitalizing on these common characteristics, our method can be seamlessly adapted to these environments without fundamental modifications.

4. Experiment

4.1. DATA

The proposed framework was evaluated on the Airport Terminal Collection, a dedicated dataset comprising 60 min of HD video (1920 × 1080 @ 30fps) from 25 calibrated cameras in an operational terminal. This dataset captures challenging scenarios—including dense crowds, specular reflections, and dynamic lighting in environments such as corridors and security checkpoints—and to enhance evaluation transparency and reproducibility, we have included representative visual samples from multiple camera viewpoints in Figure 10, showcasing synchronized frames from three strategically selected cameras covering distinct terminal areas.
Given the method’s requirement for known camera parameters and 3D models—unavailable in standard public benchmarks—this focused evaluation provides appropriate validation for the target application domain.

4.2. Evaluation Metrics

This carefully curated collection provides the first benchmark for evaluating registration methods in complex terminal environments.
  • Registration Accuracy: Assessed via the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), measured in pixels. The ground truth is defined by the precise projection of the authoritative 3D CAD model onto the image plane using the calibrated camera parameters, ensuring a geometrically meaningful and reproducible benchmark.
  • Computational Efficiency: Evaluated based on the processing time per frame (milliseconds) and throughput (frames per second, FPS).
  • Robustness: Quantified by measuring performance degradation under challenging conditions, including increasing crowd density, dynamic lighting variations, and the presence of reflective surfaces [10,48].

4.3. Implementation Details

Experimental Environment. All experiments were conducted on a consistent hardware and software platform. The hardware configuration consisted of an Intel Core i7-4770K CPU (Intel Corporation, Santa Clara, CA, USA) operating at 3.5 GHz with 8 cores, an NVIDIA T4 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 16 GB of VRAM (used to accelerate the rendering pipeline and for inference with deep learning baselines), and a 1 TB NVMe SSD for fast dataset access. The software environment was based on Ubuntu 20.04 LTS, with computations accelerated using CUDA 11.4 and cuDNN 8.2.
3D Model and Camera Pose Acquisition. The 3D scene model was constructed from official CAD drawings provided by the airport authority. Furthermore, the camera intrinsic matrices and initial pose parameters are obtainable, with the latter registered from the terminal’s GIS map. Due to the sensitive nature of the airport’s full layout, a localized 3D model was utilized for our experiments. This approach successfully validates the feasibility of our proposed method while adhering to privacy and security protocols.
In scenarios where camera installation parameters were not previously acquired, a multi-step calibration procedure was implemented using a high-accuracy checkerboard pattern following established computer vision protocols. The calibration process introduces uncertainties primarily in two domains: initial camera pose estimation and lens distortion coefficients. The quality of calibration was rigorously quantified through re-projection error analysis, yielding an average error of 0.15 pixels in our experimental setup—a metric that represents the practical uncertainty level of our system. For comprehensive distortion correction, the Brown-Conrady model was employed to effectively characterize both radial and tangential distortion components. Notably, the integrated multi-stage layout refinement algorithm demonstrates inherent robustness to minor residual errors that persist post-calibration. This architectural consideration ensures sustained registration accuracy under operational conditions where perfect calibration is often unattainable.
From a Single Image to 3D Layout: An Estimation Pipe. To better elucidate our workflow, this section provides a comprehensive top-down mapping of the experimental setup. Figure 11 shows the selected test scene monitored by two surveillance cameras within the terminal environment. We focus on camera i with known intrinsic parameters and initial pose.
Figure 12 visualizes the positioning of camera i within the 3D model reconstructed from CAD drawings, configured with a 60 field of view and 1920 × 1080 resolution.
The coordinate transformation process follows three key steps, illustrated in the figures: World-to-Camera Transformation, which determines the camera pose in the world coordinate system. 3D Layout Extraction, extracting the local 3D layout from the camera’s spatial neighborhood. Projection-to-Image Plane: Projecting the 3D layout to the 2D image plane.
Figure 13 shows the resulting 3D layout for camera i, with structural elements representing line segments from the 3D model. Figure 14 provides an overhead view of this local 3D layout, clearly displaying spatial relationships. This integrated visualization demonstrates how our four-coordinate system functions as a unified spatial representation framework.
Data types. Our framework was implemented in C++ with CUDA acceleration for the rendering pipeline. All floating-point computations were performed using 32-bit single-precision arithmetic by default. This choice is made to achieve real-time performance when processing multiple high-resolution video streams. The use of single-precision floats strikes a balance between numerical precision and the computational efficiency required by our target application.

4.4. Analysis of the Layout Refinement Process

This section provides a detailed analysis of the proposed multi-stage layout refinement algorithm, demonstrating its effectiveness in resolving the edge inconsistencies inherent in the initial geometric projection (Figure 15).
Optimization Procedure and Convergence. The hypothesis scoring mechanism is demonstrated using a representative layout centered at the key point (1200, 370). A subset of the corresponding layout hypotheses and their evaluation scores is presented in Table 1, where lower scores indicate superior geometric consistency. The optimal layout configuration is determined efficiently via gradient descent, initialized from point S and converging robustly to a local minimum. It is noteworthy that during our extensive experiments, the optimization process demonstrated remarkable resilience to common pitfalls such as local minima and saddle points, with only rare instances requiring minimal parameter adjustments. This optimization trajectory, visualized in the 3D error space in Figure 16, illustrates the smooth and stable convergence behavior of the refinement process, validating the robustness of our approach against concerns raised regarding gradient descent efficiency.
Parameter Selection. The kernel size used in hypothesis generation presents a trade-off between refinement quality and computational cost. We evaluated this trade-off empirically, with results summarized in Figure 17. The analysis shows that performance stabilizes at a kernel size of 11 × 11. which was subsequently adopted for all experiments to balance accuracy and efficiency.

4.5. Experiment Result

The performance of the proposed method. This section presents quantitative validation of our framework’s performance. Table 2 confirms near real-time capability with sub-second latency across all critical stages: real-time processing/rendering (including GPU data upload and texture mapping) and layout inference (initial estimation and optimal refinement). Results demonstrate consistent compliance with real-time constraints (25+ FPS at 16-stream loads), validating the practical viability of dynamic scene applications.
As demonstrated in Figure 18, our method achieves robust performance across diverse experimental scenarios: Scenario 1 (conventional environments) and Scenario 2 (high dynamic range conditions). Quantitative results confirm superior layout estimation accuracy with RMSE is 10.3 ± 2.2 px, and MAE is 9.8 ± 1.8 px, while maintaining real-time layout inference processing at 46.5 FPS. This consistent high performance validates the method’s adaptability to varying environmental constraints.
A six-month deployment in an international airport using 16 synchronized cameras confirmed the system’s viability (Figure 19). This implementation yielded significant benefits: 60% hardware cost reduction through depth sensor elimination and maintenance simplification from stable camera calibration requiring no frequent recalibration.
Analysis of numerical precision and computational efficiency. To quantitatively address the impact of data types on velocity and accuracy. This section conducted a controlled experiment comparing 32-bit single-precision and 64-bit double-precision floating-point representations. The results, summarized in Table 3, reveal the inherent trade-off between registration accuracy and computational performance.
As evidenced by the data, employing double precision yields only a negligible improvement in registration accuracy (a reduction of only 0.2 px in RMSE). However, this comes at a significant computational cost: a 62 % increase in processing time per frame and a 75 % increase in memory consumption. The minimal gain in accuracy does not justify the substantial performance penalty for a real-time system. Therefore, the use of single-precision floating-point is justified as the optimal choice for our application, ensuring both high frame rates and sufficient geometric accuracy.
Robustness to calibration. To evaluate the framework’s dependence on calibration accuracy—a critical consideration for practical deployment—we conducted a systematic analysis under controlled parameter perturbations. This investigation serves as a proxy for assessing performance in scenarios where calibration may be suboptimal.
Our experimental approach introduced graded errors along three parameter dimensions: positional displacement (along the X, Y, and Z axes), orientation deviation (in pitch, yaw, and roll), and focal length inaccuracy. The results, summarized in Table 4, indicate that the framework maintains functional performance even under substantial parameter errors, exhibiting graceful degradation rather than catastrophic failure.
The multi-stage refinement algorithm exhibits robust tolerance to calibration inaccuracies through iterative geometric validation. This ensures practical deployment viability while maintaining the fundamental requirement for initial parameter estimates.

5. Experimental Results and Comparison

This section presents a comprehensive comparative analysis between the proposed method and a range of state-of-the-art approaches. In addition to established classical methods—including feature-based ORB-SLAM2 [14], LSD-SLAM, deep learning-based PlaneRCNN [46], and RoomNet [41], as well as the RGB-D system RTAB-Map [37]—we have expanded our evaluation to include several recently published advanced methods, such as DCL(depth consistency loss) [49], single-image reconstruction(SIR) leveraging known dimensions [22]. All comparative methods were evaluated under standardized hardware and software conditions. Each baseline underwent rigorous parameter optimization specific to our airport terminal dataset, with reported results averaged over multiple runs to ensure statistical reliability and fairness in comparison.

5.1. Quantitative Results

Registration accuracy is quantified as the pixel-level reprojection error between the projections of 3D structures from the estimated camera poses and the ground truth. Results in Table 5 derive from scenes with regular layouts. All baselines failed in challenging scenarios (e.g., complex geometries or severe reflections), precluding their inclusion in quantitative comparison.
Traditional SLAM methods, including ORB-SLAM2 and LSD-SLAM, failed completely due to the pure rotational motion in the scenes—a degenerate case for parallax-dependent depth estimation. The RGB-D method RTAB-Map established a reasonable performance baseline. Among learning-based approaches, RoomNet suffered from a significant domain gap when transitioning from cuboid rooms to large open spaces, while PlaneRCNN and DCL achieved the strongest baseline performance but degraded in environments with long corridors and large glass facades.
Our framework overcomes these limitations through architectural priors rather than data-driven assumptions, achieving a 44.9% RMSE reduction relative to the best baseline (PlaneRCNN), superior accuracy to the RGB-D baseline without depth sensors, and a new state-of-the-art in airport terminal registration.

5.2. Robustness Results

Our framework demonstrates exceptional resilience in challenging conditions where conventional methods fail. It outperforms deep learning approaches by >44.9% accuracy due to minimal sensitivity to dynamic objects. The model-driven design maintains stability in severe specular reflections—systematically overcoming depth estimation failures—and excels under dynamic lighting/textureless conditions, confirming superiority over sensor-dependent techniques.
The visualization pipeline (layout calculation, texture mapping, rendering) achieves unprecedented scalability: processing latency remains at 3 ms (330 FPS) for single streams, 17 ms (58.6 FPS) for 8 concurrent streams, and 40 ms (25.4 FPS) for 16-stream deployments (Table 6). This establishes new benchmarks for real-time indoor registration, enabling high-density surveillance with sub-40ms latency.
Our geometry-driven approach exhibits exceptional cross-domain generalization, exhibiting lower performance degradation when deployed in unseen environments such as shopping malls, offices, and metro stations after airport-specific training. This contrasts sharply with deep learning counterparts that suffer > 44.9 % accuracy drops due to domain shift. The key advantage lies in intrinsic immunity to appearance overfitting—enabling immediate deployment in new managed spaces with architectural models without retraining.

5.3. Visualized Results

To provide a comprehensive visual comparison, we selectively juxtapose our method against both a state-of-the-art deep learning approach and traditional methods across diverse scenarios. As shown in Figure 20, this comparative analysis spans normal enclosed spaces (scene 1, 3), high-dynamic-range environments (scene 2), and scenes with prominent reflective surfaces (scene 4). The results reveal fundamental limitations of existing approaches: traditional LSD-based methods with vanishing point estimation fail in open-plan corridors due to architectural constraints, while the modern deep learning pipeline (combining MSEG segmentation [50], DCL [49], and GeoLayout aggregation [51]) exhibits significant degradation under highlight and reflective conditions. In contrast, our method consistently maintains superior visual accuracy and geometric coherence across all challenging scenarios, demonstrating its unparalleled adaptability to complex indoor environments.
The accuracy of layout estimation critically determines 3D visualization quality. As evidenced in Figure 20, Figure 21 and Figure 22, traditional and deep learning methods produce severe artifacts due to geometric estimation errors, resulting in distorted registrations. In contrast, our framework maintains spatial coherence and sub-pixel accuracy across all tested conditions.

6. Ablation Experiment

To validate the contribution of each component in our framework, we conducted a series of ablation studies and evaluated them both qualitatively and quantitatively. The results clearly demonstrate the incremental improvements brought by our core innovations.
The baseline configuration employs only the Geometric Projection (GP) component, which directly projects the 3D model onto the image plane using calibrated camera parameters. While computationally efficient, this approach suffers from noticeable misalignment due to calibration inaccuracies and scene dynamics.
We then introduce the Coordinate Framework (CF), which establishes explicit geometric linkages across the four coordinate systems (world, model, camera, and image). This provides a more robust spatial reasoning foundation, reducing error propagation.
The complete system integrates Multi-stage Refinement (MF), which iteratively optimizes the layout through hypothesis generation and gradient descent-based selection. This component specifically addresses the residual errors that remain after coordinate transformation.
The quantitative impact of each component is summarized in Table 7. The introduction of CF brings a significant 19.7% accuracy improvement over the GP baseline. The subsequent addition of MF yields a further 59.2% gain, effectively resolving most spatial inconsistencies. This progressive integration results in an overall accuracy improvement of 78.9% compared to the baseline, while maintaining real-time performance.

7. Discussion and Limitations

Despite its strong performance, our framework has several limitations that define its current scope of applicability. The primary constraint is its reliance on known camera installation parameters, which restricts its use to managed environments where such information is available. While the method demonstrates robustness, its performance degrades under extreme changed conditions, such as significant rotational displacement of the camera between successive frames, leading to registration failures. Furthermore, the framework is dependent on the availability of 3D architectural models, although these are becoming increasingly standard in modern building information management (BIM) systems [52,53]. Consequently, our method is best suited for applications in managed indoor environments such as airports, shopping centers, and office buildings, particularly for surveillance, monitoring, and infrastructure-aware augmented reality or navigation systems.

8. Conclusions

This paper introduces a novel coordinate-system-based framework for real-time 3D registration in dynamic, complex indoor environments. By establishing a rigorous geometric linkage across four coordinate systems and leveraging the often-overlooked prior of known camera poses, our method achieves a significant 44.9% accuracy improvement over state-of-the-art methods while completely eliminating the need for depth sensors. The proposed solution integrates multi-stage geometric refinement to ensure robustness against calibration errors and environmental dynamics, delivers real-time performance at 25 FPS across 16 concurrent video streams, and has been validated through a six-month deployment in a live airport terminal. While the current implementation requires pre-surveyed camera parameters and a 3D model, it presents a highly practical and robust solution for large-scale surveillance and augmented reality in managed indoor spaces. Future work will focus on relaxing these prerequisites by exploring autonomous camera calibration and improving resilience to more extreme motion dynamics.

Author Contributions

W.D.: conceptualization, methodology, data curation, writing—original draft preparation and validation; J.C.: methodology, conceptualization; Q.L.: writing—reviewing and editing; C.W.: software, validation; M.L.: writing—editing. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the NNSFC and CAAC: U2133211; the CAAC Safety Capacity Building Project: ASSA2023.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The airport terminal dataset (ATC) utilized in this study is not publicly available due to strict privacy and security protocols governing the surveillance footage. The data contains sensitive information about public spaces and operational infrastructure. However, the data may be made available to qualified researchers for non-commercial research purposes upon reasonable request. Access is contingent upon the signing of a formal data confidentiality and use agreement. Inquiries regarding the data access procedure should be directed to the corresponding author via email at dangwanli@caacsri.com.

Acknowledgments

We also thank Yinchuan Hedong International Airport for their essential support in data collection and in situ validation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ATCAirport Terminal Collection
RMSERoot Mean Square Error
MAEMean Absolute Error
FPSFrames per Second
FOVField of View
CADComputer-Aided Design
BIMBuilding Information Modeling
SLAMSimultaneous Localization and Mapping
GPUGraphics Processing Unit
CPUCentral Processing Unit
GISGeographic Information System
DLTDirect Linear Transformation

References

  1. Jianping, Y. Spatial Visualization by Realistic 3D Views. Eng. Des. Graph. J. 2008, 72, 28–38. [Google Scholar]
  2. Zou, J. Airport security comprehensive monitoring management system are discussed. Netw. Secur. Technol. Appl. 2014, 2014, 146–147. [Google Scholar]
  3. Sebe, I.; Hu, J.; You, S.; Neumann, U. 3D video surveillance with Augmented Virtual Environments. In Proceedings of the IWVS ’03: First ACM SIGMM International Workshop on Video Surveillance, Berkeley, CA, USA, 2–8 November 2003. [Google Scholar]
  4. Zhong, Z.; You, J.; Yang, J.; Zhou, Y.; Wu, W. Method for 3D Scene Structure Modeling and Camera Registration from Single Image. US Patent 9,942,535, 10 April 2018. [Google Scholar]
  5. Sawhney, H.S.; Arpa, A.; Kumar, R.; Samarasekera, S.; Aggarwal, M.; Hsu, S.; Nister, D.; Hanna, K. Video flashlights: Real time rendering of multiple videos for immersive model visualization. In Proceedings of the ACM International Conference Proceeding Series, Honolulu, HI, USA, 7–11 May 2002; Volume 28, pp. 157–168. [Google Scholar]
  6. Zhang, J.; Li, X.; Wan, Z.; Wang, C.; Liao, J. Text2nerf: Text-driven 3d scene generation with neural radiance fields. IEEE Trans. Vis. Comput. Graph. 2024, 30, 7749–7762. [Google Scholar] [CrossRef] [PubMed]
  7. Jiang, Y.; Shen, Z.; Hong, Y.; Guo, C.; Wu, Y.; Zhang, Y.; Yu, J.; Xu, L. Robust dual Gaussian splatting for immersive human-centric volumetric videos. ACM Trans. Graph. (TOG) 2024, 43, 334. [Google Scholar] [CrossRef]
  8. Gao, R.; Chen, K.; Li, Z.; Hong, L.; Li, Z.; Xu, Q. Magicdrive3d: Controllable 3d generation for any-view rendering in street scenes. arXiv 2024, arXiv:2405.14475. [Google Scholar]
  9. Jang, H.; Baek, S.; Joo, K. Structure-from-Motion is Dead: Long Live Structure-from-Motion! In Proceedings of the ICCV, Paris, France, 1–6 October 2023. [Google Scholar]
  10. Krishnan, D.L.; Kannan, K.; Muthaiah, R.; Nalluri, M.R. Evaluation of metrics and a dynamic thresholding strategy for high precision single sensor scene matching applications. Multimed. Tools Appl. 2021, 80, 18803–18820. [Google Scholar] [CrossRef]
  11. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  12. Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
  13. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, AR, USA, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
  14. Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  15. Liu, F.; Gleicher, M.; Jin, H.; Agarwala, A. Content-preserving warps for 3D video stabilization. ACM Trans. Graph. (ToG) 2009, 28, 1–9. [Google Scholar]
  16. Kumar, R.; Sawhney, H.S.; Hsu, S.; Samarasekera, S. Pose estimation, model refinement, and enhanced visualization using video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, SC, USA, 15 June 2000. [Google Scholar]
  17. Ding, L.; Goshtasby, A.; Satter, M. Volume image registration by template matching. Image Vis. Comput. 2001, 19, 821–832. [Google Scholar] [CrossRef]
  18. Whelan, T.; Salas-Moreno, R.F.; Glocker, B.; Davison, A.J.; Leutenegger, S. ElasticFusion: Real-time dense SLAM and light source estimation. Int. J. Robot. Res. 2016, 35, 1697–1716. [Google Scholar] [CrossRef]
  19. Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar]
  20. Modersitzki, J. Numerical Methods for Image Registration; Oxford University Press: New York, NY, USA, 2004. [Google Scholar]
  21. Lim, A.X.W.; Ng, L.H.X.; Griffin, C.; Kryer, N.; Baghernezhad, F. Reverse projection: Real-time local space texture mapping. arXiv 2024, arXiv:2401.05593. [Google Scholar] [CrossRef]
  22. Lima da Silva, I.N.; Fernandes, É.; Gonzaga, H. INTERACTIVE 3D RECONSTRUCTION AND DLT CAMERA CALIBRATION: A MANUAL REGISTRATION APPROACH. South. J. Sci. 2024, 32, 1–10. [Google Scholar] [CrossRef]
  23. He, Y.; Li, G.; Shao, Y.; Wang, J.; Chen, Y.; Liu, S. A point cloud compression framework via spherical projection. In Proceedings of the 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP), Macau, China, 1–4 December 2020; pp. 62–65. [Google Scholar]
  24. Deng, X.; Dragotti, P.L. Deep convolutional neural network for multi-modal image restoration and fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3333–3348. [Google Scholar] [CrossRef]
  25. Song, X.; Chao, H.; Xu, X.; Guo, H.; Xu, S.; Turkbey, B.; Wood, B.J.; Sanford, T.; Wang, G.; Yan, P. Cross-modal attention for multi-modal image registration. Med. Image Anal. 2022, 82, 102612. [Google Scholar] [CrossRef]
  26. Deng, X.; Liu, E.; Li, S.; Duan, Y.; Xu, M. Interpretable multi-modal image registration network based on disentangled convolutional sparse coding. IEEE Trans. Image Process. 2023, 32, 1078–1091. [Google Scholar] [CrossRef]
  27. Wen, K.; Xie, B.; Duan, B.; Yan, Y. MambaReg: Mamba-Based Disentangled Convolutional Sparse Coding for Unsupervised Deformable Multi-Modal Image Registration. arXiv 2024, arXiv:2411.01399. [Google Scholar]
  28. Sang, Y.; McNitt-Gray, M.; Yang, Y.; Cao, M.; Low, D.; Ruan, D. Target-oriented deep learning-based image registration with individualized test-time adaptation. Med. Phys. 2023, 50, 7016–7026. [Google Scholar] [CrossRef]
  29. Sha, Q.; Sun, K.; Jiang, C.; Xu, M.; Xue, Z.; Cao, X.; Shen, D. Detail-preserving image warping by enforcing smooth image sampling. Neural Netw. 2024, 178, 106426. [Google Scholar] [CrossRef]
  30. Lee, D.C.; Hebert, M.; Kanade, T. Geometric reasoning for single image structure recovery. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2136–2143. [Google Scholar]
  31. Nedovic, V.; Smeulders, A.W.; Redert, A.; Geusebroek, J.M. Depth information by stage classification. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
  32. Zhu, P.; Guo, Y. 3D indoor reconstruction using Kinect sensor with locality constraint. Int. J. Model. Identif. Control 2023, 42, 46–53. [Google Scholar] [CrossRef]
  33. Mei, J.; Zuo, T.; Song, D. Highly dynamic visual SLAM dense map construction based on indoor environments. IEEE Access 2024, 12, 38717–38731. [Google Scholar] [CrossRef]
  34. Meskine, F.; Mezouar, O. A Rigid Image Registration by Combined Local Features and Genetic Algorithms. Appl. Comput. Syst. 2023, 28, 252–257. [Google Scholar] [CrossRef]
  35. Qiu, J.; Li, H.; Cao, H.; Zhai, X.; Liu, X.; Sang, M.; Yu, K.; Sun, Y.; Yang, Y.; Tan, P. RA-MMIR: Multi-modal image registration by robust adaptive variation attention gauge field. Inf. Fusion 2024, 105, 102215. [Google Scholar] [CrossRef]
  36. Wang, X.; Zheng, S.; Lin, X.; Zhu, F. Improving RGB-D SLAM accuracy in dynamic environments based on semantic and geometric constraints. Measurement 2023, 217, 113084. [Google Scholar] [CrossRef]
  37. Labbé, M.; Michaud, F. RTAB-Map as an open-source LIDAR and visual simultaneous localization and mapping library for large-scale and long-term online operation. J. Field Robot. 2019, 36, 416–446. [Google Scholar] [CrossRef]
  38. Cheng, Y.; Wang, G.; Wang, M.; Liu, F. DS-SLAM: A semantic visual SLAM towards dynamic environments. IEEE/ASME Trans. Mechatronics 2021, 26, 1832–1843. [Google Scholar]
  39. Shi, Y.; Li, R.; Shi, Y.; Liang, S. A Robust and Lightweight Loop Closure Detection Approach for Challenging Environments. Drones 2024, 8, 322. [Google Scholar] [CrossRef]
  40. Zhang, R.; Zhang, X. Geometric constraint-based and improved yolov5 semantic slam for dynamic scenes. ISPRS Int. J. Geo-Inf. 2023, 12, 211. [Google Scholar] [CrossRef]
  41. Lee, C.Y.; Badrinarayanan, V.; Malisiewicz, T.; Rabinovich, A. Roomnet: End-to-end room layout estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4865–4874. [Google Scholar]
  42. Hsiao, C.W.; Sun, C.; Sun, M.; Chen, H.T. Flat2layout: Flat representation for estimating layout of general room types. arXiv 2019, arXiv:1905.12571. [Google Scholar] [CrossRef]
  43. Yang, C.W.; Yang, Y.H.; Lin, Y.Y. AtlantaNet: Inferring room layouts from a single image in atlanta world. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 416–432. [Google Scholar]
  44. Gao, K.; Ji, H.; Wu, J.; Lin, D.; Dai, Q. RoomFormer: Room-cross-view-transformer for 3D room layout reconstruction. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 55–71. [Google Scholar]
  45. Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4009–4018. [Google Scholar]
  46. Liu, C.; Kim, K.; Gu, J.; Furukawa, Y.; Kautz, J. Planercnn: 3D plane detection and reconstruction from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4450–4459. [Google Scholar]
  47. Kneip, L.; Siegwart, R.; Pollefeys, M. Finding the exact rotation between two images independently of the translation. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 696–709. [Google Scholar]
  48. Zhang, Z.; Scaramuzza, D. A tutorial on quantitative trajectory evaluation for visual(-inertial) odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 7244–7251. [Google Scholar]
  49. Han, C.; Lv, C.; Kou, Q.; Jiang, H.; Cheng, D. DCL-depth: Monocular depth estimation network based on iam and depth consistency loss. Multimed. Tools Appl. 2025, 84, 4773–4787. [Google Scholar] [CrossRef]
  50. Lambert, J.; Liu, Z.; Sener, O.; Hays, J.; Koltun, V. MSeg: A composite dataset for multi-domain semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2879–2888. [Google Scholar]
  51. Stekovic, S.; Hampali, S.; Rad, M.; Sarkar, S.D.; Fraundorfer, F.; Lepetit, V. General 3D room layout from a single view by render-and-compare. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 187–203. [Google Scholar]
  52. Dorninger, P.; Pfeifer, N. A comprehensive automated 3D approach for building extraction, reconstruction, and regularization from airborne laser scanning point clouds. Sensors 2008, 8, 7323–7343. [Google Scholar] [CrossRef]
  53. Cheng, L.; Gong, J.; Li, M.; Liu, Y. 3D building model reconstruction from multi-view aerial imagery and LIDAR data. Photogramm. Eng. Remote Sens. 2011, 77, 125–139. [Google Scholar] [CrossRef]
Figure 1. Examples of two rendering methods. (a,b) Data simulation models, which are limited by data collection accuracy and may not represent the real situation. (c) Our proposed 3D realistic visualization, which displays real-time personnel location relationships without relying on high-precision data acquisition.
Figure 1. Examples of two rendering methods. (a,b) Data simulation models, which are limited by data collection accuracy and may not represent the real situation. (c) Our proposed 3D realistic visualization, which displays real-time personnel location relationships without relying on high-precision data acquisition.
Applsci 15 13079 g001
Figure 2. Applications of real-time 3D visualization. (a) Video Flashlights for immersive model visualization. (b) 3D Immersive View in Google Street View, created from texture registration of real-world imagery.
Figure 2. Applications of real-time 3D visualization. (a) Video Flashlights for immersive model visualization. (b) 3D Immersive View in Google Street View, created from texture registration of real-world imagery.
Applsci 15 13079 g002
Figure 3. Failure cases of traditional layout estimation in terminal scenes. (a,b) Non-box structures. (c,d) Reflective and texture-less areas.
Figure 3. Failure cases of traditional layout estimation in terminal scenes. (a,b) Non-box structures. (c,d) Reflective and texture-less areas.
Applsci 15 13079 g003
Figure 4. Failure of deep learning-based layout estimation. (a) Erroneous depth map in a reflective scene, different colors indicates relative depth. (b) Incorrectly fitted planar graph resulting from the flawed depth map, different colors represent distinct planar instances recovered from the noisy data.
Figure 4. Failure of deep learning-based layout estimation. (a) Erroneous depth map in a reflective scene, different colors indicates relative depth. (b) Incorrectly fitted planar graph resulting from the flawed depth map, different colors represent distinct planar instances recovered from the noisy data.
Applsci 15 13079 g004
Figure 5. The pipeline of our proposed real-time registration method.
Figure 5. The pipeline of our proposed real-time registration method.
Applsci 15 13079 g005
Figure 6. The relationship between the four coordinate systems: World ( O W ), Camera ( O C i ), Image ( O I i ), and Pixel ( O u v ).
Figure 6. The relationship between the four coordinate systems: World ( O W ), Camera ( O C i ), Image ( O I i ), and Pixel ( O u v ).
Applsci 15 13079 g006
Figure 7. (a) From top-view to show that different FOV with two situations of one camera (after rotation). (b) From a 3D view to show the relationship between the 3D structure and the camera’s FOV. We compute the intersection of the 3D structure’s edges with the camera. The L i represents the edge of a 3D structure (in the cameras’ FOV); s i , e i represent the start point and end point of line segment L i .
Figure 7. (a) From top-view to show that different FOV with two situations of one camera (after rotation). (b) From a 3D view to show the relationship between the 3D structure and the camera’s FOV. We compute the intersection of the 3D structure’s edges with the camera. The L i represents the edge of a 3D structure (in the cameras’ FOV); s i , e i represent the start point and end point of line segment L i .
Applsci 15 13079 g007
Figure 8. Illustration of re-projection inconsistency. (a) Input image with two texture regions. (b) The re-projection template mesh. (c) The final re-projected mesh in 3D space, showing misalignment at the edge.
Figure 8. Illustration of re-projection inconsistency. (a) Input image with two texture regions. (b) The re-projection template mesh. (c) The final re-projected mesh in 3D space, showing misalignment at the edge.
Applsci 15 13079 g008
Figure 9. The pixel scale-invariant regions defined along a line segment for descriptor computation. Arrows represent local gradient orientations.
Figure 9. The pixel scale-invariant regions defined along a line segment for descriptor computation. Arrows represent local gradient orientations.
Applsci 15 13079 g009
Figure 10. Sample data from different perspectives. (a) Normal enclosed spaces. (b) High-dynamicrange environments. (c) Slight reflection scene.
Figure 10. Sample data from different perspectives. (a) Normal enclosed spaces. (b) High-dynamicrange environments. (c) Slight reflection scene.
Applsci 15 13079 g010
Figure 11. Two surveillance cameras within the terminal environment.
Figure 11. Two surveillance cameras within the terminal environment.
Applsci 15 13079 g011
Figure 12. The positioning of camera i within the 3D model.
Figure 12. The positioning of camera i within the 3D model.
Applsci 15 13079 g012
Figure 13. The resulting 3D layout for camera i.
Figure 13. The resulting 3D layout for camera i.
Applsci 15 13079 g013
Figure 14. An overhead view of the local 3D layout.
Figure 14. An overhead view of the local 3D layout.
Applsci 15 13079 g014
Figure 15. Registration result using only the initial layout estimation (the baseline GP component), showing inconsistencies at texture boundaries.
Figure 15. Registration result using only the initial layout estimation (the baseline GP component), showing inconsistencies at texture boundaries.
Applsci 15 13079 g015
Figure 16. Visualization of the gradient descent process for refining the 2D layout estimation. The (u,v) plane represents the image coordinates, and the z-axis represents the evaluation score, which the process seeks to minimize.
Figure 16. Visualization of the gradient descent process for refining the 2D layout estimation. The (u,v) plane represents the image coordinates, and the z-axis represents the evaluation score, which the process seeks to minimize.
Applsci 15 13079 g016
Figure 17. Empirical analysis of kernel size impact on (a) visual registration quality and (b) computational latency.
Figure 17. Empirical analysis of kernel size impact on (a) visual registration quality and (b) computational latency.
Applsci 15 13079 g017
Figure 18. The test result of our method: (a) experimental scenarios, (b) the layout estimation results, (c) registration result.
Figure 18. The test result of our method: (a) experimental scenarios, (b) the layout estimation results, (c) registration result.
Applsci 15 13079 g018
Figure 19. Real-time 3D visualization effects of a central international airport terminal hall involving six synchronized cameras.
Figure 19. Real-time 3D visualization effects of a central international airport terminal hall involving six synchronized cameras.
Applsci 15 13079 g019
Figure 20. Comparison of layout estimation results for four different scenes. From left to right: input image, traditional method, deep learning method, our method. (a) The layout estimation result for scene1. (b) The layout estimation result for scene2. (c) The layout estimation result for scene3. (d) The layout estimation result for scene4.
Figure 20. Comparison of layout estimation results for four different scenes. From left to right: input image, traditional method, deep learning method, our method. (a) The layout estimation result for scene1. (b) The layout estimation result for scene2. (c) The layout estimation result for scene3. (d) The layout estimation result for scene4.
Applsci 15 13079 g020
Figure 21. Comparison of 3D visualization effects for a normal scene. The inaccurate layouts from traditional and deep learning methods lead to distorted registrations.
Figure 21. Comparison of 3D visualization effects for a normal scene. The inaccurate layouts from traditional and deep learning methods lead to distorted registrations.
Applsci 15 13079 g021
Figure 22. Comparison of 3D visualization effects for a challenging scene with reflections. The deep learning method produces severe deformations, while our method remains robust.
Figure 22. Comparison of 3D visualization effects for a challenging scene with reflections. The deep learning method produces severe deformations, while our method remains robust.
Applsci 15 13079 g022
Table 1. The score of layout hypothesis.
Table 1. The score of layout hypothesis.
s/(u,v)125012001170115011301110
3900.45950.44240.44560.45470.49770.4781
3800.48480.36880.31870.30310.20910.1960
3750.59220.44170.33820.23130.16680.1322
3700.53870.53880.40390.29700.22930.2032
3650.57260.52370.36600.34970.36270.3510
3600.63710.53150.38220.43560.35490.3855
3500.58880.40700.43370.44220.50150.4996
Table 2. Computational timings of framework components.
Table 2. Computational timings of framework components.
Processing StageSubprocedureTime (ms)
Real-time Update and ProcessingOverall processing and rendering<50
Data update and GPU uploading<20
Texture mapping and transformation<20
Layout InferenceInitial estimation<10
Optimal solution20–30
Table 3. Comparison of computational performance and registration accuracy using different floating-point precisions.
Table 3. Comparison of computational performance and registration accuracy using different floating-point precisions.
PrecisionRMSE (px)MAE (px)Average Time (ms)Peak Memory (GB)
32-bit (Single)10.3 ± 2.29.8 ± 1.821.51.2
64-bit (Double)10.1 ± 2.19.6 ± 1.734.82.1
Table 4. Performance under calibrated parameter perturbations.
Table 4. Performance under calibrated parameter perturbations.
Perturbation LevelRMSE (px)Success RateDegradation
No perturbation10.398%Baseline
+3% position, +1.5° orientation11.994%Minor
+5% position, +2° orientation13.587%Moderate
+8% position, +3° orientation16.872%Significant
Table 5. Quantitative comparison on regular layout scenes from the Airport Terminal Collection dataset, evaluated by reprojection error.
Table 5. Quantitative comparison on regular layout scenes from the Airport Terminal Collection dataset, evaluated by reprojection error.
MethodRMSE (px) ↑MAE (px) ↑Average Time (ms) ↓FPS ↑
ORB-SLAM2 a////
LSD-SLAM a////
LSD476.50 ± 3.5403.22 ± 3.2∼ 2050
PlaneRCNN18.7 ± 3.814.5 ± 2.8∼ 3003.3
RoomNet31.5 ± 4.227.2 ± 3.5∼ 10010.5
RTAB-Map (RGB-D)16.5 ± 3.112.8 ± 2.4∼ 38.725.8
DCL14.2 ± 2.811.3 ± 2.1 45.2 22.1
SIR19.5 ± 3.515.7 ± 2.9 280 3.6
Ours10.3 ± 2.29.8 ± 1.8∼ 21.546.5
a Failed to initialize or lost tracking in most sequences; results are not reported.
Table 6. Performance of the Layout-Based Analysis and Rendering Stage.
Table 6. Performance of the Layout-Based Analysis and Rendering Stage.
#StreamsAvg. Processing Time/Frame (ms)Throughput (FPS)
1∼3∼330
4∼8∼131.6
8∼17∼58.6
16∼40∼25.4
Table 7. Ablation study of framework components. GP: geometric projection, CF: coordinate framework, MF: multi-stage refinement. Score denotes the geometric consistency loss, with lower values being better.
Table 7. Ablation study of framework components. GP: geometric projection, CF: coordinate framework, MF: multi-stage refinement. Score denotes the geometric consistency loss, with lower values being better.
ConfigurationScore ↓Time (ms) ↓Improvement
GP0.4511.4Baseline
+ CF0.3624.2+19.7%
+ CF + MF (Ours)0.18421.5+59.2%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dang, W.; Cheng, J.; Wang, C.; Luo, Q.; Li, M. Cyber-Physical System for Terminal Infrastructure Monitoring: A Depth-Free Registration Framework via Geometric-Model Fusion. Appl. Sci. 2025, 15, 13079. https://doi.org/10.3390/app152413079

AMA Style

Dang W, Cheng J, Wang C, Luo Q, Li M. Cyber-Physical System for Terminal Infrastructure Monitoring: A Depth-Free Registration Framework via Geometric-Model Fusion. Applied Sciences. 2025; 15(24):13079. https://doi.org/10.3390/app152413079

Chicago/Turabian Style

Dang, Wanli, Jian Cheng, Chao Wang, Qian Luo, and Meng Li. 2025. "Cyber-Physical System for Terminal Infrastructure Monitoring: A Depth-Free Registration Framework via Geometric-Model Fusion" Applied Sciences 15, no. 24: 13079. https://doi.org/10.3390/app152413079

APA Style

Dang, W., Cheng, J., Wang, C., Luo, Q., & Li, M. (2025). Cyber-Physical System for Terminal Infrastructure Monitoring: A Depth-Free Registration Framework via Geometric-Model Fusion. Applied Sciences, 15(24), 13079. https://doi.org/10.3390/app152413079

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop