4.2. Digital Twin Modeling Pipeline
The DT models are constructed from heterogeneous data sources. We import detailed 3D geometry of cranes, vehicles, and yard infrastructure from design data (e.g., BIM/IFC models, laser scans, or CityGML maps) and synchronize dynamic state from sensors. To balance fidelity and performance, we employ a tiered level-of-detail (LOD) strategy. High-fidelity LOD500 meshes are used for dynamic equipment (quay cranes, yard cranes, AGVs) where precise kinematics (trajectories, hoist motion, collision geometry) matter. In contrast, static structures like container stacks, buildings, and roads use simplified LOD300-400 meshes, capturing overall layout with far fewer polygons. In deployment, these LODs can switch based on distance: for example, distant yard cranes may render a coarse LOD, while cranes within 50–100 m use a detailed model. Typical parameters might be, say, ≤50 k triangles for LOD400 stack models versus ∼200 k for LOD500 crane models.
Virtual-to-real alignment is achieved through spatial anchoring. Large environmental features (roads, yard boundaries, building footprints) are treated as contextual anchors that align the Unity world coordinate frame to the actual port site. For instance, we align Unity’s ENU (east–north–up) axes to surveyed control points on the terminal, empirically achieving sub-30 mm placement error of holograms. The HoloLens also uses its spatial mapping capability for occlusion: real objects (captured in the device’s depth mesh) correctly obscure virtual content, and view frustum culling avoids rendering objects outside the user’s sight.
All models are implemented in Unity3D, which supports real-time rendering and native HoloLens integration. We use Unity’s Universal Render Pipeline (URP) for efficiency. Performance optimizations include GPU instancing and draw-call batching to reduce overhead, mesh simplification for LODs, and view-dependent frustum culling. These measures ensure a stable frame rate: for example, LODs and batching keep rendering fluid even with thousands of stack objects, as confirmed by runtime profiling. The end result (illustrated in
Figure 4) is a multi-scale 3D twin that is both accurate for critical equipment and lightweight enough for HoloLens rendering.
The Microsoft HoloLens 2 device was selected as the MR interface for its robust capabilities tailored to industrial applications. It employs inside–out tracking with environmental cameras and an inertial measurement unit (IMU) for self-relocalization without external markers, enabling stable spatial anchoring of holograms in large-scale environments such as container terminals. Integrated hand-tracking and microphone arrays support intuitive gesture and voice-based control, while see-through waveguide displays allow operators to maintain situational awareness of physical equipment during overlay visualization. The native Unity–MRTK integration further streamlines the development of spatially aware, real-time interactive applications.
4.3. Data Middleware and Communication Protocol Stack
This section details the operational logic of middleware data fusion and MR command processing, which together form the core of the proposed DT–MR integration framework. The subsequent scenario-based validation further demonstrates the reproducibility and practical applicability of these mechanisms. The middleware layer unifies heterogeneous data streams and synchronizes them for consistent use within the DT and MR interface. It performs protocol translation, filtering, and time alignment so that all components share a coherent and up-to-date event stream. The architecture supports multiple protocols: for example, low-level PLCs and SCADA nodes expose real-time tags via OPC UA clients, while high-rate telemetry such as AGV positions and sensor feeds is disseminated using MQTT publish/subscribe. A RESTful API is provided for on-demand queries or historical data retrieval (e.g., overview dashboards). Each protocol is selected for its respective strengths: OPC UA ensures industrial-grade reliability for device I/O, MQTT provides low-latency state broadcasting, and REST supports flexible interoperability with web services. Together, these mechanisms enable seamless and dependable data fusion across terminal subsystems.
4.3.1. Protocol Boundaries
Field equipment and control systems (e.g., cranes, drives) use OPC UA servers to expose live data; the middleware ingests these via OPC UA clients. MQTT is used internally for event dissemination: e.g., a crane’s hoist position might be published on topic ACT/QC/QC01/hoist_position whenever it changes. For configuration or historical queries, the management console uses HTTP REST requests to the middleware. This separation (OPC UA for device I/O, MQTT for messaging, REST for request/response) allows each protocol to operate in its strength: OPC UA ensures industrial reliability, MQTT provides low-latency broadcasting, and REST enables interoperability with web services.
4.3.2. Topic/Node Naming
We enforce a semantic naming convention for clarity. MQTT topics follow the form ACT/<asset_type>/<asset_id>/<signal>, e.g., ACT/AGV/AGV07/position. OPC UA node IDs mirror this hierarchy in the address space (for example, ns = 2;s = AGV07.Position.X for an AGV’s X coordinate). Such structured naming (type, ID, signal) makes it easy to extend the system to new equipment and to subscribe to specific data streams without ambiguity.
4.3.3. Time Synchronization
All systems share a common time base to prevent timestamp drift. We deploy network time protocols such as NTP or IEEE 1588 PTP across controllers, servers, and even the HoloLens device (Microsoft, Redmond, WA, USA). In a precision-critical environment, we enable PTP (as it can achieve sub-microsecond synchronization on local networks). In practice, each data packet is tagged with a synchronized timestamp, and any clock offsets are corrected at the middleware. By maintaining synchronized clocks on all nodes, the middleware can fuse data from different sources (TOS, ECS, sensors) with millisecond precision, ensuring the DT reflects the true simultaneous state of the terminal.
4.3.4. Data Frequency, Buffering, and Playback
Data streams have different update rates: e.g., system status heartbeats at ∼1 Hz, vehicle/robot motions at 10–20 Hz, and alarms/events as they occur. The middleware enforces appropriate throttling or smoothing. For example, we may ignore position updates that change by <50 mm within 50 ms to reduce jitter, or interpolate missing points for intermittent data. Each stream is briefly buffered (on the order of 100 ms) to align asynchronous inputs; timestamps and sequence numbers are used to reorder out-of-order messages. To handle packet loss or downtime, we enable MQTT QoS = 1 (at-least-once delivery) for critical topics and log all messages in a replay buffer. This allows the system to “replay” missed data after reconnection, preserving continuity of the twin. In summary, the middleware ensures that data flows at the needed rates with minimal latency (meeting the ≤100 ms and ≥10 k events/s targets) while providing reliability through buffering and replay mechanisms.
Figure 5 summarizes the middleware-centric integration pipeline, highlighting the protocol conversion, data alignment, and filtering steps that ensure the standardized data stream meets the defined performance requirements.
To systematically present the communication strategy,
Table 1 summarizes the roles, data interaction modes, and application scenarios of the core protocols that enable real-time and reliable data exchange across the physical, middleware, DT, and MR layers of the system.
To ensure secure and controlled data exchange, all middleware communication channels—OPC UA, MQTT, REST API, and WebSocket—are protected using TLS encryption and authentication mechanisms. The system employs identity-based access control: different user roles (operator, supervisor, administrator) are assigned distinct permissions for telemetry subscription, configuration access, and command execution. Token-based verification and credential management are enforced at the middleware layer to prevent unauthorized data publishing or command injection. All interactions are logged for traceability, and no personally identifiable information (PII) is stored or transmitted within the framework, ensuring compliance with privacy and security principles.
4.3.5. Event-Driven Operational Logic and Pseudocode Organization
Note that to enhance reproducibility, the detailed pseudocode of each functional interface is provided in the subsequent section, colocated with the corresponding figures illustrating the DT–MR operational workflows.These listings specify the real-time data fusion routines in the middleware and the MR-based control workflows with their associated safety interlocks. This organization allows each interface (multi-port selection, equipment panel, scale-control, and Follow-Me mode) to be directly understood alongside its corresponding figure.
Section 4.6 further discusses system-level performance and safety guarantees, while
Section 5 demonstrates scenario-based execution invoking these routines.
Beyond their colocated presentation, these routines are coordinated under a unified event-driven logic: middleware streams trigger data updates in the twin, while operator actions in MR propagate as control intents subject to permission checks and interlocks. Each routine is thus not standalone, but part of a closed feedback loop (data → twin → MR → command → execution → logging). This ensures that the pseudocode listings are both modular for clarity and cohesive for reproducibility.
4.4. MR Interaction Design
The Mixed Reality interface supports multimodal interaction via gaze, gesture, and voice [
34]. We implement a simple state machine: when the user’s gaze rests on a holographic object for ∼2 s, the object is highlighted; an Air Tap gesture then “selects” it for operation. Once selected, the user can issue voice commands or use a floating menu to control the object (e.g., “Start crane”, “Open AGV dashboard”). Critical commands use a confirmation step: for example, a voice command is not executed until the user explicitly confirms (via a secondary gesture or verbal “Yes”) to avoid false triggers. Unrecognized or unsafe commands prompt an audible error and visual alert, and the system requires re-issuance to proceed.
The UI layout follows task flow. A translucent “Follow-Me” status panel remains centered in the user’s view, showing the selected asset’s key data. Contextual panels pop up near relevant equipment (e.g., a container stack status panel appears next to the stack). Alarms trigger prominent alerts: if a safety alarm occurs, a full-window warning panel appears along with audio cues, and navigation is temporarily disabled until the user acknowledges it. The design emphasizes minimal movement (teleportation and scale controls help the user reposition in the large scene) while ensuring that essential controls (on holographic buttons or through voice) are always at the user’s disposal.
4.5. Design-to-Requirement Traceability
Each design decision directly maps to the requirements identified earlier.
Table 2 summarizes this mapping: for instance, the visualization requirements (≤5 cm spatial error, ≥60 fps) are met by the LOD-based modeling strategy. Specifically, using high-detail LOD500 models for moving assets and LOD300–400 for static layout ensures geometric accuracy for key objects and efficient rendering. The data integration requirements (≤100 ms latency, ≥10,000 events/s throughput) are fulfilled by the middleware’s protocol translation and asynchronous streaming. Likewise, the interaction requirements (sub-50 ms command response, high recognition accuracy) are addressed by the multimodal MR interface and optimized input handling. In each case, the component introduced in
Section 4 has a clear reason: e.g.,
Section 4.2’s modeling pipeline yields the needed spatial precision, and
Section 4.3’s middleware provides the needed real-time synchronization. The end-to-end mapping (detailed in
Table 2) thus shows that all functional and performance specifications have been systematically incorporated into the design. This traceability ensures that in the implementation (
Section 5), we can verify that the system actually meets the quantitative targets.
4.6. Runtime Performance and Safety Interlocks
At run time, total latency is decomposed into three stages (sensing→middleware→rendering). For instance, OPC UA polling contributes approximately 20 ms, MQTT network transit about 5 ms, middleware data alignment around 10 ms, and Unity rendering roughly 15 ms, depending on scene complexity. To ensure reproducibility, system performance was benchmarked under controlled conditions using a workstation (Intel Core i9–13900K CPU, 64 GB RAM (Intel, Santa Clara, CA, USA), NVIDIA RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA)) and a Microsoft HoloLens 2 device connected through a local 5 GHz wireless network (Redmond, WA, USA).
We continuously monitored performance using the Unity Profiler and custom diagnostics that recorded frame rates, script update times, and message queue lengths. If any metric exceeded a threshold (e.g., FPS dropped below 50 or queue backlog grew), the system flagged a warning for inspection.
Table 3 summarizes the measured runtime performance under typical loads and indicates the observed trend as data load increases. The results confirm that the system consistently meets the defined thresholds of ≤100 ms latency, ≥60 fps rendering, and ≥10 k events/s throughput with only moderate variation across load levels.
The observed variance in latency and frame rate mainly results from dynamic scene complexity (e.g., the number of AGVs or quay cranes rendered in view) and inherent fluctuations in wireless network conditions. Across 30 repeated trials (each lasting 5 min), no significant frame drop or message backlog was observed. These measurements substantiate that the integrated middleware and visualization pipeline achieve real-time responsiveness under realistic operational loads and provide transparent evidence supporting the stated technical performance.
Alongside performance monitoring, the runtime safety interlocks described below remained active throughout all benchmark trials, ensuring operational integrity and fault containment during stress testing. Safety is enforced by multiple interlocks. We define three control modes: Read-only (pure display), Advisory (suggest commands), and Command (full control). Only in Command mode can MR actions generate actual equipment commands; in other modes the user can view status but not issue motions. All commands undergo permission checks and logging: before executing, the system verifies the user’s role and intent. Critical commands require double confirmation, e.g., “Start crane QC01” must be spoken and then explicitly confirmed by a gesture or secondary command. Every MR-originated command (with timestamp, user identity, and parameters) is audit-logged for accountability. An emergency-stop (and rollback) function is bound to a voice keyword and virtual button: triggering it immediately halts or undoes any in-flight MR command. Geofencing rules further constrain actions; for instance, remote operations of AGVs cannot direct them outside predefined safe zones. Collectively, these measures (mode gating, confirmations, permission enforcement, logging, and an e-stop fallback) create a secure feedback loop between the MR interface and the traditional control console, minimizing any risk of accidental or unauthorized operations. These runtime safeguards remained fully operational throughout all performance trials, providing a reliable foundation for the cross-scale validation framework described next.
From a human factor perspective, sustained MR operation requires the careful consideration of visual ergonomics. To alleviate the visual fatigue commonly caused by the VAC in stereoscopic displays, the HoloLens 2 device used in this study employs a fixed focal plane at approximately 2.0 m. All holographic panels and overlays in our DT–MR framework—such as equipment status, alerts, and control menus—are rendered within this consistent focal depth. This configuration minimizes accommodation effort and stabilizes depth perception, allowing operators to maintain focus and situational awareness during extended monitoring sessions. By integrating optical ergonomics into the system design, the framework complements the latency and performance optimizations discussed above, ensuring that both system responsiveness and operator comfort are sustained during long-duration use in real-world supervisory contexts.
4.7. Cross-Scale Mapping and Validation Framework
To bridge realistic port operations and the laboratory-scale DT–MR system, a cross-scale system validation and calibration path was established (
Figure 6). The same HoloLens 2 device described in
Section 4.2 was used for cross-scale validation, providing both visualization and bidirectional control channels. The Qingdao Port scenario serves as a reference context, informing (a) the taxonomy and spatial hierarchy of port assets, (b) the definition of key operational states (e.g., loading, idle, transferring), and (c) the design of exemplar events for scenario validation. The 1:30 physical test platform reproduces essential terminal components—including quay cranes, yard cranes, and AGVs—with embedded sensing and control modules, preserving the kinematic and operational logic of a full-scale automated terminal. This setup provides measurable ground truth for system calibration and mapping validation, ensuring the geometric and temporal consistency of the DT–MR representation.
The Unity-based DT mirrors the platform in real-time through the geometric mapping chain defined in
Section 4.2, linking IFC coordinates to Unity via the ENU transformation, while the AR/MR interface (HoloLens) delivers in situ visualization and bidirectional command feedback. This framework ensures that while the contextual validity of the system is derived from real-world port operations, its technical performance is rigorously quantified under controlled, reproducible laboratory conditions. All measurements were conducted within the controlled indoor test platform to minimize external illumination and vibration effects. The alignment accuracy was further verified by overlaying holographic objects onto surveyed physical references within HoloLens view, confirming visual coincidence within the measured tolerance.