1. Introduction
The manufacturing paradigm is undergoing a fundamental shift from mass production to High-Mix Low-Volume (HMLV) systems, driven by growing consumer demand for customized products and the need for agile supply chains [
1]. In HMLV environments, a single production line must handle diverse product variants simultaneously, which increasingly requires heterogeneous robot fleets composed of manipulators with different kinematic structures, payload capacities, and operational ranges from multiple vendors [
2]. However, each manufacturer provides proprietary programming languages—such as RAPID for ABB, KRL for KUKA, and URScript for Universal Robots—creating a fragmented ecosystem in which process changeovers often require engineers to rewrite control code for each robot individually [
3,
4,
5]. Existing interoperability technologies partly alleviate this burden at the levels of communication and data exchange, but they do not fully resolve the problem of generating executable, behavior-consistent vendor-specific robot programs from a shared task intent. As a result, frequent product changes in HMLV production continue to impose substantial engineering effort, downtime, and integration costs.
Large Language Models (LLMs) have recently emerged as a promising approach for translating natural language instructions into robot control logic [
6,
7]. By allowing users to specify tasks at a higher semantic level, LLMs offer a potential path toward more flexible robot programming in multi-product production environments. However, direct deployment of LLM-generated robot code in industrial settings remains challenging, particularly when the target system consists of heterogeneous robots with different physical capabilities and vendor-specific execution models. Beyond platform fragmentation, generated code may contain physically invalid instructions, such as spatial targets inconsistent with the actual environment or motion parameters that exceed the selected robot’s dynamic limits. This phenomenon, which has been discussed in recent robotics and AI literature as physical hallucination [
8,
9,
10], is especially problematic in real-world robotic deployment because it can lead not only to task failure but also to unsafe motion and equipment risk.
Prior studies have demonstrated the feasibility of language-driven robot programming in fixed APIs, single-platform embodiments, or simulation-centered environments [
6,
7,
11]. Meanwhile, Digital Twin (DT)-based approaches have shown strong potential for simulation, monitoring, and pre-deployment validation in industrial automation [
12,
13,
14,
15]. However, relatively limited attention has been paid to how LLM-generated robot programs can be systematically verified and adapted before execution in heterogeneous real-world robot systems. In particular, a unified framework that combines structured task representation, runtime environment grounding, robot suitability assessment, and pre-execution dynamics validation has not been sufficiently explored.
To address this gap, this paper proposes a Digital Twin-integrated verification framework for adaptive control code generation in heterogeneous robot systems. The framework first transforms natural-language task requirements into a structured intermediate task representation that preserves task-level intent and execution semantics across vendor-specific implementations, thereby supporting behavioral consistency even when low-level syntax, coordinate representations, and motion primitives differ. Spatial references are then grounded using runtime Digital Twin information, and an appropriate robot is selected according to task-level and physical feasibility requirements. Before execution, the generated motion sequence is analyzed through Recursive Newton-Euler Algorithm (RNEA)-based dynamics validation, and when infeasible conditions are detected, a global scaling factor is automatically applied to adjust motion parameters within safe limits. Vendor-specific executable code is generated only after these verification and adaptation steps are completed.
The main contributions of this paper are as follows:
We propose a Digital Twin-integrated verification framework for LLM-based robot code generation in heterogeneous robot systems, targeting physical hallucination and pre-execution safety assurance.
We develop a validation pipeline that combines runtime spatial grounding, robot selection, RNEA-based torque analysis, and adaptive global motion scaling to detect and correct physically infeasible execution conditions before deployment.
We employ a structured intermediate task representation that supports transformation from natural-language instructions into verification-ready robot task specifications across heterogeneous platforms.
We empirically validate the proposed framework across a heterogeneous robot pool, showing that RNEA-based adaptive scaling achieves full feasibility coverage where unscaled execution fails and that the task-aware robot selection mechanism outperforms payload-only baselines in lightweight load scenarios.
3. Methodology
3.1. Framework Overview
This paper proposes a Digital Twin-integrated verification framework for adaptive control code generation in heterogeneous robot systems. The objective of the framework is not only to translate natural-language task descriptions into vendor-specific robot programs, but also to verify whether the generated task can be safely and feasibly executed by a selected robot embodiment before deployment.
As illustrated in
Figure 1, the proposed framework consists of five stages. First, a natural-language task request is transformed into a structured intermediate task representation that preserves task-level intent while remaining independent of vendor-specific syntax. Second, spatially ambiguous task elements are grounded using runtime Digital Twin information so that execution parameters reflect the actual environment rather than coordinates inferred during initial language generation. Third, the framework evaluates candidate robots in the heterogeneous robot pool and selects an embodiment that is physically suitable for the grounded task. Fourth, the grounded motion
is analyzed through Recursive Newton-Euler Algorithm (RNEA)-based dynamics validation to determine whether execution exceeds the torque limits of the selected robot. Finally, when infeasible conditions are detected, a global scaling factor is applied to reduce motion intensity within safe bounds before vendor-specific code is generated.
From a system perspective, the proposed method treats verification as an integral part of code generation rather than as a post hoc testing step. In this way, task transformation, runtime grounding, embodiment selection, and dynamic feasibility assurance are integrated into a single pre-execution workflow.
It is important to clarify the role of the LLM within this architecture. The LLM functions as a structured translation engine rather than a safety decision-maker: it converts natural-language task instructions into verification-ready intermediate specifications, and later renders verified specifications into vendor-specific executable syntax. Safety-critical decisions—including spatial grounding, robot selection, and dynamic feasibility assurance—are made entirely by the deterministic modules in the downstream pipeline. This separation of language-based translation from physics-based verification is a central design principle of the proposed framework.
3.2. Structured Intermediate Task Representation
To support consistent downstream verification across heterogeneous robot platforms, the framework first converts natural-language task descriptions into structured intermediate task representation. In this paper, the representation is employed as a verification-ready interface between language input and robot-specific execution, rather than as a standalone language-design contribution.
The representation captures the essential elements of task execution, including action type, target entity, motion intent, task ordering, and execution constraints. By explicitly structuring these elements, the framework reduces ambiguity in downstream processing and provides a common interface for Digital Twin grounding, robot selection, and dynamics validation. This is particularly important in heterogeneous robot environments, where the same task intent must later be instantiated in different robot languages and execution models.
A key requirement of the representation is the preservation of task-level semantics across platforms. In this work, behavioral consistency refers to preserving the intended execution semantics of a task even when low-level syntax, coordinate conventions, and motion primitives differ across vendors. For this reason, task information is stored in a platform-independent form wherever possible, so that feasibility can be assessed before the final program is instantiated in a vendor-specific language.
Accordingly, the intermediate representation functions as the common substrate of the proposed framework. It enables stable transformation from natural-language instructions into machine-processable task specifications and provides the structured information required by the verification modules described in the following subsections.
The natural-language-to-TDL transformation was implemented using Gemini 2.5 Pro (Google DeepMind, London, UK) with retrieval-augmented prompting, without task-specific fine-tuning. The LLM functions as a structured translation engine that converts user natural-language instructions into verification-ready TDL specifications by following explicit grammar rules and parameter-preservation constraints provided in retrieved prompt context documents. The retrieved context consisted of four categories of structured documents: TDL grammar and command definitions, vendor-specific mapping tables, parameter-preservation constraints, and output-format requirements. Each prompt specification comprised approximately 250–300 lines of structured guidance. No per-task manual customization was applied during experimental evaluation; human effort was concentrated in the initial preparation of grammar specifications and mapping rules.
A representative prompt structure is as follows: [Role] You are an expert robot programming assistant specializing in TDL-based task representation. [Task] Convert the natural-language instruction into a verification-ready TDL specification preserving all task-critical parameters. [Constraints] Use only defined TDL commands. Leave pose fields as semantic placeholders for Digital Twin grounding. Output only the final TDL script in the required format.
Importantly, the LLM is not responsible for execution safety. Physical feasibility assurance is entirely handled by the downstream Digital Twin grounding and RNEA-based validation modules described in
Section 3.3,
Section 3.4,
Section 3.5 and
Section 3.6. This separation ensures that safety-critical decisions are made analytically rather than by language model inference.
3.3. Runtime Spatial Grounding with Digital Twin Information
One of the major risks in LLM-based robot code generation is that spatial parameters may be generated without sufficient reference to the actual execution environment. A generated instruction may appear semantically plausible at the language level while still being spatially inconsistent with the current object location, workspace boundary, or obstacle configuration. To mitigate this problem, the proposed framework grounds spatially dependent task elements using runtime Digital Twin information.
Instead of resolving all coordinates during the initial language generation stage, the framework defers spatial instantiation until environment information becomes available from the Digital Twin. In this process, symbolic references in the intermediate task representation, such as object targets, approach directions, or placement regions, are converted into executable spatial parameters using the current environment state. The Digital Twin provides the object pose, workspace geometry, and relevant environmental conditions required for this conversion.
Let the symbolic task specification be denoted by
, and let the runtime Digital Twin state be denoted by
. The grounding function can then be expressed as
where
is the grounded task specification used for robot selection and motion validation. Equation (1) indicates that executable task parameters are not determined solely by the language model but are instantiated through the current environment state.
This grounding process constitutes the first layer of hallucination mitigation in the proposed framework, complemented by dynamics validation at the second layer. By deferring environment-dependent parameter binding until runtime, the framework reduces coordinate-level physical hallucination while ensuring that downstream feasibility analysis is performed on motion parameters reflecting the actual execution context.
3.4. Feasibility-Aware Robot Selection
3.4.1. Overall Fitness Formulation
After the task has been structurally represented and grounded in the runtime environment, the framework selects a suitable robot from the heterogeneous robot pool. The purpose of this step is not merely to identify a robot that can nominally perform the task, but to select an embodiment that is appropriate for subsequent safe execution under grounded task conditions.
The fitness score of robot
i is computed as a weighted sum:
where
,
and
denote the payload, reach, and DoF fitness scores, respectively. The default weights are set to
Before scoring, hard feasibility constraints are applied: any robot failing to satisfy payload, reach, or DoF requirements is excluded from consideration regardless of partial scores. This prevents physically infeasible robots from entering the ranking stage.
The weights reflect the physical priority hierarchy in HMLV manufacturing environments. Payload capacity is treated as a near-hard constraint—a robot incapable of handling the required load renders the task physically infeasible regardless of other attributes—and is therefore assigned the highest weight (ωp = 0.6). Reach and DoF govern task flexibility rather than feasibility and are weighted equally at lower values (ωr = ωd = 0.2). These values were adopted as default engineering settings motivated by the task priority structure of HMLV manufacturing rather than as universally optimal parameters; future work may explore data-driven weight calibration for specific production contexts.
3.4.2. Payload Score: Tri-Modal Design
A central design feature of the selection module is the payload fitness score, which adopts a tri-modal design based on the payload ratio
The payload score is defined piecewise as follows:
In the Gaussian mode, the score peaks at α = 1.2, which represents the design-optimal safety margin, and σ = 0.2 controls sensitivity around this point. In the Log-Penalty mode (R > 3.0), the score decreases on a log scale, strongly penalizing over-specification while preserving non-zero scores across the full over-capacity region. Here, β is a penalty coefficient controlling the decay rate in the over-specification region. In this study, β = 0.8 is adopted so that robots with excessive payload margins receive progressively lower scores, thereby discouraging unnecessary resource over-allocation even when task execution remains physically feasible. This property prevents the discriminability collapse observed in linear normalization methods when all feasible robots are substantially over-specified relative to the task requirement.
3.4.3. Reach and DoF Scores
The reach score rewards robots that satisfy the workspace requirement while discouraging unnecessary oversizing. More specifically, the reach score is assigned according to whether the robot’s maximum reach satisfies the grounded task workspace requirement and the degree of excess reach beyond the required threshold. The DoF score applies a discrete penalty based on kinematic appropriateness: an exact DoF match receives the highest score (1.0), whereas redundant DoF configurations receive a reduced score (0.8) due to increased inverse-kinematics complexity. Robots with insufficient DoF are filtered out by the hard feasibility constraints before the scoring stage.
3.4.4. Final Robot Selection
The grounded task specification
is evaluated against all candidate robots in the pool, and the robot with the highest total score is selected:
The selected robot is then passed to the downstream dynamics validation stage. This ordering is important because dynamic feasibility is embodiment-dependent: the same grounded task may be safe for one robot and infeasible for another depending on torque limits, geometry, and motion profile.
3.5. RNEA-Based Dynamics Validation
Even when a task is semantically correct and a candidate robot is selected, the resulting motion may still be dynamically infeasible. For example, required joint torques may exceed the limits of the selected robot due to an aggressive motion profile, insufficient payload margin, or an unfavorable configuration. To address this risk, the proposed framework performs pre-execution dynamics validation before vendor-specific code is finalized.
The validation step is based on the Recursive Newton-Euler Algorithm (RNEA), which computes required joint torques through forward and backward recursion with O(n) complexity. Let the selected robot trajectory be represented by the joint state variables
. The required torque vector τ(t) satisfies the rigid-body dynamics equation
where M(q) is the inertia matrix,
is the Coriolis and centrifugal term, and g(q) is the gravitational torque vector.
The grounded motion is considered dynamically feasible if for every joint throughout execution. If this condition is satisfied, the motion is classified as feasible. Otherwise, it is classified as infeasible and passed to the adaptation stage. Semantic task plausibility does not guarantee dynamic safety: a generated motion command may appear correct from the perspective of task logic while still violating the physical constraints of the robot expected to execute it.
3.6. Adaptive Parameter Scaling
When predicted torque exceeds the admissible limit, the framework does not immediately reject the generated task. Instead, it computes a global scaling factor SF to reduce motion intensity while preserving the trajectory shape. The scaling factor is defined as
where
= 0.9 is a safety margin adopted as a conservative engineering margin motivated by two complementary considerations. From an operational standpoint, Doosan Robotics specifies in its official programming manual that workpiece weight must not exceed rated payload with a 10% margin [
28], reflecting industry-established practice for collaborative robot deployment. From a modeling uncertainty standpoint, KUKA’s LoadDataDetermination documentation reports that mass determination accuracy for high-payload robots is typically within 10% of rated payload [
29], indicating that parameter uncertainty of this magnitude propagates into RNEA-computed torque predictions. The value η = 0.9 was therefore adopted as a conservative engineering margin motivated by industrial practice and inherent rigid-body modeling uncertainty, rather than as a formal guarantee of physical safety under all operating conditions. J is the set of joints, and
is the torque limit of joint j. The scale factor satisfies SF ≤ 1.0 and is applied uniformly to all velocity and acceleration parameters. Operationally, reducing the execution speed and acceleration through the global scaling factor is equivalent to stretching the trajectory in time while preserving its spatial path.
If the scaled trajectory satisfies the torque constraints, the motion proceeds to final code generation. Otherwise, the task is reported as infeasible under the selected robot and execution condition.
3.7. Vendor-Specific Code Generation
Once the task has passed the verification stages, the framework generates vendor-specific control code for the selected robot platform. At this point, the task has already been represented, grounded, assigned to a robot embodiment, and checked for dynamic feasibility. Accordingly, the purpose of this final stage is not to determine correctness, but to instantiate a verified task into the syntax and command structure required by the target robot language.
The vendor-specific code generation in Equation (10) was also implemented using Gemini 2.5 Pro with retrieval-augmented prompting. At this stage, the LLM receives the verified TDL specification and applies vendor-specific mapping rules—encoding command correspondences such as MoveLinear → movel() for Doosan DRL, and corresponding mappings for KUKA KRL and ABB RAPID—to produce syntactically correct executable code. Since the TDL has already passed all verification stages at this point, code generation is constrained to a syntax-constrained rendering task in which no safety-critical decisions remain; the structural and physical validity of the task has already been assured by the upstream pipeline.
Let
denote the vendor-specific code for vendor v. The final code generation step can be represented as
where
is the grounded task specification,
is the selected robot, and SF is the validated scaling factor. Here
denotes the vendor-specific code generation function that maps the verified task specification to the target robot language. Equation (10) indicates that code generation is performed only after the task has already passed environment grounding, embodiment selection, and feasibility validation.
The mapping process converts the structured intermediate task representation into platform-specific program elements, including motion commands, coordinate declarations, procedure structure, and I/O operations. This ordering reflects the main design principle of the proposed framework: executable syntax alone is not sufficient evidence of safe deployability. In the proposed method, vendor-specific code is generated only after the task has passed Digital Twin grounding and embodiment-aware physical feasibility verification.
5. Results and Discussion
5.1. Intermediate Representation Reliability
The first experiment evaluated whether the proposed pipeline can reliably transform natural-language task descriptions into structured task specifications suitable for downstream verification. As shown in
Figure 2a, the intermediate representation generation stage maintained perfect syntactic integrity for both simple and complex commands, indicating that the generated task structures were consistently machine-parsable and suitable for subsequent processing. Back-Translation Fidelity (BTF), however, decreased from simple to complex commands, suggesting that semantic compression occurs more frequently in multi-step or conditionally structured instructions than in single-action requests.
For the adversarial subset, the framework showed a different behavior. As shown in
Figure 2b, 70% of adversarial inputs were blocked by the safety filter prior to execution, while the remaining 30% were passed to the downstream dynamics validation layer, where they were subsequently identified as infeasible and blocked prior to execution. This two-stage behavior is consistent with the design of the proposed framework: unsafe or clearly infeasible commands are filtered as early as possible, whereas ambiguous or partially plausible commands are further evaluated through physics-aware verification. The 70% of inputs blocked at the filtering stage were cases in which physically impossible conditions—such as impossible payload demands or targets outside the defined workspace—were detectable through explicit safety-rule checking without requiring dynamics computation. The remaining 30% were not left unhandled; they were classified as physically ambiguous and forwarded to the downstream RNEA-based validation layer for complementary coverage. A known limitation of this design is the potential for false positives at the filtering stage: valid but structurally complex commands may occasionally be blocked before reaching dynamics validation. The current filter is intentionally conservative to prioritize safety; future work may explore adaptive thresholding to reduce over-blocking while preserving the safety-oriented behavior of the current pipeline.
Overall, these results indicate that the representation stage provides sufficiently stable structured input for the downstream grounding and verification modules, while also highlighting that representation reliability alone does not guarantee safe deployment. Accordingly, the following subsections examine whether the grounded task can be assigned to an appropriate robot embodiment and safely executed under embodiment-aware physical validation.
5.2. Robot Selection Performance
The robot selection results demonstrate that the proposed feasibility-aware scoring mechanism provides more task-consistent embodiment assignment than conventional baselines. As shown in
Figure 3a, the proposed method remained close to the design-optimal payload ratio
α = 1.2 across the standard task scenarios A–F. In these scenarios, the proposed method and Greedy showed similar behavior, whereas TOPSIS consistently favored higher-capacity robots and therefore produced much larger payload ratios. Linear Weighted Sum remained close to the target margin in most standard cases but did so without preserving meaningful discrimination in the over-specification regime.
The differences become more pronounced in the ultra-lightweight stress-test tasks G and H. As shown in
Figure 3b, both TOPSIS and Linear WS selected extremely over-capable robots, reaching 25× and 50× payload ratios in the most extreme cases. By contrast, the proposed method and Greedy remained within a substantially lower range. This indicates that the proposed selection mechanism is robust not only under ordinary operating conditions, but also under edge cases in which all feasible candidates are heavily over-specified.
However,
Figure 3 also shows that the proposed method does not always yield the lowest payload ratio among feasible robots. In Task G, Greedy selects a robot closer to the target payload margin than the proposed method. This is expected because Greedy optimizes only for minimum feasible payload, whereas the proposed method jointly considers payload, reach, and DoF suitability. The goal of the proposed selector is therefore not to minimize payload ratio in isolation, but to provide a more balanced embodiment assignment within the overall verification pipeline.
The reason for this behavior becomes clearer in
Figure 4. The proposed tri-modal payload score peaks at
α = 1.2 in the Gaussian region and transitions to a Log-Penalty decay for
R > 3.0, thereby preserving non-zero ranked scores throughout the over-specification region. By contrast, Linear WS hard-clips to zero beyond its threshold, eliminating discrimination when all feasible candidates exceed the same upper region. This explains why the proposed method remains informative in Tasks G and H whereas conventional linear normalization collapses structurally.
Taken together, these results suggest that the robot selection module contributes to the framework not by dominating all baselines in every scenario, but by maintaining task-consistent and discriminative embodiment assignment across both ordinary and edge-case conditions.
5.3. Dynamics Validation and Adaptive Scaling
The dynamics validation results confirm that semantic task plausibility does not guarantee embodiment-level execution safety. As shown in
Figure 5, several scenarios that were feasible at the task and embodiment-selection levels still exceeded the actuator torque limits when evaluated using the RNEA-based dynamics model. For UR5e, the Critical scenario produced a maximum torque ratio of 1.224 before scaling. For Franka Panda, both the Heavy and Critical scenarios exceeded the feasibility threshold, reaching torque ratios of 1.873 and 2.773, respectively.
Applying the adaptive scaling rule in Equation (9) restored feasibility in all analytically recoverable overload cases. In UR5e, the Critical scenario was brought below the torque threshold using
SF = 0.816. In Franka Panda, the Heavy and Critical scenarios required
SF = 0.534 and
SF = 0.361, respectively. As shown in
Figure 5, all scaled cases fell below the feasibility limit after temporal adjustment.
These results directly support the main safety claim of the proposed framework. A task may appear executable after language interpretation and embodiment assignment yet still violate actuator-level constraints once the generated motion profile is evaluated dynamically. The RNEA-based validation layer reveals this hidden infeasibility, while the adaptive scaling module converts recoverable overload cases into safe executable motions. In this sense, scaling is not a minor refinement step, but an essential feasibility-recovery mechanism within the verification pipeline.
5.4. Ablation Study Results
The ablation study further clarifies the contribution of the main selection and validation modules. Removing the payload score caused the largest degradation in robot assignment efficiency, increasing the average payload ratio substantially and confirming that payload-aware scoring is the dominant factor in keeping the selected embodiment near the desired safety margin. Removing the DoF score led to misallocation in the 7-DoF scenario, showing that kinematic appropriateness cannot be inferred from payload and reach alone.
The role of the Log-Penalty term is more selective but equally important. Removing the Log-Penalty yields little change in the standard tasks because the optimal solutions in A–F all lie in the Gaussian region, where the selection behavior remains effectively unchanged. By contrast, in Tasks G and H all feasible candidates lie in the over-specification regime, and the Log-Penalty becomes the only mechanism that preserves ranked non-zero scores. This explains why the ablation produces little visible difference in ordinary cases but causes discriminability collapse in the ultra-lightweight stress-test scenarios.
On the dynamics-validation side, removing adaptive scaling reduced the safe execution rate from 100% to 50%, indicating that validation alone is insufficient when the goal is not merely to detect overload but also to recover executable motions. Removing the safety margin weakened the conservativeness of the corrected trajectory, reducing the robustness buffer intended to account for the simulation-to-reality gap. The “w/o RNEA” condition showed no detected overloads, but this should not be interpreted as evidence of safety; rather, it indicates that the framework lost the analytical mechanism required to detect embodiment-specific dynamic infeasibility in the first place.
Taken together, these ablation results show that the proposed framework derives its effectiveness not from a single dominant component, but from the interaction of representation, selection, validation, and adaptation modules.
5.5. Integrated Discussion
Taken together, the results validate the proposed framework as a pre-execution verification pipeline for LLM-based robot programming in heterogeneous robot systems. The intermediate representation stage provides stable structured input, the robot selection stage improves embodiment assignment under both standard and stress-test conditions, and the dynamics validation stage prevents semantically plausible but physically unsafe motions from being deployed without prior correction. In this sense, the framework should be understood not merely as a code generation chain, but as a deployment-oriented verification architecture.
A useful interpretation of the framework is that it mitigates physical hallucination at two complementary levels. The first level is spatial grounding: by deferring coordinate instantiation until runtime Digital Twin information becomes available, the framework reduces the risk of environment-inconsistent targets. The second level is dynamic feasibility: by validating grounded motions against embodiment-specific torque constraints and adapting them through scaling, the framework prevents semantically plausible but physically unsafe trajectories from being executed directly. Together, these two layers form the core safety logic of the proposed framework.
The results also highlight an important implication for industrial robot programming. In heterogeneous environments, nominal embodiment suitability and execution-time physical safety cannot be treated as equivalent criteria. A robot that appears appropriate at the task-allocation stage may still become unsafe under the generated motion profile, while some initially infeasible motions may be recovered through controlled temporal scaling. This separation of concerns helps explain why multi-vendor robot programming should be approached not only as a code-generation problem, but as a verification-centered deployment problem.
Several limitations remain. First, analytical RNEA validation was performed only for UR5e and Franka Panda because reliable inertial parameter models were available for these robots. Extending closed-form torque analysis to the remaining robots would require manufacturer-disclosed inertial data. Second, the current validation framework focuses primarily on torque feasibility and does not yet incorporate other safety factors such as collision avoidance, compliance behavior, thermal limits, or uncertainty-aware control margins. Third, no direct comparison was conducted against raw vendor-specific code generated directly from natural language without the intermediate task representation. Such a comparison would require a controlled baseline capable of generating syntactically valid vendor-specific code across multiple robot platforms without task abstraction, which in practice depends heavily on platform-specific prompt engineering and code-generation heuristics outside the current framework scope. This comparison is therefore left as an important direction for future work.
Fourth, the proposed framework assumes that the Digital Twin provides sufficiently accurate spatial grounding for pre-execution verification. In practice, Digital Twins of real HMLV environments are subject to sim-to-reality discrepancies arising from sensor noise, calibration drift, and communication latency. The safety margin η = 0.9 partially accounts for these uncertainties by conservatively bounding RNEA-computed torques below hardware limits. Nevertheless, high-frequency dynamics and transient disturbances not captured by rigid-body models remain a known limitation. Future integration of real-time DT synchronization and sensor fusion would further close this gap.
Fifth, all experimental validation in this study was conducted in simulation. Although PyBullet-based dynamics simulation and RNEA-based analytical verification provide systematic pre-execution feasibility assessment, they do not replace physical robot experiments. Real-world deployment involves additional sources of uncertainty—including actuator nonlinearity, joint friction, cable routing effects, and environmental perturbations—that are not fully captured by rigid-body models. Validation on a physical robot platform is identified as an immediate next step to confirm that the proposed framework’s safety margins and scaling behavior transfer reliably to real hardware.
Sixth, the current safety filter at the early-stage adversarial blocking step is intentionally conservative, which may occasionally produce false positives—valid but structurally complex commands that are rejected before reaching the RNEA-based dynamics validation layer. As noted in
Section 5.1, this represents a deliberate trade-off between computational efficiency and coverage. Future work may explore adaptive thresholding or learned filtering criteria to reduce over-blocking while preserving the safety-oriented behavior of the pipeline.
6. Conclusions
This paper proposed a Digital Twin-integrated verification framework for adaptive control code generation in heterogeneous robot systems. Rather than treating natural-language robot programming as a pure code-generation problem, the proposed framework incorporated a structured intermediate task representation, runtime spatial grounding, feasibility-aware robot selection, RNEA-based dynamics validation, and adaptive motion scaling into a unified pre-execution pipeline. In this architecture, vendor-specific code generation is positioned as the final realization stage of a verification-centered workflow rather than as the primary source of execution correctness.
The experimental results supported the effectiveness of this approach at multiple levels. The intermediate representation stage provided sufficiently stable structured input for downstream grounding and verification, while the robot selection module maintained task-consistent and discriminative embodiment assignment across both standard and ultra-lightweight stress-test scenarios. The dynamics validation results further showed that semantically plausible motions may still violate actuator-level constraints under a selected embodiment and that adaptive scaling can recover analytically feasible overload cases. In particular, safe execution coverage improved from 50% to 100% when the full validation-and-scaling pipeline was applied.
These findings suggest that the key challenge in multi-vendor robot programming is not only how to generate syntactically executable code but how to ensure that generated behavior is grounded in the runtime environment, assigned to an appropriate embodiment, and verified against physical execution constraints before deployment. From this perspective, the main contribution of the present work is to frame LLM-based heterogeneous robot programming as a verification-centered deployment problem rather than a language-to-code translation problem alone.
Several directions remain for future work. First, analytical dynamics validation should be extended to a broader set of industrial robots as reliable inertial parameter models become available. Second, the current framework focuses primarily on torque feasibility and can be expanded to include additional safety factors such as collision risk, compliance behavior, thermal constraints, and uncertainty-aware control margins. Third, future studies should investigate controlled direct-generation baselines for vendor-specific code without intermediate task abstraction in order to further clarify the comparative value of verification-ready task representations in heterogeneous industrial environments.