LLM-Based Adaptive Control Code Generation Framework with Digital Twin-Integrated Verification for Heterogeneous Robot Systems

Lee, Young-Hoon; Nam, Taemin; Cho, Deun-Sol; Kim, Won-Tae

doi:10.3390/app16083883

Open AccessArticle

LLM-Based Adaptive Control Code Generation Framework with Digital Twin-Integrated Verification for Heterogeneous Robot Systems

¹

Major of Future Convergence Engineering, School of Computer Science and Engineering, Korea University of Technology and Education, Cheonan-si 31253, Republic of Korea

²

Department of Computer Science and Engineering, Korea University of Technology and Education, Cheonan-si 31253, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3883; https://doi.org/10.3390/app16083883

Submission received: 20 March 2026 / Revised: 9 April 2026 / Accepted: 13 April 2026 / Published: 16 April 2026

(This article belongs to the Special Issue Digital Twin and IoT, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

High-Mix Low-Volume (HMLV) manufacturing increasingly relies on heterogeneous robot fleets, but automatic generation of vendor-specific robot control code remains difficult due to platform fragmentation and safety-critical feasibility constraints. Although recent Large Language Model (LLM)-based approaches have shown promise for translating natural language into robot programs, they remain largely limited to single-platform or simulation-oriented settings and are vulnerable to physical hallucination, including spatially inconsistent commands and dynamically infeasible motions. This paper proposes a Digital Twin-integrated verification framework for adaptive control code generation in heterogeneous robot systems. The framework uses a structured intermediate task representation to support runtime spatial grounding, robot selection, pre-execution dynamics validation, and adaptive motion scaling before vendor-specific code generation and execution. Evaluation on 170 task-description scenarios and eight robot selection tasks showed improved ranking discriminability in lightweight stress cases where conventional baselines exhibited limited separation. In addition, adaptive dynamics scaling enabled safe execution in all analytically verified test cases, compared with 50% without scaling. These results suggest that Digital Twin-grounded verification and adaptive feasibility control can improve the reliability of LLM-based multi-vendor robot programming and help mitigate physical hallucination in heterogeneous robot systems.

Keywords:

heterogeneous robots; LLM-based code generation; physical hallucination; Digital Twin; dynamics validation; adaptive scaling; HMLV manufacturing

1. Introduction

The manufacturing paradigm is undergoing a fundamental shift from mass production to High-Mix Low-Volume (HMLV) systems, driven by growing consumer demand for customized products and the need for agile supply chains [1]. In HMLV environments, a single production line must handle diverse product variants simultaneously, which increasingly requires heterogeneous robot fleets composed of manipulators with different kinematic structures, payload capacities, and operational ranges from multiple vendors [2]. However, each manufacturer provides proprietary programming languages—such as RAPID for ABB, KRL for KUKA, and URScript for Universal Robots—creating a fragmented ecosystem in which process changeovers often require engineers to rewrite control code for each robot individually [3,4,5]. Existing interoperability technologies partly alleviate this burden at the levels of communication and data exchange, but they do not fully resolve the problem of generating executable, behavior-consistent vendor-specific robot programs from a shared task intent. As a result, frequent product changes in HMLV production continue to impose substantial engineering effort, downtime, and integration costs.

Large Language Models (LLMs) have recently emerged as a promising approach for translating natural language instructions into robot control logic [6,7]. By allowing users to specify tasks at a higher semantic level, LLMs offer a potential path toward more flexible robot programming in multi-product production environments. However, direct deployment of LLM-generated robot code in industrial settings remains challenging, particularly when the target system consists of heterogeneous robots with different physical capabilities and vendor-specific execution models. Beyond platform fragmentation, generated code may contain physically invalid instructions, such as spatial targets inconsistent with the actual environment or motion parameters that exceed the selected robot’s dynamic limits. This phenomenon, which has been discussed in recent robotics and AI literature as physical hallucination [8,9,10], is especially problematic in real-world robotic deployment because it can lead not only to task failure but also to unsafe motion and equipment risk.

Prior studies have demonstrated the feasibility of language-driven robot programming in fixed APIs, single-platform embodiments, or simulation-centered environments [6,7,11]. Meanwhile, Digital Twin (DT)-based approaches have shown strong potential for simulation, monitoring, and pre-deployment validation in industrial automation [12,13,14,15]. However, relatively limited attention has been paid to how LLM-generated robot programs can be systematically verified and adapted before execution in heterogeneous real-world robot systems. In particular, a unified framework that combines structured task representation, runtime environment grounding, robot suitability assessment, and pre-execution dynamics validation has not been sufficiently explored.

To address this gap, this paper proposes a Digital Twin-integrated verification framework for adaptive control code generation in heterogeneous robot systems. The framework first transforms natural-language task requirements into a structured intermediate task representation that preserves task-level intent and execution semantics across vendor-specific implementations, thereby supporting behavioral consistency even when low-level syntax, coordinate representations, and motion primitives differ. Spatial references are then grounded using runtime Digital Twin information, and an appropriate robot is selected according to task-level and physical feasibility requirements. Before execution, the generated motion sequence is analyzed through Recursive Newton-Euler Algorithm (RNEA)-based dynamics validation, and when infeasible conditions are detected, a global scaling factor is automatically applied to adjust motion parameters within safe limits. Vendor-specific executable code is generated only after these verification and adaptation steps are completed.

The main contributions of this paper are as follows:

We propose a Digital Twin-integrated verification framework for LLM-based robot code generation in heterogeneous robot systems, targeting physical hallucination and pre-execution safety assurance.
We develop a validation pipeline that combines runtime spatial grounding, robot selection, RNEA-based torque analysis, and adaptive global motion scaling to detect and correct physically infeasible execution conditions before deployment.
We employ a structured intermediate task representation that supports transformation from natural-language instructions into verification-ready robot task specifications across heterogeneous platforms.
We empirically validate the proposed framework across a heterogeneous robot pool, showing that RNEA-based adaptive scaling achieves full feasibility coverage where unscaled execution fails and that the task-aware robot selection mechanism outperforms payload-only baselines in lightweight load scenarios.

2. Related Work

2.1. Robot Programming and Interoperability in Multi-Vendor Systems

Industrial robot programming has traditionally relied on online teaching via teach pendants and offline programming (OLP) [16], both of which require substantial per-platform effort because each robot manufacturer defines its own proprietary coordinate conventions, motion commands, and I/O interfaces [2]. In heterogeneous industrial environments, this fragmentation makes process changeovers costly and time-consuming, as equivalent task logic must often be reimplemented separately for each robot platform.

To reduce this burden, a variety of interoperability-oriented technologies have been introduced. Middleware frameworks such as ROS2 [17] standardize inter-process communication through publish-subscribe architectures, while OPC UA [18] and AutomationML [19] support vendor-neutral data exchange and system integration. PLCopen [20] further provides standardized motion control function blocks for industrial automation. Although these approaches improve communication interoperability and data-model compatibility, they do not directly support transformation of a shared task specification into executable vendor-specific robot programs while preserving behavioral consistency across platforms.

In parallel, robot assignment in multi-robot environments has been studied using multi-criteria decision-making (MCDM) methods such as TOPSIS and Linear Weighted Sum, which rank candidate robots according to factors such as payload, reach, and kinematic compatibility [21]. These methods are useful for structured comparison among candidate embodiments, but their discriminability may degrade when multiple robots similarly satisfy nominal task requirements, particularly in lightweight task scenarios. More importantly, such approaches address robot selection as a ranking problem, rather than as part of a unified framework for verified multi-vendor task realization. Consequently, the problem of translating a common task intent into behavior-consistent, vendor-specific, and physically feasible execution remains insufficiently addressed.

2.2. LLM-Based Robot Code Generation

Recent studies have explored the use of Large Language Models (LLMs) for translating natural language instructions into robot control logic. Liang et al. [6] proposed Code-as-Policies, demonstrating that LLMs can generate executable Python code that invokes predefined robot APIs. While the approach showed strong compositional generalization, it assumed a fixed API structure tied to a single platform and did not address hardware-grounded feasibility constraints. Singh et al. [7] introduced ProgPrompt, which structures prompts using programmatic environment representations, but its focus is primarily on high-level task planning rather than vendor-specific motion code generation across heterogeneous robots.

Other studies have similarly shown the promise of language-driven robot programming while remaining constrained in embodiment scope. Vemprala et al. [11] demonstrated prompt engineering strategies for robot control code generation, but validation was limited to single-platform settings. Zitkovich et al. [22] introduced RT-2, showing generalization to previously unseen instructions through vision-language-action modeling; however, the resulting policy remains closely tied to the embodiment and execution context represented in training. Overall, existing work supports the feasibility of language-conditioned robot programming, but a systematic framework for generating vendor-specific control code from a unified task specification while ensuring runtime grounding and physical feasibility across heterogeneous platforms has not yet been demonstrated.

2.3. Verification, Validation, and Physical Feasibility for Robot Control

Verification and validation of robot control code have traditionally been addressed through formal methods, simulation-based testing, and model-based analysis. Formal approaches such as model checking [23] provide mathematically rigorous correctness guarantees, but their applicability is often limited in complex real-world robotic systems due to scalability challenges and the difficulty of representing continuous dynamics and rich environmental interactions [24]. In practice, simulation-based verification using commercial tools such as RobotStudio [25] and KUKA.Sim [26] has been more widely adopted for pre-deployment validation of robot programs.

For dynamic feasibility analysis in such pre-execution settings, Recursive Newton-Euler Algorithm (RNEA)-based torque computation has been widely used in robot dynamics and offline trajectory evaluation [27]. Owing to its computational efficiency and suitability for embodiment-specific torque estimation, RNEA provides a practical basis for determining whether a planned motion can be executed within the dynamic limits of a selected robot. At the same time, Digital Twin technology has extended simulation-based verification by enabling bidirectional synchronization between physical systems and their virtual counterparts [12]. Such DT-based approaches have been applied to predictive maintenance [13], production scheduling [14], and robot cell configuration validation [15], thereby improving the realism and responsiveness of pre-deployment analysis.

However, existing verification and validation frameworks have largely been developed for manually authored programs or deterministic control logic, where program structure and task intent can be assumed to be internally consistent. This assumption becomes weaker in LLM-based robot programming, where generated instructions may be semantically plausible while still being spatially inconsistent with the runtime environment, mismatched to the selected robot embodiment, or dynamically infeasible under actual torque constraints. As a result, verifying LLM-generated robot code requires not only conventional simulation or logic checking, but also runtime grounding and embodiment-aware physical feasibility assessment. These gaps motivate the Digital Twin-grounded pre-execution verification and dynamics validation framework proposed in this work.

3. Methodology

3.1. Framework Overview

This paper proposes a Digital Twin-integrated verification framework for adaptive control code generation in heterogeneous robot systems. The objective of the framework is not only to translate natural-language task descriptions into vendor-specific robot programs, but also to verify whether the generated task can be safely and feasibly executed by a selected robot embodiment before deployment.

As illustrated in Figure 1, the proposed framework consists of five stages. First, a natural-language task request is transformed into a structured intermediate task representation that preserves task-level intent while remaining independent of vendor-specific syntax. Second, spatially ambiguous task elements are grounded using runtime Digital Twin information so that execution parameters reflect the actual environment rather than coordinates inferred during initial language generation. Third, the framework evaluates candidate robots in the heterogeneous robot pool and selects an embodiment that is physically suitable for the grounded task. Fourth, the grounded motion

T_{s}

is analyzed through Recursive Newton-Euler Algorithm (RNEA)-based dynamics validation to determine whether execution exceeds the torque limits of the selected robot. Finally, when infeasible conditions are detected, a global scaling factor is applied to reduce motion intensity within safe bounds before vendor-specific code is generated.

From a system perspective, the proposed method treats verification as an integral part of code generation rather than as a post hoc testing step. In this way, task transformation, runtime grounding, embodiment selection, and dynamic feasibility assurance are integrated into a single pre-execution workflow.

It is important to clarify the role of the LLM within this architecture. The LLM functions as a structured translation engine rather than a safety decision-maker: it converts natural-language task instructions into verification-ready intermediate specifications, and later renders verified specifications into vendor-specific executable syntax. Safety-critical decisions—including spatial grounding, robot selection, and dynamic feasibility assurance—are made entirely by the deterministic modules in the downstream pipeline. This separation of language-based translation from physics-based verification is a central design principle of the proposed framework.

3.2. Structured Intermediate Task Representation

To support consistent downstream verification across heterogeneous robot platforms, the framework first converts natural-language task descriptions into structured intermediate task representation. In this paper, the representation is employed as a verification-ready interface between language input and robot-specific execution, rather than as a standalone language-design contribution.

The representation captures the essential elements of task execution, including action type, target entity, motion intent, task ordering, and execution constraints. By explicitly structuring these elements, the framework reduces ambiguity in downstream processing and provides a common interface for Digital Twin grounding, robot selection, and dynamics validation. This is particularly important in heterogeneous robot environments, where the same task intent must later be instantiated in different robot languages and execution models.

A key requirement of the representation is the preservation of task-level semantics across platforms. In this work, behavioral consistency refers to preserving the intended execution semantics of a task even when low-level syntax, coordinate conventions, and motion primitives differ across vendors. For this reason, task information is stored in a platform-independent form wherever possible, so that feasibility can be assessed before the final program is instantiated in a vendor-specific language.

Accordingly, the intermediate representation functions as the common substrate of the proposed framework. It enables stable transformation from natural-language instructions into machine-processable task specifications and provides the structured information required by the verification modules described in the following subsections.

The natural-language-to-TDL transformation was implemented using Gemini 2.5 Pro (Google DeepMind, London, UK) with retrieval-augmented prompting, without task-specific fine-tuning. The LLM functions as a structured translation engine that converts user natural-language instructions into verification-ready TDL specifications by following explicit grammar rules and parameter-preservation constraints provided in retrieved prompt context documents. The retrieved context consisted of four categories of structured documents: TDL grammar and command definitions, vendor-specific mapping tables, parameter-preservation constraints, and output-format requirements. Each prompt specification comprised approximately 250–300 lines of structured guidance. No per-task manual customization was applied during experimental evaluation; human effort was concentrated in the initial preparation of grammar specifications and mapping rules.

A representative prompt structure is as follows: [Role] You are an expert robot programming assistant specializing in TDL-based task representation. [Task] Convert the natural-language instruction into a verification-ready TDL specification preserving all task-critical parameters. [Constraints] Use only defined TDL commands. Leave pose fields as semantic placeholders for Digital Twin grounding. Output only the final TDL script in the required format.

Importantly, the LLM is not responsible for execution safety. Physical feasibility assurance is entirely handled by the downstream Digital Twin grounding and RNEA-based validation modules described in Section 3.3, Section 3.4, Section 3.5 and Section 3.6. This separation ensures that safety-critical decisions are made analytically rather than by language model inference.

3.3. Runtime Spatial Grounding with Digital Twin Information

One of the major risks in LLM-based robot code generation is that spatial parameters may be generated without sufficient reference to the actual execution environment. A generated instruction may appear semantically plausible at the language level while still being spatially inconsistent with the current object location, workspace boundary, or obstacle configuration. To mitigate this problem, the proposed framework grounds spatially dependent task elements using runtime Digital Twin information.

Instead of resolving all coordinates during the initial language generation stage, the framework defers spatial instantiation until environment information becomes available from the Digital Twin. In this process, symbolic references in the intermediate task representation, such as object targets, approach directions, or placement regions, are converted into executable spatial parameters using the current environment state. The Digital Twin provides the object pose, workspace geometry, and relevant environmental conditions required for this conversion.

Let the symbolic task specification be denoted by

T_{s}

, and let the runtime Digital Twin state be denoted by

D_{T}

. The grounding function can then be expressed as

T_{g} = G (T_{s}, D_{T}),

(1)

where

T_{g}

is the grounded task specification used for robot selection and motion validation. Equation (1) indicates that executable task parameters are not determined solely by the language model but are instantiated through the current environment state.

This grounding process constitutes the first layer of hallucination mitigation in the proposed framework, complemented by dynamics validation at the second layer. By deferring environment-dependent parameter binding until runtime, the framework reduces coordinate-level physical hallucination while ensuring that downstream feasibility analysis is performed on motion parameters reflecting the actual execution context.

3.4. Feasibility-Aware Robot Selection

3.4.1. Overall Fitness Formulation

After the task has been structurally represented and grounded in the runtime environment, the framework selects a suitable robot from the heterogeneous robot pool. The purpose of this step is not merely to identify a robot that can nominally perform the task, but to select an embodiment that is appropriate for subsequent safe execution under grounded task conditions.

The fitness score of robot i is computed as a weighted sum:

S_{t} (i) = ω_{p} S_{p} (i) + ω_{r} S_{r} (i) + ω_{d} S_{d} (i),

(2)

where

S_{p} (i)

,

S_{r} (i)

and

S_{d} (i)

denote the payload, reach, and DoF fitness scores, respectively. The default weights are set to

ω_{p} = 0.6, ω_{r} = 0.2 a n d ω_{d} = 0.2 .

Before scoring, hard feasibility constraints are applied: any robot failing to satisfy payload, reach, or DoF requirements is excluded from consideration regardless of partial scores. This prevents physically infeasible robots from entering the ranking stage.

The weights reflect the physical priority hierarchy in HMLV manufacturing environments. Payload capacity is treated as a near-hard constraint—a robot incapable of handling the required load renders the task physically infeasible regardless of other attributes—and is therefore assigned the highest weight (ω_p = 0.6). Reach and DoF govern task flexibility rather than feasibility and are weighted equally at lower values (ω_r = ω_d = 0.2). These values were adopted as default engineering settings motivated by the task priority structure of HMLV manufacturing rather than as universally optimal parameters; future work may explore data-driven weight calibration for specific production contexts.

3.4.2. Payload Score: Tri-Modal Design

A central design feature of the selection module is the payload fitness score, which adopts a tri-modal design based on the payload ratio

R = \frac{r o b o t p a y l o a d}{t a s k p a y l o a d} .

(3)

The payload score is defined piecewise as follows:

S_{p} (R) = 0, i f R < 1.0

(4)

S_{p} (R) = e x p (- \frac{{(R - α)}^{2}}{2 σ^{2}}), i f 1.0 \leq R \leq 3.0

(5)

S_{p} (R) = \exp (- β \ln (R / α)), i f R > 3.0

(6)

In the Gaussian mode, the score peaks at α = 1.2, which represents the design-optimal safety margin, and σ = 0.2 controls sensitivity around this point. In the Log-Penalty mode (R > 3.0), the score decreases on a log scale, strongly penalizing over-specification while preserving non-zero scores across the full over-capacity region. Here, β is a penalty coefficient controlling the decay rate in the over-specification region. In this study, β = 0.8 is adopted so that robots with excessive payload margins receive progressively lower scores, thereby discouraging unnecessary resource over-allocation even when task execution remains physically feasible. This property prevents the discriminability collapse observed in linear normalization methods when all feasible robots are substantially over-specified relative to the task requirement.

3.4.3. Reach and DoF Scores

The reach score rewards robots that satisfy the workspace requirement while discouraging unnecessary oversizing. More specifically, the reach score is assigned according to whether the robot’s maximum reach satisfies the grounded task workspace requirement and the degree of excess reach beyond the required threshold. The DoF score applies a discrete penalty based on kinematic appropriateness: an exact DoF match receives the highest score (1.0), whereas redundant DoF configurations receive a reduced score (0.8) due to increased inverse-kinematics complexity. Robots with insufficient DoF are filtered out by the hard feasibility constraints before the scoring stage.

3.4.4. Final Robot Selection

The grounded task specification

T_{g}

is evaluated against all candidate robots in the pool, and the robot with the highest total score is selected:

i^{*} = a r g m a x S_{t o t a l} (i) .

(7)

The selected robot

i^{*}

is then passed to the downstream dynamics validation stage. This ordering is important because dynamic feasibility is embodiment-dependent: the same grounded task may be safe for one robot and infeasible for another depending on torque limits, geometry, and motion profile.

3.5. RNEA-Based Dynamics Validation

Even when a task is semantically correct and a candidate robot is selected, the resulting motion may still be dynamically infeasible. For example, required joint torques may exceed the limits of the selected robot due to an aggressive motion profile, insufficient payload margin, or an unfavorable configuration. To address this risk, the proposed framework performs pre-execution dynamics validation before vendor-specific code is finalized.

The validation step is based on the Recursive Newton-Euler Algorithm (RNEA), which computes required joint torques through forward and backward recursion with O(n) complexity. Let the selected robot trajectory be represented by the joint state variables

q, \dot{q} a n d \ddot{q}

. The required torque vector τ(t) satisfies the rigid-body dynamics equation

τ (t) = M (q) \ddot{q} + C (q, \dot{q}) \dot{q} + g (q)

(8)

where M(q) is the inertia matrix,

C (q, \dot{q})

is the Coriolis and centrifugal term, and g(q) is the gravitational torque vector.

The grounded motion is considered dynamically feasible if

|τ_{j} (t)| \leq τ_{m a x, j}

for every joint throughout execution. If this condition is satisfied, the motion is classified as feasible. Otherwise, it is classified as infeasible and passed to the adaptation stage. Semantic task plausibility does not guarantee dynamic safety: a generated motion command may appear correct from the perspective of task logic while still violating the physical constraints of the robot expected to execute it.

3.6. Adaptive Parameter Scaling

When predicted torque exceeds the admissible limit, the framework does not immediately reject the generated task. Instead, it computes a global scaling factor SF to reduce motion intensity while preserving the trajectory shape. The scaling factor is defined as

S F = \frac{η}{{m a x}_{j \in j}, t (\frac{|τ_{j} (t)|}{τ_{m a x, j}})},

(9)

where

η

= 0.9 is a safety margin adopted as a conservative engineering margin motivated by two complementary considerations. From an operational standpoint, Doosan Robotics specifies in its official programming manual that workpiece weight must not exceed rated payload with a 10% margin [28], reflecting industry-established practice for collaborative robot deployment. From a modeling uncertainty standpoint, KUKA’s LoadDataDetermination documentation reports that mass determination accuracy for high-payload robots is typically within 10% of rated payload [29], indicating that parameter uncertainty of this magnitude propagates into RNEA-computed torque predictions. The value η = 0.9 was therefore adopted as a conservative engineering margin motivated by industrial practice and inherent rigid-body modeling uncertainty, rather than as a formal guarantee of physical safety under all operating conditions. J is the set of joints, and

τ_{m a x, j}

is the torque limit of joint j. The scale factor satisfies SF ≤ 1.0 and is applied uniformly to all velocity and acceleration parameters. Operationally, reducing the execution speed and acceleration through the global scaling factor is equivalent to stretching the trajectory in time while preserving its spatial path.

If the scaled trajectory satisfies the torque constraints, the motion proceeds to final code generation. Otherwise, the task is reported as infeasible under the selected robot and execution condition.

3.7. Vendor-Specific Code Generation

Once the task has passed the verification stages, the framework generates vendor-specific control code for the selected robot platform. At this point, the task has already been represented, grounded, assigned to a robot embodiment, and checked for dynamic feasibility. Accordingly, the purpose of this final stage is not to determine correctness, but to instantiate a verified task into the syntax and command structure required by the target robot language.

The vendor-specific code generation in Equation (10) was also implemented using Gemini 2.5 Pro with retrieval-augmented prompting. At this stage, the LLM receives the verified TDL specification and applies vendor-specific mapping rules—encoding command correspondences such as MoveLinear → movel() for Doosan DRL, and corresponding mappings for KUKA KRL and ABB RAPID—to produce syntactically correct executable code. Since the TDL has already passed all verification stages at this point, code generation is constrained to a syntax-constrained rendering task in which no safety-critical decisions remain; the structural and physical validity of the task has already been assured by the upstream pipeline.

Let

C_{v}

denote the vendor-specific code for vendor v. The final code generation step can be represented as

C_{v} = F (T_{g}, i^{*}, S F),

(10)

where

T_{g}

is the grounded task specification,

i^{*}

is the selected robot, and SF is the validated scaling factor. Here

F (\cdot)

denotes the vendor-specific code generation function that maps the verified task specification to the target robot language. Equation (10) indicates that code generation is performed only after the task has already passed environment grounding, embodiment selection, and feasibility validation.

The mapping process converts the structured intermediate task representation into platform-specific program elements, including motion commands, coordinate declarations, procedure structure, and I/O operations. This ordering reflects the main design principle of the proposed framework: executable syntax alone is not sufficient evidence of safe deployability. In the proposed method, vendor-specific code is generated only after the task has passed Digital Twin grounding and embodiment-aware physical feasibility verification.

4. Experimental Setup

4.1. Simulation Environment

Experiments were conducted in a physics-engine-based simulation environment to evaluate the proposed framework under heterogeneous robot execution conditions. PyBullet (v3.2.7) was used as the rigid-body dynamics simulator, configured with gravitational acceleration of −9.81 m/

s^{2}

and a simulation frequency of 240 Hz. All six robots in the candidate pool were tested in simulation using manufacturer-provided URDF models. In this study, the term Digital Twin refers to a synchronized virtual environment used for runtime spatial grounding and pre-execution verification, rather than a fully bidirectional industrial DT system with real-time physical–virtual synchronization. The experimental evaluation is based on simulation-based and partially analytical validation rather than on fully validated shop-floor deployment.

To complement simulation-based validation, analytical RNEA verification was additionally performed for UR5e and Franka Panda using validated Denavit–Hartenberg parameter models from the Robotics Toolbox for Python (v1.1.1). These two robots were selected for analytical verification because peer-reviewed inertial parameter models are publicly available, enabling closed-form torque evaluation consistent with the dynamics formulation in Equation (8). For the remaining four robots, PyBullet URDF-based simulation served as the practical validation mechanism, as manufacturer-disclosed inertial parameters required for closed-form RNEA were not available for these platforms. All stochastic experiments used a fixed random seed of 42 to ensure reproducibility.

4.2. Robot Platforms

The evaluation used a heterogeneous robot pool consisting of six manipulators spanning payload capacities from 3 kg to 25 kg. The set was constructed to cover lightweight collaborative robots, medium-payload manipulators, and a heavy-duty industrial robot, thereby reflecting the embodiment diversity expected in HMLV manufacturing environments. Table 1 summarizes the robot specifications used in the experiments.

4.3. Evaluation Scenarios and Metrics

The experiments were designed to evaluate the proposed framework at three levels:

reliability of the intermediate task representation produced from natural-language input;
effectiveness of the feasibility-aware robot selection mechanism;
effectiveness of RNEA-based dynamics validation and adaptive scaling for safe execution assurance.

4.3.1. Evaluation of Intermediate Representation Reliability

To evaluate the reliability of natural language-to-representation conversion, 170 test scenarios were constructed across three complexity categories. Simple commands (80 cases) consisted of single-action instructions intended to test basic parsing stability. Complex commands (80 cases) included conditional and iterative structures in order to test whether logical causality and ordering constraints were preserved in the generated representation. Adversarial or out-of-domain commands (10 cases) included physically infeasible or clearly abnormal requests to evaluate robustness of the safety filtering stage.

Three metrics were used. First, Syntactic Integrity Rate measures whether the generated representation can be parsed without error and directly executed in simulation. Second, Semantic Alignment Rate evaluates whether the generated representation preserves the intent of the original command through an automated back-translation protocol. In this process, the generated representation is converted back into natural language using an LLM-augmented reverse pass, and semantic similarity between the reconstructed text and the original instruction is computed using a hybrid TF-IDF and keyword-matching method. Third, Safety Filtering Rate measures the proportion of adversarial inputs correctly blocked prior to execution. A rule-based template-matching baseline was included for the semantic alignment experiment as a lower-bound reference.

Although this evaluation concerns the upstream representation stage, its role in this paper is not to establish the representation itself as the main contribution. Rather, it serves to verify whether the structured intermediate specification used in Section 3.2 provides sufficiently stable input for the downstream grounding and verification pipeline.

4.3.2. Robot Selection Evaluation

To evaluate the feasibility-aware robot selection method defined in Equations (2)–(7), eight task scenarios were designed, as summarized in Table 2, and compared against four baselines: TOPSIS, Linear Weighted Sum (Linear WS), Greedy selection based on minimum feasible payload, and Random selection from the feasible candidate set. Scenario payload requirements were distributed across the full nominal capacity range of the robot pool, from 0.5 kg to 20.8 kg, so that the experiments would cover both ordinary and stress-test conditions rather than clustering around a single robot specification.

Tasks A–F represent standard manufacturing scenarios with payloads between 2.0 kg and 20.8 kg and with both 6-DoF and 7-DoF requirements. Tasks G and H represent ultra-lightweight handling scenarios with payloads of 1.0 kg and 0.5 kg, respectively. These two scenarios were intentionally designed to expose discriminability failure in conventional methods when all feasible robots are substantially over-specified. In particular, Task G places all candidates in the R > 1.8 region, while Task H places all candidates in the R > 3.0 region, thereby directly testing the necessity of the Log-Penalty mode in Equation (6). Hard feasibility constraints on payload, reach, and DoF were applied to all methods prior to scoring.

The composition of the robot pool also supports direct evaluation of the tri-modal payload score in Equations (3)–(6) because it includes both near-optimal candidates around the target margin α = 1.2 and strongly over-specified candidates in the R > 3.0 region where the Log-Penalty mode becomes active.

The primary evaluation metric for robot selection was the deviation from the design-optimal safety margin, ∣R − α∣, where α = 1.2 is the target ratio used in Equation (5). This metric was chosen because the proposed method does not minimize raw payload ratio toward 1.0, but instead seeks a moderate safety margin that balances feasibility and resource efficiency. Additional summary measures included average payload ratio and match rate with the proposed method across all scenarios. These metrics were selected to align directly with the design rationale of the tri-modal payload score and to support interpretation of the stress-test scenarios in Section 5.2.

4.3.3. Dynamics Validation Evaluation

To evaluate the pre-execution dynamics validation stage defined in Equation (8) and the adaptive scaling rule in Equation (9), all six robots were first tested in PyBullet across four payload scenarios: Light (0.5 kg), Medium (2.0 kg), Heavy (4.0 kg), and Critical (6.0 kg). For UR5e and Franka Panda, analytical RNEA verification was additionally performed using validated DH parameter models, with 60 random joint configurations generated per scenario under seed 42 in order to estimate worst-case torque demand. The remaining four robots were assessed through URDF-based physics simulation only.

The two primary evaluation metrics were maximum torque ratio, defined as the maximum of ∣

τ_{j} (t)

∣/

τ_{m a x, j}

| over all joints and sampled time steps, and safe execution rate, defined as the proportion of tested conditions that satisfied the torque feasibility requirement after validation and, if needed, adaptive scaling. Reported quantitative results in Section 5.3 focus on the analytically validated UR5e and Panda cases because these permit direct and precise interpretation of the RNEA-based feasibility condition.

This evaluation directly tests the main safety claim of the proposed framework: semantically plausible generated motions should not be assumed physically executable, and adaptive feasibility control should be applied when Equation (8) predicts torque overload under the selected embodiment.

4.3.4. Ablation Study Design

An ablation study was conducted to isolate the contribution of the major selection and validation modules. For robot selection, four conditions were compared: (A) Full System, (B) without Log-Penalty, (C) without Payload Score, and (D) without DoF Score. These settings were chosen to separate the effect of the over-specification control term in Equation (6), the dominant role of payload-aware scoring, and the influence of kinematic appropriateness in multi-criteria ranking.

For dynamics validation, four corresponding conditions were examined: (A) Full System, (B) without Safety Margin (η = 1.0), (C) without Adaptive Scaling, and (D) without RNEA. These comparisons isolate the contributions of the simulation-to-reality buffer, the scaling mechanism in Equation (9), and the torque-analysis stage itself. Together, these ablations were designed to test whether each component contributes independently to safe and discriminative pre-execution decision making.

5. Results and Discussion

5.1. Intermediate Representation Reliability

The first experiment evaluated whether the proposed pipeline can reliably transform natural-language task descriptions into structured task specifications suitable for downstream verification. As shown in Figure 2a, the intermediate representation generation stage maintained perfect syntactic integrity for both simple and complex commands, indicating that the generated task structures were consistently machine-parsable and suitable for subsequent processing. Back-Translation Fidelity (BTF), however, decreased from simple to complex commands, suggesting that semantic compression occurs more frequently in multi-step or conditionally structured instructions than in single-action requests.

For the adversarial subset, the framework showed a different behavior. As shown in Figure 2b, 70% of adversarial inputs were blocked by the safety filter prior to execution, while the remaining 30% were passed to the downstream dynamics validation layer, where they were subsequently identified as infeasible and blocked prior to execution. This two-stage behavior is consistent with the design of the proposed framework: unsafe or clearly infeasible commands are filtered as early as possible, whereas ambiguous or partially plausible commands are further evaluated through physics-aware verification. The 70% of inputs blocked at the filtering stage were cases in which physically impossible conditions—such as impossible payload demands or targets outside the defined workspace—were detectable through explicit safety-rule checking without requiring dynamics computation. The remaining 30% were not left unhandled; they were classified as physically ambiguous and forwarded to the downstream RNEA-based validation layer for complementary coverage. A known limitation of this design is the potential for false positives at the filtering stage: valid but structurally complex commands may occasionally be blocked before reaching dynamics validation. The current filter is intentionally conservative to prioritize safety; future work may explore adaptive thresholding to reduce over-blocking while preserving the safety-oriented behavior of the current pipeline.

Overall, these results indicate that the representation stage provides sufficiently stable structured input for the downstream grounding and verification modules, while also highlighting that representation reliability alone does not guarantee safe deployment. Accordingly, the following subsections examine whether the grounded task can be assigned to an appropriate robot embodiment and safely executed under embodiment-aware physical validation.

5.2. Robot Selection Performance

The robot selection results demonstrate that the proposed feasibility-aware scoring mechanism provides more task-consistent embodiment assignment than conventional baselines. As shown in Figure 3a, the proposed method remained close to the design-optimal payload ratio α = 1.2 across the standard task scenarios A–F. In these scenarios, the proposed method and Greedy showed similar behavior, whereas TOPSIS consistently favored higher-capacity robots and therefore produced much larger payload ratios. Linear Weighted Sum remained close to the target margin in most standard cases but did so without preserving meaningful discrimination in the over-specification regime.

The differences become more pronounced in the ultra-lightweight stress-test tasks G and H. As shown in Figure 3b, both TOPSIS and Linear WS selected extremely over-capable robots, reaching 25× and 50× payload ratios in the most extreme cases. By contrast, the proposed method and Greedy remained within a substantially lower range. This indicates that the proposed selection mechanism is robust not only under ordinary operating conditions, but also under edge cases in which all feasible candidates are heavily over-specified.

However, Figure 3 also shows that the proposed method does not always yield the lowest payload ratio among feasible robots. In Task G, Greedy selects a robot closer to the target payload margin than the proposed method. This is expected because Greedy optimizes only for minimum feasible payload, whereas the proposed method jointly considers payload, reach, and DoF suitability. The goal of the proposed selector is therefore not to minimize payload ratio in isolation, but to provide a more balanced embodiment assignment within the overall verification pipeline.

The reason for this behavior becomes clearer in Figure 4. The proposed tri-modal payload score peaks at α = 1.2 in the Gaussian region and transitions to a Log-Penalty decay for R > 3.0, thereby preserving non-zero ranked scores throughout the over-specification region. By contrast, Linear WS hard-clips to zero beyond its threshold, eliminating discrimination when all feasible candidates exceed the same upper region. This explains why the proposed method remains informative in Tasks G and H whereas conventional linear normalization collapses structurally.

Taken together, these results suggest that the robot selection module contributes to the framework not by dominating all baselines in every scenario, but by maintaining task-consistent and discriminative embodiment assignment across both ordinary and edge-case conditions.

5.3. Dynamics Validation and Adaptive Scaling

The dynamics validation results confirm that semantic task plausibility does not guarantee embodiment-level execution safety. As shown in Figure 5, several scenarios that were feasible at the task and embodiment-selection levels still exceeded the actuator torque limits when evaluated using the RNEA-based dynamics model. For UR5e, the Critical scenario produced a maximum torque ratio of 1.224 before scaling. For Franka Panda, both the Heavy and Critical scenarios exceeded the feasibility threshold, reaching torque ratios of 1.873 and 2.773, respectively.

Applying the adaptive scaling rule in Equation (9) restored feasibility in all analytically recoverable overload cases. In UR5e, the Critical scenario was brought below the torque threshold using SF = 0.816. In Franka Panda, the Heavy and Critical scenarios required SF = 0.534 and SF = 0.361, respectively. As shown in Figure 5, all scaled cases fell below the feasibility limit after temporal adjustment.

These results directly support the main safety claim of the proposed framework. A task may appear executable after language interpretation and embodiment assignment yet still violate actuator-level constraints once the generated motion profile is evaluated dynamically. The RNEA-based validation layer reveals this hidden infeasibility, while the adaptive scaling module converts recoverable overload cases into safe executable motions. In this sense, scaling is not a minor refinement step, but an essential feasibility-recovery mechanism within the verification pipeline.

5.4. Ablation Study Results

The ablation study further clarifies the contribution of the main selection and validation modules. Removing the payload score caused the largest degradation in robot assignment efficiency, increasing the average payload ratio substantially and confirming that payload-aware scoring is the dominant factor in keeping the selected embodiment near the desired safety margin. Removing the DoF score led to misallocation in the 7-DoF scenario, showing that kinematic appropriateness cannot be inferred from payload and reach alone.

The role of the Log-Penalty term is more selective but equally important. Removing the Log-Penalty yields little change in the standard tasks because the optimal solutions in A–F all lie in the Gaussian region, where the selection behavior remains effectively unchanged. By contrast, in Tasks G and H all feasible candidates lie in the over-specification regime, and the Log-Penalty becomes the only mechanism that preserves ranked non-zero scores. This explains why the ablation produces little visible difference in ordinary cases but causes discriminability collapse in the ultra-lightweight stress-test scenarios.

On the dynamics-validation side, removing adaptive scaling reduced the safe execution rate from 100% to 50%, indicating that validation alone is insufficient when the goal is not merely to detect overload but also to recover executable motions. Removing the safety margin weakened the conservativeness of the corrected trajectory, reducing the robustness buffer intended to account for the simulation-to-reality gap. The “w/o RNEA” condition showed no detected overloads, but this should not be interpreted as evidence of safety; rather, it indicates that the framework lost the analytical mechanism required to detect embodiment-specific dynamic infeasibility in the first place.

Taken together, these ablation results show that the proposed framework derives its effectiveness not from a single dominant component, but from the interaction of representation, selection, validation, and adaptation modules.

5.5. Integrated Discussion

Taken together, the results validate the proposed framework as a pre-execution verification pipeline for LLM-based robot programming in heterogeneous robot systems. The intermediate representation stage provides stable structured input, the robot selection stage improves embodiment assignment under both standard and stress-test conditions, and the dynamics validation stage prevents semantically plausible but physically unsafe motions from being deployed without prior correction. In this sense, the framework should be understood not merely as a code generation chain, but as a deployment-oriented verification architecture.

A useful interpretation of the framework is that it mitigates physical hallucination at two complementary levels. The first level is spatial grounding: by deferring coordinate instantiation until runtime Digital Twin information becomes available, the framework reduces the risk of environment-inconsistent targets. The second level is dynamic feasibility: by validating grounded motions against embodiment-specific torque constraints and adapting them through scaling, the framework prevents semantically plausible but physically unsafe trajectories from being executed directly. Together, these two layers form the core safety logic of the proposed framework.

The results also highlight an important implication for industrial robot programming. In heterogeneous environments, nominal embodiment suitability and execution-time physical safety cannot be treated as equivalent criteria. A robot that appears appropriate at the task-allocation stage may still become unsafe under the generated motion profile, while some initially infeasible motions may be recovered through controlled temporal scaling. This separation of concerns helps explain why multi-vendor robot programming should be approached not only as a code-generation problem, but as a verification-centered deployment problem.

Several limitations remain. First, analytical RNEA validation was performed only for UR5e and Franka Panda because reliable inertial parameter models were available for these robots. Extending closed-form torque analysis to the remaining robots would require manufacturer-disclosed inertial data. Second, the current validation framework focuses primarily on torque feasibility and does not yet incorporate other safety factors such as collision avoidance, compliance behavior, thermal limits, or uncertainty-aware control margins. Third, no direct comparison was conducted against raw vendor-specific code generated directly from natural language without the intermediate task representation. Such a comparison would require a controlled baseline capable of generating syntactically valid vendor-specific code across multiple robot platforms without task abstraction, which in practice depends heavily on platform-specific prompt engineering and code-generation heuristics outside the current framework scope. This comparison is therefore left as an important direction for future work.

Fourth, the proposed framework assumes that the Digital Twin provides sufficiently accurate spatial grounding for pre-execution verification. In practice, Digital Twins of real HMLV environments are subject to sim-to-reality discrepancies arising from sensor noise, calibration drift, and communication latency. The safety margin η = 0.9 partially accounts for these uncertainties by conservatively bounding RNEA-computed torques below hardware limits. Nevertheless, high-frequency dynamics and transient disturbances not captured by rigid-body models remain a known limitation. Future integration of real-time DT synchronization and sensor fusion would further close this gap.

Fifth, all experimental validation in this study was conducted in simulation. Although PyBullet-based dynamics simulation and RNEA-based analytical verification provide systematic pre-execution feasibility assessment, they do not replace physical robot experiments. Real-world deployment involves additional sources of uncertainty—including actuator nonlinearity, joint friction, cable routing effects, and environmental perturbations—that are not fully captured by rigid-body models. Validation on a physical robot platform is identified as an immediate next step to confirm that the proposed framework’s safety margins and scaling behavior transfer reliably to real hardware.

Sixth, the current safety filter at the early-stage adversarial blocking step is intentionally conservative, which may occasionally produce false positives—valid but structurally complex commands that are rejected before reaching the RNEA-based dynamics validation layer. As noted in Section 5.1, this represents a deliberate trade-off between computational efficiency and coverage. Future work may explore adaptive thresholding or learned filtering criteria to reduce over-blocking while preserving the safety-oriented behavior of the pipeline.

6. Conclusions

This paper proposed a Digital Twin-integrated verification framework for adaptive control code generation in heterogeneous robot systems. Rather than treating natural-language robot programming as a pure code-generation problem, the proposed framework incorporated a structured intermediate task representation, runtime spatial grounding, feasibility-aware robot selection, RNEA-based dynamics validation, and adaptive motion scaling into a unified pre-execution pipeline. In this architecture, vendor-specific code generation is positioned as the final realization stage of a verification-centered workflow rather than as the primary source of execution correctness.

The experimental results supported the effectiveness of this approach at multiple levels. The intermediate representation stage provided sufficiently stable structured input for downstream grounding and verification, while the robot selection module maintained task-consistent and discriminative embodiment assignment across both standard and ultra-lightweight stress-test scenarios. The dynamics validation results further showed that semantically plausible motions may still violate actuator-level constraints under a selected embodiment and that adaptive scaling can recover analytically feasible overload cases. In particular, safe execution coverage improved from 50% to 100% when the full validation-and-scaling pipeline was applied.

These findings suggest that the key challenge in multi-vendor robot programming is not only how to generate syntactically executable code but how to ensure that generated behavior is grounded in the runtime environment, assigned to an appropriate embodiment, and verified against physical execution constraints before deployment. From this perspective, the main contribution of the present work is to frame LLM-based heterogeneous robot programming as a verification-centered deployment problem rather than a language-to-code translation problem alone.

Several directions remain for future work. First, analytical dynamics validation should be extended to a broader set of industrial robots as reliable inertial parameter models become available. Second, the current framework focuses primarily on torque feasibility and can be expanded to include additional safety factors such as collision risk, compliance behavior, thermal constraints, and uncertainty-aware control margins. Third, future studies should investigate controlled direct-generation baselines for vendor-specific code without intermediate task abstraction in order to further clarify the comparative value of verification-ready task representations in heterogeneous industrial environments.

Author Contributions

Conceptualization, Y.-H.L. and T.N.; methodology, Y.-H.L. and T.N.; software, Y.-H.L.; validation, Y.-H.L.; formal analysis, Y.-H.L.; investigation, Y.-H.L.; writing—original draft preparation, Y.-H.L.; writing—review and editing, D.-S.C. and W.-T.K.; visualization, Y.-H.L.; supervision, W.-T.K.; project administration, W.-T.K.; funding acquisition, W.-T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2026-RS-2021-II211816) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2026-25490993). The APC was funded by the Institute for Information & Communications Technology Planning & Evaluation (IITP).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are contained within the article. No new external dataset was generated.

Conflicts of Interest

The authors declare no conflict of interest.

Correction Statement

This article has been republished with a minor correction to the Funding statement. This change does not affect the scientific content of the article.

References

Gan, Z.L.; Musa, S.N.; Yap, H.J. A Review of the High-Mix, Low-Volume Manufacturing Industry. Appl. Sci. 2023, 13, 1687. [Google Scholar] [CrossRef]
Bilancia, P.; Schmidt, J.; Raffaeli, R.; Peruzzini, M.; Pellicciari, M. An Overview of Industrial Robots Control and Programming Approaches. Appl. Sci. 2023, 13, 2582. [Google Scholar] [CrossRef]
ABB Robotics. Technical Reference Manual—RAPID Overview, 3HAC050947-001; ABB Robotics: Västerås, Sweden, 2025. [Google Scholar]
KUKA AG. KUKA.SystemSoftware; KUKA AG: Augsburg, Germany, 2025. [Google Scholar]
Universal Robots A/S. Script Directory—E-Series and UR-Series—SW 5.22; Universal Robots A/S: Odense, Denmark, 2025. [Google Scholar]
Liang, J.; Huang, W.; Xia, F.; Xu, P.; Hausman, K.; Ichter, B.; Florence, P.; Zeng, A. Code as Policies: Language Model Programs for Embodied Control. arXiv 2022, arXiv:2209.07753. [Google Scholar] [CrossRef]
Singh, I.; Blukis, V.; Mousavian, A.; Goyal, A.; Xu, D.; Tremblay, J.; Fox, D.; Thomason, J.; Garg, A. ProgPrompt: Program Generation for Situated Robot Task Planning Using Large Language Models. Auton. Robot. 2023, 47, 999–1012. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Kambhampati, S.; Valmeekam, K.; Guan, L.; Verma, M.; Stechly, K.; Bhambri, S.; Saldyt, L.P.; Murthy, A.B. Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; PMLR: Cambridge, MA, USA, 2024; Volume 235, pp. 22895–22907. [Google Scholar]
Yang, Z.; Raman, S.S.; Shah, A.; Tellex, S. Plug in the Safety Chip: Enforcing Constraints for LLM-Driven Robot Agents. arXiv 2023, arXiv:2309.09919. [Google Scholar] [CrossRef]
Vemprala, S.; Bonatti, R.; Bucker, A.; Kapoor, A. ChatGPT for Robotics: Design Principles and Model Abilities. arXiv 2023, arXiv:2306.17582. [Google Scholar] [CrossRef]
Löcklin, A.; Müller, M.; Jung, T.; Jazdi, N.; White, D.; Weyrich, M. Digital Twin for Verification and Validation of Industrial Automation Systems—A Survey. In Proceedings of the 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Vienna, Austria, 8–11 September 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar] [CrossRef]
Aivaliotis, P.; Georgoulias, K.; Chryssolouris, G. The Use of Digital Twin for Predictive Maintenance in Manufacturing. Int. J. Comput. Integr. Manuf. 2019, 32, 1067–1080. [Google Scholar] [CrossRef]
Zhang, F.; Bai, J.; Wang, Q. Digital Twin Data-Driven Proactive Job-Shop Scheduling Strategy towards Asymmetric Manufacturing Execution Decision. Sci. Rep. 2022, 12, 1546. [Google Scholar] [CrossRef] [PubMed]
Kousi, N.; Gkournelos, C.; Aivaliotis, S.; Lotsaris, K.; Bavelos, A.C.; Baris, P.; Michalos, G.; Makris, S. Digital Twin for Designing and Reconfiguring Human–Robot Collaborative Assembly Lines. Appl. Sci. 2021, 11, 4620. [Google Scholar] [CrossRef]
El Zaatari, S.; Marei, M.; Li, W.; Usman, Z. Cobot Programming for Collaborative Industrial Tasks: An Overview. Robot. Auton. Syst. 2019, 116, 162–180. [Google Scholar] [CrossRef]
Macenski, S.; Foote, T.; Gerkey, B.; Lalancette, C.; Woodall, W. Robot Operating System 2: Design, Architecture, and Uses in the Wild. Sci. Robot. 2022, 7, eabm6074. [Google Scholar] [CrossRef] [PubMed]
OPC Foundation. OPC Unified Architecture (OPC UA): Interoperability for Industrie 4.0 and the Internet of Things; OPC Foundation: Scottsdale, AZ, USA, 2014. [Google Scholar]
Lüder, A.; Schmidt, N. AutomationML in a Nutshell. In Handbuch Industrie 4.0 Bd. 2; Vogel-Heuser, B., Bauernhansl, T., ten Hompel, M., Eds.; Springer Vieweg: Berlin/Heidelberg, Germany, 2017; pp. 213–258. [Google Scholar] [CrossRef]
PLCopen. Function Blocks for Motion Control, Version 2.0; PLCopen: Gouda, The Netherlands, 2011. [Google Scholar]
Shanmugasundar, G.; Kalita, K.; Čep, R.; Chohan, J.S. Decision Models for Selection of Industrial Robots—A Comprehensive Comparison of Multi-Criteria Decision Making. Processes 2023, 11, 1681. [Google Scholar] [CrossRef]
Zitkovich, B.; Joshi, N.; Irpan, A.; Ichter, B.; Hsu, J.; Herzog, A.; Gopalakrishnan, K.; Fu, C.; Florence, P.; Finn, C.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. In Proceedings of the 7th Conference on Robot Learning (CoRL 2023), Atlanta, GA, USA, 6–9 November 2023; PMLR: Brookline, MA, USA, 2023; Volume 229, pp. 2165–2183. [Google Scholar]
Luckcuck, M.; Farrell, M.; Dennis, L.A.; Dixon, C.; Fisher, M. Formal Specification and Verification of Autonomous Robotic Systems: A Survey. ACM Comput. Surv. 2019, 52, 1–41. [Google Scholar] [CrossRef]
Mordvinov, D.M.; Litvinov, Y.V. Survey on Formal Methods in Robotics. St. Petersburg State Polytech. Univ. J. Comput. Sci. Telecommun. Control Syst. 2016, 1, 84–107. [Google Scholar] [CrossRef]
ABB Robotics. RobotStudio^® Suite; ABB Robotics: Zürich, Switzerland, 2026. [Google Scholar]
KUKA AG. KUKA.Sim—Simulation Software; KUKA AG: Augsburg, Germany, 2025. [Google Scholar]
Luh, J.Y.S.; Walker, M.W.; Paul, R.P.C. On-Line Computational Scheme for Mechanical Manipulators. J. Dyn. Syst. Meas. Control 1980, 102, 69–76. [Google Scholar] [CrossRef]
Doosan Robotics Inc. Programming Manual, Version 2.9.3, Document Version 2.6; Doosan Robotics: Suwon, Republic of Korea, 2022; p. 173. [Google Scholar]
KUKA Roboter GmbH. KUKA.LoadDataDetermination 7.2 Operating Instructions; KUKA Roboter GmbH: Augsburg, Germany, 2023. [Google Scholar]

Figure 1. Digital Twin-integrated verification framework for adaptive control code generation in heterogeneous robot systems. Natural-language input is transformed into a structured task representation, grounded using runtime Digital Twin information, validated for dynamic feasibility, and converted into verified executable robot code.

Figure 2. Reliability of the intermediate representation across scenario categories and adversarial filtering outcomes. (a) Syntactic Integrity Rate and Back-Translation Fidelity (BTF) for Simple, Complex, Adversarial, and Overall cases; (b) proportion of adversarial inputs blocked by the safety filter or passed to downstream dynamics validation.

Figure 3. Payload ratio comparison across standard and ultra-lightweight robot selection scenarios. (a) Standard tasks A–F; (b) stress-test tasks G–H, with the dashed line indicating the target safety margin α = 1.2.

Figure 4. Payload-score behavior of the proposed tri-modal scoring function and Linear Weighted Sum as a function of payload ratio R. The proposed method preserves non-zero ranked scores in the over-specification region, whereas Linear WS collapses to zero beyond its threshold.

Figure 5. Maximum torque ratio before and after adaptive scaling for analytically verified robot platforms. (a) UR5e; (b) Franka Panda, with the dashed line indicating the feasibility limit at ratio = 1.0.

Table 1. Specifications of the heterogeneous robot pool.

Robot	Manufacturer	Payload	DoF	Reach	Max Vel.	Category
Franka Panda	Franka Emika	3.0	7	0.855	2.175	Lightweight cobot
UR5e	Universal Robots	5.0	6	0.850	3.14	Lightweight cobot
Doosan M0609	Doosan Robotics	6.0	6	0.900	3.92	Medium cobot
Doosan M1013	Doosan Robotics	10.0	6	1.300	4.01	Medium cobot
KUKA iiwa14	KUKA	14.0	7	0.820	1.71	Medium-heavy
Doosan H2515	Doosan Robotics	25.0	6	1.500	3.49	Heavy industrial

Table 2. Robot selection task scenarios.

Task	Payload (kg)	Reach (m)	DoF	Category
A	2.5	0.60	6	Standard
B	4.2	0.70	6	Standard
C	8.3	1.00	6	Standard
D	12.0	0.70	6	Standard
E	20.8	1.30	6	Standard
F	2.0	0.70	6	Standard
G	1.0	0.50	6	Ultra-lightweight
H	0.5	0.40	6	Ultra-lightweight

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, Y.-H.; Nam, T.; Cho, D.-S.; Kim, W.-T. LLM-Based Adaptive Control Code Generation Framework with Digital Twin-Integrated Verification for Heterogeneous Robot Systems. Appl. Sci. 2026, 16, 3883. https://doi.org/10.3390/app16083883

AMA Style

Lee Y-H, Nam T, Cho D-S, Kim W-T. LLM-Based Adaptive Control Code Generation Framework with Digital Twin-Integrated Verification for Heterogeneous Robot Systems. Applied Sciences. 2026; 16(8):3883. https://doi.org/10.3390/app16083883

Chicago/Turabian Style

Lee, Young-Hoon, Taemin Nam, Deun-Sol Cho, and Won-Tae Kim. 2026. "LLM-Based Adaptive Control Code Generation Framework with Digital Twin-Integrated Verification for Heterogeneous Robot Systems" Applied Sciences 16, no. 8: 3883. https://doi.org/10.3390/app16083883

APA Style

Lee, Y.-H., Nam, T., Cho, D.-S., & Kim, W.-T. (2026). LLM-Based Adaptive Control Code Generation Framework with Digital Twin-Integrated Verification for Heterogeneous Robot Systems. Applied Sciences, 16(8), 3883. https://doi.org/10.3390/app16083883

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLM-Based Adaptive Control Code Generation Framework with Digital Twin-Integrated Verification for Heterogeneous Robot Systems

Abstract

1. Introduction

2. Related Work

2.1. Robot Programming and Interoperability in Multi-Vendor Systems

2.2. LLM-Based Robot Code Generation

2.3. Verification, Validation, and Physical Feasibility for Robot Control

3. Methodology

3.1. Framework Overview

3.2. Structured Intermediate Task Representation

3.3. Runtime Spatial Grounding with Digital Twin Information

3.4. Feasibility-Aware Robot Selection

3.4.1. Overall Fitness Formulation

3.4.2. Payload Score: Tri-Modal Design

3.4.3. Reach and DoF Scores

3.4.4. Final Robot Selection

3.5. RNEA-Based Dynamics Validation

3.6. Adaptive Parameter Scaling

3.7. Vendor-Specific Code Generation

4. Experimental Setup

4.1. Simulation Environment

4.2. Robot Platforms

4.3. Evaluation Scenarios and Metrics

4.3.1. Evaluation of Intermediate Representation Reliability

4.3.2. Robot Selection Evaluation

4.3.3. Dynamics Validation Evaluation

4.3.4. Ablation Study Design

5. Results and Discussion

5.1. Intermediate Representation Reliability

5.2. Robot Selection Performance

5.3. Dynamics Validation and Adaptive Scaling

5.4. Ablation Study Results

5.5. Integrated Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI