Semantic–Physical Sensor Fusion for Safe Physical Human–Robot Interaction in Dual-Arm Rehabilitation

Zhu, Disha; Wang, Xuefeng; Shang, Shaomei

doi:10.3390/s26051510

Open AccessArticle

Semantic–Physical Sensor Fusion for Safe Physical Human–Robot Interaction in Dual-Arm Rehabilitation

by

Disha Zhu

¹

,

Xuefeng Wang

²

and

Shaomei Shang

^1,*

¹

School of Nursing, Peking University, Beijing 100871, China

²

School of Advanced Manufacturing and Robotics, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(5), 1510; https://doi.org/10.3390/s26051510

Submission received: 18 January 2026 / Revised: 13 February 2026 / Accepted: 14 February 2026 / Published: 27 February 2026

(This article belongs to the Section Biomedical Sensors)

Download

Browse Figures

Versions Notes

Abstract

A safe physical human–robot interaction (pHRI) in rehabilitation requires reliable perception and low-latency decision making under heterogeneous and unreliable sensor inputs. This paper presents a multimodal sensor-fusion-based safety framework that integrates physical state estimation, semantic information fusion, and an edge-deployed large language model (LLM) for real-time pHRI safety control. A dynamics-based virtual sensing method is introduced to estimate internal joint torques from external force–torque measurements, achieving a normalized mean absolute error of

18.5 %

in real-world experiments. An asynchronous semantic state pool with a time-to-live mechanism is designed to fuse visual, force, posture, and human semantic cues while maintaining robustness to sensor delays and dropouts. Based on structured multimodal tokens, an instruction-tuned edge LLM outputs discrete safety decisions that are further mapped to continuous compliant control parameters. The framework is trained using a hybrid dataset consisting of limited real-world samples and LLM-augmented synthetic data, and evaluated on unseen real and mixed-condition scenarios. Experimental results show reliable detection of safety-critical events with a low emergency misdetection rate, while maintaining an end-to-end decision latency of approximately 223 ms on edge hardware. Real-world experiments on a rehabilitation robot demonstrate effective responses to impacts, user instability, and visual occlusions, indicating the practical applicability of the proposed approach for real-time pHRI safety monitoring.

Keywords:

rehabilitation robotics; physical human–robot interaction; large language models; multimodal fusion; state estimation

1. Introduction

Multi-modal sensory perception is critical for intelligent robotics. This includes vision, force, and physiological monitoring. Industrial robots operate in structured environments, but service robots must interpret complex human behaviors and safety constraints during physical Human–Robot Interaction (pHRI) [1,2]. This requirement is critical in physical rehabilitation. Here, the robot acts as a cognitive partner rather than a simple mechanical actuator [3,4,5,6]. It must understand the user’s physiological state and intent [7,8,9,10,11].

However, current robotic implementations lag behind these requirements [12,13]. Rehabilitation robots offer high precision and adaptive feedback [14,15,16,17]. Although mechanical solutions like Series Elastic Actuators have successfully addressed physical compliance to some extent [18,19,20,21,22], it is still hard for robots to have cognitive adaptability in unstructured home environments. Raw data from a single sensor cannot provide robust state estimation [23,24,25]. To address this limitation, various multi-sensor fusion techniques have been investigated. Traditional approaches utilize probabilistic methods, such as Kalman Filters, to merge kinematic and dynamic data [26,27]. More recent studies employ deep neural networks to fuse visual and haptic modalities [28,29,30]. However, these methods typically focus on low-level feature integration and lack the high-level reasoning capabilities required to interpret complex user behaviors in unstructured scenarios. This creates bottlenecks at two levels: physical perception and semantic understanding.

At the physical perception level, external haptic feedback does not accurately reflect the internal human physiological state. The mapping from sensor readings to joint torques is non-linear [31]. In dynamic rehabilitation, the interaction force combines human joint effort with system dynamics [32,33,34]. For example, different limb configurations result in different internal torques even if the external contact force is identical. High force readings may result from gravity rather than active resistance. Direct reliance on raw force thresholds causes misinterpretations. Therefore, we need a model-based physical observer (inverse dynamics) to decouple disturbances and recover the objective intrinsic state [35,36].

At the semantic level, kinematic modalities like vision suffer from ambiguity without physical constraints [2,37]. Visual algorithms capture skeletal poses but struggle to distinguish geometrically similar scenarios. For instance, “reclining on a chair” and “falling onto the floor” look similar. Both show a horizontal trunk and lowered center of mass [38]. Without cross-modal verification from other sensors, the vision system itself cannot correctly distinguish the scenario [39,40,41].

As the rehabilitation ecosystem expands, the integration of various sensors, such as wearable devices and environmental IoT devices, leads to exponential growth in semantic information combinations. Traditional finite state machines encounter the problem of “combinatorial explosion” when facing multimodal combinations, requiring the writing of thousands of if-else rules, and appear particularly fragile when dealing with undefined edge cases [42,43]. To address this issue, the control framework needs to introduce Large Language Models (LLMs) as the reasoning core [44,45]. Using the LLM’s semantic space compression and zero-shot generalization capabilities, LLMs can interpret heterogeneous sensor data and generate adaptive control strategies for complex unseen scenarios, handling all unknown logical combinations with a single network [46].

To solve this problem, we propose a multi-level sensor fusion dual-arm rehabilitation robot framework. This framework unifies the “Physical Fusion” of vision and haptics with the “Semantic Fusion” of Large Language Models [47]. The “Physical Fusion” establishes a method for observing human internal torque based on multi-sensor perception. The “Semantic Fusion” drives the adaptive adjustment of control hyperparameters [48,49,50].

We constructed a system with two 6-DOF robotic arms, depth cameras, and force sensors. Through inverse dynamics, the system calculates the human joint internal torque. Our experiments show an average estimation error of

18.5 %

for this physiological parameter. For the cognitive core, we fine-tuned a Qwen3-1.7B model via LoRA. This enables the robot to make comprehensive judgments based on multi-source information in rehabilitation scenarios and output structured control commands. Experiments demonstrate that the fine-tuned model achieves an accuracy of

98.5 %

on the test set with an inference speed of about 4.48 Hz.

In the proposed framework, these two fusion layers are not independent. The physical fusion layer acts as a Feature Extractor. Raw sensor data contains noise from gravity and robot inertia, which leads to poor LLM generalization. By resolving the normalized human internal torque first, we provide the LLM with configuration-independent physical semantics. This allows the LLM to focus on logical reasoning (e.g., distinguishing between a spasm and a fall), thereby achieving robust generalization capabilities even in unseen scenarios.

It is worth noting that while we validate our framework using a dual-arm robot for knee rehabilitation, the proposed method is not limited to this specific configuration. The choice of a redundant dual-arm system is strategic; it addresses the composite rehabilitation needs prevalent in the elderly population, where patients often require therapy across multiple joints rather than a single site [51,52]. Instead of deploying multiple specialized machines, this system serves as a proof-of-concept for “software-defined rehabilitation”, enabling a single general-purpose robot to adapt to various anatomical targets and therapeutic protocols through software reconfiguration. Furthermore, the knee rehabilitation task represents a high-torque, dynamically complex scenario, serving as a rigorous benchmark to demonstrate the robustness of our semantic–physical fusion framework, which can be readily extended to other robotic systems and rehabilitation tasks.

The remainder of this paper is organized as follows. Section 2 details the system architecture. Section 3 describes the physical modeling fusion method. Section 4 introduces the LLM-based semantic fusion. Section 5 presents the pHRI control and validation. Section 6 concludes the paper.

2. System Architecture and Implementation

The system has two main parts: a rehabilitation robot and a prosthetic verification platform. As shown in Figure 1, the rehabilitation robot contains three modules: mobility, actuation, and perception-computation.

The mobility and actuation modules form the robot’s body. We built the mobile base on the HEXMAN-MARK1 platform. It uses Mecanum wheels with integrated odometry and Inertial Measurement Units (IMU) for movement. The dual-arm system uses two 6-Degree-of-Freedom (6-DoF) robotic arms with a spherical-wrist configuration. High-torque density brushless motors (DM4340P and DM4310, Maxon Motor, Tagerwilen, Switzerland) drive the joints. These motors support Quasi-Direct Drive (QDD) control to ensure whole-body compliance. For command execution, we used an STM32-based data acquisition card (STMicroelectronics, Geneva, Switzerland). It communicates via an independent CAN FD bus (Bosch; Gerlingen, Germany) at 250 Hz to maintain stable torque loop control.

The multi-modal perception and computing system includes the following components:

1.: Visual System: An Orbbec 336L RGB-D depth camera (Orbbec, Shenzhen, China) is mounted on the robot’s head to capture environmental context. It streams video at a resolution of $1920 \times 1080$ (30 fps), with depth data synchronized via USB to the host for real-time skeletal tracking using YOLO-POSE.
2.: Haptic System: Two XJC-XD60EC 6-DoF force/torque sensors (XJC Electronics, Shenzhen, China) are installed at the end-effectors to capture interaction forces. Crucially, these sensors communicate via an EtherCAT bus (EtherCAT Technology Group, Nuremberg, Germany) with a dedicated master station (ZMC60E), ensuring strictly deterministic data acquisition at 1 kHz with minimal jitter.
3.: Computing Core: The central processing unit is a mobile workstation equipped with an NVIDIA GeForce RTX 5060 Laptop GPU. This platform acts as the sensor fusion hub, aggregating data from heterogeneous interfaces, including USB (vision), UDP (force sensors via EtherCAT bridge), CAN (mobile base), and CAN FD (robotic arms). The system architecture adopts a hybrid computing strategy; the CPU handles high-frequency dynamics and control tasks, while the GPU provides sufficient CUDA cores to support the real-time inference of the fine-tuned Large Language Model (LLM) alongside other lightweight networks such as YOLO-POSE.

We developed a robotic prosthetic verification setup to test the algorithms (Figure 2). A QDD brushless servo motor (DM6006, CAN bus at 100 Hz) actuates the knee joint of the prosthetic limb to simulate human joint impedance. We 3D-printed the limb structure using resin and mounted it on a fixed frame. We placed a force sensor (identical to the robot’s sensor) in the knee joint. This sensor measures the ground truth internal torque

τ_{limb - internal - real}

.

Figure 3 illustrates the hardware integration and data flow framework, where visual, haptic, and auxiliary signals are synchronized and fused. A dual-path processing strategy is adopted; the semantic path updates a state pool for LLM reasoning, while the physical path combines kinematic and dynamic data to estimate human joint torques. In the physical formulation, the interaction forces measured by the robot’s end-effectors are denoted by

F_{left}

and

F_{right}

, and the robotic joint torques by

τ_{robo}

. The limb state is characterized by joint angles

θ_{limb}

and angular velocities

ω_{limb}

relative to the support points

s

. Based on these parameters, the system distinguishes the external torque applied by the robot,

τ_{limb - external}

, from the user’s physiological internal torque,

τ_{limb - internal}

.

We split the data flow based on signal type. Facial Expression Recognition (FER) extracts emotional states (e.g., ‘Pain’, ‘Normal’) for the semantic pool. Human Pose Estimation (HPE) has two roles. It sends geometric features to the inverse dynamics model, and it registers high-level pose descriptions (e.g., ‘Sitting’, ‘Lying’) in the semantic pool. Similarly, force sensor signals drive the dynamics observer for low-level control. At the same time, interpreted interaction events (e.g., ‘Stable’, ‘Impact’) provide context for the semantic decision layer.

3. Physical Fusion Layer: Dynamics-Based Torque Observation

The primary goal of the physical sensing module is to estimate the user’s internal joint torque. This metric reflects the true physiological state. In dynamic rehabilitation, raw force sensor readings at the end-effector contain noise from mechanical dynamics, such as inertial forces and gravitational components. These factors distort the measurement of active human effort. To remove these disturbances, we applied a “Physical Fusion” strategy. We modeled the human limb (or the verification platform) as a non-uniform two-link system. We determined the mass distribution parameters through preliminary experiments.

3.1. Dynamics of the Coupled System

We model the human limb (or the verification prosthetic platform) as a two-link system to separate physiological intent from mechanical measurements. We parameterized the geometric configuration and mass distribution for the dynamic analysis.

Figure 4 shows the system’s spatial definitions.

p_{i}

,

c_{i}

,

m_{i}

, and

s_{i}

denote the position vectors of the link endpoints (joints), centers of mass, the masses of the links, and sensor interaction points, respectively. Additionally,

F_{i}

denotes the interaction forces acting on the limb segments.

We define the generalized coordinates as the joint configuration vector

q_{limb} = {[θ_{1}, θ_{2}]}^{⊤}

, which corresponds to the hip and knee angles. Using the Euler–Lagrange equation, we express the dynamic model as:

M_{limb} (q_{limb}) {\ddot{q}}_{limb} + C_{limb} (q_{limb}, {\dot{q}}_{limb}) {\dot{q}}_{limb} + G_{limb} (q_{limb}) = τ_{limb - external} + τ_{limb - internal}

(1)

M_{limb}

is the inertia matrix,

C_{limb}

represents Coriolis and centrifugal effects, and

G_{limb}

is the gravity vector. The right-hand side decomposes the total torque into external mechanical interaction (

τ_{limb - external}

) and the user’s internal physiological torque (

τ_{limb - internal}

).

We identified the dynamic parameters (

m_{1}, m_{2}, \dots

) offline. This allows us to calculate the inertial and gravitational components in real time. We observe the internal torque by subtracting the passive dynamics and external loads from the total system dynamics.

3.2. Synthesizing the Internal Torque Observer

We use an explicit geometric formulation based on the kinematic chain in Figure 4. By analyzing the force vectors and lever arms relative to the joint frames, the generalized external torque

τ_{limb - external} \in R^{2}

is derived as the vector of scalar moments acting along the joint axes (z-axis):

τ_{limb - external} = [\begin{matrix} {(s_{1} \times F_{1} + (p_{1} + s_{2}) \times F_{2})}_{z} \\ {(s_{2} \times F_{2})}_{z} \end{matrix}]

(2)

where

{(\cdot)}_{z}

denotes the scalar projection onto the joint rotation axis. This formulation ensures that the dimensions of the external torque vector align with the generalized coordinates

q_{limb}

.

After reconstructing the load terms, we apply the inverse dynamics model to the observer. By rearranging the dynamic equilibrium equation, we solve for the user’s internal torque:

τ_{limb - internal} = M_{limb} (q_{limb}) {\ddot{q}}_{limb} + C_{limb} (q_{limb}, {\dot{q}}_{limb}) {\dot{q}}_{limb} + G_{limb} (q_{limb}) - τ_{limb - external}

(3)

This calculated variable

τ_{limb - internal}

acts as the “Physical Token”. It represents the normalized limb torque decoupled from the robot’s mechanical influence.

Quantitative Accuracy Analysis

To validate the algorithm, we built a physical verification setup. A dual-arm robot manipulates an instrumented prosthetic leg to simulate passive rehabilitation trajectories, as shown in Figure 5.

We use the Normalized Mean Absolute Error (NMAE) as the primary metric. It is the mean absolute deviation normalized by the peak-to-peak (PP) amplitude of the ground truth signal:

NMAE = \frac{1}{n \cdot {PP}_{τ}} \sum_{i = 1}^{n} | τ_{ground_truth, i} - {\hat{τ}}_{observed, i} |

(4)

where

τ_{ground_truth, i, i}

is the ground truth torque measured by the prosthesis’s internal sensors, and

{\hat{τ}}_{observed, i}

is the observer’s estimate. n is the total number of sampling points, and

{PP}_{τ}

denotes the peak-to-peak amplitude of the ground truth torque signal.

Figure 6 shows the comparison results. The observer outputs (orange) track the ground truth data (blue) despite structural uncertainties.

Figure 6 presents the results for a typical rehabilitation trajectory with a period

T = 3 s

and an amplitude

A = 0.2 rad

. We conducted 24 experiments with different periods (T = 3 s, 4 s, 5 s, 6 s) and amplitudes (A = 0.1, 0.2, 0.3). The observer achieved a mean absolute error (MAE) of 0.336 Nm and an average NMAE of

18.5 %

for robotic prosthesis torque estimation. This discrepancy comes from four main physical factors:

1.: Unmodeled Actuator Friction: The simplified dynamic model does not fully capture non-linear actuator characteristics, such as dry friction and stiction in the motor and gearbox.
2.: Kinematic Calibration Residuals: The kinematic parameters of the dual-arm robot have finite calibration accuracy. This introduces errors in the geometric Jacobian and gravity compensation terms.
3.: Contact Instability: Non-rigid contact and micro-slippage at the robot–prosthesis contact point cause transient deviations in the estimated force application point. This leads to errors in the lever arm calculation.
4.: Localization and Calibration Uncertainties: Leg tracking uses ArUco markers. Precise alignment is affected by cumulative errors: noise in visual pose estimation (approx. 0.036 rad MAE), calibration inaccuracies in the robot base frame, and geometric differences between the marker’s theoretical position and its actual placement.

The dual-arm robot used in this study is a laboratory-developed prototype. While its kinematic calibration exhibits finite residuals compared to high-precision industrial manipulators, the system’s consistent performance across multiple trials demonstrates that the method is not fragile to these hardware-level imperfections. For human subjects, inertial parameters were estimated using standard anthropometric scaling laws (similar to OpenSim), in contrast to the experimental identification used for the verification platform. Given that rehabilitation exercises typically involve low angular velocities and accelerations, the dynamic behavior described in (3) is dominated by the gravitational vector

G_{limb}

rather than the higher-order inertial or Coriolis terms. Since

G_{limb}

depends linearly on the mass parameters, any estimation errors in the anthropometric data result in bounded, linear offsets in

τ_{limb - internal}

, preventing the divergent instability often seen in high-speed control.

In terms of accuracy, most existing work validates torque estimation using OpenSim simulations [53], which inherently bypasses real-world sensor noise and transmission non-linearities. While physical sensing approaches like wearable ultrasound can achieve approximately 7.6% error under ideal conditions [54], achieving 18.5% NMAE on a full dynamic robotic platform represents a robust baseline for real-world interaction. Human motor control exhibits high variability, and the system’s goal is reliable semantic classification (e.g., distinguishing spasms from resistance) rather than surgical force tracking. Since the torque disparity between normal motions and hazardous anomalies is substantial, the current accuracy falls well within physiological tolerance for safety-critical decision making. These results confirm that the model-based observer effectively decouples passive mechanical effects (e.g., limb gravity) from active joint torque to recover the user’s true internal state, with potential for further accuracy gains through industrial-grade calibration.

4. Semantic-Physical Fusion and Intelligent Decision

This section describes the system’s intelligence layer. We propose an asynchronous fusion framework that separates sensor sampling rates from decision-making frequencies. This allows a Fine-Tuned Large Language Model to generate robust control decisions on edge hardware.

4.1. Asynchronous Semantic State Pool

In rehabilitation scenarios, sensors operate at different frequencies. Force sensors run at 1 kHz, visual pose estimation at 30 Hz, motors at 250 Hz, and other monitors update sparsely. To handle this, we designed a Dynamic Semantic State Pool on the central computer (Figure 7).

The State Pool stores the latest semantic status of all subsystems as key–value pairs. The process includes:

1.

Signal Discretization and Tokenization: Raw sensor data are continuous and high-frequency, which is unsuitable for direct LLM ingestion. We employ a sliding-window statistical method to convert these signals into semantic tokens:

Physical Tokens (from Section 3): The internal torque $τ_{limb - internal}$ derived in Equation (3) is processed to quantify interaction stability. We calculate the sliding variance $σ_{τ_{error}}^{2}$ over a window $W = 50$ ms:

$σ_{τ_{error}}^{2} (t) = \frac{1}{W} \sum_{k = t - W}^{t} {(τ_{internal} (k) - \bar{τ})}^{2}$

(5)

where $W = 50 ms$ is the window size, and $\bar{τ}$ represents the moving average of the internal torque within this window. The variance is mapped to discrete tokens based on thresholds ${δ_{1}, δ_{2}}$ calibrated on a 30-trial pilot dataset:

${Token}_{phy} = \{\begin{matrix} STABLE, & if σ_{τ_{error}}^{2} \leq δ_{1} \\ TREMOR, & if δ_{1} < σ_{τ_{error}}^{2} \leq δ_{2} \\ IMPACT, & if σ_{τ_{error}}^{2} > δ_{2} \end{matrix}$

(6)

Specifically, $δ_{1}$ is set to $μ_{noise} + 3 σ_{noise}$ (3-sigma rule) to filter baseline noise, where $μ_{noise}$ and $σ_{noise}$ are the mean and standard deviation of the sensor noise measured during static calibration. $δ_{2}$ is selected to maximize the margin between oscillatory tremors and high-energy impact spikes.
Human Pose Tokens: Keypoint coordinates from YOLO-POSE are analyzed for geometric anomalies. For instance, when the vertical difference between the hip and head keypoints falls below a calibrated threshold indicative of a supine posture, the token Visual: USER_LYING_DOWN is pushed to the pool.
Facial Expression Tokens: The vision system runs a lightweight FER module to classify user emotions. When the confidence score for a critical state (e.g., ‘Pain’ or ‘Fatigue’) exceeds a threshold of $0.8$ , a corresponding semantic token (e.g., Face: PAIN) is generated and updated in the state pool.
Instruction Tokens: User commands (via voice or text interface) are parsed into intent tokens. For example, the command “Stop moving” immediately triggers a high-priority Cmd: STOP token, overriding other behaviors.

2.

Asynchronous Update: Each sensor module pushes updates to the pool independently. The LLM acts as a consumer, querying the current snapshot of the pool for inference.

3.

Lifecycle Management: To prevent stale data from influencing decisions (e.g., if the camera is occluded), a time-to-live (TTL) mechanism is implemented. Any state token not updated within 5.0 s is automatically invalidated and removed from the pool.

This mechanism merges high-frequency physical data with low-frequency semantic context without blocking.

4.2. Edge-Native Instruction-Tuned Model

Real-time robotic control requires structured outputs and deterministic latency. General cloud LLMs often fail these requirements due to network variability. Therefore, we deployed the Qwen3-1.7B Large Language Model on the local hardware (NVIDIA RTX 5060 Laptop GPU).

We used Instruction Fine-Tuning instead of prompt engineering. This provides two advantages:

1.: Structured Output Reliability: General LLMs typically produce verbose “chain-of-thought” narratives. Through fine-tuning, our model is constrained to output only the control action codes (“0”, “1”, or “2”), eliminating the need for complex post-processing regex parsing and ensuring machine-readable stability.
2.: Low-Latency Inference: By optimizing the model scale (1.7B parameters) and executing locally, we achieved an average inference latency of approximately 223 ms on the laptop. Including sensor acquisition and bus communication, the total control loop latency is controlled within 240 ms. Since rehabilitation scenarios primarily involve physical human–robot interaction frequencies below 1 Hz (period $> 1$ s), this response speed falls well within the safety margins for smooth control.

The fine-tuned model serves as a function with semantic understanding capabilities and generalization ability, mapping the chaotic multimodal state space of the rehabilitation environment to a discrete, safe action space with structured outputs.

4.3. Data Augmentation and Training Pipeline

High-quality, domain-specific data is scarce in robotic rehabilitation. To solve this, we used a “Seed-Augment-Train” strategy. We used a foundation model to synthesize a robust dataset.

4.3.1. Dataset Construction

The process involved three stages:

1.: Expert Seed Creation: We manually curated a small set of approximately 20 “Instruction-Input-Output” seed samples covering typical rehabilitation scenarios (e.g., normal training, fatigue indications, sudden impacts, falls).
2.: LLM-Driven Augmentation: Using a large-scale foundation model (QWEN3-MAX), we generated 2000 diverse training samples based on the seed patterns. The prompt constrained the generator to introduce variations in semantic phrasing (e.g., rephrasing “Seat: not detected” to “Seat: user stood up”) while maintaining strict logical consistency with the safety protocols.
3.: Standardization: The augmented data was formatted into a strict JSON structure for Supervised Fine-Tuning (SFT).

We standardized the data format to maintain consistency between training and inference. Figure 8 shows a representative sample.

4.3.2. Fine-Tuning Implementation

We fine-tuned Qwen3-1.7B on the synthetic dataset using Low-Rank Adaptation (LoRA). We applied LoRA modules to the linear layers in both the attention mechanisms and feed-forward networks. Table 1 details the hyperparameter configuration.

We selected a scaling ratio of

α / r = 4

. This configuration strengthens the weight updates for the rehabilitation data, helping the model learn the semantic–action mapping while avoiding the instability of full-parameter fine-tuning.

4.3.3. Quantitative Evaluation

We evaluated the fine-tuned model on a held-out test set of 2000 samples. The model achieved an accuracy of 98.5% (1970/2000).

Inference Speed:

The average inference latency on the local edge GPU was 0.223 s. Beyond average performance, tail latency is critical for safety guarantees to prevent sporadic delays. The system achieved a P95 latency of [0.227] s and a P99 latency of [0.255] s. This distribution confirms that even in outlier cases, the control loop maintains a reaction speed within the acceptable safety margin. In the context of physical human–robot interaction (pHRI), this latency is comparable to the human physiological haptic response time, which typically ranges from 198 ms to 313 ms [55]. Furthermore, since the LLM-based cognitive core is utilized for high-level control strategy switching (e.g., transition from active to passive mode) rather than low-level real-time trajectory servoing, this delay remains transparent to the user and does not compromise the stability of the interaction.

Error Analysis for Safety (Action 2):

We analyzed the confusion matrix directly to assess safety risks:

False Positives: Out of 1121 samples predicted as “Stop”, 4 were actually “Soft” cases.
False Negatives: Out of 1129 actual “Stop” cases, 12 were misclassified. All 12 were labeled as “Soft” (Action 1), and none were labeled as “Continue” (Action 0).

The data (Figure 9 and Table 2) indicates that while the model is not perfectly accurate, the errors in this test set were confined to adjacent safety levels.

4.3.4. Supplementary Experiments on Small-Sample Data

We performed two additional checks using limited datasets to observe model behavior on non-synthetic data.

1.: Expert Seed Data (N = 20)

We tested the model on the 20 initial manual seed samples as shown in Figure 10. The accuracy was 90.0% (18/20).

The 2 errors were: one “Continue” classified as “Soft”, and one “Continue” classified as “Stop”.
In this specific set, all 10 actual “Stop” samples were correctly classified.

2.: Unseen Vocabulary (N = 10)

We tested 10 samples containing terms not present in the training set (e.g., blooding, sleep, crying, scared, playing game). The performance is visualized in Figure 11, where the accuracy was 80.0% (8/10).

The two errors were conservative (Continue classified as Soft or Soft classified as Stop).
High-risk terms such as blooding and scared were mapped to the “Stop” action in these instances.

5. Semantic–Physical Fusion Interactive Control

The final stage maps the discrete semantic decisions (Action Tokens) from the LLM to continuous robot motion parameters. To ensure safety, the system modulates the reference trajectory parameters: Amplitude (A) and Period (T). The reference trajectory for the rehabilitation task is a sinusoidal motion:

θ_{ref} (t) = A \cdot sin (\frac{2 π}{T} t) + θ_{bias}

(7)

where

θ_{bias}

is the initial joint offset. The control framework adjusts A and T based on the decision output. Specifically,

θ_{ref}

denotes the desired angular position of the human joint. To execute this motion, the control system maps

θ_{ref}

to the Cartesian coordinates of the interaction points

s_{1}

and

s_{2}

(as defined in Figure 4) via the limb’s forward kinematics. The robot’s end-effectors are commanded to track these points while maintaining an orientation perpendicular to the limb segments. This kinematic mapping and the associated SE(3) transformations are computed in real-time using the Pinocchio library, ensuring precise coordination between the dual-arm system and the target limb configuration.

5.1. Action Space Definition

The LLM processes semantic information to generate control commands. We defined three discrete action modes for this experiment, although the framework can support larger action spaces.

Action 0 (Continue Training): The robot executes the standard rehabilitation protocol.

$A \to A_{default}, T \to T_{default}$

(8)

where $A_{default}$ and $T_{default}$ are the standard amplitude and period parameters preset for the rehabilitation session. The robot maintains a standard rhythm (e.g., $T = 3$ s) and full range of motion.
Action 1 (Soft Mitigation): Triggered when the LLM determines that the situation is not dangerous but the user requires lower training intensity based on comprehensive semantic understanding. The objective is to dampen the interaction energy without stopping the training.

$A^{'} = γ_{A} \cdot A_{default} (0 < γ_{A} < 1)$

(9)

$T^{'} = γ_{T} \cdot T_{default} (γ_{T} > 1)$

(10)

In our experiments, we set $γ_{A} = 0.5$ and $γ_{T} = 1.6$ . This results in a smaller and slower motion profile, allowing the user to recover control and reducing the risk of resonance with tremors.
Action 2 (Emergency Stop): Triggered when the LLM determines that the user has an emergency risk requiring immediate cessation. The system immediately prioritizes safety by freezing the motion at the current position:

$A \to 0, θ_{bias} \to θ_{current}$

(11)

By resetting the bias to the current angle ( $θ_{current}$ ) while zeroing the amplitude, the trajectory is effectively flattened at the exact moment of the event, preventing the robot from dragging the user back to the center or causing secondary injury.

5.2. Experimental Setup

We conducted experiments involving a healthy subject simulating various rehabilitation scenarios. The hardware setup included:

Integrated Robot Platform: A dual-arm upper limb rehabilitation robot integrated with 6-axis Force/Torque sensors and a built-in RGB-D camera for capturing multi-modal user states (Force, Facial, Posture).
Computational Unit (Laptop): A laptop serving as the central processing station. It executes the visual recognition algorithms and hosts the fine-tuned LLM to perform real-time semantic reasoning and decision generation.
Seat Sensor: A binary pressure switch put on the seat, providing definitive “On-Seat” or “Off-Seat” signals.

We designed four experimental scenarios (Table 3) to evaluate the system’s performance under conflicting or ambiguous conditions. These cases test how the LLM integrates diverse inputs—human posture, internal limb torque

τ_{limb - internal}

, facial expression, and seat status—to select the appropriate safety strategy.

5.3. Results and Analysis

We analyzed the system’s decision making and control performance across the four designed scenarios. To verify repeatability, each experimental case was conducted with three trials. The system exhibited consistent behavior across all repetitions, correctly identifying safety levels and executing corresponding control actions in every trial. For clarity and conciseness, the results presented in the following figures correspond to a representative trial from each case. The results demonstrate how the fine-tuned LLM effectively maps multi-modal semantic tokens to appropriate safety protocols.

5.3.1. Adaptive Control and Ambiguity Resolution (Cases A and B)

Figure 12 details the system’s response to physiological disturbances (Case A). The force sensor detected high torque fluctuations (Tremor), while other sensors remained stable. The system interpreted this as muscle spasms. Consequently, Action 1 (Soft Mitigation) activated, reducing trajectory amplitude to accommodate the involuntary movements.

In a traditional rule-based system, a “Lying” signal combined with a torque spike would typically trigger a fall alarm. However, as illustrated in Figure 13, the LLM output Action 0 (Continue). This result highlights an emergent “black-box” behavior; the model appears to have implicitly correlated the torque transient with the benign postural shift (leaning back). Instead of reacting to the raw torque spike as a spasm, it filtered the disturbance as a mechanical side-effect of the voluntary movement, prioritizing the context (On Seat) over the instantaneous sensor fluctuation.

5.3.2. Critical Safety Interventions (Cases C and D)

Case C (Figure 14) presents a confirmed hazard. Unlike Case B, the visual “Lying” token coincided with a Seat: Off signal. This cross-validation confirmed a Fall Event. The model identified the high-risk pattern and executed Action 2 (Emergency Stop), instantly latching the position to prevent secondary injury.

Case D (Figure 15) tests the impact of facial semantics on safety logic. The physical signals (Tremor + On Seat) matched Case A exactly. However, the visual module detected a Pain expression. Although the user remained seated, the system recognized that tremor accompanied by pain indicates acute distress rather than simple fatigue. It escalated the response from ’Soft’ to Action 2 (Emergency Stop), overriding the standard adaptive protocol.

5.4. Discussion: Contextual Reasoning and Safety Prioritization

The results indicate that the LLM-based control loop enables dynamic responses beyond rigid thresholds. We analyze this performance through two key mechanisms:

Resolving Contextual Ambiguity (Case B vs. C): Standard binary detectors often flag any “Lying” posture as a fall. The contrast between Case B and Case C demonstrates how the system differentiates context. By validating the visual input against the seat sensor, the model separates a dangerous fall (Lying + Off Seat) from a benign rest break (Lying + On Seat). This significantly reduces false alarms without compromising sensitivity to actual hazards. Crucially, this continuous pose monitoring supports flexible, pose-agnostic rehabilitation (e.g., bed-side vs. seated) suitable for unstructured home environments, rather than enforcing the rigid, static setups typical of clinical protocols.

Prioritizing Human States over Physical Signals (Case A vs. D): Comparing Case A and Case D illustrates the hierarchy of inputs. From a kinematic perspective, both cases present identical torque disturbances (Tremor) while the user is seated. A conventional admittance controller would likely react identically to both—simply softening the stiffness. However, the detection of “Pain” in Case D forces an Emergency Stop. The system prioritizes the semantic “Pain” signal over the kinematic “Tremor” signal, aligning the control strategy with clinical safety requirements rather than just signal processing logic.

In conclusion, the fusion strategy maps sensor data to semantic tokens, allowing the safety protocols to adapt to the broader context of the user’s state.

6. Conclusions and Future Work

In this paper, we presented a Semantic–Physical Sensor Fusion framework for dual-arm rehabilitation robots. The core novelty lies in bridging the gap between high-level intent understanding and low-level real-time control by unifying a dynamics observer with a Large Language Model (LLM).

Our experiments demonstrated that the “Seed-Augment-Train” pipeline is an effective strategy for adapting general-purpose LLMs to niche rehabilitation tasks. The fine-tuned model achieved

98.5 %

accuracy in intent detection and showed capability in handling unseen expressions. Crucially, the dynamics observer functioned as a reliable “virtual sensor,” enabling the system to detect torque anomalies even when visual data was occluded. On the execution side, the system successfully shifted from passive impedance adjustment to active trajectory modulation, triggering appropriate responses—from soft mitigation to emergency stops—with a control loop latency of approximately 0.223 s on laptop.

Despite these promising results, several limitations remain to be addressed in future work:

1.: Dynamics Calibration: While the current observer effectively detects relative anomalies (e.g., collisions), the absolute accuracy of torque estimation needs improvement through more rigorous parameter identification.
2.: Continuous Modulation: We aim to move beyond the current discrete action space (Action 0/1/2) to continuous modulation coefficients. This would allow the LLM to generate smoother, non-binary assistance strategies based on the context.
3.: Hierarchical Architecture: To balance reasoning depth with reaction speed, we plan to investigate a “Fast-Slow” architecture: employing Vision-Language-Action (VLA) models for complex, low-frequency reasoning, while retaining a lightweight Edge-LLM for high-frequency safety reflexes.

Author Contributions

Conceptualization, D.Z. and S.S.; methodology, D.Z. and X.W.; software, D.Z.; validation, D.Z. and S.S.; formal analysis, D.Z. and X.W.; investigation, D.Z.; resources, S.S.; data curation, D.Z.; writing—original draft preparation, D.Z.; writing—review and editing, X.W. and S.S.; visualization, D.Z.; supervision, S.S.; project administration, S.S.; funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, Grant/Award Nombers: 82372584.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Biomedical Ethics Committee of Peking University (protocol code IRB00001052-23046). This study is a sub-project registered with the Chinese Clinical Trial Registry (registration number ChiCTR2300073480).

Informed Consent Statement

Informed consent was obtained from the subject involved in the study.

Data Availability Statement

The data presented in this study are available in this article.

Acknowledgments

The authors would like to thank the members of the VitalLab (School of Nursing) and the HomeLab (School of Advanced Manufacturing and Robotics) at Peking University for their support. During the preparation of this manuscript, the authors used ChatGLM v1.1.0 for language polishing, as well as for assistance with LaTeX formatting and debugging. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Wang, H.; Li, X.; Zhang, X. Multimodal Human-robot Interaction on Service Robot. In 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC); IEEE: New York, NY, USA, 2021; Volume 5, pp. 2290–2295. [Google Scholar] [CrossRef]
Zhang, Y.; Tian, J.; Xiong, Q. A review of embodied intelligence systems: A three-layer framework integrating multimodal perception, world modeling, and structured strategies. Front. Robot. AI 2025, 12, 1668910. [Google Scholar] [CrossRef]
Laudante, G.; Mirto, M.; Pennacchio, O.; Pirozzi, S. A multi-modal sensing system for human-robot interaction through tactile and proximity data. Front. Robot. AI 2025, 12, 1581154. [Google Scholar] [CrossRef]
Tirulo, A.; Yadav, M.; Lolamo, M.; Chauhan, S.; Siano, P.; Shafie-khah, M. Beyond automation: Unveiling the potential of agentic intelligence. Renew. Sustain. Energy Rev. 2026, 226, 116218. [Google Scholar] [CrossRef]
Zhao, W.; Gangaraju, K.; Yuan, F. Multimodal perception-driven decision-making for human-robot interaction: A survey. Front. Robot. AI 2025, 12, 1604472. [Google Scholar] [CrossRef]
Liu, Z.; Wang, X.; Cai, Y.; Xu, W.; Liu, Q.; Zhou, Z.; Pham, D.T. Dynamic risk assessment and active response strategy for industrial human-robot collaboration. Comput. Ind. Eng. 2020, 141, 106302. [Google Scholar] [CrossRef]
Gao, J.; Sarkar, B.; Xia, F.; Xiao, T.; Wu, J.; Ichter, B.; Majumdar, A.; Sadigh, D. Physically Grounded Vision-Language Models for Robotic Manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2024. [Google Scholar]
Vianello, L.; Short, M.; Manczurowsky, J.; Küçüktabak, E.B.; Di Tommaso, F.; Noccaro, A.; Bandini, L.; Clark, S.; Fiorenza, A.; Lunardini, F. Robot-Mediated Physical Human–Human Interaction in Rehabilitation: A Position Paper. IEEE Rev. Biomed. Eng. 2025, 19, 267–282. [Google Scholar] [CrossRef] [PubMed]
Arora, R.; Prajod, P.; Nicora, M.L.; Panzeri, D.; Tauro, G.; Vertechy, R.; Malosio, M.; André, E.; Gebhard, P. Socially interactive agents for robotic neurorehabilitation training: Conceptualization and proof-of-concept study. Front. Artif. Intell. 2024, 7, 1441955. [Google Scholar] [CrossRef] [PubMed]
Beckerle, P.; Salvietti, G.; Unal, R.; Prattichizzo, D.; Rossi, S.; Castellini, C.; Hirche, S.; Endo, S.; Amor, H.B.; Ciocarlie, M. A Human–Robot Interaction Perspective on Assistive and Rehabilitation Robotics. Front. Neurorobot. 2017, 11, 24. [Google Scholar] [CrossRef]
Gasperina, S.D.; Roveda, L.; Pedrocchi, A.; Braghin, F.; Gandolla, M. Review on Patient-Cooperative Control Strategies for Upper-Limb Rehabilitation Exoskeletons. Front. Robot. AI 2021, 8, 745018. [Google Scholar] [CrossRef] [PubMed]
Dilip, G.; Guttula, R.; Rajeyyagari, S.; S, H.; Pandey, R.R.; Bora, A.; Kshirsagar, P.R.; M, K.M.; Sundramurthy, V.P. Artificial Intelligence-Based Smart Comrade Robot for Elders Healthcare with Strait Rescue System. J. Healthc. Eng. 2022, 2022, 9904870. [Google Scholar] [CrossRef]
Sumner, J.; Lim, H.W.; Chong, L.S.; Bundele, A.; Mukhopadhyay, A.; Kayambu, G. Artificial intelligence in physical rehabilitation: A systematic review. Artif. Intell. Med. 2023, 146, 102693. [Google Scholar] [CrossRef]
Kobayashi, T. Robotic Assisted Rehabilitation for Musculoskeletal Disorders. Jpn. J. Rehabil. Med. 2020, 57, 404–408. [Google Scholar] [CrossRef]
Ong, M.T.Y.; Chan, J.S.Y.; Man, G.C.W.; Qiu, J.; He, X.; Wang, Q.; Yung, P.S.H. Effect of eccentric isokinetic exercise on muscle strength and functional recovery after anterior cruciate ligament reconstruction. Asia-Pac. J. Sport. Med. Arthrosc. Rehabil. Technol. 2024, 35, 20–26. [Google Scholar] [CrossRef]
Plaza, A.; Hernandez, M.; Puyuelo, G.; Garces, E.; Garcia, E. Lower-Limb Medical and Rehabilitation Exoskeletons: A Review of the Current Designs. IEEE Rev. Biomed. Eng. 2023, 16, 278–291. [Google Scholar] [CrossRef]
Zhang, X.; Yin, G.; Li, H.; Dong, R.; Hu, H. An adaptive seamless assist-as-needed control scheme for lower extremity rehabilitation robots. Proc. Inst. Mech. Eng. Part J. Syst. Control Eng. 2020, 235, 723–734. [Google Scholar] [CrossRef]
Gong, R.; Li, R.; Zuo, S. Design and Control of Lower Limb Rehabilitation Exoskeleton Based on Simulink. Acad. J. Eng. Technol. Sci. 2024, 7, 53–64. [Google Scholar] [CrossRef]
Yang, K.; Jiang, Q.F.; Wang, X.L.; Chen, Y.W.; Ma, X.Y. Structural design and modal analysis of exoskeleton robot for rehabilitation of lower limb. J. Phys. Conf. Ser. 2018, 1087, 062004. [Google Scholar] [CrossRef]
Li, G.; Liang, X.; Lu, H.; Su, T.; Hou, Z.G. Development and Validation of a Self-Aligning Knee Exoskeleton With Hip Rotation Capability. IEEE Trans. Neural Syst. Rehabil. Eng. 2024, 32, 472–481. [Google Scholar] [CrossRef]
Jamwal, P.K.; Dauletbayev, S.; Sagidoldin, D.; Keikibayev, D.; Niyetkaliyev, A.; Hussain, S.; Agrawal, S.K. Design and Transparency Assessment of a Gait Rehabilitation Robot With Biomimetic Knee Joints. IEEE Trans. Med. Robot. Bionics 2025, 7, 290–302. [Google Scholar] [CrossRef]
Niu, Y.; Song, Z.; Dai, J. Kinematic analysis and optimization of a planar parallel compliant mechanism for self-alignment knee exoskeleton. Mech. Sci. 2018, 9, 405–416. [Google Scholar] [CrossRef]
Mitra, M.; Kumar, G.; Chakrabarti, P.P.; Biswas, P. Enhanced Human-Robot Collaboration with Intent Prediction using Deep Inverse Reinforcement Learning. In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2024; pp. 7880–7887. [Google Scholar] [CrossRef]
Kedia, K.; Bhardwaj, A.; Dan, P.; Choudhury, S. InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions. In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2024; pp. 621–628. [Google Scholar] [CrossRef]
Zhang, Y.; Doyle, T. Integrating intention-based systems in human-robot interaction: A scoping review of sensors, algorithms, and trust. Front. Robot. AI 2023, 10, 1233328. [Google Scholar] [CrossRef]
Liu, N.; Xie, Y.; Su, Z.; Zhao, Z.; Wang, W. Adaptive Kalman Filter-Integrated navigation measurement using inertial sensor for vehicle motion state recognition. Measurement 2025, 248, 116907. [Google Scholar] [CrossRef]
Bai, Y.; Yan, B.; Zhou, C.; Su, T.; Jin, X. State of art on state estimation: Kalman filter driven by machine learning. Annu. Rev. Control 2023, 56, 100909. [Google Scholar] [CrossRef]
Wang, B.; Li, B.; Gao, T.; Li, L.; Wang, H.; Zhao, C.; Yi, Z. DAMF: A semantic-guided dynamic attention framework for visual-haptic-textual multimodal fusion. Knowl.-Based Syst. 2025, 328, 114244. [Google Scholar] [CrossRef]
Wu, Z.; Zhao, Y.; Luo, S. ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2025; pp. 8545–8552. [Google Scholar] [CrossRef]
Liu, F.; Shao, R.; Zhao, Y.; Li, C.; Wang, Y. Multimodal tactile sensing fused with vision for object recognition. Chem. Eng. J. 2025, 524, 169249. [Google Scholar] [CrossRef]
Akhtar, A.; Sombeck, J.; Boyce, B.; Bretl, T. Controlling sensation intensity for electrotactile stimulation in human-machine interfaces. Sci. Robot. 2018, 3, eaap9770. [Google Scholar] [CrossRef]
Febrer-Nafría, M.; Nasr, A.; Ezati, M.; Brown, P.; Font-Llagunes, J.M.; McPhee, J. Predictive multibody dynamic simulation of human neuromusculoskeletal systems: A review. Multibody Syst. Dyn. 2022, 58, 299–339. [Google Scholar] [CrossRef]
Gasperina, S.D.; Longatelli, V.; Braghin, F.; Pedrocchi, A.; Gandolla, M. Development and Electromyographic Validation of a Compliant Human-Robot Interaction Controller for Cooperative and Personalized Neurorehabilitation. Front. Neurorobot. 2022, 15, 734130. [Google Scholar] [CrossRef]
Tesfazgi, S.; Sangouard, R.; Endo, S.; Hirche, S. Uncertainty-aware automated assessment of the arm impedance with upper-limb exoskeletons. Front. Neurorobot. 2023, 17, 1167604. [Google Scholar] [CrossRef]
Hatze, H. The inverse dynamics problem of neuromuscular control. Biol. Cybern. 2000, 82, 0133–0141. [Google Scholar] [CrossRef]
Bailly, F.; Ceglia, A.; Michaud, B.; Rouleau, D.M.; Begon, M. Real-Time and Dynamically Consistent Estimation of Muscle Forces Using a Moving Horizon EMG-Marker Tracking Algorithm—Application to Upper Limb Biomechanics. Front. Bioeng. Biotechnol. 2021, 9, 642742. [Google Scholar] [CrossRef]
Manjotho, A.A.; Tewolde, T.T.; Duma, R.A.; Niu, Z. LLM-guided fuzzy kinematic modeling for resolving kinematic uncertainties and linguistic ambiguities in text-to-motion generation. Expert Syst. Appl. 2025, 279, 127283. [Google Scholar] [CrossRef]
Shen, J.; Barbera, J.; Shapiro, C.M. Distinguishing sleepiness and fatigue: Focus on definition and measurement. Sleep Med. Rev. 2006, 10, 63–76. [Google Scholar] [CrossRef]
Khan, S.S.; Hoey, J. Review of fall detection techniques: A data availability perspective. Med. Eng. Phys. 2017, 39, 12–22. [Google Scholar] [CrossRef]
Sun, Z.; Wang, Y.; Held, D.; Erickson, Z. Force-Constrained Visual Policy: Safe Robot-Assisted Dressing via Multi-Modal Sensing. IEEE Robot. Autom. Lett. 2024, 9, 4178–4185. [Google Scholar] [CrossRef]
Xu, W.; Zhou, G.; Zhou, Y.; Zou, Z.; Wang, J.; Wu, W.; Li, X. A Vision-Based Tactile Sensing System for Multimodal Contact Information Perception via Neural Network. IEEE Trans. Instrum. Meas. 2024, 73, 5026411. [Google Scholar] [CrossRef]
Maletti, A. Survey: Finite-state technology in natural language processing. Theor. Comput. Sci. 2017, 679, 2–17. [Google Scholar] [CrossRef]
Zhang, P.; Zhang, H.; Wang, X.; Zhang, F.; Yu, F. A Brief Survey on Temporal Reasoning Based on Large Language Models. In 2024 8th Asian Conference on Artificial Intelligence Technology (ACAIT); IEEE: New York, NY, USA, 2024; pp. 7–11. [Google Scholar] [CrossRef]
Chen, K.; Shen, Z.; Zhang, Y.; Chen, L.; Wu, F.; Bing, Z.; Haddadin, S.; Knoll, A. LEMMo-Plan: LLM-Enhanced Learning from Multi-Modal Demonstration for Planning Sequential Contact-Rich Manipulation Tasks. In 2025 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2025. [Google Scholar]
Zhang, D.; Li, Z.Z.; Zhang, M.L.; Zhang, J.; Liu, Z.; Yao, Y.; Xu, H.; Zheng, J.; Chen, X.; Zhang, Y. From System 1 to System 2: A Survey of Reasoning Large Language Models. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 48, 3335–3354. [Google Scholar] [CrossRef]
Szot, A.; Mazoure, B.; Attia, O.; Timofeev, A.; Agrawal, H.; Hjelm, D.; Gan, Z.; Kira, Z.; Toshev, A. From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Computer Vision Foundation (CVF): New York, NY, USA, 2024. [Google Scholar]
Lai, Y.; Yuan, S.; Nassar, Y.; Fan, M.; Gopal, A.; Yorita, A.; Kubota, N.; Rätsch, M. Natural Multimodal Fusion-Based Human–Robot Interaction: Application With Voice and Deictic Posture via Large Language Model. IEEE Robot. Autom. Mag. 2025, 2–11. [Google Scholar] [CrossRef]
Aggarwal, J.; Wang, Y. Sensor Data Fusion in Robotic Systems. In Advances in Robotic Systems, Part 1 of 2; Leondes, C., Ed.; Control and Dynamic Systems; Academic Press: Cambridge, MA, USA, 1991; Volume 39, pp. 435–462. [Google Scholar] [CrossRef]
Pearl, O.; Shin, S.; Godura, A.; Bergbreiter, S.; Halilaj, E. Fusion of video and inertial sensing data via dynamic optimization of a biomechanical model. J. Biomech. 2023, 155, 111617. [Google Scholar] [CrossRef]
Gravina, R.; Alinia, P.; Ghasemzadeh, H.; Fortino, G. Multi-sensor fusion in body sensor networks: State-of-the-art and research challenges. Inf. Fusion 2017, 35, 68–80. [Google Scholar] [CrossRef]
Sheng, B.; Xie, S.; Tang, L.; Deng, C.; Zhang, Y. An Industrial Robot-Based Rehabilitation System for Bilateral Exercises. IEEE Access 2019, 7, 151282–151294. [Google Scholar] [CrossRef]
Lim, J.H.; He, K.; Yi, Z.; Hou, C.; Zhang, C.; Sui, Y.; Li, L. Adaptive Learning based Upper-Limb Rehabilitation Training System with Collaborative Robot. In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); IEEE: New York, NY, USA, 2023; Volume 2023, pp. 1–5. [Google Scholar] [CrossRef]
Abdullahi, A.M.; Chaichaowarat, R. Sensorless Estimation of Human Joint Torque for Robust Tracking Control of Lower-Limb Exoskeleton Assistive Gait Rehabilitation. J. Sens. Actuator Netw. 2023, 12, 53. [Google Scholar] [CrossRef]
Jin, Y.; Alvarez, J.T.; Suitor, E.L.; Swaminathan, K.; Chin, A.; Civici, U.S.; Nuckols, R.W.; Howe, R.D.; Walsh, C.J. Estimation of joint torque in dynamic activities using wearable A-mode ultrasound. Nat. Commun. 2024, 15, 5756. [Google Scholar] [CrossRef] [PubMed]
Caldwell, D.J.; Cronin, J.A.; Wu, J.; Weaver, K.E.; Ko, A.L.; Rao, R.P.N.; Ojemann, J.G. Direct stimulation of somatosensory cortex results in slower reaction times compared to peripheral touch in humans. Sci. Rep. 2019, 9, 3292. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Key components of the rehabilitation robot system.

Figure 2. Key components of the robotic prosthetic verification platform.

Figure 3. Operational diagram and data flow of the integrated system.

Figure 4. Kinematic model and vector parameterization of the two-link limb system.

Figure 5. Experimental scenario: a dual-arm robot manipulating the instrumented prosthesis to validate internal torque estimation.

Figure 6. Estimation accuracy: comparison between the observed limb torque (orange) and the ground truth sensor data (blue).

Figure 7. Data flow architecture of the asynchronous semantic state pool. Heterogeneous sensor data (visual, physical, user input) are independently processed into discrete semantic tokens and aggregated into a unified state snapshot for the LLM.

Figure 8. Visualization of the prompt structure. The input aggregates multimodal semantic tokens, while the instruction enforces a standardized action space.

Figure 9. Confusion matrix on the held-out test set (N = 2000).

Figure 10. Performance on manually curated seed data (N = 20).

Figure 11. Performance on out-of-distribution vocabulary (N = 10).

Figure 12. Adaptive compliance response to joint torque tremor (Case A).

Figure 13. Accurate state determination via multi-modal semantic fusion (Case B).

Figure 14. Immediate emergency stop triggered by a confirmed fall event (Case C).

Figure 15. Safety intervention triggered by pain detection during tremor of joint torque (Case D).

Table 1. Hyperparameter settings for LoRA fine-tuning.

Parameter	Value
Base Model	Qwen3-1.7B
LoRA Rank (r)	8
LoRA Alpha ( $α$ )	32
Target Modules	$q_p r o j, k_p r o j, v_p r o j, o_p r o j, g a t e, u p, d o w n$
Dropout	0.1
Learning Rate	$3 \times 10^{- 4}$
Batch Size	16

Table 2. Classification performance report (N = 2000).

Class	Precision	Recall	F1-Score	Support
0 (Continue)	0.93	0.70	0.80	40
1 (Soft)	0.97	0.99	0.98	831
2 (Stop)	1.00	0.99	0.99	1129

Table 3. Experimental case design and LLM decisions.

Case	HPE	FER	Seat	Torque Disturbance	Action
A	Sitting	Normal	On	Yes (Tremor)	1 (Soft)
B	Lying	Normal	On	Yes (Impact)	0 (Continue)
C	Lying	Normal	Off	Yes (Impact)	2 (Stop)
D	Sitting	Pain	On	Yes (Tremor)	2 (Stop)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, D.; Wang, X.; Shang, S. Semantic–Physical Sensor Fusion for Safe Physical Human–Robot Interaction in Dual-Arm Rehabilitation. Sensors 2026, 26, 1510. https://doi.org/10.3390/s26051510

AMA Style

Zhu D, Wang X, Shang S. Semantic–Physical Sensor Fusion for Safe Physical Human–Robot Interaction in Dual-Arm Rehabilitation. Sensors. 2026; 26(5):1510. https://doi.org/10.3390/s26051510

Chicago/Turabian Style

Zhu, Disha, Xuefeng Wang, and Shaomei Shang. 2026. "Semantic–Physical Sensor Fusion for Safe Physical Human–Robot Interaction in Dual-Arm Rehabilitation" Sensors 26, no. 5: 1510. https://doi.org/10.3390/s26051510

APA Style

Zhu, D., Wang, X., & Shang, S. (2026). Semantic–Physical Sensor Fusion for Safe Physical Human–Robot Interaction in Dual-Arm Rehabilitation. Sensors, 26(5), 1510. https://doi.org/10.3390/s26051510

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic–Physical Sensor Fusion for Safe Physical Human–Robot Interaction in Dual-Arm Rehabilitation

Abstract

1. Introduction

2. System Architecture and Implementation

3. Physical Fusion Layer: Dynamics-Based Torque Observation

3.1. Dynamics of the Coupled System

3.2. Synthesizing the Internal Torque Observer

Quantitative Accuracy Analysis

4. Semantic-Physical Fusion and Intelligent Decision

4.1. Asynchronous Semantic State Pool

4.2. Edge-Native Instruction-Tuned Model

4.3. Data Augmentation and Training Pipeline

4.3.1. Dataset Construction

4.3.2. Fine-Tuning Implementation

4.3.3. Quantitative Evaluation

4.3.4. Supplementary Experiments on Small-Sample Data

5. Semantic–Physical Fusion Interactive Control

5.1. Action Space Definition

5.2. Experimental Setup

5.3. Results and Analysis

5.3.1. Adaptive Control and Ambiguity Resolution (Cases A and B)

5.3.2. Critical Safety Interventions (Cases C and D)

5.4. Discussion: Contextual Reasoning and Safety Prioritization

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI