Deep Reinforcement Learning for Temperature Control of a Two-Way SMA-Actuated Tendon-Driven Gripper
Abstract
1. Introduction
2. Materials and Methods
2.1. Mechanical Design of the SMA Gripper
2.2. Thermal Model and Simulation Environment
2.3. Deep Q-Learning Controller Design
2.3.1. State and Action Space Definition
2.3.2. Reward Function Design
- Proportional Error Penalty (rerror): A continuous, shaped reward is provided to guide the agent towards the setpoint. This penalty is proportional to the square of the error et, encouraging aggressive action when far from the target and finer control when close.where β = 0.1
- Overshoot Penalty (rovershoot): To teach the agent to anticipate the system’s thermal inertia, a large, discrete penalty is applied if the temperature exceeds the setpoint by a small margin (1 °C). This severe punishment effectively discourages overshooting.
- Safety temperature zones (rsafety): To ensure the physical integrity of the SMA actuator and enforce a safe operational envelope, a high-priority safety constraint mechanism was implemented. We defined a valid operating temperature range between 25 °C and 80 °C.
- Goal Achievement Reward (rgoal): A large, positive reward is given when the agent successfully navigates the temperature into a narrow tolerance band (±0.5 °C) around the setpoint. This condition defines a successful outcome and terminates the episode.
- Action Switching Penalty (rswitching): To promote a smooth and stable control policy, a small penalty is proposed whenever the agent changes its action from the previous time step. This discourages inefficient and high-frequency oscillations (chattering) when the system is near a steady state. Also, it reduces the switching state of the fan, and does not make it turn on and turn off frequently.
- Fan activation (rcooling_action): A reward shaping strategy was employed to accelerate the learning process. Specifically, the agent was incentivized to explore the active cooling action. This was implemented by providing a small, discrete reward bonus +2 whenever the ‘FAN ON’ action (action 11) was executed, guiding the agent to discover its effectiveness in controlling the temperature and preventing overshoot.
2.3.3. Hyperparameter Selection
2.3.4. Training in the Simulation Environment
2.3.5. Evaluation in Simulation
- Overshoot Elimination via Anticipatory Control: A critical achievement of the learned policy is the near-complete elimination of temperature overshoot. As seen at each upward step in the setpoint, the agent intelligently deactivates the heater before reaching the target. This “braking” behavior shows that the agent successfully learned to account for the system’s thermal inertia, a sign of an advanced control strategy. This behavior was a direct result of the large, discrete penalty for overshooting in the reward function.
- High Stability and Smooth Control: The policy is remarkably stable once the setpoint is reached. The high-frequency oscillations or “chattering” observed in earlier training stages were successfully eliminated. The agent learned to maintain a steady temperature using low-power PWM actions, which is a direct consequence of the penalty for frequent action switching. This results in a smoother, more efficient, and more hardware-friendly control approach.
3. Experiments and Results
3.1. Deployment and Evaluation on Physical Hardware
3.2. Temperature Control Evaluation
3.3. Experiment of Grasping Objects with Different Shapes, Sizes, and Weight
3.3.1. Experiment Setup of Grasping Objects with Different Shapes, Sizes, and Weight
- The target object is placed between the two gripper jaws.
- The temperature setpoint is gradually increased from 35 °C until the gripper can securely hold the object for at least two minutes, ensuring steady-state stability.
- A photograph of the grasping action is captured, and the corresponding temperature is recorded for the successful grasping event.
3.3.2. Experiment of Grasping Test with Various Target Objects
4. Discussion
4.1. Assessment of the Proposed Control Strategy
4.2. Evaluation of the Gripper Design
4.3. Limitations of the Sim-to-Real Pipeline
5. Conclusions
- First, a discrete 12-action control space was designed and trained within a high-fidelity thermal simulation to enable robust Sim-to-Real transfer. Experimental results demonstrate that the learned DRL policy can be directly deployed on physical hardware without additional fine-tuning, confirming the effectiveness of the discrete action formulation in mitigating model mismatch and hardware uncertainties.
- Second, the proposed controller achieves high-precision temperature regulation across a wide operating range (35–70 °C). In the low-temperature regime (<50 °C), a mean steady-state error of approximately 0.26 °C was achieved. At higher operating temperatures (50–70 °C), the steady-state error increased slightly to approximately 0.41 °C due to enhanced heat dissipation, yet remained well within acceptable limits for SMA-based actuation. These results were further validated through non-contact thermal imaging, which showed close agreement between the global thermal field and local thermistor measurements.
- Finally, the practical applicability of the proposed framework was verified through grasping experiments involving objects of different sizes, weights, and fragilities. By accurately regulating the SMA temperature, the gripper was able to generate appropriate grasping forces without damaging delicate objects or losing heavy payloads. Overall, the results demonstrate that the proposed Sim-to-Real DRL-based control strategy provides a reliable and scalable solution for precise temperature-controlled SMA actuation in soft robotic grippers.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ding, Q.; Chen, J.; Yan, W.; Yan, K.; Kyme, A.; Cheng, S.S. A High-Performance Modular SMA Actuator with Fast Heating and Active Cooling for Medical Robotics. IEEE/ASME Trans. Mechatron. 2022, 27, 5902–5913. [Google Scholar] [CrossRef]
- Do, P.T.; Le, Q.N.; Luong, Q.V.; Kim, H.-H.; Park, H.-M.; Kim, Y.-J. Tendon-Driven Gripper with Variable Stiffness Joint and Water-Cooled SMA Springs. Actuators 2023, 12, 160. [Google Scholar] [CrossRef]
- Shao, S.; Sun, B.; Ding, Q.; Yan, W.; Zheng, W.; Yan, K.; Hong, Y.; Cheng, S.S. Design, Modeling, and Control of a Compact SMA-Actuated MR-Conditional Steerable Neurosurgical Robot. IEEE Robot. Autom. Lett. 2020, 5, 1381–1388. [Google Scholar] [CrossRef]
- An, X.; Cui, Y.; Sun, H.; Shao, Q.; Zhao, H. Active-Cooling-in-the-Loop Controller Design and Implementation for an SMA-Driven Soft Robotic Tentacle. IEEE Trans. Robot. 2023, 39, 2325–2341. [Google Scholar] [CrossRef]
- Lexcellent, C.; Leclercq, S.; Gabry, B.; Bourbon, G. The Two Way Shape Memory Effect of Shape Memory Alloys: An Experimental Study and a Phenomenological Model. Int. J. Plast. 2000, 16, 1155–1168. [Google Scholar] [CrossRef]
- Cho, M.; Kim, S. Structural Morphing Using Two-Way Shape Memory Effect of SMA. Int. J. Solids Struct. 2005, 42, 1759–1776. [Google Scholar] [CrossRef]
- Gur, S.; Frantziskonis, G.N.; Muralidharan, K. Atomistic Simulation of Shape Memory Effect (SME) and Superelasticity (SE) in Nano-Porous NiTi Shape Memory Alloy (SMA). Comput. Mater. Sci. 2018, 152, 28–37. [Google Scholar] [CrossRef]
- Kumar Patel, S.; Swain, B.; Roshan, R.; Sahu, N.K.; Behera, A. A Brief Review of Shape Memory Effects and Fabrication Processes of NiTi Shape Memory Alloys. Mater. Today Proc. 2020, 33, 5552–5556. [Google Scholar] [CrossRef]
- Wang, W.; Xiang, Y.; Yu, J.; Yang, L. Development and Prospect of Smart Materials and Structures for Aerospace Sensing Systems and Applications. Sensors 2023, 23, 1545. [Google Scholar] [CrossRef] [PubMed]
- Costanza, G.; Tata, M.E. Shape Memory Alloys for Aerospace, Recent Developments, and New Applications: A Short Review. Materials 2020, 13, 1856. [Google Scholar] [CrossRef]
- Chau, E.T.F.; Friend, C.M.; Allen, D.M.; Hora, J.; Webster, J.R. A Technical and Economic Appraisal of Shape Memory Alloys for Aerospace Applications. Mater. Sci. Eng. A 2006, 438–440, 589–592. [Google Scholar] [CrossRef]
- Chaudhary, K.; Haribhakta, V.K.; Jadhav, P.V. A Review of Shape Memory Alloys in MEMS Devices and Biomedical Applications. Mater. Today Proc. 2024, S2214785324002943. [Google Scholar] [CrossRef]
- Bouchareb, N.; Fellah, M.; Hezil, N.; Guesmi, A.; Khezami, L. A Review of Nitinol Shape Memory Alloys for Biomedical Applications: Advancements and Biocompatibility. JOM 2025, 78, 140–167. [Google Scholar] [CrossRef]
- Xu, L.; Wagner, R.J.; Liu, S.; He, Q.; Li, T.; Pan, W.; Feng, Y.; Feng, H.; Meng, Q.; Zou, X.; et al. Locomotion of an Untethered, Worm-Inspired Soft Robot Driven by a Shape-Memory Alloy Skeleton. Sci. Rep. 2022, 12, 12392. [Google Scholar] [CrossRef] [PubMed]
- Xu, Y.; Zhuo, J.; Fan, M.; Li, X.; Cao, X.; Ruan, D.; Cao, H.; Zhou, F.; Wong, T.; Li, T. A Bioinspired Shape Memory Alloy Based Soft Robotic System for Deep-Sea Exploration. Adv. Intell. Syst. 2024, 6, 2300699. [Google Scholar] [CrossRef]
- Kang, G.; Zhang, H.; Ma, Z.; Ren, Y.; Cui, L.; Yu, K. Large Thermal Hysteresis in a Single-Phase NiTiNb Shape Memory Alloy. Scr. Mater. 2022, 212, 114574. [Google Scholar] [CrossRef]
- Zhang, C.; Chen, X.; Hubert, O.; He, Y. Characterization on Thermal Hysteresis of Shape Memory Alloys via Macroscopic Interface Propagation. Materialia 2024, 33, 102038. [Google Scholar] [CrossRef]
- Roman, R.-C.; Precup, R.-E.; Preitl, S.; Szedlak-Stinean, A.-I.; Bojan-Dragos, C.-A.; Hedrea, E.L.; Petriu, E.M. PI Controller Tuning via Data-Driven Algorithms for Shape Memory Alloy Systems. IFAC-PapersOnLine 2022, 55, 181–186. [Google Scholar] [CrossRef]
- Ruth, D.J.S.; Sohn, J.-W.; Dhanalakshmi, K.; Choi, S.-B. Control Aspects of Shape Memory Alloys in Robotics Applications: A Review over the Last Decade. Sensors 2022, 22, 4860. [Google Scholar] [CrossRef]
- Ma, B.; Liu, H.; Hao, L. Model-Free Adaptive Sliding Mode Control of Parallel Platform Actuated by Shape Memory Alloys. IEEE Access 2025, 13, 160845–160854. [Google Scholar] [CrossRef]
- Khan, A.M.; Bijalwan, V.; Baek, H.; Shin, B.; Kim, Y. Dynamic High-Gain Observer Approach with Sliding Mode Control for an Arc-Shaped Shape Memory Alloy Compliant Actuator. Microsyst. Technol. 2024, 30, 1593–1600. [Google Scholar] [CrossRef]
- Li, J.; Pi, Y. Fuzzy Time Delay Algorithms for Position Control of Soft Robot Actuated by Shape Memory Alloy. Int. J. Control Autom. Syst. 2021, 19, 2203–2212. [Google Scholar] [CrossRef]
- Ali, H.F.M.; Kim, Y.; Le, Q.H.; Shin, B. Modeling and Control of Two DOF Shape Memory Alloy Actuators with Applications. Microsyst. Technol. 2022, 28, 2305–2314. [Google Scholar] [CrossRef]
- Morales, E.F.; Murrieta-Cid, R.; Becerra, I.; Esquivel-Basaldua, M.A. A Survey on Deep Learning and Deep Reinforcement Learning in Robotics with a Tutorial on Deep Reinforcement Learning. Intell. Serv. Robot. 2021, 14, 773–805. [Google Scholar] [CrossRef]
- Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; Levine, S. How to Train Your Robot with Deep Reinforcement Learning: Lessons We Have Learned. Int. J. Robot. Res. 2021, 40, 698–721. [Google Scholar] [CrossRef]
- Chai, R.; Niu, H.; Carrasco, J.; Arvin, F.; Yin, H.; Lennox, B. Design and Experimental Validation of Deep Reinforcement Learning-Based Fast Trajectory Planning and Control for Mobile Robot in Unknown Environment. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5778–5792. [Google Scholar] [CrossRef]
- Zhu, K.; Zhang, T. Deep Reinforcement Learning Based Mobile Robot Navigation: A Review. Tsinghua Sci. Technol. 2021, 26, 674–691. [Google Scholar] [CrossRef]
- Chaoui, H.; Gualous, H.; Boulon, L.; Kelouwani, S. Deep Reinforcement Learning Energy Management System for Multiple Battery Based Electric Vehicles. In Proceedings of the 2018 IEEE Vehicle Power and Propulsion Conference (VPPC), Chicago, IL, USA, 27–30 August 2018; IEEE: Chicago, IL, USA; pp. 1–6. [Google Scholar]















| Order | Action | Control Signal | Detail |
|---|---|---|---|
| 1 | OFF (PWM 0%) | Natural cooling | Natural air cooling |
| 2 | PWM 10% | 10% heating power | Low PWM action |
| 3 | PWM 20% | 20% heating power | Low PWM action |
| 4 | PWM 30% | 30% heating power | Low PWM action |
| 5 | PWM 40% | 40% heating power | Intermediate PWM action |
| 6 | PWM 50% | 50% heating power | Intermediate PWM action |
| 7 | PWM 60% | 60% heating power | Intermediate PWM action |
| 8 | PWM 70% | 70% heating power | High PWM action |
| 9 | PWM 80% | 80% heating power | High PWM action |
| 10 | PWM 90% | 90% heating power | High PWM action |
| 11 | PWM 100% | 100% heating power | High PWM action |
| 12 | Fan On | Active cooling | Force air cooling |
| Category | Hyperparameter | Value |
|---|---|---|
| Network Architecture | Input layer size | 2 |
| Hidden layers | 2 layers × 128 neurons | |
| Output layer size | 12 | |
| Training Algorithm | Learning rate (α) | 0.0001 |
| Discount factor (γ) | 0.99 | |
| Epsilon (Initial) | 0.9 | |
| Epsilon (Final) | 0.2 | |
| Epsilon decay rate | 0.997 | |
| Batch size | 50,000 | |
| Target network update frequency | 10 episodes | |
| Training Schedule | Number of Episodes | 500 |
| Max steps per episode | 300 |
| Target | Weight [g] | Dimension (wxhxt) [mm] | Temperature [°C] |
|---|---|---|---|
| A small pastry | 18 | 23 × 23 × 8 | 35 |
| A grape | 13 | 20 × 18 × 17 | 35 |
| A small pool of thread | 20 | 22 × 20 × 20 | 35 |
| A metal adhesive bottle | 75 | 90 × 40 × 16 | 69 |
| A plug socket | 57 | 60 × 45 × 45 | 42 |
| An Acetone bottle | 100 | 100 × 36 × 36 | 78 |
| Abbreviation | Full Term |
|---|---|
| SMA | Shape Memory Alloy |
| PWM | Pulse Width Modulation |
| DRL | Deep Reinforcement Learning |
| SME | Shape Memory Effect |
| PID | Proportional-Integral-Derivative |
| SMC | Sliding Mode Control |
| FLC | Fuzzy Logic Control |
| DQN | Deep Q-Network |
| DQL | Deep Q-Learning |
| DNN | Deep Neural Network |
| ReLU | Rectified Linear Unit |
| ADC | Analog-to-Digital Converter |
| UART | Universal Asynchronous Receiver/Transmitter |
| CDC | Communications Device Class |
| TWSME | Two-way shape memory effect |
| NiTi | Nickel–Titanium |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Do, P.T.; Le, Q.N.; Park, H.; Kim, H.; Shim, S.; Park, K.; Kim, Y. Deep Reinforcement Learning for Temperature Control of a Two-Way SMA-Actuated Tendon-Driven Gripper. Actuators 2026, 15, 37. https://doi.org/10.3390/act15010037
Do PT, Le QN, Park H, Kim H, Shim S, Park K, Kim Y. Deep Reinforcement Learning for Temperature Control of a Two-Way SMA-Actuated Tendon-Driven Gripper. Actuators. 2026; 15(1):37. https://doi.org/10.3390/act15010037
Chicago/Turabian StyleDo, Phuoc Thien, Quang Ngoc Le, Hyeongmo Park, Hyunho Kim, Seungbo Shim, Kihan Park, and Yeongjin Kim. 2026. "Deep Reinforcement Learning for Temperature Control of a Two-Way SMA-Actuated Tendon-Driven Gripper" Actuators 15, no. 1: 37. https://doi.org/10.3390/act15010037
APA StyleDo, P. T., Le, Q. N., Park, H., Kim, H., Shim, S., Park, K., & Kim, Y. (2026). Deep Reinforcement Learning for Temperature Control of a Two-Way SMA-Actuated Tendon-Driven Gripper. Actuators, 15(1), 37. https://doi.org/10.3390/act15010037

