Investigations into the Design and Implementation of Reinforcement Learning Using Deep Learning Neural Networks
Abstract
:1. Introduction
- We developed an accurate simplified CCS model in a state-space representation. Based on the input–output measurement dataset of the open loop multi-input multi-output (MIMO) CCS extended nonlinear model, having 39 states, three inputs and two outputs, a MIMO autoregressive moving average with an exogenous input (ARMAX) delayed fourth-order polynomial model of z-representation in the complex domain is obtained. This ARMAX model was converted into a linearized MIMO CCS model with four states, three inputs, and two outputs;
- We designed and tuned a standard PID controller and also a standard model predictive control (MPC) for comparison purposes;
- We built two RL DLNN controllers connected in series in a forward path with the MIMO CCS simplified model represented in state space, for temperature control inside the evaporator subsystem and liquid refrigerant level control within the condenser subsystem;
- We rigorously evaluated the tracking performance for both RL DLNNs and compared it with the results obtained using classic PID or MPC controllers.
2. Related Work and Preliminaries on HVAC Systems
2.1. Literature Review of HVAC Control Systems
2.2. Preliminaries—Adopted CCS Model and Its Implementation
2.2.1. MIMO Centrifugal Chiller Modeling Assumptions
2.2.2. Case Study—MIMO Centrifugal Chiller System Assumptions and Decomposition
2.2.3. Open-Loop MIMO Centrifugal Chiller System (MIMO) MATLAB Simulink Extended Model Diagram
2.2.4. Case Study—The Data-Driven ARMAX Model for MIMO Centrifugal Chiller Subsystems in Discrete-Time State-Space Representation
3. Traditional Controllers: PID and MPC Closed-Loop Control Strategies
3.1. DTI MIMO Centrifugal Chiller Closed-Loop System Control in State-Space Representation
3.2. PID MIMO Centrifugal Chiller Closed Loop System Control in Extended State Space Representation
3.3. Digital PID Control of MIMO Centrifugal Chiller Closed-Loop System in Extended State-Space Representation (39 States)
3.4. Model Predictive Control Based on Centrifugal Chiller MIMO State-Space Representation
4. Design and Implementation of Reinforcement Learning Using Deep Learning Neural Networks-MIMO CCS Closed-Loop Control Strategies
4.1. Reinforcement Learning Closed-Loop Control Strategy Using Deep Learning Neural Network MIMO Centrifugal Chiller Plant Model Represented in State Space
4.1.1. Reinforcement Learning Process—Description
Algorithm 1: RL Deep Learning NN [17] | |
RL Agent states: Observations | |
RL Agent actions; | |
Define the optimal policy given by the Bellman equation [23]: | |
(26) | |
and | |
is the optimal state-action value function; | |
Initialize the hyperparameter learning rate , discount factor , and exploration parameter . | |
for t in range (epoch) do | |
Calculate the action according to the : | |
(27) | |
Send the action into the agent environment; | |
Evaluate the next value of the observation according Equation (20) and [23]: | |
) | (28) |
The agent is sensed by the next observation and the reward function given by Equations (21) and (22) (could be r( or could be stochastic); | |
Update the new action-value function at state-action (, according the Equation (23); | |
Evaluate the estimation using the following loss function in terms of RMSE, MSE, and MAE as defined in [17]. | |
End |
4.1.2. Reinforcement Learning Workflow
4.2. Deep Learning Neural Network
4.2.1. DLNN Architectures, Components, Algorithms and Applications
4.2.2. NN Table Models
- Confusion Matrix: This type of table is used to evaluate the performance of a classification model. It shows the numbers of true positives, true negatives, false positives, and false negatives. A generic example is presented as follows:
Actual/Predicted Positive Negative Actual 60 20 Predicted 10 25 - Weight Matrix: In each NN, the weights are critical parameters that transform input data within the network. A weight matrix represents the weights between layers of neurons, for which a simple matrix might look like this:
Neuron k Neuron k + 1 Neuron k + 2 0.2 0.3 −0.1 −0.3 0.5 0.2 - Activation Table: This type of table illustrates the activation values of the neurons in a NN for a given input, as shown below:
Input k Input k + 1 Neuron k activation Neuron k + 1 activation 0.4 0.2 0.5 0.5 0.3 0.7 0.2 0.8 - Loss Table: This table tracks the loss values during NN training, a valuable information for understanding how well the model is learning over time; a figurative example is presented below:
Epoch Training Loss Validation Loss 1 0.32 0.35 2 0.15 0.2 3 0.25 0.25
4.3. Reinforcement Learning Deep Learning Neural Networks Closed-Loop Control Strategies Applied to a Centrifugal Chiller System
4.3.1. Generate Reward Function from the Cost and Constraint Specifications Defined in an MPC Object
- Standard linear bounds for output variables (OVs) and manipulated variables (MVs):
- Scale factors as specified for OVs and MVs:
- Standard cost weights:
- The cost component, calculated according to the following equations:
- The penalty component for violation of linear bound constraints, with the following components:
- Penalty function weight (specify nonnegative):
- ○
- Choose the step or quadratic penalty method to calculate the exteriorPenalty;
- ○
- Set the Pmv value to 0 if the RL Agent action specification has appropriate “LowerLimit” and “UpperLimit” values.
- Penalty functions:
- The observations reference signals (and ), output variables ( and Level), and their integrals, were as shown in Figure 9j;
- The and Level signals were normalized by multiplying with the gain [1/10 1/52];
- The action of and was limited to between [0 1.1] for and [0 1] for
- The sample time and total simulation time were In order to capture the full evolution of the dynamics for both plant outputs, the simulation time was set to Tsim = 60 s.
4.3.2. Reinforcement Learning Deep Learning Neural Network Control Strategy—Generate Reward Function from a Step Response Specifications of MIMO Centrifugal Chiller Simplified Model in State-Space Representation
5. Traditional and Advanced Intelligent Closed-Loop Control Strategies—MATLAB Simulink Simulation Results
5.1. Traditional Closed-Loop Control Strategies
5.1.1. DTI Closed-Loop Control
5.1.2. PID Closed-Loop Control—Centrifugal Chiller Extended Model (39 States)
5.1.3. Digital PID Control of MIMO Centrifugal Chiller ANFIS Model
5.1.4. Model Predictive Control of MIMO Centrifugal Chiller Nonlinear Extended Model in State-Space Representation (39 States) with Input Constraints
5.2. Advanced Reinforcement Learning Using Deep Learning Neural Network Control Strategies
5.2.1. Reinforcement Learning Deep Learning Neural Network Control Strategies–Generate Reward Function from MPC of MIMO Centrifugal Chiller Simplified Model in State-Space Representation
5.2.2. Reinforcement Learning Deep Learning Neural Network Control Strategies—Generate Reward Function from Step Response Specifications of a MIMO Centrifugal Chiller Simplified Model in State-Space Representation
6. Discussion
6.1. Conventional Control Strategies
6.1.1. DTI Controllers
6.1.2. PID Control MIMO Centrifugal Chiller System Extended Model with 39 States
6.1.3. Model Predictive Control of MIMO Centrifugal Chiller Simplified Model in State-Space Representation
- settling time Ts = 24.6 s, rising time Tr = 7.6 s, an overshoot of σ_max = 6.75%, zero steady-state error, and an excellent disturbance rejection for evaporator temperature control;
- settling time Ts = 7.2 s, rising time Tr = 7.2 s, no overshoot, a high tracking performance accuracy, and significant disturbance rejection for the liquid refrigerant level inside the condenser subsystem.
6.2. Advanced Intelligent Closed-Loop Neural Control Strategies
Digital PID Control MIMO CCS MISO ANFIS Models
6.3. Advanced Reinforcement Learning Deep Learning Neural Networks Control Strategies
6.3.1. Reinforcement Learning Deep Learning Neural Network Control Strategies—Generating a Reward Function from the MPC of MIMO Centrifugal Chiller Simplified Model in State-Space Representation
6.3.2. Reinforcement Learning Deep Learning Neural Network Control Strategies—Generating Reward Function from a Step Response Specifications of MIMO Centrifugal Chiller Simplified Model in State Space Representation
6.4. Considerations Regarding the RL DLNN Control Strategies’ Applicability
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
PID | Proportional integral derivative |
DTI | Discrete time integrator |
HVAC | Heating Ventilation Air Conditioning |
CCS | Centrifugal Chiller System |
MIMO | Multi-input multi-output |
SISO | Single-input single-output |
MISO | Multi-input single-output |
ARMAX | Autoregressive moving average with exogenous input |
ANFIS | Adaptive Neural Fuzzy Inference System |
IAE | Integral absolute error |
ITAE | Integral time absolute error |
ISE | Integral square error |
ITSE | Integral time-weighted square error |
Appendix A
Appendix B
Appendix B.1. Algorithms
Algorithm A1. Generate Reward Function Based on MPC Specifications for a MIMO Centrifugal Chiller System-MPC RL DLNN |
Step 1. Plant Dynamics (State Space): |
Step 2: Create an MPC object: |
Step 2.1. Specify the sampling time (Ts), prediction horizon (p) and control horizon (m): |
Step 2.2. Create the MPC object |
Step 2.3. Open the Simulink model of the plant and MPC controller: |
Step 2.4. Specify the minimum and maximum values for manipulated variables (MVs): |
Step 2.5. Nominate precise values for outputs: |
Step 2.6. Nominate precise values for all inputs: |
Step 2.6. Specify output tuning weights: |
Step 2.7. Define the MPC object using structure format for output variables, tuning weights, and manipulated output variables: |
Step 3. Generate the reward function: |
Step 3.1. Open the reward function in the Simulink model and append a new MATLAB function |
r = rewardFunctionMPC (y,refy,mv,refmv,lastmv): |
Step 4: Create an RL environment (Ts, Tsim, observations and actions specifications): |
Step 5: Create a RL agent (random seed for reproducibility, layer graph object—connect and define network path, plot criticNet): |
Step 6. Define actorNet, plot it and create a deterministic actor function: |
Step 7. Specify the agent, critic optimizer, and actor optimizer options: |
Step 8. Create the TD3 agent: |
Step 9. Train the TD3 agent (options, evaluation, random seed, training function): |
Step 10. Validate the closed-loop controller: |
Algorithm A2. Generate Reward Function from a Step Response Block for a MIMO Centrifugal Chiller System—RL DLNN |
Step 1. MIMO ARMAX Centrifugal Chiller model dynamics (state space): |
Step 1.1 Initialize the model: |
Step 1.2 Open the Simulink model: |
Step2. Open BLk1 and BLk2 step response blocks: |
Step 3. Generate the reward functions Vfb1 and Vfb2: |
Step 4. Combine observations and actions specifications: |
Step 5. Create an RL environment: |
Step 6. Create a Simulink environment interface: |
Step 7. Create a learning environment: |
Step 8. Create the reinforcement learning environment for RL Agent1 and RL Agent2 (TDR3s): |
Step 8.1. Set the random reproducibility seed: |
Step 8.2. Define network path: |
Step 8.3. Create a layer graph object for criticNet: |
Step 8.4. Connect all network layers: |
Step 9. Plot the critic network structure: |
Step 9.1. Convert network to dlnetwork: |
Step 9.2. Convert the critic functions for TD3 agents: |
Step 9.3. Define actorNet: |
Step 9.4. Plot the actorNet: |
Step 9.5. Create a deterministic actor function: |
Step 9.6. Specify the agent options: |
Step 10. Create the TD3 agents: |
Step 11. Train the TD3 agents |
Step 12. Closed-loop simulation: |
Appendix B.2
Appendix B.2.1. DLNN Table for Reinforcement Learning Based on MPC Specifications
Parameter Name | Value/Description |
---|---|
RL Algorithm | TD3 RL agent Description: TD3 agents parametrize Q-value function NN architecture: NN with two inputs (one for observation and the second one for action-see the layer graphs for both actors) to model the parametrized Q-value function within both critics Metrics:
|
DNN Architecture |
One concatenation layer (1,2) and two RELU layers; One fully connected layer of size 8 on the action path. * See the layer graphs for Actor Critic and Actornet
Three fully connected layers of sizes 128, 64, and 2 (number of actions); Two RELU layers |
Reward Function | Derived from MPC model verification block |
Training episodes | 1000 |
Test Episodes | 1 |
Maximum Steps per Episode | 600 (Tsim/Ts = 60/0.1) |
Stop training criteria | Evaluation statistic: −0.2 |
Score Averaging Window length | 20 |
Number Episodes for Evaluator | 1 |
Evaluation frequency | 50 episodes |
Discount factor ( | 0.995 |
Exploration model property | Noise: std = 100, exponential decay rate = 1 × 10−5, minimum value reached of 1 × 10−3. |
Learning rate | 0.001 |
Batch size | 256 |
Replay Buffer Size | 1 × 106 |
Appendix B.2.2. DLNN Table for Reinforcement Learning Based on Step Responses Specifications
Parameter Name | Value/Description |
---|---|
RL Algorithm | Two TD3 agents for RL Agent1 (evaporator temperature) and RL Agent2 (condenser level). Description: two parametrized Q-value function approximators to estimate the value of the policy (expected cumulative long-term reward as metric). NN architecture: two NNs with two inputs for each agent (observation and action-see the layer graphs for Actor Critic and actorNet) that model the parametrized Q-value function within both critics. A critic function object is created to encapsulate the critic by wrapping around the critic deep neural network. To make sure the critics have different initial weights, each network is initialized before using them to create the critics. Metrics:
|
DNN Architecture | DLNN for each Actor Critic: One input layer of size 6 (number of observations), one input layer of size 1 (number of actions), three fully connected layers on the main path with output sizes 128, 64, and 1; One concatenation layer (1,2) and two RELU layers; One fully connected layer of size 8 on the action path. * See the layer graphs for Actor Critic and actorNet. DLNN for actorNet: One input layer of size 6 (number of observations or features); Three fully connected layers of sizes 128, 64, and 1 (number of actions); Two RELU layers. |
Reward Function | Two reward functions that derive from step responses of two model verification blocks, one for evaporator temperature, other one for condenser level blocks. |
Training episodes | 200 |
Test Episodes | 1 |
Maximum Steps per Episode | 200 (Tsim/Ts = 20/0.1) |
Stop training criteria | Average Reward: [1 1] |
Score Averaging Window length | 20 |
Number Episodes for Evaluator | 1 |
Evaluation frequency | 10 episodes |
Discount factor ( | 0.995 |
Exploration model property | Noise: std = 100, exponential decay rate = 1 × 10−5, minimum decay rate value reached = 1 × 10−3. |
Learning rate | 0.001 |
Batch size | 256 |
Replay Buffer Size | 1 × 106 |
References
- Li, P.; Li, Y.; Seem, J.E. Modelica Based Dynamic Modeling of Water-Cooled Centrifugal Chillers. In Proceedings of the International Refrigeration and Air Conditioning Conference, West Lafayette, IN, USA, 12–15 July 2010; Purdue University: West Lafayette, IN, USA, 2010; pp. 1–8. Available online: https://docs.lib.purdue.edu/iracc/1091 (accessed on 10 June 2024).
- Popovic, P.; Shapiro, H.N. Modeling Study of a Centrifugal Compressor. ASHRAE Trans. 1998, 104 Pt 2, 121. [Google Scholar]
- Ning, M. Neural Network Based Optimal Control of HVAC&R Systems. Ph.D. Thesis, Department of Building, Civil and Environmental Engineering, Concordia University, Montreal, QC, USA, 2008. [Google Scholar]
- Wong, S.P.W.; Wang, S.K. System Simulation of the performance of a Centrifugal chiller Using a Shell-and-Tube-Type Water-Cooled Condenser and R-11 as Refrigerant. ASHRAE Trans. 1989, 95, 445–454. [Google Scholar]
- Wang, S.; Wang, J.; Burnet, J. Mechanistic Model of a Centrifugal Chiller to Study HVAC Dynamics. Build. Serv. Eng. Res. Technol. 2000, 21, 73–83. [Google Scholar] [CrossRef]
- Browne, M.W.; Bansal, P.K. Steady-State Model of Centrifugal Liquid Chillers. Int. J. Refrig. 1998, 21, 343–358. [Google Scholar] [CrossRef]
- Li, S.; Zaheeruddin, M. A Model and Multi-Mode Control of a Centrifugal Chiller System: A Computer Simulation Study. Int. J. Air-Cond. Refrig. 2019, 27, 1950031. [Google Scholar] [CrossRef]
- Tudoroiu, R.E.; Zaheeruddin, M.; Radu, S.M.; Budescu, D.D.; Tudoroiu, N. The Implementation and the Design of a Hybrid digital PI Control Strategy Based on MISO Adaptive Neural Network Fuzzy Inference System Models–A MIMO Centrifugal Chiller Case Study. In Machine Learning Paradigms. Learning and Analytics in Intelligent Systems; Tsihrintzis, G., Jain, L., Eds.; Springer: Cham, Switzerland, 2020; Volume 18. [Google Scholar] [CrossRef]
- Zadeh, L.A. Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems-Selected Papers by Lotfi Zadeh. In Advances in Fuzzy Systems—Applications and Theory; Klir, G.J., Yuan, B., Eds.; State University of New York at Binghamton, USA: Binghamton, NY, USA, 1996; Volume 6, 840p. [Google Scholar] [CrossRef]
- Haykin, S. Neural Networks and Learning Machines, 3rd ed.; Horton, M.J., Mars, D., Opaluch, W., Disanno, S., Eds.; Pearson Education, Inc.: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
- Zurada, J.M. Introduction to Artificial Neural Systems, 1st ed.; Davis, G., Quadrata, P.L., Eds.; West Publishing Company: St. Paul, MN, USA, 1992. [Google Scholar]
- Ljung, L. System Identification Toolbox Getting Started Guide. The MathWorks, Inc.,1 Apple Hill Drive Natick, MA 01760-2098 © COPYRIGHT 1988–2023 by The MathWorks, Inc., MATLAB R2023a. Available online: https://control.dii.unisi.it/sysid/book/ident_gs.pdf (accessed on 12 June 2024).
- Walia, N.; Singh, H.; Sharma, A. ANFIS: Adaptive Neuro-Fuzzy Inference System-A Survey. Int. J. Comput. Appl. 2015, 123, 32–38. [Google Scholar] [CrossRef]
- Jang, R. ANFIS: Adaptive-Network-Based Fuzzy Inference System. IEEE Trans. Syst. Man Cybern. 1993, 23, 665–685. [Google Scholar] [CrossRef]
- Almabrok, A.; Psarakis, M.; Dounis, A. Fast Tuning of the PID Controller in An HVAC System Using the Big Bang–Big Crunch Algorithm and FPGA Technology. Algorithms 2018, 11, 146. [Google Scholar] [CrossRef]
- MathWorks Support Documentation-Help Center. Estimate ARMAX or ARMA Model. Available online: https://www.mathworks.com/help/ident/ref/armax.html (accessed on 12 June 2024).
- Namdari, A.; Samani, M.A.; Durrani, T.S. Lithium-Ion Battery Prognostics through Reinforcement Learning Based on Entropy Measures. Algorithms 2022, 15, 393. [Google Scholar] [CrossRef]
- MathWorks Support Documentation-Help Center. Reinforcement Learning Using Deep Neural Networks. Available online: https://www.mathworks.com/help/deeplearning/ug/reinforcement-learning-using-deep-neural-networks.html (accessed on 10 September 2024).
- MathWorks Support Documentation-Help Center. Reinforcement Learning for Control Systems Applications. Available online: https://www.mathworks.com/help/reinforcement-learning/ug/reinforcement-learning-for-control-systems-applications.html (accessed on 10 September 2024).
- MathWorks Support Documentation-Help Center. Generate Reward Function from a Model Predictive Controller for a Servomotor. Available online: https://www.mathworks.com/help/reinforcement-learning/ug/generate-reward-fcn-from-mpc-for-servomotor.html (accessed on 10 September 2024).
- MathWorks Support Documentation-Help Center. Generate Reward Function from a Model Verification Block for a Water Tank System. Available online: https://www.mathworks.com/help/reinforcement-learning/ug/generate-reward-fcn-from-verification-block-for-watertank.html (accessed on 20 September 2024).
- Uzunovic, T.; Zunic, E.; Badnjevic, A.; Miokovic, I.; Konjicija, S. Implementation of digital PID controller. In Proceedings of the IEEE 33rd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 24–28 May 2010; Volume 33, pp. 1357–1361. [Google Scholar]
- Grosse, G.; Maddison, C.; Bae, J.; Pitis, S.; Introduction to Machine Learning. Lecture 11-Reinforcement Learning, University of Toronto, Fall. 2020. Available online: https://www.cs.toronto.edu/~rgrosse/courses/csc311_f20/slides/lec11.pdf (accessed on 23 February 2025).
- Abdel-Jaber, H.; Devassy, D.; Al Salam, A.; Hidaytallah, L.; El-Amir, M. A Review of Deep Learning Algorithms and Their Applications in Healthcare. Algorithms 2022, 15, 71. [Google Scholar] [CrossRef]
- Kranthi Kumar, P.; Detroja, K.P. Design of Reinforcement Learning based PI controller for nonlinear Multivariable System. In Proceedings of the 2023 European Control Conference (ECC), Bucharest, Romania, 13–16 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
- You, W.; Yang, G.; Chu, J.; Ju, C. Deep reinforcement learning-based proportional–integral control for dual-active-bridge converter. Neural Comput. Appl. 2023, 35, 17953–17966. [Google Scholar] [CrossRef]
- MathWorks Support Documentation-Help Center. Resolving Low-Level Graphics Issues. Available online: https://www.mathworks.com/help/matlab/creating-plots/resolving-low-level-graphics-issues.html (accessed on 20 September 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tudoroiu, R.-E.; Zaheeruddin, M.; Curiac, D.-I.; Radu, M.S.; Tudoroiu, N. Investigations into the Design and Implementation of Reinforcement Learning Using Deep Learning Neural Networks. Algorithms 2025, 18, 170. https://doi.org/10.3390/a18030170
Tudoroiu R-E, Zaheeruddin M, Curiac D-I, Radu MS, Tudoroiu N. Investigations into the Design and Implementation of Reinforcement Learning Using Deep Learning Neural Networks. Algorithms. 2025; 18(3):170. https://doi.org/10.3390/a18030170
Chicago/Turabian StyleTudoroiu, Roxana-Elena, Mohammed Zaheeruddin, Daniel-Ioan Curiac, Mihai Sorin Radu, and Nicolae Tudoroiu. 2025. "Investigations into the Design and Implementation of Reinforcement Learning Using Deep Learning Neural Networks" Algorithms 18, no. 3: 170. https://doi.org/10.3390/a18030170
APA StyleTudoroiu, R.-E., Zaheeruddin, M., Curiac, D.-I., Radu, M. S., & Tudoroiu, N. (2025). Investigations into the Design and Implementation of Reinforcement Learning Using Deep Learning Neural Networks. Algorithms, 18(3), 170. https://doi.org/10.3390/a18030170