Improvement of Refrigeration Efficiency by Combining Reinforcement Learning with a Coarse Model

It is paramount to improve operational conversion efficiency in air-conditioning refrigeration. It is noticed that control efficiency for model-based methods highly relies on the accuracy of the mechanism model, and data-driven methods would face challenges using the limited collected data to identify the information beyond. In this study, a hybrid novel approach is presented, which is to integrate a data-driven method with a coarse model. Specifically, reinforcement learning is used to exploit/explore the conversion efficiency of the refrigeration, and a coarse model is utilized to evaluate the reward, by which the requirement of the model accuracy is reduced and the model information is better used. The proposed approach is implemented based on a hierarchical control strategy which is divided into a process level and a loop level. The simulation of a test bed shows the proposed approach can achieve better conversion efficiency of refrigeration than the conventional methods.


Background and Motivation
Along with the comfortable environment, the air-conditioning brings a problem of energy consumption. According to the U.S. Energy Information Administration, the energy consumption of buildings for residential and commercial users accounts for 20.1% of the global energy consumption worldwide. Of this amount, the buildings' heating, ventilation, and air conditioning (HVAC) systems can account for up to 50% of total building energy demand [1,2]. To minimize overall system energy input or operating cost while still maintaining the satisfying indoor thermal comfort and healthy environment all kinds of control algorithms are employed on HVAC. In [3], model-based monitoring of the occupants' thermal state by predictive controlling was proposed. In [4], the supervisory and optimal controls were summarized and classified into the model-based methods which used physical models, gray-box models, and black-box models, and the model-free category which used expert systems and pure learning approaches. In [5] the existed methods were grouped into the hard control (HC), such as basic controls involving PID control, optimal control, nonlinear control, robust or H∞ control, and adaptive control, and the soft control (SC) involving neural networks (NNs), fuzzy logic (FL), genetic algorithms (GAs), and other evolutionary methods. Moreover, hybrid control resulting from the fusion of SC and HC may achieve better performance.
Most of today's air-conditioning systems work based on the principle of compression refrigeration in which refrigeration is the core of the air-conditioning system [6]. The theoretical optimal conversion efficiency is used in the design of air-conditioning based on the specified loads, according to refrigeration theory. However, it is inevitable in practice for performance degradation due to the inconsistencies between design requirements and operational conditions. For example, a public place will provide a little refrigerating output to respond to small load at night, meanwhile it will need a large number of refrigerating output for the varying flow of people. The conventional control algorithm will result in the low operational conversion efficiency in this time-varying and uncertain condition because as we know the efficiency of compression refrigeration is confined to the working point and affected by the loads.
With the development of modern electronic and measurement technologies, such as supervisory control and data acquisition (SCADA), and smart sensors, the refrigeration system is prone to obtain abundant data which provide the information about consistency in the current refrigeration states and the operational environment. It is promising for the data-driven methods to improve the conversion efficiency by directly using these data. The reinforcement learning (RL) is a famous data-driven method which is regarded as a model-free supervisory control method in [5] and a soft control (SC) in [6], is about learning from interaction and how to behave in order to achieve a goal. The basic idea of reinforcement learning is simply to capture the most important aspects of an agent, which includes the sensation, the action, and the goal [7]. A reward rule is then employed to encourage the agent to seek the goal by adjusting its action with exploration and exploitation. Finally, the agent owns the optimal action to adapt to the surroundings by trials and errors. In this way, the RL can achieve optimal action based on the current states and environment. Along with incredible success in games [8], the RL has attracted great interest in various industries, such as robot [9,10], fault detection [11], and fault-tolerant control [12,13].
The RL has been introduced to HVA and some excellent results have been published recently. Baeksuk Chu et al. proposed an actor-critic architecture and natural gradient method with an improvement of the efficiency by the recursive least-squares (RLS) method aiming at two objectives: Maintaining an adequate level of pollutants and minimizing power consumption [14]. Pedro Fazenda et al. followed the idea of a multi-objective optimal supervisory control and studied the application of a discrete and continuous reinforcement-learning-based supervisory control approach, which actively learned how to appropriately schedule thermostat temperature set-points [15]. In [16], a multi-grid method of Q-learning was addressed to handle the problem that online reinforcement learning often suffered from slow convergence and faults in early stages. In [17], a suitable learning architecture was proposed for an intelligent thermostat that comprised a number of different learning methods, each of which contributed to creating a complete autonomous thermostat capable of controlling an HVAC system and proposed a novel state action space formalism to enable RL successfully. In [18], a ventilated facade with PCM was controlled using a reinforcement learning algorithm. In [19], a reinforcement learning control (RLC) approach was presented for optimal control of low exergy buildings. An improved reinforcement learning controller was designed in [20] to obtain an optimal control strategy of blinds and lights, which provided a personalized service via introducing subject perceptions of surroundings gathered by a novel interface as the feedback signal. In [21], a model-free actor-critic reinforcement learning (RL) controller was addressed using a variant of artificial recurrent neural networks called long-short-term memory (LSTM) networks. A model-free Q-learning was addressed in [22] that made optimal control decisions for HVAC and window systems to minimize both energy consumption and thermal discomfort. Recently, the latest methods on RL, for example, deep reinforcement learning control [23,24], regularized fitted Q-iteration approach [25], proximal actor critic [26], and auto-encoder [27,28] were appeared in air-conditioning to minimize both energy consumption and thermal discomfort. However, all the studies do not involve the time-varying and uncertain loads.
The RL is classified as online learning and offline learning, according to the resource of training data [29]. The online learning will make sense in the refrigeration system for the reason of overcoming the loads time-varying and uncertainty. There are two challenges: (1) The new states should be obtained only after the actions have acted on the system. It is a risk for the system because no one knows the results of these actions. (2) The conventional RL uses an errors and trials method to train, which is a forced way to the unknown environment but with low efficiency due to little information between states and actions. In this study, a coarse model is proposed to help solve both problems. We consider a coarse model based on a fact that the human has grasped the principles of refrigeration conversion. It is possible to build a coarse model to discover the relations of state variables according to the principles based on the figures and charts through a large number of experiences. On the one hand, it will bring more information between states and actions to speed up the learning process. On the other hand, it will forecast the effects of actions as a simulation. We give up the accurate model for the reasons of the system complexity and surrounding disturbances. In addition, the RL only needs the reward as a basis of action adjustment in the process of seeking for the goal. The reward can be obtained from the tendency of performance indictor which reduces to the requirement of model accuracy. As a result, it is enough for a coarse model to be used for RL. Based on a coarse model, the computer simulation can be approximated to estimate the next states instead of the real system which reduces the risk of unknown states after a trial action.
Motivated by the above discussions, a novel approach combines the reinforcement learning with a coarse model in which we take the conversation efficiency as a goal of reinforcement learning, the measurement data of refrigeration as the sensation of agent, and the control valve of refrigeration as the action of agent, and learn by Q-algorithm, is proposed to improve the conversion efficiency of refrigeration and the advantages are summarized as follows:

1.
This is an online control strategy with data-driven control and learning simultaneously which reduces the requirement of model accuracy.

2.
It has the ability to get the optimal action by exploration and exploitation to achieve a better conversion efficiency which fits the current operating conditions. 3.
The risk on the system in the learning process, which is complex and unknown in advance is prevented mostly by a hierarchical control of the process level, which makes use of the existing knowledge of refrigeration and the loop level, which implements the tracking control.
The remaining parts of this paper are organized as follows: The aim of the study and the preliminaries on a compression refrigerating system are introduced in Section 2. The fundamentals of reinforcement learning algorithms are reviewed in Section 3. Based on Sections 2 and 3, a reinforcement learning control algorithm for compression refrigerating systems is proposed in Section 4, and the case studies are addressed in Section 5. The paper is ended by conclusions in Section 6.

Aim of the Research
For a compression refrigerating system, the compressor drives refrigerants to a circulatory flow to transfer the heating of cryogenic medium to the high-temperature medium via a phase change of refrigerant. In this process, the temperature is the key indicator of reflecting efficiency which is a comprehensive result of the interaction of mass, pressure, process, and loads, so the coefficient of performance (COP) at time k is defined as where T c and T e are the evaporation temperature and the condensation temperature at sampling k, respectively. Our aim is to maximize the coefficient of performance of the refrigeration cycle during a period of time l

Principle of Compression Refrigerating System
A compression refrigerating system is always made up of four elements: The compressor, the evaporator, the condenser, and the throttle mechanic. The compressor provides power for the circulation of the refrigerant. The evaporator is employed to implement the heating change process between the refrigeration and the chilled water. The condenser is used to complete the heating change process between the refrigeration and the cooling water or air cooling system. The throttle mechanic (mainly electronic expansion valve) is applied to adjust the mass flow rate of the refrigerant in the circulation. Here, the principle of each element is reviewed for the integrity according to [30][31][32][33] and more details can be found in [30][31][32][33] and the references therein.

Principle of the Evaporator
The scheme of the evaporator process is dictated in Figure 1. The inner region of the evaporator is divided into a two-phase zone where the refrigerant lives in the combination of liquid and gas and a superheated zone where the liquid refrigerant evaporates into a gas with absorbing the heating and finally becomes the superheated steam. The refrigerant is moving from the two-phase zone to the superheated zone. An average void fraction correlation is used to express the proportion of gas in the two-phase zone whose definition is an integrating of void-age along the pipeline direction where γ is a void fraction correlation with a form of formula (4) in which x is the vapor quality, ρ g and ρ l are the density of saturated steam and the density of saturated liquid, respectively, kg/m 3 , S is the slip ratio.
According to the law of mass conservation, the law of momentum conservation and the law of energy conservation, one will obtain the state equation of evaporator as formula (5) E(x e ) .
x e = f(x e , u e ) where The meanings of states and parameters are seen in Table 1. The condenser inner region is divided into a two-phase zone, a superheated zone, and a supercool zone. For each zone, the partial differential equations have been built according to the law of conservation of mass, the law of conservation of momentum and the law of conservation of energy based on the properties of refrigerants and the experience equations. Therefore, the state equation of the condenser, obtained by integrating along the pipeline direction, is given as follows: The meanings of the states and parameters are provided in Table 1.

Principle of the Compressor
The refrigerant mass flow rate of the compressor outlet is . m com = fη vol V com ρ g where . m com is the refrigerant mass flow rate of the compressor outlet, kg/s, f is the working frequency, Hz, V com is the theoretical gas delivery determined by the producer, m 3 /s, ρ g is the density of refrigerant at the entrance of compressor, kg/m 3 , η vol is the volume efficiency of the compressor which follows the formula (8) where P c and P e are the pressure of condensation and evaporation, respectively, k is the polytrophic exponent.

Principle of the Electronic Expansion Valve
The mass flow of the electronic expansion valve is where C v and k is the flow coefficient and opening of the expansion valve, ρ l is the density of refrigerant at the entrance of expansion valve, kg/m 3 .

Reinforcement Learning
Reinforcement learning has been paid much attention during the past decades [34,35], whose basics are reviewed as follows. Consider the Markov decision process MDP (X, A, P, R), where X is a set of states and A is a set of actions or controls. The transition probabilities P : ×X × A × X → [0, 1] represent each state x X and action a A, the conditional probability P(x(k + 1), x(k), a(k)) = Pr x(k + 1) x(k + 1), a(k) of transitioning to state x(k + 1) X where the MDP is in state x(k) and takes action a(k). The cost function R : X × A × X → R is the expected immediate cost R k (x(k + 1), x(k), a(k)) paid after transition to state x(k + 1) ∈ X, given that the MDP starts in state x(k) ∈ X and takes action a(k) ∈ U. The value of a policy V π k (x(k)) is defined as the conditional expected value of the future cost E π k+T i = k γ i−k R i , R i ∈ R when starting in state x(k) at time k and following policy π(x, a), The target is to find the optimal actions that maximize the value of the future cost V π k (x) with T = ∞, the value function V π k (x) for the policy π(x, a) satisfies the Bellman equation: To a deterministic system a π k (x, a) x(k+1) P(x(k + 1), x(k), a(k)) = 1, so the optimal actions can be gained by alternating the policy evaluation (12) and policy improvement (13) according to the following two equations: π k (x, a) = argmax π R k (x(k + 1), x(k), a(k)) + γV k (x(k + 1)) (13) where γ is a discount factor, 0 ≤ γ < 1 in order to be convergent.

Q-Algorithm
In fact, formulas (12) and (13) cannot be solved directly because the need the information of V k (x(k + 1)) that no one knows. A Q-algorithm proposed by Watkins [34] provides an effective solution by substituting function Q. The Q-algorithm defines the evaluation function Q(x(k), a(k)) as the maximum discounted cumulative reward that can be achieved from state x(k) and a(k) as the first action: where the state x(k + 1) comes from x(k) and a(k), V * (x(k + 1)) is the optimization of V k (x(k + 1)) beginning with x(k + 1) and following the optimal actions. Denote the optimum of Q as Q * , therefore one has where the superscript * expresses the optimal values. It is seen from formula (15) that Q * ((x(k), a(k)) is equivalent to V * (x(k), a(k)) with the same action. Therefore, the optimal actions a * (k) can be obtained by the value iteration following the formula: where α is a learning rate, 0 ≤ α < 1, and the states x(k + 1) will be got at the next sampling time k + 1. By using this iteration formula, it will finally converge to steady state by continuously adjusting the action a(k + 1) and one obtains the responding control a * (k).
A proof on strict convergence of Q-valued function (16) was given by Watskin [35], whose theorem is rewritten here as Lemma 1.
Define n i (x, a) as the index of the ith time that action a is tried in state x.

The Proposed Approach
The scheme of the proposed approach is indicted as Figure 2. It is a hierarchical control of an integer of the process level and the loop level. The process level targets the conversion efficiency of the refrigerating system by adjusting action variables whose optimization a * is the reference of the loop level. The loop level as a basic control circuit implements the reference control by adjusting the control variables u whose elements are the compressor frequency and the electronic expansion valve opening, respectively.   (18) where l e is the length of the two-phase zone in evaporator, m, T wa is the tube wall temperature of the condenser. It is important to point out that the temperature states can be measured directly by sensors and the length states can be obtained by soft-sensor with the degree of superheat.
The action vector of the process is a = [m e , m com ] T (19) where m e and m com are the mass flow of the expansion valve refrigerant and of the compressor refrigerant which are controlled by the opening of the expansion valve K and the frequency of the compressor f, according to formula (7) and formula (9), respectively.
The cost function is selected as the goal under an action a(k) ∈ U As a result, the optimal actions a * (k) can be obtained by Q-algorithm following the formulas (16) and (17).

Exploitation and Exploration
The RL gets the optimal action according to the reward from the interaction with the environment. There are two ways called the exploitation, which seeks the optimal action according to the best rewards of Q-value and the exploration, which seeks the optimal action by dispatching randomly at a certain probability in order to escape the local optimization of exploitation or find a new unknown optimization. The ε-greedy method is proposed to carry out the exploitation and the exploration with the form of formula (21) where ε is a pre-set probability (usually as a large probability event), p is the probability of action selection, and U is the action space.

Procedure of the Proposed Algorithm
Based on the selection of action vector, state vector, and the cost function, the procedure of the proposed algorithm that begins with state vector x(k) is summarized as procedure 1. Procedure 1.
Step 1: Select an action u(k) randomly.
Step 3: Get the new state x(k + 1) according to the coarse model and compute value function according to formula (10).
Step 6: Repeat step 1 to step 5 until it is convergent.
Step 7: Get the optimal a * (k) according to the best reward Step 8: Apply a * (k) to the refrigeration system as the reference of the loop level and then get x(k + 1) under the control u.
Step 9: Replace x(k) by x(k + 1) and go to step 1.

Loop Level
The technology of loop control has been very mature and a simple PID is proposed to achieve this function for each loop whose framework is in Figures 3 and 4. Here m com and m e are the reference variables which come from the process level by RL. The u1 and u2 are the control variables which exert the effect on the system by the associated actuators that are parts of the compression refrigeration system. The y1 and y2 are the system outputs which are obtained by the sensors.

Case Studies
Our test-bed of a compression refrigeration system is seen in Figure 5, and its structure is dedicated in Figure 6 [36,37]. The liquid refrigerant stored in the tank goes into the heating interexchange through the desiccator, where they gasify by absorbing the heat. The gas comes back to the condenser driven by a compressor and becomes liquid in the stored tank. The exchanged cool air goes into the compound air conditioning plant to adjust the environment. The main devices and their specifications are listed in Table 2.  The Chilled Tank  Volume 1 m 3  12 Reservoir Volume 20 L The temperature was obtained by the temperature transmitter of Pt100 with the range of 0-30 • C, 0-60 • C, and 0-100 • C under the degree of accuracy ±0.1 • C (class A). The temperature of the evaporator and condenser was limited within the range of −5-15 • C and 30-50 • C, respectively. The pressure was obtained by the pressure transmitter with the range of 0-2.5 MPa, 0-1.6 MPa under the degree of accuracy 0.25%. The electromagnetic flowmeters with the degree of accuracy 0.5% were added to measure the m com and m e , which were the action vectors of the process level. The traditional coefficient of performance (COP) was obtained by the stable state of the compression refrigeration system. However, the loads are often uncertain which makes the compression refrigeration system stay in the transient process. As a result, a 30 kW tunable stainless steel electric heater was applied to simulate a thermal load with the range from 22.5 to 30 kW and the sampling time was selected as 1 min. In the experiment, we adjusted the electric heater by changing the resistance value through rotating an adjustment button. In this way, we simulated the load climb by running the adjustment button in one direction and the load uncertainty by random changing the direction.

Effect of Model Accuracy on Performance
A coarse model was built based on the principle of compression refrigeration in the process level with the following form: where the variables of T e , T w , T c , l e , l c ,T wa , T a , m e and m com are the same meaning in Table 1. The coefficients c 0 -c 7 are obtained based on collected data by the technology of the system identification [33]. The values are seen in Table 3. Table 3. The value of parameters in the coarse model [33].
Taking the values from Table 3 as the criterion (called the first coarse model) we built the second coarse model by changing the value of coefficient c0 from −1672 to −1500, the values with the rate 4.58%, and the value of coefficient c4 from 456 to 450. The control algorithm follows the procedure 1. The evolution of states, the instantaneous efficiency and the action variables are seen in Figures 7-9. The refrigerant circulated between gas and liquid with endothermic and exothermic reactions. The evaporation temperature was the external temperature of the refrigerant during the evaporation of the evaporator which usually is constantly pre-set according to the environmental requirement. It is seen from Figure 7 that the evaporation temperature Te (either of the first coarse model (blue curve) or of the second coarse model) fluctuated around 7 • C. The condensing temperature is prone to be affected by the working pressure. More energy changing will produce higher pressure which will increase the condensing temperature. It is seen from Figure 7 that the condensing temperature Tc (either of the first coarse model (blue curve) or of the second coarse model) was rising slowly to meet the requirement of the loads from 22.5 to 30 kW. Different from the conventional coefficient of performance (COP) which is defined in the steady state, our COP is extended to the whole process which includes both the transient and steady. The transient is often prevailing due to the resistance change. Therefore, the mean of the COP during a certain period of time is meaningful according to our definition. The curve of instantaneous efficiency in Figure 8 shows the differences between coarse models on the system. There was an instantaneous error between the first coarse model (blue curve) and the second coarse model (green curve). However, the average efficiency of both coarse models during 200 min was 1.2681 and 1.2685, respectively, with the error of 0.03%. We calculated the residual to evaluate the robustness of both coarse models. First, we made a polynomial fitting by the least square method with the same style of cubic curves (no significant error change over this order) according to the standard procedure [36] and obtained the estimationĈOP.
where the coefficients are shown in Table 4.
where l is the number of examples. The responding norms of residual are σ 1 = 0.60575 and σ 2 = 0.67583, respectively, which means the coarse model has a good robustness due to the approximation norm of residual. As a result, we can evaluate the effect of model accuracy on the performance of efficiency by comparing the result of the COP with changing the value of parameters. The mean change of 0.03% compared with the parameter's change of 4.58% indicates only a small amount of the impact among the coarse models. Figure 9 gives the action variables of m e and m com . It can be seen that there were noticeable fluctuations in m e and m com . It is due to the RL implementation and load variations. We adopted the classical table mode to map the relation between the states and the actions. However, limited by our computer's capability (Intel(R) Core(TM)2 Duo CPU E7300 @2.66GHz @2.67GHz RAM 4.00GB) the actions were divided into 1K pieces which causes some errors between the acquired actions and the ideal actions. The errors were somewhat enlarged in the process of evaluating the action value because we took each episode to last 200 steps instead of T = ∞. Another important factor is the influence of the load variations. The final actions of the RL algorithm were determined by the initial states and the current loads. The load variation was simulated by sequentially rotating the adjustment button which lead to the continuous transition.

Comparison with the Conventional Approach
There is no method to deal with the coefficient of performance (COP) under the temperature changing by load variation. The predominant approach in HVAC is to keep the outlet temperature matching the loads by automatically adjusting the flow with an electronic expansion valve and maintaining the constant speed of the compressor. We compared the proposed method with this conventional approach. Figures 10-12 show the evolution of states Te and Tc, the instantaneous efficiency, and the control variables. The blue curves and the red curves represent the results of the proposed approach and the conventional approach, respectively. It is seen from Figure 11 that the Tc and Te with the conventional approach and the proposed approach are a similar tendency to meet the requirement of loads during the test period. Te with the proposed approach shows large fluctuations due to the coarse model leading to a rough estimation of states. While the Te with the conventional approach shows a good smooth curve because it is under classical control. The Tc has small fluctuations because the accuracy of Tc is better.   There is a small fluctuation of instantaneous efficiency for the conventional approach because it tries to keep the temperature bias of inlet and outlet by taking control. Our proposed method releases the control of temperature within the range of limitation, so the instantaneous efficiency formulated from the T e and Tc shows bigger fluctuations, even sometimes low efficiency and sometimes high efficiency. We tried to reduce this fluctuation by adjusting the parameters. First, we changed the pre-set probability ε from 0.7 to 0.9, which is the general value for errors and trials method. However, there was no improvement of fluctuation and efficiency. Considering the requirement of real time, we did not decrease under 0.7, which means that 30% more searches were used to explore, but in fact, the new states can be computed according to the coarse model, which means it does not need a high probability to explore. Instead, we tried to increase the value of ε and get a good COP (1.34% better than the conventional method) even by setting ε as 0.99, though there was no improvement in the fluctuation. We also tried to adjust the learning rate α that is a compromise between the speed and the accuracy of training. If it is too small, the training will take more time. If it is too big. the training will reduce accuracy. There is no theoretical basis so far in reinforcement learning but with an empirical value of 0.95-0.98. It was noticed that the simulation shows the fluctuations were still present. A representative is shown in Figure 11. It is vague for instantaneous efficiency in Figure 11. However, as with an energy conversion process, we are more concerned with refrigeration efficiency over a period of time. The average efficiency was proposed as the metrics. The average efficiency with the conventional approach (red curve) and with the proposed approach (blue curve) was 1.2513 and 1.2681, respectively. The proposed approach will raise the efficiency by 1.34% more than the conventional approach on average during the period of 200 min.
The control variables are shown in Figure 12. As we have pointed out in 4.2, the controller parameters of PID1 and PID2 should be predesigned. The noticeable fluctuations in our proposed method cause difficulties in regulating the parameters of PID. It is fortunate that the refrigeration is a slow second-order system and the constant parameters of PID have a wide feasible scope around the working point. The values of parameters are prepared group by group in the table based on the previous experiments by considering the system stability. Therefore, the controller will directly call the group of parameter values from this preparatory table to track the references of actions provided by the process level which are fluctuating due to the partition error of actions and the load variation.

Conclusions
In this paper, we have discussed an online data-driven approach to improve the conversion efficiency of refrigeration system under the condition of load variation. A reinforcement learning approach that has an ability of seeking the unknown environment is proposed to find the optimal actions based on the online data in the process level and its risk on the system because the training process is complex and unknown in advance due to the warning from the computer simulation. A coarse model is developed to evaluate the action value which reduces the dependence of traditional control methods on model accuracy. Finally, the actions are achieved as the pre-set variables by implementing the single loop control. The simulation shows the proposed algorithm is better than the conventional methods in the conversion efficiency of the refrigeration system from the viewpoint of the average, although larger fluctuations are noticeable. How to reduce this fluctuation is our further study.