Deep Reinforcement Learning-Based Process Control in Biodiesel Production

Shi, Hongyan; Zhang, Le; Pan, Duotao; Wang, Guogang

doi:10.3390/pr12122885

Open AccessArticle

Deep Reinforcement Learning-Based Process Control in Biodiesel Production

College of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

^*

Author to whom correspondence should be addressed.

Processes 2024, 12(12), 2885; https://doi.org/10.3390/pr12122885

Submission received: 28 September 2024 / Revised: 7 December 2024 / Accepted: 14 December 2024 / Published: 17 December 2024

(This article belongs to the Section Process Control and Monitoring)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The control of complex industrial processes has been a forefront research topic. Biodiesel production, as a typical complex industrial reaction process, exhibits multivariable coupling, nonlinearity, and high latency, making it challenging for traditional control methods to achieve real-time control under varying operating conditions. To address this issue, this paper proposes a control method based on the twin delayed deep deterministic policy gradient (TD3) algorithm, which dynamically adjusts process parameters to achieve the adaptive optimization control of complex processes. A simplified biodiesel production process model was established to simulate the actual production process. Subsequently, a controller based on the TD3 algorithm was designed and implemented for this model. To evaluate the performance of the proposed controller, comparative experiments were conducted with traditional controllers and deep deterministic policy gradient (DDPG) controllers. The effects of different control strategies were analyzed through simulation experiments. The results demonstrate that the proposed controller achieves the objectives while exhibiting a lower overshoot and shorter settling time and fully validates the effectiveness of the proposed control strategy in terms of both the dynamic and steady-state performance in the production process.

Keywords:

deep reinforcement learning; biodiesel; transesterification reactor; process control

1. Introduction

Traditional diesel fuel, widely used in transportation, power generation, and industrial applications, has significant negative impacts on climate change and human health due to particulate matter and nitrogen oxide (NOx) emissions [1]. As traditional fossil fuels become scarce and environmental concerns grow, biodiesel, a sustainable and renewable alternative fuel, has attracted increasing attention. Produced from vegetable oils and animal fats, biodiesel exhibits the characteristics of biodegradability, low toxicity, and lower emissions, mitigating the environmental problems associated with fossil fuels [2].

In biodiesel production, the final yield is a critical control variable that is intricately coupled with other process variables such as feed flow rate, feed concentration, and feed temperature [3]. In most industrial processes, the setpoints of crude oil flow rate and catalyst flow rate are adjusted based on existing production and research experience to indirectly track the desired final yield. However, due to the nonlinearity, high latency, and multivariable nature of biodiesel production, empirical settings often fail to maintain a stable yield, leading to product quality degradation and increased production costs. Therefore, in-depth research on this production process and the design of efficient process control strategies are of great significance for improving biodiesel yield, reducing energy consumption, and minimizing waste [4].

Traditional control methods for biodiesel production, including fuzzy control [5,6] and model predictive control (MPC) [7], have been widely studied. For instance, Manimaran et al. [8] proposed an MPC-based method for controlling the transesterification process, optimizing controller parameters using genetic algorithms. Gupta et al. [9] employed a two-layer control framework combining constrained iterative learning control and explicit MPC to address yield fluctuations under disturbances. However, due to the fixed parameters and lack of adaptability, traditional methods struggle to handle the complex and dynamic nature of biodiesel production, limiting the achievable yield. Deep reinforcement learning, capable of learning system dynamics through continuous interaction with the environment and adapting to varying reaction conditions, offers a promising solution for such complex systems.

Mnih et al. [10] introduced the deep Q-network (DQN) algorithm, which combined deep neural networks with reinforcement learning. By using convolutional neural networks to approximate the Q-function, DQN achieved human-level performance in a variety of Atari games with discrete action spaces. Lillicrap et al. [11] proposed the deep deterministic policy gradient (DDPG) algorithm, which extended the action space of DQN to continuous domains by incorporating techniques such as experience replay and target Q-networks to enhance stability. Building upon the actor–critic framework, Fujimoto et al. [12] introduced the twin delayed deep deterministic policy gradient (TD3) algorithm, comprising an actor network, two critic networks, and their respective target networks. The critic networks learn the Q-function, while the actor network updates the policy

π

under the guidance of the critic. The training process involves optimizing the parameters of the actor, critic, and target networks, with delayed updates for improved stability. Compared to earlier reinforcement learning algorithms, TD3 has demonstrated significant advantages in complex industrial control tasks.

In recent years, deep reinforcement learning algorithms have shown great promise in addressing process control challenges. Powell et al. [13] proposed a reinforcement learning-based real-time optimization method, embedding optimal decisions in a neural network for steady-state optimization and employing hybrid training to enhance real-time processing capabilities. Peng et al. [14] combined PID control with the deep deterministic policy gradient (DDPG) algorithm to address the low-power level water level control problem in nuclear power plants, highlighting the feasibility and effectiveness of deep reinforcement learning-based control methods in industrial applications. Panjapornpon et al. [15] applied a DDPG-based control method to handle pH and liquid level control in a CSTR, demonstrating superior tracking speed and stability compared to traditional controllers. Chowdhury et al. [16] employed entropy maximization to improve the twin delayed deep deterministic policy gradient (TD3) algorithm for PID tuning, demonstrating its effectiveness in controlling the temperature and feed flow rate of a continuous stirred-tank reactor (CSTR). However, these studies have primarily focused on applying deep reinforcement learning algorithms to simplified chemical processes, while their application to complex processes like biodiesel production remains underexplored. In biodiesel production, the coupling relationships between variables are complex and dynamic, making it challenging for traditional control methods to achieve optimal performance. Deep reinforcement learning algorithms, capable of learning from historical data and system states, can make optimal control decisions, leading to more efficient production.

In this paper, a deep reinforcement learning (TD3) algorithm is proposed for controlling biodiesel production. Firstly, a TD3 algorithm controller was designed based on the dynamic model of biodiesel production. Subsequently, a deep reinforcement learning environment and reward function are constructed using the model and a neural network. After training, an optimized control policy is obtained. Finally, simulation experiments are conducted to compare the tracking performance of the proposed method with traditional control algorithms and other reinforcement learning algorithms.

2. Physical System

To evaluate advanced fault detection and diagnosis methods and control and optimization strategies, the benchmark model BDsim [17] is employed. The model is capable of the following:

Simulating the impact of critical process variables, including temperature and catalyst flow rate, on product yield. Additionally, the model enables dynamic simulations to validate the performance of the TD3 control algorithm. Biodiesel production, involving the transesterification of oil and methanol, consists of a reactor, separator, washer, and dryer, as illustrated in Figure 1. The experiment is based on the following assumptions:

(1): The raw materials are natural products with constant composition;
(2): The catalyst concentration is assumed to be constant throughout the reaction;
(3): Any side reactions, such as saponification, are neglected.

Raw materials and methanol are fed into the reactor, where they are thoroughly mixed through agitation. Under specific temperature conditions and in the presence of a basic catalyst, the triglycerides (TG) react with methanol (M) via transesterification to produce fatty acid esters (E) and glycerol (G) as a byproduct. The reaction can be represented as follows:

\begin{array}{l} TG + M ⇄_{k_{2}}^{k_{1}} DG + E \\ DG + M ⇄_{k_{4}}^{k_{3}} MG + E \\ MG + M ⇄_{k_{6}}^{k_{5}} G + E \end{array}

where TG, DG, MG, M, E, and G represent triglycerides, diglycerides, monoglycerides, methanol, ester, and glycerol, respectively. Due to the incomplete nature of the reaction, the mixture leaving the reactor contains not only the desired products but also a small amount of unreacted materials. Therefore, the mixture must be separated. To facilitate the separation process, the mixture is first cooled in a heat exchanger before entering the separator unit. A gravity settling is then employed to separate the immiscible glycerol (G) and ester (E). In the separator, the mixture separates into two phases: the denser G settles at the bottom as the heavy phase, while the lighter E separates into the upper light phase. The overflowed light phase is collected and sent to a washer, where it is washed with water to remove methanol, catalyst, and other residual impurities. The final product, commercial biodiesel, has a minimum purity of 96.5% [18].

2.1. Reactor

For the simplicity of the model equations, the set is defined as a collection of chemical species, denoted as

I = \{TG, DG, MG, M, E, G\}

. Starting from the material and energy balances, the differential equations for the reactor can be represented by Equations (1) and (2):

n_{R} \frac{d x_{R, i}}{d t} = F_{o} (x_{o, i} - x_{R, i}) + F_{m} (x_{m, i} - x_{R, i}) + r_{i} V_{R} (i \in I)

(1)

n_{R} c_{p, R}^{*} \frac{d T_{R}}{d t} = F_{o} c_{p, o}^{*} (T_{o} - T_{R}) + F_{m} c_{p, m}^{*} (T_{m} - T_{R}) + \sum_{j = 1}^{3} {(- Δ H_{r})}_{j} r_{j} V_{R}

(2)

with the total amount of substance inside the reactor given by

n_{R} = V_{R} \cdot (\sum_{i \in I} \frac{M_{i}}{ρ_{i}} x_{R, i})

where

x_{R, i}

represents the concentrations of various species in the reactor,

F_{M}, F_{O}

are the flow rates of methanol and oil,

r

is the reaction rates of the different components,

V_{R}

is the volume of the reactor,

T_{R}

is the reactor temperature,

T_{M}, T_{O}

are the temperatures of methanol and inlet oil, respectively, and

c_{p,}^{*}

,

Δ H_{r}

are the specific heat capacity and heat of reaction, respectively. The specific values for these parameters can be found in [18].

Finally, the molar flow rate leaving the reactor unit is obtained by

N_{R} = \frac{M_{o} F_{o} + M_{m} F_{m}}{M_{R}} - \frac{V_{R} \sum_{i \in I} M_{i} \frac{d x_{R, i}}{d t}}{M_{R} \sum_{i \in I} \frac{M_{i}}{ρ_{i}} x_{R, i}} + \frac{V_{R} \sum_{i \in I} \frac{M_{i}}{ρ_{i}} \frac{d x_{R, i}}{d t}}{{(\sum_{i \in I} \frac{M_{i}}{ρ_{i}} x_{R, i})}^{2}}

The numerical values of the parameters are provided in Appendix A.

2.2. Heat Exchanger

The heat exchanger reduces the temperature of the reaction mixture to the desired separation temperature by releasing energy. Since no reaction occurs within the heat exchanger, the outlet stream has the same composition as the inlet stream, as shown in Equation (3).

Q

represents the heat removed in the heat exchanger.

T_{heat} = T_{R} - \frac{Q_{heat}}{N_{R} c_{p, R}^{*}}

(3)

2.3. Decanter

From the molar mass balance of the light and heavy phases, equations for the compositions of the light- and heavy-phase mixtures in the separator are obtained. The specific expressions are as shown in Equations (4) and (5):

n_{L} \frac{d x_{L, i}}{d t} = (\frac{ξ_{i}}{\sum_{k \in I} (ξ_{k} x_{R, k})} x_{R, i} - x_{L, i}) \sum_{k \in 1} (ξ_{k} x_{R, k}) N_{R} (i \in I)

(4)

n_{H} \frac{d x_{H, i}}{d t} = (\frac{1 - ξ_{i}}{\sum_{k \in I} ((1 - ξ_{k}) x_{R, k})} x_{R, i} - x_{H, i}) \sum_{k \in I} ((1 - ξ_{k}) x_{R, k}) N_{R} (i \in I \ {TG, DG, MG})

(5)

where

x_{L, i}

represents the concentrations of various species in the light phase,

x_{H, i}

in the heavy phase,

N_{R}

is the flow rate of the mixture leaving the reactor,

n

is the molar amount of the mixture, and

ξ

is the split fraction, the ratio of the concentration of each substance in the light phase to that before separation. The mixture entering the separator is split asymmetrically. Due to the strong affinity between TG, DG, MG, and esters, all triglycerides are separated into the light phase with a split fraction of

ξ

set to 1. Glycerol (G) and part of the methanol (M) are separated into the heavy phase. The overflow of the light phase over the baffle maintains a constant total liquid height

h_{D}

within the separator, where

h_{L}

is the height of the light phase. The height of the light phase is as shown in Equation (6).

h_{L} = h_{D} - h_{H}

(6)

The height of the heavy phase,

h_{H}

, is calculated by the following equation (7):

A_{D} h_{H} = n_{H} \sum_{k \in I} (x_{H, k} \frac{M_{k}}{ρ_{k}})

(7)

In the equation,

A_{D}

denotes the cross-sectional area of the separator.

Assuming an ideal environment where heat exchange between the mixture inside the separator and the surroundings is negligible, the expression for the separator temperature

T_{D}

is derived based on the energy balance equation, as shown in Equation (8).

(n_{L} c_{p, L}^{*} + n_{H} c_{p, H}^{*}) \frac{d T_{D}}{d t} = N_{R} c_{p, R}^{*} (T_{heat} - T_{D})

(8)

2.4. Washer + Dryer

In industrial processes, unpurified biodiesel produced from separation may contain residual methanol, glycerol, and sodium salts. Since these impurities are water soluble, washing and drying steps are employed to remove them, leaving only E and residual TG, DG, and MG. Therefore, the formula for the final yield of biodiesel is as shown in Equation (9):

y' = \frac{x_{i}^{'} M_{i}}{\sum_{j \in I \ (M, G}} x_{j}^{'} M_{j}} = \frac{x_{L, i} M_{i}}{\sum_{j \in I \ {M, G}} x_{L, j} M_{j}} (i \in I \ {M, G})

(9)

where

x_{i}

denotes the mole fraction of the mixture,

M

is the molar mass, and

y'

represents the final yield. Both excessively high and low reaction temperatures can significantly affect the reaction rate and final yield. Therefore, variables

T_{R}

and

y'

are chosen as the controlled variables in this paper. Multiple variables in the reaction process, such as temperature and flow rate, are highly coupled. Considering these interactions, oil temperature

T o

, methanol flow rate

F_{m}

, and oil flow rate

F_{o}

are selected as the manipulated variables.

3. Controller Design

3.1. Markov Decision Process

Reinforcement learning studies the interactions between an agent and an environment, where the agent learns an optimal policy to make sequential decisions and maximize a cumulative reward [19]. The reinforcement learning problem is typically modeled as a Markov decision process (MDP), defined by a tuple

(S, A, p, r)

, where

S

is the state space,

A

is the action space,

p (s_{t + 1} | s_{t}, a_{t})

is the state transition probability, and

r (s_{t}, a_{t})

is the reward function. At each time step t, the agent selects an action

a_{t}

in state

s_{t}

according to its policy

π

and receives a reward

r

and transitions to a new state

s_{t + 1}

based on the transition probability [20]. This process continues until the environment terminates, as illustrated in Figure 2.

The agent’s actions are determined by the policy

π : S \to P (A)

, where

P (A)

is a probability distribution over the action space. For stochastic policies,

π (a_{t} | s_{t})

defines the probability of selecting action

A t

given state

S t

.

The value of policy

π

for the action taken in state S_t is evaluated through the Q-value function

Q^{π} : S \times A \to ℝ

. In Equation (10),

Q^{π} (s_{t}, a_{t})

represents the expected cumulative return starting from state

s

, after executing action a and then using a specific policy

π

. This is referred to as the state-action value function. In Equation (11),

V^{π} (s_{t})

represents the expected cumulative return starting from state

s

while following policy π, which is referred to as the state value function.

Q^{π} (s_{t}, a_{t}) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) | s = s_{t}, a = a_{t}]

(10)

V^{π} (s_{t}) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) | s = s_{t}]

(11)

3.2. TD3 Algorithm

Twin delayed deep deterministic policy gradient (TD3) is one of the most advanced reinforcement learning algorithms, addressing the issue of the maximum estimated value in the deep deterministic policy gradient (DDPG). By employing twin critic networks, TD3 effectively mitigates maximization bias, enhancing stability and convergence, especially in continuous action spaces.

To estimate the Q-value of the next state-action pair

(s', a')

, TD3 takes the minimum of the two Q-value estimates as the target value, as shown in Equation (12):

y = r + γ \min_{i = 1, 2} Q_{θ'} (s', π_{ϕ} (s'))

(12)

where

γ

is used to measure the discount factor of future rewards. By updating the actor network less frequently than the critic network, TD3 helps to avoid oscillations in complex industrial environments that can arise from frequent updates. A delayed update strategy first updates the critic network multiple times before updating the actor network parameters. This approach effectively avoids the instability of the actor network by assisting in updating the actor network only after the critic network becomes relatively stable [21]. Network parameters are updated along their gradient to a local maximum to reduce variance. The parameter update formula is as follows:

\{\begin{array}{l} ω' = τ ω + (1 - τ) ω' \\ θ' = τ θ + (1 - τ) θ' \end{array}

(13)

In the equation,

ω

and

θ

are the parameters of the action network and the evaluation network, respectively, and

τ

is the soft update parameter.

Additionally, DDPG is prone to overfitting the peaks in value estimation. When updating the critic network, the target is susceptible to function approximation errors, leading to increased estimation variance and inaccuracy. TD3 further mitigates this impact by introducing target policy smoothing, a regularization technique that adds noise to reduce the impact of function approximation errors and enhance estimation accuracy, thereby preventing instability caused by drastic input fluctuations in industrial processes.

3.3. Control System Framework

In a control system, an agent serves as the controller, while the environment encompasses external factors like industrial processes and sensor data. The objective is to develop an optimal control policy [22,23]. As depicted in Figure 3, the agent interacts with a biodiesel production environment. In deep reinforcement learning, both the state space and action space are pivotal. Unlike traditional supervised learning, the action space in biodiesel production is continuous. The state space is crucial, comprising all potential environmental states. Table 1 outlines the state space for this specific deep reinforcement learning environment. The input to the state space is a two-dimensional vector

S t a t e = [T_{R}, y']

, where

T_{R}

denotes the reactor temperature and

y'

signifies the final yield.

In deep reinforcement learning environments, action spaces are typically categorized into discrete and continuous domains [24]. Given the nature of biodiesel production processes, the action space for this specific problem is continuous. The specific actions are represented by

A c t i o n = [T o, F o, F m]

, where

T o, F o, F m

correspond to inlet oil temperature, inlet oil flow rate, and inlet alcohol flow rate, respectively. The specific value ranges for these actions are detailed Table 2.

3.4. Reward Function

In deep reinforcement learning algorithms, designing a reward function that aligns with the actual production process is crucial for guiding the agent to learn quickly and efficiently to achieve control objectives. In biodiesel production, process variables are monitored in real time to obtain dynamic data, which are then fed into the TD3 algorithm. The algorithm evaluates the outcome of the current action using a reward function and generates an optimized control output to adjust parameters such as temperature and flow rate, thereby controlling the production process. The reward function is designed based on the deviation between the current state and the desired value, ensuring that the production process always converges to the setpoint.

The reward function is designed based on the error between the current state and the desired value to regulate the entire production process. To achieve this, the function is divided into three components: a primary reward and two auxiliary rewards. The primary reward is calculated as the error between the value

T_{R}

and the actual process variable

y'

. Specifically, the error is computed as

e = | r_{s e t p o i n t} - r_{s t a t e} |

, set

S = {e_{T R}, e_{yLend}}

. An auxiliary reward is provided when both errors are less than a certain threshold. The corresponding reward function is expressed as Equation (14).

\begin{matrix} r = \{\begin{array}{l} μ e + R & S_{1} \leq 0.1, S_{2} \leq 0.01 \\ P & S > e_{\max} \end{array} \\ R = \sum_{i = 1}^{2} λ_{i} S_{i} \end{matrix}

(14)

In the equation,

μ, λ

is the reward scaling factor that guides the agent’s learning direction.

e

represents the error between the current state value and the desired setpoint.

e_{\max}

is the error threshold for the current state value.

R

is a relatively large reward given when both errors are smaller than their respective thresholds, otherwise, only the smaller reward value will be accepted. When the maximum error threshold

e_{\max}

is exceeded, the agent will receive a larger penalty.

4. Simulation Experiments and Results

4.1. Experimental Setup

The raw material used in this experiment is palm oil. As the transesterification reaction is highly sensitive to reactor temperature, the primary control objectives are to maintain the reaction temperature and optimize biodiesel yield to ensure complete conversion. The recovered biodiesel product is expected to meet product specifications after methanol recovery.

In this simulation model, feed oil temperature

T o

, feed oil flow rate

F o

, and methanol flow rate

F m

are the manipulated variables (

u = {[T o, F m, F o]}^{T}

). The heat exchanger

Q w

and methanol temperature

T m

are considered as disturbance variables (

d = {[Q w, T m]}^{T}

). The controlled variables are reactor temperature and yield. The control system is designed to track the setpoints of these variables under fluctuating disturbances

d

, while ensuring that the variables remain within the specified ranges of

u_{\min} = {[30, 0, 0]}^{T}

,

u_{\max} = {[62, 8 \times 10^{4}, 2000]}^{T}

. The disturbance variables are defined within the range of

d_{\min} = {[0, 30]}^{T}

,

d_{\max} = {[4 \times 10^{4}, 65]}^{T}

.

Simulations will be conducted to verify the control performance of the TD3 algorithm. The results will be compared with those obtained using traditional PID and NMPC controllers, as well as the DDPG algorithm. For a fair comparison, the hyperparameters of DDPG and TD3 are kept identical, with some key parameters listed in Table 3.

The reward trends for each episode during the training of DDPG and TD3 are shown in Figure 4. By observing the changes in the reward curves of different deep reinforcement learning algorithms during the agent’s training process, it can be found that TD3 converges the fastest. DDPG reaches the optimal solution 150–200 episodes faster than TD3 and can converge within 150 episodes, indicating a faster training speed compared to TD3. This is because the critic network in TD3 uses a value-based learning method, and during the iterative training of the agent, the approximation error will accumulate continuously, leading to value function underestimation, which in turn causes slow training and the risk of falling into local optima. Using two critic networks to estimate the value of an action simultaneously and taking the average of the outputs of each critic network can improve the accuracy of the action value estimation, mitigating the slow training and conservative policy caused by the accumulation of underestimation.

4.2. Simulation Experiment of Expectation Tracking

The first set of experiments was designed to evaluate the robustness and dynamic adjustment capability of the control algorithms in a biodiesel production process. The experimental results are shown in Figure 5.

The results indicate that due to initial process instability, both the yield

y'

(see Figure 5a) and reactor temperature

T_{R}

(see Figure 5b) exhibited small fluctuations around the setpoint. However, as the control algorithms took effect, the system gradually stabilized until the yield converged to the setpoint. At 160 minutes, the yield setpoint was changed to 97% (m/m), causing the system to deviate from the steady state. The TD3 algorithm effectively adjusted the control actions based on the new system state, bringing the yield and reactor temperature back to the desired values.

Figure 6 illustrates the variations in inlet oil temperature, alcohol flow rate, and oil flow rate under the TD3 control. To achieve the new yield setpoint, the algorithm reduced the oil flow rate and increased the alcohol flow rate, thereby adjusting the alcohol-to-oil ratio. Moreover, to maintain a suitable reaction temperature, the system controlled the inlet oil temperature to compensate for the temperature fluctuations caused by the introduction of large amounts of cold feedstock.

Comparing the control performance of different algorithms, it is evident that TD3 outperforms both DDPG and traditional control methods in terms of settling time and overshoot. The superior stability of TD3 can be attributed to its rapid convergence capability, which is achieved by employing a dual Q-network and delayed updates, enabling the control system to adapt quickly to environmental changes.

4.3. Simulation Experiment Under Step-Type Noise Input

This section experimentally verifies the ability of the production system to reach a stable state rapidly while simultaneously satisfying the imposed upper and lower constraints under significant external disturbances. In the experiment, the alcohol temperature

T m

in the disturbance variable

d = {[Q w, T m]}^{T}

was kept constant, and a step change was introduced at a specific time. The noise input variation is shown in Figure 7. Different control algorithms adjusted their inputs based on the changed system state caused by the disturbance, compensating for the noise and returning the reactor temperature to the setpoint.

Under step change conditions, the regulation of final yield

y'

and reactor temperature

T_{R}

by TD3 and other control algorithms is shown in Figure 8. As shown in Figure 8a,b, TD3 exhibits the fastest recovery and the smallest fluctuations after disturbances, outperforming other algorithms. The control system, under the corrective action of the controller, gradually converges to a steady state. Table 4 presents the control performance indices of

T_{R}

and final yield

y'

from Figure 8a,b. Compared with PID, NMPC, and DDPG, TD3 shows significant advantages in control accuracy and stability. The overall system stability is reflected by MSE and IAE, while control robustness is indicated by MAE. Smaller values of MSE, IAE, and MAE represent better performance. The specific calculation methods for MSE, IAE, and MAE are detailed in Appendix B.

Among all the evaluated metrics (MSE, MAE, IAE), the TD3 algorithm consistently demonstrated the best performance in both reactor temperature control and yield control. These results indicate that TD3 offers superior control performance when faced with disturbances, effectively reducing errors and enhancing system stability in both temperature and yield control. This highlights the potential of TD3 as a highly efficient control strategy for complex industrial processes such as biodiesel production.

5. Discussion

This study proposes a novel adaptive optimization control method based on the twin delayed deep deterministic policy gradient (TD3) algorithm to address the challenges of multivariable coupling, nonlinearity, and time delay in biodiesel production. By constructing a simplified biodiesel production model and designing a TD3 controller, simulation experiments demonstrated that the TD3-based control strategy can respond quickly, effectively suppressing fluctuations in temperature and yield and ensuring the stable operation of the production system. Compared to other reinforcement learning algorithms and traditional control methods, TD3 exhibits superior stability and a faster response. The introduction of twin critic networks and delayed updates in TD3 significantly enhances the stability and responsiveness of the control policy, showcasing its potential for complex industrial processes. By leveraging experience replay to learn from past interactions, TD3 can adapt to changing environments and explore new control strategies while exploiting existing ones.

While BDsim serves as a convenient benchmark model due to its simplified assumptions, its applicability may be compromised when dealing with highly variable feedstocks such as waste oil. Although TD3’s twin critic network improves convergence, it relies heavily on computational resources and ample interaction data for effective policy learning. To address these limitations, BDsim’s application can be enhanced by incorporating more real-world experimental data and data-driven models. The substantial computational requirements of TD3 can be mitigated by pre-training and transferring the model to target applications.

When applied to real-world production processes, TD3-based control algorithms can dynamically adjust process parameters in real time, enabling more precise control over raw material and catalyst usage, ensuring product quality stability and reducing production costs. In practical chemical reaction processes, a single control strategy is often insufficient. Combining traditional control methods (such as PID control) with reinforcement learning to form a hybrid control strategy can enhance TD3’s performance under con-strained conditions; this hybrid control strategy can be applied to other complex industrial control processes.

Author Contributions

Conceptualization, H.S. and L.Z.; methodology, H.S. and L.Z.; software, L.Z.; validation, H.S., L.Z., D.P. and G.W.; formal analysis, L.Z.; investigation, H.S.; resources, L.Z.; data curation, L.Z.; writing—original draft preparation, H.S. and L.Z.; writing—review and editing, H.S. and L.Z.; visualization, L.Z.; supervision, D.P. and G.W.; project administration, H.S.; funding acquisition, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the Artificial Intelligence Innovation and Development Program of Liaoning Province-2023JH26/10300008.

Data Availability Statement

The data that support the findings of this research are available from the author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Heat of reaction at 60 °C.

Units	${(Δ H_{r})}_{1}$	${(Δ H_{r})}_{2}$	${(Δ H_{r})}_{3}$
J mol⁻¹	15,699	36,899	−58,906

Table A2. Thermophysical properties of each component at 60 °C.

	Units	TG	DG	MG	M	E	G
$ρ$	kg m⁻³	954	983	1030	757	844	1340
$c_{p}$	J kg⁻¹ K⁻¹	2110	2188	2381	2785	2146	2556
$M$	10⁻³ kg mol⁻¹	854	600	346	32	286	92

Appendix B

Mean Square Error, MSE =

\frac{1}{T} \sum_{k = 1}^{T} {|(y (k) - y^{*} (k))|}^{2}

Max Absolute Error, MAE =

\max 1 \leq k \leq T \{|y (k) - y^{*} (k)|\}

Integral Absolute Error, IAE =

\frac{1}{T} \sum_{k = 1}^{T} |(y (k) - y^{*} (k))|

References

Ifeanyi-Nze, F.O.; Omiyale, C.O.; Okonkwo, I.U.; Chukwu, C.J.; Nwankwor, C.M.; Onabanjo, A.O.; Adoga, S.O.; Chukwu, J.O.; Chukwurah, K.F.; Ebikemiyen, M.; et al. Biodiesel Synthesis from Waste Vegetable Oil Utilizing Eggshell Ash as an Innovative Heterogenous Catalyst. Arch. Adv. Eng. Sci. 2023, 1–18. [Google Scholar] [CrossRef]
Tamilvanan, A.; Mohanraj, T.; Ashok, B.; Santhoshkumar, A. Enhancement of energy conversion and emission reduction of Calophyllum inophyllum biodiesel in diesel engine using reactivity controlled compression ignition strategy and TOPSIS optimization. Energy 2023, 264, 126168. [Google Scholar] [CrossRef]
Kelani, R.O.; Ahmad, Z.; Patle, D. Mechanistic model-based control of biodiesel production processes: A review of needs and scopes. Chem. Eng. Commun. 2023, 210, 274–290. [Google Scholar] [CrossRef]
Daniyan, I.; Daniyan, L.; Adeodu, A.; Ale, F. Automation and Control of a Multi-feedstock Biodiesel Production Plant. IETE J. Res. 2024, 70, 5081–5099. [Google Scholar] [CrossRef]
Maya-Rodriguez, M.C.; Carvajal-Mariscal, I.; López-Muñoz, R.; Lopez-Pacheco, M.A.; Tolentino-Eslava, R. Temperature Control of a Chemical Reactor Based on Neuro-Fuzzy Tuned with a Metaheuristic Technique to Improve Biodiesel Production. Energies 2023, 16, 6187. [Google Scholar] [CrossRef]
López-Muñoz, R.; Molina-Pérez, D.; Vega-Alvarado, E.; Duran-Medina, P.; Maya-Rodriguez, M.C. A Bilevel Optimization Approach for Tuning a Neuro-Fuzzy Controller. Appl. Sci. 2024, 14, 5078. [Google Scholar] [CrossRef]
Brásio AS, R.; Romanenko, A.; Fernandes NC, P.; Santos, L.O. First principle modeling and predictive control of a continuous biodiesel plant. J. Process Control 2016, 47, 11–21. [Google Scholar] [CrossRef]
Manimaran, M.; Nagalakshmi, S.; Vasanthi, S.; Muthukumar, N. Evolutionary algorithm-based model predictive control for a reactive distillation column in biodiesel production. Automatika 2023, 64, 613–621. [Google Scholar]
Gupta, N.; De, R.; Kodamana, H.; Bhartiya, S. Batch-to-Batch Adaptive Iterative Learning Control─Explicit Model Predictive Control Two-Tier Framework for the Control of Batch Transesterification Process. ACS Omega 2022, 7, 41001–41012. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Lillicrap, T.P. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning. PMLR, 2018, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Powell BK, M.; Machalek, D.; Quah, T. Real-time optimization using reinforcement learning. Comput. Chem. Eng. 2020, 143, 107077. [Google Scholar] [CrossRef]
Peng, B.; Ma, X.; Xia, H. Water level control of nuclear steam generators using intelligent hierarchical autonomous controller. Front. Energy Res. 2024, 12, 1341103. [Google Scholar] [CrossRef]
Panjapornpon, C.; Chinchalongporn, P.; Bardeeniz, S.; Makkayatorn, R.; Wongpunnawat, W. Reinforcement Learning Control with Deep Deterministic Policy Gradient Algorithm for Multivariable pH Process. Processes 2022, 10, 2514. [Google Scholar] [CrossRef]
Chowdhury, M.A.; Al-Wahaibi, S.S.S.; Lu, Q. Entropy-maximizing TD3-based reinforcement learning for adaptive PID control of dynamical systems. Comput. Chem. Eng. 2023, 178, 108393. [Google Scholar] [CrossRef]
Fernandes NC, P.; Romanenko, A.; Reis, M.S. Mechanistic Modeling and Simulation for Process Data Generation. Ind. Eng. Chem. Res. 2019, 58, 17871–17884. [Google Scholar] [CrossRef]
Brásio, A.S.; Romanenko, A.; Fernandes, N.C. Simulation and Advanced Control of the Continuous Biodiesel Production Process. In Modeling, Dynamics, Optimization and Bioeconomics III: DGS IV, Madrid, Spain, June 2016, and Bioeconomy VIII, Berkeley, USA, April 2015–Selected Contributions IV; Springer: Berlin/Heidelberg, Germany, 2018; pp. 127–146. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; A Bradford Book; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Bengio, Y.; Lodi, A.; Prouvost, A. Machine learning for combinatorial optimization: A methodological tour d’horizon. Eur. J. Oper. Res. 2021, 290, 405–421. [Google Scholar] [CrossRef]
Siraskar, R. Reinforcement learning for control of valves. Mach. Learn. Appl. 2021, 4, 100030. [Google Scholar] [CrossRef]
Dutta, D.; Upreti, S.R. A survey and comparative evaluation of actor-critic methods in process control. Can. J. Chem. Eng. 2022, 100, 2028–2056. [Google Scholar] [CrossRef]
Nian, R.; Liu, J.; Huang, B. A review On reinforcement learning: Introduction and applications in industrial process control. Comput. Chem. Eng. 2020, 139, 106886. [Google Scholar] [CrossRef]
Li, X.; Luo, Q.; Wang, L.; Zhang, R.; Gao, F. Off-policy reinforcement learning-based novel model-free minmax fault-tolerant tracking control for industrial processes. J. Process Control 2022, 115, 145–156. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the biodiesel production process.

Figure 2. The process of the interaction between the agent and the environment.

Figure 3. Production operation optimization framework based on TD3.

Figure 4. DDPG and TD3 training episode reward curve.

Figure 5. Closed loop profiles of the controlled variables.

Figure 6. Closed loop profiles of the manipulated variables.

Figure 7. Noise input in the simulation experiment.

Figure 8. Closed loop profiles of the controlled variables.

Table 1. State space information.

State Space
Parameter	Unit	Value Range
$T_{R}$	Reactor temperature/°C	[40, 62]
$y'$	Yield/%	[0, 100]

Table 2. Action space information.

Action Space
Action Parameters	Unit	Value Range
To	Oil temperature/°C	[40, 70]
Fo	Oil flow rate/kg/h	[0, 8000]
Fm	Methanol flow rate/kg/h	[0, 2000]

Table 3. TD3 algorithm parameter settings.

Hyperparameters	Value
Number of rounds	500
Maximum number of steps per training round	1000
Discount factor	0.99
Soft update parameters	5 × 10⁻³
Actor learning rate	5 × 10⁻³
Critic learning rate	1 × 10⁻³
Batch size	64
Noise attenuation rate	1 × 10⁻⁵

Table 4. Performance analysis of the different control algorithms.

	Figure 8a			Figure 8b
Comparison Indicators	MSE	MAE	IAE	MSE	MAE	IAE
PID	0.501	170.092	24.119	55.605	1431.203	286.411
NMPC	0.140	145.599	10.135	36.270	1133.423	224.662
DDPG	0.049	98.988	7.290	42.325	1242.396	230.647
TD3	0.029	91.730	5.713	26.001	898.724	223.762

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, H.; Zhang, L.; Pan, D.; Wang, G. Deep Reinforcement Learning-Based Process Control in Biodiesel Production. Processes 2024, 12, 2885. https://doi.org/10.3390/pr12122885

AMA Style

Shi H, Zhang L, Pan D, Wang G. Deep Reinforcement Learning-Based Process Control in Biodiesel Production. Processes. 2024; 12(12):2885. https://doi.org/10.3390/pr12122885

Chicago/Turabian Style

Shi, Hongyan, Le Zhang, Duotao Pan, and Guogang Wang. 2024. "Deep Reinforcement Learning-Based Process Control in Biodiesel Production" Processes 12, no. 12: 2885. https://doi.org/10.3390/pr12122885

APA Style

Shi, H., Zhang, L., Pan, D., & Wang, G. (2024). Deep Reinforcement Learning-Based Process Control in Biodiesel Production. Processes, 12(12), 2885. https://doi.org/10.3390/pr12122885

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Based Process Control in Biodiesel Production

Abstract

1. Introduction

2. Physical System

2.1. Reactor

2.2. Heat Exchanger

2.3. Decanter

2.4. Washer + Dryer

3. Controller Design

3.1. Markov Decision Process

3.2. TD3 Algorithm

3.3. Control System Framework

3.4. Reward Function

4. Simulation Experiments and Results

4.1. Experimental Setup

4.2. Simulation Experiment of Expectation Tracking

4.3. Simulation Experiment Under Step-Type Noise Input

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI