Next Article in Journal
High-Pressure- and High-Temperature-Resistant Resins as Leakage Control Materials in Drilling Fluids
Previous Article in Journal
Catalytic Cracking of Non-Hydrotreated, Hydrotreated and Sulfuric Acid-Treated Vacuum Gas Oils
Previous Article in Special Issue
VideoMamba Enhanced with Attention and Learnable Fourier Transform for Superheat Identification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit

1
Huzhou Key Laboratory of Intelligent Sensing and Optimal Control for Industrial Systems, School of Engineering, Huzhou University, Huzhou 313000, China
2
Zhejiang Key Laboratory for Industrial Solid Waste Thermal Hydrolysis Technology and Intelligent Equipment, Huzhou University, Huzhou 313000, China
*
Author to whom correspondence should be addressed.
Processes 2025, 13(5), 1352; https://doi.org/10.3390/pr13051352
Submission received: 22 March 2025 / Revised: 23 April 2025 / Accepted: 26 April 2025 / Published: 28 April 2025
(This article belongs to the Special Issue Machine Learning Optimization of Chemical Processes)

Abstract

:
The emerging machine learning techniques provide great opportunities for optimal operation of chemical systems. This paper presents a Bayesian deep reinforcement learning method for the optimization of a fluid catalytic cracking (FCC) unit, which is a key process in the petroleum refining industry. Unlike the traditional reinforcement learning (RL) methods that use deterministic network weights, Bayesian neural networks are incorporated to represent the RL agent. The Bayesian treatment is integrated with the primal-dual method to handle the process constraints. Simulated experiments for FCC determined that the proposed algorithm achieves more stable control performance and higher economic profits, especially under parameter fluctuations and external disturbances.

1. Introduction

In chemical processes, traditional process control primarily relies on feedback control techniques for operation, typically Proportional–Integral–Derivative (PID) controllers and Model Predictive Control (MPC) [1]. These methods are widely used due to their concise structure, strong interpretability, and extensive tuning methods. However, regarding complex industrial processes, such as with significant time delays and changing dynamics, they often yield suboptimal results in highly uncertain environments.
The rapid development of machine learning technologies offers new solutions for process control. Based on data-driven methodologies, machine learning methods are capable of automatically extracting key patterns from massive process data, and more accurate prediction and decision models are built. Broadly, machine learning paradigms encompass supervised learning, unsupervised learning, and reinforcement learning (RL) [2]. For decision-making problems, RL has demonstrated significant potential for operating complex chemical processes, such as batch reactor temperature control [3,4], distillation column product purity regulation [5,6], and polymer reactor quality control [7,8]. Moreover, RL is employed for process design, optimizing unit operation arrangements, and evaluating improved process schemes through iteration; see the recent literature for the absorption-stripping processes [9], energy system design [10], unit operation design [11], and separation processes [12]. However, it is worth noting that the use of RL for static process design remains controversial in some recent scientific discussions.
Safe reinforcement learning (SRL) is a branch of RL that deals with constrained problems. Operational constraints are common in the chemical industry, and therefore the SRL approaches have received notable attention. The same as the common RL, however, the existing SRL methods face some challenges that severely restrict the widespread application of SRL in high-risk, highly uncertain chemical processes. Firstly, deterministic neural networks are employed in the agents, which cannot accurately quantify system uncertainties, stemming from, for example, parameter fluctuations and external disturbances. Secondly, point estimations are typically given, which tend to ignore potential risks owing to the stochastic nature of uncertainties. Consequently, the robustness of the RL agent is unsatisfactory.
Bayesian reinforcement learning (BRL) has significant advantages in decision-making for uncertain systems. Based on this understanding, this research proposes a new Bayesian deep safe reinforcement learning framework for optimizing complex chemical processes, with application to a fluid catalytic cracking (FCC) unit. FCC is a core unit in the petroleum refining industry, which exhibits highly nonlinear characteristics, strong multi-variable coupling, and significant parameter uncertainties. Although the existing temperature control structures [13] and recently developed advanced control and optimization approaches [14,15,16] have achieved some good results for FCC problems, these solutions may still not be optimal in the presence of model uncertainties and time-varying system characteristics. In this paper, we rely on the machine learning methodologies and make the following contributions:
  • We propose a BRL method for the operation problem of the FCC unit. Unlike traditional RL methods that employ deterministic networks, we utilize Bayesian neural networks (BNNs) to represent the RL agent, effectively capturing the uncertainties in FCC.
  • We adopt a primal-dual method to handle the process constraints, ensuring the optimality of the control policy while satisfying safety, which is new in the framework of BRL.
  • Extensive simulations are conducted to investigate the dynamic responses of FCC under different operating conditions. Compared with traditional deterministic gradient methods, the proposed approach achieved improved economic profits and more stable control performance.
The remainder of this paper is structured as follows: Section 2 introduces the Constrained Markov Decision Process (CMDP); Section 3 elaborates on the proposed Bayesian SRL method; Section 4 presents the operation problem of FCC and the implementation details of the SRL scheme; Section 5 describes the results, and Section 6 concludes this paper.

2. Constrained Markov Decision Process

In this paper, we perform operational optimization of chemical processes using the SRL. The decision-maker, which is referred to as the agent in SRL, interacts with the environment by outputting actions according to some rules (the policy), such that the received cumulative rewards are maximized while satisfying process constraints (minimizing cumulative costs). SRL is typically tackled within the theoretical framework of the Constrained Markov Decision Process (CMDP) [17], defined by the state space S , action space A , state transition probability P ( s t + 1 | s t , a t ) denoting the probability of transiting from state s t to s t + 1 given action a t , reward function r : S × A R , cost function c : S × A R representing the price for the constraint, the safety threshold d, and the discount factor γ ( 0 , 1 ] .
Suppose that the agent starts at the initial state following distribution, s 0 l , at each time step t, and it observes the current state s t and selects an action a t = π ( · | s t ) according to the policy π . The agent receives an immediate reward r t = r ( s t , a t ) and immediate cost c t = c ( s t , a t ) . After executing the action, the environment transitions to a new state s t + 1 according to the transition probability P ( s t + 1 | s t , a t ) . The above interaction procedure continues and generates a trajectory sequence T = { s 0 , a 0 , r 0 , c 0 , s 1 , a 1 , r 1 , c 1 , } . The agent’s cumulative discounted reward and cost are represented as Z π R = E T π t = 0 γ t r t and Z π C = E T π t = 0 γ t c t , respectively.
The agent aims to learn an optimal safe policy π * that maximizes the expected cumulative reward Z π R while satisfying the constraint Z π C d . This optimization problem can be formally expressed as
π * = arg max π Z π R s . t . Z π C d
The CMDP problem (1) can typically be solved using the primal-dual method [18] by transforming it into the following dual form:
π * = arg max π min I 0 = Z π R I ( Z π C d )
where I 0 is the Lagrangian multiplier. The primal-dual method performs gradient ascent on π and descent on I alternatively.

3. Bayesian Primal-Dual Deep Deterministic Policy Gradient

The core of actor–critic RL method lies in evaluating policy effectiveness through value functions. Value functions quantify the long-term rewards from specific actions in certain states, primarily including the state value function V ( s ) or the state–action value function Q ( s , a ) , which are considered effective for problems with continuous state and action spaces, such as chemical process control. However, applying these methods to FCC units presents several challenges. First, in uncertain environments, value functions (critic) may provide inaccurate feedback, causing the actor to update in the wrong direction, leading to potentially low-reward samples [19]. While there might be a probability of achieving higher rewards, this often comes at the cost of constraint violations, which is absolutely unacceptable in actual chemical production. Furthermore, chemical reactions typically have irreversible characteristics, meaning that early incorrect decisions may lead to irreparable consequences. Therefore, building a value network that can accurately evaluate the environment becomes particularly important.
Given the advantages of Bayesian deep learning in handling uncertainties, we propose a natural extension: using BNNs as function approximators to model the complex nonlinear relationships between states and values [20].

3.1. Bayesian Neural Network

The neural networks define the mapping function f ω : x y from input space to output space, where ω are the parameterized weights. In the traditional neural networks, ω are deterministic parameters. The BNN employs probabilistic weights to quantify prediction uncertainty. As shown in Figure 1, ω in BNNs are random variables rather than fixed values. In the SRL, input x corresponds to state–action pairs ( s , a ) , and output y corresponds to Q-values Z; thus, our mapping function can be represented as f ω : ( s , a ) Z . Given the dataset D (the evidence), we first specify a prior distribution for the weights (typically Gaussian), then perform Bayesian inference based on the likelihood function p ( Z s , a , ω ) to obtain the posterior distribution of the weights:
p ( ω | D ) = p ( Z | s , a ; ω ) p ( ω ) p ( Z | s , a ; ω ) p ( ω ) d ω
Computing the posterior distribution p ( ω | D ) of BNN is a challenging problem. Various approximate inference methods have been proposed, including MCMC sampling [21], variational inference [22], expectation propagation [23], and Monte Carlo dropout approximation [24]. Notably, dropout has demonstrated excellent performance as a variational Bayesian method across multiple tasks, from classification to active learning, and will be deployed in this paper.
Based on the obtained the posterior distribution, the expected Q-value, namely E [ Z ( s , a ) ] , for a new state–action pair ( s , a ) can be estimated through MC sampling.
E [ Z ( s , a ) ] = f ω ( s , a ) p ( ω | D ) d ω 1 M m = 1 M f ω m ( s , a )
where M is the number of samples, ω m p ( ω | D ) .

3.2. Variational Inference in BNN with α -Divergences

To simplify the computation of posterior distribution p ( ω | D ) , variational inference introduces the variational distribution q θ ( ω ) as an approximation for p ( ω | D ) . q θ ( ω ) is parameterized by θ , typically taking the form of a Gaussian distribution: q θ ( ω ) = N ( ω | μ , σ ) (where θ = { μ , σ } ). A well-known approximation method is to minimize the Kullback–Leibler (KL) divergence between these two distributions as follows:
arg min θ KL [ q θ ( ω ) | | p ( ω | D ) ] = arg min θ q θ ( ω ) log q θ ( ω ) p ( ω | D ) d ω
= arg min θ q θ ( ω ) log q θ ( ω ) p ( ω ) p ( D ω ) / p ( D ) d ω
= arg min θ KL [ q θ ( ω ) p ( ω ) ] E q θ ( ω ) [ log p ( D ω ) ] E L B O ( q ) + log p ( D ) C o n s t .
where the first term on the right-hand side of the last equality is known as the evidence lower bound (ELBO).
Although the optimization method described above is theoretically simple and intuitive, it still faces significant computational complexity challenges in practical implementation. Dropout can serve as an alternative approach for Bayesian approximation [25]. Dropout can be interpreted as a unique form of variational inference that introduces noise into the feature space, which maps to uncertainty in the network parameter space. Specifically, during training, dropout randomly “drops” some neurons by applying stochastic masks to network weights, thereby approximating the probability distribution of parameters. At test time, dropout is applied to all neurons to approximate Bayesian inference. This method can effectively capture model uncertainty while maintaining computational efficiency.
On the other hand, dropout variational inference tends to underestimate model uncertainty. This issue stems from the asymmetric penalty mechanism in KL divergence minimization: penalties are applied when the approximate posterior distribution q θ ( ω ) assigns non-zero values in regions where the true posterior distribution p ( ω | D ) is zero. However, no penalties are imposed when q θ ( ω ) is zero in regions where p ( ω | D ) has high probability.
To address this limitation, inspired by [26], we adopt the α -divergence as an alternative measure:
D α [ p | | q ] = 1 α ( 1 α ) 1 p ( ω | D ) α q θ ( ω ) 1 α d ω
As illustrated in Figure 2, the parameter α influences the characteristics of approximate distributions. In principle, the method is a generalization with the tunable parameter α . When α takes a large positive value, the approximate distribution q θ ( ω ) tends to encompass multiple modes of the target distribution p ( ω | D ) . Conversely, as  α approaches negative infinity (assuming finite divergence), q θ ( ω ) focuses on the dominant mode with the highest probability [27].
Several effective α -divergence minimization techniques, such as Black-Box α -Divergence Minimization (BB- α ) [27] and dropout BB- α [26], have reportedly achieved significant results in practical applications. Notably, BB- α , as a black-box method, can be directly applied to probability models with complex structures. This characteristic is particularly important in cases where traditional methods such as variational Bayes, expectation propagation (EP), and Power EP encounter difficulties when processing energy functions. First, let us review the traditional BB- α function.
L α M C ( q θ ( ω ) ) = 1 α t log p ( y | x ; ω ) p 0 ( ω ) 1 T q θ ( ω ) 1 T α
however, due to the intractability of its general expectation form, to improve computational efficiency and better adapt to gradient-based optimization, we approximate the expectation value by drawing samples from the distribution q θ ( ω ) and reformulate this objective function (9) using MC method:
L α M C = 1 α log E ω q θ ( ω ) p ( y | x ; ω ) p 0 ( ω ) q θ ( ω ) α
1 α t log E ω q θ ( ω ) exp α log p ( y t | x t ; ω ) + α T [ log p 0 ( ω ) log q θ ( ω ) ]
1 α t log m M exp α log p ( y t | x t ; ω m ) + α T [ log p 0 ( ω m ) log q θ ( ω m ) ]
Given a loss function l ( . ) , we define the un-normalized likelihood term: p ( y | x , ω ) exp l ( y , f ω ( x ) ) [28]. The term log p 0 ( ω m ) represents the log prior, equivalent to l 1 or l 2 regularization [29]. We provide the following minimization objective:
L α M C 1 α t log m M exp α l ( y , f ω m ( x ) ) + α T [ l 2 log q θ ( ω m ) ]
The log-likelihood term l ( y , f ω ( x ) ) can be cross-entropy for classification tasks. However, in our value regression task, it is represented as mean squared error: l ( y , f ω ( x ) ) β 2 | | y f ω ( s , a ) | | 2 2 , where the corresponding likelihood term can be expressed as y N ( y ; f ω ( s , a ) , β 1 σ ) [26]. By reformulating the energy function (13), we derive a new objective function representation method called α -BNN, which can approximate the marginal likelihood with high precision and is compatible with mainstream objective functions in deep learning.
The new optimization objective of α -BNN is
L α M C ( q θ ( ω ) ) = 1 α t log m = 1 M exp α β 2 y t f ω m ( s , a ) 2 2 + N D 2 log β + i p i H i 2 2
where β represents the precision parameter, weight samples ω ^ m q θ ( ω ) are obtained through masked dropout, and  { f ω m ( s , a ) } m = 1 M represents a set of forward propagations obtained by performing K sampling iterations on input ( s , a ) . D and p i . denote the dropout rate and retention rate of the i-th layer, respectively, N is the batch size, and H represents the network parameters without dropout.

4. The Fluid Catalytic Cracking Process

4.1. Process Descriptions

The FCC process consists of three stages: reaction, product separation, and catalyst regeneration (Figure 3). Crude oil mixes with a hot recycled catalyst from the regenerator on the feed side and enters the riser. The catalyst’s heat vaporizes the oil and enables cracking. In the riser, heavy oil cracks into lighter hydrocarbons, mainly gasoline, which are separated from the catalyst and sent to the fractionator. The multicomponent catalyst typically contains acid USHY zeolite, active alumina matrix, inert matrix (kaolin), binder, and additives. During the reaction, coke deposits deactivate the catalyst, so it is returned to the regenerator to burn off the coke and restore activity. In the regenerator, air is blown from the bottom to fluidize the catalyst, mixing it well with air in the dense bed. Makeup catalyst and withdrawal flows compensate for permanent catalyst losses. The regenerated catalyst is recycled to mix with fresh crude oil feed. A cyclone at the regenerator top separates and collects solid catalyst from the flue gas [30].
The core mechanism of catalytic cracking is the β -scission reaction, which fundamentally generates olefins and new carbocations by cleaving β carbon–carbon bonds. During this process, carbocations undergo chain reactions with alkane molecules, continuously producing short-chain olefins. Typical reaction pathways include C 16 H 34 C 8 H 18 + C 8 H 16 , C 8 H 18 C 4 H 10 + C 4 H 8 , among many others. Given the extreme complexity of the cracking reaction network, its intrinsic mechanism remains incompletely elucidated, making the construction of a comprehensive first-principle model extremely challenging. Currently, FCC models predominantly employ hybrid approaches combining reaction mechanisms with empirical rules, with lumped models being the most practical and widely used. This research adopts the three-lumped reactor model [30]. The model abstracts FCC reactions as transitions between three virtual lumped components: gas oil (F) cracking to generate gasoline (G) and light gases/coke (L), with conversion pathways including F G , G L , F L .
The riser model is approximated as steady-state ordinary differential equations (quasi-steady state), while the regenerator is described by differential equations derived from material and energy balance relationships. For detailed derivation and explanation of these equations, readers are referred to Guan et al. [30]. Table 1 lists the material prices for FCC components used in economic profit calculations, while Table 2 and Table 3 list the key process variables and model parameters.
To facilitate the subsequent transformation of the FCC problem into the required reinforcement learning framework, the following key constraints and economic objective function need to be clearly defined.
The input capacity constraints are
6000 kg / min F s c 24 , 000 kg / min
0 F a 3600 kg / min
Metallurgical limit for the temperature of cyclone:
T c y 1000 K
Metallurgical limits for the riser inlet and outlet:
T r i 0 1000 K
T r i 1 1000 K
The economic profit function, J [USD/min], is defined as
J = p g l F g l + p g s F g s + p u g o F u g o p u g o F o i l
and the overall operational objective can be formulated as solving the following problem:
max F s c , F a J = p g l F g l + p g s F g s + p u g o F u g o p u g o F o i l s . t . constraints : ( 15 ) ( 19 )
We consider the process parameters [ k o , k c , k c o m , σ 2 , h 1 , h 2 , E c b / R ] T as uncertain (see Table 1 for nominal values). Under nominal conditions, nonlinear programming optimization yields optimal F s c = 17 , 189 kg / min and F a = 1461.7 kg / min , with a maximum profit of 44.96 USD/min. However, this solution only applies to nominal conditions. During actual operation, uncertain disturbances cause continuous changes in operating conditions, potentially making pre-determined schemes suboptimal or infeasible [30].

4.2. Formulating as a SRL Problem

The FCC process under consideration is described as a two-degree-of-freedom system ( F s c and F a ) [31,32], whose objective is to maximize the economic profit while satisfying constraints (15)–(19). Given the complex dynamic characteristics, multiple operational constraints, and clear economic objectives of the FCC process, it is handled by applying SRL in this study. As previously described, we formalize the FCC control problem as a CMDP, with its core components detailed below.
(1) A g e n t : The agent serves as the intelligence achieving optimal control under complex operating conditions through continuous interactive learning. During each sampling period t, the system acquires key state parameters from the sensors (observations). The interaction between the agent and the FCC process generates a control sequence T , with a total of T periods.
Let X R , C , x r , c , and the state–action value function G π , I ( s , a , ω ) characterizes the expected economic profit that can be obtained when executing policy π and control action a t in state s t while satisfying operational constraints. We solve the optimal value function
π * = Z π R ( s , a ; ω ) I Z π C ( s , a ; ω ) G π , I ( s , a , ω )
Based on this, the FCC unit implements policy π * for optimal control. To achieve maximum cumulative profit throughout the entire process, the agent must continuously learn and refine the mapping relationship between system states and optimal control actions through ongoing interaction with the controlled object.
(2) S t a t e : The state vector s t is the feedback signal to the SRL agent, reflecting the set of process parameters of the FCC environment after executing control command a t 1 at sampling period t 1 ; the state space is defined as follows:
s t = [ t , C r c , O d , T r g , T r i 1 , y f 1 , y g 1 , T r i 0 , J , T c y ] S
where t denotes the time step within the training period. The detailed parameters of the state space S are listed in Table 1 and Table 2.
(3) A c t i o n : Given state s t , the corresponding control action a t taken by the SRL agent at time step t is defined as
a t = [ a t F s c , a t F a ] A
where a t F s c [6000, 24,000] represents the catalyst circulation rate, and a t F a   [ 0 , 3600 ] represents the regenerator air flow rate.
(4) E n v i r o n m e n t a n d S t a t e T r a n s i t i o n : Once the action a t is outputted, the agent interacts with the fluid catalytic cracking environment to obtain the next state s t + 1 and receives reward r t and cost c t . This interaction process can be represented by the mapping function [ s t + 1 , r t , c t ] = f F C C ( s t , a t ) , which is constrained by operational rules and governed by relevant physical laws to ensure the feasibility of the action selected at each time step.
(5) R e w a r d a n d C o s t : In Formula (20), the terms represent product profit and raw material costs, respectively. p represents the material price, as shown in Table 3. Therefore, the overall operational objective can be formulated as solving the following optimization problem to maximize cumulative rewards:
r t = p g l F g l + p g s F g s + p u g o F u g o p u g o F o i l
Based on the characteristics and constraints of the FCC process (15)–(19), the cost signals are designed as
c T c y , t = λ 1 ( T c y , t 1000 ) , if T c y , t > 1000 c T r i 0 , t = λ 2 ( T r i 0 , t 1000 ) , if T r i 0 , t > 1000 c T r i 1 , t = λ 3 ( T r i 1 , t 1000 ) , if T r i 1 , t > 1000 c F s c , t 1 = λ 4 ( F s c , t 24 , 000 ) , if F s c , t > 24 , 000 c F s c , t 2 = λ 5 ( 6000 F s c , t ) , if F s c , t < 6000 c F a , t = λ 6 ( F a , t 3600 ) , if F a , t > 3600 0 , otherwise
c t 1 = λ 7 T r i 1 782.8 + λ 8 T c y 999.5
c t = c T c y , t + c T r i 0 , t + c T r i 1 , t + c F s c , t 1 + c F s c , t 2 + c F a , t + c t 1
where λ is a tunable bias constant.

5. Simulation Results

5.1. Training Algorithms

We carry out comparative experiments between the primal-dual version of the standard DDPG [33] (PD3PG) and the Bayesian PD3PG (BPD3PG). Considering the specificity of the application scenario, which involves a multi-dimensional continuous action space, traditional Deep Q-Network (DQN) [34] and Deep Policy Gradient (DPG) [35] algorithms cannot address such challenges. As mentioned previously, we employ the primal-dual method to solve the CMDP problem. Although traditional reinforcement learning algorithms such as DDPG can also attempt to solve this problem, the constraints present in practical applications make the direct application of these algorithms infeasible.
Our network is based on the actor–critic architecture of DDPG, with two modifications made through the primal-dual method. First, we add a cost critic to the existing reward critic to estimate the expected cumulative cost. Second, for the actor, we incorporate the optimization of the cost Q-value and maximize the reward Q-value. Regarding the implementation details, we constructed six deep neural networks (DNNs) for the online and target actors, reward critics, and cost critics. To ensure the comparability and reproducibility of the experiments, PD3PG and BPD3PG adopt identical structural configurations (more details of the parameters are shown in Table 4 and Table 5 and Algorithm 1). We employ α -Bayesian neural networks ( α -BNNs) as the Q-value functions. Following the aforementioned loss calculation strategy, our goal is to capture the uncertainty estimates of the Q-value functions accurately. The fitting target for the α -BNN Q-function is
y t = g + γ · Z X ( s t + 1 , π ϕ ( s t + 1 ) )
where ϕ represents the target policy, and  Z corresponds to the target network adjusted through a soft update mechanism. We update the BPD3PG using the MC dropout posterior mean of the Bayesian value function distribution, replacing the deterministic estimate.
ϕ J π ϕ 1 M m = 1 M ϕ π ϕ ( s ) a G π ϕ , I ( s , a ; ω m ) | s = s t
where G π ϕ , I ( s , a , ω ) = Z π R ( s , a ; ω ) I Z π C ( s , a ; ω ) ; similarly, the value of the Lagrangian multiplier I can be determined by minimizing the loss function ϕ J I .
ϕ J I 1 M m = 1 M ϕ π ϕ ( s ) a I ( Z π C ( s , a ; ω m ) d ) | s = s t
The training processes of BDP3PG and PD3PG are compared in Figure 4, where one observes that BDP3PG outperforms PD3PG with higher reward and lower cost. Furthermore, as seen in Figure 4, by fitting the α -BNN value function and using the posterior mean in the policy update, BPD3PG demonstrates faster convergence and improved stability, particularly in disturbed settings. This improvement can be attributed to the stronger exploration ability demonstrated by BPD3PG when faced with high uncertainty during the early stages of learning.
Algorithm 1 BPD3PG
Input: Initial netowrk Z π R , Z π C , π ϕ
Input: Target parameters: { Z π R , Z π C , π ϕ } { Z π R , Z π C , π ϕ }
Input: Initial replay buffer X and Lagrangian multiplier I
  1:   for each episode do
  2:         for each time step do
  3:                a = π ϕ ( s t )
  4:                s t + 1 P ( · | s t , a t )
  5:                X X ( s t , a t , r t , c t , s t + 1 )
  6:         end for
  7:         for each gradient step do
  8:               Sample experience from replay buffer X
  9:               Update the reward and cost critic network by minimizing the loss function (14).
10:              Update the Lagrangian multiplier using (30).
11:              Update the actor network using (29).
12:              Update the target network soft update [33]
13:        end for
14:  end for
Output:  Z π R , Z π C , π ϕ , I

5.2. Experimental Results

As shown in Figure 3, the traditional control structure is based on the Hicks framework, which has pairing relationships of F s c T r i 1 and F a T c y . According to the research by [13], this pairing relationship provides optimal controllability. In this study, the RL is additionally configured to optimize the control signals, as shown in Figure 3. As an active constraint variable, T c y requires special attention in both deep reinforcement learning control schemes.
The simulation experiments are arranged as follows: Initially, the FCC process operates under nominal conditions. To simulate the parameter fluctuations typically encountered in industrial processes, two disturbance scenarios, d 1 and d 2 , are introduced, representing random variations in key model parameters within ± 20 % of their nominal values.
Specifically, the parameter vector [ k o , k c , k c o m , σ 2 , h 1 , h 2 , E c b / R ] T (refer to Table 1) under nominal conditions is
d 0 = [ 962000 , 0.01897 , 29.338 , 0.006244 , 521150 , 245 , 158.6 ] T
At t = 1000 min, a disturbance scenario
d 1 = [ 743903.8231 , 0.01666 , 31.5 , 0.0070 , 497882 , 274.8 , 162.6 ] T
is introduced smoothly via a 500 min ramp function. At t = 3000 min, the scenario switches to another value
d 2 = [ 995340.7075 , 0.01663 , 28.8 , 0.00665 , 537683 , 196.8 , 163.5 ] T
We compared the performance of two SRL-based control methods, BPD3PG and PD3PG, under both undisturbed and disturbed conditions:
  • Undisturbed condition: The system operates in an ideal control environment. The manipulated variables F s c and F a are precisely regulated by the controller, while the key parameters change according to disturbance scenarios d 1 and d 2 , but no random disturbances or measurement noises are introduced.
  • Disturbed condition: In addition to the key parameter changes following disturbance scenarios d 1 and d 2 , the system is also subject to random disturbances. Specifically, the manipulated variables become F s c + δ s c and F a + δ a , and the measured variables become T c y + η c y and T r i 1 + η r i 1 , where δ N ( 0 , 20 ) and the noise power η = [ 0 . 1 2 ] × 1 .
The dynamic tracking performances are shown in Figure 5 and Figure 6, respectively. Overall, both controllers track their respective control variables reasonably well under the influence of disturbances. Moreover, the practical constraint T c y remains close to 999.5 K for most of the time, efficiently meeting the requirement T c y 1000 K. An exception occurs around 3000 min when the introduction of disturbance d 2 causes T c y to significantly exceed the limit. However, compared to PD3PG, BPD3PG exhibits greater robustness. Through continuous interaction with the environment, trial and error, and feedback learning, BPD3PG returns T c y within the feasible range after 3000 min. In another control loop, the reactor temperature T r i 1 is stabilized near the set-point of 782.8 K by manipulating F s c .
In terms of economic performance, Figure 7 shows the profit function J under both disturbed and undisturbed conditions for the two methods using the Hicks control structure. The BPD3PG method usually achieves higher economic profits, especially when disturbances are present, highlighting its superior robustness. Under nominal conditions (0–1000), the economic profits of both methods are nearly identical (nominal optimum). This is expected because no particular adaptations are needed for both cases. However, in the d 1 scenario (1000–3000), the profit of BPD3PG ( J u n d i s t u r b e d B P D 3 P G = 42.11 / J d i s t u r b e d B P D 3 P G = 42.05 ) is significantly higher than that of PD3PG ( J u n d i s t u r b e d P D 3 P G = 41.87 / J d i s t u r b e d P D 3 P G = 41.74 ). For the d 2 scenario, the final settlement results for both methods show that the economic profit of BPD3PG ( J u n d i s t u r b e d B P D 3 P G = 49.31 / J d i s t u r b e d B P D 3 P G = 49.33 ) consistently outperforms PD3PG ( J u n d i s t u r b e d P D 3 P G = 48.94 / J d i s t u r b e d P D 3 P G = 48.44 ). Moreover, comparing with the Hicks control structure implemented with PI controllers, our BPD3PG method achieved notably higher profits ( J d i s t u r b e d H i c k s = 48.96 / J d i s t u r b e d B P D 3 P G = 49.33 ), demonstrating the economic advantages of our approach.
In contrast, PD3PG shows a larger profit difference between disturbed and undisturbed conditions, indicating weaker robustness. In contrast, the profit difference for BPD3PG is smaller between noisy and undisturbed conditions, with some fluctuation, but, overall, it consistently outperforms both PD3PG and traditional control approaches (shown in Table 6).

6. Conclusions

This paper presented a Bayesian deep reinforcement learning method for the optimization of chemical processes, called the Bayesian Primal-Dual Deep Deterministic Policy Gradient (BPD3PG) method, which is successfully applied to a fluid catalytic cracking (FCC) unit. The BPD3PG employs Bayesian neural networks for value function approximation, which effectively captured the uncertainties in FCC processes and overcame the value function overestimation problem inherent in traditional deterministic deep reinforcement learning. Furthermore, by utilizing the primal-dual method to handle process constraints, this approach ensured the operational feasibility of the system.
The simulation experiments on the FCC unit validated the superior performance of the BPD3PG method. The BPD3PG significantly outperformed traditional PI controllers, as well as deterministic policy gradient methods, achieving higher economic profits (Table 6) and maintaining more stable process control performance, especially when facing significant disturbance scenarios that could potentially cause controller overreaction.

Author Contributions

Software, J.Q. and L.Y.; Data curation, J.Z. and J.J.; Writing—review & editing, J.Q. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62373147), Zhejiang Provincial Natural Science Foundation of China (LY24F030007), Huzhou Key Laboratory of Intelligent Sensing and Optimal Control for Industrial Systems (2022-17), and Postgraduate Research and Innovation Project of Huzhou University (2025KYCX83).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Bloor, M.; Ahmed, A.; Kotecha, N.; Mercangöz, M.; Tsay, C.; Chanona, E.A.D.R. Control-Informed Reinforcement Learning for Chemical Processes. Ind. Eng. Chem. Res. 2025, 64, 4966–4978. [Google Scholar] [CrossRef] [PubMed]
  2. Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed Machine Learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
  3. Byun, H.E.; Kim, B.; Lee, J.H. Embedding Active Learning in Batch-to-batch Optimization Using Reinforcement Learning. Automatica 2023, 157, 111260. [Google Scholar] [CrossRef]
  4. Oh, T.H.; Park, H.M.; Kim, J.W.; Lee, J.M. Integration of reinforcement learning and model predictive control to optimize semi-batch bioreactor. AIChE J. 2022, 68, e17658. [Google Scholar] [CrossRef]
  5. Syauqi, A.; Kim, H.; Lim, H. Optimizing Olefin Purification: An Artificial Intelligence-Based Process-Conscious PI Controller Tuning for Double Dividing Wall Column Distillation. Chem. Eng. J. 2024, 500, 156645. [Google Scholar] [CrossRef]
  6. Petukhov, A.N.; Shablykin, D.N.; Trubyanov, M.M.; Atlaskin, A.A.; Zarubin, D.M.; Vorotyntsev, A.V.; Stepanova, E.A.; Smorodin, K.A.; Kazarina, O.V.; Petukhova, A.N.; et al. A hybrid batch distillation/membrane process for high purification part 2: Removing of heavy impurities from xenon extracted from natural gas. Sep. Purif. Technol. 2022, 294, 121230. [Google Scholar] [CrossRef]
  7. Singh, V.; Kodamana, H. Reinforcement Learning Based Control of Batch Polymerisation Processes. IFAC-PapersOnLine 2020, 53, 667–672. [Google Scholar] [CrossRef]
  8. Hartlieb, M. Photo-iniferter RAFT polymerization. Macromol. Rapid Commun. 2022, 43, 2100514. [Google Scholar] [CrossRef]
  9. Chen, J.; Wang, F. Cost Reduction of CO2 Capture Processes Using Reinforcement Learning Based Iterative Design: A Pilot-Scale Absorption–stripping System. Sep. Purif. Technol. 2013, 122, 149–158. [Google Scholar] [CrossRef]
  10. Perera, A.T.D.; Wickramasinghe, P.U.; Nik, V.M.; Scartezzini, J.L. Introducing Reinforcement Learning to the Energy System Design Process. Appl. Energy 2020, 262, 114580. [Google Scholar] [CrossRef]
  11. Sachio, S.; Mowbray, M.; Papathanasiou, M.M.; del Rio-Chanona, E.A.; Petsagkourakis, P. Integrating Process Design and Control Using Reinforcement Learning. Chem. Eng. Res. Des. 2021, 183, 160–169. [Google Scholar] [CrossRef]
  12. Kim, S.; Jang, M.G.; Kim, J.K. Process Design and Optimization of Single Mixed-Refrigerant Processes with the Application of Deep Reinforcement Learning. Appl. Therm. Eng. 2023, 223, 120038. [Google Scholar] [CrossRef]
  13. Hicks, R.; Worrell, G.; Durney, R. Atlantic seeks improved control; studies analog-digital models. Oil Gas J. 1966, 24, 97. [Google Scholar]
  14. Boum, A.T.; Latifi, A.; Corriou, J.P. Model predictive control of a fluid catalytic cracking unit. In Proceedings of the 2013 International Conference on Process Control (PC), Strbske Pleso, Slovakia, 18–21 June 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 335–340. [Google Scholar]
  15. Skogestad, S. Plantwide control: The search for the self-optimizing control structure. J. Process Control 2000, 10, 487–507. [Google Scholar] [CrossRef]
  16. Ye, L.; Cao, Y.; Yuan, X. Global approximation of self-optimizing controlled variables with average loss minimization. Ind. Eng. Chem. Res. 2015, 54, 12040–12053. [Google Scholar] [CrossRef]
  17. Altman, E. Constrained Markov Decision Processes; Chapman and Hall/CRC: Boca Raton, FL, USA, 1999. [Google Scholar]
  18. Ji, J.; Zhou, J.; Zhang, B.; Dai, J.; Pan, X.; Sun, R.; Huang, W.; Geng, Y.; Liu, M.; Yang, Y. OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research. J. Mach. Learn. Res. 2024, 25, 1–6. [Google Scholar]
  19. Yoo, H.; Kim, B.; Kim, J.W.; Lee, J.H. Reinforcement Learning Based Optimal Control of Batch Processes Using Monte-Carlo Deep Deterministic Policy Gradient with Phase Segmentation. Comput. Chem. Eng. 2021, 144, 107133. [Google Scholar] [CrossRef]
  20. Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
  21. Neal, R.M. Bayesian Learning for Neural Networks; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 118. [Google Scholar]
  22. Graves, A. Practical variational inference for neural networks. Adv. Neural Inf. Process. Syst. 2011, 24, 2348–2356. [Google Scholar]
  23. Minka, T.P. Expectation propagation for approximate Bayesian inference. arXiv 2013, arXiv:1301.2294. [Google Scholar]
  24. Gal, Y.; McAllister, R.; Rasmussen, C.E. Improving PILCO with Bayesian neural network dynamics models. In Proceedings of the Data-Efficient Machine Learning Workshop, ICML, New York, NY, USA, 24 June 2016; Volume 4, p. 25. [Google Scholar]
  25. Henderson, P.; Doan, T.; Islam, R.; Meger, D. Bayesian Policy Gradients via Alpha Divergence Dropout Inference. In Proceedings of the NIPS Bayesian Deep Learning Workshop, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  26. Li, Y.; Gal, Y. Dropout Inference in Bayesian Neural Networks with Alpha-divergences. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2052–2061. [Google Scholar]
  27. Hernandez-Lobato, J.; Li, Y.; Rowland, M.; Bui, T.; Hernández-Lobato, D.; Turner, R. Black-box alpha divergence minimization. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 1511–1520. [Google Scholar]
  28. LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. In Predicting Structured Data; MIT Press: Cambridge, MA, USA, 2006; Volume 1. [Google Scholar]
  29. Liu, X.; Sun, S. Alpha-divergence Minimization with Mixed Variational Posterior for Bayesian Neural Networks and Its Robustness Against Adversarial Examples. Neurocomputing 2020, 423, 427–434. [Google Scholar] [CrossRef]
  30. Guan, H.; Ye, L.; Shen, F.; Song, Z. Economic Operation of a Fluid Catalytic Cracking Process Using Self-Optimizing Control and Reconfiguration. J. Taiwan Inst. Chem. Eng. 2019, 96, 104–113. [Google Scholar] [CrossRef]
  31. Loeblein, C.; Perkins, J. Structural design for on-line process optimization: II. Application to a simulated FCC. AIChE J. 1999, 45, 1030–1040. [Google Scholar] [CrossRef]
  32. Hovd, M.; Skogestad, S. Procedure for regulatory control structure selection with application to the FCC process. AIChE J. 1993, 39, 1938–1953. [Google Scholar] [CrossRef]
  33. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  34. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  35. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 22–24 June 2014; pp. 387–395. [Google Scholar]
Figure 1. (1) Traditional neural network; (2) Bayesian neural network.
Figure 1. (1) Traditional neural network; (2) Bayesian neural network.
Processes 13 01352 g001
Figure 2. An illustration of approximating distributions by α -divergence minimization. Here, p and q shown in the graphs are un-normalized probability densities.
Figure 2. An illustration of approximating distributions by α -divergence minimization. Here, p and q shown in the graphs are un-normalized probability densities.
Processes 13 01352 g002
Figure 3. The fluid catalytic cracking unit with RL agent.
Figure 3. The fluid catalytic cracking unit with RL agent.
Processes 13 01352 g003
Figure 4. The shaded region represents half a standard deviation of the average evaluation over 5 trials. Curves are smoothed uniformly for visual clarity.
Figure 4. The shaded region represents half a standard deviation of the average evaluation over 5 trials. Curves are smoothed uniformly for visual clarity.
Processes 13 01352 g004
Figure 5. Optimal trajectories after RL training and dynamic simulation of the Hicks control structure: loop1( F a T c y ) and loop2( F s c T r i 1 ) (undisturbed).
Figure 5. Optimal trajectories after RL training and dynamic simulation of the Hicks control structure: loop1( F a T c y ) and loop2( F s c T r i 1 ) (undisturbed).
Processes 13 01352 g005
Figure 6. Optimal trajectories after RL training and dynamic simulation of the Hicks control structure: loop1( F a T c y ) and loop2( F s c T r i 1 ) (disturbed).
Figure 6. Optimal trajectories after RL training and dynamic simulation of the Hicks control structure: loop1( F a T c y ) and loop2( F s c T r i 1 ) (disturbed).
Processes 13 01352 g006
Figure 7. Dynamic trajectories of the economic function J (disturbed and undisturbed).
Figure 7. Dynamic trajectories of the economic function J (disturbed and undisturbed).
Processes 13 01352 g007
Table 1. Model parameters of the FCC unit.
Table 1. Model parameters of the FCC unit.
VariableDescriptionValue and Unit
E c b Activation energy for coke burning reaction158.6 kJ/mol
F o i l Mass flow rate of gas oil feed2438.0 kg/min
F g l Gasoline yield factor of catalyst1.0 kg/min
k c Rate constant for catalytic coke formation0.01897 s 0.5
k c o m Rate constant for coke burning29.338 min 1
k o Rate constant for gas oil cracking962,000 s 1
T a Temperature of air to regenerator320.0 K
T o i l Temperature of gas oil feed420.0 K
σ 2 CO2/CO dependence on the temperature0.006244 K 1
h 1 , h 2 Parameters for approximating Δ H 521,150.0, 245.0
Table 2. Process variables of the FCC unit.
Table 2. Process variables of the FCC unit.
VariableDescriptionUnit
F a Mass flow rate of air to regeneratorkg/min
F s c Mass flow rate of spent catalystkg/min
T r g Temperature of catalyst in regenerator dense bedK
T c y Temperature of cycloneK
T r i 0 Temperature of catalyst and gas oil mixture at riser inletK
T r i 1 Temperature of catalyst and gas oil mixture at riser outletK
y f 1 Weight fraction of gas oil in product-
y g l Weight fraction of gasoline in product-
Table 3. Material prices for FCC components.
Table 3. Material prices for FCC components.
PriceComponentValue
p g l gasoline0.14 USD/kg
p g s light gases0.132 USD/kg
p u g o unconverted gas oil0.088 USD/kg
Table 4. Hyper-parameters of the proposed BPD3PG and PD3PG.
Table 4. Hyper-parameters of the proposed BPD3PG and PD3PG.
ParameterValue
Hidden layer64-128-64
Batch size64
Time step9 × 106
Episode2000
MC sample (Only for BPD3PG)50
Dropout rate (Only for BPD3PG)0.995
Actor-learning rate0.0001
Reward/Cost-learning rate0.0001
Discount factor0.99
α -divergence (Only for BPD3PG)0.95
KeepProp (Only for BPD3PG)0.95
β (Only for BPD3PG)0.92
Table 5. Hardware details.
Table 5. Hardware details.
ParameterVersion
ComputerWindows10
CPUi5-12400F 2.50 GHz
RAM32.0 GB
GPUNVIDIA GeForce RTX4060Ti
Tensorflow2.2.0
Python3.8
Table 6. Performance improvement of BPD3PG in the final settlement results.
Table 6. Performance improvement of BPD3PG in the final settlement results.
ComparisonConditionJ (USD/min)Improvement (%)
BPD3PG vs. HicksUndisturbed49.31 vs. 49.110.41%↑
Disturbed49.33 vs. 48.960.76%↑
BPD3PG vs. PD3PGUndisturbed49.31 vs. 48.940.76%↑
Disturbed49.33 vs. 48.441.84%↑
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, J.; Ye, L.; Zheng, J.; Jin, J. Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit. Processes 2025, 13, 1352. https://doi.org/10.3390/pr13051352

AMA Style

Qin J, Ye L, Zheng J, Jin J. Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit. Processes. 2025; 13(5):1352. https://doi.org/10.3390/pr13051352

Chicago/Turabian Style

Qin, Jingsheng, Lingjian Ye, Jiaqing Zheng, and Jiangnan Jin. 2025. "Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit" Processes 13, no. 5: 1352. https://doi.org/10.3390/pr13051352

APA Style

Qin, J., Ye, L., Zheng, J., & Jin, J. (2025). Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit. Processes, 13(5), 1352. https://doi.org/10.3390/pr13051352

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop