Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit

Qin, Jingsheng; Ye, Lingjian; Zheng, Jiaqing; Jin, Jiangnan

doi:10.3390/pr13051352

Open AccessArticle

Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit

¹

Huzhou Key Laboratory of Intelligent Sensing and Optimal Control for Industrial Systems, School of Engineering, Huzhou University, Huzhou 313000, China

²

Zhejiang Key Laboratory for Industrial Solid Waste Thermal Hydrolysis Technology and Intelligent Equipment, Huzhou University, Huzhou 313000, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(5), 1352; https://doi.org/10.3390/pr13051352

Submission received: 22 March 2025 / Revised: 23 April 2025 / Accepted: 26 April 2025 / Published: 28 April 2025

(This article belongs to the Special Issue Machine Learning Optimization of Chemical Processes)

Download

Browse Figures

Versions Notes

Abstract

The emerging machine learning techniques provide great opportunities for optimal operation of chemical systems. This paper presents a Bayesian deep reinforcement learning method for the optimization of a fluid catalytic cracking (FCC) unit, which is a key process in the petroleum refining industry. Unlike the traditional reinforcement learning (RL) methods that use deterministic network weights, Bayesian neural networks are incorporated to represent the RL agent. The Bayesian treatment is integrated with the primal-dual method to handle the process constraints. Simulated experiments for FCC determined that the proposed algorithm achieves more stable control performance and higher economic profits, especially under parameter fluctuations and external disturbances.

Keywords:

Bayesian deep learning; reinforcement learning; fluid catalytic cracking; process optimization

1. Introduction

In chemical processes, traditional process control primarily relies on feedback control techniques for operation, typically Proportional–Integral–Derivative (PID) controllers and Model Predictive Control (MPC) [1]. These methods are widely used due to their concise structure, strong interpretability, and extensive tuning methods. However, regarding complex industrial processes, such as with significant time delays and changing dynamics, they often yield suboptimal results in highly uncertain environments.

The rapid development of machine learning technologies offers new solutions for process control. Based on data-driven methodologies, machine learning methods are capable of automatically extracting key patterns from massive process data, and more accurate prediction and decision models are built. Broadly, machine learning paradigms encompass supervised learning, unsupervised learning, and reinforcement learning (RL) [2]. For decision-making problems, RL has demonstrated significant potential for operating complex chemical processes, such as batch reactor temperature control [3,4], distillation column product purity regulation [5,6], and polymer reactor quality control [7,8]. Moreover, RL is employed for process design, optimizing unit operation arrangements, and evaluating improved process schemes through iteration; see the recent literature for the absorption-stripping processes [9], energy system design [10], unit operation design [11], and separation processes [12]. However, it is worth noting that the use of RL for static process design remains controversial in some recent scientific discussions.

Safe reinforcement learning (SRL) is a branch of RL that deals with constrained problems. Operational constraints are common in the chemical industry, and therefore the SRL approaches have received notable attention. The same as the common RL, however, the existing SRL methods face some challenges that severely restrict the widespread application of SRL in high-risk, highly uncertain chemical processes. Firstly, deterministic neural networks are employed in the agents, which cannot accurately quantify system uncertainties, stemming from, for example, parameter fluctuations and external disturbances. Secondly, point estimations are typically given, which tend to ignore potential risks owing to the stochastic nature of uncertainties. Consequently, the robustness of the RL agent is unsatisfactory.

Bayesian reinforcement learning (BRL) has significant advantages in decision-making for uncertain systems. Based on this understanding, this research proposes a new Bayesian deep safe reinforcement learning framework for optimizing complex chemical processes, with application to a fluid catalytic cracking (FCC) unit. FCC is a core unit in the petroleum refining industry, which exhibits highly nonlinear characteristics, strong multi-variable coupling, and significant parameter uncertainties. Although the existing temperature control structures [13] and recently developed advanced control and optimization approaches [14,15,16] have achieved some good results for FCC problems, these solutions may still not be optimal in the presence of model uncertainties and time-varying system characteristics. In this paper, we rely on the machine learning methodologies and make the following contributions:

We propose a BRL method for the operation problem of the FCC unit. Unlike traditional RL methods that employ deterministic networks, we utilize Bayesian neural networks (BNNs) to represent the RL agent, effectively capturing the uncertainties in FCC.
We adopt a primal-dual method to handle the process constraints, ensuring the optimality of the control policy while satisfying safety, which is new in the framework of BRL.
Extensive simulations are conducted to investigate the dynamic responses of FCC under different operating conditions. Compared with traditional deterministic gradient methods, the proposed approach achieved improved economic profits and more stable control performance.

The remainder of this paper is structured as follows: Section 2 introduces the Constrained Markov Decision Process (CMDP); Section 3 elaborates on the proposed Bayesian SRL method; Section 4 presents the operation problem of FCC and the implementation details of the SRL scheme; Section 5 describes the results, and Section 6 concludes this paper.

2. Constrained Markov Decision Process

In this paper, we perform operational optimization of chemical processes using the SRL. The decision-maker, which is referred to as the agent in SRL, interacts with the environment by outputting actions according to some rules (the policy), such that the received cumulative rewards are maximized while satisfying process constraints (minimizing cumulative costs). SRL is typically tackled within the theoretical framework of the Constrained Markov Decision Process (CMDP) [17], defined by the state space

S

, action space

A

, state transition probability

P (s_{t + 1} | s_{t}, a_{t})

denoting the probability of transiting from state

s_{t}

to

s_{t + 1}

given action

a_{t}

, reward function

r : S \times A \to R

, cost function

c : S \times A \to R

representing the price for the constraint, the safety threshold d, and the discount factor

γ \in (0, 1]

.

Suppose that the agent starts at the initial state following distribution,

s_{0} \sim l

, at each time step t, and it observes the current state

s_{t}

and selects an action

a_{t} = π (\cdot | s_{t})

according to the policy

π

. The agent receives an immediate reward

r_{t} = r (s_{t}, a_{t})

and immediate cost

c_{t} = c (s_{t}, a_{t})

. After executing the action, the environment transitions to a new state

s_{t + 1}

according to the transition probability

P (s_{t + 1} | s_{t}, a_{t})

. The above interaction procedure continues and generates a trajectory sequence

T = {s_{0}, a_{0}, r_{0}, c_{0}, s_{1}, a_{1}, r_{1}, c_{1}, \dots}

. The agent’s cumulative discounted reward and cost are represented as

Z_{π}^{R} = E_{T \sim π} \sum_{t = 0}^{\infty} γ^{t} r_{t}

and

Z_{π}^{C} = E_{T \sim π} \sum_{t = 0}^{\infty} γ^{t} c_{t}

, respectively.

The agent aims to learn an optimal safe policy

π^{*}

that maximizes the expected cumulative reward

Z_{π}^{R}

while satisfying the constraint

Z_{π}^{C} \leq d

. This optimization problem can be formally expressed as

π^{*} = \arg \max_{π} Z_{π}^{R} s . t . Z_{π}^{C} \leq d

(1)

The CMDP problem (1) can typically be solved using the primal-dual method [18] by transforming it into the following dual form:

π^{*} = \arg \max_{π} \min_{I \geq 0} = Z_{π}^{R} - I (Z_{π}^{C} - d)

(2)

where

I \geq 0

is the Lagrangian multiplier. The primal-dual method performs gradient ascent on

π

and descent on

I

alternatively.

3. Bayesian Primal-Dual Deep Deterministic Policy Gradient

The core of actor–critic RL method lies in evaluating policy effectiveness through value functions. Value functions quantify the long-term rewards from specific actions in certain states, primarily including the state value function

V (s)

or the state–action value function

Q (s, a)

, which are considered effective for problems with continuous state and action spaces, such as chemical process control. However, applying these methods to FCC units presents several challenges. First, in uncertain environments, value functions (critic) may provide inaccurate feedback, causing the actor to update in the wrong direction, leading to potentially low-reward samples [19]. While there might be a probability of achieving higher rewards, this often comes at the cost of constraint violations, which is absolutely unacceptable in actual chemical production. Furthermore, chemical reactions typically have irreversible characteristics, meaning that early incorrect decisions may lead to irreparable consequences. Therefore, building a value network that can accurately evaluate the environment becomes particularly important.

Given the advantages of Bayesian deep learning in handling uncertainties, we propose a natural extension: using BNNs as function approximators to model the complex nonlinear relationships between states and values [20].

3.1. Bayesian Neural Network

The neural networks define the mapping function

f_{ω} : x \mapsto y

from input space to output space, where

ω

are the parameterized weights. In the traditional neural networks,

ω

are deterministic parameters. The BNN employs probabilistic weights to quantify prediction uncertainty. As shown in Figure 1,

ω

in BNNs are random variables rather than fixed values. In the SRL, input x corresponds to state–action pairs

(s, a)

, and output y corresponds to Q-values Z; thus, our mapping function can be represented as

f_{ω} : (s, a) \mapsto Z

. Given the dataset

D

(the evidence), we first specify a prior distribution for the weights (typically Gaussian), then perform Bayesian inference based on the likelihood function

p (Z ∣ s, a, ω)

to obtain the posterior distribution of the weights:

p (ω | D) = \frac{p (Z | s, a; ω) p (ω)}{\int p (Z | s, a; ω) p (ω) d ω}

(3)

Computing the posterior distribution

p (ω | D)

of BNN is a challenging problem. Various approximate inference methods have been proposed, including MCMC sampling [21], variational inference [22], expectation propagation [23], and Monte Carlo dropout approximation [24]. Notably, dropout has demonstrated excellent performance as a variational Bayesian method across multiple tasks, from classification to active learning, and will be deployed in this paper.

Based on the obtained the posterior distribution, the expected Q-value, namely

E [Z (s, a)]

, for a new state–action pair

(s, a)

can be estimated through MC sampling.

\begin{matrix} E [Z (s, a)] & = \int f_{ω} (s, a) p (ω | D) d ω \approx \frac{1}{M} \sum_{m = 1}^{M} f_{ω_{m}} (s, a) \end{matrix}

(4)

where M is the number of samples,

ω_{m} \sim p (ω | D)

.

3.2. Variational Inference in BNN with $α$ -Divergences

To simplify the computation of posterior distribution

p (ω | D)

, variational inference introduces the variational distribution

q_{θ} (ω)

as an approximation for

p (ω | D)

.

q_{θ} (ω)

is parameterized by

θ

, typically taking the form of a Gaussian distribution:

q_{θ} (ω) = N (ω | μ, σ)

(where

θ = {μ, σ}

). A well-known approximation method is to minimize the Kullback–Leibler (KL) divergence between these two distributions as follows:

\begin{matrix} \arg \min_{θ} KL [q_{θ} (ω) | | p (ω | D)] & = \arg \min_{θ} \int q_{θ} (ω) \log \frac{q_{θ} (ω)}{p (ω | D)} d ω \end{matrix}

(5)

\begin{matrix} = \arg \min_{θ} \int q_{θ} (ω) \log \frac{q_{θ} (ω)}{p (ω) p (D ∣ ω) / p (D)} d ω \end{matrix}

(6)

\begin{matrix} = \arg \min_{θ} KL \underset{- E L B O (q)}{\underset{︸}{[q_{θ} (ω) ∥ p (ω)] - E q_{θ} (ω) [\log p (D ∣ ω)]}} + \underset{C o n s t}{\underset{︸}{\log p (D)}} . \end{matrix}

(7)

where the first term on the right-hand side of the last equality is known as the evidence lower bound (ELBO).

Although the optimization method described above is theoretically simple and intuitive, it still faces significant computational complexity challenges in practical implementation. Dropout can serve as an alternative approach for Bayesian approximation [25]. Dropout can be interpreted as a unique form of variational inference that introduces noise into the feature space, which maps to uncertainty in the network parameter space. Specifically, during training, dropout randomly “drops” some neurons by applying stochastic masks to network weights, thereby approximating the probability distribution of parameters. At test time, dropout is applied to all neurons to approximate Bayesian inference. This method can effectively capture model uncertainty while maintaining computational efficiency.

On the other hand, dropout variational inference tends to underestimate model uncertainty. This issue stems from the asymmetric penalty mechanism in KL divergence minimization: penalties are applied when the approximate posterior distribution

q_{θ} (ω)

assigns non-zero values in regions where the true posterior distribution

p (ω | D)

is zero. However, no penalties are imposed when

q_{θ} (ω)

is zero in regions where

p (ω | D)

has high probability.

To address this limitation, inspired by [26], we adopt the

α

-divergence as an alternative measure:

D_{α} [p | | q] = \frac{1}{α (1 - α)} (1 - \int p {(ω | D)}^{α} q_{θ} {(ω)}^{1 - α} d ω)

(8)

As illustrated in Figure 2, the parameter

α

influences the characteristics of approximate distributions. In principle, the method is a generalization with the tunable parameter

α

. When

α

takes a large positive value, the approximate distribution

q_{θ} (ω)

tends to encompass multiple modes of the target distribution

p (ω | D)

. Conversely, as

α

approaches negative infinity (assuming finite divergence),

q_{θ} (ω)

focuses on the dominant mode with the highest probability [27].

Several effective

α

-divergence minimization techniques, such as Black-Box

α

-Divergence Minimization (BB-

α

) [27] and dropout BB-

α

[26], have reportedly achieved significant results in practical applications. Notably, BB-

α

, as a black-box method, can be directly applied to probability models with complex structures. This characteristic is particularly important in cases where traditional methods such as variational Bayes, expectation propagation (EP), and Power EP encounter difficulties when processing energy functions. First, let us review the traditional BB-

α

function.

L_{α}^{M C} (q_{θ} (ω)) = - \frac{1}{α} \sum_{t} \log [{(\frac{p (y | x; ω) p_{0} {(ω)}^{\frac{1}{T}}}{q_{θ} {(ω)}^{\frac{1}{T}}})}^{α}]

(9)

however, due to the intractability of its general expectation form, to improve computational efficiency and better adapt to gradient-based optimization, we approximate the expectation value by drawing samples from the distribution

q_{θ} (ω)

and reformulate this objective function (9) using MC method:

\begin{matrix} L_{α}^{M C} & = - \frac{1}{α} \log E_{ω \sim q_{θ} (ω)} [{(\frac{p (y | x; ω) p_{0} (ω)}{q_{θ} (ω)})}^{α}] \end{matrix}

(10)

\begin{matrix} \approx - \frac{1}{α} \sum_{t} \log E_{ω \sim q_{θ} (ω)} [\exp (α \log p (y_{t} | x_{t}; ω) + \frac{α}{T} [\log p_{0} (ω) - \log q_{θ} (ω)])] \end{matrix}

(11)

\begin{matrix} \approx - \frac{1}{α} \sum_{t} \log \sum_{m}^{M} \exp (α \log p (y_{t} | x_{t}; ω_{m}) + \frac{α}{T} [\log p_{0} (ω_{m}) - \log q_{θ} (ω_{m})]) \end{matrix}

(12)

Given a loss function

l (.)

, we define the un-normalized likelihood term:

p (y | x, ω) \propto \exp [- l (y, f_{ω} (x))]

[28]. The term

\log p_{0} (ω_{m})

represents the log prior, equivalent to

l_{1}

or

l_{2}

regularization [29]. We provide the following minimization objective:

L_{α}^{M C} \approx - \frac{1}{α} \sum_{t} \log \sum_{m}^{M} \exp (- α l (y, f_{ω_{m}} (x)) + \frac{α}{T} [l_{2} - \log q_{θ} (ω_{m})])

(13)

The log-likelihood term

- l (y, f_{ω} (x))

can be cross-entropy for classification tasks. However, in our value regression task, it is represented as mean squared error:

- l (y, f_{ω} (x)) \propto - \frac{β}{2} | | y - f_{ω} (s, a) {| |}_{2}^{2}

, where the corresponding likelihood term can be expressed as

y \sim N (y; f_{ω} (s, a), β^{- 1} σ)

[26]. By reformulating the energy function (13), we derive a new objective function representation method called

α

-BNN, which can approximate the marginal likelihood with high precision and is compatible with mainstream objective functions in deep learning.

The new optimization objective of

α

-BNN is

\begin{matrix} L_{α}^{M C} (q_{θ} (ω)) & = - \frac{1}{α} \sum_{t} \log \sum_{m = 1}^{M} \exp [- \frac{α β}{2} {∥ y_{t} - f_{ω_{m}} (s, a) ∥}_{2}^{2}] + \frac{N D}{2} \log β + \sum_{i} p_{i} {∥ H_{i} ∥}_{2}^{2} \end{matrix}

(14)

where

β

represents the precision parameter, weight samples

{\hat{ω}}_{m} \sim q_{θ} (ω)

are obtained through masked dropout, and

{f_{ω_{m}} (s, a)}_{m = 1}^{M}

represents a set of forward propagations obtained by performing K sampling iterations on input

(s, a)

. D and

p_{i}

. denote the dropout rate and retention rate of the i-th layer, respectively, N is the batch size, and H represents the network parameters without dropout.

4. The Fluid Catalytic Cracking Process

4.1. Process Descriptions

The FCC process consists of three stages: reaction, product separation, and catalyst regeneration (Figure 3). Crude oil mixes with a hot recycled catalyst from the regenerator on the feed side and enters the riser. The catalyst’s heat vaporizes the oil and enables cracking. In the riser, heavy oil cracks into lighter hydrocarbons, mainly gasoline, which are separated from the catalyst and sent to the fractionator. The multicomponent catalyst typically contains acid USHY zeolite, active alumina matrix, inert matrix (kaolin), binder, and additives. During the reaction, coke deposits deactivate the catalyst, so it is returned to the regenerator to burn off the coke and restore activity. In the regenerator, air is blown from the bottom to fluidize the catalyst, mixing it well with air in the dense bed. Makeup catalyst and withdrawal flows compensate for permanent catalyst losses. The regenerated catalyst is recycled to mix with fresh crude oil feed. A cyclone at the regenerator top separates and collects solid catalyst from the flue gas [30].

The core mechanism of catalytic cracking is the

β

-scission reaction, which fundamentally generates olefins and new carbocations by cleaving

β

carbon–carbon bonds. During this process, carbocations undergo chain reactions with alkane molecules, continuously producing short-chain olefins. Typical reaction pathways include

C_{16} H_{34} \to C_{8} H_{18} + C_{8} H_{16}, C_{8} H_{18} \to C_{4} H_{10} + C_{4} H_{8}

, among many others. Given the extreme complexity of the cracking reaction network, its intrinsic mechanism remains incompletely elucidated, making the construction of a comprehensive first-principle model extremely challenging. Currently, FCC models predominantly employ hybrid approaches combining reaction mechanisms with empirical rules, with lumped models being the most practical and widely used. This research adopts the three-lumped reactor model [30]. The model abstracts FCC reactions as transitions between three virtual lumped components: gas oil (F) cracking to generate gasoline (G) and light gases/coke (L), with conversion pathways including

F \to G, G \to L, F \to L

.

The riser model is approximated as steady-state ordinary differential equations (quasi-steady state), while the regenerator is described by differential equations derived from material and energy balance relationships. For detailed derivation and explanation of these equations, readers are referred to Guan et al. [30]. Table 1 lists the material prices for FCC components used in economic profit calculations, while Table 2 and Table 3 list the key process variables and model parameters.

To facilitate the subsequent transformation of the FCC problem into the required reinforcement learning framework, the following key constraints and economic objective function need to be clearly defined.

The input capacity constraints are

6000 kg / \min \leq F_{s c} \leq 24, 000 kg / \min

(15)

0 \leq F_{a} \leq 3600 kg / \min

(16)

Metallurgical limit for the temperature of cyclone:

T_{c y} \leq 1000 K

(17)

Metallurgical limits for the riser inlet and outlet:

T_{r i 0} \leq 1000 K

(18)

T_{r i 1} \leq 1000 K

(19)

The economic profit function, J [USD/min], is defined as

J = p_{g l} F_{g l} + p_{g s} F_{g s} + p_{u g o} F_{u g o} - p_{u g o} F_{o i l}

(20)

and the overall operational objective can be formulated as solving the following problem:

\begin{matrix} \max_{F_{s c}, F_{a}} J = p_{g l} F_{g l} + p_{g s} F_{g s} + p_{u g o} F_{u g o} - p_{u g o} F_{o i l} \\ s . t . constraints : (15) - (19) \end{matrix}

(21)

We consider the process parameters

{[k_{o}, k_{c}, k_{c o m}, σ_{2}, h_{1}, h_{2}, E c b / R]}^{T}

as uncertain (see Table 1 for nominal values). Under nominal conditions, nonlinear programming optimization yields optimal

F_{s c} = 17, 189 kg / \min

and

F_{a} = 1461.7 kg / \min

, with a maximum profit of 44.96 USD/min. However, this solution only applies to nominal conditions. During actual operation, uncertain disturbances cause continuous changes in operating conditions, potentially making pre-determined schemes suboptimal or infeasible [30].

4.2. Formulating as a SRL Problem

The FCC process under consideration is described as a two-degree-of-freedom system (

F_{s c}

and

F_{a}

) [31,32], whose objective is to maximize the economic profit while satisfying constraints (15)–(19). Given the complex dynamic characteristics, multiple operational constraints, and clear economic objectives of the FCC process, it is handled by applying SRL in this study. As previously described, we formalize the FCC control problem as a CMDP, with its core components detailed below.

(1)

A g e n t

: The agent serves as the intelligence achieving optimal control under complex operating conditions through continuous interactive learning. During each sampling period t, the system acquires key state parameters from the sensors (observations). The interaction between the agent and the FCC process generates a control sequence

T

, with a total of T periods.

Let

X \in R, C, x \in r, c

, and the state–action value function

G_{π, I} (s, a, ω)

characterizes the expected economic profit that can be obtained when executing policy

π

and control action

a_{t}

in state

s_{t}

while satisfying operational constraints. We solve the optimal value function

π^{*} = \underset{G_{π, I} (s, a, ω)}{\underset{︸}{Z_{π}^{R} (s, a; ω) - I Z_{π}^{C} (s, a; ω)}}

(22)

Based on this, the FCC unit implements policy

π^{*}

for optimal control. To achieve maximum cumulative profit throughout the entire process, the agent must continuously learn and refine the mapping relationship between system states and optimal control actions through ongoing interaction with the controlled object.

(2)

S t a t e

: The state vector

s_{t}

is the feedback signal to the SRL agent, reflecting the set of process parameters of the FCC environment after executing control command

a_{t - 1}

at sampling period

t - 1

; the state space is defined as follows:

s_{t} = [t, C_{r c}, O_{d}, T_{r g}, T_{r i 1}, y_{f 1}, y_{g 1}, T_{r i 0}, J, T_{c y}] \in S

(23)

where t denotes the time step within the training period. The detailed parameters of the state space

S

are listed in Table 1 and Table 2.

(3)

A c t i o n

: Given state

s_{t}

, the corresponding control action

a_{t}

taken by the SRL agent at time step t is defined as

a_{t} = [a_{t}^{F_{s c}}, a_{t}^{F_{a}}] \in A

(24)

where

a_{t}^{F_{s c}} \in

[6000, 24,000] represents the catalyst circulation rate, and

a_{t}^{F_{a}} \in [0, 3600]

represents the regenerator air flow rate.

(4)

E n v i r o n m e n t a n d S t a t e T r a n s i t i o n

: Once the action

a_{t}

is outputted, the agent interacts with the fluid catalytic cracking environment to obtain the next state

s_{t + 1}

and receives reward

r_{t}

and cost

c_{t}

. This interaction process can be represented by the mapping function

[s_{t + 1}, r_{t}, c_{t}] = f_{F C C} (s_{t}, a_{t})

, which is constrained by operational rules and governed by relevant physical laws to ensure the feasibility of the action selected at each time step.

(5)

R e w a r d a n d C o s t

: In Formula (20), the terms represent product profit and raw material costs, respectively. p represents the material price, as shown in Table 3. Therefore, the overall operational objective can be formulated as solving the following optimization problem to maximize cumulative rewards:

r_{t} = p_{g l} F_{g l} + p_{g s} F_{g s} + p_{u g o} F_{u g o} - p_{u g o} F_{o i l}

(25)

Based on the characteristics and constraints of the FCC process (15)–(19), the cost signals are designed as

\{\begin{matrix} c_{T_{c y, t}} = λ_{1} (T_{c y, t} - 1000), & if T_{c y, t} > 1000 \\ c_{T_{r i 0, t}} = λ_{2} (T_{r i 0, t} - 1000), & if T_{r i 0, t} > 1000 \\ c_{T_{r i 1, t}} = λ_{3} (T_{r i 1, t} - 1000), & if T_{r i 1, t} > 1000 \\ c_{{F_{s c, t}}_{1}} = λ_{4} (F_{s c, t} - 24, 000), & if F_{s c, t} > 24, 000 \\ c_{{F_{s c, t}}_{2}} = λ_{5} (6000 - F_{s c, t}), & if F_{s c, t} < 6000 \\ c_{F_{a, t}} = λ_{6} (F_{a, t} - 3600), & if F_{a, t} > 3600 \\ 0, & otherwise \end{matrix}

(26)

\begin{matrix} c_{t 1} & = λ_{7} (T_{r i 1} - 782.8) + λ_{8} (T_{c y} - 999.5) \end{matrix}

(27)

\begin{matrix} c_{t} & = c_{T_{c y, t}} + c_{T_{r i 0, t}} + c_{T_{r i 1, t}} + c_{{F_{s c, t}}_{1}} \\ + c_{{F_{s c, t}}_{2}} + c_{F_{a, t}} + c_{t 1} \end{matrix}

(28)

where

λ

is a tunable bias constant.

5. Simulation Results

5.1. Training Algorithms

We carry out comparative experiments between the primal-dual version of the standard DDPG [33] (PD3PG) and the Bayesian PD3PG (BPD3PG). Considering the specificity of the application scenario, which involves a multi-dimensional continuous action space, traditional Deep Q-Network (DQN) [34] and Deep Policy Gradient (DPG) [35] algorithms cannot address such challenges. As mentioned previously, we employ the primal-dual method to solve the CMDP problem. Although traditional reinforcement learning algorithms such as DDPG can also attempt to solve this problem, the constraints present in practical applications make the direct application of these algorithms infeasible.

Our network is based on the actor–critic architecture of DDPG, with two modifications made through the primal-dual method. First, we add a cost critic to the existing reward critic to estimate the expected cumulative cost. Second, for the actor, we incorporate the optimization of the cost Q-value and maximize the reward Q-value. Regarding the implementation details, we constructed six deep neural networks (DNNs) for the online and target actors, reward critics, and cost critics. To ensure the comparability and reproducibility of the experiments, PD3PG and BPD3PG adopt identical structural configurations (more details of the parameters are shown in Table 4 and Table 5 and Algorithm 1). We employ

α

-Bayesian neural networks (

α

-BNNs) as the Q-value functions. Following the aforementioned loss calculation strategy, our goal is to capture the uncertainty estimates of the Q-value functions accurately. The fitting target for the

α

-BNN Q-function is

y_{t} = g + γ \cdot Z^{' X} (s_{t + 1}, π_{ϕ^{'}} (s_{t + 1}))

where

ϕ^{'}

represents the target policy, and

Z^{'}

corresponds to the target network adjusted through a soft update mechanism. We update the BPD3PG using the MC dropout posterior mean of the Bayesian value function distribution, replacing the deterministic estimate.

\begin{matrix} \nabla_{ϕ} J_{π_{ϕ}} \approx \frac{1}{M} \sum_{m = 1}^{M} [\nabla_{ϕ} π_{ϕ} (s) \nabla_{a} G_{π_{ϕ}, I} (s, a; ω_{m})] |_{s = s_{t}} \end{matrix}

(29)

where

G_{π_{ϕ}, I} (s, a, ω) = Z_{π}^{R} (s, a; ω) - I Z_{π}^{C} (s, a; ω)

; similarly, the value of the Lagrangian multiplier

I

can be determined by minimizing the loss function

\nabla_{ϕ} J_{I}

.

\begin{matrix} \nabla_{ϕ} J_{I} \approx \frac{1}{M} \sum_{m = 1}^{M} [\nabla_{ϕ} π_{ϕ} (s) \nabla_{a} I (Z_{π}^{C} (s, a; ω_{m}) - d)] |_{s = s_{t}} \end{matrix}

(30)

The training processes of BDP3PG and PD3PG are compared in Figure 4, where one observes that BDP3PG outperforms PD3PG with higher reward and lower cost. Furthermore, as seen in Figure 4, by fitting the

α

-BNN value function and using the posterior mean in the policy update, BPD3PG demonstrates faster convergence and improved stability, particularly in disturbed settings. This improvement can be attributed to the stronger exploration ability demonstrated by BPD3PG when faced with high uncertainty during the early stages of learning.

Algorithm 1 BPD3PG

Input: Initial netowrk

Z_{π}^{R}, Z_{π}^{C}, π_{ϕ}

Input: Target parameters:

{Z_{π}^{R}, Z_{π}^{C}, π_{ϕ^{'}}} \leftarrow {Z_{π}^{R}, Z_{π}^{C}, π_{ϕ}}

Input: Initial replay buffer

X

and Lagrangian multiplier

I

1: for each episode do

2: for each time step do

3:

a = π_{ϕ} (s_{t})

4:

s_{t + 1} \sim P (\cdot | s_{t}, a_{t})

5:

X \leftarrow X \cup (s_{t}, a_{t}, r_{t}, c_{t}, s_{t + 1})

6: end for

7: for each gradient step do

8: Sample experience from replay buffer

X

9: Update the reward and cost critic network by minimizing the loss function (14).

10: Update the Lagrangian multiplier using (30).

11: Update the actor network using (29).

12: Update the target network soft update [33]

13: end for

14: end for

Output:

Z_{π}^{R}, Z_{π}^{C}, π_{ϕ}, I

5.2. Experimental Results

As shown in Figure 3, the traditional control structure is based on the Hicks framework, which has pairing relationships of

F_{s c} \leftrightarrow T_{r i 1}

and

F_{a} \leftrightarrow T_{c y}

. According to the research by [13], this pairing relationship provides optimal controllability. In this study, the RL is additionally configured to optimize the control signals, as shown in Figure 3. As an active constraint variable,

T_{c y}

requires special attention in both deep reinforcement learning control schemes.

The simulation experiments are arranged as follows: Initially, the FCC process operates under nominal conditions. To simulate the parameter fluctuations typically encountered in industrial processes, two disturbance scenarios,

d_{1}

and

d_{2}

, are introduced, representing random variations in key model parameters within

\pm 20 %

of their nominal values.

Specifically, the parameter vector

{[k_{o}, k_{c}, k_{c o m}, σ_{2}, h_{1}, h_{2}, E c b / R]}^{T}

(refer to Table 1) under nominal conditions is

d_{0} = {[962000, 0.01897, 29.338, 0.006244, 521150, 245, 158.6]}^{T}

At

t = 1000

min, a disturbance scenario

d_{1} = {[743903.8231, 0.01666, 31.5, 0.0070, 497882, 274.8, 162.6]}^{T}

is introduced smoothly via a 500 min ramp function. At

t = 3000

min, the scenario switches to another value

d_{2} = {[995340.7075, 0.01663, 28.8, 0.00665, 537683, 196.8, 163.5]}^{T}

We compared the performance of two SRL-based control methods, BPD3PG and PD3PG, under both undisturbed and disturbed conditions:

Undisturbed condition: The system operates in an ideal control environment. The manipulated variables $F_{s c}$ and $F_{a}$ are precisely regulated by the controller, while the key parameters change according to disturbance scenarios $d_{1}$ and $d_{2}$ , but no random disturbances or measurement noises are introduced.
Disturbed condition: In addition to the key parameter changes following disturbance scenarios $d_{1}$ and $d_{2}$ , the system is also subject to random disturbances. Specifically, the manipulated variables become $F_{s c} + δ_{s c}$ and $F_{a} + δ_{a}$ , and the measured variables become $T_{c y} + η_{c y}$ and $T_{r i 1} + η_{r i 1}$ , where $δ \sim N (0, 20)$ and the noise power $η = [0 . 1^{2}] \times 1$ .

The dynamic tracking performances are shown in Figure 5 and Figure 6, respectively. Overall, both controllers track their respective control variables reasonably well under the influence of disturbances. Moreover, the practical constraint

T_{c y}

remains close to 999.5 K for most of the time, efficiently meeting the requirement

T_{c y} \leq 1000

K. An exception occurs around 3000 min when the introduction of disturbance

d_{2}

causes

T_{c y}

to significantly exceed the limit. However, compared to PD3PG, BPD3PG exhibits greater robustness. Through continuous interaction with the environment, trial and error, and feedback learning, BPD3PG returns

T_{c y}

within the feasible range after 3000 min. In another control loop, the reactor temperature

T_{r i 1}

is stabilized near the set-point of 782.8 K by manipulating

F_{s c}

.

In terms of economic performance, Figure 7 shows the profit function J under both disturbed and undisturbed conditions for the two methods using the Hicks control structure. The BPD3PG method usually achieves higher economic profits, especially when disturbances are present, highlighting its superior robustness. Under nominal conditions (0–1000), the economic profits of both methods are nearly identical (nominal optimum). This is expected because no particular adaptations are needed for both cases. However, in the

d_{1}

scenario (1000–3000), the profit of BPD3PG (

J_{u n d i s t u r b e d}^{B P D 3 P G} = 42.11

/

J_{d i s t u r b e d}^{B P D 3 P G} = 42.05

) is significantly higher than that of PD3PG (

J_{u n d i s t u r b e d}^{P D 3 P G} = 41.87

/

J_{d i s t u r b e d}^{P D 3 P G} = 41.74

). For the

d_{2}

scenario, the final settlement results for both methods show that the economic profit of BPD3PG (

J_{u n d i s t u r b e d}^{B P D 3 P G} = 49.31

/

J_{d i s t u r b e d}^{B P D 3 P G} = 49.33

) consistently outperforms PD3PG (

J_{u n d i s t u r b e d}^{P D 3 P G} = 48.94

/

J_{d i s t u r b e d}^{P D 3 P G} = 48.44

). Moreover, comparing with the Hicks control structure implemented with PI controllers, our BPD3PG method achieved notably higher profits (

J_{d i s t u r b e d}^{H i c k s} = 48.96 / J_{d i s t u r b e d}^{B P D 3 P G} = 49.33

), demonstrating the economic advantages of our approach.

In contrast, PD3PG shows a larger profit difference between disturbed and undisturbed conditions, indicating weaker robustness. In contrast, the profit difference for BPD3PG is smaller between noisy and undisturbed conditions, with some fluctuation, but, overall, it consistently outperforms both PD3PG and traditional control approaches (shown in Table 6).

6. Conclusions

This paper presented a Bayesian deep reinforcement learning method for the optimization of chemical processes, called the Bayesian Primal-Dual Deep Deterministic Policy Gradient (BPD3PG) method, which is successfully applied to a fluid catalytic cracking (FCC) unit. The BPD3PG employs Bayesian neural networks for value function approximation, which effectively captured the uncertainties in FCC processes and overcame the value function overestimation problem inherent in traditional deterministic deep reinforcement learning. Furthermore, by utilizing the primal-dual method to handle process constraints, this approach ensured the operational feasibility of the system.

The simulation experiments on the FCC unit validated the superior performance of the BPD3PG method. The BPD3PG significantly outperformed traditional PI controllers, as well as deterministic policy gradient methods, achieving higher economic profits (Table 6) and maintaining more stable process control performance, especially when facing significant disturbance scenarios that could potentially cause controller overreaction.

Author Contributions

Software, J.Q. and L.Y.; Data curation, J.Z. and J.J.; Writing—review & editing, J.Q. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62373147), Zhejiang Provincial Natural Science Foundation of China (LY24F030007), Huzhou Key Laboratory of Intelligent Sensing and Optimal Control for Industrial Systems (2022-17), and Postgraduate Research and Innovation Project of Huzhou University (2025KYCX83).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bloor, M.; Ahmed, A.; Kotecha, N.; Mercangöz, M.; Tsay, C.; Chanona, E.A.D.R. Control-Informed Reinforcement Learning for Chemical Processes. Ind. Eng. Chem. Res. 2025, 64, 4966–4978. [Google Scholar] [CrossRef] [PubMed]
Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed Machine Learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
Byun, H.E.; Kim, B.; Lee, J.H. Embedding Active Learning in Batch-to-batch Optimization Using Reinforcement Learning. Automatica 2023, 157, 111260. [Google Scholar] [CrossRef]
Oh, T.H.; Park, H.M.; Kim, J.W.; Lee, J.M. Integration of reinforcement learning and model predictive control to optimize semi-batch bioreactor. AIChE J. 2022, 68, e17658. [Google Scholar] [CrossRef]
Syauqi, A.; Kim, H.; Lim, H. Optimizing Olefin Purification: An Artificial Intelligence-Based Process-Conscious PI Controller Tuning for Double Dividing Wall Column Distillation. Chem. Eng. J. 2024, 500, 156645. [Google Scholar] [CrossRef]
Petukhov, A.N.; Shablykin, D.N.; Trubyanov, M.M.; Atlaskin, A.A.; Zarubin, D.M.; Vorotyntsev, A.V.; Stepanova, E.A.; Smorodin, K.A.; Kazarina, O.V.; Petukhova, A.N.; et al. A hybrid batch distillation/membrane process for high purification part 2: Removing of heavy impurities from xenon extracted from natural gas. Sep. Purif. Technol. 2022, 294, 121230. [Google Scholar] [CrossRef]
Singh, V.; Kodamana, H. Reinforcement Learning Based Control of Batch Polymerisation Processes. IFAC-PapersOnLine 2020, 53, 667–672. [Google Scholar] [CrossRef]
Hartlieb, M. Photo-iniferter RAFT polymerization. Macromol. Rapid Commun. 2022, 43, 2100514. [Google Scholar] [CrossRef]
Chen, J.; Wang, F. Cost Reduction of CO₂ Capture Processes Using Reinforcement Learning Based Iterative Design: A Pilot-Scale Absorption–stripping System. Sep. Purif. Technol. 2013, 122, 149–158. [Google Scholar] [CrossRef]
Perera, A.T.D.; Wickramasinghe, P.U.; Nik, V.M.; Scartezzini, J.L. Introducing Reinforcement Learning to the Energy System Design Process. Appl. Energy 2020, 262, 114580. [Google Scholar] [CrossRef]
Sachio, S.; Mowbray, M.; Papathanasiou, M.M.; del Rio-Chanona, E.A.; Petsagkourakis, P. Integrating Process Design and Control Using Reinforcement Learning. Chem. Eng. Res. Des. 2021, 183, 160–169. [Google Scholar] [CrossRef]
Kim, S.; Jang, M.G.; Kim, J.K. Process Design and Optimization of Single Mixed-Refrigerant Processes with the Application of Deep Reinforcement Learning. Appl. Therm. Eng. 2023, 223, 120038. [Google Scholar] [CrossRef]
Hicks, R.; Worrell, G.; Durney, R. Atlantic seeks improved control; studies analog-digital models. Oil Gas J. 1966, 24, 97. [Google Scholar]
Boum, A.T.; Latifi, A.; Corriou, J.P. Model predictive control of a fluid catalytic cracking unit. In Proceedings of the 2013 International Conference on Process Control (PC), Strbske Pleso, Slovakia, 18–21 June 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 335–340. [Google Scholar]
Skogestad, S. Plantwide control: The search for the self-optimizing control structure. J. Process Control 2000, 10, 487–507. [Google Scholar] [CrossRef]
Ye, L.; Cao, Y.; Yuan, X. Global approximation of self-optimizing controlled variables with average loss minimization. Ind. Eng. Chem. Res. 2015, 54, 12040–12053. [Google Scholar] [CrossRef]
Altman, E. Constrained Markov Decision Processes; Chapman and Hall/CRC: Boca Raton, FL, USA, 1999. [Google Scholar]
Ji, J.; Zhou, J.; Zhang, B.; Dai, J.; Pan, X.; Sun, R.; Huang, W.; Geng, Y.; Liu, M.; Yang, Y. OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research. J. Mach. Learn. Res. 2024, 25, 1–6. [Google Scholar]
Yoo, H.; Kim, B.; Kim, J.W.; Lee, J.H. Reinforcement Learning Based Optimal Control of Batch Processes Using Monte-Carlo Deep Deterministic Policy Gradient with Phase Segmentation. Comput. Chem. Eng. 2021, 144, 107133. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Neal, R.M. Bayesian Learning for Neural Networks; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 118. [Google Scholar]
Graves, A. Practical variational inference for neural networks. Adv. Neural Inf. Process. Syst. 2011, 24, 2348–2356. [Google Scholar]
Minka, T.P. Expectation propagation for approximate Bayesian inference. arXiv 2013, arXiv:1301.2294. [Google Scholar]
Gal, Y.; McAllister, R.; Rasmussen, C.E. Improving PILCO with Bayesian neural network dynamics models. In Proceedings of the Data-Efficient Machine Learning Workshop, ICML, New York, NY, USA, 24 June 2016; Volume 4, p. 25. [Google Scholar]
Henderson, P.; Doan, T.; Islam, R.; Meger, D. Bayesian Policy Gradients via Alpha Divergence Dropout Inference. In Proceedings of the NIPS Bayesian Deep Learning Workshop, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Li, Y.; Gal, Y. Dropout Inference in Bayesian Neural Networks with Alpha-divergences. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2052–2061. [Google Scholar]
Hernandez-Lobato, J.; Li, Y.; Rowland, M.; Bui, T.; Hernández-Lobato, D.; Turner, R. Black-box alpha divergence minimization. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 1511–1520. [Google Scholar]
LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. In Predicting Structured Data; MIT Press: Cambridge, MA, USA, 2006; Volume 1. [Google Scholar]
Liu, X.; Sun, S. Alpha-divergence Minimization with Mixed Variational Posterior for Bayesian Neural Networks and Its Robustness Against Adversarial Examples. Neurocomputing 2020, 423, 427–434. [Google Scholar] [CrossRef]
Guan, H.; Ye, L.; Shen, F.; Song, Z. Economic Operation of a Fluid Catalytic Cracking Process Using Self-Optimizing Control and Reconfiguration. J. Taiwan Inst. Chem. Eng. 2019, 96, 104–113. [Google Scholar] [CrossRef]
Loeblein, C.; Perkins, J. Structural design for on-line process optimization: II. Application to a simulated FCC. AIChE J. 1999, 45, 1030–1040. [Google Scholar] [CrossRef]
Hovd, M.; Skogestad, S. Procedure for regulatory control structure selection with application to the FCC process. AIChE J. 1993, 39, 1938–1953. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 22–24 June 2014; pp. 387–395. [Google Scholar]

Figure 1. (1) Traditional neural network; (2) Bayesian neural network.

Figure 2. An illustration of approximating distributions by

α

-divergence minimization. Here, p and q shown in the graphs are un-normalized probability densities.

Figure 2. An illustration of approximating distributions by

α

-divergence minimization. Here, p and q shown in the graphs are un-normalized probability densities.

Figure 3. The fluid catalytic cracking unit with RL agent.

Figure 4. The shaded region represents half a standard deviation of the average evaluation over 5 trials. Curves are smoothed uniformly for visual clarity.

Figure 5. Optimal trajectories after RL training and dynamic simulation of the Hicks control structure: loop1(

F_{a} \leftrightarrow T_{c y}

) and loop2(

F_{s c} \leftrightarrow T_{r i 1}

) (undisturbed).

Figure 5. Optimal trajectories after RL training and dynamic simulation of the Hicks control structure: loop1(

F_{a} \leftrightarrow T_{c y}

) and loop2(

F_{s c} \leftrightarrow T_{r i 1}

) (undisturbed).

Figure 6. Optimal trajectories after RL training and dynamic simulation of the Hicks control structure: loop1(

F_{a} \leftrightarrow T_{c y}

) and loop2(

F_{s c} \leftrightarrow T_{r i 1}

) (disturbed).

Figure 6. Optimal trajectories after RL training and dynamic simulation of the Hicks control structure: loop1(

F_{a} \leftrightarrow T_{c y}

) and loop2(

F_{s c} \leftrightarrow T_{r i 1}

) (disturbed).

Figure 7. Dynamic trajectories of the economic function J (disturbed and undisturbed).

Table 1. Model parameters of the FCC unit.

Variable	Description	Value and Unit
$E_{c b}$	Activation energy for coke burning reaction	158.6 kJ/mol
$F_{o i l}$	Mass flow rate of gas oil feed	2438.0 kg/min
$F_{g l}$	Gasoline yield factor of catalyst	1.0 kg/min
$k_{c}$	Rate constant for catalytic coke formation	0.01897 $s^{- 0.5}$
$k_{c o m}$	Rate constant for coke burning	29.338 $\min^{- 1}$
$k_{o}$	Rate constant for gas oil cracking	962,000 $s^{- 1}$
$T_{a}$	Temperature of air to regenerator	320.0 K
$T_{o i l}$	Temperature of gas oil feed	420.0 K
$σ_{2}$	CO₂/CO dependence on the temperature	0.006244 $K^{- 1}$
$h_{1}$ , $h_{2}$	Parameters for approximating $Δ H$	521,150.0, 245.0

Table 2. Process variables of the FCC unit.

Variable	Description	Unit
$F_{a}$	Mass flow rate of air to regenerator	kg/min
$F_{s c}$	Mass flow rate of spent catalyst	kg/min
$T_{r g}$	Temperature of catalyst in regenerator dense bed	K
$T_{c y}$	Temperature of cyclone	K
$T_{r i 0}$	Temperature of catalyst and gas oil mixture at riser inlet	K
$T_{r i 1}$	Temperature of catalyst and gas oil mixture at riser outlet	K
$y_{f 1}$	Weight fraction of gas oil in product	-
$y_{g l}$	Weight fraction of gasoline in product	-

Table 3. Material prices for FCC components.

Price	Component	Value
$p_{g l}$	gasoline	0.14 USD/kg
$p_{g s}$	light gases	0.132 USD/kg
$p_{u g o}$	unconverted gas oil	0.088 USD/kg

Table 4. Hyper-parameters of the proposed BPD3PG and PD3PG.

Parameter	Value
Hidden layer	64-128-64
Batch size	64
Time step	9 × 10⁶
Episode	2000
MC sample (Only for BPD3PG)	50
Dropout rate (Only for BPD3PG)	0.995
Actor-learning rate	0.0001
Reward/Cost-learning rate	0.0001
Discount factor	0.99
$α$ -divergence (Only for BPD3PG)	0.95
KeepProp (Only for BPD3PG)	0.95
$β$ (Only for BPD3PG)	0.92

Table 5. Hardware details.

Parameter	Version
Computer	Windows10
CPU	i5-12400F 2.50 GHz
RAM	32.0 GB
GPU	NVIDIA GeForce RTX4060Ti
Tensorflow	2.2.0
Python	3.8

Table 6. Performance improvement of BPD3PG in the final settlement results.

Comparison	Condition	J (USD/min)	Improvement (%)
BPD3PG vs. Hicks	Undisturbed	49.31 vs. 49.11	0.41%↑
BPD3PG vs. Hicks	Disturbed	49.33 vs. 48.96	0.76%↑
BPD3PG vs. PD3PG	Undisturbed	49.31 vs. 48.94	0.76%↑
BPD3PG vs. PD3PG	Disturbed	49.33 vs. 48.44	1.84%↑

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, J.; Ye, L.; Zheng, J.; Jin, J. Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit. Processes 2025, 13, 1352. https://doi.org/10.3390/pr13051352

AMA Style

Qin J, Ye L, Zheng J, Jin J. Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit. Processes. 2025; 13(5):1352. https://doi.org/10.3390/pr13051352

Chicago/Turabian Style

Qin, Jingsheng, Lingjian Ye, Jiaqing Zheng, and Jiangnan Jin. 2025. "Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit" Processes 13, no. 5: 1352. https://doi.org/10.3390/pr13051352

APA Style

Qin, J., Ye, L., Zheng, J., & Jin, J. (2025). Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit. Processes, 13(5), 1352. https://doi.org/10.3390/pr13051352

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit

Abstract

1. Introduction

2. Constrained Markov Decision Process

3. Bayesian Primal-Dual Deep Deterministic Policy Gradient

3.1. Bayesian Neural Network

3.2. Variational Inference in BNN with $α$ -Divergences

4. The Fluid Catalytic Cracking Process

4.1. Process Descriptions

4.2. Formulating as a SRL Problem

5. Simulation Results

5.1. Training Algorithms

5.2. Experimental Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Bayesian Deep Reinforcement Learning for Operational Optimization of a Fluid Catalytic Cracking Unit

Abstract

1. Introduction

2. Constrained Markov Decision Process

3. Bayesian Primal-Dual Deep Deterministic Policy Gradient

3.1. Bayesian Neural Network

3.2. Variational Inference in BNN with α -Divergences

4. The Fluid Catalytic Cracking Process

4.1. Process Descriptions

4.2. Formulating as a SRL Problem

5. Simulation Results

5.1. Training Algorithms

5.2. Experimental Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2. Variational Inference in BNN with $α$ -Divergences