This section presents a discussion of the studies identified through the PRISMA-based systematic literature review. The works are organized by industry sector, with an emphasis on the application of various reinforcement learning (RL) algorithms to address specific problem-solving tasks within the process industry.
6.1. Chemical Industry
This section discusses the application of reinforcement learning (RL) within the chemical process industry, highlighting its growing role in addressing complex problems such as reactor control, raw material scheduling, fault diagnosis, and process safety. RL’s adaptability to high-dimensional, nonlinear, and uncertain environments (coupled with its compatibility with both discrete and continuous decision spaces) makes it particularly well-suited for chemical process control.
Across diverse applications, three main categories of RL emerge: value-based methods (e.g., Q-learning, DQN), policy-based methods (e.g., PPO, A2C), and hybrid actor–critic approaches (e.g., DDPG, TD3, SAC). These approaches are applied in both simulation and real-time environments, revealing several methodological trends such as hybridization with traditional control techniques, the rise of multi-agent systems, and increased focus on sample efficiency and safety guarantees.
In reactor control, several studies focus on continuous stirred tank reactors (CSTRs), which serve as benchmarks for nonlinear system dynamics. Zhang et al. [
53] introduce a Robust Safe Model-Based RL (RSMB-RL) approach to control a CSTR under state constraints and disturbances. Using a slack function to reformulate constraints and a composite learning rule, this hybrid method operates in continuous action spaces and delivers safe, stable control policies without prior system knowledge. Similarly, ref. [
54] applies a multi-agent Twin-Delayed Deep Deterministic Policy Gradient (TD3) approach for multiloop CSTR control, showing promising disturbance rejection in continuous control settings.
Other reactor-focused works include [
55] where Q-learning augmented by Gaussian processes is applied to semi-batch reactor control. The study focuses on data efficiency, requiring only 100 trajectory evaluations to develop a safe, near-optimal policy. Another study [
56] uses Deep Q-Learning (DQN) for discrete decision making—specifically, binary actions such as cold diluent injection in a styrene batch reactor—to prevent thermal runaway. Here, explainable RL techniques (e.g., decision trees, Shapley values) are introduced to enhance transparency, reflecting an emerging trend toward interpretable AI in safety-critical applications. Scheduling problems, often modeled with discrete action spaces and under uncertainty, are a recurring application of RL. Rangel-Martinez and Ricardez-Sandoval [
57] utilize Deep Recurrent Q-Networks (DRQN) in a partially observable batch plant environment with zero-wait constraints. By incorporating recurrent neural networks and flexible observation windows, the study bridges temporal dependencies with limited system visibility. In a complementary line, ref. [
58] addresses scheduling in uncertain environments using a hybrid RLeRO framework, combining Robust Optimization with Advantage Actor-Critic (A2C). The inclusion of RO improves policy stability and prevents convergence to local optima. Wu et al. [
59] also explore scheduling via a modified Proximal Policy Optimization (PPO), showing better convergence and higher returns compared to conventional policy gradients due to improved state function design.
Hubbs et al. [
60] and Bougie et al. [
61] further explore scheduling in single-stage multiproduct reactors and VAM plants. Hubbs et al. [
60] adopt A2C in a simulation environment to generate schedules resilient to demand uncertainties, while Bougie et al. [
61] introduce a decentralized PPO-based strategy, decomposing policies across agents. This method proves advantageous for sample efficiency and robustness in complex systems. Notably, ref. [
62] also explores controller-guided self-supervised learning (CGS) in VAM plant control, showcasing how legacy control knowledge can improve RL training efficiency.
Applications in fault diagnosis and parameter estimation further illustrate RL’s flexibility. Kim and Lee [
63] pioneered the use of associative RL to evolve Fuzzy Cognitive Maps (FCMs) for automated fault detection in tank–pipe systems. This early effort in model adaptation and knowledge integration foreshadows more recent techniques. For instance, ref. [
64] applies Deep Deterministic Policy Gradient (DDPG) to online parameter estimation in the hydrogenation of acetylene, offering a nonintrusive, continuous control solution that avoids perturbing the system. Similarly, ref. [
65] integrates DDPG with Economic Model Predictive Control (EMPC) to correct model mismatches in CSTRs, achieving better yield through parameter updates guided by state comparison.
Exploration of real-time control is exemplified in [
66] where Asynchronous Advantage Actor-Critic (A3C) is used for direct control of a hybrid three-tank system, demonstrating real-time deployment and effective parallel learning strategies. This work validates the benefit of pre-training in simulation, addressing a critical gap in bridging simulated and physical environments. In a related domain, Alhazmi et al. [
67] showcase DDPG for maintaining optimal conditions in chemical reaction networks, emphasizing the importance of reward design and observation space for success in complex control tasks.
Several studies highlight the hybridization of RL with traditional control and optimization frameworks. Zanon et al. [
68] combine RL with Nonlinear Model Predictive Control (NMPC) for an evaporation process, using Real-Time Iteration (RTI) to reduce computational cost while maintaining stability. Oh [
69] extends this analysis by comparing DRL algorithms (DDPG, TD3, SAC) with data-driven MPC across various systems, including SMB and bioreactors.
Neuro-evolutionary methods also make notable contributions. Conradie and Aldrich [
70] introduce Symbiotic Adaptive Neuro-Evolution (SANE) for bioreactor design and control, evolving specialized neurons for improved economic operation under uncertainty. Conradie et al. [
71] build on this with the Symbiotic Memetic Neuro-Evolution (SMNE) algorithm, blending evolutionary strategies with particle swarm optimization for efficient policy learning in nonlinear systems. These methods emphasize robustness and adaptability, particularly in early-stage simulation.
While these trends highlight clear methodological strengths—such as the dominance of actor–critic methods in nonlinear continuous control and the effectiveness of value-based methods in discrete scheduling—several challenges remain. In particular, transferring policies trained in simulation to real plants continues to lack standardized procedures, and hybrid RL–MPC schemes depend heavily on accurate surrogate or mechanistic models. These limitations help explain why most applications remain confined to simulation despite methodological maturity.
In conclusion, reinforcement learning in the chemical process industry is advancing toward mature, hybridized frameworks capable of solving both continuous and discrete control problems across various technical domains. With improvements in interpretability, real-time performance, and integration with domain-specific knowledge, RL is positioned to play a critical role in the next generation of intelligent chemical process systems. The most significant elements from this subsection are summarized in
Table 1.
6.2. Steel Industry
This section discusses the application of reinforcement learning (RL) within the steel industry, focusing on diverse technical contexts such as flatness control, energy management, maintenance scheduling, and quality assurance. A key thematic link across these applications is the alignment of RL algorithm design with the nature of the problem—whether defined by discrete scheduling decisions or continuous control requirements—and the growing hybridization of RL with traditional methods to enhance robustness and industrial applicability.
Flatness control in strip and cold rolling has emerged as a major focus, typically involving continuous action spaces. Peng et al. [
72] and Deng et al. [
73] adopt off-policy, model-free algorithms like DDPG, TD3, and PPO, integrating ensemble learning to improve stability and reduce the impact of local optima. These approaches operate in simulation and outperform traditional PI controllers. Similarly, ref. [
74] proposes Stable Q-Learning (SQL), an offline, value-based method with a Q-ensemble to ensure robustness in the absence of simulators, offering a practical solution for industrial datasets. In contrast, discrete decision-making problems, such as maintenance and scheduling, are tackled using value-based methods. Ferreira Neto et al. [
75] employ a DDQN algorithm within a simulation of a steel shredder maintenance process, achieving significant cost savings by replacing time-based policies. Likewise, ref. [
76] addresses blast furnace gas tank scheduling with DQN, demonstrating improved convergence and operational safety by discretizing actions and pre-training on historical data. Jeong et al. [
77] utilize Q-learning to identify optimal hot forging parameters by mapping instability domains, with validation through experimental observation, thus illustrating the capacity of model-free, tabular methods in low-dimensional optimization problems. Policy-based and hybrid methods also see increasing use in multi-agent or hierarchical environments. Wang et al. [
78] apply an Actor–Critic approach to integrated energy system optimization, managing continuous variables while balancing global and subsystem indices. Cho et al. [
79] introduce REINFORCE within a graph neural network-based policy for dynamic crane scheduling, demonstrating flexibility and real-time adaptability. Che et al. [
80] propose a novel hybrid approach, DRL-MOEA, where PPO is embedded within an evolutionary algorithm to enhance scheduling of air separation units, merging the global search of MOEAs with the local policy optimization of PPO.
Reinforcement learning is also being integrated into quality monitoring systems. Zhang et al. [
81] address surface defect detection through a dual-agent RL framework (DuAK), combining knowledge graph reasoning with path exploration in high-dimensional environments. This value-based framework improves defect traceability and performance on real and benchmark datasets, reflecting an emerging use of RL in knowledge-based diagnostic applications.
Across steel applications, ensemble-enhanced PPO/DDPG methods show strong performance in continuous flatness control, while Q-learning variants remain effective for discrete maintenance and scheduling tasks. However, these advances are counterbalanced by persistent challenges; offline RL methods still face robustness issues when exposed to high-variance industrial conditions, and hybrid evolutionary–RL schedulers require substantial tuning to ensure stability in large-scale steel production operations. The most significant elements from this subsection are summarized in
Table 2.
6.3. Oil and Gas Industries
This section discusses the application of reinforcement learning (RL) within the oil and gas industries where it supports critical tasks such as production scheduling, flow control, interface tracking, soft sensor development, and fault diagnosis. These applications span both discrete and continuous decision spaces and increasingly incorporate hybrid or adaptive RL strategies tailored to complex, nonlinear environments.
In continuous control contexts, ref. [
82] applies Deep Deterministic Policy Gradient (DDPG) to regulate flow levels in a three-phase separator, achieving improved separation efficiency through simulation using CFD models. Similarly, ref. [
83] employs Proximal Policy Optimization (PPO) to solve large-scale refinery production scheduling problems. By dividing the scheduling horizon into time slots and initializing the policy with operational knowledge, the RL agent efficiently coordinates local decisions for global optimization, outperforming traditional solvers. These applications demonstrate RL’s scalability and suitability for high-dimensional, continuous action spaces. In contrast, ref. [
84] addresses discrete production scheduling by using a model-based Markov Decision Process (MDP) to dynamically adjust genetic algorithm parameters. This hybrid method, NSGAeRL, improves solution convergence and quality over standard GAs. Also tackling a discrete decision space, ref. [
85] applies asynchronous actor–critic algorithms (A3CS/A3CF) to autonomously select cross-domain data for soft sensor design in an SAGD process, allowing performance-driven modeling even with limited target-domain data. Policy-value hybrids dominate in model-free control applications. Dogru et al. [
86] utilize A3C for froth-middlings interface tracking in oil sands processing, overcoming visual occlusion challenges by integrating sensor feedback with learned visual representations. Similarly, ref. [
87] adopts an online policy iteration strategy from adaptive dynamic programming to develop a fault-tolerant controller for offshore steel platforms. This approach estimates actuator faults and disturbances in real time, enhancing control stability under uncertain dynamics. These developments show that actor–critic architectures and hybrid RL–optimization schemes are particularly promising for high-dimensional refinery scheduling and multiphase flow control. Yet, methodological barriers remain; many solutions depend on extensive domain knowledge for reward shaping or initialization, and policy generalization across changing reservoir or process conditions remains insufficiently validated.
The most significant elements from this subsection are summarized in
Table 3.
6.4. Food and Beverage Industry
This section discusses the application of reinforcement learning (RL) within the food and beverage industry, focusing on diverse technical contexts such as production scheduling, thermal sterilization, concentration control, and product inspection. Across these applications, a variety of RL approaches—value-based (Q-learning, DQN), hybrid methods (fuzzy approximations), and multi-agent strategies—are explored to address both discrete and continuous decision spaces. In concentration control of a Continuous Stirred Tank Reactor, ref. [
88] applied Q-learning with fuzzy function approximation to overcome challenges posed by continuous state-action spaces, demonstrating superior performance over traditional PID control in simulation. Similarly, ref. [
89] implemented a model-free Q-learning controller to manage temperature profiles during thermal sterilization in batch processes, emphasizing RL’s suitability for uncertain, nonlinear systems. In broader supply chain management, ref. [
90] employed Multi-Agent RL (MARL) within an agent-based simulation to optimize procurement, production, and distribution for an ice cream manufacturer, with digital twin integration under development. This approach highlights the advantages of hybridizing RL with MILP and simulation frameworks. Barthwal et al. [
91] offer a comprehensive review of RL’s role in food industry contexts such as robotic navigation and safety inspection, emphasizing Q-learning and DQN for discrete decision making. A common trend emerges in the early-stage but expanding implementation of actor–critic and hybridized RL models tailored to complex, real-world industrial environments. Overall, value-based and fuzzy-approximation RL controllers perform well in nonlinear thermal and concentration processes, while MARL strategies show early promise in integrated supply-chain control. Nonetheless, the sector still faces significant challenges in obtaining sufficiently rich datasets and validating RL policies in highly variable production environments, which limits broader industrial deployment.
The most significant elements from this subsection are summarized in
Table 4.
6.5. Mining Industry
This section discusses the application of reinforcement learning (RL) within the mining industry, focusing on diverse technical challenges such as flotation control, ash content optimization, and equipment alignment. These applications span both continuous control (e.g., flotation, DMS) and discrete decision making (e.g., transfer point alignment), with a notable trend toward policy-based and hybrid RL strategies. In flotation control, ref. [
92] applied Adaptive Dynamic Programming in an actor–critic configuration to optimize recovery and grade without requiring precise process models, demonstrating efficacy in simulation. Zheng et al. [
93] extended this line of work by proposing a Hybrid Model-Based RL (HMBRL) algorithm combining fuzzy inference, ensemble critics, and physical models, outperforming PPO, SAC, and MBPO in predictive accuracy and sample efficiency. Similarly, ref. [
94] optimized the Dense Medium Separation process using an actor–critic method integrated with a layered model-based controller, enabling online setpoint updates and outperforming PI+MPC in simulation. Addressing equipment alignment, ref. [
95] implemented PPO and SAC to synchronize a Bucket Wheel Excavator, Belt Wagon, and Hopper Car, highlighting RL’s capacity for real-time adaptation in discrete, complex environments. It is worth noting that additional reinforcement learning applications in mineral processing exist beyond those retrieved by our PRISMA-defined search strategy. For example, ref. [
96] implement a PPO-based controller for a conventional continuous thickener, improving underflow concentration stability and demonstrating RL’s potential for complex solid–liquid separation units. This study did not appear under the predefined query strings and therefore did not enter the PRISMA screening flow; it is mentioned here solely as relevant external context and is not included in the tables or quantitative analysis. Actor–critic and hybrid model-based methods clearly outperform traditional PI+MPC benchmarks in simulation, particularly in flotation and DMS control. However, the mining sector continues to lack large-scale operational validations, and model-based hybrids depend on physical or fuzzy surrogates whose accuracy may degrade under volatile ore characteristics and plant disturbances.
The most significant elements from this subsection are summarized in
Table 5.
6.6. Pharmaceutical Industry
This section discusses the application of reinforcement learning (RL) within the pharmaceutical industry where high costs and development times—averaging USD 200 million and 10–15 years per innovative drug—drive the need for more efficient solutions. RL has emerged as a valuable tool in this context, addressing complex tasks in drug design, bioprocess control, and materials discovery. In continuous biopharmaceutical manufacturing, ref. [
97] applied model-free RL using Monte Carlo simulations to optimize process chromatography under uncertain dynamics, highlighting its advantage in scenarios lacking explicit process models. For small molecule drug discovery, ref. [
98] explored value-based (Q-learning, SARSA) and policy-based (PPO) methods to accelerate early-stage design, reduce costs, and enhance chemical space exploration. While RL-generated compounds have entered clinical trials, limitations persist in synthesis feasibility and data quality. Kim et al. [
99] used PPO in RL-guided combinatorial chemistry (RL-CC) for HIV drug synthesis, showing how sequential fragment selection via RL can target extreme properties and scale effectively. Across cases, policy-based approaches like PPO dominate for their stability and adaptability. The application of RL in both discrete molecular generation and continuous control underscores its growing role in transforming pharmaceutical innovation through data-driven, cost-saving methodologies.
Across applications, policy-based algorithms such as PPO consistently demonstrate robustness for both combinatorial drug design and nonlinear bioprocess control. Yet, practical limitations persist; RL-generated molecules often face synthesis constraints, and bioprocess optimizers lack real-world validation due to safety and cost barriers. These gaps underline the need for domain-informed reward structures and hybrid workflows that combine RL with established pharmaceutical design methodologies.
The most significant elements from this subsection are summarized in
Table 6.
6.7. Semiconductor Industry
This section discusses the application of reinforcement learning (RL) within the semiconductor industry, focusing on run-to-run (R2R, RtR) control in Chemical Mechanical Polishing (CMP) processes. CMP, a critical yet variable-intensive process, has motivated the integration of RL to enhance robustness and adaptability. Ma et al. [
100] employ a hybrid strategy, using TD3—a continuous, value-based deep RL algorithm—to dynamically adjust weights in a double exponentially weighted moving average (dEWMA) controller, enabling improved disturbance rejection and tracking in a simulated environment. Similarly addressing CMP variability, ref. [
101] applies structured control network DDPG (SCN-DDPG), a structured, policy-based deep RL variant, incorporating dual networks to estimate system states and define control policies for precise material removal. This approach, operating in a simulated and test-validated setting, emphasizes model-free learning and nonlinear policy representation. In contrast, ref. [
102] addresses discrete decision making, such as polish time selection, through a hybrid DQN-RtR controller. The value-based DQN component governs actions when measurement data is unavailable, while the RtR-EWMA handles periods with valid feedback. Collectively, these studies exhibit a trend toward hybridized control architectures, leveraging actor–critic frameworks and DRL for both continuous and discrete control tasks. Hybrid RL–RtR controllers show strong advantages in handling CMP drift, variability, and missing metrology data, with TD3 and DQN-based designs outperforming classical EWMA baselines. Even so, challenges remain; policy transfer between tools or product lines is rarely demonstrated, and the stability of RL-driven RtR decisions under process drifts still requires systematic evaluation.
The most significant elements from this subsection are summarized in
Table 7.
6.10. Convergence of RL and the Process Industries
Figure 7 summarizes the cross-industry patterns identified throughout this section. The diagram abstracts common structural features observed across chemical, steel, oil and gas, and food and beverage applications. Across these sectors, reinforcement learning (RL) is shaped by shared operational constraints—safety, product quality, economic performance, and regulatory compliance—which give rise to consistent industrial requirements such as constraint handling, interpretability, data efficiency, and robustness. These requirements, in turn, influence RL design choices, including the preference for hybrid architectures, the adoption of digital-twin-enabled sim-to-real pipelines, and algorithm selection aligned with the nature of the decision task (discrete scheduling, continuous control, or distributed coordination). Collectively, these elements indicate an ongoing shift from isolated proofs of concept toward verifiable, explainable, and deployment-oriented industrial RL architectures.
To operationalize the dual contextual–taxonomic framework outlined in the introduction,
Table 10 summarizes how the major RL algorithm families are distributed across the industrial sectors considered in this review. For clarity, policy-based and actor–critic methods are displayed as separate categories, allowing the table to capture sector-level granularity that is not visible in more abstract taxonomies. This choice makes the cross-industry prominence of actor–critic designs particularly evident—a pattern consistent with their suitability for continuous control. It also shows how value-based and hybrid approaches cluster around scheduling, optimization, and safety-constrained tasks.
To articulate the conceptual mechanisms linking industrial constraints with RL design strategies, we introduce below a set of derived propositions that formalize the cross-industry regularities observed in this review.
- 1.
Industries operating under strong safety, quality, or regulatory constraints favor RL methods with explicit mechanisms for constraint handling and verifiable behavior. This is reflected in the widespread use of hybrid RL–MPC/EMPC frameworks, model-based safety layers, and reward functions that enforce operational limits.
- 2.
Continuous-process industries tend to converge toward actor–critic architectures due to their compatibility with continuous action spaces and fine-grained actuator control. Evidence appears in reactor control, run-to-run semiconductor manufacturing, steel flatness control, flotation circuits, and energy-generation systems.
- 3.
Discrete sequencing or scheduling problems predominantly rely on value-based RL methods. Q-learning variants and DQN derivatives are common in steel maintenance scheduling, refinery and batch-plant scheduling, and food-processing logistics, reflecting their efficiency in finite action spaces.
- 4.
Systems requiring coordination across distributed assets adopt multi-agent or hierarchical RL variants. This pattern emerges in multiloop CSTRs, supply-chain coordination, VAM-plant operations, and multi-unit material-handling systems.
- 5.
When experimentation is costly or disruptive, industries give preference to digital twins, offline RL, and sim-to-real transfer pipelines. This trend is shared across chemical, oil and gas, mining, and energy systems where plant access is constrained or disturbances carry operational risk.