Systematic Review of Reinforcement Learning in Process Industries: A Contextual and Taxonomic Approach

Paz Ramos, Marco Antonio; Busboom, Axel

doi:10.3390/app152412904

Open AccessSystematic Review

Systematic Review of Reinforcement Learning in Process Industries: A Contextual and Taxonomic Approach

by

Marco Antonio Paz Ramos

^*

and

Axel Busboom

^*

Department of Engineering and Management, Munich University of Applied Sciences, 80335 Munich, Germany

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(24), 12904; https://doi.org/10.3390/app152412904

Submission received: 12 November 2025 / Revised: 28 November 2025 / Accepted: 4 December 2025 / Published: 7 December 2025

(This article belongs to the Special Issue Applications of Artificial Intelligence in Industry 4.0/5.0: Innovations, Challenges, and Future Directions)

Download

Browse Figures

Versions Notes

Abstract

The process industry (PI) plays a vital role in the global economy and faces mounting pressure to enhance sustainability, operational agility, and resource efficiency amid tightening regulatory and market demands. Although artificial intelligence (AI) has been explored in this domain for decades, its adoption in industrial practice remains limited. Recently, machine learning (ML) has gained momentum, particularly when integrated with core PI systems such as process control, instrumentation, quality management, and enterprise platforms. Among ML techniques, reinforcement learning (RL) has emerged as a promising approach to tackle complex operational challenges. In contrast to conventional data-driven methods that focus on prediction or classification, RL directly addresses sequential decision making under uncertainty, a defining characteristic of dynamic process operations. Given RL’s growing relevance, this study conducts a systematic literature review to evaluate its current applications in the PI, assess methodological developments, and identify barriers to broader industrial adoption. The review follows the PRISMA methodology, a structured framework for identifying, screening, and selecting relevant publications. This approach ensures alignment with a clearly defined research question and minimizes bias, focusing on studies that demonstrate meaningful industrial applications of RL. The findings reveal that RL is transitioning from a theoretical construct to a practical tool, particularly in the chemical sector and for tasks such as process control and scheduling. Methodological maturity is improving, with algorithm selection increasingly tailored to problem-specific requirements and a trend toward hybrid models that integrate RL with established control strategies. However, most implementations remain confined to simulated environments, underscoring the need for real-world deployment, safety assurances, and improved interpretability. Overall, RL exhibits the potential to serve as a foundational component of next-generation smart manufacturing systems.

Keywords:

reinforcement learning; process industries; process manufacturing; industrial digitalization; machine learning; smart process manufacturing

1. Introduction

The process industry is a cornerstone of the global economy, deeply embedded in worldwide supply chains. Over the centuries, it has undergone continuous transformation, driven by scientific and technological advances arising from successive industrial revolutions [1,2]. Despite these advancements, the inherent complexity of process operations continues to pose significant challenges. Many systems are too intricate to be fully captured by traditional analytical techniques, such as first-principles modeling, and their behavior is increasingly influenced by variable raw materials, energy constraints, and sustainability requirements.

Within this broader context, process industries (such as chemical, petrochemical, and pharmaceutical production) are characterized by complex multiscale dynamics, nonlinearities, long time delays, and heterogeneous data sources. These characteristics often lead to highly coupled decision-making problems that extend beyond process control, encompassing areas such as scheduling, energy management, and logistics optimization. Traditional model-based control and optimization methods, while mature and widely adopted, face limitations in scalability and adaptability under such challenging operating conditions. Consequently, reinforcement learning (RL) has emerged as a promising paradigm capable of learning effective decision policies directly from data and interaction, thereby addressing both the dynamic and combinatorial nature of process-industry problems.

Among various areas of automatic control, Model Predictive Control (MPC) has long been the dominant paradigm for constrained multivariable control in the process industries. However, MPC performance depends on the availability of accurate first-principles or empirical models, which are often expensive to maintain and difficult to generalize across varying operating conditions. RL offers a complementary and increasingly practical alternative. Unlike classical methods, RL encompasses both model-free and model-based formulations. Hybrid approaches are becoming feasible using digital twins—high-fidelity simulation models that provide safe environments for RL training, bridging the gap between model-based design and data-driven learning. In such settings, RL acts as an adaptive layer that can learn from synthetic experience, mitigating the risks and costs associated with real-plant experimentation [3].

In recent years, RL applications have expanded beyond closed-loop control to broader decision-making layers, such as production scheduling, demand response, and maintenance optimization. These advances reflect a shift from isolated control improvements toward integrated operational intelligence, where learning agents interact with digital representations of industrial systems to optimize performance holistically. This growing body of work reflects a steady evolution from theoretical feasibility studies toward practical, domain-oriented implementations.

Three recent survey papers have examined reinforcement learning (RL) in industrial and process-control contexts. Among them, Nian, Liu, and Huang [4] provide a widely cited tutorial-style introduction that explains RL fundamentals and illustrates their use in process control through conceptual and algorithmic comparisons. Faria et al. [5] focus on integration aspects such as offline training, transfer learning, and imitation learning, while Dogru et al. [6] extend the discussion across multiple layers of the process-systems hierarchy, including scheduling and supply-chain functions.

While these surveys have significantly advanced the understanding of RL in industrial environments, they differ from the present work in three ways. First, they are not systematic reviews following, for example, the PRISMA reporting guidelines. Second, they emphasize tutorial or conceptual exposition, whereas the present work synthesizes applications across sectors and tasks using a dual contextual–taxonomic analytical framework. Third, these earlier surveys cover literature up to 2020–2022, whereas the present review includes developments through 2024, including emerging trends such as hybrid RL, the rise of actor–critic methods, and digital twin-enabled training pipelines. Accordingly, this study complements prior surveys by offering a structured, PRISMA-based synthesis that maps RL methods to specific industrial contexts and highlights cross-sector patterns as well as application gaps not captured by earlier non-systematic surveys.

Building on these distinctions, our central research question can be formulated as follows: How is reinforcement learning applied in the process industries? To this end, we conduct a systematic literature review guided by the PRISMA methodology (Preferred Reporting Items for Systematic Reviews and Meta-Analyses), ensuring transparency, reproducibility, and methodological rigor in the identification, screening, eligibility assessment, and inclusion of relevant studies [7]. The review introduces a dual contextual (sector- and task-oriented) and taxonomic (algorithm- and integration-oriented) analytical framework to systematically map RL applications across the process industries. In doing so, it extends earlier surveys through an updated and organized synthesis that links methodological advances in RL to practical industrial adoption pathways, offering new perspectives on the maturity, challenges, and trends shaping RL-driven smart manufacturing systems.

2. Process Industry

The process industries are a foundational pillar of global industrial and economic development. In 2022, the process industries in the United States contributed approximately 5.3% to the national GDP [8]—an amount comparable to the entire GDP of countries such as Spain or Indonesia. Discussing the process industry as a distinct sector is essential for understanding its complex and pivotal role at the intersection of technology, economy, and development.

2.1. Taxonomic Overview of the Process Industry

Figure 1 presents a synthesized taxonomy derived from multiple sources, offering a hierarchical perspective from broad economic activities to specific process industries.

Manufacturing can be divided into Process Manufacturing and Discrete Manufacturing. Discrete manufacturing typically involves the production of countable, distinct items, whereas process manufacturing focuses on the production of measurable, indistinct products [9].

Alternatively, ref. [10] posits that manufacturing comprises two main pillars: Process Industries and Assembly-based Industries. Ref. [11] further distinguishes these based on input materials, noting that process industries utilize raw materials or ingredients, while assembly industries rely on components.

According to [12], process industries are characterized by operations such as chemical reactions, mixing, blending, extrusion, sheet forming, slitting, baking, and annealing. The resulting products may be solids (e.g., rolls, spools, sheets, tubes), powders, pellets, liquids or gases, packaged in various forms. These products often serve as inputs for other manufacturing processes, with examples including paints, processed foods and beverages, paper goods, plastic films, fibers, carpets, glass, and ceramics. Ref. [13] identifies key sectors within the process industry, such as petroleum, chemical engineering, steel, nonferrous metals, and construction materials.

The Association for Supply Chain Management (ASCM) dictionary defines process manufacturing as “production that adds value by mixing, separating, forming, and/or performing chemical reactions. It may be done in either batch or continuous mode” [14]. Thus, the terms Process Industry and Process Manufacturing are often used interchangeably, with the former being more prevalent in the literature.

Figure 1. Taxonomic classification of the process industry [9,10,11,12,14,15].

2.2. Technological Drivers of Innovation in the Process Industry

Since their inception, process industries have relied on enabling technologies to ensure stability, reliability, and efficiency in production systems. A landmark of this technological evolution was Watt’s flyball governor—an early feedback device that enabled automatic control of steam engines during the first industrial revolution. Its conceptual influence extended far beyond mechanics, inspiring Maxwell’s 1868 theoretical formulation that founded modern control engineering [16,17].

The integration of control engineering with process industries in the 1920s marked the emergence of process control as a distinct discipline [18]. Instrumentation—encompassing measurement, actuation, and control infrastructure—evolved into industrial instrumentation, supporting systems such as Programmable Logic Controllers (PLCs), Supervisory Control and Data Acquisition (SCADA), and Distributed Control Systems (DCS). More recently, soft sensors have emerged, using software algorithms to estimate unmeasured process variables [19].

Before the digital era, only technologies physically embedded in the process could be adopted. For instance, the Proportional-Integral-Derivative (PID) control law, developed in the early 20th century, was not widely implemented until analog pneumatic computing matured decades later [20]. The advent of digital computing broadened the technological scope from purely physical systems to cybernetic and eventually cyber-physical systems.

This evolution enabled the incorporation of multi-domain technologies (Figure 2). With the advent of Industry 4.0, platforms such as Manufacturing Execution Systems (MES), Quality Management Systems (QMS), Supply Chain Planning (SCP), and Enterprise Resource Planning (ERP) have gained relevance in process environments [21,22,23].

Emerging technologies including Big Data analytics, cloud computing, cybersecurity, the Industrial Internet of Things (IIoT), and digital twins are increasingly integrated into process industries [24]. Among these, Artificial Intelligence (AI) and its subset, Machine Learning (ML), stand out for their potential to enhance and interconnect existing systems.

2.3. Process Industry Challenges and the Opportunity for Reinforcement Learning

Despite its technological maturity, the process industry continues to face challenges that constrain further optimization. Complex multivariable dynamics, nonlinear behavior, and operational uncertainty coexist with safety and quality requirements. At the same time, sustainability goals and economic pressures demand greater efficiency, flexibility, and resource utilization [11,24,25].

Traditional process control and optimization strategies—while highly effective under well-characterized regimes—often rely on extensive modeling, periodic recalibration, and conservative safety margins. As operational conditions evolve and data availability increases, there is a growing need for adaptive, data-driven methods that complement model-based approaches.

Against this backdrop, Reinforcement Learning emerges not as a disruptive replacement but as a natural extension of the technological trajectory of the process industry. Its ability to learn control and decision strategies from interaction aligns with the realities of modern plants, where digital instrumentation, simulation environments, and historical data are readily available. Building upon existing control hierarchies and digital infrastructures, RL offers a feasible and cost-effective pathway toward adaptive optimization and informed decision support across both operational and business layers.

3. Strategic Positioning of RL in AI

Reinforcement learning (RL) has evolved from a niche academic pursuit to an important pillar of artificial intelligence (AI). The following sections explore RL’s position within the broader AI landscape and examine key milestones that have propelled its growth over the past decade.

3.1. Reinforcement Learning in the AI Landscape

Machine learning has emerged as a central subfield of artificial intelligence. As depicted in Figure 3, it encompasses four primary learning paradigms: (1) supervised learning, (2) unsupervised learning, (3) semi-supervised learning, and (4) reinforcement learning [26]. Supervised learning involves training models on labeled datasets; unsupervised learning identifies patterns within unlabeled data; semi-supervised learning combines both labeled and unlabeled data; and reinforcement learning enables agents to determine optimal action sequences through interactions with their environment [27].

Reinforcement learning operates on a fundamentally different paradigm from supervised and unsupervised learning. While the latter are dataset-driven and designed to extract patterns from static information, RL is rooted in an interactive process of sequential decision making and learning from environmental feedback.

Unlike other AI domains with identifiable seminal works, RL has evolved through the integration of diverse disciplines, notably optimal control theory and behavioral psychology. This interdisciplinary foundation is evident in foundational surveys such as [28] and subsequent reviews and analyses [29].

3.2. Key Milestones in Reinforcement Learning

Reinforcement learning has strongly gained momentum over the past decade, which is reflected in both research activity and public interest, with a notable inflection point around 2016. Figure 4, which illustrates global search trends for the term, shows a significant uptick beginning in that period.

This trend aligns with foundational innovations such as the Deep Q-learning Network (DQN) [30], the Deterministic Policy Gradient (DPG) [31], and the Deep Deterministic Policy Gradient (DDPG) [32]. These methods expanded RL’s capabilities, particularly in handling continuous action spaces, thereby broadening its applicability.

A pivotal moment contributing to this heightened visibility was the 2016 match between Go champion Lee Sedol and AlphaGo, a system developed by DeepMind Technologies. Founded in 2010 and acquired by Google in 2014 [33], DeepMind’s work showcased the power of combining supervised learning with RL. The subsequent development of AlphaGo Zero, which learned solely through deep reinforcement learning, further demonstrated the technology’s capabilities [34]. These events garnered widespread media attention, sparking public discourse on AI’s capabilities and ethical dimensions [35,36].

This growing interest is mirrored in academic output. A ScienceDirect query for “Reinforcement Learning” reveals 45,381 results as of December 2024, with annual publications transitioning from linear to exponential growth after 2016, as shown in Figure 5.

In summary, the past decade has marked a transformative period for reinforcement learning. This accelerated expansion was driven by a synergy of factors: groundbreaking research exemplified by AlphaGo, a surge in academic and public engagement, and crucial enabling technologies. The wide availability of open-source libraries like TensorFlow and PyTorch, standardized environments such as OpenAI Gym, and a substantive increase in affordable computing power via GPUs and TPUs were instrumental. This convergence of theory, application, and accessible tools has solidified RL’s position within modern AI, poised for continued growth.

4. RL Architectures Relevant to This Review

This section provides a condensed, taxonomy-oriented overview of reinforcement learning (RL) architectures relevant to the studies analyzed in this review. The structure follows the widely used OpenAI taxonomy [37], with adjustments to ensure coherence with algorithmic classes prevalent in process-industry applications. The goal is not to provide a tutorial, but a compact reference framework that clarifies how RL families relate to the industrial patterns identified throughout the paper.

Reinforcement learning is formulated on a Markov Decision Process (MDP), defined by the tuple

(S, A, P, R, γ)

[38], where:

$S$ is the state space;
$A$ is the action space (discrete or continuous);
$P (s^{'} | s, a)$ is the transition probability;
$R (s, a, s^{'})$ is the reward;
$γ \in [0, 1]$ is the discount factor.

The agent seeks a policy

π_{θ} (a | s)

that maximizes the expected discounted return

G_{t} = \sum_{k = 0}^{\infty} γ^{k} R (S_{t + k}, A_{t + k}, S_{t + k + 1}),

(1)

where

G_{t}

is the cumulative reward from time t onward.

4.1. Value-Based Algorithms

Value-based algorithms learn an action-value function

Q (s, a)

, from which a greedy or

ϵ

-greedy policy is derived. These methods dominate industrial problems with discrete decisions such as scheduling, sequencing, and fault mitigation. A fundamental update rule is the Q-learning recursion [39]

Q (s, a) \leftarrow Q (s, a) + α [r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)],

(2)

where

α

is the learning rate,

r = R (s, a, s^{'})

is the reward, and the TD target is

r + γ {max}_{a^{'}} Q (s^{'}, a^{'})

.

Representative algorithms include Q-learning, Deep Q-Networks (DQN) [40], DRQN [41], and SARSA [42]. These methods are widely adopted in steel scheduling, batch sequencing, and discrete decision systems in the chemical and food sectors.

4.2. Policy-Based and Actor–Critic Algorithms

Policy-based methods directly optimize a differentiable stochastic policy

π_{θ} (a | s)

through gradient ascent, where

θ

denotes the policy parameters. The general update rule follows the Policy Gradient Theorem

\nabla_{θ} J (θ) = E [\nabla_{θ} log π_{θ} (a | s) A^{π} (s, a)],

(3)

where

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)

is the advantage function.

Actor–critic methods augment policy gradients with a learned value function

V^{π} (s)

or

Q^{π} (s, a)

to reduce variance and improve sample efficiency. Examples include AC, A2C/A3C [43], TRPO [44], and PPO [45].

These architectures dominate continuous control in process industries, including reactor temperature control, multiloop supervision, and semiconductor run-to-run (R2R) optimization.

4.3. Hybrid Actor–Critic Algorithms

Hybrid algorithms combine actor–critic principles with off-policy sampling or architectural enhancements such as twin critics or entropy regularization. They excel in high-dimensional continuous systems typical of process control.

DDPG [32] uses deterministic policies

\nabla_{θ} J (μ_{θ}) \approx E [\nabla_{a} Q (s, a) |_{a = μ_{θ} (s)} \cdot \nabla_{θ} μ_{θ} (s)],

(4)

where

μ_{θ} (s)

is the deterministic actor and Q is an off-policy critic.

TD3 [46] improves stability through clipped double Q-learning and target policy smoothing.

SAC [47] introduces an entropy-augmented policy objective

J (π_{θ}) = E_{s, a} [Q (s, a) + α_{temp} H (π_{θ} (\cdot | s))],

(5)

where

H

is Shannon entropy and

α_{temp}

regulates the exploration–exploitation balance.

These hybrid algorithms dominate in chemical CSTR control, steel strip flatness control, semiconductor processes, and nonlinear multivariable plants.

4.4. Model-Based RL

Model-based RL (MBRL) learns explicit approximations of environment dynamics

\hat{P} (s^{'} | s, a), \hat{R} (s, a, s^{'}),

(6)

where

\hat{P}

and

\hat{R}

denote learned approximations of the transition and reward models. These learned models are then used in two complementary ways:

1.: Planning: simulating rollouts and planning action sequences (e.g., MPC-like strategies).
2.: Data generation: producing synthetic trajectories to train a model-free RL method (e.g., MBPO [48]).

Recurrent model-based approaches (RSMB-RL) [49] incorporate temporal memory via latent-state models. Model-based RL is particularly relevant in process industries due to limited data availability, safety constraints, and the widespread use of digital twins.

4.5. Evolutionary RL

Evolutionary RL replaces gradient-based optimization with mutation–selection mechanisms. A population of policies is perturbed and evaluated via a fitness function, where

θ

denotes the policy parameters

θ_{k + 1} = θ_{k} + σ \frac{1}{N} \sum_{i = 1}^{N} F (θ_{k} + σ ϵ_{i}) ϵ_{i},

(7)

where

ϵ_{i}

are perturbations,

σ

controls exploration, and

F (\cdot)

measures fitness [50]. These methods are robust to discontinuous dynamics and are useful in industrial process optimization and design, especially when gradients are unreliable.

4.6. Multi-Agent RL (MARL)

MARL extends RL to systems with multiple interacting agents. A typical cooperative MARL objective is

\nabla_{θ} J = E [\sum_{i = 1}^{n} \nabla_{θ_{i}} log π_{θ_{i}} (a_{i} | o_{i}) A_{i}^{π_{i}}],

(8)

where

o_{i}

and

a_{i}

are the observations and actions of agent i, and

A_{i}^{π_{i}}

is its advantage under its own policy (often derived from a centralized critic in CTDE). Two dominant paradigms are:

Independent Learners—each agent treats others as part of the environment.
CTDE (Centralized Training, Decentralized Execution)—global information during training; local observations during execution (e.g., MADDPG [51], QMIX [52]).

MARL aligns naturally with distributed industrial systems such as multiloop CSTR networks, coordinated VAM-plant units, supply-chain operations, and multi-asset process coordination.

5. Methodology

This study adopts the PRISMA systematic review methodology [7] to ensure transparency, reproducibility, and consistency in the identification and assessment of literature on the application of reinforcement learning in the process industry. The use of PRISMA is particularly relevant in this context due to the heterogeneous terminology and cross-disciplinary nature of reinforcement learning research, where applications may appear dispersed across control engineering, operations research, manufacturing systems, and artificial intelligence literature.

5.1. Search Strategy

The literature search was conducted in two major scientific databases relevant to engineering and industrial applications: ScienceDirect (Elsevier) and IEEE Xplore (IEEE). The primary search string used in both databases was:

"Reinforcement Learning" AND ("Process Industries" OR "Process Manufacturing")

To ensure completeness and to capture domain-specific terminology, additional title-focused searches were executed using combinations such as “Refinery”, “Petrochemical”, “Chemical”, “Steel”, “Mining”, “Pharma*”, “Food”, and “Materials”. The choice of ScienceDirect (Elsevier) and IEEE Xplore (IEEE) reflects a focused strategy to cover the main publication ecosystems where reinforcement learning research relevant to process industries is disseminated. ScienceDirect concentrates high-impact journals in chemical engineering, process systems engineering, and industrial digitalization, while IEEE Xplore includes both established journals and early-stage work in control and automation frequently published in IEEE conferences. In the final dataset, 51 of the 62 included studies (82.3%) are journal articles and 11 (17.7%) are conference papers, indicating a predominantly journal-based evidence base.

5.2. Screening and Selection

The combined database search yielded 192 records (139 from ScienceDirect and 53 from IEEE Xplore). As shown in the PRISMA flow diagram (Figure 6), two duplicate records were removed before screening, resulting in 190 unique records. Inclusion was restricted to peer-reviewed journal articles and conference papers. Sources such as book chapters, editorial summaries, and other non-peer-reviewed materials (25 sources) were excluded.

A relevance screening based on titles and abstracts led to the removal of studies not aligned with the research scope (27 exclusions), as well as those where key terms (e.g., “mining”) referred to unrelated domains (23 exclusions). The remaining 115 articles underwent full-text eligibility assessment, during which 36 studies were excluded for not addressing process industries, 5 for insufficient focus on the topic, and 12 for lacking substantive use of reinforcement learning. A total of 62 studies met all criteria and were included in the review.

5.3. Data Extraction

For each included study, metadata and thematic characteristics were systematically extracted to support comparative analysis. The extracted dimensions included:

Publication metadata (title, year, type, country of corresponding author);
Industrial sector and associated process domain;
Problem addressed (e.g., control, optimization, scheduling, fault management);
Reinforcement learning methodology employed (algorithm family, architecture, and implementation style);
Technological context (e.g., simulation-based, hybrid digital twin, on-line deployment).

This structured extraction enables the mapping of reinforcement learning methods to industrial use cases, supporting the systematic interpretation of trends and gaps across sectors.

5.4. Limitations

This review has two scope-related limitations. First, the search was restricted to English-language publications, which may omit relevant studies published in other languages. Second, the use of two databases (ScienceDirect and IEEE Xplore) may not fully capture reinforcement learning research appearing in interdisciplinary outlets outside industrial and process-systems domains. These constraints reflect a balance between relevance and coverage and should be considered when interpreting the breadth of the reviewed literature. Additionally, the studies included in the review vary in methodological rigor and validation depth, meaning that the aggregated trends reported here should be interpreted as qualitative patterns rather than statistically uniform evidence across all sources.

6. Discussion on the Convergence of Reinforcement Learning and the Process Industry

This section presents a discussion of the studies identified through the PRISMA-based systematic literature review. The works are organized by industry sector, with an emphasis on the application of various reinforcement learning (RL) algorithms to address specific problem-solving tasks within the process industry.

6.1. Chemical Industry

This section discusses the application of reinforcement learning (RL) within the chemical process industry, highlighting its growing role in addressing complex problems such as reactor control, raw material scheduling, fault diagnosis, and process safety. RL’s adaptability to high-dimensional, nonlinear, and uncertain environments (coupled with its compatibility with both discrete and continuous decision spaces) makes it particularly well-suited for chemical process control.

Across diverse applications, three main categories of RL emerge: value-based methods (e.g., Q-learning, DQN), policy-based methods (e.g., PPO, A2C), and hybrid actor–critic approaches (e.g., DDPG, TD3, SAC). These approaches are applied in both simulation and real-time environments, revealing several methodological trends such as hybridization with traditional control techniques, the rise of multi-agent systems, and increased focus on sample efficiency and safety guarantees.

In reactor control, several studies focus on continuous stirred tank reactors (CSTRs), which serve as benchmarks for nonlinear system dynamics. Zhang et al. [53] introduce a Robust Safe Model-Based RL (RSMB-RL) approach to control a CSTR under state constraints and disturbances. Using a slack function to reformulate constraints and a composite learning rule, this hybrid method operates in continuous action spaces and delivers safe, stable control policies without prior system knowledge. Similarly, ref. [54] applies a multi-agent Twin-Delayed Deep Deterministic Policy Gradient (TD3) approach for multiloop CSTR control, showing promising disturbance rejection in continuous control settings.

Other reactor-focused works include [55] where Q-learning augmented by Gaussian processes is applied to semi-batch reactor control. The study focuses on data efficiency, requiring only 100 trajectory evaluations to develop a safe, near-optimal policy. Another study [56] uses Deep Q-Learning (DQN) for discrete decision making—specifically, binary actions such as cold diluent injection in a styrene batch reactor—to prevent thermal runaway. Here, explainable RL techniques (e.g., decision trees, Shapley values) are introduced to enhance transparency, reflecting an emerging trend toward interpretable AI in safety-critical applications. Scheduling problems, often modeled with discrete action spaces and under uncertainty, are a recurring application of RL. Rangel-Martinez and Ricardez-Sandoval [57] utilize Deep Recurrent Q-Networks (DRQN) in a partially observable batch plant environment with zero-wait constraints. By incorporating recurrent neural networks and flexible observation windows, the study bridges temporal dependencies with limited system visibility. In a complementary line, ref. [58] addresses scheduling in uncertain environments using a hybrid RLeRO framework, combining Robust Optimization with Advantage Actor-Critic (A2C). The inclusion of RO improves policy stability and prevents convergence to local optima. Wu et al. [59] also explore scheduling via a modified Proximal Policy Optimization (PPO), showing better convergence and higher returns compared to conventional policy gradients due to improved state function design.

Hubbs et al. [60] and Bougie et al. [61] further explore scheduling in single-stage multiproduct reactors and VAM plants. Hubbs et al. [60] adopt A2C in a simulation environment to generate schedules resilient to demand uncertainties, while Bougie et al. [61] introduce a decentralized PPO-based strategy, decomposing policies across agents. This method proves advantageous for sample efficiency and robustness in complex systems. Notably, ref. [62] also explores controller-guided self-supervised learning (CGS) in VAM plant control, showcasing how legacy control knowledge can improve RL training efficiency.

Applications in fault diagnosis and parameter estimation further illustrate RL’s flexibility. Kim and Lee [63] pioneered the use of associative RL to evolve Fuzzy Cognitive Maps (FCMs) for automated fault detection in tank–pipe systems. This early effort in model adaptation and knowledge integration foreshadows more recent techniques. For instance, ref. [64] applies Deep Deterministic Policy Gradient (DDPG) to online parameter estimation in the hydrogenation of acetylene, offering a nonintrusive, continuous control solution that avoids perturbing the system. Similarly, ref. [65] integrates DDPG with Economic Model Predictive Control (EMPC) to correct model mismatches in CSTRs, achieving better yield through parameter updates guided by state comparison.

Exploration of real-time control is exemplified in [66] where Asynchronous Advantage Actor-Critic (A3C) is used for direct control of a hybrid three-tank system, demonstrating real-time deployment and effective parallel learning strategies. This work validates the benefit of pre-training in simulation, addressing a critical gap in bridging simulated and physical environments. In a related domain, Alhazmi et al. [67] showcase DDPG for maintaining optimal conditions in chemical reaction networks, emphasizing the importance of reward design and observation space for success in complex control tasks.

Several studies highlight the hybridization of RL with traditional control and optimization frameworks. Zanon et al. [68] combine RL with Nonlinear Model Predictive Control (NMPC) for an evaporation process, using Real-Time Iteration (RTI) to reduce computational cost while maintaining stability. Oh [69] extends this analysis by comparing DRL algorithms (DDPG, TD3, SAC) with data-driven MPC across various systems, including SMB and bioreactors.

Neuro-evolutionary methods also make notable contributions. Conradie and Aldrich [70] introduce Symbiotic Adaptive Neuro-Evolution (SANE) for bioreactor design and control, evolving specialized neurons for improved economic operation under uncertainty. Conradie et al. [71] build on this with the Symbiotic Memetic Neuro-Evolution (SMNE) algorithm, blending evolutionary strategies with particle swarm optimization for efficient policy learning in nonlinear systems. These methods emphasize robustness and adaptability, particularly in early-stage simulation.

While these trends highlight clear methodological strengths—such as the dominance of actor–critic methods in nonlinear continuous control and the effectiveness of value-based methods in discrete scheduling—several challenges remain. In particular, transferring policies trained in simulation to real plants continues to lack standardized procedures, and hybrid RL–MPC schemes depend heavily on accurate surrogate or mechanistic models. These limitations help explain why most applications remain confined to simulation despite methodological maturity.

In conclusion, reinforcement learning in the chemical process industry is advancing toward mature, hybridized frameworks capable of solving both continuous and discrete control problems across various technical domains. With improvements in interpretability, real-time performance, and integration with domain-specific knowledge, RL is positioned to play a critical role in the next generation of intelligent chemical process systems. The most significant elements from this subsection are summarized in Table 1.

6.2. Steel Industry

This section discusses the application of reinforcement learning (RL) within the steel industry, focusing on diverse technical contexts such as flatness control, energy management, maintenance scheduling, and quality assurance. A key thematic link across these applications is the alignment of RL algorithm design with the nature of the problem—whether defined by discrete scheduling decisions or continuous control requirements—and the growing hybridization of RL with traditional methods to enhance robustness and industrial applicability.

Flatness control in strip and cold rolling has emerged as a major focus, typically involving continuous action spaces. Peng et al. [72] and Deng et al. [73] adopt off-policy, model-free algorithms like DDPG, TD3, and PPO, integrating ensemble learning to improve stability and reduce the impact of local optima. These approaches operate in simulation and outperform traditional PI controllers. Similarly, ref. [74] proposes Stable Q-Learning (SQL), an offline, value-based method with a Q-ensemble to ensure robustness in the absence of simulators, offering a practical solution for industrial datasets. In contrast, discrete decision-making problems, such as maintenance and scheduling, are tackled using value-based methods. Ferreira Neto et al. [75] employ a DDQN algorithm within a simulation of a steel shredder maintenance process, achieving significant cost savings by replacing time-based policies. Likewise, ref. [76] addresses blast furnace gas tank scheduling with DQN, demonstrating improved convergence and operational safety by discretizing actions and pre-training on historical data. Jeong et al. [77] utilize Q-learning to identify optimal hot forging parameters by mapping instability domains, with validation through experimental observation, thus illustrating the capacity of model-free, tabular methods in low-dimensional optimization problems. Policy-based and hybrid methods also see increasing use in multi-agent or hierarchical environments. Wang et al. [78] apply an Actor–Critic approach to integrated energy system optimization, managing continuous variables while balancing global and subsystem indices. Cho et al. [79] introduce REINFORCE within a graph neural network-based policy for dynamic crane scheduling, demonstrating flexibility and real-time adaptability. Che et al. [80] propose a novel hybrid approach, DRL-MOEA, where PPO is embedded within an evolutionary algorithm to enhance scheduling of air separation units, merging the global search of MOEAs with the local policy optimization of PPO.

Reinforcement learning is also being integrated into quality monitoring systems. Zhang et al. [81] address surface defect detection through a dual-agent RL framework (DuAK), combining knowledge graph reasoning with path exploration in high-dimensional environments. This value-based framework improves defect traceability and performance on real and benchmark datasets, reflecting an emerging use of RL in knowledge-based diagnostic applications.

Across steel applications, ensemble-enhanced PPO/DDPG methods show strong performance in continuous flatness control, while Q-learning variants remain effective for discrete maintenance and scheduling tasks. However, these advances are counterbalanced by persistent challenges; offline RL methods still face robustness issues when exposed to high-variance industrial conditions, and hybrid evolutionary–RL schedulers require substantial tuning to ensure stability in large-scale steel production operations. The most significant elements from this subsection are summarized in Table 2.

6.3. Oil and Gas Industries

This section discusses the application of reinforcement learning (RL) within the oil and gas industries where it supports critical tasks such as production scheduling, flow control, interface tracking, soft sensor development, and fault diagnosis. These applications span both discrete and continuous decision spaces and increasingly incorporate hybrid or adaptive RL strategies tailored to complex, nonlinear environments.

In continuous control contexts, ref. [82] applies Deep Deterministic Policy Gradient (DDPG) to regulate flow levels in a three-phase separator, achieving improved separation efficiency through simulation using CFD models. Similarly, ref. [83] employs Proximal Policy Optimization (PPO) to solve large-scale refinery production scheduling problems. By dividing the scheduling horizon into time slots and initializing the policy with operational knowledge, the RL agent efficiently coordinates local decisions for global optimization, outperforming traditional solvers. These applications demonstrate RL’s scalability and suitability for high-dimensional, continuous action spaces. In contrast, ref. [84] addresses discrete production scheduling by using a model-based Markov Decision Process (MDP) to dynamically adjust genetic algorithm parameters. This hybrid method, NSGAeRL, improves solution convergence and quality over standard GAs. Also tackling a discrete decision space, ref. [85] applies asynchronous actor–critic algorithms (A3CS/A3CF) to autonomously select cross-domain data for soft sensor design in an SAGD process, allowing performance-driven modeling even with limited target-domain data. Policy-value hybrids dominate in model-free control applications. Dogru et al. [86] utilize A3C for froth-middlings interface tracking in oil sands processing, overcoming visual occlusion challenges by integrating sensor feedback with learned visual representations. Similarly, ref. [87] adopts an online policy iteration strategy from adaptive dynamic programming to develop a fault-tolerant controller for offshore steel platforms. This approach estimates actuator faults and disturbances in real time, enhancing control stability under uncertain dynamics. These developments show that actor–critic architectures and hybrid RL–optimization schemes are particularly promising for high-dimensional refinery scheduling and multiphase flow control. Yet, methodological barriers remain; many solutions depend on extensive domain knowledge for reward shaping or initialization, and policy generalization across changing reservoir or process conditions remains insufficiently validated.

The most significant elements from this subsection are summarized in Table 3.

6.4. Food and Beverage Industry

This section discusses the application of reinforcement learning (RL) within the food and beverage industry, focusing on diverse technical contexts such as production scheduling, thermal sterilization, concentration control, and product inspection. Across these applications, a variety of RL approaches—value-based (Q-learning, DQN), hybrid methods (fuzzy approximations), and multi-agent strategies—are explored to address both discrete and continuous decision spaces. In concentration control of a Continuous Stirred Tank Reactor, ref. [88] applied Q-learning with fuzzy function approximation to overcome challenges posed by continuous state-action spaces, demonstrating superior performance over traditional PID control in simulation. Similarly, ref. [89] implemented a model-free Q-learning controller to manage temperature profiles during thermal sterilization in batch processes, emphasizing RL’s suitability for uncertain, nonlinear systems. In broader supply chain management, ref. [90] employed Multi-Agent RL (MARL) within an agent-based simulation to optimize procurement, production, and distribution for an ice cream manufacturer, with digital twin integration under development. This approach highlights the advantages of hybridizing RL with MILP and simulation frameworks. Barthwal et al. [91] offer a comprehensive review of RL’s role in food industry contexts such as robotic navigation and safety inspection, emphasizing Q-learning and DQN for discrete decision making. A common trend emerges in the early-stage but expanding implementation of actor–critic and hybridized RL models tailored to complex, real-world industrial environments. Overall, value-based and fuzzy-approximation RL controllers perform well in nonlinear thermal and concentration processes, while MARL strategies show early promise in integrated supply-chain control. Nonetheless, the sector still faces significant challenges in obtaining sufficiently rich datasets and validating RL policies in highly variable production environments, which limits broader industrial deployment.

The most significant elements from this subsection are summarized in Table 4.

6.5. Mining Industry

This section discusses the application of reinforcement learning (RL) within the mining industry, focusing on diverse technical challenges such as flotation control, ash content optimization, and equipment alignment. These applications span both continuous control (e.g., flotation, DMS) and discrete decision making (e.g., transfer point alignment), with a notable trend toward policy-based and hybrid RL strategies. In flotation control, ref. [92] applied Adaptive Dynamic Programming in an actor–critic configuration to optimize recovery and grade without requiring precise process models, demonstrating efficacy in simulation. Zheng et al. [93] extended this line of work by proposing a Hybrid Model-Based RL (HMBRL) algorithm combining fuzzy inference, ensemble critics, and physical models, outperforming PPO, SAC, and MBPO in predictive accuracy and sample efficiency. Similarly, ref. [94] optimized the Dense Medium Separation process using an actor–critic method integrated with a layered model-based controller, enabling online setpoint updates and outperforming PI+MPC in simulation. Addressing equipment alignment, ref. [95] implemented PPO and SAC to synchronize a Bucket Wheel Excavator, Belt Wagon, and Hopper Car, highlighting RL’s capacity for real-time adaptation in discrete, complex environments. It is worth noting that additional reinforcement learning applications in mineral processing exist beyond those retrieved by our PRISMA-defined search strategy. For example, ref. [96] implement a PPO-based controller for a conventional continuous thickener, improving underflow concentration stability and demonstrating RL’s potential for complex solid–liquid separation units. This study did not appear under the predefined query strings and therefore did not enter the PRISMA screening flow; it is mentioned here solely as relevant external context and is not included in the tables or quantitative analysis. Actor–critic and hybrid model-based methods clearly outperform traditional PI+MPC benchmarks in simulation, particularly in flotation and DMS control. However, the mining sector continues to lack large-scale operational validations, and model-based hybrids depend on physical or fuzzy surrogates whose accuracy may degrade under volatile ore characteristics and plant disturbances.

The most significant elements from this subsection are summarized in Table 5.

6.6. Pharmaceutical Industry

This section discusses the application of reinforcement learning (RL) within the pharmaceutical industry where high costs and development times—averaging USD 200 million and 10–15 years per innovative drug—drive the need for more efficient solutions. RL has emerged as a valuable tool in this context, addressing complex tasks in drug design, bioprocess control, and materials discovery. In continuous biopharmaceutical manufacturing, ref. [97] applied model-free RL using Monte Carlo simulations to optimize process chromatography under uncertain dynamics, highlighting its advantage in scenarios lacking explicit process models. For small molecule drug discovery, ref. [98] explored value-based (Q-learning, SARSA) and policy-based (PPO) methods to accelerate early-stage design, reduce costs, and enhance chemical space exploration. While RL-generated compounds have entered clinical trials, limitations persist in synthesis feasibility and data quality. Kim et al. [99] used PPO in RL-guided combinatorial chemistry (RL-CC) for HIV drug synthesis, showing how sequential fragment selection via RL can target extreme properties and scale effectively. Across cases, policy-based approaches like PPO dominate for their stability and adaptability. The application of RL in both discrete molecular generation and continuous control underscores its growing role in transforming pharmaceutical innovation through data-driven, cost-saving methodologies.

Across applications, policy-based algorithms such as PPO consistently demonstrate robustness for both combinatorial drug design and nonlinear bioprocess control. Yet, practical limitations persist; RL-generated molecules often face synthesis constraints, and bioprocess optimizers lack real-world validation due to safety and cost barriers. These gaps underline the need for domain-informed reward structures and hybrid workflows that combine RL with established pharmaceutical design methodologies.

The most significant elements from this subsection are summarized in Table 6.

6.7. Semiconductor Industry

This section discusses the application of reinforcement learning (RL) within the semiconductor industry, focusing on run-to-run (R2R, RtR) control in Chemical Mechanical Polishing (CMP) processes. CMP, a critical yet variable-intensive process, has motivated the integration of RL to enhance robustness and adaptability. Ma et al. [100] employ a hybrid strategy, using TD3—a continuous, value-based deep RL algorithm—to dynamically adjust weights in a double exponentially weighted moving average (dEWMA) controller, enabling improved disturbance rejection and tracking in a simulated environment. Similarly addressing CMP variability, ref. [101] applies structured control network DDPG (SCN-DDPG), a structured, policy-based deep RL variant, incorporating dual networks to estimate system states and define control policies for precise material removal. This approach, operating in a simulated and test-validated setting, emphasizes model-free learning and nonlinear policy representation. In contrast, ref. [102] addresses discrete decision making, such as polish time selection, through a hybrid DQN-RtR controller. The value-based DQN component governs actions when measurement data is unavailable, while the RtR-EWMA handles periods with valid feedback. Collectively, these studies exhibit a trend toward hybridized control architectures, leveraging actor–critic frameworks and DRL for both continuous and discrete control tasks. Hybrid RL–RtR controllers show strong advantages in handling CMP drift, variability, and missing metrology data, with TD3 and DQN-based designs outperforming classical EWMA baselines. Even so, challenges remain; policy transfer between tools or product lines is rarely demonstrated, and the stability of RL-driven RtR decisions under process drifts still requires systematic evaluation.

The most significant elements from this subsection are summarized in Table 7.

6.8. Power Generation Industry

This section discusses the application of reinforcement learning within the power generation industry, focusing on control challenges in steam turbines and Organic Rankine Cycle (ORC) systems. Both [103,104] adopt Proximal Policy Optimization (PPO), a policy-based, actor–critic algorithm valued for its balance of sample efficiency and implementation simplicity. Lin et al. [103] address superheat control in ORC systems under fluctuating waste heat conditions, using a hybrid PPO approach in a Sim2Real simulation framework. The method improves adaptability, safety, and control performance without requiring precise system models. In contrast, ref. [104] enhances PI controller tuning for steam turbines using Multi-Objective RL, where PPO optimizes switching among controller settings via gain scheduling. These studies reveal a trend toward continuous control applications and hybridized RL strategies that combine data-driven adaptability with traditional control robustness.

These studies confirm PPO’s suitability for nonlinear turbine and ORC control under fluctuating operational conditions, particularly when combined with transfer learning or gain-scheduling strategies. Nonetheless, the strong dependency on simulation environments and the scarcity of real-plant validations highlight ongoing challenges in generalizing RL-based controllers to rapidly changing energy-generation contexts.

The most significant elements from this subsection are summarized in Table 8.

6.9. Textile Industry

This section discusses the application of reinforcement learning within the textile industry, focusing on optimizing complex chemical processes. He et al. [105] formulate textile process optimization—specifically for ozonation—as a Markov decision process and apply Deep Q-Networks (DQN), a value-based RL method suitable for discrete action spaces and high-dimensional data. The approach, currently in case study validation, integrates empirical data and expert knowledge to manage multi-criteria decisions. This work highlights a methodological trend toward leveraging DQN’s scalability to handle realistic, large-scale industrial process optimization.

DQN provides a scalable approach for handling the discrete, multi-criteria nature of textile chemical decision making. However, practical application is still limited by the need for high-quality empirical data and by uncertainties in transferring case-study policies to diverse textile processes, which remain open challenges for broader adoption.

The most significant elements from this subsection are summarized in Table 9.

6.10. Convergence of RL and the Process Industries

Figure 7 summarizes the cross-industry patterns identified throughout this section. The diagram abstracts common structural features observed across chemical, steel, oil and gas, and food and beverage applications. Across these sectors, reinforcement learning (RL) is shaped by shared operational constraints—safety, product quality, economic performance, and regulatory compliance—which give rise to consistent industrial requirements such as constraint handling, interpretability, data efficiency, and robustness. These requirements, in turn, influence RL design choices, including the preference for hybrid architectures, the adoption of digital-twin-enabled sim-to-real pipelines, and algorithm selection aligned with the nature of the decision task (discrete scheduling, continuous control, or distributed coordination). Collectively, these elements indicate an ongoing shift from isolated proofs of concept toward verifiable, explainable, and deployment-oriented industrial RL architectures.

To operationalize the dual contextual–taxonomic framework outlined in the introduction, Table 10 summarizes how the major RL algorithm families are distributed across the industrial sectors considered in this review. For clarity, policy-based and actor–critic methods are displayed as separate categories, allowing the table to capture sector-level granularity that is not visible in more abstract taxonomies. This choice makes the cross-industry prominence of actor–critic designs particularly evident—a pattern consistent with their suitability for continuous control. It also shows how value-based and hybrid approaches cluster around scheduling, optimization, and safety-constrained tasks.

To articulate the conceptual mechanisms linking industrial constraints with RL design strategies, we introduce below a set of derived propositions that formalize the cross-industry regularities observed in this review.

1.: Industries operating under strong safety, quality, or regulatory constraints favor RL methods with explicit mechanisms for constraint handling and verifiable behavior. This is reflected in the widespread use of hybrid RL–MPC/EMPC frameworks, model-based safety layers, and reward functions that enforce operational limits.
2.: Continuous-process industries tend to converge toward actor–critic architectures due to their compatibility with continuous action spaces and fine-grained actuator control. Evidence appears in reactor control, run-to-run semiconductor manufacturing, steel flatness control, flotation circuits, and energy-generation systems.
3.: Discrete sequencing or scheduling problems predominantly rely on value-based RL methods. Q-learning variants and DQN derivatives are common in steel maintenance scheduling, refinery and batch-plant scheduling, and food-processing logistics, reflecting their efficiency in finite action spaces.
4.: Systems requiring coordination across distributed assets adopt multi-agent or hierarchical RL variants. This pattern emerges in multiloop CSTRs, supply-chain coordination, VAM-plant operations, and multi-unit material-handling systems.
5.: When experimentation is costly or disruptive, industries give preference to digital twins, offline RL, and sim-to-real transfer pipelines. This trend is shared across chemical, oil and gas, mining, and energy systems where plant access is constrained or disturbances carry operational risk.

7. Statistical Highlights

As shown above, the application of reinforcement learning in the chemical process industry is widespread. Specifically, 30.6% of the 62 selected reports are situated within this industrial context. Other industries that are significantly represented in the review include the steel, oil and gas, food, and mining sectors (Figure 8).

Several process industries—most notably the chemical sector—have long-established infrastructures in advanced process control, process simulation, and digital instrumentation. These technological foundations facilitate early experimentation with reinforcement learning methods and partly explain the strong representation of chemical-process applications observed in the reviewed literature.

Beyond sector representation, process control stands out as the primary use case for reinforcement learning (RL), accounting for 51.6% of the selected studies (Figure 9). The use of RL for scheduling in process industries is also notable, representing 20.9% of the cases. Finally, the combined application of RL and digital twins (DTs) is an emerging approach, increasingly present in the literature [3], with 6.4% representation.

Among the selected reports, it was identified that 73.6% of the articles situated within the chemical industry involve the concurrent use of reinforcement learning (RL) and process control. In contrast, within the steel industry, the most significant intersection is with RL applications in scheduling, accounting for 40% of the studies.

The countries of affiliation of the first author were also quantified (Figure 10), revealing a broad distribution across 22 different countries. As of 2025, approximately 58.7% of the global population resides in Asia. Therefore, the fact that 54.8% of the studies selected via the PRISMA methodology originate from Asian institutions may appear intuitively consistent. Nevertheless, two points are worth highlighting: first, the notable penetration of RL in industrial applications across Asia; and second, the increasingly global and evenly distributed presence of reinforcement learning applications within the process industry.

In addition, Figure 11 presents a categorical count of the types of reinforcement learning applied in the selected reports. The combined category “Policy-Based/Actor–Critic” appears in approximately one out of every three studies, making it the most represented algorithmic family in the review and reflecting its central role in continuous-control tasks. It is followed by value-based RL and hybrid RL approaches, the latter used almost exclusively in conjunction with process control. Of particular note is the strong presence of cross-cutting approaches (e.g., Evolutionary RL), which combine reinforcement learning with other optimization paradigms to enhance robustness, exploration, or parallel search.

To complement this algorithm-level distribution, Figure 12 summarizes how the four high-level RL families—value-based, policy-based/actor–critic, hybrid, and cross-cutting—are distributed across the major process-industry sectors. This underlines the qualitative patterns highlighted earlier, e.g., the strong presence of value-based RL in steel scheduling, the prevalence of mixed policy-gradient/actor–critic and hybrid strategies in nonlinear continuous-control domains, and the use of hybrid or model-supported methods in mining applications involving multi-stage operations.

8. Conclusions

This review set out to answer the central research question: How is reinforcement learning applied in process industries? The systematic evidence shows that RL is being deployed primarily in continuous-process control, discrete scheduling, and hybrid decision-making frameworks supported by digital twins. Across the 62 included studies, RL is used not as a generic technique but as an application-driven tool whose algorithmic families align closely with industrial requirements such as constraint handling, robustness, and sample-efficient learning. This application-oriented mapping—integrating contextual sectors and taxonomic RL families—provides a structured response to the research question and frames the trends synthesized in this review.

8.1. Key Findings and Architectural Trends

The analysis reveals a deliberate mapping of RL architectures to industrial tasks:

Control-task dominance: The field is overwhelmingly focused on process control (51.6% overall), with this intersection being most pronounced in the chemical industry (73.6% of chemical studies). This focus necessitates reliable algorithms for continuous, high-dimensional spaces.
Algorithmic specialization for continuous control: The strong demand for continuous control is met by the Policy-Based/Actor–Critic family (the most common type, ≈33% of studies), reflecting its suitability for high-dimensional continuous control. Sector-wise dominance of pure actor–critic methods appears only when disaggregated taxonomically (as in Table 10), while the statistical aggregation reports them jointly due to their methodological overlap.
Value-based methods for discrete tasks: Conversely, specialized challenges like Scheduling (40% of steel studies) are predominantly addressed by Value-Based RL methods, reinforcing the efficiency of finite-action algorithms for discrete sequencing and complex sequential decision making.
Pragmatism of hybrid architectures: The necessity for robustness, constraint handling, and safety drives the adoption of hybrid models. These approaches are used almost exclusively in process-control domains, demonstrating a pragmatic convergence with established methods to provide additional structure and verifiable guarantees.
Contextual alignment: The contextual–taxonomic analysis (Table 10) clarifies how RL families align with industrial tasks; actor–critic methods dominate continuous-process sectors, while value-based RL remains central to discrete sequencing and scheduling. This sector-level alignment complements the aggregated statistics presented in Figure 11 and Figure 12.
Addressing structural complexity: The presence of multi-agent (MARL) and cross-cutting approaches (e.g., evolutionary RL) reflects the need to move beyond simple RL paradigms to handle structural complexities like distributed assets and parallelism.

8.2. Current Challenges and Priority Actions for Industrial Adoption

Despite these advancements, a persistent challenge remains; the majority of RL applications within the process industry are primarily validated in simulated environments, underscoring the existing Sim2Real gap in real-world deployment. The emergence of digital twin (DT)-based approaches (6.4%), however, points to promising pathways for addressing this issue by creating safe, high-fidelity testing platforms. To accelerate industrial adoption, the following priority actions emerge from the review:

1.: Closing the Sim2Real gap (deployment and validation): Advance from simulation-only evaluation toward controlled pilot deployments on physical systems.
2.: Algorithmic guarantees: Develop safety- and stability-aware RL formulations that provide verifiable behavior under industrial constraints.
3.: Sample efficiency: Design training pipelines—especially DT-enabled ones—that reduce dependence on real-plant data while ensuring transferability.
4.: Explainable AI (XAI): Develop interpretable RL components to support operator trust and decision transparency.
5.: Standardization: Establish benchmark problems and reporting practices tailored for process-industry RL to enable systematic comparison and progress tracking.

In conclusion, reinforcement learning is firmly establishing its role as a foundational technology for the next generation of smart manufacturing. Its trajectory suggests a transformative potential, moving beyond isolated optimizations to enable more autonomous, resilient, and efficient operations across the entire process industry landscape.

Author Contributions

Conceptualization, M.A.P.R. and A.B.; methodology, M.A.P.R. and A.B.; data curation, M.A.P.R.; formal analysis, M.A.P.R. and A.B.; investigation, M.A.P.R. and A.B.; writing—original draft preparation, M.A.P.R. and A.B.; writing—review and editing, M.A.P.R. and A.B.; visualization, M.A.P.R.; supervision, A.B.; project administration, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Bavarian Ministry of Economic Affairs, Regional Development and Energy (StMWi) under Grant DIK0397/03.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

A2C	Advantage Actor–Critic
A3C	Asynchronous Advantage Actor–Critic
AC	Actor–Critic (family of RL algorithms)
ADP	Approximate Dynamic Programming
AI	Artificial Intelligence
ASCM	Association for Supply Chain Management
CFD	Computational Fluid Dynamics
CSTR	Continuous Stirred-Tank Reactor
CTDE	Centralized Training with Decentralized Execution
DDPG	Deep Deterministic Policy Gradient
DDQN	Double Deep Q-Network
dEWMA	Dynamic Exponentially Weighted Moving Average
DCS	Distributed Control System
DGP	Deterministic Policy Gradient
DQN	Deep Q-Network
DRL	Deep Reinforcement Learning
DRQN	Deep Recurrent Q-Network
DT	Digital Twin
DuAK	Dual-Agent Knowledge-based Reinforcement Learning framework
ERP	Enterprise Resource Planning
EMPC	Economic Model Predictive Control
FCM	Fuzzy Cognitive Map
GA	Genetic Algorithm
GDP	Gross Domestic Product
GPU	Graphics Processing Unit
GNN	Graph Neural Network
GRU	Gated Recurrent Unit (a variant of RNN architecture)
HIV	Human Immunodeficiency Virus
IIoT	Industrial Internet of Things
LSTM	Long Short-Term Memory (a type of Recurrent Neural Network unit)
MADDPG	Multi-Agent Deep Deterministic Policy Gradient
MARL	Multi-Agent Reinforcement Learning
MBPO	Model-Based Policy Optimization
MBRL	Model-Based Reinforcement Learning
MFRL	Model-Free Reinforcement Learning
MILP	Mixed-Integer Linear Programming
MOEA	Multi-Objective Evolutionary Algorithm
MPC	Model Predictive Control
MDP	Markov Decision Process
NMPC	Nonlinear Model Predictive Control
NSGAeRL	Non-dominated Sorting Genetic Algorithm embedded Reinforcement Learning
ORC	Organic Rankine Cycle
PG	Policy Gradient
PI	Process Industry
PID	Proportional–Integral–Derivative (control)
PLC	Programmable Logic Controller
PoC	Proof of Concept
POMDPs	Partially Observable Markov Decision Processes
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
QMS	Quality Management System
QMIX	Value Decomposition Network for Multi-Agent Reinforcement Learning
RL	Reinforcement Learning
RLeRO	Reinforcement Learning enhanced Reference Optimization
R2R/RtR	Run-to-Run or Real-time Regulation (equivalent industrial control paradigms)
RSMB-RL	Robust Safe Model-Based Reinforcement Learning
RO	Robust Optimization
SAC	Soft Actor–Critic
SAGD	Steam-Assisted Gravity Drainage
SARSA	State–Action–Reward–State–Action (on-policy RL algorithm)
SCADA	Supervisory Control and Data Acquisition
SCP	Supply Chain Planning
SMB	Simulated Moving Bed
SMNE	Swarm-Based Mixed Neuro-Evolution
SQL	Stable (or Safe) Q-Learning
TD	Temporal Difference
TD3	Twin Delayed Deep Deterministic Policy Gradient
TRPO	Trust Region Policy Optimization
TPU	Tensor Processing Unit
VAM	Vinyl Acetate Monomer

References

Ge, W.; Guo, L.; Li, J. Toward greener and smarter process industries. Engineering 2017, 3, 152–153. [Google Scholar] [CrossRef]
Groumpos, P.P. A critical historical and scientific overview of all industrial revolutions. IFAC-PapersOnLine 2021, 54, 464–471. [Google Scholar] [CrossRef]
Paz-Ramos, M.A.; Busboom, A. Integration of Digital Twins with Reinforcement Learning in Industry: A Systematic Review. In Proceedings of the 2025 IEEE 30th International Conference on Emerging Technologies and Factory Automation (ETFA), Porto, Portugal, 9–12 September 2025; pp. 1–8. [Google Scholar] [CrossRef]
Nian, R.; Liu, J.; Huang, B. A review on reinforcement learning: Introduction and applications in industrial process control. Comput. Chem. Eng. 2020, 139, 106886. [Google Scholar] [CrossRef]
Faria, R.D.R.; Capron, B.D.O.; Secchi, A.R.; De Souza, M.B.J. Where reinforcement learning meets process control: Review and guidelines. Processes 2022, 10, 2311. [Google Scholar] [CrossRef]
Dogru, O.; Xie, J.; Prakash, O.; Chiplunkar, R.; Soesanto, J.; Chen, H.; Velswamy, K.; Ibrahim, F.; Huang, B. Reinforcement learning in process industries: Review and perspective. IEEE/CAA J. Autom. Sin. 2024, 11, 283–300. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Ezell, S.J.; Atkinson, R.D. Annual Report on the U.S. Manufacturing Economy: 2024; Technical Report NIST AMS 600-16; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2024. [CrossRef]
Lyons, A.C.; Vidamour, K.; Jain, R.; Sutherland, M. Developing an understanding of lean thinking in process industries. Prod. Plan. Control 2011, 24, 475–494. [Google Scholar] [CrossRef]
Kuwashima, K.; Fujimoto, T. Redefining the characteristics of process-industries: A design theory approach. J. Eng. Technol. Manag. 2023, 68, 101748. [Google Scholar] [CrossRef]
Lager, T. Managing Innovation & Technology in the Process Industries: Current practices and future perspectives. Procedia Eng. 2016, 138, 459–471. [Google Scholar] [CrossRef][Green Version]
King, P.L.; Kroeger, D.R.; Foster, J.B.; Williams, N.; Proctor, W. Making cereal not cars. Ind. Eng. 2008, 40, 34–37. [Google Scholar]
Qian, F.; Zhong, W.; Du, W. Fundamental theories and key technologies for smart and optimal manufacturing in the process industry. Engineering 2017, 3, 154–160. [Google Scholar] [CrossRef]
Pittman, P.H.; Awater, J.E. Process Manufacturing. In ASCM Supply Chain Dictionary, 17th ed.; Association for Supply Chain Management: Chicago, IL, USA, 2022. [Google Scholar]
United Nations Statistics Division. International Standard Industrial Classification of All Economic Activities (ISIC), Rev. 4; United Nations: New York, NY, USA, 2008; Available online: https://unstats.un.org/unsd/publication/seriesm/seriesm_4rev4e.pdf (accessed on 2 December 2025).
Bennett, S. A brief history of automatic control. IEEE Control Syst. Mag. 1996, 16, 17–25. [Google Scholar] [CrossRef]
Maxwell, J.C. I. On Governors. Proc. R. Soc. Lond. 1868, 16, 270–283. [Google Scholar] [CrossRef]
Skogestad, S. Advanced control using decomposition and simple elements. Annu. Rev. Control 2023, 56, 100903. [Google Scholar] [CrossRef]
Yeo, W.S.; Saptoro, A.; Kumar, P.; Kano, M. Just-in-time based soft sensors for process industries: A status report and recommendations. J. Process Control 2023, 128, 103025. [Google Scholar] [CrossRef]
Bennett, S. The past of PID controllers. Annu. Rev. Control 2001, 25, 43–53. [Google Scholar] [CrossRef]
Rockwell Automation. 10th Annual State of Smart Manufacturing Report; Rockwell Automation: Milwaukee, WI, USA, 2025; Available online: https://www.rockwellautomation.com/en-us/capabilities/digital-transformation/state-of-smart-manufacturing.html (accessed on 2 December 2025).
Lu, H.; Guo, L.; Azimi, M.; Huang, K. Oil and Gas 4.0 era: A systematic review and outlook. Comput. Ind. 2019, 111, 68–90. [Google Scholar] [CrossRef]
Yang, T.; Yi, X.; Lu, S.; Johansson, K.H.; Chai, T. Intelligent manufacturing for the process industry driven by industrial artificial intelligence. Engineering 2021, 7, 1224–1230. [Google Scholar] [CrossRef]
Pietrasik, M.; Wilbik, A.; Grefen, P. The enabling technologies for digitalization in the chemical process industry. Digit. Chem. Eng. 2024, 12, 100161. [Google Scholar] [CrossRef]
Brennan, D. Process Industry Economics: Principles, Concepts and Applications; Elsevier: Amsterdam, The Netherlands, 2020. [Google Scholar] [CrossRef]
Shyam, R.; Singh, R. A taxonomy of machine learning techniques. J. Adv. Robot. 2021, 8, 18–25. [Google Scholar] [CrossRef]
Russell, S.J.; Norvig, P. Artificial Intelligence: A Modern Approach; Global Edition; Pearson: London, UK, 2021. [Google Scholar]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Mnih, V. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]
Gibbs, S. Google buys UK artificial intelligence startup DeepMind for £400m. The Guardian, 27 January 2014. Available online: https://www.theguardian.com/technology/2014/jan/27/google-acquires-uk-artificial-intelligence-startup-deepmind (accessed on 2 December 2025).
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hassabis, D. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
Sang-Hun, C. Google’s computer program beats Lee Sedol in Go tournament. The New York Times, 16 March 2016. Available online: https://www.nytimes.com/2016/03/16/world/asia/korea-alphago-vs-lee-sedol-go.html (accessed on 2 December 2025).
Kissinger, H. How the Enlightenment Ends. The Atlantic, 1 June 2018. Available online: https://www.theatlantic.com/magazine/archive/2018/06/henry-kissinger-ai-could-mean-the-end-of-human-history/559124/ (accessed on 2 December 2025).
Achiam, J. A Taxonomy of RL Algorithms. 2018. Available online: https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html (accessed on 2 December 2025).
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, Cambridge, UK, 1989. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Hausknecht, M.J.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In Proceedings of the NIPS Deep Reinforcement Learning Workshop, Montreal, QC, Canada, 11–12 December 2015; p. 141. [Google Scholar]
Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; Technical Report CUED/F-INFENG/TR 166; Cambridge University Engineering Department: Cambridge, UK, 1994. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.P.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.I.; Moritz, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Fujimoto, S.; Hoof, H.v.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Janner, M.; Fu, J.; Zhang, M.; Levine, S. When to Trust Your Model: Model-Based Policy Optimization. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 16, p. 12487. [Google Scholar]
Ha, D.; Schmidhuber, J. World Models. arXiv 2018, arXiv:1803.10122. [Google Scholar] [CrossRef]
Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; Sutskever, I. Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv 2017, arXiv:1703.03864. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 1–10, pp. 6380–6391. [Google Scholar]
Rashid, T.; Samvelyan, M.; Schroeder de Witt, C.; Farquhar, G.; Foerster, J.N.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
Zhang, H.; Zhao, C.; Ding, J. Robust safe reinforcement learning control of unknown continuous-time nonlinear systems with state constraints and disturbances. J. Process Control 2023, 128, 103028. [Google Scholar] [CrossRef]
Yifei, Y.; Lakshminarayanan, S. Multi-agent reinforcement learning system for multiloop control of chemical processes. In Proceedings of the 2022 IEEE International Symposium on Advanced Control of Industrial Processes (AdCONIP), Vancouver, BC, Canada, 7–9 August 2022; pp. 48–53. [Google Scholar] [CrossRef]
Savage, T.; Zhang, D.; Mowbray, M.; Río Chanona, E.A.D. Model-free safe reinforcement learning for chemical processes using Gaussian processes. IFAC-PapersOnLine 2021, 54, 504–509. [Google Scholar] [CrossRef]
Szatmári, K.; Horváth, G.; Németh, S.; Bai, W.; Kummer, A. Resilience-based explainable reinforcement learning in chemical process safety. Comput. Chem. Eng. 2024, 191, 108849. [Google Scholar] [CrossRef]
Rangel-Martinez, D.; Ricardez-Sandoval, L.A. A recurrent reinforcement learning strategy for optimal scheduling of partially observable job-shop and flow-shop batch chemical plants under uncertainty. Comput. Chem. Eng. 2024, 188, 108748. [Google Scholar] [CrossRef]
Lee, C.Y.; Huang, Y.T.; Chen, P.J. Robust-optimization-guiding deep reinforcement learning for chemical material production scheduling. Comput. Chem. Eng. 2024, 187, 108745. [Google Scholar] [CrossRef]
Wu, Z.; Wang, Y.; Jia, L. A dynamic chemical production scheduling method based on reinforcement learning. In Proceedings of the 2022 China Automation Congress (CAC), Xiamen, China, 25–27 November 2022; pp. 4841–4846. [Google Scholar] [CrossRef]
Hubbs, C.D.; Li, C.; Sahinidis, N.V.; Grossmann, I.E.; Wassick, J.M. A deep reinforcement learning approach for chemical production scheduling. Comput. Chem. Eng. 2020, 141, 106982. [Google Scholar] [CrossRef]
Bougie, N.; Onishi, T.; Tsuruoka, Y. Local control is all you need: Decentralizing and coordinating reinforcement learning for large-scale process control. In Proceedings of the 61st Annual Conference of the Society of Instrument and Control Engineers (SICE), Kumamoto, Japan, 6–9 September 2022; pp. 468–474. [Google Scholar] [CrossRef]
Bougie, N.; Onishi, T.; Tsuruoka, Y. Data-efficient reinforcement learning from controller guidance with integrated self-supervision for process control. IFAC-PapersOnLine 2022, 55, 863–868. [Google Scholar] [CrossRef]
Kim, S.H.; Lee, K.S. A study on the development of robust fault diagnostic system based on neuro-fuzzy scheme. IFAC Proc. Vol. 1998, 31, 173–178. [Google Scholar] [CrossRef]
Alhazmi, K.; Sarathy, S.M. Nonintrusive parameter adaptation of chemical process models with reinforcement learning. J. Process Control 2023, 123, 87–95. [Google Scholar] [CrossRef]
Alhazmi, K.; Albalawi, F.; Sarathy, S.M. A reinforcement learning-based economic model predictive control framework for autonomous operation of chemical reactors. Chem. Eng. J. 2022, 428, 130993. [Google Scholar] [CrossRef]
Dogru, O.; Wieczorek, N.; Velswamy, K.; Ibrahim, F.; Huang, B. Online reinforcement learning for a continuous space system with experimental validation. J. Process Control 2021, 104, 86–100. [Google Scholar] [CrossRef]
Alhazmi, K.; Sarathy, S.M. Continuous control of complex chemical reaction network with reinforcement learning. In Proceedings of the 2020 European Control Conference (ECC), St. Petersburg, Russia, 12–15 May 2020; pp. 1066–1068. [Google Scholar] [CrossRef]
Zanon, M.; Kungurtsev, V.; Gros, S. Reinforcement learning based on real-time iteration NMPC. IFAC-PapersOnLine 2020, 53, 5213–5218. [Google Scholar] [CrossRef]
Oh, T.H. Quantitative comparison of reinforcement learning and data-driven model predictive control for chemical and biological processes. Comput. Chem. Eng. 2024, 181, 108558. [Google Scholar] [CrossRef]
Conradie, A.v.E.; Aldrich, C. Development of neurocontrollers with evolutionary reinforcement learning. Comput. Chem. Eng. 2005, 30, 1–17. [Google Scholar] [CrossRef]
Conradie, A.E.; Miikkulainen, R.; Aldrich, C. Intelligent process control utilising symbiotic memetic neuro-evolution. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No.02TH8600), Honolulu, HI, USA, 12–17 May 2002; Volume 1, pp. 623–628. [Google Scholar] [CrossRef]
Peng, W.; Lei, J.; Ding, C.; Yue, C.; Ma, G.; Sun, J.; Zhang, D. A novel deep ensemble reinforcement learning based control method for strip flatness in cold rolling steel industry. Eng. Appl. Artif. Intell. 2024, 134, 108695. [Google Scholar] [CrossRef]
Deng, J.; Sierla, S.; Sun, J.; Vyatkin, V. Reinforcement learning for industrial process control: A case study in flatness control in steel industry. Comput. Ind. 2022, 143, 103748. [Google Scholar] [CrossRef]
Deng, J.; Sierla, S.; Sun, J.; Vyatkin, V. Offline reinforcement learning for industrial process control: A case study from steel industry. Inf. Sci. (Ny) 2023, 632, 221–231. [Google Scholar] [CrossRef]
Ferreira Neto, W.A.; Virgínio Cavalcante, C.A.; Do, P. Deep reinforcement learning for maintenance optimization of a scrap-based steel production line. Reliab. Eng. Syst. Saf. 2024, 249, 110199. [Google Scholar] [CrossRef]
Zhang, T.; Zhou, F.; Zhao, J.; Wang, W. Deep reinforcement learning for secondary energy scheduling in steel industry. In Proceedings of the 2020 2nd International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 23–25 October 2020; pp. 1–5. [Google Scholar] [CrossRef]
Jeong, H.Y.; Park, J.; Kim, Y.; Shin, S.Y.; Kim, N. Processing parameters optimization in hot forging of AISI 4340 steel using instability map and reinforcement learning. J. Mater. Res. Technol. 2023, 23, 1995–2009. [Google Scholar] [CrossRef]
Wang, Z.; Wang, L.; Han, Z.; Zhao, J. Multi-index evaluation based reinforcement learning method for cyclic optimization of multiple energy utilization in steel industry. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 5766–5771. [Google Scholar] [CrossRef]
Cho, S.H.; Shin, W.J.; Ahn, J.; Joo, S.; Kim, H.J. Dynamic crane scheduling with reinforcement learning for a steel coil warehouse. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; Volume 33, pp. 16545–16552. [Google Scholar] [CrossRef]
Che, G.; Zhang, Y.; Tang, L.; Zhao, S. A deep reinforcement learning based multi-objective optimization for the scheduling of oxygen production system in integrated iron and steel plants. Appl. Energy 2023, 345, 121332. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, H.; Shen, W.; Peng, G. DuAK: Reinforcement learning-based knowledge graph reasoning for steel surface defect detection. IEEE Trans. Autom. Sci. Eng. 2025, 22, 557–569. [Google Scholar] [CrossRef]
Abdullah, Z.A.K.; Ranjbar, F.; Zare, V.; Homod, R.Z. Unlocking optimal performance and flow level control of three-phase separator based on reinforcement learning: A case study in Basra refinery. Therm. Sci. Eng. Prog. 2024, 55, 102885. [Google Scholar] [CrossRef]
Chen, Y.; Ding, J.; Chen, Q. A reinforcement learning based large-scale refinery production scheduling algorithm. IEEE Trans. Autom. Sci. Eng. 2024, 21, 6041–6055. [Google Scholar] [CrossRef]
Lee, C.Y.; Ho, C.Y.; Hung, Y.H.; Deng, Y.W. Multi-objective genetic algorithm embedded with reinforcement learning for petrochemical melt-flow-index production scheduling. Appl. Soft Comput. 2024, 159, 111630. [Google Scholar] [CrossRef]
Xie, J.; Dogru, O.; Huang, B.; Godwaldt, C.; Willms, B. Reinforcement learning for soft sensor design through autonomous cross-domain data selection. Comput. Chem. Eng. 2023, 173, 108209. [Google Scholar] [CrossRef]
Dogru, O.; Velswamy, K.; Huang, B. Actor–critic reinforcement learning and application in developing computer-vision-based interface tracking. Engineering 2021, 7, 1248–1261. [Google Scholar] [CrossRef]
Ziaei, A.; Kharrati, H.; Rahimi, A. Fault-tolerant control for nonlinear offshore steel jacket platforms based on reinforcement learning. Ocean Eng. 2022, 246, 110247. [Google Scholar] [CrossRef]
Supriya, M.; Srilatha, K.; Smitha, S.P.; Oommen, S.; Hemantha, C.; Sharath, N. Q-learning based reinforcement learning controller for concentration control of food preparation. In Proceedings of the 2023 5th International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 3–5 August 2023; pp. 872–877. [Google Scholar] [CrossRef]
Syafiie, S.; Vilas, C.; Garcia, M.R.; Tadeo, F.; Alonso, A.A.; Martinez, E. Intelligent control based on reinforcement learning for batch thermal sterilization of canned foods. IFAC Proc. Vol. 2008, 41, 3568–3573. [Google Scholar] [CrossRef]
Maheshwari, P.; Kamble, S.; Belhadi, A.; Venkatesh, M.; Abedin, M.Z. Digital twin-driven real-time planning, monitoring, and controlling in food supply chains. Technol. Forecast. Soc. Chang. 2023, 195, 122799. [Google Scholar] [CrossRef]
Barthwal, R.; Kathuria, D.; Joshi, S.; Kaler, R.S.S.; Singh, N. New trends in the development and application of artificial intelligence in food processing. Innov. Food Sci. Emerg. Technol. 2024, 92, 103600. [Google Scholar] [CrossRef]
Sun, B.; le Roux, J.D.; Jämsä-Jounela, S.L.; Craig, I.K. Optimal control of a rougher flotation cell using adaptive dynamic programming. IFAC-PapersOnLine 2018, 51, 24–29. [Google Scholar] [CrossRef]
Zheng, J.; Jia, R.; Liu, S.; He, D.; Li, K.; Wang, F. Sample-efficient reinforcement learning with knowledge-embedded hybrid model for optimal control of mining industry. Expert Syst. Appl. 2024, 254, 124402. [Google Scholar] [CrossRef]
Dai, W.; Li, T.; Zhang, L.; Jia, Y.; Yan, H. Multi-rate layered operational optimal control for large-scale industrial processes. IEEE Trans. Industr. Inform. 2022, 18, 4749–4761. [Google Scholar] [CrossRef]
Fidencio, A.X.; Glasmachers, T.; Naro, D. Application of reinforcement learning to a mining system. In Proceedings of the 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Slovakia, 21–23 January 2021; pp. 000111–000118. [Google Scholar] [CrossRef]
Silva, J.R.; Euzébio, T.A.M.; Braga, M.F. Control of conventional continuous thickeners via proximal policy optimization. Miner. Eng. 2024, 214, 108761. [Google Scholar] [CrossRef]
Khuat, T.T.; Bassett, R.; Otte, E.; Grevis-James, A.; Gabrys, B. Applications of machine learning in antibody discovery, process development, manufacturing and formulation: Current trends, challenges, and opportunities. Comput. Chem. Eng. 2024, 182, 108585. [Google Scholar] [CrossRef]
Lv, Q.; Zhou, F.; Liu, X.; Zhi, L. Artificial intelligence in small molecule drug discovery from 2018 to 2023: Does it really work? Bioorg. Chem. 2023, 141, 106894. [Google Scholar] [CrossRef]
Kim, H.; Choi, H.; Kang, D.; Lee, W.B.; Na, J. Materials discovery with extreme properties via reinforcement learning-guided combinatorial chemistry. Chem. Sci. 2024, 15, 7908–7925. [Google Scholar] [CrossRef]
Ma, Z.; Pan, T.; Tian, J. Deep reinforcement learning optimized double exponentially weighted moving average controller for chemical mechanical polishing processes. Chem. Eng. Res. Des. 2023, 197, 419–433. [Google Scholar] [CrossRef]
Yu, J.; Guo, P. Run-to-run control of chemical mechanical polishing process based on deep reinforcement learning. IEEE Trans. Semicond. Manuf. 2020, 33, 454–465. [Google Scholar] [CrossRef]
Tsen, A.Y.D.; Chen, T.L. Reinforcement learning chemical-mechanical polishing run-to-run controller. In Proceedings of the 2023 IEEE 5th Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 27–29 October 2023; pp. 732–735. [Google Scholar] [CrossRef]
Lin, R.; Luo, Y.; Wu, X.; Chen, J.; Huang, B.; Su, H.; Xie, L. Surrogate empowered Sim2Real transfer of deep reinforcement learning for ORC superheat control. Appl. Energy 2024, 356, 122310. [Google Scholar] [CrossRef]
Kumar P, K.; Detroja, K.P. Gain Scheduled PI controller design using Multi-Objective Reinforcement Learning. IFAC-PapersOnLine 2024, 58, 132–137. [Google Scholar] [CrossRef]
He, Z.; Tran, K.P.; Thomassey, S.; Zeng, X.; Xu, J.; Yi, C. A deep reinforcement learning based multi-criteria decision support system for optimizing textile chemical process. Comput. Ind. 2021, 125, 103373. [Google Scholar] [CrossRef]

Figure 2. Concomitant technologies in the process industry.

Figure 3. Taxonomic placement of Reinforcement Learning within AI.

Figure 4. Search trend for the term “Reinforcement Learning” according to Google Trends.

Figure 5. Number of articles per year in ScienceDirect for the query “Reinforcement Learning” (basic search).

Figure 6. PRISMA flow diagram for this literature review.

Figure 7. Cross-industry convergence model linking process-industry requirements with RL design strategies and deployment trajectories.

Figure 8. Industry type of the processes in the selected studies.

Figure 9. Technological disciplines, through which RL was applied.

Figure 10. Country of institutional affiliation of the first author.

Figure 11. Types of RL algorithms used in the selected reports.

Figure 12. Distribution of RL algorithm categories (value-based, policy-based, hybrid, and cross-cutting) across process-industry sectors.

Table 1. Summary table for the chemical industry *.

Author	Application Area	RL Type	Main Contribution
Dogru, 2021a [66]	Real-time tank control	A3C (Actor-Critic)	Online A2C with multi-trajectory learning, robust real-time performance
Lee, 2024a [58]	Scheduling of chemical reactors	A2C + Robust Optimization	RLeRO framework to avoid local optima
Hubbs, 2020 [60]	Reactor scheduling	A2C	Real - time scheduling via DRL, outperforming MILP
Alhazmi, 2023 [64]	Parameter estimation in chemical processes	DDPG	Nonintrusive offline RL parameter estimator
Alhazmi, 2022 [65]	Online economic control of reactors	DDPG + EMPC	Online model correction using RL-enhanced EMPC
Alhazmi, 2020 [67]	Complex chemical reaction networks	DDPG	Demonstrates DDPG’s suitability for integrated chemical processes
Yifei, 2022 [54]	Multi-loop CSTR control	Multi-Agent TD3	Demonstrates MARL for multiloop process control
Oh, 2024 [69]	CSTR, SMB, bioreactor control	DDPG, TD3, SAC	Comparative study of DRL vs. MPC for process tasks
Wu, 2022 [59]	Production scheduling under uncertainty	PPO, A2C	Enhanced state function for order urgency and stability
Bougie, 2022a [61]	Process control in VAM plant	PPO, SAC, A2C	Multi-agent PPO with message passing and shared actions
Szatmári, 2024 [56]	Thermal runaway prevention in reactors	Deep Q-Learning (DQN)	Explainable RL with decision trees and Shapley values
Rangel-Martinez, 2024 [57]	Batch plant scheduling under uncertainty	Deep Recurrent Q-Learning	DRQN with observation window and sub-reward shaping
Rangel-Martinez, 2024 [57]	Batch plant scheduling under uncertainty	Deep Recurrent Q-Learning	DRQN with observation window and sub-reward shaping
Savage, 2021 [55]	Semi-batch reactor control	Q-Learning + Gaussian Proc.	Low-data policy generation with GP-based Q-value estimation
Zhang, 2023 [53]	Constrained process control (CSTR)	Model-Based Safe RL (RSMB-RL)	Data-driven slack function, safe/stable policy without known dynamics
Zanon, 2020 [68]	Evaporation process optimization	RL + NMPC (Model-Based)	Integrates NMPC with RL using RTI to improve efficiency
Bougie, 2022b [62]	VAM plant process control	Model-Free, Self-Supervised	Controller-guided exploration with self-supervision for sparse reward learning
Kim, 1998 [63]	Fault diagnosis in tank-pipe system	Associative RL	Self-learning fuzzy cognitive maps (FCMs)
Conradie, 2005 [70]	Bioreactor optimization	Neuro - Evolution (SANE)	Combines control and design with evolutionary RL
Conradie, 2002 [71]	Bioreactor control	SMNE (Neuro-Evolution)	Combines evolutionary and swarm learning