1. Introduction
In recent years, the application of neural networks in decision-making has become a common topic in fields such as logistics, power distribution, and robotics. Several training algorithms allow a suitable choice to be determined from a set of input and output data. Other approaches, such as unsupervised methods, train NNs to play games, for example, where input data, such as an image frame, is mapped to discrete actions, such as go up, go down, and so on.
Despite their broad use in decision-making, the potential of neural networks to act as controller schedulers, i.e., deciding which controller to use in a dynamic system, remains largely unexplored. In robotics, tasks such as locomotion or manipulation often involve distinct operational phases or environmental changes. These variations mean that a single controller is usually insufficient, creating a need for a higher-level scheduler to manage multiple specialized control strategies.
This work is motivated by the self-righting problem of a quadrupedal robot. The self-righting enables a robot to recover from a fall. Established strategies often use machine learning frameworks to determine the joint references for a low-level controller [
1] or employ a trained NN to manage the entire task [
2]. In a previous study [
3], a Reference Governor Control implemented using Model Predictive Control (RGC-MPC) was employed for the stand-up phase of the robot. Since MPC strategies rely on a dynamic model, and during the task, this model and its constraints change due to varying contact points, more than one MPC optimization problem is required to complete the task. Therefore, a method is needed to schedule these optimization problems to ensure the robot can recover safely. Thus, this paper investigates the following research question: Can a machine learning algorithm effectively train a neural network to derive a scheduling rule for controller selection based solely on the system’s proprioceptive states?
To address this question, this work develops a methodology that utilizes machine learning algorithms to train a neural network to select a controller for a dynamic system, employing a curriculum approach. To prevent the NN from choosing only one control index, a model preference term is added to the reward/fitness function to penalize or reward the NN’s output. This work also presents a comparative study of machine learning frameworks for training neural networks, with a focus on deep learning and evolutionary strategies. The main goal of this paper is to obtain a simple NN policy, trained efficiently, that is capable of stabilizing an inverted pendulum.
This paper is organized as follows.
Section 2 presents the state-of-the-art of multiple controllers scheduled by neural networks.
Section 3 defines the core problem and the proposed methods for its solution.
Section 4 describes the methodology used to solve the problem of scheduling the controller using machine learning approaches. The results are presented and discussed in
Section 5. Finally,
Section 6 provides the findings of this work.
This paper uses the following notation: bold, capital letters () represent matrices, and bold lowercase letters () represent vectors. Italic letters denote time-variant scalars, while non-italic, roman letters () denote constants. denotes neural network weights or parameters. To remain consistent with the relevant literature, and represent the current and following system states, respectively.
2. State of the Art
Researchers have utilized neural networks for controlling dynamic systems for a long time. In the 1990s, researchers published studies comparing the capabilities of neural networks with the traditional control methods available at the time [
4,
5]. Some key strengths of NNs highlighted in these studies include their ability to approximate nonlinear systems, a high degree of fault tolerance due to their intrinsic parallel distributed processing, their capacity to learn and adapt, and their natural ability to handle Multiple-Input–Multiple-Output (MIMO) systems. These characteristics make NNs suitable for control applications such as generating black-box models of dynamic systems.
Furthermore, in [
4,
5], the use of neural networks for supervised control is highlighted for direct inverse control, internal model control, predictive control, optimal control, adaptive control, gain scheduling, and filtering predictions. However, the authors [
4,
5] also pointed out challenges associated with using NNs in control applications, such as the lack of a rigorous stability theory for neural network-based control systems, difficulties in selecting network architecture, generalization problems, and robustness issues.
A specific and challenging application of NNs in control is their use as a high-level controller scheduler. In this case, the neural network is used to select an operational mode to control a dynamic system. The NN can be interpreted as a hierarchical controller responsible for scheduling specific control methods.
It is essential to distinguish this high-level scheduling approach from other common uses of NNs in control. One popular area is knowledge transfer, where a neural network is trained to mimic the behavior of a complex, computationally expensive controller, such as an MPC [
6]. The goal is to replace the original controller with a fast NN approximation. Another area is adaptive control, where NNs are often used to tune the parameters of an existing controller (like a PID) in real-time [
7,
8] or to learn and compensate for model uncertainties and parametric errors [
9].
The framework proposed in this work is fundamentally different from both of these. It is not knowledge transfer, as the NN is not trained to mimic any single expert. Furthermore, it is not a typical adaptive controller, as it does not modify the gains or parameters of a low-level controller. Instead, the NN functions purely as a high-level scheduler that selects one entire, pre-defined control law from a set of distinct, available options.
Systems that exhibit more complex behavior, with significant nonlinearities, may struggle for a single NN to learn the controlling behavior. In this sense, [
10] employs smaller controllers that are individually pre-trained to perform optimally over different parts of the system’s nonlinear behavior. In the online phase, a moving average estimates the root mean square (RMS) of the reference signal to select the desired NN. Reference [
11] proposes the Multi-Expert Learning Architecture (MELA), a hierarchical framework that learns to generate adaptive skills from a group of pre-trained expert policies. The core of their solution is a Gating Neural Network (GNN) that dynamically synthesizes a new, unified control policy in real-time by calculating a weighted average of the network parameters—the weights and biases—of all available experts. It is worth noting that both [
10,
11] employ neural networks as the underlying expert policies. In contrast, the present work uses a neural network to select from a set of distinct, predefined control strategies.
A hierarchical control framework for a rotary inverted pendulum is presented in [
12]. This architecture utilizes two collaborating neural networks trained by reinforcement learning agents: a low-level controller and a high-level supervisor. The low-level controller is trained by a (DDPG) agent, which employs a 13-layer neural network to manage the continuous torque application. This NN learns to operate in two distinct modes: an energetic swing-up action to raise the pendulum and a fine-tuned stabilization action to maintain its upright position. A Proximal Policy Optimization (PPO) agent trains a high-level supervisor NN, learning the discrete task of selecting when to switch the control objective from swing-up to stabilization, which occurs when the pendulum reaches a near-vertical region. The authors validate this DDPG-PPO architecture against a conventional PID controller. The proposed method exhibits superior performance, stabilizing the pendulum more effectively and reducing the maximum angle deviation compared to the PID controller.
In [
12], the neural network training is closely tied to the system model. Consequently, if the model lacks sufficient fidelity, the sim-to-real transfer of the controller can be problematic. To mitigate this, modeling inaccuracies can be incorporated into the training process to enhance the NN’s robustness; however, this approach typically slows the convergence rate.
Training neural networks for scheduling linear controllers is discussed in [
13,
14]. In their approach, each controller has an associated NN that predicts the future cost of applying its corresponding controller, allowing a function to select the one with the minimum predicted cost. This approach is the one that most closely addresses the central question of this work.
The neural network training occurs in two stages. First, during an offline phase, the system is placed in random initial conditions and distributed over the state space. Then, the system is simulated for each controller, gathering the state evolution into a dataset. Using this dataset, the cost-to-go is estimated as:
where
is a discount factor,
is the final state using the
i-th controller starting the initial state
.
is a cost function in the shape
. Using the dataset, the neural network is trained to minimize the error between the actual value and the one predicted by the NN. A total of 5000 random trajectories were evaluated for this purpose [
13,
14].
In the second stage, with these pre-trained neural networks, training continues online; however, the paper does not clarify whether the switching rule is used in conjunction with or only new trajectories are used to validate the NN predictions.
The main difference between the two works lies in this selection function. In [
13], a min function is used to schedule the controllers, and the solution is explored in a cart inverted pendulum (CIP) on a cart system. In contrast, [
14] employs a more elaborate switching control to avoid the chattering presented in the previous work; in addition to the CIP, a submerged vessel is also used for testing the solution.
The Control Scheduling Neural Network (CSNN) described in [
13,
14] allows model mismatch to be addressed by the low-level controllers at a high frequency, leaving the NN responsible only for selecting the appropriate control strategy. A drawback of this method is the need to evaluate the neural network output at each sampling instant. This process can be computationally expensive, particularly for large network architectures or systems with numerous control strategies. Furthermore, this approach requires using a single performance index,
, for all controllers to ensure a fair comparison in selecting the optimal control action at time step
k. This requirement implies that all candidate controllers must operate within the same state–space neighborhood.
The review of the state-of-the-art reveals two primary approaches for NN scheduling, each with significant drawbacks. RL-based methods [
12] are heavily model-dependent. In contrast, cost-to-go methods [
13] are limited by the need for a single, universal performance index, forcing all candidate controllers to operate within the same state–space neighborhood.
This work presents a novel framework for creating a controller scheduler using machine learning, specifically designed to address these limitations. The primary objective is to avoid using an energy function, as in [
13], thereby removing the constraint that limits controller architectures to operating in the same neighborhood. To achieve this, this study uses a policy to choose the controller instead of having the NN compute the control action directly. This approach can accelerate the training process, as the system dynamics are already known, and well-established controllers for this problem exist in the literature. The switching behavior is achieved through a mode preference term added to the reward/fitness function, where the coefficients can be tuned heuristically based on the designer’s knowledge. The methodology used in this study is presented in
Section 3.
4. Experimental Methodology
A cart-pendulum system is used to test the main hypothesis. However, another one is adopted instead of using the model presented in [
13]. This decision is based on the authors’ mention that the pendulum mass is localized at the rod tip; however, the dynamic equations in their paper indicate a mass distribution along the length. Another aspect is that neither the sample time nor the solver is mentioned. Furthermore, the authors demonstrate that for an initial condition of
rad (with other states at zero), the LQR is unstable, while the SM and VF controllers can stabilize the system. However, the initial tests revealed that none of the controllers could stabilize the system due to the force constraint. Therefore, the present work adopts the dynamics formulation from [
47] as
where
x and
are the cart position and rod angle, respectively.
is the mass of the cart,
is the friction coefficient between the cart and horizontal bar, and
d is the pendulum damping coefficient. The rod has mass
, moment of inertia
J, and the mass is at
l. Lastly,
g is the gravitational acceleration.
Figure 3 shows the system representation. The system constants are shown in
Table A1 in
Appendix A.
Equation (
7) can be arranged in the space–state format, with the following state vector:
therefore, the dynamics are rewritten as
with
The system is subjected to the following constraints:
A Python class was implemented to simulate the system, where Equation (
7) is updated using the Forward Euler integration method, with
ms. Additionally, to address the
x constraint, a soft-wall dynamic was implemented using a spring–damper model. The decision to use a Python implementation is based on the fact that the class can be easily wrapped into the Gymnasium framework [
48], which is known for having numerous testing environments for ML algorithms and the ability to create custom ones. Another aspect is that the Gymnasium framework is compatible with the SB3 framework.
For the control set, the same approach as [
13] is being used, with three controllers: a Linear Quadratic Regulator, a Velocity Feedback (VF) controller, and a Sliding Mode (SM) controller. The first one is responsible for bringing the system to the origin. It was tuned to address situations where the system has small angles and velocities. The VF (also a LQR controller) was designed to bring the system to small velocities in less than one second, without considering the cart position. Lastly, the SM is designed to bring the rod to small angles as quickly as possible. All the controllers are updated at a frequency of 100 Hz, which is ten times slower than the integration time of the simulation.
The switch polices were derived for the control set
using the CSNN, PPO, A2C, DQN, CEM, and CMA-ES frameworks. For the discrete action space, the reward function is written as
where
and
Here,
provides a survival reward,
penalizes high-energy states, and
rewards progress toward lower-energy configurations. The terms
and
encourage central positioning and upright posture, respectively, while
promotes convergence to the origin. The values of
and
are given by the Gaussian shaping function in Equation (
13), where
is a positive constant defined as
, with
being the value of
a such that
. Lastly,
is the preference mode, defined in Equation (
14).
In Equation (
14), the variable
represents the soft probability that the controller of index
i is the most suitable choice at time
t, obtained from a softmax over controller scores.
is a small constant for numerical stability, and
is a shift constant ensuring that the best choice contributes positively. To obtain
, it is first necessary to define the score vector
as
where
and
. The terms
,
and
are derived as follows:
Using the values of
,
, and
, and adjusting the constant values of Equation (
16), it is possible to determine the space–state region, where each controller has more importance.
Figure 4 shows the scores for the LQR, VF, and SM.
From the scores, the softmax probability for each controller is
Therefore,
, and as
i is the current controller, the
value is used in Equation (
19). It is important to highlight that if, at time
t, controller
i does not have the highest score, then the final value of
is negative— that is, it receives a penalty for choosing a non-suitable controller. Conversely, if it is the best choice, the score receives a bonus. Without a preference mode score, the trained policy tends to predict only one controller index. For the system studied, a high tendency to select the VF controller was observed, likely because this controller can at least bring the pendulum rod upwards in most scenarios.
For comparative purposes, continuous action–space polices were derived for the PPO, A2C, and CEM frameworks. In the continuous action–space case, the reward function is provided by
where
corresponds to Equation (
11) without the mode-preference term. The coefficient
penalizes control effort. The weight values of the reward function and the constants of the preference mode function are provided in
Appendix C in
Table A2 and
Table A3, respectively.
For the SB3 and ES frameworks, a function
is used to convert from the system states
to the neural network states
, as shown in
Figure 1. For the cart pendulum, it is defined as
Using the proposed , where , simplifies the NN training once the states are normalized. Finally, , where for a continuous action space or for the discrete action space. In the last one, the control index is selected using a softmax in the last layer. For the CSNN method, to predict the cost of each controller, the NN has all the pendulum states as input and one output, i.e, .
A curriculum-learning framework was developed to train the models. A radius-based method was employed to determine the initial conditions of the system states. The difficulty is minimal at the beginning of training, with initial conditions close to the system’s origin. The reward values increase as the neural networks learn to cope with environmental variations. In evolutionary strategy frameworks, the difficulty rises slowly over the generations.
In the SB3 framework, the training provides learning feedback from which sequences of successful episodes can be identified. The difficulty is also increased when the number of successful environments exceeds a predefined threshold. As the difficulty increases, the initial conditions are placed farther from the center, making the task progressively more challenging. The difficulty factor also scales the reward function weights to normalize the learning signal, ensuring the agent is not penalized for the longer time horizons required by more challenging tasks.
To test the different scheduling algorithms, a common experimental testbed was developed based on the cart-pendulum problem. We first defined a set of three distinct controllers (
) to be scheduled: an LQR, a VF, and an SM controller, each designed for a different operating region. We then designed a unified, state-based reward function (Equation (
11)) that includes the mode preference term (Equation (
14)) to encourage practical switching. Finally, all policies were trained using a curriculum learning approach to progressively increase the task difficulty for both the machine learning algorithms (PPO, A2C, and DQN) and the evolutionary strategies (CEM, CMA-ES).
5. Results and Discussion
Considering the scheduling architecture presented in
Section 3 and the experimental methodology outlined in
Section 4, the NN models were trained to control the inverted pendulum on a cart. All developed scripts are available in the repository in [
49].
This section is organized as follows.
Section 5.1 details the final training parameters and neural network architectures.
Section 5.2 presents the comparative results of all frameworks. Finally,
Section 5.3 provides a discussion and summary of the key findings.
5.1. Training Setup
The neural network structure used for each framework is itemized in
Table 1.
For the PPO, A2C, and DQN algorithms, time steps were used to train the agents in continuous and discrete action spaces. For the CEM and CMA-ES methods, the number of generations was determined empirically as for the discrete case and for the continuous case. These values were selected by observing the learning metrics until the reward stabilized and the standard deviation of the population’s weights converged.
Regarding the CSNN, the cost-to-go in Equation (
1), was evaluated using
,
. The system states were initialized randomly and simulated for 2.0 s.
5.2. Training and Performance Analysis
The learning metrics for the DQN and CEM are shown in
Figure 5, while the metrics for the other methods are presented in
Appendix D.
For the DQN, as shown in
Figure 5a, the mean reward remains constant throughout the learning process, a result of the curriculum learning strategy. At the beginning of the training, the task’s difficulty is low. Consequently, the pole generally remains upward even when an action is chosen randomly from any of the controllers (due to a high exploration rate). Although the pole angle,
, tends toward zero, the high variance in the reward (shaded gray area) occurs because the chosen controller may not always guide the cart to the center. As training progresses, the task difficulty increases, causing the loss to rise until it stabilizes around the midpoint of the process (
Figure 5c). Nevertheless, despite the increasing loss, the reward remains constant, indicating that the neural network learns to handle new situations not encountered during the initial stages of training.
Regarding the CEM, the results in
Figure 5b illustrate that, despite using the curriculum approach, the fitness value in the first generation is close to zero and can even be negative, mainly because the neural network weights are initialized randomly, causing the policy to change controllers frequently and leading the system to instability. As the generations progress, the policies become more deterministic. This convergence reduces the variance among the NN weights (
Figure 5d), resulting in better selections of the controller index for each scenario.
To compare the performance of all the methods analyzed, a set of 350 random initial conditions (ICs) was created and used to test each framework. For a better comparison across the different action spaces, a cost function was defined as
where
and
are weight matrices for the states and control effort. In this analysis, the lower the cost, the better the system behavior.
Table 2 summarizes the results.
Based on the mean cost data analysis in
Table 2, the discrete action frameworks exhibit superior performance compared to their continuous counterparts. Despite using curriculum learning, the results show that within the trained interval, the PPO, A2C, and CEMs were unable to learn the task. The primary hypothesis for this observation is that the neural networks are trapped in a local minimum and cannot escape from it. Exploring the hyperparameters further for A2C and PPO could aid this learning process; however, it is not a trivial task, as it is time-consuming and requires careful consideration. For the CEM, an alternative approach is to increase the number of generations, allowing the best individuals to dominate the population.
The DQN, CEM, PPO, and CMA-ES frameworks show similar results within the discrete group. Examining the cost through their respective confidence intervals (DQN: [115.52, 193.92]; CEM: [129.05, 210.49]; PPO: [131.75, 213.72]; CMA-ES: [156.30, 213.72]) reveals a considerable overlap, indicating that there is no statistical evidence to claim that one method is more cost-efficient than another. The key distinction between these methods lies in their training efficiency: the CEM framework completed training approximately 6.5 times faster than the DQN approach and 9.90 times faster than PPO. This result highlights CEM as a compelling alternative to more conventional machine learning-based frameworks, as it combines comparable control performance with significantly reduced computational demands. Compared to other methods from the SB3 framework, A2C presents the lowest results but still achieves a considerably high success rate. Further fine-tuning of the hyperparameters for A2C and the other SB3 frameworks could improve these results.
The training results for the CSNN method are presented in
Figure 6.
Figure 6b (LQR),
Figure 6d (VF), and
Figure 6f (SM) show the root mean square error (RMSE) between the actual and predicted values, as well as the training loss. In all cases, the RMSE is less than 0.1 for both training and validation data (30.0% of the initial dataset).
Figure 6 on the right demonstrates the NNs’ predictive accuracy on a random initial trajectory. The networks’ predictions closely match the true cost-to-go, achieving low Mean Absolute Error (MAE) and high correlation for all three controllers: LQR (
Figure 6a; MAE: 0.14, Corr: 0.99), VF (
Figure 6b; MAE: 0.0055, Corr: 0.98), and SM (
Figure 6c; MAE: 0.059, Corr: 0.99).
Although the NNs effectively predict the controller cost, the overall results show poor performance for scheduling the controllers. The main hypothesis for this discrepancy in the results from [
13] is the lack of information about the sample time. To evaluate the cost-to-go, it is necessary to multiply by the factor
. If a larger sample time, T, is used, some information is lost; conversely, if a small one is used, the cost tends to zero too rapidly. Therefore, the sample time is crucial information for implementing the CSNN strategy correctly.
Regarding the failure cases, 46 initial conditions were identified in which none of the frameworks could control the system. These conditions can be grouped into four categories: (i) contradictory dynamics, where the cart’s velocity direction opposes the pole’s lean; (ii) high difficult starting conditions, characterized by large angular deviations (
) combined with non-negligible initial velocities (
or
); (iii) extreme initial velocities, defined as
; and (iv) high-energy states, where the total energy exceeds
. It is essential to mention that the failures are related to the force constraint: if there are no constraints, all controllers can bring the system to the origin.
Figure 7a illustrates the distribution of the complete set of initial conditions across the defined categories.
Figure 7b then isolates the subset of ICs for which all frameworks failed, showing the classification for only these unsuccessful cases.
Figure 8 shows the cart-pendulum state behavior using the DQN, CEM, and CSNN approaches to select the controller. The initial state is
, and all other states are zero, the same initial condition (IC) used in [
14]. Based on a heuristic analysis, the policy is expected to first choose the SM controller due to the large tilt angle, then select the VF controllers to reduce the cart velocity, and lastly, select the LQR to bring the pendulum to the system’s origin. It is possible to observe that both the DQN and CEM were able to stabilize the system at the origin (
Figure 8a,b). The main difference lies in the number of control switches: while the NN evolved by the CEM made only three changes (SM, SM to VF, and VF to LQR), the neural network trained by the DQN algorithm made 54 changes to the control index, a behavior that is not desired, as it can lead the system to an unstable situation. For the CSNN, the control architecture made 11 changes but was unable to control the system, as shown in
Figure 8c. As mentioned earlier, despite accurately predicting the cost-to-go, the switch rule was unable to select the correct controller.
Another aspect studied in this work is the system’s behavior when noise is added. In this case, another batch of random initial conditions was generated near the system’s origin to ensure that all controllers could stabilize the cart-pendulum system. With this configuration, the system’s mean cost was measured (
) for each case studied. Then, white noise was added to the simulation using the same seed under the same conditions. Again, the mean cost was measured (
). The cost variation percentage
was evaluated using these quantities.
Table 3 summarizes the results.
As expected, the cost grows in all situations when noise is added to the system. The discrete frameworks continue to have the lower cost; however, the cost variation is of the same order for all methods, except for the CSNN and the continuous action space PPO. The CEM was the framework whose performance deteriorated the most. Despite these cost increases, no failures were observed due to the addition of noise.
Another metric studied was the average time required for the policy to evaluate the output, also presented in
Table 3, as the column
and its confidence interval
. Considering that each policy update must occur within the controller’s 10 ms sample period, all frameworks can be utilized in a real-world application. The fastest update was observed using the CEM framework, where the update took less than 1.00% of the sample time.
As expected, due to its complexity and the number of NNs to be updated, the CSNN takes longer than the other frameworks. In this case, the update time consumes almost 10.00% of the sample period. Therefore, this is not a trivial amount of time and must be taken into consideration in the controller’s design.
5.3. Discussion Summary
The analysis confirms that both reinforcement learning and evolutionary strategies can successfully train a neural network policy for controller scheduling. The results yield three key findings. First, the discrete action–space frameworks (DQN, CEM, PPO, CMA-ES) significantly outperformed their continuous–space counterparts, which largely failed to converge. Second, while the top discrete methods showed statistically similar cost performance, the evolutionary-based CEM framework was approximately an order of magnitude faster to train than RL-based methods, such as PPO and DQN. Third, the quality of the resulting policy varied significantly: the CEM-trained policy was stable and logical.
In contrast, the DQN policy exhibited high-frequency “chattering,” making it less practical for a physical system. Finally, the baseline CSNN method failed to perform, highlighting a critical dependency on implementation details not provided in the original literature. These findings collectively suggest that for this scheduling problem, an evolutionary approach, such as CEM, offers the best trade-off between final performance, training efficiency, and policy stability.
6. Conclusions
This work investigated the possibility of using neural networks as control schedulers. The investigation focused on analyzing existing machine learning algorithms capable of handling systems with continuous action spaces. To do so, the Proximal Policy Optimization (PPO), Advantage Actor–Critic (A2C), and Deep Q-Network (DQN) algorithms from the Stable-Baselines3 library were investigated. Additionally, two evolutionary strategies were tested: the Cross-Entropy Method (CEM) and the Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Lastly, a Control Scheduling Neural Network (CSNN) was implemented and tested. All training algorithms were evaluated on a cart-inverted pendulum system.
The results show that policies trained by machine learning algorithms can choose the correct controller based solely on the system’s states. The PPO, DQN, CEM, and CMA-ES frameworks achieved comparable performance in a pure cost analysis. The main difference was the computational time spent on training: the CEM was the fastest algorithm, yielding results in under two hours—a crucial metric, as it enables rapid testing compared to the others. The CSNN showed poor results, mainly due to the lack of information from the [
13]. A more thorough investigation could yield better results.
Another aspect studied in this paper is the comparison with purely neural network control schemes, where the NN generates the control force for the system. A cost comparison revealed that the discrete action space approach outperformed the continuous action space approach.
For future work, it is essential to investigate how the number of states and controllers affects the training process. Another aspect to be explored is the use of different control algorithms, such as a swing-up controller, to handle situations where a lack of control action leads to catastrophic scenarios. In this case, the new controller could bring the pendulum upward, and the other controllers could stabilize the system. Furthermore, it is essential to investigate the possibility of hybrid methods, such as using CEM for initial training and then adding the resulting NN to PPO or DQN algorithms for fine-tuning.