Chaotic van der Pol Oscillator Control Algorithm Comparison

: The damped van der Pol oscillator is a chaotic non-linear system. Small perturbations in initial conditions may result in wildly different trajectories. Controlling, or forcing, the behavior of a van der Pol oscillator is difﬁcult to achieve through traditional adaptive control methods. Connecting two van der Pol oscillators together where the output of one oscillator, the driver, drives the behavior of its partner, the responder, is a proven technique for controlling the van der Pol oscillator. Deterministic artiﬁcial intelligence is a feedforward and feedback control method that leverages the known physics of the van der Pol system to learn optimal system parameters for the forcing function. We assessed the performance of deterministic artiﬁcial intelligence employing three different online parameter estimation algorithms. Our evaluation criteria include mean absolute error between the target trajectory and the response oscillator trajectory over time. Two algorithms performed better than the benchmark with necessary discussion of the conditions under which they perform best. Recursive least squares with exponential forgetting had the lowest mean absolute error overall, with a 2.46% reduction in error compared to the baseline, feedforward without deterministic artiﬁcial intelligence. While least mean squares with normalized gradient adaptation had worse initial error in the ﬁrst 10% of the simulation, after that point it exhibited consistently lower error. Over the last 90% of the simulation, deterministic artiﬁcial intelligence with least mean squares with normalized gradient adaptation achieved a 48.7% reduction in mean absolute error compared to baseline.


Introduction
The concept of chaos theory can be illustrated through the butterfly effect: the flapping of a butterfly's wings can create a small gust of air that ultimately leads to a storm on the other side of the world. Chaos theory refers to systems that are bounded, recurrent, and highly sensitive to initial conditions. See Figure 1 for an example chaotic system. For example, weather patterns are chaotic systems that can be influenced by small changes in initial conditions such as temperature, pressure, and wind patterns. This sensitivity makes predicting the weather challenging, even with the help of advanced computers.
In 2000, Boccalettia reviewed the major ideas involved in the control of chaos and proposed two methods: the Ott-Grebogi-Yorke (OGY) method and the adaptive method, both seeking to bring a trajectory to a small neighborhood of a desired location, seeking stabilized desired chaotic orbits embedded in a chaotic attractor including a review of relevant experimental applications of the techniques [3]. Ott, C. Grebogi and J. A. Yorke observed that the infinite number of unstable periodic orbits typically embedded in a chaotic attractor could be taken advantage of for the purpose of achieving control by means of applying only very small perturbations [4]. Song, et. al, proposed combining feedback control with the OGY method [5], and this trend of utilizing feedforward followed by Figure 1. The strange attractor of van der Pol and Duffing mixed type equation [1]. Image credit Yapparina, CC0, via Wiki-media Commons used in accordance with image use policy [2].
In 2000, Boccalettia reviewed the major ideas involved in the control of chaos and proposed two methods: the Ott-Grebogi-Yorke (OGY) method and the adaptive method, both seeking to bring a trajectory to a small neighborhood of a desired location, seeking stabilized desired chaotic orbits embedded in a chaotic attractor including a review of relevant experimental applications of the techniques [3]. Ott, C. Grebogi and J. A. Yorke observed that the infinite number of unstable periodic orbits typically embedded in a chaotic attractor could be taken advantage of for the purpose of achieving control by means of applying only very small perturbations [4]. Song, et. al, proposed combining feedback control with the OGY method [5], and this trend of utilizing feedforward followed by application of feedback might be considered canonical. Back in 1992, Pyragas had already offered continuous control of chaos by self-controlling feedback [5].
Adaptive methods as offered by Slotine and Li [6] strictly rely on a feedforward implementation augmented by feedback adaption of the feedforward dynamics, but also utilize elements of classical feedback adaption (e.g., the M.I.T. rule [7]) to adapt classical feedback control gains and adapt the desired trajectory to eliminate tracking errors. Following kinetic improvements to Slotine's approach by Fossen [8,9], in 2017, Cooper and Heidlauf [10] proposed extracting the feedforward elements of Slotine's approach for controlling chaos of the famous van der Pol oscillator [11], which has grown to become a benchmark system for representing chaotic systems such as relaxation oscillators [12], frequency de-multiplication [13], heartbeats [14], and non-linear electric oscillators in general [15].
Cooper and Heidlauf's techniques were initially paired with linear feedback control (linear quadratic optimal control), but the combination of feedback and feedforward proved ineffective. Smeresky and Rizzo offered 2-norm optimal feedback using pseudoinverse in 2020 [16], and the combination proved effective for controlling highly non-linear, coupled (non-chaotic) Euler's moment equations. The combination of feedforward and optimal feedback learning is labeled deterministic artificial intelligence. Zhai directly compared stochastic artificial intelligence methods: neural networks and physicsinformed deep learning in [17] and [18].
Originally discovered by Gauss in the 1800s, the work lay unused until Plackett rediscovered his work and applied it to signal processing [19]. Several methods have emerged since, extending and resolving weakness of the recursive least squares method. Several of these, such as the addition of exponential forgetting and posterior residuals and their follow-on techniques, are discussed by Å ström and Wittenmark in their textbook on adaptive control [20]. However, these works often fail to consider the cost required to determine the control input dependence of the system. In order to excite the system so that the response can be observed, and a model fit to it, resources such as fuel or electric power must be expended. In addition, the convergence rate per unit cost may be a Adaptive methods as offered by Slotine and Li [6] strictly rely on a feedforward implementation augmented by feedback adaption of the feedforward dynamics, but also utilize elements of classical feedback adaption (e.g., the M.I.T. rule [7]) to adapt classical feedback control gains and adapt the desired trajectory to eliminate tracking errors. Following kinetic improvements to Slotine's approach by Fossen [8,9], in 2017, Cooper and Heidlauf [10] proposed extracting the feedforward elements of Slotine's approach for controlling chaos of the famous van der Pol oscillator [11], which has grown to become a benchmark system for representing chaotic systems such as relaxation oscillators [12], frequency de-multiplication [13], heartbeats [14], and non-linear electric oscillators in general [15].
Cooper and Heidlauf's techniques were initially paired with linear feedback control (linear quadratic optimal control), but the combination of feedback and feedforward proved ineffective. Smeresky and Rizzo offered 2-norm optimal feedback using pseudoinverse in 2020 [16], and the combination proved effective for controlling highly non-linear, coupled (non-chaotic) Euler's moment equations. The combination of feedforward and optimal feedback learning is labeled deterministic artificial intelligence. Zhai directly compared stochastic artificial intelligence methods: neural networks and physics-informed deep learning in [17,18].
Originally discovered by Gauss in the 1800s, the work lay unused until Plackett rediscovered his work and applied it to signal processing [19]. Several methods have emerged since, extending and resolving weakness of the recursive least squares method. Several of these, such as the addition of exponential forgetting and posterior residuals and their follow-on techniques, are discussed by Åström and Wittenmark in their textbook on adaptive control [20]. However, these works often fail to consider the cost required to determine the control input dependence of the system. In order to excite the system so that the response can be observed, and a model fit to it, resources such as fuel or electric power must be expended. In addition, the convergence rate per unit cost may be a deciding factor when choosing an approach. A controller must understand the system it is controlling before it is able to drive it to a desired state.
A secondary challenge is the inherent non-linearity of many systems present in realworld applications. Such systems cannot be written using simple linear models with respect to the system states and require different analysis techniques. An overview of existing analysis methods can be found in [21]. However, as mentioned before, these techniques do not attempt to minimize the cost of system identification. It is demonstrated in this work that an excitation controller based on optimal learning techniques and self-awareness statements provides a significantly higher accuracy per unit cost than other excitation signals. Optimal learning and self-awareness are components of deterministic artificial intelligence, which seeks to create intelligent systems that derive results through conditions restricted to obey physical models rather than purely stochastically through bulk data.
In this manuscript, we focus on the non-linear damped van der Pol oscillator. See Figure 2 for an example of damped van der Pol oscillator chaotic behavior. The van der Pol oscillator is named for Balthazar van der Pol, who conducted research on oscillatory systems for applications in vacuum tubes [12]. Scientists have used van der Pol oscillators for multiple other applications, such as modeling action potentials and countering electromagnetic pulse attacks. Adaptive control theory offers a means of stabilizing the van der Pol system through synchronization of a drive and response system. world applications. Such systems cannot be written using simple linear models with re-spect to the system states and require different analysis techniques. An overview of existing analysis methods can be found in [21]. However, as mentioned before, these techniques do not attempt to minimize the cost of system identification. It is demonstrated in this work that an excitation controller based on optimal learning techniques and selfawareness statements provides a significantly higher accuracy per unit cost than other excitation signals. Optimal learning and self-awareness are components of deterministic artificial intelligence, which seeks to create intelligent systems that derive results through conditions restricted to obey physical models rather than purely stochastically through bulk data.
In this manuscript, we focus on the non-linear damped van der Pol oscillator. See Figure 2 for an example of damped van der Pol oscillator chaotic behavior. The van der Pol oscillator is named for Balthazar van der Pol, who conducted research on oscillatory systems for applications in vacuum tubes [12]. Scientists have used van der Pol oscillators for multiple other applications, such as modeling action potentials and countering electromagnetic pulse attacks. Adaptive control theory offers a means of stabilizing the van der Pol system through synchronization of a drive and response system. Cooper and Heidlauf showed that the van der Pol system can be forced to asymptotically follow a desired trajectory in [10], as shown in Figure 3. They accomplished this by using a van der Pol oscillator, the driver, to force another van der Pol oscillator, the responder. We refer to this technique as feedforward van der Pol. Cooper and Heidlauf showed that the van der Pol system can be forced to asymptotically follow a desired trajectory in [10], as shown in Figure 3. They accomplished this by using a van der Pol oscillator, the driver, to force another van der Pol oscillator, the responder. We refer to this technique as feedforward van der Pol. excitation signals. Optimal learning and self-awareness are components of dete artificial intelligence, which seeks to create intelligent systems that derive results conditions restricted to obey physical models rather than purely stochastically bulk data.
In this manuscript, we focus on the non-linear damped van der Pol oscill Figure 2 for an example of damped van der Pol oscillator chaotic behavior. The Pol oscillator is named for Balthazar van der Pol, who conducted research on os systems for applications in vacuum tubes [12]. Scientists have used van der Pol o for multiple other applications, such as modeling action potentials and counter tromagnetic pulse attacks. Adaptive control theory offers a means of stabilizing der Pol system through synchronization of a drive and response system. Cooper and Heidlauf showed that the van der Pol system can be forced to as ically follow a desired trajectory in [10], as shown in Figure 3. They accomplishe using a van der Pol oscillator, the driver, to force another van der Pol oscillato sponder. We refer to this technique as feedforward van der Pol. Smeresky and Rizzo proposed an improvement on feedforward van der Pol in [16] using deterministic artificial intelligence. The deterministic artificial intelligence method combines non-linear adaptive control and physics-based controls methods. It is based on self-awareness statements enhanced by optimal feedback, which is reparametrized into a standard regression form. Online estimation methods such as recursive least squares estimate the system parameters for this optimal feedback. Adaptive algorithms for online estimation have one goal: obtain the best estimation of coefficients such that the output signal and input signal converge. The error between the output/input is minimized using stochastic and deterministic methods. Three adaptive control methods of calculatingθ were compared: a Kalman filter, recursive least squares with exponential forgetting (RLS-EF), and least mean squares with a normalized gradient (NLMS) [22]. Kalman filters and normalized gradients are both stochastic methods [22,23], while recursive least squares with exponential forgetting is a deterministic method [24].
Kalman filters offer an efficient computational solution to least squares that can estimate the state of the system in the past, present, and future even when the parameters of the system are unknown. The filter has two main steps, prediction and correction, that occur iteratively in a cycle. In the prediction or time update step, it projects the error covariance estimates and current state forward in time to obtain the next time step's a priori estimates. In the correction or measurement update step, the algorithm incorporates new observations into the a priori estimate to improve the a posteriori estimate. In other words, it uses the actual measurements to adjust the projected estimate [25].
Recursive least squares is a method for online estimation of linear models that recursively updates estimates of the model parameters using new data points. Recursive least squares with exponential forgetting is a type of recursive least squares algorithm that uses an exponential forgetting factor to forget older data points and give more weight to newer data points. This can be used to adapt the model to changing data more quickly.
Least mean squares is a search algorithm based on gradient search that is known for its computational simplicity and stable behavior in discrete models with finite precision [26]. The normalized gradient least mean squares method differs from least mean squares in that it has a time-varying step size to update the adaptive filter coefficients [24].
This manuscript compares the performance of deterministic artificial intelligence for non-linear adaptive control of a van der Pol oscillator when using three different online estimation methods: a Kalman filter, recursive least squares with exponential forgetting (RLS-EF), and normalized gradient least mean squares (NLMS).

Novelties Presented
Previous scholarship [16] has evaluated the performance of deterministic artificial intelligence algorithms on systems such as the van der Pol oscillator amongst others. Comparative methods that have been analyzed include deep learning with artificial neural networks (ANN) and physics informed neural networks (PINN) [17,18].
Building upon [16][17][18], this manuscript evaluates the performance of three online estimation algorithms for use in adaptive control with deterministic artificial intelligence: a Kalman filter, recursive least squares with exponential forgetting, and normalized gradient least mean squares. This paper only evaluated the algorithms on one system, the van der Pol oscillator. These specific algorithms have never before been used for deterministic artificial intelligence. Moreover, the algorithms are already implemented in SIMULINK ® , facilitating adoption and future research.

1.
Using a Kalman filter for online estimation in a deterministic AI architecture exhibited worse performance than the Cooper-Heidlauf baseline [10]. Both recursive least squares with exponential forgetting and least mean squares performed better than baseline. Figures 4-6 show the implementation of the forced van der Pol oscillator simulation in SIMULINK ® with modular forcing function components. These figures in conjunction with MATLAB ® code in the Appendix A facilitate extension of results.

2.
An optimal approach to online estimation for deterministic artificial intelligence may involve using more than one algorithm. The authors found that one algorithm may exhibit superior performance when the error between predicted chaotic trajectories and observed trajectories is greatest, while another may have lower error when the 1. Using a Kalman filter for online estimation in a deterministic AI architecture exhibited worse performance than the Cooper-Heidlauf baseline [10]. Both recursive least squares with exponential forgetting and least mean squares performed better than baseline. Figures 4-6 show the implementation of the forced van der Pol oscillator simulation in SIMULINK ® with modular forcing function components. These figures in conjunction with MATLAB ® code in the appendix facilitate extension of results.    [10,16]. In the depicted simulation, the Estimator block utilizes one of three methods to calculate ̂: Kalman filter, recursive least squares with exponential forgetting, and normalized gradient least mean squares.
2. An optimal approach to online estimation for deterministic artificial intelligence may involve using more than one algorithm. The authors found that one algorithm may exhibit superior performance when the error between predicted chaotic trajectories and observed trajectories is greatest, while another may have lower error when the system approaches the limit cycle, or steady state. The authors propose this as a direction for future research. 1. Using a Kalman filter for online estimation in a deterministic AI architecture exhibited worse performance than the Cooper-Heidlauf baseline [10]. Both recursive least squares with exponential forgetting and least mean squares performed better than baseline. Figures 4-6 show the implementation of the forced van der Pol oscillator simulation in SIMULINK ® with modular forcing function components. These figures in conjunction with MATLAB ® code in the appendix facilitate extension of results.    [10,16]. In the depicted simulation, the Estimator block utilizes one of three methods to calculate ̂: Kalman filter, recursive least squares with exponential forgetting, and normalized gradient least mean squares.
2. An optimal approach to online estimation for deterministic artificial intelligence may involve using more than one algorithm. The authors found that one algorithm may exhibit superior performance when the error between predicted chaotic trajectories and observed trajectories is greatest, while another may have lower error when the system approaches the limit cycle, or steady state. The authors propose this as a direction for future research. 1. Using a Kalman filter for online estimation in a deterministic AI architecture exhibited worse performance than the Cooper-Heidlauf baseline [10]. Both recursive least squares with exponential forgetting and least mean squares performed better than baseline. Figures 4-6 show the implementation of the forced van der Pol oscillator simulation in SIMULINK ® with modular forcing function components. These figures in conjunction with MATLAB ® code in the appendix facilitate extension of results.    [10,16]. In the depicted simulation, the Estimator block utilizes one of three methods to calculate ̂: Kalman filter, recursive least squares with exponential forgetting, and normalized gradient least mean squares.
2. An optimal approach to online estimation for deterministic artificial intelligence may involve using more than one algorithm. The authors found that one algorithm may exhibit superior performance when the error between predicted chaotic trajectories and observed trajectories is greatest, while another may have lower error when the system approaches the limit cycle, or steady state. The authors propose this as a direction for future research. Figure 6. The forcing function block, F(t), adding DAI feedback as derived in [10,16]. In the depicted simulation, the Estimator block utilizes one of three methods to calculateθ: Kalman filter, recursive least squares with exponential forgetting, and normalized gradient least mean squares.

Materials and Methods
The van der Pol chaotic oscillator is defined by Equation (1): where F(t) is the forcing function and α, µ, and β define the system parameters, collectively referred to as θ. With deterministic artificial intelligence, the system parameters are estimated at each time step. Figure 4 is a SIMULINK ® model that represents the system. Online estimation methods analyzed in this paper require two inputs: the regressors Φ true observed from the system and F(t) from the previous time step, F(t − 1). We define the regressors as: Given system inputs Φ true and F(t − 1), the estimator performs regression to calculate the estimated system parameters,θ. These are used in conjunction with self-awareness statements to calculate F(t).
Self-awareness statements are formulated using desired trajectory x d .
x d : and represent the known physics of the van der Pol system. All together, we have Equation (4): See Figures 5 and 6 for the SIMULINK ® architectures of the forcing function with feedforward only and feedforward with deterministic artificial intelligence.

Results
We used the Runge-Kutta MATLAB ® 2022b solver in SIMULINK ® with a step size of 0.01 and a simulation time of 100 s to run the model in Figure 4 given different forcing functions: Uniform noise 3.
Deterministic artificial intelligence (DAI)-normalized gradient least mean squares estimator For the deterministic artificial intelligence methods, we initializedθ (α, µ, and β) as 0.1; the ground truth value was θ = 5 1 1 . Despite the inaccuracy of the initial estimates, all deterministic artificial intelligence methods succeeded in iteratively estimating the truth values, withθ converging to θ.
We analyzed the error between the trajectory of the responder van der Pol oscillator and the desired trajectory, a circle of radius 5. Code used to generate results and figures is included in Appendix A. The first three methods did not employ feedforward or deterministic artificial intelligence feedback. Both the unforced oscillator and the uniform noise forced oscillator exhibited the chaotic trajectory of characteristic of damped van der Pol oscillators, visualized at the center of Figure 7. The sine wave forced the responder oscillator to deviate from the typical van der Pol trajectory, but the resulting trajectory was chaotic. These results are shown in Tables 1 and 2 and plotted in Figure 8.   Method 4, feedforward using the drive-response methodology in [10], successfully forced the responder oscillator to follow the target trajectory. Methods 5-7 built upon method 4, adding deterministic artificial intelligence feedback as described in [16] to estimate and learn the system parameters.

Discussion
time steps Figure 8. Error of x calculated over time for methods 4-7, in order. Note that the error curve of method 7, normalized gradient least mean squares, is distinct from the others. MAE is used in assessing model performance to handle both positive and negative error swings. Method 4, feedforward using the drive-response methodology in [10], successfully forced the responder oscillator to follow the target trajectory. Methods 5-7 built upon method 4, adding deterministic artificial intelligence feedback as described in [16] to estimate and learn the system parameters.

Discussion
A van der Pol oscillator subtends toward a limit cycle, an invariant set determined by the initial conditions. This occurs when both sides of Equation (1) are equal, or: In other words, the oscillator will, after some time, converge to a fixed trajectory. The goal of [10] was to command a van der Pol oscillator such that, no matter the initial conditions, it would asymptotically approach the same limit cycle. Smeresky and Rizzo went a step further in [16] by showing that deterministic artificial intelligence can improve the ability of the forcing function, F(t), to force the responder oscillator to follow a desired trajectory.
We build upon the results of [16] by comparing three different deterministic artificial intelligence system parameter estimators: a Kalman filter, recursive least squares with exponential forgetting (RLS-EF), and normalized gradient least mean squares (NLMS). We additionally compared these methods to forcing functions that do not use deterministic artificial intelligence, such a uniform noise, a sine wave, and the feedforward function in [10].
For all methods with feedforward (methods 4-7), the responder van der Pol oscillator had significant error in the beginning that decreased as the oscillator stabilized to the desired trajectory. This error approached an asymptote around zero.
Over the length of the entire simulation, 10,000 time-steps for 100 s at a sample time of 0.01, the different deterministic artificial intelligence estimator that had the lowest mean absolute error (MAE) was recursive least squares with exponential forgetting, with a 2.41% and 2.01% reduction in mean absolute error for x and .
x, respectively, compared to feedforward alone. See the first two results columns of Table 3 for percent improvement. We found that different deterministic artificial intelligence with the Kalman filter and normalized gradient least mean squares had worse performance than feedforward alone. These results suggested that the Kalman filter and normalized gradient least mean squares estimators should be avoided in favor of recursive least squares with exponential forgetting. However, we noticed that deterministic artificial intelligence using the normalized gradient least mean squares estimator exhibited decreased mean absolute error as the system approached the limit cycle. See the last two results columns of Table 3 for percent improvement after t = 1000. Different deterministic artificial intelligence with normalized gradient least mean squares had 48.7% and 19.3% less mean absolute error for x and .
x, respectively, compared to feedforward alone, and exhibited a 47.7% and 17.7% reduction in mean absolute error for x and .
x compared to deterministic artificial intelligence with recursive least squares with exponential forgetting.

Conclusions
Simulation experiments indicate the recursive least squares with exponential forgetting algorithm for parameter estimation is relatively superior for the first 1000 time-steps of the simulation, after which normalized gradient least mean squares exhibits superior performance. One algorithm may be preferable over the other depending on the intended application. For instance, recursive least squares with exponential forgetting having relatively minimal error when the system is far from the limit cycle could indicate that the algorithm is better suited for systems where frequent external disturbances are expected.

Future Research
As a next step for future research, both recursive least squares with exponential forgetting and normalized least mean squares techniques could be used together, with the algorithm using recursive least squares with exponential forgetting for the first t time steps to estimate system parameters θ and then switching over to normalized gradient least mean squares once the system has found stability. Another research opportunity would be adding logic for the model to switch between different estimation methods when an external disturbance to the system is introduced.
More broadly, the use of hierarchical reinforcement learning for switching between estimation algorithms could be explored. Reinforcement learning involves training an agent to learn a policy in an environment, most often a simulation. The agent takes actions in the environment and receives rewards (or punishments) based on the outcome. The agent's policy is optimized to maximize cumulative rewards [17,18].
Hierarchical reinforcement learning involves decomposing a long-horizon task into subtasks. The higher-level policy will learn to perform a task by orchestrating lower-level policies to execute the subtasks. For instance, we can consider the task of driving an autonomous car to a store. The task can be broken into different subtasks, such as drive in the lines, signal with the blinkers, and pull into the parking lot. The high-level policy would execute the principal task by breaking it down into smaller tasks and orchestrating lower-level policies to complete them [17,18].
An interesting area of future research could leverage the overall concept of hierarchical reinforcement learning for online estimation to improve deterministic artificial intelligence performance. An agent would train a high-level policy to drive the system. Its goal would be minimizing error between the predicted and actual van der Pol trajectory in a simulation. The lower-level policies of the agent would, in this case, be the different online estimation methods, such as recursive least squares with exponential forgetting and normalized least mean squares. These methods would comprise a policy bank, or zoo, that the high-level policy could call upon to perform online estimation. The authors hypothesize that the high-level policy would learn a novel algorithm for optimal switching between online estimation methods. In this future research, additional online estimation methods could be explored and added to the lower-level policy bank.
Additionally, instead of using online estimation algorithms, such as recursive least squares with exponential forgetting, as low-level policies for the hierarchical reinforcement learning algorithm, low-level policies could instead be learned through reinforcement learning. The authors would find the behavior of such trained policies of great interest. Fawzi et al. show in [26] how reinforcement learning can independently discover optimal algorithms for tasks such as matrix multiplication. The reinforcement learning algorithms in this case discovered improvements on state-of-the-art algorithms such as Strassen's two-level algorithm, which has been recognized as the optimal algorithm for multiplication of 4 × 4 matrices for the past 50 years [26]. Future research into this domain may discover a new state-of-the-art algorithm for adaptive control, which may produce even better results within the deterministic AI framework outlined in the body of this work. Such an algorithm may have more general applicability to adaptive control theory.
A final future direction may be meta-learning, where a controller studies a related set of problems and generalizes to efficiently solve novel, similar problems. Systems that have well-understood dynamics, such as the van der Pol oscillator (see Equation (3)) are good targets for meta-learning. Ref. [27] outlines a methodology for meta-learning through reinforcement learning. Model-based information, such as self-awareness statements, are used during training, but are unavailable during execution. Consequently, a generalized policy is learned that can function in different stochastic environments. One potential benefit of this is that the model may be resistant to external interference.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare no conflict of interest.