Pseudo Random Number Generation through Reinforcement Learning and Recurrent Neural Networks

A Pseudo-Random Number Generator (PRNG) is any algorithm generating a sequence of numbers approximating properties of random numbers. These numbers are widely employed in mid-level cryptography and in software applications. Test suites are used to evaluate PRNGs quality by checking statistical properties of the generated sequences. These sequences are commonly represented bit by bit. This paper proposes a Reinforcement Learning (RL) approach to the task of generating PRNGs from scratch by learning a policy to solve a partially observable Markov Decision Process (MDP), where the full state is the period of the generated sequence and the observation at each time step is the last sequence of bits appended to such state. We use a Long-Short Term Memory (LSTM) architecture to model the temporal relationship between observations at different time steps, by tasking the LSTM memory with the extraction of significant features of the hidden portion of the MDP's states. We show that modeling a PRNG with a partially observable MDP and a LSTM architecture largely improves the results of the fully observable feedforward RL approach introduced in previous work.


Introduction
Generating random numbers is an important task in cryptography, and more generally in computer science.Random numbers are used in several applications, whenever producing an unpredictable result is desirable: for instance games, gambling, encryption algorithms, statistical sampling, computer simulation and modeling, and many others.
An algorithm generating a sequence of numbers approximating properties of random numbers is called a Pseudo-Random Number Generator (PRNG).A sequence generated by a PRNG is "pseudo-random" in the sense that it is generated by a deterministic function: the function can be extremely complex, but the same input will give the same sequence.The input of a PRNG is called seed, and to improve the randomness the seed itself can be drawn from a probability distribution.Of course, assuming that drawing from the probability distribution is implemented with an algorithm, everything is still deterministic at a lower level, and the sequence will repeat after a fixed, unknown, number of digits, called period of the PRNG.
While true random numbers sequences are more fit to certain applications where true randomness is necessary -for instance, cryptography -they can be very expensive to generate.In most applications, a pseudo-random number sequence is good enough.
The quality of a PRNG is measured by the randomness of the generated sequences.The randomness of a specific sequence can be estimated by running some kind of statistical test suite.In this paper the National Institute of Standards and Technology (NIST) statistical test suite for random and pseudo-random number generators [Bas+10] is used to validate the PRNG.
Neural networks have been used for predicting the output of an existing generator, that is, to break the key of a cryptography system.There have also been limited attempts at generating PRNGs using neural networks by exploiting their structure and internal dynamics.For example, the authors of [DPR12] use recurrent neural networks dynamics to generate pseudo-random numbers.In [Hug07], the authors use the dynamics of a feedforward neural network with random orthogonal weight matrices to generate pseudo-random numbers.Neuronal plasticity is used in [Abd94] instead.In [DKM18] a generative adversarial network approach to the task is presented, exploiting an input source of randomness, like an existing PRNG or a true random number generator.
A PRNG usually generates the sequence incrementally, that is, it starts from the seed at time t = 0 to generate the first number of the sequence at time t = 1, then the second at t = 2, and so on.Thus, it is naturally modeled by a deterministic Markov Decision Process (MDP), where state space, action space and rewards can be chosen in several ways.A Deep Reinforcement Learning (DRL) pipeline can then be used on this MDP to train a PRNG agent.This DRL approach has been used for the first time in [PP20] with promising results, see modeling details in Section 2.2.This is a probabilistic approach that generates pseudo-random numbers with a "variable period", because the learned policy will generally be stochastic.This is a feature of this RL approach.
However, the MDP formulation in [PP20] has an action set with size growing linearly with the length of the sequence.This is a severe limiting factor, because when the action set is above a certain size it becomes very difficult, if not impossible, for an agent to explore the action space in a reasonable time.
In this paper, we overcome the above limitation with a different MDP formulation, using a partially observable state.By observing only the last part of the sequence, and using the hidden state of a Long-Short Term Memory (LSTM) neural network to extract important features of the full state, we significantly improve the results in [PP20], see Section 3.
The code for this article can be found at GitHub repository [Pas].

Methodology
The main idea is quite natural: since a PRNG builds the random sequence incrementally, we model it as an agent in a suitable MDP, and use DRL to train the agent.Several "hyperparameters" of this DRL pipeline must be chosen: a good notion of states and actions, a reward such that its maximization gives sequences as close as possible to true random sequences, and a DRL algorithm to train the agent.In this section, we first introduce the DRL notions that will be used in the paper, see Section 2.1; we then describe the fully observable MDP used in [PP20] and the partially observable MDP formulation used in this paper, see Section 2.2; in Section 2.3 we describe the reward function used for the MDP; in Section 2.4 we describe the recurrent neural network architecture used to model the environment state; finally, Section 2.5 describe the software framework used for the experiments.

Reinforcement Learning (RL)
For a comprehensive, motivational and thorough introduction to RL, we strongly suggest reading from 1.1 to 1.6 in [SB18].
RL is learning what to do in order to accumulate as much reinforcement as possible during the course of actions.This very general description, known as the RL problem, can be framed as a sequential decision-making problem as follows.
Assume an agent is interacting with an environment.When the agent is in a certain situation -a state -it has several options -called actions.After each action, the environment will take the agent to a next state, and will provide it with a numerical reward, where the pair "state, reward" may possibly be drawn from a joint probability distribution, called the model or the dynamics of the environment.The agent will choose actions according to a certain strategy, called policy in the RL setting.The RL problem can then be stated as finding a policy maximizing the expected value of the total reward accumulated during the interaction agent-environment.
To formalize the above description, see Figure 1 representing the agent-environment interaction.
At each time step t, the agent receives a state S t ∈ S from the environment, and then selects an action A t ∈ A. The environment answers with a numerical reward R t+1 ∈ R ⊂ R and a next state S t+1 .This interaction gives raise to a trajectory of random variables: S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , . . .In the case of interest to us, S, A and R are finite sets.Thus, the environment answers the action A t = a executed in the state S t = s with a pair R t+1 = r, S t+1 = s drawn from a discrete probability distribution p on S × R, the model Figure 1: ©[SB18, figure 3.1] The agent-environment interaction is made at discrete time steps t = 0, 1, 2, . . . .At each time step t, the agent use the state S t ∈ S given by the environment to selects an action A t ∈ A. The environment answers with a real number R t+1 ∈ R ⊂ R called reward, and a next state S t+1 .Going on, we obtain a trajectory S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , . . .Note the visual clue of the fact that p(•, •|s, a) is a probability distribution for every state-action pair s, a.
Figure 1 implicitly assumes that the joint probability distribution of S t+1 , R t+1 depends on the past only via S t and A t .In fact, the environment is fed only with the last action, and no other data from the history.This means that, for a fixed policy, the corresponding stochastic process {S t } is Markov.This gives the name Markov Decision Process (MDP) to the data (S, A, R, p).Moreover, it is a time-homogeneous Markov process, because p does not depend on t.In certain problems the agent can see only a portion of the full state, called observation.In this case, we say that the MDP is partially observable.Observations are usually not Markov, because the non observed portion of the state can contain relevant information for the future.In this paper we model a PRNG as an agent in a partially observable MDP.
When the agent experiences a trajectory starting at time t, it accumulates a discounted return G t : The return G t is a random variable, whose probability distribution depends not only on the model p, but also on how the agent chooses actions in a certain state s.Choices of actions are encoded by the policy, i.e. a discrete probability distribution π on A: π(a|s) := π(a, s) := Pr(A t = a|S t = s).A discount factor γ < 1 is used mainly when rewards far in the future are less and less reliable or important, or in continuing tasks, that is, when the trajectories do not decompose naturally into episodes.Since the partially observable MDP formulation we use is episodic, see Section 2.2, in this paper we use γ = 1.
The average return from a state s, that is, the average total reward the agent can accumulate starting from s, represents how good is the state s for the agent following the policy π, and it is called state-value function: Likewise, one can define the action-value function (known also as quality or q-value), encoding how good is choosing an action a from s and then following the policy π: Since the return G t is recursively given by R t+1 + γG t+1 , the RL problem has an optimal substructure, expressed by recursive equations for v * and q * .If an accurate description of the dynamics p of the environment is available and if one can store in memory all states, then dynamic programming iterative techniques can be used, and an approximate solution v * or q * to the Bellman optimality equations can be found.From v * or q * one can then easily recover an optimal policy: for instance, π * (s) := argmax a∈A q * (s, a) is a deterministic optimal policy.However, in most problems we have only a partial knowledge of the dynamics, if any.This can be overcome by sampling trajectories instead of computing a true expectation.Moreover, in most problems there are way too many states to store them in memory, or just to go through every state just once.In this case, the estimate of q π (s, a) must be stored in a parametric function approximator q π (s, a; w), where w is a parameters vector living in a dimension much lower than |S × A|.Due to their high representational power, deep neural networks are nowadays widely used as approximators in RL: the combination of deep neural networks with RL is called Deep Reinforcement Learning (DRL).
Iterative dynamic programming techniques can then be approximated, giving a family of algorithms known as Generalized Policy Iteration algorithms.They work by sampling trajectories to obtain estimates of the true values E π [G t |S t = s, A t = a], and using supervised learning to find the optimal parameters vector w for q π (s, a; w).This estimated q-value is used to find a policy π better than π, and iterating over this evaluation-improvement loop usually gives an approximate solution to the RL problem.
Generalized Policy Iteration is value-based, because uses a value function as a proxy for the optimal policy.A completely different approach to the RL problem is given by Policy Gradient (PG) algorithms.They estimate directly the policy π(a|s; θ), without using a value function.The parameters vector θ t at time t is modified to maximize a suitable scalar performance function J(θ), with the gradient ascent update rule: Here the learning rate α is the step size of the gradient ascent algorithm, determining how much we are trying to improve the policy at each update, and ∇J(θ t ) is any estimate of the performance gradient ∇J(θ) of the policy.Different choices for the estimator corresponds to different PG algorithms.The vanilla choice for the estimator ∇J(θ t ) is given by the Policy Gradient Theorem, leading to an algorithm called REINFORCE and to its baselined derivatives, see for instance [SB18, Section 13.2 and forward].Unfortunately, vanilla PG algorithms can be very sensitive to the learning rate, and a single update with a large α can spoil the performance of the policy learned so far.Moreover, the variance of the Monte Carlo estimate is high, and a huge amount of episodes are required for convergence.For this reason, several alternatives for ∇J(θ t ) has been researched.
In this paper we use the best performing algorithm in [PP20], a PG algorithm called Proximal Policy Optimization (PPO), described in [Sch+17] and considered state-of-the-art in PG methods.PPO tries to take the biggest possible improvement step on a policy using the data it currently has, without stepping too far and making the performance collapse.
The PPO used in this paper is an instance of PPO-Clip as described by OpenAI at [Ope], with only partially shared value and policy heads.More details on the neural network architecture in Section 2.4.We use 0.2 as clip ratio, and early stopping: if the mean KL-divergence of the new policy from the old grows beyond a threshold, training is stopped for the policy, while it continues for the value function.We used a threshold of 1.5 • K, with K = 1e−2.
To reduce the variance, we used Generalized Advantage Estimation as in [Sch+15] to estimate the advantage, with γ = 1 and λ = 0.95.Saved rewards R t were normalized with respect to when they were collected (called rewards-to-go in [Ope]).

Modeling a PRNG as a MDP
In [PP20] we modeled a PRNG as a fully observable MDP in the following way.Choose a bit length for the sequence, say B. The state space is given by all possible bit sequences of length B: Action 1 n is the action of setting the n th bit to 1, and 0 n is the action of setting the n th bit to 0. The action space is then: This finite MDP formulation has |S| = 2 B and |A| = 2B, and was called Binary Formulation (BF) in [PP20].
The main problem with BF is the fact that the size 2B of the action set grows linearly with the increasing length B of the sequence.Above a certain size, it's no longer possible to learn a policy with high average score.If the size is big enough, no policy can be learned at all because it's almost impossible for an agent to explore an action space so huge in a reasonable time.Consider that for PRNGs a sequence of 1000 bits is quite short, while for a RL problem 2000 actions are way too much.
In this paper we overcome this limitation of BF by hiding a portion of the full pseudo-random sequence, letting the agent see and act only on the last N bits.This removes the correlation between the final length of the sequence and the number of actions, at the cost of introducing a temporal dependency among states, breaking Markovianity and making the resulting MDP partially observable.This new problem is solved by approximating the hidden portion of the state with the hidden state of a recurrent neural network with memory, see Section 2.4 for details.We call this new approach Recurrent Formulation (RF).
Let N ∈ N be the number of bits at the end of the full sequence that we want to expose, or, in other words, let the observations space be the set O := {0, 1} N .We want the agent to be able to change freely the last N bits, that is, the action space A coincides with O.We also fix a predetermined temporal horizon T for episodes.This means that the size of the action space is |A| = 2 N for a generated sequence of T • N bits.Both |A| and length of S T depend on N , but they do not depend on each other.Increasing the length of the generated sequence can then be done by increasing the horizon T , leaving constant the action space size.Clearly, this formulation can work only if the policy approximator π(•|•; θ) can preserve some information from the time series given by the past observations.Recurrent neural networks can model memory, and for this reason are one of the possible approaches to process time series and, more in general, temporal relationships among data.This is very useful in RL environments where, from the point of the view of the agent, the Markov property does not hold, as it is typically the case in many partially observable environments.This approach was suggested in [DUS12], see [Wan+18;Cha19] for examples of use cases far different from ours.
State-of-the-art recurrent neural networks for this kind of problems are the Gated Recurrent Unit (GRU) ones, and the Long-Short Term Memory (LSTM) ones.Typically, GRU performs better than LSTM, but for this particular formulation we have experienced good performance with LSTM.Thus, we use LSTM layers to approximate the policy network, see 2.4 for details on the neural network architecture.

Reward and NIST Test Suite
The NIST statistical test suite for random and pseudo-random number generators is the most popular application to test the randomness of sequences of bits.It has been published as a result of a comprehensive theoretical and experimental analysis and may be considered as the state-of-the-art in randomness testing for cryptographic and not cryptographic applications.The test suite has become a standard stage in assessing the outcome of PRNGs shortly after its publication.
The NIST test suite is based on statistical hypothesis testing and contains a set of statistical tests specially designed to assess different pseudo-random number sequences properties.Each test computes a test statistic value, function of the input sequence.This value is then used to calculate a P-value that summarizes the strength of the evidence for the sequence to be random.For more details, see [Bas+10].
If S t is the sequence produced by the agent at time t, the NIST test suite can be used to compute the average P-value of all eligible tests run on S t .Some tests return multiple statistic values: in that case, their average is taken.If a test is failed its value in the average is set to zero.Some tests are not eligible on certain too short sequences, and in this case they are not considered for the average.This average avg NIST (S t ) is used at the end T of each episode as a reward function for the MDP: R t = avg NIST (S t ) if t = T 0 otherwise This is the same reward strategy used in [PP20].Note that, since P-values are probabilities, rewards belong to [0, 1], and that NIST test suite accuracy grows with the tested sequence length.

Neural Network Architecture
PPO is an actor-critic algorithm.This means that it requires a neural network for the policy (the actor) and another neural network for the value function (the critic).Moreover, since RF is a partially observable MDP formulation, we need a way to maintain as much information as possible from previous observations, without exponentially increasing the size of the state space.We solve this problem with LSTM layers [HS97].
The neural network used for RF starts with two LSTM layers, with bias = 1 for the forget gate.After the LSTM layers, the network splits in two different subnetworks, one for the policy and one for the value function.The policy subnetwork has three dense layers with 256, 128 and 64 neurons respectively, stacked.All of them have ReLU activation.After this, a dense layer with 2 N neurons provides preferences for the actions, which are turned into probabilities by a softmax activation.This is the policy head.The value function subnetwork starts exactly as the policy one (but with different weights): three dense layers with 256, 128 and 64 neurons respectively, stacked, all with ReLU activation.At the end, a dense layer with 1 neuron and no activation for the state value.This is the value head.All layers have Xavier initialization.For additional details refer to the GitHub repository [Pas].
BF uses two different neural networks for the policy and the value function.The policy network has three dense layers with 256, 512 and 256 neurons respectively, stacked.All of them have ReLU activation.After this, the policy head is the same as in RF: a dense layer with 2 N neurons and softmax activation.The same for the value function network: three dense layers with 256, 512 and 256 neurons respectively, stacked, with ReLU activation.After this, the value function head is the same as in RF: a dense layer with 1 neuron and no activation.All layers have Xavier initialization.For additional details refer to the GitHub repository [Pas].

Framework
The framework used for the RL algorithms is USienaRL2 .This framework allows for environment, agent and interface definition using a preset of configurable models.While agents and environments are direct implementations of what is described in the RL theory, interfaces are specific to this implementation.Under this framework, an interface is a system used to convert environment states to agent observations, and to encode agent actions into the environment.This allows to define agents operating on different spaces while keeping the same environment.By default an interface is defined as pass-through, i.e. a fully observable state where agents action have direct effect on the environment.The two fashions of the BF are defined using different interfaces in the implementation.Specifically, the wanderer interface mask out all action resulting in a bit being set to itself, while the baseline uses a simple pass-through interface.
The NIST test battery is run with another framework, called NistRng3 .This framework allows us to easily run a customizable battery of statistical set over a certain sequence.The framework also computes which tests are acceptable over certain sequences, i.e. due to their length.Acceptable tests for a certain sequence are called eligible tests.Each test returns a value and a flag stating if the test was successfully exceeded or not by the sequence.If a test is not eligible with respect to a certain sequence it cannot be run and it is skipped.

Results
Our experiments consists on multiple sets of training processes of various BF and RF agents.The goal of these experiments is to measure the performance of the new RF agents and compare it with the results achieved by BF agents.
We consider 3 different agents, all trained by PPO-Clip described in Section 2.1 with an actor-critic neural network described in Section 2.4.
The agent based on the formulation RF introduced in this paper is called π RF , and similarly we denote by π BF the agent based on BF, that we use as a baseline.We introduce also a third agent, a variation of π BF denoted by πBF and called "wanderer": this agent is forced to move as much as possible within the environment by masking out all actions that would keep the agent in the same state.In other words, at time-step t the agent πBF is forbidden from setting a certain bit to the same value it has at the previous time step t − 1.
In the experiments, we optimize a PPO-Clip loss with GAE advantage estimation.Two separated policy loss and value loss are estimated by Adam with mini-batch samples of 32 experiences drawn randomly from a buffer.The buffer is filled by 500 episodes for π BF and πBF , and by 1000 episodes for π RF .Once the buffer is full, a training epoch is performed: thus, for instance, an agent π BF building sequences of length B = 200 by episodes of length T = 100 will start training when the buffer is filled with 500 • 100 = 50, 000 experiences, and the epoch will end after [50, 000/32] = 1562 training steps.Learning rates are 3e−4 for the policy and 1e−3 for the value.At the end of the epoch, the buffer is emptied, and the RL pipeline goes on by experiencing new episodes.
We present experimental results in the form of plots.Like in [PP20], the performance metric used is the average total reward across sets of epochs called volley.In this paper, a volley is made from two epochs, that is, the plots represent a moving average over a window of two epochs.For very short sequences of 80 bits, πBF has better performance at the end yet very similar performance at beginning of training.Making the sequence longer, B = 200 bits, allows the wanderer agent to perform much better than the baseline: starting from slightly different performance at the beginning, πBF has a much higher average total reward at the end of the training process.This trend is confirmed with sequences of B = 400 bits.
Figures 5, 6 and 7 describe three experiments with π RF .In this case the length of the trajectory T directly influences the length of the sequence generated by π RF .We keep T = 100 constant across π RF experiments.This value is chosen experimentally as giving best performance while keeping inference time acceptable.To obtain sequences comparable with the ones generated by BF agents, we choose to append N = 2, N = 5 and N = 10 bits respectively, resulting in final sequences of length 2 • 100 = 200 bits, 5 • 100 = 500 bits and 10 • 100 = 1000 bits respectively.The seed S 0 of the PRNG is drawn from a standard multivariate normal distribution of dimension N = 2, N = 5 and N = 10 respectively.
For N = 2, training is not successful.We don't have a clear explanation for this fact.A possible reason is that we are trying to represent the non observable portion of the state with a very low-dimensional input.However, this is not completely true, because in theory the inputs from every time-step is preserved in the hidden state of the LSTM.This issue could be related to the difficulties that recurrent neural networks have shown with gradient descent optimizers, see [BSF94].
For N = 5, that is, sequences of 500 bits, π RF vastly outperforms π BF and, albeit by a narrower margin, πBF with sequences of 200 bits.Thus, we have better performance with 150% additional bits in the sequence.For N = 10, that is, sequences of 1000 bits, π RF vastly outperforms π BF with B = 400.The wanderer agent in this case performs similarly, but still π RF produces sequences that are 150% longer than the ones produced by πBF .
We now want to test the hypothesis that, in order to generate PRNGs by DRL, modeling the MDP as in RF is much better than modeling it as in BF.To this aim, we compare the average total rewards per episode of random agents operating in RF and in the baseline BF.
We call these agents ρ RF and ρ BF respectively.This comparison is performed to exclude the possibility that PPO is performing better with RF, but maybe different algorithms would perform better with BF.Using a random agent means removing the algorithm from the equation.Figures 8a, 8b, 9a, 9b, and 10a, 10b describe three comparisons between the random agents producing short, medium and long sequences respectively.As before, short means 80 bits in BF and 200 bits in RF, medium means 200 bits in BF and 500 bits in RF, long means 400 bits in BF and 1000 bits in RF.In each comparison, the average total rewards of ρ RF is higher than ρ BF .From these results, we can assess that RF is a better formulation than BF for the PRNG task.
Finally, figures 11a, 11b and 11c represent graphically three sequences of 1000 bits generated by the same agent π RF after training, with NIST scores 0.4, 0.43 and 0.53 respectively.The 1000 bits are stacked on 40 rows and 25 columns, then ones are converted to 10 × 10 white squares and zeros to 10 × 10 black squares, and the resulting image is smoothed.

Conclusions and Future Work
In this paper we introduce a novel partially observable MDP modeling the task of generating PRNGs from scratch using DRL, denoted by RF for recurrent formulation.RF improves our previous MDP modeling in [PP20] because it makes the action space size independent on the length of the generated sequence.
Experiments show that RF agents trained with PPO obtains at the same time a higher average NIST score and longer sequences, thus improving [PP20] in two different ways.We use a PPO instance with a hidden state of an LSTM to encode significant features of the non observed portion of the sequence: as far as we know, this is an original idea for PRNG.
Experiments with a random agent show that RF is a better MDP modeling when compared with the binary formulation BF of [PP20], that is, when RF is compared with BF, obtains a higher average NIST score with longer sequences.
All this means that RF scales better to PRNG with longer periods.However, while RF is a serious improvement over BF, the action space size grows as 2 N , where N is the bit-length of the appended sequence.Since DRL does not scale well to discrete and large action sets, see for instance [Zah+18], this is a limitation for RF.In our experiments, we have found that N > 10 is not feasible for RF.Moreover, the vanishing gradient is an obstacle to increasing the episodes length T with recurrent neural networks, as shown in [BSF94].In our experiments we have been unable to train RF with T = 200, so we consider T = 100 an upper bound for RF with PPO and the LSTM architecture described in Section 2.4.Since the PRNG period is T • N , we can say that RF with all the hyperparameters described in this paper does not scale well over 1000 bits.
Another problem of this approach is the sparse reward, which in general makes difficult for a RL agent to be trained.
The above remarks takes to the following list of possible future improvements/research topics: • Devise a different, less sparse, reward function.
• Devise an algorithm using RF modified with a continuous action space.In this way we can append sequences of any length without increasing the size of A. Some preliminary experiments show that this approach could work but at the time of writing it still presents some performance issues.Work in progress.• RF could be used to intelligently stack generated periods.For example, one could train a policy to stack sequences generated by one or multiple PRNGs, even with different sequence lengths.This would generate PRNGs with very long periods without a sensible drop in quality.This approach could use a mixture of state-of-the-art PRNGs, DRL generators like the ones seen in this paper, and so on.A challenge of this approach is how to measure the change in the NIST score when appending a random sequence to another.• The upper bound given by the episodes length T could be overcome, or at least mitigated, by other architectures capable of maintaining a memory for a time longer than recurrent neural networks, like attention models.

Figure 2 :
Figure 2: Experiment on BF with B = 80.The learning curve is different and the average total reward is better with πBF .Volleys are composed of 1000 episodes each and the fixed length of each trajectory is T = 40 steps.

Figure 3 :
Figure 3: Experiment on BF with B = 200.Despite the similar learning curve, there is a huge difference in the achieved average total reward per episode between πBF and π BF at the end of the training process.Volleys are composed of 1000 episodes each and the fixed length of each trajectory is T = 100 steps.

Figure 4 :
Figure 4: Experiment on BF with B = 400.The difference in the achieved average total reward per episode between πBF and π BF is similar to the case with B = 200, while the learning curve is different.Volleys are composed of 1000 episodes each and the fixed length of each trajectory is T = 200 steps.

Figure 5 :
Figure 5: Average total rewards during training of π RF with N = 2 and T = 100.Volleys are composed by 2000 episodes each.

Figure 6 :
Figure 6: average total rewards during training of π RF with N = 5 and T = 100 steps.Volleys are composed by 2000 episodes each.

Figure 7 :
Figure 7: average total rewards during training of π RF with N = 10 and T = 100 steps.Volleys are composed by 2000 episodes each.

Figure 8 :
Figure 8: Average total rewards of a random agent on BF and RF for short sequences

Figure 9 :
Figure 9: Average total rewards of a random agent on BF and RF for medium sequences.

Figure 10 :
Figure 10: Average total rewards of a random agent on BF and RF for long sequences.

Figure 11 :
Figure 11: A graphical representation of 3 sequences of 1000 bits generated by the same trained π RF with their NIST score.Images are obtained by stacking the 1000 bits in 40 rows and 25 columns, then ones are converted to 10 × 10 white squares and zeros to 10 × 10 black squares.The resulting image is smoothed.