# Exploration with Multiple Random ε-Buffers in Off-Policy Deep Reinforcement Learning

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

_{1}and R

_{2}. Note that we can replay one trajectory with only one goal, assuming that we exploit an off-policy RL like DQN [9,10] and DDPG [16]. When the procedure passes through multiple ε-buffers, the trade-off between policy optimization and Q-learning can be solid and strong due to the deep neural network in DQN [9,10] or DDPG [16].

## 2. Background

_{0}), a reward function, r: S × A → R, transition probabilities, p(s

_{t+1}|s

_{t},a

_{t}), and a discount factor, γ ∈ [0,1] [1]. Deterministic policy maps from states to actions are given as π: S → A. At each time step, t, the agent selects an action attributed to the current state: a

_{t}= π(s

_{t}), and takes the reward, r

_{t}= r(s

_{t},a

_{t}) [1]. The new state of the environment is sampled from the distribution p(|s

_{t},a

_{t}). A discounted cumulative reward is a return, R

_{t}= ∑Tγ

^{t’−t}r

_{t’}, where T is the time step at which the agent’s simulation terminates [1]. The goal here is to maximize its expected return E

_{s0}[R

_{0}|s

_{0}]. The action value function, named Q-function, is defined as Q

^{π}(s

_{t},a

_{t}) = E[R

_{t}|s

_{t},a

_{t}]. An optimal policy is denoted by the following equation, called the Bellman equation, Q*(s,a) = E

_{s’~p(|s,a)}[r + γmax

_{a’}Q*(s’,a’)|s,a], (1), this equation converges at the Q-function as an optimal action value [1]. The Bellman equation is to be used as a function approximation for estimating the Q-function, Q(s,a;θ) ≈ Q*(s,a), as an action value [1]. This can be a linear-function approximation, but it is sometimes even a nonlinear-function approximation, such as with a deep neural network [9,10]. Therefore, we attempt to exploit a deep neural network as a function approximation. We refer to a deep neural network approximation with weights θ as the Q-function [9,10]. A Q-function network can be trained by minimizing a sequence of loss functions, L

_{i}(θ

_{i}), which changes at i per iteration [1,9,10]. L

_{i}(θ

_{i}) = E

_{s,a~p(|s,a)}[(y

_{i}− Q(s,a;θ

_{i}))

^{2}], (2), where y

_{i}= E

_{s’∼p(|s,a)}[r + γmax

_{a’}Q*(s’,a’;θ

_{i−1})|s,a] is the target for i per iteration. The loss function can be differentiated with regard to the weights, depending on the gradient [1,9,10]. ∇θ

_{i}L

_{i}(θ

_{i}) = E

_{s,a~p(|s,a);s’~ε}[r + γmax

_{a’}Q(s’,a’;θ

_{i−1}) − Q(s,a;θ

_{i}))∇θ

_{i}Q(s,a;θ

_{i})], (3), [1,9,10]. Optimizing the loss function by using the stochastic gradient descent is computationally intricate. Q-learning as a function approximation becomes possible [17,18], and is model-free, which indicates the RL model and off-policy. This implies that it learns about greedy strategy a = max

_{a}Q(s,a;θ) while obeying an exploration selected by ε-greedy with a probability of 1 − ε and a random action with a probability of ε for exploration and exploitation [17,18]. DQN is a well-known, model-free RL algorithm attributed to discrete action spaces. In DQN [9,10], we construct a deep neural network, Q, which approximates Q

^{∗}and is greedy-defined as π

_{Q}(s) = argmax

_{a∈A}Q(s,a) [9,10]. It is a ε-greedy policy with probability ε and takes the action π

_{Q}(s) with probability 1 − ε. Each episode uses the ε-greedy policy following Q as an approximation of the deep neural network. The tuples (s

_{t}, a

_{t}, r

_{t}, s

_{t+1}) are stored in the replay buffer, and each new episode is configured to neural network training [9,10]. The deep neural network is trained using the gradient descent of random episodes on loss L, encouraging Q to follow the Bellman equation [9,10]. The tuples are sampled from the replay buffer of random episodes. The target network y

_{t}is computed by a separate neural network that changes more slowly than the main deep neural network in order to optimize a stable process. The weights of the target network are set to the current weights of the main deep neural network. The DQN algorithm [9,10] is presented in Algorithm 1.

Algorithm 1 Deep Q-learning with Experience Replay |

Initialize replay memory D to capacity N Initialize action-value function Q with random weights for episode = 1, M do Initialize sequence s _{1} = {x_{1}} and preprocessed sequenced ϕ_{1} = ϕ (s_{1}) for t = 1, T do With probability ε, select a random action α _{t}Otherwise select α _{t} = max_{α}Q*(ϕ(s_{t}),α;θ)Execute action α _{t} in emulator and observe reward r_{t} and image x_{t+1}Set s _{t+1} = s_{t}, α_{t}, x_{t+1} and preprocess ϕ_{t+1} = ϕ(s_{t+1})Store transition (ϕ _{t}, α_{t}, r_{t}, ϕ_{t+1}) in D Sample random mini-batch of transitions (ϕ_{j}, α_{j}, r_{j}, ϕ_{j+1}) from D Set y _{j} = r_{j} for terminal ϕ_{j+1}Or y _{j} = r_{j} + γmax_{α’}Q *(ϕ_{j+1},α’;θ) for non-terminal ϕ_{j+1}Perform a gradient descent step on (y _{j} − Q(ϕ_{j}, α_{j} |θ))^{2} end for end for |

^{π}, which is the action-value function of the actor [16]. Each episode uses a noisy policy of the target, π

_{b}(s) = π(s) + Ɲ(0,1) [16]. The critic is trained in the same way as the Q-function in a DQN [9,10]. The target network yt is computed using the output of the actor, i.e., y

_{t}= r

_{t}+ γQ(s

_{t+1}, π(s

_{t+1})), and the actor’s deep neural network is trained using the gradient descent of random episodes on the loss L

_{a}= −E

_{s}Q(s,π(s)) [2,16] and sampled from the replay buffer of random episodes. The gradient of L

_{a}is computed by both the critic and actor through back-propagation [2,16]. The algorithm of a DDPG [16] is shown in Algorithm 2.

Algorithm 2 Deep Deterministic Policy Gradient |

Randomly initialize critic Q(s, a|θ^{Q}) and actor μ(s|θ^{μ}) with weights θ^{Q} and θ^{μ}Initialize target Q’ and μ’ with weights θ ^{Q’}← θ^{Q}, θ^{μ’} ← θ^{μ}Initialize replay memory R for episode = 1, M do Initialize a random process N for action exploration Receive initial observation state s _{1}for t = 1, T do Select action α _{t} = μ(s_{t}|θ^{μ}) + Ɲ_{t} according to the current policy and exploration noiseExecute action α _{t} and observe reward r_{t} and new state s_{t+1}Store transition (s _{t}, α_{t}, r_{t}, s_{t+1}) in R Sample random mini-batch of transitions (s_{j}, α_{j}, r_{j}, s_{j+1}) from R Set y _{j} = r_{j} + γQ’(s_{j+1}, μ’(s_{j+1}|θ^{μ’})|θ^{Q’}) Update critic by minimizing the loss: L = $\frac{1}{\mathrm{n}}{\Sigma}_{\mathrm{j}}$(y _{j} − Q(s_{j}, α_{j}|θ^{Q}))^{2} Update the actor policy using the sampled policy gradient: ∇ _{θμ}I≈ $\frac{1}{\mathrm{n}}{\Sigma}_{\mathrm{j}}$∇_{α}Q(s,α|θ^{Q})|s = s_{j}, α = α μ(s_{j})∇_{θμ}μ(s|θ^{μ})|s_{j}Update the target: θ ^{Q’}← τθ^{Q}+(1−τ) θ^{Q’}θ ^{μ’}← τθ^{μ}+(1−τ) θ^{μ’}end for end for |

## 3. Multiple Random ε-Buffers

#### 3.1. Proposed Off-Policy Algorithm

_{1}and R

_{2}. For better exploration, the agent evaluates a target policy π(a|s) to compute Vπ (s) or Qπ(s,a) while following a behavior policy μ(a|s), where {s

_{1}, a

_{1}, r

_{2}, … s

_{T}} ~ μ. The agent learns about multiple policies while following one policy. This means that it can learn about the optimal policy and follow the exploratory policy. The target policy π is greedy with respect to Q(s,a), such as π(s

_{t+1}) = argmax

_{a’}Q(s

_{t+1},a’), while the behavior policy μ is ε-greedy with respect to Q(s,a), such as μ(s

_{t+1}) = R

_{t+1}+ max

_{a’}γQ(s

_{t+1},a’). When an agent follows a greedy policy, convergence changes rapidly. However, because of a lack of exploration, the agent reaches local optimization. Many other challenges address this global–local dilemma [2,9,10,16]. A remarkable recent work, HER [8], provides additional replay buffers with the original goal and a subset of other goals without any complicated reward function. The off-policy algorithms in RL, such as DQN [9,10] or DDPG [2,16], reuse the past experience with random ε-greedy actions in replay buffers for better exploration. The ER buffers discard old memories at each step by sampling the buffer randomly to update the DQN agent. Therefore, ER helps to break temporal relations and increase data usage [11,12].

_{1}and R

_{2}. We can replay one trajectory with only one goal, assuming that we exploit an off-policy RL like DQN [9,10] and DDPG [2,16]. While the procedure passes through multiple random ε-buffers, the balance between variance and bias of the policy can be solid and strong due to the training of the deep neural network.

_{t+1}) = argmax

_{a’}Q(s

_{t+1},a’), and follow an exploratory policy [1]. Policy optimization [2,16] is more stable than Q-learning [9,10]; however, Q-learning [9,10] is more sample-efficient than policy optimization [2,16]. Therefore, we follow the lead to strengthen the advances in model-free RL. DDPG [16] is actually based on on-policy [1]. However, it can create a good trade-off between policy optimization [2,16] and Q-learning [9,10]. Algorithms 3 and 4 display the DQN and DDPG algorithm, respectively, with multiple random ε-buffers.

Algorithm 3 Deep Q-learning with multiple random ε-buffers |

Initialize replay memory R_{1} and R_{2}Initialize Q-function with random weights for episode = 1, M do Initialize sequence s _{1} and ϕ_{1} = ϕ (s_{1}) for t = 1, T do With probability ε, select a random action α _{t}Or select α _{t} = max_{α}Q *((s_{t}), α;θ)Execute action α _{t} in emulator and observe reward r_{t} Set s _{t+1} = s_{t}, α_{t} and ϕ_{t+1} = ϕ (s_{t+1})Store transition (ϕ _{t}, α_{t}, r_{t}, ϕ_{t+1}) in R_{1}//standard experience buffer Store transition (ϕ _{t}, α_{t}, r_{t}, ϕ_{t+1}) in R_{2}//second experience buffer Sample random mini-batch of transitions (ϕ_{j}, α_{j}, r_{j}, ϕ_{j+1}) from either R_{1} or R_{2}Set y _{j} = r_{j} Or y _{j} = r_{j} + γmax_{α’}Q*(ϕ_{j+1}, α’|θ)Perform a gradient descent step on (y _{j} − Q(ϕ_{j}, α_{j}|θ))^{2} end for Sample random mini-batch of transitions from either R_{1} or R_{2}end for |

#### 3.2. Algorithm Description

#### 3.2.1. DQN with Multiple Random ε-Buffers

- Initialize replay memories of both R
_{1}and R_{2}and Q(s,a), and initiate the process from a random state, s. - Initialize the sequence with the start state, s.
- The agent learns the policy max
_{α}Q*((s_{t,}a);θ) with greed and follows another policy with probability ε. - For exploration, an experience composed of a tuple, such as (state s, action a, reward r, new state s’), is in both R
_{1}and R_{2}. - Sample a random mini-batch of transitions from either R
_{1}or R_{2}. - The weights for performing the gradient descent (r
_{j}+ γmax_{α’}Q*(ϕ_{j+1}, α’|) − Q(ϕ_{j}, α_{j}|θ)) for a target DQN with multiple random ε-buffers. - Steps 3–6 are repeated for training.
- For the next episode, sample a random mini-batch of transitions from either R
_{1}or R_{2}. - Steps 2–8 are repeated for training.

#### 3.2.2. DDPG with Multiple Random ε-Buffers

- Initialize critic deep network Q(s, a|θ
^{Q}) and actor deep network μ(s|θ^{μ}). - Initialize replay memories in both R
_{1}and R_{2.} - The agent selects the action according to the current policy, α
_{t}= μ(s_{t}|θ^{μ}) + Ɲ_{t}. - For exploration, an experience composed of a tuple, such as (state s, action a, reward r, new state s’), is in both R
_{1}and R_{2}. - A mini-batch of transitions is randomly sampled from either R
_{1}or R_{2}. - The critic network is updated by the loss function L = 1/n∑
_{j}(r_{j}+ γQ’(s_{j+1}, μ’(s_{j+1}|θ^{μ’})|θ^{Q’}) − Q(s_{j}, α_{j}|θ^{Q}))^{2}. - The actor network is updated by the policy gradient descent ∇θμI ≈ 1/n∑
_{j}∇_{α}Q(s|θ^{Q})|s = s_{j}, α = α μ(s_{j}) ∇_{θμ}μ(s|θ^{μ})|s_{j}for a target DDPG with multiple random ε-buffers. - Update the target network θ
^{Q’}← τθ^{Q}+ (1 − τ) θ^{Q’}and θ^{μ’}← τθ^{μ}+ (1 − τ) θ^{μ’}. - Steps 3–8 are repeated for training.
- For the next episode, a mini-batch of transitions is randomly sampled from either R
_{1}or R_{2}. - Steps 2–10 are repeated for training.

Algorithm 4 Deep Deterministic Policy Gradient with multiple random ε-buffers |

Initialize critic Q(s, a|θ^{Q}) and actor μ(s|θ^{μ}) with weights θ^{Q} and θ^{μ}Initialize target Q’ and μ’ with weights θ ^{Q’}← θ^{Q}, θ^{μ’} ← θ^{μ}Initialize replay memory R _{1} and R_{2}for episode = 1, M do Initialize observation s _{1} and a random process N for action exploration for t = 1, T do Select action α _{t} = μ(s_{t}|θ^{μ}) + Ɲ_{t} by the current policy and exploration noiseExecute action α _{t} and observe reward r_{t} and new state s_{t+1}Store transition (s _{t}, α_{t}, r_{t}, s_{t+1}) in R_{1}//standard experience buffer Store transition (s _{t}, α_{t}, r_{t}, s_{t+1}) in R_{2}//second experience buffer Sample a random mini-batch of transitions (s_{j}, α_{j}, r_{j}, s_{j+1}) from either R_{1} or R_{2}Set y _{j} = r_{j} + γQ’(s_{j+1}, μ’(s_{j+1}|θ^{μ’})|θ ^{Q’}) Update critic by minimizing the loss: L = $\frac{1}{\mathrm{n}}{\Sigma}_{\mathrm{j}}$(y _{j} − Q(s_{j}, α_{j}|θ^{Q}))^{2} Update the actor by the sampled policy gradient: ∇ _{θμ}I≈ $\frac{1}{\mathrm{n}}{\Sigma}_{\mathrm{j}}$∇_{α}Q(s,α|θ^{Q})|s = s_{j}, α = α μ(s_{j})∇_{θμ}μ(s|θ^{μ})|s_{j}Update the target: θ ^{Q’}← τθ^{Q}+(1−τ) θ ^{Q’}θ ^{μ’}← τθ^{μ}+(1−τ) θ ^{μ’}end for Sample a random mini-batch of transitions from either R_{1} or R_{2}end for |

## 4. Evaluation and Results

_{t}= (s

_{t}, a

_{t}, r

_{t}, s

_{t+1}), in multiple random ε-buffers, R

_{1}and R

_{2}. Our emulators from OpenAI Gym [19] can apply mini-batch updates into R

_{1}and R

_{2}. After the multiple random experience replay memories, the agent’s actions of the emulator follow ε-greedy policy. In terms of DQN with multiple random ε-buffers, for enhanced exploration, we also follow the theory of the target Q-network of breakthrough research in [9,10]. The Q-learning agent calculates the TD error with the current estimated Q-value [9,10]. The optimized action-value function follows a significant identity, known as the Bellman equation [9,10]. In accordance with this equation, the TD target is the reward based on an action in the state plus the discounted highest Q-value for the next state [9,10]. In terms of DDPG with multiple random ε-buffers, for enhanced exploration, we also follow the theory of the critic and actor networks in [2,16]. Because the Q-learning of DDPG can exploit the deterministic policy, argmax

_{a}Q, our proposed method can follow the off-policy [1].

#### 4.1. CartPole-V0

#### 4.2. MountainCar-V0

#### 4.3. Pendulum-V0

^{2}+ 0.1 × θ

^{2}+ 0.001 × action

^{2}), where θ is normalized between −π and +π. Therefore, the lowest cost is −(π

^{2}+ 0.1 × 8

^{2}+ 0.001 × 2

^{2}) = −16.2736044 and the highest cost is θ [25]. The goal is to remain at zero angle (vertical) with the least rotational velocity and least effort. There are no episode terminations, adding a maximum number of steps [24,25].

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 1. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv
**2018**, arXiv:1801.01290. [Google Scholar] - Kim, C.; Park, J. Designing online network intrusion detection using deep auto-encoder Q-learning. Comput. Electr. Eng.
**2019**, 79, 106460. [Google Scholar] [CrossRef] - Park, J.; Salim, M.M.; Jo, J.; Sicato, J.C.S.; Rathore, S.; Park, J.H. CIoT-Net: A scalable cognitive IoT based smart city network architecture. Hum. Cent. Comput. Inf. Sci.
**2019**, 9, 29. [Google Scholar] [CrossRef] - Sun, Y.; Tan, W. A trust-aware task allocation method using deep q-learning for uncertain mobile crowdsourcing. Hum. Cent. Comput. Inf. Sci.
**2019**, 9, 25. [Google Scholar] [CrossRef] [Green Version] - Kwon, B.-W.; Sharma, P.K.; Park, J.-H. CCTV-Based Multi-Factor Authentication System. J. Inf. Process. Syst. JIPS
**2019**, 15, 904–919. [Google Scholar] [CrossRef] - Srilakshmi, N.; Sangaiah, A.K. Selection of Machine Learning Techniques for Network Lifetime Parameters and Synchronization Issues in Wireless Networks. Inf. Process. Syst. JIPS
**2019**, 15, 833–852. [Google Scholar] [CrossRef] - Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; Zaremba, W. Hindsight Experience Replay. arXiv
**2017**, arXiv:1707.01495. [Google Scholar] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Driessche, G.V.D.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M. Mastering the game of go with deep neural networks and tree search. Nature
**2016**, 529, 484–489. [Google Scholar] [CrossRef] [PubMed] - Cobbe, K.; Klimov, O.; Hesse, C.; Kim, T.; Schulman, J. Quantifying Generalization in Reinforcement Learning. arXiv
**2019**, arXiv:1812.02341. [Google Scholar] - Liu, R.; Zou, J. The Effects of Memory Replay in Reinforcement Learning. arXiv
**2017**, arXiv:1710.06574. [Google Scholar] - Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv
**2016**, arXiv:1511.05952. [Google Scholar] - Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.; Chen, X.; Asfour, T.; Abbeel, P.; Andrychowicz, M. Parameter Space Noise for Exploration. arXiv
**2018**, arXiv:1706.01905. [Google Scholar] - OpenReview.net. Available online: https://openreview.net/forum?id=ByBAl2eAZ (accessed on 16 February 2018).
- Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv
**2016**, arXiv:1509.02971. [Google Scholar] - FreeCodeCamp. Available online: https://medium.freecodecamp.org/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682 (accessed on 5 July 2018).
- RL—DQN Deep Q-network. Available online: https://medium.com/@jonathan_hui/rl-dqn-deep-q-network-e207751f7ae4 (accessed on 17 July 2018).
- OpenAI Gym. Available online: https://gym.openai.com (accessed on 28 May 2016).
- Cart-Pole-V0. Available online: https://github.com/openai/gym/wiki/Cart-Pole-v0 (accessed on 24 June 2019).
- Cart-Pole-DQN. Available online: https://github.com/rlcode/reinforcement-learning-kr/blob/master/2-cartpole/1-dqn/cartpole_dqn.py (accessed on 8 July 2017).
- MountainCar-V0. Available online: https://github.com/openai/gym/wiki/MountainCar-v0 (accessed on 4 May 2019).
- MountainCar-V0-DQN. Available online: https://github.com/shivaverma/OpenAIGym/blob/master/mountain-car/MountainCar-v0.py (accessed on 2 April 2019).
- Pendulum-V0. Available online: https://github.com/openai/gym/wiki/Pendulum-v0 (accessed on 31 May 2019).
- Pendulum-V0-DDPG. Available online: https://github.com/openai/gym/blob/master/gym/envs/classic_control/pendulum.py (accessed on 26 October 2019).
- Tensorflow. Available online: https://github.com/tensorflow/tensorflow (accessed on 31 October 2019).
- Keras Documentation. Available online: https://keras.io/ (accessed on 14 October 2019).

**Figure 2.**(

**a**) The environment of CartPole-V0 and (

**b**) the actions taken by the agent of CartPole-V0 [20].

**Figure 3.**An average result between DQN with random buffers and DQN with multiple random ε-buffers in CartPole-V0.

**Figure 4.**One of the best results between DQN with random buffers and DQN with multiple random ε-buffers in CartPole-V0.

**Figure 5.**One of the worst results between DQN with random buffers and DQN with multiple random ε-buffers in CartPole-V0.

**Figure 6.**(

**a**) The environment of MountainCar-V0 and (

**b**) the actions taken by the agent of MountainCar-V0 [22].

**Figure 7.**Maximum number of rounds, i.e., around 100, between DQN with random buffers and DQN with multiple random ε-buffers in MountainCar-V0.

**Figure 8.**Maximum number of rounds, i.e., around 200, between DQN with random buffers and DQN with multiple random ε-buffers in MountainCar-V0.

**Figure 9.**The maximum number of rounds, i.e., around 300, between DQN with random buffers and DQN with multiple random ε-buffers in MountainCar-V0.

**Figure 10.**(

**a**) The environment of MountainCar-V0 and (

**b**) the actions taken by the agent of Pendulum-V0 [24].

**Figure 11.**Maximum number of rounds, i.e., around 100, between DDPG with random buffers and DQN with multiple random ε-buffers in MountainCar-V0.

**Figure 12.**Maximum number of rounds, i.e., around 200, between DDPG with random buffers and DQN with multiple random ε-buffers in MountainCar-V0.

**Figure 13.**Maximum number of rounds, i.e., around 300, between DDPG with random buffers and DQN with multiple random ε-buffers in MountainCar-V0.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kim, C.; Park, J.
Exploration with Multiple Random ε-Buffers in Off-Policy Deep Reinforcement Learning. *Symmetry* **2019**, *11*, 1352.
https://doi.org/10.3390/sym11111352

**AMA Style**

Kim C, Park J.
Exploration with Multiple Random ε-Buffers in Off-Policy Deep Reinforcement Learning. *Symmetry*. 2019; 11(11):1352.
https://doi.org/10.3390/sym11111352

**Chicago/Turabian Style**

Kim, Chayoung, and JiSu Park.
2019. "Exploration with Multiple Random ε-Buffers in Off-Policy Deep Reinforcement Learning" *Symmetry* 11, no. 11: 1352.
https://doi.org/10.3390/sym11111352