Skip to Content
Applied SciencesApplied Sciences
  • Editor’s Choice
  • Review
  • Open Access

14 February 2023

Reinforcement Learning in Game Industry—Review, Prospects and Challenges

,
and
MLV Research Group, Department of Computer Science, International Hellenic University, 65404 Kavala, Greece
*
Author to whom correspondence should be addressed.

Abstract

This article focuses on the recent advances in the field of reinforcement learning (RL) as well as the present state–of–the–art applications in games. First, we give a general panorama of RL while at the same time we underline the way that it has progressed to the current degree of application. Moreover, we conduct a keyword analysis of the literature on deep learning (DL) and reinforcement learning in order to analyze to what extent the scientific study is based on games such as ATARI, Chess, and Go. Finally, we explored a range of public data to create a unified framework and trends for the present and future of this sector (RL in games). Our work led us to conclude that deep RL accounted for roughly 25.1% of the DL literature, and a sizable amount of this literature focuses on RL applications in the game domain, indicating the road for newer and more sophisticated algorithms capable of outperforming human performance.

1. Introduction

Deep learning (DL) algorithms were established in 2006 [1] and have been extensively utilized by many researchers and industries in subsequent years. Ever since the impressive breakthrough on the ImageNet [2] classification challenge in 2012, the successes of supervised deep learning have continued to pile up. Many researchers have started utilizing this new and capable family of algorithms to solve a wide range of new tasks, including ways to learn intelligent behaviors in reward–driven complex dynamic problems successfully. The agent––environment interaction expressed through observation, action, and reward channels is the necessary and capable condition of characterizing a problem as an object of reinforcement learning (RL). Learning environments can be characterized as Markov decision problems [3], as they satisfy the Markov property, allowing RL algorithms to be applied. From this family of environments, games could not be absent. In a game–based environment, inputs (the game world), actions (game controls), and the evaluation criteria (game score) are usually known and simulated. With the rise of DL and extended computational capability, classic RL algorithms from the 1990s [4,5] could now solve exponentially more complex tasks such as games [6] over time, traversing through huge decision spaces. This new generation of algorithms, which exploits graphical processing unit (GPU) batch computations, reward/punishment costs, as well as the immense computational capabilities of today’s machines, is called deep reinforcement learning (DRL) [7]. According to [8,9], neuroevolutionary approaches were demonstrated that directly applied to pixel data. A year after, the development of the Deep–Q–Network (DQN) by Google was the most noticeable breakthrough in the era of DRL. This novel algorithm could recognize and learn behaviors directly from pixels in an unknown environment [10].
However, there were some issues at the early stages of the development due to the instability of neural networks when acting as approximation functions (correlated inputs, oscillating policies, large gradients, etc.) that were solved through the natural development of the wider DL field. For example, the correlation between the inputs to a Machine Learning (ML) model can affect their training process, leading to underfitting/overfitting in many cases. Other issues, such as policy degradation, which can arise from value function overestimation, has been addressed by using multiple value functions [11,12] or large gradients by subtracting a previously learned baseline [13,14]. Other approaches to these instabilities include trust region algorithms, such as trust region policy optimization (TRPO) [15] or proximal policy optimization (PPO) [16], where the policy is updated by applying various constraints.
In the beginning, experiments in ATARI 2600 games were followed by a wide range of testing in more challenging games (DOTA2, Starcraft, Chess, Go, etc.). Finally, they proved that DQN types of algorithms could score higher than any of the classic RL algorithms, surpassing the professional human players that were paid to play the Atari game titles [17].
The above are solid intuitive clues on why DRL is inextricably linked with the game domain. In this study, this exact relation is analyzed, understood, and studied. Therefore, our study’s contributions are the following:
  • Conduct an in–depth publication analysis via keyword analysis on existing literature data;
  • Present key studies and publications that presented breakthroughs that addressed important issues in RL, such as policy degradation, exploding gradients, and general training instabilities;
  • Draw important inferences regarding the growing nature of published research in DRL and its various domains;
  • Analyze research on games in the field of DRL;
  • Survey keystone DRL research done in game–based environments.
Our publication analysis [18] has shown that specific terms and keywords such as “reinforcement learning” and “game” usually pair together. For the time period of 2012 to 2022, from the 92,367 publications containing “reinforcement learning,” over 25% of them also contained the keyword “game” or referred to a game. Research has also shown that most of these RL publications [19,20] review or use “deep learning” type techniques in games, with the majority being deep reinforcement learning (DRL) algorithms (recent value–based, policy gradient, and model–based DRL methods).
The rest of this paper is organized as follows: Section 2 provides the necessary theoretical background for understanding popular RL systems and their essential sub–units. Section 3 describes the analysis performed on the Scopus, Google Scholar and Dimensions publication data. Section 4 presents the development of DRL applications in game–based environments along with their impact on the overall domain of RL and how subsequent research has helped in overcoming the limitations faced by its predecessors. Additionally, we briefly present some notable studies published the recent years, according to the number of citations they received. Finally, Section 5 concludes the paper with suggestions for overcoming the obstacles to research and possible solution directions for developing DRL in game–based environments.

2. Theoretical Background

2.1. Markov Decision Process

Markov decision process (MDP) is a mathematical decision–making model where the action impact is partially random and partially controlled. Based on the definitions, we can mathematically formulate the probability of transitioning from state S t to state S t + 1 as:
P s s a = P ( s S t + 1 | S t = s , A t = a )
while the immediate reward can be formulated as:
R a ( s , s ) = R ( s S t + 1 | S t = s , A t = a )
where S is the set of states, A the set of actions, a the selected action in state s at timestep t, R the reward obtained by transitioning from state s to state s , with the selected action a. Simply put, P s s a is the probability of transitioning from the current state s S t to the next state s S t + 1 , taking action a A t , at time–step t in the transition of the process. MDPs can be either deterministic or stochastic. In deterministic approaches, as in every deterministic system, from state s, we can only transition to a unique state s , in contrast with stochastic approaches where s can lead to every possible state s . MDP is a structural element of RL since, in these problems, an agent is supposed to decide its next action based on its current state. In particular, when the above step is repeated over the time–step t, the problem can be expressed as an MDP.

2.1.1. Reinforcement Learning

In most RL algorithms, the agent obtains a model of the environment or at least some basic state transition sequences, as is depicted in Figure 1. In a similar model, the agent can interact with the environment by selecting a set of actions that alter the environment’s state, producing new states along the way. The structural components of RL are:
Figure 1. Basic RL model.
  • The discrete different time–steps t;
  • The state space S with state S t at time–step t;
  • A set of actions A with action A t at time–step t;
  • The policy function π ( . ) ;
  • A reward function R a ( S t , S t ) of an action A t , transitioning from state S to S ;
  • The state evaluation V ( s ) and energy evaluation Q ( s , a ) .

2.1.2. Policy

In MDPs, the policy π ( . ) is mathematically defined as a function with input value the state s from S and output value the action a from A S .
π : S A
According to MDPs in deterministic policy, the function π ( . ) is defined as π : S A , for each state s ϵ S corresponds to one action a ϵ A s whereas in a stochastic policy, the function π ( . ) is derived from the distribution of probability π ( a | s ) . In the latter case, the agent is able to choose one action a ϵ A s , based on the distribution π ( a | s ) . During the training phase, the agent can act based on two policies:
  • Exploration policy: The agent acts randomly to explore the state and the reward spaces.
  • Exploitation policy: The agent acts based on preexisting knowledge.

2.1.3. State Evaluation

State Evaluation in MDPs is defined as V : S R . To formulate it mathematically, it is important to also define:
  • γ the reward discount factor, which can take values between zero and one. The rate γ pushes for immediate rewards;
  • G t the reward sum (discounted by γ ), from state S t until the end of the episode;
  • E π the expected value of the rewards, given that the agent starts from the state s and acts based on policy π ( . ) .
V π ( s ) = E π { G t | S t = s } = E π k = 0 γ k r t + k + 1 | S t = s
where k is the number of steps before the end of the episode.

2.1.4. Energy Evaluation

The energy evaluation function can provide us with the expected discounted reward from γ , the sum of rewards that an agent can take if action a from a state s, based on a policy π ( . ) . It is defined as Q : S × A R .
Q π ( s , a ) = E π { G t | S t = s , A t = a } = E π k = 0 γ k r t + k + 1 | S t = s , A t = a
Taking into account the above function, a greedy deterministic policy can be:
V ( s ) = max a { Q ( s , a ) }

2.1.5. Bellman Equations

Richard Bellman introduced a set of equations that help in solving MDPs [13]. They are extensively used both in RL and in the majority of the algorithms for solving game–based problems. To provide a proper mathematical definition of these equations, we must determine the probability of transition p and the expected reward R:
P s s a = P r ( S t + 1 = s | S t = s , A t = a )
and
r ( s , a ) = E ( r t + 1 | S t + 1 = s , S t = s , A t = a )
With the above additions, we can reformulate both state and energy evaluation as:
V π ( s ) = a π ( s | a ) s P s s a [ r ( s , a ) + γ V π ( s ) ]
and
Q π ( S , a ) = s P s s a [ r ( s , a ) + γ a π ( a | s ) Q π ( s , a ) ] .
The importance of the Bellman equations is the fact that they let us describe the value of a state S x according to the value of another state S y . This means that if we know the value of S t + 1 , we can calculate the value of S t . Thus, starting from a random initialization of the value function and retrospectively applying the Bellman equation, we can apply the energy evaluation function for all the possible states S and so calculate a new policy π n e w ( . ) . Its policy-based recursion process can be seen in Figure 2.
Figure 2. Policy–based recursion [21].

2.1.6. Best Policy

The best policy satisfies the Bellman equations and returns the maximum sum of rewards reduced over time by γ . As the best policy π of an MDP, we can define:
Q π ( s , a ) Q π ( s , a ) , s , a ϵ S , A .

2.1.7. Policy Evaluation and Policy Improvement

Policy evaluation and improvement can be accomplished using dynamic programming algorithms, based on a value function. If the environment is known, a system of linear equations is set [22]. However, in this case, linear programming can be computationally expensive. Another family of RL algorithms that plays an important role in improving the above policies is the Monte Carlo method [23]. In contrast with linear programming, this method does not require extensive information about the complete environment. It is responsible for utilizing samples of state sequences, actions, and rewards. As RL problem solution concerns, the above method calculates the average of the observed rewards.

2.1.8. Temporal Difference Learning

Temporal difference learning (TDL) [24] is one of RL’s most popular and innovative concepts. It is a hybrid of dynamic programming and Monte Carlo simulations. TDL, like dynamic programming, adjusts its reward predictions depending on estimations the agent has already learned. Furthermore, it does not require substantial knowledge about the agent’s surroundings, but merely sequences of interactions, similar to Monte Carlo approaches. Because TDL approaches offer the aforementioned benefits over all other methods, they are widely employed in RL algorithms [17]. As a result, they are the most effective instrument for determining the appropriate policy. They can be employed with minimum computing expense, in the proper context, and, most significantly, with only one equation [25]. The techniques ( T D ( 0 ) ) are the simplest TDL approximation, using the following update rule:
V ( S t ) V ( S t ) + l r [ r t + 1 + γ max a V ( S t + 1 ) ˘ V ( S t ) ]
where 0 < l r < 1 is the learning rate, r t + 1 + γ V ( S t + 1 ) is the target of the temporal difference, and r t + 1 + γ V ( S t + 1 ) ˘ V ( S t ) is the loss of the temporal difference.

2.1.9. Q–Learning Algorithm

One of the most well–known temporal difference (TD) algorithms is Q–learning (see Algorithm 1) [26,27]. Q–learning is an out–of–policy algorithm; therefore, its policy does not have to coincide with the evaluated and updated policy. It uses the following update rule:
Q ( S t , A t ) Q ( S t , A t ) + a [ r t + 1 + γ Q ( S t + 1 , a ) ˘ Q ( S t , a t ) ] .
This algorithm approximates the best function Q , independently from the policy that the agent follows, as γ max Q ( S t + 1 , a ) refers to the best action the agent can perform being at state S t + 1 .
Algorithm 1 Q–learning algorithm.
1:
initialize Q ( S t , A t )
2:
for every episode do
3:
  observe state S t
4:
  while  S t in terminal do
5:
    select action A t and evaluate Q
6:
    take action a
7:
    observe r, S t
8:
     Q ( S t , A t ) Q ( S t , A t ) + a [ r t + 1 + γ Q ( S t + 1 , a ) ˘ Q ( S t , a t ) ]
9:
     S t S t + 1
10:
  end while
11:
end for
So far, all these methods require intense memory allocations to work correctly. Specifically, we must keep s and a for every state S t and action A t . A solution is impossible in real–world applications where the state space is vast. This is why reward functions need to be approximated with other types of functions, such as parametric [28]. Therefore:
Q ( s , a ) Q θ ( s , a ) .

4. Benchmarks and Comparisons

For over a decade, a great number of companies, including Google’s DeepMind, Microsoft (Figure 13 and Figure 14), and a few others, have been researching the best algorithms to beat the most popular games based on publications in RL. Comparisons often happen on ATARI and some other board games due to the absence of an accurate metric to compare the outcome of two intelligent agents on a strategy or MOBA game with enough accuracy [28]. The following benchmarks refer to the results of the specific DQN algorithm when combining Q–learning and a DNN in ATARI games, as DeepMind published. The study also contains a human performance indicator for comparison.
Figure 13. Performance of the DQN agent in ATARI games [19].
Figure 14. Performance of the Dueling architecture in ATARI games, in comparison with the Prioritized Double DQN [14].
The next agent after DQN, which outperformed everything (including DQN), was the double DQN. This intelligent deep RL agent, as analyzed in the last section, used two identical NN models. One learns during the experience replay, just like DQN does, and the other is a copy of the last episode of the first model. The next diagram depicts that the double DQN has better performance in the same set of ATARI games in comparison with the DQN.
As we can see from Figure 15, there were some games, including Atlantis, Tennis, and Space Invaders, that DDQN improved the maximum achievable scores by 296%. On average, the DDQN performed 63% better than DQN, which was the biggest leap in performance between two intelligent deep RL agents playing a video game. The next model that came into production, also by Google’s DeepMind, was the dueling DQN. This specific architecture achieved better scores in some games, but, in comparison with DQN and double DQN, it was not that noticeable.
Figure 15. Performance of the Dueling architecture in ATARI games, in comparison with the single DQN [14].

5. Discussion

Although game–based environments provide an easy way of mitigating the issue of building a comprehensive and interactive environment, a lot of the cutting–edge research has either been on a single complex environment such as AlphaGo or on a cohort of simpler systems such as the ATARI games. This raises an important question on whether the field of DRL as a whole is progressing toward general intelligence. The authors strongly feel that generalizability across multiple complex environments and focus on using past experiences by the agents should be the paramount focus of current research rather than just producing good results in some simple settings.
The study supported the notion that there has been a recent trend in “game” related articles that have used RL or deep RL in the last decade. In most studies, RL or deep RL is employed to make the intelligent agent live in a game rather than a raw simulation. These patterns repeat themselves over certain games, periods, or types. We hope that by focusing on algorithms that outperformed other algorithms, we might define their use in some essential gaming contexts.
Various algorithms have recently demonstrated promising performance in managing complicated multi–centric decision–making. These can be used to solve challenges in the real world. By expanding on the surroundings, the success of racing games may be transferred to self–driving cars. On the other hand, deep RL’s practical uses are still in their infancy. This is because a simulation cannot perfectly replicate complicated real–world settings. Although real–life testing is possible to some extent, it can be dangerous if it is used without safety precautions.

Author Contributions

Conceptualization, K.S. and G.A.P.; methodology, K.S. and G.K.S.; validation, K.S., G.K.S. and G.A.P.; formal analysis, K.S.; investigation, K.S. and G.K.S.; resources, K.S.; data curation, K.S.; writing–––original draft preparation, K.S.; writing–––review and editing, K.S., G.K.S. and G.A.P.; visualization, K.S. and G.K.S.; supervision, G.K.S. and G.A.P.; project administration, G.K.S. and G.A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

This work was supported by the MPhil program “Advanced Technologies in Informatics and Computers”, hosted by the Department of Computer Science, International Hellenic University, Kavala, Greece.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Broy, M. Software engineering from auxiliary to key technology. In Software Pioneers; Springer: Berlin/Germany, 2011; pp. 10–13. [Google Scholar]
  2. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. ImageNet: A large–scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
  3. Duy, T.; Sato, Y.; Inoguchi, Y. Performance evaluation of a green scheduling algorithm for energy savings in cloud computing. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Ph.D. Forum (IPDPSW), Atlanta, GA, USA, 19–23 April 2010; pp. 1–8. [Google Scholar]
  4. Prevost, J.; Nagothu, K.; Kelley, B.; Jamshidi, M. Prediction of cloud data center networks loads using stochastic and neural models. In Proceedings of the 2011 6th International Conference on System of Systems Engineering, Albuquerque, NM, USA, 27–30 June 2011; pp. 276–281. [Google Scholar]
  5. Zhang, J.; Xie, N.; Zhang, X.; Yue, K.; Li, W.; Kumar, D. Machine learning based resource allocation of cloud computing in auction. Comput. Mater. Contin. 2018, 56, 123–135. [Google Scholar]
  6. Yang, R.; Ouyang, X.; Chen, Y.; Townend, P.; Xu, J. Intelligent resource scheduling at scale: A machine learning perspective. In Proceedings of the 2018 IEEE Symposium on Service–Oriented System Engineering, Bamberg, Germany, 26–29 March 2018; pp. 132–141. [Google Scholar]
  7. Islam, S.; Keung, J.; Lee, K.; Liu, A. Empirical prediction models for adaptive resource provisioning in the cloud. Future Gener. Comput. Syst. 2012, 28, 155–162. [Google Scholar] [CrossRef]
  8. Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The Arcade Learning Environment: An Evaluation Platform for General Agents. arXiv 2012, arXiv:1207.4708. [Google Scholar] [CrossRef]
  9. Hausknecht, M.; Lehman, J.; Miikkulainen, R.; Stone, P. A Neuroevolution Approach to General Atari Game Playing. IEEE Trans. Comput. Intell. Games 2014, 6, 355–366. [Google Scholar] [CrossRef]
  10. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. A Brief Survey of Deep Reinforcement Learning. arXiv 2017, arXiv:1708.05866. [Google Scholar] [CrossRef]
  11. Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor–Critic Methods. In Proceedings of the 2018 International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. arXiv:1802.09477. [Google Scholar]
  12. Anschel, O.; Baram, N.; Shimkin, N. Averaged–DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 176–185. [Google Scholar]
  13. Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q–learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
  14. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
  15. Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the 2015 International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar] [CrossRef]
  16. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  17. Hausknecht, M.; Stone, P. Deep recurrent q–learning for partially observable mdps. In Proceedings of the 2015 AAAI Fall Symposium Series, Arlington, VA, USA, 12–14 November 2015. [Google Scholar]
  18. Zhao, Z. Variants of Bellman equation on reinforcement learning problems. In Proceedings of the 2nd International Conference on Artificial Intelligence, Automation, and High–Performance Computing (AIAHPC 2022), Zhuhai, China, 25–27 February 2022; Zhu, L., Ed.; International Society for Optics and Photonics, SPIE: Zhuhai, China, 2022; Volume 12348, p. 132. [Google Scholar] [CrossRef]
  19. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness, J.; Bellemare, M.; Graves, A.; Riedmiller, M.; Fidjeland, A.; Ostrovski, G. Human–level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  20. Khan, M.A.M.; Khan, M.R.J.; Tooshil, A.; Sikder, N.; Mahmud, M.A.P.; Kouzani, A.Z.; Nahid, A.A. A Systematic Review on Reinforcement Learning–Based Robotics Within the Last Decade. IEEE Access 2020, 8, 176598–176623. [Google Scholar] [CrossRef]
  21. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; A Bradford Book; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  22. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
  23. Lin, L.J. Reinforcement Learning for Robots Using Neural Networks. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, January 1993. [Google Scholar]
  24. Fortunato, M.; Azar, M.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O. Noisy networks for exploration. arXiv 2017, arXiv:1706.10295. [Google Scholar]
  25. Osband, I.; Blundell, C.; Pritzel, A.; Roy, B. Deep Exploration via Bootstrapped Dqn. Advances in Neural Information Processing Systems. 2016. Available online: https://proceedings.neurips.cc/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf (accessed on 3 January 2023).
  26. Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; Osband, I. Deep q–learning from demonstrations. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  27. Andrew, A. Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto, Adaptive Computation and Machine Learning Series; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
  28. Roy, B. An analysis of temporal–difference learning with function approximation. Autom. Control. IEEE Trans. 1997, 42, 674–690. [Google Scholar]
  29. Kelly, S.; Heywood, M.I. Emergent Tangled Graph Representations for Atari Game Playing Agents. In Proceedings of the Genetic Programming; McDermott, J., Castelli, M., Sekanina, L., Haasdijk, E., García–Sánchez, P., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 64–79. [Google Scholar]
  30. Wilson, D.G.; Cussat–Blanc, S.; Luga, H.; Miller, J.F. Evolving Simple Programs for Playing Atari Games. In Proceedings of the Genetic and Evolutionary Computation Conference, Kyoto, Japan, 15–19 July 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 229–236. [Google Scholar] [CrossRef]
  31. Smith, R.J.; Heywood, M.I. Scaling Tangled Program Graphs to Visual Reinforcement Learning in ViZDoom. In Proceedings of the Genetic Programming; Castelli, M., Sekanina, L., Zhang, M., Cagnoni, S., García–Sánchez, P., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 135–150. [Google Scholar]
  32. Smith, R.J.; Heywood, M.I. Evolving Dota 2 Shadow Fiend Bots Using Genetic Programming with External Memory. In Proceedings of the Genetic and Evolutionary Computation Conference, Prague, Czech Republic, 13–17 July 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 179–187. [Google Scholar] [CrossRef]
  33. Zhao, D.; Wang, H.; Shao, K.; Zhu, Y. Deep reinforcement learning with experience replay based on sarsa. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence, Athens, Greece, 6–9 December 2016; pp. 1–6. [Google Scholar]
  34. Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhnevets, A.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.; Schrittwieser, J. Starcraft ii: A new challenge for reinforcement learning. arXiv 2017, arXiv:1708.04782. [Google Scholar]
  35. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
  36. Rudin, C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
  37. Melnik, A.; Fleer, S.; Schilling, M.; Ritter, H. Modularization of end–to–end learning: Case study in arcade games. arXiv 2019, arXiv:1901.09895. [Google Scholar]
  38. Polvara, R.; Patacchiola, M.; Sharma, S.; Wan, J.; Manning, A.; Sutton, R.; Cangelosi, A. Autonomous quadrotor landing using deep reinforcement learning. arXiv 2017, arXiv:1709.03339. [Google Scholar]
  39. Pan, X.; You, Y.; Wang, Z.; Lu, C. Virtual to real reinforcement learning for autonomous driving. arXiv 2017, arXiv:1704.03952. [Google Scholar]
  40. Loiacono, D.; Cardamone, L.; Lanzi, P. Simulated car racing championship: Competition software manual. arXiv 2013, arXiv:1304.1672. [Google Scholar]
  41. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
  42. Synnaeve, G.; Nardelli, N.; Auvolat, A.; Chintala, S.; Lacroix, T.; Lin, Z.; Richoux, F.; Usunier, N. Torchcraft: A library for machine learning research on real–time strategy games. arXiv 2016, arXiv:1611.00625. [Google Scholar]
  43. Peng, P.; Wen, Y.; Yang, Y.; Yuan, Q.; Tang, Z.; Long, H.; Wang, J. Multiagent bidirectionally–coordinated nets: Emergence of humanlevel coordination in learning to play starcraft combat games. arXiv 2017, arXiv:1703.10069. [Google Scholar]
  44. Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi–agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  45. Logothetis, N. The ins and outs of fmri signals. Nat. Neurosci. 2007, 10, 1230–1232. [Google Scholar] [CrossRef] [PubMed]
  46. Oh, J.; Singh, S.; Lee, H.; Kohli, P. Zero–shot task generalization with multi–task deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 2661–2670. [Google Scholar]
  47. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym, 2016. arXiv 2017, arXiv:1606.01540. [Google Scholar]
  48. Xiong, Y.; Chen, H.; Zhao, M.; An, B. Hogrider: Champion agent of microsoft malmo collaborative ai challenge. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  49. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T. Mastering chess and shogi by self–play with a general reinforcement learning algorithm. arXiv 2017, arXiv:1712.01815. [Google Scholar]
  50. David, O.; Netanyahu, N.; Wolf, L. Deepchess: End–to–end deep neural network for automatic learning in chess. In Proceedings of the International Conference on Artificial Neural Networks, Barcelona, Spain, 6–9 September 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 88–96. [Google Scholar]
  51. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T. A general reinforcement learning algorithm that masters chess, shogi, and go through self–play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef] [PubMed]
  52. Justesen, N.; Torrado, R.; Bontrager, P.; Khalifa, A.; Togelius, J.; Risi, S. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv 2018, arXiv:1806.10729. [Google Scholar]
  53. Choudhary, A. A Hands-On Introduction to Deep Q-Learning Using OpenAI Gym in Python. Anal Vidhya. 2019. Available online: https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/ (accessed on 21 October 2022).
  54. Silver, D.; Huang, A.; Sifre, L.; Driessche, v.d.G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; et al. Arthur guez, an cjm. Nature 2016, 529, 484–492. [Google Scholar] [CrossRef]
  55. OpenAI; Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; et al. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar]
  56. Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count–based exploration and intrinsic motivation. Adv. Neural Inf. Process. Syst. 2016, 29, 1479–1487. [Google Scholar]
  57. Hochreiter, S.; Schmidhuber, J. Long short–term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  58. Such, F.; Madhavan, V.; Conti, E.; Lehman, J.; Stanley, K.; Clune, J. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv 2017, arXiv:1712.06567. [Google Scholar]
  59. Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer networks. arXiv 2015, arXiv:1506.03134. [Google Scholar]
  60. Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; Teh, Y.W. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. In Proceedings of the International Conference on Machine Learning, Stockholmsmässan, Sweden, 15–19 July 2018. [Google Scholar]
  61. Ammanabrolu, P.; Riedl, M.O. Playing Text–Adventure Games with Graph–Based Deep Reinforcement Learning. arXiv 2018, arXiv:1812.01628. [Google Scholar]
  62. Adolphs, L.; Hofmann, T. LeDeepChef: Deep Reinforcement Learning Agent for Families of Text–Based Games. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019. [Google Scholar]
  63. Brown, N.; Bakhtin, A.; Lerer, A.; Gong, Q. Combining Deep Reinforcement Learning and Search for Imperfect–Information Games. arXiv 2020, arXiv:2007.13544. [Google Scholar]
  64. Ye, D.; Liu, Z.; Sun, M.; Shi, B.; Zhao, P.; Wu, H.; Yu, H.; Yang, S.; Wu, X.; Guo, Q.; et al. Mastering Complex Control in MOBA Games with Deep Reinforcement Learning. Proc. AAAI Conf. Artif. Intell. 2020, 34, 6672–6679. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.