# Brain-Inspired Agents for Quantum Reinforcement Learning

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background

#### 2.1. Spiking Neural Networks

#### 2.2. Long Short-Term Memory

- Cell state: The network’s current long-term memory;
- Hidden state: The output from the preceding time step;
- The input data in the present time step.

**Forget gate**: This gate decides what information from the cell state is important, considering both the previous hidden state and the new input data. The neural network that implements this gate is built to produce an output closer to 0 when the input data are considered unimportant, and closer to 1 otherwise. To achieve this, we employ the sigmoid activation function. The output values from this gate are then passed upwards and undergo pointwise multiplication with the previous cell state. This pointwise multiplication implies that components of the cell state identified as insignificant by the forget gate network will be multiplied by a value approaching 0, resulting in a reduced influence on subsequent steps.To summarize, the forget gate determines what portions of the long-term memory should be disregarded (given less weight) based on the previous hidden state and the new input data;**Input gate**: Determines the integration of new information into the network’s long-term memory (cell state), considering the prior hidden state and incoming data. The same inputs are utilized, but now with the introduction of a hyperbolic tangent as the activation function. This hyperbolic tangent has learned to blend the previous hidden state with the incoming data, resulting in a newly updated memory vector. Essentially, this vector encapsulates information from the new input data within the context of the previous hidden state. It informs us about the extent to which each component of the network’s long-term memory (cell state) should be updated based on the new information.It should be noted that the utilization of the hyperbolic tangent function in this context is deliberate, owing to its output range confined to [−1, 1]. The inclusion of negative values is imperative for this methodology, as it facilitates the attenuation of the impact associated with specific components;**Output gate**: The objective of this gate is to decide the new hidden state by incorporating the newly updated cell state, the prior hidden state, and the new input data. This hidden state has to contain the necessary information while avoiding the inclusion of all learned data. To achieve this, we employ the sigmoid function.

- The initial step involves determining what information to discard or preserve at the given moment in time. This process is facilitated by the utilization of the sigmoid function. It examines both the preceding state ${h}_{t-1}$ and the present input ${x}_{t}$, computing the function accordingly:$${f}_{t}=\sigma ({w}_{f}\xb7{v}_{t}+bf)$$
- In this step, the memory cell content undergoes an update by choosing new information for storage within the cell state. The subsequent layer, known as the input gate, comprises the following two components: the sigmoid function and the hyperbolic tangent (tanh). The sigmoid layer decides which values to update; a value of 1, indicates no change, while a value of 0 results in exclusion. Subsequently, a tanh layer generates a vector of new candidate values, assigning weights to each value based on its significance within the range from −1 to 1. These two components are then combined to update the state:$$\begin{array}{c}\hfill {i}_{t}=\sigma ({W}_{i}\xb7{v}_{t}+{b}_{i})\\ \hfill \tilde{C}=tanh({W}_{c}\xb7{v}_{t}+{b}_{c})\end{array}$$
- The third step consists of updating the previous cell state, ${C}_{t-1}$ with the new cell state, ${C}_{t}$, through the following two operations: forgetting irrelevant information by scaling the previous state by ${f}_{t}$ and incorporating new information from the candidate $\tilde{{C}_{t}}$:$${C}_{t}={f}_{t}\xb7{c}_{t-1}+{i}_{t}\xb7\tilde{{C}_{t}}$$
- Ultimately, the output is calculated through a two-step process. Initially, a sigmoid layer is employed to determine what aspects of the cell state are pertinent for transmission to the output.$${o}_{t}=\sigma ({W}_{o}\xb7{v}_{t}+{b}_{0})$$Subsequently, the cell state undergoes processing via the tanh layer to normalize values between −1 and 1, followed by multiplication with the output of the sigmoid gate.$${h}_{t}=tanh\left({C}_{t}\right)\xb7{o}_{t}$$

#### 2.3. Deep Reinforcement Learning

**Figure 5.**

**QLSTM architecture.**The input data pass through an initial classical layer, which receives the inputs data and produces the concatenated size formed by the input dimension and hidden size. This output then passes trough a second classical layer, which outputs the same size as the number of qubits expected by the quantum Qlstm cell, whose architecture is detailed in Figure 7. Subsequently, this output is received by another classical layer that transform it into output of the hidden size. Finally, this output is further transformed into the expected output.

**Figure 6.**

**General reinforcement learning diagram [51]**. At time t, the agent perceives state ${s}_{t}$ and, based on this state, selects an action ${a}_{t}$. The environment then transitions to a new state ${s}_{t+1}$ and provides the agent with this new state along with a reward ${r}_{t+1}$.

#### 2.4. Quantum Neural Networks

**Figure 7.**

**General VQC Schema.**The dashed gray line delineates the process carried out within a quantum processing unit (QPU), while the dashed blue line illustrates the processes executed in a CPU.

**Pre-processing (CPU)**: Initial classical data preprocessing, which includes normalization and scaling;**Quantum Embedding (QPU)**: Encoding classical data into quantum states through parameterized quantum gates. Various encoding techniques exist, such as angle encoding, also known as tensor product encoding, and amplitude encoding, among others [65];**Variational Layer (QPU)**: This layer embodies the functionality of quantum neural networks through the utilization of rotations and entanglement gates with trainable parameters, which are optimized using classical algorithms;**Measurement Process (QPU/CPU)**: Measuring the quantum state and decoding it to derive the expected output. The selection of observables employed in this process is critical for achieving optimal performance;**Post-processing (CPU)**: Transformation of QPU outputs before feeding them back to the user and integrating them into the cost function during the learning phase;**Learning (CPU)**: Computation of the cost function and optimization of ansatz parameters using classic optimization algorithms, such as Adam or SGD. Gradient-free methods, such as SPSA or COLYBA, are also capable of estimating parameter updates.

## 3. Methodology

## 4. Experimentation

#### 4.1. Problem Statement

^{2}(5000 feet

^{2}).

**State space:**The state space encompasses 17 attributes; with 14 detailed in Table 1, while the remaining 3 are reserved in case new customized features need to be added;**Action space:**The action space comprises a collection of 10 discrete actions as outlined in Table 2. The temperature bounds for heating and cooling are [12, 23.5] and [21.5, 40], respectively;**Reward function:**The reward function is formulated as multi-objective, where both energy consumption and thermal discomfort are normalized and added together with different weights. The reward value is consistently non-positive, signifying that optimal behavior yields a cumulative reward of 0. Notice also that there are two temperature comfort ranges defined, one for the summer period and other for the winter period. The weights of each term in the reward allow to adjust the importance of each aspect when environments are evaluated. Finally, the reward function is customizable and can be integrated into the environment.

#### 4.2. Experimental Settings

#### 4.3. Results

## 5. Discussion

## 6. Conclusions and Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

A2C | Advantage Actor–Critic |

BPTT | Backpropagation through time |

BG | Basal Ganglia |

CPU | Central processing unit |

DQN | Deep Q-network |

DRL | Deep reinforcement learning |

HDC | Hyperdimensional computing and spiking |

HVAC | Heating, ventilation, air-conditioning |

KPI | Key performance indicator |

LIF | Leaky integrate and fired Neuron |

LSTM | Long/short-term memory |

LTS | Long-term store |

mPFC | Medial prefrontal cortex |

MDP | Markov decision process |

MLP | Multilayer perceptron |

NISQ | Noisy, intermediate-scale quantum era |

PFC | Prefrontal cortex |

QC | Quantum computing |

QLIF | Quantum leaky integrate and fired neuron |

QLSTM | Quantum long/short-term memory |

QML | Quantum machine learning |

QNN | Quantum neural network |

QSNN | Quantum spiking neural network |

QPU | Quantum processing unit |

QRL | Quantum reinforcement learning |

RNN | Recurrent neural networks |

RL | Reinforcement learning |

SNN | Spiking neural network |

S-R | Stimulus-response |

STS | Short-term store |

vmPFC | Ventromedial prefrontal cortex |

VQC | Variational quantum circuit |

## References

- Zhao, L.; Zhang, L.; Wu, Z.; Chen, Y.; Dai, H.; Yu, X.; Liu, Z.; Zhang, T.; Hu, X.; Jiang, X.; et al. When brain-inspired AI meets AGI. Meta-Radiology
**2023**, 1, 100005. [Google Scholar] [CrossRef] - Fan, C.; Yao, L.; Zhang, J.; Zhen, Z.; Wu, X. Advanced Reinforcement Learning and Its Connections with Brain Neuroscience. Research
**2023**, 6, 0064. [Google Scholar] [CrossRef] - Domenech, P.; Rheims, S.; Koechlin, E. Neural mechanisms resolving exploitation-exploration dilemmas in the medial prefrontal cortex. Science
**2020**, 369, eabb0184. [Google Scholar] [CrossRef] - Baram, A.B.; Muller, T.H.; Nili, H.; Garvert, M.M.; Behrens, T.E.J. Entorhinal and ventromedial prefrontal cortices abstract and generalize the structure of reinforcement learning problems. Neuron
**2021**, 109, 713–723.e7. [Google Scholar] [CrossRef] - Bogacz, R.; Larsen, T. Integration of Reinforcement Learning and Optimal Decision-Making Theories of the Basal Ganglia. Neural Comput.
**2011**, 23, 817–851. [Google Scholar] [CrossRef] - Houk, J.; Adams, J.; Barto, A. A Model of How the Basal Ganglia Generate and Use Neural Signals that Predict Reinforcement. Model. Inf. Process. Basal Ganglia
**1995**, 13. [Google Scholar] [CrossRef] - Joel, D.; Niv, Y.; Ruppin, E. Actor–critic models of the basal ganglia: New anatomical and computational perspectives. Neural Netw.
**2002**, 15, 535–547. [Google Scholar] [CrossRef] - Collins, A.G.E.; Frank, M.J. Opponent actor learning (OpAL): Modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive. Psychol. Rev.
**2014**, 121, 337–366. [Google Scholar] [CrossRef] - Maia, T.V.; Frank, M.J. From reinforcement learning models to psychiatric and neurological disorders. Nat. Neurosci.
**2011**, 14, 154–162. [Google Scholar] [CrossRef] - Maia, T.V. Reinforcement learning, conditioning, and the brain: Successes and challenges. Cogn. Affect. Behav. Neurosci.
**2009**, 9, 343–364. [Google Scholar] [CrossRef] - O’Doherty, J.; Dayan, P.; Schultz, J.; Deichmann, R.; Friston, K.; Dolan, R.J. Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning. Science
**2004**, 304, 452–454. [Google Scholar] [CrossRef] - Chalmers, E.; Contreras, E.B.; Robertson, B.; Luczak, A.; Gruber, A. Context-switching and adaptation: Brain-inspired mechanisms for handling environmental changes. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 3522–3529. [Google Scholar] [CrossRef]
- Robertazzi, F.; Vissani, M.; Schillaci, G.; Falotico, E. Brain-inspired meta-reinforcement learning cognitive control in conflictual inhibition decision-making task for artificial agents. Neural Netw.
**2022**, 154, 283–302. [Google Scholar] [CrossRef] - Zhao, Z.; Zhao, F.; Zhao, Y.; Zeng, Y.; Sun, Y. A brain-inspired theory of mind spiking neural network improves multi-agent cooperation and competition. Patterns
**2023**, 4, 100775. [Google Scholar] [CrossRef] - Zhang, K.; Lin, X.; Li, M. Graph attention reinforcement learning with flexible matching policies for multi-depot vehicle routing problems. Phys. A Stat. Mech. Appl.
**2023**, 611, 128451. [Google Scholar] [CrossRef] - Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv
**2019**. [Google Scholar] [CrossRef] - Rezayi, S.; Dai, H.; Liu, Z.; Wu, Z.; Hebbar, A.; Burns, A.H.; Zhao, L.; Zhu, D.; Li, Q.; Liu, W.; et al. ClinicalRadioBERT: Knowledge-Infused Few Shot Learning for Clinical Notes Named Entity Recognition. In Proceedings of the Machine Learning in Medical Imaging; Lian, C., Cao, X., Rekik, I., Xu, X., Cui, Z., Eds.; Springer: Cham, Switzerland, 2022; pp. 269–278. [Google Scholar]
- Liu, Z.; He, X.; Liu, L.; Liu, T.; Zhai, X. Context Matters: A Strategy to Pre-train Language Model for Science Education. In Proceedings of the Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky; Wang, N., Rebolledo-Mendez, G., Dimitrova, V., Matsuda, N., Santos, O.C., Eds.; Springer: Cham, Switzerland, 2023; pp. 666–674. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv
**2020**. [Google Scholar] [CrossRef] - Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv
**2021**. [Google Scholar] [CrossRef] - Aïmeur, E.; Brassard, G.; Gambs, S. Quantum speed-up for unsupervised learning. Machine Learn.
**2013**, 90, 261–287. [Google Scholar] [CrossRef] - Schuld, M.; Bocharov, A.; Svore, K.M.; Wiebe, N. Circuit-centric quantum classifiers. Phys. Rev. A
**2020**, 101, 032308. [Google Scholar] [CrossRef] - Wiebe, N.; Kapoor, A.; Svore, K.M. Quantum Nearest-Neighbor Algorithms for Machine Learning. Quantum Inf. Comput.
**2015**, 15, 318–358. [Google Scholar] - Anguita, D.; Ridella, S.; Rivieccio, F.; Zunino, R. Quantum optimization for training support vector machines. Neural Netw.
**2003**, 16, 763–770. [Google Scholar] [CrossRef] - Andrés, E.; Cuéllar, M.P.; Navarro, G. On the Use of Quantum Reinforcement Learning in Energy-Efficiency Scenarios. Energies
**2022**, 15, 6034. [Google Scholar] [CrossRef] - Andrés, E.; Cuéllar, M.P.; Navarro, G. Efficient Dimensionality Reduction Strategies for Quantum Reinforcement Learning. IEEE Access
**2023**, 11, 104534–104553. [Google Scholar] [CrossRef] - Busemeyer, J.R.; Bruza, P.D. Quantum Models of Cognition and Decision; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Li, J.A.; Dong, D.; Wei, Z.; Liu, Y.; Pan, Y.; Nori, F.; Zhang, X. Quantum reinforcement learning during human decision-making. Nat. Hum. Behav.
**2020**, 4, 294–307. [Google Scholar] [CrossRef] - Miller, E.K.; Cohen, J.D. An Integrative Theory of Prefrontal Cortex Function. Annu. Rev. Neurosci.
**2001**, 24, 167–202. [Google Scholar] [CrossRef] - Atkinson, R.; Shiffrin, R. Human Memory: A Proposed System and its Control Processes. Psychol. Learn. Motiv.
**1968**, 2, 89–195. [Google Scholar] [CrossRef] - Andersen, P. The Hippocampus Book; Oxford University Press: Oxford, UK, 2007. [Google Scholar]
- Olton, D.S.; Becker, J.T.; Handelmann, G.E. Hippocampus, space, and memory. Behav. Brain Sci.
**1979**, 2, 313–322. [Google Scholar] [CrossRef] - Eshraghian, J.K.; Ward, M.; Neftci, E.; Wang, X.; Lenz, G.; Dwivedi, G.; Bennamoun, M.; Jeong, D.S.; Lu, W.D. Training Spiking Neural Networks Using Lessons from Deep Learning. Proc. IEEE
**2021**, 111, 1016–1054. [Google Scholar] [CrossRef] - McCloskey, M.; Cohen, N.J. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychol. Learn. Motiv.
**1989**, 24, 109–165. [Google Scholar] [CrossRef] - Raman, N.S.; Devraj, A.M.; Barooah, P.; Meyn, S.P. Reinforcement Learning for Control of Building HVAC Systems. In Proceedings of the 2020 American Control Conference (ACC), Denver, CO, USA, 1–3 July 2020; pp. 2326–2332. [Google Scholar] [CrossRef]
- Wang, Y.; Velswamy, K.; Huang, B. A Long-Short Term Memory Recurrent Neural Network Based Reinforcement Learning Controller for Office Heating Ventilation and Air Conditioning Systems. Processes
**2017**, 5, 46. [Google Scholar] [CrossRef] - Fu, Q.; Han, Z.; Chen, J.; Lu, Y.; Wu, H.; Wang, Y. Applications of reinforcement learning for building energy efficiency control: A review. J. Build. Eng.
**2022**, 50, 104165. [Google Scholar] [CrossRef] - Hebb, D. The Organization of Behavior: A Neuropsychological Theory; Taylor & Francis: Abingdon, UK, 2005. [Google Scholar]
- Tavanaei, A.; Ghodrati, M.; Kheradpisheh, S.R.; Masquelier, T.; Maida, A. Deep learning in spiking neural networks. Neural Netw.
**2019**, 111, 47–63. [Google Scholar] [CrossRef] - Lobo, J.L.; Del Ser, J.; Bifet, A.; Kasabov, N. Spiking Neural Networks and online learning: An overview and perspectives. Neural Netw.
**2020**, 121, 88–100. [Google Scholar] [CrossRef] - Lapicque, L. Recherches quantitatives sur l’excitation electrique des nerfs. J. Physiol. Paris
**1907**, 9, 620–635. [Google Scholar] - Zou, Z.; Alimohamadi, H.; Zakeri, A.; Imani, F.; Kim, Y.; Najafi, M.H.; Imani, M. Memory-inspired spiking hyperdimensional network for robust online learning. Sci. Rep.
**2022**, 12, 7641. [Google Scholar] [CrossRef] - Kumarasinghe, K.; Kasabov, N.; Taylor, D. Brain-inspired spiking neural networks for decoding and understanding muscle activity and kinematics from electroencephalography signals during hand movements. Sci. Rep.
**2021**, 11, 2486. [Google Scholar] [CrossRef] - Banino, A.; Barry, C.; Uria, B.; Blundell, C.; Lillicrap, T.; Mirowski, P.; Pritzel, A.; Chadwick, M.J.; Degris, T.; Modayil, J.; et al. Vector-based navigation using grid-like representations in artificial agents. Nature
**2018**, 557, 429–433. [Google Scholar] [CrossRef] - Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks
**1994**, 5, 157–166. [Google Scholar] [CrossRef] - Graves, A.; Liwicki, M.; Fernández, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. A Novel Connectionist System for Unconstrained Handwriting Recognition. IEEE Trans. Pattern Anal. Mach. Intell.
**2009**, 31, 855–868. [Google Scholar] [CrossRef] - Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. Off. J. Int. Neural Netw. Soc.
**2005**, 18, 602–610. [Google Scholar] [CrossRef] - Hochreiter, S.; Schmidhuber, J. LSTM can solve hard long time lag problems. Adv. Neural Inf. Process. Syst.
**1996**, 9, 473–479. [Google Scholar] - Triche, A.; Maida, A.S.; Kumar, A. Exploration in neo-Hebbian reinforcement learning: Computational approaches to the exploration–exploitation balance with bio-inspired neural networks. Neural Netw.
**2022**, 151, 16–33. [Google Scholar] [CrossRef] - Dong, H.; Ding, Z.; Zhang, S.; Yuan, H.; Zhang, H.; Zhang, J.; Huang, Y.; Yu, T.; Zhang, H.; Huang, R. Deep Reinforcement Learning: Fundamentals, Research, and Applications; Springer Nature: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Sutton, R.S.; Barto, A.G. The Reinforcement Learning Problem. In Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; pp. 51–85. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] - Shao, K.; Zhao, D.; Zhu, Y.; Zhang, Q. Visual Navigation with Actor-Critic Deep Reinforcement Learning. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Macaluso, A.; Clissa, L.; Lodi, S.; Sartori, C. A Variational Algorithm for Quantum Neural Networks. In Proceedings of the Computational Science—ICCS 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 591–604. [Google Scholar]
- Benedetti, M.; Lloyd, E.; Sack, S.; Fiorentini, M. Parameterized quantum circuits as machine learning models. Quantum Sci. Technol.
**2019**, 4, 043001. [Google Scholar] [CrossRef] - Zhao, C.; Gao, X.S. QDNN: Deep neural networks with quantum layers. Quantum Mach. Intell.
**2021**, 3, 15. [Google Scholar] [CrossRef] - Lu, B.; Liu, L.; Song, J.Y.; Wen, K.; Wang, C. Recent progress on coherent computation based on quantum squeezing. AAPPS Bull.
**2023**, 33, 7. [Google Scholar] [CrossRef] - Hou, X.; Zhou, G.; Li, Q.; Jin, S.; Wang, X. A duplication-free quantum neural network for universal approximation. Sci. China Physics Mech. Astron.
**2023**, 66, 270362. [Google Scholar] [CrossRef] - Zhao, M.; Chen, Y.; Liu, Q.; Wu, S. Quantifying direct associations between variables. Fundam. Res.
**2023**. [Google Scholar] [CrossRef] - Zhou, Z.-r.; Li, H.; Long, G.L. Variational quantum algorithm for node embedding. Fundam. Res.
**2023**. [Google Scholar] [CrossRef] - Ding, L.; Wang, H.; Wang, Y.; Wang, S. Based on Quantum Topological Stabilizer Color Code Morphism Neural Network Decoder. Quantum Eng.
**2022**, 2022, 9638108. [Google Scholar] [CrossRef] - Tian, J.; Sun, X.; Du, Y.; Zhao, S.; Liu, Q.; Zhang, K.; Yi, W.; Huang, W.; Wang, C.; Wu, X.; et al. Recent Advances for Quantum Neural Networks in Generative Learning. IEEE Trans. Pattern Anal. Mach. Intell.
**2023**, 45, 12321–12340. [Google Scholar] [CrossRef] - Jeswal, S.K.; Chakraverty, S. Recent Developments and Applications in Quantum Neural Network: A Review. Arch. Comput. Methods Eng.
**2019**, 26, 793–807. [Google Scholar] [CrossRef] - Wittek, P. Quantum Machine Learning: What Quantum Computing Means to Data Mining; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Weigold, M.; Barzen, J.; Leymann, F.; Salm, M. Expanding Data Encoding Patterns For Quantum Algorithms. In Proceedings of the 2021 IEEE 18th International Conference on Software Architecture Companion (ICSA-C), Stuttgart, Germany, 22–26 March 2021; pp. 95–101. [Google Scholar]
- Zenke, F.; Poole, B.; Ganguli, S. Continual Learning Through Synaptic Intelligence. Proc. Mach. Learn. Res.
**2017**, 70, 3987–3995. [Google Scholar] [PubMed] - Rusu, A.A.; Rabinowitz, N.C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; Hadsell, R. Progressive Neural Networks. arXiv
**2016**. [Google Scholar] [CrossRef] - Shin, H.; Lee, J.K.; Kim, J.; Kim, J. Continual Learning with Deep Generative Replay. arXiv
**2017**. [Google Scholar] [CrossRef] - Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.C.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. arXiv
**2016**. [Google Scholar] [CrossRef] - Crawley, D.; Pedersen, C.; Lawrie, L.; Winkelmann, F. EnergyPlus: Energy Simulation Program. Ashrae J.
**2000**, 42, 49–56. [Google Scholar] - Mattsson, S.E.; Elmqvist, H. Modelica—An International Effort to Design the Next Generation Modeling Language. Ifac Proc. Vol.
**1997**, 30, 151–155. [Google Scholar] [CrossRef] - Zhang, Z.; Lam, K.P. Practical Implementation and Evaluation of Deep Reinforcement Learning Control for a Radiant Heating System. In Proceedings of the 5th Conference on Systems for Built Environments, BuildSys ’18, New York, NY, USA, 7–8 November 2018; pp. 148–157. [Google Scholar] [CrossRef]
- Jiménez-Raboso, J.; Campoy-Nieves, A.; Manjavacas-Lucas, A.; Gómez-Romero, J.; Molina-Solana, M. Sinergym: A Building Simulation and Control Framework for Training Reinforcement Learning Agents. In Proceedings of the 8th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, New York, NY, USA, 17–18 November 2021; pp. 319–323. [Google Scholar] [CrossRef]
- Scharnhorst, P.; Schubnel, B.; Fernández Bandera, C.; Salom, J.; Taddeo, P.; Boegli, M.; Gorecki, T.; Stauffer, Y.; Peppas, A.; Politi, C. Energym: A Building Model Library for Controller Benchmarking. Appl. Sci.
**2021**, 11, 3518. [Google Scholar] [CrossRef] - Hill, F.; Lampinen, A.; Schneider, R.; Clark, S.; Botvinick, M.; McClelland, J.L.; Santoro, A. Environmental drivers of systematicity and generalization in a situated agent. arXiv
**2020**. [Google Scholar] [CrossRef] - Lake, B.M.; Baroni, M. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. arXiv
**2018**. [Google Scholar] [CrossRef] - Botvinick, M.; Wang, J.X.; Dabney, W.; Miller, K.J.; Kurth-Nelson, Z. Deep Reinforcement Learning and Its Neuroscientific Implications. Neuron
**2020**, 107, 603–616. [Google Scholar] [CrossRef]

**Figure 1.**The standard structure of a neuron includes a cell body, also known as soma, housing the nucleus and other organelles; dendrites, which are slender, branched extensions that receive synaptic input from neighboring neurons; an axon; and synaptic terminals.

**Figure 2.**The leaky integrate-and-fire neuron model [33] involves an insulating lipid bilayer membrane that separates the interior and exterior environments. Gated ion channels enable the diffusion of charge carriers, such as Na

^{+}, through the membrane (

**a**). This neuron function is often modeled using an RC circuit. When the membrane potential exceeds the threshold, a spike is generated (

**b**). Input spikes are transmitted to the neuron body via dendritic branches. Accumulation of sufficient excitation triggers spike emission at the output (

**c**). A simulation depicting the membrane potential $U\left(t\right)$ reaching a threshold of $\theta =0.5$ V, leading to the generation of output spikes (

**d**).

**Figure 3.**

**SNN pipeline.**The input data for an SNN may undergo transformation into a firing rate or alternative encodings to produce spikes. Subsequently, the model is trained to accurately predict the correct class by employing encoding strategies, such as the highest firing rate or firing first, amongst other options.

**Figure 4.**

**LSTM Cell Architecture:**Featuring three essential gates (forget, input, and output gates). The $tanh$ and $\sigma $ blocks symbolize the hyperbolic tangent and sigmoid activation functions, correspondingly. ${x}_{t}$ represents the input at time t, ${h}_{t}$ denotes the hidden state, and ${c}_{t}$ signifies the cell state. The symbols ⊗ and ⊕ denote element-wise multiplication and addition, respectively.

**Figure 8.**

**QNN architecture.**This architecture involves several steps. First, classical data are normalized as a preprocessing step, which is necessary for the subsequent encoding strategy. Then, the amplitude encoding algorithm is applied, using only $lo{g}_{2}N$ qubits, where N corresponds to the number of independent variables or the space dimension in the case of RL. This process generates the corresponding quantum state, which is then fed into the ansatz circuit. Finally, the resulting state-vector of this quantum system undergoes post-processing through a classical linear layer. This layer transforms the dimension obtained with ${2}^{n}$, where n is the number of qubits, into the expected output dimension.

**Figure 9.**

**QSNN architecture.**This architecture involves an encoding step where classical data are translated into spikes. The network comprises two LIFs orchestrated by VQC and is trained using gradient descent. Various encoding strategies can be utilized, including the utilization of the highest firing rate or firing first, among others.

**Figure 10.**

**QLSTM cell.**Each VQC box follows the structure outlined in Figure 7. The $\sigma $ and $tanh$ blocks denote the sigmoid and the hyperbolic tangent activation functions, respectively. ${x}_{t}$ represents the input at time t, ${h}_{t}$ denotes the hidden state and ${c}_{t}$ signifies the cell state. Symbols ⊗ and ⊕ signify element-wise multiplication and addition, respectively.

**Figure 13.**Boxplots that depict the distribution of average total reward achieved by quantum (QNN, QSNN, QLSTM, and QSNN–QLSTM) and classical models (MLP, SNN, and LSTM).

**Figure 14.**Learning curves obtained from seven models. Three classical models: MLP, SNN, and LSTM. Four quantum models: QNN, QSNN, QLSTM, and new brain-inspired model QSNN–QLSTM.

**Table 1.**The observation variables consist of 14 variables, with an additional 3 empty variables reserved for specific problem requirements if necessary.

Name | Units |
---|---|

Site-Outdoor-Air-DryBulb-Temperature | °C |

Site-Outdoor-Air-Relative-Humidity | % |

Site-Wind-Speed | m/s |

Site-Wind-Direction | degree from north |

Site-Diffuse-Solar-Radiation-Rate-per-Area | W/m^{2} |

Site-Direct-Solar-Radiation-Rate-per-Area | W/m^{2} |

Zone-Thermostat-Heating-Setpoint-Temperature | °C |

Zone-Thermostat-Cooling-Setpoint-Temperature | °C |

Zone-Air-Temperature | °C |

Zone-Air-Relative-Humidity | % |

Zone-People-Occupant-Count | count |

Environmental-Impact-Total-CO_{2}-Emissions-Carbon-Equivalent-Mass | Kg |

Facility-Total-HVAC-Electricity-Demand-Rate | W |

Total-Electricity-HVAC | W |

Name | Heating Target Temperature | Cooling Target Temperature |
---|---|---|

0 | 13 | 37 |

1 | 14 | 34 |

2 | 15 | 32 |

3 | 16 | 30 |

4 | 17 | 30 |

5 | 18 | 30 |

6 | 19 | 27 |

7 | 20 | 26 |

8 | 21 | 25 |

9 | 21 | 24 |

**Table 3.**Configuration of classical models. Hyperparameters and settings for the artificial neural network, spiking neural network, and long short-term memory models, respectively.

MLP | SNN | LSTM | |
---|---|---|---|

Optimizer | Adam(lr = ${10}^{-4}$) | Adam(lr = ${10}^{-3}$) | Adam(lr = ${10}^{-3}$) |

Batch Size | 32 | 16 | 32 |

BetaEntropy | 0.01 | 0.01 | 0.01 |

Discount Factor | 0.98 | 0.98 | 0.98 |

Steps | - | 15 | - |

Hidden | - | 15 | 25 |

Layers | Actor: [Linear[17, 450], ReLU | Actor: [Linear[17, 15] | Actor: [LSTM(17, 25, layers = 5) |

Linear[450, 10]] | Lif1, Lif2 | Linear[25, 10]] | |

Linear[15, 10]] | |||

Critic: [Linear[17, 450], ReLU | Critic: [Linear[17, 15] | ||

Linear[450, 1]] | Lif1, Lif2 | Critic: [LSTM(17, 25, layers = 5) | |

Linear[15, 1]] | Linear[25, 1]] |

**Table 4.**Configuration of quantum models. Hyperparameters and settings for the quantum neural network, quantum spiking neural network, quantum long short-term memory, and the novel model composed of the combination of the last two networks.

QNN | QSNN | QLSTM | QSNN–QLSTM | |
---|---|---|---|---|

Optimizer | Adam(lr=${10}^{-2}$) | Adam(lr = ${10}^{-3}$) | Adam(lr = ${10}^{-3}$) | [(Adam(QSNN.parameters, lr = ${10}^{-2}$), |

Adam(QLSTM.parameters, lr = ${10}^{-2}$))] | ||||

Batch Size
| 128 | 16 | 128 | 128 |

Pre-processing | Normalization | Normalization | - | Normalization |

Post-processing | - | - | - | - |

BetaEntropy | 0.01 | 0.01 | 0.01 | 0.01 |

Discount Factor | 0.98 | 0.98 | 0.98 | 0.98 |

Steps | - | 15 | 15 | 15 |

Hidden | - | 15 | 25 | 15 (QSNN), 125 (QLSTM) |

Layers | Actor: [[5 QNN] | Actor: [Linear[17, 15] | Actor: [Linear[17, 42] | |

ReLU | 15 QSNN | Linear[42, 5] | ||

Linear[${2}^{N}$, 10]] | Linear[15, 10]] | 4 VQCs | ||

Linear[25, 5] | ||||

Linear[5, 25] | ||||

Linear[25, 10]] | ||||

Critic: [[5 QNN | Critic: [Linear[17, 15] | Critic: [Linear[17, 42] | 5 QSNN | |

ReLU | 15 QSNN | Linear[42, 5] | 5 QLSTM | |

Linear[${2}^{N}$, 1]] | Linear[15, 1]] | 4 VQC’S | ||

Linear[25, 5] | ||||

Linear[5, 25] | ||||

Linear[25, 1]] | ||||

Qubits
| 5 | 5 | 5 | 5 |

Encoding Strategy
| Amplitude Encoding | Amplitude Encoding | Angle Encoding | Amplitude Encoding |

**Table 5.**Results obtained by classical and quantum models. Column 1: classical models followed by quantum versions and the novel quantum model proposed; Column 2: average total reward post-training; Column 3: best total reward post-training; Column 4: worst total reward post-training; Column 5: total reward (evaluation with deterministic policy); Column 6: computational time in seconds.

Average Tot.Reward | Best Tot.Reward | Worst Tot.Reward | Test Reward | Time (s) | |
---|---|---|---|---|---|

MLP | −13.75 | −9.67 | −14.88 | −13.25 | 297.8 |

SNN | −11.05 | −9.79 | −12.15 | −13.31 | 307.6 |

LSTM | −10.56 | −9.43 | −12.11 | −12.67 | 302.8 |

QNN | −10.92 | −9.90 | −11.75 | −12.87 | 326.1 |

QSNN | −11.12 | −9.89 | −12.39 | −13.32 | 467.4 |

QLSTM | −10.72 | −10.14 | −11.14 | −12.19 | 335.42 |

QSNN–QLSTM | −9.40 | −8.26 | −12.73 | −11.83 | 962.5 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Andrés, E.; Cuéllar, M.P.; Navarro, G.
Brain-Inspired Agents for Quantum Reinforcement Learning. *Mathematics* **2024**, *12*, 1230.
https://doi.org/10.3390/math12081230

**AMA Style**

Andrés E, Cuéllar MP, Navarro G.
Brain-Inspired Agents for Quantum Reinforcement Learning. *Mathematics*. 2024; 12(8):1230.
https://doi.org/10.3390/math12081230

**Chicago/Turabian Style**

Andrés, Eva, Manuel Pegalajar Cuéllar, and Gabriel Navarro.
2024. "Brain-Inspired Agents for Quantum Reinforcement Learning" *Mathematics* 12, no. 8: 1230.
https://doi.org/10.3390/math12081230