Next Article in Journal
Kaczmarz Anomaly in Tomography Problems
Previous Article in Journal
Accumulators and Bookmaker’s Capital with Perturbed Stochastic Processes
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Gradient-Free Neural Network Training via Synaptic-Level Reinforcement Learning

Aman Bhargava
Mohammad R. Rezaei
2,3,4 and
Milad Lankarany
Division of Engineering Science, University of Toronto, Toronto, ON M5S 2E4, Canada
Division of Clinical and Computational Neuroscience, Krembil Brain Institute, University Health Network, Toronto, ON M5G 1L7, Canada
Institute of Biomedical Engineering, University of Toronto, Toronto, ON M5S 2E4, Canada
KITE, Toronto Rehabilitation Institute, University Health Network, Toronto, ON M5G 1L7, Canada
Department of Physiology, University of Toronto, Toronto, ON M5S 2E4, Canada
Author to whom correspondence should be addressed.
AppliedMath 2022, 2(2), 185-195;
Submission received: 27 December 2021 / Accepted: 31 March 2022 / Published: 12 April 2022


An ongoing challenge in neural information processing is the following question: how do neurons adjust their connectivity to improve network-level task performance over time (i.e., actualize learning)? It is widely believed that there is a consistent, synaptic-level learning mechanism in specific brain regions, such as the basal ganglia, that actualizes learning. However, the exact nature of this mechanism remains unclear. Here, we investigate the use of universal synaptic-level algorithms in training connectionist models. Specifically, we propose an algorithm based on reinforcement learning (RL) to generate and apply a simple biologically-inspired synaptic-level learning policy for neural networks. In this algorithm, the action space for each synapse in the network consists of a small increase, decrease, or null action on the connection strength. To test our algorithm, we applied it to a multilayer perceptron (MLP) neural network model. This algorithm yields a static synaptic learning policy that enables the simultaneous training of over 20,000 parameters (i.e., synapses) and consistent learning convergence when applied to simulated decision boundary matching and optical character recognition tasks. The trained networks yield character-recognition performance comparable to identically shaped networks trained with gradient descent. The approach has two significant advantages in comparison to traditional gradient-descent-based optimization methods. First, the robustness of our novel method and its lack of reliance on gradient computations opens the door to new techniques for training difficult-to-differentiate artificial neural networks, such as spiking neural networks (SNNs) and recurrent neural networks (RNNs). Second, the method’s simplicity provides a unique opportunity for further development of local information-driven multiagent connectionist models for machine intelligence analogous to cellular automata.

1. Introduction

Understanding how biological neurons in the brain adjust their connectivity to actualize learning has major consequences for neuroscience and machine learning (ML) alike [1,2]. In neuroscience, it would improve the understanding of fundamental requirements for effective neural computation [3,4,5]. For ML, it would inform alternative neural network training methods to traditional gradient descent and backpropagation methods [6,7]. However, this problem has proven challenging when investigated via analysis and modelling of biological neuron behavior as well as via the generation of computational models of more abstracted connectionist systems [8,9,10,11]. Here, we briefly introduce efforts to date from biologically and computationally motivated lines of research in this problem.
Biologically motivated neuroscientific studies in this research area frequently focus on pathways between the cortex, basal ganglia, and thalamus (known as CBGT pathways) [12,13]. It is understood that this brain region is deeply involved with action selection in decision-making tasks, and that connectivity adjusts over time to maximize a dopaminergic reward signal [5,14,15]. The major unanswered question on the mechanism behind this phenomena is: how is “credit” assigned to any one neuron or synapse in order to inform subsequent adjustments in connectivity [4]? Since the reward signal, action selection, and pathway connectivity changes are spatially and temporally distant from each other, researchers have struggled to find an effective, cohesive solution to this credit-assignment problem. Studies are further complicated by the difficulty of simulating large networks of biologically realistic neuron models [16]. Over time, research in this area has broadly converged to the hypothesis that some consistent synaptic-level RL algorithm is employed in these brain regions to actualize learning [4,5,15,17].
Computationally-motivated studies on abstracted connectionist models in this area apply RL or related techniques to train neural network models [8,9,10,11]. Within these studies, neurons are generally framed as RL agents in a partially observable Markov decision process (POMDP) with a reward signal based on network performance on a task. The works differ in terms of their formulation of the agents’ state spaces, and reward schemas [8] to incentivize biological realism. However, key issues were observed in the art that either place the “weight” of the learning task’s challenge on existing non-biologically feasible techniques and/or misformulate each neuron’s action space. For example, ref. [8] models each neuron as a deep Q-learning neural network optimized using gradient descent (policy gradient) methods [18,19]. Both [8,10] also frame the action space for each neuron as “fire” or “not fire” rather than as a set of actions to alter synaptic connectivity, an approach that is unsupported by the neuroscientific and modern machine learning literature. Meanwhile, [8,9,10,11] all train separate policies for each neuron/weight rather than training the same policy to be applied by all synapses, as suggested by the neuroscientific literature discussed above [4,5,14,15]. In general, moderate success in network training was achieved in these computational works.
We propose an algorithm wherein a universal synaptic-level reinforcement learning policy is trained to optimize a network-level reward signal. The same policy and reward signal is applied at each synapse, and each synapse’s state consists of locally available information (in this case, previous actions and rewards). The key design changes introduced in this work are as follows:
  • Frame the fundamental RL agent as the synapse rather than the neuron.
  • Train and apply the same synaptic RL policy on all synapses.
  • Set the action space for each synapse to consist of a small increment, a small decrement, and null action on the synapse weight.
  • Represent synapse state as the last n synapse actions and rewards.
  • Use a universal binary reward at each time step representing whether MLP training loss increased or decreased between the two-most-recent iterations.
The biological feasibility of the learning rule is increased through this formulation. Per change (1,3), there is a large body of neuroscientific evidence to suggest that changes in synaptic connectivity governs learning [5,15,20,21], not a neuron-specific “fire/not fire” policy. While the literature suggests a larger and potentially more complicated synapse state space [4], reward signals and previous adjustments are within the realm of biological feasibility as they rely on local information and a very short memory buffer [22,23]. This formulation also makes the problem computationally and statistically feasible, particularly for large networks. Training and applying the same synaptic policy for all synapses provides more data to inform the single policy which leads to higher chance of convergence and lower chance of overfitting for a given policy form [19,24]. Choosing the synapse as the fundamental reinforcement learning agent also simplifies the action space as biological neurons can have thousands of synapses [25], and MLPs trained on standard tasks such as optical character recognition can have an arbitrarily high number of synapses per neuron depending on the hidden layer(s) size. A naïve approach for updating neuron connectivity would yield an action space for each neuron with dimensionality on the same order as the layer size.
We find that this simple formulation enables surprisingly effective MLP training, particularly when a static (learned) synaptic learning policy is applied. This static policy enables the simultaneous training of over 20,000 parameters and consistently converged for random decision-boundary matching and optical character-recognition tasks on the notMNIST dataset [26]. The learned networks also produce character-recognition performance on par with identically shaped networks trained with gradient descent. Indeed, 0 and 32 hidden-unit OCR MLPs were trained five times using both the proposed synaptic RL method and gradient descent. The 0 hidden-unit synaptic RL-trained MLP yielded a mean final validation accuracy of 88.28 ± 0.41% while the 0 hidden-unit gradient descent-trained MLPs yielded 86.42 ± 0.22% final validation accuracy. The 32 hidden-unit synaptic RL-trained MLP had mean validation accuracy of 88.45 ± 0.6% while the 32 hidden-unit gradient descent-trained MLP had 89.56 ± 0.52% mean validation accuracy.

2. Materials and Methods

We propose a reinforcement-learning method that can be applied to the generation of a synaptic-level policy for training MLPs. While MLPs pale in comparison to the complexity of biological neuronal networks, they offer an easily simulated abstract connectionist model that shares some high-level properties with biological neurons [27]. The form of the proposed synaptic-level learning algorithm is informed by several hypotheses about biological neurons. Since biological neurons are single cells with limited computational capacity [28], we hypothesize that each synapse in a biological neural network applies a relatively simple policy to adjust its connectivity over time. This policy takes into account information including global reward signals (e.g., dopaminergic signals) and the synapse’s own past changes in connectivity to inform subsequent changes in connectivity. We suspect that the policy applied by each synapse is roughly the same in a given brain region, and that differences in behaviour and connectivity arise from differences in local information, not from the application of an entirely different policy. This policy results in the maximization of a reward signal over time.
Given these hypotheses, we frame the problem as a POMDP with synapses as the fundamental reinforcement learning (RL) agents. In this work, we apply the temporal difference learning update equations to learn the Q-function for training the synapses of MLP neural networks. We test the applicability and generalizability of this method on simulated data and optical character recognition tasks.
Multilayer perceptron notation from [29] is followed in this work.

2.1. Synaptic Reinforcement Learning

Gradient-based approaches for inferring MLP parameters w ( i , j ) require propagation of error signals backward throughout the network and are not observed in biological neural networks [2]. We frame this problem as a multiagent RL problem in a POMDP as follows: Each synapse is treated as an RL agent that executes the same policy. This policy maps synapse state to actions (i.e., to alter the synapse weight). The temporal difference update equation is applied to deduce the Q-function such that total reward is maximized over time [19,24].
At discrete time steps, we define the following for the proposed synaptic RL model:
  • Actions: Each synapse can either increment, decrement, or maintain its value by some small synaptic learning rate α s > 0 .
    a t A = { α s , 0 , + α s }
    Weight update for a given synapse k in neuron i and layer j is thus given as:
    w k ( i , j ) w k ( i , j ) + w k ( i , j ) . a t
  • Reward: Reward is defined in terms of the training loss at the previous two time steps. Decreased loss is rewarded while increased or equivalent loss is penalized.
    R t = sign ( L t 1 L t ) = + 1 if L t 1 > L t 1 if L t 1 L t
  • State: Each synapse w k ( i , j ) “remembers” its previous actions and the previous two global network rewards.
    w k ( i , j ) . s t = { a t 1 , a t 2 , R t 1 , R t 2 , }
    In this work, the previous two action-and-reward pairs were used as the synapse state after empirical testing.
  • Policy: The Q function gives the expected total reward of a given state–action pair ( s t , a ) assuming that all future actions correspond to the highest Q-value for the given future state [24]. The epsilon-greedy synaptic policy π ( s t ) returns the action a A with the highest Q-value with probability ( 1 ϵ ) . Otherwise, a random action is returned [19]. The epsilon-greedy method was selected to add stochasticity to the system, a property that appears to benefit biological information processing [30,31].
    Q ( s t , a ) = R t + γ max a Q ( s t + 1 , a ) π ( s t ) = arg max a A Q ( s t , a ) with Pr = 1 ϵ random-uniform ( a A ) with Pr = ϵ
    where γ [ 0 , 1 ] is the discount for future reward and ϵ [ 0 , 1 ] is the “exploration probability” of the policy. If the Q function is accurate, then π ( s t ) will return the optimal action a t * subject to discount factor γ . Since the state and action spaces in this formulation have low dimensionality, the Q function (and by extension the policy π ) can be implemented as lookup tables of finite size.

2.2. Training

In this study, Q-value learning is generally a two-fold process where neural network parameters w are trained at the same time as the Q function and, by extension, policy π . After a policy π has been generated, it can also be applied statically. During policy training, the Q-values are updated using the following temporal difference learning update equation for Q-learning [19]:
Q ( s t , a t ) Q ( s t , a t ) + α q [ R t + γ max a Q ( s t + 1 , a ) Q ( s t , a t ) ]
where α q > 0 is the Q-learning rate. The training pseudocode is detailed in Algorithm 1 with links to the source code for the experiments in Supplementary Materials.
Algorithm 1: Synaptic RL Training Algorithm.
Appliedmath 02 00011 i001

3. Results

To determine the efficacy of the proposed synaptic RL technique for MLP training, learning tasks of increasing complexity and dimensionality are employed, starting with random 2D decision-boundary matching. Once a static synaptic learning policy is trained on this task, it is reapplied to a different decision-boundary task as well as optical character recognition tasks with different network shapes and activation functions.
MLP binary classification matching on R 2 :
Nonlinear 2D decision boundaries are generated by randomly instantiating MLPs with two input neurons, 100 hidden units, and one output neuron. If the output neuron produces a value greater than zero, the point is classified in the positive class, and vice versa. Synapse weights for the “target” MLP are uniformly sampled in the range [ 1 , 1 ] and bias terms are set to 0. Data vector values are uniformly sampled in the range [ 10 , 10 ] . 2000 data vectors are used in training. The goal of this matching task is to train another MLP of the same size to produce the same classifications as the “target” MLP on the random data vectors (i.e., match the decision boundary). Two experiments are highlighted in Figure 1 and Figure 2. Figure 1 shows results from simultaneous policy and network training while Figure 2 shows results from applying the policy learned in Figure 1 to a new decision-boundary-matching problem. All hyperparameters are consistent between the two experiments.
When the synaptic RL policy was trained simultaneously with the network, convergence reliably occurred within 10,000 iterations (Figure 1). When a static synaptic RL policy from an MLP matching task was reapplied, convergence occurred slightly more quickly and smoothly (Figure 2).
Optical character recognition on the notMNIST dataset:
The notMNIST dataset [26] is composed of 10 classes of 28 × 28 greyscale images of typeset characters in a variety of fonts. There are 18,720 labeled images in the dataset, and one-hot encoding [29] is used for the training labels. Cross entropy loss [32] and ReLU activation functions [33] are used for all trials, and a 75–25% randomized train-validation split is used for each experiment. Networks with 0 and 32 hidden units are trained using both the proposed synaptic RL method and with gradient descent. Hyperparameters for both training methods are manually tuned. The static policy generated in Figure 1 is applied in all synaptic RL OCR trials. All experiments are run five times with validation accuracy statistics shown in Table 1. The estimated runtime for each individual experiment run, along with the number of trainable parameters, is reported in Table 1 as well. Experiments are run on an AMD Ryzen 95,900 × 4th generation processor.
To improve runtime for synaptic-level reinforcement learning experiments, minibatching is used wherein a random subset of the data was reselected to train on at regular intervals. Batches composed of 5000 training data samples are reselected every 5000 iterations for all synaptic-reinforcement-learning experiments. This approach is founded on the premise that prediction error on a sufficiently large subset of the training dataset likely reflects prediction error on the full set (as in stochastic gradient descent) [29]. Exploration probability ϵ = 0.1 and synaptic learning rate α s = 0.0001 . For the 32 hidden-unit experiments, the learning rate α s is reduced to 0.00005 (i.e., reduced by half) at iteration 60,000 to promote network convergence. Meanwhile, 32 hidden-unit networks are trained for 500,000 iterations and 0 hidden-unit networks are trained for 200,000 iterations.
For the gradient-descent experiments, batch gradient descent is employed with learning rate α = 0.1 . No regularization, momentum, or mini-batching is used. Duration of training and learning rate is tuned such that the models train to completion (i.e., plateauing validation loss and accuracy). A comparison of training and validation loss and accuracy plots from 32 hidden unit experiments can be seen in Figure 3.
Taking the difference between mean validation accuracy for each method indicates that the 32 hidden -unit synaptic RL method has 1.11 ± 0.79% lower validation accuracy than the 32 hidden-unit MLPs trained using gradient descent. The 0 hidden-unit synaptic RL method has 1.86 ± 0.47% higher validation accuracy relative to the same size MLPs trained using gradient descent.

4. Discussion

In this paper, we developed an algorithm to produce and apply a simple, effective synaptic-level learning policy for MLP training. Reapplying the static policy generated by training MLPs on a random decision-boundary-matching task resulted in networks reliably converging for previously unseen 2-dimensional decision boundaries and optical character recognition tasks. Optical character recognition validation accuracy was also comparable with, if not slightly higher than, neural networks trained using conventional gradient -escent learning methods (Table 1). Overall, the static synaptic learning policy was robust in the face of new activation functions, new tasks, and new network structures. The simplicity, effectiveness, and robustness of this approach for training MLPs supports the hypothesis that simple, universal synaptic-level algorithms can enable learning in large connectionist models. The results from this investigation are particularly salient when compared to previous computational work in this area as the learned, static RL policy was more effective in general than the adaptive one.
Of note is the fact that the 0 hidden-unit experiments yielded superior validation accuracy (1.86 ± 0.47% higher) for the synaptic RL method as compared to gradient descent. We suspect that this may be due to the minibatching approach used for all synaptic RL experiments. This may have helped the synaptic RL method escape local optima that the batch gradient-descent method was susceptible to, as was the stochastic gradient descent algorithm upon which the minibatching approach is based [29]. Further work is warranted to understand this performance difference, including direct comparison of the synaptic RL algorithm with stochastic gradient-descent methods.
A key limitation of the algorithm is that the synaptic RL models required many iterations and relatively small synaptic learning rates to converge reliably. The overall time complexity for training MLPs on optical character-recognition tasks was higher for the proposed algorithm as compared to gradient descent methods. Since MLPs are easily differentiable, it is logical to rely upon techniques that take advantage of this valuable information to train parameters efficiently. Therefore, we do not propose this algorithm as a replacement for conventional training algorithms, but rather as a demonstration of the effectiveness of simple synaptic-level learning rules in connectionist models. As well, while this study took significant inspiration from neuroscientific literature, MLPs are incredibly simplistic models of the true behaviour of biological neurons [21], notably lacking significant time-dependence. Further research is required to determine the extent to which the results obtained are applicable to the credit-assignment problem in biological neurons.
Future directions may include investigations on the applicability of this method to other connectionist models such as recurrent neural networks (RNNs) [29,34], convolutional neural networks (CNNs) [35], and spiking neural networks (SNNs) [36]. It is conceivable that the policy generated in this paper could apply directly to RNNs since they are feedforward neural networks (MLPs) with additional edges that forward-propagate to subsequent time steps [37]. While this complicates back-propagation, the proposed method would remain the same as the only nonlocal information required in each synapse is the binary reward signal. It is of interest to determine the applicability of the static policy proposed in this work to CNNs as there is substantially more structure imposed on the application of kernels and these kernel “synapses” generally coexist with a set of fully connected terminal layers (MLP) [38,39,40]. Since 2D convolutions can be expressed as matrix multiplications with circulant matrices [41], it may be that the proposed algorithm would generalize to the CNN architecture. Applying these findings to SNNs may require significant changes as the large number of iterations required for MLP training would be difficult to achieve for compute-intensive SNNs [36]. The neuroscience literature would suggest that time-dependent local information features such as pre- and postsynaptic activation timing (as in spike-timing-dependent synaptic plasticity) would be a valuable addition to the state space of each synapse [22,42]. Applications in unsupervised learning (e.g., autoencoders) are also of strong interest.
Incorporation of additional information for use in the synaptic-reinforcement learning policy on MLPs may further enhance training speed and reliability. Additional testing with different datasets and a more extensive hyperparameter search would also aid in further understanding the broader applicability of the algorithm. Techniques such as momentum [29] typically applied to gradient descent optimization methods may also be applicable in the design of even more effective synaptic-reinforcement learning algorithms. In-depth comparison to works that attempt to create biologically-feasible approximation methods for gradient optimization would also help to shed light on the nature of the proposed algorithm and its similarities to existing methods [6,7].
The algorithm presented in this paper demonstrates the effectiveness of framing the training of MLPs as a multiagent, single-policy reinforcement-learning problem. While gradient-based methods are well established for MLPs, our approach removes the requirement for easy differentiation of the network and the loss function. For the future development of machine learning and machine intelligence systems, this is not an insignificant constraint. After all, the primary example of “true intelligence” available to us is the brain—a structure that does not appear to be driven directly by backpropagation-like gradient optimization methods. It is therefore highly valuable to investigate alternative optimization methods for connectionist models.
Furthermore, much of the prior computational work on framing neural network training as a multiagent reinforcement-learning problem struggled with computational efficiency [8,9,10,11]. Neuron connectivity and policy updates take a long time and are difficult to run at large scale on modern hardware, especially in methods that train different policy for each neuron. The proposed algorithm offers a low memory usage and low complexity single-policy synaptic learning method that shows comparable end-performance to gradient-based methods in training MLPs.
The synapse policy generated and implemented in this work bears resemblance to the cellular automata studied extensively by Wolfram, Conway, Dennett, and others [43,44,45]. One of the major results in Wolfram’s A New Kind of Science is the fact that complex, irreducible behaviour can result from the application of simple rules [43]. In addition to being complex, the behavior exhibited by the static MLP synapse model in this study is effective in actualizing a high-level goal (MLP training), agnostic of network topology, activation function, and task. While the particular task of MLP training may be “reducible” in the sense that gradient computations can enable estimation of near-optimal MLP parameters quickly, Wolfram’s work suggests that other potentially useful models for machine intelligence may not have such a “short cut”.

Supplementary Materials

Julia implementations of this synaptic RL algorithm and the following experiments can be found at (accessed on 26 December 2021).

Author Contributions

Conceptualization, methodology, software, writing—original draft A.B.; supervision, writing—review and editing, M.R.R. and M.L. All authors have read and agreed to the published version of the manuscript.


This project was partially supported by Lanakray’s NSERC Discovery Grant (RGPIN-2020-05868) and MITACS (IT19240).

Data Availability Statement

The notMNIST dataset can be found here: (accessed on 22 May 2021).


We thank Emre R. Alca for the use of his computing infrastructure to run the final experiments in this work.

Conflicts of Interest

The authors declare no conflict of interest.


  1. Adolphs, R. The unsolved problems of neuroscience. Trends Cogn. Sci. 2015, 19, 173–175. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Whittington, J.C.; Bogacz, R. Theories of Error Back-Propagation in the Brain. Trends Cogn. Sci. 2019, 23, 235–250. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Su, L.; Gomez, R.; Bowman, H. Analysing neurobiological models using communicating automata. Form. Asp. Comput. 2014, 26, 1169–1204. [Google Scholar] [CrossRef]
  4. Rubin, J.E.; Vich, C.; Clapp, M.; Noneman, K.; Verstynen, T. The credit assignment problem in cortico-basal ganglia-thalamic networks: A review, a problem and a possible solution. Eur. J. Neurosci. 2021, 53, 2234–2253. [Google Scholar] [CrossRef]
  5. Mink, J.W. The basal ganglia: Focused selection and inhibition of competing motor programs. Prog. Neurobiol. 1996, 50, 381–425. [Google Scholar] [CrossRef]
  6. Lillicrap, T.P.; Cownden, D.; Tweed, D.B.; Akerman, C.J. Random synaptic feedback weights support error backpropagation for deep learning. Nat. Commun. 2016, 7, 13276. [Google Scholar] [CrossRef]
  7. Lansdell, B.J.; Prakash, P.R.; Kording, K.P. Learning to solve the credit assignment problem. arXiv 2019, arXiv:1906.00889. [Google Scholar]
  8. Ott, J. Giving Up Control: Neurons as Reinforcement Learning Agents. arXiv 2020, arXiv:2003.11642. [Google Scholar]
  9. Wang, Z.; Cai, M. Reinforcement Learning applied to Single Neuron. arXiv 2015, arXiv:1505.04150. [Google Scholar]
  10. Chalk, M.; Tkacik, G.; Marre, O. Training and inferring neural network function with multi-agent reinforcement learning. bioRxiv 2020, 598086. [Google Scholar] [CrossRef] [Green Version]
  11. Ohsawa, S.; Akuzawa, K.; Matsushima, T.; Bezerra, G.; Iwasawa, Y.; Kajino, H.; Takenaka, S.; Matsuo, Y. Neuron as an Agent. In Proceedings of the ICLR 2018: International Conference on Learning Representations 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  12. Gold, J.I.; Shadlen, M.N. The Neural Basis of Decision Making. Annu. Rev. Neurosci. 2007, 30, 535–574. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Parent, A.; Hazrati, L.N. Functional anatomy of the basal ganglia. I. The cortico-basal ganglia-thalamo-cortical loop. Brain Res. Rev. 1995, 20, 91–127. [Google Scholar] [CrossRef]
  14. Daw, N.D.; O’Doherty, J.P.; Dayan, P.; Seymour, B.; Dolan, R.J. Cortical substrates for exploratory decisions in humans. Nature 2006, 441, 876–879. [Google Scholar] [CrossRef] [PubMed]
  15. Lee, A.M.; Tai, L.H.; Zador, A.; Wilbrecht, L. Between the primate and ‘reptilian’brain: Rodent models demonstrate the role of corticostriatal circuits in decision making. Neuroscience 2015, 296, 66–74. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. D’Angelo, E.; Solinas, S.; Garrido, J.; Casellato, C.; Pedrocchi, A.; Mapelli, J.; Gandolfi, D.; Prestori, F. Realistic modeling of neurons and networks: Towards brain simulation. Funct. Neurol. 2013, 28, 153. [Google Scholar] [PubMed]
  17. Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef] [Green Version]
  18. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
  19. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  20. Citri, A.; Malenka, R.C. Synaptic plasticity: Multiple forms, functions, and mechanisms. Neuropsychopharmacology 2008, 33, 18–41. [Google Scholar] [CrossRef] [Green Version]
  21. Eluyode, O.; Akomolafe, D.T. Comparative study of biological and artificial neural networks. Eur. J. Appl. Eng. Sci. Res. 2013, 2, 36–46. [Google Scholar]
  22. Caporale, N.; Dan, Y. Spike timing–dependent plasticity: A Hebbian learning rule. Annu. Rev. Neurosci. 2008, 31, 25–46. [Google Scholar] [CrossRef] [Green Version]
  23. Bengio, Y.; Bengio, S.; Cloutier, J. Learning a Synaptic Learning Rule. Available online: (accessed on 1 March 2021).
  24. Russell, S. Artificial Intelligence: A Modern Approach; Prentice Hall: Upper Saddle River, NJ, USA, 2010. [Google Scholar]
  25. Hawkins, J.; Ahmad, S. Why Neurons Have Thousands of Synapses, a Theory of Sequence Memory in Neocortex. Front. Neural Circuits 2016, 10, 23. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Bulatov, Y. notMNIST Dataset. 8 September, 2011. Available online: (accessed on 15 February 2021).
  27. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. McKenna, T.M.; Davis, J.L.; Zornetzer, S.F. Single Neuron Computation; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]
  29. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: (accessed on 15 February 2021).
  30. Kadmon, J.; Timcheck, J.; Ganguli, S. Predictive coding in balanced neural networks with noise, chaos and delays. Adv. Neural Inf. Process. Syst. 2020, 33, 16677–16688. [Google Scholar]
  31. Hunsberger, E.; Scott, M.; Eliasmith, C. The competing benefits of noise and heterogeneity in neural coding. Neural Comput. 2014, 26, 1600–1623. [Google Scholar] [CrossRef] [PubMed]
  32. Abu-Mostafa, Y.S.; Magdon-Ismail, M.; Lin, H.T. Learning from Data; AMLBook: New York, NY, USA, 2012; Volume 4. [Google Scholar]
  33. Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 2020, 48, 1875–1897. [Google Scholar]
  34. Rezaei, M.R.; Gillespie, A.K.; Guidera, J.A.; Nazari, B.; Sadri, S.; Frank, L.M.; Eden, U.T.; Yousefi, A. A comparison study of point-process filter and deep learning performance in estimating rat position using an ensemble of place cells. In Proceedings of the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 18–21 July 2018; pp. 4732–4735. [Google Scholar]
  35. Aloysius, N.; Geetha, M. A review on deep convolutional neural networks. In Proceedings of the 2017 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 6–8 April 2017; pp. 0588–0592. [Google Scholar]
  36. Pfeiffer, M.; Pfeil, T. Deep learning with spiking neurons: Opportunities and challenges. Front. Neurosci. 2018, 12, 774. [Google Scholar] [CrossRef] [Green Version]
  37. Lipton, Z.C.; Berkowitz, J.; Elkan, C. A critical review of recurrent neural networks for sequence learning. arXiv 2015, arXiv:1506.00019. [Google Scholar]
  38. Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput. 2017, 29, 2352–2449. [Google Scholar] [CrossRef]
  39. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  40. Hasanzadeh, N.; Rezaei, M.; Faraz, S.; Popovic, M.R.; Lankarany, M. Necessary conditions for reliable propagation of slowly time-varying firing rate. Front. Comput. Neurosci. 2020, 14, 64. [Google Scholar] [CrossRef]
  41. Ding, C.; Liao, S.; Wang, Y.; Li, Z.; Liu, N.; Zhuo, Y.; Wang, C.; Qian, X.; Bai, Y.; Yuan, G.; et al. Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, 14–18 October 2017; pp. 395–408. [Google Scholar]
  42. Rezaei, M.R.; Popovic, M.R.; Lankarany, M. A Time-Varying Information Measure for Tracking Dynamics of Neural Codes in a Neural Ensemble. Entropy 2020, 22, 880. [Google Scholar] [CrossRef] [PubMed]
  43. Wolfram, S. A New Kind of Science; Wolfram Media: Champaign, IL, USA, 2002. [Google Scholar]
  44. Conway, J. The game of life. Sci. Am. 1970, 223, 4. [Google Scholar]
  45. Dennett, D.C. Real Patterns. J. Philos. 1991, 88, 27–51. [Google Scholar] [CrossRef]
Figure 1. Adaptive policy applied to 2D decision-boundary-matching task. The ground-truth decision boundary is shown in (A) while the learned decision boundary is in (B). Loss and accuracy scores per training iteration are shown in (C,D), respectively. Exploration probability ϵ is set to 25 % , Q-learning rate α q = 0.01 , synaptic learning rate α s = 0.001 , and future reward discount γ = 0.9 . tanh activation functions were used in both the trained and target networks. Final accuracy is 95.9 % , while final mean Euclidean loss across the dataset is 0.004023 .
Figure 1. Adaptive policy applied to 2D decision-boundary-matching task. The ground-truth decision boundary is shown in (A) while the learned decision boundary is in (B). Loss and accuracy scores per training iteration are shown in (C,D), respectively. Exploration probability ϵ is set to 25 % , Q-learning rate α q = 0.01 , synaptic learning rate α s = 0.001 , and future reward discount γ = 0.9 . tanh activation functions were used in both the trained and target networks. Final accuracy is 95.9 % , while final mean Euclidean loss across the dataset is 0.004023 .
Appliedmath 02 00011 g001
Figure 2. Static synaptic policy applied to 2D decision-boundary-matching task. The ground-truth decision boundary is shown in (A) while the learned decision boundary is in (B). Loss and accuracy scores per training iteration are shown in (C,D), respectively. Exploration probability ϵ is set to 25 % , synaptic learning rate α s = 0.001 , and future reward discount γ = 0.9 . tanh activation functions were used in both the trained and target networks. Final accuracy is 97.7 % , while final mean Euclidean loss across the dataset is 0.003411 .
Figure 2. Static synaptic policy applied to 2D decision-boundary-matching task. The ground-truth decision boundary is shown in (A) while the learned decision boundary is in (B). Loss and accuracy scores per training iteration are shown in (C,D), respectively. Exploration probability ϵ is set to 25 % , synaptic learning rate α s = 0.001 , and future reward discount γ = 0.9 . tanh activation functions were used in both the trained and target networks. Final accuracy is 97.7 % , while final mean Euclidean loss across the dataset is 0.003411 .
Appliedmath 02 00011 g002
Figure 3. Training curves from single trials from the 32 hidden-unit synaptic RL and gradient descent experiments.
Figure 3. Training curves from single trials from the 32 hidden-unit synaptic RL and gradient descent experiments.
Appliedmath 02 00011 g003
Table 1. notMNIST OCR validation accuracy comparison.
Table 1. notMNIST OCR validation accuracy comparison.
ExperimentMinMaxMeanStdevEst. RuntimeParams
Syn. RL 32 HU87.88%89.12%88.45%0.60%5.5 h25,450
Grad. Desc. 32 HU88.57%90.00%89.56%0.52%20 min25,450
Syn. RL 0 HU87.65%88.84%88.28%0.41%1.5 h7850
Grad. Desc. 0 HU86.11%86.73%86.42%0.22%5 min7850
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bhargava, A.; Rezaei, M.R.; Lankarany, M. Gradient-Free Neural Network Training via Synaptic-Level Reinforcement Learning. AppliedMath 2022, 2, 185-195.

AMA Style

Bhargava A, Rezaei MR, Lankarany M. Gradient-Free Neural Network Training via Synaptic-Level Reinforcement Learning. AppliedMath. 2022; 2(2):185-195.

Chicago/Turabian Style

Bhargava, Aman, Mohammad R. Rezaei, and Milad Lankarany. 2022. "Gradient-Free Neural Network Training via Synaptic-Level Reinforcement Learning" AppliedMath 2, no. 2: 185-195.

Article Metrics

Back to TopTop