Exploring the Effects of Caputo Fractional Derivative in Spiking Neural Network Training

: Fractional calculus is an emerging topic in artiﬁcial neural network training, especially when using gradient-based methods. This paper brings the idea of fractional derivatives to spiking neural network training using Caputo derivative-based gradient calculation. We focus on conducting an extensive investigation of performance improvements via a case study of small-scale networks using derivative orders in the unit interval. With particle swarm optimization we provide an example of handling the derivative order as an optimizable hyperparameter to ﬁnd viable values for it. Using multiple benchmark datasets we empirically show that there is no single generally optimal derivative order, rather this value is data-dependent. However, statistics show that a range of derivative orders can be determined where the Caputo derivative outperforms ﬁrst-order gradient descent with high conﬁdence. Improvements in convergence speed and training time are also examined and explained by the reformulation of the Caputo derivative-based training as an adaptive weight normalization technique.


Introduction
Neural networks have been around in machine learning for decades by now [1][2][3][4]. They evolved from implementing elementary logical gates through solving everyday, monotonic tasks, such as character recognition, to exceeding human intelligence in complex intellectual tasks, such as chess or go. However these static, artificial neural networks (ANNs) still work in a digital manner, they work well on time-multiplexed machines, such as CPUs and GPUs. The next step in their evolution came with the first spiking neural networks (SNNs). These SNNs are analog-valued, dynamic simulations of biological neurons based on more complex mathematical representations when compared to ANNs. They transmit information between neurons via strictly timed spikes, rather than real values, thus they resemble the behavior of human brains and natural intelligence.
Along with the practical data encoding, SNNs novelty relies on their specific hardware architecture. In the last few years, there were several artificial neural arrays designed to work with various SNN models, and training methods. When it comes to the energy efficiency of these neuromorphic processors, all models report a huge improvement, with three or four orders of magnitude less energy consumed than a regular CPU or GPU. Even modern TPUs cannot achieve as low training and inference energies as specifically designed SNN processors; therefore, SNN technology favors energy-efficient applications [5][6][7]. Recently developed memristive approaches might widen this gap even more [8].
While ANN-to-SNN conversion-based training has been a hot topic in the last few years [9,10], direct SNN training should also be considered as an alternative. Direct SNN trainings are mostly based on local learning rules which are native operations on a neuromorphic computing device. These are mostly variants of Spike-Time Dependent Plasticity (STDP) learning [11,12] and gradient learning. STDP-like learning rules are local in the outperforms classic first-order derivative-based optimization in terms of categorization accuracy and convergence speed. • Using particle swarm optimization [31] we search for near-optimal derivative orders for specific datasets and investigate if there is a generally suitable value for the derivative order which is viable for multiple datasets. • We discuss the possible reformulation of Caputo-derivative-based learning as an adaptive weight normalization, which introduces a degree of sparsity to the network architecture.
With this case study, we wish to provide empirically suitable values for Caputo derivative orders for SNN training based on a large number of model evaluations. We also aim to encourage discussion and future research of more complex fractional-order derivative-based SNN architectures, which can achieve state-of-the-art performance due to a higher number of parameters and more complex information flow.
The structure of this paper is as follows. In Section 2, Tempotron learning rule is used in our case study, and the mathematical background of the Caputo derivative is introduced. Section 3 presents the used algorithms for SNN simulation and Caputron, a novel Caputo derivative-based Tempotron optimizer. Section 4 contains various tests, on standard benchmark datasets, and summarizes our experiments and comparisons. Section 5 draws conclusions, possible future work is also presented here.

Fractional Derivatives in Tempotron Learning
The Tempotron SNN model was introduced in Nature by [17]. The model uses spiketime-based information representation. Rather than solving a differential equation, this SNN model has a kernel function, which is an ordinary function derived from neural dynamics. This kernel function takes the elapsed times from every presynaptic spike as arguments, to calculate the postsynaptic potentials (PSPs). The weighted sum of these gives the momentary membrane potential of the postsynaptic neuron. The kernel function is given by Equation (1).
where K(.) is the kernel function, V 0 is the normalization coefficient, t is the current simulation time, t i is the presynaptic spiketime, τ M and τ S are the membrane and synaptic time constants, respectively. A representation of the kernel function with common time constants and unit normalization coefficient is shown in Figure 1. The postsynaptic neuron's potential is then calculated according to Equation (2).
where V j (t) denotes the membrane potential of postsynaptic neuron j, w ji is the synaptic weight between presynaptic neuron i and postsynaptic neuron j, t i is the series of spiketimes of the presynaptic neuron i, and V rest is the equilibrium potential of the neuron. Tempotron is trained in a classical gradient-based manner, based on the cost function's first order partial derivative with respect to the weights. A learning rate is also applied when calculating the weight updates. The main idea behind the Tempotron cost function is the following. If there was a desired spike, but the neuron was inactive, the error is the difference between the firing threshold and the maximal membrane potential. If there is an undesired neuron activation, the error is the voltage difference by which the maximal membrane potential exceeds the firing threshold. The cost function's definition is given by Equation (3).
The above mentioned derivative of this cost function and the weight update rules are as follows (Equations (4) and (5)).
where the cases are the same as in Equation (3). t max j here is the simulation time at which the membrane potential of output neuron j had the highest value.
where ∆w ji is the weight change and λ is the learning rate. A schematic of the forward and backward pass of two neurons in the Tempotron model is depicted in Figure 2. Here T denotes the discrete simulation time steps, T i and T j are spike times of neuron i and j, respectively. Neurons form two layers the presynaptic one is denoted by I the postsynaptic layer is denoted by J the weight parameter of the synapse between neuron i and j is described by w ji . K(T − T i ) is the postsynaptic potential derived from presynaptic spike T i . V j (T max j ) is the maximal membrane potential of neuron j at time step T max j . E(V j ) is the Tempotron loss calculated from the output spikes, target spike times, and maximal membrane potentials. ∂E ∂w ji is the partial derivative of the loss with respect to the weight parameter. The weights are updated based on this derivative the formulation of which depends on the derivative order, as it can be either 1 or a fractional value for the Caputo derivative between 0 and 1. Inspired by results in ANN research [25,26] this paper aims to investigate the effect of replacing the first-order partial derivative with a fractional-order Caputo derivative.

Caputo Derivative
In the following section, we introduce the specified fractional-order derivative we used for training our spiking neural network.
There have been several definitions for fraction-order derivatives such as the Grünwald-Letnikov or the Riemann-Liouville definition. In this paper, we would like to utilize the most promising one based on ANN studies, this is the fractional-order derivative by Caputo. The main mathematical reason for this choice is that Caputo and the ordinary first-order derivative have the same initial value for differential equations. Another useful advantage is that the derivative of a constant is always 0.
where F is a constant function and Caputo c D α t is the Caputo derivative operator of order α over interval [c, t].
The definition of Caputo fractional-order derivative of order α is defined in Equation (7).
where α is the fractional derivative order, n is the upper integer neighbor of α, Γ(.) is the Gamma function and function f (.) is differentiable on [c, t] interval. This Caputo derivative is valid on the same [c, t] interval. Simplifying Equation (7) is possible with the restriction of the derivative order. Equation (8) presents the situation where α is between 0 and 1.
It is important to emphasize, that in the case α is −1 then according to the definition Caputo derivative becomes the first integral, furthermore in the case α is 1, then we obtain the first integer-order derivative of the original function. Since even with the simplified version of Caputo derivative calculation there is a wide range of α values to choose from, this paper aims to give an empirical overview of the most suitable values for the derivative order and even investigate the existence of data-independent optima based on experimental results.

Proposed Algorithm
In this section, the proposed algorithms for SNN training and evaluation are presented. First, the used SNN layers are described. These are spiking neurons simulated at discrete time steps only. Then the Caputron (Caputo derivative-based Tempotron) optimizer is introduced. After these, an artificial swarm intelligence-based hyperparameter-optimizer is presented, which we used for derivative order selection.

Proposed Spiking Neuron Model
As neuromorphic processors are not widely available, for SNN simulation we used a CPU-based architecture. This means using analog values for simulation is impossible. This implies the need for numerical differential equation solvers, to find the exact moment of threshold crossings. In this work, some simplifications are used, to eliminate the need for these computationally demanding operations, while maintaining relatively low error rates. The idea of these simplifications was experimentally validated in previous works [32,33].
The simulation (e.g., membrane potentials, PSPs, spikes) is evaluated at discrete, equidistant time steps only. Threshold value comparison and spike generation only happen at these steps.
Each input is presented to the network for several time steps constantly. Input layers take the role of spike generation in an integrate-and-fire way. After the preset number of time steps elapsed, the neural potentials are reset to their default values, while the weights are kept.
The information carried by spikes is interpreted in a first-to-spike manner, therefore only one spike per neuron is allowed. Membrane potentials are saved for later evaluation and it is not to be recalculated in the upcoming time steps.
In this paper, we only consider single-label classification problems, this means that the first spike in the output layer may halt the simulation with only a marginally increment of error rates, thus saving time and computational power.
For training, we built two shallow (two-layer) models, one for the UCI datasets, and one for the MNIST benchmark dataset. The output layers of these neural networks are the same Leaky Integrate-and-Fire Layers. For processing the inputs different input layers are used, Gaussian Receptive Field Layer for the UCI datasets and Integrate-and-Fire Layer for the MNIST dataset.

Leaky Integrate-and-Fire Layer
This layer consists of a column of leaky integrate-and-fire neurons. These neurons use the kernel function defined in the original Tempotron paper [17] which is presented in Section 2.1. The effect of presynaptic spikes, after a quick jump, is slowly decreasing over time, and the effects of presynaptic spikes are commulated for each postsynaptic neuron. The former characteristic makes the layer "Leaky", and the latter implies the integrate-and-fire naming.
Membrane potentials are updated in a discretized manner similarly to Equation (2). The above-mentioned simplification of one spike per neuron makes it easier to detect spikes too. The direction of threshold crossing can only be rising, so only a simple comparison is needed with the threshold potential, to determine if a spike is generated or not. Equation (9) describes the base mechanisms of this process.
here T is the discrete simulation time which follows a series of time steps with equal differences ∆T. T i is the time step at which the i-th presynaptic neuron elicits a spike.
As an enhancement we introduced a layer-wide winner-take-all mechanism for this layer, meaning that the first neuron to spike suppresses other neurons which will not be able to spike. Spikes in the same simulation time step are not differentiated, thus if no neuron fired before, multiple "first spikes" can occur. Winner-take-all is observed to enable neural networks to solve optimization problems much more efficiently [34]. This mechanism is implemented in a computationally beneficial way, meaning the first output spike will interrupt the simulation.
This layer is used as the output layer in the two-layer models.

Gaussian Receptive Field Layer
This layer consists of a column of at least two integrate-and-fire neurons, which are excited by positive Gaussian receptive fields over a real-valued input dimension. For every input dimension, the minimal and maximal input value is given for all expected inputs. The first two neurons' receptive field's maxima are assigned to the minimum and maximum of the input dimension. The other receptive fields are arranged equally between the first two. Each receptive field has a deviation equal to the distance between the two receptive field centers. Each receptive field is evaluated over an input value every time step and the result is added to the membrane potential of the corresponding neuron. A representation of the receptive fields scaled by 0.3 of a 4 neuron layer with input minimum −1 and input maximum 1 is pictured in Figure 3. Generated spike timings can be directly derived from the excitation of each receptive field, therefore they will be characteristic for a small interval of the input values. This layer is used for spike generation receiving the inputs directly from every dimension of the real-valued input data. Of course, this is a loss-making representation of these input dimensions, however, it is easy to control the resolution of this spike generation method via the number of neurons assigned to each dimension, and the resolution of simulation time steps.

Integrate-and-Fire Layer
This layer consists of a column of integrate-and-fire neurons. These neurons are excited by constant input values, which are scaled down enough to create meaningful results over several time steps. These input values are added to the membrane potential every time step until the firing threshold is reached. This structure was used to generate spikes from serialized images in the authors' previous work with success [32].
In this paper, this layer is utilized in training on the MNIST [35] database, where the serialized pixel lightness values are fed to the network.

Caputron Learning
As described in Section 2, Caputo derivative can be utilized to calculate the partial derivative of the Tempotron cost function with respect to synaptic weights. To acquire the equation by which each weight could be updated using this novel Caputo-based Tempotron (Caputron) optimizer, first the Caputo derivative of the cost function should be considered over a restricted interval of the weight values [c, w ji ], and with a derivative order between 0 and 1, similarly to Equation (8).
where notations are similar to Equation (8), but c is the minimum of weight values corresponding to the j-th postsynaptic neuron, w ji is the weight between the i-th presynaptic and j-th postsynaptic neuron. Here, V j = V j (w j0 , . . . , w jN ) is a function of weights (w j0 , . . . , w jN ) belonging to postsynaptic neuron j, where N is the neuron count of the presynaptic layer.
The first order partial derivative in Equation (10) is constant with respect to weight values, and is defined in Equation (4). With the simplifications and restrictions defined at the beginning of this section, the first order derivative of the cost function only has to take one spike into account, thus the sum has only one member.
After simplification, the following formula is acquired.
where Ψ j is the sign correction coefficient derived from the cost value of postsynaptic neuron j. Ψ j has the same effect as the signs in Equation (4), it is negative when expected output spike did not occur, positive when an unexpected spike occurred, and zero in any other case. In conclusion the update rule for a single weight with Caputron learning, taking into account that each neuron can only spike once in this architecture, is defined in Equation (12).
For efficient weight updates Equation (12) must be reformulated with matrix operations. Using the definition of a single weight update, some matrices and vectors should be constructed to obtain the final formula of mass weight updates.
where N and M are the number of neurons in the pre-and postsynaptic layers, respectively. W is the M × N matrix of synaptic weights, Ψ is the M × 1 vector of sign correction coefficients for each postsynaptic neuron, K max is the M × N matrix of kernel function values, where K max [j, i] is K(t max j − t i ). If no t i spike happened before t max j was reached, K max [j, i] is set to 0. Vector C contains minimal weight values for every postsynaptic neuron (i.e., it consists of the minima of each row in matrix W), this vector is M × 1 in dimensions. Symbols and (.) • denote the element-wise multiplication and element-wise power operators, respectively. This definition of the Caputron optimizer can be used with shallow (two-layer) spiking neural networks, where information is represented in a time-to-first spike or spike orderbased manner. A weight update of Equation (13) is performed after each training step when the simulation stops. Maximal membrane potential and PSP values are stored and updated along with the simulation time, during the evaluation of each time step. The data processing diagram of these networks is depicted in Figure 4.

Particle Swarm Optimization of Derivative Order
As it is visible from Section 3.2, the Caputron optimizer highly depends on the derivative order used for calculations. To observe the impact of this parameter's value on the accuracy of the whole model, a simple hyperparameter-optimizer is utilized. As there are infinite values for this α parameter, and testing is complex, and intelligent optimizer should be used. Rather than using a grid or gradient-based algorithm, the authors propose to use a particle swarm optimizer (PSO) [31] here, while leaving other hyperparameters (like learning rate, or network dimensions) unchanged.
Particle swarm optimization is a swarm method that uses a set of search points in the parameter space, often referred to as individuals or swarm members. These individuals start with a random velocity at a random or pre-assigned point in the parameter space. In each step, these individuals are evaluated, and their fitness value is stored. After evaluation, the velocity of each individual is changed. They accelerate towards their personal best and to the global best of the whole swarm with random coefficients. Using this new velocity a simple step is performed, adding each velocity vector to the last position of the corresponding individual. These steps are described by Equation (14) and (15) originally published in [31].
x t+1 Here, x i and v i are the position and velocity of a swarm member in the search space. U(0, c p ) and U(0, c g ) are samples from uniform distributions where the upper limit is described by two constants c p and c g which correspond to the accelerations towards the personal best of the swarm member p i and the global best g. Upper indices denote discrete timesteps t and t + 1.
PSO was successfully used for numerous parameter optimization problems in the last few decades, with recent achievements ranging from engineering applications, such as distributed generator integration of smart cities [36] to federated learning aggregation [37] and large-scale hyperparameter optimization of deep learning networks [38].
To observe the effect of α parameter change, and seek the optimal value of it, this PSO optimizer has a one-dimensional search space bound with a minimal and maximal value. Individuals (swarm members) are bound between these values and cannot leave the predefined interval. Here, choosing the best fitting initialization, starting velocities are chosen randomly, and the population is equally distributed over the above-mentioned interval, including one initial search point on each boundary value too. The pseudocode of this hyperparameter search is described by Algorithm 1. Bound swarm members by α min and α max

5:
while member index < population size do Evaluation 6: while trial index < maximum trials do 7: Reinitialize SNN with α derivative order 8: Train network for preset number of epochs 9: Save maximal test accuracy 10: end while 11: Fitness ← maximal test accuracy averaged over trials 12: end while 13: end while

Experimental Results
Experiments were carried out on a desktop environment with regular multi-core CPUs (Intel i5-8300H@3.90 GHz) used for training. The networks were constructed in Python using NumPy [39] and SciPy [40] for mathematical operations, and Seaborn for visualization.
For evaluating model performances we used categorization accuracy. We compared the results on the test set. To make accuracy values stable and less initialization-dependent, when performing PSO-based derivative order parameter search we take averages of test accuracy values of multiple trainings with different initializations. For measuring convergence speed we take the number of epochs needed to reach the best performing model state during training. When performing PSO-based derivative order parameter search we average these values as well for trainings of the same derivative order.

UCI Dataset Results
Using the novel Caputron learning rule several tests were carried out. First, three UCI [41] datasets were examined, to compare Caputron optimizer and first order Gradient Descent. The datasets we used were the same as the ones used by [25] for ANN training. These datasets are Iris (4 input features, 3 classes, 150 samples), Liver (6 input features, 2 classes), and Sonar (60 input features, 2 classes, 208 samples), all three are categorization problems.
For finding the optimal derivative order values a 20 generations long particle swarm optimization [31] was carried out for each dataset separately. To cover the search space, initially, the swarm members were distributed equally between 0.001 and 1 along the derivative order axis. The swarm had 8 individuals in it. The coefficients for accelerating towards the global best were randomly drawn from a uniform distribution between 0 and 0.2, while coefficients for accelerating towards personal best results were drawn from a uniform distribution between 0 and 0.3. Each dataset was split into a train and test subset with 40% of test data. The learning rate was 0.05, for every evaluation.
A two-layer model was constructed with a Gaussian receptive field layer as an input layer and a leaky integrate-and-fire layer as an output layer with a weight matrix between them. The input layer had 15 neurons per dimension, while the output layer's neuron count was equal to the number of classes the dataset had. The Gaussian receptive fields were scaled down by 0.3. The threshold potential and normalization coefficient (V 0 ) was 1 for all neurons, while the resting potential was set to 0. The membrane and synaptic time constants were 15 ms and 5 ms, respectively.
Each α parameter evaluation of an individual returned a score of average best validation accuracy, where these best validation accuracies were accumulated from 10 different trainings over the same dataset, where the weight matrices were randomly reinitialized, drawn from a uniform distribution from 0 to 1. Each training lasted for 30 epochs from which the best one's validation accuracy was passed on for averaging. Each input was presented to the neural network for 50 equitemporal steps between 0 ms and 15 ms.
It is important to note, that when setting exactly α = 1 the optimizer was changed to a simple, first integer-order derivative-based Gradient Descent optimizer, described in Section 2.1. Although Caputron with derivative order α → 1 should theoretically also work such as classical Gradient Descent, to avoid any mid-computation imprecision the algorithm was changed to solely use the mathematical expression from Equation (5).
After the first optimizations over all three datasets, another one was carried out, to let the PSO work more on the interval with the best results. The second intention behind another hyperparameter optimization was, to see if there is a clear convergence to any given constants for all the datasets, or if the existence of this optimal derivative order is not that trivial. With only the interval lower boundary changed to 0.65, all the other parameters remained the same.
Results are presented in Figures 5-7, and Table 1. For each dataset, all derivative orders evaluated by the two PSO runs are plotted, with the average test accuracies. The average training epochs used to reach Gradient Descent's accuracy for each evaluation during the second run are also presented as a third diagram. Those alpha values were excluded in this diagram, where average test accuracy was under Gradient Descent's test accuracy. Horizontal dashed lines indicate the performance of Gradient Descent, while in the third subplot for each dataset shades of gray, and the star marker refers to the best final test accuracy of the given training.  Table 1 concludes the best average accuracy values, with the best gradient order parameters. As an indicator of improvement, the ratio of erroneously classified test cases was reduced by the percentage shown in the table, when Caputron was used instead of Gradient Descent. Additionally, the percentage of training epochs taken until reaching Gradient Descent's accuracy by Caputron-with respect to epochs taken by Gradient Descentis also presented.
For every dataset, Caputron was more accurate, and in most cases, it provided faster convergence too, or at least there is a derivative order, where the convergence was faster for the same test accuracy reached by the first order optimizer. It is clear that no overall optimal alpha value was found, it is most likely data-dependent. However, the derivative orders, where Caputron outperforms Gradient Descent are mostly between 0.7 and 0.9 for all observed benchmark tests. This recommended interval is a safe parameter choice, as the second, refined PSO searches point out. For all three datasets together 355 alpha values were observed in the [0.7, 0.9] interval, on average using 94.6% of these observed derivative orders resulted in improved accuracy compared to first order Gradient Descent. Therefore it can be concluded that using this interval as a thumb rule of derivative order selection is a safe choice based on our experimental results.
These results of SNN experiments are similar to ANN training results by [25]. These ANN experiments found that the suitable range for Caputo derivative orders is between 6 9 and 8 9 .
Although our shallow Caputron optimized model cannot compete with more complex, deep architectures, it outperforms SpikeProp-based brain-inspired SNN architectures on the Iris dataset [42] despite having only two layers compared to the 3-layer architecture of the compared SpikeProp models. The performance increases from 96.1% classification accuracy to 97.33% in the case of our approach.

MNIST Results
Caputron learning of a shallow SNN has also been tested on the MNIST [35] benchmark dataset of handwritten digits, which contains 60,000 images for training and 10,000 images for testing. These grayscale 8 bit images were scaled down by 1536, so the maximal input value was 1 6 . PSO here was not utilized, only an equally spread population of 21 search points were used with the evaluation similar to those with the UCI Datasets. The only difference is the number of reinitializations, which in this case is halved, using only 5 different random weight initializations to determine an average best epoch accuracy.
A two-layer model was constructed here as well with an integrate-and-fire and a leaky integrate-and-fire layer in this order. The input layer has one neuron for each pixel, resulting in an input size of 784, as MNIST images are 28 × 28 images. The output layer contains a neuron for every decimal digit, thus the output layer size is 10. Individual neuron parameters remain the same.
Results are presented in Figure 8 and Table 1. Similarly to the UCI Datasets, Caputron performed better than Gradient Descent. The ideal interval of the derivative order remained the same [0.7, 0.9]. Training times for the best derivative order 0.8 and the first order derivative can be compared, with an average training time of 2830 s for Caputron, and 2910 s for Gradient Descent, resulting in a 2.8% drop in training time when using Caputron. As the first spikes with the winner-take-all mechanism introduced in Section 3.1.1 would halt the simulation, we examined training times with this mechanism turned off. The results showed the same tendency with Caputron taking 3588 s, and Gradient Descent taking 3691 s on average. The same 2.8% drop in training time can be observed here. To investigate, even more, we compare the ∆W calculation speed of both optimizers for the weight matrix used by MNIST. For a million randomized updates Gradient Descent became an average weight update time of 20.85 µs, while for Caputron it took 198.33 µs to perform a full weight matrix update. Although in weight updates Caputron is slower, when it comes to network simulation Caputron regains its lead over Gradient Descent, with earlier fire times resulting in more sparse and slightly faster computations.
When evaluating the MNIST database, information about the improvements of Caputo derivative-based ANN training is available. We observe a closely located peak of test accuracy at 0.8 derivative order, where as [25] reports 0.78 as their best candidate. Thus, it can be concluded that both ANN and SNN architectures have the same ideal value for Caputo derivative-based training, while SNN test accuracies are lower than those achieved by their ANN counterparts, the error reduction rates are in a similar range around 5-10% depending on network size.
Compared to the reimplementation of the original Tempotron [17] (denoted by Gradient Descent accuracy in Table 1) our model performed slightly better on the MNIST dataset, reaching 89.54% test accuracy over the original 88.96% but this improvement still lacks the performance of modern architectures. To reach state-of-the-art performance on MNIST more complex deep neural networks have to be used with refined temporal simulation. Such models achieve 97.9% testing categorical accuracy on the digit dataset without convolution [43]. We expect improvements of the same magnitude over these results when applying fractional-order optimization, however, no research on this has been conducted yet according to the best of our knowledge.

Inherent Adaptive Weight Normalization
To this point, experiments indicate a significant performance increase without providing insights into the underlying causes. To address the question, of why Caputron is better than standard Gradient Descent, the reformulation of Equation (12) is necessary. For this single spike per neuron scenario, the sums from Equation (5) where ∆w GD ji is the weight change caused by standard Gradient Descent, and is an adaptive weight normalization function of the difference of the weight and the minimal weight value corresponding to the same postsynaptic neuron.
The limit of this weight normalization value is as follows lim α→1 = 1, thus indicating the theoretical equivalence of Gradient Descent and first integer-order Caputron learning. In this case, no adaptive weight normalization happens.
The behavior of this weight normalization coefficient is visualized in Figure 9. Caputron multiplicatively reduces the value of weight changes near the minimal weight value. The minimal weight of each training step is hardly ever changed. However, the weight change of much higher weights is multiplicatively increased. For α values other than 1 these functions have no upper bound, however, loss changes mostly prevent them from exploding. The order of fractional derivative here controls the curve of weight normalization and sets the point where the coefficient reaches the value of 1. This effect makes it harder for inhibitory synapses to change, or eventually become excitatory, while excitatory synapses can converge in larger steps, thus reaching the needed amplification value if they remain excitatory, or changing to inhibitory type synapse if their value is updated to be negative. This enforces a certain level of sparsity concerning inhibitory and excitatory neurons.
Since the information from the network is extracted in a "first-to-spike" manner, even when no negative weights (i.e., no inhibitory synapses) are present in the network, lower weight values are more stable. This results in the same effect of insignificant synapses changing slowly, while significant, excitatory connections are easier to change.

Conclusions
As experiments in Section 4 have demonstrated, the usage of fractional-order Caputo derivative increases the performance of shallow Tempotron SNNs. Caputo derivative is successfully applied for weight updates of shallow Tempotron SNNs, with one spike per neuron, thus creating the novel method of Caputron learning. Training models on different datasets point out, that the high-performance range for the derivative order is between 0.7 and 0.9 for the benchmark datasets observed in this paper, while the results are not comparable to that of the latest deep architectures, our aim to verify the effectiveness of Caputron learning over Gradient Descent was achieved. Not only Caputron learning is better than first-order Gradient Descent in increasing classification accuracy, but faster convergence is also observed in most cases. Based on the results we propose to replace classic first-order derivative-based Gradient Descent with fractional-order Caputo-derivative having an order in the range of [0.7, 0.9], as it leads to higher accuracy in approximately 95.3% of the examined test cases with a marginal increase, or even decrease in training time.
Handling α as an optimizable hyperparameter for Caputron training is also recommended, as the preferred derivative orders are data-dependent.
The problem of backpropagation, a wider range of possible α and learning rate values, and handling multiple spikes per neuron, as well as using learning method enhancements such as momentum, AdaDelta or attention [44], which proved to be successful in applied Deep Learning [45,46] are to be addressed in future works.
Although two-layer spiking neural architectures have been utilized in multiple realworld applications [18,19,32] we would like to promote the adaptation of fractional-order derivative-based optimization for more complex state-of-the-art architectures as well. These architectures require further theoretical investigation and a wide range of experiments. Such research on event-based fractional derivative optimized SNN architectures is being conducted currently by the authors. Data Availability Statement: Supporting code is available at: https://github.com/nata108/Caputron (accessed on 2 July 2022).

Conflicts of Interest:
The authors declare no conflict of interest.