Agreeing to Stop: Reliable Latency-Adaptive Decision Making via Ensembles of Spiking Neural Networks

Spiking neural networks (SNNs) are recurrent models that can leverage sparsity in input time series to efficiently carry out tasks such as classification. Additional efficiency gains can be obtained if decisions are taken as early as possible as a function of the complexity of the input time series. The decision on when to stop inference and produce a decision must rely on an estimate of the current accuracy of the decision. Prior work demonstrated the use of conformal prediction (CP) as a principled way to quantify uncertainty and support adaptive-latency decisions in SNNs. In this paper, we propose to enhance the uncertainty quantification capabilities of SNNs by implementing ensemble models for the purpose of improving the reliability of stopping decisions. Intuitively, an ensemble of multiple models can decide when to stop more reliably by selecting times at which most models agree that the current accuracy level is sufficient. The proposed method relies on different forms of information pooling from ensemble models and offers theoretical reliability guarantees. We specifically show that variational inference-based ensembles with p-variable pooling significantly reduce the average latency of state-of-the-art methods while maintaining reliability guarantees.


I. INTRODUCTION
Context: With the advent of large language models, sequence models are currently among the most studied machine learning techniques.Unlike methods based on conventional neural networks, such as transformers, spiking neural networks (SNNs) process time series with the prime objective of optimizing energy efficiency, particularly in the presence of sparse inputs [1][2][3].The energy consumption of an SNN depends on the number of spikes generated internally by the constituent spiking neurons [4], and inference energy can be further reduced if decisions are taken as early as possible as a function of the complexity of the input time series [5].
The decision on when to stop inference and produce a decision must rely on an estimate of the current accuracy of the decision, as stopping too early may cause unacceptable drops in accuracy.
The delay-adaptive rule proposed in [5] uses the SNN's output confidence levels to estimate the true accuracy, while reference [6] determined the stopping time via a separate policy network.
SNN models, like their conventional neural network counterpart, tend to be poorly calibrated, producing overconfident decisions [7] (see also Fig. 1 in [8]).As a consequence, the schemes in [5,6] do not offer any reliability guarantee at the stopping time.To address this problem, recent work [8] demonstrated the use of conformal prediction (CP) [9][10][11][12] as a principled way to quantify uncertainty and support adaptive-latency decisions in SNNs.
In the SpikeCP method introduced in [8], the SNN produces set predictions consisting of a subset of the set of all possible outputs.For instance, given as an input electroencephalography (EEG) or electrocardiography (ECG) time series, a set predictor determines a set of plausible conditions that a doctor may need to test for.Accordingly, for many applications, set predictors provide actionable information, while also offering an inherent measure of uncertainty in the form of the size of the predicted set [9].SpikeCP leverages the theoretical properties of CP to define reliable stopping rules based on the size of the predicted set.
Motivation: Predictive uncertainty can be decomposed into aleatoric uncertainty, which refers to the inherent randomness of the data-generation mechanism, and epistemic uncertainty, which arises due to the limited knowledge that can be extracted from a finite data set [13,14].While aleatoric uncertainty is captured by individual machine learning models, like SNNs, epistemic uncertainty is typically accounted for by using ensembles of models.In particular, epistemic uncertainty is quantified by gauging the level of disagreement among the models in the ensembles [13,14].By relying on conventional SNN models, SpikeCP does not attempt to quantify epistemic uncertainty, focusing only on aleatoric uncertainty quantification.The application of Bayesian learning and model ensembling as means to quantify epistemic uncertainty in SNNs was investigated in [15][16][17], showing improvements in standard calibration metrics.
In this paper, we propose to enhance the uncertainty quantification capabilities of SpikeCP by implementing ensemble SNN models for the purpose of improving the reliability of stopping decisions.Intuitively, an ensemble of multiple models can decide when to stop more reliably by selecting times at which most models agree that the current accuracy level is sufficient.
The proposed method relies on tailored information pooling strategies across the models in the ensemble that preserve the theoretical guarantees of CP and SpikeCP.

Main contributions:
The main contributions of this work are summarized as follows.
• We propose a novel ensemble-based SNN model that can reliably decide when to stop, producing set predictions with coverage guarantees and with an average latency that is significantly lower than the state of the art.
In both cases, the resulting set predictors satisfy theoretical reliability guarantees.
• Experiments show that VI-based ensembles with PM significantly reduce the average latency of state-of-the-art methods, while maintaining reliability guarantees.
Organization: The remainder of the paper is organized as follows.Section II presents the problem, and Section III reviews the DC-SNN, while Section IV introduces the proposed framework.
Section V describes the experimental setting and results.

II. PROBLEM DEFINITION
In this paper, we study adaptive-latency multi-class classification for time series via SNNs [5,6,8].As illustrated in Fig. 1, unlike prior work [5,6,8], we propose to enhance the reliability of stopping decisions by explicitly accounting for epistemic uncertainty when deciding whether to stop or to continue processing the input.The end goal is to produce reliable set predictions with complexity and latency tailored to the difficulty of each example.In this section, we start by defining the problem and performance metrics.

A. Multi-Class Classification with SNNs
We wish to classify a vector time series x = x 1 , x 2 , ..., with N × 1 time samples x t = [x t,1 , ..., x t,N ] into C classes using an SNN model.The entries of input vector x t can be arbitrary, although typical SNN implementations assume binary inputs [20].As shown in Fig. 1, based on the time samples x t = (x 1 , ..., x t ) observed so far, at any time t, the C read-out neurons of the SNN produce the C × 1 binary vector y t = [y t,1 , ..., y t,C ], with entries equal to 1 representing spikes.
Internally, an SNN model can be viewed as a recurrent neural network (RNN) with binary activations.Its operation is defined by a vector θ of synaptic weights, which determines the response of each spiking neuron to incoming spikes.As in most existing art and implementations, we adopt a standard spike response model (SRM) [21] for the spiking neurons.
Carrying out decision on the basis of the outputs of the C read-out neurons is typically achieved by rate decoding [22].In rate decoding, at each time t, the SNN maintains a spike count vector r(x t ) = [r 1 (x t ), ..., r C (x t )] in which each cth entry counts the number of spikes emitted so far by read-out neuron c.A normalized measure of confidence can then be obtained via the softmax function as [22] f for each class c.Conversely, the loss assigned by the SNN model to label c for input x t is given by the log-loss The general goal of this work is to make reliable classification decisions at the earliest possible time t on the basis of the confidence levels (2), or equivalently of the losses (3), produced by SNN classifiers.

B. Ensemble Inference and Learning for SNNs
Conventional SNN models consist of a single SNN making decisions on the basis of the confidence levels (2), or (3), at a fixed time t = T .Neuroscience has long explored the connection between networks of spiking neurons and Bayesian reasoning [23], and the recent work [15] has explored the advantages of Bayesian learning and model ensembling in terms of uncertainty quantification for SNN classifiers.In this work, we leverage the enhanced uncertainty quantification capabilities of ensemble models to improve the reliability of adaptive-latency decision making via SNN models.
As illustrated in Fig. 1, in the considered setting, K pre-trained SNN classifiers are used in parallel on an input sequence x 1 , x 2 , ....The operation of each kth SNN classifier is defined by a vector θ k of synaptic weights as explained in the previous subsection.We specifically consider two design methods for the ensembles, namely deep ensembles (DE) [19] and Bayesian learning via variational inference (VI) [14].
In DE, the K models are obtained by running conventional SNN training methods based on surrogate gradient [24] with K independent weight initializations, with each weight selected in an independent identically distribution (i.i.d.) manner as Gaussian N (0, σ 2 ) variables for some fixed variance σ 2 .In contrast, in VI, assuming an i.i.d.Gaussian prior distribution N (0, σ 2 ) for the model parameter vector θ, one optimizes over a variational posterior distribution N (µ, ζ 2 ) parameterized by mean vector µ and diagonal covariance matrix with diagonal elements given by vector ζ 2 .The optimization is done by using gradient descent via the reparameterization trick [15].At inference time, the K models are generated by sampling the weight vectors θ k from the optimized distribution N (µ, ζ 2 ).
With DE, generating the K models in the ensemble requires retraining from scratch, while this can be done by simply drawing Gaussian variables in the case of VI.Therefore, with DE, the ensemble should be practically shared across many input test sequences, while for VI it is possible to draw new ensembles more frequently, possibly even for each new input.

C. Set Prediction and Delay-Adaptivity
As mentioned, we focus on delay-adaptive classifiers in which the time at which a decision is made is a function of the input x through the vector f produced by the read-out neurons.Intuitively, when the model confidence is high enough, the classifier can produce a decision.We denote as T s (x) the time at which a decision is made for input x.Furthermore, we allow the decision to be in the form of a set Γ(x) ⊆ {1, ..., C} of the set of C labels [9].As mentioned in Sec.I, set decisions provide actionable information in many applications of interest, such as for robotics, medical diagnosis, and language modelling, and they provide a measure of uncertainty via the predicted set's size |Γ(x)| [9].
The performance of the classifier is measured in terms of reliability and latency.A predictive set Γ(x) is said to be reliable if the probability that the correct label c is included in the set is no smaller than a pre-determined target accuracy p targ , i.e., where the probability is taken with respect to the distribution of the test example (x, c), as well as of the calibration data to be discussed next.The latency of the set prediction is defined as where the expectation is taken over the same distribution as for (4).
The models are assumed to be pre-trained, and we assume to have access to a separate calibration data set with ) generated i.i.d.from the same distribution followed by the test example (x, c) [8,9].As we will discuss in the next section, calibration data is used to optimize the process of deciding when to stop so as to guarantee the reliability requirement (4).

III. ENSEMBLE-BASED ADAPTIVE POINT CLASSIFICATION VIA SNNS
In this section, we first review dynamic-confidence SNN (DC-SNN), a point predictor for delay-adaptive SNN classification [5], and then introduce the ensemble-based version.

A. DC-SNN
DC-SNN produces a decision at the first time t for which the maximum confidence level across all possible classes is larger than a fixed target confidence level p th ∈ (0, 1).Accordingly, the stopping time is given by if there is a time t < T that satisfies the constraint; and T s (x) = T otherwise.The rationale for this approach is that, by ( 6), if T s (x) < T , the classifier has a confidence level no smaller than p th on the decision If the SNN classifier is well calibrated, the confidence level coincides with the true accuracy of the decision given by the class arg max c∈C f c (x t ) at all times t.Therefore, setting the target confidence level p th to be equal to the target accuracy p targ , i.e., p th = p targ , guarantees a zero, or negative, reliability gap for the adaptive decision (7) when T s (x) < T .However, the assumption of calibration is typically not valid.To address this problem, reference [5] introduced a solution based on the use of a calibration data set.
Specifically, DC-SNN evaluates the empirical accuracy of the decision (7), i.e., where 1(•) is the indicator function, for a grid of possible values of the target confidence level p th .Then, it chooses the minimum value p th that ensures the inequality Âcal (p th ) ≥ p targ , so that the calibration accuracy exceeds the target accuracy level p targ ; or the smallest value p th that maximizes Âcal (p th ) if the constraint Âcal (p th ) ≥ p targ cannot be met.

B. Ensemble-based DC-SNN
Following Sec.II-B, one can directly extend DC-SNN to implement approximate Bayesian learning by means of VI and DE methods.Accordingly, at inference time, a decision is made on the basis of K SNN models from a trained ensemble, which is fixed in the case of DE and randomly generated for VI.In this subsection, we briefly describe the decision procedure for a Bayesian version of DC-SNN.
Given some input x, each kth model produces a confidence value f k c (x t ) for the pair (x t , c).Implementing standard Bayesian model averaging, the confidence values f k c (x t ), k = 1, . . ., K, for all models are then pooled by averaging as The ensemble probability f c (x t ) in ( 9) is finally applied in ( 6) and ( 7) to obtain the final decision.

IV. ENSEMBLE-BASED ADAPTIVE SET CLASSIFICATION VIA SNNS
In this section, we introduce ensemble-based SpikeCP, a novel framework for delay-adaptive classification that wraps around any pre-trained ensemble of SNN classifiers, including ensembles obtained via DE and VI.We propose two implementations corresponding to different ways of pooling information across the K models in the ensemble.

A. SpikeCP
We first review SpikeCP [8], which applies to a single SNN model, i.e., with K = 1.The presentation here, unlike in [8], adopts the language of p-variables (see, e.g., [12,25]) in order to facilitate the extension to ensemble models.
SpikeCP fixes a pre-determined set of checkpoint times T s ⊆ {1, ..., T } at which inference may stop to produce a decision.The information available to determine whether to stop or not are the losses {s c (x t )} C c=1 in (3) for the current input x t , as well as the corresponding losses s c[i] (x t [i]) for the calibration data points indexed by i = 1, ..., |D cal |.For each class c, SpikeCP computes the quantity where 1(•) equals 1 if the argument is true and 0 otherwise.The quantity (10) for all α ∈ (0, 1), where the probability is taken over the distribution of test and calibration data.
At each checkpoint t ∈ T s , SpikeCP constructs a predictive set by including all classes c with p-variable larger than threshold α By (11), the probability that the set (12) does not include the true test label c is smaller or equal than α, or equivalently [26, Proposition 1] Accordingly, SpikeCP sets α = (1 − p targ )/|T s | to ensure that condition ( 13) is satisfied irrespective of which checkpoint is selected.As detailed in [8], this is a form of Bonferroni correction SpikeCP stops inference at the first time T s (x) for which the size of the predicted set is smaller than a target set size I th , so the stopping time is given by The threshold I th is a design choice that is dictated by the desired informativeness of the resulting set predictor.For any threshold I th , by construction, SpikeCP satisfies the reliability property (4) [8, Theorem 1].

B. Ensemble-based SpikeCP with Confidence Merging
In the proposed ensemble-SNN architecture in Fig. 1, each SNN classifier parameterized by 2), or correspondingly a different loss s k c (x t ), for each class c given an input x t .In this paper, we study and compare two combining mechanisms.
First, in order to produce a confidence level for each possible label c, the confidence levels output by the K models in the ensemble can be combined using the generalized mean [28] for some integer r ∈ [−∞, +∞].When r = 1, the ensemble probability ( 15) reduces to standard model averaging.Other values of r may in practice be advantageous, e.g., to enhance robustness [29,30], with maximum operation recovered for r = ∞ and the minimum operation obtained with r = −∞.The probability ( 15) is used to calculate the score via (3), which is then directly used in (10) and (12) to determine the set predictor.Note that the same combination in ( 15) is also applied to calibration data.By the same arguments as for SpikeCP, this approach guarantees the reliability condition (4) by setting α = (1 − p targ )/|T s |.

C. Ensemble-based SpikeCP with P-Variable Merging
Given the reliance of the predicted set ( 12) on p-variables, merging directly the confidence levels may be suboptimal [31].Accordingly, in this subsection, we explore the idea of pooling directly the p-variables, rather than combining confidence levels.To this end, we first calculate the losses for the calibration set by using the kth model as Then, for a test input x t , we evaluate the p-variable (10) for the kth model as The p-variables {p k c (x t )} K k=1 are then pooled by using any p-merging function F (•), as defined next.
Definition 4.1 ([32, 33]): A function F : [0, 1] K → [0, ∞) is said to be a p-merging function if, when the inputs are p-variables, the output is also a p-variable, i.e., we have the inequality where the probability is taken over the joint distribution of the K input p-variables.
Using the merged p-value generated as for any p-merging function F (•), the predictive set can be constructed by following (12).By definition of p-merging function, the resulting set predictor also satisfies the reliability condition (4).
In the experiments reported in the next section, we focus on the class of p-merging functions of the form [33] where a r is a constant chosen so as to ensure (17) as specified in [33, Table 1].For example, setting r = −∞, and correspondingly a r = K, yields the p-merging function F (p 1 , ..., p K ) = K min(p 1 , ..., p K ), while setting r = ∞ with a ∞ = 1 yields F (p 1 , ..., p K ) = max(p 1 , ..., p K ).

V. EXPERIMENTS
For numerical evaluations, we consider the standard DVS128 Gesture dataset [34], MINIST-DVS dataset [35]  which is done via the surrogate gradient method [24].The length of the time series is T = 80 samples, and we fix the set of possible checkpoints as T s = {20, 40, 60, 80}, and the target set size to I th = 3.The target accuracy p targ is set to 0.9.
We compare the performance of ensemble-based SpikeCP using DE or VI equipped with confidence merging (CM) or p-variable merging (PM) and ensemble-based DC-SNN.For DE, we follow standard random initialization made available by PyTorch, while for VI we set the prior distribution to have variance 0.03.The parameter r in (15) for CM is set to 1, yielding standard model averaging [15]; while r in (19) for PM is set to r = 45, with a r = K

A. MNIST-DVS Dataset
The MNIST-DVS dataset contains time series recorded from a DVS camera that is shown moving handwritten digits from "0" to "9" on a screen.The data set contains 8, 000 training examples, as well as 2, 000 examples used for calibration and testing.For this experiment, we adopt a fully connected SNN with one hidden layer having 1, 000 neurons.attained by DC-SNN when p targ is small.In contrast, ensemble-based SpikeCP is always reliable, irrespective of the target accuracy.Furthermore, ensemble-based SpikeCP using VI and PM requires smaller latency to achieve the target accuracy.
In Fig. 3, we show the accuracy and normalized latency as a function of the ensemble size.
With a larger ensemble size, both ensemble-based DC-SNN and SpikeCP exhibit reduced latency in reaching a final decision.While SpikeCP maintains its reliability guarantee, DC-SNN falls short of achieving the target accuracy.
To explore the impact of the hyper-parameter r in ( 15) and ( 19) for ensemble-based SpikeCP, we show in Fig. 4 the accuracy and normalized latency as a function of r.

B. DVS128 Gesture Dataset
The DVS128 Gesture data set collects videos from a DVS camera that is shown an actor performing one of 11 different gestures under three different illumination conditions.We divide each time series into T = 80 time intervals, integrating the discrete samples within each interval to obtain a (continuous-valued) time sample [36].The dataset contains 1176 training data and 288 test data, from which 50 examples are chosen to serve as calibration data.The SNN architecture is constructed using a convolutional layer, encompassing batch normalization and max-pooling layer, as well as a fully-connected layer as described in [36].
In Fig. 5, we show the accuracy, given by the probability Pr(c ∈ Γ(x)) in (4) and the average decision latency as a function of the ensemble size K on DVS128 Gesture dataset.The performance of ensemble-based DC-SNN is similar to that on MNIST-DVS dataset, failing to meet the target accuracy.To highlight the performance of ensemble-based SpikeCP, we omit the performance of DC-SNN here.Confirming their theoretical properties, all ensemble-based SpikeCP schemes meet the target accuracy p targ = 0.9.Furthermore, the average latency decreases with the ensemble size K, providing substantial improvements as compared to the original SpikeCP scheme with K = 1 [8].
VI methods tend to have a better performance in terms of latency, showcasing the benefits of VI as a more principled approach for Bayesian learning.Finally, PM generally yields smaller latency values as compared to CM, indicating that merging p-variables offers a more efficient information pooling strategy.

C. CIFAR-10 Dataset
The CIFAR-10 dataset consists of 60,000 32×32 color images that are divided into 10 classes, with 6000 images per class.There are 50,000 training images and 10,000 test images.We use |D cal | = 50 calibration samples, which are obtained by randomly selecting 50 data points from the test set.We adopt a ResNet-18 architecture in which conventional neurons are replaced with SRM neurons [36].Each example is repeatedly presented to the SNN for T = 80 times.
In Fig. 6, we show the accuracy Pr(c ∈ Γ(x)) and normalized latency E[T s (x)]/T as a function of the ensemble size K on CIFAR-10 dataset for ensemble-based SpikeCP.As per our theory, SpikeCP can guarantee the reliability condition with all information pooling schemes.
Furthermore, VI with PM produces the best performance in terms of latency.at which most models agree that the current accuracy level is sufficient.Our proposed approach relies on information pooling from ensemble models and provides theoretical guarantee of reliability.
Future work may consider applications of the proposed method to domains such as wireless communications, in which reliability and latency are essential performance criterion [22].

Fig. 1 .
Fig. 1.In the proposed system, an ensemble of K SNN models processes an input x agreeing on when to stop in order to make a classification decision.Each kth SNN model produces a score p k c for every candidate class c = 1, ..., C. The scores are combined to determine in an adaptive way whether to stop inference or to continue processing the input.
and the CIFAR-10 dataset.The first dataset represents a video recognition task, and the latter two represent image classification tasks.The calibration data set D cal is obtained by randomly sampling |D cal | = 50 examples from the test set, with the rest used for training,

Fig. 2
Fig. 2 reports accuracy -Pr(c ∈ Γ d (x)) for ensemble-based DC-SNN and Pr(c ∈ Γ(x)) for ensemble-based SpikeCP -and normalized latency E[T s (x)]/T as a function of the target accuracy p targ .Ensemble-based DC-SNN increases the decision latency as the target probability p targ increases, in order to meet the reliability condition.However, the reliable decision is only

Fig. 5 .
Fig. 5. Accuracy Pr(c ∈ Γ(x)) and normalized latency E[Ts(x)]/T as a function of the ensemble size K for DVS128 Gesture dataset.
|D cal | i=1 be the calibration data set with samples up to time t, and define as H t c the hypothesis that the pair (x t , c) and the calibration data D t,cal are i.i.d.The quantity (10) is a p-variable for null hypothesis H t c , i.e., we have the conditional probability corresponds, approximately, to the fraction of calibration data points whose loss is no smaller than the loss for label c when assigned to the current test input x t .The corrections by 1 at numerator and denominator are required to guarantee the following property, which follows from the standard theory of CP [26, Proposition 1].Theorem 4.1: Let D t,cal = {(x t [i], c[i])} 1/r following [33, Table1], based on the numerical minimization of latency on a held-out data set.The results are averaged over 50 different realizations of calibration and test data sets, and the number of ensemble K is set to 6.For fair comparison, we apply the stopping rule defined in Sec.III to obtain the stopping time, and use a top-3 predictor to produce a set Γ d (x) for ensemble-based DC-SNN.