Performance Comparison of Machine Learning Across Metal, Cuda, and Software-Based Neuromorphic Simulation

Saini, Ryan; Andreopoulos, William B.

doi:10.3390/inventions11030055

Open AccessArticle

Performance Comparison of Machine Learning Across Metal, Cuda, and Software-Based Neuromorphic Simulation

by

Ryan Saini

and

William B. Andreopoulos

^*

Department of Computer Science, San Jose State University, San Jose, CA 95112, USA

^*

Author to whom correspondence should be addressed.

Inventions 2026, 11(3), 55; https://doi.org/10.3390/inventions11030055

Submission received: 1 November 2025 / Revised: 29 April 2026 / Accepted: 27 May 2026 / Published: 4 June 2026

(This article belongs to the Section Inventions and Innovation in Electrical Engineering/Energy/Communications)

Download

Browse Figures

Versions Notes

Abstract

Machine learning’s computational demands necessitate optimal performance and utilization across diverse hardware architectures. This research compares computing as spiking neural networks (CSNNs, or simulated neuromorphic computing) and regular CNNs on Apple Silicon M3 Pro with Metal Performance Shaders (MPS), and NVIDIA RTX 3070 GPU with CUDA. We run Convolutional Spiking Neural Networks (CSNNs) and traditional CNNs on two datasets (frame-based CIFAR-10; and sequential event-based DVS) to evaluate the suitability of neural net architectures and platforms for different data problems. For both CSNNs and traditional CNNs, Apple Silicon with MPS delivers better energy efficiency but longer processing times for training and inference. NVIDIA with CUDA offers faster computation in training and inference at higher energy costs for CNNs. For CSNNs, frame-based data (CIFAR-10) significantly degraded performance when proper temporal encoding was absent, while event-based data (DVS) proved more naturally suited to the CSNN architecture than frame-based inputs. Though CNNs still achieved higher empirical accuracy in the reported experiments. CSNNs also performed better on Apple Silicon (with MPS) for the sequential event-based data. RAM utilization patterns favored Apple Silicon (with MPS) across both data experiments. The CSNN architecture demanded higher memory resources than CNN, regardless of platform and dataset. NVIDIA (with CUDA) was less energy efficient for spiking neural networks (CSNNs) as compared to Apple Silicon (with MPS). We also compared how the number of time steps affects accuracy and energy consumption across hardware platforms, finding that higher accuracy correlates with energy costs as time steps increase; the accuracy-energy relation seems linear for frame-based data, while for event-based data the energy consumption remains stable increasing at higher time steps. Our cross-platform performance analysis of spiking and regular neural network architectures highlight the importance of matching platform-architecture combinations to a dataset and application requirements.

Keywords:

machine learning; Apple Silicon; CUDA; Metal Performance Shaders (MPS); neuromorphic; spiking neural networks; NVIDIA

1. Introduction

In 2025, machine learning (ML) has become a popular commodity attracting attention from both individuals and major companies. ML has helped innovate in nearly all industries, such as healthcare, entertainment, tech, etc. Its growing influence has in turn led to a higher demand for performance for complex and computationally heavy tasks [1]. With ML tasks, users often seek ways to optimize their systems for running resource-intensive tasks. As the drive for deeper and more sophisticated neural networks grows, the platforms that power these systems have become an important factor with regard to performance and energy efficiency. For researchers, developers, and organizations it is important to select the right combination of hardware platform and neural network architecture to ensure sufficient performance, cost, and energy consumption.

This research provides a comprehensive cross-platform and cross-model architecture performance analysis. Apple Silicon with Metal is a popular system on chip design with unified memory architecture. NVIDIA GPUs with CUDA are the most popular industry standard for dedicated graphics processing units with unique parallel computing capabilities.

Lastly, neuromorphic computing is a new computing model that is inspired by the brain and fundamentally different from the current von Neumann architectures [2]. Neuromorphic systems are particularly well-suited for processing temporal information, including event-based data, which represents the time-dependent information that spiking neurons require to function effectively.

Our main objective is to provide a systematic comparison of these platforms across both traditional neural networks and spiking neural networks to see how different platforms handle different model architectures [3]. We hypothesize there is a tradeoff between accuracy, training time, memory utilization, and energy consumption, which differs based on the dataset, hardware platform and neural network architecture; in particular, spiking neural networks that simulate neuromorphic computing may be a better fit than classic CNNs for particular datasets and hardware platforms. Additionally, the time steps in spiking neural networks may relate to a tradeoff of accuracy, memory utilization, and energy consumption.

Metrics we explore are accuracy, training time, memory utilization, energy consumption and more depending on the experiment. Additionally, identifying potential advantages of simulated neuromorphic computing approaches for various applications is ideal to further push the industry.

While there are extensive studies on these hardware platforms using specific neural network architectures, there is not a comprehensive comparative analysis across multiple platforms and model architectures. Current studies either focus on one platform for multiple networks, compare multiple platforms for traditional network architectures, or use neuromorphic approaches in isolation to common computing platforms [4,5,6,7,8].

Our research aims to address this gap by providing a complete framework that spans several platforms along with model architecture variations for neural networks. Additionally the research explores different data modalities for the experiments, the data being both frame-based and event-based (sequential with time encoding). An important consideration of the research is the inclusion of power consumption metrics, which are often overlooked and ignored despite their importance for real world applications such as mobile computing.

2. Materials and Methods

This chapter describes the experimental structure, which comprises evaluating simulated neuromorphic architectures with Convolutional Spiking Neural Networks (CSNNs) and comparing against traditional CNNs on two datasets (frame-based CIFAR-10 and event-based DVS) on two different hardware platforms (NVIDIA with CUDA and Apple with MPS). In total, four model experiment configurations are considered. The primary goal of this profiling analysis is to evaluate various machine learning performance metrics (accuracy, RAM, GPU allocated memory, GPU reserved memory, power, total watt-hours, total time, inference time) across different hardware platforms and ML model architectures, rather than measuring the overall performance of the entire software implementation.

2.1. Experimental Setup

The performance of these models is compiled on two devices: (1) a 2023 MacBook Pro with an Apple M3 Pro chip, 18 GB of unified memory, a 12-core CPU (6 performance cores and 6 efficiency cores), and an 18-core GPU with Metal 3 support, running macOS Sequoia 15.2; and (2) a custom-built Windows PC featuring an Intel Core i7-13700K processor, ASUS Prime Z790-P motherboard, NVIDIA RTX 3070 GPU with CUDA 12.8, and 32 GB of DDR5 6000 MHz RAM, running Windows 11 Version 24H2 (OS Build: 26100.3775, Update ID: KB5055523).

To simulate SNN the snnTorch library (v0.9.4) was used on both devices. Several other neuromorphic computing libraries were considered for the study including SpikingJelly (v0.0.0.0.14) [9] and Bindsnet (v0.2.7) [10]. SnnTorch was selected due to its comprehensive GPU acceleration support across different hardware platforms. Specifically, snnTorch leverages PyTorch’s (v2.5.1) native GPU acceleration capabilities, enabling it to utilize Apple’s Metal Performance Shaders backend for GPU training acceleration on Apple Silicon as well as CUDA for NVIDIA GPUs. This cross-platform GPU acceleration was crucial for ensuring consistent performance evaluation across both test systems, as SpikingJelly’s GPU acceleration on Apple Silicon appears to be more limited compared to its CUDA support.

The methodology for the performance analysis involved implementing profiling tools to measure computational efficiency, memory usage, and energy consumption across model architectures. Power consumption was monitored through a PowerMonitor class that samples GPU power draw at 0.5 s intervals using hardware-specific approaches. For Apple Silicon it used the powermetrics command line tool, while with NVIDIA GPUs it used nvidia-smi queries. Python’s psutil library (v6.1.1) was used for tracking the RAM usage. To retrieve the allocated and reserved memory for Apple Silicon (with MPS) we used torch.mps.current_allocated_memory() and torch.mps.driver_allocated_memory(). To retrieve this for NVIDIA’s CUDA we used torch.cuda.memory_allocated() and torch.cuda.memory_reserved(). The profiling metrics were collected during model training and testing phases and also had visualizations generated via Matplotlib (v3.10.0) and Seaborn (v0.13.2).

No fixed random seed was set across any experiments; stochastic variation arising from weight initialization and data ordering is instead captured empirically by repeating each primary experiment three times and reporting the mean ± standard deviation across runs. The supplementary time step analysis was similarly repeated five times per configuration. The observed differences between architectures and platforms substantially exceed this run-to-run variability in all primary experiment results.

2.2. Datasets

Experiment 1 (CIFAR-10) utilized frame-based images while experiment 2 (DVSGesture) utilized event-based data. Frame-based data represent traditional image formats, where complete frames are captured at fixed time intervals. Each pixel represents the intensity values at specific spatial locations. Event-based data capture pixel level changes in brightness as they occur in time. Specifically, each event contains spatial coordinates, timestamp, and polarity (brightness increase or decrease).

2.2.1. Experiment 1 (CIFAR-10)

The dataset used in experiment 1 is the CIFAR-10 dataset. This dataset contains 60,000 color images that are evenly distributed between 10 distinct classes [11]. The classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Each class consists of 6000 images. Each image is a 32 × 32 pixel RGB color photograph. This is a popular dataset for computer vision tasks as the data provides sufficient detail for object recognition while maintaining computational efficiency.

The dataset is split 83.3%/16.7%, with 50,000 training images and 10,000 test images. Specifically, 5000 training images per class and 1000 test images per class. It is important to note that the photos do not overlap in categories. Due to the low resolution, it requires the model to extract meaningful features from the limited pixel information, in turn assessing their ability to generalize visual concepts instead of just memorizing the pixel patterns.

Identical pre-processing was implemented for both models in our experiment. In the training set, data augmentation was applied such as random horizontal flips. Additionally, random crops with padding of 4 pixels were done to enhance the model’s ability to generalize. Following this, the images were converted to tensors and normalized using the dataset’s channel-wise mean (0.4914, 0.4822, 0.4465) and standard deviation (0.2023, 0.1994, 0.2010).

2.2.2. Experiment 2 (DVSGesture)

Experiment 2 used the DVSGesture Dataset from IBM. This data is captured with DVS cameras that detect changes in brightness at each pixel independently, resulting in asynchronous event streams. The data set was loaded and preprocessed with the library tonic [12].

There are 11 gesture-based classes which are numbered from 0 to 10. These classes are: hand clapping, right hand wave, left hand wave, right arm clockwise, right arm counter clockwise, left arm clockwise, left arm counter clockwise, arm roll, air drums, air guitar and other gestures. There are 29 individuals that have performed the gestures, with the data being collected under three different illumination settings for variability. There are 1077 training samples and 264 testing samples. The data is contained in events represented by [t, x, y, p] tuples where:

t is the timestamp of the event,
$x, y$ are the coordinates (128 × 128 resolution),
p is the polarity (0 for off 1 for on)

This structure allows for high temporal resolution and high dynamic range, enabling the ability to capture the dynamics of fast movements and various light conditions respectively. Only the pixels that change generate events providing data sparsity. Lastly, the data is low in latency since the events are generated asynchronously without the frame intervals.

The dataset was preprocessed with two transformations to the raw event streams. The preprocessing for the dataset was achieved through the use of the Tonic library. The first transformation denoised the data by removing the sensor noise using a temporal filter that eliminates isolated events. The filter effectively removes events that have no neighboring events within a 10,000 microsecond window. The second transformation was event binning, this converted the continuous event stream into a sequence of discrete frames using time binning.

Time binning converts the events into temporal bins with discrete time steps in turn creating a representation with dimensions [2,

t i m e_s t e p s (16)

,

h e i g h t

,

w i d t h

]. Finally, the data is permuted so the temporal dimension is the primary sequence dimension for use in the models. The order is [

t i m e_s t e p s (16)

, 2,

h e i g h t

,

w i d t h

].

2.3. Training Protocol

2.3.1. Experiment 1 (CIFAR-10)

The CIFAR-10 data used here are frame-based, as described previously. The two models used for this experiment are ResNet-18 and a convolutional spiking neural network (CSNN). The ResNet-18 model is a well established CNN architecture introduced in Deep Residual Learning for Image Recognition [13]. The CSNN model is based on the spiking neural network framework introduced in snnTorch: spiking neural networks in Python [14], which enables the training of deep spiking architectures using surrogate gradient methods. The hyperparameters of the two models are compared in Table 1.

The ResNet-18 model was optimized using the Adam optimizer with an initial learning rate of 0.001 and a weight decay of 0.0005 to account for overfitting. Additionally, a step learning rate scheduler was implemented which reduced the learning rate by a factor of 0.1 per 10 epochs. Cross-entropy loss was employed as the optimization criterion.

The CSNN’s training configuration also utilized the Adam optimizer with learning rate of 0.0005 and weight decay of 0.0001. A step learning rate was applied by a factor of 0.3 per 10 epochs. Cross-entropy loss was used for the optimization criterion, similar to the ResNet-18 model. It shared the same batch size and number of epochs as the ResNet-18 model, 128 and 30 respectively.

2.3.2. Experiment 2 (DVSGesture)

The DVSGesture data used here are event-based (sequential with time encoding), as described previously. The two models used for this experiment are a convolutional neural network (CNN) and a convolutional spiking neural network (CSNN). The CNN is a traditional feedforward convolutional architecture commonly used for image classification tasks [15]. The hyperparameters of the two models are compared in Table 2.

The CNN was optimized using the Adam optimizer with a learning rate of 0.001. Additionally, the loss function used was the cross-entropy loss. There was a batch size of 16 for 30 epochs.

The CSNN also utilized the Adam optimizer with a learning rate of 0.001 and cross-entropy loss, similar to the CNN model. Training also endured a batch size of 16 and trained for 30 epochs.

2.4. Model Design

2.4.1. Experiment 1 (CIFAR-10)

ResNet-18

The ResNet-18 model in this experiment was implemented with PyTorch and consists of 18 layers with residual connections. The ResNet-18 variant from torchvision.models was chosen as it’s a good balance with depth and computational efficiency. To ensure compatibility with the CIFAR-10 dataset, the final fully connected layer of the original ResNet-18 model is replaced with a new linear layer that maps to 10 classes that correspond to the dataset’s image categories.

This architecture has the pretrained weights set to false to make sure the model is learning features from scratch specific to the CIFAR-10 domain, instead of readily learned from other domains like ImageNet. The training went through 30 epochs with a batch size of 128. This strikes a balance between computational efficiency with lucid gradient updates for the given computer specs.

Convolutional Spiking Neural Network (CSNN)

The second model used is our convolutional spiking neural network. It combines the temporal processing capabilities of spiking neurons along with the spatial feature extraction capabilities of convolutional neural networks [16,17].

This model consists of three parts: convolutional blocks, a pooling layer, and a fully connected layer. Within the convolutional blocks are three sub-parts: A convolutional layer, batch normalization, and a LIF neuron layer [18]. The first block starts with a 2D convolution that takes a 3 channel RGB input and produces 64 feature maps, while preserving spatial resolution with a 3 × 3 kernel, padding 1, and stride 1. Batch normalization is then applied to stabilize and speed up the training. The LIF neuron layer follows this, which uses a decay rate of 0.95 (beta) along with a surrogate gradient function for backpropagation. This LIF layer of the block allows the network to encode the temporal information across time steps, letting it have efficient processing of dynamic input patterns. The time steps set for this network are 3. This block structure is repeated across four blocks, with the number of feature channels doubling per block: 64, 128, 256, 512. Moreover, it uses a stride 2 for all subsequent blocks after the first one to reduce the spatial dimensions. The final feature maps undergo global average pooling, 1 × 1, and become flattened. They are then passed to a fully connected layer with ten output neurons, one per class.

The CSNN’s architecture is shown in Table 3.

2.4.2. Experiment 2 (DVSGesture)

Convolutional Neural Network (CNN)

Our first model in this experiment is a convolutional neural network engineered to handle the spatiotemporal nature of the data. There are two main components for this architecture, the first being the convolutional blocks, which are essentially the channel/feature extraction backbone. The second component is the classification head.

There are four convolutional blocks, and within each block there are a convolutional layer, batch normalization, and a ReLU activation. The first block begins with 2 input channels and expands to 32 feature maps using 3 × 3 convolutions with padding to preserve spatial dimensions. The feature depth increases from 32 to 64, then to 128, and finally to 256 features for the next three blocks respectively. More so, a 2 × 2 max pooling operation is included for blocks two through four to reduce the spatial dimensions by half per block.

Following the convolution blocks, the spatial feature maps are flattened to a vector size of 65,536 features (256 × 16 × 16). The classification head now consists of a fully connected layer that reduces features dimensionality from 65,536 to 512. After this reduction, there is a ReLU activation and a dropout with a probability of 0.5 to prevent overfitting. At last the final fully connected layer maps the features to 11 output neurons that correspond each to a gesture class.

The temporal processing is applied in the architecture as each time step in the sequence is independently processed through the convolutional blocks. It is important to note that prior to the classification, features from all time steps are aggregated with mean pooling so that the temporal dynamics can be captured from the gestures.

The CNN’s architecture is shown in Table 4.

Convolutional Spiking Neural Network (CSNN)

The second model in experiment 2 is a convolutional spiking neural network. This architecture acknowledges the temporal dynamics within neuromorphic data by processing 16 discrete time steps sequentially. In total there are two components, the first being convolutional blocks and the second being the classification head. A key difference in both components is the activation function from a CNN is replaced with a spiking LIF neuron layer for this CSNN. The LIF neuron layers all use a decay rate of 0.5 and an arctangent surrogate gradient function with alpha set to 2.0 to enable backpropagation through the network [19,20].

There are a total of four convolutional blocks that each contain a convolutional layer, batch normalization, and a spiking LIF neuron layer. Within the first block the 2 input channels are transformed to 32 feature maps with a 3 × 3 convolution with padding. This is expanded to 64, 128, and 256 feature maps through blocks two through four. Blocks two through four also contain a max pooling operation that reduces spatial dimension by half each block, resulting in a final map size of 16 × 16.

The network follows a two layer fully connected classification head after these blocks. The first fully connected layer flattens the 256 feature maps of size 16 × 16 into a 65,536 dimensional vector. This is then transformed through a linear projection to 512 spiking neurons. The LIF neurons here use the same leaky dynamics employed in the convolutional layers, integrating information across time steps. The final fully connected layer maps the 512 features that was transformed in the prior layer into 11 output neurons, one per class. These outputs are also processed through the LIF neurons. The head maintains spiking neuron dynamics throughout, leading to the final prediction to be based on the accumulation of output spikes across all time steps. In turn this is a rate coding scheme where the total spike count for each output neuron decides the network’s confidence in each gesture class.

The CSNN’s architecture is shown in Table 5.

3. Results and Discussion

This section is divided into two subsections to isolate the results of each experiment. The following Section 3.3 aims to help bring all the information together for a complete analysis.

3.1. Experiment 1 (CIFAR-10)

The results from this experiment vary depending on the model and architecture used as shown in Table 6, the average values represent the average amount used for an epoch. Experiments were repeated three times, and the results reported in the tables reflect the mean ± standard deviation across runs. Memory-related measurements are reported as single values, as they showed negligible variation between runs. Total energy is calculated using the trapezoidal rule for numerical integration. By averaging adjacent power readings and multiplying by the time interval between them (0.5 s), we approximate the area under the power curve. This gives us energy in joules (watt-seconds), which can be converted to watt-hours for more intuitive interpretation. The power measurements are sampled at discrete intervals (0.5 s), so short-term fluctuations between samples may be missed. Consequently, the calculated total energy and average power are approximate and may slightly underestimate or overestimate the true continuous power consumption.

It’s present that the ResNet-18 CNN from both devices yielded higher accuracy and lower computing times compared to the CSNN counterpart. ResNet-18 achieved 83.84% accuracy on Apple Silicon (with MPS) and 84.09% on NVIDIA (with CUDA), while the CSNN reached 72.51% on Apple Silicon (with MPS) and 71.45% on NVIDIA (with CUDA). The inference took longer both with the CSNN architecture and on the Apple Silicon (with MPS) platform. In turn, this meant NVIDIA (with CUDA) and ResNet-18 were the fastest for inference time. The marginally better performance and speed observed on NVIDIA’s CUDA for ResNet-18 may suggest CUDA being more optimized or better floating-point precision on NVIDIA hardware. The accuracy, F1, precision, and recall all yield similar results for both platforms per model (see Appendix A for per-class F1 scores). That is ResNet-18 and CSNN had similar model metrics as shown in Figure 1, Figure 2 and Figure 3, Figure 4 respectively.

Memory utilization revealed that Apple Silicon (with MPS) consumed significantly less RAM compared to NVIDIA (with CUDA), approximately 33–55%. We also see that both models allocated more memory on GPU for Apple Silicon while NVIDIA used slightly less. This is also shown for the GPU reserved memory for both models. Additionally, the GPU allocated memory is lower on the CSNN while higher on the ResNet. Conversely the GPU memory reserved is much higher with the CSNN architecture.

The power efficiency is insightful across both hardware and models. NVIDIA’s CUDA consumed substantially more power than Apple Silicon for both architectures, with over 8× increase for the CSNN and over a 5× increase for the ResNet. Further, NVIDIA’s CUDA also used over 3× more total watt-hours for both models compared to Apple Silicon (with MPS).

The CSNN took much longer than ResNet-18 on both platforms. On Apple Silicon (with MPS) it took 3.65× longer and on NVIDIA (with CUDA) it took 1.94× longer when compared to ResNet-18 completion time. This highlights a tradeoff on the platforms demonstrating that Apple Silicon has optimized energy efficiency but at the cost of longer processing times.

3.2. Experiment 2 (DVSGesture)

Experiment 2 (DVSGesture) yielded significant performance variations between both the models and the devices as shown in Table 7, again, where the average values represent the average amount used for an epoch. The total energy is calculated the same way as in experiment 1. The CNN architecture achieved higher accuracy than the CSNN on both devices, scoring 83.36% on Apple Silicon (with MPS) and 89.02% on NVIDIA (with CUDA), while the CSNN reached 80.30% on Apple Silicon and 77.27% on NVIDIA. Results also revealed that the CNN scored better on NVIDIA with CUDA while the CSNN was better on Apple Silicon (with MPS). The inference time again was better with the traditional neural network, with it being CNN for this experiment over the CSNN. Additionally, NVIDIA (with CUDA) had better inference speeds for the CNN but this time Apple Silicon (with MPS) inferred the CSNN faster.

The accuracy, F1, precision, and recall achieved synonymous results for both platforms per model shown in Figure 5, Figure 6, Figure 7 and Figure 8. The “Right Hand Wave” class performed the best on every single model, while the worst class varied on the models (see Appendix A for per-class F1 scores).

Memory utilization shows that both models consumed less RAM on Apple Silicon (with MPS) compared to NVIDIA (with CUDA), 12.91% less for CNN and 9.67% less for the CSNN. The GPU allocated memory remained consistent for both platforms across the two models. Further, the GPU reserved memory showed similar usage across the two devices with each architecture. However, regardless of the platform the CSNN required higher memory requirements.

There was a large difference for power efficiency in this experiment. NVIDIA (with CUDA) consumed 13× more power than Apple Silicon (with MPS) on the CNN, using 139.11 W compared to Apple Silicon (with MPS) using 10.67 W. Per epoch, NVIDIA (with CUDA) actually used less power on the CSNN compared to the CNN, whereas Apple Silicon used more. Additionally, the total energy consumption reflects more energy needed for the CSNN architecture. NVIDIA (with CUDA) needed a lot more energy for the experiment, needing about 5.25× more energy for the CNN and about 16× more for the CSNN compared to Apple Silicon (with MPS). More so NVIDIA (with CUDA) required about 5.4× more energy for the CSNN than it needed for the CNN. Apple Silicon (with MPS) only required about 7 more watt-hours for the CSNN compared to the CNN.

The total time for the CNN was completed about 2.65× faster on NVIDIA (with CUDA) than on Apple Silicon (with MPS). This advantage is actually reversed for the CSNN model where Apple Silicon delivers a 2.07× speedup compared to NVIDIA. This may suggest that while NVIDIA is optimized for conventional neural networks with frame-based data, it may struggle with event-based data on SNN type models. Overall the CNN computed faster than the CSNN regardless of the device.

3.3. Discussion

The results of both experiments provided insights regarding architecture and data modalities.

Experiment 1 had both models process frame-based data, leading ResNet-18 to have superior performance, specifically in terms of accuracy. Due to the frame-based data lacking the temporal encoding present with event-based (sequential) data, the CSNN could not exploit its spike based processing capabilities leading to sub-optimal performance.

On the other hand, in experiment 2 (DVSGesture) the CSNNs had close accuracy to CNNs due to the event-based nature of the data that provided temporal information that the CSNN could utilize. It is important to distinguish here between conceptual suitability and empirical superiority: while CNNs still achieved higher accuracy in the reported experiments, CSNNs are fundamentally better matched to event-based data than to frame-based data, given the shared sparse, asynchronous, and temporally structured nature of both the data and the network’s processing paradigm. This performance gap likely reflects the relative maturity of CNN training pipelines and optimization techniques rather than a fundamental ceiling for CSNNs.

The hardware revealed that Apple Silicon (with MPS) was more power efficient compared to NVIDIA (with CUDA) across all tasks. In turn it yielded marginally lower accuracy and more time for the ResNet-18 model in experiment 1 and for the CNN in experiment 2. While Apple Silicon (with MPS) scored higher for both experiments with the CSNN model, experiment 1 had it run 1236 s longer than NVIDIA with CUDA for the CSNN model. In experiment 2 Apple Silicon (with MPS) achieved better performance for the CSNN over NVIDIA (with CUDA), while using significantly less power and time. This suggests differences in how the hardware accelerators can handle the temporal dynamics and sparse activation patterns of sparse neural networks. For standard CNNs Apple Silicon (with MPS) demonstrated much better power utilization at the tradeoff of a slower runtime. The inverse of this was shown with NVIDIA (with CUDA).

An important topic of discussion is how the CSNN consumed more energy than the other networks, especially with NVIDIA’s CUDA. There are several factors that can explain this:

The first is that we have a software simulation overhead as the CSNNs are relying on software simulations of spiking behaviors running on traditional GPUs designed for dense matrix operations. Traditional neuromorphic computing would allow us to see the promised energy savings as the chips are designed for sparse spike-driven computation. To model the typical structure of an SNN such as the membrane potentials, synaptic dynamics, and spike generations, the simulations have to introduce additional operations, which in turn create more of a computational overhead. This structure is typically not natively supported by the GPU architectures. In short, the GPU device used here is giving the general purpose benefits of the GPU but not necessarily the specialized optimizations that make it fast with dense tensor operations like matrix multiplication.
The processor clock speed (GHz) would likely have a significant impact on CSNN performance given the sequential nature of temporal processing in these networks. Since these networks run across multiple time steps, higher clock speeds would accelerate the computation of each time step. This would potentially lead to reducing the overall runtime. This is more profound for CSNNs compared to CNNs as CSNNs have more sequential dependencies due to their temporal nature. Additionally, the software simulation discussed earlier would benefit from higher clock speeds. This is because the increased processor frequency would allow the complex operations to execute faster.
Additionally, the temporal dynamics processing makes the network run across multiple time steps for a single input. A normal CNN passes an input once through the network while the CSNN has to process an input across T time steps instead.

Additionally Apple Silicon (with MPS) consistently used less RAM compared to NVIDIA (with CUDA) on both experiments. The GPU allocated and reserved memory were similar on both platforms per experiment and model, but NVIDIA with CUDA used less most of the time. Architecture wise, the CSNN required higher GPU memory reservations practically due to the neuron states’ time steps.

3.4. Limitations

Several limitations of this study warrant acknowledgment, particularly regarding the cross-platform power measurement methodology.

As discussed in the Section 2, power consumption was sampled using powermetrics on Apple Silicon and nvidia-smi on NVIDIA. While these tools both measured power they do not necessarily reflect equivalent scopes of power consumption. Powermetrics on Apple Silicon reports package-level power that encompasses the CPU, GPU, and Neural Engine as a unified system-on-chip, whereas nvidia-smi reports power draw for the discrete GPU only, excluding the host CPU and system memory subsystems [21,22]. This asymmetry means that the Apple Silicon measurements may capture a broader range of system activity than the NVIDIA measurements, and direct numerical comparisons of absolute energy figures across platforms should therefore be interpreted with caution.

Beyond measurement scope, the two platforms differ substantially in hardware architecture. Apple Silicon employs a System on a Chip (SoC) design that integrates the CPU, GPU, and additional components onto a single chip, paired with a unified memory architecture in which all processors share the same physical memory pool [3,23]. This stands in contrast to the NVIDIA configuration, which uses a discrete GPU connected to the host system over a PCIe interconnect with its own dedicated memory. The SoC design is a contributing factor to why powermetrics captures a broader scope of power consumption than nvidia-smi. Effectively it’s not straightforward to isolate GPU power draw alone. Furthermore, Apple Silicon is built on an ARM-based RISC architecture optimised for performance per watt, which is reflected in its significantly lower measured energy consumption across all experiments. While this is a genuine architectural advantage, it also means the two platforms are not directly comparable in the way that two discrete GPUs of similar class might be. These architectural differences influence not only power draw but also memory bandwidth and the efficiency with which sparse activation patterns, a main characteristic of spiking neural networks, are handled. Observed differences in energy and runtime between platforms may therefore reflect these architectural distinctions as much as any intrinsic property of the models evaluated.

Finally, all spiking neural network experiments were conducted via software simulation on general-purpose GPU hardware rather than on dedicated neuromorphic processors. The energy figures reported here therefore represent simulated spiking behaviour on hardware optimized for dense tensor operations, and should not be taken as indicative of the energy efficiency achievable on purpose-built neuromorphic hardware such as Intel Loihi or IBM TrueNorth.

3.5. Supplementary Analysis: Accuracy-Energy Trade-Off

We did a supplementary analysis to properly assess the interdependence between test accuracy and energy consumption in spiking neural networks. These supplementary experiments were conducted to examine how specific accuracy improvements affect overall energy efficiency. As noted in prior work, SNN performance characteristics are intrinsically linked: increasing the number of time steps typically improves accuracy but correspondingly increases energy consumption [24]. Therefore, accuracy should not be evaluated in isolation but rather as a function of computational cost.

In these extended experiments, we systematically varied two key parameters for the SNN models: (1) the training duration (number of epochs), and (2) the number of time steps used shown in Table 8. This allowed us to achieve higher accuracy and quantify the resulting impact on energy consumption metrics. The training duration was increased to 40 epochs for both SNN models. Each experiment was run five times consecutively with varying time step configurations to systematically evaluate the accuracy-energy trade-off. For experiment 1 (CIFAR-10), time steps were varied across 1, 3, 6, 9, and 12 time steps. For experiment 2 (DVSGesture), time steps were varied across 4, 8, 12, 16, and 20 time steps. All other hyperparameters remained identical to those used in the primary experiments (see Table 1 and Table 2).

Due to time and resource constraints, these supplementary experiments were conducted exclusively on the Apple Silicon (with MPS), rather than across both hardware architectures used in the primary experiments. While this limits direct cross-platform comparison for the extended analysis, it provides valuable insight into the accuracy-energy trade-off inherent to SNN implementations.

3.5.1. CIFAR-10: Accuracy-Energy Trade-Off Analysis Across Time Steps

Table 9 presents the performance metrics across five different time step configurations. Increasing the number of time steps from 1 to 12 yielded substantial accuracy improvements by 22.79%. Specifically, the test accuracy rose from 59.03% to 81.82%. However, total energy consumption increased from 3.04 Wh to 48.27 Wh, representing a 15.9× increase.

Figure 9a illustrates the energy-accuracy trade-off curve, revealing diminishing returns as time steps increase. The steepest accuracy gains occur between 1 and 3 time steps by 14.55%. Subsequent time steps yield progressively smaller improvements. From 9 to 12 time steps, accuracy improved by only 0.69 percentage points despite consuming an additional 12.29 Wh of energy. The relationship between time steps and accuracy in Figure 9b shows a logarithmic growth pattern, with accuracy approaching an asymptote around 82%.

The energy metrics from Figure 10 reveal the computational costs of increased temporal processing. Total energy consumption showcased from Figure 10a demonstrates nearly linear growth with time steps. Average power consumption increased from 10.87 W at 1 time step to 17.87 W at 12 time steps shown in Figure 10b. This indicates higher sustained power draw with increased temporal processing.

The energy cost per percentage point of accuracy, from the final column in Table 9, ranged from 0.0514 Wh per % accuracy at 1 time step to 0.5900 Wh per % accuracy at 12 time steps. This represents an 11.5× increase in energy cost.

3.5.2. DVSGesture: Accuracy-Energy Trade-Off Analysis Across Time Steps

Table 10 presents the performance metrics across five different time step configurations for the DVSGesture experiment. The relationship between time steps and accuracy did not follow a consistent monotonic pattern. The 4 time step configuration achieved the highest test accuracy at 81.06%, while increasing time steps to 8 and 12 resulted in decreased accuracy, 73.86% and 74.62% respectively. Accuracy recovered at time steps 16 and 20, reaching 79.17% and 80.30% respectively. Despite this non-monotonic accuracy trend, total energy consumption increased consistently from 5.03 Wh at 4 time steps to 25.13 Wh at 20 time steps. This showed a 4.99× increase.

Figure 11a illustrates the energy-accuracy relationship. The 4 time step configuration represents the most efficient operating point, achieving 81.06% accuracy with only 5.03 Wh of energy consumption. The relationship between time steps and accuracy shown in Figure 11b. This expresses an unexpected dip at intermediate time step values 8 and 12, suggesting potential suboptimal training dynamics or temporal processing inefficiencies at these configurations.

The energy metrics from Figure 12 reveal the computational costs across configurations. Total energy consumption in Figure 12a demonstrates approximately linear growth from 4 to 16 time steps, with a notable jump at 20 time steps due to significantly increased training time (11,968.72 s at 20 time steps compared to 5214.28 s at 16 time steps). Average power consumption in Figure 12b remained relatively stable between 12.34 W and 12.80 W for time step configurations 4 through 16, but dropped to 7.35 W at 20 time steps. This could likely be due to thermal throttling or power management adjustments during the extended training period.

The energy efficiency metric ranged from 0.0621 Wh per % accuracy at 4 time steps to 0.3129 Wh per % accuracy at 20 time steps as presented in the final column of Table 10. This is a 5.04× decrease in energy efficiency. Notably, the 4 time step configuration demonstrated superior energy efficiency compared to all other configurations, achieving both the highest accuracy and lowest energy cost per percentage point.

3.5.3. Discussion: Accuracy-Energy Trade-Off Analysis Across Time Steps

This analysis revealed critical insights into the relationship between temporal processing depth, accuracy, and energy consumption in CSNNs. By examining an array of different time steps and extending training to 40 epochs, these experiments addressed the fundamental interdependence between accuracy and computational cost in spiking neural networks.

The contrasting patterns between experiments 1 and 2 underscore the importance of data modality in CSNN performance and SNNs as a whole. For frame-based data (experiment 1), achieving competitive accuracy required substantial temporal processing depth, with a clear logarithmic relationship showing diminishing returns at higher time step values. This highlights a challenge in deploying CSNNs on frame-based data: each additional time step incurs considerable energy penalties when operating on hardware not optimized for spike-based computation, yet sufficient temporal depth is necessary for reasonable accuracy.

Event-based DVSGesture data (experiment 2) demonstrated markedly different behavior, with optimal performance occurring at lower time step values. The non-monotonic accuracy pattern, with intermediate time steps (8 and 12) underperforming compared to lower (4) and higher (16 and 20) settings, suggests a fundamental mismatch between the temporal processing requirements of event-based and frame-based representations. This may indicate temporal aliasing or suboptimal integration periods for sparse event streams at intermediate configurations. In other words, the network at these intermediate time steps may be missing or focusing on the wrong temporal aspects of the gesture data. The superior efficiency at lower time steps demonstrates that event-based data can leverage the temporal processing capabilities of CSNNs more effectively than frame-based inputs, reinforcing the conclusion from the primary experiments that CSNNs are fundamentally better suited to event-based (sequential with time encoding) data.

The power consumption patterns also revealed hardware considerations beyond total energy. Average power as a function of time steps increased sublinearly in experiment 1 (CIFAR-10), while in experiment 2 (DVSGesture) it remained approximately constant before dropping sharply at 20 time steps. This drop for experiment 2 is likely attributable to thermal throttling during extended training with deeper temporal processing and increased epochs. Despite these variations in average power, the total energy consumption continues to scale linearly. This behavior is expected as increasing the number of time steps linearly increases the duration of temporal computation in the CSNN, requiring repeated neuron state updates, spike propagation, and synaptic operations at each time step. Consequently, even when hardware power management mechanisms limit instantaneous power, the extended execution time dominates, resulting in a linear increase in total energy consumption.

These findings emphasize that evaluating CSNN performance solely on accuracy or energy consumption in isolation provides an incomplete picture. The accuracy-energy trade-off must be considered holistically, with the optimal operating point depending on the specific application requirements, data modality, and available computational resources. For applications where energy efficiency is paramount, configurations with fewer time steps may be preferable even at the cost of some accuracy loss. For applications requiring maximum accuracy, the energy penalties of increased temporal processing must be factored into deployment decisions.

4. Conclusions

This cross-platform and cross-architecture study explored the performance of Apple Silicon (with MPS) as compared to NVIDIA (with CUDA), and how simulated neuromorphic computing approaches compare with traditional CNNs on different data types (sequential event-based vs. frame-based). The findings suggest that optimal platform-architecture combinations depend on specific application requirements. The DVSGesture experiment showed that sequential event-based data is more fitting for the CSNN architecture, allowing it to compete effectively with traditional CNNs. On the other hand, the CIFAR-10 experiment showed that frame-based data significantly decays spiking neural networks, when proper temporal encoding is absent. One reason is that CSNNs are fundamentally better aligned with sequential time-based data and are naturally a better fit for modeling event-based data than CNNs. Additionally, CSNNs on sequential event-based data performed better on Apple Silicon.

In terms of energy usage, the comparison of platforms showed NVIDIA’s CUDA computational speed advantage, but at the cost of lower energy efficiency, while Apple Silicon (with MPS) prioritized energy efficiency at the cost of longer run times. RAM utilization patterns favored Apple Silicon (with MPS) across both experiments; however, the GPU allocated memory usage was usually lower for NVIDIA with CUDA as compared to Apple Silicon. Further, the CSNN architecture demanded higher memory resources than CNN, regardless of the platform.

In addition, extended experiments were conducted to examine the impact of temporal depth on accuracy and energy consumption in CSNNs across both frame-based and event-based datasets. These experiments revealed that increasing the number of time steps introduces a pronounced accuracy–energy trade-off. For sequential event-based data (with time encoding), optimal performance was achieved at lower time step configurations, with higher temporal depths yielding inconsistent accuracy gains despite steadily increasing energy consumption. For frame-based data, accuracy improvements also exhibited diminishing returns as temporal depth increased. Total energy consumption scaled nearly linearly. The extended results reinforced that CSNN efficiency is not maximized by deeper temporal processing alone. While increasing the number of time steps could potentially yield higher accuracy, doing so incurs substantial computational and power cost. These findings strengthen the overall conclusion that CSNNs are fundamentally better aligned with sequential (time) event-based data and there is an optimal operating point where accuracy and energy consumption are balanced.

Our results highlight that evaluating spiking neural networks requires joint consideration of accuracy, execution time, and energy efficiency, particularly when targeting energy-constrained or real-time systems. Future work could explore advanced architectures for CSNNs, such as different neuron models or spike-timing rules, which can provide deeper insights. Our traditional networks could explore a hybrid architecture that combines the best of both or compare sequential models like Long short-term memory (LSTM). For LSTMs, we would expect similar platform-specific performance patterns as observed with CNNs, with NVIDIA with CUDA being faster and Apple Silicon with Metal being more energy efficient. The temporal processing nature of LSTMs may show heavier memory utilization patterns than the two networks due to it’s unique architecture [25]. Future work could push the extended experiments section by leveraging more powerful hardware to explore deeper temporal processing, which was limited on our MacBook Pro as performance deteriorated beyond 24 time steps. Additionally, increasing the number of training epochs to several hundred could provide further insights into the effects of prolonged training on accuracy, energy consumption, and overall CSNN performance.

Also, attention/transformer models could be explored and would heavily favor NVIDIA’s CUDA due to their computationally expensive need for matrix operations and parallel computation. Transformers could have better accuracy but would most likely come at higher performance costs in all metrics compared to CNN and CSNNs [26]. Most importantly, future work could allow implementations on neuromorphic hardware, like Loihi 2, to fully and fairly conclude the benefits of spiking neural networks that were not achieved in our GPU-based simulations [27].

Author Contributions

Conceptualization, R.S. and W.B.A.; methodology, R.S.; software (including model implementation and coding), R.S.; validation, R.S.; formal analysis, R.S.; investigation, R.S.; resources, R.S.; data curation, R.S.; writing—original draft preparation, R.S.; writing—review and editing, R.S. and W.B.A.; visualization, R.S.; supervision, W.B.A.; project administration, R.S. and W.B.A.; W.B.A. provided supervision, feedback on manuscript drafts, and guidance on neuromorphic technologies and related tools. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available. The datasets used are: (1) CIFAR-10 Dataset (Frame-based data): Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical Report, University of Toronto. Available online: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 29 January 2025) [11]. (2) DVS Gesture Dataset (Event-based data): Amir, A.; Taba, B.; Berg, D.; Melano, T.; McKinstry, J.; Di Nolfo, C.; et al. (2017). A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7243–7252). Available online: https://tonic.readthedocs.io/en/latest/generated/tonic.datasets.DVSGesture.html (accessed on 29 January 2025) [12].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. F1 Scores

The figures below represent the F1 score achieved per class on experiment 1 (CIFAR-10) and 2 (DVSGesture).

Figure A1. CIFAR-10 ResNet-18 F1 scores per class on Apple Silicon.

Figure A2. CIFAR-10 CSNN F1 scores per class on Apple Silicon.

Figure A3. CIFAR-10 ResNet-18 F1 scores per class on NVIDIA’s CUDA.

Figure A4. CIFAR-10 CSNN F1 scores per class on NVIDIA’s CUDA.

Figure A5. DVSGesture CNN F1 scores per class on Apple Silicon.

Figure A6. DVSGesture CSNN F1 scores per class on Apple Silicon.

Figure A7. DVSGesture CNN F1 scores per class on NVIDIA’s CUDA.

Figure A8. DVSGesture CSNN F1 scores per class on NVIDIA’s CUDA.

References

Huang, L.; Joseph, A.D.; Nelson, B.; Rubinstein, B.I.; Tygar, J.D. Adversarial machine learning. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence; Association for Computing Machinery: New York, NY, USA, 2011; pp. 43–58. [Google Scholar] [CrossRef]
Schuman, C.D.; Kulkarni, S.R.; Parsa, M.; Mitchell, J.P.; Date, P.; Kay, B. Publisher Correction: Opportunities for neuromorphic computing algorithms and applications. Nat. Comput. Sci. 2022, 2, 205. [Google Scholar] [CrossRef] [PubMed]
Kasperek, D.; Antonowicz, P.; Baranowski, M.; Sokolowska, M.; Podpora, M. Comparison of the usability of Apple M2 and M1 processors for various machine learning tasks. Sensors 2023, 23, 5424. [Google Scholar] [CrossRef] [PubMed]
Kok, D.; Kanaan, M. Fast Transformer Inference with Metal Performance Shaders. Available online: https://explosion.ai/blog/metal-performance-shaders (accessed on 8 January 2025).
Luo, Y.; Duraiswami, R. Canny Edge Detection on NVIDIA CUDA. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Piscataway, NJ, USA, 2008. [Google Scholar]
Gong, H.X.; Hao, L. Roberts edge detection algorithm based on GPU. J. Chem. Pharm. Res. 2014, 6, 1308–1314. [Google Scholar]
Parpart, G.; Risbud, S.; Kenyon, G.; Watkins, Y. Implementing and Benchmarking the Locally Competitive Algorithm on the Loihi 2 Neuromorphic Processor. In Proceedings of the 2023 International Conference on Neuromorphic Systems; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Gomez, W.G.; Pignata, A.; Pignari, R.; Fra, V.; Macii, E.; Urgese, G. First steps towards micro-benchmarking the Lava-Loihi neuromorphic ecosystem. In 2023 IEEE 16th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC); IEEE: Piscataway, NJ, USA, 2023; pp. 462–469. [Google Scholar] [CrossRef]
Fang, W.; Chen, Y.; Ding, J.; Yu, Z.; Masquelier, T.; Chen, D.; Huang, L.; Zhou, H.; Li, G.; Tian, Y. SpikingJelly: An open-source machine learning infrastructure platform for spike-based intelligence. Sci. Adv. 2023, 9, eadi1480. [Google Scholar] [CrossRef] [PubMed]
Hazan, H.; Saunders, D.J.; Khan, H.; Patel, D.; Sanghavi, H.T.; Siegelmann, H.T.; Kozma, R. BindsNET: A Spiking Neural Networks Library Built on PyTorch. 2018. Available online: https://bindsnet-docs.readthedocs.io (accessed on 21 May 2025).
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009; Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 29 January 2025).
Amir, A.; Taba, B.; Berg, D.; Melano, T.; McKinstry, J.; Di Nolfo, C.; Nayak, T.; Andreopoulos, A.; Garreau, G.; Mendoza, M.; et al. A low power, fully event-based gesture recognition system. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 7243–7252. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Eshraghian, J.K.; Ward, M.; Neftci, E.; Wang, X.; Lenz, G.; Dwivedi, G.; Bennamoun, M.; Jeong, D.S.; Lu, W.D. Training spiking neural networks using lessons from deep learning. Proc. IEEE 2023, 111, 1016–1054. [Google Scholar] [CrossRef]
Taye, M.M. Theoretical Understanding of Convolutional Neural Network: Concepts, Architectures, Applications, Future Directions. Computation 2023, 11, 52. [Google Scholar] [CrossRef]
Samadzadeh, A.; Far, F.S.T.; Javadi, A.; Nickabadi, A.; Chehreghani, M.H. Convolutional Spiking Neural Networks for Spatio-Temporal Feature Extraction. arXiv 2020, arXiv:2003.12346. [Google Scholar] [CrossRef]
Deng, L.; Wu, Y.; Hu, X.; Liang, L.; Ding, Y.; Li, G.; Zhao, G.; Li, P.; Xie, Y. Rethinking the performance comparison between SNNS and ANNS. Neural Netw. 2020, 121, 294–307. [Google Scholar] [CrossRef] [PubMed]
Fang, W. Leaky Integrate-and-Fire Spiking Neuron with Learnable Membrane Time Parameter. arXiv 2020, arXiv:2007.05785. [Google Scholar] [CrossRef]
Patel, R.; Tripathy, S.; Sublett, Z.; An, S.; Patel, R. Using CSNNs to Perform Event-based Data Processing & Classification on ASL-DVS. arXiv 2024, arXiv:2408.00611. [Google Scholar] [CrossRef]
Xu, Y.; Tang, G.; Yousefzadeh, A.; de Croon, G.; Sifalakis, M. Event-based Optical Flow on Neuromorphic Processor: ANN vs. SNN Comparison based on Activation Sparsification. arXiv 2024, arXiv:2407.20421. [Google Scholar] [CrossRef] [PubMed]
Apple Inc. Powermetrics—Apple Developer Documentation. Available online: https://developer.apple.com/documentation/ (accessed on 12 February 2025).
NVIDIA Corporation. System Management Interface SMI. Available online: https://developer.nvidia.com/system-management-interface (accessed on 13 February 2025).
Blem, E.; Menon, J.; Sankaralingam, K. Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA); IEEE: Piscataway, NJ, USA, 2013; pp. 1–12. [Google Scholar]
Sorbaro, M.; Liu, Q.; Bortone, M.; Sheik, S. Optimizing the energy consumption of spiking neural networks for neuromorphic applications. Front. Neurosci. 2020, 14, 662. [Google Scholar] [CrossRef] [PubMed]
Staudemeyer, R.C.; Morris, E.R. Understanding LSTM—A tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv 2019, arXiv:1909.09586. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Davies, M.; Srinivasa, N.; Lin, T.H.; Chinya, G.; Cao, Y.; Choday, S.H.; Dimou, G.; Joshi, P.; Imam, N.; Jain, S.; et al. Loihi: A Neuromorphic Manycore Processor with On-Chip Learning. IEEE Micro 2018, 38, 82–99. [Google Scholar] [CrossRef]

Figure 1. CIFAR-10 ResNet-18 performance evaluation on Apple Silicon: (a) Confusion matrix and per-class scores; (b) F1 score over time; (c) Power usage during a training epoch.

Figure 2. CIFAR-10 ResNet-18 performance evaluation on NVIDIA CUDA: (a) Confusion matrix and per-class scores; (b) F1 score over time; (c) Power usage during a training epoch.

Figure 3. CIFAR-10 CSNN performance evaluation on Apple Silicon: (a) Confusion matrix and per-class scores; (b) F1 score over time; (c) Power usage during a training epoch.

Figure 4. CIFAR-10 CSNN performance evaluation on NVIDIA CUDA: (a) Confusion matrix and per-class scores; (b) F1 score over time; (c) Power usage during a training epoch.

Figure 5. DVSGesture CNN performance evaluation on Apple Silicon: (a) Confusion matrix and per-class scores; (b) F1 score over time; (c) Power usage during a training epoch.

Figure 6. DVSGesture CNN performance evaluation on NVIDIA CUDA: (a) Confusion matrix and per-class scores; (b) F1 score over time; (c) Power usage during a training epoch.

Figure 7. DVSGesture CSNN performance evaluation on Apple Silicon: (a) Confusion matrix and per-class scores; (b) F1 score over time; (c) Power usage during a training epoch.

Figure 8. DVSGesture CSNN performance evaluation on NVIDIA CUDA: (a) Confusion matrix and per-class scores; (b) F1 score over time; (c) Power usage during a training epoch.

Figure 9. Accuracy analysis for CSNN on CIFAR-10: (a) Energy-accuracy trade-off, and (b) Time steps vs. test accuracy.

Figure 10. Energy efficiency metrics for CSNN on CIFAR-10: (a) Time steps vs. total energy consumption, and (b) Time steps vs. average power.

Figure 11. Accuracy analysis for CSNN on DVSGesture: (a) Energy-accuracy trade-off, and (b) Time steps vs. test accuracy.

Figure 12. Energy efficiency metrics for CSNN on DVSGesture: (a) Time steps vs. total energy consumption, and (b) Time steps vs. average power.

Table 1. Hyperparameter comparison between ResNet-18 and CSNN for CIFAR10.

Hyperparameter	ResNet-18	CSNN (Spiking CNN)
Optimizer	Adam	Adam
Learning Rate (lr)	0.001	0.0005
Weight Decay	0.0005	0.001
Loss Function	CrossEntropyLoss	CrossEntropyLoss
Scheduler Type	StepLR	StepLR
Step Size (epochs)	10	10
Gamma (decay factor)	0.1	0.3
Batch Size	128	128
Activation Function	ReLU	N/A
Neuron Type	N/A	Leaky Integrate-and-Fire (beta = 0.95)
Time Steps	N/A	3
Surrogate Gradient	N/A	Fast Sigmoid (slope = 10)

Table 2. Hyperparameter comparison between CNN and CSNN for DVS Gesture Recognition.

Hyperparameter	CNN	CSNN (Spiking CNN)
Optimizer	Adam	Adam
Learning Rate (lr)	0.001	0.001
Loss Function	CrossEntropyLoss	CrossEntropyLoss
Batch Size	16	16
Number of Epochs	30	30
Time Steps	16	16
Temporal Aggregation	Mean Pooling	Spike Accumulation
Activation Function	ReLU	N/A
Dropout Rate	0.5	N/A
Neuron Type	N/A	Leaky Integrate-and-Fire (beta = 0.5)
Surrogate Gradient	N/A	Arctangent (alpha = 2.0)

Table 3. Architecture of Convolutional Spiking Neural Network (CSNN) for CIFAR-10 dataset.

Layer (Type)	Output Shape
Input (3 × 32 × 32 Image)	(3, 32, 32)
Convolutional Blocks (Feature Extraction)
Conv2d (3 → 64, 3 × 3, stride = 1, padding = 1)	(64, 32, 32)
BatchNorm2d	(64, 32, 32)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.95)	(64, 32, 32)
Conv2d (64 → 128, 3 × 3, stride = 2, padding = 1)	(128, 16, 16)
BatchNorm2d	(128, 16, 16)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.95)	(128, 16, 16)
Conv2d (128 → 256, 3 × 3, stride = 2, padding = 1)	(256, 8, 8)
BatchNorm2d	(256, 8, 8)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.95)	(256, 8, 8)
Conv2d (256 → 512, 3 × 3, stride = 2, padding = 1)	(512, 4, 4)
BatchNorm2d	(512, 4, 4)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.95)	(512, 4, 4)
Classification Head
AdaptiveAvgPool2d (1 × 1)	(512, 1, 1)
Flatten	(512)
Fully Connected (Linear: 512 → 10)	(10)

Table 4. Architecture of Convolutional Neural Network (CNN) for DVS Gesture Recognition.

Layer (Type)	Output Shape
Input (Time step from DVS events)	(2, 128, 128)
Convolutional Blocks (Feature Extraction)
Conv2d (2 → 32, 3 × 3, stride = 1, padding = 1)	(32, 128, 128)
BatchNorm2d	(32, 128, 128)
ReLU (inplace = True)	(32, 128, 128)
Conv2d (32 → 64, 3 × 3, stride = 1, padding = 1)	(64, 128, 128)
BatchNorm2d	(64, 128, 128)
ReLU (inplace = True)	(64, 128, 128)
MaxPool2d (2 × 2)	(64, 64, 64)
Conv2d (64 → 128, 3 × 3, stride = 1, padding = 1)	(128, 64, 64)
BatchNorm2d	(128, 64, 64)
ReLU (inplace = True)	(128, 64, 64)
MaxPool2d (2 × 2)	(128, 32, 32)
Conv2d (128 → 256, 3 × 3, stride = 1, padding = 1)	(256, 32, 32)
BatchNorm2d	(256, 32, 32)
ReLU (inplace = True)	(256, 32, 32)
MaxPool2d (2 × 2)	(256, 16, 16)
Flatten	(65,536)
Classification Head
Linear (65,536 → 512)	(512)
ReLU (inplace = True)	(512)
Dropout (p = 0.5)	(512)
Temporal Processing
Mean Pooling (across 16 time steps)	(512)
Linear (512 → 11)	(11)

Table 5. Architecture of Convolutional Spiking Neural Network (CSNN) for DVSGesture Recognition.

Layer (Type)	Output Shape
Input (2 × 128 × 128 DVS Event Data)	(2, 128, 128)
Convolutional Blocks (Feature Extraction)
Conv2d (2 → 32, 3 × 3, stride = 1, padding = 1)	(32, 128, 128)
BatchNorm2d	(32, 128, 128)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.5)	(32, 128, 128)
Conv2d (32 → 64, 3 × 3, stride = 1, padding = 1)	(64, 128, 128)
BatchNorm2d	(64, 128, 128)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.5)	(64, 128, 128)
MaxPool2d (2 × 2)	(64, 64, 64)
Conv2d (64 → 128, 3 × 3, stride = 1, padding = 1)	(128, 64, 64)
BatchNorm2d	(128, 64, 64)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.5)	(128, 64, 64)
MaxPool2d (2 × 2)	(128, 32, 32)
Conv2d (128 → 256, 3 × 3, stride = 1, padding = 1)	(256, 32, 32)
BatchNorm2d	(256, 32, 32)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.5)	(256, 32, 32)
MaxPool2d (2 × 2)	(256, 16, 16)
Flatten	(65,536)
Classification Head
Fully Connected (Linear: 65,536 → 512)	(512)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.5)	(512)
Fully Connected (Linear: 512 → 11)	(11)
Leaky Integrate-and-Fire (LIF) Neuron (beta = 0.5)	(11)
Output Layer
Output (Accumulated Spikes)	(11)

Table 6. CIFAR-10 Performance Comparison of Models on Different Platforms. On Apple and NVIDIA platforms the acceleration was done with MPS and CUDA, respectively.

Model	Platform	Test Acc (%)		Memory (MB)			Avg GPU Power (W)		Total Watt-h		Total Time/Avg Inf Time (s)
Model	Platform	Mean ± Std	Best	Avg RAM	Avg GPU Alloc	Avg GPU Res	Mean ± Std	Best	Mean ± Std	Best	Mean ± Std	Best
ResNet-18	Apple	83.2 ± 0.8	83.84	870	210	383	10.9 ± 0.3	10.58	1.57 ± 0.05	1.5233	541.65 ± 5.95 1.13 ± 0.02	535.52 1.12
ResNet-18	NVIDIA	83.3 ± 0.7	84.09	1353	189	268	60.5 ± 0.8	60.46	6.06 ± 0.06	6.0696	376.08 ± 5.97 1.10 ± 0.02	375.34 1.09
CSNN	Apple	71.6 ± 1.1	72.5	662	66	1130	16.5 ± 0.2	16.53	8.90 ± 0.01	8.9047	1968.84 ± 10.4 4.57 ± 0.09	1959.05 4.48
CSNN	NVIDIA	70.5 ± 0.8	71.45	1444	56	902	140.8 ± 1.2	140.05	27.82 ± 0.17	27.7701	726.76 ± 14.05 2.13 ± 0.02	726.77 2.14

Table 7. DVSGesture Performance Comparison of Models on Different Platforms. On Apple and NVIDIA platforms the acceleration was done with MPS and CUDA, respectively.

Model	Platform	Test Acc (%)		Memory (MB)			Avg GPU Power (W)		Total Watt-h		Total Time/Avg Inf Time (s)
Model	Platform	Mean ± Std	Best	Avg RAM	Avg GPU Alloc	Avg GPU Res	Mean ± Std	Best	Mean ± Std	Best	Mean ± Std	Best
CNN	Apple	86.1 ± 0.2	86.36	10,860	590	7540	10.8 ± 0.2	10.67	8.21 ± 0.16	8.0568	2519.27 ± 11.94 6.17 ± 0.02	2506.41 6.15
CNN	NVIDIA	88.6 ± 0.5	89.02	12,470	570	7630	139.7 ± 3.3	139.11	42.27 ± 0.46	42.3308	948.14 ± 5.10 5.70 ± 0.02	945.49 5.68
CSNN	Apple	78.8 ± 1.3	80.30	13,070	750	10,800	12.5 ± 0.6	11.79	15.70 ± 1.32	14.2850	4102.48 ± 154.48 13.57 ± 0.37	3959.34 13.97
CSNN	NVIDIA	76.3 ± 1.1	77.27	14,470	720	10,330	95.5 ± 3.4	95.28	229.82 ± 1.78	228.3444	8197.01 ± 150.47 15.00 ± 0.02	8190.09 15.01

Table 8. Extended experimental parameters for accuracy-energy trade-off analysis. All other hyperparameters remain identical to the primary experiments.

Parameter	Experiment 1 (CSNN)	Experiment 2 (CSNN)
Number of Epochs	40	40
Time Steps (varied)	1, 3, 6, 9, 12	4, 8, 12, 16, 20
Platform	Apple Silicon (MPS)	Apple Silicon (MPS)
Other Hyperparameters	See Table 1	See Table 2

Table 9. CIFAR-10: Effect of Time Steps on Accuracy, Training Time, and Energy Consumption.

Time Steps	Test Accuracy (%)	Total Training Time (s)	Average Power (W)	Total Energy (Wh)	Energy per % Accuracy (Wh)
1	59.03	1042.96	10.87	3.0350	0.0514
3	73.58	2603.48	16.23	11.6248	0.1580
6	78.75	4968.75	17.28	23.7196	0.3012
9	81.13	7358.06	17.68	35.9838	0.4435
12	81.82	9759.23	17.87	48.2699	0.5900

Table 10. DVSGesture: Effect of Time Steps on Accuracy, Training Time, and Energy Consumption.

Time Steps	Test Accuracy (%)	Total Training Time (s)	Average Power (W)	Total Energy (Wh)	Energy per % Accuracy (Wh)
4	81.06	1335.62	12.80	5.0337	0.0621
8	73.86	2609.71	12.34	9.6094	0.1301
12	74.62	3939.27	12.41	14.6002	0.1957
16	79.17	5214.28	12.39	19.3395	0.2443
20	80.30	11,968.72	7.35	25.1282	0.3129

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Saini, R.; Andreopoulos, W.B. Performance Comparison of Machine Learning Across Metal, Cuda, and Software-Based Neuromorphic Simulation. Inventions 2026, 11, 55. https://doi.org/10.3390/inventions11030055

AMA Style

Saini R, Andreopoulos WB. Performance Comparison of Machine Learning Across Metal, Cuda, and Software-Based Neuromorphic Simulation. Inventions. 2026; 11(3):55. https://doi.org/10.3390/inventions11030055

Chicago/Turabian Style

Saini, Ryan, and William B. Andreopoulos. 2026. "Performance Comparison of Machine Learning Across Metal, Cuda, and Software-Based Neuromorphic Simulation" Inventions 11, no. 3: 55. https://doi.org/10.3390/inventions11030055

APA Style

Saini, R., & Andreopoulos, W. B. (2026). Performance Comparison of Machine Learning Across Metal, Cuda, and Software-Based Neuromorphic Simulation. Inventions, 11(3), 55. https://doi.org/10.3390/inventions11030055

Article Menu

Performance Comparison of Machine Learning Across Metal, Cuda, and Software-Based Neuromorphic Simulation

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Setup

2.2. Datasets

2.2.1. Experiment 1 (CIFAR-10)

2.2.2. Experiment 2 (DVSGesture)

2.3. Training Protocol

2.3.1. Experiment 1 (CIFAR-10)

2.3.2. Experiment 2 (DVSGesture)

2.4. Model Design

2.4.1. Experiment 1 (CIFAR-10)

ResNet-18

Convolutional Spiking Neural Network (CSNN)

2.4.2. Experiment 2 (DVSGesture)

Convolutional Neural Network (CNN)

Convolutional Spiking Neural Network (CSNN)

3. Results and Discussion

3.1. Experiment 1 (CIFAR-10)

3.2. Experiment 2 (DVSGesture)

3.3. Discussion

3.4. Limitations

3.5. Supplementary Analysis: Accuracy-Energy Trade-Off

3.5.1. CIFAR-10: Accuracy-Energy Trade-Off Analysis Across Time Steps

3.5.2. DVSGesture: Accuracy-Energy Trade-Off Analysis Across Time Steps

3.5.3. Discussion: Accuracy-Energy Trade-Off Analysis Across Time Steps

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. F1 Scores

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI