A Reproducible Benchmarking Methodology for Machine Learning Hardware: Performance–Energy Trade-Offs from GPUs to Apple Silicon

Sierra-Herrera, Oscar H.; Niño, Mario Eduardo González; Correa, Edwin Francis Cárdenas; Leon-Medina, Jersson X.; Pozo, Francesc

doi:10.3390/a19050363

Open AccessArticle

A Reproducible Benchmarking Methodology for Machine Learning Hardware: Performance–Energy Trade-Offs from GPUs to Apple Silicon

by

Oscar H. Sierra-Herrera

^1,2

,

Mario Eduardo González Niño

^2,3

,

Edwin Francis Cárdenas Correa

⁴

,

Jersson X. Leon-Medina

⁴

and

Francesc Pozo

^3,5,*

¹

Grupo de Investigación I2E, Escuela de Ingeniería Electrónica, Facultad de Ingeniería, Universidad Pedagógica y Tecnológica de Colombia (UPTC), Avenida Central del Norte 39–115, Tunja 150003, Colombia

²

Grupo de Investigación DSP, Escuela de Ingeniería Electrónica, Facultad Seccional Sogamoso, Universidad Pedagógica y Tecnológica de Colombia, Carrera 11 No. 3-33, Sogamoso 152210, Colombia

³

Control, Data, and Artificial Intelligence (CoDAlab), Department of Mathematics, Escola d’Enginyeria de Barcelona Est (EEBE), Universitat Politècnica de Catalunya (UPC), Campus Diagonal–Besòs, Eduard Maristany 16, 08019 Barcelona, Spain

⁴

Escuela de Ingeniería Electromecánica, Facultad Seccional Duitama Universidad Pedagógica y Tecnológica de Colombia, Carrera 18 con Calle 22, Duitama 150461, Colombia

⁵

Institute of Mathematics (IMTech), Universitat Politècnica de Catalunya (UPC), Pau Gargallo 14, 08028 Barcelona, Spain

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(5), 363; https://doi.org/10.3390/a19050363

Submission received: 24 March 2026 / Revised: 22 April 2026 / Accepted: 29 April 2026 / Published: 4 May 2026

(This article belongs to the Collection Feature Papers in Algorithms for Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

While hardware selection is widely recognized as a key factor in machine learning performance, systematic and reproducible evaluation across heterogeneous and accessible platforms remains limited, particularly when jointly considering execution time, energy consumption, stability, and cost-efficiency. This work presents a unified and fully reproducible benchmarking framework for supervised learning, designed to enable controlled and comparable evaluation across diverse hardware environments. The proposed methodology enforces consistent training pipelines, fixed hyperparameter configurations, and repeated executions to ensure statistical reliability, while incorporating performance metrics such as execution time, power consumption, and energy usage, as well as performance-per-dollar. The framework is validated on a representative set of platforms, including CUDA-enabled GPUs, Apple Silicon (CPU/GPU), x86 processors, ARM-based embedded systems, and cloud-based environments, using convolutional, recurrent (RNN, LSTM, BiLSTM), and tree-based (XGBoost) models. The results reveal that hardware efficiency is strongly model-dependent. GPUs provide the highest computational performance and stability for parallel workloads, whereas Apple Silicon achieves superior energy efficiency with competitive execution times, particularly for recurrent architectures. The batch size analysis shows that performance can vary significantly depending on workload configuration, especially on CPU-based platforms, while epoch-based evaluation confirms that the measured performance reflects steady-state behavior rather than initialization overhead. In contrast, conventional CPUs and embedded systems exhibit significant scalability limitations for deep learning training, although they remain competitive for tree-based methods such as XGBoost, which demonstrates near hardware-independent predictive performance. These findings highlight the limitations of generalized hardware selection criteria and emphasize the need for model-aware and hardware-aware benchmarking. The proposed framework offers a practical and extensible foundation for reproducible, hardware-aware evaluation of machine learning systems, supporting informed decision-making in research, deployment, and cost-constrained scenarios.

Keywords:

supervised learning; benchmarking methodology; heterogeneous hardware; energy efficiency; reproducibility; GPU computing; Apple Silicon; recurrent neural networks; XGBoost

1. Introduction

In recent years, machine learning (ML) has experienced rapid growth, driven by advances in algorithms, data availability, and computational power [1,2,3,4]. This expansion has intensified the need for hardware-aware benchmarks capable of reflecting real-world training conditions rather than idealized datacenter environments [5,6,7]. Researchers and developers now face the challenge of selecting platforms that balance training time, energy consumption, and reproducibility, all of which directly affect the practical feasibility and sustainability of ML workloads, and studies that integrate these dimensions under experimental conditions representative of commonly available computing environments remain limited.

Several benchmarking initiatives have addressed ML performance from complementary perspectives. MLPerf [8] established a reference standard for comparing training workloads on high-end GPUs and TPUs under fixed datasets and target accuracy conditions. DAWNBench [9] introduced metrics centered on time-to-accuracy and cost, whereas DeepEdgeBench [10] and Geekbench ML [11] focused primarily on inference tasks on edge and mobile hardware. Although these initiatives have been fundamental for benchmarking standardization, their scope remains limited when the goal is to jointly analyze supervised training, energy consumption, and reproducibility across heterogeneous and widely accessible hardware.

In parallel, recent studies have highlighted the environmental implications of ML training. Works such as From Clicks to Carbon [12], Patterson et al. [13], and Green AI [14] show that the choice of hardware and model architecture has a substantial impact on energy use and carbon footprint. More recently, benchmarking-oriented studies have reinforced this perspective by explicitly comparing predictive performance and energy efficiency under controlled experimental settings [15,16]. As a result, the selection of computational environments has become not only a matter of performance but also of sustainability. However, these concerns have not yet been systematically translated into reproducible comparative frameworks capable of quantifying, under a unified experimental design, the relationship between computational performance, energy cost, and result stability across heterogeneous hardware platforms [17,18].

Recent supervised and deep learning studies have also addressed real industrial energy forecasting scenarios. In [19] a Temporal Fusion Transformer was proposed for electric load forecasting in a quicklime production plant, showing that model behavior and forecasting suitability depend strongly on the industrial operating context and the evaluation criteria considered. Such studies reinforce the relevance of assessing supervised learning models under realistic application conditions, although they do not focus on the influence of heterogeneous hardware on training performance, energy consumption, and reproducibility.

Despite these advances, most existing benchmarks remain centered on datacenter-scale accelerators or on specific model families. They frequently exclude power analysis or overlook widely used platforms such as consumer-grade x86 CPUs, ARM-based laptops, and low-power embedded devices such as the Raspberry Pi [20,21]. In addition, synthetic indicators such as FLOPS or memory bandwidth are insufficient to capture the interaction among software stack overhead, data pipeline constraints, and the computational behavior of different model architectures [8,9,10,11]. Recent studies have also explored deterministic comparisons of machine learning models and the characterization of deep learning workloads in edge-oriented scenarios [22,23], while other works have addressed performance improvements through high-performance computing and parallel strategies in predictive tasks [24]. However, these efforts do not yet provide a unified benchmark for supervised training that jointly evaluates execution time, energy usage, predictive accuracy, and reproducibility across heterogeneous and accessible hardware platforms. Consequently, a homogeneous evaluation that compares distinct families of supervised learning algorithms across accessible and heterogeneous platforms under consistent conditions is still lacking [5].

This gap also highlights the need for benchmarking methodologies based on fixed and reproducible experimental designs, in which datasets, model definitions, hyperparameter settings, and repetition schemes are kept constant in order to isolate platform-dependent effects and enable fair comparison across heterogeneous hardware environments [25,26,27]. In addition, cost-oriented benchmarking perspectives have shown that practical hardware evaluation should not be restricted to raw performance alone, but should also consider accessibility-related criteria such as economic efficiency and comparative cost-performance behavior [9,14,28].

To address this need, this work presents a unified and reproducible benchmarking framework for evaluating five representative supervised learning models: CNN [29,30], Simple RNN [31], RNN–LSTM [32], BiLSTM [33,34], and XGBoost [35], across a set of accessible hardware platforms. The evaluated systems include a desktop computer with an RTX 5060 Ti GPU and Ryzen 7 7800X3D CPU, a laptop with a Ryzen 7 4800H and GTX 1660 Ti, an Apple M4 MacBook evaluated in both CPU and GPU configurations, and a Raspberry Pi 5 based on an ARM SoC, and Google Colab in CPU/GPU configurations.

All models were trained using consistent code, datasets, and hyperparameter settings in order to isolate hardware-specific effects on training time, power usage, energy efficiency, and reproducibility. In addition, the proposed framework incorporates a basic performance-per-dollar perspective in order to account for the economic dimension of hardware accessibility [9,14,28].

This study seeks to overcome the fragmentation observed in the literature and provide a more consistent and reproducible comparative basis.

The results discussed in this article clearly reveal differentiated behaviors across platforms. Apple Silicon exhibited outstanding energy efficiency in recurrent and tree-based workloads, whereas CUDA-enabled GPUs delivered the fastest and most consistent training performance. By contrast, general-purpose CPUs and low-power ARM devices showed significant scalability limitations for deep learning tasks, although algorithms such as XGBoost remained competitive on CPU-based platforms.

Ultimately, this study provides a practical and transparent reference for researchers, educators, and practitioners who need to balance performance, energy efficiency, and reproducibility in supervised learning tasks. It also promotes a benchmarking perspective that incorporates energy and environmental criteria alongside traditional performance metrics, thereby contributing to more practical and accessible machine learning research. In this sense, the contribution of the article lies in providing systematic comparative evidence under a unified and reproducible benchmarking framework for supervised learning across heterogeneous and accessible hardware platforms. In addition, the proposed methodology extends conventional benchmarking by incorporating batch size sensitivity analysis, epoch-based stability validation, and performance-per-dollar evaluation, allowing a more comprehensive characterization of platform performance across different evaluation criteria.

The remainder of this article is organized as follows. Section 2 presents the methodology adopted in this study, including the hardware selection criteria, the description of the evaluated platforms, and the tested models. Section 3 reports and analyzes the experimental results obtained for each model, covering CNN [29,30], Simple RNN [31], RNN–LSTM [32], BiLSTM [33,34], and XGBoost [35]. Section 4 discusses the main findings of the study by comparing the behavior of the different platforms and models from the perspectives of performance, energy efficiency, and stability. Finally, Section 5 presents the conclusions of the work and outlines future research directions derived from the obtained results.

2. Materials and Methods

The methodology adopted in this study was designed to ensure a consistent, reproducible, and comparable evaluation of heterogeneous computational platforms in supervised learning tasks. To this end, the experimental design explicitly controlled the main factors influencing execution, including hardware characterization, software environment configuration, model definition and implementation, repeated experimental runs, and systematic metric collection.

This approach enables a direct comparison of performance and energy consumption across platforms. Figure 1 summarizes the general methodological workflow adopted in this study, starting from hardware selection and characterization, continuing with software configuration and controlled model execution, and ending with metric collection, statistical analysis, and cross-platform comparison. The experiments in this study were conducted under standard indoor operating conditions in Tunja, Colombia, a high-altitude location characterized by relatively stable ambient temperatures (approximately 15–20 °C) and high relative humidity levels (typically in the range of 80–90%).

2.1. Benchmarking Methodology Overview

This study adopts a unified benchmarking methodology to evaluate the influence of heterogeneous hardware platforms on supervised learning workloads. The methodology is designed to ensure consistency, reproducibility, and comparability across all experiments by maintaining identical datasets, model definitions, and baseline hyperparameter settings for each platform whenever hardware compatibility allows. In this sense, the fixed experimental design adopted in this work follows the logic of controlled and reproducible benchmarking, in which datasets, model definitions, hyperparameter settings, and repetition schemes are kept constant in order to isolate platform-dependent effects and ensure fair cross-platform comparison [25,26,27,28]. The evaluation considers representative platforms spanning desktop, laptop, cloud, and embedded environments, including x86 CPUs, NVIDIA CUDA-enabled GPUs, Apple Silicon, and ARM-based low-power systems.

However, while fixed hyperparameter configurations ensure strict comparability, they may not fully capture the peak performance capabilities of each hardware platform. This design also ensures that the selected configurations remain compatible with low-resource platforms, where memory availability can constrain feasible workload sizes. In particular, parameters such as batch size and memory allocation strongly influence hardware utilization, especially in heterogeneous architectures with different memory hierarchies and parallel execution models. Therefore, in addition to the controlled experimental setup, this work incorporates a complementary hardware-aware evaluation, where selected hyperparameters (primarily batch size) are systematically varied to identify near-optimal operating points for each platform. This dual approach allows both reproducible comparison and fair assessment of platform-specific performance. This is particularly important because suboptimal batch size configurations may significantly underestimate the effective performance of certain platforms, especially in CPU-based systems. The complete benchmarking procedure is summarized in Algorithm 1. The workflow comprises four main stages: hardware selection, software environment configuration, model execution, and metric collection. First, hardware platforms are selected to represent distinct computational paradigms in terms of architecture, performance level, and accessibility. Second, each platform is configured with an appropriate software stack, including the required Python version and machine learning libraries. Third, the selected supervised learning models: CNN [29,30], Simple RNN [31], RNN–LSTM [32], BiLSTM [33,34], and XGBoost [35], are trained under fixed experimental settings. Finally, the results are collected in terms of training time, power (or declared power usage when available), predictive performance, and energy consumption behavior.

For the hardware-aware evaluation, CNN and LSTM models are used as representative workloads to study batch size sensitivity. These models capture two fundamentally different computational patterns: highly parallel workloads (CNN) and sequential, dependency-bound workloads (LSTM). This selection allows analysis of how batch size affects both compute-bound and sequential models without introducing excessive experimental complexity, while still providing representative insights across model families.

As described in Algorithm 1, each model is executed on each platform under fixed conditions, and all experiments are repeated five times. In addition, experiments were conducted under different epoch configurations in order to evaluate the stability of the training process and verify whether the measured performance reflects steady-state behavior rather than initialization overhead. For the hardware-aware evaluation, multiple batch sizes are explored for each platform, and performance metrics are recorded to identify optimal configurations. To reduce confounding factors, all executions are performed from the command line, and average values are reported in the subsequent analysis. Random seeds are fixed whenever supported by the framework in order to improve reproducibility. This procedure enables a controlled comparison of hardware-dependent effects on different model families, ranging from convolution-based and recurrent architectures to tree-based learning methods.

Unlike conventional benchmarking approaches that focus primarily on execution time or accuracy, the proposed methodology integrates execution time, power consumption, energy estimation, and convergence behavior. Although cost is not directly part of the controlled experimental procedure, the collected performance metrics are later used to derive a performance-per-dollar indicator, enabling a complementary techno-economic interpretation of the results. Furthermore, by incorporating both controlled (fixed configuration) and hardware-aware (tuned configuration) evaluations, the methodology provides a more complete characterization of each platform, capturing both reproducibility and peak performance behavior. This enables a more comprehensive evaluation of hardware performance under realistic and controlled conditions.

Algorithm 1 Unified benchmarking procedure for heterogeneous hardware evaluation

Require: Set of hardware platforms

P

; set of models

M

; fixed datasets

D

; fixed hyperparameter configurations

H

; number of repetitions

R = 5

; batch size search space

B

Ensure: Per-platform and per-model summary metrics for training time, predictive performance, power, and energy when available

1:: for all platform $p \in P$ do
2:: Configure the software environment of platform p
3:: for all model $m \in M$ do
4:: Load dataset $D_{m} \in D$ corresponding to model m
5:: Load fixed hyperparameter configuration $H_{m} \in H$
6:: Initialize empty result containers for time, predictive metrics, power, and energy
7:: Controlled evaluation (fixed configuration)
8:: for $r = 1$ to R do
9:: Set fixed random seeds whenever supported by the framework
10:: Initialize model m with configuration $H_{m}$
11:: Train model m on platform p
12:: Record training time
13:: Record predictive metric(s) associated with model m
14:: Record average power consumption when measurable
15:: if power data are available then
16:: Compute energy consumption as

$E_{p, m, r} = P_{p, m, r} \times T_{p, m, r}$
17:: else
18:: Mark energy consumption as unavailable
19:: end if
20:: end for
21:: Compute mean and standard deviation across the R repetitions for all recorded metrics
22:: Store summarized controlled results for platform p and model m
23:: Hardware-aware evaluation (batch size tuning, applied only to selected platforms)
24:: for all batch size $b \in B$ do
25:: Modify hyperparameter configuration $H_{m}$ with batch size b
26:: for $r = 1$ to R do
27:: Initialize model m with updated configuration
28:: Train model m on platform p
29:: Record training time and relevant metrics
30:: end for
31:: Compute mean performance for batch size b
32:: end for
33:: Identify optimal batch size $b^{*}$ for platform p and model m (e.g., minimum training time or maximum throughput)
34:: Store optimal configuration results
35:: end for
36:: end for
37:: Compare all summarized results across platforms and model families under both controlled and optimized configurations

Note: E denotes energy (J), P average power (W), and T execution time (s).

2.2. Hardware Selection Criteria

The hardware platforms considered in this study were selected to reflect practical diversity, software compatibility, accessibility, and reproducibility in supervised learning experiments. The selection procedure is summarized in Algorithm 2. Rather than targeting the maximum achievable performance of each platform through hardware-specific tuning, the benchmark was designed to compare heterogeneous systems under a controlled and homogeneous experimental setting, using consistent model definitions, datasets, and training configurations whenever platform compatibility allowed. The selected hardware set should therefore be interpreted as representative rather than exhaustive, and the reported results are not intended to define a universal scaling law for other GPU or CPU families solely on the basis of CUDA-core or CPU-core counts. The benchmark includes representative systems from desktop, laptop, reference cloud, and embedded environments, covering x86 CPUs, NVIDIA CUDA-enabled GPUs, Apple Silicon, and ARM-based low-power devices.

Google Colab was included as a practical cloud-based reference environment because of its widespread use in academic, educational, and rapid prototyping settings. However, it is not treated as a strictly equivalent counterpart to dedicated local hardware, since it operates on shared and virtualized infrastructure with inherently variable resource allocation. Consequently, its results should be interpreted as an accessibility-oriented reference rather than as a fully controlled physical platform. AMD GPUs are excluded owing to inconsistent support across the software frameworks required in this study, particularly TensorFlow, Keras, and XGBoost, which could compromise reproducibility and comparability. By contrast, NVIDIA hardware benefits from mature CUDA and cuDNN ecosystems, providing a stable baseline for GPU-accelerated training.

The inclusion of ARM-based platforms recognizes their increasing relevance in energy-efficient and mobile computing. In particular, the Raspberry Pi 5 represents a low-power embedded platform, while Apple Silicon provides an example of an integrated ARM-based architecture with unified memory. Together, the selected platforms provide a representative basis for comparison under fixed experimental conditions, while still reflecting the diversity of hardware environments commonly available to researchers and practitioners.

Algorithm 2 Hardware platform selection procedure

Require: Candidate set of hardware platforms

C

Ensure: Selected set of hardware platforms

P

1:: Initialize $P \leftarrow \emptyset$
2:: for all candidate platform $c \in C$ do
3:: if c provides architectural diversity then
4:: if c is accessible for practical academic or research use then
5:: if c supports the required software frameworks with sufficient stability then
6:: Add c to $P$
7:: end if
8:: end if
9:: end if
10:: end for
11:: Include x86 CPU and NVIDIA GPU platforms in desktop and laptop configurations
12:: Include one Apple Silicon platform with integrated GPU
13:: Include one ARM-based low-power embedded platform
14:: Include one cloud-based reference environment for remote experimentation and accessibility-oriented comparison
15:: Exclude platforms with inconsistent support in TensorFlow, Keras, or XGBoost when such limitations may affect reproducibility
16:: Preserve fixed experimental settings across platforms whenever feasible in order to prioritize methodological consistency over platform-specific peak optimization
17:: return $P$

Based on these criteria, the final benchmark included a desktop computer, a laptop, an Apple Silicon MacBook, a Raspberry Pi 5, and Google Colab as the cloud-based reference environment. The technical specifications of these selected platforms are presented in the following subsection; however, the cloud-based results must be interpreted with additional caution because the underlying infrastructure is not fully controlled by the user.

2.3. Hardware Description

In this subsection, the hardware configurations employed throughout the study are described in detail. Table 1 summarizes the main technical specifications of each platform, including processor, graphics unit, memory, storage, and declared power characteristics. These configurations provide the experimental basis for the comparative analysis conducted in the subsequent sections.

The Apple M4 Air (Apple Inc., Cupertino, CA, USA) platform is included as an additional device within the Apple Silicon family. Unlike the other platforms, it is primarily used to complement the batch size sensitivity analysis rather than the fixed-configuration benchmark, and therefore it is not included in the main controlled comparison. Its inclusion enables the evaluation of performance scaling within the same architectural family under different power and thermal constraints, particularly considering its fanless design.

2.4. Evaluated Hardware Platforms

x86 Laptop with Dedicated GPU. This gaming-oriented laptop combines an AMD Ryzen 7 4800H processor (AMD, Santa Clara, CA, USA) with an NVIDIA GTX 1660 Ti mobile GPU (NVIDIA Corporation, Santa Clara, CA, USA), representing mainstream mobile hardware capable of handling both general-purpose computing and moderately demanding machine learning tasks. It serves as a practical benchmark for portable deep learning workflows.

x86 Desktop with Dedicated GPU. A high-performance desktop configuration featuring an AMD Ryzen 7 7800X3D processor (AMD, Santa Clara, CA, USA) and an NVIDIA RTX 5060 Ti (16 GB VRAM) (NVIDIA Corporation, Santa Clara, CA, USA). It provides the reference baseline for this study, combining strong CPU throughput, large cache, and sufficient GPU memory to train deep learning models efficiently. This setup balances raw compute power and bandwidth, ideal for testing both CPU-bound and GPU-accelerated algorithms.

Apple MacBook M4 Pro. An ARM-based System-on-Chip from Apple Inc. (Cupertino, CA, USA), integrating CPU, GPU, and NPU under a unified memory architecture. Apple Silicon exemplifies high energy efficiency and cross-platform compatibility within compact devices, offering insight into ARM+GPU performance using the Metal backend. Its inclusion reflects the growing relevance of Apple hardware in portable AI and data-science workflows.

Raspberry Pi 5. A low-power single-board computer developed by Raspberry Pi Ltd. (Cambridge, UK), based on a Broadcom BCM2712 processor (Broadcom Inc., Palo Alto, CA, USA) with a quad-core Cortex-A76 CPU, included to evaluate AI on the edge and supervised learning in constrained or embedded environments. Although limited in raw performance, it provides valuable information on efficiency, thermal stability, and performance-per-watt for edge AI and IoT applications.

Google Colab Free. A widely used cloud-based reference environment provided by Google LLC (Mountain View, CA, USA), offering shared Intel Xeon CPUs (Intel Corporation, Santa Clara, CA, USA) and NVIDIA GPUs (NVIDIA Corporation, Santa Clara, CA, USA) for educational, exploratory, and rapid prototyping purposes. Unlike the dedicated local platforms considered in this study, Google Colab operates on virtualized and dynamically allocated infrastructure, which may introduce variability in CPU availability, storage performance, GPU allocation, and runtime stability. Accordingly, its inclusion is intended to provide an accessibility-oriented reference for remote experimentation rather than a fully controlled physical hardware baseline. Despite these limitations, Colab remains practically relevant because it enables GPU-accelerated training without upfront hardware investment, thereby illustrating the trade-offs between accessibility, convenience, and methodological consistency.

Table 2 summarizes the software environment across all evaluated platforms, including operating systems, programming environments, and library versions. This transposed view facilitates direct comparison of software configurations, ensuring consistency and reproducibility in the experimental setup while also highlighting potential sources of variability across heterogeneous hardware. Differences in software versions arise from the use of stable and platform-compatible releases available for each system. Although software versions differ due to compatibility constraints, all configurations correspond to optimized environments, preserving functional consistency across the benchmarking workflow. These software-stack differences may influence execution time, energy behavior, numerical consistency, and backend-level stability; however, stable and platform-compatible versions were deliberately selected in all cases in order to preserve functional comparability across the benchmark.

Power consumption was measured using a P4400 low-cost consumer power meter, with an accuracy typically ranging between 0.2% and 2%. Measurements were obtained at the system level under consistent power supply conditions, using standard manufacturer-recommended configurations for each platform. For NVIDIA GPUs, nvidia-smi was used to sample GPU power draw during training. For CPU-based and Apple Silicon platforms, system-level power monitoring utilities were employed. In the case of the Raspberry Pi 5, power was estimated from the declared SoC envelope. For Google Colab, no power measurements were available because of the cloud-managed and user-non-transparent nature of the underlying infrastructure. Therefore, Colab-based results are restricted to runtime and predictive-performance indicators and should be interpreted as practical cloud-reference measurements rather than as directly equivalent energy benchmarks relative to dedicated local hardware. Energy consumption was computed as the product of the average measured power and the total training time for each run.

2.5. Tested Models

To benchmark performance and efficiency across heterogeneous platforms, five representative machine learning models were implemented: CNN [29,30], Simple RNN [31], RNN–LSTM [32], BiLSTM [33,34], and XGBoost [35]. The model selection and organization adopted in this study are summarized in Algorithm 3. These models were chosen to cover distinct computational patterns and architectural characteristics relevant to supervised learning, including convolution-based processing, sequential recurrent computation, gated temporal modeling, bidirectional sequence modeling, and tree-based learning. Together, they provide a broad basis for evaluating how different hardware platforms respond to diverse training workloads. Accordingly, the associated datasets were not intended to provide a common predictive benchmark across all model families, but rather to instantiate representative computational workloads for each learning paradigm. Therefore, dataset size, structure, and diversity are treated here as part of the workload definition, while remaining fixed across all hardware platforms within each experimental case.

The evaluated models differ substantially in their computational behavior. CNNs are highly parallelizable and typically benefit from GPU acceleration. Simple RNN and RNN–LSTM emphasize sequential dependencies and recurrent memory access, exposing latency-sensitive execution patterns. BiLSTM extends this behavior by incorporating bidirectional recurrent processing, thereby increasing the computational complexity of sequence modeling. In contrast, XGBoost represents a tree-based method whose execution characteristics are generally more favorable for CPU-based platforms. This diversity allows the benchmark to capture a wide range of runtime, energy, and stability behaviors across hardware architectures.

All experiments were executed using Python scripts from the command line in order to minimize software overhead and improve reproducibility. The use of Python also ensured compatibility with the main frameworks employed in this study, including TensorFlow, scikit-learn, and XGBoost. Each experiment was repeated five times, and average results were reported in the subsequent analysis. All training codes used in this study are publicly available in the GitHub repository https://github.com/OscarHSierra/cpu-gpu-training-benchmark (accessed on 28 April 2026).

The objective of this study is to compare computational performance rather than to maximize predictive accuracy. Accordingly, all evaluated techniques share a common training configuration to ensure consistency and fairness across platforms. Each model is executed on all target platforms, including GPU- and CPU-based systems as well as system-on-chip architectures. Given that the total number of training runs scales with both the number of evaluated models and the number of repetitions, memory and execution time become critical constraints, particularly for resource-limited devices. For this reason, the hyperparameter configuration is deliberately selected to ensure bounded and comparable computational workloads across all platforms. While this setup does not aim to achieve state-of-the-art accuracy, it enables a tractable and reproducible evaluation process. A key limitation of this study is that the selected configurations prioritize computational tractability. In this context, batch size was fixed for each model family to ensure consistent and reproducible workloads across all platforms. For convolutional models (CNN), a batch size of 512 was selected to leverage data parallelism, while for recurrent models (RNN, LSTM, and BiLSTM), a smaller batch size of 64 was used due to their sequential nature and higher memory requirements per sample [36]. These values were selected as a compromise between computational efficiency and cross-platform feasibility, taking into account the memory constraints of low-resource platforms such as embedded devices (e.g., Raspberry Pi). Although these configurations do not necessarily correspond to the optimal batch size for each hardware architecture, they provide a fair and controlled baseline for comparison. Complementary experiments exploring batch size sensitivity are presented in Section 4.6. The results do not reflect fully optimized training scenarios or state-of-the-art model performance. Model performance is not primarily assessed in terms of predictive accuracy. Instead, execution time, power consumption, and energy usage are emphasized, while the training loss is used as an indicator of convergence behavior. When applicable, classification accuracy is reported as a complementary metric. For classification settings where accuracy is reported, it is defined as

Accuracy = \frac{1}{N} \sum_{i = 1}^{N} 1 ({\hat{y}}_{i} = y_{i}),

where N is the number of evaluated samples,

{\hat{y}}_{i}

is the predicted class, and

y_{i}

is the corresponding target class. Accordingly, higher values indicate better agreement between predictions and reference labels. However, for sequence modeling tasks, such as the character-level RNN-based models, sparse categorical cross-entropy is used as the primary convergence indicator, as it more accurately reflects probabilistic learning dynamics in sequence prediction tasks [37,38]. For character-level sequential models, sparse categorical cross-entropy is defined as

L_{SCCE} = - \frac{1}{N} \sum_{i = 1}^{N} log p_{θ} (y_{i} ∣ x_{i}),

where

p_{θ} (y_{i} ∣ x_{i})

denotes the predicted probability assigned to the correct target class. Therefore, lower values indicate better probabilistic fit and more favorable convergence behavior under the same experimental conditions. Direct comparison of predictive metrics across models is not intended, as each model employs task-specific evaluation criteria aligned with its learning objective. Accordingly, predictive metrics are not treated as the primary basis for hardware ranking, but rather as complementary indicators used to verify convergence consistency across platforms under the same experimental configuration.

Algorithm 3 Model selection and organization for hardware benchmarking

Require: Candidate set of supervised learning models

M_{c}

Ensure: Final benchmark model set

M

1:: Initialize $M \leftarrow \emptyset$
2:: for all candidate model $m \in M_{c}$ do
3:: if m represents a distinct computational pattern then
4:: if m is relevant for supervised learning benchmarking then
5:: if m can be implemented reproducibly across the evaluated platforms then
6:: Add m to $M$
7:: end if
8:: end if
9:: end if
10:: end for
11:: Include one convolution-based model (CNN)
12:: Include one simple recurrent model (Simple RNN)
13:: Include one gated recurrent model (RNN–LSTM)
14:: Include one bidirectional recurrent model (BiLSTM)
15:: Include one tree-based model (XGBoost)
16:: return $M$

2.5.1. CNN

A CNN [29,30] was implemented in TensorFlow 2.10.0 to classify CIFAR-10 [39], a widely used standard benchmark dataset for image classification. Table 3 presents the architecture of the CNN employed in this study. The model processes RGB images and is organized into three convolutional blocks with increasing depth (64, 128, and 256 filters). Each block comprises two convolutional layers with

3 \times 3

kernels, same padding, and swish activation, followed by batch normalization to improve convergence stability. Spatial dimensionality is progressively reduced using max-pooling layers with a pool size of

2 \times 2

, while dropout regularization (0.3 and 0.4) is applied to mitigate overfitting.

After feature extraction, a global average pooling layer aggregates spatial information and reduces the number of parameters. The resulting representation is processed by two fully connected layers with 1024 and 512 units, respectively, both using swish activation and followed by dropout (0.5). The final dense layer produces 10 logits corresponding to the CIFAR-10 classes.

Table 4 summarizes the training configuration and hyperparameters. The model is trained on the CIFAR-10 dataset using the Adam optimizer and sparse categorical cross-entropy loss with logits, while accuracy is used as the evaluation metric. A batch size of 512 and a single training epoch are employed.

In this study, final accuracy and loss are interpreted primarily as consistency checks on the learning outcome across platforms, rather than as the main target variables of the benchmark.

2.5.2. Simple RNN

A character-level Simple RNN [31] was trained on the Shakespeare dataset [40], a widely used public reference dataset for character-level sequence modeling, to evaluate sequential computation performance.

The architecture includes an embedding layer followed by three stacked SimpleRNN layers with 1024 units each, and a dense output layer over the vocabulary. Training was performed for one epoch using the Adam optimizer and sparse categorical cross-entropy loss.

Table 5 summarizes the architecture of the proposed Simple RNN model. The network operates on input sequences of 100 characters encoded as integer indices. An embedding layer maps each character into a 256-dimensional continuous vector space. The core of the model consists of three stacked SimpleRNN layers, each configured to return full sequences, enabling the model to capture temporal dependencies across all time steps. Finally, a dense output layer with linear activation produces logits over the vocabulary space.

Table 6 presents the training configuration and hyperparameters used in this study. The model is trained using sequences of 100 characters and a batch size of 64, with a shuffle buffer of 10,000 samples to improve data randomness. The optimization process relies on the Adam optimizer, while the loss function is sparse categorical cross-entropy configured with logits. To ensure reproducibility, a fixed random seed (42) is used for Python, NumPy, and TensorFlow. Accordingly, the reported loss values are used in this study primarily as indicators of convergence consistency across platforms, rather than as the main basis for hardware ranking.

2.5.3. RNN–LSTM

An RNN–LSTM [32] was trained on the Shakespeare dataset [40], a widely used public reference dataset for character-level sequence modeling, in order to evaluate gated recurrent computation under sequential training workloads.

Two stacked LSTM layers with 1024 units each replaced the SimpleRNN layers to mitigate vanishing-gradient effects and improve long-range sequence modeling. The model used identical preprocessing, optimizer, and loss functions as the Simple RNN baseline.

Table 7 presents the architecture of the RNN–LSTM model used to evaluate sequential computation performance. The model processes sequences of 100 integer-encoded characters and follows an embedding–recurrent–projection structure. An embedding layer maps the input tokens into a 256-dimensional continuous space, enabling a compact and expressive representation of the discrete vocabulary. The temporal modeling is performed using two stacked LSTM layers with 1024 units each, both configured to return full sequences, allowing the network to capture temporal dependencies at every time step. Finally, a dense output layer produces logits over the vocabulary space, enabling next-character prediction.

Table 8 summarizes the overall configuration and training setup of the RNN–LSTM model. The network is trained on a character-level representation of the Shakespeare dataset, where input sequences of 100 characters are used to predict the subsequent character at each time step. The training process follows the unified benchmarking protocol adopted in this study, employing a single training epoch, the Adam optimizer, and sparse categorical cross-entropy loss. This configuration ensures consistent computational conditions across all evaluated models, facilitating fair comparison across heterogeneous hardware platforms.

Table 9 details the training configuration and hyperparameters of the RNN–LSTM model. A batch size of 64 and a shuffle buffer of 10,000 samples are used to balance computational efficiency and input variability during training. The embedding dimension and LSTM size are set to 256 and 1024 units, respectively, across two recurrent layers. The output dimension matches the vocabulary size. To ensure experimental reliability, a fixed random seed (42) is applied across Python, NumPy, and TensorFlow. Additionally, GPU memory growth is enabled to prevent allocation issues, allowing stable execution across different hardware platforms, with training performed on GPU when available and otherwise on CPU.

2.5.4. XGBoost

XGBoost [35] version 3.0.2 was benchmarked using a synthetic multiclass dataset with 200,000 samples, 1000 features, and 5 classes. Identical configurations were trained on both CPU and GPU backends to compare computational performance and energy efficiency. GPU execution employed the gpu_hist tree construction method together with the gpu_predictor backend. The dataset adopted for XGBoost was not intended as a standard public benchmark, but as a synthetic and controlled tabular workload designed to represent tree-based supervised learning under reproducible computational conditions.

Table 10 presents the training configuration and hyperparameters of the XGBoost model. A multi-class classification objective (multi:softmax) is used to directly predict discrete class labels across five categories. The model is configured with a maximum tree depth of 6 and a learning rate of 0.1, providing a balance between model capacity and convergence stability. The training process consists of 100 boosting rounds, enabling incremental refinement of the ensemble.

To support heterogeneous hardware evaluation, different tree construction methods are selected depending on the execution platform. The auto method is used for CPU-based training, while gpu_hist is employed for GPU acceleration, enabling efficient histogram-based gradient boosting. The evaluation metric during training is the multi-class logarithmic loss mlogloss, which provides a probabilistic measure of model performance. A fixed random seed (42) is used to ensure reproducibility across runs.

Table 11 summarizes the experimental setup. The model is evaluated on a synthetic dataset composed of 200,000 samples with 1000 features, of which 75 are informative. The dataset is structured as a five-class classification problem and is split into training and testing subsets using an 80%/20% ratio.

The model is trained for 100 boosting rounds, and overall classification accuracy is used as the evaluation metric for reporting results. To ensure consistency with the benchmarking protocol, fixed random seeds are applied across Python, NumPy, and XGBoost, enabling reproducible comparisons across different hardware platforms.

2.5.5. BiLSTM

A lightweight BiLSTM [33,34] architecture with embedding, bidirectional recurrent processing, and a dense classification layer was implemented to evaluate sequence modeling performance under bidirectional recurrent workloads. It was trained on synthetic token sequences (vocabulary size = 5000, sequence length = 100, number of classes = 5) for 10 epochs using Adam and categorical cross-entropy loss. The dataset used for the BiLSTM experiment was not introduced as a standard public benchmark, but as a synthetic and controlled workload designed to represent bidirectional sequence-processing behavior under reproducible computational conditions.

Table 12 presents the architecture of the Bidirectional LSTM (BiLSTM) model employed to capture contextual dependencies in sequential data. The model processes input sequences of fixed length and begins with an embedding layer that maps tokens from a vocabulary of 5000 elements into a 64-dimensional continuous space, enabling a compact representation of the input.

The core of the model consists of a bidirectional LSTM layer with 64 units, which processes the sequence in both forward and backward directions. This structure allows the model to incorporate past and future context simultaneously, improving its ability to capture complex temporal relationships compared to unidirectional recurrent architectures. The outputs from both directions are concatenated into a 128-dimensional representation.

A dropout layer with a rate of 0.5 is applied to reduce overfitting by randomly deactivating neurons during training. Finally, a fully connected layer with softmax activation produces probability distributions over the five output classes, enabling multi-class classification.

Table 13 summarizes the training configuration and hyperparameters of the BiLSTM model. The model is trained using the Adam optimizer with a learning rate of 0.001 and categorical cross-entropy as the loss function. A batch size of 64 and a total of 10 training epochs are used, allowing sufficient optimization while maintaining manageable computational cost within the benchmarking framework.

Input sequences consist of 100 tokens drawn from a vocabulary of 5000 elements, with an embedding dimension of 64. The bidirectional LSTM layer uses 64 units per direction, and a dropout rate of 0.5 is applied as the primary regularization mechanism. The model performs classification over five output classes and uses a validation split of 0.2 to monitor training behavior. To ensure experimental reliability, a fixed random seed (42) is applied.

Batch size was fixed for each model family to ensure consistent and reproducible workloads across all platforms. For convolutional models (CNN), a batch size of 512 was selected to leverage data parallelism, while for recurrent models (RNN, LSTM, and BiLSTM), a smaller batch size of 64 was used due to their sequential nature and higher memory requirements per sample. The selected values were chosen as a compromise between computational efficiency and cross-platform feasibility, taking into account the memory constraints of low-resource platforms such as embedded devices (e.g., Raspberry Pi). This ensures that all experiments can be executed reliably across the full set of evaluated hardware. Although these configurations do not necessarily correspond to the optimal batch size for each architecture, they provide a fair and controlled baseline for comparison. Complementary experiments exploring batch size sensitivity were conducted to analyze hardware-aware performance and are discussed in Section 4.6. Models that do not rely on mini-batch training, such as XGBoost, were executed using their standard training procedures.

3. Results

This section presents the experimental results obtained for each of the evaluated models across the considered hardware platforms and discusses their implications in terms of runtime, energy-related behavior, stability, and predictive consistency. The corresponding tables summarize the quantitative results for each case. All reported values are presented as means obtained from repeated runs, together with their corresponding standard deviations. In the following analysis, runtime refers to total training time, and total energy consumption refers to the product of average power and training time when power data are available. Accuracy and loss are interpreted according to the role they play in each model family and are not used interchangeably across all experiments. In the result tables, the symbol “–” indicates values that were not available or not measured for the corresponding platform or configuration. For consistency and readability in presentation, time and power are reported to two decimal places, energy to one decimal place, and accuracy/loss values to four decimal places. These formatting decisions are purely editorial and do not modify the underlying measurement procedure or the comparative analysis.

3.1. Results for CNN

Table 14 summarizes the results obtained when training the CNN model across the evaluated hardware platforms. The reported metrics include training time, average power consumption, total energy use, accuracy, and loss.

The RTX 5060 Ti, adopted as the baseline platform (1.0×), achieved the shortest training time (11.70 s), except for the initial run, which required 64.14 s, most likely due to a software-related anomaly. The experiment was repeated several times, and the same issue persisted in the first execution. This behavior is also documented in the original table available in the GitHub repository. Therefore, the first measurement was excluded from the reported results, as it was considered an execution artifact rather than representative hardware behavior. Despite this issue, the RTX 5060 Ti remained the fastest platform, likely due to its modern architecture and larger memory capacity.

Following the RTX 5060 Ti, the GTX 1660 Ti (20.79 s, 1.78× slower) and the Apple M4 GPU (21.90 s, 1.87× slower) achieved comparable performance. The latter result indicates that Apple’s Metal backend can provide training performance close to that of a mid-range NVIDIA GPU, despite the absence of CUDA support. The Google Colab GPU T4 required 52.52 s (4.49× slower), reflecting the throughput limitations of the T4-class accelerator typically available in free cloud environments, although it remains a practical no-cost alternative for experimentation.

In contrast, CPU-only configurations exhibited substantially longer training times. The Apple M4 CPU (80.19 s, 6.85× slower), Ryzen 7 7800X3D (162.67 s, 13.90× slower), and Ryzen 7 4800H (277.21 s, 23.69× slower) all lagged far behind the GPU-based baseline. At the lowest performance tier, the Raspberry Pi 5 (1505.50 s, approximately 25 min, 128.65× slower) and the Google Colab CPU (1418.18 s, approximately 24 min, 121.19× slower) further confirmed the practical limitations of training CNN models without GPU acceleration.

In terms of variability, the RTX 5060 Ti, Apple M4 GPU, Apple M4 CPU, GTX 1660 Ti, and Ryzen 7 4800H exhibited the most stable behavior, with standard deviations below 2.6 s, indicating consistent execution across repeated runs. In contrast, the Google Colab GPU T4 and the Ryzen 7 7800X3D showed markedly higher variability, with standard deviations of 21.13 s and 35.38 s, respectively. This dispersion suggests that cloud-based environments and thermally sensitive desktop systems may introduce greater timing instability, either because of dynamic resource allocation in shared infrastructure or because of frequency throttling under sustained load.

Overall, the relatively low variance observed in the remaining platforms indicates that the experimental protocol was stable and reproducible, and that the performance differences reported across devices are mainly attributable to hardware and system-level characteristics—including compute architecture, memory bandwidth, cache organization, software backend efficiency, and thermal-power behavior—rather than random fluctuations or measurement noise.

Regarding predictive performance, all platforms converged to very similar values, with accuracy around 0.47–0.48 and loss around 1.41–1.43. This result indicates that the hardware platform primarily affected execution time and energy behavior, while the final model convergence remained essentially unchanged under the fixed training configuration.

Although the achieved accuracy values are relatively low compared to fully optimized models, this outcome is consistent with the experimental design. The training configuration was intentionally constrained to ensure short and comparable execution times across heterogeneous hardware platforms, enabling multiple repetitions of each experiment. As a result, the model is not expected to reach optimal predictive performance, but rather to provide a consistent and controlled workload for benchmarking purposes.

In this context, accuracy serves primarily as a validation metric to confirm that all platforms converge to similar solutions under identical conditions. The small variations observed across devices further indicate that the computational differences are not affecting the learning outcome, reinforcing that the reported performance differences are attributable to hardware characteristics rather than discrepancies in model training.

Figure 2 illustrates the energy consumption associated with CNN training across the evaluated platforms. The Apple M4 GPU was the most energy-efficient platform (1599.5 J), followed by the GTX 1660 Ti (2008.7 J) and the RTX 5060 Ti (2914.3 J). Although the RTX 5060 Ti was the fastest platform, it also required more total energy than the M4 GPU and GTX 1660 Ti, reflecting the classical trade-off between raw speed and energy efficiency. Desktop CPUs performed poorly from this perspective: the Ryzen 7 7800X3D (20,220.4 J) and Ryzen 7 4800H (23,557.5 J) consumed between five and eight times more energy than the GPU-based platforms while reaching similar predictive performance. A particularly illustrative case was the Raspberry Pi 5 (18,758.5 J), whose low power draw (12.46 W) did not prevent a high accumulated energy consumption because of its extremely long execution time. By contrast, the Apple M4 CPU (5456.0 J) showed comparatively good efficiency, consuming substantially less energy than the x86 CPUs, likely due to its lower power envelope and ARM-based system-level optimization.

3.2. Results for Simple RNN

Table 15 summarizes the results obtained when training the Simple RNN model across the evaluated hardware platforms. The model was trained for one epoch on the Shakespeare character-level dataset using the Adam optimizer. The optimization objective was the sparse categorical cross-entropy loss, which measures the negative log-likelihood of the correct next character given the preceding sequence. No explicit accuracy metric was computed, since character-level accuracy is not a sufficiently informative indicator of language modeling performance.

Modern GPUs, particularly the RTX 5060 Ti and Google Colab GPU T4, provided the most balanced performance for Simple RNN training, whereas the Apple M4 CPU emerged as a competitive alternative even without dedicated graphics acceleration. The RTX 5060 Ti established the baseline with a training time of 33.80 s, making it the fastest platform and the reference for relative comparisons. The Google Colab GPU T4 required 37.16 s, only 1.10× slower than the baseline, and therefore represented a strong cloud-based alternative. The GTX 1660 Ti required 72.50 s, approximately 2.14× slower than the RTX 5060 Ti, but remained suitable for medium-scale recurrent training workloads.

Among CPU-based platforms, the Apple M4 CPU stood out with a training time of 57.94 s, only 1.71× slower than the baseline and clearly ahead of the Ryzen 7 4800H and Ryzen 7 7800X3D. In comparison, the Ryzen 7 4800H required 395.47 s (11.70× slower), whereas the Ryzen 7 7800X3D required 609.69 s (18.04× slower). These results indicate that the ARM-based Apple CPU handled recurrent workloads more efficiently than the evaluated x86 CPUs under this experimental setup.

At the lowest performance tier were the Raspberry Pi 5 CPU and the Google Colab CPU, with training times of 1275.25 s and 1597.64 s, respectively, corresponding to approximately 21 and 27 min of execution. These values make both platforms impractical for efficient Simple RNN training. The most unfavorable result was observed on the Apple M4 GPU, which required 2264.77 s (about 38 min), equivalent to 67.01× slower than the baseline. This behavior suggests substantial inefficiencies in the Metal backend for recurrent operations, especially when compared with the much stronger CPU performance of the same platform.

In terms of variance, the RTX 5060 Ti, Google Colab GPU T4, GTX 1660 Ti, and Apple M4 CPU exhibited the most stable execution times, all with standard deviations below 1.1 s. By contrast, the Ryzen 7 4800H and, more notably, the Ryzen 7 7800X3D and Apple M4 GPU showed considerably higher dispersion, with standard deviations ranging from 37.78 s up to 430.56 s. This elevated variability may be related to dynamic frequency scaling, thermal throttling, or backend-level scheduling effects under sustained recurrent workloads. The Raspberry Pi 5 and Google Colab CPU showed moderate variability relative to their total runtime, indicating that instability was not the main factor behind their poor overall performance.

Regarding convergence, all platforms reached sparse categorical cross-entropy values in the range of approximately 3.44 to 3.70, indicating broadly consistent Simple RNN training behavior across hardware. Minor differences, such as the slightly higher loss values observed on the Apple M4 CPU, Raspberry Pi 5 CPU, and Google Colab CPU, may be attributable to numerical precision differences or the use of distinct low-level execution libraries, such as Metal versus CUDA. These variations remain within an acceptable range and suggest that differences in hardware performance did not compromise overall learning stability.

Although the reported loss values are relatively high compared to fully optimized language models, this behavior is consistent with the constrained training setup adopted in this study. The model was intentionally trained for a limited number of iterations to ensure manageable execution times across all hardware platforms, enabling repeated measurements under controlled conditions.

In this context, sparse categorical cross-entropy serves as a proxy for learning consistency rather than an absolute performance metric. The close agreement of loss values across platforms indicates that all devices converge to similar model states under identical training conditions. This observation reinforces that the differences observed in execution time, power consumption, and energy usage are attributable to hardware characteristics rather than discrepancies in the optimization process.

Figure 3 presents the energy consumption associated with Simple RNN training. The Apple M4 CPU was the most energy-efficient measured platform, requiring 3609.8 J, followed by the RTX 5060 Ti with 5659.5 J and the GTX 1660 Ti with 7611.1 J. In contrast, the Ryzen 7 4800H and Ryzen 7 7800X3D consumed 28,387.1 J and 73,882.7 J, respectively, making them substantially less favorable from an energy-efficiency perspective. Although the Raspberry Pi 5 exhibited low instantaneous power draw, it accumulated 14,410.4 J because of its very long training time.

The most extreme case was the Apple M4 GPU, which consumed 124,970.0 J. Its disproportionate combination of execution time and total energy strongly suggests that the Metal backend, at least under the tested software configuration, is not well optimized for Simple RNN training. This result contrasts sharply with the favorable performance of the Apple M4 CPU and highlights the importance of backend-specific behavior when evaluating recurrent workloads on heterogeneous hardware.

3.3. Results for RNN–LSTM

Table 16 summarizes the results obtained when training the RNN–LSTM model across the evaluated hardware platforms. The reported metrics include training time, average power consumption, total energy use, and sparse categorical cross-entropy loss.

The evaluation of the RNN–LSTM model revealed a markedly different pattern from that observed for the Simple RNN experiments. In this case, the performance gap between platforms was smaller, suggesting that LSTM-based workloads are better supported by current software backends and compilers. The RTX 5060 Ti established the baseline with the shortest training time (20.34 s, 1.0×). However, the Apple M4 GPU required only 24.59 s (1.21× slower), while the Apple M4 CPU completed training in 26.83 s (1.32× slower). These results indicate that Apple Silicon handled LSTM workloads much more efficiently than it did in the Simple RNN case, and that the Metal backend appears to be better suited for this type of recurrent architecture.

The GTX 1660 Ti and Google Colab GPU T4 also achieved competitive performance, with training times of 37.95 s and 34.48 s, corresponding to 1.87× and 1.70× slower execution than the baseline, respectively. The Google Colab CPU required 76.52 s (3.76× slower), which, although substantially slower than GPU-based platforms, remained within a usable range for moderate experimentation. By contrast, the Ryzen 7 4800H, Ryzen 7 7800X3D, and Raspberry Pi 5 exhibited very poor performance, with training times of 1196.21 s, 1556.77 s, and 2968.15 s, corresponding to 58.81×, 76.54×, and 145.93× slower execution than the RTX 5060 Ti. These runtimes make such platforms impractical for routine RNN–LSTM training.

In terms of variability, the most unstable platform was the Ryzen 7 7800X3D CPU, with a standard deviation of 345.83 s, followed by the Ryzen 7 4800H CPU (99.15 s) and the Raspberry Pi 5 CPU (11.50 s). All remaining platforms exhibited standard deviations below 1.31 s. This result suggests that the x86 CPU platforms were much more sensitive to runtime fluctuations, likely because of cache contention, dynamic frequency scaling, and thermal throttling under sustained workloads. By contrast, the Apple M4 CPU, Apple M4 GPU, RTX 5060 Ti, and Google Colab GPU T4 delivered highly consistent execution times, indicating that Apple Silicon and modern GPU platforms provide not only faster but also more stable performance for LSTM-based sequence modeling.

Regarding convergence, two distinct groups can be observed in terms of final loss values. The Apple M4 GPU, Apple M4 CPU, Google Colab GPU T4, Google Colab CPU, and Raspberry Pi 5 reached average losses around 3.18–3.19, whereas the RTX 5060 Ti, GTX 1660 Ti, Ryzen 7 7800X3D, and Ryzen 7 4800H converged to lower values around 2.84–2.85. Although this difference is moderate, it suggests some sensitivity to backend implementation, numerical precision, or low-level library behavior. Even so, the overall stability of the loss values across repeated runs indicates that the RNN–LSTM training process remained robust across all evaluated platforms.

The observed separation in convergence levels may reflect minor implementation-dependent effects, potentially related to numerical precision or backend-specific optimizations. Despite this difference, all platforms exhibit stable training behavior across repeated runs, indicating that the learning process remains consistent under the fixed experimental conditions. These results support that the observed performance differences are primarily associated with hardware characteristics rather than instability in the optimization process.

Figure 4 illustrates the energy consumption associated with RNN–LSTM training. The best results were obtained on Apple Silicon: the Apple M4 GPU required 2170.7 J and the Apple M4 CPU required 2230.6 J, both substantially lower than the RTX 5060 Ti, which consumed 5586.4 J despite being the fastest platform. The GTX 1660 Ti also showed a favorable balance between energy use and execution time, requiring 4495.8 J. In contrast, the Ryzen 7 7800X3D CPU (192,261.3 J), Ryzen 7 4800H CPU (84,763.3 J), and Raspberry Pi 5 CPU (33,955.7 J) were markedly inefficient, mainly because of their extremely long runtimes. These results reinforce the advantage of Apple Silicon for LSTM-based workloads when energy efficiency is prioritized alongside acceptable execution speed.

3.4. Results for XGBoost

Table 17 summarizes the results obtained when training the XGBoost model across the evaluated hardware platforms. The reported metrics include training time, average power consumption, total energy use, and classification accuracy. Loss is not reported, as the primary evaluation criterion in this experiment is the final predictive accuracy, which fully characterizes model performance for this supervised classification task.

In the case of XGBoost, the execution dynamics differed markedly from those observed in the neural network models, showing a much more balanced distribution of runtimes across hardware architectures. The RTX 5060 Ti achieved the shortest training time (16.66 s), establishing the reference baseline. The Google Colab GPU T4 (31.30 s, 1.88× slower) and the Apple M4 CPU (32.86 s, 1.97× slower) followed closely, indicating that XGBoost benefits less from massive GPU parallelism and remains highly efficient on modern CPU architectures.

The GTX 1660 Ti required 39.68 s (2.38× slower), maintaining acceptable performance but remaining behind the leading platforms. Among the x86 CPUs, the Ryzen 7 7800X3D required 76.09 s (4.57× slower), whereas the Ryzen 7 4800H required 347.28 s (20.84× slower). At the lowest end, the Google Colab CPU (753.83 s, 45.25× slower) and the Raspberry Pi 5 (1429.53 s, 85.81× slower) proved impractical for medium- or large-scale training because of their long execution times. The Apple M4 GPU was not included in this experiment because it was not supported by the available Metal backend.

From the perspective of energy efficiency, the Apple M4 CPU was the most favorable measured platform, requiring only 2234.5 J, even lower than the RTX 5060 Ti (2955.8 J), which nonetheless offered the best overall trade-off between speed and energy. The GTX 1660 Ti consumed 3983.7 J, which reduced its relative efficiency despite still providing moderate training speed. The Ryzen 7 7800X3D and Ryzen 7 4800H consumed 10,538.2 J and 22,211.2 J, respectively, confirming that prolonged XGBoost training on higher-power general-purpose CPUs is less efficient than on optimized GPU or ARM-based platforms. The Raspberry Pi 5, despite its low power draw (10.80 W), accumulated 15,438.6 J because of its extremely long runtime. As in previous experiments, no power or energy measurements were available for the Colab platforms, which limits direct comparison on this axis.

Figure 5 reinforces these trends. The Apple M4 CPU and RTX 5060 Ti provided the most favorable balance between execution time and energy use, while the GTX 1660 Ti occupied an intermediate position. In contrast, the Ryzen processors and the Raspberry Pi 5 required substantially more accumulated energy, mainly because of their longer execution times rather than extremely high instantaneous power.

In terms of variance, XGBoost exhibited very stable behavior across all platforms. The RTX 5060 Ti, Apple M4 CPU, and Google Colab GPU T4 showed the lowest variability, with standard deviations below 0.3 s, indicating highly consistent execution and deterministic workload scheduling. The GTX 1660 Ti showed only slightly higher dispersion (0.56 s), which remained negligible relative to its total runtime. Among the CPU-based systems, the Ryzen 7 4800H (11.06 s) and Ryzen 7 7800X3D (7.82 s) displayed moderate variance, likely associated with thermal effects or background process fluctuations during longer runs. The Raspberry Pi 5 (4.48 s) and Google Colab CPU (8.99 s) also showed low relative variance compared with their total runtime, suggesting stable behavior even in constrained or virtualized environments. Overall, the very small standard deviations confirm that XGBoost is a highly deterministic and reproducible algorithm, with limited sensitivity to stochastic or hardware-level variation during training.

Regarding predictive performance, all platforms converged to essentially identical accuracy values in the range of 0.815–0.819. The slight difference between GPU-based platforms (0.819) and CPU-based platforms (0.815) is consistent with the use of different tree construction methods (gpu_hist versus auto), which may produce marginally different histogram binning. This result highlights the strong numerical stability and experimental reliability of XGBoost across heterogeneous hardware platforms. Unlike neural network training, where floating-point precision or backend-specific implementations may introduce slight differences in convergence, XGBoost produced virtually identical results across all tested platforms.

Overall, the runtime analysis revealed behavior markedly different from that observed in deep neural networks. XGBoost is much better optimized for CPU execution and scales efficiently even on non-specialized hardware. Although the RTX 5060 Ti remained the fastest platform, the differences with the Apple M4 CPU and Google Colab GPU T4 were relatively small compared with the gaps observed in CNN, Simple RNN, RNN-LSTM, or BiLSTM experiments. These results indicate that the hardware requirements of XGBoost are less dependent on specialized accelerators and more compatible with efficient CPU-based execution, making it a robust option for supervised learning tasks in environments without high-end GPU resources.

3.5. Results for BiLSTM

Table 18 summarizes the results obtained when training the BiLSTM model across the different hardware platforms. The reported metrics include training time, average power consumption, total energy use, accuracy, and loss.

The evaluation of the BiLSTM model highlights substantial differences in execution efficiency across the tested platforms. The RTX 5060 Ti established the performance baseline with the shortest training time (8.03 s, 1.0×), confirming its advantage for recurrent sequence modeling under GPU acceleration. The GTX 1660 Ti (14.20 s, 1.77× slower) and the Apple M4 GPU (13.91 s, 1.73× slower) achieved similar intermediate performance, while the Google Colab GPU T4 required 39.46 s (4.91× slower), remaining a practical cloud-based alternative for prototyping workloads.

Among CPU-based platforms, the Apple M4 CPU delivered the strongest result, completing training in 82.26 s (10.24× slower than the baseline). The Ryzen 7 7800X3D and Ryzen 7 4800H required 125.50 s and 207.91 s, corresponding to 15.63× and 25.89× slower execution, respectively. These results indicate that recurrent bidirectional sequence modeling can still be handled reasonably well on modern CPUs, although the performance gap relative to GPU-based platforms remains substantial.

Figure 6 shows the energy consumption during BiLSTM training across the evaluated platforms.

At the lowest end of performance, the Google Colab CPU and the Raspberry Pi 5 required 469.48 s and 739.31 s, respectively, corresponding to 58.47× and 92.07× slower execution than the baseline. Such runtimes make these platforms unsuitable for efficient BiLSTM training beyond educational, low-cost, or baseline experimentation scenarios.

The energy-efficiency analysis confirmed these differences. The Apple M4 GPU was the most energy-efficient platform, requiring only 408.8 J, followed by the GTX 1660 Ti (893.9 J) and the RTX 5060 Ti (1827.8 J). Although the RTX 5060 Ti was the fastest platform, it consumed more total energy than the two mid-tier GPU alternatives. The Apple M4 CPU also showed competitive efficiency at 2809.9 J, substantially lower than the Ryzen 7 7800X3D (13,760.9 J) and Ryzen 7 4800H (20,874.2 J). The Raspberry Pi 5, despite its low power draw, accumulated 7733.2 J because of its long runtime.

Training times also revealed meaningful stability differences across platforms. The RTX 5060 Ti and GTX 1660 Ti exhibited the lowest variability, with standard deviations below 0.2 s, indicating highly consistent performance across repeated runs. The Apple M4 GPU showed moderate dispersion (0.93 s), whereas the Google Colab GPU T4 and the Ryzen 7 7800X3D displayed higher variability, with standard deviations of 10.94 s and 5.21 s, respectively. The Raspberry Pi 5 and Google Colab CPU also showed noticeable dispersion, which is expected given their lower performance and less predictable runtime conditions. Nevertheless, the observed variability did not alter the overall ranking among platforms.

With respect to convergence, all platforms achieved accuracy values between approximately 0.22 and 0.33, and loss values between 1.52 and 1.80. The RTX 5060 Ti, GTX 1660 Ti, and the Ryzen CPUs reached the lowest loss values, all close to 1.52, with accuracy near 0.32. The Apple M4 GPU produced the least favorable convergence metrics, with a loss of 1.8010 and an accuracy of 0.2217, suggesting slight differences in training dynamics, potentially associated with backend implementation or numerical precision. Even so, the spread in convergence values remained relatively limited, indicating that hardware mainly affected speed and energy behavior rather than the overall ability of the model to learn.

Overall, BiLSTM training revealed a clear separation between dedicated GPU accelerators and general-purpose platforms. The RTX 5060 Ti defined the upper bound in speed, while the Apple M4 GPU emerged as the most energy-efficient alternative. CPU-only configurations were strongly penalized in execution time, particularly in low-power or shared-resource environments. These results are consistent with the broader trends observed in the other experiments: hardware acceleration and backend maturity remain decisive factors for efficient supervised learning, while model convergence remains comparatively stable across platforms.

3.6. Cross-Platform Performance

Figure 7 summarizes the relative training times of the evaluated models across all hardware platforms, normalized to the RTX 5060 Ti (1.0×). The X-axis represents model families, while the Y-axis lists the hardware platforms. Each cell indicates the relative training time with respect to the RTX 5060 Ti for that model. The color scale maps lighter tones to faster results and darker tones to slower executions.

This global representation confirms the trends observed in the detailed results: CUDA-enabled GPUs remain dominant in speed, with the RTX 5060 Ti establishing the reference baseline and the GTX 1660 Ti acting as a consistent mid-range performer. Apple Silicon exhibits competitive performance in RNN–LSTM, BiLSTM, and XGBoost, where the M4 CPU and M4 GPU approach GPU-class execution in some workloads. In contrast, x86 CPUs present a clear penalty in both time and energy, while the Raspberry Pi 5 serves mainly as an educational or low-cost baseline reference. Google Colab, in turn, should be interpreted as a practical cloud-based reference environment rather than as a strictly equivalent counterpart to dedicated local hardware, since its observed behavior may be influenced by shared-resource allocation, virtualization overhead, and provider-side scheduling. Overall, the heatmap reinforces that hardware performance is highly model-dependent.

3.7. Batch Size Sensitivity Analysis

Additional experiments were performed by systematically varying the batch size across all platforms. Separate implementations were used for CPU and GPU to ensure reasonable training times while maintaining comparable workloads. All experiments were conducted using the same datasets as the original benchmarks presented earlier, and, under consistent experimental conditions, total training time was recorded for each configuration. To keep the analysis tractable while still capturing representative behaviors, only CNN and LSTM models were considered for batch size evaluation. These models represent two fundamentally different computational patterns: highly parallel workloads (CNN) and sequential, dependency-bound workloads (LSTM). As such, they provide meaningful insight into how batch size affects both compute-bound and sequential models across architectures. The resulting figures reveal the dependence of performance on batch size for different hardware platforms and model types. Figure 8 shows that increasing batch size reduces training time across CPU platforms due to lower overhead per update. The Apple M4 Pro reaches optimal performance at batch 4096, after which performance degrades, likely due to memory and cache limitations. The M4 Air follows a similar trend with earlier degradation, while the Ryzen 7 exhibits less stable behavior, indicating sensitivity to memory hierarchy. Overall, CPU performance improves with batch size up to a hardware-dependent optimum.

Figure 9 shows that LSTM performance on CPUs improves gradually with batch size, reflecting the sequential nature of the model. The M4 Pro achieves the best performance at batch 2048 with stable behavior, while the M4 Air follows a similar trend with higher latency. The Ryzen 7 shows variability at small batch sizes, indicating inefficiencies in sequential workload handling. Overall, LSTM models benefit from moderate batch increases with limited scalability compared to CNNs.

Figure 10 shows that CNN performance on GPUs improves rapidly as batch size increases, followed by a clear saturation point. The RTX 5060 Ti reaches near-optimal performance at batch 128, while Apple GPUs require larger batch sizes (1024–2048) to reach similar regimes. Beyond this point, performance remains stable, indicating full utilization of parallel resources.

Figure 11 shows that LSTM performance on GPUs improves more gradually and stabilizes around batch sizes of 512–1024. Differences between platforms are smaller than in CNN workloads, indicating limited ability of recurrent models to exploit GPU parallelism due to their sequential nature.

The results demonstrate that the impact of batch size depends strongly on both the hardware and the model architecture. GPUs achieve significant gains for CNN workloads up to a saturation point, while CPUs benefit from larger batch sizes but degrade when memory limits are reached. For LSTM models, the effect is less pronounced due to their sequential nature. The experiments show that optimal batch size is not universal and varies across platforms and workloads. Therefore, using a single batch configuration may lead to suboptimal hardware utilization and biased comparisons. This highlights that, while the controlled benchmarking setup ensures fair and reproducible comparison, hardware-aware tuning can provide additional insight into the practical performance limits of each platform.

The optimal batch size for CPU platforms varies significantly across architectures, as shown in Table 19. The Apple M4 Pro consistently achieves the best performance, requiring large batch sizes (2048–4096) to maximize efficiency. This behavior is enabled by its unified memory architecture and high memory bandwidth, which allow efficient handling of large data blocks and reduced data movement overhead. The M4 Air follows a similar trend with slightly smaller optimal batches, likely due to thermal constraints and reduced sustained performance. In contrast, the Ryzen 7 reaches its best performance at much lower batch sizes, indicating earlier saturation. This behavior can be attributed to its cache hierarchy and memory access patterns, where larger batch sizes increase pressure on caches and memory bandwidth, leading to reduced efficiency. Overall, these results confirm that CPUs benefit from increasing batch size to amortize computational overhead, but the optimal configuration is strongly hardware-dependent and influenced by memory architecture.

For GPU platforms, optimal batch sizes are generally smaller than for CPUs and depend strongly on the underlying architecture, as shown in Table 20. The RTX 5060 Ti achieves optimal performance at relatively low batch sizes (128 for CNN, 1024 for LSTM), reflecting the efficiency of CUDA-based parallel execution, where thousands of threads can be effectively utilized even with moderate workloads. In contrast, Apple GPUs require larger batch sizes to reach peak performance. This behavior is related to higher kernel launch overhead and the need to amortize data movement and synchronization costs in the Metal backend. Additionally, while Apple GPUs benefit from unified memory, they exhibit lower raw parallel throughput compared to dedicated discrete GPUs, which limits scalability. These results highlight fundamental differences in how GPU architectures exploit parallelism: NVIDIA GPUs achieve high efficiency at smaller batch sizes due to mature software stacks and massive parallelism, while Apple GPUs require larger workloads to reach comparable utilization.

It is important to distinguish the behavior between CNN and LSTM models. CNN workloads exhibit high data parallelism, allowing both CPUs and GPUs to efficiently exploit larger batch sizes by increasing arithmetic intensity and reducing memory access overhead. This results in higher throughput and a clear optimal batch size, particularly evident in GPU platforms where parallel execution is maximized. In contrast, LSTM models are inherently sequential due to temporal dependencies, which limits parallelism and reduces the benefits of increasing batch size. As a result, performance improvements are more gradual and tend to saturate earlier. This explains why optimal batch sizes for LSTM are generally smaller or provide diminishing returns compared to CNNs. Additionally, hardware architecture influences these differences. GPUs, especially NVIDIA architectures, benefit significantly in CNN workloads due to massive parallel execution (e.g., CUDA cores), while their advantage is reduced in LSTM models. Similarly, CPU performance is more sensitive to memory hierarchy and cache behavior, which affects CNN more strongly than LSTM. These observations explain the differences in optimal batch sizes across models and platforms reported in Table 19 and Table 20. These results directly address the limitations of using fixed batch configurations, showing that hardware-aware tuning is necessary to achieve fair and representative comparisons across heterogeneous systems. However, performance comparisons using fixed batch sizes may lead to misleading conclusions, particularly when evaluating heterogeneous architectures with different memory and parallel execution characteristics.

3.8. Training Stability Across Epochs

To further validate the reliability of the benchmarking methodology, additional experiments were conducted by varying the number of training epochs while keeping all other parameters fixed. This analysis aims to determine whether the measured performance is affected by initialization overhead or if it reflects steady-state behavior.

As shown in Figure 12, the training time per epoch remains nearly constant across different epoch configurations for both CNN and LSTM models. Although a slightly higher cost is observed for the first epoch due to initialization effects (e.g., memory allocation and kernel setup), this overhead rapidly diminishes and does not significantly impact subsequent iterations.

Furthermore, Figure 13 shows that throughput remains stable as the number of epochs increases, indicating that the system operates in a steady-state regime. This behavior is consistent across all evaluated platforms, with GPUs exhibiting lower variance due to higher levels of parallelism, while CPUs show slightly higher variability, particularly under more demanding workloads. These results confirm that the selected experimental configuration is sufficient to capture representative performance trends without requiring long training durations. In particular, the observed linear scaling of total training time with respect to the number of epochs and the stability of throughput demonstrate that the measurements are not dominated by initialization overhead, addressing a common concern in short-duration benchmarking.

3.9. Performance per Dollar Analysis

To incorporate the economic dimension of accessibility, a performance-per-dollar metric was computed based on the normalized results shown in Figure 7. These values represent relative training time across platforms (where lower values indicate better performance).

First, the relative training time was converted into a performance metric by taking its inverse (i.e., faster platforms obtain higher scores). Then, for each platform, the performance values were averaged across all evaluated workloads, including CNN, RNN, LSTM, BiLSTM, and XGBoost, to obtain a single representative score. Finally, this score was divided by the hardware cost in USD to compute the performance-per-dollar metric.

The resulting values are summarized in Table 21. Higher values indicate better cost-efficiency, meaning more computational performance is obtained per unit cost. The MacBook Air M4 was not included in the initial benchmarking experiments due to its limited availability at the time of evaluation. However, it is included in this techno-economic analysis as a reference point for cost-efficiency, given its lower price and architectural similarity to the M4 Pro platform, which demonstrated strong performance across all workloads. The reported values for the M4 Air are derived from additional experiments conducted during the batch size analysis, where both the M4 Air and M4 Pro were evaluated under identical conditions. These results indicate that the M4 Air achieves approximately 50% of the performance of the M4 Pro across representative workloads. This difference can be attributed to hardware-level factors, including the higher number of CPU and GPU cores available in the M4 Pro, as well as its active cooling system, which allows sustained operation at higher power levels compared to the passively cooled MacBook Air. The resulting estimate was used to approximate the aggregated performance score of the M4 Air and is intended for qualitative comparison only.

The results indicate that the RTX 5060 Ti achieves both the highest absolute performance and the highest performance-per-dollar ratio among the evaluated platforms. While Apple Silicon devices remain competitive, a clear performance-per-dollar gap is observed, with the MacBook Pro M4 Pro and MacBook Air M4 achieving lower cost-efficiency compared to the desktop GPU system. From a practical standpoint, this suggests that high-end GPU platforms provide a more favorable return on investment for compute-intensive workloads. However, the difference is not sufficiently large to invalidate the use of consumer-grade devices. In particular, users who already own systems such as the MacBook Air M4 or MacBook Pro M4 Pro may not obtain proportional benefits from investing in dedicated GPU hardware unless their applications require significantly higher throughput or reduced training times. Therefore, while GPU-based systems maximize both raw performance and cost-efficiency, Apple Silicon platforms still represent a viable and efficient alternative for moderate workloads, where the additional performance of dedicated hardware may not justify the increased cost.

4. Discussion

The results obtained from the five evaluated techniques (CNN, Simple RNN, RNN–LSTM, BiLSTM, and XGBoost) allow the identification of consistent patterns regarding the impact of hardware on the training of supervised learning models. The comparative analysis across CUDA GPUs, Apple Silicon, x86 CPUs, and low-power devices such as the Raspberry Pi reveals clear differences not only in execution time, but also in energy efficiency, stability, and scalability.

4.1. Global Hardware Comparison

Table 22 summarizes the main findings derived from this comparative analysis.

4.2. Architecture and Hardware Behavior

Recurrent architectures exhibited lower computational efficiency due to their inherently sequential nature, which limits parallelization and reduces hardware utilization compared to convolutional models. This effect was particularly evident in Simple RNN and, to a lesser extent, in LSTM-based models, resulting in longer runtimes across all platforms. In contrast, CNN models benefited from high parallelism, achieving significantly better performance on GPU-based systems.

XGBoost, in contrast, demonstrated strong CPU efficiency and minimal dependence on GPU acceleration, maintaining consistent predictive performance across platforms.

4.3. Apple Silicon Performance

The MacBook M4 GPU demonstrated outstanding energy efficiency across compatible workloads, often consuming substantially less energy than CUDA-enabled GPUs. Where execution was supported, performance remained competitive, particularly for RNN–LSTM and BiLSTM models.

The Apple M4 CPU delivered consistently strong results for CPU-oriented tasks, achieving favorable performance in recurrent models and XGBoost, while outperforming the evaluated x86 processors in both speed and energy efficiency.

4.4. CPU and Edge Limitations

The x86 CPUs showed a pronounced performance deficit in neural network workloads, often running one to two orders of magnitude slower than the RTX 5060 Ti and consuming significantly more energy. Only in XGBoost did they produce moderately acceptable runtimes.

In edge computing, the Raspberry Pi 5 displayed extreme latency. This highlights that low instantaneous power does not imply low total energy consumption. While suitable for inference or educational purposes, it is not practical for supervised training.

4.5. Energy Efficiency Trends

From an energy-efficiency perspective, Apple Silicon frequently provided the most favorable energy profile for neural-network workloads, especially in recurrent models. For XGBoost, the Apple M4 CPU was the most efficient platform, outperforming even GPUs. These findings demonstrate that hardware selection must be aligned with model characteristics.

4.6. Batch Size, Stability, and Cost-Efficiency

The additional experiments provide further insight into the role of workload configuration and economic factors in hardware evaluation. The batch size analysis demonstrated that performance can vary significantly depending on the selected configuration, particularly in CPU-based platforms, where suboptimal batch sizes can lead to substantial underestimation of achievable performance. This highlights that, while the controlled benchmarking setup ensures fair and reproducible comparison, hardware-aware tuning provides additional insight into platform-specific performance, particularly by revealing configurations that may otherwise underestimate achievable performance. The evaluation across different epoch configurations showed that the training time per epoch remains stable, indicating that the reported results reflect steady-state behavior rather than initialization overhead. This supports the reliability of the benchmarking methodology. These results also address potential concerns regarding the influence of initialization overhead, confirming that the reported measurements are representative of steady-state execution. Together, these results reinforce that the proposed benchmarking framework remains both fair and reproducible, while the additional analyses provide complementary insight into practical performance behavior across platforms. Finally, the performance-per-dollar analysis revealed that, although high-end GPU platforms provide the best absolute performance, their cost-efficiency advantage over consumer-grade devices such as Apple Silicon is moderate. This suggests that, for many practical scenarios, lower-cost platforms may offer a more balanced trade-off between performance and economic accessibility.

4.7. Practical Implications

GPU-based systems remain the most suitable option for compute-intensive deep learning workloads, particularly when minimizing training time is critical. However, Apple Silicon platforms provide a compelling alternative for moderate workloads, offering a favorable balance between performance and energy efficiency. Consequently, hardware selection should be guided by workload characteristics, resource constraints, and the required trade-off between performance and efficiency.

4.8. Limitations

This study has some limitations that should be considered when interpreting the results. The benchmark does not include alternative accelerators such as AMD GPUs or TPUs. In addition, simplified training configurations were adopted to ensure comparable execution times across platforms, so the reported results do not represent fully optimized training regimes.

Google Colab also introduces an additional limitation, since its shared and virtualized infrastructure is not fully controllable by the user. Therefore, Colab-based measurements should be interpreted as practical cloud-reference results rather than as strictly equivalent counterparts to those obtained on dedicated local hardware.

Finally, environmental variables such as room temperature and humidity were not instrumentally monitored or controlled. The experiments were conducted under normal indoor operating conditions, and no alarms, interruptions, or hardware-related operational failures were observed during the runs.

4.9. Final Remarks

Hardware selection for machine learning is a multi-dimensional problem involving trade-offs between execution time, energy efficiency, stability, and backend support. Model-aware benchmarking is therefore essential for informed decision-making.

5. Conclusions

This study presented a unified and reproducible benchmarking framework for evaluating supervised learning models across heterogeneous and accessible hardware platforms under controlled experimental conditions.

The main contribution of this work is to show that hardware evaluation for supervised learning must be treated as a multi-criteria problem, since runtime, energy efficiency, and stability do not necessarily favor the same platform. In this sense, the proposed framework provides a practical and reproducible basis for comparing heterogeneous systems.

The results showed that hardware suitability is strongly model-dependent. CUDA-enabled GPUs consistently delivered the fastest and most stable performance, with the RTX 5060 Ti establishing the overall runtime baseline. In addition, GPU-based platforms demonstrated stable behavior across different epoch configurations, confirming that the measured performance reflects steady-state operation.

The batch size analysis revealed that hardware performance is highly sensitive to workload configuration. In particular, CPU-based platforms exhibited strong dependence on batch size, where suboptimal configurations can significantly underestimate their performance. This highlights the importance of hardware-aware parameter tuning to ensure fair comparisons across heterogeneous systems.

From a techno-economic perspective, the performance-per-dollar analysis showed that RTX 5060 Ti provides both the highest absolute performance and the highest cost-efficiency. However, the advantage over Apple Silicon platforms is moderate rather than overwhelming, indicating that consumer-grade devices can still provide an efficient alternative for moderate workloads.

Overall, the study confirms that no single platform is universally optimal across all workloads: CUDA-enabled GPUs are preferable when maximum performance and stability are required, whereas Apple Silicon platforms offer a strong balance between performance, energy efficiency, and accessibility.

Future work should extend this benchmarking strategy to additional accelerators, including AMD GPUs, TPUs, NPUs, and emerging low-power AI hardware, as well as to longer training regimes and broader model families. It should also explore how batch size affects stability across platforms, particularly in terms of memory limits and failure behavior under demanding workloads, as well as the development of standardized methodologies for evaluating inference performance in edge AI scenarios.

Author Contributions

Conceptualization, M.E.G.N. and O.H.S.-H.; methodology, M.E.G.N., O.H.S.-H., J.X.L.-M. and F.P.; software, M.E.G.N. and O.H.S.-H.; validation, M.E.G.N., O.H.S.-H., J.X.L.-M., E.F.C.C. and F.P.; formal analysis, M.E.G.N., O.H.S.-H. and F.P.; investigation, M.E.G.N., O.H.S.-H., J.X.L.-M. and F.P.; data curation, M.E.G.N. and O.H.S.-H.; writing—original draft preparation, M.E.G.N. and O.H.S.-H.; writing—review and editing, E.F.C.C. and F.P.; visualization, M.E.G.N. and O.H.S.-H.; supervision, E.F.C.C. and F.P.; project administration, F.P.; funding acquisition, F.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available. In addition, all training code used in this study is publicly accessible through the following GitHub repository: https://github.com/OscarHSierra/cpu-gpu-training-benchmark (accessed on 28 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ML	Machine Learning
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
BiLSTM	Bidirectional Long Short-Term Memory
GPU	Graphics Processing Unit
CPU	Central Processing Unit
TPU	Tensor Processing Unit
NPU	Neural Processing Unit
CUDA	Compute Unified Device Architecture
cuDNN	CUDA Deep Neural Network
ARM	Advanced RISC Machine
SoC	System-on-Chip
IoT	Internet of Things
RAM	Random Access Memory
SSD	Solid-State Drive
NVMe	Non-Volatile Memory Express
PCIe	Peripheral Component Interconnect Express
LPDDR5	Low-Power Double Data Rate 5
LPDDR4X	Low-Power Double Data Rate 4X
GDDR6	Graphics Double Data Rate 6
TDP	Thermal Design Power
ISA	Instruction Set Architecture
VM	Virtual Machine
FLOPS	Floating Point Operations Per Second
XLA	Accelerated Linear Algebra
XGBoost	Extreme Gradient Boosting
AI	Artificial Intelligence

References

Tufail, S.; Riggs, H.; Tariq, M.; Sarwat, A.I. Advancements and Challenges in Machine Learning: A Comprehensive Review of Models, Libraries, Applications, and Algorithms. Electronics 2023, 12, 1789. [Google Scholar] [CrossRef]
Nath, D.; Neog, D.R.; Gautam, S.S. Application of Machine Learning and Deep Learning in Finite Element Analysis: A Comprehensive Review. Arch. Comput. Methods Eng. 2024, 31, 2945–2984. [Google Scholar] [CrossRef]
Mahadevkar, S.V.; Khemani, B.; Patil, S.; Kotecha, K.; Vora, D.R.; Abraham, A.; Gabralla, L.A. A Review on Machine Learning Styles in Computer Vision—Techniques and Future Directions. IEEE Access 2022, 10, 107293–107329. [Google Scholar] [CrossRef]
Araújo, S.O.; Peres, R.S.; Ramalho, J.C.; Lidon, F.; Barata, J. Machine Learning Applications in Agriculture: Current Trends, Challenges, and Future Perspectives. Agronomy 2023, 13, 2976. [Google Scholar] [CrossRef]
Sharifani, K.; Amini, M. Machine Learning and Deep Learning: A Review of Methods and Applications. 2023. Available online: https://ssrn.com/abstract=4458723 (accessed on 28 April 2026).
Talib, M.A.; Majzoub, S.; Nasir, Q.; Jamal, D. A Systematic Literature Review on Hardware Implementation of Artificial Intelligence Algorithms. J. Supercomput. 2021, 77, 1897–1938. [Google Scholar] [CrossRef]
Zaman, K.S.; Reaz, M.B.I.; Ali, S.H.M.; Bakar, A.A.A.; Chowdhury, M.E.H. Custom Hardware Architectures for Deep Learning on Portable Devices: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6068–6088. [Google Scholar] [CrossRef]
Mattson, P.; Cheng, C.; Diamos, G.; Coleman, C.; Micikevicius, P.; Patterson, D.; Tang, H.; Wei, G.Y.; Bailis, P.; Bittorf, V. MLPerf Training Benchmark. Proc. Mach. Learn. Syst. 2020, 2, 336–349. [Google Scholar]
Coleman, C.; Kang, D.; Narayanan, D.; Nardi, L.; Zhao, T.; Zhang, J.; Bailis, P.; Olukotun, K.; Ré, C.; Zaharia, M. Analysis of DawnBench, a Time-to-Accuracy Machine Learning Performance Benchmark. ACM SIGOPS Oper. Syst. Rev. 2019, 53, 14–25. [Google Scholar] [CrossRef]
Baller, S.P.; Jindal, A.; Chadha, M.; Gerndt, M. DeepEdgeBench: Benchmarking Deep Neural Networks on Edge Devices. In 2021 IEEE International Conference on Cloud Engineering (IC2E); IEEE: Piscataway, NJ, USA, 2021; pp. 20–30. [Google Scholar]
Primate Labs. Geekbench AI Benchmark, 2025. Benchmark Documentation and Official Product Page. Available online: https://www.geekbench.com/ai/ (accessed on 28 April 2026).
Vente, T.; Wegmeth, L.; Said, A.; Beel, J. From Clicks to Carbon: The Environmental Toll of Recommender Systems. In Proceedings of the 18th ACM Conference on Recommender Systems; ACM: New York, NY, USA, 2024; pp. 580–590. [Google Scholar]
Patterson, D.; Gonzalez, J.; Le, Q.; Liang, C.; Munguia, L.M.; Rothchild, D.; So, D.; Texier, M.; Dean, J. Carbon Emissions and Large Neural Network Training. arXiv 2021, arXiv:2104.10350. [Google Scholar] [CrossRef]
Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. Commun. ACM 2020, 63, 54–63. [Google Scholar] [CrossRef]
Alevizos, V.; Gkouvrikos, E.V.; Georgousis, I.; Karipidou, S.; Papakostas, G.A. Inspiring from Galaxies to Green AI in Earth: Benchmarking Energy-Efficient Models for Galaxy Morphology Classification. Algorithms 2025, 18, 399. [Google Scholar] [CrossRef]
Luque-Hernández, F.J.; Aquino-Britez, S.; Díaz-Álvarez, J.; García-Sánchez, P. A Comparison of Energy Consumption and Quality of Solutions in Evolutionary Algorithms. Algorithms 2025, 18, 593. [Google Scholar] [CrossRef]
Mary Valentina Janet, A.; Pavithra, S.; Monika, S.; Nishanthini, T.; Preethi Jenitha, S.; Kavya, A. Carbon-Aware Large Language Models for Edge Inference: Towards Sustainable and Adaptive AI. In Proceedings of the 2025 9th International Conference on Electronics, Communication and Aerospace Technology (ICECA); IEEE: Coimbatore, India, 2025. [Google Scholar] [CrossRef]
Zhao, Y.; Guo, T. Carbon-Efficient Neural Architecture Search. In Proceedings of the 2nd Workshop on Sustainable Computer Systems; ACM: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
Leon-Medina, J.X.; Tibaduiza, D.A.; Siachoque Celys, C.P.; Umbarila Suarez, B.; Pozo, F. Electric Load Forecasting for a Quicklime Company Using a Temporal Fusion Transformer. Algorithms 2026, 19, 208. [Google Scholar] [CrossRef]
Jouini, O.; Sethom, K.; Namoun, A.; Aljohani, N.; Alanazi, M.H.; Alanazi, M.N. A Survey of Machine Learning in Edge Computing: Techniques, Frameworks, Applications, Issues, and Research Directions. Technologies 2024, 12, 81. [Google Scholar] [CrossRef]
Dai, W.; Berleant, D. Benchmarking Contemporary Deep Learning Hardware and Frameworks: A Survey of Qualitative Metrics. In 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI); IEEE: Piscataway, NJ, USA, 2019; pp. 148–155. [Google Scholar] [CrossRef]
Arcos-Argudo, M.; Bojorque, R.; Torres, A. A Deterministic Comparison of Classical Machine Learning and Hybrid Deep Representation Models for Intrusion Detection on NSL-KDD and CICIDS2017. Algorithms 2025, 18, 749. [Google Scholar] [CrossRef]
Feng, W.; Tang, S.; Wang, S.; He, Y.; Chen, D.; Yang, Q.; Fu, S. Characterizing Perception Deep Learning Algorithms and Applications for Vehicular Edge Computing. Algorithms 2025, 18, 31. [Google Scholar] [CrossRef]
Myllis, G.; Tsimpiris, A.; Aggelopoulos, S.; Vrana, V.G. High-Performance Computing and Parallel Algorithms for Urban Water Demand Forecasting. Algorithms 2025, 18, 182. [Google Scholar] [CrossRef]
Reddi, V.J.; Cheng, C.; Kanter, D.; Mattson, P.; Schmuelling, G.; Wu, C.-J.; Anderson, B.; Breughe, M.; Charlebois, M.; Chou, W.; et al. MLPerf Inference Benchmark. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA); IEEE: Piscataway, NJ, USA, 2020; pp. 446–459. [Google Scholar] [CrossRef]
Bouthillier, X.; Delaunay, P.; Bronzi, M.; Trofimov, A.; Nichyporuk, B.; Szeto, J.; Mohammadi Sepahvand, N.; Raff, E.; Madan, K.; Voleti, V.; et al. Accounting for Variance in Machine Learning Benchmarks. In Proceedings of Machine Learning and Systems; Smola, A., Dimakis, A., Stoica, I., Eds.; MLCommons: San Francisco, CA, USA, 2021; Volume 3, pp. 747–769. [Google Scholar]
Wang, Y.; Wei, G.Y.; Brooks, D. A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms. In Proceedings of Machine Learning and Systems; Dhillon, I., Papailiopoulos, D., Sze, V., Eds.; MLCommons: San Francisco, CA, USA, 2020; Volume 2, pp. 30–43. [Google Scholar]
Rae, C.; Lee, J.K.L.; Richings, J.; Weiland, M. Benchmarking Machine Learning Applications on Heterogeneous Architecture using Reframe. In Proceedings of the PERMAVOST ’24 Workshop at HPDC 24; Association for Computing Machinery: New York, NY, USA, 2024; pp. 16–22. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 25; Curran Associates, Inc.: Red Hook, NY, USA, 2012. [Google Scholar]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef] [PubMed]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Schuster, M.; Paliwal, K.K. Bidirectional Recurrent Neural Networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
Graves, A.; Schmidhuber, J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
Piao, X.; Synn, D.; Park, J.; Kim, J.K. Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining Performance. IEEE Access 2023, 11, 102981–102990. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
TensorFlow. Text Generation with an RNN, 2024. Official TensorFlow Tutorial/Documentation. Available online: https://www.tensorflow.org/text/tutorials/text_generation (accessed on 28 April 2026).

Figure 1. General methodological workflow adopted for the comparative evaluation of heterogeneous computational platforms, including hardware selection and characterization, software environment configuration, controlled experimental setup, repeated model execution, metric collection, statistical analysis, and cross-platform comparison.

Figure 2. Energy consumption during CNN training across the evaluated platforms.

Figure 3. Energy consumption during Simple RNN training across the evaluated platforms.

Figure 4. Energy consumption during RNN–LSTM training across the evaluated platforms.

Figure 5. Energy consumption during XGBoost training across the evaluated platforms.

Figure 6. Energy consumption during BiLSTM training across the evaluated platforms.

Figure 7. Relative training times across platforms and models, normalized to RTX 5060 Ti (1.0×). The heatmap illustrates the performance gap between architectures.

Figure 8. Impact of batch size on training time for CNN models on CPU platforms.

Figure 9. Impact of batch size on training time for LSTM models on CPU platforms.

Figure 10. Impact of batch size on training time for CNN models on GPU platforms.

Figure 11. Impact of batch size on training time for LSTM models on GPU platforms.

Figure 12. Time per epoch vs. number of epochs for CNN and LSTM models.

Figure 13. Throughput vs. number of epochs for CNN and LSTM models.

Table 1. Hardware specifications of tested platforms.

Platform	Category	Specification
Laptop (Ryzen 7 4800H + GTX 1660 Ti)	CPU	AMD Ryzen 7 4800H (Zen 2, 7 nm), 8C/16T, 2.9–4.2 GHz, cache: L1 512 KB, L2 4 MB, L3 8 MB, ISA: x86-64
	GPU	NVIDIA GTX 1660 Ti Mobile (Turing), 1536 CUDA cores, 6 GB GDDR6, 192-bit, 288 GB/s, no Tensor/RT cores
	Memory	24 GB DDR4-3200, dual channel, 51.2 GB/s
	Storage	512 GB NVMe SSD
	Declared power	CPU TDP: 45 W, GPU TDP: ∼80 W
Desktop (Ryzen 7 7800X3D + RTX 5060 Ti)	CPU	AMD Ryzen 7 7800X3D (Zen 4, 5 nm), 8C/16T, 4.2–5.0 GHz, cache: L1 512 KB, L2 8 MB, L3 96 MB (3D V-Cache), ISA: x86-64
	GPU	NVIDIA RTX 5060 Ti (Blackwell), 6144 CUDA cores, 16 GB GDDR6, 256-bit, 512 GB/s, Tensor and RT cores
	Memory	32 GB DDR5-6000, dual channel, 96 GB/s
	Storage	1 TB PCIe Gen4 NVMe SSD
	Declared power	CPU TDP: 120 W, GPU TDP: ∼200 W
MacBook (Apple M4 Pro)	CPU	Apple M4 Pro (ARMv9.2-A, 3 nm), 14 cores (10P + 4E), ∼4.0 GHz, system cache ∼16 MB, ISA: ARMv9-A
	GPU	Apple 20-core GPU, unified memory architecture, 16-core Neural Engine (NPU)
	Memory	16 GB LPDDR5 unified memory, 120 GB/s
	Storage	1 TB NVMe SSD
	Declared power	CPU + GPU package: ∼30 W
MacBook (Apple M4 Air)	CPU	Apple M4 (ARMv9.2-A, 3 nm), 8 cores (4P + 4E), ∼4.0 GHz, system cache ∼12 MB, ISA: ARMv9-A
	GPU	Apple 10-core GPU, unified memory architecture, 16-core Neural Engine (NPU)
	Memory	16 GB LPDDR5 unified memory, 100 GB/s
	Storage	256 GB NVMe SSD
	Declared power	CPU + GPU package: ∼15–20 W (fanless design)
Raspberry Pi 5	CPU	Broadcom BCM2712 (Cortex-A76, 16 nm), 4 cores at 2.4 GHz, cache: L1 128 KB/core, L2 512 KB/core, L3 2 MB, ISA: ARMv8-A64
	GPU	VideoCore VII (∼800 MHz), OpenGL ES 3.1, no CUDA/Tensor cores
	Memory	8 GB LPDDR4X-4267, 68 GB/s
	Storage	128 GB microSD UHS-I, optional NVMe (PCIe x1)
	Declared power	SoC power: ∼10 W
Google Colab (Free Tier)	CPU	Intel Xeon (Broadwell/Skylake), 2 vCPUs, ∼2.2–2.3 GHz, ISA: x86-64
	GPU	NVIDIA Tesla T4 (Turing), 2560 CUDA cores, 16 GB GDDR6, 320 GB/s, Tensor cores
	Memory	12–16 GB RAM (shared virtual machine)
	Storage	Ephemeral VM disk (∼68 GB), cloud-managed
	Declared power	Cloud-managed (not user-exposed)

Table 2. Software environment for tested platforms.

Specification	Laptop (Ryzen 7 4800H + GTX 1660 Ti)	Desktop (Ryzen 7 7800X3D + RTX 5060 Ti)	Apple Silicon (MacBook M4 Pro/MacBook Air M4)	Raspberry Pi 5	Colab
Operating system	Windows 11 Home (Build 26100)	Windows 11 Pro (Build 26100)	macOS Sequoia	Raspberry Pi OS (Debian 12, Kernel 6.12)	Cloud VM
Python	3.10.9	3.10.13	3.9.6	3.11.2	3.12.11
CUDA	CUDA 11.2/cuDNN 8.1	CUDA 11.2/cuDNN 8.1	Not applicable	Not applicable	CUDA 12.5.1/cuDNN 9
NumPy	1.23.5	1.26.4	1.26.4	2.1.3	2.0.2
Keras	2.10.0	2.10.0	3.10.0	3.10.0	3.10.0
TensorFlow	2.10.0	2.10.0	2.16.2	2.19.0	2.19.0
Scikit-learn	1.7.1	1.5.1	1.6.1	1.7.1	1.6.1
XGBoost	3.0.2	3.0.2	2.1.4	3.0.3	3.0.4

Table 3. CNN architecture configuration.

#	Layer Type	Configuration/Parameters
1	Input	Input shape: (32, 32, 3)
2	Conv2D	64 filters, $3 \times 3$ kernel, swish activation, padding = same
3	BatchNormalization	Axis = $- 1$ , momentum = $0.99$ , epsilon = $0.001$
4	Conv2D	64 filters, $3 \times 3$ kernel, swish activation, padding = same
5	BatchNormalization	Default parameters
6	MaxPooling2D	Pool size: $(2, 2)$
7	Dropout	Rate: $0.3$
8	Conv2D	128 filters, $3 \times 3$ kernel, swish activation, padding = same
9	BatchNormalization	Default parameters
10	Conv2D	128 filters, $3 \times 3$ kernel, swish activation, padding = same
11	BatchNormalization	Default parameters
12	MaxPooling2D	Pool size: $(2, 2)$
13	Dropout	Rate: $0.4$
14	Conv2D	256 filters, $3 \times 3$ kernel, swish activation, padding = same
15	BatchNormalization	Default parameters
16	Conv2D	256 filters, $3 \times 3$ kernel, swish activation, padding = same
17	BatchNormalization	Default parameters
18	GlobalAveragePooling2D	—
19	Dense	1024 units, swish activation
20	Dropout	Rate: $0.5$
21	Dense	512 units, swish activation
22	Dropout	Rate: $0.5$
23	Dense (output layer)	10 units (logits, no activation)

Table 4. CNN training configuration parameters.

Parameter	Value
Dataset	CIFAR-10
Optimizer	Adam
Loss function	Sparse Categorical Crossentropy (`from_logits = True`)
Metrics	Accuracy
Epochs	1
Batch size	512
Activation function	Swish (custom implementation)
Random seed	42
Normalization	Batch normalization
Regularization	Dropout (0.3/0.4/0.5)

Table 5. Simple RNN model architecture.

#	Layer Type	Configuration/Parameters
1	Input (implicit)	Sequence of integers with length 100
2	Embedding	Input dimension = vocabulary size (≈65), output dimension = 256
3	SimpleRNN	1024 units, `return_sequences = True`
4	SimpleRNN	1024 units, `return_sequences = True`
5	SimpleRNN	1024 units, `return_sequences = True`
6	Dense	Units = vocabulary size, linear activation (logits)

Table 6. Training configuration and hyperparameters for the Simple RNN model.

Parameter	Value
Dataset	Shakespeare (character-level)
Sequence length	100 characters
Batch size	64
Shuffle buffer size	10,000
Epochs	1
Optimizer	Adam
Loss function	Sparse categorical cross-entropy (`from_logits = True`)
Embedding dimension	256
Simple RNN units per layer	1024
Number of Simple RNN layers	3
Output units	Vocabulary size (≈65)
Reproducibility	Fixed random seed (42) for Python, NumPy, and TensorFlow
Hardware execution	CPU and GPU (when available)

Table 7. RNN–LSTM model architecture.

#	Layer Type	Configuration/Parameters
1	Input (implicit)	Sequence of integers with length 100
2	Embedding	Input dimension = vocabulary size (≈65), output dimension = 256
3	LSTM	1024 units, `return_sequences = True`
4	LSTM	1024 units, `return_sequences = True`
5	Dense	Units = vocabulary size, linear activation (logits)

Table 8. Summary of RNN–LSTM model configuration.

Aspect	Description
Architecture	Embedding layer followed by two stacked LSTM layers (1024 units each) with `return_sequences = True`, and a final dense output layer producing logits over the vocabulary.
Dataset	Character-level Shakespeare corpus with input sequences of 100 characters. Targets correspond to the next character in the sequence.
Training setup	Model trained for a single epoch using the Adam optimizer and sparse categorical cross-entropy loss.
Reproducibility	Fixed random seeds for Python, NumPy, and TensorFlow to ensure deterministic behavior.

Table 9. Training configuration and hyperparameters for the RNN–LSTM model.

Parameter	Value
Dataset	Shakespeare (character-level)
Sequence length	100 characters
Batch size	64
Buffer size (shuffle)	10,000
Epochs	1
Optimizer	Adam
Loss function	Sparse categorical cross-entropy (`from_logits = True`)
Embedding dimension	256
LSTM units per layer	1024
Number of LSTM layers	2
Output units	Vocabulary size (≈65)
Reproducibility	Random seed = 42 (Python, NumPy, TensorFlow)
Memory control (GPU)	`set_memory_growth = True` to avoid out-of-memory errors
Hardware execution	GPU (if available), otherwise CPU

Table 10. Training configuration parameters for the XGBoost model.

Parameter	Value
Objective function	`multi:softmax`
Number of classes	5
Maximum tree depth	6
Learning rate ( $η$ )	0.1
Evaluation metric	`mlogloss`
Number of boosting rounds	100
Random seed	42
Tree method (CPU)	`auto`
Tree method (GPU)	`gpu_hist`
Predictor (GPU)	`gpu_predictor`

Table 11. Experimental setup for the XGBoost model.

Aspect	Details
Library	XGBoost 3.0.2
Dataset	Synthetic dataset with 200,000 samples and 1000 features
Informative features	75
Number of classes	5
Train/test split	80%/20%
Tree method (CPU)	`auto`
Tree method (GPU)	`gpu_hist`
Boosting rounds	100
Evaluation metric	Overall classification accuracy
Reproducibility	Fixed random seeds for Python, NumPy, and XGBoost

Table 12. BiLSTM model architecture.

#	Layer	Type/Output Shape	Parameters
1	Input	Input layer, output shape: $(None, sequence length)$	0
2	Embedding	Embedding layer, output shape: $(None, sequence length, 64)$	$5000 \times 64 = 320, 000$
3	Bidirectional LSTM	Bidirectional LSTM with 64 units, output shape: $(None, 128)$	66,560
4	Dropout	Dropout layer with rate $0.5$ , output shape: $(None, 128)$	0
5	Dense	Fully connected layer with softmax activation, output shape: $(None, 5)$	645

Table 13. Training hyperparameters for the BiLSTM model.

Hyperparameter	Value
Optimizer	Adam
Learning rate	0.001
Loss function	Categorical cross-entropy
Batch size	64
Number of epochs	10
Sequence length	100 tokens
Vocabulary size	5000
Embedding dimension	64
LSTM units	64 (bidirectional)
Dropout rate	0.5
Number of output classes	5
Validation split	0.2
Random seed	42

Table 14. Performance metrics for CNN training across different hardware platforms.

Platform (CPU/GPU)	Time (s)	Power (W)	Energy (J)	Accuracy	Loss
RTX 5060 Ti GPU	$11.70 \pm 0.14$	$249.04 \pm 0.27$	2914.3	$0.4794 \pm 0.0054$	$1.4152 \pm 0.0094$
GTX 1660 Ti GPU	$20.79 \pm 2.28$	$119.98 \pm 0.61$	2008.7	$0.4779 \pm 0.0042$	$1.4155 \pm 0.0049$
Apple M4 GPU	$21.90 \pm 1.72$	$73.02 \pm 0.34$	1599.5	$0.4807 \pm 0.0051$	$1.4136 \pm 0.0129$
Google Colab GPU T4	$52.52 \pm 21.13$	–	–	$0.4794 \pm 0.0042$	$1.4177 \pm 0.0090$
Apple M4 CPU	$80.19 \pm 2.52$	$68.04 \pm 0.38$	5456.0	$0.4775 \pm 0.0063$	$1.4199 \pm 0.0159$
Ryzen 7 7800X3D CPU	$162.67 \pm 35.38$	$124.30 \pm 5.36$	20,220.4	$0.4767 \pm 0.0053$	$1.4278 \pm 0.0141$
Ryzen 7 4800H CPU	$277.21 \pm 6.24$	$84.98 \pm 0.79$	23,557.5	$0.4762 \pm 0.0048$	$1.4245 \pm 0.0125$
Raspberry Pi 5 CPU	$1505.50 \pm 30.17$	$12.46 \pm 0.33$	18,758.5	$0.4774 \pm 0.0061$	$1.4200 \pm 0.0159$
Google Colab CPU	$1418.18 \pm 28.01$	–	–	$0.4770 \pm 0.0006$	$1.4189 \pm 0.0133$

Values are reported as mean ± standard deviation. Time and power are standardized to two decimal places, energy to one decimal, and loss to four decimals for consistency.

Table 15. Performance metrics for Simple RNN training across different hardware platforms.

Platform (CPU/GPU)	Time (s)	Power (W)	Energy (J)	Sparse Categorical Cross-Entropy
RTX 5060 Ti GPU	$33.80 \pm 0.58$	$167.46 \pm 2.90$	5659.5	$3.4872 \pm 0.0893$
Google Colab GPU T4	$37.16 \pm 0.70$	–	–	$3.6357 \pm 0.0480$
GTX 1660 Ti GPU	$72.50 \pm 0.65$	$104.98 \pm 0.86$	7611.1	$3.5080 \pm 0.0925$
Apple M4 CPU	$57.94 \pm 1.06$	$62.30 \pm 1.26$	3609.8	$3.6933 \pm 0.1199$
Ryzen 7 7800X3D CPU	$609.69 \pm 182.68$	$121.18 \pm 6.85$	73,882.7	$3.4547 \pm 0.1051$
Ryzen 7 4800H CPU	$395.47 \pm 37.78$	$71.78 \pm 1.13$	28,387.1	$3.4371 \pm 0.0638$
Raspberry Pi 5 CPU	$1275.25 \pm 8.80$	$11.30 \pm 0.22$	14,410.4	$3.6959 \pm 0.1233$
Google Colab CPU	$1597.64 \pm 37.35$	–	–	$3.6984 \pm 0.1289$
Apple M4 GPU	$2264.77 \pm 430.56$	$55.18 \pm 0.69$	124,970.0	$3.6357 \pm 0.0481$
Raspberry Pi 5 GPU	Not supported	–	Not supported	–

Values are reported as mean ± standard deviation. Time and power are standardized to two decimal places, energy to one decimal, and loss to four decimals for consistency.

Table 16. Performance metrics for RNN–LSTM training across different hardware platforms.

Platform (CPU/GPU)	Time (s)	Power (W)	Energy (J)	Sparse Categorical Cross-Entropy
Apple M4 GPU	$24.59 \pm 0.06$	$88.26 \pm 0.91$	2170.7	$3.1916 \pm 0.0154$
RTX 5060 Ti GPU	$20.34 \pm 0.15$	$274.63 \pm 0.46$	5586.4	$2.8503 \pm 0.0502$
Apple M4 CPU	$26.83 \pm 1.31$	$83.14 \pm 0.58$	2230.6	$3.1850 \pm 0.0329$
GTX 1660 Ti GPU	$37.95 \pm 0.16$	$118.48 \pm 0.67$	4495.8	$2.8409 \pm 0.0458$
Google Colab GPU T4	$34.48 \pm 0.64$	–	–	$3.1916 \pm 0.0154$
Google Colab CPU	$76.52 \pm 0.94$	–	–	$3.1808 \pm 0.0329$
Ryzen 7 7800X3D CPU	$1556.77 \pm 345.83$	$123.50 \pm 9.03$	192,261.3	$2.8371 \pm 0.0794$
Ryzen 7 4800H CPU	$1196.21 \pm 99.15$	$70.86 \pm 0.91$	84,763.3	$2.8378 \pm 0.0806$
Raspberry Pi 5 CPU	$2968.15 \pm 11.50$	$11.44 \pm 0.22$	33,955.7	$3.1807 \pm 0.0329$

Values are reported as mean ± standard deviation. Time and power are standardized to two decimal places, energy to one decimal, and loss to four decimals for consistency.

Table 17. Performance metrics for XGBoost training across different hardware platforms.

Platform (CPU/GPU)	Time (s)	Power (W)	Energy (J)	Accuracy	Loss
RTX 5060 Ti GPU	$16.66 \pm 0.04$	$177.40 \pm 0.96$	2955.8	$0.8190 \pm 0.0000$	–
Apple M4 CPU	$32.86 \pm 0.17$	$68.00 \pm 0.35$	2234.5	$0.8150 \pm 0.0000$	–
Google Colab GPU T4	$31.30 \pm 0.25$	–	–	$0.8190 \pm 0.0000$	–
GTX 1660 Ti GPU	$39.68 \pm 0.56$	$100.40 \pm 0.86$	3983.7	$0.8190 \pm 0.0000$	–
Ryzen 7 7800X3D CPU	$76.09 \pm 7.82$	$138.50 \pm 1.78$	10,538.2	$0.8150 \pm 0.0000$	–
Ryzen 7 4800H CPU	$347.28 \pm 11.06$	$63.96 \pm 1.55$	22,211.2	$0.8150 \pm 0.0000$	–
Raspberry Pi 5 CPU	$1429.53 \pm 4.48$	$10.80 \pm 0.14$	15,438.6	$0.8150 \pm 0.0000$	–
Google Colab CPU	$753.83 \pm 8.99$	–	–	$0.8150 \pm 0.0000$	–

Values are reported as mean ± standard deviation. Time and power are standardized to two decimal places, energy to one decimal, and accuracy/loss to four decimals for consistency.

Table 18. Performance metrics for BiLSTM training across different hardware platforms.

Platform (CPU/GPU)	Time (s)	Power (W)	Energy (J)	Accuracy	Loss
RTX 5060 Ti GPU	$8.03 \pm 0.06$	$227.56 \pm 0.35$	1827.8	$0.3213 \pm 0.0111$	$1.5309 \pm 0.0114$
GTX 1660 Ti GPU	$14.20 \pm 0.17$	$62.94 \pm 0.39$	893.9	$0.3186 \pm 0.0160$	$1.5296 \pm 0.0178$
Apple M4 GPU	$13.91 \pm 0.93$	$29.38 \pm 0.78$	408.8	$0.2217 \pm 0.0072$	$1.8010 \pm 0.0375$
Apple M4 CPU	$82.26 \pm 0.08$	$34.16 \pm 1.30$	2809.9	$0.2965 \pm 0.0028$	$1.5542 \pm 0.0017$
Ryzen 7 4800H CPU	$207.91 \pm 3.05$	$100.40 \pm 0.44$	20,874.2	$0.3257 \pm 0.0118$	$1.5214 \pm 0.0141$
Ryzen 7 7800X3D CPU	$125.50 \pm 5.21$	$108.78 \pm 2.07$	13,760.9	$0.3273 \pm 0.0128$	$1.5198 \pm 0.0161$
Google Colab GPU T4	$39.46 \pm 10.94$	–	–	$0.2745 \pm 0.0078$	$1.5692 \pm 0.0075$
Google Colab CPU	$469.48 \pm 21.55$	–	–	$0.2909 \pm 0.0308$	$1.5600 \pm 0.0266$
Raspberry Pi 5 CPU	$739.31 \pm 19.03$	$10.46 \pm 0.27$	7733.2	$0.2887 \pm 0.0193$	$1.5589 \pm 0.0185$

Values are reported as mean ± standard deviation. Time and power are standardized to two decimal places, energy to one decimal, and accuracy/loss to four decimals for consistency.

Table 19. Best batch size performance for CPU platforms (mean ± std).

Plat.	Model	Batch	Time (s)	Epoch (s)	Throughput (Samples/s)
M4-Pro	CNN	4096	$14.72 \pm 0.03$	$3.68 \pm 0.01$	$13, 589 \pm 30$
M4-Air	CNN	2048	$23.64 \pm 0.57$	$5.91 \pm 0.14$	$8464 \pm 206$
R7-7800X3D	CNN	256	$82.00 \pm 0.22$	$16.40 \pm 0.04$	$3049 \pm 8$
M4-Pro	LSTM	2048	$11.98 \pm 0.03$	$0.299 \pm 0.001$	$8350 \pm 22$
M4-Air	LSTM	512	$31.56 \pm 0.00$	$0.789 \pm 0.0001$	$3168 \pm 0.4$
R7-7800X3D	LSTM	2048	$29.30 \pm 2.68$	$7.32 \pm 0.67$	$344 \pm 32$

Table 20. Best batch size performance for GPU platforms (mean ± std).

Plat.	Model	Batch	Time (s)	Epoch (s)	Throughput (Samples/s)
RTX-5060Ti	CNN	128	$54.58 \pm 0.56$	$1.82 \pm 0.02$	$27, 487 \pm 281$
M4-Pro	CNN	2048	$107.41 \pm 0.28$	$3.58 \pm 0.01$	$13, 965 \pm 36$
M4-Air	CNN	2048	$182.39 \pm 6.35$	$6.08 \pm 0.21$	$8234 \pm 281$
RTX-5060Ti	LSTM	1024	$13.28 \pm 0.04$	$0.332 \pm 0.001$	$6168 \pm 18$
M4-Pro	LSTM	1024	$13.35 \pm 0.01$	$0.310 \pm 0.0002$	$7504 \pm 5$
M4-Air	LSTM	512	$29.20 \pm 0.27$	$0.730 \pm 0.007$	$3425 \pm 31$

Table 21. Performance-per-dollar comparison across platforms (higher is better).

Platform	Score	Price (USD)	Perf/$
Desktop (Ryzen 7 7800X3D + RTX 5060 Ti)	1.000	3000	$3.33 \times 10^{- 4}$
MacBook Pro (M4 Pro GPU)	0.492	1999	$2.46 \times 10^{- 4}$
MacBook Air (M4)^†	∼0.25	1199	∼2.08 $\times 10^{- 4}$
Raspberry Pi 5	0.0127	120	$1.06 \times 10^{- 4}$
Google Colab	–	–	–

^† Estimated from relative performance trends with respect to the M4 Pro. Included for qualitative comparison only.

Table 22. Global summary of hardware performance and efficiency across all evaluated models.

Platform	Avg. Speed Rank	Energy Efficiency	Stability	Key Findings
RTX 5060 Ti (GPU)	1 (fastest)	High	High	Reference platform; achieved the best runtimes and stable performance across all evaluated models.
GTX 1660 Ti (GPU)	2–3	Moderate–High	High	Strong mid-range performer; slightly less efficient than the RTX 5060 Ti, but reliable for CNN, recurrent, and XGBoost workloads.
Apple M4 Pro GPU	3–4 *	Very high	Medium	Excellent energy efficiency; competitive in CNN, RNN–LSTM, and BiLSTM workloads, though less favorable in Simple RNN. Limited backend support prevented execution of selected models.
Apple M4 Pro CPU	4	High	High	Best CPU-class performer; competitive with GPUs in recurrent models and XGBoost, while maintaining strong energy efficiency and consistency.
Colab GPU	4–5	–	Low–Medium	Accessible cloud-reference environment for prototyping and remote experimentation; competitive in selected workloads, but affected by execution variability associated with shared and virtualized resource allocation.
Colab CPU	6–7	–	Medium	Free cloud-based CPU reference suitable mainly for lightweight experimentation; results should be interpreted with caution because runtime behavior depends on dynamically allocated shared infrastructure.
Ryzen 7 (4800H/7800X3D)	6	Low–Moderate	Medium	High energy consumption and slower runtimes; suitable primarily for CPU-oriented methods such as XGBoost.
Raspberry Pi 5 (CPU)	7 (slowest)	Very low	Medium	Extremely limited performance; useful mainly for educational, exploratory, or edge-oriented baseline purposes.

Note. Rankings were derived from aggregated results of CNN, Simple RNN, RNN–LSTM, BiLSTM, and XGBoost benchmarks. Energy efficiency and stability were qualitatively assessed from average energy values and observed variances. * The Apple M4 GPU was unable to execute certain models due to incomplete support in the Metal backend. Google Colab results should be interpreted as indicative cloud-based reference measurements rather than as directly equivalent outcomes to those obtained on dedicated local hardware, due to the shared and virtualized nature of the underlying infrastructure.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sierra-Herrera, O.H.; Niño, M.E.G.; Correa, E.F.C.; Leon-Medina, J.X.; Pozo, F. A Reproducible Benchmarking Methodology for Machine Learning Hardware: Performance–Energy Trade-Offs from GPUs to Apple Silicon. Algorithms 2026, 19, 363. https://doi.org/10.3390/a19050363

AMA Style

Sierra-Herrera OH, Niño MEG, Correa EFC, Leon-Medina JX, Pozo F. A Reproducible Benchmarking Methodology for Machine Learning Hardware: Performance–Energy Trade-Offs from GPUs to Apple Silicon. Algorithms. 2026; 19(5):363. https://doi.org/10.3390/a19050363

Chicago/Turabian Style

Sierra-Herrera, Oscar H., Mario Eduardo González Niño, Edwin Francis Cárdenas Correa, Jersson X. Leon-Medina, and Francesc Pozo. 2026. "A Reproducible Benchmarking Methodology for Machine Learning Hardware: Performance–Energy Trade-Offs from GPUs to Apple Silicon" Algorithms 19, no. 5: 363. https://doi.org/10.3390/a19050363

APA Style

Sierra-Herrera, O. H., Niño, M. E. G., Correa, E. F. C., Leon-Medina, J. X., & Pozo, F. (2026). A Reproducible Benchmarking Methodology for Machine Learning Hardware: Performance–Energy Trade-Offs from GPUs to Apple Silicon. Algorithms, 19(5), 363. https://doi.org/10.3390/a19050363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Reproducible Benchmarking Methodology for Machine Learning Hardware: Performance–Energy Trade-Offs from GPUs to Apple Silicon

Abstract

1. Introduction

2. Materials and Methods

2.1. Benchmarking Methodology Overview

2.2. Hardware Selection Criteria

2.3. Hardware Description

2.4. Evaluated Hardware Platforms

2.5. Tested Models

2.5.1. CNN

2.5.2. Simple RNN

2.5.3. RNN–LSTM

2.5.4. XGBoost

2.5.5. BiLSTM

3. Results

3.1. Results for CNN

3.2. Results for Simple RNN

3.3. Results for RNN–LSTM

3.4. Results for XGBoost

3.5. Results for BiLSTM

3.6. Cross-Platform Performance

3.7. Batch Size Sensitivity Analysis

3.8. Training Stability Across Epochs

3.9. Performance per Dollar Analysis

4. Discussion

4.1. Global Hardware Comparison

4.2. Architecture and Hardware Behavior

4.3. Apple Silicon Performance

4.4. CPU and Edge Limitations

4.5. Energy Efficiency Trends

4.6. Batch Size, Stability, and Cost-Efficiency

4.7. Practical Implications

4.8. Limitations

4.9. Final Remarks

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI