Comparative Study on Energy Consumption of Neural Networks by Scaling of Weight-Memory Energy Versus Computing Energy for Implementing Low-Power Edge Intelligence

Yoon, Ilpyung; Mun, Jihwan; Min, Kyeong-Sik

doi:10.3390/electronics14132718

Open AccessArticle

Comparative Study on Energy Consumption of Neural Networks by Scaling of Weight-Memory Energy Versus Computing Energy for Implementing Low-Power Edge Intelligence

by

Ilpyung Yoon

,

Jihwan Mun

and

Kyeong-Sik Min

^*

School of Electrical Engineering, Kookmin University, Seoul 02707, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2718; https://doi.org/10.3390/electronics14132718

Submission received: 5 June 2025 / Revised: 25 June 2025 / Accepted: 2 July 2025 / Published: 5 July 2025

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

Download

Browse Figures

Versions Notes

Abstract

Energy consumption has emerged as a critical design constraint in deploying high-performance neural networks, especially on edge devices with limited power resources. In this paper, a comparative study is conducted for two prevalent deep learning paradigms—convolutional neural networks (CNNs), exemplified by ResNet18, and transformer-based large language models (LLMs), represented by GPT3-small, Llama-7B, and GPT3-175B. By analyzing how the scaling of memory energy versus computing energy affects the energy consumption of neural networks with different batch sizes (1, 4, 8, 16), it is shown that ResNet18 transitions from a memory energy-limited regime at low batch sizes to a computing energy-limited regime at higher batch sizes due to its extensive convolution operations. On the other hand, GPT-like models remain predominantly memory-bound, with large parameter tensors and frequent key–value (KV) cache lookups accounting for most of the total energy usage. Our results reveal that reducing weight-memory energy is particularly effective in transformer architectures, while improving multiply–accumulate (MAC) efficiency significantly benefits CNNs at higher workloads. We further highlight near-memory and in-memory computing approaches as promising strategies to lower data-transfer costs and enhance power efficiency in large-scale deployments. These findings offer actionable insights for architects and system designers aiming to optimize artificial intelligence (AI) performance under stringent energy budgets on battery-powered edge devices.

Keywords:

energy consumption of neural networks; weight-memory energy; MAC computing energy; edge intelligence

1. Introduction

As artificial intelligence (AI) technologies are advancing very rapidly nowadays, neural networks are starting to show that they can be really useful and helpful for various applications related to human cognition and intelligence [1,2,3,4]. The most representative models of neural networks can be regarded as convolutional neural networks and transformers [5,6,7,8]. They are used for processing spatial information such as images and sequential information such as natural languages, respectively, which can be considered the most common capabilities of human cognition and intelligence.

To realize human-level cognition and intelligence, very strong computing systems are needed. Figure 1 shows an entire hierarchy from Internet of Things (IoT) to cloud computing centers via an edge intelligence layer [9]. Here the IoT sensors collect massive amounts of data from natural environments, human societies, etc. The IoT actuators translate electrical signals to physical quantities, including motion, light, temperature, etc., to make human life more convenient and safer. The data collected from the IoT devices are finally used in the cloud layer, where very big AI models can be operated using the big data accumulated in data centers.

To make the entire computing systems energy-efficient, an edge intelligence layer should be inserted between the IoT devices and cloud layer, as shown in Figure 1 [10]. The edge layer between the IoT and cloud computing layers can reduce the amount of data to be transferred directly from the IoT devices to the data center significantly [11]. By doing so, the computing and communication energy of a data center can be decreased. Moreover, some applications should handle privacy-related data. Edge hardware can enable data processing without exposing private information to external networks. This is particularly important in healthcare, financial services, and personal devices where data privacy is paramount [11,12].

As mentioned earlier, edge intelligence is needed between the IoT devices and data centers. Most edge intelligence hardware should be operated in energy-constraint environments such as battery energy. In this case, low-power edge hardware is very critical to make the battery lifetime long. However, this demand for the low-power consumption of edge devices is in opposition to big AI models requiring high-level intelligence consuming very large amounts of energy [13,14,15]. Figure 2a shows the relationship between energy consumption and an AI model’s size [16]. Here, we analyze various large language models (LLMs) in terms of energy consumption, which very clearly demonstrates an exponential increase with respect to the number of model parameters [17,18].

Now we need to discuss how to reduce the energy consumption of AI models. Traditionally, computing energy consumption could be reduced by supply voltage (VDD) scaling. It is well known that switching power consumption is proportional to VDD². If the VDD is reduced, the power consumption will be reduced by VDD². However, VDD scaling has not been available for many years now, so power reduction through VDD scaling is impossible now [19,20].

To reduce power consumption without relying on VDD scaling, we need to consider the power consumption of traditional computing architecture, which is known as the von Neumann architecture. Figure 2b shows a conceptual diagram of the von Neumann architecture, where the computing block is separated from the memory block. The separation between the computing block and memory block can cause a large amount of power consumption because the data movement between the two separate blocks consumes energy endlessly. To overcome this problem, Figure 2c shows a conceptual diagram of near-memory computing, where the power consumption can be reduced by shortening the distance between the computing and memory blocks [21,22]. Furthermore, Figure 2d shows a diagram of the in-memory computing architecture. Here, the architecture in Figure 2d can be used to perform both the computing and memory functions on the same site. By doing so, the computing energy consumption in Figure 2d can be lowered more compared to Figure 2c [23,24,25,26].

As explained earlier, both the near-memory and in-memory computing techniques can reduce the energy consumption of AI models. In more detail, near-memory computing can affect the data movement energy needed for accessing weight memory. In-memory computing can reduce both the energy consumption of computing and weight memory. Thus, to analyze the energy reduction due to near-memory and in-memory computing, we try to divide the AI model’s energy consumption into a weight-accessing part and computing part in this paper.

To do so, first, we assume two types of neural network models, which are the convolutional neural network and transformer model, respectively, in terms of the model’s size and the FLOPs computed in the models. The two models are used for processing spatial information such as images and sequential information such as natural languages, respectively, which can be considered the most common capabilities of human cognition and intelligence.

Here, FLOPs mean the number of floating-point operations needed to run the AI model. If the model’s size is large, it needs a large number of model parameters. If so, the energy consumption of accessing weight memory should be very large too. On the other hand, if the number of FLOPs is large, it means the amount of computing energy should be very large too. Based on the analysis of model size and FLOPs, in this paper, we will calculate how much energy can be saved for two typical AI models, the ResNet18 and transformer models, in the following section.

Explaining the analyzed models in more detail, ResNet18 is suitable for handling vision information using a convolution operation. Here, the convolution operation performs very heavy computation using a small number of kernels. This is the reason that the number of FLOPs per parameter of convolution-based models is larger than for other AI models. On the contrary, transformer models need a lot of model parameters to be used in very large matrix–matrix multiplication. Consequentially, the transformer model has a low number of FLOPs per parameter. To consider two types of AI models, computation-dominant and weight-dominant, we analyze the energy consumption of AI models by scaling the weight energy versus the computing energy. The results show that the energy scaling of weight memory and computing can affect the AI model’s power consumption in various ways according to different models, operating modes, memories, etc.

To summarize how the energy scaling of weight memory versus computing can affect the total energy consumption of various AI models, we plot FLOPs per parameter with a varying batch number. By doing so, we can distinguish the weight-energy bound and MAC-energy bound regions for the AI models. Here, MAC stands for multiplication and accumulation operations that are fundamental to running the neural network model. If one AI model operated with a certain batch number belongs to a region bound by weight-accessing energy, it means the weight-memory energy is more dominant than the computing energy. On the other hand, the AI model is in the computing-energy bound region; the reduction in computing energy is more important than saving the weight-memory energy. These findings obtained through the analysis in this paper can offer actionable insights for architects and system designers aiming to optimize AI performance under stringent energy budgets on battery-powered edge devices.

Let us discuss some previous studies for analyzing the energy consumption of neural networks [27,28]. These are based on real measurements using neural network hardware such as NVIDIA devices. More specifically, a study profiling the energy consumption of fully connected and convolutional layers led to the creation of simple but accurate energy models for edge inference tasks [27]. Also, some researchers have developed an energy consumption index to evaluate the energy efficiency of various deep learning architectures, including AlexNet, ResNet18, VGG16, EfficientNet-B3, ConvNeXt-T, and Swin Transformer, during both the training and inference phases, providing a standardized approach for assessment [29].

Unlike these previous studies that obtained measurement results using hardware, this paper analyzes the energy consumption of AI models from bottom-up calculations using elementwise analysis for weight-memory access and MAC (multiply–accumulate) computing. By doing so, the energy consumption trend can be analyzed and estimated in this paper when the energy scaling affects weight energy versus computing energy differently. This kind of breakdown approach for energy consumption is very useful for estimating future energy reduction that may be driven by near-memory and in-memory computing architecture.

In the next section, the methods used in this study are explained in detail. In Section 3, simulation results are indicated using figures and tables, which show the AI model’s energy consumption for different batch numbers, different energy scaling for weight-memory and MAC computing, etc. In Section 3, we conclude this paper.

2. Methods

In this paper, two types of neural network models are considered for analyzing their energy consumption. These are ResNet18 and transformer models, known as GPT, respectively. Figure 3a illustrates a block diagram of ResNet18, which takes an input image and applies 64 convolution filters of size 7 × 7 with a stride of 2 at the first stage to rapidly reduce the spatial resolution. Afterward, multiple layers of 3 × 3 convolutions (filters 64→128→256→512) are sequentially applied. Each convolution layer is equipped with Batch Normalization, an activation function known as ReLU, etc. At the end, a pooling operation aggregates the feature map by channels, and the classification result is finally produced through a fully connected layer and SoftMax layer.

ResNet18 is a simpler model than large language models (LLMs) in the GPT family, which can have hundreds of millions to billions of parameters. However, due to the nature of pixel-level convolutions, each convolution layer requires an enormous number of MAC (multiply–accumulate) operations. For example, with an input image of size 224 × 224, just passing through the first convolution layer entails many filters performing spatial multiply–add operations. As the input image is large and the number of channels grows from layer to layer, the total computational workload skyrockets. Consequently, ResNet18 needs about 1.8 GFLOPs, which represents a substantial amount of computation. In other words, despite having fewer parameters, the model has a large volume of operations, implying that much of the energy consumption could be dominated by computational costs.

On the other hand, as shown in Figure 3b, GPT-based models (GPT3-small, Llama 7B, GPT3-175B, etc.) follow the transformer architecture designed to process sequential data. The sequential input data are first split into tokens, and the tokens are delivered to an embedding layer. Then, they are input into multiple transformer blocks. Each transformer block consists of (1) Layer Normalization, (2) Multi-Head Self-Attention, and (3) a feedforward module (MLP). A residual connection is applied at each layer for stable training. GPT models typically contain a very large number of parameters. For example, GPT3-175B has about 175 billion parameters, while Llama 7B already has around 6.7 billion parameters, which is far bigger in scale than CNN-based models [7,30].

Due to these architectural differences, the dominant factors in energy consumption diverge markedly between ResNet18 and GPT. In CNNs, the parameter count is relatively small, but there are many convolution operations, leading to a computationally intensive workload. Meanwhile, GPT frequently loads a huge number of parameters for each token, and as the sequence length grows, the KV cache access frequency also increases greatly. Therefore, in GPT-based models, memory access can become the main bottleneck in overall energy consumption. Specifically, even a model like Llama 7B already comprises billions of parameters, and in the case of GPT3-175B, the parameter count is over two orders of magnitude higher, which inevitably inflates the energy costs related to memory.

Figure 3. (a) A block diagram of ResNet18 [6]. (b) A block diagram of a transformer decoder such as GPT3 [31].

To perform a comparative study on energy consumption for moving weights and computing MAC operations, first, the numbers of model parameters and FLOPs are calculated for various AI models. Here, ResNet18, GPT3-small, Llama-7B, and GPT3-175B are chosen for this analysis, as shown in Figure 4a [17,32,33]. GPT-3 Small, Llama-7B, and GPT-3 175B were chosen as representative small-, medium-, and large-scale LLMs. Because all three models employ a GPT-like single decoder block, their layer layouts and computational patterns are similar, enabling a fair comparison. Each model has been publicly released with full hyperparameter details provided in the relevant paper and repository, ensuring reproducibility and data accessibility. ResNet18, based on a convolutional neural network, has 11.7 M parameters, which can recognize images by processing spatial information. GPT3-small, Llama-7B, and GPT3-175B are based on the transformer-decoder model, which can handle sequential information such as natural languages. There are 125 M, 7 B, and 175 B parameters for GPT3-small, Llama-7B, and GPT3-175B, respectively, as shown in Figure 4a. The numbers of FLOPs for the four models are shown in Figure 4b. The number of FLOPs of ResNet18 is as low as 1.8 G. The number of FLOPs of GPT3-small and Llama 7B is 265 M and 13.2 G, respectively. On the contrary, the number of FLOPs reaches as high as 334 G for GPT3-175B [17,32,33].

The computational intensity of neural network models can be defined by the number of FLOPs per parameter. For example, ResNet18 is calculated with a computational intensity as high as 155, because the model has a small number of parameters and a large number of FLOPs, as shown in Figure 4c. In contrast, the transformer-based models have much lower computational intensity than ResNet18. The intensity numbers shown in Figure 4c are 1.96, 2.09, and 2.04 for GPT-small, Llama 7B, and GPT3-175B, respectively. One thing to note from Figure 4c is that the transformer-based neural networks perform less computing than the CNN-based models for each parameter.

Figure 4d indicates the energy consumption of loading weights per bit and the computing energy for performing one MAC operation. Here, FP16 and INT8 are considered for carrying out MAC operations. FP16 represents a floating-point number format composed of 16 bits. Similarly, INT8 is an integer number format composed of 8 bits. These are very common number formats used in most GPU and NPU chips [34]. In Figure 4d, the first column shows the weight-memory energy per bit for GDDR6 DRAMs [35]. The second column is the weight-memory energy per bit for LPDDR5 DRAMs [36]. The third and fourth columns represent the computing energy per FP16 and INT8 MAC, respectively [20,37].

To simulate the energy consumption of AI models, the total energy consumption is examined for different batch numbers with the weight-memory energy and MAC energy varied in steps. To perform the energy analysis, first, we need to calculate the number of model parameters and FLOPs required to operate the AI models mentioned above. The energy calculation in this paper is performed using a Python 3.9.21 program that includes the energy models of MAC computing and weight-memory access. The neural network’s performance is verified using Pytorch 2.2.0 (+ CUDA 11.8) for the ResNet18 and transformer-based models. As mentioned earlier, two representative AI models are simulated in this paper. These are ResNet and transformer, respectively. The other AI models can be considered to have similar characteristics to the two models analyzed in this paper. The two models used are for processing spatial information such as images and sequential information such as natural languages, respectively, which can be considered the most common capabilities of human cognition and intelligence. Here, the total energy consumption is assumed to be roughly composed of weight-memory energy and MAC computing energy [30,31,38,39]. For the weight-memory energy, LPDDR5 in Figure 4d is assumed to be used in loading the weights from the external DRAM memory [36]. In this case, the weight-memory energy used for loading one bit can be estimated to be as much as 4.5 pJ. To calculate the MAC computing energy, MAC-INT8 precision is used in the convolution-based ResNet18 model in this paper. In most convolution-based models such as ResNet, INT8 precision can demonstrate sufficient performance for use in practical applications [40,41]. For the transformer-based models, MAC-FP16 is used to handle more complicated computations rather than INT8. If the CMOS logic process is assumed to involve a 45 nm node and the VDD is 0.9 V, the MAC-INT8 operation consumes 0.23 pJ per MAC operation [20,37]. On the other hand, MAC-FP16 requires an energy consumption as large as 1.5 pJ per MAC operation [20,37]. Table 1 shows the hyperparameter values used in the transformer models, namely GPT-3 Small, Llama-7B, and GPT-3 175B.

3. Results

Figure 5 and Figure 6 present the normalized total energy consumption to show, at a glance, how the overall energy consumption changes as the weight-memory or MAC operation energy is gradually decreased. In detail, Figure 5 demonstrates how the total energy consumption shifts (for batch sizes of 1, 4, 8, 16, respectively) when the weight-memory energy per bit in ResNet18 is reduced step by step from the baseline (1) to 1/2, 1/4, 1/8, and 1/16. When the batch size is 1, the weight energy and computation energy are almost in a 2:1 ratio, so saving weight-memory energy immediately leads to a reduction in the total energy consumption. However, when the batch size is 4 or larger, there are more MAC operations to be computed during the convolution operations, which increases the energy consumption due to MAC computation significantly. As discussed earlier, the impact of computation grows in these scenarios, so merely reducing the weight-memory energy does not yield as much benefit as it does at batch = 1

Figure 6 shows how the total energy consumption changes under the same batch size scenarios (1, 4, 8, 16) when the MAC operation energy is reduced stepwise from 1 to 1/2, 1/4, 1/8, and 1/16. Because convolution is the core operation in ResNet18, and because the number of MAC operations surges rapidly as the batch number increases, halving the MAC energy leads to a significant decline in total energy. Even though we are using INT8 precision, the overall computation volume is still very large. Notably, as the batch size grows (8, 16) and the parallelism in convolution operations intensifies, MAC energy’s proportion becomes even higher, making computation-focused optimization especially effective. Moreover, larger batch sizes also significantly increase activation memory usage (due to bigger intermediate feature maps), so lowering the MAC energy alone does not solve all energy problems. Nevertheless, since CNNs are typically computationally heavy, optimizing convolution through techniques such as Winograd transformations, FFT-based approaches, hardware parallelization, or further reducing precision below INT8 (e.g., INT4, INT2) can lead to substantial energy savings.

In summary, for ResNet18, the relative contributions of the weight-memory energy and MAC computing energy depend on the batch number. Overall, however, because CNNs involve a large volume of operations, they are likely to be computationally bound, making strategies to reduce the MAC energy pivotal for lowering the overall energy. This is especially relevant now that CNNs are increasingly deployed in battery-powered edge devices, prompting the active development of specialized hardware (NPU) and data-reuse strategies (e.g., sharing neighboring pixels) to maximize computing efficiency.

Now let us look at a transformer-based model, GPT (Llama 7B), and examine how the energy scaling of memory access versus MAC computation affects the total energy consumption. Figure 7, Figure 8 and Figure 9 show, respectively, the changes in the normalized total energy consumption by batch size (1, 4, 8, 16) when (1) weight-memory energy is reduced, (2) MAC operation energy is reduced, and (3) KV cache memory energy is reduced. Here, we assume FP16 precision for the MAC calculations in the GPT model instead of the INT8 precision used in ResNet18.

Llama 7B contains around 7 billion parameters, more than 500 times 11.7 million. Thus, for every inference or training pass, a vast number of weights must be loaded layer by layer. As shown in Figure 7, reducing the weight-memory energy to 1/2, 1/4, 1/8, or 1/16 reduces the total energy consumption dramatically regardless of batch size. This indicates that weight access or data movement is the strongest bottleneck in GPT-like architectures with large parameter counts [38]. Indeed, memory optimization techniques such as weight quantization (e.g., 8-bit or 4-bit), weight compression, or parameter sharing across layers (and KV caching) can significantly improve the energy efficiency of GPT models [42,43].

Because GPT models often have many more parameters than their overall FLOPs might suggest, MAC operations may not account for as large a fraction of the total energy as in CNNs. In Figure 8, lowering the MAC energy does reduce the total energy, but, unlike in CNNs, the absolute impact is not huge, even for larger batch sizes. Nevertheless, when the batch size reaches 8 or more, multi-head attention and the feedforward network process more tokens in parallel, increasing the computational load enough that MAC energy optimization does become meaningful. However, memory access (weights, KV cache) can still occupy a relatively larger portion of the total energy. This result shows that while computing energy efficiency remains valuable, reducing memory access—especially for weights and KV cache—is typically a higher priority for GPT-style models.

Because the GPT structure continually references past tokens, it uses a KV (key–value) cache to store them. As the sentence length grows, the quantity of information in this KV cache increases quadratically. With a larger batch size, multiple sentences are processed in parallel, further boosting the cache access frequency. As shown in Figure 9, cutting the KV cache energy to 1/2, 1/4, 1/8, or 1/16 can yield a noticeable drop in the total energy, especially when the batch size is as large as 16. This confirms that the KV cache is not just a side component; it can be a major energy bottleneck. Existing methods to optimize KV cache usage include storing token embeddings or attention keys/values at a lower precision, discarding unneeded cached entries early, or adopting architectural approaches such as placing cache memory closer to the processor (near-memory) or embedding it directly into the processor (in-memory) to reduce data movement. In summary, GPT (Llama 7B) handles significantly more parameters than CNNs and also deals with a substantial amount of KV cache usage, making memory access a dominant factor in the total energy consumption. Although MAC computation can become more significant with bigger batch sizes or longer sequences, optimizing the loading and storage processes of weights and KV caches is ultimately the key to achieving energy efficiency for the transformer-based models.

In Figure 10a,b, we compare the energy reduction rates for ResNet18 and the GPT-based Llama 7B model when lowering different energy components, namely weight-memory energy, MAC computing energy, and (for Llama 7B) KV cache energy.

Focusing first on ResNet18 in Figure 10a, we observe that at a small batch size (batch = 1), reducing the weight-memory energy yields a 61.7% reduction in the total energy consumption, whereas lowering the MAC computing energy decreases the overall energy by 30.3%. As the batch size grows to 4, weight-memory optimization still produces a larger effect (30.5%) compared to MAC-related reductions (18.2%), but this gap narrows because the share of computing increases with parallel convolution. At a batch size of 8, the situation shifts significantly in favor of MAC computing optimization, resulting in a 71.7% energy reduction, whereas weight-memory savings fall to 18.2%. At batch size = 16, the model becomes even more computationally bound, so reducing the MAC energy leads to a 79.4% decrease in the total energy consumption, with weight-memory savings dipping to 10.1%. These trends confirm that ResNet18 moves from a memory-bound state at small batch sizes to a computationally bound regime as the batch size increases.

Moving on to the Llama 7B transformer model in Figure 10b, we see a different picture because GPT models generally have many more parameters and must maintain a key–value (KV) cache for autoregressive attention. At batch size = 1, lowering the weight-memory energy produces a drastic 80% decrease, whereas cutting down the MAC computing energy only leads to 1.6% decrease, and diminishing the KV cache energy yields a 12% decrease. Even as we move to batch size = 4, weight-memory optimization still dominates at 55.6%, with MAC and KV cache reductions of 4.5% and 33.4%, respectively. These results underscore that GPT models are far more sensitive to memory traffic than to raw computing costs. At batch size = 8, the memory remains critical: decreasing the weight-memory energy yields a 39.5% reduction, reducing the KV cache energy leads to a 47.5% reduction, and MAC energy scaling results in a 6.5% decrease. At batch size = 16, memory dependencies remain paramount, with KV cache energy reductions offering 60.2% savings and weight-memory optimization leading to 25% savings, while MAC energy contributes only 8.2% savings. Altogether, these observations highlight that Llama 7B firmly remains in a memory-bound regime at all batch sizes, placing priority on either reducing weight-memory overhead or streamlining the KV cache mechanism to achieve the largest improvements in overall energy efficiency.

To clarify this analysis, Figure 10c examines how FLOPs per parameter change with increasing batch size, helping to visually identify whether a model is memory-bound or computationally bound. In the small-batch-size regime, the model tends to rely heavily on parameter loading and activation data management at every step, thus making it prone to being memory energy-bound. In this case, strategies to minimize memory access energy are crucial. Typical approaches include weight quantization (reduced precision), maximizing data reuse (caching), and near-/in-memory computing designs that physically reduce data movement distances.

When moving to larger batch sizes, architectures such as CNNs (ResNet18) exhibit a surge in parallel convolution operations, leading to the dominance of computing energy. On the other hand, for GPT-based models, the growth in sequence length also leads to heavier attention computations and KV cache usage, so memory access remains a considerable burden. As a result, even at large batch sizes, GPT-like models may not become purely computing-bound but rather remain in a mixed region where both memory and computation are critical. This disparity stems from fundamental differences in how CNNs and GPTs process data: CNNs perform pixel-level convolutions for spatial information, leading to heavy computing costs, whereas GPTs handle sequential information with extensive parameters and cache access, incurring significant memory costs.

To indicate the numbers used in Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9, Table 2 is shown below. In more detail, Table 2a indicates the energy consumption of ResNet18 with the weight-memory energy scaled from 1 down to 1/16, as shown in Figure 5. Table 2b shows the energy consumption of ResNet18 with MAC energy scaling from 1/16 to 1, as shown in Figure 6. Table 2c indicates the energy consumption of Llama-7B with the weight-memory energy scaled from 1 down to 1/16, as shown in Figure 7. Table 2d shows the energy consumption of Llama-7B with MAC energy scaling from 1/16 to 1, as shown in Figure 8. Table 2e shows the energy consumption of Llama-7B with the KV cache memory energy scaled from 1 down to 1/16, as shown in Figure 9.

One more thing to discuss here is that other AI models such as mobilenet, googlenet, etc., can also be analyzed in terms of the energy breakdown performed in this paper. Actually, a large portion of the energy consumption of AI models comes from accessing weight memory and computing MAC. Thus, using the same calculation method for energy consumption performed in this paper, the energy consumption of other AI models could be analyzed. This study could be extended further in future work to cover a wide range of AI models from edge to cloud intelligence.

By scaling the weight-memory energy versus MAC computing energy, we were able to analyze and estimate the energy consumption trend of AI models in this section. From this analysis, we can further highlight near-memory and in-memory computing approaches as promising strategies to lower data-transfer costs and enhance power efficiency in large-scale deployments of AI models. These findings could be helpful for offering actionable insights for architects and system designers aiming to optimize AI performance under stringent energy budgets on battery-powered edge devices.

4. Conclusions

The energy consumption of big AI models has emerged as a critical design constraint in deploying high-performance neural networks, especially on edge devices, where their energy resources are limited by battery capacity. In this paper, we performed a comparative study for two types of AI models, those based on convolutional neural networks (CNNs), represented by ResNet18, and transformer-based large language models (LLMs), represented by GPT3-small, Llama-7B, and GPT3-175B. To achieve this, we first analyzed how the scaling of memory energy versus computing energy affects the total energy consumption of neural networks with different batch sizes (1, 4, 8, 16). As a result, it was shown that ResNet18 transitions from a memory energy-limited regime at low batch sizes to a computing energy-limited regime at higher batch sizes due to the increase in convolution operations with batch number. On the other hand, GPT-like models remain predominantly memory-bound, with large parameter tensors and frequent key–value (KV) cache lookups accounting for most of the total energy usage. From the energy analysis performed in this paper, we found that reducing the weight-memory energy is particularly effective in transformer architectures, while improving multiply–accumulate (MAC) efficiency significantly benefits CNNs at higher workloads. Moreover, it was highlighted in this paper that near-memory and in-memory computing could be considered promising strategies in the near future to lower data-transfer costs and enhance power efficiency in large-scale deployments. These results can offer helpful guidelines for architects and system designers who are aiming to optimize AI performance under stringent energy budgets on battery-powered edge devices.

Author Contributions

Conceptualization, K.-S.M.; Methodology, I.Y. and J.M.; Investigation, I.Y. and J.M.; Writing—original draft, I.Y.; Writing—review & editing, K.-S.M.; Supervision, K.-S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Research Foundation of Korea grant number [RS-2024-00401234, RS-2024-00406006, RS-2024-00395426, RS-2024-12872969].

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

Technical support for the CAD tools was supplied by IC Design Education Center (IDEC), Daejeon, Korea.

Conflicts of Interest

The authors declare no conflicts of interest.

References

LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Schmidhuber, J. Deep Learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
Jia, Y.H.; Si, Z.Z.; Ju, Z.T.; Feng, H.Y.; Zhang, J.H.; Yan, X.; Dai, C.Q. Convolutional-recurrent neural network for the prediction of formation and switching dynamics for multicolor solitons. Sci. China Physics, Mech. Astron. 2025, 68, 284211. [Google Scholar] [CrossRef]
Wan, Y.; Wei, Q.; Sun, H.; Wu, H.; Zhou, Y.; Bi, C.; Li, J.; Li, L.; Liu, B.; Wang, D.; et al. Machine learning assisted biomimetic flexible SERS sensor from seashells for pesticide classification and concentration prediction. Chem. Eng. J. 2025, 507, 160813. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Xia, Q.; Ye, W.; Tao, Z.; Wu, J.; Li, Q. A survey of federated learning for edge computing: Research problems and solutions. High-Confidence Comput. 2021, 1, 100008. [Google Scholar] [CrossRef]
Amin, S.U.; Hossain, M.S. Edge Intelligence and Internet of Things in Healthcare: A Survey. IEEE Access 2021, 9, 45–59. [Google Scholar] [CrossRef]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Aminifar, A.; Shokri, M.; Aminifar, A. Privacy-preserving edge federated learning for intelligent mobile-health systems. Futur. Gener. Comput. Syst. 2024, 161, 625–637. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Y.; Qian, B.; Shi, X.; Shu, Y.; Chen, J. A Review on Edge Large Language Models: Design, Execution, and Applications. ACM Comput. Surv. 2025, 57, 1–35. [Google Scholar] [CrossRef]
Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. Commun. ACM 2020, 63, 54–63. [Google Scholar] [CrossRef]
Maliakel, P.J.; Ilager, S.; Brandic, I. Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings. arXiv 2025, arXiv:2501.08219. [Google Scholar]
Li, Y.; Mughees, M.; Chen, Y.; Li, Y.R. The Unseen AI Disruptions for Power Grids: LLM-Induced Transients. arXiv 2024, arXiv:2409.11416. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for deep learning in NLP. arXiv 2019, arXiv:1906.02243. [Google Scholar]
Dreslinski, R.G.; Wieckowski, M.; Blaauw, D.; Sylvester, D.; Mudge, T. Near-threshold computing: Reclaiming moore’s law through energy efficient integrated circuits. Proc. IEEE 2010, 98, 253–266. [Google Scholar] [CrossRef]
Horowitz, M. Computing’s energy problem (and what we can do about it). Dig. Tech. Pap. 2014, 57, 10–14. [Google Scholar] [CrossRef]
Ahn, J.; Hong, S.; Yoo, S.; Mutlu, O.; Choi, K. A scalable processing-in-memory accelerator for parallel graph processing. Procedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, 13–17 June 2015; pp. 105–117. [Google Scholar] [CrossRef]
Singh, G.; Chelini, L.; Corda, S.; Awan, A.J.; Stuijk, S.; Jordans, R.; Corporaal, H.; Boonstra, A.J. Near-memory computing: Past, present, and future. Microprocess. Microsyst. 2019, 71, 102868. [Google Scholar] [CrossRef]
Sheng, X.; Graves, C.E.; Kumar, S.; Li, X.; Buchanan, B.; Zheng, L.; Lam, S.; Li, C.; Strachan, J.P. Low-Conductance and Multilevel CMOS-Integrated Nanoscale Oxide Memristors. Adv. Electron. Mater. 2019, 5, 1800876. [Google Scholar] [CrossRef]
Chi, P.; Li, S.; Xu, C.; Zhang, T.; Zhao, J.; Liu, Y.; Wang, Y.; Xie, Y. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. Proceedings of ACM/IEEE 43rd Annual International Symposium on Computer Architecture, Seoul, Republic of Korea, 18–22 June 2016; pp. 27–39. [Google Scholar] [CrossRef]
Shafiee, A.; Nag, A.; Muralimanohar, N.; Balasubramonian, R.; Strachan, J.P.; Hu, M.; Williams, R.S.; Srikumar, V. Isaac. ACM SIGARCH Comput. Archit. News 2016, 44, 14–26. [Google Scholar] [CrossRef]
He, W.; Member, G.S.; Yin, S.; Member, G.S.; Kim, Y.; Member, G.S.; Sun, X.; Member, S.; Kim, J.; Yu, S.; et al. 2-Bit-Per-Cell RRAM-Based In-Memory Computing for Area-/Energy-Efficient Deep Learning. IEEE Solid-State Circuits Lett. 2020, 3, 194–197. [Google Scholar] [CrossRef]
Lahmer, S.; Khoshsirat, A.; Rossi, M.; Zanella, A. Energy Consumption of Neural Networks on NVIDIA Edge Boards: An Empirical Model. In Proceedings of the 2022 20th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks, Torino, Italy, 19–23 September 2022; pp. 365–371. [Google Scholar] [CrossRef]
Latif, I.; Newkirk, A.C.; Carbone, M.R.; Munir, A.; Lin, Y.; Koomey, J.; Yu, X.; Dong, Z. Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node. IEEE Access 2025, 13, 61740–61747. [Google Scholar] [CrossRef]
Aquino-Brítez, S.; García-Sánchez, P.; Ortiz, A.; Aquino-Brítez, D. Towards an Energy Consumption Index for Deep Learning Models: A Comparative Analysis of Architectures, GPUs, and Measurement Tools. Sensors 2025, 25, 846. [Google Scholar] [CrossRef]
Wolters, C. Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference. arXiv 2024, arXiv:2406.08413. [Google Scholar]
Wu, Y.; Wang, Z.; Lu, W.D. PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers. NPJ Unconv. Comput. 2024, 1, 1. [Google Scholar] [CrossRef]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Johnson, J. Rethinking floating point for deep learning. arXiv 2018, arXiv:1811.01721v1. [Google Scholar]
Samsung Electronics Co., Ltd. 8 Gb GDDR6 SGRAM (C-Die) Data Sheet, Rev. 1.0.; Samsung Electronics Co., Ltd.: Suwon-si, Gyeonggi-do, Republic of Korea, 2020; Available online: https://datasheet.lcsc.com/lcsc/2204251615_Samsung-K4Z80325BC-HC14_C2920181.pdf (accessed on 1 July 2025).
Micron Technology, Inc. LPDDR5/LPDDR5X SDRAM Data Sheet, Rev. D.; Micron Technology, Inc.: Boise, ID, USA, 2022; pp. 1–30. Available online: https://www.mouser.com/datasheet/2/671/Micron_05092023_315b_441b_y4bm_ddp_qdp_8dp_non_aut-3175604.pdf (accessed on 1 July 2025).
Jouppi, N.P.; Yoon, D.H.; Ashcraft, M.; Gottscho, M.; Jablin, T.B.; Kurian, G.; Laudon, J.; Li, S.; Ma, P.; Ma, X.; et al. Ten lessons from three generations shaped Google’s TPU v4i: Industrial product. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture, Valencia, Spain, 14–18 June 2021; pp. 1–14. [Google Scholar] [CrossRef]
Ivanov, A.; Dryden, N.; Ben-Nun, T.; Li, S.; Hoefler, T. Data Movement Is All You Need: A Case Study on Optimizing Transformers. arXiv 2020, arXiv:2007.00072. [Google Scholar]
Yang, T.J.; Chen, Y.H.; Emer, J.; Sze, V. A method to estimate the energy consumption of deep neural networks. In Proceedings of the 2017 51st Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 29 October–1 November 2017; pp. 1916–1920. [Google Scholar] [CrossRef]
Jain, S.R.; Gural, A.; Wu, M.; Dick, C.H. Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks. In Proceedings of the 3rd Conference on Machine Learning and Systems (MLSys 2020), Austin, TX, USA, 6–8 March 2020; pp. 1–17. [Google Scholar]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [Google Scholar] [CrossRef]
Luohe, S.; Hongyi, Z.; Yao, Y.; Zuchao, L.; Hai, Z. Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption. arXiv 2024, arXiv:2407.18003. [Google Scholar]
Adnan, M.; Arunkumar, A.; Jain, G.; Nair, P.J.; Soloveychik, I.; Kamath, P. Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference. arXiv 2024, arXiv:2403.09054. [Google Scholar] [CrossRef]

Figure 1. A conceptual diagram of the hierarchy of computing hardware from IoT sensors to the cloud layer via the edge intelligence layer.

Figure 2. (a) The energy consumption of LLMs with an increasing number of AI model parameters [16]. (b) The von Neumann computing architecture with the separation of computing and memory blocks. (c) The near-memory computing architecture, where the distance between the computing and memory blocks can be shorter than in the traditional von Neumann architecture. (d) The in-memory computing architecture, where the computing can be performed in a memory array.

Figure 4. (a) The number of parameters for various neural network models. (b) The number of FLOPs for various neural network models when the number of batches is one. (c) The FLOPs per parameter for various neural network models when the number of batches is one. (d) The energy consumption of different memories and MAC operations.

Figure 5. The energy consumption of ResNet18 with the weight-memory energy scaled from 1 down to 1/16. Here, the numbers of batches are 1, 4, 8, and 16 in (a), (b), (c), and (d), respectively.

Figure 6. The energy consumption of ResNet18 with the MAC computing energy scaled from 1 down to 1/16. Here, the numbers of batches are 1, 4, 8, and 16 in (a), (b), (c), and (d), respectively.

Figure 7. The energy consumption of Llama-7B with the weight-memory energy scaled from 1 down to 1/16. Here, the numbers of batches are 1, 4, 8, and 16 in (a), (b), (c), and (d), respectively.

Figure 8. The energy consumption of Llama-7B with the MAC computing energy scaled from 1 down to 1/16. Here, the numbers of batches are 1, 4, 8, and 16 in (a), (b), (c), and (d), respectively.

Figure 9. The energy consumption of Llama-7B with the KV cache memory energy scaled from 1 down to 1/16. Here, the numbers of batches are 1, 4, 8, and 16 in (a), (b), (c), and (d), respectively.

Figure 10. (a) ResNet18’s total energy reduction for weight-memory vs. MAC energy scaling with batch sizes of 1, 4, 8, and 16. (b) Llama 7B’s energy reduction for weight-memory, MAC, and KV cache energy scaling with batch sizes of 1, 4, 8, and 16. (c) FLOPs per parameter vs. batch size, illustrating the transition from computing energy-bound to memory energy-bound regimes.

Table 1. The hyperparameters of transformer models such as GPT-3 Small, Llama-7B, and GPT-3 175B. These are vocabulary size (n_word), context window size (n_context), hidden dimension size (d_model), layer depth (n_layer), number of attention heads (n_head), and per-head dimension (d_head).

	n_word	n_context	d_model	n_layer	n_head	d_head
GPT3-small	50,257	2048	768	12	12	64
Llama-7B	50,257	2048	4096	32	32	128
GPT3-175B	50,257	2048	12,288	96	96	128

Table 2. (a) The energy consumption of ResNet18 with the weight-memory energy scaled from 1/16 to 1. (b) The energy consumption of ResNet18 with the MAC computing energy scaled from 1/16 to 1. (c) The energy consumption of Llama-7B with the weight-memory energy scaled from 1/16 to 1. (d) The energy consumption of Llama-7B with the MAC computing energy scaled from 1/16 to 1. (e) The energy consumption of Llama-7B with the KV cache memory energy scaled from 1/16 to 1.

(a)	Normalized Weight Energy per Bit	1/16	1/8	1/4	1/2	1
number of batches = 1	activation	0.01693	0.01693	0.01693	0.01693	0.01693
	MAC	0.32393	0.32393	0.32393	0.32393	0.32393
	weight	0.0412	0.08239	0.16478	0.32957	0.65914
	total	0.38206	0.42325	0.50564	0.67043	1
number of batches = 4	activation	0.06772	0.06772	0.06772	0.06772	0.06772
	MAC	1.29574	1.29574	1.29574	1.29574	1.29574
	weight	0.0412	0.08239	0.16478	0.32957	0.65914
	total	1.40466	1.44585	1.52824	1.69303	2.02259
number of batches = 8	activation	0.13545	0.13545	0.13545	0.13545	0.13545
	MAC	2.59147	2.59147	2.59147	2.59147	2.59147
	weight	0.0412	0.08239	0.16478	0.32957	0.65914
	total	2.76812	2.80931	2.8917	3.05649	3.38606
number of batches = 16	activation	0.2709	0.2709	0.2709	0.2709	0.2709
	MAC	5.18294	5.18294	5.18294	5.18294	5.18294
	weight	0.0412	0.08239	0.16478	0.32957	0.65914
	total	5.49504	5.53623	5.61862	5.78341	6.11298
(b)	Normalized MAC Energy per Bit	1/16	1/8	1/4	1/2	1
number of batches = 1	activation	0.01693	0.01693	0.01693	0.01693	0.01693
	MAC	0.02025	0.04049	0.08098	0.16197	0.32393
	weight	0.65914	0.65914	0.65914	0.65914	0.65914
	total	0.69631	0.71656	0.75705	0.83803	1
number of batches = 4	activation	0.06772	0.06772	0.06772	0.06772	0.06772
	MAC	0.08098	0.16197	0.32393	0.64787	1.29574
	weight	0.65914	0.65914	0.65914	0.65914	0.65914
	total	0.80784	0.88883	1.05079	1.37473	2.02259
number of batches = 8	activation	0.13545	0.13545	0.13545	0.13545	0.13545
	MAC	0.16197	0.32393.	0.64787	1.29574	2.59147
	weight	0.65914	0.65914	0.65914	0.65914	0.65914
	total	0.95656	1.11852	1.44246	2.09033	3.38606
number of batches = 16	activation	0.2709	0.2709	0.2709	0.2709	0.2709
	MAC	0.32393	0.64787	1.29574	2.59147	5.18294
	weight	0.65914	0.65914	0.65914	0.65914	0.65914
	total	1.25397	1.57791	2.22578	3.52151	6.11298
(c)	Normalized Weight Energy per Bit	1/16	1/8	1/4	1/2	1
number of batches = 1	activation	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴
	MAC	0.01763	0.01763	0.01763	0.01763	0.01763
	weight	0.05336	0.10672	0.21343/	0.42687	0.85374
	KV cache	0.12819	0.12819	0.12819	0.12819	0.12819
	total	0.19962	0.25298	0.3597	0.57313	1
number of batches = 4	activation	0.00179	0.00179	0.00179	0.00179	0.00179
	MAC	0.0705	0.0705	0.0705	0.0705	0.0705
	weight	0.05336	0.10672	0.21343	0.42687	0.85374
	KV cache	0.51276	0.51276	0.51276	0.51276	0.51276
	total	0.63841	0.69176	0.79848	1.01192	1.43879
number of batches = 8	activation	0.00358	0.00358	0.00358	0.00358	0.00358
	MAC	0.14101	0.14101	0.14101	0.14101	0.14101
	weight	0.05336	0.10672	0.21343	0.42687	0.85374
	KV cache	1.02551	1.02551	1.02551	1.02551	1.02551
	total	1.22345	1.27681	1.38353	1.59696	2.02383
number of batches = 16	activation	0.00715	0.00715	0.00715	0.00715	0.00715
	MAC	0.28202	0.28202	0.28202	0.28202	0.28202
	weight	0.05336	0.10672	0.21343	0.42687	0.85374
	KV cache	2.05102	2.05102	2.05102	2.05102	2.05102
	total	2.39355	2.44691	2.55363	2.76706	3.19393
(d)	Normalized MAC Energy per Bit	1/16	1/8	1/4	1/2	1
number of batches = 1	activation	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴
	MAC	0.0011	0.0022	0.00441	0.00881	0.01763
	weight	0.85374	0.85374	0.85374	0.85374	0.85374
	KV cache	0.12819	0.12819	0.12819	0.12819	0.12819
	total	0.98348	0.98458	0.98678	0.99119	1
number of batches = 4	activation	0.00179	0.00179	0.00179	0.00179	0.00179
	MAC	0.00441	0.00881	0.01763	0.03525	0.0705
	weight	0.85374	0.85374	0.85374	0.85374	0.85374
	KV cache	0.51276	0.51276	0.51276	0.51276	0.51276
	total	1.37269	1.37709	1.38591	1.40353	1.43879
number of batches = 8	activation	0.00358	0.00358	0.00358	0.00358	0.00358
	MAC	0.00881	0.01763	0.03525	0.0705	0.14101
	weight	0.85374	0.85374	0.85374	0.85374	0.85374
	KV cache	1.02551	1.02551	1.02551	1.02551	1.02551
	total	1.89164	1.90045	1.91808	1.95333	2.02383
number of batches = 16	activation	0.00715	0.00715	0.00715	0.00715	0.00715
	MAC	0.01763	0.03525	0.0705	0.14101	0.28202
	weight	0.85374	0.85374	0.85374	0.85374	0.85374
	KV cache	2.05102	2.05102	2.05102	2.05102	2.05102
	total	2.92954	2.94717	2.98242	3.05292	3.19393
(e)	Normalized KV Cache Energy per Bit	1/16	1/8	1/4	1/2	1
number of batches = 1	activation	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴	4.47059 × 10⁻⁴
	MAC	0.01763	0.01763	0.01763	0.01763	0.01763
	weight	0.85374	0.85374	0.85374	0.85374	0.85374
	KV cache	0.00801	0.01602	0.03205	0.06409	0.12819
	total	0.87982	0.88783	0.90386	0.93591	1
number of batches = 4	activation	0.00179	0.00179	0.00179	0.00179	0.00179
	MAC	0.0705	0.0705	0.0705	0.0705	0.0705
	weight	0.85374	0.85374	0.85374	0.85374	0.85374
	KV cache	0.03205	0.06409	0.12819	0.25638	0.51276
	total	0.95808	0.99012	1.05422	1.18241	1.43879
number of batches = 8	activation	0.00358	0.00358	0.00358	0.00358	0.00358
	MAC	0.14101	0.14101	0.14101	0.14101	0.14101
	weight	0.85374	0.85374	0.85374	0.85374	0.85374
	KV cache	0.06409	0.12819	0.25638	0.51276	1.02551
	total	1.06242	1.12651	1.2547	1.51108	2.02383
number of batches = 16	activation	0.00715	0.00715	0.00715	0.00715	0.00715
	MAC	0.28202	0.28202	0.28202	0.28202	0.28202
	weight	0.85374	0.85374	0.85374	0.85374	0.85374
	KV cache	0.12819	0.25638	0.51276	1.02551	2.05102
	total	1.2711	1.39928	1.65566	2.16842	3.19393

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yoon, I.; Mun, J.; Min, K.-S. Comparative Study on Energy Consumption of Neural Networks by Scaling of Weight-Memory Energy Versus Computing Energy for Implementing Low-Power Edge Intelligence. Electronics 2025, 14, 2718. https://doi.org/10.3390/electronics14132718

AMA Style

Yoon I, Mun J, Min K-S. Comparative Study on Energy Consumption of Neural Networks by Scaling of Weight-Memory Energy Versus Computing Energy for Implementing Low-Power Edge Intelligence. Electronics. 2025; 14(13):2718. https://doi.org/10.3390/electronics14132718

Chicago/Turabian Style

Yoon, Ilpyung, Jihwan Mun, and Kyeong-Sik Min. 2025. "Comparative Study on Energy Consumption of Neural Networks by Scaling of Weight-Memory Energy Versus Computing Energy for Implementing Low-Power Edge Intelligence" Electronics 14, no. 13: 2718. https://doi.org/10.3390/electronics14132718

APA Style

Yoon, I., Mun, J., & Min, K.-S. (2025). Comparative Study on Energy Consumption of Neural Networks by Scaling of Weight-Memory Energy Versus Computing Energy for Implementing Low-Power Edge Intelligence. Electronics, 14(13), 2718. https://doi.org/10.3390/electronics14132718

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Study on Energy Consumption of Neural Networks by Scaling of Weight-Memory Energy Versus Computing Energy for Implementing Low-Power Edge Intelligence

Abstract

1. Introduction

2. Methods

3. Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI