Predicting GPU Training Energy Consumption in Data Centers Using Task Metadata via Symbolic Regression

Liao, Xiao; Li, Yiqian; Zhang, Shaofeng; Wei, Xianzheng; Hu, Jinlong

doi:10.3390/en19020448

Open AccessArticle

Predicting GPU Training Energy Consumption in Data Centers Using Task Metadata via Symbolic Regression

by

Xiao Liao

¹,

Yiqian Li

¹,

Shaofeng Zhang

¹,

Xianzheng Wei

² and

Jinlong Hu

^2,*

¹

China Energy Engineering Group Guangdong Electric Power Design Institute Company Ltd., Guangzhou 510663, China

²

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Energies 2026, 19(2), 448; https://doi.org/10.3390/en19020448

Submission received: 29 October 2025 / Revised: 17 December 2025 / Accepted: 14 January 2026 / Published: 16 January 2026

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement of artificial intelligence (AI) technology, training deep neural networks has become a core computational task that consumes significant energy in data centers. Researchers often employ various methods to estimate the energy usage of data center clusters or servers to enhance energy management and conservation efforts. However, accurately predicting the energy consumption and carbon footprint of a specific AI task throughout its entire lifecycle before execution remains challenging. In this paper, we explore the energy consumption characteristics of AI model training tasks and propose a simple yet effective method for predicting neural network training energy consumption. This approach leverages training task metadata and applies genetic programming-based symbolic regression to forecast energy consumption prior to executing training tasks, distinguishing it from time series forecasting of data center energy consumption. We have developed an AI training energy consumption environment using the A800 GPU and models from the ResNet{18, 34, 50, 101}, VGG16, MobileNet, ViT, and BERT families to collect data for experimentation and analysis. The experimental analysis of energy consumption reveals that the consumption curve exhibits waveform characteristics resembling square waves, with distinct peaks and valleys. The prediction experiments demonstrate that the proposed method performs well, achieving mean relative errors (MRE) of 2.67% for valley energy, 8.42% for valley duration, 5.16% for peak power, and 3.64% for peak duration. Our findings indicate that, within a specific data center, the energy consumption of AI training tasks follows a predictable pattern. Furthermore, our proposed method enables accurate prediction and calculation of power load before model training begins, without requiring extensive historical energy consumption data. This capability facilitates optimized energy-saving scheduling in data centers in advance, thereby advancing the vision of green AI.

Keywords:

energy consumption prediction; deep neural network training; data center

1. Introduction

As digital transformation accelerates, technologies such as cloud computing, big data, and artificial intelligence (AI) are becoming increasingly widespread. Data centers, which serve as central hubs for data storage, processing, and analysis, are expanding in both size and complexity, leading to a significant rise in energy consumption. The proportion of global energy consumed by data centers relative to total worldwide energy consumption has been increasing annually, making them a major energy consumer that cannot be overlooked. Recently, AI has made remarkable advances in areas such as natural language processing and image recognition, with rapid progress in developing large-scale models. Training and operating these AI models require substantial computing power, further increasing energy demands in data centers [1,2]. According to the 2024 US Data Center Energy Use Report, US data centers consumed 176 TWh in 2023, representing 4.4% of the country’s total electricity use. This consumption is projected to increase to between 325 and 580 TWh by 2028 [3]. In China, nationwide data center energy consumption reached 150 TWh in 2023 [4] and increased to 166 TWh in 2024, representing 1.68% of the country’s total electricity use [5]. This high energy usage not only raises operational costs but also places significant strain on the global climate and environment. Therefore, reducing energy consumption in data centers and achieving sustainable, low-carbon operations have become urgent priorities worldwide.

To address these challenges, researchers have employed various optimization methods for reducing energy consumption of the data center [6], or making more and better use of renewable green energy sources [7], or designing more energy-efficient neural networks [8], in order to achieve green and low-carbon operations of the data center, as well as to fulfill the vision of green AI that cares about the energy consumption of AI itself [9]. Among these, understanding and predicting the power consumption of AI tasks is one of the key issues.

GPU servers are the most energy-intensive IT equipment in data centers, particularly when handling AI workloads. These servers comprise various electronic components, including CPUs, GPUs, network interfaces, memory modules, and cooling fans. Each component affects the system’s overall performance and energy consumption. Among them, GPU power consumption dominates the total power usage of GPU servers, especially during highly parallel tasks such as deep neural network training. Moreover, the dynamic power consumption of GPUs is significantly higher than that of CPUs and other components, making GPUs the primary contributors to server power consumption [10,11,12]. Specifically, GPU power consumption accounts for up to 74% of the total energy usage of the server during the training of the Bidirectional Encoder Representations from Transformers (BERT) base model [13] on a single NVIDIA TITAN X GPU (12 GB) [14]. Therefore, predicting the GPU energy consumption of these servers can serve as an effective method for estimating the total energy consumption during neural network model training. Predicting the energy consumption of data center servers primarily involves three approaches: estimating server energy consumption based on energy consumption models [15,16]; forecasting server energy consumption sequences using historical energy consumption data [17]; and considering the maximum power consumption or Thermal Design Power (TDP) of hardware components such as Graphics Processing Units (GPUs) [18].

First, regarding energy consumption modeling of servers, the operational status of key resources (such as Central Processing Unit (CPU) utilization) correlates with overall system energy consumption during the execution of computing tasks, enabling real-time estimation of changes in system energy use. Energy consumption modeling involves establishing a mathematical model where input variables include state indicators of various system components, such as CPU utilization, memory throughput, network bandwidth, and disk read/write bytes. The model’s output is the system’s energy consumption. This approach is often referred to as the Performance Monitoring Counter (PMC)-based energy consumption modeling method, which obtains the operational status of the CPU, memory, disk, and other components through hardware-provided PMC performance counters [19], and estimates task energy consumption using regression techniques [20]. While the energy consumption model method monitors task execution status to accurately estimate system energy consumption in real time, it is challenging to predict the energy consumption of tasks before they run. These energy consumption models establish correlations between system component status indicators and energy consumption, allowing real-time estimation of system energy changes. They are well-suited for energy monitoring and resource scheduling during task execution but are limited in their ability to forecast the energy consumption of training tasks prior to execution.

Second, regarding the prediction of server energy consumption sequences based on historical time-series data, recent research has primarily employed machine learning and deep learning methods to forecast data center computing power load and energy consumption. For instance, Ref. [21] proposed a cost-sensitive tensor-quantized two-stage attention LSTM model that significantly improves the accuracy of server power prediction through causality-driven feature selection and tensor-quantized temporal embeddings. In [22], a prediction framework for overall data center power consumption was developed using deep neural networks, which first apply denoising and multi-layer autoencoder preprocessing, then model short-term and medium- to long-term power demand via multi-layer feedforward networks, outperforming traditional methods such as linear regression and support vector regression. Furthermore, Ref. [23] integrated convolutional layers, LSTM, and attention mechanisms into a deep architecture for predicting Power Usage Effectiveness (PUE), where convolution captures local temporal patterns, LSTM layers learn long-term dependencies, and the attention mechanism adaptively reweights features to improve energy-efficiency prediction. These time series methods typically depend on extensive historical energy data to predict energy usage in data centers. Their accuracy is affected by shifts in the distribution of the data being forecasted and the model’s capacity to generalize.

Symbolic regression (SR) has emerged as an effective modeling paradigm for deriving simple and interpretable mathematical expressions directly from observed data. SR has been successfully applied to modeling nonlinear behavior in power systems and building energy consumption [24,25], as well as more broadly to tasks such as energy prediction, power modeling, and physics-constrained modeling [26,27]. Recent studies demonstrate that hybrid SR methods maintain high accuracy under small-sample conditions in photovoltaic power prediction while producing formulas with clear physical meaning [28]. These studies illustrate that SR provides a robust and interpretable alternative to purely black-box time-series models in energy-related applications.

Third, the maximum power consumption, or TDP, of hardware components (such as GPUs) is also used to estimate the energy consumption of data center servers when designing energy and temperature management systems. However, for specific AI tasks, this energy consumption estimate is often inaccurate because energy usage dynamically varies with the servers’ operational status [29].

Moreover, metadata from AI training has been utilized to measure energy consumption and develop energy-efficient AI models. To reduce energy consumption without compromising AI model performance, researchers are actively investigating and developing energy-efficient AI systems that optimize both algorithms and hardware. In these efforts, metadata from model training tasks—such as model parameters and batch sizes—has been used to analyze its influence on GPU energy consumption during training. For example, Tripp et al. [30] analyzed energy consumption data from training a Multi-Layer Perceptron (MLP) model to explore how dataset size and network architecture impact energy use. They highlighted the significance of caching effects and proposed an energy-saving approach that integrates network design, algorithm development, and hardware optimization. Similarly, Yang et al. [31] developed an energy estimation framework to guide pruning by measuring the energy consumption of matrix multiplication operations in convolutional neural networks (CNNs), facilitating energy-focused model compression. These studies provide valuable insights for designing energy-efficient AI algorithms and hardware. Such techniques assist researchers in creating more energy-efficient neural networks, thereby reducing the overall energy required for AI training and inference in specific applications. Furthermore, only a few studies have examined power consumption patterns during deep learning training to predict power usage based on metadata, such as estimating average energy consumption during training and average energy per epoch using regression models [32].

However, detailed analyses of the energy consumption characteristics of neural network training tasks on GPUs in data centers remain scarce. What is more, studies predicting GPU energy consumption of AI training in data centers using task metadata are limited, particularly those employing symbolic regression to derive analytical expressions that explicitly link metadata to energy consumption.

In this article, we focus on predicting the energy consumption of artificial intelligence tasks involved in training neural network models within data centers. By analyzing the relationship between the configuration of model training tasks and their energy consumption, we propose a GPU energy consumption prediction scheme based on task metadata such as the number of training samples, model parameters, batch size, and epochs. This approach differs from traditional time series forecasting of data center energy consumption, which relies on historical time series data of energy consumption. We employ genetic programming-based symbolic regression to predict GPU energy consumption, enabling the estimation of server energy usage prior to executing deep neural network training tasks. To create a training environment and collect GPU energy consumption data, we utilize models from the Residual Network (ResNet) family [33], the Visual Geometry Group (VGG) family [34], the Vision Transformer (ViT) family [35], the MobileNet [36], and the BERT family [13] using the A800 GPUs. The results of predicting energy consumption curve characteristics demonstrate strong performance, with mean relative errors (MRE) of 2.67% for valley energy, 8.42% for valley duration, 3.64% for peak power, and 3.64% for peak duration.

Our contributions are summarized as follows:

(i): We analyze the relationship between the configuration of model training tasks and their energy consumption on GPUs, identifying consistent patterns that enable energy prediction across various settings.
(ii): We utilize metadata from AI training tasks to predict GPU energy consumption, enabling accurate estimation of server energy usage without requiring extensive historical energy data prior to task execution. This approach differs from conventional time-series energy prediction methods and supports flexible, time-independent optimization of energy-efficient scheduling for deep learning model training in data centers. It provides an alternative to traditional energy consumption prediction techniques, such as time-series forecasting, especially when sufficient historical energy data is unavailable to develop a reliable and generalizable predictive model.
(iii): We propose a genetic programming-based symbolic regression method to develop analytical formulas that explicitly link metadata to energy consumption. This approach delivers accurate predictions even with limited training data and produces interpretable equations that can be directly applied to task scheduling or energy management in data centers.

2. Analysis of Training Tasks and Energy Consumption

2.1. Overview of AI Model Training

Currently, data centers primarily utilize heterogeneous computing systems, including CPUs and GPUs, for training AI models. The GPU serves as a coprocessor, delivering essential computational power for AI training and inference thanks to its robust parallel processing capabilities and efficient memory architecture. Meanwhile, the CPU operates as a general-purpose processor, managing logic control, task scheduling, and complex decision-making tasks such as branch prediction and operating system management.

The entire process of AI model training primarily consists of two stages: data preparation and model training. These stages not only determine the model’s performance and convergence speed but also significantly impact overall energy consumption. During the data preparation stage, which precedes model training, tasks such as data cleaning, preprocessing, and partitioning the dataset into training, validation, and test sets must be completed. Data preprocessing operations—including data cleaning and interpolation, outlier detection and correction, feature transformation and encoding, data augmentation, pipeline optimization, dimensionality reduction, and sparsity enhancement—improve data quality and enhance the effectiveness of deep learning training. However, these operations also consume energy from the server [37,38,39]. Due to the variability of data preprocessing methods across different training tasks, this article primarily focuses on the power consumption of model training tasks, assuming the data is already prepared. The model training phase encompasses forward propagation, backpropagation, parameter updates, and distributed gradient synchronization. The computational demands of each step directly influence the continuous power output of the GPU, which typically accounts for the majority of total training energy consumption [30,40].

2.2. Energy Consumption Characteristics of Model Training

The execution of deep neural network training tasks depends on the model architecture, software and hardware environment, training configuration, and training data. These factors also influence the energy consumption of model training. In specific data centers, the software and hardware environment for model training is generally fixed, while the model architecture, training configuration, and training data continuously change to accommodate varying training task requirements.

From the perspective of model training computation, the training process involves a clearly repetitive and iterative calculation procedure. Consequently, the energy consumption during model training may display regular patterns and periodicity.

2.2.1. Epoch-Based Periodic

The loss function of neural networks is mathematically non-convex, resulting in numerous local minima, saddle points, and flat regions. This non-convex optimization problem requires multiple iterations to solve effectively. Consequently, during neural network training, a single pass through the entire training dataset (one epoch) typically only reaches a temporary valley. Multiple epochs of gradient descent are necessary to gradually approach the global minimum or a better local minimum. In other words, from a temporal perspective, one epoch in model training represents a complete computational cycle. For example, if the training set contains 10,000 samples, one epoch means that all samples have been processed once, and the model parameters have been updated accordingly.

Between each epoch, additional data processing and calculations are typically performed. These include dynamically adjusting the learning rate to accelerate convergence, saving the best model during training, recording training and validation metrics for subsequent analysis, and shuffling the data to vary the order in each epoch, thereby reducing the model’s dependence on a specific data sequence. These operations generally require less computation and consume less energy but often involve coordination between the GPU and CPU, which can be time-consuming.

2.2.2. Batch-Based Periodic

In each epoch, model training requires processing training samples in batches due to GPU memory limitations and to enhance computational efficiency through parallel computing and pipelining. Batch training is an effective approach for deep learning in engineering and practical applications. In this context, a batch in model training represents an iterative process. If an iteration is defined as the processing of a single batch, then the total number of iterations equals the total number of samples divided by the batch size, multiplied by the number of epochs. A batch is a subset of samples used as a single input to the model, and the batch size (Batch_size) refers to the number of samples processed in one batch. For example, if the training set contains 10,000 samples and the batch size is set to 100, then the 10,000 samples are divided into 100 batches.

Between each batch, several additional calculations occur. For example, data is typically copied from external storage on the server to GPU memory before being processed on the GPU. In pipeline processing, data can be pre-copied to GPU memory to optimize performance. In distributed training, parameter synchronization is performed between different GPUs using specific synchronization mechanisms. For instance, in the Distributed Data Parallel (DDP) framework introduced in PyTorch version 1.1, synchronization is automatically triggered after the gradient calculation for each batch to ensure that all devices use consistent parameters for the next computation cycle. The extent of these operations primarily depends on the data size of each batch and the model size during distributed training. Consequently, the energy consumption and duration of these operations are generally minimal, resulting in no noticeable dips in the energy consumption curve.

2.2.3. The Main Calculation Operation in Model Training

In each batch of model training, the computational process primarily includes forward propagation, backward propagation, parameter updates and optimization, as well as gradient synchronization in distributed training. These processes typically account for the majority of the total energy consumption in deep learning.

Forward propagation: Forward propagation primarily involves linear transformations and nonlinear activation functions. Linear transformations include matrix multiplication and convolution operations, while nonlinear activations include ReLU, Sigmoid, and Tanh functions. During this stage, input data is progressively transformed, and features are extracted layer by layer through the neural network architecture. In complex network structures such as ResNet and Transformer, the number of floating-point operations (Flops) during forward propagation increases significantly, leading to some of the highest levels of GPU utilization [41].

Backpropagation: Backpropagation uses the chain rule to compute gradients in reverse order. This process requires storing intermediate activation values and performing matrix operations of comparable scale, which often maintains GPU power consumption at peak levels and significantly increases memory bandwidth usage and cache pressure [42].

Parameter Update and Optimization: Optimization algorithms, such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), involve additional operations, including calculating the gradient of the loss function with respect to the parameters, computing momentum vectors, and estimating variance. These operations increase computational load and memory access, which in turn affect energy consumption and latency. Research has shown that selecting an optimizer appropriate for the task can reduce energy consumption by up to 20% [43].

Gradient synchronization in distributed training: In distributed training, particularly when employing data-parallel strategies, gradient synchronization also consumes energy. Research indicates that factors such as network topology, bandwidth limitations, and parameter aggregation methods affect the overall energy efficiency of training [44].

In summary, the model training process is an iterative computation based on epochs and batches. Consequently, the energy consumption during training should exhibit regular or periodic patterns corresponding to the epochs. Various training configurations and data characteristics—such as batch size, data volume, dataset size, model architecture, number of layers, parameter count, and thread count—affect the periodicity of the training epochs and the associated energy consumption patterns to varying degrees. We will conduct an energy consumption analysis and validation in the experimental section of Section 4.

3. Energy Consumption Prediction for Training Tasks

According to the analysis of server energy consumption characteristics in deep neural network training tasks, server energy consumption exhibits periodicity. The energy consumption sequence curve shows peaks and valleys influenced by the configuration parameters of the training task. In a given data center server hardware environment, server energy consumption is determined by the model architecture, the volume of training data, and the configuration of the training tasks.

In this article, we refer to the configuration information, model structure details, and training data characteristics of AI model training tasks collectively as training task metadata. This metadata specifically includes model architecture, model parameter size, training batch size, total number of samples, number of training epochs, number of computing nodes used, number of AI GPUs per node, data size of individual samples, and the number of threads.

We define the energy consumption prediction problem as follows: given a data center server hardware environment and a training task, the goal is to predict the energy consumption associated with the training task by using task meta-information collected from historical or experimental training tasks.

In response to the energy consumption challenges associated with the aforementioned model training, this paper proposes a method for predicting the energy consumption of model training tasks on data center servers, based on the regular patterns observed in training task energy usage. As illustrated in Figure 1, the process begins with the collection of meta-information and energy consumption data from deep neural network training tasks. Using the training task metadata and energy consumption data, a prediction model is developed to capture the time series characteristics of energy consumption for each type of model architecture. Finally, the energy consumption of a target training task is predicted based on its metadata.

As shown in Figure 1, in the analysis of energy consumption characteristics of training tasks, the energy consumption time series curve is analyzed to determine whether it has periodicity. For training tasks with periodicity, the shape of the energy consumption time series curve for each epoch of the training task is analyzed to determine whether there is a clear pattern, as well as the specific shape characteristics of the time series graph. Obtain a set of model structures corresponding to training tasks with linear or square wave patterns in energy consumption time series curves, and classify the same model structures into one category. Models with different model structures (such as ResNet, VGG, etc.) will generate different energy consumption during training [45]. Construct an energy consumption prediction model for training tasks based on the meta information and data of similar model structures. Furthermore, the server energy consumption prediction problem of the training task is transformed into three concise machine learning steps: the energy consumption dataset of the training task; model construction for energy consumption prediction; and energy consumption prediction for training tasks.

By training models with different parameter combinations under the same model architecture, collecting their training task metadata and energy consumption, the corresponding architecture’s task metadata energy consumption dataset can be obtained. Then, the prediction model can be trained to obtain the energy consumption prediction model of the model architecture. Taking the energy consumption time series curve with a square wave shape as an example, predict the peak, valley, and corresponding duration of the energy consumption curve for the training task. The peak represents the energy consumption during the model training calculation phase, while the valley represents the energy consumption during non-model training calculations such as model saving between epochs. In this article, we use a concise but effective method—symbolic regression—to predict the four variables that describe the energy consumption curve based on training task meta information.

Considering the small size of the data, we applied symbolic regression based on genetic programming (GP) for prediction. Firstly, we use the Pearson correlation method for feature selection to remove irrelevant features and improve the search efficiency of the symbolic regression model. The symbolic regression model is a regression analysis method that combines machine learning and symbolic computation. Through heuristic search methods, it iteratively optimizes in the symbolic space and derives mathematical expressions that describe the relationship between input and output from data. The evolutionary mechanism of genetic programming enables symbolic regression to handle complex nonlinear relationships and high-dimensional data. By randomly generating an initial population and selecting the individuals with the highest fitness for reproduction in each generation, genetic programming can effectively explore the solution space and avoid getting stuck in local optima.

Specifically, the gplearn library [46] is employed to perform symbolic regression. Gplearn extends the scikit-learn machine learning library to implement genetic programming (GP) for symbolic regression. By simulating natural selection and genetic operations, mathematical expressions are automatically generated and optimized to best fit the data.

4. Experiments

We first conduct experiments analyzing the energy consumption characteristics and predicting energy usage on GPU servers to validate both the energy consumption characteristics during model training and the proposed energy consumption prediction scheme.

4.1. Experimental Environment and Setup

In this section, we conduct energy consumption experiments using ResNet and VGG series models as representative neural network architectures. The ResNet model [33] is a family of deep convolutional neural networks designed to address the challenges associated with training very deep networks. Its architecture enables the model to bypass one or more layers, facilitating the training of significantly deeper networks by mitigating the vanishing gradient problem. The VGG model [34], developed by the Visual Geometry Group at the University of Oxford, is another family of deep convolutional neural networks specifically designed for image classification and object detection tasks. The ViT model [35] is a pure transformer architecture that treats an image as a sequence of fixed-size 16 × 16 patches, each linearly embedded into a vector similar to a word token in natural language processing. The BERT model [13] is a deeply bidirectional, unsupervised language representation model that pre-trains a single Transformer encoder on large corpora of unlabeled text using two self-supervised learning objectives: masked language modeling and next sentence prediction.

The GPU server used for model training, based on NVIDIA’s HGX platform, is equipped with 2 × 18-core Intel Xeon Gold processors, 16 × 64 GB DDR4 memory modules, one 480 GB SATA SSD, two 3.84 TB NVMe SSDs, and four NVIDIA Tesla A800 GPUs. Each A800 GPU features 80 GB of memory and a 5120-bit memory interface. The A800 GPUs used in our experiments have a TDP of 400 watts each and are equipped with liquid cooling systems. The data center environment is maintained at approximately 22 degrees Celsius with a relative humidity of about 50%. The system runs CUDA version 12.1 on Rocky Linux 8.6.

The specific models used for training are ResNet50, ResNet101, VGG16, ViT, and BERT. The training dataset is ImageNet, and the optimizer employed is stochastic gradient descent (SGD). We trained the models in two environments: a single GPU and multiple GPUs. Multi-GPU training is performed on four GPUs using data parallel training.

For ImageNet data, we employ a similar preprocessing and data enrichment pipeline as standard ResNet training. Specifically, RandomResizedCrop(s) is first applied to the input image to randomly crop and scale to a fixed resolution of s × s, where s ∈ {224, 512, 1024} is applied to investigate the effects of different input resolutions on performance and energy consumption. Subsequently, a RandomHorizontalFlip is performed on the image to enhance the model’s robustness to changes in pose and perspective. Next, the image is converted to a tensor by ToTensor and the pixel values are linearly scaled from [0, 255] to the [0, 1] range. Finally, we normalize the images according to the channel mean and standard deviation of the official ImageNet statistics, that is, perform (x − μ_c)/σ_c for each channel, where μ = (0.485, 0.456, 0.406), σ = (0.229, 0.224, 0.225). Data loading uses ImageFolder to read images from the ImageNet training set directory and sequentially sample a fixed number of dataset_size samples to form a subset of the training, depending on the experimental settings.

During the model training process, we used custom shell scripts with the NVIDIA tool NVIDIA-SMI to collect GPU power consumption and utilization data every 100 milliseconds, recording the timestamp along with the corresponding task metadata.

4.2. Dataset Construction

According to the experimental setup described above, both single-GPU and multi-GPU (four GPUs) configurations were used to train VGG and ResNet models using the PyTorch framework. Model features were recorded, and information was collected, including task metadata such as batch size, runtime, and power consumption during model training. Separate datasets were constructed for single-GPU and multi-GPU scenarios. Different parameters for the model training tasks were set, as shown in Table 1, to obtain the energy consumption of the training tasks under various parameter configurations.

Among the parameters listed in the table, Batch_size refers to the number of samples processed by the model during each training iteration. Data_size represents the total number of samples in the training dataset. Num_epoch indicates the number of training epochs, for complete passes through the dataset. Num_workers specifies the number of CPU threads used, which affects the efficiency of CPU operating such as CPU-GPU interaction and storing data. Image_size denotes the input image size used for training the models.

We record the specific parameters mentioned in Table 1 during the model training process, collecting GPU power consumption in every 100 milliseconds. Additionally, we calculate the model’s intrinsic parameters, including the size of the model parameters (Params), and Flops. Flops represent the number of floating-point arithmetic operations and serve as a metric to assess the complexity of algorithms or models. We use CalFlops [47] to calculate the Flops of the model. Calflops are designed to compute the theoretical number of flops, multiply-add operations, and parameters for various neural networks.

GPU power traces were collected using NVIDIA-SMI at a fixed sampling interval of 100 ms (10 Hz). To assess sampling aliasing, we compared power measurements taken at 100 ms (10 Hz) and 1 ms (1000 Hz) intervals. The mean difference between the 1 ms and 100 ms intervals was less than 1%, with a maximum difference below 3%. Based on this, we selected the 100 ms sampling interval for data collection. Additionally, since all four GPUs are located on the same host and share the OS-level time source, their power traces are synchronized with a sampling resolution of 100 ms.

To minimize noise in the GPU power data, we conducted at least three independent replicates and averaged the data point by point to obtain a stable average power trajectory. Additionally, within a single run, we applied a 3–5 point moving average to the raw data to smooth out power fluctuations occurring on the millisecond scale, thereby reducing noise caused by instantaneous spikes.

The collected energy consumption data for training tasks is divided into two categories: single-GPU energy consumption data, which includes information from 241 differently configured model training tasks, and multi-GPU energy consumption data, which comprises data from 218 models with various configurations for four-GPU training tasks.

4.3. Analysis of Energy Consumption Characteristics

This section analyzes the impact of varying parameters—such as batch size, total data size, and layers of model—on energy consumption during model training.

Observation 1: The energy consumption of training is positively correlated with the batch size used during model training.

In a single-GPU training environment, the impact of different batch sizes on energy consumption during ResNet model training—under the same parameter settings (Model = ResNet50, Num_workers = 18, Image_size = 224 × 224, Dataset size = 100,000)—is illustrated in Figure 2a. We compared exponential, linear, logarithmic, power, and moving average functions to model energy consumption in the experiments presented in this paper, including subsequent experiments, to identify the best-performing function. The logarithmic function was found to be the best fit and was selected to model energy consumption during single-GPU training because GPU power usage increases at a progressively slower rate as batch size grows. The fitted logarithmic function is expressed as y = 15.692 ln(x) + 202.35. The goodness of fit, indicated by the coefficient of determination (R²) of 0.9455, demonstrates that the logarithmic model accurately captures the positive correlation between power consumption and batch size in single-GPU training.

In a multi-GPU setup, the impact of different batch sizes on energy consumption during ResNet model training—using consistent parameter settings (Model = ResNet50, Num_workers = 32, Image_size = 224 × 224, Dataset size = 100,000)—is illustrated in Figure 2b. The data were modeled using a logarithmic function described by the equation y = 171.95 ln(x) − 291. This model demonstrated an excellent fit in Figure 2b, with an R² value of 0.9906.

As illustrated in Figure 2, with identical parameter settings, the energy consumption for training tasks using the ResNet model is positively correlated with the batch size. Moreover, this relationship follows a logarithmic function: initially, the curve rises rapidly, almost linearly, then gradually levels off as the input values increase.

Observation 2: The energy consumption of training is positively correlated with the size of the training data.

In a single-GPU training environment, the size of the input data affects energy consumption. Three different image sizes—64 × 64, 224 × 224, and 512 × 512—were selected. Under the same model parameter settings (Model = ResNet50, Num_worker = 18, Batch_size = 64, Dataset size = 100,000), the energy consumption of the training task varied with the input data size, as shown in Figure 3a. The data were modeled using a logarithmic function described by the equation y = 25.351 ln(x) + 131.94. This model demonstrated an excellent fit with an R² value of 0.9894.

In the case of multiple GPUs, varying input image sizes under the model parameter settings (Model = ResNet50, Num_worker = 32, Batch_size = 128, Dataset size = 200,000) affect energy consumption during model training, as shown in Figure 3b. The data were modeled using a logarithmic function described by the equation y = 54.05 ln(x) + 151.08. This model demonstrated an excellent fit with an R² value of 0.9447.

It can be observed that, under the same parameter settings, the energy consumption of the training task is positively correlated with the size of the input image.

Observation 3: The energy consumption of training is positively correlated with the number of model layers.

In the case of a single GPU, the number of layers significantly impacted energy consumption during training across different model layers under the same configuration (Num_worker = 9, Batch_size = 32, Image_size = 224 × 224, Dataset size = 100,000), as shown in Figure 4a. The data were modeled using a logarithmic function described by the equation y = 23.14 ln(x) + 168.12. This model demonstrated an excellent fit with an R² value of 0.9884.

In the case of multiple GPUs, the impact of different model layer values on energy consumption during model training under the same configuration (Num_worker = 32, Batch_size = 64, Image_size = 224 × 224, Dataset size = 100,000) is shown in Figure 4b. We chose a logarithmic function to model energy consumption. The fitted logarithmic function is expressed as y = 105.38 ln(x) + 40.699. This model demonstrated a relatively good fit with an R² value of 0.9503. Compared to single-GPU training, multi-GPU training offers greater capacity to process complex models with more layers, but it also consumes more energy.

Observation 4: The energy consumption of training tasks exhibits periodic behavior.

In single-GPU training, the model used is ResNet101 with an image size of 224 × 224. The batch size is set to 256, Num_worker is 18, and the dataset size is 200,000. The power consumption curve is shown in Figure 5.

In multi-card training, the model used is ResNet50, with four GPUs. The image size is 224 × 224, the batch size is set to 128, the Num_worker is 64, and the dataset size is 200,000. The power consumption curves for each GPU are shown in Figure 6.

As shown in Figure 5 and Figure 6, the model training process exhibits periodicity across training epochs. During single-GPU training, the power consumption curve forms a horizontal straight line. In contrast, during multi-GPU training, the power consumption curve resembles a square wave, featuring distinct peaks and valleys. In single-GPU training, there are no troughs—only peaks—primarily due to model storage and other operations between epochs, which involve extremely brief data interactions between the GPU and CPU; thus, no valleys appear between epochs. In multi-GPU training, however, model storage and synchronization between epochs require more time, and the power consumption during these operations is lower than during the model computation stage, resulting in valleys. These experimental results align with the previous analysis.

Furthermore, we conducted experiments to examine the effects of training task configuration on energy consumption in additional models, specifically ResNet34, VGG16, BERT, and ViT. The results, presented in Section S1 of the Supplementary Materials, are consistent with the findings reported above for ResNet50.

There are additional parameters in training tasks, such as optimizer type, learning rate, and regularization strategy, which affect the number of steps required for model convergence and have a slight impact on energy consumption during each epoch cycle [48,49]. Furthermore, we conducted experiments to evaluate the effects of various optimizers—including Adaptive Moment Estimation (Adam), Adam with Weight Decay (AdamW), Stochastic Gradient Descent (SGD), and Root Mean Square Propagation (RMSprop)—as well as different learning rates (0.1, 0.01, 0.001, and 0.0001) on energy consumption. The findings are presented in Tables S1 and S2 of the Supplementary Materials, showing a minor effect on energy consumption throughout each epoch cycle. However, this effect is less pronounced than those observed for batch size, data size, and the number of model layers.

4.4. Energy Consumption Prediction Experiment and Results

In the energy consumption prediction experiment, a multi-GPU model is initially used to train on the energy consumption dataset. Then, the metadata is utilized as input features for the prediction model to forecast the power consumption curve during training, including average power consumption at peaks and valleys, as well as their corresponding training duration.

4.4.1. Experimental Setup for Predicting Energy Consumption

The dataset includes 459 model training records, comprising 241 single-GPU and 218 multi-GPU entries. It is divided into training, validation, and testing sets in an 8:1:1 ratio. The six features of the dataset include image size, total number of samples, parameter size, model computational complexity, training batch size, and CPU thread count. These features are normalized using the StandardScaler method.

Mean Relative Error (MRE) and Mean Squared Error (MSE) are selected as the metrics for evaluating prediction errors. Given N pairs of true and predicted values,

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\overset{⌢}{y}}_{i})}^{2}

(1)

M R E = \frac{1}{N} \sum_{i = 1}^{N} |\frac{y_{i} - {\overset{⌢}{y}}_{i}}{y_{i}}|

(2)

where

y_{i}

is the truth value of the i-th sample, and

{\overset{⌢}{y}}_{i}

is the predicted value of the i-th sample.

4.4.2. Feature Selection

For each prediction task, we use Pearson correlation and mutual information (MI) to select key features. Pearson correlation captures linear relationships, while MI reflects nonlinear dependencies, together encompassing most plausible associations between metadata and energy consumption indicators. We first rank and select features for each task based on Pearson correlation, then compare these selections with MI-based results, finding that the two methods yield highly consistent rankings. The final selected features are summarized in Table 2. Additionally, since Pearson and MI scores can be directly interpreted as indicators of feature importance, this facilitates verification, understanding, and deployment of the resulting models.

For MI-based selection, we estimate continuous MI using a k-nearest neighbors (kNN) based estimator. Let the input feature matrix be

X \in R^{n \times d}

and the target variable be

y \in R^{n}

. For each target metric (e.g., average load power avg_load_power, data loading time load_time, training time train_time), we compute the MI between

y

and each candidate feature to quantify feature importance. During MI estimation, variables with a small set of integer values (num_worker, batch_size, imgsize, dataset_size) are treated as discrete dimensions, while the remaining features are treated as continuous. The kNN estimator uses

k = 5

neighbors, and we fix the random seed (random_state = 42) to ensure reproducibility. For each target variable, MI is computed both in the original feature space and in the log-transformed space for heavy-tailed features. The features are then ranked in descending order of MI, yielding the most informative predictors for the power and time metrics.

4.4.3. Symbolic Regression and Comparison Models

(1): Symbolic Regression

We design the symbolic regression search space using a compact yet expressive set of operators. For arithmetic operations, we include addition, subtraction, multiplication, and protected division. These operators capture linear combinations, contrasts, interactions, and ratio/normalization relationships while preventing numerical errors when the denominator is zero. For nonlinear transformations, we employ square root, protected logarithm, absolute value, negation, and reciprocal functions. These allow us to model diminishing returns, log-shaped scaling, deviation magnitude, sign flexibility, and inverse relationships in a numerically safe manner. Additionally, we incorporate a custom sigmoid function (implemented via make_function) to represent saturation or threshold effects, as well as standard trigonometric functions (sin, cos, tan) to capture potential periodic or oscillatory patterns in power traces. The internal fitness function for symbolic regression is the mean absolute error (MAE). For external evaluation and comparison with baseline models, we report MSE and MRE.

Search algorithm parameter settings in gplearn: population size (population_2) = 15,000, with 15,000 candidate expressions retained at each iteration. A larger population size facilitates exploration of complex search spaces. Generations = 50, meaning the maximum number of iterations is 50 generations. Stopping criteria = 0.01: the algorithm terminates early when the fitness of the best expression reaches 0.01.

During the genetic operation, the crossover probability is set to p_cross = 0.7, indicating a 70% chance of generating new expressions through crossover.

P_subtree_mutation = 0.1: Subtree mutation probability of 10% (replaces the entire sub-expression); P_hoist_mutation = 0.05: Hoist mutation probability of 5% (prevents expressions from becoming too deep by simplifying the structure through pruning); P_point_mutation = 0.1: Point mutation probability of 10% (local modification of a function or variable).

(2): MLP Model

We selected an MLP model with two or three hidden layers as the comparison model. The output of the first hidden layer can be expressed as

h^{(1)} = δ (W_{1} x + b_{1})

(3)

Among them, x is the input vector, W₁ is the weight matrix from the input layer to the hidden layer, and b₁ is the bias vector of the hidden layer. The symbol δ(•) represents the activation function.

To ensure a fair comparison, we systematically tuned the hyperparameters of the MLP baseline using the training and validation splits. The input layer size was set to the number of selected features, and the output layer contained a single neuron for each regression target. We performed a grid search over the following hyperparameter space: Hidden layer configurations:

(64, 64)

,

(128, 64)

,

(128, 128)

,

(256, 128)

; Activation functions: ReLU, Tanh, LeakyReLU; Learning rates (Adam optimizer):

{1 \times 10^{- 4}, 3 \times 10^{- 4}, 1 \times 10^{- 3}, 3 \times 10^{- 3}}

; Batch sizes:

{64, 128, 256}

; L2 weight decay:

{0.1 \times 10^{- 4}, 1 \times 10^{- 3}}

. Each model was trained for up to 200 epochs with early stopping (patience = 20 epochs) based on the validation MSE. For each target variable (e.g., valley energy consumption, peak energy consumption, valley duration, peak duration), we selected the configuration that achieved the lowest validation MSE and reported the corresponding test MSE and MRE as the final MLP performance.

(3): Linear, Ridge, Lasso, and Gradient Boosting

The linear model is an ordinary least squares regressor without explicit regularization, serving as the simplest baseline.

For Ridge and Lasso regression, we perform a grid search over the regularization strength on the training set using 5-fold cross-validation. The candidate values are 1 × 10⁻⁴, 1 × 10⁻³, 1 × 10⁻², 1 × 10⁻¹, 1, and 10. The best value, which minimizes the validation MSE, is selected separately for each of the four prediction targets.

The Gradient Boosting Regressor is configured as a tree-based ensemble using squared-error loss. We tune the learning rate (0.01, 0.05, 0.1), the number of trees (200, 400, 800), the maximum depth of each tree (3, 4, or 5), and the subsampling ratio (0.8 or 1.0), employing 5-fold cross-validation on the training set. The final hyperparameters for each target are fixed after this search, with no further manual adjustments made.

All baseline models are trained using the exact same metadata features and evaluated with the same metrics as the proposed SR model to ensure a fair and reproducible comparison.

4.4.4. Results of Energy Consumption Prediction Experiments

According to the energy consumption prediction scheme based on model training task metadata proposed in this paper, the energy consumption prediction for training tasks includes four components: average energy consumption during valleys, average duration of valleys, average power consumption during peaks, and average duration of peaks.

(1): Results of Feature Selection

The results of feature selection using MI and Pearson correlation are presented in Table 2, where the selected features are indicated in black font with an asterisk (*). We rank the MI and Pearson correlation values and select those larger than 0.3. For example, for the task of predicting valley energy consumption, we select the features Image_size, Data_size, and Num_workers. For the task of predicting peak energy consumption, we select the features Image_size, Params, Flops, and Batch_size.

(2): Results of Energy Consumption Prediction

Based on the selected features, we compared SR with MLP, Linear Regression, Ridge Regression, Lasso Regression, and Gradient Boosting Regression. The results, presented as the mean and standard deviation of 20 predictions, are shown in Table 3 and demonstrate that SR achieves the best performance, with MREs of 2.59% ± 0.55% for valley energy, 8.25% ± 1.28% for valley duration, 3.71% ± 0.57% for peak power, and 3.68% ± 1.02% for peak duration.

The four formulas derived from the SR models are provided in Section S3 of the Supplementary Materials. For the peak energy consumption prediction formula, its expression tree reaches a maximum depth of 30 and comprises 152 nodes, resulting in an average depth of 0.20 per node. The formula primarily includes Params, Flops, Batch_size, and Image_size, while excluding Data_size, consistent with our Pearson and mutual information analyses for this metric. The expression frequently applies logarithmic, square root, and reciprocal transformations to Batch_size and Image_size, indicating saturating, sub-linear growth rather than simple linear scaling. Additionally, it includes terms such as sigmoid (Batch_size − Image_size), which represent threshold-like effects depending on the relative values of batch size and image size, along with ratio-based interactions that normalize by model complexity, considering Params and Flops.

4.4.5. Analysis of Energy Consumption Prediction

(1): Impacts of parameters in the SR model

To analyze the effects of parameters in the genetic programming-based SR model, we conducted additional experiments examining the impacts of population size, number of generations, and crossover probability. The results are presented in Tables S3 and S4 of the Supplementary Materials. As the population size and number of generations increase, the model’s predictive performance improves. However, once the population size reaches 15,000 and the number of generations reaches 50, performance gains begin to plateau. The performance differences among crossover probabilities of 0.6, 0.7, and 0.8 are minimal. Comparatively, the optimal crossover probability is 0.7.

(2): Feature contribution in the SR model

To quantify the contribution of each metadata feature to the prediction results, we conducted an input perturbation-based feature importance analysis on the explicit expression of SR using the test set. The feature contributions in the SR model are presented in Table S5 of the Supplementary Materials. These experimental results are consistent with previous findings on feature selection for each prediction task based on MI or Pearson correlation, as shown in Table 2. For example, in valley energy consumption prediction, image size, data size, and number of workers all contribute, with data size having a relatively greater impact.

(3): Predicting energy consumption from different architectures

Furthermore, we conducted experiments to predict energy consumption using various models to evaluate their ability to forecast energy usage for previously unseen architectures. Table 4 presents the energy consumption predictions for ResNet18, ResNet34, ResNet50, ResNet101, ViT, and BERT using the SR model trained on ResNet50 data, as well as predictions for ViT and BERT based on ViT data. The results demonstrate that models with similar architectures achieve higher prediction accuracy, whereas those with significantly different structures (such as predicting BERT or ViT energy consumption from ResNet data) exhibit comparatively lower accuracy. These findings indicate that the model architecture plays a critical role in the accuracy of energy consumption predictions.

In addition, the prediction results for model training energy consumption on the NVIDIA RTX 2080 Ti are presented in Table S6 of the Supplementary Materials. The SR model achieved the best performance, with an MRE of 2.12% for valley energy, 5.52% for valley duration, 2.95% for peak power, and 2.25% for peak duration, which is comparatively better than the predicted performance on the A800. This indicates that our approach maintains strong predictive accuracy across different GPU types.

4.4.6. Example of Energy Consumption Prediction

A practical example of energy consumption prediction is shown in Figure 7. The cycle length of energy consumption is defined as follows:

C y c l e L e n t h = e p o c h * (f_{1} + f_{2})

(4)

where

f_{1}

represents the predicted valley duration,

f_{2}

represents the predicted peak duration, and epoch denotes the number of training epochs. The training task is configured as follows: model = ViT; number of GPUs = 4; batch size = 256; image size = 224 × 224; dataset size =100,000; and number of workers = 64.

In Figure 7, each GPU has a TDP of 400 watts for comparison purposes. The predicted square-wave power pattern has a duty cycle of 96.6% with a total epoch duration of 188.9 s, where the valley period lasts 6.41 s and the peak period lasts 182.58 s. During the valley, energy consumption is 62.23 watts per GPU, while during the peak it increases to 291.06 watts per GPU. The MSE values for valley energy consumption, valley duration, peak power, and peak duration are 5.68, 219.25, 80.47, and 987.31, respectively, with corresponding MRE values of 2.67%, 8.42%, 5.16%, and 3.64%.

Additionally, we conducted three more examples of energy consumption prediction for the ResNet50, BERT, and MobileNet models, as shown in Figures S8–S10 in the Supplementary Materials, respectively. For instance, the results for BERT indicate that the square wave has a duty cycle of 96.58%, with a total epoch duration of 175.2 s. Specifically, the valley period lasts 5.98 s, and the peak period lasts 169.25 s. During the valley period, energy consumption is 67.01 watts per GPU, while during the peak period, it reaches 320.72 watts per GPU.

Under the above configuration of the ViT, we further predict the total energy consumption and carbon emissions for one day for a single server and for a data center with 1000 servers, as shown in Table 5. A single server with four GPUs consumes approximately 27.2 kWh per day when running the predicted cycles continuously, corresponding to roughly 13.6 kg of CO₂ at a grid carbon intensity of 0.5 kg CO₂/kWh. For a data center with 1000 such servers, this scales to about 27.2 MWh and 13.6 metric tons of CO₂ per day.

5. Discussion

The metadata-driven prediction methodology presented here offers an alternative to traditional techniques for predicting energy consumption; however, several limitations should be considered.

(1): Unseen Model Architectures: If the target prediction originates from a model family not represented in the training data—such as very deep transformers, mixture-of-expert models, or architectures featuring new types of operators—the associated metadata (including Flops, parameter counts, and activation shapes) may fall far outside the range encountered during training. In these cases, the symbolic regressor must extrapolate, which can lead to inaccurate predictions of energy consumption and duration during periods of both peak and valley energy consumption.
(2): Extreme or Unusual Hyperparameter Settings: Hyperparameter configurations such as very large batch sizes, atypical image or sequence lengths, extremely large datasets, or uncommon combinations of GPU counts and worker numbers can cause the metadata distribution to differ significantly from that of the training data. In these out-of-distribution scenarios, the learned relationship between metadata and energy-time metrics may no longer hold, resulting in consistent underestimation or overestimation.
(3): Changing or Inconsistent Data Center Conditions: The predictor assumes relatively stable hardware and runtime environments, including GPU generation, cooling efficiency, dynamic voltage and frequency scaling (DVFS) policies, background interference, and job scheduling patterns. Using the model in a different data center or under rapidly changing conditions—such as thermal throttling or strict power limits—can alter the underlying power usage patterns and introduce bias.
(4): Distribution Shifts Over Time: As new models, software libraries, and kernel implementations emerge, the same metadata (for example, Flops or batch size) may correspond to different execution behaviors and power profiles. Without retraining or updates, a model based on historical data may gradually become outdated and fail to accurately represent current periodic and peak-valley energy patterns.

Several practical enhancements can help overcome these limitations and facilitate deployment in real-world data center environments.

(1): Calibration for new architectures: For model architectures not previously encountered, the current symbolic model can serve as a useful prior and be fine-tuned using a small number of calibration training runs, enabling adaptation to new architectures with minimal additional data. Specifically, this challenge can be addressed by constructing a larger energy consumption dataset using similar model architectures to obtain calibration data, or by experimentally training the target model architecture with a small amount of data and few epochs to obtain calibration energy consumption measurements.
(2): Periodic or online updates: Since hardware and software environments continuously evolve, the predictor can be updated periodically or incrementally by incorporating recent measurements. In practice, this may involve retraining the symbolic regressor on an expanded dataset at regular intervals.

6. Conclusions

Accurately forecasting the energy consumption of deep neural network training tasks is a complex challenge. Energy consumption models correlate system component status indicators with energy usage, enabling real-time estimation of changes in system energy consumption. These models are valuable for monitoring energy and scheduling resources during task execution; however, predicting energy consumption in advance remains difficult. Approaches that utilize historical time-series data employ deep learning techniques to predict energy consumption in data centers and can forecast energy usage sequences ahead of time. Nevertheless, their accuracy is affected by the unpredictability of computing tasks, the distribution of time-series training data, and the model’s generalization capability. Consequently, directly estimating energy consumption based on hardware’s maximum power consumption or TDP often proves inaccurate.

In this paper, we present a method for predicting energy consumption based on metadata from deep neural network training tasks. This approach overcomes the limitations of traditional prediction techniques that depend on historical time-series data or real-time performance monitoring. It utilizes task metadata—such as the number of model parameters, computational complexity, and input data size—as key input features to develop a simple yet effective model for forecasting energy consumption during training tasks. This model can predict energy consumption characteristics before task execution, thereby facilitating energy management and task scheduling in data centers that handle model training workloads. For example, by training task scheduling based on energy predictions, data centers can maintain steady overall power consumption, preventing excessive energy use. Moreover, this approach helps reduce peak demand on the grid, smooth out periods of low energy usage, and enhance the utilization of renewable energy sources.

For future work, we plan to further investigate the relationship between deep model training tasks and their energy consumption in various data centers. We will examine additional factors and variables that influence energy consumption and incorporate them into the metadata collection. Furthermore, we aim to enhance the models’ ability to generalize when forecasting energy consumption. Our ultimate goal is to implement this approach in real data center environments to evaluate its performance and reliability.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/en19020448/s1, Figure S1. Energy consumption of ResNet18 with different batch sizes; Figure S2. Energy consumption of VGG16 with different batch sizes; Figure S3. Energy consumption of ViT with different batch sizes in multiple GPUs; Figure S4. Energy consumption of ViT with different batch sizes in single GPU; Figure S5. Energy consumption of ViT with different image sizes in single GPU; Figure S6. Energy consumption of ViT with different image sizes in multiple GPUs; Figure S7. Energy consumption of BERT with different batch sizes in single GPU; Figure S8: Predicted energy consumption using a square wave for ResNet50 model; Figure S9: Predicted energy consumption using a square wave for BERT model; Figure S10: Predicted energy consumption using a square wave for MobileNet model; Table S1: Energy consumption with different optimizers; Table S2: Different learning rates on energy consumption; Table S3: Impact analysis of SR’s parameters in prediction; Table S4: Impact of crossover probability on prediction accuracy; Table S5: Feature Contribution in SR Model; Table S6: Prediction Results of Model Training Energy Consumption on NVIDIA RTX 2080Ti (Mean ± Std).

Author Contributions

X.L.: Conceptualization, Methodology, Validation, Investigation, Writing—Review and Editing, and Supervision. Y.L.: Validation, Investigation, and Writing—Review and Editing. S.Z.: Validation, Investigation, and Writing—Review and Editing. X.W.: Methodology, Validation, Formal Analysis, Investigation, and Writing—Original Draft. J.H.: Conceptualization, Methodology, Validation, Investigation, Writing—Original Draft, Review and Editing, and Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed at the corresponding author.

Acknowledgments

The authors thank Lihao Deng from South China University of Technology for his contribution to the initial work.

Conflicts of Interest

Authors Xiao Liao, Yiqian Li and Shaofeng Zhang were employed by the company China Energy Engineering Group Guangdong Electric Power Design Institute Company Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Adam	Adaptive Moment Estimation
AI	Artificial Intelligence
Batch_size	the number of samples processed by the model during each training iteration
CNN	Convolutional Neural Network
CPU	Central Processing Unit
Data_size	the total number of samples in the training data set
Flops	Floating-point operations
GPU	Graphics Processing Unit
Image_size	the input image size used for training the model
LSTM	Long Short-Term Memory
MLP	Multi-Layer Perceptron
Num_epoch	the number of training epochs
Num_workers	the number of CPU threads
Params	the size of the model parameters
PMC	Performance Monitoring Counter
PUE	Power Usage Effectiveness
ResNet	Residual Network
SGD	Stochastic Gradient Descent
TDP	Thermal Design Power
VGG	the networks from Visual Geometry Group

References

Xia, S.; Yang, Y.; Poon, J. Ensuring a Carbon-Neutral Future for Artificial Intelligence. Innov. Energy 2025, 2, 100071. [Google Scholar] [CrossRef]
Setyo, Z.G.M.; Rijal, H.B.; Aqilah, N.; Abdullah, N. Energy Efficiency Measurement Method and Thermal Environment in Data Centers–A Literature Review. Energies 2025, 18, 3689. [Google Scholar] [CrossRef]
Shehabi, A.; Smith, S.J.; Hubbard, A.; Newkirk, A.; Lei, N.; Siddik, M.A.B.; Holecek, B.; Koomey, J.; Masanet, E.; Sartor, D. 2024 United States Data Center Energy Usage Report; Lawrence Berkeley National Laboratory: Berkeley, CA, USA, 2024. [Google Scholar] [CrossRef]
China Academy of Information and Communications Technology (CAICT). Green Computing Power Development Research Report (2024). Available online: http://www.caict.ac.cn/kxyj/qwfb/ztbg/202407/t20240711_486866.htm (accessed on 8 October 2025).
China Academy of Information and Communications Technology. Green Computing Power Development Research Report (2025). Available online: http://www.caict.ac.cn/kxyj/qwfb/ztbg/202507/t20250724_686833.htm (accessed on 8 October 2025).
Katal, A.; Dahiya, S.; Choudhury, T. Energy Efficiency in Cloud Computing Data Centers: A Survey on Software Technologies. Clust. Comput. 2023, 26, 1845–1875. [Google Scholar] [CrossRef]
Gnibga, W.E.; Blavette, A.; Orgerie, A.-C. Renewable Energy in Data Centers: The Dilemma of Electrical Grid Dependency and Autonomy Costs. IEEE Trans. Sustain. Comput. 2024, 9, 315–328. [Google Scholar] [CrossRef]
Tmamna, J.; Ayed, E.B.; Fourati, R.; Gogate, M.; Arslan, T.; Hussain, A.; Ayed, M.B. Pruning Deep Neural Networks for Green Energy-Efficient Models: A Survey. Cogn. Comput. 2024, 16, 2931–2952. [Google Scholar] [CrossRef]
Gutierrez, M.; Moraga, M.Á.; Garcia, F.; Calero, C. Green IN Artificial Intelligence from a Software Perspective: State-of-the-Art and Green Decalogue. ACM Comput. Surv. 2024, 57, 1–30. [Google Scholar] [CrossRef]
Ma, X.; Dong, M.; Zhong, L.; Deng, Z. Statistical Power Consumption Analysis and Modeling for GPU-Based Computing. In Proceedings of the ACM SOSP Workshop on Power Aware Computing and Systems (HotPower), Big Sky, MT, USA, 10 October 2009; Volume 1. [Google Scholar]
Kandiah, V.; Peverelle, S.; Khairy, M.; Pan, J.; Manjunath, A.; Rogers, T.G.; Aamodt, T.M.; Hardavellas, N. AccelWattch: A Power Modeling Framework for Modern GPUs. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, 18–22 October 2021; pp. 738–753. [Google Scholar] [CrossRef]
Abe, Y.; Sasaki, H.; Peres, M.; Inoue, K.; Murakami, K.; Kato, S. Power and Performance Analysis of GPU-Accelerated Systems. In Proceedings of the 2012 Workshop on Power-Aware Computing and Systems (HotPower 12), Hollywood, CA, USA, 7 October 2012. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Dodge, J.; Prewitt, T.; Tachet des Combes, R.; Odmark, E.; Schwartz, R.; Strubell, E.; Luccioni, A.S.; Smith, N.A.; DeCario, N.; Buchanan, W. Measuring the Carbon Intensity of AI in Cloud Instances. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 1877–1894. [Google Scholar] [CrossRef]
Jin, C.; Bai, X.; Yang, C.; Mao, W.; Xu, X. A Review of Power Consumption Models of Servers in Data Centers. Appl. Energy 2020, 265, 114806. [Google Scholar] [CrossRef]
Safari, A.; Sorouri, H.; Rahimi, A.; Oshnoei, A. A Systematic Review of Energy Efficiency Metrics for Optimizing Cloud Data Center Operations and Management. Electronics 2025, 14, 2214. [Google Scholar] [CrossRef]
Wan, A.; Chang, Q.; Khalil, A.-B.; He, J. Short-Term Power Load Forecasting for Combined Heat and Power Using CNN-LSTM Enhanced by Attention Mechanism. Energy 2023, 282, 128274. [Google Scholar] [CrossRef]
Cottier, B.; Rahman, R.; Fattorini, L.; Maslej, N.; Besiroglu, T.; Owen, D. The Rising Costs of Training Frontier AI Models. arXiv 2024, arXiv:2405.21015. [Google Scholar] [CrossRef]
Ahmed, K.M.U.; Bollen, M.H.J.; Alvarez, M. A Review of Data Centers Energy Consumption and Reliability Modeling. IEEE Access 2021, 9, 152536–152563. [Google Scholar] [CrossRef]
Moocheet, N.; Jaumard, B.; Thibault, P.; Eleftheriadis, L. A Sensor Predictive Model for Power Consumption Using Machine Learning. In Proceedings of the 2023 IEEE 12th International Conference on Cloud Networking (CloudNet), Hoboken, NJ, USA, 1–3 November 2023; pp. 238–246. [Google Scholar]
Shen, Z.; Liu, B.; Zhou, Q.; Liu, Z.; Xia, B.; Li, Y. Cost-Sensitive Tensor-Based Dual-Stage Attention LSTM with Feature Selection for Data Center Server Power Forecasting. ACM Trans. Intell. Syst. Technol. 2023, 14, 1–20. [Google Scholar] [CrossRef]
Li, Y.; Hu, H.; Wen, Y.; Zhang, J. Learning-Based Power Prediction for Data Centre Operations via Deep Neural Networks. In Proceedings of the 5th International Workshop on Energy Efficient Data Centres, Waterloo, ON, Canada, 21–24 June 2016. [Google Scholar]
Shih, Y.-C.; Tamilarasan, S.; Chen, C.-S.; Zargar, O.A.; Kuan, Y.-D. Attention-Based Integrated Deep Neural Network Architecture for Predicting the Effectiveness of Data Center Power Usage. Int. J. Thermofluids 2024, 24, 100866. [Google Scholar] [CrossRef]
Javadi, A.B.; Pong, P. A Review on Symbolic Regression in Power Systems: Methods, Applications, and Future Directions. arXiv 2025, arXiv:2504.04621. Available online: https://arxiv.org/abs/2504.04621 (accessed on 20 October 2025).
Rueda, R.; Cuéllar, M.P.; Molina-Solana, M.; Guo, Y.; Pegalajar, M.C. Generalised Regression Hypothesis Induction for Energy Consumption Forecasting. Energies 2019, 12, 1069. [Google Scholar] [CrossRef]
Rueda, R.; Cuéllar, M.P.; Delgado, M.; Pegalajar, M.C. Preliminary Evaluation of Symbolic Regression Methods for Energy Consumption Modelling. In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM), Porto, Portugal, 24–26 February 2017; Volume 1, pp. 39–49. [Google Scholar] [CrossRef]
Eichhorn, S.; Mohapatra, A.; Goebel, C. PISR: Physics-Informed Symbolic Regression for Predicting Power System Voltage. In Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems, Rotterdam, Netherlands, 17–20 June 2025; E-Energy ’25. pp. 92–107. [Google Scholar] [CrossRef]
Trabelsi, M.; Massaoudi, M.; Chihi, I.; Sidhom, L.; Refaat, S.S.; Huang, T.; Oueslati, F.S. An Effective Hybrid Symbolic Regression–Deep Multilayer Perceptron Technique for PV Power Forecasting. Energies 2022, 15, 9008. [Google Scholar] [CrossRef]
Dayarathna, M.; Wen, Y.; Fan, R. Data Center Energy Consumption Modeling: A Survey. IEEE Commun. Surv. Tutor. 2016, 18, 732–794. [Google Scholar] [CrossRef]
Tripp, C.E.; Perr-Sauer, J.; Gafur, J.; Nag, A.; Purkayastha, A.; Zisman, S.; Bensen, E.A. Measuring the Energy Consumption and Efficiency of Deep Neural Networks: An Empirical Analysis and Design Recommendations. arXiv 2024, arXiv:2403.08151. [Google Scholar] [CrossRef]
Yang, T.-J.; Chen, Y.-H.; Sze, V. Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
del Rey, S.; Martínez-Fernández, S.; Cruz, L.; Franch, X. How to use model architecture and training environment to estimate the energy consumption of DL training. arXiv 2024, arXiv:2307.05520. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Computer Science, Proceedings of the Computer Vision–ECCV, Amsterdam, Netherlands, 8–16 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. Available online: https://arxiv.org/abs/1409.1556 (accessed on 20 October 2025).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. Available online: https://arxiv.org/abs/2010.11929 (accessed on 20 October 2025). [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. Available online: https://arxiv.org/abs/1704.04861 (accessed on 20 October 2025).
Shin, J.; Moon, H.; Chun, C.-J.; Sim, T.; Kim, E.; Lee, S. Enhanced Data Processing and Machine Learning Techniques for Energy Consumption Forecasting. Electronics 2024, 13, 3885. [Google Scholar] [CrossRef]
Isenko, A.; Mayer, R.; Jedele, J.; Jacobsen, H.-A. Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines. In Proceedings of the 2022 International Conference on Management of Data, Philadelphia, PA, USA, 12–17 June 2022; pp. 1825–1839. [Google Scholar]
Perera-Lago, J.; Toscano-Duran, V.; Paluzo-Hidalgo, E.; Gonzalez-Diaz, R.; Gutiérrez-Naranjo, M.A.; Rucco, M. An In-Depth Analysis of Data Reduction Methods for Sustainable Deep Learning. Open Res. Eur. 2024, 4, 101. [Google Scholar] [CrossRef] [PubMed]
Alizadeh, N.; Castor, F. Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering–Software Engineering for AI, Lisbon, Portugal, 14–15 April 2024; pp. 134–139. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
del Rey, S.; Martínez-Fernández, S.; Cruz, L.; Franch, X. Do DL Models and Training Environments Have an Impact on Energy Consumption? In Proceedings of the 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Durres, Albania, 6–8 September 2023; pp. 150–158. [Google Scholar]
Yarally, T.; Cruz, L.; Feitosa, D.; Sallou, J.; van Deursen, A. Uncovering Energy-Efficient Practices in Deep Learning Training: Preliminary Steps Towards Green AI. In Proceedings of the 2023 IEEE/ACM 2nd International Conference on AI Engineering–Software Engineering for AI (CAIN), Melbourne, Australia, 15–16 May 2023; pp. 25–36. [Google Scholar]
Yu, M.; Tian, Y.; Ji, B.; Wu, C.; Rajan, H.; Liu, J. GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs. In Proceedings of the IEEE INFOCOM 2022–IEEE Conference on Computer Communications, Virtual Event, 2–5 May 2022; pp. 1569–1578. [Google Scholar]
Bridges, R.A.; Imam, N.; Mintz, T.M. Understanding GPU Power: A Survey of Profiling, Modeling, and Simulation Methods. ACM Comput. Surv. 2016, 49, 1–27. [Google Scholar] [CrossRef]
gplearn. Available online: https://gplearn.readthedocs.io/en/stable/ (accessed on 16 August 2025).
Calflops. Available online: https://github.com/MrYxJ/calculate-flops.pytorch?tab=readme-ov-file (accessed on 16 August 2025).
You, J.; Chung, J.-W.; Chowdhury, M. Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), Boston, MA, USA, 17–19 April 2023; pp. 119–139. [Google Scholar]
Gowda, S.N.; Hao, X.; Li, G.; Gowda, S.N.; Jin, X.; Sevilla-Lara, L. Watt for What: Rethinking Deep Learning’s Energy-Performance Relationship. In Computer Science, Proceedings of the Computer Vision–ECCV 2024 Workshops, Milan, Italy, 29 September–4 October 2024; Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 388–405. [Google Scholar]

Figure 1. Energy consumption prediction framework based on model training task metadata.

Figure 2. Batch size and energy consumption of training. (a) Batch size and energy consumption with single-GPU; (b) batch size and total energy consumption with four GPUs.

Figure 3. Relationship between input image size and energy consumption. (a) Input image size and energy consumption with single GPU; (b) input image size and energy consumption with multiple GPUs.

Figure 4. Relationship between layers of model and energy consumption. (a) Layers of model and energy consumption with single GPU; (b) layers of model and energy consumption with multiple GPUs.

Figure 5. Power consumption curve with single GPU.

Figure 6. Power consumption curve with four GPUs.

Figure 7. Predicted energy consumption using a square wave.

Table 1. Parameter configuration for model training tasks.

Parameter	Configurations of Single GPU	Configurations of Multiple GPUs
Batch_size	[16, 32, 64, 128, 256]	[64, 128, 256, 512, 1024]
Data_size	[100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000, 900,000, 1,000,000]	[100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000, 900,000, 1,000,000]
Num_epoch	[1, 5, 10]	[1, 5, 10]
Num_workers	[9, 18]	[32, 64]
Image_size	[64 × 64, 224 × 224, 512 × 512]	[224 × 224, 512 × 512]

Table 2. Feature selection for each predicting task (MI/Pearson correlation).

Features	Valley Energy Consumption	Valley Duration	Peak Energy Consumption	Peak Duration
Image_size	0.373/0.473 *	0.479/0.683 *	0.374/0.574 *	0.444/0.749 *
Data_size	0.374/0.524 *	0.523/0.597 *	0.040/0.249	0.523/0.750 *
Params	0.014/0.142	0.504/0.593 *	0.375/0.686 *	0.479/0.743 *
Flops	0.145/0.145	0.482/0.513 *	0.381/0.592 *	0.476/0.597 *
Batch_size	0.011/0.102	0.015/0.015	0.412/0.686 *	0.344/0.548 *
Num_workers	0.361/0.461 *	0.478/0.528 *	0.038/0.038	0.017/0.017

* indicates the selected features.

Table 3. Model training energy consumption prediction results (Mean ± Std).

Predicting Tasks	Model	MSE	MRE
valley energy consumption	MLP	1692.462 ± 174.124	11.45% ± 1.48%
	SR	429.512 ± 20.541	2.59% ± 0.55%
	Linear Regression	2854.391 ± 261.578	18.43% ± 2.21%
	Ridge Regression	2594.307 ± 217.221	17.92% ± 1.82%
	Lasso Regression	2553.860 ± 225.193	17.51% ± 1.93%
	gradient boosting	3538.427 ± 155.627	24.15% ± 2.48%
valley duration	MLP	19.942 ± 5.122	28.12% ± 2.41%
	SR	6.0152 ± 0.92	8.25% ± 1.28%
	Linear Regression	32.514 ± 4.88	41.32% ± 3.72%
	Ridge Regression	29.838 ± 4.52	38.91% ± 3.15%
	Lasso Regression	30.215 ± 4.30	39.54% ± 3.28%
	gradient boosting	25.443 ± 3.74	33.82% ± 2.67%
peak energy consumption	MLP	1172.987 ± 166.21	4.62% ± 1.21%
	SR	1025.255 ± 102.51	3.71% ± 0.57%
	Linear Regression	1984.324 ± 241.88	7.91% ± 1.55%
	Ridge Regression	1852.911 ± 220.31	7.52% ± 1.38%
	Lasso Regression	1824.571 ± 207.95	7.44% ± 1.42%
	gradient boosting	1468.522 ± 188.22	5.93% ± 1.07%
peak duration	MLP	1142.451 ± 127.16	15.98% ± 2.84%
	SR	231.453 ± 11.80	3.68% ± 1.02%
	Linear Regression	1624.773 ± 155.22	21.11% ± 3.15%
	Ridge Regression	1485.512 ± 143.44	19.74% ± 2.92%
	Lasso Regression	1492.301 ± 135.80	19.98% ± 3.01%
	gradient boosting	1294.225 ± 118.56	17.42% ± 2.48%

Table 4. Predicting energy consumption from different trained models.

Trained Models (Source)	Predicted Models (Target)	Valley Energy MSE/MRE	Valley Duration MSE/MRE	Peak Energy MSE/MRE	Peak Duration MSE/MRE
ResNet50	ResNet50	88.47/2.63%	6.09/8.31%	1038.42/3.76%	234.17/3.74%
ResNet50	ResNet18	260.37/7.80%	18.47/13.28%	2976.15/10.74%	685.33/9.12%
ResNet50	ResNet34	255.18/7.65%	18.66/13.35%	2891.77/10.21%	662.48/8.71%
ResNet50	ResNet101	258.84/7.70%	17.73/12.65%	3042.63/10.58%	703.91/9.03%
ResNet50	ViT	339.12/9.88%	58.74/20.82%	5528.74/21.95%	1294.61/18.72%
ResNet50	BERT	323.47/9.66%	51.66/19.35%	5187.39/20.84%	1231.54/17.96%
ViT	ViT	198.64/5.21%	24.73/6.84%	2957.18/6.92%	736.05/6.47%
ViT	BERT	248.32/7.45%	38.91/12.87%	3894.17/13.68%	897.63/13.08%

Table 5. Estimated energy and carbon consumption for one day.

Scenario	Energy per Day (kWh)	CO₂ per Day (kg, 0.5 kg/kWh)
One server with 4 GPUs	≈27.2	≈13.6
Data center with 1000 servers	≈27,200	≈13,600

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liao, X.; Li, Y.; Zhang, S.; Wei, X.; Hu, J. Predicting GPU Training Energy Consumption in Data Centers Using Task Metadata via Symbolic Regression. Energies 2026, 19, 448. https://doi.org/10.3390/en19020448

AMA Style

Liao X, Li Y, Zhang S, Wei X, Hu J. Predicting GPU Training Energy Consumption in Data Centers Using Task Metadata via Symbolic Regression. Energies. 2026; 19(2):448. https://doi.org/10.3390/en19020448

Chicago/Turabian Style

Liao, Xiao, Yiqian Li, Shaofeng Zhang, Xianzheng Wei, and Jinlong Hu. 2026. "Predicting GPU Training Energy Consumption in Data Centers Using Task Metadata via Symbolic Regression" Energies 19, no. 2: 448. https://doi.org/10.3390/en19020448

APA Style

Liao, X., Li, Y., Zhang, S., Wei, X., & Hu, J. (2026). Predicting GPU Training Energy Consumption in Data Centers Using Task Metadata via Symbolic Regression. Energies, 19(2), 448. https://doi.org/10.3390/en19020448

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting GPU Training Energy Consumption in Data Centers Using Task Metadata via Symbolic Regression

Abstract

1. Introduction

2. Analysis of Training Tasks and Energy Consumption

2.1. Overview of AI Model Training

2.2. Energy Consumption Characteristics of Model Training

2.2.1. Epoch-Based Periodic

2.2.2. Batch-Based Periodic

2.2.3. The Main Calculation Operation in Model Training

3. Energy Consumption Prediction for Training Tasks

4. Experiments

4.1. Experimental Environment and Setup

4.2. Dataset Construction

4.3. Analysis of Energy Consumption Characteristics

4.4. Energy Consumption Prediction Experiment and Results

4.4.1. Experimental Setup for Predicting Energy Consumption

4.4.2. Feature Selection

4.4.3. Symbolic Regression and Comparison Models

4.4.4. Results of Energy Consumption Prediction Experiments

4.4.5. Analysis of Energy Consumption Prediction

4.4.6. Example of Energy Consumption Prediction

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI