1. Introduction
As digital transformation accelerates, technologies such as cloud computing, big data, and artificial intelligence (AI) are becoming increasingly widespread. Data centers, which serve as central hubs for data storage, processing, and analysis, are expanding in both size and complexity, leading to a significant rise in energy consumption. The proportion of global energy consumed by data centers relative to total worldwide energy consumption has been increasing annually, making them a major energy consumer that cannot be overlooked. Recently, AI has made remarkable advances in areas such as natural language processing and image recognition, with rapid progress in developing large-scale models. Training and operating these AI models require substantial computing power, further increasing energy demands in data centers [
1,
2]. According to the 2024 US Data Center Energy Use Report, US data centers consumed 176 TWh in 2023, representing 4.4% of the country’s total electricity use. This consumption is projected to increase to between 325 and 580 TWh by 2028 [
3]. In China, nationwide data center energy consumption reached 150 TWh in 2023 [
4] and increased to 166 TWh in 2024, representing 1.68% of the country’s total electricity use [
5]. This high energy usage not only raises operational costs but also places significant strain on the global climate and environment. Therefore, reducing energy consumption in data centers and achieving sustainable, low-carbon operations have become urgent priorities worldwide.
To address these challenges, researchers have employed various optimization methods for reducing energy consumption of the data center [
6], or making more and better use of renewable green energy sources [
7], or designing more energy-efficient neural networks [
8], in order to achieve green and low-carbon operations of the data center, as well as to fulfill the vision of green AI that cares about the energy consumption of AI itself [
9]. Among these, understanding and predicting the power consumption of AI tasks is one of the key issues.
GPU servers are the most energy-intensive IT equipment in data centers, particularly when handling AI workloads. These servers comprise various electronic components, including CPUs, GPUs, network interfaces, memory modules, and cooling fans. Each component affects the system’s overall performance and energy consumption. Among them, GPU power consumption dominates the total power usage of GPU servers, especially during highly parallel tasks such as deep neural network training. Moreover, the dynamic power consumption of GPUs is significantly higher than that of CPUs and other components, making GPUs the primary contributors to server power consumption [
10,
11,
12]. Specifically, GPU power consumption accounts for up to 74% of the total energy usage of the server during the training of the Bidirectional Encoder Representations from Transformers (BERT) base model [
13] on a single NVIDIA TITAN X GPU (12 GB) [
14]. Therefore, predicting the GPU energy consumption of these servers can serve as an effective method for estimating the total energy consumption during neural network model training. Predicting the energy consumption of data center servers primarily involves three approaches: estimating server energy consumption based on energy consumption models [
15,
16]; forecasting server energy consumption sequences using historical energy consumption data [
17]; and considering the maximum power consumption or Thermal Design Power (TDP) of hardware components such as Graphics Processing Units (GPUs) [
18].
First, regarding energy consumption modeling of servers, the operational status of key resources (such as Central Processing Unit (CPU) utilization) correlates with overall system energy consumption during the execution of computing tasks, enabling real-time estimation of changes in system energy use. Energy consumption modeling involves establishing a mathematical model where input variables include state indicators of various system components, such as CPU utilization, memory throughput, network bandwidth, and disk read/write bytes. The model’s output is the system’s energy consumption. This approach is often referred to as the Performance Monitoring Counter (PMC)-based energy consumption modeling method, which obtains the operational status of the CPU, memory, disk, and other components through hardware-provided PMC performance counters [
19], and estimates task energy consumption using regression techniques [
20]. While the energy consumption model method monitors task execution status to accurately estimate system energy consumption in real time, it is challenging to predict the energy consumption of tasks before they run. These energy consumption models establish correlations between system component status indicators and energy consumption, allowing real-time estimation of system energy changes. They are well-suited for energy monitoring and resource scheduling during task execution but are limited in their ability to forecast the energy consumption of training tasks prior to execution.
Second, regarding the prediction of server energy consumption sequences based on historical time-series data, recent research has primarily employed machine learning and deep learning methods to forecast data center computing power load and energy consumption. For instance, Ref. [
21] proposed a cost-sensitive tensor-quantized two-stage attention LSTM model that significantly improves the accuracy of server power prediction through causality-driven feature selection and tensor-quantized temporal embeddings. In [
22], a prediction framework for overall data center power consumption was developed using deep neural networks, which first apply denoising and multi-layer autoencoder preprocessing, then model short-term and medium- to long-term power demand via multi-layer feedforward networks, outperforming traditional methods such as linear regression and support vector regression. Furthermore, Ref. [
23] integrated convolutional layers, LSTM, and attention mechanisms into a deep architecture for predicting Power Usage Effectiveness (PUE), where convolution captures local temporal patterns, LSTM layers learn long-term dependencies, and the attention mechanism adaptively reweights features to improve energy-efficiency prediction. These time series methods typically depend on extensive historical energy data to predict energy usage in data centers. Their accuracy is affected by shifts in the distribution of the data being forecasted and the model’s capacity to generalize.
Symbolic regression (SR) has emerged as an effective modeling paradigm for deriving simple and interpretable mathematical expressions directly from observed data. SR has been successfully applied to modeling nonlinear behavior in power systems and building energy consumption [
24,
25], as well as more broadly to tasks such as energy prediction, power modeling, and physics-constrained modeling [
26,
27]. Recent studies demonstrate that hybrid SR methods maintain high accuracy under small-sample conditions in photovoltaic power prediction while producing formulas with clear physical meaning [
28]. These studies illustrate that SR provides a robust and interpretable alternative to purely black-box time-series models in energy-related applications.
Third, the maximum power consumption, or TDP, of hardware components (such as GPUs) is also used to estimate the energy consumption of data center servers when designing energy and temperature management systems. However, for specific AI tasks, this energy consumption estimate is often inaccurate because energy usage dynamically varies with the servers’ operational status [
29].
Moreover, metadata from AI training has been utilized to measure energy consumption and develop energy-efficient AI models. To reduce energy consumption without compromising AI model performance, researchers are actively investigating and developing energy-efficient AI systems that optimize both algorithms and hardware. In these efforts, metadata from model training tasks—such as model parameters and batch sizes—has been used to analyze its influence on GPU energy consumption during training. For example, Tripp et al. [
30] analyzed energy consumption data from training a Multi-Layer Perceptron (MLP) model to explore how dataset size and network architecture impact energy use. They highlighted the significance of caching effects and proposed an energy-saving approach that integrates network design, algorithm development, and hardware optimization. Similarly, Yang et al. [
31] developed an energy estimation framework to guide pruning by measuring the energy consumption of matrix multiplication operations in convolutional neural networks (CNNs), facilitating energy-focused model compression. These studies provide valuable insights for designing energy-efficient AI algorithms and hardware. Such techniques assist researchers in creating more energy-efficient neural networks, thereby reducing the overall energy required for AI training and inference in specific applications. Furthermore, only a few studies have examined power consumption patterns during deep learning training to predict power usage based on metadata, such as estimating average energy consumption during training and average energy per epoch using regression models [
32].
However, detailed analyses of the energy consumption characteristics of neural network training tasks on GPUs in data centers remain scarce. What is more, studies predicting GPU energy consumption of AI training in data centers using task metadata are limited, particularly those employing symbolic regression to derive analytical expressions that explicitly link metadata to energy consumption.
In this article, we focus on predicting the energy consumption of artificial intelligence tasks involved in training neural network models within data centers. By analyzing the relationship between the configuration of model training tasks and their energy consumption, we propose a GPU energy consumption prediction scheme based on task metadata such as the number of training samples, model parameters, batch size, and epochs. This approach differs from traditional time series forecasting of data center energy consumption, which relies on historical time series data of energy consumption. We employ genetic programming-based symbolic regression to predict GPU energy consumption, enabling the estimation of server energy usage prior to executing deep neural network training tasks. To create a training environment and collect GPU energy consumption data, we utilize models from the Residual Network (ResNet) family [
33], the Visual Geometry Group (VGG) family [
34], the Vision Transformer (ViT) family [
35], the MobileNet [
36], and the BERT family [
13] using the A800 GPUs. The results of predicting energy consumption curve characteristics demonstrate strong performance, with mean relative errors (MRE) of 2.67% for valley energy, 8.42% for valley duration, 3.64% for peak power, and 3.64% for peak duration.
Our contributions are summarized as follows:
- (i)
We analyze the relationship between the configuration of model training tasks and their energy consumption on GPUs, identifying consistent patterns that enable energy prediction across various settings.
- (ii)
We utilize metadata from AI training tasks to predict GPU energy consumption, enabling accurate estimation of server energy usage without requiring extensive historical energy data prior to task execution. This approach differs from conventional time-series energy prediction methods and supports flexible, time-independent optimization of energy-efficient scheduling for deep learning model training in data centers. It provides an alternative to traditional energy consumption prediction techniques, such as time-series forecasting, especially when sufficient historical energy data is unavailable to develop a reliable and generalizable predictive model.
- (iii)
We propose a genetic programming-based symbolic regression method to develop analytical formulas that explicitly link metadata to energy consumption. This approach delivers accurate predictions even with limited training data and produces interpretable equations that can be directly applied to task scheduling or energy management in data centers.
3. Energy Consumption Prediction for Training Tasks
According to the analysis of server energy consumption characteristics in deep neural network training tasks, server energy consumption exhibits periodicity. The energy consumption sequence curve shows peaks and valleys influenced by the configuration parameters of the training task. In a given data center server hardware environment, server energy consumption is determined by the model architecture, the volume of training data, and the configuration of the training tasks.
In this article, we refer to the configuration information, model structure details, and training data characteristics of AI model training tasks collectively as training task metadata. This metadata specifically includes model architecture, model parameter size, training batch size, total number of samples, number of training epochs, number of computing nodes used, number of AI GPUs per node, data size of individual samples, and the number of threads.
We define the energy consumption prediction problem as follows: given a data center server hardware environment and a training task, the goal is to predict the energy consumption associated with the training task by using task meta-information collected from historical or experimental training tasks.
In response to the energy consumption challenges associated with the aforementioned model training, this paper proposes a method for predicting the energy consumption of model training tasks on data center servers, based on the regular patterns observed in training task energy usage. As illustrated in
Figure 1, the process begins with the collection of meta-information and energy consumption data from deep neural network training tasks. Using the training task metadata and energy consumption data, a prediction model is developed to capture the time series characteristics of energy consumption for each type of model architecture. Finally, the energy consumption of a target training task is predicted based on its metadata.
As shown in
Figure 1, in the analysis of energy consumption characteristics of training tasks, the energy consumption time series curve is analyzed to determine whether it has periodicity. For training tasks with periodicity, the shape of the energy consumption time series curve for each epoch of the training task is analyzed to determine whether there is a clear pattern, as well as the specific shape characteristics of the time series graph. Obtain a set of model structures corresponding to training tasks with linear or square wave patterns in energy consumption time series curves, and classify the same model structures into one category. Models with different model structures (such as ResNet, VGG, etc.) will generate different energy consumption during training [
45]. Construct an energy consumption prediction model for training tasks based on the meta information and data of similar model structures. Furthermore, the server energy consumption prediction problem of the training task is transformed into three concise machine learning steps: the energy consumption dataset of the training task; model construction for energy consumption prediction; and energy consumption prediction for training tasks.
By training models with different parameter combinations under the same model architecture, collecting their training task metadata and energy consumption, the corresponding architecture’s task metadata energy consumption dataset can be obtained. Then, the prediction model can be trained to obtain the energy consumption prediction model of the model architecture. Taking the energy consumption time series curve with a square wave shape as an example, predict the peak, valley, and corresponding duration of the energy consumption curve for the training task. The peak represents the energy consumption during the model training calculation phase, while the valley represents the energy consumption during non-model training calculations such as model saving between epochs. In this article, we use a concise but effective method—symbolic regression—to predict the four variables that describe the energy consumption curve based on training task meta information.
Considering the small size of the data, we applied symbolic regression based on genetic programming (GP) for prediction. Firstly, we use the Pearson correlation method for feature selection to remove irrelevant features and improve the search efficiency of the symbolic regression model. The symbolic regression model is a regression analysis method that combines machine learning and symbolic computation. Through heuristic search methods, it iteratively optimizes in the symbolic space and derives mathematical expressions that describe the relationship between input and output from data. The evolutionary mechanism of genetic programming enables symbolic regression to handle complex nonlinear relationships and high-dimensional data. By randomly generating an initial population and selecting the individuals with the highest fitness for reproduction in each generation, genetic programming can effectively explore the solution space and avoid getting stuck in local optima.
Specifically, the gplearn library [
46] is employed to perform symbolic regression. Gplearn extends the scikit-learn machine learning library to implement genetic programming (GP) for symbolic regression. By simulating natural selection and genetic operations, mathematical expressions are automatically generated and optimized to best fit the data.
4. Experiments
We first conduct experiments analyzing the energy consumption characteristics and predicting energy usage on GPU servers to validate both the energy consumption characteristics during model training and the proposed energy consumption prediction scheme.
4.1. Experimental Environment and Setup
In this section, we conduct energy consumption experiments using ResNet and VGG series models as representative neural network architectures. The ResNet model [
33] is a family of deep convolutional neural networks designed to address the challenges associated with training very deep networks. Its architecture enables the model to bypass one or more layers, facilitating the training of significantly deeper networks by mitigating the vanishing gradient problem. The VGG model [
34], developed by the Visual Geometry Group at the University of Oxford, is another family of deep convolutional neural networks specifically designed for image classification and object detection tasks. The ViT model [
35] is a pure transformer architecture that treats an image as a sequence of fixed-size 16 × 16 patches, each linearly embedded into a vector similar to a word token in natural language processing. The BERT model [
13] is a deeply bidirectional, unsupervised language representation model that pre-trains a single Transformer encoder on large corpora of unlabeled text using two self-supervised learning objectives: masked language modeling and next sentence prediction.
The GPU server used for model training, based on NVIDIA’s HGX platform, is equipped with 2 × 18-core Intel Xeon Gold processors, 16 × 64 GB DDR4 memory modules, one 480 GB SATA SSD, two 3.84 TB NVMe SSDs, and four NVIDIA Tesla A800 GPUs. Each A800 GPU features 80 GB of memory and a 5120-bit memory interface. The A800 GPUs used in our experiments have a TDP of 400 watts each and are equipped with liquid cooling systems. The data center environment is maintained at approximately 22 degrees Celsius with a relative humidity of about 50%. The system runs CUDA version 12.1 on Rocky Linux 8.6.
The specific models used for training are ResNet50, ResNet101, VGG16, ViT, and BERT. The training dataset is ImageNet, and the optimizer employed is stochastic gradient descent (SGD). We trained the models in two environments: a single GPU and multiple GPUs. Multi-GPU training is performed on four GPUs using data parallel training.
For ImageNet data, we employ a similar preprocessing and data enrichment pipeline as standard ResNet training. Specifically, RandomResizedCrop(s) is first applied to the input image to randomly crop and scale to a fixed resolution of s × s, where s ∈ {224, 512, 1024} is applied to investigate the effects of different input resolutions on performance and energy consumption. Subsequently, a RandomHorizontalFlip is performed on the image to enhance the model’s robustness to changes in pose and perspective. Next, the image is converted to a tensor by ToTensor and the pixel values are linearly scaled from [0, 255] to the [0, 1] range. Finally, we normalize the images according to the channel mean and standard deviation of the official ImageNet statistics, that is, perform (x − μ_c)/σ_c for each channel, where μ = (0.485, 0.456, 0.406), σ = (0.229, 0.224, 0.225). Data loading uses ImageFolder to read images from the ImageNet training set directory and sequentially sample a fixed number of dataset_size samples to form a subset of the training, depending on the experimental settings.
During the model training process, we used custom shell scripts with the NVIDIA tool NVIDIA-SMI to collect GPU power consumption and utilization data every 100 milliseconds, recording the timestamp along with the corresponding task metadata.
4.2. Dataset Construction
According to the experimental setup described above, both single-GPU and multi-GPU (four GPUs) configurations were used to train VGG and ResNet models using the PyTorch framework. Model features were recorded, and information was collected, including task metadata such as batch size, runtime, and power consumption during model training. Separate datasets were constructed for single-GPU and multi-GPU scenarios. Different parameters for the model training tasks were set, as shown in
Table 1, to obtain the energy consumption of the training tasks under various parameter configurations.
Among the parameters listed in the table, Batch_size refers to the number of samples processed by the model during each training iteration. Data_size represents the total number of samples in the training dataset. Num_epoch indicates the number of training epochs, for complete passes through the dataset. Num_workers specifies the number of CPU threads used, which affects the efficiency of CPU operating such as CPU-GPU interaction and storing data. Image_size denotes the input image size used for training the models.
We record the specific parameters mentioned in
Table 1 during the model training process, collecting GPU power consumption in every 100 milliseconds. Additionally, we calculate the model’s intrinsic parameters, including the size of the model parameters (Params), and Flops. Flops represent the number of floating-point arithmetic operations and serve as a metric to assess the complexity of algorithms or models. We use CalFlops [
47] to calculate the Flops of the model. Calflops are designed to compute the theoretical number of flops, multiply-add operations, and parameters for various neural networks.
GPU power traces were collected using NVIDIA-SMI at a fixed sampling interval of 100 ms (10 Hz). To assess sampling aliasing, we compared power measurements taken at 100 ms (10 Hz) and 1 ms (1000 Hz) intervals. The mean difference between the 1 ms and 100 ms intervals was less than 1%, with a maximum difference below 3%. Based on this, we selected the 100 ms sampling interval for data collection. Additionally, since all four GPUs are located on the same host and share the OS-level time source, their power traces are synchronized with a sampling resolution of 100 ms.
To minimize noise in the GPU power data, we conducted at least three independent replicates and averaged the data point by point to obtain a stable average power trajectory. Additionally, within a single run, we applied a 3–5 point moving average to the raw data to smooth out power fluctuations occurring on the millisecond scale, thereby reducing noise caused by instantaneous spikes.
The collected energy consumption data for training tasks is divided into two categories: single-GPU energy consumption data, which includes information from 241 differently configured model training tasks, and multi-GPU energy consumption data, which comprises data from 218 models with various configurations for four-GPU training tasks.
4.3. Analysis of Energy Consumption Characteristics
This section analyzes the impact of varying parameters—such as batch size, total data size, and layers of model—on energy consumption during model training.
In a single-GPU training environment, the impact of different batch sizes on energy consumption during ResNet model training—under the same parameter settings (Model = ResNet50, Num_workers = 18, Image_size = 224 × 224, Dataset size = 100,000)—is illustrated in
Figure 2a. We compared exponential, linear, logarithmic, power, and moving average functions to model energy consumption in the experiments presented in this paper, including subsequent experiments, to identify the best-performing function. The logarithmic function was found to be the best fit and was selected to model energy consumption during single-GPU training because GPU power usage increases at a progressively slower rate as batch size grows. The fitted logarithmic function is expressed as y = 15.692 ln(x) + 202.35. The goodness of fit, indicated by the coefficient of determination (R
2) of 0.9455, demonstrates that the logarithmic model accurately captures the positive correlation between power consumption and batch size in single-GPU training.
In a multi-GPU setup, the impact of different batch sizes on energy consumption during ResNet model training—using consistent parameter settings (Model = ResNet50, Num_workers = 32, Image_size = 224 × 224, Dataset size = 100,000)—is illustrated in
Figure 2b. The data were modeled using a logarithmic function described by the equation y = 171.95 ln(x) − 291. This model demonstrated an excellent fit in
Figure 2b, with an R
2 value of 0.9906.
As illustrated in
Figure 2, with identical parameter settings, the energy consumption for training tasks using the ResNet model is positively correlated with the batch size. Moreover, this relationship follows a logarithmic function: initially, the curve rises rapidly, almost linearly, then gradually levels off as the input values increase.
In a single-GPU training environment, the size of the input data affects energy consumption. Three different image sizes—64 × 64, 224 × 224, and 512 × 512—were selected. Under the same model parameter settings (Model = ResNet50, Num_worker = 18, Batch_size = 64, Dataset size = 100,000), the energy consumption of the training task varied with the input data size, as shown in
Figure 3a. The data were modeled using a logarithmic function described by the equation y = 25.351 ln(x) + 131.94. This model demonstrated an excellent fit with an R
2 value of 0.9894.
In the case of multiple GPUs, varying input image sizes under the model parameter settings (Model = ResNet50, Num_worker = 32, Batch_size = 128, Dataset size = 200,000) affect energy consumption during model training, as shown in
Figure 3b. The data were modeled using a logarithmic function described by the equation y = 54.05 ln(x) + 151.08. This model demonstrated an excellent fit with an R
2 value of 0.9447.
It can be observed that, under the same parameter settings, the energy consumption of the training task is positively correlated with the size of the input image.
In the case of a single GPU, the number of layers significantly impacted energy consumption during training across different model layers under the same configuration (Num_worker = 9, Batch_size = 32, Image_size = 224 × 224, Dataset size = 100,000), as shown in
Figure 4a. The data were modeled using a logarithmic function described by the equation y = 23.14 ln(x) + 168.12. This model demonstrated an excellent fit with an R
2 value of 0.9884.
In the case of multiple GPUs, the impact of different model layer values on energy consumption during model training under the same configuration (Num_worker = 32, Batch_size = 64, Image_size = 224 × 224, Dataset size = 100,000) is shown in
Figure 4b. We chose a logarithmic function to model energy consumption. The fitted logarithmic function is expressed as y = 105.38 ln(x) + 40.699. This model demonstrated a relatively good fit with an R
2 value of 0.9503. Compared to single-GPU training, multi-GPU training offers greater capacity to process complex models with more layers, but it also consumes more energy.
In single-GPU training, the model used is ResNet101 with an image size of 224 × 224. The batch size is set to 256, Num_worker is 18, and the dataset size is 200,000. The power consumption curve is shown in
Figure 5.
In multi-card training, the model used is ResNet50, with four GPUs. The image size is 224 × 224, the batch size is set to 128, the Num_worker is 64, and the dataset size is 200,000. The power consumption curves for each GPU are shown in
Figure 6.
As shown in
Figure 5 and
Figure 6, the model training process exhibits periodicity across training epochs. During single-GPU training, the power consumption curve forms a horizontal straight line. In contrast, during multi-GPU training, the power consumption curve resembles a square wave, featuring distinct peaks and valleys. In single-GPU training, there are no troughs—only peaks—primarily due to model storage and other operations between epochs, which involve extremely brief data interactions between the GPU and CPU; thus, no valleys appear between epochs. In multi-GPU training, however, model storage and synchronization between epochs require more time, and the power consumption during these operations is lower than during the model computation stage, resulting in valleys. These experimental results align with the previous analysis.
Furthermore, we conducted experiments to examine the effects of training task configuration on energy consumption in additional models, specifically ResNet34, VGG16, BERT, and ViT. The results, presented in
Section S1 of the Supplementary Materials, are consistent with the findings reported above for ResNet50.
There are additional parameters in training tasks, such as optimizer type, learning rate, and regularization strategy, which affect the number of steps required for model convergence and have a slight impact on energy consumption during each epoch cycle [
48,
49]. Furthermore, we conducted experiments to evaluate the effects of various optimizers—including Adaptive Moment Estimation (Adam), Adam with Weight Decay (AdamW), Stochastic Gradient Descent (SGD), and Root Mean Square Propagation (RMSprop)—as well as different learning rates (0.1, 0.01, 0.001, and 0.0001) on energy consumption. The findings are presented in
Tables S1 and S2 of the Supplementary Materials, showing a minor effect on energy consumption throughout each epoch cycle. However, this effect is less pronounced than those observed for batch size, data size, and the number of model layers.
4.4. Energy Consumption Prediction Experiment and Results
In the energy consumption prediction experiment, a multi-GPU model is initially used to train on the energy consumption dataset. Then, the metadata is utilized as input features for the prediction model to forecast the power consumption curve during training, including average power consumption at peaks and valleys, as well as their corresponding training duration.
4.4.1. Experimental Setup for Predicting Energy Consumption
The dataset includes 459 model training records, comprising 241 single-GPU and 218 multi-GPU entries. It is divided into training, validation, and testing sets in an 8:1:1 ratio. The six features of the dataset include image size, total number of samples, parameter size, model computational complexity, training batch size, and CPU thread count. These features are normalized using the StandardScaler method.
Mean Relative Error (MRE) and Mean Squared Error (MSE) are selected as the metrics for evaluating prediction errors. Given
N pairs of true and predicted values,
where
is the truth value of the i-th sample, and
is the predicted value of the i-th sample.
4.4.2. Feature Selection
For each prediction task, we use Pearson correlation and mutual information (MI) to select key features. Pearson correlation captures linear relationships, while MI reflects nonlinear dependencies, together encompassing most plausible associations between metadata and energy consumption indicators. We first rank and select features for each task based on Pearson correlation, then compare these selections with MI-based results, finding that the two methods yield highly consistent rankings. The final selected features are summarized in
Table 2. Additionally, since Pearson and MI scores can be directly interpreted as indicators of feature importance, this facilitates verification, understanding, and deployment of the resulting models.
For MI-based selection, we estimate continuous MI using a k-nearest neighbors (kNN) based estimator. Let the input feature matrix be and the target variable be . For each target metric (e.g., average load power avg_load_power, data loading time load_time, training time train_time), we compute the MI between and each candidate feature to quantify feature importance. During MI estimation, variables with a small set of integer values (num_worker, batch_size, imgsize, dataset_size) are treated as discrete dimensions, while the remaining features are treated as continuous. The kNN estimator uses neighbors, and we fix the random seed (random_state = 42) to ensure reproducibility. For each target variable, MI is computed both in the original feature space and in the log-transformed space for heavy-tailed features. The features are then ranked in descending order of MI, yielding the most informative predictors for the power and time metrics.
4.4.3. Symbolic Regression and Comparison Models
- (1)
Symbolic Regression
We design the symbolic regression search space using a compact yet expressive set of operators. For arithmetic operations, we include addition, subtraction, multiplication, and protected division. These operators capture linear combinations, contrasts, interactions, and ratio/normalization relationships while preventing numerical errors when the denominator is zero. For nonlinear transformations, we employ square root, protected logarithm, absolute value, negation, and reciprocal functions. These allow us to model diminishing returns, log-shaped scaling, deviation magnitude, sign flexibility, and inverse relationships in a numerically safe manner. Additionally, we incorporate a custom sigmoid function (implemented via make_function) to represent saturation or threshold effects, as well as standard trigonometric functions (sin, cos, tan) to capture potential periodic or oscillatory patterns in power traces. The internal fitness function for symbolic regression is the mean absolute error (MAE). For external evaluation and comparison with baseline models, we report MSE and MRE.
Search algorithm parameter settings in gplearn: population size (population_2) = 15,000, with 15,000 candidate expressions retained at each iteration. A larger population size facilitates exploration of complex search spaces. Generations = 50, meaning the maximum number of iterations is 50 generations. Stopping criteria = 0.01: the algorithm terminates early when the fitness of the best expression reaches 0.01.
During the genetic operation, the crossover probability is set to p_cross = 0.7, indicating a 70% chance of generating new expressions through crossover.
P_subtree_mutation = 0.1: Subtree mutation probability of 10% (replaces the entire sub-expression); P_hoist_mutation = 0.05: Hoist mutation probability of 5% (prevents expressions from becoming too deep by simplifying the structure through pruning); P_point_mutation = 0.1: Point mutation probability of 10% (local modification of a function or variable).
- (2)
MLP Model
We selected an MLP model with two or three hidden layers as the comparison model. The output of the first hidden layer can be expressed as
Among them, x is the input vector, W1 is the weight matrix from the input layer to the hidden layer, and b1 is the bias vector of the hidden layer. The symbol δ(•) represents the activation function.
To ensure a fair comparison, we systematically tuned the hyperparameters of the MLP baseline using the training and validation splits. The input layer size was set to the number of selected features, and the output layer contained a single neuron for each regression target. We performed a grid search over the following hyperparameter space: Hidden layer configurations: , , , ; Activation functions: ReLU, Tanh, LeakyReLU; Learning rates (Adam optimizer): ; Batch sizes: ; L2 weight decay: . Each model was trained for up to 200 epochs with early stopping (patience = 20 epochs) based on the validation MSE. For each target variable (e.g., valley energy consumption, peak energy consumption, valley duration, peak duration), we selected the configuration that achieved the lowest validation MSE and reported the corresponding test MSE and MRE as the final MLP performance.
- (3)
Linear, Ridge, Lasso, and Gradient Boosting
The linear model is an ordinary least squares regressor without explicit regularization, serving as the simplest baseline.
For Ridge and Lasso regression, we perform a grid search over the regularization strength on the training set using 5-fold cross-validation. The candidate values are 1 × 10−4, 1 × 10−3, 1 × 10−2, 1 × 10−1, 1, and 10. The best value, which minimizes the validation MSE, is selected separately for each of the four prediction targets.
The Gradient Boosting Regressor is configured as a tree-based ensemble using squared-error loss. We tune the learning rate (0.01, 0.05, 0.1), the number of trees (200, 400, 800), the maximum depth of each tree (3, 4, or 5), and the subsampling ratio (0.8 or 1.0), employing 5-fold cross-validation on the training set. The final hyperparameters for each target are fixed after this search, with no further manual adjustments made.
All baseline models are trained using the exact same metadata features and evaluated with the same metrics as the proposed SR model to ensure a fair and reproducible comparison.
4.4.4. Results of Energy Consumption Prediction Experiments
According to the energy consumption prediction scheme based on model training task metadata proposed in this paper, the energy consumption prediction for training tasks includes four components: average energy consumption during valleys, average duration of valleys, average power consumption during peaks, and average duration of peaks.
- (1)
Results of Feature Selection
The results of feature selection using MI and Pearson correlation are presented in
Table 2, where the selected features are indicated in black font with an asterisk (*). We rank the MI and Pearson correlation values and select those larger than 0.3. For example, for the task of predicting valley energy consumption, we select the features Image_size, Data_size, and Num_workers. For the task of predicting peak energy consumption, we select the features Image_size, Params, Flops, and Batch_size.
- (2)
Results of Energy Consumption Prediction
Based on the selected features, we compared SR with MLP, Linear Regression, Ridge Regression, Lasso Regression, and Gradient Boosting Regression. The results, presented as the mean and standard deviation of 20 predictions, are shown in
Table 3 and demonstrate that SR achieves the best performance, with MREs of 2.59% ± 0.55% for valley energy, 8.25% ± 1.28% for valley duration, 3.71% ± 0.57% for peak power, and 3.68% ± 1.02% for peak duration.
The four formulas derived from the SR models are provided in
Section S3 of the Supplementary Materials. For the peak energy consumption prediction formula, its expression tree reaches a maximum depth of 30 and comprises 152 nodes, resulting in an average depth of 0.20 per node. The formula primarily includes Params, Flops, Batch_size, and Image_size, while excluding Data_size, consistent with our Pearson and mutual information analyses for this metric. The expression frequently applies logarithmic, square root, and reciprocal transformations to Batch_size and Image_size, indicating saturating, sub-linear growth rather than simple linear scaling. Additionally, it includes terms such as sigmoid (Batch_size − Image_size), which represent threshold-like effects depending on the relative values of batch size and image size, along with ratio-based interactions that normalize by model complexity, considering Params and Flops.
4.4.5. Analysis of Energy Consumption Prediction
- (1)
Impacts of parameters in the SR model
To analyze the effects of parameters in the genetic programming-based SR model, we conducted additional experiments examining the impacts of population size, number of generations, and crossover probability. The results are presented in
Tables S3 and S4 of the Supplementary Materials. As the population size and number of generations increase, the model’s predictive performance improves. However, once the population size reaches 15,000 and the number of generations reaches 50, performance gains begin to plateau. The performance differences among crossover probabilities of 0.6, 0.7, and 0.8 are minimal. Comparatively, the optimal crossover probability is 0.7.
- (2)
Feature contribution in the SR model
To quantify the contribution of each metadata feature to the prediction results, we conducted an input perturbation-based feature importance analysis on the explicit expression of SR using the test set. The feature contributions in the SR model are presented in
Table S5 of the Supplementary Materials. These experimental results are consistent with previous findings on feature selection for each prediction task based on MI or Pearson correlation, as shown in
Table 2. For example, in valley energy consumption prediction, image size, data size, and number of workers all contribute, with data size having a relatively greater impact.
- (3)
Predicting energy consumption from different architectures
Furthermore, we conducted experiments to predict energy consumption using various models to evaluate their ability to forecast energy usage for previously unseen architectures.
Table 4 presents the energy consumption predictions for ResNet18, ResNet34, ResNet50, ResNet101, ViT, and BERT using the SR model trained on ResNet50 data, as well as predictions for ViT and BERT based on ViT data. The results demonstrate that models with similar architectures achieve higher prediction accuracy, whereas those with significantly different structures (such as predicting BERT or ViT energy consumption from ResNet data) exhibit comparatively lower accuracy. These findings indicate that the model architecture plays a critical role in the accuracy of energy consumption predictions.
In addition, the prediction results for model training energy consumption on the NVIDIA RTX 2080 Ti are presented in
Table S6 of the Supplementary Materials. The SR model achieved the best performance, with an MRE of 2.12% for valley energy, 5.52% for valley duration, 2.95% for peak power, and 2.25% for peak duration, which is comparatively better than the predicted performance on the A800. This indicates that our approach maintains strong predictive accuracy across different GPU types.
4.4.6. Example of Energy Consumption Prediction
A practical example of energy consumption prediction is shown in
Figure 7. The cycle length of energy consumption is defined as follows:
where
represents the predicted valley duration,
represents the predicted peak duration, and epoch denotes the number of training epochs. The training task is configured as follows: model = ViT; number of GPUs = 4; batch size = 256; image size = 224 × 224; dataset size =100,000; and number of workers = 64.
In
Figure 7, each GPU has a TDP of 400 watts for comparison purposes. The predicted square-wave power pattern has a duty cycle of 96.6% with a total epoch duration of 188.9 s, where the valley period lasts 6.41 s and the peak period lasts 182.58 s. During the valley, energy consumption is 62.23 watts per GPU, while during the peak it increases to 291.06 watts per GPU. The MSE values for valley energy consumption, valley duration, peak power, and peak duration are 5.68, 219.25, 80.47, and 987.31, respectively, with corresponding MRE values of 2.67%, 8.42%, 5.16%, and 3.64%.
Additionally, we conducted three more examples of energy consumption prediction for the ResNet50, BERT, and MobileNet models, as shown in
Figures S8–S10 in the Supplementary Materials, respectively. For instance, the results for BERT indicate that the square wave has a duty cycle of 96.58%, with a total epoch duration of 175.2 s. Specifically, the valley period lasts 5.98 s, and the peak period lasts 169.25 s. During the valley period, energy consumption is 67.01 watts per GPU, while during the peak period, it reaches 320.72 watts per GPU.
Under the above configuration of the ViT, we further predict the total energy consumption and carbon emissions for one day for a single server and for a data center with 1000 servers, as shown in
Table 5. A single server with four GPUs consumes approximately 27.2 kWh per day when running the predicted cycles continuously, corresponding to roughly 13.6 kg of CO
2 at a grid carbon intensity of 0.5 kg CO
2/kWh. For a data center with 1000 such servers, this scales to about 27.2 MWh and 13.6 metric tons of CO
2 per day.
6. Conclusions
Accurately forecasting the energy consumption of deep neural network training tasks is a complex challenge. Energy consumption models correlate system component status indicators with energy usage, enabling real-time estimation of changes in system energy consumption. These models are valuable for monitoring energy and scheduling resources during task execution; however, predicting energy consumption in advance remains difficult. Approaches that utilize historical time-series data employ deep learning techniques to predict energy consumption in data centers and can forecast energy usage sequences ahead of time. Nevertheless, their accuracy is affected by the unpredictability of computing tasks, the distribution of time-series training data, and the model’s generalization capability. Consequently, directly estimating energy consumption based on hardware’s maximum power consumption or TDP often proves inaccurate.
In this paper, we present a method for predicting energy consumption based on metadata from deep neural network training tasks. This approach overcomes the limitations of traditional prediction techniques that depend on historical time-series data or real-time performance monitoring. It utilizes task metadata—such as the number of model parameters, computational complexity, and input data size—as key input features to develop a simple yet effective model for forecasting energy consumption during training tasks. This model can predict energy consumption characteristics before task execution, thereby facilitating energy management and task scheduling in data centers that handle model training workloads. For example, by training task scheduling based on energy predictions, data centers can maintain steady overall power consumption, preventing excessive energy use. Moreover, this approach helps reduce peak demand on the grid, smooth out periods of low energy usage, and enhance the utilization of renewable energy sources.
For future work, we plan to further investigate the relationship between deep model training tasks and their energy consumption in various data centers. We will examine additional factors and variables that influence energy consumption and incorporate them into the metadata collection. Furthermore, we aim to enhance the models’ ability to generalize when forecasting energy consumption. Our ultimate goal is to implement this approach in real data center environments to evaluate its performance and reliability.