Next Article in Journal
Evaluation of Multivariate Adaptive Regression Splines and Artificial Neural Network for Prediction of Mean Sea Level Trend around Northern Australian Coastlines
Previous Article in Journal
Conceptual Coverage Driven by Essential Concepts: A Formal Concept Analysis Approach
Previous Article in Special Issue
Chaos on Fuzzy Dynamical Systems
Article

Towards Optimal Supercomputer Energy Consumption Forecasting Method

IT4Innovations, VSB—Technical University of Ostrava, 17.listopadu 2172/15, 70833 Ostrava-Poruba, Czech Republic
Academic Editor: António M. Lopes
Mathematics 2021, 9(21), 2695; https://doi.org/10.3390/math9212695
Received: 14 September 2021 / Revised: 17 October 2021 / Accepted: 18 October 2021 / Published: 23 October 2021
(This article belongs to the Special Issue Dynamical Systems and Their Applications Methods)

Abstract

Accurate prediction methods are generally very computationally intensive, so they take a long time. Quick prediction methods, on the other hand, are not very accurate. Is it possible to design a prediction method that is both accurate and fast? In this paper, a new prediction method is proposed, based on the so-called random time-delay patterns, named the RTDP method. Using these random time-delay patterns, this method looks for the most important parts of the time series’ previous evolution, and uses them to predict its future development. When comparing the supercomputer infrastructure power consumption prediction with other commonly used prediction methods, this newly proposed RTDP method proved to be the most accurate and the second fastest.
Keywords: forecasting; prediction method; time series; random time delays patterns; zeroth algorithm; machine learning; statistical; supercomputer power consumption; complex system forecasting; prediction method; time series; random time delays patterns; zeroth algorithm; machine learning; statistical; supercomputer power consumption; complex system

1. Introduction

The supercomputer infrastructure is a complex system in terms of its total power consumption. It is a system whose behavior depends on many factors, which may nonlinearly depend on each other, and where the number of such factors is so great that it is computationally impossible to model. The individual user tasks can cause consumption with aregular pattern, but, in combination, they generate a consumption pattern that is much less regular and, in some places, can seem almost chaotic.
A simplified example of such a pattern-merging is shown in Figure 1. In real traffic, more users are working on the supercomputer at the same time, so the total consumption is then the result of the combination of many such patterns. An example of the total power consumption of a real supercomputer infrastructure, measured over several days, is shown in Figure 2.
Since this is not a completely chaotic time series, its development can be partially predicted using appropriate forecasting methods. However, its complexity is so high that not all samples of this time series can successfully be used to predict its future evolution. Using all past values in the prediction of such a complex time series inevitably leads to overfitting. Of course, if too few values are used, the opposite (underfitting) will be the case, so the crucial task of any successful prediction method is to find the parts of the previous evolution of the predicted time series that most determine its character. Every prediction method has to deal with this problem.
Machine learning methods [1] handle this by creating a mathematical model, but this takes some time to build, so these methods may be too slow for fast real-time predictions. It is possible to reuse a mathematical model built on older data to save time, but this can lead to larger prediction errors. Statistical methods [2], on the other hand, work with parameters that describe the time series globally and are not sensitive to the possible fluctuations that may occasionally occur during this power energy consumption.
This paper presents a new nonlinear forecasting method that was designed to find the most significant parts of the previous time series evolution, and thus to produce forecasts very quickly, even for a seemingly chaotic time series.

2. Zeroth Algorithm

The reason the zeroth algorithm method is briefly introduced here is that the new prediction method is partly based on this simple method. This method uses the zeroth-order approximation of the time series dynamics [4]. Therefore, it is very fast but inaccurate. In the previous course of the predicted time series, this method looks for subsequences that are similar to the last subsequence. The forecast is then the arithmetic mean of the values that followed these similar subsequences in the past:
y ^ t = 1 | U ε ( x l a s t ) | x k U ε ( x l a s t ) x t k ,
x l a s t = ( x t m τ , x t ( m 1 ) τ , , x t 2 τ , x t τ ) ,
x k = ( x t k m τ , x t k ( m 1 ) τ , , x t k 2 τ , x t k τ ) ,
where x l a s t is the last subsequence, x k is a subsequence in the past, m is their length, τ is the delay time, ε is the radius of U ε , and | U ε ( x l a s t ) | is number of similar subsequences in the past (they belong to the neighborhood U ε of the last subsequence).
The principle of this method can be shown by a simple example. Suppose the predicted time series is x = ( x 1 , x 2 , , x 10 ) and its members have values:
x 1  1.046794
x 2  1.049179
x 3  1.039641
x 4  1.046794
x 5  1.049179
   x 6  1.042025
x 7  1.030103
x 8  1.061101
x 9  1.046794
x 10   1.056332
For this example, the length of the searched similar subsequences m = 3 and the time delay τ = 2 will be chosen. Thus, if a prediction of the value x 11 is sought, the last subsequence will be x l a s t = ( x 5 , x 7 , x 9 ) = ( 1.049179 ,   1.030103 ,   1.046794 ) . For simplicity, the Manhattan norm will be used to measure the distance between subsequences. Then, the distances of the previous x k subsequences ( x 1 ,   x 3 ,   x 5 ) , ( x 2 ,   x 4 ,   x 6 ) , ( x 3 ,   x 5 ,   x 7 ) , and ( x 4 ,   x 6 ,   x 8 ) from the last x l a s t subsequence will be 0.014307, 0.021460, 0.045305, and 0.028614. If the radius of U ε is chosen to be ε = 0.025 , then ( x 1 ,   x 3 ,   x 5 ) and ( x 2 ,   x 4 ,   x 6 ) will be considered as similar subsequences. The prediction of x 11 is then calculated as the arithmetic mean of the values of the predicted time series following these similar subsequences, which, in this case, are the values of x 7 and x 8 . The result is therefore y ^ 11 = 1.045602 .

3. New Method

In brief, this new method attempts to find time delay patterns that, if used in the zeroth algorithm method on previous data, would result in the most accurate prediction. The structures of these patterns are randomly assembled, and the algorithm selects the most successful ones, which are used in the final calculation of the current prediction.
The principle of the new method is, therefore, partly based on the zeroth algorithm, but the subsequences x l a s t and x k are defined using the random time-delay pattern ( R T D P ). The randomly generated R T D P s are then used to generate a multitude of estimated continuations of the predicted time series x . For one particular R T D P , the partial prediction value y and its estimated error rate ε are calculated as follows:
y = x t k m i n , ε = ε m i n ,
k m i n = arg   min k x l a s t x k , ε m i n = x l a s t x k m i n ,
x l a s t = ( x t τ m , x t τ m 1 , x t τ m 2 , , x t τ 2 , x t τ 1 ) ,
x k = ( x t k τ m , x t k τ m 1 , x t k τ m 2 , , x t k τ 2 , x t k τ 1 ) ,
R T D P = { τ 1 , τ 2 , τ 3 , , τ m } , τ i = j = 1 i δ j , δ j R { 1 , 2 , , δ m a x } ,
where y is the partial prediction (created by this particular R T D P ), which is equal to the value of the predicted time series x following subsequence x k m i n which is the subsequence that is most similar to x l a s t among all x k subsequences in the past, ε is the estimated error rate of this partial prediction, ε m i n is the distance norm between the last subsequence and the most similar subsequence, R T D P is a random time-delays pattern, m is the length of R T D P s, τ i are random time delays, δ j are random time intervals.
The above procedure is repeated for N p (number of patterns) R T D P s, and the final prediction is then calculated as the arithmetic mean of the N m s p (number of the most successful patterns) best partial predictions. Mathematically, it can be expressed as follows:
y ^ t = 1 N m s p i = 1 N m s p y i , ( y i , ε i ) Y ε , N m s p N p ,
Y ε = { ( y 1 , ε 1 ) ,   ( y 2 , ε 2 ) , , ( y N p , ε N p ) } , ε j ε j + 1 , j { 1 , 2 , , N p 1 } ,
where y ^ t is the final prediction of the value of the predicted time series x at time t, N m s p is the number of the most successful R T D P s, N p is the total number of R T D P s, Y ε is the set of pairs ( y , ε ) created by all N p R T D P s, which are sorted in ascending order of the size of estimated error rates ε .
It is worth mentioning that the values of N p , N m s p , m, and δ m a x must be determined in advance. These are the parameters of this new method and fundamentally affect its accuracy and computational demand.
This new method will hereafter be referred to as the RTDP method and, for better illustration, it is written in pseudocode in Algorithm 1.
Algorithm 1 The RTDP method in pseudocode.
Require: x ,   m ,   δ m a x ,   N p ,   N m s p
Ensure: | x | > m δ m a x , N m s p N p
     Y { }
    for i = 1 to N p do▹ tries N p R T D P s
          δ random vector of length m containing random integer numbers from 1 to δ m a x
          τ ( δ 1 ,   δ 1 + δ 2 ,   δ 1 + δ 2 + δ 3 ,   ,   δ 1 + δ 2 + + δ m )
          R T D P ( τ 1 , τ 2 , τ 3 , , τ m ) R T D P is actually a cumulative sum of δ
         for k = 1 to | x | m δ m a x do▹ goes through all possible subsequencies x k in x
               ε k x l a s t x k ▹ each distance ε k between x k and x l a s t is stored
         end for
          k m i n arg min k ε k ▹ finds the k for which x k is closest to x l a s t
          y i y b e s t x t k m i n ▹ assumed best prediction made by this R T D P
          ε i ε m i n ε k m i n ▹ distance of the closest subsequence x k m i n
          Y Y ( y i , ε i ) ▹ adds results from this R T D P to the overall result set
    end for
     Y ε sort ε ( Y ) ▹ ranking all R T D P s predictions by assumed accuracy ε
     y ^ t ( y 1 + y 2 + y 3 + + y N m s p ) / N m s p ▹ averaging the best N m s p predictions
For a better understanding of this method, it is also useful to demonstrate its principle with a simple example. Suppose the predicted time series is x = ( x 1 ,   x 2 ,   ,   x 20 ) and its members have values:
x 1  1.046794
x 2  1.049179
x 3  1.039641
x 4  1.046794
x 5  1.049179
x 6  1.042025
x 7  1.030103
x 8  1.061101
x 9  1.046794
x 10  1.056332
x 11  1.022949
x 12  1.022949
x 13  1.027718
x 14  1.027718
x 15  1.020565
x 16  0.999104
x 17  1.011027
x 18  1.008642
x 19  1.013411
x 20  1.025334
For this simple example, the following parameters will be chosen: m = 5 , δ m a x = 3 , N p = 5 , N m s p = 2 .
In Algorithm 1, the R T D P s are (for illustrative purposes) generated sequentially, one after the other; however, as will now be seen, they can also be generated in parallel. Based on the value of N p and δ m a x , five random vectors δ are generated:
( 2 ,   2 ,   1 ,   3 ,   3 )   ,   ( 1 ,   1 ,   2 ,   1 ,   1 )   ,   ( 3 ,   1 ,   3 ,   1 ,   2 )   ,   ( 3 ,   3 ,   3 ,   3 ,   1 )   ,   ( 2 ,   2 ,   3 ,   3 ,   3 )
and from them, five R T D P s are calculated as their cumulative sums:
( 2 ,   4 ,   5 ,   8 ,   11 )   ,   ( 1 ,   2 ,   4 ,   5 ,   6 )   ,   ( 3 ,   4 ,   7 ,   8 ,   10 )   ,   ( 3 ,   6 ,   9 ,   12 ,   13 )   ,   ( 2 ,   4 ,   7 ,   10 ,   13 ) .
Suppose a prediction of x 21 is sought, then the last subsequences based on these R T D P s is:
( x 10 ,   x 13 ,   x 16 ,   x 17 ,   x 19 )   ,   ( x 15 ,   x 16 ,   x 17 ,   x 19 ,   x 20 )   ,   ( x 11 ,   x 13 ,   x 14 ,   x 17 ,   x 18 ) ,
( x 8 ,   x 9 ,   x 12 ,   x 15 ,   x 18 )   ,   ( x 8 ,   x 11 ,   x 14 ,   x 17 ,   x 19 ) .
By iterating the values of k from 1 to 5 ( | x | m δ m a x ), the distances ε k between all x k s and x l a s t are now calculated for each R T D P . For simplicity, the Manhattan norm can be used again, obtaining the following results for R T D P = ( 2 ,   4 ,   5 ,   8 ,   11 ) :
k 12345
y k = x 21 k 1.0253341.0134111.0086421.0110270.999104
ε k 0.0524590.0500740.1144560.0810730.090611
These results show that the minimum distance for this R T D P is ε m i n = 0.050074 , so k m i n = 2 and the assumed best prediction is y b e s t = 1.013411 . This procedure is repeated for all R T D P s, and the best results from each are stored in Y , which, in this example, would look like this:
R T D P (2,4,5,8,11)(1,2,4,5,6)(3,4,7,8,10)(3,6,9,12,13)(2,4,7,10,13)
y1.0134111.0253341.0253341.0086421.013411
ε 0.0500740.0572280.0524590.0548430.059612
Finally, the N m s p of the potential best results y (with the smallest ε ) is taken and the arithmetic mean is calculated, so, in this example, the final resulting prediction is y ^ t = ( 1.013411 + 1.025334 ) / 2 = 1.0193725 .

4. Comparison

The time series of power energy consumption, shown in Figure 2, was used to test the prediction using the RTDP method. To verify the competitiveness of this RTDP method, predictions of the same time series were also calculated using other common prediction methods.
Machine-learning methods such as extreme gradient boosting (XGB) [5], k-nearest neighbor (KNN) [6], random forest (RF) [7], and artificial neural networks (ANN) [8] were used. The latter was used with two parameter settings; one faster and one more accurate. Statistical methods are represented here by the probable best one: the auto-regressive integrated moving average (ARIMA) [2] method, and in two parameter settings. For an interesting comparison, the zeroth algorithm method on which the RTDP prediction method is based was also added.
The parameters of the ARIMA(0,1,2) method were automatically determined by its auto.arima() function and the parameters of the ARIMA(8,1,6) method were determined by the recommended procedure using autocorrelation (ACF) and partial autocorrelation functions (PACF). The optimal parameter values used for all other methods were empirically located. For the RTDP method, this search is shown in Figure 3. Table 1 then summarizes the parameter values of all methods used in the comparison.

5. Results

For all methods, the same number of previous samples (341) was used to predict of the following value. A time window of 340 samples was created and each method attempted to predict the value of the 341st sample. By sliding this time window over the entire power energy consumption time series, the waveform of the prediction error for each method was obtained.
The sampling rate of the predicted time series used is one sample per minute, so 340 samples represent a timespan of more than 5 hours. Over such a long period of time, power consumption trends should already be sufficiently evident. Of course, by using a longer time window, the predictions could be more accurate, but for the purposes of this comparison, this level of accuracy is sufficient.
From the prediction error waveforms, the moving root mean square error (RMSE) waveforms, using a 300-sample-width moving window, were calculated for smoothing purposes and are shown in Figure 4. For each method, the overall RMSE was also calculated from this prediction error waveform and a sorted summary of these total RMSEs is given in Table 2.
The prediction calculations of the machine-learning methods were conducted using the software R [9] package caret [10] and the calculation of the statistical method predictions was conducted using R package forecast [11].
In the case of the machine learning methods used (XGB, ANN, RF, KNN), the default resampling method of the caret software package was used to split the data into training and test sets. This is a bootstrapping method that builds a test set from 25% of the input data. Nonlinear and statistical methods (Zeroth, RTDP, ARIMA) do not use this partitioning in the training and test sets because they do not create a mathematical model that needs to be trained and then tested.
All calculations were performed on the same personal computer with an Intel Core i7-1065G7 processor (1.30–3.90 GHz) and 16 GB DDR4 RAM.

6. Conclusions and Future Work

In this paper, a new prediction method, named RTDP, was proposed. Using random time-delays patterns, this method tries to find important parts of the previous evolution of the time series and predicts its future evolution on this basis.
Its competitiveness was proved by comparing the accuracy of its prediction of the supercomputer infrastructure consumption time series with the accuracy of the prediction of the same time series when calculated with other commonly used prediction methods. The new RTDP method is based on the old and simple zeroth algorithm method and, thanks to the modifications, has gained in accuracy and lost a little in speed compared to the original method.
The comparison results, shown in Figure 4 and summarized in Table 2, show that the new RTDP method, when used to predict the evolution of supercomputer infrastructure consumption, was the most accurate and the second fastest. This is an excellent result, but to more comprehensively test this method, it will be necessary to perform this comparison on various types time series.
The development of the software package in which this method will be effectively implemented is another appropriate future work. The advantage of this method is its easy parallelizability, since calculations with individual R T D P s are independent of each other and can, therefore, run on different processor cores at the same time. It is reasonable to assume that the use of this feature will further speed up the method, which may also have an impact on its accuracy, as more R T D P s can be tried in the same amount of time.

Supplementary Materials

The following are available online at https://www.mdpi.com/article/10.3390/math9212695/s1.

Funding

This work was funded by the Advanced Data Analysis and Simulations lab of IT4Innovations, VSB-Technical University of Ostrava, Czech Republic.

Data Availability Statement

Data is contained within the article or supplementary material.

Acknowledgments

This work was supported by The Ministry of Education, Youth and Sports from the Large Infrastructures for Research, Experimental Development, and Innovations project “e-INFRA CZ—LM2018140” and by SGC grant No. SP2020/137 “Dynamic system theory and its application in engineering”, VSB—Technical University of Ostrava, Czech Republic.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ACFAutocorrelation function
ANNArtificial neural networks
ARIMAAuto-regressive integrated moving average
KNNk-nearest neighbor
PACFPartial autocorrelation function
RFRandom forest
RMSERoot mean square error
RTDPRandom time delays pattern
XGBExtreme gradient boosting

References

  1. Bonaccorso, G. Machine Learning Algorithms; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
  2. Brockwell, P.J.; Davis, R.A. Introduction to Time Series and Forecasting; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  3. IT4Innovations: Anselm, Salomon, DGX-2, and Barbora Supercomputer Clusters Located at IT4Innovations, National Supercomputing Center. Available online: https://www.it4i.cz/en (accessed on 31 August 2021).
  4. Kantz, H.; Schreiber, T. Nonlinear Time Series Analysis; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  5. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  6. Tomčala, J. Predictability and Entropy of Supercomputer Infrastructure Consumption. In Chaos and Complex Systems, Springer Proceedings in Complexity; Stavrinides, S., Ozer, M., Eds.; Springer: Cham, Swizterland, 2020; pp. 59–66. [Google Scholar]
  7. Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; pp. 278–282. [Google Scholar]
  8. Graupe, D. Principles Of Artificial Neural Networks: Basic Designs To Deep Learning; World Scientific: Singapore, 2019. [Google Scholar]
  9. R Core Team R: A Language and Environment for Statistical Computing. Available online: https://www.R-project.org (accessed on 31 August 2021).
  10. Kuhn, M. Caret: Classification and Regression Training. Available online: https://CRAN.R-project.org/package=caret (accessed on 31 August 2021).
  11. Hyndman, R.; Athanasopoulos, G.; Bergmeir, C.; Caceres, G.; Chhay, L.; O’Hara-Wild, M.; Petropoulos, F.; Razbash, S.; Wang, E.; Yasmeen, F. Forecast: Forecasting Functions for Time Series and Linear Models. Available online: https://pkg.robjhyndman.com/forecast (accessed on 31 August 2021).
Figure 1. A simplified example of consumption aggregation. Energy consumption of 3 nodes running different jobs at the same time. Regular patterns merged together give a regular pattern, but the pattern is more complex, with a smaller degree of predictability.
Figure 1. A simplified example of consumption aggregation. Energy consumption of 3 nodes running different jobs at the same time. Regular patterns merged together give a regular pattern, but the pattern is more complex, with a smaller degree of predictability.
Mathematics 09 02695 g001
Figure 2. Power energy consumption time series. This is the normalized measured power from the infrastructure of the IT4Innovations [3] supercomputer. The measured timerange is from 1:00 p.m., 2 November to 9:00 p.m., 5 November 2017.
Figure 2. Power energy consumption time series. This is the normalized measured power from the infrastructure of the IT4Innovations [3] supercomputer. The measured timerange is from 1:00 p.m., 2 November to 9:00 p.m., 5 November 2017.
Mathematics 09 02695 g002
Figure 3. Results of a series of predictions designed to empirically determine the optimal parameters of the RTDP method. The time series of power energy consumption, shown in Figure 2, was used to calculate these results. The numbers at the nodes represent the value of δ m a x . The number of patterns and the number of the most successful patterns were set to N p = 30 and N m s p = 21 for the whole series. The RTDP method is (in this case) most accurate when the parameters δ m a x = 5 and m = 25 are set.
Figure 3. Results of a series of predictions designed to empirically determine the optimal parameters of the RTDP method. The time series of power energy consumption, shown in Figure 2, was used to calculate these results. The numbers at the nodes represent the value of δ m a x . The number of patterns and the number of the most successful patterns were set to N p = 30 and N m s p = 21 for the whole series. The RTDP method is (in this case) most accurate when the parameters δ m a x = 5 and m = 25 are set.
Mathematics 09 02695 g003
Figure 4. Comparison of the prediction accuracy waveforms of the methods used with the new prediction method RTDP. The moving RMSE was calculated as the RMSE of a moving 300 samples wide window.
Figure 4. Comparison of the prediction accuracy waveforms of the methods used with the new prediction method RTDP. The moving RMSE was calculated as the RMSE of a moving 300 samples wide window.
Mathematics 09 02695 g004
Table 1. Summary of the used parameter values of the compared methods.
Table 1. Summary of the used parameter values of the compared methods.
MethodParameters
ANN 1 η = 0.1 , 3 layers of 15 neurons each, m a x e r r o r = 0.01
ANN 2 η = 0.1 , 3 layers of 15 neurons each, m a x e r r o r = 0.02
ARIMA(0,1,2) p = 0 ,   d = 1 ,   q = 2
ARIMA(8,1,6) p = 8 ,   d = 1 ,   q = 6
KNN k = 5 ,   N = 40
RF n t r e e = 13 ,   m t r y = 19
RTDP δ m a x = 5 ,   m = 25 ,   N p = 30 ,   N m s p = 21
XGB n r o u n d s = 22 , η = 0.23 , m i n w e i g h t = 20 , m a x d e p t h = 1 , γ = 0
Zeroth m = 31 ,   τ = 1 ,   ε = 0.151
Table 2. The ranked results are summarized here by the total RMSE and also by the total runtime taken to calculate the predictions of the entire time series of supercomputer power consumption.
Table 2. The ranked results are summarized here by the total RMSE and also by the total runtime taken to calculate the predictions of the entire time series of supercomputer power consumption.
MethodTotal RMSE [-]MethodTotal Run-Time [s]
RTDP0.02719Zeroth23
ARIMA(8,1,6)0.02722RTDP42
ARIMA(0,1,2)0.02738ARIMA(0,1,2)58
XGB0.02773KNN3240
RF0.02836XGB4515
Zeroth0.03231ARIMA(8,1,6)4714
KNN0.03350RF7250
ANN 10.03414ANN 225,501
ANN 20.03841ANN 156,549
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Back to TopTop