A Novel Hybrid Algorithm to Forecast Functional Time Series Based on Pattern Sequence Similarity with Application to Electricity Demand

: The forecasting of future values is a very challenging task. In almost all scientiﬁc disciplines, the analysis of time series provides useful information and even economic beneﬁts. In this context, this paper proposes a novel hybrid algorithm to forecast functional time series with arbitrary prediction horizons. It integrates a well-known clustering functional data algorithm into a forecasting strategy based on pattern sequence similarity, which was originally developed for discrete time series. The new approach assumes that some patterns are repeated over time, and it attempts to discover them and evaluate their immediate future. Hence, the algorithm ﬁrst applies a clustering functional time series algorithm, i.e., it assigns labels to every data unit (it may represent either one hour, or one day, or any arbitrary length). As a result, the time series is transformed into a sequence of labels. Later, it retrieves the sequence of labels occurring just after the sample that we want to be forecasted. This sequence is searched for within the historical data, and every time it is found, the sample immediately after is stored. Once the searching process is terminated, the output is generated by weighting all stored data. performance of the approach has been tested on real-world datasets related to electricity demand and compared to other existing methods, reporting very promising results. Finally, a statistical signiﬁcance test has been carried out to conﬁrm the suitability of the election of the compared methods. In conclusion, a novel algorithm to forecast functional time series is proposed with very satisfactory results when assessed in the context of electricity demand.


Introduction
Functional Data Analysis (FDA) is a new field of statistics that has emerged in the last three decades with the aim of adapting statistical methodologies for this kind of data. Broadly interpreted, FDA deals with functional random variables. Some comprehensive references are found in [1], which review relevant methods for FDA, including smoothing, basis function expansion, principal component analysis, canonical correlation, regression analysis, and differential analysis. This paper is focused on Functional Time Series (FTS), which can be defined as a time sequence of functional observations [2]. Note that most of the issues found in FDA can be extended to FTS.
While there is a number of approaches dealing with discrete time series, there are few contributions paying attention to FTS [3]. In particular, the forecasting of FTS has not been extensively studied, as compared to other fields of research: this is the main motivation for proposing a novel forecasting algorithm.

Related Works
This section first reviews some relevant works in the field of FTS forecasting and, later, how PSF and funHDDC have been successfully applied to different domains. Furthermore, their impact in both time series forecasting and clustering functional data domains is discussed. Note that a very interesting work comparing different time series forecasting methods can be found in [9].
Functional data can be found in numerous real-life applications from different fields of applied sciences. Hence, the study of demographic curves (mortality and fertility rates for each age observed every year) can be found in [10]. A short-term approach for traffic flow forecasting was introduced in [11]. The volatility in time series was studied, by proposing a functional version for the ARCH model [12]. The Italian gas market was also forecasted by means of constrained FTS, as introduced in [13]. The analysis of the evolution of intraday particulate matter concentration was reported in [14]. Another FTS approach to forecast volatility in foreign exchange markets was introduced in [15]. Rainfall was forecasted with FTS in [16]. Recently, different models to forecast electricity demand by means of FTS were introduced in [17], with a locally-recurrent fuzzy neural architecture, in [18], with a random vector functional link network, and in [19], with a new Hilbertian ARMAX model. The same authors also addressed the challenging problem of forecasting electricity prices [2]. The forecasting of chaotic time series with functional link extreme learning ANFIS was proposed in [20].
The PSF algorithm was first published in 2011 [4]. It was developed to deal with discrete time series and proposed, for the first time, to use clustering methods to transform a time series into a sequence of values. Recently, a robust implementation in R was also published [5]. PSF has also been successfully applied in the context of wind speed [21], solar power [22], and water forecasting [23]. Since its first publication, several authors have proposed modifications for its improvement. In [24], the authors modified PSF to forecast outliers in time series. Later, Fujimoto et al. [25] chose a different clustering method, in order to find clusters with different shapes. In 2013, Koprinska et al. [26] combined PSF with artificial neural networks. Five modified PSF were jointly used in [27]. Another relevant modification can be found in [28], in which the authors suggested minute modification to minimize the computation delay, and they opposed using a voting system to find the optimal number of clusters given its expensive computational cost.
On the other hand, funHDDC [6] is a functional data clustering algorithm [29]. Earlier, in 2012, a clustering algorithm for functional data in low-dimensional subspace was introduced in [30]. By contrast, the authors in [31] proposed a method based on wavelets for detecting patterns and clusters in high-dimensional time-dependent functional data one year later. A method based on similar shapes of FTS was proposed for short-term load forecasting in [32]. A similar approach was proposed one year later, in 2014, making use of techniques for clustering FTS, as well [33]. An approach for clustering functional data on convex function spaces, in which the authors focused on cases with forms of the observations known in advance, can be found in [34]. Another interesting approach, based on the study of the depth measure for functional data, was introduced in [35]. A robust clustering algorithm for stationary FTS was introduced in [36]. Finally, current efforts are directed towards the development of clustering algorithms for multivariate functional data [7,[37][38][39].
In light of the above reviewed papers, the development of a new hybrid algorithm for forecasting FTS is justified. It has been shown that PSF can achieve competitive results in a variety of fields. Moreover, FDA is an emerging field of research in which hardly no forecasting algorithms have been proposed.

Methodology
Hybridization is a very common strategy currently. In this sense, the proposed method has combined two existing techniques (PSF for forecasting and funHDDC for functional clustering) in order to develop a new hybrid method, making the most of both of them, able to forecast functional time series.This section describes the methodology proposed to forecast FTS. The funPSF flowchart is illustrated in Figure 1, in which seven well-differentiated steps can be seen. These steps are now listed and detailed. Step 1: Data normalization. It is well known that data can range in large intervals for some variables. In order to minimize this effect, data are usually normalized. Although there are several strategies, the one selected for funPSF is the unity-based normalization, which brings all values into the range [0, 1]: where X denotes the normalized data, X the original data, X max the maximum value, and X min the minimum value.
Step 2: Selection of the optimal number of clusters, K. One of the main issues in clustering is the determination of the optimal number of partitions. There exist several indices; however, all of them search for optimality by optimizing different aspects [40]. For this reason, in this work, a majority voting system is used that combines the results of four well-known methods: silhouette [41], Dunn [42], Davies-Bouldin [43], and the Bayesian Information Criterion (BIC) [44]. Their application can lead to two situations: more than two indices (the majority) agree on choosing the same K (and then, this K is selected) or all four methods propose different K. When the second situation takes place, the second best values of the four indices are also considered (along with the first best values). The K selected is, then, the one pointed to by the majority of all the cases. Further, the best values can be analyzed until one K has more votes than the others.
Step 3: Application of the funHDDC functional clustering data algorithm. Let X 1 , . . . , X n be an independent and identically distributed sample of multivariate curves, such that Under some regularity assumption, these multivariate curves can be represented as an infinite linear combination of orthogonal eigenfunctions: f j : This decomposition is known as the Karhunen-Loeve expansion. It is worth noting that the computational cost of the Karhunen-Loeve transform is low as long as the observed curves are approximated in a finite basis function with a reasonable number of basis functions, which is the case in the present work.
In practice, the functional expressions of the observed curves are not known, and we only have access to discrete observations at a finite set of times. A common way to reconstruct the functional form of data is to assume that each curve can be approximated by a linear combination of basis functions (e.g., splines or Fourier): One consequence is that only a finite number of eigenfunctions can be computed in Equation (2), and the computational cost of the Karhunen-Loeve expansion computation is consequently low (more details in [6]). funHDDC [7] is a model-based clustering algorithm for either univariate or multivariate functional data, which estimates for each curve the cluster Z i ∈ {1, . . . , K} to which it belongs. This algorithm relies on a Karhunen-Loeve decomposition per cluster, in order to propose a fine representation of the curves, and on the following distributional assumptions: Moreover, in order to provide a parsimonious modeling, a specific parametrization of the variance matrix Σ k is considered.
The inference of the model parameters θ = (π k , µ k , Σ k ) k is done through an EM algorithm: 1.
Repeat until convergence: M step: update the model parameters as follows: k ) t , and W contains the inner products between the basis function (φ j r ) rj .
It is worth noting that the EM algorithm exhibits some convergence issues [45] that may lead to the failure of the whole procedure. However, these issues are not expected to occur with time series with daily or seasonal patterns, which is the usual situation in which funPSF can be successfully applied. Figure 2 depicts an illustrative case, in which K = 4. It can be seen that once funHDDC is applied, the original time series is transformed into a sequence of labels: L = {1, 3, 3, 1, 3, 3, 1, 3, 3}. Furthermore, for simplicity, it is considered that each pattern comprises one day (24 h), since it is a very common situation in problems in which funPSF can be efficiently applied. Step 4: Selection of the length of the pattern sequence or window, W. The optimal number of labels forming W is determined by minimizing the forecasting error when funPSF is applied to a training set. In practice, W is calculated by means of cross-validation, whose validity to be used for evaluating time series prediction was recently discussed in [46]. The number of folds may vary according to the type of problem to be addressed. Figure 3 shows how W is determined when the training set is one year, and 12 folds are considered. It can be appreciated that the initial dataset is split into twelve folds (this number is used for simplicity and for a better physical understanding, since each fold represents one month). Then, every fold is used as test data, and the remaining eleven folds as training data. Errors for tests are reported when varying W from 1-10 and are denoted as e month {W = 1, ..., 10}. The average errors for W = 1, ..., 10 are denoted as e 1,..., 10 . Finally, the W that leads to minimal error is selected.
Step 5: Search for the pattern sequence within the historical data. At this stage of the algorithm, the dataset has already been discretized and the length of the pattern sequence determined. Hence, the sequence pattern, with length W, including the last (most recent) W labels, is retrieved, and the pattern sequence, S W , is created. S W is searched for within the historical data, and every time there is a match, the values identifying the next h are obtained.
Step 6: Generation of the forecast. Finally, all these values are averaged, weighting their contribution according to the distance to the present. That is, retrieved values closer to the target h have weights higher than those that occurred further into the past. Mathematically, it can be expressed as follows: where X is the forecasted sample, M is the number of matches encountered in the historical data, X m i are the ith actual h values existing after a match, and d i the distance (or time elapsed) between the match and the h values that we want to predict. Please note that Equation (3) does not use outputs' feedback and this could limit the accuracy of the predictions, as discussed in [47]. However, original PSF and other published variants have shown prediction robustness despite this fact. Figure 4 illustrates Steps 5 and 6. For this particular case, h = 24, W = 2 and the pattern sequence S W = {1, 3}. Two matches are depicted and the result of the weighted average for the values found just after the matches is the forecast itself. Step 7: Data denormalization. Once the forecast is done, this value must be denormalized, leading to the final forecast, X. In this case, it consists of multiplying the forecast by the same value by which the data were normalized in Step 1: Finally, Figure 5 shows the pseudocode for funPSF, where D denotes the dataset, K the number of clusters, L the labeled dataset such as L = [L 1 , L 2 , ..., L d−2 , L d−1 ], W the pattern sequence length or window, T the test set, and X(d) the forecasts for all data in T.

Results
The application of funPSF to real-world data is reported in this section. Section 4.1 describes the dataset used related to electricity demand from Uruguay. The quality measures to assess the funPSF performance are introduced in Section 4.2. Finally, the results achieved are shown and discussed in Section 4.4.

Data Description
This section describes the dataset that was used to evaluate the funPSF performance. It consists of data from 2007-2014 of electricity demand in Uruguay, retrieved on an hourly basis (70128 samples). The full dataset is available upon request from authors.
Data did not exhibit a significant number of outliers, since the application of the test proposed in [48] showed that less than 0.01% of the values were outliers. This fact led to the discarding of the use of methods particularly designed to deal with volatility or spike values.
In particular, the data distribution for different days of the week is represented, as well as the distribution for the full dataset is illustrated in Figure 6. Axis x for all subfigures denotes the amount of electricity in MW, while axis y denotes the accumulated frequency for such values. Differences in mean and skewness can be appreciated in different days of the week. In general, while working days were slightly left-skewed, it can be appreciated that weekends were slightly right-skewed, which means that lower consumption was reported on weekends, as expected. This fact justified generating, at least, different models for different types of days.

Quality Measures
Given the nature of the data to be processed, three metrics were chosen to assess the performance of funPSF.
The first one was the mean absolute error (MAE), which is typically used in the electricity time series context. Operators are interested in the total amount of energy that must be generated, and that is the reason why they typically use absolute errors.
The second widely-used metric is the mean absolute percentage error (MAPE). This error provides information that is easier to interpret and compare.
Finally, the Root Mean Squared Error (RMSE) was also provided. The associated formulas are: where D h denotes the actual demand andD h the predicted demand for period h, and the bar above the squared differences is the mean.

Hardware and Software Resources
The experiments were carried out on the cluster physically located at the Data Science & Big Data Lab, Pablo de Olavide University. Each node forming the cluster shares the features listed in Tables 1 and 2.

Performance Assessment
To assess the performance of funPSF, years 2013 and 2014 have been selected as test sets. The remaining years (2008-2012) have been used as training test sets. The standard 70-30% data division into training and test was applied. In particular, five out of seven years were used for training, which represents 71.43%. As shown in the original PSF paper, the algorithm needs no more than two years to find patterns in the historical data to perform accurate forecasts. As a consequence, it was determined that five years was a suitable choice.
Another relevant issue is the prediction horizon, h, which was set to one day given the typology of the problem, since the electricity industry typically plans the amount of energy that needs to be generated one day in advance.
In Section 4.1, it is shown that different days may have different data distributions. For this reason, while configuring the experimental setup, it was decided to use two strategies. First, funPSF was applied to the entire dataset, generating one single model regardless of the day of the week it had to be forecasted. Second, seven different models were generated, one per day, in order to better discover pattern sequences associated with different days of the week and festivities. It will be referred to as the 7-funPSF model in the rest of the paper.
The first parameter that must be calculated is the number of clusters that funHDDC must create, K. Values for K from 2-10 were considered in this study. The application of the majority voting system proposed in Section 3 yielded K = 4. Table 3 summarizes the results of all the indices individually proposed for the best and second best values. It can be seen that Dunn and BIC proposed K = 4 as the best values, and Davies-Bouldin and silhouette proposed K = 3. Since no K had more votes than others, the second best values had to be observed. In this case, both Davies-Bouldin and silhouette also pointed to K = 4 as the best election, whereas Dunn proposed K = 3 and BIC K = 6. Therefore, since K = 4 obtained four votes, it became the most voted K and was selected as the number of clusters to be created. The second, and last, parameter required by funPSF is the length of the pattern sequence to be search for, W. Values for W from 2-10 were considered in this study. The cross-validation approach applied in the training set discovered that W = 2 was the best value (MAE = 56.12 and MAPE = 3.98%).
Moreover, ANN and ARIMA models have been applied. Both of them have been trained with the same training and test sets. The configuration found to be optimal was: (i) for ANN: 24 input neurons, one hidden layer, backpropagation, and the sigmoid activation function; (ii) ARI MA(0, 1, 1). Table 4 shows a performance summary for all four considered methods, and both MAE and MAPE errors. It can be seen that 7-funPSF clearly outperformed all of them. It can be seen that 7-funPSF obtained MAE = 68.33 on average. This value represents a 28.07% improvement if compared to funPSF. Furthermore, the improvement reached 35.80% and 53.37% if compared to ANN and ARIMA, respectively. Analogously, 7-funPSF obtained the best MAPE: 5.63%. The relative reported improvements were 29.15%, 42.47%, and 58.00% when compared to funPSF, ANN, and ARIMA, respectively. A thorough analysis of how MAE, MAPE, and RMSE behaved in all methods is reported in Tables 5-13. Table 5 shows how MAE was hourly distributed. It can be seen that 7-funPSF obtained lower errors for every hour, and overall, the error remained constant (except for Hours 19-21, where a moderate error increase can be seen). By contrast, ARIMA exhibited the worst behavior for 23 out of 24 h (all of them except for Hour 1). funPSF showed better performance than ANN in almost all the hours, except for Hours 8-13. However, ANN's performance was very poor for the nighttime (Hours 1-6). Another relevant piece of information that operators need to know is how algorithms behave in seasonal terms. For this reason, Table 6 shows how MAE was distributed over months. The behavior was similar for all studied methods. They reached the worst values for winter time (December and January) and for the beginning of the summer (June). 7-funPSF, again, obtained the best results for every month, whereas ARIMA obtained the worst results for all months. ANN slightly outperformed funPSF only in May (104.27 versus 106.77).
The third analysis carried out consisted of evaluating how the algorithms behaved for different day typologies. Specifically, Table 7 analyzes MAE values distinguishing between regular days and festivities. For regular days, MAE was almost constant for all methods. However, funPSF reported the worst value for Mondays. This fact is due to the search strategy followed, in which pattern sequences incorporate changes from weekends to working days. Fortunately, this effect is completely removed in 7-funPSF. By contrast, the situation for festivities was different. First, the reported errors were quite higher than those of regular days. Second, funPSF performed better than 7-funPSF. A possible justification is that with 7-funPSF, the size of training sets became smaller, and the forecasting of these few days should not be done by using this strategy. Overall, ANN and ARIMA kept reporting worse results than funPSF and 7-funPSF. Consequently, the development of new models for forecasting these particular days is suggested. Similar conclusions can be drawn for MAPE, but with slight nuances. Table 8 reports the MAPE hourly distribution. 7-funPSF achieved the best results for all 24 h, with constant error. However, the behavior for ANN and ARIMA was quite different since their associated MAPE oscillated, combining small with high errors. Nevertheless, ARIMA reached the worst results for 21 out of 24 h, and ANN only slightly outperformed funPSF in four hours.
MAPE results in terms of months had no remarkable situations (see Table 9). 7-funPSF obtained the best results for all months; funPSF obtained the second best results for 11 out 12 months (December being the worst month, even worse than ANN and ARIMA); ARIMA reached the worst results for 11 out of 12 months; and ANN achieved results better than ARIMA, but worse than funPSF in 11 out of 12 months. Overall, funPSF, 7-funPSF, and ANN showed constant errors over time (no high fluctuations were reported throughout the different months).
MAPE differences between regular days and festivities are shown in Table 10. It is worth noting that 7-funPSF and funPSF obtained quite constant values for regular days and obtained the best and second best values for all considered measures. ANN and ARIMA showed higher fluctuations in their MAPE and reported the second worst and worst values for all days (except for two cases). As for festivities, again, funPSF showed better performance than the remaining methods, on average. It reached the best value in five out of seven days. 7-funPSF only obtained better results for two cases. ANN and ARIMA alternated between the worst and second worst results. Broadly speaking, the reported errors for festivities were quite larger than those of regular days, thus suggesting the development of particular models for these kinds of days.
Finally, Tables 11-13 report the accuracy in terms of RMSE. Almost the same conclusions can be drawn as those of MAPE, since both metrics are similar (except for the square root). Table 11 shows RMSE for every hour in the day. All methods exhibited two differentiated trends: from 4:00-21:00, RMSE was constantly increasing; however, from 21:00-4:00, errors decreased. There were no hours apparently in which the algorithms worked particularly well. Analogously, it seems that there were no hours in which the errors reached values that were particularly high. 7-funPSF obtained the best results in 21 out of 24 h. The remaining three best results were obtained by PSF (from 1:00-3:00). Neither ANN nor ARIMA reached the smallest error at any hour. As for Table 12, it summarizes the estimated RMSE distributed over months. As expected, 7-funPSF obtained the best results for all twelve months. Only RMSE for May, November, and December exceeded 100, contrary to what the other methods reached: PSF only obtained RMSE less than 100 in three months, and ARIMA and ANN in two months. April and May were months that were particularly difficult to predict for ANN and ARIMA, with RMSE greater than 200. These values were quite unexpected, especially taking into account that the worst value for PSF was less than 150. Anyway, the behavior shown matched that of MAPE or MAE. Table 13 summarizes RMSE disaggregated between regular days and festivities. It is especially remarkable that RMSE reached values that were extremely high for Tuesday and Wednesday when these days were festivities. This fact can be explained because they belong to central days of the week, and therefore, it is difficult to forecast such a pattern without external information. Overall, except for these two cases, the results were smooth, and no remarkable differences were reported between the best and worst values for all the methods. Finally, a test of statistical significance was carried out. First, the D'Agostino-Pearson test [49] concluded that data obtained for this study did not meet the criteria of normality (p < 0.005). For this reason, a non-parametric test was selected to validate statistically the differences in the efficiency for all models. The non-parametric process followed was described in [50] and involved the use of the Friedman test. After executing the test, the obtained p-value was less than 0.0001. Therefore, the null hypothesis was rejected, and it can be concluded that the differences in the models were statistically significant.

Conclusions
A novel algorithm to forecast functional time series was proposed in this paper, as a result of the gap found after reviewing related literature. The main principle underlying the algorithm lies in the discovery of patterns within the historical data. To achieve such a goal, a functional clustering algorithm is first applied to the time series. This step results in transforming the time series into a sequence of labels, thus discretizing the original series. Later, a sequence of labels of arbitrary length, preceding the sample to be forecasted, is searched for. As the last step, the values occurring after matching the pattern sequence and the historical data are averaged, with weights associated with their distance. Real-world data related to electricity demand were used to assess the performance of the algorithm. Very promising results were reported and compared to other methods. It is also shown that the proposed approach obtained error values almost constant over time. The code is available for researchers in a ready-to-use R package (available upon acceptance). The main limitation of the proposed approach lies in its dependency on seasonal patterns within data: for chaotic time series or just for time series with no clear daily or seasonal patterns, funPSF might not work as desired. Alternatively, this method provides fast and accurate forecasts in the context of electricity demand.
Future works are directed towards the development of novel funPSF models particularly adapted to dealing with anomalous data. The handling of multivariate functional time series is also a challenging task that will be addressed. In this sense, clustering algorithms to analyze multivariate data, as well as more suitable metrics must be used to fulfil such a task.

Acknowledgments:
The authors also thank Prof. Jairo Cugliari for providing information about the datasets.

Conflicts of Interest:
The authors declare no conflict of interest.