Fusing Nature with Computational Science for Optimal Signal Extraction

: Fusing nature with computational science has been proved paramount importance and researchers have also shown growing enthusiasm on inventing and developing nature inspired algorithms for solving complex problems across subjects. Inevitably, these advancements have rapidly promoted the development of data science, where nature inspired algorithms are changing the traditional way of data processing. This paper proposes the hybrid approach, namely SSA-GA, which incorporates the optimization merits of genetic algorithm (GA) for the advancements of Singular Spectrum Analysis (SSA). This approach further boosts the performance of SSA forecasting via better and more efﬁcient grouping. Given the performances of SSA-GA on 100 real time series data across various subjects, this newly proposed SSA-GA approach is proved to be computationally efﬁcient and robust with improved forecasting performance.


Introduction
The vigorous advancements of data science and computational technologies recent decades have significantly altered the way of conducting interdisciplinary research. Meanwhile, these interdisciplinary developments have also injected novel aspects of thinking and problem solving capabilities back to the progression of computational algorithms. Scientists march on the path of seeking knowledge of everything we encounter in life and the nature, which itself acts as the most inclusive housing facility to all, always seems to have its wise answers. Just as the phrase "let nature take its course", researchers also seek means to better appreciate the solutions nature may have to offer. It is not new that researchers invent and implement algorithms inspired by the nature as intelligent solution to complex problems and these achievements continuously bring new breakthroughs on a wider scale of science and technology. A recent review focusing on nature inspired algorithms can be found in [1]. Among which, some well established models include: the neural networks [2], which was inspired by the mechanism of biological neural networks, and has been widely applied and developed to form a large branch containing various types of computational architectures; swarm intelligence (SI) [3,4], which has been contributing to the intelligent advancements on both scientific and engineering domains, and a wide spectrum of SI inspired algorithms (i.e., bat algorithm, ant colony optimization, firefly algorithm, etc.) have emerged recent decades [1]; genetic algorithm (GA) [5], which was inspired by the theory of natural evolution, has promoted the trends of evolutionary algorithms and been widely applied for searching and optimization. The list of nature inspired algorithms goes on and new ones are developed and update the list regularly, we are not intended to review all here, but the wide scale of developments and implementations certainly reflected the significance of seeking knowledge via the mysterious means offered by nature.
will not be reproduced here. Instead, a brief summary of the process will be presented below and we mainly follow [35].
For the Decomposition stage, with a selected window length L, the one dimensional main time series can then be embedded into a multi-dimensional variable, which forms a trajectory matrix, this is then followed by SVD, where a group of small number of independent and interpretable components are achieved. Second stage, namely Reconstruction, starts from the important step-"grouping". Briefly to say, this step aims to gather eigenvalues of different characters, i.e., trend, seasonality, etc., whilst leaving out those corresponding to noise. Lastly, the grouped eigenvalues will then be transformed back to a one dimensional time series, namely the signal, via performing diagonal averaging.
A common technique in SSA's grouping stage is to choose first r components to reconstruct the signal. The number of components is selected to minimize in-sample Root Mean Square Error (RMSE) or out-of-sample forecasting RMSE. Selecting the first r components to reconstruct the signal comes from the common believe that later components are related noise in time series, since they have smaller variances and higher frequencies.
Hassani et al. (2016) [35] proposed an alternative approach, namely SSA-CT, which is inspired by CT. They showed that using first r components to reconstruct the signal does not necessarily produce the minimum RMSE results. SSA-CT considers all possible 2 L combination of components, for a given window length L, to reconstruct the signal. Then it uses the combination of components which produce minimum RMSE results. Although SSA-CT can improve the basic SSA's results, checking all 2 L possible combinations of components to find the minimum RMSE is computationally expensive and time consuming.

SSA-GA
Consider the non zero real valued time series {y t } N 1 . If the aim is to extract the signal from noise, all available data will be used to calculate the RMSE. If the main aim is to forecast the time series, one may divide the series in to two parts, use the first part (say 2

1.
Run a basic SSA on training data and find the optimum r.
Define a chromosome C i as a vector of length L with binary values: where c ij = 1 if jth components is considered for signal reconstruction and c ij = 0, otherwise.

5.
Build a population containing M chromosomes, i.e., chromosomes C 1 , . . . C M . Generate K%(K > 70) of the chromosomes in the population randomly (from uniform distribution). This will produce chromosomes C 1 to C k . Add C k+1 = (0, 0, . . . , 0) and C k+2 = (1, 1, . . . , 1) to the population (as extreme solutions). The rest of the population will be the same chromosomes as the basic SSA solution: where r is the grouping parameter from basic SSA (step 1). 6.
Use a binary crossover function to produce M offspring chromosomes. A simple crossover function produce offspring chromosomes as follows: (a) Pair chromosomes in the population randomly.
(b) For a given pair of chromosomes C i and C j generate random number d from Produce offspring chromosomes for C i and C j with switching their first d genes: First offspring = (c i1 , . . . , c id , c j(d+1) , . . . , c jL ) Second offspring = (c j1 , . . . , c jd , c i(d+1) , . . . , c iL )

7.
Produce weight matrix W i for each of M + M chromosomes:

8.
Reconstruct the signal for each weight matrix W i :

9.
For each chromosomes generate in-sample h step ahead forecasting and calculate the in-sample RMSE for all M + M chromosomes. Select the M chromosomes with smallest RMSE as the new population. 10. Repeat steps 6 to 9 until minimum RMSE in the population does not improve for several iterations. 11. Begin with L = 2 and repeat steps 1 to 10 for 2 ≤ L ≤ N 2 , to find the L and grouping parameter which minimizes in-sample RMSE.
Adding basic SSA solution to the initial population, in step 5, will boost the searching speed and grantees that the final grouping solution will be at least as accurate as basic SSA. The SSA-GA as described above, will expedite SSA-CT's searching for minimum RMSE solution and grantees that the final solution is at least as good as basic SSA, in the same time. Although, it should be mentioned that the minimum in sample RMSE does not necessarily grantees minimum out-of-sample RMSE.

Empirical Results
We used a set of 100 real time series, with different sampling frequencies, normality, stationarity and skewness characteristics, to compare the accuracy of SSA-GA whit basic SSA. The dataset is accessed through Data Market (http://datamarket.com (accessed on 12 January 2021)) and previously was employed by Ghodsi et al. [47] and Hassane et al. [36] to compare different SSA based forecasting methods. Table 1 shows description of each time series in the dataset. The name and description of each time series and their codes assigned to improve presentation are presented in Table A1 in Appendix A. Table  A2 presents descriptive statistics for all time series to enable the reader to obtain a rich understanding of the nature of the real data. This also includes skewness statistics, results from the normality (Shapiro-Wilk) and stationarity (Augmented Dickey-Fuller) tests. As it can be seen the data comes from different fields of energy, finance, health, tourism, housing market, crime, agriculture, economics, chemistry, ecology, and production, to name a few. Figure 1 shows a selection of 9/100 series used in this study.  For each time series, the out-of-sample forecasting RMSE is calculated using both basic SSA and SSA-GA, for very short, short, long and very long term forecasting horizons (i.e., h = 1, 3, 6, 12). To compare the RMSEs from two methods, we used the RRMSE defined as ratio of SSA-GA's RMSE to basic SSA's RMSE (i.e., RRMSE = RMSE SSA−GA /RMSE basicSSA ). We also employed Kolmogorov-Smirnov Predictive Accuracy (KSPA) test [48] to compare the accuracy of two methods. Table A3 shows the RRMSEs and p-values for KSPA test, for each time series. Descriptions of RRMSEs are given in Table 2. As it can be seen, the SSA-GA's results are not necessarily same as the basic SSA's. As mentioned before, the SSA-GA's insample RMSE is always at least as good as basic SSA. However, in-sample accuracy does not guarantee out-of sample accuracy. This means in all the cases that the SSA-GA's result differs from basic SSA, it has better accuracy for in-sample forecasting. However, as it is evident from Table 2, it doesn't necessarily improve out-of-sample forecasting accuracy. Figure 2 shows that the mode of RRMSEs in these 100 case is less than 1 for all forecasting horizons. According to the results given in Table 2 and Figure 2, SSA-GA and basic SSA does not dominate each other in out-of-sample forecasting accuracy. This could be the result of over-fitting in SSA-GA, since SSA-GA is always at least as accurate as basic SSA for in-sample forecasting.   In order to further investigate the accuracy of SSA-GA in forecasting time series with different characteristics, Kruskal-Wallis test is employed to compare the RRMSE of time series with different features. The Kruskal-Wallis test results are given in Table 2. As Kruskal-Wallis test results show, the sampling frequency, stationarity, normality and skewness of time series does not affect RRMSE significantly. In other words, the difference between accuracy of SSA-GA and basic SSA is not affected by these factors. According to these results, although SSA-GA has better in-sample forecasting accuracy, it may have overfitting issue for out of sample forecasting. Nevertheless, using SSA-GA, as an advanced version of SSA-CT, can improve the basic SSA's results and at the same time will reduce SSA-CT's computational expenses.

Conclusions
Nature inspired algorithms have shown remarkable performance in solving complex problems that traditional computational approaches fail or struggle to achieve. As evident by the various achievements of nature inspired algorithms across subjects in searching, forecasting, optimising, and signal extracting. The ones which better appreciate the means of nature tend to better understand the natural mechanism that holds underlying the broad scale of science and technology. Given the emerging trends of fusing nature with computational science for the past decades, this paper aims to have SSA and GA joint forces so to achieve more efficient and accurate forecast.
To the best of our knowledge, this paper is the first research that combines the powerful time series analysis technique SSA with the widely applied and established GA. This research also progresses in line with the paper [35], in which the authors proposed the hybrid SSA-CT technique that employed CT for improving the grouping stage of basic SSA. As a developed version, SSA-GA is introduced so that the merits of optimization feature of GA is adopted for further improving the efficiency of grouping and optimizing the signal reconstruction. The performance of this newly proposed hybrid approach is verified by a collection of 100 time series covering a range of diverse subjects, also promising results are achieved, especially for the in sample reconstruction. To clearly demonstrate the comparison and critically evaluate the performance, the authors employed RMSE, RRMSE, KSPA test and Kruskal-Wallis test, so to give a comprehensive investigation of SSA-GA in comparison with basic SSA. In general, with much improved SSA-CT's computational efficiency and better grouping process, the signal reconstruction has been significantly improved, while the out of sample forecasting shows stable performance which is robust as SSA-CT. Considering that basic SSA has already been a powerful tool in reconstruction and forecasting with outstanding performance, even small improvement and efficiency boost can indicate huge steps in terms of processing data in scale. It is recognised that the potential over fitting issue with out of sample and this will be one direction to address for our future research. Advanced versions of nature inspired algorithms could be explored alone or jointly to further improve part or more stages of SSA, as well as multivariate SSA. Exchange rates-monthly data: Japanese yen. A009 Exchange rates-monthly data: Pound sterling. A010 Exchange rates-monthly data: Romanian leu.