Genetic Feature Selection Applied to KOSPI and Cryptocurrency Price Prediction

: Feature selection reduces the dimension of input variables by eliminating irrelevant features. We propose feature selection techniques based on a genetic algorithm, which is a metaheuristic inspired by a natural selection process. We compare two types of feature selection for predicting a stock market index and cryptocurrency price. The ﬁrst method is a newly devised genetic ﬁlter involving a ﬁtness function designed to increase the relevance between the target and the selected features and decrease the redundancy between the selected features. The second method is a genetic wrapper, whereby we can ﬁnd the better feature subsets related to KOPSI by exploring the solution space more thoroughly. Both genetic feature selection methods improved the predictive performance of various regression functions. Our best model was applied to predict the KOSPI, cryptocurrency price, and their respective trends after COVID-19.


Introduction
When using multidimensional data in the real world, the number of cases required to find the best feature subsets increases exponentially. The problem of finding a global optimal feature subset is NP-hard [1]. Rather than finding a global optimal solution by exploring all the solution spaces, heuristic search techniques [2] are used to find a reasonable solution in a constrained time frame. In stock markets, a specific index is related to a number of other economic indicators; however, it is difficult to predict a stock index which tends to be non-linear, uncertain, and irregular. There are two mainstream methods to predict a stock index: one is the improvement of feature selection techniques, and the other is the improvement of regression models to predict a stock index. We take the former approach to predict the stock market index using various machine learning methods. This study is a new attempt to predict the KOSPI using various external variables rather than internal time series data. The predictive performance was improved through feature selection that selects meaningful variables among many external variables. We propose the two new types of feature selection techniques using a genetic algorithm [3,4] which is a metaheuristic [5] method. The first technique is a genetic filter [6,7], and the second one is a genetic wrapper [8,9]. In our genetic filter, a new fitness function was applied to overcome the disadvantages of traditional filter-based feature selection. In addition, we can find the optimal feature subset by exploring the solution space more sufficiently using our genetic wrapper. The remainder of the paper is consisted as follows. The background is explained in Section 2. In Section 3, the operation and structure of our genetic algorithm for feature selection techniques are introduced. Section 4 contains the results of KOSPI prediction using feature selection techniques with various machine learning methods. In addition, our best model was applied to predict the KOSPI, cryptocurrency price, and their respective trends after COVID-19. Our conclusions are presented in Section 5.

Feature Selection
Machine learning algorithms can be constructed using either linear or non-linear models. Because the performance of machine learning is highly dependent on the quantity and quality of data, the most ideal input data contain information that is neither excessive nor insufficient. Moreover, high-dimensional data may contain redundant or irrelevant features. Thus, the latent space that effectively explains the target variable may be smaller than the original input space. Dimensionality reduction transforms data from a highdimensional space into a low-dimensional space so that the low-dimensional representation retains important properties of the original data. It finds a latent space by compressing original data or removing noisy data. Feature selection [10] is a representative method for reducing the dimension of data. Filter methods use a simple but fast-scoring function to select features, whereas wrapper methods use a predictive model to score a feature subset. Filter-based feature selection is a method suitable for ranking features to show how relevant each feature is, rather than deriving the best feature subset for the target data. Even though a filter-based feature selection is effective in computation time compared to wrapper methods, it may select redundant features when it does not consider the relationships between selected features. In contrast, wrapper-based feature selection is a method that selects the feature subset that shows the best performance in terms of predictive accuracy. It requires significant time to train and test a new model for each feature subset; nonetheless, it usually provides prominent feature sets for that particular learning model.

Genetic Algorithm
A genetic algorithm is one of the metaheuristic techniques for global optimization and is a technique for exploring the solution space by imitating the evolutionary process of living things in the natural world. It is widely used in solving non-linear or incomputable complex problem in fields such as engineering and natural science [11][12][13][14]. To find the optimal solution through the genetic algorithm, we have to define two things. The solution of the problem should be expressed in the form of a chromosome, and a fitness function has to be derived to evaluate the chromosome. The series of these processes are similar to the process of confirming how entity can adapt to the environment. Each generation consists of a population that can be regarded as a set of chromosomes. Selection is performed based on the fitness of each chromosome, and crossover, replacement, and mutation are performed. By repeating the above process, the generated solution is improved, and searching the solution space is searched until specific conditions are satisfied.

Stock Index Prediction
There have been various methods and frameworks for analyzing stock indices. Among these, there exists the portfolio theory [15] and the efficient market hypothesis [16] based on the rational expectation theory that follows the assumption that economic agents are rational. On the contrary, a study of a stock index using behavioral finance theory [17] also exists. There are many studies that have attempted to analyze the stock index by combining data mining [18] with the above viewpoints of the stock index. Tsai et al. [19] used optimized feature selection through a combination of a genetic algorithm, principal component analysis, and decision tree, and predicted stock prices using neural networks. Lngkvist et al. [20] proposed a method that applies deep learning to multivariate time series data including stock index, social media, transaction volume, market conditions, and political and economic factors. Zhang et al. [21] proposed a model that performs feature selection using minimum redundancy maximum relevance [22,23] for stock index data. Nalk et al. [24] improved the performance of stock index prediction using the Boruta feature selection algorithm [25] with an artificial neural network [26]. Yuan et al. [27] compared the performance of the stock index prediction models such as a support vector machine (SVM) [28], random forest [29], and an artificial neural network. Hu et al. [30] improved the performance of stock index prediction by improving Harris hawks optimization.

Encoding and Fitness
The initial task when using a genetic algorithm is to design an encoding scheme and a fitness function. The solution of the genetic algorithm is expressed in the form of a chromosome through an appropriate data structure, which is called encoding. In this study, encoding was conducted by a binary bit string, which indicates whether each feature is included or not. In the first experiment, 264-bit string was used as a chromosome to predict the KOSPI, and in the second experiment, 268-bit string was used to predict a cryptocurrency price. In a genetic algorithm, fitness is measured to evaluate how well an encoded chromosome solves a problem. The fitness is obtained from the implemented fitness function, and we used different fitness functions according to the genetic filter and genetic wrapper. The fitness of our genetic filter is a numerical value obtained by combining the correlations between selected features, and the fitness of our genetic wrapper is a mean absolute error between the target values and the predicted values of the machine learning algorithms preceded by feature selection.

Selection
Selection is the process of choosing the parent chromosomes to generate offspring chromosomes in each generation. In this study, we used roulette wheel selection based on the fitness. We set the selection probability of each chromosome in proportion to its fitness; then, we randomly selected chromosomes. It means that chromosomes with good fitness are more likely to be selected as parents, and chromosomes with relatively poor fitness are less likely to be selected as parents.

Crossover
Crossover is an operation that generates the offspring of the next generation by crossing the parental chromosomes obtained through selection. There are several methods of crossover; in this study, multi-point crossover was implemented. Multi-point crossover is an extension of one-point crossover. One-point crossover is an operation that randomly selects a point on chromosomes and crosses them based on that point. Multi-point crossover is similar to a one-point crossover, but uses two or more points. Indeed, an even number of multi-point crossover has the effect of crossing circular-shaped chromosomes because the first and last genes of the chromosomes are adjacent to each other. Because the degree of perturbation of multi-point crossover is larger than that of one-point crossover, a relatively wide solution space can be explored. However, strong perturbation may decrease convergence, and multi-point crossover with odd points may not maintain uniform traits of selected chromosomes. In this study, we used the chromosomes of a circular shape, a list of features with no meaning in the order. To increase the degree of perturbation moderately and for effective crossover in a circular shape, we used a 2-point crossover.

Mutation and Replacement
Mutation is an operator that modifies the gene of a chromosome to prevent a premature convergence and increase the diversity of the population. A general mutation generates a random number between 0 and 1 for each gene on a chromosome. If the value is less than the threshold, the corresponding gene is arbitrarily modified. In this study, a mutation probability was set to 0.001. Replacement is an operator that replaces the chromosomes of the existing population with the offspring chromosomes produced by crossover and mutation. We applied a replacement to change existing chromosomes with offspring chromosomes. Furthermore, we also applied the elitism to retain the best chromosome in the previous population to the next generation ( Figure 1).

Genetic Filter
Filter-based feature selection [31][32][33] has the advantage of deriving feature subsets by identifying correlations between features within a relatively short time; however, it has the disadvantage that it may be difficult to quantify relevance and redundancy between selected features. In this study, a new fitness function was devised to emphasize the advantages and make up for the disadvantages. Equation (1) favors feature subsets that are highly correlated with the target variable and largely uncorrelated with each other.
where n corresponds to the total number of features, S target is the target variable, and IG, F, and C refer to the information gain, F-statistic, and Pearson correlation coefficient (PCC), respectively. Moreover, fitness was obtained by combining the information gain, F-statistic, and PCC to derive various correlations of chromosomes. Specifically, to calculate the fitness of a chromosome, the sum of the results of the information gain, F-statistic, and PCC between target data and the selected feature S i was obtained. Another sum was also obtained for those between the selected features S i and S j . Finally, the difference between the two summations was calculated to identify the fitness of each chromosome. Figure 2 shows the flow diagram of our genetic filter.

Mutual Information
Mutual information [34] provides a numerical value quantifying the relationship between two random variables. The mutual information of random variables X and Y is I(X, Y), the probability that events X and Y occur simultaneously is P(X, Y), and the pointwise mutual information (PMI) of the events X and Y is PMI(X, Y). If the random variables are continuous, Equation (2) is satisfied.
In other words, the mutual information of variables X and Y is the sum of the values obtained by multiplying the PMI and the probability of all cases belonging to the variables X and Y. PMI is the value obtained by dividing the probability of two events occurring at the same time by the probability of each occurrence. It can be seen that X and Y are not related to each other when the mutual information is closer to 0.

F-Test
Hypothesis testing methods for testing differences in sample variance can be divided into the chi-squared test and F-test. The chi-squared test is applied when the population of a single sample follows a normal distribution and the variance is known in advance; however, considering that the variance is generally not known in advance, the F-test is used when the population is unknown. The F-test is a statistical hypothesis test that determines whether or not the difference in variance between two samples is statistically significant.
We endeavored to include statistical significance between features by adding the F-statistic to the fitness of the genetic filter.

Pearson Correlation Coefficient
In statistics, the Pearson correlation coefficient [35] quantifies the correlation between two variables X and Y. According to the Cauchy-Schwarz inequality, it has a value between [−1, 1], and it indicates no correlation when it is closer to 0, positive linear correlation when it is closer to 1, and negative linear correlation when it is closer to −1.

Genetic Wrapper
While our genetic filter calculates fitness through the correlations between features, our genetic wrappers [36,37] use machine learning models to evaluate the fitness of each chromosome. Therefore, the computational time is longer than that of a genetic filter; however, the genetic wrapper tries to search for an optimal feature subset tailored to a particular learning algorithm. We used three machine learning models for our genetic wrapper. Figure 3 shows the flow diagram of our genetic wrapper.

Support Vector Regression
Support vector regression (SVR) [38] refers to the use of an SVM to solve regression problems. The SVM is used for classification based on training data, but an ε-insensitive loss function is introduced in the regression model of the SVM to predict unknown real values. The goal of SVR is quite different from the goal of SVM. As shown in Figure 4, SVR minimizes the error outside the margin to have as many data as possible within the margin.

Extra-Trees Regression
The random forest is a representative ensemble model, and it assembles multiple decision trees using bootstrap samples to prevent overfitting. The general performance of the random forest is higher than that of a single tree. Extra-trees [39] is a variant of the random forest model. Extra-trees increases randomness by randomly selecting a set of attributes when splitting a node. The importance of features evaluated by Extra-trees is higher than that evaluated by the random forest model; that is, Extra-trees evaluated features from a broad perspective. We used the feature selection results obtained using Extra-trees regression.

Gaussian Process Regression
Gaussian process (GP) regression [40,41] is a representative model of the Bayesian non-parametric methodology and is mainly used to solve regression problems. Assuming that f is a function that describes the input and output data, the GP assumes that the joint distribution of finite f values follows a multivariable normal distribution. In general, the mean is assumed to be 0 and covariance C is set by a kernel function. GP regression gives a high prediction performance, allows the probabilistic interpretation of prediction results, and can be implemented with a relatively simple matrix operation. Figure 5 shows that deviation of functions in a given sample is very small. On the other hand, in the unknown region without samples, the predicted values of functions show a large variance. Finding the distribution of function is the main point of GP regression. Since GP regression involves computationally expensive operations, various approximation algorithms were devised.

Samples from prior distribution
Samples from posterior distribution

Experimental Setup
We first applied our genetic filter and genetic wrapper to KOSPI data; then, we compared the prediction results obtained using the machine learning models. The data for 12 years (from 2007 to 2018), which had 264 features including global economic indices, exchange rates, commodity indices, and etc, were used ( Figure 6). Because the Korean economy is very sensitive to external variables due to its industrial structure, it was very important to grasp the trend of the global economy. Therefore, major countries and global economic indicators closely related to South Korea were selected. Various index data were preprocessed in three forms: index, net changes, and percentage changes (Figure 7). To compensate for the missing data, linear interpolation was used; further, non-trading days were excluded based on the KOSPI. The test data were not affected by the training data during the preprocessing and experiment. The SVR, Extra-trees, and GP regression were applied to compare the performance of preprocessed data with and without feature selection. Next, we selected the feature selection method and evaluation model that showed the best performance among them, and we conducted an experiment to predict the KOSPI in 2020 by adding data corresponding to 2019 and 2020 to the 12-year data from 2007 to 2018. Consequently, we endeavored to verify whether or not our feature selection technique also explains the data after COVID-19 adequately. We also tested whether feature selection improved predictive performance or not. The last experiment we conducted was to change the target data to cyptocurrency. Cryptocurrency is encrypted with blockchain technology, distributed, and issued. Specifically, it is electronic information that can be used as a currency in a certain network. Cryptocurrency was devised as a medium for the exchange of goods, that is, a means of payment. However, it serves as an investment whose price is determined according to supply and demand in the market through the exchange. Therefore, we conducted feature selection with cryptocurrency price as the target to check whether cryptocurrency can be regarded as an economic indicator affected by the market.   Table 1 shows the parameters of our genetic filter. We trained and evaluated the data from 2007 to 2018 by dividing them into 20 intervals as shown in Table A1 (see Appendix A). As mentioned in Section 4.1, all the variables of the data were preprocessed into three different values: index, net changes, and percentage changes, respectively. Our genetic filter was applied to each dataset, and the results of applying SVR, extra-trees regression, and GP regression are shown in Tables A1-A3 (see Appendix A). The results of predicting net changes and percentage changes were converted into original indices, and the mean absolute error (MAE) with the actual indices was derived. The results obtained without any feature selection were compared with those obtained by applying our genetic filter; our genetic filter showed an improved average MAE for the three types of preprocessed data. When the experimental results were classified by evaluation method, GP regression showed the best performance overall among SVR, extra-trees regression, and GP regression. When the experimental results were classified by preprocessed type, predicting percentage changes and converting them into indices showed the least error. The experiment in which feature selection was performed with percentage changes in GP regression showed the best performance, and the average error was improved by approximately 32% than in the case without feature selection. Table 2 shows the process in which our genetic algorithm selects features between 2015 and 2016. The number and fitness of features in the best solution for each generation are shown. The features frequently selected among the feature subsets obtained for each interval are shown in Table 3, which identifies the feature subset closely related to KOSPI.

Fitness
The above terms are followed by Equation (1). α * means normalized value of α. Similar to the application of the genetic filter in Section 4.2.1, the parameters of the genetic wrapper are the same as in Table 1, but with a different number of generations. As in Section 4.2.1, intervals and types of data are the same. Tables A1-A3 shows the results of applying the genetic wrapper to each data, and combining SVR, extra-trees regression, and GP regression (see Appendix A). Similarly, the results of predicting net changes and percentage changes were converted into original indices, and the MAE with the actual indices was derived. When we compared the results, our genetic wrapper showed improved average of the MAE than that without feature selection. Our genetic wrapper also showed better results compared with the genetic filter in all intervals. In particular, when we used GP regression with the percentage changes data and compared with no feature selection results, our genetic wrapper showed an improvement in the error by approximately 39%. Therefore, based on the findings of this study, the best way to explain the KOSPI is to apply percentage changes data to a genetic wrapper combined with GP regression.

Prediction of KOSPI after COVID-19
Following the global financial crisis in 2008, the KOSPI could not avoid the impact of COVID-19 on the stock market in 2020, and it showed significant fluctuations. It will be important in the real world to predict a situation in which the stock index fluctuates largely during an economic crisis. We added the data for 2019-2020 to the existing 2007-2018 data, resulting in total 14 years of data. We tried to predict the KOSPI after COVID-19 in 2020 by training 13 years of data corresponding to 2007-2019. We applied the combination of the genetic wrapper and GP regression, which had shown the best performance in Sections 4.2.1 and 4.2.2 on the percentage changes data. Figure 8 shows the actual KOSPI, the results of applying feature selection, and those without applying feature selection. It was confirmed that GP regression on the selected features could predict the KOSPI after COVID-19 better without considerable fluctuation than that without feature selection.
It is meaningful to predict the KOSPI itself, but from an actual investment point of view, predicting whether the stock index on that day will rise or fall compared to the previous day may be of interest. The optimization carried out in this study is genetic feature selection, which can better predict the numerical value of the target data. Additional experiments were carried out to see whether the predicted index data can predict the direction of stock index. We compared the prediction results derived from GP regression with those of the genetic wrapper and that without any feature selection on percentage changes data. Each target value was post-processed to UP and DOWN, which mean upward and downward direction of the stock price, respectively. Table 4 shows the results of predicting the UP and DOWN of the KOSPI. The technique that sufficiently well predicted the KOSPI in the above section also predicted the actual UP and DOWN of the KOSPI relatively well. Although our purpose of optimization was not set as the UP or DOWN compared to the previous day, our feature selection could predict the UP and DOWN of the KOSPI with relatively high accuracy.

Prediction of Cryptocurrency Price and Direction
Cryptocurrency [42,43], which advocates decentralization, seeks to promote the role of an independent and objective safe asset distinct from exchange rates or other economic indicators. However, unintentional artificial surges and plunges may occur, and similar to other safe assets, fluctuations occur owing to changes in currency values such as increases in interest rate or inflation and deflation. Until now, we have used stock index data existing in the actual stock market such as KOSPI. However, in this Section, feature selection was applied with cryptocurrency set as the target data. We tried to predict the daily prices and UP and DOWN of Bitcoin. A total of 268 features including the KOSPI data were preprocessed in the same manner as in Section 4.2.3. The start of the data was set as 2013 because Bitcoin prices began fluctuating to some extent only from 2013. Bitcoin prices in 2020 were predicted by training 7-year data from 2013 to 2019. The results of predicting Bitcoin prices by applying the combination of genetic wrapper and GP regression were compared with those without feature selection. We converted the percentage changes of the predicted Bitcoin prices from the previous day to original Bitcoin prices and obtained the MAE with the actual Bitcoin prices. Figure 9 shows the actual Bitcoin prices, the results of applying feature selection, and those of not applying feature selection. Bitcoin prices predicted without any feature selection may show considerable fluctuation in a specific interval, which means that the training did not proceed properly. However, when the genetic wrapper was applied, the prediction was similar to the actual Bitcoin prices and did not show considerable fluctuation. An additional experiment was carried out to determine whether our feature selection can adequately explain the fluctuations. Table 5 shows the results of predicting the direction of Bitcoin prices. The feature selection technique that sufficiently well predicted the KOSPI and Bitcoin prices in the above section showed the better precision, recall, F 1score, and accuracy of the UP and DOWN of the Bitcoin prices relatively well. The purpose of our optimization was also to accurately predict the Bitcoin prices; however, the actual index UP and DOWN were also predicted quite accurately.

Conclusions
In this study, we proposed genetic feature selection techniques to predict the KOSPI and performed various experiments to predict the KOSPI using machine learning. Traditional feature selection techniques aim to create an improved model through dimensionality reduction of the data. We presented a new genetic filter to increase the strength of feature selection and reduce the shortcomings of feature selection. We also presented a new genetic wrapper that maximizes prediction performance. The three important findings of this study are as follows: First, a genetic filter and a genetic wrapper, combined with various statistical techniques and machine learning, were applied to index, net changes, and percentage changes data. These combinations were compared, and the optimal form of the input data was percentage changes. By converting percentage changes into the original index, we created a better predictive model. Second, to overcome the disadvantages of the traditional filter-based feature selection, we tried a new fitness function. Redundant features were removed, and the formula was developed to have high relevance with the target variable; thus, improved results were obtained through various evaluation functions. Third, the best performance of the genetic wrapper in the 2007-2018 interval also produced meaningful results in predicting the KOSPI or cryptocurrency prices after COVID-19. It means that our stock index prediction model does not overfit to past data. Our genetic filter reduced MAE by 32% when using Gaussian Process (GP) regression and percentage change data. When the genetic wrapper was applied, the results were improved in all intervals compared to the genetic filter. GP with the genetic wrapper showed the best result with approximately 39% improvement. Although the proposed genetic wrapper has relatively good performance compared to our genetic filter, it has the disadvantage of long computation time. Our genetic filter runs faster than the genetic wrapper. In the next experiment, the genetic wrapper combined with GP regression, which showed the best result, was used to predict the KOSPI and cryptocurrency price after COVID-19. We trained predictive models using 2007-2019 data and tested them with 2020 data. Our feature selection improved KOSPI predictions in the post-COVID era. In addition, our genetic feature selection improved the prediction of stock market direction in terms of accuracy and F 1 -score. Our final experiment was conducted to predict cryptocurrency after COVID-19. Our feature selection also improved the Bitcoin price predictions. As future work, we plan experiments needed to find the fitness combination by applying more various statistical techniques in the genetic filter. In addition to the filter improvement, it will be necessary to apply various prediction models and conduct experiments to tune the hyperparameters of the model. With respect to the wrapper improvement, it will be necessary to reduce the computational cost without degeneration of prediction quality. Furthermore, it is promising to conduct research to derive more meaningful models by applying the ensemble method from several classifiers. Finally, we aim to predict various equities or assets such as US stock market, Chinese stock market, Ethereum, and Ripple using our genetic feature selection.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Results of Applying Genetic Feature Selection to Various Data
In this appendix, we provide results of applying feature selections to KOSPI, the net changes of KOSPI, and the percentage changes of KOSPI. Each table shows the MAE values of SVR, Extra-trees regression, and GP regression.