Geometric Case Based Reasoning for Stock Market Prediction

: Case based reasoning is a knowledge discovery technique that uses similar past problems to solve current new problems. It has been applied to many tasks, including the prediction of temporal variables as well as learning techniques such as neural networks, genetic algorithms, decision trees, etc. This paper presents a geometric criterion for selecting similar cases that serve as an exemplar for the target. The proposed technique, called geometric Case Based Reasoning, uses a shape distance method that uses the number of sign changes of features for the target case, especially when extracting nearest neighbors. Thus, this method overcomes the limitation of conventional case-based reasoning in that it uses Euclidean distance and does not consider how nearest neighbors are similar to the target case in terms of changes between previous and current features in a time series. These concepts are investigated against the backdrop of a practical application involving the prediction of a stock market index. The results show that the proposed technique is signiﬁcantly better than the random walk model at p < 0.01. However, it was not signiﬁcantly better than the conventional CBR model in the hit rate measure and did not surpass the conventional CBR in the mean absolute percentage error.


Introduction
Traditionally, the prediction of stock markets has relied on statistical methods including multivariate statistical methods, autoregressive integrated moving average (ARIMA), and autoregressive conditional heteroscedasticity (ARCH) models. Recently, deep learning and other knowledge techniques have been extensively applied to the task of predicting financial variables. Case-based reasoning (CBR) is one of the most popular methodologies in knowledge-based systems. CBR solves a new problem by recalling and reusing specific knowledge from past experiences [1]. Research on stock prediction using the CBR technique is lacking, but CBR is used for both research and practical applications in many areas such as medical diagnosis, recommendation systems, cybersecurity detection, and others. This paper shows how CBR can be applied to stock prediction using a geometric criterion for selecting similar cases to serve as an exemplar for the target. The proposed technique uses a shape distance method that uses the number of sign changes of features for the target case, especially when extracting nearest neighbors. Thus, this method overcomes the limitation of the conventional CBR in that it uses Euclidean distance and does not consider how nearest neighbors are similar to the target case in terms of changes between previous and current features in a time series. We investigated concepts against the backdrop of a practical application involving the prediction of a stock market index.
The rest of this paper is organized into five sections. Section 2 reviews CBR as a knowledge discovery technique. Section 3 introduces the proposed technique, which is called geometric CBR. Section 4 presents the case study. Section 5 discusses the results of the study. Finally, the concluding remarks are presented in Section 6.

Case-Based Reasoning
Case-based reasoning (CBR) is an approach for solving a new problem by remembering a previous similar situation and reusing information from, and knowledge of, that situation [2]. This concept assumes that similar problems have similar solutions, so CBR is an appropriate method for a practical domain focused on using real cases rather than rules or knowledge to solve problems. A general CBR cycle was described by the following four processes by Aamodt and Plaza [2]: 1. RETRIEVE the most similar case or cases. 2. REUSE the information and knowledge in that case to solve the problem. 3. REVISE the proposed solution. 4. RETAIN the parts of this experience likely to be useful for future problem solving.
According to this process, CBR solves a problem by retrieving one or more previous cases, reusing them to solve the problem, revising the potential solution based on the previous cases, and retaining the new experience by incorporating it into the existing case-base [2].
Conventional methods of prediction based on discrete logic usually seek the single best instance, or a weighted combination of a small number of neighbors in the observational space. An intelligent learning algorithm should therefore consider a "virtual" or composite neighbor whose parameters are defined by some weighted combination of actual neighbors in the case set. In this way, the algorithm can use the knowledge reflected in a larger subset of the case set rather than the immediate collection of proximal neighbors [3][4][5][6][7][8][9][10][11]. The procedure for case reasoning using composite neighbors and the Euclidean distance method is presented in Figure 1. One of the issues with using a conventional CBR is that many previously experienced cases must be retrieved. The conventional CBR technique tends to retrieve a fixed number of neighbors in observational space. Thus, it always selects the same number of neighbors irrespective of an optimal number of similar neighbors according to target cases. This fixed number of neighbors raises a problem when some target cases should consider more similar cases while others should consider fewer ones. A problem occurs with conventional CBR when there are too many cases equidistant to target cases. Thus, it does not guarantee optimal similar neighbors for various target cases, which leads to the weakness of lowering predictability due to deviation from desired similar neighbors. Chun and Park [8] suggested a model to find the optimal neighbors for each target case dynamically. Park et al. [10] suggested a new case extraction technique called statistical case-based reasoning One of the issues with using a conventional CBR is that many previously experienced cases must be retrieved. The conventional CBR technique tends to retrieve a fixed number of neighbors in observational space. Thus, it always selects the same number of neighbors irrespective of an optimal number of similar neighbors according to target cases. This fixed number of neighbors raises a problem when some target cases should consider more similar cases while others should consider fewer ones. A problem occurs with conventional CBR when there are too many cases equidistant to target cases. Thus, it does not guarantee optimal similar neighbors for various target cases, which leads to the weakness of lowering predictability due to deviation from desired similar neighbors.
Chun and Park [8] suggested a model to find the optimal neighbors for each target case dynamically. Park et al. [10] suggested a new case extraction technique called statistical case-based reasoning (SCBR), which dynamically adapts the optimal number of neighbors by considering the distribution of distances between potential similar neighbors for each target case.
Conventional CBR using Euclidean distance does not consider how nearest neighbors are similar to the target case in terms of changes between previous and current features in time series. This paper proposes a new similarity measure called shape distance which compares how rise and fall signs between a target case and possible neighbors are similar to each other.

Methodology
In this paper, we propose a shape distance method that selects nearest neighbors according to a slope similarity between two cases. In financial forecasting, hit rate in stock prediction may be an important decision tool for those who invest money in the stock market. Conventional CBR has focused on a numeric distance method and does not consider how possible cases can be similar to the target case in terms of features' shapes. We developed Kim and Kang's [42] distance measurement method for retrieving neighbors. Before explaining shape similarity, we first introduce the concept of numeric distances, which are used in conventional CBR methods.

Numeric Distances
The numeric distance refers to the conventional approach for determining the distance between two cases. Numeric distance is based on the differences of multiple features (or instances) between the target case and other learning cases in a stored case box. If one scalar target case x i has m instances (or features), m i = {a 1i , a 2i , a 3i , ..., a mi }; then, the distance between x i and x j is given by Equation (1): More specifically, if one vector target case X i has n consecutive time series data and each consecutive datum has m features (or instances), then the feature distance between two cases x i and x j is given by Equation (2): Distance metrics can be used, such as Manhattan distance or Gaussian distance. When the metric of choice is the standard Manhattan distance, the previous relationship becomes Equation (3): In this paper, we use one vector target case x i , which has one day of stock information and has 4 features (or instances) such as open, high, low, and closed stock prices.

Shape Distance Method
Time series data may be characterized by their pattern of behavior in terms of rises and falls. Thus, when selecting neighbors, the slopes corresponding to two regional trajectories may be compared to assess the similarity between two cases. Signs such as rises and falls between previous and current stock prices can be a simple shape distance. To understand this new distance method, consider a scalar variable such as the price of a stock. For example, let x t denote the price of a stock at specific time t and m instances (or features), m t = {a 1t , a 2t , a 3t , . . . , a mt }. If we consider n consecutive time series data and each consecutive datum has m features (or instances), x t ≡ (x t-n + 1 , . . . , x t-1 , x t ) T . Thus, we have a sequence of s consecutive time series data, S t ≡ (sign t-s +1 (s t-s *s t-s + 1 ), ..., sign t (x t-1 *x t )) T . Let s t denote the sign for observation x t from the previous value x t-1 . We denote a difference in successive observations as follows (Equation (4)): If there are n cases in a stored case box and if x t is a target case, then the shape distance between two cases x t and x n is defined as follows (Equation (5)): where U denotes the unit step function. The binary function U equals 1 if S t . and S t−k are both positive or both negative, and zero otherwise. When no matches in sign occur, the shape distance attains a maximum of d i = n. Conversely, the shape distance reaches its minimum of d i = 0 when all n pairs of slopes match. For simple analysis, we consider a one-day case as a target case. Thus, x t has one day of information (x t ) and has only 4 instances: open, high, low, and closed stock prices. Figure 2 presents the procedure for selecting nearest neighbors using shape distance.
Sustainability 2020, 12, x FOR PEER REVIEW 4 of 12 series data and each consecutive datum has m features (or instances), xt  (xt-n + 1, …, xt-1, xt) T . Thus, we have a sequence of s consecutive time series data, St  (signt-s + 1(st-s*st-s + 1), ..., signt (xt-1*xt)) T . Let st denote the sign for observation xt from the previous value xt-1. We denote a difference in successive observations as follows (Equation (4)): If there are n cases in a stored case box and if xt is a target case, then the shape distance between two cases xt and xn is defined as follows (Equation (5)): where U denotes the unit step function. The binary function U equals 1 if and are both positive or both negative, and zero otherwise. When no matches in sign occur, the shape distance attains a maximum of = . Conversely, the shape distance reaches its minimum of = 0 when all n pairs of slopes match. For simple analysis, we consider a one-day case as a target case. Thus, xt has one day of information (xt) and has only 4 instances: open, high, low, and closed stock prices. Figure 2 presents the procedure for selecting nearest neighbors using shape distance.

The Data
With this case study, we aimed to investigate the effect of the proposed technique's predictive performance in forecasting a stock market index. The case study involved the prediction of the Dow Jones Industrial Average Index (DJI) for two experiments. The first experiment used a large data set for a learning phase which includes daily values starting from 29 January 1985. This is the largest possible dataset that can be obtained using Python from Yahoo Finance. The corresponding Python library is web.DataReader ('^DJI', 'yahoo', start='1985-01-29', end='2020-03-20'). Therefore, for the first case study, the learning phase consisted

The Data
With this case study, we aimed to investigate the effect of the proposed technique's predictive performance in forecasting a stock market index. The case study involved the prediction of the Dow Jones Industrial Average Index (DJI) for two experiments. The first experiment used a large data set for a learning phase which includes daily values starting from 29 January 1985. This is the largest possible dataset that can be obtained using Python from Yahoo Finance. The corresponding Python library is web.DataReader ('ˆDJI', 'yahoo', start='1985-01-29', end='2020-03-20'). Therefore, for the first case

Model Construction
An exploratory plot of the Dow Jones Industrial Average Index (DJI) is given in Figure 3. Other exploratory plots for the raw data series are shown in Figures 4-6. Figure 4 depicts the trajectory of the opening value of the DJI. Figure 5 plots the high value of the DJI, and Figure 6 displays the low value of the DJI.

Model Construction
An exploratory plot of the Dow Jones Industrial Average Index (DJI) is given in Figure 3. Other exploratory plots for the raw data series are shown in Figures 4 to 6. Figure 4 depicts the trajectory of the opening value of the DJI. Figure 5 plots the high value of the DJI, and Figure 6 displays the low value of the DJI.  In constructing the predictive model for the DJI, the input variables were first transformed. For financial variables, stationarity can often be obtained through a logarithmic and a differencing operation [9]. However, because the data we used in the case study did not need to eliminate the effects of measurement units among variables, only a differencing procedure was performed. For example, opening value at t time (Open t ) was transformed to be dOpen t (Open t -Open t-1 ) through the differencing procedure. Other input variables such as High t , Low t , and Close t were transformed to be dHigh t , dLow t , and dClose t , respectively. As shown in Figure 7, these variables were used for the prediction engine of geometric CBR to produce the predicted value of dClose t . Then, finally, the predicted value of the closing price at t + 1 (pClose t + 1 ) was obtained from a de-transforming procedure by adding the predicted value of dClose t to the previous actual closing price at t (Close t ). Figure 7 presents an overview of pre-processing and post-processing for producing prediction values by geometric CBR.        In constructing the predictive model for the DJI, the input variables were first transformed. For financial variables, stationarity can often be obtained through a logarithmic and a differencing operation [9]. However, because the data we used in the case study did not need to eliminate the effects of measurement units among variables, only a differencing procedure was performed. For example, opening value at t time (Opent) was transformed to be dOpent (Opent -Opent-1) through the differencing procedure. Other input variables such as Hight, Lowt, and Closet were transformed to be dHight, dLowt, and dCloset, respectively. As shown in Figure 7, these variables were used for the prediction engine of geometric CBR to produce the predicted value of dCloset. Then, finally, the predicted value of the closing price at t + 1 (pCloset + 1) was obtained from a de-transforming procedure by adding the predicted value of dCloset to the previous actual closing price at t (Closet). Figure 7 presents an overview of pre-processing and post-processing for producing prediction values by geometric CBR.

Results
For much of this century, the random walk model of stock prices has served as a pillar of accepted wisdom in financial economics. One implication of the random walk model is that obvious patterns in the economy are already incorporated in the valuation of stock prices and financial markets. This is the rationale behind technical analysis in forecasting stock prices based solely on variables pertaining to the market. The performance results among the predictive models such as the random walk (RW), a conventional CBR using the Euclidean distance method, and geometric CBR (GCBR) using a shape distance method are presented in Tables 1 and 2.  In constructing the predictive model for the DJI, the input variables were first transformed. For financial variables, stationarity can often be obtained through a logarithmic and a differencing operation [9]. However, because the data we used in the case study did not need to eliminate the effects of measurement units among variables, only a differencing procedure was performed. For example, opening value at t time (Opent) was transformed to be dOpent (Opent -Opent-1) through the differencing procedure. Other input variables such as Hight, Lowt, and Closet were transformed to be dHight, dLowt, and dCloset, respectively. As shown in Figure 7, these variables were used for the prediction engine of geometric CBR to produce the predicted value of dCloset. Then, finally, the predicted value of the closing price at t + 1 (pCloset + 1) was obtained from a de-transforming procedure by adding the predicted value of dCloset to the previous actual closing price at t (Closet). Figure 7 presents an overview of pre-processing and post-processing for producing prediction values by geometric CBR.

Results
For much of this century, the random walk model of stock prices has served as a pillar of accepted wisdom in financial economics. One implication of the random walk model is that obvious patterns in the economy are already incorporated in the valuation of stock prices and financial markets. This is the rationale behind technical analysis in forecasting stock prices based solely on variables pertaining to the market. The performance results among the predictive models such as the random walk (RW), a conventional CBR using the Euclidean distance method, and geometric CBR (GCBR) using a shape distance method are presented in Tables 1 and 2.

Results
For much of this century, the random walk model of stock prices has served as a pillar of accepted wisdom in financial economics. One implication of the random walk model is that obvious patterns in the economy are already incorporated in the valuation of stock prices and financial markets. This is the rationale behind technical analysis in forecasting stock prices based solely on variables pertaining to the market. The performance results among the predictive models such as the random walk (RW), a conventional CBR using the Euclidean distance method, and geometric CBR (GCBR) using a shape distance method are presented in Tables 1 and 2.  Table 1 summarizes the hit rates (the proportion of correct forecasts) in the test data. Hit rates show how well the GCBR predicted the direction of price changes for the closing prices of the DJI. Table 1 indicates that GCBR seems to be significantly more accurate than the random walk model at p < 0.01. We also tested the null hypothesis H 0 , which states that the proposed GCBR does not produce more accurate performance than the conventional CBR method. Thus, the decision of Reject H 0 in Table 1 represents the superior performance of GCBR over the conventional CBR.
When the dataset is small and relatively recent (2006+), GCBR is more accurate than CBR overall. However, the results showed that the outperformance of GCBR is statistically significant only for the case when there is one nearest neighbor. For all other experiments, GCBR was not significantly better than the conventional CBR method. However, when the dataset was large (1985+), GCBR did not perform better than the conventional CBR method. This is because when predicting with few similar incidents in the past, as the dataset becomes larger and when the distance is equal, the possibility increases of using older cases that may have lost their relevance. The results of the first experiment (1985+) implied that when the number of similar cases was in the range of 30 to 100, GCBR exhibited superior performance to CBR. With 75 nearest neighbors, GCBR performed better than CBR to a statistically significant degree, with p < 0.1. whether there was a difference between two population proportions, which is defined as Z where p i are sample proportions, π i are population proportions, n i are sample sizes for groups, and p is a pooled estimate of the proportion of success in a sample of both groups, p = (n 1 p 1 + n 2 p 2 )/(n 1 + n 2 ). *** This signifies the zand t-values for the 1985+ dataset when the number of nearest neighbors was 75. , where y represents the original series, f the forecast, and n the number of observations. ** Pairwise t-tests of the predictive models for the test phase. The comparison is based on the MAPE of the residuals.
In the second experiment (2006+), the performance of GCBR was most accurate when the number of nearest neighbors was 50; the hit rate of GCBR being 0.5570 and the hit rate of the random walk model being 0.4625. This result implied the possibility of GCBR being superior to a conventional CBR in terms of hit rate. CBR performed better with the larger dataset (1985+) as it implements a numerical distance method in the learning phase. For GCBR, the first experiment implies that to enhance performance, the number of nearest neighbors must reach a certain level, in this case, 30. This further implies that when two relevant cases found through CBR have the same distance, optimizing using the most recent case will enhance performance. Table 2 presents the results of MAPE and of the t-test for the difference in performance of the random walk, conventional CBR and GCBR methods, respectively. The GCBR model does not seem to surpass other models. When the number of neighbors was 300, the MAPE of GCBR was best at 0.853. With a large dataset, there are several models where the performance of GCBR was better than CBR, but the difference was insignificant. The two methods also did not exhibit significant performance differences in terms of MAPE. We originally intended to optimize the GCBR model using hit rates; thus, MAPE was expected to be lower. Such results further implied that shape-distance-based GCBR has the potential to enhance MAPE-based performance.

Concluding Remarks and Future Work
This paper proposed a shape distance method for selecting nearest neighbors in case-based reasoning. Concepts were investigated against the backdrop of a practical application involving the prediction of a stock market index. The results of the case study are summarized as follows: • The proposed technique, GCBR, is significantly better than the random walk model at p < 0.01.

•
Overall, GCBR is more accurate than conventional CBR models in terms of hit rate. However, the superiority was not statistically significant compared to conventional CBR models.

•
The GCBR was not found to surpass a conventional CBR in terms of MAPE. • GCBR outperformed a conventional CBR in terms of MAPE when the number of nearest neighbors was small and the dataset was recent and smaller.

•
When the dataset was larger, GCBR performed significantly more accurately than CBR when the number of nearest neighbors was 75.

•
The proposed method has the possibility to improve predictability. Thus, in future research, we propose implementing the shape distance method along with consecutive time series data in searching the nearest neighbors, which would improve GCBR though validating the predictability of GCBR. A promising direction for the future would involve finding optimal neighbors by combining the numeric distance and shape distance methods.

Conflicts of Interest:
The authors declare no conflict of interest.