Forecasting Model for Stock Market Based on Probabilistic Linguistic Logical Relationship and Distance Measurement

: The ﬂuctuation of the stock market has a symmetrical characteristic. To improve the performance of self-forecasting, it is crucial to summarize and accurately express internal ﬂuctuation rules from the historical time series dataset. However, due to the inﬂuence of external interference factors, these internal rules are di ﬃ cult to express by traditional mathematical models. In this paper, a novel forecasting model is proposed based on probabilistic linguistic logical relationships generated from historical time series dataset. The proposed model introduces linguistic variables with positive and negative symmetrical judgements to represent the direction of stock market ﬂuctuation. Meanwhile, daily ﬂuctuation trends of a stock market are represented by a probabilistic linguistic term set, which consist of daily status and its recent historical statuses. First, historical time series of a stock market is transformed into a ﬂuctuation time series (FTS) by the ﬁrst-order di ﬀ erence transformation. Then, a fuzzy linguistic variable is employed to represent each value in the ﬂuctuation time series, according to predeﬁned intervals. Next, left hand sides of fuzzy logical relationships between currents and their corresponding histories can be expressed by probabilistic linguistic term sets and similar ones can be grouped to generate probabilistic linguistic logical relationships. Lastly, based on the probabilistic linguistic term set expression of the current status and the corresponding historical statuses, distance measurement is employed to ﬁnd the most proper probabilistic linguistic logical relationship for future forecasting. For the convenience of comparing the prediction performance of the model from the perspective of accuracy, thisa paper takes the closing price dataset of Taiwan Stock Exchange Capitalization Weighted Stock Index (TAIEX) as an example. Compared with the prediction results of previous studies, the proposed model has the advantages of stable prediction performance, simple model design, and an easy to understand platform. In order to test the performance of the model for other datasets, we use the prediction of the Shanghai Stock Exchange Composite Index (SHSECI) to prove its universality.


Introduction
Accurate predictions of future fluctuation for financial data can help investors hedge against risk. However, affected by high noise, instability, and long-term unpredictability of the stock market, it is difficult to predict the future trend of the stock market with absolute accuracy. Since each data set has its own inherent regulation of fluctuation, many researchers put forward many models that can predict the future by learning historical fluctuation laws, such as a regression analysis model [1], an autoregressive integrated moving average (ARIMA) model [2], an autoregressive conditional fluctuation time series by calculating the difference between each piece of data with that of the previous day. The fluctuation time series is then fuzzified into a fuzzy-fluctuation time series. Next, probabilistic linguistic logical rules can be generated from traditional fuzzy logical relationships. Lastly, based on the probabilistic linguistic term set expression of current status and the corresponding historical statuses, distance measurement is employed to find the most proper probabilistic linguistic logical relationship for future forecasting. The advantages of this model lies in the following aspects. (1)The combination of the symmetrical characteristic of language variables and the symmetrical direction of the stock market fluctuation makes it convenient to describe the state and evolution trend of stock market fluctuation. Meanwhile, the fuzziness of language variables helps to reduce the interference of noise information. (2) Probabilistic linguistic logical rules can explore the internal rules of the actual existence of sequential data in an unsupervised fashion.

Definition of Fuzzy-Fluctuation Time Series (FFTS)
Fuzzy time series was proposed by Song and Chissom [5][6][7]. In this section, we propose the concept of fuzzy-fluctuation time series (FFTS) based on existing Fuzzy time series. Definition 1 (Linguistic Fuzzy Set). Let U be a universe of discourse, then a fuzzy set Y = y 0 , y 1 , . . . , y τ in U can be defined by its membership function, µ Y : U → [0, 1] , where µ Y (u i ) denotes the grade of membership of u i , U ={u 1 , u 2 , . . . u i , . . . , u m }.

Definition 2 (Fuzzy-Fluctuation Time Series).
Let H(t)(t = 1, 2, . . . , T) be a time series of real numbers, where T is the number of the time series. F(t) can be defined byF(t) = H(t) − H(t − 1)(t = 2, 3, . . . , T), where F(t) is called fluctuation time series (FTS). Each element of F(t) can be represented by a linguistic fuzzy set X(t)(t = 2, 3, . . . , T) as defined in Definition 1. X(t) is a fuzzy-fluctuation time series (FFTS) generated from the original F(t).

Concept of Probabilistic Linguistic
In 2016, Pang et al. [28] proposed the theory of the probabilistic linguistic term set (PLTS).

Definition 4 (Probabilistic Linguistic Term Set).
Let S = {s 0 , s 1 , . . . , s τ } be a linguistic term set (LTS), a PLTS can be defined as: where s α p (α) is called a probabilistic linguistic variable (PLV), which is the linguistic term s α associated with the probability p (α) .

Definition 5 (Conversion of Fuzzy Fluctuation Logical Relationship).
Let X(t − n), . . . , X(t − 2), X(t − 1) be the LHS of a nth-order FFLR. It can be converted to a probabilistic linguistic term set (PLTS) as follows.
where w i = 1 if the subscript of X(t − i) is equal to α and 0, otherwise. L(t) is called the PLTS form of the LHS (PLHS) expression of a FFLR.
where f * is the linguistic scale function, which can be defined as follows.
Evidently, if r = 1, then Equation (4) is reduced to the Hamming-Hausdorff distance and it is Euclidean-Hausdorff distance when r = 2.
Step 3. Conversion of FFLRs to PLLRs According to Definition 5, each LHS of FFLRs can be expressed by a PLHS L(t). Then, we can generate the ( ) for different ( ), respectively, as described in Definition 6. Thus, the FFLRs for the historical training dataset are converted into PLLRs.

Empirical Analysis
Step 2. Establishment of FFLRs for Historical Training Data According to Definition 3, each X(t)(t = n + 1, n + 2, . . . , T, n ≥ 1) in the historical training dataset can establish FFLR with its related historical data as Step 3. Conversion of FFLRs to PLLRs According to Definition 5, each LHS of FFLRs can be expressed by a PLHS L(t). Then, we can generate the R(t) for different L(t), respectively, as described in Definition 6. Thus, the FFLRs for the historical training dataset are converted into PLLRs.
Step 4. Forecasting of the Future For each observed point H(i) in the test time series, use a L(t) to represent its current and related historical statuses. Then, find the most similar h s (p) in the left-hand side of PLLRs generated in step 3 by distance comparison, as described in Definition 7. According to the right-hand side of the selected PLLR h s (p), calculate its score S(h s (p), as described in Definition 8. The fluctuation value F (i + 1) of the next point can be forecasted as F (i + 1) = S(h s (p) × len. Lastly, the forecasting value can be calculated based on the observed point H(i) by H (i + 1) = H(i) + F (i + 1).
Step 1: Calculate the fluctuation value for each data in the historical training dataset of TAIEX2004. Then, calculate the whole mean of the fluctuation numbers of the training dataset for further fuzzification. In this case, the whole mean of the historical dataset of TAIEX2004 from January to October is len = 61.87. Therefore, the historical training dataset can be represented by FFTS, as shown in Appendix A Table A2.
Step 2: Build nth-order FFLRs according to the relationships of each data and its historical fluctuations (as shown in Appendix A Table A2). For convenience, the element s i is simplified to number i in the expression of FFLRs.
Step 1: Calculate the fluctuation value for each data in the historical training dataset of TAIEX2004. Then, calculate the whole mean of the fluctuation numbers of the training dataset for further fuzzification. In this case, the whole mean of the historical dataset of TAIEX2004 from January to October is = 61.87. Therefore, the historical training dataset can be represented by FFTS, as shown in appendix Table A1.
Step 2: Build nth-order FFLRs according to the relationships of each data and its historical fluctuations (as shown in appendix Table A2). For convenience, the element is simplified to number i in the expression of FFLRs.
Step 3: In order to convert the FFLRs to PLLRs, the LHSs of the FFLRs in Table A2 are converted to PLHSs. Then, the RHSs of the FFLRs are grouped by PLHSs and expressed by probabilistic linguistic terms sets. In this way, FFLRs are converted to PLLRs. For example, the LHS of FFLR s 3 , s 2 , s 2 , s 2 , s 4 → s 2 can be represented by a probabilistic linguistic term set and simplified as (0,0,0.6,0.2,0.2). All RHSs with the same PLHS can be grouped to (s 4 , s 2 , s 2 , s 2 , s 4 , s 0 , s 4 , s 4 , s 2 ), which can be further converted to a probabilistic linguistic term set and simplified as  In this way, the FFLRs in Table A2 can be converted into PLLRs, as shown in Table 1.  In this way, the FFLRs in Table A2 can be converted into PLLRs, as shown in Table 1.   Table A1 from date 29 October to 22 October), which can be represented by a probabilistic linguistic term set (0.2, 0, 0.6, 0.2, 0). Then, by using the distance measurement method described in Definition 8 (where the parameter r is set to 1), the most optimal PLLR is (0.2, 0, 0.6, 0.2, 0) → (0.25, 0, 0.75, 0, 0) . Therefore, the fuzzified forecasting fluctuation can be obtained by the score of the RHS of PLLR.
It can be defuzzified by: The other forecasting results are shown in Table 2 and Figure 3.  Table A1 from date 29 October to 22 October), which can be represented by a probabilistic linguistic term set (0.2,0,0.6,0.2,0). Then, by using the distance measurement method described in Definition 8 (where the parameter r is set to 1), the most optimal PLLR is (0. The other forecasting results are shown in Table 2 and Figure 3.     From the perspective of accuracy, a performance assessment can be carried out comparing forecasted values and the actual values. There are many indicators that have been verified to be useful for difference comparisons, such as the mean squared error (MSE), the root of the mean squared error (RMSE), the mean absolute error (MAE), and the mean percentage error (MPE). The definitions are described as follows.
where t = 1,2, . . . ,n denotes the position of series to be compared, n denotes the number of the series, and forecast (t) and actual (t) denote the forecasted value and actual value at position t, respectively. With respect to the proposed method for the 5th-order forecasting model, the mean squared error (MSE), the root of the mean squared error (RMSE), the mean absolute error (MAE), and the mean percentage error (MPE) are 3029.91, 55.04, 5927.36, 38.87, respectively. The forecasting errors of RMSE for different nth-order models are shown in Table 3.  Table 3.       Table 5 shows a comparison of the RMSEs of different methods for forecasting the TAIEX 1997-2005. Note: the best forecasting results are marked in bold.
From Table 5, we can see that the performance of the method presented in this paper is stable and acceptable. Although performance of the proposed method is not the best method for all time series, its average performance is the best. In fact, accuracy is not the unique judgement criteria for a forecasting method. One particularly preferred aspect is that it can be easily realized by a computer because it does not need human intervention. Meanwhile, the introduction of a probabilistic linguistic term set makes it easy to understand and possible to employ a distance measurement method to find out the most appropriate rules for further forecasting. From that point of view, it has better universality for different types of time series.

Forecasting Shanghai Stock Exchange Composite Index
In order to verify the universality of the proposed model, the famous SHSECI (Shanghai Stock Exchange Composite Index) in China is also taken as an example. We carry out forecasting for the SHSECI from 2007 to 2015 by the proposed method. The realistic datasets for each year are also divided into the training dataset and testing dataset. The RMSEs of the forecasting results for SHSECI from 2007 to 2015 are shown in Table 6. It shows that the SHSECI stock market can be successfully forecasted by the proposed model.

Conclusions
Considering the symmetrical characteristics of stock market fluctuation, thisa paper proposes a novel forecasting model based on symmetrical linguistic variables. The proposed model extends traditional FFLRs to PLLRs. Through such a transformation, distance measurement between probabilistic linguistic term sets can be employed to optimize the expression of forecasting rules. The greatest aspect of the proposed method is that it can retain more detailed fluctuation information while reducing overfitting information. Meanwhile, the distance comparison fuction makes it easy to solve new situations and avoid a lack of rules. This makes it more universal when compared with traditional models. In addition, the definability of high order and linguistic terms makes it more flexible than many other methods. Compared with the prediction results of previous studies, the proposed model has the advantages of stable prediction performance, simple model design, and an easy to understand platform. The success of forecasting SHSECI verifies the universality of the proposed model. This model uses the simplest form of probability linguistic variables to express the fuzzy fluctuation states of a time series. In a realistic stock market, there are a lot of disturbance factors concealing the internal fluctuation law of the time series. In order to reveal the internal fluctuation law, it is important to select a suitable method to retain the information reflecting the internal fluctuation law and to erase the influence of noise to a maximum extent. From this perspective, the selection of the fuzzification method needs to be deeply discussed. In this regard, the existing decision-making methods can provide useful information [45]. In the follow-up study, we can make full use of the latest research results in decision-making and further optimize the prediction method. At the same time, other external factors related to the fluctuation of the stock market will also be introduced to establish new models.