A New Method for Determining the Embedding Dimension of Financial Time Series Based on Manhattan Distance and Recurrence Quantification Analysis

Identification of embedding dimension is helpful to the reconstruction of phase space. However, it is difficult to calculate the proper embedding dimension for the financial time series of dynamics. By this Letter, we suggest a new method based on Manhattan distance and recurrence quantification analysis for determining the embedding dimension. By the advantages of the above two tools, the new method can calculate the proper embedding dimension with the feature of stability, accuracy and rigor. Besides, it also has a good performance on the chaotic time series which has a high-dimensional attractors.


Introduction
According to the Takens theory [1], a one-dimensional financial time series must have a suitable time delay [2] and embedding dimension [3][4][5][6]. The phase space reconstruction, which is proposed firstly by Takens, can demonstrate the inherent dynamic characteristics of the time series by selecting the appropriate time delay and embedding dimension. Therefore, phase space reconstruction is an important method for analyzing time series [7][8][9], and it has been widely used in various aspects of society such as chemistry, transportation and geology [10][11][12][13][14].
However, determining the time delay and embedding dimension is a necessary step for phase space reconstruction. At present, the time delay is often determined by the method of average mutual information with high accuracy and good applicability [15], while the embedding dimension was originally solved by the method of false-nearest neighbor (FNN) [16]. However, the method contains subjective parameters and depends strongly on the numbers of data points. Therefore, later, Cao made improvements in the principle of the algorithm [17]. Although Cao's method makes up for the shortcomings of FNN, it is still not good enough in terms of the stability and accuracy of solving the embedding dimension. In addition, it cannot handle financial chaotic time series with high-dimensional attractors well.
At the same time, the recurrence plot [18][19][20] and its recurrence quantification analysis (RQA) [21][22][23], which are related to the embedding dimension, have been successfully applied in analyzing the dynamic characteristics of complex systems, and have been applied to other fields such as economics and sociology [24][25][26][27][28]. What is more, the measures of RQA have been proven to be correlated with the embedding dimension by Zbilut [29]. However, due to the imperfect calculation method of time delay and other parameters, Zbilut only found a qualitative connection, but did not accurately calculate the quantitative relationship between the RQA and embedding dimension. Thus, after verifying its correctness through numerical experiments, we take it as a measure that can help to judge the embedding dimension of the time series. At the same time, we understand that Manhattan distance is applicable in high-dimensional space. It has a similar calculation effect with the Euclidean distance and maximum norm, and has less calculation time [30]. Thus, we improved the algorithm, which is based on the FNN and Cao by using the Manhattan distance [31][32][33]. Finally, through the improvement of the above two ways, two synergistic measures for determining the embedding dimension are obtained, thereby establishing a new embedding dimension calculation method with strong robustness, accuracy and applicability.
The rest of this letter is organized as follows: Section 2 introduces the methodology, which contains a recurrence plot with its quantification and false-nearest neighbor(FNN) method with Cao's modification and measures of determining the embedding dimension. Section 3 describes the empirical results. Section 4 offers the conclusions.
Among them, R 2 m (i, n(i, k)) represents the distance between the reconstructed vector and the nearest neighbor when the embedding dimension is m, and n(i, k) is an integer, determined by i and k. By choosing an appropriate R tol , the optimal embedding dimension is found when the Equation (3) is greater than R tol . However, the disadvantage of this method is that the selection of R tol is subjective, which can easily lead to an inaccuracy of the experiment. Later, Cao introduced a(i, m) and E1(m), and improved it by using the maximum norm. It mainly improves the subjective shortcomings of parameter selection, but at the same time exposes new problems. Because the distance between high-dimensional time series is nonlinear, the maximum norm is not suitable for high-dimensional time series. Thus, we introduced a Manhattan distance, and proved its stability and accuracy in processing time series with high-dimensional embedding dimension through experiments. The Manhattan distance is defined as follows: For the vector c and vector d. If n = 2, the Manhattan distance is the L1-norm, and the distance between the two vectors is By introducing the Manhattan distance and improving a(i, m), the following formula is obtained: where

Indicator 2
In order to conduct the experiment more accurately and ensure the success of the experiment, Kennel proposed Equations (7) and (8) as the test standard, but there are many shortcomings; for example, the experimental results are greatly affected by individual subjectivity and cannot identify the chaotic time series.
where A tol is some threshold, R A is the distance between average value and actual value, andx To solve the above shortcomings, Cao introduced a new measure E2(m), which can identify deterministic signals from chaotic signals, as shown in the formula, However, the above methods have limitations. Through experiments, we found that when the embedding dimension tends to be stable, the magnitude of change will be very small, and it is impossible to determine when the optimal embedding dimension is obtained. At the same time, Webber Jr found embedding dimension is related to the quantification of the recurrence plot. Recurrence plot is a two-dimensional method that can demonstrate the inherent certainty, correlation, and periodicity of the time series. For the time series x(i), i = 1, 2, . . . , N, through selecting the time delay and embedding dimension, it is reconstructed as V i represents the i-th state, N = n − (m − 1)τ is the total number of recurrence points, m ≥ 2 is the embedding dimension, and τ ≥ 1 is the time-delay.
The recurrence plot is drawn by a distance matrix, and the elements in the distance matrix can be defined by the following formula: The H(x) function is the Heaviside Function; if x > 0, the value of H(x) is 1, if x < 0, the value of H(x) is 0, i represents the number of rows, and j represents the number of columns, when R ij = 1 is 1, it is represented by black dots in the recurrence plot. Otherwise, it is represented by white dots in the recurrence plot. The threshold is an empirical value. After the above process, the distance matrix is transformed into a matrix. Finally, the 0 − 1 matrix is visualized and expressed in the form of a two-dimensional plot.
The determinism (DET) is obtained by quantifying the recurrence plot. It is the ratio between the recurrence points and the total recurrence points of a diagonal structure with a length greater than or equal to 1. When the dynamic behavior of the two systems is weakly correlated or uncorrelated, it will produce a very short diagonal structure, and its basic definition is: l min represents the length of the diagonal structure in the recurrence plot, l min = 2, and P(l) represents the probability of the diagonal structure with the length l in the recurrence plot.
After the experiments, we determined the relationship between the DET and embedding dimension. It is demonstrated by the following Figure 1. Based on confirming the correctness of Joseph L.Webber Jr's view, we proposed a method to assist in determining the embedding dimensions.

Experiment of Normal Examples
Firstly, we demonstrate the accuracy of our method for solving the embedding dimension. Thus, we selected data with a known minimum embedding dimension of 3.
The data are the x-component values from Lorenz attractor with the parameters σ = 10, r = 28, b = 8/3; they are the same as that considered in [9]. We take the integral step equals 0.01, then calculate the numerically integration of the equation. The results are demonstrated in the Figure 2 and Table 1. It is very clear that our method has the accuracy.
As the embedding dimension increases, our E11 measure gradually stabilizes. In addition, through the value of DET, we can find that when the embedding dimension is 3, DET, which can help us to determine the embedding dimension, becomes a saturation value when the E11 becomes stable. Therefore, our method can be used to solve the embedding dimension of the time series more accurately. Figure 2. The value of E1, E2 and E11, whose data are from the Lorenz attractor. It reveals uncertain, unrepeatable, and unpredictable chaotic phenomena. E1 and E2 is the measure of Cao's method. E11 is the measure of our method. After discussing the feasibility of our improved method, we next compare the method with the traditional FNN method. As is considered in [9], we select the data from the following equation t n+4 = sin(t n+5 ) + sin(2t n+1 + 5) + sin(3t n+2 + 5) + sin(4t n+3 + 5). (15) By transforming into the following prediction model, we can know that its embedding dimension is 4 t n+4 = W(t n , t n+1 , t n+2 , t n+3 ) = sin(t n+5 ) + sin(2t n+1 + 5) + sin(3t n+2 + 5) + sin(4t n+3 + 5). (16) The results of the comparison experiments are shown in Figures 3 and 4 and Table 2.
In this example, the time delay is 1. Obviously, our E11 measure becomes stable after the embedding dimension reaches 4. At the same time, the DET value exceeds 0.9 and tends towards 1. This demonstrates that our method can well identify the embedding dimension of this time series. At the same time, the FNN method is where A tol = 3, R tol = 9 cannot accurately identify its embedding dimension when faced with a time series of 1000 data points.   Next, we are going to discuss the stability and applicability for time series from a high-dimensional attractor. Until the present, many methods only discussed the time series that is from low-dimensional systems; thus, the method that can determine the embedding dimension needs to be improved. As is demonstrate in [10], Cao tested the data which from Mackey-Glass delay-differential equation, as followed, has a high-dimensional attractor, but the result is unsatisfactory.
As demonstrated in Figure 5, there is a sudden dip at d = 15; however, they could not explain the phenomenon and hope to investigating it in their future work. Thus, in order to solve this problem, we use our measure E11 and DET, which from the modified method tests the above Mackey-Glass series, and obtains the result in Figure 6 and Table 3.  Figure 2 but the data comes from Mackey-Glass delay-differential equation. Figure 6. The value E11 for the data, which is same as the above data. Cao's method has better advantages than the traditional FNN method. It can identify whether the time series is random data through the quantity E2. However, it is only useful for identification.
Next, we will explore the role of the E11 indicator in the method. We selected the CSI 500 Index from 2016 to 2021 as the experimental data. Before the experiment, considering the impact of the market adjustment in 2018 on the embedding dimension of the time series, we divided the experimental data into three parts: before, during and after 2018, and analyzed the embedding dimension, respectively. The experimental results are shown in Figures 7-9. The embedding dimension of the time series around 2018 reached the minimum around 5-6, while in 2018, due to the market adjustment, the minimum embedding dimension appeared around 3-4. This demonstrates that the market adjustment has an impact on the embedding dimension of the whole time series. There are 33-40% changes here, which indicates that the market adjustment has a significant impact on the embedding dimension of the whole time series; thus, the embedding dimension of the whole time series is not constant. Next, we will explore the advantages of the new method on the experimental data.  In our experiments, we found that when E1 tends to stabilize, E2 does not help us to determine when the embedded dimension stops increasing. Among them, we selected the daily closing chaotic time series of the CSI 500 Index from 2016 to 2021 as the experimental data. The experimental results are shown in Figure 10. The quantity of E1 becomes stable when the embedding dimension is 5 to 6; however, the quantity of E2 tends to be stable after the embedding dimension is 3.  We improved this part of the shortcomings by using the quantity DET of RQA. Through experiments on the above data, we obtained the experimental results, as shown in Figure 11 and Table 4. As shown in Figure 11, the value of quantity E11 becomes stable when the embedding dimension reaches 4 to 6. From Table 4, we can find that the value of quantity DET exceeds 0.9 after the embedding dimension reaches 5, and the certainty becomes very high. Therefore, we can effectively determine the optimal embedding dimension of the time series through quantity DET. Figure 11. The value of E11, whose data are same with the above Figure 10. Table 4. The value DET for time series, which is same as Figure 10.

Conclusions
Aiming at the shortcomings of traditional methods for finding the embedding dimension of the time series, we propose a new and effective method based on them and perfect the financial time series. In addition, through the experimental analysis of the time series, we find that this method has better advantages in accuracy, stability, and calculation of the theoretical and financial time series with high-dimensional chaotic attractors. Financial systems are complex and full of systemic risk and nonlinear characteristics.
In the further study, we also hope that our new method will be useful in applications of nonlinear techniques to explore more financial time series as well as the artificial time series. Next, we will deal with the non-stationary data in a stable way, and our work analyzes the multi-scale and multi-domain financial time series to more accurately reveal the dynamic characteristics of the financial system.