Composite Multiscale Partial Cross-Sample Entropy Analysis for Quantifying Intrinsic Similarity of Two Time Series Affected by Common External Factors

In this paper, we propose a new cross-sample entropy, namely the composite multiscale partial cross-sample entropy (CMPCSE), for quantifying the intrinsic similarity of two time series affected by common external factors. First, in order to test the validity of CMPCSE, we apply it to three sets of artificial data. Experimental results show that CMPCSE can accurately measure the intrinsic cross-sample entropy of two simultaneously recorded time series by removing the effects from the third time series. Then CMPCSE is employed to investigate the partial cross-sample entropy of Shanghai securities composite index (SSEC) and Shenzhen Stock Exchange Component Index (SZSE) by eliminating the effect of Hang Seng Index (HSI). Compared with the composite multiscale cross-sample entropy, the results obtained by CMPCSE show that SSEC and SZSE have stronger similarity. We believe that CMPCSE is an effective tool to study intrinsic similarity of two time series.


Introduction
Complex systems with interacting constituents exist in all aspects of nature and society, such as geophysics [1], solid state physics, climate system, ecosystem, financial system [2,3], and so forth. These complex systems are constantly generating a large number of time signals. Fortunately, in recent decades, numerous creative methods have been proposed to explore the operation mechanism of these complex systems. Among them, entropy-based methods are very powerful modern analysis technology. The concept of 'entropy' was first proposed by Clausius to deal with thermodynamic problems, and then Boltzmann gave a microscopic explanation from the perspective of statistical mechanics and proposed Boltzmann entropy. Gibbs proposed Gibbs entropy when determining uncertain system. In 1948, Shannon introduced the concept of entropy into information theory and put forward Shannon entropy (information entropy) [4]. Shortly after that, Renyi extended it and proposed Renyi entropy [5]. In 1988, Tsallis gave a Generalization of Boltzmann-Gibbs Statistics and proposed Tsallis entropy [6]. Although Gibbs entropy and Shannon entropy have the same mathematical expression, Shannon entropy has a broader meaning than thermodynamic entropy, as all the basic laws of thermodynamics can be derived from information entropy [7]. Since information entropy and Shannon entropy were proposed, many entropy-based methods have been proposed to explore the system complexity through studying the time series generated from them [8,9]. In order to quantify the common external factors, the result without considering the common third-party force may not reflect their intrinsic relationship [34][35][36]. Fortunately, Baba et al. [37] found that if two time series affected by the external factors are additive, the levels of intrinsic cross-correlation between two time series can be measured by the partial cross-correlation coefficient. In 2015, Yuan et al. [38] and Qian et al. [39] introduced partial cross-correlation analysis to deal with this kind of situation from different departure points.
Inspired by the above works, we propose the composite multiscale partial cross-sample entropy (CMPCSE) to measure the intrinsic similarity of two time series affected by the third common external factor simultaneously in this paper. We first test CMPCSE on three sets of artificial data, and find that it can reveal the intrinsic similarity of the time series come from the models, and then apply it to a set of stock market indices.

Composite Multiscale Partial Cross-Sample Entropy
In this section, based on CMCSE [26], we propose a new method-composite multiscale partial cross-sample entropy (CMPCSE), which can be used to quantify the intrinsic similarity of two time series linearly affected by a common external factor.
Consider two time series recorded simultaneously, {x(t) : t = 1, 2, ..., N} and {y(t) : t = 1, 2, ..., N} linearly affected by {z(t) : t = 1, 2, ..., N}, the main steps of CMPCSE are as follows: Step 1: First we eliminate the effect of z(t) on x and y, respectively. The additive model for models of x(t) and y(t) can be given respectively as: where, t = 1, 2, ..., N. When using regression analysis to estimate the value r x (t), r y (t), in a window of length s, we use the idea in MF-TWXDFA [40] to remove the effect of the sequence z(t) on x(t) and y(t) point by point as follows. For a given integer s (s ≥ 2), the points j contained in a sliding window MW i corresponding to the point i should satisfy |i − j| ≤ s. When the length of time series is different, we take different value for s . Usually the value of s is determined by experience. Accordingly, the weight function of the geographic weighted regression model is: In the window MW i , we perform linear regression for {ω ij x j } on {z j } or {ω ij y j } on {z j }, respectively. We can get the regression valuesx(z i ) andŷ(z i ) of x(i) and y(i), respectively. Then we get the corresponding estimates of r x (t), r y (t): Then the normalized data ofr x (t),r y (t) are defined as r x (t) = (r x (t)− <r x (t) >)/δˆr x (t) and r y (t) = (r y (t)− <r y (t) >)/δˆr y (t) , respectively. Here < . > and δ are the corresponding mean and standard deviation. Next, we calculate the CMCSE of r x (t) and r y (t).
Step 2: Construct coarse-grained time series from the series r x (t) and r y (t) with the scale factor τ, respectively. Then we get {u τ k (t)} and {v τ k (t)}. Each point of the k-th coarse-grained time series at a scale factor of τ is defined as For scale one (τ = 1), the times series u 1 1 and v 1 1 are the original series r x and r y . For τ > 1, Figures 1 and 2 show two more intuitive examples of the coarse-grained procedure.
Step 3: According to the following formula, construct vector sequences with length m is within the tolerance r. And then m n τ k = ∑ i m n τ k (i) represents the total number of m-dimensional matched vector pairs and is obtained from the two k-th coarse-grained time series at a scale factor of τ. Similarly, m+1 n τ k is the total number of matches of length m + 1. Finally, the CMPCSE is calculated with the equation: makes sense, and τ * is the number that makes −ln meaningful at a scale factor τ. Figure 1. Schematic illustration of the coarse-grained procedure of composite multiscale partial cross-sample entropy (CMPCSE) when τ = 2. Modified from Reference [24]. Figure 2. Schematic illustration of the coarse-grained procedure of CMPCSE when τ = 3. Modified from Reference [24].
A more intuitive procedure of CMPCSE is shown in Figure 3.
In this paper, the entropies are calculated from scale 1 to 20, that is τ = 1, 2, 3, ..., 20. And the cross-sample entropy of each pair of coarse-grained series is calculated with m = 2 and r = r * , where r * is the value selected from the candidate set {0.05, 0.1, 0.15, ..., 0.95} according to the criterion proposed by Lake et al. [16].
Eliminate the effect of z on x and y, and get the normalized data r x , r y Crose-grain procedure on r x , r y , and get u τ Figure 3. Flow charts of the CMPCSE algorithms.

Numerical Experiments for Artificial Time Series
In this section, we use a additive model of x and y as Equation (9) to perform numerical simulation and verify the effectiveness of the CMPCSE.
In the following numerical simulations, the series r x (t), r y (t) are generated from the Bivariate Fractional Brownian Motion (BFBMs), TWO-component ARFIMA process and Multifractal binomial measures, respectively, and all the third party interference factor series z(t) are pink(1/ f ) noise generated by the DSP System Toolbox in MATLAB 2016. In the experiments, all the results about the sequences with random terms are the average of 100 repeated results with series length N = 2 12 .

Bivariate Fractional Brownian Motion (BFBMs)
In this subsection, in order to test the performance of CMPCSE, we first use it to calculate the partial cross-sample entropy of BFBMs in the two sets of the above additive models (Equation (9)). The r x and r y are the incremental series of the two components of BFBMs with Hurst indices H r x and H r y . Extensive research on BFMS has been made. We know that BFBMs is a single fractal process and there is a relationship H r x r y = (H r x + H r y )/2 [41][42][43]. Wei et al. studied the long-range power cross-correlations between r x and r y in 2017 [40]. In the simulations, we set: (left) H r x = 0.6, H r y = 0.7, ρ = 0.7; (right) H r x = 0.6, H r y = 0.9, ρ = 0.7; where ρ is the cross-correlation coefficient between r x and r y .
We apply the CMPCSE method to the series simulated by BFBMs and pink noise. Figure 4 shows the results between the series simulated by the pink noise and BFBMs with (left) H r x = 0.6, H r y = 0.7, ρ = 0.7; (right) H r x = 0.6, H r y = 0.9, ρ = 0.7. From Figure 4 we can know that the entropy values of x − y : z and r x − r y are very close at all time scales, but there are obviously discrepancy between the values of x − y : z and x − y except when the time scale equal to 1, which indicates that, when r x , r y are affected by the third party factor z simultaneously, the CMPCSE method can capture the intrinsic cross-sample entropy values of r x , r y by eliminating the influence of z. Entropy Measure Hr x =0.6,Hr x =0.9 x-y r x -r y x-y:z

TWO-Component ARFIMA Process
ARFIMA process is a monofractal process [40] and often used to model the power-law auto-correlations in stochastic variables [44]. It is defined as follows: where d ∈ (0, 0.5) is a memory parameter, ε g is an independent and identically distributed Gaussian variable, and G(d, t) = ∑ ∞ n=1 a n (d)g(t − n), in which a n (d) is the weight a n (d) = dΓ(n − d)/[Γ(1 − d)Γ(n + 1)]. The Hurst index H GG is related to the memory parameters [45,46]. For the two-component ARFIMA processes discussed below, we take G = X or Y. The two-component ARFIMA process is defined as follows [47]: where W ∈ [0.5, 1] quantifies the coupling strength between the two processes r x (t) and r y (t). When W = 1, r x (t) and r y (t) are fully decoupled and become two separate ARFIMA processes as defined in Equation (11). The cross-correlation between r x (t) and r y (t) increases when W decreases from 1 to 0.5 [47].
In the process of our calculations, we choose W = 0.8 and the parameters (d 1 , d 2 ) of ARFIMA as d 1 = 0.1, d 2 = 0.2 and d 1 = 0.1, d 2 = 0.4 respectively, and corresponding two error terms ε r x (t) and ε r y (t) share one independent and identically distributed Gaussian variable with zero mean and unit variance. The CMPCSE method was used to the series simulated by two-component ARFIMA process and pink noise. Figure 5 also shows that the entropy values of x − y : z and r x − r y are very close at all time scales, but there are obviously discrepancy between the values of x − y : z and x − y except when the time scale equal 1. It also means that, when r x , r y are affected by the third party factor z simultaneously, one can use the CMPCSE to get intrinsic cross-sample entropy values of r x , r y .

Multifractal Binomial Measures
In this subsection, the series r x , r y to be tested come from the binomial measures generated by p−model with known analytic multifractal properties [40]. We combine them with pink noise to test the performance of CMPCSE. Each binomial measure or multifractal signal can be generated by iteration. We start with the iteration k = 0, where the data set g(i) consists of one value, g (0) (1) = 1. In the kth iteration, the data set {g (k) (i), i = 1, 2, ..., z k } is obtained from g (k) (2i − 1) = pg (k−1) (i) and g (k) (2i) = (1 − p)g (k−1) (i). When k → ∞, g (k) (i) approaches to a binomial measures, and the scaling exponent function H gg (q) is: In our simulation, we iterated 12 times with p 1 = 0.2, p 2 = 0.3, p 3 = 0.4 and then get 3 binomial measures g p 1 (i), g p 2 (i), g p 3 (i). In our actual calculation process, we set r x =diff (g p (i)), here diff means the first order difference. We present CMCSE results of the series x − y, r x − r y and the CMPCSE x − y : z in Figure 6 with p x = 0.2, p y = 0.3 and p x = 0.3, p y = 0.4. From the two pictures in Figure 6, we can easily find out that the entropy values of x − y : z and r x − r y are very close at all time scales, but there are obviously discrepancy between the values of x − y : z and x − y . It also indicates that, when r x , r y are affected by the third party factor z simultaneously, one can use the CMPCSE method to get intrinsic cross-sample entropy values of r x , r y by eliminating the influence of z on x, y.

Application to Stock Market Index
In order to validate the applicability of the CMPCSE method for empirical time series, we then apply it to stock market indices. The analyzed data sets consist of three Chinese stock indices: Shanghai securities composite index (SSEC), Shenzhen Stock Exchange Component Index (SZSE) and Hang Seng Index (HSI). All the raw data were download from https://finance.yahoo.com/. Then the daily closing data for the indices from 26 December 1999, to 17 July 2020, were used. Due to the different opening dates in mainland and Hong Kong, we exclude the data recorded on different dates and then reconnect the remaining parts of the original series to obtain time series with same length. As a result, the final daily closing data length is 5000.
In practice, we usually apply normalized time series. Denoting the closing index on the tth days as x(t), the daily index return is defined by : g(t) = ln(x(t)) − ln(x(t − 1)). Then the normalized daily return is defined as R(t) = (g(t)− < g(t) >)/δ, where < g(t) > and δ are the mean value and standard deviation of the seriesg(t), respectively.
In 2015, Shi and Shang studied the multisacle cross-correlation coefficient and multisacle cross-sample entropy between SSEC, SZSE and HSI [48]. From their results, we can know that there is a strong correlation between the return data of SSEC and SZSE, and both them have weak correlation with HSI. The results of our estimation and comparison of the cross-sample entropy of the two return time series SSEC and SZSE, which includes two cases of including and excluding the influence of the HSI index, are shown in Figure 7. From the entropy measure results of return data in Figure 7, one can easily find that the entropy values of SSEC-SZSE are always bigger than SSEC-SZSE:HSI at all scales, which means that if the entropy values of SSEC-SZSE calculated by CMCSE are used to estimate the degree of similarity between SSEC and SZSE, the similarity between them will be underestimated. That is to say, the partial cross-sample entropy SSEC-SZSE:HSI can deliver more reasonable and real synchronization between the two return time series of SSEC and SZSE. We believe this result is reasonable, as SSEC and SZSE are the two most important stock indices in the mainland of china, so their daily return data should have strong synchronicity, especially under large time scales.

Discussion and Conclusions
In this paper, we proposed CMPCSE for quantifying intrinsic similarity of two time series affected by common external factors. Firstly, we described the calculation process of CMPCSE in detail. And then, in order to test the validity of CMPCSE, we applied it to three sets of artificial data. These three sets of artificial data were constructed by linear superposition of BFBMs, TWO-component ARFIMA process and Multifractal binomial measures with pink(1/ f ) noise respectively. The results of each set of the artificial data show that CMPCSE can accurately measure the intrinsic cross-sample entropy of two simultaneously recorded time series by removing the effects that come from pink noise. At last, CMPCSE was employed to investigate the partial cross-sample entropy of SSEC and SZSE by eliminating the effect of HSI. Compared with the conclusion from CMCSE, the results from CMPCSE show that SSEC and SZSE have stronger similarity. Because SSEC and SZSE are the two most important stock indices in the mainland of China, they should have strong consistency, especially under large time scales, so we think the result is reasonable and it is necessary to consider partial cross-sample entropy when one wants to measure the similarity of SZSE and SSEC.
On the other hand, we must also note that the first step in the calculation of CMPCSE is crucial to the result of CMPCSE. Maybe there are other ways to eliminate the influence of the third party on the two time series that we studied. In our work, we adopted the idea from Reference [40] and satisfactory results were obtained in our artificial data examples. At the same time, in our research process, we also notice that when CMPCSE is used to study the linear combination of NBVP times series mentioned in Reference [26] and pink noise, which is constructed in the way mentioned above, we can not get satisfactory results. Therefore, we think that the way to eliminate the third-party influence in this paper can not achieve good results for the sequence with violent oscillation. Meanwhile, we expect to see better methods to deal with similar times series.
All in all, we think the partial cross-sample entropy analysis is necessary when one wants to measure the similarity of two times series affected by common external factors and, at present, CMPCSE is a good choice.