Relationship between Continuum of Hurst Exponents of Noise-like Time Series and the Cantor Set

In this paper, we have modified the Detrended Fluctuation Analysis (DFA) using the ternary Cantor set. We propose a modification of the DFA algorithm, Cantor DFA (CDFA), which uses the Cantor set theory of base 3 as a scale for segment sizes in the DFA algorithm. An investigation of the phenomena generated from the proof using real-world time series based on the theory of the Cantor set is also conducted. This new approach helps reduce the overestimation problem of the Hurst exponent of DFA by comparing it with its inverse relationship with α of the Truncated Lévy Flight (TLF). CDFA is also able to correctly predict the memory behavior of time series.


Introduction
A ternary Cantor set is a set built by removing the middle part of a series when divided into three parts and repeating this process with the remaining shorter segments. It is the prototype of a fractal [1]. A fractal is a geometric object that has similar statistical properties to itself on all scales. If a fractal object is successively magnified, it looks similar or exactly like the original shape of the fractal. A similar pattern exhibited at increasingly smaller scales is often known in fractal mathematics as self-similarity [2,3]. In time series, self-similar phenomena describe the event in which the dependence in the time series decays more slowly than an exponential decay. Typically, it follows a power-like decay [4]. Scaling methods exist for quantifying the power-law exponent of the decay function such as Rescaled Range Analysis (R/S), Detrended Fluctuation Analysis (DFA) and the Truncated Lévy Flight (TLF).
The Rescaled Range Analysis (R/S) method by Hurst subdivides integrated time series into adjacent segment sizes and examines the range (R) of the integrated fluctuations. Then, a measure of dispersion, usually standard deviation (S), is determined as a function of segment size. A power law governs the approximate relationship between the Rescaled Range Analysis' statistic (R/S) and the segment size [5].
The Detrended Fluctuation Analysis (DFA) by Peng et al. (1994) is a technique that quantifies the same power-law exponent of the R/S method. Addressing difficulties in determining correct power-law exponents of the R/S method in non-stationary time series resulted in the introduction of the DFA. Unlike the R/S method, the DFA uses a local detrending approach (usually linear regression) in the segments of the integrated series. For time series with higher-order trends, polynomial fit replaces the linear regression approach of the DFA [6]. This provides its power-law exponents' protection against effects of nonstationarity and pollution of time series by external signals while eliminating spurious detection of long memory [7]. Empirical evidence has shown that the DFA performs well compared to other variance scaling methods including the R/S methods when estimating power-law exponents.
Usually, characterizing stochastic processes empirically requires the study of determining asymptotic probability density distributions (pdf) and temporal correlations. Brownian motion models the evolution of a particle's position over time with the assumption that the movement of the particle follows a diffusive process with Gaussian distribution. This model did not describe accurately real-world time series because kurtosis of associated pdf is greater than that of the Gaussian distribution [8,9]. The Truncated Lévy Flight (TLF) model originated as a means to address the difficulties of the Brownian motion for working in long-range correlation scales. The scaling exponent (0 < α ≤ 2) of the TLF measures the memory behavior in time series that follows a diffusion process with Gaussian and non-Gaussian distributions [10].
In [11], a clear comparison was made between DNA and economics by the authors, showing the underlining similarities that allow researchers to model seemingly different phenomena using the same or slightly modified models. In the same manner, these variance scaling models have the added advantage of being used to model long memory effects in different fields where stochastic processes occur [2,7]. Thus, be it DNA sequencing, financial markets, geophysical time series etc., scaling methods have been used to detect long/short memory behaviors.
Scaling approaches serve as means of characterizing the dependence of observations separated in time series dominated by stochastic properties. Applications with DFA have been done in DNA sequences [6,12,13], neural oscillations [14], detection of speech pathology [15], heartbeat fluctuation in different sleep stages [16], describing cloud breaking [17], gearbox fault diagnosis [18], analysis of fetal cardiac data [19], streamflow in the Yellow River Basin in China [20], evaluation near infrared spectra of green and roasted coffee samples [21], just to mention a few.
Empirical evidence has shown that the DFA has the tendency of overestimating the scaling exponent [2,22]. We have not come accross any literature at the moment that describes a definite approach in the segment division step of the DFA algorithm. However, we observe that estimates of power-law exponents are influenced by the scale of choice [23,24]. Our goal in this work is to propose a definite non-overlapping segment division approach in the DFA algorithm (CDFA) that utilizes the theory of the ternary Cantor set. We show that using this approach we are able to rightly determine the correct scaling exponent to detect the memory behavior of the time series as well as reducing the over-fitting nature of the DFA. This approach has the advantage of generalizing the segment division step in the DFA algorithm. The Hurst exponents obtained from the CDFA method are then compared with the exponents of the DFA and the TLF on real time series.
In Section 2, we present proof of the relationship between the continuum of Hurst exponents of the DFA and Cantor set. We also present the scaling methods TLF, DFA and CDFA in this section. In Section 3, we present results and discussions from our investigation noting that for noise-like time series, anti-persistence, white noise and persistence behavior in time series imply 0 ≤ H < 0.5, H = 0.5 and 0.5 < H ≤ 1 respectively. The overestimation of DFA's Hurst exponent decreasing with the Cantor scales is also discussed in this section. Section 4 concludes the paper.
The stability exponent α ∈ (0, 2] defines the asymptotic decay of the pdf. σ ∈ (0, ∞) measures dispersion. Skewness parameter β ∈ [−1, 1] measures asymmetry of the distribution. µ ∈ (−∞, ∞) is a scalar which determines the "location" or shift of the distribution. The sign x is the signum function of x ∈ R defined as sign(x) = x/|x|. The problem is that the variance of the distribution in (1) is finite but is not stable. This is because, large cut-off l results in slow convergence and a smaller cut-off l may result in abrupt tail [8]. In [25], the author generated a TLF to address the convergence problem by using a decreasing exponential cut-off function. Thus, the process in Equation (1) is truncated to obtain the TLF given by: for some normalizing constant c, stability exponent α ∈ (0, 2] and cut-off length l. The characteristic function of the TLF in Equation (2) is given by To determine the best scaling exponent (α) from characteristic equation in (3), we adjust the values of A, the cut-off parameter l and the scaling exponent α simultaneously to fit the characteristic function to the data.

Detrended Fluctuation Analysis (DFA)
Given the noise-like time series ψ, we find the integrated series to determine the Root Mean Squared Fluctuations (RMSF) from Equation (5) below A log-log plot of the RMSF against the series length s produces a directly proportional relation given by where H := Hurst exponent of the DFA and H min ≤ H ≤ H max [4].

Cantor Detrended Fluctuation Analysis (CDFA)
In this subsection, we prove that the subspace [H min , H max ] of Hurst exponents is homeomorphic to [0, 1] of the Cantor set. We also present an illustration of the Cantor set and the algorithm for the CDFA. 1] between the topological spaces of Hurst exponents of noise-like time series and the Cantor set is a homeomorphism if it has the following properties: the inverse function f −1 is continuous.
If two topological spaces admit a homeomorphism between them, we say they are homeomorphic: they are essentially the same topological space. Thus, Now, we need to prove that the map f is homeomorphic to the Cantor set.
Thus, the map f (H) is a bijection. The map f (H) is continuous at some value c in its domain if f (c) is defined, the limit of f as H approaches c exists and the function value of f at c equals the limit of f as H approaches c. The function f (c) is defined as The limit of f as H approaches c equals lim The left-and right-sided limits are equal from (11). Therefore, Hence we observe that the right hand side of Equation (10) is equal to right hand side of Equation (12). Thus, it follows that Thus, the map f is continuous at some value H = c for a differentiable fractal.
Interchanging H and y gives the inverse function of f (H).
The inverse map f −1 is continuous at some value s in its domain if f −1 (s) is defined, the limit of f −1 as H approaches s exists and the function value of f −1 at s equals the limit of f −1 as H approaches s. f −1 (s) is defined as The limit of f −1 as H approaches s equals lim ⇒ lim Since the right hand side of Equation (17) equals the right hand side of Equation (19) it implies that, lim

Illustration of the Cantor Set
In this subsection, we take real-world noise-like time series and remove middle thirds up to four (4) levels so that it is similar to the Cantor set. This phenomenon is depicted in Figure 1 [26]. It shows that the segments appear the same at different scales in successive magnifications of the Cantor set from levels C 0 to C 6 . C 0 depicts the original time series with no missing parts and C 6 represents the remaining time series after removing middle thirds for the sixth time. For the sake of experimentation, we limit our scope to levels from C 0 to C 3 .

Definition
The subset of intervals of the Cantor set is defined recursively as: The ternary Cantor set is defined as C = [0, 1] \ (∪ ∞ n=1 C n ). The level C 0 indicates the interval we begin with. For C 1 , [0, 1] is divided into 3 sub-intervals and the middle sub-interval 1 3 , 2 3 is removed. For C 2 , each of the remaining intervals from C 1 are divided into 3 sub-intervals and their middle sub-intervals 1 9 , 2 9 and 7 9 , 8 9 are removed. This procedure can continue indefinitely by removing open middle third sub-interval of each interval obtained in the previous level. Due to issues with the dimension of the Cantor sets (i.e., dimension of 0.631 < 1), we rescale the integrated series ψ t by dividing each observation by the maximum data point:

Algorithm of the CDFA
Here, we present a modification of the DFA algorithm called CDFA to generalize the segment division step of the DFA. The CDFA algorithm consists of four (4) main steps: 1. given the time series ψ t of length N, find the integrated series shifted by the mean < ψ >, 2. the cumulatively summed series Y j is then segmented into equal non-overlapping segments of various sizes ∆s. ∆s is based on the Cantor set theory scale (∆s = 3 n , n ≥ 0). The number of non-overlapping segments is calculated as: The Cantor set scaling function is computed for multiple segments to highlight both slow-and fast-evolving fluctuations that control the structure of the time series. 3. Root Mean Squared Fluctuation (RMSF) is computed for multiple scales of the integrated series: where j denotes the sample size of segments N ∆s . We compute RMSF from j = 1 to 2N ∆s not N ∆s . We sum from beginning to end and from end to beginning, then an average of the values is calculated so that every data point is considered. Conversely, the large segments interweave many local periods with both small and large fluctuations and therefore average out their differences in magnitude. 4. the least squares regression fit of F(∆s) versus the Cantor scales ∆s on a log-log scale produces the power-law notation computed for multiple scales: where H c := Hurst exponent of the CDFA which measures memory behavior in the noise-like time series.

Real Time Series
In Figure 2, the time series multifractal (upper panel), monofractal (middle panel) and white noise (lower panel) used in the experiment are noise-like biomedical time series with 8000 rescaled sample data points each [27]. The red trajectory depicts the random walk of the respective series. Observe that the fractal depicted by the multifractal time series at the peak looks very similar to the entire monofractal time series. Thus, comparing the series in the upper panel to the middle panel, the multifractal series has many fractals compared to the one for the monofractal series. We determine DFA's Hurst exponents for the remaining series after removing the middle thirds of each series at each level. It should be noted that the white noise time series has a structure independent of time with Hurst exponent close to H = 0.5 whereas noise-like monofractal and multifractal time series exhibit persistence behavior s.t. 0.5 < H ≤ 1.

Results
Below are tables of results of power-law exponents from the implementation of the DFA algorithm on the respective time series. Note from Figure 1  It should be noted that the same data points from partitioning correspond to the white noise, monofractal, and multifractal time series. The Hurst exponents also follow respectively. From Table 1, we observe closeness of the Hurst exponents of the white noise series to H = 0.5 for all levels from C 0 to C 3 . This confirms the phenomena that are exhibited in the fractal nature of the Cantor set in white noise time series. No matter how many sections of a white noise series are removed, the left-over series still exhibits similar characteristics as the whole white noise series.  Table 2 present Hurst exponents between 0.5 and 1 (0.5 < H ≤ 1) for long memory monofractal time series for levels C 0 , C 1 , C 2 and C 3 . The phenomena exhibited in the monofractal time series from the table above are similar to the fractal nature of the Cantor set. The series left behind after removing the middle thirds of the monofractal time series exhibits similar statistical properties as the whole. Table 3. DFA's Hurst exponents of multifractal time series.

Levels Hurst Exponents
Hurst exponents of the multifractal time series lie within the range 0.5 < H ≤ 1 for all levels C 0 , C 1 , C 2 and C 3 from Table 3. This illustrates the fractal phenomena depicted by the Cantor set where successive magnification of the Cantor produces a copy of itself. This can be seen in Figure 1. Thus, self-similar behavior persists after removing the middle thirds of the whole series up to the level C 3 . Results from Tables 1-3 confirm that successive magnification of noise-like time series shows a similar pattern at increasingly smaller scales. Thus, the statistical characteristics of part of noise-like series are similar to that of the whole. This phenomenon is commonly known in fractals as self-similarity. Figures 3-8 shows the log-log fits of RMSF and scales of the white noise, monofracal and multifractal bio-medical series using the DFA and the CDFA. The first two (2) plots (i.e., Figures 3 and 4) present fits of the white noise using the DFA and CDFA. The next two (2) plots (i.e., Figures 5 and 6) illustrate the fit of monofractal series using the DFA and CDFA. The last two(2) plots (i.e., Figures 7 and 8) show fits of the multifractal series using the DFA and the CDFA. Table 4 above has six (6) columns of results in total. The first column (H) represents the Hurst exponents of the DFA, the second column (H c ) denotes the Hurst exponents of the CDFA and the difference between the exponents in the first two columns are found in the third column. The column for α denotes the scaling exponents of the TLF. The last two columns represent the multiplication of the Hurst exponents of the DFA (H) and the scaling exponent of the TLF (α), as well as the multiplication of the Hurst exponents of the CDFA (H c ) and the scaling exponents of the TLF. Upon investigating Hurst exponents of white noise, monofractal and multifractal time series using the DFA and CDFA, we observe differences in their exponents, as shown in Table 4. Hurst exponent of white noise time series changes slightly but that of the monofractal and multifractal time series changes about 1%. The slight changes in the exponents are a result of subdividing the time series as multiples of 3 (ternary base) at each level using the CDFA. This helps to curb the problem of overestimation associated with DFA. Notwithstanding the differences between the exponents, they still depict the same processes modeled herein (i.e., noise-like time series).
The exponent of the white noise is close to 0.5 whereas that of the noise-like monofractal and multifractal series lie within the range 0.5 < H ≤ 1, depicting long-memory behavior.

Discussion
The results obtained from the Tables 1-4 suggest that segment size may not always be hard-coded in the DFA algorithm based on the length of the time series in question. Especially for time series with odd lengths, the process can be automated using the fractal phenomena of the Cantor set to obtain equal segment sizes and satisfactory Hurst exponents.
Furthermore, multiplying Hurst exponents of the DFA and CDFA with the scaling exponents (α) of the Truncated Lévy flight (TLF) suggests that H c is a better estimate. For the monofractal and multifractal noise-like time series, we observe that H c α is approximately equal to 1 while Hα exceeds 1. This deviates from the inverse relationship between the Hurst exponents and the scaling exponents of the TLF for Gaussian noise as discussed in the paper [4]. This highlights the overestimation of the Hurst exponent of the DFA approach that happens in practice.

Conclusions
In this work, we have proposed a modification to the DFA algorithm by utilizing the theory of the tenary Cantor set in the segment division step. The Cantor DFA (CDFA) has been compared to the α exponent of the truncated Lévy model and the Hurst exponent of the DFA. We have in addition proved that the interval of the Hurst exponent of the DFA is homeomorphic to the Cantor set. We confirm the results from the proof by illustrating the fractal phenomena exhibited by the Cantor set using real-world time series in Tables 1-3. Our results from numerical simulations show that the CDFA generates better estimates of Hurst exponents. The CDFA proposed in this work automates the segment sizes in the DFA algorithm using the number base 3 theory of the Cantor set, where the time series is divided into multiples of 3 at each level. This modification helps to curb the overestimation problem of the Hurst exponent (H) of the DFA by determining segment sizes based on the fractal phenomena depicted by the Cantor set while correctly predicting the memory behavior of the series in question.
The results are shown in Table 4 where the Hurst exponent of the CDFA is compared with that of the DFA and the scaling exponents (α) of the TLF. In [4], a relationship was established between the Hurst exponent of the DFA and the α exponent of the TLF. The CDFA is also shown to satisfy this relation, thus making it possible to extract the α exponent of the TLF from the Hurst exponent of the CDFA.
The CDFA approach can be applied to time series with odd lengths, time series whose lengths are not easily divisible by even numbers, time series whose lengths do not permit equal segmentation, etc. These kinds of series exist in several industries, including financial, geophysics, health and the like. Another application of the CDFA would be to act as a control model for the ordinary DFA to reduce the chances of overestimation of the Hurst exponent.
Since this is a modification of the DFA, there is the need to simulate CDFA with different data sets having varying characteristics for which the DFA has been shown to correctly detect their scaling behavior. An example will be DNA sequences, financial markets, etc., for further comparison of the model performance against the DFA.
For future work, we seek to investigate the robustness of the CDFA as stated earlier by simulating the model with data sets from different fields, including, but not limited to, DNA sequences, financial markets and geophysical data. In the case of "big data", we seek to extend the CDFA by "parallelizing" the sequential code of the CDFA (PCDFA) to improve its efficiency in simulation.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data used in this work can be accessed at https://www.ntnu.edu/ inb/geri/software.

Conflicts of Interest:
The authors declare no conflict of interest.