Change-Point Detection in Autoregressive Processes via the Cross-Entropy Method

: It is very often the case that at some moment a time series process abruptly changes its underlying structure and, therefore, it is very important to accurately detect such change-points. In this problem, which is called a change-point (or break-point) detection problem, we need to ﬁnd a method that divides the original nonstationary time series into a piecewise stationary segments. In this paper, we develop a ﬂexible method to estimate the unknown number and the locations of change-points in autoregressive time series. In order to ﬁnd the optimal value of a performance function, which is based on the Minimum Description Length principle, we develop a Cross-Entropy algorithm for the combinatorial optimization problem. Our numerical experiments show that the proposed approach is very efﬁcient in detecting multiple change-points when the underlying process has moderate to substantial variations in the mean and the autocorrelation coefﬁcient. We also apply the proposed method to real data of daily AUD/CNY exchange rate series from 2 January 2018 to 24 March 2020.


Introduction
Change-point detection in time series processes, or time series segmentation, has been discussed by many authors in the fields of statistics, computer science, and data mining for several decades [1]. Change-point detection has a broad range of applications, including financial time series analysis [2,3], econometrics [4,5], and science and engineering [6][7][8]. Change-point problems are traditionally divided into two large categories: posterior (off-line) and sequential (on-line) problems; see, for example, [9]. The former is often referred to as segmentation of a signal or time series when all data are observed, while the latter assumes a stream of data coming in real-time. In this paper, we focus on the time series segmentation problem.
A change-point is defined when at least one statistical parameter in a given time series process suddenly changes, which may be caused by the variation in the mean, trend, or other internal parameters. Following the assumption of [10] and [7], a non-stationary time series process is assumed to be segmented by change-points into a sequence of piecewise stationary processes. We are interested in modeling the non-stationary autoregressive AR (p) process and estimating the unknown number of segments and the locations of change points.
Many of the existing multiple change-point methods assume independence of observations. Because of this assumption, these methods tend to either underestimate or overestimate the number of change-points in autoregressive time series due to its endogenous dependent structure. For example, a stationary AR(1) process with moderate to strong autocorrelation can be confused with a change-point process with independent observations causing these segmentation methods to overestimate the number of structural breaks. References [7] and [36] took the dependence structure of autoregressive time series into account and developed a segmentation method to identify break-points using a genetic algorithm and a dynamic programming algorithm, respectively, with the objective function based on the minimum description length (MDL) information criterion. Another paper [4] implemented a strucchange algorithm designed for estimating structural changes in linear regression. Recently, [37] proposed a AR1seg approach for estimating change-points in the AR(1) model based on the assumption of a common autocorrelation coefficient for all segments. In this paper, we apply the Cross-Entropy (CE) method with the MDL principle to identify the number of and locations of change-points. In contrast to [35], we assume that observations from different segments may be generated from different AR(p i ) processes. The statistical features, such as the order of autoregressive processes, the autocorrelation parameter, and the mean of the homogeneous segment, can be estimated if the change-points are detected, but our focus is on the detection accuracy of identifying change-points. Through extensive numerical experiments, our method shows superior detection accuracy when the underlying autocorrelation structure changes.
The paper is organized as follows. Section 2 introduces the MDL-based objective function and compares it with other popular model selection criteria, and the detailed derivation is provided. In Section 3, we describe the proposed algorithm in detail. Section 4 shows the design and results of numerical experiments. Real data analysis is shown in Section 5. Lastly, Section 6 provides a general discussion.

Change-Point Detection Problem
Firstly, let us formulate the change-point detection problem. Let X = (X 1 , X 2 , . . . , X T ) denote a data sequence of length T. Assume there are N unknown change-points, so X is segmented into N + 1 segments by the set of locations τ = (τ 1 , τ 2 , . . . , τ N ), 0 = τ 0 < τ 1 < τ 2 < · · · < τ N < τ N+1 = T, where the i-th segment consists of observations (X τ i−1 +1 , . . . , X τ i ); i is numbered from 1 to N + 1. Next, we assume that the data sequence X is generated by piecewise stationary AR(p i ) processes, the model assumes that the observations within each segment are generated by a stationary AR(p i ) process with autocorrelation coefficients ρ i = (ρ i1 , . . . , ρ ip i ), mean δ i , and variance ν 2 i . The segments are assumed to be independent of each other. The parameters δ i , ν 2 i , and ρ i are not known in advance, whereas p i is known.
where the noise ε t,i ∼ i.i.d. Normal 0, ν 2 i , p = (p 1 , p 2 , . . . , p N+1 ), ρ = (ρ 1 , ρ 2 , . . . , ρ N+1 ), and the model allows for shifts in the mean, δ = (δ 1 , δ 2 , . . . , δ N+1 ); also define ν 2 = (ν 2 1 , . . . , ν 2 N+1 ), θ = (δ, ρ, p, ν 2 ). It is necessary to specify the unknown parameter N, the number of change-points. Theoretically, N could be any integer between 0 and T − 1, while in practice, we can set a maximum value for N, perhaps via the rough graphical inspection. Additionally, the change-point problem is characterized by N; if N is greater than 1, we call it multiple change-point problem; otherwise, the process is stationary. A model selection procedure is often used to determine the number of change-points; one can minimize an objective function of the form [38]: where the C is a cost function for a segment and P is a penalty term. In this paper, we consider the negative log-likelihood as the cost function by rewriting (2) as a penalized likelihood information criterion: Akaike's information criterion (AIC), Bayesian information criterion (BIC), and modified BIC are well-known information criteria. The significant difference between these information criteria is the amount of penalty added to each parameter of the model. The penalty terms for the AIC and the BIC are given below: The total number of parameters in the model is denoted by k. Thus, each parameter is penalized by the same amount. Because the AIC and BIC are not theoretically designed for the change-point problems, Zhang and Siegmund [39] proposed the modified BIC (mBIC) for determining the number of change-points in array-based comparative genomic hybridization data. The mBIC follows the classic BIC, while it differs in the penalty term. The general form that penalizes for model dimensions is given: In addition, Lavielle [40] proposed a penalty function with an adjustable shrinkage parameter β for the change-point problem: In this paper, we introduce the two-stage minimum description length (MDL) for our change-point problem based on model (1); the superior flexibility is one strength of MDL. It was proposed by Rissanen (1978), based on Kolmogorov's theory of algorithmic complexity and rooted in information theory. In developing MDL, Rissanen's idea was to select the model with an excellent fit to the observed data while generating the least description. The term "description length" is interchangeable with "code length," which is used to describe the complexity of each statistical model. In other words, the best model was defined by MDL as the one that can compress a vast amount of information by using less computer memory storage; the amount of computer memory is called code length, denoted by CL. Figure 1 summarizes the aforementioned penalized likelihood methods.

The Principle of Minimum Description Length
The MDL criterion can tailor penalties for parameters having different natures, while the other information criteria treat every parameter as imposing the same penalty; therefore, which parameter to be penalized and the amount to be encoded need to be formalized according to the general principles of Rissanen [41]: • For an integer parameter N, when its value is not bounded, approximately log 2 (N) bits are encoded. If N is limited by the upper bound value N u , then approximately log 2 (N u ) bits are needed.

•
To encode a real parameter, if it is estimated from T observations, 1 2 log 2 (T) bits are required.
Then, we use the framework set out by [7] to derive the description length based on model (1); the MDL consists of the following four components: 1. As mentioned above, the first part of two-stage MDL can be considered as a negative log-likelihood function of the whole process, by assuming that the segments are independent, and using the approximation in McLeod et al. [42] for the exact loglikelihood for the AR(p) model, the likelihood of segment i can be written as − T 2 log(ν 2 ) − 1 2 log(Γ s ) (see [42] for further details), whereν 2 is the Yule-Walker estimate, log(Γ s ) is the covariance determinant of first s observations. 2. For the number N, the penalty term is log 2 (N). Since it can be considered as an integer parameter without being bounded, the code length is log 2 (N). 3. The length for each of N + 1 segments cannot exceed T. This means that, according to the integer parameter principle, the lengths of the segments can be encoded with log 2 (T) bits. 4. p i represents the order of the AR process for each segment, which should be an integer. So the parameter can be encoded with log 2 (p i ) bits; see [7]. 5. For each segment i, i = 1, . . . , N + 1, we estimate three real-valued parameters. These are mean shift δ i ; autocorrelation coefficient ρ ij , where j = 1, . . . , p i ; and variance ν 2 i . Under the real parameter principle, each of these parameters can be computed from τ i − τ i−1 − 1 observations. Hence, each of them need 1 2 Thus, we can get the objective functions based on the MDL for the multiple change-points model: stochastic complexity of the model If there are no change-points in the model (N = 0), the MDL for a stationary AR(p) process will be

The Cross-Entropy Algorithm
In order to identify the locations of change-points in the problem-Equations (8) and (9), which can be considered a combinatorial optimization problem-we use the cross-entropy (CE) algorithm. The CE method was originally developed to estimate the rare-event probabilities as an adaptive importance sampling method, based on the Kullback-Leibler divergence (or cross-entropy) minimization [43]. The main idea of the CE method can be described by a two-step iterative procedure: 1.
Generate a sample from a probability distribution.

2.
In order to obtain a better sample, find the minimum cross-entropy distance between the sample distribution and a target distribution.
Rubinstein [44,45] then realized that the CE method can be used to solve combinatorial optimization problems. We recommend [46] for getting familiar with the principals and the applications of the CE method.
Firstly, we formulate the rare-event probability estimation problem: where F(C) is the objective (or score) function defined on a finite set C, α is a threshold, is the corresponding probability measure when C follows the probability density function (pdf) f (·, v) with a real parameter v, and I is an indicator function. The main idea of the CE method is to obtain the optimal pdf from a set of pdfs It allows us to generate a sequence { f (·, v)} by multi-level CE implementation to get the final optimal pdf f (·, v * ). During adaptive updating procedure, a family of {(α k , v k )} is generated and eventually converges to the optimal state (α * , v * ).
As the F(C) defined as (8) or (9), our goal is to find the minimum value of F(c), denoted by α * , Then, the two-step iterative procedure mentioned above can be written in the following way: 1. Updating of the level parameter α k . Order the performance score increasingly, . Let α k be the β-quantile of the ordered performance scores, which can be denoted as, 2. Updating the reference parameter v k . Given α k and v k−1 , we can derive v k from Equation (10), More formally, the generic CE optimization algorithm can be described as follows (Algorithm 1).

Algorithm 1 Basic CE optimization algorithm.
1: Specify the initial parameterv 0 of f (·, v). Let k = 1. 2: Generate a random sample C (1) , . . . , C (M) from f (·,v k−1 ). Evaluate and order the performance score, F (1) ≤ · · · ≤ F (M) , level parameter is estimated as α k = F ( βM ) . 3: Update the current parameter v k by solving the stochastic problem: Stop the process if a certain stopping criterion is met; otherwise allow k = k + 1 and return to step 2.
The detailed information about the properties of the CE algorithm as well as its diverse applications can be found in [43] and [44]. To date, the CE method has been modified and applied to different multiple change-point problems as a combinatorial optimization algorithm in [8,[30][31][32][33][34][35].
Let us now formulate the CE algorithm for the multiple change-point problem. Let C 1 , C 2 , . . . , C N be a sequence representing the locations of change-points. Since the location space is finite, we can carry out the CE combinatorial optimization algorithm, and generate M rows of random samples. So here we deal with the objects C (j) = (C (j) . . , M. The external parameters are N max , β, and ε; we apply the CE.AR algorithm at 1 ≤ N ≤ N max . The normal distribution is served as f (·, v). In practice, the users need to specify the external parameters. The main steps of our proposed CE.AR algorithm are as follows (Algorithm 2).
where J is the set of indices of the best performing samples. 6: Stopping criterion is max i (σ 2 (k),i ) < ε. 7: If the stopping criterion is met, so stop the process, then identify the combination of the positions of change-points and compare its performance score with the score of N = 0 from (9). If MDL N≥1 (·) > MDL N=0 (·), then there are no change-points. Estimate the parameters of the model for all obtained segments. If the stopping criterion is not met, set k = k + 1 and iterate from step 2.

Numerical Experiments
We provide three simulated examples to illustrate the proposed CE.AR method, compared with the current methods on offer in the R packages AR1seg [37] and strucchange [47]. In the simulation study, we aim to present a detailed evaluation of the estimated number of change-points through these three methods. The CE algorithm discussed above was implemented using the breakpoint package [8] with the external settings as follows: the elite proportion value β = 0.05 and the sample size M = 200.

Example 1: Simulated AR(1) Processes with No Change-Point
The null case is considered first with three standard AR(1) processes simulated with no change-point. Three sets of 100 AR(1) sequences are generated with each sequence of length T = 201. The autocorrelation coefficients were set to be 0.1, 0.5, and 0.9, respectively, for each set. The variance was set globally to be ν = 1, and the mean was set to be δ = 0. Table 1 shows that both strucchange and CE.AR perform well for all values of ρ, while AR1seg tends to slightly overestimate the number of change-points when the value autocorrelation parameter increases. The best performing method is the CE.AR method.

Example 2: Simulated AR(1) Processes with a Single Change-Point
In this example, we would like to understand how a change in either the mean or the autocorrelation coefficient affects the performances of the algorithms. To study these individual effects, we design two scenarios. In the first scenario, we assume that autocorrelation ρ remains the same in a time series, while the mean jumps at each segment; the second scenario shows the opposite situation: the mean is fixed and the autocorrelation coefficient varies across segments.

Example 2.1
In scenario 1, we generated three sets of 100 AR(1) sequences of length 201 with a fixed autocorrelation ρ = 0.5 and one abrupt change-point at position τ 1 = 100, which partitions a time series into two segments. The mean of first segment, δ 1 , equal 0, and then jumps from 0 to δ 2 = 1 or δ 2 = 2 or δ 2 = 3 at the second segment. For each set, three sequences of simulated piecewise AR(1) process are graphically demonstrated by Figure 2.  Table 2 shows that the detection accuracies of all the methods strucchange, AR1seg, and CE.AR improved while the value of δ increased. Generally, the best performing method was the strucchange package.  We use the histogram (see Figure 3) to illustrate the performance of the estimated single change-point location under three levels of mean shifts by looking at the plot vertically to compare three methods (a,b,c) under the same settings. Overall, all methods tend to more accurately identify the change-point's location as the mean shifts become more significant. By examining the case of δ 2 = 2 or 3, we can see that AR1seg identified a few change-points located at the beginning of the process, while the CE.AR method tended to seek locations around 100. Under the case of δ 2 = 1, all the methods perform poorly; the CE.AR method especially is insensitive to the small mean change.

Example 2.2
In scenario 2, we generated three sets of 100 AR(1) sequences of length 201 with a homogeneous mean δ = 0 and one abrupt change-point at position τ 1 = 100 caused by the variation in autocorrelation coefficient. Three settings are displayed in Table 3; in the first set, a weak positive correlation is followed by a moderate correlation; the change occurs from weak to strong, and from moderate to vigorous, in the second and third sets respectively. In general, the performance of strucchange is relatively better than AR1seg and CE.AR. AR1seg is not explicitly designed for this scenario since the autoregressive parameter changes between segments. It is interesting to see that CE.AR method performs very well only when ρ varies dramatically. When ρ does not change that much, the algorithm tends to underestimate the number of change-points, preferring a model with a smaller number of change-points.   Figure 4 displays that the estimation performance of each method under scenario 2. The strucchange shows better performance than the other two. Compared with scenario 1, when the autoregressive parameter ρ varies weakly or moderately, it is more difficult for the three methods to estimate the correct number of and the locations of change-points. Lastly, the histogram shows that the estimated location of strucchange has a broader spread than CE.AR method. In addition, we run more tests when ρ changes from negative to positive; three settings are displayed in Table 4. The detection rates of strucchange and CE.AR methods increased in comparison with the previous settings.  In Example 2, we have considered two scenarios of single change-point detection. The first scenario assumes that there are no coefficient changes while the mean changes between different segments. In the second scenario, the coefficient parameter ρ changes, but the mean is assumed to be 0. Individually, the coefficients are simulated to change weakly, moderately, and strongly; we have considered six cases, which are graphically illustrated in Figure 5 by plotting one time series out of the 100 simulated AR(1) sequences. The frequency charts are used to measure the performances of the three methods in terms of accuracy of estimating the location of the change-point. We found that the CE.AR method tends to concentrate on the accuracy of the change-point locations more than the other two methods, as long as the number of change-points is accurately estimated. The detection accuracies of three methods for estimating the number of change-points is summarized in Tables 2-4, which illustrate that the detection rates of all methods increase as the shift in the mean increases. From Table 4, it is evident that the CE.AR method performs better when ρ changes from −0.5 to 0.5, from −0.5 to 0.9, and from 0.1 to 0.9; strucchange generally performs better than CE.AR and AR1seg. Therefore, we can conclude that the CE.AR method is not sensitive to mild coefficient changes, and it is suitable for the situations in which either there is a dramatic change in the mean or the correlation coefficient.

Example 3: AR(1) Processes with Multiple Change-Points
In this example, we consider a case of multiple change-points assuming that both the mean and the autocorrelation coefficient may change at the same time.
Firstly, we generated one set of 100 AR(1) sequences of length 201 with three change-points located at τ 1 = 20, τ 2 = 60, and τ 3 = 120. For each segment we used the following parameters: the mean levels of δ 1 = 0, δ 2 = 3, δ 3 = 1, and δ 4 = 0; the autocorrelation coefficients of ρ 1 = 0.1, ρ 2 = 0.9, ρ 3 = 0.9, and ρ 4 = 0.5, respectively. Table 5 illustrates that none of the three methods could estimate the number of changes well enough. AR1seg performs the best under this test. A similar underestimation issue persists in the strucchange and CE.AR methods. The underestimation is less severe in the CE.AR method; strucchange shows the least accuracy in this case. Next, we generated another set of 100 AR(1) sequences of length 201 with three change-points at τ 1 = 20, τ 2 = 60, and τ 3 = 120. For each segment, the mean levels were δ 1 = 0, δ 2 = 3, δ 3 = 1, and δ 4 = 0, while the autocorrelation coefficients were ρ 1 = −0.1, ρ 2 = 0.9, ρ 3 = −0.9, and ρ 4 = 0.5. Table 6 shows that AR1seg serves as the best method for estimating the number of change-points under this setting. Both strucchange and CE.AR methods show the heavy underestimation, while the reason could be that they ignored the τ 1 . In order to evaluate the multiple segmentation performance, we introduce the Hausdorff metric, which measures the greatest distance between the estimated locations and true change-points. Table 7 shows the Hausdorff distances for each of the three methods, the first line presents the results for the first choice of the parameters, while the second line displays the distances for the second choice mentioned above.   Figure 6 shows the frequency of estimated locations for each method under the Table 6 scenario with at least one change-point detected. The AR1seg method tends to estimate the first change-point location accurately. The CE.AR1 method has 100% power to identify the second and third locations; strucchange is less capable of detecting the first change-point at τ 1 = 20, while it performs well at τ 2 = 60 and τ 3 = 120. It is noted that all the estimations obtained from CE.AR method cluster around 60 and 120 and tend to ignore the earliest location.

Discussion
In this section, we aim to recapitulate the discussion of the results of the numerical experiments. We will also apply the proposed method to real data analysis.

Simulation Results
The objective of the simulation study was to evaluate the performance of the three change-point detection methods (AR1seg and strucchange and the proposed CE.AR) under various scenarios looking at two aspects of the methods' detection accuracies: the number of estimated change-points and their locations. We use tables to summarize the estimated number of change-points, while histograms are used to measure the detection accuracy in change-point locations. The simulation study consists of three examples. In the first example, there is no change-point in the AR(1) process; the CE.AR method achieves 100% exact detection rate under each of the three choices of the autocorrelation coefficient, 0.1, 0.5, and 0.9. The second example includes two scenarios, in which we study how a change in either the mean or the autocorrelation coefficient affects the performances of the algorithms. In scenario 1, it can be seen that the detection accuracy of the CE.AR method significantly improves as the shift in the mean increases from 1 to 3. In scenario 2, the autocorrelation coefficients differ between adjacent segments; we have considered six cases, whose profiles are shown in Figure 5. The cases with weak changes in the autocorrelation coefficient, such as ρ 1 = −0.5, ρ 2 = 0.1; ρ 1 = 0.5, ρ 2 = 0.9; or ρ 1 = 0.1, ρ 2 = 0.5, remain a challenge for the CE.AR method; it showed a heavy under-segmentation. In the third example, we consider a combined effect of the mean shift and the autocorrelation coefficient variation on piecewise AR(1) with multiple change-points. The two scenarios in this example show that the CE.AR method tends to disregard the first small segment, even if the mean or the autocorrelation coefficient changes dramatically. The histograms and Hausdorff distance show a low variation in estimated locations of change-points for the CE.AR method.

Application
We use a new real data set to demonstrate the practical aspects of our algorithm. The data were downloaded from the Reserve Bank of Australia-the daily AUD/CNY exchange rate from 2nd January 2018 to 24th March 2020. Both ARIMA and AR models have been used in the literature to model different exchange rates [48][49][50]. If an ARIMA process is used to model an exchange rate, then a change-point is recorded as a change in either the coefficients of AR part, or the MA part, or both of them. Dealing with the parameter change in the MA part of ARIMA process will be a matter of our future research. Before applying the CE.AR method, the stationary test and pre-examination are needed; the first difference of AUD/CNY is non-stationary, and we can get the rough profile of this series by using auto.arima() function which is built in forecast package [51]. If there is no obvious change in autocorrelation coefficient but in the mean level, AR1seg or the similar methods are preferred. Figure 7 shows the estimation results obtained by the CE.AR and strucchange method. Since the actual number and the locations of change-points can not be identified in advance, we need to take both methods into account and seek for the agreement between them. The CE.AR method yields four possible change-points in 4 December 2018, 4 March 2019, 2 January 2020, and 9 September 2020, while strucchange" method presents three change-points that occurred in 3 May 2018, 28 February 2018, and 6 May 2019. Both methods have correctly identified the period from December of 2018 to May of 2019, and it seems that a regime started around 4 March 2019. According to the CE.AR method, the new regime started on 9 March 2020, so it managed to identify a very volatile period related to the COVID-19 pandemic.
If we bring the views from data itself to the real world, the cause of the changes may be the fallout from the China-United States trade war since 2018 or the uncertain climate in the Eurozone. The external shock has an impact on the Australian market, which has been reflected by the data and the estimated change points.
Hence, we suggest that the CE.AR algorithm would be a good tool for analyzing the daily AUD/CNY exchange rate; combined with the strucchange method, it will reveal more historical information. Forecasting the exchange rate based on the recent change-points will be done in the future research.

Conclusions
In this paper, we have proposed a new change-point detection algorithm, named CE.AR. The algorithm is able to detect change-points in piecewise AR(p) sequences, allowing variations in all parameters of the model. Our method involves two steps. Firstly, we build the MDL-based objective function to estimate the correct number of AR(p) segments. Then, we apply the Cross-Entropy algorithm for solving the combinatorial optimization problem. The numerical results show that, when the mean and the autoregression coefficient change moderately or strongly, the CE.AR method is very efficient at estimating multiple change-points. The Cross-Entropy algorithm also shows superior performance in detecting the locations of change-points. These two features make the proposed algorithm very useful from a practical point of view. We have demonstrated its usefulness by applying it to the latest real data set on the AUD/CNY exchange rate. The model considered here assumes an AR process; generalization to an ARIMA model is a matter for our further research. A limitation of the method comes from the under-segmentation when there is a small change in the mean or the autocorrelation coefficient. The improvement of detection accuracy for this problem needs to be taken into consideration in future research. In addition, it may be worth examining the impact of the lengths of the segments on the performance of the algorithm. Lastly, the question of whether our method would be helpful to improve the forecast's accuracy is also a topic for our future research.