Change Point Detection Using Penalized Multidegree Splines

Lee, Eun-Ji; Jhong, Jae-Hwan

doi:10.3390/axioms10040331

Open AccessArticle

Change Point Detection Using Penalized Multidegree Splines

by

Eun-Ji Lee

^†

and

Jae-Hwan Jhong

^*,†

Department of Information Statistics, Chungbuk National University, Cheongju 28644, Chungbuk, Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Axioms 2021, 10(4), 331; https://doi.org/10.3390/axioms10040331

Submission received: 13 October 2021 / Revised: 24 November 2021 / Accepted: 29 November 2021 / Published: 1 December 2021

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

We consider a function estimation method with change point detection using truncated power spline basis and elastic-net-type

L^{1}

-norm penalty. The

L^{1}

-norm penalty controls the jump detection and smoothness depending on the value of the parameter. In terms of the proposed estimators, we introduce two computational algorithms for the Lagrangian dual problem (coordinate descent algorithm) and constrained convex optimization problem (an algorithm based on quadratic programming). Subsequently, we investigate the relationship between the two algorithms and compare them. Using both simulation and real data analysis, numerical studies are conducted to validate the performance of the proposed method.

Keywords:

change point detection; coordinate descent algorithm; elastic net; spline; quadratic programming

1. Introduction

A nonparametric function estimation is useful for estimating the actual relationship of data exhibiting nonlinear relationships. It is a statistical method for estimating the relationship between variables based on observed data, assuming that the actual function representing the relationships between input variables and response variables belongs to an infinite-dimensional parametric space. Representative examples, including kernel density estimation (chapter 1 of [1]), local polynomials [2], splines (pp. 118–180 of [3]) and curve estimation (chapter 2 of [4]), have been investigated previously.

A representative method used for function estimation is the basis function method. Among the basis function methods, splines are the most frequently used. In function estimation, an infinite number of parameters cannot be estimated using a finite amount of data. Therefore, function estimation is performed by introducing a function space and a basis function. If the function space and basis function are defined, then the shape of the function to be estimated using the basis function as a predictor variable can be expressed as a linear combination of the basis functions spanning the appropriate function space. Therefore, the function estimation problem can be regarded as estimating the regression coefficients.

Among the many basis functions used to interpolate data or fit smooth curves, the spline basis is the most typically used. The spline basis function is defined as a piecewise polynomial that is differentiable for every knot spacing. Its main underlying technology is truncated power splines (see chapter 3 of [5]). Additionally, it offers the advantages of simple construction and easy interpretation of the model parameters. However, in model with many overlapping intervals, basis functions are highly correlated. Therefore, several computational problems may occur when the number of knots increases significantly because the predictor matrix for the truncated power spline basis becomes dense. Consequently, incorrect fitting may occur (chapter 7 of [6]).

In the basis function methodology, the objective function to be optimized is a convex function. The coordinate descent algorithm (CDA) [7] is simple, efficient and useful for optimizing these objective functions. The concept of the algorithm is to optimize the solution of a problem involving the minimization (maximization) of a convex (concave) function for a multidimensional vector. Coefficient updates keep the remaining coefficients constant. The objective function is regarded as a one-dimensional function. Tseng [8] proved that the coordinate descent converges when a convex penalty term is non-differentiable but separable in each coordinate. This result implies that the coordinate-wise algorithms for the least absolute shrinkage and selection operator (lasso) [9], group lasso [10] and elastic net [11] converge to their optimal solutions. Another simpler method to obtain the optimal solution is to use quadratic programming (QP) [12]. QP is a type of nonlinear programming that obtains a mathematical optimization problem involving quadratic functions.

The selection of knots and the detection of change points in regression splines significantly affect the performance of the model. Knot selection and change point detection methods have been studied extensively. Osbone et al. [13] proposed an algorithm that allows an efficient computation of the lasso estimator for knot selection to be performed. Leitenstorfer and Tutz [14] considered the boosting technique to select variables in knot selection. Garton [15] proposed a method for selecting the number and position of knots using Gaussian and non-Gaussian data. Aminikhanhahi and Cook [16] proposed a change point detection method for time-series data. Meanwhile, Tibshirani et al. [17] proposed sparsity and smoothness using a fused lasso.

The main contributions of this study are as follows.

First, a new type of function estimator is established for change points in the nonparametric regression function estimation model. The proposed function estimator is defined by a linear combination of multidegree-based splines to simultaneously provide a change point detection and smoothing.

Second, two computational algorithms are introduced to address the constrained convex optimization problem (algorithm based on QP) and Lagrangian dual problem (CDA); moreover, the relationship between them is investigated. Numerical studies using both simulated and real datasets are provided to demonstrate the performance of the proposed method.

Herein, we present a new statistical learning methodology for modeling and analyzing data to address the change point detection problem. The proposed method involves two tuning parameters that control the penalty terms. We express the estimator as a linear combination of a polynomial and a truncated power-spline basis for zero and positive integer degrees, respectively. The coefficients are estimated using a CDA to minimize the residual sum of squares and QP. We consider three sets of simulation data to measure the performance of the proposed method and conduct an analysis on real data with changes or jump point(s).

This paper is organized as follows: In Section 2, we present a nonparametric regression model containing observed data and the penalized regression spline estimator. The process of updating the coefficients based on the CDA and the estimated coefficients obtained using QP is presented in Section 3. Additionally, the estimators obtained using the two methods are compared and their characteristics are introduced briefly. In Section 4, using the simulation and real data, we validate the performance of our proposed model. In Section 5, we summarize the conclusions of this study. All codes implemented using the software program R for numerical analysis are available as Supplementary Materials.

2. Model and Estimator

Consider a nonparametric regression model

y_{i} = f (x_{i}) + ε_{i} for i = 1, \dots, n,

(1)

where

{x_{i}}

’s

\in I \subset R

are the predictors and

{y_{i}}

’s

\in R

are responses. The model is nonparametric because it does not contain any parametric assumptions regarding f. Model (1) is expressed similarly as the conditional expectation

E (Y | X = x)

. In this case,

ε_{1}, \dots, ε_{n}

are only required to satisfy the property that the expected value is zero with positive variance

σ^{2}

and no distribution assumption is required. Without loss of generality, we write

I = [0, 1]

hereinafter for notational simplicity.

Because it is impossible to estimate an infinite number of parameters using a finite amount of data, a specific function space for estimation should be considered. In nonparametric estimation, it is typically assumed that f belongs to a massive class of functions [1]. In this study, because the goal is to achieve a smooth function estimation that can detect change point(s), the function space to which f belongs can be specified as follows.

Let

F_{K}

be the space of spline functions of mixed degrees 0 and positive integer

d > 0

with K interior knots

0 = t_{0} < t_{1} < \dots < t_{K} < t_{K + 1} = 1

. Therefore, any function

f \in F_{K}

can be expressed as

f (x; θ) = α_{0} + \sum_{j = 1}^{d} α_{j} x^{j} + \sum_{k = 1}^{K} \{β_{k} P_{0, k} (x) + γ_{k} P_{d, k} (x)\},

where

θ = (α, β, γ)

with

α = (α_{0}, α_{1}, \dots, α_{d})

,

β = (β_{1}, \dots, β_{K})

,

γ = (γ_{1}, \dots, γ_{K})

,

P_{0, k} (x) = \{\begin{matrix} 1 & if x > t_{k} \\ 0 & otherwise \end{matrix} and P_{d, k} (x) = {(x - t_{k})}_{+}^{d} for k = 1, \dots, K .

It is noteworthy that

{(a)}_{+}

is a function that assumes a real value of a if

a > 0

and 0 otherwise. We first select the number of the initial knots K for the splines and then place them at the same interval within

(0, 1)

. Generally, one may select a sufficiently large value of K—n, or

n / 2

, for example—to select data-adaptive knots.

The empirical risk of

θ

is defined as

R (θ) = \frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - f (x_{i}; θ))}^{2} .

The penalized objective function to be minimized is expressed as

R^{λ, τ} (θ) = R (θ) + λ (τ {∥ β ∥}_{1} + (1 - τ) {∥ γ ∥}_{1})

(2)

for

λ > 0

,

τ \in [0, 1]

and

{∥ \cdot ∥}_{1}

, the

L^{1}

-norm. The penalty term is the concept of elastic-net penalty [11]. Similar to the lasso,

λ

reduces only

β

and

γ

. Furthermore, similar to an elastic net,

τ

is a tuning parameter that controls the relative penalty. If

τ

is of a higher value, then a greater weight is assigned to the

L^{1}

-norm of

β

, where the jump detection penalty is. By contrast, if

τ

is of a lower value, then a greater weight is assigned to the

L^{1}

-norm of

γ

, which is a penalty indicating the function smoothness.

Let

{\hat{θ}}^{λ, τ} = \underset{θ}{argmin} R^{λ, τ} (θ)

and define the multidegree spline estimator (MDSE) as

\hat{f} = f (\cdot; \hat{θ}) .

3. Implementation

Specifically in mathematical statistics fields, penalized regression problems can seek to optimize a multivariate quadratic function subject to linear constraints on the regression coefficients. To realize the proposed estimator, two computational methods are considered, i.e., the CDA and the algorithm for QP, using the quadprog package [18] in the R program. We describe the abovementioned two methods comprehensively and present their advantages and disadvantages. Both methods are implemented using the software program R.

3.1. CDA

The CDA optimizes the objective function with respect to a single coefficient at a time, iteratively cycling through all coefficients until convergence is reached. We adopt the CDA introduced by Jhong et al. (2017) to perform univariate smoothing for each coefficient. Because the objective function (2) is convex for

θ

, one can adopt the CDA to obtain the MDSE. This algorithm is inspired by the studies of Friedman et al. [19] and Jhong et al. [20]. However, it is a new attempt, in that the proposed algorithm is applied to the elastic net penalty considering two hyper parameters,

λ

and

τ

.

For

j = 0, 1, 2, \dots, d

and

k = 1, 2, \dots, K

, let

{\tilde{α}}_{j}

,

{\tilde{β}}_{k}

and

{\tilde{γ}}_{k}

denote the initial (or updated) values. Specifically,

\tilde{α} = ({\tilde{α}}_{0}, {\tilde{α}}_{1}, \dots, {\tilde{α}}_{d}) \in R^{1 + d}, \tilde{β} = ({\tilde{β}}_{1}, \dots, {\tilde{β}}_{K}) \in R^{K},

and

\tilde{γ} = ({\tilde{γ}}_{1}, \dots, {\tilde{γ}}_{K}) \in R^{K}

. For notational convenience, we use

\tilde{θ} = (\tilde{α}, \tilde{β}, \tilde{γ}) = ({\tilde{θ}}_{1}, \dots, {\tilde{θ}}_{J}) for J = 1 + d + 2 K .

Therefore, the coordinate descent update is written in the following forms:

{\tilde{θ}}_{j} \leftarrow \underset{θ_{j} \in R}{argmin} R^{λ, τ} ({\tilde{θ}}_{1}, \dots, {\tilde{θ}}_{j - 1}, θ_{j}, {\tilde{θ}}_{j + 1}, \dots, {\tilde{θ}}_{J}) for j = 1, \dots, J .

3.1.1. Updating $α$

The optimization problem for

α_{j}

,

j = 0, 1, \dots, d

has a quadratic form. Hence, we can solve this problem by transforming the objective function for

α_{j}

into a perfect square form. It is noteworthy that

x_{i}^{0} = 1

.

\begin{matrix} R^{λ, τ} ({\tilde{α}}_{0}, \dots, {\tilde{α}}_{j - 1}, α_{j}, {\tilde{α}}_{j + 1}, \dots, {\tilde{α}}_{d}, \tilde{β}, \tilde{γ}) \\ = \frac{1}{2 n} \sum_{i = 1}^{n} {\{y_{i} - \sum_{ℓ \neq j} {\tilde{α}}_{ℓ} x_{i}^{ℓ} - α_{j} x_{i}^{j} - \sum_{k = 1}^{K} ({\tilde{β}}_{k} P_{0, k} (x_{i}) + {\tilde{γ}}_{k} P_{d, k} (x_{i}))\}}^{2} \\ + λ τ \sum_{k = 1}^{K} | {\tilde{β}}_{k} | + λ (1 - τ) \sum_{k = 1}^{K} | {\tilde{γ}}_{k} | \\ = \frac{1}{2 n} \sum_{i = 1}^{n} {(r_{i, α}^{(j)} - α_{j} x_{i}^{j})}^{2} + (terms independent for α_{j}) \\ = \frac{\sum_{i = 1}^{n} {(x_{i}^{j})}^{2}}{2 n} {\{α_{j} - \frac{\sum_{i = 1}^{n} r_{i, α}^{(j)} x_{i}^{j}}{\sum_{i = 1}^{n} {(x_{i}^{j})}^{2}}\}}^{2} + (terms independent for α_{j}), \end{matrix}

where

r_{i, α}^{(j)} = y_{i} - \sum_{ℓ \neq j} {\tilde{α}}_{ℓ} x_{i}^{ℓ} - \sum_{k = 1}^{K} {\tilde{β}}_{k} P_{0, k} (x_{i}) - \sum_{k = 1}^{K} {\tilde{γ}}_{k} P_{d, k} (x_{i})

is a partial residual. Because

α_{j}

is not subject to penalization, one can update each coefficient systematically using the partial residual. Therefore, we update

{\tilde{α}}_{j} \leftarrow \frac{\sum_{i = 1}^{n} r_{i, α}^{(j)} x_{i}^{j}}{\sum_{i = 1}^{n} {(x_{i}^{j})}^{2}} for j = 0, 1, \dots, d .

(3)

3.1.2. Updating $β$ and $γ$

Unlike updating

α

, one cannot use only the partial residual to update

β

and

γ

, because the objective function contains the penalty terms. Hence, we used the soft-thresholding operator [19] to solve the lasso solutions. The soft-thresholding operator is defined as

S (y, λ) = \{\begin{matrix} y - λ & if y > λ \\ y + λ & if y < - λ \\ 0 & otherwise \end{matrix} for y \in R and λ > 0 .

We select

k \in {1, \dots, K}

and transform the objective function for

β_{k}

into a quadratic form. Subsequently,

\begin{matrix} R^{λ, τ} (\tilde{α}, {\tilde{β}}_{1}, \dots, {\tilde{β}}_{k - 1}, β_{k}, {\tilde{β}}_{k + 1}, \dots, {\tilde{β}}_{K}, \tilde{γ}) \\ = \frac{1}{2 n} \sum_{i = 1}^{n} {\{y_{i} - \sum_{j = 0}^{d} {\tilde{α}}_{j} x_{i}^{j} - \sum_{ℓ \neq k} {\tilde{β}}_{ℓ} P_{0, ℓ} (x_{i}) - β_{k} P_{0, k} (x_{i}) - \sum_{ℓ = 1}^{K} {\tilde{γ}}_{ℓ} P_{d, ℓ} (x_{i})\}}^{2} \\ + λ τ \sum_{ℓ \neq k} | {\tilde{β}}_{ℓ} | + λ τ | β_{k} | + λ (1 - τ) \sum_{ℓ = 1}^{K} | {\tilde{γ}}_{ℓ} | \\ = \frac{1}{2 n} \sum_{i = 1}^{n} {(r_{i, β}^{(k)} - β_{k} P_{0, k} (x_{i}))}^{2} + λ τ | β_{k} | + (terms independent for β_{k}), \end{matrix}

where

r_{i, β}^{(k)} = y_{i} - \sum_{j = 0}^{d} {\tilde{α}}_{j} x_{i}^{j} - \sum_{ℓ \neq k} \tilde{β_{ℓ}} P_{0, ℓ} (x_{i}) - \sum_{ℓ = 1}^{K} {\tilde{γ}}_{ℓ} P_{d, ℓ} (x_{i}) .

Finally, using the soft-thresholding operator, we can update

{\tilde{β}}_{k}

as follows:

{\tilde{β}}_{k} \leftarrow \frac{S (\frac{1}{n} \sum_{i = 1}^{n} r_{i, β}^{(k)} P_{0, k} (x_{i}), λ τ)}{\frac{1}{n} \sum_{i = 1}^{n} {[P_{0, k} (x_{i})]}^{2}} .

(4)

Similarly,

{\tilde{γ}}_{k}

is expressed as

{\tilde{γ}}_{k} \leftarrow \frac{S (\frac{1}{n} \sum_{i = 1}^{n} r_{i, γ}^{(k)} P_{d, k} (x_{i}), λ (1 - τ))}{\frac{1}{n} \sum_{i = 1}^{n} {[P_{d, k} (x_{i})]}^{2}} for k = 1, \dots, K .

(5)

3.1.3. Algorithm Details

Algorithm 1 shows the process of computing the proposed estimator using the CDA. We implement the code using the software program R. For the observed data, we create a basis matrix B to fit to our model.

B = [\begin{matrix} 1 & X^{1} & \dots & X^{d} & P_{0} & P_{d} \end{matrix}] \in R^{n \times (1 + d + 2 K)},

where

1 = [\begin{matrix} 1 & \dots & 1 \end{matrix}] \in R^{n}

,

X^{j} = [\begin{matrix} x_{1}^{j} & \dots & x_{n}^{j} \end{matrix}] \in R^{n}

for

j = 1, \dots d

.

P_{0} = [\begin{matrix} P_{0, 1} & \dots & P_{0, K} \end{matrix}] \in R^{n \times K}

, and

P_{d} = [\begin{matrix} P_{d, 1} & \dots & P_{d, K} \end{matrix}] \in R^{n \times K}

. After B is created, we select the tuning parameters. We compute

λ_{m a x}

, which is the smallest value that causes all coefficients to be zero, as follows:

λ_{m a x} = max_{k} (\frac{1}{n} | 〈 P_{0, k} (x), y 〉 |, \frac{1}{n} | 〈 P_{d, k} (x), y 〉 |),

where

〈 x, y 〉 = \sum_{i = 1}^{n} x_{i} y_{i}

for n-dimensional vectors x and y, respectively. Thereafter, we generate candidates for the two tuning parameters

λ

and

τ

, respectively.

For the case involving

λ

, the equally spaced candidates decrease from

log (λ_{m a x})

to

log (λ_{m a x} \times λ_{m i n_m a x_r a t i o})

for a maximum of

n_{λ}

times, which is the number of

λ

candidates.

λ_{m i n_m a x_r a t i o}

is the ratio of the maximum and minimum values of

λ

, which is a sufficiently low positive value. Subsequently, it is reconverted to the exponential scale and the last candidate value is replaced with a zero. Next, in the case of

τ

, we generate

n_{τ}

values from 0 to 1 at equal intervals.

Subsequently, we compute the value of the initial objective function as

\sum_{i = 1}^{n} y_{i}^{2} / 2 n

. Thereafter, we begin updating all the coefficients until that is achieved. Finally, we select the best model among

n_{λ} \times n_{τ}

models. In statistical data analysis, some representative information criteria exist, such as the Akaike information criterion (AIC; [21]), the Bayesian information criterion (BIC) and cross validation (CV). In general, the AIC and CV involve the selection of a model with a relatively large variance and small bias compared with the BIC. Because the purpose of this study is to accurately select significant change points, we use the BIC [22], which involves the selection of a model with a relatively small variance, defined as follows:

B I C = n log (2 \times R^{λ, τ} (\hat{θ})) + p log (n),

where p is the number of non-zero coefficients. We select the best model that minimized the BIC. For a detailed calculation of the degrees of freedom p in penalized regression model, see [9].

Algorithm 1: Coordinate descent algorithm (CDA).

3.2. Quadratic Programming

The disadvantage of the CDA is that the computational cost increases with the higher number of iterations. Therefore, we use a new algorithm, i.e., QP, to compute the proposed estimator. In QP, certain mathematical optimization problems involving quadratic functions are solved. In this regard, a multivariate quadratic function subject to linear constraints on the coefficients can be minimized. Hence, because the solution can be obtained without iterations, QP demonstrates computational advantage over the CDA.

The Lagrangian dual problem (2) is converted into the constraint optimization problem, such that

\hat{θ} = (\hat{α}, \hat{β}, \hat{γ}) = \underset{θ}{argmin} R (θ) {subject to τ | | β | |}_{1} + (1 - τ) {| | γ | |}_{1} \leq t,

where

t \geq 0

. For

u \in R

, let

u_{+} = \{\begin{matrix} u & if u \geq 0 \\ 0 & if u < 0 \end{matrix} and u_{-} = \{\begin{matrix} - u & if u \leq 0 \\ 0 & if u > 0 \end{matrix} .

Therefore,

u = u_{+} - u_{-}

and

| u | = u_{+} + u_{-}

. By reparameterizating

β_{k}

and

γ_{k}

,

β_{k} = {(β_{k})}_{+} - {(β_{k})}_{-}, | β_{k} | = {(β_{k})}_{+} + {(β_{k})}_{-},

γ_{k} = {(γ_{k})}_{+} - {(γ_{k})}_{-}, | γ_{k} | = {(γ_{k})}_{+} + {(γ_{k})}_{-}

Therefore, the optimization problem is equivalent to minimizing

\frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - \sum_{j = 0}^{d} α_{j} x_{i}^{j} - \sum_{k = 1}^{K} [\{{(β_{k})}_{+} - {(β_{k})}_{-}\} P_{0, k} (x_{i}) + \{{(γ_{k})}_{+} - {(γ_{k})}_{-}\} P_{d, k} (x_{i})])}^{2}

subject to

τ \{\sum_{k = 1}^{K} ({(β_{k})}_{+} + {(β_{k})}_{-})\} + (1 - τ) \{\sum_{k = 1}^{K} ({(γ_{k})}_{+} + {(γ_{k})}_{-})\} \leq t,

in addition to

4 \times K

non-negativity constraints

{(β_{k})}_{+} \geq 0, {(β_{k})}_{-} \geq 0, {(γ_{k})}_{+} \geq 0, {(γ_{k})}_{-} \geq 0 f o r k = 1, \dots, K .

Next, we define

Z = [\begin{matrix} 1 & X^{1} & \dots & X^{d} & P_{0} & - P_{0} & P_{d} & - P_{d} \end{matrix}] \in R^{n \times (1 + d + 4 K)}

and

b = [\begin{matrix} α & β_{+} & β_{-} & γ_{+} & γ_{-} \end{matrix}] \in R^{1 + d + 4 K},

where

β_{+} = {{(β_{k})}_{+}}

,

β_{-} = {{(β_{k})}_{-}}

,

γ_{+} = {{(γ_{k})}_{+}}

, and

γ_{-} = {{(γ_{k})}_{-}}

. Subsequently,

R (θ) = \frac{1}{2 n} {(y - Z b)}^{⊤} (y - Z b) = \frac{1}{2 n} y^{⊤} y - \frac{1}{n} y^{⊤} Z b + \frac{1}{2 n} b^{⊤} (Z^{⊤} Z) b .

Hence, the minimization problem is equivalent to

min_{b} (- \frac{1}{n} y^{⊤} Z b + \frac{1}{2 n} b^{⊤} (Z^{⊤} Z) b)

subject to

{[\begin{matrix} 0_{(1 + d) \times 1} & 0_{(1 + d) \times K} & 0_{(1 + d) \times K} & 0_{(1 + d) \times K} & 0_{(1 + d) \times K} \\ - τ_{K \times 1} & I_{K \times K} & 0_{K \times K} & 0_{K \times K} & 0_{K \times K} \\ - τ_{K \times 1} & 0_{K \times K} & I_{K \times K} & 0_{K \times K} & 0_{K \times K} \\ - {(1 - τ)}_{K \times 1} & 0_{K \times K} & 0_{K \times K} & I_{K \times K} & 0_{K \times K} \\ - {(1 - τ)}_{K \times 1} & 0_{K \times K} & 0_{K \times K} & 0_{K \times K} & I_{K \times K} \end{matrix}]}^{⊤} [\begin{matrix} α \\ β_{+} \\ β_{-} \\ γ_{+} \\ γ_{-} \end{matrix}] \geq [\begin{matrix} - t \\ 0_{K \times 1} \\ 0_{K \times 1} \\ 0_{K \times 1} \\ 0_{K \times 1} \end{matrix}],

where

τ_{K \times 1} = (τ, \dots, τ) \in R^{K}

and

{(1 - τ)}_{K \times 1} = ((1 - τ), \dots, (1 - τ)) \in R^{K}

. Hence, the problem can be computed via QP.

In R, the solve.QP function in the quadprog package [18] provides the solution for QP problems that involve minimizing

(- {dvec}^{⊤} x + 1 / 2 x^{⊤} Dmat x)

with respect to x with constraints

{Amat}^{⊤} x \geq bvec

. Dmat is the matrix that appears in the quadratic function to be minimized. dvec is the vector that appears in the quadratic function to be minimized. bvec is the vector that has the values of the constraints. Amat is the matrix that defines the constraints for minimizing the quadratic function. Therefore, in our case, the matrices and vectors defined above correspond to

Dmat \leftarrow \frac{1}{n} (Z^{⊤} Z), dvec \leftarrow \frac{1}{n} y^{⊤} Z, bvec \leftarrow [\begin{matrix} - t \\ 0_{4 K \times 1} \end{matrix}]

and

Amat \leftarrow [\begin{matrix} 0_{(1 + d) \times 1} & 0_{(1 + d) \times K} & 0_{(1 + d) \times K} & 0_{(1 + d) \times K} & 0_{(1 + d) \times K} \\ - τ_{K \times 1} & I_{K \times K} & 0_{K \times K} & 0_{K \times K} & 0_{K \times K} \\ - τ_{K \times 1} & 0_{K \times K} & I_{K \times K} & 0_{K \times K} & 0_{K \times K} \\ - {(1 - τ)}_{K \times 1} & 0_{K \times K} & 0_{K \times K} & I_{K \times K} & 0_{K \times K} \\ - {(1 - τ)}_{K \times 1} & 0_{K \times K} & 0_{K \times K} & 0_{K \times K} & I_{K \times K} \end{matrix}] .

It is noteworthy that a small positive vector is added to the diagonal entries of

\frac{1}{n} (Z^{⊤} Z)

to guarantee a positive definite matrix and prevent numerical computational issues.

Finally, the solution of b obtained via QP is defined as

\hat{b} = [\begin{matrix} \hat{α} & {\hat{β}}_{+} & {\hat{β}}_{-} & {\hat{γ}}_{+} & {\hat{γ}}_{-} \end{matrix}],

and the MDSE is expressed as

{\hat{β}}_{k} = {({\hat{β}}_{k})}_{+} - {({\hat{β}}_{k})}_{-}, and {\hat{γ}}_{k} = {({\hat{γ}}_{k})}_{+} - {({\hat{γ}}_{k})}_{-} for k = 1, \dots, K .

3.3. Comparison between CDA and QP

We introduce two computational algorithms to solve the constrained convex optimization problem (algorithm based on QP) and the Lagrangian dual problem (CDA); subsequently, we investigate the relationship between them.

First, we verify that the CDA estimator and the estimator obtained via QP are identical. Subsequently, a simulation is performed using the sample size

n = 300

with one true knot of 0.5, as shown in Figure 1.

In this simulation, the settings used for the CDA yield zero for all the initial coefficients, with

n_{λ} = 50

(the number of

λ

),

n_{τ} = 10

(the number of

τ

),

d = 1

(the non-zero degree),

ϵ = 1 \times 10^{- 10}

(stopping criterion) and

R = 5000

(maximum iteration). Using

λ

and

τ

obtained from the best CDA model, the corresponding tuning parameter t required for QP is obtained as follows:

t = λ τ \sum_{k = 1}^{K} | {\hat{β}}_{k} | + λ (1 - τ) \sum_{k = 1}^{K} | {\hat{γ}}_{k} | .

The above relation can be derived using the Karush–Kuhn–Tucker (KKT) optimality condition [23] based on the sub-gradient [24], since the

L^{1}

-norm penalty term cannot be differentiated with respect to the coefficients.

The remaining options are set to the same value and then are simulated. As shown in Figure 1, we confirm that both algorithms yield the same results. However, when the number of knots (K) increases, the CDA does not converge appropriately to the minimum under a limited iteration number (M) and the stopping criterion (

ϵ

), unlike QP. Hence, we performe the next simulation using only QP. It is noteworthy that a higher number K of interior knots indicates a higher probability of overfitting.

4. Numerical Analysis

4.1. Simulation

In this section, we analyze the performance of the proposed method based on simulated examples. From model (1), we generate the predictor as n sequences in the range

[0, 1]

.

ε_{i}

is generated from

N (0, σ^{2})

for

i = 1, \dots, n

and

σ

is set 0.25 for all examples.

Next, we present three examples, as follows.

Example 1.

Piecewise linear function with a single change point

\begin{matrix} f (x) & = \{\begin{matrix} 13 x & i f 0 \leq x < 0.333 \\ - 10 x + 3 & i f 0.333 \leq x \leq 1 \end{matrix} \end{matrix}

Example 2.

Piecewise linear function with two change points

\begin{matrix} f (x) & = \{\begin{matrix} 8 x & i f 0 \leq x < 0.333 \\ 8 x - 2 & i f 0.333 \leq x < 0.570 \\ 2.56 & i f 0.570 \leq x < 0.666 \\ - 10 x + 11.27 & i f 0.666 \leq x \leq 1 \end{matrix} \end{matrix}

Example 3.

Piecewise cubic function with a single change point

\begin{matrix} f (x) & = \{\begin{matrix} 20 x^{3} + 50 x^{2} & i f 0 \leq x < 0.333 \\ - 12 x^{2} + 15 & i f 0.333 \leq x \leq 1 \end{matrix} \end{matrix}

In Examples 1–2, a linear model with the jump(s) is considered. In Example 1, the true knot is 0.333, whereas they are 0.333 and 0.666 in Example 2. Example 3 is designed as a nonlinear function, where the true function is of quadratic and cubic form. Figure 2 shows the true functions of Examples 1–3.

We compare the proposed method with the fused lasso (FL) of Tibshirani et al. [17], the trend filtering (TF) of Kim et al. [25], the smoothing spline (SS) of Kim and Gu [26] and a fitting method for the structural model for a time series (ST) by Harvey [27].

The FL is a method that yields a solution path for the general fused lasso problem. The TF is the method of computing the solution path for the trend filtering problem of an arbitrary polynomial order. The SS is a method of fitting smoothing spline ANOVA models in Gaussian regression. The ST fits a structural model for a time series using the maximum likelihood. It is noteworthy that the FL, TF and SS estimators use nonparametric regression function estimation models. In particular, the TF and SS are function estimators based on linear combinations of spline basis functions, similar to ours. In addition, the FL is an estimator that specializes in change point detection as a zero-degree piecewise constant function.

The FL and TF estimators are provided in the genlasso package in R [28]. They require the use of cross-validation for parameter selection. In addition, the SS and ST can be used in R via the gss [29] and stats packages [30], respectively. To reduce the computational burden, we compute all the estimators using the default settings for each package.

We consider the mean squared error (MSE), mean absolute error (MAE) and maximum deviation (MXDV) criterion as loss functions that measure the discrepancy between the true function f and each function estimator

\hat{f}

. They are expressed as follows:

\begin{matrix} MSE (\hat{f}) = \frac{1}{n} \sum_{i = 1}^{n} {(f (x_{i}) - \hat{f} (x_{i}))}^{2}, MAE (\hat{f}) = \frac{1}{n} \sum_{i = 1}^{n} |f (x_{i}) - \hat{f} (x_{i})| \end{matrix}

\begin{matrix} and MXDV (\hat{f}) = max_{1 \leq i \leq n} |f (x_{i}) - \hat{f} (x_{i})| . \end{matrix}

Examples 1–2 are simulated based on

d = 1

, whereas Example 3 is simulated based on

d = 3

, with

K = 20

and

K = 50

, respectively. In addition, a simulation is performed by setting TF with orders 1 and 3 for linear and cubic functions, respectively. The simulation is repeated 100 times for each example based on sample sizes of 200, 300 and 500. Table 1, Table 2 and Table 3 show the experimental results of various scenarios for each of the examples.

As shown in Table 1, because QP not only can detect change points (or jumps), but can also fit piecewise polynomial regression, it demonstrates the best performance compared with the other methods. Comparing the QP results based on K, it is discovered that the best performance is obtained when

K = 20

. Figure 3 shows the plot for the results of Example 1 when

n = 500

. QP detects the change point and fits the linear regression accurately, particularly when

K = 20

. Although the performances of FL and TF with order 1 improve in general as the sample size n increases, it cannot detect the change point. Herein, “cannot detect the change point” means that, around the true knot, the plot has point(s) alone or successive.

Table 2 lists the results of Example 2, which is involved the same slope, two jumps and no jump at

x = 0.57

. It is observed that QP demonstrates the best performance when

K = 20

, not

K = 50

. Figure 4 shows the plot for Example 2 when

n = 500

. QP detects the change point and fits the linear regression accurately, particularly when

K = 20

. Although most performances of FL and TF with order 1 improve in general, as the sample size n increases, they cannot detect the change point. Herein, “cannot detect the change point” means that, around the true knot, the plot has point(s) alone or successive.

Table 3 lists the results of Example 3, which is involved one jump as well as quadratic and cubic forms. Similarly, the proposed QP estimator yieldes the best results.

K = 20

yieldes the best results in terms of the MSE and MAE, whereas

K = 50

in terms of the MXDV. Figure 5 shows the results of Example 3 for all methods when

n = 500

. As shown, even when it do not fit the polynomial regression line well at

K = 50

, it detects the point of change accurately at the true knot.

QP detects the change point and fits the linear regression accurately, particularly when

K = 20

. Although the performances of the FL and TF with order 1 improve as the sample size n increases, they cannot detect the change point. Herein, “cannot detect the change point” means that, around the true knot, the plot has point(s) alone or successive.

4.2. Real Data Analysis

The data obtained from Investing.com (last accessed date: 1 October 2021) the exchange rate change of the Icelandic króna (ISK) per US dollar (USD) from January 2004 to December 2015, measured as the monthly average.

The data contained 144 sets of observations associated with the subprime mortgage crisis, which occurred between 2007 and 2008. This crisis refers to a series of economic crises that began with the bankruptcy of the largest mortgage lenders in the United States and caused a credit crunch not only in the United States, but also in the international financial market. One of affected countries was Iceland, whose three largest banks were affected by the crisis in September 2008. To revive the economy, the Icelandic government has implemented various policies, including the nationalization of banks. Consequently, the ISK exchange rate per USD increased. In Figure 6, the top panel shows the starting point for the data. In January 2004, the observed value is 69.345; in March 2008, the observed value increases gradually to 75.805 and then increases significantly to 105.965 in September. In the last observation, performed in December 2016, a continuous increase of the value is observed (129.960).

Because the input variable is a date and not numeric data, 144 samples are equally spaced in the interval [0, 1] for the proposed QP implementation. A degree exceeding four can result in a more complicated interpretation and overfitting. Therefore, we apply

d = 1, 2

and 3.

Figure 6 shows the fitted results. The top panel of Figure 6 shows a scatterplot of the data. The remaining panels show the results of the QP model fitting when

d = 1, 2

and 3. It appears that the proposed MDSE based on the BIC detects a change point in September 2008 and fits the flow of the exchange rate well. Prior to 2008, no significant change is detected, but from January 2008, the exchange rate of ISK per USD increases gradually. Subsequently, beginning from September 2008, the exchange rate of ISK per USD increases significantly, as indicated by the vertical black dashed line in the figure (representing September 2008).

5. Conclusions

In this study, we develop a nonparametric regression function estimation method with change point detection. In order to provide a data-driven knot selection method, we consider elastic-net-type

L^{1}

-norm penalty for the estimating regression function. Once we expressed the estimator with a linear combination of truncated power splines, the coefficients were estimated by minimizing a penalized residual sum of squares. A new coordinate descent algorithm and an algorithm based on quadratic programming are introduced to handle penalty terms determined by the regression coefficients. In the numerical analysis, we only use an algorithm based on quadratic programming because of the high computational cost of the coordinate descent algorithm when the number of knots increases. To verify the performance of the proposed estimator (MDSE), in the three simulations, we compared MDSE with four methods; the fused lasso (FL), trend filtering (TF), smoothing spline (SS) and structural model for a times series (ST). As a result, in all simulations, the MDSE detected the change point(s) and fitted linear and cubic trends of the piecewise polynomial well. Furthermore, we performed an actual data analysis about the sharp change in exchange rate of the Icelandic króna per US dollar in September 2008 caused by the subprime mortgage crisis in the United States between 2007 and 2008. In conclusion, MDSE can accurately detect September 2008, the time when a sharp increase in the exchange rate occurred, and can also represent the exchange rate trend before and after September 2008 well.

To improve the proposed coordinate descent algorithm, one may apply the B-spline basis and total variation penalty terms. The B-spline basis function is a numerically superior alternative to the truncated power basis. Because it has small supports, the predictor matrix is sparse and the information matrix is banded (chapter 1 and 8 of [31]). The B-spline-based total variation penalty term is a generalized lasso problem [32,33] that is difficult to implement, but affords high computational power. These approaches will be investigated in future studies.

Supplementary Materials

The following are available at https://www.mdpi.com/article/10.3390/axioms10040331/s1, The supplementary files contain a data set and R code to perform the proposed method described in the article (implemented by the software R version 4.0.2.).

Author Contributions

The authors contributed equally to this study; methodology, E.-J.L. (numerical framework) and J.-H.J. (modeling and defining estimator); implementation, E.-J.L. (CDA), J.-H.J. (QP); writing, E.-J.L. (numerical analysis), J.-H.J. (model and estimator). All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2020R1G1A1A01100869).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are provided in the Supplementary Materials.

Acknowledgments

The authors wish to thank Jae-Kwon Oh for assistance with the numerical study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Fan, J.; Gijbels, I.; Hu, T.C.; Huang, L.S. A study of variable bandwidth selection for local polynomial regression. Stat. Sin. 1996, 6, 113–127. [Google Scholar]
Efromovich, S. Nonparametric Curve Estimation: Methods, Theory, and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Green, P.J.; Silverman, B.W. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach; CRC Press: Boca Raton, FL, USA, 1993. [Google Scholar]
Massopust, P. Interpolation and Approximation with Splines and Fractals; Oxford University Press, Inc.: Oxford, UK, 2010. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An introduction to Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2013; Volume 112. [Google Scholar]
Wright, S.J. Coordinate descent algorithms. Math. Program. 2015, 151, 3–34. [Google Scholar] [CrossRef]
Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 2001, 109, 475–494. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Meier, L.; Van De Geer, S.; Bühlmann, P. The group lasso for logistic regression. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2008, 70, 53–71. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef]
Goldfarb, D.; Idnani, A. A numerically stable dual method for solving strictly convex quadratic programs. Math. Program. 1983, 27, 1–33. [Google Scholar] [CrossRef]
Osborne, M.; Presnell, B.; Turlach, B. Knot selection for regression splines via the lasso. Comput. Sci. Stat. 1998, 30, 44–49. [Google Scholar]
Leitenstorfer, F.; Tutz, G. Knot selection by boosting techniques. Comput. Stat. Data Anal. 2007, 51, 4605–4621. [Google Scholar] [CrossRef]
Garton, N.; Niemi, J.; Carriquiry, A. Knot selection in sparse Gaussian processes. arXiv 2020, arXiv:2002.09538. [Google Scholar]
Aminikhanghahi, S.; Cook, D.J. A survey of methods for time series change point detection. Knowl. Inf. Syst. 2017, 51, 339–367. [Google Scholar] [CrossRef] [PubMed]
Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2005, 67, 91–108. [Google Scholar] [CrossRef]
Turlach, B.A.; Weingessel, A. Quadprog: Functions to Solve Quadratic Programming Problems; R Package Version 1.5-8. 2019. Available online: https://cran.r-project.org/package=quadprog (accessed on 15 July 2021).
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
Jhong, J.H.; Koo, J.Y.; Lee, S.W. Penalized B-spline estimator for regression functions using total variation penalty. J. Stat. Plan. Inference 2017, 184, 77–93. [Google Scholar] [CrossRef]
Bozdogan, H. Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika 1987, 52, 345–370. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Luenberger, D.G.; Ye, Y. Linear and Nonlinear Programming; Springer: Berlin/Heidelberg, Germany, 1984; Volume 2. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 2015. [Google Scholar]
Kim, S.J.; Koh, K.; Boyd, S.; Gorinevsky, D. ℓ₁ trend filtering. SIAM Rev. 2009, 51, 339–360. [Google Scholar] [CrossRef]
Kim, Y.J.; Gu, C. Smoothing spline Gaussian regression: More scalable computation via efficient approximation. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2004, 66, 337–356. [Google Scholar] [CrossRef]
Harvey, A.C. Forecasting, Structural Time Series Models and the Kalman Filter; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
Arnold, T.B.; Tibshirani, R.J. Efficient implementations of the generalized lasso dual path algorithm. J. Comput. Graph. Stat. 2016, 25, 1–27. [Google Scholar] [CrossRef]
Gu, C. Smoothing spline ANOVA models: R package gss. J. Stat. Softw. 2014, 58, 1–25. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2021. [Google Scholar]
De Boor, C.; De Boor, C. A Practical Guide to Splines; Springer: New York, NY, USA, 1978; Volume 27. [Google Scholar] [CrossRef]
Roth, V. The generalized LASSO. IEEE Trans. Neural Netw. 2004, 15, 16–28. [Google Scholar] [CrossRef] [PubMed]
Tibshirani, R.J.; Taylor, J. The solution path of the generalized lasso. Ann. Stat. 2011, 39, 1335–1371. [Google Scholar] [CrossRef]

Figure 1. Plot showing comparison between CDA and QP. Red circles and blue triangles represent CDA and QP fitted values, respectively.

Figure 2. Plot showing true function of Examples 1–3. Top-left panel is for Example 1, which has a true knot of 0.333. Top-right panel is for Example 2, which has two true knots of 0.333 and 0.666. Bottom-left panel is for Example 3, which exhibits a nonlinear form with a true knot of 0.333.