Next Article in Journal
Modelling the Behaviour of Currency Exchange Rates with Singular Spectrum Analysis and Artificial Neural Networks
Previous Article in Journal
A Brief Overview of Restricted Mean Survival Time Estimators and Associated Variances
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A-Spline Regression for Fitting a Nonparametric Regression Function with Censored Data

1
Mugla Sitki Kocman University, Faculty of Science, Statistics, Muğla 48000, Turkey
2
Faculty of Science, Mathematics and Statistics, Brock University, Niagara Region, St. Catharines, ON L2S 3A1, Canada
*
Author to whom correspondence should be addressed.
Stats 2020, 3(2), 120-136; https://doi.org/10.3390/stats3020011
Submission received: 1 May 2020 / Revised: 21 May 2020 / Accepted: 26 May 2020 / Published: 29 May 2020

Abstract

:
This paper aims to solve the problem of fitting a nonparametric regression function with right-censored data. In general, issues of censorship in the response variable are solved by synthetic data transformation based on the Kaplan–Meier estimator in the literature. In the context of synthetic data, there have been different studies on the estimation of right-censored nonparametric regression models based on smoothing splines, regression splines, kernel smoothing, local polynomials, and so on. It should be emphasized that synthetic data transformation manipulates the observations because it assigns zero values to censored data points and increases the size of the observations. Thus, an irregularly distributed dataset is obtained. We claim that adaptive spline (A-spline) regression has the potential to deal with this irregular dataset more easily than the smoothing techniques mentioned here, due to the freedom to determine the degree of the spline, as well as the number and location of the knots. The theoretical properties of A-splines with synthetic data are detailed in this paper. Additionally, we support our claim with numerical studies, including a simulation study and a real-world data example.

1. Introduction

Let ( x i , y i ) , 1 i n be a sample of observations where x i ’s are values of a one-dimensional covariate x and y i ’s denote the values of the completely observed response (lifetime) variable y . In medical studies such as clinical trials, y is often subject to random right-censoring and censored by a random variable c with c i values representing the censorship times, i.e., patient withdrawal time. In this case, the observed response values at designed points x 1 , x 2 , , x n will be t i ’s, defined as
t i = min ( y i , c i ) ,   δ i = { 1   y i c i   ( u n c e n s o r e d )   0   y i > c i   ( c e n s o r e d )  
where δ i ’s are the values of the censoring indicator function that contains the censoring information. It should be noted that y i , c i ,   and   t i have distribution functions F x ( s ) = P ( y i s ) ,   G x ( s ) = P ( c s ) , and H x ( s ) = P ( t i s ) for s R , respectively. Additionally, we assume that y i ’s and c i ’s are independent, which is a very common assumption of right-censored analysis (see [1,2]). Thus, the relationship between the distribution of t and ( y ,   c ) can be written as follows, in terms of corresponding survival functions:
1 H x ( s ) = [ ( 1 F x ( s ) ) · ( 1 G x ( s ) ) ]
This paper considers the problem of fitting a nonparametric regression function with right-censored data. Based on the condition in Equation (1) and assumption in Equation (2), the nonparametric regression model y i = f ( x i ) + ε i can be written as
t i = f ( x i ) + ε i ,   i = 1 , 2 , , n
where t i = ( t 1 , t 2 , , t n ) are the right-censored response values that solve the censorship problem, f ( . ) is the smooth function to be estimated, and ε i ’s are the normally distributed random errors denoted as ε i ~ N ( 0 , σ ε 2 ) .
In the context of linear regression, the estimation of censored data is performed using the linear regression model proposed by [3]. Different estimators based on normal least squares for linear regression under right-censored data were introduced by [4,5,6,7]. In addition, some theoretical extensions are discussed by [8,9]. Note, also, that all the methods discussed by the above-mentioned authors are based on the assumption that there is a linear relationship between censored responses and independent variables. In real-world applications, it cannot be known whether the relationship between the responses and explanatory variables is linear. Although there are some processes to test linearity, these cannot be applied directly to censored data because they were designed based on uncensored data. In this scenario, a nonparametric regression model is widely preferred.
There are several various studies to estimate the model in Equation (3) in the literature. These existing approaches can be classified as spline-based methods, kernel smoothers, or local smoothing techniques. Spline-based techniques for right-censored data can be categorized as either smoothing splines ([10,11]) or regression splines [12]. Here, in terms of the estimation of the model in Equation (3), the difference between smoothing splines and regression splines can be expressed as being that smoothing splines have to use all unique data points as knots and, because of that, the variance of the model would be large as the fitted curve tries to pass both increased values and zeros. In regression splines, knot points can be freely determined. Regression splines perform better than smoothing splines for this reason already (see [12]). However, as is known, regression splines work based on truncated power basis polynomials, which force the method to work with a fixed degree. Studies about kernel smoothers include [13,14]. Research on local smoothing techniques can be found in [15,16]. In this study, an adaptive ridge estimator (or A-spline) is introduced based on a B-spline basis function to achieve the estimation of the model in Equation (3).
It is obvious that conventional regression estimators, whether nonparametric or not, cannot be used directly for modeling censored data. To solve this issue, there are three approaches being taken in the literature; these include using Kaplan–Meier weights [4], synthetic data transformation ([17,18]), and data imputation techniques. This paper focuses on synthetic data transformation, which is the most widely used technique in the literature. The main contribution of this technique is that it provides theoretically equal expected values of both synthetic data ( y i G ^ x ) and the completely observed response variable ( y i ) based on a Kaplan–Meier estimator ( G ^ x ) of the censoring variable ( c i ) that can be expressed as E ( y i G ^ x ) E ( y i ) by increasing magnitudes of uncensored observations and assigning zero values to the censored ones. Details about synthetic data transformation are given in Section 2.
The main motivation of this paper is to present a new nonparametric estimator to deal with synthetic response observations better than existing approaches. All of the methods given above have some restrictions when modeling synthetic data, which are indicated above. Because of these kinds of problems, we introduce a modified A-spline estimator, which has no boundary effects for the number of knots, location of knots, and degree of splines.
The A-spline proposed by [19] provides a sparse regression model that is easy to understand and interpret. A trademark of the A-spline that it can determine suitable knot points for B-splines by using adaptive ridge regression (see [20] for adaptive ridge regression), based on the approximation of the L 0 norm with an iterative procedure (see [19,21] for more details).
In Section 2, our methodology is presented with a synthetic data transformation, a B-spline regression, an adaptive ridge approach, and, finally, a modified A-spline estimator for the nonparametric regression model based on the synthetic responses. We also give an algorithm to obtain the introduced estimator. Section 3 involves the statistical and asymptotic properties of the obtained estimator. A simulation study and real-world data application are given in Section 4 and Section 5, respectively. Finally, concluding remarks are presented in Section 6.

2. Materials and Methods

2.1. Synthetic Data Transformation

To account for the right-censored data in the estimation procedure, an adjustment must be applied to the censored dataset. Otherwise, the methods for estimating f ( x i ) cannot be applied directly. One of the most important reasons for this is that the right-censored response variable t i and the actual response variable y i have different expected values. As indicated in Section 1, to avoid this issue, synthetic data transformation is used. It can be calculated simply as follows:
t i G = δ i t i 1 G x ( t i )
where G x ( . ) is the distribution of the censoring variable c i , which is mentioned in the previous section. Note that because the distribution G x is generally unknown, instead of G x , its Kaplan–Meier estimate G ^ x is used (see Koul et al. 1981), which can be formulated as
1 G ^ x ( s ) = j = 1 n ( n j n j + 1 ) I [ t ( j ) s ,   δ ( j ) = 0 ] ,   s 0
where t ( j ) ’s denote the ordered values of the response observations as t ( 1 ) t ( 2 ) t ( n ) and δ ( j ) ’s are the values ordered associated with t ( i ) ’s. Note that if the distribution G x is taken arbitrarily, some values of t i may be identical, which prevents the correct calculation of the Kaplan–Meier estimator. Therefore, the ordered values t ( 1 ) t ( 2 ) t ( n ) might not be unique. It should be emphasized that the Kaplan–Meier estimator gives an opportunity for ordering the t j ’s uniquely. In addition, it is a widely known property of the estimated distribution G ^ x ( s ) that its estimated distribution has jumps only at censored data points (see Paterson, 1977, and Kaplan and Meier, 1958).
After the acquisition of G ^ x , the transformation in Equation (4) can be rewritten as
t i G ^ = δ i t i 1 G ^ x ( t i )  
Thus, the model in Equation (3) is written by using the synthetic response variable t i G ^ as follows:
t i G ^ = f ( x i ) + ε i G ^   where   ε i G ^ = t i G ^ f ( x i ) ,   i = 1 , 2 , , n
It is important to mention that the error terms ( ε i G ^ ), which depend on synthetic data, are random variables for given G ^ x . Accordingly, it can be said that n , E ( ε i G ^ ) 0 . Consequently, at each design point x i , the mean of the distribution function of t i G ^ ’s can be expressed as E ( t i G ^ | x i ) = f ( x i ) . In addition, Lemma 1 assumes that synthetic and true response variables t i G ^ and y i have identical expectations. It is known that the estimation of the smooth function f ( x i ) is a problem of estimating the expected value from the right-censored responses.
Lemma 1.
In a censorship context, incomplete observations with the associated censoring indicator variable { ( t i , δ i } i = 1 n are used to model the actual values of y i using the regression function f ( x i ) . In this manner, if the distribution of censoring variable G is known, then the conditional expectation of f ( x i ) can be expressed as E [ t i G | x i ] = E [ y i | x i ] = f ( x i ) .
Proof of Lemma 1 is given in the Appendix A. In order to achieve the goal of this study, synthetic responses are modeled through a modified A-spline approach, which is formed by merging B-splines and the adaptive ridge penalty. Details are given in the next section.

2.2. B-Spline Approximation

Because our A-spline regression has been constructed based on B-splines, the necessary information and important basics are described in this section. Let k = { k i |   i } be a non-decreasing sequence given by
k 0 k 1 k m k m + 1
where m denotes the number of knots, k i ’s are the knot points, and ( k 0 , k m + 1 ) are the boundaries of the knots that cannot be counted as knot points. In this context, the B-splines of degree q 0   are the piecewise polynomial function that has nonzero derivatives up to order ( q 1 ) at each of the given knot points. From the properties of the B-splines, it can be said that ( m + 2 q + 1 ) knots are needed for ( m + q ) polynomial pieces. In this case, a B-spline can be described as a non-zero spline between interval [ k i , k i + q + 1 ] where ( i > 0 ) . Therefore, the i t h B-spline of degree q is notated as B i , q ( x i ) , and the calculation of it is given by
B i , q ( x i ) = x i k i k i + q k i B i , q 1 ( x i ) + k i + q + 1 x i k i + q + 1 k i + 1 B i + 1 , q + 1 ( x i ) ,   q > 0  
To solve the recursive formula in Equation (8), see the algorithm described in [22]. Note that if q = 0 , then B i , 0 ( x i ) = I ( k i x i k i + 1 ) . Some fundamental properties of B-splines are that:
  • The B-spline consists of q-degreed and ( q + 1 ) polynomial pieces.
  • Each spline function must be derivable up to ( q 1 ) order.
  • ( q + 1 ) B-splines are nonzero for given x i .
  • Each B-spline should be positive between intervals determined by ( q + 2 ) knot points.
From the information given above, a fitted smooth function f ^ ( x i ) for data synthetic data pairs { x i , y i G ^ } i = 1 n can be written as a linear combination of B-splines for k knots by
f ^ ( x i ) = j = 1 k α ^ j B j , q ( x i )  
Equation (9) is useful only for a mathematical approximation. Note, also, that B-splines are a widely-used approximation for the estimation of a single-index (univariate) nonparametric regression model (see [22] for details). From this, a minimization problem emerges with a smoothness penalty written as follows:
P S S ( α ; λ ) = i = 1 n { t i G ^ j = 1 k α j B j , q ( x i ) } 2 + λ x m i n x m a x { j = 1 k α j B j , q ( x ) } 2 d x  
where λ > 0 is the smoothing parameter that controls the smoothness of the estimated curve. Checking the amount of the penalty term has a very crucial role in the accuracy of the model estimation. This is very similar to the smoothing parameter described by [23]. In B-spline regression, one important issue is the order of the derivative determined for the penalty term, because for its higher orders, some calculation problems may be exposed. Choosing the number and positions of the knots are very substantial decisions in the minimization of the problem in Equation (10), especially for right-censored datasets.
In this paper, setting the locations and numbers of the knots is a prior aim because it has a direct relationship with the accuracy of the estimated model, as mentioned above. To provide a suitable solution for this issue, an adaptive ridge penalty is used instead of the penalty term in Equation (10) proposed by [19]. Note, also, that the smoothing parameter is chosen by an improved A I C c criterion, as proposed by [24]. In the next section, the adaptive ridge penalty is introduced.

2.3. Adaptive Ridge

The adaptive ridge method promises the best tradeoff between the goodness of fit (the left part of Equation (10)) and the number of knots, which provides a more powerful regression model. To achieve this purpose, it uses a large and equally spaced number of knots, then modifies the penalty term by using this number of knots.
Let a B-spline define the knot points k 1 , k 2 , , k m , and assume that for interval r t h knot, Δ q + 1 ( α r ) = 0 . From that, the given knots are updated as k 1 , k 2 , , k r 1 , k r + 1 , , k m . Thus, the penalty term changes from the overall number of knots to the number of non-zero ( q + 1 ) order differences given by
λ 2 i = q + 2 m Δ q + 1 ( α i ) 0  
where Δ q + 1 ( α i ) 0 denotes the L 0 -norm of the difference term Δ q + 1 ( α i ) , which means that if Δ q + 1 ( α i ) = 0 then Δ q + 1 ( α i ) 0 = 0, and Δ q + 1 ( α i ) 0 =1 otherwise. Here, λ > 0   is a smoothing parameter. The point of this penalty term is that it deletes the r t h knot and works by using the intervals [ k r 1 , k r ) ( k r , k r + 1 ] . Thus, the modeling process is completed using the remaining knot points.
Note that Equation (11) cannot be differentiable, which prevents the acquisition of the fitted model. The adaptive ridge method provides an approximation of the L 0 norm given in Equation (11) (see [21] for a more detailed discussion). The main idea of the adaptive ridge method is using weights to approximate the L 0 -norm. In this context, the penalized minimization criterion in Equation (10) is rewritten by using a weighted new penalty as follows:
W P S S ( α ; λ ) = i = 1 n { t i G ^ j = 1 k α j B j , q ( x i ) } 2 + λ 2 j = q + 2 q + m + 1 w j [ Δ q + 1 ( α j )   ] 2
and the vector and matrix form of Equation (12) is
W P S S ( α ; λ ) = t G ^ B α 2 2 + λ D T W D α  
where t G ^ is the vector of the synthetic response values, W = d i a g ( w j ) and w j ’s represent positive weights, and D involves the values of Δ q + 1 ( α j ) , which is the first difference operator and can be calculated as
Δ ( a j ) = a j a j 1   and   Δ q ( a j ) = Δ q 1 ( Δ ( a j ) )  
and α = { α j } j = 1 q + k + 1   is the vector of coefficients for the B-spline design matrix B = ( { B j , q ( x i ) } i , j ,   1 i n ,   1 j q + m + 1 ) , illustrated in Equation (8). In order to make more explicit the i t h row of the B matrix, it is given as
B i = [ B 1 , q ( x i ) , B 2 , q ( x i ) , , B ( q + m + 1 ) , q ( x i ) ]
It should be noted that w j ’s provide the approximation of the penalty term in Equation (12) to the L 0 -norm and also in the adaptive-ridge procedure; weights are of crucial importance to the choice of a perfect location for the knots. To approximate to L 0 -norm, weights are determined by an iterative process from the previous values of the coefficients α j ’s, which can be realized by the formula given in [19] as follows:
w j = [ ( Δ q + 1 ( α j ) ) 2 + γ 2 ] 1   ,   γ > 0  
where γ > 0 is constant, and it can be seen that the approximation Δ q + 1 ( α i ) 0   w j ( Δ q + 1 ( α i ) ) 2 depends on γ .
Remark 1.
As mentioned above, because the weights are determined by an iterative procedure using Equation (15), it is important to determine γ > 0 appropriately. If ( Δ q + 1 ( α j ) ) < γ then the w j ’s obtained may be extremely large, causing ( Δ q + 1 ( α j ) ) 0 , and therefore, the resulting all penalty term is w j ( Δ q + 1 ( α i ) ) 2 0 . However, if ( Δ q + 1 ( α j ) ) γ , then the approximation of Δ q + 1 ( α i ) 0   w j ( Δ q + 1 ( α i ) ) 2 is realized. In this matter, [20] obtain the value γ = 10 5 after some numerical computations, which can be accepted as a suitable value of γ .
In the next section, the modified A-spline estimator is introduced based on the given adaptive-ridge penalty and synthetic response values.

2.4. Modified A-Spline Estimator

In this section, the estimation coefficient vector α is given, and, to provide a more precise and detailed explanation, an algorithm is presented. A modified A-spline estimator to estimate the right-censored nonparametric regression model in Equation (7) is obtained by minimizing Equation (13) after some algebraic operations that are given in Appendix A.1. In this case, the vector of the estimated coefficients of the A-spline regression ( α ^ ) is computed by using the formula
α ^ = ( B T B + λ D T W D ) 1 B T t G ^  
where λ D T W D denotes the adaptive ridge-penalty, which involves both the difference-matrix D and weight matrix W . From here, fitted values for the model in Equation (7) can be obtained as
E [ t i | x i ] E [ t i G ^ | x i ] = f ^ ( x i ) = B α ^ = H A t G ^  
where H A = B ( B T B + λ D T W D ) 1 B T is a hat matrix. It should be emphasized that because of computational difficulties, instead of calculating values of matrix D , the all penalty term ( D T W D ) is obtained by an iterative algorithm, which is the most efficient method. The algorithm for our modified A-spline estimator is shown in Algorithm 1.
Algorithm 1. Algorithm for the modified A-spline estimator α ^ .
Input: covariate x i , synthetic responses t i G ^ , constant γ = 10 5
output:  α ^ = ( α ^ 1 , α ^ 2 , , α ^ q + m + 1 ) T
1: Begin
2: Give initial values α ( 0 ) = 0 q + m + 1 and W ( 0 ) = I to start iterative process
3: do until converges weighted differences to L 0 - norm
4:   α ^ ( s ) = ( B T B + λ D T W D ) 1 B T t G ^
5:   w j ( s ) = [ ( Δ q + 1 ( α j ( s ) ) ) 2 + γ 2 ] 1
6:   α ^ = α ^ ( s ) ,   W = d i a g ( w j ( s ) )
7: end
8: Obtain k s   by   ( Δ q + 1 ( α ) ) 2 W s
9: Return α ^ = ( α ^ 1 , α ^ 2 , , α ^ q + m + 1 ) T
10: End

3. Statistical Properties of the Estimator

It is implicit that an A-spline estimator is a different kind of ridge-type estimator and is used for the estimation of the right-censored nonparametric regression model in this paper. It follows that the expressions given below can be written about using the random error terms of the model in Equation (3).
E [ ε i | x i ] = 0 ,   V a r [ ε i | x i ] = σ ε 2 I  
However, in this paper, because of censoring, instead of employing the model in Equation (3), that in Equation (7), which involves synthetic responses, is used. In this case, the distribution properties in Equation (18) are changed depending on Lemma 1 and are rewritten as follows:
E [ ε i G ^ | x i ] 0 ,   V a r [ ε i G ^ | x i ] = σ ε G 2 I  
where ε i G ^ = ( t i t ^ i G ^ ) , σ ε G 2 is the variance of the right-censored nonparametric model based on the synthetic response variable, I is the n × n identity matrix, and t ^ i G ^ denotes fitted values. It should be noted that the obtained estimator is a vector of coefficients α ^ , and therefore, the quality of the model is measured partially based on the bias and variance of α ^ . In this context, from the ordinary ridge regression method, the variance–covariance matrix of α ^ can be approximated by using σ ε G 2 as follows:
C o v ( α ^ ) = σ ε G 2 1 n ( B T B + λ D T W D ) 1 ( B T B ) ( B T B + λ D T W D ) 1
If M A = ( B T B + λ D T W D ) 1 , then the covariance matrix of the fitted values of the model can be given by
C o v ( f ^ ) = σ ε G 2 1 n ( B M A B T ) ( B M A B T ) T
Because of σ ε G 2 is generally unknown, it needs to be estimated as follows:
σ ^ ε G 2 = t f ^ 2   n t r ( H A )  
where t r ( . ) indicates the sum of the diagonal elements of a matrix. Additionally, the bias of α ^ is one of the quality measurements for the estimated model. In order to calculate the bias, the conditional expected values of the estimator E [ α ^ | x i ] have to be obtained by
E [ α ^ | x i ] = ( B T B + λ D T W D ) 1 B T B α
Following on from Equation (23), the bias can be written as
B i a s ( α ^ ) = E [ α ^ | x i ] α = [ ( B T B + λ D T W D ) 1 ( B T B ) 1 ] B T B α   = ( B T B + λ D T W D ) 1 B T B
In this study, Equations (20)–(22) and (24) are used as quality measures to evaluate the performance of the estimated right-censored model. In addition, the mean squared errors ( M S E ) commonly employed in the literature is also used for measuring the quality of the fitted model. It is obtained as follows:
M S E ( f ^ ) = 1 n ( i = 1 n ( t i f ^ i ) 2 ) = t f ^ 2 / n  

3.1. Extended Properties of the Estimator

The modified A-spline estimator introduced in this paper is a smoothing technique that allows for the optimal selection of base functions, penalties, knot points, and the location of knots. It achieves that by using adaptive (weighted) ridge penalty via approximating the L 0   norm.
In this section, some large sample properties of the modified A-spline estimator are given under right-censoring. It is worth noting that the theoretical properties of the A-spline estimator have not been deeply inspected in the literature. There have been some important studies about adaptive ridge estimators, such as [20,21,25]. This section provides some initial inferences about the A-spline estimator in a nonparametric context and under censorship conditions.
Before we describe the asymptotic properties of the estimator, it should be emphasized that the flexibility of the A-spline estimator allows the choice of penalty and knot points, causing difficulties in the theoretical inferences. As is already known, the A-spline estimator is a specialized version of the P-splines proposed by [26]. Its major difference is that the A-spline changes the penalty terms using weights that are iteratively obtained and by approximating the L 0 -norm. Because of this, some assumptions and inferences are derived based on the known properties of P-splines.
The main function of the A-spline estimator is given in Equation (12), which can be rewritten as follows:
W P S S ( α ; λ ) = i = 1 n { t i G ^ j = 1 k α j B j , q ( x i ) } 2 + λ 2 i = q + 2 m Δ q + 1 ( α i ) τ  
where . τ denotes the τ   –norm. To obtain substantial results, for this study, we assume that τ 0 because solving L 0 -norm requires complex calculations. Accordingly, it can be said that minimizing Equation (26) has good potential for both estimating α j ’s and determining the optimal knot points, such as model selection for sufficiently large λ > 0 . As is known from the literature, model selection with τ 0 is realized by penalizing non-zero parameters, which is a limiting case of the bridge estimation introduced by [27] and given as
lim τ 0 i = q + 2 m Δ q + 1 ( α i ) τ = i = q + 2 m I [ Δ q + 1 ( α i ) 0 ]  
For τ 1 , the objective function in Equation (26) has a convex structure, and its global minimum can be obtained easily by using numerical algorithms. However, when τ 0 and τ = 0 , the criterion in Equation (26) is no longer convex and its computation is non-trivial. In the L 0 -norm context, there is no guarantee of reaching a global minimum. Moreover, more than one local minimum could exist. Thus, there is no unique solution of this estimator, and it depends on the iterative process. In [21], it is shown that a minimum of 5 and maximum 40 iterations provide reasonable convergence of the estimator to real parameters.
When estimator α ^ is inspected asymptotically, although its objective function in Equation (26) is non-convex, calculations about asymptotic consistency can be guided. In this case, the following condition is assumed:
R n = 1 k j k B j B j T R  
where R is a non-negative definite matrix, and also assumed is that
1 q + m + 1 max 1 j k B j T B j 0  
In general, the obtained explanatory variables included by B are scaled. Accordingly, all of the diagonals of R are equal to 1. Note that it must be assumed that B j T B j and ( B j T B j ) 1 are nonsingular matrices; consequently, R are full rank matrices to the obtained identifiable properties. Using the conditions in Equations (28) and (29), the limiting behavior of estimator α ^ can be observed by inspecting the asymptotic state of affairs of the minimization problem in Equation (12). To see the consistency of α ^ n , the function is given as
U n ( α ^ n ) = i = 1 n { t i G ^ j = 1 k α ^ n j B j , q ( x i ) } 2 + λ n 2 i = q + 2 m Δ q + 1 ( α ^ i ) τ  
where α ^ n   is a consistent estimator for λ n = o ( n ) . This result is confirmed by following theorem:
Theorem 1.
If R is a full rank-matrix and λ n 2 λ 0 , then α ^ n p a r g m i n ( U ) where
U ( α ^ n ) = ( α ^ n α ) T R ( α ^ n α ) + λ i = q + 2 m Δ q + 1 ( α i ) τ
Thus, λ n = o ( n )   and α ^ n is a consistent estimator of α . It could therefore be said that
| α ^ n | α ,   a s   n
Proof of Theorem 1 is given in the Appendix A.
Because U n is not convex due to the degree of norm τ 0 , and to ensure the accuracy of Equation (32), some additional notes are needed. Accordingly, it can be said that λ n = O ( n ) is essential for τ 0 . From that, if λ n λ 0 and τ 0 , then it can be written that
n ( α ^ n α ) d R 1 J   ~ N ( 0 ,   σ 2 R 1 )
where J has a distribution N ( 0 ,   σ ε 2 ) and its elements consist of the random error terms ε i ’s.
This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, and the experimental conclusions that can be drawn.

4. Simulation Study

In this section, a simulation study is carried out to see the behaviors of the modified A-spline estimator when estimating the right-censored nonparametric model. Before the results of simulation experiments, datasets for the different simulation combinations are generated using by the “simcensdata” function in the R software, which can be accessed via this link: https://github.com/yilmazersin13/simcensdata-generating-randomly-right-censored-data. Our data generation procedure, with accompanying descriptions, is given in Table 1.
For this simulation study, within the scope of Step 1, n o b s = ( 35 ,   100 ,   350 ) , n s i m = 1000 , and the censoring levels L = ( 5 % ,   20 % ,   40 % ) . The nonparametric covariate and random errors in Step 2 are generated as x i = θ ( i 1 2 ) / n and ε i ~ N ( 0 ,   σ ε 2 ) , where θ is a constant that determines the shape of the curve. Note that, in this study, two different types of function are used to test the introduced method under various conditions. These functions are given below with their formulations as follows:
Panel (a) and Panel (b) represent two different datasets that were formed based on nonlinear functions f 1 and f 2 . The plots of Figure 1 are drawn for n = 100 and L = 20 % . It should be noted that the optimal selection of numbers and the positions of knots are extremely important for the functions represented in these panels. In the context of synthetic data transformation, censored data points take zero values and completed points take higher values than they are. In this case, deciding the properties of knots will be crucial.
From the data generation procedure given above, the right-censored nonparametric model can be written as follows:
t i = f h ( x i ) + ε i ,   i = 1 , ,   n o b s ,   h = 1 , 2
Then, to use censorship information in the estimation process, a synthetic data transformation is done, as in Equation (6). Therefore the final model to be estimated, as given by the simulation experiments, is
t i G ^ = f h ( x i ) + ε i G ^ ,   i = 1 , ,   n o b s ,   h = 1 , 2
In this simulation study, for three sample sizes, three censoring levels, and two functions, 18 configurations are obtained. All the outcomes for the model in Equation (34) under these conditions are given in the following figures and tables.
Table 2 represents the scores of all the evaluation metrics for each of the simulation configurations. The results are inspected from three essential aspects in terms of the estimation performance of the A–spline estimator that are the effects of the sample size, censoring level, and shape of the data. For the first aspect, it can be seen from the table that σ ^ ε G ^ ,   M S E and V a r ( α ^ ) decrease when the sample size increases. This can be interpreted as practical proof of the asymptotic convergence that is one of the main purposes of this simulation study. This interpretation is consistent for all censoring levels. The censoring level naturally affects the performance of the estimator contrary to sample size; however, there is a sensitive point, which depends on the reaction of the estimator to variation in the censoring level, which makes this paper significant. If the scores are inspected carefully, it can be clearly seen that there are no huge differences between low and high censoring levels, which can also be seen in the figures given below. This case proves that the A-spline estimator achieves mitigation of the effect of the censoring level on selecting the optimal knot points, as expected. Finally, two different function types are used in this paper. f 1 has a shape that is similar to that of a sinus function and is not hard to catch for any smoothing technique. f 2 is an almost linear function but has one big peak; this is a challenge for the estimator, especially under censoring. The outcomes in Table 2 demonstrate this. Although the results for f ^ 1 are smaller than those for f ^ 2 , it can be said that the A-spline estimator shows a satisfactory performance for both datasets.
Table 3 represents the comparative outcomes for the introduced estimator modified A-spline and commonly used SS and RS. The best scores are indicated with bold colored text. As can be seen, the results indicate that the modified A-spline estimator shows the best performance from a general perspective. Additionally, as mentioned in the introduction section, RS has smaller MSEs than SS. From here, it can be said that the introduced method gives more satisfying results, which can be explained by its adaptive nature. If Table 3 is inspected carefully, it can be realized that for the results obtained from f ^ 2 ( x ) , the RS method has attractive outcomes when the censoring level is 40 % . It is an understandable situation because of the shape of the function.
Figure 2 aims to show how the modified A-spline estimator behaves when the sample size is exceedingly small, under various censoring levels. It is obvious that the estimation of f 1 is easier than that of f 2 , which is explained below. Figure 2 shows this more clearly. In addition, it can be said that the method is successful for even extremely small sample sizes ( n = 35 ) . This is an important contribution of this method for right-censored data because in a medical dataset and especially in clinical observations, many data may frequently be unobtainable.
Figure 3 presents the effects of sample sizes by keeping the censoring level constant at 20%. Model I was obtained using f 1 ; similar fitted curves are obtained for n = 100 and n = 350 , and these curves seem to be good representations of the data. This inference is also valid for Model II. Both plots show that the fitted curves successfully model right-censored data.
Figure 4 demonstrates how the method works under heavy censoring. To that end, fitted curves are shown for a moderate sample size together with the lowest and the highest censoring levels, 5% and 40%. As we expected, the A-spline estimator demonstrates its ability to handle data with zero values obtained by synthetic data transformation, and it can be clearly seen that there is a difference between the two graphs. This inference is also supported by the results in Table 2.
Figure 5 depicts bar plots of the measurement tools for both the estimated A-spline coefficients and the estimated model. In each panel, A1.5%, A1.20%, and A1.40% denote the obtained scores of the evaluation metric for L = 5 % , 20 % ,   and   40 % , respectively, for n = 35 . In a similar manner, A2.5%, A2.20%, and A2.40% represent the scores for n = 100   and all the censoring levels, and A3.5%, A3.20%, and A3.40% denote the results for n = 350 for all the censoring levels. The top panels of Figure 5 include bar plots for the bias values. As in Table 2, it can be seen here that the biases for the two models are very similar and, as expected, become smaller in larger samples. The panels in the middle show bar plots for the variances of the coefficients. The plots appear similar for the two models, but, as has been said before, because the estimation of Model II is more difficult than that of Model I, the y-axis is significantly wider in scope. The panel at the bottom is drawn for the M S E values of the estimated model, and it is similar to the variance plots. Essentially, these plots prove that the A-spline estimator can estimate the model by overcoming the effect of censorship in terms of various evaluation metrics.

5. Real Data Application

This section is prepared to show the performance of the modified A-spline estimator on real right-censored data. The dataset represents data from colon cancer patients in İzmir. The dataset involves the survival times, censoring indicator ( δ i ) , and albumin (i.e., the most common protein found in the blood) values of patients. To provide continuity, the logarithms of the survival times are considered as a response variable ( s u r v i v a l   t i m e ), and albumin is taken as a nonparametric covariate ( a l b u m i n ) . The right-censored regression model is thus given by
t i = log ( s u r v i v a l   t i m e i ) = f ( a l b u m i n i ) + ε i ,   1 i 97
Note, also, that because t i ’s cannot be used directly in the estimation procedure, they have been replaced by the synthetic responses shown in Equation (6). The model in Equation (36) is thus rewritten as
t i G ^ = log ( s u r v i v a l   t i m e i G ^ ) = f ( a l b u m i n i ) + ε i G ^
The dataset contains information for 97 patients to be used for this analysis. However, the records of 32 of these patients are incomplete, containing right-censored observations; the data of the remaining 65 patients are uncensored (deceased). Consequently, in this dataset, the censoring level is L = 32.98 . The outcomes calculated for the model in Equation (37) are given the following table and figure.
Table 4 summarizes the performance of the modified A-spline estimator. Note that the values of σ ^ ε G ^ ,   M S E and V a r ( α ^ ) are better than the results of Aydın and Yilmaz (2018), who previously used regression splines to model right-censored data. In addition, to provide a healthier comparison, the results of the RS and SS methods are given in the table. As can be seen, the results are pretty similar to the simulation results. Here, A-spline gives the best score, which proves the benefit of the introduce method. Additionally, when the shape of the dataset is inspected from Figure 6, it can be described as having an irregular shape, and it can be seen that this irregularity increases after synthetic data transformation, which is demonstrated by the blue dots in the figure. Despite this challenging case, the A-spline fit seems to represent the data well.

6. Concluding Remarks

This paper demonstrates that a modified A-spline estimator can be used to estimate the right-censored nonparametric regression model successfully. This is because it uses an adaptive procedure for determining the penalty term and works with only optimum knot points. A simulation study and real data example were carried out to demonstrate the performance of the method, and it can be seen from our findings that the modified A-spline estimator has merit for the estimation of right-censored data.
In the general frame of the numerical examples, incremental changes in the sample size affect the performance of the method, which gives closer results to real observations. This can be seen in Figure 3, Figure 4 and Figure 5. Moreover, changes to the censoring level also influence the goodness of fit, and, as expected, when the censoring level increases, the performance of the method is negatively affected. However, there is an important difference here in terms of the modified A-spline. The main purpose of the usage of this method is to diminish the effect of censorship on the modeling process, and most of our results show that the introduced method achieves this purpose. For an example of these results, see Table 2. In the simulation study, two different function types are used to generate the model. f 1 is a cliché pattern of the sinus curve and is not difficult to estimate for any smoothing method. f 2 is a little bit more difficult to handle, especially by the smoothing techniques that use all data points as knots. In this paper, it can be seen that for almost all of the simulation configurations, the modified A-spline estimator gives really close values in terms of all evaluation metrics.
The real-world application uses the dataset of colon cancer patients. Their survival times are estimated by using albumin values in their blood. Figure 6 and Table 4 show the outcomes of this study. As mentioned above, our method does a good job despite the unsteadily scattered data points. The confidence interval given by the shaded region in Figure 6 seems wide because synthetic data transformation puts censored points (as zeros) far from the uncensored points. Considering the mentioned properties, it can be said that the modified A-spline estimator can be counted as a robust estimator for right-censored datasets. As a result of this study, we recommend that the modified A-spline estimator is appropriate for modeling clinical datasets.

Author Contributions

Conceptualization, D.A. and S.E.A.; methodology, D.A. and S.E.A.; software, E.Y.; validation, D.A., S.E.A. and E.Y.; formal analysis, E.Y.; investigation, D.A. and E.Y.; resources, D.A.; data curation, E.Y.; writing—original draft preparation, D.A. and E.Y.; writing—review and editing, S.E.A.; visualization, D.A. and E.Y.; supervision, S.E.A.; project administration, S.E.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

Thank the editors and reviewers for their objective assessments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Proof of Lemma 1

Lemma 1 can be proven by using the common independency assumption between c i and y i , which is given in Section 1. From that, proof is given as follows:
E [ t i G | x i ] = E [ δ i t i 1 G ( t i ) | x i ] = E [ δ i t i G ¯ ( t i ) | x i ] = E [ I ( y i c i ) min ( y i , c i )   G ¯ ( min ( y i , c i ) ) | x i ] = E [ I ( y i c i ) y i G ¯ ( t i ) | x i ] = E [ E [ y i G ¯ ( t i ) I ( y i c i ) | x i , y i ] | x i ] = E [ y i G ¯ ( t i ) G ¯ ( t i ) | x i ] = E ( y i | x i ) = f ( x i )
Thus, proof of Lemma is completed. Note that because of distribution of censoring distribution G is unknown, it is replaced by its Kaplan–Meier estimator G ^ that is given in Equation (5).

Appendix A.2. Proof of Theorem 1

The equations given below need to be shown for validation of Theorem 1:
sup α ^ n Q | U n ( α ^ n ) U ( α ^ n ) σ ε 2 | p 0
where σ ε 2 is the variance of the model defined in (Section 3.1), Q is a compact set in a metric space and using by Equations (28)–(32), it can be said that
| α ^ n | α ,   as   n
(See [28], for more details).

References

  1. Stute, W. Consistent Estimation Under Random Censorship When Covariables Are Present. J. Multivar. Anal. 1993, 45, 89–103. [Google Scholar] [CrossRef] [Green Version]
  2. Kaplan, E.L.; Meier, P. Nonparametric Estimation from Incomplete Observations. J. Am. Stati. Assoc. 1958, 53, 457–481. [Google Scholar]
  3. Cox, D.R. Regression Models and Life-Tables. J. R. Stat. Soc. Ser. B 1972, 34, 187–202. [Google Scholar] [CrossRef]
  4. Miller, R.G. Least squares regression with censored data. Biometrika 1976, 63, 449–464. [Google Scholar] [CrossRef]
  5. Buckley, J.; James, I. Linear regression with censored data. Biometrika 1979, 66, 429–436. [Google Scholar] [CrossRef]
  6. Miller, R.; Halpern, J. Regression with censored data. Biometrika 1982, 69, 521–531. [Google Scholar]
  7. Jin, Z.; Lindgren, C.M.; Ying, Z. On least-squares regression with censored data. Biometrika 2006, 93, 147–161. [Google Scholar] [CrossRef] [Green Version]
  8. Ritov, Y. Estimation in a linear regression model with censored data. Ann. Stat. 1990, 18, 303–328. [Google Scholar] [CrossRef]
  9. Lai, T.L.; Ying, Z. Estimating a distribution function with truncated and censored data. Ann. Stat. 1991, 19, 417–442. [Google Scholar] [CrossRef]
  10. Köhler, M.; Máthé, K.; Pinter, M. Prediction from randomly right censored data. J. Multivar. Anal. 2002, 80, 73–100. [Google Scholar] [CrossRef] [Green Version]
  11. Winter, S. Smoothing Spline Regression Estimates for Randomly Right Censored Data. Ph.D. Thesis, University of Stuttgart, Stuttgart, Germany, 2013. [Google Scholar]
  12. Aydin, D.; Yilmaz, E. Modified spline regression based on randomly right-censored data: A comparative study. Commun. Stat.-Simul. Comput. 2017, 1–25. [Google Scholar] [CrossRef]
  13. El Ghouch, A.; van Keilegom, I. Non-parametric regression with dependent censored data. Scand. J. Stat. 2008, 35, 228–247. [Google Scholar] [CrossRef]
  14. Aydın, D.; Yılmaz, E. Nonparametric regression with randomly right-censored data. Int. J. Math. Comput. Methods 2016, 1, 186–189. [Google Scholar]
  15. Kim, H.T.; Truong, Y.K. Nonparametric regression estimates with censored data: Local linear smoothers and their applications. Biometrics 1998, 54, 1434–1444. [Google Scholar] [CrossRef]
  16. Peng, L.; Sun, S. Comparisons between local linear estimator and kernel smooth estimator for a smooth distribution based on MSE under right censoring. Commun. Stat.-Theory Methods 2007, 36, 297–312. [Google Scholar] [CrossRef]
  17. Koul, H.; Susarla, V.; van Ryzin, J. Regression analysis with randomly right-censored data. Ann. Stat. 1981, 9, 1276–1288. [Google Scholar] [CrossRef]
  18. Leurgans, S. Linear models, random censoring and synthetic data. Biometrika 1987, 74, 301–309. [Google Scholar] [CrossRef]
  19. Goepp, V.; Bouaziz, O.; Nuel, G. Spline regression with automatic knot selection. arXiv 2018, arXiv:1808.01770. preprint. [Google Scholar]
  20. Frommlet, F.; Nuel, G. An adaptive ridge procedure for L0 regularization. PLoS ONE 2016, 11, e0148620. [Google Scholar] [CrossRef] [Green Version]
  21. Rippe, R.C.A.; Meulman, J.J.; Eilers, P.H.C. Visualization of genomic changes by segmented smoothing using an L0 penalty. PLoS ONE 2012, 7, e38230. [Google Scholar] [CrossRef] [Green Version]
  22. De Boor, C. A Practical Guide to Splines; Springer: New York, NY, USA, 1978. [Google Scholar]
  23. Reinsch, C.H. Smoothing by spline functions. Numer. Math. 1967, 10, 177–183. [Google Scholar] [CrossRef]
  24. Hurvich, C.M.; Simonoff, J.; Tsai, C. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. J. R. Stat. Soc. Ser. B 1998, 60, 271–293. [Google Scholar] [CrossRef]
  25. Eilers, P.H.C.; De Menezes, R.X. Quantile smoothing of array CGH data. Bioinformatics. 2004, 21, 1146–1153. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Eilers, P.H.; Marx, B.D. Flexible smoothing with B-splines and penalties. Stat. Sci. 1996, 11, 89–121. [Google Scholar] [CrossRef]
  27. Frank, I.E.; Friedman, J.H. A statistical view of some chemometrics regression tools (with discussions). Technometrics 1993, 35, 109–148. [Google Scholar] [CrossRef]
  28. Fu, W.; Knight, K. Asymptotics for lasso-type estimators. Ann. Stat. 2000, 28, 1356–1378. [Google Scholar] [CrossRef]
Figure 1. Scatter plots of both censored data and incomplete response data points over the smooth functions to be estimated by A-splines.
Figure 1. Scatter plots of both censored data and incomplete response data points over the smooth functions to be estimated by A-splines.
Stats 03 00011 g001
Figure 2. Fitted curves to see the performance of the estimator for n = 35 , L = 5 % ,   40 % .
Figure 2. Fitted curves to see the performance of the estimator for n = 35 , L = 5 % ,   40 % .
Stats 03 00011 g002
Figure 3. Fitted curves to see the performance of the estimator for n = 100 ,   350 ,   L = 20 % .
Figure 3. Fitted curves to see the performance of the estimator for n = 100 ,   350 ,   L = 20 % .
Stats 03 00011 g003
Figure 4. Fitted curves to observe the quality of the estimates for n = 100 ,   L = 5 % ,   40 % .
Figure 4. Fitted curves to observe the quality of the estimates for n = 100 ,   L = 5 % ,   40 % .
Stats 03 00011 g004
Figure 5. Bar plots of MSE values for each simulation repetition and their changes under different censoring levels.
Figure 5. Bar plots of MSE values for each simulation repetition and their changes under different censoring levels.
Stats 03 00011 g005
Figure 6. Estimated model for cancer data by the A-spline estimator.
Figure 6. Estimated model for cancer data by the A-spline estimator.
Stats 03 00011 g006
Table 1. Data generation procedure with explanations.
Table 1. Data generation procedure with explanations.
StepsExplanation
Step 1. Decide n o b s , n s i m , and L Sample size of simulated dataset and number of repetitions and censoring level, respectively
Step 2. Produce, t i ’s, f ( . ) , and ε i ’sNonparametric covariate, real smooth function, and random error terms
Step 3. Obtain, y i ’sActual (complete-uncensored) data points
Step 4. Generate δ i ’s using Bernoulli distributionValues of censoring indicator
Step 5. Obtain c i ’s i.i.d. with y i ’sCensoring values that cut the actual lifetimes
Step 6. Find t i = min ( y i , c i ) Obtain partly observed response values
Table 2. Variances and biases of α ^ , variance of the model ( σ ^ ε G ^ ) , and M S E of f ^ for all simulation combinations.
Table 2. Variances and biases of α ^ , variance of the model ( σ ^ ε G ^ ) , and M S E of f ^ for all simulation combinations.
Function Type f ^ 1 ( x ) f ^ 2 ( x )
L n σ ^ ε G ^ M S E B i a s ( α ^ ) V a r ( α ^ ) σ ^ ε G ^ M S E B i a s ( α ^ ) V a r ( α ^ )
5 % 35 1.21511.20640.99710.09111.21511.20640.99710.1347
100 1.19291.19570.99790.02151.17331.18040.99790.0212
350 1.20891.21860.99820.00521.17501.18610.99820.0051
20 % 35 2.34102.50410.99710.17562.34102.50410.99710.2140
100 2.30982.48580.99790.04172.38272.58810.99790.0430
350 2.26102.44810.99820.00982.37242.57160.99820.1031
40 % 35 3.76144.32300.99710.28223.76144.32300.99710.3465
100 3.56164.14460.99790.06434.46405.15660.99790.0806
350 3.40123.97520.99820.01474.55785.24890.99820.1980
Table 3. MSE values for the A-spline, smoothing spline (SS), and regression spline (SS) methods to make comparisons.
Table 3. MSE values for the A-spline, smoothing spline (SS), and regression spline (SS) methods to make comparisons.
Function Type f ^ 1 ( x ) f ^ 2 ( x )
L n A s p l i n e R S S S A s p l i n e R S S S
5%351.20641.96221.95811.20641.66321.9527
1001.19571.23861.57651.18041.43271.7061
3501.21861.19751.26441.18611.27601.6197
20%352.50413.23863.67332.50412.90843.1748
1002.48583.19533.28512.58812.60062.7318
3502.44812.54592.72702.57162.55912.4059
40%354.32305.54595.57794.32304.58395.3468
1004.14464.65225.05705.15665.10135.1927
3503.97524.04404.05915.24894.97415.0198
Table 4. Outcomes for the estimated regression model for colon cancer data.
Table 4. Outcomes for the estimated regression model for colon cancer data.
σ ^ ε G ^ M S E V a r ( α ^ ) L n
f ^ ( A l b u m i n ) 0.00510.08140.0037 32.98% 97
f ^ R S ( A l b u m i n ) 0.05400.0821-
f ^ S S ( A l b u m i n ) 0.08950.0863-

Share and Cite

MDPI and ACS Style

Yılmaz, E.; Ahmed, S.E.; Aydın, D. A-Spline Regression for Fitting a Nonparametric Regression Function with Censored Data. Stats 2020, 3, 120-136. https://doi.org/10.3390/stats3020011

AMA Style

Yılmaz E, Ahmed SE, Aydın D. A-Spline Regression for Fitting a Nonparametric Regression Function with Censored Data. Stats. 2020; 3(2):120-136. https://doi.org/10.3390/stats3020011

Chicago/Turabian Style

Yılmaz, Ersin, Syed Ejaz Ahmed, and Dursun Aydın. 2020. "A-Spline Regression for Fitting a Nonparametric Regression Function with Censored Data" Stats 3, no. 2: 120-136. https://doi.org/10.3390/stats3020011

Article Metrics

Back to TopTop