1. Introduction
Let
,
be a sample of observations where
’s are values of a one-dimensional covariate
and
’s denote the values of the completely observed response (lifetime) variable
. In medical studies such as clinical trials,
is often subject to random right-censoring and censored by a random variable
with
values representing the censorship times, i.e., patient withdrawal time. In this case, the observed response values at designed points
will be
’s, defined as
where
’s are the values of the censoring indicator function that contains the censoring information. It should be noted that
,
have distribution functions
, and
for
, respectively. Additionally, we assume that
’s and
’s are independent, which is a very common assumption of right-censored analysis (see [
1,
2]). Thus, the relationship between the distribution of
and (
) can be written as follows, in terms of corresponding survival functions:
This paper considers the problem of fitting a nonparametric regression function with right-censored data. Based on the condition in Equation (1) and assumption in Equation (2), the nonparametric regression model
can be written as
where
are the right-censored response values that solve the censorship problem,
is the smooth function to be estimated, and
’s are the normally distributed random errors denoted as
.
In the context of linear regression, the estimation of censored data is performed using the linear regression model proposed by [
3]. Different estimators based on normal least squares for linear regression under right-censored data were introduced by [
4,
5,
6,
7]. In addition, some theoretical extensions are discussed by [
8,
9]. Note, also, that all the methods discussed by the above-mentioned authors are based on the assumption that there is a linear relationship between censored responses and independent variables. In real-world applications, it cannot be known whether the relationship between the responses and explanatory variables is linear. Although there are some processes to test linearity, these cannot be applied directly to censored data because they were designed based on uncensored data. In this scenario, a nonparametric regression model is widely preferred.
There are several various studies to estimate the model in Equation (3) in the literature. These existing approaches can be classified as spline-based methods, kernel smoothers, or local smoothing techniques. Spline-based techniques for right-censored data can be categorized as either smoothing splines ([
10,
11]) or regression splines [
12]. Here, in terms of the estimation of the model in Equation (3), the difference between smoothing splines and regression splines can be expressed as being that smoothing splines have to use all unique data points as knots and, because of that, the variance of the model would be large as the fitted curve tries to pass both increased values and zeros. In regression splines, knot points can be freely determined. Regression splines perform better than smoothing splines for this reason already (see [
12]). However, as is known, regression splines work based on truncated power basis polynomials, which force the method to work with a fixed degree. Studies about kernel smoothers include [
13,
14]. Research on local smoothing techniques can be found in [
15,
16]. In this study, an adaptive ridge estimator (or A-spline) is introduced based on a B-spline basis function to achieve the estimation of the model in Equation (3).
It is obvious that conventional regression estimators, whether nonparametric or not, cannot be used directly for modeling censored data. To solve this issue, there are three approaches being taken in the literature; these include using Kaplan–Meier weights [
4], synthetic data transformation ([
17,
18]), and data imputation techniques. This paper focuses on synthetic data transformation, which is the most widely used technique in the literature. The main contribution of this technique is that it provides theoretically equal expected values of both synthetic data
and the completely observed response variable
based on a Kaplan–Meier estimator (
of the censoring variable
that can be expressed as
by increasing magnitudes of uncensored observations and assigning zero values to the censored ones. Details about synthetic data transformation are given in
Section 2.
The main motivation of this paper is to present a new nonparametric estimator to deal with synthetic response observations better than existing approaches. All of the methods given above have some restrictions when modeling synthetic data, which are indicated above. Because of these kinds of problems, we introduce a modified A-spline estimator, which has no boundary effects for the number of knots, location of knots, and degree of splines.
The A-spline proposed by [
19] provides a sparse regression model that is easy to understand and interpret. A trademark of the A-spline that it can determine suitable knot points for B-splines by using adaptive ridge regression (see [
20] for adaptive ridge regression), based on the approximation of the
norm with an iterative procedure (see [
19,
21] for more details).
In
Section 2, our methodology is presented with a synthetic data transformation, a B-spline regression, an adaptive ridge approach, and, finally, a modified A-spline estimator for the nonparametric regression model based on the synthetic responses. We also give an algorithm to obtain the introduced estimator.
Section 3 involves the statistical and asymptotic properties of the obtained estimator. A simulation study and real-world data application are given in
Section 4 and
Section 5, respectively. Finally, concluding remarks are presented in
Section 6.
3. Statistical Properties of the Estimator
It is implicit that an A-spline estimator is a different kind of ridge-type estimator and is used for the estimation of the right-censored nonparametric regression model in this paper. It follows that the expressions given below can be written about using the random error terms of the model in Equation (3).
However, in this paper, because of censoring, instead of employing the model in Equation (3), that in Equation (7), which involves synthetic responses, is used. In this case, the distribution properties in Equation (18) are changed depending on Lemma 1 and are rewritten as follows:
where
,
is the variance of the right-censored nonparametric model based on the synthetic response variable,
is the
identity matrix, and
denotes fitted values. It should be noted that the obtained estimator is a vector of coefficients
, and therefore, the quality of the model is measured partially based on the bias and variance of
. In this context, from the ordinary ridge regression method, the variance–covariance matrix of
can be approximated by using
as follows:
If
, then the covariance matrix of the fitted values of the model can be given by
Because of
is generally unknown, it needs to be estimated as follows:
where
indicates the sum of the diagonal elements of a matrix. Additionally, the bias of
is one of the quality measurements for the estimated model. In order to calculate the bias, the conditional expected values of the estimator
have to be obtained by
Following on from Equation (23), the bias can be written as
In this study, Equations (20)–(22) and (24) are used as quality measures to evaluate the performance of the estimated right-censored model. In addition, the mean squared errors (
) commonly employed in the literature is also used for measuring the quality of the fitted model. It is obtained as follows:
3.1. Extended Properties of the Estimator
The modified A-spline estimator introduced in this paper is a smoothing technique that allows for the optimal selection of base functions, penalties, knot points, and the location of knots. It achieves that by using adaptive (weighted) ridge penalty via approximating the norm.
In this section, some large sample properties of the modified A-spline estimator are given under right-censoring. It is worth noting that the theoretical properties of the A-spline estimator have not been deeply inspected in the literature. There have been some important studies about adaptive ridge estimators, such as [
20,
21,
25]. This section provides some initial inferences about the A-spline estimator in a nonparametric context and under censorship conditions.
Before we describe the asymptotic properties of the estimator, it should be emphasized that the flexibility of the A-spline estimator allows the choice of penalty and knot points, causing difficulties in the theoretical inferences. As is already known, the A-spline estimator is a specialized version of the P-splines proposed by [
26]. Its major difference is that the A-spline changes the penalty terms using weights that are iteratively obtained and by approximating the
-norm. Because of this, some assumptions and inferences are derived based on the known properties of P-splines.
The main function of the A-spline estimator is given in Equation (12), which can be rewritten as follows:
where
denotes the
–norm. To obtain substantial results, for this study, we assume that
because solving
-norm requires complex calculations. Accordingly, it can be said that minimizing Equation (26) has good potential for both estimating
’s and determining the optimal knot points, such as model selection for sufficiently large
. As is known from the literature, model selection with
is realized by penalizing non-zero parameters, which is a limiting case of the bridge estimation introduced by [
27] and given as
For
, the objective function in Equation (26) has a convex structure, and its global minimum can be obtained easily by using numerical algorithms. However, when
and
, the criterion in Equation (26) is no longer convex and its computation is non-trivial. In the
-norm context, there is no guarantee of reaching a global minimum. Moreover, more than one local minimum could exist. Thus, there is no unique solution of this estimator, and it depends on the iterative process. In [
21], it is shown that a minimum of 5 and maximum 40 iterations provide reasonable convergence of the estimator to real parameters.
When estimator
is inspected asymptotically, although its objective function in Equation (26) is non-convex, calculations about asymptotic consistency can be guided. In this case, the following condition is assumed:
where
is a non-negative definite matrix, and also assumed is that
In general, the obtained explanatory variables included by
are scaled. Accordingly, all of the diagonals of
are equal to 1. Note that it must be assumed that
and
are nonsingular matrices; consequently,
are full rank matrices to the obtained identifiable properties. Using the conditions in Equations (28) and (29), the limiting behavior of estimator
can be observed by inspecting the asymptotic state of affairs of the minimization problem in Equation (12). To see the consistency of
, the function is given as
where
is a consistent estimator for
. This result is confirmed by following theorem:
Theorem 1. If is a full rank-matrix and , then where Thus, and is a consistent estimator of . It could therefore be said that Because
is not convex due to the degree of norm
, and to ensure the accuracy of Equation (32), some additional notes are needed. Accordingly, it can be said that
is essential for
. From that, if
and
, then it can be written that
where
has a distribution
and its elements consist of the random error terms
’s.
This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, and the experimental conclusions that can be drawn.
4. Simulation Study
In this section, a simulation study is carried out to see the behaviors of the modified A-spline estimator when estimating the right-censored nonparametric model. Before the results of simulation experiments, datasets for the different simulation combinations are generated using by the “simcensdata” function in the R software, which can be accessed via this link:
https://github.com/yilmazersin13/simcensdata-generating-randomly-right-censored-data. Our data generation procedure, with accompanying descriptions, is given in
Table 1.
For this simulation study, within the scope of Step 1, , , and the censoring levels . The nonparametric covariate and random errors in Step 2 are generated as and , where is a constant that determines the shape of the curve. Note that, in this study, two different types of function are used to test the introduced method under various conditions. These functions are given below with their formulations as follows:
Panel (a) and Panel (b) represent two different datasets that were formed based on nonlinear functions
and
. The plots of
Figure 1 are drawn for
and
. It should be noted that the optimal selection of numbers and the positions of knots are extremely important for the functions represented in these panels. In the context of synthetic data transformation, censored data points take zero values and completed points take higher values than they are. In this case, deciding the properties of knots will be crucial.
From the data generation procedure given above, the right-censored nonparametric model can be written as follows:
Then, to use censorship information in the estimation process, a synthetic data transformation is done, as in Equation (6). Therefore the final model to be estimated, as given by the simulation experiments, is
In this simulation study, for three sample sizes, three censoring levels, and two functions, 18 configurations are obtained. All the outcomes for the model in Equation (34) under these conditions are given in the following figures and tables.
Table 2 represents the scores of all the evaluation metrics for each of the simulation configurations. The results are inspected from three essential aspects in terms of the estimation performance of the A–spline estimator that are the effects of the sample size, censoring level, and shape of the data. For the first aspect, it can be seen from the table that
and
decrease when the sample size increases. This can be interpreted as practical proof of the asymptotic convergence that is one of the main purposes of this simulation study. This interpretation is consistent for all censoring levels. The censoring level naturally affects the performance of the estimator contrary to sample size; however, there is a sensitive point, which depends on the reaction of the estimator to variation in the censoring level, which makes this paper significant. If the scores are inspected carefully, it can be clearly seen that there are no huge differences between low and high censoring levels, which can also be seen in the figures given below. This case proves that the A-spline estimator achieves mitigation of the effect of the censoring level on selecting the optimal knot points, as expected. Finally, two different function types are used in this paper.
has a shape that is similar to that of a sinus function and is not hard to catch for any smoothing technique.
is an almost linear function but has one big peak; this is a challenge for the estimator, especially under censoring. The outcomes in
Table 2 demonstrate this. Although the results for
are smaller than those for
, it can be said that the A-spline estimator shows a satisfactory performance for both datasets.
Table 3 represents the comparative outcomes for the introduced estimator modified A-spline and commonly used SS and RS. The best scores are indicated with bold colored text. As can be seen, the results indicate that the modified A-spline estimator shows the best performance from a general perspective. Additionally, as mentioned in the introduction section, RS has smaller MSEs than SS. From here, it can be said that the introduced method gives more satisfying results, which can be explained by its adaptive nature. If
Table 3 is inspected carefully, it can be realized that for the results obtained from
, the RS method has attractive outcomes when the censoring level is
. It is an understandable situation because of the shape of the function.
Figure 2 aims to show how the modified A-spline estimator behaves when the sample size is exceedingly small, under various censoring levels. It is obvious that the estimation of
is easier than that of
, which is explained below.
Figure 2 shows this more clearly. In addition, it can be said that the method is successful for even extremely small sample sizes (
. This is an important contribution of this method for right-censored data because in a medical dataset and especially in clinical observations, many data may frequently be unobtainable.
Figure 3 presents the effects of sample sizes by keeping the censoring level constant at 20%. Model I was obtained using
; similar fitted curves are obtained for
and
, and these curves seem to be good representations of the data. This inference is also valid for Model II. Both plots show that the fitted curves successfully model right-censored data.
Figure 4 demonstrates how the method works under heavy censoring. To that end, fitted curves are shown for a moderate sample size together with the lowest and the highest censoring levels, 5% and 40%. As we expected, the A-spline estimator demonstrates its ability to handle data with zero values obtained by synthetic data transformation, and it can be clearly seen that there is a difference between the two graphs. This inference is also supported by the results in
Table 2.
Figure 5 depicts bar plots of the measurement tools for both the estimated A-spline coefficients and the estimated model. In each panel, A1.5%, A1.20%, and A1.40% denote the obtained scores of the evaluation metric for
, respectively, for
. In a similar manner, A2.5%, A2.20%, and A2.40% represent the scores for
and all the censoring levels, and A3.5%, A3.20%, and A3.40% denote the results for
for all the censoring levels. The top panels of
Figure 5 include bar plots for the bias values. As in
Table 2, it can be seen here that the biases for the two models are very similar and, as expected, become smaller in larger samples. The panels in the middle show bar plots for the variances of the coefficients. The plots appear similar for the two models, but, as has been said before, because the estimation of Model II is more difficult than that of Model I, the y-axis is significantly wider in scope. The panel at the bottom is drawn for the
values of the estimated model, and it is similar to the variance plots. Essentially, these plots prove that the A-spline estimator can estimate the model by overcoming the effect of censorship in terms of various evaluation metrics.
5. Real Data Application
This section is prepared to show the performance of the modified A-spline estimator on real right-censored data. The dataset represents data from colon cancer patients in İzmir. The dataset involves the survival times, censoring indicator
, and albumin (i.e., the most common protein found in the blood) values of patients. To provide continuity, the logarithms of the survival times are considered as a response variable (
), and albumin is taken as a nonparametric covariate (
. The right-censored regression model is thus given by
Note, also, that because
’s cannot be used directly in the estimation procedure, they have been replaced by the synthetic responses shown in Equation (6). The model in Equation (36) is thus rewritten as
The dataset contains information for 97 patients to be used for this analysis. However, the records of 32 of these patients are incomplete, containing right-censored observations; the data of the remaining 65 patients are uncensored (deceased). Consequently, in this dataset, the censoring level is . The outcomes calculated for the model in Equation (37) are given the following table and figure.
Table 4 summarizes the performance of the modified A-spline estimator. Note that the values of
and
are better than the results of Aydın and Yilmaz (2018), who previously used regression splines to model right-censored data. In addition, to provide a healthier comparison, the results of the RS and SS methods are given in the table. As can be seen, the results are pretty similar to the simulation results. Here, A-spline gives the best score, which proves the benefit of the introduce method. Additionally, when the shape of the dataset is inspected from
Figure 6, it can be described as having an irregular shape, and it can be seen that this irregularity increases after synthetic data transformation, which is demonstrated by the blue dots in the figure. Despite this challenging case, the A-spline fit seems to represent the data well.
6. Concluding Remarks
This paper demonstrates that a modified A-spline estimator can be used to estimate the right-censored nonparametric regression model successfully. This is because it uses an adaptive procedure for determining the penalty term and works with only optimum knot points. A simulation study and real data example were carried out to demonstrate the performance of the method, and it can be seen from our findings that the modified A-spline estimator has merit for the estimation of right-censored data.
In the general frame of the numerical examples, incremental changes in the sample size affect the performance of the method, which gives closer results to real observations. This can be seen in
Figure 3,
Figure 4 and
Figure 5. Moreover, changes to the censoring level also influence the goodness of fit, and, as expected, when the censoring level increases, the performance of the method is negatively affected. However, there is an important difference here in terms of the modified A-spline. The main purpose of the usage of this method is to diminish the effect of censorship on the modeling process, and most of our results show that the introduced method achieves this purpose. For an example of these results, see
Table 2. In the simulation study, two different function types are used to generate the model.
is a cliché pattern of the sinus curve and is not difficult to estimate for any smoothing method.
is a little bit more difficult to handle, especially by the smoothing techniques that use all data points as knots. In this paper, it can be seen that for almost all of the simulation configurations, the modified A-spline estimator gives really close values in terms of all evaluation metrics.
The real-world application uses the dataset of colon cancer patients. Their survival times are estimated by using albumin values in their blood.
Figure 6 and
Table 4 show the outcomes of this study. As mentioned above, our method does a good job despite the unsteadily scattered data points. The confidence interval given by the shaded region in
Figure 6 seems wide because synthetic data transformation puts censored points (as zeros) far from the uncensored points. Considering the mentioned properties, it can be said that the modified A-spline estimator can be counted as a robust estimator for right-censored datasets. As a result of this study, we recommend that the modified A-spline estimator is appropriate for modeling clinical datasets.