A ‐ Spline Regression for Fitting a Nonparametric Regression Function with

This paper aims to solve the problem of fitting a nonparametric regression function with right‐censored data. In general, issues of censorship in the response variable are solved by synthetic data  transformation  based  on  the  Kaplan–Meier  estimator  in  the  literature.  In  the  context  of synthetic data, there have been different studies on the estimation of right‐censored nonparametric regression  models  based  on  smoothing  splines,  regression  splines,  kernel  smoothing,  local polynomials, and so on. It should be emphasized that synthetic data transformation manipulates the observations because it assigns zero values to censored data points and increases the size of the observations. Thus, an irregularly distributed dataset is obtained. We claim that adaptive spline (A‐ spline)  regression  has  the  potential  to  deal  with  this  irregular  dataset  more  easily  than  the smoothing techniques mentioned here, due to the freedom to determine the degree of the spline, as well as the number and location of the knots. The theoretical properties of A‐splines with synthetic data  are  detailed  in  this  paper.  Additionally,  we  support  our  claim  with  numerical  studies, including a simulation study and a real‐world data example.


Let
, , 1 be a sample of observations where 's are values of a one-dimensional covariate and 's denote the values of the completely observed response (lifetime) variable . In medical studies such as clinical trials, is often subject to random right-censoring and censored by a random variable with values representing the censorship times, i.e., patient withdrawal time. In this case, the observed response values at designed points , , … , will be 's, defined as min , , where 's are the values of the censoring indicator function that contains the censoring information.
It should be noted that , , and have distribution functions , , and for ∈ , respectively. Additionally, we assume that 's and 's are independent, which is a very common assumption of right-censored analysis (see [1][2]). Thus, the relationship between the distribution of and ( , ) can be written as follows, in terms of corresponding survival functions: This paper considers the problem of fitting a nonparametric regression function with rightcensored data. Based on the condition in Equation (1.1) and assumption in Equation (1.2), the nonparametric regression model can be written as where , , … , are the right-censored response values that solve the censorship problem, . is the smooth function to be estimated, and 's are the normally distributed random errors denoted as ~ 0, . In the context of linear regression, the estimation of censored data is performed using the linear regression model proposed by [3]. Different estimators based on normal least squares for linear regression under right-censored data were introduced by [4], [5], [6], and [7]. In addition, some theoretical extensions are discussed by [8] and [9]. Note, also, that all the methods discussed by the above-mentioned authors are based on the assumption that there is a linear relationship between censored responses and independent variables. In real-world applications, it cannot be known whether the relationship between the responses and explanatory variables is linear. Although there are some processes to test linearity, these cannot be applied directly to censored data because they were designed based on uncensored data. In this scenario, a nonparametric regression model is widely preferred.
There are several various studies to estimate the model in Equation (1.3) in the literature. These existing approaches can be classified as spline-based methods, kernel smoothers, or local smoothing techniques. Spline-based techniques for right-censored data can be categorized as either smoothing splines ( [10] and [11]) or regression splines [12]. Here, in terms of the estimation of the model in Equation (1.3), the difference between smoothing splines and regression splines can be expressed as being that smoothing splines have to use all unique data points as knots and, because of that, the variance of the model would be large as the fitted curve tries to pass both increased values and zeros. In regression splines, knot points can be freely determined. Regression splines perform better than smoothing splines for this reason already (see [12]). However, as is known, regression splines work based on truncated power basis polynomials, which force the method to work with a fixed degree. Studies about kernel smoothers include [13] and [14]. Research on local smoothing techniques can be found in [15] and [16]. In this study, an adaptive ridge estimator (or A-spline) is introduced based on a B-spline basis function to achieve the estimation of the model in Equation (1.3).
It is obvious that conventional regression estimators, whether nonparametric or not, cannot be used directly for modeling censored data. To solve this issue, there are three approaches being taken in the literature; these include using Kaplan-Meier weights [4], synthetic data transformation ( [17], [18]), and data imputation techniques. This paper focuses on synthetic data transformation, which is the most widely used technique in the literature. The main contribution of this technique is that it provides theoretically equal expected values of both synthetic data and the completely observed response variable based on a Kaplan-Meier estimator ( of the censoring variable that can be expressed as ≅ by increasing magnitudes of uncensored observations and assigning zero values to the censored ones. Details about synthetic data transformation are given in Section 2.
The main motivation of this paper is to present a new nonparametric estimator to deal with synthetic response observations better than existing approaches. All of the methods given above have some restrictions when modeling synthetic data, which are indicated above. Because of these kinds of problems, we introduce a modified A-spline estimator, which has no boundary effects for the number of knots, location of knots, and degree of splines.
The A-spline proposed by [19] provides a sparse regression model that is easy to understand and interpret. A trademark of the A-spline that it can determine suitable knot points for B-splines by using adaptive ridge regression (see [20] for adaptive ridge regression), based on the approximation of the norm with an iterative procedure (see [21] and [19] for more details).
In Section 2, our methodology is presented with a synthetic data transformation, a B-spline regression, an adaptive ridge approach, and, finally, a modified A-spline estimator for the nonparametric regression model based on the synthetic responses. We also give an algorithm to obtain the introduced estimator. Section 3 involves the statistical and asymptotic properties of the obtained estimator. A simulation study and real-world data application are given in Sections 4 and 5, respectively. Finally, concluding remarks are presented in Section 6.

Synthetic Data Transformation
To account for the right-censored data in the estimation procedure, an adjustment must be applied to the censored dataset. Otherwise, the methods for estimating cannot be applied directly. One of the most important reasons for this is that the right-censored response variable and the actual response variable have different expected values. As indicated in Section 1, to avoid this issue, synthetic data transformation is used. It can be calculated simply as follows: where . is the distribution of the censoring variable , which is mentioned in the previous section. Note that because the distribution is generally unknown, instead of , its Kaplan-Meier estimate is used (see Koul et al. 1981), which can be formulated as where 's denote the ordered values of the response observations as ⋯ and 's are the values ordered associated with 's. Note that if the distribution is taken arbitrarily, some values of may be identical, which prevents the correct calculation of the Kaplan-Meier estimator. Therefore, the ordered values ⋯ might not be unique. It should be emphasized that the Kaplan-Meier estimator gives an opportunity for ordering the 's uniquely. In addition, it is a widely known property of the estimated distribution that its estimated distribution has jumps only at censored data points (see Paterson, 1977, andKaplan andMeier, 1958 Proof of Lemma 2.1 is given in the Appendix. In order to achieve the goal of this study, synthetic responses are modeled through a modified A-spline approach, which is formed by merging B-splines and the adaptive ridge penalty. Details are given in the next section.

B-Spline Approximation
Because our A-spline regression has been constructed based on B-splines, the necessary information and important basics are described in this section. Let | ∈ ℤ be a nondecreasing sequence given by ⋯ where denotes the number of knots, 's are the knot points, and , are the boundaries of the knots that cannot be counted as knot points. In this context, the B-splines of degree are the piecewise polynomial function that has nonzero derivatives up to order at each of the given knot points. From the properties of the B-splines, it can be said that knots are needed for polynomial pieces. In this case, a B-spline can be described as a non-zero spline between interval , where . Therefore, the B-spline of degree is notated as , , and the calculation of it is given by To solve the recursive formula in Equation (2.5), see the algorithm described in [22]. Note that if , then , . Some fundamental properties of B-splines are that:  The B-spline consists of q-degreed and polynomial pieces.
 Each spline function must be derivable up to order.
 B-splines are nonzero for given .
 Each B-spline should be positive between intervals determined by knot points.
From the information given above, a fitted smooth function for data synthetic data pairs , can be written as a linear combination of B-splines for knots by ∑ , (2.6) Equation (2.6) is useful only for a mathematical approximation. Note, also, that B-splines are a widely-used approximation for the estimation of a single-index (univariate) nonparametric regression model (see [22] for details). From this, a minimization problem emerges with a smoothness penalty written as follows: where is the smoothing parameter that controls the smoothness of the estimated curve. Checking the amount of the penalty term has a very crucial role in the accuracy of the model estimation. This is very similar to the smoothing parameter described by [23]. In B-spline regression, one important issue is the order of the derivative determined for the penalty term, because for its higher orders, some calculation problems may be exposed. Choosing the number and positions of the knots are very substantial decisions in the minimization of the problem in Equation (2.7), especially for right-censored datasets.
In this paper, setting the locations and numbers of the knots is a prior aim because it has a direct relationship with the accuracy of the estimated model, as mentioned above. To provide a suitable solution for this issue, an adaptive ridge penalty is used instead of the penalty term in Equation (2.7) proposed by [19]. Note, also, that the smoothing parameter is chosen by an improved criterion, as proposed by [24]. In the next section, the adaptive ridge penalty is introduced.

Adaptive Ridge
The adaptive ridge method promises the best tradeoff between the goodness of fit (the left part of Equation (2.7)) and the number of knots, which provides a more powerful regression model. To achieve this purpose, it uses a large and equally spaced number of knots, then modifies the penalty term by using this number of knots.
Let a B-spline define the knot points , , … , , and assume that for interval knot, . From that, the given knots are updated as , , … , , , … , . Thus, the penalty term changes from the overall number of knots to the number of non-zero order differences given by where ‖ ‖ denotes the -norm of the difference term , which means that if then ‖ ‖ 0, and ‖ ‖ =1 otherwise. Here, is a smoothing parameter. The point of this penalty term is that it deletes the knot and works by using the intervals , ∩ , . Thus, the modeling process is completed using the remaining knot points.
Note that Equation (2.8) cannot be differentiable, which prevents the acquisition of the fitted model. The adaptive ridge method provides an approximation of the norm given in Equation (2.8) (see [21] for a more detailed discussion). The main idea of the adaptive ridge method is using weights to approximate the -norm. In this context, the penalized minimization criterion in Equation (2.7) is rewritten by using a weighted new penalty as follows: ; , . and the vector and matrix form of Equation (2.9) is Where is the vector of the synthetic response values, and 's represent positive weights, and involves the values of , which is the first difference operator and can be calculated as and .
and is the vector of coefficients for the B-spline design matrix It should be noted that 's provide the approximation of the penalty term in Equation (2.9) to the -norm and also in the adaptive-ridge procedure; weights are of crucial importance to the choice of a perfect location for the knots. To approximate to -norm, weights are determined by an iterative process from the previous values of the coefficients 's, which can be realized by the formula given in [19] as follows: , where is constant, and it can be seen that the approximation ‖ ‖ depends on . In the next section, the modified A-spline estimator is introduced based on the given adaptiveridge penalty and synthetic response values.

Modified A-Spline Estimator
In this section, the estimation coefficient vector is given, and, to provide a more precise and detailed explanation, an algorithm is presented. A modified A-spline estimator to estimate the rightcensored nonparametric regression model in Equation (2.4) is obtained by minimizing Equation (2.10) after some algebraic operations that are given in Appendix A1. In this case, the vector of the estimated coefficients of the A-spline regression is computed by using the formula where denotes the adaptive ridge-penalty, which involves both the difference-matrix and weight matrix . From here, fitted values for the model in Equation (2.4) can be obtained as where is a hat matrix. It should be emphasized that because of computational difficulties, instead of calculating values of matrix , the all penalty term is obtained by an iterative algorithm, which is the most efficient method. The algorithm for our modified A-spline estimator is shown in Algorithm1.

Statistical Properties of the Estimator
It is implicit that an A-spline estimator is a different kind of ridge-type estimator and is used for the estimation of the right-censored nonparametric regression model in this paper. It follows that the expressions given below can be written about using the random error terms of the model in Equation However, in this paper, because of censoring, instead of employing the model in Equation (1.3), that in Equation (2.4), which involves synthetic responses, is used. In this case, the distribution properties in Equation (3.1) are changed depending on Lemma 2.1 and are rewritten as follows: where ̂ ̂ , is the variance of the right-censored nonparametric model based on the synthetic response variable, is the identity matrix, and ̂ ̂ denotes fitted values. It should be noted that the obtained estimator is a vector of coefficients , and therefore, the quality of the model is measured partially based on the bias and variance of . In this context, from the ordinary ridge regression method, the variance-covariance matrix of can be approximated by using as follows: , then the covariance matrix of the fitted values of the model can be given by (3.4) Because of is generally unknown, it needs to be estimated as follows: 3.5 where . indicates the sum of the diagonal elements of a matrix. Additionally, the bias of is one of the quality measurements for the estimated model. In order to calculate the bias, the conditional expected values of the estimator | have to be obtained by Following on from Equation (3.6), the bias can be written as In this study, Equations (3.3-3.5) and (3.7) are used as quality measures to evaluate the performance of the estimated right-censored model. In addition, the mean squared errors ( ) commonly employed in the literature is also used for measuring the quality of the fitted model. It is obtained as follows:

Extended Properties of the Estimator
The modified A-spline estimator introduced in this paper is a smoothing technique that allows for the optimal selection of base functions, penalties, knot points, and the location of knots. It achieves that by using adaptive (weighted) ridge penalty via approximating the norm. In this section, some large sample properties of the modified A-spline estimator are given under right-censoring. It is worth noting that the theoretical properties of the A-spline estimator have not been deeply inspected in the literature. There have been some important studies about adaptive ridge estimators, such as [25], [21], and [20]. This section provides some initial inferences about the A-spline estimator in a nonparametric context and under censorship conditions. Before we describe the asymptotic properties of the estimator, it should be emphasized that the flexibility of the A-spline estimator allows the choice of penalty and knot points, causing difficulties in the theoretical inferences. As is already known, the A-spline estimator is a specialized version of the P-splines proposed by [26]. Its major difference is that the A-spline changes the penalty terms using weights that are iteratively obtained and by approximating the -norm. Because of this, some assumptions and inferences are derived based on the known properties of P-splines.
The main function of the A-spline estimator is given in Equation (2.9), which can be rewritten as follows: where ‖. ‖ denotes the -norm. To obtain substantial results, for this study, we assume that → 0 because solving -norm requires complex calculations. Accordingly, it can be said that minimizing Equation (3.9) has good potential for both estimating 's and determining the optimal knot points, such as model selection for sufficiently large 0. As is known from the literature, model selection with → 0 is realized by penalizing non-zero parameters, which is a limiting case of the bridge estimation introduced by [27] and given as For 1 , the objective function in Equation (3.9) has a convex structure, and its global minimum can be obtained easily by using numerical algorithms. However, when → 0 and 0, the criterion in Equation (3.9) is no longer convex and its computation is non-trivial. In the -norm context, there is no guarantee of reaching a global minimum. Moreover, more than one local minimum could exist. Thus, there is no unique solution of this estimator, and it depends on the iterative process. In [21], it is shown that a minimum of 5 and maximum 40 iterations provide reasonable convergence of the estimator to real parameters. When estimator is inspected asymptotically, although its objective function in Equation (3.9) is non-convex, calculations about asymptotic consistency can be guided. In this case, the following condition is assumed: where is a non-negative definite matrix, and also assumed is that In general, the obtained explanatory variables included by are scaled. Accordingly, all of the diagonals of are equal to 1. Note that it must be assumed that and are nonsingular matrices; consequently, are full rank matrices to the obtained identifiable properties. Using the conditions in Equations (3.11)-(3.12), the limiting behavior of estimator can be observed by inspecting the asymptotic state of affairs of the minimization problem in Equation (2.9). To see the consistency of , the function is given as where is a consistent estimator for . This result is confirmed by following theorem: where has a distribution , and its elements consist of the random error terms 's. This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, and the experimental conclusions that can be drawn.

Simulation Study
In this section, a simulation study is carried out to see the behaviors of the modified A-spline estimator when estimating the right-censored nonparametric model. Before the results of simulation experiments, datasets for the different simulation combinations are generated using by the "simcensdata" function in the R software, which can be accessed via this link: https://github.com/yilmazersin13/simcensdata-generating-randomly-right-censored-data.
Our data generation procedure, with accompanying descriptions,is given in Table 1.  Figure 1. Scatter plots of both censored data and incomplete response data points over the smooth functions to be estimated by A-splines.

Panel (a) and Panel (b) represent two different datasets that were formed based on nonlinear functions
and . The plots of Figure 1 are drawn for 100 and ℒ 20%. It should be noted that the optimal selection of numbers and the positions of knots are extremely important for the functions represented in these panels. In the context of synthetic data transformation, censored data points take zero values and completed points take higher values than they are. In this case, deciding the properties of knots will be crucial.
From the data generation procedure given above, the right-censored nonparametric model can be written as follows: Then, to use censorship information in the estimation process, a synthetic data transformation is done, as in Equation (2.3). Therefore the final model to be estimated, as given by the simulation experiments, is , 1, … , , ℎ 1,2 (4.1) In this simulation study, for three sample sizes, three censoring levels, and two functions, 18 configurations are obtained. All the outcomes for the model in Equation (4.1) under these conditions are given in the following figures and tables.  Table 2 represents the scores of all the evaluation metrics for each of the simulation configurations. The results are inspected from three essential aspects in terms of the estimation performance of the A-spline estimator that are the effects of the sample size, censoring level, and shape of the data. For the first aspect, it can be seen from the table that , and decrease when the sample size increases. This can be interpreted as practical proof of the asymptotic convergence that is one of the main purposes of this simulation study. This interpretation is consistent for all censoring levels. The censoring level naturally affects the performance of the estimator contrary to sample size; however, there is a sensitive point, which depends on the reaction of the estimator to variation in the censoring level, which makes this paper significant. If the scores are inspected carefully, it can be clearly seen that there are no huge differences between low and high censoring levels, which can also be seen in the figures given below. This case proves that the A-spline estimator achieves mitigation of the effect of the censoring level on selecting the optimal knot points, as expected. Finally, two different function types are used in this paper. has a shape that is similar to that of a sinus function and is not hard to catch for any smoothing technique.
is an almost linear function but has one big peak; this is a challenge for the estimator, especially under censoring. The outcomes in Table 2 demonstrate this. Although the results for are smaller than those for , it can be said that the A-spline estimator shows a satisfactory performance for both datasets. Table 3 represents the comparative outcomes for the introduced estimator modified A-spline and commonly used SS and RS. The best scores are indicated with bold colored text. As can be seen, the results indicate that the modified A-spline estimator shows the best performance from a general perspective. Additionally, as mentioned in the introduction section, RS has smaller MSEs than SS. From here, it can be said that the introduced method gives more satisfying results, which can be explained by its adaptive nature. If Table 3 is inspected carefully, it can be realized that for the results obtained from , the RS method has attractive outcomes when the censoring level is 40%. It is an understandable situation because of the shape of the function.  Figure 2 aims to show how the modified A-spline estimator behaves when the sample size is exceedingly small, under various censoring levels. It is obvious that the estimation of is easier than that of , which is explained below. Figure 2 shows this more clearly. In addition, it can be said that the method is successful for even extremely small sample sizes ( 35 . This is an important contribution of this method for right-censored data because in a medical dataset and especially in clinical observations, many data may frequently be unobtainable.    To that end, fitted curves are shown for a moderate sample size together with the lowest and the highest censoring levels, 5% and 40%. As we expected, the A-spline estimator demonstrates its ability to handle data with zero values obtained by synthetic data transformation, and it can be clearly seen that there is a difference between the two graphs. This inference is also supported by the results in Table 2.  Figure 5 depicts bar plots of the measurement tools for both the estimated A-spline coefficients and the estimated model. In each panel, A1.5%, A1.20%, and A1.40% denote the obtained scores of the evaluation metric for ℒ 5%, 20%, and 40%, respectively, for 35. In a similar manner, A2.5%, A2.20%, and A2.40% represent the scores for 100 and all the censoring levels, and A3.5%, A3.20%, and A3.40% denote the results for 350 for all the censoring levels. The top panels of Figure 5 include bar plots for the bias values. As in Table 2, it can be seen here that the biases for the two models are very similar and, as expected, become smaller in larger samples. The panels in the middle show bar plots for the variances of the coefficients. The plots appear similar for the two models, but, as has been said before, because the estimation of Model II is more difficult than that of Model I, the y-axis is significantly wider in scope. The panel at the bottom is drawn for the values of the estimated model, and it is similar to the variance plots. Essentially, these plots prove that the A-spline estimator can estimate the model by overcoming the effect of censorship in terms of various evaluation metrics.

Real Data Application
This section is prepared to show the performance of the modified A-spline estimator on real right-censored data. The dataset represents data from colon cancer patients in İzmir. The dataset involves the survival times, censoring indicator , and albumin (i.e., the most common protein found in the blood) values of patients. To provide continuity, the logarithms of the survival times are considered as a response variable ( ), and albumin is taken as a nonparametric covariate ( . The right-censored regression model is thus given by The dataset contains information for 97 patients to be used for this analysis. However, the records of 32 of these patients are incomplete, containing right-censored observations; the data of the remaining 65 patients are uncensored (deceased). Consequently, in this dataset, the censoring level is ℒ 32.98. The outcomes calculated for the model in Equation (5.2) are given the following table and figure. Table 4 summarizes the performance of the modified A-spline estimator. Note that the values of , and are better than the results of Aydın and Yilmaz (2018), who previously used regression splines to model right-censored data. In addition, to provide a healthier comparison, the results of the RS and SS methods are given in the table. As can be seen, the results are pretty similar to the simulation results. Here, A-spline gives the best score, which proves the benefit of the introduce method. Additionally, when the shape of the dataset is inspected from Figure 6, it can be described as having an irregular shape, and it can be seen that this irregularity increases after synthetic data transformation, which is demonstrated by the blue dots in the figure. Despite this challenging case, the A-spline fit seems to represent the data well.

Concluding Remarks
This paper demonstrates that a modified A-spline estimator can be used to estimate the rightcensored nonparametric regression model successfully. This is because it uses an adaptive procedure for determining the penalty term and works with only optimum knot points. A simulation study and real data example were carried out to demonstrate the performance of the method, and it can be seen from our findings that the modified A-spline estimator has merit for the estimation of right-censored data.
In the general frame of the numerical examples, incremental changes in the sample size affect the performance of the method, which gives closer results to real observations. This can be seen in  Moreover, changes to the censoring level also influence the goodness of fit, and, as expected, when the censoring level increases, the performance of the method is negatively affected. However, there is an important difference here in terms of the modified A-spline. The main purpose of the usage of this method is to diminish the effect of censorship on the modeling process, and most of our results show that the introduced method achieves this purpose. For an example of these results, see Table 2. In the simulation study, two different function types are used to generate the model. is a cliché pattern of the sinus curve and is not difficult to estimate for any smoothing method. is a little bit more difficult to handle, especially by the smoothing techniques that use all data points as knots. In this paper, it can be seen that for almost all of the simulation configurations, the modified A-spline estimator gives really close values in terms of all evaluation metrics.
The real-world application uses the dataset of colon cancer patients. Their survival times are estimated by using albumin values in their blood. Figure 6 and Table 4 show the outcomes of this study. As mentioned above, our method does a good job despite the unsteadily scattered data points. The confidence interval given by the shaded region in Figure 6 seems wide because synthetic data transformation puts censored points (as zeros) far from the uncensored points. Considering the mentioned properties, it can be said that the modified A-spline estimator can be counted as a robust estimator for right-censored datasets. As a result of this study, we recommend that the modified Aspline estimator is appropriate for modeling clinical datasets.