Modified Local Linear Estimators in Partially Linear Additive Models with Right-Censored Data Based on Different Censorship Solution Techniques

This paper introduces a modified local linear estimator (LLR) for partially linear additive models (PLAM) when the response variable is subject to random right-censoring. In the case of modeling right-censored data, PLAM offers a more flexible and realistic approach to the estimation procedure by involving multiple parametric and nonparametric components. This differs from the widely used partially linear models that feature a univariate nonparametric function. The LLR method is employed to estimate unknown smooth functions using a modified backfitting algorithm, delivering a non-iterative solution for the right-censored PLAM. To address the censorship issue, three approaches are employed: synthetic data transformation (ST), Kaplan–Meier weights (KMW), and the kNN imputation technique (kNNI). Asymptotic properties of the modified backfitting estimators are detailed for both ST and KMW solutions. The advantages and disadvantages of these methods are discussed both theoretically and practically. Comprehensive simulation studies and real-world data examples are conducted to assess the performance of the introduced estimators. The results indicate that LLR performs well with both KMW and kNNI in the majority of scenarios, along with a real data example.


Introduction
Partially linear models (PLMs) have gained considerable attention in the field of survival analysis, especially for modeling right-censored data.The flexibility and capability of PLMs to capture both parametric and nonparametric components make them a favored choice for analyzing survival data with complex relationships.The classical PLM is expressed as follows for completely observed data with a sample size n: where y i 's are the completely observed response values (or lifetimes in survival analysis), x i ∈ R n×p are the parametric covariates, β = β 1 , . . ., β p T denotes the (p × 1) dimensional vector of regression coefficients, and f (.) is the univariate unknown smooth function to be estimated based on the values of the nonparametric covariate t i 's.Finally, ε i 's are the random error terms with (i) ε i ∼ N 0, σ 2 ε and (ii) Cov(ε i , x i ) = 0, (iii) E[ε i |x i , t i ] = 0. Without censored data, model (1) has been studied by many researchers, and some of the notable studies include [1,2], among others.Additionally, ref. [3] proposed the local linear regression (LLR) estimation for model (1).In the right-censored case, the response variable, y i , is incompletely observed and censored from the right by random censoring variable {c i } n i=1 under the assumption that x i and t i are completely observed.Accordingly, the censoring mechanism and some new variables can be obtained as follows: where z i denotes the incompletely observed response variable with the censoring indicator δ i .Thus, instead of y i , data pairs {z i , δ i } are used in the modeling procedure.There are several important studies on the estimation of model ( 1) under right-censored data, as given in (2), such as refs.[4][5][6], among others.While model (1) offers reliable performance for both censored and uncensored data due to its ability to incorporate both parametric and nonparametric components, it encompasses only a singular nonparametric component.This constraint necessitates that researchers select a sole nonparametric covariate from the dataset, a premise that might not align with many real-world situations.Furthermore, adhering to this limitation could result in less dependable estimations unless the dataset genuinely contains only one nonparametric covariate.To improve estimation accuracy and provide a more adaptable model that considers the right-censored response variable, z i , this research delves into the partially linear additive model (PLAM), tailored for q nonparametric functions: Here, q represents the number of nonparametric components, a value determined based on the nature of the relationship between t ij and y i .When this relationship cannot be adequately captured by a linear parametric component, it is treated as a nonparametric covariate, characterized by an unknown smooth function f j t ij .As a result, the overall nonparametric component of model ( 3) is formed by the summation of these functions.The use of PLAMs in survival analysis with right-censored data allows for more realistic modeling of the relationship between covariates and survival outcomes by incorporating both multiple parametric and nonparametric components.By introducing nonparametric components, PLAMs provide a more adaptable framework for capturing potential nonparametric relationships between covariates and survival times.It is crucial to acknowledge that model (3) cannot be estimated unless the censorship problem is suitably addressed.Numerous studies in the literature have concentrated on estimating (3) for data that is fully observed and devoid of any censoring.Ref. [7] discussed the combination of smoothing splines with semiparametric additive models, while ref.[8] studied the asymptotic properties of M-estimators for model (3).Additionally, Ref. [9] presented a comprehensive review of partially linear additive models based on various smoothing techniques.
Distinct from the studies previously mentioned, this paper presents modified LLR estimators for PLAM (3) using three distinct censoring solutions: synthetic data transformation (ST), Kaplan-Meier weights (KMW), and kNN imputation (kNNI).Through the examination of these modified estimators and the exploration of various techniques to tackle censorship, valuable insights can be gained, and the accuracy and effectiveness of modeling right-censored data may be improved.This paper also explains the procedure for obtaining these estimators, encompassing the modified backfitting technique and a non-iterative approach, accompanied by comparative numerical studies.To the best of our knowledge, this research fills a gap in the literature on modeling right-censored data.
The remaining part of the paper is organized as follows: In Section 2, the fundamentals of right-censored data are presented, and solution approaches are explained.Section 3 covers the estimation of PLAM using modified LLR estimators based on various censorship solution techniques.In Section 4, the statistical properties of the estimators are provided.Sections 5 and 6 present simulation and real data studies, respectively.Finally, Section 7 includes the conclusions of the paper.

Right-Censored Data and Solution Methods
In this section, we provide theoretical insights into modeling right-censored data.Let F and G represent the probability distribution functions of the F observed response variable (y i ) and the censoring variable (c i ), respectively.Thus, for any arbitrary data point "u", these functions can be expressed as follows: It is essential to highlight that the estimation procedure for the model, utilizing the specified distributions (4), critically relies on two "censorship assumptions".These constrain all variables within model (2).These assumptions, as outlined by ref. [10] and elaborated by ref. [11] in the context of right-censored regression models, hold significant significance.In essence, the dataset must meet the subsequent criteria.
A1. y i and c i are independent.

A2. P y
The assumption (A1) and (A2) can be explained as follows: (A2) posits that the covariates in the model lack any information about the censorship in y i .Assumption (A1) is particularly crucial when implementing censorship solutions.For a more in-depth discussion, one can refer to [10]'s writings.Drawing from the aforementioned details, this section provides the three censorship solutions.Additionally, towards the section's close, a figure is showcased to illustrate the practical distinctions between synthetic data transformation and the kNN imputation methods.
Synthetic data transformation: To incorporate the impact of censorship into the modeling procedure, synthetic data transformation is a commonly employed solution method.Consequently, the incomplete response pairs {(z i , δ i ), i = 1, . . ., n} must be substituted for a synthetic response variable, as proposed by ref. [12].Assuming that G is a continuous and known function, it becomes possible to modify the observed lifetimes z i in a manner that ensures an unbiased estimation: where z iG represents the synthetic response variable with E z iG x i , Nevertheless, the true distribution of the censoring variable G remains unknown.To address this challenge, ref. [12] suggested replacing G with its estimated version, known as the Product-Limit estimator (Kaplan-Meier estimator).This estimator calculates the survival probabilities at the arbitrary positive data point "u" as follows: where z (1) ≤, . . ., ≤ z (n) are the sorted values of the right-censored response variable z (i) and δ (i) are the corresponding censoring indicators associated to z (i) .Hence, instead of G(z i ) in (5), Ĝ(z i ) is used and z Ĝ = z 1 Ĝ, . . ., z n Ĝ T can be obtained to fit the PLAM.
[13], are a technique used in survival analysis to address the issue of right-censored data.The Kaplan-Meier estimator is a nonparametric method prevalent nonparametric approach used for estimating survival probabilities amidst censoring.Nonetheless, using standard regression techniques on censored data can lead to biased outcomes.Stute (1993) addressed this by presenting Kaplan-Meier weights, derived from the Kaplan-Meier survival probabilities for each data point.These weights are used to adjust the contribution of each observation in the regression analysis, effectively accounting for the censoring mechanism.By incorporating the Kaplan-Meier weights into the regression model, unbiased estimates of the regression coefficients can be obtained.
Before computing the KMW, let us assume that z (i) denotes the ordered values of the incomplete response values and x T (i) , δ (i) and t (i) = t (i)1 , . . ., t (i)q are the correspondingly ordered values.Then, Kaplan-Meier weight w (i) , associating with the z (i) , is computed based on the Kaplan-Meier estimator F z (i) given in (5) as follows: And KMW is obtained for all possible values of z i as a diagonal matrix W = diag w (1) , . . ., w (n) .To reach further information about (7) and implanting these weights into the regression models, see refs.[5,6].
kNN imputation method: kNN imputation is a prevalent technique for addressing missing data across various domains, as discussed by researchers including [14].Additionally, some studies, such as ref.[15], have adapted the kNN imputation method to manage right-censored data.This method allows for the practical estimation of right-censored data points without the constraints of theoretical limitations.In this context, we provide a succinct overview of the kNN imputation technique and an algorithm tailored for the PLAM dataset.Essentially, the kNN method is a machine learning technique that hinges on the similarity between data points, utilizing distance metrics for predictions.The choice of a suitable similarity measure can greatly impact the results.The Euclidean norm is commonly employed as a measure of distance in numerous studies.The Euclidean norm is a well-known distance and can be computed for the context of censored data points as where n c is the number of censored data points and x c j and x c i denote the j th and i th associated values of a regressor which has a strong correlation between response variable z i .Details are provided in Algorithm 1.For imputation, the algorithm introduced by ref. [15] can be employed.The choice of the appropriate number of neighbors, "k", is pivotal, especially given the possibility of some neighbors being rightcensored.While ref. [16] suggests a smaller value for "k", such as 1 or 2, an optimal "k" ranging between 2 to 10 is chosen in this context to minimize the mean squared error (MSE).This approach ensures precision in imputation, taking into account the distinct attributes of the data.As previously mentioned, Figure 1 has been created to illustrate the practical distinctions between the manipulative solution techniques, namely ST and kNNI.This visualization provides insights into how these methods impact the response variable and the changes they bring about.It should be noted that the effect of KMW is not demonstrated in the figure since it is incorporated into the objective function of the rightcensored PLAM as weights.However, further explanation regarding KMW will be provided in the next section when obtaining the modified LLR estimators.

Fundamentals of PLAM
Before explaining the modified LLR estimators, this section provides a concise overview of the fundamental concepts of PLAM and summarizes the steps involved in utilizing the backfitting algorithm.Additionally, we express right-censored PLAM (3) in vector and matrix form as follows: Find the distances between x j and x i for each censored data point 6 : Sort the distances from small to large 7 : for (j = As previously mentioned, Figure 1 has been created to illustrate the practical distinctions between the manipulative solution techniques, namely ST and kNNI.This visualization provides insights into how these methods impact the response variable and the changes they bring about.It should be noted that the effect of KMW is not demonstrated in the figure since it is incorporated into the objective function of the right-censored PLAM as weights.However, further explanation regarding KMW will be provided in the next section when obtaining the modified LLR estimators.

Fundamentals of PLAM
Before explaining the modified LLR estimators, this section provides a concise overview of the fundamental concepts of PLAM and summarizes the steps involved in utilizing the backfitting algorithm.Additionally, we express right-censored PLAM (3) in vector and matrix form as follows: Below, we present the explicit expressions for the vector and matrices in (8) as follows: The literature offers only a handful of studies specifically addressing the right-censored partially linear additive model (PLAM).In terms of estimating model (8), ref. [17] presented the primary optimization problem for the nonparametric additive model, which mean Xβ = 0 in model (8), and ref. [18] formulated a similar problem for (8) as follows: (10) Entropy 2023, 25, 1307 6 of 22 Accordingly, the solution expression for the j th function f j z j in the objective (10) can be written as and, based on this statement, the following equation system can be used for the general solution of the model.Accordingly, let S 1 , . . ., S q be smoothing matrices obtained from the LLR procedure.Then, the equation system for the estimation of model ( 8) can be obtained as follows: where β denotes estimated coefficients by LLR, which is shown in Section 3.2.For further details on (11), refer to [9].The solution to system (11) effectively yields the estimates of the functions f j z j q j=1 .However, it is evident that inverting the matrix on the left-hand side of (11), which comprises the smoothing matrices, becomes infeasible if the dimension of (nq × nq) is sufficiently large.As the dimension grows, solving the system in (11) becomes progressively more challenging, potentially reaching a point where it is unmanageable and cannot be directly addressed (refer to [18]).
Hence, in practical applications, the system (11) is typically solved using the backfitting method, incorporating initial-valued components notated as f0 . Consequently, the LLR estimators are derived by the modified backfitting algorithm, which is given at the end of Section 3.

Local Linear Regression
Local linear regression (LLR) is a widely employed smoothing technique for nonparametric, semiparametric, and additive models.Its effectiveness has been demonstrated across diverse domains, such as medical research, engineering, and the analysis of timeto-event (or survival) data in time-series studies.In this section, we present three LLR estimators for the partially linear additive model (PLAM) described in (8), employing the introduced censorship solution methods.These estimators are derived using a modified backfitting algorithm.Local linear regression (LLR) is a kernel-based method that differs from kernel regression in that it performs a local estimation of a line rather than a constant.To illustrate the working procedure of LLR, let us consider a partially linear model with a univariate function when q = 1, as given in (1), involving an unknown smooth function f (.).The key concept of LLR is to estimate model (1) linearly within small input intervals.To estimate the parameters of (1), the backfitting algorithm introduced by ref. [19] is used.Accordingly, the backfitting estimators β , f for model (1) where f1 = ( f 1 (t 1 ), . . . ,f 1 (t n )) T by replacing the corresponding matrices that are S h 1 and H 1 in the algorithm given in Algorithm 2 can be obtained where Here, S h 1 is computed based on the bandwidth parameter h 1 > 0 for LLR, which is formed by using nonparametric variables t 1i 's.
In order to adapt the LLR method for estimating the parameters of the right-censored PLAM, a closer examination of the elements of the smoother matrix S h j is required.Let S h j q j be written with open form as S h j = s j1 , . . ., s jn T , where s j1 , . . ., s jn show the row vectors of S h j obtained from values of h th nonparametric covariate t j = t j1 , . . ., z jn T .
From the theory of LLR, s T jr for any t j1 ≤ m ≤ t jn can be obtained as follows: where t jm , d 1 , and W jm can be expressed as follows: Based on the provided information, it can be inferred that the extension of LLR estimators to PLAM requires further adjustments.Moreover, it is crucial to satisfy the standard assumptions of LLR, such as where K(.) is the kernel function, which is continuous, and its moment is written as µ i (K) ≡ u i K(u)du = 0 when µ 2 (K) = 0 for odd values of j.The density of t ji can be given as g t (m) > 0, for all m ∈ sup(g t ), and also, as a common assumption, since n → ∞ , h → 0 , and nh → ∞ .Finally, a second derivative of the nonparametric smooth function f (.) exists and is continuous.Details about the assumptions are discussed in detail in ref. [20].
In the backfitting estimation procedure, to make simple the definition of the model ( 8), some restrictions on f j t ij q j=1 are needed.At first, E f j t ij = 0 is assumed.Secondly, the parametric covariates x T i 's and right-censored response values z i 's are assumed to be scaled around zero.In order to construct the centered smoother matrix S h j used in the LLR estimation, these constraints are necessary.Thus, the conditional expectation of model ( 8) can be expressed as follows: By using the modified backfitting algorithm given in Algorithm 2, solutions can be obtained based on S h j for PLAM parameters β and f j q j=1 .Thus, without any censoring adjustment, PLAM estimators β , f based on the LLR are obtained.
Furthermore, it should be noted that ref. [20] presented a non-iterative formulation equivalent to the backfitting algorithm based on an additive smoother matrix S A = ∑ q j=1 S * j to demonstrate the LLR estimation process in the absence of censorship issues, which reveals the relationship between Z and fA = ∑ q j=1 fj .Here, S * j is computed from the equation system (11) based on the S h j (see ref. [9]).Additionally, this information elucidates the connection between a unique solution and the iterative backfitting process.
Accordingly, LLR estimators for PLAM can be found as for both ST and kNNI by replacing Z by Z ST and Algorithm 2 Modified Backfitting Algorithm for Right-Censored PLAM
2: while (tol ≥ 0.05) and (i < max.iteration)Selection of optimal bandwidth parameter h j by GCV between steps: 3-8 3: Create a sequence of tunning parameter h seq = [0.01, 1.5] for determined length 4: for (l in 1 : length) do 5: Compute the smoothing matrix S (l) Compute X and H (l) Compute X and H (l) Select optimal ĥj which minimizes GCV h j for j th function f j . 13: Compute S ĥj for each criterion (and method).
Solution of censorship problem between steps: 14-25 if the censorship solution is kNNI 15: Replace Z with Z imp using algorithm in Algorithm 1.
28: end 29: Return β and f1 , . . ., fq 30: end And for KMW solution, non-iterative estimators are obtained as follows: Entropy 2023, 25, 1307 where X = I − S A X, Z = I − S A Z. It should be noted that the validity of Equations ( 14)-( 17) depends on the existence of a unique solution.Furthermore, the vector of fitted values for LLR can be expressed as follows: where Note that under completely observed data, H A is derived by [21] for the LLR estimator of PLAM.
To effectively demonstrate and interpret each nonparametric component individually, the introduced modified backfitting algorithm is more suitable than Equations ( 16)-( 18), which yield an additive outcome for the nonparametric component.Additionally, computing S A LL becomes significantly challenging as the dimension of the additive component increases.In this paper, the modified backfitting estimators βA , fA of LLR, obtained through an algorithm given in Algorithm 2, are employed.This approach aims to showcase the performance of the estimated functions f = fj q j=1 . In the introduced algorithm given in Algorithm 2, to calculate the selection criterion GCV, the degrees of freedom of (DF) are computed by DF j = tr I − H j T I − H j = n − 2tr H j + tr H T j H j where H j denotes the hat matrix based on the j th nonparametric component.Also, to see details about the algorithm given in Algorithm 2, see ref. [9].

Properties of the Estimator
The objective of this section is to assess the bias and variance of the modified LLR estimators introduced in the previous section.When evaluating the performance of the parametric component, the variances and biases of the regression coefficients are calculated using the non-iterative solutions given in Equations ( 14)-( 17), owing to its theoretical simplicity.
Empirical studies can be conducted to calculate the bias and variance properties of the estimators.However, when considering LLR as demonstrated in Equations ( 14)-( 17), noniterative formulations can be employed to compute finite-sample properties for the other two methods.In this matter, conditional bias E βA − β X, t and variance Var βA are obtained based on Equations ( 14)- (17).
Let us rewrite βA as: where S A = ∑ q j=1 S * j , and . Then B βA and Var βA can be given by: Entropy 2023, 25, 1307 10 of 22 And for the KMW solution, Equations ( 19) and ( 20) are given by: where σ2 ε is the model variance estimated based on LLR and it can be computed using the hat matrix H A or H A KMW for the KMW solution that are defined after Equation (18).In addition, one can replace Z by Z ST or Z imp .Accordingly, σ2 ε is formulated as follows: where the degree of freedom (DF), which is given in the denominator of ( 23), is calculated by H A KMW is used for the KMW solution.For the further details of DF A , see ref. [17].The modified backfitting algorithm provided in Algorithm 2 requires the estimation of the model variance for each individual nonparametric function in order to calculate the GCV score for bandwidth parameter selection.Consequently, if H A is replaced by H j or H KMW j in (23), then the individual variance estimator σ2 ε j can be easily obtained.The fundamental concept behind computing σ2 ε j lies in selecting the appropriate smoothing and bandwidth parameters using the GCV criterion, as it relies on the estimated model variance.The GCV criterion can be summarized as follows.
GCVcriterion: Generalized cross-validation is used to obtain a minimum score based on the optimal tuning parameter for the regression model.In terms of bandwidth selection in additive models with LLR, ref.
[22] presented a detailed work on using GCV and its properties.Accordingly, to choose the optimal h j for j th function f j , GCV h j score can be computed based on μ given in (18): where H j is the hat matrix obtained for f j which is provided at the end of the Section 3.
Notice that calculating the true DF j in PLAM is asymptotically justifiable if parametric and nonparametric covariates x i , t j are independent.If there is multicollinearity, then Equation (24) may be regularized properly due to overestimated DF j .

Metrics for the Parametric Component
In this section, two metrics are presented to assess the performance of the LLR estimator of the parametric component of the model β that are scalar versions of the dispersion error (SMDE) and the relative efficiency (RE), which is computed by ratio of the SMDE values.The formulations are given below: where MSE β, β is expressed as a summation of bias square and variance of β, and given by: Then, using (25), REs of the methods on estimating β can be computed.In this paper, methods are considered for use as censorship solution techniques for REs.
Let β1 and β2 represent the estimates of parametric components based on two different censorship solutions.Accordingly, RE can be formulated as follows: where RE β1 , β2 < 1 indicates that β1 is more efficient than β2 .

Metrics for the Nonparametric Component
To evaluate the quality of the estimated nonparametric component, two measures are presented.The first measure is the root mean squared error (RMSE), which measures the accuracy of each individual estimated function in the model.The second measure is the averaged root mean squared error (ARMSE) which is specifically designed to assess the performance of the overall additive component f = f1 , . . ., fq .The formulations of RMSE and ARMSE are written as: and where f = ∑ q j=1 f j and f = ∑ q j=1 fj .

Simulation Study
The practical performance of the modified LLR estimators in the context of rightcensored PLAM with various censorship solution methods is analyzed in this section.To achieve this, different settings for sample size (n), the number of additive nonparametric components (q), and the level of censoring (CL) are considered.Specifically, three sample sizes (n = 50, 100, and 200) and three levels of censoring (CL = 5%, 20%, and 35%) are chosen.A total of eight scenarios are obtained by combining these configurations.Additionally, a total of 24 cases for analysis are formed by using three censorship solution methods.Moreover, accelerated failure time model estimation results are presented as benchmark performance scores.To achieve that existing function, the survival library in R is used.Note that the function written in R for this paper is provided via link: https://github.com/yilmazersin13/Censored-Partially-linear-additive-models/tree/main, accessed on 9 August 2023.The simulation design and setup used in this study are designed in a manner commonly found in the literature (see ref. [4]).Small, medium, and large sample sizes are chosen, along with three different censoring levels, in accordance with reference articles.Furthermore, the nonparametric component count has been determined in two distinct ways, introducing a novel approach that differs from most similar studies (see ref. [9]).
After establishing the design, the data generation procedure for the right-censored PLAM is outlined here.Firstly, PLAM with completely observed responses is generated as: where x T i = (x i1 , x i2 ) T , is (n × 2) dimensional parametric covariate matrix with normally distributed and independently x i 's that are generated as x i ∼ N µ x = 0, σ 2 x = 1 .Also, the vector of regression coefficients is determined as β = (1, −0.5) T .Regarding the nonparametric component, smooth functions are generated by when q = 2.Note that, due to how all the variables are scaled in the simulation study, the constant term α 0 is not used throughout the section.Finally, the random error terms ε i 's are independent and identically distributed with zero mean and constant variance, which can be shown as ε i ∼ N 0, σ 2 ε = 0.5 .After generating (30), by applying the censorship procedure given in Algorithm 3, right-censored response variable Z is generated based on random censoring variable C = (c 1 , . . . ,c n ) T and censoring indicator δ = (δ 1 , . . . ,δ n ) T .
Algorithm 3 Censoring Procedure Input: Completely observed y i Output: Right-censored dependent variable z i 1 : For given censoring level (CL), produce δ i = I(y i ≤ c i ) from the binomial distribution 2 : for (i in 1 to n) 3 : Else 13 : z i = c i 14 : end (for loop in Step 9) Then, right-censored PLAM is obtained with the incomplete response variable Z = (Z 1 , . . . ,Z n ) T .Accordingly, the following figures and tables are provided based on the censorship solution techniques.Algorithms 2 and 3 present the results for the performance of the parametric component estimation, specifically the SMDE and RE values, respectively.In addition, as a benchmark method, the performance of AFT model estimation based on Cox's semiparametric proportional hazards (CPH) estimator is provided in both simulation and real data examples.The estimates are obtained a using "Survival" package in R.
Prior to presenting the findings, we offer a visual representation in Figure 2 that elucidates the process of bandwidth selection across diverse scenarios.This illustration sheds light on how the choice of bandwidth is intricately intertwined with the extent of censoring and the specific methods employed for addressing censorship.The discerning eye will note that in the context of f 1 , the selection of bandwidth appears to exhibit a lesser degree of sensitivity to variations in the level of censoring and sample size.However, in the case of the f 2 function, it becomes clear that the level of censorship exerts a discernible influence on the chosen bandwidth value.Notably, when confronted with elevated censorship levels across all solution strategies, a preference for smaller bandwidths becomes evident.This outcome is intuitively reasonable since, especially in scenarios involving ST and kNNI, the structural complexity of the data to be fitted takes on a more undulating nature.Therefore, it is evident that we can extrapolate that accounting for the degree of censorship is a pivotal factor when navigating the terrain of bandwidth selection.These findings resonate with prior research in this domain.Ref. [23] demonstrated similar behavior in a related context, highlighting the sensitivity of bandwidth to censorship levels.In line with the in-depth investigations of ref.
[24], our observations underscore the need for cautious bandwidth selection in scenarios characterized by substantial censorship, promoting the accurate modeling of intricate data structures.The results in Table 1 demonstrate that the estimation quality of the modified estimators for the parametric component  improves with lower censoring level larger sample sizes across all censorship techniques.These tendencies align wi expected theoretical behavior.Specifically, the LLR-KMW estimator exhibits dom performance in many simulation combinations, closely followed by the LLRestimator with competitive SMDE scores.However, the LLR-ST does not yield performance.Also, as a benchmark method for the model, SMDE scores of the estimator are presented in the table.It is evident that due to the model involving s complexity with two different nonparametric functions, there is a significant di between the LLR-based estimators and the CPH estimator, which is expected.
Interestingly, in cases where  = 50 and  = 5% or  = 20%, the LLRestimator outperforms the LLR-KMW estimator.As the sample size increases, LLRtakes the lead, in accordance with its theoretical behavior.It is worth noting that d its fully nonparametric nature, LLR-kNNI may yield better results under dif configurations, demonstrating relative independence from specific simulation se This characteristic is observed in the combination of  = 200 and  = 20%.The results in Table 1 demonstrate that the estimation quality of the modified LLR estimators for the parametric component β improves with lower censoring levels and larger sample sizes across all censorship techniques.These tendencies align with the expected theoretical behavior.Specifically, the LLR-KMW estimator exhibits dominant performance in many simulation combinations, closely followed by the LLR-kNNI estimator with competitive SMDE scores.However, the LLR-ST does not yield good performance.Also, as a benchmark method for the model, SMDE scores of the CPH estimator are presented in the table.It is evident that due to the model involving serious complexity with two different nonparametric functions, there is a significant distance between the LLR-based estimators and the CPH estimator, which is expected.Interestingly, in cases where n = 50 and CL = 5% or CL = 20%, the LLR-kNNI estimator outperforms the LLR-KMW estimator.As the sample size increases, LLR-KMW takes the lead, in accordance with its theoretical behavior.It is worth noting that due to its fully nonparametric nature, LLR-kNNI may yield better results under different configurations, demonstrating relative independence from specific simulation settings.This characteristic is observed in the combination of n = 200 and CL = 20%.
Additionally, to assess the impact of censorship on the solution techniques, the increase in SMDE scores between censorship levels is examined.The results indicate that the the LLR-ST estimator is the most affected by censorship, which aligns with the theoretical background of ST presented in Section 2.
In Table 2, the calculation of the RE scores follows a decision where the nominators represent the columns, and the denominators represent the rows.Therefore, an RE value of less than 1 in Table 2 indicates that the method in the column is more effective than the methods in the corresponding row.Please note that, for the sake of saving space, only certain simulation configurations are considered in Table 2.The results in the table confirm that LLR-KMW is more efficient than LLR-ST in all cases.Simultaneously, LLR-KMW and LLR-kNNI exhibit similar outcomes, indicating that they are not distinctly efficient in any simulation configurations for estimating the parametric component of the PLAM.Furthermore, when the censoring level is very high (CL = 35%), the RE scores deviate from 1, making the performance differences among the LLR estimators based on the solution techniques more apparent.Once again, it is evident that, especially for n = 50, ST is the most sensitive technique to censorship compared with the other two methods.Additionally, the results reveal that LLR-kNNI and LLR-KMW display similar RE scores in every combination.In addition, in Table 2, REs of CPH show that there is a clear dominance of LLR-basis estimators for the estimation of right-censored PLAM.This result also proves that the introduced estimator has important potential to be an alternative estimator for the model of interest that is used in survival analysis.
In Figure 3, the averaged values of the RE scores are displayed, confirming the interpretations from Table 2   After analyzing the parametric component, the estimation of the ad nonparametric components is presented in Tables 3 and 4. Table 3 displays the R values computed for the individual functions, while Table 4 provides the ARMSE v for all simulation configurations, serving as a measure of the overall performan estimating the nonparametric component of the right-censored PLAM.Upon examination, the LLR-KMW estimator demonstrates a significantly superior perform compared with the other two estimators across all simulation configurations.dominance is further evidenced by the ARMSE results presented in Table 4, w contrast the outcomes observed in the parametric component estimation.After analyzing the parametric component, the estimation of the additive nonparametric components is presented in Tables 3 and 4. Table 3 displays the RMSE values computed for the individual functions, while Table 4 provides the ARMSE values for all simulation configurations, serving as a measure of the overall performance in estimating the nonparametric component of the right-censored PLAM.Upon initial examination, the LLR-KMW estimator demonstrates a significantly superior performance compared with the other two estimators across all simulation configurations.This dominance is further evidenced by the ARMSE results presented in Table 4, which contrast the outcomes observed in the parametric component estimation.An interesting distinction in estimating the nonparametric component is that the performances of the introduced estimators deteriorate as the sample size increases.To explain this phenomenon, it is crucial to note that in the estimation of PLAMs, there exists Entropy 2023, 25, 1307 16 of 22 a balance between the estimation of parametric and nonparametric components, which exhibits an inverse relationship.Furthermore, when data points are scattered widely around the representative smooth curve, the bias of the fitted curve increases.Additionally, the RMSE scores for the three modified LLR estimators are fairly similar to each other, confirming that the modified backfitting algorithm functions effectively with the censorship solution techniques.
Table 4 presents a strong case, confirming the dominant role of the LLR-KMW estimator in estimating nonparametric components within the context of right-censored PLAM.The success of the LLR-KMW estimator lies in its clever use of weighted estimation, which works well for both the parametric and nonparametric aspects of PLAM.Notably, the LLR-KMW estimator does not just improve β estimates, it also works well together with the LLR-kNNI estimator, forming a powerful estimation duo.When we carefully analyze Table 4 and take a close look at Figures 4 and 5, a clear pattern emerges.Both the LLR-KMW and LLR-kNNI estimators perform very similarly when it comes to estimating the nonparametric component.What is even more interesting is that both estimators outperform the LLR-ST estimator, as these enlightening visuals below beautifully demonstrate.In terms of estimating nonparametric components, it is naturally expected that the CPH estimator does not show a good performance due to its theoretical structure.However, its behaviors are similar to LLR-basis estimators in sample size and censoring level changes.In summary, the introduced LLR-basis estimators show better performance than the classical CPH estimator.
An interesting distinction in estimating the nonparametric component is that performances of the introduced estimators deteriorate as the sample size increases explain this phenomenon, it is crucial to note that in the estimation of PLAMs, there e a balance between the estimation of parametric and nonparametric components, w exhibits an inverse relationship.Furthermore, when data points are scattered wi around the representative smooth curve, the bias of the fitted curve increa Additionally, the RMSE scores for the three modified LLR estimators are fairly simila each other, confirming that the modified backfitting algorithm functions effectively w the censorship solution techniques.
Table 4 presents a strong case, confirming the dominant role of the LLR-K estimator in estimating nonparametric components within the context of right-censo PLAM.The success of the LLR-KMW estimator lies in its clever use of weigh estimation, which works well for both the parametric and nonparametric aspect PLAM.Notably, the LLR-KMW estimator does not just improve β estimates, it also w well together with the LLR-kNNI estimator, forming a powerful estimation duo.W we carefully analyze Table 4 and take a close look at Figures 4 and 5, a clear pat emerges.Both the LLR-KMW and LLR-kNNI estimators perform very similarly whe comes to estimating the nonparametric component.What is even more interesting is both estimators outperform the LLR-ST estimator, as these enlightening visuals be beautifully demonstrate.In terms of estimating nonparametric components, it is natur expected that the CPH estimator does not show a good performance due to its theore structure.However, its behaviors are similar to LLR-basis estimators in sample size censoring level changes.In summary, the introduced LLR-basis estimators show be performance than the classical CPH estimator.Figure 4 illustrates the behavior of the estimators under different censoring levels with fixed sample sizes.In panels (a)-(b), the effect of the censoring level is investigated when the sample size is small (n = 50).It can be observed that while f 2 (t 2 ) is not significantly affected, the estimate of f 1 (t 1 ) is heavily influenced by the censored data points.It is important to note that this inference is also related to the initial values β (0) , f (0) determined in the algorithm and their compatibility with the unknown functions f 1 and f 2 , respectively (see [9] for further discussions).Furthermore, the results demonstrate that the weakness of the LLR-ST estimator (red dotted line) is clear in all four panels (a), (b), (c), and (d), for both n = 50 and n = 200.Additionally, panels (c) and (d) support the findings of Tables 3 and 4, leading to the conclusion that, for larger sample sizes, the fitted curves become more sensitive to the censoring level, resulting in a decrease in their performance.3 and 4, leading to the conclusion that, for larger sam sizes, the fitted curves become more sensitive to the censoring level, resulting in a decr in their performance.Figure 5 investigates the effect of sample size () for fixed censoring levels in upper and lower panels, particularly for  = 35% in panels (c) and (d), while L KMW and LLR-ST exhibit a slightly more pronounced response to increasing sample compared with LLR-kNNI.This result is expected due to the nonparametric natu kNNI.Furthermore, the changes observed in the fitted curves are more noticeable fo estimation of  ( ), as shown in Figure 4. Additionally, the differences between sam sizes for the lower censoring level ( = 5%) in panels (a)-(b) indicate that the minimal variation between the fitted curves for both functions.
These trends are consistent with the findings reported by ref. [25], where a sim sensitivity of the ST basis estimator to sample size was identified in a related context reaction of the kNNI, KMW, and ST estimators to sample size fluctuations aligns wit observations made by ref. [26] reinforcing the notion that these estimators can ex greater flexibility in accommodating varying sample sizes.
To assess the performance of the introduced modified LLR estimators on real-w data and compare them with the simulation results, a real data example is presente the following section, focusing on the hepatocellular carcinoma dataset.Figure 5 investigates the effect of sample size (n) for fixed censoring levels in the upper and lower panels, particularly for CL = 35% in panels (c) and (d), while LLR-KMW and LLR-ST exhibit a slightly more pronounced response to increasing sample size compared with LLR-kNNI.This result is expected due to the nonparametric nature of kNNI.Furthermore, the changes observed in the fitted curves are more noticeable for the estimation of f 1 (t 1 ), as shown in Figure 4. Additionally, the differences between sample sizes for the lower censoring level (CL = 5%) in panels (a)-(b) indicate that there is minimal variation between the fitted curves for both functions.
These trends are consistent with the findings reported by ref. [25], where a similar sensitivity of the ST basis estimator to sample size was identified in a related context.The reaction of the kNNI, KMW, and ST estimators to sample size fluctuations aligns with the observations made by ref. [26] reinforcing the notion that these estimators can exhibit greater flexibility in accommodating varying sample sizes.
To assess the performance of the introduced modified LLR estimators on real-world data and compare them with the simulation results, a real data example is presented in the following section, focusing on the hepatocellular carcinoma dataset.

Hepatocellular Carcinoma Data Example
In this section, the Hepatocellular Carcinoma dataset is modeled using the modified LLR estimators: LLR-ST, LLR-KMW, and LLR-kNNI.Their performances are compared with similar simulation configurations presented in Section 5.The dataset was originally presented by ref. [27] to investigate the gene expression of CXCL17 in hepatocellular carcinoma.Ref. [6] also studied this dataset, comparing parametric and semiparametric models on right-censored data.However, their study focused on a semiparametric model with a univariate nonparametric component using the covariate age.This paper considers a more realistic partially linear additive model (PLAM) that involves two nonparametric covariates.
The dataset consists of 227 data points and five explanatory variables: age, recurrencefree survival (RFS), CXCL17T (CXCT), CXCL17P (CXCP), and CXCL17N (CXCN).It should be noted that the logarithm of the response variable, overall survival time (OS), is used in this analysis.The parametric component of the PLAM is determined by the covariates regarding the bias of β, as anticipated, both ST and KMW yield lower values compared with kNNI, as they theoretically promise less biased estimates.Overall, the performance evaluation in Table 6 confirms that LLR-KMW exhibits the best results, which are evident from the RE scores.In both Tables 5 and 6, the performance of benchmark CPH estimators is also provided and, as expected, it does not show a good performance, especially in the estimation of the nonparametric component.On the other hand, in terms of bias, Table 5 shows that CPH has satisfying bias values but with large variances that cause large SMDE scores.This poor performance is highly related to the lack of the ability of CPH to represent smooth functions.RE scores highly confirm this inference.Summing up the comprehensive assessment presented in Table 6, we encounter an unequivocal affirmation of the preeminent standing of the LLR-KMW estimator.This affirmation is elegantly illuminated by the notable RE scores, reflecting an ensemble of successful estimation endeavors.
In Figure 7, bar plots of the calculated relative efficiencies (RE) are presented.Consistent with the findings in Table 5, LLR-KMW exhibits lower RE scores compared with the other two estimators, which aligns with the results of the simulation study.It is worth noting that while the difference in performance between the estimators may appear significant, numerically they are relatively close to each other, with the RE values scattered around one.After assessing the estimation of the parametric component, Figure 8 presents the results of the estimation of the nonparametric components  () and  ().It is noteworthy that in this dataset, the relative failure of LLR-kNNI and the relative success of LLR-ST can be attributed to the structure of the nonparametric components.Both functions  and  exhibit favorable structures for the properties of LLR-ST, such as magnifying the magnitudes of uncensored data points and assigning zero to censored ones, as clearly observed in panel (ii) of Figure 8.After assessing the estimation of the parametric component, Figure 8 presents the results of the estimation of the nonparametric components f 1 (Age) and f 2 (RFS).It is noteworthy that in this dataset, the relative failure of LLR-kNNI and the relative success of LLR-ST can be attributed to the structure of the nonparametric components.Both functions f 1 and f 2 exhibit favorable structures for the properties of LLR-ST, such as magnifying the magnitudes of uncensored data points and assigning zero to censored ones, as clearly observed in panel (ii) of Figure 8.
After assessing the estimation of the parametric component, Figure 8 presents the results of the estimation of the nonparametric components  () and  ().It is noteworthy that in this dataset, the relative failure of LLR-kNNI and the relative success of LLR-ST can be attributed to the structure of the nonparametric components.Both functions  and  exhibit favorable structures for the properties of LLR-ST, such as magnifying the magnitudes of uncensored data points and assigning zero to censored ones, as clearly observed in panel (ii) of Figure 8.To provide a more precise understanding of the solution procedures, the ST points and kNNI points are also included in the plots.These points illustrate why the fitted curves tend to lie below the region where all data points are scattered, especially in panel (ii).This is primarily influenced by the heavy censoring level,  = 37%.Additionally, in panel (i), one can observe the LLR-ST's fitted curve being pulled down by the zeros.As expected, LLR-KMW follows a balanced approach between the other two estimators, as shown in Table 5, yielding the smallest ARMSE scores in the estimation of the nonparametric component of the PLAM.

Conclusions
This paper introduces three modified LLR estimators based on different censorship solutions: ST, KMW, and kNNI, to model the right-censored PLAM.For the solution methods that have a theoretical background, such as ST and KMW, the statistical properties and some asymptotic properties of LLR-ST and LLR-KMW are presented.This To provide a more precise understanding of the solution procedures, the ST points and kNNI points are also included in the plots.These points illustrate why the fitted curves tend to lie below the region where all data points are scattered, especially in panel (ii).This is primarily influenced by the heavy censoring level, CL = 37%.Additionally, in panel (i), one can observe the LLR-ST's fitted curve being pulled down by the zeros.As expected, LLR-KMW follows a balanced approach between the other two estimators, as shown in Table 5, yielding the smallest ARMSE scores in the estimation of the nonparametric component of the PLAM.

Conclusions
This paper introduces three modified LLR estimators based on different censorship solutions: ST, KMW, and kNNI, to model the right-censored PLAM.For the solution methods that have a theoretical background, such as ST and KMW, the statistical properties and some asymptotic properties of LLR-ST and LLR-KMW are presented.This paper focuses on two main objectives and successfully achieves them.The two purposes of this study are to combine the backfitting LLR estimator with the censorship solutions and to compare them, both theoretically and practically.The performances of the modified LLR estimators are observed through simulation and real data studies.The following conclusions have been drawn from this study:

•
In the simulation study, the performance of the estimators is measured individually for both parametric and nonparametric components.Regarding the parametric component estimation, it is observed that LLR-KMW provides the best results, followed by LLR-kNNI.On the other hand, LLR-ST does not yield good results for any simulation configuration, and it is the estimator most affected by the censorship as its performance dramatically changes when the censoring level increases.In this case, LLR-KMW can be considered the most robust estimator, as it reacts to censorship in a more balanced way compared with the other two.In addition, the introduced estimators are also compared with the benchmark estimator for the survival model, CPH.It is observed that the LLR-basis estimators perform better than the CPH, as discussed in Section 6.

•
In the estimation of the nonparametric components, the effects of sample size and censoring level are clearly different compared with the parametric component.However, similar to the parametric component, LLR-KMW exhibits dominant performance for both nonparametric functions.It is noteworthy that, as the sample size increases, all three estimators tend to provide closer performances in terms of fitted curves.Furthermore, it should be noted that the performance of the introduced estimators is highly dependent on the structure of the nonparametric component and its compatibility with the chosen censorship solution.Hence, this paper investigates the three different solutions in detail.Ultimately, because the CPH model lacks a smoother structural framework, it falls short when compared with the newly introduced estimators.

•
The analysis of the Hepatocellular Carcinoma data serves as a real-world example in this study.This dataset is selected due to its censoring level and sample size, which align closely with one of the simulation configurations (n = 200 and CL = 35%), enabling a more realistic comparison.The results of the real data modeling demonstrate that the three introduced modified LLR estimators effectively handle the estimation of the right-censored PLAM for both parametric and nonparametric components.They exhibit a good level of agreement with the corresponding simulation configuration, with some minor differences.As expected, LLR-KMW yields the best results.Also, CPH does not show a good performance except in the bias of regression coefficients, as observed in the simulation study.Notably, one important difference between the real data and the simulation study is that LLR-ST exhibits a surprisingly better performance than LLR-kNNI in the estimation of both parametric and nonparametric components.However, this discrepancy can be attributed to the relatively large sample size (n = 227), and it does not imply inconsistency with the simulation results.On the contrary, it indicates a close agreement among all performances.

Figure 1 .
Figure 1.Working procedures of ST in panel (A) and KNNI in panel (B) for generated data.

Figure 1 .
Figure 1.Working procedures of ST in panel (A) and KNNI in panel (B) for generated data.

Figure 2 .
Figure 2. Selection of bandwidth parameter (ℎ) for different scenarios and censorship so methods when  = 50.In each panel, (i) and (ii) involve the selection processes for  (  ( ), respectively.

Figure 2 .
Figure 2. Selection of bandwidth parameter (h) for different scenarios and censorship solution methods when n = 50.In each panel, (i) and (ii) involve the selection processes for f 1 (t 1 ) and f 2 (t 2 ), respectively.
. The figure also shows both the effects of censorship and the sample size.In panel (a), the RE values are very close to each other due to the very low censoring level (CL = 5%).Panels (b) and (c) demonstrate the change in RE scores as the censoring level increases, with the differences between the estimators becoming more distinct, as mentioned earlier.Consequently, the LLR-kNNI and LLR-KMW estimators are more efficient than the LLR-ST estimator.In panel (c), the performances are once again close to each other, reflecting the large sample size (n = 200).
sample size.In panel (a), the RE values are very close to each other due to the ver censoring level ( = 5%).Panels (b) and (c) demonstrate the change in RE scores censoring level increases, with the differences between the estimators becoming distinct, as mentioned earlier.Consequently, the LLR-kNNI and LLR-KMW estim are more efficient than the LLR-ST estimator.In panel (c), the performances are once close to each other, reflecting the large sample size ( = 200).

Figure 3 .
Figure 3. Bar plots of averaged RE scores.

Figure 4 .
Figure 4. Fitted curves to show the effect of the censoring level ().In each panel, (i) and (ii) s fitted curves for  ( ) and  ( ) respectively.

Figure 4 .
Figure 4. Fitted curves to show the effect of the censoring level (CL).In each panel, (i) and (ii) show fitted curves for f 1 (t 1 ) and f 3 (t 2 ) respectively.

Figure 5 .
Figure 5. Fitted curves to show the effect of the sample size ().In each panel, (i) and (ii) show curves for  ( ) and  ( ) respectively.

Figure 4
Figure4illustrates the behavior of the estimators under different censoring le with fixed sample sizes.In panels (a)-(b), the effect of the censoring level is investig when the sample size is small (  = 50).It can be observed that while  ( ) is significantly affected, the estimate of  ( ) is heavily influenced by the censored points.It is important to note that this inference is also related to the initial va  ( ) ,  ( ) determined in the algorithm and their compatibility with the unkn functions  and  , respectively (see[9]  for further discussions).Furthermore, the re demonstrate that the weakness of the LLR-ST estimator (red dotted line) is clear in all panels (a), (b), (c), and (d), for both  = 50 and  = 200.Additionally, panels (c) an support the findings of Tables3 and 4, leading to the conclusion that, for larger sam sizes, the fitted curves become more sensitive to the censoring level, resulting in a decr in their performance.Figure5investigates the effect of sample size () for fixed censoring levels in upper and lower panels, particularly for  = 35% in panels (c) and (d), while L KMW and LLR-ST exhibit a slightly more pronounced response to increasing sample compared with LLR-kNNI.This result is expected due to the nonparametric natu kNNI.Furthermore, the changes observed in the fitted curves are more noticeable fo estimation of  ( ), as shown in Figure4.Additionally, the differences between sam sizes for the lower censoring level ( = 5%) in panels (a)-(b) indicate that the minimal variation between the fitted curves for both functions.These trends are consistent with the findings reported by ref.[25], where a sim sensitivity of the ST basis estimator to sample size was identified in a related context reaction of the kNNI, KMW, and ST estimators to sample size fluctuations aligns wit observations made by ref.[26] reinforcing the notion that these estimators can ex greater flexibility in accommodating varying sample sizes.To assess the performance of the introduced modified LLR estimators on real-w data and compare them with the simulation results, a real data example is presente the following section, focusing on the hepatocellular carcinoma dataset.

Figure 5 .
Figure 5. Fitted curves to show the effect of the sample size (n).In each panel, (i) and (ii) show fitted curves for f 1 (t 1 ) and f 3 (t 2 ) respectively.

Figure 7 .
Figure 7. Bar plots of the REs for the modified LLR estimators based on the censorship solutions methods.

Figure 7 .
Figure 7. Bar plots of the REs for the modified LLR estimators based on the censorship solutions methods.

Figure 8 .
Figure 8. Fitted curves obtained for the Hepatocellular Carcinoma dataset.In panel (i) () is shown and in panel (ii) involves ().

Figure 8 .
Figure 8. Fitted curves obtained for the Hepatocellular Carcinoma dataset.In panel (i) f (Age) is shown and in panel (ii) involves f (RFS).

Table 1 .
Calculated  values for all simulation combinations.

Table 1 .
Calculated SMDE values for all simulation combinations.
Bold color denotes the best performance score.

Table 2 .
Comparative RE scores for the modified LLR estimators.
Bold color denotes the best performance score.

Table 3 .
RMSE values of individual nonparametric functions for both functions  ( ) and

Table 3 .
RMSE values of individual nonparametric functions for both functions f 1 (t 1 ) and f 2 (t 2 ).
Bold color denotes the best performance score.
Bold color denotes the best performance score.