Efficient Estimation and Inference in the Proportional Odds Model for Survival Data

: In modeling time-to-event data with long-term survivors, the proportional hazards model is widely used for its easy and direct interpretation as well as the ﬂexibility to accommodate the past information and allow time-varying predictors. This becomes most relevant when the mortality of individuals converges with time, and the estimation and inference based upon the proportional odds model can often yield more accurate and reasonable results than the classical Cox’s proportional hazards model. Along with the fast development of the data science technologies, computational challenges for survival data with increasing sample size and diverging parameter dimension exist. Currently, existing methods for analyzing such data are computationally inconvenient. In this paper, we propose efﬁcient computational methods for analyzing survival data in the proportional odds model, where the nonparametric maximum likelihood approach is combined with the minorization-maximization (MM) algorithm and the regularization scheme to yield fast and accurate estimation and inferential procedures. The illustration of the methodology using extensive simulation studies and then the application to the Veterans’ Administration lung cancer data is also given.


Introduction
The proportional odds model was initially proposed by McCullagh [1,2] with the purpose of analyzing ordinal data instead of censored survival data.Due to its easy and direct interpretation, it has been widely used in practice.Ref. [3,4] extended the model to fit survival data using the Newton-Raphson method.Many researchers started to use this model in survival analysis also because of its good prediction performance.Ref. [5] considered the rank-based estimation method.Ref. [6] proposed the semi-parametric proportional odds model, which is one important case of the general linear transformation model.In addition, ref. [7] employed the profiled likelihood method and developed the model diagnostic procedures.For interval censored data, ref. [8,9] further introduced the sieve maximum likelihood estimation and obtained the consistency and asymptotic normality for the estimated parameters.Ref. [10] proposed an easy implementation of this proportional odds model based on the conditional logistic regression.Ref. [11] considered the semi-parametric proportional odds model, where the baseline function can be any nondecreasing function.As for the improvement of the parameter estimation efficiency, ref. [12] proposed the minorization-maximization (MM) algorithm for the proportional odds model and this algorithm performswell given high-dimensional data.Ref. [13] adopted the cubic spline for baseline estimation and [14] proposed to capture the spatial heterogeneity using a Bayesian hierarchical model for the analysis of spatially related data.Similarly, ref. [15] also proposed the semiparametric Bayesian proportional odds model, where the baseline is estimated through a monotone increasing spline.The Bayesian model framework was further extended to the application of clustered and multi-event data by [16].Ref. [17] proposed an extension of the original model in order to fit the data with a multivariate response, random intercept and non-linear effect of covariates.Furthermore, based on the interval censored data and Bayesian estimation method, ref. [18] introduced the cure rate proportional odds models for the corresponding data analysis.Different from the traditional proportional odds model, ref. [19] discussed the quantile-based definition for this model and presented the illustration using real life datasets.Ref. [20] added a log-concave constraint on the baseline distribution and developed the corresponding parameter and density estimation.Besides the MM algorithm, ref. [21] proposed an expectation-maximization (EM) scheme for the parameter estimation with a good computational efficiency for both parameter and baseline estimation.In addition, ref. [22] discussed the efficient estimation of the odds ratio for the proportional odds model with censored time-lagged outcome.[23] has pointed out the high computational complexity and inefficiency for the proportional odds model with right censored data.For example, ref. [7]'s method cannot perform estimation appropriately given the consecutively observed failure time.In addition, ref. [13] pointed out the method proposed by [9] is very complex and computationally unachievable.Moreover, the estimated baseline function from [13]'s method may not preserve the monotone property constructed by the natural cubic spline.Therefore, to tackle these problems, ref. [23] developed the EM algorithm for the proportional odds model with right censored data, where the baseline function is estimated using the spline.The method can be applied to a large dataset, while AIC or BIC criterion is applied to determine the optimal number of knots for the natural cubic spline estimation.
The covariates are important for the regression analysis, where some covariates with a tiny impact to the response might be captured during the estimation procedure.Therefore, when constructing the regression model, it is essential to conduct a variable selection procedure.The most popular way of variable selection is conducted by adding penalty functions to the original objective.The parameters are then estimated by maximizing the new objective function.In the survival analysis, the LASSO [24], SCAD [25], the adaptive LASSO [26] and the elastic SCAD are commonly applied for the proportional hazard models.As for the proportional odds model, only limited studies are conducted based on the penalized regression.Ref. [27] studied the application of LASSO and the adaptive LASSO in the proportional odds model, where the simulation results of the adaptive LASSO demonstrate better performance in both variable selection and parameter estimation.
As for optimizing the likelihood function of the proportional odds model, the Newton-Raphson method and EM algorithm perform well in finding the root without an explicit form.Ref. [28] first developed the MM algorithm to optimize the convex function iteratively.EM algorithm is a special case of the MM algorithm.The MM algorithm can help avoid high-dimensional matrix inversion, separate the parameters, linearize the optimization problem and transform the non-differentiable problem to smooth optimization.In real applications, the MM algorithm has been applied to optimize the objective function from statistical models such as quantile regression [28], Bradley-Terry model [29], zero-inflated and zero-truncated models [30,31], high-dimensional covariates selection problem [32], mixed Gaussian model [33] and finite mixture models.In addition, the MM algorithm has been also applied to survival analysis.Ref. [34] has developed a MM algorithm to solve the optimization problem based on the shared gamma frailty model.The estimation results perform well in their simulation and real data analysis.
In the following article, we first review the MM algorithm in Section 2.Then, we apply the usage of the MM algorithm in the proportional odds model in Section 3. In Section 4, we further conduct a variable selection procedure based on the proportional odds model using the MM algorithm.Simulation studies and real data analysis are conducted in Sections 5 and 6.Finally, we conclude the performance of our proposed methods and discuss the future work in Section 7.

Basic Principle
The MM algorithm performs optimization by constructing a simple surrogate function for the original objective function.Then, instead of optimizing the original objective function, we optimize the surrogate function with simplified expression.The process of the MM algorithm involves two steps; the first M step is the minorization step, which constructs a proper minorization function under the following conditions (1) and ( 2) by some commonly used inequations.The second M step is the maximization step, where a minorization function is maximized via the Newton-Raphson method or a quasi-Newton method, since the constructed minorization function is usually parameter-separable.
Let (θ | Y obs ) is the log-likelihood of observed data Y obs , θ is the parameter estima- tion through the maximization of log-likelihood.Assume θ (t) is the t-th iteration of θ, (t) is the surrogate function determined by θ (t) .If θ (t) satisfy, Then, we call Q θ | θ (t) to be the minorization function of (θ | Y obs ).The maximiz- ing of Q θ | θ (t) can substitute the maximization of the original objective.If Q θ | θ (t)   obtains the optimal at θ (t+1) , From (1) and (2), we have The MM algorithm is a stable optimization procedure and has a convergence property due to the increase in the objective in every iteration.EM algorithm is a special case of MM algorithm and its first M step (also called E step) is to calculate the expectation of the complete-data log-likelihood function.

Commonly Used Inequalities
The key step of the MM algorithm is to find the appropriate surrogate function for the objective with less computational cost.In real applications, in order to construct the surrogate function, we suggest the following ways based on the commonly used algebraic inequalities.
The first method is Jensen's inequality.Assume that X is a random variable, and ϕ is a concave function, then In contrast, if ϕ is convex, then, In real applications, we always use the continuous and discrete version of Jensen's Inequality.The continuous version is where Ω is the subset of the real line R. f (•) is defined real function on Ω, g(•)is the density function on Ω. Discrete version, where ϕ(•) is a concave function, a i ≥ 0 and ∑ n i=1 a i = 1.The second method is the inequality of arithmetic and geometric means, x . where x i and a i are nonnegative.The left-hand side of the inequality is a product of x a i i and the other side of this inequality is the sum of x a 1 i for i = 1, . . ., n.The structure of this inequality says that it can be used to minorize product terms into the summation of other terms.
The third method is the supporting hyperplane inequality,

Proportional Odds Model
Let T denotes the failure time, X is the covariate of p × 1 dimension, β is the corresponding coefficient.Suppose that given X, T follow the proportional odds model logit {F(t | X)} = log Λ 0 (t) + X β, where F(t | X) denotes the cumulative distribution function of T given X, Λ 0 (t) denotes a cumulative baseline function, and logit (x) = log x 1−x .Based on the assumption above, we can easily obtain the survival and density function of the proportional odds model respectively, with the hazard rate where λ 0 (t) = dΛ 0 (t)/dt.
Here, we consider the failure time T with a right censoring structure, and a failure time study that consists of n independent individuals.Let T i , C i and X i denote the failure time, the censoring time and the covariate of the i th individual, respectively.Moreover, we assume the T i and C i are independent.In addition, let t i = min(T i , C i ) denote the observation time of event, and δ i = I(T i ≤ C i ) is the censoring indicator based on the observed data {t i , δ i , X i }, i = 1, . . ., n.Then, the likelihood function is with the corresponding log-likelihood function In Equation ( 3), using the supporting hyperplane inequality, we have . Then, we can obtain the surrogate function of obs , as follows After the minimization step, we then apply two methods to estimate the surrogate function derived above.

Profile MM Method
Given the value of β, we can obtain the estimate of Λ 0 0 , β (k) , we have where c 1 is a constant.Using the supporting hyperplane inequality again to minimize , we have where c 2 is a constant and j .Taking first and second derivatives for The estimation function for β is The algorithm is given below,

3.
Using the updated value of β, calculate the estimate of Λ 0 by Equation (4).4.
Iterate step 2 and 3 until it converges.

Non-Profile MM Method
Use the inequality of arithmetic and geometric means 2 exp(2X i β (k) ) .
By computation, we obtain Therefore, the surrogate function is where The first and second derivatives of β are Thus, we have the estimation function of β which is The algorithm is as follows, 1.
Iterate step 2 and 3 until convergence.
To sum up, both profile MM and non-profile MM methods have decomposed the objective function into two separate parts.That is, the non-parametric component Λ 0 is separated from the regression vector β, which makes the next maximization step more simple than directly optimizing the objective log-likelihood function.It is worth noting that the parameter-separable feature is one of the advantages of the MM algorithm, which can easily incorporate the quasi-Newton acceleration and other simple off-the-shelf accelerators for boosting computational effectiveness.

Parameter Separated Estimation Method
Notice that the estimate of β relies on the Newton-Raphson algorithm, which is sensitive to the initial value, and may lead to computational inefficiency due to the inappropriate choice of initial value.Particularly, the higher the dimension of β, the higher order the matrix of Henssian from the Newton-Raphson method with a high computational cost in doing matrix inverse.Under this circumstance, the proposed MM method can avoid such a matrix inversion problem with much lower computational cost.In the following section, we describe in detail on the parameter-separated estimation method following the previous two methods of Section 3.
First, let when x iq = 0, we let 1/ω iq = 0.Then, using the discrete form of Jensen's inequality, where, ϕ(•) is a concave function, a i ≥ 0 and 9) and use the expression form of Equation ( 8), we can minimize Equation ( 5) by where Similarly, let ϕ(•) = − exp(•) in Equation ( 9) and use the expression form of Equation ( 8), we can also minimize Equation (6) by where From Equations ( 10) and ( 11), it can be observed that the two resulting MM methods only involves p + 1 separate univariate optimizations in its maximization step and matrix inversion is not needed.Thus, the proposed methods can highly reduce the computational cost.

The Variable Selection Based on SCAD and MCP Penalties
The variable selection is an important field in high-dimensional data analysis.Using variable selection methods to select significant explanatory variables and remove the insignificant ones can improve the prediction accuracy of the statistical model.In this article, we applied the SCAD and MCP penalties with oracle properties to conduct variable selection in the proportional odds model.The penalized log-likelihood function becomes the new objective to be minimized, which can be written as, where ρ(• | , γ) is the penalized term.Where the SCAD proposed by [35] is γ is suggested to take value 3.7 and the MCP proposed by [36] is where γ here takes 3.
In order to handle the singularity around the origin of ρ( β q | , γ), [35] suggested to use the quadratic local approximation of the penalized term, which can be written as Thus, the penalized surrogate function can be written as where obs (β) can be minorized by Equation (10) or Equation (11).
In order to obtain a good variable selection result, we applied the BIC criterion to select the tuning parameter .The likelihood function with BIC criterion suggested by [37] is where n is the sample size and q denotes the dimension of β, which is the number of selected non-zero parameters.For the exact process of tuning parameters' selection, we first find the the proper range of λ values by searching from (0, +∞) using this BIC criteria with an initial sequence s 1 , s 2 , ..., s k 1 .The solution path can be plotted using this sequence and s i is selected where the minimum BIC is obtained.Then, some grid points are constructed in a range (s i−1 , s i+1 ) for a more accurate search where the optimal λ is selected from this sequence with the minimum BIC score.

Simulation Study
According to the estimation equation derived in previous sections, we simulate the data to analyze the estimation result at a finite sample size.

Parameter Estimation of the Proportional Odds Model
Case 1: The data is simulated to verify the method given from Section 3. Let (X 1 , X 2 , X 3 ) be the covariates, which follow the standard normal distribution and the true parameter β is set to be (2, 1, −3) , Λ 0 (t) = (t/2) 2 .The censoring times are generated from the uniform distribution U(0, b) to yield two censoring proportions of about 30% or 50% separately.We take sample size to be 250 and 500, the corresponding censoring rate are 30% and 50%.The simulation is conducted 500 times repeatedly.The BIAS, MSE (mean square error), SD (standard deviation) and median number of iterations (K) are reported in the following Tables 1 and 2.
From Tables 1-3, we can observe that the two proposed MM algorithms in Section 3 perform similarly well with small MSEs and SDs at different sample sizes and censoring proportions.The estimation for both the parametric part and nonparametric part are accurate with small estimation bias.Moreover, with the increasing of sample size, the estimation result becomes more stable for both the Profile MM method and Non-profile MM method.In addition, compared with the Non-profile MM methods, the Profile MM algorithm performs much more efficiently with less iteration numbers.From Figure 1, we set sample size as 250; the dotted line can fit the solid line well when the censoring rate is 30% and 50%, respectively.Similarly, from Figure 2, when the right censoring rate is 50%, the curves of the true baseline cumulative hazard function and estimated baseline cumulative hazard function are coincident in both cases, where the sample size is 250 and 500, which indicates the consistency of our estimator.The results demonstrated in the Figures 1 and 2 are consistent with those shown in Tables 1-3, and we can conclude that our proposed methods have an excellent performance in estimating the baseline cumulative hazard function and other parameters, given the high right censoring rate and small sample size.

Simulation on Variable Selection
Case 3: We simulate the data to verify the method discussed in Section 4.1.The dimension of the covariates is set to be 10 and they follow standard normal distribution.Let Λ 0 (t) = (t/2) 2 , the censoring times are generated from uniform distribution U(0, b) to yield a censoring proportion of about 50%, assume β = (−4, −4, −2, −2, 1, 1, 3, 3, 5, 5) .Based on 500 replications, the BIAS, MSE (mean square error), SD (standard deviation) and median number of iterations (K) of the estimated parameters are reported in the Table 4.
, the simulation result under 30% censoring rate is presented by Table 5.From the simulation results demonstrated in Tables 4 and 5, the proposed parameter separated MM method in Section 4.1 can estimate the parameters accurately with small estimation bias and we also observe that the MSE, SD and K decrease with the sample size increases.Different from the results in Tables 1-3, the non-profile MM method has fewer iterations than profile MM method in Tables 4 and 5.
In order to assess the effectiveness of our methods, we calculate the RMSE to test the average difference between the true and estimated parameters.
where p is the number of explanatory variables.Moreover, we calculate the FDR (false discovery rate) and PSR ( positive select rate) proposed by [38] FDR = FP TP+FP , TP + FP > 0, 0, TP + FP = 0, where FP (false positive) is the number of parameters, which are estimated to be non-zero with true value equals to zero, TP (true positive) denotes the correctly excluded insignificant parameters, and m denotes the number of non-zero parameters.Thus, the low FDR or high PSR indicates the good parameter selection result.
According to Table 6, the regularized estimation methods proposed in this paper can correctly select the significant parameters.With the increase in sample size, we can observe that the parameter selection results perform better.Under the same framework, the results produced by Profile MM method and Non-profile MM method are similar as well.In both cases, where the sample size is 200 and 400, no parameter with true value equals to 0 is selected.For the selection of true parameters, both penalties generate good results with PSR greater than 0.9, while SCAD performs slightly better than MCP.As presented by Table 7, the estimation of BIAS, MSE and SD are small for non-zero parameters, which indicates a good estimation performance.That is, the MM method can effectively deal with the parameter selection for the proportional odds model under SCAD and MCP penalties.

Real Data Analysis
We apply our methods to the Veterans' administration lung cancer study data with sample size 137, and the number of covariate is 8.The data can be retrieved from the R package "survival" and the information for the covariates are given in Table 8.This dataset is generally applied for the illustration of the proportional odds model.For the purpose of comparing against other studies, we use the data of patient with no prior therapy where the censoring rate is 6.19% and the covariates are "celltype" and "karno".Two MM algorithms are applied to estimate the parameters.Moreover, the standard deviation is estimated by 1000 times of bootstraps, where β * where where β * L and β * U denotes the quantile of β * 1 , . . ., β * G 's α/2 and 1 − α/2 separately.Tables 9 and 10 present the estimation result of two MM algorithms, where the standard estimated error (SE) is defined by Equation ( 13) and the 95% confidence interval (CI1) is defined by Equation (12).Moreover, the 95% confidence interval (CI2) is defined by Equation ( 14).In addition, the estimated cumulative hazard function are plotted in Figure 3.   From Tables 9 and 10, for the same patient, every increase of 1 degree of "karno", the decrease in possibility of death is exp(−0.0532)= 0.9482.From Figure 3, the estimation results of the baseline cumulative hazard function of two MM methods are almost the same.We can find that our method produces a similar result, as concluded by [3], and is a little bit different from the result presented by [5,7,11], which is shown by Table 11 from [11].Then, we consider all eight covariates for model fitting where the censoring rate is 6.57%.We first use the method from Section 4.1 to estimate the parameters.After obtaining the estimation results, we apply the SCAD and MCP penalties discussed in Section 4.2 for parameter selection, and the result is presented by Table 12.
From Table 12, the Profile MM method and Non-profile MM method produce similar results.Without considering the penalties, the results of parameter separated MM algorithms demonstrate that four parameters including "small vs large", "adeno vs large", "karno" and "prior" significantly affect the death rate.After introducing the penalties, we conduct the variable selection and 5 covariates are shrunk to 0. Only three significant parameters are preserved, which are "small vs large", "adeno vs large" and "karno"."small vs large" and "adeno vs large" lead to a positive effect to the death rate while "karno" is negatively related to the death rate, which is in line with the common sense in reality.

Conclusions and Future Work
The proportional odds models are more competitive than proportional hazards models in dealing with right-censored survival data, where mortality tends to be uniform over time.The MM algorithm has the advantages of simple structure, strong interpretability and easy implementation.Hence, it is a useful tool for optimization problems and has a broad range of applications in statistics.In this work, we introduce the MM algorithm into the estimation of the proportional odds model.We first develop two MM algorithms for the estimation of proportional odds models, which greatly simplify the estimation process by constructing two simple surrogate functions for the log-likelihood function.The proposed MM algorithms successfully separate the parameters and decompose the high-dimensional maximization into separated low-dimensional ones, which may avoid the matrix inversion and can be used to more general scenarios.Moreover, we apply the MM methods to the regularized estimation in sparse high-dimensional proportional odds regression models with SCAD and MCP penalties.We find that the proposed MM algorithms with the property of separating parameters can mesh well with the SCAD and MCP penalties, which yield good results in simultaneous parameter estimation and variable selection.
The advantage of our algorithm is that we separate the estimation of the baseline hazard and other parameters, which makes the estimation process more efficient.In future studies, such technique derived from the semi-parametric model can be further extended to the application of parameter estimation for fully non-parametric models.In addition, as we mentioned in previous sections, the existing methods for the proportional odds model with right censored data involve high computation complexity when dealing with highdimensional data.However, our proposed algorithm can help to avoid matrix inversion, which is capable of high-dimensional regression analysis.Furthermore, the advantage of our algorithm is that it can mesh well with the existing quasi-Newton acceleration and other simple off-the-shelf accelerators to further boost the estimation process.Although our proposed MM algorithms are developed for the proportional odds models, a parallel approach can essentially be developed for the more general transformation models.We will investigate this in our future work.

Figure 1 .Figure 2 .
Figure 1.True and estimated baseline cumulative functions with different censoring rate when sample size is 250.The solid and dotted lines plot the true and estimated baseline cumulative hazard functions, respectively.The estimated baseline cumulative hazard function is the empirical average of the estimated baseline cumulative hazard functions based on 500 replications.

Figure 3 .
Figure 3.Estimated baseline cumulative functions for lung cancer data.

Table 1 .
The estimation result of Case 1 with 30% censoring rate.

Table 2 .
The estimation result of Case 1 with 50% censoring rate.

Table 3 .
The estimation result of Case 2 with 50% censoring rate.

Table 4 .
The results via parameter separated estimation method of Case 3 at 50% censoring rate.

Table 5 .
The result via parameter separated estimation method of Case 4 with 30% censoring rate.

Table 6 .
The simulation result of varibale selection in Case 5.

Table 7 .
The estimation results of non-zero coefficients in Case 5.

Table 8 .
Covariates of lung cancer data.

Table 9 .
The estimation result from profile MM method.

Table 10 .
The estimation result from non-profile MM method.

Table 11 .
Estimation result from other studies.

Table 12 .
The result of variable selection for lung cancer data.