Weighted Log-Rank Statistics for Accelerated Failure Time Model

: This paper improves the sensitivity of the G ρ family of weighted log-rank tests for the accelerated failure time model, accommodating realistic alternatives in survival analysis with censored data, such as heavy censoring and crossing hazards. The procedures are based on a weight function with the censoring proportion incorporated as a component. Extensive simulations show that the weight function enhances the performance of the G ρ family, increasing its sensitivity and ﬂexibility. The weight function method is illustrated with an example concerning vaginal cancer.


Introduction
An important issue in survival analysis is to analyze differences of two samples, especially when data are possibly censored. An example includes evaluation of treatment effects in randomized clinical trials. In such treatment outcome studies, patients are randomized into groups, one receiving a new treatment and the other receiving a fake treatment (placebo). They can be statistically compared over time to reveal effects of the new treatment. For these kinds of studies, the Cox proportional hazards model would be the usual choice of a modeling tool in the presence of censoring. It is, in fact, the most popular approach over other models that can be used in the analysis of survival data. This is primarily because it does not assume a distribution on the baseline hazard. This model, however, assumes hazard functions of two groups that are proportional over the course of study. The assumption of proportional hazards is often violated for the entire study, although the assumption holds for a short period of time, and this restricts its usage in practice. The accelerated failure time (AFT hereafter) model does not require the proportionality assumption. Furthermore, it has a simple structure in which the lifetime is accelerated or decelerated by a scale factor. For those reasons, the AFT would be an appealing alternative model to the Cox proportional hazards model when the proportional hazards assumption is not certain.
For the AFT model with the two comparison samples, the effect of covariates is measured in terms of a scale change of the two samples. Rank-based estimators are often used in estimating the scale-change and are based on the use of weighted log-rank statistics. Many authors studied the scale estimation with the two-sample censored data, including [1][2][3]. The scale parameter can be generally estimated by a root of a weighted log-rank estimating function in which some suitable weight function should be utilized. A commonly used family of weighted log-rank tests in comparing survival distributions of two samples is the G ρ family ( [4][5][6]). In the G ρ family, the weight function that consists of the product-limit estimator ( [7]), also referred to as the Kaplan-Meier estimator, of the survival function raised to a power significantly affects the performance of the tests. Moreover, inappropriately chosen weights could lead to decreasing power of the statistical tests, especially when the survival curves cross at some point during a period of study. Thus, the weight function should be carefully chosen to avoid some possible misinterpretation on the estimation. Various methods to cope with these problems have been proposed.
Motivated by [8], in this paper a class of weighted log-rank tests with randomly rightcensored data is developed, improving the flexibility and sensitivity of the G ρ family in the two-sample AFT model. To prevent possible power loss of a G ρ test from censoring or the misspecification of the weight implemented in the test, we utilize a weight function that has a censoring proportion as a component of the weight function. Numerical simulations show that the G ρ family with that weight function is more powerful than the usual G ρ family, outperforming the log-rank test. Results also demonstrate that the weight function leads to an increased sensitivity of the G ρ family in checking validity of the AFT model, showing good power to a wider range of alternatives. The procedures are illustrated in a data set regarding vaginal cancer. In this application, acceleration or deceleration of the survival time of patients is examined via the AFT model. This paper is organized as follows. Section 2 reviews the accelerated failure time model and weighted log-rank tests, and describes a statistic with the weight function that will accommodate realistic alternatives. Numerical studies are carried out in Section 3. Section 4 presents concluding remarks.

Accelerated Failure Time Model
To test the equivalence of two samples with censored survival data, we take two censored samples of sizes n 1 and n 2 from each of the comparison populations. For i = 1, 2, let X ij , j = 1, . . . , n i be independent, positive random variables with absolutely continuous distributions F i . Let C ij be independent censoring variables corresponding to X ij . It is assumed that X ij and C ij are independent. The observable random variables are (T ij , δ ij ), where T ij = min(X ij , C ij ) is the minimum of the failure and censoring observations, and δ ij = I(X ij , C ij ) with I equal to the indicator function. Let f i (t) = dF i (t) dt be the density function of F i (t) that describes the probability of failure by time t. The hazard function, also referred to as the hazard ratio, is then defined as λ , which represents an instantaneous rate of failure (or death) at time t for an individual that has survived past time t. The cumulative hazard function is Λ i (t) = t 0 λ i (s)ds, a measure of cumulative hazard (or cumulative risk) up to time t. The null hypothesis of interest is that the two-sample AFT model fits the data, for some constant θ being associated multiplicatively on time t. That is, the random variable X ij for one sample has the same distribution as the random variable X 2j /θ in terms of a scale change factor θ that signifies effects on time. This insinuates that the lifetime is either accelerated or decelerated by the constant via risk factor, such as gender or treatment. For example, the lifetime by a treatment is increased or decreased for θ > 1 or θ < 1, respectively. Note that the model in (1) can be written in terms of the hazard function, as follows: which is also equivalent to: is the survival function.

Weighted Log-Rank Test
For the censored two-sample comparison, let N i (t) = ∑ Note that N i (t) indicates the number of events (failures or deaths) in group i that occurs at time t, and Y i (t) the number of individuals prior to time t (or, equivalently, the number at risk at time t). The Kaplan-Meier estimate of the survival function ( [7]), S(t) = P(T > t), for the pooled data is: Note that 1 −Ŝ(t) estimates the distribution F(t) of the data. Also note that the cumulative hazard function Λ i (t) is − log S i (t). The Nelson-Aalen estimate of Λ i ( [12,13]) is defined as: Hence, an estimator for the AFT model in (1) is defined as a zero of the following weighted log-rank statistic: where W n (t; c) is a bounded weight function that determines a type of weighted log-rank statistic. For the rank statistics of the form in (2), by [14] and under the condition of [15], a legitimate estimate of θ can be taken as the value of θ making the integral as close as zero possible. It is worth noting that the most powerful test in detecting proportional hazards alternatives is obtained by the log-rank estimator [16,17]. The weight function leading to the log-rank estimator is: Note that for the G ρ family, the weight function is defined as: where ρ is an exponent ofŜ such that 0 ≤ ρ < ∞. The log-rank statistic is obtained when ρ = 0, for whichŜ ρ = 1, assigning equal weights over time. For ρ > 0, the G ρ statistics give relatively more weight to early survival difference sinceŜ ρ with ρ > 0 decreases as time t progresses. For ρ = 1, the Wilcoxson statistic ( [18]) is obtained. Such weighted logrank statistics are often used for comparing two distributions in the presence of arbitrary right censoring. Note that such weight functions in the statistics are sensitive in testing the equality of two distributions, so a carefully chosen weight function should be used in practice.
Ref. [8] proposed a modified version of the G ρ family to accommodate the situation where the G ρ family of the weighted log-rank statistics remains almost stationary for all values of ρ, and thus does not have a good range. This could happen when the event rate is low, in which S(τ) is near 1 so thatŜ ρ (t) is also near 1 for all values of ρ. Note that if the weight function W n does not change much, the behavior of the G ρ family is most likely the same as the unweighted log-rank statistic. The following is the modification to the G ρ statistics proposed by [8]: where κ is the time at the end of study period. An estimator for the variance ofŨ ρ is:

Adaptive Weight Function
In Section 2.2, to prevent a weight function in the G ρ family from remaining stable near 1 during the entire study period [0, τ],Ŝ at the terminal value is subtracted fromŜ decreases for any ρ > 0. When data are heavily censored, however, the Kaplan-Meier survival functionŜ decreases relatively slowly over a time period of observations and may not approach near-zero at the end. Thus, the survival function remains high when compared to the case where censoring rates are relatively less high. Due to such an inflated weight function, a statistical test with that weight function may lose power, failing to detect the possible presence of differences between the two groups that actually exists. These phenomena would happen in other situations. For example, a loss of power of a test using the G ρ family with the weight function could be incurred when two survival curves of two groups with two different treatments cross and the treatment benefits are different over time. One treatment would have high initial efficacy, while the effectiveness of the other would be gradual. In this case, the test will lose power. To cope with this kind of situation that may occur in the two-sample AFT model case, the following class of weight functions, which has a censoring proportions of data as a component, can be utilized: denotes the censoring proportion. Note that for 0 ≤ a ≤ 1, a × 100% implies the percentage of the observed data. The weight function W * n adaptively assigns weights according to the censoring proportion of the data, and thus it decreases to near-zero as the censoring proportion increases even when heavy censoring occurs, giving less weight to possible late survival differences. Thus, it relatively provides a broader range of flexibility than the weight functionŜ ρ (t−). Note that a is near 1 when light censoring is used, and thus the statistic with W * n will behave like the G ρ family. On the other hand, in the presence of heavy censored data, a is near 0, and its behavior will be similar to the weight function in [8].
Since the survival function,Ŝ(t; c), decreases from 1 toŜ(τ; c), the weight function, W * n (t; c), when scaled by W * n (0; c) for a > 0, decreases from W * n (t; c) to: , ρ > 0, withŜ(τ−; c) < 1. For ρ = 1, the ratio of the weight function W * n (t; c) to W * n (0; c) is: The ratio of W * n (t; c)/W * n (0; c) implies that W * n (t; c) is an ever-decreasing function for t > 0. For the censoring proportions of 20%, 40%, 60%, and 80%, the weight functions, W * n (t; c) and W n (t; c) with ρ = 1 are compared in Figure 1, where the dotted and solid lines, respectively, represent W * n (t; c) and W n (t; c). Note that in the figure, both the survival and the censoring distributions are taken from loglogistic distribution. As demonstrated in the figure, the weight W * n (t; c) gets further away from the weight component of W * n (t; c), (1 − a)Ŝ(τ−; c), staying below W n (t; c), as the censoring proportion increases. To test the equality of two samples with the AFT model using weighted log-rank statistics, we implement the weight function, W * n . For j = 1, . . . , n 2 , let N 2j (t; c) = I(T 2j /c ≤ t, δ 2j ) = 1, where T 2j /c = min(X 2j /c, C 2j /c). Furthermore, let N 2 (t; c) = ∑ n 2 j=1 N 2j (t; c). Define: which can be used as an estimating function to estimate θ in (1), where dΛ 2 (ct) = dN 2 (t;c) and v is the upper bound of the integral. Note that instead of [0, ∞), a finite range [0, v] for which enough data are available is used to avoid possible unusual behavior of the integral near the end of its upper limit. Letθ be an estimator as the solution to U * ρ (c) = 0. This kind of truncated integration is commonly used in survival analysis to prevent the estimating function from being explosive in the upper tail of the data. In this work, v was chosen such that v < min(τ 1 , τ 2 /c) for c in a neighborhood of θ, where τ i = sup{t : F i (t) < 1}. It is worth noting that the estimating function is well-defined, provided that the estimating function is bounded on the integration range.

Confidence Interval and Test
The AFT model with two-sample censored data has a direct interpretation in terms of a scale factor. The estimators on the scale estimation proposed in the literature are consistently estimated and are asymptotically normal ( [3]). However, their asymptotic variances are difficult to directly estimate since this involves some unknown density or requires monotone conditions of weight functions. This is a common occurrence for scale estimators, which are either rank-or minimum distance-based. We, thus, utilized an indirect method to obtain confidence intervals for the scale parameter. In this work, a test-based method of constructing confidence intervals was used, among other indirect methods, proposed by [3]. Under the conditions of [15] and the null hypothesis H 0 : Λ 1 (t) = Λ 2 (θt), where: , which implies the number of failures at time t in group i, i = 1, 2. From this, an asymptotically distribution-free, test-based confidence interval on θ at a significance level of α is obtained, as follows: where z α/2 is the 1 − α/2 quantile from a standard normal distribution. In this work, to obtain the confidence interval from J(θ), a grid search method was utilized for a value of θ.
Using the method, the least upper and greatest lower bounds of θ were calculated for the upper and lower limits of the interval, respectively. We now consider the test problem of H 0 : Λ 1 (t) = Λ 2 (θt) for some θ versus H 1 : Λ 1 (t) = Λ 2 (θt). The statistic U * ρ (θ) converges in distribution to normal with a mean of zero and variance that can be estimated by: Therefore, an asymptotic level α test is to reject H 0 if: Note that the testing procedures, with some modifications, can be modified to check the equality in survival between two groups.

Simulation
To assess the performance of the confidence interval developed with the weight function W * n in (3) and compare it to the G ρ family, extensive simulations were conducted. For the simulation study, we considered three settings to specify the distributions for survival times X ij and censoring time C ij : (C1) a log-logistic power-scale family was used to generate both X ij and C ij ; (C2) Weibull and Lognormal were used for X ij and C ij , respectively; (C3) normal and uniform were used for log X ij and log C ij , respectively. The following specifies these three cases: C1. 2 1−i X ij has density 2t(1 + t 2 ) −2 , t > 0, and C ij has density 2h 2 t(1 + h 2 t 2 ) −2 , t > 0, for some constant h. C2. 2 1−i X ij has density 2te −t 2 , t > 0, and log(C ij ) is normal with mean h and standard deviation 1. C3. log(2 1−i X ij ) is standard normal, and log(C ij ) is uniform (h, 1 + h) for some constant h.
Note that the constant h in each case is chosen to yield a censoring proportion of interest. For example, the h values of 0.25, 0.52 and 0.98 were used for the censoring proportions of 20%, 40%, and 60%, respectively, in case C1 (the loglogistic power-scale family). Similar configurations were previously considered in [19]. Results are based on 1000 repetitions with (n 1 , n 2 ) ={(25, 25), (25, 50), (50, 50)}, and summarized in Table 1. The scale parameter θ = 2, and the two samples have the same censoring distribution that does not depend on the parameter. Table 1 presents the empirical coverage proba-bilities (ECP) and empirical mean lengths (EML) of confidence intervals associated with To examine the size of the tests, the null hypothesis that the AFT model holds with θ = 2 was tested at α = 0.05 for the aforementioned three cases with the same setup. Results are presented in Table 2. The results demonstrate that overall, all of the tests achieve the nominal significance level 0.05 in the case of C2. On the other hand, in the cases of C1 and C3, the level tended to remain slightly away from the nominal level for the censoring proportions being considered.
Simulations on power were also conducted under some specific alternatives, in order to check the performance of the tests associated with the weight functions. For this, we considered the three settings in which the AFT model does not hold: (S1) the model of [20], which accommodates crossing hazard functions with β 1 and β 2 taken with opposite signs (see [20] for details); (S2) survival curves cross near the middle of the time course; (S3) late crossing of survival curves appears. For the cases of S2 and S3, we generated data from piecewise exponential distributions that have constants λ i for group i, i = 1, 2, and the censoring distributions were c × Uni f orm(0, 1), with c chosen to produce specified censoring proportions. For example, the c values of 3.2, 1.5, and 0.75 led to the censoring proportions of 20%, 40%, and 60%, respectively, in S2 (middle crossover). For the middle survival differences λ 1 = 3, 0.5, 0.5 and λ 2 = 0.5, 2, 2 for t < 0.5, 0.5 ≤ t < 0.6, t ≥ 0.6, respectively. For the late survival differences, λ 1 = 1, 0.5, 0.2 and λ 2 = 0.2, 1, 1 for t < 0.8, 0.8 ≤ t < 1.5, t ≥ 1.5, respectively. Figure 2 provides visual illustrations of the cases of S2 and S3. Simulation results of these three cases are presented in Table 3. The results of S1-S3 demonstrate that the improvements of the G ρ family by using the weight function W * n were notable overall, outperforming the log-rank. It becomes more apparent that the tests with W * n improved upon the others, as sample size and censoring proportion, respectively, became smaller and heavier. For example, for the small sample of n 1 = n 2 = 25 in the case of S2, powers with G 1 , G 2 , W * 1 , and W * 2 were 15.8%, 21.2%, 16.5%, and 22.1%, respectively. This indicates that the performances of the G ρ family by using the weight function W * n when light censoring (20%) were improved by 4% (15.8% to 16.5% for ρ = 1) and 4% (21.2% to 22.1% for ρ = 2). For moderate censoring (40%), they increased up to 10% (20.5% to 22.6% for ρ = 1) and 9% (23.8% to 26% for ρ = 2). When the censoring proportion was heavy (60%), the performance made a more than 20% increase. Specifically, the performance improved by 24%, making a move from 18.9% to 23.4%, for ρ = 1, and 72% (21.7% to 37.5%) for ρ = 2. Similar phenomena were observed in other settings including the case S3. It is worth noting that from Table 3, it can be seen that as the percentage of censoring increased, the power increased, which is a reversal phenomenon. In general, with higher censoring, one would expect less power. However, the reversal phenomenon in the right censoring case would be feasible. This is because early survival differences were mainly observed as the censoring proportion increases. For example, little or no power is possible by the crossover in the survival functions even when light or no censoring occurs. Note that all simulations were performed using MATLABm in which the methods were implemented. Source code is available upon reasonable request.

Application
The procedures evaluated in the simulation study were applied to real world data concerning vaginal cancer, which is a disease in which malignant cells grow abnormally in the vagina. It was reported in the study of human disease that vaginal cancer is predominantly a disease of older women, and approximately 50% of cases are present in women over the age of 70 ( [21]). In this work, data on vaginal cancer in female rats ( [22]) was used, where rats with the disease were split up into two groups by pretreatment regimen with samples of sizes n 1 = 19 and n 2 = 21. Two censored datasets were observed in each group. Times to cancer mortality from vaginal cancer or censoring following treatment of rats insulted with carcinogen DMBA were the variables of interest. Note that it was found that the two-sample scale model described the effect of the pretreatment regime well ( [3,23]). The plot on the left in Figure 3 displays the two estimated survival functions of the two groups.  Table 4 summarizes point estimates, empirical p-values, and confidence intervals for the scale parameter with the weight functions being considered. Results show that the vaginal cancer data can be reasonably fit by a two-sample scale model, Λ 1 (t) = Λ 2 (θt), thus confirming the previous works ( [3,23]). The weighted log-rank statistics in the table yieldsθ ≈ 1.1, which implies a 10% difference approximately in effectiveness between the pretreatment regimen on the survival. It is questionable whether the improvement on the survival by 10% is significant. The plot on the right in Figure 3 is a plot of U * ρ (θ) versus θ, illustrating how to obtain the estimated value of θ. The estimated cumulative hazard curves of the two groups are compared in Figure 4. The figure (left) depicts the estimated curves of the two groups, providing an initial insight into the shape of the curves. The figure (right) presents the same with the time scale adjusted between the cumulative hazard functions; the two curves have approximately the same shape, which reveals the suitability of the two-sample AFT model.

Concluding Remarks
The sensitivity of inference procedures of the G ρ family of weighted log-rank tests varies depending on the choice of weight functions. It may lose power when two hazard functions cross. For example, in a clinical study comparing two treatments offering different benefits over time, one treatment could be effective immediately, whereas the other may have long-term effects. Thus, carefully chosen weight functions should be used in applications. This work modified the G ρ family of weighted log-rank tests so that it can be used for the AFT model, improving its performance in realistic situations. The procedures are based on the weight function that has a censoring proportion as a factor in it. Simulation results demonstrated that the modified weight function makes the G ρ family more dependable in realistic alternatives, dealing with heavy censoring. The weight function in this work, with some modifications, could also be used to handle rare events' data for the AFT model. In addition, the weight function could be further extended to a censored regression with covariates. Finally, to deal with the situation of crossing hazard, some versatile tests based on the simultaneous use of the weighted log-rank statistics associated with the weight function, such as Rényi-type tests (Gill, 1980), could be utilized, rather than using locally powerful tests.