The Inverse Log-Rank Test: A Versatile Procedure for Late Separating Survival Curves

Often in the planning phase of a clinical trial, a researcher will need to choose between a standard versus weighted log-rank test (LRT) for investigating right-censored survival data. While a standard LRT is optimal for analyzing evenly distributed but distinct survival events (proportional hazards), an appropriately weighted LRT test may be better suited for handling non-proportional, delayed treatment effects. The “a priori” misspecification of this alternative may result in a substantial loss of power when determining the effectiveness of an experimental drug. In this paper, the standard unweighted and inverse log-rank tests (iLRTs) are compared with the multiple weight, default Max-Combo procedure for analyzing differential late survival outcomes. Unlike combination LRTs that depend on the arbitrary selection of weights, the iLRT by definition is a single weight test and does not require implicit multiplicity correction. Empirically, both weighted methods have reasonable flexibility for assessing continuous survival curve differences from the onset of a study. However, the iLRT may be preferable for accommodating delayed separating survival curves, especially when one arm finishes first. Using standard large-sample methods, the power and sample size for the iLRT are easily estimated without resorting to complex and timely simulations.


Introduction
Delayed treatment effects are the most common type of non-proportional hazards arising in clinical trials, most notably for immunologic cancer drugs [1][2][3][4].A certain period of exposure may be necessary before achieving a treatment response, owing to the mechanism of action for compounds like PD-1 or PD-L1 inhibitors.A small insignificant difference between survival curves typically is observed initially, or, in some cases, the curves may even cross-over up to a certain time point.Thereafter, the curves diverge and late separation occurs, manifesting a differential treatment effect.While a standard log-rank test (LRT) will remain valid for rejecting the null hypothesis of no survival difference and will control the Type I error rate, the procedure will not be uniformly most powerful when the hazards for the curves are non-proportional, as is the case in late-separating curves [5].Importantly, power may not necessarily increase as the sample size becomes larger.
An LRT that assigns greater weight to events occurring later in the trial will be more sensitive to delayed treatment effects [6].However, in the absence of "a priori" knowledge, finding a combination of weights that is best able to collectively accommodate various survival scenarios has been challenging [7].Non-proportional hazards owing to differential censoring between treatment groups also poses a concern, especially when the censoring occurs with greater frequency toward the later part of the trial [8,9].
The inverse log-rank test (iLRT) is a computationally simple, single weight procedure that is moderately robust in detecting late occurring survival differences.Yet, this test also performs well under proportional hazards.We provide empirical examples to illustrate the novelty and versatility of this method in comparison with the multiple weight "Max-Combo" procedure and the combinatoric-based "Split-Range" test.

Materials and Methods
2.1.Preliminaries 2.1.1.Hypergeometric Framework for Survival Time Data Let time (t i ) ≥ 0 represent the pooled times in which participants in either Group 1 or Group 2 experience an event, respectively.Consider the layout in Table 1, where d i1 = # of events at (t i ) in Group 1; d i2 = # of events at (t i ) in Group 2; d i = d i1 + d i2 ; R i1 = # of participants available at (t i ) in Group 1; R i2 = # of participants available at (t i ) in Group 2; Table 1.Events and non-events in the risk set at (t i ) by study group.

Group 1 Group 2 Total
Event Under the null hypothesis that the sets of times in the two groups are equivalent, it follows that (d i1 ), conditional on the marginal total (d i ), has a hypergeometric distribution [10][11][12].Consisting of the sum of (R i1 ) Bernoulli trials, each with a mean of d i R i , the hypergeometric distribution is written as [13,14] where the random variable (x i ) denotes the number of events (d i1 ) in Group 1 at each time point (t i ).In many applied examples, the events of interest are deaths (d).The value for this variable is bounded below by max[0, R i1 − (R i − d i )] and above by min[d i , R i1 ].Given equal survival times, the probability of an event occurring at (t i ) is not contingent upon the group to which a patient belongs [15].Observing that (x i ) is less than or equal to (R i1 ), the number at risk in Group 1 at (t i ), it follows that [16] . (2)

Expectation and Variance
The properties of the hypergeometric distribution are well described in the literature [13,14,17,18].Briefly, the first raw moment for (X) gives the expected number of patients who experience an event at time (t i ) within a particular group, and is written as The finite second central moment is obtained as Subtracting the square of the first raw moment from the second central moment gives the variance of (X) at time (t i ), i.e., (5)

Large Sample Properties
Under large sampling conditions, the null distribution for the hypergeometric test may be indirectly approximated by a Gaussian distribution [19].As (d i ) and (R i ) approach infinity (with a fixed ratio), and assuming that (R i − d i ) is relatively large, with a fixed finite (x i ), we see that [18] where the term in the left square brackets of the last two expressions denotes the unordered ways to choose (x i ) from a set of (R i1 ) elements (Pascal's pyramid) [20].The approximation becomes increasingly better as the ratio terms ( we have [21] Next, we obtain the following identities (−1) j a −j j = log(1 + a) for |a|> 1  (14)   and combining terms, with O Continuing, we assume that is neither close to 0 nor 1 and both R i1 Therefore, Substituting accordingly and combining terms gives where . Therefore, the discrete probability elements for each (X) at time (t i ) shrink infinitesimally to yield a symmetrical continuous density centered at (µ x i ) with asymptotic points of inflection at µ x i ± σ 2 x i .A simple transformation gives Noting that lim term in variance for the hypergeometric distribution asymptotically approaches unity, and, as expected, the corresponding variance for the Gaussian distribution becomes R i1 . Lastly, we mention that a more direct proof yielding the normal distribution can be obtained by rewriting the binomial coefficients in the hypergeometric distribution using de-Moivre Laplace's asymptotic formula and simplifying [22].
2.1.4.Useful Approximations, Bounds, and Recursive Formulas When (R i > 50), (d i ≤ R i1 ), and a reasonable approximation for the sum of hypergeometric terms, in terms of the Bernoulli distribution, is given as [14] A lower and upper bound for the hypergeometric density, as a function of the Bernoulli distribution is written as [18] exp − 1 2 ) This readily follows from the inequality In many cases, determining hypergeometric probabilities can be challenging.A convenient recursive equation is easily derived as Rearranging, we see that

Weighted Log-Rank Test
Consider (m) separate event time points (t 1 < t 2 < t i < • • • < t m ) and let (w i ) denote a non-disjoint, positive weight function that is appropriately bounded (detectable, nonzero measure) for each (i) value.The linear combination (∑ m i w i ξ i ) yields the weighted LRT, which defaults to the standard LRT when the weight function is equal to unity for each time point [23,24].Because the moment generating function (MGF) for this linear combination is equal to the MGF of a normal distribution with mean = (∑ m i w i µ x i ) and variance = (∑ m i w i 2 σ 2 x i ), i.e., it holds that since no distinct probability distributions can have the same moment generating function.Thus, under large sampling conditions, the summation of ( w i ξ i ) over (m) time points has an approximate standard normal distribution, i.e., N(0, 1) or, equivalently, by taking the square, a chi-square distribution with one degree of freedom.
Rewriting the weighted LRT as where (O i − E i ) denotes the deviation of the observed values (d i1 ) from their expected values, we see that the numerator of (ξ w i ) corresponds to the weighted sum of conditionally independent and uncorrelated hypergeometric (asymptotically normal) random variables, with each term having a mean of zero, under the null hypothesis of no treatment effect (i.e., ) [10].Since the event times are conditionally independent of one another and are functionally predictable (i.e., ξ w i is not contingent on outcomes that occur at or beyond t i ) [25], the variance of the numerator is simply equal to the sum of the variances for the individual [w i (O i − E i )] terms [15].Specifically, (33)   as both the variance of (E i ) and the Cov(O i , E i ) are equal to zero.Of further note, (ξ w i ) remains the same if (w i ) is multiplied or divided by a scaler constant [26,27].
Applying the conditional central limit theorem (assuming the exchangeability of elements and Lundeberg's sufficiency conditions for martingales-finite variance, tightness, and uniform integrability), it follows that (ξ w i ) is asymptotically consistent and weakly convergent in distribution to a chi-square distribution with 1 degree of freedom, even when the individual terms are not necessarily identically distributed [28][29][30][31][32][33].Thus, the conditional central limit theorem aligns with the abovementioned MGF approach for defining the large sample distribution of (ξ w i ) but with less stringent conditions that are better suited for real-world applications [34].Nonetheless, the small-sample behavior in both scenarios may be difficult to anticipate in practice, especially for highly censored and sparse tailed data [35].

Selection of Weights
Various choices for (w i ) have been proposed in the literature.A popular selection is to set (w i ) equal to 1, which gives the standard Mantel-Haenszel LRT without continuity correction [23].While this option is fairly robust for detecting survival curve differences, especially in the case of proportional hazards, there is no universal consensus regarding the best weight or combination of weights to use when the hazards (for the two groups under comparison) are not constant over time, as is the case for late separating survival curves.One flexible option is the two-parameter Fleming-Harrington (FH) weight, with (w i ) defined as where ∼ S(t−) is the left-continuous product-limit estimate, and (ρ ≥ 0, λ ≥ 0) [29].Here, G( ρ = 0, γ = 0), G( ρ > 0, γ = 0), G(ρ > 0, γ > 0), and G( ρ = 0, γ > 0) purportedly corresponds to "evenly distributed", "early", "mid", and "late" treatment effects, with G( ρ = 0, γ = 0) denoting the standard LRT and G( ρ = 1, γ = 0) denoting the Prentice-Wilcoxon statistic.Barring prior knowledge, the selection of (ρ) and (γ) is largely arbitrary.Arguably, certain weights may lack clinical relevance, focusing only on a specific portion of a survival curve with low event rates or diminishing treatment effects.
A compromise entails taking the maximum of the standardized statistics for a preset combination of FH-LRT values for (ρ) and (γ).Dividing the difference vector by the corresponding square root of Fisher's information matrix (a non-singular, uniformly minimum variance unbiased estimator), the resultant statistic asymptotically assumes a multivariate Gaussian distribution [36].Known as the "Max-Combo" method, the test accommodates various treatment effects by selectively up-or down-weighting the log-rank statistics over time [37].In general, combination approaches are more powerful than the standard LRT under a range of nonproportional hazard conditions [38,39].The critical value (c α ) for a (k)-component Max-Combo test ( Z k  Max is defined such that Commonly used combinations include with the first abovementioned Z 4 Max traditionally being designated as the default set of weights.The Max-Combo test has been shown to perform well in many applied examples with non-proportional hazards [40].However, under moderate to heavy censoring and noting the potentially high correlation among weighted LRTs, the family of combination procedures (including the Max-combo test) may not be more versatile than individual component LRT tests [8].The extension to a group sequential analysis allows the Max-Combo procedure to accommodate multiple time point decisions, with the test statistic assuming a joint normal distribution under the null hypothesis (per the application of Slutsky's theorem) [41][42][43][44].

Inverse Log-Rank Test
A key constraint of the Max-Combo test in practical applications is that the null hypothesis can be rejected in favor of both the experimental and reference arms for an identical set of observations [45].That is, when survival curves cross and one wishes to test the superiority of Treatment A, it is possible for the Max-Combo method to reject the null hypothesis in favor of Treatment A; while in contrast, if the objective is to test the superiority of Treatment B, then the Max-Combo method could conceivably yield the opposite conclusion given the same data (i.e., reject the null hypothesis in favor of Treatment B).Alternatively, the iLRT presents a single-weight LRT for analyzing nonproportional hazard survival curves [46].
Based on a smoothed, non-negative function of sample values that converges in probability to its true state, the inversely weighted logarithm of the combined number of patients at risk at each of ( m) study time points is given by The iLRT is defined as with the p-value (two-sided test) estimated as As (π) in the denominator of the last equation is equal to Γ(1/2), the integrand corresponds to the probability density function of a chi-square distribution with 1 degree of freedom.Being a score that is statistic, which can be alternatively expressed as a discretetime, partial likelihood function, (ζ w i ) easily accommodates censored data [47,48].

Split-Range Test
Consider the special case of a 2-arm, randomized clinical trial where all of the patients in the comparison arm (Group 2) achieve the event of interest by a certain time, while some of the patients in the test arm (Group 1) have survival times beyond this time point.A p-value for testing the null hypothesis (H 0 ) of no survival differences between the groups may be computed using the split-range test (SRT) [49].In this non-parametric method, designate the number of patients in Group 1 as (n 1 ) and the number in Group 2 as (n 2 = N − n 1 ), with (N) denoting the total sample size.This is equivalent to the Fermi-Dirac "ball and cell" model, where (n 2 ) balls are randomly dropped into (N) cells (corresponding to ranked survival times), allowing one ball per cell.Numbering the cells from 1 to (N), the range (R) is defined as the number of the highest occupied cell minus the lowest occupied cell.The value for the range must be a number from (n 2 − 1) to (N − 1).To test (H 0 ) with a Type I error rate of (α) for falsely rejecting the null hypothesis, find the integer (ϕ) and reject (H 0 ) if the observed value of (R) does not exceed (α).When censored values occur in Group 1 before the last event occurs in the Group 2, then (α) denotes an upper bound.That is, some of the true survival times for these censored values may be longer than all the elements in Group 1.By decreasing the range, this results in a smaller p-value.
Analogously, the split-range test can be applied in reverse by randomly dropping the (n 1 ) balls into the (N) cells.Again, the range is defined as the number of the high- est occupied cell minus the lowest occupied cell.A non-directional test is obtained by simultaneously considering both cases and multiplying (α) by two to adjust for multiplicity.

Computational Details
p-Values for the weighted Fleming and Harrington LRT were computed using the "Test=FH" option in the strata statement of the LIFETEST procedure in SAS v.9.4 software (Cary, NC, USA), while p-values for the Max-Combo procedure were obtained iteratively [50].The SAS code for performing the iLRT is provided in the Appendix A. In most cases, the computational run time for the iLRT is approximately 4-fold (or more) faster than the default 4-component Max-Combo test.
p-Values ≤ 0.05 were deemed to be statistically significant.Unless otherwise indicated, computed values were presented to two significant digits using the Goldilocks (Efron-Whittemore) rounding method, rather than a fixed number of decimal places [51].

Examples
Four examples are presented in this section comparing the results of the Prentice-Wilcoxon, standard Mantel (unweighted), combination Max-Combo (default four-component), and inverse log-rank tests.The combinatoric SRT is presented as a non-LRT comparison in the fourth example.Kaplan-Meier (product-limit) plots are provided for each example (see Figure 1).Summary computational results of the iLRT for the four examples are shown in Table 2.

Example 1
In this non-randomized cohort of n = 157 emulated patients with metastatic (stage IV), non-squamous cell lung cancer (NSCLC), who failed to respond to conventional chemotherapy, 75 opted to receive an experimental immune therapy compound (Group 1) versus 82 who were provided hospice care (Group 2) [46].Among the 75 patients in the first group, 11 had censored outcomes, while all of the patients in Group 2 experienced an event (Table 2).Soon after the second month, a noticeable late survival advantage materialized for the experimental group, while those in the hospice group continued to decline (see Kaplan-Meier plot for Example 1).Notably, the Kaplan-Meier curves otherwise crisscrossed for the first two months before diverging.The median survival time for Group 1 was slightly higher than Group 2 (0.69 versus 0.65 months).Only the iLRT yielded a statistically significant survival group difference (p = 0.029).Although the default Max-Combo failed to achieve statistical significance (p = 0.071), several individual FH-LRT values for (ρ) and (γ) had correspondingly lower p-values than the iLRT, with a minimum being observed for (ρ = 0, γ = 5; p = 0.015) (Table 3).That is, the power of the Max-Combo test in a specific scenario may not exceed their component FH test statistics [52].

Example 2
The objective of Example 2 is to demonstrate the non-significant difference between the two treatment arms in Example 1, prior to their point of separation.As expected, upon deleting observations occurring after 1.9 months, none of the LRTs in this example had statistically significant p-values.The highest value p-value corresponded to the iLRT (p = 0.83), followed by the default Max-Combo test (p = 0.51).

Example 3
An important characteristic of an omnibus LRT is the ability to accommodate late separating survival curves, while also having power to detect significant differences occurring from the beginning of a study.Example 3 elaborates on the comparative analysis of two cancer therapies, historically presented by Brown and Hollander [53].Referring to the Kaplan-Meier plots for this example, we see that the treatment curves are relatively parallel, suggesting proportional hazards over time.Both the standard Mantel LRT (p = 0.0012) and Prentice-Wilcoxon LRT (p = 0.0010) are statistically significant, while the iLRT (p = 0.0017) and the default Max-Combo test (p = 0.0021) yield comparable levels of statistical significance, though to a slightly lesser degree.

Example 4
Example 4 illustrates a special case of late separating survival curves, as originally presented by the author [49].In this analysis, all the patients in the comparison arm (Group 2) experience the event of interest, while 11 of the patients in the experimental treatment arm (Group 1) have survival times greater than the last event in Group 2 at 9.5 years.Accordingly, the SRT is applicable in this example and yields a p-value of between 0.0025 and 0.0050, as there is one censored value at 3.0 years that occurs in Group 1 before the last event in Group 2. While all of the values in Group 1 beyond the completion of Group 2 are censored, an equivalent p-value would have been obtained for this degenerate case, even if one or more of these censored values were events (which is the case for LRTs in general).
In this example, the p-value obtained for the SRT is comparatively close to the iLRT (p = 0.0011) and the default (four-component) Max-Combo procedure (p = 0.012), with the iLRT yielding the more statistically significant value.The cumulative frequency for the split-range test given n = 100 and N = 200 is provided in Table 4.

Comparison with the Cox Regression Model
In Examples 1, 2, and 4, which depict non-proportional hazards, the corresponding hazard ratios (HRs) and significance levels (estimated by a Cox regression model) were 1.2 (p = 0.27), 1.0 (p = 0.88), and 1.2 (p = 0.28), respectively.In contrast, the hazards for the two survival curves shown in Example 3 were relatively constant over time (HR = 0.22) and manifested a p-value of 0.0030, being slightly less significant but comparable to the iLRT (p = 0.0017) and default Max-Combo procedure (p = 0.0021).

Sample Size and Power Methodology
To compute the sample size and power for a planned trial (i.e., how frequently a test will detect the falsehood of an underlying hypotheses when it is wrong), we note that [54] where (Ψ) is the standardized test statistics for the iLRT, and Z α denotes the 100(1 − α) percentile of a standard normal distribution and proceed in a manner comparable to Garès and colleagues [55].Specifying the desired power as (1 − β) for an α-level (two-sided) test of significance, the respective sample size for Group 1 N Total = 2 × N Group 1 is given as where Rearranging the formula for sample size, we see that where

Sample Size and Power Example
In Example 1, the results of a non-randomized cohort were presented where a new experimental compound was compared with hospice care for late stage, refractory lung cancer.Based on the promising findings from this study, a pharmaceutical company would like to conduct a Phase-3 clinical trial randomizing an equal number of patients to the two treatment groups.
Specifically, the company wishes to reject the null hypothesis of equivalent survival times between the two arms of the planned study with a probability of 90% (given that the survival curves are truly different), and a Type I (two-sided) error rate of 5%.Plugging in the numbers from the first row of Table 1, we see that Upon being informed of the sample size, management decided that the cost to conduct the trial would be too high.Instead, they suggested a trial of no more than 144 patients per arm and asked the statistician to determine the corresponding statistical power, computed as

Overview
The choice of weights for an LRT is arbitrary and largely predicated on the efficiency to detect treatment differences [56].Under the null hypothesis, optimal "pre-specified" weights are a function of the total number of participants at risk at the time of a respective event and are estimated from the data [57].While weighted rank tests are valid under unequal censoring, the asymptotic relative efficiency of the test statistic depends on the censoring distribution.A weighted LRT should be reasonably robust to unequal rightcensoring, as permutation tests may fail to provide suitable approximations [58].In such cases, the permutation computed variance may underestimate the true variance when censoring is unequal [59].Additionally, the analysis of arbitrarily interval-censored survival data requires special techniques beyond that discussed here [60,61].
The optimal weight or combination of weights for an LRT has a defined power advantage, contingent upon advanced knowledge of when the survival curve separation will occur (e.g., early, mid, or late).Thus, the ideal selection depends on the data, knowledge of which may not be feasible before the completion of a study.While pilot data or results from comparable studies can be helpful in the decision-making process, there is no guarantee that a planned study will behave similarly.While several researchers have proposed adaptively choosing weights as a function of the data [38,47,62], the properties of such tests may be challenging to predict and may have less power when compared with the traditional unweighted LRT with proportional hazards [25].
The iLRT is nearly as powerful as the standard LRT under proportional hazards.Yet, the iLRT is more sensitive to time-dependent, non-proportional hazards observed for differential or single arm delayed treatment effects.When an investigator is uncertain in advance about the shape of the survival curves, it is not apposite to select an LRT after the data have been collected as the analytic method should be clearly specified in the protocol prior to the initiation of a study.One option is to select a combination of FH weights in the form of the Max-Combo test.While this procedure performs reasonably well, again as previously noted, it is possible to reject the null hypothesis both in favor and against a particular treatment for the same data [45].Combination tests also may have diminished power, albeit marginal, to detect treatment differences, resulting from the implicit multiplicity correction required by the procedure.As a single weight method, the iLRT does not require adjustment for multiple testing and provides a flexible and non-subjective means for analyzing both continuing and late separating survival curves.However, if the investigator is certain of the shape of the survival curves in advance, then an appropriately parametrized FH-LRT may present the optimal choice for the planned analysis.

Efficiency
The chi-square statistic (ζ w i ), w i d i /R i is the minimum, best asymptotic normal (BAN) estimator for [E i = (w i R i1 d i /R i )], providing that it is a consistent estimate of the latter and asymptotically normal under large sample conditions (with properties akin to the maximum likelihood estimator and Fisher's information loss, albeit based on cell frequencies vs. original observations) [63][64][65].Among all such asymptotically normal estimates within a multinomial framework, none have a smaller variance [66].As such, (ζ w i ) belongs to a class of tests which are unbiased and equivalent in limit to Neyman's λ-test [54,67].While tests within this family have comparable or more stringent power against Pitman alternatives (i.e., asymptotic relative efficiency), there is no guarantee that the statistic converges to a normal distribution at a reasonably fast rate, especially when observations are sparse toward the extreme right tail, with manifest censoring [68][69][70][71].For Type II right-censored data with a presumed number of events, the total time of the trial is unknown until the last event occurs (versus trials with a fixed time of termination) [72].Nonetheless, both types of censoring may lead to unreliable inferences and are challenging to model if censoring is sporadic, non-stationary, or a differential censoring mechanism exists between the two arms of a trial [73].The misspecification of weights with respect to censoring or premature withdrawals can have undesirable and difficult to predict consequences on test efficiency and power, especially in the presence of incomplete data.

Lakatos-Cantor Method for Computing Power
In practice, an alternative method for computing the power of weighted LRTs exists that only requires specifying the survival probabilities at designated times for the two arms being compared.This method (based on a seminal paper by Lakatos in 1988 and later simplified by Cantor for practical application) involves partitioning the study period into a set number of subintervals [74,75].The survival distribution for each treatment group is approximated by a piecewise linear curve, with the respective hazard at each time point estimated by linear interpolation.A Markov chain process is used to model state transitions of events across time.When both the sample size and corresponding number of subintervals are reasonably large, the power obtained by this method will tend toward that described in Section 4 [76].
The advantage of the piecewise linear approach for determining power is that one can visually estimate the required survival probabilities from published Kaplan-Meier curves or, alternatively for smaller sample sizes, by the Nelson-Aalen method [77].Furthermore, computer packages for implementing the Lakatos model, allowing for user-provided LRT weights, are readily available [78,79].The main limitation of this method lies in partitioning the study period into subintervals (i.e., discretizing continuous data into bins), particularly when the number of subintervals is small.In this case, the resulting values within each subinterval can vary depending on how the boundaries for the subintervals are chosen and potentially bias the analysis (i.e., "Mendel effect") [80,81].Implementing a prescribed algorithm to choose the interval widths alleviates this concern to some degree.However, there is no consensus on the optimal vs. practical approach for binning, with some historic and hitherto commonly used procedures lacking statistical consistency [82][83][84][85][86].

Interim Power and Sample Size Re-Estimation
While event level information often is not available during the planning stage of a clinical trial, investigators typically will have access to published Kaplan-Meier survival plots from previous studies [78].A stop-gap measure, pending the availability of more precise information, involves initially estimating power using the Lakatos-Cantor method and then re-estimating the power and sample size at an interim point, implementing the iLRT method described in Section 4. Providing that the investigator and other members of the study team remain blinded, there is no need to apply a p-value penalty for each interim look at the data.
A first interim analysis typically is conducted after more than half of the planned events in the trial have been observed, with less than ~6% (or a predetermined percentage) of participants being lost to follow-up or early censoring.In some cases, if allowed by the protocol and appropriately penalized, the unblinded "data monitoring committee statistician" may recommend a second sample size re-estimation after 75% of the planned events have occurred since sample sizes may have to be adjusted depending upon the point of late separation for the survival curves.Of note, "writing back" the time of censoring to the time of an earlier administrative event can lead to an artifactual late separation of survival curves or unintended differential bias [87].

Potential Sources of Bias
Analogous to the broad class of tests for comparing survival time differences between the two arms of a study, results of the iLRT may yield biased results if censoring is related to prognosis or if survival probabilities are not stationary and instead depend upon when a participant is recruited into the clinical trial [88].Likewise, the iLRT may experience a significant loss of power if competing risks are not independent or censoring is informative (i.e., a correlation exists between censoring and the event of interest) [89].Examples include drug withdrawal attributable to a lack of efficacy or intolerability.Furthermore, as a test of statistical significance, the iLRT is not designed to estimate the effect size for a treatment difference between groups or to compute confidence intervals of an effect [88].
While the objective of the iLRT is to reduce the false negative rate while achieving a statistically significant result, the procedure may experience a slight loss of power in the case of diminishing treatment effects, where the survival curves initially diverge but converge back together over time.If this is anticipated and the clinician has a specific interest in diminishing treatment effects, then the Max-Combo or FH (1,0) tests may represent a better choice for accommodating this possibility.When the curves extend beyond the point of diminishing treatment effect and then crossover, this poses interpretational challenges that may be best handled as a post hoc stratified analyses.The latter scenario merits exploring the underlying reasons for the crossing-over and any subgroup effects (e.g., potential treatment switching) before reaching any conclusions [7,90].In the case of crossing hazards, a two-sample semiparametric procedure has been proposed as an alternative analytic approach [91].Investigators also may consider the use a "standard of care reference arm" with a comparable hazard pattern.
A weighted LRT that is not consistent under stochastic ordering may not necessarily control the Type I error rate [92].In Example 4, with the SRT as a comparison technique, we provide a heuristic argument that both the iLRT and default Max-Combo test independently control Type I error to within an absolute difference less than or equal to 0.0039 in the case of late separating survival curves, while preserving the false positive rate under proportional hazards (Example 3).Analogous to the consistent Prentice-Wilcoxon statistic, the weight for the iLRT is based upon the number of participants at risk for each time point.By taking the logarithm of the number at risk and scaling accordingly, the iLRT is bounded above by the Prentice-Wilcoxon test.
When censoring is not under the control of the investigator, censored participants may not have the same future risk of the outcome event as non-censored participants [93].Consequently, there may not be a one-to-one correspondence between cause-specific hazard and cumulative incidence [94].Such non-informative censoring can occur under competing risks and potentially bias risk estimates [95].Unfortunately, commonly used methods to account for non-competing risks depend on the hazards being proportional, which may not always be the case when using the iLRT or other weighted procedures [96].When appropriate, competing risks can be treated as random effects in a multilevel, mixedeffects model.

Sparseness of Data and Small Sample Sizes
The iLRT may lack statistical power if few events accompany the divergence of treatment hazards or censoring is heavy [97].Sparseness in the tails of the survival curves at the time of interim analysis also can hinder reliable sample size re-estimation.As asymptotic theory was used to establish limiting formulas, the small-sample behavior of the iLRT may be uncertain in such cases.With sparse data, bootstrapping or permutation methods may be considered for validating the model robustness of the iLRT.

Computational Barriers
Standard available commercial software to compute power for weighted LRTs using the Lakatos-Cantor method generally are limited to a few weight options (e.g., standard logrank, generalized Wilcoxon/Gehan-Breslow, and Tarone-Ware).However, a downloadable computer algorithm to compute the Lakatos-Cantor method for the iLRT and other user specified weights is available online [75].

Future Directions
The basis of this manuscript relies on selected empirical examples to support the use of the iLRT.Other situations may necessitate a different approach, and future research will help to delineate the most appropriate solution, such as adaptive or machine learning strategies [8,38,47].The restricted mean survival times (RMSTs) method, which visually corresponds to the area under the Kaplan-Meier curve for a specified time period (τ), is another method for analyzing non-constant hazards and may be useful as secondary analysis [45,98].However, this technique depends on the arbitrary choice of (τ).The misspecification of this value can yield statistically significant but clinically irrelevant results by focusing only on a particular region of the survival curves.Exploring an assortment of data-driven (τ) points and accounting for these choices when estimating statistical significance is a promising area of ongoing research.Piecewise proportional hazard models also may be a good choice in some cases [99,100], and the hyperbolic cosine and logistic-like weight functions have received mention in the literature [52].
While a diverse array of weights and variance estimators have been proposed for the LRT, there is a paucity of comparative information regarding their versatility and efficiency under varying levels of non-proportionality, censoring, and competing risks [36,55,59,101].Furthermore, when the event rate is low, weighted LRTs may not retain their range of flexibility [102].Future analysis, beyond the scope of the current manuscript, may be merited.

Conclusions
A truly omnibus test is able to accurately detect survival differences over the clinical spectrum of a drug trial, regardless of whether a positive result is apparent from the start of therapy or only materializes later in the study (i.e., there is a time lag in the effectiveness of therapy).In contrast to the standard LRT, which treats all time points uniformly, an appropriately weighted LRT has the advantage of identifying significant delayed treatment effects with only a slight reduction in power for other survival outcomes.That is, under proportional hazards, with a nominal decrease in the probability of truly rejecting the null hypothesis, a substantial gain in efficiency for late separating survival curves is achieved [103].
While the quest for a "Holy Grail" test with infinite flexibility (i.e., immune to the type of non-proportional hazard) remains elusive, the single-weight iLRT possesses many of the desirable properties of such an omnibus method, particularly when the terminal event of one arm occurs before study completion.The iLRT equals or surpasses the default (four-component) Max-Combo method in many important applications and is objectively simple to implement with available computer code.The method does not require complex or timely simulations to estimate study power, and as a single-weight test, the iLRT does not involve implicit multiplicity correction nor depends on the arbitrary selection of weights.Nonetheless, in some cases, the iLRT may lack the flexibility and power of other more generalized multi-component Max-Combo tests or individual (two-parameter) Fleming-Harrington (FH) weights.
Relying entirely on a proportional hazards assumption when planning for and selecting a statistical test is unwise unless one is highly confident about the parallel shape of the ensuing hazard functions [98].For example, the benefit of treatment may not occur immediately, but rather require a certain amount of time to overcome a lengthy disease period.A delayed treatment benefit also may be a consequence of "immunologic adjustment", which often occurs with certain newer-generation cancer drugs.In contrast, the antibiotic treatment of an infectious disease generally manifests a rapid treatment response.
The single-weight iLRT does not depend on an arbitrary choice of weights yet is relatively versatile and retains excellent power under delayed treatment effects.Nonetheless, a preponderance of investigators continue to use the more familiar assumption of constant event rates and proportional hazards in the design and analysis of randomized controlled trials, despite a potential loss of power and efficiency if this supposition does not hold [104].
* p-Values computed using non-rounded values.Ex. = Example.m = # of time points;