1. Introduction
In survival analysis, a fundamental objective is to compare the distribution of event times between two or more groups. The standard tool for this purpose is the survival function
, and the most widely used hypothesis test for its comparison is the Log-Rank test, see [
1]. This test is fundamentally based on the null hypothesis of strict equality (
) and is optimal under the crucial assumption of Proportional Hazards (PHs).
However, in many clinical and reliability settings, the PH assumption is frequently violated. The most challenging scenario is when the survival curves cross. This phenomenon indicates that the advantage of one treatment over another changes over time (e.g., one treatment is superior initially but inferior in the long term). In this situation, classic tests like the Log-Rank test present a critical limitation:
The test statistic, which accumulates the observed and expected differences in events over time, suffers from a severe cancellation effect; see [
2,
3]. The positive contributions of one group are nullified by the negative contributions of the other group after the crossing point. This results in a drastic loss of statistical power and a non-significant
p-value, often leading to the erroneous conclusion of ’no difference’ when a significant, but time-varying, difference actually exists. The test is simply inadequate for diagnosing whether the observed difference is due to a consistent dominance or a crossing effect.
To reach a robust and clinically relevant conclusion, researchers must evaluate stochastic dominance. A survival function dominates if for all t. The crucial question for the investigator is, therefore, whether a consistent dominance exists or if, on the contrary, the curves cross.
This paper addresses this gap by proposing a new statistical test specifically designed to clearly distinguish the case of dominance in the presence of right-censored data. Our test is based on the supremum of the difference between Kaplan–Meier estimators, focusing on the maximum deviation between the curves. As demonstrated through asymptotic properties and simulation studies, this approach provides superior sensitivity for detecting dominance and allows researchers to directly assess the question of stochastic dominance. Furthermore, the proposed methodology can also be utilized to detect the presence of crossing survival curves. In this way, our test complements the results provided by traditional equality tests by offering a more robust interpretation of survival patterns.
The remainder of the paper is structured as follows. In
Section 2, we introduce a new test for assessing dominance between two survival functions in the presence of censored data. We examine its asymptotic properties and consistency. Next, in
Section 3, we show the implementation of the new test and apply it to real datasets. In
Section 4, we conduct a simulation study to evaluate its performance under different scenarios. Throughout the paper, we assume that the samples are independent. Finally, in
Section 5, we provide a discussion on advantages and limitations.
2. Test for Comparing Survival Functions: The Case of Independent Samples
As mentioned in the introduction, in this paper we consider right-censored data; that is, in some cases we know the time at which some event occurs (e.g., failure, death), but in others we know that the event has not yet happened by the end of the observation period, but we do not know the exact time at which it will occur.
Let us consider the
lung dataset included in the
survival package in
R, version 4.5.2; see [
4]. The lung dataset originates from the North Central Cancer Treatment Group study and includes 228 patients, 90 female and 138 male, with advanced lung cancer. One of the variables of interest is overall survival, measured in days from the initial diagnosis to the lost-to-follow-up status or death.
If we want to compare the survivals for females and males, a first approach is to plot the Kaplan–Meier estimators of the survivals for both groups.
Formally, let
,
be independent and identically distributed death times with common survival function
, and let
,
, be independent and identically distributed censoring times with common survival function
. It is assumed that failure and censoring times are independent. The dataset consists of bivariate random vectors
, where
, where ∧ denotes the minimum, with a common distribution function
F, and
indicates whether
is censored, with
being the indicator function. One of the main issues in this context is to provide information about
from
,
. Denoting by
the observed death times (non-censored observations), the Kaplan–Meier or product limit estimator of the survival function
is
where
is the number of deaths at time
and
the number of survivors before time
.
A key result for inferential purposes is the weak convergence of the Kaplan–Meier estimator. Under the previous notation [
5] (see also [
6]), this proves that for
, such that
, the empirical process
converges weakly to a Gaussian process,
with mean 0, and covariance matrix given by
where
under the conditions of
and
being continuous.
We now return to the lung cancer dataset introduced earlier.
Figure 1 displays the Kaplan–Meier estimators for both groups, suggesting that female patients exhibit a higher survival probability across all time points. The traditional methods only give information on equality or differences of the survival function, and therefore, it would be more informative to test whether one survival function dominates the other one against the alternative hypothesis that there is at least one crossing point between the two survival functions.
When one survival function lies below another, we use the concept of stochastic ordering. Formally, we say that
T is smaller than
U in the stochastic, or first stochastic dominance, order, denoted as
, if
for all
t, meaning that the survival probability of
T is always lower than that of
U. Determining whether one distribution consistently exhibits better survival characteristics than another is essential. Whether this problem has been addressed by [
7] in the context of stochastic dominance or ordering for non-censored data, there are limited statistical tools available in the case of censored data.
Following the approach by [
7], the main purpose of this paper is to provide statistical methods for testing the ordering of two survival functions by considering the Kolmogorov–Smirnov type test in the presence of right-censored data based on the supremum of the difference between the two Kaplan–Meier estimators.
To fix the notation, we consider another set , of independent and identically distributed death times with common survival function ; let , , be independent and identically distributed censoring times with common survival function . Again, it is assumed that failure and censoring times are independent. The second dataset consists of bivariate random vectors , where and indicates whether is censored or not. Let us denote by the corresponding Kaplan–Meier estimator of the survival curve . We assume that these observations are independent of the previous observations.
Under the previous notation, our main objective is to test the null hypothesis
against the alternative hypothesis
using a test statistic based on the supremum of the difference of the Kaplan–Meier estimators.
More precisely, we consider the test statistic
It is important to note that is a one-sided statistic designed to detect departures from the null hypothesis . To formally distinguish between strict dominance (where one curve is always above the other) and a crossing scenario, the test should be performed symmetrically by also considering (the maximum difference in the opposite direction). A crossing is statistically suggested when the null hypothesis of dominance is rejected in at least one direction, while secondary testing evidence shows a reversal of roles in another time interval.
According to this, the null hypothesis is rejected if , where the critical value would be determined in terms of the distribution of . However, it is not feasible to obtain the exact distribution of such a statistic, and we would rather use the asymptotic distribution. To derive the asymptotic properties, we will assume the following assumptions:
A1: The value satisfies that .
A2: When then , with .
Next, we provide an asymptotic upper bound under for in terms of its asymptotic distribution.
Theorem 1. Following the previous notation and assumptions A1 and A2, we get, under ,where is a Gaussian process with 0 mean and covariance matrix given by Proof. As noticed previously, the empirical processes
and
converge weakly to Gaussian processes
and
, respectively. The independence of the samples implies that
converges weakly to
. Now, by the continuity mapping theorem, we have
converges weakly to
where
is a Gaussian process with a 0 mean and covariance matrix given by
Now, under
we get
and therefore
; see Theorem 1.A.1 in [
8]. As a consequence, we get
□
This test is consistent as can be seen next.
Proposition 1. Under the conditions of previous theorem and under , it holds thatfor any . Proof. Under
there exists a
such that
, and therefore
It is not difficult to see that
and clearly
Now the result follows, observing that
for any
. □
We have introduced the theoretical framework of the proposed test and its motivation compared to traditional methods. Next, we will apply this methodology to real-world datasets and simulations to assess its performance in practical scenarios.
3. Implementation and Application to Some Datasets
To provide the upper bound (
1) for the
p-value, we need to compute the probability for the supremum of a Gaussian process. A common computational method is to approximate the probability via Monte Carlo simulation. The idea is to generate
N samples of the Gaussian process over a discretized grid, then compute the supremum for each sample and estimate the probability using empirical frequency.
However, following [
9], we propose a more efficient method using the
mvtnorm package in
R; see [
10]. Instead of relying exclusively on Monte Carlo simulations, this approach reduces the computational burden associated with the covariance matrix calculation, making the process faster.
Given a discretized grid
on
, the Monte Carlo method provides an approximation of
, where
is the value of
at a given sample, by an estimation of
based on the simulations. This probability can be computed as
Given that follows a multivariate normal distribution, this expression can be efficiently computed in R using the multivariate normal distribution function from the mvtnorm package.
Additionally, to construct the covariance matrix, we require theoretical values of the survival functions
,
, and their corresponding densities, and the survival functions
and
. Since these values are unknown, we use empirical values. The survival functions are replaced by the corresponding Kaplan–Meier estimators, and the distribution functions are replaced by their empirical counterparts. For density estimation, we apply the Foldes–Rejtó–Winter smoothed density estimator, which incorporates plug-in bandwidth selection, as proposed by [
11,
12]. This estimation is implemented in
R using the
survPresmooth package; see [
13].
Finally, to assess assumption A1, we will take .
Next, we provide applications of the previous test to some datasets, starting with the lung cancer dataset introduced in
Section 2.
Lung cancer: Here, we aim to assess whether the survival function of male patients lies consistently below that of female patients, as
Figure 1 suggests.
Applying our test, we obtain and an upper bound for the p-value of 0.9955, indicating no statistical evidence to reject the null hypothesis that female survival dominates male survival.
To determine whether this dominance is strict or the two survival functions are equal, we apply some classical tests to detect differences between the two survival functions. The results are presented in
Table 1. Since all tests yield extremely low
p-values, we conclude that female survival strictly dominates male survival.
Gastric cancer: In this second application, we illustrate the performance of the proposed test using the gastric cancer dataset. These data originate from a clinical trial conducted by the Gastrointestinal Tumor Study Group [
14], which compared the survival of patients with locally advanced gastric carcinoma under two treatment arms: chemotherapy alone (5-fluorouracil and semustine), which we consider the control group, versus chemotherapy combined with radiation therapy. This dataset is particularly relevant because the survival curves of the two groups exhibit a visual intersection at approximately 900 days (see
Figure 2). Before this point, the chemotherapy group appears to show higher survival probabilities, but the trend reverses thereafter.
We applied our proposed test to specifically test the null hypothesis that the survival function of the chemotherapy group dominates (is “above”) that of the chemotherapy plus radiation group. Interestingly, despite the visual crossing, the test does not provide sufficient evidence to reject this dominance hypothesis. Applying our proposed test, we obtain and an upper bound for the p-value of 0.5864. This result indicates that the observed intersection in the sample is statistically compatible with the hypothesis of ordering. From a methodological perspective, this suggests that the divergence observed in the right tail of the distributions after 900 days may be attributed to sampling variability rather than a significant structural reversal in treatment efficacy. Our test thus provides a robust interpretation by not over-interpreting visual crossings that lack enough statistical strength to invalidate the dominance model.
On the other hand, when applying traditional tests for comparing survival functions, we obtain
Table 2:
The results presented in
Table 2 highlight the limitations of traditional tests when dealing with crossing survival functions. While the Log-Rank and Tarone–Ware tests fail to reach statistical significance, other methods like Gehan or Peto–Peto yield conflicting results with
p-values below 0.05. This discrepancy arises from the cancellation effect inherent in rank-based tests, where differences before and after the crossing point (at approximately 900 days) neutralize each other. In contrast, our proposed test, based on the supremum of the difference between Kaplan–Meier estimators, provides a more robust and interpretable framework. By obtaining a
p-value of 0.5864 for the dominance hypothesis, we can conclude that the observed intersection in the sample does not provide sufficient evidence to reject the model of stochastic ordering. This confirms that our approach effectively avoids over-interpreting visual crossings that lack enough statistical strength, offering a clear advantage for clinical decision-making over conventional methods.
4. Simulation Studies
To show the performance of our test in different scenarios, we carry out Monte Carlo experiments for small and large samples. The simulation studies are performed in several scenarios where the dominance among the two survivals either holds or does not hold, for different sample sizes and different rates of censoring.
First, we have considered gamma-distributed survival times. Let us describe the different cases that we have considered.
Case 1: In this case, T follows a gamma distribution with shape parameter 2 and scale parameter 1, , and . It is known that in this case the survival function of U dominates the survival function of T.
Case 2: In this case and . It is known that in this case, the survival function of U dominates the survival function of T, but they are very close.
Case 3: In this case, and . It is known that in this case, the survival functions cross at one point.
Case 4: In this case, and . It is known that in this case the survival functions cross at one point, but it is difficult to detect the crossing point.
In
Figure 3 we have plot the different cases, where we observe ordered survival functions in cases 1 and 2 and a crossing point in cases 3 and 4.
In addition to the gamma-based scenarios, we have considered four cases to broaden the scope of the simulation study and assess the performance of the proposed test under more general survival models. Specifically, cases 5 and 6 consider Log-Normal distributions, where the survival functions cross at a single point, with the intersection occurring very early in case 6, making it more difficult to detect. Similarly, cases 7 and 8 involve Weibull distributions, both exhibiting a single crossing point, although in case 8, the crossing appears near the origin, posing an additional challenge for detection. We want to highlight that, except for trivial cases or strict equality of parameters, comparisons between survival functions generated by two Weibull or Log-Normal distributions with different parameters always result in a crossing point.
In particular, we have considered the following cases.
Case 5: In this case, T follows a Log-Normal distribution with parameters and , , and . In this setting, the survival functions of T and U intersect at a single point.
Case 6: In this case, and . Again, the survival functions cross at one point, but the intersection occurs very early, making it more difficult to detect.
Case 7: In this case, T follows a Weibull distribution with shape parameter 1 and scale parameter 5, , and . The survival functions cross at a single point.
Case 8: In this case, and . As in previous cases, the survival functions intersect at one point; however, the crossing occurs very early and is therefore more difficult to detect.
In
Figure 4, we have plot the different cases, where we observe crossing survival functions in cases 5–8.
For each case, we have considered exponentially distributed censor times with a different shape parameter . In each case, we have selected two different ’s to get a rate of approximately 20% and a 50% rate of censored observations. Given a survival time T and censored time C with an exponential distribution, we need to solve the probability equation to determine the appropriate . Because this equation is analytically complex, we used a numerical approach to solve for . We set up an equation where the calculated probability of censoring must equal our target (e.g., ). We used R’s numerical integration function (integrate) to precisely calculate for any given . We then employed a root-finding algorithm (specifically, R’s uniroot) to iteratively search for the single value that makes the calculated probability match the target percentage.
Following the numerical procedure described, we obtained the specific values that achieve the desired censoring proportions in each case. The values of the exponential parameters used to generate the censoring times for each scenario are the following:
Case 1: Gamma(2, 1) vs. Gamma(3, 1)
- –
20% censoring:
- –
50% censoring:
Case 2: Gamma(2, 1) vs. Gamma(2.2, 1)
- –
20% censoring:
- –
50% censoring:
Case 3: Gamma(3, 5) vs. Gamma(6, 2)
- –
20% censoring:
- –
50% censoring:
Case 4: Gamma(2, 2) vs. Gamma(3, 1)
- –
20% censoring:
- –
50% censoring:
Case 5: Log-Normal(2, 1) vs. Log-Normal(1.7, 0.5)
- –
20% censoring:
- –
50% censoring:
Case 6: Log-Normal(1, 1.5) vs. Log-Normal(0.5, 1)
- –
20% censoring:
- –
50% censoring:
Case 7: Weibull(1, 5) vs. Weibull(2, 3.5)
- –
20% censoring:
- –
50% censoring:
Case 8: Weibull(1.5, 4) vs. Weibull(2, 3)
- –
20% censoring:
- –
50% censoring:
This systematic determination of ensures comparability across all settings and provides a sound foundation for the simulation design.
We performed 1000 Monte Carlo replications for each case with different sample sizes, and 500, in which the rejection rates of the null hypothesis have been computed for the two conventional significance levels of and . The number of points of the grid is in every replication.
Table 3 and
Table 4 summarize the performance of the proposed test under eight scenarios, cases 1–8, with varying sample sizes,
and censoring levels, 20% vs. 50%, for significance levels
and
.
Some key observations are the following:
When one survival function truly dominates the other, cases 1 and 2, the test almost never rejects the null hypothesis. Even with small samples, the rejection rate is essentially 0 at both and , indicating no false alarms when the survival functions are ordered.
In scenarios where the survival functions cross, cases 3–8, the test’s power to reject the null hypothesis increases sharply with larger sample sizes. For small samples, e.g., , the rejection rates are modest, but with larger samples they increase, exceeding 90% in most cases by and approaching 100% at . This shows the test is very sensitive to crossings given sufficient data.
Heavier censoring reduces the test’s power to detect differences. At a given sample size, a 50% censoring rate yields lower rejection rates than 20% censoring. For example, in case 3 with n = 100, the rate of rejection is 85.1% with 20% censoring versus 65.4% with 50% censoring at . Nonetheless, as sample size grows, even with 50% censoring, the power eventually becomes high, reaching 99% at n = 500 in crossing cases.
These results demonstrate that the proposed test is reliable for confirming the order of survival functions and highly effective at flagging crossings, provided the data are sufficient. The test maintains a low false-positive rate when survival curves are truly ordered, giving confidence that a non-significant result indeed suggests no crossing. Conversely, a significant result from this test is strong evidence of at least one crossing point between survival curves. A sufficient sample size is crucial for the test to detect crossings, especially if censoring is heavy. In practice, researchers should plan for larger samples and/or seek to reduce censoring to ensure the test has high power to uncover crossing survival patterns. This sensitivity to crossings is a key advantage over traditional survival comparison tests (like Log-Rank), which often fail to detect any difference when survival curves intersect. Such improved detection can lead to better insights in studies where survival functions may cross, ensuring that meaningful survival differences are not overlooked.
5. Discussion
This paper introduces a novel test, based on the supremum of the difference between Kaplan–Meier estimators, specifically designed to evaluate stochastic dominance and to distinguish this condition from scenarios where survival curves cross. The asymptotic results presented, coupled with the simulation studies and applications to real data, confirm that the proposed test is a robust tool that is more sensitive than classic methods (such as the Log-Rank test) for detecting curve crossings. To ensure the generality of our findings, we included simulations using Gamma, Weibull and Log-Normal distributions.
Our proposed test, based on the supremum of the difference of Kaplan–Meier estimators, offers several advantages over traditional methods such as the Log-Rank test:
A significant contribution of this work, as evidenced by the analysis of the gastric cancer dataset in
Section 3, is the ability of the proposed test
to handle crossing survival curves with statistical rigor. A common criticism in survival analysis is whether a visual intersection of Kaplan–Meier curves automatically invalidates a dominance model. While the definition of our statistic focuses on the maximum distance between survival functions, this does not imply that the crossing is ignored. On the contrary, our methodology provides a formal decision rule to distinguish between a structural reversal of survival advantage and a crossing that is statistically compatible with the null hypothesis of ordering due to sampling variability. In the case of the gastric cancer study, the visual crossing observed at approximately 900 days does not lead to a rejection of the dominance hypothesis by our test. This suggests that the divergence after the intersection point lacks sufficient statistical strength to confirm a violation of the ordering. While the traditional Log-Rank test often results in non-significant
p-values in such scenarios due to the cancellation of early and late differences, our approach avoids this ’blindness’ by focusing on the maximum evidence of deviation. Therefore, the test
proves to be a robust tool that prevents the over-interpretation of visual patterns in the tails of the distribution, where high censoring often increases uncertainty. This reinforces the practical utility of our test as a complement to visual inspection in clinical and biological research.
Improved detection of stochastic dominance: Our test is specifically designed to assess whether one survival function consistently lies above another, providing a more informative alternative to conventional hypothesis testing frameworks.
Computational efficiency and implementation in R: The test is easy to compute using standard statistical software. The estimation of p-values can be efficiently performed using multivariate normal approximations, making it practical for large datasets.
Sensitivity to outliers: The test inherits the inherent robustness of the Kaplan–Meier estimator. As a non-parametric statistic based on the rank of event times, K-M is less susceptible to the influence of a single extreme value than estimators based on the mean or variance. Nevertheless, caution is advised when analyzing very small samples with clear outliers, as an extreme event or censoring value can impact the calculation of the supremum and influence the asymptotic convergence properties.
Despite its strengths, the proposed method has certain limitations that suggest promising avenues for future research:
Performance in small sample sizes: While our test has strong asymptotic properties, its performance in small samples could be further refined by improving the estimation of the asymptotic distribution. Future research could explore bootstrap-based refinements or finite-sample adjustments to enhance the test’s accuracy in small-sample scenarios.
Adaptation for paired samples: A relevant extension would be the development of a similar test for paired survival data, where observations are correlated (e.g., matched case-control studies or twin studies).
Handling time-dependent covariates: In real-world applications, survival probabilities often depend on time-varying covariates. Extending our approach to incorporate time-dependent covariates could provide additional insights into how survival dominance evolves over time.
Finally, it is crucial to define the role our test plays in the general hypothesis testing process. The proposed test is not intended to replace classic tests like the Log-Rank but rather acts as a crucial diagnostic complement.
We propose the following workflow: If the test fails to reject the dominance hypothesis (, p-value high), the data is consistent with being globally less than or equal to . In this scenario, the next step is to perform a classical test (e.g., Log-Rank) to determine whether there is strict dominance or statistical equality. Conversely, if the test rejects the dominance hypothesis (p-value low), it provides evidence that is significantly better than in at least one time interval. When combined with the initial clinical hypothesis of ’s superiority—or by performing the contrast in both directions—this rejection serves as a diagnostic for a crossing point. This result yields critical information unattainable by the Log-Rank alone, allowing researchers to shift their focus to interval-specific comparisons. This allows for a more nuanced interpretation of treatment effects, particularly when early risks are offset by long-term benefits.