Model Checking with Right Censored Data Using Relative Belief Ratio

Model checking is a topic of special interest in statistics. When data are censored, the problem becomes more difficult. This paper employs the relative belief ratio and the beta-Stacy process to develop a method for model checking in the presence of right-censored data. The proposed method for the given model of interest compares the concentration of the posterior distribution to the concentration of the prior distribution using a relative belief ratio. We propose a computational algorithm for the method and then illustrate the method through several data analysis examples.


Introduction
An important problem in statistics is to check whether a chosen statistical model is in agreement with observed data. This problem is known as model checking. The tests used to describe how well a statistical model fits a set of observations are goodness-of-fit tests. If it is determined that the observed data do not contradict the model, the true values of the parameters can be inferred. If the model fails to pass its checks, there is cause for concern about the accuracy of the inferences used in analysis. Thus, checking a proposed model on the basis of observed data is a critical component of statistical inference.
The time until an event occurs is the variable of interest in survival analysis. It is common to refer to this time as the survival time. In this case, the observed data are typically right censored. That is, at the right side, the exact survival time can become incomplete, for example, when a subject leaves the study before an event occurs, or when the study ends before the event occurs, right censoring occurs. This type of data are common in medical studies where the variable of interest is the time to death or time to relapse of a disease, in reliability studies where the variable of interest is the time until a machine part fails, and in social sciences where the variable of interest is the lifetime of the elderly in certain social programs [1]. While there are several well-known methods for model checking, the approach we want to take is Bayesian in nature. There is interest in developing nonparametric Bayesian procedures for hypothesis testing. The majority of this work focuses on uncensored data. The proposed model was embedded as a null hypothesis into a larger family of distributions, which is the main approach of model testing. Then priors were placed on the null and the alternative, and a Bayes factor was computed (see, for example, [2][3][4][5][6]). A second important approach for model testing is to apply a prior to the true distribution generating the data, and then measure the distance between the posterior distribution and the hypothesized distribution (see, for example, [7][8][9]). A third and more recent approach combines the second approach with relative belief ratio ( [10]) (for instance, see the work of [11][12][13]).
When data are censored, goodness-of-fit tests become a challenging problem. This explains why there are few and scattered studies on goodness-of-fit tests for right-censored data. In particular, Bayesian goodness-of-fit tests are rarely covered. Some exceptions include the work of [14][15][16] considered the two sample problem by computing the Bayesian factors and beta process priors ( [17]). Al-Labadi and Zarepour [15] used the beta-Stacy process and the Kolmogorov-Smirnov distance to test simple hypotheses. Chen and Hanson [16] used Polya tree priors and computed the Bayesian factor. To compute the posteriors, the MCMC procedure was used. Permutation tests were used to compute p-values. These approaches are computationally intensive.
The main problem addressed in this paper is model checking for lifetime distribution functions. A general description of the approach developed in this paper is provided along with a discussion of the benefits that the approach brings to the problem of model checking. The beta-Stacy process BSP(k(t), F 0 (t)), t ≥ 0 is considered as a prior on on the space of cumulative distribution functions ( [18]). Here, k(·) > 0 is a function defined on R + , and F 0 is a cumulative distribution function (cdf) on R + = [0, ∞) and it represents the prior guess for the beta-Stacy process. How closely realizations from the process are clustered around F 0 is determined by the value of k(·). The use of the beta-Stacy process is justified because, unlike the Dirichlet process, it is conjugate to both exact and rightcensored observations ( [19]). The proposed method then compares the posterior and prior distribution concentrations about the model of interest. If the posterior is more concentrated on the model than the prior, this is evidence in favor of the model; if the posterior is less concentrated, this is evidence against the model. This comparison is produced using a relative belief ratio that measures the evidence in the observed data for or against the model and provides a measure of the strength of this evidence; thus, the methodology is based on a direct measure of statistical evidence. The approach is fairly simple to implement and does not necessitate obtaining a closed form of the relative belief ratio. Theoretical results are developed to support appropriate hyperparameter selections k(·) and F 0 .
The rest of the paper is structured as follows. Section 2 provides the background on the definitions and generic properties of the beta-Stacy process, the relative belief ratio, and the Cramér-von Mises distance between probability measures. Section 3 discusses the proposed methodology and the selection of relevant values for the beta-Stacy process's hyperparameters. Section 4 develops a computational algorithm for implementing the approach. Section 5 examines the performance of the methodology through a number of examples. This method works well in general and can be used for both censored and uncensored observations. Section 6 wraps up the paper with a summary of the findings. Lastly, datasets and figures are provided in the Appendix A.

Notations and Background
Let the random sample X 1 , . . . , X n be drawn from an unknown cdf F defined on R + . Let (T, δ) = (T 1 , δ 1 ), . . . , (T n , δ n ) be observed, where T i = min(X i , C i ), δ i = I(X i ≤ C i ) and C 1 , . . . , C n are the censoring times. When δ i = 1, X i is observed, and when δ i = 0, X i is right censored, this type of data can occur, for instance, when X i is the lifetime of a patient who enrols in a study of a certain disease. If death occurs from the disease before the study ends, X i is recorded; however, if the patient is still alive, withdraws, or dies for another reason at the end of the study, C i records the end of the period during which the patient was under observation ( [20]).
In survival analysis, cdf F is frequently referred to as the lifetime distribution function. We use the same notation throughout this paper for the probability measure and its corresponding cumulative distribution function, i.e., F(t) = F((−∞, t]).

Beta-Stacy Process
Walker and Muliere [18] introduced the beta-Stacy process, which is a nonparametric prior that is widely used in survival analysis. In this section, a relevant introduction about the beta-Stacy process is provided.
The beta-Stacy process is defined by another process known as the log-beta process ( [18]). Let β(·) be a positive function defined on R + , and α be a measure concentrated on R + that is absolutely continuous with respect to the Lebesgue measure such that ∞ 0 (β(z)) −1 α(dz) = ∞. On the basis of [18], {Z(t)} t≥0 is a log-beta process with parameters α and β, written as Z(t) ∼ log BSP(α(t), β(t)), if {Z(t)} t≥0 is a Lévy process with Lévy measure defined, for s > 0, by and has the following moment generating function: In order to define the beta-Stacy process, let positive function k(·) be defined on R + and the absolutely continuous cdf F 0 be defined on R + . We call {F(t)} t≥0 a beta-Stacy process with parameters k(·) and F 0 , written as F(t) ∼ BSP(a(t), F 0 (t)), if F(t) = 1 − e −Z(t) , and {Z(t)} t≥0 is a log-beta process with parameters Walker and Muliere [18] showed that rendering F 0 the prior guess. The expression in (1) explains the need for assumption ∞ 0 (β(z)) −1 α(dz) = ∞. The beta-Stacy process includes various neutral-to-right processes proposed in the literature. For instance, the beta-Stacy process reduces to the Dirichlet process when k(t) = k > 0 for all t ≥ 0. The beta-Stacy process also reduces to a simple homogenous process ( [21]) when β(·) is constant.
Next, we describe the posterior distributions of {Z(t)} t≥0 and {F(t)} t≥0 . Let random sample X 1 , . . . , X n be drawn from F, and (T, δ) = (T 1 , δ 1 ), . . . , (T n , δ n ) be observed, where T i = min(X i , C i ), δ i = I(X i ≤ C i ) and C 1 , . . . , C n are censoring times. Define the counting process N by I(T i ≤ t and δ i = 1) and the (left-continuous) at-risk process Y by where I is the indicator function. In particular, let N{t} = N(t) − N(t − ) (i.e., N{t} is the number of observed X i 's at the exact position t). Let {Z(t)} t≥0 ∼ log BSP(α(t), β(t)) and F(t) = 1 − exp(−Z(t)). Given the data (T, δ), the posterior distribution of Z is a log-beta process ( [18] and The posterior Lévy measure for Z(t) is given by There are fixed points of discontinuity in the posterior process. These extra points appear at the exact (uncensored) observations. If t i is an exact observation with corresponding jump S i , then If N{t i } = 1, then the random jump S i follows an exponential density with mean Given the data (T, δ), the posterior distribution of F is a beta-Stacy process with parameters and with α * (s) and β * (s) are defined in (2) and (3), and ∏ stands for the product integral. Note that, as k(·) tends to zero, the nonparametric Kaplan-Meier estimator of the distribution function is obtained. On the other hand, F * becomes the prior guess F 0 as k(·) grows large. Thus, parameter k(t) can be viewed as the concentration parameter. The posterior consistency of the beta-Stacy process was addressed by Kim and Lee [22]. Algorithms A and B in Appendix A can be used to sample from prior and posterior beta-Stacy processes. These algorithms were developed by Al-Labadi and Zarepour [15] (see also Lee and Kim [23]).

Relative Belief Ratio
Evans' (2015) relative belief ratio has become a widely used tool in statistical hypothesis testing theory. Assume that we have a statistical model defined by the pdf { f θ : θ ∈ Θ} with respect to the Lebesgue measure on the parameter space Θ. Let π(θ) to be a prior on Θ. After observing the data (T, δ), the posterior distribution of θ is given by the conditional density function Assume that we are interested in inferring regarding the parameter θ. Let π and π(· | (T, δ)) be continuous at θ. Then, the relative belief ratio for a hypothesized value θ 0 of θ is given by The posterior density to the prior density ratio at θ 0 , that is, RB(θ 0 | (T, δ)), measures how beliefs about θ 0 changed from a priori to a posteriori. When π and π(· | (T, δ)) are discrete, the relative belief ratio is defined through limits. For more information, see Evans (2015).
Quantity RB Θ (θ 0 | (T, δ)) is a measure of evidence that θ 0 is the true value. If RB Θ (θ 0 | (T, δ)) > 1, then the probability of θ 0 being the true value increases from a priori to a posteriori; thus, there is evidence based on the data that θ 0 is the true value, and there is, hence, evidence in favor of θ 0 . If RB Θ (θ 0 | (T, δ)) < 1; then, the probability of θ 0 being the true value decreases from a priori to a posteriori. As a result, the data provide evidence that θ 0 is not the true value. Case RB Θ (θ 0 | (T, δ)) = 1 implies there is no evidence in either direction.
The ability to calibrate relative belief ratios is an appealing feature that renders it desirable in hypothesis testing problems. After calculating the relative belief ratio, it is critical to determine whether the result represents strong or weak evidence for or against H 0 : θ = θ 0 . A typical calibration RB(θ 0 | (T, δ)) is obtained by computing the tail probability (Evans, 2015) where Π(·| (T, δ)) is the posterior distribution of the posterior density π (·|(T, δ)). The posterior probability that the true value of θ has a relative belief ratio no greater than that of the hypothesized value θ 0 is represented by Equation (8). When RB Θ (θ 0 | (T, δ)) < 1, there is evidence against θ 0 , and a small value for Str Θ (θ 0 |(T, δ)) represents a large posterior probability that the true value has a relative belief ratio greater than RB Θ (θ 0 | (T, δ)) and hence there is strong evidence against θ 0 . Similarly when RB Θ (θ 0 | (T, δ)) > 1, indicates a strong evidence in favour of θ 0 , while a small value of Str Θ (θ 0 |(T, δ)) indicates weak evidence in favour of θ 0 . Figure 1 illustrates the strength of the evidence for both cases;  , δ)). RB represents the relative belief ratio RB Θ (θ 0 | (T, δ)). The smaller the shaded area, the stronger the evidence for case (a). The larger the shaded area, the stronger the evidence for case (b).

Cramér-Von Mises Distance
Let F and G be two cdfs; Cramér-von Mises distance between F and G is defined as Other distances can be used (see Gibbs and Su (2002)), but d has some computational advantages. The formula for the distance between the beta-Stacy process and a continuous cdf is provided in the following result.
denote the order statistics of θ 1 , . . . , θ M and J 1 , . . . , J M be the associated jump sizes such that J i = J j when θ i = θ (j) . Then Proof. Note that Le θ (0) = 0 and θ (M+1) = +∞. Then, When considering the prior and posterior distributions of the Cramér-von Mises distance, the following lemma allows for the use of the approximation to BSP(k(·), F 0 ). Lemma 2. If F ∼ BSP(k(·), F 0 ) and F is given by Algorithm A, then d(F , F 0 ) a.s.
is integrable, the proof follows from the dominated convergence theorem.

Model Checking Using the Relative Belief
Consider the statistical model {F θ : θ ∈ Θ} of continuous cdf on R + . Let X 1 , . . . , X n be a random sample drawn from an unknown cdf F defined on R + . Assume that (T, δ) = ((T 1 , δ 1 ), . . . , (T n , δ n )) is observed, where T i = min(X i , C i ), δ i = I(X i ≤ C i ) and C 1 , . . . , C n are censoring times. The aim is to test the hypothesis H 0 : F ∈ {F θ : θ ∈ Θ}. Let BSP(k(·), F 0 ) be the prior on F for some choice of k(·) and F 0 . Then F(t) | (T, δ) ∼ BSP(k * (t), F * 0 (t)), where k(·) and F 0 are defined in (5) and (6), respectively. If H 0 is true, then the posterior distribution of the distance between F and the proposed model should be more concentrated around 0 than the prior distribution. As a result, this test involves comparing the concentrations of the prior and the posterior distributions of d (see Lemma 1) about 0 using the relative belief ratio with the interpretation as discussed in Section 2.2.
To perform this test, we must measure the distance and then set relevant values for k(·) and F 0 . To calculate the distance, similar to Al-Labadi and Evans [24], we computed d(F, F θ ), where θ is the relative belief estimate of θ, which is always the same as the maximal likelihood estimate (MLE) for the full model parameter. In terms of hyperparameters, we set F 0 = F θ and k(t) = k for all t. There are numerous advantages in setting F 0 = F θ . First, it avoids prior-data conflict, a possible contradiction between the data and the prior. This typically happens when the prior places its mass in a region of the parameter space where the data suggest the true value does not lie ( [25,26]). In the context of the approach considered in this paper, prior-data conflict arises whenever there is a small overlap between the effective support regions of F and F θ . Note that, by Lemma 1, the distance d(F, F θ ) depends on the prior guess F 0 through the jump points θ i . If the θ i lay in one tail of F θ , then we get prior-data conflict between F and F θ because F 0 and F had the same effective support. To avoid this, it is required that θ i are selected in a region that includes most of the mass of F θ . When F 0 = F θ , then F θ is the prior mean of F; thus, both share the same effective support, which renders it a reasonable choice to avoid prior-data conflict. We refer the reader to Example 1 of Al Labadi and Evans (2018) for an interesting discussion about prior-data conflict. Nevertheless, the choice of F 0 = F θ should also avoid any impacts due to the "double use of the data". This means that the approach becomes conservative in detecting the model failure when H 0 is false. Although setting F 0 = F θ appears to induce a data-dependent prior distribution for d, the following lemma implies that this is not the case; thus, the approach is prior distribution-free with this choice.
Proof. Using Lemma 1, since (θ i ) 1≤i≤M is a sequence of i.i.d. random variables with where U (i) is the i-th order statistic for (U i ) 1≤i≤M i.i.d. uniform[0, 1]. Now, as → 0, by Lemma 2, we conclude that the distribution of d(F, F θ ) does not depend on F θ . The following results shows that setting F 0 = F θ prevents any effect due to the double use of the data. Specifically, as the sample size increases, the posterior distribution of   1 , δ 1 (ii) If H 0 is false, then, as n → ∞, lim inf d (F|(T, δ) Now, concerning the choice of k(·), we considered k(t) = k for all t > 0. In general, larger values of k must be chosen to identify smaller deviations. Consequently, it is possible to consider multiple values of k. One way to perform that is, for instance, to start with k = 1. If a larger (smaller) value of k renders the relative belief ratios to be below (above) 1, H 0 is rejected (accepted). As Section 5 shows, when the null hypothesis is correct (not correct), the relative belief ratio always remains above (below) 1 when larger (smaller) values of k are considered. When using the Dirichlet process, Al-Labadi and Zarepour [27] advised using k ≤ 0.5n for complete data to avoid the prior becoming too influential. Setting k between 1 and 10 is satisfactory for most purposes. As indicated in the introduction, when k(t) = k, the beta-Stacy process turns into the Dirichlet process. However, in the presence of right censored data, the posterior distribution of the Dirichlet process becomes beta-Stacy process ( [19]). This justifies the necessity of using the beta-Stacy process in the approach.

Computational Algorithm
Closed forms of the prior and posterior densities of D = d(F, F θ ) are required to compute the relative belief ratio as in (7). This is not usually available. As a result, the relative belief ratio must be approximated through simulation. A particular problem of computing (7) arises here when both π D (0) and π D|(T,δ) (0) are close to 0, where π D (·) and π D|(T,δ) (·) denote the pdf's of D and D| (T, δ), respectively. In such a case, determining RB D (0 | (T, δ)) is difficult. The formal definition of the relative belief ratio, as discussed in Section 2.2, is as a limit that can be approximated at zero by for a suitably small value d c , where Π D (·) and Π D|(T,δ) (·) denote the cdfs of D and D|(T, δ), respectively. From Equation (8), the strength of the evidence based on the relative belief ratio RB D (0 | T, δ) can be computed using Appendix B provides computational Algorithm C for assessing H 0 on the basis of estimates of (9) and (10). A similar algorithm for complete data based on the Dirichlet process was developed by Al-Labadi and Evans [24].

Examples
The approach is demonstrated by two main examples in this section. Throughout this section, let Exp (λ), Weibull (k, λ), and Lognorma l (µ, σ) denote the exponential distribution with mean 1/λ, the Weibull distribution with shape parameter k and scale parameter λ, and the log-normal distribution with mean µ and standard deviation σ, respectively. In all examples, the sensitivity to choosing k is investigated. We set = 0.01, i 0 = 1, M 0 = 20, and r 1 = r 2 = 1000 in Algorithms A, B, and C, though other values are also possible. R package parmsurvfit was used to compute the MLE of the distribution parameters. Lee and Wang (2003) based on the remission times (in months) of cancer patients. The dataset is given in Appendix C. We tested hypothesis H 0 : F, given in Table 1, is the underlying distribution of the observed data, where F could be a family of distributions (composite hypothesis) or a specific distribution with known parameters (simple hypothesis). Various values of k were considered to investigate the approach's sensitivity to concentration parameter selection. The p-value of the (frequentist) log-rank test was also computed for comparison purposes. Table 1 summarizes the findings. When H 0 is true, we want RB > 1 and a strength close to 1, and when H 0 is false, we want RB < 1 and a strength close to 0. According to Table 1, the proposed test performed well in this example.

Example 1. A real dataset from
Lognormal(2, 1) 1 0(0) 0 5 0(0) 10 0(0) Example 2. Simulated data. The primary purpose of this dataset is to investigate how the proposed test performs as the sample size increases. We considered data (T 1 , δ 1 ), . . . , (T n , δ n ), where T i = min(X i , C i ) where the survival times (X i ) 1≤i≤n were generated from Lognormal(1, 4), while the censored time (C i ) 1≤i≤n are generated from Lognormal (4, 1). Table 2 is the underlying distribution of the observed data. Table 2 summarizes the results that show that the selected models were accepted. Figures A1-A3 (see Appendix D) give the plots of F 0 = F θ and 5 sample paths each for the prior beta-Stacy process and the posterior beta-Stacy process for each case in Table 2. These figures clearly show that the plots of the sample paths for the posterior process moved toward the plot of F 0 . This supports the previous conclusion regarding the null hypothesis. Furthermore, in this case, the p-values of the (frequentist) log-rank test support the conclusion that the null hypothesis should not be rejected.

Concluding Remarks
The beta-Stacy process and relative belief ratio were used to propose a general approach for model checking. This method could be used for both complete and rightcensored data. It could also be used to test composite or simple hypotheses. Several examples demonstrated that the approach works very well.
Though the Cramér-von Mises distance was used here, other distance measures such as the Kolmogorov-Smirnov and Anderson-Darling distances are viable alternatives. Testing for families of multivariate distributions is an important extension of the approach presented in this paper. While conceptually similar, computational and inferential issues must be addressed. This problem can be addressed in future work.
The following steps are required to accomplish this: (e) Set The approximate prior beta-Stacy process is F (t) = 1 − exp(−Z (t)).
Algorithm B: Simulation algorithm to approximate the posterior of beta-Stacy process.

1.
Generate the posterior log-beta process. The following steps are required to accomplish this: (a) Generate the process {Z * (t)} t≥0 on the basis of where the random jumps (S * i ) 1≤i≤l are associated with the fixed points of discontinuity from the distribution given in (4), where l ≤ n. These random jumps occur at points where the data are not right-censored. (c) The approximated posterior log-beta process is given by:  T, δ)) by the ratio of the estimates of the posterior and prior contents of [ d i/M 0 (pri), d (i+1)/M 0 (pri)). Specifically,