Bivariate Proportional Hazard Models: Structure and Inference

: We focus on a variety of bivariate models with proportional hazard components. Models with proportional hazard marginals are described together with a selection of models with proportional hazard conditional distributions. The bivariate distributions with marginal proportional hazards distributions are shown to be closely related to certain known bivariate exponential models. Two distinct kinds of conditional speciﬁcation are i nvestigated. Discussion is provided of cases with hazard function components that are (1) completely unknown, (2) known to belong to given parametric families and (3) completely known. Since the models are designed for use with survival data, it is inevitable that the marginal and conditional distributions will be asymmetric. However, logarithmic transformations in some cases will result in symmetric


Introduction
Survival models involving families of densities with proportional hazard functions have proved to be useful for analyzing many lifetime data sets. Not infrequently bivariate survival data (involving related lifetimes) need to be analyzed. In this paper, we review several methods for generating suitable bivariate models for such situations. The key observation in the development is that proportional hazard models can be viewed as ones obtained via monotone transformations applied to exponential models. In the latter sections of the paper, related statistical inference issues are discussed. Since the models are designed for use with survival data, it is inevitable that the marginal and conditional distributions will be asymmetric. However, logarithmic transformations in some cases will result in symmetric component distributions.

Bivariate Distributions with Proportional Hazard Marginals
Let F 1 and F 2 be two absolutely continuous distribution functions with support (0, ∞) and with corresponding densities f 1 and f 2 and hazard functions h 1 and h 2 (where h i = f i /(1 − F i ), i = 1, 2).
We will say that (X 1 , X 2 ) has a bivariate marginal proportional hazard distribution associated with F 1 and F 2 and with parameters α 1 , α 2 > 0 if for i = 1, 2 and we will write (X 1 , X 2 ) ∼ PHM(F 1 , α 1 ; F 2 , α 2 ) and also X i ∼ PH(F i , α i ) i = 1, 2 (here PHM is an acronym for proportional hazard marginals). Note that the name is appropriate since for i = 1, 2 Observe that if (X 1 , X 2 ) ∼ PHM(F 1 , α 1 ; F 2 , α 2 ) then the X i 's admit a representation of the form where for each i, Y i ∼ exp(α i ) (i.e., f Y i (y i ) = α i e −α i y i I(y i > 0)). Note that the transformations in (3) are monotone increasing. In fact, it is not necessary to build the model with reference to two distribution functions F 1 and F 2 . Instead we can begin with Y i ∼ exp(α i ), i = 1, 2 and use two monotone increasing functions g i : (0, ∞) → (0, ∞) to define X i = g i (Y i ), i = 1, 2. If we denote the corresponding inverse functions by h i (x) = g −1 i (x), then it is readily verified that each X i has a proportional hazard distribution, i.e., that for i = 1, 2, It is however customary to use the representation (3) involving the two distribution functions F 1 and F 2 , and we will adhere to this convention.
Of course, in the representation (3), (Y 1 , Y 2 ) can have any bivariate exponential distribution that we wish to utilize. Popular choices of bivariate exponential distributions involving few additional parameters include: (iii) Marshall-Olkin distribution (see Marshall and Olkin [1]), with Many other choices are possible, see for example Kotz et al. ([2], pp. 350-385). For a specific example, if we choose (Y 1 , Y 2 ) to have a Gumbel Type I density, i.e., and use F 1 (x 1 ) = x γ 1 1 , 0 < x 1 < 1 and F 2 (x 2 ) = x γ 2 2 , 0 < x 2 < 1, then the resulting PHM density will be of the form: In application of such models, it is frequently desirable to postulate that each F i is a member of some parametric family of distributions to add flexibility to the model. The dependence structure will however be completely determined by the copula of the particular bivariate exponential distribution used in the construction.
Alternatively, one could "let the data tell us which F i 's to use in the model". Thus, we would seek monotone marginal transformations that will make the transformed marginal sample distributions look as much like exponential distributions as possible. This semiparametric approach will be returned to later in the paper.

Bivariate Distributions with Proportional Hazard Conditionals
We will consider two types of conditioning. The first kind is quite traditional in that we consider the distribution of one variable given that a second variable takes on a particular value. The second kind involves conditioning on the event that the second variable is larger than a particular value.

The First Kind
Consider two proportional hazard families of densities as in (1). Recall that we write X i ∼ PH(F i , α i ) if the corresponding densities are given by (1).
For this conditional proportional hazards paradigm, we seek to identify joint distributions for (X 1 , X 2 ) with all conditional densities of the forms in (1). Thus, for each x 2 > 0 we wish to have and for each x 1 > 0, for some functions α 1 (x 2 ) and α 2 (x 1 ). It is not difficult to verify that this will be the case if and only if where (Y 1 , Y 2 ) has a joint distribution with exponential conditionals, i.e., such that for each and for each y 1 > 0 Y 2 |Y 1 = y 1 ∼ exp(α 2 (y 1 )).
The class of all densities with such exponential conditionals is identified in Arnold and Strauss [3] and is of the form: where α 1 > 0, α 2 > 0 and δ ≥ 0. The normalizing constant, k(δ), in (10) can be expressed in terms of the exponential integral function. Thus Consequently, bivariate densities with conditionals of the proportional hazard form will be given by This model is discussed in Arnold and Kim [4] and we will follow their nomenclature and call it a proportional hazard conditionals model of the first kind. If (X 1 X 2 ) has a density of the form (13), we will write

The Second Kind
Again consider two proportional hazard families of densities as in (1). Recall that we write X i ∼ PH(F i , α i ) if the corresponding densities are given by (1). The second kind of conditional model (also introduced in Arnold and Kim [4]) involves conditioning on events of the form {X 1 > x 1 } and {X 2 > x 2 }. Thus, we seek to identify joint survival functions for (X 1 , X 2 ) such that for each x 2 > 0, and for each x 1 > 0, for some functions α 1 (x 2 ) and α 2 (x 1 ).
To analyze this situation (since F 1 and F 2 are known) it is again convenient to write where the Y i 's are exponential random variables. The conditions (14) and (15) are then equivalent to the statements and Denote the survival functions of Y 1 and Y 2 by ψ 1 (y 1 ) = P(Y 1 > y 1 ) and ψ 2 (y 2 ) = P(Y 2 > y 2 ). It then follows that Taking logarithms we have: This is a Stephanos-Levi-Civita-Suto functional Equation (see Arnold et al. [5], p. 13) which is readily solved to yield the following expression for the joint survival function of (Y 1 , for y 1 > 0, y 2 > 0, where α 1 > 0, α 2 > 0 and 0 ≤ δ ≤ 1. This is recognizable as Gumbel's Type I bivariate exponential distribution (with exponential marginals). From Equation (21) we obtain the joint survival function of (X 1 , X 2 ) in the form: Then, the joint cumulative distribution function is and the joint density function is The vector (Y 1 , Y 2 ) with density (21) has exponential marginals, i.e., Y i ∼ exp(α i ), for i = 1, 2, and thus X i ∼ PH(F i , α i , ) for i = 1, 2. Consequently, for j, j = 1, 2 the conditional densities are given by If (X 1 , X 2 ) has a survival function of the form (22), we write note that X 1 and X 2 in (22) will be independent if and only if δ = 0.

If F 1 and F 2 Are Known
Suppose that we have available a sample of size n, (X 1,j , X 2,j ), j = 1, 2, . . . , n from one of the bivariate proportional hazard models discussed in this paper. Since F 1 and F 2 are known, it is appropriate to transform the data to obtain and thus to have a sample (Y 1,j , Y 2,j ), j = 1, 2, . . . , n from the corresponding well-known bivariate exponential distribution. See the following references for appropriate estimation strategies for these bivariate exponential data sets: Arnold and Strauss ( [3,8]), • Castillo and Hadi [9], • Arnold et al. [5].

If F 1 and F 2 are Known to Belong to Some Given Parametric Families
We will illustrate this with a particular example. Other examples may treated in analogous fashion. Suppose that in the PHC(II) model, (23), we replace F 1 (x 1 ) by F 1 (x 1 ; θ) and F 2 (x 2 ) by F 2 (x 2 ; τ), where the parameters θ and τ are unknown. In this case, the model becomes more complicated, but we can still envision success in estimating all the parameters in the model. As a specific example, consider the following distributions of the Weibull form: and The corresponding log-likelihood function is of the form (θ; X 1 , X 2 ) = n log(α 1 α 2 θτ) + (θ − 1) The score function U(θ) = (U(θ), U(τ), U(δ), U(α 1 ), U(α 2 )), has elements which are derivative of the log-likelihood function with respect to the parameters and thus are given by By equating the scores to zero we obtain the score equations, i.e., (U(θ) = 0). These equations are typically solved by means of Newton-Raphson or quasi-Newton numerical methods to obtain the maximum likelihood estimatorsθ = (θ,τ,δ,α 1 ,α 2 ) of the parameter vector θ = (θ, τ, δ, α 1 , α 2 ). The observed information matrix of θ is given by , with elements of the form of minus the second derivative of the log-likelihood function with respect to the parameters. The Fisher information matrix of vector θ, I(θ) is given by I(θ) = E(K(θ)) and should be calculated numerically. When, we use the base distributions (26) and (27) in the PHC(I) model, a technique known as pseudo-likelihood estimation (see Arnold and Strauss [10]) will provide estimates of all five parameters in the model. Besag [7], defined the pseudo-likelihood estimator of θ as the value θ 0 of θ that maximizes the pseudo likelihood function, which in the present bivariate situation is based on the conditional PH densities and is given by Thus for the example with distributions: F 1 (x 1 ; γ 1 ) = x γ 1 1 , 0 < x 1 < 1, and F 2 (x 2 ; γ 2 ) = x γ 2 2 , 0 < x 2 < 1, we have the following log pseudo likelihood.
Parallel to the definition of the score function, the pseudo−score function is defined to be the vector whose coordinates are partial derivatives of the log-pseudo-likelihood function with respect to each of the parameters in the model. It is denoted by U p (β) = (U p (γ 1 ), U p (γ 2 ), U p (δ), U p (α 1 ), U p (α 2 )) .
The estimating equations are constructed by setting the elements of the pseudo−score vector equal to zero. Solutions of these equations correspond to the pseudo-likelihood estimates of the parameters of the model. Typically, these solutions are obtained numerically using iterative methods such as Newton-Raphson or quasi-Newton.
The pseudo-likelihood estimatorβ of β obtained in the above fashion can be verified to be consistent and asymptotically normally distributed with covariance matrix given by Arnold and Strauss [10]), where for l, m = 1, 2 As a consistent estimate of the asymptotic variance-covariance matrix of the pseudolikelihood estimator, we will use the sandwich estimator proposed by Cheng and Riu [11]. This estimator is developed as follows.
, be the vector of pseudo-scores for the i-th observation. Then definê which is the sum over all the observations of the matrices of second derivatives of p (β) evaluated at the pseudo-likelihood estimator β. In addition, definê Using this, we construct a consistent sandwich estimator of the asymptotic variancecovariance matrix in the formΣ A detailed discussion and analysis of such a model, but with power Lindley base distributions, utilizing pseudo-likelihood estimation, may be found in Martínez-Flórez et al. [12].

If F 1 and F 2 are Unknown
All but one of the bivariate proportional hazard models described in this paper have marginals of the proportional hazard form. The exception is the PHC(I) model which, for unknown F 1 and F 2 , we will discuss in Section 7. For the other models, we know that if F 1 and F 2 were known, we could transform the data to obtain a sample from a well-known bivariate exponential model. Consequently, if we consider an estimate F 1,n (x) of F 1 based on X 1,1 , X 1,2 , . . . X 1,n and an estimate F 2,n (x) of F 2 based on X 2,1 , X 2,2 , . . . X 2,n , we can transform the data using and then we will have approximately a sample from a bivariate distribution with standard exponential marginals and we can then estimate the parameters in this exponential model. Note that for identifiability in unknown F 1 and F 2 models we have to fix α 1 and α 2 to be equal to 1. For the PHC(II) model with Weibull component distributions, given in (26) and (27), a small simulation study of the performance of the maximum likelihood parameter estimates has been implemented for a variety of sample sizes and for several parametric configurations. With minimal loss of generality we set α 1 = α 2 = 1 throughout the simulation study. Three values of the dependence parameter δ were used, namely 0.15, 0.30 and 0.45, together with four sample sizes n = 30, 50, 70, 90. The table presents results for three representative choices of values for θ and τ. As for measures of performance, the relative bias (RB) and the square root of the mean squared error (MSE) are given.
The results in Table 1 confirm that both the relative bias and the root mean-squared error of the estimates decrease as sample size increases.

If F 1 and F 2 Are Unknown in the PHC(I) Model
Our model is of the PHC(I)(1, F 1 ; 1, F 2 ; δ) form, i.e., Where δ, F 1 , and F 2 are unknown. Although it would be easy to estimate δ, via pseudolikelihood, if F 1 and F 2 were known, it is not apparent how to estimate F 1 and F 2 assuming that δ is known. So it is not clear how to implement an iterative strategy for estimating F 1 , F 2 and δ simultaneously. Perhaps our only choice is to assume that the F 1 's belong to some parametric families of distributions, with once more utilizing pseudo likelihood to avoid dealing with k(δ).

Application
The data analyzed in this example consist of the maximum water levels registered at two stations on the Fox river in Wisconsin during the period 1918-1950. Measurements were made at an upstream location (Berlin, X 1 ) and a downstream location (Wrightstown, X 2 ). This data set was previously analyzed by Gumbel and Mustafi [13] using a bivariate extreme model.
In our analysis of this data set we will fit four models namely: • The Arnold and Strauss [3] bivariate exponential conditionals distribution. denoted by BEC. • Gumbel's [6] first bivariate exponential distribution, denoted by BG(I). • The proportional hazard conditionals Weibull extension of the BEC distribution, denoted by PHC(I)-W. • The proportional hazard conditionals Weibull extension of the BG(I) distribution, denoted by PHC(II)-W.
In both of the Weibull proportional hazard conditionals extensions mentioned above, i.e., PHC(I)-W and PHC(II)-W, as described in Section 3, we use the following choices for the component distributions F 1 and F 2 : Using the Arnold and Strauss [3] density given in Equation (11), the density of the PHC(I)-W is given by The corresponding log-pseudo-likelihood function for a sample of size n takes the form (θ; X (1) , X (2) ) = n log(α 1 α 2 θτ) + (θ − 1) The log-pseudo-likelihood for the BEC model is obtained from the expression for the PHC(I)-W by setting θ = 1 and τ = 1.
The log-likelihood function for a sample of size n from the PHC(II)-W is of the form given in Equation (28), with simple change of notation. In this case the corresponding log-likelihood function for the BG(I) model is again obtained by setting θ = 1 and τ = 1.
Using the Fox river data, maximizing the log-likelihood for the models BG(I) and PHC(II)-W and the log-pseudo-likelihood for the models BEC and PHC(I)-W, we obtain the estimates of the parameters of the four models given in Table 2 (with standard errors in parentheses).
To compare model fitting, we use the AIC (Akaike [14]) criterion, namely AIC=−2ˆ (·) + 2p. We also consider the BIC (Schwarz [15]) criterion, namely BIC=−2ˆ (·) + log(n)p, criterion where p is the number of parameters for the model being considered. The best model is the one with the smallest AIC or BIC.
According to the values of the AIC and BIC criteria for the Fox river data, the best model is the PHC(I)-W followed by the PHC(II)-W model. Since the BEC and BG(I) models are special cases of the PHC(I)-W and PHC(II)-W models, respectively, obtained by setting θ = τ = 1, we may test the hypotheses H 0 : (θ, τ) = (1, 1) versus H 1 : (θ, τ) = (1, 1) for comparing the PHC(II)-W and PHC(I)-W models with the BG(I) and BEC models, respectively.
The corresponding values of −2 log(Λ) in each case are provided in Table 3 (note in the BEC-PHC(I)-W comparison the log-pseudo-likelihoods have been utilized instead of log-likelihoods) which are greater than the value of the χ 2 2,99% = 9.210 indicating that the PHC(II)-W and PHC(I)-W models are significantly better at the 1% level. Thus, the PHC(II)-W and PHC(I)-W models appear to be good alternative for fitting the set data. The choice between the PHC(I)-W and PHC(II)-W is not so clear-cut, but perhaps the PHC(I)-W might be considered to be marginally better.
The graphs in Figures 1a,b and 2a,b show the contours of the densities BG(I) and BEC and of the fitted models for PHC(II)-W and PHC(I)-W, respectively. Under the assumption that the forms of the F i 's are unknown, we use the transformations Z 1,j = − log F 1,n (X 1,j ) and Z 2,j = − log F 2,n (X 2,j ), to arrive at a BG(I) model with joint survival function S(z 1 , z 2 ) = exp(−z 1 − z 2 − δz 1 z 2 ).
Then, using the expression for the maximum likelihood estimate of δ provided by Kotz et al. ( [2], p. 352), we obtainδ = 0.2986, a much smaller value than the estimated value of the parameter δ obtained assuming a known form for the F i 's. Perhaps this indicates that the Weibull choices for the F i 's are not optimal.

Discussion
The bivariate models discussed in this paper utilize quite different approaches to their construction and thus can be expected to exhibit significantly different distributional properties, especially with regard to dependence. Future research on such models should put some focus on the problem of selecting the appropriate one of these models for a particular data set. Of course, the old stand-by of fitting via maximum likelihood and comparing models via AIC and BIC is always available.

Data Availability Statement:
The data can be found in Gumbel and Mustafi (1967).