Next Article in Journal
Information Entropy-Based Intention Prediction of Aerial Targets under Uncertain and Incomplete Information
Previous Article in Journal
Applying the Bell’s Test to Chinese Texts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Statistical Generalized Derivative Applied to the Profile Likelihood Estimation in a Mixture of Semiparametric Models

School of Mathematics and Statistics, Victoria University of Wellington, P.O. Box 600, 6140 Wellington, New Zealand
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(3), 278; https://doi.org/10.3390/e22030278
Submission received: 9 January 2020 / Revised: 19 February 2020 / Accepted: 25 February 2020 / Published: 28 February 2020

Abstract

:
There is a difficulty in finding an estimate of the standard error (SE) of the profile likelihood estimator in the joint model of longitudinal and survival data. The difficulty is on the differentiation of an implicit function that appear in the profile likelihood estimation. We solve the difficulty by introducing the “statistical generalized derivative”. The derivative is used to show the asymptotic normality of the estimator with the SE expressed in terms of the profile likelihood score function.

1. Introduction

This paper proposes a method to show asymptotic normality of the maximum profile likelihood estimator (profile likelihood MLE) in a mixture of semiparametric models with the EM-algorithm. We derive an expression for the standard error (SE) of the estimator using the profile likelihood score function. As an example, we consider a joint model of ordinal responses and the proportional hazards model with finite mixture. Through this example, we demonstrate solving the theoretical challenge in a joint model of survival and longitudinal data stated by [1]: “No distributional or asymptotic theory is available to date, and even the standard errors (SEs), defined as the standard deviations of the parametric estimators, are difficult to obtain.” The difficulty of the problem is to deal with an implicit function which is difficult to differentiate. In the profile likelihood approach, we profile out the baseline hazard function by plugging in an estimate of the hazard function to the likelihood function. This estimator of the hazard function is an implicit function in our problem (see also [2], p67).
Here, we review some of the related works. For a more complete review of the joint models please see [2] and [3]. The paper [4] proposed a profile likelihood approach to a joint model with unobserved random effects. The EM-algorithm was applied for an estimation of parameters in the model. This approach became one of the standard models and has been adopted by many (for example, [5,6,7]). In a series of studies, ([8,9,10,11]), also adopted a profile likelihood approach in [4] and showed the asymptotic normality of the profile likelihood MLE. As a result, they obtained the SE of the estimator. Another well known approach was proposed by [12]. They used "an approximate least favorable submodel" to show the asymptotic normality of the profile likelihood MLE. The SE of the profile likelihood estimator was obtained from the asymptotic normality.
All of the existing work mentioned here avoided dealing with the implicit function in the profile likelihood. For example, the works of [8,9,10,11] used the equality between the maximum likelihood estimator (MLE) and the maximum profile likelihood estimator (profile likelihood MLE). They have shown the asymptotic normality of the MLE in the joint models and the result is used to show the asymptotic normality of the profile likelihood MLE. This indirect proof avoided the differentiation of the implicit function in the profile likelihood. In the case of [12], "an approximate least favorable submodel" was used to approximate the profile likelihood function. Then, this approximation was used to prove the result without differentiating the profile likelihood function directly.
In summary, the existing methods showed the asymptotic normality of the profile likelihood MLE using methods to avoid differentiating a profile likelihood function. They have shown that the variance of the profile likelihood MLE is the inverse of an efficient information matrix. However, they have not shown the efficient information matrix in terms of the score function in the profile likelihood.
In this paper, we introduce "the statistical generalized derivative" to deal with the differentiation of implicit function in the profile likelihood under consideration. Our approach enabled us to expand the profile likelihood function to show the asymptotic normality of the profile likelihood MLE with the efficient information matrix expressed as a variance of the profile likelihood score function.
The results of this paper give us an analytical understanding of profile likelihood estimation and a method of computing the SE of the profile likelihood MLE in terms of the profile likelihood score function.

2. Mixture of Semiparametric Models and Generalized Statistical Derivative

2.1. Introduction of Mixture Model and Notations

We consider a mixture of semiparametric models whose density is of the form
p ( x ; θ , η , π ) = r = 1 R π r p r ( x ; θ r , η r ) ,
where for each r = 1 , , R , p r ( x ; θ r , η r ) is a semiparametric model with a finite dimensional parameter θ r Θ r R m r and an infinite dimensional parameter η r H r where H r is a subset of Banach space B r , and π 1 , , π R are mixture probabilities. We assume that π r > 0 for each r and r = 1 R π r = 1 . We denote θ = ( θ 1 , , θ R ) , η = ( η 1 , , η R ) and π = ( π 1 , , π R ) . Once we observe iid data X 1 , , X n from the mixture model, the joint probability function of the data X = ( X 1 , , X n ) is given by
p ( X ; θ , η , π ) = i = 1 n r = 1 R π r p r ( X i ; θ r , η r ) .
We consider θ is the parameters of interest, and η and π are nuisance parameters.
To discuss the EM-algorithm, we further introduce notations (we use notations from [13]). Let Z i = ( Z i 1 , , Z i R ) be a group indicator variable for the subject i: for each r, Z i r = 0 or = 1 with P ( Z i r = 1 ) = π r , and r = 1 R Z i r = 1 . Let Z = ( Z 1 , , Z n ) . The joint probability function of the complete data ( X , Z ) is
p ( X , Z ; θ , η , π ) = i = 1 n r = 1 R [ π r p r ( X i ; θ r , η r ) ] Z i r .
Since Z i r are unobserved, it is common to replace with its expected value. The expected complete data log-likelihood under p ( Z | X ; θ , η , π ) is
i = 1 n r = 1 R E ( Z i r ) [ log π r + log p r ( X i ; θ r , η r ) ] = i = 1 n r = 1 R γ ( Z i r ) [ log π r + log p r ( X i ; θ r , η r ) ] .
where
γ ( Z i r ) = π r p r ( X i ; θ r , η r ) j = 1 R π j p j ( X i ; θ j , η j ) , r = 1 , , R .

2.2. Introducing Profile Likelihood, the Efficient Score Function and the Efficient Information Matrix

The efficient score function and information matrix in the mixture model: The score function for θ and score operator for η in the mixture model given in (1) are, respectively,
˙ ( x ; θ , η ) = θ log r = 1 R π r p r ( x ; θ r , η r ) = r = 1 R γ ( z r ) θ log p r ( x ; θ r , η r ) ,
and
B ( x ; θ , η ) = d η log r = 1 R π r p r ( x ; θ r , η r ) = r = 1 R γ ( z r ) d η log p r ( x ; θ r , η r )
where γ r ( z r ) is given in (5) with Z i r replaced with z r . The notation d η is the Hadamard derivative operator with respect to the parameter η .
Let θ 0 , η 0 be the true values of θ , η and denote ˙ 0 ( x ) = ˙ ( x ; θ 0 , η 0 ) and B 0 ( x ) = B ( x ; θ 0 , η 0 ) . Then, it follows from the standard theory ([14], page 374) that the efficient score function ˜ 0 and the efficient information matrix I ˜ 0 in the semiparametric mixture model are given by
˜ 0 ( x ) = ( I B 0 ( B 0 B 0 ) 1 B 0 ) ˙ 0 ( x ) ,
and
I ˜ 0 = E [ ˜ 0 ˜ 0 T ] .
Note: Equations (6) and (7) show that the score functions in the semiparametric mixture model (1) coincide with the ones for the expected complete data likelihood (4).
Introduction of the profile likelihood and its score functions: In the estimation of ( θ , η ) , we use the profile likelihood approach. Let F be a cdf function which belongs to a set containing the empirical cdf F n and the true cdf F 0 . In the profile likelihood approach, we obtain a function ( θ , F ) η ^ θ , F = ( η ^ 1 , θ , F , , η ^ R , θ , F ) whose values are in the space of the parameter η = ( η 1 , , η R ) . Then the profile likelihood function is defined by
p ( X ; θ , F , π ) = i = 1 n r = 1 R π r p r ( X i ; θ r , η ^ r , θ , F ) .
We also define the score functions for the profile likelihood in the model
ϕ ( x ; θ , F ) = θ log r = 1 R π r p r ( x ; θ r , η ^ r , θ , F ) = r = 1 R γ r ( z r ) θ log p r ( x ; θ r , η ^ r , θ , F )
and
ψ ( x ; θ , F ) = d F log r = 1 R π r p r ( x ; θ r , η ^ r , θ , F ) = r = 1 R γ r ( z r ) d F log p r ( x ; θ r , η ^ r , θ , F ) ,
where γ r ( z r ) is given in (5) with Z i r replaced with z r and η r with η ^ r , θ , F .

2.3. The EM-Algorithm to Obtain the Profile Likelihood MLE

Here, we describe the EM-algorithm applied to the profile likelihood function to obtain the profile likelihood MLE θ ^ of θ :
The E-step:
In the E-step, we use the current parameter estimates θ ^ of θ to find the expected values of Z i r :
γ ( Z i r ) = π r p r ( X i ; θ ^ r , η ^ r , θ ^ , F n ) j = 1 R π j p j ( X i ; θ ^ j , η ^ j , θ ^ , F n ) , r = 1 , , R .
The M-step:
In the M-step,
  • Calculate the estimates of π r
    π r ^ = i = 1 n γ ( Z i r ) n .
  • Keeping γ ( Z i r ) as a constant, we maximize the expected complete data log-likelihood
    i = 1 n r = 1 R γ ( Z i r ) log p r ( X i ; θ r , η ^ r , θ , F n ) .
    with respect to θ to obtain new estimates θ ^ .
The estimated parameters from the M-step are returned into the E-step until the value of θ ^ converges. The resulting estimator is the profile likelihood MLE.

2.4. Generalized Statistical Derivative and Asymptotic Normality of the Profile Likelihood MLE

In this section, we show the main result. We show the asymptotic normality of the profile likelihood MLE in Theorem 2. From the asymptotic normality, we can get an estimate of the SE of the estimator. Toward this goal, we introduce the statistical generalized derivative in Theorem 1.
Assumptions: We list assumptions used for Theorem 1 and Theorem 2 given below.
Let us denote θ 0 , η 0 and F 0 are the true values of the parameters θ , η and cdf F.
On the set of cdf functions F , we use the sup-norm, i.e. for F , F 0 F ,
F F 0 = sup x | F ( x ) F 0 ( x ) | .
For ρ > 0 , let
C ρ = { F F : F F 0 < ρ } .
We denote the density function for the profile likelihood in the mixture model by
p ( x ; θ , F ) = r = 1 R π r p r ( x ; θ r , η ^ r , θ , F ) .
We assume that:
(R1)
The density function p ( x ; θ , F ) is bounded away from 0, i.e., there is a constant c > 0 such that for each x and ( θ , F ) Θ × F , p ( x ; θ , F ) > c > 0 . More over, the density function p ( x ; θ , F ) is continuously differentiable with respect to θ and Hadamard differentiable with respect to F for all x. We denote derivatives by ϕ ( x ; θ , F ) = θ log p ( x ; θ , F ) and ψ ( x ; θ , F ) = d F log p ( x ; θ , F ) and they are given in (11) and (12).
(R2)
We assume η ^ θ , F satisfies η ^ θ 0 , F 0 = η 0 and the function
˜ 0 ( x ) : = ϕ ( x ; θ 0 , F 0 )
is the efficient score function. Further, we assume n 1 / 2 ( F n F 0 ) = O P ( 1 ) and ( η ^ θ ^ n , F n η 0 ) = o P ( 1 ) if θ ^ n is a consistent estimator of θ 0 .
(R3)
The efficient information matrix I ˜ 0 = E [ ˜ 0 ˜ 0 T ] = E [ ϕ ϕ T ( X ; θ 0 , F 0 ) ] is inevitable.
(R4)
The score function ϕ ( x ; θ , F ) defined in (11) takes the form
ϕ ( x ; θ , F ) = ϕ ˜ ( x ; θ , F , η ^ θ , F ) ,
where, by assumption (R2), the efficient score function is given by
˜ 0 ( x ) = ϕ ( x ; θ 0 , F 0 ) = ϕ ˜ ( x ; θ 0 , F 0 , η 0 ) .
We assume that there exists a ρ > 0 and neighborhoods Θ and H of θ 0 and η 0 , respectively, such that C ρ and H are Donsker and the class of functions { ϕ ˜ ( x ; θ , F , η ) : ( θ , F , η ) Θ × C ρ × H } has a square integrable envelope function and it is Lipschitz in the parameters ( θ , F , η ) :
ϕ ˜ ( x ; θ , F , η ) ϕ ˜ ( x ; θ , F , η ) M ( x ) ( θ θ + F F + η η )
where M ( x ) is a P 0 -square integrable function.
Note: Since the assumption (R4) is unusual condition, it requires an explanation. In our profile likelihood problem the estimate η ^ θ , F of the parameter η is an implicit function. Therefore, the map ( θ , F ) η ^ θ , F is hard to differentiate. It follows that the profile likelihood score function ϕ ( x ; θ , F ) defined in (11) is hard to differentiate. However it is often the case that the function ϕ ( x ; θ , F ) takes the form in (16). In the examples, the map ( θ , η , F ) ϕ ˜ ( x ; θ , F , η ) is in a closed form and therefore it is easy to differentiate. This differentiability became handy when we need to expand the function ϕ ( x ; θ , F ) through the differentiability of ϕ ˜ ( x ; θ , F , η ) . We use the assumption (R4) in the proof of Equation (20) in Theorem 1. In the example in Section 3, the function in (44) corresponds to η ^ θ , F and the function in (51) corresponds to ϕ ˜ ( x ; θ , F , η ) .
Note: To calculate the second derivative of the score function ϕ ( x ; θ , F ) given in (11), we use the idea similar to the derivative of generalized functions ([15]). Let φ ( f , φ ) = f ( x ) φ ( x ) d x be a generalized function, where φ vanishes outside of some interval. Then if f and φ are differentiable with derivative f and φ , then by integration by parts,
( f , φ ) = f ( x ) φ ( x ) d x = f ( x ) φ ( x ) d x = ( f , φ ) .
We define the derivative ( f , φ ) of the generalized function φ ( f , φ ) by ( f , φ ) . This definition is valid even if f is not differentiable, provided φ is differentiable.
A similar idea can be applied in our problem. Suppose the density for the profile likelihood p ( x ; θ , F ) given in (10) is twice differentiable with respect to θ , then by differentiating
θ log p ( x ; θ , F ) p ( x ; θ , F ) d x = 0 ,
with respect to θ at ( θ , F ) = ( θ 0 , F 0 ) , we get equivalent expressions for the efficient information matrix in terms of the score function ϕ ( x ; θ 0 , F 0 ) :
I ˜ 0 = E [ ϕ ϕ T ( X ; θ 0 , F 0 ) ] = E θ T ϕ ( X ; θ 0 , F 0 ) .
From this equation we are motivated to define the expected derivative of the score function E θ T ϕ ( X ; θ 0 , F 0 ) by E [ ϕ ϕ T ( X ; θ 0 , F 0 ) ] . In the following theorem, we show that the definition is valid even when the derivative of the score function θ T ϕ ( x ; θ , F ) does not exist.
Theorem 1.
(Statistical generalized derivative) Let p ( x ; θ , F ) = r = 1 R π r p r ( x ; θ r , η ^ r , θ , F ) , ϕ ( x ; θ , F ) = θ log p ( x ; θ , F ) , and ψ ( x ; θ , F ) = d F log p ( x ; θ , F ) as defined in (15), (11) and (12), respectively.
Suppose (R1) and (R4), then, for θ t θ 0 and F t F 0 as t 0 , we have that
E t 1 { ϕ ( X ; θ t , F 0 ) ϕ ( X ; θ 0 , F 0 ) } = E ϕ ( X ; θ 0 , F 0 ) ϕ T ( X ; θ 0 , F 0 ) { t 1 ( θ t θ 0 ) } + o ( 1 ) ,
and
E t 1 { ϕ ( X ; θ t , F t ) ϕ ( X ; θ t , F 0 ) } = E [ ϕ ( X ; θ 0 , F 0 ) ψ ( X ; θ 0 , F 0 ) ] { t 1 ( F t F 0 ) } + O { t 1 ( θ t θ 0 + F t F 0 ) ( F t F 0 + η ^ θ t , F t η ^ θ t , F 0 ) } + o ( 1 + θ t θ 0 + F t F 0 + η ^ θ t , F t η 0 ) .
Note: Note that even when the derivative θ ϕ ( x ; θ , F ) does not exist the Equation (19) shows that the derivative of the map θ E ϕ ( x ; θ , F ) exists. We may call the derivative the statistical generalized derivative. A similar comment for (20) holds.
Proof. 
In (R1) we assumed the density function p ( x ; θ , F ) is bounded away from 0 and it is differentiable with respect to θ and F. It follows that
t 1 { p ( x ; θ t , F 0 ) p ( x ; θ 0 , F 0 ) p ( x ; θ 0 , F 0 ) = ϕ ( x ; θ 0 , F 0 ) { t 1 ( θ t θ 0 ) } + o ( 1 ) ,
t 1 { p ( x ; θ t , F t ) p ( x ; θ t , F 0 ) p ( x ; θ t , F 0 ) = ψ ( x ; θ t , F 0 ) { t 1 ( F t F 0 ) } + o ( 1 ) ,
and
t 1 { p ( x ; θ t , F t ) p ( x ; θ 0 , F 0 ) } p ( x ; θ 0 , F 0 ) = ϕ ( x ; θ 0 , F 0 ) t 1 ( θ t θ 0 ) + ψ ( x ; θ 0 , F 0 ) t 1 ( F t F 0 ) + o ( 1 ) .
We prove the first Equation (19). For each t, the equality
0 = t 1 ϕ ( x ; θ t , F 0 ) p ( x ; θ t , F 0 ) d x ϕ ( x ; θ 0 , F 0 ) p ( x ; θ 0 , F 0 ) d x = t 1 { ϕ ( x ; θ t , F 0 ) ϕ ( x ; θ 0 , F 0 ) } p ( x ; θ 0 , F 0 ) d x + ϕ ( x ; θ t , F 0 ) t 1 { p ( x ; θ t , F 0 ) p ( x ; θ 0 , F 0 ) } d x
holds. It follows that, for each t, we have that
t 1 { ϕ ( x ; θ t , F 0 ) ϕ ( x ; θ 0 , F 0 ) } p ( x ; θ 0 , F 0 ) d x = ϕ ( x ; θ t , F 0 ) t 1 { p ( x ; θ t , F 0 ) p ( x ; θ 0 , F 0 ) } d x .
By the dominated convergence theorem with (21), the right hand side of (24) is, as t 0 ,
ϕ ( x ; θ t , F 0 ) t 1 { p ( x ; θ t , F 0 ) p ( x ; θ 0 , F 0 ) } d x = ϕ ( x ; θ t , F 0 ) t 1 { p ( x ; θ t , F 0 ) p ( x ; θ 0 , F 0 ) } p ( x ; θ 0 , F 0 ) p ( x ; θ 0 , F 0 ) d x = ϕ ( x ; θ 0 , F 0 ) ϕ T ( x ; θ 0 , F 0 ) p ( x ; θ 0 , F 0 ) d x t 1 ( θ t θ 0 ) + o ( 1 ) .
It follows that, we have (19):
t 1 { ϕ ( x ; θ t , F 0 ) ϕ ( x ; θ 0 , F 0 ) } p ( x ; θ 0 , F 0 ) d x = ϕ ( x ; θ 0 , F 0 ) ϕ T ( x ; θ 0 , F 0 ) p ( x ; θ 0 , F 0 ) d x { t 1 ( θ t θ 0 ) } + o ( 1 ) .
Now we prove the second Equation (20). Similar to the beginning of the proof of (19), for each t, the following equation holds:
t 1 { ϕ ( x ; θ t , F t ) ϕ ( x ; θ t , F 0 ) } p ( x ; θ t , F t ) d x = ϕ ( x ; θ t , F 0 ) t 1 { p ( x ; θ t , F t ) p ( x ; θ t , F 0 ) } d x .
Using (23) with the dominated convergence theorem, the left hand side of (25) is, as t 0 ,
t 1 { ϕ ( x ; θ t , F t ) ϕ ( x ; θ t , F 0 ) } p ( x ; θ t , F t ) d x t 1 { ϕ ( x ; θ t , F t ) ϕ ( x ; θ t , F 0 ) } p ( x ; θ 0 , F 0 ) d x = { ϕ ( x ; θ t , F t ) ϕ ( x ; θ t , F 0 ) } t 1 { p ( x ; θ t , F t ) p ( x ; θ 0 , F 0 ) } p ( x ; θ 0 , F 0 ) p ( x ; θ 0 , F 0 ) d x M ( x ) { ϕ ( x ; θ 0 , F 0 ) t 1 ( θ t θ 0 ) + ψ ( x ; θ 0 , F 0 ) t 1 ( F t F 0 ) } p ( x ; θ 0 , F 0 ) d x + o ( 1 ) × ( F t F 0 + η ^ θ t , F t η ^ θ t , F 0 ) = O { t 1 ( θ t θ 0 + F t F 0 ) ( F t F 0 + η ^ θ t , F t η ^ θ t , F 0 ) } + o ( F t F 0 + η ^ θ t , F t η ^ θ t , F 0 ) ,
where we used (17): with a P 0 -square integrable function M ( x ) ,
ϕ ( x ; θ t , F t ) ϕ ( x ; θ t , F 0 ) = ϕ ˜ ( x ; θ t , F t , η ^ θ t , F t ) ϕ ˜ ( x ; θ t , F 0 , η ^ θ t , F 0 ) M ( x ) ( F t F 0 + η ^ θ t , F t η ^ θ t , F 0 ) .
By the dominated convergence theorem with (22), it follows that the integral in the right hand side of the Equation (25) is
ϕ ( x ; θ t , F 0 ) t 1 { p ( x ; θ t , F t ) p ( x ; θ t , F 0 ) } d x = ϕ ( x ; θ 0 , F 0 ) ψ ( x ; θ 0 , F 0 ) t 1 ( F t F 0 ) p ( x ; θ 0 , F 0 ) d x + O { t 1 F t F 0 ( θ t θ 0 + η ^ θ t , F 0 η 0 ) } + o ( 1 + ( θ t θ 0 + η ^ θ t , F 0 η 0 ) ) ,
here again we used (17): there is a P 0 -square integrable function M ( x ) such that
ϕ ( x ; θ t , F 0 ) ϕ ( x ; θ 0 , F 0 ) = ϕ ˜ ( x ; θ t , F 0 , η ^ θ t , F 0 ) ϕ ˜ ( x ; θ 0 , F 0 , η 0 ) M ( x ) ( θ t θ 0 + η ^ θ t , F 0 η 0 ) .
Since O ( η ^ θ t , F t η ^ θ t , F 0 ) = O ( η ^ θ t , F 0 η ^ θ t , F 0 ) = O ( η ^ θ t , F t η 0 ) , by combining (26) and (27), the equality (25) is equivalent to
t 1 { ϕ ( x ; θ t , F t ) ϕ ( x ; θ t , F 0 ) } p ( x ; θ 0 , F 0 ) d x = ϕ ( x ; θ 0 , F 0 ) ψ ( x ; θ 0 , F 0 ) p ( x ; θ 0 , F 0 ) d x { t 1 ( F t F 0 ) } + O { t 1 ( θ t θ 0 + F t F 0 ) ( F t F 0 + η ^ θ t , F t η ^ θ t , F 0 ) } + o ( 1 + θ t θ 0 + F t F 0 + η ^ θ t , F t η 0 ) .
The (20) follows from this.  □
Using the result in Theorem 1, we show the asymptotic normality of the profile likelihood MLE:
Theorem 2.
(Asymptotic normality of the profile likelihood MLE) Suppose the set of assumptions ( R 1 ) ( R 4 ) holds. If the profile likelihood MLE (described in Section 2.3) is consistent, then it is an asymptotically linear estimator for θ 0 :
n ( θ ^ n θ 0 ) = 1 n i = 1 n I ˜ 0 1 ˜ 0 ( X i ) + o P ( 1 ) .
Hence we have that
n ( θ ^ n θ 0 ) d N 0 , I ˜ 0 1 as n .
Proof. 
The profile likelihood MLE described in Section 2.3 is solution to the estimating equation
i = 1 n ϕ ( X i ; θ ^ n , F n ) = 0
where ϕ ( x ; θ , F ) is the profile likelihood score function for θ defined in (11).
In (R4) we assumed C ρ and H are Donsker and the function ϕ ˜ ( x ; θ , F , η ) is Lipschitz in the parameters ( θ , F , η ) with a P 0 -square integrable function M ( x ) given in (17). By Corollary 2.10.13 in [16], the class { ϕ ˜ ( x ; θ , F , η ) : ( θ , F , η ) Θ × C ρ × H } is Donsker. Moreover, we assumed θ ^ n θ 0 = o p ( 1 ) . In (R2), we assumed n 1 / 2 ( F n F 0 ) = O p ( 1 ) and η ^ θ ^ n , F n η 0 = o p ( 1 ) . By the dominated convergence theorem with (17), we have E [ ( ϕ ˜ ( X ; θ ^ n , F n , η ^ θ ^ n , F n ) ϕ ˜ ( X ; θ 0 , F 0 , η 0 ) ) 2 ] = o p ( 1 ) .
By Lemma 19.24 in [14], it follows that
1 n i = 1 n { ϕ ˜ ( X i ; θ ^ n , F n , η ^ θ ^ n , F n ) ϕ ˜ ( X i ; θ 0 , F 0 , η 0 ) } = n E { ϕ ˜ ( X ; θ ^ n , F n , η ^ θ ^ n , F n ) ϕ ˜ ( X ; θ 0 , F 0 , η 0 ) } + o P ( 1 ) .
Using (16), this is equivalent to
1 n i = 1 n { ϕ ( X i ; θ ^ n , F n ) ϕ ( X i ; θ 0 , F 0 ) } = n E { ϕ ( X ; θ ^ n , F n ) ϕ ( X ; θ 0 , F 0 ) } + o P ( 1 ) .
From (19) it follows that
n E { ϕ ( X ; θ ^ n , F 0 ) ϕ ( X ; θ 0 , F 0 ) } = I ˜ 0 n ( θ ^ n θ 0 ) + o p ( 1 ) ,
where I ˜ 0 = E [ ˜ 0 ˜ 0 T ] = E { ϕ ( X ; θ 0 , F 0 ) ϕ T ( X ; θ 0 , F 0 ) } .
Using (20),
n E { ϕ ( X ; θ ^ n , F n ) ϕ ( X ; θ ^ n , F 0 ) } = E [ ϕ ( X ; θ 0 , F 0 ) ψ ( X ; θ 0 , F 0 ) ] { n ( F n F 0 ) } + O { n ( θ ^ n θ 0 + F n F 0 ) ( F n F 0 + η ^ θ ^ n , F n η 0 ) } + o ( 1 + θ ^ n θ 0 + F n F 0 + η ^ θ ^ n , F n η 0 ) = o P ( 1 + n ( θ ^ n θ 0 ) ) ,
where we used:
  • Since ψ ( x ; θ 0 , F 0 ) is in the nuisance tangent space and ϕ ( x ; θ 0 , F 0 ) is the efficient score function, we have
    E [ ϕ ( x ; θ 0 , F 0 ) ψ ( x ; θ 0 , F 0 ) ] = 0 .
  • Using consistency of θ ^ n with assumptions n 1 / 2 ( F n F 0 ) = O P ( 1 ) and ( η ^ θ ^ n , F n η 0 ) = o P ( 1 ) in (R2), it follows that
    O { n ( θ ^ n θ 0 + F n F 0 ) ( F n F 0 + η ^ θ ^ n , F n η 0 ) } = o P ( 1 + n ( θ ^ n θ 0 ) ) and o ( 1 + θ ^ n θ 0 + F n F 0 + η ^ θ ^ n , F n η 0 ) = o P ( 1 ) .
Using (30) and (31), the right hand side of (29) is
n E { ϕ ( X ; θ ^ n , F n ) ϕ ( X ; θ 0 , F 0 ) } = n E { ϕ ( X ; θ ^ n , F 0 ) ϕ ( X ; θ 0 , F 0 ) } + n E { ϕ ( X ; θ ^ n , F n ) ϕ ( X ; θ ^ n , F 0 ) } = I ˜ 0 n ( θ ^ n θ 0 ) + o p { 1 + n ( θ ^ n θ 0 ) } .
Finally, (29) together with (33) and 1 n i = 1 n ϕ ( X i ; θ ^ n , F n ) = 0 imply that
n ( θ ^ n θ 0 ) = 1 n i = 1 n I ˜ 0 1 ϕ ( X i ; θ 0 , F 0 ) + o P ( 1 ) .
 □

3. Joint Mixture Model of Survival and Longitudinal Ordered Data

In the paper [17], we treated “the joint model of ordinal responses and the proportional hazards with the finite mixture” with real data and simulated data examples. In that paper we had the method of calculating the SE of the profile likelihood MLE applied to the examples. However it was not justified with proofs. This is the motivation to write this paper to give the proofs. In this section, we use the theorem 1 and theorem 2 to show the asymptotic normality of the profile likelihood MLE to the model and prove the method of calculating SE in the paper [17]. Please see [17] for real data and simulated data examples.
Ordinal Response Models:
Let Y i j m be the ordered categorical response from 1 (poor) to L (excellent) on item (or question) j for subject i at the m t h protocol-specified time point, where i = 1 , 2 , , n , j = 1 , 2 , , J and m = 1 , 2 , , M . In total, there are J items in the questionnaire related to patients quality of life, collected at times t 1 , t 2 , , t M . Given that subject i belongs to group r, an ordered stereotype model can be written as
log P ( Y i j m = | θ r ) P ( Y i j m = 1 | θ r ) = a + ϕ ( b j + θ r ) , r = 1 , , R ,
where a is a response level intercept parameter with = 2 , , L , b j is an item effect, and θ r is associated with the discrete latent variable, with a 1 = 0 , b 1 = 0 , ϕ 1 = 0 and θ 1 = 0 . The parameter θ r can be referred to as a group effect of the quality of life for patients in group r. However, the group memberships are unknown. The { ϕ } parameters can be regarded as unknown scores for the outcome categories. Because ϕ ( b j + θ r ) = ( A ϕ ( ( b j + θ r ) / A ) ) for any constant A 0 , for identifiability, we need to impose monotone scores on { ϕ } to treat Y i j m as ordinal. Therefore, the model has the constraint 0 = ϕ 1 ϕ 2 ϕ L = 1 . The ordinal response part of likelihood function for the ith subject is
P ( Y i | θ r , α ) = m = 1 M i j = 1 J = 1 L exp ( a + ϕ ( b j + θ r ) ) 1 + k = 2 L exp ( a k + ϕ k ( b j + θ r ) ) Y i j m
where α = ( a , b , ϕ ) . Each follow-up time point may have a different number of observations because some patient responses are missing.
The Cox Proportional Hazards Model:
We consider the Cox proportional hazards model for the survival part in the joint model. Let X be a time-independent covariate. The hazard function for the failure time T i of the i t h subject is of the form
λ ( t | X i , θ r , δ ) = λ 0 ( t ) exp ( θ r δ 0 + X i δ 1 )
where λ 0 ( t ) is the baseline hazard function. The latent variable θ r is linked with the ordinal response model and δ = ( δ 0 , δ 1 ) are coefficients.
For the estimation of the baseline hazard function λ 0 ( t ) , we use the method of nonparametric maximum likelihood described in [18] (Section 4.3). Let λ i be the hazard at time t i , where t 1 < t 2 < < t n are the ordered observed times. Assume that the hazard is zero between adjacent times so that the survival time is discrete. The corresponding cumulative hazard function Λ 0 ( t i ) = p i λ p is a step function with jumps at the failure time t i . Then the survival part likelihood function of subject i is
P T i , d i | λ , θ r , δ = λ i exp ( θ r δ 0 + X i δ 1 ) d i × exp p i λ p exp ( θ r δ 0 + X i δ 1 ) ,
where the d i is an indicator of censorship for individual i: if we observe failure time, then d i = 1 , otherwise d i = 0 .
The Full Likelihood Function:
The joint likelihood function is obtained by combining the probability function from ordinal response model (34), and the proportional hazards model (36), by assuming the two models are independent given latent discrete random variables.
Let π r be the unknown probability ( r = 1 , , R ) that a subject lies in group r, and ( Θ , λ ) = ( ( θ , α , δ ) , λ ) be all the unknown parameters of the joint model. The mixture model likelihood function is
L ( Θ , λ | Y , T , D ) = i = 1 n r = 1 R P Y i | θ r , α P T i , d i | λ , θ r , δ π r .
Let Z i r be the group indicator, where Z i r = 1 if the i t h individual was from the r t h group and 0 otherwise. The complete data likelihood can be written as
L ( Θ , λ | Y , T , d , Z ) = i = 1 n r = 1 R P Y i | θ r , α P T i , d i | λ , θ r , δ π r Z i r .
The expected complete data log likelihood under q ( Z ) = P ( Z | Y , T , d ) is
Z q ( Z ) log L ( Θ , λ | Y , T , d , Z ) = i = 1 n r = 1 R γ ( Z i r ) log π r + log P Y i | θ r , α + log P T i , d i | λ , θ r , δ
where γ ( Z i r ) , P Y i | θ r , α and P T i , d i | λ , θ r , δ are defined in equations (42), (34) and (36) respectively.
To estimate all parameters and the baseline hazards simultaneously, we combine the EM algorithm and the method of nonparametric maximum likelihood.

3.1. Estimation Procedure: Profile Likelihood with EM Algorithm

Baseline Hazard Estimation:
Before starting the EM-step, we profile out the baseline hazard function λ 0 ( t ) . The survival part of Equation (39) can be separately maximized with respect to λ :
i = 1 n r = 1 R γ ( Z i r ) log P T i , d i | λ , θ r , δ = i = 1 n r = 1 R γ ( Z i r ) d i ( log λ i + θ r δ 0 + X i δ 1 ) p i λ p exp ( θ r δ 0 + X i δ 1 ) .
By solving λ l i = 1 n r = 1 R γ ( Z i r ) log P T i , d i | λ , θ r , δ = 0 , l = 1 , , n , we find the maximizer λ ^ l of (40) by holding ( θ , δ ) fixed, and it is given by
λ ^ l ( θ , δ ) = d i p i r = 1 R γ ( Z p r ) exp ( θ r δ 0 + X p δ 1 ) .
Denote λ ^ ( θ , δ ) = ( λ ^ 1 ( θ , δ ) , , λ ^ n ( θ , δ ) ) .
The E-step:
In the E-step, we use the current parameter estimates Θ = ( θ , α , δ ) to find the expected values of Z i r :
γ ( Z i r ) = E Z i r | Y i , T i , d i = π r P Y i | θ r , α P T i , d i | λ ^ ( θ , δ ) , θ r , δ g = 1 R π g P Y i | θ g , α P T i , d i | λ ^ ( θ , δ ) , θ g , δ .
The M-step:
In the M-step, we maximize Equation (39) with respect to π r and Θ = ( θ , α , δ ) . Due to the fact that there is no relationship between π r and Θ , they can be estimated separately.
  • Calculate the estimates of π r
    π r ^ = i = 1 n γ ( Z i r ) n .
  • We maximize the second and third parts of Equation (39) (with λ ^ ( θ , δ ) in the place of λ )
    i = 1 n r = 1 R γ ( Z i r ) log P Y i | θ r , α + log P T i , d i | λ ^ ( θ , δ ) , θ r , δ
    with respect to Θ = ( θ , α , δ ) to obtain Θ ^ .
The estimated parameters from the M-step are returned into the E-step until the value of Θ ^ converges.

3.2. Asymptotic Normality of the Profile Likelihood MLE and Its Asymptotic Variance

From (41), an estimator of the cumulative hazard function in the counting process notation is
Λ ^ ( t ) = 0 t i = 1 n d N i ( u ) i = 1 n Y i ( u ) r = 1 R γ ( Z i r ) exp ( θ r δ 0 + X i δ 1 )
where N i ( u ) = 1 { T i u , d i = 1 } and Y i ( u ) = 1 { T i u } .
Let us denote E F n f = f d F n . Then the above Λ ^ ( t ) can be written as
Λ ^ ( t ; Θ , F n ) = 0 t E F n d N ( u ) E F n Y ( u ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 )
where N ( u ) = 1 { T u , d = 1 } , Y ( u ) = 1 { T u } .
Note: From now on we will deal with the cumulative hazard function Λ ( t ) = 0 t λ ( s ) d s instead of the hazard function λ ( t ) . The function λ ^ ( θ , δ ) in (42) and (43) will be replaced with Λ ^ ( Θ , F n ) .
Note: Note that the function Λ ^ in (44) depends on γ ( Z r ) . On the other hand, the function γ ( Z r ) in (42) depends on Λ ^ . This circular relationship shows the function Λ ^ in (44) is an implicit function.
Equation (43) gives the profile likelihood function for Θ = ( θ , α , δ ) . The log-profile likelihood function for one observation is
log P ( Y i , T i , d i | Θ , F n ) = r = 1 R γ ( Z i r ) log P Y i | θ r , α + log P T i , d i | Λ ^ ( Θ , F n ) , θ r , δ
where
r = 1 R γ ( Z i r ) log P ( Y i | θ r , α ) = r = 1 R m = 1 M i j = 1 J = 1 L γ ( Z i r ) Y i j m a + ϕ ( b j + θ r ) log 1 + k = 2 L exp ( a k + ϕ k ( b j + θ r ) ) ,
and
r = 1 R γ ( Z i r ) log P T i , d i | Λ ^ ( Θ , F n ) , θ r , δ = r = 1 R γ ( Z r ) d i log E F n d N ( T i ) E F n Y ( T i ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) + θ r δ 0 + X i δ 1 exp ( θ r δ 0 + X i δ 1 ) 0 T i E F n d N ( u ) E F n Y ( u ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) .
In the above log-likelihood we set a 1 = b 1 = ϕ 1 = θ 1 = 0 .
Score functions:
The score functions for the profile likelihood are
ϕ ( Y i , T i , d i | Θ , F n ) = ϕ O ( Y i | Θ ) + ϕ S ( T i , d i | Θ , F n ) = r = 1 R γ ( Z i r ) Θ log P Y i | θ r , α + r = 1 R γ ( Z i r ) Θ log P T i , d i | Λ ^ ( Θ , F n ) , θ r , δ , ψ ( Y i , T i , d i | Θ , F n ) = r = 1 R γ ( Z i r ) d F log P T i , d i | Λ ^ ( Θ , F n ) , θ r , δ .
Here all derivatives are calculated treating γ ( Z i r ) as constant. We call ϕ O is the score function for the ordinal response model and ϕ S is the one for the survival model.
Theorem 3.
(The efficient score function) We drop subscript i in Equation (48). We have the followings: at the true value of ( Θ , F ) ,
1. 
Λ ^ ( t ; Θ , F ) = Λ ( t ) , the true cumulative hazard function, and;
2. 
the score function ϕ ( Y , T , d | Θ , F ) defined in (48) is the efficient score function in the model.
The proof is given in Appendix A.
Theorem 4.
(Asymptotic normality of the profile likelihood MLE) We assume the followings:
(A1)
The maximal right-censoring time τ > 0 is finite and satisfies S ( τ ) = P ( T > τ ) > 0 .
(A2)
The covariate X is bounded and the parameter Θ = ( θ , α , δ ) is in a compact set. This implies that, for some 0 < M < , we have X M and Θ M .
(A3)
The empirical cdf F n is n consistent: n ( F n F 0 ) = O p ( 1 ) .
(A4)
The efficient information matrix I ˜ is invertible.
Then the profile likelihood MLE Θ ^ n obtained by the procedure described in Section 3.1 is asymptotically normal:
n ( Θ ^ n Θ 0 ) N ( 0 , I ˜ 1 ) ,
where I ˜ = E ( ϕ ϕ T ) is the efficient information with ϕ is defined in (48).
Proof. 
We check conditions (R1)-(R4) in Section 2.4 so that Theorem 1 and Theorem 2 can be used to get result stated in the theorem.
Since the ordinal response data part is a parametric model, we mainly discuss for the survival part of the model. The survival part of the profile log -likelihood function for a one observation is given in (47).
To express the survival part of the score function ϕ S ( T , d | Θ , F ) in the form given in condition (R4), we introduce a few notations.
Let
γ ( Z r | Θ , Λ ) = π r P Y | θ r , α P T , d | Λ , θ r , δ g = 1 R π g P Y | θ g , α P T , d | Λ , θ g , δ .
The function γ ( Z r | Θ , Λ ) is differentiable with respect to Θ and Λ . Then the function γ ( Z r ) in (42) can be expressed as
γ ( Z r ) = γ ( Z r | Θ , Λ ^ ( Θ , F ) ) .
Let
M 0 ( t | Θ , F , Λ ) = E F Y ( u ) r = 1 R γ ( Z r | Θ , Λ ) exp ( θ r δ 0 + X δ 1 ) M 1 ( t | Θ , F , Λ ) = E F Y ( u ) r = 1 R γ ( Z r | Θ , Λ ) δ 0 θ r X exp ( θ r δ 0 + X δ 1 ) .
Then the score function for the survival part ϕ S ( T , d | Θ , F ) is
ϕ ˜ S ( T , d | Θ , F , Λ ^ ( Θ , F ) ) = r = 1 R γ ( Z r | Θ , Λ ^ ( Θ , F ) ) Θ log P T , d | Λ ^ ( Θ , F ) , θ r , δ = r = 1 R γ ( Z r | Θ , Λ ^ ( Θ , F ) ) d δ 0 θ r X M 1 ( T | Θ , F , Λ ^ ( Θ , F ) ) M 0 ( T | Θ , F , Λ ^ ( Θ , F ) ) + exp ( θ r δ 0 + X δ 1 ) 0 T E F d N ( u ) M 0 ( u | Θ , F , Λ ^ ( Θ , F ) ) δ 0 θ r X M 1 ( u | Θ , F , Λ ^ ( Θ , F ) ) M 0 ( u | Θ , F , Λ ^ ( Θ , F ) ) .
We will check condition (R4) using the function defined by
ϕ ˜ S ( T , d | Θ , F , Λ ) = r = 1 R γ ( Z r | Θ , Λ ) d δ 0 θ r X M 1 ( T | Θ , F , Λ ) M 0 ( T | Θ , F , Λ ) + exp ( θ r δ 0 + X δ 1 ) 0 T E F d N ( u ) M 0 ( u | Θ , F , Λ ) δ 0 θ r X M 1 ( u | Θ , F , Λ ) M 0 ( u | Θ , F , Λ ) .
 □
Condition (R1): With assumptions (A1) and (A2), a straight forward calculation can show that the probability functions P ( T , d | λ , θ r , δ ) in (36) and P ( Y | θ r , α ) in (34) are bounded from below.
We calculated the survival part score function ϕ S ( T , d | Θ , F ) = ϕ ˜ S ( T , d | Θ , F , Λ ^ ( Θ , F ) ) in (50). The ordinal response data part is a parametric model, it is differentiable with respect to the parameter Θ (we omit the calculation).
We calculate the score function ψ ( T , d | Θ , F ) = r = 1 R γ ( Z i r ) d F log P T , d | Λ ^ ( Θ , F ) , θ r , δ . For an integrable function h with the same domain as the cdfs F,
ψ ( Y , T , d | Θ , F ) h = r = 1 R γ ( Z r ) d F log P T , d | Λ ^ ( Θ , F ) , θ r , δ h = r = 1 R γ ( Z r ) d E h d N ( T ) E F d N ( T ) E h Y ( T ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) E F Y ( T ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) exp ( θ r δ 0 + X δ 1 ) 0 T E h d N ( u ) E F Y ( u ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) + exp ( θ r δ 0 + X δ 1 ) 0 T E F d N ( u ) E h Y ( u ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) ( E F Y ( u ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) ) 2 .
Condition (R2): In (A3), we assumed n ( F n F 0 ) = O p ( 1 ) ( n -consistency of F n ). In Appendix C, we showed consistencies of Θ ^ n and Λ ^ ( Θ ^ n , F n ) : Θ ^ n Θ 0 = o p ( 1 ) and Λ ^ ( Θ ^ n , F n ) Λ 0 = o p ( 1 ) . In Theorem 3 we verified the rest of conditions in (R2).
Condition (R3): In (A4), we assumed the efficient information matrix I ˜ is invertible.
Condition (R4): The score function ϕ ˜ S ( T , d | Θ , F , Λ ) given in (51) is differentiable with respect to the parameters ( Θ , F , Λ ) (with assumptions (A1) and (A2), the derivative is a bounded function). It follows Condition (R4).

4. Conclusions

We proposed a “statistical generalized derivative” to deal with an implicit function that appears in the profile likelihood estimation in semiparametric model examples. The example includes the joint model of ordinal responses and the proportional hazards model with a finite mixture which we treated in this paper. Using the “statistical generalized derivative” we expanded the profile likelihood function and showed asymptotic normality of the profile likelihood MLE. Moreover, this approach enabled us to express the SE of the estimator in terms of the profile likelihood score function. This contributes to the profile likelihood estimation methods where, otherwise, no direct expansions of the profile likelihood were possible. Our approach can be applied to not only the example in this paper, but also to many models with implicit function in the profile likelihood estimation. As a next step, we are working on applying our method to those examples.

Author Contributions

I.L. proposed the project and constructed the model which is in the example section of the paper. Y.H. was motivated to have an estimation of the SE of the profile likelihood estimator in the example. The proof part of the paper is done by the Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

Ivy Liu’s work was supported by the Marsden Fund (Award Number E2987-3648) administrated by the Royal Society of New Zealand.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theorem 3 (The Efficient Score Function)

Proof. 
For ease notation we denote the true values of the parameters by ( Θ , F , Λ ) . From (44), replacing F n by F, we have
Λ ^ ( t ; Θ , F ) = 0 t E d N ( u ) E Y ( u ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 )
where E is the expectation with respect to the true distribution F. Since, at the true value of the parameters ( Θ , F , Λ ) ,
E d N ( u ) = E [ Y ( u ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) ] d Λ ( u ) ,
we have that Λ ^ ( t ; Θ , F ) = Λ ( t ) .
The score function ϕ ( Y , T , d | Θ , F ) = ϕ O ( Y | Θ ) + ϕ S ( T , d | Θ , F ) in (48) has two parts: the score function for the ordinal response model ϕ O ( Y | Θ ) and the score function for the survival model ϕ S ( T , d | Θ , F ) . Since the score function for the ordinal response model does not involve the parameter Λ , we will only work on the survival part of score function.
We treat the part γ ( Z r ) as constant in terms of the parameters.
Let
M 1 ( t ) = E r = 1 R γ ( Z r ) δ 0 θ r X exp ( θ r δ 0 + X δ 1 ) I ( t T ) M 0 ( t ) = E r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) I ( t T )
Then the score function in the survival part of the model at the true value of parameters Θ and F is
ϕ S ( T , d | Θ , F ) = r = 1 R γ ( Z r ) Θ log P T , d | Λ ^ ( Θ , F ) , θ r , δ = r = 1 R γ ( Z r ) Θ d log E d N ( T ) E Y ( T ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) + θ r δ 0 + X δ 1 exp ( θ r δ 0 + X δ 1 ) 0 T E d N ( u ) E Y ( u ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) = r = 1 R γ ( Z r ) d δ 0 θ r X M 1 ( T ) M 0 ( T ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) 0 T δ 0 θ r X M 1 ( u ) M 0 ( u ) d Λ ( u )
where we used Equation (A1). The last expression is the efficient score function in the survival part of the model derived in Equation (A4), Appendix B.  □

Appendix B. Derivation of Efficient Score Function in the Joint Model

In this appendix, we derive the efficient score function in the joint model using (8). We denote P r , Θ , Λ ( T , d ) = P T , d | Λ , θ r , δ . Again the true values of the parameters are denoted by ( Θ , F , Λ ) .
The survival part of log-likelihood function for a one observation is
r = 1 R γ ( Z r ) log P r , Θ , Λ ( T , d ) = r = 1 R γ ( Z r ) d log λ ( T ) + θ r δ 0 + X δ 1 Λ ( T ) exp ( θ r δ 0 + X δ 1 ) .
The score function for Θ is
˙ Θ , Λ = r = 1 R γ ( Z r ) Θ log P r , Θ , Λ ( T , d ) = r = 1 R γ ( Z r ) δ 0 θ r X d Λ ( T ) exp ( θ r δ 0 + X δ 1 ) .
Let h : [ 0 , τ ] R be a function on [ 0 , τ ] . The path defined by
d Λ s = ( 1 + s h ) d Λ
is a submodel passing through Λ at s = 0 . The corresponding path for the λ is
λ s ( t ) = d Λ s ( t ) d t = ( 1 + s h ) λ ( t ) .
The derivative of the log-likelihood function
r = 1 R γ ( Z r ) log P r , Θ , Λ s ( T , d ) = r = 1 R γ ( Z r ) d log λ s ( T ) + θ r δ 0 + X δ 1 Λ s ( T ) exp ( θ r δ 0 + X δ 1 ) .
with respect to s at s = 0 is the score operator for Λ :
B Θ , Λ h = d d s | s = 0 r = 1 R γ ( Z r ) log P r , Θ , Λ s ( T , d ) = r = 1 R γ ( Z r ) d h ( T ) exp ( θ r δ 0 + X δ 1 ) 0 T h ( u ) d Λ ( u ) .
Information operator B Θ , Λ B Θ , Λ
For functions g , h : [ 0 , τ ] R , define a paths d Λ s , t = ( 1 + s g + t h + s t g h ) d Λ . Then
s | ( s , t ) = ( 0 , 0 ) r = 1 R γ ( Z r ) log P r , Θ , Λ s , t ( T , d ) = B Θ , Λ g , t | ( s , t ) = ( 0 , 0 ) r = 1 R γ ( Z r ) log P r , Θ , Λ s , t ( T , d ) = B Θ , Λ h .
Using these we have
B Θ , Λ g , B Θ , Λ h L 2 ( P ) = E { ( B Θ , Λ g ) ( B Θ , Λ h ) } = E 2 t s | ( s , t ) = ( 0 , 0 ) r = 1 R γ ( Z r ) log P r , Θ , Λ s , t ( T , d ) = E t | t = 0 B Θ , Λ 0 , t g = E r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) 0 τ I ( u T ) g ( u ) h ( u ) d Λ ( u ) = 0 τ h ( u ) E r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) I ( u T ) g ( u ) d Λ ( u ) = E r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) I ( u T ) g ( u ) , h ( u ) L 2 ( Λ )
Since
B Θ , Λ g , B Θ , Λ h L 2 ( P ) = B Θ , Λ B Θ , Λ g , h L 2 ( Λ ) ,
we have the information operator
B Θ , Λ B Θ , Λ g = E r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) I ( t T ) g ( t ) .
Since the operator multiplies a number, the inverse is
( B Θ , Λ B Θ , Λ ) 1 g = E r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) I ( t T ) 1 g ( t ) .
Calculation of B Θ , Λ ˙ Θ , Λ
Consider a paths ( s , t ) ( Θ + s a , Λ t ) with d Λ t = ( 1 + t h ) d Λ . Then
s | ( s , t ) = ( 0 , 0 ) r = 1 R γ ( Z r ) log P r , Θ + s a , Λ t ( T , d ) = a T ˙ Θ , Λ t | ( s , t ) = ( 0 , 0 ) r = 1 R γ ( Z r ) log P r , Θ + s a , Λ t ( T , d ) = B Θ , Λ h .
Using these we compute that
a T ˙ Θ , Λ , B Θ , Λ h L 2 ( P ) = E { ( a T ˙ Θ , Λ ) ( B Θ , Λ h ) } = E 2 t s | ( s , t ) = ( 0 , 0 ) r = 1 R γ ( Z r ) log P r , Θ + s a , Λ t ( T , d ) = E t | t = 0 a T ˙ Θ , Λ t = a T E r = 1 R γ ( Z r ) δ 0 θ r X exp ( θ r δ 0 + X δ 1 ) 0 τ I ( u T ) h ( u ) d Λ ( u ) = a T 0 τ E r = 1 R γ ( Z r ) δ 0 θ r X exp ( θ r δ 0 + X δ 1 ) I ( u T ) h ( u ) d Λ ( u ) = a T E r = 1 R γ ( Z r ) δ 0 θ r X exp ( θ r δ 0 + X δ 1 ) I ( u T ) , h L 2 ( Λ )
Since
a T ˙ Θ , Λ , B Θ , Λ h L 2 ( P ) = a T B Θ , Λ ˙ Θ , Λ , h L 2 ( Λ ) ,
we have that
B Θ , Λ ˙ Θ , Λ = E r = 1 R γ ( Z r ) δ 0 θ r X exp ( θ r δ 0 + X δ 1 ) I ( u T ) .
Efficient score function
Then the efficient score function for the survival part of the model is given by
˜ Θ , Λ = ˙ Θ , Λ B Θ , Λ ( B Θ , Λ B Θ , Λ ) 1 B Θ , Λ ˙ Θ , Λ = r = 1 R γ ( Z r ) δ 0 θ r X d Λ ( T ) exp ( θ r δ 0 + X δ 1 ) B Θ , Λ M 1 ( t ) M 0 ( t ) = r = 1 R γ ( Z r ) δ 0 θ r X d Λ ( T ) exp ( θ r δ 0 + X δ 1 ) r = 1 R γ ( Z r ) d M 1 ( T ) M 0 ( T ) exp ( θ r δ 0 + X δ 1 ) 0 T M 1 ( u ) M 0 ( u ) d Λ ( u ) = r = 1 R γ ( Z r ) d δ 0 θ r X M 1 ( T ) M 0 ( T ) r = 1 R γ ( Z r ) exp ( θ r δ 0 + X δ 1 ) 0 T δ 0 θ r X M 1 ( u ) M 0 ( u ) d Λ ( u )
where M 1 ( t ) and M 0 ( t ) are defined in (A2).

Appendix C. Proof of Consistency

We outline proof of consistencies of Θ ^ n and Λ ^ ( Θ ^ n , F n ) . These consistencies were required in the verification of Condition (R2) in Theorem 4.
We use the proof of Theorem 5.7 in the book [14] (page 45) and adopt the notations from the book.
Proof. 
Lets us denote the log-likelihood function M n ( Θ , Λ ) = P n log P Θ , Λ and the corresponding limit function M ( Θ , Λ ) = P Θ 0 , Λ 0 log P Θ , Λ . We denote the MLE and the true value of ( Θ , Λ ) by ( Θ ^ n , Λ ^ n ) and ( Θ 0 , Λ 0 ) , respectively.
Then by the construction of profile likelihood function
M n ( Θ ^ n , Λ ^ n ) = M n ( Θ ^ n , Λ ^ ( Θ ^ n , F n ) )
and Λ ^ n = Λ ^ ( Θ ^ n , F n ) . It follows that the consistency of the MLE ( Θ ^ n , Λ ^ n ) implies the consistency of ( Θ ^ n , Λ ^ ( Θ ^ n , F n ) ) .
BY (A2), Θ is bounded and Λ ( t ) is in the set of monotone increasing function. The parameter ( Θ , Λ ) is in a set of Donsker class. Since the function log P Θ , Λ is differentiable in the parameter ( Θ , Λ ) with square integrable envelope function, the set { log P Θ , Λ : ( Θ , Λ ) in the Donsker class } is Donsker as well (by Corollary 2.10.13 in [16]). It follows that
M n ( Θ ^ n , Λ ^ n ) M ( Θ ^ n , Λ ^ n ) = o p ( 1 ) .
By the property of Kullback-Leibler divergence, we have
M ( Θ , Λ ) < M ( Θ 0 , Λ 0 )
for all ( Θ , Λ ) ( Θ 0 , Λ 0 ) .
By definition of the MLE also we have
M n ( Θ ^ n , Λ ^ n ) M n ( Θ 0 , Λ 0 ) .
By the proof of Theorem 5.7 in [14] (page 45), the consistency of the MLE ( Θ ^ n , Λ ^ n ) ( Θ 0 , Λ 0 ) = o p ( 1 ) follows.  □

References

  1. Hsieh, F.; Tseng, Y.K.; Wang, J.L. Joint modeling of survival and longitudinal data: Likelihood approach revisited. Biometrics 2006, 62, 1037–1043. [Google Scholar] [CrossRef] [Green Version]
  2. Rizopoulos, D. Joint Models for Longitudinal and Time-to-Event Data With Applications in R; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
  3. Tsiatis, A.A.; Davidian, M. Joint modeling of longitudinal and time to event data: An overview. Stat. Sin. 2004, 14, 809–834. [Google Scholar]
  4. Wulfsohn, M.S.; Tsiatis, A.A. A joint model for survival and longitudinal data measured with error. Biometrics 1997, 53, 330–339. [Google Scholar] [CrossRef]
  5. Henderson, R.; Diggle, P.; Dobson, A. Identification and efficiency of longitudinal markers for survival. Biostatistics 2002, 3, 33–50. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Ratcliffe, S.J.; Guo, W.; Ten Have, T.R. Joint modeling of longitudinal and survival data via a common frailty. Biometrics 2004, 60, 892–899. [Google Scholar] [CrossRef]
  7. Song, X.; Davidian, M.; Tsiatis, A.A. A semiparametric likelihood approach to joint modeling of longitudinal and time-to-event data. Biometrika 2002, 58, 742–753. [Google Scholar] [CrossRef]
  8. Zeng, D.; Cai, J. Asymptotic results for maximum likelihood estimators in joint analysis of repeated measurements and survival time. Ann. Stat. 2005, 33, 2132–2163. [Google Scholar] [CrossRef] [Green Version]
  9. Zeng, D.; Lin, D.Y. Maximum likelihood estimation in semiparametric regression models with censored data. J. R. Stat. Soc.: Series B 2007, 69, 507–564. [Google Scholar] [CrossRef]
  10. Zeng, D.; Lin, D.Y. Semiparametric Transformation Models With Random Effects for Recurrent Events. J. Am. Stat. Assoc. 2007, 102, 167–180. [Google Scholar] [CrossRef] [Green Version]
  11. Zeng, D.; Lin, D.Y. A general asymptotic theory for maximum likelihood estimation in semiparametric regression models with censored data. Stat. Sin. 2010, 20, 871–910. [Google Scholar]
  12. Murphy, S.A.; van der Vaart, A.W. On profile likelihood (with discussion). J. Amer. Statist. Assoc. 2000, 95, 449–485. [Google Scholar] [CrossRef]
  13. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  14. van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
  15. Kolmogorov, A.N.; Fomin, S.V.e. Introductory Real Analysis; Dover Publication: New York, NY, USA, 1975. [Google Scholar]
  16. van der Vaart, A.W.; Wellner, J.A. Weak Convergence and Empirical Processes; Springer: Berlin/Heidelberg, Germany, 1996. [Google Scholar]
  17. Preedalikit, K.; Liu, I.; Hirose, Y.; Sibanda, N.; Fernández, D. Joint modeling of survival and longitudinal ordered data using a semiparametric approach. Aust. New Zealand J. Stat. 2016, 58, 153–172. [Google Scholar] [CrossRef]
  18. Kalbfleisch, J.D.; Prentice, R.L. The Statistical Analysis of Failure Time Data, 2nd ed.; John Wiley & Sons: New York, NY, USA, 2002. [Google Scholar]

Share and Cite

MDPI and ACS Style

Hirose, Y.; Liu, I. Statistical Generalized Derivative Applied to the Profile Likelihood Estimation in a Mixture of Semiparametric Models. Entropy 2020, 22, 278. https://doi.org/10.3390/e22030278

AMA Style

Hirose Y, Liu I. Statistical Generalized Derivative Applied to the Profile Likelihood Estimation in a Mixture of Semiparametric Models. Entropy. 2020; 22(3):278. https://doi.org/10.3390/e22030278

Chicago/Turabian Style

Hirose, Yuichi, and Ivy Liu. 2020. "Statistical Generalized Derivative Applied to the Profile Likelihood Estimation in a Mixture of Semiparametric Models" Entropy 22, no. 3: 278. https://doi.org/10.3390/e22030278

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop