Statistical Generalized Derivative Applied to the Profile Likelihood Estimation in a Mixture of Semiparametric Models

There is a difficulty in finding an estimate of the standard error (SE) of the profile likelihood estimator in the joint model of longitudinal and survival data. The difficulty is on the differentiation of an implicit function that appear in the profile likelihood estimation. We solve the difficulty by introducing the “statistical generalized derivative”. The derivative is used to show the asymptotic normality of the estimator with the SE expressed in terms of the profile likelihood score function.


Introduction
This paper proposes a method to show asymptotic normality of the maximum profile likelihood estimator (profile likelihood MLE) in a mixture of semiparametric models with the EM-algorithm. We derive an expression for the standard error (SE) of the estimator using the profile likelihood score function. As an example, we consider a joint model of ordinal responses and the proportional hazards model with finite mixture. Through this example, we demonstrate solving the theoretical challenge in a joint model of survival and longitudinal data stated by [1]: "No distributional or asymptotic theory is available to date, and even the standard errors (SEs), defined as the standard deviations of the parametric estimators, are difficult to obtain." The difficulty of the problem is to deal with an implicit function which is difficult to differentiate. In the profile likelihood approach, we profile out the baseline hazard function by plugging in an estimate of the hazard function to the likelihood function. This estimator of the hazard function is an implicit function in our problem (see also [2], p67).
Here, we review some of the related works. For a more complete review of the joint models please see [2] and [3]. The paper [4] proposed a profile likelihood approach to a joint model with unobserved random effects. The EM-algorithm was applied for an estimation of parameters in the model. This approach became one of the standard models and has been adopted by many (for example, [5][6][7]). In a series of studies, ( [8][9][10][11]), also adopted a profile likelihood approach in [4] and showed the asymptotic normality of the profile likelihood MLE. As a result, they obtained the SE of the estimator. Another well known approach was proposed by [12]. They used "an approximate least favorable submodel" to show the asymptotic normality of the profile likelihood MLE. The SE of the profile likelihood estimator was obtained from the asymptotic normality.
All of the existing work mentioned here avoided dealing with the implicit function in the profile likelihood. For example, the works of [8][9][10][11] used the equality between the maximum likelihood estimator (MLE) and the maximum profile likelihood estimator (profile likelihood MLE). They have shown the asymptotic normality of the MLE in the joint models and the result is used to show the asymptotic normality of the profile likelihood MLE. This indirect proof avoided the differentiation of the implicit function in the profile likelihood. In the case of [12], "an approximate least favorable submodel" was used to approximate the profile likelihood function. Then, this approximation was used to prove the result without differentiating the profile likelihood function directly.
In summary, the existing methods showed the asymptotic normality of the profile likelihood MLE using methods to avoid differentiating a profile likelihood function. They have shown that the variance of the profile likelihood MLE is the inverse of an efficient information matrix. However, they have not shown the efficient information matrix in terms of the score function in the profile likelihood.
In this paper, we introduce "the statistical generalized derivative" to deal with the differentiation of implicit function in the profile likelihood under consideration. Our approach enabled us to expand the profile likelihood function to show the asymptotic normality of the profile likelihood MLE with the efficient information matrix expressed as a variance of the profile likelihood score function.
The results of this paper give us an analytical understanding of profile likelihood estimation and a method of computing the SE of the profile likelihood MLE in terms of the profile likelihood score function.

Introduction of Mixture Model and Notations
We consider a mixture of semiparametric models whose density is of the form where for each r = 1, . . . , R, p r (x; θ r , η r ) is a semiparametric model with a finite dimensional parameter θ r ∈ Θ r ⊂ R m r and an infinite dimensional parameter η r ∈ H r where H r is a subset of Banach space B r , and π 1 , . . . , π R are mixture probabilities. We assume that π r > 0 for each r and ∑ R r=1 π r = 1. We denote θ = (θ 1 , . . . , θ R ), η = (η 1 , . . . , η R ) and π = (π 1 , . . . , π R ). Once we observe iid data X 1 , . . . , X n from the mixture model, the joint probability function of the data X = (X 1 , . . . , X n ) is given by We consider θ is the parameters of interest, and η and π are nuisance parameters.
To discuss the EM-algorithm, we further introduce notations (we use notations from [13]). Let Z i = (Z i1 , . . . , Z iR ) be a group indicator variable for the subject i: for each r, Z ir = 0 or = 1 with P(Z ir = 1) = π r , and ∑ R r=1 Z ir = 1. Let Z = (Z 1 , . . . , Z n ). The joint probability function of the complete data (X, Z) is Since Z ir are unobserved, it is common to replace with its expected value. The expected complete data log-likelihood under p(Z|X; θ, η, π) is where γ(Z ir ) = π r p r (X i ; θ r , η r ) ∑ R j=1 π j p j (X i ; θ j , η j ) , r = 1, . . . , R.

Introducing Profile Likelihood, the Efficient Score Function and the Efficient Information Matrix
The efficient score function and information matrix in the mixture model: The score function for θ and score operator for η in the mixture model given in (1) are, respectively, and where γ r (z r ) is given in (5) with Z ir replaced with z r . The notation d η is the Hadamard derivative operator with respect to the parameter η.
Let θ 0 , η 0 be the true values of θ, η and denote˙ 0 (x) =˙ (x; θ 0 , η 0 ) and B 0 (x) = B(x; θ 0 , η 0 ). Then, it follows from the standard theory ([14], page 374) that the efficient score function˜ 0 and the efficient information matrixĨ 0 in the semiparametric mixture model are given bỹ Note: Equations (6) and (7) show that the score functions in the semiparametric mixture model (1) coincide with the ones for the expected complete data likelihood (4).
Introduction of the profile likelihood and its score functions: In the estimation of (θ, η), we use the profile likelihood approach. Let F be a cdf function which belongs to a set containing the empirical cdf F n and the true cdf F 0 . In the profile likelihood approach, we obtain a function (θ, F) →η θ,F = (η 1,θ,F , . . . ,η R,θ,F ) whose values are in the space of the parameter η = (η 1 , . . . , η R ). Then the profile likelihood function is defined by We also define the score functions for the profile likelihood in the model and where γ r (z r ) is given in (5) with Z ir replaced with z r and η r withη r,θ,F .

The EM-Algorithm to Obtain the Profile Likelihood MLE
Here, we describe the EM-algorithm applied to the profile likelihood function to obtain the profile likelihood MLEθ of θ: The E-step: In the E-step, we use the current parameter estimatesθ of θ to find the expected values of Z ir : γ(Z ir ) = π r p r (X i ;θ r ,η r,θ,F n ) ∑ R j=1 π j p j (X i ;θ j ,η j,θ,F n ) , r = 1, . . . , R.
The M-step: In the M-step, 1. Calculate the estimates of π r 2. Keeping γ(Z ir ) as a constant, we maximize the expected complete data log-likelihood with respect to θ to obtain new estimatesθ.
The estimated parameters from the M-step are returned into the E-step until the value ofθ converges. The resulting estimator is the profile likelihood MLE.

Generalized Statistical Derivative and Asymptotic Normality of the Profile Likelihood MLE
In this section, we show the main result. We show the asymptotic normality of the profile likelihood MLE in Theorem 2. From the asymptotic normality, we can get an estimate of the SE of the estimator. Toward this goal, we introduce the statistical generalized derivative in Theorem 1.
Assumptions: We list assumptions used for Theorem 1 and Theorem 2 given below. Let us denote θ 0 , η 0 and F 0 are the true values of the parameters θ, η and cdf F. On the set of cdf functions F , we use the sup-norm, i.e. for F, F 0 ∈ F , We denote the density function for the profile likelihood in the mixture model by We assume that: The density function p(x; θ, F) is bounded away from 0, i.e., there is a constant c > 0 such that for each x and (θ, F) ∈ Θ × F , p(x; θ, F) > c > 0. More over, the density function p(x; θ, F) is continuously differentiable with respect to θ and Hadamard differentiable with respect to F for all x. We denote derivatives by φ(x; θ, F) = ∂ ∂θ log p(x; θ, F) and ψ(x; θ, F) = d F log p(x; θ, F) and they are given in (11) and (12).
Note: Since the assumption (R4) is unusual condition, it requires an explanation. In our profile likelihood problem the estimateη θ,F of the parameter η is an implicit function. Therefore, the map (θ, F) →η θ,F is hard to differentiate. It follows that the profile likelihood score function φ(x; θ, F) defined in (11) is hard to differentiate. However it is often the case that the function φ(x; θ, F) takes the form in (16). In the examples, the map (θ, η, F) →φ(x; θ, F, η) is in a closed form and therefore it is easy to differentiate. This differentiability became handy when we need to expand the function φ(x; θ, F) through the differentiability ofφ(x; θ, F, η). We use the assumption (R4) in the proof of Equation (20) in Theorem 1. In the example in Section 3, the function in (44) corresponds toη θ,F and the function in (51) corresponds toφ(x; θ, F, η).
Note: To calculate the second derivative of the score function φ(x; θ, F) given in (11), we use the idea similar to the derivative of generalized functions ( [15] A similar idea can be applied in our problem. Suppose the density for the profile likelihood p(x; θ, F) given in (10) is twice differentiable with respect to θ, then by differentiating with respect to θ at (θ, F) = (θ 0 , F 0 ), we get equivalent expressions for the efficient information matrix in terms of the score function φ(x; θ 0 , F 0 ): From this equation we are motivated to define the expected derivative of the score function In the following theorem, we show that the definition is valid even when the derivative of the score function ∂ ∂θ T φ(x; θ, F) does not exist. (15), (11) and (12), respectively. Suppose (R1) and (R4), then, for θ t → θ 0 and F t → F 0 as t → 0, we have that and Note: Note that even when the derivative ∂ ∂θ φ(x; θ, F) does not exist the Equation (19) shows that the derivative of the map θ → E [φ(x; θ, F)] exists. We may call the derivative the statistical generalized derivative. A similar comment for (20) holds.
Proof. In (R1) we assumed the density function p(x; θ, F) is bounded away from 0 and it is differentiable with respect to θ and F. It follows that and We prove the first Equation (19). For each t, the equality holds. It follows that, for each t, we have that By the dominated convergence theorem with (21), the right hand side of (24) is, as t → 0, It follows that, we have (19): Now we prove the second Equation (20). Similar to the beginning of the proof of (19), for each t, the following equation holds: Using (23) with the dominated convergence theorem, the left hand side of (25) is, as t → 0, where we used (17): with a P 0 -square integrable function M (x), By the dominated convergence theorem with (22), it follows that the integral in the right hand side of the Equation (25) is (26) and (27), the equality (25) is equivalent to Proof. The profile likelihood MLE described in Section 2.3 is solution to the estimating equation where φ(x; θ, F) is the profile likelihood score function for θ defined in (11). In (R4) we assumed C ρ and H are Donsker and the functionφ(x; θ, F, η) is Lipschitz in the parameters (θ, F, η) with a P 0 -square integrable function M (x) given in (17). By Corollary 2.10.13 in [16], the class {φ(x; θ, F, η) : (θ, F, η) ∈ Θ × C ρ × H} is Donsker. Moreover, we assumedθ n − θ 0 = o p (1). In (R2), we assumed n 1/2 (F n − F 0 ) = O p (1) andηθ n ,F n − η 0 = o p (1). By the dominated convergence theorem with (17), we have E[(φ(X;θ n , F n ,ηθ n ,F n ) −φ(X; θ 0 , F 0 , η 0 )) 2 ] = o p (1). By Lemma 19.24 in [14], it follows that Using (16), this is equivalent to Using (20), where we used: 1. Since ψ(x; θ 0 , F 0 ) is in the nuisance tangent space and φ(x; θ 0 , F 0 ) is the efficient score function, we have 2. Using consistency ofθ n with assumptions Using (30) and (31), the right hand side of (29) is Finally, (29) together with (33) and 1

Joint Mixture Model of Survival and Longitudinal Ordered Data
In the paper [17], we treated "the joint model of ordinal responses and the proportional hazards with the finite mixture" with real data and simulated data examples. In that paper we had the method of calculating the SE of the profile likelihood MLE applied to the examples. However it was not justified with proofs. This is the motivation to write this paper to give the proofs. In this section, we use the theorem 1 and theorem 2 to show the asymptotic normality of the profile likelihood MLE to the model and prove the method of calculating SE in the paper [17]. Please see [17] for real data and simulated data examples. Ordinal Response Models: Let Y ijm be the ordered categorical response from 1 (poor) to L (excellent) on item (or question) j for subject i at the m th protocol-specified time point, where i = 1, 2, . . . , n, j = 1, 2, . . . , J and m = 1, 2, . . . , M. In total, there are J items in the questionnaire related to patients quality of life, collected at times t 1 , t 2 , . . . , t M . Given that subject i belongs to group r, an ordered stereotype model can be written as where a is a response level intercept parameter with = 2, . . . , L, b j is an item effect, and θ r is associated with the discrete latent variable, with a 1 = 0, b 1 = 0, φ 1 = 0 and θ 1 = 0. The parameter θ r can be referred to as a group effect of the quality of life for patients in group r. However, the group memberships are unknown. The {φ } parameters can be regarded as unknown scores for the outcome categories. Because φ (b j + θ r ) = (Aφ ((b j + θ r )/A)) for any constant A = 0, for identifiability, we need to impose monotone scores on {φ } to treat Y ijm as ordinal. Therefore, the model has the constraint 0 = φ 1 ≤ φ 2 ≤ . . . ≤ φ L = 1. The ordinal response part of likelihood function for the ith subject is where λ 0 (t) is the baseline hazard function. The latent variable θ r is linked with the ordinal response model and δ = (δ 0 , δ 1 ) are coefficients. For the estimation of the baseline hazard function λ 0 (t), we use the method of nonparametric maximum likelihood described in [18, section 4.3]. Let λ i be the hazard at time t i , where t 1 < t 2 < . . . < t n are the ordered observed times. Assume that the hazard is zero between adjacent times so that the survival time is discrete. The corresponding cumulative hazard function Λ 0 (t i ) = ∑ p≤i λ p is a step function with jumps at the failure time t i . Then the survival part likelihood function of subject i is where the d i is an indicator of censorship for individual i: if we observe failure time, then d i = 1, otherwise d i = 0. The Full Likelihood Function: The joint likelihood function is obtained by combining the probability function from ordinal response model (34), and the proportional hazards model (36), by assuming the two models are independent given latent discrete random variables. Let π r be the unknown probability (r = 1, . . . , R) that a subject lies in group r, and (Θ, λ) = ((θ, α, δ), λ) be all the unknown parameters of the joint model. The mixture model likelihood function is Let Z ir be the group indicator, where Z ir = 1 if the i th individual was from the r th group and 0 otherwise. The complete data likelihood can be written as (38) The expected complete data log likelihood under q(Z) = P(Z|Y, T, d) is where γ(Z ir ), P Y i | θ r , α and P T i , d i | λ, θ r , δ are defined in equations (42), (34) and (36) respectively. To estimate all parameters and the baseline hazards simultaneously, we combine the EM algorithm and the method of nonparametric maximum likelihood.

Estimation Procedure: Profile Likelihood with EM Algorithm
Baseline Hazard Estimation: Before starting the EM-step, we profile out the baseline hazard function λ 0 (t). The survival part of Equation (39) can be separately maximized with respect to λ: . . , n, we find the maximizer λ l of (40) by holding (θ, δ) fixed, and it is given by Denote λ(θ, δ) = ( λ 1 (θ, δ), . . . , λ n (θ, δ)). The E-step: In the E-step, we use the current parameter estimates Θ = (θ, α, δ) to find the expected values of Z ir : (42) The M-step: In the M-step, we maximize Equation (39) with respect to π r and Θ = (θ, α, δ). Due to the fact that there is no relationship between π r and Θ, they can be estimated separately.
The estimated parameters from the M-step are returned into the E-step until the value of Θ converges.

Asymptotic Normality of the Profile Likelihood MLE and Its Asymptotic Variance
From (41), an estimator of the cumulative hazard function in the counting process notation is Let us denote E F n f = f dF n . Then the above Λ(t) can be written as Note: From now on we will deal with the cumulative hazard function Λ(t) = t 0 λ(s)ds instead of the hazard function λ(t). The functionλ(θ, δ) in (42) and (43) will be replaced with Λ(Θ, F n ).
Note: Note that the function Λ in (44) depends on γ(Z r ). On the other hand, the function γ(Z r ) in (42) depends on Λ. This circular relationship shows the function Λ in (44) is an implicit function.
Equation (43) gives the profile likelihood function for Θ = (θ, α, δ). The log-profile likelihood function for one observation is where In the above log-likelihood we set a 1 = b 1 = φ 1 = θ 1 = 0. Score functions: The score functions for the profile likelihood are Here all derivatives are calculated treating γ(Z ir ) as constant. We call φ O is the score function for the ordinal response model and φ S is the one for the survival model. Then the profile likelihood MLE Θ n obtained by the procedure described in Section 3.1 is asymptotically normal: Proof. We check conditions (R1)-(R4) in Section 2.4 so that Theorem 1 and Theorem 2 can be used to get result stated in the theorem.
Since the ordinal response data part is a parametric model, we mainly discuss for the survival part of the model. The survival part of the profile log -likelihood function for a one observation is given in (47).
To express the survival part of the score function φ S (T, d|Θ, F) in the form given in condition (R4), we introduce a few notations. Let The function γ(Z r |Θ, Λ) is differentiable with respect to Θ and Λ. Then the function γ(Z r ) in (42) can be expressed as Then the score function for the survival part φ S (T, d|Θ, F) is We will check condition (R4) using the function defined bỹ
We calculated the survival part score function φ S (T, d|Θ, F) =φ S (T, d|Θ, F, Λ(Θ, F)) in (50). The ordinal response data part is a parametric model, it is differentiable with respect to the parameter Θ (we omit the calculation).
We calculate the score function ψ(T, d|Θ, F) = ∑ R r=1 γ(Z ir )d F log P T, d | Λ(Θ, F), θ r , δ . For an integrable function h with the same domain as the cdfs F,

Conclusions
We proposed a "statistical generalized derivative" to deal with an implicit function that appears in the profile likelihood estimation in semiparametric model examples. The example includes the joint model of ordinal responses and the proportional hazards model with a finite mixture which we treated in this paper. Using the "statistical generalized derivative" we expanded the profile likelihood function and showed asymptotic normality of the profile likelihood MLE. Moreover, this approach enabled us to express the SE of the estimator in terms of the profile likelihood score function. This contributes to the profile likelihood estimation methods where, otherwise, no direct expansions of the profile likelihood were possible. Our approach can be applied to not only the example in this paper, but also to many models with implicit function in the profile likelihood estimation. As a next step, we are working on applying our method to those examples.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Theorem 3 (The Efficient Score Function)
Proof. For ease notation we denote the true values of the parameters by (Θ, F, Λ). From (44), replacing F n by F, we have where E is the expectation with respect to the true distribution F. Since, at the true value of the parameters (Θ, F, Λ), we have that Λ(t; Θ, F) = Λ(t).
The score function φ(Y, T, d|Θ, F) = φ O (Y|Θ) + φ S (T, d|Θ, F) in (48) has two parts: the score function for the ordinal response model φ O (Y|Θ) and the score function for the survival model φ S (T, d|Θ, F). Since the score function for the ordinal response model does not involve the parameter Λ, we will only work on the survival part of score function.
We treat the part γ(Z r ) as constant in terms of the parameters. Let Then the score function in the survival part of the model at the true value of parameters Θ and F is where we used Equation (A1). The last expression is the efficient score function in the survival part of the model derived in Equation (A4), Appendix B.

Appendix B. Derivation of Efficient Score Function in the Joint Model
In this appendix, we derive the efficient score function in the joint model using (8). We denote P r,Θ,Λ (T, d) = P T, d | Λ, θ r , δ . Again the true values of the parameters are denoted by (Θ, F, Λ).
The survival part of log-likelihood function for a one observation is The score function for Θ iṡ Let h : [0, τ] → R be a function on [0, τ]. The path defined by dΛ s = (1 + sh)dΛ is a submodel passing through Λ at s = 0. The corresponding path for the λ is λ s (t) = dΛ s (t) dt = (1 + sh)λ(t).

Efficient score function
Then the efficient score function for the survival part of the model is given bỹ where M 1 (t) and M 0 (t) are defined in (A2).