Marginalized Two-Part Joint Modeling of Longitudinal Semi-Continuous Responses and Survival Data: With Application to Medical Costs

: Non-negative continuous outcomes with a substantial number of zero values and incomplete longitudinal follow-up are quite common in medical costs data. It is thus critical to incorporate the potential dependence of survival status and longitudinal medical costs in joint modeling, where censorship is death-related. Despite the wide use of conventional two-part joint models (CTJMs) to capture zero-inﬂation, they are limited to conditional interpretations of the regression coefﬁcients in the model’s continuous part. In this paper, we propose a marginalized two-part joint model (MTJM) to jointly analyze semi-continuous longitudinal costs data and survival data. We compare it to the conventional two-part joint model (CTJM) for handling marginal inferences about covariate effects on average costs. We conducted a series of simulation studies to evaluate the superior performance of the proposed MTJM over the CTJM. To illustrate the applicability of the MTJM, we applied the model to a set of real electronic health record (EHR) data recently collected in Iran. We found that the MTJM yielded a smaller standard error, root-mean-square error of estimates, and AIC value, with unbiased parameter estimates. With this MTJM, we identiﬁed a signiﬁcant positive correlation between costs and survival, which was consistent with the simulation results.


Introduction
In many medical studies, the measurement of the primary outcome may be via a semi-continuous random variable that combines a continuous distribution with point masses at one or more locations [1]. A particular type of semi-continuous outcome is characterized by a point mass at zero and positive values that usually follow a skewed distribution [2]. Examples include alcohol use, driving-simulator-based research, annual medical costs, etc. [3][4][5]. For instance, in medical-insurance-based economic applications, medical costs typically include a large number of zero values representing a population of "non-users" who do not benefit from the medical care system in a given time interval, and a continuous distribution representing the cost levels of those who do receive care [6,7]. Two-part models are often used to analyze such semi-continuous data. These models consist of two separate model parts, with part I to model the zero values and part II to model the continuous values. In fact, some researchers have shown an increased interest in analyzing the semi-continuous outcomes by modeling the discrete zero component separately from the nonzero continuous component [4]. More recently, accommodating longitudinal data has necessitated the extension of these two-part models, with researchers jointly modeling these two outcomes for valid and efficient inference. It has been suggested that ignoring the potential dependency between the components can lead to biased results [2,5,8].
Although still extensively used, the conventional two-part models (CTMs) for capturing zero-inflation are limited to conditional interpretations of the regression coefficients in the model's continuous part. Expressly, they exclude zero values from the second component, and they only provide inferences on the subpopulation of those with positive outcomes. When doing so, a generalization of the results for the effects of covariates that are included in the continuous part of model is only applicable to "users" (e.g., those positive values). Marginal inferences on the populations of health care users and non-users cannot be easily obtained using the conventional two-part model [7]. Moreover, in longitudinal studies, the population of health care users is not fixed over time, whereas the inferences from conventional two-part models are made based on a fixed population [9].
To enhance such marginal inferences, marginalized two-part (MTP) models are preferable for longitudinal semi-continuous data [7]; they yield more interpretable estimates when the primary focus is to estimate covariate effects on the average costs across the entire population of both users and non-users. These models retain many of the most significant features of conventional two-part models, such as capturing zero-inflation and skewness. Still, they allow investigators to examine covariate effects on the overall meanone of the primary targets in many applications [7,9]. Numerous studies have attempted to explain the importance of using marginalized two-part models to accurately model semi-continuous data from complex surveys [4,10,11]. Thus far, this method has not been applied to joint models of longitudinal and survival data, which is the aim of this paper.
Moreover, in health economics studies, patients are monitored longitudinally, and their medical costs are gathered until death or incomplete follow-up occurs. Particularly, this terminal event changes the repeated measures process afterward [12]. For example, death precludes further accumulation of medical costs-therefore, the repeated measures of medical costs (e.g., monthly medical costs) are zero after death [13]. In such cases, the repeated measures and time-to-event outcomes are not independent. Ignoring this possible correlation can result in biased parameter estimates and inefficient statistical inference [13,14]. In a variety of public health applications, the joint modeling of these two processes is a useful tool for considering the possible correlation between the longitudinal and survival outcomes [15,16].
Added to this correlation, as mentioned previously, the distribution of medical costs data is generally right-skewed, and includes a substantial number of zero values. An appropriate model, therefore, must take all of these aspects into consideration [17]. Liu et al., and Xu et al., proposed two-part joint models of these semi-continuous data. However, they used a CTP model for the longitudinal part, so they did not obtain the marginal inference for the continuous part of the models for the longitudinal part [13,14]. However, there have been no studies that join MTP and survival models. In this paper, we propose a new extension of a marginalized two-part model for a joint analysis of longitudinal and survival data that accounts for the semi-continuous nature of longitudinal medical costs data. This new method readily allows investigators to obtain marginal inferences for the entire population of heath care users and non-users. Considering that there is no closed form of the likelihood function, we use approximate maximum likelihood estimation with the Gaussian-Hermite quadrature method to make statistical inferences based on the new model. The analysis of medical costs data lies at the core of our motivation for attempting to conduct this study. Since these data are often highly skewed to the right, with a large proportion of patients having zero costs, together with censoring due to lack of follow-up or death, the distributional features of such medical costs data make modeling challenging from a methodological standpoint.
The remainder of this paper is arranged as follows: Section 2 briefly reviews the conventional two-part model for the joint analysis of longitudinal semi-continuous data and survival data, along with the new marginalized two-part joint model. A series of simulation studies is presented in Section 3 to investigate the performance of the proposed marginalized two-part joint model (MTJM). In addition, Section 3 presents an application of this new MTJM to electronic health record (EHR) data in Iran. Finally, Discussions are provided in Section 4 and conclusions in Section 5.

Materials and Methods
In this section, we describe the methods in detail.

Conventional Two-Part Model for Joint Analysis of Longitudinal Semi-Continuous Data and Survival Data
Suppose that the monthly medical costs for the j-th observation of subject i (i = 1, 2, . . . , n) at time t ij (j = 1, 2, . . . , n i ) are denoted by Y ij (i.e., semi-continuous repeated measures), where n is the total number of subjects, and n i is the number of measurements for the i-th subject. For each subject, consider a random censoring time C i and a random terminal event (death) time D i such that the repeated measures process is terminated at τ i = min(C i , D i ). Additionally, let ∆ i = I(D i < C i ) denote an indicator variable, and assume that the deaths and censoring times are both continuous and independent. Moreover, let x ij denote the vector of covariates corresponding to the repeated measure for subject i at time j, and z it the vector of covariates corresponding to the terminal event for subject i. Let λ i (t) denote the hazard for the terminal event.
To specify the semi-continuous nature of Y ij , consider two groups of subjects, with group 1 for those with zero costs, and group 2 for those with positive costs. The probability that a subject belongs to group 2, and the parameters associated with the level of costs to some explanatory variables during repeated measurements, can be considered as common random variables, denoted by a i , to capture the variability of subject effects. Since there might be subject effects with regard to hazard rates, we consider common random variables, denoted by b i , to capture the correlations between hazard rates and costs.
With these notations, the conventional two-part joint model (CTJM) proposed by Liu et al. [14] can be defined as: where is the probability that subject i has positive costs at time t ij , given all associated covariates, and µ ij is the logarithm of the costs for subject i. Moreover, consider λ 0 (t) as the baseline hazard function for the terminal event. In addition, α C , β C , γ C , δ C 1 , δ C 2 , and δ C 3 are unknown parameters. The subject-specific random effects correlate the odds of having positive costs, the level of positive costs, and the survival. As discussed in Liu et al. [14], more complicated random effects-such as a random slope-can be easily incorporated into these equations if necessary. In Equations (1) and (2), η C ij and µ ij can be considered only when a patient is still alive. Naturally, the monthly medical costs after death are zero. In a recent study, Rustand et al. [1] further developed this CTJM to analyze longitudinal semi-continuous biomarkers and terminal events from metastatic colorectal cancer data.

Marginalized Two-Part Joint Model for Longitudinal Semi-Continuous Data and Survival Data
Extending the conventional two-part joint model (CTJM) proposed by Liu et al. [14] and Rustand et al. [1], which has been widely used to jointly model semi-continuous data and survival outcomes, we propose a more flexible marginalized two-part joint model (MTJM) by using marginalized two-part (MTP) models instead of conventional two-part (CTP) models for the longitudinal part. This newly proposed MTJM is also an extension of the work of Smith et al. [9], where they considered only the marginalized two-part model for longitudinal medical costs data, without considering the survival data (i.e., without joint modeling).

Marginalized Two-Part Model for Longitudinal Semi-Continuous Data
The general form of the probability density functions (PDFs) for the CTP model [9] is given by: where the probability of observing a non-zero cost-i.e., π ij -can be modeled using a logit link: and the location parameter µ ij from the positive values can be modeled in the second part of the CTP model assuming a log link: Under this parameterization, β C lacks a meaningful interpretation in many conditions. Often, greater interest lies in estimating the effects of covariates on the marginal mean of y ij on the original scale for the overall population of users and non-users. For example, investigators are interested in examining the association between treatment and average costs among health care users and non-users. The conventional two-part model poses challenges to the estimation of this effect [7,9,18,19]. Therefore, a flexible two-part model remains necessary to accommodate dependence between components, while providing an interpretable parameterization of the marginal mean in longitudinal studies.
Smith et al. [9] proposed a novel marginalized two-part (MTP) longitudinal model in order to alleviate the limitations posed by the CTP model, directly parameterizing the effect of covariates on the marginal mean of Y ij .
The MTP model has the same two-part structure as the conventional model, but rather than parameterizing the model in terms of µ ij -the log-scale location parameter of the conditionally positive values-the model is parameterized in terms of ν ij = E Y ij -the overall mean from the combined population of users and non-users. Therefore, for the MTP model, the general form of the PDF [9] can be written as: The MTP model specifies the linear predictors: Under this parameterization, β M is estimated for the entire population, while β C is conditional on Y ij > 0.

Parameter Estimation of the Marginalized Two-Part Joint Model (MTJM)
We extend the previously proposed model to accommodate longitudinal semi-continuous and survival responses. We specify the general form of the marginalized two-part joint model based on CTJM, as before: The common random variables, denoted by a M i and b M i , capture the variability of subject effects and the correlations between hazard rates and costs, respectively. It is important to note that, with the MTJP model, the parameter β M in Equation (6) is interpreted differently from the parameter β C in the CTJP model. In the CTJP model, the parameter β C in Equation (2) is conditional on Y ij > 0, which means that the parameter β C considers the effect of covariates for the subpopulation with positive costs. However, in the MTJP model, the parameter β M in Equation (6) is estimated for the whole population with positive costs in the CTJP model, as well as the subpopulation with zero costs that is not included in the CTJP model. Using the MTP model as parameterized, β M in Equation (6) is estimated for the entire population, while β C in Equation (2) is conditional on Y ij > 0. Using log-normal distribution for f (.) in Equation (4), the overall marginal mean ν ij can be defined as: where σ is the shape parameter in log-normal distribution. With the model formulation in MTJM, the likelihood for the ith subject is: where ω is the vector of parameters including α M , β M , γ M , δ M 1 , δ M 2 , and δ M 3 , and p c M i is the density function for . The first part of the integral results from the odds of having positive costs, and constitutes part I of the twopart model: where I ij is 1 if Y ij > 0, and 0 otherwise. The amount of positive medical costs per month constitutes the second part: and the likelihood of death is: There is no analytical solution to obtain parameter estimates. Therefore, we adopt an adaptive Gaussian quadrature technique that can be implemented conveniently using the SAS procedure NLMIXED (SAS Program in Appendices A and B). In many real-life applications, nonlinear models are required that allow the investigators to specify parameters to the model individually and nonlinearly. For example, in some repeated measurements, it is of interest to fit a model that simultaneously accounts for the overall nonlinear mean structure as well as the variability between and within subjects. In this situation, nonlinear mixed-effects models can be useful, and the fitting of nonlinear mixed-effects models using the SAS procedure "PROC NLMIXED" can be the first choice. PROC NLMIXED fits nonlinear mixed-effects models by numerically maximizing an approximation to the marginal likelihood-that is, the likelihood integrated over the random effects [20,21].

Results
This section is divided into simulation study and real data study.

Simulation Study
In order to validate the proposed MTJM model, a simulation study was designed for the proposed estimation procedure performance to be examined. Repeated measures (such as monthly medical costs) are assumed for subject i = 1, . . . , n at the integer "time" (the month, for example), j = 0, . . . , n i with the baseline of month 0. Data are simulated using the following model: The first time-variant covariate x 1ij = z 1it is simulated from the Bernoulli (p = 0.5) distribution, and the second time-variant covariate x 2ij = z 2it is simulated from a standard normal distribution. The fixed-effects coefficients are set to α M = (14.4, −0.3, 1.6) and β M = (5, 0.05, 1.1). These values, and the coding of the covariates, are obtained from the literature [7]. Additionally, in order to compare the performance between the MTJM and CTJM, we used simulation codes from previous works [13,14] for the CTJM. We take USD 1 as the monthly medical cost's unit. We assume that the positive costs follow a log-normal distribution with σ 2 = 4. The hazard rate of subject i at time t follows an exponential distribution E(λ i (t)). The parameter γ is a vector of coefficient of covariates, and it is assumed to be γ M = (0.1, −1).
The independent censoring occurs from time 1 to time 4, with censoring probabilities of 2% (at time 1), 3% (at time 2), 15% (at time 3), and 80% (at time 4). The follow-up of repeated measures is not available after either death or independent censoring. The times of censoring are set independently to the exponential distribution E(1).
Independent random intercepts a i and b i are considered in the model. We assume that the random effects jointly follow a bivariate normal distribution with mean 0 and , where σ 2 a = σ 2 b = 9 and σ ab = 0.0. We set the coefficients δ M 1 = 1, δ M 2 = 2 and δ M 3 = 3. Note that all three δs are positive, meaning that the chances in the odds of having positive costs, the level of costs, and death hazard have a positive correlation. We set the sample sizes to n = 250, 500, and 1000, and ran 100 replicates for n i = 4 timepoints for the two responses. We used the Gaussian quadrature method to fit the data using five quadrature points. The SAS program containing the simulation study code is provided in Appendix A for readers interested in replicating the simulation study. We fit the MTJM and CTJM with r = σ ab = 0.0, 0.5, 0.8 to data generated under the MTJM with r = σ ab = 0.0.
The bias, standard error, and root-mean-square error (RMSE) of the parameter estimates are shown in Tables 1-3 for n = 250, 500, 1000, and r = σ ab = 0.0, 0.5, 0.8, respectively. Figure 1 graphically illustrates the RMSE over all 100 runs with the MTJM compared to the CTJM.
From Table 1, under n = 250, the estimates' bias is generally close to zero. From Figure 1, we can observe that the RMSE is smaller for larger correlations. The results for these two methods are comparable under n = 250 for the three correlation schemes considered. In addition, the results under n = 500 in Table 2 and n = 1000 in Table 3 are similar to those under n = 250 in Table 1.
Overall, the simulation study demonstrated the superior performance of the MTJM model compared to the CTJM model, and showed that the RMSE decreases as the number of subjects increases; however, the model still demonstrates excellent performance with small samples. We fit these two models to simulated data under an MTJM assumption, so it is expected the MTJM model will outperform the CTJM model. Moreover, the two models are structured differently, so the models should not be discriminated based only on model comparison statistics. We also found that there exist negligible empirical biases for the parameter estimates. The coverage probabilities of the confidence intervals are close to the nominal level of 0.95 (the confidence interval coverage, therefore, is adequate).

Analysis of HDK Data
We analyzed the direct medical costs per physician visit for coronary heart disease patients in Kerman (HDK dataset) as an application of the newly proposed marginalized two-part joint model. As a leading cause of mortality, morbidity, and disability, coronary heart disease imposed a significant economic burden-ranging between USD 4715 and 4908 billion-on the Iranian economic system in 2014 [22].

Data Description
The data were compiled from the Iranian Integrated Care Electronic Health Record (ICHR). The ICHR is a national middleware that creates and manages electronic health records (EHRs) for Iranian individuals. This national middleware is locally called SEPAS, and all patient visits to health care facilities are communicated through it. It has a distributed and service-oriented architecture based on ISO 13606. We focused on the analysis of the out-of-pocket medical costs of coronary heart disease in Kerman. We used data from 1664 patients who were referred to the heart department of the state hospitals of Kerman Province from 2016 to 2018. Each respondent was followed from their first hospital admission until death in the hospital, or censored at the end of 2018 (31 December 2018). Total medical costs were calculated by adding up the costs of each hospital visit. The mean visit rate was four visits. Approximately 11% of patients died in the hospital during the follow-up, and others were censored. For 10% of the total person-months, the direct medical costs were zero. Table 4 shows the studied variables from the variable selection part. In addition, Tables 5 and 6 show the descriptive statistics for the studied variables. The direct medical costs per physician visit were highly right-skewed. At the time of HDK data collection, USD 1 was worth, on average, IRR 35,847 16 .

Detailed Explanations of Variable Selection
In this study, univariate analyses and full stepwise variable selection were conducted using the same selection criteria as for the primary analysis to obtain the final model. Univariate analyses were conducted to determine which variables were associated with the longitudinal semi-continuous responses of costs, and the final survival model was obtained using full stepwise variable selection with the same criteria as for the primary analysis. Selected variables are shown in Table 4.

Fitting the CTJM and MTJM to the Data
The variables gender, age, and place of residence were included in the model as the time-independent covariates. The type of hospitalization variable was included as the time-dependent covariate. Specifically, the full MTJM model was specified as: In addition, the full CTJM model was specified as: To select the best model from among these two candidate models, we used the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). The AIC is generally considered to be the first model-selection criterion that should be used in practice. For each model, this value is calculated, and the best model candidate is the one that provides the minimum AIC. Another model selection criterion based on information theory is the BIC, where the model with the lowest BIC considered the best model [23].
To facilitate the further understanding of the method developed in this paper with this real data analysis, we included the SAS program in Appendix B.

Parameter Estimation
The adaptive Gaussian quadrature is used with five quadrature points to estimate the model parameter. Table 7 summarizes the parameter estimates, SEs, and the associated p-values for both the MTJM and CTJM. As seen from Table 6, the results of the fitted MTJM model show that: (a) each outpatient tended to have higher odds (i.e., OR = exp α M 1 = exp(1.30) = 3.67, p < 0.0001) of paying the costs per physician visit. Moreover, the direct costs were higher among outpatients (whole population with positive and zero costs), but not significantly (i.e., β M 3 = 0.001, p > 0.05); (b) male patients in the whole population tended to pay higher medical care costs (i.e., β M 1 = 2.497, p-value < 0.0001)-in other words, the direct costs for males were USD 2.497 more than for females; (c) higher age in both groups with positive and zero costs meant greater direct medical costs (i.e., β M 2 = 0.027, p-value < 0.0001). The latter may be explained in two ways: (1) the higher frequency of visits to the hospital by those at a more advanced age, and (2) the high-cost intensive care that older patients with heart disease need toward the end of their lives.
We also fit the CTJM model to this data. The results of the fitted CTJM model show that: (a) each outpatient tended to have higher odds (i.e., OR = exp α C 1 = exp(1.366) = 3.92, p < 0.0001) of paying the costs per physician visit. Moreover, if outpatients had positive costs, their direct costs were higher (i.e., β C 3 = 0.012, p = 0.0004); (b) male patients who had positive costs tended to pay higher medical care costs (i.e., β C 1 = 2.109, p-value < 0.0001); (c) higher age among people with positive costs meant greater direct medical costs (i.e., β c 2 = 0.297, p-value < 0.0001).
In both models, we found that age plays a role in the survival model, but it is not statistically significant (i.e., γ M 2 = 0.499, γ C 2 = 0.479, p-value > 0.05). This means that under an MTJM assumption every 1-year increase in age brings about an exp(0.499) = 65% increase in death. In addition, not being residents of Kerman (i.e., γ M 3 = −0.136, γ C 3 = −0.182, p-value < 0.0001) seemed to be a major mortality risk factor in this sample for both models.
In terms of the association between the three models, the high significance of all δs (p < 0.0001) justifies using a joint model rather than considering the two models separately. The latter shows a significant association between part I and part II, i.e., patients who have positive costs per physician visit tend to pay higher medical treatment costs (i.e., δ 1 = 1.010). Furthermore, it is suggested that the random effects influence death hazards in both parts of monthly medical costs, i.e., a higher mortality rate is observed in patients who had more frequent medical costs and who paid for medical costs more directly (i.e., δ 2 = 1.999, δ 3 = 3.000). In summary, it can be said that when the three outcomes show a stronger correlation, our joint model is preferred over the separate longitudinal and survival models.
The results suggest that both models fit the data adequately. However, the AIC and BIC both indicate that the MTJM model is a more appropriate fit for these data (AIC MTJM = 211,965 vs. AIC CTJM = 222,610 BIC MTJM = 212,035 vs. BIC CTJM = 222,680). Although the two models do not have the same interpretation, the parameter estimates from the CTJM are similar to those estimated from the MTJM model, except in part 2 of the model.

Discussion
In this paper, we developed a marginalized two-part joint model for two correlated longitudinal semi-continuous and survival outcomes using marginal inferences. This model extended the advantages of marginalized two-part models to account for a dependent terminal event, such as death. From a clinical perspective, the MTJM is of special interest because it can account for various clinical responses to treatment, but with a more meaningful interpretation. Prior studies have noted the importance of the extension of the joint model in order to gain a better understanding of the medical cost data [13].
The simulation study demonstrated that the MTJM model performs well compared to those demonstrated by the CTJM model, and showed that larger sample sizes result in smaller bias and RMSE in parameter estimates, and that confidence intervals maintain nominal coverage. Our model still shows excellent performance with smaller samples. The latter is consistent with the results of previous studies [13,14]. The simulation showed that both models tend to be the same for larger correlations and sample sizes, except for the parameters in part two. In the MTJM, to examine such effects on the marginal mean in order to draw conclusions about the impact of predictors on the population as a whole, we parameterized the second part of the models, so it is expected that they should be estimating the same quantity-especially in part 1 and part 3. To differentiate the MTJM from the CTJM, we only made a change in the second part, so it is assumed that the parameters in the second part (the betas) do not represent the same quantity. Our results show that the MTJM works as well as the CTJM, but with marginal interpretation in the second part. These results therefore need to be interpreted with caution. As we mentioned previously, the two models are structured differently, so discrimination between these models should not be based on model comparison statistics alone. We aimed to show that the new model can work as well as previous models, but with the previous model not being able to provide marginal inference.
In the survival model, we used the exponential distribution. We found that the parameter estimation error consistently converges to zero when assuming an exponential distribution. However, current methods cannot provide a more appropriate estimate of parameter distribution if the parametric assumptions are violated [24]. For future studies, we plan to adopt more flexible survival models if these assumptions are violated Our real data application found a significant association between longitudinal medical costs and survival, suggesting that the proposed MTJM model may be a more appropriate tool than two-part models that model the longitudinal medical costs alone for these types of data. The findings of the current study are consistent with those of Xu et al., who found that parameter estimates could be seriously biased when information about the complex survey design was ignored [13].
We also showed that the proposed MTJM model gave better model fits compared to the simpler CTJM model, based on the AIC and BIC model selection criteria. It is recommended to choose the appropriate model via these model selection criteria [25]. Furthermore, the correlation between hospital status and having positive costs in part I, the correlation between gender/age and the level of cost in part II, and the correlation between gender/place of residence and the hazard of death are all statistically significant. These findings further support the importance of taking account of the correlation over time between the probability of incurring positive costs and the level of cost in longitudinal applications [4].
We acknowledge some limitations of our study. Our joint model is defined under several assumptions, such as the normality assumption for random effects and the lognormal assumption for positive costs. For future studies, we plan to adopt a more flexible generalized gamma distribution or other heavy-tailed distributions in part II of our model if these assumptions are violated. A Pareto distribution may better represent the upper tail of the medical costs distribution than the log-normal distribution. In addition, one may consider a Bayesian framework for parameter estimation with the MTJM. Finally, we assumed normal random effects in the model, so it would be interesting to investigate nonnormal random effects in this proposed model. We are actively investigating these ideas.

Conclusions
To address the medical cost data problems-including right skewness, clumping at zero, and censoring due to death and incomplete follow-up-a marginalized two-part joint model (MTJM) was developed in this paper. The simulation study showed that our proposed joint model yielded small biases of parameter estimates when the complex sample design was considered.