Multivariate Mixed Response Model with Pairwise Composite-Likelihood Method

: In clinical research, study outcomes usually consist of various patients’ information corresponding to the treatment. To have a better understanding of the effects of different treatments, one often needs to analyze multiple clinical outcomes simultaneously, while the data are usually mixed with both continuous and discrete variables. We propose the multivariate mixed response model to implement statistical inference based on the conditional grouped continuous model through a pairwise composite-likelihood approach. It can simplify the multivariate model by dealing with three types of bivariate models and incorporating the asymptotical properties of the composite likelihood via the Godambe information. We demonstrate the validity and the statistic power of the multivariate mixed response model through simulation studies and clinical applications. This composite-likelihood method is advantageous for statistical inference on correlated multivariate mixed outcomes.


Introduction
Clinical research, such as toxicity studies and laboratory examinations, can provide relevant information for measuring the effect of various treatments or experiments on the patients. This type of research needs to jointly analyze different experimental outcomes, while the research outcomes collected during the treatment are correlated and mixed with both categorical and continuous variables. For example, we are studying the efficacy of treatments along with the toxicity and adverse drug reactions simultaneously. In this case, the severity level could be measured as discrete or ordinal data, while the clinical examination results such as the blood test measures are continuous. In the traditional approach, these multiple outcomes are analyzed by different linear models to estimate the effects of the treatments together with the relevant clinical and demographic information. However, this approach ignores the correlation between the outcomes and only provides marginal inferences. Thus, it is desirable to develop a multivariate approach, which can jointly model the multiple mixed-type responses with the treatment and clinical covariates.
In the recent literature, there are two main approaches to build a joint model for mixed response variables. The conditional Gaussian distribution model (CGDM) can decompose the joint distribution of mixed response variables into a combination of the conditional distribution and the marginal distribution. In the bivariate mixed case, the conditional Gaussian distribution model can produce a conditional distribution of the categorical variables given the continuous variables and marginal distribution for the continuous variables. In particular, Cox [1] provided Stats 2020, 3 the logistic conditional distribution for binary variables, and Cox and Wermuth [2] extended the model with a probit-type function and showed the potential connection to the latent variable model. Another conditional Gaussian distribution model, referred to as the general location model (GLOM), was proposed by Olkin and Tate [3]. They adopted the opposite factorization, which consists of a conditional normal distribution given the categorical variables and marginal multinomial distribution. Teixeira-Pinto and Normand [4] compared this approach with the models proposed by Sammel et al. [5,6] in a comprehensive review. Yang et al. [7] extended the model to mixed Poisson and continuous response variables through a likelihood-based approach.
The grouped continuous model (GCM) proposed by Anderson and Pemberton [8] and de Leon [9] provides another solution for this problem. The fundamental technique allows the categorical variables to be treated as partitioned continuous latent variables with different nonoverlapping intervals (Poon and Lee [10]; Skrondal and Rabe-Hesketh [11]). This type of transformation allows the latent variables to follow a multivariate Gaussian distribution. Poon and Lee [10] demonstrated maximum likelihood estimation for latent variables with polychoric correlations. As an extension of the grouped continuous model, de Leon [9] proposed the conditional grouped continuous model (CGCM) to build a joint model for the variables mixed with categorical and continuous outcomes. Catalano and Ryan [12], Catalano [13], and Najita et al. [14] applied the conditional grouped continuous model to the studies of fetal toxicity for longitudinal data. Gueorguieva and Agresti [15] proposed that the estimation of correlated mixed response variables can be obtained by the expectation-maximization (EM) algorithm. Zhang et al. [16] extended this to the parameter expanded EM algorithm under the full-likelihood approach. de Leon and Carriére [17,18] developed a general mixed-data model, which aggregates the CGDM and CGCM to jointly analyze correlated nominal, ordinal, and continuous data together.
It remains computationally challenging to estimate the joint distribution of multivariate mixed data. The composite-likelihood method is based on compounded lower-dimensional distributions, which offers an alternative solution to the estimation problem (Lindsay [19], Cox and Reid [20], Varin [21], Varin et al. [22], and Reid et al. [23]). Faes et al. [24] applied this method for longitudinal data with mixed outcomes. In their model, the correlation structure is induced by the random effect, which does not have a closed-form expression. We aim to follow the approach of the CGCM and use the composite-likelihood method to analyze the joint distribution of multivariate mixed-type response variables, where the categorical response variables are modeled by continuous latent variables. The parameters of the mean structure, as well as the correlation among different outcomes, can be estimated simultaneously through the numerical algorithm. The proposed composite-likelihood method consists of three types of bivariate joint densities: two continuous outcomes; two discrete outcomes modeled by two continuous latent variables; and two mixed outcomes with one continuous and one categorical variable.
We provide the numerical algorithm for composite-likelihood estimation and discuss the asymptotical properties of the composite-likelihood estimates. In addition, we derive three composite-likelihood test statistics for joint inference on the multivariate mixed response model. Simulation studies were conducted to examine the empirical performance of the proposed method in comparison with the conventional approaches. The algorithm was applied to the clinical data from a colorectal cancer study. We analyze the effect of the treatment and other clinical factors' on multiple correlated responses of the patients.

Model Setup
Suppose there are n observations z 1 , z 2 , . . . , z i , . . . , z n in a clinical dataset, and each observation contains q multiple outcomes z i = (z i1 , z i2 , . . . , z ij , . . . , z iq ) T , which are correlated and mixed with continuous and binary variables. Suppose we wish to model the effects of a collection of covariates, and the generalized linear model can be constructed for each outcome as in which the covariate x ij with i = 1, 2, . . . , n and j = 1, 2, . . . , q for different responses can be the same or different, and g j denotes the link function used for the jth response. If we want to analyze these multivariate mixed-type outcomes simultaneously with the corresponding covariates, the conventional likelihood-based approach needs to identify the multivariate joint densities of all responses. Alternatively, we can set up the pairwise likelihood function for responses z ij and z ik as where θ jk denotes the parameters associated with the pairwise likelihood. The log-likelihood function is given by l jk (θ jk ) = log L jk (θ jk ), and the score function is given by According to Lindsay [19] and Cox and Reid [20], the pairwise composite-likelihood function of these q response variables is the product of ( q 2 ) paired likelihood functions: and the score function is constructed by defferentiating the composite log-likelihood function: Our multivariate mixed response model estimates the parameters of interest θ by numerically solving the composite score Function (3) equal to 0 through the Newton-Raphson [25] method. Since the outcomes are mixed with continuous and categorical variables, the score Function (2) can be derived under three different bivariate structures: outcomes with both continuous response variables, outcomes with both binary response variables, and outcomes mixed with one continuous and one binary response variable.

Case 1: Continuous Outcomes
We first discuss the case that both responses z ij and z ik are continuous and assume they follow a bivariate normal distribution: where the mean structure of z ij and z ik is associated with the generalized linear models as µ ij = x T ij β j and µ ik = x T ik β k . Thus, the pairwise density of z ij and z ik is the same as a bivariate normal density based on Johnson and Wichern [26]: The pairwise composite-likelihood function can be constructed by with the parameters of interest θ jk = {β j , β k , σ j , σ k , ρ jk }, and the log-likelihood function l jk (θ jk ) = ∑ n i=1 log f (z ij , z ik ), which gives the score function of both continuous responses used in the Equation (2):

Case 2: Binary Outcomes
For binary outcomes, the transformation can be obtained using the threshold value t such that if the latent variable z * ij ≥ t, then the response z ij = 1, otherwise z ij = 0. Without loss of generality, the value of t is set to 0, and we have µ ij = P(z ij = 1) = P(z * ij ≥ 0). We follow the grouped continuous model settings with the distributional assumptions z ij ∼ Bernoulli(µ ij ) and z * ij ∼ N(µ * ij , σ 2 j ). Thus, the model can be constructed with the covariates x ij as Since the mean µ * ij and the variance σ 2 j are not identifiable at the same time, we follow the rescaling method based on Dunson [27] by setting σ 2 j = 1. Therefore, the mean of the latent variable can be simplified as µ * ij = x T ij β j . Since we jointly analyze binary variables z ij and z ik , the paired latent variables z * ij and z * ik are generated from a model with the polychoric correlation. Thus, the joint function of (z ij , z ik ) can be derived by following composite-likelihood functions in four cases where ±ρ jk represents the polychoric correlation under different scenarios and Φ 2 denotes the bivariate normal cumulative density function. The equations above can be rewritten as with s ij = (2z ij − 1) and s ik = (2z ik − 1). The log-likelihood function of paired binary responses is The score function can be derived as

Case 3: Mixed Outcomes
When one response variable z ij is binary, and another response variable z ik is continuous, we can model the mixed outcomes together by adopting a latent normal variable z * ij as the conditional grouped continuous model. Thus, they follow a bivariate normal distribution given the polyserial correlation ρ jk : The factorization method can be applied for the joint bivariate normal distribution with a marginal density for the continuous response P(z ik ) and a conditional density for the latent variable P(z * ij |z ik ) given the continuous response.
1. z ik is the continuous response , the conditional probability P(z * ij |z ik ) can be written as The pairwise likelihood function and log-likelihood function is given as follows: The score function can be obtained as A similar formulation was extended to the longitudinal settings by Najita et al. [14]. More derivation details are included in Appendix A.
The model setting above can simplify the score Function (3) with three types of components as in Equations (4)- (6). We can obtain the maximum composite-likelihood estimateθ (MCLE) for the parameters θ by solving the score function numerically. Our numerical algorithm is presented below (Algorithm 1): 1. Select the initial values θ (0) of the parameters. Calculate the score function U(θ (0) ) in Equation (3) and the Hessian matrix ∇ θ U(θ (0) ) as the derivative of the score function; 2. Implement the Newton-Raphson method to obtain the updated parameters θ (t) with t = 0, 1, · · · ; 3. Calculate the score function and Hessian matrix with the updated parameters θ (t+1) ; 4. Repeat step 2 and 3 until convergence.

Statistical Inference Using the Composite-Likelihood Method
Cox and Hinkley [28] and Kent [29] presented various hypothesis testing procedures using the full-likelihood function. The composite-likelihood function can be treated as the misspecified-likelihood function. Its asymptotic properties were reviewed and discussed by Varin et al. [22], Reid et al. [23], Jin [30], and Gao and Song [31]. Following this framework, the pairwise composite-likelihood function implemented in our proposed model produces the estimators, which are consistent and asymptotically normally distributed.
The Godambe information [32] G of the parameters θ for the log composite-likelihood function involves the sensitivity matrix H and the variability matrix J: where the sensitivity matrix and variability matrix are defined as where U(θ; z i ) denotes the score function of the ith observation and the total score function U(θ) = ∑ n i=1 U(θ; z i ).
Theorem 1. Let θ 0 ∈ d denote the true parameter value to set up the multivariate mixed response model. Under regularity conditions, n → ∞, the maximum composite-likelihood estimatorθ is asymptotically normally distributed as Under regularity conditions, n → ∞, the maximum composite-likelihood estimatorθ is consistent The sensitivity matrix H and the variability matrix J can be evaluated by the empirical estimates under the maximum composite-likelihood estimators: Furthermore, according to Theorem 1, the composite Wald statistic, the composite score statistic, and the composite-likelihood ratio statistic for testing the null hypothesis H 0 : θ = θ 0 are given respectively by

Testing with Nuisance Parameters
Suppose the parameters are partitioned as θ = {ψ, λ} with ψ ∈ s , λ ∈ p , and d = s + p. The parameter of interest is ψ, and λ is treated as the nuisance parameter for hypothesis testing. In this setting, the Godambe information matrix and its inverse can be partitioned as and the inverse of the submatrix pertaining to ψ is given by According to the asymptotic theorem, the composite Wald statistics under the null hypothesis H 0 : which has an asymptotic χ 2 q distribution. Similarly, we define the composite score statistics: where the matrix H ψψ can be obtained by partitioning the inverse sensitivity matrix H. Furthermore, the composite-likelihood ratio statistic can be obtained by with the unrestricted maximum composite-likelihood estimateθ = {ψ,λ}. However, the asymptotic distribution of the composite-likelihood ratio under H 0 is given by ∑ q j=1 λ j χ 2 1(j) , where χ 2 1(j) are independent χ 2 1 variates and λ 1 , λ 2 · · · , λ q are the eigenvalues of the matrix H ψψ, There are different adjustments to this nonstandard weighted chi-square distribution (Rotnitzky and Jewell [33], Geys et al. [34], and Pace et al. [35]). For example, we can apply the adjustment by introducing the scaling factorλ = ∑ q j=1 λ j /q, then the adjusted composite-likelihood ratio has the same asymptotic distribution as W e (ψ 0 ) and W u (ψ 0 ): Therefore, the composite-likelihood method can simplify the modeling of correlated responses with multiple generalized linear models and allow users to conduct statistic inferences on parameters of interest from different generalized linear models. Moreover, we can select a subset of the parameters and conduct the further inferential assessment in the presence of nuisance parameters.

Simulation
Different simulation studies were implemented to show the validity of the multivariate mixed response model. The estimation results from the proposed model are compared with the full-likelihood and marginal approaches, respectively.

Comparison with Maximum Full-Likelihood Estimation
In the multivariate regression with correlated continuous outcomes, the full-likelihood estimation can be conducted without numerical integration. Thus, we can compare the maximum composite-likelihood estimates with the full-likelihood approach through the simulation study. The simulated samples contain four continuous response variables z ic 1 , z ic 2 , z ic 3 , and z ic 4 , which are generated from Equation (8): The covariates {x ic 1 , x ic 2 , x ic 1 , x ic 2 } ∼ N(0, 1) and {y ic 1 , y ic 2 , y ic 1 , y ic 2 } ∼ N(0, 0.5) are independently simulated. The errors are correlated and generated from a multivariate normal distribution N 4 (0, Σ), and the variance-covariance matrix Σ is given by In the simulation, the variances are designed as σ 2 c 1 = 1, σ 2 c 2 = 1, σ 2 c 3 = 2.25, and σ 2 c 3 = 4, and an identical correlation ρ = 0.3 is applied between the errors in the data generating process.
The simulation results ( Figure 1) were obtained through 1000 independent replications. The maximum composite-likelihood estimators demonstrate similar performance when compared with the full-likelihood approach. The simulated results also show that the estimates are close to each other, and the maximum likelihood estimators have slightly higher relative efficiency.

Comparison with the Marginal Approach
For the mixed outcome regression, the full-likelihood approach is computationally challenging, and marginal regression is often resorted to in order to conduct the analysis. We implemented simulation studies to evaluate the performance of our proposed method in comparison with marginal regression. We first tested the overall performance of the point estimates when the outcomes had different levels of dependency and covariates. Next, we focused on the test of the composite statistics. The multivariate mixed response model can provide the statistical inference with nuisance parameters and attains a higher statistical power in terms of dealing with joint inference.

Simulation Settings
We generated the sample data consisting of two binary responses z ib 1 and z ib 2 and two continuous responses z ic 1 and z ic 2 . The binary variables were obtained based on the corresponding latent normal variables z * ib 1 and z * ib 2 through the probit link function: The simulation studies of the responses are based on Equation (9) associated with covariates x i = {x ib 1 , x ib 2 , x ic 1 , x ic 2 } and y i = {y ib 1 , y ib 2 , y ic 1 , y ic 2 }, respectively: We provided different simulation scenarios of the covariates values and three levels of correlation to analyze the response variables with the proposed model. The regression parameters were arbitrarily chosen and set to be fixed values in each simulation study. The errors in Equation (9) follow a multivariate normal distribution N 4 (0, Σ), and the variance-covariance matrix Σ is given by In the following data generating processes, the values of the variance-covariance parameters are set as σ 2 b 1 = 1, σ 2 b 2 = 1, σ 2 c 1 = 16, and σ 2 c 2 = 25, and the correlation is designed at the levels of low (all ρ = 0.3), medium (all ρ = 0.5), and high (all ρ = 0.7), respectively, to assess the underlying model. Since there is no constraint on the sign of the correlation, the negative correlation can be estimated through our algorithm without further assumptions. Our simulation studies focus on the overall performance of the multivariate mixed response model through independent replications.

Point Estimates
Different simulation scenarios were designed to assess the performance of the underlying model on the point estimates by 1000 independent replications. There are two different sets of simulations for the data generating process, and within each setting, we analyze three levels of correlation, respectively. As shown in Table 1, the values of the regression parameters and the standard deviation of the continuous response variables are given across all simulation studies. In the first simulation setting, we provide 300 samples, and the mixed response variables are associated with covariates of distinct values. The covariate sets of x i and y i are identically and independently simulated from a normal distribution N(0, 1), respectively, in each linear model.
In the second simulation setting, we present the multivariate mixed response model dealing with the common covariate. Regarding the data generating process with 1000 samples, all responses share one common covariate, which was generated from a normal distribution N(0, 1), such as x ib 1 = x ib 2 = x ic 1 = x ic 2 in Equation (9). The second covariates y i are from a Bernoulli (0.5), and they are different for each response. This setting represents the scenario in practice when a common factor is included in all of the response models. , and the four responses have different covariates, e.g., x i and y i generated from a normal distribution N(0, 1); † Simulation II: N = 1000, the responses shared one common covariate x i ∼ N(0, 1) and added a different covariate y i ∼ Bernoulli (0.5).
In Table 1, we provide the ratio of the mean squared error (MSE) of the proposed method to the marginal approaches. This ratio represents the relative efficiency of the proposed method in comparison with the marginal method under different settings. In most of the simulation settings, the ratio rates of the MSE are well below 1. When the responses are highly correlated and have different covariate sets, our method can reduce MSE by 50%, which indicates a large efficiency gain.

Statistical Test
The test of composite-likelihood statistics can jointly assess the parameters of interest across different generalized linear models, while the conventional methods cannot achieve this. The simulation studies were conducted to measure the type I error rate and the power in comparison with the marginal approaches.
This simulation study was conducted to perform the hypothesis test. The correlated responses were generated based on Equation (9) with all correlation ρ = 0.3. The parameters of interest are the regression coefficients {β b 1 , β b 2 , β c 1 , β c 2 } of the first covariates across four generalized linear models, and the first covariates x i are independently simulated from N(0, 1). The regression coefficients of the second covariates y i and other parameters are nuisance parameters with y i ∼ N(0, .5). In the simulation study to assess the type 1 error rate, the regression parameters {β b 1 , β b 2 , β c 1 , β c 2 } are equal to zero in all generalized linear models, while other parameters have the same values as the previous simulation in Table 1. To assess the power, we fixed the values of the regression parameters as β b 1 = β b 2 = 0.1 for the binary responses and β c 1 = β c 2 = 0.3 for the continuous responses. In comparison, we combined the results of the marginal approaches by applying the Bonferroni adjustment. Table 2 illustrates the results over 2000 independent replications. The proposed model analyzes all responses simultaneously, and the simulated type I error rates are valid and close to the 0.05. Through the test of the joint effect of the covariate of interest on all responses, the simulated power is enhanced by our proposed model in comparison with the results from the Bonferroni test. As the sample size increases from 500 to 1000, the composite-likelihood statistics produce increased statistical power from approximately 0.800 to 0.989. The overall performance demonstrates that the composite statistics are more powerful than the conventional approach. Table 2. Type 1 error rate and power under different sample sizes (N = 500 and N = 1000).

Type I Error
Power N = 500 N = 1000 N = 500 N = 1000 The likelihood ratio statistic is adjusted as Equation (7), which approximates to a χ 2 4 .

Data Analysis
In this section, the multivariate mixed response model is applied to the clinical data from a colorectal cancer study. The data consist of clinical observations and demographic information on 743 patients, which are mixed with both categorical and continuous data. Our research interest was to evaluate the effect of treatment and other clinical factors on the toxicity outcomes. We focused on four common toxicity events that are related to colorectal cancer treatment. First, we chose nausea and diarrhea as two categorical responses. They are made up of ordinal data measuring the severity of the toxicity from grade 1 to 4. In our model setting, we only considered the occurrence of nausea and diarrhea for each patient. Therefore, these two responses were designed as binary variables, which are coded as 1 if they occurred and 0 if there is no record during the treatment. The continuous responses include two blood test measures, namely the hemoglobin (HGB) count and the white blood cell (WBC) count. Each patient had several blood examinations during the treatment, and we took the highest value for analysis. The explanatory variables contain the treatment effect, demographic information, tumor status, and genetic markers for each patient. There are two different treatment therapies in this colorectal cancer study. The patient demographic and clinical information, such as age, height, and weight, were collected as continuous variables. The tumor identified in either the colon or rectum was recorded as a binary variable. The study also includes the genetic markers, such as PERF1, PERF2, and KRAS, which are binary variables showing the occurrence of mutation. In total, we needed to jointly estimate 68 parameters for the coefficients of four linear models and the correlation between each outcome. Table 3 shows the main result of the effect of the treatment, and the complete result is presented in Appendix B. We can observe that the statistical inference on the effect of the treatment through two models was in agreement. The second treatment therapy resulted in lower measures of the hemoglobin and indicates a negative association with the occurrence of nausea and diarrhea, whereas, the effect difference on the measures of white blood cells is insignificant. We can use the composite statistics to jointly assess the overall effect of this therapy on four responses. Table 4 provides the standard deviation and the correlation of the four clinical outcomes estimated based on the proposed model.
Using the conventional approach, we cannot make a statistic inference across different linear models. The proposed model is able to test the hypothesis H 0 : β b 1 = β b 2 = β c 1 = β c 2 = 0 based on the asymptotical properties of the composite-likelihood function. The test statistics of the composite Wald statistics under the H 0 is approximately 138.5890, the composite score statistics is 476.975, and the adjusted composite-likelihood ratio is 264.3069, which are all greater than the critical value of χ 2 4 . Therefore, we can reject the null hypothesis and conclude that the two different treatments have a statistically significant difference in terms of patient toxicity response. More specifically, in our estimation results, we infer that there exists a significant difference in terms of the occurrence of nausea and diarrhea and a significant difference in HGB between the two treatments.

Discussion
The problem of mixed outcomes is widely discussed in health-related studies. As a result of the computational complexity, most existing methods mainly focus on the case of two outcomes mixed with one continuous variable and one categorical variable. As an extension of the conditional grouped continuous model, we present the multivariate mixed response model to solve high-dimensional mixed multivariate regressions. Our model is constructed using the pairwise composite-likelihood method, such that multiple outcomes are analyzed through different bivariate models simultaneously. Regarding data mixed with continuous and binary responses, our method simplifies the problem of multiple outcomes into three types of scenario, which is both methodologically flexible and analytic appealing. From the simulation studies, the estimators of the proposed model demonstrate a lower MSE than the marginal approaches. The composite statistics also provide increased statistical power for joint hypothesis testing across different generalized linear models, which could make this a favorable approach to analyze clinical data with multiple mixed-type responses.
In addition, the model can be generalized to deal with ordinal and continuous data simultaneously. Under the same setup, the latent variable z * ij is normally distributed with mean µ * ij = x T ij β j and variance σ 2 j = 1. If the latent normal variable z * ij ∈ [t l−1 , t l ) with the threshold values −∞ = t 0 < t 1 < · · · < t l < · · · < t L−1 < t L = ∞, the ordinal variable z ij = l with l = 1, 2, · · · , L. Thus, the bivariate model of two ordinal outcomes can be formulated by the pairwise probability: The score function can be obtained by taking the derivatives of ∑ n i=1 ∑ q−1 j=1 ∑ q k=j+1 log P(z ij = l, z ik = l ), and the thresholds t l 's need to be estimated with the monotone restriction. The maximum composite-likelihood estimation for mixed outcomes can be conducted through the same approaches as proposed in this paper. Further research will be considered to analyze multivariate outcomes with various distributions.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Derivatives for Mixed Response Variables
The score Function (6) contains the derivative of log marginal normal density and the log of the conditional probability: To illustrate the derivation, let z j and z k represent the n × 1 vector response variables with the design matrices X j and X k , respectively. Let The normal CDF has the following properties: The first component in the score function can be simplified as Stats 2020, 3

216
The derivatives of the second component has a simple format with respect d j : By adding the results above, the score function and its derivative can be given by Furthermore, For the term d j , more derivation is given here: In conclusion, the score function U jk (θ jk ) with respect to each parameter is given by  In Section 4, we illustrate some estimation results of the regression parameters and correlation between outcomes. The complete analysis of these data consists of 68 parameters.