The Impact of Misspecified Random Effect Distribution in a Weibull Regression Mixed Model

Abstract: Mixed models are useful tools for analyzing clustered and longitudinal data. These models assume that random effects are normally distributed. However, this may be unrealistic or restrictive when representing information of the data. Several papers have been published to quantify the impacts of misspecification of the shape of the random effects in mixed models. Notably, these studies primarily concentrated their efforts on models with response variables that have normal, logistic and Poisson distributions, and the results were not conclusive. As such, we investigated the misspecification of the shape of the random effects in a Weibull regression mixed model with random intercepts in the two parameters of the Weibull distribution. Through an extensive simulation study considering six random effect distributions and assuming normality for the random effects in the estimation procedure, we found an impact of misspecification on the estimations of the fixed effects associated with the second parameter σ of the Weibull distribution. Additionally, the variance components of the model were also affected by the misspecification.


Introduction
Mixed models are useful tools for analyzing correlated, clustered and longitudinal data that arise from medical, social and behavioral sciences studies.However, this class of models makes a strong assumption about the distribution of random effects.For computational convenience, random effects are assumed to be normal, but this assumption may be unrealistic for some applications [1].Since the random effects are not observable, checking the assumption is difficult, and if the true distribution of the random effects is far from normality, the estimation and inferences could be considerably affected.
In the literature, many studies have evaluated the impacts of mixed models that assume a normal distribution for random effects on several aspects related to the estimation and inference, when in fact the underlying distribution of the random effects is non-normal.This situation was called the "misspecification problem" by Neuhaus and McCulloch [2].In these studies, the covariates are classified into two types.A covariate can be a "between-cluster" covariate, meaning that it is constant over the units in a cluster.Conversely, it can be a "within-cluster" covariate, which means that it varies within a cluster, but it has an average that is constant between clusters [3].Some of these studies are summarized below.
Considering linear mixed models (LMM), Verbeke and Lesaffre [4] used simulations to show that maximum likelihood estimators for fixed effects and variance components are consistent and asymptotically normally distributed, even when the random effect distribution is not normal.Using the relative distance, the authors showed the clear consistency of the maximum likelihood estimators as the total sample size increased.Unfortunately, these conclusive results were not found for other classes of mixed models.
Neuhaus et al. [5] conducted a simulation of a logistic random model (LRM) in which gamma, t-Student and normal distributions were considered for the random effects.They estimated the model parameters assuming a normal distribution for the random effects and found that the estimated parameters were asymptotically biased, but the magnitude of the bias was typically small.
Heagerty and Kurland [6] considered an LRM in which the random effect followed one of four situations: a random intercept that was non-normally distributed, the variance of the random intercept, which depended on a between-cluster covariate, assuming a random intercept, when in fact there was a random intercept and slope, and autocorrelated random intercepts.Through a simulation study using relative bias, the authors found that incorrect assumptions regarding the random effects could lead to substantial bias in the estimates for the fixed effects when the random effect distribution depended on the measured covariates or when there was an autoregressive random effect.
Agresti et al. [7] showed that a considerable loss of efficiency occurred when the random effect distribution differs from the true distribution for three different models: LRM, random effects model for log odds ratio and frailty model.
Litière et al. [8] used simulations to study the impacts of the misspecification of the random effect distribution on the power of the Wald test for the fixed effect parameters in an LRM.Four true distributions for the random effects (normal, power function, discrete and a mixture of two normal) and each simulated dataset were fitted assuming normality for the random effects.They claimed that the misspecification of the random effect distribution could produce a marked increase or decrease in the power of the Wald test, depending on the shape of the random effect distribution.
Litière et al. [9] also studied the impact of misspecification on parameter estimation and hypothesis testing through simulations using an LRM based on a schizophrenia study considering nine different distributions of the random effects.These authors found that the estimates of the variance components were severely affected by the misspecification.Additionally, they found that the coefficient of the within-cluster covariate appeared to be less affected by the misspecification of the variance of less than four for the random effects.With respect to hypothesis testing, they found that misspecification severely affected the power of the test.
McCulloch and Neuhaus [3] used a simulation study to assess the impacts of misspecification on the inference of covariate effects, to estimate the random effects variance and to predict the random effects in an LRM.The authors used the Tukey distribution as the true distribution for random effects, and they fitted models assuming normal and Tukey (with known and unknown parameters) distributions for random effects.They found that the estimation of the intercept may be biased for a random effect distribution far from normal, and for the other parameters, the estimations had low biases.With respect to the prediction of random effects, they found that the mean square error of the prediction was slightly higher for the assumed normal distribution.
Neuhaus et al. [10] were not in agreement with Litière et al. [8] with respect to the LRM.They reanalyzed the scenarios and found that the misspecification of the shape of the random effect distribution led to a minimal increase in the Type II error.Moreover, to demonstrate the effects of misspecification, they argued that the assumed distribution needs to vary while the true distribution is held constant.
Litière et al. [11] presented a rejoinder to Neuhaus et al. [10], in which they defended the approach of varying the true underlying distribution for the random effects while fitting the model by assuming a normal distribution.Litière et al. considered an LRM and argued that, by using both approaches, the power associated with the test for the fixed-effect parameters in the logistic random model may be affected by misspecifying the random-effect distribution.
McCulloch and Neuhaus [12] used theory and a simulation study to investigate the impacts of the misspecification of the random effect distribution on how well the predicted values recover the true underlying distribution and the accuracy of the prediction of the realized values of the random effects.They considered two situations: LMM and LRM, with three random effect distributions.For each generated dataset, the model was fitted by assuming normal and exponential distributions for the random effects.The authors found that the shape of the distribution of the predictions of the random effects did not necessarily match the shape of the true distribution.In addition, the use of an incorrect distribution for the random effects only caused modest degradation of the mean square error of the prediction.
Neuhaus et al. [13] analyzed the impacts of the misspecification on the shape of the joint distribution of the random intercepts and slopes on the estimates and confidence intervals for generalized linear mixed models (GLMM).Through analytical and simulation studies, the estimates of the covariate effects showed little bias.
As shown, until now, most simulation studies focused on LRM or GLMM, and no significant studies have considered other classes of mixed models or other distributions for the response variable.The Weibull distribution is one of the most popular distributions associated with the models that can be applied to diverse areas ranging from engineering (reliability), health (survival analysis) and ecology (Fleming [14], Carroll [15], Abdel-Ghaly et al. [16]).The contribution of this paper is to explore the impacts of a misspecified random effect distribution on the estimation of the parameters in a mixed model with a response variable that follows a Weibull distribution.
Our paper is organized as follows.In Section 2, we present a case study of the lifetime of rubber in an abrasive process.In Section 3, we define the mixed model under consideration.In Section 4, we present the structures and the scenarios that were used in the simulation study, and in Section 5, we analyze the results of the simulation.The final Section 6 provides some concluding remarks.

A Case Study of Lifetime in an Abrasive Process
The data were obtained from an experimental design to study the effect of the density, torque, viscosity and temperature on the wear time of rubber pieces in an abrasive process.Ten pieces were analyzed, and each piece was cut into 15 small sub-pieces.For each piece, the density (g/cm 3 ) and the viscosity (Mooney viscosity) were measured; in the abrasive experiment, for each sup-piece, the torque (Tan δ) and temperature ( • C) were measured.The test involved submitting each sub-piece to an abrasive process to measure the wear time (minutes).Figure 1 depicts the empirical density (left) over time and boxplots for the time for the piece (right).The mean time was 12.9 min with a standard deviation of 8.9 min.Density q qq q q q q q q 1 2 3 In this application, we have clustered data because several sub-pieces were obtained from a specific piece.In this situation, it was appropriate to consider a mixed model to model the response variable.The variables density and viscosity were between-cluster covariates because they were measured over pieces; whereas torque and temperature were measured over the sub-pieces.Several mixed models were fitted to explain the time ij of the sub-piece j within piece i where i = 1, 2, . . ., 10 and j = 1, 2, . . ., 15.Using the Akaike information criterion (AIC) proposed by Akaike [17] and the Schwarz Bayesian criterion (SBC) proposed by Schwarz [18], the best mixed model for the dataset had a response variable showing a Weibull(µ, σ) distribution, and the model can be summarized as: The corresponding maximum likelihood estimates for the model (1) are summarized in Table 1.

Weibull Regression Mixed Model
The Weibull distribution was named after Swedish mathematician Waloddi Weibull (1887-1979).This distribution is important because it describes the failure times of many different phenomena [19].The Weibull distribution has been extensively used and cited in the fields of engineering, chemistry, meteorology, material science, medicine, quality control and biology.More details about the applications were provided by Murthy et al. [20] and Rinne [21].
The importance of the Weibull distribution can be viewed in terms of the parameterizations that were proposed in the literature.Almalki and Nadarajah [22] provided an extensive review of 29 modifications of the Weibull distribution.Bagheri et al. [23] proposed the generalized modified Weibull power series distribution that contains, as special cases, several important distributions that are modifications of the Weibull distribution.Domma et al. [24] proposed the generalized weighted Weibull distribution that includes decreasing, increasing, upside-down bathtub, N-shape and M-shape hazard rates.
In the literature, some regression models have used the Weibull distribution.Silva et al. [25] proposed a log-extended Weibull regression model to analyze survival time for patients with cancer using the performance status at diagnosis, a measure of general fitness, the age of the patient and the number of months since the cancer diagnosis as explanatory variables.Vigas et al. [26] used the Poisson-Weibull distribution to explain the survival time of the patients who were admitted into a heart transplant program using the year of acceptance to the program, the age of the patient, information about previous surgery (where one was yes and zero was no) and if the patient had the transplant within the program (where one was yes and zero was no) as covariates.Prataviera et al. [27] considered a generalized odd log-logistic flexible Weibull regression model to explain the breaking time for a part located in a sugarcane cutting system using the agent causing the failure as the covariate.
Mixed Weibull models have been also used to analyze real datasets.Sohn et al. [28] used a random effects Weibull model to explain the reliability of modules in a fighter aircraft based on the characteristics and operational conditions of the plane.Sohn et al. [29] considered a random effects Weibull regression model to study the occupational lifetime of the employees who join another company, based on characteristics such as position, gender, marriage status and age, among others.Bartolucci et al. [30] proposed a Bayesian mixed Weibull model to explore the survival time of patients with myelodysplastic syndrome by comparing the efficacy and toxicity of two treatments.Lv et al. [31] used a mixed Weibull model for data from a design experiment to explain the lifetime of a thermostat using 12 low-or high-level explanatory variables.
The original Weibull distribution has several parameterizations; however, in this work, we used the parameterization called WEI3, which can be found in Stasinopoulos and Rigby [32].The probability density function, mean and variance for the WEI3 parameterization are as follows: where κ = µ/Γ(1/σ + 1) and Γ(•) is the gamma function.
Assuming that y ij is the j-th observation for the i-th cluster and conditional on the independent random intercepts u 1i and u 2i , the response variable y ij is distributed as an independent random variable with a Weibull distribution as follows: where i = 1, 2, . . ., N represents the number of clusters and j = 1, 2, . . ., n i represents the number of observations within cluster i.In the model ( 5), X 1i and X 2i are known design matrices for the cluster i; thus, X 1i,j• and X 2i,j• correspond to the j-th rows of X 1i and X 2i respectively, and β 1 and β 2 are vectors of unknown fixed effects.The log function included in the model (5) ensures that the linear predictors for µ and σ map to the appropriate values for µ and σ, respectively.The model assumes that u 1i iid ∼ N(0, τ 2 1 ) and u 2i iid ∼ N(0, τ 2 2 ) and uncorrelated.Then, the parameter vector for the model ( 5) ) .The likelihood function for the i-th cluster is given by: where f y is the probability density function of WEI3 and p(•, a) corresponds to the normal density with zero mean and variance a 2 .The integrals in Equation ( 6) do not have a closed form, and approximations are required for a computationally-feasible estimation [33].In this work, we used the Gauss-Hermite quadrature (GHQ) approximation to obtain Equation (6).We choose the maximum likelihood method to obtain the estimates for the parameter vector θ, in which the objective is to maximize the log-likelihood function l(θ) given by:

Simulation Study
We performed a simulation study to evaluate the impacts of misspecification on the estimation of parameters.We adopted the approach used by Verbeke and Lesaffre [4], Agresti et al. [7], Litière et al. [8], Litière et al. [9], Litière et al. [11], Alonso et al. [34] and Alonso et al. [35], in which we varied the true distribution of the random effects while the assumed distribution remained constant.
For the random intercepts u 1i and u 2i , we considered six different distributions: normal, uniform, exponential, log-gamma, log-normal and symmetric mixture of two normal densities that were defined as in McCulloch and Neuhaus [12].The distributions were transformed such that the zero-mean condition was satisfied, and the corresponding variances were equal to the prespecified values τ 2 1 and τ 2  2 for u 1i and u 2i , respectively.Figure 2 shows the respective densities for the case with unit variance.With this choice, we cover a range of densities varying from very symmetric to very skewed distributions.For the simulation study, we considered the next model: with i = 1, 2, . . ., N being the number of clusters and j = 1, 2, . . ., n i the number of observations per cluster.The variables x 1b and x 2b represent between-cluster covariates, whereas x 1w and x 2w represent within-cluster covariates.We considered x 1b , x 1w , x 2b and x 2w distributed as U(0, 1).The vector parameter in the simulation study is given by θ = (β 10 , β 1b , β 1w , β 20 , β 2b , β 2w , τ 2 1 , τ 2 2 ) .The true values for the fixed parameters in the simulation were chosen to ensure that the datasets had a response variable with mean and variance of approximately 12 and nine, respectively.The fixed effects were considered constant with the following values: This selection emulated the dataset shown in Section 2.
We considered equal variances for the random intercepts u 1 and u 2 , and these variances were τ 2 1 = τ 2 2 = 0.5, 1.0, 1.5 and 2.0.The variances greater than 2.0 were not considered because they caused larger values of u 1 and u 2 , which directly affected the parameters µ and σ, which increased the variance Var(Y) that depends on µ and σ shown in Equation ( 4).We considered a variety of different numbers of clusters and observations per cluster, and the values were N = 5, 10, 15, 20, . .., 45, 50 and n i = 5, 10, 15, 20.
The simulation study included 960 cases, given by 6 distributions and 4 variances for the random intercepts, 10 values of N and 4 values of n i .For each distribution setting, the variance, N and n i , we generated 1000 samples with the model given by Equation (8).For each sample, we obtained the estimates of θ for the parameter vector θ that maximized the log-likelihood function in Equation ( 7) using our own code in R Core Team [36].
As in Verbeke and Lesaffre [4], we used the relative distance (RD) to quantify the impact of the misspecification on the estimates.The relative distance is defined as The smaller the value of the relative distance, the lower the impact.We also studied the impact of the misspecification on the estimation procedure using the median of θk .

Results
In this section, we present the results of the simulation study that was conducted to evaluate the impact of the misspecification of the random effect distribution in a Weibull mixed model as defined in Equation (8).The results are shown using figures that are provided in the Appendix A.

Relative Distance between the Estimated Parameter Vector θ and the True Parameter Vector θ
In Figures A1-A4, we can observe the median of the relative distance between the estimated parameter vector θ and the true parameter vector θ versus N for several combinations of n i and Blue was assigned to the symmetric distributions ( , , ), whereas green, orange and red ( , , ) correspond to exponential, log-gamma and log-normal (asymmetric distributions), respectively.Dark blue ( ) corresponds to the normal distribution used as the reference case.Figures A1-A4 have the same y-axis scale to facilitate comparison.
In each figure, the expected pattern of the relative distance decreasing as the number of clusters N increasing is observed.Additionally, we note an increase in the relative distance as the variances τ 2 1 = τ 2 2 increase.If we compare the corresponding panels in Figures A1-A4, the relative distance decreases as the number of observations per cluster n i increases, due to the increase in the number observations available to estimate the parameter vector θ.
From the third and fourth panels in Figures A2-A4, the blue lines tend to be below the asymmetric distributions ( , , ) when N ≥ 30.This means that when asymmetric distributions are used to generate the random intercepts and the model is fitted assuming normality for the random intercepts, the estimated parameter vector θ is far from θ.

Relative Distance between the Estimated Parameter θk and the True Parameter θ k
In Figures A5-A20, the medians are displayed of the relative distance between the estimated parameter θ and the true parameter θ versus N for several combinations of n i and τ 2 1 = τ 2 2 .Again, blue was assigned to the symmetric distributions ( , , ), whereas green, orange and red ( , , ) correspond to exponential, log-gamma and log-normal (asymmetric distributions), respectively.Dark blue ( ) corresponds to the normal distribution used as the reference case.In these figures, six colored lines are found within each panel, excluding the panels for β 1w and β 2w in which the red line ( ) intersects the other lines.The y-axis has different scales that clearly show the variability in the relative distance in each case.
By analyzing Figures A5-A20, we note that the relative distance for each estimated parameter decreases as N increases, except for β 1w and β 2w , in which the relative distance appears to be constant.The intercept β 20 shows the largest relative distance.By comparing Figures A5-A8, in which τ 2 1 = τ 2 2 is constant at 0.5, the relative distance decreases as n i incrementally increases.This pattern is also observed in Figures A9-A20 when the variances are τ 2 1 = τ 2 2 = 1, 1.5 and 2. If the figures are compared with n i fixed, the relative distance increases as the variances τ 2 1 = τ 2 2 increase.The relative distance curves for the estimations of β 10 , β 1b , β 1w , β 2b and β 2w present the same shape as in Figures A5-A20 in which all colored lines tend to be near each other.In fact, for the estimated fixed coefficients (β 1w and β 2w ) associated with the within variables, the relative distance curves are close.For this reason, we see only the red line that was drawn last.
For the estimated intercept β 20 , we note that the lines are slightly separated, but all lines show the same pattern.In half of the figures, the red line (log-normal) has a lower relative distance.
From Figures A5-A20, the blue lines (symmetric distributions) tend to be below the other lines (asymmetric distributions) as n i increases in the estimations for τ 2 1 and τ 2 2 .This means that when the random intercepts have asymmetric distributions, the estimated variances tend to have large relative distances.

Median for the Estimated Fixed Parameters θk
Figures A21-A36 demonstrate the median for the estimated parameters θk obtained in the simulation study versus N for several combinations of τ 2 1 = τ 2 2 and n i .Within each panel, the true value for the parameter and a dashed purple line (----) on the right side of the panel are included to identify if the estimators converge to the correct value.
Figures A21-A36 show some similar patterns.In all figures, the estimations for β 2w have a small bias that decreases as n i increases.As the variances of the random intercepts increase, the estimations for β 20 have a small bias when the true distributions of the random intercepts are normal, a mixture of normal, uniform and log-gamma.When the variances of the random effects are 1.5 or two, the estimations for β 2b tend to have a bias when the true distributions for the random intercepts are normal, a mixture of normal, uniform and log-gamma.The estimations for the variance components (τ 2 1 = τ 2 2 ) tend to have a bias when the true distributions for the random intercepts are log-gamma and log-normal.

Conclusions
This simulation study was conducted to explore the impacts of a misspecified random effect distribution in a Weibull mixed model.The datasets for the study were simulated following six different distributions for random intercepts, but assuming normality for the random intercepts in the parameter estimation procedure.Two measures were used to evaluate the impacts of the misspecification on the parameter estimation, the relative distance and the median of the estimated parameters.
In Section 5.1, we calculated the relative distance between the estimated parameter vector θ and the true parameter vector θ.We found that when the true distributions of the random intercepts were exponential, log-gamma and log-normal (asymmetric distributions), the relative distance tended to be higher for N ≥ 30, variances τ 2 1 = τ 2 2 =1.5, 2.0 and n i ≥ 10.This means that when the true distributions for the random intercepts are asymmetric, the estimated vector θ tends to be far from the true vector θ.
In Section 5.2, the individual relative distance was obtained between each estimated θk parameter and the true parameter θ k .The intercept β 20 demonstrated the largest relative distance, and the estimates of the variance components were severely affected when the random intercepts had an asymmetric distribution.
In Section 5.3, the median for each estimated θk parameter was obtained.From these results, we observed a large bias for β2w that was not identified with the individual relative distance.However, the bias decreased as n i increased.We noted a bias in β20 and β2b for all distributions that were used to simulate the random intercepts.This bias tended to be higher for normal, mixture of normal, uniform and log-gamma when τ 2 1 = τ 2 2 = 1.5, 2. Additionally, we identified a remarkable bias in the estimated variance components when τ 2 1 = τ 2 2 = 1.5, 2 and the random intercepts had exponential, log-gamma, and log-normal distributions.
Despite that we did not consider the variances for random effects higher than two in the simulation study, we think that this could severely impact the measures for the cases when the random effects have asymmetric distributions (exponential, log-gamma and log-normal).
With these findings, we conclude that the misspecification impacts the estimations for the fixed effects β 20 , β 2b and β 2w associated with the second parameter σ of the Weibull distribution.Additionally, the variance components τ 2 1 and τ 2 2 were also affected by the misspecification.For these reasons, in practice, we recommend checking the random effect distribution using diagnostic tests as proposed by Drikvandi et al. [37] and Efendi et al. [38].Then, if the random effect distribution is not normal, use flexible procedures by considering non-normal distributions for the random effects to estimate the model parameters.
Overall, the misspecification of the random effects significantly impacts the estimation of all the parameters of the Weibull mixed model mainly when the distribution of the random effects is asymmetric.Next are the figures for Section 5.2.Next are the figures for Section 5.3.

Figure 1 .
Figure 1.Empirical density function for time (left) and boxplot for time given each piece (right).

Figure 2 .
Figure 2. Probability density function used for random effects, each one with zero mean and unit variance.

Figure A1 .
Figure A1.Median of relative distance between θ and θ versus N for n i = 5.

Figure A2 .
Figure A2.Median of relative distance between θ and θ versus N for n i = 10.

Figure A3 .
Figure A3.Median of relative distance between θ and θ versus N for n i = 15.

Figure A4 .
Figure A4.Median of relative distance between θ and θ versus N for n i = 20.

Model for log(µ) Estimate Std. Error t-Value p-Value
Model for log(σ) Estimate Std.Error t-Value p-Value