Regression Models for Log-Normal Data: Comparing Different Methods for Quantifying the Association between Abdominal Adiposity and Biomarkers of Inflammation and Insulin Resistance

We compared six methods for regression on log-normal heteroscedastic data with respect to the estimated associations with explanatory factors (bias and standard error) and the estimated expected outcome (bias and confidence interval). Method comparisons were based on results from a simulation study, and also the estimation of the association between abdominal adiposity and two biomarkers; C-Reactive Protein (CRP) (inflammation marker,) and Insulin Resistance (HOMA-IR) (marker of insulin resistance). Five of the methods provide unbiased estimates of the associations and the expected outcome; two of them provide confidence intervals with correct coverage.


Introduction
A common objective in medical research is to identify and quantify associations. For example, this could include evaluating a biomarker or estimating personal exposure levels based on questionnaires and occupational history. In these cases regression analysis is often used. It can also be important to estimate the expected value, e.g., the expected exposure. A person's risk of developing an exposure-caused disease is related to the dose, and the dose is usually estimated by the cumulative exposure. In group-based exposure assessment, the arithmetic mean is considered superior to the geometric mean, as a dose-related variable [1,2]. The arithmetic mean is also preferred, in the form of mean exposure for individuals over time, when assessing long-term effects of exposures [3].
Many biological variables (e.g., exposure and biomarkers) have a skewed distribution with a median smaller than the mean and only positive values. It is also common with heteroscedasticity, where the variance increases with the expected value. Such data can often be described by a log-normal or quasi-log-normal distribution [4][5][6]. A common way to analyze a log-normal variable Y is to log-transform (Z = ln(Y)) so that Z follows a normal distribution with expected value and standard deviation . The geometric mean of Y is then found as exp( ), while the expected value of Y (the arithmetic mean) is found as μ Y = . In cases where the expected value μ Y depends on several predictors, regression analysis is often based on the log-transformed data, Z = , and the expected value of Y is estimated as .
This produces effect-measures on the multiplicative scale and the interpretation is that Y is expected to increase 100(exp(δ i ) − 1) percent as x i increases one unit, see e.g. [7]. We investigated the situation where we want an estimate of the absolute effect, thus we need the model to be linear on the original scale, Y 0 1 1 β β ... β p p X X      , in order to produces effect-measures on the additive scale. This is of interest e.g., in exposure modeling, when exposure time is an important factor and it is reasonable that the effect of time on exposure is linear. Effect-measures on the additive scale have also been discussed in relation to statistical vs.biologic interaction. Biologic interaction occurs when the effect of one cause depends on the presence of another cause, e.g., environmental causes and genetic predisposition, and is often defined as departure from additivity [8,9]. Different regression methods, suitable for log-normal data, were investigated and the aim was to estimate the absolute effect β i of each predictor. Because of the heteroscedasticity, the ordinary least squares regression will produce erroneous tests and confidence intervals. One solution is to use a weighted least squares regression. Another way to handle non-normal distributions is to use a general linear model, GLM, in which the distribution of the response variable Y belongs to the natural exponential family and the expected value of Y is linked to a linear model by a link function, g(μ Y ) = β 0 + β 1 X 1 + ...+ β p X p , see [10]. One example of a GLM that is suitable for the log-normal distribution is the gamma distribution with an identity link. Another possibility is the normal distribution and an exponential link, applied to Z = ln(Y).
We compared the different regression methods using both large scale simulations and by applying them to a cross-sectional data set with the aim to quantify the association of abdominal adiposity with inflammation and insulin resistance (two well-known associations).

Linear Regression with a Lognormal Response
We considered a regression model where the expected value of a continuous log-normal response variable Y is a linear function of the predictors X 1 ,X 2 ,..X p : The variance of Y depends on both the expected value of Y, , and the variance of Z = ln(Y), ; Ordinary least squares regression (here denoted LS lin ) can be used to obtain unbiased estimates 0 β , 1 β , …, β p However, the estimates provided by LS lin assume homoscedasticity, which, as previously noted, is incorrect for a log-normal variable. This incorrect variance assumption leads to incorrect statistical inferences.
In a situation with heteroscedasticity, weighted least squares regression (here denoted WLS) can be used. WLS can account for the heteroscedasticity by weighting each observation, Y i , with the inverse of its variance, . For a log-normal distribution, the weight for Y i is , where LS lin can provide estimates of . Unlike LS lin , WLS provides an estimate of the variance .
When the response Y is log-normally distributed, data are often log-transformed, ln(Y) = Z, and a log-linear model is estimated: where the expected value of Y is . Ordinary least squares regression on Z (here denoted LS exp ) provides estimates of the relative effect ( 0 δ , 1 δ , …, δ p ) as well as an estimate of the variance but no estimates of the absolute effects. Thus, both (1) and (2) can be used to estimate Y X  and . The reason for including LS exp , even if the linear model in (1) is assumed, is that LS exp is commonly used for log-normal data. The log-normal distribution is often approximated by the gamma distribution, with parameters μ (expected value) and ν (scale parameter, Var[Y] = μ 2 /ν). A generalized linear model (GLM) with gamma distribution and the identity link (denoted GLM G ), provides estimates 0 β , 1 β , …, β p and an estimate of can be found through the transformation .
Another GLM that can be used to estimate the absolute effects is one with a normal distribution and the link function exp(*), applied to Z = ln(Y), here denoted GLM N , such that The expected value of Y is then found as .
The method GLM N , does not, however, take into account the stochastic variation due to estimating . Therefore we also used a maximum likelihood method (ML LN , see [11,12]) based on the likelihood function of the log-normal distribution: The estimates 0 β , 1 β , …, β p and are found using iterations, for example the Newton-Raphson iteration used here [13].

Confidence Intervals
For LS lin , WLS, GLM G and ML LN , a 95% confidence interval for Y X  is estimated as where the sample-specific variance is estimated as: where x 0 = 1, v ar(β ) For GLM N , a confidence interval is estimated as where the sample-specific variance of the linear estimator is estimated as: For LS exp , a confidence interval for is estimated as , using the modified Cox method [14].
The sample-specific variance is estimated as: where x 0 = 1, v ar(δ ) i and ˆĉ ov(δ , δ ) i j are the sample-specific estimates of the variance and the covariance.

Simulation Model
In a simulation study we compared the large-sample properties of the methods for estimating the expected value of Y and the effect of each predictor, when data follow a log-normal distribution. To obtain a realistic scenario, a simulation model was estimated from a real-life data set on personal exposure to PM 2.5 -particles in Sweden. These data are described in [15]. PM 2.5 is the mass (microgram/m 3 ) of particles smaller than 2.5 micrometers, which implies that they are small enough to bypass the respiratory defenses and enter into the lungs. Increased levels of PM 2.5 have been associated with increased mortality from cardiovascular disease and lung cancer [16,17]. Several sources contribute to the personal exposure to PM 2.5 , two of them are tobacco smoke and traffic exhaust [18].
The expected outcome, personal exposure to PM 2.5 -particles (μg/m 3 ), was assumed to be a linear function of the number of cigarettes per day, Smoke, and residential outdoor concentration of PM 2.5 (μg/m 3 ), ConcOut: Observations were then simulated according to the model Z = ln( )-0.383 2 /2 + , where ε~N(0, σ Z = 0.383). In order to facilitate interpretation and comparison without the introduction of unnecessary variation, balanced data were used in the simulations, with the following values of the explanatory variables: ConcOut = {2, 8, 14}, Smoke = {0, 7, 14}. Thus we estimated the expected PM 2.5 exposure for 9 combinations of outdoor concentration and cigarettes smoked. Simulations with 10,000 replicates were used to evaluate the potential bias in the estimates of β 0 , β 1 and β 2 , the sample- also the properties of confidence intervals for μ Y .

The DIWA Data Set
The DIWA dataset is a population-based cohort of 64-year-old women from the city Gothenburg in Sweden and has previously been described in detail in [19]. Of the 2,595 women who was screened 9.5% had diabetes mellitus (DM) [20], and of these 230 participated in the study, together with similar sized, randomly-selected groups of women with impaired glucose tolerance (IGT, n = 209) and normal glucose tolerance (NGT, n = 190). The World Health Organization criteria for capillary glucose cut-off values were used to define diabetes and impaired glucose tolerance [21]. Insulin resistance was also assessed, as well as a large number of biomarkers including high sensitivity C-reactive protein (hS-CRP). The examination also included a questionnaire regarding medical history and lifestyle factors, including smoking habits (never smoker, past smoker and smoker) and recreational physical activity (<2 h/week and ≥2 h/week). Body weight and waist circumference were also measured.
CRP is an acute-phase protein found in blood serum and its levels increase during an inflammatory process. CRP is mainly used as an inflammatory marker in clinical practice and should, for a healthy person, be less than 5 mg/L. Diabetes, smoking, obesity and insulin resistance are all been associated with small increases in CRP-levels as assessed by high sensitivity methods [22][23][24][25].
Insulin resistance is a condition where the body has a reduced ability to respond to the insulin hormone which can cause blood glucose to rise above normal levels. Insulin resistance can lead to type 2 diabetes and cardiovascular disease. Even if insulin resistance is most common among persons with diabetes mellitus of type 2 or impaired glucose tolerance, it is also present in about 25% of non-obese persons with normal glucose tolerance, [26]. Obesity, and in particular abdominal obesity, is associated with increased insulin resistance [27,28]. Other factors are smoking and low physical activity [29,30]. In our study, insulin resistance was measured using the homeostasis model assessment of insulin resistance (HOMA-IR), which is a mathematical formula for quantifying insulin resistance [31]; HOMA-IR is the product of fasting serum glucose and fasting serum insulin (fasting serum glucose (mmol/L)·fasting serum insulin/22.5). A cut-off value around 2.5 is often used as an upper limit for normal HOMA-IR [32][33][34][35].

Bias and Standard Deviation of the Regression Coefficients (Simulation Study)
In the simulation study, balanced data sets were computer-generated using the model in Section 2.2, with two explanatory variables (Smoke and ConcOut) each with three levels. To obtain a balanced sample with at least 100 observations, the sample size n = 108 was used. For each sample, coefficients of the regression model were estimated, along with the expected outcome (personal exposure) and its confidence interval. All methods except LS exp provided unbiased estimates of the regression coefficients. Among the absolute-effects methods, GLM N tended to have the best precision (smallest SD). The sample-specific standard errors, se, were close to the true standard deviations, SD. All methods except LS lin provided reasonable estimates of , although the transformed scale parameter from GLM G was too small (Table 1).
All methods except LS exp provided an unbiased estimate of the expected value. The interval length was similar between WLS, ML LN , GLM G and GLM N , but tended to be smaller for the two GLM methods (Table 2). LS lin had the largest standard deviation, especially for small and large values of μ Y . Among the methods that provided an unbiased estimate of μ Y , GLM N had the smallest standard deviation. For all methods except LS lin , the sample-specific standard error tended to be an underestimation ( > E[se( )]), Table 3. All methods except LS lin and LS exp provided coverage close to the nominal, but both GLM G and GLM N tended to give too low coverage, whereas ML LN was slightly better. Using LS lin resulted in too high coverage for low values of μ Y , and too low coverage for large values. LS exp provided too low coverage both for low and high values (Table 4). Table 4. Actual coverage of the 95% confidence interval for Y  based on the sample-specific standard error (results from simulation with n = 108 observations and r = 10,000 replicates).

Application of the Regression Methods to the DIWA Dataset
The DIWA dataset consists of data from approximately 600 women for which a large amount of data, related to diabetes and obesity, were collected. Descriptive statistics for CRP, waist circumference and HOMA-IR are presented in Table 5, separate for each glucose tolerance group.

Regression Models for C-Reactive Protein (CRP) and Insulin Resistance (HOMA-IR)
For CRP, the start model in the multivariable regression analysis included smoking, physical activity, waist circumference (WC), insulin resistance (HOMA-IR) and glucose tolerance (GT), where GT was classified into three categories: normal glucose tolerance, impaired glucose tolerance and diabetes mellitus. We used a model that allowed for different associations for the GT groups, by including the interaction terms WC·DM and WC·IGT. The final model, based on backward elimination using ML LN , contained WC and HOMA-IR, but no interaction term, thereby implying that the association with WC could be similar for the three GT groups (Figure 1).
For HOMA-IR, the start model in the multivariable regression analysis included WC, physical activity and smoking, and we allowed for possible different association with WC for the different glucose groups by including the interaction between waist circumference and glucose tolerance.   The parameter estimates and 95% confidence intervals for the different regression methods, when estimating HOMA-IR as a function of waist circumference (WC) and the interaction between WC and glucose tolerance group (normal glucose tolerance, impaired glucose tolerance and diabetes mellitus), using n = 598 observations. The estimated standard deviation, , and the average length of the confidence intervals for μ Y , (estimated from the models presented in Figures 1 and 2), are given in Table 6. ML LN , GLM N and LS exp gave similar estimates of σ Z (this parameter cannot be estimated by LS lin ). WLS provided the largest estimate whereas GLM G gave the smallest. ML LN and GLM G had similar confidence intervals for the expected value, μ Y , GLM N had the shortest intervals, whereas LS lin had the longest intervals.

Quantification of Factors Associated with CRP and HOMA-OR (Method Comparison)
All of the methods demonstrated that WC was a significant predictor for CRP. According to the absolute-effects methods (LS lin , WLS, GLM G , GLM N and ML LN ), the CRP was expected to increase about 1 mg/L (between 0.74 and 1.07 mg/L) for every 10 cm in WC and, according to the relative-effects method (LS exp ), the expected increase was 49% for every 10 cm in WC (exp(0.40) -1 = 0.49), Figure 1. All methods showed a positive association between HOMA-IR and CRP. The expected increase in CRP was between 0.12 and 0.42 mg/L for every unit increase of HOMA-IR in the absolute-effects methods and 3% per unit of HOMA-IR for the relative-effects method. The association with HOMA-IR was not significant for LS lin and very high for GLM G and WLS (0.41 and 0.42, respectively). The point estimates from all methods had the same sign and for the absolute-effects methods the confidence intervals for β WC overlapped, as did the intervals for β HOMAIR , Figure 1.
All methods found a positive association between HOMA-IR and WC in all glucose tolerance groups, Figure 2. Further, the results showed that women with DM had a significantly stronger association with WC than women with NGT, and this was significant for all methods. The results also indicated a stronger association with WC for women with IGT, compared to women with NGT; the interaction term for WC•IGT was significant for all absolute-effects methods except LS lin . Among the absolute-effects methods, HOMA-IR was expected to increase 0.64-1.00 per 10 cm WC for women with DM, 0.42-0.74 for women with IGT and 0.39-0.70 for women with NGT. The relativeeffects method showed an expected increase in HOMA-IR of 39% per 10 cm for women with DM, 31% for women with IGT and 27% for women with NGT. Z Ẑ ˆZ ˆ

Discussion
Several methods for estimating a linear regression on log-normal data were compared. Much research has investigated making inferences, including confidence interval, of the expected value of a log-normal distribution, e.g. [36][37][38][39][40]. Here we considered the situation where the systematic part of the model for the outcome Y should be additive on the original scale Had we made the assumption that the systematic part was multiplicative, the regression coefficients could have been estimated either with a GLM using gamma distribution and the log link, or by a GLM using a normal distribution and identity link for Z = ln(Y), which give similar results [41,42]. But we wanted a model for estimating the absolute effect of each explanatory factor. In exposure assessment, we often want to assess the personal exposure to e.g., a specific compound in the air, by using a model that includes the important exposure determinants. Here the quantity is an important factor (e.g., time spent in different micro-environments, number of cigarettes smoked) and it is reasonable that the effect is linear. A linear model can also be used to estimate biologic interaction, discussed in Section 4.3 below. Six methods were compared; four of them directly modeled the expected value of Y as a linear function of the explanatory variables, and finally the common method based on log-transformation was included for comparison, Evaluation was made both using simulations and by applying the methods to a large data set to estimate well-known associations of abdominal adiposity (waist circumference, WC) on inflammation (measured using C-reactive protein, CRP) and insulin resistance (measured using HOMA-IR), respectively.

Method Comparison
In a simulation study we evaluated the regression methods in a situation where the expected outcome is a linear function of two explanatory variables. All methods except LS exp provided unbiased estimates of the regression coefficients and the expected outcome, but the sample-specific standard error, ( β ) i se , tended to be too small, thus overestimating the power. For LS lin , the assumption of a constant variance for Y resulted in confidence intervals for μ Y with unnecessary high coverage for small μ Y -values and too low coverage at large μ Y -values. LS exp does estimate the relative effect rather than the absolute and as a result the estimated expected values were biased and the coverage of the confidence intervals was erroneous. The confidence intervals from the GLM G method had too low coverage, as a result of the underestimation of the variance . This is contrary to the situation with a multiplicative model, where the gamma distribution often provide reasonable estimates when applied to a log-normal variable [41,42]. ML LN , WLS and GLM N provided approximately correct coverage, although GLM N had a tendency to underestimate, as a result of using the estimate , thus not including the stochastic variation of in the interval estimation. An approximate confidence interval taking into account its stochastic variation could be derived using Taylor expansion, see e.g. [43]. The methods were applied to two approximately log-normal response variables, CRP and HOMA-IR (almost 600 observations). The model for CRP contained WC and HOMA-IR, and the model for HOMA-IR contained WC and the interaction between WC and glucose tolerance groups When comparing confidence intervals for β and for μ Y , ML LN and GLM N consistently had narrower confidence intervals than WLS (and LS lin ). From the simulation we saw that WLS tends to overestimate the variance. Because of underestimation of , GLM G had narrower intervals than ML LN and GLM N for μ Y , but from the simulation we know that the coverage will be too low. Thus ML LN will have a higher power and for lognormal data the probability of detecting a true explanatory variable is higher. The smaller interval lengths of ML LN corroborate the results of a previous simulation study [11].

Factors Associated with CRP and HOMA-IR, Respectively
Using all methods, the analysis demonstrated a significant positive association between CRP and WC. Associations between CRP and several measures of obesity and abdominal adiposity have been shown in a number of studies [44][45][46][47], and some studies indicate that abdominal adiposity has a stronger association with inflammation than total adiposity [48][49][50]. For CRP we could not find any significant interaction between glucose tolerance group and waist circumference, thus our results did not indicate that the association between obesity and the inflammation marker depends on the degree of glucose tolerance. Many studies have been based on only one or two of the GT groups, [24,[51][52][53]. Our study showed an expected increase in CRP of between 0.74 and 1.07 mg/L per 10 cm increase in WC for the absolute-effects methods and 49% per 10 cm for the relative-effects method. All methods, with the exception of LS lin , showed a significant positive association between CRP and HOMA-IR. The lack of significant association using LS lin can probably be explained by the estimates of the variance. In the LS lin method the heteroscedasticity is not taken into account.
In the analysis of HOMA-IR, all methods identified WC as a significant predictor for HOMA-IR. There was also a significant interaction between glucose tolerance group and waist circumference, thus the absolute-effects models showed a departure from additivity. These results cannot be interpreted causally, but the interaction indicates that obesity might affect insulin resistance more for women who have diabetes mellitus compared to those with normal glucose tolerance. All models methods found a significantly stronger WC-association for women with DM compared to women with NGT, and all methods (apart from LS lin ) also had a significantly stronger WC-association for women with IGT compared to NGT. From the simulation we know that LS lin has larger standard errors than the other methods and thus lower power. The relative-effects method LS exp also showed a significant interaction between glucose tolerance group and waist circumference, i.e., departure from multiplicativity.
Even if HOMA-IR typically has a skewed non-normal distribution, regression analyses have been performed using both untransformed and log-transformed HOMA-IR values, see [54,55] shows an expected increase in HOMA-IR with 3.5 units per 10 cm WC, using LS lin on persons with DM, to be compared with 0.64-1.00 units in our study. The difference in association might be explained by the fact that the previous study included both men and women of different ages [56] uses the method here denoted LS exp and finds a positive association; about 22% per 10 cm WC, while we found the association to be stronger; 27%-39%.

Model Choice
The choice between an additive or multiplicative model affects the interpretation of the estimated coefficients. The aim of a regression analysis might be simply to test whether there is a significant association between an outcome and a potential explanatory variable. Another aim can be to quantify a specific association (e.g., the absolute or relative effect), or assess the biologic interaction. If the study is purely exploratory, using epidemiological data, residual analysis can be used to decide which model that fits the data best. The model choice might be based on previous knowledge, e.g., about the biological process, from experimental studies.
In risk-modeling, a log-linear model is often used, φ(Z, β) = exp(α 0 + α 1 X 1 + … + α k X k + βZ), where φ can be the odds ratio or rate ratio function, X 1 -X k are covariates and Z is the exposure variable of interest. In this model the ratio has an exponential dependence on Z; exp(βZ). However, linear models have also been discussed, see [57], for example in radiation epidemiology, where the linear relative rate model φ(Z,β) = exp(α 0 + α 1 X 1 + … + α k X k )(1-βZ) allows the rate ratio to increase linearly with the dose Z [58].
Not only the main effects but also potential interactions can be of interest. Interaction in a statistical sense is scale dependent, e.g., an absence of interaction in absolute-scale will lead to interaction in log-scale. An interaction in a linear absolute-effects model is additive, while an interaction in a log-linear relative-effects model is multiplicative. In epidemiology, an additive interaction (effect-modification on the absolute scale) is often considered more important when assessing public health impact, and seems to correspond more to biologically based notions of interaction [9,59,60]. There is a need for regression methods that can assess biologic interaction, as discussed in several articles. In logistic regression it is implicit that we have a multiplicative statistical relation and if an additive biological model holds, the logistic analysis would require three parameters to summaries the joint effects of only two variables, [61]. Additive interactions are given directly in a linear model, however a logistic regression model can be defined in such a way that additive interactions (e.g., biologic interaction) can be assessed [62].

Strengths and Weaknesses
Five regression methods for estimating associations on the absolute scale of the explanatory variables were compared, with regard to bias and standard deviation for the estimated coefficients and also with regard to the estimated expected outcome and its confidence interval. In addition, the standard method for log-normal data (log-transformation) was evaluated. The comparison of the methods was made both in a simulation study and using two examples. The absolute-effects methods provide similar results for the association with the predictors for CRP and HOMA-IR, respectively. The results from the examples are consistent with those from the simulations.
The aim of this study was not to provide a complete statistical model of which factors that are associated with CRP and HOMA-IR, but to compare the statistical methods. The number of factors in the regression models was therefore kept small; the simulation model only included two explanatory variables and in the models for CRP and HOMA-IR, only those variables that were significant after backward elimination using ML LN were included. Thus, all factors were significant for ML LN (and also for GLM N ). This could be seen as an advantage for these methods, compared to for example a situation in which LS exp had been used to select the model. However, since we assume a linear model (i.e., absolute effects) it is natural to use a method that can estimate the absolute effects in the model selection process. We also wanted the method that was expected to have a high power, and based on previous studies, [11], ML LN was expected to have higher power than e.g., WLS and LS lin .

Conclusions
In medical research we often want to identify and quantify associations using regression analysis. Log-normal data are common and there are situations when the absolute effects are of interest (rather than the relative) and thus there is a need for linear regression methods on untransformed log-normal data. We have evaluated several regression methods using both large scale simulations of personal exposure to PM, and by applying the methods to data on biomarkers (CRP and HOMA-IR). The LS exp does not provide estimates of the absolute effects and the expected outcome can be biased. The LS lin and GLM G provide correct point estimates of the expected outcome, but confidence intervals with incorrect coverage. The ML LN and GLM N worked best (unbiased estimates, narrow confidence intervals), although ML LN tends to have a slightly more correct coverage for the confidence intervals.