This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
We compared six methods for regression on lognormal heteroscedastic data with respect to the estimated associations with explanatory factors (bias and standard error) and the estimated expected outcome (bias and confidence interval). Method comparisons were based on results from a simulation study, and also the estimation of the association between abdominal adiposity and two biomarkers; CReactive Protein (CRP) (inflammation marker,) and Insulin Resistance (HOMAIR) (marker of insulin resistance). Five of the methods provide unbiased estimates of the associations and the expected outcome; two of them provide confidence intervals with correct coverage.
A common objective in medical research is to identify and quantify associations. For example, this could include evaluating a biomarker or estimating personal exposure levels based on questionnaires and occupational history. In these cases regression analysis is often used. It can also be important to estimate the expected value, e.g., the expected exposure. A person’s risk of developing an exposurecaused disease is related to the dose, and the dose is usually estimated by the cumulative exposure. In groupbased exposure assessment, the arithmetic mean is considered superior to the geometric mean, as a doserelated variable [
Many biological variables (e.g., exposure and biomarkers) have a skewed distribution with a median smaller than the mean and only positive values. It is also common with heteroscedasticity, where the variance increases with the expected value. Such data can often be described by a lognormal or quasilognormal distribution [
We investigated the situation where we want an estimate of the absolute effect, thus we need the model to be linear on the original scale,
Different regression methods, suitable for lognormal data, were investigated and the aim was to estimate the absolute effect β
We compared the different regression methods using both large scale simulations and by applying them to a crosssectional data set with the aim to quantify the association of abdominal adiposity with inflammation and insulin resistance (two wellknown associations).
We considered a regression model where the expected value of a continuous lognormal response variable
Ordinary least squares regression (here denoted LS_{lin}) can be used to obtain unbiased estimates
In a situation with heteroscedasticity, weighted least squares regression (here denoted WLS) can be used. WLS can account for the heteroscedasticity by weighting each observation,
When the response
The lognormal distribution is often approximated by the gamma distribution, with parameters
Another GLM that can be used to estimate the absolute effects is one with a normal distribution and the link function exp(
The method GLM_{N}, does not, however, take into account the stochastic variation due to estimating
For LS_{lin}, WLS, GLM_{G} and ML_{LN}, a 95% confidence interval for
For GLM_{N}, a confidence interval is estimated as
For LS_{exp}, a confidence interval for
In a simulation study we compared the largesample properties of the methods for estimating the expected value of
The expected outcome, personal exposure to PM_{2.5}particles (
The DIWA dataset is a populationbased cohort of 64yearold women from the city Gothenburg in Sweden and has previously been described in detail in [
CRP is an acutephase protein found in blood serum and its levels increase during an inflammatory process. CRP is mainly used as an inflammatory marker in clinical practice and should, for a healthy person, be less than 5 mg/L. Diabetes, smoking, obesity and insulin resistance are all been associated with small increases in CRPlevels as assessed by high sensitivity methods [
Insulin resistance is a condition where the body has a reduced ability to respond to the insulin hormone which can cause blood glucose to rise above normal levels. Insulin resistance can lead to type 2 diabetes and cardiovascular disease. Even if insulin resistance is most common among persons with diabetes mellitus of type 2 or impaired glucose tolerance, it is also present in about 25% of nonobese persons with normal glucose tolerance, [
In the simulation study, balanced data sets were computergenerated using the model in
Estimates of the regression coefficients; expected value of the estimate, E[
LS_{lin}  WLS  ML_{LN}  GLM_{G}  GLM_{N}^{1}  LS_{exp} ^{2}  

Intercept  
E[ 
1.566  1.560  1.563  1.565  1.567  0.487  
SD[ 
0.226  0.190  0.183  0.187  0.180  0.083  
E[ 
0.269  0.187  0.180  0.178  0.179  0.084  
Parameter for X_{1}  
E[ 
0.121  0.122  0.122  0.122  0.121  0.042  
SD[ 
0.021  0.019  0.019  0.020  0.019  0.006  
E[ 
0.021  0.019  0.018  0.018  0.018  0.006  
Parameter for X_{2}  
E[ 
0.075  0.075  0.075  0.075  0.075  0.027  
SD[ 
0.024  0.021  0.021  0.021  0.02  0.008  
E[ 
0.024  0.021  0.020  0.020  0.02  0.008  
E[ 
1.229  
SD[ 
0.143  
Scale parameter  7.330  0.377  
SD[scale parameter]  1.015  0.026  
E[ 
0.379  0.376  0.358 ^{3}  0.377  0.384  
SD[ 
0.031  0.026    0.026  0.026 
^{1} After transformation of the coefficients in eq (3):
^{2} Coefficients
^{3} After transformation:
All methods except LS_{exp} provided unbiased estimates of the regression coefficients. Among the absoluteeffects methods, GLM_{N} tended to have the best precision (smallest SD). The samplespecific standard errors,
All methods except LS_{exp} provided an unbiased estimate of the expected value. The interval length was similar between WLS, ML_{LN}, GLM_{G} and GLM_{N}, but tended to be smaller for the two GLM methods (
Estimated expected value and expected length of 95% confidence interval for
Expected value  E[ 
E[length]  


LS_{lin}  WLS  ML_{LN}  GLM_{G}  GLM_{N}  LS_{exp}  LS_{lin}  WLS  ML_{LN}  GLM_{G}  GLM_{N}  LS_{exp} 
1.714  1.72  1.71  1.71  1.72  1.72  1.85  0.927  0.631  0.609  0.594  0.6  0.544 
2.164  2.17  2.16  2.16  2.17  2.17  2.17  0.733  0.533  0.518  0.501  0.507  0.506 
2.614  2.62  2.61  2.61  2.62  2.62  2.55  0.927  0.825  0.797  0.774  0.783  0.749 
2.568  2.57  2.57  2.57  2.57  2.57  2.49  0.733  0.605  0.588  0.567  0.574  0.58 
3.018  3.02  3.02  3.02  3.02  3.02  2.91  0.464  0.467  0.462  0.437  0.443  0.439 
3.468  3.47  3.47  3.47  3.47  3.47  3.42  0.733  0.763  0.743  0.715  0.723  0.798 
3.422  3.42  3.42  3.42  3.42  3.42  3.34  0.927  0.950  0.920  0.89  0.9  0.982 
3.872  3.87  3.87  3.87  3.87  3.87  3.92  0.733  0.850  0.827  0.796  0.804  0.914 
4.322  4.32  4.32  4.32  4.32  4.32  4.60  0.927  1.026  0.997  0.962  0.972  1.351 
LS_{lin} had the largest standard deviation, especially for small and large values of
True standard deviation and samplespecific standard error for the
Expected value  SD[ 
E[ 



LS_{lin}  WLS  ML_{LN}  GLM_{G}  GLM_{N}  LS_{exp}  LS_{lin}  WLS  ML_{LN}  GLM_{G}  GLM_{N}  LS_{exp} 
1.714  0.191  0.161  0.156  0.159  0.154  0.136  0.238  0.159  0.154  0.152  0.154   
2.164  0.145  0.135  0.132  0.135  0.132  0.128  0.188  0.135  0.131  0.128  0.13   
2.614  0.220  0.209  0.202  0.211  0.205  0.190  0.238  0.208  0.201  0.198  0.201   
2.568  0.167  0.153  0.150  0.154  0.151  0.147  0.188  0.153  0.148  0.145  0.147   
3.018  0.121  0.118  0.118  0.120  0.120  0.112  0.119  0.118  0.117  0.112  0.113   
3.468  0.210  0.195  0.190  0.196  0.192  0.204  0.188  0.192  0.187  0.183  0.185   
3.422  0.251  0.241  0.234  0.244  0.238  0.251  0.238  0.240  0.232  0.228  0.231   
3.872  0.228  0.217  0.212  0.219  0.215  0.235  0.188  0.214  0.209  0.204  0.206   
4.322  0.290  0.263  0.256  0.264  0.258  0.345  0.238  0.259  0.251  0.246  0.249   
All methods except LS_{lin} and LS_{exp} provided coverage close to the nominal, but both GLM_{G} and GLM_{N} tended to give too low coverage, whereas ML_{LN} was slightly better. Using LS_{lin} resulted in too high coverage for low values of
Actual coverage of the 95% confidence interval for
Expected value  Coverage ^{1}  


LS_{lin}  WLS  ML_{LN}  GLM_{G}  GLM_{N}  LS_{exp} 
1.714  0.98  0.94  0.95  0.93  0.94  0.83 
2.164  0.99  0.95  0.95  0.93  0.94  0.95 
2.614  0.96  0.95  0.95  0.93  0.94  0.93 
2.568  0.97  0.95  0.95  0.93  0.94  0.90 
3.018  0.94  0.95  0.95  0.93  0.93  0.83 
3.468  0.92  0.95  0.95  0.93  0.94  0.94 
3.422  0.93  0.95  0.95  0.92  0.93  0.93 
3.872  0.89  0.94  0.95  0.93  0.94  0.95 
4.322  0.89  0.94  0.94  0.93  0.94  0.87 
^{1} Proportion of replicates where 95% confidence interval covers true expected value
The DIWA dataset consists of data from approximately 600 women for which a large amount of data, related to diabetes and obesity, were collected. Descriptive statistics for CRP, waist circumference and HOMAIR are presented in
Descriptive statistics for Creactive protein (CRP), insulin resistance (HOMAIR) and waist circumference.






Mean  Median  SD  Mean  Median  SD  Mean  Median  SD  
NGT ^{1}  185  2.107  1.184  2.550  1.141  0.960  0.647  88.295  88.50  8.948 
IGT ^{1}  195  2.583  1.380  3.783  1.816  1.430  1.268  92.677  92.50  11.882 
DM ^{1}  218  4.468  1.856  10.255  4.677  2.835  5.842  98.083  98.00  12.631 
^{1} Results for women with normal glucose tolerance (NGT), impaired glucose tolerance (IGT) and diabetes mellitus (DM).
For CRP, the start model in the multivariable regression analysis included smoking, physical activity, waist circumference (WC), insulin resistance (HOMAIR) and glucose tolerance (GT), where GT was classified into three categories: normal glucose tolerance, impaired glucose tolerance and diabetes mellitus. We used a model that allowed for different associations for the GT groups, by including the interaction terms WC∙DM and WC∙IGT. The final model, based on backward elimination using ML_{LN}, contained WC and HOMAIR, but no interaction term, thereby implying that the association with WC could be similar for the three GT groups (
For HOMAIR, the start model in the multivariable regression analysis included WC, physical activity and smoking, and we allowed for possible different association with WC for the different glucose groups by including the interaction between waist circumference and glucose tolerance. The final model, based on backward elimination using ML_{LN}, contained WC and the interaction between WC∙GT, thus allowing different WC parameters for each GT group (
The parameter estimates and 95% confidence intervals for the different regression methods, when estimating CRP as a function of waist circumference (WC) and HOMAIR, using
The parameter estimates and 95% confidence intervals for the different regression methods, when estimating HOMAIR as a function of waist circumference (WC) and the interaction between WC and glucose tolerance group (normal glucose tolerance, impaired glucose tolerance and diabetes mellitus), using
The estimated standard deviation,
The σ_{Z}estimates and mean length of 95% confidence intervals for




Length CI (mean, SD)  Length CI (mean, SD)  
LS_{lin}    1.61 (0.89)    1.10 (0.19)  
WLS  1.22  1.51 (2.07)  0.73  0.64 (0.35)  
ML_{LN}  1.04  0.82 (0.86)  0.61  0.43 (0.19)  
GLM_{G}  0.71 (0.974 ^{1})  0.85 (1.26)  0.33 (2.52 ^{1})  0.47 (0.26)  
GLM_{N}  1.04  0.43 (0.23)  0.61  0.23 (0.06)  
LS_{exp}  1.04  1.19 (5.40)  0.60  0.50 (0.45) 
^{1} Estimated scale parameter
All of the methods demonstrated that WC was a significant predictor for CRP. According to the absoluteeffects methods (LS_{lin}, WLS, GLM_{G}, GLM_{N} and ML_{LN}), the CRP was expected to increase about 1 mg/L (between 0.74 and 1.07 mg/L) for every 10 cm in WC and, according to the relativeeffects method (LS_{exp}), the expected increase was 49% for every 10 cm in WC (exp(0.40) – 1 = 0.49),
All methods found a positive association between HOMAIR and WC in all glucose tolerance groups,
Several methods for estimating a linear regression on lognormal data were compared. Much research has investigated making inferences, including confidence interval, of the expected value of a lognormal distribution, e.g. [
Six methods were compared; four of them directly modeled the expected value of
In a simulation study we evaluated the regression methods in a situation where the expected outcome is a linear function of two explanatory variables. All methods except LS_{exp} provided unbiased estimates of the regression coefficients and the expected outcome, but the samplespecific standard error,
The methods were applied to two approximately lognormal response variables, CRP and HOMAIR (almost 600 observations). The model for CRP contained WC and HOMAIR, and the model for HOMAIR contained WC and the interaction between WC and glucose tolerance groups (normal glucose tolerance [NGT], impaired glucose tolerance [IGT] and diabetes mellitus [DM]). When comparing confidence intervals for
Using all methods, the analysis demonstrated a significant positive association between CRP and WC. Associations between CRP and several measures of obesity and abdominal adiposity have been shown in a number of studies [
In the analysis of HOMAIR, all methods identified WC as a significant predictor for HOMAIR. There was also a significant interaction between glucose tolerance group and waist circumference, thus the absoluteeffects models showed a departure from additivity. These results cannot be interpreted causally, but the interaction indicates that obesity might affect insulin resistance more for women who have diabetes mellitus compared to those with normal glucose tolerance. All models methods found a significantly stronger WCassociation for women with DM compared to women with NGT, and all methods (apart from LS_{lin}) also had a significantly stronger WCassociation for women with IGT compared to NGT. From the simulation we know that LS_{lin} has larger standard errors than the other methods and thus lower power. The relativeeffects method LS_{exp} also showed a significant interaction between glucose tolerance group and waist circumference,
Even if HOMAIR typically has a skewed nonnormal distribution, regression analyses have been performed using both untransformed and logtransformed HOMAIR values, see [
The choice between an additive or multiplicative model affects the interpretation of the estimated coefficients. The aim of a regression analysis might be simply to test whether there is a significant association between an outcome and a potential explanatory variable. Another aim can be to quantify a specific association (e.g., the absolute or relative effect), or assess the biologic interaction. If the study is purely exploratory, using epidemiological data, residual analysis can be used to decide which model that fits the data best. The model choice might be based on previous knowledge, e.g., about the biological process, from experimental studies.
In riskmodeling, a loglinear model is often used,
Not only the main effects but also potential interactions can be of interest. Interaction in a statistical sense is scale dependent, e.g., an absence of interaction in absolutescale will lead to interaction in logscale. An interaction in a linear absoluteeffects model is additive, while an interaction in a loglinear relativeeffects model is multiplicative. In epidemiology, an additive interaction (effectmodification on the absolute scale) is often considered more important when assessing public health impact, and seems to correspond more to biologically based notions of interaction [
Five regression methods for estimating associations on the absolute scale of the explanatory variables were compared, with regard to bias and standard deviation for the estimated coefficients and also with regard to the estimated expected outcome and its confidence interval. In addition, the standard method for lognormal data (logtransformation) was evaluated. The comparison of the methods was made both in a simulation study and using two examples. The absoluteeffects methods provide similar results for the association with the predictors for CRP and HOMAIR, respectively. The results from the examples are consistent with those from the simulations.
The aim of this study was not to provide a complete statistical model of which factors that are associated with CRP and HOMAIR, but to compare the statistical methods. The number of factors in the regression models was therefore kept small; the simulation model only included two explanatory variables and in the models for CRP and HOMAIR, only those variables that were significant after backward elimination using ML_{LN} were included. Thus, all factors were significant for ML_{LN} (and also for GLM_{N}). This could be seen as an advantage for these methods, compared to for example a situation in which LS_{exp} had been used to select the model. However, since we assume a linear model (
In medical research we often want to identify and quantify associations using regression analysis. Lognormal data are common and there are situations when the absolute effects are of interest (rather than the relative) and thus there is a need for linear regression methods on untransformed lognormal data. We have evaluated several regression methods using both large scale simulations of personal exposure to PM, and by applying the methods to data on biomarkers (CRP and HOMAIR). The LS_{exp} does not provide estimates of the absolute effects and the expected outcome can be biased. The LS_{lin} and GLM_{G} provide correct point estimates of the expected outcome, but confidence intervals with incorrect coverage. The ML_{LN} and GLM_{N} worked best (unbiased estimates, narrow confidence intervals), although ML_{LN} tends to have a slightly more correct coverage for the confidence intervals.
This project was funded by the Swedish state under the agreement between the Swedish government and county councils concerning economic support for research and education of doctors (ALFagreement).
Sara Gustavsson and Eva M. Andersson were responsible for the statistical data analyses and for the manuscript. Gerd Sallsten serves as Sara Gustavsson’s assistant supervisor and contributed in the modelling of exposure to particles. Björn Fagerberg is responsible for the DIWA study, and contributed with important information on diabetes, obesity and biomarkers. All authors approved the final manuscript.
The authors declare no conflict of interest.