1. Introduction
Binomial regression is a general term of a regression model in which the dependent variable has binomial distribution. The binomial distribution is formed from a number of successes in some repeated trials of a binomial experiment, where each trial has a probability of success. Binomial regression belongs to the generalized linear models (GLM) class, which uses a link function to connect linear predictor variables to the expectation of response variables. In binomial regression, predictor variables relate the probability of success from a series of independent Bernoulli trials. The link function for the binomial regression is formed from inverse cumulative distribution functions (c.d.f.) of a random distribution, for instance, the logit link function that corresponds to the inverse c.d.f. of the standard logistic distribution, the probit link function uses the inverse c.d.f. of standard normal distribution, and the complementary log–log (cloglog) link function is formed from the inverse c.d.f. of the Gumbel distribution. For more information about GLM and binomial regression, see McCullagh and Nelder [
1] or Agresti [
2].
The logit, probit, and cloglog links are the three commonly used link functions in a binomial regression. The first link is preferable due to its easy interpretation of the regression coefficients and the odds ratio. The logit model is a linear model for the canonical or natural parameter of the underlying exponential family that was obtained, and it has a closed form [
3]. However, the logit model cannot always guarantee a good fit for all binomial regressions, especially for asymmetric models [
4]. Czado and Santner [
5], Tiku and Vaughan [
6], Dobson and Barnett [
7] indicated that misspecification of the link function in binomial regression could produce significant bias and increase mean squared error of parameter estimates and the predicted probabilities.
Both logit and probit links are symmetric links. Symmetric links assume that the probability of a given binomial response approaches 0 at the same rate as it approaches 1. This implies that the probability curve has a symmetric form of about 0.5. It is supported by the symmetrical probability density function (p.d.f.), which corresponds to the inverse c.d.f. of the link function. This assumption may not be reasonable in various cases. Many studies found that these symmetric link functions have some limitations. Nagler [
8] indicated how sensitive the inferences are if a symmetric link was incorrectly used in the direction of an asymmetric model. Chen et al. [
9] concluded that when the probability of a given binomial response approaches 0 at a different rate than it approaches 1 (depending on the covariate) or vice versa, the symmetric link function is inappropriate.
Collet [
10] showed that an asymmetric link function could be more fitting than symmetric for binomial regression in some specific situations. One of the asymmetric links that is frequently used in binomial regression is the cloglog link function. However, the cloglog link has a fixed left or negative skewness, which means the probability of a given binomial response rises slowly from 0 but then tapers off more quickly as it approaches 1 and not the other way around [
11]. As a result, the cloglog may produce an inaccurate inference for a symmetric model and a positive skewness model, and hence, it is not a flexible link function for binomial regression, although it is an asymmetric link. Binomial regression will be better modeled with flexible link functions that allow for symmetric and asymmetric (both positive and negative skewness) models and that allows the model to determine the required skewness level.
There have been many types of studies proposed that include skewness into the link function. Stukel [
12] presented a two-parameter class of generalized logistic models that can approximate many link functions, including probit, logit, and complementary log–log. However, in a Bayesian analysis, it may not be straightforward to be implemented, because of a problem of multiple covariates and noninformative improper priors. Chen et al. [
9] proposed skew-probit links using a latent variable approach of Albert and Chib [
13], which can lead to proper posterior distributions using standard improper priors. Kim et al. [
14] introduced a new link function based on the class of generalized skewed t-link models with the constraint on the shape parameter. Bazan et al. [
15] presented the skewed probit link function and some variants with different parameterizations. Naranjo et al. [
16] presented asymmetric link function based on the asymmetric exponential power (AEP) distribution. More recently, Caron et al. [
17] proposed a skewed Weibull link function for categorical response data arising from the binomial model as well as the multinomial model and showed that logit, probit, and complementary log–log can be obtained as limiting cases.
In this paper, we propose a flexible generalized logit (glogit) link function for the binomial regression model. The link function is created from inverse cumulative distribution function (c.d.f.) of a new class of generalized logistic based on the exponentiated-exponential logistic (EEL) distribution that was introduced by Ghosh and Alzaatreh [
18]. The EEL distribution has similar characteristics to Type IV generalized logistic distribution (GLD). The distributions consider both symmetric and asymmetric models, including the case of lighter and heavier tails, simultaneously as compared to standard logistic. The logistic distribution, Type I GLD, and Type II GLD can be obtained as special cases of both distributions. However, the EEL distribution differs from the Type IV GLD in the c.d.f. which is algebraically more tractable, so that it is more easily to be implemented. Some of the researchers use the previous GLD as the link function in binary regression, e.g., Tiku and Vaughan [
6] that use Type I GLD and Oral [
19] extended the work of Tiku and Vaughan [
6] with a stochastic covariate, Valle et al. [
20] use the Type III GLD and show that the model has ability to capture heavy and light tails, and Prentice [
21] applies Type IV GLD model for Dose-Response Curves.
A Bayesian estimation technique using Markov chain Monte Carlo (MCMC) is provided for implementations of the proposed model. We present a simple syntax for the computation of the model using the JAGS (Just Another Gibbs Sampler) program shown as an attractive aspect. A simulation study to investigate the model performance of the proposed model in comparison with the most commonly used link functions, i.e., logit, probit, and complementary log–log is conducted. We also compare the proposed model with several other asymmetric models using two previously published datasets, i.e., the beetle mortality dataset [
22] that was analyzed by Naranjo et al. [
16] and potency of three different poisons dataset [
23] analyzed by Caron et al. [
17].
The rest of the paper is organized as follows:
Section 2 describes the binomial regression model and the role of the link function in the model. In
Section 3, we introduce the flexible generalized logit (glogit) link based on the c.d.f. corresponding to the exponentiated-exponential logistic (EEL) distribution, and we show the graphs of the distribution in various schemes.
Section 4 demonstrates the flexibility of the glogit model compared with logit, probit, and complementary log–log by simulated datasets.
Section 5 contains the application and comparison of the glogit model using the beetle mortality and potency of three different poisons datasets. The conclusion and discussion are given in
Section 6.
2. Binomial Regression Model
The general notation of the binomial regression model is given as follows. Suppose is an vector of independent binary random variables that took value 1 for success and 0 for failure. Let X denote design matrix with rows , where is a vector of the predictor variable, where value 1 corresponds to an intercept. Furthermore, we consider as a corresponding the intercept and regression coefficient.
In binomial regression models, the probability
is predicted by applying link functions to a linear combination of predictor variables that leads to the following model:
where
is called as a link function. Link function
can be formed from inverse c.d.f. of a random distribution
. When
is a c.d.f. of a symmetric distribution, the resulting link is also symmetric. The most common symmetric links include logit and probit, whereas an asymmetric link is obtained when
is an asymmetric c.d.f., for example, the cloglog link.
The likelihood function
is derived from the likelihood of binomial distribution as follows:
thus, the log-likelihood function
is given by:
The parameters in the model (3) can be estimated by using a maximum likelihood method or using the Bayesian method. The numerical method, such as the Newton–Raphson algorithm or the Nelder–Mead algorithm, can be applied for estimation with maximum likelihood estimation (MLE). While in the Bayesian framework, inferences about model parameters are typically summarized by random draws from a posterior density, which is proportional to the likelihood function for the data multiplied by a prior for the parameters.
4. Simulation
A simulation study was carried out to evaluate the performance of the glogit link function for the binomial regression models compared to logit, probit, and cloglog. For this purpose, several datasets were generated by the following steps:
Step 1: Specify a binomial regression model with one predictor variable and set a vector of the regression coefficients ;
Step 2: Generate , then we create ;
Step 3: Compute from a given link, so that );
Step 4: Generate with = 100, and , which is the binomial distribution with as the probability of success and is the sample size for the repeated Bernoulli trials.
We consider seven simulations to accommodate into seven scenarios for symmetric and asymmetric models and lighter and heavier tails (compared to the logit model). The scenarios were considered by applying the different inverse of link functions . The scenarios are as follows:
Scenario 1: , same with ;
Scenario 2: , represent the lighter tails compared to the logit;
Scenario 3: , represent the heavier tails compared to the logit;
Scenario 4: , represent the positive skewness;
Scenario 5: , represent the negative skewness;
Scenario 6: ;
Scenario 7: .
The comparison of graph visualizations for scenarios 1 to 5 represented specifically for the glogit can be seen in
Figure 1 and
Figure 2. For each scenario, we generated 100 datasets, and for each dataset, we generated
binomial response variables, thus,
for
. Furthermore, we fit all of the models for each simulated dataset, respectively, to find the best performance in terms of model comparisons. We conducted a Bayesian analysis on the above seven simulated datasets and set with the uninformative prior for each model given in
Section 3.2.
To compare the performance of the models under different link functions, we use two model comparisons. The first is the deviance information criterion (DIC) proposed by Spiegelhalter et al. [
31], which represents a goodness-of-fit criterion. It is an estimate of expected predictive error. Therefore, the lower value of DIC is better. DIC is given by
, where
is deviance of model,
is the likelihood, and
is the deviance posterior mean. The number of parameters is given as
, and
is defined as the effective number of parameters, while
is the deviance posterior means of the parameters of interest, with
. Another criterion for the model comparison is absolute errors (AE). Absolute error is defined as the absolute value of the difference between the prediction value and the true value of a measurement and is commonly given as the maximum possible error given a model's degree of accuracy. AE is defined by
. The smaller values of AE indicate better models.
Table 1 shows the result of the simulation for the seven scenarios. Scenarios 1 to 3 are symmetrical models in which the first one is generated from
, or same with the logit. The second one represents the lighter tails compared to the logit, and the third one represents the heavier tails compared to the logit. The fourth and fifth scenarios are used to represent the asymmetric models in which the fourth one has a positive direction of skewness, while the fifth one has a negative direction. In addition, the sixth and seventh scenarios belong to the cloglog and probit link function. We evaluate each scenario by DIC and AE values to compare the models, i.e., glogit, logit, cloglog, and probit.
First, we focus on the results of the first scenario. When the datasets generated from the logit link that equaled the glogit link for , then the logit model is the best, followed by the glogit. However, the DIC and AE values of both models are very close compared to the other. For the second to fifth scenario that used datasets generated from the glogit model for the several values of and , the glogit model performs best. In the second scenario, which has lighter tails compared to the standard logistic, the second-best fit model is the probit model. For the third scenario, the logit model gives almost identical values of DIC and AE, with the result that it is the second-best fit model. In the fourth scenario, the second-best fit is the probit model, whereas, in the fifth one, the second-best fit is the cloglog model. This result is consistent with the fact that the cloglog model also has negative skewness. When the datasets are generated from cloglog and probit models (Scenario 6 and 7), the second-best fit is the glogit model after them. However, the differences in the model comparison values are quite similar. In summary, the simulation results show that the glogit model is quite flexible and performs better than the standard link function.
6. Conclusions
A flexible generalized logit (glogit) link function for analyzing binomial regression is proposed in this paper. The link function is based on a new class of generalized logistic distribution, exponentiated-exponential logistic (EEL) distribution. This distribution considers both symmetric and asymmetric models to accommodate positive or negative skewness. Moreover, this allows for the case of a lighter or heavier tails behavior compared to the standard logistic.
The paper provides a simulation of binomial regression using seven scenarios to indicate the flexibility and superiority of the glogit compared with common link functions, i.e., logit, probit, and cloglog in the Bayesian framework. The simulation result show that the glogit model always gave the best fit for the datasets generated from the glogit link function and the second-best fit for the datasets generated from another correspond link function. It also shows the second-best fit model when the datasets generated from the glogit was the model that related to the shape of EEL p.d.f. curve determined from and value.
In addition, applying the model for the datasets from previous studies indicates that the glogit fits the data well and outperforms the other asymmetric link functions, such as the probit and t-link (T(8)) models proposed by Albert and Chib [
13], the skewed probit (SP) model proposed by Chen et al. [
9], the skewed generalized t-link (SGT) model proposed by Kim et al. [
14], AEP-based link (AEP) model presented by Naranjo et al. [
16], and finally, a skewed Weibull (SW) model presented by Caron et al. [
17]. The proposed model is evidently flexible and capable of handling the datasets. The performance of the glogit is very good, indicated by the smallest value of DIC, followed by the AEP-based link model for both of the datasets. We are convinced that the glogit model is a good option and could be used in practice due to its flexibility. Other attractive aspects of the model are that it is analytically tractable and can be easily implemented under a Bayesian approach using the simple syntax.
This study adopted the typical priors for the Bayesian analysis with non-informative priors. Priors with other distributions can also be considered. The prior selection is an important activity that was not analyzed in this paper and hence, can be considered in future research.