1. Introduction
The recent statistical literature has experienced an intense research activity on skew distributions. It is due to the fact that many data sets are not fitted well with the normal distribution because of asymmetry and/or kurtosis excess [
1]. A natural extension of the normal distribution is the skew-normal (SN) distribution, which has been studied in [
2,
3,
4,
5,
6,
7,
8], among other works. Note that the Fisher information matrix of the SN distribution is singular [
1].
In order to model random variables that take values with bounded support, the beta distribution has been frequently used [
9,
10,
11,
12,
13,
14,
15,
16,
17]. This type of variables have interesting applications when the bounded support is between zero and one. Additionaly, versions with support between zero and one of other distributions, such as the Birnbaum–Saunders and Weibull distributions [
18,
19,
20], have been proposed. Variables with support between zero and one are studied particularly when modeling proportions and rates (for example, in the study of the proportion of deaths caused by a certain virus in a country, the rate of income spent on taxes, and the proportion of family income spent on food).
Note that some random variables cannot be observed below a certain value (lower detection limit (LDL)) and/or above a certain value (upper detection limit (UDL)), with LDL and UDL being often fixed values. When LDL and/or UDL are present, we say that the random variable is single/doubly-censored distributed [
21,
22]. Data associated with this type of variables can be described by censored normal distributions [
23,
24,
25,
26].
If LDL
and UDL
, extensions of the beta distribution may be considered to model excess of zeros and/or ones. These extensions used to describe variables into the intervals [0,1], [0,1) or (0,1] have been reported in [
27,
28,
29], which are named zero/one inflated beta (ZOIB) distributions. In order to model inflation at zero and/or one, mixtures distributions have been derived [
30,
31]. Bernoulli-beta mixture distributions for modeling inflated data at zero/one have been studied in [
28]. However, in many situations, the distribution of random variables that take values between zero and one present positive or negative asymmetry and/or kurtosis degrees different from the normal or beta distributions. Subsequently, other distributions than the beta and Bernoulli-beta mixture models are needed.
In order to solve the problem of asymmetry in the data, a transformation can be considered. Nevertheless, such a transformation brings with it problems in the interpretability of the distribution parameters and loss of power in the inference. As an alternative to not transform the data in the case of random variables with positive support, asymmetry to the right, and presence of LDL/UDL, the Birnbaum-Saunders, log-normal (LN), and log-SN (LSN) distributions can be used [
30,
32,
33,
34,
35,
36]. However, these distributions only cover positive asymmetry. Then, the doubly-censored SN (DCSN) distribution may be proposed as an extension of the doubly-censored normal distribution covering negative and positive asymmetry. To the best of our knowledge, no extensions of the SN distribution to describe variables that take values with bounded support and high censoring, as in the case of the intervals [0,1], [0,1) or (0,1], have been reported to date.
The objective of this paper is to propose an alternative approach to deal with data sets in the [0,1] interval. Our approach is a mixture distribution between the Bernoulli and DCSN distributions, which we name the BDCSN distribution in short. Given that the information matrix of the DCSN distribution is singular, such as in the case of the SN distribution, to circumvent this singularity, we define a centered DCSN (CDCSN) distribution [
37]. Therefore, our proposal solves the mentioned problems of the existing distributions and its maximum likelihood estimators are well behaved, with regularity conditions being satisfied, since the Fisher information matrix is non-singular in the vicinity of symmetry.
The paper is organized as follows. In
Section 2, we present the DCSN distribution and the main results on inference for this distribution. Given that the information matrix of the DCSN distribution is singular, to solve this problem, in
Section 3, we define the CDCSN distribution. In this section, the doubly censored log-SN (DCLSN) is also introduced. In
Section 4, the DCSN and DCLSN distributions are considered for modeling zero and/or one inflation by using the BDCSN distribution. Parameter estimation is dealt with the maximum likelihood method. The corresponding observed and expected Fisher information matrices are derived and shown to be non-singular.
Section 5 evaluates the performance of the maximum likelihood estimators with simulations based on the Monte Carlo method and introduces an algorithm to generate random numbers from the BDCSN distribution. Two real data analyses are considered on
Section 6 and
Section 7 from where we conclude that the distributions presented in this paper are a good alternative to the ZOIB distributions. The conclusions of this research are provided in
Section 8. Mathematical derivations of this work are detailed in the
Appendix A. All the numerical calculations were performed by using the
R software [
38].
2. Doubly-Censored SN Distribution
In this section, we define the DCSN distribution and estimate its parameter with the maximum likelihood method.
A general structure for a skew-symmetric probability density function (PDF) was proposed in [
1], which can be written as
where
f is a symmetric PDF around zero,
F is an absolutely continuous cumulative distribution function (CDF) which is symmetric around zero, and
is a shape parameter controlling the asymmetry of the distribution. In (
1), if
and
, that is, the PDF and CDF of the standard normal distribution, respectively, the so called SN distribution is obtained with PDF given by
in which case the notation
is used. Observe that the hazard and inverse-hazard functions of the SN distribution are, respectively, stated as
where
is the SN CDF, with
T being the Owen function defined as
and
is established in (
2).
Let
. Then, a location-scale extension of
Z is obtained considering the transformation
, where
is a location parameter and
is a scale parameter. Therefore, from (
2), the PDF of
X is expressed as
where
. We denote this extension by
. Next, based on the extension defined in (
4), we introduce the DCSN distribution.
Definition 1. Let be a random sample of size n, where and that only values of between the constants and are observed, with and being LDL and UDL, respectively. For values of , only the value is reported, while for values of only the value is considered. Then, the observed data set can be written asfor . The resulting sample from (5) is said to be drawn from a DCSN population. For and , we have thatandwhereFor the continuous part of , we consider that . In this case, we use the notation , with and being fixed. Note that, for , the DCSN distribution reduces to the doubly censored normal distribution [23]. Figure 1 provides graphs of PDFs of the DCSN distribution, denoted by in this figure. From
Figure 1 (left), note that, as the shape parameter
increases, keeping the other parameters fixed, different shapes of the PDF are obtained. From
Figure 1 (right), by keeping
fixed, it is evident that the parameter
modifies the location and
modifies the scale of the distribution.
Parameter estimation of the DCSN distribution can be performed using the maximum likelihood method. Thus, denoting by
,
and
the sums corresponding to
, and
, respectively, the log-likelihood function of
for the sample
is given by
where
are defined in (
6) and
. Hence, the score vector defined as the derivative of the log-likelihood function stated in (
7) with respect to the distribution parameters has elements
,
, and
, which are detailed in the
Appendix A. The first order conditions or estimating equations for the maximum likelihood method are obtained equating to zero the elements
,
, and
of the score vector. The solution of these equations leads to the maximum likelihood estimates of
. Notice that these estimating equations must be solved numerically by a nonlinear optimization method, as for example, a quasi-Newton algorithm of Broyden–Fletcher–Goldfarb–Shanno (BFGS) type [
39], which is available by the optim function of the
R software.
The elements of the observed Fisher information matrix corresponding to the DCSN distribution depend on the second derivatives of the likelihood function with respect to the distribution parameters. These elements are provided in the
Appendix A. The expected Fisher information matrix corresponding to the DCSN distribution follows then by taking expectations of the elements of the observed information matrix and multiplying by
. Subsequently, after intensive algebraic manipulations, the elements of the expected information matrix are obtained and also given in the
Appendix A, with
where
is defined here due to this is used in
Section 4. As mentioned, the Fisher information matrix of the SN distribution is singular [
1]. It occurs when
, since the score of the parameter
is
times the score of the parameter
, producing linear dependence between the corresponding columns of this matrix. Such a singularity is inherited by the Fisher information matrix corresponding to the DCSN distribution, when
. For this case of
, a convergence problem exists in the asymptotic inference and the unicity of the corresponding maximum likelihood estimators is not guaranteed.
3. Doubly-Censored Log-SN and Centered SN Distributions
As mentioned, the censored LSN distribution arises as an alternative to not transform the data in the case of random variables with positive support and asymmetry to the right. In this section, we define the DCLSN and centered SN distributions estimating their parameters with the maximum likelihood method.
Recalling that
, the PDF of a random variable
X with LSN distribution is given by
where
is a location parameter and
is a scale parameter. Notice that if
, then the LN distribution is obtained. We denote this extension of the LN distribution as
. The LSN distribution is required to model data with asymmetry different from that of
, or equivalently, of
. Hence, extending the definition of the DCSN distribution to the LSN case, we obtain the DCLSN distribution with parameters
,
and
, which is denoted by
, with
and
, and replacing
x by
in the DCSN PDF to avoid that the DCLSN PDF is not defined at zero. In this particular case, from the PDF given in (
9), the log-likelihood function of
is stated as
where
is the log-likelihood function defined in (
7) for the DCSN distribution, with
being replaced by
,
by
, and
by
. The score vector and Fisher information matrix associated with the log-likelihood function defined in (
10) can be obtained using the score vector and information matrix of the DCSN distribution replacing in the expressions of the
Appendix A: (i)
by
, and (ii)
by
, where
and
are the hazard and inverse hazard functions of the SN distribution, respectively, defined in (
3). As mentioned, the Fisher information matrix of the DCSN distribution is singular, inherited from the singularity of the SN distribution. Note that this singularity is also presented in the case of the DCLSN distribution. As also mentioned, a centered parametrization is considered to circumvent such a singularity. Observe that the SN PDF with centered parametrization (CSN in short) is given by
where
,
,
,
, and
. The centrality parameters
,
and
represent, as usual, the mean, standard deviation (SD) and coefficient of skewness of
X, respectively. In this case, we use the notation
.
Note that the distribution regarding the PDF defined in (
11) can be a location-scale distribution denoted by
considering
By using the relations stated in (
12), we parametrize the DCSN distribution to obtain the CDCSN distribution, denoted by
.
Based on the relations established in (
12), the observed and expected Fisher information matrices may be obtained for the parameter vector
of the CDCSN distribution using
, where
is a matrix containing the derivatives of the vector of parameters
with respect to
, and
is the observed Fisher information matrix of the non-centered location-scale SN distribution. Upon regularity conditions [
40],
is a consistent estimator of
. In addition, as
,
where
, with
being the expected Fisher information matrix, and
denotes convergence in distribution to. In summary,
is consistent and, from (
13), it is asymptotically normal distributed with asymptotic covariance matrix expressed as
. Note that
is a consistent estimator of the asymptotic variance-covariance matrix of
.
Figure 2 shows some graphical plots of the PDF of the CDCSN distribution with different values for its parameters. The PDF of the CDCSN distribution is denoted by
in this figure. In
Figure 2 (left), it is evident that the parameter
modifies the symmetry of the PDFs, while in
Figure 2 (right) the parameters
and
modifies the mean and dispersion of the PDFs, respectively.
4. The Bernoulli/Doubly-Censored SN Mixture Distribution
As mentioned, when the data set presents detection limits, mixture distributions are often used. In this section, we construct the BDCSN distribution considering the Bernoulli distribution for the discrete mixture variable and the DCSN distribution for the continuous mixture variable.
For the case of proportions, that is, with
and
, we can construct the mixture of Bernoulli and SN (BSN) distributions. On the one hand, we consider that the zero/one observations are well explained by a Bernoulli distributed variable with parameter
, which we denote by
. On the other hand, the remaining observations can be modeled by an SN distribution (or LSN for positive data) with parameters
,
and
. The BSN PDF for
is given by
where
are defined in (
6),
,
,
,
and
, with
denoting the SN PDF and
being the respective CDF. Observe that
and
. In this case, we use the notation
. After some algebraic manipulations, the CDF of
is stated as
where
, for
, is defined as in (
7).
Let
be a random sample of size
n from
, with
and
where
is an indicator function of
. Then, from (
14) and (
15), the log-likelihood function for
based on the data set
is established as
The elements of the score vector obtained from (
16) are detailed in the
Appendix A. Maximum likelihood estimates of the parameters
p,
,
,
and
are the solution to the system which follows by equating the scores to zero. From the first two equations, we obtain an unbiased estimator for
p, namely
, while an estimator for
is given by
, corresponding to the proportion of zeros and ones in the sample, as well as the proportions of ones in the subsample of zeros and ones, respectively. The solution to the remaining three parameters can obtained from the last three equations using iterative methods.
The expected Fisher information matrix corresponding to the BSN distribution is derived next. Considering the quantities
defined similarly as in (
8), with
, for
and
, with
,
,
,
and
, we have that the elements of the expected Fisher information matrix, denoted by
, for
, are defined as
,
,
,
,
,
,
,
,
, and detailed in the
Appendix A. Thus, the expected Fisher information matrix for
is given by
Notice from (
17) that the set of parameters for the discrete components
and the continuous components
are mutually orthogonal. Therefore, the expected Fisher information matrix can be written as
where
that is, we have block orthogonality. One of the advantages of this orthogonality is that the corresponding parameters may be estimated separately. Estimation methods were discussed in [
36] when there is orthogonality in relation to a partition of interest. Moreover, maximum likelihood estimators are independent asymptotically.
Note that the parameterization
and
in the BSN distribution leads to the BDCSN distribution, for
, with PDF being defined as
where
and
. This is denoted by
.
Figure 3 shows graphical plots of the PDF of the BDCSN distribution with different values for its parameters. In
Figure 3 (left), we note that, as the shape parameter
increases, keeping the other parameters fixed, the shapes of the PDF change. In
Figure 3 (right), by keeping
fixed, observe that the parameter
modifies the location and
modifies the scale of the BDCSN distribution. Subsequently, observe that in both figures different values for
and
are used.
From the PDF given in (
18), the log-likelihood function of
based on the data set
can be written as
Therefore, from (
19), the elements of the score vector for
and
are
and
. For
,
and
given by the non-reparametrized distribution, the solution follows using the BFGS algorithm [
39]. In addition,
is the proportion of zeros in the sample, and
is the proportion of ones in the sample. For this distribution, its Fisher information matrix can be written as
where the elements of
are given by
,
, and
Furthermore, the elements of
are the corresponding elements of
. Given the orthogonality for the two sets of parameters, their estimates are computed separately.
Next, we present the inflated zero, inflated one and zero/one inflated cases of the BDCSN distribution. For the case of zero inflation
with
, its PDF is given by
where
. Then, from the PDF stated in (
20), the log-likelihood function of
based on the data set
can be written as
Hence, from (
21), the score for
is defined as
. By equating it to zero, we obtain the estimate
, that is, the proportion of zeros in the sample. The remaining parameters are estimated similarly as above.
For the case of one inflation
with
, its PDF is stated as
where
. Thus, from the PDF expressed in (
22), the log-likelihood function of
considering the data
is established as
From (
23), we reach the score for
as
By equating it to zero, we obtain the estimate
, that is, the proportion of ones in the sample. The remaining parameters are estimated as in the previous case.
Now, by considering the PDF given in (
18), the BDCSN distribution is obtained, which can be used for fitting positive data with high kurtosis and asymmetry. Then, from the PDF expressed in (
18), the log-likelihood function of
considering the data
is given by
where
and
is the log-likelihood function defined in (
16) for the BSN distribution, with
and
The score vector and information matrices are easily obtained from (
24)–(
26), and the previous results.
We denote the mixture between the Bernoulli and CDCSN distributions by . Note that the Fisher information matrix for the continuous part of the BCDCSN distribution is obtained similarly as for the CDCSN distribution, that is, , and .
5. Monte Carlo Simulation Study
In this section, we evaluate the performance of the maximum likelihood estimators with simulations based on the Monte Carlo method. For this simulation study, we consider the BDCSN distribution.
In this Monte Carlo simulation, the true values assumed for the parameters are , and = 0.3, whereas that and . The sample sizes considered are and the number of Monte Carlo replicates is 5000. In each of these replications, we generate random numbers according to Algorithm 1.
Algorithm 1 Generation of random numbers from the BDCSN distribution. |
1: Fix values for , , , , and . |
2: Generate values for u from . |
3: Compute values for x from
where , and is the quantile function of the SN distribution. |
4: Repeat steps 2–3 until the required numbers of data (n) is completed. |
For each parameter and sample size, we report the empirical mean, variance, bias and root of the mean squared error (RMSE) of the maximum likelihood estimators in
Table 1. From this table, in general, note that, as the sample size increases, both the bias and RMSE decrease, as expected. These results empirically shows the good performance of the maximum likelihood estimators for the BDCSN distribution parameters.
6. Real Data Application 1
In this section, we illustrate the usefulness of the CDCSN and BDCSN distributions considering a first application to a real data set. We name this data set as “death”, which corresponds to the proportions of unexplained infant deaths in 5561 Brazilian counties. The data set is available for downloading at
https://datasus.saude.gov.br and contains 3367 zeros (explained deaths) and 174 ones (unexplained deaths).
Table 2 provides descriptive statistics for the death data (uncensored), including central tendency statistics, SD, coefficients of variation (CV), skewness (CS) and kurtosis (CK). From this table, note the presence of skewness and kurtosis in the distribution of the data; see also
Figure 4 (left), which depicts the histogram with boxplot revealing the distributional behavior.
In order to compare our distributions with a standard competitor, the ZOIB distribution [
28] is considered and denoted by ZOIB
. To estimate the ZOIB distribution parameters, the
GAMLSS package of the
R software is used. Then, we fit the CDCSN, BDCSN and ZOIB distributions by using the maximum likelihood method to estimate their parameters based on the
sn package [
41] and its
selm and
dp2cp functions. The
optim function of the
R software and the BFGS algorithm are used for this estimation. As starting values to initiate the algorithm, we use the moment estimates proposed in [
5].
Table 3 provides the maximum likelihood estimates for the considered distributions. In addition, for the CDCSN distribution, using the corresponding estimated CDF, the estimated proportions of censored observations are 0.6039 (zeros) and 0.0272 (ones), respectively, whereas the corresponding empirical percentages are 0.6055 and 0.0313, revealing the model fits the data well.
Figure 4 shows the estimated CDF of the ZOIB, CDCSN and BDCSN distributions indicating their good fit to the data.
We can numerically compare the distributions studied in this application while using the Akaike information criterion (AIC), the Schwarz Bayesian information criterion (BIC), and corrected Akaike information criterion (CAIC) [
42]. The AIC, BIC, and CAIC are given, respectively, by
where
is the log-likelihood function for
, associated with the underlying distribution, evaluated at
,
d is the dimension of the parameter space, and
n is the size of the data set. All of these criteria are based on the log-likelihood function and penalize the distribution with more parameters. A distribution whose information criterion has a smaller value is better [
43]. The log-likelihood, AIC, BIC and CAIC values computed according to expressions given in (
27), for the distributions studied in this application, are presented in
Table 3. From this table, observe that the BDCSN distribution has a better agreement with the death data. Additionally, in order to compare the CDCSN and BDCSN distributions against the ZOIB distribution, we use the Voung test [
44], with its statistic being a distance between two distributions measured in terms of the Kullback–Liebler criterion [
45]. Then, when comparing the BDCSN and ZOIB distributions, the
p-value of the Voung value is <0.001, providing a highly significant evidence in favor that the BDCSN distribution fits the death data better than the ZOIB distribution. Similarly, the Voung
p-value is <0.001 when comparing the BDCSN and DCSN distributions in favor of the BDCSN distribution, but the ZOIB distribution fits the data better than the CDCSN distribution. These results demonstrate the fact that the BDCSN distribution is a viable option to the ZOIB distribution to model the death data.
7. Real Data Application 2
In this section, in order to provide evidence that the distributions proposed in this paper fit different types of data, we analyze a second data set corresponding to the cable TV penetration in USA. These data were collected by the Federal Communications Commission (FCC) by means of questionnaires applied to cable community units, which are individual franchise areas. These questionnaires supplied data on prices, costs and cable operator background; see details of the questionnaire in Appendix E of [
46]. We name this data set as “FCC” and corresponds to 282 individual areas franchising cable TV; see [
12] for an analysis of these data. For FCC data, we study the proportion of subscribers going for additional canals.
The FCC data set contains 62 zeros, with the clump-at-zero in the histogram representing 21.98% of the data; see bold line in the histogram of
Figure 5 (left). Therefore, we note that this variable has excess of zeros.
Table 4 provides descriptive statistics for the data set in study (uncensored), including median, mean, SD, CV, CS and CK. From this table, note the presence of skewness and kurtosis in the distribution of the data; see also
Figure 5 (left), which depicts the histogram with boxplot revealing the behavior of the data.
We adjust four distributions to the FCC data: (i) the CSN distribution censored at
, (ii) the BDCSN distribution, (iii) the ZOIB distribution (with no ones), and (iv) the normal distribution censored at
. The maximum likelihood estimates for the parameters of these four distributions, with approximate standard errors in parentheses, are reported in
Table 5. The log-likelihood, AIC, BIC and CAIC values computed according to expressions given in (
27), for the distributions studied in this second application, are also presented in
Table 5. From this table, observe that the BDCSN distribution has a better agreement with the FCC data. Additionally, when comparing the BDCSN and ZOIB distributions with the Vuong test, the corresponding
p-value is 0.2124 indicating a statistically non-significant difference and demonstrating that both distributions are good alternatives to model these data. When comparing the BDCSN and censored normal distributions, the Voung
p-value is <0.001 favoring the BDCSN distribution, which also occurs if we compare the CSN and BDCSN distributions. In summary, the BDCSN distribution seems to be a good alternative of modeling for the FCC data, which can be visually corroborated by
Figure 5 (right), where the empirical quantile versus theoretical quantile (QQ) plot for the BDCSN distribution is depicted.