Properties and Applications of a New Family of Skew Distributions

: We introduce two families of continuous distribution functions with not-necessarily symmetric densities, which contain a parent distribution as a special case. The two families proposed depend on two parameters and are presented as an alternative to the skew normal distribution and other proposals in the statistical literature. The density functions of these new families are given by a closed expression which allows us to easily compute probabilities, moments and related quantities. The second family can exhibit bimodality and its standardized fourth central moment (kurtosis) can be lower than that of the Azzalini skew normal distribution. Since the second proposed family can be bimodal we fit two well-known data set with this feature as applications. We concentrate attention on the case in which the normal distribution is the parent distribution but some consideration is given to other parent distributions, such as the logistic distribution.


Introduction
There are many situations in which empirical data show slight or marked asymmetry. This is frequently the case, for example, with actuarial and financial data which, in addition to this feature, have heavy tails reflecting the existence of extreme values. The se features mean that the data can not be adequately modeled by the Gauss (or normal) distribution. Furthermore, bimodal distributions appear naturally in many different scenarios. For example, in certain disease patterns, as well as in certain cancer incidence curves. Behind the bimodality (and multimodality as well) of some cancer incidence curves, and their study, clinicians can improve their understanding of cancer, the development process as well as the potential characteristics that identify cancer and that separate a particular type of cancer of all other types. This occurs, for example, in cases where there are two peaks of occurrence per age. The se cancers include Kaposi's sarcoma and Hodgkin's lymphoma. The latter type of cancer has two peaks of occurrence: in young people adults and middle-aged adults. On the other hand, the normal skew distribution appears naturally in stochastic frontier analysis when a normal distribution is assumed to represent the noise or idiosyncratic component and a half-normal distribution to represent the inefficiency term, in the event that the researcher imposes inefficient behavior on all firms in the sample of interest. See, for instance [1]. Recently, [2] introduces (using a finite mixture model) the zero inefficiency stochastic frontier model which can accommodate the presence of both efficient and inefficient firms in the sample by appearing various bimodal scenarios. The refore, it seems plausible to try to obtain families of distributions that incorporate bias to the normal distribution but that at the same time are more versatile in the sense of being able to adapt to the bimodal scenario that appears in different situations.
Although there are various mechanisms to obtain skewed distributions from an initial that is not skewed, (Two well-known procedures that allow to generalize an initial probability distribution, symmetric or not, are those provided in the works of [3,4], among others). Our attention here will focus on the mechanism for this purpose introduced by [5], which enjoys an undoubted popularity and has been the subject of research in numerous works. Let g and G be, respectively, the probability density function (pdf) and the cumulative distribution function (cdf) of a symmetric distribution. A random variate Z is said to have a skew distribution if its pdf is given by This family of distributions has been widely studied as an extension of the normal distribution by means of a shape parameter, λ, which accounts for the skewness. In this case g and G are replaced in Equation (1) by the pdf and cdf of the standard normal distribution and the resulting distribution is called the skew normal distribution. It should be pointed out that the function g does not have to be precisely the derivative of the cdf G to ensure that the pdf given in Equation (1) is a genuine pdf, although this case has not been studied in depth in the statistical literature. Following the notation provided in Reference [6] we denote the family of distributions given by g(z) = 2φ(z)Φ(λz), where φ and Φ are the pdf and cdf of the standard normal distribution, respectively, by SN = {SN(λ) : λ ∈ IR }. Furthermore, when a random variable follows a skew normal distribution with location parameter −∞ < µ < ∞ and scale parameter σ > 0 we will write SN(λ, µ, σ).
In this article, a new generalization of the family of skew distributions given in Equation (1) is proposed, which also includes the skew family of distributions of Azzalini as a particular case; that is, the expression (1). The methodology used is based on the combination of Azzalini's proposal and a result provided by [7] which led us to add a new parameter to the family (1). Later, from this new family a second family, very similar to the first, is introduced. This new family of distributions can exhibit bimodality and its standardized fourth central moment (kurtosis) can be lower than the kurtosis of the Azzalini skew normal distribution (and can be positive or negative).
In recent decades, starting from Azzalini's proposal, several generalizations and extensions of the skew-normal distribution have been introduced (see for example [8]). For multivariate extensions, see References [9][10][11], among others. The methods applied in the present paper can be considered as extensions and alternatives to the well-known skew-normal distribution (see [5,12]), whose properties (see [12,13]), and corresponding estimation [14] have been widely discussed. Other ways of obtaining skewed normal distributions have also been introduced, such as the one proposed by Reference [15], the Balakrishnan skew-normal density in Reference [16], the proposed model of Reference [17] and the generalized normal distribution in References [18][19][20], among others. For an exhaustive and comprehensive study of the skew-normal distribution, see the recent book by Reference [21].
The class of probability models proposed in the present paper can also be considered as alternatives or as approximations to the usual collective risk models in actuarial settings (see [22,23] among others). Data sets in these settings are typically skewed and the generalized models of the present paper expected to provide better fits than the standard models. In collective risk settings the right tail of the distribution is of considerable interest since the likelihood of large claims is of concern. In addition, the total claim distribution is of interest. Normal approximations are frequently resorted to when dealing with these variables. The use of more flexible generalized normal models can be expected to yield improvement.
The organization of this paper is as follows. The main result from which we constructed the two proposed families is shown in Section 2. Due to the importance that the normal distribution plays in numerous problems of applied statistics we dedicate a complete section, Section 3, to the study of this distribution. For this purpose, the pdf, which appears in closed form, is shown for the two families. We also give expressions for the mean, variance and the third and fourth standardized cumulant, to compare with their equivalents corresponding to the classical (Azzalini) skew normal distribution. In Section 4, the parameter estimation problem is studied. In order to obtain numerical solutions to the maximum likelihood (ML) estimation problem, suitable software has to be used. Multivariate extensions are described briefly in Section 5. Some examples and applications are described in Section 6. Finally, some conclusions are drawn and promising fields for further research are proposed in the last Section.

Main Results
We recall (see [24]) that if X and Y are independent and indentically distributed random variables with a finite fractional moment and if for all real λ, Pr(X + λY > 0) = 1/2, then they are symmetric. Also, the following Theorem, which appears in [7], is required for the main result which will appear later. Theorem 1. (see [7]) Let G be the cdf of an arbitrary distribution that is symmetric about zero. The n, a −a G(z) dz = a, ∀a ∈ IR + . (2) Expression (2) in Reference [25] also establishes this assertion.

The First Family of Skew Distributions
The following result presents the key contribution of this work, consisting of proposing, given a family of symmetrical distributions, a more general family not necessarily symmetric that includes as a particular case the first family. Theorem 2. Let X and Y two random variables with symmetric cdf's G X (x) and G Y (y) and pdf's g X (x) and g Y (y), respectively. The n, represents a genuine pdf for α ∈ IR − {0} and λ ∈ IR .
Proof. Without loss of generality, asume that X and Y are independent random variables. Taking into account the fact that X − λY is symmetric and using the result provided in Theorem 1, we get Hence the result.
Expression (3) can instead be viewed in the following form related to an infinite mixture construction. Let G X (ξ + λy), λ ∈ IR, ξ ∈ IR and y ∈ IR , be the cdf of a symmetric distribution with support in the real line. Suppose now that ξ is random and follows a uniform distribution in the interval [−α, α], then is also a genuine cdf symmetric around zero. That is, H X (y) = 1 − H X (−y). Now, Equation (3) is derived by taking into account the fact that 2g Y (y)H X (y) is a genuine pdf. Another elegant way to see that Equation (3) defines a genuine pdf is to consider the argument given in Lemma 1 in [15]. That is, let The results in the following proposition are readily verified and consequently are stated without proof. ( (3) and consider the two random variates Z 1 and Z 2 following the pdf (3) with parameters λ 1 ∈ IR and λ 2 ∈ IR , respectively, then, if Because of Result (ii) in Proposition 1, an identifiability problem will arise in (3) if we allow α to assume both positive and negative values. To avoid this problem we can and will restrict α to be non-negative when discussing inference for this model. Observe that (iv) establishes that when α → 0 we get as a special case the well studied skew family of distributions appearing in References [5,12].
In most cases the density function (3) does not have a simple, closed form but it can be computed numerically. However, closed-form expressions for the pdf can be obtained for some specific choices of well-known distributions. For example, if we utilize the pdf and cdf of the logistic distribution with location parameter µ = 0 and scale parameter s = 1 in (3), i.e., take g Y (y) = exp(−y)/(1 + exp(−y)) 2 and G X (x) = (1 + exp(−x)) −1 , respectively, then (3) becomes On the other hand, combining the pdf of the standard normal distribution with the cdf of the logistic distribution we have, after applying (3) the new pdf If in Equation (3), we use a normal pdf and a normal cdf, i.e., take g Y (y) = φ(y) and G X (x) = Φ(x), where φ(·) and Φ(·) represent the pdf and cdf of the standard normal distribution, respectively, then (3) becomes: which does not appear to be a very attractive analytical expression. However it is not intractable and does represent a flexible two parameter family of densities. Figure 1 shows some graphs of this distribution in comparison with the skew normal distribution with the same mean. Since the mean and variance do not have a closed form for this distribution, we have chosen to calculate the mean numerically and match the mean of the SN distribution with parameters (λ, µ, σ) to obtain the value of the skew parameter λ. In all cases, except for the first graph, both distributions have the same mean and approximately the same variance.) and with parameters λ, µ and σ. Similar to the latter, the distribution is unimodal. Furthermore, the degree of skewness increases when λ grows.
Positive skewness corresponds to the case λ > 0. As can be seen, the new distribution can take the same shape as the normal skew distribution, even having one less parameter when the highest probability mass percentage is around the ordinate axis. Otherwise, shape and scale parameters will have to be incorporated, as will be proposed later. The refore, taking into account the Ockham's razor principle, it would seem logical, if one had to choose between both models, to opt for the new modeling proposed in this work. Other possibilities for which we can get simple expressions for the corresponding pdf by applying (3) include the hyperbolic secant distributions (see [26,27], among others) and the ArcSin distributions (see [28]).
On some occasions, it is convenient expressed (3) as which is obtained after the change of variable u = z + λy in (3). In fact the family of skew distributions given in [5] can alternatively be obtained from (5) by applying the first mean value Theorem to the integral appearing on (5) in the interval [λy − α, λy + α]. To see this take c = λy in the well known formula

The Second Family of Skew Distributions
Now we present the second family of skew distributions proposed in this work. This is derived from Equation (5) as follows.
Theorem 3. If g Y (y) be a density function that is symmetric about zero and G X a cdf also symmetric about zero, then, is a valid pdf for α ∈ IR and λ ∈ IR .
Proof. From Equation (3) it follows that If we now take the derivative with respect to α on both sides in (7) and apply the Fundamental Theorem of calculus we get Thus Equation (6) is a genuine pdf.
It can be verified that the pdf's given in (6) also satisfy the properties listed in Note that the two proposed families, (3) and (6), are different and the only density belonging to both families is the basic density g Y (y). A major difference between the two proposed families is that in the first family α is not permitted to take the value zero while that is permitted in the second family. Both families are very similar and differ markedly only in small regions of the support of the distributions. To see this, note that applying the trapezoid rule to (3) we get which coincides with (5).
Because G X (α) = 1 − G X (−α), from (6) we get that when λ = 0, f Y (y) = g Y (y) while for α = 0 we get the skew family of distributions proposed in Reference [5]. Again, it can be verified that f Y (y; λ, α) = f Y (y; λ, −α) for the family given in Equation (6). Thus the same identifiability problem arises for the model (6) as did in the model (3) if we allow α to assume both positive and negative values. To avoid this problem here too we can and will restrict α to be non-negative when discussing inference for the model. If desired, a more general class than the one proposed in (6) is one corresponding to finite mixtures of densities of the form (6) as follows where the δ i 's are positve and ∑ m i=1 δ i = 1. At this point, we give a more general result than the one provided in [5,29,30] for a symmetric random variable.

Proposition 2.
If U be a random variable that is symmetric about zero with cdf given by G(u), pdf g(u) and if λ and α are two real numbers, then, Proof. This is obtained in a straightforward way directly from the result given in Theorem 3 by integrating on both sides of the equality over the support (−∞, ∞).
When U follows a standard normal distribution (9) reduces to a result which is well known in the statistical literature (see [5,29,30]).

The Normal Distribution Case
Of greater interest, because it is expressed in a simpler formulation, is the pdf obtained from Equation (6) when g and G are replaced by the pdf and cdf of the standard normal distribution, respectively. This is given by In the folowing discussion, when a random variate X has its pdf given by Equation (10) it will be denoted by X ∼ GSN(λ, α). Figure 2 show some graphs of this pdf and the corresponding skew normal pdf with the same mean and variance. (In this case, the skew parameter λ of the SN distribution has been set so that, once equal to the mean and variance of the new distribution, the values of µ and σ were obtained numerically so that both distributions should have the same mean and the same variance. It can be seen that the new model is very versatile and that the value of the parameters provide a distribution which can exhibit unimodality and bimodality. Again, as with Figure 1, the distribution can take approximately the same shape as the normal skew distribution with fewer parameter).
We next provide the moment generating function of the family given in Equation (10).

Proposition 3.
The moment generating function of a random variable X having its pdf given by Equation (10) is of the form Proof. It is straightforward following the same argument as the one used in Reference [5] in order to get the moment generating function of the skew-normal distribution.
Moments can then be readily obtained by differentiation of Equation (11). In particular, the mean and variance are given by var(X) = 1 − (λbδ) 2 exp −(αδ) 2 , where b = √ 2/π. Table 1 shows the mean and variance of the proposed distribution and the corresponding ones for the skew normal distribution for some special cases of parameters. Similar to case of the skew normal distribution, it can be verified that E(X 2 ) = 1. Another important property that GSN(λ, α) shares with the skew normal distribution is that if Z ∼ GSN(λ, α) then Z 2 ∼ χ 2 1 for all values of λ and α. In complete moments can also be studied following the work of Reference [31]. The third (skewness) and fourth (kurtosis) standardised cumulants are given by, and E(X) and var(X) are given by Equations (12) and (13), respectively. Comparisons of these values with the standard skew normal distribution are shown in Table 2. As can be seen, the standardized fourth central moment (kurtosis) can be lower for the GSN distribution than it is for Azzalini's skew normal distribution.  Let Φ λ,α (z) denote the cdf of Z ∼ GSN(λ, α).

Proposition 4.
If Z ∼ GSN(λ, α), then its cdf is given by where T(x, a) is the Owen's function see [32] given by Proof. The proof is direct by applying result B.21 in Reference [21]. ∼ GSN(λ, α), then it follows that:

Proposition 5. If Z
Proof. The proof is also direct this time by applying the result B.23 in [21].
To end this Section we provide a result related to probability transformations which generalises a result appearing in Reference [5] and provided also in Reference [31]. Proposition 6. Let W and Z be independent random variables with distribution N(0, 1) and GSN(λ, α), respectively. The n, the random variable Y = (hW + kZ) Proof. It can be proved following the same argument as that used in Lemma 1 in Reference [31].

Estimation
The family of distributions GSN(λ, α) can be generalized by means of a linear transformation in order to introduce a location and a scale parameter adding more flexibility to the model (10). We thus will consider the location-scale generalization of the skewnormal distribution defined as the distribution of Y = µ + σX, where X ∼ GSN(λ, α) given in Equation (10), where µ ∈ IR and σ > 0. Its pdf is given by When λ = α = 0 this pdf reduces to the N(µ, σ) and when α = 0 to the SN(µ, σ, λ). For a sample y = {y 1 , . . . , y n } we can estimate the four parameters, Θ = (λ, α, µ, σ), of the model given in Equation (14) by a direct search for the maximum of the log-likelihood surface given by From Equation (15) we get the normal equations which are given by: Since it is not possible to obtain closed expressions for the maximum likelihood estimators, algorithms based on numerical methods, such as Newton-Raphson or Broyden-Fletcher-Goldfarb-Sanno (BGGS), among others, will have to be used. It is recommended to use different seed points as initial values to ensure that the solution obtained constitutes a global maximum of the logarithm-likelihood function.
The standard errors of the estimators can be obtained by inverting the Hessian matrix. Both Mathematica and WinRats have at least two methods for this purpose. The first is to use Cholesky factors (this package is available on the web upon request). The second, faster method, involves by finite differentiation. Furthermore, WinRats package also offers the possibility to get the maximum of the log-likelihood directly giving us the elements of the Fisher information matrix. In fact, for the examples considered later these two packages were used to get the maximum likelihood estimators in a fast way.

Multivariate Versions
Multivariate extensions of the univariate distributions arise in an easy way as we can see in the next result. Theorem 4. Let X and Y be two random variables where X ∼ N(0, 1) and Y ∼ N (m) (0, Σ).
The n, represents a genuine pdf for α ∈ IR − {0} and λ ∈ IR m .
Proof. Without loss of generality, asume that X and Y are independent random variables.
Taking into account the fact that X − λ T Y is symmetric and using the result provided in Theorem 1, we get Hence the result.

Remark 1.
The only important required property of the distribution of Y is that, for any λ, the random variable λ T Y is symmetric about zero. The only required property for the distribution of X is that it be symmetric about zero. This is true for the above Theorem and the next.
The n, is a valid pdf for α ∈ IR and λ ∈ IR m .
Proof. From (16) it follows that If we now take the derivative with respect to α on both sides in Equation (18) and apply the Fundamental Theorem of calculus we get Thus (17) is a genuine pdf.
Note that if we set α = 0 in (17), we obtain which was one of the first multivariate skew-normal models to appear in the literature. See [10,11], for instance. Figure 3 shows the density for bivariate generalized skew-normal (BGSN) model for some combinations of the parameters. Perusal of Figure 3, will confirm that the density of the BGSN model exhibits a more interesting array of possible shapes than do many of its competitors. The flexibility of this model can be expected to be useful in fitting the model to a variety of different data sets.

Numerical Illustrations
In this section, three examples for the GSN model given in Equation (14) are carried out and the results are compared with the flexible epsilon skew-normal (FESN) model introduced by Reference [33] in the first example, with the mixture of two normals (MN) model in the second example and flexible skew-normal (FSN) model introduced by Reference [34] in the third example. The three densities respectively are given by: where φ(·) and Φ(·) denote the density and distribution functions of the standard normal distribution, c λ = 1 − Φ(λ), µ, µ 1 , λ, α ∈ R, σ, σ 1 ∈ R + , −1 < ε < 1 and 0 ≤ p ≤ 1.
We use these three models, since they have been used in the applied statistics literature to explain bimodal empirical data. We chose the MN model since it is a classic model that is used to model bimodal data sets, we chose the FESN model since it is one of the first bimodal extensions of the family of epsilon-skew-simétric distributions and We chose the FSN model since it is one of the first bimodal extensions of the family of skewsimétric distributions.

Example 1
The data in this example is a set of fiber levels for 315 patients and is available online at http://Lib.stat.cmu.edu/datasets/Plasma_Retinol and contains values for 14 variables for each patient. For our analysis we will use only the variable called Fiber (Grams of fiber consumed per day). Low levels of this variable may be associated with higher risk of certain types of cancer. Descriptive statistics for the data set are displayed in Table 3. In the table b 1 and b 2 denote sample skewness and kurtosis coefficients. Note that the data exhibits a high level of kurtosis. The estimated values of the parameters for the two models are shown in Table 4 together with the standard errors (SE) in parentheses. The table also includes the maximum of the log-likelihood function ( max ), the Akaike information criterion (AIC) and the consistent Akaike information criterion (CAIC), proposed in References [35,36] respectively. Amodel with a lower AIC or CAIC is preferred to a model with a higher value. Graphs of histogram of the data and fitted densities are given in Figure 4. As it can be seen, the GSN distribution is the better of the two models with regard to reflecting the nature of the empirical data. All computations here were done using Mathematica v.11.0 and WinRATS v.7.0. These codes are available from the authors upon request.

Example 2
We consider the variables M-Sweet available in the database creaminess of cream cheese which can be found at http://www.models.kvl.dk/research/data/Cream/index. asp. Table 5 shows summary statistics for the M-Sweet dataset. In Table 6, presents parameter estimates (SE) for both, the MN and GSN models. It can be seen that the log-likelihood for the GSN model is higher than the for the MN model. The AIC and CAIC criterion are used again to compare the estimated models, it can be seen that the GSN model presents the best fit (smallest AIC and CAIC values). Finally, the histogram of the data and the fitted densities are shown in Figure 5.

Example 3
Finally, data corresponding to the age and frequency of cancer cataloged as Kaposi's sarcoma have been taken. This is a type of cancer that can form masses in the skin, lymph nodes, or other organs without distinguishing the sub-types. The data has been collected from the website of the Office for National Statistics (ONS, section Health statistics) and it can be seen in Table A1 in the Appendix A. It can be observed that there is a higher incidence in individuals with an age around 25 years as well as for those with an age around 60 years. The se records were taken during the years 1995-2016 and correspond to different regions of the United Kingdom. Table 7 shows summary statistics for the Kaposi's sarcoma dataset. The two fitted models are represented in Figure 6 and the correspondent estimated values can be seen in Table 8. The GSN model presents better fit for the data, since the AIC and CAIC values are smaller.

Conclusions
We have proposed two families of skew distributions which can be considered as alternatives to the well-known skew normal distribution for fitting skewed data.
Future research could address the following issue. We can ask whether the methodology proposed here can be applied to the generalized skew-normal distribution proposed by Reference [9] to obtain a more flexible distribution. For the case with standard normal components, one might consider the following model which is an average of two Arnold and Beaver densities.
where λ, α 1 , α 2 ∈ IR . Note that this model is not obtainable by methods analogous to those used to develop the model (10). However, it is a simple more flexible extension of the Arnold-Beaver model. But, once we recognize it as a mixture with equal weights, it is resonable to add more flexibility by considering unequal weights, as follows where γ ∈ [0, 1].