Bayesian Inference for Skew-Symmetric Distributions

: Skew-symmetric distributions are a popular family of ﬂexible distributions that conveniently model non-normal features such as skewness, kurtosis and multimodality. Unfortunately, their frequentist inference poses several difﬁculties, which may be adequately addressed by means of a Bayesian approach. This paper reviews the main prior distributions proposed for the parameters of skew-symmetric distributions, with special emphasis on the skew-normal and the skew- t distributions which are the most prominent skew-symmetric models. The paper focuses on the univariate case in the absence of covariates, but more general models are also discussed.


Introduction
The need to model skewed data led to the development of many skewed distributions which are obtained by adding to a symmetric distribution a parameter that controls skewness [1]. Arguably, the best known example is the skew-normal distribution introduced by Azzalini (1985) [2]. Its probability density function (pdf) is where µ ∈ R is the location parameter, σ ∈ R + the scale parameter, both inherited from the standard normal distribution with pdf denoted by φ and (cumulative distribution function) cdf Φ , and λ ∈ R is called the skewness parameter given that density (1) is asymmetric for λ = 0 and reduces to the standard normal pdf for λ = 0 . Several extensions and generalizations followed, see [3], a very general and highly popular being the skew-symmetric distributions of Wang et al. (2004) [4] with pdf where f is the symmetric density to be skewed and Π : R × R → [0, 1] is a so-called skewing function satisfying Π(z, λ) + Π(−z, λ) = 1 ∀z, λ ∈ R and Π(z, 0) = 1/2 ∀z ∈ R. The most widely used subfamily of skew-symmetric distributions has densities of the form where G is any symmetric, univariate, absolutely continuous cumulative distribution function. In (2), µ ∈ R is a location, σ ∈ R + a scale and λ ∈ R a skewness parameter. The function G might be replaced by a function w (·) satisfying 0 ≤ w (−x) = 1 − w (x) ≤ 1, as done in [4,5]. Different skew

Default Prior Choices in Bayesian Statistics
The Bayesian approach to quantify uncertainty in statistical inference can be broken down into three steps [18]. The first step consists in choosing a joint probability distribution for observable and unobservable quantities, consistently with the available knowledge about the underlying scientific problem and the data collection process. The second step is to condition on the observed data, which is carried out by means of several computational techniques. The third step is assessing the model's fit and interpreting the implications of the resulting posterior distribution. In this section we focus on the first step and review the most common default choices for the prior distribution. We use the following notation in the sequel: θ is our parameter of interest, π(θ) is the prior distribution, π(θ|t) is the posterior distribution given data information t , p(t|θ) is the data likelihood, p(t, θ) = p(t|θ)π(θ) is the joint distribution of t and θ , and p(t) is the marginal distribution of t . With this notation in hand, we of course have that the posterior equals π(θ|t) = π(θ)p(t|θ) p(t) .

Jeffreys Priors
In Bayesian analysis there are situations in which the available prior information is too vague to be formalized into a probability distribution, too controversial to be acceptable in scientific communities or too complicated to allow for a reliable statistical analysis. Hence the need for priors with minimal effect on the posterior distribution, so that "the chosen prior would let the data speak for themselves" [19]. Reference analysis aims at an "objective" Bayesian solution to statistical inference in the same way as conventional statistical methods, where solutions only depend on model assumptions and observed data.
One of the earliest non-informative (objective) priors is the uniform prior for the Binomial proportion [20,21]. Unfortunately, this prior suffers from its lack of invariance under one-to-one reparameterization. Jeffreys' prior is a non-informative prior which is invariant under one-to-one reparameterization and is proportional to the positive square root of the Fisher information associated with the parameter of interest. For regular models where asymptotic normality holds, the Jeffreys prior enjoys some optimality properties in the absence of nuisance parameters, but suffers from serious difficulties in the presence of nuisance parameters. As a first example, in the Neyman-Scott problem it leads to a strong inconsistency in Bayes estimation of the error variance [22]. As a second example, when estimating the product of two independent normal means, a circular symmetric prior was found to be superior to Jeffreys' prior [23]. As a third example, Jeffreys himself supported the use of another prior for location-scale models.

Reference Priors
Intuitively, a reference prior for some real-valued parameter θ is a prior of the form π(θ) = π(θ|T, P) which maximizes the missing information about θ within the class P of prior distributions compatible with the available prior knowledge T [19]. More formally, let D be a set of observations, generated by some random mechanism p(D|θ) that only depends on a real-valued parameter θ ∈ Θ . Furthermore, let t = t(D) ∈ T be any sufficient statistic (which may be the complete data set D ). In Shannon's general information theory the amount of information I θ [T, π(θ)] , which may be expected to be provided by D or equivalently by t(D) about θ , is which is the expected Kullback-Leibler divergence of the prior from the posterior (here E t indicates that the expectation is taken on the t -part). The functional I θ [T, π(θ)] is concave, non-negative and invariant under one-to-one transformations of θ . Lindley (1956) [24] and Bernardo (1979) [19] defined the reference prior as the prior maximizing (4). There are some situations where we need the asymptotic maximization of the above expectation, since for a fixed n , its maximization might lead to a discrete prior with finitely many jumps, which is hardly compatible with the concept of diffuse prior [25]. Ref. [19] proved that, in the absence of any nuisance parameter, Jeffreys' prior yields the necessary maximization.

Matching Priors
Matching priors allow for posterior probability statements which have an interpretation as confidence statements in the sampling model. Matching priors aim at achieving a compromise between Bayesian and frequentist inference based on some order of approximation, thus providing default priors for routine use in Bayesian inference and possibly more palatable to frequentist statisticians. The concept of matching prior appears to have been proposed first by Lindley (1958) [26] and several matching priors have been proposed since, such as for example quantile matching priors, matching priors for distribution functions, highest probability density matching priors and matching priors associated with likelihood ratio statistics [27].
In this subsection we illustrate the approach to matching priors introduced by Welch and Peers in the seminal paper [28]. Suppose that Y 1 , · · · , Y n are i.i.d. random variables with pdf f (Y|θ) , where θ is real-valued. In addition, assume all the regularity conditions which allow to expand the posterior around the MLEθ n . Furthermore, for 0 < α < 1, let θ π 1−α (Y 1 , · · · , Y n ) ≡ θ π 1−α denote the (1 − α) -th asymptotic posterior quantile of θ based on the prior π , that is for some r > 0. If r = 1 , π is called a first-order matching prior and if r = 3/2 the prior π will be a second-order probability matching prior. For instance, the Jeffreys prior is a first-order probability matching prior in the absence of nuisance parameters. We illustrate this appealing property with an example from [29]. Suppose that Y 1 , · · · , Y n are i.i.d. with pdf N(θ, 1) and that π(θ) = 1 with −∞ < θ < ∞ . Then the posterior density π(θ|Y 1 , · · · , Y n ) stems from the N(Y n , n −1 ) . By considering z 1−α as the 100(1 − α)% quantile of the N(0, 1) distribution, we have Therefore, the one-sided credible intervalȲ n + z 1−α / √ n for θ has exact frequentist coverage probability (1 − α) . This exact matching does not always exist. However, if Y 1 , · · · , Y n are i.i.d. random variables thenθ n |θ a ∼ N θ, I −1 /n where I is the expected Fisher information and a ∼ means asymptotically equivalent in distribution. Using the delta method we have g(θ n )|θ a ∼ N g(θ), (g (θ)) 2 I −1 /n . Therefore, if g (θ) = I 1/2 (θ) , then g(θ) = θ I 1/2 (t)dt and √ n g(θ n ) − g(θ) θ is asymptotically N(0, 1) . In the absence of nuisance parameters, a first-order matching prior for θ is a solution of the differential equation d dθ π(θ)I −1/2 (θ) = 0, so that the Jeffreys prior is the unique first-order matching prior, but it does not always hold for the second-order matching probability [27].
To obtain the second-order matching prior we need an asymptotic expansion of the posterior distribution function up to O(n −1 ) and the differential equation given by Mukerjee and Dey (1993) [30] and Mukerjee and Ghosh (1997) [31], that is θ . Jeffreys' prior is the unique second-order matching prior if it satisfies (5), as it happens for the location-scale families: for π J (θ) = I 1/2 (θ) , (5) converts to to be constant for all values of θ . We refer the reader to [29] for more details on cases where, in the absence of nuisance parameters, there is not a second-order probability matching prior, and where, in the presence of nuisance parameters, there are first-and second-order matching priors and where there is not a second-order matching prior.

Prior Choices for the Skew-Normal Distribution
This section reviews the prior distributions proposed for Bayesian inference on the parameters of the skew-normal distribution: the reference prior by Liseo and Loperfido (2006) [32], the matching prior by Cabras et al. (2012) [33] and the informative prior by Canale and Scarpa (2013) [34].

The Reference Prior
Liseo and Loperfido (2006) [32] first proposed a default prior for the shape parameter of the location-scale-free (standard) skew-normal model sn(z; λ) = 2φ(z)Φ(λz), z ∈ R. The associated Jeffreys prior is This prior is proper, symmetric about λ = 0, decreasing in |λ| and its tails are of order O(λ −3/2 ) . This prior is therefore suitable for testing the hypothesis of symmetry, which might be formalized in the skew-normal framework as H 0 : λ = 0 versus H 1 : λ = 0. The same authors investigated the frequentist performances of this prior with simulated data, concluding that the Bayesian approach might be beneficial in easing some inferential difficulties of the frequentist approach for the standard skew-normal distribution.
Ref. [32] also considered a default Bayes analysis for the general scalar case (1), where λ is the parameter of interest and the location parameter µ and the scale parameter σ are the nuisance parameters. They are assumed to be independent of λ and to have a normal-inverse gamma distribution: where Gamma(α, β) is a Gamma distribution with parameters α, β > 0. The default prior π(µ, σ) ∝ σ −1 is a limiting case and is the conditional reference prior for (µ, σ) given λ . These assumptions allow for a closed-form expression of the marginal likelihood for λ . The proposed method has been successfully applied to the infamous "frontier" dataset (see http://azzalini.stat.unipd.it/SN/frontier. dat), where the maximum likelihood estimate of the skewness parameter λ is infinite. Bayes and Branco (2007) [35] highlighted the advantages of the Bayesian approach and proposed two priors. They considered the stochastic representation (3) of the skew-normal distribution and, following the Bayes-Laplace rule, chose the uniform distribution on the interval [−1, 1] as a prior for λ/ √ 1 − λ 2 , thus leading to a t(0, 0.5, 2) distribution as prior for λ , where t(a, b, c) denotes the Student t distribution centered in a ∈ R with scale b > 0 and c > 0 degrees of freedom, which is a non-vague and non-subjective prior. They further proposed the tractable approximation t(0, π 2 /4, 1/2) for the Jeffreys prior from [32]. They motivated it by the following approximation (see [36]):

The Matching Prior
Cabras et al. (2012) [33] proposed another approach towards Bayesian inference about the shape parameter of the skew-normal distribution. It is based on a pseudo-likelihood function and a matching prior distribution for the shape parameter when location and scale parameters are unknown. First, they derive the marginal likelihood where L(λ, η) = ∏ n i=1 sn(y i ; η, λ) is the skew-normal likelihood function, η = (µ, σ) is the nuisance parameter and σ −1 is the right-invariant Haar measure on the location-scale group of transformations, whose action on the parameter space leaves λ unchanged. By considering the fact that the marginal likelihood (6) can be approximated by the modified profile likelihood L mp (λ) of [37] since L m (λ) = L mp (λ)(1 + O(n −1 )) (see [38]) and by invoking results about the use of pseudo-likelihood functions in Bayesian analysis, the matching prior π(λ) is simply proportional to the square root of the inverse of the asymptotic variance of the MLE of λ . Based on Ventura et al. (2009) [39], the matching prior for λ corresponding to (6) is whereη λ is the constrained MLE of η for a given λ and For the interested reader we provide the detailed quantities of this matrix: , i = 0, 1, 2, with Z following the standard skew-normal with parameter λ . However, the prior (7) might be data dependent because of the presence ofη λ . A prior for λ which does not suffer from this problem is proportional to This prior is proper, symmetric at the origin and with tails of order O(λ −3/2 ) . It also compensates for the possible monotonicity of the modified profile likelihood (6) and possesses good frequentist properties.

The Informative Priors of Canale and Scarpa (2013)
Canale and Scarpa (2013) [34] discuss two informative priors for the skewness parameter of the skew-normal distribution. Their study is motivated by an interesting data set on marks referring to first-year undergraduate students for the program in Economics at the University of Padua. The skew-normal model is implemented on students' grades in the first mandatory class of Statistics. Making inference on the grades of the previous years shows that the distribution of Statistics grades is skewed to the right around a certain mean, which explains why they need informative priors for their endeavour.
The first prior is the normal density with hyperparameters reflecting prior beliefs about the expectation and variance of λ in order to center the prior on a particular guess for λ . The resulting posterior belongs to the family of unified skew-normal (SUN) distributions, introduced in [40]. The explicit expressions for the mean and the variance of the posterior are not very tractable but they allow for a simple interpretation. The second informative prior is itself a skew-normal, motivated by the distribution of grades of university examinations [41]. The skew-normal prior includes location and scale hyperparameters as well as a skewness hyperparameter reflecting the beliefs on the direction of skewness. The posterior distribution also belongs to the class of SUN distributions. The authors set the location hyperparameter of the skew-normal prior equal to zero in order to have a rough prior information only on the skewness side of the distribution of the data: considering negative or positive values for the skewness hyperparameter puts more prior mass on the positive or negative semi-axis. In both cases the resulting posteriors are intractable, but the SUN parametrization eases efficient sampling methods for posterior computation via Markov Chain Monte Carlo (MCMC). For both prior choices for λ , they have specified an independent normal inverse gamma distribution for the location and scale parameters. To perform the related Bayesian inference, the authors have presented an algorithm to simulate the full conditional distribution of the skewness parameter λ given the location and scale parameter. This algorithm uses a Gibbs sampler for the stochastic representation of the SUN model. To get the posterior, the authors introduced normal latent variables, say η 1 , . . . , η n . Conditionally on these latent variables, the generic i -th observation will be normally distributed with a specific mean and variance. This way of constructing the Gibbs sampler leads to conjugacy for the location and scale parameters. For the detailed computations we refer the reader to [34]. This sampling method is useful in MCMC methods to approximate the posterior distribution.
We also wish to mention that generally the MCMC method in Bayesian statistics bears a particular importance in model selection. Suppose we have a set of models reflecting competing hypotheses about the underlying data set, where each model is characterized by a specific vector of parameters of interest. From the Bayesian viewpoint, these models are compared pairwise through their Bayes factor which is the ratio of relative marginal likelihoods. Obviously, finding the marginal likelihood is often not feasible in particular analytically. We refer the reader to [42] and references therein for estimation methods of the marginal likelihood, specifically in general non-nested models.

Prior Choices for Other Skew-Symmetric Distributions
There exists a wide literature on the Bayesian analysis of skew-symmetric distributions different form the skew-normal. Azzalini (1986) [6] and Naranjo et al. (2012) [43] provided a Bayesian analysis of a skewed exponential power distribution. This family includes the symmetric exponential distribution as well as the skew-normal distribution, and provides flexible distributions with lighter and heavier tails. Interestingly, this family of densities can fit each tail separately. Hossianzadeh and Zare (2016) [44] estimated the parameter of the discrete skewed Laplace distribution by an empirical Bayesian analysis and compared it with the maximum likelihood approach. In what follows, we will first consider the popular skew-t distribution and then focus on two general approaches for skew-symmetric densities.

Jeffreys' Prior for Skew-t Distributions
Skew-t distributions are the best-known alternatives to skew-normal ones, due to their flexibility: they can model any level of skewness and excess kurtosis. However, they pose some further inferential problems, which we illustrate in the simpler case of the Student t distribution with known location and scale parameters. Ref. [45] discussed that the likelihood function approaches infinity when the degrees of freedom go to zero, and showed that the supremum of the likelihood function may be achieved when the degrees of freedom go to infinity. There have been several frequentist attempts to solve the inferential problems of the skew-t distribution with all parameters unknown. Sartori (2006) [46] used the modified score function, which requires the degrees of freedom to be fixed. Azzalini and Genton (2008) [10] proposed a deviance approach which is only partially satisfactory, since its implementation might not be straightforward. We illustrate the problem with the univariate skew-t distribution. The deviance approach replaces the boundary maximum likelihood estimate of (λ, υ) by the smallest vector (λ 0 , υ 0 ) for which the null hypothesis H 0 : (λ, υ) = (λ 0 , υ 0 ) is not rejected. The deviance approach assumes that such smallest vector exists but neither theoretical results nor simulation studies support this assumption. For these reasons we cannot exclude the existence of samples admitting two vectors (λ a , υ a ) and (λ b , υ b ) satisfying λ a > λ b and υ a < υ b for which the hypotheses H 0 : (λ, υ) = (λ a , υ a ) and H 0 : (λ, υ) = (λ b , υ b ) are not rejected, while no vector (λ 0 , υ 0 ) exists satisfying (λ 0 , υ 0 ) < (λ a , υ a ) , (λ 0 , υ 0 ) < (λ b , υ b ) and for which the null hypothesis H 0 : (λ, υ) = (λ 0 , υ 0 ) is not rejected. We reckon that this situation is likely to happen, given that when either of the parameters λ or υ is large the shape of the skew-t density function remains almost unchanged if one parameter is substantially increased while the other is substantially decreased.
Given these shortcomings, Branco, Genton and Liseo (2012) [47] studied Bayesian analysis for various forms of skew-t distributions. Denoting by ν > 0 the degrees of freedom parameter, they first considered skew-t densities of the form where t(·|ν) and T(·|ν) are the pdf and the cdf of a Student t distribution with ν degrees of freedom. The corresponding Jeffreys prior for the skewness parameter λ when ν is known and finite is It is proper, symmetric about zero and with tails of order O(λ −3/2 ) . The same authors further investigated the case of the skew-t distribution of [5] with pdf The corresponding Jeffreys prior for λ for known and finite ν is Again, this prior is proper, symmetric about zero and the tails are of order O(λ −3/2 ) .

Jeffreys Prior for General Skew-Symmetric Models
Rubio and Liseo (2014) [48] investigated the Jeffreys prior for the skewness parameter of a general class of scalar skew-symmetric models. The Jeffreys prior cannot be used for some skew-symmetric models at λ = 0 because of the singularity of the Fisher information at this point; see the Introduction for details about this issue. They showed that under mild conditions, including knowledge of location and scale parameters, the Jeffreys prior of the skewness parameter λ in the skew-symmetric model is proper, symmetric about zero and tails are of order of O(|λ| −3/2 ) . They used these results to construct the independence Jeffreys prior for the model including the location and scale parameters: it is the product of the Jeffreys prior of each parameter, under the assumption that the remaining parameters are held fixed. The same authors also provided sufficient conditions for the existence of the posterior distribution and briefly discussed the existence of a proper independence Jeffreys prior for the skew-logistic model described in [7] and gave a Student t approximation to that prior.
The approach in [48] might be sketched as follows. The Fisher information for the shape parameter in (2) is and therefore in this case the Jeffreys prior for λ is π J (λ) ∝ I(λ) . The first step for approximating the function h(z) in (8) is to see that the transformed random variable Z = G −1 (X) ∼ h(z) , where the random variable X ∼ Beta(1/2, 1/2) and G is a cumulative distribution function of an absolutely continuous symmetric random variable, that is G(−z) = 1 − G(z) for all z ∈ R. The corresponding cumulative distribution function H is An approximation of h in terms of g might then be achieved by choosing the scale parameter σ such that The quality of this approximation depends on the thickness of the tails of g . The authors illustrate this point by considering the case of g(z) having a Student t distribution with ν degrees of freedom and comparing the approximations using quantiles. Alternatively, σ might be chosen to minimize the Kullback-Leibler divergence between h(z) and g(z/σ)/σ . Ref. [32] approximated the Jeffreys prior using the parameterization δ = λ/ √ 1 + λ 2 . They also proposed to use the symmetric Beta prior Beta(τ, τ) for β = (δ − 1) /2 , thus leading to the Student t prior for λ which reduces to the Cauchy distribution for τ = 0.5 , see [32].

Distance-Based Priors
As already mentioned, the shape parameter λ does not only impact the skewness in skew-symmetric models, but also the mean, the variance, the modes and the kurtosis. Dette et al. (2018) [49] dealt with this issue by assigning a prior distribution on the perturbation effect of the skewness parameter, quantified by the Total Variation distance between the symmetric density f and its skew-symmetric counterpart 2 f (x)G [λw(x)] , where w is an odd function, rather than on the skewness parameter itself. The rationale behind this choice is that such a distance is more easily interpretable than the parameter λ , and hence informative as well as non-informative priors can more readily be found for the effect of λ than for λ itself. The Total Variation distance between two probability measures µ(·) and ν(·) on R is that is, the maximum difference between the probabilities assigned to the same event by the two measures. It is bounded between zero and one, 0 ≤ d TV (µ, ν) ≤ 1. The Total Variation distance between f and 2 f (x)G [λw(x)] is given by The symmetry of G implies that d TV ( f , G|λ) is not a one-to-one function of the parameter λ : . It is therefore convenient to use M TV (λ) = sign(λ)d TV ( f , G|λ) as a measure of perturbation, due to its appealing properties: M TV (0) = 0, the largest/smallest value of M TV (λ) is ±0.5 (attained when λ → ±∞ ) so that M TV (λ) ∈ (−1/2, 1/2) , and M TV (λ) is invariant under affine transformations. Moreover, M TV (λ) = 0.5 1 − 2S f ;G (0; λ) , where S f ;G is the cdf associated with s f ;G , which means that M TV (λ) is a re-scaling of the difference between the mass cumulated on either side of the origin for a fixed choice of f and G by the distribution S f ;G . In summary, M TV (λ) quantifies the impact of the parameter λ on the relocation of the probability mass on either side of the symmetry center of f .
The proposed measure M TV allows to build both informative and non-informative priors for the perturbation parameter λ in skew-symmetric models. Since M TV ∈ (−1/2, 1/2) is an injective function of λ any prior option for λ induces a proper prior. Ref. [49] proposed, for M TV (λ) , Beta priors with support on the interval (−1/2, 1/2) and with density where Beta(α, β) is a beta function with hyperparameters α, β > 0 . This induces on λ the proper prior with pdf Priors of this type are called Beta Total Variation priors and are denoted by BTV(α, β) ; they are flexible, interpretable and lead to tractable posterior distributions. The behaviour of the prior BTV(α, β) is well illustrated by the special case BTV(1, 1) , that is a uniform prior giving equal probability mass to any pair of subintervals of equal length belonging to the support. If g is a bounded pdf and 1 0 w(x) f (x)dx < ∞ , then BTV(1, 1) is well-defined for all λ and is given by Since ( 9 ) does not have a closed-form, the authors proposed to approximate it by a Cauchy distribution centered at the origin and with scale parameter equal to 0.92 . A Monte Carlo study showed that the proposed non-informative prior induces a posterior distribution with good frequentist properties and similar to those of the Jeffreys prior.

Prior Choices in the Presence of Kurtosis Parameters
Rubio and Steel (2015) [50] have proposed a general strategy for constructing weakly informative priors for kurtosis parameters by assigning a uniform prior to a bounded measure of kurtosis applied to the symmetric baseline density f (·|δ) in which δ is the tail parameter and is a one-to-one function of the kurtosis. This methodology, used in [51], induces a proper prior on δ that can be interpreted as weakly non-informative prior, in that it assigns a flat prior on a function that incorporates the influence of the parameter δ on the shape of the density. This prior can be coupled with the Jeffreys prior for the skewness parameter in order to produce a joint prior for (δ, λ) in skew-symmetric models by using p(λ, δ) = p(λ|δ)p(δ) where For each value of δ the tails are of order O(|λ| −3/2 ) . A simulation study showed that this prior produces a posterior density with good frequentist properties.

Overview on Related Topics
So far this paper has focussed on the univariate case without covariates. This section briefly reviews some of the literature on more general settings related to skew-symmetric distributions.
Ref. [52] proposed a general population Monte Carlo algorithm in order to conduct a full Bayesian analysis of the multivariate skew-normal distribution, also in the presence of constrained parameters. Since the prior distribution approximates the actual reference prior for the shape parameter vector, this approach can be considered as a weakly informative prior. In addition, a generalization to the matrix variate regression model with skew-normal error is also provided.
Ref. [53] carried out a Bayesian analysis of a p -variate skew-t distribution by providing a new parameterization, considering a set of non-informative priors and a sampler designed to obtain the posterior model based on the parameters. The methodology can be extended to multivariate regression models with skewed errors and also stochastic frontier models.
Ref. [54] investigated the time series of electricity spot prices, which exhibit heavy tails and skewness. The authors conducted Bayesian inference on the multivariate skew-t distribution by putting a normal prior on the multi-dimensional skewness parameter.
Ref. [51] proposed a general non-informative structure for regression models with skew-symmetric errors, showed that under some mild conditions the resulting posterior distribution is proper and extended the results to the cases where the response variables are censored. The authors also investigated accelerated failure time models, which are relevant in survival analysis. Different prior distributions have been implemented on the skewness parameter of the skew-normal model including a Jeffreys prior, a matching prior, an informative prior and a uniform, noninformative prior on the parameter δ = λ/ √ 1 + λ 2 , leading to the proper prior Ref. [55] used finite mixtures of skewed distributions to model flow cytometry data, in order to describe their skewness, kurtosis and heterogeneity. The authors developed Bayesian inference based on data augmentation and MCMC sampling using the aforementioned model. Data augmentation in this case is based on stochastic representation of the skew-normal distribution in terms of a random-effects model with truncated normal random effects. Finite mixtures of skew-normals provide a Gibbs sampling scheme that can be drawn from standard densities only. The same MCMC scheme is extended to mixtures of skew-t distributions by considering the skew-t distribution as a scale mixture of skew-normals.
Ref. [56] proposed a new class of distributions by introducing a skewness parameter in multivariate elliptically symmetric densities. This class of densities contains many standard families such as skew-t and skew-normal distributions. They condition on some unobserved variables commonly used in regression modelling and model stock market returns, security options or risky financial assets subject to shocks. Within the Bayesian realm, they show inter alia that there exist posterior distributions and moments for regression coefficients derived under improper priors.
Linear mixed models (LMM) are commonly used to analyze repeated measure data since they allow for flexible modelling of within-subject correlations. Mostly LMM for continuous responses assume that the random effects and the within-subjects errors are normally distributed, which can be unrealistic. Ref. [57] considered the less restrictive assumption of skew-normality and Bayesian inference based on prior distributions very similar to non-informative ones. They illustrated the proposed approach with the Framingham cholesterol data, obtained from a well-known long-term study aimed at investigating the relationship between various risk factors and diseases and to characterize the natural history of chronic circulatory diseases.

Discussion
In this paper we have provided an overview on the various proposals of Bayesian inference within skew-symmetric models. We hope that the reader will consider it as a helpful tool and source of information on this research domain. We refer the interested reader to the simulation study and real data analysis of the recent paper [49] for a performance comparison between several of the above described prior proposals. Digging further into performance comparisons is a promising research task in order to get a more complete picture on which prior to ideally use in which situation when dealing with skew-symmetric distributions.
A referee remarked that in the general case the posterior distribution is multimodal and it is therefore necessary to impose some conditions ensuring unimodality. Log-concavity implies unimodality and it is preserved under convolution, marginalization, affine transformations and conditioning. For example, the assumption that the joint distribution of the parameter and the observations is log-concave implies that the posterior distribution is log-concave, too. We illustrate this point with a simple example. Assume that we sampled just one observation from a standard skew-normal distribution and that our prior distribution on the shape parameter is standard normal: f (z|λ) = 2φ (z) Φ (λz) and π (λ) = φ (λ) . The joint distribution of the observation and the parameter is then f (z, λ) = 2φ (z) φ (λ) Φ (λz) . A little calculation shows that f (z, λ) is log-concave. Hence, without further calculation, we know that f (λ|z) is log-concave and hence unimodal, too. MAP estimates are then uniquely defined and can be easily derived by noticing that the posterior distribution is skew-normal: π (λ|z) = 2φ (λ) Φ (λz) . Ref. [58] provides a thorough review of the literature on log-concavity, both in the univariate and in the multivariate case. Would the assumption of log-concavity be too restrictive, one could resort to other multivariate generalizations of unimodality, as for example block-unimodality, which already appeared in the Bayesian literature ( [59]).