The flexible Gumbel distribution: A new model for inference about the mode

A new unimodal distribution family indexed by the mode and three other parameters is derived from a mixture of a Gumbel distribution for the maximum and a Gumbel distribution for the minimum. Properties of the proposed distribution are explored, including model identifiability and flexibility in capturing heavy-tailed data that exhibit different directions of skewness over a wide range. Both frequentist and Bayesian methods are developed to infer parameters in the new distribution. Simulation studies are conducted to demonstrate satisfactory performance of both methods. By fitting the proposed model to simulated data and data from an application in hydrology, it is shown that the proposed flexible distribution is especially suitable for data containing extreme values in either direction, with the mode being a location parameter of interest. Using the proposed unimodal distribution, one can easily formulate a regression model concerning the mode of a response given covariates. We apply this model to data from an application in criminology to reveal interesting data features that are obscured by outliers. Computer programs for implementing all considered inference methods in the study are available at https://github.com/rh8liuqy/flexible_Gumbel.


Introduction
The mean, median, and mode are the three most commonly used measure of central tendency of data.When data contain outliers that cause heavy tails or are potentially skewed, the mode is a more sensible representation of the central location of data than the mean or median.The timely review on mode estimation and its application by Chacón [2020] and references therein provide many examples in various fields of research where the mode serves as a more informative representative value of data.Most existing methods developed to draw inference for the mode are semi-/non-parametric in nature, starting from early works on direct estimation in the 1960s [Chernoff, 1964, Dalenius, 1965, Venter, 1967] to more recent works based on kernel density estimation [Chen, 2018] and quantile-based methods [Ota et al., 2019, Zhang et al., 2021].Two main factors contribute to the enduring preference for semi-/non-parametric methods for mode estimation, despite the typically less straightforward implementation and lower efficiency compared to parametric counterparts.First, parametric models often impose strict constraints on the relationship between the mode and other location parameters, which may not hold in certain applications.Second, very few existing named distribution families that allow the inclusion of both symmetric and asymmetric distributions in the same family can be parameterized so that it is indexed by the mode as the location parameter along with other parameters, such as shape or scale parameters.In this study, we alleviate concerns raised by both reasons that discourage the use of parametric methods for mode estimation by formulating a flexible distribution indexed by the (unique) mode and parameters controlling the shape and scale.
When it comes to modeling heavy-tailed data, the Gumbel distribution [Gumbel, 1941] is arguably one of the most widely used models in many disciplines.Indeed, as a case of the generalized extreme value distribution [Jenkinson, 1955], the Gumbel distribution for the maximum (or minimum) is well-suited for modeling extremely large (or small) events that produce heavy-tailed data.For example, it is often used in hydrology to predict extreme rainfall and flood frequency [Loaiciga and Leipnik, 1999, Koutsoyiannis, 2004, Dawley et al., 2019].In econometrics, the Gumbel distribution plays an important role in modeling extreme movements of stock prices and large changes in interest rates [Bali, 2003, Pratiwi et al., 2019].The Gumbel distribution is indexed by the mode and a scale parameter, and thus is convenient for mode estimation.However, the Gumbel distribution for the maximum (or minimum) is right-skewed (or left-skewed) with the skewness fixed at around 1.14 (or −1.14), and the kurtosis fixed at 5.4 across the entire distribution family.Thus it may be too rigid for scenarios where the direction and extremeness of outliers presented in data are initially unclear, or when the direction and level of skewness are unknown beforehand.Constructions of more flexible distributions that overcome these limitations have been proposed.In particular, Cooray [2010] applied a logarithmic transformation on a random variable following the odd Weibull distribution to obtain the so-called generalized Gumbel distribution that includes the Gumbel distribution as a subfamily.But the mode of the generalized Gumbel distribution is not indexed by a location parameter, or an explicit function of other model parameters.Shin et al. [2015] considered mixture distributions with one of the components being the Gumbel distribution and the other component(s) being Gumbel of the same skewness direction or a different distribution, such as the gamma distribution.Besides the same drawback pointed out for the generalized Gumbel distribution, it is difficult to formulate a unimodal distribution following their construction of mixtures, and thus their proposed models are unsuitable when unimodality is a feature required to make inferring the mode meaningful, such as in a regression setting, as in modal regression [Yao et al., 2012, Yao and Li, 2013, Chen, 2018].
With heavy-tailed data in mind and the mode as the location parameter of interest, we construct a new unimodal distribution that does not impose stringent constraints on how the mode relates to other central tendency measures, while allowing a range of kurtosis wide enough to capture heavy tails at either direction, as well as different degrees and directions of skewness.This new distribution, called the flexible Gumbel (FG) distribution, is presented in Section 2, where we study properties of the distribution and discuss identifiability of the model.We present a frequentist method and a Bayesian method for estimating parameters in the FG distribution in Section 3. Finite sample performance of these methods is inspected in simulation study in Section 4, followed by an application of the FG distribution in hydrology in Section 5. Section 6 demonstrates fitting a modal regression model based on the FG distribution to data from a criminology study.Section 7 highlights the contributions of our study and outlines future research directions.

The flexible Gumbel distribution
The probability density function (pdf) of the Gumbel distribution for the maximum is given by where θ is the mode and σ > 0 is a scale parameter.The pdf of the Gumbel distribution for the minimum with mode θ and a scale parameter σ is given by We define a unimodal distribution for a random variable Y via a mixture of the two Gumbel distributions specified by ( 1) and ( 2) that share the same mode θ while allowing different scale parameters, σ 1 and σ 2 , in the two components.We call the resultant distribution the flexible Gumbel distribution, FG for short, with the pdf given by where w ∈ [0, 1] is the mixing proportion parameter.Henceforth, we state that Y ∼ FG(θ, σ 1 , σ 2 , w) if Y follows the distribution specified by the pdf in (3).
For each component distribution of FG, the mean and median are both some simple shift of the mode, with each shift solely determined by the scale parameter.Because the two components in (3) share a common mode θ, the mode of Y is also θ, and thus the FG distribution is convenient to use when one aims to infer the mode as a central tendency measure, or to formulate parametric modal regression models [Bourguignon et al., 2020, Zhou and Huang, 2020, 2022].
One can easily show that the mean of where γ ≈ 0.5772 is the Euler-Mascheroni constant.Thus the discrepancy between the mode and the mean of FG depends on three other parameters that control the scale and shape of the distribution.The median of Y , denoted by m, is the solution to the following equation, Even though this equation cannot be solved for m explicitly to reveal the median in closed form, it is clear that m − θ also depends on all three other parameters of FG.In conclusion, the relationships between the three central tendency measures of FG are more versatile than those under a Gumbel distribution for the maximum or a Gumbel distribution for the minimum.
The variance of , which does not depend on the mode parameter θ.Obviously, by setting w = 0 or 1, FG(θ, σ 1 , σ 2 , w) reduces to one of the Gumbel components.Unlike a Gumbel distribution that only has one direction of skewness at a fixed level (of ±1.14), an FG distribution can be left-skewed, or right-skewed, or symmetric.More specifically, with the mode fixed at zero when studying the skewness and kurtosis of FG, one can show (as outlined in Appendix A) that the third central moment of Y is given by where w = 1 − w, and ζ(3) ≈ 1.202 is Apéry's constant.Although the direction of skewness is not immediately clear from (4), one may consider a special case with w = 0.5 where (4) reduces to 2 )}.Now one can see that FG(θ, σ 1 , σ 2 , 0.5) is symmetric if and only if σ 1 = σ 2 , and it is left-skewed (or right-skewed) when σ 1 is less (or greater) than σ 2 .The kurtosis of Y can also be derived straightforwardly, with a more lengthy expression than (4) that we omit here, which may not shed much light on its magnitude except that it varies as the scale parameters and the mixing proportion vary, instead of fixing at 5.4 as for a Gumbel distribution.An R Shiny app depicting the pdf of FG(θ, σ 1 , σ 2 , w) with user-specified parameter values is available at https: //qingyang.shinyapps.io/gumbel_mixture/,created and maintained by the first author.Along with the density function curve, the Shiny app provides skewness and kurtosis of the depicted FG density.From there one can see that the skewness can be much lower than −1.14 or higher than 1.14, and the kurtosis can be much higher than 5.4, suggesting that inference based on FG can be more robust to outliers than when a Gumbel distribution is assumed for data at hand, without imposing stringent assumption on the skewness of the underlying distribution.
The flexibility of a mixture distribution usually comes with concerns relating to identifiability [Teicher, 1961, 1963, Yakowitz and Spragins, 1968].In particular, there is the notorious issue of label switching when fitting a finite mixture model [Redner and Walker, 1984].Take the family of two-component normal mixture (NM) distributions as an example, defined by {NM(µ 1 , σ 1 , µ 2 , σ 2 , w) : wN (µ 1 , σ 2 1 ) + (1 − w)N (µ 1 , σ 2 2 ), for σ 1 , σ 2 > 0 and w ∈ [0, 1]}.When fitting a data set assuming a normal mixture distribution, one cannot distinguish between, for instance, NM(1, 2, 3, 4, 0.2) and NM(3, 4, 1, 2, 0.8), since the likelihood of the data is identical under these two mixture distributions.As another example, for data from a normal distribution, a two-component normal mixture with two identical normal components and an arbitrary mixing proportion w ∈ [0, 1] leads to the same likelihood, and thus w cannot be identified.Teicher [1963] showed that imposing a lexicographical order for the normal components resolves the issue of non-identifiability, which also excludes mixtures with two identical components in the above normal mixture family.Unlike normal mixtures of which all components are in the same family of normal distributions, the FG distribution results from mixing two components from different families, i.e., a Gumbel distribution for the maximum and a Gumbel distribution for the minimum, with weight w on the former component.By construction, FG does not have the label-switching issue.And, we show in Appendix B by invoking Theorem 1 in Teicher [1963] that the so-constructed mixture distribution is always identifiable even when the true distribution is a (one-component) Gumbel distribution.

Frequentist inference method
Based on a random sample of size n from the FG distribution, y = {y i } n i=1 , maximum likelihood estimators (MLE) of all model parameters in Ω = (θ, σ 1 , σ 2 , w) can be obtained via the expectation-maximization (EM) algorithm [Dempster et al., 1977].To apply the EM algorithm, we introduce a latent variable Z that follows Bernoulli(w) such that the joint likelihood of (Y, Z) is where f 1 (y; θ, σ 1 ) is the pdf in (1) evaluated at y with the scale parameter σ = σ 1 , and f 2 (y; θ, σ 2 ) is the pdf in (2) evaluated at y with the scale parameter σ = σ 2 .A random sample of size n from Bernoulli(w), z = {z i } n i=1 , is viewed as missing data, and {(y i , z i )} n i=1 are viewed as the complete data in the EM algorithm.It can be shown [Boos and Stefanski, 2013, Section 2.6.3a] that integrating z out from (5) indeed gives the density of Y in (3).The log-likelihood based on the density in (3) is usually not well-behaved as an objective function to be maximized with respect to Ω.By considering the complete-data log-likelihood based on (5), one can often maximize an objective function that is better-behaved as we demonstrate next.More specifically, the complete-data log-likelihood is Starting from an initial estimate of Ω (at the zero-th iteration), denoted by Ω (0) , one iterates two steps referred to as the E-step and the M-step until a convergence criterion is met.In the E-step at the (t + 1)-th iteration, one computes the conditional expectation of ( 6) given y while assuming the true parameter value to be 2 , w (t) ), that is, E Ω (t) {ℓ(Ω; y, z)|y}.This conditional expectation can be shown to be where . (8) In the M-step at the (t + 1)-th iteration, one maximizes Q(Ω|Ω (t) ) with respect to Ω to obtain an updated estimate i /n, and the other three updated estimates in Ω (t+1) are obtained numerically.
The EM algorithm avoids directly maximizing the log-likelihood based on (3) by (iteratively) maximizing the betterbehaved Q(Ω|Ω (t) ) in ( 7).To further improve the numerical efficiency, we exploit the expectation-conditional maximization (ECM) algorithm [Meng and Rubin, 1993], which replaces the M-step with a sequence of simpler conditional maximizations referred to as the CM-step.Essentially, within each M -step, we update w via w i /n, then we update θ using w (t+1) along with (σ 2 ), followed by updating σ 1 using w (t+1) , the recently updated θ, and σ (t) 2 ; lastly, we update σ 2 using w (t+1) and the recently updated θ and σ 1 .There is no closed-form updating formula for the latter three updates, but each of them can now be easily updated by most well-accepted one-dimensional optimization algorithms, such as the Newton-Raphson method.To ensure convergence at the global maximum, as recommended by Wu [1983], one should implement the ECM algorithm several rounds with different starting values Ω (0) .
After obtaining the MLE of Ω, denoted by Ω, we propose to use the sandwich variance estimator [Boos and Stefanski, 2013, Chapter 7] to estimate the variance-covariance matrix of Ω.One may also estimate the variance-covariance of Ω based on the observed information matrix as described in Louis [1982] and Oakes [1999].The benefit of using the sandwich variance estimator is its robustness to model misspecification.Finally, the EM and ECM algorithms bear a strong resemblance to data augmentation [Wei and Tanner, 1990] in the Bayesian framework, which we turn to next for inferring Ω.

Bayesian inference method
In the Bayesian framework, we formulate hierarchical models starting with the FG distribution, followed by independent weakly informative or non-informative priors for elements in Ω, θ ∼ N (0, 10 4 ), σ j ∼ inv-Gamma(1, 1), for j = 1, 2, w ∼ Uniform(0, 1), where inv-Gamma refers to the inverse Gamma distribution.We choose the above prior for the scale parameters by following the prior selection for variance parameters suggested in Gelman [2006].
We employ the Metropolis-within-Gibbs sampler [Müller, 1991[Müller, , 1993] ] to obtain an estimate of Ω from the posterior distribution of Ω given observed data y.Similar to the EM/ECM algorithm in Section 3.1, the latent variable Z is also introduced as a device to carry out data augmentation.The iterative algorithm presented next is based on the following two conditional distributions that can be easily proved, where z −i results from dropping z i from z, and the first result above is also from which ( 8) is deduced.
• Step 4: For j = 1, 2, draw σj from N (σ (t) j , τ j ), and update σ (t) In Steps 3 and 4, p(•|•) refers to a conditional pdf generically, τ 0 , τ 1 , and τ 2 are three positive tuning parameters whose values should be chosen so that the acceptance rate at each step is around 23% [Gelman et al., 1997].To draw samples from the joint posterior distribution, there are numerous ways to design the Markov chain Monte Carlo (MCMC) sampler.Instead of the Metropolis-within-Gibbs sampler we adopt here, one may use other existing MCMC software, such as STAN [Stan Development Team, 2021], JAGS [Plummer et al., 2003], and BUGS [Spiegelhalter et al., 1996, Lunn et al., 2009], the former two of which are demonstrated in the GitHub repository (https://github.com/rh8liuqy/flexible_Gumbel).After obtaining enough high-quality samples from the joint posterior distribution p(θ, σ 1 , σ 2 , w|y), Bayesian inference is straightforward, including point estimation, interval estimation, and uncertainty assessment.

Simulation study
Large-sample properties of MLEs and likelihood-based Bayesian inference under a correct model for data have been well studied.To assess finite-sample performance of the frequentist method and Bayesian method proposed in Section 3, we carried out simulation study with two specific aims: first, to compare inference results from the two methods; second, to compare goodness of fit for data from distributions outside of the FG family when one assumes an FG distribution and when one assumes a two-component normal mixture distribution for the data.
In the first experiment, denoted as (E1) hereafter, we considered two FG distributions as true data generating mechanisms, FG(θ = 1, σ 1 = 1, σ 2 = 1, w = 0.4) and FG(θ = 0, σ 1 = 1, σ 2 = 5, w = 0.5).This design creates two FG distributions with the second one more skewed and variable than the first.Based on a random sample of size n = 50 from the first FG distribution, we estimated Ω by applying the ECM algorithm and the Metropolis-within-Gibbs algorithm.Similarly, based on a random sample of size n ∈ {100, 200}, we implemented the two algorithms to estimate Ω.The former algorithm produced the MLE of Ω, and we used the median of the posterior distribution of Ω at convergence of the latter algorithm as another point estimate of Ω. Table 1 presents summary statistics of these estimates of Ω and estimates of the corresponding standard deviation across 1000 Monte Carlo replicates under each simulation setting specified by the design of an FG distribution and the level of n.
According to Table 1, all parameter estimates in Ω are reasonably close to the true values.When the sample size is as small as 50, estimates for Ω resulting from the frequentist method are still similar to those from the Bayesian inference method, although estimates for the standard deviations of these point estimators can be fairly different.We do not find such discrepancy surprising because, for the frequentist method where we use the sandwich variance estimator to infer the uncertainty of an MLE for Ω, the asymptotic properties associated with MLEs that support the use of a sandwich variance estimator may not take effect yet at the current sample size; and, for the Bayesian method, the quantification of standard deviation can be sensitive to the choice of priors when n is small.These are confirmed by the diminishing discrepancy between the two sets of standard deviation estimates when n = 100, 200.A closer inspection of the reported empirical mean of estimates for Ω along with their empirical standard error suggests that, when n = 100, the Bayesian method may slightly underestimate σ 2 , the larger of the two scale parameters of FG.We believe that this is due to the inverse gamma prior imposed on the scale parameters that is sharply peaked near zero, and thus the posterior median of the larger scale parameter tends to be pulled downwards when the sample size is not large.As the sample size increases to 200, this trend of underestimation appears to diminish.The empirical means of the standard deviation estimates from both methods are close to the corresponding empirical standard deviations, which indicate that the variability of a point estimator is accurately estimated when n is not small, whether it is based on the sandwich variance estimator in the frequentist framework, or based on the posterior sampling in the Bayesian framework.In summary, the methods proposed in Section 3 under both frameworks provide reliable inference for Ω along with accurate uncertainty assessment of the point estimators when data arise from an FG distribution.
Among all existing mixture distributions, normal mixtures probably have the longest history and are most referenced in the literature.In another experiment, we compared the model fitting of normal mixture with that of FG when data arise from three heavy-tailed distributions: (E2) Laplace with the location parameter equal to zero and the scale parameter equal to 2; (E3) a mixture of two Gumbel distributions for the maximum, with a common mode at zero, scale parameters in the two components equal to 2 and 6, respectively, and the mixing proportion equal to 0.5; (E4) a Student-t distribution with degrees of freedom equal to 5. From each of the three distributions in (E2)-(E4), we generated a random sample of size n = 200, following which we fit a two-component normal mixture model via the EM algorithm implemented using the R package mixtools, and also fit an FG model via the two algorithms described in Section 3.This model fitting exercise was repeated for 1000 Monte Carlo replicates under each of (E2)-(E4).
We used an empirical version of the Kullback-Leibler divergence as the metric to assess the quality of modeling fitting.We denote the true density function as p(•), and let p(•) be a generic estimated density resulting from one of the three considered model fitting strategies.Under each setting in (E2)-(E4), a random sample of size 50000, (x 1 , . . ., x 50000 ), was generated from the true distribution, and an empirical version of the Kullback-Leibler divergence from p(•) to p(•)  is given by D KL = (1/50000) 50000 i=1 log(p(x i )/p(x i )). Figure 1 shows the boxplots of D KL across 1000 Monte Carlo replicates corresponding to each model fitting scheme under (E2)-(E4).
Judging from Figure 1, the FG distribution clearly outperforms the normal mixture when fitting data from any of the three heavy-tailed distributions in (E2)-(E4), and results from the frequentist method are comparable with those from the Bayesian method for fitting an FG model.When implementing the ECM algorithm for fitting the FG model and the EM algorithm for fitting the normal mixture, we set a maximum number of iterations at 1000.Our ECM algorithm always converged in the simulation, i.e., converged to a stationary point within 1000 iterations.However, the EM algorithm for fitting a normal mixture often had trouble achieving that, with more difficulty when data come from a heavier-tailed distribution.More specifically, under (E4), which has the highest kurtosis (equal to 9) among the three settings, the EM algorithm failed to converge in 59.9% of all Monte Carlo replicates; under (E2), which has the second highest kurtosis (equal to 6), it failed to converge in 6.7% of the replicates.Results associated with the normal mixture from these failing replicates were not included when producing the boxplots in Figure 1.In conclusion, the FG distribution is more suitable for symmetric or asymmetric heavy-tailed data than the normal mixture distribution.

An application in hydrology
Daily maximum water elevation changes of a water body, such as ocean, lake, and wetland, are of interest in hydrologic research.These changes may be close to zero on most days but can be extremely large or small under extreme weather.From the National Water Information System (https://waterdata.usgs.gov/),we downloaded water elevation data for Lake Murray near Columbia, South Carolina, United States, recorded from September 18, 2020 to September 18, 2021.The water elevation change of a given day was calculated by contrasting the maximum elevation and the minimum elevation on that day, returning a positive (negative) value if the maximum record of the day comes after Table 2: Frequentist and Bayesian inferences about daily maximum water elevation changes of Lake Murray, South Carolina, United States.Besides parameter estimates (under point.est)and the estimated standard deviations of these parameter estimates (under s.d.), 95% confidence intervals of the parameters from the frequentist method, and 95% credible intervals from the Bayesian method are also provided (under lower 95 and upper 95).(before) the minimum record on the same day.We fit the FG distribution to the resultant data with n = 366 records using the frequentist method and the Bayesian method, with results presented in Table 2.The two inference methods produced very similar estimates for most parameters, although small differences were observed.For example, one would estimate the mode of daily maximum water elevation change to be −0.795feet based on the frequentist method, but estimate it to be −0.485feet using the Bayesian method.The discrepancy between these two mode estimates is minimal considering that the daily maximum water elevation changes range from −38 feet to 49.4 feet within this one year.Taking into account the uncertainty in these point estimates, we do not interpret any of these differences as statistically significant because a parameter estimate from one method always falls in the interval estimate for the same parameter from the other method according to Table 2. Using parameter estimates in Table 2 in the aforementioned R Shiny app, we obtained an estimated skewness of −0.102 and an estimated kurtosis of 6.384 based on the frequentist inference results, whereas the Bayes inference yielded an estimated skewness of 0.058 and an estimated kurtosis of 6.074.Combining these two sets of results, we concluded that the underlying distribution of daily maximum water elevation change may be nearly symmetrical, with outliers on both tails that cause tails heavier than that of a Gumbel distribution.
Figure 2 presents the estimated density functions from these two methods, in contrast with the estimated density curve resulting from fitting the data to a two-component normal mixture, and a kernel density estimate using a Gaussian kernel with the bandwidth selected according to the method proposed by Sheather and Jones [1991].The last estimate is fully nonparametric and served as a benchmark against which the other three density estimates were assessed graphically.
The kernel density estimate is more flexible at describing varying tail behaviors, but such flexibility comes at the cost of statistical efficiency and interpretability.With the wiggly tails evident in Figure 2 for this estimate, we suspected a certain level of overfitting of the kernel density estimate.This often happens to kernel-based estimation of a function around a region where data are scarce, with a bandwidth not large enough for the region.Between the two FG density estimates, the difference is almost negligible.They both track the kernel density estimate closely over a wide range of the support around the mode.The mode of the estimated normal mixture density is close to the other three mode estimates, but the tails are much lighter than those of the other three estimated densities.
Besides comparing the three parametric density estimates pictorially, we also used the Monte-Carlo-based one-sample Kolmogorov-Smirnov test to assess the goodness of fit.The p-values from this test are 0.223, 0.312, and 0.106 for the frequentist FG density estimate, the Bayesian FG density estimate, and the estimated normal mixture density, respectively.Although none of the p-values are low enough to indicate a lack of fit (at significance level 0.05 for example), the p-value associated with the normal mixture is much lower than those for FG.Hence, between the two null hypotheses, with one assuming an FG distribution and the other claiming a normal mixture for this data set, we find even weaker evidence to reject the former than data evidence against the latter.It is also worth noting that the Kolmogorov-Smirnov test is known to have low power to detect deviations from a posited distribution that occur in the tails [Mason and Schuenemeyer, 1983].This may explain the above-0.05p-value for the normal mixture fit of the data even though the tail of this posited distribution may be too thin for the current data.Finally, as suggested by a referee, we computed the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) after fitting the FG distribution and the normal mixture distribution to the data.When assuming an FG distribution, we obtained an AIC/BIC of 2506.028/2521.638from the frequentist method, and 2506.299/2521.909from the Bayesian method.When assuming a mixture normal, we found the values of AIC and BIC to be 2499.821and 2519.334,respectively.Even though the fitted normal mixture distribution produces a lower AIC/BIC than the fitted FG distribution, we argue that these metrics focus more on the overall goodness of fit, and can be more forgiving when it comes to a relatively poor fit for certain feature of a distribution, such as the tail behavior.Figure 2: Four density estimates based on daily maximum water elevation changes in Lake Murray, including the kernel density estimate (solid line), the estimated FG density from the ECM algorithm (dotted line), the estimated FG density from the Bayesian method (dashed line), and the estimated normal mixture density (dash-dotted line).
Table 3: Frequentist and Bayesian modal regression models based on the FG distribution fitted to the crime data.
Besides parameter estimates (under point.est)and the estimated standard deviations of these parameter estimates (under s.d.), 95% confidence intervals of the parameters from the frequentist method, and 95% credible intervals from the Bayesian method are also provided (under lower 95 and upper 95).We used STAN to implement the Bayesian inference for the Lake Murray data.The code can be found in https://github.com/rh8liuqy/flexible_Gumbel,where the JAGS code for fitting the FG distribution is also provided.The posterior output is given in Appendix C. The output provided there indicates that our MCMC chain has converged (see the Rhat statistics).

An application in criminology
With the location parameter θ signified in the FG distribution as the mode, it is straightforward to formulate a modal regression model that explores the relationship between the response variable and predictors.To demonstrate the formulation of a modal regression model based on the FG distribution, we analyzed a data set from Agresti et al. [2021] in the area of criminology.This data set contains the percentage of college education, poverty percentage, metropolitan rate, and murder rate for the 50 states in the United States and the District of Columbia from the year 2003.The poverty percentage is the percentage of the residents with income below the poverty level; the metropolitan rate is defined as the percentage of the population living in the metropolitan area; and the murder rate is the annual number of murders per 100, 000 people in the population.
We fit the following modal regression model to investigate the association between the murder rate (Y ) and the aforementioned demographic variables, where β = [β 0 , β 1 , β 2 , β 3 ] ⊤ includes all regression coefficients.For the prior elicitation in Bayesian inference, we assume that β 0 , . . ., β 3 i.i.d ∼ N (0, 10 4 ) and use the same priors for σ 1 , σ 2 and w as those in Section 3.2.As a more conventional regression analysis to compare with our modal regression, we also fit the mean regression model assuming mean-zero normal model error to the data.
Table 3 shows the inference results from the modal regression model, and Table 4 presents the inference results from the mean regression model.At 5% significance level, both frequentist and Bayesian modal regression analyses confirm that there exists a negative association between the percentage of college education and the murder rate, as well as a positive association between the metropolitan rate and the murder rate.In contrast, according to the inferred mean regression model, there is a positive association between the percentage of college education and the murder rate.Such claimed positive association is intuitively difficult to justify and contradicts many published results in criminology [Hjalmarsson andLochner, 2012, Lochner, 2020].
The scatter plot of the data in Figure 3 can shed some light on why one reaches such a drastically different conclusion on a covariate effect when mean regression is considered in place of modal regression.As shown in Figure 3, the exists an obvious outlier, the District of Columbia (D.C.), in panels of the first row of the scatter plot matrix for instance.D.C. not only exhibited the highest murder rate but also the highest percentage of college-educated individuals.These dual characteristics position D.C., as an outlier within the dataset.Mean regression reacts to this one extreme outlier by inflating the covariate effect associated with the percentage of college education in the inferred mean regression function.Thanks to the heavy-tailed feature of the FG distribution, modal regression based on this distribution is robust to outliers, which strives to capture data features suggested by the majority of the data and is not distracted by the extreme outlier when inferring covariate effects in this application.
Lastly, to compare their overall goodness of fit for the current data, we computed AIC and BIC following fitting each regression model.Adopting the frequentist and Bayesian methods, the modal regression analysis yields AIC/BIC equal to 239.394/252.917and 238.710/252.233,respectively.The mean regression analysis leads to AIC and BIC equal to 303.154 and 312.813, respectively.Appendix C contains the convergence diagnosis for the Bayesian inferential method applied to this data set, from which we see no concerns about convergence.

Discussion
The mode had been an overlooked location parameter in statistical inference until recently when the statistics community witnessed a revived interest in modal regression among statisticians [Chen, 2018, Chacón, 2020, Feng et al., 2020, Xu et al., 2020, Ullah et al., 2021, Wang and Li, 2021, Xiang and Yao, 2022].Historically, statistical inference for the mode has been mostly developed under the nonparametric framework for reasons we point out in Section 1. Existing semiparametric methods for modal regression only introduce parametric ingredients in the regression function, i.e., the conditional mode of the response, with the mode-zero error distribution left in a nonparametric form [Yao and Li, 2013, Liu et al., 2013, Zhang et al., 2013, Yang and Yang, 2014, Zhao et al., 2014, Krief, 2017, Tian et al., 2017, Li and Huang, 2019].The few recently proposed parametric modal regression models all impose stringent parametric assumptions on the error distribution [Bourguignon et al., 2020, Zhou and Huang, 2020, 2022].Our proposed flexible Gumbel distribution greatly alleviates concerns contributing to data scientists' reluctance to adopt a parametric framework when drawing inferences for the mode.This new distribution is a heterogeneous mixture in the sense that the two components in the mixture belong to different Gumbel distribution families, which is a feature that shields it from the non-identifiability issue most traditional mixture distributions face, such as the normal mixtures.The proposed distribution is indexed by the mode along with shape and scale parameters, and thus is convenient to use to draw inferences for the mode while remaining flexible.It is also especially suitable for modeling heavy-tailed data, whether the heaviness in tails is due to extremely large or extremely small observations, or both.These are virtues of FG that cannot be achieved by the popular normal mixture and many other existing mixture distributions.
We develop the numerically efficient and stable ECM algorithm for frequentist inference for the FG distribution, and a reliable Bayesian inference method that can be easily implemented using free software, including STAN, JAGS, and BUGS.Compared with the more widely adopted mean regression framework, the modal regression model based on FG we entertained in Section 6 shows great potential in revealing meaningful covariate effects potentially masked by extreme outliers.With these advances made in this study, we open up new directions for parametric modal regression and semiparametric modal regression with a fully parametric yet flexible error distribution, and potentially nonparametric ingredients incorporated in the regression function.

Disclosure statement
Computer programs for implementing the FG distribution, related models and data used in this paper are available at https://github.com/rh8liuqy/flexible_Gumbel.
B Proof of identifiability of FG in (3) In our context of two-component mixture distributions of which the cumulative distribution functions are of the form wF 1 (x) + wF 2 (x), Theorem 1 in Teicher [1961] states that a mixture distribution is identifiable if and only if there exists y 1 in the support of F 1 (y) and y 2 in the support of F 2 (y) such that F 1 (y 1 ) F 2 (y 1 ) F 1 (y 2 ) F 2 (y 2 ) ̸ = 0, that is, the above determinant does not vanish for some (y 1 , y 2 ).In what follows, we prove that the FG distribution is identifiable by showing the existence of (y 1 , y 2 ) that makes the above determinant non-zero.
Proof.Recall that the cumulative distribution functions of right-skewed and left-skewed Gumbel distributions are given by By setting y 1 = θ, we have We next show by contradiction that there exists y 2 ∈ R such that (10) is not equal to zero.

Figure 1 :
Figure1: Boxplots of the empirical Kullback-Leibler divergence from an estimated density to the true density under each of the true-model settings in (E2)-(E4).Under each setting, the three considered model fitting strategies are, from left to right in the figure, (i) using the ECM algorithm to fit an FG distribution (FG ECM), (ii) using the Bayesian method to fit an FG distribution (FG Bayes), and (iii) using the EM algorithm to fit a normal mixture distribution (Normal Mixture Distribution EM).

Figure 3 :
Figure 3: Scatter plot matrix of the crime data, where D.C. stands out as an extreme outlier with the highest murder rate and the highest percentage of college education.

Table 1 :
Frequentist and Bayesian inference results in experiment (E1) across 1000 Monte Carlo replicates.Here, point.eststands for the average of 1000 point estimates for each parameter from each method, s.d.stands for the average of the corresponding 1000 estimated standard deviations, and s.d.refers to the empirical standard deviation of the 1000 point estimates from each method.Numbers in parentheses are 100× Monte Carlo standard errors associated with the averages of 1000 estimates of the standard deviation associated with a point estimator.

Table 4 :
Mean regression model based on the normal distribution fitted to the crime data.Besides parameter estimates (under point.est)and the estimated standard deviations of these parameter estimates (under s.d.), 95% confidence intervals of the parameters are also provided (under lower 95 and upper 95).