Model Selection Test for the Heavy-Tailed Distributions under Censored Samples with Application in Financial Data

Numerous heavy-tailed distributions are used for modeling financial data and in problems related to the modeling of economics processes. These distributions have higher peaks and heavier tails than normal distributions. Moreover, in some situations, we cannot observe complete information about the data. Employing the efficient estimation method and then choosing the best model in this situation are very important. Thus, the purpose of this article is to propose a new interval for comparing the two heavy-tailed candidate models and examine its suitability in the financial data under complete and censored samples. This interval is equivalent to encapsulating the results of many hypotheses tests. A maximum likelihood estimator (MLE) is used for evaluating the parameters of the proposed heavy-tailed distribution. A real dataset representing the top 30 companies of the Tehran Stock Exchange indices is used to illustrate the derived results.


Introduction
During recent years, heavy-tailed distributions have been considered in the form of an attractive title for various research and studies.For some works on these distributions, we refer to, among others, [1][2][3][4].These distributions have good statistical and reliability properties.Due to its practicality, the heavy-tailed distributions can be used for many applied sciences including economics, finance, econometrics, statistics, risk management and insurance.The inferential results under financial modeling have been developed by several authors; see, for example, [5][6][7][8].There are different heavy-tailed distributions.The question then arises which of them is the best for modeling the proposed financial data.Thus, in this paper, we want to choose the best distribution using the new model selection test.There are different model selection tests for discriminating between two complete models.In almost all of the tests and criteria for model selection, the maximum likelihood estimator and maximized likelihood function have an essential role.For example, Kundu et al. [9] compared the log-normal and generalized exponential distribution using maximized likelihood method, Dey and Kundu [10] considered the problem of discriminating among the log-normal, Weibull and generalized exponential distributions, Cox [11] modified the classical hypothesis testing to compare the non-nested hypothesis and Vuong [12] tested the two models using the log-likelihood ratio of the models.The results in Vuong [12] have been extended and applied in a number of ways, including [13][14][15][16][17][18].Moreover, in experimental study, it is quite common that complete data are not observed.Data obtained from such experiments are called censored data.Based on the studies of the 4 : For every α, we have, and µ is taken to be a Lebesgue measure.

New Model Selection Test (NMST) For HTDC
Let X c:n , ..., X n:n denote the truncated order statistics observed from an experimental test involving n units taken from an f α (x) distribution.To simplify the notation, we will use X i in place of X i:n .Then, the likelihood function of (X c , ..., X n ) can be obtained as where f α (x) and F α (x) are the probability density function and cumulative distribution function of heavy-tailed distribution respectively.We are interested in testing the following hypotheses set to discriminate between H 0 and H f or H g , the NMST for left censored data.
H 0 : The two proposed heavy-tailed models (F α and G β ) are equivalent, and against, H f : F α is better than G β in the sense of the closeness to the true model, or H g : F α is worse than G β .Theorem 1. (NMST for HTDC): Using the conditions 1 -6 and the asymptotic distribution of the MLE (see Appendix A), the new interval as a model selection test for HTDC is given by . Now, using the Equation (1), we have where p and q are the number of parameters in the heavy-tailed models and αn and βn are the quasi maximum likelihood estimators under censored sample.In addition, Z α is α th quantile of standard normal distribution and ω2 c satisfies Proof.Based on Theorem B in Appendix B, it is observed that the difference of the log-likelihood functions of the two truncated rival models (data are left censored) converges in distribution to the normal distribution.Thus, it is sufficient to find the empirical form of the ω 2 * c as ω2 c .
Using the missing information principle [23], the observed information can be written as where w = (w 1 , ..., w n ) = the complete data, z = (z 1 , ..., z c−1 ) is the complete data of size from the right population with density functions: For simplicity, we use f α (z i ) instead of f α (z i |X ) in what follows.Thus, the Var 1 n L f /g n (α n , βn ) can be expressed as where the empirical form of (5) satisfies The proposed interval has the property of where ∆ c ( f αn , g βn ) is the difference of the expected Kullback-Leibler divergence (KL) of f αn and g βn under censored data and where P h represents the probability with density h.

Decision Rule
An important problem in statistics concerning a sample of n observations is to test whether these observations come from a specified distribution.The Vuong test is one of the important tests for model selection.However, if the rival models are very close (not equivalent) to the true model, then this test can suffer from distortions.Therefore, this section suggests a simple model selection procedure based on the likelihood ratio statistic under censored data that is easy to compute and has an asymptotic standard normal distribution.The proposed interval is easy to compute and interpret as the following steps: Step 1: Choose the rival models and calculate the quasi maximum likelihood estimates for the unknown parameters.
Step 3: Obtain ωc and then construct the proposed interval using the α th quantile of standard normal distribution (Z α ).
Step 4: Interpret the proposed test ( ) as (i) If the calculated interval includes zero, it can be concluded that both proposed models (F α and G β ) are equivalent.(ii) If both bounds of are negative, which indicates that F α is better than G β to estimate the true model.(iii) Finally, if both bounds of are negative, then we conclude that G β is better than F α to estimate the true model.
Our approach enlightens the variability of any criterion based on log-likelihood function.

Heavy Tail Properties
Heavy-tailed distributions are the important distributions in economics and finance.In this section, we check the heavy tail properties for different distributions.Definition 3. The distribution F(.) from the random variable X is considered to be heavy tail if and only if R e −λx F(x)dx = ∞; f or all λ > 0. Definition 4. A continuous distribution function is considered to be heavy tail if the generating moment function is infinite.
Thus, we can check the heaviness using different criteria such as: i.
Based on definition 4, if only some or if none of the moments of distributions exist, then it has the heavy tail.
ii.If limsup x→∞ (x) x = 0, then the distribution has the heavy tail.Here, (x) is the hazard function.iii.If * (t) is the decreasing function for increasing value of t, then the distribution has the heavy tail, where * (t) = d dt (t).iv.If the distribution is heavy tail, then = Var(X) E(X) 2 ≥ 1.Note that the converse does not hold.v.The distribution has the heavy tail, if Here, we say that F(x) has a light tail.

Heavy-Tailed Distributions
In this subsection, we consider different heavy-tailed distributions and then check the heaviness property using the different criteria.

Generalized Extreme Value Distribution (GEVD)
The cumulative distribution function (CDF) of GEVD is given by Variable X is bounded by (ξ + α)/k from above, if k > 0 and from below if k < 0, where ξ ∈ R and α > 0. We have three cases of this distribution as We now want to check the heavy tail property.Using the ii and iii criteria, it is observed that the Ferechet-Weibull distribution and the Weibull distribution with 0 < β < 1 have a heavy tail.However, based on the v criterion, the Gumbel distribution does not have the heavy tail property.

Pareto Distribution
The Pareto distribution is a skewed, heavy-tailed distribution that is sometimes used to model the distribution of incomes.This distributional model is important in applications because many datasets are observed to follow a power law probability tail, at least approximately, for large values of x.Stable distributions with index α are also asymptotically Pareto in their probability tails, and this fact has been frequently used to develop estimators for those distributions.The CDF of Pareto is given by The hazard function of the Pareto distribution, k x+α , is a decreasing function for positive values of k and α.Thus, using the ii and iii criteria, it has a heavy tail.

Log-Normal Distribution
A log-normal distribution is applied as the standard model for financial data.It is used in many different fields of study, such as economics, metrology, biology, neuroscience and engineering.The density function of a log-normal distribution, with shape parameter σ > 0 and scale parameter µ > 0 is The tail heaviness property of the log-normal distribution depends on the variance.In other words, based on the iv criterion, we can write,

Burr Type XII Distribution
Burr [24] introduced twelve cumulative distribution functions with the primary purpose of fitting distributions to real data.One of the most important of them is the Burr Type XII distribution.The cumulative distribution function of the Burr Type XII is given by Here, α > 0 and β > 0 are the two shape parameters.The shape of the hazard rate function of the Burr Type XII distribution depends only on parameter β.Its capacity to assume various shapes often permits a good fit when used to describe biological, financial, engineering or other experimental data.It also approximates the distributional form of normal, log-normal, gamma, logistic, and several Pearson-type distributions.For instance, the normal density function may be approximated as a Burr Type XII distribution with β = 4.8544 and α = 6.2266 and the gamma distribution with shape parameter 16 can be approximated as a Burr Type XII distribution with β = 3 and α = 6, and the log-logistic distribution is a special case of the Burr Type XII distribution.In addition, using the i, ii and iii criteria, it is observed that the Burr Type XII has a heavy tail.

Dugum and Singh-Maddala Distribution
The Dugum and Singh-Maddala distributions are the special case of the generalized Beta kind 2 (GB 2 ) distribution.The CDFs of these distributions are given by, respectively: and Here, all three of the parameters are positive.In addition, the r th moment of these distributions can be written as where Γ(.) and B(., .)denote the Gamma distribution and the Beta distribution, respectively.Using the i criterion, it is observed that the moments of these distributions only exist for values of −γα < r < α.It indicates that these distributions have potentially tail heaviness properties.

Application of the NMST of Tehran Stock Exchange
In this section, the data set of daily returns of the top 30 companies from the Tehran Stock Exchange indices was used to study the performance of the proposed model selection test.All the programs are written in R. The mean, standard deviation, skewness and kurtosis of this data are 2.937687, 0.261701, 0.273005 and 1.790659, respectively.It is observed that the skewness and kurtosis are not close to zero and three, respectively.Thus, the data set has a higher peak, fatter tail and skewness in comparison to the Normal distribution.For more study, we demonstrate how it deviates from the Normal distribution using the Shapiro-Wilk (S-W) test, Kolmogrov-Smirnov (K-S) test, Anderson-Darling (A-D) test and Jarque-Bera (J-B) test.For each test, the null hypothesis is that the data are normally distributed.If the p-value is less than the significant level (0.05) of the given hypothesis, the null hypothesis will be rejected.For computing the mentioned test, we use the nortest and tseries in the R package.The results are provided in Table 1.Based on Table 1, we observe that the data do not follow the Normal distribution.Thus, we use the heavy-tailed distribution for modeling the data.First, we check the adequacy of the Weibull (We We select the best model among all competitive distributions that has the smallest AIC, BIC and K-S distance and the greatest LL values.We first estimate the unknown parameters using MLEs.The results are presented in Table 2.It is clear that the Da and LN distributions have comparatively better fitting for the present data set.The We and S-M distributions also have a good fit.We provide the Probability-Probability (P-P) plots for different distributions in Figures 1-6.Moreover, the empirical survival function and the fitted survival functions are presented in Figure 7. Therefore, based on Figures 1-7, it is observed that the BXII and Pa distributions do not fit the data reasonably well, and hence, they cannot be used to obtain inferential results from the considered data set.Using different model selection criteria, we can compare the proposed distributions.However, these criteria have some disadvantages.For example, the LL criterion assumes that the number of parameters in each competitive model is the same.In addition, one problem with AIC and BIC are that their values have no intrinsic meaning; in particular, AIC and BIC are not invariant to a one-to-one transformation of the random variables and values of AIC and BIC depend on the number of observations.Thus, we consider the NMST for comparing the heavy-tailed distributions.Now, we check the results using the proposed interval for model selection.We consider four cases of rival models as: (1) Da (f ) and LN (g), (2) Da (f ) and We (g), (3) We (f ) and S-M (g), (4) Da (f ) and S-M (g).
Based on the estimated values, we construct the proposed interval.This interval for the above four cases are (1.9878339,2.0104310), (−2.748678, −2.614607), (−4.2395868, −4.1287672) and (−172.43378,−161.83113),respectively.For Case 1, it is observed that both limits of the tracking interval are positive, which indicates that the LN is better than the Da distribution to estimate the true model for this data.However, the length of this interval is small, so we can conclude that the two models are similar to estimate the true model (as expected).For Cases 2-4, both limits of the tracking interval are negative, so the model (f ) is better than the model (g).It is observed that this interval selects the correct model well.In addition, computational steps of this interval are simple.Now, we suppose that some of the data are missed (censored).We generate artificially left censored data from the data set as Here, n is the complete sample size and c is the number of the left censored data.

Conclusions
The heavy-tailed distributions are the most important distributions in several applied sciences such as economics, financial engineering and mathematical finance.Moreover, in many situations, we cannot observe the complete information about the data.In this situation, the problem of choosing the correct distribution becomes more difficult.There are different criteria such as AIC, BIC, LL and K-S distance for comparing the models.These criteria have some disadvantages.Thus in this paper, we have proposed a new model selection test for comparing the heavy-tailed distributions under complete and censored data.This interval enlightens the unavoidable variability of any criterion based on log-likelihood ratio such as AIC, BIC and their variants.Based on this test, we can make the best possible decision based on whatever data are available at hand.The computational steps of NMST are easy to compute and could be very useful for censored data.We hope that the new model selection test will attract wider application in all areas of research.From 3 , we have ) distribution, Pareto (Pa) distribution, Burr Type XII (BXII) distribution, log-normal (LN) distribution, Dagum (Da) distribution and Singh-Maddala (S-M) distribution using three different well-established model selection criteria such as K-S minimum distance criterion, Akaike information criterion (AIC = −2 n ∑ i=1 log f αn (x i ) + 2p), Bayesian information criterion (BIC = −2 n ∑ i=1 log f αn (x i ) + plogn) and maximum log-likelihood criterion (LL).

Figure 2 .
Figure 2. The P-P plot for Singh-Maddala distribution.

Figure 2 .
Figure 2. The P-P plot for Singh-Maddala distribution.Figure 2. The P-P plot for Singh-Maddala distribution.

Figure 4 .
Figure 4.The P-P plot for Weibull distribution.

Figure 4 .
Figure 4.The P-P plot for Weibull distribution.Figure 4. The P-P plot for Weibull distribution.

Figure 4 .
Figure 4.The P-P plot for Weibull distribution.Figure 4. The P-P plot for Weibull distribution.

Figure 4 .
Figure 4.The P-P plot for Weibull distribution.

Figure 5 .
Figure 5.The P-P plot for Pareto distribution.

Figure 6 .
Figure 6.The P-P plot for Burr XII distribution.

Figure 5 .
Figure 5.The P-P plot for Pareto distribution.

Figure 4 .
Figure 4.The P-P plot for Weibull distribution.

Figure 5 .
Figure 5.The P-P plot for Pareto distribution.

Figure 6 .
Figure 6.The P-P plot for Burr XII distribution.Figure 6.The P-P plot for Burr XII distribution.

Figure 7 .
Figure 7. Empirical survival function and the fitted survival functions.

Table 1 .
Different Normality tests for the proposed data.

Table 2 .
Estimated parameters, AIC values, BIC values and log-likelihood values for different distribution functions.