Goodness of Fit Tests for the Log-Logistic Distribution Based on Cumulative Entropy under Progressive Type II Censoring

In this paper, we propose two new methods to perform goodness-of-fit tests on the log-logistic distribution under progressive Type II censoring based on the cumulative residual Kullback-Leibler information and cumulative Kullback-Leibler information. Maximum likelihood estimation and the EM algorithm are used for statistical inference of the unknown parameter. The Monte Carlo simulation is conducted to study the power analysis on the alternative distributions of the hazard function monotonically increasing and decreasing. Finally, we present illustrative examples to show the applicability of the proposed methods.


Introduction
The log-logistic distribution is a kind of classical life distribution in reliability.Scholars have extensively studied this distribution.In [1,2], they mainly studied the nature of the distribution, and a method for obtaining the exact confidence interval of the shape parameter of the log-logistic distribution based on the maximum likelihood estimation was proposed.In [3,4], scholars found that log-logistic distribution was often used for life analysis, especially the reliability characteristics of the research model; while [5][6][7] studied the applications of this distribution in the fields of the economy, networks, and the environment.
The log-logistic distribution has two main advantages.One is that the hazard function is not strictly monotonous and has some flexibility in representing product failure rate.Second, its cumulative distribution function has a closed form, which is more advantageous for life analysis of censored data.
A random variable X is said to have the log-logistic distribution if its cumulative distribution function (cdf) is expressed as: ( The probability density function (pdf) is as follows: ( In many life analyses, sometimes, in order to save money and resources, the data will be truncated, and the data obtained are not complete.Such data are classified as censored data.Type I censoring, Type II censoring, and progressive censoring are the most common censoring methods.
Progressive Type I censoring is a time-fixed censorship in which individuals are deleted at a predetermined ratio at the same time interval at a fixed observation time.Progressive Type II censoring is a fixed amount of censorship, that is censoring at a certain percentage each time a failed individual occurs.Progressive censoring refers to the method of continually censoring in observation, which is different from only censoring at the end of the experiment.This article mainly discusses the progressive Type II censorship.
The steps to perform progressive Type II censoring are as follows: n individuals are tested, and after the first individual fails, randomly take R 1 individuals from the surviving n − 1 individuals.Continue to observe until the next failed individual appears, then randomly select R 2 individuals from the surviving n − 2 − R 1 individuals to no longer observe, and repeat the process until the mth individual failure is observed, and all last R m remaining individuals are withdrawn from the observation, which results in a (R 1 , R 2 , • • • , R m ) censoring scheme.
There are many studies on the statistical inference problem of censored data: In [8], different progressive censoring methods for samples in life testing were systematically studied.The confidence interval of the unknown parameter under Type II censoring was provided by [9].They derived precise confidence intervals for different vital features, such as quantile, position, reliability, and scale.The method was also applied to the single-parameter and two-parameter exponential models to verify the feasibility and practicability.An algorithm was proposed in [10] for simulating the data generation of progressive Type II censoring, which will be used in the Monte Carlo simulations later.
In this article, we will present the log-logistic distribution fitting experiment based on cumulative entropy under progressive Type II censoring.The other parts of the article are arranged as follows: In Section 2, we will introduce Kullback-Leibler (KL) information and its related properties.In Section 3, maximum likelihood estimation and the EM algorithm are used for statistical inference of the unknown parameter.We construct two test statistics due to CRKL and CKL in Section 4. The Monte Carlo simulation is presented to study power on the alternative distributions in Section 5. Finally, we will provide an illustrative example in Section 6 and prove the applicability of the statistics.

Entropy and Kullback-Leibler Information
As the quantification of information, the average amount of information after redundancy is excluded is called information entropy.In 1948, Shannon proposed the concept of information entropy for the first time in order to clarify the relationship between probability and information redundancy.The Shannon entropy is for the discrete random variable X.Now that we want to extend this concept to the case of continuous random variables, we will get a very important concept, namely differential entropy.The definition is as follows: where ln(•) represents the natural logarithm, f (x) represents the probability density function of the random variable X.
In the last few years, information theory measures have been used in many fields to verify the consistency of the distribution in statistical inference.For example, Vasicek [11] first proposed a distribution fitting test for a normal distribution using maximum entropy.Vasicek's contribution has greatly contributed to the development of entropy-based distribution fitting test topics.Afterwards, many scholars have studied this topic; refer to [12][13][14].
A distribution fitting test using Kullback-Leibler (KL) divergence was proposed in [15,16] , and its definition is as follows: where f (x) and g(x) represent the density functions related to F(x) and G(x) distributions.

Lemma 1.
It is known from the inequality ln x ≤ x − 1 that KL(F : G) is non-negative.
Lemma 2. F(x) and G(x) are equally distributed if and only if KL(F : G) = 0.
Proof.Under the premise that KL information is non-negative, Formula (4) shows that the smaller the distance between the two distributions, the smaller the corresponding KL divergence.When KL(F : )dx = 0, the equal sign holds, i.e., ln Referring to the inequality ln x ≤ x − 1, we get f (x) = g(x).Because f (x) and g(x) represent the density functions related to F(x) and G(x) distributions, F(x) and G(x) are equally distributed.
According to Lemmas 1 and 2, when G(x) is known as the log-logistic distribution, the empirical distribution function F n (x) can be obtained from the collected data.Then, the value of KL(F n : G) can be calculated for the distribution fitting test.If the value of KL information is approximately zero to a certain extent, we can think that F(x) and G(x) are equally distributed.
The limitation of differential entropy is that it only applies to random variables that have continuous density functions.Authors of [17] presented the concept of cumulative residual entropy (CRE), in order to extend the scope of application to random variables without specific density functions.The scheme performs a distribution fitting test according to the survival function F(x) = 1 − F(x) in the reliability analysis.The cumulative residual entropy is applied to random variables that are non-negative, and the definition is: Baratpour [14] proposed a new CRE-based method for comparing the distance between two distributions, named cumulative residual Kullback-Leibler (CRKL) divergence.They also constructed fitting tests under exponential distribution.It is also possible to verify that CRKL is non-negative with a method similar to that of KL information.

CRKL(F
where F(x) and G(x) are the survival functions of the random variables X and Y. E(X) and E(Y) are the expectations of X and Y.
Authors of [18] proposed cumulative entropy (CE), which is a new method based on classical differential information entropy.This cumulative entropy is the expected average of the inactivity time of the random lifetime X.This metric is particularly well-suited to address issues related to aging characteristics based on past and inactive time in reliability analysis.They defined CE as: Another cumulative KL information (CKL) related to the cumulative distribution function was proposed in [19] and was expressed as follows: According to ln x ≤ x − 1, x > 0, we can prove that CKL(F : G) ≥ 0. Only when the equation holds, F(x)=G(x).

Maximum Likelihood Estimation
We are going to discuss the maximum likelihood estimation of the log-logistic distribution under the progressive Type II censored data in this section.Suppose X (1) , X (2) , • • • , X (m) are the fail times obtained by m observations for a given censoring scheme R 1 , R 2 , • • • , R m .In the following derivation, X (i) will be recorded as x i .Then, we have: where β is the shape parameter, c is a fixed coefficient, and the corresponding log-likelihood function is as follows: The likelihood equation for the parameter β is: From the above Formula (11), theoretically, we can get the maximum likelihood form β of the parameter β, but the likelihood equation is complicated and nonlinear, so the explicit solution of β cannot be calculated.The approximate solution obtained by the simulation method cannot prove the uniqueness of the solution.
Based on the deficiencies of the maximum likelihood in solving traditional problems, in the actual problem processing, the EM algorithm, dichotomy, or the Newton-Raphson method can be selected according to the convenience level to solve the estimated value.Considering that the existing research of the dichotomy and Newton-Raphson method has been perfected, this paper only elaborates on the EM algorithm to further complete the maximum likelihood estimation.

Expectation Maximization Algorithm
Dempster [20] first proposed an expected maximization algorithm named the EM algorithm.Originally derived from [21], it was gradually matured and modified, such as the GEM algorithm [22] and the Monte Carlo EM algorithm.
The EM algorithm is an iterative algorithm for solving the maximum likelihood estimations in incomplete data problems.It transforms the optimization problem of a complex likelihood function into a simple function optimization problem.
In general, the order statistic of the observed data is represented by The individual component lifetimes of Z i are greater than x i and are out of order.
X and Z are combined to form complete data W = (X, Z) = (w 1 , w 2 , • • • , w n ).The key to the EM algorithm is that it is easy to obtain the combined density function of the data after completion.When the effective datum X is known, the conditional density of the missing variable Z can be quickly obtained.β is the unknown parameter to be estimated.p(x, z|β) is the distribution density of the complete data, and p(z|β, x) indicates the conditional distribution of the missing variable Z after given observation data.
The EM algorithm divides the problem into the following steps: Step 1 Add the missing variable Z. Combined with the distribution function (1), it is easy to obtain: The log-likelihood function of full data l(β|w) = l(β|x, z) is as follows: Step 2 E step (expectation step): Derive the maximum likelihood function of the complete data.
Step 3 M step (maximization step): Maximize the expected value and get the next iteration value β (i+1) . Let: Step 4 Use β (i+1) instead of β (i) in step E, repeating the E and M steps.When , w)| is sufficiently small, stop the iteration.

Distribution Fitting Test Statistics
EDF test statistics are usually used to measure the distance between two empirical distribution functions F n (x) and F β(x).If the original hypothesis H 0 is established, the estimated value β of β can be obtained.In this paper, we will propose two EDF test statistics related to CRKL and CKL.We will use these two statistics to test whether the unknown distribution satisfies the log-logistic distribution of the null hypothesis and propose multiple alternative distributions at the same time.
Given a set of censored schemes , where β is an unknown shape parameter.
Based on CRKL, CKL, and the obtained censored sample, we consider the following hypothesis.
When the statistic is found to be large enough, we reject the null hypothesis.
The empirical distribution function of the complete data can be derived using order statistics.In the same way, we can estimate the empirical distribution function F m:n (x) of the censored data, which is presented as: Balakrishnan [10] found that α i:m:n was the ith expected value of statistical data obeying uniformly-distributed data under progressive Type II censoring, i.e., α i:m:n = E(U i:m:n ).
Since U i are not independent of each other, we let 0 The joint density function of According to the factorization theorem, V i are independent of each other and obey , respectively.Then, we have: When the sample obeys the log-logistic distribution function (1), the censored CRKL is able to be reduced as: The unknown parameter β should be replaced by its expectation maximization estimate βEM .In order to make the statistics free of units, divide (23) by where: Similarly, the statistic based on CKL is as follows: where: Obviously, neither ĈRKL nor ĈKL depend on the scale; therefore, they are suitable for goodness-of-fit tests.

Monte Carlo Simulations
Monte Carlo simulation is a method of solving many computational problems using random numbers.Based on a probabilistic model, statistical simulations or samplings are performed using an electronic computer to obtain an approximate solution to the problem, according to the process depicted by the model.Currently, there is related research in many fields, such as: the environment [23], computer science and technology [24], physics [25], finance [26], etc.
For the ĈRKL and ĈKL tests, when the significance level α is determined and the statistic is larger than the critical value, we reject the null hypothesis.Since the quantile cannot be determined from the distribution of ĈRKL (23) and ĈKL (24), we use the Monte Carlo simulation method to obtain the critical values ĈRKL 0.95 , ĈRKL 0.90 , ĈKL 0.95 , and ĈKL 0.90 .Lemma 3. Suppose X obeys the log-logistic distribution, and let Y = ln(1 + x β ), then Y obeys the standard exponential distribution.
Proof.Suppose the distribution functions of the random variables X and Y are F X (x) and F Y (y), respectively.
When y > 0: be a sample for a given progressive Type II censored scheme from the log-logistic distribution of which the unknown parameter is β.
is the order statistic of the data from the standard exponential distribution under progressive Type II censoring.Refer to [10] for the following transformation: According to the above transformation, Thomas [27] proved that S 1 , S 2 , • • • , S m were independent of each other and conformed to the standard exponential distribution.Ten thousand random samples (S 1 , S 2 , • • • , S m ) can be generated by R software and part of the code is shown in the Appendix A. The samples under progressive Type II censoring were obtained by corresponding transformation.
Under the experimental assumption of size n, censored number m, significance level α, censored scheme (R 1 , • • • , R m ), and distribution (1) with β = 2, we randomly generated 10,000 sets of progressive Type II censoring samples.From the law of large numbers, we know that if a test is repeated multiple times under a constant condition, the probability can be replaced by the frequency of the random event.
Because the study of the exponential distribution [14] indicated that the hazard function had a great influence on the study of the power of statistics, we present alternatives in Table 1 classified by the type of hazard function: The power study indicates the probability of rejecting the null hypothesis under the condition that the alternative hypothesis is established.It is a test of whether or not the test statistics we have proposed are feasible.
According to the classification method of the hazard function, the power study of this paper involved seven alternative hypotheses.To calculate the value of the test statistics, the parameters of the alternative distributions were estimated using the EM algorithm.
(a) H 0 : X ∼ log-logistic(2) vs. H 1 : X ∼ Pareto(5): The shape parameter is β, and the failure rate had a downward trend as x increased.
(b) H 0 : X ∼ log-logistic(2) vs. H 1 : X ∼ Rayleigh(0.2): The shape parameter is β, and the failure rate was on the rise as x increased.(c) H 0 : X ∼ log-logistic(2) vs. H 1 : X ∼ Weibull(0.2,1): where k > 0 and λ > 0 are the shape parameter and scale parameter corresponding to the distribution, respectively.k < 1 indicates that the failure rate tended to decrease as the experiment time elapsed.k = 1 means that the change in failure rate was independent of time.k > 1 indicates that the failure rate was increasing as the experiment time went on.(d) H 0 : X ∼ log-logistic(2) vs. H 1 : X ∼ Bathtub-shaped(2,15): The unknown shape parameter and scale parameter of the distribution are β and λ.In fact, λ did not have an influence on h(x).This distribution had an increased hazard function when β > 1.
The trend of the test statistic on the censored program can be obtained by the analysis of Figures 1-4:

•
From Figure 1, different censorship schemes had little effect on power for the ĈRKL test when the hazard function monotonically decreased.

•
From Figure 2, for the ĈKL test, there was no obvious trend in power study for the monotone decreasing hazard function under different censored schemes.

•
From Figure 3, when the censorship only happened at the end, the power for the monotone increasing hazard function for the ĈRKL test was lower.

•
From Figure 4, when the censorship only happened at the beginning and the hazard function monotonically increased, the power for the ĈKL test was lower.
Therefore, we obtained the following conclusions: Before performing the goodness-of-fit test, we selected the test statistic by analyzing the hazard function of the relevant distributions.When the hazard function was monotonically decreasing, the statistic ĈKL was selected.When the hazard function increased monotonically, if the censoring occurred only at the beginning of the experiment, we used ĈRKL; if the censoring only occurred at the end of the experiment, we used ĈKL; in other cases, both test statistics were acceptable.

Application Prospect
The modern quality concept holds that product quality is a combination of characteristics that meet the requirements of use, that is, applicability.Improving quality in the process of product realization is an important way to improve product quality.Therefore, in the product reliability analysis, it is important to determine the product probability distribution to further analyze the shape characteristics of the function such as the failure rate.
For example, quality fluctuations common in product formation have unacceptable parts and installation errors.In order to establish a reliability model under two quality fluctuations, it is first necessary to determine the product probability distribution.When relevant research shows that the product may be the log-logistic distribution, the algorithm proposed in this paper can be used to test the goodness of fit.
Because the failure rate function is not strictly monotonic under the log-logistic distribution, there is flexibility in establishing a reliability model.Our proposed algorithm has practical value in effectively controlling quality fluctuations and improving product reliability.Below, we will introduce actual data for simulation studies.

Real Data Application
We have proven that the proposed goodness-of-fit tests are discriminative by analyzing the power of hypothesis testing.In this part, we will introduce a set of real data to continue to illustrate the applicability of the algorithm.The work in [9] proposed a progressive censored sample of the logarithmic lifetime of the insulating fluid tested by 34 KV to the breakdown data.These data of size m = 5 generated from n = 13 are listed in Table 5.We will use the two test statistics to test the collected censored data and analyze whether the sample obeys the log-logistic distribution.Firstly, calculate the value of the test statistics according to Equations ( 23) and (24).The value of the unknown parameter β was replaced by β obtained under the parameter estimation method in the third section.The results of the calculations were ĈRKL = 1.221995 and ĈKL = 0.025522.
Referring to the algorithm for calculating the p-values by R in the Monte Carlo simulation in Section 5, we obtained p-values corresponding to the two goodness-of-fit tests of 0.898 and 0.605, respectively.From the results, there is sufficient evidence to show that the sample observed under progressive Type II censoring obeyed the log-logistic distribution.Through the simulation of this set of data, we can get the conclusion that the test statistic is applicable.

Conclusions
In this paper, the fitting test of the log-logistic distribution was discussed under progressive Type II censoring.We established ĈRKL and ĈKL test statistics based on the cumulative residual entropy and cumulative entropy.Among them, we used the maximum likelihood estimation and EM algorithm for parameter estimation, and the Monte Carlo simulation was presented to analyze the multiple alternative hypotheses.Trend analysis of the power study under different censorship schemes demonstrated the feasibility of the test statistics.
We have come to the conclusion that the test statistic was chosen based on the monotonicity of the risk function of the relevant distribution.When the hazard function monotonically decreased, we selected the statistic ĈKL.When the hazard function increased monotonically, it was necessary to select the statistic according to the time when the censorship occurred.The power analysis showed that the proposed goodness-of-fit test had the feasibility of discriminating whether the null hypothesis was true.At the same time, the actual data case showed the combination of algorithm and practical application.The idea of the fitting test in this paper can also be applied to more distributions in subsequent studies, such as: Weibull, Rayleigh, Pareto, etc.

Figure 4 .
Figure 4. Powers of different schemes for the monotone increasing hazard function Rayleigh(0.2) for the ĈKL test, in the case of n = 20 and 30.

Table 1 .
Alternative hypotheses for power study.

Table 2 .
Powers of ĈRKL for the monotone decreasing hazard at the 5% and 10% significance levels for the different censorship schemes in the case of sample sizes n = 10, 20, and 30.

Table 3 .
Powers of ĈKL for the monotone decreasing hazard at the 5% and 10% significance levels for the different censorship schemes in the case of sample sizes n = 10, 20, and 30.

Table 4 .
Powers for the monotone increasing hazard at the 5% and 10% significance levels for the different censorship schemes in the case of sample sizes n = 10, 20, and 30.

Table 5 .
A progressive censored sample of the logarithmic lifetime of the insulating fluid.