1. Introduction
The beta prime (BP) distribution has become popular for analyzing lifetime and monotonic failure rate phenomena. For modeling monotonic failure rates, the Weibull, log-logistic, and log-normal distributions can also be good choices, but they do not model bathtub-shaped, unimodal, and bimodal failure rates that are common in survival analysis. Because of this, several models have been proposed in recent years.
In this context, the Weibull-G (W-G) family [
1] proved itself to be a good competitor to the Beta-G (B-G) [
2] and Kumaraswamy-G (Kw-G) [
3] classes. In this family,
and
are two additional parameters to those of the G distribution as well as for the B-G and Kw-G classes. It is emphasized that the cumulative distribution function (cdf) of the beta distribution involves the incomplete beta function, whereas the Kumaraswamy cdf has a closed-form. In addition, the W-G family can be better explored and disseminated as the B-G and Kw-G classes have been highly cited in Google Scholar.
Recently, Ref. [
4] defined a new extension of the W-G family, also a competitor of the B-G and Kw-G classes. Ref. [
5] proposed a bivariate W-G family. The estimation of the parameters of the Weibull Generalized Exponential distribution based on the adaptive progressive type II (APTII) censored sample was explored by [
6].
Ref. [
7] addressed the estimation of the BP distribution and discussed some properties. A generalized BP model defined by [
8,
9,
10] introduced regression models based on the BP distribution. Other recent works studied this distribution [
11,
12]. Through the McDonald’s inverted beta (McIB) distribution [
13], we can obtain other generalizations of the BP distribution, for example, the Kumaraswamy Beta Prime and Beta Beta Prime models.
In this context, our main objective is to introduce the
Weibull-beta prime (WBP) distribution. We illustrate the applicability of the new distribution to three real COVID-19 data sets. Currently, the USA has the highest number of COVID-19 cases worldwide. Brazil is the second country with most deaths (688,316 total deaths) [
14], and several factors demand analysis of this number, including the continental dimension of the country, the proportion of elderly people, greater social vulnerability, and also the high rate of chronic diseases. In this way, we first verify the flexibility of the new distribution through graphical analyses and statistical tests using data on the number of new daily deaths due to COVID-19 in the US. Second, we provide an application to the times to death by this coronavirus in a Brazilian capital. In addition, a third application for regression modeling is done, in which we investigate the influence of covariates on the time to death from COVID-19 in the city of Campinas, Brazil. For these studies, we aim to contribute to the literature of new distributions and survival analysis, as well as direct efforts to estimate the impact caused by the disease.
The BP random variable
W has cumulative distribution function (cdf)
where
,
and
are shape parameters,
and
(for
) are the incomplete beta and beta functions, respectively.
The probability density function (pdf) of
W has the form
whose
sth ordinary moment (for
) becomes
Some other properties of
W were tackled by [
7]. The arguments in the functions are omitted from now on.
This article is organized as follows.
Section 2 defines the
Weibull-beta prime (WBP) model with four positive parameters.
Section 3 provides some of its properties.
Section 4 addresses the estimation and a simulation study.
Section 5 develops a WBP regression model. Applications to three COVID-19 data sets in
Section 6 confirm the potentiality of the new models. Some conclusions are found in
Section 7.
5. WBP Regression Model
A WBP regression model is constructed for censored samples, quite common in areas such as econometrics, engineering, and clinical trials. Generally, for censored samples, it is common to consider the systematic component for the shape parameter . Thus, we consider the systematic component , where is the vector of covariates and is the vector of unknown parameters. Let . Note that future research may be developed using more systematic components.
The survival function of
is
Equation (
15) defines the WPB regression model.
A special feature of survival data is the presence of censoring, which is the partial observation of the response. This refers to circumstances in which some subjects are free from the event of interest, for example, by being withdrawn early from the study or by the end of the experiment. Then, it is important to add this information to statistical modeling.
Let
be
n independent observations, where
denotes the observed lifetime or censoring time of the
ith observation. Assume that the lifetimes and censoring times are independent, and their sets are F and C, respectively, i.e., the censoring is non-informative. The log-likelihood function for the vector of parameters
from model (
15) is
where
r is the number of failures. The estimate
is found by maximizing Equation (
16).
5.1. Diagnostic and Residual Analysis
The assessment of robustness aspects of the estimates in regression models has been an important concern of various researchers in recent decades. The deletion measures examine the impact on the estimates after dropping individual observations, and they are the most employed technique to detect influential observations; see, for example, Ref. [
19].
A global influence measure considered by [
20] is a generalization of the Cook distance defined by a standardized norm
, namely
where
is the observed information matrix.
Another influence measure is the likelihood distance given by
where
is the maximized log-likelihood function for the full sample and
is the maximized log-likelihood function for the sample excluding the
ith observation.
The quantile residuals (qrs) have the form
where
is the inverse of the standard normal cdf.
Various plots of these residuals can be adopted to assess the regression assumptions and detect influential observations.
5.2. Simulation Study
A simulation study examines the accuracy of the MLEs in the WBP regression model for
, 250, and 500 and censoring percentages 0%, 10%, and 30%. Here, 1000 replicates of each sample are generated using the inverse transformation method. The censoring times
are obtained from a
, where
controls the censoring percentage. The systematic component for the parameter
(for
) is
where
,
,
,
, and
.
The simulation process follows as (for ):
(i) Generate
, and calculate
from (
20);
(ii) The generated lifetimes
are determined from the WBP(
) model using Equation (
6);
(iii) Generate and obtain ;
(iv) Set the censoring indicator: if , then ; otherwise, .
The values in
Table 2 reveal that the average estimates converge to the true parameters, and the MSEs and biases decrease when
n grows. Furthermore, the biases and MSEs of the estimates become larger when the censoring percentage increases. Hence, we conclude that the estimators are consistent.
6. Applications
First, the fits of the WBP, BP, Beta Beta Prime (BBP), and Kumaraswamy Beta Prime (KwBP) distributions are compared. The BBP and KwBP are special models of the McDonald inverted beta (McIB) [
13].
For all fitted models, we calculate the MLEs and their standard errors (SEs). The well-known statistics (AIC, CAIC, BIC) defined by the initial letters are also calculated to compare the WBP distribution with its nested BP model. The Cramer–Von Mises (
), Anderson–Darling (
) and Kolmogorov–Smirnov (K-S) (and its
p-value) statistics compare the WPB model with other distributions using the
AdequacyModel [
18],
MASS and
GenSA libraries of the
R software. The maximization is performed using the SANN method.
6.1. Application 1: COVID-19 Data in the US
The first data set refers to 95 daily new deaths due to COVID-19 in the US (from 2 April 2021 to 31 July 2021) extracted from the link:
https://www.worldometers.info/coronavirus/country/us/. This data set is used since the US is currently the country with the highest number of deaths from COVID-19. In the period, we find an average of 499.56 new deaths daily, and a standard deviation of 222.69, which can be explained by the evident variation in the number of daily deaths. In fact, the minimum number of daily deaths is 158 deaths, and the maximum is 985. In addition, we obtain skewness = 0.44 and kurtosis = 2.06.
Table 3 reports the MLEs and their SEs (in parentheses). The statistics (and the
p-values of K-S) are reported in
Table 4. The WBP distribution is better than the KwBP, BBP, and BP models.
The generalized likelihood ratio (GLR) test [
21] assesses if there is any significant difference in the fits of the distributions. The WBP model outperforms the KwBP (GLR = 4.18) and BBP (GLR = 4.99) distributions for a significance level of 5%.
Figure 5a displays the histogram and the estimated WBP, KwBP, and BBP densities.
Figure 5b reports the empirical and estimated cumulative distributions. The WBP distribution yields the best fit for a significance level of 5%.
6.2. Application 2: COVID-19 Data in Florianópolis, Brazil
According to the Votorantim Institute’s COVID-19 Municipal Vulnerability Index (MVI), Florianópolis is the least vulnerable capital to COVID-19 in Brazil [
22]. In this context, the second application refers to 116 times (in days) of COVID-19 patients from the date of hospitalization until death in the city of Florianópolis registered from January to March, 2022 in the Ministry of Health platform at
https://dados.gov.br/dataset/bd-srag-2021 (accessed on 26 May 2022). The average number of days from hospitalization to death is approximately 9.71 for patients in the analyzed period. The standard deviation is 7.67, which can be explained by the variation in these times. In fact, the minimum time from hospitalization to death is just only one day and the maximum 29 days. Furthermore, the skewness is 0.81 and the kurtosis 2.75.
The MLEs, SEs, and the previous statistics (with the
p-values of K-S) for the fitted distributions to these data are reported in
Table 5 and
Table 6. The numbers in the second table support that the WBP distribution is the best model.
The Vuong test [
21] indicates that the new distribution is more adequate than the KwBP (GLR = 8.08) and BBP (GLR = 5.77) distributions for a 5% level of significance. A comparison of the WBP distribution with its BP sub-model gives LR = 31.21 (
p-value = 1.668
). Thus, the WBP distribution is the best one to describe the current data.
The histogram of the data and some estimated densities are reported in
Figure 6a.
Figure 6b displays the empirical and estimated cumulative distributions. They show that the WBP is the best model for these data.
6.3. Application 3: COVID-19 Data in Campinas, Brazil
Some regression models are fitted to 655 survival times of coronavirus patients hospitalized (on April 2021) in the city of Campinas (state of São Paulo) obtained from
https://opendatasus.saude.gov.br/en/dataset/srag-2021-e-2022 (accessed on 1 September 2022). This city has the third largest municipal population in this State, around 1,213,792 people in 2020 according to the Brazilian Institute of Geography and Statistics (IBGE) [
23], thus justifying its choice for the application. The censoring percentage (67.8%) refers to deaths from other causes or end of observation time. The survival time is the period of time (in days) from the first symptom to the death from COVID-19.
The covariates are (for ):
: observed time (in days);
: censoring indicator (0 = censoring, 1 = observed lifetime);
: age (in years);
: Chronic cardiovascular pathology (1=yes, 0=no or not informed).
Other studies have analyzed the influence of covariates on the time to death from COVID-19. Ref. [
24] analyzed coronavirus data in Curitiba, (Brazil) and verified the influence of the sex and age on the times (in days) elapsed from the date of hospitalization to the death. Ref. [
25] investigated risk factors associated with these deaths in the Mexican population using survival analysis and concluded that the risk of death was higher for men, older individuals, chronic kidney disease patients, and people admitted to public health services.
First, the analysis is done by modeling only the response variable by fitting the WBP, KwBP, BBP, and BP distributions. The results of these preliminary analyses are reported in
Figure 7, where the WBP distribution is better than the others.
Next, we consider the following systematic components:
Table 7 gives the selection criteria values, and the WBP regression model has the lowest values for all systematic components. Note that this model with the structure
is superior to the other models.
The WBP, BBP, KwBP, and BP regression models with the structure
are evaluated using the quantile–quantile (QQ) and Worm plots of the qs in
Figure 8 and
Figure 9, respectively. The WBP regression model-
is better than the others in agreement with the results in
Table 7.
The findings in the final WBP regression model-
are given in
Table 8, where two covariables are significant.
Figure 10 displays the index plots of the case deletion measures
and
. From
Figure 10a, the 323th, 409th, and 584th cases are possible influential observations referring to the following patients:
323th: A 42-year-old patient with failure time equal to one day who does not have cardiovascular disease;
409th: A 64-year-old patient with a failure time of one day who has cardiovascular disease;
584th: A 57-year-old patient with a failure time of one day who has cardiovascular disease.
We examine the quality of fit of the WBP regression model—
. The qrs are randomly around zero as shown in
Figure 11a. The QQ plot of these residuals with a simulated envelope [
26] is displayed in
Figure 11b. We can accept that there is evidence of a good fit of the WBP regression model.
Some interpretations of the final WBP regression model:
The survival time tends to decrease when the patient gets older;
There is a difference for the survival times between patients with chronic cardiovascular disease and those that do not present this condition.
7. Conclusions
We proposed a four-parameter Weibull beta prime (WBP) distribution. The estimation was conducted by the maximum likelihood method, and a simulation study showed the consistency of the estimators. We constructed a WBP regression model for censored data and proved the importance of the new models using three COVID-19 data sets. They were compared with some known competing models, and they were more suitable to fit all data sets. The regression model with censored data from COVID-19 patients showed that advanced age and cardiovascular disease are significant factors for the survival time. We concluded that the proposed models can be interesting alternatives for symmetric and asymmetric data, with bimodal shapes, censored or uncensored. Finally, future extensions of the article include, for example, other systematic components, thus defining heteroscedastic regression models based on the WBP distribution. In addition, generalizations of the new regression model for multivariate configurations and linear mixed effects models can be investigated.