SimBetaReg Web-Tool: The Easiest Way to Implement the Beta and Simplex Regression Models

: When the response variable is deﬁned on the (0,1) interval, the beta and simplex regression models are commonly used by researchers. However, there is no software support for these models to make their implementation easy for researchers. In this study, we developed a web-tool, named SimBetaReg, to help researchers who are not familiar with programming to implement the beta and simplex regression models. The developed application is free and works independently from the operating systems. Additionally, we model the incidence ratios of COVID-19 with educational and civic engagement indicators of the OECD countries using the SimBetaReg web-tool. Empirical ﬁndings show that when the educational attainment, years in education, and voter turnout increase, the incidence ratios of the countries decrease.


Introduction
Bounded data appear in many application areas, such as finance, medical and actuarial sciences (for example, fatal traffic accident ratios, miscarriage pregnancy ratios, and earthquake ratios resulting in loss of life or positive logarithmic returns of investment instruments. To deal with and model these data sets, the beta, Kumaraswamy [1], Topp-Leone [2], and simplex [3] distributions are the first models that come to mind. However, there have been many studies on bounded distributions, especially in the last five years. Two common methods are used to generate new distributions defined on a unit interval, such as log and unit transformations. These distributions are derived from continuous probability distributions defined on R + . For instance, log-xgamma was derived from the xgamma distribution [4]. Similarly, the log-weighted exponential (log-WE) distribution was derived from the weighted exponential distribution [5]. Some of these distributions can be cited as follows: unit-improved second degree Lindley (unit-ISDL) by Altun and Cordeiro [6], unit-Lindley by Mazucheli et al. [7], log-Bilal by Altun et al. [8], unit-inverse Gaussian by Ghitany et al. [9], transmuted Kumaraswamy by Khan et al. [10], unit-Birnbaum-Saunders by Mazucheli et al. [11], exponentiated Topp-Leone by Pourdarvish [12], and so on. Researchers are still working to generate new distributions for bounded data sets.
When it is desired to explain the change in the dependent variable, defined on the (0,1) interval, by the independent variables, the beta regression model, introduced by Ferrari and Cribari-Neto [13], is the first choice. The second model is the simplex regression, introduced by Kieschnick and McCullough [14], which was extensively studied by Song and Tan [15], Song et al. [16], and Qiu et al. [17]. These models have gained attention from researchers, and several generalizations and alternative models of the beta and simplex distributions have been proposed.
For instance, the unit-Lindley regression model by Mazucheli et al. [7], log-WE, and log-ISDL regression models by Altun [5] and Altun and Cordeiro [6], respectively, unit Burr-XII regression model by Korkmaz and Chesneau [18] and arcsecant hyperbolic normal regression model by Korkmaz et al. [19] are alternative models for the beta and simplex regression models. The open code and data are very important for researchers to reproduce the results given in any scientific study. Even if the codes of the relevant models are made accessible and open, the use of these codes requires partial expertise in R or Python. Therefore, the use of these models has not become widespread.
In the light of the given explanations, the first goal of the presented study is to develop a cloud-based web-tool for the application of the beta and simplex regression models to increase their usage by researchers. We use the R Shiny platform to develop the SimBetaReg web-tool. The name of the application comes from the abbreviations of the beta and simplex regression models. Thanks to the developed application, researchers can easily use the beta and simplex regression models without requiring any software knowledge. In addition, since the developed application is cloud-based, it does not require installation and works independently from the operating systems. As in IBM SPSS, the researchers can upload their data sets and obtain the results of the beta and simplex regression models with residual analysis and goodness of fit statistics.
The second goal of the presented study is to explore the possible relation of the incidence ratio of Coronavirus disease 2019 (COVID-19) with educational and civic engagement indicators for OECD countries.
Many studies have been done by researchers/academicians to analyze the incidence number or incidence ratio of COVID-19. These studies have two different aspects. The first is to forecast the incidence ratio of COVID-19 for a specific region or country by applying the machine learning models. For instance, Mollalo et al. [20] used the artificial neural network to forecast the incidence rate for the United States. However, since only incidence rates were used, the predicted values obtained may deviate from the actual situation.
The second aspect is to seek a correlation or relation between incidence ratio and socio-economic variables. For instance, Karmakar et al. [21] examined the possible relation between incidence and death rates with the social vulnerability index in the United States. Duhon et al. [22] modeled the growth rate of COVID-19 with non-pharmaceutical interventions, social and climatic variables based on the multivariate linear regression.
El-Morshedy et al. [23] modeled the counts of deaths caused by COVID-19 using the count regression models. However, the predictions made by count regression models may lead over-estimated or under-estimated results. Therefore, the modeling of the incidence ratio instead of counts of deaths or cases produces more reliable results. According to the literature review, there is no direct research to model the incidence ratio of COVID-19 with educational and civic engagement indicators. Since the incidence ratio is defined on (0,1) interval, it should be modeled with the beta or simplex regression model by considering the appropriate covariates.
The other parts of the presented study are organized as follows. In Section 2, the mathematical backgrounds of the beta and simplex regression models are summarized. In Section 3, COVID-19 incidence ratio of the OECD countries is modeled with educational and civic engagement indicators by applying the beta and simplex regression models. The results are discussed comprehensively. In Section 4, the implementation of the developed SimBetaReg web-tool is given. The concluding remarks and future works from the presented study are given in Section 5.

Regression Models
In this section, we examine the beta and simplex regression models comprehensively.
If the random variable has the pdf in (2), its mean and variance are, respectively Note that φ −1 is interpreted as a dispersion parameter since it is an increasing function of the variance. The pdf plots of the beta distribution are plotted in Figure 1. The beta distribution has the following shapes: bathtub, right and left skewed and symmetric. Now, the beta regression model can be defined based on the appropriate link function, which is defined by where β = (β 0 , β 1 , β 2 , . . . , β k ) is a vector for the regression parameters and x i = (1, x i1 , x i2 , . . . , x ik ) is a vector for k covariates. The symbol g(·) in (3) represent the link function, such as g(·) : (0, 1) → R. Note that the link function is strictly increasing and twice differentiable function. The standard link function in the beta regression model is which is called the log-link function. Using the inverse of the link function, the variance of the random variable Y is redefined as follows Based on the density in (2), the log-likelihood function of the beta regression model is given by where The betareg package uses optim function maximizing the log-likelihood in (6) to obtain the maximum likelihood estimate (MLEs) of the unknown parameter vector (β, φ).

Simplex Regression
Let the random variable Y follows the simplex distribution. The pdf of Y is where 0 < y < 1 and d(y; µ) is defined as The random variable having pdf in (7) is denoted by Y ∼ S µ, σ 2 . The mean of the simplex distribution is E(Y) = µ and its variance is given by where Γ(·, ·) is the incomplete gamma function. The pdf plots of the simplex distribution are plotted in Figure 2. The simplex distribution has the following shapes: right and left skewed, symmetric and bimodal shapes. Using the log-link function given in (3), the log-likelihood function of the simplex regression model is The MLEs of β, σ 2 are obtained by the VGLM package of the R software, which uses the optim function to maximize the log-likelihood given in (10).

Randomized Quantile Residuals
The randomized quantile residuals are proposed by Dunn and Smyth [24]. It is defied as where Φ(·) is the cumulative distribution function (CDF) of the normal distribution and u i = F y i ,β, φ for the beta regression and u i = F y i ,β, σ 2 for the simplex regression model. The quantile-quantile plot of the residuals as well as goodness-o-fit test can be used to asses the suitability of the fitted model. Kolmogorov-Smirnov (KS) test can be applied to the randomized quantile residuals of the beta and simplex regression models. If the obtained p-value is greater than 0.05%, the randomized quantile residuals follow the standard normal distribution.

Model Comparison
In general, the R-squared measure is used to know the explanatory power of the fitted model. However, the standard R-squared measure cannot be used for the generalized linear models (GLMs). The pseudo-R-squared measures are available for these types of models. Cox and Snell [25] proposed the following pseudo-R-squared measure for GLMs where n is the number of observations, L 0 is the likelihood value of the null model with intercept term and L M is the likelihood value of the fitted model with covariates. The likelihood ratio test is performed to compare the fitted model with null model. The LR test statistic is given by where M is the log-likelihood value of the fitted model and 0 is the log-likelihood value of the null model. The LR statistic is distributed as χ 2 with df = df 1 − df 2 where df 1 is the degree of freedom of the null model and df 2 is the degree of freedom of the fitted model with covariates.

Empirical Results
In this section, the beta and simplex regression models are used to model the incidence ratio of COVID-19 with some covariates. First, we describe our data set. Then, the estimated parameters of the fitted regression models are given. Later, the model evaluation is done by the residual analysis and goodness-of-fit measures. Finally, we interpret the estimated regression parameters for the best-fitted model.

Data Set
The data set comes from the indicators of the Better Life Index (BLI), which is calculated for the OECD countries and available in https://stats.oecd.org/index.aspx? DataSetCode=BLI. The accessed date for the website is 29 June 2021. Education and civic engagement indicators of the BLI index are selected to model the incidence ratio of the OECD countries. The research question is "Do the education and civic engagement indicators affect the incidence ratio of COVID-19?".
We expect that the countries having higher education and civic engagement indicators manage the pandemic process more easily during applying the measures and closures to reduce the incidence of COVID-19. Note that the environmental conditions, government support and amount of resources are also important indicators to help the governments in managing the pandemic process.
The education indicator consists of two variables. These are educational attainment and years in education. Similarly, the civic engagement indicator also has two variables: stakeholder engagement and voter turnout. These four variables are considered covariates. The response variable, the incidence ratio, is calculated as the ratio of the number of positive incidence to the total number of tests. The incidence ratio is calculated based on the available data in (https://www.worldometers.info/coronavirus/, date accessed on 29 June 2021). Figure 3 shows the incidence ratios of different countries. The two countries with the highest incidence ratio are Mexico and Brazil. The three countries with the lowest incidence ratios are New Zealand, Denmark, and Australia. The country with the highest case rate in Europe is Slovenia. The incidence ratios of Germany, France and Italy are very close to each other. The incidence ratio in South Africa are lower than in Poland and Slovenia.   Table 1 shows the descriptive statistics of the variables, such as mean standard deviation (SD), median, minimum and maximum values, range, skewness, and kurtosis. Educational attainment and stakeholder engagement variables are left-skewed and other variables are right-skewed.

Regression Results
Two regression models are fitted to used data. These are beta and simplex regression models. The statistical backgrounds of the models are presented in Section 2. The dependent variable should be defined on the (0,1) interval. The log-likelihood functions of the regression model given in Section 2 are maximized to obtain the parameter estimates. The incidence ratios of the countries are used as response variable. The covariates are listed below. The fitted regression structure is given by Since the response variable is defined on the (0, 1) interval, the logit link function is used. Table 2 shows the estimated parameters of the beta and simplex regression models with corresponding standard errors (SEs) and p-values as well as model selection criteria, such as Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC). Based on the calculated AIC and BIC values of the fitted regression model, we conclude that the beta regression model exhibits better modeling ability than the simplex regression model since it has the lowest values of these measures compared with those of the simplex regression model.

Comparison
The randomized quantile residuals of the regression models are calculated by using the equation in (11). After that, the quantile-quantile (QQ) plots with simulated envelopes of the randomized quantile residuals for both regression models are plotted in Figure 5. If the model has perfect fit for the data, the plotted points should be on the diagonal line. Thus, it is clear that the plotted points of the beta regression model are more near the diagonal line than those of the simplex regression model. To verify the visual results by hypothesis test, the KS test is implemented and results are summarized in Table 3. From these results, it is clear that the normality assumption of the randomized quantile residuals holds for the beta regression model. However, the residuals of the simplex regression model do not hold the normality assumption. It can be evaluated as a piece of evidence for the superiority of the beta regression model over the simplex regression model for the used data set. Using the equations in (12) and (13), the Cox-Snell R-squared values and LR test results of the regression models are calculated and reported in Table 4. The pseudo-Rsquared value of the beta regression model is higher than the simplex regression model. It is also evident to convince the readers in favor of the beta regression model.
Additionally, the beta regression model provides better results than the null model since the obtained p-value of the LR test is p < 0.001. However, the simplex regression model does not provide better results than its null model since its p-value is higher than 0.05 significance level.

Interpretation
In this subsection, the estimated regression parameters of the beta regression model are interpreted. Educational attainment, years in education, and voter turnout are found statistically significant since their p-values are less than 0.05 significance level. In a kind of funny way, when the educational attainment, years in education, and voter turnout increase, the incidence rates of the countries decrease. The stakeholder engagement does not affect the incidence ratios of the countries. These results show that the incidence ratios are lower in countries with higher education levels and democracy. The reason for this situation is that citizens living in these countries comply more with the measures taken during the pandemic process.
Data openness is a crucial issue as all countries have to make information accessible for their institutions. During the pandemic process, this issue has become more popular since some countries have not disclosed COVID-19 cases on time in full. Mak [26] investigated the importance of the data openness for predicting COVID-19 cases based on the five East Asian cities, such as Beijing, Hong Kong, Seoul, Taipei and Tokyo. Mak [26] analyzed the possible relation between pollution and lockdown policies by comparing the pre and post pandemic process and emphasized that the air pollution decreased after COVID-19 waves. Air pollution can be considered as an important explanatory variable to explain the change in COVID-19 cases.

SimBetaReg Web-Tool
In this section, we introduce a web-tool of the beta and simplex regression models. The implementations of these models are presented in Section 3. Now, we obtain the same results given in Section 3 with the developed web-tool, SimBetReg. These models are not available in famous statistical programs, such as IBM SPSS and Minitab. SimBetaReg web-tool is freely accessible at https://smartstat.shinyapps.io/SimBetaReg/ (date accessed on 1 July 2021). Using the developed application, researchers can do their own analysis by uploading their data sets to the developed web-tool. Figure 6 shows how to upload one's data set to the SimBetaReg application. Note that the only acceptable data file format is csv (comma-separated values). The separator has to be a semicolon. The first row of the data set should contain the variable names. Using the "Select. . . " tab, one can easily upload a csv file that contains the data set to be analyzed. The upload data set is displayed in "Data Table" tab. After uploading the data set, users can select the dependent variable and independent variable(s). Note that the dependent variable should be defined in (0,1). If the dependent variable is not in (0,1) interval, the model does not work. Then, selecting the model by "Bounded regression models" and clicking the "GO!" button, users can display the estimated parameters of the model in "Parameter estimates" tab. Figure 7 displays the estimated parameters of the beta regression model obtained by SimBetaReg application. The obtained results by the SimBetaReg web-tool are the same as the results given in Section 3 (see Table 2). The QQ plot of the randomized quantile residuals can be displayed by "QQ plot of the randomized quantile residuals" tab (see Figure 8). This plot can be downloaded in png format by clicking "Download Plot". As mentioned before, the randomized quantiles residuals should be distributed as N(0,1) once the model is appropriate for the data set. Thus, it should be checked after fitting the data set. The normality test of the residuals are performed by the KS test, and the results are displayed in "Normality test for the residuals" tab (see Figure 9). The same results are reported in Table 3. The SimBetaReg interprets the estimated parameters of the models automatically. This property of the SimBetaReg application is very helpful for researchers with less statistical knowledge. This may even be the best specialty of the SimBetaReg application. The interpretations of the estimated parameters are displayed in "Interpretation of the model results" tab (see Figure 10). The Cox-Snell pseudo R-squared and LR test results are given in "pseudo R-squared and Likelihood Ratio Test" tab (see Figure 11). The pseudo R-squared measure can be used to compare the beta and simplex regression models. The LR test results show the sufficiency of the model against the null model (see Table 4). Additionally, the AIC and BIC values of the fitted regression models are given in "AIC and BIC values of the Beta and Simplex Regression Models" tab (see Figure 12). The AIC and BIC statistics are used to select the best model for the data. Thus, the beta regression is selected as the best model in Section 3 (see Table 2).

Used R Packages
During the development process of the SimBetaReg web-tool, several R packages were used. These were the betarag package by Cribari-Neto and Zeileis [27], VGAM package by Yee [28], ggplot2 package by Wickham [29], gamlss.dist package by Stasinopoulos and Rigby [30] and stats package by R Core Team [31]. The betareg function of the betareg package was used to obtain the parameter estimates of the beta regression model. The vglm function of the VGAM package was used to obtain the parameter estimates of the simplex regression model. The ggplot function of the ggplot2 package was used to draw the QQ plot of the residuals with simulated envelopes. The gamlss.dist package was used for the CDF of the simplex distribution. Finally, the ks.test function of the stats package was used to perform the KS test.

Conclusions and Future Work
In this study, we developed a cloud-based web-tool using the R Shiny platform. The developed application, SimBetaReg, is user-friendly and makes the implementation of the beta and simplex regression models easy for researchers and academicians. The parameter estimates, residual analysis, and likelihood ratio tests with pseudo-R-squared values of the beta and simplex regression models are easily obtained by the SimBetaReg web-tool.
Additionally, the incidence ratio of COVID-19 was analyzed with the developed webtool, and the empirical results are interesting for policy-makers. As future work related to the present study, we plan to develop a new web-tool to model the time-dependent incidence ratio by considering a longitudinal beta regression model. We believe that SimBetaReg will continue to gai attention from researchers, especially those working in the actuarial and medical sciences.

Data Availability Statement:
The data set is available upon request from the authors.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: