Quasi-Cauchy Regression Modeling for Fractiles Based on Data Supported in the Unit Interval

: A fractile is a location on a probability density function with the associated surface being a proportion of such a density function. The present study introduces a novel methodological approach to modeling data within the continuous unit interval using fractile or quantile regression. This approach has a unique advantage as it allows for a direct interpretation of the response variable in relation to the explanatory variables. The new approach provides robustness against outliers and permits heteroscedasticity to be modeled, making it a tool for analyzing datasets with diverse characteristics. Importantly, our approach does not require assumptions about the distribution of the response variable, offering increased ﬂexibility and applicability across a variety of scenarios. Furthermore, the approach addresses and mitigates criticisms and limitations inherent to existing methodologies, thereby giving an improved framework for data modeling in the unit interval. We validate the effectiveness of the introduced approach with two empirical applications, which highlight its practical utility and superior performance in real-world data settings.


Introduction
Modeling continuously distributed data within the unit interval, which includes rates and percentages, is vital in many fields of knowledge [1][2][3][4]. This modeling is of particular interest in research areas in which indices, percentages, and rates play a significant role. Note that we often encounter data that originate from continuous random variables that have constraints on their possible values. Such data have gained considerable importance in the context of the COVID-19 pandemic, as they enable the exploration of various aspects such as global infection and recovery rates, as well as mortality statistics [5][6][7].
When dealing with data that are bounded within a specific range, the regression approach emerges as a widely used statistical method for estimating parameters and conducting hypothesis tests [8]. Typically, such data are fitted using multiple linear regression models [9,10], where their parameters are often estimated utilizing the ordinary least squares (OLS) technique. This technique has several advantages which have contributed to its widespread employment in various fields. Firstly, the OLS technique is relatively simple and computationally efficient, making it accessible to a broad spectrum of users. The underlying assumptions of the OLS technique, when met, ensure the best linear unbiased estimators for the model parameters. Furthermore, the results obtained using the OLS technique are interpretable and straightforward, offering intuitive insights into relationships between variables. The mentioned advantages have facilitated the application of OLS in various research settings.
After this introduction, the article is structured as follows. Section 2 states the principles of fractile regression and introduces our approach to modeling data in the unit interval. In Section 3, we demonstrate the utility of our approach through various applications. Lastly, Section 4 concludes our article, summarizing our findings and their implications.

Fractile Regression for Data in the Unit Interval
This section focuses on fractile regression for the modeling of data in the unit interval. We start by discussing the limitations that traditional approaches encounter when dealing with this type of data. Then, we delve into the concept of fractile regression, outlining its key properties and suitability for such modeling. Our approach, which exploits fractile regression for data in the unit interval, is also introduced here.

Prelude to Fractile Regression
Consider the traditional regression structure represented by where Y = (Y i ) is an n × 1 vector representing the response to be modeled in relation to the random variable Y i , for i ∈ {1, . . . , n}; whereas x = (x ij ) denotes an n × k matrix of known values x ij of covariate X j for individual i, with i ∈ {1, . . . , n}, j ∈ {1, . . . , k}, and k < n.
The term β is a k × 1 vector of regression parameters to be estimated; and ε constitutes an n × 1 vector of independent and identically distributed errors with zero mean and constant variance. The most common technique used to estimate β is OLS by means of where y i is the observed value of the random variable Y i , x i = (1, x i1 , . . . , x ip ) , and β = (β 0 , β 1 , . . . , β p ) defined as in (1), with k = p + 1. Note that expresses the conditional mean of Y i given x i in the structure of a linear model. The idea of fractile regression, to model a quantile (or fractile) of order τ, Q τ say, with 0 < τ < 1, as suggested in [26], is based on absolute error minimization, considering weights to curtail the error in estimating a fractile of interest related to β, which will henceforth be denoted as β(τ), such as where ρ τ (v) = v(τ − 1 {v<0} ) is the loss function and 1 B the indicator function in the set B.
Observe that now (3) is similar to the OLS technique stated in (2) that operates on the principle of squared error minimization. Notably, provided all prerequisites of OLS are met and the median is modeled via fractile regression, that is, τ = 0.5, the estimates generated by both approaches align.
In [27], it was possible to estimate β(τ) by transforming the problem stated in (3) into one of linear programming. Thus, the minimization established in (3) can be substituted by where 1 n is an n × 1 vector of ones, whereas u and v are both n × 1 vectors composed of elements given by u i = max{0, y i − y i } and v i = max{0, y i − y i }, respectively, for i ∈ {1, . . . , n}, with y i = x i β(τ). The formulation presented in (4) can be utilized to scrutinize the response variable in the function of the explanatory variables at different fractiles of the conditional distribution of this response.
The estimators of fractile regression parameters prove to be more efficient than those of OLS when the error term does not follow a Gaussian distribution. Moreover, these estimators are less affected by outliers in the response variable [27].
One especially intriguing characteristic of fractile regression, which underpins the methodology proposed in this work, is that the fractile function remains unaltered by monotonic transformations, a property known as equivariance. Detailed information on this and other fractile regression properties can be found in [27]. Thus, for any given random variable Y, the following holds true: Q τ (Ψ(Y)) = Ψ(Q τ (Y)), where Ψ is a nondecreasing function of R.
In the following subsection, we present the model formalization for variables supported in [0, 1] and the corresponding link function.

Conditions Necessary for the Link Function
Such as in the formulation stated in (3), let Y be an n × 1 response vector and x be an n × k matrix containing the values of k covariates employed to model the response Y i , with i ∈ {1, . . . , n}. Unlike the structure defined in (3), the response vector consists of n observations, with each of them falling within the interval [0, 1].
Let G denote a function such that G: [0, 1] → R, which is monotone non-decreasing; its inverse function G −1 exists, and it is differentiable at least once. The idea of our proposal is to use the link function G to map Y to R and then to estimate the model parameters via fractile regression. Hence, the fractile regression model is formulated as where β(τ) is a k × 1 vector of fractile regression parameters to be estimated.
According to the equivariance property of the fractile function Q τ , we find that Note that we can directly interpret the effect of a change in the value x ij on the response Y i , for i ∈ {1, . . . , n} and j ∈ {1, . . . , k}.

Choice of the Link Function
In [24], a link based on the logistic function (logit link) was used to model data in the unit interval, employing a fractile regression stated as where w ∈ [0, 1] and δ (considered as δ = 0.001 in [24]) is arbitrary and should be chosen such that logit(w) is defined for all w ∈ [0, 1]. Inclusion of δ in the logit link function presented in (5) ensures the absence of indeterminacies. With this link function, the fractile regression model can be written as However, the logit link function can be criticized for two reasons when modeling data with support in the continuous unit interval. The first criticism relates to the lack of generality of the link function because, depending on the data being modeled and the choice of δ, it may not be feasible to map all the sample elements to R. The second criticism concerns the weighting of the observations mapped to R when the sample does not contain the extreme values zero and one of the interval. These issues are discussed further below.
To address the two mentioned criticisms, an adaptation of the function stated in (5) is proposed and stated as However, the function logit 2 is still subject to the first mentioned criticism. An alternative form for G, which satisfies all the necessary conditions and is robust against the criticisms of the link function stated in (5), is the function based on the fractile function of the standard Cauchy distribution defined as where "tan" denotes the tangent function and Π is a parameter used as a calibration tool for optimizing the model fit. It is important to state that 0 < Π < π = 3.1416 to preserve the properties of the link function G. Thus, the function given in (7) is defined for all w ∈ [0, 1]. Various authors have employed the Cauchy distribution as a link function in regression analyses [11,[28][29][30]. Nonetheless, our approach introduces a new class of link functions derived from the Cauchy distribution such as those stated in (7), which we refer to as quasi-Cauchy. In this context, any distinct value of Π ∈ (0, π) generates a unique link function. To gain a deeper understanding of the second point of criticism, we consider a set of n = 61 simulated observations as our sample data y = (y 1 , . . . , y 61 ) : In the data y, one extreme value from the interval [0, 1] is present, and its maximum is 0.8670. Table 1 presents descriptive statistics of the sample y and its transformation using the link functions given in (5), (6), and (7). The value Π = 2.5 was selected after several ad hoc evaluations to improve the model's fit and ensure that extreme observations can be mapped to relatively extreme points in R. Observing the maximum and minimum values of the link functions, it becomes clear that the link stated in (5) assigns similar weights to map values 0 and 0.8670 to R. This is undesirable, as zero is an extreme value of the interval [0, 1], unlike the value 0.8670. Figure 1 displays boxplots that provide a deeper insight into the disparities among the transformed data using different link functions.  Examining the median (indicated by the solid line in the center of the boxes) and the mean (marked by the red point inside the boxes) of the data transformed, we can observe the following: (i) when the link function stated in (5) is considered, both mean and median are significantly distanced from zero, suggesting that the transformation mapping onto R is strongly influenced by extreme values; and in contrast, (ii) for the other link functions defined in (6) and (7), the mean and median are situated close to zero, indicating a more balanced transformation.

Interpretation
An advantage of the introduced alternative approach is its ability to directly interpret the estimated results on the response variable. For a fractile regression model defined as Thus, the impact of a change in one unit of a covariate on the response variable can be interpreted through its marginal effect on the average, denoted as E m . Letx be a k × 1 vector composed of elementsx j , with j ∈ {1, . . . , k}, andx j = ∑ n i=1 x ij /n. Also, as mentioned, β(τ) is a k × 1 vector of fractile regression parameters. Then, E m is defined as Using the expression presented in (7) as an example, we can quantify the impact of a change in x j directly on the mean of Y through The formula stated in (9) is derived from the inverse tangent function, rescaled by the factor Π ≈ 2.5. This formula captures the rate of change in the mean of Y with respect to x j . Thus, such a formula illustrates the impact of changing one unit in x j on the mean transformed response variable Y. Based on the expression given in (8), we can measure the impact of a change in any covariate on the response variable when using the link functions defined in (5) or (6).

Simulation Study
We conduct a simulation study with M = 10, 000 Monte Carlo replicates to assess the finite sample properties of the estimators for the fractile regression parameters under various link functions. Additionally, it is worth noting that our study not only tests the applicability of the estimators across different distributions but also, we verify their robustness with different link functions. This verification is carried out even in simplified scenarios, reaffirming the robustness of our results. In the simulations, the response variable was generated based on the link function G and then modeled using this link function. Let θ τ = (β 1 (τ), β 2 (τ), β 3 (τ)) be the vector of fractile regression parameters to be estimated.
The empirical mean of the parameter estimates, denoted by M( θ τ ), was evaluated using Monte Carlo simulations. Additionally, the bias and mean squared error (MSE) of the estimators, denoted by B( θ τ ) and MSE( θτ), respectively, were also assessed via simulation. The data were randomly generated from a normal distribution.
Our simulation study not only aimed to test the performance of the estimators under different distributions but also verified their robustness when using different link functions. This verification was carried out even in simplified scenarios, reaffirming the robustness of our results. The model parameters β 1 (0.25) = 0.5, β 2 (0.25) = −0.5, and β 3 (0.25) = 0.9 were estimated for τ = 0.25 employing the link function defined in (6), while β 1 (0.5) = 1.5, β 2 (0.5) = 2.5, and β 3 (0.5) = 1.9 were estimated for τ = 0.50 utilizing the link function stated in (7), with Π = 2.5 also being considered. Estimates for both τ ∈ {0, 25, 0.50} were obtained considering the sample sizes n ∈ {40, 60, 100, 200, 500}. The results of this simulation study are shown in Table 2.   The link function stated in (5) was not included in the simulation study due to the lack of its inverse function. As expected, we observe that the MSE of the estimators decreased as the sample size n increased. The fractile regression estimation process used to model the response variable contained in the interval [0, 1] can be divided into three main steps: Step 1: Select an appropriate link function that satisfies the conditions mentioned in Section 2.2.
Step 2: (a) Apply the link function to the response variable; (b) estimate the fractile regression parameters as proposed in [26,31]; and (c) test the fractile regression model.
Step 3: Use the expression for E m given in (8) to ascertain the impact of changing one unit of the covariate on the response variable.
For technical details about statistical estimation, see [32]. Evaluation of the introduced fractile regression model was performed, noting that it possesses the same properties as the fractile regression models presented in [26,27]. Therefore, the same hypothesis testing, confidence intervals, and measures of fitting quality that are used in conventional fractile regression models can be employed in our case. The detailed estimation process of the fractile regression parameters is illustrated in Figure 2.
Step 1: Select the link function Step 2(a): Apply the link function Step 2(b): Estimate the model parameters Step 2(c): Test the fractile regression model Step 3: Use E m Interpret the results End

Choosing the Value for Π
When estimating the formulation stated in (7), we initially assume that Π = 2.5. Although Π can be chosen within the support (0, π), a systematic approach would involve selecting the value that yields the best model fit. This selection can be evaluated using goodness-of-fit measures like the pseudo-R 2 or the Akaike information criterion (AIC).
Contrary to the traditional R 2 used in OLS regression, the pseudo-R 2 does not provide a proportion of variance explained by the model. Instead, it gives a measure of the deviation of the predicted values from the observed values, with larger values indicating a better fit. In contrast, the AIC is a measure that balances the fit of the model against its complexity. Generally, a model with a smaller AIC is considered to provide a better fit, given the number of parameters it utilizes.
For illustrative purposes, we consider a simulated dataset. The data were randomly generated from a normal distribution. Using β 1 (0.5) = 1, β 2 (0.5) = −2, β 3 (0.5) = 5 and the inverse of the function stated in (7), we generate the data y. Then, we produced n = 100 observations considering the seed 1234. Hence, the simulated data were estimated for τ = 0.5, testing 500 different values for Π, uniformly distributed in the interval (0, π). Figure 3 displays the values of Π used and the corresponding pseudo-R 2 values obtained. We observe a peak in the pseudo-R 2 at Π = 2.9972, indicating that this is the optimal value for such a specific scenario.

Applications
This section illustrates the new fractile regression model using two datasets. The first dataset is presented in [33] and is related to household expenditure on food. These data are well known and were used in [34]. The second dataset is associated with the socioeconomic variables of 138 countries, and it was obtained from "The Quality of Government Basic Dataset", Jan15 version (University of Gothenburg: The Quality of Government Institute, http://www.qog.pol.gu.se, accessed on 18 August 2023). With this dataset, we model the democratization index.

Application 1
In this application, we utilize the dataset presented in [33] which pertains to household expenditure on food. The response variable, denoted as Y, represents the proportion of income that a family spends on food.
To explain Y, we use two explanatory variables: the family income (X 1 ) and the number of people in the family (X 2 ). The dataset consists of n = 38 observations, and it can be obtained from the betareg package of the R software [35,36], available on CRAN (https://www.r-project.org/, accessed on 18 August 2023).
Consider the fractile regression model formulated as We estimate three models, denoted by M1, M2, and M3, utilizing different link functions. Model M1 uses the link function stated in (5), model M2 employs the link function defined in (6), and model M3 applies the link function established in (7). Using the methodology described in Subsection 2.6, we employ Π = 2.1061 to estimate model M3. The performance of this estimation, reflected by the pseudo-R 2 values, is shown in Figure 4.  Observing Table 3, note that the estimated parameters are influenced by the choice of the link function. For model M1, one of the estimated parameters is statistically significant (β 1 ). However, the estimates for β 1 and β 0 are statistically significant for model M2. All the estimated parameters for model M3 are statistically significant. Evaluating the goodness of fit using a pseudo-R 2 as a measure, model M1 exhibits the poorest fit, while models M2 and M3 have similar fits, with model M3 having a slight advantage. It is also noticed that the statistical significance of the parameters varies according to the chosen link function. The performance of these estimators, in terms of pseudo-R 2 values, is shown in Figure 4. Table 3 reports the estimates of models M1, M2, and M3 for the 50th fractile. The variation in the estimated parameters β 0 , β 1 , and β 2 with respect to τ for model M3 is depicted in Figure 5.

Application 2
In this application, we employ data from "The Quality of Government Basic Dataset" of the University of Gothenburg. These data are from n = 138 countries in the year 2010, where the response variable is a democratization index (Y), which can take values in [0, 1]. The covariates are real gross domestic product per capita in thousands of dollars (X 1 ), average schooling (in years) of people aged 25 years or more (X 2 ), and press freedom (X 3 ). Note that X 3 is an index that takes values between zero and one, with a lower value indicating greater press freedom, while a higher value indicates limited press freedom.
Consider the fractile regression model formulated as We use the expressions stated in (5), (6), and (7) as link functions. Then, we estimate three models denoted by M4, M5, and M6. Once again, utilizing the methodology described in Subsection 2.6, we use Π = 1.4722 to estimate model M6, as shown in Figure 6. The estimation results of these models for the median (with their standard error in parentheses) and their corresponding pseudo-R 2 are reported in Table 4.  Note that the results of the estimates differ when different link functions are employed. Estimates for the parameters of model M4 indicate that only the values associated with X 2 and X 3 are statistically significant at a 10% level. For models M5 and M6, all the estimates were found to be statistically significant at 10%. Taking the pseudo-R 2 as a measure of the goodness of fit, it is observed that models M4 and M5 have a similar fit, while model M6 shows a better fit than models M4 and M5. Furthermore, it is once again evident that the statistical significance of the parameters varies depending on the chosen link function. Figure 7 illustrates the parameter estimates of model M6 across different fractiles. Observe that the estimated parameter associated with press freedom -β 3 (τ)has a low variation between the lower and upper fractiles, demonstrating a consistent influence on democracies.

Concluding Remarks
In conclusion, this article have introduced an alternative approach to data modeling contained in the unit interval by extending the standard fractile regression model. We have addressed two criticisms of the methodology proposed in [24], when applied to data within the unit interval. We have also proposed alternative link functions to overcome these criticisms.
Our new approach has offered several advantages. First, it has allowed for direct interpretation of the model estimates in terms of the response variable, providing meaningful insights into the relationship between the covariates and the response. Second, our extensive simulation studies have shown that the approach demonstrated robustness when using different link functions, ensuring the stability and reliability of the estimated fractile regression coefficients even in simplified scenarios.
Another strength of the introduced approach is its independence from assumptions about the distribution of the response variable. This flexibility makes our approach applicable to a wide range of data scenarios, where the underlying distribution may be unknown or may deviate from traditional parametrical assumptions. Furthermore, the introduced approach addressed criticisms of the existing methods, offering a more robust and interpretable framework for data modeling within the unit interval.
The applications of the introduced approach to two real datasets have demonstrated its effectiveness and provided valuable insights into the modeling of household expenditure on food and democratization. The estimated models have yielded statistically significant coefficients and satisfactory goodness-of-fit measures, demonstrating the practical applicability of our new approach.
While our approach offers advantages, it may vary in efficacy depending on data characteristics. Its flexibility might not always be optimal for data with specific distributions, and there could be scenarios in which alternative methods might be preferable.
This investigation has focused on the development and illustration of the introduced approach. However, it is important to acknowledge that there are other existing approaches and methodologies for modeling data supported within limited intervals. Future research could involve comparative studies, where the introduced approach is compared with alternative methods specifically designed for modeling data within the unit interval. Such comparisons would provide a comprehensive evaluation of the performance and advantages of our approach, helping researchers select the most suitable method for their specific applications.
By recognizing the potential for comparisons and leveraging existing research in the field, we hope to contribute to the ongoing exploration and refinement of modeling techniques for bounded responses. The introduced approach, along with future comparative studies, can further enhance our understanding and ability to model data within the unit interval accurately.