Next Article in Journal
Identifying the Potential of Urban Ventilation Corridors in Tropical Climates
Previous Article in Journal
Model-Free Identification of Heat Exchanger Dynamics Using Convolutional Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Modelling Interval Data with Random Intercepts: A Beta Regression Approach for Clustered and Longitudinal Structures

by
Olga Usuga-Manco
1,*,
Freddy Hernández-Barajas
2 and
Viviana Giampaoli
3
1
Departamento de Ingeniería Industrial, Universidad de Antioquia, Medellín 050010, Colombia
2
Departamento de Estadística, Universidad Nacional de Colombia sede Medellín, Medellín 050034, Colombia
3
Departamento de Estatística, Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo 05508-090, Brazil
*
Author to whom correspondence should be addressed.
Modelling 2025, 6(4), 128; https://doi.org/10.3390/modelling6040128
Submission received: 30 August 2025 / Revised: 7 October 2025 / Accepted: 10 October 2025 / Published: 14 October 2025

Abstract

Beta regression models are a class of models used frequently to model response variables in the interval ( 0 ,   1 ) . Although there are articles in which these models are used to model clustered and longitudinal data, the prediction of random effects is limited, and residual analysis has not been implemented. In this paper, a random intercept beta regression model is proposed for the complete analysis of this type of data structure. We proposed some types of residuals and formulate a methodology to obtain the best prediction of random effects. This model is developed through the parameterisation of beta distribution in terms of the mean and dispersion parameters. A log-likelihood function is approximated by the Gauss–Hermite quadrature to numerically integrate the distribution of random intercepts. A simulation study is used to investigate the performance of the estimation process and the sampling distributions of the residuals.

1. Introduction

The class of models known as beta regression models is used to model variables that assume values in the interval ( 0 ,   1 ) , e.g., rates, proportions, or percentages, as this type of data appears so frequently that these models have received the attention of many researchers. Paolino [1], Kieschnick and McCullough [2], and Smithson and Verkuilen [3] proposed a beta regression model with applications for political science, market shares, and psychology. Ferrari and Cribari-Neto [4] defined a parameterisation of the beta distribution in terms of the mean and precision parameter and proposed a beta regression model with a regression structure for the mean.
Extensions of the regression model proposed by Ferrari and Cribari-Neto [4] have been developed. Venezuela [5] developed a generalised estimating equation to analyse longitudinal data by considering marginal beta regression models. Simas et al. [6] proposed a regression structure for the precision parameter and nonlinear regression structure for the mean. Ferrari and Pinheiro [7] considered regression structure for the precision parameter and performed small-sample inference for beta regression models.
Cribari-Neto and Souza [8] performed testing inference in beta regression models based on a second parameterisation of the beta distribution that considers the mean and dispersion parameters. Rigby and Stasinopoulos [9] incorporated this parameterisation into the framework of generalised additive models for location, scale, and shape (GAMLSS).
Some more recent theoretical works include those by Karlsson [10], Abonazel and Dawoud [11], and Algamal [12] in which different estimators are proposed to address the problem of multicollinearity in the regression model. In applied publications, the work of Mullen [13] illustrates the use of these models in solar radiation predictions; the work of Douma [14] highlights the use of this type of models in ecology and evolution; the work of Geissinger [15] emphasises their application in the natural sciences; the work of Abonazel [16] shows a comparative study in medical science; and the work of Cribari-Neto [17] shows a beta regression analysis of COVID-19 mortality in Brazil.
A mixed beta regression model provides a class of models for the analysis of clustered and longitudinal data for beta-distributed responses. Early contributions of this class of models include a logit random intercept beta regression model implemented by Rigby and Stasinopoulos [9] on the R gamlss package, a logit random intercept beta regression model via Template Model Builder (TBM) implemented by Brooks et al. [18] on the R glmmTMB package, and a mixed beta regression model with logit and log link functions, a special case of which is the finite mixture model proposed by Verkuilen and Smithson [19].
The main goal of this article is to propose a random intercept beta regression model using a parameterisation scheme defined in terms of the mean and dispersion parameters by considering different link functions and utilising an estimation method different from the GAMLSS model. In this model, the parameters related to fixed and random effects are estimated jointly using the maximum likelihood method. The advantages of using random intercept beta regression model instead of the usual beta regression model are that by introducing random effects into the mean, we describe the heterogeneity of the means between clusters, and by introducing random effects into the dispersion, we can describe the dispersions between clusters. According to Lee, Nelder, and Pawitan (2006) [20], introducing random effects in the dispersion can describe abrupt changes among repeated measures.
Also, with the proposed model, we can analyse hierarchical datasets, such as clustered data, repeated measurements, and longitudinal data. We also develop predictions of random effects and propose residuals for the proposed model. An application is provided with the longitudinal data from a prospective ophthalmology study described in the work of Meyers et al. [21], in which the percentage of gas left in the eye of each patient was recorded on different follow-up days.
The rest of this paper unfolds as follows. In Section 2, we define the random intercept beta regression model and present the associated log-likelihood function. Further, we obtain closed-form expressions for the score function and observed information matrix. In the third section, we define the mean, variance, and covariance of the marginal distribution of beta random variables and use them to build the marginal residual. In Section 4, we describe randomised quantiles and conditional and marginal residuals to detect outlying observations and standardised random effect estimates to detect outlying clusters. The prediction of random effects used to build the proposed residuals is presented in Section 5. The results of the simulations regarding the performance of the estimation method and the residuals are presented in Section 6. In Section 7, we analyse a dataset with the proposed model. Finally, concluding remarks are presented in Section 8. Some technical details are collected in the Supplementary Material.

2. Random Intercept Beta Regression Model

In this section, we present two different parameterisations of the beta distribution, the random intercept model, the likelihood inference, and the estimation procedure.

2.1. Beta Distribution

Suppose Y is a response variable that follows a beta distribution. Ferrari and Cribari-Neto [4] proposed a parameterisation of its density indexed by the mean μ and precision parameter ϕ . The parameterisation is given by
f ( y ; μ , ϕ ) = Γ ( ϕ ) Γ ( μ ϕ ) Γ ( ( 1 μ ) ϕ ) y μ ϕ 1 ( 1 y ) ( 1 μ ) ϕ 1 ,
where 0 < y < 1 , 0 < μ < 1 , and ϕ > 0 . The mean and variance of Y are given, respectively, by E ( Y ) = μ and Var ( Y ) = μ ( 1 μ ) / ( 1 + ϕ ) .
Alternatively, the beta distribution can be parameterised in terms of the mean μ and the dispersion parameter σ .
f ( y ; μ , σ ) = Γ ( ( 1 σ 2 ) / σ 2 ) Γ ( μ ( ( 1 σ 2 ) / σ 2 ) ) Γ ( ( 1 μ ) ( ( 1 σ 2 ) / σ 2 ) ) y μ ( ( 1 σ 2 ) / σ 2 ) 1 × ( 1 y ) ( 1 μ ) ( ( 1 σ 2 ) / σ 2 ) 1 ,
with 0 < y < 1 , 0 < μ < 1 , and 0 < σ < 1 . In this parameterisation, the mean of Y is E ( Y ) = μ , and the variance of Y is Var ( Y ) = σ 2 μ ( 1 μ ) .
The relationship between the precision parameter in (1) and the dispersion parameter in (2) is ϕ = ( 1 σ 2 ) / σ 2 .

2.2. Random Intercept Model

Without loss of generality, we assume that ( 0 , 1 ) = ( a , b ) , where a and b are known scalars, a < b . If the response variable is constrained to the interval ( a , b ) , we model ( y a ) / ( b a ) instead of y.
Let y i j , i = 1 ,   2 , , N , be the observed repeated measurements on the i-th clusters, and let t i j , j = 1 ,   2 , , n i , be the corresponding times on which the measurements are taken on each cluster. In the beta random intercept model, it is assumed that the conditional distribution of y i j given γ i = ( γ i 1 , γ i 2 ) T follows a beta distribution with a density given by Equation (2). Given γ i , the repeated measurements, y i 1 , y i 2 , , y i n i are independent. We will assume the following model:
y i j γ i 1 , γ i 2 ind B e ( μ i j , σ i j ) , g 1 ( μ i j ) = η i j 1 = x i j 1 T β 1 + γ i 1 , g 2 ( σ i j ) = η i j 2 = x i j 2 T β 2 + γ i 2 ,
where x i j 1 = ( x i j 11 , x i j 21 , , x i j p 1 1 ) T and x i j 2 = ( x i j 12 , x i j 22 , , x i j p 2 2 ) T contain values of explanatory variables, possibly defined in terms of t i j . The vectors β 1 = ( β 11 , β 21 , , β p 1 1 ) T and β 2 = ( β 12 , β 22 , , β p 2 2 ) T are fixed parameters that do not depend on the time. The random intercepts γ i 1 and γ i 2 are shared among measurements within the same cluster. The link functions g 1 : ( 0 , 1 ) and g 2 : ( 0 , 1 ) are strictly monotonic and twice differentiable. The same or different link function may be used for the mean and the dispersion parameter, e.g., logit, probit, clog-log, log-log, or cauchit. For a discussion of these link functions, see McCullagh and Nelder [22].
The random intercepts γ i 1 and γ i 2 are independent and identically distributed normal random variables.
γ i 1 i . i . d N ( 0 , λ 1 2 ) , γ i 2 i . i . d N ( 0 , λ 2 2 ) ,
where λ 1 and λ 2 are the standard deviations of random intercepts. Larger estimates of these parameters, λ 1 and λ 2 , imply greater differences between the groups. In the particular case where λ 1 = 0 , the mean of the response variable can be modelled without a random intercept.

2.3. Likelihood Inference

Let f ( y i j γ i 1 , γ i 2 ; β 1 , β 2 ) denote the probability density function of y i j γ i 1 , γ i 2 . Let f ( γ i 1 ; λ 1 ) and f ( γ i 2 ; λ 2 ) denote the probability density functions of γ i 1 and γ i 2 , respectively. The parameter vector is θ = ( β 1 T , β 2 T , λ 1 , λ 2 ) T . The marginal distribution of y i = ( y i 1 , y i 2 , , y i n i ) T is given by
f ( y i ; θ ) = 2 j = 1 n i f ( y i j γ i 1 , γ i 2 ; β 1 , β 2 ) · f ( γ i 1 ; λ 1 ) f ( γ i 2 ; λ 2 ) d γ i 1 d γ i 2 ,
and the likelihood function of θ given the observed data y = ( y 1 , y 2 , , y N ) T is
L ( θ ) = i = 1 N 2 j = 1 n i f ( y i j γ i 1 , γ i 2 ; β 1 , β 2 ) · f ( γ i 1 ; λ 1 ) f ( γ i 2 ; λ 2 ) d γ i 1 d γ i 2 .
Unlike a Gaussian linear model, the marginal distribution (5) and the likelihood function (6) do not have closed-form expressions. The main difficulty of likelihood inference for a random intercept beta regression model is the evaluation of intractable integrals in likelihood function (6). Common approaches include Monte Carlo integration, EM algorithms, and approximation methods (see Wu [23]). In this work, we use multivariate Gauss–Hermite quadrature to approximate the integrals. Thus, the likelihood function is
L ( θ ) = i = 1 N k 1 = 1 Q 1 k 2 = 1 Q 2 j = 1 n i f ( y i j 2 λ 1 z k 1 , 2 λ 2 z k 2 ; β 1 , β 2 ) w k 1 w k 2 π ,
and the log-likelihood function can be written as
( θ ) = i = 1 N log k 1 = 1 Q 1 k 2 = 1 Q 2 j = 1 n i f i j ( y i j 2 λ 1 z k 1 , 2 λ 2 z k 2 ; β 1 , β 2 ) w k 1 w k 2 π ,
where Q 1 and Q 2 denote the numbers of quadrature points, z k 1 and z k 2 are the quadrature points, and w k 1 and w k 2 are the corresponding quadrature weights. For more details on multivariate Gauss–Hermite quadrature, see Fahrmeir and Tutz [24].

2.4. Estimating Procedure

The maximum likelihood estimators (MLEs) of θ = ( β 1 T , β 2 T , λ 1 , λ 2 ) T are obtained as the solutions of the nonlinear system U ( θ ) = 0 , where U ( θ ) denotes the p-dimensional gradient of ( θ ) , with p = p 1 + p 2 + 2 . The score function is U ( θ ) = ( U β 1 T ( θ ) , U β 2 T ( θ ) , U λ 1 ( θ ) , U λ 2 ( θ ) ) T , where the vectors U β 1 T ( θ ) and U β 2 T ( θ ) and the quantities U λ 1 ( θ ) and U λ 2 ( θ ) are defined in the Supplementary Material.
The maximum likelihood estimators do not have closed forms and must be computed by numerically maximising the log-likelihood function (7) using a nonlinear optimisation algorithm, e.g., Newton’s method or a quasi-Newton algorithm. As initial values for β 1 T and β 2 T , we suggest β ^ 1 T and β ^ 2 T , the estimates from the beta regression model without random intercepts, i.e.,  
y i j ind Be ( μ i j , σ i j ) , g 1 ( μ i j ) = x i j 1 T β 1 , g 2 ( σ i j ) = x i j 2 T β 2 .
These models can be fitted in R, for example, using the packages gamlss or betareg, developed by Rigby and Stasinopoulos [9] and Cribari-Neto and Zeileis [25], respectively. For the parameters λ 1 and λ 2 , the standard deviations of random intercepts, we suggest starting with initial values of 0, which corresponds to the usual beta regression model.
To obtain the covariance matrix of θ ^ , the ( p × p ) observed information matrix is required, which can be written as
J ( θ ) = J β 1 β 1 J β 1 β 2 J β 1 λ 1 J β 1 λ 2 J β 2 β 1 J β 2 β 2 J β 2 λ 1 J β 2 λ 2 J λ 1 β 1 J λ 1 β 2 J λ 1 λ 1 J λ 1 λ 2 J λ 2 β 1 J λ 2 β 2 J λ 2 λ 1 J λ 2 λ 2
where the elements of J ( θ ) are defined in the Supplementary Material.
Under the usual regularity conditions for maximum likelihood estimation, the MLEs θ ^ of θ are such that n θ ^ θ D N p 1 + p 2 + 2 0 , J ( θ ) 1 , where D denotes convergence in distribution.

3. Moments

The moments of the marginal distribution of y in model (3) are given by
E ( y i j ) = E μ i j , Var ( y i j ) = E μ i j 2 E ( μ i j ) 2 + E σ i j 2 · E μ i j ( 1 μ i j ) , Cov ( y i j , y i j ) = E μ i j · μ i j E μ i j · E μ i j , j j
where the expectations involving μ i j and σ i j are taken with respect to the distributions of the random intercepts γ i 1 and γ i 2 . Note that the results depend on the link functions used for μ and σ . To obtain these moments, the required integrals must be solved using numerical integration methods.

4. Prediction of Random Effects

In practice, we are usually primarily interested in estimating the fixed-effect parameters, but it is often useful to obtain predictions of the random effects γ i 1 and γ i 2 as well. Because the random effects reflect between-cluster variability, they are particularly useful for detecting unusual response profiles or groups of clusters whose profiles evolve differently over time. Moreover, estimates of the random effects are needed whenever cluster-specific predictions are of interest (Fitzmaurice et al. [26]). We propose using an empirical Bayes method to obtain the best predictor (BP) of the random intercepts, which takes the following form:
γ ˜ i 1 = E [ γ i 1 y i j ; θ ^ ] = 2 γ i 1 j = 1 n i f ( y i j γ i 1 , γ i 2 ; β ^ 1 , β ^ 2 ) f ( γ i 1 ; λ ^ 1 ) f ( γ i 2 ; λ ^ 2 ) d γ i 1 d γ i 2 2 j = 1 n i f ( y i j γ i 1 , γ i 2 ; β 1 ^ , β 2 ^ ) f ( γ i 1 ; λ ^ 1 ) f ( γ i 2 ; λ ^ 2 ) d γ i 1 d γ i 2 ,
γ ˜ i 2 = E [ γ i 2 y i j ; θ ^ ] = 2 γ i 2 j = 1 n i f ( y i j γ i 1 , γ i 2 ; β ^ 1 , β ^ 2 ) f ( γ i 1 ; λ ^ 1 ) f ( γ i 2 ; λ ^ 2 ) d γ i 1 d γ i 2 2 j = 1 n i f ( y i j γ i 1 , γ i 2 ; β 1 ^ , β 2 ^ ) f ( γ i 1 ; λ ^ 1 ) f ( γ i 2 ; λ ^ 2 ) d γ i 1 d γ i 2 ,
where i = 1 , 2 , , N and j = 1 , 2 , , n i .

5. Residual Analysis

Authors such as Hilden-Minton [27], Verbeke and Lesaffre [28], Pinheiro and Bates [29], and Nobre and Singer [30] have presented different types of residuals that accommodate the additional sources of variability present in linear mixed models. To study departures from these assumptions, outlying observations, and clusters, we adopt three types of residuals that account for the extra variability in the proposed model: randomised quantile residuals (see Dunn and Smyth [31]), standardised conditional residuals and standardised marginal residuals. In addition, we define standardised random effect estimates for μ and σ to assess the distribution of random effects and to study the presence of outlying clusters.
To assess the overall adequacy of the random intercept beta regression model for the data, we proposed the randomised quantile residual, which is given by
r q i j = Φ 1 F ( y i j ; μ ^ i j , σ ^ i j ) ,
where Φ ( · ) denotes the cumulative distribution function of the standard normal distribution, and F ( y i j ; μ ^ i j , σ ^ i j ) denotes the cumulative distribution function of Be ( μ ^ i j , σ ^ i j ) .
To study outlying observations, we consider the standardised conditional and marginal residuals. The standardised conditional residual is defined as
r c i j = y i j E ^ y i j γ i 1 , γ i 2 Var ^ y i j γ i 1 , γ i 2 ,
where E ^ y i j γ i 1 , γ i 2 = μ ^ i j and Var ^ y i j γ i 1 , γ i 2 = σ ^ i j 2 μ ^ i j ( 1 μ ^ i j ) , with μ ^ i j = g 1 1 ( x i j 1 T β ^ 1 + γ ˜ i 1 ) and σ ^ i j = g 2 1 ( x i j 2 T β ^ 2 + γ ˜ i 2 ) ; β ^ 1 and β ^ 2 are the maximum likelihood estimators of β 1 and β 2 ; and γ ˜ i 1 and γ ˜ i 2 denote the best predictors (BPs) of γ i 1 and γ i 2 , respectively.
The standardised marginal residual is given by
r m i j = y i j E ^ y i j Var ^ y i j ,
where E ^ ( y i j ) = E ( μ ^ i j ) and Var ^ ( y i j ) = E ( μ ^ i j 2 ) E ( μ ^ i j ) 2 + E ( σ ^ i j 2 ) · E μ ^ i j ( 1 μ ^ i j ) .
Thus, when the logit link function is used for μ and σ , the mean and the variance are given by
E ( y i j ) = 1 E 1 1 + a e γ i 1 , Var y i j = E 1 ( 1 + a e γ i 1 ) 2 E 1 1 + a e γ i 1 2 + a b 2 E e 2 γ i 2 ( 1 + b e γ i 2 ) 2 E e γ i 1 ( 1 + a e γ i 1 ) 2 ,
where a = e x i j 1 T β 1 and b = e x i j 2 T β 2 . Replacing β 1 , β 2 , γ i 1 , and γ i 2 , we obtain β ^ 1 , β ^ 2 , γ ^ i 1 , and γ i 2 ^ , as well as E ^ ( y i j ) and Var ^ ( y i j ) . Note that to calculate these estimates, multivariate Gauss–Hermite quadrature is used to approximate the integrals in parameter estimation and random intercept prediction.
To study outlying clusters and assess the random effect distribution, we consider the standardised random effect estimates for μ and σ , defined as
r r μ i = γ ˜ i 1 λ ^ 1 ,
r r σ i = γ ˜ i 2 λ ^ 2 ,
with i = 1 , 2 , , N , where γ ˜ i 1 and γ ˜ i 2 are the BPs of γ i 1 and γ i 2 , and λ ^ 1 and λ ^ 2 are the maximum likelihood estimates of λ 1 and λ 2 .
Atkinson [32] and Kutner et al. [33] suggested the use of probability plots with simulated half-normal envelopes as diagnostic tools. Such plots are useful for identifying outliers and examining the adequacy of the fitted model, even when the residual distribution is unknown. Half-normal plots with a simulated envelope can be produced as follows:
  • Fit the beta random intercept model and generate a sample of n independent observations, treating the fitted model as the true model.
  • Fit the beta random intercept model to the generated sample and compute the ordered absolute residuals.
  • Repeat steps (1) and (2) k times.
  • For each of the n positions, collect the k order statistics and compute their average, minimum, and maximum values.
  • Plot these values together with the ordered residuals of the original sample against the half-normal scores ϕ 1 ( ( i + n 0.125 ) / ( 2 n + 0.5 ) ) , where i is the ith order statistic, 1 i n , and n is the sample size.
The minimum and maximum values of the k-order statistics define the envelope. Observations with absolute residuals outside the simulated limits warrant further investigation. Moreover, if a considerable proportion of points falls outside the envelope, there is evidence against the adequacy of the fitted model.

6. Simulation Study

The purpose of this simulation study is twofold. First, we study the behaviour of the estimates of β 1 , β 2 , λ 1 , and λ 2 as the number of clusters, cluster size, and the standard deviations of the random intercepts in μ and σ for clustered and longitudinal data. Second, we investigate the sampling distribution of the residuals for identifying outlying observations and clusters.
The data were generated according to the following random intercept beta regression model:
y i j γ i 1 , γ i 2 ind B e ( μ i j , σ i j ) , g μ i j = β 11 + β 21 x 1 i j + β 31 x 2 i j + γ i 1 , g σ i j = β 12 + β 22 x 1 i j + β 32 x 2 i j + γ i 2 ,
where i = 1 , , N , j = 1 , , n i , g · is the logit function, and β 1 = ( β 11 , β 21 , β 31 ) T and β 2 = ( β 12 , β 22 , β 32 ) T are parameter vectors. The covariates are generated following simulation studies with longitudinal data, such as those by Park and Wu [34], Guoyou and Zhongyi [35], and Fu and Wang [36]. The covariate x 1 i j is generated from the uniform distribution, U ( 0 ,   1 ) , with a different value for each pair i j . The design time points are taken between 0 and 1 as x 2 i j = ( j 1 ) / n i , taking the same values for each cluster. Both parameter vectors are identical β 1 = β 2 = ( 0.15 ,   0.15 , 0.15 ) T , and are independent of the random intercepts, which are generated as γ i 1 N ( 0 , λ 1 2 ) and γ i 2 N ( 0 , λ 2 2 ) .
We examined all combinations of five clusters, N, ( 20 ,   40 ,   60 ,   100 ,   150 ) ; four cluster sizes, n i , ( 3 ,   5 ,   8 ,   12 ) ; and three standard deviations of the random effects, λ 1 , λ 2 , ( 0.5 ,   1.0 ,   1.5 ) . In total, there were 60 combinations, and we simulated 10,000 datasets for each of them. For the Gauss–Hermite quadrature method, Q 1 = Q 2 = 8 quadrature points were used to obtain the estimates and predictions of the random intercepts. All analyses were conducted in R [37]; the R glmmML package proposed by Broström and Holmberg [38] was used to calculate the points and weights needed for Gauss–Hermite quadrature, and the R Stats function nlminb [37] was used to maximise the log-likelihood function (7).
To study the maximum likelihood estimates of θ = ( β 1 , β 2 , λ 1 , λ 2 ) T , we used the root multivariate mean squared error (RMSE) for θ ^ , following Wissel [39], which is defined as
RMSE = ( trace ( Σ ( θ ^ ) ) + ( θ ^ θ ) T ( θ ^ θ ) ) 1 / 2 .
To analyse the convergence of the optimiser used in the estimation process, we computed the convergence rate (CR), defined as
CR = 10 , 000 i t e r a t i o n s F i n a l i t e r a t i o n s ,
where “final iterations” corresponds to the number of iterations required to complete 10,000 simulations. If the convergence rate equals one, this indicates that 10,000 iterations were sufficient. If it is less than one, more than 10,000 iterations were needed to complete the simulation study. Additional iterations occur when the algorithm does not converge because the simulated values of y are very close to zero or one due to large variances in the random effects.
Table 1 presents the RMSE across 10,000 simulated datasets for λ 1 = 0.5 and 1.5 . The RMSE decreases as the cluster size ( n i ) increases for fixed ( N ) and ( λ 1 , λ 2 ) , indicating improved estimation with larger within-cluster information. For example, with λ 1 = 0.5 , λ 2 = 0.5 , and N = 150 , the RMSE values are ( 0.476 ,   0.339 ,   0.252 ,   0.210 ) for n i = 3 ,   5 ,   8 , and 12, respectively. RMSE also decreases with a larger number of clusters ( N ) , holding ( n i ) and ( λ 1 , λ 2 ) fixed. For instance, with λ 1 = 1.5 , λ 2 = 0.5 , and n i = 12 , the RMSE values are ( 1.075 ,   0.771 ,   0.646 ,   0.553 ,   0.454 ) for N = 20 ,   40 ,   60 ,   100 , and 150. Conversely, RMSE increases with higher λ 2 for given fixed values of λ 1 , ( N ) , and ( n i ) , reflecting reduced estimation accuracy under greater random-intercept variability. For example, with λ 1 = 1.5 , N = 60 , and n i = 3 , the RMSE values are ( 1.134 ,   1.244 ,   1.364 ) for λ 2 = 0.5 ,   1.0 ,   1.5 . Exceptions arise for N = 20 and n i = 3 with λ 1 = 0.5 and 1.5 , and for N = 20 and n i = 5 with λ 1 = 0.5 . A similar pattern is observed for λ 1 : RMSE increases as λ 1 rises for fixed n i , N, and λ 2 . For instance, with λ 2 = 1.5 , N = 60 , and n i = 3 , the RMSE values are ( 1.062 ,   1.364 ) for λ 1 = 0.5 and 1.5 .
Table 2 presents the convergence rate (CR) over 10,000 simulated datasets for λ 1 = 0.5 ,   1.5 and λ 2 = 0.5 ,   1.0 ,   1.5 .
To study the sampling distributions of the proposed residuals, we used the maximum likelihood estimates and the best predictors (BPs) of the random intercepts for μ and σ , computed in each replication of the study, to obtain the residuals r q and r c , as well as the standardised random effect estimates r r μ and r r σ . In addition, estimates of the mean and variance of the marginal distribution of y were used to compute the residual r m defined in expression (11).
Figure 1 presents normal probability plots for the randomised quantile ( r q ) , standardised conditional ( r c ) , and standardised marginal residual ( r m ) for N = 20 , λ 1 = 0.5 , n i = 3 , 5 , and λ 2 = 0.5 ,   1.0 ,   1.5 . Each panel includes the three residuals. The plots show that the randomised quantile residuals follow an approximately normal distribution, while the conditional and marginal residuals exhibit symmetry with heavier tails, consistent with Student’s t distribution. Similar results were obtained for n i = 8 and 12.
Figure 2 displays normal probability plots for the standardised random effect estimates r r μ and r r σ , with N = 20 and n i = 3 ,   5 . Both distributions are well approximated by the normal law. Similar plots were generated for N = 20 , λ 1 = 0.5 , and n i = 8 ,   12 , and comparable results (not shown) were also obtained for λ 1 = 1.0 ,   1.5 and N = 40 ,   60 ,   100 ,   150 .

7. Application

In this section, we analyse data from a prospective ophthalmology study reported by Meyers et al. [21], with emphasis on heterogeneous mean and dispersion. Previous analyses of these data have examined homogeneous and heterogeneous dispersion using simplex regression models [40,41]. In the study, intraocular gas ( C 3 F 8 ) was used in complex retinal surgeries to provide an internal tamponade for retinal breaks. The concentration levels of C 3 F 8 administered were 25%, 20%, and 15%. The primary objective was to assess whether the concentration of injected gas affected its decay rate. The sample comprised 29 patients, each followed-up with between 3 and 15 times over a 3-month period. Let y i j denote the proportion of remaining gas volume relative to the initial injected volume for patient i at time j on follow-up day t i j . Since the response variable lies in ( 0 ,   1 ) , the beta regression model is an appropriate choice.
As the initial specification, we adopted the quadratic model proposed in [40]:
logit μ i j = β 11 + β 21 log ( t i j ) + β 31 log 2 ( t i j ) + β 41 x i j + γ i 1 , logit σ i j = β 12 + β 22 log ( t i j ) + β 32 log 2 ( t i j ) + β 42 x i j + γ i 2 ,
with i = 1 , , 29 , where γ i 1 N ( 0 , λ 1 2 ) and γ i 2 N ( 0 , λ 2 2 ) . Here, t i j denotes the time covariate (days after surgery) and x i j represents the standardised gas concentration, coded as 1 ( 25 % ) ,   0 ( 20 % ) , or 1 ( 15 % ) .
The standard beta regression model was first fitted to the data, after which nonsignificant variables were removed. The final specification for μ i j and σ i j was
logit μ i j = β 11 + β 21 log ( t i j ) + β 41 x i j , logit σ i j = β 12 .
Subsequently, a random-intercept beta regression model for μ was fitted, using the same linear predictors as in (14). The resulting specification for μ i j and σ i j was
logit μ i j = β 11 + β 21 log ( t i j ) + β 41 x i j + γ i 1 , logit σ i j = β 12 .
Finally, a random-intercept beta regression model for both μ and σ was fitted, using the same linear predictors as in (14). The specification for μ i j and σ i j was
logit μ i j = β 11 + β 21 log ( t i j ) + β 41 x i j + γ i 1 , logit σ i j = β 12 + γ i 2 .
Table 3 summarises the parameter estimates and standard errors for the beta regression model (BRM), the beta regression model via TBM (TBMBRM), and the mixed beta regression model (MBRM). The Akaike information criterion (AIC) values were 97.2325 , 138.2 , and 169.034 for BRM, TBMBRM, and MBRM, respectively, leading to the selection of the MBRM as the best-fitting model. The results indicate that gas concentration is statistically significant, showing that higher concentrations are associated with slower decay of gas volume. Moreover, λ 1 was greater than λ 2 , suggesting greater variability in the random intercept of the mean.
Figure 3 presents the half-normal probability plot with a simulated envelope for the randomised quantile, standardised conditional, and standardised marginal residuals of the random-intercept beta regression model. As no observations fall outside the simulated envelope, we conclude that the beta regression model with random intercepts provides an adequate fit to the data.

8. Conclusions

In this paper, we proposed a random-intercept beta regression model for the analysis of clustered and longitudinal data expressed as proportions, rates, or percentages. The model accommodates heterogeneity and heteroscedasticity across clusters by incorporating random effects in both the mean and dispersion structures. Maximum likelihood estimation was carried out using Gauss–Hermite quadrature to approximate the integrals of the log-likelihood, allowing simultaneous estimation of parameters and hyperparameters and straightforward prediction of random effects. We also introduced new residuals to evaluate model adequacy and detect potential outliers at both the observation and cluster levels. Simulation studies confirmed the effectiveness of the methodology, and its application to a real dataset demonstrated its practical usefulness.

Supplementary Materials

The following supporting information can be downloaded at https://github.com/fhernanb/Paper_Modeling_Interval_Data (accessed on 9 October 2025).

Author Contributions

Conceptualisation, O.U.-M. and V.G.; methodology, O.U.-M. and V.G.; software, O.U.-M. and F.H.-B.; validation, O.U.-M., F.H.-B. and V.G.; formal analysis, O.U.-M. and V.G.; investigation, O.U.-M.; resources, V.G.; data curation, O.U.-M. and F.H.-B.; writing—original draft preparation, O.U.-M.; writing—review and editing, O.U.-M., F.H.-B. and V.G.; supervision, V.G.; project administration, V.G.; funding acquisition, V.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES) and by the Conselho Nacional de Desenvolvimento Científico e Tecnológico—Brazil (CNPq).

Data Availability Statement

Dataset is available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Paolino, P. Maximum Likelihood Estimation of Models with Beta-Distributed Dependent Variables. Political Anal. 2001, 9, 325–346. [Google Scholar] [CrossRef]
  2. Kieschnick, R.; McCullough, B. Regression analysis of variates observed on (0, 1): Percentages, proportions and fractions. Stat. Model. 2003, 3, 193–213. [Google Scholar] [CrossRef]
  3. Smithson, M.; Verkuilen, J. A Better Lemon Squeezer? Maximum-Likelihood Regression with Beta-Distributed Dependent Variables. Psychol. Methods 2006, 11, 54–71. [Google Scholar] [CrossRef] [PubMed]
  4. Ferrari, S.; Cribari-Neto, F. Beta regression for modeling rates and proportions. J. Appl. Stat. 2004, 31, 799–815. [Google Scholar] [CrossRef]
  5. Venezuela, M. Equação de Estimação Generalizada e Influência Local para Modelos de Regressão Beta com Medidas Repetidas. Ph.D. Thesis, Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brasil, 2008. [Google Scholar]
  6. Simas, A.; Barreto-Souza, W.; Rocha, A. Improved estimators for a general class of beta regression models. Comput. Stat. Data Anal. 2010, 54, 348–366. [Google Scholar] [CrossRef]
  7. Ferrari, S.; Pinheiro, E. Improved likelihood inference in beta regression. J. Stat. Comput. Simul. 2011, 81, 431–443. [Google Scholar] [CrossRef]
  8. Cribari-Neto, F.; Souza, T. Testing inference in variable dispersion beta regressions. J. Stat. Comput. Simul. 2012, 82, 1827–1843. [Google Scholar] [CrossRef]
  9. Rigby, R.; Stasinopoulos, D. Generalized additive models for location, scale and shape. Appl. Stat. 2005, 54, 507–554. [Google Scholar] [CrossRef]
  10. Karlsson, P.; Månsson, K.; Kibria, B.G. A Liu estimator for the beta regression model and its application to chemical data. J. Chemom. 2020, 34, e3300. [Google Scholar] [CrossRef]
  11. Abonazel, M.R.; Dawoud, I.; Awwad, F.A.; Lukman, A.F. Dawoud–Kibria estimator for beta regression model: Simulation and application. Front. Appl. Math. Stat. 2022, 8, 775068. [Google Scholar] [CrossRef]
  12. Algamal, Z.Y.; Abonazel, M.R. Developing a Liu-type estimator in beta regression model. Concurr. Comput. Pract. Exp. 2022, 34, e6685. [Google Scholar] [CrossRef]
  13. Mullen, R.; Marshall, L.; McGlynn, B. A beta regression model for improved solar radiation predictions. J. Appl. Meteorol. Climatol. 2013, 52, 1923–1938. [Google Scholar] [CrossRef]
  14. Douma, J.C.; Weedon, J.T. Analysing continuous proportions in ecology and evolution: A practical introduction to beta and Dirichlet regression. Methods Ecol. Evol. 2019, 10, 1412–1430. [Google Scholar] [CrossRef]
  15. Geissinger, E.; Khoo, C.; Richmond, I.; Faulkner, S.; Schneider, D. A case for beta regression in the natural sciences. Ecosphere 2022, 13. [Google Scholar] [CrossRef]
  16. Abonazel, M.R.; Said, H.A.; Tag-Eldin, E.; Abdel-Rahman, S.; Khattab, I.G. Using beta regression modeling in medical sciences: A comparative study. Commun. Math. Biol. Neurosci. 2023, 2023, 18. [Google Scholar] [CrossRef]
  17. Cribari-Neto, F. A beta regression analysis of COVID-19 mortality in Brazil. Infect. Dis. Model. 2023, 8, 309–317. [Google Scholar] [CrossRef] [PubMed]
  18. Brooks, M.E.; Kristensen, K.; van Benthem, K.J.; Magnusson, A.; Berg, C.W.; Nielsen, A.; Skaug, H.J.; Maechler, M.; Bolker, B.M. glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling. R J. 2017, 9, 378–400. [Google Scholar] [CrossRef]
  19. Verkuilen, J.; Smithson, M. Mixed and Mixture Regression Models for Continuous Bounded Responses Using the Beta Distribution. J. Educ. Behav. Stat. 2012, 37, 82–113. [Google Scholar] [CrossRef]
  20. Lee, Y.; Nelder, J.; Pawitan, Y. Generalized Linear Models with Random Effects: Unified Analysis via H-Likelihood; Chapman and Hall/CRC: Orange, CA, USA, 2017. [Google Scholar]
  21. Meyers, S.; Ambler, J.; Tan, M.; Werner, J.; Huang, S. Variation of perfluorproane disappearance after vitrectomy. Retina 1992, 12, 359–363. [Google Scholar] [CrossRef]
  22. McCullagh, P.; Nelder, J. Generalized Linear Models, 2nd ed.; Chapman and Hall: London, UK, 1989. [Google Scholar]
  23. Wu, L. Mixed Effects Models for Complex Data; Chapman & Hall/CRC: Boca Raton, FL, USA, 2010. [Google Scholar]
  24. Fahrmeir, L.; Tutz, G. Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
  25. Cribari-Neto, F.; Zeileis, A. Beta Regression in R. J. Stat. Softw. 2010, 34, 1–24. [Google Scholar] [CrossRef]
  26. Fitzmaurice, G.; Davidian, M.; Verbeke, G.; Molenberghs, G. Longitudinal Data Analysis, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2009. [Google Scholar]
  27. Hilden-Minton, J. Multilevel Diagnostics for Mixed and Hierarchical Linear Models. Ph.D. Thesis, University of California, Los Angeles, CA, USA, 1995. [Google Scholar]
  28. Verbeke, G.; Lesaffre, E. A linear mixed-effects model with heterogeneity in the random-effects population. J. Am. Stat. Assoc. 1996, 91, 217–221. [Google Scholar] [CrossRef]
  29. Pinheiro, J.; Bates, D. Mixed-Effects Models in S ans S-PLUS; Springer: Berlin/Heidelberg, Germany, 2000. [Google Scholar]
  30. Nobre, J.; Singer, J. Residual Analysis for Linear Mixed Models. Biom. J. 2007, 49, 863–875. [Google Scholar] [CrossRef]
  31. Dunn, P.; Smyth, G. Randomized Quantile Residuals. J. Comput. Graph. Stat. 1996, 5, 236–244. [Google Scholar] [CrossRef]
  32. Atkinson, A. Plots, Transformations and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis; Oxford University Press: New York, NY, USA, 1985. [Google Scholar]
  33. Kutner, M.; Nachtsheim, C.; Neter, J.; Li, W. Applied Linear Statistical Models, 5th ed.; McGraw-Hill: New York, NY, USA, 2005. [Google Scholar]
  34. Park, J.G.; Wu, H. Backfitting and local likelihood methods for nonparametric mixed-effects models with longitudinal data. J. Stat. Plan. Inference 2006, 136, 3760–3782. [Google Scholar] [CrossRef]
  35. Guoyou, Q.; Zhongyi, Z. Robust estimation in partial linear mixed model for longitudinal data. Acta Math. Sci. 2008, 28, 333–347. [Google Scholar] [CrossRef]
  36. Fu, L.; Wang, Y.G. Quantile regression for longitudinal data with a working correlation model. Comput. Stat. Data Anal. 2012, 56, 2526–2538. [Google Scholar] [CrossRef]
  37. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2025; ISBN 3-900051-07-0. [Google Scholar]
  38. Broström, G.; Holmberg, H. glmmML: Generalized Linear Models with Clustering; R Package Version 0.82-1; R Core Team: Vienna, Austria, 2011; Available online: https://CRAN.R-project.org/package=glmmML (accessed on 20 September 2024).
  39. Wissel, J. A New Biased Estimator for Multivariate Regression Models with Highly Collinear Variables. Ph.D. Thesis, Institut für Mathematik, Universität Würzburg, Würzburg, Germany, 2009. [Google Scholar]
  40. Song, P.X.K.; Tan, M. Marginal models for longitudinal continuous proportional data. Biometrics 2000, 56, 496–502. [Google Scholar] [CrossRef]
  41. Song, P.; Qiu, Z.; Tan, M. Modelling Heterogeneous Dispersion in Marginal Models for Longitudinal Proportional Data. Biom. J. 2004, 46, 540–553. [Google Scholar] [CrossRef]
Figure 1. Normal probability plots for randomised quantile (column 1), conditional (column 2), and marginal (column 3) residuals with N = 20 , λ 1 = 0.5 , n i = 3 , and n i = 5 .
Figure 1. Normal probability plots for randomised quantile (column 1), conditional (column 2), and marginal (column 3) residuals with N = 20 , λ 1 = 0.5 , n i = 3 , and n i = 5 .
Modelling 06 00128 g001
Figure 2. Normal probability plots for standardised random effect estimates for μ and σ with N = 20 , λ 1 = 0.5 , n i = 3 , and n i = 5 .
Figure 2. Normal probability plots for standardised random effect estimates for μ and σ with N = 20 , λ 1 = 0.5 , n i = 3 , and n i = 5 .
Modelling 06 00128 g002
Figure 3. Half-normal probability plot with simulated envelope for randomised quantile, standardised conditional, and standardised marginal residuals.
Figure 3. Half-normal probability plot with simulated envelope for randomised quantile, standardised conditional, and standardised marginal residuals.
Modelling 06 00128 g003
Table 1. RMSE of θ ^ when λ 1 = 0.5 , 1.5 and λ 2 = 0.5 , 1.0 , 1.5 .
Table 1. RMSE of θ ^ when λ 1 = 0.5 , 1.5 and λ 2 = 0.5 , 1.0 , 1.5 .
λ 1 = 0.5 λ 1 = 1.5
λ 2 λ 2
N n i 0 . 5 1 . 0 1 . 5 N n i 0 . 5 1 . 0 1 . 5
2032.8973.1882.2962033.5773.6193.506
51.3881.4991.49751.8111.8221.868
80.9180.9951.11681.2791.3691.543
120.7060.8070.950121.0751.1911.458
4031.2501.4621.4724031.6451.7371.789
50.7810.8600.95451.0311.1571.301
80.5720.6380.78280.8571.0001.189
120.4350.5580.745120.7710.9281.181
6030.9371.0141.0626031.1341.2441.364
50.5900.6630.75650.8160.9031.092
80.4300.5180.65880.6930.8161.061
120.3470.4430.656120.6460.8051.050
10030.6380.6880.72710030.7770.8591.005
50.4320.4590.57050.5930.6920.909
80.3150.3870.54080.5560.6770.901
120.2570.3560.537120.5530.6750.895
15030.4760.5280.56615030.5990.6820.963
50.3390.3680.46650.4910.6410.861
80.2520.3140.47880.4900.5900.822
120.2100.2950.407120.4540.5730.792
Table 2. CR when λ 1 = 0.5 , 1.5 and λ 2 = 0.5 , 1.0 , 1.5 .
Table 2. CR when λ 1 = 0.5 , 1.5 and λ 2 = 0.5 , 1.0 , 1.5 .
λ 1 = 0.5 λ 1 = 1.5
λ 2 λ 2
N n i 0 . 5 1 . 0 1 . 5 N n i 0 . 5 1 . 0 1 . 5
2030.9970.8490.9032030.7000.9570.970
51.0000.9580.97550.8750.9850.943
81.0000.9900.97880.9960.9790.893
120.9980.9990.969120.9920.9420.817
4030.8700.9600.9774030.8690.9830.950
50.9920.9990.98850.9810.9680.867
81.0001.0000.97080.9930.9400.779
121.0001.0000.951120.9670.8830.648
6030.9530.9860.9886030.9400.9790.913
51.0000.9990.98750.9920.9600.846
81.0001.0000.96280.9850.9170.686
121.0001.0000.923120.9640.8420.524
10030.9940.9990.98510030.9870.9730.855
51.0001.0000.97450.9890.9290.724
81.0000.9980.93780.9620.8660.511
121.0000.9970.860120.9350.7420.348
15031.0001.0000.99115030.9960.9590.785
51.0001.0000.96250.9860.8940.610
81.0000.9980.90980.9640.8140.398
121.0000.9940.814120.9170.6470.229
Table 3. Parameter estimates and standard errors for models (15) and (16).
Table 3. Parameter estimates and standard errors for models (15) and (16).
ModelEst-s.e β 11 β 21 β 41 β 12 λ 1 λ 2
BRMEst.1.911−0.6690.245−0.190
s.e.0.1870.0690.1000.078
TMBBRMEst.2.650−0.9580.239−0.1870.807
s.e.0.2540.0700.2430.0600.898
MBRMEst.2.590−0.9280.184−0.8170.9090.411
s.e.0.2220.0800.1910.1230.1320.111
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Usuga-Manco, O.; Hernández-Barajas, F.; Giampaoli, V. Modelling Interval Data with Random Intercepts: A Beta Regression Approach for Clustered and Longitudinal Structures. Modelling 2025, 6, 128. https://doi.org/10.3390/modelling6040128

AMA Style

Usuga-Manco O, Hernández-Barajas F, Giampaoli V. Modelling Interval Data with Random Intercepts: A Beta Regression Approach for Clustered and Longitudinal Structures. Modelling. 2025; 6(4):128. https://doi.org/10.3390/modelling6040128

Chicago/Turabian Style

Usuga-Manco, Olga, Freddy Hernández-Barajas, and Viviana Giampaoli. 2025. "Modelling Interval Data with Random Intercepts: A Beta Regression Approach for Clustered and Longitudinal Structures" Modelling 6, no. 4: 128. https://doi.org/10.3390/modelling6040128

APA Style

Usuga-Manco, O., Hernández-Barajas, F., & Giampaoli, V. (2025). Modelling Interval Data with Random Intercepts: A Beta Regression Approach for Clustered and Longitudinal Structures. Modelling, 6(4), 128. https://doi.org/10.3390/modelling6040128

Article Metrics

Back to TopTop