Analyzing County-Level COVID-19 Vaccination Rates in Texas: A New Lindley Regression Model

: This work aims to study the factors that explain the COVID-19 vaccination rate through a generalized odd log-logistic Lindley regression model with a shape systematic component. To accomplish this, a dataset of the vaccination rate of 254 counties in the state of Texas, US, was used, and simulations were performed to investigate the accuracy of the maximum likelihood estimators in the proposed regression model. The mathematical properties investigated provide important information about the characteristics of the distribution. Diagnostic analysis and deviance residuals are addressed to examine the fit of the model. The proposed model shows effectiveness in identifying the key variables of COVID-19 vaccination rates at the county level, which can contribute to improving vaccination campaigns. Moreover, the findings corroborate with prior studies, and the new distribution is a suitable alternative model for future works on different datasets.


Introduction
The COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, has had a profound impact on the world in the past few years.It has affected nearly every aspect of human life, causing significant disruptions to healthcare systems, economies, and social structures across the globe.The developments in the fight against the pandemic, mainly the vaccination, provided a crucial tool to protect individuals and communities against the virus and help to mitigate its spread.
The US government has taken significant steps to ensure vaccine availability and accessibility, including funding vaccine production, distribution, and administration.Vaccination rates have been highest among older adults and healthcare workers, but efforts are ongoing to ensure that all eligible individuals have access to the vaccine.Despite challenges such as vaccine hesitancy and supply chain issues, vaccination efforts are critical to reducing the spread of the coronavirus and protecting public health.
According to data from Our World in Data, in November 2023, the US has administered over 676 million doses of COVID-19 vaccines, with more than 81% of the eligible population having received at least one dose and over 69% fully vaccinated (https://covid.cdc.gov/covid-data-tracker,accessed on 21 November 2023).This puts the US ahead of many other countries in terms of vaccination rates, but disparities in vaccination coverage remain among different age groups and communities.Globally, vaccination rates vary widely across countries, with some countries still struggling to acquire and distribute enough vaccines.
Consequently, the use of statistical techniques to analyze pandemic data has been widespread in the US and other countries.A comprehensive study by [1] examines the correlation between vaccination rates and social vulnerability at the county-level, revealing significant disparities in vaccination coverage across counties.Despite limited data on vaccination safety and efficacy during pregnancy, a recent study by [2] found that vaccination coverage increased across all racial and ethnic groups during the study period.Other studies by [3][4][5] revealed a correlation with determinant factors and the COVID-19 vaccination rate.
In this instance, the study aims to determine the factors that explain the COVID-19 vaccination rate by constructing a new regression model based on the generalized odd loglogistic Lindley (GOLLL) distribution.In their study, ref. [6] elucidated the advantages of the introduced family of distributions and its applicability across various fields, highlighting its superiority over well-known generators.For example, ref. [7] proposed a parametric and a partially linear regression model called genralized odd log-logistic Birnbaum-Saunders distribution, and ref. [8] defined the generalized odd log-logistic Maxwell mixture model to analyze COVID-19 Chinese data.
This particular distribution offers advantages compared to other competing models, as elaborated in the upcoming sections.Researchers have made significant contributions to the field by introducing and studying various generalizations of the Lindley distribution.Some notable examples of these generalizations include: the study of the Lomax-Lindley distribution in lifetime data [9], the perspective of the Lindley distribution on the unit interval [10], the application of the Marshall-Olkin Lindley distribution in reliability data [11], and the application of the modified-Lindley distribution in three real data sets [12].
Several studies have explored the relationships between various factors that are determinants of vaccination rates, such as demographics, social-economics, and comorbidities, among others.The construction of new models that capture the complexity of the data is crucial to addressing research gaps related to COVID-19.Due to the extra shape parameters, the new distribution has great flexibility in modeling a wide range of data shapes, and link covariates to explain the response variable.The novel GOLLL regression aims to be an efficient model for identifying the factors that influence vaccination and can be considered an alternative for future work to help vaccination efforts.
Therefore, the focus of this study is the analysis of the COVID-19 completed primary vaccination series at a county-level within the state of Texas.The main objective is to investigate the influence of explanatory variables on the response variable, with a specific focus on examining the impact of vaccination in the US.Through this study, the goal is to make a significant contribution to the literature on this topic and provide valuable insights into the factors that influence the response variable.
The paper is organized as follows.Section 2 defines the GOLLL distribution and its main features.A linear representation and some of its mathematical properties are presented.The maximum likelihood estimation method is utilized, and some simulations examine the accuracy of the estimators.In Section 3, a new GOLLL regression model with a systematic structure for the shape parameter is constructed, and the consistency of the estimators is examined.Some measures for model checking are provided.In Section 4, an application of the proposed model to COVID-19 vaccination rate data is considered, and its performance is compared with other models.Diagnostic analysis and deviance residuals confirme that the model is the best fit to explain the current data.In addition, in Section 5, the study supports its conclusions with valuable findings that corroborate those from other studies.Future works can verify the proposed model in other scenarios (states, countries, etc.).Finally, Section 6 summarizes the key results of the study.

The Proposal Model
Recently, the development of new distributions using well-known distributions aims to capture accurately the underlying distribution of the data and obtain more precise estimates or key quantities of interest.
In this context, the generalized odd log-logistic-G (GOLL-G) family, pioneered by [6], is a versatile class of continuous distributions for modeling various types of data.In their study, ref. [6] elucidated the advantages of the introduced family of distributions and its applicability across various fields, highlighting its superiority over well-known generators.This particular distribution offers advantages compared to other competing models, as elaborated in the upcoming sections.
This family is based on the transformer-transformer (T-X) generator defined by [13].Consider a baseline cdf G(x) = G(x; ξ), where ξ denotes an unknown parameter vector.The GOLL-G cdf is defined by integrating the log-logistic density function, namely where α > 0 and θ > 0 are two extra shape parameters.
The pdf corresponding to (1) can be expressed as where g(x) = g(x; ξ) is the baseline pdf.Its hazard rate function (hrf) is easily found as These equations define some characteristics of the GOLL-G family, allowing it to effectively model a wide range of data types (skewed, bimodal, asymmetric, etc.)The parameters α and θ play an important role in shaping the distribution.In addition, Equations ( 1) and (2) do not involve complex mathematical functions, unlike the gamma and beta classes.
Table 1 reports some sub-models of Equation (1).The Lindley distribution with shape parameter λ > 0 is defined by the cumulative distribution function (cdf) and probability density function (pdf) (for x > 0) and respectively.
The new distribution, namely generalized odd log-logistic Lindley (GOLLL), is characterized by inserting Equation (3) in (1), the cdf of the GOLLL distribution follows as Let Y ∼ GOLLL(α, θ, λ) be a random varibale (rv) having cdf (5).By differentiating it, the pdf of Y reduces to Three special cases of the GOLLL model are given below: 1.
For θ = α = 1 =⇒ Lindley model [18].Figures 1 and 2 provide plots of the pdf and hrf of Y for selected parameters.One of the standout characteristic of the GOLLL distribution is its flexibility in generating a vast array of hazard shapes.Figure 2 includes but are not limited to increasing-decreasing-increasing, inverse J-shape, increasing-decreasing, and various other patterns shapes.This exceptional versatility transforms the model into an immensely powerful tool for effectively modeling complex data sets that encompass a wide range of diverse hazard rate patterns.

Properties
No closed-form mathematical properties of the GOLLL distribution exist.Initially, introducing the EL rv W p ∼ exp-L(p) with power parameter p > 0 and density π p (y) = p, g(y), G(y) p−1 is done, following from Equations ( 3) and ( 4) Therefore, the pdf of Y can be expressed as a linear representation of EL densities as follows.
Applying the linear representation derived in [6], the GOLLL density (6) can be expressed as where π k+1 (y) is the density of W k+1 , and Equation ( 7) is the main result of this section.The GOLLL properties is obtained in a straightforward way by utilizing some EL properties discussed in [17].

Quantile Function
The quantile function (qf) y = Q(u) = F −1 (u) of Y can be obtained from [6] as where t(u) = u/(1 − u), and Q Lindley (•) is the Lindley qf.Equation ( 8) is a useful tool for simulating the GOLLL distribution when U is drawn from a uniform distribution on the interval (0, 1). Figure 3 displays Galton's skewness [19] and Moors' kurtosis [20] based on quantiles varying α and θ, with λ = 5.25.These measures are more robust than traditional skewness and kurtosis measures.These plots highlight the impact of both parameters on the distribution shape.

Moments
Theorem 1.The nth ordinary moment of Y is given by where Proof.The proof is straightforward by applying Equation ( 7) and using the EL moments in [17].
The generating function (gf) M Y (•) of Y can also be determined straightforwardly from Equation (7), and the EL gf reported in [17,21].
The maximum likelihood estimate (MLE) of ψ can be found from the score equations U α = U θ = U λ = 0 using a a Newton-Raphson type algorithm.Alternatively, Equation (10) can be maximized numerically using the optim routine available in [22].

Simulations
We generate 1000 samples of sizes 50, 100, 200, 400, 800, and 1000, to evaluate the accuracy of the estimators under two scenarios: ψ = (0.50, 0.75, 1.25) ⊤ for scenario 1, and ψ = (1.45,0.25, 0.95) ⊤ for scenario 2. The average estimates (AEs), biases, and mean square errors (MSEs) are calculated for each sample size, and the findings are reported in Table 2.The measures are for ϵ = α, θ, λ.The results indicate that the AEs converge to the true parameters, and the biases and MSEs decay when n increases, thus indicating that the consistency criterion holds.

The GOLLL Regression Model
In recent years, new regression models have been proposed to handle various types of data without any transformation.The development accommodates non-normal data and captures the complexity and diversity of real data sets, providing accurate results.Ref. [23] proved the applicability of the utilized family in real engineering data sets.Another work [24] studied the COVID-19 ICU survival times in a Brazilian hospital.
In this situation, new models represent an important step to improve the analysis of different outcomes.Therefore, using the proposed GOLLL distribution, a new regression model is constructed as a tool to investigate any dataset that does not satisfy normality assumptions.

Definition
The systematic component of the GOLLL regression model takes into account the fact that the parameter λ in Equation (6) varies across observations (i = 1, . . ., n) as where g(•) is a twice continuously differentiable log-linear link function, and is the p-dimensional parameter vector associated with the explanatory variables The components of β are assumed to be independent.Therefore, the non-linear function g(•) plays the link with the covariates and the new regression model.
Consider a sample of n independent observations (y 1 , x 1 ), • • • (y n , x n ).The log-likelihood function for the parameter vector ψ = (α, θ, β ⊤ ) ⊤ in this regression model has the form Numerical maximization is employed using the optim routine in [22] to estimate ψ in Equation (12).The likelihood ratio (LR) statistic is adopted to compare the proposed regression with its nested models.

Simulations of the Regression Model
The accuracy of the MLEs in the GOLLL regression model can be assessed using the measures: bias, MSE, estimated average length (AL), and coverage probability (CP).The measures are

Model Checking
Diagnostic measures and residual analysis are employed to know if the model accurately represents the data.This involves investigating whether the sample contains any outliers or influential observations that may affect the model's performance.
Measures based on case deletion are considered in the systematic component to identify influential observations in the regression model Here, the effect of excluding the ith observation is examined on the parameter estimates.Hence, the log-likelihood function for ψ from model ( 13) by deleting the ith observation is l (i) (ψ), and the MLE of the parameter vector is ψ(i) .
The influence of the ith observation is measured by comparing the difference between the estimated parameter ψ(i) and the original MLE ψ.If excluding the ith observation leads to a substantial change in the estimated parameters, then this observation is influential.
A popular influence measure is the generalized Cook distance (GCD), namely where J( ψ) is the estimated observed information matrix.Another commonly used influence measure is the likelihood distance (LD), namely In addition to global influence measures, analyzing residuals can also be an effective way to assess model adequacy and check for incompatibilities with the response distribution.The deviance residuals for the GOLLL regression are where rM i are the martingale residuals (see [25]), and sgn(•) takes value ±1 if the argument is positive/negative.

Application
Initially, a comparative analysis of the GOLLL model against some alternative models is conducted.The EL, beta Lindley (BL) [26], Kumaraswamy Lindley (KwL) [27], and gamma-Lindley (GL) [28] distributions are given by The parameters of all distributions are positive real numbers, and G(x) is the Lindley distribution.For all fitted models, the goodness.fitfunction, using the BFGS method from the AdequacyModel package [29], computes the MLEs (SEs in parentheses).The selection of the best fitted model is based on several well-known measures, including Cramér-von Mises (W * ), Anderson-Darling (A * ), and Kolmogorov-Smirnov (KS) (p-values in parentheses).

COVID-19 Vaccination Rates on County-Level
To demonstrate the usefulness of the new GOLLL regression model over other competitive models, we provide an application that utilizes county-level COVID-19 vaccination rates in the state of Texas, USA.
The data set refers to 254 percentages of the population in counties with a completed vaccination (aged adjust) to COVID-19 extracted from CDC (https://covid.cdc.gov/coviddata-tracker/#datatracker-home,accessed on 22 February 2023).This data set is used since Texas is the state with the highest number of counties in the US.Further investigation with other data sets (states, countries, and counties) should be addressed to examine the accuracy of the new model.
Additional research has examined the impact of covariates on the COVID-19 vaccination.Ref. [30] analyzed the COVID-19 vaccination coverage associated with social vulnerability and urbanity.Ref. [31] verified the impact of some variables in vaccination coverage and suggested that interventions be undertaken to improve COVID-19 vaccine acceptance and future uptake.The study conducted by [32] utilized machine learning to study the vaccination rate in the USA.The findings provide insights to increase vaccination acceptance and combat the COVID-19 pandemic.Other investigations, Refs.[33,34] demonstrate some predictors for vaccine hesitancy using variables such as social-demographics and comorbidities and conclude a strong association.Therefore, the inclusion of the study variables is based on past research, comparisons, and investigations of possible new associations to aid vaccination campaigns.
HP: Total number of hospitals reporting vaccination; 3.
PR: Poverty rate (percentage of individuals with income below the poverty line); 4.
HR: High school completion rate (proportion of individuals aged 25 and above who have completed high school or its equivalent); 6.
BA: Broadband access (percentage of households that have access to broadband internet);

7.
HT: Heart disease rate (percentage of individuals that have chronic heart disease).
Table 3 reports the descriptive statistics for the data set, and the histogram is given in Figure 8.The average rate of vaccination in counties was 0.483 in the period of the study.The standard deviation is 0.132, which can be explained by the range of 0.189 and 0.950, respectively, the minimum and the maximum.Furthermore, the skewness and kurtosis are positive.First, the analysis involves modeling only the response variable by fitting the GOLLL, OLLL, EL, L, BL, KwL, and GL distributions.The MLEs, SEs, and the previous statistics (with the p-values of KS) are reported in Table 4 for the fitted distributions to the COVID-19 vaccination rate data.The GOLLL distribution is the most suitable model for the current data based on these measures.
Three LR tests compare the GOLLL distribution with its nested models.The numbers in Table 5 indicate that the inclusion of extra parameters is significant for accurately modeling the current data.
The histogram and fitted densities of the two best models are illustrated in Figure 9a.Further, the estimated cdfs of these models are reported in Figure 9b.Although the model presents a good fit to the current data, it is not enough to know whether the model will be suitable for other datasets at different time or space scales.Future research can test other datasets in different states and at different spatial scales or county levels to investigate the accuracy of the new model.

Results New Regression
Next, utilizing the new regression model proposed, the systematic component is considered (for i = 1, . . ., 254) Table 6 reports the MLEs, SEs, and p-values for the fitted GOLLL regression model to the current data.The numbers support that all six explanatory variables are significant (at the level of 5%).

Diagnostic and Residual Analysis
Thereafter, the quality of the fit of the GOLLL regression model is examined.The LD and GCD measures in Figure 10 are useful to identify potentially influential observations.They show that the 83th, 151th, and 176th observations (referring to the counties below) are possibly influential.However, their impacts on the regression model are not particularly significant.Additionally, the plot of the deviance residuals in Figure 11a shows that they fall randomly within the bands.The normal probability plot with simulated envelope in Figure 11b proves the accuracy of the model to fit the data set.So, the GOLLL regression model provides a good fit.
Finally, Figure 12 reports profile log-likelihood plots for the parameters while keeping all other estimates constant.These plots are useful for determining confidence intervals for estimates and the reliability of statistical analyses.The curves of all parameters provide the accuracy and uncertainty associated with parameter estimates.

Discussion
The model checks reveal that the GOLLL regression model is suitable to explain the vaccination rates in Texas counties.From the parameter estimates reported in Table 6, the GOLLL regression model becomes λi = exp 1.017 − 0.010 x i1 − 0.524 x i2 − 0.149 x i3 + 0.408 x i4 + 0.637 x i5 + 2.275 x i6 .(18) Several facts can be drawn from Equation (18).For each covariate, the study reveals findings that corroborate with other research and indicate the importance of the model for future applications with diverse other vaccination data.

•
All variables are statistically significant at a significance level of 5%; • The HP variable shows a slight negative estimate, and this negative change is statistically significant; • The PR variable is significant, and its estimate is negative.COVID-19 increased poverty and inequality worldwide [35,36].Individuals living in poverty may lack access to reliable transportation, face barriers to accessing healthcare facilities, and have limited resources for paying out-of-pocket costs associated with vaccination [37,38].The study of [39] revealed the lack of access to the COVID-19 vaccine in the lowest county's poverty rates across the American state of Illinois.Other study [40] showed a strong negative correlation with poverty and vaccine coverage in the 189 countries' research.This can result in lower vaccination rates among populations living in poverty, which is supported by data from the proposed model and prior studies; • The MS variable has a negative estimate, which indicates that the vaccination rate is lower in metropolitan urban areas.The differences in vaccination rates between urban and rural communities are likely driven by various factors, such as differences in access to healthcare resources, vaccine distribution challenges, and mainly vaccine hesitancy [41].Patterns in COVID-19 vaccination coverage by urbanity are addressed by [30].It indicated lower vaccination rates in rural than urban areas, which corroborates with the study; Further, the study of [42] presented disparities in COVID-19 vaccination coverage between urban and rural counties and explained it by educational attainment, healthcare infrastructure, and Trump vote share.

•
The HR variable is significant with a positive coefficient.Thus, counties with higher high school graduation rates tend to have higher vaccination rates as well, which can be attributed to more access to accurate information regarding vaccines to access better healthcare and vaccination services [43].Other studies [44][45][46] revealed that high school is a key difference in coverage, access, and hesitancy vaccination; • The BA variable has a positive estimate, which shows the internet has played a significant role in the COVID-19 vaccination effort.Websites and social media platforms have been used to disseminate information about vaccine availability, eligibility, and safety.The study's results suggest that counties with greater access to broadcast media have a higher COVID-19 vaccination rate, which highlights the disparities in access to the internet and technology among some communities.This finding is consistent with the research presented in [47].Alternative studies [48,49], showed that lack of internet access is a barrier to vaccination.In New York City and some counties in North Carolina, the COVID-19 vaccine hesitancy increases if there is difficulty accessing the internet; • The HT variable has a highly positive estimate.Several studies [50][51][52] have demonstrated the heightened risk of individuals with chronic heart disease contracting and experiencing severe symptoms from COVID-19, as well as increased rates of hospitalization and mortality.For these reasons, many states in the US have implemented targeted outreach efforts to ensure that these populations have access to the vaccine.Hence, the study's results indicate that counties with high rates of chronic heart disease have a correspondingly higher rate of vaccination.This finding highlights the importance of the government's focus on prioritizing at-risk populations [53].
Subsequent studies [54,55], illustrated the efficacy and safety of the COVID-19 vaccine based on the presence of comorbidities, including heart disease.

Conclusions
This article investigated the factors that explain the COVID-19 vaccination rate using the generalized odd log-logistic Lindley regression model with a shape systematic component.Some mathematical properties of this model were provided, and the maximum likelihood method was used to estimate the parameters.A Monte Carlo simulation evaluated the parameters of the proposed regression model, which revealed the consistency of the estimators and the approach to the nominal level of the coverage probabilities.Diagnostic analysis and deviance residuals proved the suitability of the new model.
The analysis of COVID-19 vaccination rates at the county level in Texas, US, uncovered significant findings.The total number of hospitals reporting vaccination is a slight predictor of vaccination, and it is suggested to be considered in future work for further investigations.Poverty rate and metropolitan status are evidenced in this work as determinants.The first one discussed in [35][36][37][38], among others, reveals the lack of access to the COVID-19 vaccine among individuals living in poverty.The second one examined in [30,41,42] presented disparities in COVID-19 vaccination coverage between urban and rural counties, corroborating with this study.The education level was also identified as a determinant of increasing the vaccination rate.Supporting studies by [43,44,46] indicated that the high school rate is a significant variable in coverage, access, and hesitancy vaccination.Another important variable is broadband access.The internet, supported by websites and social media platforms, disseminates information about vaccines, as suggested by prior studies [47][48][49], and is consistent with the findings of this study.Several studies [50][51][52] worldwide demonstrated how comorbidities influence COVID-19.Countries prioritized the availability of vaccines for people in the risk group, as highlighted in [53].Subsequent studies [54,55], illustrate findings similar to the current study, including heart disease.
The new model showed that it was more flexible than some competitive models.Hence, it is possible to conclude that the proposed model can provide better insights into the relationship between the explanatory variables and the response variable and serve as an alternative model to evaluate other research.It is recommended to apply the regression model introduced in other states, countries, or cities and verify if the same covariates would be significant in future works.
θ, λ.One-thousand samples of sizes n = 25, 55, • • • , 1.000 are generated from Equation (8) by setting α = 0.75, θ = 1.85, β 0 = 2.75, and β 1 = 3.40.The Monte Carlo simulation provides a versatile approach to analyzing the parameters of the model, enabling researchers to explore the behavior of a distribution under various conditions.Figures4-7report how the measure values change with respect to the sample size.The biases, MSEs, and ALs decay toward zero when n increases.Additionally, the CPs approach the true value of 0.95 if n increases.These findings provide strong evidence of the consistency of the MLEs.The simulation contributed to the reliability and comprehensiveness of the new regression model.

Figure 11 .Figure 12 .
Figure 11.(a) Deviance residual plot (Circles-residuals and Lines-Bands of three standard deviations).(b) Normal probability plot of r D 's with envelope.

Table 4 .
Findings from the fitted models.