Bayesian Statistical Inference for Factor Analysis Models with Clustered Data

: Clustered data are a complex and frequently used type of data. Traditional factor analysis methods are effective for non-clustered data, but they do not adequately capture correlations between multiple observed individuals or variables in clustered data. This paper proposes a Bayesian approach utilizing MCMC and Gibbs sampling algorithms to accurately estimate parameters of interest within the clustered factor analysis model. The mean traversal graph of parameters ensures that the Markov chain converges, and the Bayesian case-deletion model is used to analyze the model’s impact and identify outliers in clustered data using Cook’s posterior mean distance. The applicability and validity of the principal-component-method-based factor analysis model for clustered data are demonstrated by comparing the parameter estimation of this method with the principal component method, the clustered data with and without internal relationships are compared by example analysis, and the anomalous groups are identified by the Cook’s posterior mean distance.


Introduction
Clustered data, commonly found in fields like neuroscience and epidemiology, are characterized by a high degree of similarity among observations within the same group, which surpasses the similarity observed between different groups.They comprise interconnected samples, each encapsulating a collective of individuals with correlated characteristics.When these samples are amalgamated, they form a complex clustered dataset with multiple clusters, potentially varying in number of constituent samples.This dataset accentuates intra-group correlations, a focus distinct from longitudinal data, which prioritizes the temporal sequence of observations for individual subjects.The hierarchical structure of clustered data introduces analytical challenges, particularly in statistical analysis, where there is oversight of the internal group.Galbraith and Daniel [1] have demonstrated that neglecting the internal relationships within clustered data can result in substantial biases in parameter estimation, underscoring the imperative to consider these internal associations.The field of clustered data analysis has seen significant advancements in recent years with the emergence of diverse analytical models such as linear models [2][3][4], random effects models [5], semiparametric models [6], nonparametric models [7], logistic regression models [8], and multilevel models [9,10], etc.Each of these methodologies has contributed to the parameter estimation for clustered data, converging on a pivotal insight; disregarding the potential correlations among observations within groups during the modeling process can lead to an underestimation of standard errors, thereby compromising the robustness of inferential conclusions.
The factor analysis model is an effective statistical tool for describing the dependence between multidimensional variables.Spearman [11] first proposed factor analysis when studying intelligence tests.It not only reduces the dimension, but also reflects the internal correlation among variables and solves the problem of intra-cluster correlation effectively.Experiments in cytology performed by Julian [12] demonstrated that ignoring the nested structure leads to serious deviations in the estimates of unknown parameters when the intra-cluster correlation is significant.This proves that the nested structure of clustered data cannot be ignored.Huang et al. [9] proposed a variety of alternative schemes to deal with the nested structure of clustered data.However, the processing methods applicable to the clustered factor model were not mentioned.The mean of cluster value method is a simple and effective method for processing clustered data.It has been successfully applied by many scholars to deal with nested structures.Okech [13] used the mean of clusters to process the nested structure of the data when processing the characteristic data relating to low-income parents' parenting pressure and economic pressure through the factor model.Galbraith et al. [1] also stated in a paper that using the mean of clusters to process clustered data is an effective way to obtain unknown parameters.Recently, with the rise of Bayesian networks, estimating factor analysis models using Bayesian methods has become a hot topic.The Bayesian factor analysis model, first proposed by Press [14], is capable of obtaining implicit solutions to the unknown parameters of the factor model when considering sample covariances obeying a Wishart distribution.Wirth et al. [15] demonstrated that the MCMC method can obtain better fitting results when estimating the parameters of the factor analysis model for non-clustered data.Zhang et al. [16] proposed a novel Bayesian factor analysis model which achieves the downscaling and feature extraction of high-dimensional multi-omics data by introducing the knowledge of biological mapping and improves the robustness of the model in the processing of noise and sparsity.Hansen et al. [17] proposed a new variational inference algorithm that first uses multiplicative gamma process shrinkage prior to approximation of the posterior distribution of a Bayesian factor model in order to address the poor scalability of factor analytic models for high-dimensional data.
Since Cook [18] introduced the case-deletion model for identifying influential points, the methods of statistical diagnostic of models have become one of the research hotspots, with various diagnostic methods being developed.Among these, Bayesian statistical diagnostics represents a particularly active branch, with statisticians both domestically and internationally widely utilizing this approach.De Finetti [19] explored the problem of how to deal with data outliers through Bayesian subjective probability theory and analyzed the treatment in the case of independent, exchangeable, and partially exchangeable error distributions.Jackson et al. [20] proposed two approximations capable of computing diagnostic statistics for the Bayesian case-deletion model.Zhu et al. [21] examined three Bayesian case-deletion influence measures in the context of complex models such as longitudinal data.Nevertheless, there is a paucity of research on Bayesian case-deletion diagnostics for factor models.Moreover, the statistical diagnosis of factor models for clustered data has not been addressed in the relevant literature.Factor analysis models excel at identifying correlations among observable variables or individuals within clustered datasets where large volumes of data across various fields share unobservable underlying public variables.However, applying these models for parameter estimation and impact analysis within clustered data remains largely unexplored in the existing literature.For simplicity, we outline the following specific contributions of this manuscript:

•
We used the results averaged across clusters to deal with the nested structure of the clustered data, where the associations between clusters are inscribed by a common diagonal form of the error matrix;

•
We performed a comprehensive analysis of factor analysis models for clustered data, employing both Bayesian and principal component analysis methods, and compared the efficacy of these two methodologies;

•
We utilized Cook's posterior mean distance for statistical diagnostics of factor analysis models within clustered data and identified groups with anomalies through diagnostic statistics.
The remaining sections are structured as follows: Section 2 introduces the process of constructing the factor analysis model within clustered data; Section 3 introduces the basic steps of how to model the factor analysis model of clustered data using the Bayesian method; Section 4 focuses on how to estimate factor analysis models for clustered data using principal component analysis; Section 5 introduces how to diagnose the anomalous groups by using Cook's posterior mean distance; Section 6 validates the feasibility of the Bayesian method using numerical simulation and compares the estimation effect with that of the principal component method; Section 7 is a example analysis that uses examples to illustrate that the estimation method in this paper can be effectively applied in practice; Section 8 encapsulates the findings and concludes the paper.

Model Framework of Factor Analysis for Clustered Data
Suppose the observed variable of the ith individual in the gth cluster is x gi = (x gi1 , x gi2 , . . ., x gip ) T .For g = 1, 2, . . ., G, i = 1, 2, . . ., n g , the factor analysis model is expressed as follows: where µ g is the p × 1 dimensional mean vector.Λ g is the factor loading matrix of the gth cluster.F gi = F gi1 , F gi2 , . . ., F gim T denotes the m-dimensional common factor of the ith individual in the gth group.ε gi is the special factor for the ith individual in the gth cluster, where ε gi ∼ N p 0, Ψ g .At the same time, we suppose that ε gi is uncorrelated with F gi for all g and i values.Aggregate all the individuals in the gth cluster in the form of a matrix as follows: where The factor analysis model of all clusters is expressed in diagonal matrix form, which is shown in Equation (3): where

Analysis of Prior Distribution
The factor analysis model of a single cluster is shown in Equation (1).Let θ represent a vector comprising all unknown parameters.For Bayesian analysis, consider that θ satisfies the following joint priors: Ansari et al. [22] mentioned the advantages of using a conjugate prior when dealing with multilevel factor analysis models.This approach simplifies the computation of the posterior distribution, improves computational efficiency, and provides flexibility and stability to the parameters.We follow this research and assume that the variance of the common factors follows an inverse gamma distribution.The prior distributions for the remaining unknown parameters are delineated below: where IG(•) represents the inverse gamma distribution.Aware of the potential variability among different subgroups, we adjust the means and variances of the common factors to generate subgroup datasets with distinct means and variances, thereby enabling a more precise simulation and analysis of the specific characteristics of these subgroups.In addition, we set the random error term to follow a uniform normal distribution, a setting designed to guarantee the feasibility and validity of the method of group means.α 1j , Σ 1j , c, and d are fixed hyperparameters; the remaining variables α 0 , Σ 0 , α 1 , and Σ 1 are variable parameters.

Gibbs Sampling Method
The posterior distribution of parameters is the starting point of Bayesian statistical inference.Since the posterior distribution of θ involves multiple integrals, we use the Gibbs sampler [23] to extract samples successively from the conditional distribution to form a Markov chain.It is assumed that each parameter is independent, and the probability where μg = µ g + Λ g F gi .Since θ denotes all the unknown parameters, the estimates of each component are calculated based on the conditional distribution satisfied by θ to compute x new gi .As the number of sampling times increases, the distribution of the sampled random numbers becomes closer to the true distribution.X g = (x T g1 , x T g2 , . . ., x T gn g ) T is the set of observed variables, and π µ g , Λ g , F gi , Ψ g represents the prior distribution of the unknown parameters, according to Equations ( 4) and ( 5), the joint posterior distribution h x gi , θ is as follows: The posterior distributions of the unknown parameters are inferred from Equation ( 6) to be p θ|X g .The specific form is shown in (7): Firstly, the initial values of x (0) g , F g i , Ψ (n) g } can be obtained after n steps of Gibbs sampling, and its update steps are as follows: (1) Select x (1) g i .The preceding observations lead to the conclusion that, as the number of cycles approaches infinity, the distribution of x gradually approaches the target distribution of p µ g , Λ g , F gi , Ψ g |X gi [24].This provides sufficient evidence to suggest that the samples taken above might be considered as representing the posterior distribution.The posterior distribution of the unknown parameter θ is as follows: x gi − Λ g F g i .The estimated results of the unknown parameters of the factor analysis model for clustered data can be obtained by employing Gibbs sampling at the cluster level.However, it is necessary to ascertain whether the effect of each group within the clustered data is consistent and in line with the assumptions of the established model.Furthermore, it is crucial to determine whether there are clusters that deviate significantly from the original model.Statistical diagnosis of the above models represents an indispensable component of research involving clustered data.

Principal Component Method for Factor Analysis Models with Clustered Data
In the initial phase of our investigation, we focus on the factor analysis model for a single cluster with the matrix representation presented in Equation (2).The eigenvalues of the covariance matrix are labeled by λ 1 , λ 2 , . . ., λ p , arranged in descending order of magnitude, where λ 1 is the largest.The corresponding orthogonal eigenvector is denoted by ξ 1 .If Λ g Λ T g is sufficiently large and the variance of the specific factors is considered negligible, Σ g can be approximately replaced by Λ g Λ T g , just like Σ g ≈ Λ g Λ T g .Using spectral decomposition, we decompose the covariance matrix with the specific form shown below: where λ i is the eigenvalue of the covariance matrix, denoted as , and e gi is the corresponding orthogonal eigenvector, denoted as E g= ( e g1 , e g2 , . . ., e gm ) ∈ R p×m .Since F gi obeys a normal distribution with a non-zero mean in this paper, the estimator for the factor loading matrix, derived from the principal components method, is delineated as follows: The common factor F g can be estimated using the least-squares algorithm as shown in Equation (10): Since all clusters share a common error matrix, the parameter estimates for the factor model in clustered data can be derived by averaging the estimation results from each cluster.

Case-Deletion Model
The case-deletion model is a diagnostic tool used to assess whether observed data points are influential points.It involves comparing the differences in corresponding statistics or parameters after deleting one point to determine whether it is an influential point.The model after deleting the mth group can be expressed as Equation ( 11): Let x (m) , µ (m) , F (m) , Λ (m) , ε (m) represent the dataset after deleting the mth group.The MCMC method is used to estimate parameters after data deletion, and Cook's posterior mean distance is used to identify the influence between clusters.

Cook's Posterior Mean Distance
To study the posterior mean effect of the gth group on unknown parameter θ, θ = µ g , Λ g , F g , Ψ g .The Bayesian case-deletion influence measure of deleting the gth cluster is defined by Cook's posterior mean distance, as shown in Equation ( 12): where θ = θ p θ|x g dθ and θ(g) = θ p θ|x g (g) dθ, respectively, represent the posterior mean before and after the deletion of the gth cluster, Ψ θ represents the inverse matrix of the covariance of θ, and Ψ θ is a positive definite matrix.If the value of CM g is larger, the difference between θ and θ(g) is greater, and the gth cluster is the strong-influence cluster or the anomalous cluster.

Simulation Study 6.1. Random Parameter Generation for the Markov Chain Monte Carlo Algorithm
A clustered dataset is simulated using Monte Carlo methods.The construction of a single cluster factor analysis model is initially considered, with the caveat that the number of samples in each cluster is not necessarily identical.To this end, the number of samples was allowed to adhere to a uniform distribution of U [40, 50], with samples generated sequentially and iteratively from the corresponding conditional distribution.The factor loading matrix is of order 6 × 2, and the initial parameter settings are as follows: The initial coefficients of the factor loading matrix are set as 6.In order to ensure that the common factors satisfy different normal distributions in different clusters, it is necessary to assume that the common factor satisfies normal distributions with different means and variances.Furthermore, it is assumed that the mean values follow the uniform distribution of U [0, 1], and the variance follows the uniform distribution of U [0, 0.5], just like The random error term is also assumed to follow a normal distribution, just like ε gi ∼ N 6 0, Ψ g , where Ψ g obeys the inverse gamma distribution, that is, Ψ g ∼ IG 6 (c, d).In the prior distribution, the hyperparameters are selected as α 1j = (0.9, 0.1) T , Σ 1j = diag(0.1,0.1), c = 3, and d = 4.This paper determines the convergence and stability of the Gibbs sampling by traversing the ergodic mean plot, as shown in Figure 1.

Simulation Results of Gibbs Sampling
Gibbs sampling is leveraged to estimate the parameters Λ g , f gi , and Ψ g through a process executed over 50 iterations.The initial estimates for parameters f gi and ε g i are derived from the mean of 1000 samples, each drawn from the corresponding prior distribution.For the generation of the simulated clustered dataset, the number of clusters is variably set to 10, 30, and 50.A comprehensive presentation of the simulation outcomes can be found in Table 1.In this paper, we use Bias to denote the deviation between the true value and the estimated mean, RMSE to indicate the root mean square error, and SD to represent the standard deviation.
Table 1 shows that, after many iterations, the bias and mean square error of all parameter estimates are controlled within a tightly constrained range, and the mean square error and variance are relatively close to each other, which, in turn, indicates that the estimation results are close to the real values, revealing the effectiveness of the factor model for dealing with clustered data when using this method.The principal component method, as a classical estimation method for factor analysis models, compares the parameter estimation based on two different methods under the same settings as the simulated dataset in the paper, and the estimation results of the principal component method are as shown in the table below.

Random Parameters Generation for the Principal Component Algorithm
Since the principal component method ignores the effect of variance, the parameters of the principal component method are set to be approximately the same as the Bayesian method for comparison purposes.The coefficients of the factor loading matrix are set as follows: The initial coefficients of the factor loading matrix are set as Λ 11 = Λ 12 = Λ 13 = Λ 14 = Λ 15 = Λ 16 = 0.9, Λ 21 = Λ 22 = Λ 23 = Λ 24 = Λ 25 = Λ 26 = 0.1.We assume that the common factor satisfies normal distributions with different means and variances.The mean of F g is assumed to be uniformly distributed between 0 and 1, and the variance is assumed to be uniformly distributed between 0 and 0.5.
and the random error ε gi is distributed as N p 0, Ψ g .The initial values of Ψ g are set to be the same as the initial value of Ψ g under the Gibbs sampling algorithm.The intra-group correlation is also set as

Simulation Results of Principal Component Algorithms
The parameters of interest, Λg and fg , can be estimated using principal component analysis.To gauge the accuracy of these estimates, the deviation, standard deviation, and mean square error for the pairs of Λg and Λ g , as well as fg and f g , are computed.This process is repeated 1000 times to ensure a robust assessment of how closely the estimated values of the parameters approximate their actual values.The number of clusters is also set to 10, 30, and 50, and the estimated results are depicted in Table 2.As can be seen from Tables 1 and 2, the estimation results of the MCMC method are better than the estimation results of the principal component method for each item, accordingly indicating that the MCMC method is more effective relative to the principal component method.The corresponding influence diagnosis is reliable only if the estimation results are better.

Statistical Diagnosis of Clustered Data
The dataset under consideration is subjected to statistical diagnosis using the casedeletion method.Cook's distance is utilized to evaluate the influence of individual clusters by comparing the changes in CM(g)-a measure of leverage and influence-following the removal of different clusters.Within the simulation study, the common factor variable, denoted as F g , for the 8th, 15th, 21st, and 42nd clusters, is incremented by 2 to appraise the validity of the applied method.The variable of F g is designated as the diagnostic parameter of interest.The outcomes of this analysis are presented in Figure 2.  Figure 2 presents a depiction of Cook's distance for clusters consisting of 10, 30, and 50 clusters, which are generated through Monte Carlo simulations and subsequently deleted sequentially.Analysis of Figure 2 reveals that, within the clustered data produced by Monte Carlo simulation, the 8th cluster exerts the most significant influence.In the dataset comprising 30 groups, the 8th, 15th, and 21st clusters are identified as potent influence points, with the 21st cluster demonstrating the most pronounced impact.Similarly, in the dataset with 50 clusters, the 8th, 15th, 21st, and 42nd clusters are recognized as having substantial influence, with the 21st cluster again being the most influential.These findings are congruent with the intentional perturbations introduced to the 8th, 15th, 21st, and 42nd clusters, as previously discussed.Consequently, the efficacy of the diagnostic methodology employed can be assessed and deemed valid.

Example Analysis
In this section, we detail the experimental analysis of cigarette sales.Our study is based on data derived from a survey of cigarette sales which was conducted across the United States from 1963 to 1992.This dataset is accessible through the Wiley Online Library at https://www.wiley.com/legacy/vkhbwileychi/baltagi/(accessed on 14 February 2024).The survey, conducted by Baltagi [25], aimed to investigate cigarette consumption patterns across 46 different states within the United States.A comprehensive description of the variables included in the dataset is presented in Table 3.The dataset, which is inherently clustered by state, has been modeled as such for analysis.Before conducting factor analysis, the data underwent standardization.Subsequently, the suitability of the data for factor analysis was assessed using the KMO test and Bartlett's spherical test.The outcomes are detailed in Table 4.As detailed in Table 4, the KMO test for sampling adequacy is calculated at 0.78, a value that exceeds the threshold commonly accepted for proceeding with factor analysis.Moreover, Bartlett's test for sphericity yields a p-value of less than 0.05, which is statistically significant and further corroborates the appropriateness of the data for this analytical approach.Figure 3 indicates that the extraction of three common factors is justified, thus leading to the development of a factor analysis model inclusive of these three factors.
Figure 4 illustrates the classification of the eight indicators into three separate categories based on their factor loading coefficients.Specifically, the indicators labeled as Price, Cpi, Pimin, Ndi, and Year are categorized together as representing economic factors.The indicators Pop and Pop16 are united to form the demographic factor.Finally, the Sales indicator is identified as the sole representative of the demand factor.
Examination of Figure 5 reveals that the mean of the trace for each variable stabilizes upon completion of 3000 iterations, suggesting that the Gibbs sampling algorithm has achieved convergence.To ensure reliability, the analysis was conducted with a total of 5000 iterations, with the initial 3000 iterations serving as a burn-in period.The subsequent 2000 iterations were retained for analysis, yielding a set of valid samples.Statistical analysis of these valid samples was then performed.Anderson and Herman [26] have delineated that the core condition for the identifiability of the factor loading matrix is the requirement for Λ T g Ψ −1 g Λ g to be diagonal.Then, Λ g can be identified.The resulting estimates for the factor loading matrix coefficients are presented in Table 5.According to the results shown in Table 5, the difference between the mean and median of the loading coefficient is small, indicating that the fitting effect of the MCMC method is better.Baltagi [25] studied the variation in cigarette price elasticized across states in this dataset using a heterogeneous coefficient model but did not consider the issue of commonality between states.The clustered factor analysis model can not only obtain the internal relationship among these states, but also describe the common factors among different states.The results are also richer than the analytical effects of the heterogeneous coefficient model.In order to highlight the advantages of clustered data, the quality of the model can be judged by two methods: the Akaike Information criterion (AIC) and Bayesian information criterion (BIC).Akaike [27] proposed a method to calculate the AIC value in a factor analysis model, and the AIC and BIC values calculated through this calculation method are shown in Table 6.As can be seen from Table 6, the AIC and BIC values of the model considering the internal relationship of clusters are much lower than those of the model which does not consider the internal relationship, which indicates that considering the internal relationship of clustered data is better than ignoring the internal relationship of clusters, and, at the same time, it illustrates that the estimation effect of the MCMC method is better than the principal component method.
In addition, the Bayesian case-deletion model was applied to these data in this study as a way to explore those clusters with anomalies or high-influence clusters.When the common factor F gi and factor loading matrix Λ g are the parameters of interest and the rest of the parameters are redundant, the influence diagnostic results are as shown in Figures 6 and 7, respectively.
Figure 6 presents visual evidence suggesting that, based on Cook's distance, the 17th cluster can be classified as an outlier or a cluster exerting a significant influence on the model regardless of whether parameter F g or Λ g is the focus of interest.A comparison of the arithmetic means of indicators within the original dataset reveals that the 17th state's cigarette price is substantially lower than the national average.At the same time, this state has a significantly smaller local population compared to other states.However, the local consumer price index and per capita sales are observed to be higher.These observations lead to the inference that the 17th state may wield a more substantial influence.
To investigate the potential masking effect among influential clusters, we performed a sequential exclusion analysis, initially omitting the 17th cluster, before conducting the Bayesian statistical diagnosis.The results of these diagnostic assessments are depicted in Figure 7. Figure 7 provides a clear depiction revealing that the Cook's distance values for the parameters of the non-17th clusters exhibit minimal variation.This observation suggests that there is no significant masking effect among these clusters.Concurrently, the figure also indicates that the 17th cluster exerts a more substantial influence on the estimation of the parameters.

Discussion
Clustered data, one of the complex data types, are characterized by internal correlation within the same groups and a lack of correlation between different groups.Ignoring this nested structure when inferring can lead to biased conclusions.Utilizing the factor analysis model is an effective approach to address the nested structure inherent in clustered data.This paper presents a factor model specifically tailored for clustered data.Employing the MCMC method in conjunction with Gibbs sampling, we initially estimate the unknown parameters of an individual cluster.The inter-cluster association is depicted through a shared diagonal error matrix.Thereby, the factor model with clustered data is constructed by integrating multiple clusters.Numerical simulations substantiate the model's and methodology's validity.Empirical evidence indicates that the clustered data model significantly outperforms models that disregard internal relationships.Furthermore, this study employs Bayes deletion influence to compare diagnostic statistics before and after cluster deletion and utilizes Cook's posterior mean distance to identify potential outliers among clusters.Both the simulation study and the example analysis yield promising results.
The research concepts presented are extendable to structural equation models with clustered data.As structural equation models are an advanced form of the factor analysis model, the modeling approaches discussed are not only applicable to them but also offer a foundation for future statistical diagnostic research on clustered data across varying models.

Figure 1 .
Figure 1.Ergodic mean plots.(a) Ergodic mean plot of common factor.(b) Ergodic mean plot of manifest variables.

Figure 5 .
Figure 5. Ergodic mean plots.(a) Ergodic mean plot of common factor.(b) Ergodic mean plot of manifest variables.

Table 1 .
Estimated results for the Gibbs sampling algorithm for clustered data.

Table 2 .
Estimated results for the principal component algorithm method for clustered data.

Table 3 .
Description of data variables.

Table 5 .
Estimated effects of the Gibbs sampling.

Table 6 .
Comparison of AIC and BIC values with Gibbs sampling and principal component analysis.