Analyzing Multiple Social Determinants of Health Using Different Clustering Methods

Social determinants of health (SDoH) have become an increasingly important area to acknowledge and address in healthcare; however, dealing with these measures in outcomes research can be challenging due to the inherent collinearity of these factors. Here we discuss our experience utilizing three statistical methods—exploratory factor analysis (FA), hierarchical clustering, and latent class analysis (LCA)—to analyze data collected using an electronic medical record social risk screener called Protocol for Responding to and Assessing Patient Assets, Risks, and Experience (PRAPARE). The PRAPARE tool is a standardized instrument designed to collect patient-reported data on SDoH factors, such as income, education, housing, and access to care. A total of 2380 patients had complete PRAPARE and neighborhood-level data for analysis. We identified a total of three composite SDoH clusters using FA, along with four clusters identified through hierarchical clustering, and four latent classes of patients using LCA. Our results highlight how different approaches can be used to handle SDoH, as well as how to select a method based on the intended outcome of the researcher. Additionally, our study shows the usefulness of employing multiple statistical methods to analyze complex SDoH gathered using social risk screeners such as the PRAPARE tool.


Introduction
Social determinants of health (SDoH) [1] impact a person's physical and mental health and wellbeing, and include social and economic factors such as housing, employment, and education, as well as area-level factors such as neighborhood SES, access to healthy food sources, and healthcare availability [2].These factors contribute to health disparities [3,4], where certain populations experience worse health outcomes and higher rates of chronic conditions.In response, healthcare providers have begun to recognize the importance of addressing SDoH [5] to improve health outcomes and promote health equity, including the need to screen for social risks at the medical encounter.
The utilization of measures related to social and environmental determinants of health within the EMR to guide population health initiatives has gained significant traction in recent years [6] with an emergence of standardized tools designed to capture SDoH information accurately and comprehensively in the EMR.One such tool is the Protocol for Responding to and Assessing Patients' Assets, Risks, and Experiences (PRAPARE) [7], developed to assist healthcare providers in identifying and addressing SDoH among patient populations through the electronic medical record (EMR) platform.The PRAPARE collects self-reported information from patients during healthcare encounters on a range of SDoH factors, including housing, food security, transportation, and employment.The PRAPARE also collects address information of residence for geocoding, to link to public area-level data.The tool has become a resource for healthcare providers to better understand the needs of their patients and develop targeted interventions to address SDoH challenges.For health providers and health services researchers, PRAPARE is a valuable tool for designing and customizing healthcare services to meet the specific needs of diverse patient populations.Using information from PRAPARE, providers can make informed decisions on resource allocation, program development, and policy formulation to improve the effectiveness of health services and enhance health equity in communities.
However, analyzing the data collected with PRAPARE can be challenging due to the complexity and multidimensional nature of SDoH [8].PRAPARE surveys collect a wealth of data encompassing various SDoH domains, leading to large and complex datasets.Many SDoH variables and observations are interconnected, making it crucial to recognize patterns and structures within the data.
Despite the widespread use of PRAPARE, study is limited regarding which SDoH measures matter most [8], and which groupings of SDoH measures have the greatest impact on specific chronic disease outcomes [9] or social risk of specific group of patients [10].Utilizing advanced statistical methods that group SDoH may help extract meaningful insights from the data and identify patterns and relationships among the variables.
Here, we describe three statistical approaches we recently used to analyze PRAPARE data for a pilot study that sought to identify and link social risk phenotypes to diabetes and obesity status among patients using PRAPARE data.These approaches include exploratory factor analysis, hierarchical clustering analysis, and latent class analysis.We also present the lessons learned as we compared the methods and eventually selected one methodological approach in presenting our pilot clinical findings.
Exploratory factor analysis [11][12][13] is a technique used to reduce a large number of variables into a smaller set of factors by identifying relationships and commonalities among multiple observed variables.This strategic reduction not only simplifies the screening process, but also enhances the efficiency and interpretability of the subsequent analytical results.For instance, in the context of PRAPARE data, it might reveal that several variables related to income, education, and housing quality can be collectively understood as an "Economic Stability" factor.Observed variables associated with anxiety, depression, and stress may load onto a common factor called "Mental Health."Researchers can then use a set of questions to represent specific domains or factors, thus honing in on key areas to focus interventions.
Hierarchical clustering analysis [14][15][16][17] groups data points or patients based on observed similarities or differences in their variable values.It results in sets of clusters, where data points within each cluster are more similar to each other than to those in other clusters.For instance, it can identify a cluster of patients with similar experiences of housing instability, food insecurity, and transportation barriers.
Latent class analysis (LCA) [18,19] is also used to identify groups of individuals based on their responses to multiple categorical variables.However, it goes beyond clustering observed variables; its objective is to identify unobserved (latent) classes within a population based on patterns in the observed categorical responses.For instance, it can reveal distinct subgroups of individuals with specific combinations of socio-economic factors that influence their health outcomes.
Overall, using these statistical techniques to analyze PRAPARE data can help healthcare providers and researchers better understand the complex relationships between social determinants of health and health outcomes, and develop more effective strategies for addressing health disparities and inequities.

Study Design
This study aimed to explore the use of three statistical techniques-exploratory factor analysis, hierarchical clustering, and latent class analysis-to analyze data collected using the PRAPARE (Protocol for Responding to and Assessing Patients' Assets, Risks, and Experiences) tool from a sample of 2380 patients who completed the PRAPARE survey in a primary care setting at our institution.The PRAPARE survey consists of questions that assess SDoH in five domains: (1) demographic information, (2) economic stability, (3) neighborhood and physical environment, (4) health care access and utilization, and (5) social and community context [20].We used 16 self-reported variables from the PRAPAREfear of a partner, veteran status, housing stability, incarceration history, housing insecurity, stress levels, social support, resource needs, safety in the current living environment, educational attainment, employment status, transportation access, insurance coverage, migrant or refugee status, and language proficiency.We also used 3 measures of area-level SDoH by census-tract of participant's residence at time of assessment for these analyses: the Social Vulnerability Index overall score, rural urban commuting code, and the Yost SES Index [21].Demographic variables such as age, sex, and gender were used to describe our results but were not included in factor analysis or latent class analysis [9].

Statistical Analysis
The SDoH variables were first recoded to ensure that all variables started from 1, and to facilitate interpretation of the results, such that higher values were indicative of higher risk.Descriptive statistics were then employed to summarize the frequencies and percentages for categorical variables, providing insight into the distribution of the data.
We conducted exploratory factor analysis, hierarchical clustering, and latent class analysis (LCA).All statistical analyses were carried out using the R software package (version 4.0.5).Exploratory factor analysis was conducted using the 'fa' function from the R package 'psych' (version 2.2.5).Hierarchical clustering was performed using the 'NbClust' function in R (version 3.0.1),and latent class analysis (LCA) was conducted using 'poLCA' (version 1.6.0).

Exploratory Factor Analysis
Factor analysis is a statistical technique that is used to reduce data to a smaller set of summary variables, and to explore the underlying theoretical structure of the phenomena.Here, we used exploratory factor analysis (EFA) [22] as a technique to identify the underlying factors that contribute to the observed variation in the data.The number of constructs or factors was determined based on the number of eigenvalues greater than 1 [23].To evaluate and verify the constructs found in the exploratory factor analysis, we used confirmatory factor analysis (CFA).The evaluation criteria included root mean squared error of approximation (RMSEA), standardized root mean residual (SRMR), and goodness of fit index (GFI) [24,25].RMSEA measures the discrepancy between the model and the observed data, with lower values indicating a better fit.SRMR measures the difference between the model-implied and observed covariance matrices, with values less than 0.08 indicating a good fit.

Hierarchical Clustering
Hierarchical clustering was also performed to identify the groups of observations that were most similar to one another based on how they answered the questions.Patients within a cluster are similar to each other and different from ones in other clusters.We employed the approach of hierarchical clustering with Euclidean distance as measure of dissimilarity [26], and "ward.D2." [27] as the algorithm.The Euclidean distance measure allows the algorithm to group observations that are closer together in terms of their characteristics, and "ward.D2" aims to minimize the total within-cluster variance.To determine the number of clusters to use, we used the "friedman" index, which decides the number of clusters based on the maximum difference between hierarchy levels [28].This approach allowed us to identify the optimal number of clusters to use and create meaningful clusters based on the underlying structure of the data.

Latent Class Analysis (LCA)
We defined the hypothesized unobserved classes through LCA.The process of LCA begins by fitting a baseline model with only one latent class.Then, the number of latent classes is increased iteratively, and the model is re-estimated at each step until an appropriate fit is obtained [29].To assess the model's fit and determine an appropriate number of latent classes, we use the statistical measures Bayesian information criterion (BIC) [30] and entropy [31].BIC is a parsimony measure that balances model complexity with goodness of fit, while entropy measures the degree of uncertainty in assigning individuals to classes.An entropy value close to 1 indicates a high degree of certainty in class assignment, which is considered ideal.

Results
Of the 2380 participants used for analysis, the mean age was 53.1 years (SD 16.3), over half of the participants were female (59.0%), and about 50% of the participants were Black (Table 1).

Exploratory Factor Analysis
We conducted an exploratory factor analysis on 16 Social Determinants of Health (SDoH) measures after excluding three variables (incarcerated, refugee, migrant) due to their low prevalence.The analysis yielded three clusters with eigenvalues greater than 1, indicating that they are meaningful clusters (Figure 1).The first factor found was labeled "Adverse Neighborhood", based on the characteristics of two neighborhood level SDoH factors: overall Social Vulnerability Index (SVI) and Yost index.The second factor was labeled "Social insecurities and Safety" and consisted of four factors: housing status, live safe, house insecurity, and afraid of partner.The third factor was labeled "Social Needs", which consisted of four factors: resource needs, stress, transportation, and employment (Figure 1).The root mean squared error of approximation was 0.02, which indicates that the exploratory factor analysis model fit was acceptable [32].The results suggest that SDoH questions on the PRAPARE are not independent and can be grouped into meaningful factors.
"Adverse Neighborhood", based on the characteristics of two neighborhood level SDoH factors: overall Social Vulnerability Index (SVI) and Yost index.The second factor was labeled "Social insecurities and Safety" and consisted of four factors: housing status, live safe, house insecurity, and afraid of partner.The third factor was labeled "Social Needs", which consisted of four factors: resource needs, stress, transportation, and employment (Figure 1).The root mean squared error of approximation was 0.02, which indicates that the exploratory factor analysis model fit was acceptable [32].The results suggest that SDoH questions on the PRAPARE are not independent and can be grouped into meaningful factors.

Hierarchical Clustering
We included all 22 variables (including age, sex, and gender) in the study, and classified age into three categories with a cutoff at 39 and 59.To determine the ideal number of clusters in the dataset, we utilized the "friedman" index, a statistical technique introduced in the Methods section, and explored cluster numbers ranging from 2 to 15.Based on the analysis, we identified four distinct clusters of patients.As shown in Table 2, we described the clusters according to SDoH and other demographics: (1) older, mostly optimal SDoH; (2) housing and resources needs, high stress, unemployed suboptimal SDoH; (3) predominantly male, most suboptimal SDoH reported; (4) predominantly middle aged, optimal SDoH.Table 2 presents the distribution of patients in each of these clusters, providing a summary of the findings.In Figure 2, we present a dendrogram that has been generated through the plotting of the hierarchical cluster object.This dendrogram provides a visual representation of the relationships and clustering patterns within our data.Each branch and grouping in the dendrogram signifies the degree of similarity or dissimilarity among the data points or entities being analyzed.Overall, this method shows different SDoH status and race-based clusters that exist among patients.

Hierarchical Clustering
We included all 22 variables (including age, sex, and gender) in the study, and classified age into three categories with a cutoff at 39 and 59.To determine the ideal number of clusters in the dataset, we utilized the "friedman" index, a statistical technique introduced in the Methods section, and explored cluster numbers ranging from 2 to 15.Based on the analysis, we identified four distinct clusters of patients.As shown in Table 2, we described the clusters according to SDoH and other demographics: (1) older, mostly optimal SDoH; (2) housing and resources needs, high stress, unemployed suboptimal SDoH; (3) predominantly male, most suboptimal SDoH reported; (4) predominantly middle aged, optimal SDoH.Table 2 presents the distribution of patients in each of these clusters, providing a summary of the findings.In Figure 2, we present a dendrogram that has been generated through the plotting of the hierarchical cluster object.This dendrogram provides a visual representation of the relationships and clustering patterns within our data.Each branch and grouping in the dendrogram signifies the degree of similarity or dissimilarity among the data points or entities being analyzed.Overall, this method shows different SDoH status and race-based clusters that exist among patients.

Latent Class Analysis
Gender, age, and race covariates were not included, and the latent class analysis identified four distinct classes of patients who shared similar response patterns based on the Bayesian Information Criterion (BIC), as shown in Table 3.The entropy of 0.745 indicated a high degree of certainty in class assignment, which suggests that the classes are distinct and well-defined.Our first look at the analysis revealed four classes that were labeled based on their socioeconomic status and social needs: (1)."High Socioeconomic Status, Low Social Needs", (2)."Moderate Socioeconomic Status, Moderate Social Needs" (3).

Latent Class Analysis
Gender, age, and race covariates were not included, and the latent class analysis identified four distinct classes of patients who shared similar response patterns based on the Bayesian Information Criterion (BIC), as shown in Table 3.The entropy of 0.745 indicated a high degree of certainty in class assignment, which suggests that the classes are distinct and well-defined.Our first look at the analysis revealed four classes that were labeled based on their socioeconomic status and social needs: (1)."High Socioeconomic Status, Low Social Needs", (2)."Moderate Socioeconomic Status, Moderate Social Needs" (3)."Low Socioeconomic Status, High Social Needs" (4)."Very Low Socioeconomic Status, Very High Social Needs".These four classes accounted for 18%, 27%, 45%, and 10% percentages of the population.Figure 3 in the article shows a screen capture of the estimation of the model, which provides a visual representation of the four estimated latent classes.After further inspection, we found that class 1 (Low Social Needs) and class 2 (Moderate Social Needs) were similar in nature.Moreover, there was minimal divergence in BIC between the model with three classes and the one with four classes (Table 3).To facilitate better interpretation and clinical usage, we opted for a three-class model, which resulted in an increased entropy of 0.841.These three classes represented 31.8%,58.2%, and 10% of the population.The probabilities of being positively rated by each variable, based on latent class, are illustrated in Figure 4 for the three-class model.These classes, or SDoH phenotypes, were utilized in a clinical paper that linked phenotypes to health outcomes (under review).The classes were categorized as follows: (1) low-social risk; (2) adverse neighborhood SDoH; and (3) high-social risk.

Discussion
For this investigation, we conducted exploratory factor analysis, hierarchical clustering, and latent class analysis to uncover the patterns and relationships within Social Determinants of Health (SDoH) variables collected through the PRAPARE tool.These valuable statistical techniques can provide insights into the underlying structure of the data, identify key social determinants of health, and help to identify subgroups of patients who may require tailored interventions.Our results show how approaches can be utilized to achieve different objectives based on the needs of the researcher or study.
Due to the collinearity of SDoH factors, researchers have grappled with how to approach dealing with these measures, including clustering methods, creating indices using PCA, and mostly recently, polysocial risks scores [33].The decision to use exploratory factor analysis, hierarchical clustering, and latent class analysis in our research reflects a deliberate effort to approach patient clustering from multiple angles, acknowledging the diverse aspects of the underlying data.Although these methods were initially applied in a patient setting, they hold the potential for broader applications in various settings, extending their value not only to healthcare, but also to diverse research endeavors.
Exploratory factor analysis allowed us to structure the multiple SDoH variables into three factors: adverse neighborhood, social insecurities and safety, and social needs.The confirmatory factor analysis supported the validity of this structure.This finding provides insights into the potential underlying structure of the data, and informing how future scales might be able to group concepts to describe specific domains, such as housing or food insecurity [34].This result aligns with the study of Wan et al. [9], who identified three composite clusters among the 22 PRAPARE SDoH factors.
Simultaneously, hierarchical clustering provided a different lens, identifying four distinct clusters of patients: older, mostly optimal SDoH; housing and resources needs, high stress, unemployed suboptimal SDoH; predominantly male, most suboptimal SDoH reported; and predominantly middle aged, optimal SDoH, based on similarities in how they answered the questions.This approach offers a complementary perspective, focusing

Discussion
For this investigation, we conducted exploratory factor analysis, hierarchical clustering, and latent class analysis to uncover the patterns and relationships within Social Determinants of Health (SDoH) variables collected through the PRAPARE tool.These valuable statistical techniques can provide insights into the underlying structure of the data, identify key social determinants of health, and help to identify subgroups of patients who may require tailored interventions.Our results show how approaches can be utilized to achieve different objectives based on the needs of the researcher or study.
Due to the collinearity of SDoH factors, researchers have grappled with how to approach dealing with these measures, including clustering methods, creating indices using PCA, and mostly recently, polysocial risks scores [33].The decision to use exploratory factor analysis, hierarchical clustering, and latent class analysis in our research reflects a deliberate effort to approach patient clustering from multiple angles, acknowledging the diverse aspects of the underlying data.Although these methods were initially applied in a patient setting, they hold the potential for broader applications in various settings, extending their value not only to healthcare, but also to diverse research endeavors.
Exploratory factor analysis allowed us to structure the multiple SDoH variables into three factors: adverse neighborhood, social insecurities and safety, and social needs.The confirmatory factor analysis supported the validity of this structure.This finding provides insights into the potential underlying structure of the data, and informing how future scales might be able to group concepts to describe specific domains, such as housing or food insecurity [34].This result aligns with the study of Wan et al.Simultaneously, hierarchical clustering provided a different lens, identifying four distinct clusters of patients: older, mostly optimal SDoH; housing and resources needs, high stress, unemployed suboptimal SDoH; predominantly male, most suboptimal SDoH reported; and predominantly middle aged, optimal SDoH, based on similarities in how they answered the questions.This approach offers a complementary perspective, focusing on patient groupings in a way that aligns with certain shared responses to specific questions [35].To our knowledge, our study is the first to cluster PRAPARE patients using hierarchical clustering.Latent class analysis allowed us to identify three unobserved classes (low-social risk; adverse neighborhood SDoH; high-social risk) that influence the self-report or the measures of SDoH variables; it emphasized capturing patterns in responses that might not be immediately apparent through other clustering techniques.We ultimately used this approach to inform our outcome modeling of diabetes and obesity status in the original study.
While results varied by method, they did so only slightly, indicating the robustness of the patterns that we found in our particular data.The use of exploratory factor analysis, hierarchical clustering, and latent class analysis consistently revealed disparities in critical areas such as Social Vulnerability Index (SVI), housing status, living conditions, housing insecurity, and stress.For instance, hierarchical clustering identified significant differences in the percentage of low social vulnerability among clusters, with clusters 3 and 4 demonstrating lower rates than clusters 1 and 2. Similarly, in latent class analysis, class 2 (adverse neighborhood SDoH) and class 3 (high social risk) contributed to a much higher rate of high social vulnerability compared to class 1 (low social risk).On the other hand, each technique offered unique advantages and perspectives, contributing to a well-rounded interpretation of the complex relationships within the dataset.Considering this, investigators need to determine what their objective is with the data when selecting a method- (1) to identify underlying constructs of questions asked to individuals (factor analysis); or (2) to identify segments or clusters of people based on shared characteristics (hierarchical clustering or LCA).Here we wanted to examine the latter, which could help healthcare providers develop more targeted and effective strategies for addressing health disparities and promoting health equity.On the other hand, factor analysis is helpful when developing screeners and questions pertinent to SDoH, and in discovering domains that may exist in the data or among patients.
Our study has several implications from both a research and clinical perspective.For instance, FA may prove to be more useful when developing social risk and/or SDoH screeners with good content validity.Hierarchical clustering may help to pool patients based solely on shared responses to inform clinical approaches and interventions.Latent class analysis may help to identify subtypes of patients based on shared SDoH for intervention development and tailoring.Investigators should choose approaches based on their intended outcomes; however, it is important to highlight here that all three approaches produced similar results and confirm that SDoH does indeed cluster into distinct categories.
Our study has several limitations to mention.One limitation is that the applied techniques are reliant on the quality and completeness of the data gathered with the PRAPARE tool.If the data are incomplete or inaccurate, it can affect the validity and reliability of the results.Therefore, it is important to ensure that the data are collected using standardized and validated measures, and that any missing data are handled appropriately.While our results here are a demonstration of different methods, it is important to highlight this limitation.It is also important to consider the potential ethical implications of using these techniques, which was not the main focus of this study.For example, clustering or latent class analysis could potentially result in stigmatization or discrimination against certain subgroups of patients.Therefore, it is important to ensure that the results are used in a responsible and ethical manner, and that appropriate safeguards are put in place to protect patient privacy and confidentiality.

Conclusions
In our study, which involved 2380 participants and examined 19 Social Determinants of Health (SDoH) variables, three factors emerged through factor analysis: adverse neighborhood, social insecurities and safety, and social needs.Hierarchical clustering revealed four distinct patient clusters: older, mostly optimal SDoH; housing and resource needs, high stress, unemployed suboptimal SDoH; predominantly male, most suboptimal SDoH reported; and predominantly middle-aged, optimal SDoH.Furthermore, latent class analysis identified three unobserved classes related to social risk: low-social risk, adverse neighborhood SDoH, and high-social risk.These results demonstrate the reliability and validity of scoring tools.Future work should explore the use of these tools for improving population health outcomes.
Our paper primarily targets healthcare professionals who utilize the PRAPARE tool in real-world settings, and healthcare researchers interested in disseminating descriptive findings on SDoH and assessing the relationships between SDoH and patient outcomes.While this paper involves statistical methodologies, it is crafted to be accessible to a broad audience, including practitioners and individuals with varying levels of statistical proficiency.

Figure 1 .
Figure 1.Structure of PRAPARE SDoH factors by factor analysis.Abbreviation: PRAPARE, Protocol for Responding to and Assessing Patient Assets, Risks, and Experiences.

Figure 1 .
Figure 1.Structure of PRAPARE SDoH factors by factor analysis.Abbreviation: PRAPARE, Protocol for Responding to and Assessing Patient Assets, Risks, and Experiences.

Figure 2 .
Figure 2. A dendrogram generated through the plotting of the hierarchical cluster object.Rectangles around the branches of a dendrogram highlight the corresponding clusters.Objects that are located at the same cluster share similar characteristics with each other.

Figure 2 .
Figure 2. A dendrogram generated through the plotting of the hierarchical cluster object.Rectangles around the branches of a dendrogram highlight the corresponding clusters.Objects that are located at the same cluster share similar characteristics with each other.

Figure 3 .
Figure 3. Estimation of the four-class basic latent class model; obtained by setting graphs = TRUE in the poLCA function call.Each group of red bars represents the conditional probabilities, by latent class, of being rated positively by each of the 16 SDoH measures.Taller bars correspond to conditional probabilities closer to 1 of a positive rating.

Figure 3 .
Figure 3. Estimation of the four-class basic latent class model; obtained by setting graphs = TRUE in the poLCA function call.Each group of red bars represents the conditional probabilities, by latent class, of being rated positively by each of the 16 SDoH measures.Taller bars correspond to conditional probabilities closer to 1 of a positive rating.

Figure 4 .
Figure 4. Estimation of the three-class basic latent class model; obtained by setting graphs = TRUE in the poLCA function call.Each group of red bars represents the conditional probabilities, by latent class, of being rated positively by each of the 16 SDoH measures.Taller bars correspond to conditional probabilities closer to 1 of a positive rating.

Figure 4 .
Figure 4. Estimation of the three-class basic latent class model; obtained by setting graphs = TRUE in the poLCA function call.Each group of red bars represents the conditional probabilities, by latent class, of being rated positively by each of the 16 SDoH measures.Taller bars correspond to conditional probabilities closer to 1 of a positive rating.

Table 1 .
Characteristics of the study population.

Table 2 .
Characteristics of the study population in each cluster produced using hierarchical clustering.

Table 3 .
Fit statistics of different latent classes using the latent class analysis method.
Note: bolded values indicate the criteria used to compare Model 3 and 4 to determine the best model fit.Int.J. Environ.Res.Public Health 2024, 21, 145 10 of 14

Table 3 .
Fit statistics of different latent classes using the latent class analysis method.
Note: bolded values indicate the criteria used to compare Model 3 and 4 to determine the best model fit.