Next Article in Journal
An Explainable Method for Automatic Extraction of Natural Language Access Control Policy Key Components
Previous Article in Journal
Energy Savings in Industrial Processes: The Influence of Electricity Emission Factor and Financial Parameters on the Evaluation of Long-Term Economics and Carbon Savings
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Semi-Automatic Extraction and Analysis of Health Equity Covariates in Registered Research Projects

Daniel J. Epstein Department of Industrial & Systems Engineering, Information Sciences Institute, University of Southern California, Los Angeles, CA 90007, USA
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(22), 11853; https://doi.org/10.3390/app152211853
Submission received: 12 October 2025 / Revised: 2 November 2025 / Accepted: 3 November 2025 / Published: 7 November 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

Advancing health equity requires rigorous analysis of how research initiatives incorporate and address structural disparities across populations. In this study, we apply large language models (LLMs) to systematically analyze research projects registered on the All of Us platform, with a focus on identifying patterns and institutional dynamics associated with health equity research. We examine the relationship between projects that explicitly pursue health equity goals and their use of available demographic data, their institutional composition (e.g., single- vs. multi-institutional teams), and the research tier of participating institutions (R1 vs. R2). Using the capabilities of an established LLM, we automate key tasks including the extraction of relevant attributes from unstructured project descriptions, classification of institutional affiliations, and the summarization of project content into standardized keywords from the Unified Medical Language System vocabulary. This LLM-assisted pipeline enabled scalable, replicable analysis of hundreds of projects with minimal manual overhead. Our findings suggest a strong association between the use of demographic data and health equity aims, and indicate nuanced differences in equity-oriented research participation by institution type and collaborative structure. More broadly, our approach demonstrates how LLMs can support equity-focused computational social science by transforming free-text administrative data into analyzable structures, enabling novel insights in public health, team science, and science-of-science studies.

1. Introduction

Health equity has been recognized as a significant concern in the United States [1,2,3,4,5], with Whitehead framing it in the early 1990s as “equal access to available care for equal need, equal utilization for equal need, equal quality of care for all” [6]. A framework created in the 1990s by the National Institute on Minority Health and Health Disparities similarly provides comprehensive definitions of heath equity and emphasizes the importance of addressing factors to reduce health disparities [7].
The All of Us research platform is an ambitious initiative that envisions improving health equity (as one of its goals) by “catalyzing” an ecosystem of stakeholders in gathering and analyzing data from more than one million individuals residing in the United States [8]. Since its launch, the platform has witnessed many projects being registered and covering topics ranging from mental health to cardiovascular disease [9]. Registered projects are briefly annotated with both structured data (including demographic variables, if any, being used in the project) and more unstructured text data, such as anticipated findings.
Properly studied, these projects can be used to answer a rich set of open sociological questions involving covariates of projects that explicitly choose to address health equity as one of their goals [10,11,12]. In this article, we systematically analyze hundreds of registered All of Us project descriptions and associated metadata, across five broad topical clusters, to investigate the odds of a project addressing health equity as a goal and the project, (i) being led by a multi-institutional team (suggesting interdisciplinary research), (ii) making use of demographic variables (suggesting the value of All of Us in providing fine-grained data that facilitates health equity studies), and (iii) involving doctoral Carnegie-classified R2 universities (“high research activity”) compared with R1 institutions (“very high research activity”) [13], which would suggest greater institutional diversity in exploring health equity.
Our study contributes to a growing body of research on the prevalence and drivers of health equity projects currently being pursued in the United States and several other countries [14,15,16,17]. Some of this research is policy-driven, while others focus on the importance of health equity in implementation science. However, a systematic and quantitative review of registered projects that places health equity as a first-class subject of analysis has thus far been lacking. Because they involve personnel and resources, and are formulated with good-faith intent, such projects should be studied independent of whether (or not) they lead to tangible impacts like publications. Historically, data for such endeavors was not available but has since become so, because of platforms like All of Us. We choose to analyze these projects directly, including bona-fide descriptions of anticipated findings and a priori scientific questions of interest when registering the study.
To bridge the gap between AI and health equity research, we conduct an exploratory feasibility demonstration of a scalable, end-to-end pipeline that converts granular, unstructured All of Us projects into data structure amenable to rigorous analysis. We assess whether our pipeline can reliably provide important variables for health equity analysis at scale. Since health equity research is fundamentally concerned with differences between population groups, access to research details and metadata is essential for conducting a large-scale, systematic analysis of the factors associated with health equity research. We demonstrate the utility of our pipeline using two research questions (RQs):
RQ1: What is the association between researchers’ using demographic data and the stated project’s focus on health equity?
Furthermore, the type and collaborative nature of an institution could significantly influence research priorities. Understanding this relationship is critical for identifying which academic environments currently drive this research and for informing policies that foster a more inclusive research ecosystem. We therefore consider the following research question as well:
RQ2: Are institutional factors, such as involvement of an R2 university and multi-institutional collaboration, associated with health equity research?
This multi-institutional perspective tests whether differences in missions of R1 and R2 institutions (research intensity vs. community-engaged teaching and regional service) map onto observable differences in health equity emphasis.
With these research questions as our motivation, we systematically investigate the covariates of projects on the All of Us platform. While this platform offers a rich ecosystem for research, its data is often granular and unstructured (as shown in Figure 1), making large-scale analysis methodologically non-trivial. Key variables required for our analysis—such as a project’s focus on health equity, the presence of a multi-institutional team, or an institution’s R1/R2 classification—are not explicitly coded and must be inferred from free text. To address this, our study introduces a comprehensive pipeline that employs a Large Language Model (LLM) to automate these complex data extraction and coding tasks. This method uses GPT-3.5 in conjunction with the Unified Medical Language System (UMLS), coupled with judicious manual verification, to obtain the necessary variables. It is important to note that our objective is not to benchmark the performance of different LLMs; rather, it is to demonstrate a scalable, end-to-end framework that makes the analysis of such data more efficient and accessible for addressing real-world computational social science questions.

2. Related Work

Recent advances in large language models (LLMs) have transitioned AI-assisted data analysis from proof-of-concept demonstrations to practical, real-world applications [18,19]. The integration of AI, particularly LLMs, into data analysis has been explored extensively, highlighting their potential in qualitative data analysis to significantly reduce manual labor [20]. Research has demonstrated how LLMs can effectively support data mining efforts by transforming unstructured text into structured, actionable knowledge. Additionally, agent-based LLM systems have been investigated for their capability to efficiently analyze and contextualize complex scientific datasets [21]. While these studies offer valuable methodological insights into the practical utility of recent LLMs, comprehensive, domain-specific end-to-end applications within computational social science remain notably underdeveloped. These advances are most compelling when applied to a large, heterogeneous corpus with real policy relevance; the All of Us Research Program provides precisely that substrate.
The All of Us Research Program aggregates extensive datasets, including electronic health records, genomic data, biosamples, and detailed socio-demographic metadata for over one million participants across the United States [8]. Beyond individual-level data, the program also provides a cloud-based research project directory detailing every approved study along with associated investigators and topical areas [22]. Previous analyses have leveraged this directory to integrate various data types—physical measurements, survey responses, EHRs, wearable device data, and genomic profiles—enabling robust methods for data harmonization and quality control, particularly in characterizing genetic ancestry [23]. Other research utilizing the All of Us directory has focused on evaluating the Precision Medicine Initiative (PMI), emphasizing the role of individual differences in genetics, environment, and lifestyle [24]. Despite its extensive utilization, researchers argue that the All of Us platform remains underexploited in exploring diverse healthcare domains [25]. However, drawing valid inferences from this heterogeneous text requires more than generic information extraction process: variables must be aligned to established health-equity constructs to avoid ad hoc labeling and hidden bias.
Health-equity frameworks emphasize social and structural determinants such as age, sex/gender, race/ethnicity, and neighborhood deprivation to avoid biased inference and masked disparities [26,27,28]. A growing literature examines how LLMs represent health-equity concepts, quantifies bias and hallucinations, and explores ways to constrain them [29,30,31]. This need for standards-aligned extraction and uncertainty handling informs the choice of modeling tools, balancing domain-specific NLP with general LLMs that can be grounded to controlled vocabularies. To our knowledge, there is no prior data science or analysis study that explicitly integrates an LLM as an information extractor and grounded knowledge generator with a formal health equity framework.
Biomedical natural language processing (NLP) has traditionally relied upon domain-specific transformer models such as BioBERT, SciBERT, and PubMedBERT [32,33,34]. However, recent benchmarks indicate that more generalized LLMs can achieve comparable or superior performance to specialized models in tasks such as concept recognition and UMLS linking [35,36]. Prior research on health equity frameworks has developed extensive vocabularies and conceptual corpora to represent health equity issues comprehensively [7,28]. Despite this significant progress, existing studies focused on various medical topics [37,38] highlight persistent gaps in biomedical text generation and mining, particularly noting inadequate annotation and representation within specific subdomains, such as health equity. Therefore, we adopt a general LLM, constrained by equity frameworks and normalized to controlled terminologies, as the front end of a pipeline that converts All of Us project text into an analysis-ready table described next in Materials and Methods.
Building on these topics, our framework delivers an auditable pipeline that grounds extraction in health-equity constructs and normalizes terms to a controlled schema. The resulting dataset enables the core analyses in this study (RQ1 and RQ2). We also report operational indicators (time, cost, reproducibility) to demonstrate practical scalability and to bridge directly to Section 3 and Section 4.

3. Materials and Methods

Because All of Us does not directly code for variables like health equity, variables flagging multi-institutional teaming, and variables indicating whether an institution is an R1 or R2 university, such variables must be inferred for hundreds of projects using the unstructured text descriptions and (where applicable) external data sources. We constructed such a pipeline (Figure 2) that makes judicious use of a large language model (LLM) like GPT-3.5 [39] for text analysis and coding of variables that traditionally required painstaking manual effort [40,41] and that may implicate potential cognitive biases [42,43,44].
In this section, we explain our end-to-end analysis pipeline. We begin with an overview of the All-of-Us research hub platform (Section 3.1), followed by a description of the data acquisition and field extraction process used to convert unstructured data into structured CSV files (Section 3.3). We then introduce a brief conceptual framework (Section 3.2) that anchors variable definitions and cohort design to established health-equity standards and institutional mission profiles. To address the potential for duplicate project entries within the All-of-Us platform, we explain the cause of duplication and present our deduplication strategy in Section 3.4. Subsequently, Section 3.5 details the data preprocessing steps, including data augmentation using GPT-3.5. Finally, we describe the statistical analysis used to derive our findings (Section 3.6).

3.1. The All of Us Research Hub

The All of Us Research Hub is a platform that hosts a large collection of health research projects from a diverse participant population across the U.S. It serves as a vital resource for researchers, providing access to over 13,000 registered medical projects. Each project contains detailed information, including research questions, purposes, approaches, anticipated findings, dataset types, and team details. The platform’s interactive data browser allows users to perform keyword searches, explore project data snapshots, and view comprehensive research descriptions, making it a potentially valuable tool for large-scale health equity research.
In this study, we used the All of Us Research Hub as our primary data source to investigate medical research projects. Figure 1 (top left) illustrates the platform’s interface, displaying a list of projects from the search results for the keyword “Diabetes”. We also show the detailed project information that is available when users select a project. While the hub offers a robust tool for exploring individual projects, some information needed for the analysis may not be available to researchers. For example, in considering the organizations that members of a project’s research team are affiliated with, All of Us does not code for the Carnegie classification of an academic organization (e.g., R1 vs. R2). Moreover, conducting large-scale manual analysis of thousands of projects requires additional preprocessing steps which can be time-consuming and distracting from the main analysis. Therefore, we used GPT-3.5, a large language model, to assist with data extraction tasks such as identifying institutional affiliations and extracting health equity-related keywords from the unstructured project descriptions. By automating these processes, we efficiently processed and analyzed medical projects across five targeted medical keywords, enabling a comprehensive analysis of the projects registered in the platform.

3.2. Theoretical Review

We align extraction and coding to established health-equity standards such as NIMHD framework by levels of influence for equity constructs. This motivates our variable choices (core demographics, equity keywords, and access/SES proxies) and the use of uncertainty flags when project text is ambiguous, reducing drift from ad hoc labels and improving interpretability across topics (operational rules in Section 3.5).
Institutional differences are framed through a mission lens: R1 institutions emphasize scale and research infrastructure, whereas many R2 institutions emphasize regional and community-engaged scholarship. This yields two expectations that map to our research questions: projects that explicitly use demographic variables are more likely to state a health-equity aim (RQ1), and involvement of R2 institutions is associated with a higher proportional emphasis on equity framing even if R1 institutions contribute greater overall volume (RQ2). These expectations justify constraining the label space to standards-aligned categories, documenting uncertainty, and analyzing R1–R2 using exclusive cohorts (R1-only vs. R2-only) while summarizing mixed R1 + R2 teams separately (cohort definitions in Section 3.6).
We assess associations using topic-wise odds ratios with 95% confidence intervals from 2 × 2 contingency tables (Wald CIs), contrasting projects that state a health-equity aim vs. those that do not. To summarize across topics, we report Mantel–Haenszel pooled odds ratios with corresponding 95% confidence intervals and test for between-topic heterogeneity with Cochran’s Q (fixed-effects assumption for the pooled estimate). For Figure-based proportion contrasts (e.g., R1 vs. R2 keyword shares), we use two-sided binomial tests against 0.5. Unless noted otherwise, R1–R2 comparisons use exclusive cohorts (R1-only vs. R2-only), with mixed R1 + R2 teams summarized separately.

3.3. Data Acquisition and Field Extraction

We acquired data for the study from the All of Us platform on five topics of broad interest: asthma/pollution, cardiovascular disease, diabetes, dementia/Alzheimer’s, and mental health. For each topic, we used simple keywords against the scientific questions being studied field to compile a topic-relevant list of projects, and retrieved a set of HyperText Markup Language (HTML) files for each subject from the search results, containing project details such as titles, goals and aims, approaches, and team members. Next, we extracted data from the HTML into Comma Separated Values (CSV) spreadsheets using a data extraction program based on a widely used web-text extraction Python 3.12 library [45]. To standardize outputs, we provide the LLM with three few-shot examples of a curated list of NIMHD concepts and require keyword generation to select from or map to these examples. When a source term does not match exactly, the prompt instructs the model to propose the closest NIMHD-consistent term and set an uncertainty flag. All generated keywords are then normalized to UMLS terms with a recorded link to their NIMHD concept. We used GPT-3.5 to assist us in writing the script, but manually verified the quality of the script. We also conducted checks to detect missing values, nulls, or blanks. We processed 1305 project descriptions across five topics (mental health, dementia, cardiovascular, asthma/pollution, and diabetes), 617 unique institutions and 1927 listed team members; overall, 51.95% of projects explicitly used demographic categories. For scalability and cost, we report the cost of API expense per record. A typical record with 1500 input and 300 output tokens costs about $0.0004 (range $0.0003–$0.001 for 0.8–2.0 K input and 150–500 output), totaling $0.4–$1.3 total for all 1305 records For reproducibility, we conduct the experiment with identical system and user prompts and seeds with 0 temperature. Key descriptive statistics are provided in Table 1.
We define two key attributes supporting RQ1 and RQ2. For demographic data, we create a flag for this attribute when a project explicitly reports a value for at least two core field from age, sex/gender, or race/ethnicity. We also flag a project as multi-institution (R1 + R2) when a project includes both Carnegie R1 and R2 affiliations.

3.4. Deduplication of Projects

Next, we identified duplicate entries within our datasets from the compiled CSVs. Duplicates are due to two primary causes: (1) instances of projects being registered multiple times on All of Us (e.g., by different members of the same team), and (2) the occurrence of the same project under multiple search keys due to use of related search keys e.g., many projects will overlap when independently using “dementia” and “alzheimers” as search keywords for retrieving results. We used a thresholded bag-of-words approach from the text analysis community frequently used in the deduplication of semi-structured data [46]. Manual inspection of randomly sampled project pairs after running and tuning the algorithm showed that deduplication could be achieved with 100% accuracy due to high text- and field-overlap in such project pairs.
Specifically, for each project, we concatenated the text from the ‘title’, ‘scientific-questions-being-studied’, and ‘project-purpose’ fields into a single document. These documents were then converted into numerical vectors using a Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer. We calculated the cosine similarity between all pairs of project vectors to quantify their textual overlap. A project pair was flagged as a duplicate if their similarity score exceeded a threshold of 0.8. A manual inspection of randomly sampled pairs confirmed that this method achieved 100% accuracy in identifying duplicates.

3.5. Preprocessing and Keyword Extraction

We performed data preprocessing, both to extract more fine-grained information from project descriptions, and to infer additional fields from the original dataset. The former involved determining whether the project was being led by members of more than one institution (multi-institutional team), whether (and what) demographic variables were used, and total numbers of unique institutions and individuals. To determine multi-institutionality, we first concatenated the team members, with their roles and institutions, as a text-string. Because this field does not have a fixed structure and cannot be parsed using a simple heuristic function, we elected to prompt GPT-3.5 on each concatenated string to extract three lists of (i) team member names; (ii) corresponding roles; and (iii) affiliated institutions. Using (iii), we further prompted GPT-3.5 to classify each institution on whether it is R1, R2, or neither. We manually sampled a small set of projects and metadata to verify the complete accuracy of this step. Figure 3 (top and middle) illustrates this process, showing how GPT-3.5 extracts structured information from unstructured project descriptions and classifies institutions into R1, R2, or neither. The key inferential field we sought to derive from the original data was a flag for whether the project had health equity as at least one of its goals. To do so in a robust way, we first provided GPT-3.5 with four fields (per project): aims and goals, questions, approaches, and findings. Next, we prompted it to extract the top 10 medical keywords using UMLS codes [47]. This yielded the top UMLS keywords each project is best related to. As shown in Figure 3 (bottom section), GPT-3.5 efficiently extracted UMLS-coded keywords, which were then used to flag projects for health equity relevance. Subsequently, we prompted GPT-3.5 by generating a comprehensive list of keywords related to “health equity,” recognizing that projects might address this concept in various ways. In total, GPT-3.5 provided a list of 42 keywords, such as “social determinants of health,” “health disparity,” and so on. Any project flagged with at least one of these keywords was designated as pursuing health equity as one of its goals.

3.6. Analysis

To investigate associations, we primarily use odds ratio (OR) analyses. All ORs are computed for individual topics, but where applicable, we report pooled ORs using fixed-effects Mantel–Haenszel pooling of ORs computed across all five topics. Based on the objectives stated earlier, we independently consider the association between a project pursuing health equity as a goal (outcome) and: (i) use of demographic fields, (ii) involving a multi-institution team, (iii) involving at least one R2 institution (with R1-only teams or institutions as baseline). For computing the OR in (iii), we exclude projects not affiliated with either R1 or R2 institutions, which results in a pruned dataset of 1131 projects out of the original 1305. To analyze co-occurrence strengths between keywords, we create a co-occurrence matrix that records the frequency of keyword pairs appearing together within projects. Using this matrix, we selected top 10 keyword pairs with the highest co-occurrence frequency across each of the five medical topics, illustrating the connection of keywords within and across topics. We visualize the co-occurrence using a chord diagram [48].

4. Results

4.1. Key Findings

This section presents the statistical findings that directly address the two research questions previously stated (RQ1 and RQ2). Our key statistical findings are provided in Table 2. Considering RQ1, we found a significant association between the use of demographic variables and health equity across all project topics (Mantel–Haenszel OR = 4.069, 95% CI 2.719 to 6.087), ranging from 73% higher odds in the dementia topical cluster (OR = 1.73, 95% CI 0.79 to 3.78) to 6.85× higher odds (OR = 6.85, 95% CI 4.08 to 11.50) in the cardiovascular cluster. Using Cochran’s Q test, the null hypothesis for homogeneity of odds was rejected (Cochran’s Q = 11,918.75, p = 0.00), suggesting that the topic potentially modifies the strength of the effect, and that the pooled estimate above should be treated with caution. The magnitudes of OR effects are substantive, and except for dementia, they were all significant at the 95% confidence level.
Considering RQ2, we analyzed institutional factors that may correlate with a research project’s focus on health equity. There is only slight evidence of higher odds of multi-institutional team projects addressing health equity as a goal, except for cardiovascular and asthma. The test for homogeneity was again rejected, suggesting the modifying effect of the topic on each association. For mental health, the OR was 2.17 (95% CI 0.94 to 5.05), for dementia the OR was 1.48 (95% CI 0.45 to 4.89), and for diabetes, the OR was 1.49 (95% CI 0.86 to 2.59). For asthma, OR was 0, as none of the 24 health-equity related projects involved any multi-institutional teams, while for cardiovascular, the OR was 0.88, but like all other topics except asthma, was not significant (95% CI 0.45 to 1.74). Finally, compared to R1 institutions, R2 institutions had higher odds of engaging in projects related to health equity. Except for dementia, the OR was always higher than 1, but was only significant for diabetes (OR = 1.89, 95% CI 1.15 to 3.09).
To further explore differences in research focus between R1 and R2 institutions (RQ2), Figure 4 uses the exclusive cohort (R1-only vs. R2-only); mixed R1 + R2 collaborations are excluded from these proportions and summarized separately. The figure shows the proportional distribution of projects across high-frequency keywords. A dominant pattern emerges: R1 institutions lead the majority of projects across most top keywords (e.g., “cardiovascular disease,” “risk factors,” “type 2 diabetes”). While the odds-ratio analysis suggests R2 institutions may be more likely to frame their work around health equity, R1 institutions still contribute a significantly higher volume of projects on most specific medical topics. However, for domains central to health equity, the proportional involvement of R2 institutions is more pronounced—e.g., “health disparities” (over 30%) and “social determinants of health” (over 25%). For “sociodemographic factors,” R1 and R2 involvement is nearly equal (not statistically significant). Together, these results suggest that despite R1’s higher overall volume, R2 institutions devote a larger share of their work to equity-centric areas. For each keyword, the proportion equals the number of projects from a group containing that keyword divided by the total number of R1-only plus R2-only projects containing that keyword; significance is assessed against 0.5 using a two-sided binomial test.
In sum, across topics, RQ1 is supported: projects that explicitly use demographic variables are substantially more likely to state a health-equity aim (Mantel–Haenszel pooled OR = 4.069, 95% CI 2.719–6.087), with topic-specific effects ranging from modest in dementia (OR = 1.73, 95% CI 0.79–3.78; not significant) to very strong in cardiovascular disease (OR = 6.85, 95% CI 4.08–11.50); heterogeneity was significant (Cochran’s Q = 11,918.75, p < 0.001), indicating topic modifies the association. For RQ2, institutional factors show mixed patterns: multi-institutional teams exhibit only limited evidence of higher odds of equity focus (most topic ORs not significant; e.g., mental health OR = 2.17, 95% CI 0.94–5.05; cardiovascular OR = 0.88, 95% CI 0.45–1.74), and the pooled estimate is not clearly different from 1 (1.377, 95% CI 0.821–2.307). By contrast, involvement of at least one R2 university trends positive overall (pooled OR = 1.522, 95% CI 0.979–2.366) and is significant in diabetes (OR = 1.89, 95% CI 1.15–3.09), while volume analyses show R1 institutions lead most topic-keyword counts but R2 institutions devote a comparatively larger share to equity-centric keywords—together suggesting complementary strengths rather than a uniform advantage.
We also performed a qualitative analysis of health equity-related keywords conceptualized from the NIMHD framework. We compared the keywords of R1 and R2 institutions to examine the differences in topic emphasis between these types of institutions. As shown in Figure 5, R1 institutions tend to focus more on topics such as “race/ethnicity,” “socioeconomic status,” “discrimination,” and “sex/gender,” with multiple categories co-occurring these keywords. In contrast, R2 institutions cover a broader spectrum of topics, as evidenced by the greater number of keywords. Specific topics that are unique to R2 institutions include “Hispanic,” “Puerto Ricans,” and “housing instability.” This pattern underscores the differing approaches to health equity research between R1 and R2 institutions, suggesting that the contributions of the latter are important to the broader conversations on promoting and understanding drivers of health equity research. Detailed keyword comparisons for all categories are included as additional data.

4.2. Pipeline Validation and Reproducibility

We processed 1305 project descriptions across five topics (mental health 244; dementia 174; cardiovascular 388; asthma/pollution 92; diabetes 407), covering 617 unique institutions and 1927 listed team members; overall, 51.95% of projects explicitly used demographic categories. At this scale, the task is non-trivial even for human experts: it requires extracting precise medical keywords from heterogeneous text, mapping those terms to controlled vocabularies such as UMLS without loss of meaning, aligning outputs to health-equity frameworks like NIMHD to avoid ad-hoc labels, and auditing institutional metadata to detect multi-institution collaborations and R1/R2 status. Each step demands domain knowledge, careful disambiguation, and consistency checks—work that is slow, error-prone, and expensive to do manually across 1305 projects. This difficulty motivates a standards-aligned, auditable LLM pipeline that automates routine steps while flagging uncertainty for targeted expert review.
Time and cost indicators show substantial gains. The pipeline completed the full batch in T = 8.0 h (records/hour = 163), whereas a conservative manual baseline of 6–10 min per record would require approximately 130–218 person-hours. This corresponds to a 94–96% reduction in person-hours (about 0.37 min per record via pipeline vs. 6–10 min manually). Using typical Prolific annotator rates of $12–$18 per hour, fully manual labeling would cost about $1560–$3924 for 1305 records (approximately $1.19–$3.01 per record). The API run totaled $12.00 (about $0.009 per record), a 99.2–99.7% reduction in direct labeling cost. In practice, the labeling cost is likely higher because this task requires human effort from researchers with a healthcare background for adjudication and oversight.

5. Discussion

In this study, we demonstrate that GPT-3.5 can perform a range of tasks related to medical information extraction from free text that could enable us to conduct thematic health equity analysis. Starting with the creation of a Python 3.12 script to extract data from HTML files hosted by the All of Us Research Hub, GPT-3.5 was utilized to automate the extraction of institutional affiliations, classify R1 and R2 institutions, and identify health equity-related keywords. These preprocessing steps are fundamental for accurate data analysis in medical research pipelines, reducing manual burden and increasing efficiency of data preparation.
Our quantitative results suggest a strong positive association between the explicit use of demographic information in project descriptions and the presence of health equity aims. At a high level, this indicates that when teams operationalize demographics in their protocols or narrative fields, they are more likely to articulate equity-relevant objectives as well. Importantly, this association is not uniform across topics; field-specific norms and the maturity of demographic standards (e.g., in cardiovascular vs. dementia research) likely modulate both how demographics are recorded and how equity aims are framed. Observed topic-wise heterogeneity therefore should be understood as substantive rather than merely statistical, reflecting differences in how equity considerations are embedded in domain practice.
To complement our statistical analysis, we extend the discussion with qualitative analysis using an example that demonstrates the potential of LLM-assisted human analysis in medical research. Specifically, we selected a research project from the category with the highest odds of being associated with multi-institutional collaborations (the mental health category). According to our statistical analysis, mental health projects, as shown in Table 2, exhibited the strongest association with multi-institutional collaborations. This observation aligns with prior studies supporting collaborations to enhance the quality of mental health research. For instance, an existing work emphasizes that collaborations among researchers, clinicians, and individuals with mental illness are crucial for producing relevant, feasible, and ethical research [49]. Similarly, another work explores organizational structures and barriers to collaboration with consumers in mental health research [50]. They advocate for a systematic and strategic approach to advance mental health consumer research, highlighting that collaborations are “always worth the extra effort.” Furthermore, several research initiatives show the prominence of multi-institutional collaborations in mental health research [51,52,53]. This aligns with our statistical findings demonstrating a strong association between mental health projects and collaborative efforts across institutions.
A complementary pattern in our data is that while R1 institutions dominate total project volume across many topics, R2 institutions often devote a larger proportion of their projects to equity-focused concepts (e.g., social determinants of health, disparities, access). This suggests a division of strengths: R1 organizations provide scale and infrastructure, whereas R2 organizations may foreground community-proximal priorities. Program design and funding mechanisms that resource both scale (R1) and proportional equity emphasis (R2) could therefore be synergistic for national equity agendas.
In Figure 6, we present the project titled Classification of Mental Health Disorders and Social Determinants of Health, which addresses health equity and involves multiple institutions. This example highlights GPT-3.5’s ability to correctly classify Rutgers as an R1 institution and the City University of New York (CUNY) as R2. Furthermore, GPT-3.5 efficiently processed unstructured fields, including Scientific Questions, Project Purpose(s), Scientific Approaches, and Anticipated Findings, converting them into structured UMLS-coded keywords. The model effectively summarized complex medical terminology from raw text—for instance, translating “psychiatrist diagnosis” into the UMLS keyword “Mental health disorders.” Moreover, GPT-3.5 identified several health equity-related keywords, with four out of ten terms—such as “Social determinants of health,” “Adverse experience,” “Neighborhood characteristics,” and “Racial and ethnic identities”—explicitly relating to health equity. These results underscore that GPT-3.5’s can reduce the time researchers would have spent on manual thematic analysis and labeling, allowing them to focus on the main statistical analysis. It also shows that even relatively smaller LLMs, such as GPT-3.5 can support large-scale sociological analyses effectively. The results suggest the beginning potential of AI as a data analysis assistant, pointing toward future applications where more advanced LLMs could aid in automating the initial, labor-intensive stages of thematic analysis and data preprocessing.
Some limitations must be borne in mind when interpreting the study’s external validity. One issue is potential selection bias, since not all institutions and project leaders can make use of All of Us (although the program has opened its resources to a broad set of institutions and stakeholders). The dataset is also subject to technical limitations, such as the accuracy of the semi-automatic methods and the GPT-3.5 prompts used. Our preliminary analysis did not provide evidence of hallucinatory behavior on the part of the LLM, but some likelihood of it always exists, especially at scale [54,55,56,57]. While complete accuracy cannot be presently guaranteed for any automated text analysis, future replication of the analyses here using other LLMs, and topics would strengthen the conclusions.

6. Conclusions

A methodological consequence of this study is that it shows that, with judicious use, LLMs like GPT-3.5 and other advanced automated methods make it feasible to study the sociology of projects currently registered on the All of Us research platform, especially related to outcomes like health equity. These models enable us to obtain fine-grained data in a relatively unbiased and efficient manner compared to many hours of manual labeling. Our study here used such methods to consider questions at the intersection of health equity, multi-institutional teaming, and the importance of involving and making the All of Us platform and data broadly available, and not just to “very high research activity” R1 institutions. Our methods suggest that answering other similar questions at scale may also be feasible, using the recent swath of commercial generative AI models that have become available at relatively low cost. As a policy matter, we also hope that it incentivizes researchers to enter high-quality metadata when registering their projects, as the metadata proves invaluable in conducting such sociological analyses and showcasing the utility of an initiative like All of Us.
The study also demonstrates that even an inexpensive LLM, such as GPT-3.5, can effectively enable large-scale sociological analyses without negating human oversight completely. While a formal benchmark of different models was beyond our scope, future research should investigate how newer and more powerful LLMs can be integrated into additional stages of the analysis pipeline. For example, beyond text classification, these models could assist in the initial stages of hypothesis generation on the unstructured text of project descriptions. This would allow researchers to move from simply categorizing projects to discovering research trends using LLM as an assistant. Such advancements could make these models an even more indispensable tool for healthcare researchers.

Author Contributions

N.N. was responsible for the primary aspects of the research, including data curation, experiments, analysis, and writing. M.K. contributed to research supervision, conceptualization, and revision of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The primary data on which our analyses are based are publicly available on the All of Us research directory: https://www.researchallofus.org/research-project-directory/ (accessed on 10 October 2025). All secondary data underlying this article can be made available upon request or upon publication.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Braveman, P. Health disparities and health equity: Concepts and measurement. Annu. Rev. Public Health 2006, 27, 167–194. [Google Scholar] [CrossRef] [PubMed]
  2. Baciu, A.; Negussie, Y.; Geller, A.; Weinstein, J.N.; National Academies of Sciences, Engineering, and Medicine; Committee on Community-Based Solutions to Promote Health Equity in the United States. The Need to Promote Health Equity. In Communities in Action: Pathways to Health Equity; National Academies Press: Cambridge, MA, USA, 2017. [Google Scholar]
  3. Marmot, M. Achieving health equity: From root causes to fair outcomes. Lancet 2007, 370, 1153–1163. [Google Scholar] [CrossRef] [PubMed]
  4. Farrer, L.; Marinetti, C.; Cavaco, Y.; Costongs, C. Advocacy for health equity: A synthesis review. Milbank Q. 2015, 93, 392–437. [Google Scholar] [CrossRef]
  5. Braveman, P.; Arkin, E.; Orleans, T.; Proctor, D.; Plough, A. What is health equity? Behav. Sci. Policy 2018, 4, 1–14. [Google Scholar] [CrossRef]
  6. Whitehead, M. The Concepts and Principles of Equity in Health; WHO, Regional Office for Europe: Geneva, Switzerland, 1990.
  7. National Institute on Minority Health and Health Disparities. NIMHD Research Framework. 2017. Available online: https://www.nimhd.nih.gov/resources/nimhd-research-framework (accessed on 11 October 2025).
  8. National Institutes of Health. All of Us: About. 2021. Available online: https://allofus.nih.gov/about (accessed on 6 April 2024).
  9. National Institutes of Health. All of Us: Research Projects Directory. 2024. Available online: https://allofus.nih.gov/protecting-data-and-privacy/research-projects-all-us-data (accessed on 6 April 2024).
  10. Bogard, K.; Murry, V.; Alexander, C. Perspectives on health equity and social determinants of health. In NAM Perspectives; National Academy of Medicine: Washington, DC, USA, 2017. [Google Scholar]
  11. Embrett, M.G.; Randall, G.E. Social determinants of health and health equity policy research: Exploring the use, misuse, and nonuse of policy analysis theory. Soc. Sci. Med. 2014, 108, 147–155. [Google Scholar] [CrossRef]
  12. Penman-Aguilar, A.; Talih, M.; Huang, D.; Moonesinghe, R.; Bouye, K.; Beckles, G. Measurement of health disparities, health inequities, and social determinants of health to support the advancement of health equity. J. Public Health Manag. Pract. 2016, 22, S33–S42. [Google Scholar] [CrossRef]
  13. Carnegie Classification of Institutions of Higher Education. Classification Methodology: Basic Classification. 2024. Available online: https://carnegieclassifications.acenet.edu/carnegie-classification/classification-methodology/basic-classification/ (accessed on 6 April 2024).
  14. Ostlin, P.; Schrecker, T.; Sadana, R.; Bonnefoy, J.; Gilson, L.; Hertzman, C.; Kelly, M.P.; Kjellstrom, T.; Labonte, R.; Lundberg, O.; et al. Priorities for research to take forward the health equity policy agenda. Bull. World Health Organ. 2005, 83, 948–953. [Google Scholar]
  15. Rasanathan, K.; Diaz, T. Research on health equity in the SDG era: The urgent need for greater focus on implementation. Int. J. Equity Health 2016, 15, 1–3. [Google Scholar] [CrossRef]
  16. Thomas, S.B.; Quinn, S.C.; Butler, J.; Fryer, C.S.; Garza, M.A. Toward a fourth generation of disparities research to achieve health equity. Annu. Rev. Public Health 2011, 32, 399–416. [Google Scholar] [CrossRef]
  17. Francés, F.; Parra-Casado, D. Participation as a driver of health equity. Gac. Sanit. 2019, 33, 96–98. [Google Scholar]
  18. Siiman, L.A.; Rannastu-Avalos, M.; Pöysä-Tarhonen, J.; Häkkinen, P.; Pedaste, M. Opportunities and challenges for AI-assisted qualitative data analysis: An example from collaborative problem-solving discourse data. In Proceedings of the International Conference on Innovative Technologies and Learning, Porto, Portugal, 28–30 August 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 87–96. [Google Scholar]
  19. Gu, K.; Shang, R.; Althoff, T.; Wang, C.; Drucker, S.M. How do analysts understand and verify ai-assisted data analyses? In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–22. [Google Scholar]
  20. Dai, S.C.; Xiong, A.; Ku, L.W. LLM-in-the-loop: Leveraging large language model for thematic analysis. arXiv 2023, arXiv:2310.15100. [Google Scholar]
  21. Peasley, D.; Kuplicki, R.; Sen, S.; Paulus, M. Leveraging Large Language Models and Agent-Based Systems for Scientific Data Analysis: Validation Study. JMIR Ment. Health 2025, 12, e68135. [Google Scholar] [CrossRef]
  22. Ramirez, A.H.; Sulieman, L.; Schlueter, D.J.; Halvorson, A.; Qian, J.; Ratsimbazafy, F.; Loperena, R.; Mayo, K.; Basford, M.; Deflaux, N.; et al. The All of Us Research Program: Data quality, utility, and diversity. Patterns 2022, 3, 100570. [Google Scholar] [CrossRef]
  23. The All of Us Research Program Genomics Investigators. Genomic data in the all of us research program. Nature 2024, 627, 340–346. [Google Scholar] [CrossRef]
  24. Baxter, S.L.; Saseendrakumar, B.R.; Paul, P.; Kim, J.; Bonomi, L.; Kuo, T.T.; Loperena, R.; Ratsimbazafy, F.; Boerwinkle, E.; Cicek, M.; et al. Predictive analytics for glaucoma using data from the all of us research program. Am. J. Ophthalmol. 2021, 227, 74–86. [Google Scholar] [CrossRef]
  25. Douville, N.J.; Kertai, M.D.; Sheetz, K.H. Expanding the All of Us Research Platform into the perioperative domain. JAMA Surg. 2025, 160, 220–221. [Google Scholar] [CrossRef]
  26. Braveman, P.A. Monitoring equity in health and healthcare: A conceptual framework. J. Health Popul. Nutr. 2003, 21, 181–192. [Google Scholar] [PubMed]
  27. Peterson, A.; Charles, V.; Yeung, D.; Coyle, K. The health equity framework: A science-and justice-based model for public health researchers and practitioners. Health Promot. Pract. 2021, 22, 741–746. [Google Scholar] [CrossRef]
  28. Richardson, S.; Lawrence, K.; Schoenthaler, A.M.; Mann, D. A framework for digital health equity. NPJ Digit. Med. 2022, 5, 119. [Google Scholar] [CrossRef] [PubMed]
  29. Rodriguez, J.A.; Alsentzer, E.; Bates, D.W. Leveraging large language models to foster equity in healthcare. J. Am. Med. Inform. Assoc. 2024, 31, 2147–2150. [Google Scholar] [CrossRef] [PubMed]
  30. Pfohl, S.R.; Cole-Lewis, H.; Sayres, R.; Neal, D.; Asiedu, M.; Dieng, A.; Tomasev, N.; Rashid, Q.M.; Azizi, S.; Rostamzadeh, N.; et al. A toolbox for surfacing health equity harms and biases in large language models. Nat. Med. 2024, 30, 3590–3600. [Google Scholar] [CrossRef]
  31. Iloanusi, N.J.; Chun, S.A. AI impact on health equity for marginalized, racial, and ethnic minorities. In Proceedings of the 25th Annual International Conference on Digital Government Research, Taipei, Taiwan, 11–14 June 2024; pp. 841–848. [Google Scholar]
  32. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
  33. Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar] [CrossRef]
  34. Han, Q.; Tian, S.; Zhang, J. A PubMedBERT-based classifier with data augmentation strategy for detecting medication mentions in tweets. arXiv 2021, arXiv:2112.02998. [Google Scholar]
  35. Groza, T.; Caufield, H.; Gration, D.; Baynam, G.; Haendel, M.A.; Robinson, P.N.; Mungall, C.J.; Reese, J.T. An evaluation of GPT models for phenotype concept recognition. BMC Med. Inform. Decis. Mak. 2024, 24, 30. [Google Scholar] [CrossRef]
  36. Rouhizadeh, H.; Yazdani, A.; Zhang, B.; Alvarez, D.V.; Hüser, M.; Vanobberghen, A.; Yang, R.; Li, I.; Walter, A.; Teodoro, D. Large language models struggle to encode medical concepts—A multilingual benchmarking and comparative analysis. medRxiv 2025. [Google Scholar] [CrossRef]
  37. Chen, Q.; Sun, H.; Liu, H.; Jiang, Y.; Ran, T.; Jin, X.; Xiao, X.; Lin, Z.; Chen, H.; Niu, Z. An extensive benchmark study on biomedical text generation and mining with ChatGPT. Bioinformatics 2023, 39, btad557. [Google Scholar] [CrossRef]
  38. Chen, Q.; Hu, Y.; Peng, X.; Xie, Q.; Jin, Q.; Gilson, A.; Singer, M.B.; Ai, X.; Lai, P.T.; Wang, Z.; et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat. Commun. 2025, 16, 3280. [Google Scholar] [CrossRef]
  39. OpenAI. Welcome to the OpenAI Developer Platform. 2024. Available online: https://platform.openai.com/docs/overview (accessed on 6 April 2024).
  40. Lewis, S.C.; Zamith, R.; Hermida, A. Content analysis in an era of big data: A hybrid approach to computational and manual methods. J. Broadcast. Electron. Media 2013, 57, 34–52. [Google Scholar] [CrossRef]
  41. Popping, R. Analyzing open-ended questions by means of text analysis procedures. Bull. Sociol. Methodol./Bull. Méthodol. Sociol. 2015, 128, 23–39. [Google Scholar] [CrossRef]
  42. Van Atteveldt, W.; Van der Velden, M.A.; Boukes, M. The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Commun. Methods Meas. 2021, 15, 121–140. [Google Scholar] [CrossRef]
  43. Barbosa, N.M.; Chen, M. Rehumanized crowdsourcing: A labeling framework addressing bias and ethics in machine learning. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–13. [Google Scholar]
  44. Tecimer, K.A.; Aghajani, E.; Bissyandé, T.F.; Klein, J.; Le Traon, Y. Detection and elimination of systematic labeling bias in code reviewer recommendation systems. In Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering, Trondheim, Norway, 21–24 June 2021; pp. 222–231. [Google Scholar]
  45. Richardson, L. Beautiful Soup Documentation. 2007. Available online: https://readthedocs.org/projects/beautiful-soup-4/downloads/pdf/latest/ (accessed on 6 April 2024).
  46. Kejriwal, M.; Miranker, D.P. An unsupervised instance matcher for schema-free RDF data. J. Web Semant. 2015, 35, 102–123. [Google Scholar] [CrossRef]
  47. Bodenreider, O. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 2004, 32, D267–D270. [Google Scholar] [CrossRef]
  48. Holtz, Y. Chord Diagram. 2024. Available online: https://r-graph-gallery.com/chord-diagram.html (accessed on 6 April 2024).
  49. Kohrt, B.A.; Upadhaya, N.; Luitel, N.P.; Maharjan, S.M.; Kaiser, B.N.; MacFarlane, E.K.; Khan, N. Authorship in Global Mental Health Research: Recommendations for Collaborative Approaches to Writing and Publishing. Ann. Glob. Health 2014, 80, 134–142. [Google Scholar] [CrossRef]
  50. Happell, B.; Gordon, S.; Bocking, J.; Ellis, P.; Roper, C.; Liggins, J.; Scholz, B.; Platania-Phung, C. ‘It is always worth the extra effort’: Organizational structures and barriers to collaboration with consumers in mental health research: Perspectives of non-consumer researcher allies. Int. J. Ment. Health Nurs. 2020, 29, 1168–1180. [Google Scholar] [CrossRef]
  51. Alarcón, R.D.; Parekh, A.; Wainberg, M.L.; Duarte, C.S.; Araya, R.; Oquendo, M.A. Hispanic immigrants in the USA: Social and mental health perspectives. Lancet Psychiatry 2016, 3, 860–870. [Google Scholar] [CrossRef]
  52. Mongelli, F.; Georgakopoulos, P.; Pato, M.T. Challenges and Opportunities to Meet the Mental Health Needs of Underserved and Disenfranchised Populations in the United States. Focus 2020, 18, 16–24. [Google Scholar] [CrossRef]
  53. Pearman, A.; Hughes, M.L.; Smith, E.L.; Neupert, S.D. Mental Health Challenges of United States Healthcare Professionals During COVID-19. Front. Psychol. 2020, 11, 2020. [Google Scholar] [CrossRef]
  54. Rawte, V.; Sheth, A.; Das, A. A survey of hallucination in large foundation models. arXiv 2023, arXiv:2309.05922. [Google Scholar] [CrossRef]
  55. Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is inevitable: An innate limitation of large language models. arXiv 2024, arXiv:2401.11817. [Google Scholar] [CrossRef]
  56. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Towards mitigating LLM hallucination via self reflection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
  57. Martino, A.; Iannelli, M.; Truong, C. Knowledge injection to counter large language model (LLM) hallucination. In Proceedings of the European Semantic Web Conference, Crete, Greece, 28 May–1 June 2023; pp. 568–585. [Google Scholar]
Figure 1. An illustration of the All of Us Research Hub interface and an overview of our data processing pipeline. The top-left panel shows the project directory search interface, while the top-right displays the detailed, unstructured information available for a selected project. The bottom section illustrates the core methodological workflow, which includes: (1) Data Acquisition and Extraction, (2) Deduplication, (3) Keyword Extraction using LLM, and (4) Data Analysis, where the final enriched data is used for statistical analysis.
Figure 1. An illustration of the All of Us Research Hub interface and an overview of our data processing pipeline. The top-left panel shows the project directory search interface, while the top-right displays the detailed, unstructured information available for a selected project. The bottom section illustrates the core methodological workflow, which includes: (1) Data Acquisition and Extraction, (2) Deduplication, (3) Keyword Extraction using LLM, and (4) Data Analysis, where the final enriched data is used for statistical analysis.
Applsci 15 11853 g001
Figure 2. A simplified workflow illustrating data acquisition (from the All of Us platform), preprocessing, deduplication, and data augmentation (e.g., UMLS keyword extraction) steps. Code and GPT-3.5 prompts underlying these modules are discussed in the main text.
Figure 2. A simplified workflow illustrating data acquisition (from the All of Us platform), preprocessing, deduplication, and data augmentation (e.g., UMLS keyword extraction) steps. Code and GPT-3.5 prompts underlying these modules are discussed in the main text.
Applsci 15 11853 g002
Figure 3. Three examples demonstrate how GPT-3.5 assists in data extraction and generation from unstructured text. In the first example (top), GPT-3.5 extracts the research team members, their roles, and affiliated institutions from an unstructured research team. In the second example (middle), GPT-3.5 processes a list of institutions (in red text) and returns each institution’s classification (academic or non-academic), location, and Carnegie classification (R1, R2, or neither). In the third example (bottom), GPT-3.5 extracts relevant medical keywords and their corresponding UMLS codes from the All of Us project descriptions, as provided in the prompt instructions.
Figure 3. Three examples demonstrate how GPT-3.5 assists in data extraction and generation from unstructured text. In the first example (top), GPT-3.5 extracts the research team members, their roles, and affiliated institutions from an unstructured research team. In the second example (middle), GPT-3.5 processes a list of institutions (in red text) and returns each institution’s classification (academic or non-academic), location, and Carnegie classification (R1, R2, or neither). In the third example (bottom), GPT-3.5 extracts relevant medical keywords and their corresponding UMLS codes from the All of Us project descriptions, as provided in the prompt instructions.
Applsci 15 11853 g003
Figure 4. Proportions of R1 and R2-institution projects for the (up to) top 10 highest-frequency keywords across all five topics common to both R1 and R2 projects. With the sole exception of “sociodemographic factors,” which was not significant at the 90% confidence level or above, the R1 proportion was significantly greater than 0.5 at the 99% confidence level or above for all other keywords. Proportions are based on R1-only vs. R2-only projects.
Figure 4. Proportions of R1 and R2-institution projects for the (up to) top 10 highest-frequency keywords across all five topics common to both R1 and R2 projects. With the sole exception of “sociodemographic factors,” which was not significant at the 90% confidence level or above, the R1 proportion was significantly greater than 0.5 at the 99% confidence level or above for all other keywords. Proportions are based on R1-only vs. R2-only projects.
Applsci 15 11853 g004
Figure 5. Two chord diagrams illustrating health equity related keywords co-occurrence from R1 institutions (top) and R2 institutions (bottom). Each band connects two keywords, where the band width is proportional to co-occurrence frequency. Five band colors represent the topics (blue = asthma, purple = mental health, red = cardiovascular, green = dementia, yellow = diabetes), and the grey band represent health equity keywords.
Figure 5. Two chord diagrams illustrating health equity related keywords co-occurrence from R1 institutions (top) and R2 institutions (bottom). Each band connects two keywords, where the band width is proportional to co-occurrence frequency. Five band colors represent the topics (blue = asthma, purple = mental health, red = cardiovascular, green = dementia, yellow = diabetes), and the grey band represent health equity keywords.
Applsci 15 11853 g005
Figure 6. An example of a project from the category with the highest odds of being associated with multi-institutional collaborations (mental health). The figure (left) shows project details related to the analysis, including the research team and scientific questions being studied. Next, the figure (middle) shows how GPT-3.5 prompting can help with information extraction and keyword generation for researchers to get relevant data for the analysis. Lastly, the figure (right) illustrates the results of LLM classification and keyword generation.
Figure 6. An example of a project from the category with the highest odds of being associated with multi-institutional collaborations (mental health). The figure (left) shows project details related to the analysis, including the research team and scientific questions being studied. Next, the figure (middle) shows how GPT-3.5 prompting can help with information extraction and keyword generation for researchers to get relevant data for the analysis. Lastly, the figure (right) illustrates the results of LLM classification and keyword generation.
Applsci 15 11853 g006
Table 1. Descriptive statistics on All of Us research project descriptions used in this study. The methodology for deduplication and the inference of additional fields such as institutions and individuals is detailed in the main text.
Table 1. Descriptive statistics on All of Us research project descriptions used in this study. The methodology for deduplication and the inference of additional fields such as institutions and individuals is detailed in the main text.
TopicSearch Keyword(s)# Registered Projects (Deduplicated)# of Unique Individuals Listed as Team-MembersAverage Number of Individuals per Team% Projects Using Demographic Categories for Study# of Unique Institutions
Mental Health“mental health”2443651.8263.52%140
Dementia and Alzheimer’s“dementias;” “alzheimers”, “dementia”1742461.8043.68%88
Cardiovascular disease“cardiovascular”3885881.8650.00%153
Asthma and Pollution“asthma”, “pollution”921281.6745.65%60
Diabetes“diabetes”4076001.9151.84%176
Table 2. Associations, using odds ratios (ORs) with 95% confidence intervals (CI) between projects (within a topic) pursuing health equity as a stated goal and the project, (i) being led by a multi-institutional team vs. single-institutional team (Column 2), (ii) making use of demographic variables vs. not making use (Column 3), and (iii) involving at least one doctoral Carnegie-classified R2 university vs. R1 institutions only (Column 4). Additionally, we use *, ** and *** to denote significant difference of the OR from unity at the 90, 95, and 99 percent confidence levels, respectively.
Table 2. Associations, using odds ratios (ORs) with 95% confidence intervals (CI) between projects (within a topic) pursuing health equity as a stated goal and the project, (i) being led by a multi-institutional team vs. single-institutional team (Column 2), (ii) making use of demographic variables vs. not making use (Column 3), and (iii) involving at least one doctoral Carnegie-classified R2 university vs. R1 institutions only (Column 4). Additionally, we use *, ** and *** to denote significant difference of the OR from unity at the 90, 95, and 99 percent confidence levels, respectively.
TopicMulti-Institutional TeamDemographic Variable UseAt Least one R2 University
Mental Health2.173 * (0.935, 5.050)3.635 *** (2.086, 6.337)1.296 (0.674, 2.493)
Dementia and Alzheimer’s1.481 (0.449, 4.894)1.729 (0.791, 3.778)0.564 (0.181, 1.776)
Cardiovascular disease0.883 (0.448, 1.743)6.846 *** (4.076, 11.498)1.746 * (0.983, 3.102)
Asthma and Pollution05.500 *** (1.926, 15.706)1.735 (0.292, 10.301)
Diabetes1.495 (0.862, 2.591)3.606 *** (2.309, 5.631)1.886 ** (1.154, 3.092)
Mantel–Haenszel pooled OR1.377 (0.821, 2.307)4.069 (2.719, 6.087)1.522 (0.979, 2.366)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nananukul, N.; Kejriwal, M. Semi-Automatic Extraction and Analysis of Health Equity Covariates in Registered Research Projects. Appl. Sci. 2025, 15, 11853. https://doi.org/10.3390/app152211853

AMA Style

Nananukul N, Kejriwal M. Semi-Automatic Extraction and Analysis of Health Equity Covariates in Registered Research Projects. Applied Sciences. 2025; 15(22):11853. https://doi.org/10.3390/app152211853

Chicago/Turabian Style

Nananukul, Navapat, and Mayank Kejriwal. 2025. "Semi-Automatic Extraction and Analysis of Health Equity Covariates in Registered Research Projects" Applied Sciences 15, no. 22: 11853. https://doi.org/10.3390/app152211853

APA Style

Nananukul, N., & Kejriwal, M. (2025). Semi-Automatic Extraction and Analysis of Health Equity Covariates in Registered Research Projects. Applied Sciences, 15(22), 11853. https://doi.org/10.3390/app152211853

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop