1. Introduction
HIV remains a global health challenge, with 39.9 million people living with the virus and 1.3 million new infections reported in 2023 [
1], highlighting persistent prevention gaps. While antiretroviral therapies have substantially reduced HIV-related mortality, prevention strategies lag behind the UNAIDS 95-95-95 targets for ending the epidemic by 2030 [
2]. Achieving these targets requires significant progress in HIV testing, treatment, and pre-exposure prophylaxis (PrEP) adoption. PrEP is highly effective for HIV prevention when taken as prescribed, reducing the risk of HIV acquisition from sex by about 99% [
3,
4]. However, real-world impact depends on effective uptake, adherence, persistence, and equitable access to services [
5,
6].
Current HIV risk assessment often relies on manual provider evaluations, which may fail to capture complex and evolving behavioral risk patterns, resulting in missed opportunities to identify PrEP-eligible individuals. Integrating advanced technology into healthcare systems is essential for optimizing PrEP delivery and uptake. Prior to ML adoption, traditional risk prediction models such as logistic regression–based prediction models and rule-based risk scoring systems were developed. Menza et al. developed a multivariable logistic regression model using behavioral risk factors and achieved modest predictive accuracy (AUC 0.66–0.67) for HIV acquisition among MSM [
7]. Rule-based risk scoring systems, which convert regression-derived predictor weights into simplified point-based scores for clinical risk stratification, include the San Diego Early Test Score developed by Hoenigl et al. (C-statistic > 0.70) [
8] and the HIV risk score proposed by Yin et al. (C-index 0.70–0.71) [
9], both of which demonstrated moderate discriminatory performance.
In healthcare settings, ML approaches have been successfully validated for identifying PrEP candidates. Krakower et al. achieved an AUC of 0.86 using EHR data [
10], and Marcus et al. achieved an AUC of 0.84 with 3.7 million Kaiser Permanente members [
11]. Similarly, Saldana et al. reached 80% accuracy using surveillance data [
12], and Ridgway et al. demonstrated AI’s role in emergency departments for identifying eligible patients for counseling [
13]. More recent EHR-based work has further refined this approach: Nethi et al. developed a machine learning model using EHR data from a large urban health system (
n = 458,893) to predict incident HIV diagnoses and prioritize patients for proactive PrEP outreach [
14], while May et al. proposed a generalizable automatic feature-engineering pipeline for HIV risk prediction across heterogeneous EHR systems, reporting consistent performance (AUC ≈ 0.87) and improved sensitivity for female patients [
15]. In population-based settings, Balzer et al. showed that ML algorithms in rural Kenya and Uganda could capture 50% of new HIV cases by targeting only 18% of the population, a significant improvement over traditional methods [
16]. ML has further expanded this scope. He et al. achieved an AUC of 0.88 for Chinese MSM using random forests [
17], and Xiang et al. highlighted the potential of deep learning approaches, including recurrent neural networks and convolutional neural networks, to reduce feature engineering requirements in HIV prevention research [
18]. Deep learning architectures have also been applied to relational and social-network data: Yu et al. recently demonstrated the use of explainable graph neural networks with domain adaptation to predict HIV infections among younger sexual minority men across two U.S. cohorts, illustrating both the modeling capacity and the cross-dataset transferability of such methods [
19]. In addition to studies focused on PrEP candidate identification, recent work has also examined missed opportunities for HIV diagnosis or PrEP linkage, as well as the use of deep learning and LLM-based approaches in HIV prevention support. Closely related to the objectives of the present work, several studies have specifically examined missed opportunities for earlier HIV diagnosis and PrEP linkage. Weissman et al. applied least absolute shrinkage and selection operator (LASSO) regression to linked surveillance and all-payer healthcare records and showed that a substantial proportion of individuals later diagnosed with HIV had prior healthcare encounters — most commonly in emergency department settings — that represented missed screening opportunities [
20]. Such work underscores the importance of complementary, community-facing screening pathways that can intercept individuals before they present late to clinical care.
In parallel, large language models (LLMs) are emerging as decision-support and communication aids in HIV prevention. Govathson et al. developed an LLM-powered conversational app for stigma-free HIV vulnerability assessment and PrEP eligibility discussion among adults in South Africa, demonstrating feasibility and acceptability for supporting provider–patient interactions when consultation time is limited [
21]. These developments motivate the integration of generative AI for personalized communication in our framework while also reinforcing the need for clinician oversight, constrained prompting, and structured evaluation. Despite these advancements, challenges remain regarding data privacy, algorithmic transparency, and integration into routine care [
22,
23].
Although prior studies have demonstrated the value of machine learning for identifying individuals at elevated HIV risk or potential PrEP eligibility, most have focused primarily on predictive classification performance. Many existing models were developed using electronic health record or surveillance data, which may not be readily available or transferable to community-based or digital health platforms. In addition, less attention has been given to how model outputs can be translated into personalized, context-aware recommendation content that supports communication, counseling, and real-world decision-making in digital health settings. From a practical health informatics perspective, a key challenge is not only designing end-to-end digital systems, but also ensuring that outputs from risk models can be translated into meaningful, user-facing recommendations that support real-world decision-making. While informatics research has advanced the design of clinical decision support systems across many domains, the application of such system-level integration to community-based HIV prevention—particularly through digital platforms serving key populations—remains underexplored. In particular, to the best of our knowledge, no prior study has integrated machine learning-based HIV risk stratification with generative AI to produce personalized, context-aware PrEP recommendation content within a single informatics framework. Existing ML studies in HIV prevention have focused on risk classification as an endpoint, without extending model outputs into actionable, patient-facing communication. Similarly, while generative AI has been explored for general health communication, its application to HIV-specific PrEP recommendation guided by ML-derived risk profiles has not been reported. This gap is important because prediction alone may be insufficient to support user engagement, counseling, and clinician-patient communication in routine HIV prevention practice, particularly in digital health settings where automated support may be needed. More broadly, AI in HIV prevention and care may support multiple functions, including risk prediction, decision support, workflow assistance, and personalized communication [
22,
23]. In this context, the present framework is positioned specifically as a health informatics-driven digital decision-support and recommendation system for HIV prevention, rather than as an autonomous clinical tool.
HIV remains a major public health challenge in Thailand. In 2023, 9100 new HIV infections were diagnosed in Thailand, and 580,000 people were living with HIV in the country [
1]. Key populations, including men who have sex with men (MSM), male sex workers (MSW), transgender women (TGW), and people who inject drugs (PWID), bear a disproportionate burden. As of June 2024, MSM accounted for 77.5% of new HIV cases, followed by MSW at 5.9% and TGW at 5.5%, highlighting the urgent need for targeted strategies [
24]. Extensive efforts have focused on expanding PrEP accessibility, most notably the Princess PrEP initiative. This community-led program provides free PrEP to high-risk populations and has successfully increased uptake [
25,
26].
Despite these advancements, stigma and unequal access persist. Since the launch of Princess PrEP in 2016, coverage remains low; only 29,281 active PrEP users were reported in 2024, reaching just 41% of the national target [
24]. Additionally, data suggest issues with persistence, as only 79,291 people have ever initiated oral PrEP [
27]. To address these barriers, key population-led health services (KPLHS) have shown promise in increasing accessibility and reducing barriers to PrEP uptake in Thailand [
25,
28].
To support Thailand’s HIV prevention efforts, this study aimed to develop and evaluate an integrated framework combining machine learning (ML) and generative artificial intelligence (GenAI) to (1) classify individuals as having high versus low HIV acquisition risk based on structured behavioral data and (2) generate personalized PrEP recommendation content. Using real-world data from Love2Test.org, we trained and compared multiple ML models through PyCaret [
29] and assessed the clinical relevance and validity of the integrated framework through independent physician evaluation. More broadly, this study contributes a generalizable health informatics architecture for integrating machine learning and generative AI within user-facing digital health systems, with design principles that extend beyond HIV-specific applications to other domains requiring automated risk inference and personalized communication.
4. Discussion
This study suggests the feasibility of integrating machine learning with generative AI to support personalized, data-informed PrEP recommendations within a digital health workflow. Beyond predictive performance, the proposed framework has potential to support clinician-aligned decision-making by translating structured behavioral risk patterns into recommendation content. The identification of key predictors, including multiple sexual partners, anal sex, condom use, and STI history, is consistent with prior HIV risk prediction studies and epidemiological evidence regarding HIV acquisition risk among MSM and transgender populations [
11,
34].
Importantly, the generative AI component extends the value of conventional risk prediction by translating model outputs into context-aware recommendation content. This may be useful in digital health settings, where risk classification alone may be insufficient to support communication, user understanding, and patient engagement [
35,
36]. Emerging evidence suggests that large language models may help bridge predictive systems and end users through personalized narrative explanations [
37]. In this context, the favorable physician evaluation observed in the present study is consistent with prior literature suggesting that generative AI may complement clinical judgment within structured decision-support workflows [
38].
We acknowledge that, given the limited number of structured indicators, comparable explanations could potentially be generated using predefined templates, and that generative models introduce non-determinism and a risk of hallucination. Accordingly, the present findings should not be interpreted as demonstrating the necessity or superiority of generative AI for this task. Future studies should directly compare template-based and generative approaches.
Compared with prior studies that focused primarily on predictive classification performance for HIV risk or PrEP eligibility [
39], the present framework extends this area by integrating generative AI to generate personalized prevention recommendations. At the same time, the use of AI in HIV prevention raises important concerns regarding algorithmic bias and digital health equity, particularly when training data may not fully represent all populations who could benefit from PrEP [
40,
41]. Because the framework communicates directly with users, formal fairness evaluation is important before real-world deployment. Future work should assess subgroup performance across demographic characteristics such as gender and geographic region to identify and mitigate potential bias. These concerns are especially relevant in low- and middle-income settings, where digital innovation may expand access to prevention services but should be implemented with careful attention to fairness, inclusion, and clinician oversight [
42,
43].
From an implementation perspective, the framework has potential to support more standardized HIV risk stratification workflows and may, in principle, reduce some manual screening burden, particularly in resource-limited settings [
42,
43]. In Thailand, community-led PrEP delivery models and digital health platforms have increasingly supported linkage to HIV prevention services among key populations [
44,
45].
From a health informatics standpoint, this study proposes a modular architecture integrating structured data collection, machine learning inference, and generative AI within a unified digital health workflow. The proposed four-layer design is aligned with established informatics principles of modularity and interoperability and may support future adaptation to other digital health applications [
46].
From a human–computer interaction perspective, the framework embeds AI-generated recommendations directly within the user’s existing digital health workflow. Rather than presenting isolated risk scores, the system translates computational outputs into narrative-form recommendations intended to support user understanding and informed decision-making [
47]. This design reflects a user-centered informatics approach in which AI augments, rather than disrupts, the interaction flow between users and digital health services.
Taken together, this work differs from prior ML-based HIV risk prediction studies by addressing the full information pipeline from data acquisition through automated inference to user-facing content delivery, representing a system-level contribution to health informatics that extends beyond model performance alone.
Several limitations should be considered when interpreting the findings. First, the dataset was derived from a single digital platform in Thailand, which may limit the generalizability of the findings to other populations, healthcare systems, or sociocultural contexts. Second, the behavioral variables were based on self-reported information, which may be affected by recall bias, social desirability bias, or incomplete disclosure, particularly for sensitive sexual health behaviors. Third, although the model achieved very high predictive performance, including a high AUC, this result should be interpreted cautiously, as it may partly reflect overfitting or close alignment between the structured questionnaire items and the target labels within this dataset. Although internal validation was conducted, further external validation in independent and more diverse populations is necessary to confirm model robustness. Confidence intervals for performance metrics were not computed in this proof-of-concept study. Given the modest test set size (n = 400) and the acknowledged conceptual circularity between structured behavioral inputs and expert-defined labels, point estimates should be interpreted as descriptive summaries of internal discrimination rather than precise measures of generalizable predictive performance. Future validation studies using independent external datasets should report confidence intervals alongside point estimates.
A further limitation is that this proof-of-concept study reported only point estimates for Accuracy (ACC), Area Under the Receiver Operating Characteristic Curve (AUC), Recall (REC), Precision (PREC), F1-score (F1), Cohen’s Kappa, and Matthews Correlation Coefficient (MCC), and did not include fold-level variability statistics, such as mean, standard deviation, minimum, and maximum values across cross-validation folds. These measures can provide important information regarding the stability and consistency of model performance across different training–validation partitions and help assess whether reported results may be influenced by favorable data splits.
Another limitation is that the present evaluation did not include a tailored rule-based classifier or a zero-/few-shot LLM-based comparator. Because the expert labeling criteria were themselves partly rule-based and derived from CDC-informed behavioral indicators, a simple deterministic or clinician-derived rule-based system might achieve comparable performance. Accordingly, the present findings should not be interpreted as establishing the superiority of machine learning over simpler decision-support approaches. Direct comparison against rule-based and LLM-based baselines therefore remains an important direction for future research.
The framework also relies primarily on structured behavioral data, which may not fully capture the complexity of real-world HIV risk. Important contextual, psychosocial, interpersonal, and structural factors may be difficult to represent through fixed questionnaire variables alone. This reliance on structured inputs may reduce applicability in more complex real-world settings, where risk is dynamic and shaped by nuanced social and clinical circumstances.
A clinically important limitation of the selected model is the presence of 13 false-negative classifications, in which individuals with high HIV acquisition risk were predicted as low risk. In an HIV prevention context, such errors are especially concerning because they may result in missed opportunities for PrEP counseling and referral. Future model refinement should therefore prioritize reducing false negatives through approaches such as threshold adjustment, cost-sensitive learning, class weighting, and other optimization strategies that favor recall for the high-risk group [
48].
The physician evaluation also has limitations. Only four medical doctors participated, and each reviewed a limited number of recommendations. In addition, the ratings were based on Likert-scale assessments, which may involve some subjectivity. No inter-rater reliability statistics were calculated, and no comparator condition was included. Therefore, these findings should be interpreted as preliminary evidence of clinician-perceived acceptability rather than definitive evidence of superiority over simpler alternatives.
An additional limitation concerns the transparency of the generative AI component. Although a structured prompt-based approach was used to combine the predicted HIV risk category with selected client profile variables, this study did not formally evaluate prompt sensitivity, alternative prompt formulations, or failure modes of the generated recommendations. Moreover, explainability of the generative component was addressed only at a practical design level rather than through formal explainable AI methods. Recent work has highlighted that explainability, usability, and user trust remain critical challenges for AI-based clinical decision support systems, particularly when such systems are intended for patient-facing or clinician-facing deployment [
49]. These considerations should be examined more systematically in future work, including the application of formal explainability frameworks, before any real-world clinical implementation.
Finally, this study was conducted entirely as an offline proof-of-concept evaluation. No patients directly interacted with the system, and no measures related to stigma, usability, uptake, adherence, or real-world service engagement were collected. Accordingly, claims regarding stigma reduction, scalable deployment, or real-world effectiveness should be interpreted cautiously and remain to be tested in prospective implementation studies.
While the proposed framework demonstrated strong predictive performance and favorable clinical evaluation, this study did not directly measure real-world outcomes such as PrEP uptake, adherence, or sustained engagement in care. The current evaluation focused on classification performance and clinician assessment of recommendation quality. Therefore, any potential effect on PrEP uptake or adherence should be interpreted as hypothetical rather than empirically demonstrated.
Finally, this study should be interpreted within its commercial and ethical context. PrEP is a commercial product, the study was funded by Gilead Sciences, and two co-authors are employees of Gilead Sciences. Accordingly, systems that may influence PrEP recommendation require careful attention to conflict-of-interest transparency, clinician oversight, and scientific independence. In this study, the generated recommendations were non-branded, non-prescriptive, and intended only for decision-support communication rather than promotion of any specific product. In addition, all analytic data were de-identified, the analytic workflow was led by the academic investigators, and the source code is publicly available for independent scrutiny.
Future research should address these limitations by incorporating more diverse datasets, integrating broader clinical and social determinants, and evaluating system performance in real-world implementation settings. Prospective studies should assess measurable outcomes such as PrEP initiation, persistence, adherence, and follow-up engagement with healthcare services. In addition, ensuring ethical AI governance, data privacy, fairness, and human oversight will be essential for safe and scalable deployment.