AI-HOPE-TGFbeta: A Conversational AI Agent for Integrative Clinical and Genomic Analysis of TGF-β Pathway Alterations in Colorectal Cancer to Advance Precision Medicine

Yang, Ei-Wen; Waldrup, Brigette; Velazquez-Villarreal, Enrique

doi:10.3390/ai6070137

Open AccessArticle

AI-HOPE-TGFbeta: A Conversational AI Agent for Integrative Clinical and Genomic Analysis of TGF-β Pathway Alterations in Colorectal Cancer to Advance Precision Medicine

by

Ei-Wen Yang

¹

,

Brigette Waldrup

²

and

Enrique Velazquez-Villarreal

^2,3,*

¹

PolyAgent, San Francisco, CA 94102, USA

²

Department of Integrative Translational Sciences, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA

³

City of Hope Comprehensive Cancer Center, Duarte, CA 91010, USA

^*

Author to whom correspondence should be addressed.

AI 2025, 6(7), 137; https://doi.org/10.3390/ai6070137

Submission received: 21 May 2025 / Revised: 19 June 2025 / Accepted: 23 June 2025 / Published: 24 June 2025

(This article belongs to the Section Medical & Healthcare AI)

Download

Browse Figures

Versions Notes

Abstract

Introduction: Early-onset colorectal cancer (EOCRC) is rising rapidly, particularly among the Hispanic/Latino (H/L) populations, who face disproportionately poor outcomes. The transforming growth factor-beta (TGF-β) signaling pathway plays a critical role in colorectal cancer (CRC) progression by mediating epithelial-to-mesenchymal transition (EMT), immune evasion, and metastasis. However, integrative analyses linking TGF-β alterations to clinical features remain limited—particularly for diverse populations—hindering translational research and the development of precision therapies. To address this gap, we developed AI-HOPE-TGFbeta (Artificial Intelligence agent for High-Optimization and Precision Medicine focused on TGF-β), the first conversational artificial intelligence (AI) agent designed to explore TGF-β dysregulation in CRC by integrating harmonized clinical and genomic data via natural language queries. Methods: AI-HOPE-TGFbeta utilizes a large language model (LLM), Large Language Model Meta AI 3 (LLaMA 3), a natural language-to-code interpreter, and a bioinformatics backend to automate statistical workflows. Tailored for TGF-β pathway analysis, the platform enables real-time cohort stratification and hypothesis testing using harmonized datasets from the cBio Cancer Genomics Portal (cBioPortal). It supports mutation frequency comparisons, odds ratio testing, Kaplan–Meier survival analysis, and subgroup evaluations across race/ethnicity, microsatellite instability (MSI) status, tumor stage, treatment exposure, and age. The platform was validated by replicating findings on the SMAD4, TGFBR2, and BMPR1A mutations in EOCRC. Exploratory queries were conducted to examine novel associations with clinical outcomes in H/L populations. Results: AI-HOPE-TGFbeta successfully recapitulated established associations, including worse survival in SMAD4-mutant EOCRC patients treated with FOLFOX (fluorouracil, leucovorin and oxaliplatin) (p = 0.0001) and better outcomes in early-stage TGFBR2-mutated CRC patients (p = 0.00001). It revealed potential population-specific enrichment of BMPR1A mutations in H/L patients (OR = 2.63; p = 0.052) and uncovered MSI-specific survival benefits among SMAD4-mutated patients (p = 0.00001). Exploratory analysis showed better outcomes in SMAD2-mutant primary tumors vs. metastatic cases (p = 0.0010) and confirmed the feasibility of disaggregated ethnicity-based queries for TGFBR1 mutations, despite small sample sizes. These findings underscore the platform’s capacity to detect both known and emerging clinical–genomic patterns in CRC. Conclusions: AI-HOPE-TGFbeta introduces a new paradigm in cancer bioinformatics by enabling natural language-driven, real-time integration of genomic and clinical data specific to TGF-β pathway alterations in CRC. The platform democratizes complex analyses, supports disparity-focused investigation, and reveals clinically actionable insights in underserved populations, such as H/L EOCRC patients. As a first-of-its-kind system studying TGF-β, AI-HOPE-TGFbeta holds strong promise for advancing equitable precision oncology and accelerating translational discovery in the CRC TGF-β pathway.

Keywords:

artificial intelligence; large language models; AI agents; medical AI; healthcare AI; colorectal cancer

1. Introduction

Colorectal cancer (CRC) continues to rank among the most prevalent and lethal cancers globally, with a notable and rapid increase in early-onset cases—defined as diagnoses occurring before age 50—over the past several decades [1,2,3,4,5]. This upward trend is particularly pronounced in high-risk groups, including Hispanic/Latino (H/L) individuals [6,7,8,9,10]. Although multiple oncogenic pathways contribute to CRC development and progression, the transforming growth factor-beta (TGF-β) signaling pathway plays a pivotal role by facilitating processes such as epithelial-to-mesenchymal transition (EMT), immune system evasion, and metastatic spread [11,12,13]. However, efforts to comprehensively characterize TGF-β dysregulation in EOCRC—especially among the underrepresented populations—have been constrained by the underrepresentation of diverse patient cohorts in genomic databases and the absence of tools capable of linking clinical and molecular data in an integrative manner [6,7,14,15].

The TGF-β pathway is frequently altered in CRC through mutations in key components such as SMAD4, BMPR1A, and TGFBR2, which are associated with poor prognosis, therapy resistance, and aggressive tumor phenotypes [16,17,18,19]. Recent studies suggest that these mutations may present with distinct patterns in EOCRC compared to late-onset CRC (LOCRC) and may vary by ethnicity—highlighting a need for population-specific investigation [6,7,19,20]. For example, alterations in BMPR1A and BMP7 have been identified in H/L patients with EOCRC, suggesting unique mechanisms of TGF-β dysregulation in this group [6,7]. However, few tools exist that can efficiently integrate and stratify these genomic insights alongside clinical factors, such as age, tumor stage, treatment response, and microsatellite instability (MSI) status.

Although public platforms such as The Cancer Genome Atlas (TCGA) and AACR GENIE provide rich datasets for CRC research, existing analysis tools—like cBio Cancer Genomics Portal (cBioPortal) [21] and the University of California Santa Cruz (UCSC) Xena [22]—require multi-step workflows and offer limited functionality for pathway-specific, population-disaggregated, or treatment-contextualized exploration in investigating CRC TGF-β dysregulation. These constraints disproportionately affect non-computational researchers, impeding precision oncology efforts in real-world and subpopulation contexts.

The emergence of artificial intelligence (AI), especially advancements in large language models (LLMs), has paved the way for conversational tools that convert natural language input into executable data analysis pipelines [23,24]. Although these technologies have demonstrated potential in streamlining multiomic investigations [25,26,27,28,29,30], there remains a lack of platforms specifically designed to target signaling pathways or to support integrative clinical–genomic research with an emphasis on hypothesis generation and health disparity considerations.

AI-HOPE-TGFbeta (Artificial Intelligence agent for High-Optimization and Precision Medicine focused on TGF-β) was developed to directly address the lack of tools capable of pathway-specific integrative analysis in CRC. This conversational AI system enables users to pose natural language questions that are translated into executable workflows, facilitating real-time synthesis of harmonized genomic and clinical data. With built-in automation for statistical tasks—including Kaplan–Meier survival analysis and odds ratio estimation—the platform streamlines both validation studies and exploratory investigations across large datasets. In the present study, we (1) created AI-HOPE-TGFbeta to enable a user-friendly, pathway-centered interrogation of CRC cohorts; (2) assessed its performance by replicating well-established clinical–genomic associations in EOCRC; and (3) applied the system to reveal novel links between TGF-β mutations, MSI, tumor staging, and population-level variables. These results highlight AI-HOPE-TGFbeta as an innovative and inclusive solution to support scalable, translational TGF-β pathway research in precision oncology.

To guide the reader through the manuscript, the remainder of the paper is organized as follows: In Section 2, we describe the architecture of the AI-HOPE-TGFbeta platform, including its system components, data sources, conversational query handling, statistical framework, and validation strategy. Section 3 presents the results of validation and exploratory analyses, highlighting both recapitulated findings and novel associations between TGF-β pathway alterations and clinical–genomic features in colorectal cancer. In Section 4, we discuss the translational significance of the findings, implications for health equity, and technical advantages of the platform, while also outlining limitations and future directions. Finally, Section 5 provides our concluding remarks on the broader impact of AI-HOPE-TGFbeta for precision oncology and population-focused cancer research.

This study contributes several novel methodological advancements at the intersection of AI and clinical–genomic research. AI-HOPE-TGFbeta leverages a fine-tuned biomedical variant of the Large Language Model Meta AI 3 (LLaMA 3) for the semantic interpretation of natural language queries. It incorporates a natural language-to-code interpreter that translates user inputs into executable Python 3.12-based statistical workflows. These AI components enable real-time cohort stratification, hypothesis testing, and interpretation of TGF-β pathway alterations in CRC—without requiring programming expertise. These technical innovations distinguish AI-HOPE-TGFbeta as a scalable and user-friendly system for translational bioinformatics and precision oncology.

2. Materials and Methods

2.1. System Architecture and Workflow of AI-HOPE-TGFbeta

AI-HOPE-TGFbeta is a natural language-enabled AI system engineered to explore CRC with a specific focus on alterations within the TGF-β signaling pathway. The platform is built on a layered, modular framework that integrates three core components: a built-in LLM based on the LLaMA 3 architecture for semantic query interpretation, a translation layer that converts user prompts into executable code, and a statistical backend designed to automate case generation, analytical processing, and result visualization. When a user submits a question in plain English, the system identifies the analytical intent, applies relevant filters to harmonized clinical–genomic datasets, and generates a suite of outputs—including survival analyses, mutation frequencies, odds ratios, and explanatory text summaries tailored to the context of the query (Figure 1).

The choice of the LLaMA 3 architecture as the core language model for AI-HOPE-TGFbeta was based on its strong performance in biomedical natural language processing tasks, its open-source accessibility, and its adaptability for domain-specific fine-tuning. Compared to other available LLMs, LLaMA 3 offers an optimal balance between computational efficiency and contextual accuracy, making it well-suited for real-time, query-driven clinical–genomic analysis. The model’s architecture supports long-context comprehension and precise semantic parsing, both essential for translating complex natural language queries into executable bioinformatics workflows. Furthermore, its open-access licensing facilitates reproducibility and broader deployment across research environments without commercial restrictions, aligning with the goal of democratizing precision oncology tools.

2.2. Data Sources and Preparation for AI-HOPE-TGFbeta

To power its analyses, AI-HOPE-TGFbeta draws from harmonized CRC datasets derived from public repositories such as TCGA and cBioPortal, with a targeted focus on genes implicated in TGF-β signaling. Key genomic features include alterations in SMAD4, TGFBR1, TGFBR2, BMPR1A, BMPR2, ACVR1B, and BMP7. The accompanying clinical metadata encompass a broad array of attributes—patient age, disease stage, treatment history (including FOLFOX—fluorouracil, leucovorin, and oxaliplatin—exposure), MSI classification, ethnicity, tumor tissue origin (primary versus metastatic), and overall survival metrics. Raw data underwent extensive preprocessing to ensure analytical compatibility: files were converted into standardized, tab-delimited matrices with harmonized sample identifiers, and ontology-based frameworks, such as OncoTree and the Disease Ontology, were applied to unify clinical annotations. Mutation data were cross-validated across sources, and all TGF-β pathway gene sets were curated using publicly available knowledge bases to ensure accuracy and biological relevance.

2.3. Conversational Query Handling and Cohort Definition in AI-HOPE-TGFbeta

AI-HOPE-TGFbeta enables users to initiate complex clinical–genomic analyses using natural language inputs. Queries such as “Compare survival outcomes for SMAD4-mutated versus wild-type EOCRC patients treated with FOLFOX” or “Evaluate TGFBR1 mutation prevalence between H/L and non-Hispanic White (NHW) individuals” are interpreted by a built-in LLM based on the LLaMA 3 architecture. This LLM translates conversational prompts into executable code that filters datasets, defines cohorts, and launches the appropriate statistical analyses. When necessary, the system prompts the user for clarification to resolve ambiguity and ensure accurate query interpretation. AI-HOPE-TGFbeta accommodates a wide range of stratification parameters, including genetic mutation status, MSI, tumor stage, race/ethnicity, and chemotherapy exposure, allowing users to flexibly define custom subgroups for targeted analysis.

To ensure data integrity and analytical reliability, all colorectal cancer datasets were extracted from cBioPortal, which provides harmonized and quality-controlled clinical–genomic data. For this study, we selected only those cases with complete clinical variables—including tumor stage, MSI status, treatment history, and survival data—and comprehensive genomic profiles covering key TGF-β pathway genes. The dataset underwent preprocessing to remove any entries with missing values across variables critical for cohort stratification and statistical analysis. Additionally, class balance was assessed across primary stratification variables (e.g., mutation status, MSI phenotype, ethnicity), and cohort sizes were adjusted where necessary to maintain representativeness and minimize analytical bias. These measures ensured that the final dataset was both complete and sufficiently balanced for robust statistical evaluation.

2.4. Analytical Framework and Statistical Methods in AI-HOPE-TGFbeta

AI-HOPE-TGFbeta’s analytical engine is powered by a Python-based bioinformatics workflow that supports a comprehensive suite of statistical methods tailored for clinical–genomic analysis. Categorical variables are evaluated using either chi-square or Fisher’s exact tests, with odds ratios and corresponding 95% confidence intervals calculated to quantify associations. For survival-related outcomes, the system implements Kaplan–Meier estimations and log-rank tests to compare groups, while multivariable Cox proportional hazards regression is available to adjust for confounding variables in time-to-event analyses. The platform also includes specialized modules for examining TGF-β pathway enrichment, identifying co-mutation patterns, and performing stratified survival analysis. Users can conduct subgroup comparisons across a variety of dimensions, including age groups (<50 vs. ≥50 years), racial and ethnic backgrounds (e.g., H/L vs. NHW), tumor sample type (primary vs. metastatic), and MSI status.

The selection of statistical models integrated into AI-HOPE-TGFbeta was guided by standard practices in clinical–genomic research and the specific analytical needs of CRC studies. Chi-square and Fisher’s exact tests were employed for categorical variable comparisons due to their robustness in evaluating associations across stratified groups, particularly with small or imbalanced sample sizes. Odds ratios with 95% confidence intervals were used to quantify the strength of associations between genomic alterations and clinical characteristics. Kaplan–Meier survival analysis and log-rank tests were selected for time-to-event outcomes, given their widespread application and interpretability in oncology. To account for potential confounding variables, multivariable Cox proportional hazards regression was incorporated for more complex survival modeling. These models were chosen for their proven reliability, reproducibility, and suitability for the types of queries AI-HOPE-TGFbeta is designed to support—including population-level comparisons, mutation enrichment analyses, and treatment-specific outcome evaluations.

2.5. Platform Design and Validation Strategy

AI-HOPE-TGFbeta was engineered with a strong emphasis on analytical rigor and reproducibility. At its core, the platform integrates a retrieval-augmented generation (RAG) mechanism that continuously references a structured biomedical knowledge base to enhance the contextual accuracy of its outputs and reduce the likelihood of AI-generated errors or hallucinations. The system applies schema-guided prompting to standardize how queries are interpreted and how results are formatted, ensuring consistency across diverse analyses. To assess the reliability of the platform, we validated its performance by successfully reproducing known clinical–genomic relationships from prior studies involving SMAD4, BMPR1A, and TGFBR2 in EOCRC [6,7], including stage-specific survival outcomes and mutation frequency disparities across patient populations.

2.6. Usability Evaluation and Comparative Benchmarking

To assess the usability and performance of AI-HOPE-TGFbeta, we conducted a comparative analysis against established platforms, including cBioPortal and UCSC Xena. Evaluation criteria focused on speed of task execution, consistency of analytical output, and flexibility in constructing complex, stratified cohort queries. Benchmark tasks included identifying EOCRC patients with SMAD4 mutations stratified by treatment status, generating Kaplan–Meier survival curves based on MSI subtypes, and analyzing TGFBR1 mutation frequency across racial and ethnic groups. Across all scenarios, AI-HOPE-TGFbeta outperformed existing tools in both response time and user interaction efficiency—particularly in handling intersectional analyses that required integration of clinical, genomic, and demographic filters.

2.7. Visualization Capabilities and Exportable Results

Upon completion of each analysis, AI-HOPE-TGFbeta produces a suite of high-quality visual and tabular outputs designed for immediate interpretation and downstream application. These include Kaplan–Meier survival curves, forest plots, mutation heatmaps, and summary data tables—each rendered using backend libraries such as Matplotlib 3 and Plotly 4 to ensure visual clarity and consistency. In addition to graphical elements, the system generates narrative summaries that interpret statistical findings within the context of existing TGF-β pathway literature. All outputs can be downloaded in formats suitable for publication, presentation, or integration into clinical decision-making workflows.

3. Results

By converting user-generated natural language inputs into fully automated clinical–genomic workflows, AI-HOPE-TGFbeta enables the on-demand analysis of TGF-β signaling disruptions in CRC. Its interactive conversational design allows users to define custom cohorts based on variables such as age, tumor stage, MSI classification, mutational status, treatment exposure, and racial or ethnic background. The system then performs statistical evaluations—generating Kaplan–Meier survival curves, odds ratio calculations, and corresponding visual outputs—with no additional coding required. In both validation and discovery-focused queries, AI-HOPE-TGFbeta consistently reproduced the established associations and revealed new insights related to EOCRC, treatment efficacy, and pathway-specific biomarker patterns.

While validating the ancestry-stratified analyses, AI-HOPE-TGFbeta identified a potential disparity in the frequency of BMPR1A mutations among EOCRC patients across ethnic groups (Figure 2). Specifically, the platform revealed that 4.58% of EOCRC H/L patients harbored BMPR1A mutations, compared to 1.79% of EOCRC NHW patients. This difference translated to an odds ratio of 2.63 (95% CI: [1.093, 6.327]; p = 0.052), suggesting that BMPR1A mutations were more than twice as likely to occur in the H/L EOCRC population. Although the association narrowly missed conventional thresholds for statistical significance, these findings underscore the potential of AI-HOPE-TGFbeta to uncover emerging ancestry-linked molecular patterns that may otherwise be overlooked. Importantly, this analysis highlights the need for a broader inclusion of racially and ethnically diverse populations in genomic studies to validate and extend these observations.

In ancestry-specific survival analyses, AI-HOPE-TGFbeta evaluated the prognostic impact of TGF-β pathway alterations in EOCRC among H/L patients (Figure S1). The platform stratified EOCRC HL cases by mutation status in key TGF-β signaling genes, including SMAD4, TGFBR2, and BMPR1A. The case cohort consisted of 48 patients with TGF-β pathway alterations (0.9% of the dataset), while the control cohort included 105 patients without such mutations (1.9%). Kaplan–Meier survival analysis revealed no statistically significant difference in the overall survival between the two groups (p = 0.8631), suggesting that TGF-β pathway mutations may not independently influence the prognosis in EOCRC HL under current sample sizes. Despite the lack of significance, this result underscores the value of AI-HOPE-TGFbeta in enabling fine-grained subgroup analyses and generating hypotheses about context-dependent effects. The findings also highlight the need for larger, ancestry-specific datasets to more definitively assess the clinical relevance of TGF-β signaling alterations in diverse EOCRC populations.

In exploratory analyses, AI-HOPE-TGFbeta reproduced a key finding from the published TGF-β literature regarding the prognostic relevance of SMAD4 mutations in EOCRC patients treated with FOLFOX chemotherapy (Figure 3). Using a natural language query, the system stratified EOCRC patients (<50 years old) by SMAD4 mutation status and assessed treatment outcomes following FOLFOX (fluorouracil, leucovorin, and oxaliplatin) administration. The case cohort included 188 SMAD4-mutated patients (3.4% of the dataset), while the control cohort included 1066 SMAD4 wild-type patients (19.2%). Kaplan–Meier survival analyses revealed that SMAD4-mutated patients exhibited significantly worse overall and progression-free survival compared to wild-type cases (p = 0.0001 for both), consistent with prior reports linking SMAD4 loss to chemoresistance and aggressive tumor biology in EOCRC. These results highlight the ability of AI-HOPE-TGFbeta to recapitulate known genotype–treatment–outcome relationships and underscore the clinical importance of SMAD4 as a biomarker for poor prognosis in young CRC patients undergoing standard chemotherapy.

AI-HOPE-TGFbeta also enabled exploratory ethnicity-specific analysis of TGFBR1 mutation patterns in CRC patients (Figure 4). This analysis compared TGFBR1-mutated H/L patients (case cohort, n = 11) to TGFBR1-mutated NHW patients (control cohort, n = 79), highlighting the platform’s capacity to support disaggregated investigations despite sample imbalances. Although H/L individuals accounted for only 6.4% of the full dataset, AI-HOPE-TGFbeta successfully isolated a sufficient number of cases to perform comparative statistical analyses. Odds ratio testing, stratified by early-onset age (<50 years), yielded a value of 1.029 (95% CI: [0.563, 7.134], p = 0.454), suggesting no significant difference in the mutation enrichment context across ethnic groups. The Kaplan–Meier survival curves similarly showed no statistically significant difference in overall survival (p = 0.3561), despite apparent visual divergence between curves. These findings reinforce the underrepresentation of H/L populations in the existing genomic datasets and underscore the potential of AI-HOPE-TGFbeta to enable focused population-level queries that can guide future efforts toward equity-driven precision oncology.

Further analysis using AI-HOPE-TGFbeta revealed clinically significant associations between TGFBR2 mutation status and tumor stage in CRC patients (Figure 5). The platform stratified TGFBR2-mutated cases into early-stage (Stages I–III) and late-stage (Stage IV) cohorts to assess the prognostic impact of disease stage in the context of TGF-β pathway disruption. Among the 307 TGFBR2-mutated patients analyzed, those with early-stage disease (n = 235) exhibited markedly improved overall survival compared to their late-stage counterparts (n = 72), with Kaplan–Meier analysis yielding a highly significant p-value (p = 0.0000). Additionally, a 2×2 odds ratio analysis evaluating FOLFOX chemotherapy exposure revealed that early-stage patients were significantly more likely to have received standard treatment (OR = 0.155, 95% CI: [0.082, 0.294], p = 0.000), suggesting potential treatment-related differences contributing to improved outcomes. These findings underscore the prognostic relevance of tumor stage among TGFBR2-mutated CRC patients and highlight AI-HOPE-TGFbeta’s capacity to integrate clinical and genomic variables for nuanced outcome stratification, supporting its use in guiding precision medicine strategies.

AI-HOPE-TGFbeta was used to assess the prognostic significance of MSI status among the SMAD4-mutated CRC patients (Figure S2). In this analysis, patients were stratified by MSI phenotype, comparing those with MSI-high (Instable) tumors to those with MSI-stable counterparts. The case cohort included 78 SMAD4-mutated patients with MSI-Instable tumors (1.4% of the dataset), while the control cohort comprised 710 SMAD4-mutated patients with MSI-Stable tumors (12.8%). The Kaplan–Meier survival analysis revealed that MSI-Instable patients had significantly better overall survival than MSI-Stable patients (p = 0.00001), with clearly divergent survival curves and non-overlapping 95% confidence intervals. This finding suggests a potential protective interaction between MSI-associated immunogenicity and SMAD4 pathway disruption, supporting the clinical relevance of combining genomic and molecular features in CRC prognosis. Moreover, the result highlights the utility of AI-HOPE-TGFbeta in uncovering context-dependent biomarker interactions that may inform immunotherapy stratification and precision treatment strategies.

Finally, AI-HOPE-TGFbeta was employed to evaluate the prognostic relevance of tumor sample type in CRC patients harboring SMAD2 mutations (Figure S3). The platform stratified patients by tumor origin—primary versus metastatic—within the SMAD2-mutant cohort to assess survival differences across disease progression stages. The case cohort included 209 patients with SMAD2-mutant primary tumors (3.8% of the dataset), while the control cohort consisted of 48 patients with SMAD2-mutant metastatic tumors (0.9%). The Kaplan–Meier survival analysis revealed significantly better overall survival in patients with primary tumors compared to those with metastatic lesions (p = 0.0010), with a clear separation of survival curves and non-overlapping 95% confidence intervals. This result supports prior evidence linking TGF-β signaling dysregulation to metastatic progression and underscores the clinical importance of tumor origin in prognostic modeling. Notably, this analysis highlights the strength of AI-HOPE-TGFbeta in dissecting context-specific molecular subgroups and advancing precision oncology through AI-enabled stratification.

Together, these findings demonstrate the versatility and analytical power of AI-HOPE-TGFbeta in uncovering both validated and novel insights into TGF-β pathway alterations across CRC subtypes and populations. By translating natural language prompts into executable clinical–genomic workflows, the platform enabled real-time, interpretable analyses incorporating key variables such as age, tumor stage, MSI status, mutation profiles, treatment exposure, and race/ethnicity. AI-HOPE-TGFbeta consistently recapitulated known prognostic associations—such as SMAD4-driven chemoresistance and stage-specific TGFBR2 outcomes—while also identifying emerging patterns, including ancestry-linked BMPR1A disparities and context-dependent survival modifiers, like MSI status and tumor origin. These results highlight the potential of conversational AI to democratize integrative bioinformatics, support equity-driven investigations, and accelerate precision medicine through scalable, dynamic cohort interrogation and hypothesis generation.

4. Discussion

AI-HOPE-TGFbeta represents a paradigm shift in precision oncology, offering a novel conversational AI platform that enables real-time, natural language-driven interrogation of TGF-β signaling dysregulation in CRC. By translating user-defined prompts into rigorous, reproducible analyses that integrate genomic and clinical variables, the system addresses longstanding limitations in accessibility, usability, and stratified data exploration. Unlike conventional bioinformatics platforms that often require complex scripting or multi-step workflows, AI-HOPE-TGFbeta streamlines the analytical process, allowing researchers and clinicians—even those without programming expertise—to formulate and execute pathway-centric, population-specific hypotheses with minimal friction.

The TGF-β signaling pathway is a central regulator of CRC progression, influencing processes such as epithelial-to-mesenchymal transition (EMT), immune evasion, and metastasis. Mutations in the TGF-β pathway genes—such as SMAD4, TGFBR2, and BMPR1A—are well-documented markers of poor prognosis and therapeutic resistance, particularly in EOCRC, which is rising at alarming rates in young adults and underserved populations. Despite this clinical importance, integrative analysis of TGF-β alterations has been hindered by the fragmentation of clinical–genomic data, underrepresentation of diverse populations, and the technical inaccessibility of traditional analysis pipelines. AI-HOPE-TGFbeta was developed to close these gaps, empowering users to interrogate TGF-β dysregulation across molecular and demographic contexts with unprecedented ease.

A core strength of AI-HOPE-TGFbeta lies in its ability to validate known associations while surfacing novel insights. In this study, the platform successfully recapitulated key findings from the TGF-β literature. These included the significantly worse overall and progression-free survival observed in SMAD4-mutated EOCRC patients treated with FOLFOX chemotherapy and the markedly better outcomes among the early-stage TGFBR2-mutated patients compared to their late-stage counterparts. These results not only confirmed the accuracy of AI-HOPE-TGFbeta’s analytic engine but also highlighted its potential for reinforcing known clinical–genomic relationships in treatment stratification and prognosis modeling.

AI-HOPE-TGFbeta also enabled hypothesis-driven, population-disaggregated analyses that are critical for advancing health equity in cancer genomics. Through natural language queries, the platform identified a potential disparity in the frequency of BMPR1A mutations among H/L EOCRC patients relative to their NHW counterparts—an observation that approached statistical significance and may reflect unique molecular etiologies in underrepresented groups. Similarly, the system allowed for survival analysis of TGFBR1-mutated CRC patients by ethnicity, demonstrating the platform’s flexibility in handling small cohort comparisons and highlighting the persistent underrepresentation of minority populations in genomic datasets. These findings underscore the urgent need for inclusive datasets and the value of AI systems like AI-HOPE-TGFbeta that support such investigations despite cohort-size limitations.

Beyond ancestry and ethnicity, the platform uncovered clinically actionable interactions between the TGF-β pathway genes and molecular or histopathological features. For example, among SMAD4-mutated tumors, those with MSI-high status exhibited significantly better survival than MSI-stable counterparts. This suggests a potentially protective immunologic interaction between MSI and TGF-β pathway disruption—an observation that may have implications for immunotherapy response prediction in CRC. Likewise, AI-HOPE-TGFbeta identified a strong prognostic benefit for patients with SMAD2-mutant primary tumors compared to those with metastatic lesions, reinforcing the relevance of tumor origin in outcome prediction and supporting the growing recognition of spatial tumor context in clinical decision-making.

From a technical perspective, AI-HOPE-TGFbeta offers a uniquely powerful platform built on LLMs, a RAG engine, and harmonized clinical–genomic data. The system integrates structured biomedical ontologies to ensure accurate cohort definitions and interpretable outputs. Benchmarking results revealed that AI-HOPE-TGFbeta outperforms widely used tools such as cBioPortal and UCSC Xena in execution speed, subgroup flexibility, and multidimensional filtering—particularly for complex queries involving intersectional factors like age, race/ethnicity, MSI subtype, tumor stage, and treatment exposure. These performance advantages position the platform as a scalable, next-generation solution for precision oncology.

Looking ahead, future iterations of AI-HOPE-TGFbeta will incorporate additional statistical metrics to further refine its survival analysis capabilities. While the current version focuses on Kaplan–Meier and log-rank methods—consistent with prior TGF-β pathway studies—we recognize the value of including complementary metrics, such as the concordance index. This measure offers a more nuanced evaluation of predictive discrimination in time-to-event models and could enhance the platform’s utility in personalized prognosis estimation. Expanding the platform to support such metrics will allow users to generate more comprehensive survival insights and better evaluate the clinical relevance of molecular alterations across diverse CRC populations.

While AI-HOPE-TGFbeta represents a significant advancement in conversational AI-driven clinical–genomic analysis, several limitations should be acknowledged. First, the platform is currently limited to publicly available datasets, such as those from cBioPortal, which may not fully capture the diversity or granularity of patient populations, especially among underrepresented groups. Second, although the platform supports a wide range of analytical outputs, its current statistical capabilities do not yet include advanced machine learning models or multivariate feature selection techniques. Third, while natural language interfaces lower the barrier for non-programming users, the platform may require refinement to handle ambiguous or overly complex queries with optimal accuracy. Lastly, the generalizability of AI-HOPE-TGFbeta beyond the TGF-β pathway or colorectal cancer has not yet been formally evaluated. Future versions will aim to incorporate additional data modalities, broader pathway support, and enhanced analytical flexibility to address these limitations.

In evaluating the interpretability and communication capabilities of AI-HOPE-TGFbeta, we compared its outputs to human-generated content previously analyzed and curated in two peer-reviewed studies on colorectal cancer [6,7]. These prior publications provided a foundation for assessing the AI agent’s ability to generate clinically relevant and context-aware responses. However, it is important to note that this content was derived from public databases, such as cBioPortal [21], whose primary objective is to simplify and enhance the accessibility of complex cancer genomic data for cancer biologists and clinicians. AI-HOPE-TGFbeta is intended to serve as a research-oriented AI tool that supports precision oncology investigations focused on TGF-β pathway signaling in colorectal cancer and is not a substitute for authoritative medical advice. We strongly recommend that non-professional users consult licensed healthcare providers for any clinical or treatment-related decisions. This limitation has been explicitly acknowledged to ensure transparency regarding the context of AI–human comparisons and to emphasize the importance of professional oversight when applying AI tools in healthcare settings.

An important consideration in this study is the lack of control over input complexity, response length, and answer structure across the evaluated queries. Because AI-HOPE-TGFbeta generates natural language outputs in real time based on user-defined questions, the heterogeneity in phrasing and depth of prompts may influence the comprehensiveness and formatting of the responses. This variation could introduce unintended bias in interpretability or perceived quality, particularly when outputs are assessed subjectively or in comparison with human-generated responses. While this flexibility reflects real-world user interaction with conversational AI systems, we acknowledge it as a limitation in standardized performance evaluation. Future versions of the platform will incorporate more structured benchmarking protocols to ensure consistent comparisons across different question types and complexity levels.

An acknowledged constraint of this study is that the current version of AI-HOPE-TGFbeta relies primarily on univariate statistical methods—such as Kaplan–Meier survival analysis and odds ratio testing—without formal adjustment for potential confounding variables. As a result, some associations identified may be influenced by underlying covariates that were not accounted for, potentially limiting the robustness and generalizability of the findings. While this approach aligns with the platform’s initial focus on demonstrating core functionality and query interpretability, future versions of AI-HOPE-TGFbeta will incorporate multivariate regression models to support confounder-adjusted analyses and enhance the statistical rigor of clinical–genomic interpretations.

While this study focused primarily on validating the technical and analytical capabilities of AI-HOPE-TGFbeta, it is important to recognize that key dimensions of clinical communication—such as patient understanding, emotional tone, and ethical risk—were not formally evaluated. These factors are critical for ensuring that AI-generated outputs support effective, compassionate, and responsible communication, particularly in health contexts involving sensitive information or vulnerable populations. AI-HOPE-TGFbeta is intended strictly for research use only and is not designed to provide clinical guidance or patient-facing communication. As AI systems increasingly interface with researchers, patients, and clinicians, future work should incorporate structured assessments of how these tools impact comprehension, emotional response, and ethical acceptability.

As this study represents an initial demonstration of the AI-HOPE-TGFbeta platform, the number of showcased queries is intentionally limited and exploratory in nature. While the system was tested with a broader set of inputs during development, we selected a focused subset of seven representative queries that best illustrate the platform’s core functionalities—particularly in stratified survival analysis, mutation enrichment, and population-specific genomic insights related to the TGF-β pathway in colorectal cancer. These examples were chosen based on their alignment with previously published findings by our team and their ability to validate known clinical–genomic relationships, especially within early-onset colorectal cancer. We also acknowledge that the dataset used is relatively small, in part due to our focus on populations who are significantly underrepresented in current genomic databases. This limitation has been noted to ensure transparency and contextualize the scope of the current evaluation. Future benchmarking efforts will incorporate larger and more systematically designed query sets to further assess the scalability, robustness, and generalizability of the AI-HOPE-TGFbeta platform.

While AI-HOPE-TGFbeta successfully replicated known clinical–genomic associations involving SMAD4, TGFBR2, and BMPR1A—validated through prior peer-reviewed studies from our group—the current demonstration is based on a relatively small and exploratory cohort. This limitation is partially attributable to the underrepresentation of minority populations, such as Hispanic/Latino patients, in public genomic datasets. We acknowledge that broader clinical validation using larger, more diverse, and independent cohorts is essential to further establish the clinical significance and generalizability of our findings. Future work will focus on extending the platform’s reach by integrating additional datasets and validating outputs across external cohorts to enhance reproducibility, equity, and translational impact in precision oncology.

The observed enrichment of BMPR1A mutations in Hispanic/Latino patients with EOCRC underscores the potential of AI-HOPE-TGFbeta to reveal population-specific genomic patterns. However, we recognize that these conclusions are based on a relatively small sample size, which limits the statistical power and generalizability of the findings. This reflects a broader challenge in precision oncology: the persistent underrepresentation of racial and ethnic minority groups in genomic databases. To strengthen the validity of these insights, future research will prioritize the inclusion of larger, ancestrally diverse cohorts and multi-institutional datasets. Such efforts are critical for ensuring that AI-driven platforms like AI-HOPE-TGFbeta contribute meaningfully to equitable and generalizable precision medicine.

As AI-HOPE-TGFbeta enables population-specific genomic exploration, particularly in underserved groups such as Hispanic/Latino patients, it is critical to approach these analyses with ethical responsibility. The interpretation of genetic data from underrepresented populations must be conducted with caution to avoid reinforcing stereotypes, misrepresentation, or stigmatization. Moreover, while our platform utilizes de-identified, publicly available data (e.g., from cBioPortal), we acknowledge the ongoing responsibility to protect patient privacy and promote transparency in how subgroup findings are generated and communicated. Future versions of AI-HOPE-TGFbeta will incorporate ethical oversight frameworks and stakeholder input, especially from communities affected by disparities, to guide responsible data use and foster trust in AI-driven precision oncology.

While AI-HOPE-TGFbeta provides a suite of informative outputs—including survival curves, odds ratio plots, and mutation heatmaps—this platform is explicitly intended for research use only and not as a clinical decision support system. Its current functionality is designed to assist investigators in exploring pathway-specific alterations and generating population-aware hypotheses using harmonized genomic and clinical data. The translation of these outputs into actionable insights for clinical decision-making remains outside the current scope of the platform. However, future iterations of AI-HOPE-TGFbeta will incorporate modules aimed at evaluating clinical applicability more systematically, including integration with clinical guidelines, interpretability for care teams, and potential alignment with decision-support frameworks. We acknowledge this current boundary and have added it as a key consideration for future development.

The clinical implications of the prognostic biomarkers and subgroup patterns identified by AI-HOPE-TGFbeta—such as SMAD4 mutations, TGFBR2 alterations, and MSI status—are supported by prior peer-reviewed publications from our group, which investigated these associations in early-onset colorectal cancer using traditional statistical approaches. Notably, two studies previously published serve as empirical references that both informed and validated key query outputs presented in this manuscript. These prior studies enhance the interpretability and translational relevance of our AI-driven analyses. Nevertheless, we acknowledge that broader clinical validation, especially across diverse ethnic groups and larger cohorts, remains necessary. Future research will focus on systematically evaluating the predictive and therapeutic value of these findings to inform personalized treatment strategies and support equitable precision oncology regarding the TGF-β pathway in CRC.

It is also important to note that this study was conducted entirely within a research setting and does not reflect the operational complexities of real-world clinical environments. As such, the findings and applications of AI-HOPE-TGFbeta are not currently generalizable to routine clinical practice. The platform was developed as a research tool to support clinical–genomic exploration, translational oncology, and hypothesis generation. It has not been validated for direct clinical use or decision-making. Future work will be necessary to assess its performance, usability, and safety in clinical contexts, including integration with electronic health records, alignment with clinical workflows, and evaluation under regulatory standards.

Nevertheless, there are important areas for continued development. Expanding the platform to incorporate additional omics layers—including transcriptomics, proteomics, and spatial data—would enable deeper mechanistic insight into TGF-β-driven tumor biology. Integration with federated learning frameworks and secure, privacy-preserving data environments will be essential for real-world clinical deployment, especially in multi-institutional settings where patient-level data must remain decentralized. Further comparative evaluations against other AI-powered bioinformatics agents are needed to assess the generalizability of AI-HOPE-TGFbeta beyond CRC and the TGF-β pathway, paving the way for a modular suite of disease- and pathway-specific AI agents.

Finally, while AI-HOPE-TGFbeta was designed to lower the barriers to entry for researchers working at the intersection of genomics and health disparities, broader efforts to support training, community engagement, and interdisciplinary collaboration will be key to maximizing its impact. The platform is not a substitute for diverse data generation but rather a catalyst for extracting meaningful insights from the data that already exist—highlighting the need for ongoing investment in both algorithmic innovation and inclusive data science.

5. Conclusions

In conclusion, AI-HOPE-TGFbeta enables a new era of integrative, equitable, and conversational bioinformatics by uniting advanced language models with clinical–genomic reasoning. It provides an intuitive and powerful tool for validating known associations, identifying novel biomarkers, and generating actionable hypotheses—particularly in the context of early-onset CRC and underserved patient populations. As precision oncology evolves, AI-HOPE-TGFbeta exemplifies the transformative potential of AI not only to accelerate discovery but also to bridge the gap between data and balanced health outcomes across populations.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/ai6070137/s1, Figure S1: AI-HOPE-TGFbeta analysis of TGF-β pathway alterations in early-onset colorectal cancer (EOCRC) among Hispanic/Latino (H/L) patients; Figure S2: AI-HOPE-TGFbeta analysis of SMAD4-mutant colorectal cancer (CRC) patients by microsatellite instability status (MSI-high vs. MSI-stable); Figure S3: AI-HOPE-TGFbeta analysis of SMAD2-mutant colorectal cancer (CRC) patients by sample type: primary vs. metastatic tumors.

Author Contributions

Conceptualization, E.V.-V. and B.W.; methodology, E.V.-V. and B.W.; software, E.V.-V. and E.-W.Y.; validation, B.W., E.-W.Y. and E.V.-V.; formal analysis, E.V.-V. and B.W.; investigation, E.V.-V. and B.W.; resources, E.V.-V.; data curation, B.W. and E.V.-V.; writing—original draft preparation, E.V.-V. and E.-W.Y.; writing—review and editing, E.V.-V., B.W. and E.-W.Y.; visualization, E.V.-V. and B.W.; supervision, E.V.-V.; project administration, E.V.-V.; funding acquisition, E.V.-V. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Department of Integrative Translational Sciences at City of Hope, the City of Hope Cancer Control and Population Sciences Program, National Cancer Institute (NCI), National Institute of Health (NIH), award number NIH/NCI P30-CA033572; NIH, NCI, Cancer Moonshot project, PE-CGS: Optimizing Engagement of Hispanic Colorectal Cancer Patients in Cancer Genomic Characterization Studies, award number NIH/NCI U2C-CA252971; and NIH, NCI, U54 University of California Riverside (UCR) and City of Hope Comprehensive Cancer Center (COH-CCC) partnership, Drug Development and Capacity Building, award number NIH/NCI U54-CA285116.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used in this study is available to the public and can be found at cbioportal.org. The AI-HOPE-TGFbeta software and demonstration data is available at https://github.com/Velazquez-Villarreal-Lab/AI-HOPE-TGFb (accessed on 15 March 2025).

Acknowledgments

The authors gratefully acknowledge the support of the Department of Integrative Translational Sciences at City of Hope, the Cancer Control and Population Sciences Program (NIH P30 CA033572) at the City of Hope Comprehensive Cancer Center, and the Drug Development and Capacity Building initiative: a UCR/CoH-CCC Partnership project (NIH U54 CA285116).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bhandari, A.; Woodhouse, M.; Gupta, S. Colorectal cancer is a leading cause of cancer incidence and mortality among adults younger than 50 years in the USA: A SEER-based analysis with comparison to other young-onset cancers. J. Investig. Med. 2017, 65, 311–315. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Venugopal, A.; Carethers, J.M. Epidemiology and biology of early onset colorectal cancer. EXCLI J. 2022, 21, 162–182. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Patel, S.G.; Karlitz, J.J.; Yen, T.; Lieu, C.H.; Boland, C.R. The rising tide of early-onset colorectal cancer: A comprehensive review of epidemiology, clinical features, biology, risk factors, prevention, and early detection. Lancet Gastroenterol. Hepatol. 2022, 7, 262–274. [Google Scholar] [CrossRef] [PubMed]
Santucci, C.; Mignozzi, S.; Alicandro, G.; Pizzato, M.; Malvezzi, M.; Negri, E.; Jha, P.; La Vecchia, C. Trends in cancer mortality under age 50 in 15 upper-middle and high-income countries. J. Natl. Cancer Inst. 2025, 117, 747–760. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Sinicrope, F.A. Increasing Incidence of Early-Onset Colorectal Cancer. N. Engl. J. Med. 2022, 386, 1547–1558. [Google Scholar] [CrossRef] [PubMed]
Monge, C.; Waldrup, B.; Carranza, F.G.; Velazquez-Villarreal, E. Molecular Heterogeneity in Early-Onset Colorectal Cancer: Pathway-Specific Insights in High-Risk Populations. Cancers 2025, 17, 1325. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Monge, C.; Waldrup, B.; Carranza, F.G.; Velazquez-Villarreal, E. WNT and TGF-Beta Pathway Alterations in Early-Onset Colorectal Cancer Among Hispanic/Latino Populations. Cancers 2024, 16, 3903. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Garcia, S.; Pruitt, S.L.; Singal, A.G.; Murphy, C.C. Colorectal cancer incidence among Hispanics and non-Hispanic Whites in the United States. Cancer Causes Control 2018, 29, 1039–1046. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Muller, C.; Ihionkhan, E.; Stoffel, E.M.; Kupfer, S.S. Disparities in Early-Onset Colorectal Cancer. Cells 2021, 10, 1018. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Carranza, F.G.; Waldrup, B.; Jin, Y.; Amzaleg, Y.; Postel, M.; Craig, D.W.; Carpten, J.D.; Salhia, B.; Hernandez, D.; Gutierrez, N.; et al. Assessment of MYC Gene and WNT Pathway Alterations in Early-Onset Colorectal Cancer Among Hispanic/Latino Patients Using Integrated Multi-Omics Approaches. medRxiv 2025. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Lieu, C.H.; Golemis, E.A.; Serebriiskii, I.G.; Newberg, J.; Hemmerich, A.; Connelly, C.; Messersmith, W.A.; Eng, C.; Eckhardt, S.G.; Frampton, G.; et al. Comprehensive Genomic Landscapes in Early and Later Onset Colorectal Cancer. Clin. Cancer Res. 2019, 25, 5852–5858. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Tang, J.; Peng, W.; Tian, C.; Zhang, Y.; Ji, D.; Wang, L.; Jin, K.; Wang, F.; Shao, Y.; Wang, X.; et al. Molecular characteristics of early-onset compared with late-onset colorectal cancer: A case controlled study. Int. J. Surg. 2024, 110, 4559–4570. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Antelo, M.; Balaguer, F.; Shia, J.; Shen, Y.; Hur, K.; Moreira, L.; Cuatrecasas, M.; Bujanda, L.; Giraldez, M.D.; Takahashi, M.; et al. A high degree of LINE-1 hypomethylation is a unique feature of early-onset colorectal cancer. PLoS ONE 2012, 7, e45357. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Alshenaifi, J.Y.; Vetere, G.; Maddalena, G.; Yousef, M.; White, M.G.; Shen, J.P.; Vilar, E.; Parseghian, C.; Dasari, A.; Morris, V.K.; et al. Mutational and co-mutational landscape of early onset colorectal cancer. Biomarkers 2025, 30, 64–76. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Papavassiliou, K.A.; Delle Cave, D.; Papavassiliou, A.G. Targeting the TGF-β Signaling Axis in Metastatic Colorectal Cancer: Where Do We Stand? Int. J. Mol. Sci. 2023, 24, 17101. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Koveitypour, Z.; Panahi, F.; Vakilian, M.; Peymani, M.; Seyed Forootan, F.; Nasr Esfahani, M.H.; Ghaedi, K. Signaling pathways involved in colorectal cancer progression. Cell Biosci. 2019, 9, 97. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Li, X.; Wu, Y.; Tian, T. TGF-β Signaling in Metastatic Colorectal Cancer (mCRC): From Underlying Mechanism to Potential Applications in Clinical Development. Int. J. Mol. Sci. 2022, 23, 14436. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Laskar, R.S.; Qu, C.; Huyghe, J.R.; Harrison, T.; Hayes, R.B.; Cao, Y.; Campbell, P.T.; Steinfelder, R.; Talukdar, F.R.; Brenner, H.; et al. Genome-wide association studies and Mendelian randomization analyses provide insights into the causes of early-onset colorectal cancer. Ann. Oncol. 2024, 35, 523–536. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Li, Q.; Geng, S.; Luo, H.; Wang, W.; Mo, Y.Q.; Luo, Q.; Wang, L.; Song, G.B.; Sheng, J.P.; Xu, B. Signaling pathways involved in colorectal cancer: Pathogenesis and targeted therapy. Signal Transduct. Target. Ther. 2024, 9, 266. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012, 487, 330–337. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Cerami, E.; Gao, J.; Dogrusoz, U.; Gross, B.E.; Sumer, S.O.; Aksoy, B.A.; Jacobsen, A.; Byrne, C.J.; Heuer, M.L.; Larsson, E.; et al. The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012, 2, 401–404, Erratum in Cancer Discov. 2012, 2, 960. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Goldman, M.J.; Craft, B.; Hastie, M.; Repečka, K.; McDade, F.; Kamath, A.; Banerjee, A.; Luo, Y.; Rogers, D.; Brooks, A.N.; et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat. Biotechnol. 2020, 38, 675–678. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Chen, Q.; Hu, Y.; Peng, X.; Xie, Q.; Jin, Q.; Gilson, A.; Singer, M.B.; Ai, X.; Lai, P.T.; Wang, Z.; et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat. Commun. 2025, 16, 3280. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Manojlovic, Z.; Christofferson, A.; Liang, W.S.; Aldrich, J.; Washington, M.; Wong, S.; Rohrer, D.; Jewell, S.; Kittles, R.A.; Derome, M.; et al. Comprehensive molecular profiling of 718 Multiple Myelomas reveals significant differences in mutation frequencies between African and European descent cases. PLoS Genet. 2017, 13, e1007087. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Xiao, J.; Yu, X.; Meng, F.; Zhang, Y.; Zhou, W.; Ren, Y.; Li, J.; Sun, Y.; Sun, H.; Chen, G.; et al. Integrating spatial and single-cell transcriptomics reveals tumor heterogeneity and intercellular networks in colorectal cancer. Cell Death Dis. 2024, 15, 326. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Patrad, E.; Khalighfard, S.; Amiriani, T.; Khori, V.; Alizadeh, A.M. Molecular mechanisms underlying the action of carcinogens in gastric cancer with a glimpse into targeted therapy. Cell. Oncol. 2022, 5, 1073–1117. [Google Scholar] [CrossRef] [PubMed]
Idrissi, Y.A.; Rajabi, M.R.; Beumer, J.H.; Monga, S.P.; Saeed, A. Exploring the Impact of the β-Catenin Mutations in Hepatocellular Carcinoma: An In-Depth Review. Cancer Control 2024, 31, 10732748241293680. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Bungaro, C.; Guida, M.; Apollonio, B. Spatial proteomics of the tumor microenvironment in melanoma: Current insights and future directions. Front Immunol. 2025, 16, 1568456. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Greene, S.B.; Dago, A.E.; Leitz, L.J.; Wang, Y.; Lee, J.; Werner, S.L.; Gendreau, S.; Patel, P.; Jia, S.; Zhang, L.; et al. Chromosomal Instability Estimation Based on Next Generation Sequencing and Single Cell Genome Wide Copy Number Variation Analysis. PLoS ONE 2016, 11, e0165089. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]

Figure 1. Overview of AI-HOPE-TGFbeta workflow. This figure illustrates the end-to-end workflow of AI-HOPE-TGFbeta, a conversational AI system developed to investigate TGF-β pathway dysregulation in colorectal cancer (CRC) through natural language-guided, integrative bioinformatics. (a) Users interact with the system using natural language queries, such as assessing the survival outcomes of CRC patients with SMAD4 mutations or comparing BMPR1A mutation prevalence between early-onset Hispanic/Latino (H/L) and non-Hispanic White (NHW) patients. (b) The input is interpreted through a graphical user interface (GUI) powered by a large language model (LLM), which converts user intent into executable code and applies relevant filters for patient subgroups, clinical stage, or treatment regimen. (c) AI-HOPE-TGFbeta interfaces with harmonized clinical and genomic datasets from resources such as TCGA and cBioPortal, focusing on key genes in the TGF-β pathway—including SMAD4, BMPR1A, TGFBR2, TGFB1, and BMP7. (d) The platform performs automated statistical analyses—such as Kaplan–Meier survival estimation, mutation frequency comparisons, and odds ratio testing—generating both visual and textual outputs.

Figure 2. AI-HOPE-TGFbeta Analysis of BMPR1A mutations in early-onset colorectal cancer (EOCRC) among Hispanic/Latino (H/L) vs. Non-Hispanic White (NHW) Patients. (a) Pie charts display the distribution of selected samples from the overall dataset following the application of query filters. The case cohort includes 153 EOCRC H/L patients (<50 years old, H/L ethnicity), representing 2.8% of the queried population. The control cohort includes 1117 EOCRC NHW patients (<50 years old, White race, non-Hispanic ethnicity), accounting for 20.2% of the filtered samples. (b) A 2 × 2 odds ratio analysis compares the frequency of BMPR1A mutations between the case and control cohorts within the context of early-onset disease. The stacked bar plot shows the distribution of in-context (mutation present) and out-of-context (mutation absent) samples in each group. BMPR1A mutations were present in 4.58% of H/L cases and 1.79% of NHW controls. The odds ratio was 2.63 (95% CI: 1.093–6.327), with a p-value of 0.052, suggesting that BMPR1A mutations are over two and a half times more likely in the H/L EOCRC population.

Figure 3. AI-HOPE-TGFbeta analysis of early-onset colorectal cancer (EOCRC) patients treated with FOLFOX, stratified by SMAD4 mutation status. This figure presents the output of a natural language query executed via AI-HOPE-TGFbeta, investigating the association between SMAD4 mutation status and survival outcomes among EOCRC patients treated with FOLFOX chemotherapy (fluorouracil, leucovorin, and oxaliplatin). (a) A histogram displays the age distribution for the full cohort (mean: 57.92 years), contextualizing the selection of patients under age 50 as EOCRC cases. (b) Pie charts illustrate the selected sample distributions. The case cohort (SMAD4-mutated) includes 188 patients (3.4%) from a total of 5543, while the control cohort (SMAD4 wild-type) comprises 1066 patients (19.2%). These charts visualize the proportional representation of each group within the broader dataset. (c) Kaplan–Meier survival analyses comparing overall survival and progression-free survival between SMAD4-mutated and wild-type EOCRC patients reveal significantly worse outcomes in the SMAD4-mutated group (p = 0.0001 for both endpoints), suggesting SMAD4 mutations may confer treatment resistance or more aggressive disease behavior in young patients receiving standard FOLFOX therapy.

Figure 4. AI-HOPE-TGFbeta analysis of TGFBR1-mutated colorectal cancer (CRC) in Hispanic/Latino (H/L) vs. non-Hispanic White (NHW) patients. This figure demonstrates the use of AI-HOPE-TGFbeta to examine ethnicity-specific patterns in CRC patients with TGFBR1 mutations. The analysis compares H/L individuals (case cohort) to NHW individuals (control cohort), all of whom carry TGFBR1 mutations. (a) The initial panel displays the distribution of ethnicities in the full dataset. H/L individuals (X1) account for 6.4% of samples (n = 357), while NHW individuals (X0) make up the majority (89.0%, n = 4993). Both absolute counts and relative percentages are shown to contextualize the underrepresentation of H/L patients. (b) Pie charts summarize the number of selected samples after applying the query filters. The case cohort includes 11 samples (0.2% of the dataset), and the control cohort includes 79 samples (1.4%), indicating that TGFBR1-mutated H/L CRC cases are rare but analyzable using the platform. (c) A 2 × 2 odds ratio analysis contextualized by age (less than 50) compares the enrichment of in-context samples (defined by full query conditions) between the two cohorts. A stacked bar plot shows the distribution of in-context and out-of-context cases. The odds ratio is 1.029 with a 95% confidence interval of [0.563, 7.134] and a p-value of 0.454, suggesting no statistically significant difference in age-stratified mutation context under current sample sizes. (d) Kaplan–Meier survival curves compare overall survival between the case and control groups.

Figure 5. AI-HOPE-TGFbeta analysis of TGFBR2-mutated colorectal cancer (CRC) by tumor stage: early (Stages I–III) vs. late (Stage IV). This figure showcases the use of AI-HOPE-TGFbeta to investigate survival and enrichment patterns among CRC patients with TGFBR2 mutations, stratified by tumor stage. The analysis compares early-stage (Stages I–III) and late-stage (Stage IV) patients to evaluate how disease stage influences survival outcomes in the context of TGFBR2 pathway alterations. (a) Bar plots depict the distribution of tumor stages across the dataset. Patients with Stages I–III tumors account for 57.2% of the cohort (n = 3171), while Stage IV tumors make up 42.8% (n = 2372), providing context for defining early vs. late disease. (b) Pie charts summarize the sample selection following the user-defined query. The case cohort (TGFBR2-mutated patients with Stage I–III disease) includes 235 samples (4.2% of the total dataset), while the control cohort (TGFBR2-mutated patients with Stage IV disease) includes 72 samples (1.3%). (c) A 2 × 2 odds ratio test is performed using chemotherapy exposure (fluorouracil, leucovorin, and oxaliplatin) as a contextual filter. The stacked bar plot visualizes the proportion of in-context and out-of-context samples in each group. The odds ratio is 0.155, with a 95% confidence interval of [0.082, 0.294] and a p-value of 0.0001, indicating a statistically significant enrichment of treatment context among early-stage TGFBR2-mutated patients. (d) Kaplan–Meier survival curves compare overall survival between the case and control cohorts.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, E.-W.; Waldrup, B.; Velazquez-Villarreal, E. AI-HOPE-TGFbeta: A Conversational AI Agent for Integrative Clinical and Genomic Analysis of TGF-β Pathway Alterations in Colorectal Cancer to Advance Precision Medicine. AI 2025, 6, 137. https://doi.org/10.3390/ai6070137

AMA Style

Yang E-W, Waldrup B, Velazquez-Villarreal E. AI-HOPE-TGFbeta: A Conversational AI Agent for Integrative Clinical and Genomic Analysis of TGF-β Pathway Alterations in Colorectal Cancer to Advance Precision Medicine. AI. 2025; 6(7):137. https://doi.org/10.3390/ai6070137

Chicago/Turabian Style

Yang, Ei-Wen, Brigette Waldrup, and Enrique Velazquez-Villarreal. 2025. "AI-HOPE-TGFbeta: A Conversational AI Agent for Integrative Clinical and Genomic Analysis of TGF-β Pathway Alterations in Colorectal Cancer to Advance Precision Medicine" AI 6, no. 7: 137. https://doi.org/10.3390/ai6070137

APA Style

Yang, E.-W., Waldrup, B., & Velazquez-Villarreal, E. (2025). AI-HOPE-TGFbeta: A Conversational AI Agent for Integrative Clinical and Genomic Analysis of TGF-β Pathway Alterations in Colorectal Cancer to Advance Precision Medicine. AI, 6(7), 137. https://doi.org/10.3390/ai6070137

Article Menu

AI-HOPE-TGFbeta: A Conversational AI Agent for Integrative Clinical and Genomic Analysis of TGF-β Pathway Alterations in Colorectal Cancer to Advance Precision Medicine

Abstract

1. Introduction

2. Materials and Methods

2.1. System Architecture and Workflow of AI-HOPE-TGFbeta

2.2. Data Sources and Preparation for AI-HOPE-TGFbeta

2.3. Conversational Query Handling and Cohort Definition in AI-HOPE-TGFbeta

2.4. Analytical Framework and Statistical Methods in AI-HOPE-TGFbeta

2.5. Platform Design and Validation Strategy

2.6. Usability Evaluation and Comparative Benchmarking

2.7. Visualization Capabilities and Exportable Results

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI