Comprehensive Transcriptomic and Proteomic Analyses Identify a Candidate Gene Set in Cross-Resistance for Endocrine Therapy in Breast Cancer

Endocrine therapy (ET) of selective estrogen receptor modulators (SERMs), selective estrogen receptor downregulators (SERDs), and aromatase inhibitors (AIs) has been used as the gold standard treatment for hormone-receptor-positive (HR+) breast cancer. Despite its clinical benefits, approximately 30% of patients develop ET resistance, which remains a major clinical challenge in patients with HR+ breast cancer. The mechanisms of ET resistance mainly focus on mutations in the ER and related pathways; however, other targets still exist from ligand-independent ER reactivation. Moreover, mutations in the ER that confer resistance to SERMs or AIs seldom appear in SERDs. To date, little research has been conducted to identify a critical target that appears in both SERMs/SERDs and AIs. In this study, we conducted comprehensive transcriptomic and proteomic analyses from two cohorts of The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) to identify the critical targets for both SERMs/SERDs and AIs of ET resistance. From a treatment response cohort with treatment response for the initial ET regimen and an endocrine therapy cohort with survival outcomes, we identified candidate gene sets that appeared in both SERMs/SERDs and AIs of ET resistance. The candidate gene sets successfully differentiated progress/resistant groups (PD) from complete response groups (CR) and were significantly correlated with survival outcomes in both cohorts. In summary, this study provides valuable clinical implications for the critical roles played by candidate gene sets in the diagnosis, mechanism, and therapeutic strategy for both SERMs/SERDs and AIs of ET resistance for the future.


Introduction
Breast cancer is the most common cancer in women worldwide with approximately 2.3 million incident cases (11.7%) in 2020. In the United States, breast cancer is the leading cancer type, accounting for 30% of new cases and 15% of death among all cancer types in women [1]. Breast cancer is a highly heterogeneous cancer that is classified based on the histopathology of the receptor status or microarray analysis of molecular subtypes [2]. With immunohistochemical staining, breast cancer is classified as hormone-receptor-positive based on the expression of the estrogen receptor (ER) and progesterone receptor (PR). In contrast, breast cancer without the expression of ER, PR, and human epidermal growth factor receptor-2 (HER2) is classified as triple-negative breast cancer (TNBC) [3]. According to the current guidelines, the classification is considered positive if at least 1% of the tumor nuclei stain for the receptor with appropriate internal and external controls [4]. Nearly 80% of breast cancers are estrogen-receptor-positive (ER+) or hormone-receptor-positive [2].
Estrogen is a critical hormone that is not only responsible for normal growth and development of female mammary and reproductive organs, but also mammary hyperplasia and tumorigenesis. When the estrogen receptor binds to estrogen, it dimerizes and is translocated to the nucleus with coactivators to activate gene transcription, including cell cycle progression [5][6][7]. Due to the high dependence of breast tumorigenesis on the estrogen-ER axis, endocrine therapy with estrogen suppression or ER antagonists is the first-line treatment for ER+ breast cancer. Clinically, endocrine therapy includes: selective ER modulators (SERMs), such as tamoxifen, which competitively inhibit the binding of estrogen to ER; selective ER downregulators (SERDs), such as fulvestrant, which impair mobility/translocation of ER; aromatase inhibitors (AIs), such as letrozole, which deplete systemic estrogen levels in post-menopausal patients by blocking the conversion of androgens to estrogen. The above molecules of endocrine therapy are approved for adjuvant treatment of ER+ breast cancer patients and can reduce the mortality rate of breast cancer by 30% [8,9].
Despite the clinical benefit of endocrine therapy, approximately 20-30% of patients acquire resistance and recurrence after long-term treatment with endocrine therapy [10][11][12][13]. The mechanisms of endocrine resistance are complex and mainly focus on gain-of-function mutations in ER and compensatory cross-talk between ER and growth factor receptor/oncogenic signaling pathways [9]. Based on clinical observations, mutations in the ligand-binding domain (LBD) of ESR1 (the gene encoding ERα) [9,14] and receptor tyrosine kinases of HER2 amplification [15] are predominantly observed in endocrine therapy resistance. In addition, mutations in oncogenic pathways such as PI3-AKT/MAPK pathways are also frequently observed in mutation profiles and endow endocrine resistance in breast cancer [9].
Endocrine resistance remains a major clinical challenge for therapeutic efficacy in most ER+ breast cancer patients. Besides the above mechanisms, there may still be other genetic targets of endocrine resistance and this could be a potential therapeutic strategy to overcome endocrine resistance. Moreover, endocrine resistance is commonly driven by ligand-independent ER reactivation [9,16]. The dominant mutation of ESR1 accounts for less than 20% of cases of endocrine resistance in patients treated with SERMs or AIs [9,17], which indicates that patients resistant to SERMs with ESR1 mutation may also not benefit from AIs. However, ESR1 mutations that confer resistance to SERMs or AIs seldom appear in SERDs [14]. To date, little research has been conducted to identify a critical target that appears in both SERMs/SERDs and AIs. Moreover, a previous study from The Cancer Genome Atlas (TCGA) mainly focused on studies of untreated primary tumors or datasets without treatment response [18][19][20][21][22]. Therefore, in this study, we aimed to comprehensively explore the critical target by integrating transcriptomic (RNA-seq) and proteomic (reversephase protein array, RPPA) approaches that appeared in both SERMs/SERDs and AIs from the TCGA dataset of two cohorts with endocrine therapy response and survival outcomes, and elucidating their potential in diagnostic and survival outcomes, thereby providing genetic profiles and strategies to overcome endocrine therapy resistance.

Characteristics of Study Subjects from the TCGA-BRCA
In this study, 670 samples with ET were collected from TCGA-BRCA of the Genomic Data Commons (GDC) data portal. Based on the record of treatment response, most of the samples were recorded with "Not Applicable" or "Not Available", and only 44 samples were recruited in the arm of the treatment response cohort (TR cohort) with the record of "Complete Response (CR)" and "Clinical Progressive Disease (PD)". The 44 samples were further divided into two groups: CR = 34 samples and PD = 10 samples. The PD group showed clinical progress and resistance to the first-line ET regimen of AI (N = 5 samples) and SERM/SERD (N = 5 samples). The baseline characteristics of the study population groups according to CR and PD are summarized in Table 1. There were significant differences in the distribution of metastasis status, stage, PFS, and OS between CR and PD. We aimed to integrate the dataset of transcriptomic (RNA-Seq) and proteomic (reversephase protein array, RPPA) analyses, which appeared in both SERMs/SERDs and AIs. The baseline characteristics of the treatment response cohort (CR and PD) were further divided into two tables, the RNA-seq and RPPA datasets, as shown in Tables 2 and 3. Of the 44 samples, 32 possessed an RPPA dataset with CR = 25 and PD = 7, as shown in Table 3. Both tables show significant differences in the distribution of the proportion of stage, recurrence/metastases, and survival status between CR and PD. In addition, the baseline characteristics of the endocrine therapy cohort (ET cohort, without treatment response, N = 449) derived from the previous 670 samples are also included in Tables 2 and 3.

The Critical Target in Both SERMs/SERDs and AIs of PD Groups
To verify the critical target of transcriptomic and proteomic analyses in both SERMs/ SERDs and AIs of PD groups, a Venn diagram was conducted in both RNA-seq and RPPA from the treatment response cohort. In the Venn diagram, the targets from each group of CR, AIs, and SERMs/SERDs were selected based on the criteria of z-score threshold ±2 in the TCGA database. A comprehensive workflow to display the screening and analysis approaches in this study is shown in Figure 1. There were 1470 mRNA targets that appeared in the region of overlap between SERMs/SERDs and AIs and among CR, SERMs/SERDs, and AIs ( Figure 2A). In addition, seven protein targets appeared in the region of overlap between SERMs/SERDs and AIs and among CR, SERMs/SERDs, and AIs ( Figure 2B). The seven protein targets were retained for further study; however, the 1470 mRNA targets were filtered through a series of workflows, as shown in Figure 1. We filtered 1470 targets with statistical significance for both PFS and OS, and obtained 229 targets (significantly correlated with both PFS and OS with the criteria of z-score threshold ±2 and p < 0.05). Then, these 229 targets were filtered with differentiations of SERD/SERM versus CR or AI versus CR with fold changes > 0.5 or < −0.5, and 107 targets were obtained. Subsequently, 107 targets were filtered in both the treatment response cohort (with ET treatment response) and the endocrine therapy cohort (with survival outcomes) based on progression-free survival (PFS). In the arm of the TR cohort, inclusion 1470 mRNA targets were filtered through a series of workflows, as shown in Figure 1. We filtered 1470 targets with statistical significance for both PFS and OS, and obtained 229 targets (significantly correlated with both PFS and OS with the criteria of z-score threshold ±2 and p < 0.05). Then, these 229 targets were filtered with differentiations of SERD/SERM versus CR or AI versus CR with fold changes > 0.5 or < −0.5, and 107 targets were obtained. Subsequently, 107 targets were filtered in both the treatment response cohort (with ET treatment response) and the endocrine therapy cohort (with survival outcomes) based on progression-free survival (PFS). In the arm of the TR cohort, inclusion criteria were: (1) response of area under the curve (AUC) for CR vs. PD ≥0.6; (2)    The 29 targets from RNA-seq and seven targets from RPPA were calculated based on optimal cut-off points and individual receiving operating characteristics (ROC) analysis. The results of RNA-seq are shown in Table 4, which lists the cut-off points of high risk in each of the 29 targets. When the values reach the cut-off point of high risk, they receive a score of one; otherwise, a score of zero is accorded. The score was calculated in the total population of 44 samples for each target to obtain the results of AUC for distinguishing PD from CR (individual AUC, shown in the column of response) and AUC for PFS and OS (individual AUC, shown in the column of PFS, OS, and based on the criteria of optimal cutoff points). The above analyses were also conducted on targets from RPPA (Table 5). Finally, based on PFS, we filtered and arranged 29 targets with inclusion criteria: (1) response AUC ≥ 0.6 and PFS ≤ 0.05 in both TR and ET cohorts (AKT1S1, NSL1, ESRRA, TMEM81, CKB, SGEF); (2) statistical significance in PFS/OS of two cohorts (KRT19); (3) borderline significance between CR and PD (SCG5, CEACAM1, ALOX12B) and seven targets with inclusion criteria; (4) statistical significance in PFS of TR or ET cohort. Further cumulative ROC analysis was conducted based on the above criteria to obtain the optimal gene set (RNA-seq with ten targets and RPPA with five targets) of cumulative ROC values for response AUC and PFS AUC. The differentiations of candidate genes (RNA-seq with ten targets and RPPA with five targets) are summarized in Tables 6 and 7. borderline significance between CR and PD (SCG5, CEACAM1, ALOX12B) and seven targets with inclusion criteria; (4) statistical significance in PFS of TR or ET cohort. Further cumulative ROC analysis was conducted based on the above criteria to obtain the optimal gene set (RNA-seq with ten targets and RPPA with five targets) of cumulative ROC values for response AUC and PFS AUC. The differentiations of candidate genes (RNA-seq with ten targets and RPPA with five targets) are summarized in Tables 6 and 7.        p-value is estimated using Wilcoxon rank-sum test.

The Distinguishability of Candidate Gene Sets in Identifying PD versus CR
Based on a series of analyses and selections, candidate genes of transcriptomics (RNAseq with 10 targets) and proteomics (RPPA with five targets) were verified as critical gene sets that appeared in both SERMs/SERDs and AIs of ET resistance. The differentiations of RNA-seq expression with 10 targets in the treatment response cohort is illustrated in Figure 2C with a box plot. AKT1S1, SCG5, CEACAM1, and ALOX12B reached borderline significance between the CR and PD groups (p = 0.022 to 0.09). On the other hand, the differences in RPPA expression of five targets in the treatment response cohort is illustrated in Figure 2D with box plots. XRCC1 reached a significant difference between the CR and PD groups (p = 0.018). Furthermore, the expression of candidate gene sets (RNA-seq and RPPA) in the treatment response and endocrine therapy cohort is shown with a heatmap in Figure 2E,F.
We further verified the accuracy of the gene set for distinguishing the PD and CR groups. As mentioned previously, when the values of each target gene reach the cut-off point of high risk, they receive a score of one; otherwise, they receive a score of zero. The sum of the risk scores in the gene set was calculated using the ROC analysis. The results of ROC analysis showed that the gene set of RNA-seq with 10 targets displayed excellent discrimination in identifying PD and CR groups (AUC = 0.902, p < 0.001, Figure 3A,B). The gene set of RPPA with five targets also displayed excellent discrimination in identifying PD and CR groups (AUC = 0.92, p < 0.001, Figure 3A,B). Moreover, the risk score from the combination of RNA-seq with RPPA with 15 targets obtained outstanding discrimination in identifying PD and CR groups (AUC = 1, p < 0.001, Figure 3A,B). The distribution of each subject was further illustrated using a scatter plot with the x-axis of the RNA-seq risk score and the y-axis of the RPPA risk score. The results of the scatter plot showed a clear and separated region between the CR and PD groups ( Figure 3C). The above results indicated that the candidate gene set of 15 targets was critical in both SERMs/SERDs and AIs of ET resistance and provided excellent discrimination in identifying PD and CR groups.

The Candidate Gene Set in PFS/OS Outcomes of Treatment Response and Endocrine Therapy Cohort
The survival outcomes, especially for PFS, were critical clinical points for the evaluation of CR and PD. In the original groups of CR versus PD, the Kaplan-Meier analysis revealed a significant difference in PFS and OS between the CR and PD groups ( Figure 4A). To further verify the importance of the candidate gene sets, we evaluated the PFS/OS outcomes using the average and median risk score from Figure 3 (RNA-seq in PD groups: ≥5; RPPA in PD groups: ≥3; RNA-seq + RPPA in PD groups: ≥5) in the treatment response and endocrine therapy cohort. As Figure 4B,C show, based on the cut-off point of the risk score from RNA-seq or RPPA data, the Kaplan-Meier analysis demonstrated a significant difference in PFS and OS between the low-risk and high-risk groups in the treatment response and endocrine therapy cohort (the score from RPPA did not show statistical significance in the endocrine therapy cohort). Moreover, when inputting the risk score from the combination of RNA-seq with RPPA, the Kaplan-Meier analysis also showed a significant difference in PFS and OS between the low-risk and high-risk groups, not only in the treatment response but also in the endocrine therapy cohort ( Figure 4D). The above results indicate that the importance of a total of 15 targets was not only observed in the ability to discriminate but also in the outcomes of PFS and OS.
from the combination of RNA-seq with RPPA with 15 targets obtained outstanding discrimination in identifying PD and CR groups (AUC = 1, p < 0.001, Figure 3A,B). The distribution of each subject was further illustrated using a scatter plot with the x-axis of the RNA-seq risk score and the y-axis of the RPPA risk score. The results of the scatter plot showed a clear and separated region between the CR and PD groups ( Figure 3C). The above results indicated that the candidate gene set of 15 targets was critical in both SERMs/SERDs and AIs of ET resistance and provided excellent discrimination in identifying PD and CR groups. The risk score from RNA-seq, RPPA, and RNA-seq + RPPA analyses showed excellent and outstanding discrimination ability in identifying PD and CR groups. A p value less than * p < 0.05, ** p < 0.01, and *** p < 0.001 was considered statistically significant. Red line represents the ROC curve for a line of identity. (C) The scatter plot provides a quick view of the co-expression of both RNAseq and RPPA cumulative risk scores of study cohorts with different clinical outcomes. Blue circles represent the group of CR and red circles represents the group of PD. The risk score from RNA-seq, RPPA, and RNA-seq + RPPA analyses showed excellent and outstanding discrimination ability in identifying PD and CR groups. A p value less than * p < 0.05, ** p < 0.01, and *** p < 0.001 was considered statistically significant. Red line represents the ROC curve for a line of identity. (C) The scatter plot provides a quick view of the co-expression of both RNA-seq and RPPA cumulative risk scores of study cohorts with different clinical outcomes. Blue circles represent the group of CR and red circles represents the group of PD. nificance in the endocrine therapy cohort). Moreover, when inputting the risk score from the combination of RNA-seq with RPPA, the Kaplan-Meier analysis also showed a significant difference in PFS and OS between the low-risk and high-risk groups, not only in the treatment response but also in the endocrine therapy cohort ( Figure 4D). The above results indicate that the importance of a total of 15 targets was not only observed in the ability to discriminate but also in the outcomes of PFS and OS.

Gene Ontology Analysis (GO) of Candidate Gene Set from PD Groups
The candidate gene set of RNA-seq + RPPA was further analyzed through the correlation matrix and gene set enrichment analysis (GSEA) of gene ontology analysis to determine gene function. From the correlation matrix, we found most of the candidate gene sets were significantly correlated with each other in both cohorts. For example, the high-risk cut-off points for AKT1S1 were as follows: ≥−0.154, ESRRA: ≥0.281, NSL1: ≤−2.058. The correlation matrix showed that the expression of AKT1S1 was positively correlated with ESRRA (r = 0.57, 0.41) and negatively correlated with NSL1 (r = −0.46, −0.25) in both cohorts ( Figure 5A,B). Thus, the candidate gene sets were not only involved in survival outcomes but were also significantly correlated with each other in both cohorts. Finally, we elucidated the function of candidate gene sets by using the gene ontology analysis, including biological processes, cellular components, molecular functions, and pathways. Regarding biological processes, the candidate gene set was mainly associated with cell death and the metabolic process. Regarding cellular components and molecular functions, the candidate gene set was mainly associated with cytosol and kinase activity. Regarding pathways, the candidate gene set was mainly associated with neuregulin, DNA doublestrand break repair, and cancer-related signaling pathways ( Figure 5C). The above results provide a potential mechanism and therapeutic strategy for both SERMs/SERDs and AIs of ET resistance in the future. sets were significantly correlated with each other in both cohorts. For example, the highrisk cut-off points for AKT1S1 were as follows: ≥−0.154, ESRRA: ≥0.281, NSL1: ≤−2.058. The correlation matrix showed that the expression of AKT1S1 was positively correlated with ESRRA (r = 0.57, 0.41) and negatively correlated with NSL1 (r = −0.46, −0.25) in both cohorts ( Figure 5A,B). Thus, the candidate gene sets were not only involved in survival outcomes but were also significantly correlated with each other in both cohorts. Finally, we elucidated the function of candidate gene sets by using the gene ontology analysis, including biological processes, cellular components, molecular functions, and pathways. Regarding biological processes, the candidate gene set was mainly associated with cell death and the metabolic process. Regarding cellular components and molecular functions, the candidate gene set was mainly associated with cytosol and kinase activity. Regarding pathways, the candidate gene set was mainly associated with neuregulin, DNA double-strand break repair, and cancer-related signaling pathways ( Figure 5C). The above results provide a potential mechanism and therapeutic strategy for both SERMs/SERDs and AIs of ET resistance in the future. The main function of the candidate gene set was associated with cell death, metabolic process, kinase activity, neuregulin, DNA double−strand break repair, and cancer−related signaling pathways. A p value less than * p < 0.05, ** p < 0.01, and *** p < 0.001 was considered statistically significant.

Discussion
In this study, we integrated transcriptomic (RNA-seq) with proteomic (reverse-phase protein array, RPPA) data that appeared in both SERMs/SERDs and AIs from the clinical database of TCGA-BRCA of the GDC data portal to comprehensively analyze the critical targets in endocrine resistance. As previously described, most studies from TCGA mainly focused on the dataset of untreated tumors or datasets without treatment response [18][19][20][21][22]. In our studies, we used two cohorts in which one cohort showed treatment response (treatment response cohort: CR+PD) and the other cohort was without treatment response information but possessed the outcomes of survival (endocrine therapy cohort). In terms of clinical characteristics, the PD groups that were resistant to the first-line ET regimen correlated with a higher proportion of N status and higher stage. Resistance to ET was also correlated with poor PFS and OS. To elucidate the critical targets that appeared in both SERMs/SERDs and AIs from PD groups, a Venn diagram was produced in the region of overlap between SERMs/SERDs and AIs, and among CR, SERMs/SERDs, and AIs.
Regarding the mechanisms of endocrine resistance, loss of ER expression occurs in less than 10% of patients [23,24]. In most cases, endocrine resistance is driven by ligandindependent ER reactivation and is mainly focused on genomic alteration and activation of oncogenic pathways. To date, most studies have concentrated on the mechanisms of genomic alteration. For instance, mutations in ESR1 are observed in approximately 20% of recurrent ER+ breast cancers following long-term treatment with AIs or tamoxifen [9,17]; mutations in oncogenic pathways PI3K (including PIK3CA, PTEN, and AKT1) and MAPK (including NF1, KRAS/NRAS/HRAS, BRAF, and MAP2K1) are also frequently observed and endow endocrine resistance in breast cancer [9,17,25,26]. In addition to genomic alteration, the expression of specific targets is also involved in endocrine resistance; for example, high expression of FGFR1 and c-Myc was associated with tamoxifen resistance [27,28], and overexpression of RAD51 was associated with AI resistance [29]. However, little is known about the specific targets involved in the cross-resistance of ET in breast cancer.
Cross-resistance refers to resistance to several drugs or treatment strategies with a similar mechanism of resistance. In several circumstances, ESR1 mutations confer resistance to AI/SERMs but not to SERD [14], indicating that patients resistant to SERMs with ESR1 mutations may also not benefit from AIs but a clinical benefit from fulvestrant may be yielded [30]. In addition, the loss of NF1 causes resistance to ET in both SERMs and SERDs [25]. To date, few studies have been conducted on the cross-resistance of ET, including SERMs /SERDs and AIs in breast cancer. In this study, we provide the first evidence to elucidate specific targets of transcriptomics and proteomics in the crossresistance of ET in breast cancer. In the TCGA-BRCA database, the amount of data in RPPA was less than RNA-seq, such that seven targets in the Venn diagram were retained for further study, whereas 1470 mRNA targets were filtered using a series of workflows. Finally, we obtained the optimal gene set of cumulative ROC values for RNA-seq (10 targets) and RPPA (five targets).
From the gene ontology analysis, the candidate gene sets are involved in different categories, including cell death, metabolic process, kinase activity, neuregulin, DNA doublestrand break repair, and cancer-related signaling pathways. In these candidate gene sets, CEACAM1 expression is reduced or lost in breast cancer compared to normal tissues and controls the switch of epithelial-to-mesenchymal transition (EMT), which is involved in endocrine resistance [31][32][33][34]. The KRT locus of keratin family, KRT19, is a tumor suppressor gene in breast cancer and regulates drug sensitivity through cancer stem cell reprogramming and NOTCH signaling pathways [35][36][37]. TMEM81 is a transmembrane protein which is involved in the response to fulvestrant treatment [38]. One of the members, TMEM119, has been reported to promote the stemness of breast cancer and is negatively correlated with the survival of patients [39]. ESRRA is overexpressed in a variety of cancers, including breast cancer, and is associated with recurrence, poor prognosis, and tamoxifen/fulvestrant treatment response [40][41][42]. ERBB3 is a typical oncogenic RTK that is upregulated in breast cancer and is directly involved in the development of resistance to both tamoxifen and fulvestrant [43][44][45]. SRC is a non-receptor tyrosine kinase, participating in several oncogenic pathways and promoting tamoxifen resistance in breast cancer [46,47]. AKT1S1 is a substrate of AKT that binds the 14-3-3 protein and is involved in the oncogenic PI3K-Akt pathways, which are critical for endocrine resistance [48,49]. SGEF is a guanine-nucleotide exchange factor which is overexpressed in prostate cancer and its depletion enhanced invadopodia formation in breast cancer [50,51]. SCG5 encodes a neuroendocrine protein and is overexpressed in brain metastatic breast cancer and breast cancer stem cells [52,53]. ALOX12B encodes an enzyme to transfer arachidonic acid to 12R-hydroxyeicosatetraenoic acid. ALOX12B has been reported to promote the carcinogenesis of cervical cancer and is associated with an increased risk of breast cancer [54,55]. CKB is a creatine kinase brain isoform that promotes invasion and metastasis of breast cancer [56]. BID is a pro-apoptotic protein of the Bcl-2 family, participating in drug-induced apoptosis and tamoxifen resistance [57,58]. XRCC1 is a well-known DNA repair gene, and deficiency of XRCC1 promotes an aggressive phenotype of breast cancer [59,60]. NSL1 is a kinetochore-associated protein for cell division, normal development, and accumulation of tumor suppressor gene BRCA1 [61,62]. CHEK2 is also a tumor suppressor gene involved in DNA repair and endocrine resistance [63].
Based on the literature, the function and expression of the candidate gene sets were consistent with our differentiation analysis between CR and PD. For instance, upregulation of ERBB3 in resistance to both tamoxifen and fulvestrant was also consistent with the high expression level in PD groups. Moreover, increased ESRRA expression was associated with both tamoxifen and fulvestrant resistance and was consistent with the high expression level in PD groups. According to the individual optimal cut-off point for RNA-seq and RPPA expression of each target gene, we further showed that scores from the combination of candidate gene sets were higher in PD groups than CR groups, providing excellent discrimination in identifying PD and CR groups. Moreover, the survival outcomes were critical for the therapeutic response in which the PFS/OS was significantly reduced in the endocrine resistance of PD groups. As previously mentioned, most studies from TCGA mainly focused on the dataset without treatment response in which the authors utilize the outcomes of survival to predict ET-resistance-related genes based on the dataset without therapeutic response information [20][21][22]. However, in our studies, we obtained the candidate gene sets from two cohorts with the clinical characteristics of both treatment response and survival outcomes. In our studies, when the average and median scores from PD groups (high risk) were input, we observed a significant difference in PFS/OS not only in the treatment response but also in the endocrine therapy cohort, which indicates that at least 15% of the population may develop endocrine resistance in this cohort. However, the score from RPPA did not reach statistical significance in the endocrine therapy cohort of PFS/OS. The above results may be due to the characteristics of highly heterogeneous factors in endocrine resistance [17]; only five targets in RPPA could not accurately reflect and discriminate the complexity of endocrine resistance, which requires more abundant markers and more functional categories, such as targets in RNA-seq.
The present study still has certain limitations. Due to the limited number of medical records of treatment responses, the number of subjects in the arm of the treatment response cohort was quite small. In the future, we will verify candidate gene sets in a larger number of clinical participants with detailed treatment response records. However, our findings reveal potential targets in the cross-resistance of SERMs/SERDs and AIs based on transcriptomics and proteomics, thereby providing a potential therapeutic approach for endocrine resistance in breast cancer in the future.

Study Population
All data were downloaded from the TCGA-BRCA project in the GDC database using TCGAbiolinks packages. Breast cancer patients who were hormone-positive and had received endocrine therapy (ET) were recruited, and those with incomplete clinical characteristics, RNA sequencing (RNA-seq), and reverse-phase protein array (RPPA) expression profiles were excluded. A total of 44 patients with a record of treatment response (treatment response cohort, TR cohort) and 449 patients (endocrine therapy cohort, ET cohort, without a record of treatment response but with survival outcomes) were analyzed. Both cohorts were used to determine the candidate gene sets and the risk score based on the TR cohort. Baseline clinical characteristics included age at diagnosis, TNM staging, and the pathological stage. The survival endpoints included progression-free survival (PFS) and overall survival (OS).

RNA-Sequencing and Reverse Phase Protein Array (RPPA) Analysis
The RNA-seq expression profile was determined experimentally using the Illumina HiSeq 2000 RNA Sequencing platform at the University of North Carolina TCGA genome characterization center. Differential gene expression (DGE) analysis was conducted to obtain standardized reading count data and to conduct a statistical analysis to identify quantitative changes in gene expression levels based on RNA-seq level 3 data. RNA-seq expression was reported in reads per kilobase million (RPKM), and the current dataset shows the gene-level transcription estimates as log2(x + 1)-transformed RNA-seq by expectationmaximization (RSEM) normalized counts. RPPA is a high-throughput antibody-based technique for protein analysis. RPPA expression was first derived from a single curve using all the samples on a slide with the signal intensity as the response variable and the dilution steps as independent variables. The fitted curve was plotted with the signal intensities on the y-axis and the log2-concentration of proteins on the x-axis for diagnostic purposes. The current dataset abstracted the level 3 RPPA normalized data across all proteins and samples [64].

Individual Receiving Operating Characteristics (IROC)
The individual optimal cut-off points for RNA-seq and RPPA expression for each target gene were derived by receiving operating characteristics (ROC) using a treatment response cohort. Each data point was used to dichotomize the study population into high-risk and lowrisk groups according to the estimated outcome (initial ET response for TR cohort or survival status for ET cohort). The case numbers of both risk groups and estimated outcome were summarized into four values including true positive (TP), true negative (TN), false negative (FN), and false positive (FP). TP indicates the high-risk subjects with unfavored outcomes (i.e., PD, metastases, or death), TN indicates low-risk subjects with the favored outcome (i.e., CR, disease-free, or survived), FP indicates high-risk subjects with the favored outcome, and FN indicates low-risk subjects with the unfavored outcome. Then, the area under the curve (AUC) of each data point was computed using TP, TN, FP, and FN values based on the area formula as shown in equation [65]. In general, 0.7 ≤ AUC ≤ 0.8 (acceptable discrimination); 0.8 ≤ AUC ≤ 0.9 (excellent discrimination); 0.9 ≤ AUC ≤ 1.0 (outstanding discrimination).
The cut-off point that provided the maximized accuracy metric was selected as the optimal cut-off point. An individual AUC (IAUC) of the corresponding optimal cut-off point was computed, and a higher IAUC indicates better predictive ability for the initial ET response. The high-risk characteristic of each target gene was defined by IROC, and an individual risk score of one was scored if the study population obtained high-risk characteristics for the corresponding target genes. In contrast, the individual risk score was zero.

Cumulative Risk Score and Scatter Plot
Subsequently, the individual risk scores of each target gene were combined to generate a cumulative risk score. The specificity and sensitivity of the risk score from RNA-seq, RPPA, and RNA-seq + RPPA were determined using the ROC curve and AUC value using GraphPad Prism 8 (GraphPad Software, La Jolla, CA, USA). The co-expression of the cumulative risk score based on RNA-seq and RPPA was illustrated using a scatter plot. The x-axis indicates the RNA-seq cumulative risk score, the y-axis indicates the RPPA cumulative risk score, and markers with different shapes and colors indicate different clinical outcomes. Furthermore, the study cohorts were dichotomized into high-and lowrisk subgroups based on the mean or median cumulative risk score. Subjects who obtained a cumulative risk score greater than or equal to the mean or median value of PD groups were defined as a high-risk subgroup, and those who obtained a cumulative risk score lower than the mean or median value were defined as low-risk subgroups. The survival difference between the high-risk and low-risk subgroups was estimated to validate the predictive ability of the cumulative risk score for survival outcomes.

Statistical Analysis and Correlation Matrix
Baseline characteristics were summarized as median, range, frequency, and percentage. Differences in baseline characteristics between subgroups were compared using Fisher's exact and Wilcoxon rank-sum tests. The RNA-seq and RPPA expression of the study cohorts was summarized as median and range, and the difference between groups was estimated using the Wilcoxon rank-sum test. The Venn diagram was conducted based on the criteria of the z-score threshold ± 2.0. The differentiations of the candidate genes in the treatment response cohort were visualized using a box plot. The middle line within the box represents the median and the upper and bottom borders of the box represent the interquartile range. The upper and lower whiskers represent the minimum and maximum values before the fence (Q1/Q3 + 1.5*IQR), and the dots represent the maximum or minimum outliers of the corresponding subgroup. The expression of candidate genes in the treatment response and endocrine therapy cohort was illustrated using a heatmap and annotated with either the treatment response to endocrine therapy or PFS status. The survival rates of the ET response and cumulative risk score subgroups were estimated using the Kaplan-Meier estimator, and the survival difference between subgroups was tested using the log-rank test. Both p-values of PFS and OS were estimated using a log-rank test. The distribution and correlation between each target gene were summarized using a correlation matrix, and the correlation between each target gene was tested using the Pearson correlation test. The diagonal shows the histograms for each gene together with the density functions, the lower diagonal shows the scatter plots, and the upper diagonal shows the Pearson correlation coefficients. Moreover, the x-axis indicated the expression of a target from up to down and the y-axis indicated the expression of a target from right to left. All p-values were two-tailed, and a p-value less than 0.05 was considered statistically significant. All analyses were performed using R 4.0.2 software (R Core Team, 2021) [64] and GraphPad Prism 8 (GraphPad Software, La Jolla, CA, USA).

Gene Set Enrichment Analysis (GSEA)
Gene set enrichment analysis (GSEA) was conducted using the target genes to further explore related gene functions, including biological processes (BP), cellular components (CC), molecular functions (MF), and enrichment pathways. GSEA was performed using the TCGAbiolinks package in R software, and a comprehensive set of efficient and concise annotation tools was derived from the Database for Annotation, Visualization, and Integrated Discovery (DAVID) [66]. The cutoff criterion for GSEA was set at a false discovery rate (FDR) < 0.05 [64].

Conclusions
In conclusion, we conducted comprehensive transcriptomic (RNA-seq) and proteomic (RPPA) analyses of ET-resistance-related targets from the clinical TCGA-BRCA PanCancer Atlas database. With a series of analyses and selections from both cohorts, we elucidated a candidate gene set of 15 targets that was critical in both SERMs/SERDs and AIs of ET resistance. The candidate gene set provided excellent discrimination in identifying PD and CR groups and was significantly correlated with the survival outcomes of PFS/OS in the treatment response and endocrine therapy cohort. The present study still has certain limitations: (1) some uncertainties and variable factors (different methodologies) exist in the approach with information and computational technology; (2) we lack the time and space characteristics for fully elucidating the interaction and function of targets; (3) we did not take into account the factor of post-translational modification sites (PTMs); (4) there are differences among transcription, translation, and PTM level (not all RNAs are coding for proteins). To overcome the above limitations, we will verify candidate gene sets in a larger number of clinical participants with detailed treatment response records, in vitro cell lines,