Comprehensive Transcriptomic Analysis Identifies ST8SIA1 as a Survival-Related Sialyltransferase Gene in Breast Cancer

Hypersialylation caused by the overexpression of sialyltransferases (STs) is a common feature in cancer that is associated with several characteristics of tumorigenesis. Thus, identifying cancer-associated STs is critical for cancer therapy. However, ST screening has been frequently conducted in cell line models. In this study, we conducted a comprehensive analysis of STs in the clinical database and identified the STs related with the survival of breast cancer patients. RNA sequencing (RNA-Seq) data of 496 patients were obtained from The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA). Of the eight mapped STs, ST3GAL5, and ST8SIA1 met the acceptable area under the curve (AUC) criteria for overall survival (OS). Using Kaplan–Meier methods, we determined that high expression of ST8SIA1 was associated with poor 10-year OS in all patients, triple-negative breast cancer (TNBC), and non-TNBC patients, and poor disease-free survival (DFS) rates particularly in TNBC. ST8SIA1 also had superior AUC values in terms of OS/DFS. High ST8SIA1 levels showed a higher risk for poor OS in different groups of patients and a higher risk for poor DFS particularly in TNBC. In summary, we conducted a comprehensive analysis of STs from the clinical database and identified ST8SIA1 as a crucial survival-related ST, which might be a potential therapeutic target for breast cancer and TNBC patients.


Introduction
Glycosylation is the main post-translational modification (PTM) in which glycans are conjugated to a protein at asparagine (N-linked) or serine/threonine (O-linked) residue, contributing to protein cancer patients were also included in the analysis. A total of 496 patients with comprehensive baseline characteristics and RNA-Seq expression data were analyzed. The baseline characteristics included age, ethnicity, molecular subtypes, pathological stage, lymph node invasion status, radiation, and pharmaceutical treatments. The included patients were divided into triple-negative breast cancer (TNBC) and non-triple-negative breast cancer (non-TNBC), according to the molecular subtypes of breast cancer. Overall survival (OS) was considered as a primary endpoint and disease-free survival (DFS) was considered as a surrogate endpoint.

RNA Sequencing Analysis
A total of 20 STs genes were included in the initial analysis. Only 8 STs genes were mapped by the RNA-Seq expression. Therefore, we estimated the different RNA-Seq expression levels regarding 8 STs genes between TNBC and non-TNBC groups, and those with significant differences between the subgroups were selected as target genes for later discriminant analysis. Level 3 data were downloaded from the TCGA data coordination center. The gene expression profile was determined experimentally using the Illumina HiSeq 2000 RNA Sequencing platform at the University of North Carolina TCGA genome characterization center. We conducted a differential gene expression (DGE) analysis to obtain standardized reading count data and conducted a statistical analysis to identify quantitative changes in gene expression levels based on RNA sequencing data. RNA-Seq expression was reported in reads per kilobase million (RPKM), and the current data set shows the gene-level transcription estimates as log2(x + 1)-transformed RSEM (RNA-Seq by Expectation-Maximization) normalized count.

Statistical Analysis
The baseline characteristics of the study population were presented as mean (±standard deviation, SD) or frequency (percentages). The difference in distribution between the TNBC and non-TNBC groups was estimated using the independent two-sample t-test, chi-squared test, or Fisher's exact test. The RNA-Seq expression of genes in all patients was illustrated using a heatmap. The RNA-Seq expression in the entire study population, the TNBC group, and the non-TNBC group was expressed as mean ± SD, and the comparison of gene expression levels between TNBC and non-TNBC was evaluated using the independent two-sample t-test. Genes with statistically significant differences in RNA-Seq expression levels between TNBC and non-TNBC were chosen as target genes. Furthermore, receiver operating characteristic (ROC) analysis was used to estimate the optimal cutoff point of RNA-Seq expression for each target gene according to the OS and DFS status (death was set as an outcome event). All cut points were compared using the area under the curve (AUC), and an AUC greater than 0.6 was considered as an acceptable cutoff point. The patients in the entire study population, the TNBC group, and the non-TNBC group were divided into low and high expression levels according to the optimal cutoff points. The survival difference between low and high expression levels was illustrated using the Kaplan-Meier estimator and tested using the log-rank test. Univariate and multivariate Cox proportional hazard regression models were used to estimate the impact of RNA-Seq expression levels and baseline characteristics on OS and DFS in breast cancer. Hazard ratio (HR) and 95% confidence interval (CI) were computed. Moreover, differential expression analysis was conducted for the target genes with acceptable prediction ability for OS and PFS. All p-values were two-sided, and the statistical significance level for all tests was set at 0.05. All statistical analyses were performed using the computing environment R 4.0.2 (R Core Team, 2020, Vienna, Austria).

Gene Set Enrichment Analysis (GSEA) and Gene Ontology (GO)
Gene Set Enrichment Analysis (GSEA) is a computational approach that enables one to determine whether an a priori defined set of candidate genes shows statistically significant, concordant differences between two biological states or phenotypes. GSEA was conducted using the target genes to further explore related Gene Ontology (including BP: biological process, CC: cellular components, and MF: molecular functions) or enrichment pathways. The analysis was performed using TCGAbiolinks packages in R software. The comprehensive set of efficient and concise annotation tools were derived from the Database for Annotation, Visualization and Integrated Discovery (DAVID) [24]. The cutoff criterion for the GSEA was set at a false discovery rate (FDR) < 0.05.

Characteristics of the Study Subjects from TCGA-BRCA
In this study, 496 samples were collected from TCGA-BRCA of the GDC data portal. These breast cancer samples were further divided into the triple-negative breast cancer (TNBC, n = 110) and non-TNBC (n = 386) groups. Baseline characteristics of the study population group by TNBC and non-TNBC are summarized in Table 1. A total of 110 TNBC patients with a mean age of 55.8 ± 12.4 years and 386 non-TNBC patients with a mean age of 56.9 ± 13.1 years were analyzed. Most of the patients were not Hispanic or Latino. There were significant differences in the distribution of the proportion of the pathological stage (p = 0.013) and lymph node invasion (p < 0.001) between TNBC and non-TNBC.

The Difference in RNA-Seq Expression of Sialyltransferase Genes
To verify the expression of different sialyltransferase genes in breast cancer, RNA-Seq expression of different sialyltransferase genes between TNBC and non-TNBC patients was calculated by reads per kilobase per million mapped reads (RPKM) and is summarized in Table 2. A total of 20 STs genes were included in the initial evaluation. Only 8 STs genes were mapped by the RNA-Seq expression. Overall, ST3GAL5 had significantly higher RNA-Seq expression levels in the TNBC group (TNBC vs. non-TNBC: 3.86 ± 0.35 vs. 3.62 ± 0.30, p < 0.001). However, ST6GALNAC4 (TNBC vs. non-TNBC: 3.66 ± 0.47 vs. 4.16 ± 0.36, p < 0.001) and ST8SIA1 (TNBC vs. non-TNBC: 0.75 ± 0.76 vs. 1.04 ± 0.87, p < 0.001) had significantly higher RNA-Seq expression levels in the non-TNBC group. The heatmap of RNA-Seq expression of ST3GAL2, ST3GAL3, ST3GAL5, ST6GAL1, ST6GALNAC3, ST6GALNAC4, ST8SIA1, and ST8SIA3 in each patient is illustrated in Figure 1. In the horizontal scale bar, the green color indicates a lower level of RNA-Seq expression and the red color indicates a higher level of RNA-Seq expression. Thus, ST3GAL5, ST6GALNAC4, and ST8SIA1 were candidate genes for differential expression between TNBC and non-TNBC patients.  p-value is estimated using an independent two-sample t-test.

Distinction of OS by Sialyltransferase Gene in Breast Cancer
From the RNA-Seq expression results, ST3GAL5, ST6GALNAC4, and ST8SIA1 were the candidate genes that showed statistically significant differential expression between TNBC and non-TNBC patients. Thus, we utilized OS status as the outcome to further determine the cutoff point and the ability of these candidate genes to discriminate OS and DFS. The ROC analysis results are shown in Table 3. ROC analysis was conducted to obtain the optimal cutoff point of RNA-Seq expression of ST3GAL5, ST6GALNAC4, and ST8SIA1 according to the primary endpoint of OS status and the surrogate endpoint of DFS to confirm their importance. The ROC analysis results indicated that the optimal cutoff points of OS for ST3GAL5, ST6GALNAC4, and ST8SIA1 were 3.63, 4.01, and 1.31, respectively. ST8SIA1 (AUC = 0.68, sensitivity = 0.71, specificity = 0.73) showed the best predictive performance, followed by ST3GAL5 (AUC = 0.62, sensitivity = 0.51, specificity = 0.71) and ST6GALNAC4 (AUC = 0.54, sensitivity = 0.71, specificity = 0.44). However, only ST3GAL5 and ST8SIA1 met the acceptable criteria (AUC > 0.6).

Distinction of OS by Sialyltransferase Gene in Breast Cancer
From the RNA-Seq expression results, ST3GAL5, ST6GALNAC4, and ST8SIA1 were the candidate genes that showed statistically significant differential expression between TNBC and non-TNBC patients. Thus, we utilized OS status as the outcome to further determine the cutoff point and the ability of these candidate genes to discriminate OS and DFS. The ROC analysis results are shown in Table 3. ROC analysis was conducted to obtain the optimal cutoff point of RNA-Seq expression of ST3GAL5, ST6GALNAC4, and ST8SIA1 according to the primary endpoint of OS status and the surrogate endpoint of DFS to confirm their importance. The ROC analysis results indicated that the optimal cutoff points of OS for ST3GAL5, ST6GALNAC4, and ST8SIA1 were 3.63, 4.01, and 1.31, respectively. ST8SIA1 (AUC = 0.68, sensitivity = 0.71, specificity = 0.73) showed the best predictive performance, followed by ST3GAL5 (AUC = 0.62, sensitivity = 0.51, specificity = 0.71) and ST6GALNAC4 (AUC = 0.54, sensitivity = 0.71, specificity = 0.44). However, only ST3GAL5 and ST8SIA1 met the acceptable criteria (AUC > 0.6).

High ST8SIA1 Expression Was Correlated with Poor OS/DFS in TNBC Patients
The study population was divided into high and low ST3GAL5 and ST8SIA1 RNA-Seq expression levels according to the optimal cutoff point of each gene. The OS and DFS analyses of ST3GAL5 and ST8SIA1 RNA-Seq expression levels were evaluated using the Kaplan-Meier method and are illustrated in Figures 2 and 3, respectively. To examine the impact on OS/DFS, the study population was divided into three groups: all patients, TNBC, and non-TNBC. There was no statistically significant difference in the expression of ST3GAL5 in all patients, TNBC, and non-TNBC of OS/DFS status ( Figure 2). However, high expression of ST8SIA1 was statistically significantly correlated with poor OS in all ( Figure 3A, p < 0.001) and non-TNBC patients ( Figure 3C, p < 0.001). Moreover, high expression of ST8SIA1 was associated with poor OS (p = 0.007) and DFS (p = 0.026) in TNBC patients ( Figure 3B). ROC analysis was also conducted to verify the ability of ST3GAL5 and ST8SIA1 to discriminate OS/DFS of TNBC and non-TNBC patients. The Kaplan-Meier results showed that ST8SIA1 had superior AUC values compared to ST3GAL5 (Table 4)

Cox Proportional Hazard Regression Model of ST8SIA1
According to the Kaplan-Meier and ROC results, ST8SIA1 RNA-Seq expression levels that reached statistically significant survival differences were further included in the Cox proportional hazard regression model to estimate the impact of ST8SIA1 levels and baseline characteristics on OS and DFS in all patients, TNBC patients, and non-TNBC patients. The Cox proportional hazard regression analyses for OS and DFS are presented in Tables 5 and 6, respectively. In Table 5, the univariate analysis demonstrated that patients with high ST8SIA1 levels (HR = 8.57, 95% CI = 2.84, 25.9, p < 0.001) had a higher risk for poor OS compared to patients with low ST8SIA1 levels in the all patients group. Multivariate analysis also indicated that patients with high ST8SIA1 levels (HR = 9.95, 95% CI = 3.25, 30.50, p < 0.001), those aged over 50 years (HR = 3.49, 95% CI = 1.09, 11.1, p = 0.035), those at pathological stage II (HR = 14.4, 95% CI = 1.55, 134, p = 0.019), and those at stage III (HR = 20.40, 95% CI = 1.47, 283, p = 0.025) were at higher risk for poor OS.
The study population was further divided into TNBC and non-TNBC groups. In the univariate analysis of the TNBC group, high ST8SIA1 levels still showed a higher risk for poor OS (HR = 11.9, 95% CI = 1.23, 114.00, p = 0.032) compared to patients with low ST8SIA1 levels, whereas in the non-

Cox Proportional Hazard Regression Model of ST8SIA1
According to the Kaplan-Meier and ROC results, ST8SIA1 RNA-Seq expression levels that reached statistically significant survival differences were further included in the Cox proportional hazard regression model to estimate the impact of ST8SIA1 levels and baseline characteristics on OS and DFS in all patients, TNBC patients, and non-TNBC patients. The Cox proportional hazard regression analyses for OS and DFS are presented in Tables 5 and 6, respectively. In Table 5, the univariate analysis demonstrated that patients with high ST8SIA1 levels (HR = 8.57, 95% CI = 2.84, 25.9, p < 0.001) had a higher risk for poor OS compared to patients with low ST8SIA1 levels in the all patients group. Multivariate analysis also indicated that patients with high ST8SIA1 levels (HR = 9.95, 95% CI = 3.25, 30.50, p < 0.001), those aged over 50 years (HR = 3.49, 95% CI = 1.09, 11.1, p = 0.035), those at pathological stage II (HR = 14.4, 95% CI = 1.55, 134, p = 0.019), and those at stage III (HR = 20.40, 95% CI = 1.47, 283, p = 0.025) were at higher risk for poor OS. The study population was further divided into TNBC and non-TNBC groups. In the univariate analysis of the TNBC group, high ST8SIA1 levels still showed a higher risk for poor OS (HR = 11.9, 95% CI = 1.23, 114.00, p = 0.032) compared to patients with low ST8SIA1 levels, whereas in the non-TNBC group, both univariate and multivariate analysis showed that high ST8SIA1 levels were associated with higher risk for poor OS (univariate: HR = 8.01, 95% CI = 2.26, 28.40, p = 0.001; multivariate: HR = 11.30, 95% CI = 3.04, 41.90, p < 0.001). Furthermore, diagnosis age, pathological stage, and radiation also impacted OS (Table 5). Interestingly, regarding DFS (Table 6), high ST8SIA1 levels were associated with a higher risk for poor DFS in both univariate (HR = 4.30, 95% CI = 1.07, 17.30, p = 0.04) and multivariate analysis (HR = 5.13, 95% CI = 1.03, 25.50, p = 0.046) only in the TNBC groups. Thus, high expression of ST8SIA1 could increase the risk for poor OS in breast cancer patients and particularly increased the risk for poor DFS in TNBC patients.

Gene Ontology Analysis (GO) of Candidate ST8SIA1 Gene
The survival-related STs of ST8SIA1 were further analyzed through the GSEA of Gene Ontology analysis to determine the gene function ( Figure 4). The Gene Ontology analysis included BP: biological process, CC: cellular components, and MF: molecular functions. Regarding BP, ST8SIA1 was associated with intrinsic glycosphingolipid and glycosylation-related processes with a false discovery rate (FDR) of 0.001-0.004. Regarding CC, ST8SIA1 was associated with organelle and Golgi membrane/apparatus (FDR = 0.001-0.008), which is the location of STs. Regarding MF, ST8SIA1 was associated with intrinsic STs activity (FDR = 0.0004-0.0009). The results of GO analysis confirmed the genetic function of ST8SIA1.
was associated with intrinsic glycosphingolipid and glycosylation-related processes with a false discovery rate (FDR) of 0.001-0.004. Regarding CC, ST8SIA1 was associated with organelle and Golgi membrane/apparatus (FDR = 0.001-0.008), which is the location of STs. Regarding MF, ST8SIA1 was associated with intrinsic STs activity (FDR = 0.0004-0.0009). The results of GO analysis confirmed the genetic function of ST8SIA1.

Discussion
Hypersialylation is a feature of various cancers in which aberrant expression of STs increases sialylation expression on tumor cell surfaces. Hypersialylation further facilitates several aspects of tumorigenesis including (1) immune evasion through immune inhibitory Siglecs (Immune inhibitory receptors), (2) enhancement of tumor proliferation and metastasis through cytoskeleton-related protein, (3) promotion of tumor angiogenesis through the interaction between VEGF and polysialic acid, and (4) resistance to apoptosis through anti-apoptosis/kinase inhibitors [14]. The above effects of hypersialylation on cancer affect the survival of patients. Thus, it is crucial to elucidate survival-related STs in breast cancer and provide another therapeutic approach.
In our studies, we used RNA-Seq data from the clinical database of TCGA-BRCA of the GDC data portal to comprehensively analyze the role of STs in breast cancer. Prior to this study, a comprehensive analysis of STs had only been conducted in cell lines. In the cell line model, the author analyzed the transcript level of sialic acid metabolism and glycosylation (SAMG) genes including 20 STs in MCF10A, T-47D, and MDA-MB-231 cells [21,22]. In compartment 2 of 20 STs, MDA-MB-231 cells displayed higher levels of ST3GAL5 and ST8SIA1 than MCF10A/T-47D cells and no change was observed in ST6GALNAC4. Intriguingly, sialylation was lower in MDA-MB-231 than in T-47D cells due to metabolic flux-based control of sialylation [21,22]. In a proteomic analysis of different breast cancer cell lines, lectin chromatography/mass spectrometry was used to compare glycosylation profiles between luminal and TNBC cell lines. In contrast to the luminal type, a number of TNBC-specific glycosites were enriched with sialic acids and a concomitant increase in STs gene expression was observed [23]. The results of the proteomic analysis indicated that TNBC-specific sialylation might be a therapeutic target for this aggressive tumor subtype.
In our RNA-Seq analysis of the clinical data set, we identified three STs genes that were significantly different between TNBC and non-TNBC. ST3GAL5 was higher in the TNBC group as in MDA-MB-231 cells, whereas ST6GALNAC4 and ST8SIA1 were higher in the non-TNBC group, which was somewhat different from the cell line model, indicating the difference of transcript level between cell line and clinical patients. In order to clarify the potential of STs as a therapeutic target, their effect on OS was investigated, and we found that ST8SIA1 significantly impacted OS in all patients, the TNBC group, and the non-TNBC group. Moreover, ST8SIA1 specifically affected DFS in TNBC patients, indicating that ST8SIA1 might be a therapeutic target for this aggressive TNBC subtype. ST8SIA1, also known as GD3 synthase gene (GD3S), is a key enzyme in the biosynthesis of b-/c-series gangliosides (GD3, GD2, and GT3) expressed on the cell surface. Gangliosides are glycosphingolipids carrying one or more sialic acid residues and are located on the outer leaflet of the plasma membrane to interact with receptor tyrosine kinases (RTKs) [25,26]. High expression of ST8SIA1 or GD3S has been reported to be associated with poor histologic grade/survival in estrogen-receptor (ER)-negative breast cancer and with better prognosis in ER-positive breast cancer [27,28]. However, in our study, the impact of ST8SIA1 on survival was not limited to ER-negative breast cancer. High expression of ST8SIA1 was significantly correlated with poor OS in all patients, the TNBC group, and the non-TNBC group. In Cox proportional hazard regression, high expression of ST8SIA1 was also significantly associated with higher risk of poor OS in all patients (HR = 8.57), the TNBC group (HR = 11.9), and the non-TNBC group (HR = 8.01). Thus, ST8SIA1 is a crucial survival-related ST in breast cancer, regardless of the ER receptor status. Moreover, high expression of ST8SIA1 was also significantly associated with poor DFS and a higher risk for poor DFS (HR = 4.30) in TNBC patients particularly. This TNBC-specific effect was similar to that found in other studies indicating an association between ST8SIA1 and ER-negative breast cancer. The authors of several studies reported that ST8SIA1 could promote tumor growth, metastasis, and chemoresistance through the FAK/Akt/mTOR and Wnt/β-catenin signaling pathways in TNBC cell lines [18,29,30]. This specific effect of ST8SIA1 in TNBC cell lines may indicate the potential mechanisms underlying the association of higher ST8SIA1 levels with poor DFS in TNBC patients. Thus, from the literature and our comprehensive study, survival-related STs of ST8SIA1 may be a novel and crucial therapeutic target in breast cancer, especially for TNBC patients.
Several groups of sialyltransferase inhibitors (STIs) have been developed, including (1) analogs of sialic acids/CMP-sialic acid/cytidine/lithocholic acid, (2) oligosaccharide derivatives, (3) aromatic compounds, and (4) flavonoids [31,32]. Most STIs have poor cell membrane permeability and bioavailability that fail to meet the criteria for clinical utility. Few STIs have great permeability across the cell membrane. For instance, lithocholic acid analogs of AL10 exert great permeability and lower cytotoxicity, and they inhibit adhesion, migration, proliferation, and invasion of lung and breast cancer cells [11,16]. However, most STIs were developed against α2,3/2,6-STs, and fewer were developed to inhibit polysialyltransferases of α2,8-STs, especially ST8SIA1. Triptolide was isolated from the perennial vine Tripterygium wilfordii and displayed several anticancer cell properties [33,34]. Triptolide was also found to indirectly inhibit ST8SIA1 in melanoma cells [35]. A specific STI against ST8SIA1 is an unmet clinical need for breast cancer and TNBC treatment to address in a future study.
The present study still has some limitations. In our comprehensive transcriptomic analysis of STs from the TCGA-BRCA of the GDC data portal, 20 STs were included initially, but only 8 STs were mapped by the RNA-Seq expression. However, our studies have identified ST8SIA1 as a crucial survival-related ST that is associated with higher risk for poor OS in all patients and DFS in TNBC patients particularly. Further in vitro and clinical studies are required to verify the expression and function of ST8SIA1 beside the GO analysis and other members of STs to provide another therapeutic approach for breast cancer in the future.

Conclusions
In conclusion, we conducted a comprehensive transcriptomic analysis of STs using RNA-Seq data from the clinical TCGA-BRCA of the GDC data portal. With a different calculation approach, we identified ST8SIA1 as a crucial survival-related ST. High expression of ST8SIA1 displayed significant association with poor OS and indicated higher risk for poor OS in all breast cancer patients, TNBC, and non-TNBC patients. Furthermore, high expression of ST8SIA1 exerted a significant association with poor DFS and higher risk for poor DFS in TNBC patients particularly. This study provides a valuable analysis of STs based on clinical patient data and indicates ST8SIA1 as a potential therapeutic target for breast cancer and TNBC patients, particularly in the future.