1. Introduction
According to 2022 data (
https://gco.iarc.who.int/en, accessed on 15 November 2024), colorectal cancer (CRC) ranks as the third most commonly diagnosed cancer in both men and women, and is the second leading cause of cancer-related deaths globally. According to the World Health Organization (WHO), of all colorectal cancer cases, 70% are localized in the colon.
One of the hallmarks of cancer, including colon cancer, is its dynamic transcriptional landscape and usage of alternative promoters. Expression of many human protein-coding genes is regulated by alternative promoters. Recent findings from an extensive pan-cancer transcriptome analysis revealed differential expression of two alternative
PHF19 gene promoters in malignant versus non-malignant gut mucosa [
1].
The promoter upregulated in colon and rectal cancer gives rise to the PHF19-207 transcript, suggesting a potential tumor-promoting function. The
PHF19 gene encodes PHD finger protein 19, a component of the polycomb repressive complex 2 (PRC2), which is involved in H3K27 methylation, a chromatin modification linked to transcriptional repression [
2]. The target genes of
PHF19 protein are implicated in processes such as proliferation, differentiation, angiogenesis, and the organization of the extracellular matrix [
3]. The role of
PHF19 protein in malignant transformation has been demonstrated in several malignancies, with its tumor-promoting role in colorectal cancer revealed only recently [
4,
5].
The PHF19 gene is located at 9q33.2 and encompasses 39245bp. A set of 14 transcripts was identified from this gene, with major transcript PHF19-202 (ENST00000373896) encoding a 580-amino-acid-long protein. According to the Ensembl database, the majority of other transcripts are either truncated or have an undefined coding sequence. Elements of the non-coding transcriptome are increasingly recognized as key contributors to the complexity of the genome; however, their specific roles remain largely unexplored.
The objective of our study was to examine the expression of PHF19-207 in colon cancer, assess its potential as an early biomarker for colorectal cancer, and evaluate its functional implications using in silico tools, as the function of this transcript has not been previously characterised.
2. Materials and Methods
2.1. In Silico Analysis of the PHF19 Gene Promoters
The promoter sequences of the
PHF19 gene were defined as 1kb regions both upstream and downstream of the two transcription start sites (TSSs) identified as differentially active in colorectal cancer [
1]. These promoter sequences were retrieved in FASTA format from the human GRCh38.p13 assembly using the Ensembl genome browser. To analyze characteristic motifs within these sequences, the Motif Finder tool of the Integrative Genomics Viewer (IGV) program was employed [
6]. The distribution of GC boxes was examined using the MethPrimer 2.0 tool (
http://www.urogene.org/methprimer/) [
7].
Four available bioinformatic tools were utilized to predict the presence of consensus sequences for potential transcriptional regulator binding within the
PHF19 gene promoter: Alggen PROMO 2.0 (
https://alggen.lsi.upc.es,
https://bio.tools/alggen), AliBaba 2.1 (
https://gene-regulation.com/pub/programs/alibaba2/), CiiDER (
https://ciider.com,
https://bio.tools/CiiiDER), and TFBIND (
https://tfbind.hgc.jp) [
8,
9,
10]. Each of these four tools employs different algorithms, and their combined use enhances prediction robustness and allows for cross-validation results. Default query parameters and human libraries were applied in these analyses to ensure optimal performance of the mentioned tools and reproducibility. Only the positive results obtained from at least two algorithms for each transcriptional regulator were considered. The expression levels of the identified regulators in colon cancer and normal gut mucosa were analyzed using the Gene Expression Profiling Interactive Analysis 2 (GEPIA) tool (
http://gepia.cancer-pku.cn/).
The list of genetic variants in the PHF19 gene promoter sequences was extracted from the Ensembl database (global MAF:0.005-0.5, class: SNP, clinical consequences: all, consequences: all) to map variants occurring in the predicted binding sites of transcriptional regulators.
2.2. In Silico Analysis of PHF19-207
The sequence of the PHF19-207 transcript was retrieved as a FASTA file using the human GRCh38.p13 assembly from the Ensembl genome browser (ENST00000456291).
2.3. Cell Cultures
The cell lines utilized in this study were derived from human colon tissue, including the immortalized colonic epithelial cell line HCEC-1CT (CVCL_AQ45) isolated from healthy tissue (Evercyte GmbH, Wien, Austria) and a panel of colon cancer cell lines: HCT 116 (CVCL_0291), HT-29 (CVCL_0320), CaCo-2 (CVCL_0025), SW480 (CVCL_0546), DLD-1 (CVCL_0248), and SW620 (CVCL_0547) (ATCC, Manassas, VA, USA). All cell lines were cultured in Dulbecco’s Modified Eagle Medium (DMEM; Capricorn Scientific, Ebsdorfergrund, Germany) supplemented with 10% fetal bovine serum (FBS; Capricorn Scientific, Germany) and 1% antibiotic/antimycotic solution (Capricorn Scientific, Ebsdorfergrund, Germany) in a 5% CO2 atmosphere at 37 °C. Cells were subcultured once they reached 70–80% confluence using 1× trypsin/EDTA (Capricorn Scientific, Ebsdorfergrund, Germany). To ensure biological relevance, cells were cultivated in triplicate. All cell lines were confirmed to be free from mycoplasma contamination.
Non-malignant (HCEC-1CT) and malignant cell lines representing different tumor stages according to the Dukes’ classification (HCT116, DLD-1, and SW620) were cultured in 3D as spheroids. To generate the spheroids, adherent cells were detached using 1× trypsin/EDTA (Capricorn Scientific, Ebsdorfergrund, Germany) and counted with a standard hemocytometer. Approximately 2 × 105 cells per well were seeded in a 24-well Nunclon™ Sphera™ Dish (Thermo Fisher Scientific, Waltham, MA, USA), designed for low cell attachment, containing 1 mL of the complete culture medium described earlier. Spheroids were cultured for 7 days in a humidified incubator at 37 °C with 5% CO2. To maintain nutrient levels and remove dead cells, media changes were performed every 2–3 days. During incubation, spheroids were monitored daily under a phase-contrast microscope for shape, growth and compactness. Compact spheroids were defined based on the following morphological criteria: spherical morphology with clearly defined borders, absence of fragmented edges or loosely attached cells, and uniform in size distribution within biological replicates. Compact spheroids were collected under a microscope to ensure that only live cells, free from debris, were selected for subsequent total RNA extraction.
2.4. Cell Transfection
HCEC-1CT cells were seeded at a density of 2 × 105 and cultured for 24h in DMEM medium without antibiotic/antimycotic solution. Cells were counted using a standard hemocytometer. Transient transfection with the EGFP-KRAS-G12V plasmid (#164925, Addgene, Watertown, MA, USA) and control plasmid was performed using LipofectamineTM 3000 (Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s protocol. Transfection was performed in triplicate. KRAS expression was confirmed by the presence of GFP fluorescence, and cells were collected after 24 h.
2.5. RNA Extraction
Total RNA was extracted from approximately 8 × 106 adherent cells (2D cultured cells) and spheroids (3D cultured cells) from two 24-well plates using the PureLink™ RNA Mini Kit (Thermo Fisher Scientific, Waltham, MA, USA) following the manufacturer’s protocol. RNA from transfected cells was isolated using the same procedure. For RNA isolation from the HCEC-1CT and SW620 cell compartments, the Cytoplasmic & Nuclear RNA Purification Kit (Norgen Biotek Corp., Thorold, ON, Canada) and the Cell Culture Media Exosome Purification and RNA Isolation Midi Kit (Norgen Biotek Corp., Thorold, ON, Canada) were employed, adhering to the respective manufacturer’s protocols. The concentration and purity of the extracted RNA were assessed by measuring absorbance at 260 nm and 280 nm using a BioSpec-nano spectrophotometer (Shimadzu, Kyoto, Japan).
2.6. RNA Sequencing
High-throughput next-generation RNA sequencing was conducted by Novogene (UK) Company Limited (Cambridge, UK). Total RNA from the cultivated spheroids underwent quality control (QC), which included 1% agarose gel electrophoresis, Nanodrop spectrophotometry to assess RNA concentration and purity, and Agilent 2100 analysis to evaluate the RNA Integrity Number. Library preparation involved ribosomal RNA depletion, facilitating RNA enrichment for gene expression profiling of both coding and non-coding transcripts. Sequencing was performed using the Illumina NovaSeq6000 platform, generating paired-end 150 bp reads. Bioinformatics analysis comprised quality control, mapping of the reads to the GRCh38 human reference genome, and quantification of gene expression levels using Novogene’s established pipeline, which provided raw counts. A Sashimi plot of the Binary Alignment Map (BAM) files was generated using the Integrative Genomics Viewer (IGV).
2.7. Quantitative Real-Time PCR (qRT-PCR)
Reverse transcription of total RNA, isolated from both 2D and 3D cultured cells (2 μg) and cell compartments (0.1 μg), into complementary DNA (cDNA) was carried out using the High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems, Waltham, MA, USA), following the manufacturer’s protocol. The reaction conditions included 10 min at 25 °C, 120 min at 37 °C, and 5 min at 85 °C.
Relative expression of PHF19-207 was quantified in triplicate using quantitative real-time PCR (qRT-PCR) with Power SYBR Green PCR Master Mix (Thermo Fisher Scientific, Waltham, MA, USA). The specificity of the amplification products was confirmed by performing a melting curve analysis. Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) served as the endogenous control for all measurements. The forward primer was designed to bind to the retained intron of the PHF19-207 transcript, and the reverse primer targeted the retained intron–exon junction to ensure reaction specificity. The sequences of the primers used for relative quantification are provided in
Table 1.
qRT-PCR was conducted using the 7500 Real-Time PCR System (Applied Biosystems, Waltham, MA, USA), and relative quantification was calculated using the 2-dCt method. The reaction conditions consisted of 2 min at 50 °C, 10 min at 95 °C, followed by 40 cycles of 15 s at 95 °C and 1 min at 60 °C. The expression levels of PHF19-207 were determined and normalized to the endogenous control.
2.8. Data Used in the Study
Publicly available high-throughput RNA sequencing data (GSE152562, GSE164541, and GSE254832) were obtained from the National Center for Biotechnology Information’s Gene Expression Omnibus database, NCBI GEO (
https://www.ncbi.nlm.nih.gov/geo/) [
16,
17,
18]. The corresponding raw sequencing files in FASTQ format were downloaded from the Sequence Read Archive (SRA) (
https://www.ncbi.nlm.nih.gov/sra) under accessions SRP267478, SRP301216, and SRP487433, respectively. Downloads were facilitated using the SRA Explorer web application (
https://sra-explorer.info/). The sequencing reads were aligned to the GRCh38 human reference genome, and the expression levels of
PHF19 transcripts were quantified using the HISAT2 and StringTie tools [
19]. The expression of PHF19-207 in clinical samples was assessed using the UCSC Xena Browser web server (
https://xenabrowser.net/), comparing The Cancer Genome Atlas (TCGA) colon adenocarcinoma and Genotype-Tissue Expression (GTEx) colon datasets, with data accessed on 15 November 2024 [
11]. Additionally, PHF19-207 expression was evaluated across other solid tumor types using TCGA data available through the UCSC Xena Browser web server.
2.9. Statistical Analysis
Statistical analysis was conducted using GraphPad Prism v9 software. Data are presented as percentages and as the mean ± standard deviation. The distribution of the data was assessed using the Shapiro–Wilk test. Differences between groups were analyzed using the independent samples t-test and analysis of variance (ANOVA), followed by Dunnett’s post hoc test. A p-value of ≤0.05 was considered statistically significant.
4. Discussion
Alterations in alternative transcription initiation have been observed in various pathologies, including cancer [
21,
22]. The diagnostic and prognostic potential of alternative promoters and transcripts has been established in several cancers, including colorectal cancer, multiple myeloma, prostate cancer, and hepatocellular carcinoma [
23,
24,
25,
26]. This study was conducted using cell lines, high-throughput sequencing and computational tools to evaluate the potential of the transcript PHF19-207 as a biomarker for early colon cancer and to explore its possible role in tumor promotion. The hypothesis on its involvement in the early stages of colon cancer was derived from a previous comprehensive study that had screened for the deregulation in the genes’ promoter activity between tumor and non-tumor tissue and found deregulation in the activity of the PHF19 gene promoters [
1]. The promoter down-regulated in colon cancer tissue was found to be upregulated in glioblastoma. The other promoter was found to be upregulated in colon and rectal cancer, kidney cancer, stomach cancer, and chronic lymphocytic leukemia.
The presence of characteristic motifs was similar in the two analysed promoters of the PHF19 gene. According to in silico predictions, both promoters are located within CpG islands. A previous study investigated transcriptional activity of gene promoters in colon cancer using the H3K4me3 mark [
4]. Results of that study overlap with findings that the upregulated promoter is located in a transcriptionally active region of the PHF19 gene.
The potential binding of transcriptional regulators to the promoter sequences was assessed using four distinct bioinformatics tools. Regulators predicted by at least two of these tools to bind to either of the promoters were considered for further analysis. Based on the presence of their binding motifs in the promoter sequences, the CTF family was predicted to bind to the promoters upregulated in colon cancer. Data from GEPIA showed that the expression level of CTF is lower in colon cancer compared to the healthy gut mucosa. Lower expression of CTF proteins in combination with upregulation of the second promoter in colon cancer can be explained by their dual activity, since the family includes both activators and repressors. Since NFI/CTF transcription factors have both oncogenic and tumor suppressor potential, depending on the type of carcinoma, their role in regulating PHF19 gene promoters should be further investigated [
27].
Transcript PHF19-207 is 888 nucleotides long and classified as protein-coding according to Ensembl. Its computationally mapped protein isoform consists of 106 amino acids. In silico evaluation of PHF19-207 suggests that this RNA may be non-coding rather than coding. It has low coding probability according to the LCG Coding Potential Prediction tool and Coding Potential Calculator tool, and the AnnoLnc2 tool indicates its localization in the nucleus. Although the localization data are not available for colon cell lines, they are consistent for a variety of other tissues, indicating that this transcript is predominantly retained in the nucleus regardless of the tissue. Another bioinformatic tool, lncLocator, predicts the localization of this transcript in exosomes. In silico data also point to the upregulation of PHF19-207 in colon cancer tissue samples in comparison to normal colon mucosa. Overall, in silico data indicate that PHF19-207 may be a long non-coding RNA involved in gene regulation and/or signalling. Additionally, its role in communication between colon cancer cells and the tumor microenvironment can include the RNA fluorescence in situ hybridization method.
In silico data also predict binding of nine microRNA molecules for the transcript PHF19-207. These microRNAs are predicted to bind mainly towards 5′ (within intron 1) and 3′ ends of the transcript, and only a couple of them have overlapping binding sites. Most of the miRNAs bind to the predicted loops of the RNA secondary structure. For some of the microRNA molecules predicted to bind to PHF19-207, anti-tumor roles were demonstrated, while others are not yet characterized [
28,
29]. These data suggest that the retained intron of the PHF19-207 transcript may act as a microRNA sponge, which is in line with its proposed tumor-promoting role in colon tumorigenesis. MicroRNA hsa-6721-5p has four binding sites in the PHF19-207 sequence. Simultaneous binding of multiple miR-6721-5p may modulate transcript secondary structure and stability, suggesting a possible anti-tumor role of this miRNA. As such, the retained intron of the PHF19-207 transcript could be involved in the regulation of colon carcinogenesis, and further study of functional properties is required for understanding its role.
However, according to the Ensembl database, this transcript is protein-coding. The translated protein of this transcript is suggested to be 106 amino acids long. In comparison with the reference transcript, which transcribes a protein with 580 amino acids, this protein might have different roles in cells. Recently, studies suggested the existence of small open reading frames (sORF) on long non-coding RNA molecules that are engaged by ribosomes [
30]. Micropeptides originating from these ORFs deviate from canonical peptide sequences and are, on average, about 100 amino acids long. PHF19-207 predicted peptide fits this description. The hypothesis that PHF19-207 is a protein-coding long non-coding RNA should be further investigated. Confirmation of this hypothesis would require analysis of ribosome recruiting on this transcript, mass spectrometry, in vivo translation, and custom-made antibodies for Western blot [
31].
The results of the public data showed no expression of PHF19-207 in either the wild-type or APC knockdown HCEC-1CT cell line [
16]. It can be assumed that this transcript is not the result of a first-ever genetic alteration in the canonical colorectal carcinogenesis pathway. However, we have to consider that the technique used in that study, Illumina NextSeq 500, does not have the same sequencing depth as the NovaSeq 6000 used in our study. This may explain why we quantified the lowly expressed PHF19-207 transcript in the HCEC-1CT cell line and observed upregulation in cell lines representing other stages of tumor development.
Results of public sequencing data also showed no statistically significant changes in PHF19-207 expression between HCT 116 cell lines with wild-type and KRAS knockdown [
18]. Within that study, RNA of wild-type and transfected cell lines was sequenced in duplicates. Considering that, we overexpressed GFP-labeled KRAS G12V mutant peptide in the normal colon mucosa cell line HCEC-1CT. Results showed statistically significant changes in PHF19-207 expression in cell lines with overexpressed, mutated KRAS versus without overexpressed KRAS. This experiment suggests that the KRAS mutation could be one of the first genetic deregulations that drive upregulation of the PHF19-207 transcript.
Analysis of publicly available sequencing data showed a slight increase in the expression of PHF19-207 in the tumor in comparison to normal tissue, but without statistical significance [
17]. The study that produced this public data analyzed triplicate tissue samples from five patients with colorectal cancer. A comparison between TCGA and GTEx data suggested that there is a difference between tumor and normal gut mucosa. However, this data did not include adenomas and showed a clear difference in tumor staging. We suggest further research on the expression of this transcript in clinical samples in larger patient groups with tumors in different stages and more sensitive methods, such as ddPCR.
The PHF19-201 transcript is shown to be significantly expressed in the HCEC-1CT APC knockdown cell line. Our data showed similar expression between HCEC-1CT and the HCT 116 cell line. However, its expression in DLD-1 and SW620 was significantly upregulated. Considering that the HCT 116 cell line has a functional APC gene, we can conclude that this transcript could be a marker of APC deregulation in colon cancer.
The results of the transcriptional profiling confirm biomarker potential and are also in line with the proposed tumor-promoting role of the PHF19-207 transcript. The expression analysis of cells cultured in 2D, conducted using qPCR, revealed that PHF19-207 expression was elevated in all malignant cell lines compared to the non-malignant HCEC-1CT cell line (by 2 to 5-fold). A more significant increase in expression was observed in the cell lines derived from advanced stages of colon tumors (HCT 116 Dukes’ A category; HT-29, CaCo-2 and SW480 Dukes’ B category; DLD-1 and SW620 Dukes’ C category). Similar results were obtained when the expression of PHF19 gene transcripts was analysed in cells cultivated in 3D using RNA sequencing, where PHF19-207 expression was elevated 2 to 7.5-fold in the malignant cells vs. the non-malignant cell line. Also, aberrant splicing events are occurring more in cells representing later stages of colon cancer. Cell lines were cultured in 3D to ensure that the resulting transcriptomes accurately represent those of cells in their native environment. PHF19-207 was detected in the nucleus of both HCEC-1CT and SW620 cell lines, and SW620 exosomes using qPCR, which also validated the in silico prediction results of transcript localization. Exosomes are one of the regulators of cell-to-cell communication [
32]. Their role in cancer development and aggressiveness is demonstrated in breast cancer [
33]. Increased secretion of PHF19-207 via exosomes in colon cancer could elucidate its mechanism of action in cancer development and should be further explored.
Transcript PHF19-210 showed a similar expression pattern as PHF19-207 in expression data from cell lines, and considering its undefined coding sequence and length (588 nucleotides), it may also be considered by future studies as non-coding RNA with a potential role in tumorigenesis. Its expression from clinical NGS data does not suggest it could be used as a biomarker.
The results of this study provide us with detailed data on PHF19 expression through the development of colon cancer and suggest the potential use of PHF19-207 as a biomarker of early colon cancer. However, several limitations should be acknowledged. First, the findings of this study rely primarily on data from established cell lines and publicly available RNA sequencing data, which may not fully capture the complexity or heterogeneity of primary tumor tissues. Additionally, while differential expression and splicing patterns of PHF19-207 are clearly demonstrated, functional validation experiments are required, and, therefore, the biological role of this isoform remains speculative. Although PHF19-207 contains a retained intron, its consistent and elevated expression across samples suggests that it is not subject to effective nonsense-mediated decay (NMD), supporting the potential functional relevance of this isoform. Previous studies illustrate that non-coding transcripts originating from loci of protein-coding genes could have roles in different molecular processes in a malignant cell [
34]. High-risk patients undergoing screening for colorectal cancer could benefit the most from the implementation of early colon cancer biomarkers, such as PHF19-207. With further functional characterization and description of transcript behavior in tumor cells under therapeutics, we could estimate the different aspects of biomarker potential.