Pseudogene Transcripts in Head and Neck Cancer: Literature Review and In Silico Analysis

Once considered nonfunctional, pseudogene transcripts are now known to provide valuable information for cancer susceptibility, including head and neck cancer (HNC), a serious health problem worldwide, with about 50% unimproved overall survival over the last decades. The present review focuses on the role of pseudogene transcripts involved in HNC risk and prognosis. We combined current literature and in silico analyses from The Cancer Genome Atlas (TCGA) database to identify the most deregulated pseudogene transcripts in HNC and their genetic variations. We then built a co-expression network and performed gene ontology enrichment analysis to better understand the pseudogenes’ interactions and pathways in HNC. In the literature, few pseudogenes have been studied in HNC. Our in silico analysis identified 370 pseudogene transcripts associated with HNC, where SPATA31D5P, HERC2P3, SPATA31C2, MAGEB6P1, SLC25A51P1, BAGE2, DNM1P47, SPATA31C1, ZNF733P and OR2W5 were found to be the most deregulated and presented several genetic alterations. NBPF25P, HSP90AB2P, ZNF658B and DPY19L2P3 pseudogenes were predicted to interact with 12 genes known to participate in HNC, DNM1P47 was predicted to interact with the TP53 gene, and HLA-H pseudogene was predicted to interact with HLA-A and HLA-B genes. The identified pseudogenes were associated with cancer biology pathways involving cell communication, response to stress, cell death, regulation of the immune system, regulation of gene expression, and Wnt signaling. Finally, we assessed the prognostic values of the pseudogenes with the Kaplan–Meier Plotter database, and found that expression of SPATA31D5P, SPATA31C2, BAGE2, SPATA31C1, ZNF733P and OR2W5 pseudogenes were associated with patients’ survival. Due to pseudogene transcripts’ potential for cancer diagnosis, progression, and as therapeutic targets, our study can guide new research to HNC understanding and development of new target therapies.


Introduction
Head and neck cancer (HNC) is the eighth most common cancer worldwide, with more than 835,000 new cases and 431,000 deaths due to the disease per year [1]. HNC comprises tumors in the oral cavity, pharynx, and larynx, and nearly 95% of them are squamous cell carcinoma histological type tumors [2]. The classical risk factors for developing HNC are smoking habits and alcohol consumption [3]. The human papillomavirus (HPV), in particular HPV16, is detected in approximately 25% of HNC cases, especially in tumors located at the oropharynx, and serves as a favorable prognostic factor for these patients [4]. Most HNC patients are diagnosed with measurable locally advanced disease, and only about half of these patients achieve complete or partial responses after treatment [5]. Besides surgery, cisplatin (CDDP), combined or not with radiotherapy (RT), has been used in HNC patients' treatment [6], but therapy resistance has been reported [7].

Pseudogene Transcripts
The transcription of pseudogenes depends on their genomic location, and it could be processed mainly into antisense RNAs, endogenous small interfering (endo-si)RNAs, or micro (mi)RNA sponges [18], acting as negative or positive gene regulators [19]. Pseudogene antisense RNAs can bind directly into sense RNA from the parental gene and inhibit its translation and protein production [20]. Pseudogene-derived endo-siRNAs are produced by cleaved pseudogene sense or antisense transcripts of double-stranded RNA or by the inverted repeat region of pseudogenes transcribed into hairpin-shaped RNA and sliced by the ribonuclease Dicer [20]. They are then separated into single strands and incorporated into the RNA-induced silencing complex to bind and degrade target mRNAs, inhibiting protein production [20]. Pseudogene transcripts can also act as competing endogenous RNAs (ceRNA) [21]. They may share miRNA response elements with other genes, especially parental ones, and positively regulate their expression by competing for the same pool of miRNAs, acting as miRNAs sponges [20]. In this process, pseudogene transcripts inhibit miRNA-target gene binding, allowing gene expression and modulating biological processes and tumor progression [22,23]. In addition, pseudogene transcripts in sense orientation may compete for miRNAs, RNA-binding proteins, and translational machinery with the parental gene [18].
Pseudogene transcriptome can vary during physiological and pathological processes, such as cancer [19]. Pseudogene transcripts have been described in diverse types of human cancer, acting as tumor promotors, facilitating cancer development, or as tumor suppressors, inhibiting cancer progression [19]. Moreover, pseudogene transcripts have also shown a prognostic role in HNC, highlighting its importance in patient prognosis and treatment outcomes [24].

Pseudogene Transcripts in HNC
Although largely understudied, some pseudogene transcripts have been described in HNC. Here, we divided the literature information by tumor location (HNC in general with mixed patients, oral cancer, and laryngeal cancer studies) and provided a summary of pseudogenes associated with HNC described in the literature in Table 1. To the best of our knowledge, there is no study of pseudogene transcripts in the pharynx only. Additionally, it is possible that some real pseudogenes are named as genes in certain studies, meaning that we may have missed some information while searching the literature for pseudogenes only.

HNC in General
In HNC tumor cells, loss of PTENP1 (phosphatase and tensin homolog pseudogene 1) pseudogene transcript modulated malignant behavior and worst survival of HNC patients, leading to tumor cell proliferation, colony formation, and migration, possibly by interacting with its parental gene PTEN (phosphatase and tensin homolog) [25].
In HNC tumor cells and cell lines (FaDu, Cal-27, SCC4, and SCC9), FKBP9P1 (FKBP prolyl isomerase 9 pseudogene 1) pseudogene transcript abundance was found to correlate with advanced tumor stage and poor prognosis of patients by enhancing tumor cell proliferation, migration, and invasion, possibly by interacting with the PI3K/AKT signaling pathway [26].
Based on The Cancer Genome Atlas (TCGA) data and the University of Alabama Cancer Database (UALCAN), Grzechowiak et al. [27] described that PTTG3P (pituitary tumor-transforming 3 pseudogene) pseudogene transcript abundance was correlated with HNC lower T-stage (T1 or T2), positive HPV16 status, and poor prognosis [27]. Gene Set Enrichment Analysis indicated that PTTG3P was correlated with genes involved in DNA repair, oxidative phosphorylation, and peroxisome pathways [27].
FTH1P3 pseudogene transcript was also associated with oral cancer risk and prognosis by becoming a miRNA sponge for miR-224-5p and thereby modulating the expression of the FZD5 (frizzled class receptor 5) gene, facilitating cell proliferation and colony formation [30]. Moreover, the FTH1P3 transcript was associated with advanced tumor stage and poor prognosis in oral cancer [31]. FTH1P3 could promote proliferation, migration, and invasion of tumor cells, possibly by PI3K/Akt/GSK3β/Wnt/β-catenin signaling [31].
In addition, the PTENP1 transcript was found to act as a ceRNA, protecting the parental gene PTEN from miR-21 binding and therefore inhibiting tumor cell proliferation and colony formation [32].

Laryngeal Cancer
In laryngeal cancer, HMGA1P6 (high mobility group at-hook 1 pseudogene 6) and HMGA1P7 (high mobility group at-hook 1 pseudogene 7) pseudogene transcripts were identified as possible miRNA sponges, allowing the increase of HMGA2 expression (high mobility group at-hook 2) and other oncogenic genes involved in proliferation and cell cycle progression, such as CCNB2 (cyclin B2) and WNT (proto-oncogene WNT) family member genes, and epithelial-mesenchymal transition, such as SNAIL (snail family transcriptional repressor) and TWIST1 (twist family BHLH transcription factor 1) genes [33].
In a laryngeal cancer patient, the HLA-A (major histocompatibility complex class I A) processed pseudogene (HLA-A*31012) was identified from retro-transposition of the parental gene within a clonal tumor cell [34]. The HLA-A*31012 pseudogene transcript was restricted to tumor cells since it was not amplified in normal laryngeal tissue nor peripheral blood leucocytes [34]. This transcript may have contributed to laryngeal tumor progression by providing tumor cells with the immune system escape [34].
In addition, the FTH1P3 pseudogene transcript was associated with advanced tumor stage and worst overall survival in laryngeal cancer by enhancing cell proliferation, migration, and invasion, and inhibiting cell apoptosis, although the exact mechanism was not elucidated [35]. Using clinical materials from TCGA, a five-gene signature predicting survival of laryngeal cancer patients was established by Zhang et al. [36], where the pseudogene DPY19L2P1 was included. However, the exact mechanism was not investigated further [36].
Since pseudogene transcripts identified in HNC are largely unexplored in the literature, we performed an in silico analysis to predict potential biomarkers and their pathways for future validation with functional analysis.

Materials and Methods
The in silico analysis consisted of acquiring and evaluating TCGA data from HNC patients to identify potential pseudogenes enrolled in tumor development. TCGA is a comprehensive public database for key genomic changes in various types of cancers, and we can obtain deregulated pseudogenes from the available transcriptome profiles.
Next, we constructed co-expression networks to identify pairs of genes (or pseudogenes) showing similar expression patterns across samples and performed gene ontology (GO) enrichment analysis to map the relationship between pseudogenes and other genes involved in the same biological process. A workflow of the methodology used in this study is shown in Figure 1.

Materials and Methods
The in silico analysis consisted of acquiring and evaluating TCGA data from HNC patients to identify potential pseudogenes enrolled in tumor development. TCGA is a comprehensive public database for key genomic changes in various types of cancers, and we can obtain deregulated pseudogenes from the available transcriptome profiles.
Next, we constructed co-expression networks to identify pairs of genes (or pseudogenes) showing similar expression patterns across samples and performed gene ontology (GO) enrichment analysis to map the relationship between pseudogenes and other genes involved in the same biological process. A workflow of the methodology used in this study is shown in Figure 1.

TCGA Data Analysis
To identify potential pseudogene transcripts in HNC, we used the TCGA database (https://portal.gdc.cancer.gov; accessed on 21 May 2021) to download pseudogene transcript data on patients with HNC with squamous cell subtype and corresponding tumor location (oral cavity, oropharynx, hypopharynx, and larynx). We selected genomic information of patients with tumors located at the lip, gum, palate, floor of mouth, tonsil, base of tongue, oropharynx, nasopharynx, hypopharynx, and larynx (n = 546). Tumors in illdefined sites and with complex mixed, stromal, adenomas, or adenocarcinomas subtypes were excluded. Next, we selected only those cases with primary tumors as sample type (n = 439). From those 439 HNC patients, we selected the cases with pseudogene information. We included only the transcript classification (biotypes) denominated, transcribed unprocessed pseudogene, processed pseudogene, unprocessed pseudogene, transcribed processed pseudogene, polymorphic pseudogene, immunoglobulin variable region pseudo-

TCGA Data Analysis
To identify potential pseudogene transcripts in HNC, we used the TCGA database (https://portal.gdc.cancer.gov; accessed on 21 May 2021) to download pseudogene transcript data on patients with HNC with squamous cell subtype and corresponding tumor location (oral cavity, oropharynx, hypopharynx, and larynx). We selected genomic information of patients with tumors located at the lip, gum, palate, floor of mouth, tonsil, base of tongue, oropharynx, nasopharynx, hypopharynx, and larynx (n = 546). Tumors in ill-defined sites and with complex mixed, stromal, adenomas, or adenocarcinomas subtypes were excluded. Next, we selected only those cases with primary tumors as sample type (n = 439). From those 439 HNC patients, we selected the cases with pseudogene information. We included only the transcript classification (biotypes) denominated, transcribed unprocessed pseudogene, processed pseudogene, unprocessed pseudogene, transcribed processed pseudogene, polymorphic pseudogene, immunoglobulin variable region pseudogene, and unitary pseudogene. From the 439 HNC patients, 220 did not present pseudogene expression information, which reduced our sample size to 219 HNC patients. From the 219 HNC patients selected, we also downloaded the somatic genomic alterations pseudogene data, such as single-nucleotide variation (SNV), deletions, and insertions, due to their potential of altering pseudogene transcription.

Co-Expression Networks and GO Enrichment Analysis
We built co-expression networks to better understand the relationship between pseudogenes and other genes in the genome. In a co-expression network, genes that have similar expression variation among samples are clustered in the same module, and they are generally thought to be involved in the same biological process [37]. Using RNAseq data from 213 HNC patients available in TGCA, we built a tumor co-expression network using the Weighted Correlation Network Analysis (WGCNA) R software package from gene expression values (log2-converted FPKM) [38]. It was not possible to obtain RNAseq data from six HNC patients. We kept only protein-coding genes from the human genome assembly reference GRCh38.p13 and the ten most deregulated pseudogenes of each HNC location (HNC in general, oral cavity, oropharynx, hypopharynx, and larynx). We then removed genes and pseudogenes not expressed in at least half of the samples and built the network with 19,406 protein-coding genes and 31 pseudogenes identified in the TCGA data analysis. Additional parameters used to build the network with WGCNA were a soft thresholding power of 8 and a minimum number of 30 genes allowed in a module [39]. The modules were then classified in colors.
To gain more insights about the possible functions of the pseudogenes described above, we performed a GO enrichment analysis for all gene modules containing at least one of most deregulated pseudogenes from HNC using the GOATOOLS Python library [40] (false discovery rate (FDR) adjusted p-values < 0.05) and subsequently summarized and visualized using the web tool REVIGO [41].

Survival Analysis
The Kaplan-Meier Plotter (https://kmplot.com/analysis/; accessed on 25 May 2021), an online database established from gene expression data and survival information of cancer patients from the TCGA database [42], was assessed to evaluate the prognostic value of the pseudogene transcripts identified in our study in a cohort of 500 HNC patients. Unfortunately, patients' clinicopathological aspects were not available. A Kaplan-Meier survival plot, log-rank p-value, and confidence interval (CI) were directly determined and displayed by the database. Patients were split by the expression median. Relapse-free survival (RFS) consists of the date of diagnosis to the date of relapse or last follow-up. Overall survival (OS) consists of the date of diagnosis until the date of death, due to any cause, or last follow-up.
When we stratified patients' information by tumor location, we found a few differences in pseudogene transcripts' pattern that could be explained by different cancer behaviors [43], but some of the identified pseudogenes were present in more than one tumor location.
Only one pseudogene identified in our analysis, DPY19L2P1, has already been associated with HNC. DPY19L2P1 was considered an independent prognosis predictor of laryngeal cancer, where its higher expression was associated with the worst OS, although the exact mechanism was not explored [36]. In our analysis, DPY19L2P1 was included in the ten most deregulated pseudogenes in oropharyngeal and hypopharyngeal cancers. However, it was identified only as the 54th most deregulated pseudogene in laryngeal cancer, and its prognostic value could not be evaluated. Other described pseudogenes have already been associated with carcinogenesis, although the exact mechanisms were not reported or detailed in most studies. HERC2P3 upregulation was associated with gastric cancer cell growth and migration by interacting with the Akt signaling pathway [44]. BAGE2 was related to the tumor-specific antigen profile [45]. FOLH1B was found to be involved in metallopeptidase activity, and its upregulation was associated with aggressiveness and metastasis in prostate cancer [46]. BNIP3P1 upregulation was found in breast cancer brain metastases [47]. PKD1L2 upregulation was associated with a good prognosis in breast cancer patients [48] and colorectal cancer risk and poor survival in obese patients [49]. POTEA upregulation was associated with increased colorectal cancer risk [50]. MSL3P1 upregulation was considered a non-invasive biomarker of renal cell carcinoma [51]. HLA-H upregulation was associated with cervical [52] and lung [53] carcinomas. GBA3 lower expression was associated with poor prognosis of hepatocellular carcinoma patients [54]. MST1P2, binding to miR-133b, affected the chemoresistance of bladder cancer cells to cisplatin-based therapy [55] and promoted cervical cancer progression [56]. PNLIPRP2 lower expression was found in pancreatic ductal adenocarcinoma [57].
We also identified 993 somatic genetic alterations in the 370 pseudogene transcripts identified in HNC, and SNV was the most common type (96.  Supplementary Table S2. The pseudogenes SPATA31D5P,  HERC2P3, MAGEB6P1, SPATA31C2, SPATA31C1, SLC25A51P1, BAGE2, HSP90AB2P, OR2W5 and REG1P were the most genetically altered by several SNVs in the HNC patients. The complete approach can be found in Supplementary Dataset S1.
From these identified pseudogenes, only BAGE2 was already studied for genomic mutation profile and its copy number variation may be associated with the Robertsonian Down syndrome [58]. Pseudogenes' genetic alterations, their potential of interfering in pseudogene transcription, and carcinogenesis have not been explored in the literature yet.  n: number of patients, UV: ultraviolet radiation, NAD + : nicotinamide adenine dinucleotide.

Co-Expression Networks and GO Enrichment Analysis
We performed co-expression network and GO enrichment analyses to map the relationship between pseudogenes and other genes in the genome. After analyzing 19,406 proteincoding genes and 31 pseudogenes, our co-expression networks contained 14,379 genes organized in 36 modules, with an average of 399 genes per module (median: 140 genes; range: 35-3645 genes; Supplementary Dataset S2).
From the 31 most deregulated pseudogenes from HNC and its subtypes, 14 of them (SPATA31D5P, SPATA31C2, DNM1P47, NBPF25P, HSP90AB2P, NXF4, ZNF658B, POTEA, MROH5, HLA-H, DPY19L2P1, DPY19L2P3, GBA3, and PNLIPRP2) were present in our network across 8 modules (light yellow, red, dark magenta, orange, black, salmon, ivory, and dark olive green). Although the other pseudogenes of interest did not show a coexpression profile similar enough to other genes to be assigned to a module according to the parameters used to build this network, they are still important for further studies in HNC based on our previous TCGA data analysis.
We chose to highlight three modules (light yellow, red, and dark magenta) that contain genes well-known to be involved in head and neck carcinogenesis based on two important reviews [76,77]. However, the other five modules also presented relevant gene associations that should be evaluated in further studies.
The light-yellow module contained the greatest number of pseudogenes (NBPF25P, HSP90AB2P, ZNF658B, and DPY19L2P3). Interestingly, this module also contains 12 genes known to participate in head and neck carcinogenesis (CASP8, involved in apoptosis; NOTCH2 and NOTCH3, involved in cell differentiation; TRAF3, involved in antiviral response; RB1 and HRAS, involved in cell cycle control and proliferation; PTEN, involved in apoptosis and cell cycle control; CUL3 and NFE2L2, involved in oxidative stress response; EP300, involved in chromatin remodeling, and the transcription factors IGF1R and TP63) [76,77], suggesting that these genes may directly or indirectly interact with the pseudogenes in this module and play a role in HNC (Figure 2A). In addition, the light-yellow module is enriched in GO terms related to tumor development, such as regulation of cell communication, cellular response to stress, and intracellular transport ( Figure 2B). The potential relationship between the described pseudogenes and genes in this module has not been explored in the literature yet.
The red module contains the pseudogene DNM1P47 and the tumor suppressor TP53 gene, mainly involved in cell cycle control and apoptosis [76,77] (Figure 3A), and is enriched in GO terms associated with regulation of cell proliferation, response to stress, and cell death ( Figure 3B). The potential relationship between the DNM1P47 pseudogene and the TP53 gene has not been explored in the literature yet. terms identified by GOATOOLS Python library and summarized by the web tool REVIGO. Circles colored with darker red indicate GO terms with lower p-values in our enrichment analysis, while the size of the circles indicates the frequency of a GO term in the GO annotation database. GO terms that are highly similar have thicker lines. The dark magenta module contains the known tumor genes HLA-A and HLA-B, involved in immune response [77], very closely associated with the HLA-H pseudogene ( Figure 4A), and is also enriched in GO terms important to cancer biology, such as cell proliferation, regulation of the immune system, regulation of gene expression, and Wnt signaling pathway ( Figure 4B). The relationship of the HLA-H pseudogene and HLA-A and HLA-B genes has been described in the literature, and even related to lung cancer in the Asian population [53]. The SNV rs12333226, capable of modulating HLA-A and HLA-H expression levels, was suggested to exert an effect on lung cancer through these two immune-related genes and the pseudogene [53].  The complete list of genes in light-yellow, red, and dark magenta modules can be found in Supplementary Dataset S2. The other five modules consist of an orange module containing DPY19L2P1, a black module containing NXF4, MROH5, and SPATA31D5P, a salmon module containing PNLIPRP2, an ivory module containing SPATA31C2, and a dark olive-green module containing POTEA and GBA3 pseudogenes. The list of genes interacting with those pseudogenes and the enriched GO terms for these modules can be found in Supplementary Dataset S2.

Survival Analysis
To investigate the prognostic value of the pseudogenes identified by the TCGA database in HNC, we assessed the online database The Kaplan-Meier Plotter. In this analysis, only the prognostic value of the ten most deregulated pseudogenes identified in HNC general samples was assessed (SPATA31D5P, HERC2P3, SPATA31C2, MAGEB6P1, SLC25A51P1, BAGE2, DNM1P47, SPATA31C1, ZNF733P and OR2W5) because the online database does not allow to stratify the patients by tumor location.
In addition, we observed that higher expression of SPATA31D5P, ZNF733P and OR2W5 pseudogene transcripts was associated with the worst OS of HNC patients. Patients with higher expression presented 1.39, 1.56, and 1.53 more chances respectively, of dying, compared to other patients. Moreover, lower expression of SPATA31C2, BAGE2 and SPATA31C1 pseudogene transcripts was associated with the worst OS of HNC patients. Patients with lower expression presented 1.43, 1.39, and 1.43 more chances respectively, of dying, compared to other patients ( Figure 5B). MAGEB6P1, SLC25A51P1 and DNM1P47 pseudogene transcripts were not available in The Kaplan-Meier Plotter database, so we could not perform survival analysis. The HERC2P3 pseudogene transcript did not influence RFS or OS in the HNC patients. However, in the literature, HERC2P3 upregulation was associated with cell migration in gastric cancer, an independent prognosis factor [44]. The prognostic values of SPATA31D5P, SPATA31C2, BAGE2, SPATA31C1, ZNF733P and OR2W5 have not been explored in the literature yet.

Conclusions
The biology of cancer is complex and not fully understood so far. Pseudogenes lost the label of "junk DNA" and are now known to be modulators of gene expression and potential biomarkers for cancer risk and prognosis [12]. Pseudogene transcripts regulate gene expression by directly binding at the target mRNA or by functioning as miRNA sponges, impeding miRNA-target mRNA binding [20]. Thus, pseudogene transcripts can function as negative or positive gene regulators [20].
In HNC, some pseudogene transcripts have been studied and associated with tumor aggressiveness, HPV16 infection, and prognosis [24][25][26][27][28]. However, their role in HNC remains poorly explored, while patients' therapy resistance and poor survival rates highlight the need of finding novel molecular biomarkers. In our in silico analysis, we identified potential pseudogene transcripts, their genetic alterations, their interactions, and potential pathways in HNC progression and prognostics. Therefore, this study can guide new research to HNC understanding and development of new target therapies.