Differentially-Expressed Pseudogenes in HIV-1 Infection

Not all pseudogenes are transcriptionally silent as previously thought. Pseudogene transcripts, although not translated, contribute to the non-coding RNA pool of the cell that regulates the expression of other genes. Pseudogene transcripts can also directly compete with the parent gene transcripts for mRNA stability and other cell factors, modulating their expression levels. Tissue-specific and cancer-specific differential expression of these “functional” pseudogenes has been reported. To ascertain potential pseudogene:gene interactions in HIV-1 infection, we analyzed transcriptomes from infected and uninfected T-cells and found that 21 pseudogenes are differentially expressed in HIV-1 infection. This is interesting because parent genes of one-third of these differentially-expressed pseudogenes are implicated in HIV-1 life cycle, and parent genes of half of these pseudogenes are involved in different viral infections. Our bioinformatics analysis identifies candidate pseudogene:gene interactions that may be of significance in HIV-1 infection. Experimental validation of these interactions would establish that retroviruses exploit this newly-discovered layer of host gene expression regulation for their own benefit.


Introduction
Pseudogenes are considered relics of once-functional genes that have lost the ability to code for proteins. They arise due to mutations, frame-shifts and gene duplication events, where the duplicated gene loses transcriptional elements (like promoters and enhancers) necessary for gene expression [1]. Irrespective of how they are created, pseudogenes are traditionally assumed to be transcriptionally silent. However, microarray studies show that 2%-20% of pseudogenes are quantifiably expressed [1][2][3]. Tissue-specific expression of pseudogenes hints at their possible functional importance in those tissues [4,5]. Pseudogenes are also differentially expressed in diseases, notably cancer [6][7][8][9][10][11][12].
We are only beginning to understand the implications of pseudogene expression for cellular well-being. Pseudogenes can regulate the expression of parent genes by the following mechanisms [1]: (i) antisense pseudogene transcripts can pair with a sense parent transcript, altering

HIV-1 Infection
To generate transcriptomic data, we infected 5 × 10 6 H9 T-cells with 300 ng p24 Gag NL43 strain of HIV-1 using a spinfection protocol (multiplicity of infection > 1): infected cells were incubated at 37 • C for 15 min and then centrifuged at 1200 rpm for 2 h at 33 • C, followed by a further half-hour incubation at 37 • C and three rounds of washing with RPMI. The infected cells were then replenished with fresh medium and incubated at 37 • C for seven days. Then, 3 × 10 6 uninfected H9 T-cells were used as the control and incubated for 7 days in conditions identical to the infected T-cells (37 • C, 5% CO 2 ). Cellular RNA from infected and uninfected cells was extracted for sequencing 7 days post-infection using Qiagen's RNeasy Mini Kit.

RNA Sequencing
Paired-end (150 bp × 2) RNASeq data were generated using the Illumina HiSeq sequencing platform. The uninfected sample generated 82 million paired-end reads (24 Gbp) with a mean fragment size of 377 bp, and the infected sample generated 76 million reads (23 Gbp) with a mean fragment size of 361 bp. Illumina adaptors were removed using Trimmomatic Version 0.3, and low-quality reads were dropped using fastq_quality_filter [32]. Ribosomal RNA was depleted, and poly-A RNA was selected prior to sequencing. The sequence data generated were of high quality. The average quality score for the forward reads was >34, and that for the reverse reads was >33 (a Q-score of 30 is considered optimal). The transcriptomics data are available at the NCBI's Gene Expression Omnibus (GEO) database, via Accession Number GSE70785.

Pipeline to Identify Differentially-Expressed Pseudogenes
The filtered reads were mapped to the human reference transcriptome using Tophat2 Version 2.0.11 [33]. The human reference genome (fasta file: Homo_sapiens.GRCh37.57.dna.toplevel.fa.gz, indexed using Bowtie2 Version 2.2.1 [34]) and the most recent human gene model (Gene Transfer Format (GTF) file: Homo_sapiens.GRCh37.75.gtf) were downloaded from Ensembl. More than 90% of reads from the uninfected and the infected sample mapped to the human reference transcriptome. The mapped reads were then supplied to the Cuffdiff program of Cufflinks (Version 2.2.0) to quantify gene expression, as well as to calculate the log 2 fold change in gene expression due to HIV-1 infection [35]. Cuffdiff reports gene expression in FPKM (fragments per kilobase of transcript per million mapped reads).
To avoid false positives due to noise in measuring very low expression levels, we limited ourselves to genes with expression >0.1 FPKM in both infected and uninfected cells, a threshold widely used for reliable detection of transcripts in RNASeq data [36][37][38]. The 16,243 mRNA transcripts fulfilling this criteria were further pruned to only those that showed at least a 4-fold increase or decrease in gene expression due to infection (i.e., a log 2 fold change in expression >2 or <−2). We focused on fold change in gene expression, because this quantity is comparable across genes with different basal expression levels. This resulted in 493 transcripts with at least a 4-fold change in expression, of which 46 are annotated as pseudogenes in Ensembl based on GENCODE (ENCyclopedia Of DNA Elements-genes and gene variants) annotations [26].
We then computed the p-values and false discovery rate (FDR) for these 46 differentially-expressed pseudogenes as follows: the 16,243 genes with expression level >0.1 FPKM in both infected and uninfected cells had an average log 2 (fold change in expression) of −0.037 (µ) with a standard deviation (σ) of 0.819 (the log fold change distribution is normally distributed; Supplementary Figure S1). Using this µ and σ, we calculated z-scores for the pseudogene change in gene expression as z = (x − µ)/s, where x is the log 2 (fold change). p-values for the two-sided test were determined as the cdf of the normal distribution, and the FDR (Q-value) was calculated using the qvalue function of R [39]. The computational pipeline to identify differentially-expressed pseudogenes is outlined in Figure 1.  Computational pipeline to identify differentially-expressed pseudogenes from transcriptomic data. The transcriptomic data collected from HIV-1 infected and uninfected H9 T-cells 7 days post-infection were mapped to the human reference transcriptome after quality control analysis. Differentially-expressed pseudogenes were identified following several stringent filtering steps that are the accepted standard in the field: at least >4-fold change in expression, p-value < 0.01 and FDR < 0.25.

Bioinformatics Analysis of Candidate Pseudogenes
Human proteins matching the translated pseudogene sequence were identified by BLASTX against the GENCODE/Ensembl database using a cutoff E-value of 10 −4 . BLASTX hits that were uncharacterized novel protein coding genes are not reported. cDNAs (transcripts/splice variants) matching pseudogene sequences were identified using BLASTN against the Ensembl cDNA database using a cutoff E-value of 10 −4 and sequence identity of >90%. cDNA hits for uncharacterized genes and other pseudogenes were ignored.  [40]. We analyzed these RNASeq datasets as described above: quality control followed by mapping the reads to human reference transcriptome using Tophat2 and quantifying gene expression using Cuffdiff to determine the log 2 fold change in gene expression. This generated 9 sets of gene expression comparisons at the 12-h time point (3 replicates of Mock transcriptomes compared to 3 replicates of HIV-1 infection transcriptomes: a total of 9 comparisons) and 6 sets of gene expression comparisons at the 24-h time point (2 replicates of Mock × 3 replicates of HIV-1 infection transcriptomes). These 15 transcriptomics datasets had 14-23 million paired-end reads each, with a 73%-88% overall read mapping rate to the reference human transcriptome. Supplementary Table S2 lists the log 2 fold change in gene expression at 12 h and 24 h post-infection for the pseudogenes and their parent genes.

Results
Using a p-value < 0.01 as a cutoff, we identified 13 pseudogenes as significantly over-expressed and 17 as significantly under-expressed in HIV-1-infected T-cells (Table 1). Using an FDR cutoff of 0.25, as previously used for exploratory gene expression analyses [41,42], we further narrowed the number of differentially-expressed pseudogenes to 21 (eight over-expressed and 13 under-expressed; shaded rows in Table 1). Most of the over-expressed pseudogenes are processed (Table 1). We also find several uncharacterized pseudogenes as differentially expressed (pseudogene names with prefix RP followed by a number).

Pseudogenes Derived from Over-Expressed Genes
Over-expressed genes are more likely to form processed pseudogenes [43]. In our analysis, we find that several differentially-expressed pseudogenes are derived from highly-expressed genes: MTATP8P2, MTND4P15, RP1-89D4.1, RP11-380G5.3, AC010733.5. It is likely that these pseudogenes are observed as differentially expressed largely due to fluctuations in the expression of their parent genes.
To investigate the putative function of differentially-expressed pseudogenes that are not derived from over-expressed protein-coding genes, we identified human proteins (Table 2; BLASTX hits against GENCODE using a cutoff E-value of 10 −4 ) and transcripts/splice variants (Table 3; BLASTN hits against the Ensembl cDNA database using a cutoff E-value of 10 −4 and sequence identity of > 90%) that are similar to these pseudogene transcripts or their translated products.
Several pseudogenes have their parent genes as the best (and sometimes the only) BLASTX or BLASTN hit (Tables 2 and 3). A pseudogene may have a regulatory function if its expression correlates (positively or negatively) with that of its parent or other similar genes. Table 2. BLASTX hits of differentially-expressed pseudogenes. Only hits to known protein-coding genes with E-value < 10 −4 are shown. "fc" denotes fold change in gene expression; "Mock" represents gene expression in uninfected T-cells; and "HIV" denotes gene expression in HIV-1-infected T-cells 7 days post-infection.

Pseudogenes with Antagonistic Expression to the Parent Genes
If pseudogene transcripts are acting as molecular sponges for mRNA stability factors and miRNA, then their expression will negatively correlate with that of the parent gene.
We find several pseudogene:gene pairs where pseudogene expression is antagonistic to their BLASTX/BLASTN hits (Tables 3 and 4). The GO terms for the TBC1D9B protein product suggest its function as a regulator of Rab-GTPase activity. Another GTPase (Rho-GTPase) is required for HIV-1 replication [44,45], but the role of Rab-GTPases in the HIV-1 life cycle is not clear. However, it has been shown that Rab GTPase-activating proteins bind Hepatitis C virus protein NS5A to mediate viral replication [46]. The average gene expression change from nine comparisons at 12 h and 6 h post-infection also show upregulation of TBC1D9B and a >2-fold increase in gene expression seven days post-infection (Table 4). Thus, the interaction between ZNF137P and TBC1D9B transcripts may be implicated in HIV-1 replication. Table 4. Differentially-expressed pseudogenes in early and late HIV-1 infection. log 2 (fold change) of only those pseudogene:gene pairs are shown where the parent gene shows a >1.5-fold change in expression due to HIV-1 infection, i.e., log 2 (fold change) either >0.585 (upregulated) or <−0.585 (downregulated). The time points (12 h, 24 h or 7 days post-infection) where the genes show a >1.5-fold change in expression are highlighted in bold. The genes that are upregulated or downregulated in all of the replicate observations (nine observations for 12 h and six observations for 24 h time-points) are identified by *. "on" genes are those that show detectable expression in HIV-1-infected cells only, and "off" genes show detectable expression in uninfected cells only. The log 2 (fold change) at the 12-h and 24-h time points is the average log 2 (fold change) over replicate observations. Supplementary Table S2 lists log 2 (fold change) in each replicate. ANTXRL did not have detectable gene expression in any of the nine observations at 12 h post-infection. It is known that HIV's Tat protein downregulates SOD2 transcripts in mitochondria, leading to mitochondria-mediated apoptosis in infected cells [47], making the pair ANTXRLP1:SOD2 of interest (Table 3). RP11-411B10.4's parent gene ADC is upregulated in HIV-1 infection (more than an eight-fold increase in gene expression at the 24-h time point; Table 4). ADC encodes arginine decarboxylase, an enzyme that metabolizes L-arginine to agmatine. This is interesting, because increased metabolism of L-arginine impairs lymphocyte response to antigens [48]. Thus, even though ADC expression is not linked to HIV-1 infection, upregulation of arginine decarboxylase appears to be an effective viral strategy to evade host immune response.

Pseudogenes with Synergistic Expression to the Parent Genes
Sometimes, pseudogene transcripts compete with the parent gene transcript for silencing non-coding RNAs, as has been shown for tumor suppressor gene PTEN and its pseudogene PTENP1 [20]. Of these, only a few parent genes code for proteins that are known to have a role in viral infection, such as HNRNPA3/HNRNPA1. The heterogeneous nuclear ribonucleoproteins (hn-RNPs) bind to pre-mRNA and are involved in presenting the mRNA to the splicing machinery and transporting mRNA to the cytoplasm. Downregulating hn-RNP A proteins promotes viral production due to efficient alternative splicing of viral mRNA transcripts and their trafficking [49]. Further, upregulation of hn-RNP A1 results in a 100-fold decline in virus production [50]. We find that pseudogene HNRNPA3P6, its parent gene HNRNPA3 and a close paralog of the parent gene HNRNPA1 are under-expressed in HIV-1 infection (Tables 2 and 3). It is thus possible that the virus exploits the regulatory influence of pseudogene HNRNPA3P6 on HNRNPA3 and HNRNPA1 to downregulate hn-RNP proteins and promote virus production.
Several functional genes similar to pseudogene KLHL2P1 are downregulated in HIV-1 infection (Table 4). BIRC5 inhibits apoptosis, and its downregulation in both early and late HIV-1 infection could contribute to HIV-associated depletion of T-cells [51]. While the TRIM family of proteins is important in innate immune responses [52], we find TRIM59 to be downregulated in HIV-1 infection (Table 4). MYB encodes a transcriptional activator protein that binds to the HIV-1 long terminal repeat (LTR) and activates viral transcription [53]. Consistent downregulation of MYB expression could thus be the host response to mitigate HIV-1 transcription.
Pseudogene RP11-490K7.4's functional relative ADAM10 encodes a cell surface protein that has both adhesion and protease domains and has been shown to facilitate nuclear trafficking of HIV-1 [54]. Another protein coding gene similar to this pseudogene, GTF2A2, is found downregulated in uninfected T-cells in response to exposure to the HIV-1 gp120/V3 epitope [55], similar to our observation at 24 h post-infection.

Pseudogenes and Their Parent Genes with Modulating Gene Expression in Early and Late HIV-1 Infection
As the expression of pseudogene DUTP1 goes from a >1.5-fold decrease at 12 h post-infection to a >8-fold increase at seven days post-infection, its parent gene expression is modulated as well from over-expressed to under-expressed (Table 4). DUT codes for a protein that hydrolyzes the dUTP to dUMP reaction, and its downregulation will increase cellular dUTP levels. Heavy uricilation of HIV reverse transcripts is common and promotes chromosomal integration of the viral genome [56]. Thus, over-expression of DUTP1 to downregulate DUT expression may be the cellular response to inhibit the integration of uricilated viral transcripts into the host genome.
Pseudogene RP11-265N6.3 is consistently upregulated in HIV-1 infection with consistent expression at 12 h and 24 h, and a >4-fold increase in expression at seven days post-infection. However, its parent gene MYL12A shows modulating gene expression at 12 h and 24 h post-infection. MYL12A is involved in the regulation of the actin cytoskeleton and is found to be upregulated in HIV/HCV co-infection [57]. HIV-1 proteins are known to reorganize the actin cytoskeleton to optimize virion production [58].
While pseudogene ZNF137P is increasingly downregulated as the infection progresses, its parent gene ZNF83, an anti-inflammatory gene [59], is over-expressed by a factor of 1.5 in all six datasets at 24 h post-infection, but is under-expressed seven days post-infection (Table 4). In contrast, FOXK1 (parent gene of KLHL2P1) is downregulated only in early viral infection (a >1.5-fold decrease in expression and downregulation in all nine datasets at 12 h and six datasets at the 24-h time point; Table 4). FOXK1 inhibited interferon production in Sendai virus infection, a single-stranded RNA virus [60], and its downregulation in early HIV-1 infection likely contributes to triggering the immune response.
In the RP11-114F3.5:CRLF2 pair, the CRLF2 gene shows a >2-fold downregulation 12 h post-infection, followed by an increase in gene expression at 24 h (as reflected by the "on" signal in all six transcriptomic comparisons at 24 h post-infection), leading to a >8-fold increase in expression in infected cells seven days post-infection (Table 4). Thus, it appears that the CRLF2 gene expression went from downregulated in early infection to significantly upregulated in the late HIV-1 infection. CRLF2 encodes a receptor for cytokine TSLP, whose over-production in HIV-1 infected mucosal epithelial cells amplifies HIV infection in activated CD4+ T-cells [61]. It is thus important to understand how CRLF2 expression is regulated in T-cells and if pseudogene RP11-114F3.5 has any influence on its post-transcriptional regulation.
ABCC10, a functional protein similar to pseudogene RP11-114F3.5, codes for the multidrug resistance-associated protein 7 (MRP7) that transports molecules, such as xenobiotics, across intraand extra-cellular membranes [62]. HIV-Tat is known to upregulate the expression of another transporter from the same family (MRP1), a cell-surface protein of the cells of the blood brain barrier that remove anti-viral drugs from the central nervous system [63]. It is thus likely that modulating gene expression of ABCC10 (downregulated in all nine datasets at 12 h and upregulated in all six datasets at 24 h post-infection) might be interfering with the transport of molecules across HIV-1-infected T-cells. Interestingly, another transporter encoding gene (SLC2A5) shows modulating gene expression, as well (a >3-fold decrease in expression at 12 h and a >3-fold increase in expression at the 24-h time point; Table 4).
Pseudogene RP11-471L13.3 is over-expressed by a factor of >1.5 at 24 h and shows a >4-fold decrease in expression seven days post-infection. Its putative parents genes (DYNLT1 and DYNLT3) are dynein motor proteins that carry cargo, such as proteins, across the cellular microtubules; dynein proteins are thought to associate with viral proteins during infections [64].

Discussion
Although Vanin characterizes pseudogenes as "related but defective" [22], this sentiment has seen a shift in recent years. With the accumulating evidence of functional consequences of pseudogene transcription, the term "pseudogene" should be revisited [1]. A pseudogene capable of exerting regulatory control on other genes via its RNA intermediates is functionally active even if it does not code for a protein. Although most pseudogenes are thought to evolve neutrally [3,65], some transcribed pseudogenes exhibit patterns of non-neutral evolution [66]. The contribution of pseudogenes to the non-coding small-RNA pool that regulates the expression of a wide range of genes provides a possible explanation for their evolutionary maintenance. About a fifth of human pseudogenes are transcribed [3], and thus might be candidates for purifying selection if they participate in gene expression regulation. For a retrovirus with a pre-packaged reverse transcriptase and integrase, this presents an opportunity to exploit a layer of gene expression regulation for its own benefit. Here, we present evidence that suggests the involvement of pseudogenes in the regulation of cellular gene expression in the early stages of a retroviral infection.
Our computational pipeline for detecting differentially-expressed pseudogenes relies on software that is widely used and accepted by the scientific community. Gene expression is normalized by gene length to account for the fact that longer genes end up producing more sequencing reads than shorter genes. Although it is desirable to have replicates in the sequencing experiment design, replicate data are usually generated by pooling the replicate samples together in a single sequencing run, reducing the data generated per sample. While we analyzed multiple replicate transcriptomes at 12 h and 24 h post-infection (which likely increased specificity) the single transcriptome at seven days post-infection time point had four times the sequencing data (≈80 million reads compared to the ≈20 million reads at the 12-h and 24-h time points). This high sequencing depth combined with high quality reads and a mean fragment size of >350 (150 × 2 paired-end sequencing) for both uninfected and infected samples improves sensitivity in detecting differentially-expressed genes and adds confidence to our predictions. Both increasing sequencing depth and biological replication have been shown to increase the power of detecting differentially-expressed genes [67]. Strongly upregulated protein-coding genes in our data (the seven-day time point) are implicated in HIV-1 infection, lending further confidence to our analysis (Supplementary Table S1).
While our methodology cannot find novel pseudogenes that might be created in an infected cell due to viral reverse transcriptase and integrase activity, we find a small number of known pseudogenes that are differentially expressed in HIV-1-infected cells. While the data from the 12-h and 24-h time points comes from a T lymphocyte cell-line that is different from the T-cell-line at the seven-day time point, our candidate pseudogene:gene interactions show consistent differential expression across the three time points, suggesting that these host responses are not cell-line specific. We further identify protein-coding genes (parent genes or genes that are BLASTX/BLASTN hits to the pseudogenes) whose expression is correlated with that of the differentially-expressed pseudogenes. One third of these parent genes are implicated in HIV-1 infection, suggesting pseudogene-mediated post-transcriptional regulation of gene expression in the infected cells. It should be noted that the study that experimentally confirmed the effect of pseudogene PTENP1 transcription on the mRNA levels of tumor suppressor gene parent gene PTEN showed that the change in PTEN mRNA expression was less than two-fold, yet the pseudogene PTENP1 is "selectively lost in human cancer" [20]. Thus, we restricted our focus to those parent genes that show a >1.5-fold change in gene expression at 12 h, or 24 h, or seven days post-infection (Table 4). It is possible that these parent genes are differentially expressed in HIV-1 infection due to other factors in the complex gene-regulatory machinery of the cell, but the significant change in expression in their pseudogenes suggests pseudogene-mediated gene expression regulation. Experimental validation of regulatory interactions between the candidate pseudogene:gene pairs is needed to confirm if viruses exploit this newly-discovered layer of post-transcriptional regulation of gene expression.

Conclusions
In this study, we find that instead of being silent spectators, pseudogenes are differentially expressed in HIV-1 infection. While non-coding RNAs are now considered integral to the host response to viral infections, our findings indicate for the first time that RNAs derived from pseudogene transcripts are a part of the cellular non-coding RNA pool, and may regulate expression of genes implicated in viral infections. Functional relevance of pseudogene transcription has been experimentally verified in cancer but remains to be validated in viral infections. We anticipate that pseudogene transcription will be of significance in other cellular processes and in host responses to the environment as well.