Alternative Splicing, RNA Editing, and the Current Limits of Next Generation Sequencing

The advent of next generation sequencing (NGS) has fostered a shift in basic analytic strategies of a gene expression analysis in diverse pathologies for the purposes of research, pharmacology, and personalized medicine. What was once highly focused research on individual signaling pathways or pathway members has, from the time of gene expression arrays, become a global analysis of gene expression that has aided in identifying novel pathway interactions, the discovery of new therapeutic targets, and the establishment of disease-associated profiles for assessing progression, stratification, or a therapeutic response. But there are significant caveats to this analysis that do not allow for the construction of the full picture. The lack of timely updates to publicly available databases and the “hit and miss” deposition of scientific data to these databases relegate a large amount of potentially important data to “garbage”, begging the question, “how much are we really missing?” This brief perspective aims to highlight some of the limitations that RNA binding/modifying proteins and RNA processing impose on our current usage of NGS technologies as relating to cancer and how not fully appreciating the limitations of current NGS technology may negatively affect therapeutic strategies in the long run.


Introduction
The advent of high-throughput technologies (HTTs) like DNA microarray technology and reverse-phase protein arrays (RPPAs), in the 1990s, revolutionized the manner in which a gene expression analysis was undertaken in the life sciences [1,2]. Prior to HTTs, most biological studies focused on a single gene/protein or a handful of genes with a quantitative reverse-transcription polymerase chain reaction (qRT-PCR) and Northern and Western blotting techniques used to assess gene/protein expression. Over the years, HTTs in nucleic acid analyses further developed to include whole genome sequencing (WGS), whole exome sequencing (WES), whole transcriptome sequencing (WTS or RNA-seq), non-coding RNA-seq, and variations of these techniques to identify modified (acetylated or methylated) nucleic acids [3]. In contrast, HTTs in protein expression analyses developed along two fronts: those involving modifications of RPPA and advancements in mass spectrometry [4][5][6][7]. Modern gene expression analyses now add single-cell and spatial parameters to these HTTs [8][9][10]. But similar to limitations that post-translational modifications (PTMs) impose on a protein expression analysis, data derived from HTTs involving gene expression analyses at the nucleic acid level are limited by a number of factors, some of the most important being the presence of RNA binding proteins (RBPs), the isoforms of these RBPs present, the PTM status of these RBPs, and the activities these RBPs catalyze during RNA processing. Barring epigenetic modifications in DNA, singlecell WGS is currently the most accurate HTT platform. The genome represents the base starting material for gene expression with a limited chromosome/gene copy number, and a single-cell analysis avoids heterogeneity that is often present with an analysis of whole diseased tissue. Still possible though are chromosomal alterations due to duplications, deletions, and translocations that can alter the gene structure and copy number or create novel fusions that cause generated data to deviate from the expected. Added to this is the possible influence of DNA editing enzymes, mutations in genes affecting the efficiency of DNA repair, or innate immune/stress-mediated changes [11][12][13][14]. But, all in all, the genome of a cell is that particular cell's genome. For the transcriptome, it is not that simple, and it is becoming more evident that expressed transcripts and isoforms of proteins can vary depending on the tissue/cell type, state of differentiation, microenvironment, stress, and disease, with many RNA transcripts and potential protein isoforms going undocumented and unvalidated.

RNA Modifications
All RNA transcripts can undergo some form of post-transcriptional processing. In eukaryotes, the classic example involves mRNA capping, polyadenylation, and splicing. These processes involve the addition of a N7-methylated guanosine linked 5 -5 to a second guanosine by a triphosphate linkage (7mGpppG) to the 5 nucleotide of the pre-mRNA, the addition of a poly A tail to the 3 end of the sequence, and the removal of introns to form the mature mRNA. Both 7mGpppG capping and polyadenylation influence the stability of the mRNA transcript, its nuclear export and subsequent translation, thus influencing the expression level of the encoded protein [15]. In addition to these processes, a number of additional modifications can be added to all RNA species, including methylations (A, C, and ribose ring), hydroxymethylations (C), and isomerizations of uridine (U) to pseudouridine (Ψ). These modifications can influence the folding and stability of the RNA, the proteins that associate, and the trafficking of the RNA [16,17]. While the exact influence of all these RNA modifications is not completely understood, some such as 2 O-methylation of the ribose ring in the second nucleotide of the mRNA cap appear to serve a purpose in distinguishing self-transcripts from non-self [18,19]. But the fact that these modifications do not directly change the functional transcript sequence limits the influence they impose on an HTT gene expression analysis, and they will not be further discussed here. For more information on these types of modifications, see reviews by Boo andKim, 2000 andRoundtree et al., 2017 [16,17]. In contrast, the processes of splicing and RNA editing alter the actual sequence, structure, and coding of the RNAs affected. As with other post-transcriptional modifications, a specialized group of RNA binding proteins catalyzes these processes.

mRNA Splicing and Alternative Splicing
In eukaryotic cells, the removal of introns and the splicing together of exons is mandatory for the proper synthesis of mature mRNA transcripts and the synthesis of encoded proteins. In the nucleus, the association of the spliceosome complex initiates the first posttranscriptional modification of the pre-mRNA. The assembly of the spliceosome complex on the pre-mRNA and subsequent splicing require both cisand trans-acting factors. The cis-acting factors consist of RNA sequences 5 and 3 of the intron-exon junction and the branch-point (BP) region located between 18-40 nucleotides upstream of the 3 splice site. In addition to these, sequences present in the introns and exons serve as splicing enhancers or silencers based on the RNA binding proteins that preferentially associate at these sequences. The trans-acting factors consist of the small nuclear ribonucleoproteins (snRNPs) U1, U2, U4/6, and U5 as well as close to 300 additional proteins [20,21].
Splicing initiates with the ATP-dependent binding of U1 snRNP to the 5 splice site of the intron. This interaction is stabilized by members of the serine/arginine-rich (SR) protein family. Following this initial step, SF1/BBP binds the BP region while the 35-kDa subunit of the heterodimeric U2 auxiliary factor (U2AF) binds the 3' splice site. The 65-kDa subunit of U2AF then serves to bridge SF1/BBP and the 35-kDa subunit of U2AF. This initial recognition of 5 and 3 splice sites forms the E-complex. At this juncture, the U2 snRNP associates in an ATP-dependent manner with the BP forming the pre-spliceosome (A-complex) and is stabilized by the association of SF3 (A and B) proteins as well as U2AF. SF1/BBP is then displaced by other SF3 proteins. A pre-assembled U4/U6/U5 complex is then recruited to the A-complex to form the pre-catalytic spliceosome (Bcomplex). Structural alteration of the spliceosome results in the destabilization of the U1 and U4 interactions, resulting in the loss of these snRNPs and the generation of an active spliceosome complex or B'-complex, which initiates the first catalytic step of splicing to form the C-complex [20,21]. The second catalytic step in splicing occurs through structural modification of the spliceosome RNP component, which may associate or disassociate from the complex. The final phase results in the release of U2, U5, and U6 snRNPs and the release of the spliced mRNA and intron ( Figure 1). This process sequentially occurs to form the mature mRNA, but this is not always the case. Statistical data suggest that, on average, a gene encoding 11 exons will produce approximately 5.4 differing mRNAs. The responsible process is referred to as "alternative" splicing [22]. Alternative splicing of mRNA is a fundamental process to enhance the possibility of gene expression and increase protein diversity. It is estimated that at least 50% of genes express alternately spliced isoforms that vary according to the tissue/cell type and condition [21]. This estimate may be low as many immune-related genes have been observed to undergo varying degrees of alternative splicing relative to conditions. In T-and B-lymphocytes, RNA-seq and microarray studies suggested that nearly 60% of all genes expressed were alternately spliced [23]. Five patterns of alternative splicing have been described: exon skipping, intron retention, mutually exclusive exons, alternative 5 splice site, and alternative 3 splice site ( Figure 1). The major players involved consist of the trans-regulatory factors, serine/arginine-rich splicing factors (SRSF) and heterogeneous nuclear ribonucleoproteins (hnRNPs), and the cis-regulatory elements defined as exonic splicing enhancers (ESE), intronic splicing enhancers (ISE), exonic splicing silencers (ESS), and intronic splicing silencers (ISS). Usually, SRSF proteins associate with ESE and ISE elements, favoring the recruitment of U1 and U2 snRNPs and auxiliary factors and subsequent splicing, while in contrast, association of hnRNPs to ESS and ISS elements favors suppression by inhibiting spliceosome access to the polypyrimidine tract [20,24]. Impairment of alternative splicing is associated with a failure of normal cellular function with a final outcome of disease [20,21]. It is becoming more evident that for a majority of genes, a single primary transcript does not code simply for a single protein but rather diverse isoforms of the same protein, each having specific functions in given tissues. Not only can alternative splicing affect the amino acid sequence of the resulting protein, affecting its function/activity or localization, but it can also influence the translation efficiency and stability of the encoding mRNA and thus regulate protein expression post-transcriptionally but pre-translationally. This mechanism is finely tuned at developmental stages in different tissues, and an alteration in regulation of alternative splicing is now linked with several human diseases, including pre-leukemic states such as MDS, leukemia, a variety of solid tumors, Alzheimer's disease, Parkinson's disease, and metabolic and autoimmune diseases [20,21,25]. Identification of frequent mutation of genes involved in RNA splicing in cancer gives an interesting view into the mechanism of cancer-specific alternative splicing [21,25]. As changes in alternative splicing and regulators of alternative splicing have been associated with disease, it is only reasonable to assume that diseased tissues present with a significant percentage of abnormal transcripts, many most likely yet identified or validated, contributing to their absence in curated databases (NCBI, ENSEMBL) and their elimination by most standard expression profiling protocols that map reads to annotated transcripts. But these transcripts may be highly important and have roles (positive or negative) in the disease state.

Serine/Arginine-Rich Splicing Factors (SRSFs)
The serine/arginine-rich (SR) splicing factor family is comprised of 12 phylogenetically conserved and structurally related proteins encoded by genes, SRSF1-12, scattered throughout the genome (Figure 2A). These proteins act in complexes to control constitutive and alternative pre-mRNA splicing, as well as in other aspects of gene expression ( Figure 2A, Table 1) [26]. SRSFs contain one or two RNA-recognition motifs at the N-terminus and one SR-rich domain, which serves as an intermediary in the interaction with other proteins, at the C-terminus. SRSF1 and SRSF2, which have been extensively investigated, serve pivotal roles in cell cycle regulation, genome stability, and translation as well as pre-mRNA splicing, stability, and transport [26,27]. Most SRSFs are localized to the nucleus, nucleoplasm, or speckles, but a number are also found in the cytoplasm and shuttle between the cytoplasm and nucleus.
Acts as an antagonist to SR proteins in pre-mRNA splicing regulation. Mainly expressed in testis.
Regulates alternative pre-RNA splicing by binding U-rich sequences immediately downstream of 5 splice sites in a U snRNP-dependent manner. Can promote atypical 5 splice site selection by promoting splicing of exons with weak 5 splice sites. Isoform 2 demonstrates enhanced splicing regulatory activity as compared to Isoform 1. Most prominently expressed in heart, small intestine, kidneys, liver, lungs, skeletal muscle, pancreas, ovary, and testis. Disease associated.
Associates with pre-mRNA in a sequence-specific manner to promote or inhibit exon inclusion. Works by antagonizing splicing regulators belonging to the hnRNP class of proteins. Ubiquitously expressed with highest expression in heart, skeletal muscle, and pancreas; lowest expression in kidneys and liver. uc-uncharacterized and muc-mostly uncharacterized. * This protein is not considered part of the SR splicing factor family but is included here for its role in regulating members of this family.
The activity of the SRSF proteins is highly regulated by extensive and reversible phosphorylation of serine residues carried out by several kinases, including AKT fam-ily kinases, CDC-like kinases (CLKs), SRSF protein kinases (SRPKs), pre-mRNA splicing 4 kinase (PRP4K), and protein kinase A (PKA) [28][29][30]. These phosphorylations modulate protein-protein interactions within the spliceosome and regulate the activity and sub-cellular distribution of SRSF proteins. Therefore, changes in the phosphorylation state of SRSFs during growth factor/cytokine stimulation or stress (inflammation, cytotoxic cytokines) play a critical role in the control of their activity and the splicing landscape of expressed transcripts. Other than phosphorylations, additional PTMs, such as acetylations, methylations, succinylations, sumoylations, and ubiquitinations, are present. But with the exceptions of methylations at the amino-terminus, which appear to regulate subcellular localization to the nucleus, and ubiquitinations at the amino-terminus, which appear to promote protein degradation, the significance of most of these PTMs is unknown (Table 1) [31]. SRSFs are also known to influence splicing of their own transcripts and that of other SRSFs, thus resulting in multiple SRSF isoforms that most likely demonstrate altered activity and transcript/substrate preference; the majority of these alternative isoforms of SRSF proteins have never been examined. Moreover, the activity of these SRSFs is also altered by other splicing modulators such as splicing regulatory glutamine/lysine-rich protein (SREK)-1, serine/arginine repetitive matrix protein (SRRM)-1/-2, cytotoxic granule-associated RNA binding protein TIA1, and transformer-2 protein homolog (TRA2)-A/B (Table 1) [32][33][34][35][36]. Thus, differential expression of these splicing factors, the alternate isoforms expressed, and their PTMs, mutations, and regulation by additional splicing factors each can drastically change the RNA splicing landscape to produce an infinite number of potential transcripts and protein isoforms that have specific roles.

Heterogeneous Nuclear Ribonucleoproteins (hnRNPs)
Heterogeneous nuclear ribonucleoproteins assist in the maturation of pre-mRNA transcripts to mature mRNAs during nuclear to cytoplasmic transport, and their subsequent translation, by influencing alternative splicing and mRNA stability, transport, and folding. There are around 20 major types of hnRNPs and several minor families whose genes are scattered throughout the genome, some being sex-linked ( Figure 2B, Table 2) [37]. The hnRNPs contain a nuclear localization sequence (NLS) and are thus primarily localized to the nucleus; but this localization is highly dependent on post-translational modifications, which regulate the nuclear-cytoplasmic localization as well as intermolecular interactions of the hnRNPs [37]. In addition to the NLS, the hnRNPs contain at least one or a combination of four types of RNA-binding domains (RBDs): RNA recognition motif (RRM), RRM-like, glycine-rich with an RGG box, or KH domain. The specificity of RNA-protein binding within hnRNPs is entirely dependent on the three-dimensional structure of the protein surrounding the RBD, with diversity dictated by the combination of different RNA-binding motifs present [37].
Similar to SRSF proteins, a number of hnRNPs also express alternatively spliced isoforms, which potentially change activity or substrate specificity, and PTMs, including acetylations, caspase cleavage (activation/protein processing), methylations, N-glycosylations, O-linked β-N-acetylglucosaminylation (O-GlcNAc), phosphorylations, succinylations, sumoylations, and ubiquitinations, whose roles in the cell are not clear (Table 2) [31]. Many of the hnRNPs are known to be part of the spliceosome complex and influence splicing of particular mRNAs or their alternative splicing; thus, alternations in their expression, isoform expressed, PTMs, and with what proteins they interact can change the expressed RNA landscape. In addition to splicing and alternative splicing, hnRNPs are known to carry out a number of other functions: (1) binding of certain hnRNPs to the 3'-end untranslated region (UTR) stabilizes some mRNAs while it targets the degradation of others; (2) certain hnRNPs sequester specific mRNAs or suppress their translation; (3) hnRNPs can influence internal ribosome entry site (IRES)-dependent translation of specific mRNAs; and (4) hnRNPs can also influence miRNAs [37]. The expression of hnRNPs is known to influence the transcription and translation of multiple oncogenes and tumor suppressors, epithelial-mesenchymal transition (EMT), and immune/inflammatory regulation; thus, it is not surprising that alterations/mutations in a number of these proteins are associated with disease [37]. As with the alternative SRSF isoforms, the cellular functions of most alternative hnRNP isoforms have never been examined, nor is it clear what the influence of PTMs to the hnRNPs is on alternative splicing. Thus, the potential exists for hnRNPs, either through differential expression, isoforms expressed, or PTM regulation or mutation, to produce an infinite number of potential transcripts and protein isoforms that have specific roles but have not yet been identified or annotated.
Specifically binds AU-rich element (ARE)-containing mRNAs and is involved in post-transcriptional stability of diverse cytokine mRNAs.
Involved in the packaging of pre-mRNA into hnRNP particles, nucleo-cytoplasmic transport of poly(A) mRNA, and modulation of splice site selection. Disease associated.
Nuc, Cyto (shuttles) Involved in the packaging of pre-mRNA into hnRNP particles, nucleo-cytoplasmic transport of poly(A) mRNA, and modulation of splice site selection.    The level to which these PTMs have been studied is defined in parentheses next to the number of PTMs identified for each protein: uc-uncharacterized; muc-mostly uncharacterized; ssc-some sites modestly characterized; wc-a significant number of sites well characterized.

Consequences of Alternative Splicing and Measures to Address the Issue
The biggest issue involving alternative splicing in regards to the interpretation of RNA-seq data is the shear absence of many alternatively spliced transcripts from the databanks. This stems from both the inability to continuously update the databanks, as well as the failure of the research community to timely upload novel transcripts when found, which of course may occur for any number of understandable reasons. As can be seen in the previous sections, the number of protein complex combinations, isoforms involved, and the status of their PTM suggest, in theory, an infinite number of splicing possibilities across the transcriptome. Certainly, many of these may not lead to viable mRNA products as they are relegated to nonsense mediated decay or encode a protein product that is highly unstable and rapidly degraded, but it would be premature to suggest that these products do not have a function in the cell. Twenty years ago, pseudogenes were viewed as leftovers/byproducts of gene amplification; only later was it found that in many cases, these non-coding mRNAs acted as bait to sequester miRNAs away from the actual coding transcripts (primary examples include the tumor suppressor PTEN). One must also keep in mind that as well as having an infinite number of splicing possibilities, we also have an infinite possibility of conditions that may influence alternate splicing and promote the production of novel isoforms. Moreover, the response to these conditions and the expression of alternate-spliced isoforms may be, and likely is, tissue/cell-type specific and highly influenced by the disease state of the tissue. Thus, it is not surprising that alterations in several splicing factors (SRSFs and hnRNPs; Tables 1 and 2) are associated with disease and the expression of these factors associated with cancer. Therefore, while the physical long-read RNA sequencing itself may catch all these isoforms, a standard analysis protocol is bound to miss the majority of novel variants. So how can the presence of undocumented splice variants be verified?
For starters, little information can be gained from the gene expression profile data itself (gene/isoform quantification), and while many gene profiles from gene expression arrays and RNA-seq are available as open access, the information they provide in this context is only partial. One of the more direct methods is manually assembling transcript products from a specific gene. Manual mapping is a low-throughput method but has the highest probability of identifying all transcript products from an individual gene. In addition, validation and cloning of any alternate spliced forms identified can be undertaken with RNA isolation, gene-specific amplification of transcripts, and RNA-seq together with shotgun cloning and DNA-seq, giving a representation of all transcripts present that are related to a particular gene. Of course, this works for when the expression of only a handful of genes is of interest [38]. When it comes to identifying novel splice variants from RNAseq on a larger scale, the past several years have seen a vast increase in the number of algorithms to detect novel alternately spliced transcripts. In most cases, these algorithms bypass the direct use of annotated transcript databases such as ENSEMBL or NCBI's RefSeq for transcript identification but may still use these annotated sequences as a scaffold for assembling reads or just comparison [39,40].

RNA Editing and RNA Editing Enzymes
RNA editing is an important post-transcriptional mechanism occurring in a wide range of organisms, which alters the primary RNA sequence through the insertion/deletion or modification of specific nucleotides. The most common modification in humans is deamination, leading to both a biochemical and functional change in the nucleotide [41]. In RNA, two forms of nucleotide deamination are observed and involve either adenosine or cytidine residues. Deamination of adenosine at the C6 position results in the biochemical conversion of adenosine to inosine. As inosine is interpreted by the cellular machinery as guanine, deamination also results in the functional equivalent of an adenosine to guanine conversion in the affected transcript. Similarly, deamination of cytidine at the C4 position results in the biochemical and functional conversion of cytidine to uridine (Figure 3). (B) The deamination of cytidine to uridine is carried out by members of the AID/APOBEC family, functioning as monomers, homodimers, heterodimers, and tetramers depending on the family member involved. The molecular weight (designated main isoform), major structural domains, and the number of validated mRNA transcripts/protein isoforms produced from the respective genes are presented.
Deamination of adenosine and cytidine residues in RNA is carried out by dedicated and specific enzymes classified according to the activity they catalyze as either adenosine deaminases or cytidine deaminases. The adenosine deaminases consist of the adenosine deaminase acting on the double-strand RNA (ADAR) family and the adenosine deaminase acting on the tRNA (ADAT) family, while the cytidine deaminases belong to the activationinduced cytidine deaminase/apolipoprotein B mRNA editing enzyme and the catalytic polypeptide (AID/APOBEC) family ( Figure 3, Table 3).
Demonstrates C-to-U editing activity. Selectively targets ssDNAs of foreign origin (viral). Antiviral activity occurs through both deaminase-dependent and -independent mechanisms. Can homodimerize or form heterodimers with APOBEC3F and APOBEC3G. Expressed in peripheral blood mononuclear cells, bone marrow, lymphoid tissues, gastrointestinal tract, and female reproductive system.
Cyto, P-bodies 2 trscpts/2 isoforms 4 (uc) Demonstrates C-to-U editing activity. Selectively targets ssDNAs of foreign origin (viral). Exhibits antiviral activity toward wide spectrum of viruses through both deaminase-dependent and -independent mechanisms. Has not been shown to target nuclear or mitochondrial DNA. May play a role in epigenetic regulation of gene expression. Widely expressed with highest expression observed in ovary. Interacts with APOBEC3G. Expression is induced by interferon.
Nuc, Cyto, P-bodies 1 trscpt/2 isoforms # 11 (ssc) Demonstrates C-to-U editing activity. Selectively targets ssDNAs of foreign origin (viral). Antiviral activity occurs through both deaminase-dependent and -independent mechanisms. Expressed in peripheral blood mononuclear cells, bone marrow, lymphoid tissue, lungs, testis, ovary, and skin.  [12,42]. Expression of all the proteins unless otherwise stated under "Function/Role" is ubiquitous. 1 For substrate nucleic acid, (h)=human; (v)=viral. 2 Alternate transcripts (protein coding only) and protein isoforms are defined as those currently verified and annotated in ENSEMBL, UniProt/SwissProt, and NCBI databases. In some cases, (#) verified transcripts have no corresponding verified protein annotated and vice versa. 3 Post-translational modifications (PTMs) were retrieved from PhosphositePlus (https://www.phosphosite.org accessed on 18 May 2023) and include acetylations, caspase cleavage (alteration of activity), methylations, phosphorylations, succinylations, sumoylations, and ubiquitinations. The level to which these PTMs have been studied is defined in parentheses next to the number of PTMs identified for each protein: uc-uncharacterized; muc-mostly uncharacterized; ssc-some sites characterized; wc-sites well characterized.
Beyond modifying an RNA secondary structure, protein binding and changing protein coding and adenosine deamination are also known to alter splice donor-acceptor site selection by modifying key adenosine residues near the acceptor site, thus promoting alternative splicing of transcripts. In non-coding RNA, such as miRNAs, deamination can result in alteration of the seed region, thereby changing the target specificity of miRNAs. Similarly, cytidine deamination alters transcript coding, structure, and stability and may alter miRNAs, although generally at a much lower frequency compared to adenosine deamination. Therefore, the enzymes that carry out these activities have the potential to drastically enhance transcript variability on a small or large scale, depending on the tissue, process involved, and conditions, as well as promote disease (Figure 1) [42,43]. Moreover, the enzymes that principally carry out these modifications are also associated with the genomic stability, DNA repair, and modification of nucleotides in genome-associated ssRNA/ssDNA, which potentially alter the genome [11,12,[44][45][46][47].

Adenosine Deaminases
The majority of RNA editing in mammalian cells is carried out by the ADAR family of enzymes. This family consists of three independently encoded proteins, ADAR1 (ADAR; 1q21.3), ADAR2 (ADARB1; 22q22.3), and ADAR3 (ADARB2; 10q15.3). These enzymes are composed of a series of tandemly arranged double-strand (ds) RNA binding domains (dsRBDs; 3 in ADAR1 and 2 in both ADAR2 and ADAR3) upstream of the catalytic domain. In addition, full-length ADAR1 also contains two Z-α domains in the amino terminus, which allow ADAR1 to associate with nucleic acids possessing a left-handed helical structure (Z-RNA/Z-DNA) ( Figure 3A) [48]. ADAR1 and ADAR2 are ubiquitously expressed, functional deaminases, while ADAR3 is catalytically inactive with expression mostly restricted to the brain and neural tissue [42,49]. These enzymes act on RNAs as either homoor heterodimers, and the diverse dimeric combinations are believed to influence substrate selection and modification. In addition, each of these genes produces a number of protein coding isoforms through alternative splicing (ADAR1 (5 isoforms), ADAR2 (6 isoforms), and ADAR3 (2 isoforms)), or through the use of alternative transcriptional start elements that alter exon 1 (ADAR1 (2 isoforms)) [50][51][52]. Of these, only a handful have been fully characterized. The two prominent forms of ADAR1 expressed in the cell, ADAR1p110 and ADAR1p150, are produced through the use of alternative promoter/transcriptional start sites and alternative splicing. Transcriptional initiation at promoters 2 and 3 result in the inclusion of an exon 1 (1B and 1C) devoid of an AUG start codon so that translation of the final mRNA product initiates in exon 2, thus producing ADAR1p110. In contrast, transcriptional initiation at promoter 1, which is interferon-responsive, results in the inclusion of an exon 1 that contains an AUG start codon, thus encoding a protein containing an additional 295 amino acids, ADAR1p150 [52,53]. As ADAR1p110 primarily localizes to the nucleus and ADAR1p150 primarily localizes to the cytoplasm, the RNA substrates that these isoforms preferentially modify likely differ. Moreover, of the reported isoforms of ADAR1 generated by alternative splicing, many demonstrate spatial changes in the structural distribution between dsRBDs or between dsRBD 3 and the catalytic domain, suggesting that these isoforms may have different substrates than those thus far identified for ADAR1p150 and p110, further enhancing genetic variability.
Similar to ADAR1, two main forms of ADAR2 have been characterized, ADAR2 long (ADAR2) and ADAR2 short (ADAR2-S). ADAR2-S results from the alternative splicing of an Alu element, producing an ADAR2 isoform that is missing amino acids 466-505. The lack of these 40 amino acids in ADAR2-S results in an enzyme with elevated catalytic activity. In contrast, only one form of ADAR3 has been significantly characterized [50,51,[54][55][56][57].
Beyond alternative splicing and homo-/heterodimerization, these editases are regulated by a number of post-translational modifications. In ADAR1p150, over 120 PTMs have been identified, mostly through high-throughput methods like mass spectrometry, but only a few of these PTMs have been characterized for their effect on ADAR1 activity [31]. Sumoylation of K418 inhibits editase activity while ubiquitination of K574 and K576 promotes proteosomemediated degradation of ADAR1 [58,59]. Phosphorylation of T808, T811, S823, and S825 through stress activation of the MKK6-p38-MSK MAP kinase pathway promotes the nuclear export of ADAR1p110, while phosphorylation of T1033 alters editase activity toward specific substrates [60,61]. Phosphorylation of T1033 is rather interesting as it suggests that PTMs may not be all or nothing (activating editase activity or inhibiting editase activity) but may alter editase activity toward specific targets, changing the repertoire of edited transcripts [42,61]. For ADAR2, 23 PTMs have been identified, again mostly with high-throughput methods [31]. Like ADAR1, the effect that most of these PTMs have on ADAR2 is unknown. To date, only three of these PTMs have been characterized. Phosphorylation of S211 and S216 in the spacer region between dsRBD2 and the editase domain by PKCζ results in enhanced editing activity, while similar to phosphorylation of T1033 in ADAR1, phosphorylation of the homologous site in ADAR2, T553, alters editase activity toward specific substrates [61,62]. Hence, other than alternate isoforms of these enzymes, which likely display a differing substrate preference and activity, regulation by PTMs can also influence the enzymatic complexes with which these editase participate, substrate preference, and activity; thus changing the panorama of transcripts altered.
Unlike the ADAR family, ADAT proteins, ADAT1, ADAT2, and ADAT3, carry out a minor portion of the RNA editing in the cell. As their name indicates, editing catalyzed by these enzymes is specific to tRNA, thus their influence is rather specific to translation and has no significant impact on a DNA/RNA-seq analysis and will not be further discussed here [63].
Although the first members of this family were identified for their RNA editing ability, not all members of this family edit RNA. APOBEC proteins function as monomers, homodimers, heterodimers, and tetramers depending on the family member (Table 3) [11]. The first family member identified, APOBEC1 (12p13.31), was discovered for its ability to catalyze the cytidine deamination of apolipoprotein B (APOB) mRNA and generate a stop codon at amino acid 2180, thus generating a truncated form of APOB, APOB48 [69]. In normal human tissues, APOBEC1 expression is limited to the small intestine where it is known to only edit APOB, while in mice, numerous APOBEC1-edited mRNAs have been identified along with a variety of cofactors that dictate mRNA substrate specificity. A significant amount of data from mice have suggested the importance of APOBEC1dependent editing in immunity [11,12,70,71]. Interestingly, APOBEC1 expression and editing is enhanced in several human cancers, but this association appears to be more related to the ability of APOBEC1 to deaminate DNA rather than RNA [11,12,[72][73][74].
Mostly expressed in lymphoid tissues and B-lymphocytes, AID (12p13.31) is a ssDNA deaminase that is responsible for Ig heavy chain class switching and hypermutations in variable regions, promoting antibody maturation [75]. Localization of AID is predominantly cytosolic with nuclear localization dependent on germinal center-associated nuclear protein (GANP) and β catenin-like protein 1 (CTNNBL1) [76,77]. Like many other APOBEC family members, off-target editing of the genome by AID has been reported [78,79]. Although AID binds both single-stranded (ss) DNA and ssRNA, it is only catalytically active toward ssDNA.
In mice, APOBEC3 is encoded in a single gene that diverges and expands in humans to seven related genes in a cluster located on chromosome 22 (22q13.1). APOBEC3 proteins are widely expressed in tissues. Expression is most noted in lymphoid tissue and peripheral blood mononucleated cells as well as reproductive organs, suggestive of a significant role in innate immune signaling. While certain APOBEC3 members (3A and 3G) can edit ssRNAs, the primary purpose of these editases is the modification of ssDNAs arising during viral infection [11,12]. In addition to editing viral ssRNAs (3A and 3B) and ssDNAs (3A, 3B, 3C, 3D, 3F, 3G, and 3H), APOBEC3 members have also been observed to modify host cell ssRNAs (3A and 3G) and ss genomic DNAs (3A and 3B) at sites of DNA damage repair [11,[80][81][82][83][84]. Enhanced activity of these enzymes has been reported to promote DNA damage and genomic instability.
In contrast to other APOBECs, the expression of APOBEC2 (6p21.1) is restricted to cardiac and skeletal muscle. APOBEC2 does not demonstrate any deaminase activity toward DNA or RNA yet strongly binds to DNA at specific promoters and is suspected of being a transcriptional repressor that regulates myoblast differentiation and myogenic stem satellite cell self-renewal [85,86]. Finally, the expression of APOBEC4 (1q25.3) is restricted to the testis and appears to have a role in epigenetic remodeling of promoter regions [87,88]. An interesting aside to several of these cytidine deaminases (3A, 3B, 3C, 3F, and 3G) as well as ADAR1p150 is their ability to be induced by an interferon; thus, activation of the innate immune system and the integrated stress response (ISR) promotes their expression, thereby enhancing editing of both transcripts and potentially genomic DNA. Similar to the ADAR proteins, a number of splice variants have been identified (and likely many others unidentified) for various AID/APOBEC family members (Table 3). Like with ADAR proteins, these alternately spliced forms are expected to demonstrate altered activity/substrate specificity in comparison to the canonical form, adding additional complexity to how these RNA editases may diversify the transcriptome and/or alter the transcriptome to genome homology.

Consequences of RNA Editing and Measures to Address the Issue
RNA editing presents many of the same issues that alternative splicing presents but with the potential for modifications much more difficult to clearly identify. Both WGS and single-cell techniques can remove a significant amount of potential variability in regards to editing. Our current understanding of the ADAR family is that adenosine deaminations catalyzed by ADAR1 and ADAR2 are restricted to RNA, although the association of ADAR1p110 and ADAR2 with DNA repair complexes and their involvement in the DNA repair process should at least be mentioned for consideration [44][45][46][47]. In contrast, APOBEC family members target genomic DNA and RNA; thus, transcript changes visualized in a complex multicellular analysis may result from genomic or transcriptomic changes and would require WGS to rule-out any germ line changes. A single-cell analysis would reduce this variability and allow the origin of changes (genome or mRNA) to be determined based on the number of reads containing a particular edit. Single nucleotide changes induced by RNA editing can influence alternative splicing but can also cause punctiform modifications/mutations to the encoded protein. While it is difficult for splicing/alternative splicing to be an artifact, the case is different regarding point mutations. Again, the frequency of edits becomes highly important as artifacts in the nucleotide sequence are common. As with SNPs, the frequency of punctiform changes must be determined to distinguish an artifact from reality. Most pipelines include read alignment, and thus, individual counts of edits are normally analyzed or can be easily analyzed from the raw data. While methods to determine the identity of novel splice variants resulting from RNA editing are no different than those discussed above regarding splicing factors, specifically determining a role for RNA editing in the alternative splicing of particular transcripts is not a direct process. Sites of deamination influencing alternative splicing are often located in introns and are lost from the final product during the splicing event and thus difficult to define unless caught in edited pre-mRNA. In contrast, those affecting acceptor/donor sites are usually quickly identifiable from the sequence but still serve a level of manual surveillance to catch. To date, the best modes of defining alternative spliced transcripts related to RNA editing are first identifying the novel transcript and any noted sequencing differences, and then individually analyzing the presence of known RNA editing sites in the gene of suspect novel transcripts, using the A-to-I REDIportal database [89]. Currently a comprehensive database for C-to-U editing does not exist. Moreover, when a high percentage to a complete switch of a particular splice variant is observed, DNA editing by APOBEC family members should also be considered, as changes in the genomic sequence also have the potential to influence downstream splicing with particular alternative events occurring at a much higher frequency.

Conclusions
Increasing use of high-throughput technologies for diagnoses, prognoses, and therapeutic responses of patients often has overlooked caveats . . . the transcriptome does not necessarily reflect the genome, nor will all transcript reads be correctly identified. Early studies using genome arrays based on short probe binding missed many alternatively spliced transcripts, and of the transcripts they recognize, the assay format was not adequate to distinguish the presence or absence of most alternatively spliced transcript isoforms, nor were these formats adequate at catching post-transcriptional edits of the transcripts. Using this approach, multiple transcript isoforms may bind to the same oligonucleotide probe and vice versa; some expressing alternatively spliced transcripts may exclude the sequences necessary to bind to a gene's specific oligonucleotide probe, and thus, their expression fails to be identified. The re-use of most older genome array datasets requires updating/re-aligning the oligonucleotide probes to the genome prior to re-analyzing as well as a focused analysis of oligonucleotide probes demonstrating no significant homology to the genome but demonstrating significant change during the assay.
With the advancement of RNA-seq technology that provides longer reads, many of these points are now resolvable but require significant attention on the part of those conducting the analysis. A streamlined RNA-seq analysis often uses database matching with annotated transcripts reported by the NCBI or ENSEMBL, but unique transcripts in an analysis that have yet to be annotated are not assigned and "discarded". Alternative splicing resulting from the altered expression/regulation of splicing-associated proteins or RNA editing enzymes can have a significant effect on this aspect. Since many splicing factors and RNA/DNA editing enzymes are affected by and regulate the innate immune/inflammatory response, the effects elicited by alternative splicing factors and RNA editing enzymes become even more applicable under inflammatory/cell stress conditions where innate immune signaling and activation of the integrated stress response become paramount. In addition, high-throughput techniques are not designed to analyze the potential downstream effects that alternative splicing or single-nucleotide editing of the RNA have on the function of specific proteins. Single nucleotide changes observed in RNA-seq need to be defined and compared to WGS/WES to differentiate RNA or DNA editing from polymorphisms (both ADAR and AID/APOBEC family members can promote DNA modifications, which may have a role in adaption/evolution as well as disease) [14]. Thus, in most cases, singular datasets represent a partial analysis painting less than half the picture. Current advancements in RNA-seq bioinformatics will hopefully fill this chasm of information by having the ability to run predicted homologies to assign transcripts currently relegated to experimental "garbage" into meaningful data [39,40]. The next big step will be linking the analysis of the reads with the predicted protein structure and function and the potential impacts of edited sites on protein function and/or targeting of miRNA, finally giving a much clearer picture of what is happening in the cell.

Conflicts of Interest:
The authors declare no conflict of interest.