4.1. 5′UTR Length
Among ABCA1 gene transcripts from 55 vertebrate species evaluated, longer whole 5′UTR regions, including the before-Intron-1 and after-Intron-1 sections as well as the whole Intron 1, were seen in primates and their closest relatives, with more evolutionarily distant vertebrates having all these 5′UTR parts shorter in length. In contrast, the length of the after-Intron-1-section appears to have been conserved across verterbate evolution, since it did not vary significantly among vertebrate groups studied.
The first reports comparing the average length of whole UTR regions on a whole genome level included species from very distant taxa with a limited number of vertebrates and evaluated hundreds of 5′UTRs sequenced [
30,
31,
32]. Their attention was paid mainly to the length of 3′UTRs, which are strikingly longer and show greater variability between lineages than the average length of 5′UTRs. The increase in average length of 3′UTRs with evolutionary distance, prominent during evolution of vertebrates, was discussed and a possible link to the increase in organism complexity was proposed [
30,
31,
32]. Based on the comparison of the average length of 5′UTRs it was concluded that the length of 5′UTRs stays relatively constant during evolution [
30,
31,
32].
Lynch and colleagues (2003, 2005) [
33,
34] introduced the idea that the features of 5′UTRs, including the length, in most eukaryotes are largely dictated by random genetic drift and mutational processes that cause stochastic turnover in transcription start sites (TSSs) and premature start codons in relation to the reduced effective population sizes of eukaryotes compared to prokaryotes. For eukaryotes, lengthy 5′UTRs impose a mutational burden on their associated alleles by enhancing the rate of acquisition of a premature start codon. Natural selection would select against a long 5′UTR because of the increased rate of premature start codons and for that reason only indirectly influence the lengths of UTRs [
34]. However, the idea that the variability in genome size or any of its components can be neatly explained by a single factor such as population size was questioned by several researchers at the theoretical as well as methodological levels [
35,
36,
37]. Vinogradov and Anatskaya [
38] reported that human 5′UTRs are on average 37% longer than the corresponding mouse ones and this difference remained strongly significant also after correction for 16% difference in genome sizes. The latter fact suggests according to Vinogradov and Anatskaya [
38] that the difference in 5′UTR lengths cannot be just due to mutation pressure (if one assumes that the difference in genome sizes was caused by neutral drift), and should have functional significance. Compared to mouse, the longer 5′UTRs of human mRNAs as well as the 30% greater number of uAUGs, which suggest a more complexly regulated translation, are in good agreement with the higher expression rates of human translation machinery [
38]. The dataset from Lynch and Conery’s work [
33] was revisited with a phylogenetic perspective by Whitney and Garland [
39]. In parallel, the correlation between the length of 5′UTR and organismal complexity, measured as the number of cell types, was not confirmed in the work of Chen et al. [
40]; however, data from only 17 species, including 9 vertebrates, 5 invertebrates, 2 plants, and a yeast were evaluated in the study. The results of the work by Chen et al. also raise the questions of how organismal complexity can be measured and if e.g., the number of cell types is a correct proxy factor. Furthermore, Chen et al. [
41] demonstrated that genomic features other than uAUGs, particularly upstream stop codons (uSTOPs) and G+C content, play an important role in the evolution of 5′UTR length. Considering that uAUGs and uSTOPs together can form uORFs, the observation of Chen et al. seems to imply that the major target of selection in 5′UTRs is likely uORFs, rather than uAUGs per se.
In a study focused on the impact of exon-exon junctions (left after intron splicing) and uORFs in 5′UTRs on gene expression Lim and colleagues [
42] described the conservation of these features between human and mouse. Our conclusions about the prominent conservation of the after-Intron-1 sections are in line with these recently published results.
4.2. Intron 1—Length and Position
Intron 1 was annotated within the 5′UTRs in 45 species of our dataset and in 10 species there was no 5′UTR intron. Because Intron 1 was described in the representatives of the most ancient groups evaluated, spotted gar and coelacanth, the most parsimonious conclusion is that it was lost in those 10 vertebrate species independently during the evolution of ABCA1 genes. In general, the length of Intron 1 was the longest in primates and shortest in ray-finned fishes; however, several atypically long (e.g., in coelacanth and anole lizard) or short (e.g., in chicken and cow) versions were noticed. Importantly, the position of Intron 1 has been more conserved in relation to the start codon of the main open reading frame (sATG, mORF) than to the TSS.
Studies evaluating 5′UTRs on the whole genome level have revealed that approx. 25% of metazoan 5′UTRs carry introns, with lower frequencies in plants (14%) and fungi (5%) [
31,
43]. Data from these studies also suggest the existence of a strong barrier against carrying more than 1 intron within 5′UTRs for all taxa. Early analyses on genomes of eukaryotic model organisms also indicated a positive correlation between genome size and intron sizes; however, because of the low level of significance, they suggested that there may be other factors involved [
44,
45]. Because UTR regions are under a less stringent substitutional constraint than the protein-coding sequences (CDSs), they show a higher frequency of insertions and deletions and greater length heterogeneity [
46,
47]. As a consequence, introns in UTRs may experience less stabilizing selection for some traits than introns in CDSs; the more dynamic nature of UTRs may thus promote intron loss and result in lower intron abundance in comparison with CDSs [
43]. Hong and colleagues [
43] suggested a mechanism for the occurrence of selection differences on intron size in 5′UTRs versus CDSs that may occur over very short distances, driven by the potentially deleterious effects of uAUGs within 5′UTR exons. They proposed the existence of: (1) selection against intron contraction, due to the potential introduction of uAUGs residing in 5′UTR introns at nearly neutral proportions; and (2) selection for intron expansion, due to the beneficial effects of both removing uAUGs from 5′UTR exons and preventing the appearance of new uAUGs by reducing the total 5′UTR exon length.
Another view was given by Vinogradov [
48], who hypothesized that longer introns (as well as intergenic regions) preferentially occur in tissue-specific genes because they allow chromatin-mediated gene suppression and complex regulation. The data reported by Pozzoli et al. [
49] supported and extended this view. In particular, their unifying model proposed that regulatory needs, accounted for by multi-species conserved sequences (MCSs), shape intron size and tend to be stronger in genes that are not highly expressed. They also showed that, when MCS content was fixed, no variation of size with expression level was observed for introns (and intergenic spacers) in genes expressed at a medium-to-low level. They therefore proposed that the fixation of functional conserved elements is the adaptive event underlying size increase [
49].
We have shown that the increase in length of the whole region spanning
ABCA1 5′UTRs studied can be attributed mainly to the striking increase in Intron 1 length. Similar observations on the gene level were published by LaConte and Mukherjee [
50] in their work describing the major constraints that have shaped the evolution of
CASK (Ca
2+/calmodulin-activated serine kinase) genes. They showed that despite the tremendous increase in the size of the
CASK gene over the course of animal evolution, the changes in the number of introns have been minimal, and most of the gene size increase can be attributed to an increase in the size of the introns.
4.3. Upstream ORF—Position, Function, and Evolution
Within the ABCA1 5′UTRs we defined a 15 nts long sequence consisted of the most conserved nucleotides based on alignment analyses. This sequence was almost always located at the start of the after-Intron-1 section. Only one uATG was found within the whole 5′UTRs evaluated with only a few exceptions having none or more than one, mainly within ray-finned fishes. This out-of-frame uATG was assigned to the most conserved 15 nts long sequence. Several bioinformatics tools predicted an uORF which starts from this uATG. The length of the uORF and protein possibly coded by this regulative feature varies from 30 nts, i.e., 6 aa, in coelacanth through 51 nts, i.e., 16 aa, in anole lizard; and 78 nts, i.e., 25 aa, in platypus to 111 nts, i.e., 36 aa, in primates where the uORF started to overlap with mORF. A highly conserved in-frame uTGA stop codon at the position −9 to −7 from the sATG was described within the 5′UTRs from primates to ray-finned fishes. Moreover, we showed striking differences in the distribution of upstream non-ATG start and stop codons within the 5′UTR sections studied.
Generally, uATGs decrease mRNA translation efficiency and may be considered to be negative translational regulatory factors [
51]. Among five commonly studied vertebrate species including human, a median of 36% of 5′UTRs contained at least one uATG [
52]. A higher percentage of uATG incidence (44%) was calculated several years later by Iacono et al. [
53] and an even higher percentage (49%) by Calvo et al. [
54] in human and rodent 5′UTR transcripts. Although uATGs were shown to be relatively common, the frequencies of ATG trinucleotides in 5′UTRs were significantly lower than the frequencies expected by chance, suggesting that ATGs are specifically selected against in 5′UTRs and can bear an important functional relevance [
52,
53]. Based on the comparison of human, mouse, and rat as well as different species of the yeast genus
Saccharomyces orthologous genes, Churbanov and co-workers [
55] concluded that ATG triplets in 5′UTRs are subject to the pressure of purifying selection in two opposite directions: the uATGs that have no specific function tend to be deleterious and get eliminated by natural selection, whereas those uATGs that do serve a function are conserved. Most probably, the principal role of the conserved uATGs is attenuation of translation at the initiation stage by diverting scanning ribosomes from the authentic sAUGs. This process is often additionally regulated by alternative splicing in the mammalian 5′UTRs [
55]. Consistent with this hypothesis, they found that ORFs starting from conserved uATGs are significantly shorter than those starting from non-conserved uATGs. However, because they compared only sequences from relatively closely related species, they could not recognize the pattern of increase in uORF length among vertebrate species as we described for the first time in this study for the
ABCA1 orthologous genes. Matsui et al. [
56] introduced the idea that uORFs are sequence elements that downregulate RNA transcripts via RNA decay mechanisms which maintain the balance between the synthesis and degradation of RNA transcripts.
The application of alternative promoters and alternative splicing are two well-known mechanisms contributing to protein isoform complexity. Another less explored molecular mechanism allowing several protein variants to be produced from a single mRNA is alternative translation. Based on experimental work, it has been assumed that ribosomes can recognize alternative translation initiation sites (TISs) by leaky scanning and additional protein isoforms can differ in functional properties [
57,
58]. The initiation/scanthrough ratio depends on the ATG nucleotide context. When the preferred first ATG resides in a weak context, lacking both purine (R) in position −3 and guanine (G) in +4, some ribosomes initiate at that point but most continue scanning and initiate farther downstream [
51]. This situation may be true for the uORF defined in this study, which starts from the first ATG with a weak context, allowing mORF, which starts from the second ATG with a strong context, to be read in most cases. Similarly, reinitiation of translation after a short uORF can be employed as a second mechanism for selection of alternative TISs and synthesis of alternative protein isoforms [
57,
58]. In addition, reinitiation occurs more efficiently at longer distances between the stop codon of the uORF and the next start ATG [
59,
60]. In our case, the uORF is small enough to allow reinitiation; however, the distance between the uORF stop codon and sATG is quite short and in primates the uORF even overlaps with the mORF. It can be speculated that reinitiation in
ABCA1 genes is possible; however, a different isoform from the main one is produced starting from an ATG further downstream. Based on bioinformatics analyses, Kochetov et al. [
61] assumed that some 5′UTR-located uORFs can function to deliver ribosomes to alternative TISs, and they should be taken into consideration in the prediction of human mRNA coding potential. Moreover, it has been shown experimentally that gene expression can be determined by a combination of leaky scanning and reinitiation events and the system is sensitive to changing global translational conditions, e.g., the expression of the yeast transcription factor GCN4 under nutritional stress conditions [
62]. uORFs overlapping with mORFs (VuORFs) were reported to participate in the regulation of condition-specific protein expression by Hsu and Chen [
63].
As uORFs often reduce the rate of translation of the main isoform, the possibility of a SNP creating or removing an uORF would be expected to have serious consequences on gene expression level. Verified examples of such disease-causing SNPs in humans, congenital as well as acquired, were reviewed in, e.g., Barbosa et al. [
64] and Somers et al. [
65]. We plan to focus on this issue in ABC genes during our ongoing next-generation sequencing experiments. Furthermore, much emerging evidence about functional short peptides (sPEPs) encoded by short ORFs (sORFs), which can be located at any gene section, has broadened our understanding of transcriptomic and proteomic complexity [
66,
67]. A putative bioactive peptide of 36 aa based on the above-described uORF in
ABCA1 gene was also predicted by the uPEPperoni program.
4.4. GC Content and Motifs
The GC content of the whole 5′UTRs as well as subsections showed great variability between extant vertebrate subgroups and in some cases even within closer species. Placental mammals showed the highest percentage of GC content. Motif discovery analyses revealed several highly conserved elements among transcriptional regulatory motifs (TFIIA, TATA, NFAT1, NFAT4, and HOXA13), exon and intron splicing enhancers (sc35, ighg2 cgamma2, ctnt, and gh-1), exon splicing silencers (fibronectin eda exon) as well as microRNA target sites (hsa-miR-4474-3p). The existence of specific base-repetition-rich subregions which were conserved and created a characteristic pattern within the 5′UTRs was suspected from visual screening of the sequences and MEME program results.
The mammalian genome is characterized by its high spatial heterogeneity in base composition. The average GC content of a 100-kb fragment of the human genome can be as low as 35% or as high as 60%, a range that is twice as wide as that typically observed in teleost fishes [
68,
69]. Several published studies have explained the cross-species variability in GC content at the genome level, e.g., Duret et al. [
70], Gu and Li [
71], and Romiguier et al. [
69]. Reuter et al. [
37] did not see a positive relationship between GC content and 5′UTR length in genomic data from yeast, fruit fly, and human. We also did not see this correlation on our data; however, the data set tested was quite small.
Transcription of RNA polymerase II-dependent genes is triggered by the regulated assembly of the preinitiation complex. Preinitiation complex formation commences with the binding of transcription factor IID (TFIID), which contains the TATA-box binding protein (TBP), to the core promoter. Binding of TFIID to the core promoter is followed by the recruitment of further general transcription factors, including transcription factor IIA (TFIIA), and RNA polymerase II [
72]. The finding of high conservation of TF and TATA elements within
ABCA1 5′UTRs is consistent with their importance for transcription initiation. The NFAT (nuclear factor of activated T cells) family of transcription factors consists of four Ca
2+-regulated members (NFAT1–NFAT4) and one protein—NFAT5, which is activated in response to osmotic stress. In addition to their well-documented role in T lymphocytes, where they control gene expression during cell activation and differentiation, NFAT proteins are also expressed in a wide range of cells and tissue types and regulate genes involved in cell cycle, apoptosis, angiogenesis, and metastasis. The NFAT proteins share a highly conserved DNA-binding domain, which allows all NFAT members to bind to the same DNA sequence in enhancers or promoter regions [
73]. The homeobox (HOX) genes are an evolutionarily highly conserved gene family of homeodomain-containing transcription factors, which play critical roles in regulating embryonic development. HOXA13, which is a member of the Abd-B subfamily of Hox genes, is crucial for the autopod development of the limb. Increasing evidence now indicates that dysregulated expression of HOX genes, including HOXA13, in a variety of cancers is closely related to tumor development and progression [
74]. The high conservation of NFAT and HOXA elements can be probably attributed to the ABCA1 transport function and suggests a connection between ABCA1 protein and cancer [
75]. We can speculate that those base-repetition-rich subregions observed in this study play more general roles in translation initiation and/or recognition of introns and initiate further study in this issue.
4.5. Secondary Structure
In contrast to GC content, the number and distribution of hairpin loops was found to be relatively conserved across vertebrates. Notably, the largest hairpin loop structure followed closely the upstream start codon of the uORF predicted in the study.
It has been proposed that stable secondary structures can influence TIS recognition. A hairpin loop structure located downstream of the start codon could delay 40S ribosomal subunit movement into the proper position and facilitate the recognition of TIS in a weak context [
76,
77]. Understanding of this issue of secondary structure involvement is still limited and experiments testing the influence of different hairpin loops located closely downstream on uORF translation need to be conducted.
Secondary structure of mRNA affects translation in more than one way [
78]. Significant correlations between folding free energies of 5′UTRs and various transcript features were found measured in genome-wide studies of yeast [
79]. In particular, mRNAs with weakly folded 5′UTRs have higher translation rates, higher abundances of the corresponding proteins, longer half-lives and higher numbers of transcripts. According to our predictions, human
ABCA1 5′UTR is strongly folded, because it has a secondary structure with a low value of the minimum free energy. Similar situations can be found also in other vertebrates. Transcripts with 5′UTRs that hamper their translation often encode for proteins that need to be strongly and finely regulated.
In conclusion, we have defined highly conserved sequences within the 5′UTRs of vertebrate
ABCA1 orthologous genes, including human. Elements having the strongest influence on transcription or translation of this gene—exon and intron enhancers and silencers (sc35, ighg2 cgamma2, ctnt, gh-1 and fibronectin eda exon), transcription factors (TFIIA, TATA, NFAT1, NFAT4, and HOXA13), microRNAs (hsa-miR-4474-3p), and uORF and secondary structure features—were located to these sequences. Single nucleotide polymorphisms (SNPs) disrupting these important elements can probably have an impact on
ABCA1 gene expression. We calculated that 24% of the nts with variants annotated in Ensembl database were located within these study hot spots. Furthermore, the uORF characterized in the study possibly codes for an unknown bioactive peptide (sPEP). We described a work-flow which can be suitably applied to any other gene of interest based on freely available bioinformatics tools (
Table 5). We hope that it can help researchers dealing with next-generation sequencing results in case of research as well as clinical purposes. Moreover, we showed several features of 5′UTRs which are interesting from a phylogenetic point of view and can stimulate further evolutionary oriented research.