Imaging Intron Evolution

Intron evolution may be readily imaged through the combined use of the “dot plot” function of the NCBI BLAST, aligning two sequences at a time, and the Vertebrate “Multiz” alignment and conservation tool of the UCSC Genome Browser. With the NCBI BLAST, an ideal alignment of two highly conserved sequences generates a diagonal straight line in the plot from the lower left corner to the upper right corner. Gaps in this line correspond to non-conserved sections. In addition, the dot plot of the alignment of a sequence with the same sequence after the removal of the Transposable Elements (TEs) can be observed along the diagonal gaps that correspond to the sites of TE insertion. The UCSC Genome Browser can graph, along the entire sequence of a single gene, the level of overall conservation in vertebrates. This level can be compared with the conservation level of the gene in one or more selected vertebrate species. As an example, we show the graphic analysis of the intron conservation in two genes: the mitochondrial solute carrier 21 (SLC25A21) and the growth hormone receptor (GHR), whose coding sequences are conserved through vertebrates, while their introns show dramatic changes in nucleotide composition and even length. In the SLC25A21, a few short but significant nucleotide sequences are conserved in zebrafish, Xenopus and humans, and the rate of conservation steadily increases from chicken/human to mouse/human alignments. In the GHR, a less conserved gene, the earlier indication of intron conservation is a small signal in chicken/human alignment. The UCSC tool may simultaneously display the conservation level of a gene in different vertebrates, with reference to the level of overall conservation in Vertebrates. It is shown that, at least in SLC25A21, the sites of higher conservation are not always coincident in chicken and zebrafish nor are the sites of higher vertebrate conservation.


Introduction
In eukaryotes, the coding DNA sequence of most protein-coding genes is interrupted by the interpolation of non-coding DNA sequences termed introns ("intragenic regions"), as opposed to the coding DNA segments termed exons [1,2]. In the cell nucleus, the whole gene is sequentially transcribed as a pre-mRNA, but before it is passed to the cytoplasm as mature mRNA, the introns are removed by a process called RNA splicing, carried out by a large ribonucleoprotein machinery called the Spliceosome [3][4][5]. In evolutionary conserved genes, the sites of splicing are also most often conserved, although alternatively spliced transcripts may be expressed in addition to the canonical transcript [6,7]. Although originally considered "junk" DNA, the introns are now believed to play a role in the control of gene expression [8,9]. In addition, introns are thought to control their own splicing by carrying nucleotide signaling sequences at their ends (5'ss and 3'ss) [10,11], as well as internal splicing enhancer or silencing sequences and sequences that interact with components of the spliceosome [12]. The origin and evolution of introns is still poorly understood. It is debated whether introns are very ancient and are gradually being lost, or whether their relatively recent evolution has caused them to gradually accumulate: the 1 January 2020) (Tables 1 and 2). The genomic sequences (from start codon to stop codon) were processed by deleting the coding sequences, thus leaving the introns only (fourth column of Table 3). Some analyses were performed after clearing the intron sequences of all transposable elements (TEs), using the CENSOR software (https://www.girinst.org/ censor/, accessed on 1 January 2020) [25] (Table 3, last column). Nucleotide alignments were usually performed with BLAST (Basic Local Alignment Search Tool; https://blast.ncbi. nlm.nih.gov/Blast.cgi, accessed on 1 January 2020), setting the Expect threshold at 0.0005. BLAST finds regions of local similarity between sequences and calculates the statistical significance of matches [26]. The dot plot is a graph on the XY plane, each dot having the positions of the aligned segment in one gene and in the other gene as coordinates. When comparing homologous genes, the series of dots will align along a somewhat regular straight line stretching from the lower left corner to the upper right corner. The gaps in the line correspond to non-homologous segments. The UCSC Genome Browser was accessed at https://genome.ucsc.edu on 1 January 2020, and the function "Comparative Genomics-Conservation" was used. This method shows multiple alignments of 100 vertebrate species and measurements of evolutionary conservation using two methods (phastCons and phyloP).
Further instructions for both BLAST and UCSC Genome Browser plots are available at the indicated sites. BLAST dot plots and UCSC graphics photos are unretouched in order to show the actual outputs of the analysis. However, when the matching "dots" were too small to be viewed on the original print (e.g., human/zebrafish alignments, accessed on 1 January 2020), the plots were redrawn with suitably sized dots at the sites of alignment.
The percentages of nucleotide conservation between species were calculated as follows: From the list of the individual aligned sections (with their length) provided by BLAST, we computed the total number of the aligning nucleotides and divided it by the total number of intron nucleotides.

The TEs
The alignment of an intron sequence with itself, apart from the expected exact diagonal alignment, usually shows a large number of matches outside the diagonal ( Figure 1A,B).  Further instructions for both BLAST and UCSC Genome Browser plots are available at the indicated sites.
BLAST dot plots and UCSC graphics photos are unretouched in order to show the actual outputs of the analysis. However, when the matching "dots" were too small to be viewed on the original print (e.g., human/zebrafish alignments, accessed on 1 January 2020), the plots were redrawn with suitably sized dots at the sites of alignment.
The percentages of nucleotide conservation between species were calculated as follows: From the list of the individual aligned sections (with their length) provided by BLAST, we computed the total number of the aligning nucleotides and divided it by the total number of intron nucleotides.

The TEs
The alignment of an intron sequence with itself, apart from the expected exact diagonal alignment, usually shows a large number of matches outside the diagonal ( Figure 1A   In the next experiment, we removed the SINE Alus from the SLC25A21 intron sequence which are the most represented TEs in human genes, but average 280 nucleotides only in length. Their removal generates a dramatic decrease in the matches outside the diagonal but also reveals gaps along the diagonal. Due to the short length of the Alus, the gaps in the diagonal are not apparent at a low magnification ( Figure 1C), but may be clearly demonstrated at a higher magnification ( Figure 1D). When all the other non-Alu TEs were also removed (a total of 192,584 nucleotides, i.e., about 39% of the introns) the small gaps along the diagonal could be appreciated, even at a low magnification. It was also seen that all matches outside the diagonal disappeared, showing that the latter were all due to coupling between TEs ( Figure 1E). The same experiment with GHR yielded similar results (not shown).

Human/Mouse Alignments
The BLAST dot plot alignment of the human SLC25A21 with the homologous mouse gene shows that the large majority of matches lie along a slightly irregular diagonal line. The additional sparse matches are likely to correspond to TEs, which are at least partly similar in human and mouse, such as LINE 1 transposons ( Figure 2A); indeed, when all TEs are cancelled in both genes before the alignment, the sparse matches are practically absent (and the alignment diagonal is more regular) ( Figure 2B). The gaps (usually short) in the alignment diagonal correspond to poorly conserved sections.
In the next experiment, we removed the SINE Alus from the SLC25A21 intron sequence which are the most represented TEs in human genes, but average 280 nucleotides only in length. Their removal generates a dramatic decrease in the matches outside the diagonal but also reveals gaps along the diagonal. Due to the short length of the Alus, the gaps in the diagonal are not apparent at a low magnification ( Figure 1C), but may be clearly demonstrated at a higher magnification ( Figure 1D). When all the other non-Alu TEs were also removed (a total of 192,584 nucleotides, i.e., about 39% of the introns) the small gaps along the diagonal could be appreciated, even at a low magnification. It was also seen that all matches outside the diagonal disappeared, showing that the latter were all due to coupling between TEs ( Figure 1E). The same experiment with GHR yielded similar results (not shown).

Human/Mouse Alignments
The BLAST dot plot alignment of the human SLC25A21 with the homologous mouse gene shows that the large majority of matches lie along a slightly irregular diagonal line. The additional sparse matches are likely to correspond to TEs, which are at least partly similar in human and mouse, such as LINE 1 transposons ( Figure 2A); indeed, when all TEs are cancelled in both genes before the alignment, the sparse matches are practically absent (and the alignment diagonal is more regular) ( Figure 2B). The gaps (usually short) in the alignment diagonal correspond to poorly conserved sections.
The BLAST dot plot alignment of the human GHR with the homologous mouse gene shows that approximately 1 to 140,000 human nucleotides have no match with the mouse sequence and are likely to be an additional insertion peculiar in human. On the contrary, the rest of the human sequence significantly aligns with the mouse sequence, although the alignment line is considerably distorted and exhibits relatively large gaps ( Figure 2C). Following the removal of all TEs in both sequences, the diagonal alignment becomes more regular, but with gaps that are much larger than in the SLC25A21 alignments ( Figure 2D; compare with Figure 1E).
(A)   The BLAST dot plot alignment of the human GHR with the homologous mouse gene shows that approximately 1 to 140,000 human nucleotides have no match with the mouse sequence and are likely to be an additional insertion peculiar in human. On the contrary, the rest of the human sequence significantly aligns with the mouse sequence, although the alignment line is considerably distorted and exhibits relatively large gaps ( Figure 2C). Following the removal of all TEs in both sequences, the diagonal alignment becomes more regular, but with gaps that are much larger than in the SLC25A21 alignments ( Figure 2D; compare with Figure 1E).

Human/Chicken Alignments
The BLAST dot plot alignment of the human SLC25A21 with the homologous chicken gene yielded virtually identical results, whether using the complete sequences or after TEs deletion, since practically no TE is shared between the two species. Figure 3A shows that the vast majority of matches, i.e., only 32, lie along a slightly irregular diagonal line; the gaps between matching segments, which correspond to segments that do not share significant homologies, are relatively wide. The matching segments are so short that they appear as simple dots at a lower magnification ( Figure 3A), but at a higher magnification, they appear to be short segments ( Figure 3B).

Human/Chicken Alignments
The BLAST dot plot alignment of the human SLC25A21 with the homologous chicken gene yielded virtually identical results, whether using the complete sequences or after TEs deletion, since practically no TE is shared between the two species. Figure 3A shows that the vast majority of matches, i.e., only 32, lie along a slightly irregular diagonal line; the gaps between matching segments, which correspond to segments that do not share significant homologies, are relatively wide. The matching segments are so short that they appear as simple dots at a lower magnification ( Figure 3A), but at a higher magnification, they appear to be short segments ( Figure 3B).
On the contrary, the BLAST 2-nucleotide search of the human GHR intron sequence against the homologous chicken sequence yielded only one 182 nucleotide-long match (i.e., 0.30 of the chicken sequence).

Human/Zebrafish Alignments
The BLAST dot plot alignment of the human SLC25A21 with the homologous zebrafish gene yielded five non-TE matches (with a total of 1022 nucleotides; 0.51% of the total) positioned near the theoretical diagonal line (Figure 4).  On the contrary, the BLAST 2-nucleotide search of the human GHR intron sequence against the homologous chicken sequence yielded only one 182 nucleotide-long match (i.e., 0.30 of the chicken sequence).

Human/Zebrafish Alignments
The BLAST dot plot alignment of the human SLC25A21 with the homologous zebrafish gene yielded five non-TE matches (with a total of 1022 nucleotides; 0.51% of the total) positioned near the theoretical diagonal line (Figure 4).

Human/Zebrafish Alignments
The BLAST dot plot alignment of the human SLC25A21 with the homologous zebrafish gene yielded five non-TE matches (with a total of 1022 nucleotides; 0.51% of the total) positioned near the theoretical diagonal line (Figure 4).  On the contrary, no significant similarity below the Expect threshold was found when aligning the human GHR intron sequence against the intron sequence of the GHR of zebrafish.

Human/Xenopus Alignments
The BLAST dot plot alignment of the human SLC25A21 with the homologous Xenopus gene yielded five non-TE matches (with a total of 846 nucleotides; 0.45% of the total) positioned near the theoretical diagonal line ( Figure 5). On the contrary, no significant similarity below the Expect threshold was found when aligning the human GHR intron sequence against the intron sequence of the GHR of zebrafish.

Human/Xenopus Alignments
The BLAST dot plot alignment of the human SLC25A21 with the homologous Xenopus gene yielded five non-TE matches (with a total of 846 nucleotides; 0.45% of the total) positioned near the theoretical diagonal line ( Figure 5). On the contrary, the alignment of the human GHR intron sequence against the intron sequence of the GHR of Xenopus did not yield significant similarities below the Expect threshold.

Zebrafish/Chicken Alignments
The BLAST dot plot alignment of the zebrafish slc25a21 with the homologous chicken gene (not shown) yielded five non-TE matches positioned near the theoretical diagonal line. The aligned nucleotides constitute 0.42% of the chicken sequence.
The alignment of the zebrafish GHR intron sequence against the intron sequence of the GHR of chicken did not yield significant similarities.

Zebrafish/Xenopus Alignments
The BLAST dot plot alignment of the zebrafish slc25a21 with the homologous Xenopus gene (not shown) yielded one match only corresponding to 0.11% of the zebrafish sequence and 0.10% of the Xenopus sequence.
The alignment of the zebrafish GHR intron sequence against the intron sequence of the GHR of Xenopus did not yield significant similarities.

Chicken/Xenopus Alignments
The BLAST dot plot alignment of the chicken slc25a21 with the homologous Xenopus gene (not shown) yielded six non-TE matches positioned near the theoretical diagonal line. The aligned nucleotides constitute 0.44% of the chicken sequence. On the contrary, the alignment of the human GHR intron sequence against the intron sequence of the GHR of Xenopus did not yield significant similarities below the Expect threshold.

Zebrafish/Chicken Alignments
The BLAST dot plot alignment of the zebrafish slc25a21 with the homologous chicken gene (not shown) yielded five non-TE matches positioned near the theoretical diagonal line. The aligned nucleotides constitute 0.42% of the chicken sequence.
The alignment of the zebrafish GHR intron sequence against the intron sequence of the GHR of chicken did not yield significant similarities.

Zebrafish/Xenopus Alignments
The BLAST dot plot alignment of the zebrafish slc25a21 with the homologous Xenopus gene (not shown) yielded one match only corresponding to 0.11% of the zebrafish sequence and 0.10% of the Xenopus sequence.
The alignment of the zebrafish GHR intron sequence against the intron sequence of the GHR of Xenopus did not yield significant similarities.

Chicken/Xenopus Alignments
The BLAST dot plot alignment of the chicken slc25a21 with the homologous Xenopus gene (not shown) yielded six non-TE matches positioned near the theoretical diagonal line. The aligned nucleotides constitute 0.44% of the chicken sequence.
The alignment of the chicken GHR intron sequence against the intron sequence of the GHR of Xenopus did not yield significant similarities.

Global Analysis of the SLC25A21 and GHR Conservation in Vertebrates
The UCSC Genome Browser, using the function Comparative Genomics/Conservation, plots the estimated evolutionary trend of a given vertebrate gene, calculated along its entire length as an average from 100 multialigned vertebrate species. The system calculates, for each position in the gene, the conservation expected under neutral drift (the zero level in the plot) and then determines the evolutionary "acceleration" (faster evolution than expected, i.e., poor conservation) and the "conservation" (slower than expected evolution) values. The values of "conservation" are plotted as positives, while the values of "acceleration" are plotted as negatives. An example is given in Figure 6A. In addition, the system provides the statistical distribution of the readings and the general average in the section studied. The alignment of the chicken GHR intron sequence against the intron sequence of the GHR of Xenopus did not yield significant similarities.

Global Analysis of the SLC25A21 and GHR Conservation in Vertebrates
The UCSC Genome Browser, using the function Comparative Genomics/Conservation, plots the estimated evolutionary trend of a given vertebrate gene, calculated along its entire length as an average from 100 multialigned vertebrate species. The system calculates, for each position in the gene, the conservation expected under neutral drift (the zero level in the plot) and then determines the evolutionary "acceleration" (faster evolution than expected, i.e., poor conservation) and the "conservation" (slower than expected evolution) values. The values of "conservation" are plotted as positives, while the values of "acceleration" are plotted as negatives. An example is given in Figure 6A. In addition, the system provides the statistical distribution of the readings and the general average in the section studied.  Figure  6B, but each bar corresponds to 480 nucleotides.
The conservation profiles of SLC25A21 and GHR are shown in Figure 6B,C, respectively. In both graphs, the positive values largely prevail, although a few small negative peaks may be appreciated in spite of the small enlargement. The general averages of all readings are 0.256 in SLC25A21 and 0.093 in GHR. These averages forcefully include exons; however, these represent a very small fraction of the sequences (0.18% of SLC25A21 in human and 0.66% in GHR). The average figures indicate that, in both genes, the intron sequences tend to be conserved, but the level of conservation is about 2.8 times higher in SLC25A21. The conservation profiles of SLC25A21 and GHR are shown in Figure 6B,C, respectively. In both graphs, the positive values largely prevail, although a few small negative peaks may be appreciated in spite of the small enlargement. The general averages of all readings are 0.256 in SLC25A21 and 0.093 in GHR. These averages forcefully include exons; however, these represent a very small fraction of the sequences (0.18% of SLC25A21 in human and 0.66% in GHR). The average figures indicate that, in both genes, the intron sequences tend to be conserved, but the level of conservation is about 2.8 times higher in SLC25A21.

Differences among Species
The peaks of conservation in the different vertebrate species do not always match the general vertebrate pattern. Here, we provide two examples covering two different 53,000-nucleotide segments in intron 1 ( Figure 7A,B).

Differences among Species
The peaks of conservation in the different vertebrate species do not always match the general vertebrate pattern. Here, we provide two examples covering two different 53,000nucleotide segments in intron 1 ( Figure 7A Table 2). (B) A section of the SLC25A21 sequence: the general vertebrate conservation exhibits a long series of conserved nucleotides, while the chicken matches only a small part of the series and zebrafish exhibits no signal at all. This section approximately corresponds in human to nucleotides 36,792 K to 36,845 K.

Repeats in Introns of Human SLC25A21 and GHR
Within the intronic sequence of human SLC25A21, we found two short Plus/Plus repeats of non-TE sequences: the longest comprised 96 nt (Expect = 5 × 10 −34 ). Within the intronic sequence of human GHR, we found four short Plus/Plus repeats of non-TE sequences: the longest comprised 81 nt (Expect = 3 × 10 −22 ).

Exon Nucleotide and Amino Acid Conservation in SLC25A21 and GHR (Zebrafish/Human)
In SLC25A21, the zebrafish/human conservation was 69.6% for exon nucleotides and 81.2% for the amino acids. In GHR, the zebrafish/human conservation was 56.3% for exon nucleotides and 43.5% for the amino acids.
Actinopterygii/Sarcopterygii divergence seems to have entailed a radical re-editing of introns. In SLC25A21 alignments of the actinopterygian zebrafish with human, chicken, and Xenopus (all of the sarcopterygian group), there were five (0.51% of the intron sequence), five (0.42%), and one (0.11%) matches, respectively. The corresponding alignments with the less conserved GHR yielded no matches.
In the sarcopterygian group of Tetrapoda, the subsequent Amphibia/Amniota divergence also entailed a similarly radical re-editing of introns. In SLC25A21 alignments of  Table 2). (B) A section of the SLC25A21 sequence: the general vertebrate conservation exhibits a long series of conserved nucleotides, while the chicken matches only a small part of the series and zebrafish exhibits no signal at all. This section approximately corresponds in human to nucleotides 36,792 K to 36,845 K.

Repeats in Introns of Human SLC25A21 and GHR
Within the intronic sequence of human SLC25A21, we found two short Plus/Plus repeats of non-TE sequences: the longest comprised 96 nt (Expect = 5 × 10 −34 ). Within the intronic sequence of human GHR, we found four short Plus/Plus repeats of non-TE sequences: the longest comprised 81 nt (Expect = 3 × 10 −22 ).

Exon Nucleotide and Amino Acid Conservation in SLC25A21 and GHR (Zebrafish/Human)
In SLC25A21, the zebrafish/human conservation was 69.6% for exon nucleotides and 81.2% for the amino acids. In GHR, the zebrafish/human conservation was 56.3% for exon nucleotides and 43.5% for the amino acids.
Actinopterygii/Sarcopterygii divergence seems to have entailed a radical re-editing of introns. In SLC25A21 alignments of the actinopterygian zebrafish with human, chicken, and Xenopus (all of the sarcopterygian group), there were five (0.51% of the intron sequence), five (0.42%), and one (0.11%) matches, respectively. The corresponding alignments with the less conserved GHR yielded no matches.
In the sarcopterygian group of Tetrapoda, the subsequent Amphibia/Amniota divergence also entailed a similarly radical re-editing of introns. In SLC25A21 alignments of Xenopus with human and chicken, the matches were 5 (0.45% of the intron sequence), and 6 (0.44%), respectively. The corresponding alignments with the less conserved GHR yielded no match.
In Amniota, the Synapsida/Sauropsida divergence appears to be much more conservative. In the SLC25A21 alignment of human with chicken, the matches were 32 (4.66% of the intron sequence). The corresponding alignment with the less conserved GHR yielded only one match (0.30%).
Eventually, as expected, the more recent rodents/primates divergence was more conservative. The human/mouse alignments yielded 266 matches (40.03%) with SLC25A21 and 27 matches (11.07%) with GHR.
The evolution of exons was also more conservative in SLC25A21 than in GHR, as reported in paragraph 3.12. Thus, the nucleotide dynamics of introns seems to parallel that of exons, although changes in introns are possibly less restricted and can accumulate, while exons are subject to a selective pressure, which minimizes changes that would alter their protein product.
A comparison of the total number of intronic nucleotides in human and zebrafish shows that the figures in human are almost three times those in zebrafish, both in SLC25A21 and GHR (Table 3). According to our investigations (unpublished results), through the whole mitochondrial carrier subfamily, the number of intron nucleotides is significantly higher in human than in zebrafish. The reason for a trend of intron length increase during evolution has not yet been completely elucidated. The hypothesis is that introns can grow by progressive addition of new "structure units" at their 3 end [33]. In addition, the spontaneous duplication of DNA segments, apparently by accidental replication, was described [34]. In the SLC25A3 gene of vertebrates, a whole exon (about 120 nt long) is duplicated, and two copies which retain a high level of similarity are alternatively transcribed, giving rise to two similar transcripts [35][36][37]. In the present material, we found a few short but significant repeats in the human sequences (Section 3.11) that could be the recognizable remnants of larger duplications which had diverged by a summation of mutations. The relatively regular distribution of matches along the entire gene length seems to contradict the hypothesis of additions at the 3 end only; on the other hand, the number of duplications observed appears too low to support weighty intron elongations. These observations suggest that the increase in intron length during evolution may occur through different mechanisms.

Conclusions
Using the NIH Nucleotide BLAST suite to align the introns of homologous genes of two species, the "dot plot" provides a complete visual image of the intron conservation between the two species. Aligning a genuine intronic sequence with the same sequence after TEs clearing, the "dot plot" returns a clear (negative) image of the numerousness and positions of TEs ( Figure 1E). However, when the two species share similar TEs, as is the case of the human SINE Alu and mouse SINE B1 [38], several matches show outside of the alignment diagonal (Figure 2A), but after TE removal, these matches are no longer present ( Figure 2B). The UCSC Genome graphics suite may be a powerful tool for displaying multicomparisons along the entire gene of reference, the average conservation level in vertebrates and the conservation level profile in selected individual species, simultaneously. Thus, the combination of the two approaches may be helpful to gain a better understanding of the general evolutionary dynamics in vertebrate introns. For this connection, the graphically basic approaches outlined here may help in providing a qualitative, but readily visual, evaluation of the intron structural conservation/mutation and integration with transposable elements in vertebrate evolution, as a preliminary step to more quantitative approaches.