Can Introns Stabilize Gene Duplication?

Simple Summary Eukaryotic genes are organized as DNA sequences containing exon and intron regions. Exons relate to sequences that, after transcription, will be maintained in mature mRNA to provide the blueprint for protein translation. Introns, on the other hand, are present in the primary transcript and are then removed by the splicing mechanisms. The evolutionary solutions that maintain and make this complex gene organization functional are only partially known. Here, we speculate that the presence of introns in the gene sequence can stabilize the products of gene duplication, one of the most effective driving forces in genome evolution. The hypothesis we propose is to be considered additional to those currently reported and not as an alternative. Abstract Gene duplication is considered one of the most important events that determine the evolution of genomes. However, the neo-duplication condition of a given gene is particularly unstable due to recombination events. Several mechanisms have been proposed to justify this step. In this “opinion article” we propose a role for intron sequences in stabilizing gene duplication by limiting and reducing the identity of the gene sequence between the two duplicated copies. A review of the topic and a detailed hypothesis are presented.


Essential Aspects of Gene Duplication
Several processes are surmised to drive the evolution of genomes. Among them, gene duplication has been traditionally regarded as a valuable source of innovation and functional variation, and its active involvement in genome evolution has been extensively considered [1][2][3]. The gene duplication rate is estimated in the same order of magnitude as single nucleotide polymorphisms [4].
The concept of gene duplication was originally formalized by S. Ohno in 1970 [5] and its major features were successively highlighted by the same author [6]. In particular, Ohno considered that since only one gene copy is generally sufficient to support a given function, any extra copy obtained by duplication could undergo various mutations that could make it either non-functional (pseudo-or non-functionalization, Figure 1 a-a'), or cause the acquisition of a new function (neo-functionalization, Figure 1 a-a"). Moreover, if the second gene copy is not further modified, gene dosage would be increased [7] (Figure 1 a-a). Finally, through a specialization inducing an altered function, a sub-functionalization could occur [8] (Figure 1 a-a"'). According to Ohno's model, there would be an overall increase in the evolutionary rate of the duplicated copy. Actually, at least two processes somehow limit the spread of new gene variants: accumulation of mutations, with the consequent formation of pseudogenes [2,[9][10][11], and gene conversion, inducing homogenization with the original gene copy [12]. In essence, as gene duplication operates in all genomes [13] it allows evolution to experiment with otherwise prohibitive gene variants, and therefore, confers a strong evolutionary advantage [14]. The symbols X, star and full circle denote one or more mutations which, after gene duplication, result in the functional outcomes specified on the right. The functional state of the duplicated gene is highlighted in color (orange, unchanged; red, non-or pseudofunctional; green, neo-functional; grey, sub-functional).
Gene duplication events are observed both at a small scale (SSD) [15][16][17] and at a large scale, as exemplified by global genome duplication (WGD) [18,19]. SSD can derive mainly from tandem duplication events and unequal crossing over between paralogous genes. SSD usually involves genes placed in proximity to each other [15][16][17]. Transposition events can also lead to gene duplication and the traces of these events are observed in the conserved terminal repeats [20,21]. As for large-scale gene duplication, it is worth mentioning polyploidy (partial or temporary). This condition typically arises from alloploidy processes, the acquisition of related chromosomes from different species, autoploidy duplications caused by lack of cytokinesis, or fertilization between gametes that have not undergone a reduction of entire chromosomes [3]. The latter case is usually identified as WGD. Temporary changes in ploidy have been found in most organisms [3,22] and sequence analysis allows to identify the traces of these events. These phenomena are observed frequently in plants [23][24][25]. Generally, in WGD most duplicated genes are short-lived: one of the two copies is soon lost or altered and only one copy is eventually maintained. Interestingly, it has been observed that genes encoding products that operate in protein complexes tend to be maintained even as double copies [26]. The fact that in most cases of gene duplication only one copy is maintained has led to speculate that the initial "double copy" state is inherently unstable, the loss of one of the two copies being the result of unequal recombination.
Overall, what appears in current genomes can be regarded essentially as a balance between the acquisition and loss of gene copies. The widely acknowledged genetic relevance of gene duplication notwithstanding, the molecular mechanisms that can lead to the duplication of genes and to their initial stabilization are as yet not fully understood. In the "hypothesis" section we propose a role for intronic sequences precisely in the stabilization of the sequences of neo-duplicated genes.

Genomes and Non-Coding DNA Content
By comparing genome size in organisms along the phylogenetic scale a progressive, though not linear, increase can be observed starting with the simplest species. In bacteria and simple eukaryotes, there is a rather high gene density, while in the genomes of more complex eukaryotes gene density decreases with the increasing amount of DNA. These observations have led to the formulation of the well-known C-value paradox [27,28]. In Figure 1. Schematic representation of gene duplication and its consequences. The symbols X, star and full circle denote one or more mutations which, after gene duplication, result in the functional outcomes specified on the right. The functional state of the duplicated gene is highlighted in color (orange, unchanged; red, non-or pseudofunctional; green, neo-functional; grey, sub-functional).
Gene duplication events are observed both at a small scale (SSD) [15][16][17] and at a large scale, as exemplified by global genome duplication (WGD) [18,19]. SSD can derive mainly from tandem duplication events and unequal crossing over between paralogous genes. SSD usually involves genes placed in proximity to each other [15][16][17]. Transposition events can also lead to gene duplication and the traces of these events are observed in the conserved terminal repeats [20,21]. As for large-scale gene duplication, it is worth mentioning polyploidy (partial or temporary). This condition typically arises from alloploidy processes, the acquisition of related chromosomes from different species, autoploidy duplications caused by lack of cytokinesis, or fertilization between gametes that have not undergone a reduction of entire chromosomes [3]. The latter case is usually identified as WGD. Temporary changes in ploidy have been found in most organisms [3,22] and sequence analysis allows to identify the traces of these events. These phenomena are observed frequently in plants [23][24][25]. Generally, in WGD most duplicated genes are short-lived: one of the two copies is soon lost or altered and only one copy is eventually maintained. Interestingly, it has been observed that genes encoding products that operate in protein complexes tend to be maintained even as double copies [26]. The fact that in most cases of gene duplication only one copy is maintained has led to speculate that the initial "double copy" state is inherently unstable, the loss of one of the two copies being the result of unequal recombination.
Overall, what appears in current genomes can be regarded essentially as a balance between the acquisition and loss of gene copies. The widely acknowledged genetic relevance of gene duplication notwithstanding, the molecular mechanisms that can lead to the duplication of genes and to their initial stabilization are as yet not fully understood. In the "hypothesis" section we propose a role for intronic sequences precisely in the stabilization of the sequences of neo-duplicated genes.

Genomes and Non-Coding DNA Content
By comparing genome size in organisms along the phylogenetic scale a progressive, though not linear, increase can be observed starting with the simplest species. In bacteria and simple eukaryotes, there is a rather high gene density, while in the genomes of more complex eukaryotes gene density decreases with the increasing amount of DNA. These observations have led to the formulation of the well-known C-value paradox [27,28]. In essence, on the evolutionary scale, the size of the genome and the number of genes do not grow in parallel. The expansion of repeated copies, together with polyploidization processes, global genome duplications, and insertion of transposons, are major factors responsible for the increase in the size of genomes. Due to both pseudogenization and sub-functionalization processes, a large part of the increased genome size does not produce new genes, and with subsequent duplications, it accumulates as "junk DNA". However, maintaining this pool of non-coding sequences allows for the acquisition of new biological functions that can represent a reservoir for biological evolution [29,30].
In this context, it is worth stressing that according to the genotype-phenotype dualism [31] the content of genomic DNA (genome size) represents an adaptive feature that changes during evolution just like a phenotypic trait. Hence, deciphering how genomes grow and how the formation/loss of genes and junk DNA occurs is crucial to the understanding of genome evolution

Introns
A very important genomic feature, which also provides a conspicuous contribution to the size of eukaryotic genomes, is the presence of introns, originally also called intervening sequences. They are found in most eukaryotic genes. The long-standing "one gene, one polypeptide chain" paradigm [32] suddenly became much less stringent after Sharp and colleagues discovered introns about forty years ago [33,34]. Initially (and for a long time) considered "junk DNA", introns have been gradually recognized as having an increasingly important role in gene expression.
Introns are currently classified into three major categories according to their structure and the way in which they are removed from precursor RNA to produce its mature form [35]: (a) spliceosomal introns, which are ubiquitous in eukaryotic genomes and require a complex RNA/protein machinery (the spliceosome) for their removal from the RNA precursor; (b) self-splicing introns (ribozymes), subdivided into group I (present in bacteria, viruses and, in eukaryotes, in the rRNA fraction of mitochondria and plastids; their splicing involves a complex three-dimensional structure and the formation of lariats) and group II (found in bacteria, mitochondria, and chloroplasts; their excision involves a specific secondary structure of the precursor RNA); and (c) tRNA introns.
Almost four decades after their discovery, the debate on the origin of introns still remains open [36]. In particular, a hot topic is the early-versus late-appearance of these sequences during evolution. In the early-appearance hypothesis, introns are regarded as very ancient elements which, depending on the organism, have been successively lost in different ways [37]. Bacteria would have lost almost all of them in a streamlining process of the genome, while eukaryotes, particularly those endowed with large genomes, would have preserved intronic sequences in large quantities [38]. The late-appearance view is supported by the strong similarity between self-splicing group II introns and spliceosomal introns (which are surmised to derive from group II ones, as suggested by the formation of lariat structures in both systems and by the conservation of boundary sequences [39]). In essence, the group II autosplicing form is regarded as the original one which then largely evolved into spliceosomal introns.
Since "introgenesis" has continued during the course of evolution [40], current views assume the coexistence of the two intron origin processes, so that a pool of early-appearing introns is maintained with late-appearing ones that continue to accumulate. Regardless of the mechanisms that allowed the presence of introns, their maintenance must be subject to natural selection, and therefore to the existence of a functional advantage. Several specific advantages can be envisaged. For instance, an increased coding repertoire resulting from alternative splicing (AS) could represent an evolutionary "push" towards the creation of isoforms (giving rise to possible variants and new biological functions [41]), or towards the introduction of slight changes allowing the fine-tuning of specific functions [42] or the appearance of non-functioning forms [43,44]. Another advantage is represented by the possible increase of homologous recombination between similar sequences (exon shuffling) [35,45]. Finally, there is the possibility for faster evolution compared to coding sequences, as introns are not blocked by the need to specify coding information [46].

Hypothesis
We wish to suggest a further advantage for coding sequences in eukaryotic genomes to acquire or retain intervening sequences. The hypothesis we propose should be considered as an addition and not as an alternative to those summarized above. In particular, we focus on tandem-based duplications, as these exemplify a major gene duplication category.
Let us first consider the "intron early-appearance" scenario, i.e an initial genomic condition characterized by introns present in most genes. When gene duplication occurs, the maintenance of two or more identical copies is difficult due to purge systems of the genome, e.g., unequal homologous recombination which tends to restore the singlegene condition and to control repeated sequences, as it occurs (in a specialized form) for ribosomal genes. The presence of the intron(s) contributes to the rapid differentiation of one gene copy from the other. Indeed, as intronic sequences are non-coding they can mutate easily. A rapid progressive reduction in the homology between the two genes can be expected and, as sequence homology is the basis for efficient recombination, this, in turn, decreases the possibility for recombination to occur. Hence, the duplicate copy is quickly allowed to maintain its differentiated state.
Let us now consider the "intron late-appearance" scenario, i.e. a genomic condition characterized by the absence of introns in most genes. Following gene duplication, the acquisition of intervening sequences within the genes significantly lowers the identity between the sequences and minimizes the possibility of recombination that would lead to the loss of a copy. The acquisition of other introns further reduces the progress of this process and increases the stability of the sequence. In line with this expectation, exon size would exhibit minimal changes while intron size would be hypervariable. By reducing the homology between the gene copies, the acquisition of new introns thus stabilizes their presence.
In essence, both acquisition and maintenance of intervening sequences would allow (at least in the case of tandem duplications) the targeted gene to undergo duplication with a lower risk of being eliminated by homologous recombination. The association of introns with gene duplication enhances the stability of the duplicated copies and both processes are evolutionarily supported.