Classiﬁcation Problems of Repetitive DNA Sequences

: Repetitive DNA sequences, satellite DNAs (satDNAs) and transposable elements (TEs) are essential components of the genome landscape, with many different roles in genome function and evolution. Despite signiﬁcant advances in sequencing technologies and bioinformatics tools, detection and classiﬁcation of repetitive sequences can still be an obstacle to the analysis of genomic repeats. Here, we summarize how speciﬁcities in repetitive DNA organizational patterns can lead to an inability to classify (and study) a signiﬁcant fraction of bivalve mollusk repetitive sequences. We suggest that the main reasons for this inability are: the predominant association of satDNA arrays with Helitron/Helentron TEs; the existence of many complex loci; and the unusual, highly scattered organization of short satDNA arrays or single monomers across the whole genome. The speciﬁcities of bivalve genomes conﬁrm the need for introducing diverse organisms as models in order to understand all aspects of repetitive DNA biology. It is expected that further development of sequencing techniques and synergy among different bioinformatics tools and databases will enable quick and unambiguous characterization and classiﬁcation of repetitive DNA sequences in assembled genomes.


Introduction
Despite the exponential number of genome sequencing projects arising and spanning all taxa, genomic regions largely composed of repetitive DNA sequences still present substantial technical issues in the assembly of genomes [1]. Repetitive DNAs are mainly constituted of satellite DNAs (satDNAs), formed by sequences repeated in tandem, and of mobile elements, interspersed throughout the genome [2]. According to the established classical view, satDNAs are associated with constitutive heterochromatin which is commonly located at pericentromeric and subtelomeric chromosomal domains and at interstitial loci of the chromosomal arms. They build long arrays of monomers repeated in tandem, comprised of hundreds to thousands of highly similar repeat units [3]. However, more recent work has introduced new data and showed that satDNA sequences can also be located outside of heterochromatin, where they can be found in different organizational forms: as monomers or monomer fragments, in arrays of diverse length or incorporated into mobile elements, for example [4][5][6][7][8][9][10]. In addition, many links show that satDNAs and mobile elements are often tightly interconnected. For example, tandem repeats can be created from mobile elements or their segments, or satDNAs can expand from short internal arrays carried by mobile elements (reviewed in [11]).
Sequencing problems arise in attempts to reconstruct repetitive genomic segments, and, subsequently, these regions are still regularly omitted or are misassembled in the available genomic data [12]. Ongoing improvements in sequencing technologies (e.g., long-read PacBio and Nanopore sequencing) are opening the possibility to obtain insights into these missing fractions of assembled genomes [13]. At the same time, a number of programs and software aimed to forward repeat detection and characterization are being generated and/or upgraded (reviewed in [14]), substantially changing our knowledge on the inventory of repetitive sequences in genomes, satellitomes and repeatomes [15][16][17][18][19][20]. However, clear and unambiguous classifications of repetitive fractions of genomes still present a challenge, as described here for bivalve mollusks.

Repetitive Sequences in Bivalve Genomes
Bivalve genomes show many peculiarities in the content and composition of repetitive DNA sequences and heterochromatin and can be valuable sources of information as model organisms in studying repetitive DNA biology [21]. The overall contribution of repetitive DNAs to bivalve genomes is relatively high, in some species reaching 50% [22][23][24][25] or even 60% [26] of the genome. In contrast, the estimated content of sequences classified as satDNAs is very low according to genome sequencing projects, usually <2% [24,[26][27][28][29]. This differs significantly from the organisms in which up to 40% of the genome is composed of satDNAs [30][31][32].
Interestingly, a large number of repetitive sequences, >70%, remain unassigned in many bivalve species ( [28,29,33,34], etc.), differing again from the organisms where the success in repeat classification is significantly higher, reaching 98% in some plant species [35]. At the same time, it was noticed that Helitron/Helentron mobile elements constitute a substantial part of bivalve genomes, frequently being by far the most abundant type of DNA transposons in a certain genome [26,28,36]. Mobile elements of the Helitron/Helentron superfamily consist of conserved sequence segments at element ends that hold subterminal inverted repeats and a short palindromic sequence at the 3 end of the right flanking sequence, while in the central part, they frequently contain arrays of tandem repeats, preceded by a microsatellite sequence ( [37], Figure 1a). repeat in the 5 conserved segment and a short palindromic sequence at the 3 conserved segment. The central part of the element frequently contains arrays of tandem repeats preceded by a microsatellite sequence. (b) In RepeatExplorer analysis, different parts of Helitron mobile elements are placed in separate clusters. In this example, central repeats were assigned to cluster CL2 and classified as low-confidence satellite DNAs (satDNA), while sequence segments flanking central repeats were assigned to clusters CL4 and CL18, both remaining unclassified. The position of contigs, obtained after the clustering, is shown in respect to the Helitron-N2_Cgi consensus sequence.
Whole elements can also be repeated in tandem, with frequent truncation at the 3 end of the element [38]. The number of central repeats found within Helitron/Helentron elements in bivalves can vary significantly from 1 to~90 [39], with arrays holding up to 6 monomers being the most frequent [4,9,[40][41][42].

Problems in Classification of Repetitive DNA Sequences: The Case of the Pacific Oyster Crassostrea gigas
The "hybrid" structure of mobile elements incorporating tandem repeats could, at least partially, explain the difficulties in the precise classification of repetitive sequences, and in determining the exact contribution of each type of repeat to the genome. For example, in a RepeatExplorer [43] analysis of the Pacific oyster Crassostrea gigas, contigs based on short-read sequences (Supplementary Data S1-S3) built separate clusters, which corresponded to different parts of Helitron mobile elements. Tandem repeats derived from the central parts of Helitrons were placed into one cluster and classified as satDNA, while sequences corresponding to flanking sequences were allocated into two separate clusters (Figure 1b) that remained unclassified [44]. Similar assignment problems were also observed in Drosophila virilis and Drosophila americana, species containing structurally similar elements of the same superfamily [45]. The inability to detect three formerly known satDNAs of Rhodnius prolixus with RepeatExplorer analysis [46] could also be a consequence of their association with certain sequence segments (potentially of transposable element origin), resulting in their placement within clusters that remained unclassified. While such problems are noticeable during analyses based on short-read data, repeat classification on assembled genomes encounters different obstacles. Programs based solely on structural characteristics found at the element ends (subterminal inverted repeats, short palindromic sequence) would fail to characterize large numbers of truncated variants, while homologybased classifications could be biased by several other factors. Among these factors, we include: the extreme variation in the number of central repeats (observed in [39]), sequence variations, the existence of two different types of repeats within an array (reported in [42]) and the existence of many complex loci. The Pacific oyster C. gigas genome is replete with the last factor ( Figure 2). Here, tandemization of repeats and of conserved boxes that are usually found at the element ends, together with insertion, deletion and recombination events, generates a very complex network of tandem repeats and Helitron components, presenting a significant classification challenge.
In C. gigas, satellitome analysis revealed an unusual, highly scattered organization of relatively short satDNA arrays across the whole genome, a pattern followed by all 52 satD-NAs detected in this species [39]. Interestingly, the same species contains an extremely low amount of heterochromatin, limited to the centromeric region of one chromosome pair and the telomeric region of another, and predominantly composed of DNA transposons [44,47]. For this species, in strong contrast with the established concept of satDNA genomic organization, no significant accumulation of satDNAs has been observed in any chromosomal loci. Additionally, a substantial number of satDNA sequences has been found in the form of single, interspersed monomers. They were, in the same manner as arrays of monomers, found either in standalone form, or associated with conserved boxes of Helitron mobile elements, from one or both sides [39]. We believe that this pattern, differing significantly from the classical organizational concept of satDNA, is a significant contributor to the inability to classify a large number of repetitive DNA sequences, especially standalone interspersed monomers. If examined from the aspect of interspersed sequences, their classification would be omitted due to a lack of some recognizable structural characteristics, while on the other hand, they would most probably remain unrecognized by satDNA-oriented tandem repeat finders.

Figure 2.
An example of complex loci found within the Crassostrea gigas genome. Vertical bars denote conserved boxes of Helitron mobile elements. Boxes 1 and 2 represent conserved segments found at the 5 and 3 ends of Helitron-N2_Cgi, while boxes 3 and 4 represent 5 and 3 conserved segments, respectively, shared among several Helitron mobile elements of C. gigas. CgiSat01-06 and CgiSat10, 17, 24, 49 denote monomers belonging to some of the satDNAs described in [46]. Each arrowhead represents one monomer, with arrowhead sizes varying in respect to the monomer length of a particular satDNA. In silico localization was performed on linkage groups/chromosomes of the currently representative genome assembly GCA_902806645.1 [36].

Conclusions
While there are other circumstances that can contribute to the difficulties in the classification of repetitive elements in bivalves, e.g., the existence of species-specific variants of known elements [33], we believe that a significant role is played by: (i) the absence of classical satellite DNAs, (ii) the predominant existence of satDNA arrays of different sizes within the complete and truncated Helitron/Helentron elements, (iii) the existence of combined types of arrays within such elements, (iv) the presence of a large number of single, interspersed monomers throughout the genome and (v) the existence of complex loci generated by insertion, deletion and recombination events in combination with tandemization of repeats and mobile element conserved boxes.
Comprehensive and accurate annotation and characterization of repetitive sequences are necessary, as the contribution of this part of the genome is important for the understanding of the genomic/chromosomal architecture and function as a whole. We believe that the combination of quickly evolving sequencing technologies, followed by the constant development and improvement of bioinformatics tools and databases, supplemented with manual curation and adaptation according to the specificities of each model organism of interest, will significantly forward our abilities to correctly characterize and classify repetitive DNA sequences and obtain novel insights into these important genomic components.   Institutional Review Board Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.