PlantDRs: A Database of Dispersed Repeats in Plant Genomes Identified by the Iterative Procedure Method

Rudenko, Valentina; Korotkov, Eugene; Kostenko, Dmitrii

doi:10.3390/data10070111

Open AccessArticle

PlantDRs: A Database of Dispersed Repeats in Plant Genomes Identified by the Iterative Procedure Method

by

Valentina Rudenko

^*

,

Eugene Korotkov

and

Dmitrii Kostenko

Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Bld. 2, 33 Leninsky Ave., 119071 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Data 2025, 10(7), 111; https://doi.org/10.3390/data10070111

Submission received: 16 May 2025 / Revised: 6 July 2025 / Accepted: 7 July 2025 / Published: 9 July 2025

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

Download

Browse Figures

Versions Notes

Abstract

In this work, we searched for and analyzed highly divergent dispersed repeats (DRs) in the genomes of four plants: Arabidopsis thaliana, Capsicum annuum, Daucus carota, and Zea mays. DRs were detected using the iterative procedure method which has shown efficacy in searches for highly divergent repeats in bacteria and algae. The results indicated that the number of DRs in the plant genomes depended on the genome size, whereas the number of repeat families did not. The DRs covered from 36 to 50% of the studied genomes. The shortest repeats were observed in the D. carota genome, but their consensus lengths were similar to those in the other species. Analysis of periodicity in various DR families showed that most periods were 3 bp long. We created a database of the detected DRs, which contains 5,392,216 DRs grouped in 150 families and which can be accessed on the Research Center of Biotechnology RAS server. The server makes it possible to search for repeats based on various criteria and to download the obtained data.

Keywords:

dispersed repeats; transposable elements; repeat family; plant genome; Arabidopsis thaliana; Capsicum annuum; Daucus carota; Zea mays; IP method; dispersed repeats database

1. Introduction

It is known that eukaryotic genomes contain numerous repetitive DNA sequences of various types. Among them, dispersed repeats (DRs), which are DNA segments that occur multiple times at random positions in the genome, include mobile elements such as retrotransposons and DNA transposons. The current point of view on the role of transposable elements (TEs) is that they regulate the plasticity of the genome and can play a significant role in evolution [1]. The movement of TEs throughout the genome can result in TE insertion into gene reading frames and adjacent regulatory regions, which can lead to alternative splicing, gene duplications, and recombinations and affect the expression rate [2,3,4,5]. During evolution, organisms have developed strategies to suppress negative effects of transposons, including epigenetic control of their number and activity through small RNAs and methylation [6,7,8,9,10]. Therefore, mobile elements are predominantly silent in the genome.

Among living organisms, plants are the most enriched with repetitive elements [11]. Plants are spread ubiquitously in all climate zones and can thrive in all environments, demonstrating remarkable ability for adaption to any living conditions and stress factors, both biotic, such as pathogens, and abiotic, such as temperature, humidity, pressure, light, and nutrient availability. As most plants are unable to change their habitat in response to external challenges, they have developed internal adaptation mechanisms, including shifts in gene expression, which are closely related to TE activity [12].

It has been observed that in plants, TEs respond to adverse conditions by relocation to stress response genes or the associated regulatory regions. Thus, Arabidopsis reacts to heat stress by initiating a rapid increase in the number of ONSEN TE copies in the genome, which leads to changes in gene expression, alternative splicing, and exonization, ultimately affecting the heat sensitivity of the plant [13].

Although the nature of TE activity is not fully understood, it is clear that TEs play an important role in plant evolution. Thus, there are periods when transposons are relatively immobile and controlled by genomic defense mechanisms, which are followed by bursts of TE activity; such fluctuations point to a dynamic equilibrium between TE suppression and reactivation within a host genome [14]. In particular, it is noted that genome polyploidization has been accompanied by an increase in the number of TEs [15]. The interaction of transposons with the host genome leads to so-called “domestication” of TEs which confer useful traits to the plant [16,17,18].

Such TE properties could be valuable for applied biotechnology. The activation of element transposition in response to stress is currently used for selection and domestication of agricultural crops [19,20,21]. TEs have been shown to stimulate pigment biosynthesis and promote the development of new colors and forms of flower genotypes [22]. TEs are also useful as molecular markers in genotyping and phylogenetic studies [23,24].

Nowadays, genome analysis and annotation cannot be performed without identifying not only genes but also various DR classes, including TEs. Many studies have concentrated on the search for DRs, which resulted in the rapid growth of DR databases. Along with previously used Repbase [25] and Dfam [26], there are now many resources providing information about DRs detected in the genomes of various organisms. For plants, one can consider DR databases for Cicer species [27], Populus trichocarpa, P. euphratica, Salix suchowensis [28], Gossypium raimondii [29], and soybean [30]; there are also databases of TEs belonging to specific classes such as MITEs [31], SINE [32], and LTR [33].

A large number of mathematical methods and software programs have been developed for DR detection in genomes [34,35,36]. All currently known approaches to identify DRs are divided into two groups: methods based on searching for similarities with already known sequences from repeat databases and de novo methods that search for previously unknown repeats. Pipelines combining algorithms of both groups are also very popular.

The limitation of the first group of methods is that they require prior information about the type of repeats, which is not always available. The algorithms of the second group are more universal as they can identify new DR families; however, since they are based on either self-comparison or k-mer distribution, these methods can reveal only short conserved repeats that have accumulated few mutations. However, TEs are mostly located in non-coding DNA regions, near centromeres and telomeres [37], where they may have mutated beyond recognition by the existing methods.

In contrast to the above-mentioned mathematical approaches, the recently developed iterative procedure (IP) method allows detection of highly divergent long DRs of about 500–600 bp, which has been demonstrated in our previous studies of the genomes of various bacteria and red algae [38,39,40].

In this work, we applied the IP method to analyze DNA repeats in the genomes of higher plants: Arabidopsis thaliana, Capsicum annuum, Daucus carota, and Zea mays. Since the bioinformatics approach cannot provide clues to the repeat nature and confirm that they are true TEs, we use a more general term, “DRs”. As a result of our analysis, we identified in total 150 DR families including 5,392,216 repeats, which were deposited into an open access database, PlantDRs 1.0. The database contains all the necessary information on the DRs in the four plant genomes and provides search tools based on various criteria.

2. Materials and Methods

The search for DRs in the plant genomes was performed using the IP method described in detail in our previous publications [38,39,40].

In short, the method consists of four steps: (1) generation of position weight matrices (PWMs) of repeat families, (2) scanning of the genome based on family matrices, (3) filtering of potential repeats, and (4) selection of the statistical significance level and confirmation of significant DRs.

During the first step, we find such a PWM that would correspond to any repeated sequence in the analyzed genome. The IP method belongs to the de novo category, i.e., the type of repeat is unknown. In order to determine the presence of highly divergent DRs, the frequencies of dinucleotides in the sequence are taken into account in the calculations.

Let us denote the genomic sequence as S and its length as Ls. We used M(16 × L) as the repeat profile matrix, where 16 is the number of all possible dinucleotides and L = 600 bp is the maximum repeat length. Initially, a random matrix M(i,j) was generated for the unknown repeat family with the elements from −10.0 to 10.0. This matrix was transformed into M_tr(i,j) so that

R_{0}^{2} = \sum_{i = 1}^{16} \sum_{j = 1}^{L} M_{t r} {(i, j)}^{2}

and

K_{0} = \sum_{i = 1}^{16} \sum_{j = 1}^{L} M_{t r} (i, j) p_{1} (i) p_{2} (j)

; here, p₂(j) = 1/L, p₁(i)—the probability of i dinucleotide type in S, K₀ = −1.0 and

R_{0}^{2} = 300 L^{0.5}

. More detailed explanations for the choice of parameters K₀ and R₀ can be found in [41]. The transformed matrix M_tr(i,j) represented the PWM.

Next, we used the profile PWM to search for the best local alignment of all fragments of sequence S with coordinates 10 × t and 10 × t + 649 (t = 0, 1, …, (Ls−650)//10, where // is the integer division operator). Statistical significance of the alignment Z was determined using the Monte Carlo method, and alignments with Z ≥ 3.5 were selected. In the alignment, v1 was the sequence of profile matrix positions and v2 was the fragment of sequence S. If v2 for two different alignments intersected, the alignment with a lower Z value was excluded from further consideration.

For the remaining alignment fragments, we constructed multiple alignments of all v2 sequences taking into account the position in v1 and the calculated frequency matrix M_f(16 × L), which was then transformed as described above into M_tr(i,j) (element (i,j) shows the number of dinucleotides of type i in position j of the repeat). The resultant matrix was used as the PWM in the next iteration.

The iterative procedure of constructing the best local alignments and recalculating the PWM continued until the number of alignments with Z ≥ 3.5 ceased to increase. If at the end the number of alignments with Z ≥ 3.5 was greater than 300, then the PWM was fixed as the family matrix, and all fragments of S with Z ≥ 3.5 were removed from S.

Then, the next DR family was created. The process of generating new families continued until the family size was more than 300 elements.

At the second step of the IP algorithm, the initial sequence S was scanned with the PWMs. In this case, a higher Z level was used, and fragments that had Z ≥ 4.0 were considered as potential repeats.

According to step 1, the repeats of the same family do not intersect with each other, whereas those belonging to different families can do so. Therefore, at the third step of the IP algorithm, filtering was performed to remove redundancy. If two repeats overlapped by more than 50 bp, we selected the one with a greater Z value. The final number of repeats after filtering was denoted as N.

At the final step of the algorithm, steps 2 and 3 were performed for a random sequence generated by rearranging symbols of the original sequence using a random number generator, and the number of found repeats was denoted as Nr. Then, the reliability of the results obtained in steps 1–3 was determined according to the false discovery ratio (FDR) calculated as FDR = Nr/(N + Nr).

3. Results

3.1. Search for DRs in the Plant Genomes

The species chosen for analysis, A. thaliana, C. annuum (paprika), D. carota (carrot), and Z. mays (corn), represent flowering plants belonging to different orders—Brassicales, Solanales, Apiales, and Poales, respectively. A. thaliana is a popular model organism with a small, well-studied genome; the other species were chosen as important agricultural crops, the information on which could be useful for practical reasons. The data on the genomic sequences and the number of annotated genes were obtained from the website https://plants.ensembl.org. Table 1 shows the characteristics of the four plant genomes and the links to download the genomic sequences and gff3 annotation files.

We searched for DR families in plant genomic sequences using the IP algorithm (described in detail in the Section 2). The calculations were carried out using the program located on the website of our institute at http://victoria.biengi.ac.ru/shddr/auth/login (accessed on 15 May 2025).

First, family matrices were created until the number of repeats in them exceeded 300. The condition of 300 repeats in new DR families, which has been used in our earlier studies [38,39,40], was chosen because the simulation of the process on random sequences shows that the average number of elements in a family is 122 with a standard deviation of 12; therefore, the family power of 300 elements represents a significant difference from random (σ > 10.0). As a result, we obtained 35, 26, 35, and 54 DR families for A. thaliana, C. annuum, D. carota, and Z. mays, respectively. It should be noted that there was no correlation between the genome size and the number of identified repeat families. Thus, among the four species, C. annuum has the largest genome but the smallest number of DR families.

The family matrices were used as PWMs in the alignment algorithm to find DRs. The identified sequences were considered as potential repeats if statistical significance Z was greater than the initial level Z₀ = 4.0. Scanning was performed in both forward and reverse DNA strands. The number of fragments (potential DRs) for which Z ≥ Z₀ was denoted as N(Z₀). To determine the reliability of our predictions, we performed the same calculations for random sequences and determined the FDR. Random sequences were obtained by random permutation of symbols in the original genomic sequence and were generated for each genome. If Nr(Z₀) is the number of fragments with Z ≥ Z₀ in the random sequence, then FDR(Z₀) = Nr(Z₀)/(N(Z₀) + Nr(Z₀)). FDR values for Z₀ = 4.0 in the genomes of A. thaliana, C. annuum, D. carota, and Z. mays were 3.0, 2.1, 6.1, and 1.6%, respectively. We also determined the FDR for higher Z₀ levels. The parameters of DRs found in the plant genomes are shown in Table 2.

It can be seen that the total number of repeats was proportional to the genome size of the plant, whereas the number of repeats on the forward strand was approximately the same as that on the reverse strand in all four species. The percentage of DR coverage in the genome ranged from 36% in D. carota to almost 50% in Z. mays, and the average repeat length varied from 465 to 530 bp. A more detailed distribution of repeat lengths is shown in Figure 1. The results indicate that in A. thaliana, C. annuum, and Z. mays, most DRs were over 500 bp long, whereas in D. carota, there were peaks corresponding to DR lengths of 200–300 bp and 500–600 bp.

Consensuses of DR families were analyzed using Weblogo v.2.8.2 software [42]. For this purpose, repeats with Z ≥ 5.0 were selected from each family and their multiple alignment was constructed. The Weblogo consensuses are presented in supplementary archive file consensus.zip which includes subdirectories and files, consensus/<org>/consensus<n>.png, where <org> is athaliana, cannuum, dcarota, or zmays and n is the family number.

The lengths of the DR family consensuses are shown in the supplement.xlsx file, shown in Table S1. For all species, most consensuses were longer than 450 bp; shorter consensuses were detected in Z. mays and D. carota. However, Z. mays had only three short repeats, whereas D. carota carried many DRs of 200–300 bp (Figure 1c), which may indicate that the repeats in carrot were fragmented. The longest consensuses among the species were observed in the C. annuum genome.

3.2. Periodicity in DRs

Many bioinformatics methods which search for DRs can identify relatively conserved and short (about 200 bp) DRs [40]. Since the DRs identified by the IP method were almost three times longer, it was necessary to check whether they were tandem duplications of shorter repeats. There are several methods to search for periodicity [43,44,45]. A classical approach to solve this problem is the Fourier transform [46,47,48], which was used here.

We calculated the spectral function for the frequency matrix of DRs family M_f(i,j) (Section 2), which had 16 rows and 600 columns, and determined the following:

x (i) = \sum_{j = 1}^{600} M_{f} (i, j), y (j) = \sum_{i = 1}^{16} M_{f} (i, j), N = \sum_{i = 1}^{16} \sum_{j = 1}^{600} M_{f} (i, j)

(1)

v (i, j) = \frac{x (i) \times y (j)}{N^{2}}, w (i, j) = \frac{M_{f} (i, j) - N \times v (i, j)}{\sqrt{N \times v (i, j) \times (1 - v (i, j))}}

(2)

Next, we wrote the Fourier transform for each row of matrix w(i,j). The Monte Carlo method was used to determine Z(i,k)—the statistical significance of the difference between the spectral function and the random value for period lengths k from 2 to 200. Z(i,k) had an approximately standard normal distribution. To calculate the overall statistical significance for all rows, we took the following:

Z 1 (k) = \sum_{i = 1}^{16} Z (i, k) / 16

(3)

The dependence of Z1 on k was plotted for all repeat families of the four species. An example of such dependence in the 10th DR family of the C. annuum genome is shown in Figure 2. We considered that periodicity with a period length of k₀ was present in the frequency matrix of a DR family if Z1(k₀) ≥ 5. Thus, in the 10th family, we could identify periods with lengths of 3 and 119 bp.

The list of periods found in all DR families of the four plants is given in supplement.xlsx, in Table S2. Only few families (5 out of 150) had periods with a length of about 200 bp, which points to the presence of 2–3 diffuse copies of these repeats within the identified DRs. However, this phenomenon was uncommon, and in most families with periodicity (59 out of 79), the period length was 3 bp. Such periodicity is a well-known property of coding DNA regions [49], suggesting that some DRs are insertions into the coding sequences or have been so in the past.

3.3. Database of DRs Found in the Plant Genomes

As a result of DR search in the four plant genomes, we obtained data on repeat family matrices and chromosomal DR locations and deposited this information in the PlantDRs 1.0 database, which is located on the Research Center of Biotechnology RAS server (link: http://victoria.biengi.ac.ru/cgi-bin/ipdisprep/index.cgi; accessed on 15 May 2025). In total, the database contains 5,392,216 DRs belonging to 150 families.

PlantDRs was implemented on the SQLite, and the interface to it was written in Perl. The start page contains a filter at the top, which makes it possible to select DRs with specific characteristics (Figure 3). The following fields are used to set the filtering parameters:

Chromosome of a specific organism (field: Organism). The chromosome number is selected from the list and is accompanied by information about the chromosome length, which should be taken into account when entering the data on repeat positions in the following fields.
DR family (field: Position weight matrix). Here, the repeat family matrix associated with the organism in question is selected; therefore, the species and the following family number should be selected from the list. If a chromosome of a specific organism has been previously selected, you need to make sure that the family matrix corresponds to the same biological species.
Internal identifier in the database (field: Record ID). The identifier is an integer, and a range of identifier values can be specified.
Left and right coordinates in the chromosome (field: Position in chr.).
Range of values for the statistical significance of the repeat Z (field: Significance).
DNA strand (forward “+” or reverse “−“), where the repeat is found (field: DNA string).

Filter parameter should be set to “Any” if it does not need to be specified.

Below the filter input fields, there are two buttons: «Search» and «Export data». After clicking «Search», DRs are filtered according to the conditions specified above and displayed at the bottom of the screen. The data are presented in a tabular form. In the Organism column, the chromosome number is accompanied by the >> symbol; clicking on this provides more detailed information on the repeat. In addition to the fields mentioned above, a link to the sequence of the corresponding chromosome, the direction of the DNA strand, and the alignment of the repeat sequence relative to the family profile matrix are also provided. You can also download the family frequency matrix (Download button) or full information on the repeat (Export button). The file with the frequency matrix is named <xx>-<yy>_pwm<n>.txt, where <xx> and <yy> are the first two letters from the generic and specific names of the species, respectively, and <n> is the chromosome number. For example, for the first chromosome of A. thaliana, the file is ar-th_pwm1.txt. When «Export data» is pressed, all DRs selected according to the filter conditions are downloaded to the user’s computer as a zip archive containing two or more text files: the corresponding frequency matrix files <xx>-<yy>_pwm<n>.txt and the records.csv file. It contains the following fields separated by commas: record_id (internal identifier of the repeat in the database), pwm (family number), org (organism code in the form <xx>-<yy>), chromosome, organism (full name of the biological species), reference (link to the genomic sequence in the Ensembl Plants portal: https://plants.ensembl.org/index.html; accessed on 15 May 2025), beginning and end (left and right coordinates of the repeat in the chromosome, respectively), significance (statistical significance of the repeat), string (direction of the DNA strand), and alignment_0 and alignment_1 (v1 and v2 sequences, respectively, corresponding to the alignment). There is a limit on the number of downloaded records per request, which cannot exceed 10,000 in order to avoid copying the entire database and to reduce the load on the server during parallel downloads.

Database help can be obtained using the “?” button.

4. Discussion

Earlier, we developed the IP method, which has been proven to be very effective in the detection of highly divergent DRs [39]. Using the genomes of prokaryotes [38,39] and the simplest plant, red alga C. merolae [40], we have shown that the IP algorithm can identify divergent repeats of 500–600 bp that are missed by other methods. It is important to note that DRs in bacterial genomes have rarely been detected before and that most of the annotated repeats in C. merolae are highly conserved and rather short (50–200 bp). Therefore, the current study of the genomes of four plants (A. thaliana, C. annuum, D. carota, and Z. mays) is a natural continuation of our previous work. The number of families that could be generated for the plant genomes was 35, 26, 35, and 54, respectively. It is evident that despite the fact that the genome of C. annuum is much larger than that of A. thaliana and D. carota, it has less DR families. The second largest genome of Z. mays has the maximum number of repeat families, which is an expected result since Z. mays contains multiple repeated sequences (including tandem repeats) that occupy up to 85% of its genome [7]. Our finding that DRs cover 50% of the maize genome is consistent with earlier reports.

In contrast to the number of DR families, the number of repeats was directly proportional to the plant genome size. It can be hypothesized that the IP method identified some structural elements that participate in genome formation. The most conserved among the created DR families could in fact correspond to the generally accepted TE classes. It is unclear whether more diffuse TE copies are functional; it can be suggested that these TEs represent certain markers within the genome, which are associated with its compaction.

For all DR families identified in this study, Z₀ ≥ 4.0 was used as a cutoff level to detect statistically significant repeats. However, the FDR differed among the analyzed species: the lowest value was observed for Z. mays and the highest for D. carota (Table 2). By raising the Z₀ level, it would be possible to decrease the errors. It is reasonable to expect that the adjustment of Z₀ to a specific repeat family could have increased the number of detected DRs. However, in this work, our aim was not so much to optimize the parameters of the algorithm but rather to determine whether divergent DRs, which are genomic markers, are widespread in plants.

Analysis of DR periodicity showed that the predominant period length was 3 bp, which could be related to the triplet periodicity of the coding DNA sequences.

The obtained results are accessible for a broad audience interested in repeated DNA sequences of plants. We created a PlantDRs 1.0 database of DRs in the plant genomes, which is available via the internet. The search can be tailored by setting filter parameters according to repeat characteristics; in particular, it is possible to set a Z₀ value to obtain a fraction of DRs with a lower false discovery ratio. Currently, the database provides the information for the genomes of A. thaliana, C. annuum, D. carota, and Z. mays, but in the future, we plan to extend PlantDRs by depositing the data on DRs in other plant species.

5. Conclusions

The paper presents the results of the search for DRs in the genomes of four plants. The search for repeats was carried out using the IP method, which has no analogs for detecting highly divergent repeats. DRs are combined into the PlantDRs database, which includes 150 repeat families and 5,392,216 DRs. It will be expanded as more plant genomes are tested for DRs. We hope that the data on DRs presented here will form the basis for further studies of the role and function of DRs in plant genomes.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/data10070111/s1, consensus.zip: Weblogo consensuses for DR families for the genomes of A. thaliana, C. annuum, D. carota, and Z. mays; supplement.xlsx: Table S1 Consensus length for different DR families; Table S2. Periodicity in the DR family profiles of different plants.

Author Contributions

Conceptualization, E.K.; methodology, E.K. and V.R.; software, D.K. and E.K.; validation, V.R. and D.K.; formal analysis, V.R.; investigation, V.R. and E.K.; resources, D.K.; data curation, V.R.; writing—original draft preparation, V.R.; writing—review and editing, V.R. and E.K.; visualization, D.K.; supervision, E.K. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the Russian Science Foundation (project No. 24-24-00031).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Links to all data analyzed and obtained during the work are located in the article and Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DRs	Dispersed repeats
IP	Iteration procedure
PWM	Position weight matrix

References

Galindo-González, L.; Mhiri, C.; Deyholos, M.K.; Grandbastien, M.-A. LTR-retrotransposons in plants: Engines of evolution. Gene 2017, 626, 14–25. [Google Scholar] [CrossRef]
Ujino-Ihara, T. Stress-responsive retrotransposable elements in conifers. Genes Genet. Syst. 2022, 97, 185–191. [Google Scholar] [CrossRef] [PubMed]
Dubin, M.J.; Scheid, O.M.; Becker, C. Transposons: A blessing curse. Curr. Opin. Plant Biol. 2018, 42, 23–29. [Google Scholar] [CrossRef]
Makarevitch, I.; Waters, A.J.; West, P.T.; Stitzer, M.; Hirsch, C.N.; Ross-Ibarra, J.; Springer, N.M. Transposable elements contribute to activation of maize genes in response to abiotic stress. PLoS Genet. 2015, 11, e1004915. [Google Scholar] [CrossRef]
Hirsch, C.D.; Springer, N.M. Transposable element influences on gene expression in plants. Biochim. Biophys. Acta Gene Regul. Mech. 2017, 1860, 157–165. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Zhao, M. How transposable elements are recognized and epigenetically silenced in plants? Curr. Opin. Plant Biol. 2023, 75, 102428. [Google Scholar] [CrossRef]
Schnable, P.S.; Ware, D.; Fulton, R.S.; Stein, J.C.; Wei, F.; Pasternak, S.; Liang, C.; Zhang, J.; Fulton, L.; Graves, T.A.; et al. The B73 maize genome: Complexity, diversity, and dynamics. Science 2009, 326, 1112–1115. [Google Scholar] [CrossRef]
Kalendar, R.; Sabot, F.; Rodriguez, F.; Karlov, G.I.; Natali, L.; Alix, K. Editorial: Mobile elements and plant genome evolution, comparative analyzes and computational tools. Front. Plant Sci. 2021, 12. [Google Scholar] [CrossRef]
Hassan, A.H.; Mokhtar, M.M.; El Allali, A. Transposable elements: Multifunctional players in the plant genome. Front. Plant Sci. 2023, 14, 1330127. [Google Scholar] [CrossRef]
Rymen, B.; Ferrafiat, L.; Blevins, T. Non-coding RNA polymerases that silence transposable elements and reprogram gene expression in plants. Transcription 2020, 11, 172–191. [Google Scholar] [CrossRef]
Pulido, M.; Casacuberta, J.M. Transposable element evolution in plant genome ecosystems. Curr. Opin. Plant Biol. 2023, 75, 102418. [Google Scholar] [CrossRef]
Mojica, E.A.; Kültz, D. Physiological mechanisms of stress-induced evolution. J. Exp. Biol. 2022, 225. [Google Scholar] [CrossRef]
Roquis, D.; Robertson, M.; Yu, L.; Thieme, M.; Julkowska, M.; Bucher, E. Genomic impact of stress-induced transposable element mobility in Arabidopsis. Nucleic Acids Res. 2021, 49, 10431–10447. [Google Scholar] [CrossRef]
Liu, P.; Cuerda-Gil, D.; Shahid, S.; Slotkin, R.K. The epigenetic control of the transposable element life cycle in plant genomes and beyond. Annu. Rev. Genet. 2022, 56, 63–87. [Google Scholar] [CrossRef]
Vicient, C.M.; Casacuberta, J.M. Impact of transposable elements on polyploid plant genomes. Ann. Bot. 2017, 120, 195–207. [Google Scholar] [CrossRef]
Jangam, D.; Feschotte, C.; Betrán, E. Transposable element domestication as an adaptation to evolutionary conflicts. Trends Genet. 2017, 33, 817–831. [Google Scholar] [CrossRef]
Capy, P. Taming, Domestication and Exaptation: Trajectories of transposable elements in genomes. Cells 2021, 10, 3590. [Google Scholar] [CrossRef] [PubMed]
Romano, N.C.; Fanti, L. Transposable Elements: Major players in shaping genomic and evolutionary patterns. Cells 2022, 11, 1048. [Google Scholar] [CrossRef]
Benoit, M.; Drost, H.-G.; Catoni, M.; Gouil, Q.; Lopez-Gomollon, S.; Baulcombe, D.; Paszkowski, J.; Scheid, O.M. Environmental and epigenetic regulation of Rider retrotransposons in tomato. PLoS Genet. 2019, 15, e1008370. [Google Scholar] [CrossRef]
Sun, X.; Xiang, Y.; Dou, N.; Zhang, H.; Pei, S.; Franco, A.V.; Menon, M.; Monier, B.; Ferebee, T.; Liu, T.; et al. The role of transposon inverted repeats in balancing drought tolerance and yield-related traits in maize. Nat. Biotechnol. 2023, 41, 120–127. [Google Scholar] [CrossRef]
Li, X.; Guo, K.; Zhu, X.; Chen, P.; Li, Y.; Xie, G.; Wang, L.; Wang, Y.; Persson, S.; Peng, L. Domestication of rice has reduced the occurrence of transposable elements within gene coding regions. BMC Genom. 2017, 18, 55. [Google Scholar] [CrossRef]
Jung, S.; Venkatesh, J.; Kang, M.-Y.; Kwon, J.-K.; Kang, B.-C. A non-LTR retrotransposon activates anthocyanin biosynthesis by regulating a MYB transcription factor in Capsicum annuum. Plant Sci. 2019, 287, 110181. [Google Scholar] [CrossRef] [PubMed]
Arvas, Y.E.; Marakli, S.; Kaya, Y.; Kalendar, R. The power of retrotransposons in high-throughput genotyping and sequencing. Front. Plant Sci. 2023, 14, 1174339. [Google Scholar] [CrossRef]
Kumar, A.; Hirochika, H. Applications of retrotransposons as genetic tools in plant biology. Trends Plant Sci. 2001, 6, 127–134. [Google Scholar] [CrossRef]
Bao, W.; Kojima, K.K.; Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 2015, 6, 11. [Google Scholar] [CrossRef]
Storer, J.; Hubley, R.; Rosen, J.; Wheeler, T.J.; Smit, A.F. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob. DNA 2021, 12, 2. [Google Scholar] [CrossRef]
Mokhtar, M.M.; Alsamman, A.M.; Abd-Elhalim, H.M.; El Allali, A.; Kashkush, K. CicerSpTEdb: A web-based database for high-resolution genome-wide identification of transposable elements in Cicer species. PLoS ONE 2021, 16, e0259540. [Google Scholar] [CrossRef]
Yi, F.; Jia, Z.; Xiao, Y.; Ma, W.; Wang, J. SPTEdb: A database for transposable elements in salicaceous plants. Database 2018, 2018, bay024. [Google Scholar] [CrossRef]
Xu, Z.; Liu, J.; Ni, W.; Peng, Z.; Guo, Y.; Ye, W.; Huang, F.; Zhang, X.; Xu, P.; Guo, Q.; et al. GrTEdb: The first web-based database of transposable elements in cotton (Gossypium raimondii). Database 2017, 2017, bax013. [Google Scholar] [CrossRef]
Du, J.; Grant, D.; Tian, Z.; Nelson, R.T.; Zhu, L.; Shoemaker, R.C.; Ma, J. SoyTEdb: A comprehensive database of transposable elements in the soybean genome. BMC Genom. 2010, 11, 113. [Google Scholar] [CrossRef]
Chen, J.; Hu, Q.; Zhang, Y.; Lu, C.; Kuang, H. P-MITE: A database for plant miniature inverted-repeat transposable elements. Nucleic Acids Res. 2014, 42, D1176–D1181. [Google Scholar] [CrossRef] [PubMed]
Vassetzky, N.S.; Kramerov, D.A. SINEBase: A database and tool for SINE analysis. Nucleic Acids Res. 2013, 41, D83–D89. [Google Scholar] [CrossRef]
Mokhtar, M.M.; Alsamman, A.M.; El Allali, A. PlantLTRdb: An interactive database for 195 plant species LTR-retrotransposons. Front. Plant Sci. 2023, 14, 1134627. [Google Scholar] [CrossRef]
Nicolas, J.; Tempel, S.; Fiston-Lavier, A.-S.; Cherif, E. Finding and Characterizing Repeats in Plant Genomes. In Plant Bioinformatics. Methods in Molecular Biology; Edwards, D., Ed.; Humana: New York, NY, USA, 2023; Volume 2443, pp. 327–385. [Google Scholar] [CrossRef]
Storer, J.M.; Hubley, R.; Rosen, J.; Smit, A.F.A. Methodologies for the de novo discovery of transposable element families. Genes 2022, 13, 709. [Google Scholar] [CrossRef] [PubMed]
Rodriguez, F.; Arkhipova, I.R. An Overview of Best Practices for Transposable Element Identification, Classification, and Annotation in Eukaryotic Genomes. In Methods in Molecular Biology; Branco, M.R., de Mendoza Soler, A., Eds.; Humana: New York, NY, USA, 2023; Volume 2607, pp. 1–23. [Google Scholar] [CrossRef]
Marsano, R.M.; Dimitri, P. Constitutive heterochromatin in eukaryotic genomes: A mine of transposable elements. Cells 2022, 11, 761. [Google Scholar] [CrossRef]
Korotkov, E.; Korotkova, M. Detection of dispersed repeats in the genomes of bacteria from different phyla. IPSJ Trans. Bioinform. 2024, 17, 55–63. [Google Scholar] [CrossRef]
Korotkov, E.; Suvorova, Y.; Kostenko, D.; Korotkova, M. Search for Dispersed repeats in bacterial genomes using an iterative procedure. Int. J. Mol. Sci. 2023, 24, 10964. [Google Scholar] [CrossRef]
Rudenko, V.; Korotkov, E. Study of dispersed repeats in the Cyanidioschyzon merolae genome. Int. J. Mol. Sci. 2024, 25, 4441. [Google Scholar] [CrossRef]
Pugacheva, V.; Korotkov, A.; Korotkov, E. Search of latent periodicity in amino acid sequences by means of genetic algorithm and dynamic programming. Stat. Appl. Genet. Mol. Biol. 2016, 15, 381–400. [Google Scholar] [CrossRef]
Crooks, G.E.; Hon, G.; Chandonia, J.-M.; Brenner, S.E. WebLogo: A sequence logo generator. Genome Res. 2004, 14, 1188–1190. [Google Scholar] [CrossRef]
Benson, G. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 1999, 27, 573–580. [Google Scholar] [CrossRef] [PubMed]
Serizay, J.; Ahringer, J. periodicDNA: An R/Bioconductor package to investigate k-mer periodicity in DNA. F1000Research 2021, 10, 141. [Google Scholar] [CrossRef] [PubMed]
Suvorova, Y.M.; Korotkova, M.A.; Korotkov, E.V. Comparative analysis of periodicity search methods in DNA sequences. Comput. Biol. Chem. 2014, 53, 43–48. [Google Scholar] [CrossRef] [PubMed]
Jin, H.; Rube, H.T.; Song, J.S. Categorical spectral analysis of periodicity in nucleosomal DNA. Nucleic Acids Res. 2016, 44, 2047–2057. [Google Scholar] [CrossRef]
Silverman, B.D.; Linsker, R. A measure of DNA periodicity. J. Theor. Biol. 1986, 118, 295–300. [Google Scholar] [CrossRef]
Yin, C.; Wang, J. Periodic power spectrum with applications in detection of latent periodicities in DNA sequences. J. Math. Biol. 2016, 73, 1053–1079. [Google Scholar] [CrossRef]
Eskesen, S.T.; Eskesen, F.N.; Kinghorn, B.; Ruvinsky, A. Periodicity of DNA in exons. BMC Mol. Biol. 2004, 5, 12. [Google Scholar] [CrossRef]

Figure 1. DR length distribution in (a) A. thaliana, (b) C. annuum, (c) D. carota, and (d) Z. mays.

Figure 2. The periodicity profile of the 10th DR family of the C. annuum genome (k is shown in a logarithmic scale).

Figure 3. Query page of the PlantDRs database.

Table 1. Genome characteristics of four plant species.

Organism, Path to Genome Data	Genome Size, bp	Chromosomes	Genes (Excluding ncRNA)	DR Families
Arabidopsis thaliana (https://plants.ensembl.org/Arabidopsis_thaliana/Info/Index accessed on 15 May 2025)	119,146,348	5	27,445	35
Capsicum annuum (https://plants.ensembl.org/Capsicum_annuum/Info/Index accessed on 15 May 2025)	2,589,160,526	12	31,600	26
Daucus carota
(https://plants.ensembl.org/Daucus_carota/Info/Index accessed on 15 May 2025)	361,968,048	9	30,824	35
Zea mays (https://plants.ensembl.org/Zea_mays/Info/Index accessed on 15 May 2025)	2,131,846,805	10	43,459	54

Table 2. Number of DRs on the forward (+) and reverse (−) DNA strands and chromosome coverage by DRs in the plant genomes.

A. thaliana
Z	FDR	Number of repeats			Repeat length (bp)			Coverage
		+	−	total	average	min	max	bp	%
4.0	3.0%	59,316	57,734	117,050	520.13	27	600	55,361,784	46.5
5.0	1.0%	24,365	22,462	46,827	521.00	38	600	22,636,501	19.0
6.0	0.4%	9820	8819	18,639	518.99	38	600	9,132,775	7.7
7.0	0.1%	4140	3692	7832	512.86	46	600	3,850,736	3.2
C. annuum
Z	FDR	Number of repeats			Repeat length (bp)			Coverage
		+	−	total	average	min	max	bp	%
4.0	2.1%	1,259,727	1,265,244	2,524,971	530.20	47	600	1,227,076,568	47.4
5.0	0.6%	778,107	781,446	1,559,553	534.61	48	600	773,308,166	29.9
6.0	0.2%	596,646	599,577	1,196,223	536.71	54	600	598,836,871	23.1
7.0	0.1%	515,231	517,139	1,032,370	538.09	54	600	517,957,179	20.0
D. carota
Z	FDR	Number of repeats			Repeat length (bp)			Coverage
		+	−	total	average	min	max	bp	%
4.0	6.1%	167,250	163,078	330,328	465.78	24	600	131,926,275	36.4
5.0	3.4%	90,025	87,235	177,260	440.29	24	600	64,507,019	17.8
6.0	1.9%	59,320	57,588	116,908	412.01	31	600	38,591,244	10.7
7.0	0.8%	44,326	43,157	87,483	388.61	39	600	26,760,676	7.4
Z. mays
Z	FDR	Number of repeats			Repeat length (bp)			Coverage
		+	−	total	average	min	max	bp	%
4.0	1.6%	1,209,144	1,210,723	2,419,867	511.81	28	600	1,062,814,683	49.9
5.0	0.5%	869,739	870,844	1,740,583	520.65	32	600	771,959,857	36.2
6.0	0.1%	722,664	723,102	1,445,766	524.51	35	600	646,688,520	30.3
7.0	0.0%	641,103	641,307	1,282,410	527.13	42	600	578,910,853	27.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rudenko, V.; Korotkov, E.; Kostenko, D. PlantDRs: A Database of Dispersed Repeats in Plant Genomes Identified by the Iterative Procedure Method. Data 2025, 10, 111. https://doi.org/10.3390/data10070111

AMA Style

Rudenko V, Korotkov E, Kostenko D. PlantDRs: A Database of Dispersed Repeats in Plant Genomes Identified by the Iterative Procedure Method. Data. 2025; 10(7):111. https://doi.org/10.3390/data10070111

Chicago/Turabian Style

Rudenko, Valentina, Eugene Korotkov, and Dmitrii Kostenko. 2025. "PlantDRs: A Database of Dispersed Repeats in Plant Genomes Identified by the Iterative Procedure Method" Data 10, no. 7: 111. https://doi.org/10.3390/data10070111

APA Style

Rudenko, V., Korotkov, E., & Kostenko, D. (2025). PlantDRs: A Database of Dispersed Repeats in Plant Genomes Identified by the Iterative Procedure Method. Data, 10(7), 111. https://doi.org/10.3390/data10070111

Article Menu

PlantDRs: A Database of Dispersed Repeats in Plant Genomes Identified by the Iterative Procedure Method

Abstract

1. Introduction

2. Materials and Methods

3. Results

3.1. Search for DRs in the Plant Genomes

3.2. Periodicity in DRs

3.3. Database of DRs Found in the Plant Genomes

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI