Search for Dispersed Repeats in Bacterial Genomes Using an Iterative Procedure

We have developed a de novo method for the identification of dispersed repeats based on the use of random position-weight matrices (PWMs) and an iterative procedure (IP). The created algorithm (IP method) allows detection of dispersed repeats for which the average number of substitutions between any two repeats per nucleotide (x) is less than or equal to 1.5. We have shown that all previously developed methods and algorithms (RED, RECON, and some others) can only find dispersed repeats for x ≤ 1.0. We applied the IP method to find dispersed repeats in the genomes of E. coli and nine other bacterial species. We identify three families of approximately 1.09 × 106, 0.64 × 106, and 0.58 × 106 DNA bases, respectively, constituting almost 50% of the complete E. coli genome. The length of the repeats is in the range of 400 to 600 bp. Other analyzed bacterial genomes contain one to three families of dispersed repeats with a total number of 103 to 6 × 103 copies. The existence of such highly divergent repeats could be associated with the presence of a single-type triplet periodicity in various genes or with the packing of bacterial DNA into a nucleoid.


Introduction
Dispersed repetitive DNA sequences constitute a significant portion of existing genomes. In the human genome, more than one third is occupied by dispersed repeats, which are primarily copies of transposable elements [1], whereas in other organisms, the proportion could be much higher [2,3]. Accurate localization of dispersed repeats in the sequenced genome can help determine the functional significance and evolutionary origin of genomic sequences. Considering that the genomes of many eukaryotic organisms have already been sequenced and those of all living organisms are expected to be sequenced in the next decade according to the Earth BioGenome Project [4], the search for dispersed repeats in various genomes emerges as an important task of bioinformatics.
At present, many mathematical methods and algorithms have been developed for the identification of dispersed repeats [5]. Most of the programs contain two independent parts: the first (A) generates a dispersed repeat sequence, whereas the second (B) searches the genome under study for sequences similar to those generated in part A. Part A can be of two types: A1 uses a library of already published dispersed repeats, whereas A2 uses dispersed repeats created de novo by various mathematical methods. The dispersed repeats of the library or those created de novo are used in part B to search for similar sequences in the genome.
Programs such as RepeatMasker [6], Censor [7], and MaskerAid [8] are A1-based. The mathematical methods implemented in these programs detect repeats in the genomes by comparison with Repbase [9], a database of repetitive sequences. Thus, the detection of new repeats by these methods is limited to the ones existing in the library. accumulated many mutations and their similarity is less than 45%, it is very difficult to detect such a family de novo in a 4 × 10 6 -base long sequence by A1 methods. A similar x value is shown in [5]. It should be noted that, in reality, x may be smaller because in addition to base substitutions, indels could be present in dispersed repeats. The use of word counting and signature-based methodologies to search for dispersed repeats cannot significantly improve the situation, because at x = 0.75 sequence similarity is already at the level of random noise (~25%) and word frequencies may differ insignificantly from those expected for random sequences.
In this study, we showed that the known de novo methods could find dispersed repeats with x ≤ 1.00, whereas part B-based methods such as nHMMER could do the same with x ≤ 3.0 using a previously created multiple alignment, indicating that the main constraints on the identification of dispersed repeats are related to part A. Thus, a family of unknown dispersed repeats in the genome which has accumulated a large number of base substitutions and indels (x > 1.0) is unlikely to be detected by A2-based methods. Consequently, the application of part B-based methods is impossible because there is no sequence, consensus, or multiple alignment to search for the dispersed repeats of such a family in the genome. Therefore, highly divergent repeat families (x > 1.0), which could be present in already sequenced genomes, cannot be detected because of the limitations of A2-based methods.
Here, we report a method for identifying dispersed repeats based on the use of random positional weight matrices (PWMs) and an iterative procedure (IP). The developed algorithm allows for the detection of dispersed repeats with x ≤ 1.5, which significantly exceeds the capacity of all modern methods based on part A2 (x ≤ 1.0). We applied the new IP method to search for dispersed repeats in the genome of E. coli and nine other bacterial species and showed that the bacterial genomes contained families of dispersed repeats with copy numbers >10 3 and lengths of 400-600 bases.
We chose bacterial genomes for this analysis because they lack long dispersed repeats (>300 b.p.) with copy numbers greater than 10 3 and therefore the results obtained cannot be related to known families of dispersed repeats. Additionally, bacterial genomes are significantly shorter than eukaryotic genomes and their analysis does not require very high computational power. At the same time, weakly similar dispersed repeats in bacterial repeats may be present due to the stacking of bacterial DNA in the nucleoid [24]. Possible functional significance and the origin of the dispersed repeats are discussed.

Using Model Sequences for the IP Method
We generated a random sequence S test with a length of 4 × 10 6 bases and randomly inserted it into sequences from set Q(x), which contained 10 3 sequences with x substitutions relative to each other. Overall, 500 sequences from set Q(x) were introduced into sequence S test in the forward orientation and 500 in the reverse orientation. Then, the IP method was applied to identify the repeats. The results shown in Table 1 (where columns 1-4 show the number of repeats found in the first four families) indicate that the repeats were detected in the first and second families (columns 1 and 2), whereas the other two families (columns 3 and 4) contained sequences combined randomly. The level of random noise was 145 ± 35 sequences in all four generated repeat families. The results in Table 1 indicate that IP clearly separated forward and reverse sequences into two families and consistently detected two repeat families with x up to 1.5.
The search for sequences such as the PWM (Section 4.3) was performed in only one direction. Therefore, forward and reverse repeats could create two separate families; as a result, to conclusively determine the number of the identified families, we needed to check for possible association of the two PWM families into a single family by taking into account inversion and complementarity. Such analysis indicated that the similarity of the PWMs of the three dispersed repeat families was statistically insignificant, indicating that the identified families cannot be combined. Below (Section 2.3), we show that the reason for this is the association of the families of dispersed repeat families found in the E. coli genome with triplet periodicity, which in different families is not similar with regard to inversion and complementarity. A model sequence of 4 × 10 6 bases containing 500 repeats in the forward and 500 repeats in the reversed orientation was used; x is the average number of substitutions per nucleotide between family members.

Comparison of the IP Method with RED, RECON, RepeatMasker, BLAST, and nHMMER
A previous study has indicated that the known methods of de novo search for dispersed repeats cannot detect them if the number of mutations in the dispersed repeats that originate from a common ancestor exceeds 25-30 [5], which corresponds to x ≤ 0.6. Our calculations (Section 1) showed that by using pairwise comparison, such a search is possible at x ≤ 0.55, which roughly corresponds to the result obtained in [5]. We chose RED (which uses K-mers) [10] as one of the popular algorithms for finding repeats and applied it to identify 10 3 artificial repeats of 600 bases randomly scattered across a sequence of 4 × 10 6 bases. A total of nine sequences were created, for which dispersed repeats had different numbers of mutations relative to each other, i.e., x = 0-2.0. Figure 1 shows that RED found 100%, 40%, and practically 0% of dispersed repeats with x ≤ 1.0, x = 1.25, and x ≥ 1.5, respectively, indicating that RED could reliably identify dispersed repeats only with x ≤ 1.0. At the same time, the application of the IP method to the nine artificial sequences resulted in the detection of dispersed repeats with x ≤ 1.5 (Figure 1), which is a better result than that of RED.
We also examined the performance of RECON [11] in finding families of dispersed repeats with different degrees of divergence. The results indicate that, similar to RED, RECON could find dispersed repeats only with x ≤ 1.0. However, at x = 1.00, RECON, instead of finding just one repeat family, detected 108 families containing an overall number of 825 repeats, which is an incorrect result. Therefore, RECON could detect one family of repeats only with x ≤ 0.8.
In a previously study, the same analysis was performed using the RepeatMasker program, and the results indicate that it finds almost 100% of dispersed repeats for x ≤ 0.5 but 25% and less than 5% for x = 0.75 and for x = 1.0 [25].
Many of the de novo methods mentioned in [5] use the BLAST program to compare sequences with themselves and then perform the assembly of dispersed repeats from the found similarities. Therefore, it was interesting to analyze the capability of BLAST to search for dispersed repeats in the same artificial sequence S(x) (4 × 10 6 bases) carrying 10 3 randomly inserted 600 base repeats from set Q(x), which had the same number of random mutations relative to original sequence S m , two indels at random positions, and x substitutions per nucleotide between repeats. The word_size parameter in BLAST was chosen to be 4, which gave the best result. To evaluate the effectiveness of BLAST, we randomly created 100 sequences of S(x) containing repeats from different Q(x) (E_value was chosen to be 100). E-value is the number of expected hits of similar score that could be found just by chance. Table 2 shows the numbers of average repeats per sequence S(x) for each x. The results indicate that BLAST could find pairwise similarity for x ≤ 1.0 but failed to do so for x > 1.0.

Figure 1.
Comparative performance of the IP and RED methods in search for dispersed re artificial sequences. The search was performed in an artificial sequence of 4 × 10 6 bases conta repeats (each of 600 bases but with different x), which were randomly inserted from set Q(x) co repeats with the same number of random mutations relative to original sequence Sm as well at random positions. Black and white circles indicate the IP and RED methods, respectively average number of substitutions per nucleotide and N is the number of repeats identified.
In a previously study, the same analysis was performed using the Repeat program, and the results indicate that it finds almost 100% of dispersed repeats fo but 25% and less than 5% for x = 0.75 and for x = 1.0 [25].
Many of the de novo methods mentioned in [5] use the BLAST program to c sequences with themselves and then perform the assembly of dispersed repeats f found similarities. Therefore, it was interesting to analyze the capability of BL search for dispersed repeats in the same artificial sequence S(x) (4 × 10 6 bases) carr randomly inserted 600 base repeats from set Q(x), which had the same number of mutations relative to original sequence Sm, two indels at random positions, substitutions per nucleotide between repeats. The word_size parameter in BLA chosen to be 4, which gave the best result. To evaluate the effectiveness of BLA randomly created 100 sequences of S(x) containing repeats from different Q(x) ( was chosen to be 100). E-value is the number of expected hits of similar score tha be found just by chance. Table 2 shows the numbers of average repeats per seque for each x. The results indicate that BLAST could find pairwise similarity for x ≤ failed to do so for x > 1.0.  Comparative performance of the IP and RED methods in search for dispersed repeats in artificial sequences. The search was performed in an artificial sequence of 4 × 10 6 bases containing 10 3 repeats (each of 600 bases but with different x), which were randomly inserted from set Q(x) containing repeats with the same number of random mutations relative to original sequence S m as well as indels at random positions. Black and white circles indicate the IP and RED methods, respectively; x is the average number of substitutions per nucleotide and N is the number of repeats identified. Table 2. Numbers of dispersed repeats identified by the BLAST, nHMMER, and IP methods.  2  nHMMER  1004  1002  1004  1006  1002  1002  1004  1003  992  668  21  0  IP  1068  1006  1003  1002  1005  1002  1004  1003  1065  907  221  80 Model sequence S(x) of 4 × 10 6 bases containing 10 3 dispersed repeats was used; x is the average number of base substitutions per nucleotide between two dispersed repeats.
Thus, previous findings with the de novo A2-based methods [5] and the results of the present analysis suggest that the currently used methods can identify dispersed repeats with x ≤ 1.0 but skip those with x > 1.0. In contrast, the IP method can identify repeats with x ≤ 1.5, i.e., those missed by the other methods.
We also compared the performance of the IP method with that of nHMMER, which is one of the best part B-based methods to find dispersed repeats with already known multiple alignments. In this test, we aligned all 10 3 sequences from set Q(x), and since the placement of indels in these sequences with respect to S m was known, it was not difficult to construct a multiple alignment, which was used by nHMMER to create a hidden Markov model and search for dispersed repeats in sequence S(x).
We also searched for dispersed repeats in sequence S(x) using the IP method. To correctly compare the IP method with nHMMER in search for repeats with known multiple alignments, we used S m as a sequence from the library to create a PWM ( Figure 2, step 4), when indices i and j were limited to 1 (i.e., one cycle of dispersed repeat search).
to construct a multiple alignment, which was used by nHMMER to create a hidden Markov model and search for dispersed repeats in sequence S(x).
We also searched for dispersed repeats in sequence S(x) using the IP method. To correctly compare the IP method with nHMMER in search for repeats with known multiple alignments, we used Sm as a sequence from the library to create a PWM ( Figure  2, step 4), when indices i and j were limited to 1 (i.e., one cycle of dispersed repeat search). The created PWM had a length of 600 bases and 16 rows and was filled in as: PWM(n,i) = PWM(n,i) + 1 for all i from 2 to 600 (here, n = sm(i−1) + 4sm(i) and i is the column number). Then, for i = 1 n = sm(600) + 4sm(1). The created PWM had a length of 600 bases and 16 rows and was filled in as: PWM(n,i) = PWM(n,i) + 1 for all i from 2 to 600 (here, n = s m (i−1) + 4s m (i) and i is the column number). Then, for i = 1 n = s m (600) + 4s m (1).
The results shown in Table 2 indicate that the IP method, similar to nHMMER, could find dispersed repeats using a known alignment. Both methods perform reliably up to x ≤ 3.0; however, the IP method was slightly more efficient at x = 3.0.

Search for Dispersed Repeats in the E. coli Genome Using the IP Method
The escherichia_coli_str_k_12sbstr_mg1655_gca_000005845.ASM584v2.49 sequence was obtained from http://bacteria.ensembl.org/index.html/, accessed on 1 January 2023. To search for dispersed repeats in the E. coli genome with the IP method, we used length L 1 = 600 bases (Section 4.1). The results shown in Figure 3 indicate that, for a random sequence, the IP method generated families of 145 ± 35 repeats, because the iterative algorithm ( Figure 2) can always capture a certain number of sequences and build PWMs. In the case of the E. coli genome, the respective numbers of dispersed repeats found in the three families were 2239, 1170, and 1024. The volume of the other repeat families was close to random, and the probability of finding families of such volume in a random sequence of the same length as the genome is extremely low. The coordinates of the found repeat families and their alignment with the PWM (sequence S 2 , Sections 4.3 and 4.5) are shown in Supplementary Materials in additional files fam1.txt, fam2.txt, and fam3.txt, and the PWMs created for these families are shown in files pwm1.txt, pwm2.txt, and pwm3.txt. sequence, the IP method generated families of 145 ± 35 repeats, because the i algorithm ( Figure 2) can always capture a certain number of sequences and build In the case of the E. coli genome, the respective numbers of dispersed repeats foun three families were 2239, 1170, and 1024. The volume of the other repeat families w to random, and the probability of finding families of such volume in a random se of the same length as the genome is extremely low. The coordinates of the found families and their alignment with the PWM (sequence S2, Sections 4.3 and 4.5) are in Supplementary Materials in additional files fam1.txt, fam2.txt, and fam3.txt, PWMs created for these families are shown in files pwm1.txt, pwm2.txt, and pwm We also constructed an artificial sequence based on the E. coli genome, wh codons in each gene were randomly shuffled to ensure that any similarity of the sequences was absent but where the triplet periodicity was preserved because t second, and third codon positions did not change [26]. In Figure 3, the volume created families for this sequence is indicated by white circles. The results show th codons in the genes were conserved, dispersed repeat families of sufficientl We also constructed an artificial sequence based on the E. coli genome, where the codons in each gene were randomly shuffled to ensure that any similarity of the coding sequences was absent but where the triplet periodicity was preserved because the first, second, and third codon positions did not change [26]. In Figure 3, the volume of the created families for this sequence is indicated by white circles. The results show that if the codons in the genes were conserved, dispersed repeat families of sufficiently large volumes could still be created, indicating that the identified repeat families were associated with the triplet periodicity of coding sequences [27].
Next, we created an artificial sequence containing the E. coli genome in which all non-coding sequences were randomly mixed and used it to search for dispersed repeat families with the IP method. Figure 4 shows that, in this case, we could still find families of dispersed repeats, indicating that the non-coding regions in the E. coli genome do not contain a significant number of dispersed repeats. This result was verified by calculating the proportion of non-coding regions in dispersed repeats of families 1, 2, and 3 (Table 3), which confirmed that most of the found repeat families were not associated with noncoding sequences. contain a significant number of dispersed repeats. This result was verified by calc the proportion of non-coding regions in dispersed repeats of families 1, 2, and 3 (T which confirmed that most of the found repeat families were not associated wi coding sequences. Next, we analyzed the length distribution of the repeats in each of the three ( Figure 5). The repeats of the first family (485 bases) were slightly shorter than thos second and third families (548 and 564 bases, respectively). Table 3. Distribution of the found repeats from families 1, 2, and 3 ( Figure 3) accordin proportion of non-coding regions. 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0 129  58  36  22  7  3  2  1  25  2  918  97  60  32  13  11  4  4  3  28  3  709  116  72  43  29  15  8  2  3  27 Next, we analyzed the length distribution of the repeats in each of the three families ( Figure 5). The repeats of the first family (485 bases) were slightly shorter than those of the second and third families (548 and 564 bases, respectively).

Triplet Periodicity of Dispersed Repeat Families in the E. coli Genome
Next, we investigated the triplet periodicity in the sequences of the three found repeat families. For this, we filled in matrix M(3,4) for each repeat sequence: s 1 (i)) = M(f (s 2 (i)) + 1, s 1 (i)) + 1 for i from 1 to L, where L is the repeat length, s 1 (i) is a sequence element of the found repeat (sequence S 1 , Section 4.3), and s 2 (i) is a column sequence element of the PWM (sequence S 2 , Section 4.3). We calculated function f (s 2 (i)) = s 2 (i)−3int((s 2 (i)−0.1)/3.0) and found that it was 1, 2, or 3. If s 1 (i) or s 2 (i) were equal to zero (deletion), then 1 was not added to M(f (s2(i)) but added to i. After determining matrix M, we calculated mutual information I as:

Triplet Periodicity of Dispersed Repeat Families in the E. coli Genome
Next, we investigated the triplet periodicity in the sequences of the thre repeat families. For this, we filled in matrix M (3,4) for each repeat sequence M(f(s2(i)) + 1, s1(i)) + 1 for i from 1 to L, where L is the repeat length, s1(i) is a s element of the found repeat (sequence S1, Section 4.3), and s2(i) is a column s element of the PWM (sequence S2, Section 4.3). We calculated function f s2(i)−3int((s2(i)−0.1)/3.0) and found that it was 1, 2, or 3. If s1(i) or s2(i) were equal (deletion), then 1 was not added to M(f(s2(i)) but added to i. After determining m we calculated mutual information I as:  Then, we calculated the argument of normal distribution Z = (4I) 0.5 − (11.0) 0.5 for each sequence in the repeat family and plotted Z distribution for all family members. Figure 6 shows the distribution for the first repeat family. For randomly shuffled sequences, the Z distribution was close to normal (black circles), but for repeat sequences without alignment with the PWM, the distribution was markedly shifted to the right (white circles), indicating that the coding sequences had triplet periodicity [26]. The Z distribution for sequences aligned with the PWM was shifted even more to the right compared with the two distributions mentioned above (black circles on a dashed line; Figure 6), indicating that the alignment of repeats with the PWM results in a clearer triplet periodicity.
We combined all matrices M from all sequences for each family into one matrix, created matrices M 1 , M 2 , and M 3 , and converted each matrix element into a normal distribution argument using normal approximation for binomial distribution. For this, we calculated partial sums of X(i) and Y(j) as we did in Figure 6 and determined probabilities p(i,j) = X(i)Y(j)/L 2 and the argument of normal distribution z k (i,j) = {m k (i,j)-Lp(i,j)}/{Lp(i,j)(1.0-p(i,j)} 0.5 (where k indicates the number of the dispersed repeat family). The resulting matrices are shown in Table 4; column numbers are s 2 (i)mod(3), where s 2 (i) is the column sequence element of the PWM (sequence S 2 , Section 4.2). It should be noted that the column numbers in matrices M 1 , M 2 , and M 3 are unrelated to the reading frame in the genes because there are indels in sequences S 1 and S 2 . From the matrices, we could conclude that the nucleotides were distributed extremely unevenly across the matrix positions. Thus, the first group of dispersed repeats contained more than expected nucleotides C and G in the first position, C in the second position, and A and T in the third position. We combined all matrices M from all sequences for each family into one matrix, cre matrices M1, M2, and M3, and converted each matrix element into a normal distribu argument using normal approximation for binomial distribution. For this, we calcul partial sums of X(i) and Y(j) as we did in Figure 6 Table 4; column numbers are s2(i)mod (3), where s2(i) is the column sequ element of the PWM (sequence S2, Section 4.2). It should be noted that the column num in matrices M1, M2, and M3 are unrelated to the reading frame in the genes because there indels in sequences S1 and S2. From the matrices, we could conclude that the nucleot were distributed extremely unevenly across the matrix positions. Thus, the first grou dispersed repeats contained more than expected nucleotides C and G in the first positio in the second position, and A and T in the third position.   Columns represent positions in a period of three bases; A, B, and C are matrices for sequences from the first, second, and third groups of dispersed repeats obtained for the escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome.

Comparison with Nucleoid-Associated Protein-Binding Sites
It is known that the spatial structure of bacterial genomes, including that of E. coli, is maintained by so-called nucleoid-associated proteins (NAPs). By binding to DNA, these proteins help stabilize and compact DNA and can also play regulatory functions. Several such proteins, with different properties and DNA specificity, are known currently [24]. Experimental binding maps have been constructed for some NAPs using chip-seq methods. It is possible that the DNA regions with remote similarity identified in this study may be the binding sites for various NAPs. To test this hypothesis, we compared the intervals found here with the binding sites of some NAPs (such as Fis, H-NS, and Ihf) whose binding site coordinates were obtained from [28,29]. The intersection of these coordinates was determined using the bedtools program [30]. The results reveal that there was no statistical difference between the numbers of intersections of NAP-binding sites with the found repeat families and with randomly located dispersed repeats of these families, indicating the absence of a statistically significant intersection of the found intervals with NAP-binding sites.

Search for the Families of Dispersed Repeats in the Genomes of Other Bacteria
To confirm that dispersed repeat families exist not only in E. coli but also in other bacteria, we applied the IP method to search for dispersed repeats in the genomes of the following bacterial species: Azotobacter vinelandii, Bacillus subtilis, Clostridium tetani, Methylococcus capsulatus, Mycobacterium tuberculosis, Shigella sonnei, Treponema pallidum, and Yersinia pestis (genome sequences were obtained from http://bacteria.ensembl.org/index. html). The results shown in Table 5 indicate that in all the analyzed bacterial genomes, 1-2 repeat families could be detected at a statistically significant level. The least number of repeats were identified in the genome of T. pallidum, which may be due to its small size.

Discussion
In this study, we developed a new IP method and applied it to search for the families of dispersed repeats in the E. coli genome. As a result, we could identify three respective families of approximately 1.09 × 10 6 , 0.64 × 10 6 , and 0.58 × 10 6 DNA bases (2.3 × 10 6 bases in total), constituting almost 50% of the complete E. coli genome. Such extensive repeat families could not be detected in the E. coli genome via the RED, RECON, or Repeat_masker programs, but could be detected via the IP method, which could find de novo repeat families with x ≤ 1.5, whereas all other programs found them with x ≤ 1.0.
It should be noted that in search of the genomes containing 5 × 10 5 -1.1 × 10 7 DNA bases, the level of false positives in each family was 145 ± 35 repeats. Such level of noise is due to the fact that, at the initial step of the iterative procedure, the random matrix can always find weak similarities (Z > 3.0) with some sequences (Figure 2). After creating a new PWM based on these similarities, the Z value for these sequences increases. In total, the iterative procedure can randomly include about 145 sequences in the PWM for which Z would be over 5.0.
A legitimate question arises regarding the origin of such highly divergent families of repeats and their functional significance. Our results (Section 2.4) indicate that each repeat has a similar triplet periodicity, which can account for the similarity of these sequences and classify them as one family. The emergence of triplet periodicity is partially related to the use of the same synonymous codons [31,32]. Therefore, the origin of repeat families could be associated with gene segments in which the same synonymous codons are used. In this case, dispersed repeats of the same family may be found in genes with similar transcriptional activity [33], whereas those with different activities could contain distinct families of dispersed repeats.
In Section 2.5 we found no correlation between the detected dispersed repeats and the binding sites of some of the proteins involved in nucleoid formation. Despite this, we cannot completely reject the assumption that the identified repeats contribute to DNA stacking in the nucleoid. Dispersed repeats create a certain markup of the bacterial genome that may contribute to the spatial self-organization of bacterial DNA.
Since dispersed repeats exist not only in the genome of E. coli but also in those of many other bacterial species (Table 5), it is also possible that the detected families of repeats could be involved in the creation of the liquid crystal structure within bacterial DNA through interactions between repeats within a family [34][35][36].
The IP method can be used to search for dispersed repeats in any DNA sequences, including those from eukaryotic organisms. The main limitation is that the analyzed sequence must be longer than 5 × 10 5 bp. This limitation is due to the fact that the IP method uses an iterative procedure, meaning that at smaller lengths it is not possible to start because there will be no hits with Z > 3.0 (Figure 2). At the same time, for eukaryotic chromosomes with lengths more than 2 × 10 7 bp the computation time could be too long. Therefore, the present version of IP can be used to find dispersed repeats in parts of eukaryotic chromosomes <2 × 10 7 bp. The dispersed repeats found could then be used in nHMMER to search for IP-detected repeats in the whole genome.
The IP method is currently available on the server at: http://victoria.biengi.ac.ru/ shddr, accessed on 27 June 2023, which is open for use. The search time for dispersed repeats in the E. coli genome was just over five days, and we plan to increase the capacity of this computational system as the number of users grows. If necessary, we will also increase the volume of the computer cluster and shorten the search time for dispersed repeats in a prokaryotic genome to about an hour or less. Figure 2 illustrates the algorithm used in this work to search for dispersed repeats de novo. The algorithm is iterative and can be divided into six steps explained in detail below.

Calculation of the Random Matrix
To search for a family of dispersed repeats in sequence S with length L, we created a random PWM with 16 rows and L 1 columns (step 1, Figure 2), in which L 1 was the maximum repeat length that could be identified in sequence S using local alignment. The created matrix was then transformed so that sum pwm(i, j) 2 had constant value R 2 0 for all matrices used below to find the similarity between the PWM and sequence S (Section 4.2) (the procedure of matrix transformation is described in detail in [37]). For these matrices, sum K = pwm(i, j)p 1 (i)p 2 (j) was also kept equal to K 0 . In this formula, pwm(i,j) is the element of the PWM on row i and column j, p 2 (j) = 1/L 1 and p 1 (i) = f (k)f (l), where f (k) and f (l) are the probabilities of encountering nucleotides of types k and l, respectively, in the analyzed sequence S (Section 4.2) (k and l could be A, T, C, or G and pair kl formed index i).
In the present study, we used L 1 = 600, K 0 = −1.0, and R 2 0 = 300L 0.5 1 ; assuming that K 0 = −1.0 permits the accurate determination of the start and end points of the local alignment [25] between the PWM and sequence S 1 (Section 4.2). Thereafter, for local alignment we used the PWM with only these parameters.

Calculation of F max and σ for the PWM
To calculate F max and σ, we randomly shuffled sequence S (step 2, Figure 2). After choosing t = 1, in sequence S we selected a window (sequence S 1 ) with the beginning at point t and end at point t + L 1 + 50. If letter N occurs in sequence S 1 , then 10 should be added to t and sequence S 1 should be created again.
Let F max be the maximum value of the similarity function after the local alignment between the PWM and sequence S 1 , performed by taking into account the correlation of neighboring bases in sequence S 1 [38], and whose elements we have denoted as s 1 (i) for i from 1 to L 1 . Briefly, we first recoded the entire sequence S 1 , in which DNA bases became A = 1, T = 2, C = 3, and G = 4, and then created sequence S 2, in which elements s 2 (j) = j for j from 1 to L 1 and which contained the column numbers of the PWM. Then, similarity function F was calculated as: Initial conditions were: F(0,0) = F(i,0) = F(0,i) = 0.0 and n = s 1 (k) + 4(s 1 (i)−1)), where I and j each ranged from 2 to L 1 ; for i = 1, n = s 1 (1) and for j = 1, n = s 1 (i). This choice for the initial n values had little effect on the final alignment results. By considering variable n, we took into account the correlation of neighboring nucleotides in sequence S 1 . To calculate n, we should find previous position k, which had already been included in the alignment and which had been calculated as previously described (Section 2.4 and Equation (7) in [38]). Here, we used d = 35.0 and e = 3.5.
First, we calculated matrix F and its maximum value F max for t = 1 and then added 10 bases to t and repeated the calculation up to L-L 1 -49. As a result, we obtained vector Fmax(t) and used it to calculate mean F max and σ for the used PWM. Together with matrix F, we filled in the matrix of inverse transitions, where each cell (i,j) had the coordinates of the cell or cells of matrix F from which we reached point (i,j). Then, we found coordinates (i max ,j max ) for F max and those for F(i 0 ,j 0 ) = 0 by backtracking. Thus, we obtained the local alignment of the PWM with sequence S 1 and its coordinates (i 0 , i max ).

Search for Similarities to the PWM in Sequence S
In sequence S, which was searched for dispersed repeats, we determined vector F max (t) (t = 1, 11, ..., L-L 1 -49) (step 3, Figure 2) using the PWM from Section 4.2. For each point t in sequence S, we calculated the coordinates of the beginning and end of local alignment y 0 (t) = i max and y max (t) = i max and then searched for local maxima in vector F max (t), which was found at position t if F max (t + i) < F max (t) for all i from t-L 1 -49 to t + L 1 + 49. Next, we selected the local maxima for which Z = (F max −F max )/σ ≥ 5.0 and denoted the number of local maxima found in sequence S as N lm . For all found local maxima, the average Z was calculated as: Below, we show that the threshold value of Z > 5.0 provides about 6% of false positives for the first family of dispersed repeats found in the E. coli genome by this method, and that for the other two families the number of false positives was about 15%. As a result, we constructed local alignment of sequences S 1 and S 2 (columnar sequence of the PWM matrix) for each selected local maximum.

Creating a New PWM Based on the Found Similarities
Based on the obtained local maxima, we created a new PWM (Figure 2, step 4). For this, we used all local alignments found near the local maxima selected in Section 4.3, all of which had Z ≥ 5.0. These local alignments contained fragments of sequences S 1 and S 2 (Section 4.2); the former representing a nucleotide sequence in the numeric code and the latter representing PWM column numbers. Using sequences S 1 and S 2 , we filled in frequency matrix M(16, 600) as L 1 = 600 (Section 4.1). M(n,s 2 (i)) = M(n,s 2 (i)) + 1 for all i from 2 to L 1 (n = s 1 (i−1) + 4s 1 (i)). Then, we calculated matrix of normal arguments M 1 (16,800) as: where p(i, j) = x(i)y(j)/(L − 1) 2 , x(i) = To find a PWM(j) with the maximum value of Z, the procedures described in Sections 4.2-4.4 were repeated i times (i = 1-20; step 5, Figure 2). The aim of these iterations was to find a PWM(i) with the maximum i value. The search was performed for i =1, 2, ..., 20, denoted as i max . As a result of iterations i = 1, 2, ..., 20, we memorized PWM(j) = PWM(i max ), all alignments found for i max , their coordinates in sequences S 1 and S 2 (Section 4.3), and Z for each alignment. Then, the procedures described in Sections 4.1-4.5 were repeated j times.

Creating a Family of Dispersed Repeats
The procedures in Sections 4.1-4.5 were repeated 50 times, which means that index j varied from 1 to 50. Then, we chose the j max at which the maximum value (i max ) was obtained (step 6, Figure 2) and obtained the first family of dispersed repeats. Thus, for each repeat family, we created PWM(j max ) and all the alignments found for j max , obtained their coordinates in sequences S 1 and S 2 (Section 2.3), and determined Z for each alignment.
After creating the first family of repeats, we replaced the sequences of the found repeats in S 1 with N, repeated the calculations described in Sections 4.1-4.6, and constructed the next family of dispersed repeats.

Conclusions
We have developed a new mathematical method that allows identification of dispersed repeats with the average number of substitutions per nucleotide x ≤ 1.5, which is higher than that for any currently existing program. We have shown that all previously developed methods and algorithms (RED, RECON, and some others) can only find dispersed repeats for x ≤ 1.0. The new IP method has made it possible to detect families of dispersed repeats in bacterial genomes which have not been previously reported. We identify three families of approximately 1.09 × 10 6 , 0.64 × 10 6 , and 0.58 × 10 6 DNA bases, respectively, constituting almost 50% of the complete E. coli genome. The length of the repeats is in the range of 400 to 600 bp. Other analyzed bacterial genomes contain one to three families of dispersed repeats with a total number of 10 3 to 6 × 10 3 copies. The existence of such highly divergent repeats could be associated with the presence of a single-type triplet periodicity in various genes or with the packing of bacterial DNA into a nucleoid. The method can also be applied for the search of dispersed repeats in eukaryotic genomes. We have created a web site for the analysis of bacterial genomes, where users can enter a genome sequence and obtain the result in a reasonable time.