# Search for Dispersed Repeats in Bacterial Genomes Using an Iterative Procedure

^{1}

^{2}

^{*}

## Abstract

**:**

^{6}, 0.64 × 10

^{6}, and 0.58 × 10

^{6}DNA bases, respectively, constituting almost 50% of the complete E. coli genome. The length of the repeats is in the range of 400 to 600 bp. Other analyzed bacterial genomes contain one to three families of dispersed repeats with a total number of 10

^{3}to 6 × 10

^{3}copies. The existence of such highly divergent repeats could be associated with the presence of a single-type triplet periodicity in various genes or with the packing of bacterial DNA into a nucleoid.

## 1. Introduction

_{1}in a sequence of length L. To perform this task, we select two windows of length L

_{1}in the sequence of length L and compare their sequences. The first window runs from the beginning to the end of sequence L, and the second runs from the end of the first sequence to the end of sequence L. As a result, the number of comparisons between the two sequences (N) is proportional to L

^{2}/2. In this case, the probability of detecting random similarity between two sequences of length L

_{1}(p) should be no more than k/N ≈ 0.01. To exclude noise in the form of random pairwise similarities, we take k = 0.01, which means that the probability of finding random similarity is 1%. From probability p, we can estimate the level of statistically significant similarity. Let us first calculate the argument of normal distribution Z

_{0}for which P(Z > Z

_{0}) = p. Based on Z

_{0}, we can estimate the number of identical bases s

_{0}for which the similarity of two sequences of length L

_{1}is considered statistically insignificant (or noise): ${s}_{0}=\overline{s}+{z}_{0}\sqrt{{L}_{1}t(1-t)}$, where t is the probability of random similarity between two nucleotides taken as 0.25. In the case of a bacterial genome, L = 4 × 10

^{6}. If repeat length L

_{1}is 300, then probability p is approximately 10

^{−15}, which corresponds to Z

_{0}≈ 8.0; thus, s

_{0}= 135, which corresponds to 45% similarity and x ~ 0.55 (formula 11 in [23]). This means that if a family of dispersed repeats has accumulated many mutations and their similarity is less than 45%, it is very difficult to detect such a family de novo in a 4 × 10

^{6}-base long sequence by A1 methods. A similar x value is shown in [5]. It should be noted that, in reality, x may be smaller because in addition to base substitutions, indels could be present in dispersed repeats. The use of word counting and signature-based methodologies to search for dispersed repeats cannot significantly improve the situation, because at x = 0.75 sequence similarity is already at the level of random noise (~25%) and word frequencies may differ insignificantly from those expected for random sequences.

^{3}and lengths of 400–600 bases.

^{3}and therefore the results obtained cannot be related to known families of dispersed repeats. Additionally, bacterial genomes are significantly shorter than eukaryotic genomes and their analysis does not require very high computational power. At the same time, weakly similar dispersed repeats in bacterial repeats may be present due to the stacking of bacterial DNA in the nucleoid [24]. Possible functional significance and the origin of the dispersed repeats are discussed.

## 2. Results

#### 2.1. Using Model Sequences for the IP Method

_{test}with a length of 4 × 10

^{6}bases and randomly inserted it into sequences from set Q(x), which contained 10

^{3}sequences with x substitutions relative to each other. Overall, 500 sequences from set Q(x) were introduced into sequence S

_{test}in the forward orientation and 500 in the reverse orientation. Then, the IP method was applied to identify the repeats. The results shown in Table 1 (where columns 1–4 show the number of repeats found in the first four families) indicate that the repeats were detected in the first and second families (columns 1 and 2), whereas the other two families (columns 3 and 4) contained sequences combined randomly. The level of random noise was 145 ± 35 sequences in all four generated repeat families. The results in Table 1 indicate that IP clearly separated forward and reverse sequences into two families and consistently detected two repeat families with x up to 1.5.

#### 2.2. Comparison of the IP Method with RED, RECON, RepeatMasker, BLAST, and nHMMER

^{3}artificial repeats of 600 bases randomly scattered across a sequence of 4 × 10

^{6}bases. A total of nine sequences were created, for which dispersed repeats had different numbers of mutations relative to each other, i.e., x = 0–2.0. Figure 1 shows that RED found 100%, 40%, and practically 0% of dispersed repeats with x ≤ 1.0, x = 1.25, and x ≥ 1.5, respectively, indicating that RED could reliably identify dispersed repeats only with x ≤ 1.0. At the same time, the application of the IP method to the nine artificial sequences resulted in the detection of dispersed repeats with x ≤ 1.5 (Figure 1), which is a better result than that of RED.

^{6}bases) carrying 10

^{3}randomly inserted 600 base repeats from set Q(x), which had the same number of random mutations relative to original sequence S

_{m}, two indels at random positions, and x substitutions per nucleotide between repeats. The word_size parameter in BLAST was chosen to be 4, which gave the best result. To evaluate the effectiveness of BLAST, we randomly created 100 sequences of S(x) containing repeats from different Q(x) (E_value was chosen to be 100). E-value is the number of expected hits of similar score that could be found just by chance. Table 2 shows the numbers of average repeats per sequence S(x) for each x. The results indicate that BLAST could find pairwise similarity for x ≤ 1.0 but failed to do so for x > 1.0.

^{3}sequences from set Q(x), and since the placement of indels in these sequences with respect to S

_{m}was known, it was not difficult to construct a multiple alignment, which was used by nHMMER to create a hidden Markov model and search for dispersed repeats in sequence S(x).

_{m}as a sequence from the library to create a PWM (Figure 2, step 4), when indices i and j were limited to 1 (i.e., one cycle of dispersed repeat search).

_{m}(i−1) + 4s

_{m}(i) and i is the column number). Then, for i = 1 n = s

_{m}(600) + 4s

_{m}(1).

#### 2.3. Search for Dispersed Repeats in the E. coli Genome Using the IP Method

_{1}= 600 bases (Section 4.1). The results shown in Figure 3 indicate that, for a random sequence, the IP method generated families of 145 ± 35 repeats, because the iterative algorithm (Figure 2) can always capture a certain number of sequences and build PWMs. In the case of the E. coli genome, the respective numbers of dispersed repeats found in the three families were 2239, 1170, and 1024. The volume of the other repeat families was close to random, and the probability of finding families of such volume in a random sequence of the same length as the genome is extremely low. The coordinates of the found repeat families and their alignment with the PWM (sequence S

_{2}, Section 4.3 and Section 4.5) are shown in Supplementary Materials in additional files fam1.txt, fam2.txt, and fam3.txt, and the PWMs created for these families are shown in files pwm1.txt, pwm2.txt, and pwm3.txt.

#### 2.4. Triplet Periodicity of Dispersed Repeat Families in the E. coli Genome

_{1}(i)) = M(f(s

_{2}(i)) + 1, s

_{1}(i)) + 1 for i from 1 to L, where L is the repeat length, s

_{1}(i) is a sequence element of the found repeat (sequence S

_{1}, Section 4.3), and s

_{2}(i) is a column sequence element of the PWM (sequence S

_{2}, Section 4.3). We calculated function f(s

_{2}(i)) = s

_{2}(i)−3int((s

_{2}(i)−0.1)/3.0) and found that it was 1, 2, or 3. If s

_{1}(i) or s

_{2}(i) were equal to zero (deletion), then 1 was not added to M(f(s2(i)) but added to i. After determining matrix M, we calculated mutual information I as:

^{0.5}− (11.0)

^{0.5}for each sequence in the repeat family and plotted Z distribution for all family members. Figure 6 shows the distribution for the first repeat family. For randomly shuffled sequences, the Z distribution was close to normal (black circles), but for repeat sequences without alignment with the PWM, the distribution was markedly shifted to the right (white circles), indicating that the coding sequences had triplet periodicity [26]. The Z distribution for sequences aligned with the PWM was shifted even more to the right compared with the two distributions mentioned above (black circles on a dashed line; Figure 6), indicating that the alignment of repeats with the PWM results in a clearer triplet periodicity.

_{1}, M

_{2}, and M

_{3}, and converted each matrix element into a normal distribution argument using normal approximation for binomial distribution. For this, we calculated partial sums of X(i) and Y(j) as we did in Figure 6 and determined probabilities p(i,j) = X(i)Y(j)/L

_{2}and the argument of normal distribution z

_{k}(i,j) = {m

_{k}(i,j)-Lp(i,j)}/{Lp(i,j)(1.0-p(i,j)}

^{0.5}(where k indicates the number of the dispersed repeat family). The resulting matrices are shown in Table 4; column numbers are s

_{2}(i)mod(3), where s

_{2}(i) is the column sequence element of the PWM (sequence S

_{2}, Section 4.2). It should be noted that the column numbers in matrices M

_{1}, M

_{2}, and M

_{3}are unrelated to the reading frame in the genes because there are indels in sequences S

_{1}and S

_{2}. From the matrices, we could conclude that the nucleotides were distributed extremely unevenly across the matrix positions. Thus, the first group of dispersed repeats contained more than expected nucleotides C and G in the first position, C in the second position, and A and T in the third position.

#### 2.5. Comparison with Nucleoid-Associated Protein-Binding Sites

#### 2.6. Search for the Families of Dispersed Repeats in the Genomes of Other Bacteria

## 3. Discussion

^{6}, 0.64 × 10

^{6}, and 0.58 × 10

^{6}DNA bases (2.3 × 10

^{6}bases in total), constituting almost 50% of the complete E. coli genome. Such extensive repeat families could not be detected in the E. coli genome via the RED, RECON, or Repeat_masker programs, but could be detected via the IP method, which could find de novo repeat families with x ≤ 1.5, whereas all other programs found them with x ≤ 1.0.

^{5}–1.1 × 10

^{7}DNA bases, the level of false positives in each family was 145 ± 35 repeats. Such level of noise is due to the fact that, at the initial step of the iterative procedure, the random matrix can always find weak similarities (Z > 3.0) with some sequences (Figure 2). After creating a new PWM based on these similarities, the Z value for these sequences increases. In total, the iterative procedure can randomly include about 145 sequences in the PWM for which Z would be over 5.0.

^{5}bp. This limitation is due to the fact that the IP method uses an iterative procedure, meaning that at smaller lengths it is not possible to start because there will be no hits with Z > 3.0 (Figure 2). At the same time, for eukaryotic chromosomes with lengths more than 2 × 10

^{7}bp the computation time could be too long. Therefore, the present version of IP can be used to find dispersed repeats in parts of eukaryotic chromosomes <2 × 10

^{7}bp. The dispersed repeats found could then be used in nHMMER to search for IP-detected repeats in the whole genome.

## 4. Materials and Methods

#### 4.1. Calculation of the Random Matrix

_{1}columns (step 1, Figure 2), in which L

_{1}was the maximum repeat length that could be identified in sequence S using local alignment. The created matrix was then transformed so that sum ${R}^{2}={\displaystyle \sum _{i=1}^{16}{\displaystyle \sum _{j=1}^{{L}_{1}}pwm}}{(i,j)}^{2}$ had constant value ${R}_{0}^{2}$ for all matrices used below to find the similarity between the PWM and sequence S (Section 4.2) (the procedure of matrix transformation is described in detail in [37]). For these matrices, sum $K={\displaystyle \sum _{i=1}^{16}{\displaystyle \sum _{j=1}^{{L}_{1}}pwm}}(i,j){p}_{1}(i){p}_{2}(j)$ was also kept equal to K

_{0}. In this formula, pwm(i,j) is the element of the PWM on row i and column j, p

_{2}(j) = 1/L

_{1}and p

_{1}(i) = f(k)f(l), where f(k) and f(l) are the probabilities of encountering nucleotides of types k and l, respectively, in the analyzed sequence S (Section 4.2) (k and l could be A, T, C, or G and pair kl formed index i).

_{1}= 600, K

_{0}= −1.0, and ${R}_{0}^{2}=300{L}_{1}^{0.5}$; assuming that K

_{0}= −1.0 permits the accurate determination of the start and end points of the local alignment [25] between the PWM and sequence S

_{1}(Section 4.2). Thereafter, for local alignment we used the PWM with only these parameters.

#### 4.2. Calculation of ${\overline{F}}_{\mathrm{max}}$ and σ for the PWM

_{1}) with the beginning at point t and end at point t + L

_{1}+ 50. If letter N occurs in sequence S

_{1}, then 10 should be added to t and sequence S

_{1}should be created again.

_{max}be the maximum value of the similarity function after the local alignment between the PWM and sequence S

_{1}, performed by taking into account the correlation of neighboring bases in sequence S

_{1}[38], and whose elements we have denoted as s

_{1}(i) for i from 1 to L

_{1}. Briefly, we first recoded the entire sequence S

_{1}, in which DNA bases became A = 1, T = 2, C = 3, and G = 4, and then created sequence S

_{2,}in which elements s

_{2}(j) = j for j from 1 to L

_{1}and which contained the column numbers of the PWM. Then, similarity function F was calculated as:

_{1}(k) + 4(s

_{1}(i)−1)), where I and j each ranged from 2 to L

_{1}; for i = 1, n = s

_{1}(1) and for j = 1, n = s

_{1}(i). This choice for the initial n values had little effect on the final alignment results. By considering variable n, we took into account the correlation of neighboring nucleotides in sequence S

_{1}. To calculate n, we should find previous position k, which had already been included in the alignment and which had been calculated as previously described (Section 2.4 and Equation (7) in [38]). Here, we used d = 35.0 and e = 3.5.

_{max}for t = 1 and then added 10 bases to t and repeated the calculation up to L-L

_{1}-49. As a result, we obtained vector Fmax(t) and used it to calculate mean ${\overline{F}}_{\mathrm{max}}$ and σ for the used PWM. Together with matrix F, we filled in the matrix of inverse transitions, where each cell (i,j) had the coordinates of the cell or cells of matrix F from which we reached point (i,j). Then, we found coordinates (i

_{max},j

_{max}) for F

_{max}and those for F(i

_{0},j

_{0}) = 0 by backtracking. Thus, we obtained the local alignment of the PWM with sequence S

_{1}and its coordinates (i

_{0}, i

_{max}).

#### 4.3. Search for Similarities to the PWM in Sequence S

_{max}(t) (t = 1, 11, ..., L-L

_{1}-49) (step 3, Figure 2) using the PWM from Section 4.2. For each point t in sequence S, we calculated the coordinates of the beginning and end of local alignment y

_{0}(t) = i

_{max}and y

_{max}(t) = i

_{max}and then searched for local maxima in vector F

_{max}(t), which was found at position t if F

_{max}(t + i) < F

_{max}(t) for all i from t-L

_{1}-49 to t + L

_{1}+ 49. Next, we selected the local maxima for which Z = (F

_{max}−${\overline{F}}_{\mathrm{max}}$)

**/**σ ≥ 5.0 and denoted the number of local maxima found in sequence S as N

_{lm}. For all found local maxima, the average Z was calculated as: $\overline{Z}={\displaystyle \sum _{k=1}^{k={N}_{lm}}Z(k)/{N}_{lm}}$.

_{1}and S

_{2}(columnar sequence of the PWM matrix) for each selected local maximum.

#### 4.4. Creating a New PWM Based on the Found Similarities

_{1}and S

_{2}(Section 4.2); the former representing a nucleotide sequence in the numeric code and the latter representing PWM column numbers. Using sequences S

_{1}and S

_{2}, we filled in frequency matrix M(16, 600) as L

_{1}= 600 (Section 4.1). M(n,s

_{2}(i)) = M(n,s

_{2}(i)) + 1 for all i from 2 to L

_{1}(n = s

_{1}(i−1) + 4s

_{1}(i)). Then, we calculated matrix of normal arguments M

_{1}(16,800) as:

_{1}as described in Section 4.1, we obtained a PWM which could be used in Section 4.2.

#### 4.5. Selection of the PWM to Find the Greatest Number of Similarities with Sequence S

_{max}. As a result of iterations i = 1, 2, ..., 20, we memorized PWM(j) = PWM(i

_{max}), all alignments found for i

_{max}, their coordinates in sequences S

_{1}and S

_{2}(Section 4.3), and Z for each alignment. Then, the procedures described in Section 4.1, Section 4.2, Section 4.3, Section 4.4 and Section 4.5 were repeated j times.

#### 4.6. Creating a Family of Dispersed Repeats

_{max}at which the maximum value (i

_{max}) was obtained (step 6, Figure 2) and obtained the first family of dispersed repeats. Thus, for each repeat family, we created PWM(j

_{max}) and all the alignments found for j

_{max}, obtained their coordinates in sequences S

_{1}and S

_{2}(Section 2.3), and determined Z for each alignment.

_{1}with N, repeated the calculations described in Section 4.1, Section 4.2, Section 4.3, Section 4.4, Section 4.5 and Section 4.6, and constructed the next family of dispersed repeats.

## 5. Conclusions

^{6}, 0.64 × 10

^{6}, and 0.58 × 10

^{6}DNA bases, respectively, constituting almost 50% of the complete E. coli genome. The length of the repeats is in the range of 400 to 600 bp. Other analyzed bacterial genomes contain one to three families of dispersed repeats with a total number of 10

^{3}to 6 × 10

^{3}copies. The existence of such highly divergent repeats could be associated with the presence of a single-type triplet periodicity in various genes or with the packing of bacterial DNA into a nucleoid. The method can also be applied for the search of dispersed repeats in eukaryotic genomes. We have created a web site for the analysis of bacterial genomes, where users can enter a genome sequence and obtain the result in a reasonable time.

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Smit, A.F.A. The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev.
**1996**, 6, 743–748. [Google Scholar] [CrossRef] - Mayer, K.F.X.; Waugh, R.; Langridge, P.; Close, T.J.; Wise, R.P.; Graner, A.; Matsumoto, T.; Sato, K.; Schulman, A.; Ariyadasa, R.; et al. A physical, genetic and functional sequence assembly of the barley genome. Nature
**2012**, 491, 711–716. [Google Scholar] [CrossRef] [Green Version] - Meyer, A.; Schloissnig, S.; Franchini, P.; Du, K.; Woltering, J.M.; Irisarri, I.; Wong, W.Y.; Nowoshilow, S.; Kneitz, S.; Kawaguchi, A.; et al. Giant lungfish genome elucidates the conquest of land by vertebrates. Nature
**2021**, 590, 284–289. [Google Scholar] [CrossRef] - Gupta, P.K. Earth Biogenome Project: Present status and future plans: (Trends in Genetics 38:8 p: 811-820, 2022). Trends Genet.
**2023**, 39, 167. [Google Scholar] [CrossRef] [PubMed] - Storer, J.M.; Hubley, R.; Rosen, J.; Smit, A.F.A. Methodologies for the De novo Discovery of Transposable Element Families. Genes
**2022**, 13, 709. [Google Scholar] [CrossRef] [PubMed] - Tempel, S. Using and understanding repeatMasker. Methods Mol. Biol.
**2012**, 859, 29–51. [Google Scholar] [CrossRef] [PubMed] - Jurka, J.; Klonowski, P.; Dagman, V.; Pelton, P. CENSOR—A program for identification and elimination of repetitive elements from DNA sequences. Comput. Chem.
**1996**, 20, 119–121. [Google Scholar] [CrossRef] - Bedell, J.A.; Korf, I.; Gish, W. MaskerAid: A performance enhancement to RepeatMasker. Bioinformatics
**2000**, 16, 1040–1041. [Google Scholar] [CrossRef] [Green Version] - Bao, W.; Kojima, K.K.; Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA
**2015**, 6, 11. [Google Scholar] [CrossRef] [Green Version] - Girgis, H.Z. Red: An intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale. BMC Bioinform.
**2015**, 16, 227. [Google Scholar] [CrossRef] [Green Version] - Bao, Z.; Eddy, S.R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res.
**2002**, 12, 1269–1276. [Google Scholar] [CrossRef] [Green Version] - Edgar, R.C.; Myers, E.W. PILER: Identification and classification of genomic repeats. Bioinformatics
**2005**, 21 (Suppl. S1), i152–i158. [Google Scholar] [CrossRef] - Price, A.L.; Jones, N.C.; Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics
**2005**, 21 (Suppl. S1), i351–i358. [Google Scholar] [CrossRef] [Green Version] - Volfovsky, N.; Haas, B.J.; Salzberg, S.L. A clustering method for repeat analysis in DNA sequences. Genome Biol.
**2001**, 2, 0027.1. [Google Scholar] [CrossRef] - Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res.
**1997**, 25, 3389–3402. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Mount, D.W. Using a FASTA Sequence Database Similarity Search. CSH Protoc.
**2007**, 2007, pdb.top16. [Google Scholar] [CrossRef] [PubMed] - Tamura, K.; Peterson, D.; Peterson, N.; Stecher, G.; Nei, M.; Kumar, S. MEGA5: Molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol. Biol. Evol.
**2011**, 28, 2731–2739. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Wheeler, T.J.; Eddy, S.R. Nhmmer: DNA homology search with profile HMMs. Bioinformatics
**2013**, 29, 2487–2489. [Google Scholar] [CrossRef] [Green Version] - Notredame, C.; Higgins, D.G.; Heringa, J. T-coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol.
**2000**, 302, 205–217. [Google Scholar] [CrossRef] [Green Version] - Edgar, R.C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res.
**2004**, 32, 1792–1797. [Google Scholar] [CrossRef] [Green Version] - Korotkov, E.V.; Suvorova, Y.M.; Kostenko, D.O.; Korotkova, M.A. Multiple alignment of promoter sequences from the arabidopsis thaliana l. Genome. Genes
**2021**, 12, 135. [Google Scholar] [CrossRef] - Blattner, F.R.; Plunkett, G.; Bloch, C.A.; Perna, N.T.; Burland, V.; Riley, M.; Collado-Vides, J.; Glasner, J.D.; Rode, C.K.; Mayhew, G.F.; et al. The complete genome sequence of Escherichia coli K-12. Science
**1997**, 277, 1453–1462. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kostenko, D.O.; Korotkov, E.V.; Kostenko, D.O.; Korotkov, E.V. Application of the MAHDS Method for Multiple Alignment of Highly Diverged Amino Acid Sequences. Int. J. Mol. Sci.
**2022**, 23, 3764. [Google Scholar] [CrossRef] [PubMed] - Verma, S.C.; Qian, Z.; Adhya, S.L. Architecture of the Escherichia coli nucleoid. PLoS Genet.
**2019**, 15, e1008456. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Suvorova, Y.M.; Kamionskaya, A.M.; Korotkov, E.V. Search for SINE repeats in the rice genome using correlation-based position weight matrices. BMC Bioinform.
**2021**, 22, 42. [Google Scholar] [CrossRef] - Frenkel, F.E.E.; Korotkov, E.V. V Classification analysis of triplet periodicity in protein-coding regions of genes. Gene
**2008**, 421, 52–60. [Google Scholar] [CrossRef] - Suvorova, Y.M.; Korotkov, E.V. Study of triplet periodicity differences inside and between genomes. Stat. Appl. Genet. Mol. Biol.
**2015**, 14, 113–123. [Google Scholar] [CrossRef] - Kahramanoglou, C.; Seshasayee, A.S.N.; Prieto, A.I.; Ibberson, D.; Schmidt, S.; Zimmermann, J.; Benes, V.; Fraser, G.M.; Luscombe, N.M. Direct and indirect effects of H-NS and Fis on global gene expression control in Escherichia coli. Nucleic Acids Res.
**2011**, 39, 2073–2091. [Google Scholar] [CrossRef] - Prieto, A.I.; Kahramanoglou, C.; Ali, R.M.; Fraser, G.M.; Seshasayee, A.S.N.; Luscombe, N.M. Genomic analysis of DNA binding and gene regulation by homologous nucleoid-associated proteins IHF and HU in Escherichia coli K12. Nucleic Acids Res.
**2012**, 40, 3524–3537. [Google Scholar] [CrossRef] [Green Version] - Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics
**2010**, 26, 841–842. [Google Scholar] [CrossRef] [Green Version] - Trotta, E. The 3-Base Periodicity and Codon Usage of Coding Sequences Are Correlated with Gene Expression at the Level of Transcription Elongation. PLoS ONE
**2011**, 6, e21590. [Google Scholar] [CrossRef] [PubMed] - Sánchez, J.; López-Villaseñor, I. A simple model to explain three-base periodicity in coding DNA. FEBS Lett.
**2006**, 580, 6413–6422. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Großmann, P.; Lück, A.; Kaleta, C. Model-based genome-wide determination of RNA chain elongation rates in Escherichia coli. Sci. Rep.
**2017**, 7, 1–11. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Yevdokimov, Y.M.; Salyanov, V.I.; Nechipurenko, Y.D.; Skuridin, S.G.; Zakharov, M.A.; Spener, F.; Palumbo, M. Molecular Constructions (Superstructures) with Adjustable Properties Based on Double-Stranded Nucleic Acids. Mol. Biol.
**2003**, 37, 293–306. [Google Scholar] [CrossRef] - Yevdokimov, Y.M.; Salyanov, V.I.; Skuridin, S.G. From liquid crystals to DNA nanoconstructions. Mol. Biol.
**2009**, 43, 284–300. [Google Scholar] [CrossRef] - Skuridin, S.G.; Vereshchagin, F.V.; Salyanov, V.I.; Chulkov, D.P.; Kompanets, O.N.; Yevdokimov, Y.M. Ordering of double-stranded DNA molecules in a cholesteric liquid-crystalline phase and in dispersion particles of this phase. Mol. Biol.
**2016**, 50, 783–790. [Google Scholar] [CrossRef] - Pugacheva, V.; Korotkov, A.; Korotkov, E. Search of latent periodicity in amino acid sequences by means of genetic algorithm and dynamic programming. Stat. Appl. Genet. Mol. Biol.
**2016**, 15, 381–400. [Google Scholar] [CrossRef] - Korotkov, E.V.; Suvorova, Y.M.; Nezhdanova, A.V.; Gaidukova, S.E.; Yakovleva, I.V.; Kamionskaya, A.M.; Korotkova, M.A. Mathematical Algorithm for Identification of Eukaryotic Promoter Sequences. Symmetry
**2021**, 13, 917. [Google Scholar] [CrossRef]

**Figure 1.**Comparative performance of the IP and RED methods in search for dispersed repeats in artificial sequences. The search was performed in an artificial sequence of 4 × 10

^{6}bases containing 10

^{3}repeats (each of 600 bases but with different x), which were randomly inserted from set Q(x) containing repeats with the same number of random mutations relative to original sequence S

_{m}as well as indels at random positions. Black and white circles indicate the IP and RED methods, respectively; x is the average number of substitutions per nucleotide and N is the number of repeats identified.

**Figure 3.**The number of repeats in the groups created for the E. coli genome. Black circles on a continuous curve indicate the groups for the escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome; white circles indicate the size of the groups created for the same genome, in which the codons in coding sequences are mixed randomly; black circles on a discontinuous curve indicate the size of the groups created for a random sequence of the same length as the escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome. N is the number of repeats identified.

**Figure 4.**The number of repeats in the coding sequences of the E. coli genome. Black circles indicate groups created for the escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome, in which all non-coding sequences were randomly mixed; white circles indicate the size of the groups created for the randomly mixed genome. N is the number of repeats identified.

**Figure 5.**Length distribution of dispersed repeats found in the escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome. Repeats of families 1, 2, and 3 are indicated by black circles on a continuous line, white circles, and black circles on a discontinuous line, respectively.

**Figure 6.**Distribution of the dispersed repeats from the first group according to the level of triplet periodicity. Here, X is the argument of normal distribution indicating the level of statistical significance of triplet periodicity. Black circles on a continuous line show triplet periodicity calculated for randomly mixed sequences from the first group; white circles show triplet periodicity for the first group sequences found in the escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome without alignment with the PWM for this family (sequences without indels as they are in the genome); black circles on a dashed line show the level of triplet periodicity in the sequences that are a part of the alignments with the first group PWM (sequences with indels).

x | 1 | 2 | 3 | 4 |
---|---|---|---|---|

0 | 508 | 490 | 51 | 71 |

0.5 | 503 | 502 | 55 | 62 |

0.75 | 507 | 496 | 95 | 75 |

1.0 | 505 | 502 | 102 | 113 |

1.25 | 506 | 501 | 85 | 92 |

1.5 | 501 | 483 | 112 | 101 |

1.75 | 166 | 152 | 139 | 124 |

2.0 | 125 | 144 | 138 | 132 |

4.0 | 114 | 127 | 132 | 90 |

^{6}bases containing 500 repeats in the forward and 500 repeats in the reversed orientation was used; x is the average number of substitutions per nucleotide between family members.

x | 0.1 | 0.25 | 0.5 | 0.75 | 1.0 | 1.25 | 1.5 | 2.0 | 2.5 | 3.0 | 4.0 | 20.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|

BLAST | 1000 | 1000 | 1001 | 1000 | 1000 | 2.9 | 1.4 | 1.1 | 2.0 | 0.2 | 1.1 | 1.2 |

nHMMER | 1004 | 1002 | 1004 | 1006 | 1002 | 1002 | 1004 | 1003 | 992 | 668 | 21 | 0 |

IP | 1068 | 1006 | 1003 | 1002 | 1005 | 1002 | 1004 | 1003 | 1065 | 907 | 221 | 80 |

^{6}bases containing 10

^{3}dispersed repeats was used; x is the average number of base substitutions per nucleotide between two dispersed repeats.

**Table 3.**Distribution of the found repeats from families 1, 2, and 3 (Figure 3) according to the proportion of non-coding regions.

Families | Proportion of Non-Coding Sequences | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

0.0–0.1 | 0.1–0.2 | 0.2–0.3 | 0.3–0.4 | 0.4–0.5 | 0.5–0.6 | 0.6–0.7 | 0.7–0.8 | 0.8–0.9 | 0.9–1.0 | |

1 | 1956 | 129 | 58 | 36 | 22 | 7 | 3 | 2 | 1 | 25 |

2 | 918 | 97 | 60 | 32 | 13 | 11 | 4 | 4 | 3 | 28 |

3 | 709 | 116 | 72 | 43 | 29 | 15 | 8 | 2 | 3 | 27 |

**Table 4.**Matrices with dimensions 3 × 4 (A, B and C) which contain the normal distribution argument obtained by normal approximation of binomial distribution for each cell of matrices M

_{1}, M

_{2}and M

_{3}.

DNA Bases | A | B | C | ||||||
---|---|---|---|---|---|---|---|---|---|

1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 | |

A | −48.7 | −3.5 | 52.2 | 9.2 | 22.8 | −32.0 | 24.7 | 3.9 | −28.6 |

T | −9.8 | −27.4 | 37.3 | −40.4 | 7.7 | 32.7 | −19.1 | 28.0 | −8.8 |

C | 17.2 | 37.1 | −54.4 | −20.4 | −22.1 | 42.6 | −6.8 | −18.7 | 25.6 |

G | 37.2 | −8.6 | −28.5 | 49.3 | −7.13 | −42.2 | 1.4 | −12.1 | 10.7 |

Bacteria | 1 | 2 | 3 | 4 | Genome Size |
---|---|---|---|---|---|

Azotobacter vinelandii | 4565 | 1357 | 322 | 178 | 5.3 × 10^{6} |

Bacillus subtilis | 2563 | 768 | 340 | 305 | 4.2 × 10^{6} |

Clostridium tetani | 1605 | 640 | 168 | 111 | 2.8 × 10^{6} |

Methylococcus capsulatus | 2489 | 375 | 280 | 95 | 3.3 × 10^{6} |

Mycobacterium tuberculosis | 3343 | 1152 | 299 | 103 | 4.4 × 10^{6} |

Shigella sonnei | 2606 | 645 | 519 | 358 | 5.0 × 10^{6} |

Treponema pallidum | 590 | 273 | 83 | 46 | 1.1 × 10^{6} |

Xanthomonas campestris | 4622 | 1348 | 359 | 75 | 5.1 × 10^{6} |

Yersinia pestis | 1953 | 43 | 35 | 43 | 4.8 × 10^{6} |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Korotkov, E.; Suvorova, Y.; Kostenko, D.; Korotkova, M.
Search for Dispersed Repeats in Bacterial Genomes Using an Iterative Procedure. *Int. J. Mol. Sci.* **2023**, *24*, 10964.
https://doi.org/10.3390/ijms241310964

**AMA Style**

Korotkov E, Suvorova Y, Kostenko D, Korotkova M.
Search for Dispersed Repeats in Bacterial Genomes Using an Iterative Procedure. *International Journal of Molecular Sciences*. 2023; 24(13):10964.
https://doi.org/10.3390/ijms241310964

**Chicago/Turabian Style**

Korotkov, Eugene, Yulia Suvorova, Dimitry Kostenko, and Maria Korotkova.
2023. "Search for Dispersed Repeats in Bacterial Genomes Using an Iterative Procedure" *International Journal of Molecular Sciences* 24, no. 13: 10964.
https://doi.org/10.3390/ijms241310964