Next Article in Journal
Comparative Analysis of Polysaccharides from Chicory Roots and Aerial Parts Reveals Comparable Cytoprotective Effects Associated with MAPK/NF-κB Signaling
Previous Article in Journal
Library Preparation Biases Plant Virome Detection: Poly(A) mRNA Enrichment vs. rRNA Depletion in Pepper and Garlic
Previous Article in Special Issue
Hybrid Genome Assembly and Annotation of the Basidiomycete Fungus Candolleomyces candolleanus Strain CMU-8613 Using a Cost-Effective Iterative Pipeline
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Regions Enriched with Reverse Complement Triplets in Bacterial Genomes

by
Eugene V. Korotkov
Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Bld. 2, 33 Leninsky Ave., 119071 Moscow, Russia
Int. J. Mol. Sci. 2026, 27(5), 2301; https://doi.org/10.3390/ijms27052301
Submission received: 27 December 2025 / Revised: 24 February 2026 / Accepted: 26 February 2026 / Published: 28 February 2026
(This article belongs to the Special Issue Computational Genomics and Bioinformatics in Microbiology)

Abstract

I developed a mathematical method to search for DNA regions that are significantly enriched in reverse complement triplets (RCTs) and are located in sequences with strongly expressed triplet periodicity (TP). The method makes it possible to exclude the influence of TP on the number of RCTs. To search for RCTs, I used the difference between triplet frequencies and their expected number, which was determined by taking into account the TP of the analyzed region. I analyzed the genomes of 42 bacteria representing all bacterial phyla, and found that the number of DNA regions containing RCTs ranged from several hundred to several thousand per genome depending on its size. The average length of the region was about 850 DNA bases. The most common inversion symmetry (IS) pattern of the RCT-containing regions was the enrichment of the first, second, and third triplet positions with {A, G}, {A, T}, and {T, C} bases, respectively. When the sequence was rotated 180 degrees and the bases were replaced with complementary ones (IS), such enrichment of triplet positions was preserved. I suggest that the emergence of IS could be a result of evolutionary processes such as inversions, transpositions, and recombinations.

1. Introduction

Advances in sequencing technology have led to the accumulation of a large number of DNA and protein sequences and promoted the development of mathematical methods to identify genomic and proteomic structural patterns [1,2]. One such pattern is dispersed repeats (DRs), which are short repetitive nucleotide sequences distributed throughout genomes in a non-tandem, yet non-random, manner [3]. Typically, the degree of similarity between individual repeats ranges from 40 to 100%; often, DRs are associated with mobile elements such as transposons. Repeats constitute a significant part of eukaryotic genomes [4]; thus, 45% of the human genome consists of DRs [3].
Recently, we have developed a mathematical approach that allows statistically significant identification of DRs with similarity less than 40% and applied it to analyze bacterial genomes [5]. It appeared that DRs occupy from 17 to 72% of the bacterial genome and that most repeats are superimposed on coding sequences as a motif. In addition, the found DRs have inversion symmetry (IS) [5,6]. A detailed study of IS in the DRs of bacterial genomes shows that they display triplet periodicity (TP), which is characterized by matrix M(3,4) containing three columns and four rows that correspond to positions in the found DRs and to DNA bases, respectively [7,8]. Matrix M is calculated as multiple alignment AL constructed for the identified bacterial DRs. If the number of sequences in AL is N, then matrix M is filled as m(i,sk(j)) = m(i,sk(j)) + 1, where i = j − 3int(j/3) and j varies from 1 to L (L is the length of AL and sk(i) is the element of sequence k in AL). After matrix M is filled for all k from 1 to N, we obtain matrix MM(3,4) by first calculating sums x ( i ) = j = 1 4 m ( i , j ) , y ( j ) = i = 1 3 m ( i , j ) , W = i = 1 3 j = 1 4 m ( i , j ) , and p(i,j) = x(i)y(j)/W2, and then mm(i,j) = m(i,j) − Wp(i,j))/(Wp(i,j)(1 − p(i,j))0.5. Each cell mm(i,j) shows the argument of the normal distribution; the larger it is, the more the values of cell m(i,j) differ from those obtained for randomly shuffled sequences included in AL. It turned out that matrix MM for DRs of bacterial genomes has IS, which means that when all sequences in multiple alignment AL are rotated by 180 degrees and the bases are replaced with complementary ones, matrix MM looks the same because its values that are greater than 3.0 remain in the same cells. In 10 of the 12 studied bacterial genomes, these cells are 1G, 2A, 2T, and 3C (the number indicates the column and the letter—the row of matrix MM). In the other two genomes, the cells are 1G, 2C, 3A, and 3T [6]; such matrix MM also has IS but only with a phase shift: thus, if sequence GC(A/T)GC(A/T)GC(A/T)… is converted to sequence (T/A)GC(T/A)GC(T/A)GC…, the latter would result in the same matrix MM only if matrix M is filled from the second base.
It is very likely that the patterns found in [6] are a consequence of Chargaff’s second rule [9,10]. Applied to nucleotides and mononucleotides, it means that on a single DNA strand the frequency of nucleotide A is approximately equal to that of T and that of G to that of C. Chargaff’s generalized rule (also called IS) extends this principle to longer sequences [11,12], implying that the frequency of a k-mer is approximately equal to that of its reverse complementary (RC) k-mer. For example, according to this rule the frequencies of triplet AGC and its RC triplet GCT in a single DNA strand would be approximately equal. Such symmetry is observed in the genomes of most organisms, both eukaryotes and prokaryotes [13,14,15]; however, mitochondrial genomes often show significant deviations from this rule because of a different replication mechanism [16,17].
As discussed in [6], the formation of RC triplets (RCTs) in bacterial genomes may be, on the one hand, a result of evolutionary processes such as inversions, transpositions, and recombinations [18,19,20] and/or the formation of secondary DNA structures (e.g., hairpins) which requires intrachain complementarity [21,22]. On the other hand, RCTs may appear because of TP due to codon degeneracy. The uneven presence of amino acids in proteins, which results in the increased frequency of the encoding triplets, could also generate TP in DNA [7,23,24]. Thus, if a protein contains multiple serine and threonine residues encoded by AGT and ACT, respectively, the corresponding gene sequence should be enriched in two RCTs; a similar situation is observed with glycine and alanine encoded by GGC and GCC, respectively.
Mathematical methods used to search for RC k-mers usually operate based on k-mer frequencies [25,26,27]; however, in the presence of TP, RCT frequencies can be quite high merely on the account of nucleotide probabilities in the corresponding positions. For example, if a random sequence has p(1,a) = 0.4, p(1,t) = 0.4, p(2,c) = 0.4, p(2,g) = 0.4, p(3,a) = 0.4, and p(3,t) = 0.4, it would contain many ACT and TGA triplets; in this case, the occurrence of RCTs would be associated only with different base frequencies in the first, second, and third positions of the triplets and would be random in nature. Thus, the known methods to search for IS in the presence of TP find all RCTs irrespectively of their origin. However, it would be interesting to identify RCTs that have appeared exclusively through evolutionary processes or the assembly of secondary DNA structures. For this purpose, it is necessary to use a search parameter other than triplet frequency.
In this study, I developed a mathematical method for identifying DNA regions that are statistically significantly enriched in RCTs in the presence of strongly expressed TP. The advantage of the new method is that it allows excluding the influence of TP on the number of RCTs because it employs a new mathematical measure for RCT detection. The measure is the difference between triplet frequencies and their expected number, which is determined taking into account the TP of the analyzed region. I also defined some parameters of the introduced mathematical measure. The developed algorithm is applicable to search for DNA regions longer than 300 DNA bases. I used the method to analyze the genomes of 42 bacteria representing all bacterial phyla and found that the number of RCT-containing regions ranged from several hundred to several thousand per genome, depending on the genome size; the average region length was approximately 850 DNA bases. I then created RCT classes for each studied bacterium and RCT groups for all the 42 bacterial species. The MM matrix for the most common RCT group included cells 1/A, 1/G, 2/A, 2/T, 3/T, 3/C, and 3/G with the matrix values greater than 3.0 (6.7, 45.4, 46.6, 46.9, 7.3, 39.5, and 3.7, respectively). With the exception of cell 3G, all other cells converted to themselves upon reverse complement DNA transformation. I hypothesize that the emergence of RCTs is associated with evolutionary processes such as inversions, transpositions, and recombinations.

2. Results

2.1. Search for Regions Enriched with RCT in the Bacterial Genome

The genome sequence of E. coli strain K-12 substr. MG1655 (ASM584v2) was obtained from https://bacteria.ensembl.org/index.html (accessed on 20 February 2026). The search for RCT-containing regions in the E. coli genome was performed according to the algorithm described in Section 4.3. This search was performed on the website http://victoria.biengi.ac.ru/mct (accessed on 20 February 2026).
Table 1 shows the number of the identified sequences with RCTs depending on threshold V0 (Section 4.3). The results indicated that at V0 = 4.5, α = N0/Nm ≈ 0.05. Therefore, I used V0 = 4.5 when searching for RCTs.
Let us consider an example of an RCT-containing region in the E. coli genome, which represents a sequence of 1200 bases (nucleotides 119,401–120,600) spanning the end of gene b0111 (119,281–120,135) and the beginning of gene b0112 (120,178–121,551), which encode AmpE and AroP proteins, respectively. For this sequence, Vmax(a) (Materials and Methods, Section 4.3) is equal to 5.7 and at R1 > 5.0 (Formula (1)) there are the following mirrored pairs of triplets: AAA/TTT, AAT/ATT, ATG/CAT, ACC/GGT, ACG/CGT, TAC/GTA, CAC/GTG, CAG/CTG, CGC/GCG, and GCC/GGC. For these triplet pairs, the x T 1 ( i ) x M T 1 ( i ) values are 10.1, 16.5, 23.5, 9.8, 21.3, 9.5, 10.0, 90.4, 22.1 and 8.8, respectively, and the probability of x T 1 ( i ) x M T 1 ( i ) > 5.0 for two normal distributions is less than 0.05.
I then searched for regions with RCTs in the genomes of bacteria from 42 phyla. All RCT-containing sequences identified in the bacterial genomes had a length that was a multiple of 3 bases. For each found sequence, I calculated x T 0 ( i ) and x M T 0 ( i ) as described for sequence S (Section 2.1). The use of k = 0 showed that I isolated triplets from the beginning of the sequence. I also calculated R ( i ) = x T 0 ( i ) x M T 0 ( i ) , which made it possible to determine pairs of triplets that formed RCT-enriched regions in the sequence because for them R(i) > 0. All sequences found in the bacterial genomes are presented in the Supplementary Materials, where for each bacterium, there is a directory named ‘Phyla’ including a directory named ‘Class’ which in turn contains the selection.txt file. In each of these files, the line beginning with KOK shows the number of the found sequence, KK and LL indicate the coordinates (beginning and end, respectively) of the sequence, lines starting with A=== and B=== show pairs of reverse complementary triplets, and Z and DL show the level of statistical significance Vmax(a) and the length of the sequence, respectively. The encoding (1 ≡ a, 2 ≡ t, 3 ≡ c, 4 ≡ g) is used; for example, triplet aaa is shown as 111. The field marked as RRR shows the R(i) values for the corresponding triplets, and the lines marked as CD1 and CD2 show the frequencies of triplets from sets T and MT. The threshold V0 was 4.5.
The data on the number of sequences with RCTs found in the 42 bacterial genomes are presented in Table 2. It can be seen that at V0 = 4.5, the rate of false positives was mostly within the interval from 1.6 to 21.5% (except Bacillus anthracis (42.7%) and Candidatus poribacteria bacterium (35.3%)) with an average of approximately 8%. This finding indicates the non-random presence of RCT regions in different bacterial genomes.
I analyzed the length distribution of the found RCT-enriched regions. The results for the genomes of E. coli, Victivallis vadensis, and Gemmatimonas aurantiaca are shown in Figure 1; similar results were obtained for the other bacterial genomes. The average length of a region containing RCTs ranged from 700 to 850 DNA bases.

2.2. Classes of RCT-Contaning Regions in the Bacterial Genomes

I created classes of sequences enriched with RCTs for each of the 42 bacteria, as described in Section 4.4. These results are summarized in the mirror.txt file consisting of directories for each bacterial phylum in Supplementary Materials. The file contains the following information. The lines beginning with SMAX and KMAX show the number of sequences with RCTs included in the created class and the central sequence number, respectively. The lines beginning with NOM contain the numbers of sequences from the selection.txt file that belong to the created class. The line beginning with KK contains the characteristics of the central sequence as in the selection.txt file (Section 4.3), including the coordinates (KK, LL), Vmax (Z), and length (DL). Lines beginning with M correspond to those in matrix MM (Section 4.4), and lines beginning with J contain the lines in matrix A(i,j) (Formula (9)). The lines beginning with A=== and B=== indicate pairs of reverse complementary triplets organized in a descending order of RRR(i) (Formula (11)).
From all mirror.txt files comprising RCT-enriched sequence classes for each bacterium, I created a single class_all.txt file in which the number of each bacterium is located after “start”, followed by a line titled “Phyla”. Then, from each matrix A of each class, I selected cells for which A(i,j) ≥ 3.0; the set of these cells shows the positions and bases with which the RCT-containing regions are enriched. Matrices A and M are included in the class_all1.txt file, and the set of positions in matrix A for which A(i,j) ≥ 3.0 is shown in the class_all.txt file in Supplementary Materials. For example, the first class from the Acidobacterium capsulatum genome contains 344 sequences (KOL = 344), and the set of positions with A(i,j) ≥ 3.0 are 1/T, 2/C, 2/G, and 3/A with values 19.6, 14.7, 15.3, and 20.9, respectively.
A total of 195 classes were created for the analyzed bacteria. For example, for the E. coli genome, six RCT-enriched classes containing 1549 sequences were created, which constitute about 90% of all sequences with RCTs found in the genome.
Table 3 shows that when inversion and complementary recoding were applied, positions 1/A, 1/C, 1/G, 2/A, 2/T, 3/T, 3/C, 3/G for which A(i,j) > 3.0 were preserved in triplets. The same happened in classes 2, 3, and 6. Classes 4 and 5 were also preserved though not completely; thus, position 1/C was lost in class 4 and 3/T, in class 5, but the remaining cells of the matrix were preserved.

2.3. Groups of RCT-Enriched Sequences for the Genomes of 42 Bacteria (Run_Class)

The groups of RCT-containing sequences created for different bacteria are included in the group100.txt file in Supplementary Materials. In this file, the line beginning with ‘Center’ contains the group number and the line beginning with ‘START’ incorporates the number of the central class and the number of classes included in the group (KOL(IM)). The next line (NAM) contains the name of the phyla where the central class was created; cells of this class, for which A(i,j) ≥ 3.0, are shown. Below, after ‘NAM1=’, the classes created in the class_all.txt file and included in the group are presented; each class begins with ‘K=’.
Table 4 shows the six groups of RCT-containing sequences from the 42 bacterial genomes. It can be seen that the first group comprised 77 classes and the central MM matrix was mirror-symmetrical with a complementary recoding of sequences, i.e., cells 1/A, 1/G, 2/A, 2/T, 3/T, and 3/C, where A(i,j) ≥ 3.0, did not change with recoding. There was only one exception, cell 3/G, but the matrix A value for this cell was quite small. All other groups shown in Table 4 had the same property.

2.4. Intersection of RCT-Enriched Sequences with Genomes, Mobile Elements, and Promoters

Intersection was considered if an RCT-enriched sequence was completely incorporated into the gene or, conversely, if the gene was completely incorporated into the RCT-enriched sequence. The number of such sequences was denoted as N. At the same time, I randomly shuffled the coordinates of the identified RCT-enriched sequences 200 times while maintaining their length and calculated the number of intersections for each shuffle, which was denoted as Nr. Then, I calculated the mean and variance for Nr, denoted as D(Nr) and N r ¯ , respectively, and determined Z = (N N r ¯ ). The results showed that for all the bacteria in Table 2, Z ≤ 3.0, indicating that no significant correlations existed between the genes and RCT-enriched sequences. The same procedure was performed for promoter sequences, and, similarly, Z was ≤3.0 for all the bacteria from Table 2. Finally, I examined the intersection of the RCT-enriched sequences with mobile elements of the E. coli genome, which are listed in file eschrichia_coli_str_k_12_substr_mg1655_gca_000005845.gff3 from https://bacteria.ensembl.org/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Info/Index/ (accessed on 20 February 2026). (for the other bacteria from Table 2, the coordinates of mobile elements are not specified). In this case, I were able to detect a correlation between the RCT-enriched sequences and the mobile element (Z = 4.0).

3. Discussion

In this work, I developed a new mathematical method for screening regions with IS in the genomes of bacteria. It has been shown that in bacterial genomes, approximately 80% are coding sequences characterized with TP [8] which can greatly influence triplet frequencies. My aim was to develop a mathematical method for finding regions with RCTs in the presence of TP. The results shown in Table 2 indicate that using my method, I could confidently identify RCT-containing regions in sequences with TP, as evidenced by a low rate of false positive results (8%) if random shuffling was performed with preservation of TP. It turns out that in addition to the factors leading to TP, there are those causing the formation of RCTs. Among them, evolutionary processes such as inversions, transpositions, and recombinations [18,19,20], as well as the generation of stem-loops [22], are the most instrumental in the emergence of RCTs; moreover, the binding of transcription factors could also promote RCT formation [13].
In my previous study, I have discovered RCTs in bacterial genomes during the search for DRs [6]. In 10 of the 12 examined bacterial species, regions enriched with RCTs have matrix MM for which cells 1G, 2A, 2T, and 3C contain values greater than 3.0; these fit completely into the first group of Table 4 here. In the other two species, matrix MM cells 1G, 2C, 3A, and 3T contain values greater than 3.0; such sequences belong to group 5 in Table 4 with a shift of one base (see Section 1). It can be stated that all RCT sequence classes found earlier were also identified in this work.
The method developed in this study is based on the approach to creating a sequence in which the formation of RCTs is associated with purely random factors (described in Section 4.2, step 4 and Section 4.3, step 7). In such a sequence, the correlations between adjacent DNA bases would be preserved but the formation of RCTs would be due to random factors. There are different ways to create such a sequence. First, an arbitrary sequence can be generated by random mixing of nucleotides. However, in this case Nm values in Table 2 and Table 3 would be significantly overestimated because bacterial genomes consist of 80% coding sequences which have TP [7,23,24], and it would not be possible to understand whether I detect the regions with a significant number of RCTs or it is just a feature of TP.
Second, it is possible to shuffle nucleotide pairs instead of individual nucleotides. In this case, only some of the correlations between adjacent bases in the sequence would be preserved. For example, let us consider the first group in Table 4, in which cell G{A/T}C has the highest value of matrix A. There are two phases of base pair extraction for sequence. In the first phase, I obtain pairs G{A/T}, {A/T}C, and CG, and their random shuffling would produce a sequence containing combinations of these pairs, which means that in such a sequence RCTs can be partially preserved and their formation cannot be attributed to purely random factors. As a result, the level of N0 would be increased and that of Nm decreased (Table 1). I verified this by randomly shuffling base pairs in the sequence in Section 4.2, step 4 and Section 4.3, step 7, which produced 1303 and 150 RCT-containing regions, respectively, for the E. coli genome (Table 1), indicating that this mixing method cannot be considered optimal.
Third, a scrambled sequence can be created using a Markov model constructed taking into account the observed correlations. However, this strategy is a mere expansion of the one described in the previous paragraph, and for such artificial sequences, the formation of RCTs also cannot be due to purely random factors.
In this study, my goal was to preserve the TP of the sequence which must be purely random. It was achieved by the method used in Section 4.2, step 4, and Section 4.3, step 7, when I split one nucleotide sequence into three according to the position in the triplet, then randomly shuffled each of the three sequences and reassembled the original one. Thus, this method preserves the TP of the sequence while detecting the regions with RCTs formed because of purely random factors.
It is also interesting to discuss the search for RCT-containing sequences in the genomes of 42 phylogenetically diverse bacteria (Table 2; here, V0 = 4.5 was chosen based on the E. coli genome analysis shown in Table 1). The results in Table 2 revealed that the ratio α = N0/Nm varied significantly: from 3.3% (Helicobacter pylori) to 42.7% (Bacillus anthracis), although most bacteria had α ≤ 10% and only few had large α values. Thus, the Vmax(a) distribution (Section 4.3, step 6) varies for different bacteria, which can lead to changes in SR0 and D0 (Section 4.2). Figure 2 and Figure 3 show SR0 and D0, which were calculated for the E. coli genome and were assumed to be the same for the other bacterial genomes. To test this assumption, I determined the SR0(L,N) and D0(L,N) spectra for different lengths L of the genomes of three bacteria. The results presented in Table 5 and Table 6 indicate that variations in SR0(L,N) and D0(L,N) did not exceed a few percent, which I thought was to be expected since only the sample size changed in the calculation of SR0(L,N) and D0(L,N).
I also analyzed the dependence of SR0(L,N) and D0(L,N) on the A + T content by comparing bacterial genomes with A + T from 32 to 74%. It turned out that there were variations in SR0(L,N) and D0(L,N) among the analyzed species (Table 5 and Table 6), which could result in different α values (Table 2). However, the average α for the 42 bacteria in Table 2 was 8% and such a comparatively low level of random noise cannot change the results shown in Table 3 and Table 4.
Previously, IS k-mers have been studied in complete genomes or in large sets of relatively short sequences. Thus, calculation of k-mer frequencies for 804 mitochondrial, 236 eubacterial, and 42 chloroplast genomes have been reported [17]. The advantage of the method described here is the possibility to find relatively short (from 300 to 1200 nucleotides) RCT-containing sequences and exclude the effect of TP, which allows attributing the nature of the found RCTs to evolutionary mechanisms.

4. Materials and Methods

4.1. Development of a Mathematical Parameter for Searching Sequences Enriched with Reverse Complementary Triplets

I took a bacterial genome of sequence G and length LG, and selected nucleotide sequence S within genome G, which has start coordinates equal to a and length L. Then, I created 64 triplets starting from position a + k (k can be 0, 1, and 2) to the end of sequence S, assuming that the first base of the first triplet was at base k of sequence S and the triplets were selected without overlap. The 64 triplets were divided into two equal groups so that each triplet in one group was reverse complementary to that in the other group. The first group was denoted as T = {aaa, aat, aac, aag, ata, atc, atg, aca, act, acc, acg, aga, agc, agg, taa, tac, tag, ttc, ttg, tca, tcc, tcg, tgc, tgg, cac, cag, ctc, ccc, ccg, cgc, gac, gcc} and the second as MT = {ttt, att, gtt, ctt, tat, gat, cat, tgt, agt, ggt, cgt, tct, gct, cct, tta, gta, cta, gaa, caa, tga, gga, cga, gca, cca, gtg, ctg, gag, ggg, cgg, gcg, gtc, ggc}. For each triplet in the two groups, I calculated x T k ( i ) and x M T k ( i ) —the arguments of the normal distribution, which allow us to evaluate the abundance of the triplet in sequence S. To obtain x T k ( i ) and x M T k ( i ) , I determined frequencies f(i) and probabilities p(i) = f(i)/L of bases in sequence S, where i is a DNA base. The probability of detecting a triplet from set T in sequence S was calculated as PT(t) = p(i)p(j)p(k), where i, j, and k are DNA bases located at the first, second, and third triplet positions, respectively. Then, the average number of triplets t in sequence S would be n ¯ ( t ) = ( L / 3 ) P T ( t ) , variance D(t) = (L/3)PT(t)(1 − PT(t)), and x T k ( i ) = ( n ( t ) n ¯ ( t ) ) / D ( t ) , where n(t) is the number of triplets t in sequence S. x M T k ( i ) for set MT was calculated in the same way as x T k ( i ) for set T. Finally, I calculated Rk:
R k = i = 1 32 x T k ( i ) x M T k ( i )
Each element of the sum x T k ( i ) x M T k ( i ) is the product of two independent normal variables, the probability density of which can be described by a modified Bessel function of the second kind K0 [28]. The larger the value of K0, the lower the probability that x T k ( i ) and x M T k ( i ) are independent of each other. For this reason, it is convenient to take Rk as a measure of the independence of x T k ( i ) and x M T k ( i ) and for all i from 1 to 32.

4.2. Determination of Mean Values SRk(L,N) and Variance Dk(L,N) for Rk

If I assume that sequence S is purely random, then each product x T k ( i ) x M T k ( i ) for i from 1 to 32 has normal product distribution [27]. However, bacterial genomes cannot be considered purely random sequences, as there is strong correlation between neighboring DNA bases, which can be observed as TP [8] and which can significantly change distributions Rk (k = 0, 1, 2). Therefore, I determined average value SRk(L,N) and variance Dk(L,N) for Rk depending on TP values and L of sequence S.
To calculate TP, I created periodic sequence S1 of length L, which contained numbers 123123123…, and filled matrix M(3,4) as M(s1(i),s(i)) = M(s1(i),s(i)) + 1, where i varies from 1 to L. Mutual information I [29] was determined using the following formula:
I = i = 1 3 j = 1 4 m ( i , j ) ln m ( i , j ) i = 1 3 x ( i ) ln x ( i ) j = 1 4 y ( j ) ln y ( j ) + L ln L
where x ( i ) = j = 1 4 m ( i , j ) and y ( j ) = i = 1 3 m ( i , j ) , and the argument of normal distribution Z was calculated as
Z = (4I)0.5 − (11.0)0.5
Z ranges from −3.0 to +3.0 for sequences with no significant correlation between adjacent bases but increases with the increase in the correlation. To take this into account in the calculation of Rk (k = 0, 1, 2), I introduced seven intervals N for Z (Table 7). The intervals were chosen to minimize the number of false positives when searching for RCT-enriched regions in the E. coli genome; each interval also contained a sufficient number of regions to calculate mean value SRk(L,N) and variance Dk(L,N), which were necessary to introduce a quantitative parameter indicating statistically significant presence of RCTs in the sequence. I used the deviation of Rk from the mean divided by the square root of the variance as such a parameter. SRk(L,N) and Dk(L,N) were calculated according to the Algorithm 1.
Algorithm 1. The algorithm for calculating SRk(L,N) and Dk(L,N) is shown here
Step 1Let us introduce array SPk(L,N,J) and set its initial values to zero. In total, I considered 31 intervals for L (from 300 to 1200 bases with a step of 30) and 7 intervals N for Z; J was the ordinal number.
Step 2First, I chose k = 0, J = 0, and L = 300 bases. Subsequently, I used L values that were multiples of three bases.
Step 3Sequence S was taken from the genome of bacterium G starting from base a + k.
Step 4I calculated J = J + 1 and shuffled sequence S while preserving TP in order to maintain the correlations between DNA bases among randomly selected triplets. To do this, I created three sequences, S1, S2, and S3, where bases were selected according to their number in sequence S; these sequences contained bases located at positions 3n + 1, 3n + 2, and 3n + 3 of sequence S, respectively (here, n = 1, 2, 3… as long as 3n + 3 < L). Then, each of the three sequences was randomly shuffled and recombined into a single sequence S. Thus, the formation of RCTs could be attributed only to random factors.
Step 5Rk was calculated, and interval N corresponding to Z was determined from Table 7. Then, SPk(L,N,J) = Rk.
Step 6Sequence S was shifted by L bases from the beginning of the genome (a = a + L), and I returned to step 4. However, if J > int(LG/L) (LG is the length of the bacterial genome), then the end of the bacterial genome was reached, and I proceeded to step 7.
Step 7I added 30 bases to L, set J to 0, and went back to step 3. If L > 1200, I proceeded to step 8.
Step 8I took k = k + 1 and went back to step 1. If k > 2, I proceeded to step 9.
Step 9I calculated SRk(L,N) and Dk(L,N), which are the mean and variance, respectively, of Rk for each interval Z (Table 7) and all values of L.
As an example, Figure 4 shows the distribution of SP0(600,1,J)/Sum1 and SP0(600,4,J)/Sum2 for all J (step 6 above) that fall within the interval from R0 to R0 + 1 in the R0 range from 0 to 200. These quantities is shown in the figure as N. Sum1 and Sum2 are the sum of SP0(600,1,J) and SP0(600,4,J) for all J, respectively. It can be seen that Rk strongly depended on the TP level in sequence S. Figure 2 shows SR0(L,1) and SR0(L,4) and Figure 3D0(L,1) and D0(L,4) obtained for the complete E. coli genome. The results indicated that SRk(L,N) and Dk(L,N) did not depend on k, i.e., SRk(L,N) and Dk(L,N) were equal within the statistical error for all three k values. Full results of the calculations of SRk(L,N) and Dk(L,N) are presented in the sred.txt file.

4.3. Searching the E. coli Genome for Sequences Enriched in RCTs

As mentioned in Section 4.1, a indicates the starting coordinate of sequence S in the E. coli genome. To search for DNA sequences enriched in RCTs, I used V m k , which is the maximum of Vk(l1,l2) for all values of coordinates l1 and l2 of subsequence Seqk(l1 + k,l2) in sequence S. Vk(l1,l2) was calculated as
Vk(l1,l2) = {Rk(l1,l2) − SRk(l2l1 + 1,N)}/{Dk(L,N)0.5}
where k can take values 0, 1, and 2, and N is the number of the interval from Table 7, into which the TP sequence in coordinates from l1 to l2 falls.
The calculations were performed using Algorithm 2. The graphical diagram of the Algorithm 2 is shown in Figure 5.
Algorithm 2. Here is an algorithm that was used to search for DNA sequences enriched in RCTs
Step 1Sequence S with a length of 1200 bases was selected from the beginning of the E. coli genome starting from a.
Step 2Subsequence Seqk(l1,l2) was selected in sequence S. k was initially equal to 0, coordinate l1 varied from base k + 1 to base 900 + k of sequence S, and coordinate l2 varied from l1 + 300 to 1200 with a step of 30 bases. Mutual information I was calculated for each sequence Seqk(l1,l2) using Formula (2), and Z(l1,l2) was calculated using Formula (3) (Section 4.2). Then, interval number N was determined for Z(l1,l2) in Table 7 and used to calculate Rk according to Formula (1), which allowed determination of V(l1,l2) for sequence Seqk(l1,l2) according to Formula (4).
Step 3I selected coordinates l1, l2 for which the maximum value of Vk(l1,l2) was achieved; these coordinates were termed l1m, l2m and the maximum V m k . Then, k = k + 1 and all calculations were redone starting from step 2. If k > 2, I proceeded to step 4.
Step 4As a result, I obtained three values ( V m 1 , V m 2 , V m 3 ) and selected the largest of them denoted as Vmax(a), whereas the corresponding k was denoted as kmax.
Step 5I took a = a + 600 and proceeded to step 1. If a was greater than LG − 1200, I proceeded to step 6.
Step 6As a result of the calculations, I obtained vector Vmax(a) where a varied from 1 to LG-1200. Then, I entered a threshold value for Vmax(a) denoted as V0 and selected all a where Vmax(a) > V0; the number of these values was denoted as Nm.
Step 7All the calculations in steps 1–6 were repeated with the following changes. In step 2, I randomly shuffled Seqk(l1,l2) as I did above in the calculation of SRk(L,N) and dispersion Dk(L,N) (Section 2.2, step 4). During such shuffling, triplets were changed but TP was preserved. In this case, Z(l1,l2) was influenced only by the correlation of bases and the number of RCTs was related solely to random factors. Here, I also selected all a where Vmax(a) > V0 and denoted the number of such values as N0.
Then, I calculated α = N0/Nm. As a result, I obtained the values of Vmax(a), number of sequences Nm enriched with RCTs, the coordinates of the found sequences (a, l1m, l2m), and kmax.

4.4. Classification of Regions Enriched with RCTs in the Bacterial Genome

I also classified the RCT-enriched regions found in the bacterial genome in order to see which triplets occurred in these regions as reverse complementary. To do this, I used the Algorithm 3. The graphical diagram of the algorithm is shown in Figure 6.
Algorithm 3. An algorithm for classification of regions enriched with RCTs in the bacterial genome is shown here
Step 1.Let us have Nm regions containing RCTs in the bacterial genome and calculate x T k ( i ) и x M T k ( i ) for each of them as described in Section 4.1. Since I were classifying the already found sequences, it could be assumed that for them k = 0, i.e., triplets were determined from the first base of the sequence. Then, for each sequence I calculated sum X i 2 = i = 1 32 ( x T 0 ) 2 + i = 1 32 ( x M T 0 ) 2 and normalized Xi2 to 100.0 in order to exclude the influence of the sequence length on x T 0 ( i ) and x M T 0 ( i ) . For this, I introduced correction factor sk = (100.0/Xi2)0.5 and then used y T ( i ) = s k × x T 0 ( i ) and y M T ( i ) = s k × x M T 0 ( i ) for classification.
Step 2Matrix W(Nm, Nm) was filled without filling in the main diagonal. The elements of the matrix were sums
W ( j , k ) = i = 1 32 y T j ( i ) y T k ( i ) + i = 1 32 y M T j ( i ) y M T k ( i )
where indices j (from 1 to Nm) and k (from 1 to Nm) corresponded to the two sequences enriched with RCTs.
Step 3I set threshold W0 = 40.0; for all lesser values, W = 0.0 and l was taken as 1.
Step 4For each j, I calculated sum T(j) for all k = 1, …, Nm, where T(j) = T(j) + 1 only if W(j,k) ≠ 0.0; here, T(j) shows the number of sequences that are similar to sequence j, provided that W ≥ 40.0. I selected such jm where T(jm) would have the greatest value. The set of all k for which W(jm,k) ≠ 0.0 would form the class of sequences G(l) enriched with RCTs; sequence numbers would be recorded in G(l). Then, I set all rows and columns of matrix W(j,k), whose numbers were included in set G(l), equal to zero.
Step 5If G(l) ≥ 20, then l = l + 1, and I proceeded to step 4. If G(l) < 20, then I stopped creating classes. As a result, I obtained l classes, numbers jm indicating the number of the sequence around which the class was formed, and numbers of sequences included in the class, which are recorded in set G.
The threshold of 20 was chosen based on grouping the RCT-enriched regions, provided that all x T k ( i ) and x M T k ( i ) (step 1) were randomly shuffled. In total, I created 100 sets, each containing randomly shuffled x T k ( i ) and x M T k ( i ) for each of the Nm regions found in the E. coli genome. Figure 7 shows the distribution of the number of classes created depending on threshold l for the 100 sets; obviously, at l ≤ 1 100 classes are created. The results indicate that it was not possible to obtain a single class with more than 20 RCT-enriched regions, which means that in the bacterial genome the number of classes formed purely at random does not exceed 1%.
I also created matrix MM(3,4) for each class, which is the sum of matrices M(3,4) calculated for each sequence belonging to the class (Section 4.2). For matrix MM, I calculated
x ( i ) = j = 1 4 m m ( i , j )
y ( j ) = i = 1 3 m m ( i , j )
L = i = 1 3 j = 1 4 m m ( i , j )
Next, I calculated matrix B(i,j) = x(i)y(j)/L and P(i,j) = x(i)y(j)/((L)2); the variance of the number of bases in cell mm(i,j) of the matrix was found using formula D(i,j) = LP(i,j)(1 − P(i,j)). Then, I obtained matrix
A(i,j) = (mm(i,j) − b(i,j))/(D(i,j)0.5)
I also calculated x T 0 ( i ) and x M T 0 ( i ) for each created class. To do this, I determined f(i) and probabilities p(i) = f(i)/L for each sequence in the class, where f(i) is the frequency of bases a, t, c, g and L is the total length of all sequences in the class. The probability of detecting a triplet from set T in sequence S was calculated as PT(t) = p(i)p(j)p(k), where i, j, and k are the DNA bases in the first, second, and third positions of the triplet, respectively. Then, the average number of triplets t in all sequences of the class was equal to variance D(t) = (L/3)PT(t)(1 − PT(t)) and
x T ( i ) = ( n ( t ) n ¯ ( t ) ) / D ( t )
where n(t) is the number of triplets t in all sequences from the class. xMT(i) for set MT was calculated in the same way as xM(i).
As a result, I obtained:
RRR(i) = xM(i) xMT(i)
This parameter made it possible to estimate the contribution of each triplet and its mirror copy to the formation of a sequence with RCTs.

4.5. Creation of RCT-Enriched Sequence Groups for the Genomes of 42 Bacteria

Then, I created groups of matrices MM found for 42 bacterial species; the set of these matrices was denoted as BM. Each matrix MM was calculated for the central sequence of a class that contained RCT-enriched sequences existing in the bacterial genome (Section 4.4). To create matrix groups, I first calculated the maximum value of sum L2 (Section 4.4) for all MM matrices of all bacteria, which turned out to be 4.12 × 105; the purpose of this step was to exclude the influence of MM matrix sizes on group creation. Each element of matrix MM was recalculated as mm(i,j) = 4.12 × 105mm(i,j)/L2, and for each of the recalculated matrices I determined matrix A(i,j) using Formulas (6)–(11). A(i,j) shows the bases in triplet positions that are expected to be more frequent than in a randomly shuffled sequence; the set of A(i,j) matrices was denoted as BA.
Thus, each matrix MM from set BM corresponded to matrix A from set BA, which means that the classification of matrices MM can be performed using that of matrices A. To do this, I employed the similarity measure of two matrices, Ak(i,j) and Al(i,j):
R ( k , l ) = i = 1 , 3 j = 1 4 A k ( i , j ) A l ( i , j )
where k and l vary from 1 to 195 which is a total number of classes of RCT-enriched sequences in bacterial genomes created in this work.
For each matrix A from set BA, I created matrix Arand by randomly shuffling the elements of matrix A and denoted this set of matrices as BArand. I also introduced matrix Rmax(k,l) = 0 for all k and l. Then, I compared matrices Ak(i,j) and Al(i,j) using Formula (5) and stored the resulting number R(k,l) in matrix Rmax(k,l), if R(k,l) > Rmax(k,l). These calculations were performed for all k and l. After that, I generated a new set BArand and again compared the two matrices Ak(i,j) and Al(i,j) using Formula (5). This cycle was repeated 100 times. As a result, I obtained number Rmax(k,l) for each pair of matrices k and l from set BA, which was recorded in the group_r100.txt file.
Next, I grouped matrices from set BA using the Algorithm 4.
Algorithm 4. Here is the algorithm that I used to create of RCT-enriched sequence groups for the genomes of 42 bacteria
Step 1Let t = 0. I filled vector Ri(k), where the number of matrices with R(k,l) > Rmax(k,l) (l from 1 to 195) was counted for each matrix k from set BA and denoted as N(k). The numbers of these matrices were recorded in matrix Nom(k,i), i = 1, …, N(k).
Step 2I selected the value of k for which N(k) was maximal and set t = t + 1; thus, the first group of matrices Gr(t) was created.
Step 3I took N(k) = 0 and excluded from matrix Nom all matrix numbers listed in Nom(k,i), i = 1, …, N(k). The exclusion was performed according to index k and the numbers contained in Nom(k,i) for all i.
Step 4If Gr(t) > 10.0, I proceeded to step 1, and if Gr(t) ≤ 10.0, the creation of classes was stopped. For each class, I obtained the number of central matrix A and a set of matrices that belonged to this group. I also recorded the cells in the central matrix with the values ≥ 3.0. The condition Gr(t) ≤ 10.0 was due to the fact that if the cells within matrices Ak(i,j) and Al(i,j) were randomly mixed, the size of the classes Gr(t) did not exceed 6. Therefore, the classification performed can be considered non-random.

5. Conclusions

I developed a method to search bacterial genomes for sequences enriched in reverse complementary triplets (RCTs). This method makes it possible to find RCTs in the presence of strongly expressed triplet periodicity characteristic for coding sequences. I analyzed 42 genomes of bacteria from all phyla and found that the number of RCT-enriched sequences ranged from hundreds to thousands per genome depending on the genome size. The average length of the region with RCTs was about 850 DNA bases. The most statistically significant inversion symmetry (IS) pattern was the enrichment of triplet positions 1, 2, and 3 with bases {A,G}, {A,T}, and {T,C}, respectively. When the sequence was rotated 180 degrees and the bases were replaced with complementary ones (IS), this enrichment of triplet positions was preserved. I suggest that the emergence of IS may be a result of evolutionary processes such as inversions, transpositions, and recombinations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms27052301/s1.

Funding

This research received no external funding.

Data Availability Statement

All received data are in Supplementary Materials. The calculations were done on the website http://victoria.biengi.ac.ru/mct (accessed on 20 February 2026).

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Ejigu, G.F.; Jung, J. Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing. Biology 2020, 9, 295. [Google Scholar] [CrossRef] [PubMed]
  2. Bielińska-Wąż, D.; Wąż, P.; Błaczkowska, A.; Mandrysz, J.; Lass, A.; Gładysz, P.; Karamon, J. Mathematical Modeling in Bioinformatics: Application of an Alignment-Free Method Combined with Principal Component Analysis. Symmetry 2024, 16, 967. [Google Scholar] [CrossRef]
  3. Liao, X.; Zhu, W.; Zhou, J.; Li, H.; Xu, X.; Zhang, B.; Gao, X. Repetitive DNA sequence detection and its role in the human genome. Commun. Biol. 2023, 6, 954. [Google Scholar] [CrossRef] [PubMed]
  4. Snowbarger, J.; Koganti, P.; Spruck, C. Evolution of Repetitive Elements, Their Roles in Homeostasis and Human Disease, and Potential Therapeutic Applications. Biomolecules 2024, 14, 1250. [Google Scholar] [CrossRef]
  5. Korotkov, E.; Suvorova, Y.; Kostenko, D.; Korotkova, M. Search for Dispersed Repeats in Bacterial Genomes Using an Iterative Procedure. Int. J. Mol. Sci. 2023, 24, 10964. [Google Scholar] [CrossRef]
  6. Korotkov, E.V. Mirror Complementary Triplet Periodicity of Dispersed Repeats in Bacterial Genomes. Symmetry 2025, 17, 549. [Google Scholar] [CrossRef]
  7. Wang, L.; Stein, L.D. Localizing triplet periodicity in DNA and cDNA sequences. BMC Bioinform. 2010, 11, 550. [Google Scholar] [CrossRef]
  8. Suvorova, Y.M.; Korotkov, E.V. Study of triplet periodicity differences inside and between genomes. Stat. Appl. Genet. Mol. Biol. 2015, 14, 113–123. [Google Scholar] [CrossRef]
  9. Chargaff, E.; Lipshitz, R.; Green, C. Composition of the desoxypentose nucleic acids of four genera of sea-urchin. J. Biol. Chem. 1952, 195, 155–160. [Google Scholar] [CrossRef]
  10. Chargaff, E.; Zamenhof, S.; Green, C. Human Desoxypentose Nucleic Acid: Composition of Human Desoxypentose Nucleic Acid. Nature 1950, 165, 756–757. [Google Scholar] [CrossRef]
  11. Shporer, S.; Chor, B.; Rosset, S.; Horn, D. Inversion symmetry of DNA k-mer counts: Validity and deviations. BMC Genom. 2016, 17, 696. [Google Scholar] [CrossRef]
  12. Kafri, A.; Chor, B.; Horn, D. Inter-chromosomal k-mer distances. BMC Genom. 2021, 22, 644. [Google Scholar] [CrossRef]
  13. Coons, L.A.; Burkholder, A.B.; Hewitt, S.C.; McDonnell, D.P.; Korach, K.S. Decoding the Inversion Symmetry Underlying Transcription Factor DNA-Binding Specificity and Functionality in the Genome. iScience 2019, 15, 552–591. [Google Scholar] [CrossRef]
  14. Baisnée, P.F.; Hampson, S.; Baldi, P. Why are complementary DNA strands symmetric? Bioinformatics 2002, 18, 1021–1033. [Google Scholar] [CrossRef]
  15. Yang, Z.; Li, H.; Jia, Y.; Zheng, Y.; Meng, H.; Bao, T.; Li, X.; Luo, L. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evol. Biol. 2020, 20, 157. [Google Scholar] [CrossRef]
  16. Mitchell, D.; Bridge, R. A test of Chargaff’s second rule. Biochem. Biophys. Res. Commun. 2006, 340, 90–94. [Google Scholar] [CrossRef]
  17. Nikolaou, C.; Almirantis, Y. Deviations from Chargaff’s second parity rule in organellar DNA Insights into the evolution of organellar genomes. Gene 2006, 381, 34–41. [Google Scholar] [CrossRef] [PubMed]
  18. Cristadoro, G.; Degli Esposti, M.; Altmann, E.G. The common origin of symmetry and structure in genetic sequences. Sci. Rep. 2018, 8, 15817. [Google Scholar] [CrossRef] [PubMed]
  19. Albrecht-Buehler, G. Asymptotically increasing compliance of genomes with Chargaff’s second parity rules through inversions and inverted transpositions. Proc. Natl. Acad. Sci. USA 2006, 103, 17828–17833. [Google Scholar] [CrossRef]
  20. Okamura, K.; Wei, J.; Scherer, S.W. Evolutionary implications of inversions that have caused intra-strand parity in DNA. BMC Genom. 2007, 8, 160. [Google Scholar] [CrossRef] [PubMed]
  21. Prabhu, V.V. Symmetry observations in long nucleotide sequences. Nucleic Acids Res. 1993, 21, 2797. [Google Scholar] [CrossRef]
  22. Bell, S.J.; Forsdyke, D.R. Accounting Units in DNA. J. Theor. Biol. 1999, 197, 51–61. [Google Scholar] [CrossRef]
  23. Jiang, M.; Ning, W.; Wu, S.; Wang, X.; Zhu, K.; Li, A.; Li, Y.; Cheng, S.; Song, B. Three-nucleotide periodicity of nucleotide diversity in a population enables the identification of open reading frames. Brief. Bioinform. 2022, 23, bbac210. [Google Scholar] [CrossRef]
  24. Sánchez, J. 3-base periodicity in coding DNA is affected by intercodon dinucleotides. Bioinformation 2011, 6, 327–329. [Google Scholar] [CrossRef][Green Version]
  25. Marçais, G.; Elder, C.S.; Kingsford, C. k-nonical space: Sketching with reverse complements. Bioinformatics 2024, 40, btae629. [Google Scholar] [CrossRef] [PubMed]
  26. Deorowicz, S.; Debudaj-Grabysz, A.; Grabowski, S. Disk-based k-mer counting on a PC. BMC Bioinform. 2013, 14, 160. [Google Scholar] [CrossRef] [PubMed]
  27. Moeckel, C.; Mareboina, M.; Konnaris, M.A.; Chan, C.S.Y.; Mouratidis, I.; Montgomery, A.; Chantzi, N.; Pavlopoulos, G.A.; Georgakopoulos-Soares, I. A survey of k-mer methods and applications in bioinformatics. Comput. Struct. Biotechnol. J. 2024, 23, 2289–2303. [Google Scholar] [CrossRef] [PubMed]
  28. Normal Product Distribution—From Wolfram MathWorld. Available online: https://mathworld.wolfram.com/NormalProductDistribution.html (accessed on 23 December 2025).
  29. Kullback, S. Statistics and Information Theory; John Wiley & Sons: New York, NY, USA, 1959. [Google Scholar]
Figure 1. Length distribution for RCT-containing regions in the genomes of E. coli (circles), Victivallis vadensis (triangles), and Gemmatimonas aurantiaca (squares).
Figure 1. Length distribution for RCT-containing regions in the genomes of E. coli (circles), Victivallis vadensis (triangles), and Gemmatimonas aurantiaca (squares).
Ijms 27 02301 g001
Figure 2. SR0(L,1) and SR0(L,4) depending on the length of sequence S from the E. coli genome (Section 4.2, step 9). Black and white circles indicate the first and fourth intervals, respectively.
Figure 2. SR0(L,1) and SR0(L,4) depending on the length of sequence S from the E. coli genome (Section 4.2, step 9). Black and white circles indicate the first and fourth intervals, respectively.
Ijms 27 02301 g002
Figure 3. D0(L,1) and D0(L,4) depending on length L of sequence S from the E. coli genome (Section 4.2, step 9). Black and white circles indicate the first and fourth intervals, respectively.
Figure 3. D0(L,1) and D0(L,4) depending on length L of sequence S from the E. coli genome (Section 4.2, step 9). Black and white circles indicate the first and fourth intervals, respectively.
Ijms 27 02301 g003
Figure 4. The distribution of SP0(600,1,J)/Sum1 (black circle) and SP0(600,4,J)/Sum2 (white circle) for all J (step 6 above) that fall within the interval from R0 to R0 + 1 in the R0 range from 0 to 200. These quantities are shown in the figure as N. Sum1 and Sum2 are the sum of SP0(600,1,J) and SP0(600,4,J) for all J, respectively. R0 shows the intervals for SP0(600,1,J) and SP0(600,4,J).
Figure 4. The distribution of SP0(600,1,J)/Sum1 (black circle) and SP0(600,4,J)/Sum2 (white circle) for all J (step 6 above) that fall within the interval from R0 to R0 + 1 in the R0 range from 0 to 200. These quantities are shown in the figure as N. Sum1 and Sum2 are the sum of SP0(600,1,J) and SP0(600,4,J) for all J, respectively. R0 shows the intervals for SP0(600,1,J) and SP0(600,4,J).
Ijms 27 02301 g004
Figure 5. A graphical diagram of the Algorithm 2 for finding sequences enriched in RCTs is shown in this figure. A detailed description of the algorithm is shown in Section 4.3.
Figure 5. A graphical diagram of the Algorithm 2 for finding sequences enriched in RCTs is shown in this figure. A detailed description of the algorithm is shown in Section 4.3.
Ijms 27 02301 g005
Figure 6. A graphical diagram of the Algorithm 3 for the classifying of regions enriched with RCTs in the bacterial genome is showm in this figure. A detailed description of the Algorithm 3 is shown in Section 4.4.
Figure 6. A graphical diagram of the Algorithm 3 for the classifying of regions enriched with RCTs in the bacterial genome is showm in this figure. A detailed description of the Algorithm 3 is shown in Section 4.4.
Ijms 27 02301 g006
Figure 7. Distribution of classes according to the number of sequences included. Here, l is the threshold value for the number of sequences in a class, and K is the number of classes with the number of sequences equal to l (Section 4.4, step 4).
Figure 7. Distribution of classes according to the number of sequences included. Here, l is the threshold value for the number of sequences in a class, and K is the number of classes with the number of sequences equal to l (Section 4.4, step 4).
Ijms 27 02301 g007
Table 1. Number of regions (Nm) enriched with RCTs in the E. coli genome (N0 is the number of false positives and V0 is the threshold level).
Table 1. Number of regions (Nm) enriched with RCTs in the E. coli genome (N0 is the number of false positives and V0 is the threshold level).
V03.03.54.04.55.05.56.0
Nm303426172180172613481040772
N015806762588735135
Table 2. The number of sequences with RCTs in the genomes of bacteria from different phyla for V0 = 4.5.
Table 2. The number of sequences with RCTs in the genomes of bacteria from different phyla for V0 = 4.5.
NPhylaSpeciesNmN0p (%)Genome Size, 106 bp
1AcidobacteriotaAcidobacterium capsulatum1055928.74.1
2ActinomycetotaActinomyces israelii3085818.82.6
3AquificotaThermosulfidibacter takaii614355.71.8
4ArmatimonadotaChthonomonas calidirosea3025618.53.4
5AtribacterotaCandidatus Atribacteria bacterium4306515.12.7
6BacillotaBacillus anthracis35115042.75.2
7BacteroidotaBacteroides fragilis62410016.03.5
8BalneolotaBalneola sp.8548610.04.0
9BdellovibrionotaBdellovibrio bacteriovorus1337836.23.8
10CaldisericotaCandidatus poribacteria bacterium35712635.34.9
11CalditrichotaCalditrichaeota bacterium1509785.23.6
12CampylobacterotaHelicobacter pylori1042353.31.6
13ChlamydiotaChlamydia muridarum316288.91.1
14ChlorobiotaChlorobium chlorochromatii1107585.22.6
15ChloroflexotaDehalogenimonas lykanthroporepellens606437.11.7
16ChrysiogenotaDesulfurispirillum indicum1136625.43.0
17CoprothermobacterotaCoprothermobacter proteolyticus267269.71.5
18DeferribacterotaCalditerrivibrio nitroreducen4635812.32.3
19DeinococcotaTruepera radiovictrix658649.73.3
20DictyoglomotaDictyoglomus thermophilum725567.72.0
21ElusimicrobiotaElusimicrobium minutum1107423.81.6
22FibrobacterotaFibrobacter succinogenes832667.93.6
23FusobacteriotaFusobacterium necrogenes3694010.82.0
24GemmatimonadotaGemmatimonas aurantiaca2120984.64.7
25IgnavibacteriotaIgnavibacterium album11381059.23.7
26KiritimatiellotaKiritimatiellaceae_bacterium2073918.81.7
27LentisphaerotaVictivallis vadensis23991456.05.2
28MycoplasmatotaSpiroplasma poulsonii4187217.22.3
29MyxococcotaLentisphaerae bacterium18881125.94.3
30NitrososphaerotaNitrosopumilus_maritimus792465.81.6
31NitrospinotaNitrospina gracilis_1105595.33.1
32NitrospirotaNitrospira moscoviensis10651029.64.6
33PlanctomycetotaGemmata obscuriglobus129620816.09.0
34PseudomonadotaSalmonella enterica2090753.55.0
35RhodothermotaRhodothermaceae bacterium4559821.54.7
36SpirochaetotaLeptospira interrogans25791696.55.1
37SynergistotaAnaerobaculum hydrogeniforman2570431.62.1
38TermoproteotaCrenarchaeota_archaeon4717215.23.5
39ThermodesulfobacteriotaSulfolobus acidocaldarius2564517.52.3
40ThermomicrobiaThermomicrobium roseum1513624.13.0
41ThermotogotaPetrotoga mobilis2694717.41.8
42VerrucomicrobiotaChlamydia trachomatis5217213.82.4
Table 3. Classes of sequences with RCTs found in the E. coli genome.
Table 3. Classes of sequences with RCTs found in the E. coli genome.
Class Number Number of Sequences in the Class Cells in Matrix A (Section 4.5) That Are Greater than 3.0 (Matrix A Values Shown in Brackets)
112611/A (6.9), 1/C (10.4), 1/G (29.3), 2/A (41.7), 2/T (43.0), 3/T (5.5), 3/C (31.3), 3/G (10.2)
21161/A (7.1), 1/G (49.8), 2/A (34.9), 2/T (28.9), 3/T (17.4), 3/C (36.7)
3701/T (43.2), 1/C (3.3), 2/C (23.9), 2/G (22.6), 3/A (36.0), 3/G (8.2)
4501/C (12.4), 1/G (9.9), 2/A (21.7), 2/T (10.1), 3/C (6.1)
5311/C (16.0), 1/G (28.7), 2/A (41.3), 2/T (29.0), 3/T (14.2), 3/C (12.4), 3/G (9.6)
6211/G (16.1), 2/A (19.7), 2/T (7.9), 3/C (20.4)
Table 4. Groups of sequences with RCT created for 195 classes from 42 bacteria.
Table 4. Groups of sequences with RCT created for 195 classes from 42 bacteria.
Group NumberCentral Class NumberNumber of Classes in the GroupCells in Matrix A (Section 4.5) Greater than 3.0 (Matrix A Values Shown in Brackets)
1153771/A (6.7), 1/G (45.4), 2/A (46.6), 2/T (46.9), 3/T (7.3), 3/C (39.5), 3/G (3.7)
243361/T (44.5), 2/C (38.1), 2/G (48.4), 3/A (38.9)
374291/A (6.5), 1/G (26.5), 2/C (14.6), 3/T (8.8), 3/C (7.2)
4154151/T (33.5), 1/C (11.3), 2/C (18.1), 2/G (18.7), 3/A (31.9), 3/G (12.4)
5166111/C (13.4), 1/G (6.5), 2/A (9.0), 2/T (12.2), 3/G (14.1)
6108101/A (13.2), 1/G (18.4), 3/T (5.6), 3/C (28.2)
Table 5. SR0(L,4) calculated for several lengths L of three bacterial genomes with different A + T composition.
Table 5. SR0(L,4) calculated for several lengths L of three bacterial genomes with different A + T composition.
L30039048057066075084093010201100
E. coli
A + T = 50%
60.080.093.6102.8110.7117.0123.9125.9136.6143.1
Truepera radiovictrix
A + T = 32%
73.289.7100.6109.2116.1123.1128.2135.5142.3148.3
Spiroplasma poulsonii
A + T = 74%
59.074.485.596.3104.4110.8120.2126.1132.0140.2
Table 6. D0(L,4) calculated for several lengths L of three bacterial genomes with different A + T composition.
Table 6. D0(L,4) calculated for several lengths L of three bacterial genomes with different A + T composition.
L30039048057066075084093010201100
E. coli
A + T = 50%
24.029.633.634.437.839.640.241.542.845.0
T. radiovictrix
A + T = 32%
32.037.640.839.641.043.442.645.846.646.8
S. poulsonii
A + T = 74%
25.031.832.634.638.038.740.243.044.045.2
Table 7. Z levels of TP, for which SRk(L,N) and Dk(L,N) were calculated separately.
Table 7. Z levels of TP, for which SRk(L,N) and Dk(L,N) were calculated separately.
Interval Number N1234567
Z range−5–22–44–66–1010–1414–2020–100
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Korotkov, E.V. Regions Enriched with Reverse Complement Triplets in Bacterial Genomes. Int. J. Mol. Sci. 2026, 27, 2301. https://doi.org/10.3390/ijms27052301

AMA Style

Korotkov EV. Regions Enriched with Reverse Complement Triplets in Bacterial Genomes. International Journal of Molecular Sciences. 2026; 27(5):2301. https://doi.org/10.3390/ijms27052301

Chicago/Turabian Style

Korotkov, Eugene V. 2026. "Regions Enriched with Reverse Complement Triplets in Bacterial Genomes" International Journal of Molecular Sciences 27, no. 5: 2301. https://doi.org/10.3390/ijms27052301

APA Style

Korotkov, E. V. (2026). Regions Enriched with Reverse Complement Triplets in Bacterial Genomes. International Journal of Molecular Sciences, 27(5), 2301. https://doi.org/10.3390/ijms27052301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop