Next Article in Journal
Combining Protein and Strain Engineering for the Production of Glyco-Engineered Horseradish Peroxidase C1A in Pichia pastoris
Next Article in Special Issue
Identification of Specific Variations in a Non-Motile Strain of Cyanobacterium Synechocystis sp. PCC 6803 Originated from ATCC 27184 by Whole Genome Resequencing
Previous Article in Journal
Rational Protein Engineering Guided by Deep Mutational Scanning
Previous Article in Special Issue
Structural Variation (SV) Markers in the Basidiomycete Volvariella volvacea and Their Application in the Construction of a Genetic Map
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multiple Factors Drive Replicating Strand Composition Bias in Bacterial Genomes

1
Center of Bioinformatics, Key Laboratory for NeuroInformation of the Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
2
Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu 610054, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Int. J. Mol. Sci. 2015, 16(9), 23111-23126; https://doi.org/10.3390/ijms160923111
Submission received: 27 July 2015 / Revised: 18 August 2015 / Accepted: 18 September 2015 / Published: 23 September 2015
(This article belongs to the Special Issue Microbial Genomics and Metabolomics)

Abstract

:
Composition bias from Chargaff’s second parity rule (PR2) has long been found in sequenced genomes, and is believed to relate strongly with the replication process in microbial genomes. However, some disagreement on the underlying reason for strand composition bias remains. We performed an integrative analysis of various genomic features that might influence composition bias using a large-scale dataset of 1111 genomes. Our results indicate (1) the bias was stronger in obligate intracellular bacteria than in other free-living species (p-value = 0.0305); (2) Fusobacteria and Firmicutes had the highest average bias among the 24 microbial phyla analyzed; (3) the strength of selected codon usage bias and generation times were not observably related to strand composition bias (p-value = 0.3247); (4) significant negative relationships were found between GC content, genome size, rearrangement frequency, Clusters of Orthologous Groups (COG) functional subcategories A, C, I, Q, and composition bias (p-values < 1.0 × 10−8); (5) gene density and COG functional subcategories D, F, J, L, and V were positively related with composition bias (p-value < 2.2 × 10−16); and (6) gene density made the most important contribution to composition bias, indicating transcriptional bias was associated strongly with strand composition bias. Therefore, strand composition bias was found to be influenced by multiple factors with varying weights.

1. Introduction

The DNA replication process produces two identical DNA molecules from one original DNA molecule. The leading strand is synthesized continuously in the same direction as the growing replication fork and the lagging strand is replicated by the synthesis of short and separated Okazaki fragments that are then joined together to form an integrated strand [1]. According to Chargaff’s second parity rule (PR2), a single DNA strand globally has an equal percentage of base pairs (A ≈ T and G ≈ C) when there is no strand bias caused by mutation or selection [2]. After PR2 bias caused by mutation was found between the leading and lagging strands in the echinoderm and vertebrate mitochondria genomes [3], the same phenomenon has been found in an increasing number of genomes [4,5,6,7,8,9,10,11]. These biases consistently showed that the leading strand had more G than C and, to a lesser extent more T than A, while in lagging strand the bias was in the opposite direction [9,12,13].
Many researchers found that the strand bias was related to the replication process, because the accumulation of base mutations were caused by the asymmetric replication mechanism between the two strands [1,2,6,14,15]. The rule of Watson–Crick base pairing would protect cytosine from being deaminized in double-stranded DNA [16,17]. However, DNA must be separated into two single strands temporarily during replication. In single-stranded DNA, cytosine would be easier to undergo deamination and transform to thymine, which contributes towards the composition bias in genomes [16]. Researchers have found that other factors may lead to asymmetry of DNA, such as thymine dimers [18], nonsense mutations [11,16], two-fold degenerated sites of cytosine [13,19], and nucleotide usage in twofold as well as fourfold degenerate sites from third codon positions [20]. Other researchers suggested that the strand composition bias was associated with the transcription process [21,22]. The mutation and repair frequencies between coding and non-coding regions of genomes are different, and most genes are located on the leading strands [1,23]. Hence, considering the gene orientation bias, the transcription process also could induce composition bias between two replicating strands.
Thus, the mechanisms underlying nucleotide composition bias are still open to debate. In this work, we selected 1111 microbial genomes to study a number of factors that may affect strand composition bias, using a quantitative analysis approach.

2. Results and Discussion

2.1. Composition Bias in Obligate Intracellular Bacteria

Extremely strong strand composition bias has been reported in 11 bacteria, among which seven are obligate intracellular parasites [8]. The strong bias means that genes have significantly different base and codon usages between the two replicating strands [24,25,26]. Obligate intracellular bacteria live permanently in their hosts, which helps to protect them against some DNA damage [7]. Thus, during their long-term evolution, some DNA repair genes would have been lost and mutations would have accumulated, resulting in the strand composition bias that has been reported.
In this work, we analyzed the composition bias in obligate intracellular bacteria using a broader range of genomes than has been used previously. Among the 1111 genomes that we downloaded from the NCBI FTP site (see Section 3.1 for details), 83 bacteria were confirmed as obligate intracellular. The species names and access numbers are displayed in Table S1. The average Scorecomposition bias (see Section 3.2 for details) of the 83 obligate intracellular bacteria (0.0433) was significantly higher than that of the other bacteria (0.0362) (t-test, p-value = 0.0305), and 40 of the 83 genomes were among the top scoring 258 genomes (top quarter). However, the top 10 genomes were not from obligate intracellular bacteria. Thus, the Scorecomposition bias of obligate intracellular bacteria was stronger on the whole than that of the other species, but not always strong for an individual genome.

2.2. Composition Bias in Different Bacterial Phyla

We separated the 1111 microbial genomes into 24 phyla and plotted the Scorecomposition bias for each phylum (Figure 1); the variance, standard deviation, and average Scorecomposition bias are given in Table 1. Fusobacteria had the highest average Scorecomposition bias. They are obligately anaerobic non-spore-forming Gram-negative bacteria [27]. Firmicutes had the second highest average Scorecomposition bias, which is in accord with a previous study that found that strand-biased gene distribution was stronger in Firmicutes than in other bacteria [28]. To explore other features that may affect composition bias at the phylum level, we compared the size, GC content, and rearrangement frequencies of the Fusobacteria and Firmicutes genomes and found that these three features were smaller than the average values for all the other bacterial genomes; however, the gene densities in these two phyla were larger than the average values for all the other bacteria (Table 2). We reconstructed the phylogenetic tree of the 24 phyla (Figure 2) and found that the Fusobacteria and Firmicutes phyla had the closest relationship. Meanwhile, they had the top two Scorecomposition bias (0.100 and 0.071). We also found that several other clades with close relationship had similar Scorecomposition bias, such as among Gemmatimonadetes, Planctomycetes and Acidobacteria. This suggests phylogenetic relationship is one of the determinant factors of strand composition bias in bacterial genomes.
Figure 1. Box-and-whiskers represent for composition bias of all genomes, which sorted into 24 phyla. The bottom and top of box mark the first and third quartiles, and the band inside the box denotes the median. The ends of the whiskers in each plot represent the lowest datum still within 1.5 IQR (interquartile range) of the lower quartiles, and the highest datum still within 1.5 IQR of the upper quartiles. Any data not included between the whiskers is plotted as an outlier with a small circle. This boxplot graphically depict the different bias distribution in respective phylum.
Figure 1. Box-and-whiskers represent for composition bias of all genomes, which sorted into 24 phyla. The bottom and top of box mark the first and third quartiles, and the band inside the box denotes the median. The ends of the whiskers in each plot represent the lowest datum still within 1.5 IQR (interquartile range) of the lower quartiles, and the highest datum still within 1.5 IQR of the upper quartiles. Any data not included between the whiskers is plotted as an outlier with a small circle. This boxplot graphically depict the different bias distribution in respective phylum.
Ijms 16 23111 g001
Figure 2. The phylogenetic tree of the 24 phyla. N means the total strains in a phylum, M means the average Scorecomposition bias in a phylum.
Figure 2. The phylogenetic tree of the 24 phyla. N means the total strains in a phylum, M means the average Scorecomposition bias in a phylum.
Ijms 16 23111 g002
Table 1. Strand composition bias for each phylum a.
Table 1. Strand composition bias for each phylum a.
PhylumStandard DeviationVarianceMean
Acidobacteria0.0053092.82 × 10−50.009124
Actinobacteria0.0157490.0002480.018728
Aquificae0.0052632.77 × 10−50.00957
Bacteroidetes0.0168050.0002820.027048
Chlamydiae0.0125210.0001570.055526
Chlorobi0.0180460.0003260.051947
Chloroflexi0.0150560.0002270.024993
Cyanobacteria0.0216380.0004680.019847
Deferribacteres0.0073185.36 × 10−50.051752
Deinococcus-Thermus0.0076685.88 × 10−50.015442
Dictyoglomi0.0101320.0001030.052093
Elusimicrobia0.0306970.0009420.052418
FibrobacteresNANA0.056901
Firmicutes0.0285710.0008160.071236
Fusobacteria0.0488860.002390.099682
GemmatimonadetesNANA0.020857
NitrospiraeNANA0.013445
Planctomycetes0.0121610.0001480.023082
Proteobacteria0.0171630.0002950.028607
Spirochaetes0.0469780.0022070.062153
Synergistetes0.0123060.0001510.047907
Tenericutes0.0232550.0005410.030599
Thermotogae0.0041261.70 × 10−50.016197
Verrucomicrobia0.0052282.73 × 10−50.029585
a All genomes are grouped by phylum, NA refer to that there is only one species in this phylum. The phylum Fusobacteria owned the highest mean bias value, and the Firmicutes comes second.
Table 2. Mean value of various biological characters for each phylum a.
Table 2. Mean value of various biological characters for each phylum a.
PhylumGenome SizeGC ContentGene DensitygcRFtaRF
Acidobacteria6,581,121.330.6026110.5241790.5462990.239179
Actinobacteria4,434,386.260.6474730.5917450.6559260.5707
Aquificae1,680,594.860.38741530.5142860.0267640.090473
Bacteroidetes3,688,038.520.42463550.5538540.0350090.101365
Chlamydiae1,265,852.440.40467210.5447130.0225670.081014
Chlorobi2,618,734.270.50793880.5839070.0610150.114787
Chloroflexi2,435,937.540.55315830.5192210.0449770.063278
Cyanobacteria3,397,176.980.44601030.508569−0.33356−0.55797
Deferribacteres2,728,2330.36827450.6424150.0126090.057666
Deinococcus-Thermus2,411,100.110.662850.517812−0.10793−0.12243
Dictyoglomi1,907,773.50.33849170.6811950.019410.055101
Elusimicrobia1,384,709.50.37579770.7269880.0149040.078649
Fibrobacteres3,842,6350.48051840.5806030.0479160.088216
Firmicutes3,077,249.490.38530.7868120.0200210.081354
Fusobacteria2,680,3830.291410.723410.010460.05595
Gemmatimonadetes4,636,9640.64274360.5664550.0430680.055612
Nitrospirae2,003,8030.3412890.5523860.0191410.07548
Planctomycetes6,254,9500.55509870.5021510.1161250.138471
Proteobacteria3,506,416.550.53377850.5699340.0674620.135439
Spirochaetes1,702,653.170.37219470.6004670.0215910.121083
Synergistetes1,914,5330.54549710.750060.0234060.050368
Tenericutes892,007.8890.27947370.665323−0.02018−0.08702
Thermotogae1,976,742.360.40288720.547240.0242320.083806
Verrucomicrobia3,998,5070.54808560.514130.0938820.10771
Mean3,329,265.480.49527670.6121580.0921910.127667
a Genome size, GC content and rearrangement frequency of Fusobacteria and Firmicutes are all smaller than average of each trait for all genomes, but the opposite was true for the gene density.

2.3. Composition Bias in Genomes with Different S Values

Selection and mutation are two primary factors that generate bias in species’ genomes during evolution. These two factors may generate biases that partially counteract each other. An S value can be used to measure the strength of codon usage bias as an indicator of selection bias [29]. Replicating strand composition bias can be considered to represent mutation bias. Thus, we used the S values for 80 bacterial genomes that were reported by Sharp et al. [29] to study the correlation between them and the Scorecomposition bias of the same 80 genomes. We found that there was no significant correlation between them (Spearman’s correlation, ρ = −0.08604675, p-value = 0.3247). Hence, we suggest that selection and mutation may influence genome bias by different mechanisms; therefore, codon usage bias may counteract strand composition bias.

2.4. Composition Bias in Genomes with Different Generation Times

Microbial generation times range from a few minutes to several weeks and are affected by evolutionary factors such as environment stability, nutrient availability, and community diversity. Vieira-Silva and Rocha found that codon usage bias was correlated with growth rates [30]. Hence, we explored the relationship of generation time and Scorecomposition bias. The bacterial generation time data were extracted from of the paper by Vieira-Silva and Rocha [30]. Our result indicated that generation time also was not significantly related with Scorecomposition bias (Spearman’s correlation, ρ = −0.1457365, p-value = 0.1021). That may be the same as the reason mentioned on the S value.

2.5. Composition Bias in Genomes with Different Genome Sizes

The average sizes of the genomes in the Fusobacteria and Firmicutes phyla are smaller than average sizes of the genomes in all the bacterial phyla examined. We found that a significantly negative correlation existed between genome size and Scorecomposition bias (Spearman’s correlation, ρ = −0.2508015, p-value < 2.2 × 10−16). This finding is similar to the results of Guo and Ning [7] who found that the genome sizes of 11 bacteria with extremely strong strand composition biases were all smaller than 2000 kb. Guo and Ning speculated that the repair mechanism might be inefficient in small bacterial genomes that had undergone reductive evolution [7]. Additionally, mutation pressure may be insufficient to surpass translational selection in larger genomes.

2.6. Composition Bias in Genomes with Different Gene Densities of the Leading Strand

With the availability of a large number of complete genome sequences, it has become increasing clear that the unequal distribution of genes between leading and lagging strands varies widely among different species. Numerous studies have shown that genes are generally preferentially located on the leading strand [31,32,33,34], which may be explained by the polymerase collision avoidance model [1].
We calculated the density of leading strand genes for all 1111 genomes. Our correlation analysis showed that gene density was highly positively correlated with Scorecomposition bias (Spearman’s correlation, ρ = 0.6273871, p-value < 2.2 × 10−16). This result could be caused by DNA replication-associated mutation bias during the transcription process in which DNA decomposes into single strands. However, the DNA mutation or repair rates were quite different between transcribed and non-transcribed strands. Because most protein-coding genes are located on the leading strand, the two replication strands can have extremely different compositions [21]. Thus, the asymmetric transcription process is likely to have a major impact on the composition bias between the two replication strands.

2.7. Composition Bias in Genomes with Different GC Contents

GC content is the percentage of guanine and cytosine base pairs in a DNA sequence. The GC content of bacterial genomes ranges from about 20% to 70% [35]. We investigated the correlation between GC content and Scorecomposition bias and found that a significantly negative correlation existed between them (Spearman’s correlation, ρ = −0.5026315, p-value < 2.2 × 10−16). It may be explained that genomes with high GC content will generate fewer mutations than those with low GC content [36]. However, this would inspire us that the replicating strand composition bias is caused by a complex set of factors.

2.8. Composition Bias in Genomes with Different Recombination Rates

Chromosomal recombination occurs as a result of deletions, duplications, inversions, and translocations in native chromosomes. Rocha [1] has shown that the recombination rate is related to strand composition bias, and has suggested that codon usage separation may be caused by low recombination rates in some obligate intracellular parasites. Wei and Guo confirm this suggestion in 11 obligate intracellular bacteria with strong strand composition bias using the Z-curve method [24].
Here, we explored this issue in the 1111 genomes. The recombination rates (taRF, gcRF) of each genome were calculated as described in Section 3.3. Then, the correlations between Scorecomposition bias and both taRF and gcRF were estimated for all the genomes. We found that taRF and gcRF were both negatively associated with Scorecomposition bias (Spearman’s correlations, ρgcRF = −0.3746862, ρtaRF = −0.2916134, both p-values < 2.2 × 10−16).
Rocha suggested that frequent chromosomal recombination would reduce strand composition bias [1]. The base distribution in any one strand is accordant; that is, if G > C in a particular region, then a similar base distribution also will be found in other regions of the same chromosome. However, recombination would break the accordance and reduce strand composition bias.

2.9. Composition Bias in Different COG Functional Categories

To determine whether gene function has an impact on strand composition bias, we explored the relationship between Clusters of Orthologous Groups (COG) functional categories and composition bias for the first time.

2.9.1. Percentage of Gene Number for Each COG Functional Subcategory

To explore the influence of each COG subcategory on composition bias, the correlation between the percentage of each COG functional subcategory (pCOGi; see Section 3.4 for details) and the corresponding Scorecomposition bias was analyzed for each genome. The results, summarized in Table 3, were considered as statistically significant if the p-value was <1.0 × 10−8. Based on this cutoff value, the pCOGs of the A, C, I, and Q subcategories were negatively related to Scorecomposition bias, and the D, F, J, L, and V subcategories showed positive correlations to Scorecomposition bias.
Klasson and Andersson have studied gene function and composition bias [37]. They found that strong asymmetric mutation bias in endosymbiont genomes caused them to lack replication restart genes (subcategory L). Guo and Ning reported that genes associated with replication initiation and re-initiation such as mutH, dnaT and fis were absent in 11 obligate intracellular bacteria genomes with extreme strand composition bias [7]. However, we detected some replication initiation and re-initiation genes based on our analysis of the 1111 genomes, which indicated that COG subcategory L and composition bias was positively correlated. This is an interesting finding that we will further explore in Section 2.9.2. Rocha and Danchin [38] reported some obligate parasite bacteria with strong composition bias in which genes associated with energy metabolism were absent. This finding is mostly accord with our result that the metabolism-related genes (subcategories C, I, and Q) were all negatively correlated with composition bias, except those in subcategory F.
Table 3. The correlation of each Clusters of Orthologous Groups (COG) functional subcategory and strand composition bias.
Table 3. The correlation of each Clusters of Orthologous Groups (COG) functional subcategory and strand composition bias.
COG Functional Categoryp ValueCorrelation
Information Storage and Processing
JTranslation, ribosomal structure and biogenesis P8.11 × 10−320.341886
ARNA processing and modification N2.44 × 10−13−0.21728
KTranscription0.099239−0.04948
LReplication, recombination and repair P1.01 × 10−80.170797
BChromatin structure and dynamics0.002404−0.09097
Cellular Processes and Signaling
DCell cycle control, cell division, chromosome partitioning P1.05 × 10−450.407564
YNuclear structure0.2229490.036592
VDefense mechanisms P3.93 × 10−140.224269
TSignal transduction mechanisms1.77 × 10−7−0.15589
MCell wall/membrane/envelope biogenesis0.609835−0.01533
Cellular Processes and Signaling
NCell motility0.1983050.038623
ZCytoskeleton0.006632−0.0814
WExtracellular structures0.901043−0.00373
UIntracellular trafficking, secretion, and vesicular transport0.9080910.003467
OPosttranslational modification, protein turnover, chaperones0.188347−0.0395
Metabolism
CEnergy production and conversion N4.51 × 10−11−0.1959
GCarbohydrate transport and metabolism0.1939190.039003
EAmino acid transport and metabolism0.417676−0.02434
FNucleotide transport and metabolism P5.99 × 10−390.377498
HCoenzyme transport and metabolism0.014050.073666
ILipid transport and metabolism N1.22 × 10−19−0.26737
PInorganic ion transport and metabolism0.081681−0.05226
QSecondary metabolites biosynthesis, transport and catabolism N6.65 × 10−40−0.38194
N denotes significantly negative correlation between subcategories and composition bias. P denotes significantly positive correlation between subcategories and composition bias.

2.9.2. Proportion of Replication and Repair Genes

The correlation between subcategory L and composition bias that we obtained is opposite to what has been found previously. To explore this result further, we collected the replication and repair genes from the KEGG pathway database and divided then into the 10 subtypes (for details see Section 3.7) based on their functions. The correlations between the percentage genes under each subtype and the Scorecomposition bias are shown in Table 4. The gene subtypes were all positively related to composition bias, and the excision and mismatch repair subtype had the highest correlation. We suspect that genomes with strong composition bias may have generated more repair genes to balance the composition bias during evolution. However, the cause-and-effect relationship between repair genes and composition bias is not still clear; that is, which is the cause and which is the effect.

2.9.3. Average Value of Times between Strong-Biased Group and Weak-Biased Group for Each Functional Subcategory

The DiffSBG/WBG (see Section 3.5 for details) for all COG subcategories is shown in Table 5. Subcategory D had the highest value (5.709 among all the subcategories, which indicated that genes involved in cell cycle control, cell division, and chromosome partitioning were present in significant numbers in the strong-biased genomes (i.e., the genomes with three top 555 Scorecomposition bias values). This result is in accordance with Lin et al. [39] who found that only some essential COG subcategories were situated preferentially on the leading strand and that subcategory D genes showed the most significant bias among 10 strand-biased classifications. Furthermore, both the strong-biased COG groups (SCOGs) and weak-biased COG groups (WCOGs) in all 1111 genomes were significantly related to Scorecomposition bias (Spearman’s correlation, ρSCOG = 0.51473 and ρWCOG = −0.65945, both p-values < 2.2 × 10−16). We suggest that although the essential subcategories are similar in number in the genomes, they tend to be located on the leading strand, resulting in strong composition bias. For small genomes, the percentages of essential subcategories are higher than for large genomes, hence leading to stronger composition bias in small genomes.
Table 4. Average value of discrepant times (AVDT) between strong-biased group and week-biased group for each functional subcategory in descending order.
Table 4. Average value of discrepant times (AVDT) between strong-biased group and week-biased group for each functional subcategory in descending order.
COGAVDTCOGAVDT
D5.709197C1.086021
K3.415376H1.053758
N2.848684F1.046122
T2.229241V1.02066
M2.181872E0.99786
O2.089135I0.936222
U2.013089P0.914553
G1.472415A0.864394
L1.363586Z0.775298
B1.266486Q0.64794
J1.23429W0.6
Table 5. Relationship between each type of replication and repair genes and composition bias.
Table 5. Relationship between each type of replication and repair genes and composition bias.
PathwayFunctionp ValueCorrelation
ko03030DNA replication3.69 × 10−100.18656
ko03032DNA replication proteins6.70 × 10−90.172841
ko03036Chromosome and associated proteins3.28 × 10−70.152472
ko03400DNA repair and recombination proteins6.73 × 10−100.183808
ko03410Base excision repair2.11 × 10−60.141724
ko03420Nucleotide excision repair4.15 × 10−120.2059713
ko03430Mismatch repair9.39 × 10−120.2025802
ko03440Homologous recombination1.16 × 10−100.191753
ko03450Non-homologous end-joining0.9268210.002759
ko03460Fanconi anemia pathway0.0025310.090509

2.10. Conjoint Analysis of Multiple Factors and Composition Bias by Principal Component Regression

We determined the independent contribution of each genomic feature to composition bias by principal component regression. Here, we selected only the features that were significantly related with strand composition bias (p-values < 1.0 × 10−8). The replication and repair genes were not considered separately because they belong to COG subcategory L. The respective contribution is presented in detail in Table 6. The results show that among the whole contribution (R2 = 0.5104) of all the features, gene density (R2 = 0.064778) made the most contribution to strand composition bias. Thus, gene orientation bias was the primary factor that influenced base composition among the biological features tested.
Table 6. Principal component regression analysis of various genomic features a.
Table 6. Principal component regression analysis of various genomic features a.
Genomic FeaturesGenome SizeGene DensityGC ContentgcRFtaRFSCOGsWCOGsA
R20.05580.06480.03910.00040.00030.03320.03260.0122
Genomic featuresCDFIJLQV
R20.06340.03480.02720.02380.02990.03710.02620.0297
a Detail values for each of the genomic features and strand composition bias are listed in Table S2.

3. Experimental Section

3.1. Data Source

We retrieved 1111 bacterial genome sequences from the NCBI FTP site in September 2010. Among them, 76 bacteria had multiple strains and hence the 1111 bacteria belonged to only 703 species. We used all sequenced bacterial genomes at that time, rather than sampling the genomic data to analyze.
The origin and terminus of DNA replications were obtained from the Doric database [40] in July 2011. This information was used to separate genes onto leading and lagging strands.
The genes related to DNA repair and replications were extracted from the KEGG Pathway database [41] in April 2013.

3.2. Computation of Strand Composition Bias

Strand composition bias of a whole genome was obtained as:
S c o r e C o m p o s i t i o n   B i a s = | G     C |   +   | T     A | C h r o m o s o m e   L e n g t h
where G, C, T, and A are the numbers of corresponding bases in leading strands. According to the principle of complementary base pairing, strand composition bias in lagging strands is equal to that of the leading strand.

3.3. Computation of Counteracting Effect of Recombination

Strand composition bias was measured by the mean value of GC + TA. Recombination may change the natural order of nucleotides, so to counteract some usual bias and finally lower the strength of the whole bias, we introduced two values, gcRF and taRF, to roughly reflect this effect of recombination. gcRF was calculated as:
g c B i a s ¯ = i = 1 N G i C i L i N
g c R F = i = 1 N ( G i C i L i g c B i a s ¯ ) 2 ( N 1 ) × g c B i a s ¯
and taRF and was calculated as:
t a B i a s ¯ = i = 1 N T i A i L i N
t a R F = i = 1 N ( T i A i L i t a B i a s ¯ ) 2 ( N 1 ) × t a B i a s ¯
where Gi, Ci, Ti, and Ai are the numbers of corresponding bases of the ith leading strand gene; Li is the length of the corresponding gene; and N is the total number of genes in the leading strand. Usually, the higher the two values are, the higher the frequency of counteracting recombination occurs.

3.4. Computation of the Percentage of Each COG Functional Subcategory

The percentage of each COG functional subcategory (pCOG) was calculated as:
p C O G i = N C O G   i N C O G   i = A ~ Z ,     except     R , S , X
where i is the ith subcategory and NCOGi is the number of genes with the ith subcategory in a genome. NCOG is the total number of genes within all the COG subcategories.

3.5. Computation of Average Value of Differences between Strong-Biased Group and Weak-Biased Group for Each Functional Subcategory

We grouped the genomes with the top 555 Scorecomposition bias values as the strong-biased group (SBG), and the remaining genomes as the weak-biased group (WBG) and count the number of genes in each COG subcategory for all the species in each group separately. For each COG, we defined an indicator, DiffSBG/WBG, to measure the differences between the two groups as:
D i f f S B G / W B G = N S B G N W B G
where NSBG is the number of genes in each COG subcategory in the SBG, and NWBG is the number of genes in each COG subcategory in the WBG.
Finally, we defined another indicator, DiffCOG, for each COG functional subcategory as:
D i f f C O G   i = j = 1 N D i f f S B G / W B G   j N   i = A ~ Z ,     except     R , S , X
where i is the ith subcategory of the 23 COG functional subcategories; j is the jth gene in ith subcategory; and N is the total number of genes in ith subcategory.

3.6. Proportion of SCOGs and WCOGs

Subcategories with DiffCOG > 5 were defined as strong-biased COG groups (SCOGs), and subcategories with DiffCOG < 0.2 were defined as weak-biased COG groups (WCOGs). Then, the proportions of SCOGs and WCOGs in each genome were calculated.

3.7. Proportion of Replication and Repair Genes

We download the genes associated with replication and repair from the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway database [41]. Ten pathways are classified under replication and repair; namely, DNA replication, DNA replication proteins, chromosome and associated proteins, DNA repair and recombination proteins, base excision repair, nucleotide excision repair, mismatch repair, homologous recombination, non-homologous end-joining, and Fanconi anemia pathway. Then, we computed the proportion of genes associated with each classification in each genome.

3.8. Statistical Analyses

The correlations between various genomic features and the strand composition bias were measured by Spearman’s rank correlation coefficient, which is a nonparametric measure of statistical dependence between two factors. It uses a monotonic function to assess how well the relationship between two variables. Rho of Spearman’s rank correlation is used to reflect the intensity of correlation between variables of statistical indicators and the absolute value of rho reflects the relative significance between two variables. For example, a rho value of −0.14 is less significant than a rho value of −0.25. The p-value of Spearman’s correlation is used for measuring significance of correlation between two variables. In this work, it is considered a significant correlation if the p-value <0.05. The independent contribution of each feature to the bias was confirmed statistically by principal component regression analysis. All statistical analyses were conducted using the freely available R package (https://cran.r-project.org/).

4. Conclusions

Strand composition bias has been reported in different genomes over many years. The bias might be driven by multiple factors. In this work, we explored the relationship between strand composition bias and various genomic features. The results show that multiple factors are related to replication strand composition bias. Together, these factors play a major role and our principal component regression analysis showed that their contribution to replication strand composition bias accounted for over 50% of the bias. Gene orientation bias had the highest independent contribution, which indicates that the transcription process is likely to have a major impact on the composition bias between two replication strands. For most of the factors, we, for the first time, quantitatively measured their contribution to strand composition bias. Thus, so far, this study is the first integrative analysis of strand composition bias in prokaryotes. The results will help understand the underlying mechanisms of how such bias is generated.

Supplementary Materials

Supplementary materials can be found at https://www.mdpi.com/1422-0067/16/09/23111/s1.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (grant numbers 31071109 and 31470068), the Fundamental Research Funds for the Central Universities of China (grant number ZYGX2013J101), and the Sichuan Youth Science and Technology Foundation of China (grant number 2014JQ0051).

Author Contributions

Conceived and designed the experiments: Feng-Biao Guo and Yuan-Nong Ye. Performed the experiments: Hai-Long Zhao and Zhong-Kui Xia. Analyzed the data: Zhong-Kui Xia and Fa-Zhan Zhang. Wrote the manuscript: Yuan-Nong Ye and Hai-Long Zhao. Polished the manuscript: Feng-Biao Guo and Zhong-Kui Xia.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rocha, E.P. The replication-related organization of bacterial genomes. Microbiology 2004, 150, 1609–1627. [Google Scholar] [PubMed]
  2. Frank, A.C.; Lobry, J.R. Asymmetric substitution patterns: A review of possible underlying mutational or selective mechanisms. Gene 1999, 238, 65–77. [Google Scholar] [CrossRef]
  3. Asakawa, S.; Kumazawa, Y.; Araki, T.; Himeno, H.; Miura, K.; Watanabe, K. Strand-specific nucleotide composition bias in echinoderm and vertebrate mitochondrial genomes. J. Mol. Evol. 1991, 32, 511–520. [Google Scholar] [CrossRef] [PubMed]
  4. Lobry, J.R. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 1996, 13, 660–665. [Google Scholar] [CrossRef] [PubMed]
  5. Xia, X. DNA replication and strand asymmetry in prokaryotic and mitochondrial genomes. Curr. Genom. 2012, 13, 16–27. [Google Scholar] [CrossRef] [PubMed]
  6. Necsulea, A.; Lobry, J.R. A new method for assessing the effect of replication on DNA base composition asymmetry. Mol. Biol. Evol. 2007, 24, 2169–2179. [Google Scholar] [CrossRef] [PubMed]
  7. Guo, F.-B.; Ning, L.-W. Strand-Specific Composition Bias in Bacterial Genomes; INTECH Open Access Publisher: Rijeka, Croatia, 2011. [Google Scholar]
  8. Guo, F.B. Replicating strand asymmetry in bacterial and eukaryotic genomes. Curr. Genom. 2012, 13, 2–3. [Google Scholar] [CrossRef] [PubMed]
  9. Arakawa, K.; Tomita, M. Measures of compositional strand bias related to replication machinery and its applications. Curr. Genom. 2012, 13, 4–15. [Google Scholar] [CrossRef] [PubMed]
  10. Lin, Q.; Cui, P.; Ding, F.; Hu, S.; Yu, J. Replication-associated mutational pressure (RMP) governs strand-biased compositional asymmetry (SCA) and gene organization in animal mitochondrial genomes. Curr. Genom. 2012, 13, 28–36. [Google Scholar] [CrossRef] [PubMed]
  11. Khrustalev, V.V.; Barkovsky, E.V. A blueprint for a mutationist theory of replicative strand asymmetries formation. Curr. Genom. 2012, 13, 55–64. [Google Scholar] [CrossRef] [PubMed]
  12. Arakawa, K.; Suzuki, H.; Tomita, M. Quantitative analysis of replication-related mutation and selection pressures in bacterial chromosomes and plasmids using generalised GC skew index. BMC Genom. 2009, 10, 640. [Google Scholar] [CrossRef] [PubMed]
  13. Khrustalev, V.V.; Barkovsky, E.V. Study of completed archaeal genomes and proteomes: Hypothesis of strong mutational at pressure existed in their common predecessor. Genom. Proteom. Bioinform. 2010, 8, 22–32. [Google Scholar] [CrossRef]
  14. Lobry, J.R.; Sueoka, N. Asymmetric directional mutation pressures in bacteria. Genome Biol. 2002, 3, RESEARCH0058. [Google Scholar] [CrossRef] [PubMed]
  15. Khrustalev, V.V.; Barkovsky, E.V. “Protoisochores” in certain archaeal species are formed by replication-associated mutational pressure. Biochimie 2011, 93, 160–167. [Google Scholar] [CrossRef] [PubMed]
  16. Khrustalev, V.V.; Barkovsky, E.V. The probability of nonsense mutation caused by replication-associated mutational pressure is much higher for bacterial genes from lagging than from leading strands. Genomics 2010, 96, 173–180. [Google Scholar] [CrossRef] [PubMed]
  17. Beletskii, A.; Bhagwat, A.S. Transcription-induced mutations: Increase in c to t mutations in the nontranscribed strand during transcription in escherichia coli. Proc. Natl. Acad. Sci. USA 1996, 93, 13919–13924. [Google Scholar] [CrossRef] [PubMed]
  18. Cordeiro-Stone, M.; Nikolaishvili-Feinberg, N. Asymmetry of DNA replication and translesion synthesis of UV-induced thymine dimers. Mutat. Res. Fundam. Mol. Mech. Mutagen. 2002, 510, 91–106. [Google Scholar] [CrossRef]
  19. Khrustalev, V.V.; Barkovsky, E.V. The level of cytosine is usually much higher than the level of guanine in two-fold degenerated sites from third codon positions of genes from simplex- and varicelloviruses with G plus C higher than 50%. J. Theor. Biol. 2010, 266, 88–98. [Google Scholar] [CrossRef] [PubMed]
  20. Khrustalev, V.; Barkovsky, E. Bioinformatical approaches for studies on replication-associated and transcription-associated mutational pressure, interpretations and applications. Adv. Genet. Res. 2011, 6, 1–108. [Google Scholar]
  21. Francino, M.P.; Ochman, H. Strand asymmetries in DNA evolution. Trends Genet. 1997, 13, 240–245. [Google Scholar] [CrossRef]
  22. Nikolaou, C.; Almirantis, Y. A study on the correlation of nucleotide skews and the positioning of the origin of replication: Different modes of replication in bacterial species. Nucleic Acids Res. 2005, 33, 6816–6822. [Google Scholar] [CrossRef] [PubMed]
  23. Rocha, E.P. The organization of the bacterial genome. Annu. Rev. Genet. 2008, 42, 211–233. [Google Scholar] [CrossRef] [PubMed]
  24. Wei, W.; Guo, F.B. Strong strand composition bias in the genome of ehrlichia canis revealed by multiple methods. Open Microbiol. J. 2010, 4, 98–102. [Google Scholar] [PubMed]
  25. Guo, F.B.; Yu, X.J. Separate base usages of genes located on the leading and lagging strands in chlamydia muridarum revealed by the Z curve method. BMC Genom. 2007, 8, 366. [Google Scholar] [CrossRef] [PubMed]
  26. Guo, F.B.; Yuan, J.B. Codon usages of genes on chromosome, and surprisingly, genes in plasmid are primarily affected by strand-specific mutational biases in lawsonia intracellularis. DNA Res. 2009, 16, 91–104. [Google Scholar] [CrossRef] [PubMed]
  27. Bennett, K.W.; Eley, A. Fusobacteria: New taxonomy and related diseases. J. Med. Microbiol. 1993, 39, 246–254. [Google Scholar] [CrossRef] [PubMed]
  28. Hu, J.; Zhao, X.; Yu, J. Replication-associated purine asymmetry may contribute to strand-biased gene distribution. Genomics 2007, 90, 186–194. [Google Scholar] [CrossRef] [PubMed]
  29. Sharp, P.M.; Bailes, E.; Grocock, R.J.; Peden, J.F.; Sockett, R.E. Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res. 2005, 33, 1141–1153. [Google Scholar] [CrossRef] [PubMed]
  30. Vieira-Silva, S.; Rocha, E. The systemic imprint of growth and its uses in ecological (meta) genomics. PLoS Genet. 2010, 6, e1000808. [Google Scholar] [CrossRef] [PubMed]
  31. McLean, M.J.; Wolfe, K.H.; Devine, K.M. Base composition skews, replication orientation, and gene orientation in 12 prokaryote genomes. J. Mol. Evol. 1998, 47, 691–696. [Google Scholar] [CrossRef] [PubMed]
  32. Blattner, F.R.; Plunkett, G., 3rd; Bloch, C.A.; Perna, N.T.; Burland, V.; Riley, M.; Collado-Vides, J.; Glasner, J.D.; Rode, C.K.; Mayhew, G.F.; et al. The complete genome sequence of Escherichia coli K-12. Science 1997, 277, 1453–1462. [Google Scholar] [CrossRef] [PubMed]
  33. Rocha, E.P. Is there a role for replication fork asymmetry in the distribution of genes in bacterial genomes? Trends Microbiol. 2002, 10, 393–395. [Google Scholar] [CrossRef]
  34. Karlin, S. Bacterial DNA strand compositional asymmetry. Trends Microbiol. 1999, 7, 305–308. [Google Scholar] [CrossRef]
  35. Hildebrand, F.; Meyer, A.; Eyre-Walker, A. Evidence of selection upon genomic GC-content in bacteria. PLoS Genet. 2010, 6, e1001107. [Google Scholar] [CrossRef] [PubMed]
  36. Paul, S.; Million-Weaver, S.; Chattopadhyay, S.; Sokurenko, E.; Merrikh, H. Accelerated gene evolution through replication-transcription conflicts. Nature 2013, 495, 512–515. [Google Scholar] [CrossRef] [PubMed]
  37. Klasson, L.; Andersson, S.G. Strong asymmetric mutation bias in endosymbiont genomes coincide with loss of genes for replication restart pathways. Mol. Biol. Evol. 2006, 23, 1031–1039. [Google Scholar] [CrossRef] [PubMed]
  38. Rocha, E.P.; Danchin, A. Base composition bias might result from competition for metabolic resources. Trends Genet. 2002, 18, 291–294. [Google Scholar] [CrossRef]
  39. Lin, Y.; Gao, F.; Zhang, C.T. Functionality of essential genes drives gene strand-bias in bacterial genomes. Biochem. Biophys. Res. Commun. 2010, 396, 472–476. [Google Scholar] [CrossRef] [PubMed]
  40. Gao, F.; Luo, H.; Zhang, C.T. Doric 5.0: An updated database of oric regions in both bacterial and archaeal genomes. Nucleic Acids Res. 2013, 41, D90–D93. [Google Scholar] [CrossRef] [PubMed]
  41. Kanehisa, M.; Goto, S.; Kawashima, S.; Okuno, Y.; Hattori, M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004, 32, D277–D280. [Google Scholar] [CrossRef] [PubMed]

Share and Cite

MDPI and ACS Style

Zhao, H.-L.; Xia, Z.-K.; Zhang, F.-Z.; Ye, Y.-N.; Guo, F.-B. Multiple Factors Drive Replicating Strand Composition Bias in Bacterial Genomes. Int. J. Mol. Sci. 2015, 16, 23111-23126. https://doi.org/10.3390/ijms160923111

AMA Style

Zhao H-L, Xia Z-K, Zhang F-Z, Ye Y-N, Guo F-B. Multiple Factors Drive Replicating Strand Composition Bias in Bacterial Genomes. International Journal of Molecular Sciences. 2015; 16(9):23111-23126. https://doi.org/10.3390/ijms160923111

Chicago/Turabian Style

Zhao, Hai-Long, Zhong-Kui Xia, Fa-Zhan Zhang, Yuan-Nong Ye, and Feng-Biao Guo. 2015. "Multiple Factors Drive Replicating Strand Composition Bias in Bacterial Genomes" International Journal of Molecular Sciences 16, no. 9: 23111-23126. https://doi.org/10.3390/ijms160923111

Article Metrics

Back to TopTop