Multiple Factors Drive Replicating Strand Composition Bias in Bacterial Genomes

Composition bias from Chargaff’s second parity rule (PR2) has long been found in sequenced genomes, and is believed to relate strongly with the replication process in microbial genomes. However, some disagreement on the underlying reason for strand composition bias remains. We performed an integrative analysis of various genomic features that might influence composition bias using a large-scale dataset of 1111 genomes. Our results indicate (1) the bias was stronger in obligate intracellular bacteria than in other free-living species (p-value = 0.0305); (2) Fusobacteria and Firmicutes had the highest average bias among the 24 microbial phyla analyzed; (3) the strength of selected codon usage bias and generation times were not observably related to strand composition bias (p-value = 0.3247); (4) significant negative relationships were found between GC content, genome size, rearrangement frequency, Clusters of Orthologous Groups (COG) functional subcategories A, C, I, Q, and composition bias (p-values < 1.0 × 10−8); (5) gene density and COG functional subcategories D, F, J, L, and V were positively related with composition bias (p-value < 2.2 × 10−16); and (6) gene density made the most important contribution to composition bias, indicating transcriptional bias was associated strongly with strand composition bias. Therefore, strand composition bias was found to be influenced by multiple factors with varying weights.

contribution to composition bias, indicating transcriptional bias was associated strongly

Introduction
The DNA replication process produces two identical DNA molecules from one original DNA molecule. The leading strand is synthesized continuously in the same direction as the growing replication fork and the lagging strand is replicated by the synthesis of short and separated Okazaki fragments that are then joined together to form an integrated strand [1]. According to Chargaff's second parity rule (PR2), a single DNA strand globally has an equal percentage of base pairs (A ≈ T and G ≈ C) when there is no strand bias caused by mutation or selection [2]. After PR2 bias caused by mutation was found between the leading and lagging strands in the echinoderm and vertebrate mitochondria genomes [3], the same phenomenon has been found in an increasing number of genomes [4][5][6][7][8][9][10][11]. These biases consistently showed that the leading strand had more G than C and, to a lesser extent more T than A, while in lagging strand the bias was in the opposite direction [9,12,13].
Many researchers found that the strand bias was related to the replication process, because the accumulation of base mutations were caused by the asymmetric replication mechanism between the two strands [1,2,6,14,15]. The rule of Watson-Crick base pairing would protect cytosine from being deaminized in double-stranded DNA [16,17]. However, DNA must be separated into two single strands temporarily during replication. In single-stranded DNA, cytosine would be easier to undergo deamination and transform to thymine, which contributes towards the composition bias in genomes [16]. Researchers have found that other factors may lead to asymmetry of DNA, such as thymine dimers [18], nonsense mutations [11,16], two-fold degenerated sites of cytosine [13,19], and nucleotide usage in twofold as well as fourfold degenerate sites from third codon positions [20]. Other researchers suggested that the strand composition bias was associated with the transcription process [21,22]. The mutation and repair frequencies between coding and non-coding regions of genomes are different, and most genes are located on the leading strands [1,23]. Hence, considering the gene orientation bias, the transcription process also could induce composition bias between two replicating strands.
Thus, the mechanisms underlying nucleotide composition bias are still open to debate. In this work, we selected 1111 microbial genomes to study a number of factors that may affect strand composition bias, using a quantitative analysis approach.

Composition Bias in Obligate Intracellular Bacteria
Extremely strong strand composition bias has been reported in 11 bacteria, among which seven are obligate intracellular parasites [8]. The strong bias means that genes have significantly different base and codon usages between the two replicating strands [24][25][26]. Obligate intracellular bacteria live permanently in their hosts, which helps to protect them against some DNA damage [7]. Thus, during their long-term evolution, some DNA repair genes would have been lost and mutations would have accumulated, resulting in the strand composition bias that has been reported.
In this work, we analyzed the composition bias in obligate intracellular bacteria using a broader range of genomes than has been used previously. Among the 1111 genomes that we downloaded from the NCBI FTP site (see Section 3.1 for details), 83 bacteria were confirmed as obligate intracellular. The species names and access numbers are displayed in Table S1. The average Scorecomposition bias (see Section 3.2 for details) of the 83 obligate intracellular bacteria (0.0433) was significantly higher than that of the other bacteria (0.0362) (t-test, p-value = 0.0305), and 40 of the 83 genomes were among the top scoring 258 genomes (top quarter). However, the top 10 genomes were not from obligate intracellular bacteria. Thus, the Scorecomposition bias of obligate intracellular bacteria was stronger on the whole than that of the other species, but not always strong for an individual genome.

Composition Bias in Different Bacterial Phyla
We separated the 1111 microbial genomes into 24 phyla and plotted the Scorecomposition bias for each phylum ( Figure 1); the variance, standard deviation, and average Scorecomposition bias are given in Table 1. Fusobacteria had the highest average Scorecomposition bias. They are obligately anaerobic non-spore-forming Gram-negative bacteria [27]. Firmicutes had the second highest average Scorecomposition bias, which is in accord with a previous study that found that strand-biased gene distribution was stronger in Firmicutes than in other bacteria [28]. To explore other features that may affect composition bias at the phylum level, we compared the size, GC content, and rearrangement frequencies of the Fusobacteria and Firmicutes genomes and found that these three features were smaller than the average values for all the other bacterial genomes; however, the gene densities in these two phyla were larger than the average values for all the other bacteria ( Table 2). We reconstructed the phylogenetic tree of the 24 phyla ( Figure 2) and found that the Fusobacteria and Firmicutes phyla had the closest relationship. Meanwhile, they had the top two Scorecomposition bias (0.100 and 0.071). We also found that several other clades with close relationship had similar Scorecomposition bias, such as among Gemmatimonadetes, Planctomycetes and Acidobacteria. This suggests phylogenetic relationship is one of the determinant factors of strand composition bias in bacterial genomes.

Composition Bias in Genomes with Different S Values
Selection and mutation are two primary factors that generate bias in species' genomes during evolution. These two factors may generate biases that partially counteract each other. An S value can be used to measure the strength of codon usage bias as an indicator of selection bias [29]. Replicating strand composition bias can be considered to represent mutation bias. Thus, we used the S values for 80 bacterial genomes that were reported by Sharp et al. [29] to study the correlation between them and the Scorecomposition bias of the same 80 genomes. We found that there was no significant correlation between them (Spearman's correlation, ρ = −0.08604675, p-value = 0.3247). Hence, we suggest that selection and mutation may influence genome bias by different mechanisms; therefore, codon usage bias may counteract strand composition bias.

Composition Bias in Genomes with Different Generation Times
Microbial generation times range from a few minutes to several weeks and are affected by evolutionary factors such as environment stability, nutrient availability, and community diversity. Vieira-Silva and Rocha found that codon usage bias was correlated with growth rates [30]. Hence, we explored the relationship of generation time and Scorecomposition bias. The bacterial generation time data were extracted from of the paper by Vieira-Silva and Rocha [30]. Our result indicated that generation time also was not significantly related with Scorecomposition bias (Spearman's correlation, ρ = −0.1457365, p-value = 0.1021). That may be the same as the reason mentioned on the S value.

Composition Bias in Genomes with Different Genome Sizes
The average sizes of the genomes in the Fusobacteria and Firmicutes phyla are smaller than average sizes of the genomes in all the bacterial phyla examined. We found that a significantly negative correlation existed between genome size and Scorecomposition bias (Spearman's correlation, ρ = −0.2508015, p-value < 2.2 × 10 −16 ). This finding is similar to the results of Guo and Ning [7] who found that the genome sizes of 11 bacteria with extremely strong strand composition biases were all smaller than 2000 kb. Guo and Ning speculated that the repair mechanism might be inefficient in small bacterial genomes that had undergone reductive evolution [7]. Additionally, mutation pressure may be insufficient to surpass translational selection in larger genomes.

Composition Bias in Genomes with Different Gene Densities of the Leading Strand
With the availability of a large number of complete genome sequences, it has become increasing clear that the unequal distribution of genes between leading and lagging strands varies widely among different species. Numerous studies have shown that genes are generally preferentially located on the leading strand [31][32][33][34], which may be explained by the polymerase collision avoidance model [1].
We calculated the density of leading strand genes for all 1111 genomes. Our correlation analysis showed that gene density was highly positively correlated with Scorecomposition bias (Spearman's correlation, ρ = 0.6273871, p-value < 2.2 × 10 −16 ). This result could be caused by DNA replication-associated mutation bias during the transcription process in which DNA decomposes into single strands. However, the DNA mutation or repair rates were quite different between transcribed and non-transcribed strands. Because most protein-coding genes are located on the leading strand, the two replication strands can have extremely different compositions [21]. Thus, the asymmetric transcription process is likely to have a major impact on the composition bias between the two replication strands.

Composition Bias in Genomes with Different GC Contents
GC content is the percentage of guanine and cytosine base pairs in a DNA sequence. The GC content of bacterial genomes ranges from about 20% to 70% [35]. We investigated the correlation between GC content and Scorecomposition bias and found that a significantly negative correlation existed between them (Spearman's correlation, ρ = −0.5026315, p-value < 2.2 × 10 −16 ). It may be explained that genomes with high GC content will generate fewer mutations than those with low GC content [36]. However, this would inspire us that the replicating strand composition bias is caused by a complex set of factors.

Composition Bias in Genomes with Different Recombination Rates
Chromosomal recombination occurs as a result of deletions, duplications, inversions, and translocations in native chromosomes. Rocha [1] has shown that the recombination rate is related to strand composition bias, and has suggested that codon usage separation may be caused by low recombination rates in some obligate intracellular parasites. Wei and Guo confirm this suggestion in 11 obligate intracellular bacteria with strong strand composition bias using the Z-curve method [24].
Here, we explored this issue in the 1111 genomes. The recombination rates (taRF, gcRF) of each genome were calculated as described in Section 3.3. Then, the correlations between Scorecomposition bias and both taRF and gcRF were estimated for all the genomes. We found that taRF and gcRF were both negatively associated with Scorecomposition bias (Spearman's correlations, ρgcRF = −0.3746862, ρtaRF = −0.2916134, both p-values < 2.2 × 10 −16 ).
Rocha suggested that frequent chromosomal recombination would reduce strand composition bias [1]. The base distribution in any one strand is accordant; that is, if G > C in a particular region, then a similar base distribution also will be found in other regions of the same chromosome. However, recombination would break the accordance and reduce strand composition bias.

Composition Bias in Different COG Functional Categories
To determine whether gene function has an impact on strand composition bias, we explored the relationship between Clusters of Orthologous Groups (COG) functional categories and composition bias for the first time.

Percentage of Gene Number for Each COG Functional Subcategory
To explore the influence of each COG subcategory on composition bias, the correlation between the percentage of each COG functional subcategory (pCOGi; see Section 3.4 for details) and the corresponding Scorecomposition bias was analyzed for each genome. The results, summarized in Table 3, were considered as statistically significant if the p-value was <1.0 × 10 −8 . Based on this cutoff value, the pCOGs of the A, C, I, and Q subcategories were negatively related to Scorecomposition bias, and the D, F, J, L, and V subcategories showed positive correlations to Scorecomposition bias.
Klasson and Andersson have studied gene function and composition bias [37]. They found that strong asymmetric mutation bias in endosymbiont genomes caused them to lack replication restart genes (subcategory L). Guo and Ning reported that genes associated with replication initiation and re-initiation such as mutH, dnaT and fis were absent in 11 obligate intracellular bacteria genomes with extreme strand composition bias [7]. However, we detected some replication initiation and re-initiation genes based on our analysis of the 1111 genomes, which indicated that COG subcategory L and composition bias was positively correlated. This is an interesting finding that we will further explore in Section 2.9.2. Rocha and Danchin [38] reported some obligate parasite bacteria with strong composition bias in which genes associated with energy metabolism were absent. This finding is mostly accord with our result that the metabolism-related genes (subcategories C, I, and Q) were all negatively correlated with composition bias, except those in subcategory F.  N denotes significantly negative correlation between subcategories and composition bias. P denotes significantly positive correlation between subcategories and composition bias.

Proportion of Replication and Repair Genes
The correlation between subcategory L and composition bias that we obtained is opposite to what has been found previously. To explore this result further, we collected the replication and repair genes from the KEGG pathway database and divided then into the 10 subtypes (for details see Section 3.7) based on their functions. The correlations between the percentage genes under each subtype and the Scorecomposition bias are shown in Table 4. The gene subtypes were all positively related to composition bias, and the excision and mismatch repair subtype had the highest correlation. We suspect that genomes with strong composition bias may have generated more repair genes to balance the composition bias during evolution. However, the cause-and-effect relationship between repair genes and composition bias is not still clear; that is, which is the cause and which is the effect.

Average Value of Times between Strong-Biased Group and Weak-Biased Group for Each Functional Subcategory
The DiffSBG/WBG (see Section 3.5 for details) for all COG subcategories is shown in Table 5. Subcategory D had the highest value (5.709 among all the subcategories, which indicated that genes involved in cell cycle control, cell division, and chromosome partitioning were present in significant numbers in the strong-biased genomes (i.e., the genomes with three top 555 Scorecomposition bias values). This result is in accordance with Lin et al. [39] who found that only some essential COG subcategories were situated preferentially on the leading strand and that subcategory D genes showed the most significant bias among 10 strand-biased classifications. Furthermore, both the strong-biased COG groups (SCOGs) and weak-biased COG groups (WCOGs) in all 1111 genomes were significantly related to Scorecomposition bias (Spearman's correlation, ρSCOG = 0.51473 and ρWCOG = −0.65945, both p-values < 2.2 × 10 −16 ). We suggest that although the essential subcategories are similar in number in the genomes, they tend to be located on the leading strand, resulting in strong composition bias. For small genomes, the percentages of essential subcategories are higher than for large genomes, hence leading to stronger composition bias in small genomes.

Conjoint Analysis of Multiple Factors and Composition Bias by Principal Component Regression
We determined the independent contribution of each genomic feature to composition bias by principal component regression. Here, we selected only the features that were significantly related with strand composition bias (p-values < 1.0 × 10 −8 ). The replication and repair genes were not considered separately because they belong to COG subcategory L. The respective contribution is presented in detail in Table 6. The results show that among the whole contribution (R 2 = 0.5104) of all the features, gene density (R 2 = 0.064778) made the most contribution to strand composition bias. Thus, gene orientation bias was the primary factor that influenced base composition among the biological features tested.

Data Source
We retrieved 1111 bacterial genome sequences from the NCBI FTP site in September 2010. Among them, 76 bacteria had multiple strains and hence the 1111 bacteria belonged to only 703 species. We used all sequenced bacterial genomes at that time, rather than sampling the genomic data to analyze.
The origin and terminus of DNA replications were obtained from the Doric database [40] in July 2011. This information was used to separate genes onto leading and lagging strands.
The genes related to DNA repair and replications were extracted from the KEGG Pathway database [41] in April 2013.

Computation of Strand Composition Bias
Strand composition bias of a whole genome was obtained as: where G, C, T, and A are the numbers of corresponding bases in leading strands. According to the principle of complementary base pairing, strand composition bias in lagging strands is equal to that of the leading strand.

Computation of Counteracting Effect of Recombination
Strand composition bias was measured by the mean value of G − C + T − A. Recombination may change the natural order of nucleotides, so to counteract some usual bias and finally lower the strength of the whole bias, we introduced two values, gcRF and taRF, to roughly reflect this effect of recombination. gcRF was calculated as: and taRF and was calculated as: where Gi, Ci, Ti, and Ai are the numbers of corresponding bases of the i th leading strand gene; Li is the length of the corresponding gene; and N is the total number of genes in the leading strand. Usually, the higher the two values are, the higher the frequency of counteracting recombination occurs.

Computation of the Percentage of Each COG Functional Subcategory
The percentage of each COG functional subcategory (pCOG) was calculated as: where i is the i th subcategory and NCOGi is the number of genes with the i th subcategory in a genome. NCOG is the total number of genes within all the COG subcategories.

Computation of Average Value of Differences between Strong-Biased Group and Weak-Biased Group for Each Functional Subcategory
We grouped the genomes with the top 555 Scorecomposition bias values as the strong-biased group (SBG), and the remaining genomes as the weak-biased group (WBG) and count the number of genes in each COG subcategory for all the species in each group separately. For each COG, we defined an indicator, DiffSBG/WBG, to measure the differences between the two groups as: where NSBG is the number of genes in each COG subcategory in the SBG, and NWBG is the number of genes in each COG subcategory in the WBG. Finally, we defined another indicator, DiffCOG, for each COG functional subcategory as: where i is the i th subcategory of the 23 COG functional subcategories; j is the j th gene in i th subcategory; and N is the total number of genes in i th subcategory.

Proportion of SCOGs and WCOGs
Subcategories with DiffCOG > 5 were defined as strong-biased COG groups (SCOGs), and subcategories with DiffCOG < 0.2 were defined as weak-biased COG groups (WCOGs). Then, the proportions of SCOGs and WCOGs in each genome were calculated.

Proportion of Replication and Repair Genes
We download the genes associated with replication and repair from the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway database [41]. Ten pathways are classified under replication and repair; namely, DNA replication, DNA replication proteins, chromosome and associated proteins, DNA repair and recombination proteins, base excision repair, nucleotide excision repair, mismatch repair, homologous recombination, non-homologous end-joining, and Fanconi anemia pathway. Then, we computed the proportion of genes associated with each classification in each genome.

Statistical Analyses
The correlations between various genomic features and the strand composition bias were measured by Spearman's rank correlation coefficient, which is a nonparametric measure of statistical dependence between two factors. It uses a monotonic function to assess how well the relationship between two variables. Rho of Spearman's rank correlation is used to reflect the intensity of correlation between variables of statistical indicators and the absolute value of rho reflects the relative significance between two variables. For example, a rho value of −0.14 is less significant than a rho value of −0.25. The p-value of Spearman's correlation is used for measuring significance of correlation between two variables. In this work, it is considered a significant correlation if the p-value <0.05. The independent contribution of each feature to the bias was confirmed statistically by principal component regression analysis. All statistical analyses were conducted using the freely available R package (https://cran.r-project.org/).

Conclusions
Strand composition bias has been reported in different genomes over many years. The bias might be driven by multiple factors. In this work, we explored the relationship between strand composition bias and various genomic features. The results show that multiple factors are related to replication strand composition bias. Together, these factors play a major role and our principal component regression analysis showed that their contribution to replication strand composition bias accounted for over 50% of the bias. Gene orientation bias had the highest independent contribution, which indicates that the transcription process is likely to have a major impact on the composition bias between two replication strands. For most of the factors, we, for the first time, quantitatively measured their contribution to strand composition bias. Thus, so far, this study is the first integrative analysis of strand composition bias in prokaryotes. The results will help understand the underlying mechanisms of how such bias is generated.