Bipyrimidine Signatures as a Photoprotective Genome Strategy in G + C-rich Halophilic Archaea

Halophilic archaea experience high levels of ultraviolet (UV) light in their environments and demonstrate resistance to UV irradiation. DNA repair systems and carotenoids provide UV protection but do not account for the high resistance observed. Herein, we consider genomic signatures as an additional photoprotective strategy. The predominant forms of UV-induced DNA damage are cyclobutane pyrimidine dimers, most notoriously thymine dimers (T^Ts), which form at adjacent Ts. We tested whether the high G + C content seen in halophilic archaea serves a photoprotective function through limiting T nucleotides, and thus T^T lesions. However, this speculation overlooks the other bipyrimidine sequences, all of which capable of forming photolesions to varying degrees. Therefore, we designed a program to determine the frequencies of the four bipyrimidine pairs (5’ to 3’: TT, TC, CT, and CC) within genomes of halophilic archaea and four other randomized sample groups for comparison. The outputs for each sampled genome were weighted by the intrinsic photoreactivities of each dinucleotide pair. Statistical methods were employed to investigate intergroup differences. Our findings indicate that the UV-resistance seen in halophilic archaea can be attributed in part to a genomic strategy: high G + C content and the resulting bipyrimidine signature reduces the genomic photoreactivity.


Introduction
Halophilic microorganisms thrive in the briny waters of salt lakes and evaporative solar salterns [1]. Saturated salinity, from 25%-35%, presents osmotic obstacles for life, but we see representatives from all three domains thriving in these conditions [2]. Hypersaline habitats select for archaea at a higher abundance than bacteria and eukarya in the community [3]. The present study focuses on this particular group, which likely share genetic strategies for life at high salinity [2]. Microbial diversity studies show a wide array of halophilic archaea present in hypersaline ecosystems [4] that are adapted to thrive in the saltiest places on Earth.
Beyond salt, high solar radiation is a feature of these environments; thus, their microbial inhabitants must have evolved to overcome the challenge of ultraviolet (UV) exposure. Halophilic archaea are highly resistant to UV light, first noted by Dundas and Larsen [5]. For example, one Halorubrum species was previously found to be nearly ten-fold more UV-resistant than Escherichia coli [6]. This is due in part to their robust UV DNA repair systems, including photoreactivation and nucleotide excision repair [7][8][9][10][11], but there are clearly additional processes that afford the high level of resistance to solar radiation exhibited by halophilic archaea.
Several studies have proposed other potential photoprotective strategies; for example, carotenoids in the cell membranes of halophilic archaea may provide UV resistance [6,12,13], although the Table 1. Halophilic archaea sampled for the present study, representative of all species with full genome sequences presently available (taxid: 183963). Data were obtained from the National Center for Biotechnology Information (NCBI) database [21].

Species
Strain G + C (%) Size (Mb) Genes One photoprotective mechanism suggested in the literature for decades [22][23][24][25][26] relates to the genomes of organisms with high Guanine+Cytosine (G + C) content, such as halophilic archaea. Their genomes typically exceed 60% G + C (Table 1), which effectively reduces the number of thymines (Ts) present. Thymine dimers (TˆT), which form from the UV-induced cyclization of two adjacent thymines on the same DNA strand (Figure 1), were long thought to be the primary DNA damage arising from sunlight [23]. Logically, the limitation of T residues is expected to reduce the incidence of adjacent Ts on the same DNA strand and thus, the possibility of TˆTs. Reduction of DNA damage would reduce mutagenesis during replication of unrepaired lesions. Kennedy et al. [26] suggested that this strategy of thymine limitation could explain the UV resistance of halophilic archaea, but the idea remains untested. of adjacent Ts on the same DNA strand and thus, the possibility of T^Ts. Reduction of DNA damage would reduce mutagenesis during replication of unrepaired lesions. Kennedy et al. [26] suggested that this strategy of thymine limitation could explain the UV resistance of halophilic archaea, but the idea remains untested. The most important oversight in this premise is that T^T lesions are but a subclass of cyclobutane pyrimidine dimers (CPDs), which include each possible adjacent bipyrimidine pair: (5' to 3') T^T, C^C, C^T, and T^C. Haynes [21] noted that the sensitivity of microorganisms to UV light was not mathematically proportional to the thymine frequency in the genome, which pointed to other lethal lesions. Any set of two adjacent pyrimidines may cyclize upon exposure to a photon of UV light. Though CPDs represent the majority of solar-induced DNA damage events [28], other potential lesions resulting from the UV-irradiation of bipyrimidine sequences are pyrimidine (6-4) pyrimidone photoproducts (6-4PPs). The proportion of CPDs to 6-4PPs is dependent on wavelength of light [29] and on flanking sequences [30,31].
Though it is logical to assume T^T lesions may be limited in halophilic archaea, other photolesions may form; therefore, the photoreactivities of each bipyrimidine pair and the relationship between G + C content and their incidences must be considered in order to assess any photoprotective benefit [32,33]. Therefore, we sought to address the hypothesis that thymine limitation is photoprotective using a genome photoreactivity score that incorporates all bipyrimidine pairs and their respective photoreactivities, while accounting for all potential types of UV damage (e.g. CPDs and 6-4PPs). We developed an equation to quantify the theoretical photoreactivity of a genome based on the frequencies of each bipyrimidine doublet within it, weighted by their intrinsic photoreactivities (as determined by [33]) and genome size. Comparison with other groups and a robust statistical analysis sheds light on an interesting story in genome evolution.

Comparing G + C Content of Halophilic Archaea Versus Other Prokaryotes
A list of all prokaryotes with full-genome sequences available (n = 5074) and their corresponding G + C contents was downloaded from the NCBI database [21]. One representative genome for each species was selected at random, yielding sample groups of n = 29 halophilic archaea and n = 2231 other prokaryotes.

Genome Sampling
Tabulated lists of all species with full-genome sequences available were downloaded from the NCBI database [21] for the sample groups bacteria (taxid: 2, n = 4829), halophilic archaea (taxid: 183963, n = 33), archaea excluding halophilic archaea (taxid: 2157, n = 209), cyanobacteria (taxid: 1117, n = 90) and enterobacteriaceae (taxid: 91347, n = 736). The most important oversight in this premise is that TˆT lesions are but a subclass of cyclobutane pyrimidine dimers (CPDs), which include each possible adjacent bipyrimidine pair: (5' to 3') TˆT, CˆC, CˆT, and TˆC. Haynes [21] noted that the sensitivity of microorganisms to UV light was not mathematically proportional to the thymine frequency in the genome, which pointed to other lethal lesions. Any set of two adjacent pyrimidines may cyclize upon exposure to a photon of UV light. Though CPDs represent the majority of solar-induced DNA damage events [28], other potential lesions resulting from the UV-irradiation of bipyrimidine sequences are pyrimidine (6-4) pyrimidone photoproducts (6-4PPs). The proportion of CPDs to 6-4PPs is dependent on wavelength of light [29] and on flanking sequences [30,31].
Though it is logical to assume TˆT lesions may be limited in halophilic archaea, other photolesions may form; therefore, the photoreactivities of each bipyrimidine pair and the relationship between G + C content and their incidences must be considered in order to assess any photoprotective benefit [32,33]. Therefore, we sought to address the hypothesis that thymine limitation is photoprotective using a genome photoreactivity score that incorporates all bipyrimidine pairs and their respective photoreactivities, while accounting for all potential types of UV damage (e.g. CPDs and 6-4PPs). We developed an equation to quantify the theoretical photoreactivity of a genome based on the frequencies of each bipyrimidine doublet within it, weighted by their intrinsic photoreactivities (as determined by [33]) and genome size. Comparison with other groups and a robust statistical analysis sheds light on an interesting story in genome evolution.

Comparing G + C Content of Halophilic Archaea Versus Other Prokaryotes
A list of all prokaryotes with full-genome sequences available (n = 5074) and their corresponding G + C contents was downloaded from the NCBI database [21]. One representative genome for each species was selected at random, yielding sample groups of n = 29 halophilic archaea and n = 2231 other prokaryotes.
Steps were taken to minimize sample bias: for the halophilic archaea group, one representative strain from each species was selected at random, yielding a sample group of n = 29 halophilic archaea. For the archaea, cyanobacteria, and enterobacteriaceae groups, one representative from each genus was selected at random, yielding sample groups of n = 68 (non-halophilic) archaea, n = 32 cyanobacteria, The full-genome sequences corresponding to each sampled strain were downloaded as .fasta files from the NCBI database [21].

Determining Bipyrimidine Incidences
To determine the incidences (i.e., relative frequencies) of each bipyrimidine in the sampled genomes, a novel word-counting program "DinucleotideCounts" was written in the scripting language R (script available at: [34]). This program determines the frequency of each dinucleotide, the frequency of each nucleotide, and the size of any .fasta-formatted DNA sequence.
The bipyrimidine frequencies within sampled genomes were determined via the DinicleotideCounts program. Bipyrimidine incidences (TC i , TT i , CT i , CC i ) were then computed by dividing each bipyrimidine's frequency by the size of the corresponding genome in bases.

Determining Theoretical Genomic Photoreactivity (P g )
We devised the metric P g to quantify the theoretical photoreactivity of a genome based on its bipyrimidine signature: P g corresponds to the weighted sum of a genome's bipyrimidine incidences (TC i , TT i , CT i , CC i ): The coefficients represent the intrinsic photoreactivity of each bipyrimidine, as experimentally determined by [33] via establishing the ratio between the frequency of the photoproducts (6-4PPs and CPDs) and the bipyrimidine incidences in DNA with varying G + C content.

Statistical Methods
G + C content averages were compared ( Figure 2) via a Welch Two Sample t-test using the "stats" package in R [35]. Intergroup differences in bipyrimidine incidences ( Figure 3) and P g (Figure 4) were assessed via one-way analysis of variance (ANOVA) and post-hoc Tukey contrasts using the "multcomp" package [36] in R [35]. The regression analysis of G + C content vs. P g ( Figure 5) and corresponding Pearson's product-moment correlation test were carried out using the "stats" package in R. All randomization was facilitated by the "RAND" function in Microsoft Excel. Steps were taken to minimize sample bias: for the halophilic archaea group, one representative strain from each species was selected at random, yielding a sample group of n = 29 halophilic archaea. For the archaea, cyanobacteria, and enterobacteriaceae groups, one representative from each genus was selected at random, yielding sample groups of n = 68 (non-halophilic) archaea, n = 32 cyanobacteria, and n = 42 enterobacteriaceae. For the bacteria, the first 101 strains of a unique genus to be randomly selected constituted the final sample group of n = 101 bacteria.
The full-genome sequences corresponding to each sampled strain were downloaded as .fasta files from the NCBI database [21].

Determining Bipyrimidine Incidences
To determine the incidences (i.e. relative frequencies) of each bipyrimidine in the sampled genomes, a novel word-counting program "DinucleotideCounts" was written in the scripting language R (script available at: [34]). This program determines the frequency of each dinucleotide, the frequency of each nucleotide, and the size of any .fasta-formatted DNA sequence.
The bipyrimidine frequencies within sampled genomes were determined via the DinicleotideCounts program. Bipyrimidine incidences (TCi, TTi, CTi, CCi) were then computed by dividing each bipyrimidine's frequency by the size of the corresponding genome in bases.

Determining Theoretical Genomic Photoreactivity (Pg)
We devised the metric Pg to quantify the theoretical photoreactivity of a genome based on its bipyrimidine signature: Pg corresponds to the weighted sum of a genome's bipyrimidine incidences (TCi, TTi, CTi, CCi): The coefficients represent the intrinsic photoreactivity of each bipyrimidine, as experimentally determined by [33] via establishing the ratio between the frequency of the photoproducts (6-4PPs and CPDs) and the bipyrimidine incidences in DNA with varying G + C content.

Statistical Methods
G + C content averages were compared ( Figure 2) via a Welch Two Sample t-test using the "stats" package in R [35]. Intergroup differences in bipyrimidine incidences ( Figure 3) and Pg ( Figure 4) were assessed via one-way analysis of variance (ANOVA) and post-hoc Tukey contrasts using the "multcomp" package [36] in R [35]. The regression analysis of G + C content vs. Pg ( Figure 5) and corresponding Pearson's product-moment correlation test were carried out using the "stats" package in R. All randomization was facilitated by the "RAND" function in Microsoft Excel.

G + C Content of Halophilic Archaea
Halophilic archaea are known to have high G + C content with one notable exception, Haloquadratum walsbyi [37][38][39]. To understand the unique genomic features of these halophiles, we compare them to other prokaryotes. The NCBI Genbank database contains the full genome sequences for 29 species of halophilic archaea (Table 1), which were utilized for this study. 2231 other prokaryotic species (bacteria and non-halophilic archaea) were selected as described in Materials and Methods. This analysis (Figure 2) showed that halophilic archaea (excluding H. walsbyi) have a clustered distribution with a remarkably high G + C content (63.1% ± 1.3%) relative to other microbial life (49.7% ± 0.55%), thereby demonstrating the uniqueness of this group and pointing to the relatedness of halophilic archaea.

Bipyrimidine Signature of Halophilic Archaea
Due to their high G + C content (Figure 2), we predicted halophilic archaea would have low and high incidences of TT and CC dinucleotides, respectively. However, considering the incidences of all bipyrimidine pairs (CC i , CT i , TC i , and TT i ) is necessary for understanding the overall photoreactivity of their genomes. Hence, bipyrimidine incidences were determined for and subsequently compared between halophilic archaea and the other taxonomic groups bacteria, non-halophilic archaea, cyanobacteria, and enterobacteriaceae (as described in Materials and Methods) for insight into unique genomic signatures. If an evolutionary genome strategy were employed to protect organisms from sunlight, one would expect that a photosynthetic group like cyanobacteria might share such a strategy, and that enterobacteriaceae bacteria, which dwell inside higher eukaryotes, would not. Random samplings of species in these groups are thus included as controls, in addition to the bacteria and non-halophilic archaea ( Figure 3).
As predicted, halophilic archaea are distinctive among these control groups in their low TT i and high CC i (Figure 3). Further, it was found that halophilic archaeal genomes are also characterized by high TC i relative to the comparative groups. Multiple comparisons of means with respect to each bipyrimidine pair (as described in Materials and Methods) point to some interesting observations, indicating that on average:

•
Halophilic archaea have smaller TTi than any other group (p < 10 −4 each). No other significant intergroup differences in TTi were detected.

Intergroup Differences in Theoretical Genomic Photoreactivity (P g )
Until recently, TTs were thought to be the most photoreactive sequences [24]. This idea has been challenged by current data on the photoreactivity of each bipyrimidine pair in both naked and intracellular DNA with a variety of genome G + C contents [31,32]. Matallana-Surget et al. [33] experimentally found the relative photoreactivities of the bipyrimidine sequences to be in the decreasing order of TC > TT > CT > CC. Furthermore, these authors quantified the intrinsic photoreactivity of each bipyrimidine via determining ratio between frequency of photoproducts (CPDs and 6-4PPs) and bipyrimidine incidences in DNA with varying G + C content. These ratios were Life 2016, 6, 37 7 of 11 employed as coefficients in the equation for our theoretical quantification of genomic photoreactivity in terms of bipyrimidine signature, P g , (Equation (1)). P g was calculated for each sampled genome and then compared across all tested taxonomic groups (Figure 4).
Intergroup differences in P g were examined via a multiple comparison of means (as described in Materials and Methods), which indicated that on average:

Genomic Strategy of Photoprotection
In analyzing photoreactivity in the studied genomes, we plotted each computed P g value against its corresponding genome's G + C content ( Figure 5). These data clearly show a strong, negative correlation between P g and G + C content (R 2 = 0.7139, p < 2.2 × 10 −16 ). Figure 5 further demonstrates that this relationship holds true not only for the halophilic archaea tested, but also for all other genomes with this bias toward G + C, altogether giving evidence that an organism's genomic G + C content contributes to their relative sensitivity to UV radiation.

Discussion
Solar exposure for all prokaryotic life on Earth results in UV-induced DNA damage, the majority of which is in the form of CPDs, a cycloaddition of two adjacent pyrimidine bases ( Figure 1) [28,31]. CPDs are mitigated by two universal DNA repair systems: photolyase [40,41] and nucleotide excision repair [42]. Halophilic archaea are no exception as they have a similar ratio of CPD to 6, 4 photoproducts [8] and efficient repair [7][8][9][10][11].
The UV spectrum is broken into three subdivisions: UV-A (315-400 nm), UV-B (280-315 nm), and UV-C (<280 nm), with UV-C being the most damaging for DNA [43] due to the shorter wavelengths. The solar spectrum (UV A/B) peaks at 300 nm and results in CPDs as the principal type of UV-lesions, placing less emphasis on 6-4PPs [27]. UV-C light is sometimes used in laboratory experiments to amplify DNA damage detection [6,24], but it does not have real world consequence for biological systems since it is absorbed by the oxygen and ozone in the Earth's atmosphere [44].
The photoreactivity of a specific bipyrimidine pair may vary depending on the wavelength of UV light utilized, the flanking sequence conditions, and the cellular environment. For example, in one study the ratio of TˆT to other CPDs was greater when UV-C light was used in experiments than when UV-B was utilized [30]. This may account for the significance placed on the TˆT lesions over other photoproducts in decades of laboratory experimentation. The bipyrimidine photoreactivity coefficients utilized in our P g calculations were adapted from a study that measured UV-B induced CPDs and 6-4PPs [33].
Our data clearly show that T limitation reduces TˆT formation ( Figure 3). However, this is an incomplete picture. We developed the P g formula (Equation 1) to further assess other bipyrimidine impacts on photoreactivity. For example, the corresponding enrichment in C nucleotides and the higher photoreactivity of the TC bipyrimidine [33] also impact the P g score for a halophilic archaeal genome. Nevertheless, the strong, negative correlation observed between P g and G + C content ( Figure 5) gives evidence that T limitation facilitates a net increase in photoprotection. Note that the three most photoreactive sequences, TC, TT, and TC [32,33], are T-containing: this could explain the observed relationship between genomic photoreactivity and G + C content. It should also be noted that CˆC photolesions are the most mutagenic [32], which suggests that there is more work to be done beyond photoreactivity to explore mutagenesis from solar irradiation. Finally, determination of the role that flanking sequences [30] play in the formation of photolesions could add another dimension to this work.
From a genome evolutionary perspective, if exposure to UV light was driving an organism's genomic bipyrimidine signature, we would expect to see this adaptation in cyanobacteria, a photosynthetic group of microorganisms, and we do not (Figure 4). Conversely, we would expect members of the enterobacteriaceae group, which dwell inside higher eukaryotes protected from the sun, to be void of UV adaptations, but in fact, their photoreactivity scores are more similar to halophilic archaea than any other sample group (Figure 4). The analysis of our controls gives evidence that there is not a clear/predictive relationship between P g and UV-resistance alone. The similarity between the enterobacteriaceae and halophilic archaea stems not from lifestyle, but instead from their relativity high G + C contents (Table 2), which have been shown to result in low P g values ( Figure 5). Hence, we find no evidence in our analysis of genomic photoreactivity that UV exposure is a selective pressure for a photoprotective bipyrimidine signature. One important outlier is Haloquadratum walsbyi, a dominant halophilic archaeon in salt lakes and salterns. This microorganism is an exception to the G + C-rich genomes of the other members of this group, having only 48% G + C content (Table 1) [37][38][39]. This square-shaped organism otherwise shares the same ecosystem niche as the other extreme halophiles, thriving in high UV exposure; however, it lacks the photoprotective genome signature seen in other halophilic archaea, having a P g value of 0.260, which is over 8 standard errors smaller than the sample group's mean of 0.245 ( Figure 4). Bolhuis et al. [37] noted that H. walsbyi has a relatively higher number (four) photolyase genes, and this could impact its ability to counteract solar DNA damage.
If the overwhelming majority of halophilic archaea have genomes with lower photoreactivity because of their high G + C contents (Table 2), then we must ask how the G + C richness evolved? Litchfield [45] pointed to the stability of genomes rich in base pairs that have more hydrogen bonds; GC pairs have three and AT pairs have two. Halophilic archaea living in high salinity (and having high intracellular cation concentrations) would benefit from DNA helices that are more tightly paired, adding stability to their molecular structure in a destabilizing environment. High C + C content has also been discussed as a hypersaline adaptation that impacts the proteome composition of these microorganisms [46]. A + T limitations result in preferences for particular (G + C-rich) amino acids over others. For example, acidic residues such as aspartic acid would be over-represented and cysteine would be under-represented. Paul et al. demonstrate that this signature would reduce the likelihood of helices forming in protein tertiary structure but would positively impact coil structures. Both stronger DNA hydrogen bonding and preferential codon usage (and resulting protein structure) could impact the ability of halophilic archaea to manage life in hypersaline waters.
Genomic signatures, such as bipyrimidine limitations, should be examined when painting a picture of genome evolution for any microbial community. Each related group of species in their environment over time results in a unique signature of nucleic acid and protein composition [47][48][49], and G + C content is a marker for this signature. The UV-resistance observed in halophilic archaea can