Metabolism of the Genus Guyparkeria Revealed by Pangenome Analysis

Halophilic sulfur-oxidizing bacteria belonging to the genus Guyparkeria occur at both marine and terrestrial habitats. Common physiological characteristics displayed by Guyparkeria isolates have not yet been linked to the metabolic potential encoded in their genetic inventory. To provide a genetic basis for understanding the metabolism of Guyparkeria, nine genomes were compared to reveal the metabolic capabilities and adaptations. A detailed account is given on Guyparkeria’s ability to assimilate carbon by fixation, to oxidize reduced sulfur, to oxidize thiocyanate, and to cope with salinity stress.


Introduction
The halophilic sulfur-oxidizing species Guyparkeria [1] was previously classified as a member of Thiobacillus [2] and then of Halothiobacillus [3]. Currently, two species are declared, namely the type strain G. halophila DSM 6132 (which is also the type species of Guyparkeria [4]) and the type strain G. hydrothermalis R3 [2]. Since then, more Guyparkeria isolates and metagenome-assembled genomes (MAGs) have been identified. They originated from marine and terrestrial environments: hydrothermal vent chimney, deep-sea sediment, coastal sediment, hot spring sediment, and lacustrine sediment (see Table 1 for details). Guyparkeria is also being explored to be used as a more environmentally friendly strain in metal extraction from ores [5]. As of now, common physiological characteristics displayed by Guyparkeria isolates have not been linked to the metabolic potential encoded in their genetic inventory. The genome of strain SCN-R1 has been studied [6,7], but the focus mainly centered around its thiocyanate oxidation, an ability that is so far exclusive to this strain.
Seven genomes assigned to Guyparkeria based on the Genome Taxonomy Database (GTDB) [8] taxonomic classification have become available in the past few years, making it possible to perform a genome-wide study. To provide a genetic basis for understanding the metabolism of the genus Guyparkeria, genomes were sequenced from G. hydrothermalis R3, isolated from hydrothermal vent chimney [2], and a coastal strain B1-1 [9] and compared with the seven publicized genomes. The metabolic capabilities and adaptations are discussed on the basis of the pan genome data. Note: * Guyparkeria halophila is the type species of Guyparkeria according to Boden (2017) [1].ˆScaffolds < 1500 nucleotides (nt) were removed in this study.

Microorganisms and Culturing Media
G. hydrothermalis R3 (hereafter, R3) was isolated from an active hydrothermal vent chimney at 2000 m depth at the North Fiji Basin, Tonga, Pacific Ocean [2]. Live culture of R3 was purchased from DSMZ-German Collection of Microorganisms and Cell Cultures GmbH (catalog number: 7121). G. hydrothermalis B1-1 (hereafter, B1-1) was isolated more recently from sediments of marine culture cage area at Bachimen, Zhangzhou, Fujian [9]. B1-1 was determined to the species level based on its high 16S rRNA gene similarity (>99%) and phylogenetic relatedness to R3 [9]. Live culture of B1-1 was purchased from China Center for Type Culture Collection (catalog number: CCTCC AB 2016151).
Live culture of B1-1 cells was maintained at 28 • C on 1% agar plates prepared from a medium, following Chen et al. [9] with minor modifications, containing (per L) 0.05 g anhydrous MgSO 4 , 5 g Na 2 S 2 O 3 ·5H 2 O, 2 g K 2 HPO 4 , 0.1 g (NH 4 ) 2 SO 4 , 0.08 g anhydrous CaCl 2 , 0.02 g FeSO 4 ·7H 2 O in 1 L of aged seawater (i.e., sea water collected from the nearby coast and was kept in dark for at least 1 week before use), adjusted to pH 7.6. The medium was sterilized by autoclaving.

Genomic DNA Extraction and Sequencing of Strains B1-1 and R3
Batch culture (3-6 mL) was centrifuged at 10,000 rpm for 1 min and the supernatant was discarded. The cell pellet was then suspended in 180 µL of lysozyme solution (20 mg/mL lysozyme in buffer containing 20 mM Tris, pH 8.0; 2 mM Na 2 -EDTA; 1.2% Triton), and then incubated at 37 • C for 30 min. Cell lysis was achieved by added 400 µL of lysis buffer (containing 40 mM Tris; 20 mM sodium citrate; 1% SDS; 1 mM EDTA; 20 mg/mL proteinase K). After mixing with 300 µL of 5 M NaCl by inverting up-and-down., chloroform (800 µL) was added and mixed by inversion, then let the solution sit still for 2 min before centrifugation at 12,000 rpm for 10 min. About 500-600 µL of crude DNA solution was taken from the upper aqueous layer, to which double volume of absolute ethanol was added, before being left to sit for 2 min at room temperature to precipitate the DNA. After centrifugation at 12,000 rpm for 10 min, supernatant was discarded, and the tube was placed upside down to air dry. Finally, DNA pellet was suspended in 300 µL of sterile double-distilled water and the DNA solution was kept at −20 • C. DNA samples were sent to Novogene Co., Ltd. (Tianjin, China) at which metagenomic library was prepared using NEB Next Ultra DNA Library Prep Kit and sequenced by Illumina Novaseq 6000.

Pangenome and Core Genome Analyses
Seven genome sequences of Guyparkeria (Table S2), as classified by GTDB in both the previous Release 95 and the current Release 202, were retrieved from the NCBI FTP site. It is noted that the taxonomy of these genome sequences in NCBI differs from that in GTDB, and only two (Guyparkeria halophila sp2 and Guyparkeria sp. SCN-R1) are classified as Guyparkeria by NCBI (Table S2). Scaffolds shorter than 1500 nucleotides (nt) were discarded. EDGAR v3.0 platform [17] was used to perform comparative genomic analysis and visualize the results. The tools used in this study include calculating the Average Amino Acid Identity (AAI) and Average Nucleotide Identity (ANI) values, assigning orthologous genes and using those to calculate the pangenome. EDGAR v3.0 was also used to derive exponential decay functions that predict the development of the pan genome or core genome with increasing genome number. The formulas for Heaps' Law (the least squares fit of the power law to medians) for the pan genome (1) and the least squares fit of the exponential regressing decay to medians (2) were used, respectively [18]: where n is the number of genes, N is the number of genomes, and k and γ are fitted constants. α is defined as 1 − γ.
where n is the number of genes, N is the number of genomes, e is Euler number, and k, τ and tgθ are constants.
Of the nine genomes included in this study, G. halophila sp2 genome is declared as a complete genome, therefore, it was used as the reference genome when analyzing core genes for Guyparkeria.
As genes encoding hydrogenase, thiosulfate:quinone oxidoreductase and thiosulfate dehydrogenase were not identified in the automated genome annotations, additional search using HMMER v3. 3
With the exclusion of Halothiobacillaceae bacterium SpSt-1134 genome, 1458 core genes were extracted from the Guyparkeria genomes and aligned individually on the EDGAR v3.0 platform. Multiple sequence alignment file of concatenated core genes (in total 501,819 residue positions), provided by the EDGAR team, was analyzed by Prottest v3 and the recommended evolutionary model JTT+G was used to obtain a maximum likelihood tree followed by bootstrap analysis (--bs-trees autoMRE) using RAxML-NG v1.0.3 (httpes://github.com/amkozlov/raxml-ng; installed on 17 November 2021).
The numbers of coding sequences (CDS) predicted by CheckM and PROKKA for each genome were different. These differences were generally small (0-3.6%), given the total number of predicted CDS, except for Halothiobacillaceae bacterium SpSt-1134, for which the difference accounted for~18%. Side-by-side annotation results for strains B1-1 and R3 are provided in Tables S3 and S4. As the annotations by NCBI PGAP and BAKTA yielded a lower percentage of hypothetical proteins, they were desirably used for the later comparative genomic analysis.
G. hydrothermalis B1-1 and R3 and G. halophila sp2 have two copies of 16S, 5S, 23S ribosomal RNA (rRNA) genes. G. hydrothermalis B1-1 and R3 genomes have identical 16S rRNA gene copies, whereas the two 16S rRNA genes in G. halophila sp2 genome are not identical, with one being more related to B1-1 and the other to R3 (Figure 1). The slight difference (3 out of 1539 nucleotide positions) between the two 16S rRNA gene copies in G. halophila sp2 genome could be a result of: (1) mixed populations or variants; (2) sequencing error; (3) assembly error; and (4) point mutation. With 16S rRNA gene pairwise identities between 96.46% and 100%, and AAI values >82.36% (Figure 2), the genomes compared, except Halothiobacillaceae bacterium SpSt-1134, can confidently be regarded as being within the same (i.e., Guyparkeria) genus in accordance with the definition (AAI of 65-95% and 16S rRNA of 95-98.6%) given in Konstantinidis et al. (2017) [22]. It is worthy to note that Halothiobacillus sp. S21.Bin061 shows the lowest ANI and AAI values among the 10 Guyparkeria genomes (Figures 2 and 3). While clearly being a member of the Guyparkeria genes, it still seems to be evolutionarily more distant than the other genomes in this analysis. This fact, together with the lower completeness of this genome and the resulting lower CDS number (Table 1), leads to a bipartite pattern in the core genome development plot (see Section 3.3 for discussion). Given the described slightly higher deviation of strain S21.Bon061 from the other Guyparkeria genomes, it was used to root the phylogenetic tree. Due to the low completeness of Halothiobacillaceae bacterium SpSt-1134 and its relatively low AAI values (~68%) compared to other Guyparkeria genomes, it was omitted from the core genome and phylogenetic analysis.   It is noted that the genomes of Halothiobacillus sp. XI15 and Halothiobacillus sp. WRN 7 are of identical length (Table 1) and ANI values (Figure 3), although they are claimed t originate from different habitats and geographic locations (XI15 from Kebrit deep brine seawater interface, Red Sea, Saudi Arabia and WRN-7 from saline-alkaline soil, Tian Ji City, China). As of the time of writing this manuscript, clarification has not been receive from the author who submitted both of these genomes; as a result, no attempt was mad to associate the observed phylogeny and genes with their origins.
The genomes of the type strain G. hydrothermalis R3, Halothiobacillus sp. SB14A and G halophila sp2 share >97% ANI values (Figure 3), which satisfies the recommended criterio recommended for organisms of the same species (i.e., >95-96% ANI; [22,23]). Nonetheless it warrants a careful look at their morphological and physiological differences befor proposing their detailed taxonomic classification.  It is noted that the genomes of Halothiobacillus sp. XI15 and Halothiobacillus sp. WRN 7 are of identical length (Table 1) and ANI values (Figure 3), although they are claimed t originate from different habitats and geographic locations (XI15 from Kebrit deep brine seawater interface, Red Sea, Saudi Arabia and WRN-7 from saline-alkaline soil, Tian Jin City, China). As of the time of writing this manuscript, clarification has not been received from the author who submitted both of these genomes; as a result, no attempt was mad to associate the observed phylogeny and genes with their origins.
The genomes of the type strain G. hydrothermalis R3, Halothiobacillus sp. SB14A and G halophila sp2 share >97% ANI values (Figure 3), which satisfies the recommended criterion recommended for organisms of the same species (i.e., >95-96% ANI; [22,23]). Nonetheless it warrants a careful look at their morphological and physiological differences befor proposing their detailed taxonomic classification. It is noted that the genomes of Halothiobacillus sp. XI15 and Halothiobacillus sp. WRN-7 are of identical length (Table 1) and ANI values (Figure 3), although they are claimed to originate from different habitats and geographic locations (XI15 from Kebrit deep brineseawater interface, Red Sea, Saudi Arabia and WRN-7 from saline-alkaline soil, Tianjin, China). As of the time of writing this manuscript, clarification has not been received from the author who submitted both of these genomes; as a result, no attempt was made to associate the observed phylogeny and genes with their origins.
The genomes of the type strain G. hydrothermalis R3, Halothiobacillus sp. SB14A and G. halophila sp2 share >97% ANI values (Figure 3), which satisfies the recommended criterion recommended for organisms of the same species (i.e., >95-96% ANI; [22,23]). Nonetheless, it warrants a careful look at their morphological and physiological differences before proposing their detailed taxonomic classification.

Phylogenetic Relatedness of Guyparkeria 16S rRNA Genes, Genomes and soxB Genes
The analysis of 16S rRNA genes, the gold standard for understanding phylogeny of life, indicated that the coastal strain G. hydrothermalis B1-1 is closely related to Halothiobacillus sp. WRN-7 and Halothiobacillus sp. XI15, whereas the deep-sea vent chimney strain G. hydrothermalis R3 is closely related to marine sediment strain Halothiobacillus sp. SB14A and G. halophila sp2 from cold seep sediment. Congruent with the 16S rRNA gene tree, the two distinct clades are also formed with robust bootstrap support in the phylogenetic tree (Figure 4), which is in agreement with the observed ANI or AAI similarities. Similar clustering is also observed in the analysis of soxB genes ( Figure 5

Pangenome of Guyparkeria
The analysis of the Guyparkeria genomes presented a pan genome size of 3204, with 46% (1458 CDS), 28% (909 CDS) and 26% (837 CDS) being in the core and accessory genomes, and as singleton genes, respectively (Table S5). The core genome therefore accounted for~63-77% of CDS in the studied genomes. This amount of shared genes in the eight studied Guyparkeria strains is quite high, when compared to that reported from clinicalrelevant bacterial species from several phyla (~20-64% determined from 50 genomes per species; [24]) and the hydrothermal vent sulfur-oxidizing epsilonproteobacterial genera Sulfurovum (~5% from 20 genomes; [25]) and Sulfurimonas (~33-50% from 11 genomes; [26]).  From the Guyparkeria genomes, the Heaps' Law function for the pangenome was determined as 2092*(Nˆ0.185) ( Figure 6A) and the exponential regression decay function for the core genome was 670.278*e (−N/4.764) + 1384.351 ( Figure 6B). The predicted alpha value of Heaps' Law is 0.815, where a value of less than 1 indicates that the pan genome is open. It is understandable that the proportion of shared genes will decrease with the addition of more Guyparkeria genomes.
From the Guyparkeria genomes, the Heaps' Law function for the pangenome was determined as 2092*(N^0.185) ( Figure 6A) and the exponential regression decay function for the core genome was 670.278*e (−N/4.764) + 1384.351 ( Figure 6B). The predicted alpha value of Heaps' Law is 0.815, where a value of less than 1 indicates that the pan genome is open. It is understandable that the proportion of shared genes will decrease with the addition of more Guyparkeria genomes. The core genome plot shows a bipartite pattern of values. This is due to the lower similarity of Halothiobacillus sp. S21.Bin061, as indicated by the lowest ANI and AAI values in the pairwise comparisons among the Guyparkeria genomes (Figures 2 and 3). Another factor to explain this pattern is the smaller genome size of S21.Bin061 (92% completeness, Table 1), which has 140-380 CDS less than other Guyparkeria genomes ( Table 1). The core development plot ( Figure 6B) uses random sampling of sets of n genomes, with n being the range 1-10 in this case. Whenever the Halothiobacillus sp. S21.Bin061 genome is part of the random sample, the core genome size is ~250 genes smaller than when this genome is not sampled. If this genome is omitted from the analysis, the core development plot shows The core genome plot shows a bipartite pattern of values. This is due to the lower similarity of Halothiobacillus sp. S21.Bin061, as indicated by the lowest ANI and AAI values in the pairwise comparisons among the Guyparkeria genomes (Figures 2 and 3). Another factor to explain this pattern is the smaller genome size of S21.Bin061 (92% completeness, Table 1), which has 140-380 CDS less than other Guyparkeria genomes ( Table 1). The core development plot ( Figure 6B) uses random sampling of sets of n genomes, with n being the range 1-10 in this case. Whenever the Halothiobacillus sp. S21.Bin061 genome is part of the random sample, the core genome size is~250 genes smaller than when this genome is not sampled. If this genome is omitted from the analysis, the core development plot shows a narrower distribution of values (see Figure S1).

Genetic Basis of Key Metabolic Functions of Guyparkeria
The core genome provides the genetic basis of key metabolic features that have been reported for the four described Guyparkeria isolates, namely the type strain of G. halophila DSM 6132 (also being the type species of Guyparkeria; [4]), the type strain of G. hydrothermalis R3 [2], and strains SCN-R1 [7] and B1-1 [9].
Carbon assimilation by autotrophy. CO 2 is fixed using the Calvin-Benson-Bassham (CBB) cycle (Figure 7 and Table S5). To facilitate CO 2 fixation, Guyparkeria employs proteincoated carboxysomes (encoded by cso genes) that house a carbonic anhydrase (CA), converting bicarbonate ions to CO 2 , and form I ribulose-bisphosphate carboxylase (RuBisCO), converting the 5-carbon compound ribulose-1,5-bisphosphate (RuBP) to form two molecules of 3-carbon compound 3-phosphoglycerate (3-PGA) [27]. It is believed that energy-dependent bicarbonate transporters, such as SulP, play a role in concentrating the intracellular bicarbonate level, similar to other autotrophs such as marine cyanobacteria [28] and sulfur-oxidizer Thiomicrospira crunogena [29]. Carbon-concentrating mechanism enables Guyparkeria to cope with times when the dissolved inorganic carbon availability is low or scarce.
The 3-PGA enters the CBB cycle that generates glyceraldehyde 3-phosphate (G-3-P), and regenerates RuBP to complete the CBB cycle. G-3-P is an important metabolic intermediate that generates precursors pyruvate and acetyl-CoA for biomass synthesis (e.g., lipids, nucleotides, proteins, etc.). A partial tricarboxylic acid (TCA) cycle is observed, with the genes encoding the enzymes of the oxidation arm being present in the Guyparkeria genomes but not that of the reductive arm (Figure 7). The incomplete TCA cycle explains why G. hydrothermalis R3 cannot assimilate acetate for heterotrophic growth [2]. G-3-P may also be directed to synthesize starch (Figure 7).
The described genetic composition of central carbon metabolism in Guyparkeria resembles very much that of sulfur-oxidizing bacteria Thiomicrospira crunogena XCL-2 [30]. Scott and coworkers [30] hypothesized that when there is ample supply of reduced sulfur compounds (electron donors) and oxygen (terminal electron acceptor), the CBB cycle is employed for growth, with some carbon being used to produce starch; when reduced sulfur compounds become limiting, starch is used as the main source for energy and carbon intermediates. This hypothesis of adaptive carbon use in response to changes in energy substrate levels may also apply to Guyparkeria. Scott and coworkers [30] also hypothesized that hydrogenase may play a role in maintaining a membrane proton potential in Thiomicrospira crunogena XCL-2, which is not likely to be applicable to Guyparkeria, because genes encoding for hydrogenase are not apparent in any of the Guyparkeria genomes (not among singleton genes (Table S5) and no valid hit was found by hmmsearch). The absence of hydrogenase-coding genes explains the observation that G. hydrothermalis R3 does not grow under CO 2 and H 2 /O 2 [2].
Sulfur oxidation. Guyparkeria genomes encode a complete SOX pathway, as existing isolates showed complete oxidation of thiosulfate to sulfate (Figure 7 and Table S5). Interestingly, the sox genes are not organized in a single operon, suggesting that in the absence of an alternative pathway, they are always expressed and do not require complex regulation [30]. The sox genes are split into three clusters: soxBAX, soxCD and soxZY in all Guyparkeria genomes except that of SCN-R1 and S21.Bin061, in which the soxB gene is separated from the soxAX gene cluster. The production of sulfate results in lower pH, as has been reported for Guyparkeria isolates (an end point of pH 4.8-6.0; [1]). Guyparkeria cells grown under thiosulfate are coated with elemental sulfur. The accumulation of elemental sulfur outside of the cells could be explained by incomplete oxidation of thiosulfate at lower pH that inhibits the activity of subunit soxCD [31].
enables Guyparkeria to cope with times when the dissolved inorganic carbon availability is low or scarce.
Multiple copies of genes related to sulfide dehydrogenases were identified. The presence of fcc and sqr genes encoding for flavocytochrome c (FCC) and sulfide:quinone oxidoreductase (SQR), respectively, suggested that they are responsible for the oxidation of sulfide to elemental sulfur with the involvement of electron transfer to cytochromes [32].
Thiocyanate oxidation. Of the four Guyparkeria isolates, it was reported that G. halophila DSM 6132 and G. hydrothermalis R3 do not use thiocyanate as an energy source, but only strain SCN-R1 [7]. Tsallagov et al. [6] revealed that SCN-R1 genome possesses the gene encoding for thiocyanate dehydrogenase as well as cyanate hydratase or cyanase (CYN). Interestingly, the cyn gene is present in all Guyparkeria genomes, and in its proximity is another copy of CA. Further investigation is needed to find out whether thiocyanate oxidation would be a general characteristics for Guyparkeria, and under what conditions. Salt tolerance. A variety of compounds are known to be employed by diverse halotolerant or halophilic microorganisms to cope with osmolytic stress, such as ectoine in Proteobacteria and glycine betaine in halophiles [33]. As suggested for strain SCN-R1 [6], ectoine is considered to be the primary osmolyte for osmoregulation synthesized by halophilic Guyparkeria using ect genes (Table S5). Removal of ectoine by Guyparkeria can be achieved by ectoine-degrading enzymes (encoded by doe genes). Secondary osmolytes, specifically glycine betaine, may be imported through membrane transporters in the ABC and BCCT families, including OpuD (Table S5).
Other metabolisms. Ammonium is used as an N source [1], and is imported by ammonium transporter AMT. The addition of ammonium (0.8 g/L) was reported to shorten the lag phase of B1-1 culture when compared to adding nitrate and peptone of the same concentration [34]. Ubiquinone 8 (Q-8) is the dominant respiratory quinone used by Guyparkeria [1]. The ubi genes for ubiquinone 8 synthesis are present in all genomes. ATP is synthesized via oxidative phosphorylation, with genes encoding for complex I NADH:quinone oxidoreductase (NDH-1), complex III Cytochrome bc1 complex respiratory unit, complex IV Cytochrome c oxidase (including cbb3 type of high O 2 affinity), complex V F-type ATPase being present in the Guyparkeria genomes (Table S5). Multi-subunit cation transporters, such as that encoded by Mnh genes, are probably involved in Na+/H+ and pH homeostasis. Guyparkeria are capable of transporting essential inorganic nutrients and metal cofactors such as phosphate, molybdate/molydenate, zinc, iron, nitrate, cobalt, sulfate, manganese and copper (cus and cop genes) (Table S5). Lipopolysaccharide production (lpx and lip genes) and exportation (lpt genes), as well as biofilm formation signaling genes, are present in most, if not all, Guyparkeria genomes, suggesting that Guyparkeria are capable of forming biofilm for protection against adverse conditions (e.g., toxins, heavy metals, fluid flow, etc.) [35,36]. In addition, Guyparkeria genomes possess genes encoding for the formation of type IV pili (T4aP, pil genes) (Figure 7). Type IV pili exhibit dynamic and diverse functions such as twitching motility, surface sensing, DNA uptake and even virulent infection to other cells [37], thus enabling Guyparkeria to respond to environmental changes.

Highlights on Some Accessory Genes in B1-1 and R3 Genomes
One of the physiological characteristics distinguishing G. halophila and G. hydrothermalis from Halothiobacillus is that tetrathionate was not a detectable intermediate when thiosulfate was provided as the sole electron donor for Guyparkeria cells, although they are able to oxidize tetrathionate [1]. It was reported that strain SB14A expressed soxB and soxC genes when growing on tetrathionate. Thiosulfate-oxidizing gammaproteobacteria that form tetrathionate either use doxDA-encoded thiosulfate:quinone oxidoreductase or tsdAencoded thiosulfate dehydrogenase [38,39], yet these genes are not present in Guyparkeria genomes, except for a tsdA gene homolog that was identified in strain B1-1 and SCN-R1 (by automated genome annotation (Table S5) and hmmsearch). Without empirical evidence, it is too early to comment on the actual function of the tsdA gene homologs encoded in the genomes of strains B1-1 and SCN-R1. It would be interesting to perform more targeted experiments to test whether any of the Guyparkeria isolates exhibits the physiological characteristic of converting thiosulfate to tetrathionate.
Interestingly, genes encoding for clustered regularly interspaced short palindromic repeats (CRISPR) are present in the genomes of strains R3, sp2 and SB14A, which form a cluster in phylogenetic trees constructed for 16S rRNA genes, and soxB genes and genomewide shared genes. CRISPR genes are known as a bacterial defense mechanism against virus/phage infection. This can partly be explained by the considerable amount of viral load in marine environments (e.g., vent chimney, cold seep sediment and deep-sea sediment) from which these strains were isolated [40][41][42].