Large-Scale Integration of Amplicon Data Reveals Massive Diversity within Saprospirales, Mostly Originating from Saline Environments

The order Saprospirales, a group of bacteria involved in complex degradation pathways, comprises three officially described families: Saprospiraceae, Lewinellaceae, and Haliscomenobacteraceae. These collectively contain 17 genera and 31 species. The current knowledge on Saprospirales diversity is the product of traditional isolation methods, with the inherited limitations of culture-based approaches. This study utilized the extensive information available in public sequence repositories combined with recent analytical tools to evaluate the global evidence-based diversity of the Saprospirales order. Our analysis resulted in 1183 novel molecular families, 15,033 novel molecular genera, and 188 K novel molecular species. Of those, 7 novel families, 464 novel genera, and 1565 species appeared in abundances at ≥0.1%. Saprospirales were detected in various environments, such as saline water, freshwater, soil, various hosts, wastewater treatment plants, and other bioreactors. Overall, saline water was the environment showing the highest prevalence of Saprospirales, with bioreactors and wastewater treatment plants being the environments where they occurred with the highest abundance. Lewinellaceae was the family containing the majority of the most prevalent species detected, while Saprospiraceae was the family with the majority of the most abundant species found. This analysis should prime researchers to further explore, in a more targeted way, the Saprospirales proportion of microbial dark matter.


Materials and Methods
To create the initial dataset of sequences associated with the order of Saprospirales, we executed taxonomy and similarity queries in the IMNGS and SILVA databases [44,45]. As both of the aforementioned databases are not up to date with the recent splitting of the Saprospirales order, the search term was limited to "Saprospiraceae". Regarding the taxonomy query, sequences classified as the family of interest were extracted from SILVA. These sequences were also used as the input in IMNGS for the similarity query. The gathered sequences (n = 988 K) were dereplicated, and then aligned and reclassified using SINA (v.1.7.2) with SILVA SSU database (v.138) [46]. Reclassification took place to update the taxonomy information to retain the sequences belonging to the Saprospirales order. Following this, the novel tool "taxonomy informed clustering" (TIC) [47] was used to process the sequencing data. TIC is a new clustering algorithm that first procedurally divides taxonomically annotated sequences into bins of the same taxonomy down to the genus level. Then, it performs incremental clustering using the sequences confined within the same taxonomy level to avoid contamination of the clusters with sequences with clearly different phylogenetic origin but otherwise overall sequence similarity above the set cut-off levels.
Since different studies use different sequencing technologies or different primers, the resulting 16S rRNA gene fragments do not always overlap. To identify the most represented region of the 16S rRNA gene in our dataset, we calculated the representation of each position in the SINA multiple sequence alignment as the sum of all bases in that position. After identification of the most represented region, all sequences were trimmed around these positions. The sequences with more than 80% of the number of bases that aligned with those Escherichia coli would have in this region were selected for further analysis. All sequences that did not cover or partially covered that region were removed.
The 16S reference sequences of 24 known Saprospirales species were obtained directly from the NCBI and the SILVA databases. For the 7 remaining described species, their 16S rRNA gene sequence was extracted from their respective genomes using BLAST, with the 16S reference sequence of Aureispira Maritima as a query. Finally, the 31 reference 16S sequences were aligned and trimmed around the selected SINA positions.
Traditionally, the cut-offs of 97%, 95%, and 90% 16S rRNA gene similarity are used to denote sufficient evolutionary divergence for the classification of distinct species, genera, and families respectively. Nevertheless, smaller regions of the 16S rRNA gene do not always mirror the evolutionary information captured by the whole gene. Adjusting similarity cut-offs for selected regions is important to avoid over or underestimation of diversity. For the selected region of the 16S rRNA gene, we evaluated the corresponding similarity cut-offs to be used for clustering of species, genera, and families, based on the actual sequence distances among all known Saprospirales species. We found that the existing known species, when compared over the selected region, showed 96% similarity among species of the same genus, 92% similarity among species across genera of the same family, and 90% similarity among species belonging to different families, on average. Those values were used as clustering cut-offs in TIC for determining the diversity of molecular species (sOTUs), molecular genera (gOTUs), and molecular families (fOTUs), respectively.
For the ecological analysis, the metadata for each sequence in our dataset, available in the IMNGS database, was used. We extracted information from IMNGS related to the environment where each molecular species (sOTU) was detected (prevalence), as well as their abundances in each of those samples. The ecological analysis was focused on the sOTUs with an abundance of ≥0.1% in at least one sample. Samples with unclear origins were manually determined by following their sequence read archive (SRA) accession numbers in SRA site (https://www.ncbi.nlm.nih.gov/sra, Accessed during January 2023). Manually assembled systems, such as laboratory-cultivated photosynthetic mats and biofilms on polymer material surfaces, were removed from the ecological analysis. We considered only the samples derived from natural environments as well as artificial environments such as wastewater treatment plants. Briefly, the natural environments in which the Saprospirales species were detected were saline water, saline water sediments, and beach sand (hereafter referred to as "saline water"), freshwater and freshwater sediments (referred to as "freshwater"), soil, air, terrestrial flora and fauna (referred as "plant" and "host", respectively), as well as saline water flora and fauna (referred as "plant saline water" and "host saline water", respectively). In addition, Saprospirales species were detected in samples derived from wastewater treatment plants, bioreactors, activated sludges, and fermentation processes (referred to as a "bioreactor").
Finally, known species were assigned to formed sOTUs when BLAST similarity was above 98%. Some species were not distinguishable in the selected region as they were assigned to the same sOTU. In the text, those undistinguishable sOTUs carry both the names of the closest known species, i.e., Neolewinella marina/litorea, Neolewinella persica/agarilytica, Portibacter lacus/marinus, and Vicinibacter affinis/proximus.
The general workflow that was followed ( Figure 1) can be easily adapted to other taxonomic groups of interest, simplifying future microbial diversity studies.

Processing Results
The evaluation of the most represented region of 16S rRNA gene sequences in our Saprospirales dataset pointed to a region spanning the 10 K to 25.5 K positions of the complete SINA alignment ( Figure 2). Our estimation of expected bases across this region, using the SINA-aligned 16S rRNA gene of E. coli, was 282 bases. Following our requirement for an 80% minimum coverage over this region, we exclude every sequence that had less than 229 bases around our selected positions. After eliminating sequences targeting different regions, our final dataset was limited to 691 K, from 988 K sequences initially.

Figure 2.
Agglomerative coverage of integrated sequences over their SINA alignment. The y-axis (counts) indicates the number of times a base has been found in the multiple sequence alignment for the respective position. Consecutive high counts correspond to regions overrepresented in the integrated dataset. Considering this, the most represented region spanned from the position 10 K to 25.5 K in the alignment. All aligned sequences were trimmed around these positions, and those left with a sufficient number of bases were selected for further analysis.

Processing Results
The evaluation of the most represented region of 16S rRNA gene sequences in our Saprospirales dataset pointed to a region spanning the 10 K to 25.5 K positions of the complete SINA alignment ( Figure 2). Our estimation of expected bases across this region, using the SINA-aligned 16S rRNA gene of E. coli, was 282 bases. Following our requirement for an 80% minimum coverage over this region, we exclude every sequence that had less than 229 bases around our selected positions. After eliminating sequences targeting different regions, our final dataset was limited to 691 K, from 988 K sequences initially.

Processing Results
The evaluation of the most represented region of 16S rRNA gene sequences in our Saprospirales dataset pointed to a region spanning the 10 K to 25.5 K positions of the complete SINA alignment ( Figure 2). Our estimation of expected bases across this region, using the SINA-aligned 16S rRNA gene of E. coli, was 282 bases. Following our requirement for an 80% minimum coverage over this region, we exclude every sequence that had less than 229 bases around our selected positions. After eliminating sequences targeting different regions, our final dataset was limited to 691 K, from 988 K sequences initially. Agglomerative coverage of integrated sequences over their SINA alignment. The y-axis (counts) indicates the number of times a base has been found in the multiple sequence alignment for the respective position. Consecutive high counts correspond to regions overrepresented in the integrated dataset. Considering this, the most represented region spanned from the position 10 K to 25.5 K in the alignment. All aligned sequences were trimmed around these positions, and those left with a sufficient number of bases were selected for further analysis. Agglomerative coverage of integrated sequences over their SINA alignment. The y-axis (counts) indicates the number of times a base has been found in the multiple sequence alignment for the respective position. Consecutive high counts correspond to regions overrepresented in the integrated dataset. Considering this, the most represented region spanned from the position 10 K to 25.5 K in the alignment. All aligned sequences were trimmed around these positions, and those left with a sufficient number of bases were selected for further analysis. The incremental taxonomy bounded clustering of the final sequencing data with TIC resulted in 118 K novel molecular species (sOTUs), 9 K molecular genera (gOTUs), and 1269 families (fOTUs) (A FASTA formatted file "Saprospirales_diversity.zip" with all the sOTUs and their assigned taxonomy is available in the Supplementary Materials). Characterization of species, genera, and families, when referring to the results of our analysis, should always be considered as short forms of "clusters of sequences with similarities over the selected region equivalent to the corresponding taxonomic level". Therefore, our sOTUs, gOTUs, and fOTUs are not equivalent to official taxonomies. The sequences clustered in one sOTU at 96% similarity could belong to multiple biological species. This means that our method tended to underestimate the diversity compared to how common practices (ANI, phylogeny, function) would determine how the diversity of biological species is assigned to all isolates carrying the variants of the 16S rRNA gene in a dataset.

Analysis of Results
There were 204 K samples, out of 500 K pre-processed samples in the IMNGS database, that covered our selected region of interest. Within those, Saprospirales were found in almost 13% of IMNGS samples ( Figure 3). Specifically, 48% of freshwater samples, 33% of saline water samples, 26% of plant samples, 22% of soil-derived samples, and 1% of samples were found to be positive, originating from hosts (terrestrial fauna). Furthermore, it was found that the 13% of samples marked as "other" were positive, which included samples derived from different environments such as wastewater treatment plants, bioreactors, activated sludges, fermentation processes, air, and saline water flora and fauna.
Microorganisms 2023, 11, x FOR PEER REVIEW 5 The incremental taxonomy bounded clustering of the final sequencing data with resulted in 118 K novel molecular species (sOTUs), 9 K molecular genera (gOTUs), 1269 families (fOTUs) (A FASTA formatted file "Saprospirales_diversity.zip" with al sOTUs and their assigned taxonomy is available in the Supplementary Materials). C acterization of species, genera, and families, when referring to the results of our anal should always be considered as short forms of "clusters of sequences with similarities the selected region equivalent to the corresponding taxonomic level". Therefore, sOTUs, gOTUs, and fOTUs are not equivalent to official taxonomies. The sequences c tered in one sOTU at 96% similarity could belong to multiple biological species. means that our method tended to underestimate the diversity compared to how com practices (ANI, phylogeny, function) would determine how the diversity of biological cies is assigned to all isolates carrying the variants of the 16S rRNA gene in a dataset.

Analysis of Results
There were 204 K samples, out of 500 K pre-processed samples in the IMNGS d base, that covered our selected region of interest. Within those, Saprospirales were fo in almost 13% of IMNGS samples ( Figure 3). Specifically, 48% of freshwater samples, of saline water samples, 26% of plant samples, 22% of soil-derived samples, and 1% samples were found to be positive, originating from hosts (terrestrial fauna). Furtherm it was found that the 13% of samples marked as "other" were positive, which inclu samples derived from different environments such as wastewater treatment plants, bi actors, activated sludges, fermentation processes, air, and saline water flora and faun  Regarding the predicted number of molecular families, genera, and species belonging to the order of Saprospirales, a small percentage of species appeared in abundance at ≥0.1% (Table 1). Only nine families had species with an abundances of ≥0.1%, including the three known families, meaning that only the 0.71% of the predicted families had at least one species with an abundance of ≥0.1%. Similarly, 479 genera (3.18% of the predicted genera) had at least one species with an abundance of ≥0.1%, including all known genera except for Rubidimonas and Saprospira. Overall, 1565 species (1.33% of the predicted species) presented an abundance of ≥0.1%. To be more specific, and considering only sOTUs with an abundance of ≥0.1%, they were distributed in 132 genera belonging to Saprospiraceae, including 9 out of 11 known genera in the family; in 186 genera belonging to Lewinellaceae; and 150 genera to Haliscomenobacteraceae, including all known genera to both families ( Table 2). The remaining seven Regarding the predicted number of molecular families, genera, and species belonging to the order of Saprospirales, a small percentage of species appeared in abundance at ≥0.1% (Table 1). Only nine families had species with an abundances of ≥0.1%, including the three known families, meaning that only the 0.71% of the predicted families had at least one species with an abundance of ≥0.1%. Similarly, 479 genera (3.18% of the predicted genera) had at least one species with an abundance of ≥0.1%, including all known genera except for Rubidimonas and Saprospira. Overall, 1565 species (1.33% of the predicted species) presented an abundance of ≥0.1%. To be more specific, and considering only sOTUs with an abundance of ≥0.1%, they were distributed in 132 genera belonging to Saprospiraceae, including 9 out of 11 known genera in the family; in 186 genera belonging to Lewinellaceae; and 150 genera to Haliscomenobacteraceae, including all known genera to both families ( Table 2). The remaining seven unknown families had 1 to 4 genera, resulting in 11 genera with a species abundance of ≥0.1%. Overall, regardless of the species abundance, most of the predicted genera were classified as Lewinellaceae. Regardless of species abundance, almost all predicted species were classified to known families (Table 3). Specifically, 98% of the predicted species and 99% of species with abundances of ≥0.1% were classified as the three known families. Concerning these two groups, most of the species classified to Saprospiraceae belonged to known genera (53% and 56% with a ≥0.1% species abundance, respectively), while most of the species classified to Lewinellaceae and to Haliscomenobacteraceae belonged to unknown genera (56% and 59% with ≥0.1% species abundance, and 80% and 68% with ≥0.1% species abundance, respectively) ( Tables 3 and 4). Table 3. Predicted global sequenced-based diversity of species belonging to known and unknown families within the order Saprospirales. In particular, almost 82% of the unknown families had only one species, while 0.17% of the unknown families had more than 51 species classified to each of them ( Figure 4). Furthermore, almost 17% of the unknown families had 2 to 10 species, while the rest had 11 to 50 species. On the other hand, all unknown families with a species abundance of ≥0.1% had less than 10 species. To be more specific, these seven unknown families had one to six species each. Five families had one species, one family had two species and the last unknown family had six species classified to it. Moreover, the aforementioned family had five genera, while the rest of these families had only one. Regarding the known families, the majority of the unknown genera within the three known families had 2 to 10 species each, regardless of their abundance ( Figure 4). The latter also applies to the known genera of these families, except for Saprospiraceae species with an abundance of ≥0.1%. In this case, most of the known genera had either 2 to 10 species or at least 51 species. In particular, almost 82% of the unknown families had only one species, while 0.17% of the unknown families had more than 51 species classified to each of them ( Figure 4). Furthermore, almost 17% of the unknown families had 2 to 10 species, while the rest had 11 to 50 species. On the other hand, all unknown families with a species abundance of ≥0.1% had less than 10 species. To be more specific, these seven unknown families had one to six species each. Five families had one species, one family had two species and the last unknown family had six species classified to it. Moreover, the aforementioned family had five genera, while the rest of these families had only one. Regarding the known families, the majority of the unknown genera within the three known families had 2 to 10 species each, regardless of their abundance ( Figure 4). The latter also applies to the known genera of these families, except for Saprospiraceae species with an abundance of ≥0.1%. In this case, most of the known genera had either 2 to 10 species or at least 51 species. Distribution of the number of species within unknown families, unknown genera, and known genera. The three plots present the percentage of unknown families, unknown genera, and known genera that have only 1 species (blue), 2 to 10 species (orange), 11 to 50 species (grey), and more than 51 species (green). Distribution of the number of species within unknown families, unknown genera, and known genera. The three plots present the percentage of unknown families, unknown genera, and known genera that have only 1 species (blue), 2 to 10 species (orange), 11 to 50 species (grey), and more than 51 species (green).

Families
Regarding the environmental distribution of Saprospirales families, saline water was the environment where most species were present and had their maximum abundance ( Figure 5). Briefly, for Haliscomenobacteraceae the descending order of environments according to species prevalence was saline water > host saline water > freshwater > bioreactor > plant saline water > soil > air > host. Similarly, for Lewinellaceae the corresponding descending order was saline water > bioreactor > host saline water > freshwater > soil > plant saline water > air = host, and for Saprospiraceae, it was saline water > bioreactor > freshwater > soil > host saline water > host > plant saline water. As far as the environments with the maximum species abundance for each known family are concerned, the respective descending orders were: for Haliscomenobacteraceae, saline water > host saline water > freshwater > bioreactor > plant saline water > soil > air > host > plant, for Lewinellaceae, saline water > bioreactor > freshwater > host saline water > soil > plant saline water > host > air = plant, and for Saprospiraceae, saline water > bioreactor > freshwater > soil > host saline water > plant saline water > host > plant. Collectively, concerning the unknown families (7 families containing 11 species), the descending order of environments according to species prevalence and maximum abundance was saline water > plant saline water > soil > host saline water = host.
Regarding the environmental distribution of Saprospirales families, saline water was the environment where most species were present and had their maximum abundance ( Figure 5). Briefly, for Haliscomenobacteraceae the descending order of environments according to species prevalence was saline water > host saline water > freshwater > bioreactor > plant saline water > soil > air > host. Similarly, for Lewinellaceae the corresponding descending order was saline water > bioreactor > host saline water > freshwater > soil > plant saline water > air = host, and for Saprospiraceae, it was saline water > bioreactor > freshwater > soil > host saline water > host > plant saline water. As far as the environments with the maximum species abundance for each known family are concerned, the respective descending orders were: for Haliscomenobacteraceae, saline water > host saline water > freshwater > bioreactor > plant saline water > soil > air > host > plant, for Lewinellaceae, saline water > bioreactor > freshwater > host saline water > soil > plant saline water > host > air = plant, and for Saprospiraceae, saline water > bioreactor > freshwater > soil > host saline water > plant saline water > host > plant. Collectively, concerning the unknown families (7 families containing 11 species), the descending order of environments according to species prevalence and maximum abundance was saline water > plant saline water > soil > host saline water = host. As far as the known genera are concerned, the most prevalent environments, as well as the maximum abundance environments with the greatest number of species, were found to be different for each genus ( Figure 6). It is noted that the number of species belonging to the known genera and with abundances of ≥0.1% was 681 (Table 3). For Aquirestis, the descending order of environments according to the number of species detected in each environment was freshwater > saline water > bioreactor > soil = host. For Aureispira, the corresponding descending order was saline water > host saline water > soil As far as the known genera are concerned, the most prevalent environments, as well as the maximum abundance environments with the greatest number of species, were found to be different for each genus ( Figure 6). It is noted that the number of species belonging to the known genera and with abundances of ≥0.1% was 681 (Table 3). For Aquirestis, the descending order of environments according to the number of species detected in each environment was freshwater > saline water > bioreactor > soil = host. For Aureispira, the corresponding descending order was saline water > host saline water > soil = freshwater. For Brachybacter, the descending order according to species maximum abundance environment was bioreactor > saline water > soil > host saline water > freshwater = host = plant saline water, while the corresponding descending order according to species prevalence was almost the same, i.e., bioreactor > saline water > soil > host saline water > freshwater = host. For Defluviibacterium, the respective order was bioreactor > freshwater. For Membranihabitans, the descending order of environments was host saline water > soil > saline water = bioreactor > host. For Epiflobacter, the order was freshwater > saline water = bioreactor, whilst for Opimibacter, the order was soil > saline water > bioreactor > host saline water > freshwater > host. Concluding for Saprospiraceae, for Vicinibacter species, the descending order of environments according to their prevalence was bioreactor > saline water > plant saline water > soil > freshwater, whilst for maximum abundance environments, the corresponding order was almost the same, i.e., bioreactor > saline water > plant saline water > soil. Regarding the family of Lewinellaceae, for Lewinella species, the descending order of environments according to their maximum abundance was saline water > freshwater = host saline water > plant saline water > bioreactor > air, while according to their prevalence, the order was saline water > plant saline water > freshwater > host saline water > bioreactor > air. For Neolewinella, the descending order of environments regarding species prevalence was saline water > plant saline water > freshwater > host saline water > bioreactor, while for maximum abundance environments, the corresponding order was saline water > plant saline water > host saline water > freshwater > bioreactor > air. Similarly, for Flavilitoribacter, the orders of environments were saline water > host saline water > bioreactor = freshwater = soil > host, and saline water > host saline water > bioreactor = freshwater > soil > host. Regarding the family of Haliscomenobacteraceae, for Haliscomenobacter species, the descending order of environments according to their prevalence was bioreactor = freshwater > saline water > host saline water = soil = air, while according to their maximum abundance, the order was bioreactor > freshwater = saline water > host saline water = soil = air. For Phaeodactylibacter, the corresponding descending order of environments was saline water > host saline water > freshwater > bioreactor > soil = host. Finally, for Portibacter, the order of environments was saline water > host saline water > plant saline water > freshwater.
As far as the unknown genera are concerned, regardless of their classification in terms of families, saline water was the environment where most species of these genera were present and had their maximum abundance (Figure 7). It is noted that, for species abundance of ≥0.1%, the total number of unknown genera belonging to the three known families, as well as to the unknown families, was 462; the species belonging to them were 884 (Tables 2 and 3). To be more specific regarding their environmental distribution, 53.24% of the species belonging to these unknown genera were found in saline water, and 51.33% of the species had their maximum abundance in samples derived from saline water. On the other hand, the remaining environments appeared in smaller percentages, i.e., freshwater, 12.27% and 12.17%; bioreactor, 12.73% and 13.33%; saline water hosts, 11.11% and 12.05%; soil, 6.02% and 5.91%; saline water plants, 3.47% and 3.71%; air, 0.58%; and terrestrial hosts, 0.58% and 0.70%, respectively. Furthermore, 0.23% of species had their maximum abundance in samples derived from plants.
Cosmopolitan species, i.e., species that appeared at least in 50 samples, constituted almost 16% of species with an abundance of ≥0.1%, i.e., 244 species out of 1565 species. Most of the cosmopolitan species showed maximum abundances between 0.1% and 1%, while fewer cosmopolitan species showed maximum abundances between 1% and 5%, and even fewer species showed maximum abundances above 5% (Figure 8). Specifically, eight species showed maximum abundances above 5%, four of which had abundances above 10%. As mentioned previously, saline water was both the most prevalent environment and the environment where the majority of species had their maximum abundance, followed by environments related to bioreactors and freshwater. Further analysing the cosmopolitan species, almost all species belonged to known families, except one species that belonged to an unknown family. Furthermore, 97 species belonged to Saprospiraceae, 32 of which belonged to unknown genera; 85 species belonged to Lewinellaceae, 31 of which belonged to unknown genera; and 61 species belonged to Haliscomenobacteraceae, 32 of which belonged to unknown genera. As far as the unknown genera are concerned, regardless of their classification in terms of families, saline water was the environment where most species of these genera were present and had their maximum abundance (Figure 7). It is noted that, for species abundance of ≥0.1%, the total number of unknown genera belonging to the three known families, as well as to the unknown families, was 462; the species belonging to them were 884 (Tables 2 and 3). To be more specific regarding their environmental distribution, 53.24% of the species belonging to these unknown genera were found in saline water, and 51.33% of the species had their maximum abundance in samples derived from saline water. On the other hand, the remaining environments appeared in smaller percentages, i.e., freshwater, 12.27% and 12.17%; bioreactor, 12.73% and 13.33%; saline water hosts, 11.11% and 12.05%; soil, 6.02% and 5.91%; saline water plants, 3.47% and 3.71%; air, 0.58%; and terrestrial hosts, 0.58% and 0.70%, respectively. Furthermore, 0.23% of species had their maximum abundance in samples derived from plants. Figure 6. Environmental preferences by known genera with species abundance ≥0.1%. Pr corresponds to prevalence (dominant environment), while Ab corresponds to maximum abundance environments. Each bar represents the number of species found in each corresponding environment.
The known genera that correspond to these cosmopolitan species showed environmental preferences, unlike unknown genera that had species detected in a variety of environments ( Figure 8). Specifically, Aureispira showed an environmental preference for saline water; Aquirestis showed an environmental preference for freshwater; Opimibacter, for soil; Brachybacter and Vicinibacter, for bioreactor-related environments; and Membranihabitans, for terrestrial hosts. Defluviibacterium had few cosmopolitan species, and almost all of them were detected in bioreactors and wastewater treatment plants. In addition, Epiflobacter was detected solely in freshwater. Parvibacillus was equally detected in saline water and bioreactor-related environments. On the other hand, unknown genera belonging to Saprospiraceae could be found in saline water, soil, freshwater, hosts, bioreactors, and wastewater treatment plants. Regarding the Lewinellaceae family, Neolewinella, Lewinella, and Flavilitoribacter showed environmental specificity in saline water, while unknown genera could be found in saline water, freshwater, bioreactors, and wastewater treatment plants. Regarding Haliscomenobacteraceae, both Portibacter and Phaeodactylibacter showed an environmental preference for saline water, while Haliscomenobacter was found in freshwater, bioreactors, and wastewater treatment plants. Haliscomenobacteraceae unknown genera could be detected in saline water, freshwater, bioreactors, and wastewater treatment plants. Finally, the one species that belonged to the unknown family was found in freshwater. Cosmopolitan species, i.e., species that appeared at least in 50 samples, constituted almost 16% of species with an abundance of ≥0.1%, i.e., 244 species out of 1565 species. Most of the cosmopolitan species showed maximum abundances between 0.1% and 1%, while fewer cosmopolitan species showed maximum abundances between 1% and 5%, and even fewer species showed maximum abundances above 5% ( Figure 8). Specifically, eight species showed maximum abundances above 5%, four of which had abundances above 10%. As mentioned previously, saline water was both the most prevalent environment and the environment where the majority of species had their maximum abundance, followed by environments related to bioreactors and freshwater. Further analysing the cosmopolitan species, almost all species belonged to known families, except one species that belonged to an unknown family. Furthermore, 97 species belonged to Saprospiraceae, 32 of which belonged to unknown genera; 85 species belonged to Lewinellaceae, 31 of which belonged to unknown genera; and 61 species belonged to Haliscomenobacteraceae, 32 of which belonged to unknown genera.
The known genera that correspond to these cosmopolitan species showed environmental preferences, unlike unknown genera that had species detected in a variety of environments ( Figure 8). Specifically, Aureispira showed an environmental preference for saline water; Aquirestis showed an environmental preference for freshwater; Opimibacter, for soil; Brachybacter and Vicinibacter, for bioreactor-related environments; and Membranihabitans, for terrestrial hosts. Defluviibacterium had few cosmopolitan species, and almost all of them were detected in bioreactors and wastewater treatment plants. In addition, Epiflobacter was detected solely in freshwater. Parvibacillus was equally detected in saline water and bioreactor-related environments. On the other hand, unknown genera belonging to Saprospiraceae could be found in saline water, soil, freshwater, hosts, bioreactors, and Twenty-eight known species were represented by twenty-four sOTUs clusters (Table 5). After further analysing these known species, the number of positive samples with a species abundance of ≥0.1% ranged from 1 to 282 samples; in addition, their maximum abundances ranged from 0.11% to 4.84%. Saline water was the most prevalent environment, as well as the environment where most of these species had their maximum abundance. (Table 5 and Figure 9). In general, Brachybacter algidus, Defluviibacterium haderslevense, Parvibacillus calidus, and Vicinibacter affinis/Proximus, all of which belong to the family of Saprospiraceae, were found in samples that originated from wastewater treatment plants and bioreactors. Aureispira marina, Aureispira maritima, and Membranihabitans marinus were found in salinewater-related environments, while Aquirestis calciphila was found exclusively in freshwater samples. Species belonging to the genus of Epiflobacter were found in freshwater and saline water hosts. In contrast, Opimibacter skivensis was found in all detected environments, i.e., soil, freshwater, bioreactors, plants, and saline water. Lewinella cohaerens and Flavilitoribacter nigricans appeared in a variety of unrelated environments, unlike the rest of the Lewinellaceae species, which were found only in saline-water-related environments. In particular, both Lewinella cohaerens and Flavilitoribacter nigricans were found in samples related to bioreactors, and activated sludge, freshwater, and saline water. Both species were more prevalent in samples originating from saline water and had their maximum abundance in samples from a nitrifying bioreactor and activated sludge, respectively. On the other hand, all known Haliscomenobacteraceae species appeared either in saline water (Portibacter lacus/marinus, Haliscomenobacter hydrossis and Phaeodactylibacter xiamenensis) or in freshwater (Phaeodactylibacter luteus). Opimibacter iunctus, Rubidimonas crustatorum, Saprospira grandis, Haliscomenobacter calcifugiens, and Neolewinella lacunae were not detected above a 0.1% abundance in any sample, therefore they lacked environmental distribution.
and Flavilitoribacter showed environmental specificity in saline water, while unknown genera could be found in saline water, freshwater, bioreactors, and wastewater treatment plants. Regarding Haliscomenobacteraceae, both Portibacter and Phaeodactylibacter showed an environmental preference for saline water, while Haliscomenobacter was found in freshwater, bioreactors, and wastewater treatment plants. Haliscomenobacteraceae unknown genera could be detected in saline water, freshwater, bioreactors, and wastewater treatment plants. Finally, the one species that belonged to the unknown family was found in freshwater. Figure 8. Taxonomic tree representing the environmental distribution of species appearing in at least in 50 samples each. All presented species had abundances ≥0.1%, i.e., in this diagram, 249 species out of 1565 species are presented. The diagram was created using GraPhlAn [48]. Species are presented according to the family and the genus they belong to. The inner circle of environments corresponds to species prevalence (Pr), while the rest of the circles correspond to species abundance and to the respective environment, i.e., the second circle corresponds to species abundances <1%, the third circle corresponds to abundances < 5%, and the last circle corresponds to abundances <10%.
Twenty-eight known species were represented by twenty-four sOTUs clusters (Table  5). After further analysing these known species, the number of positive samples with a species abundance of ≥0.1% ranged from 1 to 282 samples; in addition, their maximum Figure 8. Taxonomic tree representing the environmental distribution of species appearing in at least in 50 samples each. All presented species had abundances ≥0.1%, i.e., in this diagram, 249 species out of 1565 species are presented. The diagram was created using GraPhlAn [48]. Species are presented according to the family and the genus they belong to. The inner circle of environments corresponds to species prevalence (Pr), while the rest of the circles correspond to species abundance and to the respective environment, i.e., the second circle corresponds to species abundances <1%, the third circle corresponds to abundances < 5%, and the last circle corresponds to abundances <10%.
The prevalence of known species was not uniform across the environments, indicating varying levels of niche specificity. For example, Aquirestis calciphila and Vicinibacter affinis/proximus appeared in five samples, all of which were derived from freshwater, and bioreactor and wastewater treatment processes, respectively. Aureispira marina and Aureispira maritima were found in six and seven samples, respectively, originating from saline-water-related environments, specifically saline water, and saline water flora and fauna. Epiflobacter species were found in six samples, the majority of which originated from freshwater-related environments. In addition, Membranihabitans marinus appeared in fifteen samples derived from saline water hosts. Further, Opimibacter skivensis was detected mainly in soil (211 out of 282 samples). It was also found in bioreactors, plants, freshwater, and saline water. Also, Neolewinella xylanilytica, Neolewinella aquimaris, Neolewinella marina/litorea, and Neolewinella lutea appeared in seven to nine samples, all of which were derived from saline-water-related environments. Lewinella cohaerens and Flavilitoribacter nigricans were found in ten and eight samples, respectively, the majority of which were derived from saline water. Table 5. Known species assignment to sOTUs, with BLAST similarity above 98%. Each known species was assigned the ecological parameters of the respective sOTU. Symbol #, when present, stands for "The absolute number of" the corresponding entity.  Figure 9. Environmental distribution (%) of known species. At the end of each bar the number of samples that these species were found in abundances of ≥0.1% are presented. Opimibacter iunctus, Rubidimonas crustatorum, Saprospira grandis, Haliscomenobacter calcifugiens, and Neolewinella lacunae were not found in any samples; therefore, they lack environmental distribution.

Family
Opimibacter skivensis was also the most prevalent species in this study, as it appeared in 3066 samples, 282 of which had abundances of ≥0.1% (Table 6). Moreover, almost 75% of samples were derived from the soil. Opimibacter skivensis appeared to be low in abundance, as its maximum abundance was 1.53%. Flavilitoribacter nigricans and Neolewinella aquimaris were also among the top 20 most prevalent species, which appeared in 315 samples (8 samples with ≥0.1% species abundance) and 256 samples (9 samples with ≥0.1% species abundance), respectively. The dominant environment for both species was saline water. On the other hand, Flavilitoribacter nigricans was found to be most abundant in bioreactor processes, while Neolewinella aquimaris was most abundant in saline water. Unlike Opimibacter skivensis, both Flavilitoribacter nigricans and Neolewinella aquimaris had low maximum abundances, specifically 0.22% and 0.33%, respectively. Overall, for the top 20 most prevalent species, the most prevalent environment, and the environment where most species had their maximum abundance was saline water (Table 6).
Unlike the top 20 most prevalent species, all of the top 20 most abundant species remained unidentified (Table 7). Species abundance ranged from 5.89% to 28.02%. In addition, the most prevalent environments were saline water and bioreactor processes, while the environment where most species had their maximum abundance detected was in samples derived from bioreactor processes. To sum up, saline water may be the environment with the highest prevalence, but bioreactors and wastewater-treatment-related processes had the most abundance (Tables 6 and 7). Finally, Lewinellaceae is the family that contained the majority of the most prevalent species (Table 6), while Saprospiraceae is the family that contained the majority of the most abundant species (Table 7). Opimibacter skivensis was also the most prevalent species in this study, as it appeared in 3066 samples, 282 of which had abundances of ≥0.1% (Table 6). Moreover, almost 75% of samples were derived from the soil. Opimibacter skivensis appeared to be low in abundance, as its maximum abundance was 1.53%. Flavilitoribacter nigricans and Neolewinella aquimaris were also among the top 20 most prevalent species, which appeared in 315 samples (8 samples with ≥0.1% species abundance) and 256 samples (9 samples with ≥0.1% species abundance), respectively. The dominant environment for both species was saline water. On the other hand, Flavilitoribacter nigricans was found to be most abundant in bioreactor processes, while Neolewinella aquimaris was most abundant in saline water. Unlike Opimibacter skivensis, both Flavilitoribacter nigricans and Neolewinella aquimaris had low maximum abundances, specifically 0.22% and 0.33%, respectively. Overall, for the top 20 most prevalent species, the most prevalent environment, and the environment where most species had their maximum abundance was saline water (Table 6).
Unlike the top 20 most prevalent species, all of the top 20 most abundant species remained unidentified (Table 7). Species abundance ranged from 5.89% to 28.02%. In addition, the most prevalent environments were saline water and bioreactor processes, while the environment where most species had their maximum abundance detected was in samples derived from bioreactor processes. To sum up, saline water may be the environment with the highest prevalence, but bioreactors and wastewater-treatment-related processes had the most abundance (Tables 6 and 7). Finally, Lewinellaceae is the family that contained the majority of the most prevalent species (Table 6), while Saprospiraceae is the family that contained the majority of the most abundant species (Table 7). Table 6. The 20 most prevalent species accompanied by their respective information regarding the environment of prevalence, the maximum abundance environment and the corresponding maximum abundance, and the number of samples identified. The sOTUs with BLAST similarity above 98% were assigned to known species. Symbol #, when present, stands for "The absolute number of" the corresponding entity.  Table 7. The 20 most abundant species accompanied by their respective information regarding the environment of prevalence, the maximum abundance environment and the corresponding maximum abundance, and the number of samples identified. The sOTUs with BLAST similarity above 98% were assigned to known species. Symbol #, when present, stands for "The absolute number of" the corresponding entity.

Discussion
Saprospirales can indeed be found in various environments, such as saline water, freshwater, soil, bioreactors, and wastewater treatment plants, as has been already mentioned in various studies . In addition, air, plants, as well as terrestrial and saline water hosts are also potential environments for Saprospirales. In this work, we present that saline water was the environment that Saprospirales were most prevalent in, but bioreactors and wastewater treatment plants were the environments in which they were found in the highest abundance. Furthermore, Lewinellaceae is the family that contains the majority of the most prevalent species, while Saprospiraceae is the family that contains the majority of the most abundant species.
Regarding cosmopolitan species, i.e., species that appeared at least in 50 samples, saline water was the environment with the highest prevalence, where the majority of species had their maximum abundance, followed by bioreactor-related environments and freshwater. In general, almost all known genera that belong to Saprospiraceae, Lewinellaceae, and Haliscomenobacteraceae showed environmental preferences that corresponded with already published studies. For instance, Aquirestis showed an environmental preference for freshwater [12]; Aureispira showed an environmental preference for saline water [13,14]; and Brachybacter, Defluviibacterium, and Vicinibacter showed environmental specificity in bioreactors and wastewater treatment plants [26]. Furthermore, Neolewinella, Lewinella, Flavilitoribacter, Portibacter, and Phaeodactylibacter showed environmental specificity in saline water [3,[7][8][9][10][11]15,17,22,43]. Haliscomenobacter appeared in saline water, freshwater, bioreactors, and wastewater treatment plants, which agrees with the previous studies that associate this genus with wastewater treatment plants and freshwater environments. [12,24,25]. On the other hand, Membranihabitans showed environmental specificity in terrestrial hosts, which opposes the findings of previous studies; Li et al. [16], and Béziat et al., found Membranihabitans in saline water environments. Opimibacter showed environmental specificity in soil. Parvibacillus appeared equally in saline water and activated sludge, while Kondrotaite et al. [26] related these with the wastewater treatment plants. Finally, Epiflobacter appeared in freshwater, while Xia et al. [23] associated this genus with activated sludge.
Twenty-eight known species were represented by twenty-four sOTUs clusters (≥ 0.1% species abundance). Therefore, not all described species could be detected in this integrated analysis as abundant constituents of microbial communities across the more than 200 K samples tested. If this observation was extended to the 1565 species-level clusters formed that were detected in abundances of ≥0.1%, we could safely and conservatively estimate that the global Saprospirales diversity is at least 2000 species. Concerning the niche assignment of known species, it agrees to a high extent with that of already published studies. Specifically, concerning Saprospiraceae, Aquirestis calciphila was found in freshwater samples [12], and Aureispira marina and Aureispira maritima were found in saline-water-related environments [13,14]. In particular, Aureispira maritima was found to be most prevalent in saline water hosts and most abundant in saline-water-derived samples, while Aureispira marina had both a prevalence and maximum abundance in saline water. Brachybacter algidus, Defluviibacterium haderslevense, Parvibacillus calidus, and Vicinibacter affinis/proximus were found in samples that originated from wastewater treatment plants and bioreactors, as reported by Kondrotaite et al. [26]. Membranihabitans marinus was found in saline water hosts, which was in alignment with the studies of Li et al. [16] and Béziat et al. [49]. On the other hand, Opimibacter skivensis was found in a variety of environments, especially in soil samples. Kondrotaite et al. [26] reported that this species should be found in wastewater treatment plants. Opimibacter skivensis was found in 282 samples with an abundance of ≥0.1%, 211 samples of which were derived from soils, while 18 samples were derived from bioreactor-related and wastewater-treatment-related samples. It was also found in plants, hosts, freshwater, and saline water. Regarding Haliscomenobacteraceae, both Phaeodactylibacter xiamenensis, and Portibacter lacus were found in saline water samples, as already published studies indicate [17,22]. Contrary to the aforementioned information, we detected Phaeodactylibacter luteus in freshwater, while Lei et al. [15] isolated Picochlorum sp. from saline water alga, and Ma et al. [29] found it in wastewater-related environments.
Haliscomenobacter hydrossis was detected in saline water environments, while it has been reported that this species is usually found in activated sludge [24,25]. Finally, regarding Lewinellaceae, all species appeared to be most prevalent and most abundant in saline water environments, except Lewinella cohaerens and Flavilitoribacter nigricans. These had maximum abundances in samples from a nitrifying bioreactor and activated sludge, respectively, despite being more prevalent in samples originating from saline water. However, Khan et al. [43] associated both species with marine environments. Unlike the rest of the Lewinellaceae species, both Lewinella cohaerens and Flavilitoribacter nigricans appeared in a variety of unrelated environments, i.e., in samples related to bioreactors and activated sludge, freshwater, and saline water. Regarding the remaining Lewinellaceae species, our results conformed with those of previous studies. Neolewinella persica/agarilytica, Neoewinella antarctica, Neolewinella aquimaris, Neolewinella aurantiaca, Neolewinella lutea, Neolewinella marina/litorea, Neolewinella maritima, and Neolewinella xylanilytica were found in saline water samples and saline-water-related environments, such as saline water sediments, saline water flora and fauna, and beach sand [3,[7][8][9][10][11]43]. Concluding the analysis of known species, the number of positive samples (with species abundance of ≥0.1%) ranged from 1 to 282 samples; in addition, their maximum abundance ranged from 0.11% to 4.84% (Table 5).
Although it cannot be determined by incidence data such as in our analysis, the extensive diversity observed in aquatic and especially saline environments could be attributed to dissolved organic carbon (DOC). The amount and complexity of DOC in aquatic systems, if mirrored by diverse metabolic pathways distributed in various microorganisms, could explain some of the remarkable diversity of Saprospirales revealed by our analysis. Ecological reasons, like niche connectivity and increased dispersion through oceanic currents, as well as the role of oceans as the principal terminal reservoir of rainwater, may also play a role. Saprospirales species collected through precipitation across the land are eventually pooled in coastal seas. Currents can then contribute to the dispersal of those species across the connected oceanic bodies. Nevertheless, further investigation is needed to elucidate the observed high diversity and prevalence of Saprospirales in aquatic systems.

Conclusions
The introduction of next-generation sequencing allowed for high-throughput microbial profiling of environments of interest, represented, to a large extent, by the targeted sequencing of 16S rRNA gene amplicons. The relative cost efficiency and methodological simplicity led to a geometric increase in studies covering a wide range of environments around the globe. Integrating that wealth of information and reutilizing it as a unified resource for new questions is not only a very powerful approach but is also energy-and cost-effective. It is our duty as a research community to show that we can achieve the most out of the datasets we spend millions to create.
We have applied a similar integrating procedure in the past, although more demanding in execution, to the bacterial phylum Chlamydiota [50]. Over the years, the available datasets grow and novel databases and tools have been introduced that enable a much more streamlined querying of the global sequencing repositories. In this study, we showed that applying the tool TIC [47] over the half a million amplicon datasets in IMNGS [45] can readily give us insights into the diversity and ecological distribution of selected taxonomic groups like the order Saprospirales.
Future efforts should be focused on the targeted isolation of those novel members for their functional characterization and the elucidation of the ecological reasons behind that high diversity. In addition, further automatization of the pipeline used will allow further insights into other important microbial taxa that are still hindered by the limitations of isolation-based characterizations.