# ‘SRS’ R Package and ‘q2-srs’ QIIME 2 Plugin: Normalization of Microbiome Data Using Scaling with Ranked Subsampling (SRS)

^{1}

^{2}

^{3}

^{4}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Theory

_{min}, and the structure of the normalized matrix approximates the structure of the original matrix. The criteria for the approximation may differ but a key principle is that relative frequencies of counts of the normalized matrix are as close as possible to the relative frequencies of counts in the original matrix. A relative frequency is obtained by dividing the count for a particular species in a particular sample by the sum of counts for all species in that sample. Different implementations of the criterion of matching relative frequencies are conceivable. The simplest option is to construct a normalization matrix minimizing the sum of absolute values of pairwise differences between the relative frequencies. This approach, however, ignores the effect of sampling error on the accuracy of relative frequencies. In the first approximation, the coefficient of variation of a count is proportional to the inverse of the square root of the count. Therefore, frequencies may be weighted by the inverse square root of counts. Depending on the purpose of the study, for instance, regarding the importance of rare species, other weighing may be more adequate.

_{min}= 100, the same normalized counts are obtained for all combinations of optimization parameters:

_{min}. Exhaustive enumeration of this kind is not feasible for real-world data. In 2014, Cont and Heidari suggested an algorithm solving this optimization problem with the complexity O(n log n), n being the number of species, but their preprint has not been subjected to a peer review yet [6]. The SRS algorithm [5], which has the complexity of O(n), generated the same results in this example.

## 3. Method

#### 3.1. Principle of SRS

- Scaling: feature counts (such as OTUs (operational taxonomic units), ASVs (amplicon sequence variants), or clades) are scaled sample-wise so that the sum of the scaled counts (C
_{scaled}) for each sample is equal to the desired number of counts (C_{min}). - Ranked subsampling: the scaling step produces fractional values that must be converted into counts (integers). To do this, the C
_{scaled}for each feature is split into the floor (C_{int}) and fractional part (C_{frac}) of C_{scaled}. Because C_{min}= ΣC_{scaled}= ΣC_{int}+ ΣC_{frac}, it follows that C_{min}≥ ΣC_{int}. Therefore, ΔC C_{frac}values (where ΔC = C_{min}− ΣC_{int}) must be converted into additional counts (integers) so that C_{min}can be reached. To do so, C_{frac}values are ranked. Next, from the highest to the lowest rank, a count for each feature is added until ΔC counts have been added. After this step, all samples will have been normalized to C_{min}counts. - Special cases: (i) when C
_{frac}values involved in picking ΔC counts share the same rank across features, the counts are added for features based on the respective C_{int}ranks; (ii) when both C_{frac}and its respective C_{int}values involved in picking ΔC counts share the same ranks across features, the counts are assigned randomly (without replacement). The specification of the seed that initializes the random process enables reproducible results.

#### 3.2. ‘SRS’ R Package

#### 3.2.1. SRS-Function

_{min}, set_seed = TRUE, seed = 1)

_{min}is the number of counts to which all samples will be normalized (C

_{min}), set_seed enables the use of the set.seed-function, and seed specifies the seed used by set.seed to initialize the random process.

#### 3.2.2. SRScurve-Function

max.sample.size = 0, rarefy.comparison = FALSE,

rarefy.repeats = 10, rarefy.comparison.legend = FALSE,

xlab = “sample size”, ylab = “richness”, label = FALSE,

col, lty,…)

#### 3.2.3. SRS.shiny.app-Function

_{min}. The app utilizes the SRScurve-function and enables the selection of four diversity metrics (see metric in SRScurve) that will be returned at different C

_{min}. The selection of C

_{min}is interactive through a slider or an interconnected numeric text field. In response to the selected C

_{min}, the app returns

- a rug plot that shows the distribution of the number of counts per sample and displays discarded samples as well as summary statistics (including a list of discarded samples and descriptive statistics of the global feature richness and selected alpha diversity metric of the input dataset) in response to the selected C
_{min}(Figure 1A), - a plot of SRS curves (SRScurve-function) that respond to the selected step size (step) and maximum sample size (max.sample.size) (Figure 1B), and
- an interactive table with sample names and the number of counts per sample as well as the initial diversity (non-normalized), retained diversity (normalized), %retained diversity (normalized), and %discarded diversity (normalized) of the selected alpha diversity metric in response to the selected C
_{min}(Figure 1C).

_{min}of the app is the lowest total number of counts per sample in the input data (no samples are discarded by default), which can be restored within the app using the reset C

_{min}-button. The default maximum sample size equals the default setting of C

_{min}and can be restored using the reset max. sample size-button. The default step size for SRS curves is 1000. The default setting of the SRS.shiny.app-function (as of version 0.2.2) is:

#### 3.3. ‘q2-srs’ QIIME 2 Plugin

## 4. Results and Discussion

_{frac}. The random step in SRS is rare and negligible for complex microbiome data, as noted previously [5]. This rather minor modification, however, ensures the reproducibility of SRS, which is essential for microbiome analysis [8].

_{frac}) are chosen: depending on the scaling factor, an ASV with an integer value (C

_{int}) of zero may or may not be chosen by ranked subsampling due to its C

_{frac}, causing a reproducible zigzag behaviour in the observed number of species. The magnitude of the zigzag observed in SRS curves depends on the data structure (balance between rare and abundant ASVs). Despite the zigzag behaviour, the observed ASV richness was frequently observed to be higher after SRS as compared to rarefying (Figure 2B). Therefore, we recommend the use of the SRS Shiny app (SRS.shiny.app-function) prior to SRS for the determination of C

_{min}for users working in the R environment. QIIME 2 users are also encouraged to utilize .qza files in the SRS Shiny app (https://vitorheidrich.shinyapps.io/srsshinyapp/ (accessed on 1 November 2021)).

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Yatsunenko, T.; Rey, F.E.; Manary, M.J.; Trehan, I.; Dominguez-Bello, M.G.; Contreras, M.; Magris, M.; Hidalgo, G.; Baldassano, R.N.; Anokhin, A.P.; et al. Human Gut Microbiome Viewed across Age and Geography. Nature
**2012**, 486, 222–227. [Google Scholar] [CrossRef] [PubMed] - Fierer, N. Embracing the Unknown: Disentangling the Complexities of the Soil Microbiome. Nat. Rev. Microbiol.
**2017**, 15, 579–590. [Google Scholar] [CrossRef] - Orsi, W.D. Ecology and Evolution of Seafloor and Subseafloor Microbial Communities. Nat. Rev. Microbiol.
**2018**, 16, 671–683. [Google Scholar] [CrossRef] - McMurdie, P.J.; Holmes, S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput. Biol.
**2014**, 10, e1003531. [Google Scholar] [CrossRef] [Green Version] - Beule, L.; Karlovsky, P. Improved Normalization of Species Count Data in Ecology by Scaling with Ranked Subsampling (SRS): Application to Microbial Communities. PeerJ
**2020**, 8, e9593. [Google Scholar] [CrossRef] [PubMed] - Cont, R.; Heidari, M. Optimal Rounding under Integer Constraints. arXiv
**2014**, arXiv:1501.00014. [Google Scholar] - Oksanen, J.; Blanchet, F.G.; Friendly, M.; Kindt, R.; Legendre, P.; McGlinn, D.; Minchin, P.R.; O’Hara, R.B.; Simpson, G.L.; Solymos, P.; et al. Vegan: Community Ecology Package. R Package Version 2.5-7. 2020. Available online: https://CRAN.R-project.org/package=vegan (accessed on 1 November 2021).
- Schloss, P.D. Identifying and Overcoming Threats to Reproducibility, Replicability, Robustness, and Generalizability in Microbiome Research. mBio
**2018**, 9, e00525-18. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Bolyen, E.; Rideout, J.R.; Dillon, M.R.; Bokulich, N.A.; Abnet, C.C.; Al-Ghalith, G.A.; Alexander, H.; Alm, E.J.; Arumugam, M.; Asnicar, F.; et al. Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using QIIME 2. Nat. Biotechnol.
**2019**, 37, 852–857. [Google Scholar] [CrossRef] - Yang, J.; Park, J.; Jung, Y.; Chun, J. AMDB: A Database of Animal Gut Microbial Communities with Manually Curated Metadata. Nucleic Acids Res.
**2021**, gkab1009. [Google Scholar] [CrossRef] [PubMed] - Beule, L.; Arndt, M.; Karlovsky, P. Relative Abundances of Species or Sequence Variants Can Be Misleading: Soil Fungal Communities as an Example. Microorganisms
**2021**, 9, 589. [Google Scholar] [CrossRef] - Pontiller, B.; Pérez-Martínez, C.; Bunse, C.; Osbeck, C.M.G.; González, J.M.; Lundin, D.; Pinhassi, J. Taxon-Specific Shifts in Bacterial and Archaeal Transcription of Dissolved Organic Matter Cycling Genes in a Stratified Fjord. bioRxiv
**2021**. [Google Scholar] [CrossRef] - Barreto Filho, M.M.; Walker, M.; Ashworth, M.P.; Morris, J.J. Structure and Long-Term Stability of the Microbiome in Diverse Diatom Cultures. Microbiol. Spectr.
**2021**, 9, e00269-21. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**User interface of the Shiny app for SRS (SRS.shiny.app-function of the ‘SRS’ R package version 0.2.2). (

**A**) Rug plot showing the distribution of the number of counts per sample, discarded samples, and summary statistics; (

**B**) plot showing SRS curves; (

**C**) interactive table with sample names, the number of counts per sample, and summary statistics for the diversity metric.

**Figure 2.**(

**A**) SRS curves and (

**B**) comparison of SRS curves and repeated rarefying (10 repeats) using the “richness” metric (SRScurve-function of the ‘SRS’ R package version 0.2.2). The vertical black solid line indicates the chosen number of counts (10,000) to which all samples will be normalized (C

_{min}).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Heidrich, V.; Karlovsky, P.; Beule, L.
‘SRS’ R Package and ‘q2-srs’ QIIME 2 Plugin: Normalization of Microbiome Data Using Scaling with Ranked Subsampling (SRS). *Appl. Sci.* **2021**, *11*, 11473.
https://doi.org/10.3390/app112311473

**AMA Style**

Heidrich V, Karlovsky P, Beule L.
‘SRS’ R Package and ‘q2-srs’ QIIME 2 Plugin: Normalization of Microbiome Data Using Scaling with Ranked Subsampling (SRS). *Applied Sciences*. 2021; 11(23):11473.
https://doi.org/10.3390/app112311473

**Chicago/Turabian Style**

Heidrich, Vitor, Petr Karlovsky, and Lukas Beule.
2021. "‘SRS’ R Package and ‘q2-srs’ QIIME 2 Plugin: Normalization of Microbiome Data Using Scaling with Ranked Subsampling (SRS)" *Applied Sciences* 11, no. 23: 11473.
https://doi.org/10.3390/app112311473