Next Article in Journal
Research Progress on Spike-Dependent SARS-CoV-2 Fusion Inhibitors and Small Molecules Targeting the S2 Subunit of Spike
Previous Article in Journal
Additional Insertion of gC Gene Triggers Better Immune Efficacy of TK/gI/gE-Deleted Pseudorabies Virus in Mice
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Virus Quasispecies Rarefaction: Subsampling with or without Replacement?

by
Josep Gregori
1,*,
Marta Ibañez-Lligoña
1,2,3,
Sergi Colomer-Castell
1,2,3,
Carolina Campos
1,2,3 and
Josep Quer
1,2,3,4,*
1
Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
2
Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
3
Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
4
Medicine Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
*
Authors to whom correspondence should be addressed.
Viruses 2024, 16(5), 710; https://doi.org/10.3390/v16050710
Submission received: 18 April 2024 / Revised: 26 April 2024 / Accepted: 26 April 2024 / Published: 29 April 2024
(This article belongs to the Section General Virology)

Abstract

:
In quasispecies diversity studies, the comparison of two samples of varying sizes is a common necessity. However, the sensitivity of certain diversity indices to sample size variations poses a challenge. To address this issue, rarefaction emerges as a crucial tool, serving to normalize and create fairly comparable samples. This study emphasizes the imperative nature of sample size normalization in quasispecies diversity studies using next-generation sequencing (NGS) data. We present a thorough examination of resampling schemes using various simple hypothetical cases of quasispecies showing different quasispecies structures in the sense of haplotype genomic composition, offering a comprehensive understanding of their implications in general cases. Despite the big numbers implied in this sort of study, often involving coverages exceeding 100,000 reads per sample and amplicon, the rarefaction process for normalization should be performed with repeated resampling without replacement, especially when rare haplotypes constitute a significant fraction of interest. However, it is noteworthy that different diversity indices exhibit distinct sensitivities to sample size. Consequently, some diversity indicators may be compared directly without normalization, or instead may be resampled safely with replacement.

Graphical Abstract

1. Introduction

The study of viral quasispecies diversity is similar in many ways to the study of biodiversity in ecology. Nevertheless, the size of a quasispecies is far beyond the size of any known ecosystem; for example, the number of viral particles in a patient chronically infected by the hepatitis C virus (HCV) may outnumber human population. Indeed, the dynamics in a quasispecies have no comparison with the dynamics in any ecosystem. Because of the low polymerase fidelity characteristic in these viruses, each viral replication cycle generates new variants [1,2,3]. With high viral loads, the number of viral particles generated and eliminated daily may be over 1012 [4,5].
The study of quasispecies diversity and composition through NGS by amplicons constitutes a powerful approach to this extremely high-diversity world, which nevertheless still falls short despite the developments since the times when molecular cloning was the only available tool to dig into quasispecies [6,7]. The study by amplicon haplotypes starts with the processing of next-generation sequencing (NGS) data to obtain a set of haplotypes and corresponding frequencies as read counts [8]. The depth of the analysis depends on the library size, that is, the number of sequenced reads by sample and amplicon. Groups of samples typically have different library sizes for technical variations, and library size normalization is required for fair comparisons, given that some diversity indices are highly sensitive to sample size. Rarefaction is a widely used normalization technique that involves random subsampling of reads from the initial sample library to a common library size. In the field of metagenomics, there is an open debate [9,10,11,12,13] about whether this process should be used at all.
The basis of such debate is that by subsampling, we are increasing the downward bias already existing in our data. Bias exists in the sense that, first, we may only approach the true diversity of a microbiome, and second, that the more we limit our library size, the less representative it is of the studied population. Nevertheless, despite these limitations, it continues to be widely used in practice as a suitable normalization whenever diversity comparisons are required. It helps reduce the impact of an uneven sampling effort by eliminating the differential bias associated with more deeply sequenced samples. In metagenomic studies, the estimation of richness, which is the number of species identified in a sample, is paramount, and alternative approximations to estimate the true value, including unobserved species [14], might be preferred. In quasispecies studies, richness is understood as the number of observed and unobserved haplotypes, which plays a minor role, with other diversity computations being potentially preferable [15].
This report aims at clarifying the use of this essential tool in quasispecies diversity comparisons, the process of rarefaction, by which two samples of different sizes become comparable. In this context, rarefaction only makes sense in the frame of sample comparisons or when studying richness rarefaction curves of single samples. In ecology, rarefaction is defined as repeated resampling without replacement to the reference size. In the context of rarefaction, resampling without replacement implies that during a cycle of random subsampling, the extracted reads are not reintroduced into the initial sample pool. Conversely, resampling with replacement involves returning the sampled reads to the population after each random extraction. From a computational perspective, when dealing with substantial sample sizes, sampling with replacement tends to be faster than sampling without replacement. In a recent work with metagenomic data [11], the authors observed that rarefying libraries with or without replacement had no substantial impact on Shannon entropy. However, libraries rarefied with replacement exhibited a slightly reduced Shannon index compared to those rarefied without replacement across different library sizes. This effect is attributed to the exclusion of rare sequences, which occurred more frequently in sampling with replacement than in sampling without replacement. This suggests that samples dominated by only a few highly abundant sequence variants are comparatively robust to subsampling. Nevertheless, the authors stated that rarefying without replacement should be encouraged because it is more theoretically correct, specifically when representing the data so they account for the limitations of smaller library sizes [11]. We aim to study the impact of both sampling schemes with the large numbers implied in quasispecies analysis, both theoretically and numerically.
Subsampling with replacement is equivalent to sampling from a population where haplotype frequencies remain constant. Using the cumulative distribution derived from these frequencies, extracting an item involves generating a random value from a uniform distribution. This value is then matched against the cumulative distribution to determine the corresponding haplotype ID. In this context, the item sampled with replacement represents a haplotype ID. In contrast, when sampling without replacement, the sampled item is a single read from the original sample. To prevent repetitions in successive random extractions, it becomes necessary to track which reads have been sampled and which have not.

2. Methods

Intensive resampling simulations under each hypothetical case have been carried out to support and extend the theoretical considerations for each case and resampling scheme. See Box 1 for definitions. Each described example is represented as a vector of reads, where each element corresponds to a different haplotype, and the total coverage is given by the sum of the vector elements. Based on this representation of quasispecies composition for each case, all the computations were conducted using R [16], with the help of packages knitr [17], tidyverse [18], ggplot2 [19], and dqrng [20]. Resampling without replacement was optimized using the dqsample.int function from the package dqrng [20], and with the help of a full sample size, vector mapping reads to haplotypes. The full R code used is given in Supplementary Materials.
Box 1. Rarefaction related definitions [15,21,22].
ConceptDefinition
RarefactionA technique used to compensate for different intensities of sampling in diversity studies.
Subsampling cycleIt consists in the successive random extraction of a given
number of items from a sample, lower than the sample size, with or without replacement at each extraction.
Subsampling with replacementThis is based on a situation where an element is randomly extracted from a sample, identified, and then immediately replaced. Therefore, this element can be obtained again in further extractions along the same subsampling cycle.
Subsampling without replacementAll extractions in a subsampling cycle are performed without replacement, so no item may be extracted multiple times in the same cycle.
Downward biasInaccuracy in measurement or estimation that underestimates the true value.
Subsampling fractionFraction of reads being subsampled from a given sample in a single resampling cycle.
GranularityLevel of resolution at which the data are processed when estimating frequencies from counts.

3. Results

In the following sections, we study different specific cases to compose a comprehensive picture of the type of results to be expected with the two sampling schemes according to quasispecies structure. We start by showing the limitations inherent to sampling and subsampling with replacement. The following cases are studied under both subsampling schemes:
  • All singletons: This represents a quasispecies where all haplotypes are represented by a single read. It serves as the simplest case to numerically show the discussed limitations, showcasing the most significant differences between the sampling schemes.
  • Single dominant case: This hypothetical scenario involves a dominant haplotype, while all other haplotypes are singletons. Our goal is to evaluate the master frequency and the number of haplotypes.
  • Prominent haplotypes: In this case, there are six prominent haplotypes along with a set of singletons. The objective is to evaluate the frequencies of the prominent haplotypes, the fraction of singletons in the quasispecies, and the fraction of reads for haplotypes with over one read and below the top 6 haplotypes, representing singleton replicates produced in sampling with replacement.
  • No rare haplotypes: This is a quasispecies composed of a master haplotype at 90%, with 10 other haplotypes at 1% each. This scenario excludes singletons and lower frequency haplotypes. We seek to estimate haplotype frequencies by repeated subsampling.
  • Flat quasispecies: Similar to the first case, all the haplotypes have equal frequencies, ranging from 1 read to 10 reads each, representing a perfectly even quasispecies. This case is crucial for demonstrating the robustness in sampling quasispecies data that have undergone a previous abundance filter at a low level.
Finally, we discuss the sensitivity of various diversity indices with respect to sample size, and the corresponding granularity caused by estimating frequencies from the counts.

3.1. Bootstrap: The Theory around 0.632

Given a sample of n items (reads), the probability to randomly extract any given item is 1/n, and the probability to not extract it is 1 − 1/n. In a bootstrap cycle composed of n extractions, each extraction is followed by a replacement, which makes successive extractions independent and with identical probability.
As a bootstrap cycle is composed of n extractions, the probability to not extract a given item from the sample in a full cycle is (1 − 1/n)n; this means that the probability to have a given item sampled in a bootstrap cycle is 1 − (1 − 1/n)n. As n tends to infinity, the limit of this expression is 1 − 1/e = 0.632. This result implies that a bootstrap resample is composed of 0.6321 unique realizations of items in the original sample plus 0.3679, which are replicates, in the limit as n grows to infinity.

3.2. Subsampling a Given Fraction with Replacement

In subsampling with replacement to a given fraction of the sample size f, the number of random extractions in each cycle is f · n; then, the probability to have a given item in the subsample is 1 − (1 − 1/n)f·n, and the limit as n tends to infinity is 1 − (1/e)f.
To solve the limit, we transform the indetermination 1 to ∞ · 0, noting that f (x) = eln(f(x))
l i m n ( 1 1 / n ) f n = l i m n e f n l n ( 1 1 / n )
Next, ∞ · 0 is converted to 0/0 by double inversion of one term
l i m n f n l n ( 1 1 / n ) = l i m n f l n ( 1 1 / n ) 1 / n
and finally, by the application of L’Hopital rule, differentiating the numerator and denominator:
l i m n f l n ( 1 1 / n ) 1 / n = l i m n f 1 / ( 1 1 / n ) 1 / n 2 1 / n 2 = l i m n f 1 ( 1 1 / n ) = f
So that
l i m n ( 1 1 / n ) f n = l i m n e f n l n ( 1 1 / n ) = e f = 1 e f
Given the limit, Table 1 and Figure 1 show the expected fraction of items seen from the original sample in a resampling cycle at various fractions of the original size, from 0.1 to 1. In resampling without replacement, the fraction of seen items would correspond exactly to the fraction of subsampling. As the sample fraction increases, the deviation of seen items with respect to the sampled fraction also increases to reach its maximum at f = 1, corresponding to the pure bootstrap. This holds particular importance when studying richness by subsampling. Due to the replacement, some rare species may be observed with inflated frequencies in a subsampling cycle, while others may not be observed at all. This inflation adversely affects the representation of other rare species in the sampled data, as these will be sampled less.

3.3. The All-Singletons Case

The numerical approach involves creating a quasispecies sample composed of 10,000 unique reads, each representing a distinct haplotype with a single sequence. All of them are rare haplotypes. In this scenario, sampling without replacement will produce a resample with as many haplotypes as the resample size. Resampling with replacement will suffer from the 0.632 effect described above. In repeated subsampling without replacement to a fraction f, the number of haplotypes obtained is equal to the subsample size, as denoted in the ‘True’ column in Table 2.
Table 2 shows the results of B = 500 repeated resampling cycles at increasing fractions from 0.1 to 1, where ‘True’ is the true number of haplotypes in a fraction of the sample. ‘Expected’ is the number of expected haplotypes from the computed probability, ‘Median’ is the median number of haplotypes observed after B cycles of resampling with replacement, ‘Unique’ is the proportion of haplotypes observed as singletons, and ‘Replicated’ is the proportion of reads corresponding to replicated singletons.
A first conclusion is that in this case, an accurate estimate of richness is obtained exclusively when subsampling without replacement. On the other hand, the deviation from the true value diminishes proportionally with the decreasing fraction of subsampling.

3.4. The Single-Dominant Case

Let us consider a quasispecies composed of a dominant haplotype at varying frequencies, from 10% to 90%, with the remaining reads attributed to singletons. Our focus is on understanding how the estimation of the number of haplotypes and the frequency of the master haplotype unfolds through repeated subsampling, both with and without replacement. In this particular example, each quasispecies sample comprises 100,000 reads, with a master haplotype spanning from 10 to 90%. The remaining haplotypes in the sample are all singletons. Table 3 shows the quasispecies IDs and compositions.
Table 4 and Figure 2 show the results in estimating the number of haplotypes at different subsampling fractions after B resampling cycles with and without replacement, and Table 5 and Figure 3 show the results in estimating master frequencies.
As observed in the case of all singletons, the number of haplotypes is underestimated with respect to the true value when subsampling with replacement, contrary to the estimation by subsampling without replacement, which is very close to the true value.
The estimation of the master haplotype frequency was nearly identical under both sampling schemes (Table 5 and Figure 3), contrary to the estimation of the number of haplotypes, which was severely downward biased when sampling with replacement, except for the lowest fractions of subsampling.
This observation aligns with the data presented in Table 2 in the all-singletons case. An accurate estimation of the master frequency in this scenario implies that the complementary, which is the fraction of reads in the quasispecies for non-dominant haplotypes, is also accurate. In simpler terms, despite the underestimation in the number of haplotypes, the aggregated sum for non-dominant reads remains accurate.
These results can be extended to say that prominent haplotypes will be accurately subsampled under both schemes, but rare haplotypes will be severely underestimated when subsampling with replacement, particularly when the fraction of reads for rare haplotypes in the sample is significant.
Nonetheless, the estimate of the fraction of reads for low-abundance haplotypes may still be accurate.

3.5. Prominent Haplotypes

Let us consider now a quasispecies featuring six prominent haplotypes, each at half frequency of the preceding one, with the remaining haplotypes being singletons (Table 6). In this case, our goal is to estimate the frequencies of the six prominent haplotypes and to also determine the fraction of reads belonging to singleton haplotypes.
To account for singleton replicates, we compute the fraction of reads for haplotypes above 1 read but below the top 6 haplotypes in each subsampling.
Further analysis confirms that subsampling without replacement provides results very close to the true values. In contrast, subsampling with replacement estimates the frequencies of the prominent haplotypes fairly well, but underestimates the fraction of singletons, which is underestimated in favor of replicated singletons, with frequencies above 1 read and below the top 6 haplotypes. However, the sum of the fractions for singletons and replicated singletons, as the complementary to the top 6, remains well approximated, like in the single-dominant case (Table 7 and Table 8).
In conclusion, the frequencies of prominent haplotypes will be equally estimated under both schemes, but the number of rare haplotypes could be severely underestimated when subsampling with replacement. The aggregation of non-dominants is well approximated when subsampling with replacement. We observed similar results with real hepatitis E virus data [23] sequencing sample replicates at different depths [24].
At a high sequencing depth, the frequencies of prominent haplotypes are highly stable across varying sample sizes and may be compared directly without subsampling. Note that the variance of a proportion p is given by Var(p) = p · (1 − p)/n, where n is the sample size.

3.6. No Rare Haplotypes

Let us review a quasispecies composed of a master haplotype at 90% and 10 other haplotypes at 1% each, without any singletons or rare haplotypes. Our aim is to estimate haplotype frequencies by repeated subsampling. A quasispecies without haplotypes at very low frequencies will give approximately the same results with subsampling or not, and both subsampling methods will retrieve very similar results (Table 9 and Table 10).

3.7. Flat Quasispecies

Let us consider the case of a quasispecies with n haplotypes, where all have an equal number of reads, k, growing from 1 to 10 each. As the frequency of each haplotype increases, they become less rare. Given the number of reads per haplotype, k, the probability to sample a given haplotype with replacement after a full resampling cycle of n · k random extractions, that is, a bootstrap cycle, is given by 1 − (1 − k/(n · k))(n·k) = 1 − (1 − 1/n)(n·k), where n · k is the sample size. The probability, which, in the limit as n goes to infinity, is 1 − (1/e)k. Table 11 and Figure 4 show the values computed for n = 1000 haplotypes, k from 1 to 10 reads each, the computed probability, and the corresponding limits.
In subsampling a fraction f of the full sample size, as described in Table 12, the expected number of haplotypes subsampling without replacement is given by the rarefaction equation, which, when applied to this case, results in Equation (1).
E 1 n | k , f = n n n k k n k f / n k n k f
The expected number of haplotypes when subsampling with replacement is given by Equation (2)
E 2 [ n k , f ] = 1 ( 1 1 / n ) n k f
The results of Equation (1) for n = 10,000 haplotypes with frequencies, k = 1, 2, …, 10 reads, and subsample fractions, f = 0.1, 0.2, …, 1 are represented in Figure 5.
The ratio E2[n|k,f]/E1[n|k,f] gives the fraction of haplotypes estimated in subsampling with replacement with respect to those estimated in subsampling without replacement (rarefaction). This ratio gives a representation of the accuracy obtained in subsampling with replacement in this scenario, and is represented in Figure 6, computed for n = 10,000 haplotypes, k = 1, 2, …, 10 reads per haplotype, and f = 0.1, 0.2, …, 1.
This represents an extreme case with all haplotypes at equivalent and very low frequencies, with no prominent or dominant haplotypes. This is a useful example to show the implications of low-level abundance filters in our context. This type of low-level abundance filter may aim at limiting technical and instrumental noise, but it is not always advisable.
When comparing samples with the rarest haplotypes being excluded by setting a minimum abundance threshold, i.e., such as a minimum of 5 reads per haplotype, the outcomes of subsampling under both methods will tend to be similar, particularly in not-so-extreme cases, such as flat-like quasispecies.

4. Discussion

The determination of quasispecies diversity is significantly influenced by sample size, where larger samples generally yield a more accurate estimation. It is well-established that diversity estimation is dependent on the sampling effort, affecting some indices more than others. The higher the effort, the bigger the chances to sample lower frequent and rare haplotypes, an inherent characteristic of quasispecies.
  q D p = i = 1 H p i q 1 / 1 q
When computing diversity through Hill numbers, qD(p) (Equation (3)) [25,26], of different orders q, they will be limited above by 0D, being the number of haplotypes, and below by D, being the inverse of the frequency of the dominant haplotype. As the order q increases, the relative weight of low frequency and rare haplotypes in the computation decreases, as low-frequency values are more heavily affected by the exponent. At q = 0, all haplotypes have equal weight regardless of their frequency, while at q = ∞, only the highest frequency holds significance. This observation suggests that the sensitivity or dependence of a Hill number with respect to sample size decreases as q gets bigger. Considering the correspondence between Hill numbers and other classical diversity indices, we may set the sensitivity order as:
Number of haplotypes > Shannon entropy > Gini index > Master frequency
The rare haplotype load (RHL) [27] and the quasispecies fitness fractions (QFF) [23] are specific quasispecies diversity indices calculated as fractions of aggregated reads corresponding to haplotypes with frequencies between defined limits. These indices represent prominent fractions, which are relatively insensitive to sample size. Even the fraction of rare haplotypes, computed below 1% or below 0.1%, proves to be robust to sample size, provided that the coverage is high enough to capture with sufficient depth of these fractions.
The number of haplotypes and the related incidence indices [26], such as the number of polymorphic sites or the number of mutations, on one hand, and the Shannon entropy on the other will benefit from rarefaction, and repeated subsampling without replacement to the reference size is required. Higher order diversity indices like the Gini–Simpson index will be less sensitive to sample size and may be rarefied with replacement, when required. The frequency of prominent haplotypes remains highly robust to sample size variations and can be directly compared or rarefied with replacement if needed.
As mentioned earlier, it is advisable to employ subsampling without replacement. In specific cases, such as when examining quasispecies diversity in hepatitis E virus (HEV) [23], this approach is particularly crucial. HEV, characterized by a high mutation rate, exhibits considerable diversity, resulting in haplotypes at very low frequencies. Subsampling with replacement might not fairly represent these low-frequency haplotypes. This especially needs to be considered when analyzing HEV samples from chronic patients treated with ribavirin due to the increase in mutation rates introduced, which leads to a more unstructured quasispecies, leading to even more low-frequency haplotypes [23,28]. On the contrary, for viruses with a less diverse quasispecies, like SARS-CoV-2 [23], as demonstrated earlier in the ‘Flat Quasispecies’ case, subsampling could be effectively carried out through subsampling with replacement. This is supported by the removal of lowest-frequency haplotypes, which allows for the retrieval of comparable results to subsampling without replacement.
In NGS data, haplotype frequencies are obtained as read counts, which are integers. However, when computing diversity indices, frequencies are used as fractions of haplotype reads relative to the sample size (total number of reads), N. This introduces granularity in the results, as not all frequencies from 0 to 1 are possible, with the granularity determined by the frequency of a single read, 1/N. With a coverage of 1000 reads, the resolution is 0.1%. This resolution may be insufficient for quasispecies analysis, especially when rare haplotypes are of interest, in which a target depth above 100,000 reads per sample/amplicon is recommended.
Controversy in the metagenomics field, mentioned in the introduction, also arises because there are different approaches other than rarefaction when differential abundance analysis is the main objective instead of diversity comparisons. Methods based in counts instead of frequencies, like generalized linear models (GLM), with family distributions like the negative binomial [29], are being used in RNAseq or in label-free proteomics by LC-MS/MS, among others. These methods are used in differential expression studies, aiming to compare the relative abundances of mRNA or proteins between two biological conditions, and use an implicit normalization by offsets, which allow for complex normalizations [30]. In metagenomics, these and other methods based on compositional data analysis (CoDA) [31,32] are also used in the normalization of microbiome abundance tables [10,33]. Nevertheless, when comparing diversity metrics between unbalanced samples, rarefaction is a necessity [34], especially with diversity indices, such as the number of haplotypes, polymorphic sites, mutation frequency, Shannon entropy, Hill numbers, and others, or metrics, which show a dependency of the sample size.
This study lays the groundwork for determining the most appropriate subsampling approach depending on quasispecies structure and the specific indices to be compared. In summary, with high depths, frequencies of prominent haplotypes and fractions of reads are robust to sample size variations and can be compared directly or previously subsampled with replacement. The estimation of Shannon entropy, where low-frequency and rare haplotypes still have a significant role, requires rarefaction by subsampling without replacement to balance biases. The estimation of the number of haplotypes, incidence indices, the fraction of singletons, or of the lowest frequency haplotypes requires subsampling without replacement. As part of the experimental design, a minimum coverage must be established beforehand to reject or repeat any samples falling below that threshold. This minimum coverage sets the level of information conveyed by the study.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/v16050710/s1, R code as Supplementary Material.

Author Contributions

J.G. and J.Q. wrote the main manuscript and prepared the tables and figures. J.G. and M.I.-L. developed the software and the formal analysis. S.C.-C. and C.C. worked on figures, results, and corrections of the main manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially supported by Plan Estratègic de Recerca i Innovació en Salut (PERIS)—Direcció General de Recerca i Innovació en Salut (DGRIS), Catalan Health Ministry, Generalitat de Catalunya; Centro para el Desarrollo Tecnológico Industrial (CDTI) from the Spanish Ministry of Economy and Business, grant number IDI-20200297; and Project PI22/00258 funded by Instituto de Salud Carlos III (ISCIII) and cofounded by the European Union. S.C.-C has received support from the Spanish Ministry of Education, grant FPU21/04150. M.I.-L. received the support of a fellowship from the “la Caixa” Foundation (ID 100010434), whose code is LCF/BQ/DR23/12000020, C.C. received a predoctoral fellowship from VHIR.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article and Supplementary Materials.

Acknowledgments

We would like to acknowledge Roche Diagnostics S.L.U., who support the APC charges and the full cost of publishing this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Holland, J.; Spindler, K.; Horodyski, F.; Grabau, E.; Nichol, S.; VandePol, S. Rapid evolution of RNA genomes. Science 1982, 215, 1577–1585. [Google Scholar] [CrossRef] [PubMed]
  2. Vignuzzi, M.; Stone, J.K.; Arnold, J.J.; Cameron, C.E.; Andino, R. Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population. Nature 2006, 439, 344–348. [Google Scholar] [CrossRef] [PubMed]
  3. Domingo, E.; Holland, J.J. Mutation rates and rapid evolution of RNA viruses. In Evolutionary Biology of Viruses; Morse, S.S., Ed.; Raven Press: New York, NY, USA, 1994; pp. 161–184. [Google Scholar]
  4. Neumann, A.U.; Lam, N.P.; Dahari, H.; Gretch, D.R.; Wiley, T.E.; Layden, T.J.; Perelson, A.S. Hepatitis C viral dynamics in vivo and the antiviral efficacy of interferon-alpha therapy. Science 1998, 282, 103–107. [Google Scholar] [CrossRef] [PubMed]
  5. Lam, N.P.; Neumann, A.U.; Gretch, D.R.; Wiley, T.E.; Perelson, A.S.; Layden, T.J. Dose-dependent acute clearance of hepatitis C genotype 1 virus with interferon alfa. Hepatology 1997, 26, 226–231. [Google Scholar] [CrossRef] [PubMed]
  6. Martell, M.; Esteban, J.I.; Quer, J.; Genesca, J.; Weiner, A.; Esteban, R.; Guardia, J.; Gomez, J. Hepatitis C virus (HCV) circulates as a population of different but closely related genomes: Quasispecies nature of HCV genome distribution. J. Virol. 1992, 66, 3225–3229. [Google Scholar] [CrossRef] [PubMed]
  7. Gregori, J.; Salicru, M.; Domingo, E.; Sanchez, A.; Esteban, J.I.; Rodríguez-Frías, F.; Quer, J. Inference with viral quasispecies diversity indices: Clonal and NGS approaches. Bioinformatics 2014, 30, 1104–1111. [Google Scholar] [CrossRef]
  8. Gregori, J.; Esteban, J.I.; Cubero, M.; Garcia-Cehic, D.; Perales, C.; Casillas, R.; Alvarez-Tejado, M.; Rodríguez-Frías, F.; Guardia, J.; Domingo, E.; et al. Ultra-deep pyrosequencing (UDPS) data treatment to study amplicon HCV minor variants. PLoS ONE 2013, 8, e83361. [Google Scholar] [CrossRef] [PubMed]
  9. Willis, A.D. Rarefaction, Alpha Diversity, and Statistics. Front. Microbiol. 2019, 10, 2407. [Google Scholar] [CrossRef]
  10. Calle, M.L. Statistical Analysis of Metagenomics Data. Genom. Inform. 2019, 17, e6. [Google Scholar] [CrossRef]
  11. Cameron, E.S.; Schmidt, P.J.; Tremblay, B.J.M.; Emelko, M.B.; Müller, K.M. Enhancing diversity analysis by repeatedly rarefying next generation sequencing data describing microbial communities. Sci. Rep. 2021, 11, 22302. [Google Scholar] [CrossRef]
  12. Hong, J.; Karaoz, U.; de Valpine, P.; Fithian, W. To rarefy or not to rarefy: Robustness and efficiency trade-offs of rarefying microbiome data. Bioinformatics 2022, 38, 2389–2396. [Google Scholar] [CrossRef] [PubMed]
  13. Shamsuri, Q.S.; Ab Majid, A.H. Metagenomic 16S rRNA amplicon data of gut microbial diversity in three species of subterranean termites (Coptotermes gestroi, Globitermes sulphureus and Macrotermes gilvus). Data Br. 2023, 47, 108993. [Google Scholar] [CrossRef] [PubMed]
  14. Gotelli, N.J.; Colwell, R.K. Estimating Species Richness. In Biological Diversity: Frontiers in Measurement and Assessment, 1st ed.; Magurran, E.A., McGill, B.J., Eds.; Oxford University Press: New York, NY, USA, 2011; pp. 1–335. [Google Scholar]
  15. Adombie, C.M.; Bosch, A.; Buti, M.; Campos, C.; Colomer-Castell, S.; Cortese, M.F.; Domingo, E.; Esteban, J.I.; Gallego, I.; Garcia-Cehic, D.; et al. Viral Quasispecies Diversity and Evolution: A Bioinformatics Molecular Approach, 1st ed.; Gregori, J., Rodríguez-Frías, F., Quer, J., Eds.; Il Pensiero Scientific Editore: Rome, Italy, 2023; pp. 1–182. [Google Scholar]
  16. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: https://www.R-project.org/ (accessed on 25 April 2024).
  17. Xie, Y. knitr: A General-Purpose Package for Dynamic Report Generation in R. 2023. Available online: https://rdrr.io/cran/knitr/ (accessed on 25 April 2024).
  18. Wickham, H. Welcome to Master the Tidyverse. J. Open Source Softw. 2019, 4, 1686. [Google Scholar] [CrossRef]
  19. Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer: Cham, Switzerland, 2016; p. 260. [Google Scholar] [CrossRef]
  20. Stubner, R. dqrng: Fast Pseudo Random Number Generators. 2023. Available online: https://CRAN.R-project.org/package=dqrng (accessed on 25 April 2024).
  21. Magurran, A.E. Measuring Biological Diversity; Wiley-Blackwell: Oxford, UK, 2013; 272p. [Google Scholar]
  22. Gotelli, N.J.; Colwell, R.K. Quantifying biodiversity: Procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett. 2001, 4, 379–391. [Google Scholar] [CrossRef]
  23. Gregori, J.; Colomer-Castell, S.; Campos, C.; Ibañez-Lligoña, M.; Garcia-Cehic, D.; Rando-Segura, A.; Adombie, C.M.; Pintó, R.; Guix, S.; Bosch, A.; et al. Quasispecies Fitness Partition to Characterize the Molecular Status of a Viral Population. Negative Effect of Early Ribavirin Discontinuation in a Chronically Infected HEV Patient. Int. J. Mol. Sci. 2022, 23, 14654. [Google Scholar] [CrossRef] [PubMed]
  24. Gregori, J.; Colomer-Castell, S.; Ibañez-Lligoña, M.; Garcia-Cehic, D.; Campos, C.; Buti, M.; Riveiro-Barciela, M.; Andrés, C.; Piñana, M.; González-Sánchez, A.; et al. In-host flat-like quasispecies, methods and clinical implications. Microorganisms 2024, in press. [Google Scholar]
  25. Hill, M.O. Diversity and evenness: A unifying notation and its consequences. Ecology 1973, 54, 427–432. [Google Scholar] [CrossRef]
  26. Gregori, J.; Perales, C.; Rodriguez-Frias, F.; Esteban, J.I.; Quer, J.; Domingo, E. Viral quasispecies complexity measures. Virology 2016, 493, 227–237. [Google Scholar] [CrossRef] [PubMed]
  27. Gregori, J.; Soria, M.E.; Gallego, I.; Guerrero-Murillo, M.; Esteban, J.I.; Quer, J.; Perales, C.; Domingo, E. Rare haplotype load as marker for lethal mutagenesis. PLoS ONE 2018, 13, e0204877. [Google Scholar] [CrossRef]
  28. Todt, D.; Gisa, A.; Radonic, A.; Nitsche, A.; Behrendt, P.; Suneetha, P.V.; Pischke, S.; Bremer, B.; Brown, R.J.; Manns, M.P.; et al. In vivo evidence for ribavirin-induced mutagenesis of the hepatitis E virus genome. Gut 2016, 65, 1733–1743. [Google Scholar] [CrossRef]
  29. Agresti, A. Categorical Data Analysis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2002. [Google Scholar]
  30. Gregori, J.; Méndez, O.; Katsila, T.; Pujals, M.; Salvans, C.; Villarreal, L.; Arribas, J.; Tabernero, J.; Sánchez, A.; Villanueva, J. Enhancing the Biological Relevance of Secretome-Based Proteomics by Linking Tumor Cell Proliferation and Protein Secretion. J. Proteome Res. 2014, 13, 3706–3721. [Google Scholar] [CrossRef] [PubMed]
  31. Aitchison, J. The Statistical Analysis of Compositional Data; Chapman & Hall: Boca Raton, FL, USA; The Blackburn Press: Caldwell, NJ, USA, 1986; 460p. [Google Scholar]
  32. Pawlowsky-Glahn, V.; Egozcue, J.J.; Tolosana-Delgado, R. Modelling and Analysis of Compositional Data; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2015. [Google Scholar]
  33. Gloor, G.B.; Wu, J.R.; Pawlowsky-Glahn, V.; Egozcue, J.J. It’s all relative: Analyzing microbiome data as compositions. Ann. Epidemiol. 2016, 26, 322–329. [Google Scholar] [CrossRef] [PubMed]
  34. Weiss, S.; Xu, Z.Z.; Peddada, S.; Amir, A.; Bittinger, K.; Gonzalez, A.; Lozupone, C.; Zaneveld, J.R.; Vázquez-Baeza, Y.; Birmingham, A.; et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 2017, 5, 27. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Subsampling with replacement. Theoretical limit to the number of observed items when subsampling with replacement at different subsampling fractions.
Figure 1. Subsampling with replacement. Theoretical limit to the number of observed items when subsampling with replacement at different subsampling fractions.
Viruses 16 00710 g001
Figure 2. Single-dominant case. Estimation of the number of haplotypes at different subsampling fractions after B resampling cycles with and without replacement.
Figure 2. Single-dominant case. Estimation of the number of haplotypes at different subsampling fractions after B resampling cycles with and without replacement.
Viruses 16 00710 g002
Figure 3. Single-dominant case. Estimation of the master frequencies at different subsampling fractions after B resampling cycles with and without replacement.
Figure 3. Single-dominant case. Estimation of the master frequencies at different subsampling fractions after B resampling cycles with and without replacement.
Viruses 16 00710 g003
Figure 4. Theoretical limit to the number of observed haplotypes in a bootstrap resample cycle. Flat quasispecies with growing reads per haplotype.
Figure 4. Theoretical limit to the number of observed haplotypes in a bootstrap resample cycle. Flat quasispecies with growing reads per haplotype.
Viruses 16 00710 g004
Figure 5. Flat quasispecies rarefaction.
Figure 5. Flat quasispecies rarefaction.
Viruses 16 00710 g005
Figure 6. Flat quasispecies. Ratio of number of haplotypes estimated in subsampling with replacement versus those estimated by the rarefaction equation.
Figure 6. Flat quasispecies. Ratio of number of haplotypes estimated in subsampling with replacement versus those estimated by the rarefaction equation.
Viruses 16 00710 g006
Table 1. Subsampling a given fraction with replacement. Proportion of items seen and unseen in a single resampling cycle.
Table 1. Subsampling a given fraction with replacement. Proportion of items seen and unseen in a single resampling cycle.
FractionSeenUnseen
0.10.09520.9048
0.20.18130.8187
0.30.25920.7408
0.40.32970.6703
0.50.39350.6065
0.60.45120.5488
0.70.50340.4966
0.80.55070.4493
0.90.59340.4066
1.00.63210.3679
Table 2. All-singleton case. Estimating the number of haplotypes. Subsampling a given fraction with replacement.
Table 2. All-singleton case. Estimating the number of haplotypes. Subsampling a given fraction with replacement.
FracTrueExpectedMedianIQRSDUniqueReplicated
0.11000952.1952.08.006.210.95200.0480
0.220001813.51812.017.0012.520.90600.0940
0.330002592.92593.023.0015.880.86430.1357
0.440003298.13295.025.5020.200.82380.1762
0.550003936.23932.033.0024.950.78640.2136
0.660004513.54512.039.2527.850.75200.2480
0.770005035.95033.037.0028.110.71900.2810
0.880005508.55505.042.0030.370.68810.3119
0.990005936.15934.043.2532.020.65930.3407
1.010,0006323.06321.542.0032.470.63220.3678
Table 3. Single-dominant quasispecies data.
Table 3. Single-dominant quasispecies data.
IDMasterHpl. No.
Q.90.100.910,001
Q.80.200.820,001
Q.70.300.730,001
Q.60.400.640,001
Q.50.500.550,001
Q.40.600.460,001
Q.30.700.370,001
Q.20.800.280,001
Q.10.900.190,001
Table 4. Single dominant. Estimating number of haplotypes. Median values.
Table 4. Single dominant. Estimating number of haplotypes. Median values.
IDSubszNoRplWithRplExact
Q.90.100.505002.03933.05000
Q.90.100.252500.02215.02500
Q.90.100.101000.0953.01000
Q.90.100.05501.0490.0500
Q.80.200.5010,002.57866.010,000
Q.80.200.255002.54423.05000
Q.80.200.101999.01906.02000
Q.80.200.051001.5977.01000
Q.70.300.5014,996.011,799.515,000
Q.70.300.257495.56635.07500
Q.70.300.102999.02858.03000
Q.70.300.051501.01466.51500
Q.60.400.5020,005.015,741.020,000
Q.60.400.2510,001.08852.010,000
Q.60.400.103999.53807.54000
Q.60.400.051998.01951.02000
Q.50.500.5025,001.019,676.525,000
Q.50.500.2512,500.511,070.012,500
Q.50.500.105006.04759.05000
Q.50.500.052499.02440.02500
Q.40.600.5029,996.023,609.530,000
Q.40.600.2514,993.013,274.015,000
Q.40.600.106000.05706.06000
Q.40.600.053001.52927.53000
Q.30.700.5035,001.027,542.535,000
Q.30.700.2517,504.515,487.517,500
Q.30.700.107004.06661.07000
Q.30.700.053499.03415.03500
Q.20.800.5039,997.031,477.540,000
Q.20.800.2520,002.517,701.020,000
Q.20.800.107997.07613.58000
Q.20.800.054003.03904.04000
Q.10.900.5045,001.035,409.045,000
Q.10.900.2522,502.019,914.022,500
Q.10.900.109003.08565.09000
Q.10.900.054503.04389.04500
Table 5. Single dominant. Estimating master frequency. Median values.
Table 5. Single dominant. Estimating master frequency. Median values.
IDSubszNoRplWithRplExact
Q.90.100.500.8999800.900050.9
Q.90.100.250.9000400.900040.9
Q.90.100.100.9001000.900000.9
Q.90.100.050.9000000.900000.9
Q.80.200.500.7999700.800100.8
Q.80.200.250.7999400.800140.8
Q.80.200.100.8002000.799800.8
Q.80.200.050.7999000.800200.8
Q.70.300.500.7001000.700160.7
Q.70.300.250.7002200.700120.7
Q.70.300.100.7002000.699800.7
Q.70.300.050.7000000.699800.7
Q.60.400.500.5999200.600040.6
Q.60.400.250.6000000.599880.6
Q.60.400.100.6001500.599900.6
Q.60.400.050.6006000.600400.6
Q.50.500.500.5000000.499850.5
Q.50.500.250.5000200.499820.5
Q.50.500.100.4995000.500050.5
Q.50.500.050.5004000.500000.5
Q.40.600.500.4001000.399980.4
Q.40.600.250.4003200.399880.4
Q.40.600.100.4001000.400400.4
Q.40.600.050.3999000.399800.4
Q.30.700.500.2999930.300110.3
Q.30.700.250.2998600.300080.3
Q.30.700.100.2997000.300100.3
Q.30.700.050.3004000.300000.3
Q.20.800.500.2000720.199780.2
Q.20.800.250.1999240.199920.2
Q.20.800.100.2004000.200250.2
Q.20.800.050.1996000.200200.2
Q.10.900.500.1000000.100070.1
Q.10.900.250.0999600.099900.1
Q.10.900.100.0998000.100000.1
Q.10.900.050.0996000.100000.1
Table 6. Prominent haplotypes, quasispecies composition.
Table 6. Prominent haplotypes, quasispecies composition.
Number of Reads100,000
Number of haplotypes3083
Prominent haplotypes (read counts)49,231, 24,615, 12,308, 6154, 3077, 1538
Singletons (reads)3077
Table 7. Prominent haplotypes. Subsampling without replacement. Median values.
Table 7. Prominent haplotypes. Subsampling without replacement. Median values.
SubsSngFrHpl_1Hpl_2Hpl_3Hpl_4Hpl_5Hpl_6Ov1
True0.030770.492310.246150.123080.061540.030770.015380
0.50.030760.492110.246260.123150.061480.030840.015420
0.250.030800.492240.246060.123160.061640.030760.015360
0.10.030900.492400.246350.122800.061600.030700.015400
0.050.031000.492800.246200.122800.061200.030600.015200
Table 8. Prominent haplotypes. Subsampling with replacement. Median values.
Table 8. Prominent haplotypes. Subsampling with replacement. Median values.
SubsSngFrHpl_1Hpl_2Hpl_3Hpl_4Hpl_5Hpl_6Ov1
True0.030770.492310.246150.123080.061540.030770.015380.00000
0.50.018720.492300.246040.123020.061460.030780.015420.01210
0.250.023960.492320.246260.123080.061480.030680.015360.00684
0.10.027800.492150.246200.123200.061700.030700.015400.00285
0.050.029000.492800.246000.123200.061400.030600.015400.00140
Table 9. No rare haplotypes. Subsampling without replacement.
Table 9. No rare haplotypes. Subsampling without replacement.
SubsHplNoHpl_01Hpl_02Hpl_03Hpl_04Hpl_05
True110.900000.010000.010000.010000.0100
0.5110.899960.010020.010040.010020.0100
0.25110.899900.010000.010000.010040.0100
0.1110.900000.010100.010000.010000.0101
0.05110.900000.010000.010000.010000.0100
SubsHpl_06Hpl_07Hpl_08Hpl_09Hpl_10Hpl_11
True0.01000.010000.010000.010.010000.01000
0.50.01000.010020.010010.010.010020.00999
0.250.01000.009960.009960.010.010000.00996
0.10.01010.010000.010100.010.010000.01000
0.050.01000.010000.010000.010.010000.01000
Table 10. No rare haplotypes. Subsampling with replacement.
Table 10. No rare haplotypes. Subsampling with replacement.
SubsHplNoHpl_01Hpl_02Hpl_03Hpl_04Hpl_05
True110.900000.010000.010000.010000.01000
0.5110.900060.009990.010020.010020.01004
0.25110.900040.010000.010040.009920.01004
0.1110.900100.010000.010000.010000.01000
0.05110.900100.010000.010000.010000.01020
SubsHpl_06Hpl_07Hpl_08Hpl_09Hpl_10Hpl_11
True0.010000.010000.010000.010000.010000.01000
0.50.010020.009980.010020.009980.009960.01002
0.250.010000.010000.009940.010040.010040.01000
0.10.009900.010000.010000.010000.010000.01000
0.050.009800.009800.010000.009800.009900.01000
Table 11. Flat quasispecies: full bootstrap cycle results at growing haplotype frequencies to this case results in Equation (1).
Table 11. Flat quasispecies: full bootstrap cycle results at growing haplotype frequencies to this case results in Equation (1).
nHplkReadsProbLimit
1000110000.63230460.6321206
1000220000.86480010.8646647
1000330000.95028760.9502129
1000440000.98172100.9816844
1000550000.99327890.9932621
1000660000.99752870.9975212
1000770000.99909130.9990881
1000880000.99966590.9996645
1000990000.99987710.9998766
10001010,0000.99995480.9999546
Table 12. Flat quasispecies subsampling.
Table 12. Flat quasispecies subsampling.
Haplotypesn
Reads per haplotypek
Full sample sizen · k
Subsampling fractionf
Subsample sizeround(n · k · f)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gregori, J.; Ibañez-Lligoña, M.; Colomer-Castell, S.; Campos, C.; Quer, J. Virus Quasispecies Rarefaction: Subsampling with or without Replacement? Viruses 2024, 16, 710. https://doi.org/10.3390/v16050710

AMA Style

Gregori J, Ibañez-Lligoña M, Colomer-Castell S, Campos C, Quer J. Virus Quasispecies Rarefaction: Subsampling with or without Replacement? Viruses. 2024; 16(5):710. https://doi.org/10.3390/v16050710

Chicago/Turabian Style

Gregori, Josep, Marta Ibañez-Lligoña, Sergi Colomer-Castell, Carolina Campos, and Josep Quer. 2024. "Virus Quasispecies Rarefaction: Subsampling with or without Replacement?" Viruses 16, no. 5: 710. https://doi.org/10.3390/v16050710

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop