# Virus Quasispecies Rarefaction: Subsampling with or without Replacement?

## Abstract

## 1. Introduction

## 2. Methods

Concept | Definition |

Rarefaction | A technique used to compensate for different intensities of sampling in diversity studies. |

Subsampling cycle | It consists in the successive random extraction of a given number of items from a sample, lower than the sample size, with or without replacement at each extraction. |

Subsampling with replacement | This is based on a situation where an element is randomly extracted from a sample, identified, and then immediately replaced. Therefore, this element can be obtained again in further extractions along the same subsampling cycle. |

Subsampling without replacement | All extractions in a subsampling cycle are performed without replacement, so no item may be extracted multiple times in the same cycle. |

Downward bias | Inaccuracy in measurement or estimation that underestimates the true value. |

Subsampling fraction | Fraction of reads being subsampled from a given sample in a single resampling cycle. |

Granularity | Level of resolution at which the data are processed when estimating frequencies from counts. |

## 3. Results

- All singletons: This represents a quasispecies where all haplotypes are represented by a single read. It serves as the simplest case to numerically show the discussed limitations, showcasing the most significant differences between the sampling schemes.
- Single dominant case: This hypothetical scenario involves a dominant haplotype, while all other haplotypes are singletons. Our goal is to evaluate the master frequency and the number of haplotypes.
- Prominent haplotypes: In this case, there are six prominent haplotypes along with a set of singletons. The objective is to evaluate the frequencies of the prominent haplotypes, the fraction of singletons in the quasispecies, and the fraction of reads for haplotypes with over one read and below the top 6 haplotypes, representing singleton replicates produced in sampling with replacement.
- No rare haplotypes: This is a quasispecies composed of a master haplotype at 90%, with 10 other haplotypes at 1% each. This scenario excludes singletons and lower frequency haplotypes. We seek to estimate haplotype frequencies by repeated subsampling.
- Flat quasispecies: Similar to the first case, all the haplotypes have equal frequencies, ranging from 1 read to 10 reads each, representing a perfectly even quasispecies. This case is crucial for demonstrating the robustness in sampling quasispecies data that have undergone a previous abundance filter at a low level.

#### 3.1. Bootstrap: The Theory around 0.632

^{n}; this means that the probability to have a given item sampled in a bootstrap cycle is 1 − (1 − 1/n)

^{n}. As n tends to infinity, the limit of this expression is 1 − 1/e = 0.632. This result implies that a bootstrap resample is composed of 0.6321 unique realizations of items in the original sample plus 0.3679, which are replicates, in the limit as n grows to infinity.

#### 3.2. Subsampling a Given Fraction with Replacement

^{f·n}, and the limit as n tends to infinity is 1 − (1/e)

^{f}.

^{∞}to ∞ · 0, noting that f (x) = e

^{ln(f(x))}

#### 3.3. The All-Singletons Case

#### 3.4. The Single-Dominant Case

#### 3.5. Prominent Haplotypes

#### 3.6. No Rare Haplotypes

#### 3.7. Flat Quasispecies

^{(n·k)}= 1 − (1 − 1/n)

^{(n·k)}, where n · k is the sample size. The probability, which, in the limit as n goes to infinity, is 1 − (1/e)

^{k}. Table 11 and Figure 4 show the values computed for n = 1000 haplotypes, k from 1 to 10 reads each, the computed probability, and the corresponding limits.

_{2}[n|k,f]/E

_{1}[n|k,f] gives the fraction of haplotypes estimated in subsampling with replacement with respect to those estimated in subsampling without replacement (rarefaction). This ratio gives a representation of the accuracy obtained in subsampling with replacement in this scenario, and is represented in Figure 6, computed for n = 10,000 haplotypes, k = 1, 2, …, 10 reads per haplotype, and f = 0.1, 0.2, …, 1.

## 4. Discussion

^{q}D(p) (Equation (3)) [25,26], of different orders q, they will be limited above by

^{0}D, being the number of haplotypes, and below by

^{∞}D, being the inverse of the frequency of the dominant haplotype. As the order q increases, the relative weight of low frequency and rare haplotypes in the computation decreases, as low-frequency values are more heavily affected by the exponent. At q = 0, all haplotypes have equal weight regardless of their frequency, while at q = ∞, only the highest frequency holds significance. This observation suggests that the sensitivity or dependence of a Hill number with respect to sample size decreases as q gets bigger. Considering the correspondence between Hill numbers and other classical diversity indices, we may set the sensitivity order as:

**Figure 1.**Subsampling with replacement. Theoretical limit to the number of observed items when subsampling with replacement at different subsampling fractions.

**Figure 2.**Single-dominant case. Estimation of the number of haplotypes at different subsampling fractions after B resampling cycles with and without replacement.

**Figure 3.**Single-dominant case. Estimation of the master frequencies at different subsampling fractions after B resampling cycles with and without replacement.

**Figure 4.**Theoretical limit to the number of observed haplotypes in a bootstrap resample cycle. Flat quasispecies with growing reads per haplotype.

**Figure 6.**Flat quasispecies. Ratio of number of haplotypes estimated in subsampling with replacement versus those estimated by the rarefaction equation.

**Table 1.**Subsampling a given fraction with replacement. Proportion of items seen and unseen in a single resampling cycle.

Fraction | Seen | Unseen |
---|---|---|

0.1 | 0.0952 | 0.9048 |

0.2 | 0.1813 | 0.8187 |

0.3 | 0.2592 | 0.7408 |

0.4 | 0.3297 | 0.6703 |

0.5 | 0.3935 | 0.6065 |

0.6 | 0.4512 | 0.5488 |

0.7 | 0.5034 | 0.4966 |

0.8 | 0.5507 | 0.4493 |

0.9 | 0.5934 | 0.4066 |

1.0 | 0.6321 | 0.3679 |

**Table 2.**All-singleton case. Estimating the number of haplotypes. Subsampling a given fraction with replacement.

Frac | True | Expected | Median | IQR | SD | Unique | Replicated |
---|---|---|---|---|---|---|---|

0.1 | 1000 | 952.1 | 952.0 | 8.00 | 6.21 | 0.9520 | 0.0480 |

0.2 | 2000 | 1813.5 | 1812.0 | 17.00 | 12.52 | 0.9060 | 0.0940 |

0.3 | 3000 | 2592.9 | 2593.0 | 23.00 | 15.88 | 0.8643 | 0.1357 |

0.4 | 4000 | 3298.1 | 3295.0 | 25.50 | 20.20 | 0.8238 | 0.1762 |

0.5 | 5000 | 3936.2 | 3932.0 | 33.00 | 24.95 | 0.7864 | 0.2136 |

0.6 | 6000 | 4513.5 | 4512.0 | 39.25 | 27.85 | 0.7520 | 0.2480 |

0.7 | 7000 | 5035.9 | 5033.0 | 37.00 | 28.11 | 0.7190 | 0.2810 |

0.8 | 8000 | 5508.5 | 5505.0 | 42.00 | 30.37 | 0.6881 | 0.3119 |

0.9 | 9000 | 5936.1 | 5934.0 | 43.25 | 32.02 | 0.6593 | 0.3407 |

1.0 | 10,000 | 6323.0 | 6321.5 | 42.00 | 32.47 | 0.6322 | 0.3678 |

ID | Master | Hpl. No. |
---|---|---|

Q.90.10 | 0.9 | 10,001 |

Q.80.20 | 0.8 | 20,001 |

Q.70.30 | 0.7 | 30,001 |

Q.60.40 | 0.6 | 40,001 |

Q.50.50 | 0.5 | 50,001 |

Q.40.60 | 0.4 | 60,001 |

Q.30.70 | 0.3 | 70,001 |

Q.20.80 | 0.2 | 80,001 |

Q.10.90 | 0.1 | 90,001 |

ID | Subsz | NoRpl | WithRpl | Exact |
---|---|---|---|---|

Q.90.10 | 0.50 | 5002.0 | 3933.0 | 5000 |

Q.90.10 | 0.25 | 2500.0 | 2215.0 | 2500 |

Q.90.10 | 0.10 | 1000.0 | 953.0 | 1000 |

Q.90.10 | 0.05 | 501.0 | 490.0 | 500 |

Q.80.20 | 0.50 | 10,002.5 | 7866.0 | 10,000 |

Q.80.20 | 0.25 | 5002.5 | 4423.0 | 5000 |

Q.80.20 | 0.10 | 1999.0 | 1906.0 | 2000 |

Q.80.20 | 0.05 | 1001.5 | 977.0 | 1000 |

Q.70.30 | 0.50 | 14,996.0 | 11,799.5 | 15,000 |

Q.70.30 | 0.25 | 7495.5 | 6635.0 | 7500 |

Q.70.30 | 0.10 | 2999.0 | 2858.0 | 3000 |

Q.70.30 | 0.05 | 1501.0 | 1466.5 | 1500 |

Q.60.40 | 0.50 | 20,005.0 | 15,741.0 | 20,000 |

Q.60.40 | 0.25 | 10,001.0 | 8852.0 | 10,000 |

Q.60.40 | 0.10 | 3999.5 | 3807.5 | 4000 |

Q.60.40 | 0.05 | 1998.0 | 1951.0 | 2000 |

Q.50.50 | 0.50 | 25,001.0 | 19,676.5 | 25,000 |

Q.50.50 | 0.25 | 12,500.5 | 11,070.0 | 12,500 |

Q.50.50 | 0.10 | 5006.0 | 4759.0 | 5000 |

Q.50.50 | 0.05 | 2499.0 | 2440.0 | 2500 |

Q.40.60 | 0.50 | 29,996.0 | 23,609.5 | 30,000 |

Q.40.60 | 0.25 | 14,993.0 | 13,274.0 | 15,000 |

Q.40.60 | 0.10 | 6000.0 | 5706.0 | 6000 |

Q.40.60 | 0.05 | 3001.5 | 2927.5 | 3000 |

Q.30.70 | 0.50 | 35,001.0 | 27,542.5 | 35,000 |

Q.30.70 | 0.25 | 17,504.5 | 15,487.5 | 17,500 |

Q.30.70 | 0.10 | 7004.0 | 6661.0 | 7000 |

Q.30.70 | 0.05 | 3499.0 | 3415.0 | 3500 |

Q.20.80 | 0.50 | 39,997.0 | 31,477.5 | 40,000 |

Q.20.80 | 0.25 | 20,002.5 | 17,701.0 | 20,000 |

Q.20.80 | 0.10 | 7997.0 | 7613.5 | 8000 |

Q.20.80 | 0.05 | 4003.0 | 3904.0 | 4000 |

Q.10.90 | 0.50 | 45,001.0 | 35,409.0 | 45,000 |

Q.10.90 | 0.25 | 22,502.0 | 19,914.0 | 22,500 |

Q.10.90 | 0.10 | 9003.0 | 8565.0 | 9000 |

Q.10.90 | 0.05 | 4503.0 | 4389.0 | 4500 |

ID | Subsz | NoRpl | WithRpl | Exact |
---|---|---|---|---|

Q.90.10 | 0.50 | 0.899980 | 0.90005 | 0.9 |

Q.90.10 | 0.25 | 0.900040 | 0.90004 | 0.9 |

Q.90.10 | 0.10 | 0.900100 | 0.90000 | 0.9 |

Q.90.10 | 0.05 | 0.900000 | 0.90000 | 0.9 |

Q.80.20 | 0.50 | 0.799970 | 0.80010 | 0.8 |

Q.80.20 | 0.25 | 0.799940 | 0.80014 | 0.8 |

Q.80.20 | 0.10 | 0.800200 | 0.79980 | 0.8 |

Q.80.20 | 0.05 | 0.799900 | 0.80020 | 0.8 |

Q.70.30 | 0.50 | 0.700100 | 0.70016 | 0.7 |

Q.70.30 | 0.25 | 0.700220 | 0.70012 | 0.7 |

Q.70.30 | 0.10 | 0.700200 | 0.69980 | 0.7 |

Q.70.30 | 0.05 | 0.700000 | 0.69980 | 0.7 |

Q.60.40 | 0.50 | 0.599920 | 0.60004 | 0.6 |

Q.60.40 | 0.25 | 0.600000 | 0.59988 | 0.6 |

Q.60.40 | 0.10 | 0.600150 | 0.59990 | 0.6 |

Q.60.40 | 0.05 | 0.600600 | 0.60040 | 0.6 |

Q.50.50 | 0.50 | 0.500000 | 0.49985 | 0.5 |

Q.50.50 | 0.25 | 0.500020 | 0.49982 | 0.5 |

Q.50.50 | 0.10 | 0.499500 | 0.50005 | 0.5 |

Q.50.50 | 0.05 | 0.500400 | 0.50000 | 0.5 |

Q.40.60 | 0.50 | 0.400100 | 0.39998 | 0.4 |

Q.40.60 | 0.25 | 0.400320 | 0.39988 | 0.4 |

Q.40.60 | 0.10 | 0.400100 | 0.40040 | 0.4 |

Q.40.60 | 0.05 | 0.399900 | 0.39980 | 0.4 |

Q.30.70 | 0.50 | 0.299993 | 0.30011 | 0.3 |

Q.30.70 | 0.25 | 0.299860 | 0.30008 | 0.3 |

Q.30.70 | 0.10 | 0.299700 | 0.30010 | 0.3 |

Q.30.70 | 0.05 | 0.300400 | 0.30000 | 0.3 |

Q.20.80 | 0.50 | 0.200072 | 0.19978 | 0.2 |

Q.20.80 | 0.25 | 0.199924 | 0.19992 | 0.2 |

Q.20.80 | 0.10 | 0.200400 | 0.20025 | 0.2 |

Q.20.80 | 0.05 | 0.199600 | 0.20020 | 0.2 |

Q.10.90 | 0.50 | 0.100000 | 0.10007 | 0.1 |

Q.10.90 | 0.25 | 0.099960 | 0.09990 | 0.1 |

Q.10.90 | 0.10 | 0.099800 | 0.10000 | 0.1 |

Q.10.90 | 0.05 | 0.099600 | 0.10000 | 0.1 |

Number of Reads | 100,000 |
---|---|

Number of haplotypes | 3083 |

Prominent haplotypes (read counts) | 49,231, 24,615, 12,308, 6154, 3077, 1538 |

Singletons (reads) | 3077 |

Subs | SngFr | Hpl_1 | Hpl_2 | Hpl_3 | Hpl_4 | Hpl_5 | Hpl_6 | Ov1 |
---|---|---|---|---|---|---|---|---|

True | 0.03077 | 0.49231 | 0.24615 | 0.12308 | 0.06154 | 0.03077 | 0.01538 | 0 |

0.5 | 0.03076 | 0.49211 | 0.24626 | 0.12315 | 0.06148 | 0.03084 | 0.01542 | 0 |

0.25 | 0.03080 | 0.49224 | 0.24606 | 0.12316 | 0.06164 | 0.03076 | 0.01536 | 0 |

0.1 | 0.03090 | 0.49240 | 0.24635 | 0.12280 | 0.06160 | 0.03070 | 0.01540 | 0 |

0.05 | 0.03100 | 0.49280 | 0.24620 | 0.12280 | 0.06120 | 0.03060 | 0.01520 | 0 |

Subs | SngFr | Hpl_1 | Hpl_2 | Hpl_3 | Hpl_4 | Hpl_5 | Hpl_6 | Ov1 |
---|---|---|---|---|---|---|---|---|

True | 0.03077 | 0.49231 | 0.24615 | 0.12308 | 0.06154 | 0.03077 | 0.01538 | 0.00000 |

0.5 | 0.01872 | 0.49230 | 0.24604 | 0.12302 | 0.06146 | 0.03078 | 0.01542 | 0.01210 |

0.25 | 0.02396 | 0.49232 | 0.24626 | 0.12308 | 0.06148 | 0.03068 | 0.01536 | 0.00684 |

0.1 | 0.02780 | 0.49215 | 0.24620 | 0.12320 | 0.06170 | 0.03070 | 0.01540 | 0.00285 |

0.05 | 0.02900 | 0.49280 | 0.24600 | 0.12320 | 0.06140 | 0.03060 | 0.01540 | 0.00140 |

Subs | HplNo | Hpl_01 | Hpl_02 | Hpl_03 | Hpl_04 | Hpl_05 |
---|---|---|---|---|---|---|

True | 11 | 0.90000 | 0.01000 | 0.01000 | 0.01000 | 0.0100 |

0.5 | 11 | 0.89996 | 0.01002 | 0.01004 | 0.01002 | 0.0100 |

0.25 | 11 | 0.89990 | 0.01000 | 0.01000 | 0.01004 | 0.0100 |

0.1 | 11 | 0.90000 | 0.01010 | 0.01000 | 0.01000 | 0.0101 |

0.05 | 11 | 0.90000 | 0.01000 | 0.01000 | 0.01000 | 0.0100 |

Subs | Hpl_06 | Hpl_07 | Hpl_08 | Hpl_09 | Hpl_10 | Hpl_11 |

True | 0.0100 | 0.01000 | 0.01000 | 0.01 | 0.01000 | 0.01000 |

0.5 | 0.0100 | 0.01002 | 0.01001 | 0.01 | 0.01002 | 0.00999 |

0.25 | 0.0100 | 0.00996 | 0.00996 | 0.01 | 0.01000 | 0.00996 |

0.1 | 0.0101 | 0.01000 | 0.01010 | 0.01 | 0.01000 | 0.01000 |

0.05 | 0.0100 | 0.01000 | 0.01000 | 0.01 | 0.01000 | 0.01000 |

Subs | HplNo | Hpl_01 | Hpl_02 | Hpl_03 | Hpl_04 | Hpl_05 |
---|---|---|---|---|---|---|

True | 11 | 0.90000 | 0.01000 | 0.01000 | 0.01000 | 0.01000 |

0.5 | 11 | 0.90006 | 0.00999 | 0.01002 | 0.01002 | 0.01004 |

0.25 | 11 | 0.90004 | 0.01000 | 0.01004 | 0.00992 | 0.01004 |

0.1 | 11 | 0.90010 | 0.01000 | 0.01000 | 0.01000 | 0.01000 |

0.05 | 11 | 0.90010 | 0.01000 | 0.01000 | 0.01000 | 0.01020 |

Subs | Hpl_06 | Hpl_07 | Hpl_08 | Hpl_09 | Hpl_10 | Hpl_11 |

True | 0.01000 | 0.01000 | 0.01000 | 0.01000 | 0.01000 | 0.01000 |

0.5 | 0.01002 | 0.00998 | 0.01002 | 0.00998 | 0.00996 | 0.01002 |

0.25 | 0.01000 | 0.01000 | 0.00994 | 0.01004 | 0.01004 | 0.01000 |

0.1 | 0.00990 | 0.01000 | 0.01000 | 0.01000 | 0.01000 | 0.01000 |

0.05 | 0.00980 | 0.00980 | 0.01000 | 0.00980 | 0.00990 | 0.01000 |

**Table 11.**Flat quasispecies: full bootstrap cycle results at growing haplotype frequencies to this case results in Equation (1).

nHpl | k | Reads | Prob | Limit |
---|---|---|---|---|

1000 | 1 | 1000 | 0.6323046 | 0.6321206 |

1000 | 2 | 2000 | 0.8648001 | 0.8646647 |

1000 | 3 | 3000 | 0.9502876 | 0.9502129 |

1000 | 4 | 4000 | 0.9817210 | 0.9816844 |

1000 | 5 | 5000 | 0.9932789 | 0.9932621 |

1000 | 6 | 6000 | 0.9975287 | 0.9975212 |

1000 | 7 | 7000 | 0.9990913 | 0.9990881 |

1000 | 8 | 8000 | 0.9996659 | 0.9996645 |

1000 | 9 | 9000 | 0.9998771 | 0.9998766 |

1000 | 10 | 10,000 | 0.9999548 | 0.9999546 |

Haplotypes | n |

Reads per haplotype | k |

Full sample size | n · k |

Subsampling fraction | f |

Subsample size | round(n · k · f) |

