#### 3.1. Search for the Conserved Targets and Statistical Criteria for Deep Sequencing Data Arrays

In the majority of bioinformatic pipelines (in fact, in all pipelines to the authors’ knowledge), the mutations are detected against fixed predetermined targets. The choice of such reference targets is partly dictated by the available drugs or merely by convention. For example, the sequences of HIV-1 isolate 97CDKP58e from the Republic of the Congo (GenBank Accession No. AF316544) could be considered as a generic reference for HIV-1 subtype A sequences. The conservation of a reference target and corresponding mutational repertoire that is determined against the complete deep sequencing set may depend on the choice of a particular target as a reference (see, for example, [

17]). The natural choice of reference targets corresponds to the conserved targets in a set. Namely, these targets should be compared with related predetermined drug targets. Using the conserved target as a reference and aligning the complete deep sequencing set against such reference targets provides the frequency of nucleotide substitution and microindels in the different sites of the target and allows the assessment of the general conservation of the target.

The frequency of mutations,

${f}_{i,\text{\hspace{0.17em}}N\to {N}^{\prime}}^{(k)},$ is defined against the set of aligned sequences as

where

${n}_{i,N\to {N}^{\prime}}^{(k)}$ is the number of aligned sequences containing replacement

$N\to {N}^{\prime}$ in a site

i and

${n}_{seq}^{(k)}$ is the total number of aligned sequences for the

k-th cohort. The total frequency of mutations in the site

i is obtained by summation over

${N}^{\prime}\ne N$. The expected standard deviation for the mutation frequency in the

i-th site may be assessed by binomial distribution [

30,

31]

The similar expression is valid for the total frequency of mutations. The sensitivity of mutation detection with deep sequencing techniques can be assessed by the criterion

$1.96\sigma ({f}_{thr})={f}_{thr}$ (Pr = 0.05), which yields the threshold at a large

n_{seq},

The detection limit (3) depends only on the total number of reads for a particular target and should be applied to the specific site in an individual target. The statistical significance between the corresponding replacements in the sites

i and

j for the same or two different cohorts can be assessed by the Gaussian

z-criterion [

30]

At ${f}_{N\to {N}^{\prime}}\approx {10}^{-4}$ and n_{seq} ≈ 10^{6}, the difference about $\Delta {f}_{N\to {N}^{\prime}}\approx {10}^{-5}$ can be resolved between the corresponding replacements $N\to {N}^{\prime}$ in two cohorts. In some cases, the small differences in mutation rates may have significant genetic consequences and may be used for the early diagnosis of disease.

All sequencing techniques introduce some experimental errors into the resulting reads. In Illumina deep-sequencing technology, the quality of reads is assessed in terms of the parameter

Q, which is indirectly related to the error frequency. We will not discuss the possible methods for correction of the outputted Illumina data in this paper. Instead, we will restrict ourselves to the effects of given error rates on the proposed criteria and results. Let us consider a model in which the detected mutation frequency is composed of the actual mutation frequency and the frequency of read errors,

f_{observable} =

f_{mutation} +

f_{error}. Both contributions are inferred to be independent and to obey binomial statistics. The study of the cumulant generating function [

30] for the composite stochastic process proves that, at the limit of small error frequencies,

f_{error} << 1, the resulting statistics are approximately binomial or Gaussian. The inequality

f_{error} << 1 is needed for the practical application of deep sequencing and is not restrictive. This means that the criteria (2)–(5) provide a suitable interpolation throughout the entire range of

f_{mutation} when observable mutation frequencies are substituted into the criteria (2)–(5), i.e., these criteria remain robust against the contribution of the read errors.

Commonly, the noise in the experimental data (read errors in our case) is assessed via the weakest signals (detected mutation frequencies). The smallest detected mutation frequencies in the example below are about 10^{−8}–10^{−7} and are several times (or even orders) less than the threshold related to the assessed statistical scattering in the finite sampling sets (see Equation (3)).

#### 3.3. Mutations in RNA Interference Targets for HIV-1 Subtype A

A deep sequencing technique was applied for the study of the conservation of RNAi targets in HIV-1 subtype A. The detailed information about the selected RNAi targets A1–A4 and A6 was published previously [

16,

17]. The targets A1 and A2 are located inside the RT domain and A3 is inside the integrase domain, whereas A4, A5, and A6 reside inside the domains specifying vpu, gp120, and p17, respectively. Their positions on the HIV-1 genome are shown in

Figure 1a. The total numbers of reads for the different RNAi targets are summarized in

Table S1. The conserved targets were determined as described in the Methods section. Their 19-nucleotide core sequences are shown in

Figure 1b together with the profiles of the total mutation frequencies over the target sites. The mutation profiles reveal the clear conservation of the target cores, thus indicating their functional significance. Rare microindels were also detected but their contribution to the general target conservation is about two orders of magnitude lower than that of mutations. The

z-criterion profiles (Equation (4)) that characterize the difference between the mutation profiles for cohorts 1 and 2 in the corresponding sites (

i =

j in Equation (4)) of the same targets are shown in

Figure 3.