Next-generation sequencing (NGS) is increasingly used in molecular diagnostic laboratories, including for HIV-1 drug resistance (HIVDR) genotyping [1
]. NGS methods have several potential advantages over standard Sanger sequencing (SS) methods, including more a sensitive detection of low-abundance variants (LAV, here defined as variants detectable by NGS but not SS), potentially less subjective and more quantitative and automatable data processing steps, and a reduction in cost. Since virus populations within individuals include multiple variants (possibly including variants with mutations conferring drug resistance) that are present at frequencies below the minimum required for detection by SS, NGS has the potential to improve the utility of HIVDR genotyping [5
]. However, NGS involves complex laboratory and analytic methods that are not yet well-standardized between laboratories, although recommendations for bioinformatic analysis pipelines have been proposed [7
]. HIVDR genotyping tests based on different NGS platforms are also commercially available [10
]. The clinical significance of LAV is largely unknown, although there is a general agreement that LAV detected by more sensitive methods such as NGS may increase the predictive value of HIVDR genotyping for clinical outcomes as compared to SS [14
]. There is ongoing debate, along with a general lack of certainty, regarding the optimal LAV threshold for clinical applications [17
The World Health Organization (WHO) HIVDR Laboratory Network supports the national surveillance of HIVDR in low- and middle-income countries (LMIC) [20
]. Network laboratories currently employ a variety of SS-based methods including commercial kits and in-house developed procedures, but several laboratories are planning to adopt NGS methods. Since resistance prevalence trends over time and between countries and geographic regions are an important part of the survey results, the standardization of genotyping assay performance characteristics is crucial. Consistency is ensured by the implementation of a rigorous validation, quality assurance, and quality control system [23
]. New technologies, such as NGS, must be introduced carefully, with consideration given to comparability to results from other laboratories in the network and to historical data. Currently, most laboratories located in LMIC do not have access to NGS platforms. Until all WHO Network laboratories have the capability to implement NGS, individual laboratories that are doing so are required to report consensus sequences that mimic those generated by SS as closely as possible. To support this approach during this transitional period, a standardized threshold for reporting LAV that generates data comparable to those derived from SS is needed.
The National Institute of Allergy and Infectious Diseases (NIAID) Virology Quality Assurance (VQA) program provides a comprehensive quality assessment program for virologic assays for HIV, including drug resistance genotyping [26
]. A crucial function of the VQA program is to ensure the validity and inter- and intra-laboratory comparability of virologic laboratory data generated for NIAID-supported clinical trials and research by the provision and analysis of proficiency testing panels. The VQA program also implements standards of performance for existing and state-of-the-art new virologic assays, develops and evaluates biostatistical methods relating to the assays, and acquires, tests, stores, and dispenses quality control materials and reagents. Since 2007, the VQA has provided proficiency testing specimens to the WHO HIVDR Laboratory Network [24
]. This resource is well-suited to the investigation of NGS LAV thresholds that maximize the comparability of sequences from SS assays.
In this paper, we report for the first time an inter-laboratory comparison of HIV protease and reverse transcriptase sequences from an external quality assurance panel, comparing NGS sequences to Sanger sequences and NGS sequences between laboratories.
2. Materials and Methods
Ten VQA HIVDR genotyping proficiency testing panel specimens (five from each of the two panels) were used. The specimens were prepared from patient plasma or cell culture virus stocks, and belonged to HIV-1 subtypes B, C, D, or F, at viral load loads ranging from 3656 to 29,139 copies/mL. Several specimens contained multiple drug resistance-associated mutations (DRMs), some of which were present as mixtures (Table 1
2.2. Sequencing Methods
Ten laboratories participated in this evaluation study. The laboratories are numbered from 1 to 10. Six of the laboratories were from the WHO HIVDR Laboratory Network, and four were extra-network laboratories with extensive HIVDR testing and NGS experience. Each laboratory used its own RNA extraction, RT-PCR amplification, raw sequencing data analysis, and post-testing QA procedures (Table A1
), but all used the Illumina MiSeq platform (Foster City, CA, USA). One laboratory (#1) used a unique molecular identifier approach to more accurately quantitate the number of amplified templates in each reaction [27
The laboratories submitted consensus sequences for each specimen using LAV thresholds of 5%, 10%, 15%, and 20% (i.e., minor nucleic acid variations with frequencies below these thresholds were ignored, and all variations with frequencies above the threshold are included in the base call at that position). Lower thresholds were not evaluated because of the lack of data demonstrating the clinical relevance of LAV at less than 5%. The software used to generate the consensus sequences was not able to do so using the 20% threshold in laboratory 4, so there are no data for this laboratory in the 20% group. The consensus sequences spanned the protease (PR)-reverse transcriptase (RT) regions that encompass all DRM sites that contribute to the resistance to PR and RT inhibitors of interest to the WHO HIVDR surveillance program (PR 10–93 and RT 41–238), except those from laboratory 1 which did not cover RT amino acids 123 to 151.
2.3. Sequence Comparison
The SS consensus sequences for each specimen were generated by VQA based on over 30 results from independent laboratories that used an SS-based, FDA-approved commercial genotyping kit (ViroSeq or TruGene), using an 80% identity threshold. Where an 80% absolute agreement was not reached, an “N” was inserted at that position, and these positions were excluded from identity percentage calculations. The VQA SS consensus sequence covers protease codons 4–99 and reverse transcriptase 38–247; portions of the NGS sequences outside this region were excluded from the analysis of identity to the VQA consensus. A secondary analysis evaluating only the sequence at DRM codons (any position with a potential impact on the penalty score in the Stanford HIVdb algorithm, version 8.5) was also performed.
The sequences were aligned using Geneious software (version 11.1; San Diego, CA, USA) and analyzed in Microsoft Excel. To assess the extent to which the sequences between labs agreed with each other, without comparison to SS, the sequence identity at all positions was determined between all possible pairs of sequences for each specimen and threshold. Missing data (gaps) were ignored.
Sequence quality evaluation (i.e., assessing the presence of anomalies such as frameshifts, stop codons, APOBEC mutations, and unusual mutations) was performed with Stanford HIVdb (https://hivdb.stanford.edu/
). The anomalies reported in the region not covered by laboratory 1 (RT 123–151) were ignored for this laboratory only.
Comparisons of percent identity between thresholds were performed using the Wilcoxon matched-pairs signed rank test and paired t-test (Prism 7, GraphPad, San Diego, CA, USA), and a random effects model with laboratory and specimen as random effects, cut-off values as fixed effects, and pairwise adjusted estimates of differences between cut-off values using SAS PROC MIXED.
The global surveillance of HIVDR relies on high quality, standardized methods for detecting DRMs in specimens from survey participants. The current standard platform method is SS, and until NGS is accessible to all laboratories that are contributing sequence data for HIVDR surveys, those that adopt NGS-based methods must be able to produce sequences that have the same performance characteristics as laboratories using SS. It is recognized that this transitional approach may initially lead to the under-utilization of some potential advantages of NGS, including better sensitivity for LAV detection and de-convolution of complex mixtures.
We evaluated several thresholds for reporting LAV from NGS data and demonstrated that the similarity to SS data was highest when a 20% threshold was applied. Furthermore, inter-laboratory comparability was also highest at this threshold. Previous studies that evaluated the sensitivity of SS for LAV detection or that compared sensitive point mutation assays and SS are consistent with the 20% threshold [30
The decreased agreement between NGS and SS data at thresholds below 20% might have been considered predictable, based on the concept that additional mixtures are expected to be reported in the NGS sequences, as LAV present at low frequency are detected more frequently. However, we found that the observed decreased agreement below 20% is not solely the result of the better sensitivity of NGS, since the inter-laboratory agreement also decreased as the threshold was lowered. These observations suggest that the detection of LAV can be subject to stochastic effects that may not be robustly repeatable or reproducible between methods or laboratories. Importantly, our results raise concerns about accuracy and inter-laboratory (and perhaps also intra-laboratory) reproducibility at low thresholds such as 5%, and strongly suggest that if even lower thresholds were to be used, the reproducibility would continue to decline. In the future, if the clinical significance of drug resistant LAV is conclusively shown to increase the predictive value of HIVDR genotyping for clinical outcomes and a threshold below 20% is established, there may be enough impetus to transition laboratory assays that support the public health surveillance of HIVDR to NGS platforms. At that time, it will be important to gain a better understanding of the sources of inter-laboratory variability in sequence determination and implement ways to minimize their impact. Both processes would be greatly facilitated by the development and use of standardized reference and/or control material with relevant LAV at specific frequencies, for use in external QA programs and/or assay optimization and validation. Other challenges inherent in the capacity of laboratories in LMIC to perform NGS-based HIVDR genotyping (e.g., instrument cost, operator training, and the availability of technical support) will also require attention and significant resources.
The low inter-laboratory reproducibility of NGS sequences may also be at least partly related to input amplifiable copy number, specimen sequence heterogeneity, position in the genome (Figure 2
and Figure 3
), and differences in the bioinformatics pipelines used. This complexity strongly suggests that clinical specimens with these characteristics should be included in external QA programs and inter-laboratory comparisons, rather than virus clones or reconstructed mixtures. With regard to differences in pipelines, Lee et al. [34
] evaluated the data from six of the ten laboratories included in this study using five pipelines and reported that sensitivity was good (over 99%) using thresholds as low as 1%, but specificity was low (82.4%) at the 1% threshold; they therefore suggested that a 2% threshold would be more reliable than 1%.
Our study has several limitations. (1) One or more of the NGS methods used may include unique aspects that make them more accurate than others or than those used to generate the SS standard comparator sequences (for example, the use of unique molecular identifiers, different input RNA volumes, or bioinformatic analysis pipelines). In this case, that method might generate sequences that are very different from the gold standard, but in reality, closer to the correct result. (2) Because thresholds over 20% were not evaluated, it is possible that the optimal threshold is higher; it is expected that at very high thresholds, a decrease in concordance would be seen, as mixtures start to be under-called. (3) We have analyzed similarity to SS across the entire sequence uniformly; different optimal thresholds may exist for specific DRM positions, due to the context dependence of chromatogram peak height in SS raw data. For example, a LAV that involves a change from a “weak” A base to a “strong” G might be expected to reach maximum identity at thresholds lower than 20%. (4) It is possible that many of the sites where the variability between laboratories is introduced involve synonymous mutations that would not have any impact on the predicted amino acid sequence and DR interpretation. (5) All participating laboratories used the Illumina MiSeq platform, limiting the application of our conclusions to that platform. Finally, (6) several assay variables that could be hypothesized to have an impact on NGS assay reproducibility have not been explored, including PCR reaction input copy number, sampling bias related to procedural bottlenecks, PCR-associated errors, and analysis pipeline methodology.