A Hierarchical Genotyping Framework Using DNA Melting Temperatures Applied to Adenovirus Species Typing

Known genetic variation, in conjunction with post-PCR melting curve analysis, can be leveraged to provide increased taxonomic detail for pathogen identification in commercial molecular diagnostic tests. Increased taxonomic detail may be used by clinicians and public health decision-makers to observe circulation patterns, monitor for outbreaks, and inform testing practices. We propose a method for expanding the taxonomic resolution of PCR diagnostic systems by incorporating a priori knowledge of assay design and sequence information into a genotyping classification model. For multiplexed PCR systems, this framework is generalized to incorporate information from multiple assays to increase classification accuracy. An illustrative hierarchical classification model for human adenovirus (HAdV) species was developed and demonstrated ~95% cross-validated accuracy on a labeled dataset. The model was then applied to a near-real-time surveillance dataset in which deidentified adenovirus detected patient test data from 2018 through 2021 were classified into one of six adenovirus species. These results show a marked change in both the predicted prevalence for HAdV and the species makeup with the onset of the COVID-19 pandemic. HAdV-B decreased from a pre-pandemic predicted prevalence of up to 40% to less than 5% in 2021, while HAdV-A and HAdV-F species both increased in predicted prevalence.


Introduction
Characterizing genetic variation in pathogen populations and expanding the understanding of how they evolve between hosts can have important implications for public health [1]. Providing increased taxonomic information about a pathogen can inform medical decision-making. For example, it has been shown that a specific strain of Enterovirus (D-68) presents an increased risk of severe respiratory infections for some at-risk populations (e.g., pediatric and asthmatic patients) compared to other similar Enterovirus strains [2,3]. When this increased taxonomic resolution of a pathogen is available and provided in real-time, clinicians and health care professionals can make decisions for local and regional institutions to mitigate the effects of an outbreak [4].
At the present time, next-generation sequencing (NGS) is the primary method of genetic characterization and is increasingly being integrated into the clinical microbiology laboratory, but there still exist many challenges in its routine use, namely cost and time to results [5]. Alternatively, lab-developed tests (LDTs) to identify specific strains are common practice in diagnostics for the clinical microbiology laboratory. However, these tests cannot be distributed directly from lab to lab, and labs are not required to standardize assays used for the identification and reporting of specific strains. Commercially available diagnostic tests for common pathogens are widely available due to their low cost, time to result, typing efforts and the generalizability of this method to other genotyping applications are discussed. Finally, we define the formal mathematical framework used for predicting genotypic variability based on expected Tm values and detail how the model to classify HAdV species was optimized using a labeled dataset of well-characterized isolates.

Hierarchical Classification Model Performance
A confusion matrix of the predicted HAdV species against known HAdV species using the optimized classification model is shown in Table 1. Values along the diagonal indicate that the prediction agreed with the known genotype for the sample, and off-diagonal elements indicate misclassifications. Table 2 summarizes the performance results of the model on the labeled dataset with the mean of the precision, recall, and F1-score across the ten cross-validation folds, along with support by HAdV species. The mean class-wise precision for HAdV-A of 0.88 indicates that when HAdV-A is predicted, it is correct 88% of the time. The mean class-wise recall for HAdV-A of 0.96 indicates that when the true classification for a sample is HAdV-A, the species is correctly predicted 96% of the time. The mean class-wise F1-score for HAdV-A of 0.92 indicates balanced accuracy as the harmonic mean of precision and recall. Support indicates the number of samples of each species included in the analysis (see the Methods section for more details of the training dataset). The macro average for each metric is the average across all classes with equal class weights and the weighted average is the average across all classes weighted by the number of samples in each class [24]. The logistic regression model finds coefficients for each assay's posterior probabilities that optimize the model's classification performance on the training dataset. In contrast, the assay design-informed model used for a sensitivity analysis sets the coefficients by the heuristic of expected assay reactivity and demonstrates 75% overall class accuracy for the same training dataset. The detailed performance results and confusion matrix for this method are shown in Appendix B, Tables A1 and A2, respectively. Figure 1A shows the optimized classification model applied to the RPP runs in Trends with an Adenovirus "Detected" result. The rate of Adenovirus "Detected" results are displayed as a rate of total RPP runs in the database. Subtypes are displayed in the stacked area under the total detections, indicating the percentage of RPP runs with an Adenovirus "Detected" result attributed to each species. Figure 1B shows the normalized percentage of each HAdV species from Figure 1A.  The detailed performance results and confusion matrix  this method are shown in Appendix B, Tables A1 and A2, respectively. Figure 1A shows the optimized classification model applied to the RPP runs Trends with an Adenovirus "Detected" result. The rate of Adenovirus "Detected" resu are displayed as a rate of total RPP runs in the database. Subtypes are displayed in t stacked area under the total detections, indicating the percentage of RPP runs with Adenovirus "Detected" result attributed to each species. Figure 1B shows the normaliz percentage of each HAdV species from Figure 1A.

Hierarchical Classification Model Performance
The mathematical framework for predicting genotypic differences applied to HAd species typing showed robust cross-validation performance in assigning correct spec labels to samples tested with the RPP, with an overall class accuracy of 95% and a standa deviation of 4% across the 10-fold cross-validation. However, the optimized classificati model performance was not equal across all HAdV species with the lowest species cla accuracy equal to 87% (see Table 2). Higher class precision and recall were observed f HAdV species B and C compared to A, D, E, and F. This may be attributed to the ass coverage of these species, as A and F and D and E often shared high posterior probabilit in similar temperature ranges. The model also demonstrates performance in line with t Xu, McDonough, and Erdman HAdV species typing assay that show an overall class curacy of 96% with the lowest species class accuracy of 83% [25].
Compared to the optimized classification model, the assay design-informed class cation model had two primary deficiencies in performance. The most discrepant resu

Hierarchical Classification Model Performance
The mathematical framework for predicting genotypic differences applied to HAdV species typing showed robust cross-validation performance in assigning correct species labels to samples tested with the RPP, with an overall class accuracy of 95% and a standard deviation of 4% across the 10-fold cross-validation. However, the optimized classification model performance was not equal across all HAdV species with the lowest species class accuracy equal to 87% (see Table 2). Higher class precision and recall were observed for HAdV species B and C compared to A, D, E, and F. This may be attributed to the assay coverage of these species, as A and F and D and E often shared high posterior probabilities in similar temperature ranges. The model also demonstrates performance in line with the Xu, McDonough, and Erdman HAdV species typing assay that show an overall class accuracy of 96% with the lowest species class accuracy of 83% [25].
Compared to the optimized classification model, the assay design-informed classification model had two primary deficiencies in performance. The most discrepant results were observed with HAdV-A and HAdV-F samples, often mistaking one for the other. It also misclassified many HAdV-E samples as HAdV-D. While the optimized classification model exhibits the worst performance with HAdV-A and HAdV-F, it avoids most of the specific biases of the assay design-informed model by using the labeled data to inform the posterior probability weights. Although the training dataset used for these species classification models was comprised of a diverse set of serotypes for each HAdV species, we would expect lower accuracy than what is demonstrated here when applied to unseen serotypes.

Generalization of the Genotyping Framework
The mathematical framework presented here was well-suited for the RPP HAdV assays. These assays were designed specifically to capture the diversity of the HAdV genome across the different species. However, it could be generalized to classify genotypic diversity using other assays, provided there is variability in Tm values inside the relevant temperature range due to sequence differences of the amplicon sequences associated with genotypic diversity. The variation in Tm values associated with genotypic diversity must be greater than the temperature resolution of the instrumentation and the run-to-run variability for a reactive sequence. The error of the predicted Tm values for a given set of reaction conditions and amplicon sequences must also be less than the differences in Tm values between the genotypic groups. If assays were designed to maximize the variation of Tm values from different genotypic groups, the performance of this method could be further improved.

Adenovirus Surveillance
The relative prevalence of HAdV species in specific subpopulations has been studied extensively [11,12,18,19,[26][27][28][29]. The surveillance of HAdV species in the greater population is currently addressed by the CDC's National Adenovirus Type Reporting System (NA-TRS) [30]. This program collects and summarizes typing results from labs and hospitals and publishes a static prevalence estimation for several time periods. NATRS has published prevalence data from 2003 to 2016 collected from 11 sites and from 2014 to 2017 collected from at least 33 sites. We applied the HAdV species classification model to the RPP patient test data collected by the Trends system from 2018 through 2021. Because HAdV species prevalence varies from year to year, and the data collection timeframe does not overlap, no direct comparison of prevalence estimated here is made to that of NATRS [14,29]. However, the relative species proportions between NATRS (2014-2017) and the model predictions of the Trends data (2018-2020) are similar with HAdV-B and HAdV-C combined, consisting of greater than 80% of HAdV relative prevalence in both estimates. The application of the HAdV species classification model to Trends data supports and augments the surveillance efforts of NATRS. The Trends database contains a large pool of "Adenovirus Detected" results (>25,000) reported from over 100 testing sites across the United States. The application of the model to an active reporting system, such as Trends, confers several benefits over traditional passive reporting. First, this application requires minimal additional effort beyond running the samples on the RPP. Second, because all results are produced by the same testing platform, they share the same sample preparation protocol, reagents, and assays, which improves data consistency. Finally, because the data are automatically processed and categorized by the system and available in near-real-time, they can be used to detect species-specific outbreaks and monitor seasonality and changes in circulation patterns. The HAdV species classification model can be continuously applied to Trends data moving forward, and as more sites are added to the platform, more detailed patterns of circulation can be discerned.
Trends data were utilized in this study to investigate United States HAdV species prevalence. While Trends data has been used to suggest pathogen circulation dynamics [4,31,32], it should be noted that the testing practices and demographics of the institutions contributing to Trends are unknown. Biases in sites contributing to the network may exist that could give Trends a skewed representation of population pathogen prevalence. These limitations need to be kept in mind when making inferences about HAdV species prevalence based on the classification model results applied to Trends data.
Investigating the predicted prevalence data produced by the model shown in Figure 1A reveals several interesting results. Immediately apparent is the large drop in positivity in the spring of 2020 due to the COVID-19 pandemic. This pattern was not limited to HAdV, however, and was observed across all positivity in the Trends network [32]. As social practices return to normal over the course of 2021, rates of Adenovirus "Detected" results increase in Trends to levels seen prior to the onset of the pandemic. It is interesting to note in Figure 1B that the estimated proportion of HAdV-B does not recover to its previous state of about 30% of Adenovirus "Detected" results after the onset of the pandemic; in 2021 it has consisted of less than 5% of total Adenovirus "Detected" results. In the same timeframe, HAdV-C increased in relative prevalence to make up approximately 70% of all Trends Adenovirus "Detected" results. Additionally, two species (HAdV-A and HAdV-F) appear to be increasing in prevalence in the absence of HAdV-B. HAdV-A has increased from a low in 2020 of 4.5% relative prevalence to a high in 2021 of 11.6%, and HAdV-F increased from a low of 1.8% in 2020 to a high in 2021 of 17.9%. As stated above, this data may not reflect the actual circulating patterns of HAdV in the general population. Although this is an interesting observation on potential changes in the prevalence of several HAdV species, further study and monitoring are needed to understand the dynamics of circulating virus populations after a major disruption in social practices and competition between similar species.

Materials and Methods
In this section, we detail the general mathematical framework for predicting genotypic differences as applied to predict HAdV species from an unknown sample. To develop the framework, we first determine the HAdV species reactivity of each RPP HAdV assay by finding a complementary alignment of the forward and reverse primer sequences for HAdV entries in the Nucleotide collection nr/nt sequence database [33]. The Tm values are predicted in silico for each expected resulting amplicon sequence. These estimated Tm values are used to form a probability density function for the Tm value of each species. We then employ Bayes' Theorem to estimate the posterior probabilities of each HAdV species given a Tm value. The results for all assays, with their associated HAdV species posterior probabilities, are the training input to a data-driven, hierarchical classification model to predict the species of an unknown sample. The classification model was optimized and the performance assessed and validated using 10-fold cross-validation on a labeled dataset. The validated model was applied to Adenovirus "Detected" results from RPP test data in the Trends system to predict HAdV species prevalence in the United States from 2018 through 2021.

Estimating Assay Reactivity
For each RPP HAdV assay, forward and reverse primer sets were used as query inputs for a BLASTn search against the nr/nt database [33]. The results from these queries were filtered to HAdV hits. We identified a predicted 2249 amplicons in HAdV genomes, where an amplicon is bounded by primers in the correct orientation and less than 1000 bp in length. The amplicons are linked to their respective taxonomic lineage, including the HAdV species. The set of HAdV species that have at least one predicted amplicon for an assay is considered to be reactive for an RPP HAdV assay. Table 3 contains the predicted reactivity of HAdV species for the five HAdV assays on the RPP.
The primer sequences for the RPP HAdV assays are not disclosed and the RPP HAdV assay names have been obfuscated for the commercial interest of bioMérieux, Inc.

Predicting Distributions of Tm Values
For a single HAdV assay on the RPP, let S i represent one of the HAdV species (A-F). For each species S i , there is a set of associated potential amplicon sequences denoted R i(j) for j = 1, 2, 3, . . . n i , with n i being the number of distinct amplicon sequences that are associated with species S i and amplified by the HAdV assay. Note that if the species is not expected to react with an assay, then n i = 0.
For each amplicon sequence R i(j) , the predicted Tm value, based on the base-pair sequence of the amplicon, is denoted µ Tm,Ri(j) . For simplicity, we use the model proposed by Howley et al. for all Tm value predictions [9]. To account for system variability due to slight changes in chemistry, instrument conditions, and sample makeup, the Tm value for each R i(j) is represented by a normally distributed random variable T Ri(j) : In Equation (1), the variance of T Ri(j) , σ 2 sys , represents the measurement error of the PCR system and is set to 0.5 • C standard deviations for all sequences in the application based on empirical evidence [6]. The distribution of expected Tm values for all expected reactive sequences of species S i can be represented as a weighted sum of all the individual sequence distributions T Ri(j) : where 0 ≤ w j ≤ 1 for all w j and ∑ j w j = 1. For this application, the weights w j are naively set to 1/n i for all j. For a given assay and species i, T Si is the random variable that describes the possible Tm values based on the reactivity of that assay with the individual sequences included from the species. The associated probability density function for T Si is f TSi .

Posterior Probabilities of a Genotype Group
To calculate the posterior probability of each species given a Tm value, x, we employ Bayes' Theorem for each S i : where P(x|x ∈ T Si ) is the likelihood of the Tm value, x, given it is from species S i . For a Tm value, x, the conditional probability that the Tm value is associated with species S i can be written: where ε is defined as the minimum resolution of the Tm detection algorithm, set to 0.01 • C for the BioFire system. Note that to account for the possibility of a reactive sequence not included in R i , we incorporate a uniform distribution S unknown~U (Tm low , Tm high ), where Tm low and Tm high are the lower and upper bounds of the melting curve analysis range for all RPP HAdV assays, set to 75 • C and 95 • C, respectively. The final set of likelihood probabilities, S all , is the union of the conditional probabilities for an observed Tm and the uniform distribution S unknown .
The prior, P(S i ), is a discrete distribution indicating the relative prevalence of HAdV species and for this application is set such that all species are equally likely. P(x) is the marginal probability calculated using the Law of Total Probability.
As an illustration of the methods, Figure 2 shows the progression from the estimated Tm values of expected reactive sequences (Figure 2A) to the likelihood distributions ( Figure 2B) and subsequent posterior probability distributions ( Figure 2C) for the RPP HAdV-5 assay. For a Tm value, each posterior probability describes the probability that the sample is of a particular species. Note that for each Tm value in the melting curve analysis range, the posterior probabilities for all species sum to one. Analogous figures for the other four HAdV assays (RPP HAdV-1, RPP HAdV-2, RPP HAdV-3, and RPP HAdV-4) can be found in Appendix A, Figures A1-A4. all RPP HAdV assays, set to 75 °C and 95 °C, respectively. The final set of likelihood abilities, Sall, is the union of the conditional probabilities for an observed Tm and t form distribution Sunknown.
The prior, P(Si), is a discrete distribution indicating the relative prevalence of species and for this application is set such that all species are equally likely. P(x marginal probability calculated using the Law of Total Probability.
As an illustration of the methods, Figure 2 shows the progression from the est Tm values of expected reactive sequences (Figure 2A) to the likelihood distribution ure 2B) and subsequent posterior probability distributions ( Figure 2C) for the RPP H 5 assay. For a Tm value, each posterior probability describes the probability that th ple is of a particular species. Note that for each Tm value in the melting curve a range, the posterior probabilities for all species sum to one. Analogous figures other four HAdV assays (RPP HAdV-1, RPP HAdV-2, RPP HAdV-3, and RPP HA can be found in Appendix A, Figures A1-A4.

Hierarchical Classification Model
The previous methods are applied to the five HAdV assays in the RPP indepen to compute the posterior probabilities for the entire melting curve analysis range probabilities are combined into a data-driven hierarchical classification model to in accuracy in predicting the species for an Adenovirus "Detected" result. Labeled da generated by running isolates of known HAdV species using the RPP. A total of 3 tests with an Adenovirus "Detected" result were sourced from three internal BioFir ies used to characterize RPP performance: the limit of detection study, the incl study, and the cross-reactivity study. The results of these studies contribute to the i tions for use of the RPP [22]. The number of replicates, serotypes, and concentrat the samples for each study are summarized in Table 4. This set of varied seroty each species, with multiple replicates of each isolate, provides a diverse training

Hierarchical Classification Model
The previous methods are applied to the five HAdV assays in the RPP independently to compute the posterior probabilities for the entire melting curve analysis range. These probabilities are combined into a data-driven hierarchical classification model to increase accuracy in predicting the species for an Adenovirus "Detected" result. Labeled data were generated by running isolates of known HAdV species using the RPP. A total of 347 RPP tests with an Adenovirus "Detected" result were sourced from three internal BioFire studies used to characterize RPP performance: the limit of detection study, the inclusivity study, and the cross-reactivity study. The results of these studies contribute to the instructions for use of the RPP [22]. The number of replicates, serotypes, and concentrations of the samples for each study are summarized in Table 4. This set of varied serotypes for each species, with multiple replicates of each isolate, provides a diverse training dataset for optimizing a data-driven classification model. For each sample in the labeled dataset, the posterior probabilities for each assay were computed from the Tm value observed from each assay. Note that the Tm values for all data in this application have been normalized against the internal controls of the RPP [34]. The posterior probabilities are combined to create a feature matrix X, and the known species from each isolate form a response vector y. X and y are used to optimize a multiclass elastic-net regularized logistic regression model to predict HAdV species. This model was optimized using an L1 ratio of 0.6 with balanced class weights, and all other parameters set to the default values in the scikit-learn v1.0.1. package with Python v3.8 [35]. The performance of the optimized classification model was assessed and validated using 10-fold cross-validation and reported as the mean and standard deviation of the accuracy across all folds. The mean of the precision, recall, and F1-scoring across the ten cross-validation folds were also used to assess overall and class-wise performance [24].
In addition to the optimized logistic regression model, we developed another HAdV species classification model as a sensitivity analysis to ensure that the data-driven model provided increased performance over a naive approach. This model is referred to as the assay design-informed classification model and sets predetermined weights for each of the posterior probabilities based on a priori knowledge of assay reactivity. The weights for the posterior probabilities are set equally across all species that an assay has expected reactivity to, as shown in Table 1. For each sample, the prediction is determined by computing the Argmax of the weighted sum of posterior probabilities [35]. Because this model was not optimized using the labeled dataset, no cross-validation was performed.

Application to Syndromic Trends
The optimized classification model was elected to be used for application to Trends due to its increased performance over the assay design-informed classification model developed for the sensitivity analysis. HAdV species prevalence was predicted from Adenovirus "Detected" results from RPP test data in the Trends system in the United States from 2018 through 2021.

Conclusions
Using a priori information and in silico analysis, we developed a mathematical framework for providing additional taxonomic resolution for pathogen targets of the highly multiplexed BioFire system. This framework is flexible enough to provide increased genotypic classification for any PCR-based test with end-point melting analysis, so long as the genotypic variability being classified results in predicted Tm value variability. We applied this framework to the problem of predicting HAdV species from RPP Adenovirus "Detected" results. The HAdV species classification accuracy is enhanced by combining the results from multiple assays in a data-driven regularized logistic regression model. The model exhibited high accuracy on the labeled dataset. As a result, the optimized classification model was applied to unlabeled data from the Trends network. These results show a marked change in both the predicted prevalence for HAdV and the species makeup with the onset of the COVID-19 pandemic. In particular, HAdV-B decreased from a pre-pandemic predicted prevalence of approximately 30% to less than 5% in 2021, which led to relative increases in predicted prevalence for HAdV-A and HAdV-F during that same timeframe.
There are two areas where we would like to continue these efforts: the further characterization and validation of the species classification model and integrating the model into protect the patient's privacy and comply with applicable privacy laws and regulations. For this work, the data are aggregated in such a manner as to make it far less likely that any patient-run data can be traced to a specific facility and/or individual. The aggregation method employed by BioFire also serves the purpose of protecting the identity of the facility which, for competitive and privacy reasons, may not want its use of the BioFire system to be made known publicly. If a dataset not included in the provided repository is requested, BioFire will review such a request internally to ensure that any disclosure does not conflict with BioFire's obligations and restrictions set forth in the DUA. BioFire is subject to the terms and conditions of a Data Use Agreement ("DUA") by and between BioFire and each facility participating in the Trends program. Primarily, the DUA is intended to protect the patient's privacy and comply with applicable privacy laws and regulations. For this work, the data are aggregated in such a manner as to make it far less likely that any patient-run data can be traced to a specific facility and/or individual. The aggregation method employed by BioFire also serves the purpose of protecting the identity of the facility which, for competitive and privacy reasons, may not want its use of the BioFire system to be made known publicly. If a dataset not included in the provided repository is requested, BioFire will review such a request internally to ensure that any disclosure does not conflict with BioFire's obligations and restrictions set forth in the DUA.     Table A1. Confusion matrix of the predicted HAdV species by the assay design-informed classification model.