Characterization of Molecular Cluster Detection and Evaluation of Cluster Investigation Criteria Using Machine Learning Methods and Statewide Surveillance Data in Washington State

Molecular cluster detection can be used to interrupt HIV transmission but is dependent on identifying clusters where transmission is likely. We characterized molecular cluster detection in Washington State, evaluated the current cluster investigation criteria, and developed a criterion using machine learning. The population living with HIV (PLWH) in Washington State, those with an analyzable genotype sequences, and those in clusters were described across demographic characteristics from 2015 to2018. The relationship between 3- and 12-month cluster growth and demographic, clinical, and temporal predictors were described, and a random forest model was fit using data from 2016 to 2017. The ability of this model to identify clusters with future transmission was compared to Centers for Disease Control and Prevention (CDC) and the Washington state criteria in 2018. The population with a genotype was similar to all PLWH, but people in a cluster were disproportionately white, male, and men who have sex with men. The clusters selected for investigation by the random forest model grew on average 2.3 cases (95% CI 1.1–1.4) in 3 months, which was not significantly larger than the CDC criteria (2.0 cases, 95% CI 0.5–3.4). Disparities in the cases analyzed suggest that molecular cluster detection may not benefit all populations. Jurisdictions should use auxiliary data sources for prediction or continue using established investigation criteria.


Introduction
In the past five years, progress in reducing HIV incidence has slowed in the United States [1,2]. In response, the United States federal government launched the Ending the HIV Epidemic Initiative, which sets the goal of reducing new HIV infections by 90% within 10 years. One of the four major components of the initiative is "Respond quickly to potential HIV outbreaks to get needed prevention and treatment services to people who need them" in an effort to reduce transmission [3]. Molecular cluster detection is a component of this.
Molecular epidemiological approaches are used to identify groups of individuals (clusters) with closely related HIV viruses, which may reflect groups of people with ongoing HIV transmission, and lead to prevention resources being directed to those groups [4,5]. The existence of a cluster does not mean that transmission is occurring, however, as some clusters may represent transmission that occurred many years in the past [6]. Similarly, transmission may occur independently of identifiable clustering; incomplete genetic data or late sampling of genetic data (potentially after changes to genetic sequences from selective pressure or genetic drift) can hide genetic linkages [7].
While molecular cluster detection can be a useful epidemiological tool, its potential for real-time effectiveness relies partly on the ability to distinguish clusters where transmission is occurring from clusters that are stable. Many state and local health jurisdictions do not have the resources to review and investigate every cluster, therefore prioritization of clusters should be based on the resources available and potential cases averted. The general prioritization criteria suggested by the Centers for Disease Control and Prevention (CDC) for medium and high-burden jurisdictions (>4000 people living with HIV) is to investigate clusters that have grown by five cases linked by a genetic distance of 0.5% (0.005 substitutions per site) in the past 12 months [8]. This criterion is not sensitive to differences in local HIV transmission dynamics, and its appropriateness across jurisdictions or varying resources and priorities is unclear. The Washington State Department of Health currently manually reviews all clusters that have grown by three cases linked by a genetic distance of 1.5% (0.015 substitutions per site) in the last 12 months. Two other investigation criteria are under consideration for use by Washington State: 3 cases in the past 12 months linked by a genetic distance of 0.5% and 5 cases in the past 12 months linked by a genetic distance of 1.5% [9,10].
An ideal cluster investigation criterion would be a strong predictor of cluster growth and would be flexible enough to accommodate varying levels of health department resources. It would also include only data that are readily accessible to surveillance programs to minimize the burden of prediction. Previous efforts to predict cluster growth have been successful in a number of jurisdictions, though some have demonstrated that predictors of cluster growth change in performance over time, which indicates that static models may not be suitable tools for describing a dynamic transmission environment [11,12]. Random forest modeling is a supervised learning algorithm that requires minimal input and has demonstrated considerable predictive success in other public health settings. Unlike traditional regression models, random forests perform well without input such as individual variable selection and fitting [13,14]. The model's flexibility and ease of implementation make it an appealing choice for the development of a tool that can be tailored to local settings and updated continuously as the context of the HIV epidemic changes.
The objectives of this study were to characterize current molecular cluster detection and growth in Washington State, to examine the performance of the CDC and current Washington State criteria for cluster review, and to assess the performance of random forest modeling using core surveillance data as a tool to prioritize cluster investigation in Washington State. By identifying successful criteria for cluster investigation, the results of our study can be used to optimize the cluster review and intervention to fully leverage the power of molecular cluster detection to prevent HIV transmission.

Study Design and Data Collection
We conducted a retrospective study of active clusters of molecularly linked HIV cases in Washington State from 2015 to 2018. Demographic, risk, HIV genetic sequence, and HIV viral load information were collected as part of core HIV surveillance by the Washington State Department of Health.
Washington State is considered a "moderate prevalence" jurisdiction; in 2018 the HIV prevalence was 13,417 and there were 511 incident cases reported within the state [2,15]. Of these incident cases, 20% had reported previous positive tests and may have been diagnosed in another state and were not in care or were from another country [16]. Washington State has had name-based HIV case reporting and mandatory reporting of CD4 test and all HIV viral load (detectable and undetectable) results since 2006 [15]. HIV genotype sequence reporting was not mandatory during the study period, potentially impacting completeness of molecular cluster membership [17]. Additionally, resistance testing was not ordered for all new HIV diagnoses, also impacting completeness [18].
HIV genetic sequences were used to identify clusters of molecularly linked HIV cases using the CDC recommended HIV-TRACE software. Full documentation of the methodology to generate these clusters has been previously described [8,19]. Briefly, the pairwise genetic distances between protease and reverse transcriptase sequences from every pair of individuals was calculated using the Tamura-Nei 93 nucleotide substitution model [20]. For this analysis, we used only the HIV sequence obtained from each individual's first sample (when HIV sequences were collected from multiple samples of the same individual). Cases were considered molecularly linked if the genetic distance between their sequences was 1.5% or less. Active clusters were defined as HIV clusters of three or more individuals with molecular links in Washington State at any point between 2016 and 2018. Clusters with less than three individuals are not routinely monitored in Washington State.

Variables
On a monthly basis, we described active clusters across the following dimensions: size (number of cases living in Washington), viremic cases (number of cases with last viral load >200 copies/mL or no viral load in last 12 months), late diagnoses (number of cases with first post-diagnosis CD4 count <200 cells/mm 3 ), gender (number of female vs. non-female cases), transmission risk (number of cases with reported injection drug use vs. no reported injection drug use), race (number of white cases vs. non-white cases), and time since diagnosis. Cases were determined to be living in Washington from addresses reported to the health department; departure from Washington State was estimated to be at the midpoint between a person's last reported residence in Washington State and their first reported residence outside of Washington State. The number of viremic cases was assessed using the cases' most recent viral load value at the current month of analysis. These variables were selected for their availability in surveillance data, associations with transmission rate in Washington State, and their ability to describe historic clusters with rapid growth in Washington State. The variable categories were selected to accommodate population size and to preserve contrasts of particular interest (e.g., transmission risks of male sex with male (MSM), heterosexual contact, pediatric exposure, etc. were all categorized as no injection drug use).
These metrics were expressed as counts, percent of overall cluster size, and percentiles (quartiles for cluster size and viremia, and above and below the median for the number of late diagnosis, gender, transmission risk, and race). To simulate the information available at the time of analysis, cases were included in the cluster based on the date their genotype sequences were received by the Washington State Department of Health.
In addition to point-of-time information, CDC and Washington State candidate criteria for cluster investigation were generated for each month: 3 cases in the previous 12 months at 0.5% genetic distance, 5 cases in the previous 12 months at 0.5% genetic distance, 3 cases in the previous 12 months at 1.5% genetic distance, and 5 cases in the previous 12 months at 1.5% genetic distance (Table 1).

Analysis
To characterize molecular cluster detection in Washington State, we described the state's molecular cluster detection program in terms of number of active clusters, the proportion of cases (prevalent and incident) with a valid genotype sequence reported to the state, the proportion of cases in any cluster (including historical clusters not analyzed in this study), the proportion of cases in active clusters, the median days from HIV diagnosis to specimen collection of genotype sequences, and the median time from HIV diagnosis to HIV-TRACE analysis.
To characterize the study population, we compared the population of people living with HIV (PLWH) with genotypes (who could potentially be in a cluster) and the population of PLWH in active clusters to the prevalent population of PLWH in Washington State in terms of gender, transmission risk, and age on 31 December 2018.
We measured cluster growth as the number of newly diagnosed cases added to the cluster in the subsequent 3 and 12 months in Washington State, regardless of when the sequence was identified as a member of the cluster. In this case, diagnosis is used as a proxy for transmission, and such cases would be considered potentially preventable during a cluster intervention. Clusters were stratified by posited predictors of cluster growth (cluster size, virally suppressed individuals, % not suppressed, the CDC/Washington State candidate historical criteria, number of cases with a first CD4 count <200, gender, transmission risk, race, and time since diagnosis), which were updated monthly. The number of newly diagnosed cases added to each cluster in the subsequent 3-and 12-month periods was calculated both on an absolute scale and as a rate per 100 person-months. Differences between categories, confidence intervals, and p-values were calculated using a repeated-measures generalized estimating equation with a Poisson distribution.

Prediction of Cluster Growth
A random forest regression model was generated to predict the number of newly diagnosed cases added to the cluster in the subsequent three months using data from 2016 to 2017. The following variables were included in the model: number of cases per cluster living in Washington State; number of viremic individuals per cluster; number of new cases in the previous 1, 3, and 12 months at both 0.015 and 0.005 genetic distance; number of late diagnoses; number of white cases; number of cases described as injection drug use transmission risk; number of female cases; and number of cases that had been diagnosed <1 year prior. A random search was used to identify optimum number of candidate variables at each split. Five hundred trees were estimated. A second model was fit to predict rate of cluster growth per 100 person-months. Regression was performed using the RandomTree package in R [21].
To create an investigation criterion from this model, we derived a cutoff for predicted cluster growth at which an investigation should be initiated. For comparability with the existing criteria, we sought a criterion that would yield approximately the same number of investigations as the existing criteria. To identify this target number of investigations, we counted the number of investigations initiated by the existing criteria from 2016 to 2017 (n). To translate this into a cutoff for the random forest models, we ranked the clusters by their highest monthly predicted growth from the random forest model. The predicted growth of the top nth cluster was selected as the cutoff for investigation, as this value would yield n investigations in 2016-2017.
Using this technique, we derived two investigation criteria to match the activity of the most stringent and least stringent CDC and Washington State investigation criteria (five cases at 0.005% genetic distance and three cases at 0.015% genetic distance).

Evaluation of Investigation Criteria
The CDC cluster investigation criterion, Washington State cluster investigation criterion, and random forest criteria were evaluated for 2018 on the basis of growth of clusters indicated for investigation. The number of new diagnoses in the subsequent three months and the rate of new diagnoses per 100 person-months were calculated for clusters that met each criterion. To simulate a real public health intervention, clusters were only eligible for investigation once. Criteria that identified clusters with higher growth over the subsequent three-month period were considered the best.
The data collected and analyses conducted under HIV/AIDS surveillance authority as an evaluation of the surveillance system are not considered research. No data which could identify individuals are presented.

Cluster Detection in Washington State
Between 2015 and 2018, 57% (1058 of 1847) of new cases had a reverse transcriptase or protease sequence reported to the state. Only 49% (7373 of 15,150) of prevalent cases in that time frame had sequences. The median days from diagnosis to genotype specimen collection was 14 (IQR 6-31) and from diagnosis to TRACE analysis was 291 (138-714). From 2015 to 2018, these delays decreased from a median of 17 to 11 days and 821 to 109 days, respectively (Table 2). People with a genotype were more likely to be under 45 years of age (41% vs. 33%, p < 0.01) ( Table 3).   Using the 1.5% genetic distance threshold, 107 clusters were present during the study time period, which represented 22% of new HIV cases and 8% of prevalent HIV cases (Table 2). Compared to the total population of PLWH in Washington, those within clusters of three or more prevalent cases were more likely to be recently diagnosed (47% diagnosed in the last five years vs. 21%), male (93% vs. 83%, p < 0.01), white (64% vs. 57%, p < 0.01), be described as MSM transmission risk (74% vs. 60%, p < 0.01), and be under 45 years old (68% vs. 34%, p < 0.01) ( Table 3).

Predictors of Cluster Growth
In 2016 and 2017, the mean number of new cases added to clusters was 0.24 (95% CI 0.18-0.33) per three-month period and 1.22 (95% CI 0.86-1.73) per 100 person-months. Over a three-month period, larger clusters and clusters with more viremic individuals grew faster on an absolute scale. Clusters with three or fewer members grew at an average of 0.17 cases (95% CI 0.11-0.17), while clusters with 12 or more members grew at an average of 0.37 cases (95% CI 0.24-0.58). Clusters with zero viremic individuals grew at an average of 0.13 cases (95% CI 0.08-0.20), while clusters with three or more viremic individuals grew at an average of 0.38 cases (95% CI 0.26-0.57). As a rate, the only significant predictor of growth was cluster size, where larger clusters grew more slowly (0.43, 95% CI 0.33-0.58 cases per 100 person-months in clusters of 12 members or more vs. 2.07, 95% CI 1.29-3.34 cases in clusters with three or fewer members).
Clusters meeting CDC criteria of five cases in the previous 12 months at a genetic distance of 0.005 had the highest growth on the absolute scale (1.50, 95% CI 0.79-2.86) and a high growth rate (2.93 per 100 person-months, 95% CI 0.79-10.86), but these values were not significantly different from clusters outside of this criteria (Table 4). Assessed at 12 months of growth, clusters showed similar, but attenuated trends (Table 5).

Derivation and Evaluation of Investigation Criteria
The CDC criteria identified four clusters from 2016 to 2017 that would prompt investigation. To create a criterion that would produce a similar volume of investigations, a cutoff was selected for the random forest model (i.e., the predicted number of newly diagnosed cases in the subsequent three months that would prompt investigation). The criterion to prompt investigation was set at 2.3 predicted cases. This value would have initiated the same number of investigations on unique clusters from 2016 to 2017. A looser criterion was set at 0.9 predicted cases, as this would have matched the 24 clusters investigated under the Washington State Department of Health investigation criterion. For growth per 100 person-months, criteria of 0.14 and 0.42 were selected. The first four layers of an example tree sampled from the random forest are presented in Figure 1. Of note, this represents one of the 500 trees which were combined to create the prediction model.  In 2018, there were six clusters that met the CDC investigation criterion, which had on average 28.3 (95% CI 5.7-62.4) members and grew 2.0 (95% CI 0.5-3.4) members in the subsequent three months. The clusters selected for investigation by the random forest model did not grow significantly faster. The model selected four clusters with a mean of 31.8 (95% CI 20.5-42.9) members that grew 2.3 (95% CI 1.3-3.2) members in the subsequent three months. There were 17 clusters that met the Department of Helath investigation criterion that had a mean size of 20.5 (95% CI 8.3-32.6) and grew an average of 1.4 cases (95% CI 0.8-2.0). The clusters selected by the random forest model with a looser criterion did not grow significantly faster (n = 15, mean size = 19.9, 95% CI 16.8-23.1, mean growth = 1.3, 95% CI 1.1-1.4). All criteria selected clusters that grew significantly faster than the statewide cluster growth of 0.3 cases per three months.
Prediction of cluster growth per 100 person-months was poor, and only the random forest with the looser criterion selected clusters that grew significantly faster than the statewide cluster growth of 1.4 cases per 100 person-months. This model detected 20 clusters with a mean size of 7.5 (95% CI 6.1-8.8) and a mean growth of 3.4 (95% CI 2.8-4.0) per 100 person-months. The random forest model with the stricter criterion only identified one cluster, which did not grow ( Table 6).

Discussion
Molecular cluster detection of HIV has the potential to guide public health intervention to populations of the greatest need. This is dependent on completion of genetic sequence data and the ability to identify molecular clusters where transmission is likely to occur. In this analysis, we demonstrated the disparities in the applicability of molecular cluster detection in populations in Washington State and the challenge of predicting cluster growth from core surveillance data. Finally, we suggest that the established criteria for prediction of cluster growth based on cases in the previous 12 months continue to be used in absence of additional data sources.
Molecular cluster detection is dependent on the timeliness and completeness of the sequence data included in the analysis [17]. From 2015 to 2018, there was an 87% decrease in the median time from a person's HIV diagnosis to analysis in HIV-TRACE. In the same time period, Washington State was near the CDC target of 60% completeness of sequence data in new HIV cases (57%) but only achieved 49% in prevalent cases. The majority of incident and prevalent cases do not have an observed molecular link within the state (71% and 82%, respectively.) While the population of PLWH with analyzable genotype sequences was fairly equivalent to the population of Washington State, there were stark disparities in the populations who were in analyzable clusters of three or more individuals. People in analyzable clusters are more likely to be white, male, and MSM. In Washington State, these are populations with decreasing incidence counts and a higher proportion of viral suppression, which may indicate that these populations may not be those who could most benefit from interventions informed by molecular cluster detection. The disparity between the population with analyzable genotype sequences and the population in analyzable clusters may be tied to disparity in time since diagnosis, which could hide a difference in the distribution of cases with sequences in more recent years. There may also be differences in the network structure of demographic groups that affect the number of analyzable clusters.
Completeness of collection of molecular sequences can be a challenge for health departments as there are multiple points in the process that can result in a sequence not being available for analysis: an individual is not diagnosed, the individual is not linked to care, a resistance test is not ordered at the time of initial linkage to care, there is a problem with the specimen collection, there is a problem with specimen storage or transportation, there is a problem at the time of running the test, or the sequence is not transmitted to the health department. Efforts to improve the completeness (to increase the percentage of new diagnoses with reported sequences) and efficiency (to decrease the amount of time for inclusion of sequences in analysis) of the process are important to the utility of the program. The random forest model used in this analysis did not predict cluster growth significantly better than the CDC or Washington State criteria. This demonstrates the limitations in predictive power of HIV core surveillance and molecular cluster detection in isolation. Previous successful prediction efforts have used partner services data to supplement these information sources, but some jurisdictions may have limited access to such data through structural or legal barriers [12]. In addition to partner services data, information about cluster topology and data from prior cluster and data for care investigations could be useful in prediction of cluster growth. The performance of our prediction model (relative to the existing criteria) was likely hurt by the rapid growth of an outbreak cluster among heterosexual persons who were living homeless and injecting drugs in King County in 2018, which was unlike the baseline cluster growth in years past [10]. This is the type of event that would be important to predict, however, and highlights the limitations of prediction models in the domain of rare events [22]. A larger dataset, either covering a longer time period or a larger geographic region would be useful for predicting such events.
In the absence of complete or readily available adjunct data from partner services investigations, the results of this study suggest that jurisdictions use the established criteria based on cluster growth in the past 12 months. Looser criteria (e.g., smaller numbers of individuals at a less restrictive genetic distance cutoff) may offer an added benefit of offering staff a chance to identify non-surveillance indicators that are important in the local transmission networks.
Notably, no criteria examined in our analyses was able to predict cluster growth as a rate. Jurisdictions that are concerned with cases prevented as a function of investigated cases may need to seek criteria based on information outside of core surveillance.
For cluster detection and response to be most effective it should incorporate robust core surveillance including an ongoing understanding of changes in HIV transmission trends and complete case and lab reporting, robust partner services activities with disease investigators that are familiar with the profile of their local cases, and efficient and complete molecular data collection and analysis [10,12,23]. There should also be transparency and dialogue with the community about the activities being undertaken and why they are important. It is only with the combination of information and efforts that programs will be able to make effective use of cluster detection to decrease HIV transmission early and focus prevention efforts.