1. Introduction
SARS-CoV-2 is part of the
Coronaviridae family comprising positive-sense Ribonucleic Acid (RNA). Coronaviruses induce respiratory illnesses in infected individuals culminating as the common cold, pneumonia, or coronavirus disease-2019 (COVID-19) in the case of CoV-2 [
1]. The transmission rate of CoV-2, which initiated in Wuhan/China in late 2019, had increased rapidly and concomitantly demanded an expansion in molecular diagnostics techniques for detection of virion RNA [
1]. The basic detection methods operate on highly sensitive and specific quantitative reverse transcription PCR (qRT-PCR) assays [
2]. These assays probe several locations across the CoV-2 RNA genome including Nucleocapsid, ORF1a/b, and Spike protein [
1]. Upon the establishment of these PCR-based molecular assays across hospitals and laboratories worldwide, a secondary advanced molecular demand needed implementation. Research and clinical laboratories began using next-generation sequencing (NGS) techniques for typing CoV-2 variants. Specifically, targeted sequencing technologies for CoV-2 variant identification have expanded rapidly in 2022. This is at least partially due to the increased availability of user-friendly bench top sequencers, such as the Illumina MiSeq. This implementation was, in part, an attempt to grasp and elucidate the molecular dynamics of Cov-2 viral mutagenesis and dissemination. Iowa City Veterans Affairs (VA) laboratory, as part of the VA Sequencing for Research Clinical and Epidemiology (SeqFORCE) consortium, implemented sequencing technology for CoV-2 to assist treating physicians when clinically indicated. The latter includes vaccine breakthrough cases, reinfection with CoV-2, extended COVID-19 hospitalization, and local outbreak investigations. Outbreak investigations are typically two or more cases of a suspected intra-facility transmission event.
An epidemiological investigation for a transmissible infection consists of a multi-step plan that includes developing exposure data with respect to place, time, and persons involved. The goal is to ultimately trace back to the source (patient zero), identify other cases, control spread, and evaluate infection control procedures [
3,
4]. This type of investigation can be observed in the case of SARS-CoV-2 with a recent excellent report by Al Hamad et al. [
5]. Their clinical investigation allowed tracing back to patient zero (index case), and re-evaluation of infection control procedures. While these investigations are important, as the authors mention, the addition of sequencing data would have provided more concrete evidence of strain sharing, at the molecular level [
5].
Currently, there exist methods that utilize sequencing and epidemiological data in order to evaluate transmission dynamics. These include transphylo, scotti, and outbreaker2, along with other quantitative-based methods [
6,
7,
8,
9]. However, the workflow of some of these methods may not fit best in a clinical laboratory setting requiring personnel with complex computational or programming background. We herein agglomerate simple qualitative methods for determining molecular linkage in strains suspected to be part of a transmission ring. We present three case examples where methods involving Nextclade and UShER algorithms, which are publicly available for clinical laboratory personnel, were used [
10,
11,
12]. Outbreak samples underwent phylogenetic clustering with locally sequenced samples utilizing Nextclade. Next, nucleotide mutational analysis was performed to determine specific shared regions that are carried through the strains of interest. The samples were then assessed via national databases (UShER) to determine global phylogenetic clustering and association with samples from similar regions. We believe that these methods are effective and practical due to their ease of use and convenience of publicly available software. These procedures can be utilized as part of a screening process for viral molecular epidemiological analysis in outbreak investigations. If needed, the investigator can then proceed to post-screening follow-up with statistical based algorithms [
6,
7,
8,
9].
3. Defining Similarity/Dissimilarity of Outbreak Strains
The methodology used is composed of three categories integrated as part of the full molecular assessment of strains in question. The methods need not be done in order, but in general, we should expect harmonious findings from the three approaches towards either similarity or dissimilarity, especially within nucleotides of ORF1a. However, temporal dynamics of viral mutations may in some circumstances lead to less than 100% identity within the ORF1a gene in suspected strains [
13,
14]. Additionally, the unavailability of recently sequenced strains (i.e., close sampling dates within international databases) could lead to neutral or non-concordant results when performing global phylogenetic clustering. We will discuss these limitations in the following sub-sections.
3.1. Utility of Mutational Frequencies
It is important to define the prevalence of single nucleotide mutations for qualitative analysis of outbreak strains. Mutations with overly high frequencies (>65%) are relatively not as useful in terms of qualitatively linking suspect strains. In part, this is due to the decreased confidence of association when an already large number of circulating CoV-2 sequences carry these mutations. To this end, we noticed that absolute cut-offs of low vs. high mutational prevalence within CoV-2 are not clear within the literature. There were efforts by some groups who attempted to define the prevalence of mutations based on normalized frequencies in conjunction to viral temporal dynamics. For example, Arevalo et al. categorized CoV-2 mutations into low, medium and high prevalence for structural and non-structural proteins [
13]. The authors studied 115 mutations that occurred above 3% prevalence. They presented high frequency mutations as nucleotide locations that do not show reduced changes in prevalence greater than 1% and instead increase rapidly. The low frequency mutations undulate temporally in prevalence but with an overall frequency below 15%. Based on this, we recommend that the choice of unique mutations within the outbreak samples should be based on mutations with less than 15% frequency and preference to frequencies of less than 3% and up to 10%. Next, due to temporal mutational dynamics we would expect a degree of variation within each mutation site [
13,
14]. Thus, frequency calculations should be dependent on recent circulating variants, not based on the bulk of total CoV-2 viral sequences. To this end, we found that
COVIDCG software allows calculations of CoV-2 mutational frequencies based on set dates.
3.2. Method 1—Local Clustering
The goal of this step is to assess or screen for signs of identity within the outbreak strains when compared to other sequences obtained from same laboratory (i.e., local) utilizing Nextclade. In terms of conceptual probability, the higher number of non-outbreak samples used in this comparison will contribute to a general higher confidence in the screen. To this end, we recommend that the analysis should be performed in conjunction with 40–100 samples. This will allow formation of a “higher resolution” phylogenetic tree for better sample comparison. If the outbreak samples’ nucleotide sequence is drifting towards homology, then they will phylogenetically cluster together within the same branch or a close branch. In contrast, dissimilar samples will be dispersed at some level along the phylogenetic tree.
We suggest the 40–100 sequences used for comparisons should be composed of currently circulating strains with a close sampling date relative to the outbreak strains. This is because there is more confidence in strain association if the outbreak strains cluster together in the presence of samples containing similar homologous sequences (i.e., Delta strains with similar sampling dates). This is in contrast to observing clustering in outbreak strains in presence of older or different sequence pools that have not been as exposed to temporal dynamics of viral mutations (i.e., Delta strains previously sampled months apart). Evidence of temporally associated mutagenesis can in-part be observed from the expansion or splitting of the Delta strains into sub-Deltas, with more than 100 AY lineages (outbreak.info).
Another consideration for this method is that its strengths are exhibited when comparing similar clades, such as Delta to Delta. When the outbreak strains are of different clades (i.e., Alpha or Omicron) it would be already highly demonstrative for association without clustering given the main circulating pool of variants at the time of this writing is Delta. In this case, mutational analysis (method-2) should be performed to probe homogeneity of CoV-2 genomic regions in the outbreak samples.
3.3. Method 2—Analysis of Genomic Mutations
In this step, we show the use of Nextclade nucleotide sequence alignment interface. The tool is used to infer signature mutations that are carried through the strains of interests, in comparison to other strains derived from alternate sources. Several genomic regions could be utilized for this goal. We found that, within the same clade, ORF1a is relatively unstable in genetic consensus (i.e., more informative) where one may observe unique/shared mutations only within the outbreak strains. This is in contrast to the Spike region, where it was not useful (i.e., does not change) when comparing strains from a matching clade. Similar to phylogenetic clustering, there is more confidence when detecting peculiar set of mutations in outbreak samples in presence of non-related samples with relatively homologous sequences (i.e., Delta strains with similar sampling dates). If the strains of interest are of different clade in the presence of a circulating dominant strain, then this would be already indicative of association. Nevertheless, mutational analysis should be performed to confirm identity.
A limitation with the use of molecular analysis is locating unique mutations with low frequency within the global CoV-2 sequence pool (<0.15 frequency “<15%”). We found that COVIDCG provides GISAID-based data with options to designate timeframes for attaining mutational frequencies. COVIDCG data, as well as Nextclade alignment, show that the Delta variant carries high frequency mutations in ORF1a. An example would be G4181T or C6402T, which are present at a frequency of more than 0.65 and should be excluded from comparisons. In some cases, however, the absence of high-frequency mutations may be useful for defining associations. The observation of 1–5 low-frequency shared mutations in outbreak strains is indicative of association. On the other hand, inability to find unique mutations does not exclude association or linear viral transmission. To this end, these genomic regions should be labeled as “non-informative” and attempts to analyze other regions should be performed.
Another limitation is that one should not expect that all outbreak samples to be identical within the ORF1a gene (see case-2). It should be expected that the longer the transmission chain, the more likelihood of virions drifting towards dissimilarity. In this case, the choice of shared low mutational frequencies will produce higher confidence in association–even if some of the samples are not fully identical. As a troubleshooting step, one can compare sampling dates between outbreak strains to determine if one or two, not fully identical strains, were sampled more than one week apart in a long outbreak chain. If this is the case, then the minor differences can be attributed to viral temporal mutational dynamics.
3.4. Method 3–Global Clustering (UShER)
This method is similar to local clustering by Nextclade; it produces a secondary layer of confirmation in strain-association. First, if samples are genetically similar or identical, they will cluster together in the presence of a global CoV-2 sequence base. Second, we have observed that outbreak CoV-2 sequences will tend to home or co-cluster near other samples sequenced from the same sending state. The latter would be indicative of association and possibly link to other related small sub-outbreaks occurring within these regions. A limitation is that homing to state of origin is dependent on available recent CoV-2 sequences from that region in order to compare to samples in-question. Therefore, not observing co-clustering with samples from the same sending state should not be taken as evidence of dissimilarity. Instead, overall results should be integrated from all three methods.
5. Discussion
In this manuscript, we described the utility of publicly available algorithms in order to assist epidemiological investigations. Here, molecular analysis of CoV-2 outbreak sequences can be assessed at multiple levels (i.e., local/global phylogenetic clustering, and specific mutational analysis). This process yields a three-layer qualitative analysis that, in many cases, produces confident results in terms of defining strain similarity/dissimilarity.
It is important to note that, similar to other diagnostic laboratory tests, these methods presented here should only serve as an outbreak screen in-part due to limitations mentioned above. For example, one cannot produce a quantitative estimate for strain relatedness since no direct statistical comparison is involved other than qualitative grouping of the samples based on genetic similarities (NextClade, UShER) [
10,
11,
12]. In this case, the clinical laboratory may need to extend their findings to other available methodologies involving statistical computing [
8,
9]. The latter includes transphylo, scotti, outbreaker2, and other quantitative-based methods [
6,
7,
8,
9]. However, while these methods can provide a quantitative evaluation, expertise in computational methods and/or R-programming is required.
Another limitation is the focused use of Cov-2 ORF1a region. We have found that the SARS-CoV-2 ORF1a region contains a degree of nucleotide instability (i.e., more informative) when compared to the Spike region. This can be true even between strains that are of the same sub-lineage–which has been observed for both Delta and Omicron strains. Regions, such as ORF1a, with relative genomic instabilities can be utilized to compare outbreak strains qualitatively by using low frequency mutations as a first-tier choice. However, we have found that sometimes, other regions may assist in identifying group similarities in the case where the ORF1a region is identical between all samples (e.g., Gene M). We refer the reader to a discussion on the subject of ORF1a use [
9].
Care should be taken when evaluating mutational frequencies using
COVIDCG or any other frequency-producing software [
15]. We have initially hypothesized, and confirmed through our outbreak analyses, that mutational frequencies for outbreak samples will drift temporally. The temporal drift is observed even within the same CoV-2 lineages. Therefore, mutational frequencies must be based on the specific outbreak period. For example, if an outbreak had occurred in November of 2021, then compiling or averaging mutational frequencies from January 2021–September 2021 may not produce accurate ORF1a frequencies. In fact, in some cases it can be misleading wherein a true rare mutation would be considered common, or a true common mutation would be considered rare. Therefore, for the November 2021 example, a better
COVIDCG date range would be August 2021–November 2021. Narrower date ranges provide more accurate frequencies that better approach true values of circulating variants, at the time of the outbreak. However, a narrower date range will contain a lower sample number. Therefore, we advise the outbreak investigator to compare both narrow (October 2021–November 2021) as well as slightly larger range (August 2021–November 2021). In this case, if the two ranges provide similar numbers, then use of the wider date range is recommended as it contains a higher sample number. Overall, since the methods mentioned here are only qualitative, slight deviation in frequencies will not affect the final outcome of the investigation.
In conclusion, we propose that along with qualitative analysis, clinical laboratories should consider quantitative probability-based models to estimate the likelihood or relatedness of outbreak strain associations. This model should include inputs for outbreak sample number, mutational frequencies, and potentially spatiotemporal dynamics. Overall, the above methods can function as a platform for future refinements and developments but may also be used as is to define strain association in localized viral outbreaks.