Characterization of Near Full-Length Transmitted/Founder HIV-1 Subtype D and A/D Recombinant Genomes in a Heterosexual Ugandan Population (2006–2011)

Detailed characterization of transmitted HIV-1 variants in Uganda is fundamentally important to inform vaccine design, yet studies on the transmitted full-length strains of subtype D viruses are limited. Here, we amplified single genomes and characterized viruses, some of which were previously classified as subtype D by sub-genomic pol sequencing that were transmitted in Uganda between December 2006 to June 2011. Analysis of 5′ and 3′ half genome sequences showed 73% (19/26) of infections involved single virus transmissions, whereas 27% (7/26) of infections involved multiple variant transmissions based on predictions of a model of random virus evolution. Subtype analysis of inferred transmitted/founder viruses showed a high transmission rate of inter-subtype recombinants (69%, 20/29) involving mainly A1/D, while pure subtype D variants accounted for one-third of infections (31%, 9/29). Recombination patterns included a predominance of subtype D in the gag/pol region and a highly recombinogenic envelope gene. The signal peptide-C1 region and gp41 transmembrane domain (Tat2/Rev2 flanking region) were hotspots for A1/D recombination events. Analysis of a panel of 14 transmitted/founder molecular clones showed no difference in replication capacity between subtype D viruses (n = 3) and inter-subtype mosaic recombinants (n = 11). However, individuals infected with high replication capacity viruses had a faster CD4 T cell loss. The high transmission rate of unique inter-subtype recombinants is striking and emphasizes the extraordinary challenge for vaccine design and, in particular, for the highly variable and recombinogenic envelope gene, which is targeted by rational designs aimed to elicit broadly neutralizing antibodies.


Introduction
According to the Joint United Nations Programme on HIV/AIDS (UNAIDS)report, an estimated 2 million individuals are newly infected with HIV-1 every year [1]. Phylogenetic analyses of HIV-1 strains have revealed four major groups: M, N, O and P. Group M is responsible for the majority of infections and is sub-divided into nine subtypes A, B, C, D, F, G, H, J and K. Recently, subtype L has been identified in Central Africa and is yet to be confirmed (Yamaguchi et al., 2020).
Globally, it is estimated that approximately 20% of new HIV infections are due to HIV-1 circulating recombinant forms (CRFs) and unique recombinant forms (URFs), particularly in regions such as Africa, where several subtypes are known to co-circulate [2]. Previous studies investigating HIV-1 diversity in Uganda have targeted sub-genomic fragments: gag, [3] pol [4,5] and env [3]. In addition, full-length genomes in other studies were performed mainly in chronically infected individuals in rural and semi-urban areas, focusing on the southern and central regions of Uganda. In these studies, inter-subtype recombinant viruses (mostly A-D URFs) were estimated to occur in a range of 6-19% of total infections based on sub-genomic data [3,[5][6][7] and nearly 30% based on full-length genome data [8]. As sequence diversity assessments are largely based on the gag-pol and env genes instead of the full HIV-1 genome, it is plausible that current figures of the frequency of HIV recombinants are an underestimate. More recent estimates using near full-length genome data from the same geographic regions in Uganda showed that 46% and 49.9% of the 200 and 465 patients, respectively, were infected by recombinant viruses, with A1D recombinants being most frequently observed (25%) [9,10].
In approximately 80% of heterosexual transmission events of HIV-1, a single virus, the transmitted/founder (T/F) virus, is thought to be transmitted and establish infection in the naive host from the diverse quasi-species in the chronically infected donor [11,12]. Understanding the key features of T/F viruses further provides additional insights into mechanisms underlying transmission, which is important for both vaccine design and therapeutic interventions [13]. Most T/F studies have used subtype B and C viruses to determine the phenotypic characteristics relative to the chronic viruses. Both Deymier et al. and Iyer et al. studied genomes and clones from six and eight epidemiologically linked transmission pairs, respectively [13,14]. Deymier et al. observed that T/Fs from subtype C showed little difference in replicative capacity and resistance to interferon alpha compared to non-transmitted variants within each transmission pair. In contrast, Iyer et al.'s data demonstrated that T/F are relatively more resistant to both interferon alpha and beta relative to the chronic viruses for both subtype B and C strains [13,14]. The difference in data could be attributed to the different sample types used, female-male transmissions compared to male-female transmissions and the subtypes studied. Baalwa et al., 2013 characterized 12 T/F viruses of subtypes A, D and A/D and discovered that all 12 viruses used CCR5 but not CXCR4 as a co-receptor. Additionally, T/F of subtype D replicated more efficiently compared to subtype A viruses assayed in primary human CD4+ T cells [15].
As there have been limited studies on subtype D T/F viruses, the current study focused on near full-length genome analyses of viruses previously classified, based on the pol gene. In addition, we investigated features of HIV-1 infections and patient outcomes related to the full genome sequence and in vitro replication of T/F viruses.

Study Subjects
The International AIDS Vaccine Initiative (IAVI) Protocol C acute HIV infection cohort was a prospective multi-center observational study that enrolled approximately 600 volunteers 18-60 years of age with recent HIV infection [16]. Study sites were in Kenya, Uganda, Rwanda, Zambia and South Africa.
Newly infected individuals were identified through p24 ELISA and serological tests and followed up at regular intervals for up to 8 years. All volunteers received HIV care, including Antiretroviral Therapy (ART) according to national guidelines. The estimated

Amplification of Near Full-Length Genomes
In-house primers used in this study and their details are listed in Table 1. Viral RNA was extracted from plasma and full-length cDNA synthesized as described previously [17]. Briefly, 140 µL of patient plasma was used to extract viral RNA using the QIAamp Viral RNA Mini Kit (Qiagen Inc, Valencia, CA, USA). RNA was recovered and used to synthesize near full-length HIV cDNA using SuperScript III Reverse Transcriptase (Life Technologies, USA: Invitrogen, Ljubljana, Slovenia) enzyme with primers 1′3′3′ and OFM19 (Table 1). Near full-length (NFL) cDNA was serially diluted in replicates of eight PCR wells and subjected to nested PCR amplification with HIV-specific primers (Table 1) to yield ~9-kb amplicon at a dilution in which 30% of the wells tested positive for ampli-  In-house primers used in this study and their details are listed in Table 1. Viral RNA was extracted from plasma and full-length cDNA synthesized as described previously [17]. Briefly, 140 µL of patient plasma was used to extract viral RNA using the QIAamp Viral RNA Mini Kit (Qiagen Inc, Valencia, CA, USA). RNA was recovered and used to synthesize near full-length HIV cDNA using SuperScript III Reverse Transcriptase (Life Technologies, USA: Invitrogen, Ljubljana, Slovenia) enzyme with primers 1 3 3 and OFM19 (Table 1). Near full-length (NFL) cDNA was serially diluted in replicates of eight PCR wells and subjected to nested PCR amplification with HIV-specific primers (Table 1) to yield~9-kb amplicon at a dilution in which 30% of the wells tested positive for amplification [18]. In a total reaction volume of 25 µL, 1X Q5 Reaction Buffer, 1X Q5 High GC Enhancer, 0.35 mM of each dNTP, 0.5 M of primers 1.U5Cc and 1.3 3 PlCb and 0.02 U/mL of Q5 Hot Start High-Fidelity DNA Polymerase (NEB) were used for the first round of PCR. PCR conditions for the first round were: 98 • C for 30 s, followed by 30 cycles of 98 • C for 10 s, 72 • C for 7.5 min, with a final extension of 72 • C for 10 min. One microliter of first-round PCR product was then used as a template for the second-round PCR, with identical cycling conditions and PCR mix except for the primers, whereby primer 2.U5Cd and 2.33plCb were used. PCR reactions were then run at 300 V for 25 min on a 1 percent agarose lithium acetate gel to detect the presence of a~9 kb band. For samples that did not amplify using the~9 kb fragment, a half genome Single Genome Amplification (SGA) approach was employed to generate both the 3 and 5 end genomes as previously described [19]. Briefly, the SGA half genome method entailed amplifying overlapping 5 (U5, gag-pol and vif ) and 3 (pol, vif, vpr, rev, vpu, tat, env, nef and U3-R) half genomes. The generation of cDNA and amplification of HIV-1 half genomes have been previously described [19]. We used primers 5FIV-R1/b5r1 and 1.R3.B3R (Table 2) for 5 and 3 half genome to make cDNA, respectively, using superscript IV. Primers used for first-round PCR were RVDA-F1 and 5FIV-R1 for 5 and b3F1 and 1.R3.B3 (Table 2) for 3 half genome amplification, respectively. Amplification reactions were performed with 10× High Fidelity Platinum Taq PCR buffer, 10 uM of each primer, 50 mM MgSO 4 , 10 mM of each deoxynucleoside triphosphate and 5 units/µL of Platinum Taq High Fidelity polymerase in reactions of 25 µL (Invitrogen, Carlsbad, CA, USA). The first round of PCR was performed at 94 • C for 2 min, followed by 35 intervals of 94 • C for 15 s, 55 • C for 30 s and 68 • C for 6 min, followed by a final extension at 68 • C for 10 min. The number of cycles was increased to 37 • C for the second round of PCR, and the annealing temperature was adjusted from 55 • C to 58 • C. Primers used for the second round were RVDA-F1 and 5FV-R22 plus b3F3 and 2.R3.B6R (Table 2) for 5 and 3 half genomes, respectively.  [20]. In brief, we combined 75 near full-length single genome amplicons for each RSII library; 10 patients' half genome PCR products were collected for each RSII library. The final library DNA concentration was more than 20 ng/µL, purity 260/280 ratio was greater than 1.8 and 260/230 ratio was greater than 2.0; the total volume was 30 µL. SMRT sequencing was performed on PacBio RSII at the University of Delaware DNA Sequencing & Genotyping Center. An algorithm described by Dilernia et al., 2015 [20] that stratifies unique reads from the different genomes and estimates consensus within each genome strata was used to remove sequencing error. To determine the T/F, all~9 kb viral sequences were aligned using MUSCLE in Geneious bioinformatics software (Biomatters, Auckland, New Zealand), followed by manually aligning. Maximum likelihood parsimony with 100 bootstraps was used for phylogenetic analysis. MEGA7 was used to extract pairwise distances for every intra patient variation using the Poisson correction model. The Los Alamos National Database HIV Consensus/Ancestral Sequence Alignments were used as reference sequences.

Generation of Infectious Molecular Clones and In-Fusion Cloning
HIV-1 T/F genomes were chemically synthesized (by GenScript Inc., Piscataway, NJ, USA) in three fragments with 100 bp overlaps in the pol and env regions ( Figure 2) of the proviral genome (to facilitate In-Fusion HD cloning) and ligated separately and, whenever possible, in the same orientation into the multiple cloning site (MCS) of the pUC57 plasmid vector, which contains an ampicillin resistance marker gene for selection. To combine the three genome fragments into one contiguous proviral genome sequence and generate an infectious molecular clone (IMC) plasmid, we utilized an In-Fusion cloning-based approach. The principle of the strategy we followed is outlined below; for some of the IMC, the cloning strategy had to be customized further to overcome plasmid instability challenges that can arise when cloning primary HIV-1 viral proviral genomes. Primers (Table 1) overlapping by 15 to 20 nt in forward (fwd) and reverse (rev) orientation were designed in the ampicillin resistance gene and the overlapping regions in segment 1 and 2 and segment 2 and 3. These were then used to generate 3 PCR fragments using NEB Q5 high-fidelity DNA polymerase: PCR segment 1 contained the portion of pUC57 vector from ampR through proviral segment 1 (~4300 bp); PCR segment 2 spanned proviral segment 2 (~3850 bp), overlapping with the 3 end of segment 1 and 5 end of segment 3 by 15-20 nt, respectively; PCR Segment 3 spanned proviral segment 3 and the second portion of pUC57 into the ampR sequence (~4860 bp). PCR fragments were purified using a QiaQuick Gel extraction kit (Qiagen Inc, Valencia, CA, USA) following agarose gel electrophoresis using Invitrogen™ SYBR™ Safe™ DNA Gel Stain and Invitrogen™ Safe Imager™ 2.0 blue-light transilluminator for band visualization. Optimized ratios of purified fragments were subjected to In-Fusion HD Cloning Plus CE (TaKaRa Bio, Mountain View, CA, USA) ligation, principally following the manufacturer's instructions.
Two and a half microliters of the In-Fusion cloning reaction above were added to Stellar competent cells, incubated on ice for 30 min and cells were heat shocked for 45 s at 42 • C. Pre-warmed SOC was added to make a final volume of 500 µL and incubated for 1 h at 30 • C while shaking. The transformation reaction was plated on LB plates containing 100 µg/mL of ampicillin or carbenicillin antibiotic, and plates were incubated for 18 h at 30 • C. Individual isolated colonies were picked, sub-cultured at 30 • C overnight and miniprep plasmid purified (PureYield™ Plasmid Miniprep System; Promega, Madison, WI, USA). Multiplex PCR was used to check for the full lengths' insert of the IMCs. To preempt the deletion of proviral sequences, correctly sized miniprep DNA was usually re-transformed into Invitrogen™ MAX Efficiency™ Stbl2™ competent cells, followed by large-scale culture in carbenicillin-containing LB medium and maxiprep DNA preparation (PureYield™ Plasmid Maxiprep System, Promega). Sequence confirmation of the IMC was done using Sanger sequencing. The 14 TFV sequences that were analyzed as infectious molecular clones (IMCs) are indicated in Table 1 (Genbank accession numbers MW006052 to MW006081). Two and a half microliters of the In-Fusion cloning reaction above were added to Stellar competent cells, incubated on ice for 30 min and cells were heat shocked for 45 s at 42 °C. Pre-warmed SOC was added to make a final volume of 500 µL and incubated for 1 h at 30 °C while shaking. The transformation reaction was plated on LB plates containing 100 µg/mL of ampicillin or carbenicillin antibiotic, and plates were incubated for 18 h at 30 °C. Individual isolated colonies were picked, sub-cultured at 30 °C overnight and miniprep plasmid purified (PureYield™ Plasmid Miniprep System; Promega, Madison, WI, USA). Multiplex PCR was used to check for the full lengths' insert of the IMCs. To preempt the deletion of proviral sequences, correctly sized miniprep DNA was usually re-transformed into Invitrogen™ MAX Efficiency™ Stbl2™ competent cells, followed by largescale culture in carbenicillin-containing LB medium and maxiprep DNA preparation (PureYield™ Plasmid Maxiprep System, Promega). Sequence confirmation of the IMC was done using Sanger sequencing. The 14 TFV sequences that were analyzed as infectious molecular clones (IMCs) are indicated in Table 1 (Genbank accession numbers MW006052 to MW006081)

Generation of Virus Stocks and Determination of Viral Replicative Capacity
Viral stocks of IMCs were generated by transfecting 1.5 µg of purified proviral plasmid DNA into 293 T cells in 6 well plates (American type culture collection) using the Fugene HD transfection reagent (Roche, Basel, Switzerland). Viral stocks were collected

Generation of Virus Stocks and Determination of Viral Replicative Capacity
Viral stocks of IMCs were generated by transfecting 1.5 µg of purified proviral plasmid DNA into 293 T cells in 6 well plates (American type culture collection) using the Fugene HD transfection reagent (Roche, Basel, Switzerland). Viral stocks were collected 72 h post-transfection, clarified by low-speed centrifugation and frozen at 80 • C. The titer of each viral stock was determined by infecting TZM-bl cells (NIH AIDS Research and Reference Reagent Program) with 5-fold serial dilutions of virus in a manner previously described [24]. In order to assess the in vitro replicative capacity of the IMC, 5 × 105 whole Peripheral blood lymphocytes (PBL) from one donor were infected at a multiplicity of infection of 0.05 (based on TZM-bl titer), and 100 µL of viral supernatants were collected at 2-day intervals during a period of 10 days. Briefly, previously activated peripheral blood leukocytes cells (using DNASE I, Interleukin-2 (IL-2) and phytohemagglutinin) and virus were spinoculated at 37 • C for 2 h, washed 5 times with complete R10 media and a volume of 200 µL was plated into 48 or 96-well U-bottomed plates. Viral supernatants were taken at days 2, 4, 6 and 8 and replaced with an equal volume of complete R10 media containing IL-2. Virion production was quantified using a P33-labeled reverse transcriptase assay as previously described [25]. The optimum window for logarithmic growth for all viruses was estimated to be between days 2 and 6 based on values obtained for days 2-8. By day 8, many high replicating viruses had depleted their target cells, allowing the replication curve to flatten or decrease. As a result, log10-transformed slopes were computed using days 2, 4, 6 and 8 for all viruses. Viral Replication Capacity (VRC) scores were generated by the area under curve normalized with the wild-type subtype C lab adapted strain, MJ4 (AF321523). Each sample was analyzed in triplicate in three independent experiments in order to generate average VRC scores as a measure to control assay variability and avoid a potential bias in determining score values. Included also were the NL4.3 and R880F.

Statistical Analyses
To characterize select T/F IMC, the association between VRC in vitro and patient's CD4+ T cell decline in vivo, a subset of 14 volunteers with longitudinal CD4+ T cell counts and VRC was studied. The relation between VRC and decrease in CD4+ T cells was estimated using the Mantel-Cox method in Prism software. The time before CD4+ T cell counts that fell below a certain threshold, such as 350 or 500 cells/mm 3 , is known as the endpoint [26]. The difference between set point VL and VRC was also determined using Prism software.

Study Subjects
A total of 29 adult subjects recruited between 2006 and 2011 in Uganda and whose plasma was collected during the acute/early phase of HIV-1 infection were analyzed in this study. Of the 29 participants (Table 2), 14 (48%) were women and 15 (52%) were men. They represent a section of a young Ugandan heterosexual population with a mean age of 31 (range: 21 to 58) whose major risk factor for HIV infection was co-habitation with a seropositive partner and who became ultimately infected. We examined plasma samples taken near the time of infection, with a mean estimated time from infection (EDI) of 42 days (range: 11 to 73). A significant positive correlation was observed between the TMRCA derived from the 5 and 3 end genome data (p = 0.006). The mean TMRCA was significantly higher for the 3 half sequence data relative to the 5 end (30.65 vs. 52.75, t-test p = 0.0005).
Viral loads at the first seropositive visit for all 29 samples varied broadly (range: 201 to 1,394,000 copies) with a mean of 238,166 copies/mL. When grouped by gender, women and men did not significantly differ (p > 0.05, t-test assuming equal variances) in their mean age (29 versus 34 years), EDI (43 versus 41 days) and viral load (294,144 versus 185,919 copies/mL). We next identified single and multivariant transmissions, inferred the nucleotide sequence of the near full-length T/F virus, determined their subtype composition, characterized their sequence variability and determined the VRC using a panel of 14 IMCs that represent the most common subtypes and inter-subtype recombinants transmitted in Uganda during the 2006-2011 time period.

HIV-1 Genetic Diversity
Viral sequences from 29 study subjects were split into 5 half and 3 half genomes for genetic analysis. A total of 209 5 half genomes (median of seven sequences per subject) and 197 3 half genomes (median of six sequences per subject) were phylogenetically analyzed. Maximum likelihood phylogenetic trees of 5 half genomes ( Figure 3A) and 3 -half genome nucleotide sequences ( Figure 3B) showed distinct monophyletic lineages in a patientspecific pattern with strong statistical support. Sequences from 23 of the 29 subjects formed clusters with very small branches with no or little structure indicative of low intrastrain genetic diversity, strongly suggesting that these infections resulted from transmissions of a single virus or two or more closely related viruses. In contrast, the remaining six other subjects showed sequence clusters with more heterogeneous viral populations, depicted with larger branches and more structure at their tips (shown in red) in both 5 half and 3 half phylogenetic trees ( Figure 3A,B).
We next employed within-patient sequence diversity and model predictions to identify single and multivariant transmissions. The median number of amplicons per sample was seven (range: 2-14) for 5 half and six (range: 2-12) for 3 half genomes (Table 3). Maximum within-patient half genome diversities ranged from 0.07% to 11.01% for 26 individuals; three other subjects were excluded from the diversity analysis because two samples had only two to three sequences in both or at least in one half genome, while the third subject had two out of four sequences highly enriched for APOBEC3G G-to-A mutations. There was a significant correlation between 5 half and 3 half maximum diversity values (r = 0.89, p < 0.001), and no significant difference between their means values: 5 half mean 0.38% versus 3 half mean 0.85% for the 26 participants. Maximum diversity values and conformance to a model of random diversification for single virus transmissions were next examined for 26 individuals as previously reported [12,19,27,28].

HIV-1 Genetic Diversity
Viral sequences from 29 study subjects were split into 5′ half and 3′ half genomes for genetic analysis. A total of 209 5′ half genomes (median of seven sequences per subject) and 197 3′ half genomes (median of six sequences per subject) were phylogenetically analyzed. Maximum likelihood phylogenetic trees of 5′ half genomes ( Figure 3A) and 3′-half genome nucleotide sequences ( Figure 3B) showed distinct monophyletic lineages in a patient-specific pattern with strong statistical support. Sequences from 23 of the 29 subjects formed clusters with very small branches with no or little structure indicative of low intrastrain genetic diversity, strongly suggesting that these infections resulted from transmissions of a single virus or two or more closely related viruses. In contrast, the remaining six other subjects showed sequence clusters with more heterogeneous viral populations, depicted with larger branches and more structure at their tips (shown in red) in both 5′ half and 3′ half phylogenetic trees ( Figure 3A,B). Single genome amplification (SGA sequences from each subject fell into distinct monophyletic lineages with low genetic diversity and 100% bootstrap support (black asterisks); nodes with bootstrap support ≥80% shown with grey asterisks. Branches colored in red denote sequences with intra-patient maximum diversity >0.6%; some have sub-lineages with 100% bootstrap support. HIV-1 subtypes A1, D, A/D and subtype B HXB2 reference sequences from the LANL database are shown in grey. The scale bar represents 10% genetic distance.
We next employed within-patient sequence diversity and model predictions to identify single and multivariant transmissions. The median number of amplicons per sample was seven (range: 2-14) for 5′ half and six (range: 2-12) for 3′ half genomes (Table 3). Maximum within-patient half genome diversities ranged from 0.07% to 11.01% for 26 individuals; three other subjects were excluded from the diversity analysis because two samples had only two to three sequences in both or at least in one half genome, while the third subject had two out of four sequences highly enriched for APOBEC3G G-to-A mutations. There was a significant correlation between 5′ half and 3′ half maximum diversity values (r = 0.89, p < 0.001), and no significant difference between their means values: 5′ half mean 0.38% versus 3′ half mean 0.85% for the 26 participants. Maximum diversity values and Single genome amplification (SGA sequences from each subject fell into distinct monophyletic lineages with low genetic diversity and 100% bootstrap support (black asterisks); nodes with bootstrap support ≥80% shown with grey asterisks. Branches colored in red denote sequences with intra-patient maximum diversity >0.6%; some have sub-lineages with 100% bootstrap support. HIV-1 subtypes A1, D, A/D and subtype B HXB2 reference sequences from the LANL database are shown in grey. The scale bar represents 10% genetic distance.

Model Predictions Analysis
Based on model predictions from previous studies [11,19], the maximum diversity expected within 100 days of transmission of a single virus is 0.6% with confidence intervals of 0.54-0.68%. For all study subjects, clinically determined estimates for the number of days since infection were available, which fell within a range of 11 to 71 days (Table 2), thus allowing us to use previously developed model predictions. Sequence data relevant for predictions are summarized in Table 3. Out of the 26 subjects included in the diversity analysis, 21 had a half genome maximum sequence diversity of <0.6%, whereas five subjects had a sequence diversity greater than 0.6%-the latter being inconsistent with the prediction of an infection initiated by a single virus [11,19]. Phylogenetic and highlighter plot analysis of the 21 subjects displaying <0.6% maximum within-patient sequence diversity allowed us to differentiate those infected by a single virus from those infected by more than one closely related viral variants. We found that sequence data for 19 out of 21 subjects (73%) were indicative of single virus transmission. For example, subject 194584 had sequences with mutations randomly distributed following a Poisson distribution and coalescing to a single consensus in both half genomes ( Figure 4A,B); two of the 14 5 half genomes sequences were hypermutated by APOBEC3G; the removal of G-to-A sites restored the star phylogeny. Another factor causing deviations from a star phylogeny and model predictions of single virus transmission was the appearance of shared polymorphisms in two or more sequences due to early stochastic changes or immune selection. Indeed, subject 194289's 3 half genomes ( Figure 4D) showed five loci with shared polymorphisms, most notably in the nef gene exhibiting multiple polymorphisms within a region spanning five amino acid residues. This pattern of clustered polymorphisms strongly suggests a vigorous immune response to an epitope in the Nef protein (thicker arrow). Most of the 19 subjects identified as single virus infections had sequences in which confounding factors due to enrichment in APOBEC3G sites and selection were discerned.
On the other hand, sequence analysis showed that seven out of twenty-six (27%) individuals were most likely to be infected by more than one virus. Two examples of multivariant transmissions are shown next. Subject 275026 displayed half genome sequence diversities that did not conform to model predictions of single virus transmission (Table 3). Phylogenetic and highlighter analyses of 10 viral sequences revealed a nonrandom distribution of mutations and evidence of two viral sub-lineages distinguished by at least 20-26 polymorphisms in the 5 and 3 half genomes ( Figure 4E,F). The predominant founder lineage represented by seven sequences and a minor second variant by three sequences are depicted in Figure 4E,F. Polymorphisms identifying the second lineage in subject 275026 were clustered into five segments in 5 half genomes (see dashed line boxes in Figure 4E), each with 3 to 6 polymorphisms within a region spanning an average length of 159 nucleotides (range: 132-219). Clustered mutations were interspaced by identical or nearly identical sequences to the consensus of the predominant lineage. A similar pattern was seen in the 3 half genomes. ( Figure 4F). We thus inferred that these clustered polymorphisms represent recombination events with a second founder virus still detectable approximately two months after infection. A separate analysis of seven homogeneous 3 half genome sequences identifying the major lineage (Table 3) conformed to model predictions of single virus transmission. Altogether, the data indicate that subject 275026 was infected by at least two closely related variants from a single donor.
In the case of subject 270015 ( Figure 4G,H), the plasma sample was estimated to be taken shortly after infection (11 days), and 5 and 3 half genome sequences had low maximum diversities (0.27% and 0.46%, respectively), a random distribution of mutations and a star phylogeny, thus suggesting single virus transmission. However, the Poisson time estimate (>150 days) was not consistent with the clinical data, and sequences did not fit the model prediction of single virus transmission ( Table 3). The highlighter plot revealed one sequence containing a high number of polymorphisms (9 in the 5 half and 19 in the 3 half) that cannot be explained by APOBEC hypermutation and are not the expected number of mutations within 11 days from infection for a single virus transmission event. We reasoned that the more divergent sequence represents a second founder virus. Indeed, exclusion of this divergent sequence in the analysis showed that the remaining five homogeneous 3 half sequences (depicted as lineage one in Figure 4H) conformed to model predictions of a single ancestor. Altogether, these data support the idea that subject 270015 was infected by two closely related viruses from the same donor, with lineages diverging from each other by as much as 0.4% in the 3 half genome. Half genome sequences from five additional individuals also showed evidence of multivariant transmissions, as listed in Table 3 and represented individually in the phylogenetic trees and highlighter plots of half genomes depicted in Supplementary Figure S1.
Three subjects (191996, 275027 and 194140) had a low number of sequences per sample or exhibited APOBEC-mediated hypermutation and for these reasons were excluded from this analysis. Additionally, these three subjects were also sampled > 50 days from infection.

Inference of Transmitted/Founder Viruses
Next, we inferred the transmitted/founder viral sequences in the 29 study subjects and determined their subtype composition. In the 19 cases of single virus transmissions, star phylogenies and highlighter plots of half genomes coalescing to a consensus sequence allowed us to infer the sequence of the most recent common ancestor for both 5 and 3 half genomes. Seven of nineteen subjects (37%) had star phylogenies after excluding or without excluding APOBEC sites (Table 3), thus allowing us to unambiguously infer a T/F virus sequence. As shown in Figure 3A, 5 half genome sequences (n = 14) from subject 194584 were identical or near identical to the consensus with a random distribution of mutations coalescing a consensus except for two sequences with multiple APOBEC3G mutations; removal of the APOBEC mutated sites resulted in a star phylogeny and conformance to model predictions. Analysis of 3 half genomes ( Figure 4B), showed seven near identical sequences, including two sequences with a single shared polymorphism that was identified as an early stochastic change as the model of random evolution predicts to happen in some cases. Thus, the shared mutation did not confound the inference of a consensus sequence in this subject and in other similar cases. Consensus sequences of both 5 and 3 half genomes were linked through a partial overlap to complete the near full-length HIV-1 genome. On the other hand, 12 of 19 subjects with single virus transmissions had some sequences harboring few shared polymorphisms that suggested immune selection and confounded the unambiguous inference of the transmitted virus sequence. As there were no pre-sero-conversion plasma samples available from earlier time points, it was not possible to resolve this issue. When shared polymorphism(s) occurred in a fraction of SGA sequences, the predominant nucleotide was assigned in the consensus sequence. This is represented in the 3 half genome of subject 194289 ( Figure 4D) in which two distinct nucleotide sites differed in about half of the nine sequences, and both mutations conferred amino acid changes in the Nef protein. Because these two mutational sites were accompanied with three additional shared polymorphisms in a region spanning five amino acid residues, it strongly suggested immune selection. Four other sites with shared polymorphisms that changed amino acid residues (involving two or three sequences) were detected in other genes: one in vif, one in vpu and two sites in the env gene; at each site, the predominant nucleotide was assigned in the consensus sequence. On the other hand, when a shared polymorphism occurred in exactly half of the sequences, the ancestral nucleotide was selected. Thus, in 12 study subjects with shared polymorphisms, we could not be 100% certain if the assigned nucleotide truly corresponded to the actual transmitted virus sequence; it is still possible that we inferred an early adapted variant with few mutation changes that emerged under immune pressure or a founder virus with one or two stochastic changes that conferred a replication/survival advantage over the transmitted virus. In any event, in about two-thirds of subjects with single virus infections, we identified an early founder virus that may be one or few nucleotides away from the actual transmitted virus.
In seven cases of multivariant transmissions, the lineage of one of the transmitted/founder viruses, defined as virus 1 in Figure S1, was identified as the consensus of the set of sequences harboring the least number of mutations. Lineage one was easier to identify in five subjects (275026 and 270015 shown in Figure 4H and 191997, 192023 and 191696 in Figure S1) than in two other individuals whose sequences showed great variability (193004 and 193005 shown in Figure S1). The identification of the second and third transmitted variant was not possible due to the overall low number of sequences exhibiting substantial heterogeneity with multiple adaptation and immune selection mutations as well as recombination events among founder viruses; this was best illustrated for subject 275026 for whom evidence of a second strain is shown in sequences with dashed line boxes in Figure 4E,F. Sequences with a variable number of recombination events between founder viruses were also apparent in subjects 192023, 193004, 193005 and 191696 (Figure S1). Although, the inferred lineage sequence of one founder virus may represent an early virus rather than a transmitted virus.

Subtype Composition
Irrespective of whether the inferred consensus represents a bona fide T/F virus or an early virus, the sequence information allowed us to determine the most common subtype composition in our cohort from Uganda. Analysis using different subtyping and recombination analysis tools (see methods) of near full-length genome sequences from 29 T/F viruses showed an overall good agreement between methods. Unique mosaic inter-subtype recombinant strains accounted for most infections (68.9%, 20/29), of which 19 were A1/D and one A1/C/D recombinant as shown in Figure 5A. Infections by pure clade D strains only accounted for about one-third (31%, 9/29) of sero-converters in our cohort. We observed that subtype D tended to predominate in the gag/pol gene; in fact, the entire gag/pol region and often the vif gene were clade D in six out of twenty recombinant viruses, whereas as few as three or four viruses were subtype A1 in approximately half the length of gag/pol genes ( Figure 5A). The other pattern observed is that in half of the recombinant viruses, most of the gp120 and part of gp41 were subtype A1, while both exons 1 and 2 of Rev and Tat were almost invariably subtype D. A similar pattern was observed in the envelope region of the single A1/C/D recombinant virus detected, where D and C subtypes were intermixed in the gag/pol and vif regions. Interestingly, the only two viruses in which the entire envelope gene was subtype A1, exons-1 and -2 of rev and tat genes were also subtype A1. whereas as few as three or four viruses were subtype A1 in approximately half the length of gag/pol genes ( Figure 5A). The other pattern observed is that in half of the recombinant viruses, most of the gp120 and part of gp41 were subtype A1, while both exons 1 and 2 of Rev and Tat were almost invariably subtype D. A similar pattern was observed in the envelope region of the single A1/C/D recombinant virus detected, where D and C subtypes were intermixed in the gag/pol and vif regions. Interestingly, the only two viruses in which the entire envelope gene was subtype A1, exons-1 and -2 of rev and tat genes were also subtype A1.

Recombination Breakpoints
To gain further insights into how A1 and D subtypes recombine, we determined the recombination breakpoints in the 20 inter-subtype recombinant founder viruses inferred above. Unique mosaic inter-subtype recombinant strains accounted for most of the infections (68.9%, 19A1/D and one A1/C/D) as shown in Figure 5A. The complete envelope gene (HXB2 coordinates 6225-8795) covering the ~856 amino acid residues are depicted as mosaic recombinants between clade A1 (shown in red) and clade D (shown in green; Figure 5A). A high rate of recombination was observed within the envelope gene, and an initial analysis suggested that recombination breakpoints were enriched in coding regions for integrase, gp120 (gp160 signal peptide (SP)/vpu overlap and variable V1-V2 loops) and gp41 (Tat2/Rev2 flanking region, encoding the membrane spanning domain) within the 20 unique recombinant TFV analyzed. Two hotspot regions for recombination are suggested by this analysis; the most abundant is within the gp41 transmembrane domain (approximately 75% of the 16 envelopes have this recombination event), and the second is in the signal peptide-C1 region with approximately 25% (4/16) of the samples having this recombination event ( Figure 5B). Interestingly, in recombinant proviral genomes, the

Recombination Breakpoints
To gain further insights into how A1 and D subtypes recombine, we determined the recombination breakpoints in the 20 inter-subtype recombinant founder viruses inferred above. Unique mosaic inter-subtype recombinant strains accounted for most of the infections (68.9%, 19A1/D and one A1/C/D) as shown in Figure 5A. The complete envelope gene (HXB2 coordinates 6225-8795) covering the~856 amino acid residues are depicted as mosaic recombinants between clade A1 (shown in red) and clade D (shown in green; Figure 5A). A high rate of recombination was observed within the envelope gene, and an initial analysis suggested that recombination breakpoints were enriched in coding regions for integrase, gp120 (gp160 signal peptide (SP)/vpu overlap and variable V1-V2 loops) and gp41 (Tat2/Rev2 flanking region, encoding the membrane spanning domain) within the 20 unique recombinant TFV analyzed. Two hotspot regions for recombination are suggested by this analysis; the most abundant is within the gp41 transmembrane domain (approximately 75% of the 16 envelopes have this recombination event), and the second is in the signal peptide-C1 region with approximately 25% (4/16) of the samples having this recombination event ( Figure 5B). Interestingly, in recombinant proviral genomes, the gag-pol coding region is predominantly subtype D, and the directional change between subtypes from 5 to 3 end was mostly from subtype D to A to D (Table 4), with this final switch to D occurring within the envelope region. The cytoplasmic tail coding region and the exons 2 of tat and rev of these recombinant T/F viruses without exception were derived from subtype D. Table 4. Inter-subtype recombination events and identification of recombination hot spots in envelope genes of HIV-1 transmitted/founder variants from Uganda.

Viral Replicative Capacity
VRC scores were determined based on the area under the curve for each sample ( Figure 6) and normalized by the MJ4 HIV-1 clone. This clade C chimeric clone (MJ4) was preferred as a control because like the 14 IMC T/F viruses in this study, MJ4 is CCR5-tropic, unlike NL4-3, which is CXCR4-tropic. Additionally, MJ4 has been used in previous studies of VRC and allows comparisons to be made. The lab-recombinant clade B virus, NL4-3, despite being more closely similar to clade D, was not chosen for normalization because its VRC range even at low multiplicity of infection was significantly higher than the primary clade D and A/D viruses in this study. The R880F virus was also included, an example of a poorly replicating HIV-1 IMC. High VRC was defined as the top tercile of T/F virus VRC, while low VRC was defined as the bottom tercile. It is unknown whether viral replicative capability plays a role in HIV-1 transmission. Some research into transmission pairs suggests that T/F viruses have higher replicative fitness than non-transmitted variants, but other studies of T/F variants did not demonstrate increased viral fitness in terms of particle infectivity or viral replicative capacity [13,29]. The replicative capacity of recombinant A/D founder variant infection (n = 11) was compared to that of pure D variants (n = 3) from these early infections to determine whether there was a difference in replicative capability. The median T/F VRC was 1.211, ranging from 0.085 to 3.75, with 8 T/F VRC scores over the median ( Figure S2). The mean VRC scores for A/D and D were 1.307 and 0.975, respectively. mission pairs suggests that T/F viruses have higher replicative fitness than non-transmitted variants, but other studies of T/F variants did not demonstrate increased viral fitness in terms of particle infectivity or viral replicative capacity [13,29]. The replicative capacity of recombinant A/D founder variant infection (n = 11) was compared to that of pure D variants (n = 3) from these early infections to determine whether there was a difference in replicative capability. The median T/F VRC was 1.211, ranging from 0.085 to 3.75, with 8 T/F VRC scores over the median ( Figure S2). The mean VRC scores for A/D and D were 1.307 and 0.975, respectively.

VRC Association with CD4+T Cell Count and Set Point Viral Load
Previous studies have shown that transmitted viral characteristics in subtype C infection significantly correlate with set point viral load (SPVL) as well as CD4 T-cell decline, even in the context of viral control by previously identified host factors [30][31][32]. However, the role of viral replicative capacity in influencing disease progression among subtypes D and A/D recombinant infection has not been adequately studied. Using the Mantel-Cox method in Prism software, Kaplan-Meier curves were compared between two groups of high and low VRC (Figure 7). This analysis showed that the replicative capacity of the initial infecting viral strain had a statistically significant impact on the trajectory of the

VRC Association with CD4+T Cell Count and Set Point Viral Load
Previous studies have shown that transmitted viral characteristics in subtype C infection significantly correlate with set point viral load (SPVL) as well as CD4 T-cell decline, even in the context of viral control by previously identified host factors [30][31][32]. However, the role of viral replicative capacity in influencing disease progression among subtypes D and A/D recombinant infection has not been adequately studied. Using the Mantel-Cox method in Prism software, Kaplan-Meier curves were compared between two groups of high and low VRC (Figure 7). This analysis showed that the replicative capacity of the initial infecting viral strain had a statistically significant impact on the trajectory of the CD4 counts in the first 4-6 years of follow-up in this small group of ART-naïve participants. The median time to CD4 count <500 cells/mm 3 was 1199 days for individuals with low VRC and 419 days for those with high VRC. The respective median time to CD4 count <350 cells/mm 3 was 1021 and 1513 days for individuals with high and low VRC, respec-tively. In both cases, in individuals infected with HIV strains with high VRC, CD4+ decline was significantly faster compared to those infected with low VRC viruses. Next, we examined the effect of VRC on SPVL. We defined SPVL as the median viral load following acute phase HIV-1 infection (which, in this case, was 3-24 months). A statistically significant positive correlation was observed between the 14 IMC and SPVL ( Figure 8A); however, to tease out differences in a small group of individuals, we divided the replicative capacity into high and low VRC as was done previously [26,31]. There was no difference in SPVL between the low and high VRC ( Figure 8B). These data are, thus, consistent with previous reports that VRC of the initial infecting strain (T/F) has an impact on some important markers of HIV pathogenesis, especially CD4+ T cell count decline. CD4 counts in the first 4-6 years of follow-up in this small group of ART-naïve participants. The median time to CD4 count <500 cells/mm 3 was 1199 days for individuals with low VRC and 419 days for those with high VRC. The respective median time to CD4 count <350 cells/mm 3 was 1021 and 1513 days for individuals with high and low VRC, respectively. In both cases, in individuals infected with HIV strains with high VRC, CD4+ decline was significantly faster compared to those infected with low VRC viruses. Next, we examined the effect of VRC on SPVL. We defined SPVL as the median viral load following acute phase HIV-1 infection (which, in this case, was 3-24 months). A statistically significant positive correlation was observed between the 14 IMC and SPVL ( Figure 8A); however, to tease out differences in a small group of individuals, we divided the replicative capacity into high and low VRC as was done previously [26,31]. There was no difference in SPVL between the low and high VRC ( Figure 8B). These data are, thus, consistent with previous reports that VRC of the initial infecting strain (T/F) has an impact on some important markers of HIV pathogenesis, especially CD4+ T cell count decline.

Amino Acid Signatures of T/F Sequences
It has been previously reported [33,34] that His12 in the signal peptide of the Envelope gene is a strong signature for viral transmission. Thus, we examined the amino acid variability at position 12 for the set of T/F genomes described herein. We observed that nearly half (41%; 11/27) of the sequences have the Histidine at position 12 (Table 5). Inter-

Amino Acid Signatures of T/F Sequences
It has been previously reported [33,34] that His12 in the signal peptide of the Envelope gene is a strong signature for viral transmission. Thus, we examined the amino acid variability at position 12 for the set of T/F genomes described herein. We observed that nearly half (41%; 11/27) of the sequences have the Histidine at position 12 (Table 5). Interestingly, when we compared the peptide signal of different subtypes, subtypes D and A1 were highly contrasting. The subtype D sequences had His12 at a frequency of 70% (14/20), whereas none of the A1 sequences had His12; instead, they had a high frequency (63%; 5/8) for Asn12. None of the subtype D sequences had Asn12. We further observed that His12 was present among recombinants with low VRC (3/5), while recombinants with high VRC (6/9) did not contain His at position 12.

Discussion
This study characterized HIV-1 near full-length T/F viral genomes at both the molecular and phenotypic level for mainly subtype D and A/D recombinants from heterosexual mucosal transmissions. HIV-1 mucosal transmission is characterized by an extreme bottleneck that most often results in only one virus variant, termed T/F virus, being selected out of a donor's quasi-species viruses that cross the mucosal barrier and establish clinical infection in the new host [11,19]. These breakthrough viruses may have unique properties that confer a higher probability to be transmitted [14,17]. Here, we report single viral variant transmissions in 73% (19/26) of the new infection and multiple transmissions in 27% (7/26) of the cases. The proportion of multivariant transmission in our study, although slightly higher, was not statistically different when compared to previous studies in heterosexual populations, which reported proportions ranging from 17.7% to 21.7% [11,12,35]. One of the caveats in our study with regard to assessment of the multiplicity of infection was the limited number of sequences we were able to obtain for some samples that did not conform to the criteria of single variant transmission. Additionally, a later sampling time is expected to impact the detection of transmitted variants that are less fit than a co-transmitted variant that becomes the predominant strain [36]. Thus, it is more likely that an infrequent outcompeted variant would be missed with a limited number of sequences. The power analysis predicted that for samples with only six sequences, the probabilities of multivariant detection were 60% for variants with ≥15% frequency and 80% probability for variants with ≥20% frequency. Nonetheless, the genetic analyses of both 5 and 3 half genomes were concordant in all seven individuals with multivariant transmission, crossvalidating each other's findings despite the relatively low number of sequences per sample. Our study did not show evidence of an association between multivariant transmission and inflammation caused by sexually transmitted infections. This finding was unlike previous studies by Halaand and colleagues published in 2009, where multivariant transmissions for subtypes A and C were correlated with genital inflammation.
However, for the majority of study subjects classified as single virus transmissions, we had obtained at least six half-genome sequences (either both or one of the two halves). It is reasonable, therefore, to assume that some, if not all of these samples are single virus infections for vaccine-design purposes because these are the fittest strains that a vaccine should target for neutralization. However, we cannot rule out the possibility that a few individuals may have been infected by a second variant whose less fit progeny represent <10% of the virus population.
In this study, of 29 subjects initially classified as subtype D using a small fragment of the HIV-1 pol gene, unique mosaic inter-subtype recombinant strains accounted for most infections (69.0%, A1/D or A1/C/D) using near full-length sequences. This proportion reported in our study is higher than previous studies centered on near full-length HIV-1 sequences in Uganda from chronically infected individuals, for which the reported prevalence of HIV-1 inter-subtype recombinants was 46% (92/200) [9], 39.3% (108/275) [37], 30% (14/46) [8] and recently, 49.9% (232/465) [10]. The difference in percentages of recombinants could be due to the limited cohort size in our study (n = 29), which focused mainly on the unique sequences corresponding to T/F viruses. Overall, our study, together with others, suggest a high frequency of inter-subtype recombinants in Uganda, most of which are URFs combining subtypes A and D. The extraordinary viral diversity and highly pathogenic nature seen among some of these T/F URFs in our study poses a challenge for both vaccine development and treatment and supports that vaccine products must be matched to the predominant subtype in a country instead of universal vaccines. The virus that crossed the mucosal barrier to establish clinical infection in the new host were, indeed, mainly unique recombinant forms of A1D, suggesting that these recombinant variants must exhibit a selective advantage for transmission over the parental subtypes, but the molecular basis for this property is not yet fully understood.
Interestingly, two hotspots for recombination were identified in the gp41 transmembrane coding region and signal peptide-C1 regions, respectively, as has been previously observed for different subtypes [38,39]. Furthermore, the directionality (5 to 3 of the NFL recombinant genomes) of the subtype switches was from D to A1 to D, with the final switch back to D occurring within the envelope transmembrane region. This recurrence of recombination patterns would suggest that generation of a fit hybrid A1D virus may not be a non-random process, but that there must be structural and functional constraints that select for a virus with transmission fitness. In addition, the selection observed for the rev and tat genes, that results in exons 1 and 2 of tat and rev, respectively, belonging to the same subtype (D in this case) but interspaced by a region of subtype A1, seems not to be a random event, but rather one explainable by biological constraints. The second exons of tat and rev overlap with the gp41 coding area of HIV-1 env and must be appropriately spliced to create functional Tat and Rev proteins, which are two key viral regulatory factors for HIV gene expression [40,41]. Inter-subtype discordance between tat and rev 1 and 2, respectively, might provide a functional bottleneck for Envelope recombinants with break-points upstream of tat 2/rev 2 exons. Constraints in the overlapping env open reading frame could favor this selection of the same subtype in exons 1 and 2 of tat and rev, but also, this occurrence could be due to pressure on the tat [42,43]; in chimeras where 5 is clade D, it may be advantageous if that is also D. Furthermore, subtype recombinants that have an Envelope cytoplasmic tail that matches the subtype of the gag may be advantageous for virion incorporation of Envelope [44]. In addition, we observed that a previously reported amino acid signature, a histidine at position 12 (His12) in the Envelope signal peptide, was highly prevalent in subtype D Envelopes in our study, while the subtype A1 Envelope signal peptide completely lacked His12, and asparagine was predominant at this position. Presence of Histidine at position 12 occurred in virus strains with low viral replicative capacity of the virus, while departure from the Histidine in position 12 was seen for strains with high viral replicative capacity in this study. More detailed phylogenetic analyses will need to be conducted before conclusions on possible associations can be drawn. It is, however, of note that the transmitted Envelope glycoproteins from HIV-1 subtype B with a basic amino acid position 12 were found to be incorporated into virions at a higher density and had higher infectious titers than non-His12 signature envelopes, according to Asmal et al. 2011 [33]. Similarly, the Envelope His12 signature was identified by Gnanakaran et al. 2011 [34], and its expression levels were implicated in selection at viral transmission or early expansion. Thus, further exploration of this important signature with more sequences from acute infection among subtypes A, D and A/D recombinants is warranted.
HIV-1 transmission favors viruses with high infectivity and replication capacity, and a subject's founder virus replication capacity can predict the rate at which subtype C disease progresses [13,26,31,45]. While there was no significant difference in replicative capacity between subtype D and A/D in this study, replicative capacity of the initial infecting viral strain had a statistically significant impact on the trajectory of the CD4 counts in the first 4-6 years of follow-up in this small group of ART-naïve participants. On the contrary, there was no difference between high and low replicative viruses and SPVL in this study, although there was a significant correlation between VRC and SPVL. Additional analyses with more numbers and representation for subtypes A, D and ADs will be more informative.

Conclusions
For subtype determination, subtyping based on single partial genome region is inaccurate, as many of the recombinants identified were previously missed when only sequencing pol. Therefore, there is a need to use a full genome sequence, or at least multiple regions, for accurate subtyping of the viruses. The HIV-1 Envelope recombination patterns observed in this study further underpin the need for larger studies of HIV-1 acutely infected individuals, given that subtype-specific immunogens are being considered for vaccine development. The numbers of T/F sequences available has significantly increased over time, with a fair representation from all subtypes enabling the extensive bioinformatics analyses with chronic viral sequences to allow for the identification of functional sequence domains unique to T/F viruses. Additionally, the full-length genome infectious molecular clones derived here will be further utilized in mucosal studies to elucidate mucosal transmission biology and to inform other immuno-pathogenesis studies currently ongoing within the IAVI consortium.