HIV-1 Gag-Pol Sequences from Ugandan Early Infections Reveal Sequence Variants Associated with Elevated Replication Capacity

The ability to efficiently establish a new infection is a critical property for human immunodeficiency virus type 1 (HIV-1). Although the envelope protein of the virus plays an essential role in receptor binding and internalization of the infecting virus, the structural proteins, the polymerase and the assembly of new virions may also play a role in establishing and spreading viral infection in a new host. We examined Ugandan viruses from newly infected patients and focused on the contribution of the Gag-Pol genes to replication capacity. A panel of Gag-Pol sequences generated using single genome amplification from incident HIV-1 infections were cloned into a common HIV-1 NL4.3 pol/env backbone and the influence of Gag-Pol changes on replication capacity was monitored. Using a novel protein domain approach, we then documented diversity in the functional protein domains across the Gag-Pol region and identified differences in the Gag-p6 domain that were frequently associated with higher in vitro replication.


Introduction
During early HIV-1 infection, viremia increases rapidly, reaching a peak within weeks of infection, then drops to a level (the set point viral load or SPVL) that can remain stable over months to years of asymptomatic infection [1]. High SPVL is a predictor of faster disease progression [2]. The determinants of SPVL are complex and involve the host's immune system as well as properties of the infecting virus and have been a matter of intensive research. SPVL and viral control vary by infecting subtype, with subtype A associated with control [3,4]. Subtype D HIV-1 infections have an increased frequency of CXCR4 co-receptor usage [5,6] and faster CD4+ T cell decline [7], which could account for the more aggressive clinical course HIV-1 subtype D infections than subtype A in sub-Saharan Africa [6,[8][9][10][11].
Several studies report that the initial viruses establishing new HIV-1 infections may be important determinants of SPVL [12] and disease progression [13]. High viral replicative capacity (VRC) of transmitted HIV-1 among subtype C viruses has been associated with faster progression to disease [14,15]. Baalwa suggested that early subtype D viruses replicate more efficiently than subtype A [16] and subtype C viruses have lower VRC compared to other subtypes [17][18][19]. We asked if there were differences in VRC among Ugandan HIV-1 early viruses of subtypes A and D and their recombinants and set out to identify virus sequence features that might account for differences in VRC. The HIV-1 gag and pol genes are among the most conserved of the HIV-1 genome and in subtype C viruses appear to drive replication capacity and clinical outcomes [14,20]. Moreover, Gag-Pol chimeric viruses were shown to display similar VRC as the full-length HIV-1 genomes from which they were derived, supporting the idea that the Gag-Pol region was a major determinant of VRC. A large analysis of the Gag-Pol region from East African subtypes supported a hierarchy of inter-subtype recombinants replicating more highly in vitro than subtype D, which was in turn higher than subtypes A or C and identified changes in the Gag-p6 region that may play an important role among these chronically infected individuals [21]. Insertions in Gag-p6 are associated with increased replication as well as cooperation with protease resistance mutations [22][23][24]. Our study cohort consisted of HIV-seronegative individuals in the International AIDS Vaccine Initiative protocol C (IAVI protocol C) HIV epidemiology cohorts [25,26] who had been followed until seroconversion with frequent sampling intervals that allowed us to identify the virus near the time of transmission. We report here the molecular features of the Gag-Pol region of a set of these viruses and the contribution of these features to VRC. The results are important for determining the dynamics of HIV in human populations from East Africa where subtypes A, D and A/D recombinants predominate and may help identify sequence features associated with transmitted variants of distinct subtypes.

Study Subjects
This was a laboratory-based study incorporated into a larger multi-center primary HIV-1 infection cohort (IAVI protocol C) through Clinical Research Centers in Uganda, Kenya, Rwanda, Zambia and South Africa [26]. The protocol C study objectives were to follow the immunologic, virologic and clinical parameters in HIV-infected volunteers with a date of infection that could be accurately defined. In this study, data and samples were obtained from Ugandan participants, all initially HIV negative. Individuals who seroconverted were enrolled in IAVI protocol C. All were heterosexual individuals at high risk from the general population and from HIV-1 sero-discordant couples. Participants who became newly infected (tested positive for p24-antigen ELISA or HIV antibody) were invited to enroll. The estimated date of HIV infection (EDI) was defined as the midpoint between the last negative and first positive HIV antibody test, 14 days before the first positive p24 antigen test, 10 days before the first positive viral load test in the absence of p24 antigen or rapid HIV antibodies or the date of a self-reported high-risk exposure event. All participants were seen monthly until 3 months after EDI, then quarterly until 24 months and semi-annually thereafter. This study utilized protocol C stored plasma samples from 60 participants within 90 days post-EDI.

Amplification and Sequencing Of Transmitted Virus for Identification of Early Gag-Pol Sequences
Viral RNA was isolated from 140 µL plasma using a QIA-amp Viral RNA Mini Kit (Qiagen Inc, Valencia, CA, USA). RNA was either frozen at −80 • C or immediately used to synthesize cDNA using SuperScript IV (Invitrogen, Ljubljana, Slovenia). Using a reverse primer 5FIV-R1 (5 -CTYTTTCTCCTGTATGCAGACCCC-3 ; nucleotides 5272 to 5249 of the HXB2 sequence), cDNA was generated that served as a template to amplify a 5 kb 5 half viral genome fragment spanning the Gag-Pol region. For single genome amplification (SGA), the cDNA was serially diluted in replicates of eight and subjected to nested PCR amplification with HIV-specific primers: 5FIV-R1 and RVDA-F1 (5 -GGGTCTCTCTDGTTAGACCAGAT-3 ) for 1st round PCR and RVDA-F1 and 5FVR22 (5 -CCTAGTGGGATGTGTACTTCTGAAC-3 ) for second round PCR. cDNA dilutions that yielded >30% PCR positive wells were retested in 96-well plates to identify a dilution where <30% of wells were positive for amplification products; these procedures and primers have been previously described in detail [27]. To ensure amplification from single molecules and avoid in vitro PCR artefacts, 8-10 SGA amplicons were generated per patient and these were sequenced using di-deoxy sequencing technology (Applied Biosystems 3500), aligned and analyzed using Sequencher and Geneious software to infer an early infection consensus sequence. HIV-1 subtype classification was done using the REGA (http://hivdb.stanford.edu/), the Recombination Identification Program (RIP) (http://www.hiv.lanl.gov/content/sequence/RIP/RIP.html) and jpHMM programs (GOBICS; University of Göttingen) [28][29][30] (Table 1). The jpHMM tool (http: //jphmm.gobics.de/submission_hiv) was used to obtain recombination breakpoints, and the recombinant HIV-1 drawing tool from Los Alamos National Laboratories (https: //www.hiv.lanl.gov/content/sequence/DRAW_CRF/recom_mapper.html) was used to generate the recombinant breakpoint maps.

In Vitro Assay for HIV-1 Replicative Capacity
To assess the VRC of Gag-Pol NL4.3 chimeras, 5 × 10 5 GXR25 cells [31] were infected at a multiplicity of infection (MOI) of 0.05. GXR25 cells and chimeric viruses were incubated with 5 µg/mL polybrene at 37 • C for 3 h, washed 5× with complete Roswell Park Memorial Institute 1640 medium (RPMI) and plated into 24-well plates. Cells were split 1:2 to maintain confluency by replacement with an equal amount of fresh media. Viral supernatants from days 2, 4, 6, 8 and 10 [32] and virions were quantified using a 33 P-labeled reverse transcriptase assay and the colorimetric assay, as described below. The optimal window for logarithmic growth was determined to be between days 2-6. Replication capacity values were generated by dividing the area under the curve (AUC) for days 2-6 of the chimeric viruses by the AUC of the NL4.3 wildtype after subtracting the negative control [14]. Two independent Gag-Pol NL4.3 chimera clones per participant were run to confirm cloning fidelity.

Quantification of HIV-1 Reverse Transcriptase Using Radioactive and Colorimetric Assays
Culture supernatant aliquots from infected cells were added to a reverse transcriptase (RT) PCR master mix and incubated at 37 • C for 2 h; then the RT-PCR product was blotted onto DE-81 paper and allowed to dry. Blots were washed 5× with Saline sodium citrate buffer (SSC) and 3 times with 90% ethanol, allowed to dry and exposed to a phosphoscreen overnight. Counts were read using a Cyclone Phosphorimager [32]. The reverse transcriptase (RT) assay and colorimetric assay take advantage of the ability of reverse transcriptase to synthesize DNA using the hybrid poly (A) × oligo (dT) 15 as a template and primer. It avoids the use of [3H]-or [32P]-labeled nucleotides that are employed in standard RT assays. In place of radiolabeled nucleotides, digoxigenin-and biotin-labeled nucleotides in an optimized ratio are incorporated into the same DNA molecule by the RT activity. The detection and quantification of the synthesized DNA as a parameter for RT activity follows a sandwich ELISA protocol: biotin-labeled DNA binds to the surface of streptavidin-coated microplate modules. In the next step, an antibody to digoxigenin, conjugated to peroxidase (anti-DIG-POD), is added and bound to the digoxigenin-labeled nucleotides (licensed by Institut Pasteur). In the final step, the peroxidase substrate ABTS is added. The peroxidase enzyme catalyzes the cleavage of the substrate to produce a colored reaction product. The absorbance of the samples was determined using a microplate (ELISA) reader and was directly correlated to the level of RT activity in the sample using the manufacturer's instructions (Sigma-Aldrich, Munich, Germany content version May 2016).

Protein Domain Methods
For the initial analysis, the encoded Pfam domains were identified using HMMER-3.2.1 [33] (http://hmmer.org/) with the Pfam database (Pfam 32.0 September 2018, (http://pfam.xfam.org/) [34]. For each sequence, all open reading frames ≥75 amino acids were determined from both reading strands and examined for Pfam content. A domain hit was retained if the domain i-Evalue was <0.0001. Details of each domain instance were gathered including position in query genome, length, domain i-Evalue and bit score. For the analysis in Figure 5, all full or nearly full HIV-1 genomes were retrieved from GenBank using the query (txid11676[Organism] AND 8000[SLEN]:11000[SLEN]) and HIV-1 subtype classification was performed using the KAMERIS tool [35].

Participant and Virus Characteristics
Thirty-two Ugandan protocol C participants had sequences successfully cloned from early samples drawn within 90 days of EDI and had their VRC characterized. Table 1 shows the participants' characteristics. Three analysis tools, REGA, RIP and jpHMM [28][29][30] were used to assign subtypes and identify possible recombinants. We observed 6 with subtype A1, 13 with subtype D and 13 inter-subtype recombinants. The recombinants identified were A1D (10), A1C (1), CD (1) and a complex recombinant of subtypes E, F1, G and A (1) (Table 1, Figure 1).

Gag-Pol-NL4.3 Chimeras Showed a Range of Replicative Capacities
VRC was measured using Gag-Pol chimeras of early virus Gag-Pol cloned into an NL4.3 clone backbone [20,32]. The normalized VRC values of the chimeras for days 2-6 (logarithmic growth phase of these viruses) relative to wildtype NL4.3 ranged from 0.07-1.34 ( Figure 2) The viral replicative capacity scores appeared to be biphasic, and accordingly, we used two groups (LowVRC ≤ 0.8 and HighVRC ≥ 0.8). The results demonstrate that replacement with a novel Gag-Pol region can have measurable effects on the ability of the virus to replicate in cell culture. When sequences were arranged by VRC (Table 1), the subtype A1 sequences show the lowest VRC values while subtype D, followed by the recombinants, show higher VRC values. The subtype of the Gag-P6 region within each se- This was generated using the jpHMM website and recombinant HIV-1 drawing tool from the LANL website as described in Materials and Methods. The key to colors in the figure: red as A1, light green as D, brown as C, dark green as G and light blue as 01_AE.

Gag-Pol-NL4.3 Chimeras Showed a Range of Replicative Capacities
VRC was measured using Gag-Pol chimeras of early virus Gag-Pol cloned into an NL4.3 clone backbone [20,32]. The normalized VRC values of the chimeras for days 2-6 (logarithmic growth phase of these viruses) relative to wildtype NL4.3 ranged from 0.07-1.34 ( Figure 2) The viral replicative capacity scores appeared to be biphasic, and accordingly, we used two groups (LowVRC ≤ 0.8 and HighVRC ≥ 0.8). The results demonstrate that replacement with a novel Gag-Pol region can have measurable effects on the ability of the virus to replicate in cell culture. When sequences were arranged by VRC (Table 1), the subtype A1 sequences show the lowest VRC values while subtype D, followed by the recombinants, show higher VRC values. The subtype of the Gag-P6 region within each sequence (see Table 1) shows a pattern, with higher VRC values found in sequences with non-A1 Gag-P6 (Table 1) and the highest VRCs found in viruses with more complex Gag-p6 regions.

There Was no Difference in Set Point Viral Load, CD4 + T Cell Count Decline and Subtypes
Previous studies have documented the importance of the transmitted/founder (T/F) virus genotype in determining HIV-1 subtype B and C SPVL [36] [37]. However, we observed no statistical correlation between the replication capacity of the Gag-Pol NL4.3 chimera and SPVL in this cohort of subtype A1, D and A1D recombinants ( Figure 3B). The time taken for the CD4+ cell count to drop to less than 350 cells/µL between subtypes A1, D and recombinants also showed no statistical difference ( Figure 3A).

Protein Domain Diversity of Gag-Pol Regions
To gain information about changes in viral protein functions associated with and perhaps influencing replication capacity, we used Pfam profile hidden Markov models (profile HMMs) to document differences in functional protein domains encoded by the viruses. Profile HMMs provide a statistical description of protein domains or cleavage sites and can be used to identify domains as well as to document changes in domain se-

There Was No Difference in Set Point Viral Load, CD4 + T Cell Count Decline and Subtypes
Previous studies have documented the importance of the transmitted/founder (T/F) virus genotype in determining HIV-1 subtype B and C SPVL [36,37]. However, we observed no statistical correlation between the replication capacity of the Gag-Pol NL4.3 chimera and SPVL in this cohort of subtype A1, D and A1D recombinants ( Figure 3B). The time taken for the CD4+ cell count to drop to less than 350 cells/µL between subtypes A1, D and recombinants also showed no statistical difference ( Figure 3A).

There Was no Difference in Set Point Viral Load, CD4 + T Cell Count Decline and Subtypes
Previous studies have documented the importance of the transmitted/founder (T/F) virus genotype in determining HIV-1 subtype B and C SPVL [36] [37]. However, we observed no statistical correlation between the replication capacity of the Gag-Pol NL4.3 chimera and SPVL in this cohort of subtype A1, D and A1D recombinants ( Figure 3B). The time taken for the CD4+ cell count to drop to less than 350 cells/µL between subtypes A1, D and recombinants also showed no statistical difference ( Figure 3A).

Protein Domain Diversity of Gag-Pol Regions
To gain information about changes in viral protein functions associated with and perhaps influencing replication capacity, we used Pfam profile hidden Markov models (profile HMMs) to document differences in functional protein domains encoded by the viruses. Profile HMMs provide a statistical description of protein domains or cleavage sites and can be used to identify domains as well as to document changes in domain se-

Protein Domain Diversity of Gag-Pol Regions
To gain information about changes in viral protein functions associated with and perhaps influencing replication capacity, we used Pfam profile hidden Markov models (profile HMMs) to document differences in functional protein domains encoded by the viruses. Profile HMMs provide a statistical description of protein domains or cleavage sites and can be used to identify domains as well as to document changes in domain sequences relative to a reference set [34,38]. The functional domains of HIV-1 are well studied and provide a good starting point to identify protein motifs whose variation might influence virus replication. The 13 domains from the HIV-1 Gag-Pol region are described by Pfam, and preliminary results showed that seven domains (DUF935, zf-CCHC_2, Gag-P6 in the gag protein and gag_asp_proteas, RVT_thumb, integrase_Zn, rve_3 in the Pol protein, marked in green and orange in Figure 4A) showed variation in the set of 32 sequences ( Figure 4B). influence virus replication. The 13 domains from the HIV-1 Gag-Pol region are described by Pfam, and preliminary results showed that seven domains (DUF935, zf-CCHC_2, Gag-P6 in the gag protein and gag_asp_proteas, RVT_thumb, integrase_Zn, rve_3 in the Pol protein, marked in green and orange in Figure 4A) showed variation in the set of 32 sequences ( Figure 4B).

Variation of Gag-Pol Domains Linked to Elevated VRC
Using the Pfam domains [36] found in HIV-1 domains as guides, we prepared custom domains based on alignments from 391 subtype A1 complete genomes found in GenBank (see Section 2.7). Using A1 as the reference domain set allowed us to detect differences in the query sequences from the A1 type domains. For each of the 32 query sequences, the instances of the seven domains within the query sequences were identified and their domain bit scores (a measure of the distance of the query from the reference Pfam domain) were collected. The major contributors to variation were the Gag-P6 domain and the zinc finger CCHC domain, although modest changes were observed in the other domains (Figure 4B).
Stratifying the Gag-Pol sequences into four subtype categories (A1, D, A1D and Other_recombinants) revealed important patterns (Figure 3). In vitro replication as measured by VRC was clearly different across the four groups, with the non-recombinant groups A1 and D showing lower VRC than the recombinants A1D and Other_Recombinants (CD, A1C, A1AEF) ( Figure 5A). Combined total Pfam bit scores of all seven domains were calculated as a measure of how different the sequences were from the subtype A1 reference set. When total scores were compared across the four groups, the reverse pattern was seen, with the A1 sequences showing the highest scores (as expected, they were closest to the subtype A1 reference set) and the other groups showing more distance from subtype A1 sequences ( Figure 5B). Within the domains analyzed, the major contribution to the distance score was in the Gag-P6 domain and, accordingly, the Gag-p6 scores showed a similar pattern to the total score ( Figure 5C).

Variation of Gag-Pol Domains Linked to Elevated VRC
Using the Pfam domains [36] found in HIV-1 domains as guides, we prepared custom domains based on alignments from 391 subtype A1 complete genomes found in GenBank (see Section 2.7). Using A1 as the reference domain set allowed us to detect differences in the query sequences from the A1 type domains. For each of the 32 query sequences, the instances of the seven domains within the query sequences were identified and their domain bit scores (a measure of the distance of the query from the reference Pfam domain) were collected. The major contributors to variation were the Gag-P6 domain and the zinc finger CCHC domain, although modest changes were observed in the other domains ( Figure 4B).
Stratifying the Gag-Pol sequences into four subtype categories (A1, D, A1D and Other_recombinants) revealed important patterns (Figure 3). In vitro replication as measured by VRC was clearly different across the four groups, with the non-recombinant groups A1 and D showing lower VRC than the recombinants A1D and Other_Recombinants (CD, A1C, A1AEF) ( Figure 5A). Combined total Pfam bit scores of all seven domains were calculated as a measure of how different the sequences were from the subtype A1 reference set. When total scores were compared across the four groups, the reverse pattern was seen, with the A1 sequences showing the highest scores (as expected, they were closest to the subtype A1 reference set) and the other groups showing more distance from subtype A1 sequences ( Figure 5B). Within the domains analyzed, the major contribution to the distance score was in the Gag-P6 domain and, accordingly, the Gag-p6 scores showed a similar pattern to the total score ( Figure 5C).

Protein Changes in Gag-P6 Region
A sequence logo of the Gag protein alignment shows the positions and residues unique to the low VRC sequences ( Figure 6). The first proline in the Gag-P6 motif is part of the protease cleavage site 5′ to the Gag-P6 and seven of the eight low VRC sequences have a proline at this site (cleavage site FP), while there is leucine (cleavage site FL) in the majority of the medium and high VRC sequences (Figure 4). Similarly, low VRC sequences have either a proline or cysteine at position 36 near to the carboxy-terminal cleavage site flanking the Gag-P6 domain. These changes to or from proline near essential protease cleavage sites are expected to alter the local secondary structure and may play important roles in determining the efficiency of Gag polyprotein processing. Figure 6. Protein changes in Gag-P6 region. The amino acid sequence of the Gag-P6 domains from the 32 sequences were aligned and a sequence logo was generated using Weblogo3 [39]. Amino acids are indicated by a single letter code with the height of each letter stack indicating conservation at that position (measured in entropy bits, see [39]) and the height of the letter within the stack indicating the relative frequency of the amino acid at that position. Amino acids found only in the genomes with VRC ≤ 0.4 are indicated in red.

Global Gag-P6 Domain Variation
Because of the complexity of early infection identification, sequencing and VRC determination, our sample size was modest at 32 infections. To get an indication of the generality of Gag-P6 variation in HIV-1 biology, we expanded our analysis to include all

Protein Changes in Gag-P6 Region
A sequence logo of the Gag protein alignment shows the positions and residues unique to the low VRC sequences ( Figure 6). The first proline in the Gag-P6 motif is part of the protease cleavage site 5 to the Gag-P6 and seven of the eight low VRC sequences have a proline at this site (cleavage site FP), while there is leucine (cleavage site FL) in the majority of the medium and high VRC sequences (Figure 4). Similarly, low VRC sequences have either a proline or cysteine at position 36 near to the carboxy-terminal cleavage site flanking the Gag-P6 domain. These changes to or from proline near essential protease cleavage sites are expected to alter the local secondary structure and may play important roles in determining the efficiency of Gag polyprotein processing.

Protein Changes in Gag-P6 Region
A sequence logo of the Gag protein alignment shows the positions and residues unique to the low VRC sequences ( Figure 6). The first proline in the Gag-P6 motif is part of the protease cleavage site 5′ to the Gag-P6 and seven of the eight low VRC sequences have a proline at this site (cleavage site FP), while there is leucine (cleavage site FL) in the majority of the medium and high VRC sequences (Figure 4). Similarly, low VRC sequences have either a proline or cysteine at position 36 near to the carboxy-terminal cleavage site flanking the Gag-P6 domain. These changes to or from proline near essential protease cleavage sites are expected to alter the local secondary structure and may play important roles in determining the efficiency of Gag polyprotein processing. Figure 6. Protein changes in Gag-P6 region. The amino acid sequence of the Gag-P6 domains from the 32 sequences were aligned and a sequence logo was generated using Weblogo3 [39]. Amino acids are indicated by a single letter code with the height of each letter stack indicating conservation at that position (measured in entropy bits, see [39]) and the height of the letter within the stack indicating the relative frequency of the amino acid at that position. Amino acids found only in the genomes with VRC ≤ 0.4 are indicated in red.

Global Gag-P6 Domain Variation
Because of the complexity of early infection identification, sequencing and VRC determination, our sample size was modest at 32 infections. To get an indication of the generality of Gag-P6 variation in HIV-1 biology, we expanded our analysis to include all Figure 6. Protein changes in Gag-P6 region. The amino acid sequence of the Gag-P6 domains from the 32 sequences were aligned and a sequence logo was generated using Weblogo3 [39]. Amino acids are indicated by a single letter code with the height of each letter stack indicating conservation at that position (measured in entropy bits, see [39]) and the height of the letter within the stack indicating the relative frequency of the amino acid at that position. Amino acids found only in the genomes with VRC ≤ 0.4 are indicated in red.

Global Gag-P6 Domain Variation
Because of the complexity of early infection identification, sequencing and VRC determination, our sample size was modest at 32 infections. To get an indication of the generality of Gag-P6 variation in HIV-1 biology, we expanded our analysis to include all available HIV-1 full genome sequences. We asked if the observed Gag-P6 domain variation occurred in HIV-1 genomes from chronic infections. To answer this question, all available HIV-1 complete genome sequences were retrieved from GenBank (12,571 genomes, 30 October 2019) and classified by subtype. The majority of the HIV-1 genome sequences in GenBank are expected to be derived from chronic infections due to acute infection (by definition) being time-limited and the complexity of obtaining acute infection samples. For all available near-full-length HIV genomes, subtypes were determined, the Gag-p6 Pfam bit scores were determined and for each subtype, a median Gag-p6 Pfam bit score was calculated. We then compared the 32 early Gag-p6 Pfam bit scores generated from the acute infection study to the median values for the GenBank set of 12,571 genomes (Figure 7). We found that 21 of the Gag-p6 bit scores fall below the median value for their corresponding subtype (showing greater protein distance from the subtype A1 reference) and 14 of 32 scores fell below the interquartile range, the normal range of variation found in viruses from chronic sequences ( Figure 5). This shows increased variability (lower bit scores) in the Gag-p6 domains of early infection sequences relative to the Gag-p6 domains from chronic infections.

Discussion
In this study, we documented the VRC supported by Gag-Pol gene chimeras with NL4.3 viruses generated from 32 Ugandan adults with very early HIV infection. The study included the subtypes typically observed in Uganda, that is, subtype A, D and A1D recombinants. The recombinant breakpoints greatly varied among the 13 recombinants identified in this study, as shown in Figure 1. Our results indicate that the set of Gag-Pol genes described here support a range of VRCs, with some variants showing a higher VRC than that of the wildtype NL4.3. In general, subtype A1 had the lowest VRC, followed by subtype D, with inter-subtype recombinants having the greatest VRC.
When looking at only the subtype classification of the Gag-p6 region (Figure 1), this is consistent with earlier reports of inter-subtype differences in disease progression where recombinants progressed fastest, followed by subtype D, with subtype A progressing the slowest [9][10][11]. Our study results are also consistent with earlier studies that showed inter-subtype recombinants having higher replicative fitness than pure subtypes [39,40] in West Africa. Another study in East African cohorts showed a similar trend of hierarchy of Gag protease-driven replication capacities, with subtypes A or C replicating less, followed by D, and inter-subtype recombinants replicating the most [21].
Increasing evidence indicates that in vitro VRC appears to be a strong indicator of HIV pathogenicity in the patient [14,20,41,42] Here, we observed that while there were differences in VRC between subtypes A, D and recombinant Gag-pol, there was no correlation between VRC and CD4+ cell count levels or viral load in the small number of patients examined (results not shown). There was, however, a trend where most high replicators progressed faster to CD4+ counts of less than 350 cells/µL in the first 5 years of infection, although this was not statistically significant. However, no trends or significant correlations between SPVL and VRC were observed (results not shown). This suggests that the VRC of the initial infecting strain may have limited impact on these important long-term markers of HIV pathogenesis.
To gain information about viral protein functions that might be associated with the observed differences in replication capacity, we monitored changes in the Pfam profile hidden Markov models found in these sequences to reveal differences in functional or defined protein domains in the Gag-Pol genes. Rather than categorizing VRC by general subtype, the domain analysis we performed provided a more detailed focus on changes in protein domains with functional attributes. Across the set of 32 sequences, there was variability in three domains in the Gag coding region: a domain of unknown function DUF935 in the amino terminal half of the protein, the zinc finger motif zf_CCHC_2 and the gag-p6 domain near the C-terminus and overlapping with the Pol coding region. Gag-p6 is a major phosphoprotein of HIV-1 that has been shown to play an important role when it comes to release of the virus from the infected cells [43]. The four viruses with the highest VRCs showed the greatest level of variety in the Gag-p6 domain (lowest HMM bit score), suggesting that changes in this domain may influence viral replication. Two sequences had insertions related to a PYxE insert previously observed in subtype C viruses with elevated virulence [44]. The PYxE motif may be involved in the ALIX (ALG-2 (apoptosis-linked gene 2)-interacting protein X)-mediated virus release pathway [45] and recently the insertion of this tetrapeptide has been implicated in the restoration of Gag binding to ALIX with enhanced viral fitness in the presence or absence of lopinavir and tenofovir alafenamide antiretroviral drugs [23].
The HIV-1 nucleocapsid protein carries two zinc fingers and is located at the Cterminus of Gag, trailed by the p6 domain. The zf-CCHC_2 domain is one of the two zinc finger domains in the Gag nucleocapsid protein and both are required for protein localization, genomic RNA binding and encapsidation [46][47][48]. All zinc finger changes or mutations in one study were shown to negatively impact on virus replication and maturation [49]. The gag-p6 domain is needed for particle budding, during which the viral particles pinch off from the cellular membrane [50]. The p6 domain additionally contains proline-rich and di-leucine areas, which are the target of the cellular proteins Tsg101 and Alix, respectively, which are involved in the cellular class E protein sorting pathway and HIV-1 budding machinery [51,52].
We asked if the observed Gag-P6 domain variants were unique to incident viruses or if similar variation can be observed in HIV-1 genomes derived from chronic infection. We examined the Gag-P6 domain from all available full or nearly full genomes from GenBank ( Figure 5). Comparing the Gag-P6 bit scores (a measure of the distance of the query sequence to the reference domain) to median scores for each HIV-1 subtype showed that 21 of the early infection sequences had Gag-P6 bit scores below the median value for their subtype ( Figure 5). Lower Gag-P6 bit scores indicate greater variation from the A1 reference domain, thus there is a tendency for changes in the Gag-P6 sequences. The Gag-P6 region is emerging as an important determinant of HIV-1 replication [23,44,45]. Although it seems unlikely that a Gag-P6 variant unique to early infection sequences exists, the increased variation in this site observed in this small set of 32 patients is consistent with the domain playing a role in transmission. It is also notable that additional changes were observed in six other Gag-Pol domains ( Figure 2 and these may cooperate with the Gag-p6 alterations in viruses associated with transmission. The first proline in the Gag-P6 motif is part of the protease cleavage site 5 to Gag-P6 and seven of the eight low VRC sequences have a proline at this site (cleavage site FP) while there is leucine (cleavage site FL) in the majority of the medium and high VRC sequences (Figure 4). Similarly, low VRC sequences have either a proline or cysteine at position 36 near to the carboxy-terminal cleavage site flanking the Gag-P6 domain. These changes to or from proline near essential protease cleavage sites may play important roles in determining the efficiency of Gag polyprotein processing, which in turn influences the viral packaging and viral load and perhaps plays an important role in establishing early infection. It should be noted that the proline to serine or proline to leucine coding changes require only a 1 nt change and may account for the diversity observed at this site. One can speculate that as infections progress to a chronic stage, it may be useful to reduce viral loads to avoid immune responses and simple amino acid switches might be involved.
Our study had some limitations. The effort required for SGA cloning limited the number of sequences available. The VRC measurement is a simplified virus replication in the absence of immune responses and the measurements were performed using a query Gag-Pol sequence within an HXB2 backbone virus. This potentially misses more complex interactions between the Gag-Pol region and the rest of the virus. However, despite the modest sample size, we were still able to observe strong differences in VRC by HIV-1 subtype. The samples were obtained in 2006-2011 and HIV-1 evolution has continued. However, the global analysis shown in Figure 5 included more recent sequence data up to December 2019 and the Gag-p6 variations we observed in the set of 32 early infection sequences appeared to be representative of the entire HIV-1 epidemic.
In conclusion, the current study has revealed crucial features of the HIV-1 Gag-Pol region, especially the Gag-p6 domain that influences viral replicative capacity and may play a role in establishing new HIV-1 infections.