HIV-1 Subtypes and 5′LTR-Leader Sequence Variants Correlate with Seroconversion Status in Pumwani Sex Worker Cohort

Within the Pumwani sex worker cohort, a subgroup remains seronegative, despite frequent exposure to HIV-1; some of them seroconverted several years later. This study attempts to identify viral variations in 5′LTR-leader sequences (5′LTR-LS) that might contribute to the late seroconversion. The 5′LTR-LS contains sites essential for replication and genome packaging, viz, primer binding site (PBS), major splice donor (SD), and major packaging signal (PS). The 5′LTR-LS of 20 late seroconverters (LSC) and 122 early seroconverters (EC) were amplified, cloned, and sequenced. HelixTree 6.4.3 was employed to classify HIV subtypes and sequence variants based on seroconversion status. We find that HIV-1 subtypes A1.UG and D.UG were overrepresented in the viruses infecting the LSC (P < 0.0001). Specific variants of PBS (Pc < 0.0001), SD1 (Pc < 0.0001), and PS (Pc < 0.0001) were present only in the viral population from EC or LSC. Combinations of PBS [PBS-2 (Pc < 0.0001) and PBS-3 (Pc < 0.0001)] variants with specific SD sequences were only seen in LSC or EC. Combinations of A1.KE or D with specific PBS and SD variants were only present in LSC or EC (Pc < 0.0001). Furthermore, PBS variants only present in LSC co-clustered with PBS references utilizing tRNAArg; whereas, the PBS variants identified only in EC co-clustered with PBS references using tRNALys,3 and its variants. This is the first report that specific PBS, SD1, and PS sequence variants within 5′LTR-LS are associated with HIV-1 seroconversion, and it could aid designing effective anti-HIV strategies.


Introduction
In 2016, there were 62,000 new HIV infections, and 1,600,000 people living with HIV in Kenya [1]. Efforts are underway, globally, to find ways to prevent infection, as well as to explore practical cures for HIV [2,3]. The most at-risk individuals for infection by HIV are commercial sex workers (CSW), intravenous drug users, and men who have sex with men (MSM). The CSW population is at increased risk, as they may have hundreds of sexual partners each year. Compounding the risk of infection and transmission, many of them could be intravenous drug users, and/or may be infected with other sexually transmitted pathogens that could enhance HIV transmission [4]. Kenya has around 133,675 sex workers [1]. The percentage of female sex workers of a population has been reported to strongly correlate with total HIV/AIDS prevalence [4]. Female sex workers have a 13.5-fold higher genome packaging capability [29]. Primer-binding site, SD1, and PS, are therefore pivotal sequence elements for the replication and proliferation of HIV-1.
The presence of these three essential sites in 5 LTR-leader sequence led us to choose this region to examine the viruses infecting the late seroconverters. We compared 5 LTR-leader sequences from the late seroconverters with those from women who were seropositive at enrollment, or seroconverted within the first three years of enrollment in the Pumwani sex worker cohort. We hypothesized that late seroconverters were infected with specific variants of HIV-1, whose distinct 5 leader sequence profile could confer potential replicative advantages, besides efficient genome packaging capability.

Sample Collection
HIV-1 positive sex workers and late seroconverters from the Pumwani sex worker cohort were selected for this study. No anti-retroviral treatments (ARTs) were available during the sample collection period in Kenya, thus, none of the samples analyzed in this study were confounded by ARTs. Informed written consent was obtained from all study subjects. The University of Manitoba, as well as University of Nairobi ethics review panels, have approved studies with these subjects. Women in this cohort are routinely screened for HIV-1 infection by serology and PCR amplification for the env, nef, and vif genes. Women were defined as resistant to HIV-1 infection if they remain HIV-1 seronegative and PCR negative for a minimum of three years of follow up after enrollment [6]. The late seroconverters were defined as those who seroconverted after meeting the defined resistance criteria [6,17]. In this study, 20 patients met this criterion. Seven of these patients also had samples collected at different dates since seroconversion. The control population consisted of 122 seropositive patients, of which 101 women were positive at enrollment, and 21 seroconverted within three years after enrollment. Sixteen control subjects had more than one timepoint sample. The average seronegative time of the late seroconverters is 5.94 ± 2.92 years, compared to an average of 0.80 ± 0.70 seronegative years of the 21 seroconverters in the positive control group.

Genomic DNA Isolation and Nested PCR Amplification of Partial 5 LTR of HIV-1
Genomic DNA was isolated from peripheral blood mononuclear cells of the study subjects using QIAamp DNA Mini Kit (Qiagen Inc., Mississauga, ON, Canada). Nested PCR was carried out, using Expand High Fidelity PCR system (Roche Diagnostics, Mannheim, Germany), to amplify a 2 kb fragment containing partial 5 LTR, HIV-1 gag, and partial protease gene (found in pol) ( Figure 1A,B). Primers HIV71-89F (5 -CTTCCCTGATTGGCAGAAY-3 ) and HIVseq2692R (5 -GGATTTTCAGG CCCAATTTTTG-3 ) were used for the first round of amplification. The PCR cycle conditions were 2 min initial denaturation at 94 • C, followed by 35 cycles of 15 s at 94 • C, 30 s at 53 • C, and 68 • C for 5 min, with final extension at 68 • C for 15 min. Primers Gag PCR outerF (5 -AATCTCTAGCAGTGGCGCCCGAACAG-3 ) and GagRT (5 -CCATTGTTTAACCTTTGGGCCATCCA-3 ) were used for the second round PCR reaction. One microliter of PCR product from first round amplification was used as template. Thermal cycler parameters were set as 94 • C for 2 min, 35 cycles of 94 • C for 15 s, 59 • C for 30 s and 68 • C for 4 min, with final extension at 68 • C for 10 min. All PCR products were examined using 1% agarose gel electrophoresis.

Cloning and Sequencing of Amplified Partial 5'LTR Sequences
Prior to cloning, the PCR products were TA extended. Each TA extended PCR product was ligated into pCR®4-TOPO® vector (TOPO TA Cloning Kit for Sequencing, Invitrogen Life Technologies, Carlsbad, CA, USA) and transformed into One Shot® TOP10 Chemically Competent E. coli. Forty-eight clones were picked from each sample and cultured for 16-20 h in 2 mL LB medium with ampicillin (200 μg/mL). Bacteria cultures were pelleted by centrifugation for 6 min at 1900 g. QIAprep 96 Turbo Miniprep Kit protocol was used to isolate plasmids containing the amplified HIV-1 fragment. EcoR1 restriction digestion and agarose gel electrophoresis were conducted to detect the presence of insert DNA. T3 and T7 sequencing primers were used to sequence the clones using BigDye version 3.1 Cycle sequencing kit (Applied Biosystems TM , Carlsbad, CA, USA), and analyzed with an ABI3730XL DNA Analyzer, available at DNACORE facility of the National Microbiology Laboratory, Winnipeg, Manitoba, Canada.

Sequence and Phylogenetic Analyses
The sequences were examined using Sequencher version 4.6 (Gene Codes Corporation, MI, USA). HIV gag sequences were removed and 160 nucleotide sequences of partial 5'LTR region, including the part of U5 and untranslated leader sequence, were retained for further analysis. Close to 4000 5'LTR leader sequences have been generated. Phylogenetic analysis using MEGA 3.1 [38] was

Cloning and Sequencing of Amplified Partial 5 LTR Sequences
Prior to cloning, the PCR products were TA extended. Each TA extended PCR product was ligated into pCR®4-TOPO®vector (TOPO TA Cloning Kit for Sequencing, Invitrogen Life Technologies, Carlsbad, CA, USA) and transformed into One Shot®TOP10 Chemically Competent E. coli. Forty-eight clones were picked from each sample and cultured for 16-20 h in 2 mL LB medium with ampicillin (200 µg/mL). Bacteria cultures were pelleted by centrifugation for 6 min at 1900 g. QIAprep 96 Turbo Miniprep Kit protocol was used to isolate plasmids containing the amplified HIV-1 fragment. EcoR1 restriction digestion and agarose gel electrophoresis were conducted to detect the presence of insert DNA. T3 and T7 sequencing primers were used to sequence the clones using BigDye version 3.1 Cycle sequencing kit (Applied Biosystems TM , Carlsbad, CA, USA), and analyzed with an ABI3730XL DNA Analyzer, available at DNACORE facility of the National Microbiology Laboratory, Winnipeg, MB, Canada.

Sequence and Phylogenetic Analyses
The sequences were examined using Sequencher version 4.6 (Gene Codes Corporation, MI, USA). HIV gag sequences were removed and 160 nucleotide sequences of partial 5 LTR region, including the part of U5 and untranslated leader sequence, were retained for further analysis. Close to 4000 Viruses 2018, 10, 4 5 of 24 5 LTR leader sequences have been generated. Phylogenetic analysis using MEGA 3.1 [39] was done to classify viral subtypes. Briefly, partial 5 LTR sequences were aligned with 51 reference sequences obtained from HIV sequence database [19]. Alignment was done with ClustalW and phylogenetic trees were generated. Alignment and phylogenetic relatedness to reference sequences permitted subtype identification for each clone. To confirm the subtype assignment of 5 LTR sequences by phylogenetic analysis, we also conducted phylogenetic analysis of p17 sequences of these cloned sequences. The results confirmed the subtype assignment using the sequences of the partial 5 LTR region. Two examples of the phylogenetics analysis using p17 sequences are shown in Supplemental Figures S1 and S2. To assess the possible function of PBS variants observed in the present study, 19 published PBS sequences, that used different tRNA primers for reverse transcription, were taken as reference to construct a maximum likelihood method based phylogenetic tree, using MEGA 6. The 19 PBS sequences in the reference alignment included those corresponding to tRNA Lys,3 (wild-type), tRNALys 1,2 , tRNA Lys,5a , tRNA Lys,1 , EctRNA Lys,3 (E. coli tRNA), tRNA Pro , tRNA Ile , tRNA Met , tRNA Met(e) (used in elongation), tRNA Met(i) (used in initiation), tRNA Met(i) AG (contains a transition), tRNA Ser , tRNA Phe , tRNA Thr , tRNA Gln,1 , tRNA Gln,3 , tRNA His , tRNA Arg(ACG) and tRNA Arg(CCU) [32,34,[40][41][42][43][44][45][46][47].

Sequence Variant Classification by Recursive Partitioning Analysis
Recursive partitioning methods have become popular and widely used tools for non-parametric regression and classification in many scientific fields [48]. They can deal with large numbers of predictor variables, even in the presence of complex interactions, and have been applied successfully in genetics, clinical medicine, and bioinformatics within the past few years [48]. In this study, we used the recursive partitioning methods based interactive tree analysis tool in HelixTree SNP and Variation Suite version 6.4.3 (Golden Helix, Inc., Bozeman, MT, USA) to analyze the large pool of sequence variants of the three important sites (PBS, SD, and PS) within the 5 LTR leader region. The interactive tree analysis tool was developed based on formal inference recursive modeling (FIRM) technology by Dr. Douglas Hawkins [48][49][50][51][52][53][54][55][56][57] accessed 21 December 2017), and has taken the statistical foundations of FIRM and augmented it with faster and more exact segmenting algorithms. It has also extended FIRM methods to include multivariate response. Recursive partitioning uses a set of data and, based on some criterion, partitions or splits the original set into smaller sets. These smaller sets are, in turn, split into still smaller sets. This process continues (recursively) until additional splitting of the data into smaller sets gives no statistically meaningful information.
For example, because the aim of our study is to identify the sequence variants of the three sites within the HIV 5 LTR leader region that are predominantly detected among late seroconverters, we designated the u-value of late seroconverters as 1.0 and the u-value for early seroconverters as 0. For example, when analyzing sequence variants of PBS using the tree analysis tool, the sequence variants were partitioned based on whether they are detected in the early or late seroconverters and the p value. PBS sequence variants in the tree node with u-value equal to 1 indicate that the PBS sequence variants were identified only in late seroconverters, whereas the PBS sequence variants in the tree node with u-value equal to 0 were only identified in early seroconverters. The PBS sequence variants in the tree nodes with u-value varying between 1 and 0 indicate that the sequences exist in both early and late seroconverters. Because it is possible that not only specific sequence variants of PBS can influence seroconversion, but also the combinations of the PBS sequence variants with specific sequences of SD or PS may play a role in seroconversion, the PBS sequences in the nodes with u-values between 0 and 1 can be further classified by sequence variants of SD or PS. At each step of analysis, a combination of u-value and p value was used to define the sequences associated with late or early seroconverters.
Differences in subtype distributions of the sequence variants between late seroconverters and controls were analyzed by Pearson χ 2 analysis using SPSS version 13.0. p values equal to or less than 0.05 were considered statistically significant.

Uganda A1 and D Subtype 5 LTR-Leader Sequences Were Significantly Enriched in HIV Viral Population from Late Seroconverters
A total of 3678 sequences from 20 late seroconverters and 122 early seroconverters were phylogenetically analyzed to determine their HIV-1 subtypes. This analysis only included the sequences of the earliest sampling date of the available samples from each patient. Similar to previous studies, subtype A predominates in the HIV viral population of this Kenyan population, followed by subtype D. The frequencies of subtypes A1.KE, A1.UG, D, and D.UG were 57.2%, 3.7%, 27.2%, and 1.3%, respectively ( Figure 2 and Table 1). There is a significant difference in overall subtype distribution of 5 LTR leader sequences between viral population in early and late seroconverters (p < 0.0001). While subtypes B (0% versus 3.4%, p < 0.0001) and C (0% versus 9.2%, p < 0.0001) sequences were not observed among the late seroconverters, subtype A1.UG sequences were significantly enriched in the late seroconverters compared to the ones in early seroconverters (11.4% versus 1.5%, p < 0.0001). Further, subtype D.UG sequences were absent in early seroconverters (5.7% versus 0%, p < 0.0001). It is apparent that the viral population infecting late seroconverters was enriched with subtype A1.UG and D.UG 5 LTR leader sequences.

Uganda A1 and D Subtype 5'LTR-Leader Sequences Were Significantly Enriched in HIV Viral Population from Late Seroconverters
A total of 3678 sequences from 20 late seroconverters and 122 early seroconverters were phylogenetically analyzed to determine their HIV-1 subtypes. This analysis only included the sequences of the earliest sampling date of the available samples from each patient. Similar to previous studies, subtype A predominates in the HIV viral population of this Kenyan population, followed by subtype D. The frequencies of subtypes A1.KE, A1.UG, D, and D.UG were 57.2%, 3.7%, 27.2%, and 1.3%, respectively ( Figure 2 and Table 1). There is a significant difference in overall subtype distribution of 5'LTR leader sequences between viral population in early and late seroconverters (p < 0.0001). While subtypes B (0% versus 3.4%, p < 0.0001) and C (0% versus 9.2%, p < 0.0001) sequences were not observed among the late seroconverters, subtype A1.UG sequences were significantly enriched in the late seroconverters compared to the ones in early seroconverters (11.4% versus 1.5%, p < 0.0001). Further, subtype D.UG sequences were absent in early seroconverters (5.7% versus 0%, p < 0.0001). It is apparent that the viral population infecting late seroconverters was enriched with subtype A1.UG and D.UG 5'LTR leader sequences.

Unique Sequences and Combinations of PBS, SD, and Ps Sequences in Late Seroconverters
We then examined whether specific sequences of primer binding site (PBS), splice donor (SD), and packaging signal (PS), and their combinations, are more likely to be associated with HIV viral population in late seroconverters. For this, we included all 4839 sequences from multiple sample dates of the patients in recursive partition analysis using the Tree analysis tool of HelixTree 6.4.3. Recursive partitioning analysis classifies the 5 LTR-leader sequence variants based on their nucleotide sequences, subtypes, and their origin, into early (designated as 0) or late seroconverters (designated as 1) (Figures 3-5 and Table 2). The analysis showed that specific sequence variants of PBS were only identified in the viral population of either early or late seroconverters (Pc < 0.0001) ( Figure 3 and Table 2). Specifically, 12 PBS sequence variants were only found in the viral population of late seroconverters (PBS-1, Figure 3 and Table 2), and 23 PBS sequence variants were only identified in the viral population of early seroconverters (PBS-4, Figure 3A,B, and Table 2). Some PBS sequence variants were identified in the viral population of both early and later seroconverters (PBS-2 and PBS-3, Figure 3 and Table 2).

Unique Sequences and Combinations of PBS, SD, and Ps Sequences in Late Seroconverters
We then examined whether specific sequences of primer binding site (PBS), splice donor (SD), and packaging signal (PS), and their combinations, are more likely to be associated with HIV viral population in late seroconverters. For this, we included all 4839 sequences from multiple sample dates of the patients in recursive partition analysis using the Tree analysis tool of HelixTree 6.4.3. Recursive partitioning analysis classifies the 5'LTR-leader sequence variants based on their nucleotide sequences, subtypes, and their origin, into early (designated as 0) or late seroconverters (designated as 1) (Figures 3-5 and Table 2). The analysis showed that specific sequence variants of PBS were only identified in the viral population of either early or late seroconverters (Pc < 0.0001) ( Figure 3 and Table 2). Specifically, 12 PBS sequence variants were only found in the viral population of late seroconverters (PBS-1, Figure 3 and Table 2), and 23 PBS sequence variants were only identified in the viral population of early seroconverters (PBS-4, Figure 3A,B, and Table 2). Some PBS  Similarly, specific sequence variants of SD were only identified in the viral population of either early or late seroconverters (Pc < 0.0001) ( Figure 4 and Table 2). Nine SD sequence variants were only found in the viral population of late seroconverters (SD-1, Figure 4A,B and Table 2), while 14 SD sequence variants were only found in the viral population of early seroconverters (SD-5, Figure 4A,B and Table 2). Some SD sequence variants were identified in the viral population of early as well as late seroconverters (SD-2, 3, 4, Figure 4 and Table 2). Similarly, specific sequence variants of SD were only identified in the viral population of either early or late seroconverters (Pc < 0.0001) (Figure 4 and Table 2). Nine SD sequence variants were only found in the viral population of late seroconverters (SD-1, Figure 4A,B and Table 2), while 14 SD sequence variants were only found in the viral population of early seroconverters (SD-5, Figure 4A,B and Table 2). Some SD sequence variants were identified in the viral population of early as well as late seroconverters (SD-2, 3, 4, Figure 4 and Table 2). Likewise, specific sequence variants of PS were only identified in the viral population of either early or late seroconverters (Pc < 0.0001) ( Figure 5 and Table 2). Five PS sequence variants were only identified in the viral population of late seroconverters (PS-1, Figure 5A,B and Table 2), while four PS sequence variants were only seen in the viral population of early converters (PS-3, Figure 5A,B and Table 2). Some PS sequence variants were identified in the viral population of both early and late seroconverters (PS-2, Figure 5A,B and Table 2). Likewise, specific sequence variants of PS were only identified in the viral population of either early or late seroconverters (Pc < 0.0001) ( Figure 5 and Table 2). Five PS sequence variants were only identified in the viral population of late seroconverters (PS-1, Figure 5A,B and Table 2), while four PS sequence variants were only seen in the viral population of early converters (PS-3, Figure 5A,B and Table 2). Some PS sequence variants were identified in the viral population of both early and late seroconverters (PS-2, Figure 5A,B and Table 2). For the primer binding site sequence variants that existed in both early and late seroconverters (PBS-2 and PBS-3, Figure 3 and Table 2), we conducted further analysis to see whether combinations of specific PBS, SD sequence variants were more likely to exist in the viral population of late seroconverters. Further recursive analysis for the six sequence variants in the PBS-2 node with sequence variants of SD showed that the combinations of four specific SD sequence variants with the six PBS variants were only identified in the late seroconverters (PBS-2-SD-1; Figure 6A,B and Table  3). Similarly, PBS-3 node sequences in combinations with seven specific SD sequences occurred only in the viral population of late seroconverters (PBS-3-SD-1; Figure 7A,B and Table 3). In contrast, PBS-3 and 14 specific SD sequence variants (PBS-3-SD-4) existed only in the early seroconverters ( Figure  7A,B and Table 3).  For the primer binding site sequence variants that existed in both early and late seroconverters (PBS-2 and PBS-3, Figure 3 and Table 2), we conducted further analysis to see whether combinations of specific PBS, SD sequence variants were more likely to exist in the viral population of late seroconverters. Further recursive analysis for the six sequence variants in the PBS-2 node with sequence variants of SD showed that the combinations of four specific SD sequence variants with the six PBS variants were only identified in the late seroconverters (PBS-2-SD-1; Figure 6A,B and Table 3). Similarly, PBS-3 node sequences in combinations with seven specific SD sequences occurred only in the viral population of late seroconverters (PBS-3-SD-1; Figure 7A,B and Table 3). In contrast, PBS-3 and 14 specific SD sequence variants (PBS-3-SD-4) existed only in the early seroconverters ( Figure 7A,B and Table 3).   TGGCGCCCGAACAGGGTC  TGGCGCCCCAACGGGGAC  TGGCGCCCGAACAGGAAC  TGG-GCCCGAACAGGGAC  TGGCGCCCAAACAGGGAC  TG-CGCCCGAACAGGGAC  TGGCCGCCCGAACAGGGAC  (D-PBS-1)   LSC   TGGCGCCGGAACAGGGAC  TGGCGCCCGAACAGGGTAC  TGGCGCCCGACGTGGGGC  TGGCGACCGAACAGGGAC  TGGCGCCCGAACCGGGAC  TGGCGCCCGTACAGGGAC  TGGC-CCCGAACAGGGAC  TGGCCGCCCGATCAGGGAC  TG

Combinations of Subtype A1.KE or D with Unique PBS and SD Sequence Variants in Late Seroconverters
The late seroconverters are most likely to be infected with HIV variants with 5 LTR sequences belonging to A1.UG and D.UG, and specific PBS, SD, and PS variants are only identified in the viral population infecting the late seroconverters. However, late seroconverters were also infected with A1.KE and D, the two major HIV subtypes circulating in Kenya. Are there unique PBS, SD, and PS sequence variants in A1.KE and D infecting late seroconverters? The recursive analysis showed that specific SD variants or PBS variants in subtype D were identified only in late seroconverters or early seroconverters (Figures 8 and 9, and Table 3). Specific SD variants in A1.KE were only identified in late seroconverters or early seroconverters ( Figure 10, and Table 3). Thus, A1.KE or D with specific PBS and SD variants infect late seroconverters.

Potential Functional Differences among PBS Variants in Late Seroconverters
Among the three sites studied, only PBS had sufficient supporting literature available to permit analysis for their potential functional significance. A phylogenetic tree was constructed containing PBS variant sequences only identified in late or early seroconverters, together with 19 PBS reference sequences that have been studied for their function (Figure 11). With the exception of tRNA Lys,3 and tRNA Lys,5a , none of the other tRNA molecules have been reported to be used as primers in naturally occurring HIV-1. Phylogenetic analysis showed that majority of the PBS sequence variants identified only in late seroconverters (PBS-1) co-clustered with PBS reference sequences utilizing tRNA Arg molecules. Whereas, the PBS sequence variants identified only in early seroconverters (PBS-4) co- Our study showed that late seroconverters are more likely to be infected with A1 and D from Uganda, and specific PBS, SD, and PS sequences were only identified in the late seroconverters. Also, A1.KE and D with specific PBS and/or SD variants are also likely to infect late seroconverters. Table 4 summarized the identified 5 LTR subtypes, PBS, SD, PS variants, and the combinations identified and enriched in the 20 late seroconverters. These identified 5 LTR subtypes, PBS, SD, PS, and their combinations were identified and enriched in 16 out 20 late seroconverters ( Table 4). The subtype classification of 5 LTR-leader sequence of viruses infecting late seroconverters is shown in Table 5.

Potential Functional Differences among PBS Variants in Late Seroconverters
Among the three sites studied, only PBS had sufficient supporting literature available to permit analysis for their potential functional significance. A phylogenetic tree was constructed containing PBS variant sequences only identified in late or early seroconverters, together with 19 PBS reference sequences that have been studied for their function (Figure 11). With the exception of tRNA Lys,3 and tRNA Lys,5a , none of the other tRNA molecules have been reported to be used as primers in naturally occurring HIV-1. Phylogenetic analysis showed that majority of the PBS sequence variants identified only in late seroconverters (PBS-1) co-clustered with PBS reference sequences utilizing tRNA Arg molecules. Whereas, the PBS sequence variants identified only in early seroconverters (PBS-4) co-clustered with PBS wild type references PBS-tRNA Lys,3 and its variants PBS-tRNA Lys1-9 , PBS-tRNA Lys1,2 , PBS-tRNA Lys (5) , and PBS-tRNA His (Figure 11). clustered with PBS wild type references PBS-tRNA Lys3 and its variants PBS-tRNA Lys1-9 , PBS-tRNA Lys1-2 , PBS-tRNA Lys (5) , and PBS-tRNA His (Figure 11). The evolutionary history was inferred by using the maximum likelihood method based on the Tamura-Nei model [1]. The bootstrap consensus tree inferred from 1000 replicates [2] is taken to represent the evolutionary history of the taxa analyzed [2]. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. Initial tree(s) for the heuristic search were obtained automatically by applying neighbor-join and BioNJ algorithms to a matrix of pairwise distances estimated using the maximum composite likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. The analysis involved 54 nucleotide sequences. There was a total of 22 positions in the final dataset. Evolutionary analyses were conducted in MEGA6.
Note: reference sequences are marked with colored filled circles. The PBS sequences identified only from late seroconverters (PBS-1 as PBS-L) are marked with red filled square. The PBS sequences identified only from early seroconverters (PBS-4 as PBS-E) are marked with purple filled triangle.

Discussion
The outcome of exposure to HIV-1 is influenced by both host as well as pathogen derived genetic factors. HIV-1 late seroconversion has been observed in Pumwani sex worker cohort. Here, we investigate whether the late seroconversion is associated with specific subtypes and 5'LTR-leader sequence variants in this epidemiologically well-characterized cohort. We showed that the 5'LTR- The evolutionary history was inferred by using the maximum likelihood method based on the Tamura-Nei model [1]. The bootstrap consensus tree inferred from 1000 replicates [2] is taken to represent the evolutionary history of the taxa analyzed [2]. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. Initial tree(s) for the heuristic search were obtained automatically by applying neighbor-join and BioNJ algorithms to a matrix of pairwise distances estimated using the maximum composite likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. The analysis involved 54 nucleotide sequences. There was a total of 22 positions in the final dataset. Evolutionary analyses were conducted in MEGA6.
Note: reference sequences are marked with colored filled circles. The PBS sequences identified only from late seroconverters (PBS-1 as PBS-L) are marked with red filled square. The PBS sequences identified only from early seroconverters (PBS-4 as PBS-E) are marked with purple filled triangle.

Discussion
The outcome of exposure to HIV-1 is influenced by both host as well as pathogen derived genetic factors. HIV-1 late seroconversion has been observed in Pumwani sex worker cohort. Here, we investigate whether the late seroconversion is associated with specific subtypes and 5 LTR-leader sequence variants in this epidemiologically well-characterized cohort. We showed that the 5 LTR-leader sequence variants are dominated by clade A1 and D viruses in this cohort, and this is consistent with previous studies of Kenyan HIV infected patients [20][21][22]. We observed a significant difference in HIV-1 subtype distribution between late seroconverters and the early seroconverters. A significantly higher proportion of late seroconverters were infected by subtype A1 and D from Uganda. Two possibilities may explain this observation. One, viral subtypes from Uganda may differ in its ability to cause infection and exhibit superior replicative properties. Two, the late seroconverters may be infected while they were back in their home village during a break from sex work [17]. As none of the late seroconverters were from Uganda, it is possible that the migration of their clientele between Uganda and Kenya was responsible for the transmission of subtype A1.UG and D.UG. The predominance of the Uganda subtype in the late seroconverter population suggests a relationship between Ugandan viral origin and late seroconversion. HIV-1 subtypes originating in Uganda may be more infectious than their Kenyan counterparts, and comparative infectivity studies will need to be carried out to confirm this possibility. Moreover, the rates of disease progression of patients infected with Ugandan A and D subtypes could be examined and compared with that of patients infected with Kenyan subtypes A and D. In addition, other genetic factors unique to subtypes A1.UG and D.UG might play an important role in HIV-1 late seroconversion.
We also showed that unique sequence variants of PBS, SD, and PS exist in viruses infecting late seroconverters. Specific SD sequences were identified only in viruses from late seroconverters or early seroconverters. SD is essential to all splicing events in HIV-1 [36], and as such, the association of specific SD sequence variants with late seroconverters deserves specific attention. Functional studies, currently lacking, could address whether these specific SD sequence variants exhibit more efficient splicing activity. SD, PBS, and PS each have different roles in HIV-1 replication [28,29]. Our study showed that combinations of sequence variants from these sites associated significantly with late seroconverters or with the early seroconverters, suggesting a synergistic effect between these three functional sites. This appears also true in the combination of A1.KE or D with specific PBS and SD sequence variants infecting late or early seroconverters. Thus, both viral subtypes and PBS, SD, and PS sequence variants, play a role in late seroconversion. The interplay between the sequence variants of these sites and their effect on HIV-1 exposure outcome is not clear, and warrants further functional investigations.
Studies have shown that most of HIV viruses, including proviral sequences and virions in plasma samples, were defective. Our study is limited to the analysis of 5 LTR leader sequences; these diverse sequences may be associated with defective or non-defective HIV viruses. The identification of the specific PBS, SD, or PS variants, that exist only in LSC or EC, may provide a reasonable base to further investigate whether these specific sequence variants actually play a more important role in viral pathogenesis than the ones indicated by their population frequencies. In addition, studies have shown that defective viruses are known to drive HIV infection, persistence, and pathogenesis [58], and the data from our study provide another aspect of HIV pathogenesis.
Earlier studies done in our cohort suggested viral cytotoxic T lymphocyte escape variants were not likely to be the primary factors influencing HIV-1 late seroconversion, and pointed out potential links between loss or waning of HIV-1 epitope-specific responses after a break from sex work and late seroconversion [17]. The present study explored the phenomenon of late seroconversion further, and suggests that the process need not purely be immunological; virological factors, viz, PBS, SD1, PS variants and subtypes, could play important roles.
Analysis of potential functional implications of the PBS variant that were only identified in late or early seroconverters, based on the published data, showed that most of the PBS variants identified only in late seroconverters co-clustered with PBS sequence variants using tRNA Arg as a primer for reverse transcription, whereas the PBS variants identified only in early seroconverters were co-clustered with the wild type PBS sequences using tRNA Lys,3 , tRNA Lys variants, or tRNA His as a primer for reverse transcription. Studies have shown that HIV can replicate using either tRNA His or tRNA Lys1,2 as primers [59][60][61][62][63][64], however, HIV mutants that use reverse transcription primers other than tRNA Lys, 3 have reduced replication [65]. The only retrovirus that has been reported to use tRNA Arg as a primer for reverse transcription is MuLV [66,67]. Analysis of the replication and stability of MuLVs with alternative PBSs revealed a preference for a PBS complementary to tRNA Pro , tRNA Gly , or tRNA Arg [67]. The selection of tRNA Arg for MuLV was probably facilitated, in part, by the multiple isoacceptors for tRNA Arg [67]. Our study is the first to report that HIV PBS sequence variants identified only in late seroconverters, co-cluster with PBS sequences utilizing tRNA Arg as a primer for reverse transcription. The PBS variants do not appear to belong to one specific subtype by interaction analysis (data not shown). Studies have shown that primer selection and viral translation, in particular, the synthesis of Gag-Pol, are linked [66,67]. How these specific HIV PBS variants, clustering with PBS sequences using tRNA Arg as a primer for reverse transcription, contribute to the infection of women who were relatively resistant to HIV-1 infection, needs to be investigated.
The current study intends to investigate viral factors influencing HIV-1 late seroconversion observed in the Pumwani cohort. It is clear that the viral subtypes, as well as PBS, SD, and PS variants within the 5 leader sequence, are associated with this clinical outcome, underscoring the importance of viral factors in the late seroconversion. Viral genotypes have been shown to exert profound influence over HIV-1 viral load [68]. Understanding why viruses of certain clades exhibit seemingly more infectiousness and pathogenicity will provide us with valuable information that could be used to help prevent HIV-1 infection. There is also a potential application for this knowledge to be used as clinical predictors that can serve to guide treatment decisions for patients. Successful inhibition of HIV-1 replication through small interfering RNA targeted to the PBS has been reported [69]. RNA transcripts containing HIV-1 PS sequences as HIV-1 antivirals have been explored [70]. To our knowledge, this is the first report of association of 5 LTR-leader sequence variation with HIV-1 late seroconversion, in addition to reporting the specific sequence variations in 5 leader sequence region. The association of PBS, SD, and PS variants with LSC or EC identified in this study may help to find additional pharmaceutical targets, aiding the development of new anti-HIV therapeutics and HIV/AIDS prevention strategies.