NMR Studies of the Structure and Function of the HIV-1 5′-Leader

The 5′-leader of the human immunodeficiency virus type 1 (HIV-1) genome plays several critical roles during viral replication, including differentially establishing mRNA versus genomic RNA (gRNA) fates. As observed for proteins, the function of the RNA is tightly regulated by its structure, and a common paradigm has been that genome function is temporally modulated by structural changes in the 5′-leader. Over the past 30 years, combinations of nucleotide reactivity mapping experiments with biochemistry, mutagenesis, and phylogenetic studies have provided clues regarding the secondary structures of stretches of residues within the leader that adopt functionally discrete domains. More recently, nuclear magnetic resonance (NMR) spectroscopy approaches have been developed that enable direct detection of intra- and inter-molecular interactions within the intact leader, providing detailed insights into the structural determinants and mechanisms that regulate HIV-1 genome packaging and function.

A hallmark of retroviruses is that they selectively and efficiently package two copies of their unspliced, 5 -capped and 3 -polyadenylated RNA genomes during virus assembly [1]. Both molecules are utilized for strand transfer-mediated recombination during reverse transcription [2,3], enabling transcription at strand breaks caused by restriction endonucleases [4] and promoting evolution of resistance to antiviral therapies [5]. Unspliced genomes are selected for packaging as non-covalently linked dimers from a cellular milieu that includes a substantial excess of non-viral and spliced viral mRNAs [6,7]. RNA elements that direct packaging are located within the 5 -leader of the genome [8], the most conserved region of the genome [9]. Elements important for transcriptional activation, splicing, and translation initiation are also located within the 5 -leader, and there is evidence that these and other activities are temporally modulated by dimerization-dependent exposure of functional signals [10][11][12][13][14].
Dimerization occurs both in the cytoplasm and at the plasma membrane (PM), and is mediated by a conserved GC-rich palindromic dimer initiation sequence (DIS) located within the 5 -leader of the genome, Figure 1. Genetic studies indicate that the DIS is responsible for RNA:RNA partner selection [15,16], but other leader elements, including those overlapping the gag start codon (AUG, Figure 1), also play roles in dimerization [13,[17][18][19]. Genome selection during virus assembly is mediated by interactions between the nucleocapsid (NC) domains of a small number of viral Gag polyproteins (~12 or fewer) [20] and packaging signals located within the 5 -leader of the viral Figure 1. Secondary structure of the human immunodeficiency virus type 1 (HIV-1) 5′-leader in monomer (a) and dimer-promoting (b) conformations. Highlighted regions on each structure denote elements whose structure has been probed using nuclear magnetic resonance (NMR) spectroscopy in the full-length leader. In contrast to the dimeric form of the 5′-leader, few regions of the monomeric leader have been probed by NMR spectroscopy. Yellow boxes indicate regions whose structure could only be probed in truncated or mutated (see inset) 5′-leader constructs, and other colored boxes correspond to different regions with detectable Adenosine-H2 signals in partially deuterated 5′-leader RNA samples. Figure adapted from [31].

Probing the Secondary Structure of the Intact 5′-Leader by Nuclear Magnetic Resonance (NMR) Spectroscopy
In vitro, human immunodeficiency virus type 1 (HIV-1) 5′-leader RNA (residues 1-356, which includes the intact 5′-untranslated region and the first 21 nucleotides of the gag gene; Figure 1) exists as an equilibrium mixture of monomeric and dimeric species [8,10,11,14,[21][22][23][24][25][26][27][28][29]. The secondary structure of the 5′-leader has been probed by combinations of nucleotide reactivity mapping, phylogenetic analyses, biochemical and molecular biological studies, and free energy calculations, resulting in more than 25 predicted secondary structures and multiple dimerization models [8]. Probing of the intact HIV-1 5′-leader RNA by nuclear magnetic resonance (NMR) spectroscopy was initially conducted with a segmentally labeled sample, in which residues overlapping the gag start codon (AUG; Figure 1) were enriched with 13 Carbon ( 13 C) [13]. These studies revealed that AUG adopts a hairpin structure in the monomeric form of the 5′-leader that is favored at low ionic strength [13]. Technical problems (severe line broadening associated with 1 H-13 C dipolar coupling) precluded 13 C-based NMR spectroscopy studies of the dimeric form of the leader. However, using a 2 H-edited approach that does not rely on 13 C labeling (long-range probing by adenosine interaction detection, lrAID), residues of AUG were shown to base pair with an upstream U5 element in the dimeric form of the 5′-leader, consistent with phylogenetic predictions [32,33]. These data, in combination with other mutagenesis experiments, suggested a dimerization model in which DIS is sequestered by base pairing with U5 in the monomeric RNA, and U5:AUG base pairing displaces and exposes the DIS to promote dimerization ( Figure 1) [13]. Mutations engineered to favor the monomer reduce the number of high affinity NC binding sites and inhibit vector RNA packaging, whereas those that promote dimerization also promote NC binding and packaging, consistent with the proposed dimerization dependent RNA structural switch mechanism. Subsequent studies with HIV-2 and simian immunodeficiency virus (SIV) 5′-leader RNAs indicate that this dimerization/packaging mechanism is likely conserved among evolutionarily distant lentiviruses [34]. Highlighted regions on each structure denote elements whose structure has been probed using nuclear magnetic resonance (NMR) spectroscopy in the full-length leader. In contrast to the dimeric form of the 5 -leader, few regions of the monomeric leader have been probed by NMR spectroscopy. Yellow boxes indicate regions whose structure could only be probed in truncated or mutated (see inset) 5 -leader constructs, and other colored boxes correspond to different regions with detectable Adenosine-H2 signals in partially deuterated 5 -leader RNA samples.

Probing the Secondary Structure of the Intact 5 -Leader by Nuclear Magnetic Resonance (NMR) Spectroscopy
In vitro, human immunodeficiency virus type 1 (HIV-1) 5 -leader RNA (residues 1-356, which includes the intact 5 -untranslated region and the first 21 nucleotides of the gag gene; Figure 1) exists as an equilibrium mixture of monomeric and dimeric species [8,10,11,14,[21][22][23][24][25][26][27][28][29]. The secondary structure of the 5 -leader has been probed by combinations of nucleotide reactivity mapping, phylogenetic analyses, biochemical and molecular biological studies, and free energy calculations, resulting in more than 25 predicted secondary structures and multiple dimerization models [8]. Probing of the intact HIV-1 5 -leader RNA by nuclear magnetic resonance (NMR) spectroscopy was initially conducted with a segmentally labeled sample, in which residues overlapping the gag start codon (AUG; Figure 1) were enriched with 13 Carbon ( 13 C) [13]. These studies revealed that AUG adopts a hairpin structure in the monomeric form of the 5 -leader that is favored at low ionic strength [13]. Technical problems (severe line broadening associated with 1 H-13 C dipolar coupling) precluded 13 C-based NMR spectroscopy studies of the dimeric form of the leader. However, using a 2 H-edited approach that does not rely on 13 C labeling (long-range probing by adenosine interaction detection, lrAID), residues of AUG were shown to base pair with an upstream U5 element in the dimeric form of the 5 -leader, consistent with phylogenetic predictions [32,33]. These data, in combination with other mutagenesis experiments, suggested a dimerization model in which DIS is sequestered by base pairing with U5 in the monomeric RNA, and U5:AUG base pairing displaces and exposes the DIS to promote dimerization ( Figure 1) [13]. Mutations engineered to favor the monomer reduce the number of high affinity NC binding sites and inhibit vector RNA packaging, whereas those that promote dimerization also promote NC binding and packaging, consistent with the proposed dimerization dependent RNA structural switch mechanism. Subsequent studies with HIV-2 and simian immunodeficiency virus (SIV) 5 -leader RNAs indicate that this dimerization/packaging mechanism is likely conserved among evolutionarily distant lentiviruses [34].

NMR Structure of the HIV-1 RNA Packaging Signal
A systematic nucleotide deletion strategy was used to identify a minimal region of the 5 -leader sufficient for packaging (Core Encapsidation Signal, Ψ CES ) [35]. Ψ CES lacks the TAR, PolyA, and the upper primer binding site (PBS) loops, yet maintains the dimerization, high affinity NC binding, and importantly, NMR spectroscopic properties observed for the intact, native 5 -leader [35]. Ψ CES is also capable of directing vector RNA packaging with >80% efficiency of the native 5 -leader [35]. To improve NMR spectral quality for structural studies, the size of the symmetrical Ψ CES dimer was effectively "halved" by substituting the native DIS palindromic loop by a hairpin-promoting tetraloop (Ψ CESm ), Figure 2 [36]. This substitution did not affect the NC binding properties or nuclear Overhauser effect (NOE) NMR spectral patterns of the RNA. The structure of Ψ CESm was determined using a fragmentation based 2 H-edited NMR approach, which was designed to simplify spectral analysis and provide direct information about long-range base pairing interactions, Figure 2 [36]. Unexpectedly, Ψ CESm was found to adopt a tandem three-way junction structure, in which residues of the splice donor site (SD) participate in long-range base pairing (a tandem three-way junction), rather than adopting a widely-predicted hairpin structure. Notably, the observed secondary structure is in better agreement with in-gel chemical probing results obtained for the resolved dimeric 5 -leader RNA [37] than with previously predicted secondary structures [38,39]. The Ψ CESm structure informs four key aspects of HIV replication: (i) dimerization dependent attenuation of translation is explained by the sequestration of the gag start codon, which base pairs with upstream U5 residues; (ii) the tandem three-way junction exposes un-paired and weakly-paired guanosines that are required for high affinity NC binding, which explains why Ψ CES promotes RNA packaging; (iii) the exquisite selectivity of HIV-1 to package its unspliced genome is likely due to the requirement of residues immediately downstream of the major splice site for formation of the packaging-competent three-way junction structure; and (iv) although the structure of the monomeric genome has not been determined, it is likely that monomers are ignored during virus assembly because they do not adopt the guanosine-exposing tandem three-way junction structure [36].

NMR Structure of the HIV-1 RNA Packaging Signal
A systematic nucleotide deletion strategy was used to identify a minimal region of the 5′-leader sufficient for packaging (Core Encapsidation Signal, Ψ CES ) [35]. Ψ CES lacks the TAR, PolyA, and the upper primer binding site (PBS) loops, yet maintains the dimerization, high affinity NC binding, and importantly, NMR spectroscopic properties observed for the intact, native 5′-leader [35]. Ψ CES is also capable of directing vector RNA packaging with >80% efficiency of the native 5′-leader [35]. To improve NMR spectral quality for structural studies, the size of the symmetrical Ψ CES dimer was effectively "halved" by substituting the native DIS palindromic loop by a hairpin-promoting tetraloop (Ψ CESm ), Figure 2 [36]. This substitution did not affect the NC binding properties or nuclear Overhauser effect (NOE) NMR spectral patterns of the RNA. The structure of Ψ CESm was determined using a fragmentation based 2 H-edited NMR approach, which was designed to simplify spectral analysis and provide direct information about long-range base pairing interactions, Figure 2 [36]. Unexpectedly, Ψ CESm was found to adopt a tandem three-way junction structure, in which residues of the splice donor site (SD) participate in long-range base pairing (a tandem three-way junction), rather than adopting a widely-predicted hairpin structure. Notably, the observed secondary structure is in better agreement with in-gel chemical probing results obtained for the resolved dimeric 5′-leader RNA [37] than with previously predicted secondary structures [38,39]. The Ψ CESm structure informs four key aspects of HIV replication: (i) dimerization dependent attenuation of translation is explained by the sequestration of the gag start codon, which base pairs with upstream U5 residues; (ii) the tandem three-way junction exposes un-paired and weakly-paired guanosines that are required for high affinity NC binding, which explains why Ψ CES promotes RNA packaging; (iii) the exquisite selectivity of HIV-1 to package its unspliced genome is likely due to the requirement of residues immediately downstream of the major splice site for formation of the packaging-competent threeway junction structure; and (iv) although the structure of the monomeric genome has not been determined, it is likely that monomers are ignored during virus assembly because they do not adopt the guanosine-exposing tandem three-way junction structure [36].  A phylogenetic analysis showed that 31 out of 48 residues at or near the tandem three-way junction are strictly (16 sites) or >99% (15 sites) conserved, and that 13 sites exhibited 90%-99% A phylogenetic analysis showed that 31 out of 48 residues at or near the tandem three-way junction are strictly (16 sites) or >99% (15 sites) conserved, and that 13 sites exhibited 90%-99% identity [36]. The lack of base pair covariation has led some to question the validity of the tandem three-way junction structure [40], especially in view of strong evidence that residues of SD adopt a hairpin structure [8,40,41]. Although it is unfortunate that sequence conservation was too high for a conclusive phylogenetic assessment, it is certainly possible (likely, we believe) that the three-way junction and hairpin structures are both formed in infected cells, possibly a consequence of heterogeneous transcription start site usage (see below) [42].

Probing the Intermolecular Interface in the Dimeric 5 -Leader by NMR
The 5 -leader contains a conserved stem-loop that serves as the primary site of genome DIS [15,[43][44][45][46][47][48][49][50]. The isolated DIS oligoribonucleotide is capable of adopting either a "kissing" dimer, where the only intermolecular contacts are in the palindromic loop (5 -GCGCGC-3 for HIV-1 NL4-3 ) [51][52][53][54][55], or an "extended" dimer with extensive intermolecular interactions [54,56,57]. Longer DIS oligoribonucleotides which contain unpaired or bulged residues are able to convert from "kissing" to thermodynamically stable "extended" dimers [58]. Native gel electrophoresis studies indicate that 5 -leader RNAs of some lentiviruses, including some strains of HIV-1, form "labile" dimers [34,59,60] presumed to be mediated by loop-loop "kissing" interactions, and that the labile dimers can in some cases be converted by mild heating to "non-labile" dimers presumed to be stabilized by an extended DIS duplex interface [61]. In contrast, only non-labile dimers have been observed for the HIV-1 NL4-3 5 -leader. The propensity to form observable labile dimers appears to be related to the GC content of the DIS palindrome, since labile dimers were only observed for leaders containing a GCGCGC DIS sequence [34]. Using a 2 H-edited NMR spectroscopy approach, the dimer interface of the HIV-1 NL4-3 5 -leader was shown to include both extended DIS duplex interactions and intermolecular U5:AUG base pairing, Figure 3 [31]. Using a differential mutagenesis/ 2 H-edited 1D NMR approach, the U5:AUG intermolecular interface was shown to form on roughly the same timescale as overall dimerization, suggesting that the proposed kissing intermediate, if formed, converts rapidly to the extended interface structure at 35 • C even in the absence of RNA chaperones [31]. identity [36]. The lack of base pair covariation has led some to question the validity of the tandem three-way junction structure [40], especially in view of strong evidence that residues of SD adopt a hairpin structure [8,40,41]. Although it is unfortunate that sequence conservation was too high for a conclusive phylogenetic assessment, it is certainly possible (likely, we believe) that the three-way junction and hairpin structures are both formed in infected cells, possibly a consequence of heterogeneous transcription start site usage (see below) [42].

Probing the Intermolecular Interface in the Dimeric 5′-Leader by NMR
The 5′-leader contains a conserved stem-loop that serves as the primary site of genome DIS [15,[43][44][45][46][47][48][49][50]. The isolated DIS oligoribonucleotide is capable of adopting either a "kissing" dimer, where the only intermolecular contacts are in the palindromic loop (5′-GCGCGC-3′ for HIV-1NL4-3) [51][52][53][54][55], or an "extended" dimer with extensive intermolecular interactions [54,56,57]. Longer DIS oligoribonucleotides which contain unpaired or bulged residues are able to convert from "kissing" to thermodynamically stable "extended" dimers [58]. Native gel electrophoresis studies indicate that 5′-leader RNAs of some lentiviruses, including some strains of HIV-1, form "labile" dimers [34,59,60] presumed to be mediated by loop-loop "kissing" interactions, and that the labile dimers can in some cases be converted by mild heating to "non-labile" dimers presumed to be stabilized by an extended DIS duplex interface [61]. In contrast, only non-labile dimers have been observed for the HIV-1NL4-3 5′-leader. The propensity to form observable labile dimers appears to be related to the GC content of the DIS palindrome, since labile dimers were only observed for leaders containing a GCGCGC DIS sequence [34]. Using a 2 H-edited NMR spectroscopy approach, the dimer interface of the HIV-1NL4-3 5′-leader was shown to include both extended DIS duplex interactions and intermolecular U5:AUG base pairing, Figure 3 [31]. Using a differential mutagenesis/ 2 H-edited 1D NMR approach, the U5:AUG intermolecular interface was shown to form on roughly the same timescale as overall dimerization, suggesting that the proposed kissing intermediate, if formed, converts rapidly to the extended interface structure at 35 °C even in the absence of RNA chaperones [31]. The exposure of DIS enables the formation on a "kissing" dimer, where two RNA molecules interact at the palindromic DIS loop sequence. There is no spectroscopic evidence for this conformation; therefore, it is likely a short-lived species; (c) the HIV-1 5′-leader rapidly adopts an "extended" dimer conformation characterized by extensive base pairing between the two RNA molecules. (Figure adapted from [31]).

Transcriptional Start Site Heterogeneity Modulates Genome Structure and Function
An additional layer of regulation may come from the sequence of the RNA itself, determined by the transcription start site (TSS) used during genome synthesis [42,62]. Shortly after initiation of RNA synthesis, the HIV-1 genome is co-transcriptionally capped by a 5′-5′ triphosphate linked 7-methylguanosine moiety (7MeG) [62][63][64][65][66]. The U3/R junction of HIV-1 proviruses contains a conserved run of three consecutive guanosines, any (or all) of which can serve as the +1 site for RNA transcription. Recent reports suggest that HIV-1 utilizes three different transcription start sites (1G, 2G, and 3G), creating an array of genomes with varying 5′ end identity, each of which is The exposure of DIS enables the formation on a "kissing" dimer, where two RNA molecules interact at the palindromic DIS loop sequence. There is no spectroscopic evidence for this conformation; therefore, it is likely a short-lived species; (c) the HIV-1 5 -leader rapidly adopts an "extended" dimer conformation characterized by extensive base pairing between the two RNA molecules. (Figure adapted from [31]).

Transcriptional Start Site Heterogeneity Modulates Genome Structure and Function
An additional layer of regulation may come from the sequence of the RNA itself, determined by the transcription start site (TSS) used during genome synthesis [42,62]. Shortly after initiation of RNA synthesis, the HIV-1 genome is co-transcriptionally capped by a 5 -5 triphosphate linked 7-methylguanosine moiety (7MeG) [62][63][64][65][66]. The U3/R junction of HIV-1 proviruses contains a conserved run of three consecutive guanosines, any (or all) of which can serve as the +1 site for RNA transcription. Recent reports suggest that HIV-1 utilizes three different transcription start sites (1G, 2G, and 3G), creating an array of genomes with varying 5 end identity, each of which is co-transcriptionally capped ( Cap 1G, Cap 2G, Cap 3G). Interestingly, the Cap 1G genome is predominately packaged into virions (70 to~100%) [42,62], whereas Cap 2G and Cap 3G RNAs are enriched on polysomes [42]. The distinct function of these transcripts appears to correlate with a distinct structural rearrangement. In vitro transcribed 5 -leader (5 -L) RNAs beginning with either 1G, 2Gs, or 3Gs displayed strikingly different dimerization profiles, with 1G and 2G 5 -L RNA favoring the dimer and 3G 5 -L RNA favoring the monomer [42]. The authors went on to examine the effect that capping played on these in vitro transcribed leader RNAs. Dimerization propensities of Cap 1G, Cap 2G, Cap 3G 5 -L RNAs were assayed and, strikingly, the influence of a 7MeG cap mimicked that of the addition of a single guanosine ( Cap 1G 5 -L and 2G 5 -L behaved similarly). Mutagenesis studies show that destabilizing the predicted base of the PolyA stem promoted the monomeric RNA conformation while stabilizing the base of the hairpin promoted dimerization. These studies suggest a new paradigm for RNA fate, in which function is not encoded by an intrinsic monomer-dimer equilibrium, but rather by the transcriptionally encoded sequence of the capped 5 -terminus [42].

Strengths and Limitations of NMR Spectroscopy for Large RNAs
Although NMR spectroscopy has historically been used to determine structures of relatively small RNAs that typically comprise fewer than 60 nucleotides [67][68][69][70][71][72][73][74][75], the above studies have shown that structures can be probed by NMR spectroscopy in RNAs comprising up to 688 nucleotides. When used in combination with cryo-electron microscopy (cryoEM) or small-angle X-ray scattering (SAXS), 3D structural information can be obtained for relatively large protein:RNA complexes [68,69]. An advantage of the Adenosine-H2 detected, 2 H-edited method described above is that it enables direct detection of structure, even in RNAs that exist as equilibrium mixtures of species. The fact that HIV-1 5 -leader transcripts can exist as an equilibrium mixture of conformers in vitro, and as an equilibrium mixture of chemical species and structures in transfected cells, may explain discrepant secondary structure predictions made on the basis of chemical probing [37][38][39]76], and it is noteworthy that the NMR-derived structure of the HIV-1 packaging signal [36] is more consistent with in-gel Selective 2 -Hydroxyl Acylation analyzed by Primer Extension (SHAPE) probing of the dimeric 5 -leader [37] than previously published SHAPE-derived models [38,39]. Difficulties interpreting bulk chemical probing data may now be overcome using a "mutate-and-map" approach coupled with the RNA Ensemble Extraction From Footprinting Instights Tool (REEFFIT) algorithm in order to deconvolute and reconstruct complex conformational landscapes [77].
Disadvantages of the NMR spectroscopy approach are the time and expense associated with data collection (up to~five days per spectrum for the largest RNAs) and sample preparation (up to a dozen differentially 2 H-labeled RNAs at~$1000 per sample). One key weakness of the NMR spectroscopy approach described above is that it only involved the use of NOE-derived structural information, which is inherently short-range (~5 Å, although some very long-range NOEs were identified in highly deuterated samples [36]). For appropriate RNAs that adopt a unique structure and do not undergo conformational averaging, NMR-derived residual dipolar couplings (RDCs) can provide global structural information regarding the relative orientations of the different helices [78,79]. The use of hybrid NMR/SAXS [78,79] or NMR/cryoEM [80] approaches provides exciting new opportunities to combine the high resolution "local" structural information obtainable by NMR spectroscopy with lower resolution global structural information derived by SAXS and/or cryoEM. Of course, caution must be taken when using RDC [80], cryoEM, or SAXS data to refine structures of RNAs that undergo rapid, large-scale conformational averaging. In addition, although the size limit for RNA NMR spectroscopy studies remains undefined, it will certainly not approach the large molecular sizes that are now tractable by X-ray crystallography and recently developed high resolution (~3 Å) cryoEM methods [81].

Future Directions
Recent advances in isotopic labeling, segmental ligation, and non-covalent fragmentation-based labeling approaches have expanded the size of RNAs that can be successfully studied by NMR spectroscopy. Due in large part to these technological advances, high resolution structures have been obtained for relatively large, independently functional portions of the HIV-1 5 -leader, and secondary structures have been directly detected in the intact leader [13,31,36]. However, these studies still do not provide a complete view of all functional elements within the dimeric form of the HIV-1 5 -leader, and the monomeric form of the leader has not been extensively probed by NMR (shaded regions, Figure 1a). A number of important questions remain: What is the structure of the intact 5 -leader in both its monomeric and dimeric states? How does the structure of the genomic 5 -leader compare to that of various spliced viral mRNAs? How does transcription start site usage influence RNA structure and function? Ultimately, it will be important to determine if these new NMR approaches can contribute to our understanding of the Gag:Ψ CES structures that assemble in cells and nucleate virus assembly. Although NMR alone will not likely be capable of providing all of the information needed to determine 3D structures of large Gag:5 -leader complexes, it could provide important complementary information when used in combination with other techniques such as cryo-EM [80] or SAXS [69,82].