Spontaneous Mutations in HIV-1 Gag, protease, RT p66 in the first replication cycle and how they appear: Insights from an in vitro BSL2 assay on mutation rates and types

While drug resistant mutations in HIV-1 is largely credited to its error prone HIV-1 RT, host proteins such as deaminases may also play a role generating mutations. Many HIV-1 RT mutational in vitro studies utilize reporter genes (LacZ) as template, leaving out the possible contribution of HIV codon usage and gene-specific effects. To address this gap, we studied HIV-1 RT mutation rates and bias on its own Gag, protease, and RT p66 genes in an in-vitro selection pressure free system. We found rare clinical mutations with a general avoidance of crucial functional sites in the background mutations rates for Gag, protease and RT p66 at 4.71 x 10−5, 6.03 x 10−5, and 7.09 x 10−5 mutations/bp respectively. Gag and p66 genes showed a large number of ‘A to G’ hypermutations likely due to cellular adenosine deaminases. Comparisons with silently mutated p66 sequences showed an increase in mutation rates (1.88 x 10−4 mutations/bp) and that ‘A to G’ mutations occurred in regions reminiscent of ADAR neighbour preferences. Mutational free energies by the ‘A to G’ mutations revealed an avoidance of destabilizing effects with the natural p66 gene codon usage providing barriers to ADAR effects. Our study demonstrates the importance of studying mutation emergence in HIV genes to understand how fast drug resistance can emerge, sometimes with contributions of host deaminases, providing transferable applications to how new viral diseases and drug resistances can emerge.


Introduction 37
RNA viruses have been shown to have a higher likelihood of undergoing genetic 38 changes, leading to species jump [1] and efficient spread among humans [2]. Notably, among 39 RNA viruses, the human immunodeficiency viruses such as HIV-1 and HIV-2 are believed to 40 be zoonotic transmissions of the simian immunodeficiency viruses (SIV) [3,4]. 41 Within HIV-1, the Gag and protease proteins play crucial roles in viral assembly and 42 maturation of infectious virions [5]. Protease cleaves Gag and Pol polyproteins into functional 43 subunits [6] and this cleaving is prevented by protease inhibitors (PI) which compete with Gag 44 for the active site [7,8]. In emerging PI resistance, mutations on viral protease reduce affinity 45 to PIs [8,9], and the gradual accumulation of many such resistance mutations [10][11][12] induced 46 by HIV-1 RT [13,14], eventually render the PIs ineffective, and in some cases, cross-47 resistances to limit clinical drug selection [8]. In many cases, mutations on the substrate Gag 48 compensate for reduced viral fitness [8,9,[14][15][16] with paired mutations that work 49 synergistically to protease mutations [17][18][19][20][21] to overcome PIs. 50 In the spotlight for the cause of the mutations in HIV is the error prone enzyme reverse 51 transcriptase (RT), an asymmetric heterodimer of the p66 and p51 subunits [22]. The p66 52 subunit performs the key enzymatic functions of catalysing DNA polymerisation and cleaving 53 the RNA of the RNA/DNA duplex [23,24], while the p51 subunit plays a more supportive role 54 to p66 [25]. 55 In the investigation of HIV mutations, in-depth analysis of RT is indispensable. 56 However, most of the previous studies on HIV mutation rates to date utilize reporter genes such 57 as LacZ for the template for analysis of mutations, and not HIV genes. There is thus a current 58 gap in understanding the contribution of HIV gene-specific sequences and codon usage in 59 mutational hotspots, as well as type of mutations and when they can emerge in the infection 60 cycle [26]. To fill this gap, we developed an in-vitro based assay without translational, immune 61 calculated as mutations/bp where the total number of mutations and the total nucleotide bases 111 of the respective HIV-1 genes were compared. 112 A 'Two Sample logo' was generated to determine the underlying sequence contexts of 113 'A to G' transitions generated in our in-vitro assay for comparisons against previously reported 114 ADAR neighbour preferences [37] . A custom script was written to locate, pool and align 4 115 nucleotides upstream and downstream of all adenosine mutants. Analysis of natural adenosines 116 were also analysed to generate an ersatz set of sequences as background control (see Figure 4a  117 for schematic representation). The mutated sites were compared against the respective ersatz 118 background sequences using the Two Sample logo software [38] with two sample t-test without 119 Bonferroni correction to test for significantly enriched and depleted bases within 9 nucleotides. 120 In-silico assessment on protein thermostability using FoldX and Rosetta

122
To study the mutations structurally in Gag, in silico mutagenesis was performed on 123 previously modelled compact Gag structures [13] using PyMOL followed by minimization 124 using GROMOS96 implementation in Swiss-PdbViewer. For p66, the 3T19 crystal structure 125 was used. 126 Protein thermostability, free energy changes of the mutations ΔΔGmt (ΔGwild-type -127 ΔGmutant) were studied using FoldX5 and Rosetta Cartesian_ddg (Version : 2017.52.58848) 128 [39]. Model structures were first relaxed and minimized using either the FoldX RepairPDB 129 protocol or with the cartesian-space refinement relax protocol and ref2015_cart score function 130 prior to mutagenesis. Free energy modelling was performed with the Cartesian_ddg protocol 131 (https://www.rosettacommons.org/docs/latest/cartesian-ddG). To calculate the mutational 132 effects on protein thermostability with FoldX, BuildModel (using default settings) was used 133 with the numberOfRuns parameter set to 10. For Rosetta procedures, the lowest scoring 134 structure out of 1000 cartesian-space minimization iterations were assessed to be free of serious 135 clashes by the MolProbity server [40]. The Cartesian_ddg protocol and talaris_cart score 136 functions were used to generate 15 models for each mutation and the average of the lowest 3 137 scoring mutations were averaged and multiplied by the scaling factor [39] to calculate the 138 ΔΔGmt in kcal/mols. 139

Mutation Rates of the Domains of HIV-1 Gag, protease and RT p66 204
The HIV-1 genes were further analysed by their respective domains and the mutation 205 rates calculated separately (see Figure 3). Within Gag, the overall mutation rates throughout 206 the domains were within a narrow range of 2.29 to 4.46 x 10 -5 mutations/bp when excluding 207 'A to G' hypermutations, but increased to 8.16 x 10 -5 mutations/bp when including them. 'A 208 to G' hypermutations were found in the capsid (CA), nucleocapsid (NC), and p6 subunits, but 209

Sequence contexts of 'A to G' mutations 228
To determine the influence of host defence deaminases, specifically ADAR in the 229 overrepresentation of 'A to G' mutations in our assay as was performed on double stranded 230 non-HIV RNA [37], we analysed the neighbouring sequences +/-4 nucleotides (nt) of all the 231 'A to G' mutations using the two-sample logo analysis ( Figure 4A-C). 232 Analysis of the Gag showed significant depletions of 'A' at -4 and 'G' at -1 nt positions 233 downstream of 'A to G' mutation sites at (p < 0.05) agreeing with known ADAR neighbour 234 preferences [37] (see Figure 4B). Using the Two Sample Logo of multiple and single 235 substitution mutations, there was an enrichment of 'C' at the +1, +4 (p < 0.05) and -4 nt (p < 236 0.01) positions (see Figure S1), with the former two (+1 and +4 nt positions) concurring with 237 ADAR neighbour preferences [37] (see Figure S1 and 4B). On protease, only the 'T' 238 enrichment at the -2 nt ( Figure 4B) in agreement to previously reported ADAR neighbor 239 preferences was found. For RT codon mutated p66 (p<0.01), there were enrichments of 'T' at the -1, -2 nt, 244 enrichment of 'A' bases at the +3 nt, and enrichments of 'G' at the +1 and +4 nt positions (p 245

<0.01). 246
For both wild-type and codon mutated p66, the large representation of multiple 247 mutation variants masked single substitution variants with ADAR neighbour preferences found 248 with enrichments of 'T' at the -1 and -2 nt positions of the 3' end ( Figure S1). 249

Effects of 'A to G' Mutations on Protein Thermostability 250
The distinct observed robustness of HIV to adapt to drugs, immune pressures and the 251 error prone nature of RTs may be attributed to the ability of viral protein structures to buffer 252 fitness by minimizing the change in mutational free energies (ΔΔGMt) [41]. When significantly 253 destabilized (e.g. >4kcal/mol), proteins may adopt a substantially different fold or be misfolded 254 were fewer destabilizing mutations with ΔΔGMt at +5.17 kcal/mol and +7.52 kcal/mol for 'A 266 to G' and other transition and transversion mutations, respectively ( Figure 5A). This enforces 267 the trend that 'A to G' mutations tend to elicit less ΔΔGMt, especially since respective 268 secondary peaks were also seen in Gag and the wild-type RT p66 ( Figure 5B and C). 269 The 'A to G' mutations in Gag and RT p66 genes generally had lower ΔΔGMt compared 270 to other mutations from calculations by both protein design tools ( Figure 5A The results suggest that the HIV-1 genes, particularly RT p66 wild-type, were more 275 robust to 'A to G' mutations, especially since 'A to G' mutations in codon mutated p66 were 276 more destabilizing compared to all other mutations; with significant differences in calculations 277 generated by FoldX. 278 Models of the hypermutations in Gag and wild-type p66 (Table S6) showed the average 279 ΔΔGMt of multiple mutations as: Gag -0.37 kcal/mol, 1.75 kcal/mol; p66 -2.53 kcal/mol, 3.25 280 kcal/mol as determined using Rosetta and FoldX, respectively. These were in contrast to codon 281 mutated p66 with an average doubling of destabilizing ΔΔGMt mutants to that of wild-type p66 282 at 6.48 kcal/mol, 7.00 kcal/mol for Rosetta and FoldX. 283

Discussion 284
We set out to study the native mutation rates of HIV-1 RT on HIV-1 genes: Gag, 285 protease and RT p66 subunit in a single replication cycle using our in vitro BSL2 assay that is 286 devoid of selection pressures. In the absence of protein translational, immune, viral fitness, and 287 drug selection pressures in our system, the observed biases are presumed to be intrinsic to  RT and that of host cell factors. Although mutations can be contributed from plasmid hosts 289 (DH5α and/or HEK293F), Q5 polymerase and even sequencing artefacts, these were highly 290 unlikely given that the observed phenotypic mutations/bp/replication for E. coli and 291 mammalian cells were estimated to be 5.4×10 -10 and 5.0×10 -11 , respectively [44], and that Q5 292 polymerase is one of the most high-fidelity polymerases [45]. Admittedly, there may be a role 293 for primer selection bias when priming the first few bases of the genes, and for this reason, 294 these bases were excluded from analysis and calculation of mutation rates. 295 While numerous in vitro studies have reported the error rates of HIV-1 RT (as 296 previously reviewed in [26,46]), we found only two reports that utilised HIV-1 genes as their 297 template for analysis [47,48]. From our study, we calculated the mutation rates of HIV-1 Gag, 298 protease, RT p66 and codon mutated p66 to be 4.71 x 10 -5 , 6.03 x 10 -5 , 7.09 x 10 -5 and 1.88 x 299 10 -4 mutations/bp (inclusive of hypermutations) respectively, within the range of 1.8 x 10 -5 -300 6.67 x 10 -4 mutations/bp previously reported [26,46]. Across the HIV-1 genes, there was a 301 predominance of transition mutations consistent across most phyla. This transition bias was 302 hypothesized to be for the conservation of protein functions [49][50][51]. In our assay, we observed 303 missense mutations to occur at the highest frequency, followed by silent mutations, frameshifts 304 and nonsense. These observations are in agreement with the in-built genetic code probabilities 305 [52] to maintain protein translatability, and avoidance of detrimental effects from frameshifts 306 and nonsense mutations. 307 The use of different HIV-1 gene templates also elicited varying mutation rates, type of 308 mutations and mutational biases. We noted a general absence of 'A -T' or 'C -G' mutations 309 in the wild-type HIV genes studied, detecting these mutations only in the codon mutated p66 310 (see Figure 1), suggesting influences that are inbuilt in the sequences and codon usage. This is 311 clearly seen when the codon mutated p66 mutation rate was 2.7 folds higher than the wild-type 312 HIV-1 RT p66. Interestingly, the RNase H domain of the codon mutated HIV-1 RT p66 was 313 also found to have a 7.1 fold increase (3.38 x 10 -4 mutations/bp) when compared to the wild-314 type (4.79 x 10 -5 mutations/bp), possibly due to RNA secondary structural 315 protection/susceptibility [37,53,54]. 316 The increase in 'A to G' hypermutations were observed in HIV-1 Gag, RT p66 and 317 codon mutated RT p66, but not in protease. Possible explanations may be due to the short 318 nucleotide length of protease, thereby avoiding the 'A to G' mutational effects caused by host 319 adenosine deaminases such as double-stranded RNA-specific adenosine deaminase (ADAR). 320 ADAR editing (by either ADAR1 and ADAR2) have been suggested to influence such cell 321 based in vitro RT-fidelity assays [55]. ADAR deaminates 'A' residues to inosine (I) residues 322 (read as 'G' by ribosomes due to structural homology), leading to 'C' mis-incorporations in 323 the plus strand DNA after reverse transcription, resulting in 'G' residues being incorporated 324 during second (minus) stand synthesis [56,57]. ADAR deaminases are believed to be host 325 defence mechanism proteins against viral infections by exerting significant changes on the 326 genomes of such RNA based lifeforms [58-60], exhibiting both pro-viral and anti-viral 327 activities [61][62][63][64]. 328 We also found lower occurrences of ADAR neighbour preferences bases in protease 329 The susceptibility of Gag, p66 and codon mutated p66 to 'A to G' hypermutations 332 would result in an accumulation of G bases, that would lead to a translational bias towards 333 glycine, arginine, valine and alanine [52]. This will in turn reduce the occurrences of 334 phenylalanine, isoleucine, tyrosine, histidine and asparagine in these genes. For HIV-1 335 protease, the mutational bias towards 'A' and 'C' accumulation, which is also the case when 336 'A to G' hypermutations are not considered for Gag and both p66 genes, would lead to 337 translational bias in the 'A' accumulation towards threonine and lysine, isoleucine, asparagine, 338 arginine and stop codons, while reducing phenylalanine, tryptophan and cysteine occurrences. 339 Similarly, the accumulation of 'C' would lead to proline, serine, leucine, threonine, alanine and 340 arginine while reducing methionine, lysine, glutamic acid, tryptophan, and more importantly, 341 stop codons. It is expected that higher mutations rates can push the virus towards lethal 342 mutagenesis, allowing a possible therapeutic intervention to exploit the ADAR neighbour 343 preference bases flanking target sites. 344

Structurally, the 'A to G' transitions elicit less destabilization effects in ΔΔGMt within 345
Gag and wild-type p66. Gag was found to be more robust compared to p66 with its ΔΔGMt 346 being generally neutral (Rosetta data showed distributions of significantly stabilizing mutations 347 that could have acted as epistatic buffers against destabilizing ones), although this may require 348 confirmation using cleaved Gag models [67]. On the other hand, codon mutated p66 was less 349 resilient to 'A to G' changes compared to wild-type p66, hinting that RNA viral protein do not 350 just minimize ΔΔGMt at the structural level as suggested by Tawfik, Berezovsky and colleagues 351 [41], but also at the nucleotide sequence level. With multiple 'A to G' substitutions by ADAR 352 hyper-editing, the average ΔΔGMt of all p66 mutations (and also Gag) were half that of the 353 codon mutated p66. Lower rates of 'A to G' transition occurred in the wild-type p66 compared 354 to its codon mutated counterpart. Such observations clearly demonstrate that the codon usage 355 of the natural wild-type p66 (and possibly Gag) gene offered some protection against ADAR. 356 Despite the lack of selection pressure towards functional proteins, we did not find 357 mutations in active sites of the enzymes (protease and p66) nor in the cleavage sites of Gag, 358 with the exception of A431D and F433L in the NC/p1 cleavage site 429 RQAN/FLG 435 of Gag. 359 The detection of non-cleavage mutations in Gag show that such compensation mutations would 360 emerge early in the infection process even in the absence of drugs. While mimiking a single 361 replication cycle, we found known clinical drug resistance mutations (see Table S5 for Gag, 362 protease, and p66). In Gag, the previously reported rare transient mutation E17K (in the 363 matrix), implicated in cytotoxic T lymphocyte (CTL) immune evasion resistance [68,69] was 364 detected. In protease, K70T, a minority mutation associated with resistance to PIs [70] was also 365 found. In p66, a known polymorphic mutation K103R (in the palm domain), which when 366 combined with V179D (not found in our assay), reduced susceptibility to non-nucleoside 367 reverse transcriptase inhibitors (NNRTIs): nevirapine (NVP) and efavirenz (EFV) by about 15 368 folds [71]. Interestingly. the observed K103R originated from an 'A to G' hypermutation event, 369 specifically from "AAA" to "AGA" suggesting that drug resistance mutations can be induced 370 by ADAR. 371 Though in much reduced frequency, some mutations were also found in crucial 372 functional sites. In Gag, P222L occurred in the cyclophilin A (CyPA) binding site (along with 373 G221), potentially affecting the binding of the capsid to CyPA as previously reported for 374 P222A [72]. Given that the probability of P (CCT) to L (CTT) is 0.33 in having a T substitution 375 in the 2 nd codon position, and that P (CCT) to A (GCT) is also 0.33 in having the first C mutated 376 to a G, there were equal probabilities. However, when considering that our Gag mutation 377 analysis did not show a transition mutation link between C to G (Figure 1 In the absence of selection pressures, mutations that are selected against can be better 398 detected, allowing for a more comprehensive analysis of the HIV-1 RT bias that can be missed 399 in multiplexing methods. Given that many HIV proteins can function in intense drug/immune 400 selection environments with significant reduced activity [78,79], an all-out campaign against 401 HIV could involve inhibitors against low-frequency occurring drug resistance mutations. This 402 pushes the HIV infection towards Muller's ratchet [80][81][82]  Coupling with in silico tools, such as those incorporating probability mutational 408 change, it may be possible to identify mutational hotspots. As it was previously shown that 409 protease and RT drug cross resistance have a structural basis governed by drug resistance 410 mutations [89,90], the bias of restricting specific amino acid changes by the absence of A-T, 411 C-G mutation occurrences can be exploited. However, such an approach will require an in-412 depth understanding of HIV-RT mutations that are selected against at protein functional levels. 413 Through the mutational biases and mutation rate of HIV-1 RT, it is possible to calculate 414 the mutational events leading to the zoonotic transmission of SIV to HIV, opening up 415 surveillance of emerging viral threats [2], especially given that RNA viruses are the most likely 416 to species jump [1]. 417 In conclusion, we have established an assay and characterised HIV-1 RT mutations on 418 HIV-1 Gag, protease and RT p66 in a safe BSL2 environment, allowing for insights to the 419 mutational bias and mutation rate of HIV in the absence of biological selection pressures. 420 Through our analysis, we found effects of ADAR in increasing the genetic diversity beyond 421 RT to generate rare and documented clinical drug resistance mutations within a single 422 replication cycle and thereby very early in infection. The assay can provide deeper insights 423 relevant for drug and vaccine development and be applied for horizontal understanding to other 424 viruses. 425

Acknowledgments 426
We thank Chinh Tran-To Su, Wai-Heng Lua and Wei-Li Ling for useful comments and 427 discussion. 428

Disclosure of interest 429
The authors report no conflict of interest.   Mutation rates were calculated as the ratio of the total number of mutations and the total number 691 of nucleotide bases. The rates were within the reported range of 1.8 x 10 -5 -6.67 x 10 -4 as 692 previously reviewed [26,46]. 693  Mutation rates calculated as the total number of mutations and the total number of nucleotide 729 bases were counted separately excluding and including A to G hypermutations. Truncations 730 that span across multiple domains were excluded in the calculations. 731