Optimizing Protein Production in Therapeutic Phages against a Bacterial Pathogen, Mycobacterium abscessus

: Therapeutic phages against pathogenic bacteria should kill the bacteria efﬁciently before the latter evolve resistance against the phages. While many factors contribute to phage efﬁciency in killing bacteria, such as phage attachment to host, delivery of phage genome into the host, phage mechanisms against host defense, phage biosynthesis rate, and phage life cycle, this paper focuses only on the optimization of phage mRNA for efﬁcient translation. Phage mRNA may not be adapted to its host translation machinery for three reasons: (1) mutation disrupting adaptation, (2) a recent host switch leaving no time for adaptation, and (3) multiple hosts with different translation machineries so that adaptation to one host implies suboptimal adaptation to another host. It is therefore important to optimize phage mRNAs in therapeutic phages. Theoretical and practical principles based on many experiments were developed and applied to phages engineered against a drug-resistant Mycobacterium abscessus that infected a young cystic ﬁbrosis patient. I provide a detailed genomic evaluation of the three therapeutic phages with respect to translation initiation, elongation, and termination, by making use of both experimental results and highly expressed genes in the host. For optimizing phage genes against M. abscessus , the start codon should be AUG. The D toStart distance from base-pairing between the Shine-Dalgarno (SD) sequence and the anti-SD sequence should be 14–16. The stop codon should be UAA. If UAG or UGA is used as a stop codon, they should be followed by nucleotide U. Start codon, SD, or stop codon should not be embedded in a secondary structure that may obscure the signals and interfere with their decoding. The optimization framework should be generally applicable to developing therapeutic phages against bacterial pathogens.


Introduction
Phage therapy against bacterial infection has a long history, with success stories in both mice [1] and humans [2]. It gained momentum in recent years due to the alarming increase of bacterial resistance to antimicrobial drugs [3][4][5]. Bacterial resistance to penicillin became known soon after its discovery in 1928 and its routine medical applications in 1940 [6,7]. However, it is the recent emergence and prevalence of multidrug-resistant bacteria [8] that highlights the urgency of research in therapeutic phages as an attractive alternative to antibiotics [9,10]. What is particularly promising is that phage therapy works not only through phages lysing the bacterial pathogen, but also enhancing the human immune response against the pathogen by disabling one or more immune evasion mechanisms of the pathogen [11]. Thus, the bacterial pathogen is attacked by phages from within and by human immune responses from without.

General Criteria for Optimizing Therapeutic Phages against Pathogenic Bacteria
A desirable therapeutic phage must have (1) a high bactericidal efficiency [12], (2) limited side effects [13], (3) a low chance of drug resistance [14], (4) a low cost of development [15], and (5) a low human immune response so that the therapeutic phage will not be cleared before it has a chance to kill the pathogenic bacteria [16][17][18][19]. However, among these desirable properties, the bactericidal efficiency is obviously the most important.
There are many other processes contributing to the bactericidal efficiency of phages, such as phage attachment to host, delivery of phage DNA into the host, susceptibility of phage DNA to host deoxyribonucleases, phage life cycle, viral genome replication, viral transcription, etc. All these factors contribute to generation time. However, the main objective of this paper is to develop principles for phage mRNA optimization for increased translation efficiency and to illustrate the application of these principles to improve the design of three phages against a bacterial pathogen, Mycobacterium abscessus. I should emphasize that this optimization of phage mRNAs is relevant only after the phage has already been shown to lyse the bacterial pathogen.

Principles of Optimizing Phage mRNA
A translation machinery is made of ribosomes, translation initiation factors, tRNAs, tRNA-charging enzymes, and release factors, as well as the energy that drives the translation machinery ( [20], pp. 522-635). Optimization of mRNA for efficient translation depends mainly on two factors: selection and mutation [21][22][23]. Natural selection has operated for millions of years to optimize the translation machinery in bacteria. An organism that cannot mass-produce its proteins efficiently would be eliminated by its more efficient competitors. Well-adapted phages typically have their mRNA mimicking the highly expressed genes in their bacterial hosts to take advantage of the host translation machinery [24][25][26][27]. However, rapidly replicating bacteria such as Escherichia coli with generation time of only 20-30 min [28] are expected to approach translation optimum closer than slow-replicating bacteria such as M. abscessus with generation time of 4-5 h [29]. Mutations in bacterial genomes disrupt optimizing selection on their encoded mRNAs. It also disrupts adaptation of phage mRNA to their host environment [24,25], so that phages [24,25] or human viruses [30] with a high mutation rate are associated with less adaptation than those with relatively a low mutation rate.
With efficient translation initiation, translation elongation becomes ratelimiting [31,33]. Codon-anticodon adaptation is invariably observed in rapidly replicating organisms [21,38,47,48]. Highly expressed genes such as ribosomal protein genes exhibit strong preferences of codons decoded by the most abundant tRNAs [31,[47][48][49][50]. The same pattern was also found in phages [24][25][26][27]. Codon optimization is associated with an increased growth rate in E. coli. Genes with optimal codon usage tend to generate more proteins. Experimental replacement of minor codons by major codons, or vice versa, typically leads to increased or decreased translation rates.
Translation terminations in bacterial species are mediated by one or two release factors RF1 and RF2 encoded by prfA and prfB, with RF1 decoding UAA and UAG and RF2 decoding UAA and UGA. All three stop codons can be misread by tRNAs in bacterial species, and UGA appears to be the leakiest of the three, with a readthrough frequency of at least 10 −3 to 10 −2 in Salmonella typhimurium [51] and E. coli [52,53]. Stop codons UAA and UAG can also be leaky in bacteria [54,55]. Natural UAG readthrough frequency is mostly within the range of 1.1 × 10 −4 to 7 × 10 −3 , depending on the nature of the downstream nucleotides [55][56][57][58]. The readthrough rate is the lowest for UAA, at frequencies [59] from 9 × 10 −4 to less than 1 × 10 −5 [55]. Overall, the available experimental data suggest that readthrough error rate in bacteria species is in the order of UGA > UAG > UAA [53,[60][61][62][63][64][65]. The hypothesis that such small error rates should be biologically important may seem weak. However, I present empirical results to show that it is because highly expressed genes exhibit strong preferences of UAA against UGA.
Termination efficiency also depends on the nucleotide immediately downstream of the stop codon [66][67][68], leading to the proposal of the tetranucleotide stop signal including the +4 site [23,66,[68][69][70][71]. The best documented case involves translation of prfB (encoding RF2) in E. coli in which an inframe UGA stop codon is followed by nucleotide C [72][73][74]. When RF2 is abundant, the inframe UGA is decoded correctly to terminate translation, generating a short nonfunctional peptide. When RF2 is rare, UGA is not decoded. A +1 frameshift leads to translation of GAC at a different coding frame, generating a functional RF2.
Phage proteins ideally should be produced in exact stoichiometric ratios if phage protein composition is fully known [75]. While this paper does not address stoichiometry directly, an understanding of the translation machinery and translation efficiency should facilitate stoichiometric optimization. For example, if protein A is 30 times as many as protein B in the final assembled virion, but only 3 times as many as protein B in the host bacterial cell, then increasing translation efficiency for protein A would bring us closer to the 10:1 stoichiometric ratio.
The optimization principles above have been derived mostly from model species, such as E. coli as a representative of the gram-negative bacteria and Bacillus subtilis as a representative of the gram-positive bacteria. For a specific bacterial pathogen, one must examine features of its translation machinery by contrasting its highly expressed and lowly expressed genes.

A Case Study: Optimizing Phages against M. abscessus
A recent successful application of phage therapy targets antibiotic-resistant M. abscessus [2]. The patient was 15 years old, suffered from cystic fibrosis, and underwent a lung transplantation, but was subsequently infected by an antibiotics-resistant bacterial pathogen. Genomic sequencing identified the pathogen as M. abscessus subsp. massiliense strain GD01, which is a common nontuberculous lung pathogen [76]. The three Mycobacterium phages targeting M. abscessus were Muddy, ZoeJ, and BPs from screening a large number of potential phage candidates [2]. The original study [2] was a compassionate therapeutic application for a terminally ill patient, and therefore could not afford much time for phage mRNA optimization. Three significant findings emerged from our detailed genomic analysis of the pathogenic host and the three phages. Firstly, the translation machinery of M. abscessus strongly favors AUG as a start codon and UAA as a stop codon. They are strongly preferred by highly expressed genes in M. abscessus. Thus, therapeutic phages should be engineered to have AUG and UAA as start and stop codons, respectively. Secondly, secondary structure stability is much reduced near start and stop codons, especially in highly expressed genes. This is consistent with the hypothesis that a strong secondary structure embedding initiation and termination signals may interfere with the decoding of these signals. Therapeutic phage mRNA should be engineered to reduce secondary structure near important translation signals such as start codon, SD sequence, and stop codon. Thirdly, protein-coding genes in phage Muddy would benefit more from codon optimization than the other two phages. This is because M. abscessus, like M. tuberculosis [59,77], has a GC-rich genome, with its tRNA anticodons strongly in favor of G-ending and C-ending codons. Phage Muddy is relatively AT-rich, so its codon usage deviates more from the optimum than the other two GC-rich phages. Lastly, if stop codons UGA and UAG must be allowed in phage mRNA, they should be engineered to have nucleotide U at the site immediately following the stop codon.

Results
I present mRNA optimization specific to the M. abscessus strain GD01, which is the pathogen against which the three-phage cocktail was used. Results from GD01A and GD01B are nearly identical, so only results from GD01A were presented unless otherwise indicated.

Translation Initiation
I present results to show that the translation machinery of M. abscessus prefers AUG as a start codon, a much-reduced secondary structure stability in sequences near the start codon and SD sequence, and SD/aSD base-pairing with a D toStart within the range of 13-18. Phage mRNAs should be so modified to achieve efficient translation.

Start Codon Usage
AUG is typically the preferred start codon in [27,35,78], which is also true in M. abscessus (Table 1), together with E. coli and B. subtilis. The AUG usage as a start codon is presented for HEGs (highly expressed protein-coding genes, including ribosomal protein genes and the 11 highly expressed genes listed in Materials and Methods) and REST (all annotated protein-coding genes excluding HEG). HEGs consistently use AUG as a start codon more frequently than REST (Table 1), consistent with previous studies [27,34,79,80] that AUG is the start codon favored by the bacterial translation machinery. This effect is highly significant (p = 0.00986) when tested with a three-way log-linear model (SPECIES*START*EXPRESSION, where START has AUG and non-AUG categories and EXPRESSION has HEG and REST categories). (1) N AUG and N NonAUG : count of AUG and non-AUG start codons; P AUG and P NonAUG : proportion of AUG and non-AUG start codons; (2) HEG: highly expressed genes, including ribosomal protein genes and the 11 highly expressed genes listed in Materials and Methods; REST: all annotated genes on the genome excluding HEG.
Generalizing the early hypothesis that highly expressed genes are more optimized than lowly expressed genes [50], one may also hypothesize that rapidly replicating bacterial species should be under stronger selection pressure to optimize their translation machinery than slowly replicating organisms. E. coli replicates once every 20-30 min [28], B. subtilis once every 30-70 min [81], but M. abscessus once every 4-5 h [29]. If AUG is universally a better start codon than other alternatives, then one would predict that AUG should be used most frequently by E. coli, followed by B. subtilis, and finally by M. abscessus. This prediction is empirically substantiated ( Table 1). Both HEG and REST genes use AUG as a start codon most frequently in E. coli, but least frequently in M. abscessus. This species difference is also highly significant (p < 0.00001) when tested with the three-way log-linear model. The start codon preference is AUG > GUG > UUG > Other (i.e., CUG+AUA+AUC+AUU, Table 2), which is consistent with other bacterial and archaeal species [78].
Many publications have documented GUG and UUG as suboptimal to AUG for translation initiation, although most of studies involve the translation machinery in E. coli. A recent study [80] tested all 64 codons for their initiation efficiencies as start codons. Among the three canonical start codons (AUG, GUG, and UUG), translation initiation efficiency is consistently in the order of AUG > GUG > UUG [79,80]. The hypothesis proposed to explain this pattern states that a nucleotide mismatch weakens the pairing affinity between a start codon and the decoding tRNA anticodon, leading to reduced translation initiation efficiency [78]. This is the same hypothesis proposed previously for AUA as a suboptimal start codon [79]. The prediction from this hypothesis is that, if temperature is increased leading to further weakening of the base-pairing affinity, then start codons such as AUA or UUG with a nucleotide mismatch would exhibit reduced efficiency of translation initiation. Indeed, when temperature was increased from 21 • C to 37 • C, initiation efficiency decreased dramatically for mRNA with AUA as a start codon, but increased for mRNA with AUG as a start codon [79]. The hypothesis also explains the finding of an earlier study in which ACG as a start codon exhibits the same temperature sensitivity as AUA [82]. The frequency of start codon usage, in the order of AUG > GUG > UUG, has been documented in diverse bacterial and archaeal species [78], suggesting perhaps a universal preference of AUG as a start codon in prokaryotic translation machinery. The three phages do not use rare start codons CUG, AUA, AUC, or AUU (Table 2), which is similar to HEGs in the bacterial host. Phage BPs exhibits the highest AUG usage, higher than HEG genes of the bacterial host (87.30% vs. 78.57%). However, phages Muddy and ZoeJ still have many suboptimal GUG and UUG as start codons. Three of the four UUG start codons in Muddy genes ( Table 2) are in hypothetical genes. The remaining UUG is in a gene encoding ATPase, which is typically needed only during genome packaging [83]. Since it is not a structural protein and not needed in large quantities, there might be little selection for its translation efficiency, so a UUG start codon is tolerated. All genes encoding structural proteins in Muddy, such as capsids, minor tail proteins, and minor tail subunits, as well as tail assembly chaperones, use AUG as start codon. Thus, changing UUG to AUG in ATPase in Muddy may not result in a more efficient phage, although more phage ATPase than necessary may be produced.
The four UUG start codons in ZoeJ genes (Table 2) also include three in hypothetical genes. The remaining one is in a gene encoding lysin A, which the phage uses, together with holins, to lyse the bacterial cell wall in the late phase of the lytic cycle [84]. Phage lysins have been used by themselves as antibacterial agents [84,85]. A tentative hypothesis is that both ATPase and lysins are not mass-produced proteins such as capsids and tail proteins, so no strong selection operates on them to optimize their translation.

Secondary Structure Flanking the Start Codon
The minimum requirement for translation initiation is an accessible start codon that is not embedded in a strong secondary structure [22,35,46,79,86]. Previous experimental data based on engineered E. coli genes have shown a strong negative association between secondary structure stability near the start codon and translation efficiency in E. coli [31][32][33]. I measure the secondary structure stability by the minimum folding energy (MFE) computed over sliding windows of 40, 50, and 60 nucleotides. The patterns are similar from the three different window sizes and only results from the window size of 40 nt are shown below.
MFE equal to 0 means no secondary structure, and a stronger secondary structure corresponds to a more negative MFE value. Secondary structure stability decreases conspicuously in sequences flanking the translation initiation sequence in both M. abscessus ( Figure 1A) and the three phages ( Figure 1B). The weakest secondary structure corresponds to the SD sequence upstream of the start codon ( Figure 1). This is consistent with the interpretation [27,34] that a strong secondary structure embedding the SD sequence or the start codon is selected against because it prevents the translation initiation signal (SD and start codon) from being decoded by the aSD sequence and the initiation tRNA, respectively. The three phages are consistent in exhibiting a pattern of reduced secondary structures in sequences flanking the start codon ( Figure 2). However, there are highly significant differences among the three phage species (ANOVA on the three phage species blocked by the sliding windows, p < 2 × 10 −16 ). The secondary structure is significantly weaker (p < 2 × 10 −16 ) in Muddy genes ( Figure 2B) than in BPs genes ( Figure 2A) but stronger in ZoeJ genes ( Figure 2C) than in BPs genes (p < 2 × 10 −16 ). If I may extrapolate from experimental results in E. coli [31][32][33] where reduced secondary structure is associated with increased translation efficiency, then genes in Phage Muddy are expected to be translated more efficiently than those in the other two phages.
There could be two nonexclusive explanations for the difference in MFE among the three phage species. The first involves differential selection. Phage Muddy is naturally lytic and the other two phages are lysogenic before being genetically engineered to disable lysogeny [2]. Lytic phages are more frequently under optimizing selection than lysogenic phages because most genes in a lysogenic phage do not express during the lysogenic phase and are consequently at the mercy of mutations [27]. Given that adaptation represents a balance between selection and mutation, the prolonged accumulation of mutation during the lysogenic phase is likely to contribute to a suboptimal state. Therefore, genes in phages BPs and ZoeJ should have stronger (suboptimal) secondary structures near the translation initiation signals than the genes in Phage Muddy. The second explanation is differential mutation bias. Phage Muddy is more AT-rich than the other two phages (genomic AT% is 41.2% for Muddy, but 33.4% for BPs and 31.5% for ZoeJ). Secondary structure from an AT-rich mRNA is expected to be, on average, weaker than that from a GC-rich mRNA.
If a reduction in secondary structure stability in sequences flanking the start codon is functionally important, then one would expect highly expressed genes (HEGs) to exhibit a stronger pattern than an average gene in M. abscessus. This prediction is strongly supported (Figure 3). These results suggest that genes in the three phages can be modified to reduce secondary structure to improve their translation initiation efficiency.  I have previously reported that ATPase in Muddy and Lysin A in GPs have UUG as a start codon in their mRNAs, and hypothesized that these proteins may be lowly expressed relative to structure proteins and consequently have experienced less selection for increasing their translation efficiency. If ATPase and lysins are indeed lowly expressed, then the selection for reducing secondary structure in sequences flanking their start signals should also be weak. I contrasted MFE along a sliding window for ATPase and lysin genes from the three phages against structural protein genes ( Figure 4). Secondary structure, as indicated by MFE, is indeed weaker for structural protein genes than for ATPase and lysin genes. Thus, both the use of UUG as a stop codon and the relatively strong secondary structure near the translation initiation signal in ATPase and lysin genes appear consistent with the hypothesis that ATPase and lysins are not mass-produced and their genes have not experienced strong selection for increasing translation efficiency. Figure 4. Contrast of average MFE between genes encoding structure proteins (representing massproduced phage proteins, including capsid proteins, head-to-tail adaptors, head-to-tail stoppers, major tail subunits, minor tail proteins, tail assembly chaperones, tail terminators, minor tail subunits, and scaffolding proteins, n = 34 from all three phages) and those encoding ATPase and lysins (representing non-mass-produced phage proteins, n = 6 from all three phages). MFE is calculated along a sliding window of 40 nt. The vertical bar indicates the position of the start codon.

Shine-Dalgarno Paring
In addition to a good start codon and the lack of a strong secondary structure embedding the start codon, an efficiently translated bacterial or phage mRNA must have a well-positioned SD sequence. Highly expressed bacterial genes or phage genes are more likely to have a well-positioned Shine-Dalgarno sequence than lowly expressed genes [22,27].
SD/aSD base-pairing is illustrated with two ribosomal protein genes rpsT and rpsI in M. abscessus ( Figure 5A) to highlight a common misconception of an optimal distance between SD and the start codon. The distance (L 1 ) between the SD sequence for rpsT (AAGGA, Figure 5A) and the start codon is greater than that (L 2 ) for rpsI (SD = AGGUG, Figure 5A). As both ribosomal proteins are highly expressed, one may wonder if the differences between L 1 and L 2 matter in translation initiation. A model of SD/aSD basepairing [27,34] to juxtapose the start codon and the initiation tRNA anticodon ( Figure 5B) shows SD 1 and SD 2 by base-pairing with different aSD sequences at the 3' end of the small subunit (ssu) rRNA; both serve to align the start codon against the anticodon of the tRNA. In spite of the differences between L 1 and L 2 , the distance D toStart remains the same, being 15 for both rpsT and rpsI. If D toStart changes, then the aligned start codon and the initiation tRNA anticodon would be shifted out of alignment ( Figure 5B). The model in Figure 5B predicts that D toStart must be constrained within a narrow range. This prediction is substantiated by the frequency distributions of D toStart ( Figure 5C) which suggests that a good D toStart should be within the range between 13 and 18. This constrained D toStart range is also visible in the phage genes ( Figure 5C). Therefore, genes encoding mass-produced proteins in therapeutic phages targeting M. abscessus should be engineered to have their D toStart within the range of 13 and 18.
I previously hypothesized that ATPase in Muddy and Lysin A in BPs are not subject to strong selection for translation efficiency. They both have UUG as the start codon, and their secondary structures near their start signals are more stable than more highly expressed structure protein genes ( Figure 4). ATPase mRNA in Muddy does not have effective SD/aSD base-pairing; the maximum number of consecutive base-pairs is three. Lysin A mRNA has an SD/aSD base-pairing with D toStart = 18, which is at the upper limit of the preferred D toStart range of 13-18 ( Figure 5C).
It has long been noted that both base-pairing strength and the position of SD/aSD affect translation initiation efficiency [82]. When SD/aSD involves only four consecutive base-pairs, D toStart is widely distributed in M. abscessus genes ( Figure 6A). However, the distribution of D toStart becomes narrower and peakier when the SD/aSD base-pair increases to seven, when the preferred D toStart = 17 ( Figure 6A). The figure does not include genes with eight or more consecutive base-pairs because there are too few of such genes to form a meaningful distribution.
The same relationship between D toStart and SD/aSD is visible in E. coli ( Figure 6B), although the distribution is overall much narrower in E. coli genes than in M. abscessus genes. E. coli replicates much more rapidly than M. abscessus, which may indicate stronger selection for more efficient translation in E. coli than in M. abscessus. The most frequent D toStart is 13 in E. coli.

Translation Elongation and Codon-Anticodon Adaptation
Codon-anticodon adaptation depends on both tRNA-mediated selection and mutation. Different tRNAs differ in availability. Highly expressed genes preferentially use codons corresponding to the most abundant tRNA to increase translation efficiency [21,38,47,48]. This pattern is particularly strong in highly expressed genes in rapidly replicating unicellular organisms such as bacterial species [21,48,50,87] or yeast species [21,48]. Protein production increases with codon optimization, especially when translation initiation is efficient [31,33].
Mutation can disrupt codon-anticodon adaptation. For example, A-biased mutation in HIV-1 disrupts the codon adaptation of HIV-1 genes to the human tRNA pool, which favors C-ending and G-ending codons in some codon families [30]. I used nucleotide frequencies of sequences between genes in M. abscessus as a proxy of mutation spectrum ( Table 3). The GC% of 59.71% indicates GC-biased mutation in M. abscessus, as previously observed in M. tuberculosis [59]. If there is no tRNA-mediated selection, then the nucleotide frequencies at the third codon site should have similar GC%. However, the GC% at the third codon site is much greater (Table 3), especially so in highly expressed genes (HEGs). This suggests strong selection in favor of C-ending and G-ending codons, consistent with the observation that M. abscessus tRNAs feature C and G at the wobble site favoring Cand G-ending codons. This preference of C-and G-ending codons is particularly strong in highly expressed genes (HEG-CS3, Table 3).
The tRNA-mediated selection is strong against codon AUA in the Ile codon family in M. abscessus species. Among the same 48 tRNA genes shared by three M. abscessus genomes (GD01A, GD01B and GZ002), only a single tRNA Ile/GAU exists for decoding Ile codons, with a GAU anticodon decoding AUC and AUU codons. The translation of AUA would require a noncanonical G/A wobble between two purines. Highly expressed genes (HEGs) in GD01A have only 2 AUA codons out of 755 Ile codons, indicative of strong selection against AUA codons. The other Ile codons in HEGs include 689 AUC and 64 AUU codons. Anticodon GAU forms perfect base-pairing with codon AUC but wobble with codon AUU, which explains the preference of AUC over AUU.
Codon adaptation mediated by differential tRNA availability was traditionally measured by the codon adaptation index (CAI) [88] and its improved version [89]. Calculation of CAI requires a codon usage reference of known highly expressed genes, e.g., ribosomal protein genes. One main shortcoming of CAI is that it does not take background mutation bias into consideration. This motivated the development of a more general index of translation elongation (I TE ) of which CAI is a special case when there is no background mutation bias [31]. I TE is implemented in DAMBE [90] with compiled codon usage for many commonly used species. Table 3. Nucleotide frequencies from sequences between genes (Between) and from the third codon site of all coding sequences (CS3) and of highly expressed genes (HEG-CS3). Column "N Nuc " is the total nucleotide count, and the last four columns show nucleotide proportions. Among the three groups of genes in the M. abscessus genome, HEGs naturally have the highest mean I TE , followed by REST and pseudogenes (Table 4). It is not surprising that pseudogenes should have the poorest codon adaptation measured by I TE because they are not subject to tRNA-mediated natural selection optimizing their codon usage. In contrast, functional genes, especially highly expressed ones, are subject to selection to increase translation elongation efficiency. The difference in I TE between HEGs and REST is greater in the reference genome (Mean ± SE = 0.770736 ± 0.006869 for HEGs and 0.639903 ± 0.000789 for REST) than in GD01. The three phages differ significantly in I TE (Table 4) based on an ANOVA test (F = 19.747, DF betweenGroup = 2, DF withinGroup = 223, p < 0.00001). Muddy genes have the lowest mean I TE (Table 4). However, even Muddy genes have I TE s significantly higher than M. abscessus pseudogenes based on a two-sample t-test (t = 3.9933, DF = 99, p = 0.0001, two-tailed test). This suggests positive selection for codon optimization. Two factors may contribute to the low I TE in Muddy genes. The first factor involves overlapped genes or frameshifted sites that are in one reading frame for one gene but in another reading frame for another gene. Muddy has 513 of such functionally constrained sites, but BPs and ZoeJ have only 217 and 255 of such sites, respectively. Since optimizing codons for one gene at such sites would alter protein sequence in another gene, and since altering protein sequence is typically deleterious, natural selection would not favor such codon optimization that would result in deleterious amino acid replacement. Therefore, Muddy genes are less amenable to codon optimization, and consequently have lower I TE s than the genes in the other two phages. The second contributing factor invokes differential mutation bias among the three phages and is based on the following observations: All major codons in M. abscessus (i.e., codons preferred in highly expressed genes and typically decoded by the most abundant tRNA) are NNC or NNG codons. This suggests that GC-biased mutation is in the same direction as tRNA-mediated selection. However, the Muddy genome is more AT-rich than the other two phage genomes (41.2% for Muddy, but 33.4% for BPs and 31.5% for ZoeJ), suggesting a less GC-biased mutation in Muddy than the other two phages. Thus, less GC-biased mutations in Muddy would shift codons away from the optimal NNC or NNG codons. In short, highly expressed Muddy genes would benefit more from codon optimization than genes in the other two phage species.

Species
There are other factors affecting codon adaptation. A phage just recently switched to a new bacterial host will not have time to evolve adaptation in the new host. Moreover, a phage parasitizing multiple hosts with different tRNA pools would experience fluctuating selection, i.e., adaptation to the tRNA pool in one host may leads to suboptimal codon usage in another host. All these factors could contribute to suboptimal codon usage in phage genes.
One complication in optimizing phage codon usage is the potential change of the tRNA pool in the host cell between early and late phase of the infection cycle. In HIV-1, early genes have similar codon usage as host genes, which is expected because these early genes are translated with a normal tRNA pool shared with host gene translation. However, late genes in HIV-1 exhibit a different codon usage bias. Experiments measuring tRNA abundance suggests that late HIV-1 genes adapt to the late tRNA pool that is different from the tRNA pool in the normal cell [30]. Interestingly, a parallel case has also been documented in phage lambda [91].

Termination Signals
Our results highlight three features involving stop signals in M. abscessus. First, UAA is the optimal stop codon although UGA is the most frequently used stop codon. UAA prefers a +4G but UGA and UAGs prefer a +4U. The secondary structure stability is greatly reduced in sequences flanking a stop codon.

Identification of Optimal Stop Signals
UGA stop codon is used most frequently in M. abscessus genes (Table 5), similar to M. tuberculosis [59]. The latter, as well as other GC-rich bacterial species, expresses much more RF2 than RF1 and uses UGA as a stop codon much more frequently than UAG or UAA codons. However, highly expressed genes in these GC-rich genomes invariably prefer UAA stop codons [23,59]. This is also true in M. abscessus (Table 5)with a higher proportion of genes using UAA in HEGs than in REST genes (21.9% for HEG vs. 15.6% for REST, Table 5). UAA in bacterial species, especially in E. coli, has been shown to have the lowest error rate in translation termination with almost any nucleotide at the +4 site [53,[60][61][62][63][64][65]. Without contrasting between the HEG and REST, one might be misled to think that UGA, being the most frequently used, is the preferred stop codon by the translation machinery in M. abscessus. Among the three phages, Muddy has the highest proportion of genes using the UAA stop codon, comparable to the proportion in HEGs. The lower proportion of genes using UAA in BPSs and ZoeJ may reflect the fact that they are lysogenic phages in which most genes are not under selection during the lysogenic phase. However, it could also result from stronger GC-biased mutation in these two phages than in Muddy. Given that highly expressed genes prefer UAA stop codons [59], which is consistent with the results in Table 5, one should change stop codons in highly expressed phage genes to UAA.

Differential Nucleotide Preference at the +4 Site
The +4 site is the nucleotide site immediately downstream of the stop codon. The association between stop codon and +4 nucleotide for all 4707 genes in M. abscessus (ALL subtable in Table 6) partially reflects mutation bias, i.e., a preponderance of +4C (43.8%, 40.4%, and 36.5% for UAA, UAG, and UGA, respectively. Table 6). This pattern is surprising because +4C is known to reduce termination efficiency in diverse organisms, from bacteria [56,58,67,92] to yeast [93][94][95] and other eukaryotes [96][97][98]. A +4C increases readthrough error particularly strongly for stop codon UGA [63,92,99]. One explanation for this preponderance of +4C is the GC-richness of the M. abscessus genome, i.e., GC-biased mutation creates +4C even if it is suboptimal. If +4C reduces translation termination efficiency and accuracy in M. abscessus as in other bacteria, then I should expect +4C to be avoided in HEGs [98]. This expectation is supported by empirical data for HEGs (HEG subtable in Table 5), with UAA associating with +4G, and UAG and UGA codons associating with +4U (HEG subtable in Table 6). This is consistent with previous studies that +4U decreases stop codon misreading and is preferred by highly expressed genes in diverse bacterial species [67,71,92,93,100]. Therefore, stop codon UAA in phage mRNA intended to replicate in M. abscessus should be engineered to have +4G and stop codons UAG and UGA should have +4U.

Reduction in Secondary Structure Stability in Sequences Flanking the Stop Codon
The tetranucleotide stop signal, just like the translation initiation signal represented by the SD sequence and the start codon, would not be available for decoding if it is embedded in a strong secondary structure. I have previously shown reduced secondary structure stability in sequences flanking the initiation signal. This is also true for sequences flanking the stop codon, not only for genes in M. abscessus ( Figure 7A), but also in the three phage species ( Figure 7B).
If the reduction in secondary structure stability near the stop codon is important for translation termination, then one should expect the highly expressed genes in M. abscessus to exhibit even a stronger pattern than the average of all M. abscessus genes shown in Figure 7A. This is indeed the case (Figure 8). The change in MFE near the stop codon is the most dramatic in HEGs in M. abscessus (Figure 8). The phage genes have weaker secondary structure than an average M. abscessus gene. However, the weak secondary structure in HEGs ( Figure 8) suggests that one should aim to further reduce the secondary structure stability of phage mRNAs.  Different optimizations can conceivably conflict with each other. For example, the importance of having a weak secondary structure in sequences flanking the start and the stop codons would conflict with the optimization of codons immediately downstream of the start codon or immediately upstream of the stop codon because replacing a minor codon by a major codon could increase local secondary structure. Integrated software is needed to facilitate simultaneous optimization of the three subprocesses of translation.

Discussion
Will translation optimization necessarily increase phage replication and consequently bactericidal efficiency? If translation is not a limiting factor in phage replication, then we should not observe translation adaptation of phage genes to host translation machinery. The ubiquitous observation of adaptation of phage genes to host translation machinery suggests that optimization of translation efficiency is beneficial to the phage proliferation and that the selection that optimizes translation has overwhelmed the disruptive effect of mutation. Ideally one should produce viral proteins in the right proportion. It is not clear what the optimal proportion of different viral proteins is, but the stoichiometric ratio, if known, should make a good approximation. If a virion needs 100 copies of protein A and 10 copies of protein B, then one should engineer the phage to make proteins A and B in such proportions. However, phage infection and replication depend on many factors so translation efficiency may not be a limiting factor. This may well be true even within the translation framework. For example, optimization of codon usage in genes with low translation initiation efficiency contributes little to protein production [31][32][33].
There are many concerns in phage therapy [101]. It is more difficult to elucidate the potential side effects of therapeutic phages than antibiotics. The latter are mostly small molecules with limited numbers of functional groups. In contrast, phage proteins are macromolecules with far more functional groups that can potentially interact with many cellular components and immune systems. While phages have not been reported to become human pathogens, they are obviously capable of modulating microbiomes in our digestive system [102][103][104] and causing diseases [105,106]. Some phages were intended to kill intracellular pathogens phagocytosed into human cells [107,108], i.e., they can enter eukaryotic cells. Free phages can travel across multiple human tissues [102,109,110]. T7 phages administered through mouse tail veins can be recovered in uterus and fetal tissues [111]. The potential interaction of such phages with human cells is unpredictable. Experimental studies are clearly needed to investigate potential side effects of engineered phages.
It is often assumed that phages are highly host-specific. However, viruses frequently switch hosts in evolution. Bacterial hosts such as E. coli can evolve resistance to phage T4. In this particular context, it is beneficial for phage T4 to find new hosts, i.e., nonresistant strains of E. coli or other bacterial species. It turns out that the host specificity in T4 (and close relatives of T4) depends heavily on the C-terminus of the distal tail fiber which binds to the host receptor [2]. Mutations at this distal tail fiber can allow phage T4 to enter bacterial species that belong to different families, e.g., Yersinia pseudotuberculosis or even different orders, e.g., vibrio species [112]. Host switching also occurs in mammalian viruses, e.g., the feline panleukopenia virus switching to dogs to become canine parvovirus [113,114] (p. 793), the avian influenza virus traveling from birds to mammals, coronaviruses passed from bats or camels to humans.
One may ask what platform can be used to optimize phage mRNA in a phage genome, and create an infectious phage with the optimized genome. There are two possible approaches. The first makes use of the method that created the first bacterial cell with a chemically synthesized genome [115]. One can incorporate an optimized phage genome into a bacterial genome, and then induce the lytic phage to generate new phages. This should be suitable for DNA phages. The second would follow the reverse genetics approach [116][117][118][119] which was used to generate a full-length coronavirus genome and make it infectious. The approach potentially can be used to generate a full-length phage genome and make it infectious. It is suitable for RNA phages.
One of the problems in the development of phage therapies is that naturally occurring phages may not be patentable because many of them have been in use over the past century. The proposed mRNA optimization would help biopharmaceutical companies to patent therapeutic phage. However, phage optimization can be applied in nontherapeutic scenarios. For example, phages have been used to decontaminate food items [101,120]. Such phages would also benefit from such phage mRNA optimizations.

Materials and Methods
The pathogenic M. abscessus subsp. massiliense strain GD01 has two variants with complete genomic sequences (NZ_CP035923 for GD01A and NZ_CP035924 for GD01B). These were downloaded from GenBank, together with the reference genome of M. abscessus (NZ_CP034181, strain GZ002), which is better-annotated. For example, genes tig and tsf have their gene names in the reference genome but only a locus_tag in the genomes for GD01A and GD01B. The reference genome allows us to match locus_tags to gene names. I also downloaded the three therapeutic phage genomes (BPs: EU568876; Muddy: NC_022054; ZoeJ: NC_024147).
The M. abscessus reference genome includes 4942 protein-coding genes (CDSs), of which 30 are annotated as pseudogenes. The GD01A and GD01B genomes have 4770 and 4759 CDSs, respectively, and 63 and 61 pseudogenes, respectively. These pseudogenes were excluded from data analysis involving translation initiation, elongation, and termination because some of them have lost the 5' or 3' end of their functional counterparts, and some have frameshifting mutations.
To test the prediction that secondary structure should be reduced near important translation signals such as start codon, SD sequence, and stop codon, I measured secondary structure stability by the minimum folding energy (MFE) implemented in DAMBE [90] that uses the Vienna RNA fold library [122] for secondary structure characterization. For quantifying MFE in sequences flanking the start codon, 50 nt (nucleotides) immediately upstream of the start codon and 50 nt downstream of the start codon were extracted, with the start codon occupying sites 51-53 in the resulting 100 nt sequence. A sliding window of 40 nt was used to characterize the change of MFE along the sequence. Alternative sliding windows of 50 nt and 60 nt were also used. This was similarly done for quantifying MFE in sequences flanking the stop codon; 50 nt at the 3 end of the coding sequence (including the stop codon) as well as 50 nt immediately downstream of the stop codon were extracted, with the stop codon occupying sites 48-50. The same sliding window approach was used to measure the change of MFE along the sequence.
To characterize the optimal pairing configuration between SD and aSD, I measured the D toStart distance, which is the number of nucleotides between the 3 end of 16S rRNA and the start codon given an SD/aSD pairing. Previous studies have shown D toStart to be constrained to a narrow range for highly expressed bacterial genes [90].
All sets of sequences extracted from the GD01A genome (NZ_CP035923) and used in this manuscript are included in the Supplementary Files. The file name convention is explained below. There are six gene groups: MaHEG (n = 73, where Ma is abbreviation for M. abscessus), MaREST (n = 4635), MaPSEUDO (n = 63), PhageBPs (n = 63), PhageMuddy (n = 71), and PhageZoeJ (n = 92). Each gene group features three types of sequences: (1) CDSs, (2) Start_50_50 sequences (50 nt of 5' UTR immediately upstream of the start codon plus 50 nt of the CDS, including the start codon occupying sites 51-53), (3) Stop_50_50 sequences (50 nt of 3 end of CDS followed by 50 nt of 3 UTR, with the stop codon occupying sites [48][49][50], and (4) Up30 sequences (30 nt immediately upstream of the start codon) for characterizing strength and position of SD sequence. There should be 6 × 4 = 24 sequence files. However, because pseudogenes often miss 5 ends or 3 ends, only CDS sequences, but not the last three types of sequences, were compiled for pseudogenes, so the total number of files is 21. The file name follows the format of GeneGroup_SequenceType.FAS. For example, the file containing CDSs of the highly expressed genes in M. abscessus is named MaHEG_CDS.FAS. The first and the last genes may not have 50 nt upstream or downstream of the coding sequence. For example, the first CDS in the phage BPs genome starts at site 43. This gene is not included in PhageBPs_Start_50_50.FAS file.

Conclusions
Natural phage mRNAs may not be optimized to their host translation machinery for three reasons: (1) a phage has recently switched to a new host with no time for adaptation, (2) a high mutation rate is disrupting adaptation, and (3) a phage is parasitizing multiple hosts with different translation machineries so that adaptation to one host would lead to suboptimal adaptation to another host. For these reasons, phage mRNAs must be optimized for efficient translation in a specific bacterial pathogen. I have detailed the optimization principles for the three subprocesses of translation: initiation, elongation, and termination. The optimization principles can be further enhanced by taking stoichiometric ratios of different phage proteins. However, stoichiometric ratios of phage component proteins are unknown for most phages.