Biophysical Modeling of SARS-CoV-2 Assembly: Genome Condensation and Budding

The COVID-19 pandemic caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has spurred unprecedented and concerted worldwide research to curtail and eradicate this pathogen. SARS-CoV-2 has four structural proteins: Envelope (E), Membrane (M), Nucleocapsid (N), and Spike (S), which self-assemble along with its RNA into the infectious virus by budding from intracellular lipid membranes. In this paper, we develop a model to explore the mechanisms of RNA condensation by structural proteins, protein oligomerization and cellular membrane–protein interactions that control the budding process and the ultimate virus structure. Using molecular dynamics simulations, we have deciphered how the positively charged N proteins interact and condense the very long genomic RNA resulting in its packaging by a lipid envelope decorated with structural proteins inside a host cell. Furthermore, considering the length of RNA and the size of the virus, we find that the intrinsic curvature of M proteins is essential for virus budding. While most current research has focused on the S protein, which is responsible for viral entry, and it has been motivated by the need to develop efficacious vaccines, the development of resistance through mutations in this crucial protein makes it essential to elucidate the details of the viral life cycle to identify other drug targets for future therapy. Our simulations will provide insight into the viral life cycle through the assembly of viral particles de novo and potentially identify therapeutic targets for future drug development.


Introduction
Challenging every aspect of daily lives, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the etiologic agent of COVID-19, has driven scientists all around the world to rapidly develop technology and research advancing our understanding of this virus and the means to combat it. The coronavirus family extracts a significant burden on the global population through the sheer number of viruses from this family causing disease. Before the pandemic, research had been hampered by the limited scope of severe infections dying out before becoming wide spread health risks (e.g., the original SARS-CoV) and by the high level of complexity involved in the formation and assembly pathways of these viruses. Up to now, nearly all vaccines target the spike (S) proteins inhibiting viral attachment to the host cell membrane thus preventing viral entry into the host cell. Focusing on this single mechanism to disarm the virus can easily be rendered ineffective by mutations in the S protein.
While the S protein does play a crucial role in the life-cycle of the virus, it is not integral to the assembly process. Coronaviruses have three other structural proteins: Envelope (E), Membrane (M), and Nucleocapsid (N), which are known essential constituents for the assembly and formation of the viral envelope, see Figure 1. It is known that the N proteins attach to the genomic RNA and form a compact viral ribonucleoprotein complex (RNP) [1][2][3][4][5]. It appears that the electrostatic interaction between the positively charged N The morphology of the formed RNP and how the N proteins interact with and condense the genomic RNA resulting in its packaging by a lipid envelope decorated with structural proteins inside a host cell are still unclear. Two models have been proposed based on Electron Microscopy (EM) images of RNP composite structures. Some studies [2,3,8,9] show that N proteins and RNA form a helical fiber structure (9-16 nm in diameter), which attaches to the interior of the viral envelope. Other studies [4,10] suggest that the entire RNP consists of 35-40 copies of small but connected clusters in which each cluster contains 12 N proteins attached to 800 nucleotides forming a spherical layer with a thickness of 15 nm.
In addition to the uncertainty about the structure within the virus envelope, the process of packing of RNP and its budding from the organelles inside the host cell is not well understood. In fact, it is believed that the interaction between N and M proteins play an important role in the RNA encapsulation by the lipid envelope [11][12][13][14][15]. Using cryo-EM, Neuman et al. have shown that the M protein forms a matrix which is responsible for the curvature of the viral envelope [16,17]. Several other studies show that the presence of M proteins is required for virus budding and envelope formation [14,[18][19][20]. It is worth mentioning that the M protein is the most abundant structural protein in the viral particle. However, due to limitations of resolution in electron microscopy imaging, our knowledge remains rudimentary about the properties of all the structural proteins including M proteins composing the viral envelope.
In general, a careful study of the assembly experiments reveals that there are very few known details about the formation and assembly of coronaviruses. Computational models have often used to shed light on the myriad of complexities contributing to the assembly of other viruses not easily gleaned from experiments [21][22][23][24][25][26][27][28][29]. All-atom molecular dynamics simulations provide powerful analytic tools in analyzing protein-protein interactions. However, because these processes represent miniscule portions of an immensely larger enterprise and occur over time scales on the order of nanoseconds, they are nearly impossible to apply to the entire assembly and budding process [30][31][32]. To this end, coarse-grained (CG) modeling has been employed to explore the kinetics of assembly of spherical icosahedral viruses. While a considerable number of CG computer simulations and the in vitro self-assembly experiments of other viruses such as HBV [33,34] have had remarkable impact on deciphering the factors that contribute to the assembly of viruses and the means to combat them [35][36][37][38][39][40][41][42][43], similar systematic studies aimed at understanding the assembly of coronaviruses are missing.
To investigate and understand the assembly pathway of SARS-CoV-2 and gain insight into behavior over longer time scales, we have designed coarse-grained models for RNA, phospholipid membrane, and N and M proteins. Using Molecular Dynamics simulations, we have examined what types of interactions between the membrane, proteins, and genome are necessary to give rise to the packaging of RNA and the assembly of the virus. Our simulations reveal the role of each structural protein in the formation of the SARS-CoV-2, the mechanisms underlying the assembly process and the physical factors contributing to the virus morphology, see Video S6 and Video S7.
We find that, while the positively charged N monomers can condense the negatively charged RNA, oligomerization of N proteins is essential for the packaging of the very long genome of coronaviruses. Without N oligomerization, the RNA will not be compact enough to be packaged. As a result of the oligomerization of N proteins, RNA and Nproteins form several small helical clusters connected through small segments of RNA, which leads to the condensation of the long genome and its packaging. The structures of RNP in our simulations reveal why some experimental results show that RNP has a helical structure [2,3,8,9], while others find that it is a combination of several clusters with no particular order and symmetry [4,10]. In fact, we find that it is a combination of the two observed cases. Our study shows that small helical clusters can join and condense RNA resulting in its packaging; however, one single long helical structure renders RNP too stiff to be packaged.
Furthermore, considering the length of RNA and the size of the virus, our study shows that the intrinsic curvature of M proteins is necessary for viral budding. Under no circumstances were we able to observe the packaging and formation of an envelope through budding in the absence of the M protein intrinsic curvature. Consistent with the experimental data on most coronaviruses [17,19,20], we find that, while the interaction between the M and N proteins is not absolutely necessary for the budding process of an empty envelope, it is necessary for the genome packaging and significantly facilitates the virus budding through the host cell membrane.
Our work elucidates the fundamental process of viral formation essential to its reproduction, survival and ultimately its infectivity. Even though our knowledge about the coronavirus proteins responsible for the virus assembly is still very basic, our CG model based on the known protein structures has been able to successfully condense the RNA and shed light on the kinetic pathways of the viral assembly. Our simulations have deciphered to some extent the elusive assembly process of SARS-CoV-2, which could be the basis and guide for the design of future experiments.
Understand that the factors contributing to the formation of coronaviruses can ultimately prevent viral replication through disruption of the assembly. It is through inhibition of multiple viral mechanisms that resistance is most effectively thwarted and viral replication suppressed.

N Protein
We construct our coarse-grained model for the N proteins based on the fact that they are highly positively charged and mainly form dimers in solution. Furthermore, NMR, small angle X-ray scattering, and other microscopy experiments [44][45][46] reveal that the structure of N proteins (built from 419 amino acids) can be divided into five regions, namely: the N-terminal disordered region (N IDR ), the RNA binding domain (RBD), the Linker disordered region (Linker IDR ), the C-terminal dimerization domain (CTD), and the C-terminal disordered region (C IDR ), in which N IDR , Linker IDR and C IDR form the highly flexible intrinsically disorder regions (IDRs) of the protein [47,48]. The in vitro experiments using reconstitution approaches and cellular assays have shown that the Linker region plays a crucial role in phase separation of the N protein with RNA [49], whereas the CTD mainly contribute to the N protein oligomerization [3].
A careful review of the entire 419 residues of N proteins show that the majority of basic residues are located in N IDR (+4, residues 1-43), the middle of RBD (+5, residues 89-110), Linker IDR (+6, residues 181-243), and the N-terminal of the CTD (+9, residues 244-278), as shown in Figure 2. Since our focus is on the electrostatic interaction between the N proteins and RNA, we assume that the N protein is composed of three regions based on its charge distribution. A schematic view of our N protein model is shown in Figure 2. Each N protein consists of three hard particles: N-terminal (N1, red), Linker (N2, yellow), and C-terminal (N3, black) regions, respectively. Note that the C-terminal region contains C IDR and CTD, while the N-terminal region contains N IDR and RBD. We note that, since the charges in CTD are located at the N-terminal sites, we combine them with the Linker IDR charges, which form the N2 region with +15 charges, the yellow particle in Figure 2.
We further perform a set of molecular dynamics simulations to adjust the strength of N-N interaction based on the experimental data reporting that most N proteins form dimers in solution [44]. We find that the oligomerization of N proteins largely depends on the relative size of N1, N2, and N3 domains as shown in Figure S1. This is mainly due to the fact that a smaller N3 effectively increases the electrostatic repulsion between charged regions (N1 and N2) rendering the higher order oligomerization (trimers, tetramers, . . . ) less favorable. Based on the analysis of the distributions of the N protein oligomerization, we set the diameters of N1, N2, and N3 particles equal to 1.2a, 0.5a, and 0.8a, respectively, and set N 3 N 3 = 20 for the N3-N3 interaction, where a is the system unit length corresponding to 3 nm. With these parameters, the majority of oligomers form dimers regardless of the N protein concentration, consistent with the experimental data.  [50,51]. The intrinsic disorder regions (C IDR , Linker IDR , N IDR ) are obtained from UniProt P0DTC9. The visualization is performed through RCSB 3D-view, where the concentrated positive charges are colored light green; the rest of the regions have a zero net charge. The shaded circles indicate the coarse-grained model including three spherical particles: the N-terminal region (N1, red), the linker region (N2, yellow), and the C-terminal region (N3, black).

Membrane and Membrane Proteins
We model the ERGIC membrane as a spherical vesicle built of a triangular lattice; see Figure 3A. The energy of the triangular network involves both the bond stretching and bending energies. The stretching energy can simply be expressed by a harmonic potential summed over all bonds as follows: where l 0 is equal to a, the system length scale corresponding to 3 nm as described before, and k s = 20a −2 is the stretching modulus. The bending energy is determined by summing over all pairs of triangular subunits sharing an edge and can be written as with θ 0 = π the preferred angle and k b = 20 the bending modulus. To make the membrane incompressible, we have placed a spherical hard particle with diameter a (green sphere in Figure 3A) on each membrane vertex. As for the M protein, we have modeled it as a rigid body composed of three particles, M1, M2, and M3, corresponding to N-terminal, transmembrane, and C-terminal domains, respectively, see Figure 3B. We have then replaced a fraction f of the membrane vertices with membrane proteins as illustrated in Figure 3A. The M2-M2, M3-M3, N3-N3, and N3-M3 (see Figure 3B,C) interactions are modeled by Lennard-Jones potential, where σ = 2 −1/6 d with d the distance between two interacting particles when touching each other and ij and r ij are the interaction strength and distance between two particles i and j, respectively. The interactions are subject to a cutoff distance r cut = 3a.
The electrostatic interactions between N2, N3, and the genome are modeled by Debye-Hückle interaction potential, where the Bjerrum length l b = 0.24a corresponds to the length scale over which the electrostatic interaction between two elementary charges is comparable to the thermal energy, Z i , and Z j are the charge numbers, d is the sum of the radius of two particles, and κ −1 = 0.5a is the Debye-Hückel screening length. In addition, the excluded volume interaction between all particles is presented with a shifted LJ repulsion potential with = 1, r cut = d and σ = 2 −1/6 d.

Genome
While the length of SARS-CoV-2 RNA is 30k nucleotides, it is important to note that 63.8% of the RNA is base-paired [52]. Given that dsRNA forms an A-form helix which is 20-25% shorter than dsDNA helix [53,54], we assume each bead is built of 11-12 bp. Thus, we use 800 beads to present the double-stranded segments of RNA. We note that the single-stranded segments of RNA are mostly at the branching points or stem loops [4]. For simplicity, we did not explicitly model the single-stranded segments but rather implicitly consider their role in RNA flexibility. To this end, we model RNA as a negatively charged linear chain composed of L = 800 beads whose diameter is 1.0a and are connected by unextendable bonds. The chain is the so-called freely-jointed model if we do not consider the electrostatic repulsion. Because of the base-pairing and the secondary structures, RNA has often been modeled as a branched polymers [23,55,56]. Thus, we also consider the case of branched polymers with the same total length L = 800.
Taking into account the condensation of counter ions around the genome [57], we set the effective number of charges of each coarse-grained genome bead equal to Z g = b zl b Q g with b the distance between the genome phosphate charges, l b the Bjerrum length as described above, z the valence of the counter ions, and Q g the original charge number of each bead. In the following sections, we perform a number of simulations for Z g ∈ [−2, −10] and employ the Debye-Hückle theory to model the protein-RNA electrostatic interactions.

Simulations
To study the kinetics of budding, we use the Lagenvin integrator to integrate the membrane vertices movement, with time step dt = 10 −3 s. At every 100 time steps, we perform Monte Carlo (MC) simulations combined with the bond flipping method to change the membrane connectivity. The details of Bond Flip moves are shown in Figure 4. Each MC step consists of N b attempts of removing a randomly chosen bond between two triangles and attaching two vertices that were not linked before as shown in Figure 4, where N b is the total number of bonds. Detachment and attachment of the bonds are such that the total number of vertices and bonds remain constant. The simulation is performed on NVIDIA GeForce RTX 3090 where a 8000 s run takes approximately 3 h. We have employed HOOMD-blue [58] for the Molecular Dynamics (MD) simulations and built a plug-in for Bond-Flip MC simulations. All simulations are visualized by OVITO software [59].

N Protein and RNA Assemble into RNP with Multiple Protein Clusters
As noted in the Introduction, coronaviruses have the largest genome among all singlestranded RNA viruses. Several experiments have shown that the presence of N proteins is necessary for the packaging of such a long genome by the viral envelope. In fact, it is now widely accepted that the electrostatic interaction between the positively charged N proteins and negatively charged RNA is the driving force for genome encapsulation.
In this section, we explore the condensation of RNA by N proteins into a compact ribonucleocapsid structure (RNP) as a function of the concentration of N proteins and the strength of genome-N protein interaction. Following the literature on the structure of the N protein, we have modeled it as described in the Methods section; see Figure 2. Note that we have chosen the N-N interaction based on the structure shown in Figure S1, where the dimers are dominant structures in the solution. It is worth mentioning that the stronger N-N interaction results in the formation of higher-ordered oligomers, but, as it is illustrated in Figure S3, they do not modify the final structure of RNP. Using these subunits, in the majority of cases, we find that N proteins form several clusters along the length of RNA. Quite interestingly, as illustrated in Figure 5, RNA wraps around N proteins to form several clusters with helical structures that are connected by small segments of naked RNA.
To quantify the structure of RNPs and to create a basis for the comparison with the future experiments, we have analyzed the clusters and assigned a unique index number and color to each one, see Figure S2. If the distance between two neighboring N proteins is less than a cutoff distance 1.2a with a the system length scale 3 nm, then we consider that they belong to the same cluster. Figure 5A shows the snapshots of the simulations of the formation of several clusters built from N proteins along a linear genome for two cases of weak and strong RNA-N protein interactions. The top row illustrates the RNA-N protein (RNP) complex structures. For clarity, the middle row displays RNA only, whereas the bottom row illustrates only the N protein clusters (NPC) without showing the genome. The simulations indicate that multiple nucleation sites simultaneously form along the genome. At the early stages of the assembly, we observe that RNA wraps around N proteins (see t = 1000 s in Figure 5A) and several small clusters made up of two or three N proteins are formed. As time passes, the small clusters either fuse with neighboring clusters or attract more proteins from the solution to form larger clusters.
We note that the strength of RNA-N protein interaction in the simulation depends on the effective genome charge density, which depends on the concentration of salt and counterions in solution [61]. As the effective charge density increases, both the RNA-N protein attractive interaction and the RNA self-repulsion become stronger. At this regime, larger N protein clusters form with an average of 10 proteins in each cluster. As time passes, multiple clusters join and form a large cylindrical configuration which is pretty rigid, as shown on the right side of Figure 5A.
We recall that the focus of our work is on the regimes in which electrostatics dominates all the other interactions between RNA and N proteins. If, however, the strength of RNA-N protein interaction were weak, the presence of packaging signals or other specific interactions would be necessary to produce a similar genome condensation process [62]. In that case, each packaging signal would act as a nucleation site for the N proteins to start forming clusters along RNA.
The snapshots of simulations of the formation of RNPs using branched polymers are shown in Figure 5B. In each case, 30 branch points were randomly generated. The length of each branch was randomly chosen between one and forty monomers while the total length of the branch polymer was kept the same as the linear chain model (L = 800). Our simulations show that, at the weak RNA-N protein interaction, several clusters immediately nucleate along the longest length of the polymer, which later will grow in all directions. However, as we increase the strength of RNA-N protein interaction (the effective genome charge density), the N proteins nucleate first along the small branches and then they grow towards the branch point, see Video S1.
Moreover, while the branchiness of the polymers seems to speed up the condensation process, we find that the impact of the secondary structure of RNA becomes more significant when the branch number is larger than 10. At the smaller branch numbers (≤10), the condensation of branch polymers follows more or less a similar pathway as a linear chain (see Video S2). We note that, while previous experimental and theoretical investigations on small icosahedral viruses like CCMV and BMV have shown that branchiness increases packaging efficiency and lowers the free energy of the formation of virus particles [23,55,63], the effect of branchiness on the rate and the kinetics of virus assembly has not yet been studied. We have also explored the effect of the protein concentration on the structure of RNP (see Videos S3 and S4). The results of the simulations indicate that the protein concentration does not have a significant impact on the final RNP structures if the genome effective charge density is low. Nevertheless, a higher protein concentration speeds up the condensation time as expected. For the case of the genomes with a higher effective charge density, a higher protein concentration, however, facilitates the adhesion of neighboring clusters, rigidifies RNP configuration, and increases the number of clusters and the number of N proteins per cluster, leading to a more compact RNP configuration.
Finally, we examine the impact of the strength of the attractive hydrophobic interaction between N proteins on the RNP structure. We find that the N-N attractive interaction plays a key role in the formation of N protein clusters along RNA. Even a weak interaction between N proteins can induce the formation of several small clusters. In fact, for a wide range of the strength of the N-N interaction, the RNP configurations are almost the same, and the number of N proteins in clusters remains mainly the same (see Figure S3). Quite interestingly, we find that these clusters are crucial for making RNP compact enough to ensure its packaging. In the absence of the N-N interaction, the positive charges on N proteins condense the negatively charged RNA but not as much as it is necessary for its complete packaging. The presence of N protein clusters along the genome is essential for RNP packaging if the genome is very long as in the case of SARS-CoV-2, see the structures of RNP in Figure S3 for N 3 N 3 = 0 and N 3 N 3 = 30.

Spontaneous Radius and Line Tension of M Proteins Are Important for Budding
As noted in the Introduction, despite the intense experimental and theoretical studies, the most basic questions about the mechanism and relevant time scales associated with the budding process of coronaviruses are still lingering. Using a combination of electrophoresis, fluorescence and electron microscopy, the earlier experiments on coronaviruses have revealed that M, E, and N proteins are sufficient for the budding of virus-like particles (VLPs), whereas the shape of VLPs is mainly determined by M proteins [14,16,[18][19][20]64,65]. It is important to note that, according to the previous experiments, the E proteins are required for budding; it appears that they contribute to the completion of assembly and the scission of the budding neck. In the absence of enough information about the role of E proteins, we do not explicitly include their effects here, and thus investigate the general properties of the proteins embedded in the lipid membrane resulting in virus budding. More specifically, we investigate the minimum requirement for the packaging and budding of RNP at the ERGIC membrane and explore the contribution of the mechanical properties of M proteins to the formation of viral envelope.
As explained in the Methods section, we model ERGIC as a fluid membrane and model M proteins based on the literature investigating its structure [32,[66][67][68]. To explore the impact of M-M interaction on the formation of the envelope, we first build a twodimensional phase diagram as a function of the interaction between the transmembrane domains (M2-M2) and endodomains or cytosolic domains (M3-M3) of M proteins, see Figure 3. We note that, for all simulations presented in this section, we have assumed that M proteins have already aggregated into a circular region and that ERGIC is relaxed with several pentameric defects located at various positions. It is worth mentioning that we have explored different scenarios and found that the locations and the initial numbers of defects do not affect the final budding configuration. We explain in the later section why the initial distribution of M proteins in the membrane does not have an impact on the final results. Figure 6A shows the snapshots of the M proteins budding at ERGIC in the absence of a genome. The strength of the interaction between the segments of the proteins embedded in the membrane (M2-M2) increases from top to bottom while the strength of interaction between cytosolic domains, the segments outside of the membrane (M3-M3), increases from left to right. Figure 6B shows the plots of the size of the perimeters of the partially budding vesicles (red regions in Figure 6A) vs. time. The snapshots and time analysis of the growing vesicles indicate that the stronger transmembrane-transmemebrane (M2-M2) and endodomain-endodomain (M3-M3) interactions facilitate the budding process within the regime explored in Figure 6A. We emphasize that we have set the size of endodomains smaller than the transmembrane domains, which gives rise to the M protein's effective spontaneous curvature, leading to the bending of the membrane. A stronger attractive interaction between endodomains facilitates budding and overcoming the energy barrier for the formation of vesicles.
It is worth mentioning that, since the transmembrane domains are embedded in the membrane, their interactions with each other and the membrane create a line tension, promoting the budding process through reducing the perimeter of the budding region. However, if the strength of the transmembrane-transmembrane interaction is beyond a certain limit, the M proteins can easily get trapped in a metastable state where the proteins form a hexagonal lattice and cannot bud. Thus, we were not able to locate a regime in which the line tension by itself is sufficient for virus budding, see the first column in Figure 6A. The spontaneous curvature of M proteins is essential for bending the membrane to complete the budding and viral formation. It is interesting to note that, even in the absence of the M2-M2 attractive interaction, if the strength of M3-M3 attractive interaction is high enough, the budding will take place (see Figure 6A.c).
We note that recent computational models on SARS-CoV-2 show that M proteins can induce the membrane curvature through the protein-lipid interaction [30], which can be equivalent to the M3-M3 (cytosolic domains) interaction in our model. However, based on the fact that the formation of empty envelope is not optimal for a virus, we think that the effective M3-M3 interaction of SARS-CoV-2 is not sufficient for budding and that the RNP is required to trigger the budding so that the genomic material can be packaged more efficiently. In fact, some experiments on coronaviruses suggest that RNP interact with M proteins through the electrostatic interaction between the negatively charged portion of the C-terminal domain of the N protein and the positively charged cytosolic domain of the M protein [11,[69][70][71]. In the next section, we show how N proteins can induce viral budding even if M3-M3 interaction is weak.

Two Different Models for RNA Packaging
We now explore the effect of the interaction between N and M proteins on the packaging of RNP at the ERGIC membrane. We consider two different scenarios: (1) The genome has not yet been condensed but is surrounded by N proteins and is located in the vicinity of M proteins embedded in ERGIC (Video S6), and (2) the compact RNP has been formed and transported to the ERGIC interface before interacting with the M proteins (Video S7). The snapshots of the simulations of two scenarios are shown in Figure 7. Condensation of the genome occurs at the same time as the budding process. In this case, at the beginning of the simulations, the naked branched genome is located near the M proteins embedded in the ERGIC membrane surrounded by N proteins (25 µM), where the N-terminals of N proteins attract the genome and their C-terminals interact with M proteins, triggering both RNP condensation and budding simultaneously. Scenario 2: N proteins (25 µM) have already condensed RNA before the transport of RNP to the ERGIC interface. In this scenario, the N proteins within RNP are arranged such that their C-terminals interact with the endodomain of M proteins and trigger the budding process. For clarity, only the N proteins (not RNA) are shown after the completion of the budding process in both cases. For both scenarios, the genome length is L = 800 and genome charge is Z g = −2, the interactions between proteins are McMc = 1, A careful review of the snapshots shows that both scenarios are feasible and could result in the formation of viral particles. The cluster analysis of the final structures ( Figure S5) reveals that, when the genome is not condensed but is in the vicinity of the membrane surrounded by N proteins, many small RNP clusters form right at the beginning of the assembly process. This is mainly due to the fact that the interaction with M proteins make the small clusters consisting of two to four N proteins survive over a relatively longer period of time. The packaged RNP through this mechanism contains more N proteins than if it were formed while away from the membrane and then would be transferred to the membrane to interact with the M proteins to acquire its envelope. In the latter case (Scenario 2), the number of clusters remains constant during the budding as the clusters have already been formed and fully relaxed. However, the packaging of RNP will take more time as N proteins in the clusters need to reorient and expose their C-terminal domains to interact with the M proteins. It is worth mentioning that the protein concentration does not affect the rates of condensation and the assembly process, in the case of the second scenario. However, for the first scenario, the condensation and budding occur simultaneously so that the protein concentration and thus rate of condensation will have an impact on the budding rate.
Furthermore, we find that the strength of the attractive interaction between the N proteins and RNA modifies the structure of RNP inside the envelope. The clusters along RNP are distributed more sparsely inside the viral particle for a weak N protein-RNA interaction while RNP assumes a more condensed structure if the N protein-RNA interaction is stronger, as shown in Figure S6. In the case of the second scenario, such condensed conformation might prevent RNP from being packaged at the ERGIC membrane since the N protein clusters are too compact to reorient their C-terminal and interact with M proteins. A thorough analysis of RNP structures can reveal the condensation pathway of the coronavirus genomic RNA. However, the current cryo-em images cannot distinguish the difference between the structures of RNP formed through different pathways. The comparison of our simulations with future cryo-em images and experiments with different mutants of N proteins can shed light on the mechanism through which RNP is condensed and packaged at ERGIC.
Our simulations also show that, if we set the interaction between transdomains M 2 M 2 = 1 but between endodomains M 3 M 3 = 0 (see Figure 3), the M-M interaction cannot bend the membrane and no empty vesicle can form, as shown in Figure 6Ad. However, in the presence of RNP, the interaction between M and RNP is enough for the virus to bud even though M 3 M 3 = 0. We emphasize that, if the size of M3 is not smaller than M2, under no circumstances does the virus bud as discussed in the previous section, indicating that the spontaneous curvature of M proteins is absolutely necessary for budding (see Figure S4). It appears that M-N interaction constitutes the main contribution to the RNP budding if there is no interaction between endodomains. Because of the attraction between the N proteins, the M-N interaction induces the M3-M3 attractive interaction indirectly, which then provides the necessary curvature to trigger the budding process. This can explain why for some coronaviruses, the interaction of M-N is crucial for the formation of VLPs. Further experiments is necessary to understand if there is an attractive interaction between endodomains of M proteins.

Condensation of RNA by N Proteins Is Essential for Its Packaging
It is worth mentioning that the recent experiments on the SARS-CoV-2 M proteins show that the endodomains of the M proteins are highly positively charged, which might have an impact on their interactions with N and S proteins and RNA [68]. To examine the effect of the M positive charges on the packaging of the genome, we also consider the case in which the endodomain (M3) has positive charges (+6, residues 101-222) (see Figure S7). Following [30], we assume that the electrostatic repulsion between the M protein endodomains is screened out by the POPI lipids through the protein-lipid interactions, and thus we set the M3-M3 electrostatic repulsion equal to zero. Intuitively, one would expect that the positive charges on M proteins might be sufficient for the RNA packaging because of the attractive electrostatic interactions between RNA and the M endodomain. However, we find that the sole attractive electrostatic interaction between the M proteins and RNA is not enough for its packaging in the absence of N proteins, i.e., the condensation of genome by the N proteins is essential for the virus formation. This is in contrast to the case of many icosahedral viruses where the sole attractive interaction between the genome and the capsid proteins is sufficient for the encapsidation of the genome with viral capsid proteins [55,72]. Because of the length of the genome of coronaviruses and the budding process, the presence of N proteins are essential for the genome condensation and virus formation. We have examined that, in our system, the interaction between the positive charges on M and negative charges on RNA will be enough for budding if the length of the genome is much shorter.
We note that all the budding simulations are performed using branched polymers with the number of branch points N = 30. Although the secondary structure of RNA does not have an impact on the final configurations of condensed RNPs, the budding of a long linear chain is more challenging because the radius of gyration of a free linear chain is larger than a free branched polymer; thus, it takes much more time for the linear one to condense and be completely packaged. For example for a linear chain with L ∼ 800, we have been able to package only half of the RNP during the time of our simulations (see Figure S8).

Impact of the Initial Distribution of M Proteins on Virus Budding
We recall that, in the budding simulations presented above, we have always assumed that the M proteins initially form a spherical cap with a circular edge. However, the initial distribution of M proteins at the ERGIC membrane could be different. To this end, we have also considered the condition in which the initial distribution of the M proteins is random and have repeated the simulations presented in Figure 6A.d with the same parameters. We first fully relax the ERGIC membrane and then allow the randomly distributed M proteins to diffuse and form different structures rather than imposing a circular cap from the beginning Figure 8A. As we increase the strength of the interactions between the endodomain part of M proteins, the irregular branched-like patterns start to form several circular domains pointed with arrows in Figure 8A, which will later bud into empty vesicles. This disorder-order transition indicates that the spontaneous curvature of M proteins makes the branched configuration energetically unfavorable; as such, they diffuse to form separate circular domains.
Quite interestingly, we find that, even in the absence of the interaction between endodomains of M proteins, RNP can trigger the formation of circular caps (Video S8). Figure 8B illustrates that, when we place RNP near the disordered regions where the M proteins embedded in ERGIC have formed a branched-like pattern, the M-M interaction starts to bend the membrane due to the endodomain-RNP interaction. As explained in the previous section, even in the absence of M3-M3 attractive interaction, the interaction of RNP with the endodomain of M leads to an effective spontaneous curvature between M proteins. The curved region attracts more M proteins due to the energetically favorable endodomain-RNP interaction. After a while, the branched-like regions start to assume a circular configuration, which later rapidly buds through ERGIC encapsulating RNP to form the viral particle. We note that with a random distribution of M proteins, the simultaneous packaging of genome and budding are not plausible through scenario 1 because then many N proteins basically follow the branched-like pattern of the M proteins and will be distributed everywhere on ERGIC. In that case, RNA spreads out to reach those N proteins and thus it cannot condense to be packaged. Thus, a better understanding of the M-M interaction while embedded in the membrane can shed light on the pathway of genome condensation and its budding through the ERGIC membrane. As a result of diffusion, the M proteins have formed a branched-like pattern. In the absence of the M3-M3 interaction, the branched-like pattern does not change (top row). However, with increasing the endodomain-endodomain attractive interaction M 3 M 3 , the connected patches of M proteins begin to aggregate into circular domains (bottom row, indicated with red arrows), which will eventually bud into empty vesicles. Note that the M proteins cover half of the surface of ERGIC, i.e., the M protein fraction is f = 50% during the simulations; (B) snapshots of the simulations of the RNP budding from ERGIC. Initially, the randomly distributed M proteins form a branched-like pattern on the surface of ERGIC in the absence of the endodomain interaction. However, the interaction with RNP results in the aggregation of the proteins in the vicinity of RNP and the formation of the envelope enclosing RNP. The branched-like distribution is obtained by relaxing ERGIC and the M proteins for 16,000 s. The RNP structure is obtained by relaxing N proteins (25 µM) and RNA for 8000 s. RNA is modeled as a branched polymers with the charge density Z g = −2 and the number of branch points N = 30. The comparison of (A,B) reveals that RNP can mediate the interaction between the endodomains giving rise to a preferred curvature between M proteins.

Discussion
Even though the replication and assembly are conserved across many members of the family of coronaviruses, the most rudimentary details of these processes remain elusive. This could be due to the presence of a number of structural proteins involved in the formation of viral particles and our limited knowledge regarding their specific roles. Additionally, viral assembly and budding occur in the milieu of the intracellular host cell organelle membranes which further complicates isolating mechanisms specific only to the virus.
In this paper, we have investigated the underlying physical principles for the formation of SARS-CoV-2 focusing on the role of the structural proteins in the assembly of the virus. While many unknown parameters are involved in the packaging of the genome, our primary goal has been to obtain an overall understanding of under what conditions the large genomic RNA of coronaviruses can be condensed to bud through the host cell intercellular membrane. Specifically, we have sought to determine what types of effective protein-protein, genome-protein and membrane-protein interactions are necessary for the budding and formation of viral particles.
Despite all the experiments, the quantitative values of the interactions critical to modeling remain experimentally unknown. To this end, we have made several phase diagrams exploring how the relative strength of interactions between the different virus constituent parts influence the virus assembly and budding. Our simulations show that the essential requirement for the virus budding is the spontaneous curvature of the membrane proteins. Considering the length of SARS-CoV-2 RNA, under no circumstances were we able to observe virus budding in the absence of the M protein spontaneous curvature. We note that the attractive interaction between the cytosolic domain of M proteins or the presence of external scaffolding such as N proteins or RNP interacting with the cytosolic domain of M proteins can give rise to the spontaneous curvature of M proteins, see Figure 8.
The attractive interaction between the transmembrane domain of M proteins could also lower the budding energy barrier by introducing the line tension. A very strong interaction between the transmembrane domains might create a high line tension, which could lead to virus budding. However, if the M-M interaction is very strong, we observe that the M proteins form immobile large islands, trapped in metastable configurations.
We have also investigated the mechanism of packaging of the long genomic RNA during the virus budding. Although several experiments show that the N proteins and RNA phase separate [47,48,[73][74][75][76], the exact location and when this process occurs remain unclear. Our simulations offer two distinct plausible models: Either the RNP condensation and budding occur simultaneously at the ERGIC membrane or RNP is formed remotely and then transported to the ERGIC for budding. In both models, the RNP forms a spherical layer under M proteins with a hollow center, consistent with previous observations [4]. In the first scenario, however, the order of arrival of N proteins and RNA to the ERGIC regions seems vital since the N proteins may initiate the budding process in the absence of genome and form empty particles (see Figure S9). If the entire RNP is to be encapsulated, it will largely depend on the protein concentration, the rate of RNP condensation, and the initial M protein distribution.
In the alternative model that we offer in this paper, the preformed RNP is transported to the ERGIC region and triggers the budding process by the M-N interaction. There will be no extra N proteins in the immediate vicinity of ERGIC in this model; thus, the environmental factors have less impact on the budding process. However, if the RNP configurations are specifically compact, in the case of the strong N proteins-RNA interactions, the packaging of the RNP will be difficult as the N proteins in the clusters need to reorient and expose their C-terminal domains to interact with M proteins. Quite interestingly, a recent paper on SARS-CoV-2 shows that N proteins aggregate around ER membranes that connect to viral RNA replication foci, whereas M, E, and S proteins accumulate at the ERGIC membrane where the budding and maturation happen [77]. Since N proteins are located near RNA, it is possible that RNP is pre-formed before meeting M, E, and S proteins in ERGIC. Furthermore, the study shows that N proteins and genome are observed at 5 h post infection (hpi), whereas M, E, and S proteins are not observed until 7.5 hpi [77]. The fact that N proteins are expressed earlier than the other structural proteins indicates the possibility that RNA is condensed before being transported to the location of ERGIC. Considering these temporospatial aspects, we think that scenario 2 is completely plausible. However, more experiments are still needed to decipher where and when the genome is condensed. The comparison of the our simulations with the future in vivo experiments focused on the kinetic pathways of RNP condensation can shed light on virus formation at the ERGIC membrane.
We note that, while some experiments regarding the assembly of both SARS-CoV and SARS-CoV-2 demonstrate that the minimal requirements for the virus budding are M and N proteins [14,20], other experiments show that M and E proteins are essential for the VLP formation [18,65]. Interestingly, another set of experiments on SARS-CoV suggests that E proteins might not be required for the VLP assembly but rather enhance the production level [19,64]. Even with the unprecedented deluge of research the pandemic has wrought, the role the structural proteins (E, M and N) play in the SARS-CoV-2 life cycle continue to remain speculative and pose an area of intense debate.
Our work can explain to some extent why there are inconsistencies in the experiments regarding the budding mechanism of various coronaviruses. Different requirements for budding could be due to the various strength of interactions between the structural proteins from one type of coronavirus to another, or the composition of lipids in different host cells. In one case, the M-M interaction might be weak and the M-N interaction must be essential for the budding to occur, while in another case, the presence of the E proteins confers a higher spontaneous curvature to the membrane [78,79]; and as such, when E proteins are incorporated, a significant enhancement of the viral particle production is observed [19].
Our aim is to shed light on the elusive assembly process of coronaviruses and particularly that of SARS-CoV-2, which could be the basis and guide for the design of future experiments too. A systematic comparison of our simulations and the experiments with various mutations on M and N proteins could significantly advance our knowledge on this virus and the factors contributing to its formation. Understanding the complex life cycle of this virus inherently involves how the virus reproduces itself in such an extraordinarily successful way. Since there is always the danger of reduced effectiveness or even ineffectiveness of vaccines as a result of novel mutations, a better understanding of the structural properties of SARS-CoV-2 could lead to alternative treatments for COVID-19 infections beyond the limited means we now have.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/v14102089/s1. Figure S1: Dimerization of the N proteins for various domain sizes and the strengths of interaction between the C-terminals. Figure S2: Cluster analysis of NPCs along RNP for weak and strong RNA-N protein interactions. Figure S3: RNP condensation for various strengths of interactions between N proteins. Figure S4: Illustration of the role of the spontaneous curvature in the budding process. Figure S5: Cluster analysis of the N proteins along RNP during the budding. Figure S6: RNP budding with various N protein-RNA interactions for scenarios 1 and 2. Figure S7: Snapshots of the simulations of the RNP budding when the endodomains of M proteins are positively charged. Figure S8: Snapshots of simulations of the RNP budding using a linear RNA. Figure S9: Formation of the virus like particles through the budding of N proteins in the absence of RNA. Video S1: RNA condensation by the N proteins. Video S2: The impact of the secondary structure of RNA on its condensation. Video S3: N proteins condense a linear genome for various N protein concentrations and effective genome charge densities. Video S4: N proteins condense RNA at various N protein concentrations and effective genome charge densities. Video S5: The role of the M protein interactions in budding at the ERGIC membrane. Video S6: RNP budding through the ERGIC membrane for scenario 1. Video S7: RNP budding through the ERGIC membrane for scenario 2. Video S8: RNP budding through the ERGIC membrane for the randomly distributed M proteins.
Author Contributions: Conceptualization, S.L. and R.Z.; investigation, S.L. and R.Z.; writing, S.L. and R.Z. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: All simulation data are present in the manuscript. HOOMD-blue plug-in is available at https://github.com/virusassembly/hoomd_bondflip_plugin.

Acknowledgments:
We thank Paul van der Schoot for many helpful suggestions and Thomas Kuhlman for critically reading several sections of the paper. We thank Yuanzhong Zhang, Ajay Gopinathan and Umar Mohideen for many helpful discussions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: