A Computational Approach to Investigate TDP-43 RNA-Recognition Motif 2 C-Terminal Fragments Aggregation in Amyotrophic Lateral Sclerosis

Many of the molecular mechanisms underlying the pathological aggregation of proteins observed in neurodegenerative diseases are still not fully understood. Among the aggregate-associated diseases, Amyotrophic Lateral Sclerosis (ALS) is of relevant importance. In fact, although understanding the processes that cause the disease is still an open challenge, its relationship with protein aggregation is widely known. In particular, human TDP-43, an RNA/DNA binding protein, is a major component of the pathological cytoplasmic inclusions observed in ALS patients. Indeed, the deposition of the phosphorylated full-length TDP-43 in spinal cord cells has been widely studied. Moreover, it has also been shown that the brain cortex presents an accumulation of phosphorylated C-terminal fragments (CTFs). Even if it is debated whether the aggregation of CTFs represents a primary cause of ALS, it is a hallmark of TDP-43 related neurodegeneration in the brain. Here, we investigate the CTFs aggregation process, providing a computational model of interaction based on the evaluation of shape complementarity at the molecular interfaces. To this end, extensive Molecular Dynamics (MD) simulations were conducted for different types of protein fragments, with the aim of exploring the equilibrium conformations. Adopting a newly developed approach based on Zernike polynomials, able to find complementary regions in the molecular surface, we sampled a large set of solvent-exposed portions of CTFs structures as obtained from MD simulations. Our analysis proposes and assesses a set of possible association mechanisms between the CTFs, which could drive the aggregation process of the CTFs. To further evaluate the structural details of such associations, we perform molecular docking and additional MD simulations to propose possible complexes and assess their stability, focusing on complexes whose interacting regions are both characterized by a high shape complementarity and involve β3 and β5 strands at their interfaces.


I. INTRODUCTION
The investigation of the molecular mechanisms that lead to the accumulations of aggregated proteins is crucial for understanding the pathophysiology of many neurodegenerative diseases [1].The accumulation of aggregates containing the DNA-and RNA-binding protein TDP-43 in the central nervous system is a common feature in diseases, such as Amyotrophic Lateral Sclerosis (ALS), Frontotemporal Dementia (FTD), and Alzheimer's Disease (AD), t [2,3].However, the mechanisms of aggregation are not yet fully understood and various aggregation models have been proposed [4].In this scenario, the fundamental role of the C-terminal fragments in the formation of TDP-43 aggregates has already been widely confirmed [5][6][7][8][9][10].In particular, it is interesting to study this aggregation model for the possible implications on ALS disease.TDP-43 is composed of an N-terminal domain (NTD), two RNA recognition motifs (RRMs), and a long Cterminal (CTD) glycine-rich region [11].The CTF is formed by the last 195-206 residues [8] of TDP-43, corresponding to a portion of the full protein including only the CTD together with a truncated RRM2 fragment.In the CTFs two kinds of RRM2 fragments can usually be found [12]: one corresponding to the cleavage at the residue 208 and the other corresponding to the cleavage at residue 219.
In normal conditions, the NTD-driven head-to-tail oligomerization spatially separates the high aggregationprone CTDs of consecutive TDP-43 monomers, antagonizing aggregation [7].However, if a preoteolytic cleavage releases the CTF, these free portions of the protein are able to aggregate [8]: according to current knowledge, the formation of inclusions seems indeed to start from this disruption of the physiological oligomerization of TDP-43.Furthermore, the removal of the N-terminus increases the cytoplasmic localization, since it deprives the CTF of the Nuclear Localization Signal (NLS).The aggregation involves the CTD, which is intrinsically disordered and aggregation-prone, and harbors most of the mutations related to ALS [10].That said, it has been discussed how the CTD is necessary for cytoplasmic aggregation and toxicity but not sufficient, since it requires an intact RRM, i.e. the RRM2 fragment in the CTFs is fundamental for the aggregation [13].In physiological conditions RRM2 is a really stable domain, thanks to a cluster of twelve connected hydrophobic residues in its core [9], but the cleavage deprives it of its stabilizing interaction with RRM1.In a study of the RRM2 unfolding [14] that follows the separation from RRM1, it has been found that the mutually stabilizing interaction between RRM1 and RRM2 reduces the population of an intermediate state of the latter linked with pathological misfolding.This intermediate state may enhance the access to the Nuclear Export Signal (NES) within its sequence that increases the transport to the cytoplasm, and serve as a molecular hazard linking physiological folding with pathological misfolding and aggregation.
A second effect of the cleavage of RRM2 is the exposition of its aggregation-prone β-strands [8,15].These strands are normally buried in the native state, but have been found to form fibrils in vitro [9].These processes confirm a fundamental role of the fragment of RRM2 in the CTFs for the protein's aggregation [15].In particular, the β-strands could be at the core of the aggregation, because they are able to form steric zippers which then, following a typical atomic model for amyloid fibril structure formation, give rise to amyloid structures.[8].Amyloid fibrils consist of packed β-sheets that run parallel to the fibril axes.Each β-sheet adheres to its neighboring sheet through the side chains that project roughly perpendicular to the fibril axis, toward the neighboring sheet.This interdigitation between the side chains of mating sheets is the so-called steric zipper.
In support of the hypothesis that this kind of structure is at the base of the CTFs aggregation, as shown in Figure 1, it has been found that some regions of RRM2 can form different classes of steric zipper structures [10,12] at the core of the formation of these amyloid fibrils [16].
The aggregation of TDP-43 is strongly influenced by the interaction with DNA and RNA: RNA aptamers are able to interfere with the aggregation kinetic, as a function of their nucleotides composition, binding affinity, and length [17].A correctly designed RNA aptamer should then be able to hinder the aggregation.This molecule should have an effect after binding to the RRM2 fragment: assuming the validity of the cross-β spine model for the CTFs aggregation, by binding an aptamer to the RRM2 site that forms the "spine" of the fibril, we could prevent another CTF to bind to that site.This is indeed an almost mandatory choice since the RRM2 fragment is the only part of the CTFs that we can control (with both MD simulations and the Zernike [18] method) because of the CTD disordered structure.The designing of this interfering molecule should obviously consider the binding compatibility to the misfolded conformation of the RRM2 fragments, which is not available yet in literature.Because of this, we use MD simulations to study the conformations of the fragments after the cut.The aim of this work is to suggest some possible binding regions that in the future could be taken as starting point for designing specific interfering molecules.To obtain a complete study of this region of TDP-43, we employ MD simulations to study the evolution of the whole RRM2 as well.

A. Molecular dynamics simulations and equilibrium configurations
A MD simulation of 10 µs is performed for the whole RRM2 and for each fragment (see Methods section for details).The whole RRM2 corresponds to the residues 192-269 of TDP-43.The first fragment, Fragment A, corresponds to the residues 209-269.The second one, Fragment B, corresponds to the residues 220-269.To obtain their equilibrium configurations, we firstly apply a Principal Component Analysis (PCA) on the trajectory resulting from each MD simulation.In this way, we get an essential representation of their motions.Then, we implement a cluster analysis on the projection of each trajectory on its first two principal components.Our aim is to find the most representative configurations for each one of the possible conformations that the fragment can take at equilibrium: we are assuming that each cluster's center (or centroid) is a good representative of that cluster.The structures corresponding to these centroids are the equilibrium configurations, we will use to search for possible binding regions with the Zernike polynomial formalism [18][19][20].We found four equilibrium configurations for the whole RRM2 (W1, W2, W3, W4), five for Fragment A (A1, A2, A3, A4, A5), and two for Fragment B (B1, B2).Since the plot of the RMSD evolution shows a clear peak for the trajectory of Fragment B, which should corre- spond to an unfolding of the fragment, we implement the same analysis specifically for that time interval.After the application of the just-described procedure, we found as the most representative conformations for the unfolding of Fragment B three configurations: B3, B4, and B5. Figure 2 shows the just described steps of the analysis and their results for each of the studied cases.
We then compute for each one of the conformations found for Fragment A and B the corresponding 3D molecular surface (see Methods for details).The shape complementarity between these molecular surfaces can now be studied with the Zernike polynomial expansion.

B. 2D Zernike polynomial expansion for binding regions prediction
Recently, a new method based on the Zernike 2D polynomial expansion, has been developed with the aim of evaluating whether and where two proteins can interact with each other to form a complex, based on their shape complementarity [18].Although there is noise in the shape of the interacting molecular surface, which probably has not been evolved to optimize intermolecular binding, we have previously shown the goodness of the methods of recognizing interacting surface portions from non-interacting patches (decoys) [18].Not all complexes are characterized by the same degree of shape complementarity, but in general, this property is one of the most important for molecular docking.By expanding the wellexposed molecular surface patches in terms of 2D Zernike polynomials, we are able to rapidly and quantitatively measure the geometrical complementarity between interacting proteins by comparing their molecular surfaces.In our case, we apply it to all the possible pairs between the 3D surfaces of the two CTFs RRM2 fragments to find the binding regions on each surface.More specifically, for each point i belonging to the molecular surface we define a molecular surface region (patch), which can be described through 2D Zernike formalism with a set of invariant descriptors.Two similar patches, if defined by the same reference point, have a small distance between the Zernike vectors, since perfectly complementary patches are equal under roto-translation.For each point, i of one of the two surfaces, the distance between the Zernike descriptors of its patch and all the patches built on the points of the other surface is computed.The minimum of these values is selected, and after all the points had been studied these minimum values are mapped in [0, 1] and inverted.At the end of the process, the points whose corresponding patches have a high complementarity with the other surface are associated with a value of the binding propensity (BP) near one.As a next step, each point is associated with the mean BP value of the points in its neighborhood: the interacting regions should be made up mostly of elements with high complementarity and therefore a high average value of BP.Finally, we associate to each residue the mean value of the BP of the corresponding points in the surface.We apply this procedure to all the possible pairs of conformations.To identify the most promising regions of interaction between the two fragments (i.e., the regions that we expect to be at the core of the CTFs aggregation), we select, for each conformation of each fragment, the pairing, for each fragment, that results in the highest mean BP of the residues corresponding to β-strands in the conformation sequence.In this way, we are selecting the pairings that are more prone to bind through β strands.Figure 3 shows the results of this selection employed for each of the ten conformations for Fragment A and B.

C. Proposal of binding and β-strands regions
The pairings found in the previous section are relevant to describe CTFs aggregation process.Artificial molecules, such as RNA aptamers or peptides could be in the future proposed as candidates for the interruption of the molecular interaction between the CTFs of the TDP43 protein: to propose some binding regions suited for testing with aptamers we followed a second approach.
For each conformation, we summed the BPs obtained for all of its possible pairings with the other conformation.In this way, we obtain a clear representation of the residues in each conformation that are in general more involved in the interaction with other surfaces.At this point, we have to take into account the fact that the Zernike method looks only at the shape complementarity between surfaces, which is a necessary but not sufficient condition for interaction.To select among the residues found with this summation, the ones corresponding to the binding regions that will be able to bind an aptamer, we can apply some additional chemical-based constraint.In particular, we consider one of the most basic requirements for interaction, i.e. the Coulombic interaction: aptamers are characterized by a negative charge, consequently we can select among the Zernike-selected residues the ones associated with a positive charge.
To select these regions we use UCSF Chimera [21] to visualize the Coulombic surface coloring of each conformation: negative regions are red-colored, positive ones are blue-colored.Then we look at the Coulomb coloring of the surface regions corresponding to our Zernikeselected residues and select only the residues corresponding to non-negative regions.Figure 4 shows an example of how a binding region is selected.
That said, we are interested in the study of the aggregation of these fragments as hypothesized by a model according to which their interaction is mediated by the β-strands.To better understand the importance of these β-strands we select, among the Zernike-found residues, the ones corresponding to β-strand fragments as well, independently of their charge.These processes result in the selection of the binding and β-strand regions listed in Table I.The residues among the binding regions colored in red are the ones corresponding to β-strand fragments.We expect these regions to be the most interesting ones since they are regions that according to our hypothesized model are at the core of the aggregation (because they include β-strands) and we will be able to test them by binding to them an aptamer (since they are positive).

III. DISCUSSION
The principal objective of the here presented work was to find some regions on the TDP-43 CTFs (in particular on their RRM2 fragment) to propose as the foundation of their aggregation.Since the structure of these fragments had not been studied yet and their conformations were not yet available, we began our project by studying the time evolution of the two possible RRM2 fragments constituting the CTFs, i.e.Fragment A and B, with MD simulations; we studied the whole RRM2 as well.From the analysis of the trajectories of these two fragments, we found five representative configurations each: these are the possible configurations that should be observable in a cell, as well as the ones that the fragments assume while interacting with each other.The definition of this equilibrium configuration is the first result presented by this paper.As a next step, we searched on the surfaces of these configurations all the possible regions of interaction, by verifying their shape complementarity.This complementarity is a necessary but not sufficient condition for interaction: even though having narrowed the possible region of interaction to a limited set of residues (listed in Table  I) is already an interesting result, this study could be in the future further developed with additional constraints for the selection of regions.
A second continuation of this work could be the verification in vitro of our results: following the insertion of expressively designed aptamers (starting from our suggested binding regions' residues) in CTFs-expressing cells, if our results are correct, the aggregates number and dimension should decrease.With this aim, we identified among the Zernike selected binding regions the ones that would be able to bind an aptamer.

A. Dataset
The three starting structures for the MD simulations of the whole RRM2, Fragment A, and Fragment B were extracted from the Protein Data Bank [22]: we begun from the Nuclear Magnetic Resonance (NMR) structure of the TDP-43 tandem RRMs in complex with UG-rich RNA (PDB id: 4BS2) [23], and then deleted from the file the RNA e the RRM1 domain.The resulting structure, to which we refers as 'whole RRM2' contains residues from 192 to 269 of TDP-43.Next, to obtain the molecular structure of Fragment A removing residues up to the 208th, whereas to obtain Fragment B we removed the residues up to the 219th.

B. Molecular Dynamics simulations
For each of the starting structures, we carried out one molecular dynamics simulation of 10 µs.All steps of the simulation were performed using Gromacs 2019.3 [24].The topologies of the system were built using the CHARMM-27 force field [25], the standard force field for proteins.Each fragment was placed in a rhombic dodecahedron simulative box, with periodic boundary conditions, filled with TIP3P water molecules [26].The system of the whole RRM2 included 5269 water molecules, Fragment A 4607 and Fragment B 4658.The rhombic dodecahedron box is built so that each atom of each fragment is at least at a distance of 11 Å from the box borders.Its volume is 71% of the one of a cubic box of the same periodic distance: fewer water molecules have to be added to solvate the protein.For a protein to have the correct behavior there need to be at least two or three layers of water around it: with 11 Å there is space for approximately five layers.The final system of the whole RRM2, consisting of 17038 atoms, was first minimized with 371 steps of steepest descent.In the same way, the system of Fragment A, consisting of 14777 atoms, was minimized with 102 steps, whereas the system of Fragment B, consisting of 14759 atoms, was minimized with 346 steps.Each step had a size of 0.01, while the force limit value was set to max(|F n |) < 10 3 kJ/mol/nm.The thermalization and pressurization of the systems in NVT and NPT environments were run each for 0.1 ns at 2 f s time-step.The temperature was kept constant at 300 K with a Modified Berendsen thermostat and the final pressure was fixed at 1 bar with the Parrinello-Rahman algorithm [27] (with a time constant of coupling between the system and the barostat of τ P = 2 ps), which guarantees a water density close to the experimental value of the SPC/E model of water of 1008 kg/m 3 .Indeed for the whole RRM2, we obtained an average density value of 1012±4 kg/m 3 , for Fragment A a value of 1006±5 kg/m 3 , and a value of 1002 ± 4 kg/m 3 for Fragment B. LINCS algorithm [28] was used to constraint h-bonds.Finally, the systems were simulated with a 2 f s time-step for 10 µs in periodic boundary conditions, using a cut-off of 12 Å for the evaluation of short-range non-bonded interactions and the Particle Mesh Ewald method [29] for the long-range electrostatic interactions.For all these steps the Leap-Frog integrator and the Verlet cut-off scheme were used.

C. Computation of molecular surfaces
The molecular surfaces were obtained starting from the PDB files found after the clustering of the PCA of the trajectories resulting from the MD simulations.To compute for each one of the 14 structures the solvent-accessible surface, we used DMS [30], with a density of 5 points per Å2 and a water probe radius of 1.4 Å.For each surface point, we calculated the unit normal vector with the flag −n.

D. Evaluation of Complementarity
The first step of this algorithm is to select from the surface a patch Σ, defined as the set of surface points that fall within a sphere of radius R zernike = 6 Å centered on one point of the surface.The points contained in this sphere are divided, with a clustering from a random point that includes only the points closer than a distance D p , in points belonging to the surface and points not directly connected to it (for example coming from a protuberance included in the sphere).Only the former will constitute the patch.Once the patch has been selected, the mean vector of the normal vectors of the patch points is computed and oriented along the z-axis.Thus, given a point C on the z-axis, we define θ as the largest angle between the z-axis and a secant connecting C to any point of the patch Σ. C is then set so that θ = 45 • and each surface point is labeled with its distance r from C. As a next step, a square grid that associates each pixel with the mean r value calculated on the points inside it is built.Such a 2D function can now be expanded on the basis of the Zernike polynomials.Indeed, each function of two variables f (r, ψ) defined in polar coordinates inside the region of the unitary circle can be decomposed in the Zernike basis as with and c n m are the expansion coefficients, while the complex functions Z n m (r, ψ) are the Zernike polynomials.The radial part R n m is given by Since for each couple of polynomials, it is true that the complete sets of polynomials form a basis, and knowing the set of complex coefficients c n m allows for a univocal reconstruction of the original patch.Once a patch is represented in terms of its Zernike descriptors, the similarity between that patch and another one can be simply measured as the Euclidean distance between the invariant vectors.The norm of each coefficient z n m = |c n m | constitutes one of the Zernike invariant descriptors.Since z n m does not depend on the phase (i.e. it is invariant for rotations around the origin of the unitary circle), two patches can be assessed by comparing the Zernike invariants of their associated 2D projections, without considering their orientation.On the other hand, the relative orientation must be taken into account: if we search for similar regions we must compare patches that have the same orientation once projected in the 2D plane, i.e. the solventexposed part of the surface must be oriented in the same direction for both patches (for example as the positive zaxis).If instead, we want to assess the complementarity between them, we must orient the patches contrariwise, i.e. one patch with the solvent-exposed part toward the positive z-axis ('up') and the other toward the negative z-axis ('down').
What is in practice done to understand if two surfaces have some complementary patches as described in the following: 1.For both surfaces we compute the Zernike descriptors of the patches centered in all the points of the two surfaces up to the selected maximum expansion order n.The two surfaces have to be oriented in opposite verse along the z-axis: we are going to study the similarity between the first one and the inverse of the second one.

2.
For each point i of the surface 1, we compute the distance between the Zernike descriptors of its patch and all the patches built on the points of the surface 2. The minimum of these values is selected, and after all the points have been studied these minimum values are mapped in [0, 1] and inverted.At the end of the process, the points whose corresponding patches have a high complementarity with the other surface are associated with a value near one.
3. After all surface points are associated with these binding propensities, we perform a smoothing process.
In this process, each point is associated with a novel binding propensity (BP) computed as the mean value of the points in its neighborhood, defined as all the points having a spatial distance from it smaller than 6 Å.

FIG. 1 .
FIG. 1. Hypothesised model of the TDP-43 CTFs aggregation.A) TDP-43 in physiological conditions forms dimers.B) After the cleavage the CTF is split from the whole protein.C) The RRM2 fragment resulting from the cleavage exposes its β-strands.D) The β-strands from different CTFs allow the formation of aggregates to happen.

FIG. 2 .
FIG. 2. Analysis of the Molecular Dynamics simulations.A) From left to right, (i) Root Mean Square Deviation (RMSD) as a function of time for the whole RRM2 structure with respect to its equilibrated starting state.The Radius of gyration (Rg) during the simulation is shown in the inset.(ii) Two-dimensional projection of the sampled conformations in the subspace spanned by the first two Principal Components (PCs) of the covariances of the atomic positions during the simulation.Each point corresponds to a conformation after a number of steps indicated by the color bar.(iii) Clustering of the scatter plot of the two-dimensional projection of the sampled conformations.The centroids of each cluster (labeled by the numbered white circle) are depicted as well.(iv)Cartoon representation of the configurations corresponding to the found centroids.B) Same as in A) but for Fragment A. C) Same as in A) but for Fragment B. D) Same as in C) but considering only the unfolding configurations of Fragment B, marked but the blue vertical lines in RMSD plot.

FIG. 3 .
FIG.3.Binding propensity profiles.Binding propensities scores for the conformations where the mean BP of the βstrand residues in the starting conformation has the highest value.Each row corresponds to one of the ten conformations: for each row i, the plot on the left corresponds to the BP of the residues of conformation i respect to the conformation of Fragment B that results in the highest mean BP for the β-strands residues.The plot on the right instead shows the same results, but with the best pairing with a conformation of Fragment B. The row corresponding to B2 is empty because this conformation has no β-strand residues on its surface.

FIG. 4 .
FIG. 4. Example of binding region selection.A) Coulomb surface colouring of the binding region of the first binding region for conformation A1.B) Selection of the binding region sequence.

TABLE I .
Binding and β-strands regions found for each conformation.