Application of Hierarchical Clustering to Analyze Solvent-Accessible Surface Area Patterns in Amycolatopsis lipases

Simple Summary Solvent-Accessible Surface Area (SASA) as the one dimensional structure property of the protein considers as the measuring the exposure of an amino acid residue to the solvent in one protein. It is an important structural property as the active sites of proteins are mostly located on the protein surfaces. The aim of this paper is to provide the clear information on different Amycolatopsis eburnea lipases based on the SASA patterns. This information could help in recognizing the structural stability and conformation as well as precise clustering them for revealing lipase evolution. Abstract The wealth of biological databases provides a valuable asset to understand evolution at a molecular level. This research presents the machine learning approach, an unsupervised agglomerative hierarchical clustering analysis of invariant solvent accessible surface areas and conserved structural features of Amycolatopsis eburnea lipases to exploit the enzyme stability and evolution. Amycolatopsis eburnea lipase sequences were retrieved from biological database. Six structural conserved regions and their residues were identified. Total Solvent Accessible Surface Area (SASA) and structural conserved-SASA with unsupervised agglomerative hierarchical algorithm were clustered lipases in three distinct groups (99/96%). The minimum SASA of nucleus residues was related to Lipase-4. It is clearly shown that the overall side chain of SASA was higher than the backbone in all enzymes. The SASA pattern of conserved regions clearly showed the evolutionary conservation areas that stabilized Amycolatopsis eburnea lipase structures. This research can bring new insight in protein design based on structurally conserved SASA in lipases with the help of a machine learning approach.


Introduction
Hydrophobic forces in proteins play a vital role in the stability, folding and proteinprotein interaction [1,2]. The residues comprised in hydrophobic areas, their interactions and the form packing could be useful for studying the protein structure and protein-substrate binding [3,4]. The residues involved in core stability of proteins are hydrophobic residues. Therefore, finding protein and enzyme Solvent Accessible Surface Area (SASA) [5,6] and hydrophobic areas of total and conserved residues and clustering them could provide unique features in comparing the proteins and enzymes [7]. Furthermore, these residues, their interaction and classification could extrapolate the protein contact map by emphasizing the role of each specific residue in protein stability and conformation [8]. This paper provides insight on SASA patterns in Amycolatopsis eburnea lipases and clustering of conserved structural-SASA with the help of the unsupervised agglomerative hierarchical method (machine learning approach) toward identification of hot spot structures for protein stability and conformation. The results will help in the design and engineering of new enzymes.
Glycerol ester hydrolase or triacylglycerol acylhydrolase is (E.C.3.1.1.3), a fat splitting enzyme which is also called Lipase [9]. The products of the enzymatic reaction (as a catalyzer) (hydrolyses triglycerides) are glycerol and fatty acids. Applications of lipase include food, dairy, flavor, pharmaceuticals, biofuels, leather, cosmetics, detergent, and many chemical industries [10][11][12]. This is the third most significant enzyme in the industry after proteases and amylases [13]. This enzyme can hydrolyze triglycerides in both aqueous and non-aqueous media [14,15]. It should be mentioned here that lipase substrates are insoluble in water. Lipase loses its functionality in different organic solvents [16]. Structurally, lipases contain a/b hydrolase folds [17]. Triad of Ser, Asp (Glu) and His residues in their active site were considered as their specific structural characterization [18,19].
It is clearly established that around seventy medicinal and agricultural microbial products are from the Actinobacteria phylum. The famous ones are different antibiotics produced by the Amycolatopsis genus [20,21]. This genus is very remarkable in producing antibiotics such as balhimycin, vancomycin, as well as immuno-suppressants, anti-cancer agents and many other secondary metabolites [22][23][24][25]. Their significant position in the medicine and agriculture market is due to the diversity of vital compounds and the amount of production [21]. This substantial position, besides the innumerable structure in their compound and eventually their genomes, provides a great prospect for researchers to discover valuable insight for future applications. Therefore, Amycolatopsis species require study to find the details of structure and diversity of important enzymes such as lipases [26][27][28][29][30]. It was recently reported that Amycolatopsis eburnea, one of the species in the Amycolatopsis genus, has a symbiotic relationship with mycorrhizal fungi; however, the details of this mechanism need to be investigated.
Computational characterizations of enzymes with the help of machine learning algorithms offer a great opportunity to speed up the systematic classifications [31,32]. They can help routine scientific proposals to engineer better enzymes with superior activity. Additionally, computational analysis offers a clearer way to understand the mechanisms of each reaction from a structural point of view. Lipase enzymes with very wide applications can get much more benefit from these computational approaches [33]. Moreover, designing the most efficient experiments in the lab requires clear computational and structural information [31,33].
The lipases with microbial origin can be isolated from different cellular compartments, either extracellular, peripheral protein or intracellular enzymes [34][35][36]. It is accepted that structural features of enzymes are the key indicators for protein evolution despite the sequence differences. The folding and stability of proteins in general and enzymes in particular are highly dependent to their structure and environment. The structural plasticity of enzymes in different environments is the key to functionality efficiency. As no significant sequence similarity was observed in the conserved folding in many enzymes, the regions with particular structure that were conserved in the enzyme would play a critical role for plasticity and eventually functionality. On the other hand, not all residues of enzymes were involved in determination of folding and stability. For detecting the sequence necessity for particular fold and stability and their role in enzyme evolution, Solvent Accessible Surface Area was applied. This feature is based on the fact that hydrophobic residues have less/no SASA [37].
The lack of information on structural stability of lipase from the Amycolatopsis eburnea with their inevitable place in industry is noticeable. Therefore, the aim of this research paper is to provide insight and clear information on different Amycolatopsis eburnea lipases based on the SASA. This information can help in recognizing the lipases structural stability and conformation. The clear information of each amino acid in the structure could help in designing the new lipase enzyme with better functionality. Furthermore, this precise clustering of specific amino acids can demystify the lipase evolution and even enzyme functionality.

Methods
The sequence data were retrieved from the National Center for Biotechnology Information (NCBI) database for Amycolatopsis eburnea lipases. The physiochemical parameters and 3-D homology models were calculated with the help of bioinformatics-server (https://www.expasy.org/) (accessed on 1 February 2020) [38]. Furthermore, for confirmation and comparison of 3D-models, the deep learning de novo modeling was performed for all lipases. In this method, after generating the multiple sequence alignment, the prediction for distance and orientation distribution was done, followed by coarse grained structure modeling by energy minimization, full atom structure refinement and finally generating the models. The percentage of similarity and structural identity were also calculated for all models [39]. The phylogenetic relationship of sequences were presented with MegaX software [40]. The secondary structure predictions were performed with the help of Chou & Fasman secondary structure prediction [41]. The clustering of SASA determined with the hierarchical clustering method that groups together the more close or similar SASA. In this paper, the agglomerative approach which the bottom-up of each SASA data (as the observation) was considered as one cluster and merged with the closer cluster as one moved up the hierarchy. In order to calculate the distance between the SASA (proteins), Euclidean distance and Ward method were applied.

Solvent Accessible Surface Area (SASA)
The solvent accessible surface area of each enzyme was calculated according to Fraczkiewicz and Braun [42]. The Cartesian coordinates of protein atoms stored in PDB models were used to calculate the SASA for each residue [43]. The solvent associable surface area of residues was calculated for two environments: the nucleus and surface for each enzyme. The area contacts between solvent and the atoms as the points located on a sphere interaction radius surrounding them were identified as SASA. For this calculation, the interaction is the coverage of van der Waals radius of each atom type, plus the radius of a water molecule. The individual protein chain and the similar coverage of each enzymes area were calculated and compared. For categorizing the residues of proteins as nucleus or surface, the side-chain solvent surface accessibility is divided by the specific accessibility value for each residue. The specific accessibility value is the average solvent accessible surface area in the tripeptide Gly-X-Gly in an ensemble of 30 random conformations. Thus, residues with ratio value more than 50% were considered as in surface environment and alternatively the residues with less than 20% marked as nucleus or core environment.

Total average SASA =
Nucleus SASA + Sur f ace SASA Total amino acids Structurally conserved regions (SCR) for each enzyme were identified with the help of Chimera with defaults parameters [44] and solvent accessible surface areas of each SCR were calculated as mentioned above. The SCR-SASA is herein considered as the new conserved fingerprint descriptor for Amycolatopsis eburnea lipases. The cluster analyses of SASA patterns of lipases were performed with unsupervised agglomerative hierarchical clustering method as a machine learning approach with the help of python 3.9 programming language (http://www.python.org) (accessed on 21 August 2021).

Results
Physiochemical features of Amycolatopsis eburnea lipases showed in Table 1. The numbers of amino acids for lipases were in the range of 252 to 436. The average molecular weight was around 38 kDa. The minimum and maximum of MW were 24 kDa and 44 kDa, respectively. The negatively charged amino acid were in the range of 19 to 47. However, the positively charged amino acids were lower than them. The pI of lipases was more than 4.52 with a maximum of 6.23. This showed that all of them were in the range of acidic pH condition. Therefore, we need to find out if these enzymes performed their function in acidic environments. Furthermore, buffer preparation for purifying them   The three-dimensional structure of the enzymes was modeled with homology modeling with the help of Swiss institute of bioinformatics-server ( Figure 1). All models were then evaluated for stereo chemical quality with Ramachandran Map (Ramachandran and Sasisekharan 1968), as well as qmean for model confirmation (Tables 2 and 3). Amycolatopsis eburnea lipase sequences were again modeled with deep learning de novo modeling as described by Yang and coworkers [39]. The results (models) provided by de novo modeling were also confirmed with Ramachandran and qmean methods. The quality of models significantly improved (Table 3). Ramachandran map showed that less than 1.92% (A0A3R9DV90) of residues were in the outlier section; thus, the models are fully acceptable. The favorite region residues were more than 91% which showed the high quality of modeling in comparison to homology modeling performed earlier. The information provided confirmed the models for further analysis. All lipase models were homodimer. Ramachandran results showed that maximum residues in the favored region were 96.85% (A0A3R9EQB2). The indexes for qmean were more than −1.98 and considered acceptable for all models (de novo).     The Amycolatopsis eburnea lipases showed less frequency of His, Met and Cys and Trp compared to other residues ( Table 4). The secondary structure in lipases is shown in Table 5. The percentages of helices in the structure of lipases were higher than beta sheets and turn loops. At least 53.7% (A0A3R9DV90) of the lipases structure was helices.  1  A0A3R9KNJ9  388  62  21  4  22  4  10  10  48  5  11  37  7  2  12  33  21  26  8  16  29  2  A0A3R9DUJ4  394  58  13  10  24  4  18  11  39  4  9  34  16  4  16  30  20  29  5  18  32  3  A0A427T6P4  436  65  18  13  22  2  9  12  50  5  14  40  9  5  11  33  33  34  5  19  37  4  A0A3R9KMI2  404  69  20  8  20  4  11  18  39  9  14  38  11  3  14  28  24  21  3  17  33  5  A0A3R9EQB2  288  45  8  7  14  6  7  5  37  7  9  28  8  3  9  17  18  22  3  11  24  6  A0A3R9F8T1  252  44  9  3  10  4  13  14  31  3  4  26  3  2  0  20  11  20  3  3  29  7  A0A3R9DV90  419  51  19  13  20  4  20  9  52  8  13  31  7  7  15  25  39  27  10  19  30  8  A0A427T2R3  380  54  34  8  26  2  6  21  38  14  10  48  5  2  11  30  8  26  3 5 29 The total SASA of the enzymes was applied to cluster the Amycolatopsis eburnea lipases. The overall similarity of lipases was 99.96%. The dendogram result showed lipase 4 and lipase 3, with lipase 2 had more than 99.99% similarity. This similarity percentage was also observed with lipase 6 and lipase 8. However, the similarity of these two mentioned clusters was around 99.94%. Three distinct clusters were observed and categorized the lipases overall. A clear identification of SASA clustering is one of the great advantages of this grouping even in lipases with very high sequence similarity (Figures 2 and 3).
nucleus and surface environments were higher than backbones. This trend was obs in individual enzymes as well. The results showed that lipase 4 had more of a cha interact with solvent. Furthermore, results showed more accessibility for side cha the enzymes to interact with solvent and eventually substrate compared to enzyme bones.   between 39.65 to 51.53 Å . The overall side chain of solvent accessibility areas both nucleus and surface environments were higher than backbones. This trend was obs in individual enzymes as well. The results showed that lipase 4 had more of a chan interact with solvent. Furthermore, results showed more accessibility for side cha the enzymes to interact with solvent and eventually substrate compared to enzyme bones.   The total solvent accessible surface area and average of solvent accessible for two environments (nucleus and surface) in each enzyme is shown in Table 6. The maximum SASA of nucleus residues was related to lipase 4; however, the maximum SASA of surface residues was related to lipase 8. The average solvent accessibility areas of enzymes were between 39.65 to 51.53 Å 2 . The overall side chain of solvent accessibility areas both in the nucleus and surface environments were higher than backbones. This trend was observed in individual enzymes as well. The results showed that lipase 4 had more of a chance to interact with solvent. Furthermore, results showed more accessibility for side chains of the enzymes to interact with solvent and eventually substrate compared to enzyme backbones.
Hierarchical clustering of structurally conserved regions-SASA of Amycolatopsis eburnea lipases is shown in Figure 4. Lipases 1, lipase 2, lipase 3, and lipase 4 showed the similarity approximately the same as the lipase 5, lipase 6, lipase 7, and lipase 8. Lipase 1 and lipase 2 with the minimum dissimilarity showed the more conserved SASA compared to other lipases. On the other hand, the lipase 8 and lipase 7 are totally different compared to lipase 3 or lipase 4. Overall, dendogram showed more clear similarity features compared to the whole enzyme SASA. The structurally conserved regions-SASA could provide more flexibility to select the lipase for specific substrate based on the contact area to the solvent.  Hierarchical clustering of structurally conserved regions-SASA of Amycolatopsis nea lipases is shown in  The structurally conserved regions ( Figure 5) showed the correlation in the S (Table 7). Lipase 5 and lipase 6 had the highest correlation, followed by lipase 1 and l The structurally conserved regions ( Figure 5) showed the correlation in the SASA (Table 7). Lipase 5 and lipase 6 had the highest correlation, followed by lipase 1 and lipase 2; however, the residues involving the structure were not the same. The lowest SASA correlation was related to lipase 5 and lipase 3, followed by lipase 4 and lipase 7. The SASA correlation of different structurally conserved regions showed the overall high correlation between Amycolatopsis eburnea lipases. The conserved regions SASA might indicate the minimum SASA which was essential for stability of the protein and folding.   The similarity of SASA in conserved regions could shed light on the conserved and preferences of residues for the stability of Amycolatopsis eburnea lipases. It was observed that GLY and Val are the most frequent residues in conserved regions with 28 and 27 repeats. Different residues were shown in Figure 6. Three residues of CYS, MET and TRP were not observed in the structurally conserved region of Amycolatopsis eburnea lipases.

Discussion
There is an increased concern for lipase as the third most important enzyme in the market for hydrolyzing triglycerides in different media [14,19,[45][46][47][48]. The enzyme markets are food, dairy, flavor, detergent, pharmaceuticals, biofuels and cosmetics industries. The demand of more than 1000 tons of lipases for the detergent industry has been reported [49,50].
The large applications of Lipases in many fields from food to medicine are due to their functionality to work in different media; these diverse applications are the reason for a huge demand in the market [51].
Different sources of lipases were reported in the past; however, the bacterial sources are more suitable and get the better chance for industrial applications [52]. Applications of lipase from microbial origin, as well as functionality in various environments, provide the ease of lipase usage in many industries. The significant part of lipase production is to introduce significant species or strain of the microbe. Thus, microorganisms play a dynamic role in lipase production. The bacterial sources have a better chance for industrial lipase production [53,54]. High GC-content bacteria within the family Pseudonocardiaceae (Amycolatopsis eburnea) provided a noble prospect to work on for understanding the lipase production. This genus (Amycolatopsis) of bacteria showed many antibiotic productions in different conditions [55,56]. Therefore, providing the lipase structural investigations beside their antibiotic properties can help in introducing them for industrial application more easily than others [57].
Recently published genetic diversity of lipase in bacteria showed great differences in lipase characterizations. However, they revealed a conserved sequence which contained penta-peptide (Gly-X-Ser-X-Gly) [52]. Seven groups (Group A-G) of bacterial lipases clas-

Discussion
There is an increased concern for lipase as the third most important enzyme in the market for hydrolyzing triglycerides in different media [14,19,[45][46][47][48]. The enzyme markets are food, dairy, flavor, detergent, pharmaceuticals, biofuels and cosmetics industries. The demand of more than 1000 tons of lipases for the detergent industry has been reported [49,50].
The large applications of Lipases in many fields from food to medicine are due to their functionality to work in different media; these diverse applications are the reason for a huge demand in the market [51].
Different sources of lipases were reported in the past; however, the bacterial sources are more suitable and get the better chance for industrial applications [52]. Applications of lipase from microbial origin, as well as functionality in various environments, provide the ease of lipase usage in many industries. The significant part of lipase production is to introduce significant species or strain of the microbe. Thus, microorganisms play a dynamic role in lipase production. The bacterial sources have a better chance for industrial lipase production [53,54]. High GC-content bacteria within the family Pseudonocardiaceae (Amycolatopsis eburnea) provided a noble prospect to work on for understanding the lipase production. This genus (Amycolatopsis) of bacteria showed many antibiotic productions in different conditions [55,56]. Therefore, providing the lipase structural investigations beside their antibiotic properties can help in introducing them for industrial application more easily than others [57].
Recently published genetic diversity of lipase in bacteria showed great differences in lipase characterizations. However, they revealed a conserved sequence which contained penta-peptide (Gly-X-Ser-X-Gly) [52]. Seven groups (Group A-G) of bacterial lipases classified. The computational analysis of new enzymes from bacteria such as Amycolatopsis eburnea can provide clearer information for lipase classification and even help to introduce more clear understanding of lipase evolution. As lipases are water soluble and their substrates are mostly insoluble, their structures should dictate the specific functional activity [16]. Functional activity of this ubiquitous enzyme is very efficient in energy consuming point of view and environmental friendly in comparison to other catalyzers [58].
We should also mention that lipases are substrate specific, as well as structurally chemo-, region-and stereo-specific. Three most recognized groups of lipases can mention here as non-specific lipases, 1, 3-specific lipases and acid-specific lipases based on their catalyzing activity on triglycerides substrate in different systems. The preferred lipases for industry should have a low reaction time and remain resistant to various pH beside the activity in non-aqueous media [59]. The priority and preference of lipases with microbial origin beside the ease of their production with cheap growth media provide the better opportunity to work on their genetic manipulation towards achieving the ideal lipases [60].
In our research, the role of interacting hydrophobic residues in conserved structural regions was clearly presented. The lipases presented here clearly showed that they are homologous and the structural homology features are recognizable based on their similarity and phylogenetic dendongram. However, the need for finding structural similarity was necessary to establish a common ancestry. This structural conservation presented here clearly showed the surface plasticity in Amycolatopsis eburnea lipases. This structural conservation contained GLY and Val with higher percentage residues that imposed the stability and folding functionality to Amycolatopsis eburnea lipases. SASA features of these regions also deduced the hydrophobic contact information from the hydrophobic residues in lipase structures. The secondary structure length and loops in Amycolatopsis eburnea lipases were exactly related and substantially conserved. The conserved SASA could be a result of selective pressure on molecular conservation. Here, we identified 8 clusters, of which their mean of the SASA were very close. These 6 conserved regions were great tools for describing the stability and surface plasticity of lipases in different environments and their substrate specificity [61][62][63].
The finding of the role of these residues in folding or function could even be clearly answered as most of the amino acids in conserved regions are considered hydrophobic residues to some extent. Therefore, they would act in stability and folding conservation of lipases. This research could support the results and hypotheses in finding the specific residues to develop the better enzyme with mutation approaches. Furthermore, it clearly could help in sequence-structure correlation, role of individual residues in folding, stability and function.
It is important to mention here that the results clearly showed that there were specific structural constraints with specific residues SASA features conserved in Amycolatopsis eburnea lipases. This pattern of SASA/hydrophobic positions was observed and clearly conserved. The results showed the compensating amino acids residues that might occur during evolution were conserved in SASA features. It should be noted that amino acids mutation could be detected in conserved regions also. Thus, the only conserved feature in the conserved structure was SASA and hydrophobic features. Then the results showed the specific SASA conservation pattern to impose the native folding in homologous lipases from Amycolatopsis eburnea. Therefore, the significant correlation between sequences, conserved structural regions and SASA features was observed. This information could extend lipase structural information and describe the algorithms to predict SASA protein contact map in future.
Microbial lipase production needs to introduce better microbes and optimize more suitable environmental conditions. The lipase structure identification is the essential part of the system to find the better source of microbe for their production. Therefore, microbial sources can play the vital role in this selection. Traditionally many microbial sources selected for industrial production based on their amount of lipase production. However, the functionality of lipases and their efficiency can improve significantly by finding the better structure [17,50,64].
These days with the help of huge bioinformatics data in protein databases and computational analysis, finding the better lipase structure for industry is more feasible. On the other hand, as the structure of enzyme was proved to be species specific therefore working on lipases from specific species is more reasonable and practical. It should be mentioned here that working on lipase structural analysis would be a great help in finding the evolution of enzyme specially to find the essential residues to track the homology relation. On the other hand, different environments could play substantial roles in functionality of lipases. Soil usually provides the vital habitat for lipase microbe alone and in interaction with plants and other biofilm. Thus, research on lipases with soil microbe origin such as Amycolatopsis eburnea for industry and evolution purpose is inevitable [57].
Bacterial lipases with soil origin had shown huge diversity and variation in molecular and biochemical characterization. However, the conserved structural area such as residues related to active site (serine residue enclosed with conserved penta-peptide (Gly-X-Ser-X-Gly)) was conserved in all lipases [52]. In order to cluster the lipases many enzyme features and factors considered however the solvent accessibility of enzyme as the outstanding factor always missed. Solvent accessibility has an important impact on enzyme stability and substrate activity [65]. Even finding the hydrophobic contact area that is the opposite of SASA can provide the shed light to find the stability factor in structural evolution. In this study clearly showed the SASA of the lipases from Amycolatopsis eburnea had specific conserved hydrophobic contact area. This feature robustly categorized the lipases in two clusters. There was another report of lipases categorizing with other features that found seven groups, however the SASA feature didn't considered for categorizing [66].
Furthermore, designing the new lipase as well as new primer and probe could get great help from finding the conserved SASA feature [64,67]. Purification of lipases as an important factor in industry especially for mass production could gain the benefit with conserved SASA feature too. Enzyme formulation for market and even wet lab experiments could be more approachable with knowing the conserved SASA feature [66,68].
It is important to mention that all modification of lipases from chemical modification to immobilization and UV and gamma ray irradiations as well as amino acid modification and mutagenesis need great investigate to find the effect of them on conserved SASA of lipases [72,73]. The information provided here as the SASA of Amycolatopsis eburnea lipases can apply as the great asset for precise engineering of lipases for agricultural and industrial purposes.
The 3-D structures of lipases with α/β-hydrolase fold architecture provided here can be an outstanding tool for protein modeling and engineering the lab experiments [74,75]. The hydrolyzing fold of this enzyme was assumed to be unrelated to specific residues, however, with activity in diverse environment. As different residues are involved in the structure folding of lipases, the structural conservation and their features need to classify and investigate in more details to understand the mechanisms of lipase action. It is noteworthy to mention that parallel β-sheet of eight strands play the great role in folding structure of lipases [76].

Conclusions
It is clearly shown that lipases from Amycolatopsis eburnea with great impact on agricultural and industrial sectors have specific structural patterns. Therefore, for developing and designing the new lipases the substantial insight on Amycolatopsis eburnea lipase structure, hotspots were presented with a machine learning approach. Structural landscapes of lipases with specific conserved SASA features from Amycolatopsis eburnea showed the better potential to be the model to design and develop synthetic lipases with an unsupervised agglomerative hierarchical method. Finding the conserved SASA of Amycolatopsis eburnea lipases showed a clear need for having the specific residues with specific SASA be in the structure/sequence of the enzyme for its stability and conformation. This pattern in the enzyme structure can help in the design of the synthetic lipase and even provide a great asset to find the homology of this enzyme from an evolutionary point of view. Amycolatopsis bacteria with symbiotic relationship with mycorrhiza can even be good examples for soil-, bacteria-and fungi-plant interactions research, and the SASA patterns in the structure of lipase enzymes can help to investigate and understand this symbiosis in future research.