Targeting Cathepsin L in Cancer Management: Leveraging Machine Learning, Structure-Based Virtual Screening, and Molecular Dynamics Studies

Cathepsin L (CTSL) expression is dysregulated in a variety of cancers. Extensive empirical evidence indicates their direct participation in cancer growth, angiogenic processes, metastatic dissemination, and the development of treatment resistance. Currently, no natural CTSL inhibitors are approved for clinical use. Consequently, the development of novel CTSL inhibition strategies is an urgent necessity. In this study, a combined machine learning (ML) and structure-based virtual screening strategy was employed to identify potential natural CTSL inhibitors. The random forest ML model was trained on IC50 values. The accuracy of the trained model was over 90%. Furthermore, we used this ML model to screen the Biopurify and Targetmol natural compound libraries, yielding 149 hits with prediction scores >0.6. These hits were subsequently selected for virtual screening using a structure-based approach, yielding 13 hits with higher binding affinity compared to the positive control (AZ12878478). Two of these hits, ZINC4097985 and ZINC4098355, have been shown to strongly bind CTSL proteins. In addition to drug-like properties, both compounds demonstrated high affinity, ligand efficiency, and specificity for the CTSL binding pocket. Furthermore, in molecular dynamics simulations spanning 200 ns, these compounds formed stable protein-ligand complexes. ZINC4097985 and ZINC4098355 can be considered promising candidates for CTSL inhibition after experimental validation, with the potential to provide therapeutic benefits in cancer management.


Introduction
Cathepsins are proteases found in lysosomes classified into cysteine, aspartate, and serine cathepsins, depending on their active site [1].Cysteine cathepsins (CTS) are found in a wide range of organisms, spanning from prokaryotes to mammals, and exhibit a conserved cysteine residue within their active site.These enzymes are essential for degrading proteins internalized by lysosomes and are active at pH levels ranging from neutral to acidic [2].
They also aid in the breakdown of intracellular proteins, the activation of inactive enzyme precursors, and the remodeling of the extracellular matrix [3,4].CTS serve an important role in tissue homeostasis and contribute to processes such as immune response, apoptosis, and development under normal physiological settings [5].However, their expression, location, and activity changes have been linked to various clinical diseases, including cancer progression [6,7].Ectopic cysteine cathepsin expression is frequently associated with a poor prognosis [8].The number of cathepsin classes in humans has increased from the previously recognized 11 to 15.These include the classes A, B, C, D, E, F, G, H, L, K, O, S, V, X, W, and Z [9].
Cathepsin L (CTSL) functions both intracellularly and extracellularly.The acidic environment created in cancer cells by anaerobic glycolysis promotes the activity of CTSL [10,11], which actively degrades collagen, fibronectin, and laminin [12].CTSL is normally confined within lysosomes in healthy physiological states.However, in the context of tumors, changes in expression levels and translocation pathways can lead to CTSL secretion.Studies on animal models of pancreatic cancer have demonstrated that CTSL and cathepsin S enzymes are released from cancer cells and act on the extracellular domain of cell adhesion molecules present on the surface of cancer cells [13].CTSL's ability to degrade E-cadherin (a cell-cell adhesion molecule) contributes to cancer cells' invasive function and metastatic capacity [14].CTSL expression increases in various cancers, including glioma, melanoma, pancreatic, breast, and prostate carcinoma [15].Given the importance of CTSL in tumor cell dissemination, there is a pressing demand to develop novel CTSL inhibition strategies.Several CTSL inhibitors have been identified, including SID 26681509, quarterhydrate, and Z-FY-CHO [16].While these inhibitors show CTSL specificity, it is important to note that they are synthetic and not derived from natural sources.Because of their synthetic origin, they may have potential side effects when used.This highlights the importance of investigating natural compounds as potential CTSL inhibitors in order to alleviate concerns about side effects and improve the translational potential of CTS-targeted therapeutic interventions.
The drug development process is characterized by its interdisciplinary nature, significant costs, and time requirements.Over the last two decades, scientific progress has transformed the approach to pharmaceutical research in generating new bioactive molecules [17].Notably, advances in computational techniques and parallel hardware support have made it easier to use in silico methods, particularly the structure-based drug design approach.These methods have accelerated various stages of the drug development process, including target identification via identifying initial active compounds and subsequent optimization of lead compounds [18].This study employed in silico strategies, including a combination of machine learning (ML) and structure-based screening of natural compounds, to find natural CTSL inhibitors for combating cancer.

Results and Discussion
The CHEMBL database was used to obtain compound activity data against the CTSL enzyme.The dataset's redundancy was examined, and all duplicate compounds were removed.Compounds with IC 50 values less than 1000 nM were considered active, while those with IC 50 values greater than 1000 nM were considered inactive.We confirmed 2000 active molecules and 1278 inactive molecules using these criteria.The Morgan fingerprints (1024) have been calculated for the active/inactive molecules.Following vectorization, the random forest (RF) classification model was trained to differentiate between active and inactive molecules.ROC curves from 10-fold cross-validation were plotted.Figure 1A depicts the study's workflow.The AUC mean values of the RF model were estimated at 0.91 (Figure 1B).Following that, the RF train model was used to screen the Biopurify and Targetmol natural compound libraries, with the top hits selecting them for further SBVS.A total of 149 compounds with prediction scores >0.6 were identified.These compounds were then used in structure-based virtual screening (SBVS) to selectively interact with the CTSL protein's active site.In this study, the positive control was AZ12878478.The binding energies that exceeded those of the control compound were chosen for further interaction analysis, which included both two-dimensional (2D) and three-dimensional (3D) interactions.SBVS yielded thirteen hits with higher affinity, as determined by binding energy, in comparison to the positive control (Table 1).Table 1 presents the RF prediction scores and binding affinities of the 13 compounds.Further physicochemical and molecular properties encompass the molecular weight, the number of rotatable bonds, as well as the number of hydrogen bond donors and acceptors.Following that, the RF train model was used to screen the Biopurify and Targetmol natural compound libraries, with the top hits selecting them for further SBVS.A total of 149 compounds with prediction scores >0.6 were identified.These compounds were then used in structure-based virtual screening (SBVS) to selectively interact with the CTSL protein's active site.In this study, the positive control was AZ12878478.The binding energies that exceeded those of the control compound were chosen for further interaction analysis, which included both two-dimensional (2D) and three-dimensional (3D) interactions.SBVS yielded thirteen hits with higher affinity, as determined by binding energy, in comparison to the positive control (Table 1).Table 1 presents the RF prediction scores and binding affinities of the 13 compounds.Further physicochemical and molecular properties encompass the molecular weight, the number of rotatable bonds, as well as the number of hydrogen bond donors and acceptors.The top two compounds, ZINC4097985, and ZINC4098355, as well as the positive control, were subjected to detailed interaction analysis with the CTSL active site residues.ZINC4097985 was found to interact with Gln19, Gly20, Gln21, Cys22, Gly23, Cys25, Trp26, Cys65, Asn66, Gly67, Gly68, Leu69, Met70, Ala135, Met161, Asp162, His163, and Trp189 residues of CTSL protein.The Gln19, Gly20, and Asp162 residues were H-bonded with ZINC4097985 (Figure 2A).Furthermore, ZINC4098355 interacted with Cys25, Trp26, Asn66, Gly67, Gly68, Leu69, Met70, Asp71, Ala135, Glu159, Asp160, Met161, Asp162, Ala214, and Ser216 residues of CTSL protein.The Gly68, Met70, and Met161 residues were H-bonded with ZINC4098355 (Figure 2B).
The co-crystal inhibitor was found to interact with Gly23, Cys25, Trp26, Asn66, Gly67, Gly68, Leu69, Ala135, Glu159, Met161, Asp162, His163, and Gly164 residues of CTSL protein (Figure 2C).Notably, among these residues, Cys25, Trp26, Asn66, Gly67, Gly68, Leu69, Ala135, Met161, and Asp162 were found to be common interaction residues for the cocrystal inhibitor and the hit compounds (ZINC4097985 and ZINC4098355) (Figure 2A-C).This observation suggests that the hit compounds ZINC4097985 and ZINC4098355 bind to the same binding pocket on the CTSL protein as the co-crystal inhibitor.CTSL has an electrophilic center that can act as a thiol trap for the active-site cysteine (Cys-25), making it an ideal target for covalent inhibitors.CTSL inhibitors currently use electrophilic moieties such as epoxides, vinyl sulfones, fluoromethyl ketones, and diazomethyl ketones to form an irreversible covalent linkage with CTSL [19][20][21].Nonetheless, the use of irreversible inhibitors in drug candidate development raises significant safety concerns due to the potential difficulties associated with dissociation.Nitrile-functionalized inhibitors have been classified as covalent and reversible inhibitors due to the capacity of nitriles to selectively bind sulfur atoms, leading to the formation of thioimidazole bonds that exhibit hydrolytic properties over a period of time [22,23].Importantly, the identified natural compounds in this study (ZINC4097985 and ZINC4098355) have demonstrated interactions with CTSL's Cys-25 residue, implying that these compounds could be potent CTSL inhibitors.
The evaluation of adsorption, distribution, metabolism, excretion, and toxicity (AD-MET) properties holds significant importance in the realm of small-molecule drug discovery and therapeutics.It is important to acknowledge that a multitude of clinical trials have encountered obstacles due to inadequacies in these characteristics.Although early-stage ADMET profiling is considered optimal, the execution of experimental evaluations is often hindered by the high costs involved and the limited availability of data.The ADMET properties of the top two compounds were predicted using online tools (https://ai-druglab.smu.edu/admet,accessed on 10 July 2023) (Table 2) [24].The anticipated values of diverse molecular properties for these two compounds fell within an acceptable range, suggesting their potential as lead molecules.
Finally, the docked complexes of these two compounds (ZINC4097985 and ZINC4098355) and the control were subjected to 200 ns of MD simulation.MD simulations of the docked complexes were performed to assess their binding stability.The root mean square deviation (RMSD) serves as a metric for assessing protein stability, with smaller deviations indicating a higher degree of stability in the protein structure.CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 had average RMSD values of 0.18, 0.21, and 0.15 nm, respectively.The RMSD plot exhibits that CTSL-ZINC4098355 and the CTSL-control complex were more stable than CTSL-ZINC4097985.The bound structure of the CTSL-ZINC4097985 complex exhibited a high deviation from the initial conformation, indicating that CTSL's catalytic pocket formed a stable interaction with the screened compound.Furthermore, the ligand RMSD exhibits that the control and ZINC4098355 demonstrated high deviation, while the ZINC4097985 complex showed low deviation (Figure 3A,B).
Int. J. Mol.Sci.2023, 24, x FOR PEER REVIEW 6 of 11 CTSL-ZINC4097985 complex exhibited a high deviation from the initial conformation, indicating that CTSL's catalytic pocket formed a stable interaction with the screened compound.Furthermore, the ligand RMSD exhibits that the control and ZINC4098355 demonstrated high deviation, while the ZINC4097985 complex showed low deviation (Figure 3A,B).
The average fluctuation of all residues during the simulation and the root mean square fluctuation (RMSF) of CTSL during the binding of CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 were plotted as a function of CTSL residue numbers.The CTSL-control and CTSL-ZINC4098355 backbones exhibited consistent fluctuations in the CTSL binding pocket, most likely due to differing orientations, and CTSL-ZINC4097985 showed high fluctuation in several regions such as 38-45, 58-61, and 110-115 residues (Figure 3C).Conversely, the CTSL-control and CTSL-ZINC4098355 complexes had the overall fewest fluctuations.Collectively, ZINC4098355 was more stable than ZINC4097985, and it follows the control RMSF pattern with CTSL.
In order to acquire a deeper understanding of the intricate compactness profile within a biological system, the radius of gyration (Rg) was used.The complexes CTSLcontrol, CTSL-ZINC4097985, and CTSL-ZINC4098355 exhibited average Rg values of 1.66 nm, 1.68 nm, and 1.64 nm, respectively.The Rg plot revealed that the CTSL-control and CTSL-ZINC4097985 complexes were less compact than the CTSL-ZINC4098355 complex.It was inferred that after the binding of CTSL-control and CTSL-ZINC4097985 compounds, CTSL becomes more stable due to the higher unfolding of CTSL's native structure (Figure 3D).A protein's solvent-accessible surface area (SASA) refers to the extent of its surface that is engaged in interactions with the surrounding solvent molecules.The average SASA values for the three complexes, CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355, were graphed.SASA values for the CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 complexes were 114.01, 116.12, and 109.23 nm 2 , respectively (Figure 4A).The average fluctuation of all residues during the simulation and the root mean square fluctuation (RMSF) of CTSL during the binding of CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 were plotted as a function of CTSL residue numbers.The CTSL-control and CTSL-ZINC4098355 backbones exhibited consistent fluctuations in the CTSL binding pocket, most likely due to differing orientations, and CTSL-ZINC4097985 showed high fluctuation in several regions such as 38-45, 58-61, and 110-115 residues (Figure 3C).Conversely, the CTSL-control and CTSL-ZINC4098355 complexes had the overall fewest fluctuations.Collectively, ZINC4098355 was more stable than ZINC4097985, and it follows the control RMSF pattern with CTSL.
In order to acquire a deeper understanding of the intricate compactness profile within a biological system, the radius of gyration (Rg) was used.The complexes CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 exhibited average Rg values of 1.66 nm, 1.68 nm, and 1.64 nm, respectively.The Rg plot revealed that the CTSL-control and CTSL-ZINC4097985 complexes were less compact than the CTSL-ZINC4098355 complex.It was inferred that after the binding of CTSL-control and CTSL-ZINC4097985 compounds, CTSL becomes more stable due to the higher unfolding of CTSL's native structure (Figure 3D).
A protein's solvent-accessible surface area (SASA) refers to the extent of its surface that is engaged in interactions with the surrounding solvent molecules.The average SASA values for the three complexes, CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355, were graphed.SASA values for the CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 complexes were 114.01, 116.12, and 109.23 nm 2 , respectively (Figure 4A).SASA analysis showed that the binding of ZINC4098355 reduced surface exposure, while the control and ZINC4097985 compounds enhanced the surface area of solvent accessibility.The selected docked complexes were also subjected to hydrogen bond analysis.The control and ZINC4097985 exhibited an average 3-5 H-bond with CTSL protein, while the ZINC4098355 compound demonstrated a 1-3 H-bond.In addition, they also formed more H-bonds with solvents in the system.It was inferred that the control and ZINC4097985 compounds interacted more with water molecules in the system, while ZINC4098355 interacted less with water.Therefore, it binds more stably in the catalytic pocket of the CTSL protein (Figure 4B,C).Further, 2D plot analysis of complexes showed that all the complexes overlap very well with each other (Figure 4D).compounds interacted more with water molecules in the system, while ZINC409835 teracted less with water.Therefore, it binds more stably in the catalytic pocket of the C protein (Figure 4B,C).Further, 2D plot analysis of complexes showed that all the plexes overlap very well with each other (Figure 4D).The free energy landscape of CTSL-control, CTSL-ZINC4097985, and C ZINC4098355 complexes has been demonstrated in Figure 5A-C.The complex CTSL trol and CTSL-ZINC4098355 formed a similar pattern to CTSL-ZINC4097985.The plex CTSL-ZINC4097985 showed a higher trajectory distribution than the other pounds.The free energy landscape of CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 complexes has been demonstrated in Figure 5A-C.The complex CTSL-control and CTSL-ZINC4098355 formed a similar pattern to CTSL-ZINC4097985.The complex CTSL-ZINC4097985 showed a higher trajectory distribution than the other compounds.
Because of their rich chemical diversity, natural products represent a valuable reservoir of diverse bioactive compounds with promising medicinal potential [25].Over the past few decades, considerable research endeavors have been dedicated to the identification and extraction of novel natural products from a variety of organisms, such as plants, microorganisms, and other biological entities.Numerous anti-cancer pharmaceuticals have been developed as a result of extensive research into the anti-cancer properties of these naturally occurring compounds and the underlying mechanisms of their action.Interestingly, it is estimated that from 1981 to 2019, about 25% of newly approved anticancer medications had natural constituents as their source [26].Notably, the hits identified in this study are natural compounds that have been demonstrated to bind strongly to the CTSL protein; further investigation could confirm that these substances are potent CTSL inhibitors.Because of their rich chemical diversity, natural products represent a valuable reservoir of diverse bioactive compounds with promising medicinal potential [25].Over the past few decades, considerable research endeavors have been dedicated to the identification and extraction of novel natural products from a variety of organisms, such as plants,

Machine Learning Model Building and Screening Natural Compounds
Compound datasets of reported active and inactive CTSL ligands were retrieved from freely available cheminformatics databases, such as ChEMBL.To distinguish between active and inactive compounds, the largest experimental datatype for the CTSL enzyme was used.The activity cutoff for biochemical or biophysical activity (IC 50 ) was 1000 nM, and anything above that was considered inactive.Further, molecular fingerprints (FPs) for the active and inactive compounds were determined using the cheminformatics package RDKit (5 March 2022).We counted all circular fragments from each chosen center-heavy atom up to the specified radius of two atoms and 1024 binary bits using the Morgan FPs [27].
Next, the ML models were built with the default Python (3.9) libraries.The classic ML algorithm (RF) method was chosen.The scikit-learn (1.7.3) [28] package is used to implement the RF model with a dataset splitting ratio of 4:1 (train and test).Ten-fold crossvalidation was used to assess the trained models, and receiver operating characteristic curves were plotted for RF models.The trained model was then used to screen the natural compound library (Biopurify and Targetmol).Furthermore, we choose the top virtual hits from the natural compound database with a prediction score > 0.6 for further screening using structure-based methods.

CTSL Protein Retrieval and Preparation
The 3D structure of the CTSL protein (PDB ID: 3HHA) was obtained in pdf format from the Protein Data Bank.The structure exhibits a monomeric arrangement, with AZ12878478 as a co-crystal ligand.The co-crystal ligand and water molecules were eliminated from the structure using Discovery Studio (DS) 2021.The 'clean' protein was then prepared by employing the "prepared protein" protocol in DS 2021, and the resulting protein structure was saved in pdb format.

Structure-Based Virtual Screening
SBVS is a well-established computational technique used in computer-assisted drug design.This screening methodology is critical to accelerating the process of novel drug design by identifying potentially potent compounds.Specifically, molecular docking-based vs. is an effective method for identifying active compounds from large ligand databases.This is accomplished by assessing the binding affinities of proteins and ligands, which aids in the selection of compounds with promising interactions for further investigation [29].
The PyRx tool (0.8 version) was used to screen the compounds identified by the ML model against the active site of CTSL.The coordinates for the grid center of the CTSL were defined as follows: X = 8.674536, Y = 8.057071, and Z = −2.836607.

Physiochemical and ADMET Properties Prediction
The physicochemical and molecular properties of the top 13 compounds were estimated by the SwissADME web tool (http://www.swissadme.ch/,accessed on 10 July 2023).The ADMET properties of the final top two compounds were predicted using online tools (https://ai-druglab.smu.edu/admet,accessed on 10 July 2023).

Molecular Dynamics (MD) Simulation
GROMACS 2022 was used to conduct MD simulations of CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 complexes at 300 K [30].CTSL enzyme topology was determined using the AMBER 99SB force field [31].Force-field parameters and lead compound topologies were generated using the AnteChamber server [32].The lead compound's atoms were assembled into a complex using the topology of the CTSL enzyme.The CTSL-compound complex was then solvated using the TIP3P water model in a cubic box [33].To ensure a physiological concentration of 0.15 M, the charges on CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 were restored by adding Na + and Cl − ions through the gmx genion module.The steepest descent method, with 1500 steps, was used to minimize each system.The temperature was stabilized using the V-rescale thermostat after the system was first brought to equilibrium using an NVT ensemble with a constant number of particles, volume, and temperature of 300 K for 100 ps.The Parrinello-Rahman barostat was then used to equalize the pressure in each system to 1.0 bar [34].To calculate long-range electrostatic interactions, each equilibrated system was simulated for 200 ns using particle mesh Ewald.The Berendsen (V-rescale) thermostat [35] and the Parrinello-Rahman barostat were used during the simulation to keep the temperature at 300 K and the pressure at 1.0 bar, respectively.GROMACS analysis modules were used to study subsequent trajectories.VMD [36] and PyMOL were used to generate graphical representations of the 3D models.

Conclusions
This study used a novel approach that combined ML and SBVS techniques to identify potential natural CTSL inhibitors, the expression of which is known to be abnormal in a variety of cancer types.ZINC4097985 and ZINC4098355 have been found to possess strong affinities and specificities for the CTSL binding site, as well as drug-like properties.In a 200 ns MD simulation, these compounds also demonstrated stable binding in complexes with the CTSL protein.ZINC4097985 and ZINC4098355 exhibit potential as viable inhibitors of CTSL, thereby offering therapeutic value in the management of cancer.Nevertheless, it is necessary to conduct experimental validation in order to optimize them as inhibitors of CTSL.

Figure 1 .
Figure 1.Workflow of study (A), and evaluation of the train RF machine learning model for CTSL protein (B).

Figure 1 .
Figure 1.Workflow of study (A), and evaluation of the train RF machine learning model for CTSL protein (B).

Figure 2 .
Figure 2. 3D and 2D interaction of ZINC4097985 (red) (A), ZINC4098355 (green) (B), and control compound (black) (C) in the CTSL protein binding pocket.CTSL has an electrophilic center that can act as a thiol trap for the active-site cysteine (Cys-25), making it an ideal target for covalent inhibitors.CTSL inhibitors currently use electrophilic moieties such as epoxides, vinyl sulfones, fluoromethyl ketones, and diazomethyl ketones to form an irreversible covalent linkage with CTSL[19][20][21].Nonetheless, the use of irreversible inhibitors in drug candidate development raises significant safety concerns due to the potential difficulties associated with dissociation.Nitrile-functionalized inhibitors have been classified as covalent and reversible inhibitors due to the capac-

Figure 4 .
Figure 4. SASA plot of complexes (A), number of H-bonds in complexes (B), number of H-b between ligands and water molecules (C), and 2D projection of protein ligand complexes (D) green, and black color represent ZINC4097985, ZINC4098355, and control.

Figure 4 .
Figure 4. SASA plot of complexes (A), number of H-bonds in complexes (B), number of H-bonds between ligands and water molecules (C), and 2D projection of protein ligand complexes (D).Red, green, and black color represent ZINC4097985, ZINC4098355, and control.

Figure 4 .
Figure 4. SASA plot of complexes (A), number of H-bonds in complexes (B), number of H-bonds between ligands and water molecules (C), and 2D projection of protein ligand complexes (D).Red, green, and black color represent ZINC4097985, ZINC4098355, and control.The free energy landscape of CTSL-control, CTSL-ZINC4097985, and CTSL-ZINC4098355 complexes has been demonstrated in Figure5A-C.The complex CTSL-control and CTSL-ZINC4098355 formed a similar pattern to CTSL-ZINC4097985.The complex CTSL-ZINC4097985 showed a higher trajectory distribution than the other compounds.

Table 1 .
Binding energies and physicochemical properties of the top 13 compounds.

Table 1 .
Binding energies and physicochemical properties of the top 13 compounds.
No. of H acceptors: HA; No. of H donors: HD; No. of rotatable bonds: RB.

Table 2 .
ADMET prediction for the top two compounds, ZINC4097985 and ZINC4098355.