Monte Carlo Method and GA-MLR-Based QSAR Modeling of NS5A Inhibitors against the Hepatitis C Virus

Hepatitis C virus (HCV) is a serious disease that threatens human health. Despite consistent efforts to inhibit the virus, it has infected more than 58 million people, with 300,000 deaths per year. The HCV nonstructural protein NS5A plays a critical role in the viral life cycle, as it is a major contributor to the viral replication and assembly processes. Therefore, its importance is evident in all currently approved HCV combination treatments. The present study identifies new potential compounds for possible medical use against HCV using the quantitative structure–activity relationship (QSAR). In this context, a set of 36 NS5A inhibitors was used to build QSAR models using genetic algorithm multiple linear regression (GA-MLR) and Monte Carlo optimization and were implemented in the software CORAL. The Monte Carlo method was used to build QSAR models using SMILES-based optimal descriptors. Four splits were performed and 24 QSAR models were developed and verified through internal and external validation. The model created for split 3 produced a higher value of the determination coefficients using the validation set (R2 = 0.991 and Q2 = 0.943). In addition, this model provides interesting information about the structural features responsible for the increase and decrease of inhibitory activity, which were used to develop eight novel NS5A inhibitors. The constructed GA-MLR model with satisfactory statistical parameters (R2 = 0.915 and Q2 = 0.941) confirmed the predicted inhibitory activity for these compounds. The Absorption, Distribution, Metabolism, Elimination, and Toxicity (ADMET) predictions showed that the newly designed compounds were nontoxic and exhibited acceptable pharmacological properties. These results could accelerate the process of discovering new drugs against HCV.


Introduction
Hepatitis C Virus (HCV) has significantly affected the lives of infected patients over the last century, as a small proportion of them shed the virus naturally. Most infected individuals develop a spectrum of liver diseases ranging from mild inflammation to extensive liver fibrosis, cirrhosis, chronic hepatitis C, and hepatocellular carcinoma [1,2]. According to the statistical report of the World Health Organization (WHO), an estimated 58 million people were infected with hepatitis C in 2019, and approximately 300,000 deaths were caused by HCV [3]. HCV belongs to the Flaviviridae family, the genus Hepacivirus. It is a single-stranded RNA virus encoded by 9600 nucleotide bases. The HCV genome consists of the open reading frames (ORF) between the 5' and 3' conserved untranslated regions encoding three structural proteins (C, E1 and E2) and seven non-structural proteins (NS1, NS2, NS3, NS4A, NS4B, NS5A and NS5B) [4]. HCV strains are classified into eight major genotypes, with 86 subtypes identified to date [5]. For the past two decades, the standard therapy for HCV infection has been based on peginterferon and the antiviral nucleoside analog ribavirin. To date, approximately half of patients achieved a lower sustained virologic response (SVR) and suffered from undesired harmful effects such as cardiac-related problems, leukopenia, and thrombocytopenia [6]. Recently, many directacting antiviral (DAA) drugs have been authorized for the treatment of HCV infection with higher SVR rates (>90%), shorter duration, and fewer adverse effects compared with older treatment therapies [7,8]. These innovative therapies have revolutionized HCV medicine and delivered significant insights into curing HCV patients. In 2016, an affordable combination treatment with the new drug Ravidasvir was shown to be safe and effective, with exceptionally elevated cure rates [9]. The fight against HCV infection is not fully covered due to the high costs associated with the therapies and the emergence of mutant strains resistant to DAA drugs. These treatments target three nonstructural proteins: NS3/4A protease, NS5B polymerase, and NS5A protein, which are involved in the replication and assembly processes of the virus [7].
The NS5A receptor is a 478-amino acid phosphoprotein containing three structural domains (I, II, and III) that terminate in four complementary functional zones (A, B, C, and D). The NS5A protein interacts with other important viral proteins (NS4B, NS5B, NS3) and host cell proteins (cyclophilin A, kinases and others) to regulate viral replication and assembly [10,11]. Due to its critical role in HCV replication, NS5A has emerged as a potential therapeutic target for treating chronic HCV infection. Recently, computational drug design has emerged as a powerful technique that plays a pivotal role in drug development. The quantitative structure-activity relationship (QSAR), which links the structural features of molecules to endpoints, is an important part of cheminformatics. The QSAR approach is widely used to predict biological activities and the development of new lead compounds. Thus, the biological activity of new structures based on the developed model can be easily determined using the QSAR method without the need for experimental synthesis and biological testing [12]. Due to its predictive power, the QSAR approach could also eliminate molecules with undesirable properties at an early stage. Therefore, it reduces the cost, time, and error rate in developing new drug molecules.
Continuing our recent work on the development of new potent inhibitors targeting the NS4B receptor of HCV [13], we report here several QSAR models targeting NS5A. The current marketed anti-NS5A drugs have common structural features including C2 axial symmetry and the presence of methyl carbamates on both extremities. However, the symmetrical nature of anti-HCV agents is not essential for the inhibition of HCV, as reported by Nakamura et al. [14]. The QSAR models were built based on the structural features of asymmetrical NS5A derivatives with their potent inhibitory activity. The first model, based on Monte Carlo optimization, was applied to develop SMILES-based QSAR models that provide insights into the design of novel anti-HCV agents. The second model aims to confirm the prediction of inhibitory activity of the designed molecules using the genetic algorithm multiple linear regression (GA-MLR) technique. ADMET analysis was used to investigate and evaluate the drug-likeness properties of the newly designed inhibitors.

SMILES-Based QSAR Model
In total, 24 QSAR models were developed from four random splits using two objective functions: TF1 without the IIC and TF2 with different values of the IIC. For TF2, different numerical values of WIIC were used, including 0.1, 0.3, 0.5, 0.7, and 0.9. The calculated statistical parameters for the created SMILES-based QSAR models show that the WIIC = 0.5 strengthens the influence of IIC on the Monte Carlo optimization (Supplementary materials, Spreadsheet). The statistical parameters calculated with WIIC = 0.5 for all splits are shown in Table 1 splits are shown in Figure 1. Table 1 clearly shows the statistical reliability of all models and that they meet the criteria established by Tropsha et al. [15] and Ojha et al. [16]. The established QSAR model of split 3 provides the best statistical parameters (R 2 = 0.991, CCC = 0.911, and Q 2 = 0.943). The model equation of split 3 is given below: pEC 50 = 0.532 (± 0.184) + 0.103 (± 0.003) × DCW (2,30) (1) tary materials, Spreadsheet). The statistical parameters calculated with WIIC = 0.5 for all splits are shown in Table 1. The experimental pEC50 values compared to the calculated values for the four splits are shown in Figure 1. Table 1 clearly shows the statistical reliability of all models and that they meet the criteria established by Tropsha et al. [15] and Ojha et al. [16]. The established QSAR model of split 3 provides the best statistical parameters (R 2 = 0.991, CCC = 0.911, and Q 2 = 0.943). The model equation of split 3 is given below: pEC50 = 0.532 (± 0.184) + 0.103 (± 0.003) × DCW (2,30) (1) An additional validation model for the Monte Carlo method was performed using The AD. We determined the theoretical range in which the predictions of the constructed An additional validation model for the Monte Carlo method was performed using The AD. We determined the theoretical range in which the predictions of the constructed SMILES-based QSAR model are accurate. In the case of TF1, without considering the influence of IIC on activity (pEC50), the number of outliers for split 3 was four (i.e., compounds No. 29, 32, 34, and 35). In the case of TF2, the number of outliers for split 3 was three (i.e., compounds No. 8, 9, and 16).

GA-MLR QSAR Model
The GA-MLR method was performed on the training set and then evaluated against the test set based on the selected descriptors. In the GA-MLR model, the three selected descriptors from the entire set including RBN (i.e., No of rotatable bonds), MATS1e (i.e., Moran autocorrelation of lag 2 weighted by Sanderson electronegativity) and G(N..O) (i.e., Sum of geometrical distances between N..O), which contribute to the inhibition activity, were selected to build the QSAR model. The model created using the GA-MLR technique and its statistical parameters (Equation (2)) are shown below: where N tr is the total samples in training and CCC represents the concordance correlation coefficient [17]. Q 2 F1 , Q 2 F2 and Q 2 F3 are external validation criteria [18]. The performance of the above parameters of the developed GA-MLR model meets the standard validation criteria according to the OECD guidelines. In addition, Figure 2a illustrates the experimental and the pEC50 endpoints predicted by the developed GA-MLR model, which shows a good correlation between the activity of concern and the three selected descriptors. To further validate the constructed model, the AD is used to evaluate the AD space of the leading model. The AD is performed with the leverage method as shown by the Williams plot in Figure 2b. The dashed lines show the cutoff value of ±3 s.d. and the warning line for the X outlier (h*) is 0.462. William plot show that all molecules are within the AD, with the exception of compound No. 21.

Mechanistic Interpretation
A mechanistic interpretation is a crucial part of OECD. Molecular features responsible for increasing and decreasing an endpoint can be extracted and interpreted from such

Mechanistic Interpretation
A mechanistic interpretation is a crucial part of OECD. Molecular features responsible for increasing and decreasing an endpoint can be extracted and interpreted from such models. The mechanistic interpretation of the CORAL model can be obtained from multiple runs of Monte Carlo optimization. In three independent Monte Carlo optimization runs, the molecular features extracted from the SMILES attributes with positive CWs are found to be promoters of an increase in pEC 50 activity, and the SMILES attributes with negative CWs are found to be promoters of a decrease in pEC 50 activity. In contrast, the SMILES attributes with both positive and negative CWs are undefined. The main promoters leading to an increase or decrease in pEC 50 values with their CWs for three independent runs of the built QSAR model for split 3 are shown in Table 2. Considering these data, the top-ranking fragments for increasing activity are: no. 1combination of sp 3 carbon with branching; no. 2-presence of sp3 oxygen surrounded by two sp 3 carbons; no. 3-presence of oxygen; no. 4-combination of sp 3 nitrogen with branching; no. 5-presence of sp 3 carbon surrounded by sp 3 oxygen and sp 3 carbon; no. 6combination of sp 3 nitrogen and sp 3 carbon in the aliphatic ring; no. 7-combination of sp3 nitrogen and sp3 carbon); no. 8-presence of sp 3 nitrogen surrounded by two sp 3 carbons; no. 9-presence of two sp3 carbon atoms; no. 10-presence of nitrogen; no. 11-presence of sp 3 carbon surrounded by sp 3 nitrogen and sp 3 carbon; no. 12-maximum number of nitrogen is 8; and no. 13-maximum number of oxygen is 8. In contrast, the most ranking fragments for decreasing the activity are: no. 1-presence of one ring; no. 2-combination of oxygen, double bond, and branching; no. 3-presence of doubly bonded carbon; and no. 4-combination of sp 3 oxygen and branching. Based on these considerations, the promoters of decrease were avoided. The promoters of propagation were exanimated in three different positions indicated by R1, L and R2 in the lead compound 25, which has the higher pEC50 value. The structures of all designed compounds with their pEC50 values are listed in Table S3 in the Supplemental Material.
Consequently, eight novel HCV NS5A inhibitors were selected based on these promoters, which showed high activity among the designed NS5A inhibitors ( Figure 3). All pEC50 values of the selected inhibitors predicted by SMILES-based QSAR and GA-MLR QSAR models were higher than that of the lead compound 25 ( Figure 3). These newly designed hits with their chemical structure, promoters increase, and predicted pEC50 values are shown in Figure 3 and Table 3.

ADMET Study
In silico ADMET analysis was performed using AdmetSAR and OSIRIS servers to evaluate the drug-likeness and pharmacokinetic characteristics of the newly designed compounds. The designed hit compounds do not present risks in terms of tumorigenic, irritant, mutagenic or reproductive effect profiles.
Water solubility is important for drug formulation and the determination of the persistence of organic compounds in the environment. The results in Table 4 show that all the newly developed compounds are soluble (water solubility is expressed in log (mol/L)). In addition, the blood-brain barrier (BBB) is the major interface between the central nervous system and the bloodstream. The BBB is an important property because it controls whether drugs can pass through the brain barrier and exert their effects. It is believed that a molecule with a logBB > −1 is widely distributed in the brain. Consequently, the BBB permeability results in Table 4 clearly show the non-penetrating BBB for the new suggested compounds. Moreover, intestinal absorption in humans (HIA) is one of the most important ADME properties. A compound with an intestinal absorption value greater than 30% is considered to be highly absorbed. Consequently, all newly developed compounds can be expected to have good biological activity, drug-like features, and ADMET properties. Table 4. Pharmacokinetic and ADME properties of the designed molecules and the lead compound evaluated using AdmetSAR and Osiris property explorer.

Data Preparation
For this study, a dataset of 36 asymmetric inhibitors of HCV NS5A was used [14,19]. The chemical structures of these derivatives were drawn and were pre-optimized using the molecular mechanics' force field MMFF94 of the ChemDraw package. Then, their geometries were optimized using the Gaussian 09 software [20], particularly the AM1 method in the gas phase. We calculated vibrational spectra to confirm the optimized structures to be the energy minima. The activity value of each molecule (half-maximal effective concentration, EC 50 ) was converted to its negative logarithmic scale pEC 50 = −log (EC 50 ) and used as an independent variable to build QSAR models.
Two QSAR models were created using Monte Carlo optimization and the GA-MLR technique. For the Monte Carlo method, the simplified molecular input line entry system (SMILES) was used to symbolize the chemical structure and to develop QSAR models. They were generated with ACD/ChemSketch software (File Version C35E41, Build 125843, 14 Jan 2022, Toronto, ON, Canada) [21]. For the GA-MLR model, the molecular descriptor values (0D-3D) of the 36 compounds were computed using OCHEM [22]. To avoid multicollinear variables in the QSAR model, the total number of variables generated was reduced by excluding descriptors that possessed more than 95% constant values and descriptor pairs with a correlation coefficient greater than 0.9. A final set of 625 descriptors was selected from the initial pool of 3085 descriptors. The molecular structures and their corresponding pEC 50 data are listed in Table 5 (the SMILES notation can be found in the Supplemental Materials in Table S1). scriptor values (0D-3D) of the 36 compounds were computed using OCHEM [22]. To avoid multicollinear variables in the QSAR model, the total number of variables generated was reduced by excluding descriptors that possessed more than 95% constant values and descriptor pairs with a correlation coefficient greater than 0.9. A final set of 625 descriptors was selected from the initial pool of 3085 descriptors. The molecular structures and their corresponding pEC50 data are listed in Table 5 (the SMILES notation can be found in the Supplemental Materials in Table S1).  [21]. For the GA-MLR model, the molecular descriptor values (0D-3D) of the 36 compounds were computed using OCHEM [22]. To avoid multicollinear variables in the QSAR model, the total number of variables generated was reduced by excluding descriptors that possessed more than 95% constant values and descriptor pairs with a correlation coefficient greater than 0.9. A final set of 625 descriptors was selected from the initial pool of 3085 descriptors. The molecular structures and their corresponding pEC50 data are listed in Table 5 (the SMILES notation can be found in the Supplemental Materials in Table S1). 125843, 14 Jan 2022, Toronto, Canada) [21]. For the GA-MLR model, the molecular descriptor values (0D-3D) of the 36 compounds were computed using OCHEM [22]. To avoid multicollinear variables in the QSAR model, the total number of variables generated was reduced by excluding descriptors that possessed more than 95% constant values and descriptor pairs with a correlation coefficient greater than 0.9. A final set of 625 descriptors was selected from the initial pool of 3085 descriptors. The molecular structures and their corresponding pEC50 data are listed in Table 5 (the SMILES notation can be found in the Supplemental Materials in Table S1). They were generated with ACD/ChemSketch software (File Version C35E41, Build 125843, 14 Jan 2022, Toronto, Canada) [21]. For the GA-MLR model, the molecular descriptor values (0D-3D) of the 36 compounds were computed using OCHEM [22]. To avoid multicollinear variables in the QSAR model, the total number of variables generated was reduced by excluding descriptors that possessed more than 95% constant values and descriptor pairs with a correlation coefficient greater than 0.9. A final set of 625 descriptors was selected from the initial pool of 3085 descriptors. The molecular structures and their corresponding pEC50 data are listed in Table 5 (the SMILES notation can be found in the Supplemental Materials in Table S1). They were generated with ACD/ChemSketch software (File Version C35E41, Build 125843, 14 Jan 2022, Toronto, Canada) [21]. For the GA-MLR model, the molecular descriptor values (0D-3D) of the 36 compounds were computed using OCHEM [22]. To avoid multicollinear variables in the QSAR model, the total number of variables generated was reduced by excluding descriptors that possessed more than 95% constant values and descriptor pairs with a correlation coefficient greater than 0.9. A final set of 625 descriptors was selected from the initial pool of 3085 descriptors. The molecular structures and their corresponding pEC50 data are listed in Table 5 (the SMILES notation can be found in the Supplemental Materials in Table S1). (SMILES) was used to symbolize the chemical structure and to develop QSAR models. They were generated with ACD/ChemSketch software (File Version C35E41, Build 125843, 14 Jan 2022, Toronto, Canada) [21]. For the GA-MLR model, the molecular descriptor values (0D-3D) of the 36 compounds were computed using OCHEM [22]. To avoid multicollinear variables in the QSAR model, the total number of variables generated was reduced by excluding descriptors that possessed more than 95% constant values and descriptor pairs with a correlation coefficient greater than 0.9. A final set of 625 descriptors was selected from the initial pool of 3085 descriptors. The molecular structures and their corresponding pEC50 data are listed in Table 5 (the SMILES notation can be found in the Supplemental Materials in Table S1). (SMILES) was used to symbolize the chemical structure and to develop QSAR models. They were generated with ACD/ChemSketch software (File Version C35E41, Build 125843, 14 Jan 2022, Toronto, Canada) [21]. For the GA-MLR model, the molecular descriptor values (0D-3D) of the 36 compounds were computed using OCHEM [22]. To avoid multicollinear variables in the QSAR model, the total number of variables generated was reduced by excluding descriptors that possessed more than 95% constant values and descriptor pairs with a correlation coefficient greater than 0.9. A final set of 625 descriptors was selected from the initial pool of 3085 descriptors. The molecular structures and their corresponding pEC50 data are listed in Table 5 (the SMILES notation can be found in the Supplemental Materials in Table S1). (SMILES) was used to symbolize the chemical structure and to develop QSAR models. They were generated with ACD/ChemSketch software (File Version C35E41, Build 125843, 14 Jan 2022, Toronto, Canada) [21]. For the GA-MLR model, the molecular descriptor values (0D-3D) of the 36 compounds were computed using OCHEM [22]. To avoid multicollinear variables in the QSAR model, the total number of variables generated was reduced by excluding descriptors that possessed more than 95% constant values and descriptor pairs with a correlation coefficient greater than 0.9. A final set of 625 descriptors was selected from the initial pool of 3085 descriptors. The molecular structures and their corresponding pEC50 data are listed in Table 5 (the SMILES notation can be found in the Supplemental Materials in Table S1).

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C0 is the intercept, while C1 is the slope of the regression equation.

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation ( SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation (3) SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4)

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation (3) SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4)

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation (3) SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4)

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation (3) SMILES DCW (T, Nepoch) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6.
The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).

SMILES-Based QSAR Model Construction
The Monte Carlo optimization was used to create SMILES-based QSAR models using CORAL 2019 [23]. The SMILES attributes were used in this software to predict the endpoint using optimal descriptors (i.e., correlation weights (CWs)) and the balance-of-correlation method [24]. Four splits were created from the 36 compounds. Each split was randomly divided into 4 partitions: training (35%), invisible training (35%), calibration (15%), and validation (15%). Each set has a different task in constructing the QSAR model. The training set creates the QSAR model by calculating the correlation weight. The invisible training (inv. Train) set is assigned to evaluate the fitness of the molecules that are not included in the training set. The calibration set is used to identify the onset of overfitting, while the validation set is used to test the models for the compounds that are not included in the remaining sets [25][26][27].
Equation (3)  SMILES DCW (T, N epoch ) combines SMILES-based attributes associated with a correlation weight (CW). A description of the optimal SMILES parameters is provided in Table 6. The linear regression approach was used to develop QSAR models after all CWs were calculated as shown in Equation (4).
C 0 is the intercept, while C 1 is the slope of the regression equation.
In the Monte Carlo method, we defined T as the threshold and N epoch as the number of epochs. The T coefficient is used as a criterion to divide the SMILES attributes into two classes: an active class in which SMILES attributes are involved in model construction and a rare class (noise) that does not contain SMILES attributes. The T coefficient is used as a criterion to divide the SMILES attributes into two categories: an active class where SMILES attributes contribute to model construction and a rare class (noise) that contains no SMILES attributes. Overtraining can result from these rare attributes producing a good correlation during training and a poor correlation during validation. The N epoch provides the best statistical quality during calibration [28].
To develop the QSAR models, two types of target functions (TF) are used. TF1 uses balance of correlation as described in Equation (5), while TF2 adds the Index of Ideality of Correlation (IIC) described in Equation (6) [29,30]. IIC (Equation (7)) was proposed as a criterion for evaluating of the predictive power of the developed QSAR models. Namely, it improves the accuracy of the model measured by the coefficient of determination (R 2 ) and the mean absolute error (MAE). The value of the coefficient WIIC (The weight of IIC) can change the strength of the influence of IIC on Monte Carlo optimization. The preferred value of WIIC can be determined by two factors: molecular diversity and endpoint nature [31][32][33].
Rtraining and Rinv.train are correlation coefficients between the experimental pEC 50 and the calculated pEC 50 for each respective set. The empirical Const is typically fixed. Moreover, R set is the value of the correlation coefficient between the observed and predicted endpoint of a give set. MAE is the mean absolute error, calculated as follows: where, ∆ k = Observed k − Predicted k = pEC 50 k(obs) − pEC 50 k(pred) ∆ k is the accuracy for the kth substance from a set.
A grid-search was used for the best values of T and N epoch for the four splits (1 to 10 for T and 1 to 30 for N epoch ). The number of optimization probes was set to 3.

GA-MLR QSAR Construction
The first step in QSAR analysis is to choose the most relevant descriptors from the entire pool of computed descriptors. For this purpose, the stepwise linear regression method was applied, and the value of the leave-one-out cross-validation coefficient was used as the fitness function. Thus, 3085 different molecular descriptors were calculated using the OCHEM server [22]. The calculated descriptors were first examined to remove the near-constant and constant variables to decrease the redundancy in the matrix of descriptors. The correlation between the calculated descriptors and inhibitory activity was examined to exclude the collinear descriptors. Finally, 625 molecular descriptors were filtered out from the original set of variables. Then, the stepwise-MLR method was used to select the most relevant descriptors. Finally, three molecular descriptors were selected from the whole set. Based on the selected molecular descriptors, the MLR method used the ordinary least squares (OLS) algorithm to establish a linear relationship between the pEC50 endpoints of NS5A inhibitors and their molecular descriptors. QSARINS software was used to create the GA-MLR model [34,35]. The data set was randomly split into training (26 molecules) and testing (10 molecules) sets with a percentage distribution of 70% and 30%, respectively. The default parameters were used to build the GA-MLR models, except for: subsets = 1 to 5, maximum generation = 10,000 and mutation probability = 0.05.

QSAR Models Validation
The validation process is essential in QSAR to test the model's suitability to make reliable forecasts of the modeled activity for new compounds with an unknown reaction. This process is considered one of the crucial steps to check the robustness, predictability and reliability of any QSAR model. Four steps are usually used to validate the constructed model, including (a) internal validation or cross-validation using the training set, (b) Yrandomization, (c) independent validation using the test set, and (d) applicability domain (AD) evaluation [36].

Validation of GA-MLR QSAR Model
In the GA-MLR, the validity of the generated QSAR model was confirmed based on: internal validation using leave-many-out (LMO) and leave-one-out (LOO) procedures, Y-randomization, independent validation, and finally by checking the model AD. Moreover, thorough fulfillment of the respective thresholds for the statistical metrics proposed in the literature was evaluated [37]: the determination coefficient R 2 tr ≥ 0.6, the Cross-validated Q 2 loo ≥ 0.5, the determination coefficient obtained for the test set R 2 ext ≥ 0.6, the root-mean square error RMSE tr < RMSE cv , the concordance correlation coefficient (CCC) ≥ 0.80, Q 2 Fn ≥ 0.6, the Y-scramble correlation coefficient R 2 Yscr < 0.2, the the Y-scramble crossvalidation coefficient Q 2 Yscr < 0.2, Q 2 Yscr < R 2 Yscr , the root-mean-square of Y randomization RMSE AV Yscr and the mean absolute error (MAE) should be near to zero.

Validation of CORAL QSAR Model
In Monte Carlo optimization, additional parameters were used to verify the quality of the predictions of the QSAR models. C R 2 p is the deviation of the mean determination coefficient of the randomized models (R 2 r ) from the determination coefficient of the nonrandomized models (R 2 ). C R 2 p should be greater than 0.5 for an acceptable QSAR model.
m is a metric proposed by Roy et al. [38,39] to indicate the external predictability of QSAR models; the average R 2 m (AvgR 2 m ) should be greater than 0.5, and ∆R 2 m should be less than 0.2 (∆R 2 m = R 2 m (x,y) − R 2 m (y,x). x is the experimental value while y is the predicted value of endpoint).
Any QSAR model that does not meet the above criteria is eliminated. The formulas for calculating these statistical parameters are listed in Supplementary Material Table S2.

Applicability Domain
The applicability domain (AD) was proposed by the Organization for Economic Cooperation and Development (OECD) guidelines. AD allows the evaluation of the uncertainty in the prediction of a given molecule based on its similarity to the compounds used to develop the model. Compounds outside the AD are considered as outliers In CORAL QSAR models, the AD is determined by the calculated statistical defects d(A) of SMILES based on the distribution of available data among all sets (Equation (11)). The d(A) of the SMILES attribute is depicted as the difference between the probability of the attribute in the training set and that of the calibration set. Outliers are SMILES, whose SMILE error is higher than twice the average error over training set compounds.
A molecule is considered outlier when D > 2 × _ D (13) _ D is the average of the calculated D of training, inv. Train and calibration sets [40].
In the GA-MLR model, the William plot of standardized residual versus leverage was used to visualize the model AD. Reliable model predictions have leverage values framed between the critical leverage with ±3 standard deviations and lower than the warning leverage value h* of 0.48. Outliers are compounds that fall outside the horizontal reference lines on the plot. In contrast, the influential chemicals are compounds that have h > h* [41].

ADMET Study
ADMET assessment is critical in the early phase of drug discovery. A high-quality therapeutic agent is expected to have excellent efficacy against the target receptor and excellent ADMET properties at a therapeutic dose. Therefore, it is necessary to evaluate the pharmacokinetic profile of the Hit compounds to prevent subsequent drug failure [42]. Drug-likeness properties explain how a compound is distributed inside an organism and thus influence its pharmacological efficacy [43]. The ADMET predictions of the designed compounds were evaluated using AdmetSAR and Osiris property explorer [44,45].

Conclusions
Hepatitis C virus is a worldwide health problem that causes several life-threatening chronic liver diseases. Currently, there is no effective vaccine against hepatitis C, and treatment is still quite difficult. Computational methods have repeatedly proven useful in addressing the unique challenges of antiviral drug discovery. In this study, two QSAR models were developed to determine the quantitative relationship between anti-NS5A HCV biological activity and the molecular structure of a series of NS5A inhibitors. Two models were constructed using the GA-MLR and Monte Carlo optimization techniques. The results of the two models were in accordance with OECD guidelines. The model based on SMILES was used to evaluate the effects of the presence or absence of different molecular fragments on the biological activity studied. These results provided insights into the design of the eight novel NS5A inhibitors (against the NS5A target). The GA-MLR model confirmed the obtained inhibitory activities of the eight compounds. The ADMET study demonstrated that the designed molecules have advantageous chemical properties that provide promising inhibitory activity against NS5A.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/molecules27092729/s1. Table S1: SMILES notation for the 36 compounds and their experimental activity data;