In Silico Modeling and Structural Analysis of Soluble Epoxide Hydrolase Inhibitors for Enhanced Therapeutic Design

Human soluble epoxide hydrolase (sEH), a dual-functioning homodimeric enzyme with hydrolase and phosphatase activities, is known for its pivotal role in the hydrolysis of epoxyeicosatrienoic acids. Inhibitors targeting sEH have shown promising potential in the treatment of various life-threatening diseases. In this study, we employed a range of in silico modeling approaches to investigate a diverse dataset of structurally distinct sEH inhibitors. Our primary aim was to develop predictive and validated models while gaining insights into the structural requirements necessary for achieving higher inhibitory potential. To accomplish this, we initially calculated molecular descriptors using nine different descriptor-calculating tools, coupled with stochastic and non-stochastic feature selection strategies, to identify the most statistically significant linear 2D-QSAR model. The resulting model highlighted the critical roles played by topological characteristics, 2D pharmacophore features, and specific physicochemical properties in enhancing inhibitory potential. In addition to conventional 2D-QSAR modeling, we implemented the Transformer-CNN methodology to develop QSAR models, enabling us to obtain structural interpretations based on the Layer-wise Relevance Propagation (LRP) algorithm. Moreover, a comprehensive 3D-QSAR analysis provided additional insights into the structural requirements of these compounds as potent sEH inhibitors. To validate the findings from the QSAR modeling studies, we performed molecular dynamics (MD) simulations using selected compounds from the dataset. The simulation results offered crucial insights into receptor–ligand interactions, supporting the predictions obtained from the QSAR models. Collectively, our work serves as an essential guideline for the rational design of novel sEH inhibitors with enhanced therapeutic potential. Importantly, all the in silico studies were performed using open-access tools to ensure reproducibility and accessibility.


Introduction
Epoxide hydrolases are a family of widely distributed enzymes responsible for the rapid hydrolysis of epoxides into corresponding vicinal diols.The soluble epoxide hydrolase (sEH) is found in all lower and upper vertebrates, but only the mammalian sEH is associated with phosphatase activity [1,2].Human soluble epoxide hydrolase (sEH) is a dual-functioning homodimeric enzyme and a member of the epoxide hydrolase family.It is involved in the hydrolysis of epoxyeicosatrienoic acids (EETs) [3,4].Human sEHs are present in both cytosol and peroxisomes and exhibit hydrolase and phosphatase activities.In the presence of this enzyme, the biological effects of EETs are diminished.EETs are involved in various biological processes, including vasodilation of coronary arterioles, vascular smooth muscle relaxation, renal excretion of sodium, reduction of the expression of cytokine-induced endothelial cell adhesion molecules, and lipid and carbohydrate metabolism, as well as insulin resistance [5][6][7][8].Additionally, EETs may contribute to neovascularization by promoting angiogenesis [8].Consequently, sEH is responsible for degrading EETs into inactive products, thereby diminishing several protective mechanisms elicited by EETs.Inhibitors of sEH may then have implications in the treatment of various diseases such as diabetes, fibrosis, chronic pain, cardiovascular diseases, and neurodegenerative diseases [9].These inhibitors are also claimed to be useful in the treatment of disorders related to smooth muscles, such as erectile dysfunction, hyperactive bladder, uterine contractions, irritable bowel syndrome (IBS), rheumatoid arthritis, and nephropathy [5,10].Yet, the role of sEH mentioned above is primarily regulated through catalysis that occurs at the C-terminal hydrolase domain of the enzyme.The role of the N-terminal phosphatase domain has been comparatively less investigated, but there is strong evidence that the phosphatase activity of this domain is capable of hydrolyzing diverse lipid phosphates, including farnesyl pyrophosphate, sphingosine-1-phosphate, and lysophosphatidic acid.Recent studies have reported that inhibiting the phosphatase activity of sEH may prevent obesity and cardiac ischemic injury [11][12][13].Several compounds inhibiting sEH functionalities have been reported, and some of these (e.g., SMTP-7, an investigational thrombolytic drug for the treatment of ischemic stroke; and ebselen, an anti-inflammatory, antioxidant, and cytoprotective drug) may simultaneously block both hydrolase and phosphatase activities [14].Recently, it was discovered that sEH inhibition leads to a reduction in hepatic fat accumulation and inflammation, also suggesting a promising role in the treatment of Nonalcoholic Steatohepatitis (NASH) [15,16].
Researchers from the Goethe University, Germany, have been involved in the design and development of sEH inhibitors with a range of structural scaffolds [4,10,14].The objective of this work is to perform a ligand-based in silico study utilizing receptor-independent Quantitative Structure-Activity Relationship (QSAR) modeling.The aim is to gain an understanding of the structural requirements of 184 compounds that have been reported by the researchers involved in such a study.QSAR, which is one of the oldest but most reliable in silico techniques, provides a viable option to minimize experimental work and screen novel molecules during drug design and development [17,18].Whole molecular descriptorbased QSAR is particularly helpful in estimating the structural requirements for a diverse set of ligands with multiple mechanisms of action [19][20][21].In recent years, QSAR methodologies have advanced with the discovery of various novel descriptors, and model-building strategies have also improved with the progress in feature selection methodologies and machine learning techniques coupled with computational efficiency [21,22].As mentioned earlier, compounds inhibiting sEH may have multiple binding sites (hydrolase catalytic site and phosphatase catalytic site).Therefore, in this study, we primarily relied on QSAR regression methods to determine if validated predictive models can be generated with a dataset containing diverse sEH inhibitors that likely possess multiple binding mechanisms.

2D-QSAR Model
Following the strategy mentioned in Materials and Methods, we systematically sought the best linear 2D-QSAR MLR models.As mentioned, this involved using nine types of descriptors and two feature selection strategies, resulting in a total of 81 models.The outcomes of the 81 models are summarized in Table S3 of the Supplementary Materials.It was observed that each set of descriptors was capable of producing at least one model with acceptable Q 2 LOO (>0.65) and R 2 Pred (>0.50) values [23,24].However, the primary objective was to identify the most predictive model in terms of statistical quality.It was found that the AlvaDesc descriptors along with the GA feature selection approach yielded the most successful model (Q 2 LOO = 0.784 and R 2 Pred = 0.792).The resulting model (an eight-variable equation) is given below together with the statistical parameters of the regression.Pred /Q 2 (F1) = 0.792, Q 2 (F2) = 0.769, Q 2  (F3) = 0.763, RMSEP = 0.558, r m 2 test = 0.685, ∆r m 2 test = 0.167 The observed vs. predicted activity plot of the 2D-QSAR model is shown in Figure 1.However, as noticed by Gramatica et al. [25], it is also important to consider the difference between R 2 and Q 2  LOO in assessing internal predictivity.In this model, the R 2 − Q 2  LOO difference was found to be small (0.027), indicating good internal predictivity.Furthermore, the model achieved satisfactory values for the metrics r m 2 LOO (=0.701) and ∆r m 2 LOO (= 0.147), which are considered more stringent parameters than Q 2 LOO .For these parameters, acceptable values are greater than 0.50 and less than 0.20, respectively.Additionally, the low value obtained for MAE indicates that the model achieves sufficient internal predictivity.
Subsequently, to assess external predictivity, a test set of 37 compounds was used.The model demonstrated a relatively high value of 0.792 for predicting the biological activity of the test set compounds.The Q 2 F2 and rm 2 test values further confirmed the model's satisfactory performance on the test set.The low values obtained for RMSEP and ∆rm 2 test also support the model's external predictivity.
The maximum intercorrelation (R 2 ) between any two descriptors in the model was found to be 0.557, indicating that the descriptors used in the model are independent.The intercorrelation matrix can be found in Table S4 of the Supplementary Materials.The variance inflation factor (VIF) was calculated for each descriptor, and none of the values exceeded 5.0, indicating the absence of multi-collinearity in the model.Moreover, the model exhibited acceptable Kxx and ΔK values, further supporting its robustness.The Y-randomization test with 1000 runs yielded a cRp 2 value of 0.784, indicating that the model is not a result of chance but rather a unique and meaningful model.
The Williams plot of this 2D-QSAR model is presented in Figure 1.Only one training set compound was found to be a structural outlier and two compounds appeared as response outliers.Nonetheless, given the good predictivity of the structural outlier, we decided to retain it in the model.In Table 1, the eight descriptors of this 2D-QSAR model are listed along with their meaning, and their relative significance, determined by the standardized coefficients, is The maximum intercorrelation (R 2 ) between any two descriptors in the model was found to be 0.557, indicating that the descriptors used in the model are independent.The intercorrelation matrix can be found in Table S4 of the Supplementary Materials.The variance inflation factor (VIF) was calculated for each descriptor, and none of the values exceeded 5.0, indicating the absence of multi-collinearity in the model.Moreover, the model exhibited acceptable K xx and ∆K values, further supporting its robustness.The Y-randomization test with 1000 runs yielded a cR p 2 value of 0.784, indicating that the model is not a result of chance but rather a unique and meaningful model.
The Williams plot of this 2D-QSAR model is presented in Figure 1.Only one training set compound was found to be a structural outlier and two compounds appeared as response outliers.Nonetheless, given the good predictivity of the structural outlier, we decided to retain it in the model.
In Table 1, the eight descriptors of this 2D-QSAR model are listed along with their meaning, and their relative significance, determined by the standardized coefficients, is depicted in Figure 2. As can be noticed, save for RDF140v, all these descriptors belong to the category of 2D descriptors [26].For example, ATS6m, J_Dz(p) and SM14_AEA(dm) are 2D topological descriptors, which provide information about the structural characteristics and connectivity patterns within the compounds.depicted in Figure 2. As can be noticed, save for RDF140v, all these descriptors belong to the category of 2D descriptors [26].For example, ATS6m, J_Dz(p) and SM14_AEA(dm) are 2D topological descriptors, which provide information about the structural characteristics and connectivity patterns within the compounds.Among the descriptors, ATS6m was identified as the most significant descriptor in the model, showing a positive correlation with pIC50(M).ATS6m is a 2D autocorrelation descriptor that encodes the distribution of atomic mass within a molecule, considering atom pair distances up to a 2D topological distance of 6 [27].The analysis of the ATS6m descriptor values revealed that compounds with higher molecular weights and higher values of ATS6m tend to have a higher affinity towards the sEH enzyme.This indicates that both molecular weight and specific 2D topology, as encoded by the ATS6m descriptor, Among the descriptors, ATS6m was identified as the most significant descriptor in the model, showing a positive correlation with pIC 50 (M).ATS6m is a 2D autocorrelation descriptor that encodes the distribution of atomic mass within a molecule, considering atom pair distances up to a 2D topological distance of 6 [27].The analysis of the ATS6m descriptor values revealed that compounds with higher molecular weights and higher values of ATS6m tend to have a higher affinity towards the sEH enzyme.This indicates that both molecular weight and specific 2D topology, as encoded by the ATS6m descriptor, play important roles in determining the activity of the compounds as potent inhibitors of the enzyme.
The next most significant descriptor in the model is J_Dz(p), which is a 2D matrixbased descriptor representing the Balaban-like index from the Barysz matrix weighted by polarizability [26].Similarly, a higher value of J_Dz(p) was found to be associated with higher biological activity.This suggests that, apart from molecular mass and topology, the polarizability of compounds may also play a crucial role in influencing their inhibitory activity against sEH.
The third, fourth, and fifth most significant descriptors of the model belong to the category of CATS2D or 2D pharmacophore descriptors, specifically CATS2D_07_AA, CATS2D_03_NL, and CATS2D_05_AA.Chemically advanced template search (CATS) descriptors are particularly useful in elucidating the structural requirements for higher activity.These descriptors encode the topological distances between specific pharmacophore features within the molecules [28].For example, the descriptor CATS2D_07_AA indicates the presence of hydrogen bond acceptors (A) at a topological distance of 7. In the context of the 2D-QSAR model, this descriptor was found to negatively impact the endpoint response.This suggests that compounds with fewer hydrogen bond acceptors located at such a topological distance have higher biological activity.The following most significant descriptor of the model is F09[N-O].This is a simple 2D atom-pair descriptor that specifically captures the frequency of nitrogen (N) and oxygen (O) atoms located at a topological distance of 9 within the compounds.It is interesting to note that, similar to CATS2D_07_AA, the F09[N-O] descriptor also exhibits a negative correlation with the endpoint response.This means that compounds with higher values of this descriptor tend to be less active, while most of the highly active compounds tend to have lower values of this descriptor.Indeed, as illustrated in Figures 3 and 4, the observed correlation between higher descriptor values and lower activity for some compounds reinforces the importance of these descriptors in capturing the relevant structural features influencing the biological activity in the context of the 2D-QSAR model.
Molecules 2023, 28, x FOR PEER REVIEW 5 of 23 The third, fourth, and fifth most significant descriptors of the model belong to the category of CATS2D or 2D pharmacophore descriptors, specifically CATS2D_07_AA, CATS2D_03_NL, and CATS2D_05_AA.Chemically advanced template search (CATS) descriptors are particularly useful in elucidating the structural requirements for higher activity.These descriptors encode the topological distances between specific pharmacophore features within the molecules [28].For example, the descriptor CATS2D_07_AA indicates the presence of hydrogen bond acceptors (A) at a topological distance of 7. In the context of the 2D-QSAR model, this descriptor was found to negatively impact the endpoint response.This suggests that compounds with fewer hydrogen bond acceptors located at such a topological distance have higher biological activity.The following most significant descriptor of the model is F09[N-O].This is a simple 2D atom-pair descriptor that specifically captures the frequency of nitrogen (N) and oxygen (O) atoms located at a topological distance of 9 within the compounds.It is interesting to note that, similar to CATS2D_07_AA, the F09[N-O] descriptor also exhibits a negative correlation with the endpoint response.This means that compounds with higher values of this descriptor tend to be less active, while most of the highly active compounds tend to have lower values of this descriptor.Indeed, as illustrated in Figures 3 and 4, the observed correlation between higher descriptor values and lower activity for some compounds reinforces the importance of these descriptors in capturing the relevant structural features influencing the biological activity in the context of the 2D-QSAR model.The last two descriptors, i.e., SM14_AEA(dm) and RDF140v, further contribute to the understanding of the compounds' biological activity against sEH.SM14_AEA(dm) is a 2D graph-based descriptor weighted by the dipole moment (dm).The positive correlation between SM14_AEA(dm) and the biological activity suggests that compounds with higher dipole moments tend to exhibit higher inhibitory activity against sEH.This indicates that electrostatic interactions, mediated by the dipole moment, play a significant role in the binding of compounds to the target enzyme.On the other hand, RDF140v is a 3D descriptor (RDF) weighted by van der Waals volume (v).It captures the steric effects and the spatial distribution of atoms in the molecule.The negative correlation between RDF140v and the biological activity indicates that steric interactions, mainly governed by van der Waals volume, influence the binding and activity of compounds against sEH.The last two descriptors, i.e., SM14_AEA(dm) and RDF140v, further contribute to the understanding of the compounds' biological activity against sEH.SM14_AEA(dm) is a 2D graph-based descriptor weighted by the dipole moment (dm).The positive correlation between SM14_AEA(dm) and the biological activity suggests that compounds with higher dipole moments tend to exhibit higher inhibitory activity against sEH.This indicates that electrostatic interactions, mediated by the dipole moment, play a significant role in the binding of compounds to the target enzyme.On the other hand, RDF140v is a 3D descriptor (RDF) weighted by van der Waals volume (v).It captures the steric effects and the spatial distribution of atoms in the molecule.The negative correlation between RDF140v and the biological activity indicates that steric interactions, mainly governed by van der Waals volume, influence the binding and activity of compounds against sEH.
Interestingly, both J_Dz(p) and SM14_AEA(dm) exhibit a positive correlation with the biological activity, whereas, contrary to ATS6m, RDF140v shows a negative correlation.The contrasting correlations of these four descriptors (ATS6m, J_Dz(p), SM14_AEA(dm), and RDF140v) indicate the complex interplay of molecular topology, electrostatic interactions, and steric effects in shaping the biological activity of the compounds.Understanding these relationships can help in the design of compounds with optimized structural features to enhance their inhibitory activity against the target enzyme sEH.
In order to check whether non-linear models may be developed with better statistical predictivity, we attempted to develop some non-linear models using three distinct machine learning techniques, namely, MLP, RF, and SVM.A concise overview of the statistical outcomes derived from these models is shown in Table 2. Interestingly, both J_Dz(p) and SM14_AEA(dm) exhibit a positive correlation with the biological activity, whereas, contrary to ATS6m, RDF140v shows a negative correlation.The contrasting correlations of these four descriptors (ATS6m, J_Dz(p), SM14_AEA(dm), and RDF140v) indicate the complex interplay of molecular topology, electrostatic interactions, and steric effects in shaping the biological activity of the compounds.Understanding these relationships can help in the design of compounds with optimized structural features to enhance their inhibitory activity against the target enzyme sEH.
In order to check whether non-linear models may be developed with better statistical predictivity, we attempted to develop some non-linear models using three distinct machine learning techniques, namely, MLP, RF, and SVM.A concise overview of the statistical outcomes derived from these models is shown in Table 2. Clearly, none of the non-linear models managed to outperform the previously discussed linear 2D-QSAR model.Moreover, the non-linear models with the highest statistical significance were established using descriptors from the most predictive linear model (Equation ( 2)), through the utilization of MLP and SVM techniques.Remarkably, this SVM model was developed using a linear kernel.In contrast, descriptors chosen via differential Shannon entropy (dSe) proved insufficient to yield any models exhibiting statistical significance surpassing either the non-linear models or the proposed linear model.

Transformer-CNN-Based QSAR Model
The Transformer-CNN-based model yielded promising results in terms of its predictive performance and interpretability.The model attained a 5-fold cross-validated Q 2 value of 0.713, coupled with an RMSE (CV) of 0.628, which underscore its ability to precisely forecast compound activity.This assertion is reinforced when the model is evaluated with a separate test set comprising 37 data points, yielding an R 2 Pred of 0.731.This outcome thus further confirms its predictive power.This model was produced with 200 epochs and a batch size of 4. It is notable that increasing the batch size to 16, 32, or 64 compromised the predictivity of the model.Similarly, reducing the number of epochs to 100 or increasing it to 300 also resulted in reduced predictivity.
In addition to predictive performance, the focus was also on the interpretability of the Transformer-CNN model.The LRP (Layer-wise Relevance Propagation) algorithm implemented in the Transformer-CNN repository was employed to obtain structural interpretations from the model.The interpretations for selected highly active and less active compounds from the dataset are depicted in Figure 5.The insights extracted from the Transformer-CNN model offer valuable understandings into the structural characteristics influencing the activity against sEH.These interpretations align with the findings from the conventional 2D-QSAR model, which identified molecular mass and polarizability as important factors in governing higher activity.For example, compounds D1_01 and D2_37 are structurally similar, but the presence of a chlorine atom and sulfonamide (which contain heavy atoms and polar atoms) make a major difference to their activities.This observation is also consistent with the predictions of the conventional 2D-QSAR model, which highlighted the unfavorable effect of negative ionizable carboxylate for higher biological properties (see Figure 3).In the comparison of compounds D5_27 and D5_32, despite their structural similarities, the contributions of oxazole atoms varied considerably.Similarly, the contributions of the methylbenzene scaffold also varied to a considerable extent in D1_01 and D2_37.This reinforces the notion that it is the overall topology of the compounds that shapes their activity.The examples of compounds D4_02 and D4_06 also demonstrate the impact of specific structural The insights extracted from the Transformer-CNN model offer valuable understandings into the structural characteristics influencing the activity against sEH.These interpretations align with the findings from the conventional 2D-QSAR model, which identified molecular mass and polarizability as important factors in governing higher activity.For example, compounds D1_01 and D2_37 are structurally similar, but the presence of a chlorine atom and sulfonamide (which contain heavy atoms and polar atoms) make a major difference to their activities.This observation is also consistent with the predictions of the conventional 2D-QSAR model, which highlighted the unfavorable effect of negative ionizable carboxylate for higher biological properties (see Figure 3).In the comparison of compounds D5_27 and D5_32, despite their structural similarities, the contributions of oxazole atoms varied considerably.Similarly, the contributions of the methylbenzene scaffold also varied to a considerable extent in D1_01 and D2_37.This reinforces the notion that it is the overall topology of the compounds that shapes their activity.The examples of compounds D4_02 and D4_06 also demonstrate the impact of specific structural features on activity.The contributions of the carboxamide group varied considerably between these compounds, indicating that this feature plays a significant role in their differential activities.The sulfonamide residues were generally found to be partially favorable, while only the oxygen atoms of the carboxamide contributed positively to higher biological properties.To corroborate these interpretations, we also contrasted the results of MD simulations for compounds D4_02 and D2_37 with the insights derived from the Transformer-CNN model.This comparison likely lends additional support to the relationship between identified structural features and their impact on the compounds' activity against sEH.
The color codes depicted in Figure 5 hold varying significance based on the LRP algorithm.For compounds D1_01 and D2_37, which are structurally akin, the relevance of each atom concerning favorable and unfavorable activity is shown in the Supplementary Materials.It is evident that D2_37 exhibits maximum negative influence primarily from its fluorine atom and the oxygen atom of the carboxylate.On the contrary, the chlorine atom in D1_01 contributes significantly and positively to its higher potency.Notably, the negative influence of the carboxamide fragment is markedly more pronounced in D1_01 than in D2_37.

3D-QSAR Analysis
The current 2D-QSAR model illustrates the significance of steric and electrostatic interactions, as well as specific fragments and pharmacophores, in determining the activity against sEH.To gain a better understanding of the structural requirements, we resorted to 3D-QSAR modeling and analysis using the Open3DQSAR software.Similar to the 2D-QSAR modeling approach, the dataset was randomly divided into a training set and a test set.Atom-based rigid body alignment was performed to align the structures, which were then used to calculate steric and electrostatic fields.Two different feature selection techniques, FFD-SEL and UVE-PLS, were employed for PLS model development.Both techniques yielded the most predictive models with three components.Figure 6 presents the aligned structures and contour maps, while Table 3 showcases the statistical results of the models.

3D-QSAR Analysis
The current 2D-QSAR model illustrates the significance of steric and electrostatic interactions, as well as specific fragments and pharmacophores, in determining the activity against sEH.To gain a better understanding of the structural requirements, we resorted to 3D-QSAR modeling and analysis using the Open3DQSAR software.Similar to the 2D-QSAR modeling approach, the dataset was randomly divided into a training set and a test set.Atom-based rigid body alignment was performed to align the structures, which were then used to calculate steric and electrostatic fields.Two different feature selection techniques, FFD-SEL and UVE-PLS, were employed for PLS model development.Both techniques yielded the most predictive models with three components.Figure 6 presents the aligned structures and contour maps, while Table 3 showcases the statistical results of the models.The UVE-PLS technique yielded superior statistical results in the 3D-QSAR analysis conducted with a training set of 148 compounds and a test set of 36 compounds.The model achieved satisfactory Q 2 LOO (=0.643) and R 2 Pred (=0.657), considering the inclusion of a relatively large and structurally diverse dataset, potentially involving multiple binding mechanisms.The UVE-PLS model indicated that electrostatic interactions (34%) and steric contributions (66%) played a significant role in determining the binding affinity of the ligands towards sEH, with the steric component being dominant.Unlike 2D-QSAR models, assessing the applicability domain of 3D-QSAR models is challenging.However, leverage values of the training set compounds were determined using the Open3DQSAR tool, and it was observed that the leverage values (range: 0.983-0.849)did not vary considerably.Hence, it can be assumed that the compounds analyzed in this study were well within the AD of the model.Figure 7 displays the most potent compound (D4_02) and the least potent compound (D5_32) from the dataset, along with their respective contour maps.An analysis revealed that the bulky aromatic moiety of D4_02 is positioned near the steric favorable field, whereas such bulky groups are absent in D5_32.This indicates that steric interactions play a significant role in the potency of D4_02, which is consistent with our findings in the 2D-QSAR models, where descriptors such as ATS6m and RDF140v emerged as important factors.Additionally, electropositive (electron-deficient) fields were more prevalent than electronegative (electron-rich) fields.In the case of D4_02, the presence of the trifluoromethyl group in the benzene ring created an electropositive environment, which was absent in D5_32.Furthermore, D4_02 featured an indole ring fully inserted into another electropositive field, whereas the cyanobenzene residue of D5_32 (with an electron-deficient benzene residue) was not fully inserted into this field.Please note that this information may not be visible in Figure 8, and an additional figure from a different angle is provided in the Supplementary Materials (Figure S3).
Molecules 2023, 28, x FOR PEER REVIEW 10 of 23 whereas such bulky groups are absent in D5_32.This indicates that steric interactions play a significant role in the potency of D4_02, which is consistent with our findings in the 2D-QSAR models, where descriptors such as ATS6m and RDF140v emerged as important factors.Additionally, electropositive (electron-deficient) fields were more prevalent than electronegative (electron-rich) fields.In the case of D4_02, the presence of the trifluoromethyl group in the benzene ring created an electropositive environment, which was absent in D5_32.Furthermore, D4_02 featured an indole ring fully inserted into another electropositive field, whereas the cyanobenzene residue of D5_32 (with an electron-deficient benzene residue) was not fully inserted into this field.Please note that this information may not be visible in Figure 8, and an additional figure from a different angle is provided in the Supplementary Materials (Figure S3).It is evident that, in addition to the presence of polar groups, the specific topology of the compounds plays a crucial role in governing their biological activity, as also suggested by the 2D-QSAR analysis.Interestingly, two sulphonyl residues of D4_02 were found to be in proximity to the electronegative favorable contour maps, which could further en-

Molecular Dynamics Simulations
The compounds D4_02 and D2_37, which represent one of the most potent and one of the least potent compounds, respectively, were subjected to 50 ns molecular dynamics (MD) simulations.These compounds were docked into the active site of the sEH protein (PDB: 4X6X).Similarly, the complex 4X6X with a bound ligand (S74: 3-{4-[(1-{[(1s,2R,3S)-2,3-diphenylcyclopropyl]carbamoyl}-piperidin-4-yl)oxy]phenyl}-pro-panoic acid) was used as a reference protein complex for MD simulations.However, prior to conducting molecular docking on the dataset compounds, a self-docking analysis was performed using S74 in the 4X6X configuration.This step aimed to validate the docking methodology, resulting in an RMSD of 1.54 Å between the docked pose of S74 and its bound pose.
Figure 9 shows the RMSD plots of the protein backbones and ligands, along with the RMSF and RG plots.From the ligand RMSD plots, it is apparent that the highly active compound D4_02 exhibits lower fluctuations compared to the less active compound D2_37, primarily due to the lower fluctuations of D4_02 in residues 140-160 and 260-280.However, D4_02 displays higher fluctuations in residues 180-220 when compared to both S74 and D2_37.When assessing the compactness of the complexes using the radius of gyration (RG) plots, it was observed that the D4_02-4X6X complex remained more compact throughout the MD simulation compared to the D2_37-4X6X complex.It is evident that, in addition to the presence of polar groups, the specific topology of the compounds plays a crucial role in governing their biological activity, as also suggested by the 2D-QSAR analysis.Interestingly, two sulphonyl residues of D4_02 were found to be in proximity to the electronegative favorable contour maps, which could further enhance the biological activity of this molecule.Conversely, no electron-rich group was observed near these contours.Similar observations were made when examining the contour maps of higher active compounds D1_24 and D2_37, as depicted in Figure 8.

Molecular Dynamics Simulations
The compounds D4_02 and D2_37, which represent one of the most potent and one of the least potent compounds, respectively, were subjected to 50 ns molecular dynamics (MD) simulations.These compounds were docked into the active site of the sEH protein (PDB: 4X6X).Similarly, the complex 4X6X with a bound ligand (S74: 3-{4-[(1-{[(1s,2R,3S)-2,3diphenylcyclopropyl]carbamoyl}-piperidin-4-yl)oxy]phenyl}-pro-panoic acid) was used as a reference protein complex for MD simulations.However, prior to conducting molecular docking on the dataset compounds, a self-docking analysis was performed using S74 in the 4X6X configuration.This step aimed to validate the docking methodology, resulting in an RMSD of 1.54 Å between the docked pose of S74 and its bound pose.
Figure 9 shows the RMSD plots of the protein backbones and ligands, along with the RMSF and RG plots.From the ligand RMSD plots, it is apparent that the highly active compound D4_02 exhibits lower fluctuations compared to the less active compound D2_37, primarily due to the lower fluctuations of D4_02 in residues 140-160 and 260-280.However, D4_02 displays higher fluctuations in residues 180-220 when compared to both S74 and D2_37.When assessing the compactness of the complexes using the radius of gyration (RG) plots, it was observed that the D4_02-4X6X complex remained more compact throughout the MD simulation compared to the D2_37-4X6X complex.We also calculated the MM-GBSA binding energies for these complexes, which are presented in Table 4.The results clearly indicate that the highly active compound D4_02 exhibits a higher binding affinity towards sEH compared to the less active compound D2_37, consistent with the ligand RMSD plots of these two compounds.Compound D4_02 showed higher electrostatic and van der Waals interactions, and notably, there were significant differences in electrostatic interactions (ΔEelec) between D4_02 and D2_37.This finding aligns with our 3D-QSAR analyses, which suggested that electrostatic interactions play a substantial role in determining the inhibitory potentials of the compounds in the dataset.The lower entropy of D4_02 contributed to its higher theoretical binding energy.Given these findings, it was essential to examine the final binding poses obtained for these two compounds in the analysis.a ΔGbind(T): theoretical binding free energy (=ΔEvdW + ΔEele + ΔGpolar + ΔGnonpolar − TΔS) and its components.ΔEvdW: van der Waals interaction energy; ΔEele: electrostatic interaction energy; ΔGpolar: polar solvation free energy; ΔGnonpolar: nonpolar solvation free energy; TΔS: entropy.
Figure 10 displays the final binding poses of compounds D4_02 and D2_37.The 3D-QSAR analysis correctly predicted the involvement of π-π and π-alkyl interactions between the indole moiety of D4_02 and amino acid residues such as Tyr154, as well as Val269 (due to its insertion into an electropositive favorable field).Similarly, the π-alkyl interactions of the trifluoromethylbenzene moiety were well predicted by the 3D-QSAR model.While both aromatic rings of D2_37 exhibited π-π interactions with the amino acid We also calculated the MM-GBSA binding energies for these complexes, which are presented in Table 4.The results clearly indicate that the highly active compound D4_02 exhibits a higher binding affinity towards sEH compared to the less active compound D2_37, consistent with the ligand RMSD plots of these two compounds.Compound D4_02 showed higher electrostatic and van der Waals interactions, and notably, there were significant differences in electrostatic interactions (∆E elec ) between D4_02 and D2_37.This finding aligns with our 3D-QSAR analyses, which suggested that electrostatic interactions play a substantial role in determining the inhibitory potentials of the compounds in the dataset.The lower entropy of D4_02 contributed to its higher theoretical binding energy.Given these findings, it was essential to examine the final binding poses obtained for these two compounds in the analysis.Figure 10 displays the final binding poses of compounds D4_02 and D2_37.The 3D-QSAR analysis correctly predicted the involvement of π-π and π-alkyl interactions between the indole moiety of D4_02 and amino acid residues such as Tyr154, as well as Val269 (due to its insertion into an electropositive favorable field).Similarly, the π-alkyl interactions of the trifluoromethylbenzene moiety were well predicted by the 3D-QSAR model.While both aromatic rings of D2_37 exhibited π-π interactions with the amino acid residues, the overall van der Waals and electrostatic interactions of this ligand were significantly lower than those of D4_02.It should be noted that our 3D-QSAR analysis accurately predicted a large number of van der Waals interactions surrounding the trifluoromethylbenzene moiety of D4_02 (cf. the steric favorable field).In contrast, fewer van der Waals interactions were observed in D2_37 due to its lower molecular mass, which was also indicated by the 2D-QSAR model where ATS6m was identified as the most influential descriptor.
Molecules 2023, 28, x FOR PEER REVIEW 13 of 23 residues, the overall van der Waals and electrostatic interactions of this ligand were significantly lower than those of D4_02.It should be noted that our 3D-QSAR analysis accurately predicted a large number of van der Waals interactions surrounding the trifluoromethylbenzene moiety of D4_02 (cf. the steric favorable field).In contrast, fewer van der Waals interactions were observed in D2_37 due to its lower molecular mass, which was also indicated by the 2D-QSAR model where ATS6m was identified as the most influential descriptor.Furthermore, D4_02 exhibited hydrogen bond interactions with Thr131, whereas D2_37 depicted hydrogen bond interactions with Tyr237 and Asp106.It is worth noting that these interactions were not predicted by the 3D-QSAR model, likely because most of the compounds in the dataset had amide moieties that were aligned, and these specific interactions were not found to have a significant influence in the 3D-QSAR analysis.
Finally, it is important to compare the interpretation results from the Transformer-CNN with the interactions obtained from the MD simulations.The Transformer-CNN accurately predicted the interactions of the carboxamide, trifluoromethyl, and aromatic rings.Notably, the carboxylate group of D2_37 was solvent-exposed and did not show polar interactions with amino acid residues.This lack of polar interactions contributed to the unfavorable ΔEelec of this compound, thereby reducing its overall binding affinity.This observation may explain the negative influence of carboxylate residues in both the 2D-QSAR and Transformer-CNN models.Additionally, one of the sulphonyl groups of D4_02 formed a hydrogen bond interaction with Gln155 (not shown in Discovery Studio Visualizer but detected by the PoseView software of https://proteins.plus/,(accessed on 3 June 2023) and presented in Figure S2 of Supplementary Materials).

Dataset Collection and Preparation
The dataset utilized in this study consists of 184 structurally diverse human soluble epoxide hydrolase (sEH) inhibitors, which were sourced from articles published by research groups affiliated with the Goethe University, Germany [4,10,14,16,29].A complete listing of the SMILES of the dataset compounds, along with the corresponding experimental data, can be found in Table S1 of the Supplementary Materials.The chemical structures of the inhibitors were obtained either from the provided SMILES notations in the original publications or drawn using ChemSketch [30].These canonical SMILES were subsequently converted to .sdfformat and protonated at pH 7.4 using the Openbabel software-2.4.1 [31].To ensure consistency, the structures were converted back to canonical Furthermore, D4_02 exhibited hydrogen bond interactions with Thr131, whereas D2_37 depicted hydrogen bond interactions with Tyr237 and Asp106.It is worth noting that these interactions were not predicted by the 3D-QSAR model, likely because most of the compounds in the dataset had amide moieties that were aligned, and these specific interactions were not found to have a significant influence in the 3D-QSAR analysis.
Finally, it is important to compare the interpretation results from the Transformer-CNN with the interactions obtained from the MD simulations.The Transformer-CNN accurately predicted the interactions of the carboxamide, trifluoromethyl, and aromatic rings.Notably, the carboxylate group of D2_37 was solvent-exposed and did not show polar interactions with amino acid residues.This lack of polar interactions contributed to the unfavorable ∆E elec of this compound, thereby reducing its overall binding affinity.This observation may explain the negative influence of carboxylate residues in both the 2D-QSAR and Transformer-CNN models.Additionally, one of the sulphonyl groups of D4_02 formed a hydrogen bond interaction with Gln155 (not shown in Discovery Studio Visualizer but detected by the PoseView software of https://proteins.plus/,(accessed on 3 June 2023) and presented in Figure S2 of Supplementary Materials).

Dataset Collection and Preparation
The dataset utilized in this study consists of 184 structurally diverse human soluble epoxide hydrolase (sEH) inhibitors, which were sourced from articles published by research groups affiliated with the Goethe University, Germany [4,10,14,16,29].A complete listing of the SMILES of the dataset compounds, along with the corresponding experimental data, can be found in Table S1 of the Supplementary Materials.The chemical structures of the inhibitors were obtained either from the provided SMILES notations in the original publications or drawn using ChemSketch [30].These canonical SMILES were subsequently converted to .sdfformat and protonated at pH 7.4 using the Openbabel software-2.4.1 [31].To ensure consistency, the structures were converted back to canonical SMILES notation using the sdftosmi.pyprogram from the tanimoto_similarities package (https://github.com/MunibaFaiza/tanimoto_similarities,accessed on 10 June 2023), and any duplicate structures were removed.Further processing of the .sdfstructures was performed using Chemaxon in the OCHEM platform, involving the following steps: (a) standardization, (b) neutralization, (c) removal of salts, and (d) cleaning of structures [32].Furthermore, geometrical optimization of the structures for the calculation of 3D descriptors was conducted using Corina under the OCHEM platform [33].
To assess the structural diversity of the dataset compounds, we generated their MACCS Keys structural fingerprints [34].These fingerprints were employed to compute a distance matrix using Tanimoto Similarity analysis.Subsequently, the distance matrix underwent t-Distributed Stochastic Neighbor Embedding (t-SNE) analysis, producing a structural diversity plot with two components [35].Following this, a k-means cluster analysis was performed using 6 clusters determined by the Silhouette score, resulting in a plotted representation (refer to Figure S1).Such representation clearly depicts that these structures cover a considerably large chemical space that can easily be clustered.
The biological activity of interest here is the measured inhibitory potential of the compounds against human sEH, expressed as IC 50 (in µM).The latter, as is usual, was logconverted (pIC 50 (M) = −log 10 (IC 50 /10 6 )) and taken as the response variable for practical use in the subsequent 2D-QSAR modeling.
All these descriptors were calculated using the OCHEM web platform [32].Each set of descriptors was employed separately to develop QSAR linear interpretable models.These models will be specifically referred to as 2D-QSAR models to distinguish them from the other QSAR modeling approach applied in this study.

Dataset Division and Feature Selection
The dataset was divided into a training set and a test set using the open-access Pythonbased SFS-QSAR tool (available at https://github.com/ncordeirfcup/SFS-QSAR-tool_v2,accessed on 11 April 2023) [43].The SFS-QSAR tool implements the train_test_split function from Scikit-learn [44], and a seed value of 3 was set to ensure reproducibility for each descriptor set.Two distinct feature selection techniques were employed to generate the linear 2D-QSAR models by adopting a multiple linear regression (MLR)-based procedure, namely: (i) Sequential Forward Selection (SFS) [43], and (ii) Genetic Algorithm (GA) [45].
Feature selection is an important step in developing linear QSAR models as it identifies the most significant descriptors for determining the structural requirements of the compounds.SFS is a non-stochastic feature selection method that consistently produces the same model given the same descriptors, data distribution, and parameter settings.In this study, the SFS-MLR models were developed using the open-access SFS-QSAR-tool, which implements the Mlxtend tool (http://rasbt.github.io/mlxtend/,accessed on 5 April 2023).Four scoring functions, i.e., determination coefficient (R 2 ), negative mean absolute error (NMAE), negative mean Poisson deviance (NMPD), and negative mean gamma deviance (NMGD), were chosen one by one in this tool, with the option of no cross-validation (No CV) or 5-fold cross-validation (5-fold CV).As a result, eight SFS-MLR models were generated for each descriptor set, as shown in Figure 11.Conversely, GA is a stochastic method.The GA-MLR models were created using the GeneticAlgorithm v. 4.1_2 open-access tool [45] with default settings, including 100 iterations/generation, a crossover probability of 1, a mutation probability of 0.3, an initial number of 100 generated equations, and the selection of 30 equations in each generation.GA involves the random selection of descriptors, estimation of fitting scores for these random models, and the application of crossover and mutation schemes to improve the fitting scores and establish the final models [45].To account for the stochastic nature of GA, at least 20 different runs were performed for each dataset, and the best model was selected based on its overall predictivity.
Before model development, a pre-treatment step was performed in both tools.This involved setting a correlation cutoff of 0.99 and a variance cutoff of 0.0001 to eliminate highly correlated descriptors and constant/near-constant descriptors.For all linear interpretable 2D-QSAR models, a maximum of eight descriptors was allowed.

Model Evaluation
In order to compare the statistical quality of the developed models and determine the most reliable one, two well-known validation parameters were utilized, namely Q 2 LOO (leave-one-out cross-validated determination coefficient R 2 ) [23] and R 2 Pred (predicted R 2 or Q 2 F1) [24].The former is known for evaluating the internal predictivity of the model, whereas the latter estimates its external predictivity.The average value of these parameters was considered to select the most statistically reliable model.
To further assess the final models, additional statistical parameters were employed, i.e., the adjusted R 2 (R 2 Adj), the Fisher statistic (F-test), the mean absolute error (MAE), and the metrics rm 2 LOO and ∆rm 2 LOO were computed for the training set, whereas Q 2 (F2), Q 2 (F3), the root mean square error of prediction (RMSEP), and the metrics rm2test and ∆rm2test were computed for the test set.These parameters provide a more critical evaluation of the final models, both in terms of internal performance and external predictivity.A detailed description of these statistical parameters can be found elsewhere [25,46,47].
Likewise, to ensure the robustness and reliability of the proposed 2D-QSAR models, additional tests were carried out.To begin with, the maximum inter-collinearity among the descriptors of the final models was estimated from the cross-correlation matrix using Conversely, GA is a stochastic method.The GA-MLR models were created using the GeneticAlgorithm v. 4.1_2 open-access tool [45] with default settings, including 100 iterations/generation, a crossover probability of 1, a mutation probability of 0.3, an initial number of 100 generated equations, and the selection of 30 equations in each generation.GA involves the random selection of descriptors, estimation of fitting scores for these random models, and the application of crossover and mutation schemes to improve the fitting scores and establish the final models [45].To account for the stochastic nature of GA, at least 20 different runs were performed for each dataset, and the best model was selected based on its overall predictivity.
Before model development, a pre-treatment step was performed in both tools.This involved setting a correlation cutoff of 0.99 and a variance cutoff of 0.0001 to eliminate highly correlated descriptors and constant/near-constant descriptors.For all linear interpretable 2D-QSAR models, a maximum of eight descriptors was allowed.

Model Evaluation
In order to compare the statistical quality of the developed models and determine the most reliable one, two well-known validation parameters were utilized, namely Q 2 LOO (leave-one-out cross-validated determination coefficient R 2 ) [23] and R 2 Pred (predicted R 2 or Q 2 F1 ) [24].The former is known for evaluating the internal predictivity of the model, whereas the latter estimates its external predictivity.The average value of these parameters was considered to select the most statistically reliable model.
To further assess the final models, additional statistical parameters were employed, i.e., the adjusted R 2 (R 2 Adj ), the Fisher statistic (F-test), the mean absolute error (MAE), and the metrics rm 2  LOO and ∆rm 2 LOO were computed for the training set, whereas Q 2 (F2) , Q 2 (F3) , the root mean square error of prediction (RMSEP), and the metrics rm2test and ∆rm2test were computed for the test set.These parameters provide a more critical evaluation of the final models, both in terms of internal performance and external predictivity.A detailed description of these statistical parameters can be found elsewhere [25,46,47].
Likewise, to ensure the robustness and reliability of the proposed 2D-QSAR models, additional tests were carried out.To begin with, the maximum inter-collinearity among the descriptors of the final models was estimated from the cross-correlation matrix using the SFS-QSAR-tool.Then, the multi-collinearity of the final models was assessed using the variance inflation factor (VIF) [48], defined as follows: where R 2 i is the coefficient of determination (R 2 ) obtained from regressing the i th descriptor on the other descriptors [48].
The multi-collinearity of the 2D-QSAR models was also checked using the parameters K xx and ∆K calculated by the software QSARINS v2.2.4 [49].The K xx parameter represents the overall correlation among descriptors, while ∆K is the difference between the correlation among descriptors (K x ) and the correlation between descriptors and responses (K xy ) [50].
To further ensure the statistical robustness of the models, a Y-randomization test was performed.This involved randomizing the response variables while keeping the descriptors unchanged and then calculating the cR p 2 value using the following formula [51]: where R r denotes the average R 2 obtained from the randomized models.A value of cR p 2 greater than 0.5 generally suggests that the model was not developed by chance [51].

Applicability Domain of the Models
The applicability domain of a QSAR model refers to the region in the response and chemical structure space in which the model can make reliable predictions for new or unseen compounds [46].In this work, to determine the applicability domain (AD) of the 2D-QSAR models, a leverage estimation approach was followed, and the Williams plot generated.The Williams plot displays the leverage, which measures the influence of individual data points, against the standardized residuals [46,[52][53][54].This plot helps in identifying structural and response outliers in the linear 2D-QSAR models.It is important to note that, according to the Organization for Economic Cooperation and Development (OECD) guidelines, QSAR models should be reported along with their applicability domain.This ensures that the reliability and validity of the models can be assessed based on their performance within the defined applicability domain [22].
3.1.6.Machine Learning Techniques and Partial Least Square (PLS) Non-linear models were set up using selected features via three distinct techniques: (a) multilayer perception (MLP) [55], (b) support vector machines (SVM) [56], and (c) random forests (RFs) [57].These models were developed using the open-source software "Non-linear-Regression-tools" (available at https://github.com/ncordeirfcup/Non-linear-Regression-tools,accessed on 15 May 2023), which leverages Scikit-learn-based programs for model creation while incorporating hyperparameter optimization.Within this tool, users can specify the necessary parameters by means of a .csvfile.These parameters are then tuned to create optimal models based on 5-fold cross-validation on the training set.The optimized parameters for this study are detailed in Table S2 of the Supplementary Materials.The performance of the final models is subsequently gauged against external predictivity with the test set.Two distinct feature selection algorithms were employed during the development of the non-linear models.Firstly, descriptors from the most predictive linear model were utilized for setting up the model.As an alternative, we identified the eight most significant descriptors using differential Shannon entropy, a process implemented through the open-access tool IMMAN [58].
Additionally, the partial least squares (PLS) method was also employed using the selected features.This procedure was facilitated by another open-access tool named PLS-QSAR (accessible at https://github.com/ncordeirfcup/PLS-QSAR,accessed on 15 May 2023), resourcing to the following settings: maximum number of components: 5, condition: "CVLOO" (cross-validation leave-one-out), and increment: 5. Therefore, the tool would test dataset was employed to estimate the external predictivity of the generated models using a different configuration file (config_val.cfg).The input configuration files for these steps are provided in the Supplementary Information for reproducibility.
To interpret the models and assess the significance of individual input features, the "standalone" Transformer-CNN tool was applied (available at https://github.com/bigchem/transformer-cnn, accessed on 22 May 2023).This tool utilizes the Layer-wise Relevance Propagation (LRP) algorithm, which splits the overall predicted result into a sum of contributions coming from the individual neurons.The relevance is propagated from the last layer to the input layer, allowing the evaluation of contributions from specific input variables and the identification of significant features for the training set or the explanation of individual neural network predictions [61].

Alignment Techniques
For the development of the 3D-QSAR models, the compounds in the dataset were aligned using an atom-based alignment method or unsupervised rigid body molecular alignment.Initially, the 3D structures of the ligands in the dataset were minimized using the "obminimize" function of OpenBabel.The minimization process involved employing the steepest descent technique and the MMFF94 forcefield [31].After the minimization, the ligand structures were used to generate 100 conformations using the rdMolAlign.GetCrippenO3A code of Rdkit.The Python script "alignment.py"written and used for the atom-based alignment can be found in the GitHub repository: https: //github.com/ncordeirfcup/InsilicoModeling_RdRp,(accessed on 25 May 2023) [52].

Model Development
The 3D-QSAR models were generated using the aligned conformations with the opensource software called Open3DQSAR-2.24.The methodology for this software has been described in detail in earlier works by Tosco and Balle [62,63].Open3DQSAR utilizes a carbon and a volume-less positively charged probe to estimate steric and electrostatic domains, respectively.In its data pre-treatment stage, a smart region definition (SRD) cutoff level (here equal to 2.0) is employed, and N-level variables are removed.Open3DQSAR deploys SRD for grouping variables.Two different variable selection algorithms are utilized for such a purpose, namely, Fractional Factorial Design-based variable SELection (FFD-SEL), and Uninformative Variable Elimination-based Partial Least Square (UVE-PLS).

Molecular Docking and Molecular Dynamics Simulations
The X-ray crystal structure of sEH hydrolase (PDB: 4X6X) [64] was downloaded and utilized for molecular docking of the selected compounds from the dataset.The docking was performed using the AutoDock 4.2 package [65].A grid box with a spacing of 0.375 and dimensions of 50 Å × 50 Å × 50 Å was defined at the coordinates X = 2.98, Y = 4.39, and Z = 35.20.The detailed methodology for the docking procedure can be found in our previous work [66].
The best poses of the compounds obtained from the docking experiment were selected, and the resulting ligand-receptor complexes underwent 50 ns molecular dynamics (MD) simulations using Amber 20.The specific steps of the MD simulations were described in detail in our previous investigations [66,67].Trajectory analysis was performed using the cpptraj function of Python.Various properties, such as the root mean square deviation (RMSD) of the complexes and ligands, the root mean square fluctuations (RMSF), and their radius of gyration (Rg), were calculated.To estimate the binding free energies of the complexes, the molecular mechanics generalized born surface area (MM-GBSA) approach was applied.The MM-PBSA.py tool was used to calculate the binding free energies, considering 100 snapshots taken from the last 10 ns of the MD production run.Additionally, the entropy contributions (−T∆S) to the binding free energies were determined using normal mode analysis, collecting 100 snapshots from the last 10 ns [67,68].

Conclusions
The sEH (soluble epoxide hydrolase) enzyme is indeed a significant biological target for various diseases.It has been identified that the binding affinity of sEH inhibitors can be influenced by the binding sites present in the enzyme's C-terminal region, responsible for hydrolase activity, as well as the N-terminal region, associated with phosphatase activity.These distinct binding sites offer potential opportunities for designing and developing sEH inhibitors as single-target or multi-target agents, aiming to modulate the enzyme's activity and provide therapeutic benefits in the context of different diseases.The understanding of these binding sites and their contributions to the inhibitory potency of compounds is crucial for the rational design of effective sEH inhibitors.
In the present work, a large and diverse series of sEH inhibitors were investigated using receptor-independent 2D-QSAR and 3D-QSAR analyses.The aim was to generate validated and predictive models that can provide insights into the structural requirements of these inhibitors.The most predictive linear 2D-QSAR regression model found achieved high predictive power, explaining 80% of the variances in the training set compounds and predicting 78.4% of their variances.More importantly, external validation on the test set compounds yielded a prediction of 79.4% for their variances.The model highlighted the importance of 2D pharmacophoric information, as indicated by its CATS2D descriptors, and emphasized the significance of topological characteristics and properties such as molecular mass, van der Waals volume, dipole moment, and polarizability in determining the biological activity against sEH.The transformer CNN-based model provided a clear pictorial understanding of favorable and unfavorable fragments responsible for biological potency.The 3D-QSAR analyses also demonstrated satisfactory statistical predictivity and supported the interpretations from the 2D-QSAR models.These analyses provided valuable information about the structural requirements of potent sEH inhibitors.Furthermore, the MD simulations conducted with highly active and less active compounds revealed important receptor-ligand interactions, which were consistent with the predictions from the QSAR models.This comprehensive investigation serves as an important guideline for the design of novel sEH inhibitors.For instance, the generated 2D-QSAR models can serve as a means to obtain average predicted values for novel compounds to be synthesized.To set up the 2D-QSAR linear models, along with generating plots and values, one can refer to the files "2DQSAR_train.csv" and "2DQSAR_test.csv"(accessible at https://github.com/amitporto/soluble-epoxide-hydrolase-inhibitors,accessed 2 August 2023) and process them either through the Flask-based web application accessible at https://amit-mlr.onrender.com,(accessed 2 August 2023) (note that when using this application, file names should remain unchanged) or simply by employing the SFS-MLR-tool, which is available at https://github.com/ncordeirfcup/SFS-QSAR-tool_v2,(last accessed on 5 August 2023).Both these tools are suitable for predicting outcomes for new compounds.In addition, the Transformer-CNN can be leveraged for predictions using the "sEH.pickle"object file (located at https://github.com/amitporto/soluble-epoxide-hydrolase-inhibitors,accessed on 2 August 2023) and the "ochem.py"script (accessible at https://github.com/bigchem/transformer-cnn/tree/master/standalone, accessed on 22 May 2023).Concerning the 3D-QSAR models, these offer a means to further predict activity as well as to check the proximity of molecular scaffolds along with favorable contour maps (steric and electrostatic contours can be found in files "uvepls_coefficients_fld-01_y-01.grd" and "uve-pls_coefficients_fld-02_y-01.grd" at https://github.com/amitporto/soluble-epoxidehydrolase-inhibitors,accessed on 2 August 2023).Regarding the results of the MD simula-tions, these clearly demonstrated reduced fluctuations in amino acid residues 260-280 for both active complexes, pertaining to the dataset compound D4_02 and bound ligand S74.This observation implies that stronger interactions with these amino acid residues could contribute to enhanced inhibitory potential.Substantial differences were observed in steric and electrostatic interaction energies between D04_02 and D2_37.These factors should be monitored preliminarily while predicting the potency of the new compounds against the sEH enzyme.
Finally, and of paramount significance, the entirety of this study was undertaken using noncommercial open-access tools and web platforms to ensure fast reproducibility and accessibility.In the updated version of the SFS-QSAR-tool (available at https://github.com/ncordeirfcup/SFS-QSAR-tool_v2, accessed on 5 August 2023), we have incorporated two Jupyter notebook files, specifically, multiSFSQSAR_random.ipynb and multiSFSQSAR_random.ipynb.These notebooks were designed to assist users in generating multiple SFS-MLR models in a single run, as we performed in this study.Furthermore, we report here, for the first time, two automated Python-based tools, namely, Non-linear-Regression-tools (accessible at https: //github.com/ncordeirfcup/Non-linear-Regression-tools,accessed on 15 May 2023) and PLS-QSAR (available at https://github.com/ncordeirfcup/PLS-QSAR,accessed on 15 May 2023).These tools are intended to assist the scientific community in developing machine-learning-based regression models and PLS models, respectively.

Figure 1 .
Figure 1.Observed vs. predicted activity (left) and Williams plot (right) of 2D-QSAR model.Subsequently, to assess external predictivity, a test set of 37 compounds was used.The model demonstrated a relatively high value of 0.792 for predicting the biological activity of the test set compounds.The Q 2 F2 and r m 2 test values further confirmed the model's satisfactory performance on the test set.The low values obtained for RMSEP and ∆r m 2 test also support the model's external predictivity.The maximum intercorrelation (R 2 ) between any two descriptors in the model was found to be 0.557, indicating that the descriptors used in the model are independent.The intercorrelation matrix can be found in TableS4of the Supplementary Materials.The variance inflation factor (VIF) was calculated for each descriptor, and none of the values exceeded 5.0, indicating the absence of multi-collinearity in the model.Moreover, the model exhibited acceptable K xx and ∆K values, further supporting its robustness.The Y-randomization test with 1000 runs yielded a cR p 2 value of 0.784, indicating that the model is not a result of chance but rather a unique and meaningful model.The Williams plot of this 2D-QSAR model is presented in Figure1.Only one training set compound was found to be a structural outlier and two compounds appeared as response outliers.Nonetheless, given the good predictivity of the structural outlier, we decided to retain it in the model.In Table1, the eight descriptors of this 2D-QSAR model are listed along with their meaning, and their relative significance, determined by the standardized coefficients, is depicted in Figure2.As can be noticed, save for RDF140v, all these descriptors belong to

Figure 2 .
Figure 2. Relative significance of each descriptor of the 2D-QSAR model.

Figure 2 .
Figure 2. Relative significance of each descriptor of the 2D-QSAR model.

Figure 3 .
Figure 3.The negative influence of descriptors CATS2D_07_AA and CATS2D_03_NL for the biological activity of the compounds.Figure 3. The negative influence of descriptors CATS2D_07_AA and CATS2D_03_NL for the biological activity of the compounds.

Figure 3 .
Figure 3.The negative influence of descriptors CATS2D_07_AA and CATS2D_03_NL for the biological activity of the compounds.Figure 3. The negative influence of descriptors CATS2D_07_AA and CATS2D_03_NL for the biological activity of the compounds.

Figure 4 .
Figure 4.The negative influence of descriptors CATS2D_05_AA and F09[N-O] for the biological activity of the compounds.

Figure 4 .
Figure 4.The negative influence of descriptors CATS2D_05_AA and F09[N-O] for the biological activity of the compounds.

23 Figure 5 .
Figure 5. Structural interpretations obtained from the Transformer-CNN model for some selected highly active (upper row) and less active (lower row) dataset compounds.Color codes: green (favorable) and red (unfavorable).

Figure 5 .
Figure 5. Structural interpretations obtained from the Transformer-CNN model for some selected highly active (upper row) and less active (lower row) dataset compounds.Color codes: green (favorable) and red (unfavorable).

Figure 6 .
Figure 6.The aligned structures of the compounds used for 3D-QSAR modeling (left) and all contour maps (right).Notice that only the five best actives and the five least actives are shown.Color codes: green (steric favorable), yellow (steric unfavorable), blue (electropositive favorable), and red (electronegative favorable).

Figure 6 .
Figure 6.The aligned structures of the compounds used for 3D-QSAR modeling (left) and all contour maps (right).Notice that only the five best actives and the five least actives are shown.Color codes: green (steric favorable), yellow (steric unfavorable), blue (electropositive favorable), and red (electronegative favorable).

Figure 9 .
Figure 9. Results from the trajectory analysis for the MD simulations of D4_02 and D2_37 as well as of S74 (bound ligand of PDB 4X6X).

Figure 9 .
Figure 9. Results from the trajectory analysis for the MD simulations of D4_02 and D2_37 as well as of S74 (bound ligand of PDB 4X6X).

Figure 10 .
Figure 10.Poses obtained from the final trajectory of MD simulations for D4_02 and D2_37.

Figure 10 .
Figure 10.Poses obtained from the final trajectory of MD simulations for D4_02 and D2_37.

Figure 11 .
Figure 11.Model development strategy adopted for developing SFS-MLR models.

Figure 11 .
Figure 11.Model development strategy adopted for developing SFS-MLR models.

Table 1 .
List of descriptors of the 2D-QSAR model with their descriptions.
RDF140vRadial Distribution Function at a distance of 14.0 Å weighted by van der Waals volume 3D (RDF)Molecules 2023, 28, x FOR PEER REVIEW 4 of 23

Table 1 .
List of descriptors of the 2D-QSAR model with their descriptions.

Table 2 .
Summary of the results obtained from non-linear models.

Table 3 .
Results for the 3D-QSAR models with three components.

Table 3 .
Results for the 3D-QSAR models with three components.