Chemical Graph Theory for Property Modeling in QSAR and QSPR—Charming QSAR & QSPR

: Quantitative structure-activity relationship (QSAR) and Quantitative structure-property relationship (QSPR) are mathematical models for the prediction of the chemical, physical or biological properties of chemical compounds. Usually, they are based on structural (grounded on fragment contribution) or calculated (centered on QSAR three-dimensional (QSAR-3D) or chemical descriptors) parameters. Hereby, we describe a Graph Theory approach for generating and mining molecular fragments to be used in QSAR or QSPR modeling based exclusively on fragment contributions. Merging of Molecular Graph Theory, Simpliﬁed Molecular Input Line Entry Speciﬁcation (SMILES) notation, and the connection table data allows a precise way to differentiate and count the molecular fragments. Machine learning strategies generated models with outstanding root mean square error (RMSE) and R 2 values. We also present the software Charming QSAR & QSPR , written in Python, for the property prediction of chemical compounds while using this approach.


Introduction
Quantitative structure-activity relationship (QSAR) and Quantitative structure-property relationship (QSPR) correlate structural parameters with a determinate attribute while using statistical tools. In principle, any complex characteristic can be modeled by QSAR or QSPR, such as toxicity, IC 50 , cetane number, solubility, and so on. Despite the studied property being reported as a single number, it is already influenced by several physicochemical parameters that also depend on the molecular structure [1]. The observed biological activity, for example, relates to specific intermolecular interactions, membrane permeability, pKa, molecular weight, polarity, and a dozen more characteristics. Eventually, some of those traits may act synergistically in order to improve the observed result, but others, instead, behave antagonistically [1][2][3][4][5][6][7][8][9].
QSAR and QSPR work with structural, experimental, or theoretical parameters in order to model a specific property, usually using a weighted composition of them. When considering that any property ultimately arises from the connection pattern, geometry, and the molecular structure if the number of compounds in the study set is large enough and the structural parameters are sufficiently precise, it is possible to anticipate any desired property while using exclusively the structural data of the study set [2][3][4][5][6][7][8][9]. This work presents a software that was developed by our group that exclusively uses the chemical graph theory to generate a small set of molecular fragments that were obtained from the original study set to model a specific property.
A molecular property can be interpreted as a summoning of positive and negative contributions of different fragments in the compound. Polar groups, for example, interact among them and increase the boiling point of a substance. The molecular symmetry and polarity of a molecule may facilitate intermolecular interactions and increase its melting point. The biological activity of a small molecule is a result of specific interactions between the compound and a macromolecular target. In this case, such molecular recognition arises from pharmacophoric interaction points into the active site. All of those molecular properties evolve from increments of positive and negative contributions of the different substructural fragments in the whole molecule [2][3][4][5][6][7][8][9]. In this way, substructural descriptors are already used in QSAR/QSPR studies and their counts help to quantify the desired property, and they are readily calculable from the molecular structure.
Graph theory concepts are present in our daily routine, although we do not perceive it. Situations, such as routing internet traffic, finding the shortest path between two points, and map coloring, are typical examples. Conversely, in chemistry, those concepts can be found in HMO, protein folding, and nomenclature, just to cite a few cases [1,10].
There is a great resemblance between a graph and structural formula, as depicted in Figure 1, in which a direct correlation among vertices and atoms, as well as edges and chemical bonds, is clearly perceived. Unfortunately, a critical drawback precludes a more generalized use of the graph theory in chemistry-the incapability to distinguish atoms or bonds [11][12][13][14][15][16][17][18][19][20][21][22]. The quantification of specific interactions among substructural fragments is crucial to infer any molecular property and, to do so, it is crucial to differentiate atoms and chemical bonds, as stated previously. In order to circumvent the above-mentioned weakness of graph theory, we envisage its simultaneous use with the SMILES linear notation of chemical structures [23,24]. It is relatively fast and precise to determine all of the subgraphs in a more complex graph using the graphs theory. The inability to differentiate atoms or bonds on the graph theory is thwarted while using SMILES through the correspondence of each substructural fragment that was obtained from the graph theory with the atom connection pattern obtained from the input data. Structural chemical data, such as a mol file or similar, define different types of chemical bonds that are used to connect different atoms and such information is transferred precisely to the SMILES notation [17][18][19][20][21][22][23][24]. While using such an approach, it is possible to discriminate any atom, even its hybridization, as well as its precise location into the molecule. Thereby, it would be possible to tell whether an oxygen atom is an alcohol, ether, or ester group, for example. Figure 1 shows an example of the aforesaid approach, which was used in the Charming QSAR & QSPR software described here. Applying this approach, a graph is built using the connectivity information from the input data. Accordingly, each subgraph obtained by the graph theory corresponds to a substructural fragment, which is notated in SMILES, retrieving the corresponding chemical information.
In Charming QSAR & QSPR, the activity model is achieved from a multivariate linear regression while using molecular fragment frequencies as descriptors. The frequency is simply calculated by counting the number of occurrences of an independent molecular fragment that is produced by the Graph Theory/SMILES/Structural Data approach described above.

The Charming QSPR & QSPR
The Charming QSPR & QSPR program is concerned about mining molecular fragments and generating predictive QSAR models that enlighten the main structural patterns that are related to a given property of a molecule set. The program is written in Python and it uses several RDKit tools, such as SDWriter, MolToSmiles, and MolFragmentToSmiles [25]. It also uses Scikit-learn tools for model selection, statistical metrics, regressions, and machine learning [26]. The input data have SDF file format, and each compound is handled at a time. The applied sequence ( Figure 1) begins with the conversion of the input molecule into a benchmarked molecular coordinate while using Chem.SDMolSupplier. The tool follows the required steps to generate and validate a QSAR model, and they are listed below.

Standardization
Standardization is an important step for building a QSAR model that permits the removal of inconsistent and duplicated data that, otherwise, can input an error at the model elaboration. For standardization, the Charming QSAR & QSPR has a function, builds using the functionalities of the RDKit, called standardized_molecules that accepts an SDF file as input, and searches for duplicates on the work set. This function uses the tautomer enumerator and standardization functionalities of the RDKit Chem module in order to create a standard representation and SMILES nomenclature to compare different structures on the SDF file [23,24,26]. It also sets upper and down limits to molecular mass and a maximum number to each halogen atom.

Outliers
The package that is presented here displays a functionality to identify possible outliers on the endpoint of the work set compounds. This tool is called rem_poss_out, which calculates the average of endpoints values. Based on the standard deviation, it removes the possible outliers. The cut-off value for finding outliers was Z-scores of ±3. The outlying is necessary when some molecules have unexpected or different biological/chemical/physical activity contributing to the endpoint. In such cases, the outlined compounds do not fit in a QSAR model, because such compounds may be acting in a different mechanism of action, having a different target biomolecule [27] or having different binding modes due to conformational flexibility [28]. The input data are the standardized SDF file, and the outputs are the endpoint's scatter plot and two SDF files, one with the removed compounds and the second with the selected ones.

Molecular Fragment Generation and Counting
The molecular fragments (MF) are the descriptors that are used to portray the compounds set. The package has a function that generates MFs within the range of lengths that are arbitrated by the user. The sequential addition of edges to an empty graph engenders a molecular graph. Each edge is denoted by two connected vertices that, in this case, represent two bonded atoms. After the addition of all edges (bonds) of a compound structure, the molecular graph is submitted to subgraph mining that returns straight and branched paths.
In order to engender the subgraphs, an atom is selected, and its neighbors are added, one at a time, according to the connecting pattern. All of the possible combinations of neighboring atoms are calculated and registered, as depicted in Figure 2. Subsequently, each neighbor in each generated combination is the starting point for the next round of growth. The growth step is repeated until the graph reaches the maximum limit of atoms. The subgraphs that fulfill the requirements that are established by the user are recorded in its SMILES notation. The function repeats this process for each atom in a molecule and each molecule in the SDF file. After that, the duplicated MFs are removed, and their SMILES strings are saved into a CSV file. The correlation table is constructed by accounting for the matching of each MF at all molecular structures. The SMILES notation is also used by the CORAL software [29][30][31] in order to calculate the descriptors, and molecular graphs to calculate the local graph invariants. Conversely, Charming uses the graph theory as a tool for structural patterns engendering. Each generated subgraph is stored, and the SMILES notation is used in order to codify its chemical information.

Molecular Fragment Generation and Counting
The molecular fragments (MF) are the descriptors that are used to portray the compounds set. The package has a function that generates MFs within the range of lengths that are arbitrated by the user. The sequential addition of edges to an empty graph engenders a molecular graph. Each edge is denoted by two connected vertices that, in this case, represent two bonded atoms. After the addition of all edges (bonds) of a compound structure, the molecular graph is submitted to subgraph mining that returns straight and branched paths.
In order to engender the subgraphs, an atom is selected, and its neighbors are added, one at a time, according to the connecting pattern. All of the possible combinations of neighboring atoms are calculated and registered, as depicted in Figure 2. Subsequently, each neighbor in each generated combination is the starting point for the next round of growth. The growth step is repeated until the graph reaches the maximum limit of atoms. The subgraphs that fulfill the requirements that are established by the user are recorded in its SMILES notation. The function repeats this process for each atom in a molecule and each molecule in the SDF file. After that, the duplicated MFs are removed, and their SMILES strings are saved into a CSV file. The correlation table is constructed by accounting for the matching of each MF at all molecular structures. The SMILES notation is also used by the CORAL software [29][30][31] in order to calculate the descriptors, and molecular graphs to calculate the local graph invariants. Conversely, Charming uses the graph theory as a tool for structural patterns engendering. Each generated subgraph is stored, and the SMILES notation is used in order to codify its chemical information.

Preprocessing
The work set is randomly split into training and test sets containing, typically, 80% for the training set and 20% for the test set. Subsequently, the descriptors (MF) pass through two filter functions: the first is related to the variance and it has a minimum variance threshold; the second analyses the correlation among the descriptors, where the highly correlated MFs are removed. Two or more MF are considered to be correlated when a graph MF 1 is a subgraph of MF 2 and they have a linear dependency. The fragments considered rare was removed. The fragments are considered to be rare if they are present on less than a minimum number of molecules on the training set; this value is set when observing the length of the training set. After that, a study with the most appropriate correlation threshold was performed. Among the values of 0.99, 0.95, and 0.90, the value of 0.99 has the best performance at the machine learning model elaboration. At the end of the whole process, the table with the counting of all MF has all of the descriptors' range scaled between 0 and 1. This process furnishes a learning algorithm with all equally weighted inputs and then removes the repeated information.

Descriptor Selection
It is necessary to select the best set of descriptors that are most correlated with the studied property in order to avoid overfitting. For description selection, the Charming QSAR & QSPR has a tool with a backward selection function that selects the descriptors with the p value being adjusted to a multilinear regression. The use of this tool is indicated in small datasets where the number of descriptors is reduced.
The LASSO regression (Least Absolute Shrinkage and Selection Operator) can also be applied to select the descriptors; it is a linear model that calculates sparse coefficients; in other words, it prefers solutions with fewer non-zero coefficients, reducing the dimensionality of the data [32][33][34].
The Random Forest (RF) regressor can also be used to select the MF, and its performance was evaluated in each case studied here. Invariably, in the case of the three examples presented in this work, the descriptors that were selected by LASSO model showed a better description of the chemical space.

Model Training
The Scikit-learn has several machine learning algorithms, and three of them were selected: support vector machine (SVM), random forest (RF), and gradient boosting machine (GBM) [26]. Only the descriptors with nonzero coefficients at the LASSO model are used to describe the studied property. For each machine learning algorithm, a grid for hyperparameter optimization is used and the best training model is selected. A linear stacking and ensemble of the training models is performed in order to improve the prediction accuracy. After each training process, the test set is used in order to verify the model quality and predictive power.

Validation
Some statistic metrics are observed to verify the accuracy and predictive ability of the model. The root mean square error (RMSE) is the most appropriate among them. This metric is considered during both steps: model elaboration and validation [35][36][37]. It is defined as: where y j is the property value, y j is the predicted value, and N is the number of molecules in the set. During the model elaboration, R 2 is also considered: where N tr is the number of compounds in the training set, y j is the experimental value, y j is the predicted value, and y tr is the average value for the studied property in the training set [35][36][37].
During the test step, in addition to RMSE, the R 2 and R 2 0 are also considered, defined as: where N test is the number of entries at the test set, y j is the experimental value, y j is the predicted value, the y test is the average value for the studied property on the test set, and y test is the mean of the predicted value. R 2 0 test is the coefficient of determination of a regression that has no linear coefficient: y r0 j = k y j [35][36][37]. The observed value of the referred metrics permits the analysis of the model performance according to the procedure that was proposed by Tropsha and Golbraikh. In this way, the R 2 tr > 0.5, R 2 test ≥ 0.6 and 0.85 ≤ k ≤ 1.15 [35][36][37]. These references might change according to the data set that was modeled and the analysis application.

Application
In order to illustrate the QSAR and QSPR elaboration with Charming QSAR, three data sets were evaluated, and the results are shown below.

Example 1 3.1.1. Data Set
The dataset endpoint is the negative logarithmic value of compound concentration, which reduces the growth of protozoan Tetrahymena pyriformis by 50% (pIGC50) [38]. This assay is a commonly accepted toxicity screening and evaluation tool. The data set includes small organic molecules that contain a diverse set of functional groups. The data set was retrieved from literature and it contains 1094 observations [39].

Standardization and Outlier Analysis
The standardization filter application on the SDF file removed four molecules from the whole set of compounds. The threshold for each halogen atom was set at 10, the minimum molecular mass at 33, and the maximum MM in 2000. The outlier analysis did not remove any compound, and the final work set has 1090 molecules. The endpoints considered to be an outlier were those that are higher than three σ away from the mean value ( Figure 3). atics 2021, 9, x FOR PEER REVIEW

Molecular Fragment Generation, Counting, and Preprocessing
There were generated 8288 molecular fragments with three up to eight a matching of each molecular fragment was counted, and the matrix was saved file. Subsequently, the work set was randomly split into training and test sets. A the descriptors pass through the variance filter that removed MF with zero vari correlation threshold was set at 0.99 and repeated chemical information was After filtering, the descriptors set has 1187 molecular fragments, and all of th descriptors ranged between 0 and 1.

Descriptor Selection, Model Training, and Validation
A LASSO model was trained in order to identify highly correlated MFs to t ical activity. There were 233 molecular fragments selected in order to describe t cal space on the training set.
The selected MFs were used to train the machine learning models that wer Python while using the scikit-learn package. For each machine learning algorit port vector machine (SVM), gradient boosting machine (GBM), and random for a set of hyperparameters was optimized while using a repeated K-fold cross-v grid. Individual models were elaborated using the optimized hyperparamet combination of these three models, as well the LASSO model, were evaluated i improve the predictive ability. Two ensemble techniques were explored; firs combination of each model called "Linear ensemble"; and second, a stacking m lizing an RF regressor. In Table 1, the observed parameters for model evalu shown, as well as a comparison with Zhu's work [39]. The study that was pub Zhu and co-workers reported the same difficulty in the description of the tes training set and test set show differences in the performance when the statistica ters are analyzed. It reflects the inhomogeneity of the dataset. Nevertheless, the

Molecular Fragment Generation, Counting, and Preprocessing
There were generated 8288 molecular fragments with three up to eight atoms. The matching of each molecular fragment was counted, and the matrix was saved as a CSV file. Subsequently, the work set was randomly split into training and test sets. After that, the descriptors pass through the variance filter that removed MF with zero variance. The correlation threshold was set at 0.99 and repeated chemical information was removed. After filtering, the descriptors set has 1187 molecular fragments, and all of the selected descriptors ranged between 0 and 1.

Descriptor Selection, Model Training, and Validation
A LASSO model was trained in order to identify highly correlated MFs to the biological activity. There were 233 molecular fragments selected in order to describe the chemical space on the training set.
The selected MFs were used to train the machine learning models that were built on Python while using the scikit-learn package. For each machine learning algorithm-support vector machine (SVM), gradient boosting machine (GBM), and random forest (RF)-a set of hyperparameters was optimized while using a repeated K-fold cross-validation grid. Individual models were elaborated using the optimized hyperparameters and a combination of these three models, as well the LASSO model, were evaluated in order to improve the predictive ability. Two ensemble techniques were explored; first, a linear combination of each model called "Linear ensemble"; and second, a stacking model utilizing an RF regressor. In Table 1, the observed parameters for model evaluation are shown, as well as a comparison with Zhu's work [39]. The study that was published by Zhu and co-workers reported the same difficulty in the description of the test set. The training set and test set show differences in the performance when the statistical parameters are analyzed. It reflects the inhomogeneity of the dataset. Nevertheless, the Charming performance was similar to the ones that also exclusively use molecular fragments and better to some of the approaches that use a summoning of different features. Some of the elaborated models in Zhu's work also have moderate transferability to the test set, as it can be seen for the entries kNN-Dragon, kNN-MolconnZ, SVM-dragon, ISIDA-SVM, ISIDA-MLR, and OLS [39].
Although the gradient boosting machine (GBM) showed better fittings for the training set, the linear ensemble proved to have better performance for the test set. The linear ensemble shows higher R 2 and lower MAE values for the test set when compared to any of the machine learning models alone (respectively, 0.799 and 0.36 in Table 1; Figures 4 and 5). The stacking using a random forest regressor has a similar performance to the other models. The R 2 -test is a measurement of the dispersion of the results along the linear regression; the R 2 0 and the K value measures how good the results are and if they can be used to predict new molecules [35][36][37]. Some compounds in the test set showed a large deviation from the model. A plausible reason for such a kind of poor fitting in the QSAR model is that such compounds may be acting in a different mechanism of action, having a different target biomolecule, despite showing the same biological result or having different binding modes due to conformational flexibility. deviation from the model. A plausible reason for such a kind of poor fitting i model is that such compounds may be acting in a different mechanism of act a different target biomolecule, despite showing the same biological result or ferent binding modes due to conformational flexibility.   The correlation between each selected molecular fragment and biological activity can be retrieved from the LASSO model. The main MFs and their contributions are shown in Figure 6. The blue MFs have positive values and decrease toxicity. On the other hand, the red-colored fragments, such as the pattern 1-hydroxy-4-methoxy-benzene (upper right in Figure 6), increase the compound toxicity. That particular result could be associated with the similarity of this MF to the ubiquinol pattern. Therefore, compounds that have this moiety is supposed to interfere with vital processes to the cell, showing toxic activity.

Example 2 3.2.1. Data Set
The logarithm values of equilibrium constants for the extraction of Eu 3+ for 128 compounds were retrieved from the literature [40,41]. The experimental procedures were performed at 25 • C and 0.1 mol/L. The compounds at the SDF file describe crown ethers complexing agents with different ring sizes and substitution patterns.

Standardization and Outlier Analysis
The standardization filter that was applied to the SDF file did not remove any molecule. The threshold value for each halogen atom was set at 6, the minimum molecular mass was 20, and the maximum 900. The outlier analysis removed 1 compound and the final work set has 127 molecules. The endpoints considered to be outliers were those that are higher than three σ away from the mean value (Figure 7).
The standardization filter that was applied to the SDF file did not remove any molecule. The threshold value for each halogen atom was set at 6, the minimum molecular mass was 20, and the maximum 900. The outlier analysis removed 1 compound and the final work set has 127 molecules. The endpoints considered to be outliers were those that are higher than three σ away from the mean value (Figure 7).

Molecular Fragment Generation, Counting, and Preprocessing
There were generated 2787 MFs with three up to eight atoms. After the filtering step, the descriptors set has 189 MFs and all of the selected descriptors ranged between 0 and 1. The correlation threshold was used at 0.95.

Descriptor Selection Model Training and Validation
A LASSO model was trained in order to identify correlated MFs and the affinity constant logarithms (LogK). There were 22 molecular fragments selected to describe the chemical space of the training set.
The selected MFs were used to train the machine learning models, and the hyperparameters were optimized while using a repeated K-fold cross-validated grid. The optimized hyperparameters were used to build the individual models, the stacking, and the ensemble model. Table 2 shows the observed parameters for model evaluation.

Molecular Fragment Generation, Counting, and Preprocessing
There were generated 2787 MFs with three up to eight atoms. After the filtering step, the descriptors set has 189 MFs and all of the selected descriptors ranged between 0 and 1. The correlation threshold was used at 0.95.

Descriptor Selection Model Training and Validation
A LASSO model was trained in order to identify correlated MFs and the affinity constant logarithms (LogK). There were 22 molecular fragments selected to describe the chemical space of the training set.
The selected MFs were used to train the machine learning models, and the hyperparameters were optimized while using a repeated K-fold cross-validated grid. The optimized hyperparameters were used to build the individual models, the stacking, and the ensemble model. Table 2 shows the observed parameters for model evaluation.
All of the training models have similar validation parameters. Special attention must be applied to the LASSO, and RF stacking that have the lowest RMSE values for the test set. Among them, the SVM has the lowest RMSE; in this case, it would be prudent to select this model in order to make predictions regarding this chemical space. Figures 8 and 9 show the training and the test plots for the RF-stacking model of the selected features, which has 0.997 for R 2 0 . Interestingly, the highly polar molecule 3-amino-5-sulfosalicylic acid behaves as an outlier, at coordinates (8.49, 2.37). This behavior is probably due a different mode of coordination with ion Eu 3+ . This compound has four different functional groups and it behaves as a dipolar ion in solution. this model in order to make predictions regarding this chemical space. Figures 8 and 9 show the training and the test plots for the RF-stacking model of the selected features, which has 0.997 for . Interestingly, the highly polar molecule 3-amino-5-sulfosalicylic acid behaves as an outlier, at coordinates (8.49, 2.37). This behavior is probably due a different mode of coordination with ion Eu 3+ . This compound has four different functional groups and it behaves as a dipolar ion in solution.    Figures 8 and 9 show the training and the test plots for the RF-stacking model of the selected features, which has 0.997 for . Interestingly, the highly polar molecule 3-amino-5-sulfosalicylic acid behaves as an outlier, at coordinates (8.49, 2.37). This behavior is probably due a different mode of coordination with ion Eu 3+ . This compound has four different functional groups and it behaves as a dipolar ion in solution.   The LASSO model has identified the main molecular fragments that are responsible for Eu 3+ complexation. Figure 10 shows the main MFs and their contributions to the training model. The blue MFs have positive values and they increase the equilibrium constant. Conversely, the red MFs have negative values and they decrease the equilibrium constant. Rationalizing the chemical information that is brought with the fragments, we can say that compounds with amino or hydroxyl groups will have a greater affinity for Eu 3+ complexation than ethers and amides groups. We can justify this pattern due to the charge relief in the coordination site by hydrogen bonding. That pattern acts as a better σ electron-donating group, stabilizing the cation more efficiently.
1 Figure 10. Main molecular fragments (MFs) retrieved by the LASSO model for Eu 3+ complexation. The red-colored fragments have negative values and reduce the logK value. The blue MFs have positive values and improve the complexation ability. Below each MF, there is its name, SMILES notation, LASSO coefficient, and the multilinear coefficient, in parentheses.

Standardization and Outlier Analysis
The standardization step on the SDF file did not remove any molecules. The threshold for each halogen atom was set at 6, the minimum molecular mass at 20, and the maximum at 900. The outlier analysis did not remove any compound for any dataset, and the final work set has all the molecules. The endpoints that were considered to be outliers were those that are higher than three σ away from the mean value.

Molecular Fragment Generation, Counting, and Preprocessing
For the CU, HEPT, and TIBO datasets, there were respectively, 5082, 7621, and 10,495 MFs generated containing from three up to 10 atoms. After the filtering step, the CU, HEPT, and TIBO have, respectively, 191, 128, and 226 MFs.

Descriptor Selection, Model Training, and Validation
A LASSO model was trained in order to identify the best set of MFs to describe biological activity. There were 47 molecular fragments selected for the CU derivatives set, 31 for the HEPT, and 18 for TIBO set in order to describe the chemical space of the training set.
The selected MFs were used in order to train the machine learning models, and the hyperparameters were optimized while using a repeated K-fold cross-validated grid. The optimized hyperparameters were used to build the individual models, the stacking, and the ensemble model. Table 3 shows the observed parameters for each model evaluation. For the cyclic urea derivatives, the LASSO model has shown the best parameters for the test set (R 2 = 0.874, RMSE = 0.600), and a satisfactory description of the training set. For the HEPT compounds set, the Random Forest (RF) shows the best parameters for the training and test set (R 2 train = 0.966, RMSE train = 0.258, R 2 test = 0.948, and RMSE test = 0.330). For the TIBO group, the RF also shows the best parameters for the test set (R 2 = 0.822, RMSE = 0.657) and s satisfactory performance for the training set ( Figure 11). In Solov'ev and Varnek's work, they create some models for describing the same dataset of compounds [43]. The best model for each of the three datasets is summarized below Table 4: Table 4. Statistical parameters calculated for each model in the prediction of anti-HIV activity for three different groups of molecules: cyclic ureas (CU), TIBO, and HEPT derivatives. ISIDA results were executed by Vernek's group [43]. Figure 11. QSAR for the anti-HIV activity: observed vs. predicted anti-HIV activity (log(1/IC 50 )) for the training (upper right) and test set (upper left) of: cyclic urea (CU) derivatives using the LASSO model, for the training (center right) and test set (center left) of HEPT derivatives using the RF model, for the training (down right) and test set (down left) of TIBO derivatives using the LASSO model. In Solov'ev and Varnek's work, they create some models for describing the same dataset of compounds [43]. The best model for each of the three datasets is summarized below Table 4: Table 4. Statistical parameters calculated for each model in the prediction of anti-HIV activity for three different groups of molecules: cyclic ureas (CU), TIBO, and HEPT derivatives. ISIDA results were executed by Vernek's group [43]. For further details see Supplementary files. The ISIDA software was used in Solov'ev and Varnek's works in order to describe the anti-HIV activity for the three referred datasets [43]. The CHARMING analysis clearly showed better results for the CU and HEPT datasets. Although CHARMING gave better fitting for the training set in the TIBO group, the best parameters for the test set were achieved while using ISIDA.

Set
By retrieving the coefficients of each fragment assigned by LASSO model, the main structural patterns responsible for the anti-HIV activity were identified. Figure 12 shows an overview of the main MFs and their coefficients assigned by model elaboration for each dataset. The blue MFs have positive values; therefore, they contribute to increasing the log (1/IC 50 )) endpoint (smaller IC 50 values). On the other hand, the red-colored MFs have negative values and decreases the log (1/IC 50 )) (greater IC 50 values). Interpreting the chemical information that is embedded with the fragments, we can envisage the cyclic urea derivatives that have the pattern 3-(aminomethyl)aniline show smaller values of IC50 than the CUs that have 1-hydroxy-3-methoxy-5methyl arenes pattern. The HEPT compounds that show the 1,3-dimethylbenzene pattern have smaller values of IC50 than the HEPT compounds that have the N-(2-hydroxymethoxy)methyl pattern. Finally, the TIBO compounds that have the N-(2-bromobenzyl)ethanamine pattern show smaller values than the TIBOs, which have 1H-imidazo (4,5-c)pyridin-2(3H)-one pattern.
2 Figure 12. Main molecular fragments (MFs) retrieved by the LASSO model for anti-HIV activity. The red-colored fragments have negative values and increase the log(1/IC 50 )) endpoint. The blue MFs have positive values and decrease the log(1/IC 50 )) endpoint. Below, each MF there is its name, SMILES notation, LASSO coefficient, and the multilinear coefficient, in parentheses.

Final Considerations
Cheminformatic tools with predictive and qualitative models have proved to be valuable instruments in the development of biologically active compounds helping the optimization of the compound potency, selectivity, and physical-chemical properties. In this context, the Charming QSAR & QSPR provides an accessible alternative to developing statistical models for QSAR and QSPR. The use of Molecular Fragments (MFs) to describe the chemical space and their relation to the physical, chemical, and biological property has been developed and exemplified while using graph theory along with the SMILES notation. The application of MFs has the advantage of direct interpretability of the chemical information that is coded in form of molecular patterns. The Charming QSAR & QSPR was successfully applied to the prediction of pIGC50 of T. pyriform, the logK for complexation of ions Eu 3+ , and the log (1/IC 50 ) for three sets of compounds with anti-HIV activity.