Modeling of Effectiveness of N3-Substituted Amidrazone Derivatives as Potential Agents against Gram-Positive Bacteria

Prediction of the antibacterial activity of new chemical compounds is an important task, due to the growing problem of bacterial drug resistance. Generalized linear models (GLMs) were created using 85 amidrazone derivatives based on the results of antimicrobial activity tests, determined as the minimum inhibitory concentration (MIC) against Gram-positive bacteria: Staphylococcus aureus, Enterococcus faecalis, Micrococcus luteus, Nocardia corallina, and Mycobacterium smegmatis. For the analysis of compounds characterized by experimentally measured MIC values, we included physicochemical properties (e.g., molecular weight, number of hydrogen donors and acceptors, topological polar surface area, compound percentages of carbon, nitrogen, and oxygen, melting points, and lipophilicity) as potential predictors. The presence of R1 and R2 substituents, as well as interactions between melting temperature and R1 or R2 substituents, were also considered. The set of potential predictors also included possible biological effects (e.g., antibacterial, antituberculotic) of tested compounds calculated with the PASS (Prediction of Activity Spectra for Substances) program. Using GLMs with least absolute shrinkage and selection (LASSO), least-angle regression, and stepwise selection, statistically significant models with the optimal value of the adjusted determination coefficient and of seven fit criteria were chosen, e.g., Akaike’s information criterion. The most often selected variables were as follows: molecular weight, PASS_antieczematic, PASS_anti-inflam, squared melting temperature, PASS_antitumor, and experimental lipophilicity. Additionally, relevant to the bacterial strain, the interactions between melting temperature and R1 or R2 substituents were selected, indicating that the relationship between MIC and melting temperature depends on the type of R1 or R2 substituent.


Introduction
The increasing number of bacterial infections is one of the challenges of modern medicine [1].Searching for new therapeutic substances remains an actual question and task for many scientists.Despite huge developments in medicine, chemistry, and biochemistry, there are still rare diseases without effective therapies so far [2].At the same time, the increasing drug resistance of pathogenic bacterial strains creates the risk that, in a few years, we will be left without an effective weapon against these microorganisms [3,4].Different bacterial strains exhibit varying resistance to widely used drugs [5,6].
Bacterial drug resistance mechanisms have evolved due to the presence of selective pressure.The main known mechanisms of resistance include the limitation of drug absorption, modification of drug targets, drug inactivation, and active drug efflux.Antibiotic resistance in Gram-positive cocci is still a current problem, and the selection pressure of antibiotics is one of the most important factors contributing to its spread.Methicillinresistant Staphylococcus aureus (MRSA) and vancomycin-resistant enterococci (VRE), which cause nosocomial infections, are of particular concern.Additionally, many Gram-positive bacteria often have high natural intrinsic antimicrobial resistance.The genetic and biochemical basis of antimicrobial resistance in these groups of bacteria is diverse and often varies within genera and/or species [7].Therefore, effectively treating resistant bacterial infections is an important problem in contemporary medicine.Many antibiotics are no longer sufficiently effective, thus prompting the examination of the antibacterial activity of compounds isolated from plants [8,9] or forest-derived soil microorganisms [10] against significant bacteria.
Amidrazone derivatives are known for their wide biological activity, including antibacterial, antifungal, anti-inflammatory, cytoprotective, and anticancer effects [11].Previous studies have demonstrated that unsubstituted amidrazones and their chloride or bromide salts exhibit good antibacterial activity.Due to our experience in the synthesis of N 3 -substituted amidrazone derivatives, compounds such as acyclic derivatives [12], 1,2,4-triazole derivatives [13] and cyclic imides [14] were selected for this research.In this study, we attempted to determine the influence of the R1 and R2 substituents and other structural factors of new N 3 -amidrazone derivatives on their antibacterial against selected strains of Gram-positive bacteria.
The search for new drugs is a long-standing and expensive process, with as many as 90% of promising substances being discarded for failing to meet the strict demands of several clinical trial phases [15].Multidimensional statistical methods or machine learning procedures might be useful tools in the earliest stages of drug design [5,16,17].These modern methods are increasingly used in areas such as the financial sector, energy sector, entertainment, and health care, in addition to the academic environment [18].Artificial intelligence methods are increasingly used in structure-based drug discovery [19].
Innovative machine learning methods can be helpful in medicinal chemistry.These types of tools are increasingly used at the earliest stages of drug design, and their effectiveness is a key benefit of this approach.This method was used to study the activity of synthetic and natural small molecules and peptides against Gram-negative and Grampositive bacteria and mycobacteria, including multidrug-resistant strains [5].Currently used antibiotics act on one of the pathways necessary for the survival of bacterial cells, including cell wall synthesis and the biosynthesis of nucleic acids or proteins.Due to the rapidly developing antibiotic resistance (e.g., production of enzymes inactivating antibiotics, modifications in the targeted pathways), it is important to select new biochemical and therapeutic targets for antibacterial drugs [5,20].An interesting approach is to block the biosynthesis of peptidoglycan, the main component of the bacterial cell wall, not by targeting membrane-bound extracellular enzymes but at the cytoplasmic stage of biosynthesis by inhibiting Mur enzymes, which are essential for bacterial survival [21].
A deep neural network was applied to find molecules that are structurally divergent from conventional antibiotics and display bactericidal activity against a wide phylogenetic spectrum of bacterial strains [22].Various types of data serve as the basis of analysis using machine learning methods, including omics data [15].Generalized linear models (GLMs) were applied to predict minimum inhibitory concentrations (MICs) based on growth curves [23].
On the contrary, we decided to model the values of MIC for five bacterial stains based on the results of our experiments, supplemented with theoretical data describing chemical structure, using a GLM [24,25].While GLMs with main effects and interaction analysis have been applied by other authors, they focused on plant drugs against Pseudomonas fluorescens [8] and examined differences between groups and incubation times.
In the synthesis of new chemical compounds, it is essential to theoretically assess their biological properties, such as antibacterial activity, anti-inflammatory properties, anti-cancer potential, and others.Developing statistical models using measurements from experiments and theoretical values associated with chemical structure characterization may prove useful for this purpose.We are interested in identifying crucial factors for models that could facilitate the design of new chemical compounds with potential antibacterial activity.
The presented work aimed to create a model for the antibacterial activity of amidrazone derivatives by evaluating their MICs using generalized linear models (GLMs).Models of growth inhibition by eighty-five N 3 -substituted amidrazone derivatives were examined for the following five strains of Gram-positive bacteria: Staphylococcus aureus, Enterococcus faecalis, Micrococcus luteus, Nocardia corallina, and Mycobacterium smegmatis.

Results
The models for predicting MICs, an in vitro measure of the pharmacodynamic potency of the drug [26], were developed for five bacterial strains.Experimental and theoretical data from 85 N 3 -substituted amidrazone derivatives, including derivatives of 1,2,4-triazole and cyclic imides (general structures are shown in Figure 1, full formulas in Figures S1-S3 in the Supplementary Materials), were used to build models.
Molecules 2024, 29, x FOR PEER REVIEW 3 of 32 In the synthesis of new chemical compounds, it is essential to theoretically assess their biological properties, such as antibacterial activity, anti-inflammatory properties, anti-cancer potential, and others.Developing statistical models using measurements from experiments and theoretical values associated with chemical structure characterization may prove useful for this purpose.We are interested in identifying crucial factors for models that could facilitate the design of new chemical compounds with potential antibacterial activity.
The presented work aimed to create a model for the antibacterial activity of amidrazone derivatives by evaluating their MICs using generalized linear models (GLMs).Models of growth inhibition by eighty-five N 3 -substituted amidrazone derivatives were examined for the following five strains of Gram-positive bacteria: Staphylococcus aureus, Enterococcus faecalis, Micrococcus luteus, Nocardia corallina, and Mycobacterium smegmatis.

Results
The models for predicting MICs, an in vitro measure of the pharmacodynamic potency of the drug [26], were developed for five bacterial strains.Experimental and theoretical data from 85 N 3 -substituted amidrazone derivatives, including derivatives of 1,2,4triazole and cyclic imides (general structures are shown in Figure 1  The influence of R1 and R2 substituents was analyzed in the models, but due to the large number of categories causing greater computational complexity, the influence of R3 and R4 substituents was omitted. Values of MIC for the five bacterial strains were the explained variables in building GLMs.For each Gram-positive bacterial strain, least absolute shrinkage and selection operator (LASSO), least-angle regression (LAR) and stepwise selections were applied to create GLMs.The meaning of potential variables as well as methods of their calculation are given in Table 1.Variables such as molecular weight (MW), theoretical measure of lipophilicity (miLOGP), donors of hydrogen (Donors_H), acceptors of hydrogen (Accep-tors_H), and topological polar surface area (TPSA) [27] were calculated by Molinspiration online software [28] using the Simplified Molecular Input Line Entry System (SMILES) codes of studied compounds.SMILES codes and the percentages of carbon, nitrogen, and oxygen of compounds were generated using the ChemSketch program.Variables denoting biological activity such as antibacterial (PASS_antibact), anti-inflammatory (PASS_anti-inflam), antieczematic (PASS_antieczematic), antitumor (PASS_antitumor), antituberculosis (PASS_antituberculosi) were calculated with PASS Online software version 2.0 [29,30], using the SMILES codes.Variables such as PASS_anti*PASS_antib, melt-ingTemp*R1_substituent, meltingTemp*R2_substituent, meltingTemp*R2_substituent, and meltingTe*meltingTem were calculated directly by the Statistical Analysis System (SAS) [31].The remaining two variables melting point (meltingTemp) and experimental lipophilicity (RMoExper) were collected through experiments.Experimental lipophilicity The influence of R1 and R2 substituents was analyzed in the models, but due to the large number of categories causing greater computational complexity, the influence of R3 and R4 substituents was omitted.
Values of MIC for the five bacterial strains were the explained variables in building GLMs.For each Gram-positive bacterial strain, least absolute shrinkage and selection operator (LASSO), least-angle regression (LAR) and stepwise selections were applied to create GLMs.The meaning of potential variables as well as methods of their calculation are given in Table 1.Variables such as molecular weight (MW), theoretical measure of lipophilicity (miLOGP), donors of hydrogen (Donors_H), acceptors of hydrogen (Accep-tors_H), and topological polar surface area (TPSA) [27] were calculated by Molinspiration online software [28] using the Simplified Molecular Input Line Entry System (SMILES) codes of studied compounds.SMILES codes and the percentages of carbon, nitrogen, and oxygen of compounds were generated using the ChemSketch program.Variables denoting biological activity such as antibacterial (PASS_antibact), anti-inflammatory (PASS_antiinflam), antieczematic (PASS_antieczematic), antitumor (PASS_antitumor), antituberculosis (PASS_antituberculosi) were calculated with PASS Online software version 2.0 [29,30], using the SMILES codes.Variables such as PASS_anti*PASS_antib, meltingTemp*R1_substituent, meltingTemp*R2_substituent, meltingTemp*R2_substituent, and meltingTe*meltingTem were calculated directly by the Statistical Analysis System (SAS) [31].The remaining two variables melting point (meltingTemp) and experimental lipophilicity (RMoExper) were collected through experiments.Experimental lipophilicity values were evaluated using reversed-phase thin-layer chromatography (TLC) [32].Discrete descriptors such as R1 and R2 were incorporated into this study as binary dummy variables.Estimated parameters for the best models, according to fit and performance measures such as adjusted determination coefficient (Adj R 2 ), Akaike's information criterion (AIC), corrected Akaike's information criterion (AICC), Mallows' C(p) statistic, two information criteria as Sawa's Bayesian Information Criterion (BIC) and Schwarz Bayesian Criterion (SBC), predicted residual sum of squares (PRESS), and mean square error on the validation set (described in the Section 4) are presented in Tables 2-7, while the remaining Tables S1-S24 can be found in the Supplementary Materials.In Tables S2-S16 fit statistics for models are given with F-values and p-values from analysis of variance (ANOVA) results.
The models for the analyzed bacterial strains differed in the number of selected important variables and the quality of prediction, and measured various criteria including adjusted coefficient of determination R 2 and other fit and performance criteria.Therefore, the details of outcomes are presented in subsections.The unstandardized estimate for the i-th coefficient in the models is denoted by b i , while the standardized estimate for the i-th coefficient in the models is denoted by β i .
Thus, MIC (M) was modeled after the calculation of GLM selection in equations with unstandardized coefficients, as follows: or with standardized coefficient (without intercept) as follows: where the summing is performed from one to the number of selected predictors, p.
Unstandardized coefficients are useful in the biochemical interpretation of models obtained as GLM results (given in the left part of Tables 2-7).When assuming a model without an interaction term, the value of the unstandardized coefficients denotes the change in the MIC variable (denoted as M in the above equations) with a one-unit increment in the explanatory variable (main effect) x i .More precisely, positive values of the unstandardized coefficient estimate, b i , of the numerical predictive variable, x i , result in an increase in the MIC value on average by b i when the explaining variable is changed by one unit.Conversely, negative values indicate a decrease.This also holds if this variable is the melting temperature, in the case of the absence of interaction of melting temperature with any substituent R1 or R2 in the found model.However, explaining variables are measured by different scales, so unstandardized coefficients depend on the scales.To interpret the objective meaning of variables (or effects) from a chemical point of view, standardized coefficients in the GLM (presented numerically or graphically) are used for inference.Standardized coefficients are obtained by dividing unstandardized coefficients by standard deviations of the respective explanatory variables.For the comparison of the impact of any predicting variable x i on the MIC variable, the standardized coefficients are interpreted similarly, in terms of changes measured in standard deviations.Specifically, for numerical explaining variables, the positive (negative) values of estimates of the standardized coefficients β i (presented in the right part of Tables 2-7) are on average associated with one standard deviation (SD) increase (decrease) of the predictor by β i , assuming the other variables remain unchanged (and again assuming lack of interaction of this variable with any substituent R1 or R2 in the found model).For cases of interaction between variable x i and any substituent effect, R1 or R2, in the found models, this interpretation (for both models with unstandardized and standardized estimates) becomes slightly more complicated.In such cases, coefficients b j (β j ) found in the model for the interaction term of the numerical variable x i with the discrete effect should be added to b i (or β i , respectively) to estimate the change in the MIC variable when the predictor variable x i is changed by one (or 1 SD, respectively).Furthermore, the importance of the i-th effect is measured by the absolute value of the standardized coefficient: |β i |.Additionally, the sign of β i further indicates the direction of MIC changes when the explaining variable is changed.An independent numerical variable with a larger absolute value of standardized coefficient will have a greater impact on the predicted variable MIC.Therefore, standardized coefficients, βi, are valuable for comparing the impact of the explaining numerical variable x i (effect) on MIC.
Only significant models, with p-values of the F-statistic below the 0.05 level, were regarded in the analysis of selected effects results for the prediction of antibacterial activity, and models with the optimal value of adjusted R 2 (Adj R 2 ) and seven other fit information criteria were chosen for further consideration (Figures 2-13, Tables 2-7).
of the predictor by βi, assuming the other variables remain unchanged (and again assuming lack of interaction of this variable with any substituent R1 or R2 in the found model).
For cases of interaction between variable xi and any substituent effect, R1 or R2, in the found models, this interpretation (for both models with unstandardized and standardized estimates) becomes slightly more complicated.In such cases, coefficients bj (βj) found in the model for the interaction term of the numerical variable xi with the discrete effect should be added to bi (or βi, respectively) to estimate the change in the MIC variable when the predictor variable xi is changed by one (or 1 SD, respectively).
Furthermore, the importance of the i-th effect is measured by the absolute value of the standardized coefficient: |βi|.Additionally, the sign of βi further indicates the direction of MIC changes when the explaining variable is changed.An independent numerical variable with a larger absolute value of standardized coefficient will have a greater impact on the predicted variable MIC.Therefore, standardized coefficients, βi, are valuable for comparing the impact of the explaining numerical variable xi (effect) on MIC.
Only significant models, with p-values of the F-statistic below the 0.05 level, were regarded in the analysis of selected effects results for the prediction of antibacterial activity, and models with the optimal value of adjusted R 2 (Adj R 2 ) and seven other fit information criteria were chosen for further consideration (Figures 2-13, Tables 2-7).

Models for Staphylococcus aureus
According to Tables S1-S3, considering determination coefficients R 2 with Adj R 2 values and other fit criteria, three selection models (LASSO, LAR, and Stepwise) for modelling the inhibition of S. aureus are worth considering.Therefore, these models are discussed in detail in the following three subsections: Sections 2.

Models for Staphylococcus aureus
According to Tables S1-S3, considering determination coefficients R 2 with Adj R 2 values and other fit criteria, three selection models (LASSO, LAR, and Stepwise) for modelling the inhibition of S. aureus are worth considering.Therefore, these models are discussed in detail in the following three subsections: Sections 2.1.1-2.1.3.

Models for S. aureus Selected by Adaptive Least Absolute Shrinkage and Selection
Operator LASSO According to Table S1, the best results for S. aureus (n = 83) for LASSO selection were obtained using several criteria such as Adj R 2 , AIC, and C(p), creating equivalent models.Specifically, 51.6% of the variance in antimicrobial activity can be explained by the same 11 variables selected for models M1, M2, and M5 (by Adj R 2 , AIC, and C(p) criteria, respectively), while 45.4% of the variance is explained by 7 variables creating models M3 and M6 (based on the AICC and SBC criteria, respectively).Additionally, 49.8% of MIC variation is explained by 12 variables, selected according to the ASE Validation criterion, creating model M7 (Table S1).
The best models are selected for visualizing the selection steps, demonstrating how the standardized parameters change during the process.The sequential selection steps with final models are presented in Figure 2 and other analogous figures.Each line characterizes one explanatory variable and its significance in the model at each step.In the plots, we can observe at which step each variable is entered into the model or removed from the model and to what extent they have an impact on the predicted variable during the creation of model steps.Standardized coefficients show the comparable importance of variables, and the chosen step number defining the final model is denoted by a vertical line.At the end of the selection process, we can compare the magnitude and sign of standardized coefficients for variables chosen for the final model (Table 2).In Figure 2 (or in other analogous figures) the horizontal axis is described by the numbers of consecutive steps, the positive or negative sign (+ or −), indicating the addition or removal of variables, and the variable name (description of names are given in the Section 4).
Figure 3 and other analogous figures serve as counterparts to Figure 2 and its analogous figures.They depict the values of all examined fit measures (Adj R 2 , AIC, AICC, BIC, C(p), SBC, ASE Val, and PRESS) corresponding to the consecutive steps of the model building presented in Figure 2 (and its analogues) with the selected criterion statistic (or criteria if few models are the same during consecutive steps).The maximum Adj R 2 and the minimum AIC and C(p) occur at step 15, where 11 variables are chosen in the model.The same models created by the optimization of different fit criteria can be identified from Table 2 (or an analogous table) by the same finally chosen variables and their final coefficients (standardized or unstandardized).Equivalent models are also confirmed in Tables S1-S3 (by the same final fit measures in corresponding columns).
The values of unstandardized coefficients presented in Table 2 give the possibility to predict MICs for given values of variables.The interpretation of the unstandardized coefficients is as follows: it represents the change in MIC if other numerical variables are set unchanged (e.g., at their mean values).
According to the left part of Similarly, the equation for M with standardized coefficients (without intercept) from Table 2 may be written.
Interaction terms between meltingTemp and R1 or R2 are included in the model (interaction is marked by *).The existing interaction with R1 (or R2) means that the impact of melting temperature on MIC depends on the category of R1 (or R2) substituent.Simultaneously, the same interaction coefficient means that the impact of the R1 (or R2) substituent on MIC depends on the value of the melting temperature.
Assuming other variables remain unchanged and considering R1_substituent as 2-pyridyl, taking into account the interaction of melting temperature with R1 and R2 substituent, an increase in melting temperature by one unit of standard deviation (SD) results in the following:  2 (standardized coefficients), the variables with the greatest impact on MIC are as follows: meltingTemp*R2_substituent_4-nitrophenyl, PASS_anti-inflam (with positive signs β = 0.862991 and 0.325141), R2_substituent_4-nitrophenyl, and PASS_antitumor (with negative signs β = −0.648931and −0.391309).A lower MIC value indicates better experimental antibacterial activity.Therefore, for instance, an increase in PASS_antitumor has a considerable impact on improving the growth inhibition of S. aureus, assuming that the other ten variables remain unchanged.
The "story" of creating the final model is visible in Figure 2.For example, the report in Figure 2 indicates that TPSA is included in step 7 (with a small negative standardized coefficient) and removed in step 10, only to be added again in step 13.Consequently, in the final model, TPSA is present with a considerably positive coefficient (0.178460-see Table 2).Notably the very important R2_substituent_4-nitrophenyl (β = −0.648931) is added only in the almost final step (14).The interaction effect of meltingTemp*R2_substituent_4-nitrophenyl remains in the adaptive LASSO model until step 4 (β = 0.862991).This high coefficient indicates considerable interaction between melting temperature and the R2 substituent.
The highest mean selection percentages (87% and 82.1%) by seven criteria from 1000 bootstrap samples from the dataset involve PASS_antitumor and the interaction of melting point with R2_substituent (4-nitrophenyl) (Table 2).

Models for S. aureus Selected Using the Least-Angles Regression Method
According to the determination coefficients presented in Table S2, the LAR model selection was also chosen for further analysis with different fit criteria.The LAR model of variable selection explains 49.78% of MIC variation using 11 variables according to the Adj R 2 criterion, forming the M1 mode.A lower percentage (46.8%) of MIC variation is explained using the M2-M4 models (based on AIC, AICC, and BIC criteria, respectively) with the same eight variables selected (Table S2).
The creation of a model based on the Adj R 2 criterion is presented in Figure 4. Optimal Adj R 2 is obtained after 11 selection steps, where one variable is added at each step.
Figure 5 shows the standardized coefficients of all the effects selected at some step of the LAR method, plotted as a function of the step number.According to standardized coefficients (also see the right part of Table 3), the effects with the highest impact on the models for MIC are meltingTemp, meltingTemp*R2_substituent_4-nitrophenyl, PASS_antiinflam, PASS_antieczematic (with positive coefficients β = 0.435941, 0.243939, 0.265515, and 0.127085, respectively), PASS_antitumor, and meltingTe*meltingTem (with negative coefficients β = −0.431145and −0.394171, respectively).
In the LAR model, both the interaction between melting temperature and the R1 substituent (step 9) and the interaction between melting temperature and the R2 substituent (step 4) were selected.Interaction effects indicate that the R2 substituent or R1 substituent variable influences the relationship between melting temperature and the MIC variable.The highest mean selection percentages (86.01%and 65.09%) from eight criteria from 1000 bootstrap samples from the dataset are for MW and meltingTemp (Table 3).

Models for S. aureus Selected Using Stepwise Procedure
According to Table S3, the best model with the stepwise procedure is achieved using the Adj R 2 and five other fit criteria resulting in equivalent models.The stepwise selection method yields explanatory models for 44.15% of MIC variation, based on only four variables according to criteria, creating M1-M5 and M7 models (i.e., for Adj R 2 , AIC, AICC, BIC, C(p), and PRESS, respectively).Model M6, based on the SBC criterion with three variables selected, explains 42.56% of the variability, while model M8 (ASE Val criterion) explains 43.21% with only two variables selected, namely PASS_antitumor and PASS_anti-inflam (Table S3).
We focus on the equivalent models after only six steps of stepwise selection (Figures 6 and 7) according to Adj R 2 , AIC, AICC, BIC, C(p), and PRESS criteria for S. aureus (Adj R 2 = 0.4129).No interaction term is selected (Table 4, Figures 6 and 7), so the interpretation of the model is straightforward.Thus, according to unstandardized coefficients (left side of Table 4), MIC (M) can be estimated as the following function of The model for MIC can be interpreted as follows: assuming the three other variables remain unchanged, an increase of one unit in PASS_anti-inflam causes an increase of 732.638700 in MIC.Similarly, assuming the other variables remain unchanged, an increase of one unit in miLOGP leads to an increase of 76.605993 in MIC.
However, the coefficients are negative for PASS_antitumor and perc_C.Therefore, assuming another three of four variables remain unchanged, an increase of one unit in PASS_antitumor results in a decrease of 1633.785168 in MIC, or, assuming the other variables remain unchanged, an increase of one unit in perc_C leads to a decrease of 29.306596 in MIC.
It should be noted that miLOGP is chosen in the stepwise model, which is the theoretical counterpart of the experimental lipophilicity (RmoExper) that is not selected.
The highest mean selection percentage (29.9%),determined by eight criteria from 1000 bootstrap samples from the dataset, corresponds to PASS_anti-inflam (Table 4).
Summarizing the results from Section 2.1, the variables that exert the greatest impact on compound activity in inhibiting the S. aureus bacterial strain are commonly selected, namely PASS_antitumor and PASS_anti-inflam.Additionally, for models with more than four variables, the interaction of melting point with the R2 substituent (4-nitrophenyl) is important (Tables 2-4).The variables most frequently selected by different GLM selection models show similarities across the LASSO, LAR, and stepwise selection criteria (Tables 2-4, Figures 2, 4 and 6).

Models for Nocardia corallina
According to Tables S4-S6, the best models for N. corallina, achieving the optimally adjusted determination coefficients (Adj R 2 ) and other fit criteria, as AIC, AICC, C(p), and SBC were obtained after stepwise selection, are presented in Table 5 and Figures 8 and 9.The remaining models (LASSO and LAR) are provided in Tables S16 and S17.
Figure 8 presents the changes in standardized parameters during the selection process.The best model for N. corallina with eight variables (12 effects together with intercept) was obtained after 10 steps of stepwise selection based on the Adj R 2 criterion (Figures 8 and 9).The variable perc_N was included in the 6th step but removed in the 9th step.Interaction effects indicate that the R2 substituent or R1 substituent variable influences the relationship between melting temperature and MIC variable.The most important variables, according to standardized coefficients, are squared melting temperature and the interaction between melting temperature and the R2 substituent (Figure 8 and Table 5).Variables such as MW (with a negative coefficient) and miLOGP (with a positive coefficient) have slightly smaller importance.Other variables in the model, such as PASS_anti-inflam, PASS_antieczematic, and PASS_antitumor, have smaller values measured by standardized coefficients (see Table 5).
The highest mean selection percentages (61.8% and 42.5%) by eight criteria from 1000 bootstrap samples from the dataset have PASS_antieczematic and MW (Table 5).

Models for Micrococcus luteus
Based on Tables S7-S9, and the optimal values of Adj R 2 , AIC, BIC, C(p), and PRESS criteria, stepwise models were selected for further detailed examination.The specifics of parameter estimation for the chosen stepwise selection are presented in Figures 10 and 11 and Table 6, while the parameters of the remaining models (obtained by LASSO and LAR selection) are provided in the Supplementary Materials (Tables S18 and S19).
The best models for M. luteus were obtained for stepwise selection according to Adj R 2 , AIC, BIC, C(p), and PRESS criteria.This set of models, including M1-M2, M4-M5, and M7, uses the same subset of five variables, explaining 41.49% of MIC variability (Adj R 2 = 0.3564).No interaction term is selected, simplifying the interpretation of the MIC model.According to the equation provided, assuming other variables remain unchanged, an increase of one unit in any variable x i causes a change in MIC equal to b i .For example, an increase of one unit in MW results in a decrease in MIC values by 3.248648, while a one-unit increase in RmoExper leads to an increase in MIC by 251.327244.
Selected effects with larger absolute values of standardized coefficients will have a greater influence on the dependent variable.Thus, according to the standardized coefficients in Table 6, RmoExper (with a positive coefficient), together with MW and PASS_antieczematic (with negative coefficients), have the most significant impact on the model.The highest mean selection percentage (88.91%)by eight criteria from 1000 bootstrap samples from the dataset have RmoExper (Table 6).
According to the standardized coefficients, the most important is the variable PASS_antibact, both in first and second power (Table 7, Figures 12 and 13).Also, MW and PASS_anti-inflam (with negative coefficients of −0.472224 and −0.329342) and RmoExper (with a positive coefficient of 0.535104) are important.The interaction of melting temperature and R1 substituent with all positive coefficients should also be noted.
The highest mean selection percentages (94.21% and 80.76%) in eight criteria from 1000 bootstrap samples from the dataset are from RmoExper and MW (Table 7).

Models for Mycobacterium smegmatis
For M. smegmatis, the fit criteria obtained for selected models are not satisfying (Tables S13-S15 for M. smegmatis).For example, in the best LASSO model, a rather large number of effects ( 12) explained only 26.46% of MIC variability (model M1, with a small value of Adj R 2 = 0.1538, Table S13; parameter estimates in Table S22).The nine-effect LAR models M2-M5 explains 23.07% of M. smegmatis inhibition variability (Adj R 2 = 0.0989, Tables S14 and S24).The selection of the remaining models using LASSO and LAR, based on the remaining criteria together with all models for stepwise selection, are not useful at all (Tables S13-S15).
Mycobacteria are evolutionarily classified as Gram-positive bacteria, but the architecture of their cell wall is more complex.The outer membrane contains a variety of lipids necessary for the survival and virulence of pathogenic species.The permeability barrier of the outer membrane is a major determinant of drug resistance for many antibiotics, especially in slow-growing mycobacteria.
It has been shown that, in Mycobacterium smegmatis, porins play an important role in the transport of small and hydrophilic β-lactam antibiotics through the outer membrane.Hydrophobic antibiotics like moxifloxacin, in contrast to norfloxacin, were more effective in inhibiting the growth of Mycobacterium smegmatis, probably due to better diffusion through the lipid membrane.Structural models showed that drug molecules that were too large (e.g., erythromycin, kanamycin, and vancomycin) did not pass through porin channels in this bacterial strain [34].The distinctive biochemical characterization of Mycobacterium smegmatis between five examined bacterial strains might be reflected in different model results, including different numbers of selected variables and worse final fit measures for M. smegmatis.

Discussion
In papers using machine learning for drug discovery [5,17], previous authors only have divided classifications into groups-e.g., whether the drug works or not-while we modeled concrete MIC values with the selection of important variables and indications of the direction of the effect.Moreover, we modeled new chemical compounds synthesized in our laboratory, in contrast to chemical compounds from the big chemical database analyzed in [5].Both chemical structure and experimental results were included in the selection procedures, additionally considering the interaction between structure and experimental melting point.
The models for the tested bacterial strains differed from each other; however, we can point to some variables that are important for models selected for the bacteria S. aureus, E. faecalis, M. luteus and N. corallina.The overlapping of selected variables for each of the five bacterial strains and the LASSO or LAR or stepwise selection procedure is presented in Table 2, Table 7 and Tables S16-S24, with estimated parameters according to models based on Adj R 2 , AIC, AICC, BIC, C(p), SBC, average squared error on validation set (ASE Val), and PRESS criteria.
The appearance of variables (main effects) for selected models in S. aureus (SA), N. corallina (NC), M. luteus (ML), E. faecalis (EF), and M. smegmatis (MS) is given in Supplementary Table S25 (summarizing findings from Table 2, Table 7 and Tables S16-S24).The sum of the model numbers with positive and negative signs occurring at least once in each of the 15 tables (for five bacterial strains and three selection procedures: LASSO, LAR, and stepwise) is presented for each main effect in the last column.According to this column, the most often selected variables are molecular weight, PASS_antieczematic, PASS_anti-inflam, squared melting temperature, PASS_antitumor, and RmoExper.
After removing the weakest models of M. smegmatis from this summary and analyzing 12 tables (Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Tables S16-S21), PASS_antiinflam, molecular weight, PASS_antieczematic, squared melting point, RmoExper, and PASS_antitumor are still the most oft-present main effects, which confirms the importance of these variables (Table S26).
Models for some bacteria selected the interaction terms.Interaction effects indicate that the categorical R1 substituent (R2 substituent) variable influences the relationship between melting temperature and MIC variable.For example, when selected for S. aureus LASSO models (M1-M2, M5), interactions between melting temperature with R1 and R2 substituents have an impact on MIC.For the LAR model (M1), only the interaction between melting temperature and R2 substituent is chosen.Similarly for E. faecalis, in selected stepwise models (M1-M5, M7), the interaction between melting temperature and R2 substituent is included.However, for N. corallina in selected stepwise models (M1-M3, M5-M6), the interaction between the R1 substituent and melting temperature influences the MIC.
For the best models presented in the main part of the article (Tables 2-7), Venn diagrams are elaborated (Figures 14-16).For S. aureus, the common variables occurring in all LAR, LASSO, and stepwise models are PASS_anti-inflam and PASS_antitumor (Figure 14).The set of two common variables is marked as "2" in the middle of Figure 14.For S. aureus, the number of common variables for LASSO with 11 selected variables (SA_LAS_M1_p11; yellow circle in Figure 14) and LAR models with 11 variables (SA_LAR_M1_2_5_p11; blue circle in Figure 14) is equal to the sum of intersections counts 2 + 4 = 6.According to Tables 2 and 3, these variables are MW, PASS_anti-inflam, PASS_antitumor, perc_N, interaction of melting point with R2_substituent (4-nitrophenyl), and melting point squared.From the above variables, the signs, which indicate the direction of impact on MIC, are concordant for MW, PASS_antitumor (negative sign), PASS_anti-inflam, perc_N, and interaction of melting point with R2_substituent 4-nitrophenyl (positive sign).Adding a smaller stepwise selection model for S. aureus with only four variables, common variables occurring for the models LAR (M1, M2, M5; with 11 variables), LASSO (M1 with 11 variables), and stepwise (M5-M7 with 4 variables) are PASS_anti-inflam (positive sign) and PASS_antitumor (negative sign) (Figure 14, Tables 2-4).2-4).MW (always with a negative sign) is the common variable chosen in models with a least seven effects selected, i.e., across four bacteria strains: S. aureus (SA), N. corallina (NC) E. faecalis (EF), and M. luteus (ML), as presented in Figure 16 (see Tables 2-3 and 5-7).The presented tables (Table 2-7) indicate that SBC selects no more variables than AIC and, most often, fewer.Similarly, the BIC criterion selects no more variables than AIC, and often fewer.This can be explained by the formulas in Section 4.4, which demonstrate tha the BIC is more parsimonious, i.e., it penalizes models for free parameters in a more re stricted way [23].MW (always with a negative sign) is the common variable chosen in models with a least seven effects selected, i.e., across four bacteria strains: S. aureus (SA), N. corallina (NC E. faecalis (EF), and M. luteus (ML), as presented in Figure 16 (see Tables 2-3 and 5-7).The presented tables (Table 2-7) indicate that SBC selects no more variables than AIC and, most often, fewer.Similarly, the BIC criterion selects no more variables than AIC, an often fewer.This can be explained by the formulas in Section 4.4, which demonstrate tha the BIC is more parsimonious, i.e., it penalizes models for free parameters in a more re stricted way [23].After adding to LASSO (LAS), LAR, and stepwise (ST) for S. aureus (SA) the best models for N. corallina (NC) and E. faecalis (EF), we obtained the next Venn diagram (Figure 15).For wider set of models across four bacteria strains (including added models Stepwise (M1-M3) with 11 effects for N. corallina and Stepwise (M1-M5, M7) with 10 effects for E. faecalis (EF), one common variable is observed-PASS_anti-inflam (Tables 2-5 and 7).This one-variable intersection of selected sets of variables is marked as "1" in the middle of Figure 15.
MW (always with a negative sign) is the common variable chosen in models with at least seven effects selected, i.e., across four bacteria strains: S. aureus (SA), N. corallina (NC), E. faecalis (EF), and M. luteus (ML), as presented in Figure 16 (see Tables 2, 3 and 5-7).
The presented tables (Tables 2-7) indicate that SBC selects no more variables than AIC, and, most often, fewer.Similarly, the BIC criterion selects no more variables than AIC, and often fewer.This can be explained by the formulas in Section 4.4, which demonstrate that the BIC is more parsimonious, i.e., it penalizes models for free parameters in a more restricted way [23].
The average squared error validation (Val ASE) method makes it possible to assess the validity of effects (parameters) obtained on a random subset (70 percent of all tested chemicals) on the remaining 30 percent of the data, which replaces testing on a new independent sample.This validation (Val ASE) was performed and the results are presented as model 7 in Tables 2 and 3 (or model 8, in the case of the stepwise method in Tables 4-7).Moreover, the values of the adjusted coefficient of determination for the validation sample are usually similar to models based on all compounds.
Additionally, to maximize the use of the dataset, a 1000-fold bootstrap sampling of compounds was used to assess the frequency of variable selections in the models.These choice frequencies largely confirm the validity of the parameters.The high rate of selection of a variable into the model estimated using the bootstrap method is often confirmed by the high absolute value of the standardized coefficient (e.g., meltingTemp*R2_substituent_4nitrophenyl in Table 2, RmoExper in Table 6, RmoExper and MW in Table 7).
In the interpretation of the impact of selected variables on MIC, assuming the other variables are not changed, negative values of coefficients indicate a decrease in MIC, while positive values indicate an increase (see Equations ( 1) and ( 2)).The impact of individual variables on the antibacterial activity of compounds depends on bacterial strains.Variables are considered important when they achieve the highest positive or smallest negative coefficients.Assuming that the remaining variables are constant, the following influence of variables on the antibacterial activity of compounds can be observed in followed models: For LASSO models M1-M6 of S. aureus (Table 2, Figure 2), the results are as follows: 1.
PASS_antitumor: an increase in this variable lowers the MIC value.

3.
MW: as the molar weight increases, the MIC value decreases.4.
melting Temp*R2_Substituent: assuming a constant melting point, the influence of different R2 substituents can be arranged in the order of having a favorable effect on the MIC value as follows: 2-pyridyl > 4-methylphenyl > 4-nitrophenyl.

5.
MeltingTemp: an increase in this parameter increases the MIC value for all kinds of R1 and R2 substituents.
PASS_antitumor: an increase in this variable lowers the MIC value.

3.
MeltingTemp: after summing the effect of meltingTemp, meltingTemp squared, and interaction with R2 substituent, an increase in meltingTemp causes an increase in the MIC value, which is bigger for 4-nitrophenyl than for the remaining for the remaining kinds of R2 substituents.

5.
PASS_antieczematic: as this variable increases, the MIC value increases.

6.
MW: as the molar weight increases, the MIC value decreases.
PASS_antitumor: as this variable increases, the MIC value decreases.

3.
perc_C: an increase in the percentage of carbon in a compound molecule causes a decrease in the MIC value.4.
miLOGP: lowering the lipophilicity value results in lower MIC values.
The importance of PASS_antitumor and PASS_anti-inflam with the described direction of impact on MIC in the M1-M5 and M7 models are confirmed in the validation ASE model, with a very close value of Adj R 2 (41.15).
MW: as the molar weight increases, the MIC value decreases.

2.
MeltingTemp*R2_substitituent: assuming the same melting point, the influence of different R2 substituents can be arranged in the order of having a beneficial effect on the MIC value as follows: 4-methylphenyl > 4-nitrophenyl > 2-pyridyl > phenyl.

3.
The miLOGP variable has a greater impact on the MIC value than the experimental RmoExper; as the values of the two variables describing lipophilicity increase, the MIC value increases.

4.
PASS_anti-inflam, PASS_antitumor, and PASS_antieczematic: as the value of each of these variables increases, the MIC value decreases.
RmoExper: as the lipophilicity increases, the MIC value increases.

2.
RmoExper: as the molar weight increases, the MIC value decreases.

3.
PASS_antieczematic: increasing the value of this variable causes a decrease in the MIC value.4.
PASS_antituberculosi: lowering the value of this variable lowers the MIC value.

5.
MeltingTemp: as this variable increases, the MIC value decreases.
For the stepwise models M1-M5 and M7 of E. faecalis (Table 7, Figure 12), the results are as follows: 1.
MW: as the molar weight increases, the MIC value decreases.

2.
RmoExper: the lower the RmoExper value, the lower the MIC value.

3.
Based on the interaction of the melting point and R1 substituent, and considering that the melting point does not change, the effect of R1 on the increase in MIC can be presented in descending order as 4-pyridyl > phenyl > 2-pyridyl.All substitutions increase the MIC, but the first one, 4-pyridyl, increases it the least and is the most favorable for this model.The weakest antibacterial activity is expected for the R1 substitution with 2-pyridyl.4.
PASS_anti-inflam: as this variable increases, the MIC value decreases.

5.
PASS_antituberculosi: as this variable increases, the MIC value increases.

6.
PASS_antibact: assuming other variables are not changed, the change in PASS_antibact by one SD increase the MIC value in 0.268199 (2.242303 − 1.974104), a lower value of this variable is more favorable.
Using the data collected in Table S26, the selected variables set as common for the modeling of the four bacterial strains S. aureus, M. luteus, N. corallina, and E. faecalis are as follows: 1.
MW: as this variable increases, the MIC value decreases.

3.
PASS_antieczematic: as this variable increases, the MIC value decreases (except for S. aureus).

4.
MeltingTemp squared: in general, an increase in this variable causes a decrease in the MIC value (except for E. faecalis).
PASS_antitumor: as this variable increases, the MIC value decreases.
In the majority of models, a beneficial effect of the increase in molar mass on the antibacterial activity was observed.However, it can be expected that this relationship will persist only up to a certain point, similar to previous studies on the relationship between the mass of chitosan derivatives and activity against S. aureus [35].The compounds used for modeling had molar masses ranging from 291 to 411 g/mol.
The high frequency of selecting the PASS_anti-inflam and PASS_antieczem variables (Table S26) may seem surprising.However, the recently described potential use of antiinflammatory drugs as antibacterial agents to combat biofilm formed by pathogenic bacteria [36] indicates the possibility of existing dependences between anti-inflammatory and antibacterial activity of the compounds.In turn, the selected PASS_antitumor variable may indicate the relationship between the cytotoxicity of compounds and antimicrobial activity.This may also confirm the possible potential of known anti-inflammatory and anti-cancer drugs, which may have antibacterial activity and are currently undergoing repositioning tests [37].Interestingly, the PASS_antibact variable was selected less frequently, mainly in models involving E. faecalis.
The most frequently mentioned variables included the melting point in the second power (Table S26).The importance of the melting point in the analyzed models is also evidenced by its frequent presence in the interaction with R1 or R2 substituents.The influence of this variable on biological activity has not been widely studied so far, although several works on it have been published.The dependence of decomposition on the melting point of approved and withdrawn drugs was examined [38], as was the relationship between drug absorption and melting point [39].The quantitative structure-property relationship, concerning the melting point of drug compounds, is another area of current research [40].
Lipophilicity plays a role not only in penetrating biological membranes but also in the metabolism, distribution, excretion, and toxicity of drugs [41].It is worth noting that the experimental lipophilicity values (RmoExper) better explain the antibacterial activity than the calculated lipophilicity (miLOGP) values (only two selections in Table S26).This agrees with the suggestions of authors investigating this lipophilicity phenomenon using computational and chromatography methods [32,42].
Among other analyzed variables, the percentage composition of the tested compounds seems to play a secondary role, as the variables' percentages of carbon and nitrogen in the composition of compounds were selected mainly in the models for S. aureus.Variables related to the number of hydrogen bond donors were selected more often than the number of hydrogen bond acceptors (Table S26), but their impact on MIC values, similarly to the topological polar surface area (TPSA) value, seems to be less important.
The variables R1 and R2 substituents, important from the point of view of the structureactivity relationship, occur as single main effects rather rarely, because interactions between substituents R1 or R2 and melting points are often present.For example, the stepwise model of E. faecalis predicts the highest antibacterial activity for the compounds possessing the R1 substituent 4-pyridyl and the lowest in the case of the 2-pyridyl substituent at the R1 position.Moreover, based on the stepwise model of N. corallina, the R2 substituents can be ranked from the most favorable in terms of antibacterial activity to the least favorable as follows: 4-methylphenyl > 4-nitrophenyl > 2-pyridyl > phenyl.However, LASSO and LAR models of S. aureus suggest that the R2 substituent 4-nitrophenyl is the least favorable for activity against this bacterial strain.These differences, however, may be caused by differences in individual bacterial strains.
More detailed studies of the influence of the above-mentioned variables, carried out on models built on a larger number of compounds with greater structural diversity, seem advisable to clarify the precise impact of these variables.
Generalized linear models for predicting the activity of chemical compounds from three groups (linear, 1,2,4-triazole derivatives, and cyclic imides), based on selected theoretical variables, lipophilicity, and the type of R1 and R2 substituents, with the interaction with the melting point, have beneficial predictive properties for the creation of compounds as potential drugs.
Despite obtaining significant results, it should be remembered that our work has certain limitations.Our models only concern antibacterial activity against five strains of Gram-positive bacteria.Only amidrazone derivatives from three groups (Figure 1b-d) were included in the analyzed data.Additionally, only R1 and R2 substituents with three and four categories, respectively, were included in the set of potential variables, additionally with the analysis of substituents' interactions with the melting point.We intend to include other compounds to elaborate more general models.A higher number of agents will increase the size of the dataset, which may also let us analyze R3 and R4 substituents with a larger number of categories.We also plan to expand the analysis to a broader range of bacterial strains (including Gram-negative strains).
The classic approach to the design of antibacterial substances mainly considered the influence of substituents such as R1 and R2.Our approach to the design of antibacterial drugs extends the analysis to include the impact of other factors that may also affect the MIC value.Our models are multidimensional and also consider interactions between melting point and the R1 and R2 substituents.
For designing compounds with a similar chemical structure (i.e., amidrazone derivatives) for particular types of Gram-positive bacteria, we suggest using the variables that we obtained in our selection for single models and validated based on the ASE, or the variables that were most frequently selected based on the analysis of 1000 random samples using the bootstrap method.

Bacterial Strains
The antibacterial activity of compounds was tested against Gram-positive bacteria derived from the American Type Culture Collection (ATCC), including Staphylococcus aureus ATCC 25923 and Enterococcus faecalis ATCC 29,212, and from the Department of Genetic and Microbiology's collection of Maria Curie Sklodowska University, including Micrococcus luteus, Nocardia corallina (currently Gordonia rubripertincta), and Mycobacterium smegmatis.

Determination of Minimum Inhibitory Concentration (MIC) In Vitro
The test was performed using the microdilution method in sterile 96-well microplates, following the procedure described by the Clinical and Laboratory Standards Institute [43].All tested compounds were dissolved in dimethyl sulfoxide (DMSO) to obtain a concentration of 10.24 mg/mL and then were ten-fold diluted in Mueller-Hinton (MH) broth, followed by serial two-fold dilution in MH, resulting in a concentration range from 512 to 0.5 µg/mL.Each well was then inoculated with a bacterial suspension (0.5 Mc-Farland standard).The plates were incubated at 37 • C for 18 h.Bacterial growth was confirmed spectrophotometrically using a microplate reader ASYS UVM 340 (Biogenet, Józefów, Poland).The MIC value was considered to be the lowest dilution of the compound that inhibited bacterial growth.
Eighty-five examined chemical compounds were obtained in the reaction of N 3substituted amidrazones with cyclic anhydrides, following methods described previously [12][13][14].Generally, they belong to three groups depending on chemical structure (linear, 1,2,4-triazole derivatives, and cyclic imides, as shown in Figure 1).Individual compounds differ in the substituents R1 and R2, which are the only qualitative variables in the models.The structures of the studied compounds are presented in Figures S1-S3.
Among the 85 compounds with experimentally measured biological activities, we analyzed physicochemical properties such as molecular weight (MW), the per cent composition of oxygen, nitrogen, and carbon (calculated using the free ChemSketch software version 2017.1.2Advanced Chemistry Development, Toronto, Ontario, Canada) [33].Moreover, computational lipophilicity (miLOGP), the number of hydrogen NH and OH donors (Donors_H), the number of nitrogen and oxygen atoms (Acceptors_H), and the topological polar surface area (TPSA), values of the studied compounds were obtained using Molinspiration cheminformatics software [28].
Additionally, two experimental variables, such as experimental lipophilicity (R M0, determined by thin layer chromatography using reversed-phased plates (nano-silica gel RP-18W, Fluca Analytical, Neu-Ulm, Germany) and 40-60% water-methanol mixtures) according to the method described previously [32] and a medium melting point (determined by MEL-Temp apparatus, Electrothermal, Stone, UK) were included into consideration.Interactions between melting temperature and R1, and also between melting temperature and R2 substituents, were also considered.Adding an interaction term to a MIC model becomes necessary when the statistical association between MIC and melting temperature depends on the value/level of the R1 or R2 substituent.Additionally, PASS antibacterial activity squared was considered.We anticipated that, among the five results from the PASS program, the PASS_antibact variable might be most closely related to MIC.To avoid duplicating variables, we decided to investigate this variable's potential non-linear relationship by including its square in the set of potential variables.Abbreviations with descriptions of the potential explaining variables are given in Table 1.
Statistical generalized linear models (GLMs) estimate the influence of predictors on the outcome MIC.Most predictors are continuous, with two categorical variables: R1 and R2 substituents (Table 1).However, we also consider the modification of the impact of melting temperature on MIC by R1 substituent and the modification of the effect of melting temperature on MIC by R2 substituent.R1 and R2 may be regarded as moderating variables.If this effect modification exists, it is a statistical interaction.For example, we may model the impact of melting temperature on MIC that is modified by three different R1 substituents (1 = 2-pyridyl; 2 = 4-pyridyl; 3 = phenyl) or by R2 substituent (1 = 2-pyridyl; 2 = 4-methylphenyl; 3 = 4-nitrophenyl phenyl).Interaction variables are generated by multiplying melting temperature by the level of R1 (or R2), and this resulting product variable is then entered into the model as an additional potential predictor, along with the melting temperature and R1 (R2).

Variables Selection for Generalized Linear Models
For multidimensional problems, dimension reduction to represent the problem without losing significant amounts of information is crucial.The selection of variables (main effects or interaction effects) was performed using three types of generalized linear model (GLM) selection models: adaptive LASSO, least-angle regression (LAR), and stepwise selection.The maximal number of variables built into the model was bounded by a value depending on the number of training examples for MIC: for n > 80 it was set to 13 (for S. aureus n = 83, M. smegmatis n = 85), and for others, it was set to 10 (N.corallina n = 50, M. luteus n = 56, and E. faecalis n = 56).
LASSO selection (least absolute shrinkage and selection operator for generalized linear models) [25] is a regularization technique to reduce the number of predictors in a generalized linear model.LASSO is used to identify important predictors and to choose among redundant ones.The LASSO method is a penalized method that considers all candidate variables and reduces the coefficients of non-important variables to zero.As the penalty term increases, the LASSO technique sets more coefficients to zero, resulting in a smaller model with fewer predictors.By making shrinkage estimates, LASSO potentially could result in lower predictive errors than ordinary least squares.Adaptive LASSO is a method for regularization and variable selection in regression analysis that was introduced by Zou [44].Zou demonstrated that the adaptive LASSO has theoretical advantages over the standard LASSO.It is beneficial when the explaining variables are correlated and there is a need to select a small subset of important predictors for the model.By introducing weights (a prior "intelligence" about variables), adaptive LASSO is likely to penalize nonzero coefficients less than the zero ones.In GLMs, variable selection using LASSO is often applied [45].
Least-angle regression (LAR) was introduced by Efron et al. [46].The LAR modelselection method is based on traditional forward selection, where, given a collection of possible explanatory variables, the one having the largest absolute correlation with the explained variable is selected.
Stepwise (forward-backwards) model selection is a classical model selection and dimensionality reduction technique.The same entry and removal significance levels for the forward selection and backward elimination were used to evaluate the contributions of variables as they are added to or removed from a model.The entry significance level (SLE) was set to 0.15 and the stay significance level (SLS) was also set to 0.15.
Variables selection within the framework of general linear models was performed using the procedure of variables selection in GLM models (GLMSELECT) procedure in the SAS package [31,47].A variety of fit statistics were specified as criteria for selection.Seven or eight criteria, such as Adj R 2 (resulting model denoted as M1), AIC (M2), AICC (M3), BIC (M4), C(p) (M5), SBC (M6), PRESS (M7-for stepwise selection only), and ASE (M7 or M8 for stepwise selection), were applied.The process of adding or removing variables was stopped according to the optimal values of performance statistics, where minimal values are considered optimal for all performance statistics except adjusted R 2 (Table 8).R 2 determines the proportion of variance in the dependent variable that can be explained by the independent variable; hence, it is called the determination coefficient.It serves as a measure of goodness of fit for the model.The adjusted determination coefficient takes into account the number of predictors and is calculated as follows: where n is the number of observations used to fit the model, p is the number of parameters (effects, including the intercept) in the model, and n is the number of observations.The AIC and SBC are both methods of assessing model fit penalized for the number of estimated parameters.Akaike's information criterion is defined as follows: where SSE is the sum of squared errors (residual sum of squares).The corrected AIC is given by the following equation: Sawa's Bayesian Information Criterion (BIC) additionally takes into account the pure error variance σ 2 fitting the full model, as follows: BIC = n log(SSE/n) + 2(p + 2)q − 2(q 2 ), (7) where q = n(s 2 )/SSE and s 2 is the estimate of pure error variance σ 2 from fitting the full model.Mallows' statistic is defined as follows: The smallest values of AIC, AICC, SBC, C(p), and BIC are preferred.
To compare the results of creating single models with resampling of data models, (where model selection is repeated on subsets of the input data and parameters are averaged), the method of model averaging (MODELAVERAGE) statement in GLMSELECT procedure from the SAS package [31,47] was applied.Resampling-based methods, in which samples are obtained by drawing with replacements, called bootstrap, were used.Bootstrap in the context of variable selection was proposed by Breiman [48].A thousand replications were used in our bootstrap samples.

Conclusions
Using GLMs, statistically significant models were obtained for the Gram-positive bacterial strains.The best results based on the optimum adjusted determination coefficient and other fit criteria, e.g., Akaike's information criterion, were obtained for S. aureus, indicating an opportunity to support the design of antibacterial drugs using this method and selected predictors.Variables with the greatest impact on compound activity in inhibiting S. aureus growth included PASS_antitumor, PASS_anti-inflam, and the interaction of the melting point with the R2 substituent (4-nitrophenyl).The weakest models with an adjusted determination coefficient below 20% were obtained for the M. smegmatis strain, which may be due to the different properties of the mycobacterial cells.
For all bacterial strains, the most often selected main effects in models were molecular weight, PASS_antieczematic, PASS_anti-inflam, PASS_antitumor, experimental RmoExper lipophilicity, and squared melting temperature.For several models, the influence of R1 and R2 substituents on the antibacterial activity against a given bacterial strain was determined.For example, the stepwise model predicted the highest antibacterial activity for the compounds possessing the R1 substituent 4-pyridyl and the lowest in the case of the 2-pyridyl substituent at the R1 position against E. faecalis.Another stepwise model for N. corallina suggested the beneficial influence of substituents on the MIC value in the following order: 4-methylphenyl > 4-nitrophenyl > 2-pyridyl > phenyl.The selection of the interaction of melting point with substituents R1 and R2 indicates that the relationship with MIC depends on the substituent type.The models found can constitute the basis for building further strategies that may be useful in designing new antibacterial drugs.
, full formulas in Figures S1-S3 in the Supplementary Materials), were used to build models.

Table 1 .
Potential variables in models.

Table 2 .
S. aureus parameter estimates for LASSO models optimized by the seven fit criteria.

Table 3 .
S. aureus parameter estimates for LAR models optimized by seven fit criteria.

Table 4 .
S. aureus parameter estimates for stepwise models optimized by eight criteria.

Table 5 .
N. corallina parameter estimates for stepwise models optimized by eight fit criteria.

Table 6 .
M. luteus parameter estimates for stepwise models optimized by seven fit criteria.

Table 7 .
E. faecalis parameter estimates for stepwise models optimized by eight fit criteria.

Table 8 .
Fit criteria applied and denotation of final models after achieving optimal fit value or after a pre-assumed maximal number of variables.