Synthesis of Synthetic Musks: A Theoretical Study Based on the Relationships between Structure and Properties at Molecular Scale

Synthetic musks (SMs), as an indispensable odor additive, are widely used in various personal care products. However, due to their physico-chemical properties, SMs were detected in various environmental media, even in samples from arctic regions, leading to severe threats to human health (e.g., abortion risk). Environmentally friendly and functionally improved SMs have been theoretically designed in previous studies. However, the synthesizability of these derivatives has barely been proven. Thus, this study developed a method to verify the synthesizability of previously designed SM derivatives using machine learning, 2D-QSAR, 3D-QSAR, and high-throughput density functional theory in order to screen for synthesizable, high-performance (odor sensitivity), and environmentally friendly SM derivatives. In this study, three SM derivatives (i.e., D52, D37, and D25) were screened and recommended due to their good performances (i.e., high synthesizability and odor sensitivity; low abortion risk; and bioaccumulation ability in skin keratin). In addition, the synthesizability mechanism of SM derivatives was also analyzed. Results revealed that high intramolecular hydrogen bond strength, electrostatic interaction, qH+ value, energy gap, and low EHOMO would lead to a higher synthesizability of SMs and their derivatives. This study broke the synthesizability bottleneck of theoretically designed environment-friendly SM derivatives and advanced the mechanism of screening functional derivatives.


Introduction
Synthetic musks (SMs), as a kind of additive with a particular smell, have been widely used in perfume, shower gel, shampoo, lotion, shampoo, shaving cream, soap, deodorant, sunscreen, nail polish, hair oil, hair dye, lip balm, aftershave softener, air freshener, washing powder, and other personal care products in daily life to enhance the fragrance and mask odors [1]. SMs contain various compounds and have been categorized into four categories (i.e., nitromusks, polycyclic, macrocyclic, and cycloaliphatic). Polycyclic musks such as galaxolide (HHCB) and tonalide (AHTN) have been classified as high-yield chemicals [2]. Specifically, the annual output of HHCB in the United States reached about 4536 tons in 2015 [3]. After using personal care products, they entered the environment through water and air, thus polluting the environment. SMs can also cause secondary pollution problems because they cannot be completely degraded in their environments [2][3][4][5][6][7]. SMs have been reported in various environmental matrices, such as surface water [8,9], air [10], 2 of 28 and soil [2,11], and HHCB was even detected in sediment samples from the Arctic [12]. Regarding their environmental properties, SMs are considered semi-volatile, lipophilic, bio-accumulative, and partially biodegradable [4,13,14]. The bioaccumulation abilities of HHCB and AHTN have called for extensive attention in recent years. HHCB and AHTN have been found to cause endocrine disruption and antiestrogenic activity in animals and humans [4,15] and have also been shown to accumulate in living tissues and blood with potential adverse health effects [16,17]. Furthermore, as an emerging pollutant, SMs have been proven to have endocrine-disrupting effects and developmental toxicity to embryonic development, which can potentially cause abortion risks in pregnant women [1,15,18]. Therefore, it is important to investigate and reduce the risk of SMs to the environment and human health.
The environmentally friendly molecular design has been widely used for the source control of emerging pollutants. Zhou et al. [19] designed a novel bee-friendly peptidomimetic insecticide based on 3D-QSAR. Zhao et al. [20] developed and screened four neonicotinoid insecticides with bidirectional selectivity. Specifically, these four insecticides had enhanced toxic effects on pests and grubs (increased by 1.44-12.58%), while the chronic sublethal impact on beneficial insects (i.e., bees and earthworms) slowed down (0.29-27.18%). Ren et al. [21] designed two LEV substitute molecules using levofloxacin (LEV) as the parent molecule. They found that the binding effects of LEV substitutes on B-G mutant proteins were increased by 13.24% and 19.40%, respectively. These results indicated that antibacterial drug resistance had improved, thus inhibiting its vertical gene transfer ability in humans. Fu et al. [22] designed and screened a fluoroquinolone substitute molecule with high biodegradability (+120.51%), improved functional properties (genotoxicity) (+13.66%), decreased bioaccumulation (−44.81%), and relieved human liver toxicity (−106.21%). In addition, a variety of SM derivatives, including 19 SM derivatives with improved functionality (strong odor sensitivity) [23], ten function-improved SM derivatives with reduced risk through dermal exposure [24], and 48 SM derivatives with lower abortion risk [18], were evaluated in our previous studies. The design of those environmentally friendly alternatives for emerging pollutants is generally based on theoretical methods. Although it has been confirmed from multiple perspectives, such as molecular functional characteristics, environmental characteristics, environmental existence, and environmental friendliness improvement mechanisms of emerging pollutant substitutes, none of these derivatives have been synthesized. It is undoubtedly a shortcoming for theoretically designed emerging pollutant derivatives. For this reason, compared with the commercially available emerging pollutants, the research on the synthesizability and mechanism of theoretically designed emerging pollutant substitute molecules is expected to make up for the above shortcomings to a certain extent. It can also provide technical support for effectively reducing development costs when put into experimental synthesis.
The synthesizable studies of environmentally friendly emerging contaminant derivatives were barely reported. Density functional theory (DFT) has been widely used to predict the positive frequency value of emerging pollutant derivatives to verify whether the designed molecules can exist in the environment [21,[25][26][27]. However, the synthesizability of derivatives calculated by the DFT method has not been compared with their precursors; thus, the prediction is incomplete. Frey et al. [28] used a machine learning (ML) model to predict the synthesizability of two-dimensional metal carbides and nitrides and their precursors for the first time and screened 18 molecules and 111 phases from 56 MXenes and 792 MAX phases, respectively, with improved synthesis efficiency. By this means, this paper attempted to improve the prediction ability of accurately screening environmentally friendly SMs and their derivatives with improved functionality using DFT and ML methods.
In order to accurately verify the predicted synthesizability of SM derivatives, it is important to analyze the mechanism of the relationship between the molecular structures and descriptors and the synthesizability of SM derivatives. The integration of DFT and ML methods for predicting the synthesizability of SM derivatives is based on the comparative

Synthesizability Calculation of SM Derivatives Based on the Bagging-RF Algorithm
In this paper, the 88 SMs and SM derivatives retrieved were divided into positive and unlabeled samples, of which 11 commercial SMs were marked as positive samples. The remaining 77 theoretically designed and yet-to-be synthesized environmentally friendly SM derivatives were labelled as "unlabeled samples." In each iteration of the PU machine learning process, some unlabeled samples can be randomly marked as unlabeled, non-synthesizable samples. In addition, the key descriptors screened by Pearson correlation coefficients were used to construct a random forest base classifier for predicting the synthesizability of unlabeled samples. The above PU machine learning procedure was repeated for 23 iterations, and the unlabeled, non-synthesizable samples were relabeled in each iteration. Furthermore, the synthesizability of an unlabeled sample was defined as the average of the RF classifier prediction scores for all samples except that sample. The SM derivatives with synthesizability greater than 0.5 were marked as an unlabeled synthesizable sample, and those with less than 0.5 were marked as an unlabeled non-synthesizable sample [28]. The model outputted the synthesizability of unlabeled synthesizable samples as an indicator representing their synthesizability. The 16 key descriptors screened were used to construct a PU ML prediction model for the synthesizability of SM derivatives based on the bagging-RF algorithm. The evaluation score of the In this paper, the 88 SMs and SM derivatives retrieved were divided into positive and unlabeled samples, of which 11 commercial SMs were marked as positive samples. The remaining 77 theoretically designed and yet-to-be synthesized environmentally friendly SM derivatives were labelled as "unlabeled samples". In each iteration of the PU machine learning process, some unlabeled samples can be randomly marked as unlabeled, nonsynthesizable samples. In addition, the key descriptors screened by Pearson correlation coefficients were used to construct a random forest base classifier for predicting the synthesizability of unlabeled samples. The above PU machine learning procedure was repeated for 23 iterations, and the unlabeled, non-synthesizable samples were relabeled in each iteration. Furthermore, the synthesizability of an unlabeled sample was defined as the average of the RF classifier prediction scores for all samples except that sample. The SM derivatives with synthesizability greater than 0.5 were marked as an unlabeled synthesizable sample, and those with less than 0.5 were marked as an unlabeled non-synthesizable sample [28]. The model outputted the synthesizability of unlabeled synthesizable samples as an indicator representing their synthesizability. The 16 key descriptors screened were used to construct a PU ML prediction model for the synthesizability of SM derivatives based on the bagging-RF algorithm. The evaluation score of the built model was 0.705 (>0. 7), indicating that the model has good predictive performance [28]. According to Table 1, the predicted synthesizability of 16 (i.e., D3, D5, D7, D8, D22, D25, D28, D33, D35, D37, D38, D50, D51, D52, D60, and D34) out of 77 unlabeled samples was greater than 0.5. So, the above 16 theoretically designed environment-friendly SM derivatives had strong synthesizability. This result provides important guidance for developing green SM derivatives in the future and simultaneously reduces the number of unlabeled samples by 79.22%, which minimizes the research cost of subsequent musk substitute molecules. In this paper, bagging-ERT ML was used to predict the synthesis of SM derivatives. The Extremely Randomized Tree Classifier (ERT) was used in the replacement algorithm of the PU machine learning prediction model to prove that the synthesis of SM derivatives does not depend on specific predictive models. Eleven commercial SMs were selected as positive samples, and the remaining 77 theoretically designed environmentally friendly SM derivatives that have not yet been synthesized were used as unlabeled samples. The bagging-ERT algorithm for predicting the synthesizability of SM derivatives was constructed. The evaluation score of the built model was 0.727 (>0. 7), indicating that the model has good predictive performance [28]. Among the 77 unlabeled samples, the predicted synthesizability of 16 unlabeled samples (i.e., D5, D8, D9, D11, D23, D25, D26, D29, D34, D35, D38, D51, D52, D53, D57, and D60) was greater than 0.5. Thus, the above 16 theoretically designed environment-friendly SM derivatives had strong synthesizable properties in the follow-up experimental development process (Table 2). Compared with the RF algorithm, 11 SMs substitute molecules (i.e., D7, D8, D22, D25, D28, D33, D34, D37, D50, D51, and D52) with a high synthesizable probability were screened out by both the bagging-RF and bagging-ERT algorithms. The synthesizability of these SM derivatives was ranked in the top 10 of the two prediction models. The deviations of synthesizability for D37, D8, D22, D34, D51, D28, and D52 in the bagging-RF and bagging-ERT prediction models were less than 10.13%, indicating that the constructed bagging-RF and bagging-ERT machine learning models both had high prediction accuracy. The gradient-boosting classifier (GBC) was selected as the basic classifier to construct the bagging-gradient-boosting classifier (bagging-GBC) model for further analyzing the molecular synthesis of SM derivatives. We aim to screen the SM derivatives with high predicted synthesizability that meet different models so that these screened environmentally friendly SM derivates could have the highest probability of synthesis. The synthesizability of the SM derivatives was predicted using the bagging-GBC algorithm. The constructed model showed an evaluation score of 0.770 (>0. 7), indicating that the model has good predictive performance [28]. Table 3 shows the prediction results of the PU machine learning model for the molecular synthesizability of SM derivatives based on the bagging-GBC algorithm. The synthesizability of D6, D7, D8, D25, D32, D35, D37, D50, D51, and D52 is greater than 0.5. Seven SM derivatives (i.e., D7, D8, D25, D37, D50, D51, and D52) were screened and simultaneously met the requirements of synthesizable probability (greater than 0.5) predicted by bagging-GBC, bagging-ERT, and bagging-RF algorithms. The synthesizability and molecular structures of these seven SM derivatives are provided in Figure 2. The synthesizability of D7 and D50 predicted by three ML models was high. Specifically, the synthesizability of D7 predicted by the three models was all greater than 0.65, and the synthesizability of D50 predicted by the three models was greater than 0.62 (Table 3). In addition, the synthesizability of the seven SM derivatives ranked in the top 10 of the predicted values predicted by the bagging-GBC, bagging-ERT, and bagging-RF models, indicating the prediction accuracy of the three synthesizable prediction models has consistency (Table 4).

Evaluation of Environmental Risk and Functional Properties of SM Derivatives
The environmental risks and functional properties of seven SM derivatives screened by the bagging-GBC, bagging-ERT, and bagging-RF models were predicted by the 3D-QSAR models constructed by Li et al. [18,23,24] and EPI Suite 4.1 (U.S. Environmental Protection Agency, USA) software. The results are shown in Table 5. The abortion risk induced by SM derivatives was represented by the docking scores of SM derivatives to estrogen (1A52) and progesterone (1A28), respectively [24]. The docking score of SM derivatives to skin keratin proteins (4ZRY) was used to characterize the bioaccumulation of SM derivatives in humans [18]. The biotoxicity of SM derivatives was reflected by LC 50 in the fish [23]. The functional property (i.e., odor sensitivity) of SM derivatives was represented by the binding energy of SM derivatives and human olfactory protein (OR5AN1).  [24]; ** Data obtained from [18];ˆData obtained from [23].
It was found that D37, as a substitute molecule for HHCB, had a 7.18% improvement in its functional properties (odor sensitivity) compared with HHCB. The docking scores of D37 to 1A52 and 1A28 decreased, indicating that the abortion risk induced by D37 was alleviated. In addition, D37 had an improved LC 50 value in fish and reduced bioaccumulation ability in skin keratin, indicating that the environmental risks of D37 were lower than those of its precursor, HHCB. Therefore, D37, with its high synthesizability and lower environmental impacts, can be an alternative for HHCB. According to Table 5, the odor sensitivity of D7 was increased, and its bioaccumulation ability and abortion risk were reduced. Moreover, the biological toxicity of D7 is one level lower than that of its precursor, HHCB [23]. Thus, D7 can be recommended as another substitute molecule for HHCB. Compared with the other six screened SM derivatives, the functional properties (odor sensitivity) of D52 were significantly improved (10.77%). For the design of SM alternatives with improved functional properties, D52 is the first choice as the substitute for its precursor, MK. Furthermore, D52 has almost equal bioaccumulation ability to MK but a lower abortion risk and lower biotoxicity. For PHAN, the odor sensitivity of D25 was increased by 7.65%, and the biotoxicity was significantly reduced (−743.48%). The predicted LibDock scores of D25 for two hormone proteins (i.e., progestogen and estrogen) were consistent with those of its parent molecule, PHAN, maintaining the same level of abortion risk. Nakata et al. [35] found that skin contact was the most important way for SMs to be absorbed into the human body. Thus, the bioaccumulation ability of SMs can be effectively reduced by inhibiting their entry into skin keratin. The bioaccumulation ability of D25 was decreased by 17.25%, indicating that D25 can be recommended as one of the environmentally friendly substitutes for PHAN, which is consistent with the conclusion provided by Li et al. [24]. To further analyze the influence of the 16 key descriptors (i.e., qH + , E HOMO , energy gap (EG), dipole moment (DM), Q YY , Q ZZ , Q XY , Q XZ , Q YZ , positive frequency (Freq), Raman, GE, AATSC8c, GATS5c, GATS6c, and GATS3s) screened by the Pearson correlation coefficient method in Section 3.1, the python software was used to output the correlation ranking between the 16 key descriptors and synthesizability. Then the sensitivity analysis of the top 50% most important key descriptors was analyzed by SPSS software to explore the important factors affecting the synthesis of SM derivatives. With the help of the "sklearn.feature_selection" package in the ML tool library scikit-learn, the relationship between the synthesizability of SM derivatives and 16 key descriptors in the bagging-RF, bagging-ERT, and bagging-GBC algorithms [36] was analyzed. The SelectKBest function was used to output the correlation ranking of 16 key descriptors in the three algorithm models (Table 6). Raman, energy gap, qH + , dipole moment, positive frequency, Q YY , and E HOMO are the top 50% of the key descriptors in three models; Q XY is the top 50% of the key descriptors in two models; and AATSC8c ranks the top 50% of the descriptors of the key feature correlation of the bagging-GBC model. Therefore, this paper selected nine key eigenvalues (i.e., Raman, energy gap, qH + , dipole moment, positive frequency, Q YY , E HOMO , Q XY , and AATSC8c) for the sensitivity analysis. The above nine key descriptor values were used as independent variables, and the synthesizability of SM derivatives was used as a dependent variable for constructing three linear regression models by SPSS software. The correlation coefficients R of the bagging-RF, bagging-ERT, and bagging-GBC algorithms were 0.793, 0.825, and 0.805, respectively, all of which met the statistical requirements, and the Sig. of these models were all 0.000, passing the significance test [37]. The linear relationship between the synthesizability of SM derivatives and key descriptors was shown in Formula (1) (bagging-RF), Formula (2) (bagging-ERT), and Formula (3) (bagging-GBC). The coefficients of Raman, positive frequency, and E HOMO are positive, indicating a positive correlation between these three descriptors and the synthesizability of SM derivatives. The coefficients of the energy gap, qH + , dipole moment, Q YY , Q XY , and AATSC8c are negative, indicating a negative correlation between these three descriptors and the synthesizability of SM derivatives. (1) (2) The absolute value of the sensitivity coefficient under different variation degrees of key descriptors was calculated by Formulas (1)-(3) (Figure 3). As shown in Figure 3, when the degree of key eigenvalues increased, the sensitivity coefficients of all key eigenvalues showed an upward trend, except for the Raman descriptor in the bagging-ERT model. Among the descriptors, the characteristics of Raman, energy gap, qH + , dipole moment, and E HOMO were more prominent, and their sensitivity coefficients were all greater than 0.2. However, the characteristics of positive frequency, Q YY , Q XY , and AATSC8c descriptors were not obvious, and the sensitivity coefficients were all less than 0.05. Therefore, it can be inferred that the key descriptors (i.e., Raman, energy gap, qH + , dipole moment, and E HOMO ) have a more significant impact on the synthesizability of SM derivatives and are in a higher position in the ranking order of the key descriptors' correlations (Table 6). In contrast, the positive frequencies, Q YY , Q XY , and AATSC8c had relatively little effect on the synthesizable properties of SM derivatives.
The absolute value of the sensitivity coefficient under different variation degrees of key descriptors was calculated by Formulas (1)-(3) (Figure 3). As shown in Figure 3, when the degree of key eigenvalues increased, the sensitivity coefficients of all key eigenvalues showed an upward trend, except for the Raman descriptor in the bagging-ERT model. Among the descriptors, the characteristics of Raman, energy gap, qH + , dipole moment, and EHOMO were more prominent, and their sensitivity coefficients were all greater than 0.2. However, the characteristics of positive frequency, QYY, QXY, and AATSC8c descriptors were not obvious, and the sensitivity coefficients were all less than 0.05. Therefore, it can be inferred that the key descriptors (i.e., Raman, energy gap, qH + , dipole moment, and EHOMO) have a more significant impact on the synthesizability of SM derivatives and are in a higher position in the ranking order of the key descriptors' correlations ( Table  6). In contrast, the positive frequencies, QYY, QXY, and AATSC8c had relatively little effect on the synthesizable properties of SM derivatives. Raman, energy gap, qH + , dipole moment, and EHOMO significantly impact the synthesizability of SM derivatives when the change degree of key descriptors is less than 50%. In order to further analyze the changing trend of the synthesizability of SM derivatives Raman, energy gap, qH + , dipole moment, and E HOMO significantly impact the synthesizability of SM derivatives when the change degree of key descriptors is less than 50%. In order to further analyze the changing trend of the synthesizability of SM derivatives when the parameter eigenvalues vary greatly, this paper analyzed the growth rate of the sensitivity coefficients of the characteristic parameters. Thus, the eigenvalues that most significantly affect the synthesizability of SM derivatives were screened, and the growth rates of the sensitivity coefficients of the three models were calculated (Table 7). Results showed that the average growth rates of sensitivity coefficients for the energy gap, qH + , dipole moment, and E HOMO were all higher than 10%. However, the average growth rate of Raman's sensitivity coefficient was less than 3% and even showed a negative growth trend in the bagging-GBC algorithm. After comparison, in the case of significant changes in key eigenvalues, the potential impact of Raman on the molecular synthesis of SM derivatives is much smaller than that of the energy gap, qH + , dipole moment, and E HOMO . Therefore, energy gap, qH + , dipole moment, and E HOMO were the most significant key descriptors affecting the synthesizability of SMs and their derivatives. E HOMO refers to the energy of the highest occupied orbital of a molecule, which is one of the important quantum chemical properties of molecules [38]. The energy gap is the difference in energy between the highest and the lowest occupied orbitals of a molecule. Studies have shown that molecular E HOMO and energy gap values are closely related to molecular stability [39]. The descriptor qH + refers to the maximum charge number of molecular hydrogen ions, and intramolecular hydrogen bonds can enhance the stability of molecules [39]. The dipole moment is the product of the distance between the positive and negative charge centers in a molecule and the charge at the charge center, which is closely related to the effective charge carried by the molecule [40]. Li et al. [24] found that those mentioned above as key characteristic values with significant influence belong to the electrons of the molecule parameter. The molecular structure is associated with key eigenvalues such as E HOMO , energy gap, and other properties [41]. Laikov [42] developed a new molecular electronic structure model using the electronic parameters of the molecule and found that molecular structures were inseparable from their electronic parameters. In quantum machine learning, molecular structures and electronic parameters such as E HOMO , energy gap, and dipole moment play an extremely important role in studying molecular physico-chemical properties [43]. In summary, the eigenvalues of the molecular electronic descriptors of SM derivatives played an essential role in the training process of ML models. They had a significant impact on the synthetic probability of SM derivatives. Table 7. Growth rate of the key descriptors with high correlation in the bagging-RF, bagging-ERT, and bagging-GBC models.

The Mechanism Analysis for the Synthesizability of SMs Derivatives Based on 3D-QSAR Model
In Section 2.4.1, energy gap, qH + , dipole moment, and E HOMO were screened out as the key descriptors that significantly affect the synthesis of theoretically designed SM derivatives. In order to verify the above results, the 11 commercially synthesized SMs (positive samples), the top seven SM derivatives with high synthesizability, and the last seven SM derivatives with low synthesizability predicted by the three models were selected. The energy gap, qH + , dipole moment, and E HOMO values of the 14 SM derivatives and 11 commercially synthesized SMs are given in Table 8. Compared with the 11 synthesized SMs, the energy gap, qH + , dipole moment, and E HOMO eigenvalues of the seven synthesizable SM derivatives changed by 3.72%, −0.66%, −5.70%, and −7.57%, respectively. The energy gap, qH + , dipole moment, and E HOMO eigenvalues of the seven none synthesizable SM derivatives changed by 3.72%, −0.66%, −5.70%, and −7.57%, respectively, compared with the 11 commercially synthesized SMs. The results further showed that the eigenvalues of qH + , dipole moment, and E HOMO have relatively significant effects on the synthesizable properties of SMs. The higher the value of qH + , the larger the value of the dipole moment, and the lower the value of E HOMO , the lower the synthesizability of SM derivatives, which is consistent with the conclusion of the sensitivity analysis of the constructed machine learning models. The 3D-QSAR model can effectively analyze the relationship between molecular structural features and physico-chemical activities [23]. Therefore, the synthesizability of unlabeled samples (SM derivatives) was used as the input to construct an environment-friendly 3D-QSAR prediction model (CoMSIA) for predicting the synthesis of SM derivatives. The data set (n = 35) was composed of a training set (27 SMs) and a test set (9 SMs) for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. for 3D-QSAR model construction and validation, and the template molecule (SM 12) existed in both the training set and test set. The cross-validation coefficient q 2 is 0.753 > 0.5, indicating that the constructed 3D-QSAR model had a good prediction ability [23]. Relatively high values of non-cross-validation (R 2 = 0.971 > 0.9 and close to 1.000) and the external test coefficient (r 2 pred = 0.940 > 0.6) further proved the good predictive ability and robustness of the generated models. The standard error of estimate (SEE) of the model was 0.047 < 0.95, which confirmed the good fit ability and predictive ability of the constructed 3D-QSAR model. In addition, in order to verify the rationality of setting the positive samples and unlabeled samples of the machine learning prediction model, this paper also used the constructed 3D-QSAR model to predict the synthesizability of the 11 positive samples (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone). However, since the 3D-QSAR model can only predict molecules with a common skeleton (Table 9), it cannot predict Muscone without a common benzene ring. Based on the 3D-QSAR model prediction, it was found that 90% of the positive samples had a synthesizability greater than 0.5, indicating that the constructed 3D-QSAR model had high prediction accuracy. At the same time, it showed that the setting of positive and unlabeled samples of SMs and SM derivatives used for machine learning model training is reasonable. In the contour maps of the CoMSIA model for the molecular synthesizability of environmentally friendly SM derivatives, the contributions of hydrophobic, electrostatic, hydrogen bond acceptor, hydrogen bond donor, and steric fields were 22.6%, 19.8%, 33.2%, 20.4%, and 4.0%, respectively. The results indicated that the hydrophobic, electrostatic, hydrogen bond acceptor, and hydrogen bond donor fields significantly impacted the synthesizability of SM derivatives. In this study, unlabeled sample D50 was taken as a template molecule for analysis (Figure 4). Seven derivatives of MK (i.e., D50, D51, D57, D61, D62, D67, and D76) were selected as examples (Table 10) to analyze the synthesizability mechanism based on the contour maps. Studies showed that increasing the positive electric groups in the blue area in the contour maps was beneficial to chemical activity [22,44]. As shown in Table 10 and Figure 4a, b, compared with the synthesizability of D50 and D51, the non-synthesizable SM derivatives (i.e., D67, D61, D62, D57, and D76) had increased electronwithdrawing groups (-NO 2 ) at position 3, which led to the decreased synthesizability of D67, D61, D62, D57, and D76. Previous studies have shown that adding strong electronwithdrawing groups can reduce the E HOMO value of the molecule [45]. Then the energy gap value of the molecule will change significantly, which will increase the charge of some hydrogen atoms in the molecule. According to Long and Niu [46], the higher the qH + value of a molecule, the easier it is to accept electrons and generate electrophilic reactions. The lower the energy gap value of the molecule, the easier it is for nucleophilic and electrophilic reactions to occur, resulting in poor stability (or synthesizability) of the molecule [47,48]. Compared with the synthesizable D50 and D51, the E HOMO values of non-synthesizable SM derivatives (i.e., D57, D61, D62, and D76) were all smaller than D50 and D51. Furthermore, D57, D61, D62, D67, and D76 substitute molecules had lower energy gap values and higher qH + values than D50 and D51. The higher the value of qH + , the lower the value of E HOMO , and the smaller the energy gap value, the lower the synthesis ability of SMs substitute molecules. It showed that the higher the value of qH + , the lower the value of E HOMO , and the smaller the energy gap value, the lower the synthesizability of SM derivatives. hydrogen bond acceptor, and hydrogen bond donor fields significantly impacted the synthesizability of SM derivatives. In this study, unlabeled sample D50 was taken as a template molecule for analysis ( Figure 4). Seven derivatives of MK (i.e., D50, D51, D57, D61, D62, D67, and D76) were selected as examples (Table 10) to analyze the synthesizability mechanism based on the contour maps. Studies showed that increasing the positive electric groups in the blue area in the contour maps was beneficial to chemical activity [22,44]. As shown in Table 10 and Figure 4a, b, compared with the synthesizability of D50 and D51, the non-synthesizable SM derivatives (i.e., D67, D61, D62, D57, and D76) had increased electron-withdrawing groups (-NO2) at position 3, which led to the decreased synthesizability of D67, D61, D62, D57, and D76. Previous studies have shown that adding strong electron-withdrawing groups can reduce the EHOMO value of the molecule [45]. Then the energy gap value of the molecule will change significantly, which will increase the charge of some hydrogen atoms in the molecule. According to Long and Niu [46], the higher the qH + value of a molecule, the easier it is to accept electrons and generate electrophilic reactions. The lower the energy gap value of the molecule, the easier it is for nucleophilic and electrophilic reactions to occur, resulting in poor stability (or synthesizability) of the molecule [47,48]. Compared with the synthesizable D50 and D51, the EHOMO values of non-synthesizable SM derivatives (i.e., D57, D61, D62, and D76) were all smaller than D50 and D51. Furthermore, D57, D61, D62, D67, and D76 substitute molecules had lower energy gap values and higher qH + values than D50 and D51. The higher the value of qH + , the lower the value of EHOMO, and the smaller the energy gap value, the lower the synthesis ability of SMs substitute molecules. It showed that the higher the value of qH + , the lower the value of EHOMO, and the smaller the energy gap value, the lower the synthesizability of SM derivatives.     Yellow patches appeared surrounding positions 2 and 4 of D50 in the contour map (Figure 4c), indicating that substituting hydrophobic groups in this region are conducive to synthesizing SMs molecules [23]. Compared with the synthesizability of D50, hydrophilic groups (e.g., methoxyl and carboxyl groups) of D67, D61, and D62 were found at Yellow patches appeared surrounding positions 2 and 4 of D50 in the contour map (Figure 4c), indicating that substituting hydrophobic groups in this region are conducive to synthesizing SMs molecules [23]. Compared with the synthesizability of D50, hydrophilic groups (e.g., methoxyl and carboxyl groups) of D67, D61, and D62 were found at Yellow patches appeared surrounding positions 2 and 4 of D50 in the contour map (Figure 4c), indicating that substituting hydrophobic groups in this region are conducive to synthesizing SMs molecules [23]. Compared with the synthesizability of D50, hydrophilic groups (e.g., methoxyl and carboxyl groups) of D67, D61, and D62 were found at Yellow patches appeared surrounding positions 2 and 4 of D50 in the contour map (Figure 4c), indicating that substituting hydrophobic groups in this region are conducive to synthesizing SMs molecules [23]. Compared with the synthesizability of D50, hydrophilic groups (e.g., methoxyl and carboxyl groups) of D67, D61, and D62 were found at Yellow patches appeared surrounding positions 2 and 4 of D50 in the contour map (Figure 4c), indicating that substituting hydrophobic groups in this region are conducive to synthesizing SMs molecules [23]. Compared with the synthesizability of D50, hydrophilic groups (e.g., methoxyl and carboxyl groups) of D67, D61, and D62 were found at Yellow patches appeared surrounding positions 2 and 4 of D50 in the contour map (Figure 4c), indicating that substituting hydrophobic groups in this region are conducive to synthesizing SMs molecules [23]. Compared with the synthesizability of D50, hydrophilic groups (e.g., methoxyl and carboxyl groups) of D67, D61, and D62 were found at position 4. This should be why the synthesizable properties of the D67, D61, and D62 molecules were predicted as "non-synthesizable". It has been found that the higher the dipole moment value of the molecule, the stronger the polarity of the molecule [49]. Compared with D50, the dipole moment values of the remaining non-synthesizable D67, D61, and D62 were 4.313, 5.726, and 5.915, respectively, which were significantly higher than the dipole moment values of D50 (1.192). Thus, the possible replacement of hydrophilic groups may enhance the polarity of SMs (increased dipole moment value). The change will decrease the synthesis of SMs molecules, which is consistent with the previous conclusions. Thus, the 3D-QSAR model of the molecular synthesizability of environmentally friendly SM derivatives constructed in this paper has a good predictive ability. The results obtained from sensitivity analysis were consistent with those obtained from contour maps: the higher the value of qH + , the larger the value of the dipole moment, the lower the value of E HOMO , and the smaller the value of the energy gap, the lower the synthesizability of SM molecules.

Mechanism Verification Analysis for the Synthesizability of SM Derivatives Based on Intramolecular Hydrogen Bond Theory
In order to further analyze the synthesizable mechanism of SM derivatives, they were screened based on bagging-RF, bagging-ERT, and bagging-GBC models. The intramolecular hydrogen bond differences of the top seven SM substitute molecules (D7 > D50 > D8 > D25 > D37 > D51 > D52) and the last seven non-synthesizable SM derivatives (D67 < D62 < D39 < D57 < D61 < D76 < D49) were compared to analyze the intrinsic factors affecting the internal reasons of synthesizability. HHCB and MK derivatives from the above 14 SM derivatives were selected as representatives. D7, D8 (synthesizable), and D39, D49 (non-synthesizable) were the derivatives of HHCB. D50, D51, and D52 (synthesizable) and D57 and D76 (non-synthesizable) were the derivatives of MK.  (Table 11). Comparing the number of intramolecular hydrogen bonds, it was found that the formation rate of intramolecular hydrogen bonds in the synthesizable SM derivatives was 100%. The predicted rate of intramolecular hydrogen bond formation in non-synthesizable SM derivatives was only 50% ( Figure 5). Since the basic forms of intramolecular hydrogen bonds in the SM derivatives were all C-H···O, it was speculated that the hydrogen bond between the methyl group and the nitro group restricts the rotation of the C-C bond between the benzene ring and the nitro group in the molecular center of the SM derivatives. Thus, D7, D8, D50, D51, D52, D57, and D76 have stable, planar molecular conformations that allow for synthesis. However, the lack of intramolecular hydrogen bonds in D39 and D49 resulted in unclear molecular conformation and unstable structures. Thus, D39 and D49 had low synthesizability. The above inference was consistent with a previous study [50]. The expected molecular conformation and arrangement can be obtained by rationally designing the intramolecular hydrogen bond between the amide and alkoxy groups. The bispyridyl aromatic dicarboxamide derivatives and their complexes can thus be designed [50]. bonds in D39 and D49 resulted in unclear molecular conformation and unstable structures. Thus, D39 and D49 had low synthesizability. The above inference was consistent with a previous study [50]. The expected molecular conformation and arrangement can be obtained by rationally designing the intramolecular hydrogen bond between the amide and alkoxy groups. The bispyridyl aromatic dicarboxamide derivatives and their complexes can thus be designed [50]. bonds in D39 and D49 resulted in unclear molecular conformation and unstable struc-tures. Thus, D39 and D49 had low synthesizability. The above inference was consistent with a previous study [50]. The expected molecular conformation and arrangement can be obtained by rationally designing the intramolecular hydrogen bond between the amide and alkoxy groups. The bispyridyl aromatic dicarboxamide derivatives and their complexes can thus be designed [50]. tures. Thus, D39 and D49 had low synthesizability. The above inference was consistent with a previous study [50]. The expected molecular conformation and arrangement can be obtained by rationally designing the intramolecular hydrogen bond between the amide and alkoxy groups. The bispyridyl aromatic dicarboxamide derivatives and their complexes can thus be designed [50]. tures. Thus, D39 and D49 had low synthesizability. The above inference was consistent with a previous study [50]. The expected molecular conformation and arrangement can be obtained by rationally designing the intramolecular hydrogen bond between the amide and alkoxy groups. The bispyridyl aromatic dicarboxamide derivatives and their complexes can thus be designed [50]. with a previous study [50]. The expected molecular conformation and arrangement can be obtained by rationally designing the intramolecular hydrogen bond between the amide and alkoxy groups. The bispyridyl aromatic dicarboxamide derivatives and their complexes can thus be designed [50]. with a previous study [50]. The expected molecular conformation and arrangement can be obtained by rationally designing the intramolecular hydrogen bond between the amide and alkoxy groups. The bispyridyl aromatic dicarboxamide derivatives and their complexes can thus be designed [50]. be obtained by rationally designing the intramolecular hydrogen bond between the amide and alkoxy groups. The bispyridyl aromatic dicarboxamide derivatives and their complexes can thus be designed [50]. and alkoxy groups. The bispyridyl aromatic dicarboxamide derivatives and their complexes can thus be designed [50].    Although the formation rate of intramolecular hydrogen bonds in non-synthesizable SM derivatives was only 50%, D57 and D76 still had intramolecular hydrogen bonds due to structural reasons. Therefore, we further analyzed the influences of intramolecular hydrogen bonds on the synthesizability of MK derivatives (i.e., D50, D51, D52, D57, and D76). Intramolecular hydrogen bond strength is difficult to calculate accurately due to the complex and strict operation process. If the system needs to be cut off and the cutoff point should be saturated, the structure needs to be adjusted to avoid severe steric hindrance. Emamian et al. [51] first proposed that the bond critical point (BCP) electron density can be defined by AIM theory to estimate the intramolecular hydrogen bond strength (E_HB). Emamian et al. [51] also redefined the standard definition of intramolecular hydrogen bond strength, where E_HB > −2.5 kcal/mol is weak strength and E_HB < −2.5 kcal/mol is weak strength. Among these, E_HB > −2.5 kcal/mol is "very weak intensity," and −14 < E_HB < −2.5 kcal/mol is "weak to medium intensity." The BCP electron densities of D7, D8, D50, D51, D52, D57, and D76 were calculated by density functional theory (DFT). The strength of the intramolecular hydrogen bonds of these derivatives was then estimated (Table 11, Figure 5). Among the synthesizable SM derivatives, D50 formed three extremely weak intramolecular hydrogen bonds with E_HB of −2.336, −1.102, and −1.101 kcal/mol, respectively, and two weak intramolecular hydrogen bonds with E_HB of −2.704 and −2.707 kcal/mol, respectively. D51 formed five extremely weak intramolecular hydrogen bonds (with the E_HB of −1.960, −1.602, −1.919, −2.100, and −0.677 kcal/mol) and one weak intramolecular hydrogen bond (with E_HB of −4.039 kcal/mol). D52 formed two extremely weak intramolecular hydrogen bonds (E_HB were −2.182, −1.091 kcal/mol) and two weak intramolecular hydrogen bonds (E_HB were −2.885, −4.127 kcal/mol). Among the nonsynthesizable SM derivatives, D57 formed four "very weak intensity" intramolecular hydrogen bonds (E_HB were −2.280, −2.105, −2.167, and −2.128 kcal/mol) and one weak intramolecular hydrogen bond (E_HB was −2.576 kcal/mol). D76 formed three "very weak Although the formation rate of intramolecular hydrogen bonds in non-synthesizable SM derivatives was only 50%, D57 and D76 still had intramolecular hydrogen bonds due to structural reasons. Therefore, we further analyzed the influences of intramolecular hydrogen bonds on the synthesizability of MK derivatives (i.e., D50, D51, D52, D57, and D76). Intramolecular hydrogen bond strength is difficult to calculate accurately due to the complex and strict operation process. If the system needs to be cut off and the cutoff point should be saturated, the structure needs to be adjusted to avoid severe steric hindrance. Emamian et al. [51] first proposed that the bond critical point (BCP) electron density can be defined by AIM theory to estimate the intramolecular hydrogen bond strength (E_HB). Emamian et al. [51] also redefined the standard definition of intramolecular hydrogen bond strength, where E_HB > −2.5 kcal/mol is weak strength and E_HB < −2.5 kcal/mol is weak strength. Among these, E_HB > −2.5 kcal/mol is "very weak intensity", and −14 < E_HB < −2.5 kcal/mol is "weak to medium intensity". The BCP electron densities of D7, D8, D50, D51, D52, D57, and D76 were calculated by density functional theory (DFT). The strength of the intramolecular hydrogen bonds of these derivatives was then estimated (Table 11, Figure 5). Among the synthesizable SM derivatives, D50 formed three extremely weak intramolecular hydrogen bonds with E_HB of −2.336, −1.102, and −1.101 kcal/mol, respectively, and two weak intramolecular hydrogen bonds with E_HB of −2.704 and −2.707 kcal/mol, respectively. D51 formed five extremely weak intramolecular hydrogen bonds (with the E_HB of −1.960, −1.602, −1.919, −2.100, and −0.677 kcal/mol) and one weak intramolecular hydrogen bond (with E_HB of −4.039 kcal/mol). D52 formed two extremely weak intramolecular hydrogen bonds (E_HB were −2.182, −1.091 kcal/mol) and two weak intramolecular hydrogen bonds (E_HB were −2.885, −4.127 kcal/mol). Among the non-synthesizable SM derivatives, D57 formed four "very weak intensity" intramolecular hydrogen bonds (E_HB were −2.280, −2.105, −2.167, and −2.128 kcal/mol) and one weak intramolecular hydrogen bond (E_HB was −2.576 kcal/mol). D76 formed three "very weak intensity" intramolecular hydrogen bonds (E_HB were −2.049, −2.200, and −2.485 kcal/mol) and one weak intramolecular hydrogen bond (E_HB was −2.781 kcal/mol). Statistical analysis found that the formation rate of weak intramolecular hydrogen bonds in predicted synthesizable SM derivatives was 33.33%, and the weak intramolecular hydrogen bond formation rate in non-synthesized SM derivatives was 22.22%. It showed that "very weak intensity" intramolecular hydrogen bonds are more conducive to synthesizing SM derivatives than weak intramolecular hydrogen bonds. Studies have confirmed that when the intramolecular hydrogen bond is weak, it is mainly dominated by electrostatic interaction forces [51]. This result is consistent with the conclusion in Section 2.4.2 that the electrostatic field can affect the synthesizability of SMs and their derivatives. In addition, based on the sensitivity analysis results of the synthesizability-machine learning model, it can be seen that the energy gap, qH + , dipole moment, and E HOMO are key descriptors affecting the synthesizability of SM derivatives. Among them, the size of E HOMO and the energy gap are closely related to molecular stability, and qH + affects the interaction between donor and acceptor in intramolecular hydrogen bonds [39]. Therefore, the intramolecular hydrogen bond strength analysis, the sensitivity analysis of the three synthesizability-machine learning models, and the contour map analysis by the 3D-QSAR model were mutually verified. The higher the intramolecular hydrogen bond strength, electrostatic interaction, qH + value, and energy gap value, the lower the E HOMO value, the more stable the SMs substitute molecule, and the higher the synthesis probability will be.

Molecular Structures of SMs and SM Derivatives-Literature Review Method
A total of 88 SMs and their derivatives were retrieved, including 11 commercialized SMs (i.e., Phantolide, Celestolide, Tonalid, Galaxolide, Versalide, Musk xylene, Muscone, Musk methy, Musk ambrette, Moskene, and Musk ketone) and 77 environmentally friendly SM derivatives that have been theoretically designed but have not yet been experimentally synthesized. The 77 SM derivatives consist of 19 functionally improved SM derivatives [23], 10 SM derivatives with reduced bioaccumulation ability and enhanced odor sensitivity [24], and 48 SM derivatives with lower abortion risks [18]. The molecular structures of 88 SMs and SM derivatives are shown in Table S1.

Construction of Machine Learning Models for the Synthesizability of SM Derivatives
SMs are widely used in personal care products. Due to the extensive use of SM, environmental and human health problems were accelerated. Therefore, it is of great practical significance to develop and design environmentally friendly SM derivatives. The process of developing SM derivatives takes a long time and much labor. The machine learning (ML) method can reduce unnecessary consumption in the early research stage by helping to discover and terminate the molecular design of SM derivatives with low synthesizability. Therefore, it is necessary to predict the synthesizability of the theoretically designed SM derivatives before synthesizing.

Calculation of Molecular Descriptors of SMs and SM Derivatives-Software Calculation Method
The research found that structural parameters (e.g., Gibbs energy, atomic volume, energy of crystal structure, bond strength, and bond length between adjacent atoms, total energy, atomic energy, formation energy, Bader charge, lattice constant, and electronegativity) can affect the physic-chemical properties of molecules [24,52]. Dolz et al. [53] pointed out that the maximum exfoliation energy is a key descriptor affecting the synthesis of MXenes. Mladenović et al. [54] reported that Highest Occupied Molecular Orbital is one of the key descriptors affecting the synthesis of 4-hydroxy-chromene-2-one. Therefore, the descriptors of SMs molecules were selected in this study as the original eigenvalues of the ML model for predicting the synthesizability of SM derivatives. ChemBioDraw 12.0 (PerkinElmer, USA) was utilized to calculate the physico-chemical parameters, struc-tural parameters, and topological parameters (e.g., critical temperature, critical pressure, hydrophobic constant of organic compounds, Henry constant, heat of formation, steric parameters, molecular weight, and polar surface area) of SMs and SM derivatives [37]. The density functional theory (DFT) in Gaussian 09 software was used to optimize the molecular structures of SMs and SM derivatives at the B3LYP/6-31G basis set level and to calculate the spectral, geometric, and electronic parameters (e.g., Milligan charge, occupied orbital energy, positive frequency value, energy gap value, dipole moment, quadrupole moment, infrared, and Raman spectra) [24]. The topological, electronic, geometric, and physico-chemical parameters (e.g., van der Waals volume and atomic number) of SMs and SM derivatives can be calculated using PaDEL-Descriptor software.

Dimensionality Reduction on Descriptors of SMs and SM Derivatives-Pearson Correlation Coefficient Method
The Pearson correlation coefficient is widely used in statistical analysis, pattern recognition, image processing, and other fields. The Pearson correlation matrix can be used to select suitable descriptors for multiple linear regression analysis [55] and analyze the correlation between production data, dilution attributes, and system efficiency [56]. The molecular descriptors of SMs and SM derivatives calculated in this study may have repeatability. Moreover, in the process of constructing PU machine learning models, not all descriptors can provide different molecular information. Some descriptors are highly correlated and express similar molecular information [56]. Therefore, Pearson's correlation coefficient method was used in this study to rank the eigenvalues with a high degree of similarity in the correlation coefficient matrix [56,57]. Eliminating features with high similarity and screening out the key descriptors. The Pearson correlation coefficient can be calculated below: Among which, _ x and _ y are the average values of the two eigenvalues of SMs and SM derivatives (X and Y), respectively. The absolute value of the Pearson correlation coefficient r xy is less than or equal to 1, indicating a degree of correlation: r xy > 0.5 indicates a strong correlation; 0.3 < r xy < 0.5 indicates a moderate correlation; 0.1 < r xy < 0.3 indicates a weak correlation; and r xy < 0.1 indicates almost no correlation. SPSS 18.0 software (SPSS Inc., Chicago, IL, USA) is commonly used for Pearson correlation coefficient analysis of eigenvalues. However, since there are over 1000 descriptors of SMs and SM derivatives in this study, the Pearson correlation coefficient was calculated and analyzed by self-writing "code packages" in python software [28]. The correlation between each molecular descriptor was calculated as well. The number of eigenvalues required for PU machine learning was set by adjusting the classification threshold (which was set to p = 0.6 in this study), and the highly correlated descriptors were eliminated independently. The specific process code is shown in Figure S1.

Molecular Synthesizable Prediction Model for SM Derivatives-Bagging-Random Forests Algorithm
In this paper, the number of positive samples (i.e., 11 synthesized and commercialized SMs) used to construct the PU machine learning prediction model of synthesizability is small, and its number is much smaller than that of unlabeled samples (i.e., 77 theoretically designed SM derivatives). Thus, traditional classifiers such as the random forest (RF) method are more appropriate. A bagging classification model was constructed in this study based on the basic classifier (i.e., RF) and the independent variables (i.e., descriptors obtained by dimensionality reduction through the Pearson correlation coefficient method) [58].
In the Bagging-RF classification method, the synthesized commercial SMs were set as positive samples (marked as "1"), and the theoretically designed yet unsynthesized environment-friendly SM derivatives were set as unlabeled samples (marked as "0"). The above samples were used to construct a PU machine learning classification model. During the model training process, unlabeled samples are divided into unlabeled synthesizable samples and unlabeled non-synthesizable samples, and their synthesizability can be output simultaneously. By adjusting the n_estimators, criterion, and random_state parameters inside the RF algorithm and the n_estimators, max_samples, random_state, and other parameters in the PU classifier, a PU classification model with a model evaluation score (out-of-bag (OOB) score) greater than 0.7 is obtained. Finally, the model with a high oob_score (greater than 0.7) is used to predict the synthesizability of the positive and unlabeled samples. In this way, the synthesizability of SM derivatives was obtained. When the synthesizability is >0.5, it is considered synthesizable, and when the synthesizability is <0.5, it is considered not synthesizable [28]. The specific code package is shown in Figure S2.

Molecular Synthesizable Prediction Model for SM Derivatives-Bagging-Extremely Randomized Tree Classifier Algorithm
In this paper, the bagging-Extremely Randomized Tree Classifier (ERT) machine learning method was also used to predict the synthesizability of environmentally friendly SM derivatives. ERT was used as an alternative algorithm for PU machine learning predictive models to demonstrate that the synthesis of SM derivatives is not dependent on a specific predictive model. The ERT algorithm is a machine learning method based on tree structure for decision-making. Its algorithm is very similar to the random forest algorithm, which is composed of many decision trees. The RF uses randomly selected samples. In comparison, the ERT uses all samples with randomly selected molecular characteristics because the split is random. Therefore, to some extent, ERT is more appropriate than the prediction results simulated by the RF algorithm [59]. The RF model selects the optimal forked features in a feature subset, while the ERT model randomly selects the forked features [60]. The ERT model can adjust the minority class of target features in classification by reducing the variance of tree-splitting nodes [60]. The ERT model was used to reduce the variance inherent in many tree-based and neural network algorithms through an enhanced tree splitting technique. Due to its randomization properties for numerical inputs, ERT was very efficient in solving problems involving a large number of numerical features [60]. Thus, this paper adopted the ERT model for the synthesizability prediction of unlabeled samples. An ensemble of decision trees generated a decision function. The classifier took an input feature vector and classified it for each tree in a forest-like structure. Then the labeled class was output based on the majority vote [61]. The specific code package is shown in Figure S3.
ERT is an extension of RF, where a further randomization stage is added for selecting cut points and, at the same time, randomizes the attributes in RF, randomly selecting attributes and splitting cut points [62]. According to Soltaninejad et al. [62], each tree was determined by t {1 . . . T}, where T is the number of random trees. For a given data point x and data set Dtrain, the feature vector is represented by f Dtrain, and the feature vector is represented by f (x, D train ). To classify the data from class c, each tree learned a weak predictor pt(c| f (D train )) for an n-dimensional feature representation. During testing, for an unseen data point x', the probability of belonging to class c is calculated by the average of the probabilities on all trees, as shown in Equation (5):

Molecular Synthesizable Prediction Model for SM Derivatives-Bagging-Gradient Boosting Classifier
In this paper, the machine learning of the bagging-gradient boosting classifier (GBC) was used to construct and compare the predictive models for the synthesis of SM derivatives to prove once again that the synthetic properties of SM derivatives do not depend on specific prediction models. GBC is an ensemble classifier that performs well when the number of variables exceeds the number of samples (high-dimensional data) [63]. GBC has been applied to the status classification of water quality. Experiments showed that GBC was more effective in classifying water quality status than the AdaBoost classifier, supporting vector classifiers, and random forest classifiers [64]. Gradient boosting is a method used to develop classification and regression models to optimize the learning process of the models, which are mostly nonlinear and are more broadly known as decision trees or regression trees. The GBC algorithm [65,66] first used prior information to initialize the classifier F 0 (x). F 0 (x) is the average value of the training target value, and the proportion of y = 1 in the training sample is P(Y = 1|x). For the first iteration, t = 1, the formula was shown as follows: The loss function negative gradient r m,i can be calculated by Formula (7), among which m = 1, 2, 3, . . . , M.
Best-fit value c m,j can be calculated using Formula (8). The regression tree was used to fit the data x i , r m,j , among which i = 1, 2, 3, . . . , N. The leaf nodes under the mth regression tree are R m,j , among which j = 1, 2, 3 · · · , J m, J m is the number of leaf nodes for the mth regression tree. c m,j = ∑ x i∈ R m,j r m,i ∑ x i∈ R m,j (y − r m,i )(1 − y + r m,i ) The calculation of F m (x) classifier: The best classifier F M (x): The final computational form of the classifier model: The detailed code package is provided in Figure S4. Sensitivity analysis is mainly used to analyze the sensitivity of model output values to changes in characteristic parameters [29]. In this study, the linear regression module of SPSS 18.0 software (SPSS Inc., Chicago, IL, USA) was used to construct two-dimensional quantitative structure-activity relationships (2D-QSAR) models. The models were constructed based on the dependent variable (i.e., synthesizability of SM derivatives) and the independent variables (i.e., key descriptors screened by the machine learning model). The structure-activity relationship between the synthesizability of SM derivatives and the characteristic parameters was thus created. Sensitivity analysis was then carried out with the help of characteristic parameter coefficients to screen the key descriptors, which helps to accurately analyze the influence of key descriptors on the synthesizability of SM derivatives [30].
The sensitivity coefficient can be calculated as follows: where SC i represents the sensitivity coefficient of the input feature parameter i, ∆X i /X i indicates the change rate of the input feature parameters, ∆Y i /Y i is the synthesizability change rate of SM derivatives. The sensitivity analysis on each descriptor was conducted with the help of each descriptor's coefficient in the linear regression model. The sensitivity coefficients for each descriptor varying by 10%, 20%, 30%, 40%, and 50% of its value were calculated, which helped to screen out the key descriptors.

Molecular Force Field Analysis Affecting Molecular Synthesizability of SM Derivatives-3D-QSAR Model
The chemical structures (i.e., designed SM derivatives) and their synthesizability were used as independent and dependent variables, respectively, for the establishment of 3D-QSAR models. The synthesizability of unlabeled samples (i.e., SM derivatives) is predicted by the ML model with the highest model evaluation score. Using the SM derivative with the highest synthesizability as a template, a 3D-QSAR model about the synthesizability of environmentally friendly SM derivatives was constructed. The molecules of SM derivatives were optimized using the Tripos force field and Gasteiger-Hückel charges [67]. The Powell method was optimized up to 10,000 times, the energy convergence gradient value was set to 0.005 kJ/mol, and the rest of the parameters were set to default values [5]. The template molecules were then aligned with other SMs using the align database command in SYBYL.
The constructed 3D-QSAR model was then used to predict the synthesizability of positive samples (i.e., synthesized and commercialized) for verifying the molecular synthesizability of SMs and their derivatives predicted by the PU ML model. The plausibility of the positive and unlabeled samples used in the PU ML model was also verified. At the same time, the coupling of the force field information of the 3D-QSAR contour maps with the key descriptors affecting the molecular synthesis of SM derivatives was used to analyze the synthesizable mechanism of SM derivatives further.

Theoretical Analysis of Intramolecular Hydrogen Bonds Affecting Synthesizability of SM Derivatives-DFT Method
The stability of chemicals is closely related to the ease of synthesis [68], and intramolecular hydrogen bonds can directly affect the stability of compound molecules [32]. For example, Berl et al. [69] reported in "Nature" that the intramolecular hydrogen bond between the amide group and the pyridine group could ensure that the pyridine carboxamide oligomer has a stable double helix structure. Intramolecular hydrogen bonds have directional non-covalent interactions. According to this interaction, highly planar molecular conformations can be designed, and finally, newly designed compounds with high stability and easy synthesis can be obtained [33]. The basic form of intramolecular hydrogen bonds is D−H···A, where D is a hydrogen bond donor and A is a hydrogen bond acceptor, and the length of H···A is usually less than 3.2 Å [70]. This paper used the DFT method in Gaussian 09 software (Gaussian, Inc., Wallingford, CT, USA) to optimize the molecular structure of SM derivatives under the B3LYP/6-31G basis set [71]. The bond critical point (BCP) was obtained by using the topology analysis module of the wave function analysis program Multiwfn, and finally, the intramolecular hydrogen bond strength (E_HB) was estimated based on the BCP electron density defined by the Atoms in Molecules (AIM) theory. The AIM theory is mainly based on the topological properties of the electron density function to describe the bonding situation in the SM derivative molecules, where the point between two interacting atoms in the SM substitute molecule is defined as BCP [4].

Conclusions
This study investigated the synthesizability of environmentally friendly SM derivatives by integrating machine learning, 3D-QSAR, 2D-QSAR, and DFT methods. Seven of the 77 SM derivatives screened were found to have high synthesizability and low environmental risks. SM derivatives were recommended based on their performances. D37 and D52, with low abortion risks, can be used as alternatives to HHCB and MK. D52, with improved odor sensitivity (increased 10.77%), can be used as an alternative with improved functional properties. From the perspective of reducing the dermal exposure risks of SMs, D25 had a lower skin keratin enrichment (−17.25) and thus could be used as an alternative to PHAN. In addition, the synthesizability of SM derivatives was analyzed by 3D-QSAR, 2D-QSAR, and DFT methods. Based on the mechanism analysis, descriptors such as qH + and energy gap can influence the synthesizability of SMs and their derivatives. This study analyzed the molecular structure and physico-chemical parameters of SM derivatives to accurately screen high-performance, environmentally friendly SM derivatives with high synthesizability. This research has remarkably improved in synthesizing high-performance alternatives for emerging contaminants from molecular aspects. This technology can also be applied to other fields, promoting the discovery and development of more advanced functional materials in various areas.  Data Availability Statement: Data will be provided upon request.