Designing Sustainable Hydrophilic Interfaces via Feature Selection from Molecular Descriptors and Time-Domain Nuclear Magnetic Resonance Relaxation Curves

Surface modification using hydrophilic polymer coatings is a sustainable approach for preventing membrane clogging due to foulant adhesion to water treatment membranes and reducing membrane-replacement frequency. Typically, both molecular descriptors and time-domain nuclear magnetic resonance (TD-NMR) data, which reveal physicochemical properties and polymer-chain dynamics, respectively, are required to predict the properties and understand the mechanisms of hydrophilic polymer coatings. However, studies on the selection of essential components from high-dimensional data and their application to the prediction of surface properties are scarce. Therefore, we developed a method for selecting features from combined high-dimensional molecular descriptors and TD-NMR data. The molecular descriptors of the monomers present in polyethylene terephthalate films were calculated using RDKit, an open-source chemoinformatics toolkit, and TD-NMR spectroscopy was performed over a wide time range using five-pulse sequences to investigate the mobility of the polymer chains. The model that analyzed the data using the random forest algorithm, after reducing the features using gradient boosting machine-based recursive feature elimination, achieved the highest prediction accuracy. The proposed method enables the extraction of important elements from both descriptors of surface properties and can contribute to the development of new sustainable materials and material-specific informatics methodologies encompassing multiple information modalities.


Introduction
The bioeconomy [1] and circular economy [2] are the keys to realizing a sustainable society.With the shift in focus toward the management of the life cycle of plastics [3], understanding the interfaces of materials has become crucial.For example, understanding the mechanisms underlying biological and chemical reactions, including microorganism reactions that degrade polyethylene, polystyrene, and polypropylene [4,5], enzyme reactions that degrade polyethylene terephthalate (PET) [6], and marine biofouling, which occurs at the interfaces of materials, is crucial [7].Therefore, methods based on wettability, which indicates the hydrophilicity and hydrophobicity of a material, antifoulant release, self-renewability, temperature and pH changes, and biomimetics have been developed [8].Biocompatible 2-methacryloyloxyethyl phosphorylcholine, developed by introducing phosphatidycholine, which is a component present in biological membranes [9] has a hydration layer formed on the polymer imparts strong antifouling properties [10].A hydration layer, namely an intermediate water layer, formed on poly(2-methoxyethyl acrylate) (PMEA) plays a critical role in preventing fouling [11,12].Thus, polymer hydrophilicity as well as hydrophobic coatings play crucial roles in controlling fouling.Superhydrophobic polymers exhibit antifouling properties owing to their low surface free energies [13].In addition, elastomers based on silicone or polydimethylsiloxane are used to prevent fouling.However, the adhesion between the coating material and substrates is weak, and thus, various studies have been conducted to improve the adhesion using nanofiller mixtures [14].
The functionality of polymers are influenced by their microscopic molecular structures as well as their intricate molecular dynamics, including the behavior of polymer chains and their entanglement.Hence, comprehending and managing molecular dynamics is pivotal for effectively understanding and harnessing polymer properties [15].The surface properties of materials, such as rigidity, affect their hydrophilicity [16].Nuclear magnetic resonance (NMR) is a powerful tool for analyzing molecular dynamics, and the corresponding signals can be measured over a wide range of timescales, from picoseconds to milliseconds (ms) [17].Thus, NMR spectroscopy can be employed to study the relationships between the functions and physical properties of materials and their structures [18].Proton and carbon-13 nuclear magnetic resonance ( 1 H NMR and 13 C NMR) spectroscopies are frequently used in polymer development [18,19], and time-domain NMR (TD-NMR), a highly promising analytical tool, is extensively utilized to explore the impact of internal and external factors on the structure and properties of various materials, including polymers, fresh foods, processed food products, and agricultural items [20].
Because NMR spectroscopy generates a large amount of high-dimensional data pertaining to molecular dynamics, various measurement informatics technologies have been simultaneously developed to streamline the associated measurement process [21].To optimize machine learning (ML) performance, the extraction of optimal features from raw data is crucial.Moreover, high-dimensional datasets often lead to overfitting issues [22].To address these challenges, approaches involving dimensionality reduction and/or feature selection are employed.Methods such as principal component analysis [23], multidimensional scaling [24], and linear discriminant analysis [25] are used for reducing feature space dimensions owing to their high efficacy in identifying highly relevant descriptors, which are commonly referred to as key features and are particularly beneficial for ML applications [26].Additionally, non-negative matrix factorization (NMF), partial least squares [27], and semi-supervised NMF have been used as dimension reduction methods [28], in which genetic algorithms [29,30] are utilized to meaningfully reduce the relaxation component to 10% [31].Recursive feature elimination (RFE), a type of feature selection, is applied to perform quality control of polylactic acid processing [32] and antibacterial peptide development [33].Recently, a new RFE approach that evaluates the "feature (variable) importance" based on support vector machine (SVM), random forest (RF), and gradient boosting machine (GBM) models as well as selects and eliminates the least important features has been proposed [34].
As mentioned before, analyzing microscopic molecular structures is also necessary for assessing the surface properties of materials.For instance, in an antifouling membrane [8], after the initial formation of conditioning films, microorganism adhesion occurs [35].The antifouling ability of PMEA originates from the interactions between the constituent carboxy and methoxy groups with water molecules [36].Hence, controlling the intermolecular interactions is crucial.Materials informatics (MI) involves the analysis of microscopic surficial molecular structures using informatics technology.To date, various MI models based on molecular descriptors have been developed using open-source tools, such as RDKit, which is widely utilized in chemoinformatics.These models aid in devising synthesis strategies for molecules, including inorganic nickel (II) salts, organic photosensitizers [37], and amphiphilic copolymers [38].Although MI research is primarily focused on extracting essential physicochemical components, progress in integrating these findings with molecu-lar dynamics has been limited.Specifically, studies on the application of diverse types of RFE algorithms to NMR transition curves are scarce.
In the present study, we constructed an ML model that incorporates both molecular and dynamics descriptors to predict the hydrophilicity of hydrophilic polymer coating materials.The conceptual framework of the study is illustrated in Figure 1.We conducted RF classification using a combination of RDKit descriptors, five distinct pulse sequences from TD-NMR spectroscopy, and different ultraviolet (UV) wavelengths applied during the manufacturing of the coating material.This study was focused on enhancing interpretability through the application of RFE as a feature selection method.propyl]dimethyl (3-sulfobutyl)ammonium hydroxide inner salt (FOM-3010), N-tert-butylacrylamide (NTBA) were used as hydrophilic monomers (Figure S1).2-Hydroxy-4 ′ -(2-hydroxyethoxy)-2methylpropiophenone (Irgacure 2959) was used as the photoinitiator.Methanol was used as the solvent to dissolve the monomers.FOM-3006, FOM-3007, FOM-3008, FOM-3009, FOM-3010, and methanol were obtained from Fujifilm Wako Pure Chemical Corporation (Osaka, Japan).NTBA and Irgacure 2959 were purchased from Tokyo Chemical Industry Co., Ltd.(Tokyo, Japan) and Sigma-Aldrich (Tokyo, Japan), respectively.

Surface Coating
Surface coatings were prepared by copolymerizing two types of hydrophilic monomers on PET sheets, which were adopted owing to the strong affinity of PET with hydrophilic polymers as well as a high recycling rate (approximately 85% in Japan) of PET [39]; owing to its high recycling rate, PET is one of the most sustainable plastics.
Two monomers were randomly selected and dissolved in methanol, which contained the photoinitiators listed in Table S1.Approximately 200 µL of each mixture was coated onto a PET sheet, with dimensions of approximately 3.5 cm × 3.5 cm (Cosmoshine A4360, TOYOBO, Osaka, Japan), using a micropipette.The mixing ratio of each monomer is listed in Table S1.The coated sheets were dried at 50 • C for 10 min using a constant temperature thermostatic dryer natural oven (NDO-420, TOKYO RIKAKIKAI CO., LTD., Tokyo, Japan).Subsequently, the sheets were exposed to UV light, with a wavelength of either 254 nm or 365 nm, inside a box using a handy UV light (SLUV-4, AS ONE CORPORATION, Osaka, Japan) for curing.

TD-NMR Measurements
TD-NMR measurements were conducted at 298 K using the Minispec mq20 NMR spectrometer (Bruker, Billerica, MA, USA) to assess the dynamics of the polymer chains within the hydrophilic coating.This equipment is equipped with Carr-Purcell Meiboom-Gill (CPMG) [28], double quantum (DQ) filter [40], magic sandwich echo (MSE) [40], solid echo (SE) [41], and magic and polarization echo (MAPE) [40].The PET sheets were cut into square shapes measuring approximately 1 mm × 10 mm and placed in measurement tubes without any solvent.T 2 relaxation curves were obtained using five pulse sequences, viz, CPMG, DQ, MSE, SE, and MAPE.Because CPMG was a pulse sequence that could measure long relaxation times in the order of ms, information on rapid molecular mobility was obtained.SE could measure short relaxation times in the order of µs and was thus used to measure high-order structures, such as crystalline and amorphous structures.However, the application of SE was limited by its associated dead time.Thus, MSE, DQ, and MAPE pulse sequences, which could overcome this dead time issue, were employed.MSE consisted of DQ and MAPE, and DQ could measure extremely short relaxation times, whereas MAPE could measure relaxation times longer than those measured by DQ.The relaxation curves exhibited distinct time regions for the slow and fast mobile components of the polymers.Specifically, the MAPE, DQ, MSE, and SE sequences depicted the slow mobile components, while CPMG captured the fast mobile components in ms.MSE represented both slow and relatively fast mobile components.Furthermore, the DQ and SE sequences detected the slow mobile components with short relaxation times.

Contact Angle Measurement
Contact angle measurements were conducted using a contact angle meter (DMs-401, Kyowa Interface Science Co., Ltd., Saitama, Japan).A 2 µL droplet of clean water was dispensed using a microsyringe, and its side image was captured using the accompanying digital camera.From the obtained image, the contact angle was automatically calculated using the θ/2 method.The contact angle of each specimen was measured thrice, and their average value was adopted as the final measured contact angle.Based on the measured contact angles, the films were divided into two groups: films with contact angles less than 25 • , 30 • , 35 • , or 40 • were categorized as 0, while those with contact angles exceeding 25 • , 30 • , 35 • , or 40 • were classified as 1.Such a classification was performed to avoid large differences in the amount of data after classification.

Generation of Molecular Descriptors
Molecular descriptors of the monomers were generated using a simplified molecular input line entry system [42], which transformed chemical structures into text representations, and RDKit (version 2023.3.2) (Table S2).The descriptors of all the monomers were selected (Table S3).The molecular descriptors of the copolymers were calculated based on the mixing ratio of the monomers and photoinitiators.Additionally, the number of chemical bonds, including double bonds (C-C, C=C, C-N, C-O), in each monomer was utilized as a descriptor.Furthermore, the bond distances between the vinyl groups were manually calculated based on individual bond numbers and bond lengths (C-C: 1.54 Å, C=C: 1.34 Å, C-N: 1.43 Å, and C-O: 1.43 Å).

Data Analysis
The data were processed using Python (version 3.10.12),scikit-learn library (version 1.2.2),LightGBM (version 1.2.2), and XGBoost (version 2.02).Autoscaling (standardization) was conducted for both the molecular descriptors and TD-NMR relaxation curves.Because the mobile molecules were assumed to be contributors to hydrophilicity, TD-NMR and data from five pulse sequences were employed in the analysis.Feature selection was executed using GBM-RFE, RF-RFE, SVM-RFE, and XGM-RFE.To ensure that the number of features remains less than the number of samples (=57), the number of features after reduction was set to 30.The features obtained via RFE were employed as explanatory variables, while the binary classification values of the contact angle were employed as target variables.RF classifiers were used to construct classification models.The data were split into training and test datasets using the holdout method.Hyperparameters were determined using cross-validation methods by applying GridSearchCV to the training data.Prediction accuracies were assessed based on the parameters: accuracy, precision, recall, and F1-score.The flow of data analysis is depicted in Figure S2, and the hyperparameter RF is shown in Table S4.

Surface Coating
Surface coatings were applied by the photoinitiated copolymerization of acrylamide monomers on PET films.Both ionic (FOM-3010)) and nonionic (NBTA) monomers were utilized to alter the surface properties.Cross-linkers with varying numbers of vinyl groups were employed to stabilize the coating and regulate film dynamics.Upon exposure to UV light, the initially flowable liquid transformed into a cured solid with high viscosity.As a result, polyacrylamide was coated onto the PET films through the copolymerization of the monomers.

TD-NMR
Changes in chain dynamics due to surface modifications were assessed through TD-NMR measurements.Figure S3 displays the TD-NMR relaxation curves, which are correlated with the surface modification conditions, acquired for various pulse sequences.Noticeable distinctions in the relaxation curves were evident for CPMG, MSE, and SE sequences.Specifically, the CPMG and MAPE [40] appeared suitable for mobile components, suggesting their effectiveness in detecting components characterized by long relaxation times on the surface.
The impact of the presence or absence of cross-linkers on the relaxation curves was confirmed, as depicted in Figure 2. The evaluation of samples coated with NTBA using the CPMG sequence revealed a gradual attenuation in the intensity of T 2 relaxation.Conversely, the relaxation curve obtained using the MSE sequences exhibited a low intensity with min-imal changes.However, for samples with cross-linkers, the CPMG relaxation curves exhibited minimal alterations, with sharper relaxations observed in the MSE sequences.These trends were accentuated with an increase in the number of vinyl groups.Polyacrylamide prepared with NTBA featuring a single vinyl group exhibited linear polymer chains, while FOM-3006, FOM-3007, FOM-3008, and FOM-3009, which possess multiple vinyl groups, displayed cross-linked or networked structures with reduced mobility.Thus, these findings underscore a disparity in the TD-NMR relaxation curves, stemming from the chain mobilities of polyacrylamide on the surface.

Contact Angle
The properties of the modified surfaces were evaluated through contact angle measurements.Figure 3

Contact Angle
The properties of the modified surfaces were evaluated through contact angle measurements.Figure 3 illustrates a histogram of the contact angles of the sample films.The contact angles of the modified surfaces exhibited a broad range of values, spanning from 5° to 80°.To determine the standard for binary classification, data analysis was performed according to the approach indicated in Section 2.5, and the best results were obtained at 40°.Therefore, 40° was set as the criteria for binary classification.The performance data for angles of 25°, 30°, and 35° are shown in Figure S4.

RFE
Feature selection was performed using the importance-based RFE method, and the importance was evaluated using the GBM, RF, SVM, and XGB classifiers.When a classifier

RFE
Feature selection was performed using the importance-based RFE method, and the importance was evaluated using the GBM, RF, SVM, and XGB classifiers.When a classifier model is trained on a training dataset, feature weights that reflect the importance of each feature are obtained.After all the features were ranked according to their weights, the feature with the lowest weight value was removed.The classifier is then retrained with the remaining features until there are no more features to learn.Finally, the model-based RFE method can obtain important features and show good performance [20].The classification models were constructed by RF classifiers using the selected feature values above.As a result of feature reduction using GBM-RFE, RF-RFE, SVM-RFE, and XGB-RFE, the values of accuracy, precision, recall, and F-score for all models were higher than those obtained without RFE.The GBM-RFE showed the highest accuracy and F-score, the GBM-RFE and XGB-RFE had the highest precision, and the RF-RFE had the highest recall.From the results of the receiver operating characteristic curve and the area under the curve (AUC), GBM-RFE had the highest AUC value (Figure 4).The important factors extracted differed for each model (Figure 5).As a feature of RDKit, fr_unbrch_alkane was the top selected feature in all RFE models.For the TD-NMR sequence, several time points of the CPMG, double quantum (DQ) filter, MSE, MAPE, and SE sequences were selected, but most of them were after the intermediate region where the slope of the transition curve becomes gentle (Figure 6).The important factors extracted differed for each model (Figures 5 and 6).

Discussion
Among the four RFE methods, GBM-RFE exhibited the best feature-selection performance.For LightGBM, following the decision tree analysis, gradient boosting was employed to enhance the accuracy.This boosting technique improves the predictive performance by learning from errors between predicted and actual values; it particularly focuses on data that could not be initially accurately predicted, and this method of growing according to the leaves of a decision tree is called leaf-wise [43].
Although XGB is based on gradient boosting, a difference with the branches of the decision tree, called level-wise, exists for each layer [43].SVM-RFE, a wrapper method [44], was employed for feature selection in this study using a linear form [45].However, the linear approach might not have effectively classified the current dataset, which comprised RDKit and five pulse sequence data with diverse characteristics.RF, a bagging method based on decision trees [46], demonstrated the second-best performance among the models.Its strength lies in amalgamating multiple decision trees into an ensemble, which potentially contributes to its effective analysis ability.Although both GBM and XGB are gradient boosting methods, their inherent approaches are different.GBM utilizes the leaf-wise method, focusing on improving accuracy while learning the errors for each leaf.Interestingly, GBM-RFE proved suitable for our dataset with its varying characteristics.Notably, the selection and importance of the features were dependent on the RFE methods employed in the analysis.This dependency underscores the significance of the chosen methodology in determining feature relevance and importance.
TD-NMR measurements offer a wide dynamic range, spanning from sub-microseconds (µs) to seconds, allowing for the extraction of various types of information across different time scales [46].Mobility within meso-regions, like domain fluctuations, was observed in the µs range [40,47,48], while fluctuations in chain ends or the mobility of unfrozen and bound water were detected at the ms level [31].The CPMG pulse sequence enables the measurement of long T 2 relaxation times, allowing the analysis of the long components (liquid-like) of polymers [28,[49][50][51].DQ, MSE, and SE have short T 2 relaxation times and are thus suitable for analysis of the rigid components (solid-like) of polymers [52][53][54].The data after the middle region of each transitional curve includes data related to highly motile components.The monomers used in this study contained acrylamide groups, and the monomer FOM-03010 features a betaine structure, which is a zwitterionic group.The C=O and N=H of the acrylamide group [55] (betaine structures [56]) interact with the water surrounding the polymer, forming a hydration layer that inhibits foulant adhesion.In our case, multiple pulse sequences of CPMG, DQ, MSE, and SE were used to detect data from relaxation curves; in addition, various motilities might be involved in the measured contact angle and hydration properties.
Among the several molecular descriptors of physicochemical features, fr_unbrch_alkane (number of unbranched alkanes of at least four members, excluding halogenated alkanes) [57], which represented the proportion of unbranched alkanes, was selected in most of the RFE methods.The presence of branched alkanes in molecular structures is crucial for predicting crystallization owing to their disruptive effect on molecular packing, which destabilizes the liquid crystal phase [58].This characteristic suggests that hydrophilic monomers may form structures conducive to expressing hydrophilicity by orderly bonding among themselves.Furthermore, molecular fluctuations occur more easily in a linear structure than in a branched structure.Based on the TD-NMR data, intermediate or late relaxation time and high molecular mobility were selected as important factors in the region; thus, we can infer that this molecular mobility (molecular fluctuation) is involved in the expression of hydrophilicity.
As discussed earlier, our current methodology offers the unique advantage of simultaneously providing the essential components from both molecular descriptors and TD-NMR T 2 relaxation curves.These components respectively represent physicochemical and dynamic properties.While understanding both properties is crucial for designing superior surface modifications, the challenge lies in the human capacity to manage a vast array of molecular descriptors and numerous NMR curves obtained through various pulse sequences.An additional noteworthy aspect is our utilization of diverse types of RFE methods.The importance attributed to ML algorithms varies, and relying on a single set of criteria can lead to misunderstandings due to factors like noise or pseudo-correlation.Therefore, our approach is particularly well-suited for a comprehensive and multifaceted examination of material data.In recent years, simple and inexpensive methods using smartphones and ML have been developed to measure contact angles [59,60].Until recently, expensive contact angle meters were extensively used.However, these new ML-based methods enable the evaluation of the process of creating hydrophilic/hydrophobic polymer coating materials and aid in efficiently managing their manufacturing process.Therefore, incorporating the proposed ML-based method into future studies will be useful.

Conclusions
In our study, we showcased feature-selection techniques for molecular descriptors assessed via RDKit and relaxation curves obtained from TD-NMR, employing RFE for surface modifications.Our surface modifications involved copolymerizing various combinations of acrylamide monomers on PET films.To evaluate polymer chain dynamics across a broad time range, we utilized TD-NMR measurements with multiple pulse sequences, standardizing the data obtained from these five sequences.Applying GBM-RFE, RF-RFE, SCM-RFE, and XGB-RFE treatments to these descriptors significantly improved the predictability of the RF classifications.Moreover, our findings highlighted the crucial roles played by both the physicochemical properties and dynamics of polymer chains in determining the surface properties.The RFE method not only enhanced the predictability but also allowed us to extract critical factors or time regions from both physicochemical and TD-NMR data.This deeper insight into underlying mechanisms underscores the versatility

Figure 1 .
Figure 1.Conceptual framework of the study, illustrating the pivotal role of molecular descriptors and molecular dynamics in hydrophilicity development.The study encompassed calculations and measurements of these elements.The dataset, comprising RDKit, TD-NMR, contact angle data, and manufacturing process details, underwent feature selection via RFE.Preprocessed data were then utilized by RF classifier to identify crucial factors for building predictive models of hydrophilicity and investigating the underlying principles governing this trait.
illustrates a histogram of the contact angles of the sample films.The contact angles of the modified surfaces exhibited a broad range of values, spanning from 5 • to 80 • .To determine the standard for binary classification, data analysis was performed according to the approach indicated in Section 2.5, and the best results were obtained at 40 • .Therefore, 40 • was set as the criteria for binary classification.The performance data for angles of 25 • , 30 • , and 35 • are shown in Figure S4.

Figure 3 .
Figure 3. Contact angle histogram, with each bin representing a 5° interval ranging from 10° to 80°.The orange line represents 40°, which is the criteria for binary classification.

Figure 3 .
Figure 3. Contact angle histogram, with each bin representing a 5 • interval ranging from 10 • to 80 • .The orange line represents 40 • , which is the criteria for binary classification.

Figure 4 .
Figure 4. Performance of each model.(a) Accuracy, Precision, Recal ROC and AUC score of each RFE.

Figure 4 .
Figure 4. Performance of each model.(a) Accuracy, Precision, Recall, and F-score of each RFE; (b) ROC and AUC score of each RFE.

Figure 6 .
Figure 6.Important features of relaxation time in each pulse sequence.The extracted important features (blue line) for each relaxation time are shown.