A Novel Machine Learning Model to Predict the Photo-Degradation Performance of Different Photocatalysts on a Variety of Water Contaminants

: This paper describes an innovative machine learning (ML) model to predict the performance of different metal oxide photocatalysts on a wide range of contaminants. The molecular structures of metal oxide photocatalysts are encoded with a crystal graph convolution neural network (CGCNN). The structure of organic compounds is encoded via digital molecular ﬁngerprints (MF). The encoded features of the photocatalysts and contaminants are input to an artiﬁcial neural network (ANN), named as CGCNN-MF-ANN model. The CGCNN-MF-ANN model has achieved a very good prediction of the photocatalytic degradation rate constants by different photocatalysts over a wide range of organic contaminants. The effects of the data training strategy on the ML model performance are compared. The effects of different factors on photocatalytic degradation performance are further evaluated by feature importance analyses. Examples are illustrated on the use of this novel ML model for optimal photocatalyst selection and for assessing other types of photocatalysts for different environmental applications. the performance of different metal oxide photocatalysts in degrading a wide range of contaminants. The structures and features of photocatalysts were represented with a crystal graphic convolutional neural network (CGCNN). The structures of contaminants were encoded with molecular ﬁngerprint (MF). The encoded information of the photocatalysts and contaminants were combined with experimental variables and fed into an artiﬁcial neuron network (ANN) model. The hyperparameters of the ANN were optimized with the Bayesian optimization process. A dataset was assembled that included six different types of photocatalysts and 45 different types of organic contaminants, which were used for the training and validation of the CGCCN-MF-ANN model. The results of the pre-Catalysts


Introduction
Water pollution associated with the increasing amount of human and industrial activities has become an emerging environmental issue that threatens the health of people and animals [1]. Organic chemicals, such as pesticides, herbicides, and polycyclic aromatic hydrocarbons (PAHs), are major types of pollutants present in the wastewater [2]. Catalysts are important to supplement the conventional biological treatment [3] to effectively and efficiently remove organic water contaminants, including semiconducting oxide photocatalysts used in practice [4][5][6]. The photocatalyst-assisted contaminant removal process is sustainable and environmental-friendly for wastewater treatment [7].
In the past decades, tremendous efforts have been devoted to developing photocatalysts and evaluating their performance in municipal water treatment operations [8][9][10][11]. However, it is challenging to quantify the efficiency of photocatalysts to a range of waterborne contaminants. The photo-degradation performance of contaminants is dependent on the properties of photocatalysts, including the crystalline structure, the size and shape of the grain, the specific surface area, pore structure, etc. [12,13]. Besides, the experimental setups, such as photocatalyst dosage, medium pH, contaminant concentration, light wavelength and intensity, etc., also affect the photocatalytic activity [14,15]. A fully factorized experimental design to optimize the photocatalyst performance with multiple variables requires a significant amount of time and cost, if not impossible to implement. The feasibility of the conventional experimental approach is further compromised due to the wide range of water-borne contaminants.
Metal-oxide semiconductor photocatalysts are capable of degrading organic compounds in contaminated water. Methods to assess their performance via conventional experimental approach incur tremendous efforts and investments, particularly in light of the complex structure of photocatalysts and the wide range of contaminants. The recent progress in machine learning (ML) allows a data-driven approach that leads to much more efficient investigation and prediction of the performance features of different photocatalysts. ML model allows to fully utilize experimental data in published literature and can generate results that guide subsequent experimental designs. These significantly save time and labor compared with the conventional experimental approach.
Data-driven ML is emerging as a new solution for photocatalyst performance assessment. ML approach is faster, cheaper, and more flexible than experiments. An artificial neural network (ANN) is an ML model that has been widely used to predict the properties of materials, ranging from polymers, metals, ceramics to composite materials [16][17][18][19][20][21]. It has also been explored to assist the accelerated discovery and design of novel photocatalysts [22,23] and to predict the photocatalytic performance of a photocatalyst [24][25][26][27][28]. However, the scope of these models is limited to organic contaminants with a similar chemical structure since they only consider a limited number of contaminants and a single photocatalyst [29]. Different types of photocatalysts, which are a major factor that affects the photo-degradation performance of the water contaminants, are not included. A challenge that prevents a comprehensive set of experimental variables is the transformation of non-numerical variables into machine-readable language.
This work introduced an innovative ML model, CGCNN-MF-ANN, applicable for a variety of photocatalysts and contaminants. Data from published research were collected to generate a database of photocatalysis matrix, including the experimental variables. The features of common semiconductor photocatalysts were extracted with Crystal Graph Convolutional Neural Network (CGCNN). The features of contaminants were represented with a digital molecular fingerprint (MF). The features of photocatalysts and contaminants, together with experimental conditions, were inputs to an optimized artificial neural network (ANN). The CGCNN-MF-ANN model achieved satisfactory consistent performance by learning from the connections between experimental variables (the types of photocatalysts, contaminants, experimental conditions) and the photocatalytic activities. As a generalized mode, it allowed to predict the performance of new photocatalysts as well as to select the best photocatalyst for degradation of a range of contaminants.

Results of ML Model Prediction
The CGCNN-MF-ANN ML model with optimal hyperparameters was trained with a three-fold cross-validation method. The three-fold cross-validation method is a re-sampling procedure and can reduce the bias of the model prediction. With this method, the complete dataset was randomly split into three subgroups, with any of the two subgroups used for model training and the rest used for testing. This process was repeated three times until each subgroup was used as the testing data.
The scatter plot in Figure 1 summarizes the results of the CGCNN-MF-ANN model prediction versus the experimental measured photocatalytic performance. A perfect prediction would lay along the 1:1 line. Overall, the predicted rate constants by the ML model were in a consistent trend with the experimental results. The overall R 2 of the ML prediction versus measured results was 0.746, which was a promising performance, given the complex factors involved and the amount of data used for model training. Coefficient of determination (R 2 ), mean absolute error (MAE), and root mean square error (RMSE) were used to evaluate the ML prediction performance. The evaluation scores of the ML model prediction performance on the three testing subgroups and the overall dataset are listed in Table 1. There were only small variations in the evaluation scores of the ML model performance among each subgroup, which indicated the CGCNN-MF-ANN model achieved consistent and reliable prediction.

Performance of ML Model for Different Photocatalysts
Further analyses were conducted to investigate how well the CGCNN-MF-ANN ML model predicts photocatalytic degradation by different photocatalysts. Two different groups of analyses were conducted for this purpose. The first analysis aimed to investigate the generality of the ML model in predicting photocatalytic activities for different photocatalysts. All data collected for different photocatalysts were used for training and testing of the ML model via the three-fold cross-validation method described. The second group of analyses aimed to determine if there are benefits with individualized training of the ML model only with data for a specific photocatalyst. For this purpose, the data were divided into subsets according to different photocatalysts; each subset was then used to train and validate the CGCNN-MF-ANN ML model for that type of photocatalyst. Figure 2 shows the scatter plots of the prediction results grouped by different photocatalysts with the CGCNN-MF-ANN ML model trained and validated using all the data. Figure 3 shows the results for different photocatalysts with the CGCNN-MF-ANN ML model trained separately with corresponding data by those photocatalysts. Table 2 summarizes the performance of model prediction for different photocatalysts using these two different ML model training and testing procedures. For the ML model trained with all datasets (Figure 2), its prediction performance, such as MAE, RMSE of ML model trained with the overall dataset (Table 1), lied in between those predicted for individual photocatalyst ( Table 2). This was consistent with the expectation. Additionally, as seen from Table 2, for an individual photocatalyst, the ML model trained with all datasets achieved a better prediction performance than if the model was trained separately only

Performance of ML Model for Different Photocatalysts
Further analyses were conducted to investigate how well the CGCNN-MF-ANN ML model predicts photocatalytic degradation by different photocatalysts. Two different groups of analyses were conducted for this purpose. The first analysis aimed to investigate the generality of the ML model in predicting photocatalytic activities for different photocatalysts. All data collected for different photocatalysts were used for training and testing of the ML model via the three-fold cross-validation method described. The second group of analyses aimed to determine if there are benefits with individualized training of the ML model only with data for a specific photocatalyst. For this purpose, the data were divided into subsets according to different photocatalysts; each subset was then used to train and validate the CGCNN-MF-ANN ML model for that type of photocatalyst. Figure 2 shows the scatter plots of the prediction results grouped by different photocatalysts with the CGCNN-MF-ANN ML model trained and validated using all the data. Figure 3 shows the results for different photocatalysts with the CGCNN-MF-ANN ML model trained separately with corresponding data by those photocatalysts. Table 2 summarizes the performance of model prediction for different photocatalysts using these two different ML model training and testing procedures. For the ML model trained with all datasets (Figure 2), its prediction performance, such as MAE, RMSE of ML model trained with the overall dataset (Table 1), lied in between those predicted for individual photocatalyst (Table 2). This was consistent with the expectation. Additionally, as seen from Table 2, for an individual photocatalyst, the ML model trained with all datasets achieved a better prediction performance than if the model was trained separately only with the data for that photocatalyst. This is counter-intuitive for a physics-based model, where data with no direct relevance (i.e., use all data) tend to lead to a larger error than if only relevant data are used (i.e., photocatalyst-specific data). Two reasons might explain the better performance for the ML model trained with more data. Firstly, the amount of data for ML model training was significantly reduced when split by individual photocatalysts for individualized training. Secondly, fewer types of organic contaminants were included in the training dataset for an individual photocatalyst, which means the diversity of the organic contaminants was reduced. This interesting observation demonstrated the essentials of a data-driven approach, i.e., the amount as well as the diversity of data. The presence of more data for model training usually leads to more accurate ML models due to more diversity of data. In general, ML models can learn more patterns and relationships from a more comprehensive dataset; they might even bring in 'noise' in the conventional sense. The observation in Table 2 is a vindication of the effects of data on the ML model performance. with the data for that photocatalyst. This is counter-intuitive for a physics-based model, where data with no direct relevance (i.e., use all data) tend to lead to a larger error than if only relevant data are used (i.e., photocatalyst-specific data). Two reasons might explain the better performance for the ML model trained with more data. Firstly, the amount of data for ML model training was significantly reduced when split by individual photocatalysts for individualized training. Secondly, fewer types of organic contaminants were included in the training dataset for an individual photocatalyst, which means the diversity of the organic contaminants was reduced. This interesting observation demonstrated the essentials of a data-driven approach, i.e., the amount as well as the diversity of data. The presence of more data for model training usually leads to more accurate ML models due to more diversity of data. In general, ML models can learn more patterns and relationships from a more comprehensive dataset; they might even bring in 'noise' in the conventional sense. The observation in Table 2 is a vindication of the effects of data on the ML model performance.     The previous results indicated that the CGCNN-MF-ANN ML model achieved

Model Interpretability via Feature Importance
The previous results indicated that the CGCNN-MF-ANN ML model achieved decent performance in predicting the photocatalytic degradation rate constant by different photocatalysts over a wide range of contaminants. Compared with the conventional physicsbased model, data-driven ML models generally are limited in the area of interpretability. To interpret the ML results, feature importance was analyzed for the interpretability of the ML model. The feature importance was determined by calculating the SHapley Additive exPlanations (SHAP) value of each variable [30]. SHAP value assesses the impacts of having a certain feature by making the prediction with and without the feature. The mean SHAP values of the seven experimental variables are shown in Figure 4a, and SHAP values of individual data points are shown in Figure 4b. From Figure 4a, among the seven experimental variables, the type of water contaminant was the most important factor for the photo-degradation rate constant, with its SHAP value accounting for more than 50% of the total SHAP value. This indicated that with a certain photocatalyst, the capability in degrading different organic contaminants could vary significantly by the types of contaminants. Therefore, for the wastewater treatment application, it is suggested that the major contaminant types be analyzed before the selection of the most effective photocatalyst and treatment conditions. Moreover, from the SHAP values, the type of photocatalyst and its size also had a relatively high impact on its photo-degradation performance, while the initial concentration of the contaminants and the pH did not have as much influence on the photo-degradation performance.    For example, for the photocatalyst particle size, photocatalysts large in size (indicated with red points) were clustered on the right side far from the centerline, while photocatalysts small in size (indicated with blue points) were clustered on the left side far from the centerline. This meant the photocatalyst size feature had a strong positive effect on model output, −log(k); or since the negative sign was used for −log(k), this meant photocatalyst size was negatively related to the predicted photocatalytic reaction rate constant. That is, a photocatalyst with a smaller size would result in a higher photo-degradation rate constant. This was consistent with experimental evidence.
Following the similar assessment of the data, photocatalyst dosage had a strong negative effect on model output, −log(k). This indicated the photo-degradation rate constant predicted by the ML model increased with an increased amount of photocatalyst dosage. The initial concentration did not have an obvious trend since data points were widely distributed on both sides of the centerline. The pH had a positive effect on model output, −log(k), which implied the lower pH value could result in the higher photodegradation rate constant. However, most of the pH data points were concentrated near the centerline, which implied the impacts of pH on the ML model output was small. For the characteristics of light, since it is converted to categorical data with three categories (solar light, visible light, UV light), the impacts of each light type on the ML model output were analyzed separately. For each of the light types, the red color code indicated that this type of light was used in the experiment, while the blue color code indicated that it was not used. The SHAP values for solar light clustered to the right of the centerline (or negatively affected the reaction rate constant k). Data points for UV light clustered to the left of the centerline (or positively affected the reaction rate constant k). The SHAP values for regular visible light were scattered on both sides of the centerline. From these observations, the effects of different lights on improving photo-degradation rate constant followed the sequence solar light (the whole spectrum) < visible light (400-700 nm nominal range) < UV light.

Performance of CGCNN-MF-ANN ML Model for Different Types of Contaminants
To further interpret the performance of the ML model, we also analyzed its performance for different types of contaminants. In total, 45 types of water contaminants were included in the dataset to train the ML model and were labeled from 1 to 45. Figure 5 summarizes the MAE for each type of contaminant. Most of them were reasonably accurate (with MAE values below 0.5), except for two types of contaminants, which were 2-chlorophenol and 2-nitrophenol. After carefully revisiting the data, it was found that there were only two data points involving 2-chlorophenol and only one data point involving 2-nitrophenol. The lack of data might be the reason for the larger prediction errors. To confirm this assumption, we also investigated contaminants with more than 20 data points each. The statistics of the ML model prediction results on these groups are given in Table 3. The ML model achieved decent prediction performance on all of these seven types of contaminants, as indicated by the small MAE and RMSE values. These observations verified that inclusion of a sufficient amount of data for a contaminant to train the ML model was crucial for the ML model to achieve good prediction accuracy on that contaminant. 5 summarizes the MAE for each type of contaminant. Most of them were reasonably accurate (with MAE values below 0.5), except for two types of contaminants, which were 2-chlorophenol and 2-nitrophenol. After carefully revisiting the data, it was found that there were only two data points involving 2-chlorophenol and only one data point involving 2-nitrophenol. The lack of data might be the reason for the larger prediction errors. To confirm this assumption, we also investigated contaminants with more than 20 data points each. The statistics of the ML model prediction results on these groups are given in Table 3. The ML model achieved decent prediction performance on all of these seven types of contaminants, as indicated by the small MAE and RMSE values. These observations verified that inclusion of a sufficient amount of data for a contaminant to train the ML model was crucial for the ML model to achieve good prediction accuracy on that contaminant.

Application of the CGCNN-MF-ANN ML Model in Selecting the Best Photocatalyst for Contaminant Removal
With its capability to predict the performance of different photocatalysts over a range of contaminants, an important application of the CGCNN-MF-ANN model was to select the optimal photocatalyst for removal of a certain group of contaminants, such as in wastewater treatment. As an example, the photo-degradation rate constants by different photocatalysts on two contaminates, i.e., methylene blue and rhodamine B, were predicted. The other experimental variables were set based on typical values (using the average values in the training data for that contaminant). Figure 6 shows the ML predicted −log(k) for these two contaminants by different photocatalysts. The smaller predicted value indicated a higher rate constant since the predicted −log(k) had a negative sign. The overall predicted −log(k) was higher for methylene blue than rhodamine B, which implied methylene blue was more difficult to be decomposed than rhodamine B under those specified experimental conditions. For methylene blue, the efficiency of the six types of photocatalysts followed the sequence Fe 2 O 3 < WO 3 < SnO 2 < β-MnO 2 < TiO 2 < ZnO, while for rhodamine B, the efficiency of the photocatalysts ranked as WO 3 < SnO 2 < TiO 2 < ZnO < Fe 2 O 3 < β-MnO 2 . Therefore, ZnO and β-MnO 2 were the most efficient photocatalysts for the removal of methylene blue and rhodamine B, respectively. For photodegradation of both types of contaminants, ZnO photocatalyst appeared to be the best option.   Another example is given on the use of the CGCNN-MF-ANN model to select appropriate photocatalysts for the fast degradation of a combination of various contaminants. Seven contaminants with the most training data were selected, i.e., methylene blue, rhodamine B, rose Bengal, toluidine blue, azure B, carmine indigo, phenoxyacetic acid. The other inputs variables for the ML model were set based on the average values of the overall training dataset. With these inputs, the photo-degradation rate constants of each contaminant data by different photocatalysts were predicted with the CGCNN-MF-ANN model. The results were assembled to determine the average photodegradation rates and their ranges. Figure 7 shows the average and ranges of predicted −log(k) of the seven contaminants degraded by each type of photocatalyst. Overall, the efficiency of the six types of photocatalysts to degrade the combination of these seven contaminants followed the sequence WO 3 < Fe 2 O 3 < SnO 2 < β-MnO 2 < TiO 2 < ZnO. Among these, ZnO appeared to achieve the best photocatalytic reaction rates and was the best candidate for the removal of the combination of these seven contaminants.

Predicting the Performance of Other Photocatalysts
Analyses were conducted to further assess the generality of the pre-trained CGCNN-MF-ANN model on other types of photocatalysts that were not included in model training.
The pre-trained model was used to predict the performance of another photocatalyst tetragonal Mn 3 O 4 . Twenty-five additional data points of tetragonal Mn 3 O 4 were collected. Two alternative predictions were compared, i.e., (1) predictions with the pre-trained CGCNN-MF-ANN model only with data from the six other types of photocatalysts, (2) the CGCNN-MF-ANN model re-trained with all data, including those for tetragonal Mn 3 O 4 . Figure 8 shows the scatter plot of the predicted vs. experimental −log(k) for Mn 3 O 4 . using these two different methods (i.e., Figure 8a for method 1, Figure 8b for method 2). The retrained model outperformed the pre-trained CGCNN-MF-ANN model with R 2 improved from −2.259 to 0.572, MAE reduced from 0.597 to 0.199, and RMSE reduced from 0.73 to 0.265. The findings showed that the ML model could be extended to other photocatalysts by incorporating additional training data. This observation pointed to strategies to improve the reliability and generality of the ML model to predict the performance of a wide range of photocatalysts.

Data Collection, Preparation, and Encoding
A database was created by an extensive collection of experimental data in published literature, which included 449 data points and is listed in the Supplementary Material Table S1. The typical experimental procedures to measure the photocatalytic degradation rate included firstly preparing water with designed contaminant concentration and adjusting the pH value of the solution. A certain amount of photocatalyst was then added and mixed. The suspension was placed in a dark environment for a period of time to reach the equilibrium between adsorption and desorption. This procedure was to exclude the adsorption effects in measuring the photocatalytic degradation performance. After that, the suspension was stirred and shone under light. At a given time interval, a small portion of the suspension was extracted, filtered, and the residual contaminant concentration was measured by ways, such as the UV-vis spectrometer. From the measured contaminant concentration versus time, the photocatalytic degradation rate constant was obtained.
A variety of experimental variables affected the photocatalytic degradation performance, which could be classified into three major categories: factors related to photocatalysts (type, crystalline structure, size, dosage, etc.), the type of organic contaminant, and the type of light used to activate the photocatalytic reaction. A brief summary of data sets collected:


Six common types of photocatalysts were included in this study, i.e., wurtzite ZnO, rutile SnO2, rhombohedral Fe2O3, anatase TiO2, monoclinic WO3, and tetragonal β-MnO2.  Forty-five different organic compounds, i.e., the names of an organic compound, their initial concentrations, and the pH value if available.  The properties of light, including a range of wavelengths and intensities. Seventy percent of the light intensity data was missing in the published papers, and only the

Data Collection, Preparation, and Encoding
A database was created by an extensive collection of experimental data in published literature, which included 449 data points and is listed in the Supplementary Material Table S1. The typical experimental procedures to measure the photocatalytic degradation rate included firstly preparing water with designed contaminant concentration and adjusting the pH value of the solution. A certain amount of photocatalyst was then added and mixed. The suspension was placed in a dark environment for a period of time to reach the equilibrium between adsorption and desorption. This procedure was to exclude the adsorption effects in measuring the photocatalytic degradation performance. After that, the suspension was stirred and shone under light. At a given time interval, a small portion of the suspension was extracted, filtered, and the residual contaminant concentration was measured by ways, such as the UV-vis spectrometer. From the measured contaminant concentration versus time, the photocatalytic degradation rate constant was obtained.
A variety of experimental variables affected the photocatalytic degradation performance, which could be classified into three major categories: factors related to photocatalysts (type, crystalline structure, size, dosage, etc.), the type of organic contaminant, and the type of light used to activate the photocatalytic reaction. A brief summary of data sets collected:
• Forty-five different organic compounds, i.e., the names of an organic compound, their initial concentrations, and the pH value if available.

•
The properties of light, including a range of wavelengths and intensities. Seventy percent of the light intensity data was missing in the published papers, and only the range of light wavelength was provided. Therefore, the only wavelength of light was used in the ML model.
Other experimental conditions, such as the temperature, could also affect the photocatalytic performance. However, since most experiments were conducted close to the room temperature, it was not included as an experimental variable for the ML model.
From data completeness, seven experimental variables were selected as inputs to the ML model, i.e., the photocatalyst types, photocatalyst particle sizes, photocatalyst dosages, organic compounds, initial concentrations of organic contaminants, pH of the solution, and light property category. The output was the photocatalytic degradation rate constant (k, min −1 ), and it was converted into base-10 logarithm −log (k) because the range of the rate constants k is over several orders of magnitude. Among those seven model inputs, the variables, including the type of photocatalysts, dosages, and initial concentrations of contaminant and pH, were quantitative continuous data. The particles sizes typically covered a wide range and were converted to categorical data with three levels: particles < 100 nm labeled as 1; particles between 100 and 1 um labeled as 2; particles > 1 um labeled as 3. The type of light was also represented in categorical data at three levels: i.e., UV light with a wavelength less than 400 nm was labeled as 1, visible light with a wavelength between 400 nm and 700 nm was labeled as 2, and the light with full-spectrum, including sunlight, was labeled as 3.
The photocatalysts and organic contaminant are non-numerical variables, which needs to be converted to digital representation or be encoded. Photocatalysts can be either crystalline or amorphous. For this study, only crystalline photocatalysts were included in study with data extracted from literature and analyzed. Crystal graph convolutional neural network (CGCNN) algorithm was utilized to encode the crystalline materials and extract important features [31]. The CGCNN model preserves all essential information of crystalline materials (i.e., atoms and bonds between atoms) by a crystal graph. It is capable of representing the crystal structures of inorganic materials and has been successfully applied to predict the material intrinsic properties, such as the formation energy and band gap [31,32].
The organic contaminant compounds were encoded with molecular fingerprints (MF), which converted the organic compounds into a bit string. Molecular fingerprints were originally created for the structural similarity search of small molecules [33]. It stores the atomic and structural information of molecules in a binary digit vector, where "1" represents presence and "0" represents the absence of a particular substructure. It has shown the potentials to encode organic materials for machine learning models [34][35][36][37]. The advantages of MF representation include that the properties of small molecules can be predicted at high accuracy and with low computational time at the same time [36]. Besides, the length and radius of the molecular fingerprints are adjustable based on the needs of the ML model. Figure 9 shows the configuration of the machine learning (ML) model, referred to as the CGCNN-MF-ANN model. The ANN component of the model consisted of the input layer, hidden layers, and output layer. Seven experimental variables that capture the information of photocatalysts and organic contaminants were fed into the input layer. As discussed in the previous section, the photocatalysts and organic compounds were firstly encoded and converted by CGCNN and MF, respectively. After conversion, each of them was connected to a neuron in the input layer. Each of the other five variables (which are quantitative data or categorical data) occupied one input neuron. The output layer was the photocatalytic degradation rate constant in −log scale. Hidden layers of the artificial neural network provided connections between input and output layers with activation functions that were fine-tuned from the training data. The hyperparameters of the ANN (i.e., the number of the hidden layers and the number of neurons in each hidden layer) had significant effects on the model prediction accuracy. Bayesian optimization was used to optimize the hyperparameters of the model [38]. The optimized hyperparameters included the length and radius of the molecular fingerprints, the number of the hidden layers, and the number of neurons in each hidden layer. It was noted that the hyperparameters for the CGCNN referred to the typical settings and were not included in the hyperparameter optimization process. By use of Bayesian optimization, the optimized hyperparameters of the ANN model were obtained, which included two hidden layers with 512 neurons in the first layer and 256 neurons in the second layer. The optimal length and radius of the molecular fingerprints were 128 and 1, respectively. The input layer of the ANN model had 517 neurons, with 384 occupied by the representation of photocatalyst crystals via CGCNN, 128 occupied by encoded contaminants via MF, and 5 by other experimental factors (i.e., particle size, dosage, concentration, initial pH, and light).

Conclusions
A novel machine learning model, CGCNN-MF-ANN, was developed to predict the performance of different metal oxide photocatalysts in degrading a wide range of contaminants. The structures and features of photocatalysts were represented with a crystal graphic convolutional neural network (CGCNN). The structures of contaminants were encoded with molecular fingerprint (MF). The encoded information of the photocatalysts and contaminants were combined with experimental variables and fed into an artificial neuron network (ANN) model. The hyperparameters of the ANN were optimized with the Bayesian optimization process. A dataset was assembled that included six different types of photocatalysts and 45 different types of organic contaminants, which were used for the training and validation of the CGCCN-MF-ANN model. The results of the predicted photo-degradation rate constants by the ML model matched reasonably well with the experimental results with the R 2 of 0.746 and RMSE of 0.293. The interpretability of the ML model was evaluated by analyzing the importance of different variables on the The hyperparameters of the ANN (i.e., the number of the hidden layers and the number of neurons in each hidden layer) had significant effects on the model prediction accuracy. Bayesian optimization was used to optimize the hyperparameters of the model [38]. The optimized hyperparameters included the length and radius of the molecular fingerprints, the number of the hidden layers, and the number of neurons in each hidden layer. It was noted that the hyperparameters for the CGCNN referred to the typical settings and were not included in the hyperparameter optimization process. By use of Bayesian optimization, the optimized hyperparameters of the ANN model were obtained, which included two hidden layers with 512 neurons in the first layer and 256 neurons in the second layer. The optimal length and radius of the molecular fingerprints were 128 and 1, respectively. The input layer of the ANN model had 517 neurons, with 384 occupied by the representation of photocatalyst crystals via CGCNN, 128 occupied by encoded contaminants via MF, and 5 by other experimental factors (i.e., particle size, dosage, concentration, initial pH, and light).

Conclusions
A novel machine learning model, CGCNN-MF-ANN, was developed to predict the performance of different metal oxide photocatalysts in degrading a wide range of contaminants. The structures and features of photocatalysts were represented with a crystal graphic convolutional neural network (CGCNN). The structures of contaminants were encoded with molecular fingerprint (MF). The encoded information of the photocatalysts and contaminants were combined with experimental variables and fed into an artificial neuron network (ANN) model. The hyperparameters of the ANN were optimized with the Bayesian optimization process. A dataset was assembled that included six different types of photocatalysts and 45 different types of organic contaminants, which were used for the training and validation of the CGCCN-MF-ANN model. The results of the pre-dicted photo-degradation rate constants by the ML model matched reasonably well with the experimental results with the R 2 of 0.746 and RMSE of 0.293. The interpretability of the ML model was evaluated by analyzing the importance of different variables on the ML model performance by calculating their SHAP values and distributions. The feature importance analyses unveiled the influence of experimental variables on the ML model predictions that were consistent with experimental observations. Examples were given to demonstrate the applications of the CGCCN-MF-ANN model for the selection of optimal catalysts for contaminants removal. The pre-trained ML model was extended to predict other photocatalysts, and a re-training strategy was proposed to augment the generality of the model in its performance.