Exploring Deep Learning for Metalloporphyrins: Databases, Molecular Representations, and Model Architectures

: Metalloporphyrins have been studied as biomimetic catalysts for more than 120 years and have accumulated a large amount of data, which provides a solid foundation for deep learning to discover chemical trends and structure–function relationships. In this study, key components of deep learning of metalloporphyrins, including databases, molecular representations, and model architectures, were systematically investigated. A protocol to construct canonical SMILES for metal-loporphyrins was proposed, which was then used to represent the two-dimensional structures of over 10,000 metalloporphyrins in an existing computational database. Subsequently, several state-of-the-art chemical deep learning models, including graph neural network-based models and natural language processing-based models, were employed to predict the energy gaps of metalloporphyrins. Two models showed satisfactory predictive performance ( R 2 0.94) with canonical SMILES as the only source of structural information. In addition, an unsupervised visualization algorithm was used to interpret the molecular features learned by the deep learning models.


Introduction
Metalloporphyrins are coordination compounds of metal ions and porphyrins or porphyrin derivatives, derived from the core structure of cytochrome P450 enzymes [1][2][3][4]. As a recognized class of biomimetic catalysts, metalloporphyrins can catalyze different types of chemical reactions, mainly reduction and oxidation [1,[5][6][7][8]. Since the earliest study in 1979 [9,10], experimental chemists have synthesized a variety of metalloporphyrins and explored their functions in catalysis [11][12][13][14][15]. Meanwhile, computational chemists have explored the mechanisms of metalloporphyrin-catalyzed reactions using quantum mechanical atomistic simulation methods, mostly density functional theory (DFT) [16][17][18]. While continued research using either computational or experimental approaches is necessary, it is also essential to develop a method that can learn the chemical trends and structure-function relationships of metalloporphyrins from the available data.
Recently, deep learning has emerged as an effective technique for molecular property prediction, such as reactivity, synthesizability, solubility, binding affinity, and biological activity [19][20][21][22][23][24][25]. Unlike quantum mechanical methods based on solving Schrödinger's equation, deep learning extracts features from a large amount of data generalizes them and then maps the learned features to labels carried by the data. While quantum mechanical methods are good at explaining organic chemistry mechanisms on a case-by-case basis, deep learning models extract the overall trends and relationships from large amounts of data. In addition, the overall computational cost of deep learning is lower than that of DFT computations. While it may take hours to days to train deep learning models, once the models are trained properly, they can make predictions in seconds.
Currently, there are two main challenges in using deep learning models to study metalloporphyrins. First, although recognized public small molecule databases such as PubChem [26], ChemBL [27,28], DSSTox [29], MoleculeNet [30], and ZINC [31] have adopted the canonical SMILES as one of the representations of molecular 2D structure, there is no such general molecular representation for metalloporphyrins, which makes it more difficult to index, merge, read, and process the data. Furthermore, although one of our previous works showed the possibility of using deep learning models to predict the properties of molecular complexes, such as solute-solvent pairs [32], metalloporphyrins have significantly larger structures than "drug-like" small molecule complexes and contain additional inorganic components (i.e., center metal ions), which may increase the difficulty for deep learning.
In this study, we first proposed a protocol for assembling the canonical SMILES for a recognized computational database of metalloporphyrins. Afterward, state-of-theart deep learning models, including three graphical neural network models (hereinafter referred to as molecular graph-based model) and two attention-based natural language processing models (hereinafter referred to as string-based model), were trained on this database and tested for energy gap (E gap) prediction. In addition, the molecular features extracted by these models were visualized using a big data-based visualization algorithm for better interpretability. The overall workflow of this study is shown in Figure 1. Currently, there are two main challenges in using deep learning models to study metalloporphyrins. First, although recognized public small molecule databases such as PubChem [26], ChemBL [27,28], DSSTox [29], MoleculeNet [30], and ZINC [31] have adopted the canonical SMILES as one of the representations of molecular 2D structure, there is no such general molecular representation for metalloporphyrins, which makes it more difficult to index, merge, read, and process the data. Furthermore, although one of our previous works showed the possibility of using deep learning models to predict the properties of molecular complexes, such as solute-solvent pairs [32], metalloporphyrins have significantly larger structures than "drug-like" small molecule complexes and contain additional inorganic components (i.e., center metal ions), which may increase the difficulty for deep learning.
In this study, we first proposed a protocol for assembling the canonical SMILES for a recognized computational database of metalloporphyrins. Afterward, state-of-the-art deep learning models, including three graphical neural network models (hereinafter referred to as molecular graph-based model) and two attention-based natural language processing models (hereinafter referred to as string-based model), were trained on this database and tested for energy gap (E gap) prediction. In addition, the molecular features extracted by these models were visualized using a big data-based visualization algorithm for better interpretability. The overall workflow of this study is shown in Figure 1.

Establishing Canonical SMILES for Porphyrins and Metalloporphyrins
To the best of our knowledge, no canonical SMILES rules have been established for porphyrins or metalloporphyrins prior to our study, probably due to their more complex structures compared to "drug-like" small molecules. Most small molecule databases, such as ZINC, PubChem, and ChemBL, provide canonical SMILES representing the two-dimensional (2D) structure of molecules, while the Porphyrin Based Dyes Database (PBDD) provides the molecular formula (e.g., ZnC56H46N4O11), the short name of side groups (e.g., TMP), and the metal center (e.g., ZnP). The establishment of canonical SMILES for porphyrins and metalloporphyrins not only facilitates the use of existing databases (e.g., PBDD) for deep learning studies, but also encourages the entire research community to store existing and newly designed porphyrin and metalloporphyrin structures in a big-data format.
Therefore, we implemented a framework that allows the assembly of canonical SMILES for molecules in PBDD ( Figure 2). First, translations from short names of side groups as well as metal centers to their corresponding SMILES fragments were established (e.g. Next, these SMILES fragments were concatenated to the SMILES of the porphyrin backbone in a predetermined order to produce the final canonical SMILES for the entire molecule (Figure 2, lower). We designed the concatenation following the pattern of a limited

Establishing Canonical SMILES for Porphyrins and Metalloporphyrins
To the best of our knowledge, no canonical SMILES rules have been established for porphyrins or metalloporphyrins prior to our study, probably due to their more complex structures compared to "drug-like" small molecules. Most small molecule databases, such as ZINC, PubChem, and ChemBL, provide canonical SMILES representing the twodimensional (2D) structure of molecules, while the Porphyrin Based Dyes Database (PBDD) provides the molecular formula (e.g., ZnC56H46N4O11), the short name of side groups (e.g., TMP), and the metal center (e.g., ZnP). The establishment of canonical SMILES for porphyrins and metalloporphyrins not only facilitates the use of existing databases (e.g., PBDD) for deep learning studies, but also encourages the entire research community to store existing and newly designed porphyrin and metalloporphyrin structures in a bigdata format.
Therefore, we implemented a framework that allows the assembly of canonical SMILES for molecules in PBDD ( Figure 2). First, translations from short names of side groups as well as metal centers to their corresponding SMILES fragments were established (e.g., FPh

Comparing the Performance of Deep Learning Models
Using the SMILES representation of the molecular structure as a feature and E gap as a target, PBDD was used to train several deep learning models. The data were randomly split into a training set and a test set with a split ratio of 8:2. These models were trained with the default hyperparameters of their original architectures. For each model, the distance between the predicted and computed E gap was visualized and presented as a scatter plot of the linear regression fit (e.g., Figure 3 left column). In addition, the overlap of the distributions of the predicted and the computed E gap was also shown as a histogram (e.g., Figure 3

Comparing the Performance of Deep Learning Models
Using the SMILES representation of the molecular structure as a feature and E gap as a target, PBDD was used to train several deep learning models. The data were randomly split into a training set and a test set with a split ratio of 8:2. These models were trained with the default hyperparameters of their original architectures. For each model, the distance between the predicted and computed E gap was visualized and presented as a scatter plot of the linear regression fit (e.g., Figure 3 left column). In addition, the overlap of the distributions of the predicted and the computed E gap was also shown as a histogram (e.g., Figure 3 right column). To ensure the reproducibility of the models, the training and test of the models were repeated 10 times, and the averaged results are provided in the Supplementary Materials Figure S4.

Molecular Graph-Based Model Results
As shown in Figure 3 (left column), as the molecular graph-based models evolve from the earliest GCN to MPNN to D-MPNN, the prediction and generalization ability of these models for metalloporphyrins improves.   Compared to GCN, MPNN has a modularized message passing stage, which makes the model construction more suitable for molecular graph algorithms [19] and possibly leads to its better performance than GCN in metalloporphyrin E gap prediction. On the other hand, GCN and MPNN pass atom-centric information, whereas D-MPNN passes information across the molecule centering on directed bonds [22], which may lead to prediction with higher R 2 , less error, and more significant overlap with true results.

String-Based Model Results
Based on the regression plots (Figure 4, left column), the performance of the BERT model (R 2 = 0.9371, RMSE = 0.1117 eV, MAE = 0.0951 eV) is significantly better than the Transformer (R 2 = 0.7111, RMSE = 0.2344 eV, MAE = 0.1812 eV). At the same time, the overlap of measured data and the BERT predicted data is significantly larger than that of the Transformer. improvement compared to MPNN, with R improving from 0.9316 to 0.9446 and RMSE and MAE decreasing from 0.1137 eV and 0.0949 eV to 0.1014 eV and 0.0872 eV, respectively. Meanwhile, the overlap between the measured data and the data predicted from GCN, MPNN, and D-MPNN showed a steady increase, reflecting their improved predictive power from a different perspective (Figure 3, right column).
Compared to GCN, MPNN has a modularized message passing stage, which makes the model construction more suitable for molecular graph algorithms [19] and possibly leads to its better performance than GCN in metalloporphyrin E gap prediction. On the other hand, GCN and MPNN pass atom-centric information, whereas D-MPNN passes information across the molecule centering on directed bonds [22], which may lead to prediction with higher R 2 , less error, and more significant overlap with true results.

String-Based Model Results
Based on the regression plots (Figure 4, left column), the performance of the BERT model (R 2 = 0.9371, RMSE = 0.1117 eV, MAE = 0.0951 eV) is significantly better than the Transformer (R 2 = 0.7111, RMSE = 0.2344 eV, MAE = 0.1812 eV). At the same time, the overlap of measured data and the BERT predicted data is significantly larger than that of the Transformer.

Transfer Learning Results
Transfer learning strategies were used to further improve the performance of the String-based models. The transfer learning of the Transformer is implemented directly on the ChemBERTa [35] architecture. The model was first pretrained using the data collected from PubChem and ZINC15 and then fine-tuned with the PBDD database.
The best pre-trained model named 'PubChem10M_SMILES_BPE_396_250 was selected from the ChemBERTa (detailed comparison results are provided in the Supplementary Materials Figure S5). Compared to the model without transfer learning, the R 2 of Transformer was improved to 0.8010, and the RMSE and MAE were reduced to 0.1965 eV and 0.1524 eV, respectively ( Figure 5).
The data used for pretraining BERT consisted of 1,000,000 molecules randomly selected from the ZINC15 database. Fine-tuning the pretrained BERT model with PBDD improved the R 2 to 0.9372 and decreased the RMSE and MAE to 0.1114 eV and 0.0919 eV, respectively ( Figure 6). It is worth noting that the pretraining phase of BERT is unsupervised learning, i.e., the pretraining only extracts structural information from the input SMILES. Although the molecules of ZINC15 are small, metal-free organic molecules with structures significantly different from metalloporphyrins, the BERT model still learns features from these small molecules, which significantly improves its predictive power.

Comparing the Computational Costs of Different Models
To understand the computational cost of the deep learning models in this study, the runtime of each model, including data reading, feature extraction, and model training, was recorded on an NVIDIA GeForce RTX 3060 Lite Hash Rate platform, as well as the epochs to achieve the above performance. Table 1 shows that the models without transfer learning need fewer than 1000 s to complete the training, while fine-tuning training on pretrained models consumes even less time (210 s).

Mapping the Chemical Space of the Porphyrin Database under the D-MPNN Model and BERT Model
We used the TMAP algorithm and the Faerun visualization library [36] to visualize the chemical space of PBDD with the features extracted by the D-MPNN model (Figure 7a,b) and BERT model (Figure 7c,d), the final output of the high-dimensional feature vectors from the feature extraction layers of each model. The color bars in the upper panels (a and c) depend on the value of the energy gap (red indicates a higher value and blue indicates a lower value), while in the lower panels (b and d) the color shows the classifications of center metal. Figure 7 shows that the clustering and trends observed on the TMAP of the features extracted by D-MPNN are more correlated with the energy gap, while the features extracted by BERT are more correlated with the structure of molecules. This difference coincides with the structural difference between the two models-the feature extraction layers of BERT are trained mainly in the unsupervised training stage, which relies only on the structure of molecules, while D-MPNN follows traditional supervised learning, where the weights of feature extraction layers are adjusted according to the target.

Discussion
Databases, molecular representations, and model architectures are the three key components of deep learning for chemistry. The protocol presented in this study for assembling canonical SMILES of metalloporphyrins fills the gap between traditional metalloporphyrin databases such as PBDD (which do not have SMILES) and the state-of-the-

Discussion
Databases, molecular representations, and model architectures are the three key components of deep learning for chemistry. The protocol presented in this study for assembling canonical SMILES of metalloporphyrins fills the gap between traditional metalloporphyrin databases such as PBDD (which do not have SMILES) and the state-of-the-art chemical deep learning models (which typically use SMILES as input). We encourage scientists in the metalloporphyrin research community to use canonical SMILES to represent the 2D structures of metalloporphyrins, not only for deep learning, but also for easier data indexing, searching, and curation.
Excitingly, both D-MPNN and BERT achieved satisfactory performance in predicting the E gap of metalloporphyrin with canonical SMILES as the only source of structural information. Although it is difficult to assess the results by comparing them with the performance of these models on small molecules, the distribution of predicted results and computed results (Figures 3 and 4) shows clear overlaps.
Furthermore, we must emphasize that all the deep learning models tested in this study do not require any feature engineering (i.e., manually selection of molecular features to be provided to the models). Moreover, these models read structural information directly from the SMILES without the need to compute molecular descriptors or fingerprints. This is in contrast to an earlier study by Li, et al. which used traditional physicochemical descriptors as molecular representations and traditional machine learning algorithms (such as Lasso, kernel ridge regression (KRR), support vector machine (SVM), and feedforward artificial neural networks (ANNs)) as models [37]. The performance of our approach in energy gap prediction is comparable to the work of Li, et al. Considering the computational and labor resources saved from feature engineering and descriptor computation, our method is more efficient and economical.
Furthermore, both molecular graph-based and string-based deep learning models have been successfully used to predict forward reaction outcomes, retrosynthesis planning, and reaction condition recommendations [19,[38][39][40][41][42][43][44][45], using SMILES-based graphical representations or reaction SMILES as input to the model instead of molecular descriptors and fingerprints. Therefore, one of our future works will combine the high-throughput DFT computation and the deep learning models to study the relationships of metalloporphyrin structures and selectivity in the catalysis of reduction and oxidation reactions.

Porphyrin-Based Dyes Database
To the best of our knowledge, the Porphyrin-based Dyes Database (PBDD) [46,47] at the Computational Materials Repository (https://cmrdb.fysik.dtu.dk/dssc/ (accessed on 12 November 2022)) is the largest computational database of porphyrins/metalloporphyrins published online. PBDD contains 12,096 porphyrin structures-10,080 of them are metalloporphyrins and the rest are porphyrins without any central metal. In addition, 4032 molecules have hydrogen substituted by fluorine at the β-position. Each porphyrin molecule contains three aromatic side groups and an anchoring group that serves as an anchor point for the semiconductor carrier.
The properties of porphyrins in PBDD include frontline orbital energy levels (HOMO and LUMO), an optical gap, and an energy gap. Among these properties, the energy gap is often chosen to represent the ability of metalloporphyrins to act as reduction catalysts [48]. The energy gap data in the database shows a normal distribution without any significant data imbalance or outliers (Figure 8). tains three aromatic side groups and an anchoring group that serves as an anchor point for the semiconductor carrier.
The properties of porphyrins in PBDD include frontline orbital energy levels (HOMO and LUMO), an optical gap, and an energy gap. Among these properties, the energy gap is often chosen to represent the ability of metalloporphyrins to act as reduction catalysts [46]. The energy gap data in the database shows a normal distribution without any significant data imbalance or outliers (Figure 8).

Databases for Transfer Learning
Two databases from ZINC15 and PubChem were selected as pretraining databases for transfer learning. The goal of transfer learning is to familiarize the model with the basics of chemical structure, such as atoms, bonds, and function groups. ZINC15 is a publicly accessible database containing more than 230 million purchasable compounds in 3D formats for virtual screening [31] (https://zinc15.docking.org/ (accessed on 16 November 2022)). In this study, a randomly selected subset of 1 million molecules from ZINC15 was used as a dataset for pretraining. PubChem [26] is the world's largest free repository of chemical information, which records the structural information, activity data, and other

Databases for Transfer Learning
Two databases from ZINC15 and PubChem were selected as pretraining databases for transfer learning. The goal of transfer learning is to familiarize the model with the basics of chemical structure, such as atoms, bonds, and function groups. ZINC15 is a publicly accessible database containing more than 230 million purchasable compounds in 3D formats for virtual screening [31] (https://zinc15.docking.org/ (accessed on 16 November 2022)). In this study, a randomly selected subset of 1 million molecules from ZINC15 was used as a dataset for pretraining. PubChem [26] is the world's largest free repository of chemical information, which records the structural information, activity data, and other relevant information for 112 million compounds. We selected a subset of 10 million molecules of PubChem compounds as another pretraining dataset.

Model Names and Architecture
The models used in this study are from the two most commonly used chemical deep learning classes, namely graph neural networks based on graph structure and attentionbased natural language processing models. Depending on the form of molecular representation data desired by the model, it will be referred to as molecular graph-based model and string-based model. A brief description of each model is given below, and detailed information can be retrieved in the Supplementary Materials.

Graph Convolutional Neural Network (GCN)
A graph neural network propagates information about nodes and edges in a non-Euclidean graph, and then compares the results of multiple propagations with existing results to update parameters in the model for training purposes. A graph structure model that contains convolutional layer(s) is called a graph convolutional neural network (GCN). In this study, we used the GCN model architecture [50] and corresponding featurizer [51] implemented by DeepChem [52].

Message Passing Neural Network (MPNN)
MPNN is obtained by modularizing the convolution operation in the graph convolutional neural network into two parts-a message passing stage and a state update stage. In this study, we implemented an MPNN model following the tutorial of Keras [53] (https://keras.io/ (accessed on 12 November 2022)). Meanwhile, RDKit was used (https://www.rdkit.org/ (accessed on 12 November 2022)) to extract the molecular features including the symbol (element), the number of valence electrons, the number of hydrogen bonds, orbital hybridization, bond type, and conjugation.

Directed Message Passing Neural Network (D-MPNN)
D-MPNN is a further update of MPNN. D-MPNN uses messages associated with directed edges (bonds) instead of using messages associated with vertices (atoms). In contrast to atom-based message passing methods such as MPNN, bond-based message passing methods such as D-MPNN allow fixed message passing directions, thus avoiding unnecessary loops in the message passing trajectory [54]. We used the model architecture developed by Yang et al. [22], which reads both the atomic and chemical bonding information of the molecules as well as the molecular descriptors.

Transformer
Transformer [55] is a relatively new class of NLP models based entirely on attention mechanisms [56], which show a powerful ability in modeling sequential data [57]. The underlying structure of Transformer models consists of a multi-layer encoder-decoder architecture like the seq2seq model, where a multi-headed attention mechanism is used in each encoder and decoder. A previous study used the Transformer model to make predictions about molecular properties [35]. We used the Transformer architecture provided by Simple Transformers [58] (https://simpletransformers.ai/ (accessed on 16 November 2022)). When transfer learning is performed, Transformer's pre-trained model is called via hug-gingface [59].

Bidirectional Encoder Representation from Transformers (BERT)
The BERT model is a pre-trained language representation model based on the transformer model [60]. A new masked language model (MLM) was employed so that deep bidirectional language representations can be created. In this study, we use the BERT model architecture rxnfp built by Schwaller et al. [43], which has been adapted for chemical reaction yield prediction [38,41] and molecular property prediction [32]. 4.3.6. Tree MAP (TMAP) TMAP (http://tmap.gdb.tools (accessed on 16 November 2022)) is an algorithm that visualizes high-dimensionality data as a two-dimensional tree, preserving global and local features with a sufficient level of detail for human inspection and interpretation [36]. In this work, the TMAP algorithm was applied to visualize the molecular features of metalloporphyrins extracted by the deep learning models.

Conclusions
In this study, deep learning of metalloporphyrins was investigated from three important perspectives: database, molecular representations, and model. A protocol for assembling canonical SMILES was developed to make the open-source metalloporphyrin database PBDD available for the training of state-of-the-art deep learning models. Both the D-MPNN and the BERT models trained on PBDD had R 2 above 0.93 in terms of energy gap prediction. It is worth mentioning that we only used data from one database because other data on metalloporphyrins are scattered in various papers and difficult to collect in a short time. Therefore, in the future, we plan to use deep learning-assisted automatic literature data extraction methods [61,62] to curate another metalloporphyrin database containing data with more diverse structures. In parallel, we are preparing to publish another study to develop a high-throughput DFT method to compute the energy gaps of metalloporphyrins that have appeared in literature in recent years, with a wider variety of central metals, since only Ti and Zn are available in PBDD. On the other hand, we are also extending the SMILES representation of metalloporphyrin molecules to metalloporphyrin-catalyzed reactions in order to use deep learning models for reaction prediction to study the catalysis of metalloporphyrin. Furthermore, we are studying metalloporphyrins using Graphormer, an advanced Transformer model that combines the advantages of graph representation with the power of Transformer and shows better performance than message passing-based GNNs [53,57,58].