Triple Generative Self-Supervised Learning Method for Molecular Property Prediction

Molecular property prediction is an important task in drug discovery, and with help of self-supervised learning methods, the performance of molecular property prediction could be improved by utilizing large-scale unlabeled dataset. In this paper, we propose a triple generative self-supervised learning method for molecular property prediction, called TGSS. Three encoders including a bi-directional long short-term memory recurrent neural network (BiLSTM), a Transformer, and a graph attention network (GAT) are used in pre-training the model using molecular sequence and graph structure data to extract molecular features. The variational auto encoder (VAE) is used for reconstructing features from the three models. In the downstream task, in order to balance the information between different molecular features, a feature fusion module is added to assign different weights to each feature. In addition, to improve the interpretability of the model, atomic similarity heat maps were introduced to demonstrate the effectiveness and rationality of molecular feature extraction. We demonstrate the accuracy of the proposed method on chemical and biological benchmark datasets by comparative experiments.


Introduction
Drug development is a time-consuming and costly process.In order to improve the success rate and reduce the time and costs, computer-aided drug design (CADD) [1,2] methods such as virtual screening and molecular docking have been introduced to provide guidance for the entire process.Despite their success in drug discovery [3,4], many traditional CADD methods based on molecular simulation techniques suffer from high computational costs and long running times, which limit their large-scale application in the pharmaceutical industry.
In recent years, artificial intelligence has developed rapidly, which has become a popular and dominant direction within drug discovery because of its superior performance and high efficiency.Moreover, many deep-learning methods [5,6] have been successfully applied to various tasks in drug discovery, including molecular property prediction [7], drug-target affinity prediction [8,9], and protein-protein interaction prediction [10].
Molecular property prediction aims to predict whether a molecule has the expected properties (solubility, biological activity, etc.) from a large number of candidate molecules, which is important for drug design.There are many ways to represent molecular sequences, including simplified molecular input line entry systems (SMILES), and fingerprints like Extended-Connectivity fingerprints (ECFP) [11] and the molecular access system (MACCS) [12].SMILES is a specification for extracting molecular sequence features that uses ASCII strings to encode molecular structures.The molecule is simply represented using one or two letter symbols from the periodic table.For chemical bonds [13], single bonds can be implicitly represented by "-", and double, triple, and quadruple bonds are represented by "=", "#", and "$", respectively.Various deep learning methods based on SMILES strings have emerged.Hou et al. [14] used LSTM to process SMILES strings to obtain complex information of atoms.Honda et al. [15] proposed the SMILES Transformer, which pre-trained the sequence-to-sequence model by using SMILES strings.However, the structural information of molecules cannot be obtained from a SMILES string directly since two connected atoms may be far away from each other in the SMILES string.
Moreover, the molecular graph structure [16] provided another way to represent molecules, in which atoms are represented by nodes and chemical bonds are represented by edges.In recent years, many studies have concentrated on molecular graph structures for molecular property prediction through Message Passing Networks, including MPNN [17], DMPNN [18], and CMPNN [19].Most of current graph-based methods are supervised learning methods that require large-scale labeled data for training.However, the label acquirement (i.e., molecules with known property) is a tough and expensive process.On the other hand, there are many databases with a large amount of data but no label information (e.g., ZINC [20], ChEMBL, and PubChem).How to reasonably and effectively utilize these data to improve the accuracy of molecular property prediction is an open problem to be solved.
The Natural Language Processing (NLP) and Computer Vision [21] (CV) fields address this problem through self-supervised learning (SSL).Specifically, the model is first pretrained on a large unlabeled dataset and then fine-tuned for downstream tasks using data with labels.SSL includes generative self-supervised and contrastive self-supervised learning [22].The generative SSL consists of an encoder and a decoder.The encoder is trained to encode an input x into a latent vector z, and the decoder is used to reconstruct z into x by minimizing the reconstruction loss.The generative SSL methods include AutoRegressive (AR) models, flow-based models, AutoEncoder (AE) models, and hybrid generative models.For the contrastive SSL [23,24], features are learned by constructing positive and negative samples, and an encoder is trained to encode an input x into an explicit vector z to measure similarity.
SSL has achieved great success in the field of natural language processing, such as the creation of the Generative Pre-trained Transformer (GPT) [25] and Bidirectional Encoder Representation from Transformers (BERT) [26].GPT is OpenAI' s pre-trained Transformer model for natural language processing, which uses deep learning to generate human language-like text given a prompt or seed text.The pre-trained BERT language model is able to learn contextual word representations by masking words prediction and reconstructing the input context, thereby improving the performance of downstream tasks.Wen et al. [27] pre-trained BERT to obtain a semantic representation of compound fingerprints through SSL, called Fingerprint-BERT (FP-BERT).Then, the embedding of molecule was fed into a convolutional neural network (CNN) to obtain higher-level features.
However, language models can only be used to handle sequence-based molecular representations, ignoring the important topology of molecular graphs.Therefore, the utilization of SSL for molecular graphs is also a non-negligible aspect of molecular property prediction.Graph contrastive coding (GCC) [28] designs a self-supervised graph neural network pre-training framework to capture common network topological properties across multiple networks.The KPGT [29] self-supervised framework introduces the line graph transformer (LiGhT), which is mainly used to accurately simulate the structural information of molecular graphs.However, it ignores the unique structural properties of chemical molecules, such as rings and functional groups.To fully consider the properties of molecular graphs, Zhang et al. [30] sampled subgraphs by learning graph motifs.The motif learning was defined as a clustering problem through EM-clustering to group similar and important subgraphs into several motifs.These learned motifs were used to train the sampler to generate more informative subgraphs for graph-to-subgraph contrastive learning.HiMol [31] used a hierarchical molecular graph neural network (HMGNN) to encode topic structures and extracted node-topic-graph hierarchical molecular representations.
Despite improvements in molecular representation learning, there are still some problems which remain to be solved: (1) Although molecular representations based on SSL have been extensively studied, most methods focus on pre-training using sequence information or graph information only.The effective fusion of heterogeneous molecular information is important for enhancing the diversity of molecular representations.There are some methods that have considered this direction.Liu et al. [32] used 3D and 2D information for SSL, aiming to maximize the mutual information between 3D and 2D views of the same molecule.However, there is much less 3D molecular structural information than there is 2D and 1D information.Although there are some methods that could calculate 3D information about a molecule, the error accumulation could also result in the inaccuracy of predictions.Zhu et al. [33] used sequence and graph information to conduct SSL and proposed a pre-training algorithm that combined two molecular representations, including dual-view molecular pre-training (DMP), which maximized the consistency between molecular sequence and molecular graph representations.However, we believe that the generative model can reflect molecular information more accurately and effectively.Therefore, inspired by Liu's work, this paper concentrates on how to use the generative SSL model to learn molecular representations from sequence and topological structural information from molecules.(2) The existing SSL models, whether generative or contrastive, generally only use a single or two different models.For example, in generative learning, the encoder and decoder are used to reconstruct features, and in contrastive learning, SSL is performed by minimizing the difference between the feature representation of two different types or sources of data.But there is currently no method to discuss the introduction of three or more models in SSL.We believe that, to a certain extent, more models participating in SSL can also improve the accuracy and generalization of the final feature representation.(3) After pre-training, multiple models are obtained for downstream tasks, and how to more effectively integrate multiple models is also a problem worth studying.Ensemble learning is widely used in the fusion of various models, but directly concatenating output features cannot effectively utilize the advantages of different models.Treating each output feature equally will also result in key information vanishing from multiple features.Therefore, how to design an effective fusion model, discover the important parts of different sources of features, and improve the accuracy of the prediction are also important issues in this paper.
To address the above problems, a triple generative self-supervised learning method (TGSS) is proposed in this paper, which combines molecular sequence information and molecular graph structure information to improve model performance.Moreover, BiLSTM and Transformer are used to learn the feature representation of the molecular sequence, and GAT is used to learn the feature representation of the molecular graph.The generative SSL method is introduced in the pre-training step and all three representations are used for reconstruction, which are performed in pairs to improve the generalization of the model.For the downstream tasks, all three pre-trained models are fused and the attention module is utilized to fully integrate the three features.We experimented with eight downstream tasks of molecular property prediction, five of which outperform existing supervised and self-supervised learning methods.

Datasets
For the pre-training dataset, we used 430,000 unlabeled molecules randomly sampled from the public ChEMBL database available at https://www.ebi.ac.uk/chembl/ (accessed on 3 March 2023).ChEMBL is a database of bioactive molecules with drug-like properties, containing millions of unlabeled SMILES data.The comparative experiments were tested on the public dataset MoleculeNet [34] available at https://moleculenet.org/ (accessed on 5 April 2023), including classification tasks and regression tasks.For the regression task dataset, we use the random splitting method to divide the dataset.For the classification task dataset, following the method of Yang et al. [35], we use the scaffold splitting, which splits the molecules according to their structures.The molecular samples in the training set and the test set come from different molecular scaffolds.This scaffold splitting method is more challenging and could evaluate the generalization of model more accurately.These two split methods are used to split a dataset into a training set, validation set, and test set in the ratio of 8:1:1.The details of the dataset are shown in Table 1.It should be noted that Tox21 and SIDER involve multi-classification tasks, where each input sample may correspond to multiple labels.Therefore, the arithmetic mean values of all labels are calculated for these two datasets as the final result.• FreeSolv: the experiment and calculated hydration-free energies in water of 642 small neutral molecules.

•
ESOL: 1128 compounds and their corresponding water solubility.
Classification dataset: • BACE: the quantitative and qualitative binding results for a panel of human (BACE-1) inhibitors.

•
HIV: more than 40,000 compounds with the ability to inhibit HIV replication, represented by inactivated and active tags.

•
Tox21: the qualitative toxicity measurements of 12 different targets for 7831 compounds.• SIDER: 27 drug side effects labels for 1427 compounds.
To demonstrate the effectiveness of the TGSS method, we tested it on eight molecular datasets, and the experimental results are shown in Tables 2 and 3, which were obtained using the mean and standard deviation of three different random seed tests.Table 2 shows the performance of the TGSS method in classification tasks.Compared with other supervised/self-supervised learning methods, our TGSS method performed the best on BBBP, HIV, SIDER, in the five benchmark datasets in the classification tasks.Compared to the best results on these three datasets, the TGSS method achieved improvements of 5.7%, 0.3%, and 4.9%, respectively.Specifically, our TGSS achieved the best overall performance on five datasets compared to the supervised learning and other self-supervised learning methods, including generative and contrastive SSL models.These results demonstrate the effectiveness and good generalization ability of our self-supervised strategy.Table 3 shows the performance of the TGSS method in regression tasks.It can be seen from the table that our TGSS method outperformed the previous supervised learning method on all three datasets.Compared with other self-supervised learning methods, although our method is weaker than MolBERT and KEMPNN on the ESOL and Lipophilicity datasets, respectively, the overall performance is better when combining the three datasets.It is worth noting that the improvement made by our TGSS method on the FreeSolv dataset was by 18.8%; thus, it can be seen that the improvement made by our model was most significant in small datasets.This effectively demonstrated that the TGSS model was capable of extracting effective representations from limited molecular data.

Ablation Experiments
To explore the influence of different factors on the model's performance, we conducted ablation experiments in the pre-training, downstream task prediction, and feature fusion stages, respectively.

Performance Comparison of Different Combinations of the Model in the Pre-Training Process
In this paper, three models including BiLSTM, Transformer, and GAT were embedded in the TGSS to improve the generalization performance of the model.To explore the impact of different pre-trained models, we designed the first ablation experiment with five groups: pre-training all three models, only pre-training two models, including BiLSTM and GAT (Pre-BG), BiLSTM and Transformer (Pre-BT), Transformer and GAT (Pre-TG), that is, L xz , L xy , L yz are used as the objective functions alone, and the last group is without pre-training (No pre).For a fair comparison, the experiment still used the three models' fusion prediction methods in the downstream tasks, but the parameters of the models that did not participate in pre-training were initialized randomly.
Four datasets including two regression tasks (ESOL and Lipophilicity) and two classification tasks (BACE and BBBP) were selected for evaluation.The prediction result at different epochs was used as the indicator for different methods.From the ESOL dataset in Figure 1a it can be seen that, in the first 100 epochs, the effect of the proposed TGSS model was worse than Pre-BG, Pre-TG, and no pre-training.After the 100th epoch, the RMSE value became the minimum one.It could be concluded that pre-training has significantly improved the performance of the model.Compared with only pre-training BG, BT, TG, and no pre-training, the results were improved by 19.8%, 15.7%, 11.7% and 7.6%, respectively.Since the amount of data in Lipophilicity was larger than that in ESOL, after increasing the amount of data, the gap between each module widened.Pre-training with BiLSTM and Transformer, Transformer and GAT, and the proposed TGSS method were all significantly better than no pre-training.At the 195th epoch, the proposed model achieved the best results; compared with the best results of pre-BG, pre-BT, pre-TG, and no pre-training, the improvements were 17.2%, 3.7%, 0.6%, and 9.0%, respectively.In Figure 1b, it can be seen that the TGSS model achieved the best performance faster than other methods, and the curve is smoother, indicating that it has better stability.Therefore, it was found that pre-training the model for downstream tasks effectively improved the prediction accuracy.Compared with the non-pre-trained model, the pre-trained model achieved convergence faster, which sped up the training process.

Performance Comparison of Different Sizes of Pre-Training Dataset
The model was pre-trained to learn effective molecular representations without labels through SSL.A sub-dataset which contains 430,000 molecules was used in pre-training.To explore whether less data would affect the performance of downstream tasks, we implemented the pre-training with different amounts of data, from 10,000 to 430,000, and tested its performance on downstream tasks.
It can be clearly seen from Figure 2a that, on the regression dataset, pre-training with more data effectively improved the performance of the model.The RMSE using the whole dataset was 0.597, whereas the RMSE using the pre-training dataset with 20,000 molecules was 1.086, which was clearly improved through the use of a larger dataset.For the classification tasks in Figure 2b, the improvements brought by using 430,000 molecules as the pre-training dataset compared to using other smaller datasets are clear.To summarize, the amount of pre-training data affected the performance.By increasing the amount of pre-training data, the model could learn more comprehensive molecular features, thereby improving the generalization ability of the model.As can be seen from Figure 3a, the improvement of our TGSS model is even more obvious on the ESOL dataset, which is about 18.3% compared with the other best combinations.When using the larger regression dataset, Lipophilicity, although the fusion of BiLSTM and GAT had achieved an RMES of 0.653, our method still led to an improvement of about 1.6%.As can be seen in Figure 3b, for the classification task, the proposed TGSS method led to an improvement of about 0.4% on the smaller BACE dataset and 6.4% on the BBBP dataset.Through the experiment, it was demonstrated that the proposed TGSS model combining three models could obtain the best results and improve the generalization performance of the model.This showed that using multiple models to learn molecular information was effective.Different models could learn various aspects of molecular information, thus compensating for the limitations of a single model, meaning that the proposed model could comprehensively acquire molecular information.

Performance Comparison of Different Feature Fusion Methods
When merging different molecular features, we believe that concatenating two features directly cannot explore the deep information of each feature, and so we introduced aa hierarchical elem-feature fusion method to the TGSS model.In this section, we experimented with two different strategies for the model, direct concatenating and adding the hierarchical elem-feature fusion modules, to explore their different impacts on the model.
For the three molecular features extracted by the model, adding a feature fusion method could effectively balance the proportion of the three in the final output features.As shown in Figure 4a, in the ESOL dataset, adding feature fusion could achieve a lower RMSE than no fusion method.It can be seen from Figure 4b that, on the larger dataset, Lipophilicity, the curves of the two were more consistent, but the improvement after adding feature fusion to the best result was about 8.6%.For the classification dataset, the feature fusion was able to significantly improve the prediction performance, and this trend is evident from Figure 4d.From the experiments, it was found that feature fusion could prevent the premature fitting of the model when the amount of data was small.Although the effect of the improvement was not as obvious as when the amount of data increased, it was still better than directly concatenating features.Molecular features consist of the individual features of each atom.For the TGSS model, there are three representations for each atomic feature, which are BiLSTM features, Transformer features, and GAT features.In order to study the evolution of these features during the training process, we calculated the similarity coefficient (Pearson correlation coefficient) between atomic features, and then visualized the similarity with a heat map.
We randomly selected a molecule from the datasets of Lipophilicity and ESOL for mapping, and plotted them as final output features.As can be seen from Figure 5a, the TGSS model combined with the three features' information clearly showed the clustering of atoms.After 100 epochs, the molecules were divided into three clusters, namely 4-chlorophenyl, 1-methylbenzimidazole, and piperazine.Moreover, both 4-chlorophenyl and 1-methylbenzimidazole are lipophilic, which suggests that TGSS can learn characterizations related to the lipid solubility of molecules.In addition, it can also be found from Figure 5b that oxy acetonitrile, phenyl, 3,4,5-trihydroxy-6-(hydroxymethyl)oxan-2-yl, related to the water solubility of the molecule, are all clustered.Therefore, the TGSS model was able to effectively extract molecular property-related information.

Discussion
In this work, we explored the fusion of multiple models for molecular representation through generative self-supervised learning.TGSS, a triple generative self-supervised learning method, is proposed, which uses BiLSTM and Transformer through molecular sequences and GAT through 2D graphs for pre-training.Moreover, molecules are reconstructed by VAE between each model in pre-training.In downstream tasks, the trained models were fine-tuned and a feature fusion module was added to balance the weights between three molecular features.
We experimentally validated the accuracy and generalization of the TGSS model using benchmark datasets from the fields of chemistry and biology, which indicates that pretraining with a large unlabeled dataset is effective for property prediction, since pre-training can enable the model to learn more molecular data and make up for the lack of labeled data.Meanwhile, by comparing it with other self-supervised learning methods, it was proven that our self-supervised strategy could extract molecular property-related representations more effectively, since this strategy fully combines multiple molecular features and more comprehensively obtains the information contained in the molecules.
In addition, we verified the impact of pre-training weights, pre-training data volume, different model combinations, and molecular feature fusion on model performance through ablation experiments.By pre-training the model, the fitting speed and accuracy of the model in downstream tasks could be significantly accelerated.Different amounts of pre-training data also affect the performance of the model.The more pre-training data, the better the effect of the model.By using a combination of three models, the characteristics of different models can be fully exploited to improve the comprehensiveness of extracted molecular features.The added molecular feature fusion can effectively balance the proportions between different molecular features, and improve the performance of the final prediction.
To validate the interpretability of the model, heat maps were generated by computing similarity coefficients, which revealed a high degree of consistency with the depiction of molecular structure in reality.It is demonstrated that the proposed model could extract key information from molecules.

Overview
This section presents the proposed triple generative self-supervised learning method based on molecular sequences and graph structures, which consists of two parts, a pretraining stage and a downstream task prediction stage, as shown in Figure 6a.Unlabeled molecules were used to train the TGSS model in pre-training, and the trained weights were transferred to the pre-trained TGSS model for molecular prediction.In the pre-training part, the data used were all unlabeled, all of which came from a subset of the ChEMBL dataset, with a total of 430,000 molecules.In the encoder part, BiLSTM and Transformer were used to encode sequence data, and GAT was used to encode graph data.After obtaining the corresponding features, the VAE was used for generative selfsupervised learning.As shown in Figure 6b, the input molecules were processed by three models to obtain h x , h y , h z .These three features entered the VAE, where the reconstruction loss was calculated after reparameterization.The model weights were optimized based on the loss, and the weight with the best effect was used for downstream tasks.
In the downstream task prediction, the data used were labeled data from MoleculeNet as shown in Figure 6c.The model was finetuned through the labeled data for prediction.Moreover, a feature fusion module was introduced to balance the proportions of each feature in the final output.

Pre-Training Models
In the pre-training stage, three models were utilized for training: BiLSTM, Transformer, and GAT, and the parameters of these models were transferred to downstream tasks for molecular property prediction.
There are two types of molecular input, which are molecular sequences (SMILES) and 2D molecular graphs.Sequence-based BiLSTM and Transformer are used to process SMILES to obtain the corresponding molecular features h x and h y .Graph-based GAT is used to process 2D molecular graphs to obtain molecular features h z .Each model will be introduced in the following sections.

BiLSTM
BiLSTM, as an extension of the Recurrent Neural Network (RNN), addresses the challenges faced by RNN in learning long-term dependencies.The LSTM consists of three gate units: forget gate, input gate, and output gate.These gate units enable the model to extract features from the input data and keep this information for a long time.During the training process, the information is kept or discarded based on the weight value.Figure 7a shows the basic structure of BiLSTM, where represents the hidden vector of the forward layer, and represents the hidden vector of the backward layer.The input {x 1 , x 2 , x 3 , • • • , x n } are fed into the embedding layer to obtain the corresponding embedding vector, and the forward layer and the backward layer are used to obtain → h t and ← h t , respectively.These vectors are then combined to obtain the output vector h x of BiLSTM.For an input x t , the computation proceeds as follows: The BiLSTM utilizes two LSTMs with different directions.namely the forward layer and the backward layer, to process the input data.At time t, the forward layer calculates the hidden vector → h t at the current moment based on the previous hidden vector → h t−1 and the embedding vector X t ; the backward layer calculates the hidden vector ← h t based on ← h t−1 and the embedding vector X t .Subsequently, → h t and ← h t are combined to form the final hidden vector, which serves as the output of BiLSTM as follws:

.2. Transformer
The Transformer consists of a self-attention layer and a feed-forward neural network to capture the global dependencies between input and output through an attention mechanism.When processing a sequence, the RNN operates by sequentially processing words and passing the results to the next layer.However, when dealing with long sequences, the gradient tends to vanish or explode when words are distant from each other.Unlike RNN, the Transformer [48] tracks the relationship between words in the long text in both forward and backward directions through the attention mechanism.A detailed flowchart of Transformer is shown in Figure 7b.
First, an embedding layer is used to convert the input to X embedding ∈ R B×S×d , where B, S, and d represent the batch size, sequence length, and vector dimension, respectively.Subsequently, Q, K, V are obtained through linear transformation.
To address the problem of sequence prediction, the Transformer provides sequential information by adding position encoding X pos with the same dimensions as the input is obtained, which is combined with X embedding to obtain a new embedding as follows: Next, the self-attention mechanism is introduced to ensure the model attends on the more relevant characters as follows: where W Q , W K , and W V are trainable parameters, and then the attention matrix is calculated by QK T and weighted by V: where d k represents the number of columns in the Q, K matrix.
Then, the residual connection and layer normalization are implemented to obtain the X hidden : X hidden = Norm X embedding + Attention(Q, K, V) The X hidden is used as the input to the Feed Forward Network, which contains two linear transformations to obtain the final output h y as the following equation: Molecules can be represented as topological graphs by treating atoms as nodes and bonds as edges, which can be defined as G = (V, E), where V denotes the set of nodes and E denotes the set of edges.A two-layer graph attention network [49] is used for node aggregation to obtain graph representations in TGSS.The processing flow of a molecule in GAT is shown in Figure 7c.First, the topological information is obtained from the molecular graph.The processing in GAT is divided into three steps.
The first step is to calculate the attention weights e ij and e ik of the central atom and the neighbor atoms through the following equation: The second step is to normalize the weights to obtain a ij , in which the e ij is fed into a softmax function for normalization.
Finally, the feature information of neighbor nodes are aggregated with the feature weight of its own node, through aggregating node weight information using the following equation: After obtaining the features of each atom i, max pooling and MLP are used to obtain the feature h z .

Molecular Representation Reconstruction
In the pre-training process, VAE is used to reconstruct the molecular features and calculate the reconstruction loss, as shown in Figure 8a.VAE consists of two parts: an encoder and a decoder.The encoder processes the input features to obtain mean µ x and logarithm σ x which determine the latent vector z x , and can be represented as follows: In the decoder part, the reparameterization is implemented to calculate the latent vector z x , and then the reconstructed features are output through the decoder.The reconstruction loss is obtained by calculating Mutual information (MI) between the two reconstructed features.
MI measures the nonlinear dependence between two random variables, and the larger the MI, the stronger the dependence between the variables.Unlike the correlation coefficient, MI is more general and determines the difference between the joint distribution of p(x, y) and the product of the marginal distributions of p(x) and p(y).The standard expression of MI is calculated as follows, where h x , h y , and h z correspond to the feature of BiLSTM, Transformer, and GAT, respectively.I h x ; h y = E p(x,y) log p(x, y) p(x)p(y) (20) where p(x, y) represents the joint probability distribution function of h x and h y , while p(x) and p(y) represents the marginal probability distribution functions of h x and h y , respectively.It can be seen from the above equation that the greater the divergence of the product of p(x, y) and p(x)p(y), the stronger the correlation between x and y.
Therefore, our objective is to maximize the MI between any two features through the above models in order to obtain a more accurate representation, i.e., maximize I h x ; h y , I(h x ; h z ), I h y ; h z .In other words, it is used to minimize the difference between the reconstructed features and other features; that is, minimize L xy , L xz , and L yz in Figure 8b.
In this paper, we employed the variational lower bound to approximate the conditional log-likelihood term in (20).Specifically, when generating Transformer sequence features from the corresponding BiLSTM sequence features, we modeled the conditional likelihood p(y|x) to obtain the lower bound of the conditional likelihood.Similarly, p(z|x ) represents the generation of GAT features from the BiLSTM sequence features, and p(z|y) represents the generation of GAT graph features from Transformer sequence features.The calcuation of p(y|x) is as follows: Likewise, the expression for logp(x|y) is similar.The above objective function consists of conditional log-likelihood and Kullback-Leible (KL) divergence, which represent the reconstruction of a Transformer's sequence features (y) from a sampled BiLSTM's sequence features (z x ).However, one challenge arises from the discrete nature of molecules, making them difficult to model in the molecular space.
Therefore, taking inspiration from Liu et al. [32], we implemented the reconstruction of the data space as a continuous representation space.For the reconstruction, we projected the latent vector z x onto the objective representation space.Then, the lower bound of the conditional likelihood could be calculated as follows: where C is a constant, and the SG denotes the regularization operator used for optimizing the variational representation reconstruction.
Combining the above two equations with BiLSTM and Transformer as an example, the objective function of their loss can be calculated as follows: Similarly, the L xz between BiLSTM and GAT, as well as the L yz between Transformer and GAT, can also be calculated using the above equation.The final objective function is as follows: L = mean L xy , L xz , L yz (24)

Downstream Task with Hierarchical Elem-Feature Fusion
The downstream task prediction stage includes three parts: model fine-tuning, hierarchical feature fusion, and molecular property prediction, as shown in Figure 9a.First, the model reloads the pre-training weights and fine-tunes them according to the input labelled data.In the fine-tuning stage, three models including BiLSTM, Transformer, and GAT output features, and the hierarchical feature fusion is performed, respectively, sequence feature fusion and sequence-graph feature fusion.The final output is used for molecular property prediction.
Since three models are used to obtain three molecular features, just concatenating them directly cannot fully explore the information hidden in the features and find the important part of each feature.Inspired by Hua et al. [50], we combined the sequence and structural features and balanced the weights of different features, and the two same feature fusion models were adopted in the TGSS model, as shown in Figure 9b.
First, the sequence features of molecules are fused by combining the BiLSTM feature and the Transformer feature.The weights of the two features are obtained through the attention module shown in Equation (25): where h x ∈ R N×d and h y ∈ R N×d and h x and h y denote the BiLSTM feature and Transformer feature, respectively.Concat is the concatenating operation, and the features are extracted from the concatenated feature map by 2D convolution operation CNN 2D .The σ is a Sigmoid used to normalize the convolved features to obtain the attention weight matrix W attn of h x .The (1 − W attn ) is defiend as the attention weight matrix of h y , correspondingly.
After connecting the residuals of the two features separately, the combined feature h f is obtained by Equation ( 26): h f = FC(h x ) * W attn + h x + FC h y * (1 − W attn ) + h y (26) where FC denotes the Linear layer and the ReLU layer, and * denotes the element dot product.Following the above method, the sequence feature h f is fused with the graph feature h z to obtain the final feature h f .After hierarchical fusion, the output h f is robust and fully combines the sequence and graph information of molecules, which can be used for more effective molecular predictions.

Conclusions
Molecular property prediction is an important task in molecular design.Deep learning methods are used to effectively extract molecular features, thereby reducing the time and costs required.We use three models to extract features from different dimensions to ensure that as much molecular information is retained as possible.In pre-training, a generative self-supervised strategy was adopted.Among them, VAE was utilized to calculate the reconstruction molecule loss and optimize the model based on the reconstruction loss.It turns out that generative self-supervised learning can provide great help for molecular sequence representation and graph representation.
Our current research only focused on the 1D and 2D information of molecules.In the future, 3D information could be considered as useful information which may be introduced for in-depth research.Additionally, we would like to utilize larger pre-trained datasets to improve the comprehensiveness of the model.Through our work, we hope we can make contributions to molecular property prediction.

Figure 2 .
Figure 2. Performance comparison of different sizes of pre-training dataset.(a) Regression tasks.(b) Classification tasks.2.3.3.Performance Comparison of Different Combinations of Model in Downstream TasksIn the downstream task, we used three models to predict molecular properties.Among them, BiLSTM and Transformer extracted molecular sequence features, and GAT extracted 2D molecular graph features.In this section, we try to investigate the contribution of each single model in a downstream task.For a fair comparison, all three trained models were acquired from the TGSS pre-training step.Instead of combining them together, each single model, BiLSTM (B), Transformer (T), and GAT (G), and the fusion of any two models (B + G, B + T, T + G) were used for comparison, and the results are shown in Figure3.

Figure 3 .
Figure 3. Performance comparison of different combinations of model in downstream tasks (a) Regression tasks.(b) Classification tasks.

Figure 4 .
Figure 4. Performance comparison of different feature fusion methods.(a) ESOL.(b) Lipophilicity.(c) BACE.(d) BBBP.2.4.Feature Visualization TGSS has shown good results on various datasets, but there are still interpretability problems within the model.Due to the black-box nature of the deep learning model, the learned content (weights, features) cannot be effectively mapped to chemistry, biology, or other knowledge domains.Therefore, visualizing what the model has learned can help measure the effectiveness of the model and improve the interpretability of the model.

Figure 5 .
Figure 5. Atomic similarity heat map.(a) Example in the Lipophilicity dataset.(b) Example in the ESOL dataset.

Figure 6 .
Figure 6.Overall framework.(a) TGSS framework.(b) Description of the generative self-supervised strategy in pre-training, the training model is updated according to reconstruction Loss.(c) Downstream task prediction.
where W f , W i , W c , W o are weight matrices, and b f , b i , b C , b o are biases.

Figure 8 .
Figure 8.The process of molecular reconstruction loss calculation.(a) Calculation process of VAE.(b) Calculation process of molecular representation reconstruction loss.The loss between two molecular features is calculated by VAE.The three molecular features are first obtained from the corresponding latent vectors z x , z y , and z z through reparameterization of VAE.The reconstruction losses L xy , L xz , and L yz with the other two molecular features are obtained, respectively.

Figure 9 .
Figure 9. Downstream task with hierarchical elem-feature fusion.(a) The process of downstream task prediction process.(b) The process of molecular feature fusion.

Table 1 .
The details of the MoleculeNet Datasets.

Table 2 .
The ROC-AUC values of various approaches in classification tasks.Higher values mean better results.
Note: The best results are shown in bold.Standard deviations are in brackets.

Table 3 .
The RMSE values of various approaches in regression tasks.Lower values mean better results.
Note: The best results are shown in bold.Standard deviations are in brackets.