Absorption Distribution Metabolism Excretion and Toxicity Property Prediction Utilizing a Pre-Trained Natural Language Processing Model and Its Applications in Early-Stage Drug Development

Machine learning techniques are extensively employed in drug discovery, with a significant focus on developing QSAR models that interpret the structural information of potential drugs. In this study, the pre-trained natural language processing (NLP) model, ChemBERTa, was utilized in the drug discovery process. We proposed and evaluated four core model architectures as follows: deep neural network (DNN), encoder, concatenation (concat), and pipe. The DNN model processes physicochemical properties as input, while the encoder model leverages the simplified molecular input line entry system (SMILES) along with NLP techniques. The latter two models, concat and pipe, incorporate both SMILES and physicochemical properties, operating in parallel and with sequential manners, respectively. We collected 5238 entries from DrugBank, including their physicochemical properties and absorption, distribution, metabolism, excretion, and toxicity (ADMET) features. The models’ performance was assessed by the area under the receiver operating characteristic curve (AUROC), with the DNN, encoder, concat, and pipe models achieved 62.4%, 76.0%, 74.9%, and 68.2%, respectively. In a separate test with 84 experimental microsomal stability datasets, the AUROC scores for external data were 78% for DNN, 44% for the encoder, and 50% for concat, indicating that the DNN model had superior predictive capabilities for new data. This suggests that models based on structural information may require further optimization or alternative tokenization strategies. The application of natural language processing techniques to pharmaceutical challenges has demonstrated promising results, highlighting the need for more extensive data to enhance model generalization.


Introduction
Over the past few decades, the landscape of drug discovery has been significantly transformed by the integration of in silico methodologies, witnessing a substantial surge in efficiency and effectiveness.This revolution in computational approaches has been instrumental in streamlining the drug screening process, thereby offering the pharmaceutical industry considerable savings in terms of both costs and time.Among the various strategies employed, Quantitative Structure-Activity Relationship (QSAR) models have emerged as a cornerstone for predicting the chemical properties of compounds.The foundational premise of QSAR models is the assumption that compounds with analogous structures are likely to exhibit similar activities, thereby enabling the prediction of chemical activity through structural analysis.
Traditionally, QSAR models have relied on machine learning (ML) techniques, including but not limited to support vector machines, decision trees, naive Bayes, and k-nearest neighbors [1,2].These methods typically dissect the structure of molecules into predefined molecular fragments or employ theoretical molecular descriptors, often determined through human judgment on a training dataset.Such an approach, while functional, has its limitations, particularly in terms of predictability on novel datasets.
However, the advent of deep learning is reshaping this landscape by addressing the shortcomings of conventional QSAR methodologies.Deep learning algorithms have the capacity to algorithmically define the criteria for analysis, thus bypassing the constraints imposed by human-set parameters.This advancement not only enhances the predictive accuracy of these models but also broadens their application.Furthermore, a significant limitation of traditional QSAR models has been their reliance solely on compounds with available ADMET experimental results for model construction.Considering the vast number of synthesized compounds, the subset with ADMET data is relatively small, posing a considerable challenge to the generalization of ADMET prediction models.
An online competition held in 2012 revealed the potential of deep learning algorithms to address problems with pharmaceuticals, such that there has been a shift toward the deep-learning techniques.Although deep learning has shown promising results that can replace traditional methods, some problems in deep learning remain [3].Deep learning models tend to improve their performance by memorizing the inputs, which can increase their dependency on the tested data [4][5][6].This tendency is even more pronounced in pharmaceutical fields, for example, the relationship between a molecular structure and its properties.Because molecular data come in various forms depending on their specific domain, many efforts to generate compatible data and to make a link between various domains are under process.Due to the mutual understanding of computer-aided drug design (CADD) in pharmaceutical fields, various trials to predict pharmacologic features and endpoints in drug development are being made with machine learning [7][8][9].In terms of pharmacology, the features concerning absorption, distribution, metabolism, excretion, and toxicity (ADMET) are of significant interest in typical drug development and can be used for weighing the systemic exposure and potential side effects of a candidate drug.Since this systemic exposure is affected by numerous factors and features, reliable ADMET prediction has the utmost priority before candidate drugs are further evaluated in real clinical situations [10].
In techniques for deep learning, graph convolutional neural networks (GCNNs) enable the dynamic learning of chemical structures by considering a space for atoms and adjacent bonds [11].After the advantages of GCNNs were demonstrated, new featurization approaches based on multitasking or sequential learning were implemented using GCNNs, leading to further performance improvements [12].However, despite these improvements, GCNNs have difficulties with unlabeled structures because they require many feature parameters.In a recent study, however, contrastive learning in GCNN was introduced to resolve this problem [13], showing remarkable performance improvements along the tasks.Just as there was a certain level of progression in the performance utilizing graph neural networks, this was also demonstrated in natural language processing.Transformer-based learning is vigorously performed in this field; recent natural language technique applications during drug development tasks were successful in improving benchmark results.Within natural language processing (NLP), bidirectional encoder representations from transformers (BERT) have significantly improved NLP over the past 4 years via transformer pre-training and task-specific model fine-tuning [14].Because BERT is generally used in conjunction with masked language modeling (MLM), it is also expected to be able to deal with the atom, masking the problems seen in GCNNs.
In addition, BERT is capable of handling large amounts of data because it was originally designed to deal with large volumes of text.In 2020, Chithrananda et al. introduced ChemBERTa, which contains 77 million simplified molecular input-line entry systems (SMILES) from PubChem and was designed to perform large-scale self-supervised pretraining for molecular property predictions.ChemBERTa is expected to provide promising performance for representation learning and molecular property prediction as a pre-trained model [15].
In addition to BERT, models that employ transformers and that have shown effectiveness in masked modeling, such as BART (Bidirectional Auto-Regressive Transformers) [16] and ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) [17], could serve as promising pre-trained models in drug discovery.Performance metrics in these studies have exceeded those of traditional approaches in many tasks, as has been previously demonstrated in language tasks [18].
Both GCNN and NLP models are continuously evolving, complementing each other's weaknesses, yet an examination in terms of the weaknesses and strengths of the NLP technique is not sufficiently considered across the aspects of pharmacy.In this study, (1) the performance of natural language models of ChemBERTa and ELECTRA were assessed on benchmark datasets with other prediction models, and (2) large-scale transfer learning with fine tunings to natural language models in ADMET problems was carried out to test its ability to perform multi-task prediction.(3) The models were then assessed on the external dataset to investigate the NLP model's generalization towards ADMET problems.

MoleculeNet Dataset
In Tables 1 and 2, the results from the MoleculeNet dataset are shown.The mean and standard deviation of AUROC or RMSE and MAP on each dataset are reported.In reference to the benchmark result from Wang et al., the performance metrics of supervised learning models or graph models were compared with those of ChemBERTa and ELECTRA.In classification tasks, ChemBERTa recorded around mid-ranks on average and showed superior performance on toxicity problems like Tox21 and ClinTox (ranked 1st and 3rd, respectively).ELECTRA ranked slightly below ChemBERTa scores in general.Both Chem-BERTa and ELECTRA scored almost the lowest in regression tasks.The performance of ELECTRA was lower than that of ChemBERTa in most tasks except ESOL and QM7.

DrugBank Dataset
Based on the AUROC values, the encoder model had the best performance (76.0%), followed by the concat model (74.9%), the pipe model (68.2%), the DNN_A model (63.6%), the DNN model (62.4%), and the pipe_A model (61.2%).The encoder and concat models, which included pre-trained models, generally showed higher predictive power than the others.The pipe model showed comparatively low performance, even though it utilized a pre-trained model.The incorporation of attention slightly increased the performance of the DNN model but decreased the pipe model's performance.
Although DNN is the simplest model, it uses parameters that are considered important in drug development, and it can be identified that the performance is not significantly inferior compared to other models.It was shown that the performance of the DNN's simple model slightly improved due to the addition of attention.Considering that encoder and concat are similar in structure and differ only in the input information, it has been shown that important structural information can be sufficiently reflected by SMILES in the QSAR work process.
The pipe model, comprising two steps that predict physicochemical information and ADMET properties, may have exhibited decreased performance due to uncertainties introduced during the learning processes.This issue was likely more pronounced in the pipe model, which utilized the attention algorithm, adding complexity.A summary of the performance of the output label is described in Figures 1 and 2. introduced during the learning processes.This issue was likely more pronounced in the pipe model, which utilized the attention algorithm, adding complexity.A summary of the performance of the output label is described in Figures 1 and 2.  A1) for suggested models.A1) for suggested models.A1) for suggested models.A4) of suggested models.

External Dataset
The DNN, encoder, and concat models had AUROC values of 0.78, 0.44, and 0.50, respectively.When tested only with CYP 3A4 substrate prediction, the matched label proportions for the test data were 0.631, 0.583, and 0.571.For weighted soft voting, which analyzes the abundance of CYP450 subtype enzymes, the matched label proportions for the test data were 0.619, 0.571, and 0.583, respectively.In all three assessment methods, DNN scored the best.

Applicability Domain
The models were developed using datasets from PubChem and DrugBank.Initially, the PubChem dataset was employed for the pre-training of the language model through MLM techniques.This step enabled the model to understand the structures of a wide variety of substances.The models were then fine-tuned with the DrugBank dataset, with a focus on the ADMET features of substances classified as drugs.The scope of chemical structures targeted by these models was those cataloged in PubChem.To assess the model's applicability and its limitations within this domain, the external dataset-comprising rates of CYP450 enzyme reactions for toxic substances not listed in DrugBankwas employed for validation purposes.The validation showed that the DNN model, which is close to traditional QSAR models, had superior performance, but within the DrugBank dataset, other models performed better.This suggests that these models are more adept at predicting the ADMET features of therapeutic drugs rather than toxic substances.A4) of suggested models.

External Dataset
The DNN, encoder, and concat models had AUROC values of 0.78, 0.44, and 0.50, respectively.When tested only with CYP 3A4 substrate prediction, the matched label proportions for the test data were 0.631, 0.583, and 0.571.For weighted soft voting, which analyzes the abundance of CYP450 subtype enzymes, the matched label proportions for the test data were 0.619, 0.571, and 0.583, respectively.In all three assessment methods, DNN scored the best.

Applicability Domain
The models were developed using datasets from PubChem and DrugBank.Initially, the PubChem dataset was employed for the pre-training of the language model through MLM techniques.This step enabled the model to understand the structures of a wide variety of substances.The models were then fine-tuned with the DrugBank dataset, with a focus on the ADMET features of substances classified as drugs.The scope of chemical structures targeted by these models was those cataloged in PubChem.To assess the model's applicability and its limitations within this domain, the external dataset-comprising rates of CYP450 enzyme reactions for toxic substances not listed in DrugBank-was employed for validation purposes.The validation showed that the DNN model, which is close to traditional QSAR models, had superior performance, but within the DrugBank dataset, other models performed better.This suggests that these models are more adept at predicting the ADMET features of therapeutic drugs rather than toxic substances.

Discussion
In the MoleculeNet benchmark dataset, the pre-trained NLP models generally exhibited good performance in classification tasks.However, in most regression tasks, the NLP models demonstrated poor performance, with other models surpassing the NLP model metrics, especially in tasks predicting physicochemical properties.It is believed that the regression tasks require more detailed information on atomic spacing, which the NLP models used in this study cannot fully consider.On the other hand, the classification tasks resulted in better outcomes with simpler model implementations.In the Tox21 dataset, the NLP models achieved a better AUROC (82.3% and 80%, respectively) compared to the latest GNN techniques.The lower performance of ELECTRA in this study, compared to BERT in previous studies, could be attributed to its pre-training on a smaller set of molecules.It is anticipated that ELECTRA's performance will improve with further pre-training using SMILES information.Moreover, as GNN techniques have enhanced their performance by addressing atom-masking tasks, these NLP models could also see improved performance by developing an approach that considers the precise functional space or by introducing another tokenization method to generate the minimal unit of functional atom groups.
The DNN model resembles the traditional QSAR model.Similar to its predecessor, it trains exclusively on datasets containing results from ADMET experiments without employing a pre-training approach.In contrast, other models, such as encoder, concat, and pipe, utilize a fine-tuning strategy with pre-trained NLP models.Except for pipe_A, these models demonstrated superior performance compared to the DNN.This outcome validates the efficacy of the NLP's MLM training technique in capturing the structural nuances of chemical compounds.
In an effort to enhance model accuracy, we explored the concat model and pipe model, which integrate SMILES notation alongside the physicochemical properties of compounds.However, the encoder model, relying solely on SMILES notation, emerged as the most effective.This can be attributed to the fact that pharmaceutical development typically focuses on compounds adhering to specific physicochemical criteria, such as Lipinski's rule of five and the Ghose filter.The limited variance in physicochemical properties within the training dataset presumably had minimal impact on the model performance.
When assessed using the external dataset, the DNN model's performance surpassed its counterparts (the concat and encoder models).The external dataset comprised toxic compounds, which often deviate from conventional guidelines like Lipinski's rule of five.This deviation suggests that the range of physicochemical properties in the training dataset is considerably narrower than that in the external dataset.This finding underscores the significance of employing diverse and unbiased datasets to bolster the model's generalization capability.

Data Collection and Preprocessing
Three datasets were collected to evaluate the learning under different conditions.To derive quantitative benchmark results in comparison to other machine learning techniques, the dataset from MoleculeNet was used [18].The data from DrugBank [38] was used to assess the model's schematic position during drug development steps.Among the labels used in learning in the DrugBank dataset, we created an external dataset (unseen in the learning process) to evaluate the model trained on the DrugBank data.

DrugBank Dataset
Datasets for training, testing, and validation were collected from DrugBank.We obtained 13,856 raw JSON files.Each file contained drug information such as the name, description, attribute values, related molecules, and applications.In total, 5238 raw files contained SMILES and ADMET data, and 18 features extracted from the "Experimental Properties" and "Predicted Properties" tabs were used for model training as follows: SMILES, LogP, LogS, pKa, water solubility, physiological charge, hydrogen acceptor count, hydrogen donor count, polar surface area, rotatable bond count, molar refractivity, polarizability, the number of rings, bioavailability, and drug-likeness filters including Lipinski's rule of five [40], the Ghose filter [41], Veber's rule [42], and the MDDR-like rule [43] (Table A4).The filter properties of bioavailability, Lipinski's rule of five, the Ghose filter, Vebers' rule, and MDDR-like rule are Boolean-type data that determine whether the information is 'true or false', and, for all other properties except the SMILES, it meant that molecular formulas were numeric data types.Among the extracted data, the four filter values of Veber's rule, the MDDR-like rule, Lipinski's rule of five, and the Ghose filter were excluded since those values could not be determined from the experiment, and possible overfitting was observed in the pre-test.The models were used to predict 21 ADMET features extracted from the "Predicted ADMET Features" Table, and these features are described in Table A4.To avoid semantic redundancy in the outputs, the labels of human intestinal absorption and Caco-2 permeability were combined into human intestinal absorption.Two p-glycoprotein inhibitor (I and II) descriptors were combined into one p-glycoprotein inhibitor, and the same was performed for the two hERG inhibition descriptors.If one of the 18 features was missing from a raw file, the corresponding chemical was excluded from the analysis.

External Dataset
Additional model testing was performed using 84 compounds externally collected by the CYP assay to evaluate the ability to perform CYP substrate prediction.The external dataset included the chemical structure, formula, and metabolism of CYP in human liver microsomes.These chemical structures were encoded in the SMILES format, and the classification of a compound as a CYP substrate was determined by the percentage of the chemical that remained following a specified reaction duration.

Deep Learning Models
For the MoleculeNet dataset, both ChemBERTa and ELECTRA were utilized in the benchmarks.For tokenization, a byte pair encoder (BPE) [44]-based SMILES tokenizer and WordPiece were used, respectively [15,45].BPE is a sub-word level tokenization technique that processes the maximum number of words in a text corpus.Given the unlimited number of letter combinations, even unknown words can be processed by decomposing them into multiple-letter combinations.Thus, even SMILES can be expressed as a set of sub-SMILES.WordPiece Tokenizer is a variant algorithm of BPE.The algorithm merges the pairs with the highest 'likelihood' of the corpus when merged, as opposed to merging the pairs in which the BPE appears most frequently based on their 'frequency'.The ELECTRA model pays more attention to the efficiency of learning as well as the accuracy of the model.ELECTRA includes new pre-training tasks called Replaced Token Detection (RTD) to improve learning efficiency, through which ELECTRA learns faster and more effectively.For this study, we used ELECTRA-small and randomly extracted 10 M molecules that were pre-trained from the PubChem 109 M dataset with 10 epochs.For the DrugBank dataset, an NLP model with better performance in the benchmark was selected and used in learning.Six models with different structures were tested in this study (Figure 3); they were all trained using a cross-entropy loss function.

1.
A deep neural network (DNN) model consists of fully connected and embedded layers.The model uses 18 input physicochemical values as input properties.One Boolean feature was transformed into a 10-dimensional vector via embedding layers.Five integer-based and seven float features were transformed into a 10-dimensional vector via fully connected layers.Vectors were concatenated into a 30-dimensional vector, which passed through fully connected layers to return a vector with 21 dimensions (the predicted ADMET features). 2.
An encoder model includes a pre-trained ChemBERTa model.This model treats SMILES data as "natural sentences" and learns via MLM, which is RoBERTa.The SMILES data used for the pre-trained model are in the form of a 768-dimensional hidden vector, which is transformed into an 18-dimensional vector via fully connected layers.

3.
A concat model combines the DNN and encoder models described above.The 30-dimensional vector from the DNN model and the 768-dimensional hidden vector from the encoder model are concatenated and passed to the hidden layer of the concat model.This 798-dimensional hidden vector is then transformed into 21 dimensions.4.
A pipe model, which subsumed a pre-trained ChemBERTa model, used a 768-dimensional hidden vector based on SMILES data to predict 21 physicochemical properties.Those physicochemical properties were then used as input for a DNN model to predict ADMET features.

5.
A modified version of the DNN model is DNN A (where A stands for attention).We incorporated dot-product self-attention into the model, which uses hidden vectors from the DNN as the query, key, and value.By implementing dot-product self-attention, it was possible to identify which input most affected ADMET feature predictions.

6.
A modified version of the pipe model is pipe A, into which dot-product self-attention can be incorporated.to improve learning efficiency, through which ELECTRA learns faster and more effectively.For this study, we used ELECTRA-small and randomly extracted 10 M molecules that were pre-trained from the PubChem 109 M dataset with 10 epochs.For the DrugBank dataset, an NLP model with better performance in the benchmark was selected and used in learning.Six models with different structures were tested in this study (Figure 3); they were all trained using a cross-entropy loss function.

Settings
The data were divided into training, validation, and test sets (ratio of 7:1.5:1.5).All numeric data were normalized.The batch size was fixed at 32, with the learning rate at 5 × 10 −5 .Early stopping was used during training; this automatically terminated training if the validation loss value did not drop for five epochs.A maximum of 30 epochs were allowed before early stopping.For more accurate performance measurements, each model was constructed five times, with identical parameters using different random seeds for generating random numbers during model initialization.As an optimizer, AdamW was used.For each classification and regression problem, binary cross entropy and mean squared errors (MSEs) were used for the loss function.No activation function was used except in the case of attention models, which used the tanh activation model.Model training was performed using a computer with an Nvidia A100 GPU, AMD EPYC ROME 7742 CPU, and 1 TB of RAM.

Evaluation
In benchmark datasets, model evaluation was performed in the area under the receiver operating characteristic curve (AUROC) for classification tasks.For regression tasks, FreeSolv, ESOL, and Lipo used the root mean square error (RMSE), while QM7, QM8, and QM9 were measured with the mean absolute error (MAE) in accordance with MoleculeNet's recommendation.Performance evaluation in the DrugBank dataset was based on accuracy, the area under the receiver operating characteristic curve (AUROC), F1, precision, and recall.Accuracy refers to the proportion of data correctly predicted by the model.Accuracy is an intuitive metric, but these data should be balanced to evaluate accuracy appropriately.For example, if a test dataset consists of 100 values, of which 99 are true values and 1 is a false value, the accuracy would be 99% if the model predicted 100 true values without any conditions, which implies that the metric is biased.When these data are unbalanced, it is difficult to obtain reliable results.Precision, recall, and F1 should be included as performance metrics to overcome potential bias in the accuracy evaluation.Precision refers to the proportion of actual true values relative to all predicted true values.Precision evaluation does not take predicted false values into account; only true values are considered, which gives rise to bias.Therefore, precision alone is not a reliable metric.Recall refers to the proportion of true values correctly predicted by the model.This metric does not take predicted false values into account.The problem with recall is that if all values are predicted to be true, performance is considered perfect.Recall and precision are related; if precision increases, recall tends to decrease.
Precision and recall are good metrics when several conditions are satisfied, but due to their bias, F1 (the harmonic average of precision and recall) is the most used metric.Precision and recall are complementary, but both values must be high for F1 to be high; thus, F1 represents a compromise that solves the problems of precision and recall.The receiver operating characteristic (ROC) curve can show the predictive performance of a model at different thresholds.ROC curves plot recall against specificity, which is also complimentary.Specificity is defined as the proportion of false values predicted correctly by the model.The AUROC is a commonly used metric that increases when a model predicts both true and false values accurately; therefore, it has good evaluation performance.The prediction results of the six models used in this experiment were voted on: when the result was a tie (3:3), the highest average prediction probability of each model was considered the final value.
When evaluating the external data, the CYP450 substrate prediction performance was assessed with the DNN, encoder, and concat models.Since the label did not match with what was predicted from the models, performance evaluation was conducted on the following three methods of transformation.(1) The CYP450 subtypes' substrate-predicted value for each model was concatenated into one vector and then synthesized into one feature of the CYP substrate (logical, true, or false) with deep neural networks.(2) The CYP 3A4 (the major enzyme in CYP metabolism) substrate value was taken directly for the CYP substrate.(3) The prediction was compared to the results of weighted soft voting regarding CYP450 abundance.Abundance was set to 12% for the CYP450 subtype 2C9, 4% for subtype 2D6, and 30% for subtype 3A4.In the case of (1), the performance was measured by the AUROC, and for the rest of the method, performance was measured in matched proportion with the test data.

Conclusions
In traditional ADMET prediction models, the scope was narrowly confined to compounds with pre-existing experimental ADMET data.This limitation significantly curtailed the models' generalizability, as the dataset of compounds with known ADMET outcomes was substantially smaller than the entire pool of synthesized compounds.Additionally, these conventional models often introduced bias by incorporating human-defined molecular fragments or theoretical molecular descriptors.In contrast, NLP (Natural Language Processing) models have adopted the strategy of unsupervised pre-training on extensive datasets, which is a technique proven to bolster model performance while also reducing the potential for human-induced biases.However, when evaluated against the external dataset, simpler models, such as the DNN model, outperformed more complex ones.This discrepancy unveiled a decline in performance when dealing with heterogeneous datasets, suggesting that generalization capabilities might be compromised due to dataset bias.Enhancing our dataset with a more diverse array of data points could, therefore, further refine the accuracy of deep learning models.To improve model robustness and lessen reliance on large datasets, we advocate for methodological advancements, including data augmentation, few-shot learning, and the adoption of sophisticated pre-trained models proficient in interpreting the SMILES notation.

Figure 1 .
Figure 1.Performance metrics of total features (TableA1) for suggested models.Figure 1.Performance metrics of total features (TableA1) for suggested models.

Figure 1 .
Figure 1.Performance metrics of total features (TableA1) for suggested models.Figure 1.Performance metrics of total features (TableA1) for suggested models.

Figure 1 .
Figure 1.Performance metrics of total features (TableA1) for suggested models.

Figure 2 .
Figure 2. Box plots of the distribution of performance metrics for each feature (TableA4) of suggested models.

Figure 2 .
Figure 2. Box plots of the distribution of performance metrics for each feature (TableA4) of suggested models.

Figure 3 .
Figure 3. Schematic flow of prepared core models, (A): DNN, (B): encoder, (C): concat, and (D): pipe.Single SMILES and Boolean features, five integers, and seven float values are concatenated and processed in the model before the values are transformed into 21 output values.I: input, h: hidden layer, EMB: embedding layer, LINEAR: full-connected layer.1.A deep neural network (DNN) model consists of fully connected and embedded layers.The model uses 18 input physicochemical values as input properties.One Boolean feature was transformed into a 10-dimensional vector via embedding layers.Five integer-based and seven float features were transformed into a 10-dimensional vector via fully connected layers.Vectors were concatenated into a 30-dimensional vector, which passed through fully connected layers to return a vector with 21 dimensions (the predicted ADMET features).2.An encoder model includes a pre-trained ChemBERTa model.This model treats SMILES data as "natural sentences" and learns via MLM, which is RoBERTa.The SMILES data used for the pre-trained model are in the form of a 768-dimensional hidden vector, which is transformed into an 18-dimensional vector via fully connected layers.3. A concat model combines the DNN and encoder models described above.The 30dimensional vector from the DNN model and the 768-dimensional hidden vector

Figure 3 .
Figure 3. Schematic flow of prepared core models, (A): DNN, (B): encoder, (C): concat, and (D): pipe.Single SMILES and Boolean features, five integers, and seven float values are concatenated and processed in the model before the values are transformed into 21 output values.I: input, h: hidden layer, EMB: embedding layer, LINEAR: full-connected layer.

Table 2 .
Mean and standard deviation (in parenthesis) of RMSE and MAE measures.RMSE for FreeSolv, ESOL, and Lipo dataset; MAE for QM7, QM8, and QM9.Supervised learning models: first seven rows.Self-supervised/pre-training methods: rows eight to thirteen.Tested models (ChemBERTa and ELECTRA): rows twelve and thirteen (RF: random forest.SVM: support vector machine.#: Number of).