Multimodal Contrast-Enhanced Molecular Representation Learning and Property Prediction

Luo, Hong; He, Jie; Liu, Zhichao; Zeng, Chen

doi:10.3390/biophysica6020024

Open AccessArticle

Multimodal Contrast-Enhanced Molecular Representation Learning and Property Prediction

by

Hong Luo

¹,

Jie He

¹,

Zhichao Liu

^1,* and

Chen Zeng

^2,*

¹

Chongqing Engineering Research Center of Medical Electronics and Information Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

²

Department of Physics, The George Washington University, Washington, DC 20052, USA

^*

Authors to whom correspondence should be addressed.

Biophysica 2026, 6(2), 24; https://doi.org/10.3390/biophysica6020024

Submission received: 24 February 2026 / Revised: 23 March 2026 / Accepted: 25 March 2026 / Published: 27 March 2026

(This article belongs to the Special Issue Latest Advances in Molecular Docking Involved in Biophysics)

Download

Browse Figures

Versions Notes

Abstract

Molecular representation learning (MRL) has garnered significant attention due to its pivotal role in downstream applications such as molecular property prediction and drug discovery. In most MRL approaches, molecules are encoded into 2D topological graphs via graph neural network (GNN), which suffers from over-smoothing issues and limited receptive fields. Furthermore, most GNN models fail to utilize the 3D spatial structural information that determines molecular physicochemical properties and biological activity. To this end, here we propose multimodal contrast-enhanced molecular representation learning (MCMRL). This approach utilizes both the 2D topological information and 3D structural information of molecules for contrastive learning to enhance molecular graph representations. Further, it integrates additional molecular fingerprint information and feature fusion techniques to incorporate multimodal knowledge, yielding more reliable and generalizable molecular representations. MCMRL is pre-trained on ~10 million unlabeled molecules from PubChem, followed by various downstream benchmark tasks. Experimental results demonstrate that MCMRL achieves superior performance in 9 out of 13 benchmark tests for molecular property prediction, validating its effectiveness in molecular representation learning. Furthermore, potential molecular drugs binding to biological target protein DRD2 screened by MCMRL representation show promising affinity score, which also demonstrates the efficacy of the proposed method.

Keywords:

molecular representation learning; molecular property prediction; multimodal; self-supervised learning; contrastive learning

1. Introduction

Molecular property prediction is widely accepted as a fundamental step in computational drug and material discovery, with numerous methods relying on accurate molecular property for molecular evaluation, screening, and generation [1,2]. For this purpose, reliable and generalizable molecular representation learning is an essential task and has gained increasing attention in this area. Many conventional molecular representations, such as SMILES [3], ECFP [4] and RDKFP [5], have been developed and widely adopted. However, given the enormous magnitude of the chemical space of potential pharmacologically active molecules (~10⁶⁰) [6], universal molecular representation learning methods are urgently in need and still face fundamental challenges [7].

In recent years, graph neural networks (GNNs) have emerged as the mainstream approach for molecular representation learning (MRL) based on deep neural networks (DNNs) [8]. By effectively mapping molecular structures onto topological graphs (where atoms serve as nodes and chemical bonds as edges), GNNs capture topological information to predict molecular properties [9,10]. However, their inherent limitations have become a persistent bottleneck for performance improvement: GNNs suffer from severe over-smoothing and restricted receptive fields during graph convolution and message passing, resulting in inadequate representation of complex molecular features [11,12,13]. Attention FP propagates node information by extending graph attention mechanisms to capture local atomic environments, even identifying latent intramolecular critical bonds. Yet, it remains confined to modeling topological structures while neglecting non-local spatial correlations [14]. D-MPNN employs a specialized message-passing architecture to aggregate molecular graph structural information yet similarly encodes only 2D topological features without considering 3D geometric information [15]. Variants such as MGCN [16] advance explicit modeling of intramolecular quantum interactions yet remain tailored for quantum property prediction. They fail to generalize to general physicochemical prediction tasks and lack integration of comprehensive 3D structural features—crucial for determining molecular physicochemical properties and biological activity [17,18,19].

Self-supervised learning (SSL) alleviates the scarcity of annotated data in MRL by mining large-scale unlabeled molecular datasets, having found widespread application in natural language processing and computer vision [20,21,22,23]. However, its implementation in MRL remains nascent and lacks chemical depth. Existing SSL approaches suffer from narrow feature modeling and oversimplified pre-training designs, failing to match the semantic complexity inherent in molecular structure–property relationships. Early SSL-based MRL research, such as PretrainGNN [24], utilized solely molecular topological information while completely disregarding 3D geometry. Pre-training was conducted exclusively through masking and predicting nodes, edges, or contextual features within 2D topological graphs. GraphMVP [25] attempted to address this by emphasizing alignment and consistency between 2D and 3D views in SSL yet failed to incorporate molecular functional group information. Another representative SSL approach, FG-BERT [26], employed a functional group masking strategy for pre-training to optimize 2D topological substructure modeling, but similarly excluded 3D spatial information, thereby failing to capture molecular structural similarities and differences determined by geometric structure. In summary, existing SSL-based MRL approaches either entirely disregard 3D geometric information or achieve only superficial alignment between 2D and 3D views without deep integration. Their pre-training tasks underestimate molecular properties’ sensitivity to subtle structural alterations—minor structural changes may induce drastic property shifts. Existing methods struggle to extract chemically meaningful semantic representations, let alone achieve accurate and generalizable MRL [27,28,29,30,31,32,33].

The main limitations of existing self-supervised molecular representation learning methods can be summarized as the following two aspects. First, the integration of molecular 2D topological features and 3D spatial geometric features remains limited to superficial alignment, failing to achieve deep cross-modal enhancement and feature interaction [34,35,36,37]. This impedes the extraction of the intricate correlations between molecular structure and properties. Second, they neglect prior chemical knowledge embedded in conventional molecular fingerprints, such as functional groups and substructures. Feature learning from a single modality or partial modalities struggles to cover the complex physicochemical properties of molecules. These two shortcomings result in insufficient generalization capabilities for predicting molecular properties and an inability to effectively learn chemically meaningful molecular representations.

To address these challenges, here we propose multimodal contrast-enhanced molecular representation learning (MCMRL). By conducting contrastive learning between 2D topological information and 3D structural information, it optimizes molecular graph representations. In addition, the incorporation of additional molecular fingerprint information, coupled with feature fusion techniques, fully integrates multimodal molecular knowledge to construct more robust and generalizable molecular representations. To validate the effectiveness of the proposed MCMRL method, we evaluated it against multiple state-of-the-art (SOTA) baseline models across 13 molecular property prediction benchmark datasets, achieving superior results on nine tasks. Furthermore, we screened potential molecular drugs binding to biological target protein DRD2 using MCMRL representation. The proposed candidate molecules show promising affinity scores.

Our contributions can be summarized as follows:

1. We propose a novel multimodal contrast-enhanced molecular representation learning model that innovatively integrates multimodal information including 2D topology, 3D spatial structure, and functional group fingerprints, overcoming the limitations of existing methods that rely on single-modal or partial-modal feature learning.

2. We construct a cross-attention-based feature fusion module that achieves deep integration of local topological embeddings, global geometric embeddings, and functional group fingerprint information, yielding reliable and generalizable molecular feature representations.

3. Benchmark dataset experiments demonstrate that our approach achieves significant performance improvements in molecular property prediction tasks. Case study on potential drugs of DRD2 further proves the efficacy of the proposed approach.

2. Materials and Methods

2.1. 2D Encoder Module

In our work, a 2D molecule graph is defined as

G = (V, E)

, where

V

and

E

are nodes (atoms) and edges (chemical bonds), respectively. Since GNN can make use of the information transfer mechanism so that the model can fully learn the local topological information, we use GNN to exploit the neighborhood aggregation operation in the 2D Encoder Module, which updates the node representation iteratively. The aggregation update rule for a node feature on the

k

th layer of a GNN is given in Equations (1) and (2):

a_{v}^{(k)} = {A g g r e g a t e}^{(k)} (\{h_{u}^{k - 1} : u \in N (v)\}),

(1)

h_{v}^{(k)} = {C o m b i n e}^{(k)} (\{h_{u}^{k - 1}, a_{v}^{(K)}\}),

(2)

where

h_{v}^{(k)}

is the feature of node

v

at the

k

th layer and

h_{v}^{(0)}

is initialized by node feature

h_{v}

.

N (v)

denotes the set of all the neighbors of node v. To further extract a graph-level feature

h_{G}

, readout operation integrates all the node features among the graph

G

as given in Equation (3):

h_{G} = {R E A D O U T}^{(k)} (\{h_{u}^{k} : v \in G\})

(3)

In our work, we build GNN encoders based on GIN. While GIN utilizes an MLP and weighted summation of node features in the aggregation, both are simple yet generic graph convolutional operations. Additionally, we implement widely used mean pooling as the readout.

2.2. 3D Encoder Module

In recent years, learning 3D geometric representations has achieved great progress in molecular modeling. 3D molecular graphs incorporate the spatial positions of atoms, which need not be static since atoms continuously move along potential energy surfaces in real-world scenarios. The 3D structure at local minima on this surface is termed a molecular conformation. To fully leverage molecular geometric information and better characterize global features, this paper defines a 3D molecular graph as

G = (X, R)

, where

X

represents nodes (atoms). Compared to 2D molecular graphs, this approach introduces an additional coordinate dimension

R

for nodes and employs a 3D graph neural network to encode the representation embedding of molecular conformations. Specifically, it can be expressed as

h_{3 D} = G N N (T_{3 D} (g_{3 D})) = G N N (T_{3 D} (X, R)),

(4)

where

R

is the 3D-coordinate matrix and

T_{3 D}

is the 3D transformation. Note that further information such as plane and torsion angles can be solved from the positions. SchNet is composed of the following key steps:

z_{i}^{(0)} = e m b e d d i n g (x_{i}),

(5)

z_{i}^{(t + 1)} = M L P (\sum_{j = 1}^{n} f (x_{j}^{(t - 1)}, r_{i}, r_{j})),

(6)

h_{i} = M L P (z_{i}^{(K)}),

(7)

where

K

is the number of hidden layers, and

f (x_{j}, r_{i}, r_{j}) = x_{j} \cdot e_{k} (r_{i} - r_{j}) = r_{j} \cdot e x p (- γ {‖‖r_{i} - r_{j}‖ 2 - μ‖}_{2}^{2}),

(8)

is the continuous-filter convolution layer, enabling the modeling of continuous positions of atoms. The 3D structure of each molecule is generated with RDKit toolbox [5]. The conformation is initialized with Experimental Torsion Knowledge Distance Geometry (ETKDG) algorithm and optimized with Merck Molecular Force Field (MMFF94) before model pre-training and downstream tasks. Considering the processing speed and model efficiency, the molecular 3D conformation is currently simplified by a representative 3D structure.

2.3. Molecular Fingerprints Encoder Module

We use here three complementary fingerprints MACCS, PubChem and Pharmacophore ErG fingerprints, as detailed below.

MACCS Fingerprint: A fingerprint based on a substructure key using SMARTS mode. MACCS contains the most atomic properties, bond properties, and atomic neighborhoods in different topological separations, which is implicative for drug discovery. We chose the short variant of the 166 bits for this study.

PubChem Fingerprint: An 881-bit fingerprint based on a substructure key with broad chemical structure coverage.

Pharmacophore ErG fingerprint: This uses the extended reduce graph (ErG) method and pharmacophore-type node descriptions are applied to encode molecular properties.

Among these, MACCS can effectively capture local functional groups (such as carboxyl groups and pyridine groups). PubChem can supplement large-scale features such as macrocycles and stereocenters. ErG can encode 3D pharmacophore distributions, addressing the limitations of the previous two 2D fingerprints. Therefore, the combination of the three achieves a three-level complementary approach of ‘local-global-interaction,’ significantly enhancing the accuracy, robustness, and generalization capabilities of drug discovery task.

Combining these three into a hybrid fingerprint to Equation (9),

F P = C O N C A T ({F P}_{M A C C S}, {F P}_{P u b C h e m}, {F P}_{P h a r m a c o p h o r e E r G}),

(9)

The fingerprints vector was input into the artificial neural network (ANN) to obtain the following representation Equation (10):

V = W \cdot F P + b,

(10)

2.4. Fusion Module

The computational methods are illustrated in Equations (11) and (12). To efficiently integrate 2D local topological features, 3D global geometric features, and molecular fingerprint knowledge, MCMRL employs a feature fusion module based on a cross-attention mechanism. The 2D molecular graph and 3D molecular graph are encoded respectively by a 2D graph neural network module and a 3D graph neural network module. The 3D graph neural network aims to capture global molecular information, whilst this module more finely represents the geometric information within the 3D molecular graph. However, employing solely a 3D graph neural network architecture may result in the loss of local molecular information (such as atoms, bonds, and connectivity relationships), whereas the 2D graph neural network can capture the local node information of the molecule. Furthermore, we introduce three complementary types of molecular fingerprint information as supplementary knowledge to enhance the model’s ability to accurately characterize molecular features. These two hierarchical information types—molecular graphs and molecular fingerprints—are fed into a fusion module. Through an interactive attention mechanism, they are integrated to achieve the fusion of graph structure and functional group information. Ultimately, through a series of operations, a robust representation integrating local topological features, global collective features, and supplementary molecular fingerprint information is obtained. This yields the final molecular representation: the 2D molecular graph embedding

h_{T}

originates from the 2D graph neural network, while the 3D molecular graph embedding

h_{G}

derives from the 3D graph neural network module. These two are concatenated to form

h_{C}

, which is input alongside the molecular fingerprint

h_{F P}

into the feature fusion module. This module computes the interactive attention between graph structure and functional group information, using

h_{C}

as the query vector and

h_{F P}

as the key–value pair. The specific computational workflow of the fusion module is illustrated in the figure.

Q, K, V = {L i n e a r}_{Q} (h_{C}), {L i n e a r}_{K} (h_{F P}), {L i n e a r}_{V} (h_{F P}),

(11)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V,

(12)

By applying layer normalization (LN) following the multi-head attention mechanism and feedforward network, the molecular representation H is ultimately obtained. This fusion network aggregates and embeds local atomic information from the 2D molecular graph, global molecular information from the 3D molecular graph, and molecular fingerprint information. Through an interactive attention mechanism, it assigns higher weights to important atoms throughout the entire molecule.

2.5. Contrastive Learning Module

To achieve effective alignment between 2D topological features and 3D geometric features, a projection head is required to map both modalities onto the same vector space. This projection head takes as input the 512-dimensional topological features output by the 2D GIN model and the 512-dimensional geometric features output by the 3D SchNet model. This projection head achieves dimensionality reduction and spatial alignment through two layers of linear transformations: The first linear layer contains 512 neurons and employs the ReLU activation function to perform nonlinear feature transformation. The second linear layer contains 256 neurons and omits an activation function to preserve feature continuity, ultimately outputting a 256-dimensional modal feature vector. The projected features are fed into the InfoNCE loss calculation module, which maximizes similarity between features from the same molecule while minimizing similarity between features from different molecules, thereby achieving contrastive learning optimization for multimodal representation. Specifically, after obtaining the 2D molecular graph and 3D molecular graph representations for each molecule, we maximize the mutual information between the 2D molecular graph and 3D molecular graph representations from the same molecule using InfoNCE as the contrastive loss, while distinguishing between the two representations from different molecules. Specifically, we first use a projection head to graph the two representations into the same vector space, then select positive and negative pairs for contrastive learning. During training, the 2D molecular graph and 3D molecular graph representations from molecules are used as input, where the 2D molecular graph and 3D molecular graph representations from the same molecule are used as positive sample pairs, and the two representations from different molecules are used as negative sample pairs, thereby generating positive pairs and negative sample pairs. More specifically, the

i

th 2D molecular graph representation and the

i

th 3D molecular graph representation form a unique positive sample pair, while the remaining 2D molecular graph representations and the

i

th 3D molecular graph representation form negative sample pairs. Therefore, the InfoNCE loss for the

i

th 2D molecular graph is

L_{i}^{t} = - \log (\frac{\exp (s i m (z_{i}^{t}, z_{i}^{g}) / τ)}{\sum_{j = 1}^{N} \exp (s i m (z_{i}^{t}, z_{j}^{g}) / τ)}),

(13)

where

t

and

g

represent 2D molecular graphs and 3D molecular graphs, respectively,

s i m (\cdot)

is a pairwise similarity function using cosine similarity, and

τ

is the temperature parameter. Similarly, the loss function for the

i

th 3D molecular graph is

L_{i}^{g} = - l o g (\frac{\exp (s i m (z_{i}^{g}, z_{i}^{t}) / τ)}{\sum_{j = 1}^{N} \exp (s i m (z_{i}^{g}, z_{j}^{t}) / τ)}),

(14)

In summary, the final InfoNCE loss function is as follows:

L_{i n d o} = \frac{1}{2 N} \sum_{k \in t, g} \sum_{i = 1}^{N} L_{i}^{k} .

(15)

3. Results

3.1. Overview of MCMRL

Multimodal contrast-enhanced molecular representation learning (MCMRL) is a self-supervised contrastive learning framework for MRL that optimizes molecular graph representations through contrastive learning of 2D topological information with 3D structural information. Furthermore, it incorporates additional molecular fingerprint information and feature fusion techniques to construct more robust and generalizable molecular representations. Within this framework, a 2D encoder extracts local topological information via a Graph Isomorphism Network (GIN); the 3D encoder utilizes a 3D graph neural network [38] to extract global spatial structural features; the fingerprint encoder employs an artificial neural network to extract fingerprint information; and feature fusion techniques adopt a cross-attention mechanism. The basic workflow comprises two stages: pre-training and fine-tuning, as shown in Figure 1.

The MCMRL model employs contrastive learning [39,40], which optimizes molecular graph representations by minimizing distances between duplicate representations of the same molecule while maximizing distances between distinct molecular representations [41]. It further integrates three complementary molecular fingerprint representations—namely, the MACCS fingerprint [42], the Pharmacophore ErG fingerprint [43], and the PubChem fingerprint [44]. These are fused with the molecular graph representations enhanced by contrastive learning. The three representation modalities form a ‘local-global-interaction’ hierarchical complementary framework [45], achieved through a cross-attention [46] module. This integrates multimodal molecular knowledge to construct a more robust and universal molecular representation, termed the MCMRL representation (Figure 1a). For a batch of N molecular SMILES, we simultaneously construct a 2D molecular graph (where nodes represent atoms and edges represent chemical bonds) for the 2D encoder, and a 3D molecular structure (containing atomic coordinates and atomic types) for the 3D encoder. The two embeddings of the same molecule form a positive sample pair, while embeddings from different molecules form negative sample pairs (Figure 1b). The NT-Xent loss function is employed to maximize consistency among positive pairs and minimize correlation among negative pairs. The MCMRL model undergoes contrastive learning enhanced using approximately 10 million unlabeled molecules from the PubChem database [47]. The entire fine-tuning process is completed through supervised learning on the target molecular property database. Further details of the MCMRL framework are provided in Section 2.

3.2. Molecular Property Prediction Performance

To validate the effectiveness of MCMRL, we conducted benchmarking tests on multiple challenging classification and regression tasks from MoleculeNet [48]. Table 1 presents the comparison of ROC-AUC (%)—the area under the receiver operating characteristic curve—between our MCMRL model and supervised learning as well as self-supervised/pre-trained baseline models across classification tasks. The mean and standard deviation of three independent runs are reported. Compared to other self-supervised learning or pre-training strategies, the MCMRL framework achieves superior performance on five out of seven benchmarks. These improvements demonstrate that MCMRL provides a powerful self-supervised learning strategy for MRL.

Table 2 presents the performance of MCMRL and baseline models on regression benchmarks. FreeSolv, ESOL, and Lipo employ root mean square error (RMSE) as the evaluation metric, while QM7, QM8, and QM9 follow MoleculeNet’s recommendation to use mean absolute error (MAE). Compared to classification tasks, regression tasks are more challenging as they deal with manually defined discrete labels. Table 2 reveals the following observations: (1) Across six benchmarks, MCMRL outperforms on four benchmarks and achieves near-equivalent performance on the remaining QM7 and QM9 benchmarks. Compared to the Attentive FP model, MCMRL demonstrates superior performance on all five regression datasets. For instance, on the FreeSolv and ESOL databases, it achieves 28% and 7% improvements over the Attentive FP approach, respectively. (2) Compared to supervised learning models, MCMRL demonstrates competitive performance in most scenarios. For instance, on the Lipo dataset, MCMRL achieves results comparable to the top-performing supervised D-MPNN. However, on the QM9 dataset, MCMRL cannot compete with the supervised MGCN. This is because these two models are specifically designed for quantum interactions. Notably, while MGCN excels on datasets related to quantum mechanical properties (i.e., QM7, QM8, and QM9), it does not demonstrate superiority over other supervised learning baselines in the remaining benchmarks. Furthermore, MCMRL pre-training maintains effectiveness on the challenging QM9 benchmark. On the QM9 dataset, MCMRL outperforms other supervised learning baseline models, validating its efficacy.

Table 1 and Table 2 both demonstrate that in certain scenarios, MCMRL exhibits superior predictive accuracy compared to other pre-trained/self-supervised learning benchmarks. Notably, MCMRL benefits from pre-training on large-scale unlabeled databases, and the application of unlabeled data grants MCMRL significant advantages over other benchmark methods in terms of chemical space generalization capability and molecular property diversity.

3.3. Evaluation of Contrastive Learning Pre-Training and Multimodal Fusion

To validate the enhancement of graph representation performance in MCMRL models through cross-dimensional contrastive learning augmentation, we compared MCMRL molecular graph embeddings generated with and without contrastive learning augmentation. The dataset comprises two representations for 1000 molecules, including 1000 positive pairs and 999,900 negative pairs (where 2D–3D feature pairs from the same molecule are positive examples, and pairs from different molecules are negative examples). We computed the cosine similarity between each feature pair, defined as

\frac{(u, v)}{‖u‖ ‖v‖}

[49]. Statistical analysis indicates (as shown in Figure 2) that within the graph representations of the MCMRL model prior to contrastive learning reinforcement, similarity values for positive molecular pairs predominantly cluster between 0.4 and 0.8, with an average of approximately 0.6. Conversely, negative molecular pairs exhibited similarity values concentrated between 0.3 and 0.7, averaging approximately 0.5. This indicates that prior to cross-dimensional contrastive learning enhancement, the graph representations of MCMRL struggled to distinguish diverse molecules. As both representations employ atomic type encoding, they share a limited amount of identical information. Consequently, the presence of relatively high similarity among positive samples before contrastive learning enhancement is a normal phenomenon. For positive sample pairs in the MCMRL model’s graph representations after contrastive learning reinforcement, similarity values predominantly cluster within the 0.8 to 1.0 range, with an average of approximately 0.93. Negative sample pairs, conversely, exhibit similarity values predominantly distributed between −0.2 and 0.2, with an average approaching zero. This conclusively demonstrates that our contrastive learning enhancement enables the MCMRL model’s graph representations to fully learn similarities among positive samples while distinguishing differences among negative samples. This outcome robustly validates the significant enhancement of cross-dimensional contrastive learning to the performance of MCMRL model graph representations.

To systematically evaluate the performance of different molecular representation methods, we compared multiple combinations including 2D features, 3D features, fingerprint (FP) features, their fusions, and variants with certain components removed (no 2D, no 3D, no FP, no fusion). The two figures present evaluation results using ROC-AUC for classification tasks and RMSE for regression tasks across various benchmark datasets. As shown in Figure 3, full MCMRL achieves the best average performance across the vast majority of molecular property prediction tasks, encompassing both classification and regression. This excellence stems from its comprehensive fusion of 2D, 3D, and FP multimodal features, enabling the capture of complementary molecular information from multiple dimensions. Although its advantage is less pronounced on a few classification datasets like SIDER, and single 2D features perform poorly in esol regression tasks, these instances precisely highlight the importance of multimodal information when tackling complex tasks. Therefore, this experiment demonstrates that removing any feature category or fusion module degrades model performance, confirming the irreplaceable role of our topological, geometric, and molecular fingerprint information alongside the fusion module in providing critical insights.

3.4. Investigation on MCMRL’s Graph Representation

To validate that our contrastive learning enhancement method enables MCMRL to fully learn topological and spatial geometric information, demonstrating that the multimodal contrastive learning-enhanced graph representations possess robust representational capabilities, we employed t-distributed stochastic neighbor embedding (t-SNE) [50] to analyze the graph representations. The t-SNE algorithm maps closely related molecular representations to adjacent points in a 2D space. Figure 4 displays the t-SNE 2D embedding plot for 100,000 molecules from the validation set, color-coded by molecular weight. We have also added randomly selected molecules to illustrate how MCMRL’s graphical representations distinguish between similar and dissimilar molecules. For example, the top three molecules above share a similar structure, each featuring a chlorine atom bonded to a benzene ring. The bottom five molecules all contain a benzene ring and a carbonyl group. This demonstrates that even without downstream task fine-tuning, our pre-training method enables the model to learn intrinsic relationships between molecules—molecules with similar properties often share comparable features.

To further evaluate MCMRL’s graph representation capability, we compare its graph representations with traditional molecular FP scores (e.g., ECFP and RDKFP). Specifically, given a query molecule, we extract its graph representation via MCMRL and compute its cosine distance from all reference molecules in the pre-trained database. All reference molecules were then sorted by representation distance and uniformly divided into 20 buckets based on their sorted percentile. Lower percentile thresholds indicate molecules more closely related to the query, as their MCMRL’s graph representations are closer. Within each bin, 5000 molecules were randomly selected, and their dice F-score similarity with the query was computed. Figure 5 shows the mean and standard deviation of FP similarity within each bin. ECFP tends to yield lower similarities than RDKFP, as the former encompasses a broader range of features related to molecular activity. However, both ECFP and RDKFP similarities decrease as MCMRL’s graph representation distances increase. The mean RDKFP similarity for the top 5% is ~0.91, dropping to ~0.53 for the bottom 5%. Similarly, the average ECFP similarity decreased from ~0.50 in the top 5% to ~0.27 in the bottom 5%. Despite fluctuations with increasing percentage thresholds, the overall trend of MCMRL’s graph representation aligns with chemical FP similarity, indicating that distances between MCMRL graph representations effectively reflect molecular similarity. Furthermore, based on MCMRL’s graph representations similarity, molecules closer to the query exhibit high similarity, while distant molecules show even lower similarity than RDKFP. The overall decline trend is more pronounced than in both molecular fingerprints, indicating that our MCMRL’s graph representations possess stronger molecular representation capabilities than these two fingerprints and more effectively reflect molecular similarity.

3.5. Case Study: Potential Drug for DRD2

To further demonstrate the efficacy of the molecular representation by MCMRL, a case study of screening potential drugs for DRD2 was conducted. The 6CM4 crystal structure downloaded from the PDB database is a complex of the DRD2 with risperidone (8NU). The receptor portion of this structure was used for the docking study, and the co-crystallized ligand risperidone was extracted for subsequent docking validation. We employed the DRD2 ligand 8NU [51] as the query molecule, calculating the latent space distances between 8NU and 100,000 randomly selected molecules based on their MCMRL representations. These distances were then sorted and divided into 10 bins. We employed molecular docking technology (AutoDock Vina Version 1.1.2) [52] to calculate the binding affinity of 100 randomly selected molecules from each bin with the DRD2 receptor. These results were compared against the experimentally resolved DRD2-8NU complex structure 6CM4 [53]. As shown in Figure 6, binding affinity increases with the potential spatial distance from the 8NU molecule. A clear overall upward trend is evident. Notably, molecules in the first cluster exhibit significantly lower binding affinities than subsequent clusters, with an average of −11.78—demonstrating a decisive advantage. These molecules share the highest similarity with 8NU. This study successfully identified high-affinity DRD2-binding molecules based on MCMRL feature similarity, fully validating the powerful representation capabilities of MCMRL.

Next, Figure 7 presents a 2D schematic diagram of the 10 molecules most closely related to 8NU within the MCMRL representation domain. We used PyMOL (Version 3.1.6.1) [54] to generate molecular docking structures based on the binding affinities of these 10 molecules with the DRD2 protein, while annotating their binding affinities to DRD2. The binding affinities of these 10 molecules to DRD2 ranged from −12.7 to −10.5, all highly consistent with 8NU’s affinity value of −12.1. Observation of these selected molecules reveals significant structural and functional group similarities, with the lead molecule differing from 8NU by only one hydrogen atom. Through contrastive learning on large-scale unlabeled datasets, MCMRL automatically embeds molecules into a representative feature space and distinguishes compounds in a chemically meaningful manner, further demonstrating its ability to learn chemically meaningful representations. More details of the molecular docking procedure are as shown in Appendix D.

4. Discussion

In this study, we explore the molecular representation learning method enhanced by multimodal contrastive learning. The contrastive learning mechanism in the proposed MCMRL enables the model to capture fine-grained correlations between molecular 2D topological and 3D geometric features, with the cosine similarity of positive sample pairs elevated to 0.93 after pre-training. This improvement essentially stems from the model’s learning of the fundamental chemical rule that topological structure determines spatial geometry, and spatial geometry reflects physicochemical properties, which addresses the critical limitation of existing methods in inadequately capturing structure–property correlations. For multimodal feature fusion, the 2D topological features ensure the accuracy of local bonding information (e.g., atomic connections and bond types) of molecules, the 3D geometric features characterize the spatial steric hindrance and intermolecular interaction potential that dominate molecular biological activity and physicochemical behavior, and the molecular fingerprint features introduce prior chemical knowledge of functional groups, substructures and pharmacophore distributions. The multimodal characterization of molecular properties through the fusion of these three types of features constitutes the core reason for MCMRL achieving superior performance on 9 out of 13 benchmark datasets for molecular property prediction.

Despite its promising performance, MCMRL still has certain limitations that need to be improved. First, the generation of 3D molecular conformation relies on classical molecular mechanics force fields, which leads to limited prediction accuracy for macromolecules with complex spatial structures and flexible conformations. Secondly, the model’s prediction performance for quantum mechanical properties (e.g., dipole moment of QM9) is still inferior to MGCN, a model specially designed for quantum interaction modeling, due to the lack of a dedicated feature extraction module for quantum chemical features.

Given the above limitations, some promising directions of MRL can be further investigated as future works. For example, the 3D conformational space, or at least multiple conformers of the molecule, could be incorporated to improve the accuracy of 3D feature characterization. In addition, quantum chemical descriptors, together with more generic fusion and enhanced strategy for the multimodal information, would contribute to improve the efficacy and accuracy of MRL.

Author Contributions

Conceptualization, Z.L. and C.Z.; methodology, Z.L. and H.L.; software, H.L.; validation, H.L., J.H.; formal analysis, H.L.; data curation, H.L., J.H.; writing—original draft preparation, H.L. and Z.L.; writing—review and editing, Z.L., J.H. and C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Natural Science Foundation of China (Grant NO. 32300557), China Postdoctoral Science Foundation (Grant NO. 2022MD713690), Scientific and Technological Research Program of Chongqing Municipal Education Commission (Grant NO. KJQN202200639).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The pre-training dataset is from PubChem and also provided here: https://drive.google.com/file/d/1e9TUmxePrAr4tQRxIIAjI_GSKS0JMH1z/view?usp=drive_link accessed on 24 February 2026. The benchmark datasets are from MoleculeNet [48]. Codes are available at https://github.com/LH000809/MCMRL accessed on 24 February 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Details of Molecular Datasets

Table A1 summarizes all the benchmarks used in our work. These benchmarks from MoleculeNet cover a wide variety of molecular properties, including physiology (i.e., BBBP, Tox21, SIDER, ClinTox), biophysics (i.e., BACE, MUV, HIV), physical chemistry (i.e., FreeSolv, Lipo, ESOL), and quantum mechanics (i.e., QM7, QM8, QM9). Also, numbers of data vary significantly among the benchmarks, ranging from less than 1K to more than 130K. All benchmarks except QM9 are scaffold split to train/validation/test sets by the ratio of 8/1/1, which provides a more challenging yet realistic setting. Random splitting is implemented on QM9 following the settings in most related works for comparison. ROC-AUC is used as the metric for classification tasks while RMSE and MAE are used for regression tasks.

Table A1. Summary of all the benchmarks for molecular property predictions used in this work.

Dataset	Molecules	Tasks	Task Type	Metric	Split
BBBP	2039	1	Classification	ROC-AUC	Scaffold
Tox21	7831	12	Classification	ROC-AUC	Scaffold
ClinTox	1478	2	Classification	ROC-AUC	Scaffold
HIV	41,127	1	Classification	ROC-AUC	Scaffold
BACE	1513	1	Classification	ROC-AUC	Scaffold
SIDER	1427	27	Classification	ROC-AUC	Scaffold
MUV	93,087	17	Classification	ROC-AUC	Scaffold
FreeSolv	642	1	Regression	RMSE	Scaffold
ESOL	1128	1	Regression	RMSE	Scaffold
Lipo	4200	1	Regression	RMSE	Scaffold
QM7	6830	1	Regression	MAE	Scaffold
QM8	21,786	12	Regression	MAE	Scaffold
QM9	130,829	8	Regression	MAE	Random

We constructed node and key features for embedding 2D (2D) molecular graphs, and node and node coordinates for embedding 3D (3D) molecular graphs. During the fine-tuning phase, three complementary molecular fingerprints were incorporated as additional knowledge inputs. RDKit was employed to convert SMILES into 2D graphs, 3D graphs, and molecular fingerprints, from which features were extracted. Feature details are presented in Table A2.

Table A2. Input features of MCMRL.

Embedding Method	Feature Name	Description	Size
2D graph	Atomic number	Type of atom, by atomic number (one-hot)	119
	Chirality	CW, CCW, unspecified or other (one-hot)	4
	Bond type	Single, double, triple or aromatic (one-hot)	4
	Bond direction	Begin dash, begin wedge, etc. (one-hot)	3
3D graph	Atomic number	Type of atom, by atomic number (one-hot)	119
3D graph	coordinate	Node coordinates (float)	-
Fingerprint	MACCS	A fingerprint based on a substructure key using SMARTS mode	166
	PubChem	A substructure-based fingerprint offering broad coverage of chemical structures	881
	Pharmacophore ErG	Encoding of Extended Reduced Graph (ErG) and pharmacodynamic node descriptions	442

Appendix B. Visualization of MCMRL Representations

To demonstrate the representation results of the pre-trained MCMRL, we visualized molecular features using t-SNE. The molecules were sourced from various databases and colored according to their corresponding attribute labels (Figure A1). Note that all features were extracted directly from the pre-trained MCMRL model without any fine-tuning. That is, the model had no access to molecular attribute labels during training. Figure A1 shows molecules from QM8 [55,56] and QM9 [57]. The features generated by the pre-trained MCMRL exhibit label-based clustering phenomena, even without exposure to labels during training. For example, in Figure A1b, molecules are colored according to their dipole moment (

μ

). Molecules with relatively high

μ

values (green and blue) cluster in the lower-right region, while those with low

μ

values (dark red) cluster near the center of the plot. Similar clustering trends are observed in other t-SNE visualizations in Figure A1.

Figure A1. 2D t-SNE embedding of the molecular representations learned by our MCMRL pre-training. (a) Molecules from QM8 database; color indicates the electronic spectrum calculated from CC2 of each molecule. (b) Molecules from QM9 database; color indicates the averaged electronic spectrum

μ

of each molecule.

Figure A1. 2D t-SNE embedding of the molecular representations learned by our MCMRL pre-training. (a) Molecules from QM8 database; color indicates the electronic spectrum calculated from CC2 of each molecule. (b) Molecules from QM9 database; color indicates the averaged electronic spectrum

μ

of each molecule.

Appendix C. Comparison of Single-Molecule Fingerprint Performance

To validate the individual impact of three molecular fingerprints on model performance, we designed the following experiments to compare the influence of individual molecular fingerprints on model effectiveness. The experiments were conducted on three representative regression tasks (FreeSolv, ESOL, Lipo) and three representative classification tasks (BACE, BBBP, HIV) within the MolecularNet dataset. The results are as shown in Figure A2.

Figure A2. The impact of molecular fingerprints on the prediction of molecular properties. Upper: Comparison of different molecular representation strategies in regression tasks. Lower: Comparison of different molecular representation strategies in classification tasks.

It can be seen that adding any of the three fingerprints to the 2D–3D baseline model (No_FP, Gray Bar) can improve the model performance (except for MACCS fingerprints in HIV task), while the integration of information of all three fingerprints can further improve the model performance (All, Red Bar). This proves the contribution of each individual fingerprint to the molecular property prediction performance of our MCMRL model.

Appendix D. Details of Molecular Docking Procedure

The crystal structure with PDB ID 6CM4 is mainly used for the docking analysis in this manuscript. This structure is a complex of the DRD2 with risperidone (8NU, resolution 2.87 Å). In our docking experiments, we use the receptor from this structure as the target and define the docking grid box center based on the position of its co-crystallized ligand.

First, the candidate small molecules in SMILES format are converted to PDB format with RDKit toolbox (Version 2025.9.3) and default parameters. More specifically, the AllChem.EmbedMolecule function is used to generate initial 3D conformations. By default, it employs the Experimental Torsion Knowledge Distance Geometry (ETKDG) algorithm. The initial 3D conformations are further optimized with function AllChem.MMFFOptimizeMolecule, in which the MMFF94 force field and maximum of 200 iterations are adopted for the optimization of each molecule. The output for each ligand molecule is a PDB file.

Next, with OpenBabel toolbox (Version 3.1.1), both the PDB files of candidate small molecule and the 6CM4 receptor are converted into the pdbqt format, which is required for AutoDock Vina for molecular docking. The conversion command is ‘obabel *.pdb -O *.pdbqt -h -xb’, in which ‘-h’ stands for adding hydrogen atoms, and ‘-xb’ stands for denoting rotatable bonds, atomic type and charge.

Finally, molecular docking is conducted with AutoDock Vina (Version 1.1.2). The grid box of docking is centered on the geometric center of risperidone (x = 8.926, y = 6.796, z = −6.012, based on the crystal structure coordinates from PDB 6CM4). The box dimensions are set to 20 Å × 20 Å × 20 Å. All other parameters, including the scoring function and the exhaustiveness, are kept at their default values.

To validate the docking procedure, redocking was conducted for target DRD2 and ligand 8NU. Using the same docking parameters, risperidone (8NU) was redocked into its original binding site of DRD2 and the RMSD was calculated between the top redocked pose and the conformation of risperidone in the crystal structure 10 times. The results showed that the best redocked pose had an average RMSD of 1.949 Å relative to the crystal structure (Figure A3). This confirms that the docking procedure and parameters (grid box size, box center, exhaustiveness, etc.) can reliably reproduce the experimental binding conformation.

Figure A3. Redocking of 8NU with DRD2. Red stands for original crystal structure of 8NU in 6CM4, green stands for redocking conformation. Docking score and RMSD are labeled for each experiment.

References

Wieder, O.; Kohlbacher, S.; Kuenemann, M.; Garon, A.; Ducrot, P.; Seidel, T.; Langer, T. A Compact Review of Molecular Property Prediction with Graph Neural Networks. Drug Discov. Today Technol. 2020, 37, 1–12. [Google Scholar] [CrossRef]
Shen, J.; Nicolaou, C.A. Molecular Property Prediction: Recent Trends in the Era of Artificial Intelligence. Drug Discov. Today Technol. 2019, 32–33, 29–36. [Google Scholar] [CrossRef]
Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. [Google Scholar] [CrossRef]
RDKit. Available online: https://www.rdkit.org/ (accessed on 10 November 2025).
Bohacek, R.S.; McMartin, C.; Guida, W.C. The Art and Practice of Structure-Based Drug Design: A Molecular Modeling Perspective. Med. Res. Rev. 1996, 16, 3–50. [Google Scholar] [CrossRef]
Fang, X.; Liu, L.; Lei, J.; He, D.; Zhang, S.; Zhou, J.; Wang, F.; Wu, H.; Wang, H. Geometry-Enhanced Molecular Representation Learning for Property Prediction. Nat. Mach. Intell. 2022, 4, 127–134. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful Are Graph Neural Networks? In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Huang, K.; Fu, T.; Glass, L.M.; Zitnik, M.; Xiao, C.; Sun, J. DeepPurpose: A Deep Learning Library for Drug-Target Interaction Prediction. Bioinformatics 2021, 36, 5545–5547. [Google Scholar] [CrossRef]
Rong, Y.; Bian, Y.; Xu, T.; Xie, W.; Wei, Y.; Huang, W.; Huang, J. Self-Supervised Graph Transformer on Large-Scale Molecular Data. In Proceedings of the 34th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2020; pp. 12559–12571. [Google Scholar]
Shindo, H.; Matsumoto, Y. Gated Graph Recursive Neural Networks for Molecular Property Prediction. arXiv 2019, arXiv:1909.00259. [Google Scholar] [CrossRef]
Xiong, Z.; Wang, D.; Liu, X.; Zhong, F.; Wan, X.; Li, X.; Li, Z.; Luo, X.; Chen, K.; Jiang, H.; et al. Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism. J. Med. Chem. 2020, 63, 8749–8760. [Google Scholar] [CrossRef]
Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; et al. Analyzing Learned Molecular Representations for Property Prediction. J. Chem. Inf. Model. 2019, 59, 3370–3388, Erratum in J. Chem. Inf. Model. 2019, 59, 5304–5305. [Google Scholar] [CrossRef] [PubMed]
Lu, C.; Liu, Q.; Wang, C.; Huang, Z.; Lin, P.; He, L. Molecular Property Prediction: A Multilevel Quantum Interactions Modeling Perspective. AAAI 2019, 33, 1052–1060. [Google Scholar] [CrossRef]
Wang, Y.; Wang, J.; Cao, Z.; Barati Farimani, A. Molecular Contrastive Learning of Representations via Graph Neural Networks. Nat. Mach. Intell. 2022, 4, 279–287. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, Q.; Wang, H.; Lu, C.; Lee, C.-K. Motif-Based Graph Self-Supervised Learning for Molecular Property Prediction. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 15870–15882. [Google Scholar]
Chen, J.; Zheng, S.; Song, Y.; Rao, J.; Yang, Y. Learning Attributed Graph Representation with Communicative Message Passing Transformer. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence; International Joint Conferences on Artificial Intelligence Organization: Montreal, QC, Canada, 2021; pp. 2242–2248. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
He, P.; Liu, X.; Gao, J.; Chen, W. Deberta: Decoding-Enhanced Bert with Disentangled Attention. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised Visual Representation Learning by Context Prediction. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV); IEEE: Santiago, Chile, 2015; pp. 1422–1430. [Google Scholar]
Shanker, V.R.; Bruun, T.U.J.; Hie, B.L.; Kim, P.S. Unsupervised Evolution of Protein and Antibody Complexes with a Structure-Informed Language Model. Science 2024, 385, 46–53. [Google Scholar] [CrossRef]
Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.; Leskovec, J. Strategies for Pre-Training Graph Neural Networks. In Proceedings of the International Conference on Learning Representations 2020, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
Liu, S.; Wang, H.; Liu, W.; Lasenby, J.; Guo, H.; Tang, J. Pre-Training Molecular Graph Representation with 3d Geometry. In Proceedings of the International Conference on Learning Representations 2022, Online, 25–29 April 2022. [Google Scholar]
Li, B.; Lin, M.; Chen, T.; Wang, L. FG-BERT: A Generalized and Self-Supervised Functional Group-Based Molecular Representation Learning Framework for Properties Prediction. Brief. Bioinform. 2023, 24, bbad398. [Google Scholar] [CrossRef]
Dwivedi, V.P.; Rampášek, L.; Galkin, M.; Parviz, A.; Wolf, G.; Luu, A.T.; Beaini, D. Long Range Graph Benchmark. arXiv 2022, arXiv:2206.08164. [Google Scholar]
Wu, Z.; Jain, P.; Wright, M.; Mirhoseini, A.; Gonzalez, J.E.; Stoica, I. Representing Long-Range Context for Graph Neural Networks with Global Attention. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 13266–13279. [Google Scholar]
Chen, B.; Barzilay, R.; S Jaakkola, T. Path-Augmented Graph Transformer Network. arXiv 2019, arXiv:1905.12712. [Google Scholar] [CrossRef]
Maziarka, Ł.; Danel, T.; Mucha, S.; Rataj, K.; Tabor, J.; Jastrzębski, S. Molecule Attention Transformer. arXiv 2020, arXiv:2002.08264. [Google Scholar] [CrossRef]
Kreuzer, D.; Beaini, D.; Hamilton, W.; Létourneau, V.; Tossou, P. Rethinking Graph Transformers with Spectral Attention. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 21618–21629. [Google Scholar]
Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y.; Liu, T.-Y. Do Transformers Really Perform Badly for Graph Representation? In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 28877–28888. [Google Scholar]
Li, S.; Zhou, J.; Xu, T.; Dou, D.; Xiong, H. GeomGCL: Geometric Graph Contrastive Learning for Molecular Property Prediction. AAAI 2022, 36, 4541–4549. [Google Scholar] [CrossRef]
Li, C.; Wang, J.; Niu, Z.; Yao, J.; Zeng, X. A Spatial-Temporal Gated Attention Module for Molecular Property Prediction Based on Molecular Geometry. Brief. Bioinform. 2021, 22, bbab078. [Google Scholar] [CrossRef]
Klicpera, J.; Groß, J.; Günnemann, S. Directional Message Passing for Molecular Graphs. In Proceedings of the International Conference on Learning Representations 2020, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
Liu, Y.; Wang, L.; Liu, M.; Zhang, X.; Oztekin, B.; Ji, S. Spherical Message Passing for 3D Graph Networks. arXiv 2021, arXiv:2102.05013. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
Schütt, K.T.; Kindermans, P.-J.; Sauceda, H.E.; Chmiela, S.; Tkatchenko, A.; Müller, K.-R. SchNet: A Continuous-Filter Convolutional Neural Network for Modeling Quantum Interactions. arXiv 2017, arXiv:1706.08566. [Google Scholar]
Sun, R.; Dai, H.; Yu, A.W. Does GNN Pretraining Help Molecular Representation? arXiv 2022, arXiv:2207.06010. [Google Scholar] [CrossRef]
van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Schneider, N.; Sayle, R.A.; Landrum, G.A. Get Your Atoms in Order An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm. J. Chem. Inf. Model. 2015, 55, 2111–2120. [Google Scholar] [CrossRef]
Durant, J.L.; Leland, B.A.; Henry, D.R.; Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280. [Google Scholar] [CrossRef]
Stiefl, N.; Watson, I.A.; Baumann, K.; Zaliani, A. ErG: 2D Pharmacophore Descriptions for Scaffold Hopping. J. Chem. Inf. Model. 2006, 46, 208–220. [Google Scholar] [CrossRef]
Bolton, E.E.; Wang, Y.; Thiessen, P.A.; Bryant, S.H. PubChem: Integrated Platform of Small Molecules and Biological Activities. In Annual Reports in Computational Chemistry; Elsevier: Amsterdam, The Netherlands, 2008; Volume 4, pp. 217–241. [Google Scholar]
Cai, H.; Zhang, H.; Zhao, D.; Wu, J.; Wang, L. FP-GNN: A Versatile Deep Learning Architecture for Enhanced Molecular Property Prediction. Brief. Bioinform. 2022, 23, bbac408. [Google Scholar] [CrossRef]
Hao, Y.; Chen, X.; Fei, A.; Jia, Q.; Chen, Y.; Shao, J.; Pandiyan, S.; Wang, L. SG-ATT: A Sequence Graph Cross-Attention Representation Architecture for Molecular Property Prediction. Molecules 2024, 29, 492. [Google Scholar] [CrossRef]
Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem 2019 Update: Improved Access to Chemical Data. Nucleic Acids Res. 2019, 47, D1102–D1109. [Google Scholar] [CrossRef]
Wu, Z.; Ramsundar, B.; Feinberg, E.N.; Gomes, J.; Geniesse, C.; Pappu, A.S.; Leswing, K.; Pande, V. MoleculeNet: A Benchmark for Molecular Machine Learning. Chem. Sci. 2018, 9, 513–530. [Google Scholar] [CrossRef]
Shen, W.X.; Zeng, X.; Zhu, F.; Wang, Y.L.; Qin, C.; Tan, Y.; Jiang, Y.Y.; Chen, Y.Z. Out-of-the-Box Deep Learning Prediction of Pharmaceutical Properties by Broadly Learned Knowledge-Based Molecular Representations. Nat. Mach. Intell. 2021, 3, 334–343. [Google Scholar] [CrossRef]
van der Maaten, L.; Hinton, G. Visualizing Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Rose, P.W.; Prlić, A.; Altunkaya, A.; Bi, C.; Bradley, A.R.; Christie, C.H.; Costanzo, L.D.; Duarte, J.M.; Dutta, S.; Feng, Z.; et al. The RCSB Protein Data Bank: Integrative View of Protein, Gene and 3D Structural Information. Nucleic Acids Res. 2017, 45, D271–D281. [Google Scholar] [PubMed]
Eberhardt, J.; Santos-Martins, D.; Tillack, A.F.; Forli, S. AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. J. Chem. Inf. Model. 2021, 61, 3891–3898. [Google Scholar] [CrossRef]
Wang, S.; Che, T.; Levit, A.; Shoichet, B.K.; Wacker, D.; Roth, B.L. Structure of the D2 Dopamine Receptor Bound to the Atypical Antipsychotic Drug Risperidone. Nature 2018, 555, 269–273. [Google Scholar] [CrossRef]
DeLano, W.L.; Scientific, D.; Carlos, S. PyMOL: An Open-Source Molecular Graphics Tool. CCP4 Newsl Protein Crystallogr 2002, 40, 82–92. [Google Scholar]
Ramakrishnan, R.; Hartmann, M.; Tapavicza, E.; Von Lilienfeld, O.A. Electronic Spectra from TDDFT and Machine Learning in Chemical Space. J. Chem. Phys. 2015, 143, 084111. [Google Scholar] [CrossRef]
Ruddigkeit, L.; Van Deursen, R.; Blum, L.C.; Reymond, J.-L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864–2875. [Google Scholar] [CrossRef]
Ramakrishnan, R.; Dral, P.O.; Rupp, M.; Von Lilienfeld, O.A. Quantum Chemistry Structures and Properties of 134 Kilo Molecules. Sci. Data 2014, 1, 140022. [Google Scholar] [CrossRef]

Figure 1. MCMRL Overview. (a) The entire MCMRL framework. First, contrastive learning is employed to enhance molecular graph representations, incorporating additional molecular fingerprint information and feature fusion to ultimately capture universal characteristics of each molecule, termed MCMRL representation. The gray sections, whilst not directly involved in the contrastive learning process, remain integral components of the MCMRL model. Subsequently, supervised learning is applied for model fine-tuning. (b) Contrastive learning enhances molecular graph representations. Convert the SMILES structural formulas of a batch of molecular data into 2D molecular graphs and 3D molecular graphs. Input both types of molecular graphs into their respective feature encoders to extract feature representations, ultimately outputting corresponding feature vectors. Employ contrastive loss to minimize the distance between latent vectors of the same molecule while maximizing the distance between latent vectors of different molecules, where distance is computed using cosine similarity.

Figure 2. Cosine similarity distribution and change of molecular representation for positive and negative pairs with contrastive pre-training. Before model pre-training, distributions of cosine similarity of positive examples (light blue) and negative samples (light red) are similar. After contrastive pre-training, the cosine similarity of positive molecular pairs (dark blue) significantly increases, while the cosine similarity of negative molecular pairs (dark red) significantly decreases.

Figure 3. Comparison of different molecular representation methods in predicting molecular properties. (a) Comparison in classification tasks (ROC-AUC). (b) Comparison in regression tasks (RMSE).

Figure 4. Visualization of graph representations based on t-SNE. Representations extracted from the validation set of the pre-trained dataset, which contains 100,000 unique molecules. Each point is labeled with its corresponding molecular weight (g mol⁻¹). Some molecules within the representation domain are also displayed.

Figure 5. Comparison of graph representations and conventional FPs using the query molecule. Change in ECFP, RDKFP and MCMRL similarities with respect to the distance between MCMRL representations.

Figure 6. Changes in DRD2 protein binding affinity based on MCMRL representation, with excellent docking in the upper left and poor docking in the lower right.

Figure 7. 8NU and its 10 nearest neighbors in the MCMRL representation domain, along with their DRD2 protein docking structures, annotated with docking affinity.

Table 1. Test performance of different models on seven classification benchmarks.

Dataset	BBBP	Tox21	ClinTox	HIV	BACE	SIDER	MUV
GCN [9]	71.8 ± 0.9	70.9 ± 2.6	62.5 ± 2.8	74.0 ± 3.0	71.6 ± 2.0	53.6 ± 3.2	71.6 ± 4.0
GIN [10]	65.8 ± 4.5	74.0 ± 0.8	58.0 ± 4.4	75.3 ± 1.9	70.1 ± 5.4	57.3 ± 1.6	71.8 ± 2.5
PretrainGNN [24]	68.7 ± 1.3	78.1 ± 0.6	87.6 ± 1.5	71.1 ± 0.5	84.5 ± 0.7	62.7 ± 0.8	80.1 ± 2.1
Attentive FP [14]	64.3 ± 1.8	76.1 ± 0.5	84.7 ± 0.3	75.7 ± 1.4	78.4 ± 0.0	60.6 ± 3.2	76.6 ± 1.5
D-MPNN [15]	71.2 ± 3.8	68.9 ± 1.3	90.5 ± 5.3	75.0 ± 2.1	85.3 ± 5.3	63.2 ± 2.3	76.2 ± 2.8
MGCN [16]	85.0 ± 6.4	70.7 ± 1.6	63.4 ± 4.2	73.8 ± 1.6	73.4 ± 3.0	55.2 ± 1.8	70.2 ± 3.4
GraphMVP [25]	72.4 ± 1.6	75.9 ± 0.5	79.1 ± 2.8	77.0 ± 1.2	81.2 ± 0.9	63.9 ± 1.2	77.7 ± 1.9
FG-BERT [26]	70.2 ± 0.9	78.4 ± 0.8	83.2 ± 1.6	77.4 ± 1.0	84.5 ± 1.5	64.0 ± 0.7	75.3 ± 2.4
MCMRL	74.1 ± 0.6	79.7 ± 1.2	91.3 ± 1.8	80.3 ± 2.1	84.6 ± 0.4	67.6 ± 0.7	81.8 ± 1.0

The mean and standard deviation of the ROC-AUC (%) for each benchmark were reported. Best-performing methods for each benchmark are marked in bold.

Table 2. Test performance of different models on six regression benchmarks.

Dataset	FreeSolv	ESOL	Lipo	QM7	QM8	QM9
GCN [9]	2.87 ± 0.14	1.43 ± 0.05	1.43 ± 0.05	122.9 ± 2.2	0.037 ± 0.001	5.796 ± 1.969
GIN [10]	2.76 ± 0.18	1.45 ± 0.02	0.85 ± 0.07	124.8 ± 0.7	0.037 ± 0.001	4.741 ± 0.912
PretrainGNN [24]	2.76 ± 0.02	1.10 ± 0.06	0.74 ± 0.10	113.2 ± 6.0	0.020 ± 0.002	4.081 ± 0.001
Attentive FP [14]	2.07 ± 0.18	0.98 ± 0.02	0.72 ± 0.10	72.0 ± 2.7	0.018 ± 0.001	2.156 ± 0.001
D-MPNN [15]	2.18 ± 0.91	0.98 ± 0.26	0.65 ± 0.05	105.8 ± 13.2	0.018 ± 0.002	3.241 ± 0.119
MGCN [16]	3.35 ± 0.01	1.27 ± 0.15	1.11 ± 0.04	77.6 ± 4.7	0.022 ± 0.002	0.050 ± 0.002
GraphMVP [25]	—	1.029	0.681	—	—	—
FG-BERT [26]	—	0.94 ± 0.03	0.66 ± 0.01	—	—	—
MCMRL	1.79 ± 0.25	0.91 ± 0.01	0.65 ± 0.05	89.0 ± 3.2	0.017 ± 0.001	2.095 ± 0.216

The mean and standard deviation of the test RMSE (FreeSolv, ESOL, Lipo) or MAE (QM7, QM8, and QM9) are reported. The best-performing methods for each benchmark are marked in bold; —means no result reported for the corresponding models.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, H.; He, J.; Liu, Z.; Zeng, C. Multimodal Contrast-Enhanced Molecular Representation Learning and Property Prediction. Biophysica 2026, 6, 24. https://doi.org/10.3390/biophysica6020024

AMA Style

Luo H, He J, Liu Z, Zeng C. Multimodal Contrast-Enhanced Molecular Representation Learning and Property Prediction. Biophysica. 2026; 6(2):24. https://doi.org/10.3390/biophysica6020024

Chicago/Turabian Style

Luo, Hong, Jie He, Zhichao Liu, and Chen Zeng. 2026. "Multimodal Contrast-Enhanced Molecular Representation Learning and Property Prediction" Biophysica 6, no. 2: 24. https://doi.org/10.3390/biophysica6020024

APA Style

Luo, H., He, J., Liu, Z., & Zeng, C. (2026). Multimodal Contrast-Enhanced Molecular Representation Learning and Property Prediction. Biophysica, 6(2), 24. https://doi.org/10.3390/biophysica6020024

Article Menu

Multimodal Contrast-Enhanced Molecular Representation Learning and Property Prediction

Abstract

1. Introduction

2. Materials and Methods

2.1. 2D Encoder Module

2.2. 3D Encoder Module

2.3. Molecular Fingerprints Encoder Module

2.4. Fusion Module

2.5. Contrastive Learning Module

3. Results

3.1. Overview of MCMRL

3.2. Molecular Property Prediction Performance

3.3. Evaluation of Contrastive Learning Pre-Training and Multimodal Fusion

3.4. Investigation on MCMRL’s Graph Representation

3.5. Case Study: Potential Drug for DRD2

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Details of Molecular Datasets

Appendix B. Visualization of MCMRL Representations

Appendix C. Comparison of Single-Molecule Fingerprint Performance

Appendix D. Details of Molecular Docking Procedure

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI