MCARSMA: A Multi-Level Cross-Modal Attention Fusion Framework for Accurate RNA–Small Molecule Affinity Prediction

Li, Ye; Zhang, Yongfeng; Zhu, Lei; Wang, Menghua; Wang, Rong; Wang, Xiao

doi:10.3390/math14010057

Open AccessArticle

MCARSMA: A Multi-Level Cross-Modal Attention Fusion Framework for Accurate RNA–Small Molecule Affinity Prediction

by

Ye Li

¹,

Yongfeng Zhang

¹,

Lei Zhu

²,

Menghua Wang

¹,

Rong Wang

³ and

Xiao Wang

^1,*

¹

School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450000, China

²

School of Software, Zhengzhou University of Light Industry, Zhengzhou 450000, China

³

School of Electronic Information, Zhengzhou University of Light Industry, Zhengzhou 450000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(1), 57; https://doi.org/10.3390/math14010057

Submission received: 17 November 2025 / Revised: 9 December 2025 / Accepted: 20 December 2025 / Published: 24 December 2025

(This article belongs to the Special Issue Machine Learning Algorithms and Their Applications in Bioinformatics)

Download

Browse Figures

Versions Notes

Abstract

RNA has emerged as a critical drug target, and accurate prediction of its binding affinity with small molecules is essential for the design and screening of RNA-targeted therapeutics. Although current deep learning methods have achieved progress in predicting RNA–small molecule interactions, existing models commonly suffer from reliance on single-modality features and insufficient representation of cross-level interactions. This paper proposes a multi-level cross-modal attention fusion framework, named MCARSMA, which integrates sequence, structural, and semantic information from both RNA and small molecules. The model employs a dual-path interaction mechanism to capture multi-scale relationships spanning from atom–nucleotide fine-grained interactions to global conformational features. The model architecture comprises (1) the feature extraction of RNA secondary structure and sequence using GAT and CNN; (2) small molecule representation that combines GCN and Transformer for joint graph and sequence embedding; (3) a dual-path fusion module for atom–nucleotide fine-grained interactions and structure-guided multi-level interactions; and (4) an adaptive feature weighting mechanism implemented via a gated network. The results demonstrate that on the R-SIM dataset, MCARSMA achieves RMSE = 0.883, PCC = 0.772, and SCC = 0.773, validating the effectiveness of the proposed multi-level cross-modal attention fusion framework. This study provides a highly interpretable deep learning solution with high predictive accuracy.

Keywords:

RNA–small molecule interaction prediction; multi-modal fusion; graph neural network; attention mechanism; adaptive feature weighting

MSC:

92B20

1. Introduction

RNA serves as a bridge for genetic information transfer but also participates in transcription, splicing, translation, and epigenetic regulation via its complex and conserved secondary/tertiary structures. It plays key roles in gene regulation [1], RNA processing and modification [2], and catalysis [3]. In recent years, breakthroughs in structural biology and small-molecule screening technologies have led to the recognition that RNA can be as important a drug target as proteins, provided that druggable pockets are present. The first FDA-approved RNA-targeting small molecule drug, risdiplam, confers revolutionary benefits to patients with spinal muscular atrophy (SMA) by stabilizing specific splicing sites of the SMN2 pre-mRNA, thereby significantly increasing functional SMN protein levels [4]. This milestone has spurred enthusiasm in both academia and industry for systematic studies of RNA–small molecule affinity (RSMA), leading to rapid expansion in pipelines targeting viral RNAs, repeat expansion RNAs, and cancer-associated non-coding RNAs [5]. As a key regulatory target in disease, the specific binding of RNA with small molecules holds irreplaceable value in areas such as antiviral and cancer therapeutics. Accurate prediction of RNA–small molecule binding affinity is crucial [6], as it can significantly accelerate the screening and optimization of targeted drugs and reduce experimental costs.

However, compared to the well-established field of protein–ligand research [7,8], the progress in modeling RNA–ligand interactions remains relatively limited. To date, only a handful of docking and scoring methods specifically adapted for RNA–ligand interactions have been developed, such as rDOCK [9,10], AutoDock [11,12], Dock6 [13], and NLDock [14]. Correspondingly, the scoring functions designed for these docking tasks include LigandRNA [15], SPA-LN [16], I-INF [17], I-RMSD [18], and DrugScoreRNA [19]. Nevertheless, the available data volume and algorithmic sophistication for RNA–small molecule interactions are still insufficient. On the other hand, while experimental techniques such as Surface Plasmon Resonance (SPR) and Microscale Thermophoresis (MST) can provide high-precision binding affinity data, they face bottlenecks including difficulties in RNA sample preparation, high costs, and low throughput [20,21], which severely limit the generation of large-scale, high-quality datasets. The integration of computational approaches at an early stage can effectively narrow the chemical space, thereby significantly reducing the need for costly and labor-intensive experimental screening. Early computational methods, such as those relying on energy minimization-based docking scoring functions [22] or the combination of empirical potential terms [23], were able to coarsely distinguish binding from non-binding states. However, these methods often yielded substantial deviations due to insufficient accounting for RNA flexibility and the influence of the ionic environment.

With the rise of machine learning, particularly deep learning, data-driven prediction of RNA–small molecule affinity (RSMA) has emerged as a new trend. Researchers have explored various approaches, such as using convolutional neural networks to process ribosomal hairpin images [24], employing graph neural networks to encode non-covalent interaction fingerprints [25], or constructing dedicated models for RNA molecules categorized into six distinct subtypes [26]. The DeepRSMA framework, for instance, captures fine-grained RNA–small molecule interactions through a dual-perspective feature extraction and cross-fusion module at the nucleotide-atom level, though it still has room for improvement in mining hierarchical structural information and generalizing to cold-start scenarios [27]. However, existing computational methods still face two major challenges: (1) limitations in feature extraction, where traditional models often rely on single-modality data (e.g., sequence or structure alone) and struggle to capture multi-scale information ranging from atomic interactions to global conformational features; and (2) inefficiency in cross-modal interaction modeling, since RNA–small molecule binding involves multi-level interactions, such as nucleotide–functional group and secondary structure–pharmacophore relationships, existing cross-attention mechanisms often overlook critical interactions at the substructure level.

Based on this, we propose an innovative framework that integrates the advantages of both approaches: it combines multi-scale feature extraction, cross-modal hierarchical interaction, and a confidence-weighted mechanism to achieve holistic information modeling from atoms to the overall conformation. The proposed model addresses existing limitations through the following strategies: (1) integrating dual-perspective nucleotide-atom level sequence and graph features to capture hierarchical information from local functional groups to global architecture, augmented with RNA and molecule pre-trained language models to enhance sequence representations; and (2) designing a substructure-nucleotide cross-attention mechanism that captures not only local and global interactions but also inter-level interactions between different hierarchical substructures.

This integrated framework not only provides a more accurate computational tool for predicting RNA–small molecule binding but also offers new insights into the molecular mechanisms of RNA-targeted binding through hierarchical feature modeling and interpretability analysis. It is expected to facilitate the rational design and development of RNA-targeted therapeutics.

2. Methods

2.1. Model Overview

As illustrated in Figure 1, our MCARSMA comprises two main components: (A) the RNA–small molecule feature extraction module (light gray area) and (B) the cross-modal feature fusion module (light green area).

In the feature extraction module (A), we aim to comprehensively capture the multi-dimensional structural information of both RNA and small molecules by extracting sequence-based, structure-based, and pretrained language model-based representations. Specifically, a one-dimensional convolutional neural network is employed to capture multi-scale sequence patterns. The RNA secondary structure predicted by RNAfold [28] is converted from its dot-bracket notation into a graph representation, which is subsequently encoded using a graph attention network (GAT) to obtain structure-aware features. In addition, semantic-level embeddings are extracted from the pretrained RNA language model RNAErnie [29].

Meanwhile, the structural features of small molecules are extracted using a graph convolutional network (GCN) applied to molecular graphs parsed from SMILES strings via RDKit. The sequential features of small molecules are obtained by encoding their SMILES representations with a Transformer encoder. SMILES (Simplified Molecular Input Line Entry System) is an ASCII-based notation that explicitly describes molecular structures by linearizing their two-dimensional topology into a one-dimensional character sequence. In this representation, atoms are encoded according to their atomic number and aromaticity, enabling the capture of key chemical patterns such as branching, ring systems, and ionization states. In addition, we employ the ChemBERTa [30] to extract semantic-level embeddings that further enrich the molecular representations.

In the cross-fusion module (B), we design two complementary interaction mechanisms to comprehensively model RNA–small molecule relationships at different levels. The first mechanism is a fine-grained atom–nucleotide interaction module, which integrates the graph-based and sequence-based representations of RNA and small molecules across four perspectives (graph–graph, graph–sequence, sequence–graph, and sequence–sequence). Bidirectional cross-attention is then applied to compute fine-grained associations between nucleotides and atoms. The resulting attention outputs are subsequently fused with semantic embeddings derived from RNAErnie and ChemBERTa, yielding the final atom–nucleotide interaction features.

The second mechanism is a structure-oriented multi-level interaction module. Specifically, three layers of stacked GAT and GCN networks are employed to obtain hierarchical structural representations of RNA (R₁, R₂, R₃) and small molecules (M₁, M₂, M₃), respectively. All 3 × 3 pairwise combinations of these structural features are then computed using element-wise products, followed by a shared linear mapping. The mapped features are subsequently aggregated through a weighted summation with learnable coefficients α_kl, producing the structure-level interaction representation.

Finally, a gating network is used to adaptively fuse the fine-grained interaction features with the structure-level interaction features, yielding the final prediction for RNA–small molecule binding affinity.

2.2. Feature Extraction Module

2.2.1. RNA Feature Extraction

The RNA feature extraction module processes RNA molecules from both graph and sequence perspectives to generate their respective graph and sequence embeddings.

From the sequential perspective, the representation is derived from the one-hot encoded RNA nucleotide sequence, which is first transformed into dense embeddings via an embedding layer. This is followed by a multi-scale feature extraction using a 1D CNN block comprising three parallel convolutional branches with kernel sizes of 7, 11, and 15, respectively. Each branch consists of a two-layer convolutional structure with an output channel size of 128, and incorporates a max-pooling operation after the convolution to retain salient local sequential patterns. The outputs from the three branches are then integrated through a projection head containing two linear layers, which projects them into a unified dimensional space. The final sequence-level embedding for the RNA is obtained by averaging the features across all nucleotide positions.

From the graph perspective, the structural features of the RNA are derived from its secondary structure represented in the dot-bracket notation predicted by RNAfold. A graph is constructed based on the base-pairing interactions, where G-C, A-U Watson-Crick pairs and G-U wobble pairs are defined as edges. A three-layer stacked Graph Attention Network (GAT) is then applied to this structural graph to extract relational features. The hidden dimensions of the three GAT layers are 128, 256, and 256, respectively, each employing 4 attention heads. This architecture enables the model to simultaneously capture local base-pairing patterns, meso-scale structural motifs, and high-level topological dependencies across different segments. The node embeddings encoded by the three layers are integrated via a global average pooling operation, thereby yielding the final structural embedding representation of the RNA.

2.2.2. Small Molecule Feature Extraction

Similar to the RNA feature extraction, the small molecule module also encompasses two primary dimensions: semantic encoding based on the SMILES sequence and structural encoding based on the molecular graph.

From the sequential perspective, the SMILES string of a small molecule is first tokenized at the atomic level, where characters representing atoms and bonds are mapped into a sequence of vectors. SMILES (Simplified Molecular Input Line Entry System) is a linear notation widely used in cheminformatics that unambiguously describes the two-dimensional topology of a molecule, making it highly suitable as a sequential input representation. Subsequently, an encoder composed of a three-layer Transformer architecture is employed to model the SMILES sequence. Each Transformer block consists of a multi-head self-attention layer and a feed-forward network, with a hidden dimension of 128 and 4 attention heads, enabling it to capture long-range dependencies and semantic relationships implied by branches and ring structures. The resulting sequential embedding effectively encodes multi-level semantic information, ranging from atomic symbols to global structural patterns.

From the graph perspective, the structure of the small molecule is represented as a molecular graph constructed from the atoms and chemical bonds identified by the RDKit toolkit. In this graph, nodes correspond to atoms and edges represent chemical bonds. A three-layer Graph Convolutional Network (GCN) with hidden dimensions of 78, 156, and 128, sequentially, is applied to process this graph. Mean aggregation is used for node updating across layers, enabling the network to hierarchically capture local bonding information, functional group structures, and the global topological pattern of the molecule. The GCN and Transformer pathways thus extract complementary features from the structural and sequential representations of the small molecule, respectively. Furthermore, to enhance the model’s generalization capability, we incorporate two pre-trained models: RNAErnie-1.0 for RNA sequences and ChemBERTa-77M-MLM for SMILES strings. These models are utilized to extract high-dimensional semantic embeddings for the respective molecular representations.

2.3. RNA–Small Molecule Cross-Information Fusion

2.3.1. Atom–Nucleotide Fine-Grained Interaction

This interaction pathway employs a cross-attention mechanism to capture fine-grained binding information between RNA nucleotides and small molecule atoms. The process builds upon the graph embeddings

R^{g}

and sequence embeddings

R^{s}

from the RNA feature extraction module, together with the graph embeddings

M^{g}

and sequence embeddings

M^{s}

from the small molecule feature extraction module. To prevent confusion between features derived from different perspectives, segment embeddings

S^{g}

(indicating graph-view origin) and

S^{s}

(indicating sequence-view origin) are introduced to explicitly mark the source of each feature. The final input to the interaction module is defined as follows:

R_{i n p u t}^{c} = [R^{c, g}, R^{c, s}] = [R^{g} + S^{g}, R^{s} + S^{s}] M_{i n p u t}^{c} = [M^{c, g}, M^{c, s}] = [M^{g} + S^{g}, M^{s} + S^{s}]

(1)

Subsequently, a multi-head cross-attention computation is employed. We design parallel cross-attention layers from both the RNA perspective and the small molecule perspective. In each perspective, the features of one party serve as the Query, while the features of the other party serve as the Key and Value, thereby calculating fine-grained interaction weights. For the i-th attention head, the interaction from the RNA perspective captures the attention of nucleotides towards atoms, as shown in Equation (2):

C r o s s H e a d_{i}^{R} = S o f t m a x (\frac{R^{c} W_{R, i}^{Q} (M^{c} W_{R, i}^{K})^{T}}{\sqrt{d}}) M^{c} W_{R, i}^{V}

(2)

Correspondingly, the interaction from the small molecule perspective captures the attention of atoms towards nucleotides, as formulated in Equation (3).

C r o s s H e a d_{i}^{M} = S o f t m a x (\frac{M^{c} W_{M, i}^{Q} (R^{c} W_{M, i}^{K})^{T}}{\sqrt{d}}) R^{c} W_{M, i}^{V}

(3)

Here,

W_{R, i}^{Q}

,

W_{R, i}^{K}

,

W_{R, i}^{V}

,

W_{M, i}^{Q}

,

W_{M, i}^{K}

,

W_{M, i}^{V}

denote the trainable parameters for the RNA and small molecule perspectives, respectively, and

\sqrt{d}

represents the scaling factor used to mitigate gradient vanishing.

Finally, the outputs from all attention heads are concatenated, processed through a feed-forward layer, and subjected to mean pooling to obtain the integrated features representing atom–nucleotide interactions. These features comprehensively capture fine-grained binding information across four distinct view combinations between RNA and the small molecule: graph–graph, graph–sequence, sequence–graph, and sequence–sequence.

Additionally, to enhance the model’s generalization capability, semantic embeddings extracted from the two pre-trained language models—RNAErnie for RNA and ChemBERTa for the small molecule—are incorporated, and their averaged representation is combined with the interaction features.

2.3.2. Structure-Guided Multi-Level Interaction

To capture hierarchical structural information of RNA and the small molecule, multi-layer graph neural networks are employed to extract structural features at different depths, and a full-combination interaction schema is constructed. Specifically, a three-layer stacked architecture based on Graph Attention Network (GAT) and Graph Convolutional Network (GCN) is used to generate three levels of RNA structural features [

R_{1}

,

R_{2}

,

R_{3}

] and small molecule structural features [

M_{1}

,

M_{2}

,

M_{3}

]. Here,

R_{1}

captures local nucleotide contact relationships,

R_{2}

encodes intermediate-scale structural module interactions, and

R_{3}

characterizes the global topological organization of the RNA. Correspondingly,

M_{1}

reflects local atomic bonding environments,

M_{2}

represents connectivity patterns between functional groups, and

M_{3}

describes the overall chemical architecture of the small molecule.

To comprehensively model cross-interactions between different structural hierarchies of the RNA and the small molecule, a full permutation is performed between the multi-level RNA features and small molecule features, resulting in nine structure interaction pairs, as expressed in Equation (4).

P = {[R_{1}, M_{1}], [R_{1}, M_{2}], [R_{1}, M_{3}], [R_{2}, M_{1}], [R_{2}, M_{2}], [R_{2}, M_{3}], [R_{3}, M_{1}], [R_{3}, M_{2}], [R_{3}, M_{3}]}

(4)

For each interaction pair, the inter-level interaction strength is first computed via element-wise product. The result is then passed through a shared linear projection layer to unify the feature dimensions. Finally, a weighted summation is performed using learnable coefficients to obtain the integrated feature representation of the structure-guided multi-level interactions. The corresponding calculation is given by Equation (5):

F_{s t r u c t} = \sum_{k = 1}^{3} \sum_{l = 1}^{3} α_{k l} \cdot W_{m a p} \cdot (R_{k} ⊙ M_{l})

(5)

In the equation, ⊙ denotes the element-wise product operation, and the coefficient

α_{k l}

is adaptively learned during training to emphasize hierarchical interaction pairs that contribute more significantly to binding affinity.

2.3.3. Adaptive Fusion via Gating Network

To balance the contributions of the two interaction pathways, a gating unit based on a sigmoid activation function is designed to dynamically weight the features

F_{a t o m - n u c}

and

F_{s t r u c t}

, and to output the final interaction representation. The features from both pathways are concatenated and fed into a single-layer perceptron to generate the gating weight

g

, as given by the Equation (6):

g = σ (W_{g} \cdot [F_{a t o m - n u c}, F_{s t r u c t}])

(6)

In the Equation (6),

W_{g}

denotes the gating weight matrix,

σ

represents the sigmoid activation function,

g

serves as the fusion weight for the atom–nucleotide interaction features. The gate value

g

is also an input to Equation (7). In this equation,

1 - g

is assigned as the weight for the structural multi-level interaction features. The ultimate interaction feature input to the affinity prediction module is then obtained by the weighted summation of the two feature paths.

F_{f i n a l} = g \cdot F_{a t o m - n u c} + (1 - g) \cdot F_{s t r u c t}

(7)

This integrated representation preserves both fine-grained atom–nucleotide binding information and incorporates global interaction patterns across multiple structural hierarchies, thereby establishing a solid foundation for accurately predicting RNA–small molecule binding affinity.

3. Results

3.1. Datasets and Baselines

Two types of datasets were employed in this study to comprehensively evaluate the model’s fitting capability and generalization performance: one for cross-validation and the other for independent testing.

The R-SIM dataset, sourced from Krishnan et al. (2023) [31], serves as a common benchmark in the field of RNA–small molecule binding affinity prediction and was used for cross-validation. After preprocessing, 1439 valid instances were retained, covering 341 distinct RNA structures and 749 different small molecules. The binding affinity was quantified using the negative logarithm of the dissociation constant (pKd), where a higher pKd value indicates stronger binding.

The HIV-1 TAR RNA independent dataset from Cai et al. (2022) [32] was used for independent testing. This dataset contains binding affinities for 48 compounds interacting with HIV-1 trans-activation response (TAR) RNA, measured via surface plasmon resonance (SPR). To ensure test independence (i.e., no data overlap between training and test sets), only 282 viral RNA-related samples from the R-SIM dataset were initially selected as the training set. Furthermore, sequence similarity filtering was applied to remove samples in the initial training set that showed similarity to RNAs or small molecules in the independent test set. This process resulted in an optimized training set of 141 RNA–small molecule pairs with affinity scores, while the 48 samples in the independent test set remained unchanged.

A total of ten mainstream methods from three categories were selected as baselines to ensure comprehensive and representative comparisons, covering classical machine learning, deep learning, and dedicated drug-target prediction approaches. These include Support Vector Machine (SVM), k-Nearest Neighbors (KNN), Extreme Gradient Boosting (XGBoost), Graph Convolutional Network (GCN), Graph Attention Network (GAT), Transformer, DeepCDA, DeepDTAF, GraphDTA, and DeepRSMA.

To evaluate the performance of MCARSMA, three commonly used metrics were employed to quantify the prediction accuracy, all computed based on the discrepancy between the predicted pKd values and the ground-truth pKd values: Root Mean Square Error (RMSE), Pearson Correlation Coefficient (PCC), and Spearman’s Rank Correlation Coefficient (SCC). Furthermore, to guide the training process, we designed a multi-level fusion loss function that jointly optimizes predictions at different hierarchical levels. This loss function integrates prediction errors from the main fusion pathway, the graph-level pathway, and the node-level pathway, thereby constraining the model’s learning process at both global and local scales and enhancing consistency and robustness in multi-scale feature fusion. Specifically, denoting the main pathway prediction as ŷ_main, the graph-level prediction as ŷ_graph, and the node-level prediction as ŷ_node, with y representing the ground-truth label, the overall loss function is defined as follows:

L_{t o t a l} = α L ({\hat{y}}_{m a i n}, y) + β L ({\hat{y}}_{g r a p h}, y) + γ L ({\hat{y}}_{n o d e}, y)

(8)

In the equation, α, β, and γ denote the weighting coefficients for the main pathway loss, graph-level loss, and node-level loss, respectively, which balance the contributions of different hierarchical representations.

The base loss

L

for each level adopts a weighted combination of Mean Squared Error (MSE) and Mean Absolute Error (MAE), aiming to jointly enhance model stability and robustness to outliers. It is defined as follows:

L (\hat{y}, y) = (1 - λ) \cdot M S E (\hat{y}, y) + λ \cdot M A E (\hat{y}, y)

(9)

In the equation, λ represents the balancing coefficient. When λ = 0, the loss reduces to a pure MSE objective, while when λ = 1, it becomes a pure MAE loss. This design allows the model to leverage the MSE term for faster convergence during early training stages, while increasingly relying on the MAE term in later phases to mitigate the influence of outliers, thereby achieving more robust regression performance.

3.2. Cross-Validation Results

To evaluate the fitting accuracy and stability of MCARSMA within known data distributions, we first performed stratified 5-fold cross-validation on the R-SIM dataset. The number of training epochs was uniformly set to 1500. Using Root Mean Square Error (RMSE), Pearson Correlation Coefficient (PCC), and Spearman’s Correlation Coefficient (SCC) as the primary metrics, we compared MCARSMA against ten baseline methods. The results are summarized in Table 1.

In the 5-fold cross-validation regression task, MCARSMA demonstrated a significantly lower RMSE than all baseline methods, achieving a value of 0.893. This represents a 2.3% improvement over DeepRSMA and a 4.2% improvement over the second-best baseline XGBoost. Meanwhile, on the correlation metrics (PCC and SCC), its values, albeit slightly lower than those of DeepRSMA, exhibited only marginal differences, indicating comparable overall performance.

These results indicate that the predicted RNA–small molecule binding affinities from MCARSMA exhibit a smaller average deviation from the experimental values. This enhancement can be attributed to the model’s dual-path architecture—integrating fine-grained atom–nucleotide interactions with structure-guided multi-level interactions, which captures cross-tier binding patterns that are not sufficiently modeled in DeepRSMA. Furthermore, the gating network enables adaptive feature fusion, optimizing the weighting of critical interaction information and reducing the loss of key binding signals.

3.3. Independent Test Results

Independent testing serves as a critical step for evaluating model generalization. In this study, experiments were conducted on the HIV-1 TAR RNA independent dataset. The training set consisted of 141 viral RNA samples from the R-SIM dataset after similarity filtering (consistent with the setup in the original study), in order to validate the predictive capability of MCARSMA on out-of-distribution data. The results are presented in Table 2.

The independent test results further confirmed the superior generalization of MCARSMA, which achieved an RMSE of 0.908, outperforming the DeepRSMA and Transformer baselines by 1.3% and 6.2%. Its PCC and SCC scores were on par with those of DeepRSMA.

In the independent test, the generalization advantage of MCARSMA became even more pronounced: it achieved an RMSE of 0.911, representing a 0.9% reduction compared to DeepRSMA and a 5.8% improvement over the second-best baseline, Transformer. This result significantly outperforms those of traditional machine learning baselines and also surpasses simpler deep learning approaches such as GCN and GAT, demonstrating that MCARSMA effectively captures universal patterns of RNA–small molecule binding rather than merely overfitting to the specific distribution of the training set.

3.4. Ablation Study

To evaluate the contribution of each component in MCARSMA, we constructed several ablated variants and assessed their performance using 5-fold cross-validation. The evaluated variants included: a no-sequence model, a no-structure model, a node-level-only model, a graph-level-only model, a no-interaction model, and the complete model. The corresponding results are presented in Table 3.

As evidenced by the results in Table 3, each component of the MCARSMA model contributes significantly to its overall performance. The removal of either sequence features (no-sequence variant) or structural information (no-structure variant) leads to a noticeable decline in model performance, indicating that sequence and structural information play complementary roles in affinity prediction. In terms of interaction modeling, both the node-level-only and graph-level-only variants outperform the no-interaction model, suggesting that cross-modal feature alignment—whether at the local (node) or global (graph) level—effectively promotes the integration of RNA and small molecule representations. In comparison, the full model incorporating both node-level and graph-level interactions achieves the best performance (RMSE = 0.883, PCC = 0.772, SCC = 0.773), further validating that the multi-level interaction mechanism can more comprehensively capture cross-modal correlations, thereby improving prediction accuracy and stability. Overall, these ablation results demonstrate the rationality and necessity of the individual modules in MCARSMA.

3.5. Computational Efficiency Analysis

To comprehensively assess the computational efficiency of MCARSMA in a practical environment, we systematically evaluated the model from several aspects: training time, per-sample inference time, peak GPU memory usage, and parameter count. The experimental results are summarized in Table 4.

As observed from Table 4, despite integrating multiple heterogeneous feature encoders and a hierarchical interaction mechanism, the overall computational cost of MCARSMA remains at a manageable level. During training, the time consumed per epoch for MCARSMA is 12.3 s, which is closely aligned with the 11.8 s of the baseline model, DeepRSMA. The inference time per sample is 7.25 milliseconds, also remaining at the millisecond level, indicating that the proposed interaction architecture does not introduce significant inference latency.

In terms of memory consumption, the peak GPU memory usage of MCARSMA is 1752 MB, which is lower than the 1838 MB of DeepRSMA. Despite the introduction of multi-view feature fusion and structural interaction modules, the memory footprint remains well-controlled. Furthermore, the parameter size of MCARSMA is 7.86 MB, only slightly higher than the 7.39 MB of its simplified variant without pre-trained language models, demonstrating good parameter efficiency. It is noteworthy that although MCARSMA integrates multiple structural and sequential encoders, its actual computational overhead remains on the same order of magnitude as that of a single-structure model, indicating that the multi-module design does not lead to a significant engineering burden.

Overall, while maintaining high expressive power and multi-modal modeling capabilities, MCARSMA also exhibits high operational efficiency and resource-friendly characteristics. This balance suggests that the model is not only suitable for conventional experimental environments but also holds potential for deployment on larger-scale datasets or under resource-constrained conditions.

3.6. Analysis of Gating Behavior and Model Decision-Making

To evaluate the efficacy and stability of the gated fusion strategy, we analyzed the dynamic evolution and steady-state behavior of the gating weight g (representing the weight of the atom–nucleotide pathway) during the training process. The results are shown in Figure 2. The analysis indicates that the gating value g exhibits excellent convergence. Specifically, its statistical distribution (mean = 0.314, median = 0.312, standard deviation = 0.042) is unimodal and exhibits low variance. Furthermore, the temporal evolution shows that the value rapidly converges to a stable regime after approximately 100 training epochs. These results demonstrate that the gating mechanism operates stably without collapsing to 0 or 1.

These results suggest that, on the current benchmark dataset, RNA–small molecule binding affinity may be driven predominantly by global structural complementarity (governed by the structural pathway with a weight of 1 − g ≈ 0.69), rather than by atom-level specific pairing interactions alone.

This finding is consistent with the established principles of RNA-targeted drug design, wherein many successful small molecule inhibitors function by recognizing and binding to specific secondary structure pockets, such as bulges and internal loops. Our model has autonomously learned this key paradigm from the data. Furthermore, the temporal trajectory illustrated in the lower panel of Figure 2 shows that the gating weight rapidly converged and stabilized after approximately 100 training epochs, empirically corroborating the robustness of the training process.

It is noteworthy that despite the low value of g, the atom–nucleotide pathway remains active (with a minimum g value of 0.207, consistently greater than zero), and its subtle contribution is integrated into the final prediction. This ensures that the model retains sufficient modeling capacity for scenarios where binding events are predominantly governed by key atomic interactions. The unimodal distribution of g values, as shown in the upper panel of Figure 2, further corroborates the robustness and interpretability of the gating behavior.

In summary, the gating mechanism proves to be stable and effective. It functions as an adaptive fusion tool that enables the model to learn from data and assign higher decision-making weight to the more informative feature pathway (the structural pathway in this case), while preserving the complementary modeling capacity of fine-grained interactions, thereby enhancing the overall robustness of predictions.

4. Discussion

The results of this study demonstrate that a multi-level cross-modal fusion strategy is an effective approach to enhancing the performance of RNA–small molecule binding affinity prediction. The innovation of the MCARSMA model lies in its integration of information from different modalities and scales into a unified framework for interactive learning. Compared to traditional methods, the model’s superior performance on the R-SIM benchmark dataset, as evidenced by metrics such as an RMSE of 0.883, confirms its potential for handling complex biological information.

In-depth analysis reveals that the performance improvement can be attributed to two key design elements: firstly, the dual-path architecture simultaneously captures local atomic interactions and global structural features; secondly, the gated adaptive weighting mechanism optimizes the contribution of multi-modal features. Ablation studies provide strong evidence for these conclusions.

The significance of this work lies in the development of an interpretable deep learning framework, which provides a new tool for targeting traditionally “undruggable” RNA molecules. Future work will focus on integrating dynamic conformational information to further enhance the model’s utility in real-world drug discovery scenarios.

Author Contributions

Conceptualization, Y.L. and X.W.; methodology, Y.L. and Y.Z.; software, Y.Z. and M.W.; validation, Y.Z. and M.W.; data curation, L.Z. and R.W.; writing—original draft preparation, Y.L., Y.Z. and M.W.; writing—review and editing, Y.L., Y.Z., L.Z. and R.W.; supervision, X.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Research Project of Colleges and Universities of Henan Province (No. 22A520013, No. 23B520004), the Key Science and Technology Development Program of Henan Province (No. 232102210020, No. 252102210137), and the Training Program of Young Backbone Teachers in Colleges and Universities of Henan Province (No. 2019GGJS132).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. In this paper, we utilized two datasets to evaluate the performance of our proposed method. However, these datasets were collected and compiled by other research groups, not by us. Therefore, due to associated privacy and confidentiality restrictions, we are unable to provide or distribute this data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bartel, D.P. MicroRNAs: Target recognition and regulatory functions. Cell 2009, 136, 215–233. [Google Scholar] [CrossRef]
Cui, L.; Ma, R.; Cai, J.; Guo, C.; Chen, Z.; Yao, L.; Wang, Y.; Fan, R.; Shi, X.W.Y. RNA modifications: Importance in immune cell biology and related diseases. Signal Transduct. Target. Ther. 2022, 7, 334. [Google Scholar] [CrossRef] [PubMed]
Higgs, P.G.; Lehman, N. The RNA World: Molecular cooperation at the origins of life. Nat. Rev. Genet. 2015, 16, 7–17. [Google Scholar] [CrossRef] [PubMed]
Ottesen, E.W. ISS-N1 makes the First FDA-approved Drug for Spinal Muscular Atrophy. Transl. Neurosci. 2017, 8, 1–6. [Google Scholar] [CrossRef] [PubMed]
Warner, K.D.; Hajdin, C.E.; Weeks, K.M. Principles for targeting RNA with drug-like small molecules. Nat. Rev. Drug Discov. 2018, 17, 547–558. [Google Scholar] [CrossRef]
Childs-Disney, J.L.; Yang, X.; Gibaut, Q.M.R.; Tong, Y.; Batey, R.T.; Disney, M.D. Targeting RNA structures with small molecules. Nat. Rev. Drug Discov. 2022, 21, 736–762. [Google Scholar] [CrossRef]
Jiménez, J.; Škalič, M.; Martínez-Rosell, G.; DeFabritiis, G. KDEEP: Protein-Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. J. Chem. Inf. Model. 2018, 58, 287–296. [Google Scholar] [CrossRef]
Jones, D.; Kim, H.; Zhang, X.; Stevenson, A.Z.; Bennett, W.F.D.; Kirshner, D.; Wong, S.E.; Lightstone, F.C.; Allen, J.E. Improved Protein-Ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference. J. Chem. Inf. Model. 2021, 61, 1583–1592. [Google Scholar] [CrossRef]
Morley, S.D.; Afshar, M. Validation of an empirical RNA-ligand scoring function for fast flexible docking using Ribodock. J. Comput. Aided. Mol. Des. 2004, 18, 189–208. [Google Scholar] [CrossRef]
Ruiz-Carmona, S.; Alvarez-Garcia, D.; Foloppe, N.; Garmendia-Doval, A.B.; Juhos, S.; Schmidtke, P.; Barril, X.; Hubbard, R.E.; Morley, S.D. rDock: A fast, versatile and open source program for docking ligands to proteins and nucleic acids. PLoS Comput. Biol. 2014, 10, e1003571. [Google Scholar] [CrossRef]
Trott, O.; Olson, A.J. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 2010, 31, 455–461. [Google Scholar] [CrossRef]
Goodsell, D.S.; Sanner, M.F.; Olson, A.J.; Forli, S. The AutoDock suite at 30. Protein Sci. 2021, 30, 31–43. [Google Scholar] [CrossRef] [PubMed]
Lang, P.T.; Brozell, S.R.; Mukherjee, S.; Pettersen, E.F.; Meng, E.C.; Thomas, V.; Rizzo, R.C.; Case, D.A.; James, T.L.; Kuntz, I.D. DOCK 6: Combining techniques to model RNA-small molecule complexes. RNA 2009, 15, 1219–1230. [Google Scholar] [CrossRef] [PubMed]
Feng, Y.; Zhang, K.; Wu, Q.; Huang, S.Y. NLDock: A Fast Nucleic Acid-Ligand Docking Algorithm for Modeling RNA/DNA-Ligand Complexes. J. Chem. Inf. Model. 2021, 61, 4771–4782. [Google Scholar] [CrossRef] [PubMed]
Philips, A.; Milanowska, K.; Lach, G.; Bujnicki, J.M. LigandRNA: Computational predictor of RNA-ligand interactions. RNA 2013, 19, 1605–1616. [Google Scholar] [CrossRef]
Yan, Z.; Wang, J. SPA-LN: A scoring function of ligand-nucleic acid interactions via optimizing both specificity and affinity. Nucleic. Acids Res. 2017, 45, e110. [Google Scholar] [CrossRef]
Ludwiczak, O.; Antczak, M.; Szachniuk, M. Assessing interface accuracy in macromolecular complexes. PLoS ONE 2025, 20, e0319917. [Google Scholar] [CrossRef]
Nithin, C.; Kmiecik, S.; Błaszczyk, R.; Nowicka, J.; Tuszyńska, I. Comparative analysis of RNA 3D structure prediction methods: Towards enhanced modeling of RNA-ligand interactions. Nucleic Acids Res. 2024, 52, 7465–7486. [Google Scholar] [CrossRef]
Pfeffer, P.; Gohlke, H. DrugScore: RNA-knowledge-based scoring function to predict RNA-ligand interactions. J. Chem. Inf. Model. 2007, 47, 1868–1876. [Google Scholar] [CrossRef]
Tor, Y. Targeting RNA with small molecules. ChemBioChem 2003, 4, 998–1007. [Google Scholar] [CrossRef]
Kairys, V.; Baranauskiene, L.; Kazlauskiene, M.; Matulis, D.; Kazlauskas, E. Binding affinity in drug design: Experimental and computational techniques. Expert Opin. Drug Discov. 2019, 14, 755–768. [Google Scholar] [CrossRef] [PubMed]
Guilbert, C.; James, T.L. Docking to RNA via root-mean-square-deviation-driven energy minimization with flexible ligands and flexible targets. J. Chem. Inf. Model. 2008, 48, 1257–1268. [Google Scholar] [CrossRef] [PubMed]
Feng, Y.; Huang, S.Y. ITScore-NL: An Iterative Knowledge-Based Scoring Function for Nucleic Acid-Ligand Interactions. J. Chem. Inf. Model. 2020, 60, 6698–6708. [Google Scholar] [CrossRef] [PubMed]
Grimberg, H.; Tiwari, V.S.; Tam, B.; Gur-Arie, L.; Gingold, D.; Polachek, L.; Akabayov, B. Machine learning approaches to optimize small-molecule inhibitors for RNA targeting. J. Cheminform. 2022, 14, 4. [Google Scholar] [CrossRef]
Wang, K.; Zhou, R.; Li, Y.; Li, M. DeepDTAF: A deep learning method to predict protein-ligand binding affinity. Brief. Bioinform. 2021, 22, bbab072. [Google Scholar] [CrossRef]
Krishnan, S.R.; Roy, A.; Gromiha, M.M. Reliable method for predicting the binding affinity of RNA-small molecule interactions using machine learning. Brief. Bioinform. 2024, 25, bbae002. [Google Scholar] [CrossRef]
Zhijian, H.; Yucheng, W.; Song, C.; Tan, Y.S.; Deng, L.; Wu, M. DeepRSMA: A cross-fusion-based deep learning method for RNA–small molecule binding affinity prediction. Bioinformatics 2024, 12, btae678. [Google Scholar]
Lorenz, R.; Bernhart, S.H.; Siederdissen, C.H.Z.; Tafer, H.; Flamm, C.; Stadler, P.F.; Hofacker, I.L. ViennaRNA Package 2.0. Algorithms Mol. Biol. 2011, 6, 26. [Google Scholar] [CrossRef]
Wang, N.; Bian, J.; Li, Y.; Li, X.; Mumtaz, S.; Kong, L.; Xiong, H. Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning. Nat. Mach. Intell. 2024, 6, 548–557. [Google Scholar] [CrossRef]
Chithrananda, S.; Gabriel, G.; Bharath, R. ChemBERTa: Large-scale self-supervised pretraining for molecular property prediction. arXiv 2020, arXiv:2010.09885. [Google Scholar]
Krishnan, S.R.; Roy, A.; Gromiha, M.M. R-SIM: A database of binding affinities for RNA-small molecule interactions. J. Mol. Biol. 2023, 435, 167914. [Google Scholar] [CrossRef]
Cai, Z.; Zafferani, M.; Akande, O.M.; Hargrove, A.E. Quantitative Structure–Activity Relationship (QSAR) Study Predicts Small-Molecule Binding to RNA Structure. J. Med. Chem. 2022, 65, 7262–7277. [Google Scholar] [CrossRef]

Figure 1. The overall workflow of the MCARSMA framework. (A) Semantic feature extraction for RNA sequences and SMILES strings of small molecules using the large language models RNAErnie and ChemBERTa, respectively. (B) Illustration of the multi-level cross-modal fusion template and the adaptive feature weighting via a gating network.

Figure 2. The upper panel displays the global statistical distribution of the g-values upon the completion of all training epochs. The lower panel illustrates the dynamic variation of the g-values as the number of training epochs increases.

Table 1. Comparative Evaluation of Our Proposed MCARSMA Model Against Baseline Models Using Five-Fold Cross-Validation.

Method	RMSE	PCC	SCC
SVM	0.994	0.706	0.714
KNN	1.038	0.671	0.684
XGBoost	0.922	0.755	0.765
GCN	1.046	0.715	0.717
GAT	1.012	0.715	0.716
Transformer	1.067	0.699	0.695
DeepCDA	0.982	0.746	0.743
DeepDTAF	0.957	0.751	0.747
GraphDTA	0.928	0.772	0.773
DeepRSMA	0.904	0.784	0.786
MCARSMA	0.883	0.772	0.773

Note: The best performance for each metric is highlighted in boldface.

Table 2. Comparative Evaluation of the MCARSMA Model Against Baseline Models on an Independent Test Set.

Method	RMSE	PCC	SCC
SVM	1.116	−0.101	−0.090
KNN	1.144	0.097	−0.012
XGBoost	1.383	−0.169	−0.209
GCN	1.025	0.297	0.409
GAT	1.017	0.258	0.381
Transformer	0.968	0.396	0.412
DeepCDA	1.025	0.305	0.293
DeepDTAF	1.106	0.077	0.052
GraphDTA	1.012	0.301	0.316
DeepRSMA	0.920	0.490	0.499
MCARSMA	0.908	0.488	0.484

Note: The best performance for each metric is highlighted in boldface.

Table 3. Ablation Study Results for the MCARSMA Model.

	RMSE	PCC	SCC
no-sequence	0.953	0.755	0.733
no-structure	0.971	0.751	0.731
node-level-only	0.912	0.763	0.741
graph-level-only	0.922	0.766	0.743
no-interaction	0.965	0.758	0.732
complete	0.883	0.772	0.773

Note: The best performance for each metric is highlighted in boldface.

Table 4. Comparison of Inference Time and Parameter Quantity Across Different Models.

Model	Train Time/Epoch (s)	Inference Time/Sample (ms)	GPU Memory Peak (MB)	Parameters (MB)
MCARSMA (Ours)	12.3	7.25	1752	7.86
MCARSMA (w/o LM)	12.1	7.11	1741	7.39
DeepRSMA	11.8	6.41	1838	3.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Zhang, Y.; Zhu, L.; Wang, M.; Wang, R.; Wang, X. MCARSMA: A Multi-Level Cross-Modal Attention Fusion Framework for Accurate RNA–Small Molecule Affinity Prediction. Mathematics 2026, 14, 57. https://doi.org/10.3390/math14010057

AMA Style

Li Y, Zhang Y, Zhu L, Wang M, Wang R, Wang X. MCARSMA: A Multi-Level Cross-Modal Attention Fusion Framework for Accurate RNA–Small Molecule Affinity Prediction. Mathematics. 2026; 14(1):57. https://doi.org/10.3390/math14010057

Chicago/Turabian Style

Li, Ye, Yongfeng Zhang, Lei Zhu, Menghua Wang, Rong Wang, and Xiao Wang. 2026. "MCARSMA: A Multi-Level Cross-Modal Attention Fusion Framework for Accurate RNA–Small Molecule Affinity Prediction" Mathematics 14, no. 1: 57. https://doi.org/10.3390/math14010057

APA Style

Li, Y., Zhang, Y., Zhu, L., Wang, M., Wang, R., & Wang, X. (2026). MCARSMA: A Multi-Level Cross-Modal Attention Fusion Framework for Accurate RNA–Small Molecule Affinity Prediction. Mathematics, 14(1), 57. https://doi.org/10.3390/math14010057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

MCARSMA: A Multi-Level Cross-Modal Attention Fusion Framework for Accurate RNA–Small Molecule Affinity Prediction

Abstract

1. Introduction

2. Methods

2.1. Model Overview

2.2. Feature Extraction Module

2.2.1. RNA Feature Extraction

2.2.2. Small Molecule Feature Extraction

2.3. RNA–Small Molecule Cross-Information Fusion

2.3.1. Atom–Nucleotide Fine-Grained Interaction

2.3.2. Structure-Guided Multi-Level Interaction

2.3.3. Adaptive Fusion via Gating Network

3. Results

3.1. Datasets and Baselines

3.2. Cross-Validation Results

3.3. Independent Test Results

3.4. Ablation Study

3.5. Computational Efficiency Analysis

3.6. Analysis of Gating Behavior and Model Decision-Making

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI