1. Introduction
Molecular property prediction plays a central role in drug discovery, materials design, and chemical informatics, enabling the efficient estimation of physicochemical properties and biological activities without extensive experimental measurements [
1,
2]. Generally, the core challenge of this task lies in molecular representation learning: how to encode molecular structures into expressive and transferable representations that faithfully capture structure–property relationships.
Molecules are naturally described as graphs, with atoms as nodes and chemical bonds as edges [
3]. Consequently, graph neural networks (GNNs) have become a widely adopted paradigm for molecular representation learning, achieving strong performance across a wide range of benchmark datasets [
4]. To mitigate the scarcity of labeled molecular data, recent studies have further introduced self-supervised pretraining strategies inspired by advances in natural language processing, such as masked language modeling [
5]. In these approaches, atom types or local graph components are typically masked and reconstructed, encouraging models to learn general-purpose molecular representations from large-scale unlabeled data [
6,
7].
Despite their success, existing molecular pretraining frameworks still exhibit limitations that constrain their representational capacity. Firstly, most atom-level pretraining objectives rely on a small and highly imbalanced atom vocabulary, where a few atom types—such as carbon—dominate molecular datasets. This leads to limited supervision diversity and potentially biased learning signals [
8]. Recent studies have further revealed that self-supervised pretraining does not consistently yield performance gains and can even induce negative transfer, a phenomenon largely attributed to the sparse or redundant information content of atom-level representations [
9]. To enrich the supervisory signal, several works have introduced substructure- or fragment-level pretraining objectives, such as masking and predicting functional groups or molecular motifs [
10,
11]. However, these approaches often depend on predefined or dataset-specific structural units, raising concerns regarding their generality and robustness. Overall, existing pretraining research has predominantly focused on “how to design pretraining tasks,” while the question of “which representation unit is most suitable for pretraining”—and how it interacts with the underlying model architecture—remains underexplored.
Secondly, GNN-based architectures primarily depend on local message passing and are known to suffer from the over-squashing phenomenon, which limits their efficiency in capturing long-range dependencies such as extended conjugation effects or remote non-covalent interactions [
12]. To address these atom-centric limitations, a line of research has shifted toward explicitly modeling chemical bonds and their directional properties. Architectures such as Hybrid Directional Graph Neural Networks (HDGNN) [
13] incorporate directional information, while Directed Message Passing Neural Networks [
14] use directed edges to prevent information backtracking. Line-graph-based GNNs further transform molecular graphs into line graphs to capture bond-level interactions and angular information [
14]. These studies demonstrate the importance of bond- and edge-centric molecular modeling, indicating that the use of bonds as representation units is not itself a new concept. However, most existing bond-aware or line-graph-based methods are still implemented within graph message-passing or graph-convolutional frameworks, where capturing long-range dependencies generally requires multiple propagation layers [
14,
15]. This motivates the exploration of sequence-style formulations that can retain bond-level topological information while benefiting from the global receptive field of attention-based architectures.
Thirdly, sequence-based Transformer models built on SMILES representations provide global receptive fields and have achieved significant performance through large-scale pre-training, as demonstrated by MoLFormer-XL [
16]. However, this success raises a fundamental question: does such dominance reflect an optimal representation choice, or does it arise primarily from the capacity of Transformers to compensate for syntactic imperfections through scale? Even with more robust descriptors like SELFIES [
17], these approaches fundamentally treat molecules as one-dimensional strings, where learned correlations often reflect statistical co-occurrence patterns rather than the intrinsic chemical bond connectivity that governs molecular properties. Graph Transformers attempt to bridge this gap by injecting molecular topology through soft positional or distance biases [
16,
18]. Yet, these soft constraints tend to entangle local connectivity with long-range correlations, often struggling to prioritize the specific bond pathways that are essential for chemical reasoning. Consequently, there remains a need for a representation that can more faithfully interface with molecular topology while retaining the modeling flexibility of sequence-based architectures.
To address these limitations, in this work we explore a bond-level sequence formulation for Transformer-based molecular modeling. Bonds naturally encode relational information between atoms and exhibit substantially greater combinatorial diversity than atom types alone, which may help alleviate the issue of limited and imbalanced vocabularies. More fundamentally, chemical bonds serve as important pathways for chemical connectivity and electronic interactions. Unlike spatial proximity, which acts as a noisy proxy for interaction and may conflate geometric nearness with chemical connectivity, bond pathways define chemically meaningful routes of structural interaction. By modeling molecules as bond sequences, we can enforce clear structural constraints that guide models toward prioritizing these chemically meaningful paths without resorting to explicit graph-based message passing.
Motivated by these observations, herein we propose a bond-sequence representation framework for molecular property prediction, built entirely on attention mechanisms. Molecules are represented as sequences of bond-level tokens. Unlike methods that mainly rely on soft biases to encode structural information [
18], we introduce a structure-aware hybrid attention mechanism that explicitly imposes hard topological constraints on a subset of attention heads to enforce local bond connectivity, while preserving unrestricted global attention in the remaining heads. This design aims to separate short-range chemical bonding from long-range contextual dependencies, enabling the model to jointly capture local structural priors and global molecular context within a unified Transformer architecture.
To learn transferable bond-level representations, we further introduce a masked bond token pretraining objective tailored to molecular structure. The pretrained encoder is subsequently fine-tuned for downstream molecular property prediction tasks. While the predictive performance varies across different datasets, the model demonstrates promising behavior on complex, topology-sensitive benchmarks such as MUV.
In summary, our study could contribute to the methodological integration of bond-centric topology into a sequence-based Transformer framework. We emphasize that the novelty of this work does not lie simply in using bonds as representation units, since bond- and edge-centric ideas have already been explored in D-MPNN, Chemprop-like models, line-graph GNNs, and graph Transformer frameworks. Instead, our main contribution lies in converting a line-graph-equivalent bond representation into a Transformer-compatible sequence framework and combining it with hard topological attention constraints and self-supervised pretraining. Specifically, our contributions are threefold:
We propose a deterministic conversion mechanism that transforms a 2D line-graph-equivalent bond representation into a 1D Transformer-compatible sequence, providing a sequence-based interface for bond-level molecular topology;
We design a structure-aware hybrid attention mechanism that utilizes a bond-to-bond adjacency matrix as a hard topological constraint. This aims to separate short-range chemical connectivity from unconstrained long-range global attention without relying on traditional absolute positional encodings;
We introduce an atom-centric structured masking pre-training strategy specifically tailored for bond-level sequences, aiming to reduce local information redundancy and encourage the learning of transferable topological representations.
2. Results and Discussion
2.1. Comparative Analysis and Performance Discussion
To evaluate our bond-sequence framework, we benchmarked its performance against three distinct categories of established baselines, ranging from classical supervised models to representative self-supervised architectures. Firstly, we selected representative supervised GNNs (GCN [
19], GIN [
20], and D-MPNN [
15]) to establish the baseline performance achievable through pure topological modeling. Secondly, we contrasted our approach with self-supervised GNNs (MolCLR [
21], GraphMAE [
10], and GIN + ContextPred [
5]) and advanced knowledge-guided Transformers (Mole-BERT [
8], KPGT [
22]) to assess the impact of bond-level tokenization. Notably, our architecture achieves these results with approximately 3.5 million parameters, representing a compact parameter scale compared with many larger molecular Transformer models.
The experimental results on molecular classification benchmarks (
Table 1) reveal a dataset-dependent performance profile. On MUV, widely regarded as a challenging benchmark due to its design against structural artifacts, our model achieves an ROC-AUC of 77.64%. This is higher than the supervised baseline D-MPNN [
15] (74.8%) and pre-training methods such as KPGT [
22] (76.2%) and Mole-BERT [
8] (77.3%). These results suggest that bond-level granularity may provide a useful inductive bias for capturing topology-sensitive structural patterns in selected tasks. Unlike atom-centric models, the bond-sequence representation naturally encodes bond connectivity patterns, which may help the Transformer represent chemically meaningful structural contexts.
On the large-scale HIV dataset, our model achieves 76.66%, showing performance comparable to established pre-training baselines like GraphMAE [
10] (76.4%) and MolCLR [
21] (77.0%), and exceeds standard supervised GCN [
19]/GIN + ContextPred [
5] models. While it trails slightly behind KPGT [
22] (78.5%), it is important to contextualize this comparison. KPGT relies on more extensive external knowledge and molecular descriptors during its pre-training phase. In contrast, our model incorporates a lightweight set of four handcrafted physicochemical priors (TPSA, LogP, HBD, and HBA) calculated via RDKit. This highlights the smaller descriptor burden and lower parameter count of our framework, while acknowledging that it is not entirely free from external physicochemical knowledge.
Conversely, more conservative absolute scores are observed on certain datasets such as BBBP (62.65%) and ClinTox (71.77%). These results indicate that the proposed framework should not be interpreted as broadly superior across all benchmarks. While baseline models typically report optimal results derived from extensive, dataset-specific hyperparameter grid searches, our Bond-Transformer was evaluated under a unified, fixed hyperparameter protocol across all eight benchmarks. Furthermore, following the stage-dependent behavior observed in the global attribute ablation, we adopted the decoupled ptGA_ftNoGA setting to avoid directly imposing global descriptor bias during fine-tuning. Although this absence of task-specific tuning may limit absolute performance on some datasets, it provides a consistent assessment of the model under a unified evaluation setting.
Collectively, these findings suggest that the proposed bond-sequence framework offers a parameter-efficient and structurally aware alternative to existing molecular representation methods. By grounding the Transformer in a bond-level topological sequence and evaluating it under stringent unified protocols, our model combines local bond-connectivity constraints with global attention-based contextual modeling. This indicates that bond-centric sequence modeling may be particularly useful for selected complex or topology-sensitive molecular prediction tasks, while its performance remains dependent on the dataset and evaluation setting.
2.2. Ablation Study
2.2.1. Topological Advantage of Bond-Level Representation
While chemical principles dictate that the electronic environment and conformational constraints of a molecule are intrinsically defined by the nature of its chemical bonds and their interaction networks, the potential advantage of bond-level representation can also be discussed from graph-theoretical principles. In a standard molecular graph
, the maximum degree of a carbon atom is naturally constrained by its valency to 4. However, by transforming the molecule into a bond-centric representation—conceptually equivalent to a line graph
—the topological connectivity is altered and, in many cases, increased. The degree of a bond-level token
in
is defined as
For a typical C-C bond connecting two saturated carbons, this yields a degree of 6, creating a higher connectivity density that may facilitate information exchange within each attention layer.
To quantitatively support this observation, we conducted a statistical survey on a representative subset of the pre-training corpus. Our analysis reveals that while the traditional atom-level graphs exhibit an average node degree of 2.1071 and a graph density of 0.1144, the bond-level representation achieves a higher average degree of 2.7469 and a graph density of 0.1435. This corresponds to a 30.36% gain in connectivity density and a 25.43% increase in overall graph density, providing a richer topological context for the attention mechanism. Such enhanced density may help expose dense structural motifs, such as fused rings or highly branched chains, to the attention mechanism and partially alleviate the information bottlenecks associated with sparse atom-centric message passing.
2.2.2. Impact of Global Attribute Bias
Beyond capturing local connectivity, our framework originally incorporates four global descriptors—Topological Polar Surface Area (TPSA), octanol-water partition coefficient (
), Hydrogen Bond Donors (HBD), and Hydrogen Bond Acceptors (HBA)—as macro-scale physicochemical anchors. Derived from Lipinski’s “Rule of Five,” ref. [
23] these descriptors were intended to provide the model with a holistic profile of the molecular landscape. However, a pivotal observation is that the integration of these global biases significantly alters the model’s behavior depending on the training stage.
To thoroughly examine the role of these global attributes, we conducted a systematic cross-ablation study across the pre-training (pt) and fine-tuning (ft) stages (
Table 2). Quantitatively, the results reveal a stage-dependent and task-dependent effect. During the pre-training phase, the inclusion of global attributes improves the masked bond modeling objective. As evidenced by the pre-training metrics, models incorporating global descriptors (ptGA) achieved a higher MBM accuracy (90.21%) compared to the “w/o Global” variant (83.28%), indicating that these descriptors can provide weak physicochemical priors that help guide representation learning during self-supervised pre-training.
However, the downstream results show that the direct use of the same global attributes during fine-tuning does not uniformly improve prediction performance. In several datasets, especially small or distribution-sensitive tasks, incorporating global attributes during fine-tuning can lead to lower ROC-AUC values compared with the decoupled setting. This observation suggests that global descriptors may introduce descriptor-induced negative transfer when their coarse-grained physicochemical bias is not well aligned with the task-specific structural patterns required by a downstream dataset.
Therefore, global attribute bias should not be interpreted as a universally beneficial regularizer. Instead, the ablation results support a stage-dependent usage pattern: global attributes can be useful during pre-training as weak macro-scale physicochemical priors, whereas their direct incorporation during fine-tuning should be treated cautiously and may need to be decoupled depending on the downstream task and data distribution.
2.2.3. Synergy of Structured Masking and Multi-Scale Consistency Learning
Given the substantial computational overhead required to pre-train the Transformer encoder from scratch, we adopted a two-phase evaluation protocol. Phase 1 (Architectural Search): We conducted exploratory ablations (
Table 3 and
Table 4) using an initial parameter configuration fixed across variants. The primary objective of this phase was not to benchmark final fine-tuned performance, but to identify topological configurations that support numerical convergence and prevent representation collapse during pre-training. Phase 2 (Rigorous Evaluation): Once the selected stable architectural hyperparameters were locked (i.e., Structured Masking and
), we deployed the multi-seed protocol under a unified fine-tuning setting for the final baseline comparisons (
Table 1) and the global attribute ablation (
Table 2).
The design of the Atom-centric Structured Masking (SM) strategy is primarily motivated by the observation of inherent structural redundancy within bond-level sequences. Since adjacent bond tokens typically share common endpoint atoms, there is a potential risk that the model may exploit these local atom indices to perform trivial inference—a form of pattern matching that bypasses deep topological reasoning. By creating a “structural vacuum” through the simultaneous masking of all bonds connected to a central atom, this approach is intended to mitigate such local information leakage and steer the encoder toward capturing more robust molecular representations through non-redundant dependencies.
A critical finding from this Phase 1 architectural search (
Table 3) is that Multi-scale Consistency Learning (MSCL) plays an important stabilizing role during pre-training. In configurations where MSCL was omitted (Groups 01 and 02), the model consistently suffered from representational collapse within the initial three epochs, with the training loss abruptly vanishing and the masked bond modeling accuracy dropping to zero. This phenomenon indicates that, due to the high-entropy nature of bond-level representations, the Transformer benefits from explicit semantic regularization to avoid converging toward non-informative, degenerate solutions. In this context, MSCL is not merely a performance-enhancing component but appears to be an important condition for stable optimization.
The synergy between SM and MSCL becomes most apparent when evaluating downstream task performance and predictive consistency. While the inclusion of MSCL alone enables stable convergence even with random masking, the integration of SM provides further performance gains in benchmarks sensitive to specific functional group arrangements, such as ClinTox (+2.96%) and BACE (+1.45%). Furthermore, the full model demonstrates improved predictive consistency; for instance, the standard deviation of predictions for the BACE task decreased from 1.04 to 0.29 when SM was employed alongside MSCL. These results suggest that the structural challenge posed by SM, when anchored by the semantic constraints of MSCL, may reduce the model’s reliance on local redundancies and encourage more transferable molecular representations.
2.2.4. Quantifying Pre-Trained Knowledge Injection and Information Balancing
To evaluate the effect of our self-supervised learning phase, we compared the performance of the pre-trained encoder against a randomly initialized baseline (No Pre-train) and conducted a sensitivity analysis on the adjacency ratio
, which governs the integration of local structural focus and global context. As demonstrated in
Table 4, the model initialized with random weights exhibits lower performance, particularly on large-scale and complex benchmarks. For instance, on MUV and Tox21, the pre-trained model (
) outperforms the non-pre-trained baseline by 11.15% and 7.41% respectively. This suggests that bond-level pre-training contributes useful transferable information beyond randomly initialized supervised learning under the same downstream setting.
A notable observation within our parametric sweep () is the identification of a critical stability threshold. Specifically, when the adjacency ratio falls to or below 0.5 (), the pre-training process invariably suffers from representational collapse. This observation suggests that sufficient local structural bias is important for stabilizing the bond-level latent space during high-entropy reconstruction tasks. In the absence of robust local connectivity constraints, the model may fail to resolve the structural degeneracies induced by the masking procedure.
Among the stable configurations, our initially locked configuration of provides a favorable balance between local structural focus and global context in this sensitivity analysis. It outperforms the pure local structural representation () across almost all benchmarks, with notable gains in MUV (+5.42%) and HIV (+6.14%). This performance gap suggests that incorporating an appropriate proportion of global context during pre-training can complement local bond connectivity, although the utility of such global information remains stage-dependent and task-dependent. While remains numerically stable, its lower performance compared to suggests that excluding global context limits the model’s ability to contextualize local bond environments within the broader molecular architecture. Interestingly, on the smallest dataset (BBBP), performance exhibits limited sensitivity to pre-training and ratio adjustment, suggesting that the benefit of global-context balancing may be less pronounced in some small-sample settings.
2.2.5. Exploration of Bond Orientation: Directional vs. Canonicalized Undirected Representation
A fundamental question in bond-level molecular modeling is whether the intrinsic ordering of atom pairs within a bond token—typically an artifact of graph traversal or computer storage—conveys meaningful chemical priors. In this study, we benchmarked our default directional representation (e.g., distinguishing between and ) against a canonicalized undirected approach, which ensures a unique token for each chemical bond through predefined sorting protocols.
Empirical results (
Table S1) demonstrate similar performance between the directional and canonicalized undirected strategies across all evaluated molecular property benchmarks. The performance divergence across the diverse range of datasets remains marginal, consistently falling within the range of standard deviation. This indicates that the Transformer architecture can obtain comparable predictive performance from these two bond-tokenization strategies under the current experimental setting.
From a computational perspective, the canonicalized undirected approach is more advantageous as it reduces the bond vocabulary size from 3804 to 2451 (a 35.6% reduction) without causing an obvious loss in predictive performance. This reduction in vocabulary complexity may reduce memory usage and simplify optimization during large-scale pre-training. Given that the increased task entropy of the directional representation does not translate into a tangible predictive gain, the undirected representation is a more compact and practical configuration for the present bond-level sequence framework.
2.3. Qualitative Representation and Attention Visualization
To qualitatively examine the representations learned by the model, we performed a multi-scale visualization analysis, ranging from global latent space distribution to local attention distribution. Firstly, to inspect whether fine-tuning changes the representation space for HIV activity prediction, we visualized the latent features using t-SNE.
Figure 1 compares the embeddings from the pre-trained and fine-tuned stages. As shown in
Figure 1a, the pre-trained features of active (orange) and inactive (blue) molecules are highly entangled, indicating that generic structural features are insufficient for this task. In contrast,
Figure 1b reveals a clear change in the latent distribution after fine-tuning. The emergence of dense, localized clusters of active molecules suggests that the model learns task-relevant discriminative patterns after fine-tuning. It is important to note that the remaining dispersion of some active molecules is chemically consistent with the intrinsic nature of the HIV dataset, which encompasses inhibitors targeting diverse mechanisms (e.g., reverse transcriptase vs. protease inhibitors) and possessing distinct scaffolds. Furthermore, the presence of “activity cliffs”—where minor structural changes drastically alter bioactivity—contributes to the complexity of the decision boundary. Despite these challenges, the clear transition from
Figure 1a,b provides qualitative evidence that fine-tuning reshapes the representation space toward task-relevant separation.
To further explore whether the attention distribution shows qualitative correspondence with chemically relevant regions, we visualized the attention weights from the final Transformer layer for four representative active molecules. This analysis serves as a qualitative visualization: it first examines the attention pattern of Zidovudine (AZT) from the dense clusters in
Figure 1b, and subsequently visualizes representative external clinical drugs (EFV, Raltegravir, and Saquinavir) that are not present in the original dataset.
As illustrated in
Figure 2, the highlighted regions demonstrate a qualitative correspondence with reported pharmacophore-related regions. The analysis begins with AZT, a prototypical Nucleoside Reverse Transcriptase Inhibitor (NRTI). Relatively high attention weights are observed around the azido group (
) moiety, while also assigning weight to the thymine (5-methyl-2,4-dioxopyrimidine) ring. This dual-focus pattern shows qualitative correspondence with the thymidine-mimicking structure of AZT, including the azido-containing region and the nucleobase moiety involved in enzyme recognition.
Moving to the external inhibitors, we observe a qualitative correspondence between the model’s attention weights and known structural motifs. In the case of Efavirenz (EFV), the attention assigns relatively high weights to the strongly electron-withdrawing trifluoromethyl group (-CF3) and the adjacent phenyl ring. This attention pattern shows qualitative correspondence with the electron-deficient aromatic region associated with known SAR features of EFV. This pattern of highlighting broader structural scaffolds is also observed in Raltegravir, where attention is distributed across the central hydroxypyrimidinone ring. This region corresponds to the chelation-related scaffold reported to participate in coordination. Finally, for the complex peptidomimetic Saquinavir, the model assigns lower attention to the bulky peripheral hydrophobic naphthyl groups and assigns relatively higher attention to secondary hydroxyl group () and its adjacent nitrogen atom. This specific region shows qualitative correspondence with the transition-state-mimetic region involved in HIV protease inhibition.
Collectively, these qualitative observations suggest that the Bond-level Transformer assigns relatively higher attention to chemically plausible regions, such as pharmacophore-related motifs and functional groups. However, it is important to emphasize that these attention distributions serve solely as qualitative observations of pattern correlations rather than faithful mechanistic explanations. While the high-attention regions visually align with known pharmacological features, we cannot claim that the model has learned “electronic intuition” or moved “beyond simple pattern matching” based solely on attention weights. True chemical interpretability would require more rigorous attribution analyses in future work—such as integrated gradients, SHAP-like methods, input perturbations, or counterfactual molecular edits—to quantitatively compare the model’s decision patterns with reported Structure-Activity Relationship (SAR) trends.
4. Conclusions
In this work, we explore an alternative perspective on molecular representation learning by revisiting the choice of representation granularity for Transformer-based models. Our study examines whether a bond-level representation can offer a structurally grounded and potentially scalable interface between molecular topology and attention-based sequence modeling. By representing molecules as sequences of directed bond tokens and implementing explicit topological constraints, we observe that Transformers can incorporate bond-level topological information without relying solely on traditional graph message passing.
One of the primary challenges addressed in this framework is the structural degeneracy encountered during high-entropy molecular pre-training. Our findings suggest that integrating atom-centric group masking with multi-scale consistency learning can help stabilize the pre-training process and improve representation consistency. In addition, the global attribute ablation indicates that macro-scale physicochemical descriptors can serve as weak priors during pre-training, whereas their direct use during fine-tuning should be treated as stage-dependent and task-dependent. Overall, the proposed approach is not intended to replace large-scale SMILES-based or graph-based methods, but to provide a parameter-efficient alternative representation design that is compatible with self-supervised objectives.
More broadly, this work is intended as a complementary viewpoint rather than a replacement for existing molecular modeling approaches. As Transformer architectures continue to dominate general-purpose representation learning, understanding how input units shape inductive bias remains a significant question beyond immediate benchmark performance. Looking ahead, the bond-centric granularity of our model provides a natural foundation for incorporating 3D geometric information. By extending the current binary topological mask to encode continuous physical descriptors such as bond angles (1st-order) and dihedral angles (2nd-order), this framework may provide a possible direction for connecting discrete bond-level topology with continuous conformational information. We hope that bond-level sequence modeling provides a useful reference point for future studies seeking to bridge discrete chemical structure with scalable sequence modeling frameworks, while further validation on broader datasets and more rigorous interpretability analyses remain important future work.