1. Introduction
The recent release of Google DeepMind’s AlphaFold3 has accelerated interest in AI-driven drug discovery by expanding protein structure availability at scale [
1]. However, progress in this area fundamentally depends on how molecular structures and physicochemical properties are represented for computation. Complex molecules must be transformed into data formats, such as Simplified Molecular Input Line Entry System (SMILES) strings or fingerprints, that machine learning models can use for prediction or design [
2]. Just as sentences are made up of words and words from letters, complex chemical structures are composed of basic chemical components. Therefore, the central challenge is not merely fragmenting molecules but designing fragmentation methods that expose informative substructure information while preserving chemical interpretability.
Molecules, especially those that are large and complex, consist of many structural components, each contributing to the molecule’s overall function and behavior. Take synthetic cannabinoids as an example; they are often fragmented into four primary groups during mass spectrometry analysis, the head, linker, core, and tail, reflecting their distinct chemical roles in biological interactions and analytical behavior [
3]. Although such decomposition is typically applied in experimental settings, the underlying modular organization is a general characteristic of complex molecules across both natural and synthetic compounds. This compositional nature motivates the development of systematic methods for decomposing sophisticated chemical structures into elementary components. Through the decomposition of complex molecules into chemically meaningful substructures and the identification of relationships among these substructures, researchers can build a robust foundation for future advancements in AI-driven drug discovery and chemical analysis.
Among the widely adopted approaches, molecular fragmentation has traditionally relied on predefined fragment libraries and experimental screening techniques, such as high-resolution mass spectrometry (HRMS), X-ray crystallography, and nuclear magnetic resonance (NMR), to elucidate molecular structure and generate fragments with low molecular weight (MW) [
4,
5]. While these techniques are effective and chemically grounded, they are limited by high experimental cost, time-consuming analysis, and the inability of predefined fragment libraries to encompass all potential molecular fragments. Therefore, such approaches are often used as preprocessing tools rather than as scalable, standalone solutions, particularly when analyzing large and diverse molecular datasets.
To improve scalability, computational fragmentation methods have increasingly drawn inspiration from natural language processing (NLP), facilitated by the widespread adoption of SMILES strings. By treating SMILES strings as sentences, subword-based segmentation techniques aim to decompose molecules into statistically informative substructures. Byte-pair encoding (BPE), introduced by Sennrich [
6], exemplifies this approach by iteratively merging frequently occurring token pairs to construct data-driven vocabularies of molecular fragments. Despite their scalability, these methods rely on statistical frequency and operate on linear representations, which often fail to preserve chemically meaningful motifs and obscure underlying molecular topology. Graph-based fragmentation methods address these limitations by directly leveraging molecular graphs to preserve chemically meaningful substructures.
A fundamental concept in drug discovery is the molecular scaffold, the invariant core of a molecule that is retained while other periphery structures are varied to alter properties [
7]. Bemis and Murcko formalized this concept by defining molecular frameworks obtained through the removal of terminal side chains [
8,
9]. Beyond scaffold extraction, rule-based methods such as the Retrosynthetic Combinatorial Analysis Procedure (RECAP) [
10] and the Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS) [
11,
12] introduced predefined cleavage rules based on retrosynthetically relevant bonds to generate chemically valid building blocks. While interpretable and chemically intuitive, these approaches remain constrained by fixed cleavage rules and limited flexibility.
Recent work has sought to address constraints through graph-based fragmentation methods that preserve structural connectivity without relying on predefined rules. Matched molecular pairs (MMPs) model local structural transformations through dual cleavages and synthetic accessibility filtering [
13,
14], while methods such as ReLMole [
15] and Tree Decomposition [
16] further abstract molecular graphs into functional-group-level or hierarchical representations. Although these approaches demonstrate meaningful progress, many lack a principled mechanism for constructing interpretable, hierarchical abstractions that systematically organize molecular topology across different levels of granularity.
Although there have been extensive advances in molecular fragmentation, existing methods continue to face fundamental trade-offs between chemical interpretability, topological fidelity, and representational flexibility. Motivated by these limitations, we introduce HiFrAMes, a framework for Hierarchical Fragmentation and Abstraction of Molecular graphs. HiFrAMes transforms raw chemical structure graphs into interpretable, abstracted topological scaffold representations to enable structured and chemically informed analysis of complex molecules. In this work, we (1) present a four-stage, topology-driven pipeline for hierarchical fragmentation of molecular graphs; (2) define a set of abstract substructures (chains, rings, linkers, and scaffold representations) that are derived from input molecular graphs; and (3) evaluate the resulting fragment library extracted from ZINC-250K (a widely used benchmark set of 250,000 drug-like molecules) using multiple fragment selection criteria established in the literature, as well as qualitative case studies to demonstrate that HiFrAMes produces quality, reusable fragments aligned with common medicinal chemistry motifs.
4. Discussion
To evaluate the quality of fragments extracted by HiFrAMes, we applied HiFrAMes to ZINC-250K, a 250,000 molecule subset of ZINC, one of the most widely used public databases of commercially available drug-like compounds [
25]. On the ZINC-250K benchmark, our framework generated a rich vocabulary of chemically-valid fragments, obtaining 29,219 unique fragments from 1,638,977 fragment instances, enabling significant abstraction of the original molecular graphs. The resulting fragment library contains 19,803 leaf chain fragments, 7153 ring fragments, and 2468 linker fragments.
In fragment-based drug discovery (FBDD), fragments are commonly described as low complexity molecules with molecular weight below 300 Da and typically no more than 20 heavy atoms (non-hydrogen atoms) [
26,
27,
28]. These size ranges are widely used to keep fragments interpretable, efficiently searchable, and to enable growing, linking, and merging [
29,
30,
31]. As shown in
Figure 11, the molecular weight and heavy atom count (HAC) distributions of HiFrAMes fragments exhibit tightly bounded and unimodal profiles across all fragment categories. Over 99% of fragments fall below the 300 Da threshold and contain no more than 20 heavy atoms, with median sizes well within these ranges. Moreover, leaf chain, ring, and linker fragments each span distinct, yet overlapping size ranges, consistent with their different roles within molecules.
Having established that HiFrAMes fragments conform to established fragment size and complexity criteria, we next examine the chemical composition and structural motifs represented by each fragment category. Common structural motifs extracted by HiFrAMes are shown in
Figure 12,
Figure 13 and
Figure 14, organized by fragment type. From the analysis presented in the figures, three key findings stand out.
First, leaf chains dominate by count, suggesting that most molecules feature multiple peripheral substituents decorating core scaffolds composed of a comparatively smaller number of ring and linker fragments. As shown in the leaf chain fragment profile in
Figure 12, the most frequent leaf chains contain common functional groups, such as carbonyls (C=O), hydroxyls (-OH), amines (-
), and halogenated substituents, consistent with what is typically observed in drug-like molecules. These fragments tend to be reactive or polar moieties appended to ring systems, a pattern well aligned with functional group driven optimization strategies in medicinal chemistry, where peripheral substituents are systematically varied to tune molecular properties while preserving core scaffolds [
32]. Carbonyl and amino groups frequently act as H-bond donors or acceptors or reactive centers, whereas hydroxyls and protonated amines (-
) increase polarity and solubility, thereby shaping Absorption, Distribution, Metabolism, and Excretion (ADME) profiles [
33].
Second, the number of unique ring systems is relatively small. Many molecules reuse the same ring families shown in
Figure 13, such as aromatic benzene rings and heterocycles, indicating that HiFrAMes extracts frequently occurring core ring structures.
Third, simple aliphatic linkers shown in
Figure 14, such as propyl (-CCC-), ethyl ether (-CCO-), and ethylamine (-CCN-), bridge larger substructures and are frequently observed as short connectors. These flexible linkers help space functional domains, modulate conformations, and often increase lipophilicity by boosting hydrocarbon content. Overall, the most frequent leaf chain, ring, and linker fragments correspond to established medicinal chemistry motifs [
34,
35], suggesting that HiFrAMes extracts a compact set of reusable, chemically meaningful fragments.
To assess the behavior of HiFrAMes relative to established fragmentation approaches, we conducted qualitative and quantitative comparisons against two widely used fragmentation methods: RECAP and BRICS. Although all three methods aim to decompose molecules into reusable substructures, they differ fundamentally in how fragments are defined, organized, and contextualized.
Figure 15 presents a side-by-side decomposition of vitamin B-12 using RECAP, BRICS, and HiFrAMes. RECAP produces a highly conservative decomposition, producing only a small number of large fragments in which most peripheral substituents remain attached to the corrin macrocycle or are grouped into single nucleotide-like fragments. BRICS applies more aggressive retrosynthetic bond cleavage, separating many peripheral functional groups into small, disconnected fragments while still preserving a limited number of large cores. In contrast, HiFrAMes fragments the same molecule by separating major ring systems, such as the corrin core, benzimidazole, and sugar ring, into distinct subgraphs and isolates the phosphate-containing connector as its own fragment, while extracting numerous small chain fragments. Although HiFrAMes yields a larger number of fragments for this example than both RECAP and BRICS, the decomposition reflects the underlying roles of structural components, rather than simply cleaving predefined bond types.
In contrast to the vitamin B-12 example, where molecular complexity exposes clear differences between fragmentation schemes, structurally simple molecules exhibit largely convergent behavior across methods.
Figure 16 illustrates this for tiabendazole, a simple ZINC-250K molecule composed of a compact heteroaromatic core with no peripheral substituents and only a single cleavable linker. In this case, HiFrAMes, RECAP, and BRICS yield a nearly identical fragmentation (BRICS and RECAP produce the same fragments) that preserve the fused cyclic core as a single dominant fragment, with only a minor difference arising from the retention of a small linker moiety in HiFrAMes, as shown in the figure. This behavior reflects the topology of the molecular graph rather than algorithmic limitations, and indicates that HiFrAMes is most informative for sets of molecules with pronounced structural heterogeneity.
When HiFrAMes is applied to the entire ZINC-250K dataset, these observed differences are reflected in the fragment size statistics summarized in
Table 1 and
Figure 17. These results demonstrate that HiFrAMes generates a substantially more compact fragment vocabulary compared to RECAP and BRICS. This indicates markedly higher fragment reuse across molecules for HiFrAMes, suggesting that it captures recurring topological motifs rather than producing a large number of idiosyncratic fragments. Consistent with FBDD criteria, 99.7 % of HiFrAMes fragments fall below 300 Da and 99.4% contain no more than 20 heavy atoms, exceeding the corresponding proportions for RECAP (93.8% and 92.6%) and BRICS (73.0% and 69.9%). These trends are clearly visible in the molecular weight and heavy atom count histograms, where HiFrAMes exhibits tightly bounded unimodal distributions with limited tails, whereas RECAP shows broader distributions, and BRICS produces a pronounced shift toward larger and more complex fragments. Notably, when both size criteria are applied jointly (MW ≤ 300 Da and HAC ≤ 20), 99.3% of HiFrAMes fragments satisfy fragment-like constraints, compared to 91.9% for RECAP and 68.3% for BRICS. These results demonstrate that HiFrAMes avoids the under-fragmentation observed in conservative rule-based methods, preventing the explosion of large or overly complex fragments characteristic of more aggressive retrosynthetic cleavage strategies, enabling greater fragment reuse.
In addition to fragment size, we evaluated other physicochemical characteristics of fragments produced by each method, including the commonly used Rule of Three (Ro3) selection criteria [
36,
37,
38]. As summarized in
Table 2, the HiFrAMes fragments exhibit consistently high compliance for individual constraints when compared with those of BRICS and RECAP, indicating that the fragments produced are not only appropriately sized but also chemically well-balanced.
When compliance with all Ro3 metrics is evaluated, requiring satisfaction of MW, ClogP, hydrogen bond donors (HBD), and hydrogen bond acceptors (HBA) thresholds, 87.6% of HiFrAMes fragments meet all four criteria, substantially exceeding the corresponding Ro3 compliance rates for RECAP (68.0%) and BRICS (45.9%), as shown in
Table 3. We further evaluated fragment quality using an extended seven-criterion framework that includes HAC, number of rotatable bonds (NROT), and topological polar surface area (TPSA) [
26], in addition to the Ro3 criteria. Under this set of constraints, 42.8% of HiFrAMes fragments are fully compliant, compared to 51.2% of RECAP fragments and 26.6% of BRICS fragments meeting all criteria; however, it is well established that successful fragments may violate at least one commonly used selection criterion without compromising ligand efficiency or downstream developability [
39]. Relaxing the constraint of full compliance, we also evaluate the rates of high extended compliance of each set of fragments, in which a fragment must meet at least six of the seven thresholds. This allows for tolerance to single-criterion violations while still evaluating overall fragment appropriateness. We observed that 82.5% of HiFrAMes fragments satisfy at least six of the seven fragment-likeness thresholds, while 78.1% of RECAP fragments and only 49.3% of BRICS fragments do the same. Altogether, our results indicate that HiFrAMes produces appropriately sized, chemically balanced fragments that are well-aligned with common motifs observed in medicinal chemistry and are highly reusable, making them well-suited for downstream fragment-oriented computational tasks.