AMP: Automatic Modality-Aware Parallelization with Hidden-Dimension Tensor Parallelism for Multi-Modal 3D Biological Models

Zhang, Kailin; Zheng, Hao; Yuan, Lang

doi:10.3390/electronics15132769

Open AccessArticle

AMP: Automatic Modality-Aware Parallelization with Hidden-Dimension Tensor Parallelism for Multi-Modal 3D Biological Models

by

Kailin Zhang

,

Hao Zheng

^* and

Lang Yuan

College of Computer Science and Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(13), 2769; https://doi.org/10.3390/electronics15132769 (registering DOI)

Submission received: 21 May 2026 / Revised: 16 June 2026 / Accepted: 17 June 2026 / Published: 23 June 2026

(This article belongs to the Special Issue Advances in 3D Computer Vision and 3D Data Processing)

Download

Browse Figures

Versions Notes

Abstract

Three-dimensional (3D) spatial interaction data are fundamental to understanding genome architecture. Multi-modal deep learning models that jointly learn from 3D spatial data and orthogonal modalities, such as gene expression, face a critical computational challenge: the 3D spatial modality dominates computation by over one order of magnitude, creating a structural memory bottleneck that renders heavyweight model instances untrainable on single GPU. Existing distributed training methods rely on cost-model searching and treat model components uniformly, overlooking modality-specific memory asymmetries. We propose Automatic Modality-aware Parallelization (AMP), a framework that diagnoses memory bottlenecks from data configuration signals and prescribes a set of five strategies. At the core of this framework is a hidden-dimension tensor parallelism strategy (

S_{5}

) that partitions the 3D decoder’s hidden dimension across GPUs, transforming five non-standard operators into sharded forms with formal equivalence proofs. Evaluated on Hi-C data and RNA-seq from the HiRES single-cell mouse brain dataset across lightweight and heavyweight configurations, AMP converts out-of-memory (OOM) failures into successful training runs. Scaling from four to eight GPUs under heavyweight configurations, the 500 kb and 100 kb variants achieve 2.0× and 3.8× training speedups respectively, with mathematical equivalence to single GPU computation guaranteed by formal proofs.

Keywords:

3D data processing; automatic parallelization; multi-modal deep learning; tensor parallelism; hidden-dimension sharding; Hi-C data; single-cell genomics; memory bottleneck; mathematical equivalence

1. Introduction

Three-dimensional (3D) spatial interaction data, spanning Hi-C contact matrices in genomics [1,2,3] and point clouds in computer vision [4,5,6,7], are characterized by extreme sparsity and high dimensionality. As shown in Figure 1a, a typical single-cell Hi-C contact matrix spans tens of thousands of spatial bins yet exhibits detectable contacts in only a tiny fraction of bin pairs; the corresponding 3D spatial architecture is visualized in Figure 1b.

Technologies that co-assay chromatin conformation with transcriptome in individual cells, exemplified by HiRES [8] and scHiCAR [9], anchor sparse 3D spatial measurements with orthogonal RNA readouts. This multi-modal co-registration is analogous to how RGB imagery enriches LiDAR point clouds in autonomous driving [10,11], and has motivated deep learning architectures that jointly model 3D spatial interactions and orthogonal molecular readouts from the same cell.

Single-cell multi-modal profiling has motivated a range of deep learning architectures for joint representation learning, including VAE-based integration [12,13,14,15], graph-based methods [16], and attention-driven architectures [17,18,19]. Among these, HiGLUE [20] is the first model to simultaneously combine VAE encoders, graph neural networks (GNNs), transformer attention, and multi-branch encoder–decoder designs for jointly modeling paired single-cell Hi-C and RNA-seq data.

HiGLUE inherits a computational challenge that is structurally analogous to multi-modal 3D perception systems in autonomous driving [10]. In both settings, separate encoder–decoder branches process modalities of vastly different scales: in HiGLUE, the RNA modality consists of several thousand gene expression values while the Hi-C modality exceeds it by over one order of magnitude; in LiDAR-camera fusion, the point cloud branch similarly dominates computation over the camera branch. This asymmetry concentrates the majority of model parameters in the heavier spatial encoder and causes decoder activations to accumulate linearly across decomposition layers, creating a structural memory bottleneck whose severity is predictable directly from the data configuration. Lightweight configurations fit within typical GPU memory, whereas heavyweight configurations inevitably exhaust it, resulting in out-of-memory (OOM) failures regardless of batch size.

Existing distributed training methods apply generic cost models that treat all model components uniformly, overlooking modality-specific memory asymmetries. For heavyweight model instances, they recommend standard data parallelism, which replicates rather than distributes the structural memory bottleneck across all GPUs without reducing per-device activation memory. In contrast, we identified that partitioning the hidden dimension of the 3D spatial decoder removes a single bottleneck: reducing per-device activation memory across all subsequent strata, keeping cumulative memory below device limits without modifying the encoder or graph embedding stages.

We argue that for domain-specific models with structurally predictable computational heterogeneity, the problem is not searching for an optimal parallel strategy in an infinite space but diagnosing the bottleneck from the model configuration and prescribing a deterministic remedy.

We introduce Automatic Modality-aware Parallelization (AMP), a framework that operates in three phases:

Auto-Profile. Extracts three bottleneck signals (input dimensionality ratio, attention decoder presence, and decomposition depth) from preprocessed data files without executing a forward pass.
Auto-Diagnose. Maps each detected signal independently to its corresponding strategy via a fixed signal-to-strategy mapping: HIGH_INPUT_DIMENSIONALITY activates $S_{2}$ and $S_{4}$ ; ATTENTION_DECODER activates $S_{3}$ ; HEAVY_DECOMPOSITION activates $S_{5}$ when the predicted peak memory exceeds GPU capacity.
Auto-Execute. Composes the activated strategies into a unified per-stratum computation under the $S_{1}$ data-parallel infrastructure: $S_{2}$ determines the local feature partition, $S_{4}$ applies key-first decoding using this partition, $S_{3}$ provides chunked attention as a reusable operator, and $S_{5}$ internally invokes both $S_{3}$ and $S_{4}$ in its sharded attention block.

Among the five strategies, the key innovation is

S_{5}

, a hidden-dimension tensor parallelism strategy that partitions the 3D decoder’s hidden dimension across GPUs, transforming five non-standard operators into sharded forms, each proven equivalent through three shared mathematical properties (Section 3.4.5). Every transformation in the framework is mathematically proven to preserve single GPU numerical results.

We evaluated AMP on the HiRES mouse brain dataset [8]. Lightweight configurations trained successfully on a single GPU and achieved throughput improvement with four-GPU data parallelism. Heavyweight configurations, which are untrainable on a single GPU due to OOM, were correctly diagnosed by the framework and completed training successfully. Scaling from four to eight GPUs on heavyweight configurations, the 500 kb and 100 kb variants achieved 2.0× and 3.8× training speedups, respectively. A leave-one-out ablation confirmed that chunked attention (

S_{3}

) and hidden-dimension tensor parallelism (

S_{5}

) are each individually necessary for training to succeed.

The remainder of this paper is organized as follows. Section 2 reviews related work in multi-modal 3D biological models and distributed parallelization. Section 3 presents AMP framework, detailing the Auto-Profile, Auto-Diagnose, and Auto-Execute phases along with the five strategies. Section 4 reports experimental results. Section 5 concludes with limitations and future directions.

2. Related Works

This section reviews two lines of related work. Section 2.1 surveys multi-modal deep learning models for 3D spatial interaction data, identifying shared architectural patterns and the computational challenges they create for large-scale training. Section 2.2 discusses existing distributed training and automatic parallelization methods, highlighting their limitations for models with asymmetric modality scales.

2.1. Multi-Modal Deep Learning Models for 3D Spatial Interaction Data

Single-cell multi-modal profiling technologies, including HiRES [8] and scHiCAR [9], alongside transformer-based pretrained models for single-cell data such as scBERT [21] and scGPT [22], have catalyzed deep learning architectures that jointly learn from 3D spatial interaction data and other modalities. Despite their diversity, these models converge on four shared architectural building blocks: (a) VAEs for representation learning from noisy high-dimensional observations; (b) GNNs for irregular graph structures arising from 3D spatial measurements; (c) transformer-based attention for long-range dependencies; and (d) multi-branch encoder–decoders for heterogeneous modality integration (Table 1).

Each paradigm presents specific computational challenges when applied to 3D spatial datasets. We emphasize that these architectural categories are descriptive—they characterize the design space of existing models—rather than prescriptive for our parallelization framework. Our strategy decisions are driven not by whether a model uses VAE attention or GNNs, but by measurable data-level signals (input dimensionality, attention decoder presence, and computational footprint imbalance) extracted during the profiling phase (Section 3.2). The input projection layers of VAE encoders scale linearly with the input dimensionality, dominating the computation for high-dimensional 3D spatial inputs while performing inefficient dense operations on predominantly zero-valued data. GNNs on genomic graphs face irregular, load-imbalanced adjacency structures. Transformer attention has quadratic complexity in sequence length and wastes computations on sparse 3D spatial inputs [24,25]. Multi-branch designs inherit severely skewed parameter distributions: the 3D spatial branch dominates the computation and memory, whereas other branches contribute negligibly. Critically, all these models share the structural property that is central to our work: the computational cost of the 3D spatial modality overwhelms other modalities by one to two orders of magnitude, creating a systematic bottleneck whose location and severity can be predicted from the model configuration alone. While 3D computer vision has developed sparse tensor libraries such as MinkowskiEngine [26] that avoid computation on empty voxels, these operate at the operator level and do not address the architectural memory bottleneck arising from modality asymmetry in multi-branch models.

2.2. Distributed Training and Parallelization Strategies

Existing distributed training methods are primarily designed for conventional workloads. Fully Sharded Data Parallelism (FSDP) [27,28] applies identical sharding strategies regardless of the component characteristics. Tensor parallelism, as implemented in Megatron-LM [29], partitions weight matrices along fixed dimensions optimized for dense transformer layers with balanced parameter sizes, unable to account for orders-of-magnitude input dimensionality differences across branches. Pipeline parallelism is ill-suited for multi-branch architectures, where branches can be computed independently.

Tensor parallelism, which partitions weight matrices and activations across devices, is highly effective for scaling homogeneous transformer architectures. Megatron-LM [29] splits attention heads and feed-forward layers along fixed and balanced dimensions optimized for dense self-attention. These assumptions break down in multi-modal 3D biological models, where the 3D spatial decoder exhibits a stratified local attention structure, GEGLU (Gated Gaussian Error Linear Unit) activation gating, and an uneven decomposition of the hidden representation across multiple distance-dependent strata. None of these operators offer natural dimension-splitting points that align with the standard tensor-parallel partitioning schemes. In the 3D computer vision domain, the parallelization of sparse 3D architectures has focused on operator-level efficiency, such as the MinkowskiEngine [26], which addresses computation on empty voxels but does not tackle modality asymmetry in multi-branch models. While sequence parallelism methods such as RingAttention [30] distribute long-sequence attention across devices and IO-aware kernels such as FlashAttention [31] reduce single-GPU attention memory, neither approach provides formal equivalence guarantees for non-standard decoder operators. No existing study has addressed the general problem of sharding the hidden dimension of a non-standard decomposed decoder across GPUs with formal equivalence guarantees.

A closely related line of work is the automatic parallelism strategy generation. AutoDDL leverages OneFlow SBP (Split, Broadcast, Partial-sum) abstraction and coordinate descent to search for near-optimal distributed schemes [32]. Shi et al. formulated the parallelism strategy generation problem as an integer linear program (ILP) targeting minimal memory redundancy [33].

However, all existing automatic parallelism frameworks share a common assumption: parallel strategies must be discovered through searching because the optimal configuration is model-specific and not obvious in advance. This assumption underlies cost-model-based methods such as FlexFlow [34], Alpa [35], and AutoDDL [32], ILP-based approaches such as Shi et al. [33], and recursive strategy propagation methods. This assumption holds for general-purpose DNNs but fails for domain-specific models with structurally predictable bottlenecks. In a multi-modal 3D genomic model, the bottleneck is deterministically located in the 3D spatial modality, which has a high-dimensional sparse input and attention-based decoding, and the exact out-of-memory (OOM) threshold can be predicted directly from the model’s data configuration. Our work addresses both gaps simultaneously: the profiling and diagnosis framework replaces the search, while strategy

S_{5}

provides the first hidden-dimension tensor parallelism scheme proven equivalent for non-Transformer 3D spatial decoders.

Table 2 positions AMP relative to representative automatic parallelism frameworks along five dimensions that distinguish diagnosis-driven from search-driven approaches.

3. Methods

This section presents the proposed framework, which operates through three deterministic phases: Auto-Profile (Section 3.2) extracts bottleneck signals from the data configuration, Auto-Diagnose (Section 3.3) maps them to a strategy prescription via a fixed signal-to-strategy mapping, and Auto-Execute (Section 3.5) implements the prescribed training pipeline. These three phases are organized into two functional layers: a lightweight routing layer (Section 3.2 and Section 3.3) and a strategy library (Section 3.4) containing five mathematically equivalent parallel strategies (

S_{1}

–

S_{5}

).

3.1. Problem Formulation and System Overview

We formulated the parallelization problem for multi-modal 3D biological models as follows: A model M consists of a set of modalities

K = {k_{1}, k_{2}, \dots}

, where each modality k has an encoder

E_{k}

and decoder

D_{k}

, and a shared guidance graph encoder

G (V, E)

that processes a prior knowledge graph. Given N GPUs with per-device available memory

M_{avail}

, the goal is to determine a parallel strategy configuration

S = (s_{1}, \dots, s_{5})

,

s_{j} \in {0, 1}

, indicating which of the five parallel strategies are activated, such that the peak per-GPU memory consumption satisfies

MEM (g_{i}, M, S) \leq M_{avail} \forall i \in {1, \dots, N},

(1)

where

g_{i}

denotes the i-th GPU. Here,

MEM (g_{i}, M, S)

is the peak GPU memory consumption on device

g_{i}

when model M is executed under strategy configuration

S

. In practice this value is predicted analytically by the EPM estimator (Equation (2)) during the profiling phase without executing a forward pass. The secondary objective is to minimize the total training time

T (M, S)

. An additional constraint is that every parallel transformation must be mathematically equivalent to the single-GPU forward and backward computations.

Figure 2 illustrates the three phases of the proposed framework. The framework takes only a preprocessed data directory and the number of available GPUs as user input and proceeds through three deterministic phases: Auto-Profile, Auto-Diagnose, and Auto-Execute. The strategy configuration is determined by profiling rather than manual specification or cost-model search. The framework is designed around five principles:

Zero Intrusion. The original single-GPU training code was preserved and unmodified.
Diagnosis-Driven. Strategy activation is determined by data-derived bottleneck signals rather than user configuration.
Mathematical Equivalence. Every parallel transformation was proven to preserve the single-GPU numerical results.
Determinism. Identical inputs always produce identical strategy configurations.
Extensibility. New bottleneck signals and corresponding strategies can be registered using a standard interface.

Architecturally, these three phases are organized into two functional layers. The upper layer (Section 3.2 and Section 3.3) provides a lightweight bottleneck-aware routing mechanism: Auto-Profile extracts structural signals from the data configuration, and Auto-Diagnose maps them to a strategy prescription using a fixed signal-to-strategy mapping. This layer is intentionally simple because, for multi-modal 3D biological models, the bottleneck is structurally knowable from a small set of data-derived signals. The lower layer (Section 3.4) constitutes the strategy library, which includes five parallel strategies with a formal mathematical equivalence proof. This is where the principal technical depth of this study resides. Strategy

S_{5}

in particular involves the complete tensor-parallel transformation of five core operators in the Hi-C decoder, each proven equivalent through three shared mathematical properties (Section 3.4.5).

3.2. Auto-Profile: Bottleneck Signature Extraction

In the first phase, modality signatures are automatically extracted from the pre-processed data files without executing a single forward pass. Given a data directory, the profiler scans the modality-specific subdirectories (RNA expression, Hi-C contact matrices, and optionally ATAC-seq or methylation data) and reads the associated structured data files to extract properties that are predictive of the downstream memory bottlenecks. This profiling procedure is deliberately lightweight. The key architectural claim of this study is that a small set of structurally encoded signals suffices to diagnose the bottleneck for this class of models. This eliminates the need for an elaborate cost-model search employed by general-purpose automatic parallelism frameworks.

For each modality, the profiler records: (1) the probability model type (e.g., Negative Binomial (NB) for RNA, Hi-C Zero-Inflated Negative Binomial (HiCZINB) for Hi-C, and Zero-Inflated Log-Normal (ZILN) for methylation), which determines the decoder architecture; (2) the input dimensionality

n_{features}

and the number of cells

n_{cells}

; (3) for the 3D spatial modality specifically, whether the decoder contains attention mechanisms and, when applicable, the number of distance-dependent decomposition units (such as the number of strata in Hi-C’s stratified decoder, inferred from file naming conventions or feature name suffixes); and (4) the relative scale of computation across modalities, obtained by a lightweight estimation that accounts for the interaction between input size, decoder complexity, and training batch size. The guidance graph was profiled separately to record its node and edge counts.

The extracted measurements were mapped to a set of three bottleneck signals (Table 3). A modality is flagged with the HIGH_INPUT_DIMENSIONALITY signal when its input features exceed a domain-specific threshold relative to other modalities, indicating that its encoder projection and decoder reconstruction pathways dominate memory consumption. The ATTENTION_DECODER signal is raised when the modality’s decoder employs attention mechanisms, which produce large intermediate tensors of shape [

N_{nodes}

,D] during the query-key computation. The HEAVY_DECOMPOSITION signal (instantiated here for the Hi-C decoder) is raised when the decoder employs a multi-layer decomposed structure (such as the stratified attention layers in Hi-C’s decoder), indicating that multiple attention and feedforward blocks are applied sequentially to the same input, which causes cumulative activation retention.

The three signals were designed to be orthogonal in their detection logic rather than in the architectural features they target. ATTENTION_DECODER detects the presence of attention as a mechanism, and HEAVY_DECOMPOSITION detects the structural repetition of decoder blocks regardless of their internal operation. A model with both properties will activate both

S_{3}

and

S_{5}

, and the functional dependencies ensure that their sharded attention and chunked attention are composed correctly. The EPM estimator Equation (2) then determines how many such layers the available GPU memory can accommodate before exhausting it.

The threshold for HIGH_INPUT_DIMENSIONALITY is domain-specific and is assessed relative to the next-largest modality. In the HiRES dataset, the Hi-C feature count exceeds the RNA features by over one order of magnitude, which serves as the practical detection criterion.

The key insight is that these signals are structurally encoded in the model’s data configuration; they can be detected by reading data files and model metadata without running the model or measuring the actual GPU memory. Because these signals derive from preprocessing parameters (bin size, decomposition depth, and decoder architecture) rather than from the biological content of the data, the diagnosis does not depend on which tissue or species the data originate from. This makes the profiling phase zero-cost and immediately informative before the training begins.

The profiler also estimates the peak GPU memory using a linear budget model. Let

K_{strata}

denote the number of decomposition units (strata) in the Hi-C decoder. Let

M_{base}

denote the memory consumption of the encoder and graph embedding stages, and let

Δ M

denote the marginal memory cost per decomposition unit, which is dominated by the per-unit attention block activations, decoder intermediate tensors and softmax normalization. The estimated peak memory for

K_{strata}

decomposition units is

EPM (K_{strata}) = M_{base} + K_{strata} \times Δ M .

(2)

Both

M_{base}

and

Δ M

are estimated analytically from tensor shapes without executing a forward pass:

M_{base}

is computed from the encoder output dimensions and graph embedding parameters, whereas

Δ M

is derived from the intermediate tensor dimensions of the attention block per stratum.

3.3. Auto-Diagnose: Bottleneck-to-Strategy Mapping

The second phase maps the extracted bottleneck signals to a concrete parallel strategy prescription via a fixed signal-to-strategy mapping (Figure 3). The decision procedure is intentionally straightforward; its role is to provide a reproducible, auditable mapping from structural signals to strategy activation, not to contribute algorithmic complexity. The three signal nodes in Figure 3 (HIGH_INPUT, ATTN_DECODER, HEAVY_DECOMP) abbreviate the bottleneck signals defined in Table 3, and each independently activates its corresponding strategy from Table 4.

Signal-to-strategy mapping. If only one GPU is available, the framework prescribes single-GPU execution without parallelism. If the 3D spatial modality is absent (RNA-only or RNA + ATAC configurations), standard data parallelism (Strategy

S_{1}

) is sufficient because the bottleneck signals that require deeper parallelization are exclusively associated with the 3D spatial modality. When signals are present, the mapping evaluates each detected signal independently as follows:

If HIGH_INPUT_DIMENSIONALITY is raised, strategies $S_{2}$ and $S_{4}$ are activated: $S_{2}$ shards the reconstruction output across GPUs, and $S_{4}$ restructures the decoder-side matrix computation using the same feature partition.
If ATTENTION_DECODER is raised, strategy $S_{3}$ (chunked attention) is activated, replacing the full attention computation with sequential chunked evaluation.
If HEAVY_DECOMPOSITION is raised and the estimated peak memory $EPM (K_{strata})$ exceeds $M_{avail}$ , strategy $S_{5}$ (hidden-dimension sharding) is activated, partitioning the hidden dimension across GPUs.

These three signals characterize the structural source of modality asymmetry: high input dimensionality creates the encoder-side weight dominance, an attention decoder creates large quadratic or windowed temporary tensors, and deep decomposition causes cumulative activation retention. Together they describe the structural pattern that makes one modality dominate GPU memory.

While the current instantiation of

S_{3}

and the tensor-parallel attention in

S_{5}

target the local (sliding-window) attention pattern of HiGLUE, the ATTENTION_DECODER signal detects the architectural presence of attention in the decoder rather than its specific type. Adapting the framework to full self-attention decoders would require integrating alternative parallelization strategies (e.g., Flash Attention) into the strategy library, a direction we defer to future work. The profiling and diagnosis layers remain applicable because their detection logic depends on whether attention exists, not on the particular attention variant employed.

Strategy

S_{1}

(data parallelism) serves as the underlying multi-process infrastructure and is always enabled when multiple GPUs are available. This design yields a compositional strategy prescription: each signal independently activates its corresponding strategy, and the functional dependencies (Section 3.4.6) ensure the correct composition, regardless of which subset is activated.

Unlike existing automatic parallelism frameworks that search over a continuous strategy space, the proposed method is fully determined by the detected bottleneck signals. The mapping from the signal to the strategy is fixed and documented (Table 4), making the prescription reproducible and auditable. The strategies are composed according to functional dependencies: the feature-parallel strategy

S_{2}

must be activated before the restructured decode

S_{4}

(because

S_{4}

needs to know which features belong to the local rank); the chunked attention

S_{3}

is an independent component; and the hidden-dimension sharding

S_{5}

depends on the chunked attention routine from

S_{3}

as a building block within its sharded attention block.

3.4. Strategy Library: Five Mathematically Equivalent Parallel Strategies

The strategy library consists of five parallel strategies, each targeting a specific bottleneck signal and provably equivalent to single-GPU computation. All five strategies were implemented as non-intrusive extensions to the original training workflow, preserving the original model code base unmodified and allowing the single-GPU baseline to remain available for comparison.

3.4.1. $S_{1}$ : Lightweight Data Parallelism

Strategy

S_{1}

implements a conventional multi-process data parallelism. Each GPU holds a complete model replica and processes distinct subsets of the training cells. After each backward pass, the gradients were manually synchronized across all ranks using an all-reduce sum, followed by division by the world size. The full guidance graph (stored as edge index, edge weight, and edge sign tensors) is replicated on every GPU and processed identically by the graph encoder forward pass on each rank. Although this is computationally redundant, the graph encoder is lightweight relative to the Hi-C decoder, and the redundancy avoids additional communication overhead for partial graph synchronization. Training metric logging, learning rate scheduling, and early stopping were disabled to ensure that all ranks followed an identical control flow.

Let

g_{i k}

denote the gradient of parameter

θ_{k}

computed on rank i’s local data shard

D_{i}

. The synchronized gradient is as follows:

{\bar{g}}_{k} = \frac{1}{N} \sum_{i = 1}^{N} g_{i k} .

(3)

The parameter update follows standard SGD:

θ_{k} \leftarrow θ_{k} - η \cdot {\bar{g}}_{k} .

(4)

Because the loss decomposes additively over independent data samples:

L = \sum_{b \in D} ℓ (x_{b}, θ),

(5)

with

D = ⋃_{i} D_{i}

,

{\bar{g}}_{k}

equals the gradient that would be obtained from the full batch, ensuring equivalence to single-GPU training with N times the batch size.

3.4.2. $S_{2}$ : Feature-Parallel Hi-C Reconstruction

Target signal. HIGH_INPUT_DIMENSIONALITY.

Unlike conventional data parallelism, where different GPUs process different cell batches, Strategy

S_{2}

employs same-batch multi-rank execution: all ranks receive the identical cell batch B and graph batch. The Hi-C feature dimension

F_{hic}

is evenly partitioned into N contiguous shards

F^{(0)}, F^{(1)}, \dots, F^{(N - 1)}

, with rank i computing the reconstruction loss only for its local feature shard.

Algorithm 1 summarizes the per-cell local reconstruction loss computation under feature-parallel sharding.

Algorithm 1 Local Hi-C Reconstruction Loss for a Single Cell

Input: cell batch B, node embeddings $v \in R^{N_{nodes} \times D}$ , Hi-C raw data $x$ , $query \in R^{B \times D}$ , decoder parameters $θ$
Output: local reconstruction loss $L_{hic}^{(i)}$

1:: $[start, end] \leftarrow$ rank i feature partition bounds in $[0, F_{hic})$
2:: for stratum $k = 0$ to $K - 1$ do
3:: Compute mask: $M_{k} \leftarrow {f ∣ f \in stratum k \land f \in [start, end)}$
4:: if $M_{k} = Ø$ then
5:: continue
6:: end if
7:: Extract local features: $x_{local} \leftarrow x [:, {f \in M_{k}}]$
8:: if $k = 0$ then
9:: $key \leftarrow v$ ▹ no attention at $k = 0$
10:: else if $S_{3}$ activated and $S_{5}$ activated then
11:: $key \leftarrow ShardedAttentionBlock (k - 1, v)$ ▹ Algorithm 2
12:: else
13:: $key \leftarrow dilated_conv (k - 1, v)$
14:: end if
15:: Select: $K_{F} \leftarrow key [feature_indices (M_{k}), :]$ , $V_{F} \leftarrow v [feature_indices (M_{k}), :]$ ▹F selected rows
16:: Decode: $μ \leftarrow softmax (query \cdot K_{F}^{T}) \cdot V_{F}$ ▹ see $S_{4}$ for detail
17:: Compute local loss: $L_{local} \leftarrow - \log p (x_{local} ∣ μ)$
18:: end for
19:: return mean of all $L_{local}$

The Hi-C reconstruction loss decomposes additively over the features as follows:

L_{hic} (x, θ; F) = \sum_{i = 0}^{N - 1} L_{hic} (x, θ; F^{(i)}) .

(6)

Because gradients are accumulated across all ranks via all-reduce, the total gradient is

\nabla_{θ} L_{hic} = \sum_{i = 0}^{N - 1} \nabla_{θ} L_{hic}^{(i)},

(7)

which is exactly the gradient that would be obtained from the full-feature reconstruction on a single GPU.

3.4.3. $S_{3}$ : Chunked Local Attention

Target signal. ATTENTION_DECODER.

The Hi-C decoder uses local self-attention layers, where each graph node attends to nodes within a spatial window of size w. The standard implementation simultaneously processes all query positions, materializing an

O (N_{nodes} \times 2 w \times D)

temporary tensor. Strategy

S_{3}

decomposes this into sequential chunks of size C:

Attn {(Q, K, V)}_{j} = softmax (\frac{Q_{j} K_{[j - w, j + w]}^{⊤}}{\sqrt{D}}) V_{[j - w, j + w]},

(8)

{Attn}_{chunked} (Q, K, V) = ⨁_{c = 0}^{⌈ N_{nodes} / C ⌉ - 1} Attn (Q_{[c C, \min ((c + 1) C, N_{nodes})]}, K, V)

(9)

where ⊕ denotes the sequential concatenation. For each chunk, the attention operation internally selects keys and values within the window region, reducing the peak memory from

O (N_{nodes} \cdot 2 w \cdot D)

to

O (C \cdot 2 w \cdot D)

. The chunk size C is configurable as well.

Because attention at position j depends only on keys and values within

[j - w, j + w]

, and softmax normalization is performed independently per query position, the chunked evaluation yields results identical to the full computation. There is no cross-chunk dependency in the computation of the above equation. This design is distinct from Flash Attention, which tiles general dense attention operations with online softmax rescaling.

S_{3}

exploits the structural locality of the window-limited attention of the decoder: each chunk’s queries interact only with their own window neighborhood; therefore, no inter-chunk softmax rescaling is required. Chunking is a direct consequence of the local attention pattern identified by the ATTENTION_DECODER signal, rather than a general memory-saving technique applicable to arbitrary attention patterns.

3.4.4. $S_{4}$ : Restructured Decode Path

Target signal. HIGH_INPUT_DIMENSIONALITY (intermediate decode matrix).

The original Hi-C decoder computes a full intermediate matrix

Q K^{⊤} \in R^{B \times N_{nodes}}

before selecting the subset of features needed for the current stratum via column indexing. When

N_{nodes}

is large (e.g.,

24,000 +

nodes for chromosome-level processing), this intermediate matrix dominates the memory. Strategy

S_{4}

applies the matrix algebra identity:

{(Q K^{⊤})}_{[:, F]} = Q \cdot K_{[F, :]}^{⊤}

(10)

where

F \subset {1, \dots, N_{nodes}}

is the feature index set for the current stratum, with

| F | = F ≪ N_{nodes}

. The left-hand side materializes a

B \times N_{nodes}

intermediate; the right-hand side selects the F key rows first, producing a

B \times F

result.

The same transformation is applied to the fully decoded path required for the per-stratum softmax normalization. Since

F ≪ N_{nodes}

per stratum, the memory reduction is from

O (B \cdot N_{nodes} + B \cdot F)

to

O (B \cdot F)

.

This is a direct application of the matrix algebra identity:

{(A B)}_{[:, I]} = A \cdot B_{[:, I]},

(11)

where I is a set of column indices.

3.4.5. $S_{5}$ : Hidden-Dimension Tensor Parallel Decoder

Target signal. HEAVY_DECOMPOSITION (multi-layer decomposed decoder).

Strategy

S_{5}

is the most important strategy in the library. It partitions the hidden dimension D of the Hi-C decoder evenly across N ranks and transforms every hidden-dimension-dependent operator into a sharded yet mathematically equivalent form. Let the hidden dimension be partitioned as

D = N \times d

, with rank i owning channels

D_{i} = [i \cdot d, (i + 1) \cdot d)

. A vector

x \in R^{D}

is decomposed as

x = [x^{(0)} | x^{(1)} | \dots | x^{(N - 1)}], x^{(i)} \in R^{d}

(12)

The core algebraic principle that makes Strategy

S_{5}

possible is the additive decomposability of the dot product over the hidden dimension:

Q K^{⊤} = \sum_{i = 0}^{N - 1} Q^{(i)} {(K^{(i)})}^{⊤}

(13)

This identity holds because the query and key vectors are concatenated along the hidden dimension across ranks, and the full dot product is exactly the sum of the partial dot products over the concatenated sub-vectors. Every operator transformation in Table 5 is derived from this decomposition combined with the standard properties of all-reduce and layer normalization. We now detail each of these transformations.

LayerNorm sharding. Rank i computes the local partial sums over its hidden shard as follows:

S_{1}^{(i)} = \sum_{j \in D_{i}} x_{j}, S_{2}^{(i)} = \sum_{j \in D_{i}} x_{j}^{2} .

(14)

After two all-reduce operations, each rank reconstructs the global statistics as follows:

μ = \frac{1}{D} \sum_{i} S_{1}^{(i)}, σ^{2} = \frac{1}{D} \sum_{i} S_{2}^{(i)} - μ^{2} .

(15)

The sharded output

y^{(i)} = γ^{(i)} \cdot \frac{x^{(i)} - μ}{\sqrt{σ^{2} + ϵ}} + β^{(i)},

(16)

which is identical to the corresponding slice of the unsharded LayerNorm output.

Linear layer sharding. For a weight matrix

W \in R^{D \times D}

, rank i holds only the columns

W^{(i)} \in R^{D \times d}

corresponding to its hidden shard. The computation is

y_{i} = W^{(i)} x^{(i)} \in R^{D}

, then

y = \sum_{i} AR (y_{i})

, and each rank retains

y^{(i)} = y [D_{i}]

.

Sharded decode. The decomposition of Equation (13) is directly applied. Each rank computes its partial contribution as follows:

M^{(i)} = Q^{(i)} {(K^{(i)})}^{⊤} \in R^{B \times N_{nodes}},

(17)

and a single all-reduce sum recovers the full result. This decomposes the

O (B \cdot N_{nodes} \cdot D)

decode matrix multiplication into N parallel

O (B \cdot N_{nodes} \cdot D / N)

operation.

Sharded local attention. Rank i computes local scores:

S^{(i)} = \frac{Q^{(i)} {(K^{(i)})}^{⊤}}{\sqrt{D}} .

(18)

Note that the normalization is by the global dimension

\sqrt{D}

rather than the local dimension

\sqrt{d}

. After all-reduce,

S = \sum_{i} S^{(i)} = Q K^{⊤} / \sqrt{D}

, then

{output}^{(i)} = softmax (S) \cdot V^{(i)}

, which is equivalent to the unsharded computation.

Algorithm 2 integrates all five sharded operations into a complete per-stratum decoder attention block, illustrating the end-to-end tensor-parallel computation flow on each rank.

Algorithm 2 Sharded Decoder Attention Block (per rank i, per stratum

k \geq 1

)

Input:

v_{input} \in R^{N_{nodes} \times D}

,

query \in R^{B \times D}

, decoder parameters

W_{qk}^{(i)}, W_{val}^{(i)}, W_{ff}^{(i)}

for rank i

Output:

decoded \in R^{B \times F}

, updated

v^{(i)} \in R^{N_{nodes} \times d}

1:: $v^{(i)} \leftarrow v_{input} [:, D_{i}]$ ▹ Extract hidden shard
2:: $prenorm_local \leftarrow ShardedLayerNorm (v^{(i)})$
3:: $qk_local \leftarrow ShardedLinear (prenorm_local, W_{qk}^{(i)})$
4:: $val_local \leftarrow ShardedLinear (prenorm_local, W_{val}^{(i)})$
5:: $[q_local, k_local] \leftarrow split (qk_local)$ ▹ $N_{nodes} \times d$ each
6:: $scores_local \leftarrow 0_{N_{nodes} \times N_{nodes}}$
7:: for $c = 0$ to $⌈ N_{nodes} / C ⌉ - 1$ do
8:: $s \leftarrow c \cdot C$ , $e \leftarrow \min ((c + 1) C, N_{nodes})$
9:: $scores_local [s : e, :] \leftarrow q_local [s : e, :] \cdot k_{local}^{T} / \sqrt{D}$ ▹ per-chunk computation, window-masked as in $S_{3}$
10:: end for
11:: $scores \leftarrow AllReduceSum (scores_local)$
12:: $attn_out \leftarrow softmax (scores) \cdot val_local$
13:: $key_local \leftarrow val_local + attn_out$
14:: $key_local \leftarrow ShardedGEGLUFFN (key_local, W_{ff}^{(i)})$
15:: $partial \leftarrow query [:, D_{i}] \cdot {key_local}^{T}$
16:: $decoded \leftarrow AllReduceSum (partial)$
17:: return $decoded [F_{indices}]$ , $key_local$

All five operations in Table 5 satisfy the identity

y^{(i)} = y [:, D_{i}]

— each rank’s sharded output exactly equals the corresponding slice of the unsharded single-GPU computation. The proof is divided into three steps.

Dot product additivity. For any query $Q \in R^{B \times D}$ and key $K \in R^{N_{nodes} \times D}$ , the decomposition in Equation (13) follows from the definition of the dot product as a sum over the hidden dimension. This identity is the algebraic foundation for sharded decoding, sharded local attention scores, and sharded GEGLU FFN partial computations.
LayerNorm invariance under sharding. The global mean $μ = \frac{1}{D} \sum_{j = 1}^{D} x_{j}$ and variance $σ^{2} = \frac{1}{D} \sum_{j = 1}^{D} {(x_{j} - μ)}^{2}$ are reconstructed from the per-rank partial sums via Equations (14) and (15). Since the normalization $y_{j} = γ_{j} (x_{j} - μ) / \sqrt{σ^{2} + ϵ} + β_{j}$ is element-wise after $μ$ and $σ^{2}$ are determined, applying it to the local shard with local parameters $γ^{(i)}, β^{(i)}$ produces results identical to applying the full LayerNorm and then slicing.
Softmax equivalence under global normalization. The sharded attention computation is normalized by the global dimension $\sqrt{D}$ as shown in Equation (18). After all-reduce, $S = \sum_{i} S^{(i)} = Q K^{⊤} / \sqrt{D}$ , which is exactly the unsharded pre-softmax attention score matrix. Applying softmax to the correctly reconstructed global scores and multiplying by the local value shard yields an attention output identical to the corresponding slice of the unsharded result.

These three properties, all relying on the basic algebraic fact that summation over a partitioned index is order-independent, collectively guarantee the full mathematical equivalence of Strategy

S_{5}

to single-GPU computation.

The ATTENTION_DECODER signal is broadly defined to detect the presence of any attention mechanism in the decoder. Strategies

S_{3}

and the attention-specific portion of

S_{5}

are instantiated here for the local (sliding-window) attention pattern found in HiGLUE’s Hi-C decoder. For models employing full self-attention, the same signal would activate alternative strategies (such as Flash Attention or sequence-parallel attention). The remaining strategies

S_{1}

,

S_{2}

,

S_{4}

, and the non-attention components of

S_{5}

(LayerNorm, Linear, and Decode sharding) are independent of the attention variant. They are applied directly to any multi-modal 3D biological model with a large 3D spatial modality. New strategies for different attention patterns can be registered through the framework’s extensibility interface (Design Principle 5, Section 3.1) without modifying the profiling or diagnosis logic.

3.4.6. Strategy Composition

The five strategies interact through well-defined functional dependencies that determine how they are composed in the unified per-stratum computation (Algorithm 1). These dependencies reflect which strategy’s output serves as another’s input, not a sequential execution pipeline.

S_{2}

determines the local feature partition, and

S_{4}

applies key-first decoding using this partition—hence,

S_{2}

must be evaluated before

S_{4}

.

S_{3}

provides chunked attention as a reusable operator that does not depend on the outputs of

S_{2}

or

S_{4}

; it is invoked within the attention block regardless of which features are being processed.

S_{5}

internally invokes both

S_{3}

and

S_{4}

in its sharded attention block: the chunked attention mechanism from

S_{3}

is called with cross-rank score reduction enabled, and the key-first decoding approach of

S_{4}

is adopted in the sharded decode step.

S_{1}

operates as the underlying multi-process infrastructure throughout.

3.5. Auto-Execute: Strategy Composition and Execution

The third phase executes the prescribed strategy configuration. Strategy

S_{1}

(data parallelism) serves as the underlying multi-process infrastructure and is always enabled when multiple GPUs are available.

When no bottleneck signals are detected, only

S_{1}

is activated, and multi-process data-parallel training is launched. The trainer inherits the original single-GPU training procedure and augments the per-iteration step with manual gradient all-reduce synchronization.

When one or more bottleneck signals are detected, the framework activates the corresponding subset of strategies as prescribed by the diagnosis phase (Section 3.3). All activated strategies are integrated into the training loop:

S_{2}

and

S_{4}

operate in the feature-parallel loss function;

S_{3}

and

S_{5}

operate through the tensor-parallel attention primitives. The strategies are composed according to their functional dependencies (Section 3.4.6):

S_{2}

determines the local feature partition,

S_{4}

applies key-first decoding using this partition,

S_{3}

and

S_{5}

operate through the tensor-parallel attention primitives, and

S_{1}

provides the data-parallel infrastructure throughout.

4. Results

This section empirically validates the framework through four experiments: lightweight configurations verify that no unnecessary parallelism is imposed (Section 4.1), a leave-one-out ablation quantifies each strategy’s marginal memory contribution (Section 4.2), per-stratum memory profiling confirms the EPM budget model (Section 4.3), and cross-configuration evaluation tests generalization and GPU scaling (Section 4.4). HiRES [8] is among the few publicly available datasets providing paired single-cell Hi-C and RNA-seq at genome-wide scale with published preprocessing pipelines—the multi-modal 3D spatial plus orthogonal modality setting that AMP targets.

We evaluated AMP on the HiRES mouse brain dataset [8], under four configurations combining two bin sizes (100 kb, ∼27 K bins; 500 kb, ∼5.4 K bins) and two stratum counts (10 and 32). All experiments are conducted on servers equipped with NVIDIA RTX 3090 GPUs (24 GB each) using NCCL-based distributed communication. Hyperparameters (learning rate

2 \times 10^{- 3}

, latent dimension 64, hidden dimension 128) were held constant across all runs. A batch size of 128 was used for the 100 kb configurations and 256 for the 500 kb configurations, with chunk size 32 for local attention on 100 kb and 128 on 500 kb. Each configuration ran one pretrain stage and one fine-tune stage, both consisting of five iterations in benchmark mode. Lightweight configurations (10 strata) serve as the baseline where single-GPU training is feasible; heavyweight configurations (32 strata) exhaust single-GPU memory and are addressed by the prescribed five-strategy configuration (

S_{1}

–

S_{5}

), of which

S_{3}

and

S_{5}

are the decisive enablers (Section 4.2).

All timing values represent per-epoch averages obtained over 10 training iterations (five pretrain followed by five fine-tune) per configuration. Peak GPU memory values are reported as mean ± standard deviation over the same 10 iterations, measured via torch.cuda.max_memory_allocated() at the end of each iteration. Memory standard deviations were below 0.7 GB in all tested configurations.

4.1. Multi-GPU Speedup on Lightweight Configurations

We first evaluated whether AMP provides training acceleration on configurations that already fit on a single GPU. Both lightweight configurations (10 strata) were trained with data parallelism (

S_{1}

only) using four GPUs. Table 6 shows that four-GPU data parallelism yields approximately a 1.5× speedup on the 500 kb configuration (1.59 to 1.09 s/epoch). The 100 kb configuration showed a larger per-epoch time under DDP (4.67 s vs. 2.34 s on a single GPU), reflecting load imbalance caused by the highly variable number of genomic contacts per cell at finer bin resolution.

The 500 kb configuration achieves a 1.5× speedup under data parallelism, while the 100 kb configuration becomes slower (4.67 vs. 2.34 s/epoch) due to load imbalance: at finer bin resolution, the per-cell graph complexity varies substantially, causing uneven work distribution across GPUs. This imbalance motivates the strategy-based parallelism of AMP—pure data parallelism alone is insufficient when the data distribution is inherently skewed. Nevertheless, these results confirm two important baselines: first, AMP’s data-parallel backbone (

S_{1}

) operates correctly; second, the auto-diagnosis layer correctly limits parallelism to data-parallel mode for configurations that fit comfortably on a single GPU, imposing no unnecessary overhead.

4.2. Leave-One-Out Ablation: Quantifying Per-Strategy Memory Impact

To quantify the marginal contribution of each parallel strategy to memory reduction, we conducted a leave-one-out ablation on the 500 kb–32 strata configuration using eight GPUs with a batch size of one. Starting from all five strategies (

S_{1}

–

S_{5}

), we disabled one strategy at a time and recorded the training outcome and peak per-GPU memory. Table 7 reports these results.

The results revealed a clear hierarchy of memory impact among the strategies.

S_{3}

(Chunked Local Attention) is the first line of defense. Disabling

S_{3}

while keeping

S_{4}

and

S_{5}

active results in an immediate out-of-memory failure in stratum 1. The unchunked local attention attempts to allocate a full attention matrix for

113,505

guidance graph nodes, requesting approximately 48 GiB in a single allocation, exceeding the GPU capacity by a factor of two. Without chunking, the decoder cannot complete even the first attention-enabled stratum, regardless of the other optimizations.

S_{5}

(Hidden-Dimension Tensor Parallelism) is the decisive enabler. Disabling

S_{5}

causes per-stratum memory growth to accelerate from approximately 0.56 GB per stratum to 1.38 GB per stratum, more than doubling the decoder’s activation footprint. Training fails at stratum 16 after exceeding 22.33 GB, well before reaching the final stratum. With

S_{5}

active, all strategies complete all 32 strata within 17.87 GB—over 25% below the 24 GB limit. This confirms

S_{5}

as the strategy that converts an untrainable configuration into a trainable configuration.

S_{4}

(Local Key Optimization) provides significant but non-critical savings. Without

S_{4}

, the decoder reverts to gathering full key and value tensors across all hidden-dimension ranks before each attention operation. Training still completes, but the peak memory increases by 3.24 GB (from 17.87 to 21.11 GB), confirming that

S_{4}

eliminates a substantial intermediate memory cost while not being individually indispensable for this configuration.

S_{2}

(Feature-Parallel Encoder) primarily optimizes computation rather than memory. Disabling

S_{2}

leaves the peak memory unchanged at 17.87 GB. The output of the encoder projection layer is only

[1, 128]

, and its weight tensor (∼77 MB) is negligible compared to the decoder’s multi-GB activations.

S_{2}

reduces the per-GPU FLOPs by distributing the large feature-dimension matrix multiplication across ranks.

Taken together, this leave-one-out analysis establishes a clear ranking of memory impact:

S_{3} ≫ S_{5} ≫ S_{4} > S_{2}

.

S_{3}

and

S_{5}

are both individually necessary for training to succeed in this configuration;

S_{4}

provides measurable relief; and

S_{2}

optimizes throughput rather than memory footprint.

4.3. Per-Stratum Memory Profile

To understand why the 32-strata configuration exhausts GPU memory and how

S_{5}

resolves this, we profile per-GPU memory consumption after each decoder stratum with all five strategies on the 100kb–32 strata configuration (4 GPUs). Memory snapshots are collected at key points across the training pipeline using built-in profiling hooks.

Figure 4 shows the near-linear growth of per-GPU memory with stratum index, approximately 0.45 GB per stratum once the attention-enabled decoder layers begin. With all five strategies activated,

S_{5}

partitions the hidden dimension D across GPUs, keeping the per-stratum memory increment sufficiently low to remain below 24 GB across all 32 strata. The observed linear accumulation is consistent with the analytical EPM budget model formulated in Section 3.2: each additional stratum adds a predictable

Δ M

to the per-GPU memory footprint, and hidden-dimension sharding compresses this per-stratum cost by approximately a factor of N (the number of GPUs). Extrapolating the same per-stratum increment without

S_{5}

confirms that the 24 GB limit would be exceeded at approximately stratum 28, which aligns with single-GPU profiling, where memory exceeded 22 GB at strata 18–19 under the original unoptimized decoder.

4.4. Generalization and Scalability Across Configurations

To assess whether the five-strategy configuration generalizes beyond a single configuration and GPU count, we evaluated AMP across both bin sizes on four and eight GPUs.

Table 8 confirms that AMP generalizes across bin sizes. Both 32-strata configurations exhaust single-GPU memory but complete training under all five strategies on four GPUs. The 500 kb configuration peaks at 17.89 ± 0.15 GB, and the 100 kb configuration peaks at 19.51 ± 0.67 GB, both safely under the 24 GB limit. The absolute memory values differ by approximately 1.6 GB despite the 100 kb configuration having five times as many input bins (∼27 K vs. ∼5.4 K), demonstrating that

S_{5}

effectively decouples per-device memory from the input dimensionality, which is a direct consequence of sharding along the hidden dimension rather than the feature dimension.

Table 9 reports the throughput scaling from four to eight GPUs. The 100 kb configuration achieves a 3.8× speedup by doubling the number of GPUs, while the 500 kb configuration gains a 2.0× improvement. The larger relative gain on 100 kb is expected because its substantially larger bin count (∼27K vs. ∼5.4 K bins) makes the attention and decode stages the dominant runtime components, where

S_{3}

and

S_{5}

achieve the greatest parallelization efficiency through finer hidden-dimension sharding. Peak per-GPU memory does not decrease proportionally with the GPU count beyond four GPUs: at

d = D / 8

, the residual memory is dominated by the non-sharded components (RNA encoder, guidance graph embedding, and loss computation) rather than the decoder’s hidden-dimension activations. This memory plateau confirms that

S_{5}

has effectively saturated the hidden-dimension memory bottleneck on the decoder side; further memory reduction would require sharding of the encoder and graph embedding stages, which are not currently targeted by the strategy library.

The speedup gap between the two configurations is explained by the communication overhead of the distributed deployment. Each attention stratum under

S_{5}

incurs the all-reduce operations listed in Table 5 (up to 10 per stratum across 31 attention-enabled strata). The 500 kb configuration (∼5.4 K bins) has a lighter compute load, resulting in a higher communication-to-computation ratio and a modest 2.0× speedup. In contrast, the 100 kb configuration (∼27 K bins) carries a substantially heavier attention and decode workload that amortizes the same communication cost, achieving a 3.8× speedup.

Finally, to verify that standard data parallelism alone is insufficient for heavyweight configurations, we evaluate both 32-strata configurations under pure

S_{1}

on four GPUs. In both cases, each GPU holds a full model replica, and the structural memory bottleneck, identified as predictable from the data configuration in Section 3.2, is duplicated rather than distributed across devices. Training fails with OOM before the first epoch is completed. This negative result confirms the analysis in Section 2.2: generic parallelism strategies that ignore modality-specific structural asymmetry cannot resolve the predictable memory bottleneck of multi-modal 3D biological models. Only the diagnosis-driven five-strategy configuration, with hidden-dimension tensor parallelism as its decisive component, converts out-of-memory failures into successful training runs.

4.5. Summary of Experimental Validation

Taken together, the experimental results validate the three core claims of this work. First, the memory bottleneck of multi-modal 3D biological models is structurally predictable from data configuration alone: the EPM budget model accurately captures the linear accumulation of per-stratum activations, and the OOM threshold can be identified before training begins. Second, bottleneck-specific diagnosis can replace cost-model search: three orthogonal signals mapped to a prescribed set of five strategies suffice to resolve the bottleneck in all tested configurations without any trial-and-error strategy exploration. Third, hidden-dimension tensor parallelism (

S_{5}

) is the decisive enabler: the leave-one-out ablation on the 500 kb–32 strata configuration shows that disabling

S_{5}

more than doubles the per-stratum memory growth and causes OOM at stratum 16, while all strategies complete all 32 strata within 17.87 GB. The generalization of these results across two bin sizes and two GPU counts further demonstrates that AMP framework is not configuration-specific but rather addresses a structural property of this model class.

5. Conclusions

In this work we addressed the structural memory bottleneck that prevents training heavyweight multi-modal 3D biological models on single GPUs. We showed that in models of this class, the location and severity of the bottleneck are deterministically predictable from the data configuration, and that this predictability enables a diagnostic approach: three bottleneck signals (HIGH_INPUT_DIMENSIONALITY, ATTENTION_DECODER, and HEAVY_DECOMPOSITION) are extracted via zero-cost profiling and mapped to a prescribed set of parallel strategies. The resulting AMP framework deploys five mathematically equivalent strategies (

S_{1}

–

S_{5}

), with hidden-dimension tensor parallelism (

S_{5}

) serving as the decisive enabler.

Experimental validation on the HiRES mouse brain dataset demonstrated that all five strategies enable the training of previously untrainable 32-strata configurations at both 100 kb and 500 kb bin sizes. A leave-one-out ablation on the 500 kb configuration confirmed that

S_{3}

(chunked attention) and

S_{5}

(hidden-dimension tensor parallelism) are each individually necessary for training to succeed, while

S_{4}

(local key optimization) provides significant supplementary memory savings and

S_{2}

(feature-parallel encoding) optimizes computational throughput. Per-stratum memory profiling confirmed the linear accumulation predicted by the EPM budget model and showed that

S_{5}

compressed the per-stratum activation footprint sufficiently to keep memory within hardware limits. Pure data parallelism on the same configuration failed with OOM on four GPUs, confirming that generic parallelism without modality-aware diagnosis cannot resolve structural-memory bottlenecks.

Several limitations define the current scope of AMP. First, bottleneck signal definitions and strategy implementations were instantiated for Hi-C-based 3D genomic models with local attention decoders. Although the ATTENTION_DECODER signal is broadly defined to cover any attention mechanism, concrete strategies for full self-attention decoders (e.g., Flash Attention substitution) remain to be integrated and tested. Second, the auto-diagnosis module currently operates on a fixed three-signal vocabulary; extending the framework to additional modality types (e.g., spatial transcriptomics and imaging-based 3D data) will require registering new bottleneck signals and corresponding strategies through the extensibility interface.

Beyond the immediate generalization to other multi-modal 3D biological models, the core insight that domain-specific architectural asymmetries can replace generic cost-model searching has implications for the broader parallel computing community. Multi-modal 3D deep learning architectures where one spatial modality dominates computation face the same class of predictable memory bottlenecks. Applying AMP diagnostic framework to 3D computer vision models, such as multi-modal LiDAR-camera detectors in autonomous driving, represents a natural next step. Future work should also explore the integration of activation checkpointing as a complementary memory-saving mechanism, incorporate a communication-aware scheduling layer that selects among parallelism strategies based on cluster topology, and provide automated GPU-count recommendation that balances memory savings against communication cost. Extending the strategy library to support heterogeneous GPU clusters represents another practical direction.

Author Contributions

Conceptualization, K.Z., H.Z. and L.Y.; methodology, K.Z., H.Z. and L.Y.; software, K.Z. and H.Z.; validation, K.Z. and H.Z.; formal analysis, K.Z. and H.Z.; investigation, K.Z., H.Z. and L.Y.; resources, K.Z., H.Z. and L.Y.; data curation, K.Z. and H.Z.; writing—original draft preparation, K.Z.; writing—review and editing, K.Z., H.Z. and L.Y.; visualization, K.Z.; supervision, H.Z. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The HiRES single-cell Hi-C and RNA-seq raw data analyzed in this study were originally generated by Liu et al. [8] and are publicly available at NCBI GEO under accession number GSE223917. The preprocessed data files (h5ad and scool formats) were produced by the authors using the HiGLUE preprocessing pipeline. The HiGLUE model source code is publicly available at https://github.com/dylan-plummer/HiGLUE (accessed on 16 June 2026). AMP framework code supporting the reported findings is available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lieberman-Aiden, E.; Van Berkum, N.L.; Williams, L.; Imakaev, M.; Ragoczy, T.; Telling, A.; Amit, I.; Lajoie, B.R.; Sabo, P.J.; Dorschner, M.O.; et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science 2009, 326, 289–293. [Google Scholar] [CrossRef] [PubMed]
Rao, S.; Huntley, M.; Durand, N.; Stamenova, E.; Bochkov, I.; Robinson, J.; Sanborn, A.; Machol, I.; Omer, A.; Lander, E.; et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell 2014, 159, 1665–1680. [Google Scholar] [CrossRef] [PubMed]
Nagano, T.; Lubling, Y.; Stevens, T.J.; Schoenfelder, S.; Yaffe, E.; Dean, W.; Laue, E.D.; Tanay, A.; Fraser, P. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 2013, 502, 59–64. [Google Scholar] [CrossRef] [PubMed]
Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4338–4364. [Google Scholar] [CrossRef] [PubMed]
Benallal, H.; Abdallah Saab, N.; Tairi, H.; Alfalou, A.; Riffi, J. Advancements in Semantic Segmentation of 3D Point Clouds for Scene Understanding Using Deep Learning. Technologies 2025, 13, 322. [Google Scholar] [CrossRef]
Liu, Z.; Chen, Y.; Xia, Q.; Liu, M.; Xu, H.; Chi, Y.; Deng, Y.; Xing, D. Linking genome structures to functions by simultaneous single-cell Hi-C and RNA-seq. Science 2023, 380, 1070–1076. [Google Scholar] [CrossRef] [PubMed]
Wei, X.; Xu, Y.; Yang, D.; Kim, K.; Yi, L.; Luo, W.; Lin, X.; Xiang, Y.; Williams, A.B.; Wang, X.; et al. Trimodal single-cell profiling of transcriptome, epigenome and 3D genome in complex tissues with scHiCAR. Nat. Biotechnol. 2026. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Chen, X.; Song, Z.; Zhou, J.; Xie, D.; Lu, J. Camera and LiDAR Fusion for Urban Scene Reconstruction and Novel View Synthesis via Voxel-Based Neural Radiance Fields. Remote Sens. 2023, 15, 4628. [Google Scholar] [CrossRef]
Gayoso, A.; Steier, Z.; Lopez, R.; Regier, J.; Nazor, K.L.; Streets, A.; Yosef, N. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 2021, 18, 272–282. [Google Scholar] [CrossRef] [PubMed]
Lopez, R.; Regier, J.; Cole, M.B.; Jordan, M.I.; Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 2018, 15, 1053–1058. [Google Scholar] [CrossRef] [PubMed]
Ashuach, T.; Gabitto, M.I.; Koodli, R.V.; Saldi, G.A.; Jordan, M.I.; Yosef, N. MultiVI: Deep generative model for the integration of multimodal data. Nat. Methods 2023, 20, 1222–1231. [Google Scholar] [CrossRef] [PubMed]
Cao, Z.J.; Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol. 2022, 40, 1458–1466. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Zhou, T.; Ma, J. Multiscale and integrative single-cell Hi-C analysis with Higashi. Nat. Biotechnol. 2022, 40, 254–261. [Google Scholar] [CrossRef] [PubMed]
Zhou, J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat. Genet. 2022, 54, 725–734. [Google Scholar] [CrossRef] [PubMed]
Valeyre, H.; Pati, P.; Gossi, F.; Somnath, V.R.; Martinelli, A.; Rapsomaniki, M.A. ChromFormer: A transformer-based model for 3D genome structure prediction. bioRxiv 2022. [Google Scholar] [CrossRef]
Schuette, G.; Lao, Z.; Zhang, B. ChromoGen: Diffusion model predicts single-cell chromatin conformations. Sci. Adv. 2025, 11, eadr8265. [Google Scholar] [CrossRef] [PubMed]
Plummer, D. Loop2Loop: Representation Learning of the 3D Genome for Multimodal Single-Cell Integration and In-Silico Chromatin Rewiring. Ph.D. Thesis, Case Western Reserve University School of Graduate Studies, Cleveland, OH, USA, 2024. [Google Scholar]
Yang, F.; Wang, W.; Wang, F.; Fang, Y.; Tang, D.; Huang, J.; Lu, H.; Yao, J. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 2022, 4, 852–866. [Google Scholar] [CrossRef]
Cui, H.; Wang, C.; Maan, H.; Pang, K.; Luo, F.; Duan, N.; Wang, B. scGPT: Toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 2024, 21, 1470–1480. [Google Scholar] [CrossRef] [PubMed]
Li, G.; Fu, S.; Wang, S.; Zhu, C.; Duan, B.; Tang, C.; Chen, X.; Chuai, G.; Wang, P.; Liu, Q. A deep generative model for multi-view profiling of single-cell RNA-seq and ATAC-seq data. Genome Biol. 2022, 23, 20. [Google Scholar] [CrossRef] [PubMed]
Tan, J.; Shenker-Tauris, N.; Rodriguez-Hernaez, J.; Wang, E.; Sakellaropoulos, T.; Boccalatte, F.; Thandapani, P.; Skok, J.; Aifantis, I.; Fenyö, D.; et al. Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening. Nat. Biotechnol. 2023, 41, 1140–1150. [Google Scholar] [CrossRef] [PubMed]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 200. [Google Scholar] [CrossRef]
Choy, C.; Gwak, J.; Savarese, S. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3070–3079. [Google Scholar] [CrossRef]
Rajbhandari, S.; Rasley, J.; Ruwase, O.; He, Y. ZeRO: Memory optimizations Toward Training Trillion Parameter Models. In Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 9–19 November 2020; pp. 1–16. [Google Scholar] [CrossRef]
Zhao, Y.; Gu, A.; Varma, R.; Luo, L.; Huang, C.C.; Xu, M.; Wright, L.; Shojanazeri, H.; Ott, M.; Shleifer, S.; et al. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. Proc. VLDB Endow. 2023, 16, 3848–3860. [Google Scholar] [CrossRef]
Narayanan, D.; Shoeybi, M.; Casper, J.; LeGresley, P.; Patwary, M.; Korthikanti, V.; Vainbrand, D.; Kashinkunti, P.; Bernauer, J.; Catanzaro, B.; et al. Efficient large-scale language model training on GPU clusters using megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 14–19 November 2021. [Google Scholar] [CrossRef]
Liu, H.; Zaharia, M.; Abbeel, P. Ring Attention with Blockwise Transformers for Near-Infinite Context. In Proceedings of the International Conference on Learning Representations; Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y., Eds.; 2024; Volume 2024, pp. 3992–4008. [Google Scholar]
Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 16344–16359. [Google Scholar]
Chen, J.; Li, S.; Guo, R.; Yuan, J.; Hoefler, T. AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 1331–1344. [Google Scholar] [CrossRef]
Shi, Y.; Liang, P.; Zheng, H.; Qiao, L.; Li, D. Automatic parallelism strategy generation with minimal memory redundancy. Front. Inf. Technol. Electron. Eng. 2025, 26, 109–118. [Google Scholar] [CrossRef]
Jia, Z.; Zaharia, M.; Aiken, A. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems; Talwalkar, A., Smith, V., Zaharia, M., Eds.; MLSys Organization: Indio, CA, USA, 2019; Volume 1, pp. 1–13. [Google Scholar]
Zheng, L.; Li, Z.; Zhang, H.; Zhuang, Y.; Chen, Z.; Huang, Y.; Wang, Y.; Xu, Y.; Zhuo, D.; Xing, E.P.; et al. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), Carlsbad, CA, USA, 11–13 July 2022; pp. 559–578. [Google Scholar]

Figure 1. Three–dimensional (3D) spatial structure captured by Hi-C data from the HiRES mouse brain dataset. (a) Hi-C contact heatmap of chromosome 1 (50–100 Mb, 500 kb resolution). Nonzero contacts are concentrated near the diagonal; the majority of matrix entries are zero. This structural sparsity causes dense encoder operations to waste computation on zero-valued entries—a key contributor to the memory bottleneck that AMP is designed to address. (b) 3D spatial architecture of chromosomes Chr1, Chr2, and Chr3 reconstructed from Hi-C contact data. The complexity of such spatial structures directly translates into the high-dimensional input tensors that create the memory bottlenecks addressed in this work.

Figure 2. Two–layer architecture of the proposed framework. The upper layer (Auto-Profile + Auto-Diagnose) performs lightweight bottleneck-aware routing: three bottleneck signals are extracted from data files, and each independently activates its corresponding parallel strategy. The lower layer (Strategy Library) contains five mathematically equivalent strategies (

S_{1}

–

S_{5}

) that are applied according to the activated subset, with composition dependencies defined in Section 3.4.6.

Figure 2. Two–layer architecture of the proposed framework. The upper layer (Auto-Profile + Auto-Diagnose) performs lightweight bottleneck-aware routing: three bottleneck signals are extracted from data files, and each independently activates its corresponding parallel strategy. The lower layer (Strategy Library) contains five mathematically equivalent strategies (

S_{1}

–

S_{5}

) that are applied according to the activated subset, with composition dependencies defined in Section 3.4.6.

Figure 3. Signal-to-strategy mapping of the Auto-Diagnose phase. Three bottleneck signals are detected independently from the data configuration; each activates its corresponding strategy (solid arrows). Dashed arrows indicate functional dependencies detailed in Section 3.4.6:

S_{4}

uses the feature partition established by

S_{2}

;

S_{5}

internally invokes the chunked attention of

S_{3}

and adopts the key-first decoding of

S_{4}

.

S_{1}

serves as the data-parallel infrastructure throughout.

Figure 3. Signal-to-strategy mapping of the Auto-Diagnose phase. Three bottleneck signals are detected independently from the data configuration; each activates its corresponding strategy (solid arrows). Dashed arrows indicate functional dependencies detailed in Section 3.4.6:

S_{4}

uses the feature partition established by

S_{2}

;

S_{5}

internally invokes the chunked attention of

S_{3}

and adopts the key-first decoding of

S_{4}

.

S_{1}

serves as the data-parallel infrastructure throughout.

Figure 4. Per-stratum GPU memory consumption on the 100kb–32 strata configuration with all five strategies (four GPUs). Memory grows monotonically with stratum index, reaching 18.54 GB at stratum 31—well below the 24 GB GPU limit (dashed line).

Table 1. Architectural components and input dimensionality of representative models for 3D spatial interaction data.

Model	Year	Task(s)	VAE	GNN	Trans. ³	Multi-Branch
totalVI [12]	2021	RNA + protein ¹	√			√
scMVP [23]	2022	RNA + ATAC ¹	√			√
Higashi [16]	2022	scHi-C imputation ²		√		√
Chromformer [18]	2022	Predict chromatin structures			√
MultiVI [14]	2023	RNA + ATAC integration	√			√
HiGLUE [20]	2024	scHi-C + scRNA integration ²	√	√	√	√

¹ RNA + protein/ATAC—joint modeling of paired omics. ² The prefix “sc” indicates that the data sampling is based on individual cells. ³ Trans. means Transformer.

Table 2. Comparison of AMP with representative automatic parallelism frameworks. Entries marked “—” indicate that the feature is not applicable: these frameworks handle only standard Transformer operators and do not support the non-standard decoder operators that AMP targets.

Feature	FlexFlow [34]	Alpa [35]	AutoDDL [32]	AMP (Ours)
Strategy det.	Cost-model	Cost-model	Cost-model	Diagnosis-driven
Modality-aware	No	No	No	Three signals
Non-std. TP	—	—	—	Five operators
Equiv. proofs	—	—	—	Proven
Target arch.	General	General	General	Multi-modal 3D

Abbreviations: det. = determination; Non-std. = Non-standard; TP = tensor parallelism; Equiv. = Equivalence; arch. = architecture.

Table 3. Bottleneck signals and their detection criteria.

Signal	Detection Criterion	Consequence
HIGH_INPUT_DIMENSIONALITY	$n_{features} ≫ other modalities$	Encoder projection and decoder reconstruction dominate memory
ATTENTION_DECODER	decoder uses attention	Large $[N_{nodes}, D]$ temporary tensors in attention
HEAVY_DECOMPOSITION	decoder has multi-layer decomposed structure	Cumulative per-unit activation accumulation

Table 4. Bottleneck-to-strategy mapping.

Bottleneck Signal	Activated Strategy	Memory Reduction Mechanism
HIGH_INPUT_DIMENSIONALITY	$S_{2}$ : Feature-parallel reconstruction	Shard output features across ranks → $[B, F / N]$ per GPU
	$S_{4}$ : Restructured decode	Key-first matrix computation → $[B, F]$ per GPU
ATTENTION_DECODER	$S_{3}$ : Chunked attention	Sliding window chunking → $[C, N_{nodes}]$ per chunk
HEAVY_DECOMPOSITION	$S_{5}$ : Hidden-dimension (if $EPM (K_{strata}) > M_{avail}$ )	Partition hidden dim across ranks → $[N_{nodes}, D / N]$ per rank

Table 5. Comparison of operations in single-GPU and tensor-parallel formulations.

Operation	Single-GPU Formulation	Tensor Parallel (Rank i)	All-Reduce Count
LayerNorm	$y = \frac{x - μ}{\sqrt{σ^{2} + ϵ}} ⊙ γ + β$	local $\sum_{i} S_{1}^{(i)}, \sum_{i} S_{2}^{(i)} \overset{AR}{\to} μ, σ^{2}$ ; $y^{(i)} = \frac{x^{(i)} - μ}{\sqrt{σ^{2} + ϵ}} ⊙ γ^{(i)} + β^{(i)}$	2
Linear	$y = W x + b$	$y_{i} = W^{(i)} x^{(i)}$ ; $y = \sum_{i} AR (y_{i})$ ; retain $y^{(i)}$	1
GEGLU FFN ¹	$y = g ⊙ GELU (v)$ where $[g; v] = W_{2 H} x$	partial → AR sum → retain paired gate/value rows for $D_{i}$	1
LocalAttn ²	$softmax (\frac{Q K^{⊤}}{\sqrt{D}}) V$	$AR (Q^{(i)} {(K^{(i)})}^{⊤}) \overset{AR}{\to} S$ ; $softmax (S / \sqrt{D}) V^{(i)}$	1
Decode	$Q K^{⊤}$	$\sum_{i = 0}^{N - 1} AR (Q^{(i)} {(K^{(i)})}^{⊤})$	1

¹ FFN means Feed-Forward Network. ² Attn means Attention.

Table 6. Single-GPU vs. four-GPU data-parallel training time on lightweight configurations. Epoch times are averaged over one pretrain epoch and one fine-tune epoch (five iterations each).

Configuration	1 GPU	4 GPUs ( $S_{1}$ Only)
500 kb–strata10	1.59 s/epoch	1.09 s/epoch
100 kb–strata10	2.34 s/epoch	4.67 s/epoch

Table 7. Leave-one-out ablation of parallel strategies on the 500 kb–32 strata configuration (eight GPUs, batch size one). The DDP backbone (

S_{1}

) is enabled in all configurations.

Δ

denotes the peak memory increase relative to all five strategies.

Table 7. Leave-one-out ablation of parallel strategies on the 500 kb–32 strata configuration (eight GPUs, batch size one). The DDP backbone (

S_{1}

) is enabled in all configurations.

Δ

denotes the peak memory increase relative to all five strategies.

Activated Strategies	Disabled	Status	Peak (GB)	Δ
$S_{1} + S_{2} + S_{3} + S_{4} + S_{5}$ (full)	—	Completed	17.87	—
$S_{1} + S_{2} + S_{3} + S_{4}$	$S_{5}$ (hidden TP)	OOM @ strata 16	≥22.33	—
$S_{1} + S_{2} + S_{3} + S_{5}$	$S_{4}$ (local key)	Completed	21.11	+3.24
$S_{1} + S_{2} + S_{4} + S_{5}$	$S_{3}$ (chunked attn)	OOM @ strata 1	—	—
$S_{1} + S_{3} + S_{4} + S_{5}$	$S_{2}$ (feat. enc.)	Completed	17.87	≈0

Table 8. Training outcome and peak memory for heavyweight configurations under all five strategies (four GPUs). OOM indicates that training failed before completing one epoch. Peak memory values are reported as mean ± standard deviation over 10 iterations.

Configuration	One GPU	Four GPUs ( $S_{1}$ – $S_{5}$ )	Peak Memory (Four GPUs)
500 kb-strata32	OOM	Completed (137.5 s/epoch)	17.89 ± 0.15 GB
100 kb-strata32	OOM	Completed (5000 s/epoch)	19.51 ± 0.67 GB

Table 9. Training throughput scaling from four to eight GPUs on heavyweight configurations with all five strategies. Four-GPU peak memory values are reported in Table 8. Timing values represent per-epoch averages (five iterations per epoch). Eight-GPU peak memory is reported as mean ± std over 10 iterations.

Configuration	Four GPUs	Eight GPUs	Speedup	Peak Mem (Eight GPUs)
500 kb–strata32	137.5 s/epoch	69.0 s/epoch	2.0×	18.07 ± 0.14 GB
100 kb-strata32	5000 s/epoch	1320 s/epoch	3.8×	15.52 ± 0.66 GB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, K.; Zheng, H.; Yuan, L. AMP: Automatic Modality-Aware Parallelization with Hidden-Dimension Tensor Parallelism for Multi-Modal 3D Biological Models. Electronics 2026, 15, 2769. https://doi.org/10.3390/electronics15132769

AMA Style

Zhang K, Zheng H, Yuan L. AMP: Automatic Modality-Aware Parallelization with Hidden-Dimension Tensor Parallelism for Multi-Modal 3D Biological Models. Electronics. 2026; 15(13):2769. https://doi.org/10.3390/electronics15132769

Chicago/Turabian Style

Zhang, Kailin, Hao Zheng, and Lang Yuan. 2026. "AMP: Automatic Modality-Aware Parallelization with Hidden-Dimension Tensor Parallelism for Multi-Modal 3D Biological Models" Electronics 15, no. 13: 2769. https://doi.org/10.3390/electronics15132769

APA Style

Zhang, K., Zheng, H., & Yuan, L. (2026). AMP: Automatic Modality-Aware Parallelization with Hidden-Dimension Tensor Parallelism for Multi-Modal 3D Biological Models. Electronics, 15(13), 2769. https://doi.org/10.3390/electronics15132769

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AMP: Automatic Modality-Aware Parallelization with Hidden-Dimension Tensor Parallelism for Multi-Modal 3D Biological Models

Abstract

1. Introduction

2. Related Works

2.1. Multi-Modal Deep Learning Models for 3D Spatial Interaction Data

2.2. Distributed Training and Parallelization Strategies

3. Methods

3.1. Problem Formulation and System Overview

3.2. Auto-Profile: Bottleneck Signature Extraction

3.3. Auto-Diagnose: Bottleneck-to-Strategy Mapping

3.4. Strategy Library: Five Mathematically Equivalent Parallel Strategies

3.4.1. $S_{1}$ : Lightweight Data Parallelism

3.4.2. $S_{2}$ : Feature-Parallel Hi-C Reconstruction

3.4.3. $S_{3}$ : Chunked Local Attention

3.4.4. $S_{4}$ : Restructured Decode Path

3.4.5. $S_{5}$ : Hidden-Dimension Tensor Parallel Decoder

3.4.6. Strategy Composition

3.5. Auto-Execute: Strategy Composition and Execution

4. Results

4.1. Multi-GPU Speedup on Lightweight Configurations

4.2. Leave-One-Out Ablation: Quantifying Per-Strategy Memory Impact

4.3. Per-Stratum Memory Profile

4.4. Generalization and Scalability Across Configurations

4.5. Summary of Experimental Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

AMP: Automatic Modality-Aware Parallelization with Hidden-Dimension Tensor Parallelism for Multi-Modal 3D Biological Models

Abstract

1. Introduction

2. Related Works

2.1. Multi-Modal Deep Learning Models for 3D Spatial Interaction Data

2.2. Distributed Training and Parallelization Strategies

3. Methods

3.1. Problem Formulation and System Overview

3.2. Auto-Profile: Bottleneck Signature Extraction

3.3. Auto-Diagnose: Bottleneck-to-Strategy Mapping

3.4. Strategy Library: Five Mathematically Equivalent Parallel Strategies

3.4.1. S 1 : Lightweight Data Parallelism

3.4.2. S 2 : Feature-Parallel Hi-C Reconstruction

3.4.3. S 3 : Chunked Local Attention

3.4.4. S 4 : Restructured Decode Path

3.4.5. S 5 : Hidden-Dimension Tensor Parallel Decoder

3.4.6. Strategy Composition

3.5. Auto-Execute: Strategy Composition and Execution

4. Results

4.1. Multi-GPU Speedup on Lightweight Configurations

4.2. Leave-One-Out Ablation: Quantifying Per-Strategy Memory Impact

4.3. Per-Stratum Memory Profile

4.4. Generalization and Scalability Across Configurations

4.5. Summary of Experimental Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.1. $S_{1}$ : Lightweight Data Parallelism

3.4.2. $S_{2}$ : Feature-Parallel Hi-C Reconstruction

3.4.3. $S_{3}$ : Chunked Local Attention

3.4.4. $S_{4}$ : Restructured Decode Path

3.4.5. $S_{5}$ : Hidden-Dimension Tensor Parallel Decoder