1. Introduction
One of the main issues that has limited the depth of convolutional neural networks (CNNs) is the vanishing gradient problem, a situation where error signals diminish as they are propagated backward through multiple layers. The revolutionary introduction of residual connections in ResNet brought a major breakthrough in deep learning by preserving gradient flow, thereby enabling the successful training of very deep networks [
1]. The prevalence of ResNet in academic research and its integration into various systems and applications suggest that it serves as a benchmark architecture due to its reliable performance and widespread acceptance [
2]. The core residual formulation,
, not only establishes gradient highways but also ensures that gradients are transmitted effectively across layers, mitigating errors during backpropagation.
Still, residual connections are inherently static and uniform: Each skip connection offers the same direct gradient paths, regardless of input complexity or the feature relevance of the particular task. Such uniformity neglects a fundamental characteristic: Different tasks require adaptive weighting of feature representations at various network depths. For complicated visual recognition tasks covering normal images to medical imaging, the needed learning requires the adaptive weighting of features at different depths of the network. It is not possible for residual connections to selectively highlight some routes while disabling others; they combine shallow texture features and deep semantic representations with the same level of importance. Despite recent advances in hybrid architectures and attention mechanisms, only a few works treat gradient flow itself as a learnable, task-adaptive component. Most approaches either solve gradient degradation through fixed structural shortcuts or apply attention solely for feature enhancement, leaving a fundamental gap: the inability to dynamically route gradients based on input complexity and task requirements across network hierarchies.
This work, specifically
GradAttn (
https://github.com/SoudeepGhoshal/GradAttn (accessed date 21 May 2026)), introduces
attention-controlled gradient flow, replacing static residual connections with learnable gradient pathways. The proposed method, GradAttn, uses the attention mechanisms of the transformer to not only open up the features’ interactions but also to enable gradient flow through CNN hierarchies. The major highlight of the method is the extraction of multi-scale CNN features from various depths and the mapping of these features to a common embedding space, where their contributions can be weighted dynamically by transformer attention; this thus creates gradient pathways that can be selected for holding global semantic patterns for complex scenes or for local texture cues of specialized domains, all completely learned automatically during the training process.
Our Contribution: (a) We propose attention-controlled (inter-layer) gradient flow as a learnable alternative to fixed residual connections, enabling adaptive gradient pathways in deep networks. (b) We demonstrate that this approach outperforms ResNet-18 on five of eight diverse datasets, achieving up to accuracy improvement while maintaining comparable parameter counts. (c) We provide empirical evidence that controlled gradient instabilities introduced by attention often coincide with improved generalization, challenging the assumption that perfect stability is always optimal.
We perform a direct comparison against the industry standard ResNet, confirming that gradient routing achieves more effective learning. Since CNNs are omnipresent, this work has broad and immediate applicability.
We investigate three architectural variants, viz., those without positional encoding (No PE), those with learnable positional encoding, and those with rotary positional encoding (RoPE), revealing that the effectiveness of explicit spatial encoding is dataset-dependent, with CNN hierarchies often providing sufficient structural priors. These findings position attention mechanisms as fundamental enablers of learnable gradient control, opening possibilities for adaptive deep learning across diverse visual domains.
The remainder of this paper is organized as follows.
Section 2 reviews related work on gradient flow in deep architectures, attention mechanisms, and multi-scale feature integration.
Section 3 details the proposed GradAttn architecture, its positional encoding variants, training protocols, and evaluation datasets.
Section 4 presents comprehensive experimental results, including classification performance, domain-specific analysis, training dynamics, gradient flow analysis, and parameter efficiency. Finally,
Section 5 concludes this paper and outlines directions for future work.
3. Methods
The key idea of this work is to regulate the learning and flow of gradients via attention at the inter-layer level. Since each layer learns features at different levels of the semantic hierarchy, the idea is to efficiently regulate learning within these hierarchies. In this paper, we realize this key idea through self-attention in the image recognition problem using deep convolutional neural networks. The methodology is split into two parts: first, a high-level mathematical view of the proposed solution and, second, a deeper architectural explanation.
3.1. Conceptual Framework for Gradient Modulation
We first establish a conceptual framework to reason about our multi-scale token attention, which is also referenced as gradient modulation; the actual realization in GradAttn is through transformer self-attention, as described in
Section 3.2. Let a deep network be a composition of
L layers
with parameters
, producing intermediate feature maps
for input
. In a standard residual network, the backward gradient at layer
l is the unmodulated chain
, and skip connections inject a uniform identity term
, treating every layer as equally update-worthy regardless of the input or training state. We argue that this is suboptimal: at any training step
t, the useful update magnitude for layer
l depends on how much that layer has already learned relative to others, a quantity not visible from local gradients alone.
We therefore introduce a meta-regulator
with parameter
that observes a global summary of the network state and emits a per-layer modulation signal
, interpreted as the relative degree to which layer
l should be updated at step
t. Conceptually,
encodes the notion that not all layers require equal adjustment at every training step: Some layers may have already converged to useful representations, while others remain under-adapted. It is crucial to distinguish between this analytical framework and our physical implementation. In our actual architecture, this modulation is achieved implicitly through transformer self-attention over the token sequence
from the extraction points defined in Equation (
3). Specifically, the softmax operation within the attention layers dynamically weights the feature tokens. During backpropagation, these learned attention weights directly and naturally scale the magnitude of the gradient flow routed back to each corresponding convolutional stage. Therefore, the self-attention mechanism effectively acts as a learned implicit proxy for the theoretical
gating parameters. For analytical clarity to represent this ideal explicit scaling, we retain the following gating formulation:
where
denotes the full transformer encoder. In this conceptual formulation,
represents the effective influence that the attended representation of token
exerts on the final output of
, as governed by the self-attention operation in Equation (
5). This influence shapes the magnitude of the gradient signal propagated back to stage
l. The effective gradient update for layer
l is then expressed as
While Equation (
2) shows an explicit scalar multiplication, in practice, this scaling is naturally executed by the chain rule passing through the attention mechanism. The key assumptions underlying this formulation are as follows: (i) a global view across
contains sufficient signals for estimating the relative learning need at each depth; (ii)
is end-to-end differentiable so that
is learned jointly with
from the task loss; (iii) the modulation is adaptive per input batch, so
varies with both the sample and the training stage. Under this view, residual connections correspond to the degenerate case
for all
l and
t, and GradAttn generalizes this to a learned, input-conditioned gradient routing policy.
3.2. Architecture Design
Our GradAttn framework (
Figure 1) replaces ResNet’s fixed residual connections with attention-controlled gradient pathways. The backbone follows ResNet-18’s convolutional structure but removes all skip connections. Instead, we extract features at five strategic depths: after initial max pooling and after each of the five convolutional stages, yielding feature maps
.
Each feature map
, where
,
, and
denote the number of channels, height, and width at extraction point
i, respectively, undergoes global average pooling and linear projection into a common embedding dimension
d:
where
denotes global average pooling that spatially aggregates each channel into a scalar, and
projects the pooled features into the shared embedding space, yielding
.
The five extraction points correspond to ResNet-18’s architectural structure: after initial max pooling and after each of the four convolutional stages (every two basic blocks). This configuration balances computational efficiency with comprehensive gradient flow coverage. Increasing to per-block extraction (10 points) would significantly increase memory overhead without substantial benefit, as our Gradient Health Score analysis (
Section 4.6) shows minimal vanishing gradients within two consecutive blocks. Reducing below five points would skip entire stages, essentially creating a vanilla CNN. Thus, five points represent the optimal balance for multi-scale feature integration.
The resulting token sequence enters a transformer encoder with L attention layers (we use attention layers with 8 heads, an embedding dimension of , and standard feedforward networks of with layer normalization).
Multi-head self-attention computes queries, keys, and values as follows:
The attention operation dynamically weights feature contributions:
This mechanism learns adaptive gradient pathways that dynamically shape learned representations; unlike ResNet’s uniform short circuits, attention can emphasize deep semantic features for complex inputs while prioritizing shallow texture representations for simpler patterns.
3.3. Positional Encoding Variants
We evaluate three variants to understand spatial encoding requirements:
No PE: Relies solely on CNN-derived spatial structure.
Learnable PE: Adds trainable position embeddings:
where
is a trainable parameter adapted during training. This allows domain-specific positional cues while maintaining the same transformer pipeline as the No PE variant.
RoPE: Relative positional information is incorporated directly into the attention mechanism by rotating queries and keys:
where
is a rotation operator parameterized by position
i. This design integrates spatial relationships natively into the dot-product attention computation.
4. Experiments and Analysis
We present a comprehensive evaluation of GradAttn across eight diverse datasets, examining classification performance, gradient flow characteristics, model calibration, and computational efficiency. Our analysis reveals the dataset-specific advantages of attention-controlled gradient pathways and provides insights into when and why learnable gradient routing outperforms fixed residual connections.
4.1. Training Protocol
All models are trained using the Adam optimizer with a fixed random seed for reproducibility. Full training and transformer hyperparameters are summarized in
Table 1. Learning rate scheduling follows ReduceLROnPlateau monitoring validation accuracy, with training capped at 100 epochs and early stopping restoring the best observed weights. Dataset-specific preprocessing and augmentation strategies are detailed in
Table 2; augmentation is applied exclusively to training splits, with validation and test splits receiving only resizing and normalization. Both GradAttn and ResNet-18 use identical preprocessing pipelines throughout all experiments, ensuring that performance differences are attributable solely to architectural design rather than data handling.
4.2. Evaluation Datasets
We carry out testing on eight diverse benchmarks: Tiny ImageNet (primary) [
27], CIFAR-10 [
28], SVHN [
29], FashionMNIST [
30], and four medical datasets (TissueMNIST [
31], BloodMNIST [
31], PCam [
32], and PAD-UFES-20 [
33]). This spans natural images, structured recognition, and specialized medical imaging to validate generalizability across domains (
Figure 2). Tiny ImageNet, CIFAR-10, SVHN, FashionMNIST, and PAD-UFES-20 use a 70/15/15 train/val/test split. TissueMNIST and BloodMNIST use a 70/10/20 split to provide a larger test set given their medical imaging context. PCam uses an 80/10/10 split to maximize training data for the large-scale binary classification task.
4.3. Classification Performance
Table 3 summarizes the Top-1 accuracy results across all datasets. GradAttn variants outperform ResNet-18 on five of eight datasets, with performance gains strongly correlated with task complexity and the need for multi-scale feature integration. The RoPE variant achieves the highest accuracy on Tiny ImageNet (37.67%, +4.65% over ResNet-18) and SVHN (98.15%, +0.17%), while Learnable PE excels on FashionMNIST (75.18%, +11.07%) and matches ResNet-18 on TissueMNIST (69.88%).
GradAttn underperforms ResNet-18 on three datasets: CIFAR-10 (by 2.94–5.61%), PCam (by 0.89%), and PAD-UFES-20 (by 4.91–7.80%). On CIFAR-10, ResNet-18’s strong convolutional inductive biases for low-resolution 32 × 32 images provide an advantage that multi-scale attention routing cannot overcome. On PCam, the binary metastasis detection task has simple decision boundaries where local convolutional features suffice. On PAD-UFES-20, the limited dataset size of only 2298 samples causes the transformer components to overfit.
4.3.1. Extended Evaluation
Beyond Top-1 accuracy, we evaluate Top-3 and Top-5 accuracy, macro-averaged F1 scores, and the Expected Calibration Error (ECE) to assess ranking quality and prediction reliability.
Table 4 presents these metrics for representative datasets where GradAttn demonstrates substantial improvements.
The ECE reductions of 35–45% across successful domains indicate that attention-controlled gradient flow improves not only accuracy but also calibration. The dynamic weighting mechanism produces more reliable confidence estimates by selectively emphasizing features relevant to each input, whereas uniform residual connections propagate all features equally regardless of their contribution to the prediction.
The ECE is defined as follows:
where
M is the number of bins,
represents the set of predictions in the
m-th bin,
n is the total number of samples,
is the average accuracy of bin
, and
is the average predicted confidence of bin
.
4.3.2. Comprehensive Performance Analysis
Table 5 presents precision, recall, and F1 scores (both macro- and weighted averages) across all datasets and variants. These metrics reveal nuanced performance characteristics beyond simple accuracy measurements.
For highly imbalanced datasets like Tiny ImageNet (200 classes with varying difficulty), macro-averaged metrics reveal that GradAttn (RoPE) achieves more balanced performance across classes, with macro-F1 score improvements of +0.044 over ResNet-18. The weighted metrics confirm that improvements are not merely driven by better performance on dominant classes but reflect genuine enhancement in handling diverse visual categories.
On FashionMNIST, Learnable PE achieves remarkable macro-F1 score gains (+0.136), indicating substantially improved recognition of challenging classes like pullovers and shirts that ResNet-18 frequently confuses. This suggests that dataset-specific positional adaptations enable the model to capture subtle texture and shape variations critical for fine-grained fashion recognition.
4.4. Domain-Specific Patterns
4.4.1. Complex Natural Images
On Tiny ImageNet and SVHN, RoPE consistently outperforms other variants. Tiny ImageNet contains highly diverse visual categories (200 classes spanning objects, animals, and scenes) with significant intra-class variations and cluttered backgrounds. RoPE’s relative positional encoding effectively captures spatial relationships in these complex scenes, enabling the attention mechanism to weight features based on their geometric configuration. The +4.65% accuracy gain on Tiny ImageNet demonstrates that explicitly modeling relative spatial structure benefits tasks requiring global context understanding.
SVHN (street view house numbers) presents a different challenge: Digits appear at varying scales and orientations with complex backgrounds. Despite the simpler task structure (10 classes), RoPE achieves +0.17% improvement by better handling spatial transformations through rotation-equivariant positional encoding. The near-ceiling performance (98.15%) indicates that attention-controlled gradients can approach optimal performance on well-structured recognition tasks.
CIFAR-10 represents a case where GradAttn variants consistently underperform ResNet-18 (92.10%), with No PE achieving 88.68%, Learnable PE achieving 86.49%, and RoPE achieving 89.16%. Unlike Tiny ImageNet and SVHN where multi-scale feature integration provides measurable benefits, CIFAR-10’s low-resolution 32 × 32 images contain limited spatial complexity that does not warrant the adaptive gradient routing GradAttn provides. At this resolution, discriminative information is largely captured by local convolutional features within the early layers, making ResNet-18’s fixed residual connections sufficient and the transformer’s cross-scale attention redundant. Furthermore, the relatively small intra-class variation and well-separated decision boundaries in CIFAR-10 mean that shallow texture features alone adequately represent the ten object categories, leaving little room for the deep semantic weighting that attention-controlled gradients excel at. This suggests that GradAttn’s benefits are contingent on sufficient input resolution and feature hierarchy complexity to justify the learnable gradient routing mechanism.
4.4.2. Fashion and Texture Recognition
FashionMNIST represents structured object recognition where shape and texture jointly determine categories. Learnable PE achieves the most substantial improvement (+11.07%) by adapting positional encodings to clothing-specific patterns. Unlike natural images where spatial relationships follow universal geometric principles, fashion items exhibit dataset-specific structural regularities (e.g., shirts always have sleeves in particular positions, trousers have consistent leg structures). Learnable positional encodings capture these domain-specific spatial priors more effectively than either fixed CNN hierarchies (No PE) or universal relative encodings (RoPE).
4.4.3. Medical Imaging
Medical imaging datasets exhibit divergent patterns that illuminate when attention-controlled gradients provide advantages:
BloodMNIST (Blood Cell Classification): The No PE variant achieves the highest accuracy (96.23%, +0.26% over ResNet-18), suggesting that CNN-derived spatial hierarchies sufficiently encode the structure of microscopy images. Blood cells have consistent internal structure (nucleus and cytoplasm) with discrimination primarily based on morphological features at fixed scales. The strong performance without explicit positional encoding indicates that convolutional inductive biases already capture medically relevant spatial patterns.
TissueMNIST (Kidney Tissue Classification): Learnable PE matches ResNet-18 in Top-1 accuracy but achieves superior Top-3 (+0.09%) and Top-5 (+0.09%) performance alongside substantially reduced ECE (0.039 vs. 0.067, −41.8%). This pattern suggests that attention-controlled gradients improve confidence calibration and ranking quality even when final classification accuracy remains comparable. For medical applications where physicians review top-k predictions, improved ranking reliability provides clinical value beyond raw accuracy.
PCam and PAD-UFES-20: ResNet-18 outperforms GradAttn variants on both datasets (+0.89% on PCam, +4.91% on PAD-UFES-20). PCam involves binary metastasis detection in histopathology patches with simple object boundaries; PAD-UFES-20 contains only 2298 samples across six skin lesion types. These results indicate two scenarios where attention-controlled gradients provide limited benefit: (1) tasks with simple decision boundaries where local convolutional features suffice and (2) small datasets where transformer components overfit due to insufficient training samples. The negative results validate that GradAttn’s benefits depend on task complexity and dataset scale rather than universally improving upon residual connections.
4.5. Training Dynamics and Convergences
Figure 3 and
Figure 4 present training and validation accuracy curves comparing ResNet-18 with the best-performing GradAttn variant for each dataset. Contrary to expectations that attention mechanisms would accelerate convergence, we observe nuanced patterns where GradAttn variants often require more epochs to converge yet achieve superior final performance.
4.5.1. Convergence Speed and Training Requirements
Across datasets, GradAttn models typically require more epochs to converge than ResNet-18. This does not reflect optimization inefficiency but rather the additional time needed for the attention mechanism to learn meaningful routing patterns. On FashionMNIST, for instance, the Learnable PE variant requires 11 epochs compared to ResNet-18’s five, which is a more than increase, yet this extended training yields a +11.07% improvement in accuracy. This illustrates that the extra training time is compensated by substantially better representations that fixed skip connections cannot obtain. Overall, the modest increase in convergence time is justified by the consistent improvements in final performance.
4.5.2. Optimization Landscape Complexity
The extended training requirements for GradAttn reveal fundamental differences in optimization dynamics between fixed and learnable gradient pathways. We identify two factors contributing to slower convergence:
Interdependent Component Optimization: GradAttn must simultaneously optimize CNN feature extraction, linear projections, and attention weights. Early in training, random attention patterns distribute gradients nearly uniformly across extraction points, providing minimal benefit over ResNet while adding optimization complexity. Only after the attention mechanism learns meaningful feature importance patterns (typically 15–25 epochs on Tiny ImageNet) does performance begin exceeding ResNet-18. This initial “discovery phase” explains the extended training time.
Non-Stationary Gradient Distributions: Unlike ResNet where gradient pathways remain constant, GradAttn’s attention-controlled routing creates non-stationary optimization. As attention weights evolve, the effective learning rate for different layers changes dynamically, with layers receiving high attention experiencing larger gradient updates while low-attention layers adapt slowly. This adaptive behavior requires more iterations to reach equilibrium but produces better calibrated feature representations.
4.5.3. Generalization Gap Analysis
Table 6 quantifies the train–validation accuracy gap at convergence, revealing overfitting tendencies across architectures.
GradAttn variants dramatically reduce overfitting on complex datasets where they achieve superior test accuracy. Most notably, on Tiny ImageNet, Learnable PE reduces the generalization gap from 65.77% (ResNet-18) to 14.30%, a 78.2% reduction in overfitting. This massive improvement indicates that ResNet-18 severely overfits the training data, while attention-controlled gradients learn substantially more generalizable representations. The RoPE and No PE variants also achieve large reductions (61.4% and 75.3%, respectively), confirming that dynamic feature weighting inherently regularizes learning.
On FashionMNIST, Learnable PE reduces overfitting by 22.5% (11.57% to 8.97%), while on TissueMNIST, No PE achieves a 75.7% reduction (5.39% to 1.31%). These results suggest that attention-controlled gradients act as an adaptive regularizer, where less relevant features contribute minimally to gradients during backpropagation, preventing the model from memorizing spurious correlations in the training set.
CIFAR-10 shows a balanced generalization pattern, with ResNet-18 exhibiting a 6.52% gap and GradAttn variants showing comparable or smaller gaps (3.16–6.62%). Despite Learnable PE achieving the tightest gap at 3.16%, ResNet-18 maintains superior test accuracy at 92.10%, indicating that GradAttn variants generalize more consistently but from a lower performance ceiling due to ResNet-18’s stronger convolutional inductive biases for low-resolution inputs.
On medical imaging datasets, the pattern is nuanced. For TissueMNIST and BloodMNIST, GradAttn variants achieve minimal generalization gaps (0.15–1.75%), substantially lower than ResNet-18 (5.39% and 0.57%, respectively), indicating excellent generalization. However, on PCam, ResNet-18 maintains a smaller gap (5.86%) compared to GradAttn variants (7.31–8.33%), aligning with our earlier finding that PCam’s binary classification with simple decision boundaries favors ResNet’s convolutional inductive biases. On PAD-UFES-20, all models exhibit relatively small gaps (2.95–4.21%), with minimal differences between architectures, suggesting that the primary challenge is insufficient training data (only 2298 samples) rather than overfitting.
These results establish attention-controlled gradients as a powerful implicit regularizer for complex, large-scale datasets where multi-scale feature integration benefits from adaptive selection. The regularization effect emerges naturally from the attention mechanism’s learned feature weighting rather than requiring explicit regularization techniques.
4.6. Gradient Flow Analysis
The Gradient Health Score is defined as follows:
where
represents layers with gradient norms in the range
, and
is the total number of analyzed layers.
indicates perfect gradient stability, with all layers maintaining healthy gradient magnitudes, while
reveals the fraction of layers experiencing vanishing (
) or exploding (
) gradients. This composite metric quantifies overall network gradient flow quality, where controlled instabilities (
) can coincide with improved generalization, as observed in our attention variants.
We monitored gradient health during testing using normalized stability metrics across all eight datasets. While ResNet-18 maintained perfect stability (GHS = 1.0) across experiments, attention variants consistently introduced controlled instabilities that often coincided with improved generalization. The No PE variant exhibited minimal gradient decay across most datasets, with generally stable training dynamics (e.g., GHS = 0.914 on FashionMNIST). The Learnable PE variant experienced localized vanishing gradients in several layers, yet it achieved the best performance on FashionMNIST (+11.07%) and matched ResNet-18 on TissueMNIST while substantially improving calibration (ECE reduced by 41.8%). Similarly, RoPE introduced controlled instability across a subset of layers while still achieving the highest accuracy on Tiny ImageNet and SVHN among all variants.
Table 7 provides a concrete illustration of these gradient dynamics for the Tiny ImageNet dataset, where RoPE attains GHS = 0.732 with vanishing gradients in eight layers, yet it outperforms the perfectly stable ResNet-18 by +4.65% in Top-1 accuracy.
This suggests that perfect gradient stability may not be optimal, i.e., controlled attention-induced redistribution can enhance feature learning despite minor gradient anomalies.
Controlled Instability and Generalization
The relationship between gradient stability and performance challenges conventional deep learning theory. On Tiny ImageNet, RoPE achieves the highest accuracy (37.67%) despite exhibiting 0.732, with eight layers experiencing vanishing gradients. Learnable PE on FashionMNIST attains a +11.07% improvement with GHS = 0.914. These results suggest that attention-induced gradient redistribution creates beneficial training dynamics where not all layers require uniform gradient flow.
We hypothesize that controlled instabilities enable the attention mechanism to effectively “prune” less relevant gradient pathways during training. Layers experiencing occasional vanishing gradients contribute minimally to feature learning, allowing the network to focus representational capacity on extraction points that are most critical for the task. This differs fundamentally from traditional vanishing gradient problems where poor initialization or activation functions cause universal gradient decay; attention-controlled vanishing is selective and input-dependent.
Figure 5 visualizes layer-wise gradient norm distributions, revealing that vanishing gradients in GradAttn models occur selectively in mid-depth layers, while shallow and deep layers maintain healthy gradients. This pattern indicates that attention learns to bypass intermediate feature hierarchies when they provide redundant information, creating direct pathways from shallow texture features and deep semantic representations to the loss function.
4.7. Parameter and Computational Efficiency
Despite transformer layers, the models maintained competitive efficiency. While ResNet-18 has ∼11.2 M parameters, attention variants added only ∼1.6 M parameters (≈14.3% increase).
Table 8 reports the average training time per epoch across all datasets and model variants. Contrary to expectations, GradAttn variants do not consistently incur higher per-epoch computational cost than ResNet-18, and in several cases, they achieve lower epoch times, notably on Tiny ImageNet, TissueMNIST, and PCam. This suggests that the attention-controlled gradient routing does not introduce significant computational overhead per epoch and that the extended total training time observed for GradAttn variants is attributable primarily to the greater number of epochs required for convergence rather than increased per-epoch cost.
4.8. Comparison with Attention-Augmented Baselines
To further contextualise GradAttn’s contributions beyond a plain ResNet-18 baseline, we compare against ResNet-18 augmented with established channel and spatial attention modules: Squeeze-and-Excitation (SE) Networks [
13] and the Convolutional Block Attention Module (CBAM) [
14]. These modules apply attention within individual residual blocks for feature recalibration, representing a natural intermediate between fixed residual connections and GradAttn’s inter-layer gradient routing.
Table 9 presents Top-1 accuracy on Tiny ImageNet, SVHN, and FashionMNIST, chosen as representative datasets spanning complex natural images, structured digit recognition, and texture-based fashion recognition, respectively.
The architectural distinction between these approaches is fundamental. SE and CBAM operate as
intra-block attention mechanisms: They recalibrate feature channels or spatial responses within a single residual block, leaving the gradient pathways between blocks unchanged. GradAttn, by contrast, functions as a
global gradient regulator across the full network hierarchy. As formalized in Equations (
1) and (
2), the meta-regulator
observes a global summary of the network state across all extraction points and emits per-layer modulation signals
, governing the effective gradient update
for each stage. This modulation is realised through the transformer self-attention over the token sequence
defined in Equation (
3), where the attention operation in Equation (
5) dynamically weights the relative contribution of each hierarchical level. The result is that gradient signals are routed adaptively across the entire network depth rather than recalibrated within any single block. SE and CBAM enhance what is represented within a block; GradAttn controls which blocks drive learning across the network, a fundamentally different and complementary form of attention.
The results demonstrate that GradAttn’s attention-controlled gradient pathways provide advantages beyond what intra-block channel or spatial attention alone can achieve, validating that inter-layer gradient routing captures complementary information that block-level attention mechanisms cannot access.
4.9. Discussion
Domain-Specific Advantages: GradAttn variants excel in domains requiring global context modeling (Tiny ImageNet and SVHN) or abstract structural understanding (FashionMNIST), but they offer limited benefits in texture-dominated domains. Complementarity of CNNs and Attention: Attention-based control demonstrates complementary behavior rather than replacing residual connections. Convolutions handle local feature extraction, while attention adaptively redistributes gradient signals across scales, enabling dynamic control where deeper features selectively dominate or recede. Generalization vs. Stability Trade-Off: Gradient diagnostics reveal that minor instability can coincide with improved accuracy, suggesting that perfectly uniform gradient flow (as in ResNet) may not be optimal for all tasks, and controlled imbalance through attention may promote richer feature learning. Comparison with Block-Level Attention: SE and CBAM modules augment individual residual blocks with channel and spatial recalibration, respectively, but they operate within fixed residual connection topologies. GradAttn differs fundamentally by routing gradients across the full network hierarchy through inter-layer attention, enabling adaptive weighting of features from different semantic depths rather than recalibrating features within a single block.