GradAttn: Transformer-Based Modulation of Residual Approach for Classification and Representation Learning Problems

Ghoshal, Soudeep; Buckchash, Himanshu

doi:10.3390/app16115252

Open AccessArticle

GradAttn: Transformer-Based Modulation of Residual Approach for Classification and Representation Learning Problems

by

Soudeep Ghoshal

¹

and

Himanshu Buckchash

^2,*

¹

School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT), Bhubaneswar 751024, India

²

Department of Science and Technology, IMC University of Applied Sciences Krems, 3500 Krems an der Donau, Austria

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5252; https://doi.org/10.3390/app16115252 (registering DOI)

Submission received: 22 April 2026 / Revised: 21 May 2026 / Accepted: 22 May 2026 / Published: 24 May 2026

(This article belongs to the Special Issue Convolutional Neural Networks and Computer Vision, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Deep ConvNets suffer from gradient signal degradation as network depth increases, limiting effective feature learning in complex architectures. ResNet addressed this through residual connections, but these fixed short circuits cannot adapt to varying input complexity or selectively emphasize task-relevant features across network hierarchies. This study introduces GradAttn, a variation of the residual approach in CNNs that replaces the fixed residual connections with attention-controlled gradient flow. By extracting multi-scale CNN features at different depths and regulating them through self-attention, GradAttn dynamically weights shallow texture features and deep semantic representations. For representational analysis, we evaluated three GradAttn variants across eight diverse datasets: from natural images and medical imaging to fashion recognition. The results demonstrate that GradAttn outperforms ResNet-18 on five of eight datasets, achieving up to

+ 11.07 %

accuracy improvement on FashionMNIST while maintaining a comparable network size. Gradient flow analysis reveals that controlled instabilities, introduced by attention, often coincide with improved generalization, challenging the assumption that perfect stability is optimal. Furthermore, positional encoding’s effectiveness turned out to be dataset-dependent, with CNN hierarchies frequently encoding sufficient spatial structure. These findings render attention mechanisms as enablers of learnable gradient control, offering a new way for adaptive representation learning in deep neural architectures.

Keywords:

convolutional neural networks (CNN); transformers; gradient flow; residual connections; attention mechanisms; image classification

1. Introduction

One of the main issues that has limited the depth of convolutional neural networks (CNNs) is the vanishing gradient problem, a situation where error signals diminish as they are propagated backward through multiple layers. The revolutionary introduction of residual connections in ResNet brought a major breakthrough in deep learning by preserving gradient flow, thereby enabling the successful training of very deep networks [1]. The prevalence of ResNet in academic research and its integration into various systems and applications suggest that it serves as a benchmark architecture due to its reliable performance and widespread acceptance [2]. The core residual formulation,

y = F (x, {W_{i}}) + x

, not only establishes gradient highways but also ensures that gradients are transmitted effectively across layers, mitigating errors during backpropagation.

Still, residual connections are inherently static and uniform: Each skip connection offers the same direct gradient paths, regardless of input complexity or the feature relevance of the particular task. Such uniformity neglects a fundamental characteristic: Different tasks require adaptive weighting of feature representations at various network depths. For complicated visual recognition tasks covering normal images to medical imaging, the needed learning requires the adaptive weighting of features at different depths of the network. It is not possible for residual connections to selectively highlight some routes while disabling others; they combine shallow texture features and deep semantic representations with the same level of importance. Despite recent advances in hybrid architectures and attention mechanisms, only a few works treat gradient flow itself as a learnable, task-adaptive component. Most approaches either solve gradient degradation through fixed structural shortcuts or apply attention solely for feature enhancement, leaving a fundamental gap: the inability to dynamically route gradients based on input complexity and task requirements across network hierarchies.

This work, specifically GradAttn (https://github.com/SoudeepGhoshal/GradAttn (accessed date 21 May 2026)), introduces attention-controlled gradient flow, replacing static residual connections with learnable gradient pathways. The proposed method, GradAttn, uses the attention mechanisms of the transformer to not only open up the features’ interactions but also to enable gradient flow through CNN hierarchies. The major highlight of the method is the extraction of multi-scale CNN features from various depths and the mapping of these features to a common embedding space, where their contributions can be weighted dynamically by transformer attention; this thus creates gradient pathways that can be selected for holding global semantic patterns for complex scenes or for local texture cues of specialized domains, all completely learned automatically during the training process.

Our Contribution: (a) We propose attention-controlled (inter-layer) gradient flow as a learnable alternative to fixed residual connections, enabling adaptive gradient pathways in deep networks. (b) We demonstrate that this approach outperforms ResNet-18 on five of eight diverse datasets, achieving up to

+ 11.07 %

accuracy improvement while maintaining comparable parameter counts. (c) We provide empirical evidence that controlled gradient instabilities introduced by attention often coincide with improved generalization, challenging the assumption that perfect stability is always optimal.

We perform a direct comparison against the industry standard ResNet, confirming that gradient routing achieves more effective learning. Since CNNs are omnipresent, this work has broad and immediate applicability.

We investigate three architectural variants, viz., those without positional encoding (No PE), those with learnable positional encoding, and those with rotary positional encoding (RoPE), revealing that the effectiveness of explicit spatial encoding is dataset-dependent, with CNN hierarchies often providing sufficient structural priors. These findings position attention mechanisms as fundamental enablers of learnable gradient control, opening possibilities for adaptive deep learning across diverse visual domains.

The remainder of this paper is organized as follows. Section 2 reviews related work on gradient flow in deep architectures, attention mechanisms, and multi-scale feature integration. Section 3 details the proposed GradAttn architecture, its positional encoding variants, training protocols, and evaluation datasets. Section 4 presents comprehensive experimental results, including classification performance, domain-specific analysis, training dynamics, gradient flow analysis, and parameter efficiency. Finally, Section 5 concludes this paper and outlines directions for future work.

2. Related Works

2.1. Gradient Flow in Deep Architectures

Over the past few years, gradient flow innovations implemented as residual architectures have really changed the face of deep learning, especially in the area of convolutional neural networks (CNNs). ResNets (residual networks) rely on skip connections that allow gradients to travel around certain layers and thus empower the training of deeper networks [1]. Moreover, DenseNets link each layer with every other layer, thus encouraging feature reuse and maintaining gradient propagation throughout the network [3]. EfficientNets advance these concepts through compound scaling methodologies that systematically balance network depth, width, and resolution, creating more sophisticated gradient flow patterns while maintaining computational efficiency [4]. Nevertheless, these architectures have a common feature, which is the dependence on fixed, uniform connections during training that may limit the flexibility of gradient flow and adaptation to different input characteristics [1]. New changes in the field try to fix these problems; for example, as shown in [5], residual connections support iterative inference, but they still function within the constraints of current architectures. It implies a demand for more versatile procedures that effectively allocate gradient flow depending on the fundamental requirements of the learning task.

Recent architectural innovations have further explored gradient flow optimization through structural modifications. ResNeXt extended ResNet’s design by introducing cardinality as an additional dimension beyond depth and width, demonstrating that aggregating transformations can improve gradient propagation while maintaining computational efficiency [6]. Inception architectures explored multi-branch convolutions that process information at different scales simultaneously, though these branches converge through concatenation rather than adaptive weighting [7]. More recently, Neural architecture search (NAS) has automated the discovery of optimal skip connection patterns, though these approaches still operate within the paradigm of fixed architectural decisions once deployment occurs [8].

The degradation problem in deep networks has also been addressed through normalization techniques that stabilize gradient flow. Batch normalization reduces internal covariate shifts and enables higher learning rates, indirectly improving gradient propagation [9]. Layer normalization and group normalization have extended these concepts to scenarios where batch statistics are unreliable, further demonstrating that gradient health depends on multiple architectural factors beyond skip connections alone [10,11]. However, normalization techniques complement but do not replace the need for effective gradient routing mechanisms in very deep architectures.

More recent work has examined the instability introduced when learned residual mappings are left unconstrained. Xie et al. [12] proposed Manifold-Constrained Hyper-Connections (mHC), which extends the Hyper-Connections framework by projecting residual mapping matrices onto the Birkhoff polytope of doubly stochastic matrices via the Sinkhorn–Knopp algorithm. This constraint restores the identity mapping property across arbitrary network depths, preventing unbounded signal amplification. Empirical analyses of 27 B parameter language models showed that unconstrained composite residual mappings can reach gain magnitudes of approximately 3000, causing training instability at scale. GradAttn takes a complementary approach: rather than constraining the mapping space geometrically, the softmax normalization inherent to self-attention naturally bounds feature contributions, implicitly regularizing gradient routing without requiring explicit manifold projection.

2.2. Attention Mechanisms and Hybrid Architectures

Attention mechanisms have been the trend in NLP but have been integrated in CNNs to improve feature representation. On the one hand, works like Squeeze-and-Excitation Networks (SENets) and the Convolutional Block Attention Module (CBAM) have led to substantial advancements in image recognition fields by opening channel features dynamically [13,14]. Besides this, the rise of Vision Transformers (ViTs) has brought along a new paradigm of hybrid models that combine CNNs and transformer architectures to make use of the advantages of both fields [15]. Nevertheless, the majority of these traditional frameworks are quite limited, as they are mainly focused on attention for feature enhancement. Thus, they do not handle gradient flow efficiently, and hence, there is a gap between the methods that control (inter-layer) gradients during training remains.

Building on these foundations, the integration of attention mechanisms with convolutional architectures has evolved beyond simple feature recalibration. Non-local neural networks introduced self-attention blocks within CNNs to capture long-range dependencies, demonstrating that global context modeling enhances feature representations in vision tasks [16]. BAM (Bottleneck Attention Module) explored the dual-pathway design of attention across both spatial and channel dimensions, showing complementary benefits when applied at intermediate network stages [17]. More recently, CoAtNet architectures have systematically studied the interplay between convolutional inductive biases and transformer attention, revealing that hybrid designs can outperform pure CNN or pure transformer architectures when properly configured [18]. Gradient flow in transformer architectures presents distinct challenges from CNNs.

In a related vein, InternImage [19] is a large-scale CNN foundation model that replaces fixed convolutional kernels with deformable convolutions as the core operator, enabling adaptive spatial aggregation conditioned on input content. Evaluated on ImageNet, COCO, and ADE20K, InternImage demonstrated that relaxing the rigid spatial inductive bias of standard convolutions yields strong gains, with InternImage-B achieving 84.9% Top-1 accuracy on ImageNet-1K. However, InternImage’s adaptivity operates within individual layers by dynamically adjusting sampling locations in the spatial domain, leaving the inter-layer flow of information governed by fixed residual connections.

Pre-normalization configurations have been shown to stabilize training in very deep models by preventing gradient explosion in attention layers [20], while post-normalization variants demonstrate trade-offs between training stability and representational capacity [21].

Despite these advances, existing works treat gradient flow as either a stability problem to be solved through normalization and careful initialization or as a fixed architectural property through skip connections. None of these approaches consider gradient flow as a learnable, task-adaptive component of the network.

2.3. Multi-Scale Feature Integration

Multi-scale feature fusion strategies in detection and segmentation architectures relate to hierarchical feature integration. Feature pyramid networks (FPNs) combine features across multiple scales through lateral connections, establishing the importance of hierarchical feature integration [22]. PANet further enhanced this design with bottom-up path augmentation, showing that bidirectional feature flow improves representation learning [23]. U-Net architectures demonstrate the effectiveness of skip connections that bridge encoder and decoder pathways in dense prediction tasks [24]. However, these architectures employ fixed fusion patterns with predetermined connection topologies that do not adapt to input characteristics or task requirements.

Ensemble and multi-branch architectures provide another perspective on feature combination. Wide ResNets increased network capacity through wider layers rather than deeper stacks, showing that width provides complementary benefits to depth [25]. Multi-Scale Dense Networks (MSDNets) combined features from different depths to enable early-exit inference, though feature combination weights remain static across all inputs [26]. These methods demonstrate the value of combining information from multiple network locations but rely on fixed combination strategies that cannot selectively emphasize or de-emphasize specific feature hierarchies based on input complexity.

The attention-controlled gradient flow that we are suggesting is different from the fixed kinds of operations, as it gives a flexible design for the optimization of gradient flow through which deep learning models can be made to work at high efficiency beyond the extent of residual and attention systems. In the next section, we describe the design of the proposed method.

3. Methods

The key idea of this work is to regulate the learning and flow of gradients via attention at the inter-layer level. Since each layer learns features at different levels of the semantic hierarchy, the idea is to efficiently regulate learning within these hierarchies. In this paper, we realize this key idea through self-attention in the image recognition problem using deep convolutional neural networks. The methodology is split into two parts: first, a high-level mathematical view of the proposed solution and, second, a deeper architectural explanation.

3.1. Conceptual Framework for Gradient Modulation

We first establish a conceptual framework to reason about our multi-scale token attention, which is also referenced as gradient modulation; the actual realization in GradAttn is through transformer self-attention, as described in Section 3.2. Let a deep network be a composition of L layers

{ϕ_{l}}_{l = 1}^{L}

with parameters

{θ_{l}}_{l = 1}^{L}

, producing intermediate feature maps

f_{l} = ϕ_{l} (f_{l - 1}; θ_{l})

for input

f_{0} = x

. In a standard residual network, the backward gradient at layer l is the unmodulated chain

g_{l} = \partial L / \partial θ_{l}

, and skip connections inject a uniform identity term

\partial f_{l + 1} / \partial f_{l} = I + \partial F_{l} / \partial f_{l}

, treating every layer as equally update-worthy regardless of the input or training state. We argue that this is suboptimal: at any training step t, the useful update magnitude for layer l depends on how much that layer has already learned relative to others, a quantity not visible from local gradients alone.

We therefore introduce a meta-regulator

M_{ψ}

with parameter

ψ

that observes a global summary of the network state and emits a per-layer modulation signal

α_{l} \in [0, 1]

, interpreted as the relative degree to which layer l should be updated at step t. Conceptually,

α_{l}

encodes the notion that not all layers require equal adjustment at every training step: Some layers may have already converged to useful representations, while others remain under-adapted. It is crucial to distinguish between this analytical framework and our physical implementation. In our actual architecture, this modulation is achieved implicitly through transformer self-attention over the token sequence

Z = [z_{1}, \dots, z_{L}]

from the extraction points defined in Equation (3). Specifically, the softmax operation within the attention layers dynamically weights the feature tokens. During backpropagation, these learned attention weights directly and naturally scale the magnitude of the gradient flow routed back to each corresponding convolutional stage. Therefore, the self-attention mechanism effectively acts as a learned implicit proxy for the theoretical

α_{l}

gating parameters. For analytical clarity to represent this ideal explicit scaling, we retain the following gating formulation:

[α_{1}, \dots, α_{L}] = M_{ψ} (Z), α_{l} \in [0, 1],

(1)

where

M_{ψ}

denotes the full transformer encoder. In this conceptual formulation,

α_{l}

represents the effective influence that the attended representation of token

z_{l}

exerts on the final output of

M_{ψ}

, as governed by the self-attention operation in Equation (5). This influence shapes the magnitude of the gradient signal propagated back to stage l. The effective gradient update for layer l is then expressed as

{\tilde{g}}_{l} = α_{l} \cdot \frac{\partial L}{\partial θ_{l}}, θ_{l}^{(t + 1)} = θ_{l}^{(t)} - η {\tilde{g}}_{l} .

(2)

While Equation (2) shows an explicit scalar multiplication, in practice, this scaling is naturally executed by the chain rule passing through the attention mechanism. The key assumptions underlying this formulation are as follows: (i) a global view across

{z_{l}}

contains sufficient signals for estimating the relative learning need at each depth; (ii)

M_{ψ}

is end-to-end differentiable so that

ψ

is learned jointly with

{θ_{l}}

from the task loss; (iii) the modulation is adaptive per input batch, so

α_{l}

varies with both the sample and the training stage. Under this view, residual connections correspond to the degenerate case

α_{l} = 1

for all l and t, and GradAttn generalizes this to a learned, input-conditioned gradient routing policy.

3.2. Architecture Design

Our GradAttn framework (Figure 1) replaces ResNet’s fixed residual connections with attention-controlled gradient pathways. The backbone follows ResNet-18’s convolutional structure but removes all skip connections. Instead, we extract features at five strategic depths: after initial max pooling and after each of the five convolutional stages, yielding feature maps

{f_{1}, f_{2}, f_{3}, f_{4}, f_{5}}

.

Each feature map

f_{i} \in R^{c_{i} \times h_{i} \times w_{i}}

, where

c_{i}

,

h_{i}

, and

w_{i}

denote the number of channels, height, and width at extraction point i, respectively, undergoes global average pooling and linear projection into a common embedding dimension d:

z_{i} = W_{p}^{i} \cdot Pool (f_{i})

(3)

where

Pool (\cdot) : R^{c_{i} \times h_{i} \times w_{i}} \to R^{c_{i}}

denotes global average pooling that spatially aggregates each channel into a scalar, and

W_{p}^{i} \in R^{d \times c_{i}}

projects the pooled features into the shared embedding space, yielding

z_{i} \in R^{d}

.

The five extraction points correspond to ResNet-18’s architectural structure: after initial max pooling and after each of the four convolutional stages (every two basic blocks). This configuration balances computational efficiency with comprehensive gradient flow coverage. Increasing to per-block extraction (10 points) would significantly increase memory overhead without substantial benefit, as our Gradient Health Score analysis (Section 4.6) shows minimal vanishing gradients within two consecutive blocks. Reducing below five points would skip entire stages, essentially creating a vanilla CNN. Thus, five points represent the optimal balance for multi-scale feature integration.

The resulting token sequence

Z = [z_{1}, z_{2}, z_{3}, z_{4}, z_{5}]

enters a transformer encoder with L attention layers (we use

L = 3

attention layers with 8 heads, an embedding dimension of

d = 256

, and standard feedforward networks of

f_d i m = 512

with layer normalization).

Multi-head self-attention computes queries, keys, and values as follows:

Q = Z W_{Q}, K = Z W_{K}, V = Z W_{V}

(4)

The attention operation dynamically weights feature contributions:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(5)

This mechanism learns adaptive gradient pathways that dynamically shape learned representations; unlike ResNet’s uniform short circuits, attention can emphasize deep semantic features for complex inputs while prioritizing shallow texture representations for simpler patterns.

3.3. Positional Encoding Variants

We evaluate three variants to understand spatial encoding requirements: No PE: Relies solely on CNN-derived spatial structure. Learnable PE: Adds trainable position embeddings:

z_{i}^{P E} = z_{i} + p_{i}, p_{i} \in R^{d}

(6)

where

p_{i}

is a trainable parameter adapted during training. This allows domain-specific positional cues while maintaining the same transformer pipeline as the No PE variant. RoPE: Relative positional information is incorporated directly into the attention mechanism by rotating queries and keys:

Q_{i}^{RoPE} = R_{θ (i)} (Q_{i}), K_{i}^{RoPE} = R_{θ (i)} (K_{i})

(7)

where

R_{θ (i)}

is a rotation operator parameterized by position i. This design integrates spatial relationships natively into the dot-product attention computation.

4. Experiments and Analysis

We present a comprehensive evaluation of GradAttn across eight diverse datasets, examining classification performance, gradient flow characteristics, model calibration, and computational efficiency. Our analysis reveals the dataset-specific advantages of attention-controlled gradient pathways and provides insights into when and why learnable gradient routing outperforms fixed residual connections.

4.1. Training Protocol

All models are trained using the Adam optimizer with a fixed random seed for reproducibility. Full training and transformer hyperparameters are summarized in Table 1. Learning rate scheduling follows ReduceLROnPlateau monitoring validation accuracy, with training capped at 100 epochs and early stopping restoring the best observed weights. Dataset-specific preprocessing and augmentation strategies are detailed in Table 2; augmentation is applied exclusively to training splits, with validation and test splits receiving only resizing and normalization. Both GradAttn and ResNet-18 use identical preprocessing pipelines throughout all experiments, ensuring that performance differences are attributable solely to architectural design rather than data handling.

4.2. Evaluation Datasets

We carry out testing on eight diverse benchmarks: Tiny ImageNet (primary) [27], CIFAR-10 [28], SVHN [29], FashionMNIST [30], and four medical datasets (TissueMNIST [31], BloodMNIST [31], PCam [32], and PAD-UFES-20 [33]). This spans natural images, structured recognition, and specialized medical imaging to validate generalizability across domains (Figure 2). Tiny ImageNet, CIFAR-10, SVHN, FashionMNIST, and PAD-UFES-20 use a 70/15/15 train/val/test split. TissueMNIST and BloodMNIST use a 70/10/20 split to provide a larger test set given their medical imaging context. PCam uses an 80/10/10 split to maximize training data for the large-scale binary classification task.

4.3. Classification Performance

Table 3 summarizes the Top-1 accuracy results across all datasets. GradAttn variants outperform ResNet-18 on five of eight datasets, with performance gains strongly correlated with task complexity and the need for multi-scale feature integration. The RoPE variant achieves the highest accuracy on Tiny ImageNet (37.67%, +4.65% over ResNet-18) and SVHN (98.15%, +0.17%), while Learnable PE excels on FashionMNIST (75.18%, +11.07%) and matches ResNet-18 on TissueMNIST (69.88%).

GradAttn underperforms ResNet-18 on three datasets: CIFAR-10 (by 2.94–5.61%), PCam (by 0.89%), and PAD-UFES-20 (by 4.91–7.80%). On CIFAR-10, ResNet-18’s strong convolutional inductive biases for low-resolution 32 × 32 images provide an advantage that multi-scale attention routing cannot overcome. On PCam, the binary metastasis detection task has simple decision boundaries where local convolutional features suffice. On PAD-UFES-20, the limited dataset size of only 2298 samples causes the transformer components to overfit.

4.3.1. Extended Evaluation

Beyond Top-1 accuracy, we evaluate Top-3 and Top-5 accuracy, macro-averaged F1 scores, and the Expected Calibration Error (ECE) to assess ranking quality and prediction reliability. Table 4 presents these metrics for representative datasets where GradAttn demonstrates substantial improvements.

The ECE reductions of 35–45% across successful domains indicate that attention-controlled gradient flow improves not only accuracy but also calibration. The dynamic weighting mechanism produces more reliable confidence estimates by selectively emphasizing features relevant to each input, whereas uniform residual connections propagate all features equally regardless of their contribution to the prediction.

The ECE is defined as follows:

E C E = \sum_{m = 1}^{M} \frac{| B_{m} |}{n} | acc (B_{m}) - conf (B_{m}) |

(8)

where M is the number of bins,

B_{m}

represents the set of predictions in the m-th bin, n is the total number of samples,

acc (B_{m})

is the average accuracy of bin

B_{m}

, and

conf (B_{m})

is the average predicted confidence of bin

B_{m}

.

4.3.2. Comprehensive Performance Analysis

Table 5 presents precision, recall, and F1 scores (both macro- and weighted averages) across all datasets and variants. These metrics reveal nuanced performance characteristics beyond simple accuracy measurements.

For highly imbalanced datasets like Tiny ImageNet (200 classes with varying difficulty), macro-averaged metrics reveal that GradAttn (RoPE) achieves more balanced performance across classes, with macro-F1 score improvements of +0.044 over ResNet-18. The weighted metrics confirm that improvements are not merely driven by better performance on dominant classes but reflect genuine enhancement in handling diverse visual categories.

On FashionMNIST, Learnable PE achieves remarkable macro-F1 score gains (+0.136), indicating substantially improved recognition of challenging classes like pullovers and shirts that ResNet-18 frequently confuses. This suggests that dataset-specific positional adaptations enable the model to capture subtle texture and shape variations critical for fine-grained fashion recognition.

4.4. Domain-Specific Patterns

4.4.1. Complex Natural Images

On Tiny ImageNet and SVHN, RoPE consistently outperforms other variants. Tiny ImageNet contains highly diverse visual categories (200 classes spanning objects, animals, and scenes) with significant intra-class variations and cluttered backgrounds. RoPE’s relative positional encoding effectively captures spatial relationships in these complex scenes, enabling the attention mechanism to weight features based on their geometric configuration. The +4.65% accuracy gain on Tiny ImageNet demonstrates that explicitly modeling relative spatial structure benefits tasks requiring global context understanding.

SVHN (street view house numbers) presents a different challenge: Digits appear at varying scales and orientations with complex backgrounds. Despite the simpler task structure (10 classes), RoPE achieves +0.17% improvement by better handling spatial transformations through rotation-equivariant positional encoding. The near-ceiling performance (98.15%) indicates that attention-controlled gradients can approach optimal performance on well-structured recognition tasks.

CIFAR-10 represents a case where GradAttn variants consistently underperform ResNet-18 (92.10%), with No PE achieving 88.68%, Learnable PE achieving 86.49%, and RoPE achieving 89.16%. Unlike Tiny ImageNet and SVHN where multi-scale feature integration provides measurable benefits, CIFAR-10’s low-resolution 32 × 32 images contain limited spatial complexity that does not warrant the adaptive gradient routing GradAttn provides. At this resolution, discriminative information is largely captured by local convolutional features within the early layers, making ResNet-18’s fixed residual connections sufficient and the transformer’s cross-scale attention redundant. Furthermore, the relatively small intra-class variation and well-separated decision boundaries in CIFAR-10 mean that shallow texture features alone adequately represent the ten object categories, leaving little room for the deep semantic weighting that attention-controlled gradients excel at. This suggests that GradAttn’s benefits are contingent on sufficient input resolution and feature hierarchy complexity to justify the learnable gradient routing mechanism.

4.4.2. Fashion and Texture Recognition

FashionMNIST represents structured object recognition where shape and texture jointly determine categories. Learnable PE achieves the most substantial improvement (+11.07%) by adapting positional encodings to clothing-specific patterns. Unlike natural images where spatial relationships follow universal geometric principles, fashion items exhibit dataset-specific structural regularities (e.g., shirts always have sleeves in particular positions, trousers have consistent leg structures). Learnable positional encodings capture these domain-specific spatial priors more effectively than either fixed CNN hierarchies (No PE) or universal relative encodings (RoPE).

4.4.3. Medical Imaging

Medical imaging datasets exhibit divergent patterns that illuminate when attention-controlled gradients provide advantages:

BloodMNIST (Blood Cell Classification): The No PE variant achieves the highest accuracy (96.23%, +0.26% over ResNet-18), suggesting that CNN-derived spatial hierarchies sufficiently encode the structure of microscopy images. Blood cells have consistent internal structure (nucleus and cytoplasm) with discrimination primarily based on morphological features at fixed scales. The strong performance without explicit positional encoding indicates that convolutional inductive biases already capture medically relevant spatial patterns.

TissueMNIST (Kidney Tissue Classification): Learnable PE matches ResNet-18 in Top-1 accuracy but achieves superior Top-3 (+0.09%) and Top-5 (+0.09%) performance alongside substantially reduced ECE (0.039 vs. 0.067, −41.8%). This pattern suggests that attention-controlled gradients improve confidence calibration and ranking quality even when final classification accuracy remains comparable. For medical applications where physicians review top-k predictions, improved ranking reliability provides clinical value beyond raw accuracy.

PCam and PAD-UFES-20: ResNet-18 outperforms GradAttn variants on both datasets (+0.89% on PCam, +4.91% on PAD-UFES-20). PCam involves binary metastasis detection in histopathology patches with simple object boundaries; PAD-UFES-20 contains only 2298 samples across six skin lesion types. These results indicate two scenarios where attention-controlled gradients provide limited benefit: (1) tasks with simple decision boundaries where local convolutional features suffice and (2) small datasets where transformer components overfit due to insufficient training samples. The negative results validate that GradAttn’s benefits depend on task complexity and dataset scale rather than universally improving upon residual connections.

4.5. Training Dynamics and Convergences

Figure 3 and Figure 4 present training and validation accuracy curves comparing ResNet-18 with the best-performing GradAttn variant for each dataset. Contrary to expectations that attention mechanisms would accelerate convergence, we observe nuanced patterns where GradAttn variants often require more epochs to converge yet achieve superior final performance.

4.5.1. Convergence Speed and Training Requirements

Across datasets, GradAttn models typically require more epochs to converge than ResNet-18. This does not reflect optimization inefficiency but rather the additional time needed for the attention mechanism to learn meaningful routing patterns. On FashionMNIST, for instance, the Learnable PE variant requires 11 epochs compared to ResNet-18’s five, which is a more than

2 \times

increase, yet this extended training yields a +11.07% improvement in accuracy. This illustrates that the extra training time is compensated by substantially better representations that fixed skip connections cannot obtain. Overall, the modest increase in convergence time is justified by the consistent improvements in final performance.

4.5.2. Optimization Landscape Complexity

The extended training requirements for GradAttn reveal fundamental differences in optimization dynamics between fixed and learnable gradient pathways. We identify two factors contributing to slower convergence:

Interdependent Component Optimization: GradAttn must simultaneously optimize CNN feature extraction, linear projections, and attention weights. Early in training, random attention patterns distribute gradients nearly uniformly across extraction points, providing minimal benefit over ResNet while adding optimization complexity. Only after the attention mechanism learns meaningful feature importance patterns (typically 15–25 epochs on Tiny ImageNet) does performance begin exceeding ResNet-18. This initial “discovery phase” explains the extended training time.

Non-Stationary Gradient Distributions: Unlike ResNet where gradient pathways remain constant, GradAttn’s attention-controlled routing creates non-stationary optimization. As attention weights evolve, the effective learning rate for different layers changes dynamically, with layers receiving high attention experiencing larger gradient updates while low-attention layers adapt slowly. This adaptive behavior requires more iterations to reach equilibrium but produces better calibrated feature representations.

4.5.3. Generalization Gap Analysis

Table 6 quantifies the train–validation accuracy gap at convergence, revealing overfitting tendencies across architectures.

GradAttn variants dramatically reduce overfitting on complex datasets where they achieve superior test accuracy. Most notably, on Tiny ImageNet, Learnable PE reduces the generalization gap from 65.77% (ResNet-18) to 14.30%, a 78.2% reduction in overfitting. This massive improvement indicates that ResNet-18 severely overfits the training data, while attention-controlled gradients learn substantially more generalizable representations. The RoPE and No PE variants also achieve large reductions (61.4% and 75.3%, respectively), confirming that dynamic feature weighting inherently regularizes learning.

On FashionMNIST, Learnable PE reduces overfitting by 22.5% (11.57% to 8.97%), while on TissueMNIST, No PE achieves a 75.7% reduction (5.39% to 1.31%). These results suggest that attention-controlled gradients act as an adaptive regularizer, where less relevant features contribute minimally to gradients during backpropagation, preventing the model from memorizing spurious correlations in the training set.

CIFAR-10 shows a balanced generalization pattern, with ResNet-18 exhibiting a 6.52% gap and GradAttn variants showing comparable or smaller gaps (3.16–6.62%). Despite Learnable PE achieving the tightest gap at 3.16%, ResNet-18 maintains superior test accuracy at 92.10%, indicating that GradAttn variants generalize more consistently but from a lower performance ceiling due to ResNet-18’s stronger convolutional inductive biases for low-resolution inputs.

On medical imaging datasets, the pattern is nuanced. For TissueMNIST and BloodMNIST, GradAttn variants achieve minimal generalization gaps (0.15–1.75%), substantially lower than ResNet-18 (5.39% and 0.57%, respectively), indicating excellent generalization. However, on PCam, ResNet-18 maintains a smaller gap (5.86%) compared to GradAttn variants (7.31–8.33%), aligning with our earlier finding that PCam’s binary classification with simple decision boundaries favors ResNet’s convolutional inductive biases. On PAD-UFES-20, all models exhibit relatively small gaps (2.95–4.21%), with minimal differences between architectures, suggesting that the primary challenge is insufficient training data (only 2298 samples) rather than overfitting.

These results establish attention-controlled gradients as a powerful implicit regularizer for complex, large-scale datasets where multi-scale feature integration benefits from adaptive selection. The regularization effect emerges naturally from the attention mechanism’s learned feature weighting rather than requiring explicit regularization techniques.

4.6. Gradient Flow Analysis

The Gradient Health Score is defined as follows:

G H S = \frac{N_{h e a l t h y}}{N_{t o t a l}}

(9)

where

N_{h e a l t h y}

represents layers with gradient norms in the range

[10^{- 6}, 10]

, and

N_{t o t a l}

is the total number of analyzed layers.

GHS = 1.0

indicates perfect gradient stability, with all layers maintaining healthy gradient magnitudes, while

GHS < 1.0

reveals the fraction of layers experiencing vanishing (

gradient_norm < 10^{- 6}

) or exploding (

gradient_norm > 10

) gradients. This composite metric quantifies overall network gradient flow quality, where controlled instabilities (

0.8 < GHS < 1.0

) can coincide with improved generalization, as observed in our attention variants.

We monitored gradient health during testing using normalized stability metrics across all eight datasets. While ResNet-18 maintained perfect stability (GHS = 1.0) across experiments, attention variants consistently introduced controlled instabilities that often coincided with improved generalization. The No PE variant exhibited minimal gradient decay across most datasets, with generally stable training dynamics (e.g., GHS = 0.914 on FashionMNIST). The Learnable PE variant experienced localized vanishing gradients in several layers, yet it achieved the best performance on FashionMNIST (+11.07%) and matched ResNet-18 on TissueMNIST while substantially improving calibration (ECE reduced by 41.8%). Similarly, RoPE introduced controlled instability across a subset of layers while still achieving the highest accuracy on Tiny ImageNet and SVHN among all variants. Table 7 provides a concrete illustration of these gradient dynamics for the Tiny ImageNet dataset, where RoPE attains GHS = 0.732 with vanishing gradients in eight layers, yet it outperforms the perfectly stable ResNet-18 by +4.65% in Top-1 accuracy.

This suggests that perfect gradient stability may not be optimal, i.e., controlled attention-induced redistribution can enhance feature learning despite minor gradient anomalies.

Controlled Instability and Generalization

The relationship between gradient stability and performance challenges conventional deep learning theory. On Tiny ImageNet, RoPE achieves the highest accuracy (37.67%) despite exhibiting 0.732, with eight layers experiencing vanishing gradients. Learnable PE on FashionMNIST attains a +11.07% improvement with GHS = 0.914. These results suggest that attention-induced gradient redistribution creates beneficial training dynamics where not all layers require uniform gradient flow.

We hypothesize that controlled instabilities enable the attention mechanism to effectively “prune” less relevant gradient pathways during training. Layers experiencing occasional vanishing gradients contribute minimally to feature learning, allowing the network to focus representational capacity on extraction points that are most critical for the task. This differs fundamentally from traditional vanishing gradient problems where poor initialization or activation functions cause universal gradient decay; attention-controlled vanishing is selective and input-dependent.

Figure 5 visualizes layer-wise gradient norm distributions, revealing that vanishing gradients in GradAttn models occur selectively in mid-depth layers, while shallow and deep layers maintain healthy gradients. This pattern indicates that attention learns to bypass intermediate feature hierarchies when they provide redundant information, creating direct pathways from shallow texture features and deep semantic representations to the loss function.

4.7. Parameter and Computational Efficiency

Despite transformer layers, the models maintained competitive efficiency. While ResNet-18 has ∼11.2 M parameters, attention variants added only ∼1.6 M parameters (≈14.3% increase). Table 8 reports the average training time per epoch across all datasets and model variants. Contrary to expectations, GradAttn variants do not consistently incur higher per-epoch computational cost than ResNet-18, and in several cases, they achieve lower epoch times, notably on Tiny ImageNet, TissueMNIST, and PCam. This suggests that the attention-controlled gradient routing does not introduce significant computational overhead per epoch and that the extended total training time observed for GradAttn variants is attributable primarily to the greater number of epochs required for convergence rather than increased per-epoch cost.

4.8. Comparison with Attention-Augmented Baselines

To further contextualise GradAttn’s contributions beyond a plain ResNet-18 baseline, we compare against ResNet-18 augmented with established channel and spatial attention modules: Squeeze-and-Excitation (SE) Networks [13] and the Convolutional Block Attention Module (CBAM) [14]. These modules apply attention within individual residual blocks for feature recalibration, representing a natural intermediate between fixed residual connections and GradAttn’s inter-layer gradient routing. Table 9 presents Top-1 accuracy on Tiny ImageNet, SVHN, and FashionMNIST, chosen as representative datasets spanning complex natural images, structured digit recognition, and texture-based fashion recognition, respectively.

The architectural distinction between these approaches is fundamental. SE and CBAM operate as intra-block attention mechanisms: They recalibrate feature channels or spatial responses within a single residual block, leaving the gradient pathways between blocks unchanged. GradAttn, by contrast, functions as a global gradient regulator across the full network hierarchy. As formalized in Equations (1) and (2), the meta-regulator

M_{ψ}

observes a global summary of the network state across all extraction points and emits per-layer modulation signals

α_{l} \in [0, 1]

, governing the effective gradient update

{\tilde{g}}_{l} = α_{l} \cdot \partial L / \partial θ_{l}

for each stage. This modulation is realised through the transformer self-attention over the token sequence

Z = [z_{1}, z_{2}, z_{3}, z_{4}, z_{5}]

defined in Equation (3), where the attention operation in Equation (5) dynamically weights the relative contribution of each hierarchical level. The result is that gradient signals are routed adaptively across the entire network depth rather than recalibrated within any single block. SE and CBAM enhance what is represented within a block; GradAttn controls which blocks drive learning across the network, a fundamentally different and complementary form of attention.

The results demonstrate that GradAttn’s attention-controlled gradient pathways provide advantages beyond what intra-block channel or spatial attention alone can achieve, validating that inter-layer gradient routing captures complementary information that block-level attention mechanisms cannot access.

4.9. Discussion

Domain-Specific Advantages: GradAttn variants excel in domains requiring global context modeling (Tiny ImageNet and SVHN) or abstract structural understanding (FashionMNIST), but they offer limited benefits in texture-dominated domains. Complementarity of CNNs and Attention: Attention-based control demonstrates complementary behavior rather than replacing residual connections. Convolutions handle local feature extraction, while attention adaptively redistributes gradient signals across scales, enabling dynamic control where deeper features selectively dominate or recede. Generalization vs. Stability Trade-Off: Gradient diagnostics reveal that minor instability can coincide with improved accuracy, suggesting that perfectly uniform gradient flow (as in ResNet) may not be optimal for all tasks, and controlled imbalance through attention may promote richer feature learning. Comparison with Block-Level Attention: SE and CBAM modules augment individual residual blocks with channel and spatial recalibration, respectively, but they operate within fixed residual connection topologies. GradAttn differs fundamentally by routing gradients across the full network hierarchy through inter-layer attention, enabling adaptive weighting of features from different semantic depths rather than recalibrating features within a single block.

5. Conclusions

This study shows that gradient flow with attention control provides a successful solution to the problem of deep neural networks with fixed residual connections as found in ResNet. The performance of our GradAttn framework was better on five out of eight different datasets and by up to +11.07% on FashionMNIST when the usual short circuits were replaced by the attention-based pathways that can be modulated by an input and a task to select the most task-relevant feature representations. The main findings indicate that flow gradient control needs to be adaptive instead of uniform. Through our study of gradient flow, it was uncovered that controlled fluctuations can go hand in hand with enhanced generalization, thus conflicting with the idea that strict stability is always the best. The variable effectiveness of positional encoding depending on the dataset gives more insights into the architecture; datasets from medical imaging were more compatible with the No PE version that made use of the spatial structure of CNN, whereas intricate natural images had the most significant improvement from the relative positioning of RoPE. In the future, we will explore larger architectures and gradient-aware training objectives. Treating gradient flow as a learnable component opens pathways for adaptive deep learning architectures beyond residual connections.

Author Contributions

S.G. led the software development and the preparation of the original draft. S.G. and H.B. contributed significantly to all other aspects of the research and the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All datasets used in this study are publicly available and are cited in the references. The implementation of GradAttn can be found on this link: https://github.com/SoudeepGhoshal/GradAttn (accessed date 21 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sima, Z.; Tao, J.; Liu, Z. Adaptive and generic improvements to ResNet Backbone in image classification. In Proceedings of the International Conference on Electronics, Electrical and Information Engineering (ICEEIE 2024); SPIE: San Francisco, CA, USA, 2024; Volume 13445, pp. 982–989. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 6105–6114. [Google Scholar]
Jastrzębski, S.; Arpit, D.; Ballas, N.; Verma, V.; Che, T.; Bengio, Y. Residual connections encourage iterative inference. arXiv 2017, arXiv:1710.04773. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. arXiv 2016, arXiv:1611.01578. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2015; pp. 448–456. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Xie, Z.; Wei, Y.; Cao, H.; Zhao, C.; Deng, C.; Li, J.; Dai, D.; Gao, H.; Chang, J.; Yu, K.; et al. mhc: Manifold-constrained hyper-connections. arXiv 2025, arXiv:2512.24880. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar] [CrossRef]
Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. Coatnet: Marrying convolution and attention for all data sizes. Adv. Neural Inf. Process. Syst. 2021, 34, 3965–3977. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14408–14419. [Google Scholar]
Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T. On layer normalization in the transformer architecture. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2020; pp. 10524–10533. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 10347–10357. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar] [CrossRef]
Huang, G.; Chen, D.; Li, T.; Wu, F.; Van Der Maaten, L.; Weinberger, K.Q. Multi-scale dense networks for resource efficient image classification. arXiv 2017, arXiv:1703.09844. [Google Scholar]
Le, Y.; Yang, X. Tiny imagenet visual recognition challenge. CS 231N 2015, 7, 3. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain, 16–17 December 2011. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
Yang, J.; Shi, R.; Wei, D.; Liu, Z.; Zhao, L.; Ke, B.; Pfister, H.; Ni, B. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Sci. Data 2023, 10, 41. [Google Scholar] [CrossRef] [PubMed]
Veeling, B.S.; Linmans, J.; Winkens, J.; Cohen, T.; Welling, M. Rotation equivariant CNNs for digital pathology. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2018; pp. 210–218. [Google Scholar]
Pacheco, A.G.; Lima, G.R.; Salomao, A.S.; Krohling, B.; Biral, I.P.; De Angelo, G.G.; Alves, F.C., Jr.; Esgario, J.G.; Simora, A.C.; Castro, P.B.; et al. PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief 2020, 32, 106221. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Proposed attention-controlled gradient flow architecture [GradAttn].

Figure 2. Dataset overview and representative samples.

Figure 3. Training dynamics comparison across complex natural images and texture recognition datasets.

Figure 4. Training dynamics comparison across medical datasets.

Figure 5. Gradient dynamics of GradAttn (RoPE) on Tiny ImageNet dataset.

Table 1. Training hyperparameters.

Hyperparameter	Value	Hyperparameter	Value
Optimizer	Adam	Early Stopping Monitor	Val Accuracy
Learning Rate	1 × $10^{- 3}$	Early Stopping Patience	7 epochs
Weight Decay	1 × $10^{- 4}$	Best Weight Restoration	Enabled
Batch Size	128	LR Scheduler	ReduceLROnPlateau
Maximum Epochs	100	LR Patience	3 epochs
Random Seed	42	LR Reduction Factor	0.2
		Minimum LR	1 × $10^{- 7}$
Transformer Encoder Settings
Embedding Dim	256	Feedforward Dim	512
Attention Heads	8	Dropout	0.1
Encoder Layers	3

Table 2. Data preprocessing and augmentation settings per dataset (training split).

Dataset	Preprocessing Applied (Training)	Normalization (Mean/Std)
Tiny ImageNet	None	[0.485, 0.456, 0.406]/[0.229, 0.224, 0.225]
CIFAR-10	RandomHorizontalFlip (p = 0.5), RandomCrop (32, pad = 4)	[0.4914, 0.4822, 0.4465]/[0.2023, 0.1994, 0.2010]
SVHN	None	[0.4377, 0.4438, 0.4728]/[0.1980, 0.2010, 0.1970]
FashionMNIST	RandomHorizontalFlip (p = 0.5), RandomRotation (10°), RandomAffine (translate = 0.1)	[0.2860]/[0.3530]
TissueMNIST	Resize (64 × 64), RandomRotation (10°), RandomHorizontalFlip (p = 0.5), ColorJitter (b = 0.1, c = 0.1)	[0.485, 0.456, 0.406]/[0.229, 0.224, 0.225]
BloodMNIST	Resize (64 × 64), RandomRotation (10°), RandomHorizontalFlip (p = 0.5), ColorJitter (b = 0.1, c = 0.1)	[0.485, 0.456, 0.406]/[0.229, 0.224, 0.225]
PCam	None	[0.485, 0.456, 0.406]/[0.229, 0.224, 0.225]
PAD-UFES-20	Resize (224 × 224), RandomHorizontalFlip (p = 0.5), RandomRotation (10°), ColorJitter (b = 0.2, c = 0.2, s = 0.2, h = 0.1), RandomResizedCrop (224, scale = 0.8–1.0)	[0.485, 0.456, 0.406]/[0.229, 0.224, 0.225]

Note: No augmentation is applied to validation or test splits for any dataset.

Table 3. Top-1 accuracy comparison across datasets.

Dataset	ResNet-18	No PE	Learnable PE	RoPE
Tiny ImageNet	33.02%	35.89%	34.72%	37.67%
SVHN	97.98%	98.09%	98.09%	98.15%
CIFAR-10	92.10%	88.68%	86.49%	89.16%
FashionMNIST	64.11%	66.70%	75.18%	62.40%
TissueMNIST	69.88%	69.79%	69.88%	69.72%
BloodMNIST	95.97%	96.23%	94.48%	95.59%
PCam	80.71%	79.65%	79.74%	79.82%
PAD-UFES-20	55.49%	50.29%	50.58%	47.69%

Table 4. Extended performance metrics.

Model	Top-3 Acc	Top-5 Acc	F1 Score	ECE
Tiny ImageNet
ResNet-18	50.52%	58.62%	0.327	0.344
RoPE	56.45%	64.81%	0.373	0.197
FashionMNIST
ResNet-18	92.21%	98.06%	0.589	0.193
Learnable PE	96.93%	99.30%	0.723	0.121
TissueMNIST
ResNet-18	93.83%	98.55%	0.695	0.067
Learnable PE	93.92%	98.64%	0.696	0.039

Table 5. Precision, recall, and F1 score across datasets.

Model	Precision (M)	Precision (W)	Recall (M)	Recall (W)	F1 (M)	F1 (W)
Tiny ImageNet
ResNet-18	0.328	0.331	0.330	0.330	0.327	0.329
RoPE	0.371	0.374	0.377	0.377	0.371	0.373
FashionMNIST
ResNet-18	0.662	0.662	0.636	0.641	0.585	0.589
Learnable PE	0.760	0.760	0.748	0.752	0.721	0.723
TissueMNIST
ResNet-18	0.631	0.695	0.609	0.699	0.617	0.695
Learnable PE	0.632	0.696	0.613	0.699	0.618	0.695

M: Macro-averaged; W: weighted-averaged across classes.

Table 6. Train–validation accuracy gap at convergence.

Dataset	ResNet-18	No PE	Learnable PE	RoPE
Tiny ImageNet	65.77%	16.24%	14.30%	25.38%
SVHN	0.70%	0.94%	1.29%	1.34%
CIFAR-10	6.52%	5.47%	3.16%	6.62%
FashionMNIST	11.57%	10.85%	8.97%	10.96%
TissueMNIST	5.39%	1.31%	1.75%	1.58%
BloodMNIST	0.57%	0.43%	0.63%	0.15%
PCam	5.86%	7.37%	7.31%	8.33%
PAD-UFES-20	3.46%	3.27%	2.95%	4.21%

Note.

Gap = | Train Accuracy - Validation Accuracy |

.

Table 7. Gradient flow analysis results on Tiny ImageNet.

Model	GHS			V/E	Avg	Range	Std
Model	[1 × $10^{- 7}$ , 1 × $10^{2}$ ]	[1 × $10^{- 6}$ , 1 × $10^{1}$ ]	[1 × $10^{- 5}$ , 1 × $10^{0}$ ]	Gradients	Norm	Range	Std
ResNet-18	1.00	1.00	1.00	None	0.153	0.022–0.370	0.106
GradAttn (No PE)	0.80	0.80	0.80	Vanishing (4 layers)	0.085	0.000–0.610	0.106
GradAttn (Learnable PE)	0.80	0.80	0.80	Vanishing (4 layers)	0.078	0.000–0.336	0.070
GradAttn (RoPE)	0.73	0.73	0.73	Vanishing (8 layers)	0.117	0.000–0.957	0.199

GHS = Gradient Health Score; V/E = vanishing/exploding. GHS values are consistent across all tested threshold ranges, confirming robustness of the metric.

Table 8. Average training time per epoch (seconds) across datasets.

Dataset	ResNet-18	No PE	Learnable PE	RoPE
Tiny ImageNet	50.90	56.53	58.17	58.59
SVHN	174.61	200.67	205.52	203.86
CIFAR-10	42.03	40.45	40.50	40.84
FashionMNIST	36.33	37.42	37.50	38.94
TissueMNIST	142.00	119.43	152.71	158.90
BloodMNIST	12.52	11.86	12.97	13.28
PCam	209.20	198.92	197.31	198.15
PAD-UFES-20	77.91	79.29	76.89	78.75

Average time per epoch is computed over all epochs before early stopping.

Table 9. Comparison with attention-augmented ResNet-18 baselines on selected datasets.

Model	Tiny ImageNet	SVHN	FashionMNIST
ResNet-18	33.02%	97.98%	64.11%
ResNet-18 + SE	32.01%	97.94%	63.11%
ResNet-18 + CBAM	31.76%	97.99%	66.98%
GradAttn (No PE)	35.89%	98.09%	66.70%
GradAttn (Learnable PE)	34.72%	98.09%	75.18%
GradAttn (RoPE)	37.67%	98.15%	62.40%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghoshal, S.; Buckchash, H. GradAttn: Transformer-Based Modulation of Residual Approach for Classification and Representation Learning Problems. Appl. Sci. 2026, 16, 5252. https://doi.org/10.3390/app16115252

AMA Style

Ghoshal S, Buckchash H. GradAttn: Transformer-Based Modulation of Residual Approach for Classification and Representation Learning Problems. Applied Sciences. 2026; 16(11):5252. https://doi.org/10.3390/app16115252

Chicago/Turabian Style

Ghoshal, Soudeep, and Himanshu Buckchash. 2026. "GradAttn: Transformer-Based Modulation of Residual Approach for Classification and Representation Learning Problems" Applied Sciences 16, no. 11: 5252. https://doi.org/10.3390/app16115252

APA Style

Ghoshal, S., & Buckchash, H. (2026). GradAttn: Transformer-Based Modulation of Residual Approach for Classification and Representation Learning Problems. Applied Sciences, 16(11), 5252. https://doi.org/10.3390/app16115252

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

GradAttn: Transformer-Based Modulation of Residual Approach for Classification and Representation Learning Problems

Abstract

1. Introduction

2. Related Works

2.1. Gradient Flow in Deep Architectures

2.2. Attention Mechanisms and Hybrid Architectures

2.3. Multi-Scale Feature Integration

3. Methods

3.1. Conceptual Framework for Gradient Modulation

3.2. Architecture Design

3.3. Positional Encoding Variants

4. Experiments and Analysis

4.1. Training Protocol

4.2. Evaluation Datasets

4.3. Classification Performance

4.3.1. Extended Evaluation

4.3.2. Comprehensive Performance Analysis

4.4. Domain-Specific Patterns

4.4.1. Complex Natural Images

4.4.2. Fashion and Texture Recognition

4.4.3. Medical Imaging

4.5. Training Dynamics and Convergences

4.5.1. Convergence Speed and Training Requirements

4.5.2. Optimization Landscape Complexity

4.5.3. Generalization Gap Analysis

4.6. Gradient Flow Analysis

Controlled Instability and Generalization

4.7. Parameter and Computational Efficiency

4.8. Comparison with Attention-Augmented Baselines

4.9. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI