CHARMS: A CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution

Li, Xia; Sun, Haicheng; Li, Tie-Qiang

doi:10.3390/s26020738

Open AccessArticle

CHARMS: A CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution

by

Xia Li

¹

,

Haicheng Sun

¹ and

Tie-Qiang Li

^2,3,4,*

¹

College of Information Engineering, China Jiliang University, Hangzhou 314423, China

²

School of Medical Imaging, Fujian Medical University, Fuzhou 350005, China

³

Department of Medical Radiation Physics and Nuclear Medicine, Karolinska University Hospital, 171 76 Stockholm, Sweden

⁴

Department of Clinical Science, Intervention and Technology, Karolinska Institutet, 171 77 Stockholm, Sweden

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(2), 738; https://doi.org/10.3390/s26020738

Submission received: 4 December 2025 / Revised: 14 January 2026 / Accepted: 19 January 2026 / Published: 22 January 2026

(This article belongs to the Special Issue Sensing Technologies in Digital Radiology and Image Analysis)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

CHARMS, a lightweight CNN-Transformer hybrid (~1.9 M parameters, ~30 GFLOPs), outperforms state-of-the-art lightweight MRI super-resolution models (EDSR, PAN, W2AMSN-S, and FMEN) by 0.1–0.6 dB PSNR and up to 1% SSIM at ×2/×4 upscaling while reducing inference time by ~40%.
With cross-field fine-tuning on only twenty subjects, CHARMS upgrades clinical 3T MRI to near-7T quality, yielding ~6 dB PSNR and 0.12 SSIM gains over native 3T scans across T1w/T2w contrasts.

What are the implications of the main findings?

Near-real-time performance (~11 ms/slice enabling ~1.6–1.9 s processing per 3D brain volume on RTX 4090) and small model size enable practical deployment in clinical workstations, online reconstruction pipelines, and resource-constrained environments including low-field and portable MRI scanners.
Superior fidelity–efficiency balance paves the way for shorter scan times, reduced motion artifacts, and 7T-like diagnostic quality from standard 3T systems without additional hardware.

Abstract

Magnetic resonance imaging (MRI) super-resolution (SR) enables high-resolution reconstruction from low-resolution acquisitions, reducing scan time and easing hardware demands. However, most deep learning-based SR models are large and computationally heavy, limiting deployment in clinical workstations, real-time pipelines, and resource-restricted platforms such as low-field and portable MRI. We introduce CHARMS, a lightweight convolutional–Transformer hybrid with attention regularization optimized for MRI SR. CHARMS employs a Reverse Residual Attention Fusion backbone for hierarchical local feature extraction, Pixel–Channel and Enhanced Spatial Attention for fine-grained feature calibration, and a Multi-Depthwise Dilated Transformer Attention block for efficient long-range dependency modeling. Novel attention regularization suppresses redundant activations, stabilizes training, and enhances generalization across contrasts and field strengths. Across IXI, Human Connectome Project Young Adult, and paired 3T/7T datasets, CHARMS (~1.9M parameters; ~30 GFLOPs for 256 × 256) surpasses leading lightweight and hybrid baselines (EDSR, PAN, W2AMSN-S, and FMEN) by 0.1–0.6 dB PSNR and up to 1% SSIM at ×2/×4 upscaling, while reducing inference time ~40%. Cross-field fine-tuning yields 7T-like reconstructions from 3T inputs with ~6 dB PSNR and 0.12 SSIM gains over native 3T. With near-real-time performance (~11 ms/slice, ~1.6–1.9 s per 3D volume on RTX 4090), CHARMS offers a compelling fidelity–efficiency balance for clinical workflows, accelerated protocols, and portable MRI.

Keywords:

MRI super-resolution; lightweight network; CNN-Transformer hybrid; attention regularization; reverse residual attention; clinical deployment

1. Introduction

Magnetic resonance imaging (MRI) is a cornerstone modality in clinical diagnostics and biomedical research, offering superior soft tissue contrast without ionizing radiation. However, acquiring high-resolution (HR) MRI often requires long scan durations or high-field systems, which increase costs, patient discomfort, and vulnerability to motion artifacts. Super-resolution (SR) reconstruction [1,2,3] seeks to mitigate these limitations by recovering HR images from low-resolution (LR) inputs, thereby enhancing spatial detail without extending acquisition time.

Deep learning has substantially advanced SR reconstruction by enabling end-to-end mappings between LR and HR domains. Early 2D convolutional neural network (CNN) models such as SRCNN [4], VDSR [5], EDSR [6], and RCAN [7] achieved strong gains in image fidelity through deep feature extraction and residual learning. These methods demonstrated the potential of CNNs to recover fine structural details but often depended on increasingly deep or wide architecture. As a result, they introduced high computational cost, large memory demand, and limited suitability for real-time or near-real-time MRI workflow.

Subsequent efforts introduced lightweight 2D CNN variants, including MobileNet [8], ShuffleNet [9], and ESPNet [10], and explored depthwise separable convolutions and channel decomposition to reduce parameters while preserving performance. Although effective for accelerating SR networks, these designs generally lacked mechanisms for modeling long-range spatial relationships, which are essential for reconstructing complex anatomical structures.

To address these limitations, we propose CHARMS (CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution), a compact and efficient SR framework tailored for MRI. CHARMS integrates a Reverse Residual Attention Fusion (RRAF) backbone [11] for hierarchical feature extraction with Pixel–Channel Attention (PCA) [12] and Enhanced Spatial Attention (ESA) modules [13] for fine-grained feature calibration. Multi-Depthwise Dilated Transformer Attention (MDDTA) module [14] captures long-range dependencies with linear computational complexity, while an attention regularization mechanism reduces redundancy and stabilizes training (see Appendix A for a glossary of acronyms and brief descriptions of the main architectural modules).

Extensive experiments on multiple open-access MRI datasets, including IXI [15], Human Connectome Project Young Adult (HCP-YA) [16], and paired 3T-7T datasets [17,18], demonstrate that CHARMS achieves superior reconstruction accuracy with significantly fewer parameters and lower inference latency compared to existing CNN- and Transformer-based SR models. By balancing representational power with computational efficiency, CHARMS advances the integration of deep SR frameworks into modern MRI acquisition and reconstruction pipelines.

This study addresses the following research questions: RQ1: Can a lightweight CNN-Transformer hybrid outperform existing MRI SR models in fidelity–efficiency trade-off? RQ2: How does attention regularization improve training stability and cross-contrast/field-strength generalization? RQ3: What is the clinical potential of limited-data cross-field fine-tuning for upgrading standard 3T scans to near-7T quality?

2. Related Work

2.1. CNN-Based MRI Super-Resolution

Early deep SR models relied heavily on 2D CNNs. SRCNN pioneered end-to-end learning with a simple three-layer architecture, while VDSR and EDSR introduced residual learning and deeper networks for enhanced detail recovery. RCAN advanced this further by incorporating channel attention to adaptively recalibrate features. Lightweight variants, inspired by MobileNet, ShuffleNet, and ESPNet, employed depthwise separable convolutions and channel shuffling to reduce parameters, achieving faster inference suitable for clinical constraints. Despite these efficiency gains, pure CNN approaches often struggle with modeling long-range spatial dependencies critical for reconstructing extended anatomical structures in brain MRI.

2.2. Attention Mechanisms in SR

Attention modules have significantly boosted SR performance [19,20]. Channel [7] and spatial attention [19] modules improved intra-feature relevance by adaptively weighting informative regions, while residual attention blocks [7,12,19,20,21,22] helped capture hierarchical contextual cues. More recently, Transformer-based architectures [23] introduced self-attention to model long-range dependencies across the entire image plane [24,25,26,27]. These developments produced notable improvements in reconstruction accuracy and generalization. Representative hybrid CNN-Transformer SR models, such as SwinIR [28], Restormer [27], and TTVSR [29], demonstrated the value of combining convolutional locality with global context modeling. However, many of these hybrid approaches [30,31,32,33] rely on computationally expensive attention operations, redundant attention activations, or complex multi-stage pipelines, which hinder their scalability to high-resolution volumetric MRI and limit deployment in resource-constrained clinical environments.

2.3. Transformer and Hybrid Architectures

The introduction of Transformers revolutionized natural image SR through global self-attention, as exemplified by SwinIR (shifted-window attention), Restormer (multi-head channel attention), and TTVSR. These hybrid CNN-Transformer models combine inductive biases of convolutions with Transformer’s long-range modeling, yielding superior fidelity. In MRI SR, variants such as SuperFormer, REHRSeg, and physics-informed hybrids have emerged, often extending to multi-contrast or anisotropic data. Very recent advances include VMamba-Transformer hybrids (e.g., MRISR [34,35,36,37,38,39]) for low-field enhancement and diffusion-guided models that leverage generative sampling for texture-rich reconstruction. Although powerful, these methods frequently involve high parameter counts (>10 M), iterative sampling, or quadratic attention complexity, limiting real-time applicability.

Transformer-enhanced frameworks often incur prohibitive computational overhead [24,27,33,40]. A key challenge is how to design an efficient hybrid model that preserves local representational strength, captures long-range dependencies, and avoids redundant or unstable attention behaviors. Addressing this challenge is critical for advancing practical MRI SR, especially in hardware-limited settings where latency, memory footprint, and energy consumption are essential constraints. In addition to these CNN and Transformer-based methods, several very recent models have further extended MRI SR capabilities. MRISR [41] employs a VMamba-Transformer architecture to enhance texture reconstruction in low-field MRI, while a 2D slice-wise diffusion model [42] demonstrates the strong generative capacity of diffusion sampling for high-fidelity SR. Although effective, these approaches tend to be computationally demanding, reinforcing the need for lightweight and stable SR frameworks.

2.4. Diffusion Models and Emerging Trends

Diffusion-based SR frameworks have recently demonstrated exceptional perceptual quality by treating super-resolution as iterative denoising. Models exploiting latent diffusion or physics-informed guidance achieve impressive detail recovery, particularly in multi-parametric or low-SNR settings. However, their reliance on dozens to hundreds of sampling steps renders them computationally intensive, often unsuitable for clinical workflows demanding sub-second latency. Recent advances have extended MRI SR with diffusion models for high-fidelity generation [35,42,43,44,45,46] and Transformer hybrids for multi-contrast handling [44,47]. However, these often prioritize perceptual quality over computational efficiency, with high iteration counts or parameters (>10 M), limiting real-time clinical use [48,49,50]. CHARMS addresses this gap by integrating lightweight CNN-Transformer fusion with attention regularization, achieving SOTA efficiency–fidelity balance for portable and cross-field MRI [35,42,43,44,45,46,47,51].

CHARMS distinguishes itself by integrating lightweight convolutional processing with linear complexity-dilated Transformer attention and explicit regularization, achieving competitive or superior fidelity at a fraction of the computational cost of contemporary Transformer hybrids and diffusion models. This positions CHARMS as a practical solution for resource-constrained and cross-field MRI super-resolution.

3. Materials and Methods

3.1. CHARMS Framework

The proposed CHARMS network reconstructs HR MR images from LR inputs through an end-to-end mapping

I_{S R} = f_{θ} (I_{L R}),

(1)

where

f_{θ}

denotes the trainable parameters of the model. As illustrated in Figure 1, CHARMS comprises three sequential stages as follows: (1) RRAF blocks [11] for local texture encoding and multi-scale spatial awareness; (2) contextual enhancement through a MDDTA layer [14] coupled with a Gated Depthwise Dilated Feed-Forward Network (GDDFN) [52] to capture long-range dependencies; and (3) reconstruction and refinement that combines pixel-shuffle upsampling [53] with a High-Frequency Information Refinement (HFIR) module [54] to restore fine structural details. A global residual connection adds shallow features directly to the final output, preserving low-frequency signal fidelity and improving optimization stability. Each design element within CHARMS was introduced to address the performance–efficiency trade-off that limits most existing CNN-Transformer hybrid SR models.

Unlike deep CNNs [35,36,37,38,39,55] that rely on large parameter counts or pure Transformers [30,31,32,33] that suffer from high computational complexity, CHARMS unifies convolutional locality and Transformer globality in a lightweight yet expressive architecture. Its design incorporates the following three principal innovations relative to existing lightweight SR architectures: (1) RRAF modules were adapted and extended from the reverse-attention principle introduced in prior work [11], are reformulated for MRI SR by integrating residual feature refinement, channel shuffle, and enhanced spatial attention to improve feature reuse and gradient flow while directing focus toward anatomically salient regions; (2) an attention regularization mechanism, applied jointly across the CNN and Transformer branches, suppresses redundant activations and stabilizes optimization, reducing overfitting and improving generalization across contrasts and scanners; (3) the MDDTA layer [14] introduces depthwise-separable projections and dilated attention heads to maintain linear computational complexity while preserving long-range contextual modeling, which is essential for capturing extended anatomical structures.

Through these interdependent innovations, CHARMS achieves a balance between representational richness and computational efficiency, reducing model size and inference time while improving fidelity and structural sharpness. The following subsections describe each architectural component, dataset preparation, and experimental protocols used for model training and evaluation.

3.2. Reverse Residual Attention Fusion (RRAF) Block

At the core of CHARMS lies the RRAF module [11], which integrates residual learning and spatial attention to extract discriminative features while maintaining parameter efficiency. As shown in Figure 1b, each RRAF block contains several RLFE units [56] composed of paired convolutions with ReLU activations and an ESA operator. ESA generates a spatial weight map that highlights high-frequency anatomical regions such as tissue interfaces and structural boundaries. Dilated convolutions with rates of 1 and 2 expand the receptive field, while a channel-reduction and recovery sequence captures multi-scale dependencies. To enhance inter-channel communication without increasing parameters, channel shuffle operations are incorporated, allowing independent feature groups to exchange information efficiently. Mathematically, the output of an RRAF block can be expressed as:

F_{o u t} = Shuffle (\sum_{k = 1}^{K} R L F E_{k} (F_{i n}) + F_{i n}),

(2)

where

F_{i n}

and

F_{o u t}

represent the input and output feature maps, k is the number of RLFE units, and

Shuffle (\cdot)

denotes channel shuffling. Equation (2) provides an overview, with details in (3) for RLFE summation and (5) for channel shuffle to enhance inter-feature communication. This design facilitates both spatial adaptivity and gradient stability, addressing the common vanishing gradient issues observed in deeper CNNs. Each RRAF block aggregates

K

RLFE modules [56] that jointly encode spatial detail and contextual cues. For the

i

-th block,

F_{RRAF} = F_{i n} + \sum_{k = 1}^{K} ε_{k} (F_{k - 1}), F_{0} = F_{i n},

(3)

where

ε_{k} (\cdot)

denotes an RLFE unit consisting of two

3 \times 3

convolutions with ReLU activation followed by an ESA operation. ESA computes a spatial weight map

A_{s} = σ (C_{↑} (C_{d 2} (C_{d 1} (P (C_{↓} (F)))))),

(4)

with

C_{↓}

and

C_{↑}

representing channel-reduction and recovery layers,

C_{d 1}

and

C_{d 2}

dilated convolutions of rates 1 and 2, and

σ

the sigmoid activation. The refined output

F^{'} = F ⊙ A_{s}

selectively enhances informative regions, while a channel-shuffle operator

S (\cdot)

intermixes channel groups to strengthen cross-feature communication without increasing parameter count:

F_{RRAF} \leftarrow S (F_{RRAF}) .

(5)

Residual skip connections maintain integrity and mitigate gradient attenuation.

3.3. Pixel–Channel Attention (PCA) Module

To further recalibrate feature saliency, CHARMS employs a PCA mechanism that merges pixel-level and channel-level cues. Given an input tensor

F

, pixel attention is computed as

A_{p} = σ (C_{1 \times 1} (F)),

(6)

while channel attention is

A_{c} = σ (C_{1 \times 1} (AvgPool (F))) .

(7)

The combined attention map

A_{PCA} = N (A_{p} + A_{c})

(8)

is normalized and applied elementwise, yielding the recalibrated feature

F_{PCA} = F ⊙ A_{PCA} .

(9)

This joint weighting emphasizes anatomically relevant structures and suppresses background noise with negligible computational cost.

3.4. Transformer Module with MDDTA and GDDFN

Local convolutions cannot effectively capture long-range anatomical dependencies; thus, CHARMS introduces a lightweight Transformer composed of a Multi-Depthwise Dilated Transformer Attention (MDDTA) layer [14] followed by a Gated Depthwise Dilated Feed-Forward Network (GDDFN) [52]. The intermediate feature map obtained after the MDDTA block (plus its residual skip connection) is

F_{t} = A_{MDDTA} (F_{i n}) + F_{i n},

(10)

The combined process after GDDFN block can be expressed as

F_{o u t} = G_{GDDFN} (F_{t}) + F_{t}

(11)

The MDDTA computes self-attention over multi-scale depthwise features:

A_{MDDTA} (F_{i n}) = \sum_{i} w_{i} Softmax (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d}}) V_{i}, Q_{i}, K_{i}, V_{i} = C_{r_{i}} (F_{i n}),

(12)

where

C_{r_{i}}

is a depthwise convolution with dilation

r_{i}

and

w_{i}

are learned weights. This formulation expands the receptive field while maintaining linear complexity. The subsequent GDDFN enhances nonlinearity through gated depthwise convolutions:

G_{GDDFN} (F) = C_{1 \times 1} [ϕ (C_{d} (F_{1})) ⊙ C_{d} (F_{2})],

(13)

where

ϕ

denotes GELU activation. Together, these components effectively balance contextual modeling power and efficiency.

3.5. High-Frequency Information Refinement (HFIR)

After pixel-shuffle upsampling [53], the HFIR module [54] performs residual correction to restore attenuated high-frequency information. The final reconstruction can be written as

I_{S R} = U (F_{o u t}) + σ (C_{3 \times 3}^{(4)} (U (F_{o u t}))) ⊙ U (F_{o u t}),

(14)

where

U

denotes the upsampling operator,

C_{3 \times 3}^{(4)}

represents four stacked depthwise separable convolutions, and

σ

is a sigmoid gate. This refinement selectively boosts edge and texture components while suppressing reconstruction artifacts.

3.6. Datasets and Preprocessing

To evaluate the performance and generalization capability of CHARMS framework, we employed four publicly available brain MRI datasets that encompass diverse imaging contrasts, spatial resolution, and acquisition protocols. As summarized in Table 1, the datasets used in this study are from the Human Connectome Project Young Adult study (HCP-YA) [16], IXI dataset [15], and two sets (PTT1 and PTT2) of paired 3T/7T studies [17,18]. The HCP-YA dataset offers high-quality 3T scans at 0.7 mm isotropic resolution for a relatively homogeneous healthy control (HC) cohort, the IXI dataset provides T1W and T2W volumes (0.94 × 0.94 × 1.2 mm³) of more heterogenous participants over different platforms (1.5T to 3T of different vendors), and the two paired 3T/7T datasets supply matched cross-field T1W and T2W images, where the 7 T data serve as pseudo-ground-truth for cross-field validation. To train the CHARMS model, all 3D image volumes were bias-field corrected [57] with N4ITK and skull-stripped using SynthStrip [58]. This bias-field correction was applied consistently to all volumes across the IXI, HCP-YA, PTT1, and PTT2 datasets to ensure uniform preprocessing. The central 100 slices were downsampled by factors ×2 and ×4 using bicubic interpolation to form LR–HR pairs. Datasets were split according to the ratio of 70:20:10 for training, validation, and testing at the subject level, with intensities normalized to [0, 1]. Specifically, intensity normalization was performed via min–max scaling applied independently to each volume (i.e., rescaling voxel intensities based on the minimum and maximum values within that volume), without additional techniques such as Z-score normalization, White Stripe, or histogram matching between datasets. This approach was used consistently across all datasets to maintain reproducibility and support model generalization, as no direct quantitative comparisons were made between datasets.

3.7. Training Protocol and Comparative Models

All experiments were implemented in PyTorch 2.0 and executed on an NVIDIA RTX 4090 GPU (24 GB GDDR6X VRAM) under standard conditions (PyTorch 2.0, CUDA 11.7, and no additional optimizations such as TensorRT). Model optimization employed the AdamW optimizer [59] with

β_{1} = 0.9

,

β_{2} = 0.999

, and

ϵ = 10^{- 8}

hyper parameters suggested in the literature [60,61,62]. Training minimized an

L_{1}

reconstruction loss over 200 epochs [63] with a batch size of 16 and an initial learning rate of

10^{- 4}

. The proposed CHARMS network comprises approximately 1.9 million parameters and requires about 30 GFLOPs for a 256 × 256 input, yielding roughly 40% faster inference than the widely used EDSR model [6] while achieving higher reconstruction fidelity across all benchmark datasets.

To ensure a fair and reproducible evaluation, CHARMS was compared with the following seven representative 2D SR networks that together span the evolution of deep SR approaches: the non-learned bicubic interpolation baseline; early CNN architectures SRCNN [4], VDSR [5], and EDSR [6]; the attention-augmented CNN PAN [12]; and the more recent multi-scale and attention-driven hybrids W2AMSN-S [64] and FMEN [65]. All baseline models were trained from scratch on identical training, validation, and testing partitions using the same optimizer, learning-rate schedule, and number of epochs, ensuring consistent experimental conditions across architectures.

Performance was assessed using the Peak Signal-to-Noise Ratio (PSNR) [66] and Structural Similarity Index (SSIM) [67] to quantify both numerical accuracy and perceptual quality. In addition, qualitative visual analysis was conducted to evaluate the recovery of fine anatomical structures, such as cortical boundaries and tissue interfaces. Computational efficiency was measured in terms of inference time and parameter count using standardized 256 × 256 MR image inputs. Finally, a comprehensive ablation study isolated the contributions of the RRAF, PCA, MDDTA, and HFIR modules, demonstrating how each component contributes to CHARMS overall balance between efficiency and reconstruction quality.

3.8. Cross-Field Adaptation and Evaluation Procedure

To adapt CHARMS for generating 7T-like images from 3T inputs, we performed cross-field fine-tuning using the PTT1 dataset [17] of 20 subjects (see Table 1), each providing matched 3T/7T T1-weighted and T2-weighted volumes. The overall procedure is illustrated in Figure 2.

The pretrained model

f_{θ_{0}}

is fine-tuned on 20 paired 3T/7T subjects from PTT1 dataset to obtain

f_{θ_{1}}

, which is then evaluated on the PTT2 dataset [17] of 10 subjects to assess cross-site generalization. Let

D_{ft} = {[(x_{i}^{3 T}, y_{i}^{7 T})]}_{i = 1}^{20}

(15)

denote this fine-tuning dataset and let

f_{θ_{0}}

represent the pretrained CHARMS model obtained from large-scale multi-contrast MRI SR training described above. Fine-tuning adapts the model parameters to produce an updated model

f_{θ_{1}}

by solving

θ_{1} = a r g \underset{θ}{m i n} \sum_{(x_{i}^{3 T}, y_{i}^{7 T}) \in D_{ft}} L (f_{θ} (x_{i}^{3 T}), y_{i}^{7 T}),

(16)

where the total training loss for cross-field adaptation was defined as

L = λ_{1} ∥ f_{θ} (x_{i}^{3 T}) - y_{i}^{7 T} ∥_{1} + λ_{2} (1 - S S I M (f_{θ} (x_{i}^{3 T}), y_{i}^{7 T})) + λ_{3} L_{AR}

(17)

includes the pixel-wise

L_{1}

reconstruction term, a structural-similarity term, and the proposed attention-regularization (AR) penalty to enhance fine-scale anatomical fidelity. To mitigate overfitting on the 20-subject fine-tuning dataset. We used fixed weights

λ_{1} = 1

,

λ_{2} = 0.1

, and

λ_{3} = 0.005

. We fine-tuned the model using a reduced learning rate of

10^{- 5}

and restricted trainable parameters to the last two Transformer blocks and the decoder, while keeping the early CNN backbone frozen. This selective fine-tuning strategy follows established transfer-learning principles, where early layers capture stable, generalizable features while later layers adapt to domain-specific variations [68]. In medical imaging, partial fine-tuning of deeper layers has been shown to be more effective and less prone to overfitting than full retraining when data are limited [69].

Model evaluation was conducted using the independent external dataset PTT2 [18] consisting of 10 additional subjects with paired 3T/7T volumes.

D_{test} = [(x_{j}^{3 T}, y_{j}^{7 T})]_{j = 1}^{10} .

(18)

This strict separation of sites ensures that the evaluation reflects cross-site generalization rather than memorization of vendor- or protocol-specific features. For each subject in the external cohort, from the 3T inputs, the adapted model produced a 7T-like prediction

{\hat{y}}_{j}^{7 T} = f_{θ_{1}} (x_{j}^{3 T}),

(19)

which was quantitatively compared against the true 7T reference

y_{j}^{7 T}

using PSNR, SSIM, and additional region-specific sharpness and edge-contrast metrics. Qualitative assessment emphasized cortical ribbon delineation, deep-gray-matter contrast, and cerebellar structural clarity.

4. Results

4.1. Benchmark Performance

The proposed CHARMS network was first evaluated against seven representative SR methods—Bicubic, SRCNN [4], VDSR [5], EDSR [6], PAN [12], W2AMSN-S [64], and FMEN [65]—on the IXI (T1w/T2w) and HCP-YA (T1w) datasets. Figure 3 presents qualitative comparisons for a representative sagittal slice of the IXI T1w volume downsampled by a factor of four. CHARMS visibly restores sharper cortical and subcortical structures with reduced blurring, while competing models exhibit varying degrees of texture loss or over-smoothing.

Quantitative results for ×2 and ×4 up-scaling are summarized in Table 2 and Table 3. At ×2 scaling, CHARMS achieves PSNR/SSIM values of 37.79 dB/0.973 on IXI T1w and 38.56 dB/0.981 on IXI T2w, outperforming FMEN and W2AMSN-S by ≈0.2–0.6 dB with fewer parameters. At ×4 scaling, CHARMS yields 33.27 dB/0.945 on IXI T1w and 32.97 dB/0.956 on IXI T2w, again slightly higher than the best competing networks. On the high-quality HCP dataset, performance differences are smaller but consistent: CHARMS matches or exceeds FMEN while maintaining the lowest model complexity (≈1.9 M parameters).

These results confirm that CHARMS attains a favorable trade-off between reconstruction fidelity and computational efficiency. Its complementary convolutional and Transformer branches allow the network to preserve fine anatomical textures while modeling extended contextual relationships. By contrast, purely convolutional baselines (SRCNN, VDSR, EDSR, and PAN) either fail to recover global continuity or require deeper architectures with higher latency.

The IXI T1w dataset has served as a standard benchmark for MRI super-resolution since 2014, enabling direct longitudinal comparison of model performance across studies [35,43,44,45,46,47,51,70,71,72,73,74,75,76]. Figure 4 summarizes the evolution of reconstruction fidelity over this period, plotting reported PSNR (Figure 4a) and SSIM (Figure 4b) values for ×2 and ×4 upscaling from the literature, alongside linear regression trends. Both metrics exhibit a consistent, approximately linear improvement over time, reflecting ongoing advances in deep learning architectures and training strategies. The models evaluated in the current study, including CHARMS, closely align with these established trends. Notably, CHARMS reaches state-of-the-art fidelity levels for both scaling factors—matching or exceeding recent heavyweight baselines—while offering superior computational efficiency (~1.9 M parameters and ~30 GFLOPs versus substantially higher values for competing SOTA methods). This demonstrates that CHARMS achieves cutting-edge performance without relying on increased model complexity, highlighting the effectiveness of its lightweight hybrid design and attention regularization.

To assess the contribution of each component, an ablation study was conducted on the IXI T1w and T2w datasets at ×2 and ×4 scaling. The baseline model consisted of stacked RRAF blocks without channel shuffle, PCA, Transformer, or HFIR modules. Successive additions of these components (Table 4) show a progressive improvement in both PSNR and SSIM. Introducing channel shuffle alone increased PSNR by about 1 dB, highlighting its effectiveness in promoting inter-channel interaction. Adding the PCA module yielded an additional 0.1 dB gain and improved structural similarity, confirming the advantage of combined pixel- and channel-level calibration. The inclusion of the Transformer further enhanced performance by ≈0.02–0.03 dB, reflecting improved long-range feature modeling. The full CHARMS model, integrating all modules including HFIR, achieved the highest metrics across all datasets and scaling factors with only marginal increases in parameter counts (<2 MB).

4.2. Ablation Study

To further illustrate the effect of the proposed attention regularization, Figure 5 shows attention maps extracted from the MDDTA block before and after applying the regularization loss on representative IXI T1w slices. Without regularization (Figure 5a), attention is distributed more diffusely across the entire image, with limited selectivity toward structural boundaries. In contrast, after regularization (Figure 5b), attention becomes markedly more focused and structured, preferentially highlighting anatomically salient regions including cortical folds, sulci, and white-gray matter junctions. This sharper localization demonstrates that the regularization successfully suppresses redundant activations and encourages the model to efficiently direct its focus to informative features critical for high-fidelity super-resolution, contributing to the observed improvements in both quantitative metrics and perceptual quality.

Although AR yields the smallest isolated PSNR improvement (~0.05 dB), it plays a crucial role in suppressing redundant attention patterns and stabilizing optimization. This is especially valuable in the resource-limited cross-field fine-tuning scenario, where AR contributes to consistent high-fidelity 7T-like reconstructions without overfitting on the small twenty-subject dataset.

Figure 6 illustrates the influence of network depth, parameterized by the number of RRAF and RLFE blocks. Performance saturated beyond a (4, 4) configuration, while deeper variants provided negligible gains at substantially higher computational cost. This observation guided the final model design used throughout subsequent experiments.

Consistent with backbone depth saturation, additional experiments evaluated stacking multiple MDDTA blocks or varying their position. Stacking 2–3 blocks provided marginal PSNR gains (<0.01 dB) at ~15–30% higher computational cost, while earlier placement was less effective due to insufficient local feature maturity. The selected single late-stage MDDTA thus achieves the optimal balance for lightweight clinical MRI SR.

4.3. Cross-Field Validation Using Paired 3T/7T Datasets

To verify generalization beyond synthetic downsampling, CHARMS was further tested on two datasets of paired 3T/7T (PTT) MR images [17,18]. Models pre-trained on IXI and HCP-YA datasets were fine-tuned using the aligned 3T inputs and 7T ground-truth targets. CHARMS successfully synthesized 7T-like images from clinical 3T acquisitions (Figure 5), achieving remarkable quantitative gains over the original 3T images as follows: an average PSNR improvement of ≈6 dB across T1w and T2w volumes (Table 5) and an SSIM increase of 0.12 (Table 6)—a particularly large leap.

Qualitatively, Figure 5 presents representative axial T1w slices before and after super-resolution. The super-resolved images display markedly sharper cortical gyri, crisper gray-white matter boundaries, and enhanced tissue contrast, closely approximating the visual appearance of true 7T acquisitions. Quantitative consistency is equally striking. Swarm plots in Figure 6 and the summary statistics in Table 6 reveal that the per-slice SSIM distribution for T1w data becomes dramatically narrower after super-resolution, with standard deviation reduced by >50%. This collapse of variance indicates highly reliable structure and texture recovery across diverse anatomical regions.

These results demonstrate that CHARMS effectively bridges the resolution gap between clinical 3T and ultra-high field 7T MRI without requiring paired 7T data during deployment. The ability to upgrade standard-of-care 3T scans to near-7T quality highlights the framework’s strong generalization across magnetic field strengths, scanner vendors, and acquisition protocols—paving the way for its application to low-field and portable MRI systems, where hardware-imposed resolution limitations remain a major challenge.

To further quantify the enhancement in image quality, we evaluated Signal-to-Noise Ratio (SNR) and gray-white matter Contrast-to-Noise Ratio (CNR) on the PTT2 test set. As shown in Table 7, CHARMS SR images improve mean SNR over native 3T by ~0.8 for T1w (8.994 vs. 8.196) and ~0.9 for T2w (5.697 vs. 4.802), approaching native 7T levels (9.393 for T1w, 6.198 for T2w). Similarly, Table 8 reveals CNR gains of ~0.1 for T1w (2.209 vs. 2.103) and ~0.7 for T2w (2.631 vs. 1.901), closely matching 7T values (2.297 for T1w, 2.825 for T2w). These metrics objectively confirm the SR outputs’ ‘near-7T quality’ in terms of reduced noise and improved tissue differentiation, see Figure 7 and Figure 8.

In addition to reconstruction accuracy, CHARMS delivers substantial computational advantages. With ≈1.9 million parameters and ≈30 GFLOPs per 256 × 256 input, it is significantly lighter than W2AMSN-S (11 M) and FMEN (3.8 M) while producing higher PSNR and SSIM. Training time was ≈7 h per configuration, compared with 14 h for W2AMSN-S. Inference on a single RTX 4090 required ≈11 ms per slice. For a typical 3D brain volume (e.g., 176 slices in PTT datasets), accounting for slices without brain tissue via foreground masking, total inference time is approximately 1.6–1.9 s on an RTX 4090 GPU—enabling near-real-time processing in clinical reconstruction pipelines.

The observed efficiency stems from the combined use of depthwise separable convolutions, channel shuffle, and attention regularization. Depthwise operations reduce parameter count and memory usage, while the attention regularization strategy limits redundancy in learned attention maps, improving convergence stability and generalization. The resulting architecture thus achieves the accuracy of more complex Transformer networks with the speed of lightweight CNNs—an essential property for integration into resource-constrained imaging systems or embedded sensor pipelines.

5. Discussion

The proposed CHARMS framework demonstrates that a carefully engineered CNN-Transformer hybrid can achieve an effective balance between reconstruction accuracy, computational efficiency, and architectural interpretability in MRI SR reconstruction. By pairing convolutional local feature extraction with Transformer-based global context modeling, CHARMS consistently outperformed both classical CNN models and pure Transformer architectures across multiple datasets and scaling factors. The combination of RRAF [11] and MDDTA [14] proved particularly powerful in preserving fine-grained textures while maintaining global anatomical coherence in the reconstructed images.

A central contributor to the performance of CHARMS is its attention regularization mechanism [7,23,37,52,70,77,78], which suppresses redundant activation patterns and stabilizes optimization. Existing hybrid SR models such as SwinIR [28] and Restormer [27] often rely on densely parameterized attention blocks that encode overlapping contextual information, resulting in diminishing returns when scaling to high-resolution MRI. In contrast, CHARMS encourages diverse and complementary attention maps, enabling sharper delineation of tissue interfaces, improved structural consistency, and reduced over-smoothing—limitations frequently observed in CNN-based SR frameworks such as EDSR [6] and PAN [12]. By referencing 2023–2025 studies [35,42,43,44,45,46,47,51], CHARMS’ novel attention regularization and MDDTA block fill gaps in stable, low-data adaptation, enabling near-7T quality from 3T inputs with minimal overhead.

The marginal direct fidelity gain from AR in standard training is outweighed by its stabilization effects and negligible computational cost, making it a worthwhile addition for reliable performance in diverse clinical settings, including low-data adaptation regimes.

The hybrid nature of CHARMS provides a principled solution over simpler pure architectures as follows: ablation results (Table 4) show that removing the Transformer component degrades PSNR/SSIM by ~0.02–0.06 dB, underscoring its role in enhancing global coherence without the overhead of full Transformer models. In brain MRI, where scans exhibit sparse, hierarchical patterns, this combination yields sharper, more anatomically consistent reconstructions than optimized pure CNNs (e.g., outperforming EDSR by ~0.3 dB PSNR) or Transformers, prioritizing efficiency for practical deployment.

Although diffusion-based models [41,42,43] and heavily parameterized vision/state-space-model hybrids [40] have shown excellent perceptual quality in recent studies, they typically require iterative sampling (10–100× slower) or >10–50 million parameters. Because the primary goal of CHARMS is lightweight single-pass inference suitable for clinical and resource-constrained environments, direct head-to-head comparison with these fundamentally different paradigms was outside the scope of this study.

Addressing RQ1, CHARMS’ hybrid architecture (Section 3.1, Section 3.2, Section 3.3, Section 3.4 and Section 3.5) outperforms baselines like EDSR and FMEN by 0.1–0.6 dB PSNR with ~40% faster inference (Table 2 and Table 3), validated on IXI and HCP-YA datasets. For RQ2, attention regularization stabilizes optimization, yielding robust gains in ablation studies (Table 4) and cross-contrast performance. RQ3 is demonstrated through cross-field adaptation (Section 4.3), achieving ~6 dB PSNR and 0.12 SSIM improvements over native 3T, with SNR/CNR nearing 7T levels (Table 5, Table 6, Table 7 and Table 8), paving the way for reduced scan times in clinical workflows.

In the cross-field adaptation results (Table 5 and Table 6), we observed greater variability in performance metrics (e.g., larger standard deviations in PSNR and SSIM) for T2w reconstructions compared to T1w results. This increased variability in T2w contrasts is likely attributable to inherent tissue contrast characteristics as follows: T2w imaging is more sensitive to fluid content, edema, and subtle tissue–water variations, which can introduce higher slice-to-slice heterogeneity in anatomical detail, noise patterns, and edge sharpness—particularly in downsampled inputs. In contrast, T1w images typically exhibit more consistent gray-white matter differentiation and lower sensitivity to such fluid-related variations, leading to more stable reconstruction outcomes. Additionally, the paired 3T/7T datasets (PTT1/PTT2) contain fewer T2w volumes relative to T1w in some splits, potentially contributing to slightly reduced robustness during fine-tuning. These observations align with prior studies noting greater challenges in modeling T2w contrasts due to their pronounced sensitivity to relaxation properties and potential artifacts. Future multi-contrast training strategies could further mitigate this by incorporating explicit T2w-specific regularization or additional data augmentation.

Despite its strengths, CHARMS currently operates in a 2D slice-wise manner. Extending it to native 3D processing [70,79] would better exploit through-plane continuity and further improve consistency. While CHARMS achieves state-of-the-art performance with a lightweight 2D slice-wise architecture, this design inherently neglects explicit modeling of through-plane continuity. As a result, minor inconsistencies can arise across slices in the reconstructed 3D volume. Figure 9 illustrates this limitation on a representative ×4 super-resolved IXI T1w scan. Although the axial view (native processing plane) appears smooth and artifact-free, reformatted sagittal and coronal views reveal subtle horizontal banding and slight intensity variations between adjacent slices (visible as faint striped patterns). These discontinuities, while minor and not clinically disruptive in most cases, could be further mitigated by extending the model to full 3D processing or incorporating inter-slice consistency constraints in future work.

Additionally, while supervised training with paired data yielded excellent results, incorporating self-supervised [80,81,82,83,84,85,86,87] or physics-informed strategies [43,88,89,90,91,92,93] could reduce dependency on high-quality ground-truth pairs and enhance robustness across scanners and pathologies.

While VGG-based perceptual losses are common in natural image super-resolution, we deliberately omitted them to prioritize quantitative fidelity and anatomical accuracy in clinical MRI reconstructions, relying instead on L1 and structural similarity terms to achieve high PSNR/SSIM and artifact-free results.

Although primary benchmarks employ isotropic scaling, the cross-field adaptation from clinical (often anisotropic) 3T acquisitions to near-isotropic 7T quality directly demonstrates CHARMS’ applicability to real-world anisotropic MRI, a common scenario in routine scanning protocols.

While direct comparisons to other models fine-tuned on the same limited cross-field data would further isolate transfer efficiency, the substantial gains over native 3T images—consistent with evaluation protocols in prior 3T-to-7T synthesis studies—validate CHARMS’ practical value for enhancing routine clinical acquisitions.

The added SNR and CNR analyses (Table 7 and Table 8) provide rigorous quantitative support for the ‘near-7T quality’ claim, demonstrating that CHARMS not only enhances resolution but also boosts signal characteristics critical for diagnostic confidence in clinical settings.

Clinically, the combination of high fidelity, low latency (~11 ms/slice on RTX 4090), with total inference times of ~1.6–1.9 s per 3D volume on a standard high-end GPU, CHARMS supports integration into online reconstruction workflows, and small memory footprint positions CHARMS as an ideal candidate for integration into online reconstruction pipelines, low-field/portable systems [94,95,96,97,98], or accelerated protocols—ultimately helping reduce scan time, minimize motion artifacts, and improve diagnostic confidence without additional hardware. While CHARMS demonstrates strong performance on healthy and near-healthy brain scans across T1w/T2w contrasts, its generalization to brains with gross pathologies (e.g., large tumors or significant focal lesions causing substantial anatomical distortion) remains to be fully validated, as such cases may introduce domain shifts that challenge learned feature representations; future work will explore fine-tuning or domain-adaptation strategies on pathological datasets to enhance robustness in clinical neuro-oncology applications.

In summary, CHARMS represents a significant step toward practical, clinically viable deep learning SR for MRI by delivering top-tier lightweight performance today while providing a modular foundation for future 3D and unsupervised extensions. Deploying CHARMS in real-world clinical settings is feasible due to its lightweight design (~11 ms/slice, ~1.6 s/volume on RTX 4090), enabling integration into PACS or online reconstruction via PyTorch/ONNX export. Challenges include DICOM compatibility, validation on diverse scanners/pathologies, and regulatory approval (e.g., FDA for AI tools). Required extensions are 3D volumetric processing for continuity, self-supervised fine-tuning for unlabeled data, and edge deployment on low-power GPUs for portable MRI.

While the current evaluation relies on established quantitative metrics (PSNR, SSIM, SNR, and CNR) and side-by-side qualitative visual assessment by the authors (demonstrating sharper cortical boundaries, improved gray-white contrast, and reduced blurring without obvious over-sharpening or hallucination-like artifacts), a formal blinded subjective scoring by expert radiologists (e.g., using a Likert scale for diagnostic confidence, artifact presence, and perceived resolution gain) was not performed in this study. Such an evaluation would be valuable for future clinical validation, particularly when extending CHARMS to pathological cases or real-world heterogeneous clinical scans. In the present work, the absence of gross hallucinatory artifacts in the reconstructions is supported by the high structural fidelity (SSIM gains of ~0.12) and the preservation of anatomically plausible tissue interfaces across all evaluated slices.

Recent studies in brain MRI deep learning, such as Samarasinghe et al. (2025) [99], have advanced self-supervised segmentation and edge detection on multi-modal MRI using architectures like dual-decoder 3D-UNet with SimSiam pretraining. While these approaches effectively reduce annotation dependency for tumor boundary tasks, they differ fundamentally from our focus on super-resolution reconstruction. CHARMS uniquely contributes a lightweight (~1.9 M parameters) CNN-Transformer hybrid tailored for efficient MRI SR, incorporating Reverse Residual Attention Fusion, multi-depthwise dilated Transformer blocks, and novel attention regularization to achieve superior PSNR/SSIM gains and ~40% faster inference compared to prior lightweight SR models, while enabling practical cross-field 3T-to-near-7T quality enhancement—features not addressed in segmentation-oriented works.

6. Conclusions

This study presented CHARMS, a CNN-Transformer hybrid framework with attention regularization for SR reconstruction of MR images. By combining convolutional feature extraction with the global contextual modeling of Transformers, CHARMS achieves superior reconstruction accuracy and visual fidelity while maintaining exceptional computational efficiency. Experimental evaluations across multiple MRI datasets confirmed its consistent performance advantage over existing CNN- and Transformer-based methods. The framework’s lightweight design and modular structure make it suitable for integration into routine MRI reconstruction pipelines, offering a practical solution for accelerating image acquisition and improving diagnostic quality. Future work will extend CHARMS to 3D volumetric SR and self-supervised domain adaptation to enhance generalization across scanners, field strength, and contrasts. Overall, CHARMS represents a promising step toward intelligent, efficient, and clinically adaptable SR reconstruction of MR images.

Author Contributions

Conceptualization: X.L. and T.-Q.L., Methodology: X.L. and H.S., Software: X.L. and H.S., Validation: T.-Q.L. and X.L., Formal Analysis: T.-Q.L. and X.L., Visualization: X.L. and H.S., Writing—Original Draft: T.-Q.L. and X.L., Writing—Review and Editing: T.-Q.L., X.L. and H.S., Supervision: T.-Q.L. and X.L., Project Administration: T.-Q.L., Funding Acquisition: T.-Q.L. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Fujian Medical University Start-up Fund for Scientific Research (Grant No. XRCZX203013), the Joint China–Sweden Mobility program from STINT (Dnr: CH2019-8397), and a grant from the Zhejiang Natural Science Foundation of China (No. LY23F010005).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The study was based on public domain data which are all openly accessible. Code for our preprocessing and decoding the machine learning framework will be available on GitHub. The datasets analyzed in the study were available in the public domain.

Acknowledgments

The authors gratefully acknowledge stimulating discussions with H. Zhang on MRI data curation and his valuable assistance with the software.

Conflicts of Interest

The authors declare no conflicts of interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Glossary of Key Acronyms

To improve readability, this appendix provides a summary table of the main acronyms used for the architectural components in the CHARMS framework.

CHARMS	Full Name	Brief Description
CHARMS	CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution	The proposed lightweight model for MRI super-resolution, combining CNN and Transformer elements with attention regularization.
RRAF	Reverse Residual Attention Fusion	Backbone module for hierarchical local feature extraction, integrating residual learning and attention.
RLFE	Residual Local Feature Extraction	Units within RRAF blocks, consisting of convolutions, ReLU activations, and ESA for feature encoding.
ESA	Enhanced Spatial Attention	Spatial attention operator that highlights high-frequency regions using dilated convolutions.
PCA	Pixel–Channel Attention	Mechanism merging pixel- and channel-level attention for fine-grained feature recalibration.
MDDTA	Multi-Depthwise Dilated Transformer Attention	Transformer block for efficient long-range dependency modeling with linear complexity.
GDDFN	Gated Depthwise Dilated Feed-Forward Network	Feed-forward component in the Transformer module, enhancing nonlinearity via gated convolutions.
THFIR	High-Frequency Information Refinement	Refinement module post-upsampling to restore high-frequency details and suppress artifacts.

References

Li, Y.; Sixou, B.; Peyrin, F. A Review of the Deep Learning Methods for Medical Images Super Resolution Problems. IRBM 2021, 42, 120–133. [Google Scholar] [CrossRef]
Yang, H.; Wang, Z.; Liu, X.; Li, C.; Xin, J.; Wang, Z. Deep learning in medical image super resolution: A review. Appl. Intell. 2023, 53, 20891–20916. [Google Scholar] [CrossRef]
Ji, Z.; Zou, B.; Kui, X.; Liu, J.; Zhao, W.; Zhu, C.; Dai, P.; Dai, Y. Deep learning-based magnetic resonance image super-resolution: A survey. Neural Comput. Appl. 2024, 36, 12725–12752. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a Deep Convolutional Network for Image Super-Resolution. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 184–199. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 294–310. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J.; Recognition, P. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation; Springer International Publishing: Cham, Switzerland, 2018; pp. 561–580. [Google Scholar]
Wang, Z.; Xie, X.; Yang, J.; Song, X. RA-Net: Reverse attention for generalizing residual learning. Sci. Rep. 2024, 14, 12771. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Kong, X.; He, J.; Qiao, Y.; Dong, C. Efficient Image Super-Resolution Using Pixel Attention. In Proceedings of the ECCV Workshops, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Liu, J.; Zhang, W.; Tang, Y.; Tang, J.; Wu, G. Residual Feature Aggregation Network for Image Super-Resolution. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2359–2368. [Google Scholar]
Shen, L.; Qin, F.; Zhang, Y.; Wang, Q.; Hu, B.; Zhao, W. MATNet: MRI Super-Resolution with Multiple Attention Mechanisms. In Proceedings of the 2023 International Conference on New Trends in Computational Intelligence (NTCI), Qingdao, China, 3–5 November 2023; Volume 1, pp. 171–180. [Google Scholar]
Wood, D.A.; Kafiabadi, S.; Busaidi, A.A.; Guilhem, E.; Montvila, A.; Lynch, J.; Townend, M.; Agarwal, S.; Mazumder, A.; Barker, G.J.; et al. Accurate brain-age models for routine clinical MRI examinations. Neuroimage 2022, 249, 118871. [Google Scholar] [CrossRef] [PubMed]
Glasser, M.F.; Smith, S.M.; Marcus, D.S.; Andersson, J.L.; Auerbach, E.J.; Behrens, T.E.; Coalson, T.S.; Harms, M.P.; Jenkinson, M.; Moeller, S.; et al. The Human Connectome Project’s neuroimaging approach. Nat. Neurosci. 2016, 19, 1175–1187. [Google Scholar] [CrossRef]
Chu, L.; Ma, B.; Dong, X.; He, Y.; Che, T.; Zeng, D.; Zhang, Z.; Li, S. A paired dataset of multi-modal MRI at 3 Tesla and 7 Tesla with manual hippocampal subfield segmentations. Sci. Data 2025, 12, 260. [Google Scholar] [CrossRef]
Chen, X.; Qu, L.; Xie, Y.; Ahmad, S.; Yap, P.T. A paired dataset of T1- and T2-weighted MRI at 3 Tesla and 7 Tesla. Sci. Data 2023, 10, 489. [Google Scholar] [CrossRef]
Chen, J.; Sasaki, H.; Lai, H.; Su, Y.; Liu, J.; Wu, Y.; Zhovmer, A.; Combs, C.A.; Rey-Suarez, I.; Chang, H.Y.; et al. Three-dimensional residual channel attention networks denoise and sharpen fluorescence microscopy image volumes. Nat. Methods 2021, 18, 678–687. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.-S.J.A. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Liu, J.; Tang, J.; Wu, G. Residual Feature Distillation Network for Lightweight Image Super-Resolution; Springer International Publishing: Cham, Switzerland, 2020; pp. 41–55. [Google Scholar]
Liu, J.-J.; Hou, Q.; Cheng, M.-M.; Wang, C.; Feng, J. Improving Convolutional Networks with Self-Calibrated Convolutions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10093–10102. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, G.; Lv, J.; Tian, Y.; Dou, Q.; Wang, C.; Xu, C.; Qin, J. Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 20604–20613. [Google Scholar]
Mahapatra, D.; Ge, Z. MR Image Super Resolution by Combining Feature Disentanglement CNNs and Vision Transformers. In Proceedings of the 5th International Conference on Medical Imaging with Deep Learning, Zurich, Switzerland, 6–8 July 2022; Ender, K., Bjoern, M., Archana, V., Christian, B., Qi, D., Shadi, A., Eds.; PMLR: Proceedings of Machine Learning Research: Cambridge, MA, USA, 2022; Volume 172, pp. 858–878. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE Computer Society: Washington, DC, USA, 2022; pp. 5718–5729. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the ICCVW, Montreal, BC, Canada, 11–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 1833–1844. [Google Scholar]
Liu, Y.; Zhang, M.; Zhang, W.; Hou, B.; Liu, D.; Lian, H.; Jiang, B. Flexible Alignment Super-Resolution Network for Multi-Contrast MRI. arXiv 2022, arXiv:2210.03460. [Google Scholar]
Wang, J.; Zou, Y.; Wu, H. Image super-resolution method based on attention aggregation hierarchy feature. Vis. Comput. 2023, 40, 2655–2666. [Google Scholar] [CrossRef]
Lyu, J.; Li, G.; Wang, C.; Cai, Q.; Dou, Q.; Zhang, D.; Qin, J. Multicontrast MRI Super-Resolution via Transformer-Empowered Multiscale Contextual Matching and Aggregation. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 12004–12014. [Google Scholar] [CrossRef]
Huang, S.; Liu, X.; Tan, T.; Hu, M.; Wei, X.; Chen, T.; Sheng, B. TransMRSR: Transformer-based self-distilled generative prior for brain MRI super-resolution. Vis. Comput. 2023, 39, 3647–3659. [Google Scholar] [CrossRef]
He, K.; Gan, C.; Li, Z.; Rekik, I.; Yin, Z.; Ji, W.; Gao, Y.; Wang, Q.; Zhang, J.; Shen, D. Transformers in medical image analysis. Intell. Med. 2023, 3, 59–78. [Google Scholar] [CrossRef]
Meng, F.; Guo, Y.; He, W.; Xu, Z. Ultra-Low Field Magnetic Resonance Image Enhancement based on Deep-Learning Method. In Proceedings of the 2023 2nd International Conference on Health Big Data and Intelligent Healthcare (ICHIH), Zhuhai, China, 27–29 October 2023; pp. 39–42. [Google Scholar]
Bischoff, L.M.; Peeters, J.M.; Weinhold, L.; Krausewitz, P.; Ellinger, J.; Katemann, C.; Isaak, A.; Weber, O.M.; Kuetting, D.; Attenberger, U.; et al. Deep Learning Super-Resolution Reconstruction for Fast and Motion-Robust T2-weighted Prostate MRI. Radiology 2023, 308, e230427. [Google Scholar] [CrossRef]
Yang, G.; Zhang, L.; Zhou, M.; Liu, A.; Chen, X.; Xiong, Z.; Wu, F. Model-Guided Multi-Contrast Deep Unfolding Network for MRI Super-resolution Reconstruction. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 3974–3982. [Google Scholar]
Wang, H.; Hu, X.; Zhao, X.; Zhang, Y. Wide Weighted Attention Multi-Scale Network for Accurate MR Image Super-Resolution. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 962–975. [Google Scholar] [CrossRef]
Zhao, C.; Dewey, B.E.; Pham, D.L.; Calabresi, P.A.; Reich, D.S.; Prince, J.L. SMORE: A Self-Supervised Anti-Aliasing and Super-Resolution Algorithm for MRI Using Deep Learning. IEEE Trans. Med. Imaging 2021, 40, 805–817. [Google Scholar] [CrossRef]
Rakotonirina, N.C.; Rasoanaivo, A. ESRGAN+: Further Improving Enhanced Super-Resolution Generative Adversarial Network. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3637–3641. [Google Scholar]
Li, Y.; Yang, M.; Bian, T.; Wu, H. MRI Super-Resolution Analysis via MRISR: Deep Learning for Low-Field Imaging. Information 2024, 15, 655. [Google Scholar] [CrossRef]
Wang, R.; Cao, Z.; He, Y.; Liu, J.; Shi, F.; Shen, D. Clinical Brain MRI Super-Resolution with 2D Slice-Wise Diffusion Model; Springer Nature: Cham, Switzerland, 2025; pp. 166–176. [Google Scholar]
Zhao, K.; Pang, K.; Hung, A.L.Y.; Zheng, H.; Yan, R.; Sung, K. MRI Super-Resolution With Partial Diffusion Models. IEEE Trans. Med. Imaging 2025, 44, 1194–1205. [Google Scholar] [CrossRef] [PubMed]
Safari, M.; Wang, S.; Eidex, Z.; Li, Q.; Qiu, R.L.J.; Middlebrooks, E.H.; Yu, D.S.; Yang, X. MRI super-resolution reconstruction using efficient diffusion probabilistic model with residual shifting. Phys. Med. Biol. 2025, 70, 125008. [Google Scholar] [CrossRef]
Chen, W.; Wu, S.; Wang, S.; Li, Z.; Yang, J.; Yao, H.; Tian, Q.; Song, X. Multi-contrast image super-resolution with deformable attention and neighborhood-based feature aggregation (DANCE): Applications in anatomic and metabolic MRI. Med. Image Anal. 2025, 99, 103359. [Google Scholar] [CrossRef] [PubMed]
Yoon, D.; Myong, Y.; Kim, Y.G.; Sim, Y.; Cho, M.; Oh, B.M.; Kim, S. Latent diffusion model-based MRI superresolution enhances mild cognitive impairment prognostication and Alzheimer’s disease classification. Neuroimage 2024, 296, 120663. [Google Scholar] [CrossRef]
Georgescu, M.-I.; Ionescu, R.T.; Miron, A.-I.; Savencu, O.; Ristea, N.-C.; Verga, N.; Khan, F.S. Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
Wang, Y.; Hu, H.; Yu, S.; Yang, Y.; Guo, Y.; Song, X.; Chen, F.; Liu, Q. A unified hybrid transformer for joint MRI sequences super-resolution and missing data imputation. Phys. Med. Biol. 2023, 68, 135006. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; Xia, C.; Tang, J.; Yao, L.; Hu, N.; Li, J.; Peng, W.; Hu, S.; Ye, Z.; Zhang, X.; et al. Evaluation of high-resolution pituitary dynamic contrast-enhanced MRI using deep learning-based compressed sensing and super-resolution reconstruction. Eur. Radiol. 2025, 35, 5922–5934. [Google Scholar] [CrossRef]
Xiao, H.; Yang, Z.; Liu, T.; Liu, S.; Huang, X.; Dai, J. Deep learning for medical imaging super-resolution: A comprehensive review. Neurocomputing 2025, 630, 129667. [Google Scholar] [CrossRef]
Kravchenko, D.; Isaak, A.; Mesropyan, N.; Peeters, J.M.; Kuetting, D.; Pieper, C.C.; Katemann, C.; Attenberger, U.; Emrich, T.; Varga-Szemes, A.; et al. Deep learning super-resolution reconstruction for fast and high-quality cine cardiovascular magnetic resonance. Eur. Radiol. 2025, 35, 2877–2887. [Google Scholar] [CrossRef]
Chatterjee, S.; Sciarra, A.; Dünnwald, M.; Ashoka, A.B.T.; Vasudeva, M.G.C.; Saravanan, S.; Sambandham, V.T.; Tummala, P.; Oeltze-Jafra, S.; Speck, O.; et al. Beyond Nyquist: A Comparative Analysis of 3D Deep Learning Models Enhancing MRI Resolution. J. Imaging 2024, 10, 207. [Google Scholar] [CrossRef]
Yang, Q.; Zhang, Y.; Chandler, D.M.; Farias, M.C.Q. SSRT: Intra- and cross-view attention for stereo image super-resolution. Multimed. Tools Appl. 2025, 84, 22917–22945. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Singh, V.; Ramnath, K.; Mittal, A. Refining high-frequencies for sharper super-resolution and deblurring. Comput. Vis. Image Underst. 2020, 199, 103034. [Google Scholar] [CrossRef]
Pham, C.-H.; Tor-Díez, C.; Meunier, H.; Bednarek, N.; Fablet, R.; Passat, N.; Rousseau, F. Multiscale brain MRI super-resolution using deep 3D convolutional networks. Comput. Med. Imaging Graph. 2019, 77, 101647. [Google Scholar] [CrossRef]
Kong, F.; Li, M.; Liu, S.; Liu, D.; He, J.; Bai, Y.; Chen, F.; Fu, L. Residual Local Feature Network for Efficient Super-Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 765–775. [Google Scholar]
Tustison, N.J.; Avants, B.B.; Cook, P.A.; Zheng, Y.; Egan, A.; Yushkevich, P.A.; Gee, J.C. N4ITK: Improved N3 bias correction. IEEE Trans. Med. Imaging 2010, 29, 1310–1320. [Google Scholar] [CrossRef] [PubMed]
Hoopes, A.; Mora, J.S.; Dalca, A.V.; Fischl, B.; Hoffmann, M. SynthStrip: Skull-stripping for any brain image. Neuroimage 2022, 260, 119474. [Google Scholar] [CrossRef] [PubMed]
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Duchi, J.C.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Smith, L.N. Cyclical Learning Rates for Training Neural Networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472. [Google Scholar]
Masters, D.; Luschi, C. Revisiting Small Batch Training for Deep Neural Networks. arXiv 2018, arXiv:1804.07612. [Google Scholar] [CrossRef]
Bengio, Y. Practical Recommendations for Gradient-Based Training of Deep Architectures. In Proceedings of the Neural Networks, Shenyang, China, 11–14 July 2012. [Google Scholar]
Zou, W.; Ye, T.; Zheng, W.; Zhang, Y.; Chen, L.; Wu, Y. Self-Calibrated Efficient Transformer for Lightweight Super-Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 929–938. [Google Scholar]
Du, Z.; Liu, D.; Liu, J.; Tang, J.; Wu, G.; Fu, L. Fast and Memory-Efficient Network Towards Efficient Image Super-Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 852–861. [Google Scholar]
Horé, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
Zhou, W.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How Transferable Are Features in Deep Neural Networks? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Tajbakhsh, N.; Shin, J.Y.; Gurudu, S.R.; Hurst, R.T.; Kendall, C.B.; Gotway, M.B.; Liang, J. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans. Med. Imaging 2016, 35, 1299–1312. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Fu, Y.R. MR Image Super-Resolution with Squeeze and Excitation Reasoning Attention Network. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13420–13429. [Google Scholar]
Feng, C.-M.; Yan, Y.; Fu, H.; Chen, L.; Xu, Y. Task Transformer Network for Joint MRI Reconstruction and Super-Resolution. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2021; de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 307–317. [Google Scholar]
Georgescu, M.-I.; Ionescu, R.T.; Verga, N. Convolutional Neural Networks with Intermediate Loss for 3D Super-Resolution of CT and MRI Scans. IEEE Access 2020, 8, 49112–49124. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, Y.; Zhang, T.; Zou, X. Channel Splitting Network for Single MR Image Super-Resolution. IEEE Trans. Image Process. 2019, 28, 5649–5662. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2472–2481. [Google Scholar]
Hui, Z.; Wang, X.; Gao, X. Fast and Accurate Single Image Super-Resolution via Information Distillation Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 723–731. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Iglesias, J.E.; Billot, B.; Balbastre, Y.; Magdamo, C.; Arnold, S.E.; Das, S.; Edlow, B.L.; Alexander, D.C.; Golland, P.; Fischl, B. SynthSR: A public AI tool to turn heterogeneous clinical brain scans into high-resolution T1-weighted images for 3D morphometry. Sci. Adv. 2023, 9, eadd3607. [Google Scholar] [CrossRef]
Wang, X.; Zhang, S.; Lin, Y.; Lyu, Y.; Zhang, J. Pixel attention convolutional network for image super-resolution. Neural Comput. Appl. 2022, 35, 8589–8599. [Google Scholar] [CrossRef]
Forigua, C.; Escobar, M.; Arbelaez, P. SuperFormer: Volumetric Transformer Architectures for MRI Super-Resolution. In Simulation and Synthesis in Medical Imaging; Springer: Berlin/Heidelberg, Germany, 2022; Volume Lecture Notes in Computer Science. [Google Scholar]
Song, Z. REHRSeg: Unleashing the Power of Self-Supervised Super-Resolution for Resource-Efficient 3D MRI Segmentation. Neurocomputing 2025, 624, 129425. [Google Scholar] [CrossRef]
Schlereth, M.; Schillinger, M.; Breininger, K. Faster, Self-Supervised Super-Resolution for Anisotropic Multi-View MRI Using a Sparse Coordinate Loss. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2025, Daejeon, Republic of Korea, 23–27 September 2025. [Google Scholar]
Zhang, H.; Zhang, Y.; Wu, Q.; Wu, J.; Zhen, Z.; Shi, F.; Yuan, J.; Wei, H.; Liu, C.; Zhang, Y. Self-Supervised Arbitrary Scale Super-Resolution Framework for Anisotropic MRI. In Proceedings of the 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), Cartagena, Colombia, 18–21 April 2023; pp. 1–5. [Google Scholar]
Wang, X. Inter-Slice Super-Resolution of Magnetic Resonance Images by Pre-Training and Self-Supervised Fine-Tuning. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024. [Google Scholar]
Gundogdu, B. Self-Supervised Multi-Contrast Super-Resolution for Diffusion-Weighted Prostate MRI. Magn. Reson. Med. 2024, 92, 319–331. [Google Scholar] [CrossRef] [PubMed]
Chen, H. Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 25857–25867. [Google Scholar]
Benisty, R. SIMPLE: Simultaneous Multi-Plane Self-Supervised Learning for Isotropic MRI Restoration from Anisotropic Data. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024. [Google Scholar]
Remedios, S.W. Self-Supervised Super-Resolution for Anisotropic MR Images with and Without Slice Gap; Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; pp. 118–128. [Google Scholar]
Mayo, P. Physics informed guided diffusion for accelerated multi-parametric MRI reconstruction. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Daejeon, Republic of Korea, 23–27 September 2025. [Google Scholar]
Zhang, L.; Nie, J.; Wei, W.; Zhang, Y. Unsupervised Test-Time Adaptation Learning for Effective Hyperspectral Image Super-Resolution With Unknown Degeneration. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5008–5025. [Google Scholar] [CrossRef]
Wang, Z. One for Multiple: Physics-informed Synthetic Data Boosts Generalizable Deep Learning for Fast MRI Reconstruction. Med. Image Anal. 2025, 103, 103616. [Google Scholar] [CrossRef]
Ren, Y.; Liu, W.; Zhou, Z.; Hu, P. PI-GNN: Physics-Informed Graph Neural Network for Super-Resolution of 4D Flow MRI. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024. [Google Scholar]
Jacobs, L. Generalizable synthetic MRI with physics-informed convolutional networks. Med. Phys. 2023, 51, 3348–3359. [Google Scholar] [CrossRef]
Ferdian, E.; Marlevi, D.; Schollenberger, J.; Aristova, M.; Edelman, E.R. Cerebrovascular super-resolution 4D Flow MRI: Sequential combination of resolution enhancement by deep learning and physics-informed image processing. Med. Image Anal. 2023, 88, 102831. [Google Scholar] [CrossRef]
Donnay, C.; Okar, S.V.; Tsagkas, C.; Gaitán, M.I.; Poorman, M.; Reich, D.S.; Nair, G. Super resolution using sparse sampling at portable ultra-low field MR. Front. Neurol. 2024, 15, 1330203. [Google Scholar] [CrossRef] [PubMed]
Man, C.; Lau, V.; Su, S.; Zhao, Y.; Xiao, L.; Ding, Y.; Leung, G.K.K.; Leong, A.T.L.; Wu, E.X. Deep learning enabled fast 3D brain MRI at 0.055 tesla. Sci. Adv. 2023, 9, eadi9327. [Google Scholar] [CrossRef]
Lau, V.; Xiao, L.; Zhao, Y.; Su, S.; Ding, Y.; Man, C.; Wang, X.; Tsang, A.; Cao, P.; Lau, G.K.K.; et al. Pushing the limits of low-cost ultra-low-field MRI by dual-acquisition deep learning 3D superresolution. Magn. Reson. Med. 2023, 90, 400–416. [Google Scholar] [CrossRef] [PubMed]
Islam, K.T.; Zhong, S.; Zakavi, P.; Rawluk, N.; Hill, M.; Spreeuwers, L.; Mateen, F.; Sher, I.; McCreary, C.; Frayne, R. Improving portable low-field MRI image quality through image-to-image translation using paired low- and high-field images. Sci. Rep. 2023, 13, 20848. [Google Scholar] [CrossRef] [PubMed]
Iglesias, J.E.; Schleicher, R.; Laguna, S.; Billot, B.; Schaefer, P.; McKaig, B.; Goldstein, J.N.; Sheth, K.N.; Rosen, M.S.; Kimberly, W.T. Quantitative brain morphometry of portable low-field-strength MRI using super-resolution machine learning. Radiology 2023, 306, e220522. [Google Scholar] [CrossRef]
Samarasinghe, D.; Wickramasinghe, D.; Wijerathne, T.; Meedeniya, D.; Yogarajah, P. Brain Tumour Segmentation and Edge Detection Using Self-Supervised Learning. Int. J. Online Biomed. Eng. (IJOE) 2025, 21, 127–141. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed CHARMS framework, consisting of the following three key components: deep feature extraction, upsampling reconstruction, and high-frequency refinement (a). Deep feature extraction involves a stack of Reverse Residual Attention Fusion (RRAF) blocks (b), each incorporating four improved residual local feature extraction (RLFE) modules (c) with Enhanced Spatial Attention (ESA). A Multi-Depthwise Dilated Transformer Attention (MDDTA) transformer with supporting GDDFN further enhanced features. Various attention mechanisms, including ESA, pixel attention (PA), and shallow channel attention (SCA), are integrated into architecture. Solid arrows indicate data/model flow.

Figure 2. Schematic of the CHARMS cross-field adaptation workflow. Solid arrows indicate data/model flow; The curved blue arrow represents the fine-tuning process using paired 3T/7T data; the final inference path (3T input → 7T-like output) is shown with a thick arrow.

Figure 3. Qualitative ×4 super-resolution results on a representative sagittal IXI T1w slice. Top row: full-slice reconstructions showing overall image quality and contrast, with red rectangles indicating zoomed regions. Middle row: zoomed regions highlighting fine anatomical details. Bottom row: MSE maps with coluber visualizing residual errors and potential artifacts.

Figure 4. Evolution of super-resolution performance on the IXI T1w dataset from 2014 to 2025 [35,43,44,45,46,47,51,70,71,72,73,74,75,76]. (a) Peak Signal-to-Noise Ratio (PSNR, dB) and (b) Structural Similarity Index (SSIM) for ×2 (black) and ×4 (red) upscaling factors. Solid lines represent linear regression fits to published state-of-the-art results compiled from the literature over this period, illustrating the steady progress in reconstruction fidelity. Individual data points (filled squares for ×2, filled circles for ×4) correspond to the validated performance of models evaluated in the current study, including CHARMS. CHARMS achieves state-of-the-art fidelity for both scaling factors while maintaining substantially lower computational complexity.

Figure 5. Visualization of attention maps from the Multi-Depthwise Dilated Transformer Attention (MDDTA) block on representative axial slices from the IXI T1w dataset. (a) Attention maps before attention regularization, showing relatively diffuse and less structured activation patterns. (b) Attention maps after regularization, demonstrating sharper and more focused attention concentrated along anatomically salient regions, such as cortical boundaries, sulci, and white-gray matter interfaces. Crossing black lines indicate the positions of the orthogonal cross-sections (axial, sagittal, and coronal views) displayed in each column.

Figure 6. Performance on the IXI dataset (T1w) across different model configurations.

Figure 7. A representative T1w axial slice of the brain as well as a 4-fold zoomed-in display of the selected region marked with a red box showcasing the application of the proposed CHARMS network for SR reconstruction on the 3T MRI data using the paired PTT dataset (PTT2).

Figure 8. Swarm boxplot of PSNR (a) and SSIM (b) metrics for the T1w, T2w, and their corresponding SR based on the fine-tuned SR framework using the PTT datasets. Colors were used to distinguish the different datasets.

Figure 9. Qualitative illustration of the limitation of the 2D slice-wise processing in CHARMS. The (left) panel shows an axial view of a representative super-resolved T1w volume from the IXI dataset. The (middle and right) panels display reconstructed sagittal and coronal views, respectively, obtained by resampling the 3D volume along the orthogonal planes indicated by the green crosshairs. Horizontal banding artifacts and subtle slice-to-slice intensity inconsistencies are visible in the reformatted views (most evident in the sagittal and coronal planes), highlighting the lack of explicit through-plane continuity modeling in the current 2D approach.

Table 1. Summary of the MRI datasets used.

Dataset	Source	Subjects	Demographics	Contrasts	Resolution (mm³)
HCP-YA	[16]	1206 HC	22–36 years m/f = 507/699	T1w	0.7 × 0.7 × 0.7
HCP-YA	[16]	1206 HC	22–36 years m/f = 507/699	T2w	0.7 × 0.7 × 0.7
IXI	[15]	563 HC	20–86 years m/f = 250/313	T1w	0.94 × 0.94 × 1.2
IXI	[15]	563 HC	20–86 years m/f = 250/313	T2w	0.94 × 0.94 × 1.2
PTT1	[17]	20 HC	18–25 years m/f = 10/10	T1w	1 × 1×1, 0.7 × 0.7 × 0.7
PTT1	[17]	20 HC	18–25 years m/f = 10/10	T2w	0.9 × 0.9 × 1.9, 0.4 × 0.4 × 1
PTT2	[18]	10 HC	25–41 years m/f = 7/3	T1w	0.8 × 0.8 × 0.8, 0.7 × 0.7 × 0.7
PTT2	[18]	10 HC	25–41 years m/f = 7/3	T2w	0.8 × 0.8 × 0.8, 0.7 × 0.7 × 0.7

Table 2. Summary of the model characteristics and performances (PSNR and SSIM) after 2× downsampling and SRR enhancement. All CHARMS gains over FMEN are statistically significant (p < 0.05).

Model	Parameters (Million)	Train (h)	PSNR (IXIT1w)	SSIM (IXIT1w)	PSNR (IXI-T2w)	SSIM (IXI-T2w)	PSNR (HCP-T1w)	SSIM (HCP-T1w)
Bicubic	—		31.68 ± 0.55	0.935 ± 0.028	32.51 ± 0.58	0.946 ± 0.025	40.19 ± 0.48	0.9874 ± 0.020
SRCNN	0.82	0.7	33.86 ± 0.42	0.961 ± 0.018	35.33 ± 0.45	0.969 ± 0.016	43.80 ± 0.38	0.9935 ± 0.014
VDSR	0.37	3.0	35.43 ± 0.35	0.968 ± 0.015	36.74 ± 0.38	0.963 ± 0.013	44.77 ± 0.32	0.9936 ± 0.011
EDSR	1.36	4.7	36.13 ± 0.28	0.971 ± 0.012	38.07 ± 0.30	0.979 ± 0.010	46.11 ± 0.25	0.9957 ± 0.010
PAN	0.78	5.0	36.38 ± 0.27	0.970 ± 0.013	37.79 ± 0.27	0.978 ± 0.011	45.71 ± 0.21	0.9954 ± 0.008
W²AMSN-S	11.37	14	37.12 ± 0.22	0.972 ± 0.009	38.30 ± 0.26	0.980 ± 0.008	46.43 ± 0.20	0.9961 ± 0.007
FMEN	3.80	11	37.22 ± 0.16	0.973 ± 0.008	38.40 ± 0.17	0.981 ± 0.007	46.58 ± 0.14	0.9963 ± 0.007
CHARMS	1.74	10	37.79 ± 0.11	0.973 ± 0.008	38.56 ± 0.11	0.981 ± 0.006	46.58 ± 0.10	0.9963 ± 0.006

Table 3. Summary of the model characteristics and performances (PSNR and SSIM) after 4× downsampling and SRR enhancement. CHARMS shows statistically significant improvements over FMEN (p < 0.05) and much larger gains over heavier models like W2AMSN-S (p < 0.01), despite ~6× fewer parameters.

Model	Parameters (Million)	Train (h)	PSNR (IXI-T1w)	SSIM (IXI-T1w)	PSNR (IXI-T2W)	SSIM (IXI-T2W)	PSNR (HCP-T1w)	SSIM (HCP-T1w)
Bicubic	—		26.17 ± 0.58	0.786 ± 0.030	26.92 ± 0.58	0.826 ± 0.025	31.57 ± 0.48	0.924 ± 0.024
SRCNN	0.82	0.5	29.18 ± 0.50	0.888 ± 0.028	30.63 ± 0.50	0.873 ± 0.022	33.96 ± 0.39	0.948 ± 0.021
VDSR	0.37	1.5	30.49 ± 0.45	0.909 ± 0.020	32.44 ± 0.46	0.890 ± 0.020	34.60 ± 0.36	0.954 ± 0.018
EDSR	1.51	2.3	31.48 ± 0.36	0.931 ± 0.018	32.53 ± 0.38	0.948 ± 0.012	36.10 ± 0.33	0.965 ± 0.011
PAN	0.92	2.7	31.76 ± 0.28	0.926 ± 0.016	32.27 ± 0.28	0.944 ± 0.011	35.90 ± 0.28	0.964 ± 0.009
W²AMSN-S	11.41	9.0	32.61 ± 0.22	0.939 ± 0.016	32.76 ± 0.11	0.953 ± 0.009	36.56 ± 0.20	0.968 ± 0.008
FMEN	3.95	7.0	32.72 ± 0.20	0.944 ± 0.010	32.92 ± 0.20	0.956 ± 0.007	36.81 ± 0.18	0.969 ± 0.007
CHARMS	1.89	7.0	33.27 ± 0.14	0.945 ± 0.010	32.97 ± 0.15	0.956 ± 0.007	36.65 ± 0.11	0.969 ± 0.007

Table 4. Summary of ablation studies, detailing the effects of each module on the PSNR (dB) and SSIM in SRR-enhanced image quality.

Model		Baseline	+CS	+CS +PCA	+CS +PCA +Transformer	Full Model
2×	Parameters (MB)	1.46	1.49	1.69	1.72	1.74
	PSNR/SSIM (IXI-T1w)	35.12/0.967	36.19/0.973	36.27/0.973	36.28/0.973	36.29/0.973
	PSNR/SSIM (IXI-T2w)	37.38/0.973	38.43/0.979	38.50/0.981	38.51/0.981	38.56/0.981
	PSNR/SSIM (HCP-T1w)	45.63/0.989	46.44/0.995	46.56/0.996	46.56/0.996	46.58/0.996
4×	Size of Parameters (MB)	1.61	1.64	1.84	1.87	1.89
	PSNR/SSIM (IXI-T1w)	28.66/0.869	29.37/0.889	29.42/0.890	29.44/0.890	29.47/0.891
	PSNR/SSIM (IXI-T2w)	29.56/0.896	30.39/0.909	30.41/0.916	30.44/0.916	30.47/0.916
	PSNR/SSIM (HCP-T1w)	35.81/0.946	36.53/0.967	36.60/0.968	36.63/0.968	36.65/0.969

Table 5. PSNR for the T1w, T2w, and T1/T2 image volumes acquired at 3T and the reconstructed SRR images based on the corresponding high-resolution 7T MRI data.

Subject	3T (T1w)	SR (T1w)	3T (T2w)	SR (T2w)
1	27.310	33.330	27.359	28.475
2	28.639	34.660	32.265	38.286
3	27.927	33.947	31.938	37.958
4	26.724	32.744	30.251	36.271
5	27.174	33.195	30.546	36.566
6	27.752	33.773	31.982	38.003
7	27.717	33.737	31.956	37.976
8	27.622	33.643	30.472	36.493
9	28.394	34.415	31.208	37.229
10	27.213	33.234	31.952	37.973
Mean	27.647	33.668	30.993	36.523
Std	0.579	0.579	1.476	2.923

Table 6. The SSI for the T1w, T2w, and T1/T2 image volumes acquired at 3T and the generated SR images are based on the corresponding high-resolution 7T MRI data.

Subject	3T (T1w)	SR (T1w)	3T (T2w)	SR (T2w)
1	0.821	0.938	0.726	0.727
2	0.870	0.954	0.872	0.948
3	0.830	0.945	0.851	0.933
4	0.807	0.933	0.783	0.890
5	0.832	0.947	0.822	0.907
6	0.844	0.949	0.829	0.920
7	0.816	0.944	0.816	0.914
8	0.836	0.949	0.817	0.911
9	0.854	0.951	0.840	0.925
10	0.818	0.939	0.825	0.914
Mean	0.833	0.945	0.818	0.899
Std	0.019	0.007	0.040	0.062

Table 7. The SNR for the T1w and T2w image volumes acquired at 3T, 7T, and the reconstructed SR images based on the fine-tuned model using the paired PTT1 dataset.

Subject	3T (T1w)	7T (T1w)	SR (T1w)	3T (T2w)	3T (T2w)	SR (T2w)
1	8.007	9.013	8.557	4.301	5.983	5.481
2	8.887	10.792	10.7	5.695	7.331	6.689
3	8.478	10.368	9.628	5.353	6.501	5.956
4	7.451	7.458	7.588	4.119	5.071	4.542
5	7.995	7.576	7.963	4.133	5.497	5.105
6	8.311	10.153	9.375	5.147	6.368	5.791
7	8.159	9.771	8.879	4.961	6.285	5.779
8	8.055	9.277	8.642	4.652	6.174	5.725
9	8.616	10.729	10.301	5.466	6.888	6.522
10	8.001	8.791	8.305	4.192	5.878	5.389
Mean	8.196	9.393	8.994	4.802	6.198	5.697
Std	0.401	1.201	1.003	0.601	0.652	0.631

Table 8. The CNR between gray and white matter brain tissues for the T1w and T2w image volumes acquired at 3T, 7T, and the reconstructed SR images based on the fine-tuned model using the paired PTT1 dataset.

Subject	3T (T1w)	7T (T1w)	SR (T1w)	3T (T2w)	3T (T2w)	SR (T2w)
1	1.916	1.915	2.081	1.551	2.823	2.357
2	3.193	3.537	3.46	2.755	4.676	3.646
3	2.559	2.557	2.492	2.431	3.321	2.520
4	1.234	1.451	1.205	1.042	1.623	1.578
5	1.393	1.679	1.307	1.298	1.722	1.742
6	2.476	2.305	2.42	2.015	2.185	2.603
7	1.953	2.285	2.21	1.986	3.082	3.398
8	1.923	2.011	2.173	1.842	3.001	2.715
9	2.607	3.421	2.718	2.549	4.174	3.416
10	1.775	1.81	2.025	1.539	1.642	2.332
Mean	2.103	2.297	2.209	1.901	2.825	2.631
Std	0.601	0.702	0.652	0.559	1.201	1.066

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Sun, H.; Li, T.-Q. CHARMS: A CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution. Sensors 2026, 26, 738. https://doi.org/10.3390/s26020738

AMA Style

Li X, Sun H, Li T-Q. CHARMS: A CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution. Sensors. 2026; 26(2):738. https://doi.org/10.3390/s26020738

Chicago/Turabian Style

Li, Xia, Haicheng Sun, and Tie-Qiang Li. 2026. "CHARMS: A CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution" Sensors 26, no. 2: 738. https://doi.org/10.3390/s26020738

APA Style

Li, X., Sun, H., & Li, T.-Q. (2026). CHARMS: A CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution. Sensors, 26(2), 738. https://doi.org/10.3390/s26020738

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CHARMS: A CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution

Highlights

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based MRI Super-Resolution

2.2. Attention Mechanisms in SR

2.3. Transformer and Hybrid Architectures

2.4. Diffusion Models and Emerging Trends

3. Materials and Methods

3.1. CHARMS Framework

3.2. Reverse Residual Attention Fusion (RRAF) Block

3.3. Pixel–Channel Attention (PCA) Module

3.4. Transformer Module with MDDTA and GDDFN

3.5. High-Frequency Information Refinement (HFIR)

3.6. Datasets and Preprocessing

3.7. Training Protocol and Comparative Models

3.8. Cross-Field Adaptation and Evaluation Procedure

4. Results

4.1. Benchmark Performance

4.2. Ablation Study

4.3. Cross-Field Validation Using Paired 3T/7T Datasets

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Glossary of Key Acronyms

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI