Next Article in Journal
Exploring AI in Healthcare Systems: A Study of Medical Applications and a Proposal for a Smart Clinical Assistant
Previous Article in Journal
SPADR: A Context-Aware Pipeline for Privacy Risk Detection in Text Data
Previous Article in Special Issue
All’s Well That FID’s Well? Result Quality and Metric Scores in GAN Models for Lip-Synchronization Tasks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Transform Domain Based GAN with Deep Multi-Scale Features Fusion for Medical Image Super-Resolution

1
Department of Information Engineering, City University of Wuhan, Wuhan 430083, China
2
School of Electronics and Information Engineering, Liaoning Technical University, Huludao 125105, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(18), 3726; https://doi.org/10.3390/electronics14183726
Submission received: 17 July 2025 / Revised: 21 August 2025 / Accepted: 3 September 2025 / Published: 20 September 2025
(This article belongs to the Special Issue New Trends in AI-Assisted Computer Vision)

Abstract

High-resolution (HR) medical images provide clearer anatomical details and facilitate early disease diagnosis, yet acquiring HR scans is often limited by imaging conditions, device capabilities, and patient factors. We propose a transform domain deep multiscale feature fusion generative adversarial network (MSFF-GAN) for medical image super-resolution (SR). Considering the advantages of generative adversarial networks (GANs) and convolutional neural networks (CNNs), MSFF-GAN integrates a deep multi-scale convolution network into the GAN generator, which is composed primarily of a series of cascaded multi-scale feature extraction blocks in a coarse-to-fine manner to restore the medical images. Two tailored blocks are designed: a multiscale information distillation (MSID) block that adaptively captures long- and short-path features across scales, and a granular multiscale (GMS) block that expands receptive fields at fine granularity to strengthen multiscale feature extraction with reduced computational cost. Unlike conventional methods that predict HR images directly in the spatial domain, which often yield excessively smoothed outputs with missing textures, we formulate SR as the prediction of coefficients in the non-subsampled shearlet transform (NSST) domain. This transform domain modeling enables better preservation of global anatomical structure and local texture details. The predicted coefficients are inverted to reconstruct HR images, and the transform domain subbands are also fed to the discriminator to enhance its discrimination ability and improve perceptual fidelity. Extensive experiments on medical image datasets demonstrate that MSFF-GAN outperforms state-of-the-art approaches in structural similarity index (SSIM) and peak signal-to-noise ratio (PSNR), while more effectively preserving global anatomy and fine textures. These results validate the effectiveness of combining multiscale feature fusion with transform domain prediction for high-quality medical image super-resolution.

1. Introduction

CT and magnetic resonance imaging (MRI) technologies make non-invasive medical treatment convenient and help doctors make diagnoses. High-resolution (HR) medical imaging reveals finer lesion structures and therefore substantially improves diagnostic accuracy and prognostic assessment. Acquiring HR medical images is challenging because of various factors. In addition to potential technical limitations, clinical constraints related to patient health and limited scan time often impede image acquisition. For instance, patient movement from fatigue and physiological motions like cardiac and respiratory cycles further degrade image quality and lower the signal-to-noise ratio (SNR). Low-resolution images with constrained fields of view and elevated noise can obscure critical pathological features and undermine diagnostic reliability. Consequently, super-resolution (SR) reconstruction methods, which recover fine structural details and enhance clinical utility, have become a major focus of recent research [1,2,3,4].
Super-resolution (SR) is an effective approach for improving image quality and has attracted substantial interest in computer vision. Learning-based SR methods typically perform better when trained with external example pairs. Traditional SR approaches are commonly divided into reconstruction-based methods [5,6] and shallow learning techniques [7,8,9,10,11]. Reconstruction-based SR methods usually require explicit prior information to constrain the reconstruction results [12], leading to a substantial amount of calculation, difficulty in solving, and significant time consumption. Shallow learning SR techniques capture representations that relate Low-Resolution (LR) images to their HR counterparts to guide reconstruction, yet the feature extraction and representational power of these learned models remain constrained. Numerous works report that deep learning (DL) achieves superior results compared with conventional machine learning on problems like image classification [13] and object detection [14]. With advances in computing resources and the availability of massive training data, DL-based SR methods [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34] have markedly enhanced the fidelity of reconstructed high-resolution images. These methods develop a series of network models with excellent reconstruction effects according to the specific SR needs based on CNNs, ResNet [35], GANs [36], or other network structures. A fresh concept for image generation is offered by the GAN model [36], which also offers a model foundation for high-resolution image generation. The first study to use the GAN model for SR reconstruction is SRGAN [30], which has produced images with improved visual quality and more accurate high-frequency (HF) features. However, the relatively straightforward design of the SRGAN generation network results in the extracted features being frequently insufficient, which has an impact on the reconstruction quality. To boost super-resolution performance across scales, researchers have proposed various GAN-based and deep convolutional network techniques [31,32,33,34,35,36,37,38].
Multi-scale features have been widely used in deep learning [24,39,40,41,42], yielding excellent performance. Their inherent ability to extract information at multiple scales provides effective representations for a variety of visual tasks. All of the aforementioned methods also accomplish the image SR task in the spatial domain, but the output is typically too smooth and lacks texture detail. Images in the transform domain can retain the context and texture information of the image at different levels, by contrast, resulting in better SR effects. For medical images, they have more texture and greater correlation than natural images, so higher resolution is required to achieve more accurate matching, detection, and segmentation tasks. For medical image SR, we present a deep multi-scale generative adversarial network (MSFF-GAN) in the transform domain in this paper. The following are the main contributions:
  • We designed two types of multi-scale feature extraction blocks (the MSID block and the GMS block). The MSID block can exploit the potential features from medical images by adaptively detecting the long- and short-path features at different scales, and the GMS block can achieve greater multi-scale feature extraction capabilities with less computing load at the granular level by expanding the range of receptive fields for each network layer.
  • We conducted experiments on our medical image dataset, which demonstrates that our method can achieve higher PSNR/SSIM values and preserve global topological structure and local texture detail more effectively than existing state-of-the-art methods.
The rest of this paper is structured as follows. In Section 2, we introduce related work, Section 3 details our method, Section 4 presents the experiments, and conclusions are provided in Section 5.

2. Related Work

In this section, we will state the research developments of CNN-based SR, GAN-based SR, and medical image SR methods, respectively.

2.1. CNN-Based SR

Since the initial SRCNN [17] through the most recent feature extraction module-based SR networks, CNN-based models have greatly improved the SR quality [20,21,22,23], and then to the latest attention-based SR learning [29], which is noticeably superior to traditional SR approaches. The performance of SRCNN was hindered by its shallow structure. Kim et al. proposed the deeper VDSR model [19], arguing that increasing network depth leads to improved performance. Later, some extremely complex models were put forth, such as RCAN [21], which performs admirably on the SR problem. Models that incorporate dense connections, such as MemNet [20], have been introduced to further boost SR performance. More efficient CNN-based SR methods, including IDN [23] and MSRN [24], build the network by chaining similar feature-extraction modules, which highlights the capability of each block. Marco et al. [29] propose an AMRSR network, which improves performance by using multiple reference images and attention-based sampling. In order to efficiently finish the image SR, attention-based SR network models have recently been proposed. One example is Du et al. [27], who present a MARP network for image SR, which better bridges the resolution gap between LR and reference images and enhances texture details by using attention and residual pooling. Xie et al. [28] introduce a multi-range attention transformer architecture for image SR that adaptively varies attention ranges to model local details and sparse global context, enriching the diversity and representational power of extracted features.
Multi-scale architectures [24,41,42] consistently deliver strong performance on tasks including semantic segmentation, object detection, and image classification. Li et al. [24] propose a multi-scale residual network for image super-resolution that adaptively captures image features across scales. Wang et al. [41] develop a deep multi-scale network for medical image SR that improves representation of overall topology and local texture in HR medical scans. In order to successfully recover the low-resolution image, Liu et al. [42] develop an integrated position coding to improve multi-scale implicit learning in image SR. In order to obtain abundant attention maps at different granularity levels, Wang et al. [43] propose a multi-scale attention network that corrects large kernel attention with multi-scale and gate schemes. By jointly aggregating global and local information, this network enhances the performance of convolutional SR networks. However, the tremendous computational load brought on by the numerous parameters is a weakness shared by these multi-scale networks. Gao et al. [44] proposed Res2Net to address this issue, which outperforms baseline approaches that are currently on the cutting edge by creating hierarchical residual-like connections within individual residual blocks. In this study, the better Res2Net block will be used, and it will be plugged into our network for SR medical images.
The techniques described above carry out image SR in the spatial domain, but they frequently produce extremely smooth output that omits textural features. On the other hand, transform-domain SR is able to preserve contextual and textural information at multiple levels, leading to improved reconstruction. In light of this, Guo et al. [45] create a deep wavelet super-resolution (DWSR) network that reconstructs HR details by predicting missing wavelet coefficients from LR inputs. Li et al. [46] further introduce a wavelet-based feature enhancement network to ensure accurate capture of both local and global textures and to mitigate distortion of high-frequency components.

2.2. GAN-Based SR

In addition to models built on CNNs, GAN-based models exhibit exceptional performance over a wide range of vision fields. They significantly enhance image visual quality for super-resolution tasks; the HF features of the images are more realistic than those built on CNNs. The first study to apply the GAN model to SR reconstruction is SRGAN [30], which has produced images with improved visual quality and more accurate HF features. Wang et al. [31] propose an improved SR generative adversarial network (ESRGAN). Dou et al. [32] propose PCA-SRGAN, which introduces incremental orthogonal projection discrimination in a PCA subspace for face super-resolution and improves perceptual fidelity. Zhang et al. [33] design RankSRGAN, which uses non-differentiable perception indicators to optimize the generator in terms of performance. Shunta et al. [34] describe a method for training an unpaired SR network with produced pseudo-data pairs using a GAN, which would increase the generalizability of the model. By creating a more intricate yet useful degradation model, Zhang et al. [36] are able to successfully super-resolve both synthetic and real images with a variety of degradations. Angarano et al. [47] present an efficient GAN model for real-time SR that uses a tailored SRGAN architecture and model quantization that is fast and lightweight while maintaining a fairly satisfactory image quality. Xiao et al. [48] address the slow sampling speed and small step size denoising issues in diffusion model-based SR by integrating denoising diffusion models with GANs. This method models each denoising step with a GAN to achieve large-step denoising, enhancing sampling efficiency while maintaining image quality. Aloisi et al. [49] further introduce a Wavelet Diffusion-GAN for image super-resolution, performing denoising in the wavelet domain and coupling conditional diffusion with adversarial training to better preserve high-frequency details while remaining efficient. Park et al. [50] introduce a single-image SR framework aimed at perceptually faithful HR reconstruction, in which an objective estimator determines the best mix of objectives for each image patch and a generator synthesizes SR results that correspond to those targets. Duan et al. [51] propose a local implicit wavelet transformer (LIWT) to improve the reconstruction of high-frequency textures. Shrish et al. [52] present NVS-GAN, which adopts a lightweight GAN design for novel view synthesis to reduce trainable parameters while maintaining synthesis quality.

2.3. Medical Image SR

In the medical field, substantial medical image SR reconstruction techniques and their applications are developed, especially the DL-based ones proposed in recent years. To solve the issue of cross-modal medical image synthesis, Huang et al. [53] suggest an innovative dual convolutional filter learning approach. It builds a closed-loop joint filter learning technique to make better use of the data while using less training data. Chaudhari et al. [1] propose a CNN-based SR method to produce thin-slice knee MR images from thicker input slices. Umehara et al. [2] propose to apply SRCNN to enhance the resolution of chest CT images and achieve higher image restoration quality. To produce a more stable and effective training and enhance the perceived quality of the super-resolved findings, Zhu et al. [3] present a multi-scale GAN taking into account several magnification parameters. Huang et al. [54] address SR and cross-modality medical image synthesis jointly by introducing a weakly supervised joint convolutional sparse coding method. The method achieves significant improvement in semantic interpretability scores. Iwamoto et al. [55] propose a novel unsupervised multimodal priors guided SR for the resolution augmentation of MRI images on the basis of external learning. Pan et al. [56] design a Light-ESRGAN, which integrates a stochastic degradation modeling process and a GAN architecture to significantly enhance cell SR reconstruction. Nimitha et al. [57] propose a multi-slice MRI SR that combines GAN with a pre-trained slice interpolation network to achieve higher quality MRI slices. Liu et al. [58] propose a multi-scale attention-guided progressive aggregation network (MAPANet), aiming to utilize multi-scale and appropriate non-local information to promote SR reconstruction. For multimodal medical image SR, Dharejo et al. [4] suggest a GAN with deep multi-attention modules and learned HF information via WT to produce promising results. Wei et al. [59] introduce a misalignment-robust deep unfolding network (MAR-DUN) to enhance multi-modal MRI SR reconstruction. Ji et al. [60] propose Deform-Mamba, an architecture that integrates local and global features across multiple scales to better capture image information and enhance MRI resolution.
Actually, medical images have more texture and greater correlation than natural images, and thus more accurate feature extraction is required. We will use a new GAN-based network with multi-scale feature extraction in the transform domain for medical image SR in this research.

3. Proposed Method

In this section, Section 3.1 overviews the network architecture, Section 3.2 details the MSFF-GAN structure, Section 3.3 describes the loss functions, and Section 3.4 introduces non-subsampled shearlet transform (NSST) prediction in the generator. Unlike conventional multi-scale or multi-modal SR frameworks that simply integrate existing feature-extraction modules, MSFF-GAN adopts a targeted design that combines two complementary multi-scale feature-extraction blocks, MSID and GMS, with transform-domain reconstruction via NSST. MSID captures both long- and short-path dependencies across scales, while GMS efficiently enlarges the receptive field. By operating in the transform domain, the proposed method mitigates over-smoothing and texture loss, a benefit supported by multi-scale signal representation theory and frequency-domain analysis, and leads to improved preservation of fine structural details.

3.1. Overview

We propose a new GAN-based medical image SR approach with transform prediction, which fundamentally differs from native HR imaging by reconstructing enhanced details from LR inputs rather than capturing them directly from imaging devices. Our method designs a series of cascaded deep multi-scale feature extraction (MSFE) blocks in the GAN generator, which is a mechanism for extracting features from images at multiple scales to capture both global and local information, forming MSFF-GAN, to exploit abundant and potential features from medical images. Figure 1 depicts the pipeline of our network architecture. We first apply the NSST, which is a multiscale and multidirectional image decomposition method combining the advantages of non-subsampled pyramid decomposition for multiscale analysis and shearlet transform for multidirectional representation, on the LR image, obtaining one LF sub-band and a series of HF sub-bands. The LF sub-band preserves the global topology information, while the HF sub-bands collect the structure and texture information. These sub-bands are sent together to the MSFF-GAN generator to predict the transform coefficients of the generated HR image that can be obtained via an inverse transform. After that, the sub-bands of the HR image and the generated sub-bands are together sent to the MSFF-GAN discriminator to distinguish between true and false. The sub-bands as the input of the discriminator can better promote the discrimination ability than the spatial domain image. Our model’s total loss function  L T o t a l S R is defined as the weighted sum of all the individual loss functions:
L T o t a l S R = α 1 L C o n S R + α 2 L G e n S R + α 3 L T V S R ,
where  α i , i = 1,2,3, is the weighting parameter.  L C o n S R stands for content loss, which is the most popular optimization target for image SR and the foundation for many cutting-edge methods [18,20,24].  L G e n S R stands for the adversarial loss [30,36] of the generative network, which aims to trick the discriminator network.  L T V S R denotes the total variation (TV) loss to restrain the noise and encourage spatially coherent solutions [6,30,61].

3.2. Multi-Scale Features Fusion Generative Adversarial Network (MSFF-GAN) Structure

For the medical image SR task, our aim is to reconstruct the SR image  I S R from the input LR image  I L R . The  I L R is the LR version of its HR counterpart  I H R . Only while training are the HR images accessible. A bicubic operation is used to obtain  I L R . We describe the LR medical image  I L R with a tensor of size W × H and denote the  I H R , I S R with  r W × r H , where  r represents the upscaling factor.
Our final objective is to train a generating function G that predicts its corresponding HR counterpart for a given input LR image. In order to do this, we train a generator network as the feed-forward CNN  G θ G parameterized by  θ G . Here  θ G = w 1 , w 2 , , w p , b 1 , b 2 , , b p stands for the weights and bias of the p convolutional layers and is obtained by optimizing the loss function  L T o t a l S R of our model. For the given N training images  I i H R i = 1 , , N with corresponding  I i L R , we solve:
θ ^ G = arg min θ G 1 N i = 1 N L T o t a l S R G θ G I i L R , I i H R .
which utilizes the ADAM optimizer with an initial learning rate of 0.0001 (halved every 50 epochs) until convergence, determined when the validation set PSNR plateaus with less than 0.1 dB improvement over 20 epochs.
In our study, we explicitly design a perceptual loss  L T o t a l S R as a weighted combination of several loss components that model various desirable characteristics of the recovered SR image. The loss function  L T o t a l S R is used to reduce the difference between the  I L R and  I H R . More details on each loss function are described in Section 3.3.
In accordance with SRGAN [30], we define the discriminator network  D θ D , in which we alternately optimize with  G θ G in order to tackle the adversarial min-max problem:
L 1 = E I H R ~ P d a t a I H R log D θ D I H R ,
L 2 = E I L R ~ P G I L R log 1 D θ D G θ G I L R ,
L = min   θ G max θ D ( L 1 + L 2 ) ,
where  P data ( I H R ) denotes the real HR sample distribution, and  P G ( I L R ) denotes the generator distribution [62]. The core idea is to train a generative model G to deceive a differentiable discriminator D, which is optimized to distinguish real images from super-resolved ones. This adversarial setup encourages G to generate outputs that closely resemble real images and are difficult for D to classify.
The core of our generator network G consists of two parts, as illustrated in Figure 1: the shallow feature extraction (SFE) module and the deep feature extraction (DFE) module. Specifically, we use two convolution layers to extract the shallow feature  M 0 in a coarse manner from the NSST sub-bands  S L R of the input medical image  I L R . Therefore, we can have:
M 0 = H S F E 1 H S F E 2 S L R ,
where M is the number of HF directional sub-bands from NSST (e.g., M = 4 for scale 1 and M = 8 for scale 2). The shallow feature  S L R R W × H × 64 is extracted via two 3 × 3 convolutions with 64 filters each.  H S F E 1 and  H S F E 2 stand for the two layers’ respective convolution operations in the SFE module. After the SFE module, the shallow feature  M 0 is used for the DFE module that contains a set of cascaded MSFE blocks, i.e., MSID blocks or GMS blocks that will be elaborated in Section 3.2.1 and Section 3.2.2, respectively. Each MSFE block can extract multi-scale features in a fine manner. The output information is then adaptively controlled using a 1 × 1 convolutional layer. We refer to this process as feature fusion, expressed as:
M G F = H G F F M 1 , M 2 , , M C ,
where  M 1 , M 2 , , M C is the result of concatenating the feature maps created by MSFE blocks 1, 2, …, D H G F F is the composite function of 1 × 1 convolutional layer. After feature fusion, global residual learning is used to obtain the feature maps  S S R , which can be expressed as:
S S R = M G F + M 0 .
To produce the reconstructed medical image  I S R through the super-resolution reconstruction workflow, we use inverse NSST on the sub-bands  S S R . In the proposed MSFF-GAN generator network, all convolutional layers have 64 filters and are followed by ReLU activation [63].
We train a discriminator network to distinguish real HR images from generated SR samples. Notice that we concatenate the HR sub-bands and the distribution of generated data (i.e., generated sub-bands) and input them to the MSFF-GAN discriminator to distinguish between true and false, which can better promote the discrimination ability than the spatial domain image. Here we use ReLU activation and avoid max-pooling throughout the network. As in the VGG network [63], it has eight convolutional layers and a rising number of filter kernels that increases by a factor of 2 from 64 to 512 kernels. The 512 feature maps, resulting from two fully connected layers and a ReLU activation function, are used to calculate the probability for sample classification.

3.2.1. MSID Block for MSFE

Unlike conventional multi-scale convolutional modules with fixed receptive fields, the proposed MSID block employs an adaptive long short-path fusion to capture dependencies across scales, selectively retaining informative features while suppressing noise. This approach aligns with multi-scale signal decomposition theory, balancing contextual aggregation and texture preservation. As shown in Figure 2, each MSID block comprises two parts exploiting short- and long-path properties. Unlike the IDN model [23], each part contains three bypasses with distinct convolutional kernels, enabling multi-scale detection of both short- and long-path features.
Assuming  M d 1 and  O p 1 are the first part’s input and output, we have:
k 1 = σ Y 3 × 3 2 M d 1 ,
k 2 = σ Y 5 × 5 3 M d 1 ,
k 3 = σ Y 7 × 7 4 M d 1 ,
O P 1 = σ Y 1 × 1 1 [ k 1 + k 2 + k 3 ] ,
where  Y 1 × 1 1 Y 3 × 3 2 Y 5 × 5 3 and  Y 7 × 7 4 , respectively, refer to the 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolutional layers’ functions in the first part.  denotes the concatenation of different convolutional kernels’ feature maps, and the ReLU function [64] is denoted by  σ . The 64-dimensional feature maps of  O P 1 and the input  M d 1 are then concatenated in the channel dimension,
R = C S O P 1 , 64 , M d 1 ,
where  C and  S stand for the operations of concatenation and slicing, respectively. For example, for a 4 × 4 × 64 input feature  M d 1 and 4 × 4 × 64 multi-scale features  O P 1 , concatenation yields a 4 × 4 × 128 feature map, from which the slicing operation  S extracts the last 64 channels (4 × 4 × 64) and  S is used to acquire the 64-dimensional features. The goal is to merge the previous information with the current multi-scale information. It can be thought of as stored short-path knowledge. We then use the remaining 64-dimensional feature maps as the input for the second part, which mostly further extracts long-path information. Let  q i be the intermediate feature maps from Equation (14), which are processed by the second part’s convolutional layers of sizes 1 × 1, 3 × 3, 5 × 5, and 7 × 7,
q 1 = σ Y 3 × 3 6 O P 1 , 64 ,
q 2 = σ Y 5 × 5 7 O P 1 , 64 ,
q 3 = σ Y 7 × 7 8 O P 1 , 64 ,
O P 2 = σ Y 1 × 1 5 q 1 + q 2 + q 3 ,
where  Y 1 × 1 5 Y 3 × 3 6 Y 5 × 5 7 Y 7 × 7 8 , respectively, refer to the 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolutional layers’ functions in the second part. The final step is to aggregate the input information, short-path information, and long-path information, which can be expressed as follows:
M d = R + O P 2 ,
where  M d indicates the output of the MSID block.

3.2.2. GMS Block for MSFE

In this subsection, we present the GMS block [44], a variant of the MSFE block. To capture finer-scale features and enlarge the receptive field of each network layer, it is integrated into the generator as a sequence of cascaded MSFE blocks. The block has improved MSFE performance while retaining a light computational load. In particular, it connects various filter groups in a hierarchical residual-like manner using a group of 3  × 3 filters with smaller groups of filters, which can be collectively referred to as a Res2Net block.
The GMS block is depicted in Figure 3. The feature maps are uniformly separated into s feature map subsets, indicated by  X i following the 11 convolutions, where  i 1 , 2 , , s . Compared to the input feature map, each feature subset  X i has the same spatial extent but 1/s number of channels. Except for  X 1 each  X i has a matching 3 × 3 convolution indicated by  K i ( ) ,and the output of  K i ( ) is denoted by  F i . The feature subset  X i is added with the output of  K i 1 ( ) and then fed into  K i ( ) . The 3 × 3 convolution for  X 1 might be skipped to decrease parameters while increasing  s . So  F i can be expressed as:
F i = X i i = 1 ; K i X i i = 2 ; K i X i + Y i 1 2 < i s .
Note that each 3  × 3 convolutional operator  K i ( ) might be able to obtain feature information from every feature split  x j , j i . Therefore, the output result of a 3 × 3 convolutional operator may have a bigger receptive field than the input feature split  X j each time. The output of the Res2Net block has a varying number and different combinations of receptive field sizes as a result of the combinatorial explosion effect.
Splits are handled on multiple scales in the GMS module, making it possible to extract both global and local information. All splits are concatenated and then passed through a 1 × 1 convolution to better combine information at various scales. Convolutions can be forced to process features more successfully using the split and concatenation approach, and s may also be utilized as a scale dimension control parameter. A larger s may make it possible to learn features with wider receptive field sizes while incurring minimal computational overhead from concatenation.

3.3. Loss Function

Our loss function  L T o t a l S R , as shown in Formula (1), is essential for the generator network’s effectiveness and, thus, the SR algorithm. We define  L T o t a l S R = α 1 L C o n S R + α 2 L G e n S R + α 3 L T V S R , as the weighted sum of the individual loss functions given the weighting factors  α i , i = 1,2,3, and we specifically identify the following loss components: content loss, adversarial loss, and regularization loss.
The mean square error (MSE) function is the most widely used objective optimization function in image SR [20,22,65]. But Lim et al. [66] have empirically shown that using MSE loss in training is not a wise decision. We employ the mean absolute error (MAE) function loss as a superior substitute for the content loss  L C o n S R in order to prevent the introduction of pointless training tactics and minimize computations, which can be expressed below:
L C o n S R = 1 N i = 1 N G θ G I i L R I i H R 1 ,
We incorporate the generative component into our loss in addition to the content loss. By attempting to trick the discriminator network, this encourages our network to favor solutions that are found on the variety of medical images with relevance. Based on the probabilities of the discriminator  D θ D G θ G I i L R across all training samples, the generative loss  L G e n S R is defined as follows:
L G e n S R = i = 1 N log D θ D G θ G I i L R ,
where  D θ D G θ G I i L R is the estimated probability that the reconstructed image  G θ G I i L R is a medical HR image. Note that we minimize  log D θ D G θ G I i L R instead of  log 1 D θ D G θ G I i L R for better gradient behavior [62].
We further employ a regularizer based on the TV to encourage spatially coherent solutions [6,67]. The regularization loss  L T V S R , is calculated as:
L T V S R = 1 r 2 W H m = 1 r W n = 1 r H G θ G I L R m , n ,
where  denotes the gradient operator, a vector operator that describes the rate of change and direction of a function at a given point.

3.4. Non-Subsampled Shearlet Transform (NSST) Prediction

Rather than performing SR reconstruction directly in the spatial domain, where GAN-based methods often lead to over-smoothing and texture loss, we formulate the task in the transform domain using NSST. NSST provides a mathematically grounded framework to decompose images into multi-scale, multi-directional components, effectively separating high-frequency directional details from low-frequency global structures. As a result, the generator can better preserve textures and edges. By combining NSST with adversarial learning, our approach explicitly targets both global topology and fine-grained texture, achieving a balance that is difficult to obtain with purely spatial-domain methods.
In the transform domain, images can be well perceived with richer global topological structure and local texture details than in the spatial domain. The most common transform domain methods include WT, curvelet transform, contourlet transform, shearlet transform, and NSST with superior performance. Wavelets can sparsely represent one-dimensional signals by smoothing out point discontinuities in digital signals. However, for two-dimensional image signals, the commonly used two-dimensional wavelets have only limited directions and cannot make full use of the geometric regularity of the image itself. As a result, with straight lines and curves, wavelet analysis [67] cannot “optimally” represent visual functions.
In contrast, wavelets and curvelets [68] capture the anisotropic regularity of surfaces along edges but lack a true multi-resolution geometric representation. Contourlets, composed of a Laplacian pyramid (LP) and a directional filter bank (DFB) in two filter-bank stages, offer limited directional diversity. The contourlet transform (CT) [69] samples at both LP and DFB stages, making it shift-variant—an undesirable property for many multimedia applications. To address this, Cunha et al. [70] proposed the non-subsampled contourlet transform (NSCT), a translation-invariant form of CT that removes all sub-sampling at the cost of increased redundancy. Labate et al. [71] later introduced the shearlet transform (ST), providing multi-scale, multi-resolution geometric representation via general multi-resolution analysis. Building on ST, NSST [72] was developed to overcome its limitations while preserving its advantages.
In this paper, NSST decomposition consists of two parts: multi-scale decomposition via a non-subsampled Laplacian pyramid filter (NSLPF) and multi-directional decomposition via modified shearing filters. The LF sub-band is recursively decomposed by NSLPF to capture singularities, producing k + 1 HF sub-bands and one LF sub-band. Shearing filters are mapped from the pseudo-polar grid to the Cartesian coordinate system, with all operations performed directly in the transform domain. This design avoids the negative effects of sampling operations and provides the desirable properties of multi-scale, multi-directional, and translation-invariant representation, making it particularly suited for medical image SR. As noted in the introduction, spatial-domain HR image prediction often leads to over-smoothing and texture loss. Transform-domain SR can better preserve details. In this work, we integrate NSST with MSFF-GAN to construct a medical image SR network, formulating SR as the prediction of NSST coefficients (Figure 1), enabling richer structural detail preservation compared to spatial-domain methods. In Figure 4 and Figure 5, the high-frequency coefficients of NSST and WT are compared on the lung image and the head image, respectively. Figure 4a and Figure 5a indicate the original images, Figure 4b and Figure 5b indicate the low-frequency image of NSST, and Figure 4c,d and Figure 5c,d indicate the high-frequency image of NSST at different scales and orientations.
Specifically, in our implementation, the four directions at scale 1 correspond to orientation channels centered approximately at 0°, 45°, 90°, and 135°, while the eight directions at scale 2 correspond to orientation channels centered approximately at 0°, 22.5°, 45°, 67.5°, 90°, 112.5°, 135°, and 157.5°, as determined by the directional filter banks in the NSST. Figure 4e and Figure 5e indicate the fused image of the high-frequency coefficients of NSST, and Figure 4f and Figure 5f indicate the fused image of the high-frequency coefficients of WT. By comparing Figure 4a–e and Figure 5a–f, it can be clearly observed that NSST can represent texture curvature and details more accurately.
Specifically, in order to fully mine LF structure and HF detail information of CT images from LR images, we obtain an LF sub-band and HF sub-bands by NSST transformation of LR images, which are, respectively, input to the generation network, in which the LF generation network is used to learn the global structure information of LR to HR images. The HF generation network is used to extract HF details such as edges and textures through several cascade multi-scale feature extractions from different levels of features, the production of LF and HF networks before output using a matching layer gathered in front of all the feature extractions, and then using the pixels on the convolution layer for sampling, and obtaining the corresponding HR images of LF and HF sub-bands through NSST inverse transformation to obtain the reconstruction image. Similarly, the discriminant network input is transformed from the traditional spatial synthesis data to the frequency domain sub-band, and the discriminant network input is the NSST sub-band generated by the network reconstruction and the NSST sub-band of the original HR image. NSST can be implemented in various SR networks, and it is a straightforward and effective method for improving the performance. Further experiments will be conducted in Section 4.5.3 to reveal the role of NSST. See Ref. [73] for more implementation details of NSST.

4. Experimental Results and Analysis

In the experiments, the effectiveness of the suggested method is assessed in terms of both qualitative and quantitative factors. We begin by outlining the training and testing datasets. The implementation process and evaluation protocol are then covered. Finally, we perform a number of ablation investigations and compare our approach to a number of cutting-edge SR approaches.

4.1. Medical Image Datasets

To build a medical image SR dataset, we collected images from four anatomical regions: brain, lung, bone, and abdomen, totaling 8000 samples (2000 per region). Brain and lung images were obtained from the Cancer Imaging Archive (TCIA) [29], while bone and abdominal images were sourced from the radiology department of a people’s hospital in Liaoning Province, China. The dataset was split into 7000 training images (1750 per region) and 1000 test images.

4.2. Implementation Details

For the 700 training images described in Section 4.1, data augmentation is performed. We evaluate training images that have been rotated by 90, 180, and 270 degrees and then have been flipped horizontally, as inspired by [7,8]. This process generated seven augmented variants in addition to each original image. Our network contains either 10 MSID blocks or 10 GMS blocks. Training used the ADAM optimizer with an initial learning rate of 0.0001, reduced by half every 50 epochs, and converged after 200 epochs. The entire training process took about 9.5 h on a single Tesla P40 GPU.
In the proposed MSFF-GAN, the selection of hyper-parameters such as convolution kernel size, number of feature channels, receptive field scale parameters (s and D), learning rate, and the number of multi-scale feature-extraction blocks is guided by both prior literature and empirical tuning. Initial values are set according to commonly adopted configurations in GAN-based super-resolution networks, while adjustments are made based on validation performance and computational efficiency. For example, a kernel size of 3 × 3 is used as a baseline because it balances receptive-field coverage and computational cost; the number of feature channels is fixed at 64 to provide sufficient representation capacity without excessive memory use; and the receptive-field scale parameters are set to s = 4 and D = 10 after comparative trials show that this combination yields optimal PSNR/SSIM gains within acceptable training time. During training, we observe a gradual decrease in content loss and adversarial loss, indicating stable convergence. The total loss typically reaches a plateau at approximately 150 epochs, and validation metrics (PSNR and SSIM) consistently improve until convergence, without obvious overfitting. These observations confirm the robustness and stability of the proposed training strategy.

4.3. Evaluation Protocols

We first employ two common evaluation measures, PSNR and SSIM, for the evaluation process [74]. These two metrics are important criteria currently used to evaluate advanced methods for SR. Better performance is indicated by higher values for both criteria.
The ratio of the maximum signal to background noise, or PSNR, is an objective measurement. The PSNR between the predicted image of size M × N and the true value is calculated as:
P S N R , y ^ , y , = 10 log 10 i = 1 M j = 1 N M A X R 2 i = 1 M j = 1 N ( y ^ ( i , j ) y ( i , j ) ) 2 ,
where  M A X R is the maximum signal value.
The similarity between two images is gauged by the SSIM. Compared with PSNR, SSIM can better match human perception. The SSIM of the two images can be found as follows:
S S I M ( X , Y ) = ( 2 μ X μ Y + c 1 ) ( 2 σ X Y + c 2 ) ( μ X 2 + μ y 2 + c 1 ) ( σ X 2 + σ Y 2 + c 2 ) ,
where  μ X and  μ Y are averages,  σ X and  σ Y are variances, and  σ X Y is the covariance of X and Y, and  c 1 and  c 2 are constants that are used to maintain stability. The structural similarity is between −1 and 1. The result of SSIM equals 1 when there are no differences between two images.
In addition, the test results of this article are evaluated by a senior radiologist with over 10 years of imaging diagnosis experience, using the mean opinion score (MOS) [3] as the subjective evaluation criterion.

4.4. Comparison with State-of-the-Art Methods

The training process of MSFF-GAN exhibits stable convergence: loss curves decline smoothly, and evaluation metrics improve steadily until they reach a plateau. Such training dynamics indicate that the proposed architecture reliably captures multi-scale and transform-domain features, without exhibiting instability or mode collapse commonly seen in adversarial training. We evaluate the method on four medical datasets (brain, lung, bone, and abdomen) using PSNR, SSIM, and MOS as evaluation metrics. To guarantee a fair comparison, all models are trained following the same protocol on identical training data, and we adopt publicly available implementations of baseline methods. Quantitative results presented in Table 1 and Table 2 demonstrate that our method achieves superior performance across metrics and datasets, and qualitative comparisons confirm improved preservation of fine textures and anatomical boundaries in the reconstructed images.
For the MOS evaluation, 120 images were randomly sampled from the 400-image test set, each paired with its HR ground truth. A senior radiologist from a tertiary hospital scored the SR images based on image quality factors, including excessive smoothing, artifacts, texture fidelity, and signal-to-noise ratio. A four-point scale was used, where 1 indicates poor, 2 fair, 3 good, and 4 very good. The MOS was computed as the mean and standard deviation of all scores. As reported in Table 3, our method achieved the highest MOS at the ×8 scale. Visual comparisons in Figure 6 further confirm that the proposed method preserves abundant structural and textural details, producing reconstructions that are visually close to the ground truth, thereby demonstrating its strong capability for medical image SR.

4.5. Ablation Study

4.5.1. Effectiveness of MSID Block

In this subsection, we investigate the impact of MSID block and single-scale feature extraction. As shown in Table 4, we found consistent performance improvements using the MSID block, suggesting that MSID works better than single scaling. To achieve an optimal trade-off between computational complexity and reconstruction accuracy, we choose 10 MSID blocks as the standard configuration of the MSID network.

4.5.2. Effectiveness of GMS Block

In this subsection, we examine the effects of the quantity of feature map subsets (designated as s) and the quantity of Res2Net blocks (designated as D) in our GMS. As seen in Figure 7, we found that increasing either s or D consistently results in performance gains, demonstrating that deeper is better. Similarly, we chose the s = 4 and D = 10 combination, taking the trade-off between accuracy and speed into consideration.

4.5.3. Effectiveness of NSST

To evaluate the effectiveness of NSST, we incorporate MSFF-GAN in the spatial domain and combine it with wavelet transform [51], curvelet transform [70], and the proposed NSST. The experimental results across multiple datasets are shown in Figure 8. It can be observed that integrating MSFF-GAN with NSST significantly outperforms both the spatial-domain approach and the other two transform-domain methods, with this advantage remaining consistent across different network configurations and benchmark datasets. Furthermore, as shown in Table 5, when combined with the MSID block, it achieves the highest scores across all decomposition levels, and the performance improves consistently as the number of high-frequency decomposition levels increases, further validating the effectiveness of NSST in multi-scale high-frequency information modeling.

5. Conclusions

In this work, we presented MSFF-GAN, a novel medical image SR framework that embeds a deep multi-scale convolutional architecture within a GAN generator to reconstruct images through cascaded multi-scale feature extraction blocks in a coarse-to-fine manner. The SR task is reformulated as the prediction of transform-domain coefficients, where incorporating NSST into the network enables more accurate retention of structural information from LR medical images than spatial-domain approaches, thereby further boosting reconstruction quality. Experimental results, both qualitative and quantitative, confirm that the proposed approach outperforms existing state-of-the-art methods and significantly improves the fidelity of restored medical images.

Author Contributions

H.Y., Q.W. and Y.S.: conceptualization, analysis, methodology, and framework construction. H.Y. and Y.S.: methodology, writing, review and editing, and investigation. H.Y. and Q.W.: validation and visualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by National College Students’ Innovation and Entrepreneurship Training Program under Grant [No. 202413235002], Key Teaching Research Project of City University of Wuhan under Grant [No. 2024CYZDJY008], Program for Excellent Young and Middle-Aged Science and Technology Innovation Teams in Colleges and Universities of Hubei Province under Grant [No. T2022060].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Acknowledgments

We acknowledge all supporting organizations.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Abbreviations

AbbreviationFull FormFirst Appearance
HRHigh-ResolutionAbstract
CTComputed TomographyAbstract
MSFF-GANMulti-Scale Features Fusion Generative Adversarial NetworkAbstract
SRSuper-ResolutionAbstract
GANsGenerative Adversarial NetworksAbstract
CNNsConvolutional Neural NetworksAbstract
MSIDMulti-Scale Information DistillationAbstract
GMSGranular Multi-ScaleAbstract
NSSTNon-Subsampled Shearlet TransformAbstract
SSIMStructural Similarity IndexAbstract
PSNRPeak Signal-to-Noise RatioAbstract
MRIMagnetic Resonance ImagingIntroduction
LRLow-ResolutionIntroduction
SNRSignal-to-Noise RatioIntroduction
DLDeep LearningIntroduction
SRCNNSuper-Resolution Convolutional Neural NetworkSection 2.1
VDSRVery Deep Super-ResolutionSection 2.1
RCANResidual Channel Attention NetworkSection 2.1
SRDenseNetSuper-Resolution Dense NetworkSection 2.1
MemNetMemory NetworkSection 2.1
RDNResidual Dense NetworkSection 2.1
IDNInformation Distillation NetworkSection 2.1
MSRNMulti-Scale Residual NetworkSection 2.1
AMRSRAttention-based Multi-Reference Super-ResolutionSection 2.1
AdderSRAdder-based Super-ResolutionSection 2.1
MARPMulti-scale Attention and Residual PoolingSection 2.1
SRGANSuper-Resolution Generative Adversarial NetworkSection 2.2
ESRGANEnhanced Super-Resolution GANSection 2.2
RankSRGANRanker-guided SRGANSection 2.2
CGANConditional Generative Adversarial NetworkSection 2.2
EventSREvent-based Super-ResolutionSection 2.2
LIWTLocal Implicit Wavelet TransformerSection 2.2
DMSNDeep Multi-Scale NetworkSection 2.3
MAPANetMulti-scale Attention-guided Progressive Aggregation NetworkSection 2.3
SFEShallow Feature ExtractionSection 3.2
DFEDeep Feature ExtractionSection 3.2
MSFEMulti-Scale Feature ExtractionSection 3.2
NSLPFNon-Subsampled Laplacian Pyramid FiltersSection 3.4
TCIAThe Cancer Imaging ArchiveSection 4.1
MOSMean Opinion ScoreSection 4.3
WTWavelet TransformSection 3.4
DCTDiscrete Cosine TransformSection 2.1
LPLaplacian PyramidSection 3.4
DFBDirectional Filter BankSection 3.4
NSCTNon-Subsampled Contourlet TransformSection 3.4
STShearlet TransformSection 3.4
TVTotal VariationSection 3.3

Notation and Symbols

SymbolDescriptionShape and Notes
x Low-resolution (LR) input image H × W × C
y High-resolution (HR) ground-truth image ( r H ) × ( r W ) × C
y ^ = G x ; θ G Generator output (predicted HR)Same shape as y
G ; θ G Generator network with parameters  θ G /
D ; θ D Discriminator network with parameters  θ D Outputs real/fake score
θ G , θ D Trainable parameters of G and DVectors/tensors
r Upscaling factor e . g . , $ r { 2 , 4 , 8 }
r ( ) Ideal/implicit upsampling operator by factor rConceptual; not necessarily implemented as naive interpolation
r ( ) Downsampling operator by
factor r
/
D = x i , y i i = 1 N Training set of LR–HR pairsN samples
N Number of training samples/
B Mini-batch size/
p data   ( y ) True data distribution of HR imagesUsed in adversarial objective
p G ( y ) Model distribution induced by G/

References

  1. Chaudhari, S.; Fang, Z.; Kogan, F.; Wood, J.; Stevens, K.J.; Gibbons, E.K. Super-resolution musculoskeletal MRI using deep learning. Magn. Reson. Med. 2018, 80, 2139–2154. [Google Scholar] [CrossRef]
  2. Umehara, K.; Ota, J.; Ishida, T. Application of super-resolution convolutional neural network for enhancing image resolution in chest CT. J. Digit. Imaging 2018, 31, 441–450. [Google Scholar] [CrossRef]
  3. Zhu, J.; Yang, G.; Lio, P. How Can We Make Gan Perform Better in Single Medical Image Super-Resolution? A Lesion Focused Multi-Scale Approach. In Proceedings of the 16th IEEE International Symposium on Biomedical Imaging, Venice, Italy, 8–11 April 2019; pp. 1669–1673. [Google Scholar]
  4. Dharejo, F.A.; Zawish, M.; Zhou, Y.C.; Dev, K.; Khowaja, S.A.; Qureshi, N.M.F. Multimodal-boost: Multimodal medical image super-resolution using multi-attention network with wavelet transform. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 2420–2433. [Google Scholar] [CrossRef] [PubMed]
  5. Aly, H.; Dubois, E. Regularized image up-sampling using a new observation model and the level set method. In Proceedings of the International Conference on Image Processing, Barcelona, Spain, 14–17 September 2003; pp. III–665. [Google Scholar]
  6. Aly, H.A.; Dubois, E. Image up-sampling using total-variation regularization with a new observation model. IEEE Trans. Image Process. 2005, 14, 1647–1659. [Google Scholar] [CrossRef] [PubMed]
  7. Su, C.Y.; Zhuang, Y.T.; Li, H.; Wu, F. Steerable pyramid-based face hallucination. Pattern Recognit. 2025, 38, 813–824. [Google Scholar] [CrossRef]
  8. Yang, J.C.; Wright, J.; Huang, T.; Ma, Y. Image super-resolution as sparse representation of raw image patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  9. Chan, T.M.; Zhang, J.; Pu, J. Neighbor embedding based super-resolution algorithm through edge detection and feature selection. Pattern Recognit. Lett. 2009, 30, 494–502. [Google Scholar] [CrossRef]
  10. Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image Super-Resolution Via Sparse Representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef]
  11. Zhang, K.; Gao, X.; Tao, D.; Li, X. Single image super-resolution with non-local means and steering kernel regression. IEEE Trans. Image Process 2012, 21, 4544–4556. [Google Scholar] [CrossRef]
  12. Khizar, H. Multimedia super-resolution via deep learning: A survey. Digit. Signal Process. 2018, 81, 198–217. [Google Scholar] [CrossRef]
  13. Cheng, B.; Xiao, R.; Wang, J.; Huang, T.; Zhang, L. High frequency residual learning for multi-scale image classification. In Proceedings of the British Machine Vision Conference (BMVC), Cardiff, UK, 9–12 September 2019. [Google Scholar]
  14. Borji, A.; Cheng, M.-M.; Hou, Q.; Jiang, H.; Li, J. Salient object detection: A survey. Comput. Vis. Media 2019, 5, 117–150. [Google Scholar] [CrossRef]
  15. Yang, W.; Zhang, X.; Tian, Y.; Wang, W.; Xue, J.-H.; Liao, Q. Deep Learning for Single Image Super-Resolution: A Brief Review. IEEE Trans. Multimed. 2019, 21, 3106–3121. [Google Scholar] [CrossRef]
  16. Anwar, S.; Khan, S.; Barnes, N. A deep journey into super-resolution: A survey. ACM Comput. Surv. 2021, 53, 1–34. [Google Scholar] [CrossRef]
  17. Wang, Z.; Chen, J.; Hoi, S.C.H. Deep Learning for Image Super-Resolution: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
  18. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a Deep Convolutional Network for Image Super-Resolution. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 6–12 September 2014; Volume 8692. [Google Scholar]
  19. Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  20. Tai, Y.; Yang, J.; Liu, X.; Xu, C. MemNet: A Persistent Memory Network for Image Restoration. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4549–4557. [Google Scholar]
  21. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Volume 11211. [Google Scholar]
  22. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2472–2481. [Google Scholar]
  23. Hui, Z.; Wang, X.; Gao, X. Fast and Accurate Single Image Super-Resolution via Information Distillation Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 723–731. [Google Scholar]
  24. Li, J.C.; Fang, F.M.; Mei, K.F.; Zhang, G.X. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 527–542. [Google Scholar]
  25. Shu, Z.; Cheng, M.; Yang, B.; Su, Z.; He, X. Residual Magnifier: A Dense Information Flow Network for Super Resolution. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 646–651. [Google Scholar]
  26. Li, Z.; Yang, J.; Liu, Z.; Yang, X.; Jeon, G.; Wu, W. Feedback Network for Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–21 June 2019; pp. 3862–3871. [Google Scholar]
  27. Du, J.; Wang, M.; Wang, X.; Yang, Z.; Li, X.; Wu, X. Reference-based image super-resolution with attention extraction and pooling of residuals. J. Supercomput. 2024, 81, 240. [Google Scholar] [CrossRef]
  28. Xie, C.; Zhang, X.; Li, L.; Fu, Y.; Gong, B.; Li, T.; Zhang, K. MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
  29. Pesavento, M.; Volino, M.; Hilton, A. Attention-based Multi-Reference Learning for Image Super-Resolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 14677–14686. [Google Scholar]
  30. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar]
  31. Wang, X.T.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. ESRGAN: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 63–79. [Google Scholar]
  32. Dou, H.; Chen, C.; Hu, X.; Xuan, Z.; Hu, Z.; Peng, S. PCA-SRGAN: Incremental Orthogonal Projection Discrimination for Face Super-Resolution. In Proceedings of the 28th ACM International Conference on Multimedia (MM), Seattle, WA, USA, 12–16 October 2020; pp. 1891–1899. [Google Scholar]
  33. Zhang, W.; Liu, Y.; Dong, C.; Qiao, Y. RankSRGAN: Generative Adversarial Networks with Ranker for Image Super-Resolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–3 November 2019; pp. 3096–3105. [Google Scholar]
  34. Maeda, S. Unpaired Image Super-Resolution Using Pseudo-Supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 288–297. [Google Scholar]
  35. Wang, L.; Kim, T.-K.; Yoon, K.-J. EventSR: From Asynchronous Events to Image Reconstruction, Restoration, and Super-Resolution via End-to-End Adversarial Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 8312–8322. [Google Scholar]
  36. Zhang, K.; Liang, J.; Van Gool, L.; Timofte, R. Designing a Practical Degradation Model for Deep Blind Image Super-Resolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 4771–4780. [Google Scholar]
  37. Gao, S.; Zhuang, X. Bayesian Image Super-Resolution with Deep Modeling of Image Statistics. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1405–1423. [Google Scholar] [CrossRef]
  38. Liu, H.; Li, Z.; Shang, F.; Liu, Y.; Wan, L.; Feng, W.; Timofte, R. Arbitrary-scale Super-resolution via Deep Learning: A Comprehensive Survey. Inf. Fusion 2024, 102, 102015. [Google Scholar] [CrossRef]
  39. Chen, Y.; Li, J.; Xiao, H.; Jin, X.; Yan, S.; Feng, J. Dual path networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Red Hook, NY, USA, 4–9 December 2017; pp. 4470–4478. [Google Scholar]
  40. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 2261–2269. [Google Scholar]
  41. Wang, C.; Wang, S.; Ma, B.; Li, J.; Dong, X.; Xia, Z. Transform Domain Based Medical Image Super-resolution via Deep Multi-scale Network. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2387–2391. [Google Scholar]
  42. Liu, Y.T.; Guo, Y.C.; Zhang, S.H. Enhancing multi-scale implicit learning in image super-resolution with integrated positional encoding. arXiv 2021, arXiv:2112.05756. [Google Scholar]
  43. Wang, Y.; Li, Y.; Wang, G.; Liu, X. Multi-scale attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 17–21 June 2024; pp. 5950–5960. [Google Scholar]
  44. Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
  45. Guo, T.; Mousavi, H.S.; Vu, T.H.; Monga, V. Deep Wavelet Prediction for Image Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 22–25 July 2017; pp. 1100–1109. [Google Scholar]
  46. Li, W.; Guo, H.; Liu, X.; Liang, K.; Hu, J.; Ma, Z.; Guo, J. Efficient Face Super-Resolution via Wavelet-based Feature Enhancement Network. In Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), New York, NY, USA, 28 October–1 November 2024; pp. 4515–4523. [Google Scholar]
  47. Angarano, S.; Salvetti, F.; Martini, M.; Chiaberge, M. Generative Adversarial Super-Resolution at the edge with knowledge distillation. Eng. Appl. Artif. Intell. 2023, 123, 106407. [Google Scholar] [CrossRef]
  48. Xiao, H.; Wang, X.; Wang, J.; Cai, J.-Y.; Deng, J.-H.; Yan, J.-K.; Tang, Y.-D. Single image super-resolution with denoising diffusion GANS. Sci. Rep. 2024, 14, 4272. [Google Scholar] [CrossRef]
  49. Aloisi, L.; Sigillo, L.; Uncini, A.; Comminiello, D. A wavelet diffusion GAN for image super-resolution. arXiv 2024, arXiv:2410.17966. [Google Scholar] [CrossRef]
  50. Park, S.H.; Moon, Y.S.; Cho, N.I. Perception-Oriented Single Image Super-Resolution using Optimal Objective Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 1725–1735. [Google Scholar]
  51. Duan, M.; Qu, L.; Liu, S.; Wang, M. Local Implicit Wavelet Transformer for Arbitrary-Scale Super-Resolution. arXiv 2024, arXiv:2411.06442. [Google Scholar] [CrossRef]
  52. Shrisha, H.S.; Anupama, V. NVS-GAN: Benefit of Generative Adversarial Network on Novel View Synthesis. Int. J. Intell. Netw. 2024, 5, 184–195. [Google Scholar] [CrossRef]
  53. Huang, Y.; Shao, L.; Frangi, A.F. DOTE: Dual cOnvolutional filTer lEarning for Super-Resolution and Cross-Modality Synthesis in MRI. In Medical Image Computing and Computer Assisted Intervention—MICCAI; 2017 Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; pp. 89–98. [Google Scholar]
  54. Huang, Y.; Shao, L.; Frangi, A.F. Simultaneous Super-Resolution and Cross-Modality Synthesis of 3D Medical Images Using Weakly-Supervised Joint Convolutional Sparse Coding. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5787–5796. [Google Scholar]
  55. Iwamoto, Y.; Takeda, K.; Li, Y.; Shiino, A.; Chen, Y.-W. Unsupervised MRI Super Resolution Using Deep External Learning and Guided Residual Dense Network with Multimodal Image Priors. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 426–435. [Google Scholar] [CrossRef]
  56. Pan, B.; Du, Y.; Guo, X. Super-Resolution Reconstruction of Cell Images Based on Generative Adversarial Networks. IEEE Access 2024, 12, 72252–72263. [Google Scholar] [CrossRef]
  57. Nimitha, U.; Ameer, P.M. MRI super-resolution using similarity distance and multi-scale receptive field based feature fusion GAN and pre-trained slice interpolation network. Magn. Reson. Imaging 2024, 110, 195–209. [Google Scholar]
  58. Liu, L.; Liu, T.; Zhou, W.; Wang, Y.; Liu, M. MAPANet: A Multi-Scale Attention-Guided Progressive Aggregation Network for Multi-Contrast MRI Super-Resolution. IEEE Trans. Comput. Imaging 2024, 10, 928–940. [Google Scholar] [CrossRef]
  59. Wei, J.; Yang, G.; Wang, Z.; Liu, Y.; Liu, A.; Chen, X. Misalignment-Resistant Deep Unfolding Network for multi-modal MRI super-resolution and reconstruction. Knowl.-Based Syst. 2024, 296, 111866. [Google Scholar] [CrossRef]
  60. Ji, Z.; Zou, B.; Kui, X.; Vera, P.; Ruan, S. Deform-Mamba Network for MRI Super-Resolution. In Medical Image Computing and Computer Assisted Intervention–MICCAI; Springer Nature: Cham, Switzerland, 6–10 October 2024; pp. 242–252. [Google Scholar]
  61. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  62. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 2, pp. 2672–2680. [Google Scholar]
  63. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
  64. Johnson, J.; Alahi, A.; Li, F. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar]
  65. Karen, S.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  66. Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 22–25 July 2017; pp. 1132–1140. [Google Scholar]
  67. Mallat, S. A Wavelet Tour of Signal Processing. In The Sparse Way, 3rd ed.; Academic Press: Cambridge, MA, USA, 2008. [Google Scholar]
  68. Candès, E.J.; Donoho, D.L. New tight frames of curvelets and optimal representations of objects with piecewise C2 singularities. Commun. Pure Appl. Math. 2004, 57, 219–266. [Google Scholar] [CrossRef]
  69. Do, M.N.; Vetterli, M. The contourlet transform: An efficient directional multiresolution image representation. IEEE Trans. Image Process. 2005, 14, 2091–2106. [Google Scholar] [CrossRef] [PubMed]
  70. Da Cunha, A.L.; Zhou, J.; Do, M.N. The Nonsubsampled Contourlet Transform: Theory, Design, and Applications. IEEE Trans. Image Process. 2006, 15, 3089–3101. [Google Scholar] [CrossRef] [PubMed]
  71. Labate, D.; Lim, W.Q.; Kutyniok, G.; Weiss, G. Sparse multidimensional representation using shearlets. In Optics and Photonics; SPIE: San Diego, CA, USA, 2005; pp. 254–262. [Google Scholar]
  72. Hou, B.; Zhang, X.; Bu, X.; Feng, H. SAR Image Despeckling Based on Nonsubsampled Shearlet Transform. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 809–823. [Google Scholar] [CrossRef]
  73. Zhang, K.; Van Gool, L.; Timofte, R. Deep Unfolding Network for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 3214–3223. [Google Scholar]
  74. Horé, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. In Proceedings of the 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
  75. Zhang, J.; Long, C.; Wang, Y.; Piao, H.; Mei, H.; Yang, X.; Yin, B. A Two-Stage Attentive Network for Single Image Super-Resolution. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1020–1033. [Google Scholar] [CrossRef]
Figure 1. The architecture of our MSFF-GAN with transform prediction.
Figure 1. The architecture of our MSFF-GAN with transform prediction.
Electronics 14 03726 g001
Figure 2. The architecture of the MSID block.
Figure 2. The architecture of the MSID block.
Electronics 14 03726 g002
Figure 3. The architecture of the GMS Block.
Figure 3. The architecture of the GMS Block.
Electronics 14 03726 g003
Figure 4. Comparisons of NSST and WT on the lung image: (a) the original HR image; (b) the LF NSST coefficients image; (c) the HF coefficient images for the four directions of scale 1; (d) the HF coefficient images for the eight directions of scale 2; (e) the fusion image of HF coefficients of NSST, and (f) the fusion image of HF coefficients of discrete WT (“Haar” type).
Figure 4. Comparisons of NSST and WT on the lung image: (a) the original HR image; (b) the LF NSST coefficients image; (c) the HF coefficient images for the four directions of scale 1; (d) the HF coefficient images for the eight directions of scale 2; (e) the fusion image of HF coefficients of NSST, and (f) the fusion image of HF coefficients of discrete WT (“Haar” type).
Electronics 14 03726 g004aElectronics 14 03726 g004b
Figure 5. Comparisons of NSST and WT on the brain image: (a) the original HR image; (b) the LF NSST coefficients image; (c) the HF coefficient images for the four directions of scale 1; (d) the HF coefficient images for the eight directions of scale 2; (e) the fusion image of HF coefficients of NSST, and (f) the fusion image of HF coefficients of discrete WT (“Haar” type).
Figure 5. Comparisons of NSST and WT on the brain image: (a) the original HR image; (b) the LF NSST coefficients image; (c) the HF coefficient images for the four directions of scale 1; (d) the HF coefficient images for the eight directions of scale 2; (e) the fusion image of HF coefficients of NSST, and (f) the fusion image of HF coefficients of discrete WT (“Haar” type).
Electronics 14 03726 g005
Figure 6. Qualitative results. From left to right: LR image obtained by bicubic interpolation, SR result generated using the MAT [28] method, SR result generated using the GMS method, original HR image in this paper; (b) is an enlarged view of the red boxed area in (a).
Figure 6. Qualitative results. From left to right: LR image obtained by bicubic interpolation, SR result generated using the MAT [28] method, SR result generated using the GMS method, original HR image in this paper; (b) is an enlarged view of the red boxed area in (a).
Electronics 14 03726 g006
Figure 7. Average PSNR values (8×) for different numbers of D and s, and with increasing numbers of D and s, the SR performs better consistently.
Figure 7. Average PSNR values (8×) for different numbers of D and s, and with increasing numbers of D and s, the SR performs better consistently.
Electronics 14 03726 g007
Figure 8. Effectiveness of NSST prediction.
Figure 8. Effectiveness of NSST prediction.
Electronics 14 03726 g008
Table 1. Comparison of PSNR for different methods on the medical dataset. Bold indicates the best; underline indicates the second best.
Table 1. Comparison of PSNR for different methods on the medical dataset. Bold indicates the best; underline indicates the second best.
DatasetsScaleBicubicDWSR [45]IDN [23]MSRN [24]RCAN [21]DMSN [41]USRNET
[73]
TSAN
[75]
Diff-GAN [48]MapaNet [58]MAT [28]Ours
(with MSID)
Ours
(with GMS)
Abdomen427.90229.93430.34130.40730.67330.52330.69630.91330.92731.05131.33831.65432.186
824.69426.09526.19926.32526.90026.59126.96427.21727.43627.51427.91128.13428.625
Bone426.41428.45128.55528.63128.82228.68528.86328.99529.07329.23129.35229.87730.175
824.47825.75925.79025.95626.16526.03126.20426.29726.22026.31926.43126.96227.224
Brain426.51728.26828.45428.84129.37428.96429.44329.57029.77329.81630.02930.58230.838
822.33823.92624.18424.31724.95224.73324.95725.01624.95825.32625.63425.85126.413
Lung427.33829.19629.32329.45530.35629.95830.50630.69330.71830.93630.94231.31431.793
823.79525.02725.23125.40225.74425.68525.83125.94825.94326.16426.23226.69627.088
Table 2. Comparison of SSIM for different methods on the medical datasets. Bold indicates the best; underline indicates the second best.
Table 2. Comparison of SSIM for different methods on the medical datasets. Bold indicates the best; underline indicates the second best.
DatasetsScaleBicubicDWSR [45]IDN [23]MSRN [24]RCAN [21]DMSN [41]USRNET
[73]
TSAN
[75]
Diff-GAN [48]MapaNet [58]MAT [28]Ours
(with MSID)
Ours
(with GMS)
Abdomen40.7960.8520.8560.8570.8600.8580.8610.8630.8630.8650.8670.8700.875
80.6730.7170.7220.7280.7400.7300.7410.7430.7450.7460.7490.7510.757
Bone40.4270.6440.6490.6620.6660.6610.6690.6700.6710.6730.6740.6780.681
80.3420.3680.3720.3770.3810.3800.3820.3840.3830.3840.3850.3890.392
Brain40.8310.8650.8680.8720.8800.8750.8830.8840.8860.8860.8890.8940.896
80.7040.7550.7590.7630.7760.7640.7800.7850.7840.7880.7900.7920.798
Lung40.8950.8690.8710.8740.8810.8790.8840.8880.8890.8910.8910.8950.899
80.7390.7790.7830.7860.7950.7910.7980.8020.8020.8040.8050.8100.814
Table 3. Comparison (8×) of MOS for different methods. Bold indicates the best.
Table 3. Comparison (8×) of MOS for different methods. Bold indicates the best.
MethodsNumber of Evaluation Images (120 in Total)MOS (Mean and Standard Deviation)
1 (Poor)2 (Fair)3 (Good)4 (Very Good)
Bicubic9426001.22 ± 0.5932
DWSR [45]16425842.42 ± 0.8457
MSRN [24]9347072.63 ± 0.8514
RCAN [21]9317372.65 ± 0.8433
TSAN [75]7307583.70 ± 0.8526
MAT [28]4208882.83 ± 0.7656
Ours (with MSID)11493122.97 ± 0.7430
Ours (with GMS)14101143.07 ± 0.7146
Table 4. Average PSNR values (8×) under different numbers of MSID blocks.
Table 4. Average PSNR values (8×) under different numbers of MSID blocks.
Number of MSID Blocks24681012
Average PSNR value26.98327.13627.35127.46527.56827.583
Table 5. PSNR values for different high-frequency levels of decomposition based on our method with the MSID block (Scale: 8×, T = 10).
Table 5. PSNR values for different high-frequency levels of decomposition based on our method with the MSID block (Scale: 8×, T = 10).
High-Frequency Level of NSST DecompositionBody Part
AbdomenBoneBrainLung
(The number of decomposition directions is 2 and 4, respectively)
2 levels
27.28426.75125.41926.273
(The number of decomposition directions is 4 and 8, respectively)
3 levels
27.63426.76225.43526.296
(The number of decomposition directions are 2, 4, and 8, respectively)
4 levels
27.28826.76425.43826.297
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, H.; Wei, Q.; Sang, Y. Transform Domain Based GAN with Deep Multi-Scale Features Fusion for Medical Image Super-Resolution. Electronics 2025, 14, 3726. https://doi.org/10.3390/electronics14183726

AMA Style

Yang H, Wei Q, Sang Y. Transform Domain Based GAN with Deep Multi-Scale Features Fusion for Medical Image Super-Resolution. Electronics. 2025; 14(18):3726. https://doi.org/10.3390/electronics14183726

Chicago/Turabian Style

Yang, Huayong, Qingsong Wei, and Yu Sang. 2025. "Transform Domain Based GAN with Deep Multi-Scale Features Fusion for Medical Image Super-Resolution" Electronics 14, no. 18: 3726. https://doi.org/10.3390/electronics14183726

APA Style

Yang, H., Wei, Q., & Sang, Y. (2025). Transform Domain Based GAN with Deep Multi-Scale Features Fusion for Medical Image Super-Resolution. Electronics, 14(18), 3726. https://doi.org/10.3390/electronics14183726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop