Bidirectional T1–T2 Brain MRI Synthesis Using a Fusion U-Net Transformer for Real-World Clinical Data

Cantemir, Zeynep; Karacan, Hacer; Cindil, Emetullah; Kalafat, Burak

doi:10.3390/app16083674

Open AccessArticle

Bidirectional T1–T2 Brain MRI Synthesis Using a Fusion U-Net Transformer for Real-World Clinical Data

¹

Graduate School of Natural and Applied Sciences, Gazi University, 06500 Ankara, Turkey

²

Department of Computer Engineering, Faculty of Engineering, Gazi University, 06570 Ankara, Turkey

³

Department of Radiology, Faculty of Medicine, Gazi University, 06500 Ankara, Turkey

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3674; https://doi.org/10.3390/app16083674

Submission received: 26 February 2026 / Revised: 5 April 2026 / Accepted: 6 April 2026 / Published: 9 April 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Obtaining multiple MRI contrasts for each patient prolongs scan acquisition time, increases healthcare costs, and may not always be feasible due to patient specific constraints. Deep learning-based MRI contrast synthesis offers a potential solution, yet most existing approaches are evaluated on preprocessed public benchmarks that do not reflect real-world clinical variability. In this study, we propose a fusion U-Net transformer framework for bidirectional T1-weighted ↔ T2-weighted brain MRI synthesis trained and evaluated exclusively on retrospectively acquired clinical data. The proposed architecture integrates multiscale convolutional feature extraction with axial attention mechanisms and a transformer bottleneck for efficient global context modeling. A fusion refinement block is incorporated to mitigate skip connection artifacts. An adversarial training strategy with the least squares GAN objective and a hybrid loss combining L1 reconstruction and structural similarity (SSIM) is employed to promote both pixel-level accuracy and perceptual fidelity. The model is evaluated using SSIM and PSNR metrics alongside qualitative expert assessment conducted by two board-certified radiologists. For both synthesis directions, the framework achieves competitive quantitative performance against baseline models under the challenging conditions of clinical data. Expert evaluation confirms high anatomical fidelity and clinically acceptable image quality across both synthesis directions. These results indicate that the proposed framework represents a promising approach for multi-contrast MRI synthesis in clinically heterogeneous data environments.

Keywords:

medical image synthesis; brain MRI; MRI contrast synthesis; transformer; fusion U-Net; bidirectional image translation

1. Introduction

Magnetic resonance imaging (MRI) plays a vital role in modern clinical diagnosis by offering superior soft tissue contrast and multiparametric imaging capabilities [1]. Different MRI sequences, such as T1-weighted and T2-weighted imaging, provide complementary anatomical and pathological information essential for disease assessment and treatment planning [2]. However, the acquisition of multi-contrast MRI protocols presents several challenges. These include extended scanning times that cause patient discomfort, motion artifacts, and increased healthcare costs. The synthesis of MRI contrasts from available sequences has attracted considerable research interest as a potential solution to these challenges [3].

Cross-modality medical image synthesis aims to generate high quality target contrast images from source-modality scans [4]. With recent advances in deep learning, convolutional neural networks (CNNs) have significantly improved performance in medical image synthesis tasks. Variants of the U-Net architecture have become the de facto standard for medical image-to-image translation due to their ability to capture both local and contextual features through encoder–decoder structures with skip connections [5]. However, pure CNN approaches face inherent limitations in modeling long-range dependencies and global context, which are crucial for maintaining anatomical consistency across the entire image [6].

More recently, generative adversarial networks (GANs) [7] and transformer-based architectures [8] have demonstrated strong performance in medical image synthesis. Vision transformers have emerged as a promising alternative, leveraging self-attention mechanisms to capture global dependencies and complex spatial relationships [6,9]. While transformers excel at modeling global context, they often struggle with fine-grained local details and require substantial computational resources [10]. This has motivated the development of hybrid architectures that synergistically combine the complementary strengths of CNNs, GANs, and transformers [11,12]. Despite these advances, most existing studies are commonly evaluated on publicly available preprocessed datasets, which may not fully reflect real-world clinical variability.

In this work, we present a hybrid deep learning framework for T1 ↔ T2 brain MRI synthesis that addresses the challenges through several key contributions. First, we introduce a fusion U-Net transformer architecture that integrates multiscale convolutional features with axial attention mechanisms and bottleneck transformers, enabling the effective capture of both local structural details and global anatomical context. Second, we employ an adversarial training strategy with carefully designed loss functions that combine L1 reconstruction and SSIM to enhance perceptual quality and anatomical fidelity. Third, our study leverages a clinically acquired dataset processed directly from raw DICOM data using a dedicated preprocessing pipeline. This approach distinguishes our work from studies that rely on publicly available preprocessed datasets. By operating on raw clinical data, the proposed framework explicitly addresses real-world challenges such as acquisition variability, patient specific artifacts, and intensity inhomogeneities, which are critical for clinical translation. Experimental results demonstrate competitive performance and promising visual quality, with synthesized images achieving clinically acceptable fidelity according to both quantitative metrics and qualitative radiological evaluation by board-certified radiologists.

The remainder of this paper is organized as follows. Section 2 reviews related work on traditional methods, GAN-based approaches, and hybrid transformer architectures for MRI synthesis. Section 3 presents the proposed fusion U-Net transformer architecture. Section 4 describes the clinical dataset and preprocessing pipeline. Section 5 details the experimental setup and evaluation metrics. Section 6 reports the quantitative and qualitative results. Section 7 provides a discussion of the findings, and Section 8 concludes this paper and outlines directions for future work.

2. Related Work

2.1. Traditional Methods

Early approaches to cross-modality medical image synthesis can be broadly categorized into registration-based and intensity transformation-based methods. Registration-based methods typically rely on atlas frameworks, where images with known modality pairs are nonlinearly registered to a target subject and the missing contrast is synthesized through label or intensity fusion across the aligned atlases. However, these approaches are highly sensitive to registration errors and have limited ability to capture anatomical variations [13,14]. In contrast, intensity transformation-based techniques learn a mapping between source and target contrasts from local neighborhoods to model the relationship between paired multi-contrast training data [15]. While these methods reduce dependence on registration, they may lose global context and produce artifacts [3].

2.2. Deep Learning-Based MRI Synthesis

With recent advances in deep learning, convolutional neural networks (CNNs) have shown superior performance compared with traditional methods in MRI image synthesis. The U-Net architecture, originally proposed for biomedical image segmentation, has been widely adapted for synthesis tasks. U-Net’s effective encoder–decoder structure with skip connections preserves spatial information [5]. Supervised U-Net models trained on paired multi-contrast datasets have demonstrated accurate synthesis of missing MRI sequences and outperform traditional methods [16,17]. However, convolutional architectures rely on hierarchical local feature extraction. As a result, they may struggle to capture long-range dependencies and global anatomical contexts, which are crucial for preserving structural consistency across the entire image [3].

2.3. Generative Adversarial Networks for MRI Synthesis

Generative adversarial networks (GANs) have further improved medical image synthesis by introducing adversarial training, which encourages perceptually realistic outputs through competition between generator and discriminator networks [7]. The pix2pix framework established a foundation for paired image-to-image translation tasks by combining conditional GANs with L1 reconstruction loss [18]. Building upon pix2pix, several studies have successfully applied conditional GANs to MRI contrast synthesis tasks. Nie et al. introduced context aware GANs that incorporate multiscale anatomical information and spatial context into the synthesis process. Their work demonstrated improved structural consistency in brain MRI synthesis compared with conventional CNN-based methods [19]. Extending this work, Dar et al. proposed a refined context-aware GAN framework for multi-contrast MRI synthesis, incorporating perceptual loss functions to further improve anatomical detail preservation [20]. Sharma et al. introduced a MM-GAN framework capable of synthesizing missing MRI sequences from any combination of available contrasts [21].

Bidirectional T1 ↔ T2 synthesis has also been specifically addressed by several dedicated methods. Kawahara and Nagata explored T1 ↔ T2 synthesis using conditional GAN frameworks such as pix2pix style architecture, investigating optimal input preprocessing strategies, including image resolution and grayscale conversion methods [22]. The work highlighted the importance of preprocessing considerations that are often overlooked in medical image synthesis pipelines. Xu et al. proposed Bi-MGAN, a multi-generative multi-adversarial framework that simultaneously learns two nonlinear mappings between T1 ↔ T2 MRI in both paired and unpaired fashion, ensuring pathological invariance through auxiliary label information [23]. However, this approach requires two generators and two discriminators for each synthesis direction, limiting scalability. More recently, Lei et al. proposed MD-GAN, a modal disentangled GAN that employs a single generative adversarial system to achieve bidirectional T1 ↔ T2 synthesis by learning disentangled morphological and modal feature representations in a shared latent space [24]. However, MD-GAN requires all training samples to be processed simultaneously, limiting its applicability to streaming or incrementally acquired clinical data. Moreover, the majority of these methods are evaluated on preprocessed public datasets, leaving their robustness under realistic clinical acquisition variability largely unaddressed.

Alongside paired synthesis, Zhu et al. introduced CycleGAN, an unpaired image-to-image translation framework based on cycle consistency constraints that enables cross-modal synthesis without requiring aligned image pairs [25]. While CycleGAN and its medical imaging adaptations have demonstrated utility in settings where paired data are scarce, anatomical fidelity may be compromised due to the absence of direct pixel-level supervision [3,26].

Despite these advances, GAN training instability remains a significant concern due to sensitivity to hyperparameter selection and network architecture choices [25]. Adversarial training dynamics can also lead to hallucinated features, particularly in regions with complex anatomy or pathology [3]. To address training stability, least squares GANs (LSGANs) have been proposed by replacing cross-entropy objective with least squares loss, leading to more stable optimization and reduced vanishing gradient problems [27]. Skandarani et al. showed that GAN performance varies substantially across architectures, and that simpler models often fail to adequately capture the diversity and structural richness of medical imaging datasets [28]. Nevertheless, pure CNN-based generators remain limited in capturing long-range spatial dependencies, which is particularly critical for maintaining global anatomical consistency in medical imaging [3].

2.4. Vision Transformers and Hybrid Architectures for Medical Image Synthesis

Vision transformers (ViTs) have achieved competitive or superior performance compared with convolutional neural networks (CNNs) on image classification tasks [6]. Unlike convolutional operations that process local neighborhoods, transformers leverage self-attention mechanisms to model relationships between all spatial positions simultaneously. As a result, long-range dependencies and global contextual information can be effectively captured [8]. This fundamental difference has attracted significant interest in medical imaging tasks, and several transformer-based architectures have been proposed. The Swin transformer introduced hierarchical feature representations through shifted window-based self-attention, achieving improved computational efficiency while maintaining the ability to model multiscale features [11]. Cao et al. adapted this architecture for medical image segmentation in Swin-UNet, demonstrating that pure transformer architecture without any convolutions could achieve competitive segmentation performance on multiple medical imaging benchmarks [12]. However, these pure transformer approaches often require substantial computational resources and large-scale training datasets. This limits their applicability in clinical settings where data availability is constrained and inference efficiency is critical.

For medical image synthesis specifically, hybrid CNN transformer architectures have emerged as a promising approach to combine complementary strengths of both paradigms. Chen et al. proposed TransUNet, which combines a transformer encoder with a CNN decoder through skip connections, demonstrating that hybrid architectures could effectively process medical images despite challenges related to data efficiency and computational complexity [29]. Dalmaz et al. proposed ResViT (residual vision transformer), a novel generative adversarial framework for multimodal medical image synthesis that synergistically integrates convolutional and transformer modules [30]. ResViT demonstrated strong performance in synthesizing missing sequences in multi-contrast MRI and MRI to CT translation and outperforming both pure CNN-based GANs and standalone transformer architectures.

Despite notable progress, several challenges remain in applying hybrid architectures to MRI synthesis. First, effective fusion of multiscale features from both CNN and transformer branches requires careful architectural design to preserve complementary spatial and semantic features. Second, computational efficiency must be maintained for practical clinical deployment. Third, transformers typically require larger training datasets than CNNs due to their reduced inductive bias, which poses a significant limitation in medical imaging where annotated data are often scarce. Moreover, most existing studies rely on curated public datasets with controlled imaging conditions. This leaves robust bidirectional MRI synthesis on raw clinical data, with inherent intensity variability, acquisition artifacts, and inter subject heterogeneity, relatively underexplored.

3. Materials and Methods

3.1. Problem Definition and Overview

This study addresses the bidirectional T1-weighted ↔ T2-weighted MRI synthesis problem using clinically acquired MRI data. Given an input MRI slice x ∈ R^(1×H×W) from one contrast domain (T1 or T2), the objective is to learn a mapping function

G: x → ŷ,

(1)

where ŷ approximates the corresponding target slice y ∈ R^(1×H×W) in the opposite contrast domain (T2 or T1). The proposed framework enables flexible synthesis in both T1 → T2 and T2 → T1 directions within a unified model.

Unlike prior work relying on preprocessed public datasets, our approach operates on raw clinical MRI data, which introduces critical challenges absent in benchmark studies, including intensity heterogeneity, imperfect slice correspondence, artifact diversity, and bit depth inconsistencies. These factors significantly complicate reliable contrast translation and limit generalizability. To address these clinical deployment challenges, we propose a fusion U-Net transformer framework that integrates convolutional inductive biases with global contextual modeling and adversarial learning. The proposed design incorporates the following key elements:

Convolutional inductive biases for local texture modeling.
Axial attention for efficient long-range dependency capture.
Transformer-based global context reasoning at the bottleneck.
Adversarial training with LSGAN for perceptual realism.
Feature fusion refinement to mitigate skip connection artifacts.

The complete framework is illustrated in Figure 1.

The proposed architectural components were designed in direct response to the specific challenges introduced by raw clinical MRI data. Intensity heterogeneity arising from multi-scanner acquisition and the absence of protocol standardization motivated the adoption of group normalization over batch normalization, as group-wise statistics are more robust to batch-level intensity variation. Residual slice-level misalignment inherent in retrospective paired acquisitions motivated the incorporation of the fusion refinement block. This component is designed to mitigate artifacts introduced by skip connection fusion under imperfect spatial correspondence. The axial attention mechanism and transformer bottleneck address the global anatomical consistency requirements that are particularly critical when local intensity references are unreliable due to acquisition variability. Together, these design choices reflect a deliberate alignment between architectural decisions and the practical constraints of clinical data deployment.

3.2. Generator Architecture: Fusion U-Net Transformer

The generator G combines U-Net’s proven efficacy in medical image segmentation with modern attention mechanisms for enhanced global context modeling.

3.2.1. Encoder Path with Hierarchical Feature Extraction

The encoder follows a hierarchical U-Net design composed of convolutional blocks and downsampling operations [5]. The standard U-Net design is modified by replacing batch normalization with group normalization [31], improving under small batch sizes commonly encountered in medical imaging. Formally, the hierarchical feature extraction process is defined as:

f_i = ConvBlock (Pool (f_i−1)), i ∈ {1, 2, 3},

(2)

where f_i denotes the feature map at the i-th encoder level and Pool (·) represents a spatial downsampling operation. Each convolutional block consists of two consecutive 3 × 3 convolutional layers, each followed by group normalization (GN) and SiLU activation function [32]. The convolutional block is defined as:

ConvBlock (x) = SiLU (GN (Conv_3×3 (SiLU (GN (Conv_3×3 (x)))))).

(3)

Feature dimensions progress as: 64, 128, 256, 512 with spatial resolutions [H, H/2, H/4, H/8], respectively, where H = 256.

We employ group normalization (GN) [31] to avoid dependence on batch statistics, which is critical for small medical imaging batches. We use 8 groups per layer to balance normalization effectiveness and computational efficiency. For activation functions, we adopt Swish/SiLU [32], which has demonstrated superior gradient flow compared with ReLU in deep medical imaging networks, particularly for complex synthesis tasks.

3.2.2. Axial Attention

At the H/4 resolution level (64 × 64), we incorporate an axial attention block [33] to efficiently capture long-range spatial dependencies. The axial attention mechanism factorizes two-dimensional self-attention into sequential row- and column-wise operations. Formally, given a feature map f, axial attention is defined as:

AxialAttn (f) = ColAttn (RowAttn (f)),

(4)

where row-wise attention operates on reshaped feature sequences of size (B⋯H, W, C), while column-wise attention is applied to sequences of size (B⋯W, H, C). This factorized attention reduces complexity from O((HW)²) to O(HW(H + W)), enabling global context modeling without prohibitive memory costs.

Axial attention is strategically incorporated at the H/4 resolution level for two primary considerations. From a computational perspective, full self-attention at higher spatial resolutions is prohibitively expensive for medical imaging tasks. From a semantic perspective, feature maps at 64 × 64 resolution encode mid-level anatomical structures, such as ventricles, cortical folding patterns, and white/gray matter boundaries. At this scale, modeling global spatial relationships is essential for preserving structural consistency.

3.2.3. Transformer Bottleneck

At the bottleneck (H/8 resolution, 32 × 32), the encoder feature map is reshaped into a sequence of spatial tokens and processed by a transformer encoder [8]. Let

f_{3} \in R^{C \times 32 \times 32}

denote the bottleneck feature map. Specifically, it is flattened into 1024 tokens, each with C channels, enabling global context modeling across the entire spatial domain.

z = Reshape (f₃) ∈ R^N×C, N = 32 × 32 = 1024,

(5)

z′ = TransformerEncoder(z),

(6)

f′₃ = Reshape (z′) ∈ R^C×32×32.

(7)

The transformer encoder comprises two layers with 8-head/multi-head self-attention, position-wise feed-forward networks with an expansion factor of four, and pre-layer normalization [34] for improved training stability. This design enables direct modeling of brain anatomical context through self-attention mechanisms, capturing long-range correlations that exceed the limitations of convolutional operations alone. The hybrid design combines CNN’s efficient local texture modeling in the encoder with transformer-based global reasoning at the bottleneck, following recent successful applications in medical vision transformers [29].

The transformer bottleneck configuration was intentionally kept lightweight (two encoder layers with eight attention heads), to balance global context modeling capacity and computational efficiency. Given the limited size and heterogeneity of the clinical MRI dataset used in this study, deeper transformer stacks were not adopted to reduce overfitting risk and maintain stable training behavior. This design choice is consistent with recent hybrid CNN transformer architectures that employ transformer modules primarily for global context aggregation rather than full-resolution feature modeling [29].

3.2.4. Decoder Path with Fusion Refinement Mechanism

The decoder mirrors the encoder structure using bilinear upsampling followed by skip connections to progressively recover spatial resolution. To address feature mismatch between encoder and decoder representations, a known issue in vanilla U-Net architectures [5], we introduce a fusion refinement block. This block performs three sequential operations: (i) channel compression of concatenated skip and decoder features, (ii) feature recalibration via a squeeze and excitation gating mechanism [35], and (iii) residual refinement using 3 × 3 convolutions with skip connections. Let e_i and d_i denote the encoder and decoder feature maps at scale

i

, respectively. The fusion refinement operation is defined as:

f_i = Concat (e_i, d_i),

(8)

{\tilde{f}}_{i} = SE ({Conv}_{1 \times 1} (f_{i})),

(9)

{d^{'}}_{i} = {\tilde{f}}_{i} + {Conv}_{3 \times 3} ({\tilde{f}}_{i}),

(10)

where Conv_1×1 performs channel compression, SE (⋅) denotes squeeze- and excitation-based channel-wise recalibration [35], and the residual refinement with 3 × 3 convolutions enhances local consistency. The refined feature d′_i is then propagated to the next decoder stage.

This design improves cross-scale information flow and mitigates artifacts commonly introduced by naive skip connection fusion, particularly when operating on raw clinical MRI data.

3.2.5. Output Layer

The final prediction is generated through a 1 × 1 convolution followed by sigmoid activation, producing normalized T1- or T2-weighted intensity values in the [0, 1] range.

3.3. Discriminator Architecture: Conditional PatchGAN

To encourage realistic local texture synthesis, we employ a conditional PatchGAN discriminator [18]:

D (x,y) ∈ R^H′×W′,

(11)

where the discriminator receives concatenated input target pairs [x, y] and outputs a patch-wise realism map rather than a single scalar. This allows the discriminator to focus on high-frequency details and local anatomical consistency.

The discriminator architecture, as summarized in Table 1, consists of a sequence of convolutional layers with progressively increasing channel depth, instance normalization, and LeakyReLU activations, except for the first layer where normalization is omitted.

We employ instance normalization [36] rather than batch normalization for improved performance in image-to-image translation tasks. Normalization is omitted in the first layer to preserve input intensity distributions. LeakyReLU activation is used to allow gradient flow for negative activations. The final output is a 30 × 30 patch-wise prediction map, where each element corresponds to a receptive field of approximately 70 × 70 pixels in the input image. This configuration balances local detail discrimination with computational efficiency and aligns with the PatchGAN principle introduced in pix2pix [18]. Owing to its emphasis on local realism rather than global image classification, this discriminator design is well-suited for MRI synthesis tasks, where preserving fine-grained anatomical texture is critical.

3.4. Loss Functions and Training Strategies

The choice of loss functions plays a critical role in determining synthesis quality in medical imaging synthesis tasks. Pixel-wise losses such as L1 and L2, while ensuring basic structural fidelity, often produce over-smoothed results lacking high-frequency details [37]. Perceptual losses computed on deep feature representations can better capture semantic content but may introduce artifacts [38]. In contrast, SSIM has been widely adopted in medical imaging due to its alignment with the human perception of image quality and structural preservation [39]. Adversarial training further encourages the synthesis of realistic local details by introducing a discriminator network that distinguishes between real and synthesized images [7]. Recent works have demonstrated that combining multiple complementary losses yields successful results [21].

3.4.1. Adversarial Learning with LSGAN Objective

We adopt the least squares GAN (LSGAN) [27] over the standard GAN to mitigate gradient saturation and improve training stability. The discriminator loss is defined as:

L_{D} = \frac{1}{2} E_{x, y} [(D (x, y) - 1)^{2}] + \frac{1}{2} E_{x, ŷ} [D (x, ŷ)^{2}]

(12)

while the generator adversarial loss is given by:

L_{adv} = \frac{1}{2} E_{x, ŷ} [(D (x, ŷ) - 1)^{2}] .

(13)

The least squares objective alleviates vanishing gradients and empirically demonstrates superior training stability in medical imaging applications compared with the binary cross-entropy GAN objective. The adversarial term is introduced after a warm-up period, allowing the generator to first learn a stable reconstruction prior.

3.4.2. L1 Loss for Pixel-Level Accuracy

We employ L1 loss rather than L2 (mean squared error) for two principal reasons rooted in medical imaging characteristics. First, L1 loss exhibits superior robustness to intensity outliers, which frequently occur in clinical MRI due to motion artifacts, susceptibility distortions, and scanner-specific noise patterns [37]. Second, L1 minimization naturally encourages sharper edge preservation compared with L2, which tends to produce overly smooth reconstructions. Formally, the L1 reconstruction loss is defined as:

L_L1 = E_(x,y) [‖ŷ − y‖₁],

(14)

where y denotes the ground truth target image and ŷ represents the synthesized output generated by the model.

3.4.3. SSIM Loss for Structural Similarity

Following Wang et al. [39], we compute SSIM using an 11 × 11 Gaussian window with σ = 1.5:

SSIM (y, ŷ) = ((2μ_yμ_ŷ + C₁)(2σ_yŷ + C₂))/((μ_y² + μ_ŷ² + C₁)(σ_y² + σ_ŷ² + C₂),

(15)

where μy and μŷ denote local mean intensities, σy² and σŷ² represent local variances, σyŷ is the local covariance, and C1 = (0.01)² and C2 = (0.03)² are stability constants to avoid division by zero.

The incorporation of SSIM loss is particularly important for MRI synthesis tasks, as it captures perceptual structural similarity by jointly considering luminance, contrast, and structural correlations within local neighborhoods [39]. These properties align closely with human visual perception and radiological interpretation. Furthermore, clinical diagnosis fundamentally relies on structural anatomical patterns and tissue contrast relationships rather than exact pixel-level intensity reproduction, making SSIM a more clinically relevant optimization target than pure L1 or L2 losses.

3.4.4. Hybrid Reconstruction Loss Function

To ensure both voxel wise accuracy and perceptual structural fidelity, we employ a hybrid reconstruction loss as:

L_recon = L_L1 + λ_SSIM (1 − SSIM),

(16)

where λ_SSIM controls the contribution of structural similarity.

The final generator loss is defined as:

L_G = L_recon + λ_adv.

(17)

This formulation balances pixel-level accuracy with perceptual structural consistency and adversarial realism.

4. Clinical Dataset and Preprocessing

4.1. Data Acquisition and Dataset Characteristics

The imaging data utilized in this study were retrospectively obtained from the clinical MRI archive of the Department of Radiology, Faculty of Medicine at our institution. Prior to data acquisition, ethical approval was obtained from the institutional ethics committee, and all procedures were conducted in accordance with institutional guidelines. All personally identifiable information was removed in accordance with the ethical approval. Due to the retrospective design and fully anonymized nature of the data, demographic and clinical attributes (e.g., age, sex, diagnosis) were not retained or used in this study.

The dataset consists of anonymized clinical brain MRI examinations of 100 subjects. Subjects were retrospectively selected from existing clinical archives by board-certified radiologists. For each subject, T1-weighted and T2-weighted sequences were acquired in close temporal proximity, yielding a total of 2147 axial slice pairs. MRI data were acquired using three Siemens scanners (Aera, MAGNETOM Lumina, and Verio) at 1.5 T and 3.0 T field strengths. T1-weighted sequences were obtained with TR ranging from 220 to 447 ms and TE ranging from 2.46 to 11.00 ms. T2-weighted sequences were obtained with TR ranging from 3100 to 9000 ms and TE ranging from 90 to 113 ms. Slice thickness ranged from 4.5 to 5.5 mm for both sequences. In-plane resolution ranged from 0.43 to 0.78 mm/pixel for T1 and 0.38 to 0.75 mm/pixel for T2. The heterogeneity in acquisition parameters across scanners and field strengths reflects the variability inherent in real-world clinical practice.

A distinctive characteristic of this dataset is that it is derived from raw clinical acquisitions rather than public benchmarks, which constitutes an important source of originality for the proposed study. Unlike commonly used intensity standardized and spatially optimized public datasets, our dataset preserves the inherent variability of real-world clinical imaging environments. While this variability substantially increases synthesis difficulty, it significantly enhances the clinical relevance and translational value of the proposed approach.

Subjects were identified among patients presenting to the emergency department with headache complaints and no radiologically significant pathological findings. Case selection was performed by board-certified radiologists to ensure the absence of structural abnormalities that could confound the synthesis task.

4.2. Preprocessing Pipeline

All imaging data were initially stored in a native DICOM format. Each imaging DICOM series was converted to an NIfTI format using the SimpleITK library. After format conversion, volumes were rigidly registered to reference space using SimpleITK’s registration framework to establish voxel-wise correspondence between modalities [40]. This registration step ensures that corresponding anatomical structures across modalities are spatially aligned prior to slice extraction and learning. Following registration, 2D axial slices were extracted from co-registered volumes. Slice pairing was performed at the patient level using center-aligned overlap pairing within anatomical bands to ensure that each T1 slice corresponded anatomically to its T2 counterpart. Border slices were excluded to avoid partial volume artifacts and noninformative samples. To enable batch-based training and architectural consistency, all slices were transformed into a fixed spatial resolution of 256 × 256 pixels. All slices were normalized independently using percentile-based intensity normalization for MRI data, mapping voxel intensities to the [0, 1] range.

This carefully designed preprocessing strategy enables robust training on real-world clinical data. The final dataset exhibits substantial heterogeneity representative of clinical practice, including scanner variability, acquisition-related artifacts observed during quality control, and incidental findings such as white matter hyperintensities and chronic infarcts.

5. Experiments

5.1. Experimental Setup

The proposed model was implemented in PyTorch 2.5.1 and trained on a single NVIDIA GeForce RTX 4080 GPU. Training and inference were performed on 256 × 256 axial slices, consistent with the preprocessing pipeline described in Section 3. The dataset was split into training, validation, and test sets using a patient-level separation strategy to prevent information leakage across splits. In total, 70% of subjects were assigned to the training set, while the remaining subjects were divided between validation (15%) and test (15%) sets randomly. This strategy ensured that slices from the same subject did not appear in different subsets, providing a realistic assessment of the model’s generalization performance on unseen subjects. Mini-batch training was employed with a batch size of 8, selected to balance computational efficiency and memory constraints. Both the generator and discriminator networks were initialized using default PyTorch weight initialization schemes.

To assess the statistical significance of performance differences, we conducted two-sided Wilcoxon signed-rank tests between the proposed method and each baseline model. Tests were performed on the test set, with each slice treated as an independent observation. A significance threshold of p < 0.05 was adopted.

5.2. Training Details

Two separate Adam optimizers were employed for the generator and discriminator, each with independently tuned learning rates. The generator employed a conservative learning rate of 5 × 10⁻⁵ to prevent mode collapse, while the discriminator used 1 × 10⁻⁴ (2× ratio) to ensure sufficient adversarial feedback during training. Momentum parameters were set to β₁ = 0.5 (reduced from the typical 0.9 to dampen oscillations in adversarial dynamics) and β₂ = 0.999 for second moment estimation. To stabilize optimization, gradient clipping was applied to both networks, preventing gradient explosion and improving convergence behavior.

Training was performed for 30 epochs, with the generator and discriminator updated alternately at each iteration. The proposed framework was trained in an adversarial manner using the least squares GAN (LSGAN) objective. The hybrid generator loss combined L1, SSIM structural similarity and LSGAN adversarial objective, with λ_SSIM = 0.1 and λ_adv = 0.01, the latter introduced after the first epoch to prevent early generator collapse.

5.3. Evaluation Metrics

Model performance was assessed using both quantitative image similarity metrics and qualitative visual inspection. Quantitative evaluation focused on the structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR), which are widely used metrics in medical image synthesis tasks. PSNR quantifies overall reconstruction accuracy as

PSNR = 10log₁₀ (MAX²/MSE),

(18)

where MAX = 1.0 for normalized images. While PSNR emphasizes pixel-wise fidelity, it may not strongly correlate with perceptual quality in medical imaging contexts. SSIM measures perceptual structural similarity using an 11 × 11 Gaussian window (σ = 1.5) as defined in Section 3.4.3.

In addition to quantitative metrics, qualitative visual assessment was conducted by board-certified radiologists to evaluate anatomical fidelity and clinical plausibility of the synthesized images.

5.4. Model Complexity Analysis

The proposed fusion U-Net transformer contains 16.21 M trainable parameters, with a computational cost of 53.07 GMACs. At inference time, the model processes an MRI slice in 9.68 ms on a single NVIDIA RTX 4080 GPU. Together, these figures suggest a computationally efficient model suitable for clinical deployment scenarios. For comparison, baseline model parameter counts are: U-Net (31.04 M), Pix2pix (54.41 M), CycleGAN (11.37 M), and ResViT (44.41 M). The relatively compact parameter count is achieved through architectural design, where axial attention and the transformer bottleneck operate on reduced spatial representations rather than full-resolution feature maps. This strategy constrains computational overhead while preserving global anatomical context modeling capacity.

6. Results

6.1. Quantitative Results

The quantitative performance of the proposed bidirectional MRI synthesis framework and baseline models is summarized in Table 2. All models were trained and evaluated on the same dataset and patient-level splits to ensure a fair comparison. Results are reported as mean ± standard deviation across independent runs using SSIM and PSNR metrics. For the T1 → T2 synthesis direction, the proposed method achieved an SSIM of 0.787 ± 0.015 and a PSNR of 21.06 ± 0.13 dB. In the reverse T2 → T1 direction, higher quantitative performance was observed, with an SSIM of 0.830 ± 0.005 and a PSNR of 24.03 ± 0.072 dB. The relatively stronger performance in the T2 → T1 direction may reflect inherent differences in contrast characteristics and signal distribution between T1- and T2-weighted imaging. The proposed method consistently outperformed pix2pix, CycleGAN, ResViT, and Simple U-Net across both synthesis directions and both metrics. Statistical significance of all pairwise comparisons was confirmed via two-sided Wilcoxon signed-rank tests (p < 0.001 across all baseline comparisons).

To further investigate the contribution of individual architectural components and loss functions, an ablation study was conducted, and the results are shown in Table 3.

Excluding the SSIM loss leads to a moderate decrease in SSIM across both synthesis directions, while PSNR exhibits only minor variation. This suggests that the SSIM loss contributes primarily to structural similarity optimization rather than pixel-level reconstruction accuracy. Removing the adversarial loss also results in a reduction in SSIM in both directions, indicating that GAN supervision contributes to improved structural consistency beyond pure reconstruction objectives. The effect is particularly noticeable in the T1 → T2 direction, suggesting that adversarial learning plays an important role in modeling complex contrast transformations. In addition to improving quantitative similarity, the adversarial term promotes sharper local texture representation and enhanced perceptual realism.

When axial attention is removed, the overall quantitative metrics remain largely comparable in both synthesis directions, with only minor performance fluctuations. This behavior is consistent with its intended role in the framework: axial attention provides an efficient mechanism for incorporating global anatomical context while avoiding the quadratic computational complexity of full self-attention. These results indicate that axial attention helps preserve quantitative performance while reducing computational complexity.

Removing the transformer bottleneck leads to a consistent performance drop in both directions, indicating that global context modeling at the bottleneck contributes meaningfully to overall synthesis performance. This degradation further suggests that the transformer bottleneck plays a critical role in modeling long-range spatial dependencies that cannot be effectively captured through local convolutional operations alone.

Finally, excluding the fusion refinement block results in a consistent reduction in PSNR for both synthesis directions, accompanied by a modest decrease in SSIM. This degradation suggests that the refinement stage plays a critical role in enhancing local texture fidelity and stabilizing intensity reconstruction. The refinement mechanism helps mitigate artifacts propagated through skip connections and improves overall signal consistency.

Overall, the ablation results demonstrate that the combined integration of architectural components leads to more balanced and robust performance across structural similarity and signal fidelity metrics.

6.2. Qualitative Results

Representative qualitative results for both synthesis directions (T1 → T2 and T2 → T1) are presented in Figure 2 and Figure 3. In each figure, multiple axial slices from different subjects are shown to illustrate the visual performance of the proposed model across varying anatomical levels.

Visual inspection indicates that the proposed framework successfully reconstructs the characteristic contrast properties of the target modality while maintaining global anatomical integrity. Major structural elements are consistently preserved across subjects and slice levels throughout both synthesis directions. The synthesized images demonstrate strong structural correspondence with the ground truth.

The absolute difference maps provide a detailed visualization of residual discrepancies between synthesized and ground truth images. Differences are primarily localized to high-frequency edge regions and subtle intensity transitions, while overall anatomical consistency is preserved. This pattern suggests that the model effectively captures global anatomical structure while minor deviations occur in fine-scale intensity variations.

Overall, the qualitative analysis indicates that the proposed framework maintains global anatomical integrity while preserving fine-grained structural details across both synthesis directions. These findings are further examined through expert radiological evaluation in the following section.

6.3. Expert Radiological Evaluation

To assess clinical relevance beyond quantitative similarity metrics, a double-blinded expert evaluation was conducted by two board-certified radiologists with 12 and 4 years of clinical experience in radiology, respectively. The radiologists independently reviewed synthesized T1- and T2-weighted MRI images, and the evaluation focused on anatomical consistency, signal intensity characteristics, presence of artifacts, and overall diagnostic image quality.

Across both synthesis directions (T1 → T2 and T2 → T1), the radiologists reported that the synthesized images demonstrated high anatomical fidelity and preserved major structural landmarks, including cortical boundaries, ventricular morphology, deep gray matter structures, cerebellar architecture, and brainstem anatomy. In the T1 → T2 direction, hyperintense cerebrospinal fluid (CSF) regions and parenchymal signal intensity patterns were reproduced consistently, while in the T2 → T1 direction, appropriate white/gray matter differentiation and physiologically reversed fluid signal intensity was restored. The experts noted that signal transitions appeared coherent and clinically plausible.

Importantly, the retrospective clinical nature of the dataset introduced inherent slice-level misalignment in certain subjects. Due to acquisition time differences between sequences and minor patient motion during prolonged MRI scans, exact voxel-wise correspondence between T1 and T2 slices was not always achievable, even after registration. In several cases, the radiologists observed that the synthesized output appeared anatomically more consistent with the expected slice position than the nominal ground truth reference. In other words, when mild inter-slice misalignment was present in the paired data, the model generated anatomically coherent structures that aligned with the underlying anatomy rather than replicating ground truth positional inconsistencies.

The perception that T2 images offer greater visual detail and clarity compared with T1 images is primarily attributable to a higher contrast-to-noise ratio (CNR) and a broader dynamic range of inter-tissue signal differences rather than superior spatial resolution. Consequently, in cross-modality image synthesis, translating from T2 to T1 yields superior detailed results compared with the reverse process (T1 to T2), even when both source images are of optimal diagnostic quality.

The radiologists further emphasized that minor residual differences were primarily localized to high-frequency cortical edges and subtle intensity transitions, which are expected due to the differing MRI physics required to obtain T1 and T2 images. They also noted that signal variability between synthesis directions may partly reflect fundamental MR physics differences and the heterogeneity of raw clinical acquisition parameters.

Overall, both experts concluded that the synthesized images were structurally consistent, diagnostically interpretable, and of clinically acceptable quality. The reconstructions demonstrated minimal perceptible artifacts and preserved essential anatomical detail necessary for routine radiological assessment. These findings support the translational potential of the proposed framework, demonstrating its efficacy in bidirectional synthesis for the generation of various MRI sequences within realistic clinical environments.

7. Discussion

This study presents a bidirectional MRI synthesis framework evaluated on retrospectively acquired raw clinical data, addressing both T1 → T2 and T2 → T1 translation tasks under realistic acquisition conditions. Unlike many prior studies that rely on pre-aligned public datasets with standardized imaging protocols, the proposed framework was assessed on heterogeneous clinical scans, introducing challenges related to slice-level misalignment, intensity variability, and acquisition-dependent contrast differences.

A substantial body of the literature has investigated MRI to MRI synthesis using deep learning, often reporting strong quantitative performance on curated public benchmark datasets. Conditional GAN-based frameworks, such as those proposed by Dar et al. [20] and Sharma et al. [21], demonstrated high SSIM and PSNR values for multi-contrast and missing sequence synthesis tasks under controlled acquisition and preprocessing settings. Similarly, Kawahara and Nagata [22] specifically addressed bidirectional T1 ↔ T2 synthesis using a pix2pix-style GAN, reporting competitive quantitative results while emphasizing the strong dependence of synthesis performance on input preprocessing strategies, resolution, and intensity scaling. More recent hybrid architectures incorporating transformer components have further advanced MRI synthesis performance. Transformer-enhanced models, including ResViT [30], have demonstrated improved global context modeling and high quantitative accuracy on public datasets through the integration of residual vision transformers with convolutional backbones.

However, these approaches are primarily evaluated on carefully curated datasets that undergo extensive preprocessing, including rigid or deformable registration, skull stripping, intensity normalization, and slice selection. In contrast, the proposed framework is evaluated on retrospectively acquired raw clinical MRI data, where acquisition timing differences, protocol heterogeneity, and patient motion introduce unavoidable slice-level misalignment and intensity variability. Under these realistic conditions, voxel-wise similarity metrics such as SSIM and PSNR become highly sensitive to minor spatial inconsistencies, making direct numerical comparisons with results reported on curated benchmark datasets inherently limited.

Within this clinical setting, the proposed framework achieved an SSIM of 0.787 and a PSNR of 21.06 dB for T1 → T2 synthesis, and an SSIM of 0.830 and a PSNR of 24.03 dB for T2 → T1 synthesis. The framework further demonstrates stable quantitative performance across synthesis directions, strong qualitative anatomical fidelity, and favorable expert radiological assessment. To further contextualize these results, the proposed framework was evaluated against representative baseline architectures, including pix2pix, CycleGAN, ResViT, and a standard U-Net, under identical training and evaluation conditions on the same heterogeneous clinical dataset. The proposed method consistently and statistically significantly (p < 0.001) outperformed all baseline models in both synthesis directions, indicating that the observed metric values primarily reflect dataset difficulty rather than architectural limitations alone.

Notably, the simple U-Net yields the lowest SSIM values, underscoring the importance of adversarial and perceptual supervision in synthesis tasks. CycleGAN, despite enabling unpaired synthesis, consistently underperforms paired methods, reflecting the known limitation of cycle-consistency constraints in preserving anatomical fidelity without direct pixel-level supervision. Pix2pix, as a paired conditional GAN framework, achieves intermediate performance relative to the proposed method, demonstrating that adversarial supervision improves synthesis quality over pure reconstruction-based training. ResViT, as a modern hybrid transformer-based architecture, achieves competitive results but is still outperformed by the proposed method. Together, these results indicate that the proposed framework maintains robust synthesis performance even under acquisition variability that typically limits the effectiveness of benchmark-optimized architectures. Compared with baseline methods, the proposed fusion U-Net transformer achieves more consistent structural preservation and contrast representation through the integration of convolutional inductive biases, transformer-based global context modeling, and refinement mechanisms.

An important observation emerging from the expert radiological evaluation concerns the impact of slice-level misalignment in the ground truth data on quantitative similarity metrics. Due to the retrospective clinical acquisition protocol, exact voxel-wise correspondence between paired T1 and T2 slices could not always be guaranteed, even after registration. In several cases, the radiologists reported that the synthesized images appeared anatomically more consistent with the expected slice position than the nominal ground truth reference. This finding suggests that the proposed model does not simply replicate voxel-level intensity patterns from imperfectly aligned targets but instead learns anatomically plausible representations conditioned on the input image. As a result, minor deviations between synthesized and ground truth images, particularly in cases of inter-slice mismatch, may lead to lower SSIM and PSNR values despite visually and clinically coherent reconstructions.

These observations highlight an important limitation of voxel-wise quantitative metrics when applied to raw clinical datasets, where ground truth images may themselves contain positional inconsistencies. In such scenarios, SSIM and PSNR can penalize anatomically correct predictions that diverge from misaligned references. The expert feedback therefore indicates that the reported quantitative results likely represent a conservative estimate of the true synthesis quality, reinforcing the importance of complementary qualitative and clinical evaluation.

Furthermore, the quantitative results demonstrate consistent and stable performance across both synthesis directions, with higher SSIM and PSNR values observed for the T2 → T1 task compared with T1 → T2. This asymmetry aligns with known differences in MRI physics required to obtain T1 and T2 images. T2-weighted images typically exhibit higher contrast-to-noise ratios and more pronounced tissue intensity separation, providing richer contextual information for learning the inverse T2 → T1 mapping. In contrast, T1 → T2 synthesis requires the generation of fluid-sensitive contrast patterns that are inherently more susceptible to intensity ambiguity and noise.

The ablation study provides insight into the complementary roles of individual architectural components. Removing the transformer bottleneck resulted in a measurable reduction in SSIM across both synthesis directions, highlighting its role in capturing long-range anatomical dependencies beyond local convolutional receptive fields. Removing axial attention resulted in only minor quantitative variations across both synthesis directions, indicating that its primary contribution is not direct metric maximization. Instead, consistent with its intended role in the framework, axial attention enables efficient global context modeling while avoiding the quadratic computational complexity of full self-attention. By decomposing full self-attention into sequential axial operations, this mechanism allows the network to capture long-range dependencies in a computationally efficient manner, which is particularly important for medical imaging tasks. The observed performance further suggests that this attention formulation helps preserve structural consistency while reducing computational complexity, making the architecture more suitable for practical clinical deployment without sacrificing synthesis performance.

Excluding the fusion refinement block led to a consistent reduction in PSNR and a modest decrease in SSIM across both synthesis directions. This finding highlights the importance of refinement stages in stabilizing intensity reconstruction and enhancing local texture fidelity. By mitigating artifacts propagated through skip connections, the refinement mechanism improves signal consistency and reduces high-frequency noise, directly contributing to improved reconstruction quality.

The ablation results further reveal the complementary roles of the loss function components. Removing the SSIM loss leads to a decrease in SSIM across both synthesis directions, confirming its role in structural similarity optimization beyond what pixel-level L1 loss alone can achieve. Removing the adversarial loss results in a direction dependent reduction in SSIM, with a more pronounced effect in the T1 → T2 direction, suggesting that GAN supervision is particularly important for modeling complex contrast transformations. In addition, excluding the adversarial loss led to smoother but less detailed reconstructions, indicating that GAN-based supervision contributes primarily to texture realism and edge sharpness. Together, these observations demonstrate that the combined use of reconstruction based, structural, and adversarial supervision terms enables a balanced trade-off between intensity accuracy and anatomical realism.

The ablation results collectively indicate that no single component dominates overall performance; rather, the proposed framework benefits from the complementary interaction of global context modeling, refinement mechanisms, and multi-objective supervision. Together, these components lead to more balanced and robust performance across structural similarity and signal fidelity metrics. The full model configuration is consequently well-suited to heterogeneous clinical imaging conditions, where robustness across varying acquisition settings is essential for reliable synthesis.

Qualitative evaluation further supports the quantitative findings. Visual inspection reveals that the proposed framework preserves global anatomical integrity across subjects and slice levels, with consistent reconstruction of cortical boundaries, ventricular morphology, deep gray matter structures, and posterior fossa anatomy. Residual discrepancies observed in the absolute difference maps are primarily localized to high-frequency edge regions and subtle intensity transitions, which are expected given the intrinsic differences in T1 and T2 contrast mechanisms.

The expert radiological evaluation provides critical clinical validation beyond numerical similarity metrics. Radiologists reported that synthesized images were anatomically faithful, diagnostically interpretable, and exhibited coherent signal transitions consistent with expected tissue characteristics. Differences in perceived synthesis quality between directions were also noted, with T2 → T1 synthesis generally exhibiting sharper anatomical delineation, consistent with both quantitative findings and known MRI physics.

Overall, the proposed bidirectional MRI synthesis framework demonstrates stable quantitative performance, strong qualitative fidelity, and clinically acceptable reconstruction quality under realistic clinical data conditions. The integration of efficient global context modeling and targeted refinement mechanisms enables robust performance without excessive computational complexity, supporting the translational potential of the approach for multi-contrast MRI synthesis in real-world clinical environments.

8. Conclusions

In this work, we proposed a fusion U-Net transformer for bidirectional T1 ↔ T2 MRI synthesis operating on MRI data derived from clinical acquisitions. By addressing real-world challenges such as acquisition heterogeneity, residual slice misalignment, and intensity variability, the proposed framework moves beyond idealized benchmark settings toward realistic clinical deployment. The proposed method achieved an SSIM of 0.787 ± 0.015 and PSNR of 21.06 ± 0.13 dB for T1 → T2 synthesis, and an SSIM of 0.830 ± 0.005 and PSNR of 24.03 ± 0.072 dB for T2 → T1 synthesis, with statistically significant improvements over all baseline models including pix2pix, CycleGAN, ResViT, and simple U-Net (p < 0.001 in all comparisons). Quantitative and qualitative evaluations demonstrate that the model achieves competitive performance under challenging clinical conditions, while expert radiologist assessment confirms the preservation of diagnostically relevant anatomical features. The integration of convolutional inductive biases, axial attention, transformer-based global context modeling, and fusion refinement enables effective and computationally efficient synthesis, supporting practical clinical deployment.

To the best of our knowledge, this study represents one of the early efforts to address unified bidirectional MRI contrast synthesis on raw clinical data using a modern hybrid GAN–transformer architecture. The results demonstrate that robust bidirectional T1 ↔ T2 synthesis can be achieved under heterogeneous clinical acquisition conditions, supporting the applicability of the proposed framework in realistic clinical imaging environments. Although the dataset used in this study already incorporates variability across multiple scanner models, field strengths, and acquisition protocols, external validation on independent multi-center datasets remains an important direction for future research. Future work will focus on cross-institutional validation and extending the framework to additional MRI contrasts and CT to MRI synthesis tasks.

Author Contributions

Conceptualization, Z.C. and H.K.; methodology, Z.C.; software, Z.C.; validation, Z.C., E.C. and B.K.; formal analysis, Z.C.; investigation, Z.C.; resources, Z.C., E.C. and B.K.; data curation, Z.C., E.C. and B.K.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C., H.K., E.C. and B.K.; visualization, Z.C.; supervision, H.K.; project administration, H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Gazi University Rectorate (protocol code E-77082166-302.08.01-1239445 and date of approval 14 May 2025).

Informed Consent Statement

Patient consent was waived due to the retrospective nature of this study and the use of fully anonymized MRI data, as approved by the Ethics Committee of Gazi University Rectorate.

Data Availability Statement

The data used in this study consist of retrospective clinical MRI scans obtained from Gazi University Faculty of Medicine. Due to privacy and ethical considerations, and institutional regulations, the data are not publicly available. Access to the data may be granted by the corresponding author upon reasonable request and subject to approval by the relevant ethics committee.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
GAN	Generative Adversarial Network
LSGAN	Least Squares GAN
MRI	Magnetic Resonance Imaging
PSNR	Peak Signal-to-Noise Ratio
SSIM	Structural Similarity Index Measure

References

Smith-Bindman, R.; Miglioretti, D.L.; Larson, E.B. Rising use of diagnostic medical imaging in a large integrated health system. Health Aff. 2008, 27, 1491–1502. [Google Scholar] [CrossRef]
Bitar, R.; Leung, G.; Perng, R.; Tadros, S.; Moody, A.R.; Sarrazin, J.; Roberts, T.P. MR pulse sequences: What every radiologist wants to know but is afraid to ask. Radiographics 2006, 26, 513–537. [Google Scholar] [CrossRef] [PubMed]
Dayarathna, S.; Islam, K.; Uribe, S.; Yang, G.; Hayat, M.; Chen, Z. Deep learning based synthesis of MRI, CT and PET: Review and analysis. Med. Image Anal. 2023, 92, 103046. [Google Scholar] [CrossRef]
Fard, A.S.; Reutens, D.C.; Vegh, V. From CNNs to GANs for cross-modality medical image estimation. Comput. Biol. Med. 2022, 146, 105556. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tao, D. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar] [CrossRef]
Jog, A.; Carass, A.; Roy, S.; Pham, D.L.; Prince, J.L. Random forest regression for magnetic resonance image synthesis. Med. Image Anal. 2017, 35, 475–488. [Google Scholar] [CrossRef]
Roy, S.; Carass, A.; Prince, J.L. Magnetic resonance image example-based contrast synthesis. IEEE Trans. Med. Imaging 2013, 32, 2348–2363. [Google Scholar] [CrossRef]
Huang, Y.; Shao, L.; Frangi, A.F. Cross-modality image synthesis via weakly coupled and geometry co-regularized joint dictionary learning. IEEE Trans. Med. Imaging 2017, 37, 815–827. [Google Scholar] [CrossRef]
Osman, A.F.I.; Tamam, N.M. Deep learning-based convolutional neural network for intramodality brain MRI synthesis. J. Appl. Clin. Med. Phys. 2022, 23, e13530. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Huang, X.; Zhang, Z.; Liu, L.; Wang, F.; Li, S.; Xia, J. Synthesis of magnetic resonance images from computed tomography data using convolutional neural network with contextual loss function. Quant. Imaging Med. Surg. 2022, 12, 3151–3164. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar] [CrossRef]
Nie, D.; Trullo, R.; Lian, J.; Petitjean, C.; Ruan, S.; Wang, Q.; Shen, D. Medical image synthesis with context-aware generative adversarial networks. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Quebec, QC, Canada, 10–14 September 2017; pp. 417–425. [Google Scholar] [CrossRef]
Dar, S.U.; Yurt, M.; Karacan, L.; Erdem, A.; Erdem, E.; Çukur, T. Image synthesis in multi-contrast MRI with conditional generative adversarial networks. IEEE Trans. Med. Imaging 2019, 38, 2375–2388. [Google Scholar] [CrossRef]
Sharma, A.; Hamarneh, G. Missing MRI pulse sequence synthesis using multi-modal generative adversarial network. IEEE Trans. Med. Imaging 2019, 39, 1170–1183. [Google Scholar] [CrossRef] [PubMed]
Kawahara, D.; Nagata, Y. T1-weighted and T2-weighted MRI image synthesis with convolutional generative adversarial networks. Rep. Pract. Oncol. Radiother. 2021, 26, 35–42. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Zhang, H.; Song, L.; Lei, Y. Bi-MGAN: Bidirectional T1-to-T2 MRI images prediction using multi-generative multi-adversarial nets. Biomed. Signal Process. Control 2022, 78, 103994. [Google Scholar] [CrossRef]
Xu, L.; Lei, Y.; Shao, J.; Zeng, X.; Li, W. Modal disentangled generative adversarial networks for bidirectional magnetic resonance image synthesis. Eng. Appl. Artif. Intell. 2025, 141, 109817. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar] [CrossRef]
Yang, H.; Sun, J.; Carass, A.; Zhao, C.; Lee, J.; Xu, Z.; Prince, J. Unpaired brain MR-to-CT synthesis using a structure-constrained CycleGAN. In Proceedings of the Deep Learning in Medical Image Analysis, Granada, Spain, 20 September 2018; pp. 174–182. [Google Scholar] [CrossRef]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.K.; Wang, Z.; Smolley, S.P. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar] [CrossRef]
Skandarani, Y.; Jodoin, P.-M.; Lalande, A. GANs for medical image synthesis: An empirical study. J. Imaging 2023, 9, 69. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Dalmaz, O.; Yurt, M.; Çukur, T. ResViT: Residual vision transformers for multimodal medical image synthesis. IEEE Trans. Med. Imaging 2022, 41, 2598–2614. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar] [CrossRef]
Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.; Chen, L.-C. Axial-DeepLab: Stand-alone axial-attention for panoptic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020; pp. 108–126. [Google Scholar] [CrossRef]
Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T. On layer normalization in the transformer architecture. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 10524–10533. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar] [CrossRef]
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 2017, 3, 47–57. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Yaniv, Z.; Lowekamp, B.C.; Johnson, H.J.; Beare, R. SimpleITK image-analysis notebooks: A collaborative environment for education and reproducible research. J. Digit. Imaging 2018, 31, 290–303. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the proposed fusion U-Net transformer architecture for bidirectional MRI synthesis.

Figure 2. Qualitative visualization of T1 → T2 MRI synthesis results. From left to right: input T1-weighted image, ground truth (GT) T2-weighted image, synthesized T2-weighted image, and absolute difference map (|Pred − GT|) visualized using a perceptually uniform colormap. Examples from multiple subjects and axial slice levels are shown to illustrate structural consistency and contrast preservation across the brain. Error maps show that darker regions indicate lower reconstruction error, and brighter regions highlight areas of greater deviation from ground truth.

Figure 3. Qualitative visualization of T2 → T1 MRI synthesis results. From left to right: input T2-weighted image, ground truth (GT) T1-weighted image, synthesized T1-weighted image, and absolute difference map (|Pred − GT|) visualized using a perceptually uniform colormap. Examples from multiple subjects and axial slice levels are shown to illustrate structural consistency and contrast preservation across the brain. Error maps show that darker regions indicate lower reconstruction error, and brighter regions highlight areas of greater deviation from ground truth.

Table 1. Architectural details of discriminator.

Layer	Operation	Output Shape
1	Conv (2 → 64, k = 4, s = 2) + LeakyReLU (0.2)	(B, 64, 128, 128)
2	Conv (64 → 128, k = 4, s = 2) + InstanceNorm + LeakyReLU	(B, 128, 64, 64)
3	Conv (128 → 256, k = 4, s = 2) + InstanceNorm + LeakyReLU	(B, 256, 32, 32)
4	Conv (256 → 512, k = 4, s = 1) + InstanceNorm + LeakyReLU	(B, 512, 31, 31)
5	Conv (512 → 1, k = 4, s = 1)	(B, 1, 30, 30)

Table 2. Quantitative results of the proposed method against baseline models for bidirectional T1 → T2 and T2 → T1 MRI synthesis, evaluated using SSIM and PSNR (mean ± standard deviation).

Model	Direction	SSIM	PSNR
Simple U-Net	T1 → T2	0.5775 ± 0.010	20.53 ± 0.14 dB
Simple U-Net	T2 → T1	0.6323 ± 0.005	23.71 ± 0.21 dB
CycleGAN	T1 → T2	0.6425 ± 0.005	17.77 ± 0.02 dB
CycleGAN	T2 → T1	0.6541 ± 0.002	17.83 ± 0.26 dB
Pix2pix	T1 → T2	0.6895 ± 0.075	19.36 ± 0.11 dB
Pix2pix	T2 → T1	0.7415 ± 0.018	22.10 ± 0.29 dB
ResViT	T1 → T2	0.7200 ± 0.021	20.14 ± 0.30 dB
ResViT	T2 → T1	0.7670 ± 0.004	22.50 ± 0.18 dB
Our Model	T1 → T2	0.7870 ± 0.015	21.06 ± 0.13 dB
Our Model	T2 → T1	0.8300 ± 0.005	24.03 ± 0.07 dB

Bold values indicate the best performing results.

Table 3. Ablation study analyzing the contribution of individual architectural and loss components on quantitative performance for bidirectional MRI synthesis.

Ablation Study	Direction	SSIM	PSNR
w/o ¹ SSIM loss	T1 → T2	0.7578 ± 0.0017	20.82 ± 0.11 dB
w/o ¹ SSIM loss	T2 → T1	0.8100 ± 0.0030	23.98 ± 0.15 dB
w/o ¹ GAN loss	T1 → T2	0.7674 ± 0.0011	21.06 ± 0.06 dB
w/o ¹ GAN loss	T2 → T1	0.8254 ± 0.0019	24.05 ± 0.02 dB
w/o ¹ Axial Attention	T1 → T2	0.7885 ± 0.0190	20.87 ± 0.01 dB
w/o ¹ Axial Attention	T2 → T1	0.8320 ± 0.0060	24.11 ± 0.05 dB
w/o ¹ Transformer	T1 → T2	0.7601 ± 0.0004	20.59 ± 0.10 dB
w/o ¹ Transformer	T2 → T1	0.8148 ± 0.0035	23.73 ± 0.08 dB
w/o Fusion Refine Block	T1 → T2	0.7740 ± 0.0170	20.74 ± 0.08 dB
w/o Fusion Refine Block	T2 → T1	0.8212 ± 0.0014	23.93 ± 0.11 dB

¹ w/o denotes without.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cantemir, Z.; Karacan, H.; Cindil, E.; Kalafat, B. Bidirectional T1–T2 Brain MRI Synthesis Using a Fusion U-Net Transformer for Real-World Clinical Data. Appl. Sci. 2026, 16, 3674. https://doi.org/10.3390/app16083674

AMA Style

Cantemir Z, Karacan H, Cindil E, Kalafat B. Bidirectional T1–T2 Brain MRI Synthesis Using a Fusion U-Net Transformer for Real-World Clinical Data. Applied Sciences. 2026; 16(8):3674. https://doi.org/10.3390/app16083674

Chicago/Turabian Style

Cantemir, Zeynep, Hacer Karacan, Emetullah Cindil, and Burak Kalafat. 2026. "Bidirectional T1–T2 Brain MRI Synthesis Using a Fusion U-Net Transformer for Real-World Clinical Data" Applied Sciences 16, no. 8: 3674. https://doi.org/10.3390/app16083674

APA Style

Cantemir, Z., Karacan, H., Cindil, E., & Kalafat, B. (2026). Bidirectional T1–T2 Brain MRI Synthesis Using a Fusion U-Net Transformer for Real-World Clinical Data. Applied Sciences, 16(8), 3674. https://doi.org/10.3390/app16083674

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bidirectional T1–T2 Brain MRI Synthesis Using a Fusion U-Net Transformer for Real-World Clinical Data

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Deep Learning-Based MRI Synthesis

2.3. Generative Adversarial Networks for MRI Synthesis

2.4. Vision Transformers and Hybrid Architectures for Medical Image Synthesis

3. Materials and Methods

3.1. Problem Definition and Overview

3.2. Generator Architecture: Fusion U-Net Transformer

3.2.1. Encoder Path with Hierarchical Feature Extraction

3.2.2. Axial Attention

3.2.3. Transformer Bottleneck

3.2.4. Decoder Path with Fusion Refinement Mechanism

3.2.5. Output Layer

3.3. Discriminator Architecture: Conditional PatchGAN

3.4. Loss Functions and Training Strategies

3.4.1. Adversarial Learning with LSGAN Objective

3.4.2. L1 Loss for Pixel-Level Accuracy

3.4.3. SSIM Loss for Structural Similarity

3.4.4. Hybrid Reconstruction Loss Function

4. Clinical Dataset and Preprocessing

4.1. Data Acquisition and Dataset Characteristics

4.2. Preprocessing Pipeline

5. Experiments

5.1. Experimental Setup

5.2. Training Details

5.3. Evaluation Metrics

5.4. Model Complexity Analysis

6. Results

6.1. Quantitative Results

6.2. Qualitative Results

6.3. Expert Radiological Evaluation

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI