A Hybrid Vision Transformer-BiRNN Architecture for Direct k-Space to Image Reconstruction in Accelerated MRI

Oh, Changheun

doi:10.3390/jimaging12010011

Open AccessArticle

A Hybrid Vision Transformer-BiRNN Architecture for Direct k-Space to Image Reconstruction in Accelerated MRI

by

Changheun Oh

Department of Software and Communication Engineering, Hongik University, Sejong 30016, Republic of Korea

J. Imaging 2026, 12(1), 11; https://doi.org/10.3390/jimaging12010011 (registering DOI)

Submission received: 13 November 2025 / Revised: 17 December 2025 / Accepted: 22 December 2025 / Published: 26 December 2025

(This article belongs to the Topic New Challenges in Image Processing and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

Long scan times remain a fundamental challenge in Magnetic Resonance Imaging (MRI). Accelerated MRI, which undersamples k-space, requires robust reconstruction methods to solve the ill-posed inverse problem. Recent methods have shown promise by processing image-domain features to capture global spatial context. However, these approaches are often limited, as they fail to fully leverage the unique, sequential characteristics of the k-space data themselves, which are critical for disentangling aliasing artifacts. This study introduces a novel, hybrid, dual-domain deep learning architecture that combines a ViT-based autoencoder with Bidirectional Recurrent Neural Networks (BiRNNs). The proposed architecture is designed to synergistically process information from both domains: it uses the ViT to learn features from image patches and the BiRNNs to model sequential dependencies directly from k-space data. We conducted a comprehensive comparative analysis against a standard ViT with only an MLP head (Model 1), a ViT autoencoder operating solely in the image domain (Model 2), and a competitive UNet baseline. Evaluations were performed on retrospectively undersampled neuro-MRI data using R = 4 and R = 8 acceleration factors with both regular and random sampling patterns. The proposed architecture demonstrated superior performance and robustness, significantly outperforming all other models in challenging high-acceleration and random-sampling scenarios. The results confirm that integrating sequential k-space processing via BiRNNs is critical for superior artifact suppression, offering a robust solution for accelerated MRI.

Keywords:

accelerated MRI; bidirectional recurrent layers; MR image reconstruction; parallel MRI; RNN; vision transformer

1. Introduction

Magnetic Resonance Imaging (MRI) has established itself as an indispensable diagnostic tool in modern medicine, owing to its exceptional soft-tissue contrast and non-invasive characteristics. However, the inherent trade-off between image quality, spatial resolution, and acquisition time remains a fundamental challenge in clinical practice [1]. In particular, long scan times can increase patient discomfort and induce motion artifacts from involuntary movements such as breathing or heartbeat, which can severely compromise diagnostic accuracy. To address these problems, MRI acceleration involves intentionally undersampling k-space data and subsequently reconstructing a high-quality image from this incomplete information. The reconstruction process in accelerated MRI is intrinsically ill-posed, presenting infinite potential solutions due to the undersampling of k-space data, which has driven the development of diverse methodological approaches to address this inverse problem [2,3,4,5,6,7].

Early MRI acceleration techniques were based on explicit physical and mathematical models [6]. Parallel imaging (PI) methods such as SENSE (SENSitivity Encoding) [2] and GRAPPA (Generalized Autocalibrating Partially Parallel Acquisitions) [3] have been widely adopted in clinical settings, leveraging the spatial sensitivity information from multiple receiver coils to compensate for missing phase-encoding steps [4]. Whereas these methods have achieved significant acceleration factors, they are fundamentally limited by their linear reconstruction models and suffer from noise amplification (g-factor) at higher acceleration factors [8].

The advent of deep learning has revolutionized medical image reconstruction, offering data-driven approaches that can learn complex, non-linear mappings from undersampled data to fully sampled images [9]. In particular, Convolutional Neural Network (CNN) architectures, such as UNet, have shown remarkable success in MRI reconstruction by virtue of their excellent ability to capture local spatial features and hierarchical representations. However, due to the intrinsic nature of their local receptive fields, CNNs are inherently limited in modeling the long-range dependencies that are essential for removing aliasing artifacts across the entire image and understanding the global structural context [10,11]. As an alternative to overcome this limitation of CNNs, the Vision Transformer (ViT) has emerged [12,13,14]. The ViT treats an image as a sequence of patches and simultaneously models the global relationships among all patches using a self-attention mechanism. Thanks to this capability, the ViT demonstrates a strong advantage in understanding global context and has shown its potential in various medical image reconstruction problems. Recent research has demonstrated the potential of Transformer-based architectures in various medical imaging applications, including MRI reconstruction [10,11,15]. The ability of Transformers to process entire sequences in parallel, combined with their superior handling of long-range dependencies, makes them particularly well suited for the global nature of MRI reconstruction problems [11,16].

However, purely Transformer-based models often require massive datasets and lack the inductive bias to capture local high-frequency details effectively. To mitigate these limitations, hybrid architectures combining CNNs for local feature extraction and Transformers for global context modeling have recently attracted significant attention [17]. However, the majority of these hybrid approaches still operate predominantly in the image domain, treating k-space data merely as a constraint rather than exploiting their inherent sequential correlations for feature learning.

Therefore, to overcome this limitation, a hybrid architecture that operates in both the image and k-space domains is necessary. While the ViT addresses the image domain, an effective mechanism is required to model the k-space data, which fundamentally exhibit strong sequential correlations along their phase-encoding directions. On the other hand, Recurrent Neural Networks (RNNs) have been explored in MRI reconstruction to model the iterative nature of optimization algorithms or to capture dependencies in dynamic MRI sequences [18,19]. Unlike CNNs or ViTs, which treat data as static 2D or 3D volumes, RNNs are naturally suited for processing sequential data. Since MRI data acquisition in k-space is inherently a sequential process (i.e., line-by-line or shot-by-shot), RNN-based approaches offer a theoretical advantage in modeling the raw signal dependencies [7]. However, standalone RNN models often struggle with the high computational burden and may fail to capture complex spatial semantics once the data are transformed into the image domain.

The integration of Bidirectional Recurrent Neural Networks (BiRNNs) with Transformer architectures represents a novel approach to leveraging both sequential processing capabilities and global attention mechanisms. BiRNNs excel at capturing temporal dependencies in sequential data by processing information in both forward and backward directions, enabling the model to utilize future context when making predictions about current states. In the context of MRI reconstruction, k-space data exhibit inherent sequential characteristics that can benefit from bidirectional processing to extract domain-transformed latent representations.

This work aims to propose and evaluate an optimized ViT-based reconstruction model for parallel MRI that addresses the limitations of standard Transformer architectures for image reconstruction tasks. We introduce a novel autoencoder framework that combines the global attention capabilities of Vision Transformers with the sequential processing strengths of Bidirectional RNNs, specifically designed to handle the unique characteristics of k-space data in MRI reconstruction.

2. Methods

The ViT architecture was originally developed for classification tasks, typically concluding with an MLP that outputs a single label. However, this design is not well suited for image reconstruction, which requires precise spatial detail and dense pixel-wise prediction. Therefore, the final stage of the ViT must be restructured to accommodate reconstruction tasks.

In MRI, each image pixel is influenced by the entirety of k-space data—a property that initially motivated the use of fully connected layers in early reconstruction frameworks such as AUTOMAP. However, fully connected layers are computationally expensive due to the large number of parameters. To achieve a more efficient yet globally aware transformation, our method employs BiRNNs. A BiRNN consists of two RNNs processing the sequence in forward and reverse directions, enabling bidirectional context aggregation across the data. By applying BiRNN layers that alternately sweep horizontally and vertically over the 2D k-space, we construct a domain-transformed latent representation that preserves global structure while significantly reducing parameter complexity. In our approach, we introduce a ViT-based autoencoder architecture optimized for direct MRI reconstruction from undersampled k-space data. While the ViT encoder is leveraged to learn global contextual features from image patches, a key component of our design is the integration of BiRNNs for domain transformation from k-space to image space. To assess the effectiveness of our proposed design, we evaluate three architectures:

Model 1: The original ViT structure, in which the final MLP directly outputs the reconstructed image.
Model 2: A ViT-based autoencoder that includes a Transformer encoder–decoder pair but excludes any recurrent layers. The ViT encoder extracts patch-level features, and a Transformer decoder reconstructs the image [14].
Model 3 (Proposed): An enhanced autoencoder structure where the decoder is augmented with additional inputs consisting of folded images and domain-transformed latent features obtained from the BiRNN block. This integration of BiRNNs with the ViT enables the model to learn representations that reflect the sequential and global characteristics of k-space data.

Figure 1 illustrates the structure of the proposed Model 3, highlighting the nested configurations of Model 1 and Model 2. Model 1 corresponds to the encoder-only portion of Model 3 (indicated by the purple dotted line), while Model 2 shares the same encoder–decoder architecture as Model 3 but omits the BiRNN module.

For each model, the following experimental parameters were used: For Model 1, we applied the ‘ViT-Huge’ parameters [12]: layers = 32, hidden size = 1280, MLP size = 5120, heads = 16. Model 2 used the ‘ViT-Large’ parameters [12] for both the encoder and decoder: layers = 24, hidden size = 1024, MLP size = 4096, heads = 16. Model 3 used the ‘ViT-Base’ parameters [12] for the encoder. For the decoder and the BiRNNs, the following parameter settings were used: layers = 12, hidden size = 1280, MLP size = 5120, heads = 8, BiRNN hidden size = 384 × 10. According to the specified parameter settings, we can report the following model sizes: Model 1 contains 940,579,877 trainable parameters in total. Model 2 contains 643,630,553 parameters in total, including 337,035,240 in the encoder and 302,237,696 in the decoder. For Model 3, the network comprises 1,024,630,973 parameters in total, with 111,143,144 in the encoder, 188,899,840 in the decoder, and 637,102,080 across the BiRNN modules.

Our proposed model (Model 3) is designed based on an encoder–decoder framework. The encoder is tasked with learning a potent latent representation from the input images, while the decoder utilizes this representation, along with supplementary k-space data, to reconstruct the final, high-fidelity image. The encoder processes a batch of folded input images, denoted as

x \in R^{N \times H \times W \times 2 C}

, where N is the number of images, H and W are the height and width, respectively, and C is the number of Receive (Rx) channels. Each input image is partitioned into a sequence of flattened 2D patches. These patches are then mapped to a latent D-dimensional embedding space through a trainable linear projection. A learnable position embedding is added to these patch embeddings to retain positional information. The token embeddings are then passed through a series of L standard Transformer blocks. Each block consists of a Multi-Head Self-Attention (MSA) module and a feed-forward Multilayer Perceptron (MLP). Layer Normalization (LN) is applied before each module, and a residual connection is employed after each module. Finally, the sequence of encoded tokens from the last Transformer block is processed by a final MLP layer to produce the encoder results. This result encapsulates the high-level features extracted from the input images. The whole process of the encoder can be simply described as

z_{e n c} = Φ_{E n c} (x)

(1)

where

Φ_{E n c} (\cdot)

represents the ViT encoder, including patch embedding, position embedding, and MSAs. The output

Z_{e n c} \in R^{N \times D}

consists of N patch tokens with embedding dimension D.

The decoder is a hybrid architecture designed to synthesize the final image by integrating the learned features from the encoder with auxiliary data streams. It receives three inputs: the encoder results, k-space data, and the original folded images. The encoder results first pass through a feature embedding layer before being processed by M successive Transformer blocks, and the resulting features are then processed by an MLP and a tail module to produce the Transformer path’s final output. The process of the ViT decoder and Up-tail can be described as

\begin{matrix} z_{d e c} & = Φ_{D e c} (z_{e n c}) \end{matrix}

(2)

\begin{matrix} I_{V i T} & = Φ_{T a i l} (Reshape (MLP (z_{d e c}))) \end{matrix}

(3)

where

Φ_{D e c} (\cdot)

represents the ViT decoder layers,

MLP (\cdot)

projects the features to match the required output dimensions, and

Φ_{T a i l} (\cdot)

represents the upsampling sequence to generate the image domain features

I_{V i T} \in R^{H \times W \times C_{V i T}}

.

Concurrently, the k-space data are processed by a sequence of two BiRNNs. This path is designed to capture contextual information from the frequency domain (k-space). Let

k \in R^{H \times W \times 2 C}

denote the input k-space; it is transformed into a sequence of column vectors

K = {k_{1}, k_{2}, \dots, k_{W}}

, where each

k_{i}

represents the flattened vertical column at width index i. The process of the BiRNN can be described as

\begin{matrix} {\vec{h}}_{i}, {\overset{\leftarrow}{h}}_{i} & = {BiRNN}_{h o r} (k_{i}, h_{i - 1}, h_{i + 1}) \end{matrix}

(4)

\begin{matrix} H_{h o r} & = {[{\vec{h}}_{1}, {\overset{\leftarrow}{h}}_{1}], [{\vec{h}}_{2}, {\overset{\leftarrow}{h}}_{2}], \dots, [{\vec{h}}_{W}, {\overset{\leftarrow}{h}}_{W}]} \in R^{W \times (2 \times H \times D)} \end{matrix}

(5)

where

{BiRNN}_{h o r} (\cdot)

represents the horizontally sweeping BiRNN layer (Figure 2a), which sequentially generates the k-space–image hybrid domain features

{\vec{h}}_{i}, {\overset{\leftarrow}{h}}_{i}

.

H_{h o r}

represents the intermediate latent features, and D is the user-defined parameter.

To capture dependencies along the vertical axis (phase-encoding direction), the output of the horizontal sweep is rearranged as shown in Figure 2b. The intermediate feature tensor

H_{h o r}

is permuted and reshaped into a sequence of row vectors

V = {v_{1}, v_{2}, \dots, v_{H}}

, where each

v_{j}

contains the flattened horizontal hidden features across the entire width for the j-th row.

\begin{matrix} {\vec{g}}_{j}, {\overset{\leftarrow}{g}}_{j} & = {BiRNN}_{v e r} (v_{j}, g_{j - 1}, g_{j + 1}) \end{matrix}

(6)

\begin{matrix} G_{v e r} & = {[{\vec{g}}_{1}, {\overset{\leftarrow}{g}}_{1}], [{\vec{g}}_{2}, {\overset{\leftarrow}{g}}_{2}], \dots, [{\vec{g}}_{H}, {\overset{\leftarrow}{g}}_{H}]} \in R^{H \times (2 \times W \times D)} \end{matrix}

(7)

\begin{matrix} I_{R N N} & = Reshape (G_{v e r}) \in R^{H \times W \times 2 D} \end{matrix}

(8)

where

{BiRNN}_{v e r} (\cdot)

represents the vertically sweeping BiRNN layer, which sequentially generates the image domain features

{\vec{g}}_{j}, {\overset{\leftarrow}{g}}_{j}

.

G_{v e r}

represents the output latent features, and D is the user-defined parameter.

I_{R N N}

represents the image domain features, as a permuted and reshaped version of

G_{v e r}

.

The outputs from the Transformer and BiRNN paths are fused with the original folded images to form a comprehensive, multi-modal representation. Finally, the refined features are combined and processed by a 2D convolutional layer, which maps this fused representation to the pixel space, generating the final reconstructed image as

\hat{y} = Conv 2 D (Concat (I_{V i T}, I_{R N N}, x))

.

For the performance evaluation, we utilized neuro-MRI images from the ‘FastMRI’ dataset [20]. Specifically, T2w images acquired on a 3T MRI (Siemens, Skyra) with a matrix size of 384 (FE) × 396 (PE) × 16 (Rx) were used to train the model. Fully sampled k-space is available in FastMRI only for the training and validation sets. Thus, the models were trained on the training subset, and evaluation was carried out on the validation subset, which includes the fully sampled reference images needed for quantitative metrics. Undersampled data were retrospectively generated using two acceleration factors (AFs): R = 4 and R = 8. The R = 4 condition used 32 auto-calibration signal (ACS) lines, while the R = 8 condition used 16 ACS lines. For both acceleration factors, experiments were conducted using two different undersampling schemes: a regular Cartesian pattern and a random pattern that was newly generated for each sample. For random sampling, the remaining phase-encoding positions outside the ACS region were selected uniformly at random to achieve the desired acceleration factor. Quantitative assessment was conducted using three metrics: the normalized mean square error (nMSE), structural similarity index (SSIM) [21], and visual information fidelity (VIF) [22].

All models were trained using the Adam optimizer, with an initial learning rate of

1 \times 10^{- 4}

. A cosine annealing learning rate schedule was applied throughout training to gradually reduce the step size and stabilize convergence. No data augmentation strategies were used, as the focus of this study was to assess the intrinsic reconstruction capability of each architecture without introducing additional variability in the input distribution. All experiments were trained for 50 epochs on an NVIDIA TITAN RTX GPU, with each epoch requiring approximately 7 h to complete. The training loss was computed as the pixel-wise L1 loss between the reconstructed and reference images. For inference, the model processes a single slice in approximately 0.504 s, with a peak GPU memory consumption of 6.14 GB, demonstrating its computational feasibility for clinical workflows. We also reconstructed images using UNet (depth = 5, number of channels in the first convolutional layer = 64) and VarNet for comparison [23].

3. Results

Figure 3 illustrates sample images and the corresponding error maps from Models 1–3, UNet, and VarNet for the R = 4 regular sampling pattern. The first column shows the reference (label) images, while columns 2–6 show the reconstructed images and the corresponding error maps (10× amplification) from UNet, VarNet, and Models 1, 2, and 3, respectively. Table 1 provides the quantitative results for the R = 4 regular sampling pattern.

A qualitative visual assessment of the reconstructed images in Figure 3 reveals significant performance differences among the models. The proposed Model 3 demonstrates exceptional reconstruction quality, achieving high fidelity to the reference image that is visually comparable to the performance of the UNet. Furthermore, the proposed model exhibits competitive performance against VarNet. As shown in Table 1, Model 3 achieves higher SSIM and VIF compared to VarNet, indicating superior perceptual quality and structural fidelity. The error map for Model 3 shows minimal structural error and low residual noise, indicating a successful recovery of fine anatomical details. In contrast, Model 1, which simply utilizes a final MLP for reconstruction, fails to capture essential anatomical structures, resulting in a severely blurred image with a high degree of structured error, as evidenced by its error map. Model 2, which employs a Transformer-based decoder, shows a substantial improvement over Model 1 by reconstructing the overall brain morphology. However, its error map contains more noticeable residual artifacts and noise compared to Model 3, suggesting a less complete removal of aliasing effects.

To further evaluate the robustness of our proposed model, we conducted additional experiments under higher acceleration factors (R = 8) and with different undersampling patterns (regular and random). For the random sampling scenarios, a new undersampling mask was randomly generated for each sample, which prevents the model from overfitting to a fixed pattern and enhances generalization. The qualitative results are presented in Figure 3, Figure 4 and Figure 5, and the corresponding quantitative analysis is summarized in Table 1. At a higher acceleration factor of R = 8 (Figure 4 and Figure 5), the performance gap between the models becomes more pronounced. While all models exhibit increased artifacts compared to R = 4, Model 3 consistently preserves anatomical structures more effectively than the other models, particularly in the more challenging random sampling scenario (Figure 5). The quantitative results in Table 1 align with the visual assessment. Across all tested scenarios—R4 random, R8 regular, and R8 random—Model 3 consistently achieves the lowest nMSE and the highest SSIM and VIF scores.

4. Discussion

This study compares ViT-based image reconstruction models and introduces a ViT-based autoencoder with BiRNNs. The results show that a decoding Transformer improves reconstruction compared with an MLP decoder. To ensure a fair comparison, Models 1, 2, and 3 were set to a similar total parameter count. Model 3’s superior performance stems from its ability to leverage information from BiRNN structures effectively, achieving the highest efficiency by extracting domain-transformed latent representations from k-space. The hybrid, dual-domain architecture allows the model to synergistically process features from both the image and k-space domains. The ViT encoder captures global spatial context, which is crucial for structural coherence, while the BiRNNs model the sequential nature of k-space data, which is vital for removing complex aliasing artifacts. This dual-domain approach leads to higher reconstruction fidelity, especially in challenging high-acceleration scenarios.

The comparison between Model 2 and Model 3 effectively serves as an ablation study, isolating the contribution of the BiRNN module. The significant performance improvement observed in Model 3, particularly under high acceleration (R = 8) and random sampling conditions (Table 2), underscores the critical role of processing k-space data directly. While the Transformer decoder in Model 2 can reconstruct global structures from the encoder’s latent features, it struggles with the complex, non-local aliasing artifacts inherent in undersampled data. The BiRNN module in Model 3 addresses this by interpreting k-space as sequential data, effectively capturing the structured correlations along the phase-encoding directions. This allows the model to disentangle aliasing patterns from true anatomical features before the final fusion step, resulting in superior artifact suppression and detail preservation, as visually confirmed in Figure 4 and Figure 5.

Furthermore, the model’s robustness against different sampling patterns demonstrates its generalization capabilities. The regular sampling pattern (Figure 5) produces coherent, line-like artifacts, whereas the random pattern (Figure 4 and Figure 6) generates more incoherent, noise-like artifacts. Model 3’s consistent high performance in both scenarios indicates that the hybrid architecture, which processes global image-domain context via the ViT and models frequency-domain sequential dependencies via BiRNNs, achieves high adaptability to varying artifact structures. This adaptability is a significant advantage over methods optimized for only a specific type of artifact texture.

5. Conclusions

In this study, we have introduced a novel, hybrid, dual-domain deep learning architecture for accelerated MRI reconstruction. By combining a Vision Transformer-based autoencoder with Bidirectional Recurrent Neural Networks, our model effectively leverages both global image-domain features and sequential k-space information. Through comprehensive experiments across various acceleration factors and sampling patterns, we demonstrated that our proposed model significantly outperforms other ViT-based architectures and achieves competitive results. The results highlight the importance of integrating domain-specific knowledge, such as the sequential structure of k-space, into powerful deep learning frameworks. This work demonstrates the potential of hybrid Transformer–RNN models to advance the field of medical image reconstruction, paving the way for faster and more accurate MRI examinations.

Funding

This work was supported by the Hongik University new faculty research support fund, grant number 2025S101801.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the data used was obtained from the public database.

Informed Consent Statement

Patient consent was waived due to the data used was obtained from the public database.

Data Availability Statement

The datasets presented in this article are not readily available due to time limitations. Requests to access the datasets should be directed to the author.

Conflicts of Interest

The author declares no conflicts of interest.

References

Hansen, M.S.; Kellman, P. Image reconstruction: An overview for clinicians. J. Magn. Reson. Imaging 2015, 41, 573–585. [Google Scholar] [CrossRef] [PubMed]
Pruessmann, K.P.; Weiger, M.; Scheidegger, M.B.; Boesiger, P. SENSE: Sensitivity encoding for fast MRI. Magn. Reson. Med. 1999, 42, 952–962. [Google Scholar] [CrossRef]
Griswold, M.A.; Jakob, P.M.; Heidemann, R.M.; Nittka, M.; Jellus, V.; Wang, J.; Kiefer, B.; Haase, A. Generalized autocalibrating partially parallel acquisitions (GRAPPA). Magn. Reson. Med. Off. J. Int. Soc. Magn. Reson. Med. 2002, 47, 1202–1210. [Google Scholar] [CrossRef] [PubMed]
Xiao, Z.; Hoge, W.S.; Mulkern, R.; Zhao, L.; Hu, G.; Kyriakos, W.E. Comparison of parallel MRI reconstruction methods for accelerated 3D fast spin-echo imaging. Magn. Reson. Med. 2008, 60, 650–660. [Google Scholar] [CrossRef] [PubMed]
Zhu, B.; Liu, J.Z.; Cauley, S.F.; Rosen, B.R.; Rosen, M.S. Image reconstruction by domain-transform manifold learning. Nature 2018, 555, 487. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Liu, Z.; Zhang, S.; Zhang, H.; Spincemaille, P.; Nguyen, T.D.; Sabuncu, M.R.; Wang, Y. Fidelity imposed network edit (FINE) for solving ill-posed image reconstruction. Neuroimage 2020, 211, 116579. [Google Scholar] [CrossRef] [PubMed]
Oh, C.; Chung, J.Y.; Han, Y. Domain transformation learning for MR image reconstruction from dual domain input. Comput. Biol. Med. 2024, 170, 108098. [Google Scholar] [CrossRef] [PubMed]
Knoll, F.; Hammernik, K.; Zhang, C.; Moeller, S.; Pock, T.; Sodickson, D.K.; Akcakaya, M. Deep-learning methods for parallel magnetic resonance imaging reconstruction: A survey of the current approaches, trends, and issues. IEEE Signal Process. Mag. 2020, 37, 128–140. [Google Scholar] [CrossRef] [PubMed]
Rastogi, A.; Brugnara, G.; Foltyn-Dumitru, M.; Mahmutoglu, M.A.; Preetha, C.J.; Kobler, E.; Pflüger, I.; Schell, M.; Deike-Hofmann, K.; Kessler, T.; et al. Deep-learning-based reconstruction of undersampled MRI to reduce scan times: A multicentre, retrospective, cohort study. Lancet Oncol. 2024, 25, 400–410. [Google Scholar] [CrossRef] [PubMed]
Lin, K.; Heckel, R. Vision transformers enable fast and robust accelerated MRI. In Proceedings of the International Conference on Medical Imaging with Deep Learning, Zurich, Switzerland, 6–8 July 2022; PMLR, pp. 774–795. [Google Scholar]
Meng, Y.; Yang, Z.; Shi, Y.; Song, Z. Boosting vit-based mri reconstruction from the perspectives of frequency modulation, spatial purification, and scale diversification. AAAI 2025, 39, 6135–6143. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12299–12310. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, W.; Lin, Z.; Zhou, K.; Qi, L.; Wang, Y.; Jia, J. Mat: Mask-aware transformer for large hole image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10758–10768. [Google Scholar]
Yalcinbas, M.F.; Ozturk, C.; Ozyurt, O.; Emir, U.E.; Bagci, U. Rosette Trajectory MRI Reconstruction with Vision Transformers. Tomography 2025, 11, 41. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Pan, J.; Zhu, Y.; Ni, J.; Rueckert, D. Classification, Regression and Segmentation directly from k-Space in Cardiac MRI. In International Workshop on Machine Learning in Medical Imaging; Springer: Cham, Switzerland, 2024; pp. 31–41. [Google Scholar]
Huang, J.; Fang, Y.; Wu, Y.; Wu, H.; Gao, Z.; Li, Y.; Del Ser, J.; Xia, J.; Yang, G. Swin transformer for fast MRI. Neurocomputing 2022, 493, 281–304. [Google Scholar] [CrossRef]
Qin, C.; Schlemper, J.; Caballero, J.; Price, A.N.; Hajnal, J.V.; Rueckert, D. Convolutional recurrent neural networks for dynamic MR image reconstruction. IEEE Trans. Med. Imaging 2018, 38, 280–290. [Google Scholar] [CrossRef] [PubMed]
Chen, E.Z.; Wang, P.; Chen, X.; Chen, T.; Sun, S. Pyramid convolutional RNN for MRI image reconstruction. IEEE Trans. Med. Imaging 2022, 41, 2033–2047. [Google Scholar] [CrossRef] [PubMed]
Zbontar, J.; Knoll, F.; Sriram, A.; Murrell, T.; Huang, Z.; Muckley, M.J.; Defazio, A.; Stern, R.; Johnson, P.; Bruno, M.; et al. fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv 2018, arXiv:1811.08839. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Sheikh, H.R.; Bovik, A.C. A visual information fidelity approach to video quality assessment. In The First International Workshop on Video Processing and Quality Metrics for Consumer Electronics; sn; 2005; Volume 7, pp. 2117–2128. Available online: https://www.semanticscholar.org/paper/A-VISUAL-INFORMATION-FIDELITY-APPROACH-TO-VIDEO-Bovik/b70b6cf13b55b61a37133b921770dcf32ef0bcfd (accessed on 12 November 2025).
Giannakopoulos, I.I.; Muckley, M.J.; Kim, J.; Breen, M.; Johnson, P.M.; Lui, Y.W.; Lattanzi, R. Accelerated MRI reconstructions via variational network and feature domain learning. Sci. Rep. 2024, 14, 10991. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed ViT-based autoencoder.

Figure 2. Schematic illustration of the bidirectional recurrent processing applied to k-space: (a) Horizontal BiRNN sweeps across each row to model left–right dependencies and generate the horizontal latent representation

H_{h o r}

. (b) Vertical BiRNN processes

H_{h o r}

along the column direction to capture top–bottom correlations, producing the final latent representation

G_{v e r}

.

Figure 2. Schematic illustration of the bidirectional recurrent processing applied to k-space: (a) Horizontal BiRNN sweeps across each row to model left–right dependencies and generate the horizontal latent representation

H_{h o r}

. (b) Vertical BiRNN processes

H_{h o r}

along the column direction to capture top–bottom correlations, producing the final latent representation

G_{v e r}

.

Figure 3. Reconstructed images and corresponding error maps for ViT-based models, UNet, and VarNet (intensity of the error maps is amplified by 10).

Figure 4. Qualitative evaluation on R = 4 randomly undersampled data. This figure compares the reconstruction outcomes and amplified (10×) error maps for Model 1, Model 2, and the proposed Model 3.

Figure 5. High-acceleration (R = 8) reconstruction with a regular sampling mask. The visual results and corresponding error maps (10× amplification) illustrate each model’s effectiveness in handling coherent aliasing.

Figure 6. Robustness test under a high-acceleration (R = 8) random sampling condition. The reconstructed images and their error maps (10× amplification) demonstrate performance degradation and artifact patterns in a highly incoherent scenario.

Table 1. Quantitative analysis of Models 1–3, UNet, and VarNet for the R = 4 regular sampling pattern. Performance is evaluated using nMSE, SSIM, and VIF (mean ± standard deviation).

	UNet	VarNet	Model 1	Model 2	Model 3
nMSE (%)	$1.92 \pm 0.230$	$1.37 \pm 0.13$	$24.83 \pm 8.76$	$7.52 \pm 1.47$	$1.48 \pm 0.26$
SSIM	$0.888 \pm 0.010$	$0.888 \pm 0.036$	$0.691 \pm 0.060$	$0.791 \pm 0.025$	$0.890 \pm 0.012$
VIF	$0.954 \pm 0.073$	$0.912 \pm 0.095$	$0.134 \pm 0.116$	$0.755 \pm 0.051$	$0.982 \pm 0.037$

Table 2. Quantitative analysis of Models 1–3 for R4 random, R8 regular, and R8 random. Performance is evaluated using nMSE, SSIM, and VIF (mean ± standard deviation).

Sampling	Metric	Model 1	Model 2	Model 3
R4 random	nMSE (%)	$31.98 \pm 8.32$	$9.41 \pm 1.67$	$3.82 \pm 1.32$
	SSIM	$0.719 \pm 0.057$	$0.826 \pm 0.029$	$0.893 \pm 0.018$
	VIF	$0.048 \pm 0.037$	$0.500 \pm 0.091$	$0.748 \pm 0.041$
R8 regular	nMSE (%)	$29.13 \pm 7.28$	$11.21 \pm 2.23$	$9.33 \pm 2.22$
	SSIM	$0.727 \pm 0.049$	$0.820 \pm 0.031$	$0.830 \pm 0.031$
	VIF	$0.048 \pm 0.085$	$0.484 \pm 0.107$	$0.609 \pm 0.072$
R8 random	nMSE (%)	$34.40 \pm 6.06$	$15.42 \pm 3.11$	$9.74 \pm 3.82$
	SSIM	$0.718 \pm 0.052$	$0.802 \pm 0.037$	$0.831 \pm 0.039$
	VIF	$0.024 \pm 0.051$	$0.427 \pm 0.070$	$0.587 \pm 0.036$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Oh, C. A Hybrid Vision Transformer-BiRNN Architecture for Direct k-Space to Image Reconstruction in Accelerated MRI. J. Imaging 2026, 12, 11. https://doi.org/10.3390/jimaging12010011

AMA Style

Oh C. A Hybrid Vision Transformer-BiRNN Architecture for Direct k-Space to Image Reconstruction in Accelerated MRI. Journal of Imaging. 2026; 12(1):11. https://doi.org/10.3390/jimaging12010011

Chicago/Turabian Style

Oh, Changheun. 2026. "A Hybrid Vision Transformer-BiRNN Architecture for Direct k-Space to Image Reconstruction in Accelerated MRI" Journal of Imaging 12, no. 1: 11. https://doi.org/10.3390/jimaging12010011

APA Style

Oh, C. (2026). A Hybrid Vision Transformer-BiRNN Architecture for Direct k-Space to Image Reconstruction in Accelerated MRI. Journal of Imaging, 12(1), 11. https://doi.org/10.3390/jimaging12010011

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Hybrid Vision Transformer-BiRNN Architecture for Direct k-Space to Image Reconstruction in Accelerated MRI

Abstract

1. Introduction

2. Methods

3. Results

4. Discussion

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI