Isotropic Reconstruction of Anisotropic vEM Volumes with ViT-Guided Diffusion

Qiu, Junchao; Wan, Guojia; Zhou, Zhengyun; Liao, Minghui; Liu, Xiangdong; Li, Xinyuan; Du, Bo

doi:10.3390/electronics15061181

Open AccessArticle

Isotropic Reconstruction of Anisotropic vEM Volumes with ViT-Guided Diffusion

by

Junchao Qiu

,

Guojia Wan

^*

,

Zhengyun Zhou

,

Minghui Liao

,

Xiangdong Liu

,

Xinyuan Li

and

Bo Du

National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1181; https://doi.org/10.3390/electronics15061181

Submission received: 9 February 2026 / Revised: 6 March 2026 / Accepted: 9 March 2026 / Published: 12 March 2026

(This article belongs to the Topic Theoretical Foundations and Applications of Deep Learning Techniques)

Download

Browse Figures

Versions Notes

Abstract

Volume electron microscopy (vEM) provides nanometer-scale 3D imaging, yet its axial (z) resolution is often much lower than the in-plane (

x y

) resolution, yielding anisotropic volumes that hinder segmentation and connectomic reconstruction. We present a two-stage cross-axial super-resolution framework for isotropic reconstruction that combines a conditional diffusion model and domain-specific self-supervised pretraining of a vision transformer (ViT). First, the student–teacher self-distillation paradigm of DINOv3 is adopted to learn representations from large sets of high-resolution

x y

sections, capturing vEM-specific texture statistics and ultrastructural patterns. Second, a conditional diffusion denoiser is trained with supervised anisotropic degradation simulated by z-downsampling, while a perceptual loss based on frozen ViT feature distances constrains generated slices to match real-section distributions. These constraints recover axial high-frequency details and reduce hallucinated textures and inter-slice drift, improving cross-slice consistency. Experiments on two public vEM datasets show improved fidelity, perceptual quality, and membrane-boundary continuity over interpolation and learning-based baselines.

Keywords:

volume electron microscopy; isotropic reconstruction; self-supervised learning; vision transformer

1. Introduction

Volume electron microscopy (vEM) is a key tool in structural cell biology, enabling three-dimensional imaging of cells and tissues at nanometer-scale resolution. As acquisition throughput increases, vEM datasets routinely reach terabyte (TB) to petabyte (PB) scales [1,2,3]. By resolving complex intracellular and tissue-level architectures, vEM captures fine spatial relationships among organelles, synapses, and neural circuits [4,5,6]. This capability is essential for connectomics, where accurate reconstruction of neural wiring at synaptic resolution is required [4,5]. Beyond neuroscience, vEM is widely used in cancer research [7], immunology [8], infectious disease, and developmental biology to study three-dimensional ultrastructural changes underlying complex biological processes [9].

However, current vEM imaging modalities face severe anisotropies. In many acquisition setups, the in-plane (

x y

) pixel size is only a few nanometers (typically 3∼5 nm), whereas the inter-slice sampling interval along z is much coarser (typically 30∼70 nm). Consequently, the effective axial resolution is degraded by roughly 6∼20× compared with the in-plane resolution [1,2]. This imbalance arises from the imaging physics and sectioning limitations. For serial-section TEM (ssTEM), the axial resolution is determined largely by the section thickness [4,5,10]. For serial block-face SEM (SBF-SEM), it is constrained by the achievable cutting precision of the diamond knife [1,2,10]. FIB-SEM can yield near-isotropic volumes in some configurations, but the attainable volume and throughput remain limited; high equipment and maintenance costs further hinder large-scale acquisition [11].

Traditionally, missing z-direction slices are filled using linear or bicubic interpolation [12]. However, the axial information loss is not merely a simple decimation process; rather, it is a complex degradation combining volumetric averaging—dictated by the physical section thickness—and severe missing data caused by coarse inter-slice sampling gaps. Interpolation only enforces smooth transitions between observed samples and therefore cannot recover unacquired high-frequency structures (e.g., sharp membrane boundaries and continuous tubular topologies) [13]. As a result, over-smoothing in the

x z / y z

planes, inter-slice discontinuities, and stair-step artifacts are common, and the reconstruction becomes more sensitive to registration errors and sample deformation [14].

Recent studies therefore favor learning-based reconstruction [15]. These methods exploit high-resolution

x y

-plane texture statistics as priors and impose explicit cross-slice consistency to complete missing z-direction details without additional physical sections [16]. Generative models such as GANs [17,18] and diffusion models are particularly promising because they can model complex texture distributions while maintaining inter-slice coherence, thereby improving isotropic reconstruction quality without increasing acquisition cost [19].

Despite this progress, learning-based methods still face two key challenges in vEM. First, without domain-consistent perceptual priors, models may generate biologically implausible textures [20]. Second, insufficient cross-slice consistency can lead to inter-slice jumps and topological breaks, which directly impact downstream segmentation and tracing [5,10].

To address these issues, we propose a hybrid framework that combines conditional diffusion with domain-specific self-supervised ViT pretraining. We select DINOv3’s self-distillation paradigm [21] as the pretraining recipe for several reasons: (i) Its student–teacher framework with exponential moving average provides stable, collapse-free training on single-domain data without labels; (ii) The resulting features capture both local patch-level details and global structural layout through the multi-crop strategy, which aligns well with the multi-scale nature of vEM ultrastructures; (iii) DINOv3’s knowledge-distillation objective produces features with strong spatial correspondence, suited for pixel-level perceptual constraints in super-resolution. We first perform self-supervised representation learning on abundant high-resolution

x y

slices to obtain vEM-adapted feature priors capturing ultrastructural statistics [16]. We then train a conditional diffusion super-resolution model under simulated anisotropic degradation and inject ViT features as perceptual constraints. This design suppresses biologically implausible details and reduces cross-slice drift [22].

Our main contributions are summarized as follows:

We propose a two-stage training framework for cross-axial super-resolution in vEM. It completes axial details while enhancing cross-slice structural consistency. It effectively reduces biologically implausible pseudo-textures.
We perform self-supervised pretraining on large-scale high-resolution $x y$ slices. This yields representation priors adapted to vEM texture statistics and ultrastructural patterns.
We validate the framework’s effectiveness on vEM datasets through 3D reconstruction experiments. Results demonstrate improvements in both quantitative metrics and visual quality.

2. Related Work

Isotropic reconstruction for volume electron microscopy (vEM) and related microscopy data typically relies only on high-resolution

x y

slices. In contrast, the axial planes (

y z / x z

) often suffer from severe undersampling and blurring. Existing methods generally fall into three broad categories: traditional/2D super-resolution, video interpolation/implicit representations, and generative diffusion models. Each faces distinct limitations when applied to vEM data.

Traditional and 2D Super-Resolution. Traditional linear or bicubic interpolation is stable and easy to use; however, it cannot recover unacquired high-frequency information. To recover these missing details, deep learning techniques have recently become mainstream choices. Classic 2D super-resolution networks, such as SRCNN, sub-pixel convolution, and Transformer variants, are frequently used as building blocks for EM super-resolution (EMSR) [23,24,25,26]. However, these methods often do not explicitly enforce volumetric cross-slice constraints. While some approaches emphasize combining generative priors with cross-slice alignment to improve large upscaling-factor EMSR, they still struggle with unpaired structural consistency [27].

Video Interpolation and Implicit Representations. Another line of work formulates cross-axial reconstruction as z-direction super-resolution or frame interpolation. These methods adapt video interpolation or optical flow estimation to generate intermediate slices [28]. However, they depend heavily on the accuracy of deformation estimation. Large deformations or low-contrast structures can easily propagate errors and lead to topological discontinuities. Beyond explicit voxel generation, implicit neural representations, such as niiv and Gaussian splatting-based rendering, enable continuous-resolution reconstruction [29,30,31,32,33]. Overall, existing methods either incur high computational costs for large volumes, or still suffer from hallucinated details and inter-slice drift under unpaired conditions.

Generative Models and Our Differentiation. Under more realistic unpaired or weakly paired settings, diffusion models have gained popularity for isotropic reconstruction because their progressive denoising process enables more powerful detail synthesis. Methods like DiffuseIR and EMDiffuse showed axial detail enhancement without requiring isotropic training data [16,19]. Other studies further discussed suppressing hallucinated details in the absence of ground-truth references [22].

Motivated by these limitations, we combine conditional diffusion models with domain-specific self-supervised ViT pretraining to improve detail authenticity while explicitly constraining cross-slice structural consistency. Unlike existing diffusion baselines that rely on generic perceptual networks or lack domain-specific constraints, our approach explicitly pretrains a ViT on in-domain

x y

slices via self-distillation, yielding feature priors specifically adapted to vEM ultrastructural statistics. This domain-adapted perceptual constraint more effectively suppresses biologically implausible hallucinated textures and reduces cross-slice drift.

3. Preliminaries

Task Definition (Cross-axial Super-resolution Reconstruction). Cross-axial super-resolution aims to address anisotropy in vEM data.

x y

planes exhibit high resolution, while the z direction suffers from sparse sampling and blurring. Given anisotropic observed volume data (or slice stack)

V_{L R}

, the goal is to learn a mapping F producing an isotropic reconstruction

{\hat{V}}_{H R} = F (V_{L R})

. The output should complete z-direction textures and structural details. It must also maintain consistency with observed high-resolution

x y

slices.

Formally, let ideal isotropic high-resolution volume be

V_{H R} \in R^{H \times W \times D}

. We only observe degraded anisotropic data

V_{L R} \in R^{H \times W \times (D / s)}

. Here, s denotes the axial downsampling factor (e.g.,

s = 2

). Degradation operator

A (\cdot)

models the process from high-resolution to low-resolution observation. Let

D_{x y}

denote the high-resolution

x y

slice dataset. During training, we construct sample pairs

(x_{L R}, x_{H R})

.

x_{H R}

represents high-resolution slices cropped from

x y

planes.

x_{L R} = A (x_{H R})

denotes its corresponding low-resolution condition. This is obtained by undersampling along the z axis, followed by upsampling for structural guidance. Our learning objective makes

{\hat{V}}_{H R} = F (V_{L R})

approximate

V_{H R}

as closely as possible.

4. Methodology

4.1. Framework Overview

Cross-axial super-resolution reconstructs high-frequency details in

y z / x z

planes. Input contains only high-resolution

x y

slices with sparse, blurred z-direction sampling. The goal is to complete missing details while preserving 3D structural coherence and topological consistency. Unlike conventional 2D super-resolution, this task faces three practical challenges:

Axial information loss represents irreversible systematic degradation. Section thickness, PSF, and sampling intervals jointly cause low-pass blurring in $y z / x z$ directions. This makes boundaries blunt and structures discontinuous.
Cross-slice consistency serves as a critical constraint. Inconsistent generated details between adjacent slices cause organelle boundary jumps. They also break thin elongated structures, creating 3D artifacts that affect segmentation and tracing.
Domain distribution differs significantly from natural images. vEM images exhibit distinct contrast, noise patterns, and texture statistics. Directly transferring generic perceptual networks often fails to provide reliable structural priors. It may even amplify pseudo-textures.

To address these issues, we propose a hybrid architecture (Figure 1). It combines a conditional diffusion model and domain self-supervised ViT pretraining:

Conditional diffusion model enables multi-solution generation through progressive denoising while providing multi-scale local inductive bias (via the denoiser backbone). This alleviates over-smoothing and artifact accumulation from single-step reconstruction. It also enhances the perceptual authenticity of fine details.
Self-supervised ViT features serve as domain priors. They construct perceptual constraints in feature space. These constraints suppress biologically implausible hallucinated details. They also reduce cross-slice drift by enforcing structural consistency.

We adopt a two-stage training strategy. Stage 1 performs self-supervised representation learning on abundant high-resolution

x y

slices. This yields vEM-adapted feature space capturing domain statistics. Stage 2 trains conditional diffusion super-resolution under simulated anisotropic degradation. We inject frozen ViT features from Stage 1 as perceptual loss. This explicitly reinforces cross-axial structural consistency during detail completion.

4.2. Self-Supervised ViT Feature Pretraining

We adopt the student–teacher self-distillation paradigm of DINOv3 [21] for self-supervised learning. Training occurs on abundant unlabeled high-resolution

x y

slices. The frozen teacher network serves as a feature prior extractor

f_{ϕ}

in Stage 2.

Motivation for domain priors. In real acquisition,

x y

planes are more reliable and higher resolution than axial directions. Thus,

x y

slice texture and structural statistics provide valuable priors for cross-axial reconstruction. Directly using ImageNet-pretrained perceptual networks risks mismatched features. This may amplify pseudo-textures or introduce implausible details. Conversely, self-supervised pretraining on domain

x y

slices yields appropriate ViT features. We then use feature distances as perceptual constraints in Stage 2. This enhances detail, authenticity, and structural consistency.

(1) Multi-crop view augmentation. For vEM data, augmentation must preserve ultrastructural morphology while improving robustness. We adopt global/local crop-based multi-view strategies. Controlled brightness/contrast perturbations and light noise injection further enhance robustness. These avoid destroying critical topology like membrane structures. Given input slice

x \in R^{H \times W \times 1}

, we define random augmentation operator

T (\cdot)

as:

\begin{matrix} T_{crop} (x) & = Crop (x; Ω), Ω \sim U (Ω_{global} \cup Ω_{local}), \end{matrix}

(1)

\begin{matrix} T_{photo} (x) & = clip (a x + b, 0, 1), a \sim U (1 - δ_{c}, 1 + δ_{c}), b \sim U (- δ_{b}, δ_{b}), \end{matrix}

(2)

\begin{matrix} T_{noise} (x) & = clip (x + η, 0, 1), η \sim N (0, σ^{2}), \end{matrix}

(3)

T = T_{noise} \circ T_{photo} \circ T_{crop},

(4)

where

T_{crop}

denotes global/local random cropping.

T_{photo}

applies controlled brightness/contrast perturbations.

T_{noise}

injects light noise.

Ω

represents the crop window.

a, b

denote contrast and brightness perturbation magnitudes.

η

represents injected noise.

clip (\cdot)

restricts pixel values within valid dynamic range. Based on

T (\cdot)

, we generate view sets from the same x:

V (x) = {x_{g}^{(1)}, x_{g}^{(2)}} \cup {x_{l}^{(m)}}_{m = 1}^{M},

(5)

where

x_{g}

denotes two global views (larger crop scales).

x_{l}

denotes M local views (smaller crop scales). In vEM, this strategy covers both global structural layout (e.g., organelle morphology) and local texture details (e.g., membrane boundaries statistics).

(2) ViT encoder (patch embedding + Transformer blocks). We adopt Vision Transformer as the encoder. Input view

v \in V (x)

is divided into

P \times P

patches. Patch count

N = \frac{H W}{P^{2}}

. The i-th patch token representation is:

z_{i} = E \cdot flatten (p_{i}) + e_{i}, z_{i} \in R^{d},

(6)

where

flatten (\cdot)

unfolds 2D patch

p_{i}

(

P \times P

) into length-

P^{2}

vector. E denotes a linear projection matrix.

e_{i}

represents positional encoding. d is the token dimension. After L Transformer blocks, we obtain token sequence

Z^{(L)} = {z_{i}^{(L)}}_{i = 1}^{N}

. Average pooling yields image representation:

h = \frac{1}{N} \sum_{i = 1}^{N} z_{i}^{(L)} \in R^{d} .

(7)

For the ℓ-th Transformer block (Pre-LN form):

\begin{matrix} {\tilde{Z}}^{(ℓ)} & = Z^{(ℓ - 1)} + MSA (LN (Z^{(ℓ - 1)})), \end{matrix}

(8)

\begin{matrix} Z^{(ℓ)} & = {\tilde{Z}}^{(ℓ)} + MLP (LN ({\tilde{Z}}^{(ℓ)})), \end{matrix}

(9)

where

MSA

denotes multi-head self-attention.

MLP

represents feed-forward network. Both teacher and student branches project representations to the distillation space via MLP heads. Temperature-scaled softmax yields distributions:

P_{ϕ, ψ} (v; τ) = softmax (\frac{{MLP}_{ψ} (h (v))}{τ}) .

(10)

(3) Student–Teacher dual branches and cross-view alignment loss. Let teacher output be

P_{t} = P_{ϕ_{t}, ψ_{t}} (v_{t}; τ_{t})

. Let student output be

P_{s} = P_{ϕ_{s}, ψ_{s}} (v_{s}; τ_{s})

. Cross-view alignment uses cross-entropy loss

H (\cdot, \cdot)

:

L_{DINO} (v_{t}, v_{s}) = H (P_{t}, P_{s}) = - \sum_{k} P_{t}^{(k)} log P_{s}^{(k)} .

(11)

We sum (or average) over all pairs between teacher global views

v_{t} \in {x_{g}^{(1)}, x_{g}^{(2)}}

and all student views

v_{s} \in V (x)

. This yields self-supervised loss

L_{SSL}

. Teacher parameters update via exponential moving average (EMA) without backpropagation:

ϕ_{t} \leftarrow m ϕ_{t} + (1 - m) ϕ_{s}, ψ_{t} \leftarrow m ψ_{t} + (1 - m) ψ_{s},

(12)

where

m \in (0, 1)

is momentum coefficient. After pretraining, we discard projection heads. We freeze the teacher encoder

f_{ϕ_{t}}

for Stage-2 perceptual constraints. Specifically, for Stage-2 reconstruction

{\hat{x}}_{0}

and reference

x_{0}

, we define perceptual loss in the teacher encoder’s feature space:

L_{per} = ∥ f_{ϕ_{t}} ({\hat{x}}_{0}) - f_{ϕ_{t}} (x_{0}) ∥_{2}^{2} .

(13)

This naturally injects Stage-1 vEM structural/textural priors into diffusion super-resolution. It suppresses domain-inconsistent hallucinated details. It also enhances cross-slice structural consistency.

4.3. Conditional Diffusion Denoising Reconstruction with ViT Perceptual Constraints

Network architecture. We adopt U-Net as the diffusion denoiser

ϵ_{θ}

. Key vEM information (membrane boundaries, fine tubular structures, vesicle textures) exhibits strong locality and multi-scale characteristics. UNet’s skip connections help preserve both fine textures and coarse morphology during denoising. Inputs include noisy sample

x_{t}

, timestep t, and conditional input

x_{L R}

.

x_{L R}

is the anisotropic observation upsampled along z axis to target depth. It provides low-frequency structural guidance for

y z / x z

views. The network learns to complete missing high-frequency details given cross-slice context. During training, we organize samples as 3D blocks/neighborhoods. This enables leveraging adjacent slice redundancy. It enhances cross-slice consistency and reduces inter-slice jumps.

Diffusion training objective. We train the conditional diffusion model in pixel space. The forward diffusion process is defined as:

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I), t \in {1, \dots, T} .

(14)

Here, T denotes total diffusion steps.

t \in {1, \dots, T}

represents timestep.

x_{0}

is a clean sample.

x_{t}

denotes a noisy sample at step t. Noise

ϵ \sim N (0, I)

where

I

is identity matrix.

β_{t} \in (0, 1)

is noise variance coefficient at step t.

Denoiser

ϵ_{θ} (x_{t}, t, x_{L R})

predicts noise

ϵ

added to

x_{0}

. We adopt the following joint loss:

L_{total} = L_{diff} + λ L_{per},

(15)

where

λ

balances diffusion denoising and ViT feature constraints. Specifically:

L_{diff} = E_{x_{0}, ϵ, t} [∥ ϵ - ϵ_{θ} (x_{t}, t, x_{L R}) ∥_{2}^{2}]

(16)

is the simplified diffusion denoising MSE loss. Perceptual loss

L_{per}

uses frozen Stage-1 teacher ViT features (Equation (13) with

ℓ_{2}

distance). We first construct reconstruction estimate

{\hat{x}}_{0}

from current noisy sample

x_{t}

and noise prediction:

{\hat{x}}_{0} = \frac{x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ_{θ} (x_{t}, t, x_{L R})}{\sqrt{{\bar{α}}_{t}}},

(17)

where

α_{t} ≜ 1 - β_{t}

and

{\bar{α}}_{t} ≜ \prod_{i = 1}^{t} α_{i}

. This estimate computes perceptual loss

L_{per}

. It constrains diffusion outputs’ structural statistics in feature space. Compared to pixel-level loss alone, this perceptual constraint better captures EM-specific patterns. Examples include membrane continuity, texture granularity, and repetitive internal organelle structures. Thus, it effectively suppresses diffusion-generated hallucinated artifacts inconsistent with domain statistics. It also enhances cross-slice structural consistency during cross-axial reconstruction.

We summarize the two-stage training in Algorithm 1. Lines 1–8 describe DINO-based ViT self-supervised pretraining and teacher freezing. Lines 9–20 describe conditional diffusion SR training with ViT perceptual constraints. Corresponding reverse diffusion inference appears in Algorithm 2. Lines 1–2 construct conditions and initialize noise. Lines 3–9 perform progressive denoising to output reconstructed volume.

Algorithm 1 Two-stage Training for Cross-axial SR

Require:: High-res $x y$ slices dataset $D_{x y}$ ; degradation operator $A (\cdot)$ ; diffusion steps T; perceptual weight $λ$
Ensure:: Trained teacher ViT feature extractor $f_{ϕ}$ and denoiser $ϵ_{θ}$
1:: Stage-1 (Self-supervised ViT pretraining)
2:: Initialize student $f_{ϕ_{s}}$ and teacher $f_{ϕ_{t}}$
3:: for each SSL iteration do
4:: Sample $x \sim D_{x y}$ ; generate global/local views $(x_{g}, x_{l})$
5:: Update $ϕ_{s}$ by minimizing DINO loss $H (P_{t} (x_{g}), P_{s} (x_{l}))$
6:: Update teacher by EMA: $ϕ_{t} \leftarrow EMA (ϕ_{t}, ϕ_{s})$
7:: end for
8:: Freeze $f_{ϕ} \leftarrow f_{ϕ_{t}}$
9:: Stage-2 (Conditional diffusion SR training)
10:: Initialize denoiser $ϵ_{θ}$ (UNet)
11:: for each SR training iteration do
12:: Sample $x_{0} \sim D_{x y}$ and construct condition $x_{L R} \leftarrow A (x_{0})$
13:: Sample $t \sim {1, \dots, T}$ and noise $ϵ \sim N (0, I)$
14:: Form noisy input $x_{t} \leftarrow \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ$
15:: Predict noise $\hat{ϵ} \leftarrow ϵ_{θ} (x_{t}, t, x_{L R})$
16:: $L_{diff} \leftarrow {∥ ϵ - \hat{ϵ} ∥}_{2}^{2}$
17:: Reconstruct ${\hat{x}}_{0} \leftarrow \frac{x_{t} - \sqrt{1 - {\bar{α}}_{t}} \hat{ϵ}}{\sqrt{{\bar{α}}_{t}}}$
18:: $L_{per} \leftarrow {∥ f_{ϕ} ({\hat{x}}_{0}) - f_{ϕ} (x_{0}) ∥}_{2}^{2}$
19:: Update $θ$ by minimizing $L_{diff} + λ L_{per}$
20:: end for

Algorithm 2 Inference (Cross-axial SR via Reverse Diffusion)

Require:: Anisotropic volume $V_{L R}$ ; trained denoiser $ϵ_{θ}$ ; diffusion steps T
Ensure:: Reconstructed isotropic volume ${\hat{V}}_{H R}$
1:: Construct condition volume $C \leftarrow Upsample (V_{L R})$
2:: Initialize $x_{T} \sim N (0, I)$
3:: for $t = T$ down to 1 do
4:: $\hat{ϵ} \leftarrow ϵ_{θ} (x_{t}, t, C)$
5:: Sample $x_{t - 1}$ from the reverse transition $p (x_{t - 1} ∣ x_{t}, \hat{ϵ})$
6:: end for
7:: Output ${\hat{V}}_{H R} \leftarrow x_{0}$

5. Experiments

5.1. Experimental Setup

Dataset description. We evaluate on two representative vEM datasets: a mouse cerebral cortex EM dataset [19] and the FANC dataset [34]. Both datasets exhibit typical vEM challenges. Slices contain complex ultrastructures such as mitochondrial cristae, synaptic vesicles, and ER networks. During training, we simulate anisotropic degradation by downsampling isotropic ground truth along the z-axis and applying Gaussian smoothing to mimic point-spread-function (PSF) effects.

Implementation details. Our framework follows the two-stage training paradigm in Section 4. Stage 1 uses a ViT-Base/16 encoder (

L = 12

Transformer blocks, 12 attention heads, embedding dimension

d = 768

, patch size

P = 16

) following the DINOv3 self-distillation recipe [21]. We generate structure-preserving augmentations, including global crops (

224 \times 224

) and local crops (

96 \times 96

). Augmentations include limited rotation (

\pm 5^{\circ}

), contrast perturbation (

\pm 15 %

), and Poisson noise injection. Training uses 4 NVIDIA RTX PRO 6000 Blackwell GPUs with a batch size of 256 for 1000 epochs. Stage 2 adopts a UNet denoiser backbone with a base channel count of 64. The UNet architecture features channel multipliers

[1, 2, 4, 8]

, two residual blocks per resolution level, self-attention at resolution 16 with 32 head channels, and dropout rate

0.2

. Training images are resized to

256 \times 256

via bicubic interpolation and normalized to

[- 1, 1]

(

mean = 0.5

,

std = 0.5

). Data augmentation includes random horizontal flipping (

p = 0.5

),

90^{\circ}

rotation (

p = 0.5

), and Gaussian blur with radius 3 applied to the input image only (

p = 0.5

). Diffusion uses a linear noise schedule with

T = 1000

steps during training. Inference employs a DDIM sampler with 50 steps. We use the Adam optimizer (

β_{1} = 0.9

,

β_{2} = 0.999

). The initial learning rate is

2 \times 10^{- 4}

with cosine decay. An exponential moving average (EMA) with decay

0.9999

is maintained over the denoiser weights and used at inference. Stage 2 training uses per-GPU batch size 16 across 4 GPUs (effective batch size 64); models are validated every 10 epochs and checkpointed every 20 epochs.

Evaluation metrics. We assess reconstruction quality from three perspectives: pixel fidelity, perceptual similarity, and domain structural plausibility. Table 1 reports pixel-level metrics, including PSNR (dB) and SSIM (↑ higher is better), as well as error metrics MSE and MAE (↓ lower is better). Values are reported as “mean ± std” across test samples. These metrics measure intensity consistency and local structural agreement against reference slices. FSIM (Feature Similarity Index) measures structural similarity based on phase congruency and gradient magnitude. LPIPS (Learned Perceptual Image Patch Similarity) measures deep feature space differences (lower is better). DISTS (Deep Image Structure and Texture Similarity) evaluates structure and texture similarity with geometric robustness. CLIPIQA (CLIP-based Image Quality Assessment) provides semantic quality scores derived from CLIP features (higher is better).

Baseline methods. We select representative baselines covering four categories: (1) Bicubic: classic bicubic interpolation as a non-learning baseline; (2) SRCNN [23] and Subpixel CNN [24]: early convolutional super-resolution methods; (3) EDSR [35] and WDSR [36]: residual/wide-activation networks representing strong 2D restoration backbones; (4) FNO [37]: a Fourier neural operator baseline for larger-receptive-field modeling. All baselines are re-implemented under identical degradation settings, train/test splits, and evaluation metrics.

5.2. Quantitative Results and Comparison

We compare our method against bicubic interpolation and learning-based baselines. Table 1 shows

2 \times

axial super-resolution performance on mouse cerebral cortex EM dataset [19].

As Table 1 shows, our method achieves the best results across all four pixel-level metrics. Compared to Bicubic, PSNR improves from

12.98

to

20.25

. SSIM increases from

0.280

to

0.451

. MSE/MAE decreases from

0.377 / 0.489

to

0.0387 / 0.1546

. This indicates significant improvements in intensity fidelity and structural consistency. Against learning-based baselines (SRCNN, Subpixel CNN, EDSR, WDSR), our method maintains a lead in PSNR/SSIM. It also substantially reduces error metrics. This demonstrates more effective recovery of high-frequency details lost in the axial direction. It also avoids over-smoothing from simple interpolation.

Table 1. Quantitative comparison of

2 \times

axial super-resolution on mouse cerebral cortex EM dataset (mean ± std). Baselines include bicubic interpolation and multiple 2D super-resolution methods. PSNR/SSIM higher better; MSE/MAE lower better; bold indicates best results.

Table 1. Quantitative comparison of

2 \times

axial super-resolution on mouse cerebral cortex EM dataset (mean ± std). Baselines include bicubic interpolation and multiple 2D super-resolution methods. PSNR/SSIM higher better; MSE/MAE lower better; bold indicates best results.

Model	Parameters (M)	PSNR (↑)	SSIM (↑)	MSE (↓)	MAE (↓)
Bicubic	/	12.9804 ± 0.4148	0.2799 ± 0.0213	0.3766 ± 0.0366	0.4894 ± 0.0252
SRCNN [23]	0.0573	15.4597 ± 0.4392	0.3669 ± 0.0226	0.2008 ± 0.0161	0.3500 ± 0.0165
Subpixel CNN [24]	0.2270	15.4310 ± 0.4087	0.3646 ± 0.0214	0.2017 ± 0.0147	0.3506 ± 0.0154
FNO [37]	4.7520	14.5686 ± 0.2274	0.2967 ± 0.0146	0.2405 ± 0.0059	0.3880 ± 0.0078
EDSR [35]	1.3676	15.5603 ± 0.3947	0.3737 ± 0.0208	0.1957 ± 0.0132	0.3452 ± 0.0144
WDSR [36]	1.3345	15.4533 ± 0.3863	0.3651 ± 0.0204	0.2002 ± 0.0133	0.3496 ± 0.0143
Ours	15.677	20.2509 ± 0.9655	0.4510 ± 0.0394	0.0387 ± 0.0091	0.1546 ± 0.0137

DiffuseIR [16] and EMDiffuse [19] are diffusion-based methods designed specifically for vEM isotropic reconstruction; however, their default training configurations differ from ours. To provide a direct comparison with these vEM-specific diffusion methods, we conduct a separate evaluation on the FANC dataset under a unified protocol. Figure 2 presents 3D reconstruction results on FANC. Our method outperforms both DiffuseIR and EMDiffuse, achieving the highest PSNR (

12.56

dB) and SSIM (

0.3117

), which confirms its effectiveness against diffusion-based competitors.

5.3. Qualitative Visual Analysis

We present two voxel-level case studies. Figure 3 shows a reconstruction schematic. The model takes two adjacent input slices (Input (Low Z) and Input (High Z)) as conditions. It reconstructs intermediate missing slices (Recon 1–Recon 5). Figure 4 compares continuous

x y

-plane slices along z axis. Reconstructions are compared against ground truth (GT) slices. Our method accurately recovers key ultrastructures like membrane boundaries. It maintains relatively smooth structural transitions along the z direction. For tubular continuous structures, reconstructions show fewer breaks. This indicates that domain perceptual constraints help suppress implausible hallucinated details.

5.4. Ablation Study

Figure 5 quantifies the contribution of each component via systematic ablation. Linear interpolation retains some low-frequency structures; however, it over-smooths fine details, leading to degraded perceptual quality. Replacing the domain-specific ViT prior with a generic ImageNet-pretrained one (Ours w/o Pretrain) reduces PSNR/SSIM, while increasing LPIPS/DISTS. These results suggest that, without domain constraints, the generator tends to mistake high-frequency noise for biological texture, thereby amplifying perceptual distortion. In contrast, the full model achieves a more balanced performance across all metrics, indicating better alignment with real EM image statistics. Overall, the domain-specific perceptual prior effectively suppresses biologically implausible hallucinated details while preserving structural plausibility.

Beyond the binary ablation above, we further investigate the individual contributions of the loss function design and the perceptual loss hyperparameters. Table 2 compares training with MSE loss only versus MSE combined with perceptual loss. Adding the domain-specific perceptual term improves SSIM from

0.2550

to

0.4060

and reduces LPIPS/DISTS from

0.5407 / 0.3682

to

0.4627 / 0.2930

, confirming that the ViT-based perceptual constraint improves structural fidelity beyond pixel-level supervision alone.

Table 3 reports a grid search over the perceptual loss weight

λ

and start step

t_{s}

. The default configuration (

λ = 1.0

,

t_{s} = 100

) achieves the best SSIM (

0.4060

) and lowest LPIPS/DISTS (

0.4627 / 0.2930

).

5.5. ViT Feature Similarity Heatmap Visualization

Figure 6 shows a patch similarity heatmap visualization. Using the query patch as a reference, feature similarity with all patches is higher near membrane boundaries. It is also elevated around periodic ultrastructures. Similarity is lower in relatively homogeneous cytoplasmic regions. This phenomenon supports hallucination suppression intuition from a feature space perspective. During diffusion training, perceptual loss more strongly penalizes deviations at these key structural patterns. This encourages reconstructed textures to align better with biological topology.

5.6. Training Dynamics Analysis

Figure 7 compares training loss curves with and without ViT perceptual constraints. The model with the ViT prior (blue curve) converges more smoothly. The inset shows the perceptual loss component decreasing overall. This confirms that generated textures progressively align with domain structural statistics in feature space. It validates the effectiveness of our two-stage training strategy.

6. Limitations

We acknowledge several limitations of the current work. First, while our model demonstrates robust performance on specific datasets, its generalizability across different imaging modalities and biological species is not yet fully guaranteed. Specifically, variations in imaging physics (e.g., FIB-SEM vs. serial-section SEM) introduce distinct noise profiles and contrast distributions—often referred to as the instrumental domain gap. Furthermore, the structural heterogeneity across different biological tissues or species poses a biological domain gap, where the structural priors learned from one specimen (e.g., mouse brain) may not perfectly align with others (e.g., Drosophila or plant tissues). Second, our quantitative evaluation relies on paired datasets constructed via synthetically simulated degradation (z-downsampling followed by Gaussian smoothing). Evaluating natively anisotropic volumes without paired isotropic ground truth remains an important direction for future work.

7. Conclusions

This paper addresses the loss of fine ultrastructural details in vEM volumes caused by axial (z) undersampling and blurring. We proposed an axial isotropic reconstruction framework that couples a conditional diffusion model with domain-specific self-supervised ViT pretraining. The method follows a two-stage pipeline. In Stage 1, we performed self-supervised representation learning on abundant high-resolution

x y

slices to obtain feature priors aligned with vEM texture statistics and ultrastructural patterns. In Stage 2, we trained a conditional diffusion denoiser under simulated anisotropic degradation and added a ViT-feature perceptual loss computed with a frozen ViT. This explicitly encourages cross-slice structural consistency during axial high-frequency detail synthesis and mitigates hallucinated textures and inter-slice drift. Experiments on two representative vEM datasets show consistent improvements over baselines across pixel-level fidelity, perceptual quality, and structural plausibility metrics. Future work will explore scaling the framework to larger connectomics volumes and downstream analysis tasks.

Author Contributions

Conceptualization, J.Q. and G.W.; methodology, J.Q.; software, J.Q. and X.L. (Xiangdong Liu); validation, M.L. and X.L. (Xinyuan Li); formal analysis, Z.Z.; investigation, J.Q., M.L., X.L. (Xiangdong Liu), X.L. (Xinyuan Li) and Z.Z.; resources, Z.Z.; data curation, M.L.; writing—original draft preparation, J.Q.; writing—review and editing, G.W.; visualization, X.L. (Xiangdong Liu); supervision, G.W.; project administration, G.W.; funding acquisition, B.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62225113, in part by the National Key Research and Development Program of China under Grants 2025ZD01907901 and 2023YFC2705700, in part by the Innovative Research Group Project of Hubei Province under Grant 2025BBA008, and in part by the Science and Technology Major Project of Hubei Province under Grant 2025BCB026.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and code presented in this study are openly available in Github at https://github.com/FlyGraph/SR2026 (accessed on 8 March 2026).

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive feedback. We also thank the laboratory staff at the Department of Computer Science, Wuhan University, for their technical assistance during the experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Denk, W.; Horstmann, H. Serial block-face scanning electron microscopy to reconstruct three-dimensional tissue nanostructure. Plos Biol. 2004, 2, e329. [Google Scholar] [CrossRef]
Peddie, C.J.; Collinson, L.M. Exploring the third dimension: Volume electron microscopy comes of age. Micron 2014, 61, 9–19. [Google Scholar] [CrossRef]
Lichtman, J.W.; Pfister, H.; Shavit, N. The big data challenges of connectomics. Nat. Neurosci. 2014, 17, 1448–1454. [Google Scholar] [CrossRef]
Kasthuri, N.; Hayworth, K.J.; Berger, D.R.; Schalek, R.L.; Conchello, J.A.; Knowles-Barley, S.; Lee, D.; Vázquez-Reina, A.; Kaynig, V.; Jones, T.R.; et al. Saturated reconstruction of a volume of neocortex. Cell 2015, 162, 648–661. [Google Scholar] [CrossRef] [PubMed]
Helmstaedter, M.; Briggman, K.L.; Turaga, S.C.; Jain, V.; Seung, H.S.; Denk, W. Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature 2013, 500, 168–174. [Google Scholar] [CrossRef]
Motta, A.; Berning, M.; Boergens, K.M.; Staffler, B.; Beining, M.; Loomba, S.; Hennig, P.; Wissler, H.; Helmstaedter, M. Dense connectomic reconstruction in layer 4 of the somatosensory cortex. Science 2019, 366, eaay3134. [Google Scholar] [CrossRef] [PubMed]
Lichtman, J.W.; Denk, W. The big and the small: Challenges of imaging the brain’s circuits. Science 2011, 334, 618–623. [Google Scholar] [CrossRef]
Helmstaedter, M. Cellular-resolution connectomics: Challenges of dense neural circuit reconstruction. Nat. Methods 2013, 10, 501–507. [Google Scholar] [CrossRef] [PubMed]
Hua, Y.; Laserstein, P.; Helmstaedter, M. Large-volume en-bloc staining for electron microscopy-based connectomics. Nat. Commun. 2015, 6, 7923. [Google Scholar] [CrossRef]
Briggman, K.L.; Bock, D.D. Volume electron microscopy for neuronal circuit reconstruction. Curr. Opin. Neurobiol. 2012, 22, 154–161. [Google Scholar] [CrossRef]
Hayworth, K.J.; Xu, C.S.; Lu, Z.; Knott, G.W.; Fetter, R.D.; Tapia, J.C.; Lichtman, J.W.; Hess, H.F. Ultrastructurally smooth thick partitioning and volume stitching for large-scale connectomics. Nat. Methods 2015, 12, 319–322. [Google Scholar] [CrossRef]
Keys, R. Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process. 1981, 29, 1153–1160. [Google Scholar] [CrossRef]
Unser, M. Splines: A perfect fit for signal and image processing. IEEE Signal Process. Mag. 2002, 16, 22–38. [Google Scholar] [CrossRef]
Saalfeld, S.; Fetter, R.; Cardona, A.; Tomancak, P. Elastic volume reconstruction from series of ultra-thin microscopy sections. Nat. Methods 2012, 9, 717–720. [Google Scholar] [CrossRef]
Deng, S.; Fu, X.; Xiong, Z.; Chen, C.; Liu, D.; Chen, X.; Ling, Q.; Wu, F. Isotropic reconstruction of 3D EM images with unsupervised degradation learning. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2020; pp. 163–173. [Google Scholar]
Pan, M.; Gan, Y.; Zhou, F.; Liu, J.; Zhang, Y.; Wang, A.; Zhang, S.; Li, D. DiffuseIR: Diffusion models for isotropic reconstruction of 3D microscopic images. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 323–332. [Google Scholar]
Yang, H.; Wei, Q.; Sang, Y. Transform Domain Based GAN with Deep Multi-Scale Features Fusion for Medical Image Super-Resolution. Electronics 2025, 14, 3726. [Google Scholar] [CrossRef]
Liu, Q.; Chen, L.; Sun, Y.; Liu, L. SwinT-SRGAN: Swin Transformer Enhanced Generative Adversarial Network for Image Super-Resolution. Electronics 2025, 14, 3511. [Google Scholar] [CrossRef]
Lu, C.; Chen, K.; Qiu, H.; Chen, X.; Chen, G.; Qi, X.; Jiang, H. Diffusion-based deep learning method for augmenting ultrastructural imaging and volume electron microscopy. Nat. Commun. 2024, 15, 4677. [Google Scholar] [CrossRef]
Kazimi, B.; Ruzaeva, K.; Sandfeld, S. Self-supervised learning with generative adversarial networks for electron microscopy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 71–81. [Google Scholar]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. Dinov3. arXiv 2025, arXiv:2508.10104. [Google Scholar] [PubMed]
Lee, K.; Jeong, W.K. Reference-free isotropic 3d em reconstruction using diffusion models. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 235–245. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
Huang, H.; Abbas, H. Cross-Modality Guided Super-Resolution for Weak-Signal Fluorescence Imaging via a Multi-Channel SwinIR Framework. Electronics 2026, 15, 204. [Google Scholar] [CrossRef]
Shou, J.; Xiao, Z.; Deng, S.; Huang, W.; Shi, P.; Zhang, R.; Xiong, Z.; Wu, F. Learning large-factor EM image super-resolution with generative priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 11313–11322. [Google Scholar]
Ferede, F.A.; Khalighifar, A.; John, J.; Venkataraman, K.; Khairy, K. Z-upscaling: Optical Flow Guided Frame Interpolation for Isotropic Reconstruction of 3D EM Volumes. In Proceedings of the 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI); IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Troidl, J.; Liang, Y.; Beyer, J.; Tavakoli, M.; Danzl, J.; Hadwiger, M.; Pfister, H.; Tompkin, J. niiv: Interactive Self-supervised Neural Implicit Isotropic Volume Reconstruction. In Proceedings of the International Workshop on Efficient Medical Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2025; pp. 257–267. [Google Scholar]
He, Y.; Zhou, Z.; Zheng, Y.; Liang, C.; Wang, Y.; Yang, X. EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy. arXiv 2025, arXiv:2512.06684. [Google Scholar]
Zhang, Y.; Zhen, J.; Sun, S.; Liu, T.; Huo, L.; Wang, T. SCAFNet: A Semantic Compensated Adaptive Fusion Network for Remote Sensing Images Change Detection. IEEE Geosci. Remote Sens. Lett. 2026, 23, 6003405. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, T.; Xue, L.; Lian, W.; Tao, R. ORSI Salient Object Detection via Progressive Interaction and Saliency-Guided Enhancement. IEEE Geosci. Remote Sens. Lett. 2025, 23, 6002105. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, T.; Zhen, J.; Kang, Y.; Cheng, Y. Adaptive downsampling and scale enhanced detection head for tiny object detection in remote sensing image. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6003605. [Google Scholar] [CrossRef]
Phelps, J.S.; Hildebrand, D.G.C.; Graham, B.J.; Kuan, A.T.; Thomas, L.A.; Nguyen, T.M.; Buhmann, J.; Azevedo, A.W.; Sustar, A.; Agrawal, S.; et al. Reconstruction of motor control circuits in adult Drosophila using automated transmission electron microscopy. Cell 2021, 184, 759–774. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Fan, Y.; Yu, J.; Huang, T.S. Wide-activated deep residual networks based restoration for bpg-compressed images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2621–2624. [Google Scholar]
Li, Z.; Kovachki, N.; Azizzadenesheli, K.; Liu, B.; Bhattacharya, K.; Stuart, A.; Anandkumar, A. Fourier neural operator for parametric partial differential equations. arXiv 2020, arXiv:2010.08895. [Google Scholar]

Figure 1. Framework schematic for cross-axial super-resolution reconstruction.

x_{H R}

denotes high-resolution ground truth slices.

{\tilde{x}}_{H R}

represents reconstructed high-resolution slices from the reverse diffusion denoising network (i.e., estimation of clean sample

x_{0}

, corresponding to

{\hat{x}}_{0}

in main text).

x_{T}

denotes noise state at diffusion timestep T.

ϕ (\cdot)

represents the ViT feature extractor.

Figure 1. Framework schematic for cross-axial super-resolution reconstruction.

x_{H R}

denotes high-resolution ground truth slices.

{\tilde{x}}_{H R}

represents reconstructed high-resolution slices from the reverse diffusion denoising network (i.e., estimation of clean sample

x_{0}

, corresponding to

{\hat{x}}_{0}

in main text).

x_{T}

denotes noise state at diffusion timestep T.

ϕ (\cdot)

represents the ViT feature extractor.

Figure 2. 3D reconstruction comparison on the FANC dataset. Pixel-level metrics, including PSNR and SSIM (↑ indicates higher is better), are reported for the diffusion-based vEM reconstruction methods DiffuseIR and EMDiffuse.

Figure 3. 3D reconstruction schematic. Taking

s = 6

as an example, given two adjacent input slices (Input (Low Z) and Input (High Z)), the model predicts intermediate missing slices shown as Recon 1–Recon 5.

Figure 3. 3D reconstruction schematic. Taking

s = 6

as an example, given two adjacent input slices (Input (Low Z) and Input (High Z)), the model predicts intermediate missing slices shown as Recon 1–Recon 5.

Figure 4. Continuous

x y

-plane slice reconstruction comparison along z axis on the FANC dataset. Given two adjacent input slices (Input (Low Z) and Input (High Z)), the model predicts intermediate slices (6× super-resolution) shown as Recon 1–Recon 5. Corresponding ground truth (GT) slices appear below each prediction for reference. Our method more accurately recovers cellular ultrastructural details during cross-slice interpolation. Red arrows highlight the features where our reconstruction closely matches the ground truth.

Figure 4. Continuous

x y

-plane slice reconstruction comparison along z axis on the FANC dataset. Given two adjacent input slices (Input (Low Z) and Input (High Z)), the model predicts intermediate slices (6× super-resolution) shown as Recon 1–Recon 5. Corresponding ground truth (GT) slices appear below each prediction for reference. Our method more accurately recovers cellular ultrastructural details during cross-slice interpolation. Red arrows highlight the features where our reconstruction closely matches the ground truth.

Figure 5. Ablation study on perceptual quality and structural fidelity. The upward arrows (↑) indicate that a higher value represents better performance, while downward arrows (↓) indicate that a lower value is preferable. PSNR is measured in dB, while SSIM, FSIM, LPIPS, DISTS, and CLIPIQA are dimensionless scores in the [0, 1] range. Lower LPIPS/DISTS indicates better perceptual similarity, and higher CLIPIQA indicates better biological plausibility. The legend denotes “Linear Interpolation”, “Ours (w/o Pretrain)” (ImageNet-pretrained, without large-scale vEM pre-training), and “Ours (full model)”.

Figure 6. Domain-specific ViT patch similarity heatmap visualization (softmax normalized). Warm colors indicate higher feature similarity with the query patch (indicated by the red box). High-similarity regions concentrate near membrane boundaries and periodic ultrastructures. This provides structural consistency constraints for diffusion reconstruction and helps suppress biologically implausible hallucinated textures.

Figure 7. Training loss curve comparison. Blue: full model (with ViT perceptual constraints); gray: ablation variant (without ViT). Inset shows the perceptual loss component convergence process.

Table 2. Ablation study on loss functions (MSE baseline vs. adding ViT perceptual loss).

Method	PSNR ↑	SSIM ↑	LPIPS ↓	DISTS ↓
MSE only	19.8774	0.2550	0.5407	0.3682
MSE + $L_{per}$ (Ours)	19.9742	0.4060	0.4627	0.2930

Table 3. Ablation study on perceptual loss hyperparameters (

λ

: perceptual weight,

t_{s}

: step threshold—perceptual loss is applied only when diffusion timestep

t < t_{s}

, i.e., in the low-noise refinement phase).

Table 3. Ablation study on perceptual loss hyperparameters (

λ

: perceptual weight,

t_{s}

: step threshold—perceptual loss is applied only when diffusion timestep

t < t_{s}

, i.e., in the low-noise refinement phase).

Method	MAE ↓	MSE ↓	PSNR ↑	SSIM ↑	FSIM ↑	LPIPS ↓	DISTS ↓
$λ = 0.1$ , $t_{s} = 100$	0.1594	0.0406	19.9318	0.1991	0.7018	0.5579	0.3677
$λ = 0.5$ , $t_{s} = 100$	0.2063	0.0716	17.4730	0.1375	0.6462	0.6252	0.4098
$λ = 1.0$ , $t_{s} = 100$ (Ours)	0.1644	0.0413	19.9742	0.4060	0.7006	0.4627	0.2930
$λ = 2.0$ , $t_{s} = 100$	0.1560	0.0404	19.9602	0.2217	0.7270	0.5561	0.3659
$λ = 1.0$ , $t_{s} = 50$	0.1458	0.0342	20.6844	0.2752	0.7552	0.5395	0.3624
$λ = 1.0$ , $t_{s} = 200$	0.1610	0.0424	19.7508	0.2492	0.7403	0.5164	0.3368

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qiu, J.; Wan, G.; Zhou, Z.; Liao, M.; Liu, X.; Li, X.; Du, B. Isotropic Reconstruction of Anisotropic vEM Volumes with ViT-Guided Diffusion. Electronics 2026, 15, 1181. https://doi.org/10.3390/electronics15061181

AMA Style

Qiu J, Wan G, Zhou Z, Liao M, Liu X, Li X, Du B. Isotropic Reconstruction of Anisotropic vEM Volumes with ViT-Guided Diffusion. Electronics. 2026; 15(6):1181. https://doi.org/10.3390/electronics15061181

Chicago/Turabian Style

Qiu, Junchao, Guojia Wan, Zhengyun Zhou, Minghui Liao, Xiangdong Liu, Xinyuan Li, and Bo Du. 2026. "Isotropic Reconstruction of Anisotropic vEM Volumes with ViT-Guided Diffusion" Electronics 15, no. 6: 1181. https://doi.org/10.3390/electronics15061181

APA Style

Qiu, J., Wan, G., Zhou, Z., Liao, M., Liu, X., Li, X., & Du, B. (2026). Isotropic Reconstruction of Anisotropic vEM Volumes with ViT-Guided Diffusion. Electronics, 15(6), 1181. https://doi.org/10.3390/electronics15061181

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Isotropic Reconstruction of Anisotropic vEM Volumes with ViT-Guided Diffusion

Abstract

1. Introduction

2. Related Work

3. Preliminaries

4. Methodology

4.1. Framework Overview

4.2. Self-Supervised ViT Feature Pretraining

4.3. Conditional Diffusion Denoising Reconstruction with ViT Perceptual Constraints

5. Experiments

5.1. Experimental Setup

5.2. Quantitative Results and Comparison

5.3. Qualitative Visual Analysis

5.4. Ablation Study

5.5. ViT Feature Similarity Heatmap Visualization

5.6. Training Dynamics Analysis

6. Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI