Tomography

29 pages, 4755 KB

Open AccessArticle

DenseViT-OCT: A Hybrid CNN-Transformer Architecture with Multi-Scale Dense Feature Aggregation for Automated Epiretinal Membrane Severity Classification

by Elif Yusufoğlu, Salih Taha Alperen Özçelik, Orhan Atila, Numan Halit Guldemir and Abdulkadir Sengur

Tomography 2026, 12(6), 76; https://doi.org/10.3390/tomography12060076 - 22 May 2026

Abstract

Background/Objectives: Epiretinal membrane (ERM) is a common vitreoretinal disorder characterized by fibrocellular proliferation on the inner retinal surface, often leading to progressive visual impairment. Accurate grading of ERM severity using optical coherence tomography (OCT) is critical for treatment planning and surgical decision-making; however, manual grading is labor-intensive and subjective. This study aims to develop an automated and reliable deep learning-based method for ERM severity classification. Methods: We propose DenseViT-OCT, a hybrid deep learning model that integrates dense convolutional neural networks (CNN) and vision transformers (ViT). The model introduces three key modules: Multi-Scale Dense Feature Aggregation (MDFA) for capturing hierarchical features across multiple spatial scales, Adaptive Feature Calibration (AFC) for enhancing feature discrimination through channel and spatial attention, and Cross-Attention Feature Fusion (CAFF) for enabling bidirectional interaction between convolutional and transformer representations. The model was trained and evaluated on 2195 OCT B-scan images obtained from 397 patients. Results: DenseViT-OCT achieved an overall accuracy of 94.76% on the internal four-class test set, outperforming 19 benchmark models, including ConvNeXt, EfficientNet, ViT, and Swin Transformers. The model demonstrated balanced performance with a macro-averaged precision of 93.76%, recall of 93.22%, F1-score of 93.47%, Cohen’s kappa of 92.62%, and macro-Area Under the Curve (AUC) of 98.95%. Ablation experiments confirmed the contribution of the proposed MDFA, AFC, CAFF, and deep supervision components, with the full model consistently outperforming reduced variants and standalone DenseNet121 and ViT-B/16 backbones. In repeated experiments across five random seeds, DenseViT-OCT also achieved the best mean accuracy (0.9399 ± 0.0052). External validation on the public multicenter OCTDL dataset, performed as binary ERM-versus-normal classification because of label availability, yielded 90.76% accuracy and 97.61% AUC, indicating promising generalization beyond the development cohort. Conclusions: DenseViT-OCT provides a robust framework for automated ERM severity classification from OCT B-scans. The combination of local CNN features, global transformer context, and dedicated fusion modules improves classification performance and yields clinically meaningful error patterns. Although further stage-wise multicenter validation, volumetric OCT analysis, and prospective clinical assessment are required, the proposed method shows promise as a research-oriented decision-support framework for B-scan-level ERM assessment. Full article

(This article belongs to the Special Issue Medical Image Analysis in CT Imaging)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Tomography, Volume 12, Issue 6 (June 2026) – 1 article

Further Information

Guidelines

MDPI Initiatives

Follow MDPI