Journal of Imaging

15 pages, 647 KB

Open AccessStudy Protocol

Non-Invasive Detection of Prostate Cancer with Novel Time-Dependent Diffusion MRI and AI-Enhanced Quantitative Radiological Interpretation: PROS-TD-AI

by Baltasar Ramos, Cristian Garrido, Paulette Narváez, Santiago Gelerstein Claro, Haotian Li, Rafael Salvador, Constanza Vásquez-Venegas, Iván Gallegos, Víctor Castañeda, Cristian Acevedo, Gonzalo Cárdenas and Camilo G. Sotomayor

J. Imaging 2026, 12(1), 53; https://doi.org/10.3390/jimaging12010053 - 22 Jan 2026

Cited by 1 | Viewed by 1167

Abstract

Prostate cancer (PCa) is the most common malignancy in men worldwide. Multiparametric MRI (mpMRI) improves the detection of clinically significant PCa (csPCa); however, it remains limited by false-positive findings and inter-observer variability. Time-dependent diffusion (TDD) MRI provides microstructural information that may enhance csPCa [...] Read more.

Prostate cancer (PCa) is the most common malignancy in men worldwide. Multiparametric MRI (mpMRI) improves the detection of clinically significant PCa (csPCa); however, it remains limited by false-positive findings and inter-observer variability. Time-dependent diffusion (TDD) MRI provides microstructural information that may enhance csPCa characterization beyond standard mpMRI. This prospective observational diagnostic accuracy study protocol describes the evaluation of PROS-TD-AI, an in-house developed AI workflow integrating TDD-derived metrics for zone-aware csPCa risk prediction. PROS-TD-AI will be compared with PI-RADS v2.1 in routine clinical imaging using MRI-targeted prostate biopsy as the reference standard. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

22 pages, 7186 KB

Open AccessArticle

Multi-Frequency GPR Image Fusion Based on Convolutional Sparse Representation to Enhance Road Detection

by Liang Fang, Feng Yang, Yuanjing Fang and Junli Nie

J. Imaging 2026, 12(1), 52; https://doi.org/10.3390/jimaging12010052 - 22 Jan 2026

Cited by 1 | Viewed by 756

Abstract

Single-frequency ground penetrating radar (GPR) systems are fundamentally constrained by a trade-off between penetration depth and resolution, alongside issues like narrow bandwidth and ringing interference. To break this limitation, we have developed a multi-frequency data fusion technique grounded in convolutional sparse representation (CSR). [...] Read more.

Single-frequency ground penetrating radar (GPR) systems are fundamentally constrained by a trade-off between penetration depth and resolution, alongside issues like narrow bandwidth and ringing interference. To break this limitation, we have developed a multi-frequency data fusion technique grounded in convolutional sparse representation (CSR). The proposed methodology involves spatially registering multi-frequency GPR signals and fusing them via a CSR framework, where the convolutional dictionaries are derived from simulated high-definition GPR data. Extensive evaluation using information entropy, average gradient, mutual information, and visual information fidelity demonstrates the superiority of our method over traditional fusion approaches (e.g., weighted average, PCA, 2D wavelets). Tests on simulated and real data confirm that our CSR-based fusion successfully synergizes the deep penetration of low frequencies with the fine resolution of high frequencies, leading to substantial gains in GPR image clarity and interpretability. Full article

(This article belongs to the Section Image and Video Processing)

► Show Figures

Figure 1

19 pages, 1304 KB

Open AccessArticle

Interpretable Diagnosis of Pulmonary Emphysema on Low-Dose CT Using ResNet Embeddings

by Talshyn Sarsembayeva, Madina Mansurova, Ainash Oshibayeva and Stepan Serebryakov

J. Imaging 2026, 12(1), 51; https://doi.org/10.3390/jimaging12010051 - 21 Jan 2026

Viewed by 1157

Abstract

Accurate and interpretable detection of pulmonary emphysema on low-dose computed tomography (LDCT) remains a critical challenge for large-scale screening and population health studies. This work proposes a quality-controlled and interpretable deep learning pipeline for emphysema assessment using ResNet-152 embeddings. The pipeline integrates automated [...] Read more.

Accurate and interpretable detection of pulmonary emphysema on low-dose computed tomography (LDCT) remains a critical challenge for large-scale screening and population health studies. This work proposes a quality-controlled and interpretable deep learning pipeline for emphysema assessment using ResNet-152 embeddings. The pipeline integrates automated lung segmentation, quality-control filtering, and extraction of 2048-dimensional embeddings from mid-lung patches, followed by analysis using logistic regression, LASSO, and recursive feature elimination (RFE). The embeddings are further fused with quantitative CT (QCT) markers, including %LAA, Perc15, and total lung volume (TLV), to enhance robustness and interpretability. Bootstrapped validation demonstrates strong diagnostic performance (ROC-AUC = 0.996, PR-AUC = 0.962, balanced accuracy = 0.931) with low computational cost. The proposed approach shows that ResNet embeddings pretrained on CT data can be effectively reused without retraining for emphysema characterization, providing a reproducible and explainable framework suitable as a research and screening-support framework for population-level LDCT analysis. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

26 pages, 7486 KB

Open AccessArticle

ADAM-Net: Anatomy-Guided Attentive Unsupervised Domain Adaptation for Joint MG Segmentation and MGD Grading

by Junbin Fang, Xuan He, You Jiang and Mini Han Wang

J. Imaging 2026, 12(1), 50; https://doi.org/10.3390/jimaging12010050 - 21 Jan 2026

Viewed by 791

Abstract

Meibomian gland dysfunction (MGD) is a leading cause of dry eye disease, assessable through gland atrophy degree. While deep learning (DL) has advanced meibomian gland (MG) segmentation and MGD classification, existing methods treat these tasks independently and suffer from domain shift across multi-center [...] Read more.

Meibomian gland dysfunction (MGD) is a leading cause of dry eye disease, assessable through gland atrophy degree. While deep learning (DL) has advanced meibomian gland (MG) segmentation and MGD classification, existing methods treat these tasks independently and suffer from domain shift across multi-center imaging devices. We propose ADAM-Net, an attention-guided unsupervised domain adaptation multi-task framework that jointly models MG segmentation and MGD classification. Our model introduces structure-aware multi-task learning and anatomy-guided attention to enhance feature sharing, suppress background noise, and improve glandular region perception. For the cross-domain tasks MGD-1K→{K5M, CR-2, LV II}, this study systematically evaluates the overall performance of ADAM-Net from multiple perspectives. The experimental results show that ADAM-Net achieves classification accuracies of 77.93%, 74.86%, and 81.77% on the target domains, significantly outperforming current mainstream unsupervised domain adaptation (UDA) methods. The F1-score and the Matthews correlation coefficient (MCC-score) indicate that the model maintains robust discriminative capability even under class-imbalanced scenarios. t-SNE visualizations further validate its cross-domain feature alignment capability. These demonstrate that ADAM-Net exhibits strong robustness and interpretability in multi-center scenarios and provide an effective solution for automated MGD assessment. Full article

(This article belongs to the Special Issue Imaging in Healthcare: Progress and Challenges)

► Show Figures

Figure 1

11 pages, 1398 KB

Open AccessArticle

Chest Radiography Optimization: Identifying the Optimal kV for Image Quality in a Phantom Study

by Ioannis Antonakos, Kyriakos Kokkinogoulis, Maria Giannopoulou and Efstathios P. Efstathopoulos

J. Imaging 2026, 12(1), 49; https://doi.org/10.3390/jimaging12010049 - 21 Jan 2026

Viewed by 988

Abstract

Chest radiography remains one of the most frequently performed imaging examinations, highlighting the need for optimization of acquisition parameters to balance image quality and radiation dose. This study presents a phantom-based quantitative evaluation of chest radiography acquisition settings using a digital radiography system [...] Read more.

Chest radiography remains one of the most frequently performed imaging examinations, highlighting the need for optimization of acquisition parameters to balance image quality and radiation dose. This study presents a phantom-based quantitative evaluation of chest radiography acquisition settings using a digital radiography system (AGFA DR 600). Measurements were performed at three tube voltage levels across simulated patient-equivalent thicknesses generated using PMMA slabs, with a Leeds TOR 15FG image quality phantom positioned centrally in the imaging setup. Image quality was quantitatively assessed using signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR), which were calculated from mean pixel values obtained from repeated acquisitions. Radiation exposure was evaluated through estimation of entrance surface dose (ESD). The analysis demonstrated that dose-normalized performance metrics favored intermediate tube voltages for slim and average patient-equivalent thicknesses, while higher voltages were required to maintain image quality in obese-equivalent conditions. Overall, image quality and dose were found to be strongly dependent on the combined selection of tube voltage and phantom thickness. These findings indicate that modest adjustments to tube voltage selection may improve the balance between image quality and radiation dose in chest radiography. Nevertheless, as the present work is based on phantom measurements, further validation using clinical images and observer-based studies is required before any modification of routine radiographic practice. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

18 pages, 1198 KB

Open AccessArticle

Graph-Enhanced Expectation Maximization for Emission Tomography

by Ryosuke Kasai and Hideki Otsuka

J. Imaging 2026, 12(1), 48; https://doi.org/10.3390/jimaging12010048 - 20 Jan 2026

Viewed by 511

Abstract

Emission tomography, including single-photon emission computed tomography (SPECT), requires image reconstruction from noisy and incomplete projection data. The maximum-likelihood expectation maximization (MLEM) algorithm is widely used due to its statistical foundation and non-negativity preservation, but it is highly sensitive to noise, particularly in [...] Read more.

Emission tomography, including single-photon emission computed tomography (SPECT), requires image reconstruction from noisy and incomplete projection data. The maximum-likelihood expectation maximization (MLEM) algorithm is widely used due to its statistical foundation and non-negativity preservation, but it is highly sensitive to noise, particularly in low-count conditions. Although total variation (TV) regularization can reduce noise, it often oversmooths structural details and requires careful parameter tuning. We propose a Graph-Enhanced Expectation Maximization (GREM) algorithm that incorporates graph-based neighborhood information into an MLEM-type multiplicative reconstruction scheme. The method is motivated by a penalized formulation combining a Kullback–Leibler divergence term with a graph Laplacian regularization term, promoting local structural consistency while preserving edges. The resulting update retains the multiplicative structure of MLEM and preserves the non-negativity of the image estimates. Numerical experiments using synthetic phantoms under multiple noise levels, as well as clinical ^99mTc-GSA liver SPECT data, demonstrate that GREM consistently outperforms conventional MLEM and TV-regularized MLEM in terms of PSNR and MS-SSIM. These results indicate that GREM provides an effective and practical approach for edge-preserving noise suppression in emission tomography without relying on external training data. Full article

(This article belongs to the Special Issue Advances in Photoacoustic Imaging: Tomography and Applications)

► Show Figures

Figure 1

18 pages, 2295 KB

Open AccessArticle

Automatic Retinal Nerve Fiber Segmentation and the Influence of Intersubject Variability in Ocular Parameters on the Mapping of Retinal Sites to the Pointwise Orientation Angles

by Diego Luján Villarreal and Adriana Leticia Vera-Tizatl

J. Imaging 2026, 12(1), 47; https://doi.org/10.3390/jimaging12010047 - 19 Jan 2026

Cited by 1 | Viewed by 560

Abstract

The current study investigates the influence of intersubject variability in ocular characteristics on the mapping of visual field (VF) sites to the pointwise directional angles in retinal nerve fiber layer (RNFL) bundle traces. In addition, the performance efficacy on the mapping of VF [...] Read more.

The current study investigates the influence of intersubject variability in ocular characteristics on the mapping of visual field (VF) sites to the pointwise directional angles in retinal nerve fiber layer (RNFL) bundle traces. In addition, the performance efficacy on the mapping of VF sites to the optic nerve head (ONH) was compared to ground truth baselines. Fundus photographs of 546 eyes of 546 healthy subjects (with no history of ocular disease or diabetic retinopathy) were enhanced digitally and RNFL bundle traces were segmented based on the Personalized Estimated Segmentation (PES) algorithm’s core technique. A 24-2 VF grid pattern was overlaid onto the photographs in order to relate VF test points to intersecting RNFL bundles. The PES algorithm effectively traced RNFL bundles in fundus images, achieving an average accuracy of 97.6% relative to the Jansonius map through the application of 10th-order Bezier curves. The PES algorithm assembled an average of 4726 RNFL bundles per fundus image based on 4975 sampling points, obtaining a total of 2,580,505 RNFL bundles based on 2,716,321 sampling points. The influence of ocular parameters could be evaluated for 34 out of 52 VF locations. The ONH-fovea angle and the ONH position in relation to the fovea were the most prominent predictors for variations in the mapping of retinal locations to the pointwise directional angle (p < 0.001). The variation explained by the model (R² value) ranges from 27.6% for visual field location 15 to 77.8% in location 22, with a mean of 56%. Significant individual variability was found in the mapping of VF sites to the ONH, with a mean standard deviation (95% limit) of 16.55° (median 17.68°) for 50 out of 52 VF locations, ranging from less than 1° to 44.05°. The mean entry angles differed from previous baselines by a range of less than 1° to 23.9° (average difference of 10.6° ± 5.53°), and RMSE of 11.94. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

24 pages, 5019 KB

Open AccessArticle

A Dual Stream Deep Learning Framework for Alzheimer’s Disease Detection Using MRI Sonification

by Nadia A. Mohsin and Mohammed H. Abdul Ameer

J. Imaging 2026, 12(1), 46; https://doi.org/10.3390/jimaging12010046 - 15 Jan 2026

Cited by 1 | Viewed by 1035

Abstract

Alzheimer’s Disease (AD) is an advanced brain illness that affects millions of individuals across the world. It causes gradual damage to the brain cells, leading to memory loss and cognitive dysfunction. Although Magnetic Resonance Imaging (MRI) is widely used in AD diagnosis, the [...] Read more.

Alzheimer’s Disease (AD) is an advanced brain illness that affects millions of individuals across the world. It causes gradual damage to the brain cells, leading to memory loss and cognitive dysfunction. Although Magnetic Resonance Imaging (MRI) is widely used in AD diagnosis, the existing studies rely solely on the visual representations, leaving alternative features unexplored. The objective of this study is to explore whether MRI sonification can provide complementary diagnostic information when combined with conventional image-based methods. In this study, we propose a novel dual-stream multimodal framework that integrates 2D MRI slices with their corresponding audio representations. MRI images are transformed into audio signals using a multi-scale, multi-orientation Gabor filtering, followed by a Hilbert space-filling curve to preserve spatial locality. The image and sound modalities are processed using a lightweight CNN and YAMNet, respectively, then fused via logistic regression. The experimental results of the multimodal achieved the highest accuracy in distinguishing AD from Cognitively Normal (CN) subjects at 98.2%, 94% for AD vs. Mild Cognitive Impairment (MCI), and 93.2% for MCI vs. CN. This work provides a new perspective and highlights the potential of audio transformation of imaging data for feature extraction and classification. Full article

(This article belongs to the Section AI in Imaging)

► Show Figures

Figure 1

19 pages, 2936 KB

Open AccessArticle

A Cross-Device and Cross-OS Benchmark of Modern Web Animation Systems

by Tajana Koren Ivančević, Trpimir Jeronim Ježić and Nikolina Stanić Loknar

J. Imaging 2026, 12(1), 45; https://doi.org/10.3390/jimaging12010045 - 15 Jan 2026

Cited by 1 | Viewed by 991

Abstract

Although modern web technologies increasingly rely on high-performance rendering methods to support rich visual content across a range of devices and operating systems, the field remains significantly under-researched. The performance of animated visual elements is affected by numerous factors, including browsers, operating systems, [...] Read more.

Although modern web technologies increasingly rely on high-performance rendering methods to support rich visual content across a range of devices and operating systems, the field remains significantly under-researched. The performance of animated visual elements is affected by numerous factors, including browsers, operating systems, GPU acceleration, scripting load, and device limitations. This study systematically evaluates animation performance across multiple platforms using a unified set of circle-based animations implemented with eight web-compatible technologies, including HTML, CSS, SVG, JavaScript, Canvas, and WebGL. Animations were evaluated under controlled feature combinations involving random motion, distance, colour variation, blending, and transformations, with object counts ranging from 10 to 10,000. Measurements were conducted on desktop operating systems (Windows, macOS, Linux) and mobile platforms (iOS, Android), using CPU utilisation, GPU memory usage, and frame rate (FPS) as key metrics. Results show that DOM-based approaches maintain stable performance at 100 animated objects but exhibit notable degradation by 500 objects. Canvas-based rendering extends usability to higher object counts, while WebGL demonstrates the most stable performance at large scales (5000–10,000 objects). These findings provide concrete guidance for selecting appropriate animation technologies based on scene complexity and target platform. Full article

(This article belongs to the Section Visualization and Computer Graphics)

► Show Figures

Graphical abstract

23 pages, 5097 KB

Open AccessArticle

A Deep Feature Fusion Underwater Image Enhancement Model Based on Perceptual Vision Swin Transformer

by Shasha Tian, Adisorn Sirikham, Jessada Konpang and Chuyang Wang

J. Imaging 2026, 12(1), 44; https://doi.org/10.3390/jimaging12010044 - 14 Jan 2026

Cited by 2 | Viewed by 921

Abstract

Underwater optical images are the primary carriers of underwater scene information, playing a crucial role in marine resource exploration, underwater environmental monitoring, and engineering inspection. However, wavelength-dependent absorption and scattering severely deteriorate underwater images, leading to reduced contrast, chromatic distortions, and loss of [...] Read more.

Underwater optical images are the primary carriers of underwater scene information, playing a crucial role in marine resource exploration, underwater environmental monitoring, and engineering inspection. However, wavelength-dependent absorption and scattering severely deteriorate underwater images, leading to reduced contrast, chromatic distortions, and loss of structural details. To address these issues, we propose a U-shaped underwater image enhancement framework that integrates Swin-Transformer blocks with lightweight attention and residual modules. A Dual-Window Multi-Head Self-Attention (DWMSA) in the bottleneck models long-range context while preserving fine local structure. A Global-Aware Attention Map (GAMP) adaptively re-weights channels and spatial locations to focus on severely degraded regions. A Feature-Augmentation Residual Network (FARN) stabilizes deep training and emphasizes texture and color fidelity. Trained with a combination of Charbonnier, perceptual, and edge losses, our method achieves state-of-the-art results in PSNR and SSIM, the lowest LPIPS, and improvements in UIQM and UCIQE on the UFO-120 and EUVP datasets, with average metrics of PSNR 29.5 dB, SSIM 0.94, LPIPS 0.17, UIQM 3.62, and UCIQE 0.59. Qualitative results show reduced color cast, restored contrast, and sharper details. Code, weights, and evaluation scripts will be released to support reproducibility. Full article

(This article belongs to the Special Issue Underwater Imaging (2nd Edition))

► Show Figures

Figure 1

24 pages, 6383 KB

Open AccessArticle

FF-Mamba-YOLO: An SSM-Based Benchmark for Forest Fire Detection in UAV Remote Sensing Images

by Binhua Guo, Dinghui Liu, Zhou Shen and Tiebin Wang

J. Imaging 2026, 12(1), 43; https://doi.org/10.3390/jimaging12010043 - 13 Jan 2026

Cited by 1 | Viewed by 1521

Abstract

Timely and accurate detection of forest fires through unmanned aerial vehicle (UAV) remote sensing target detection technology is of paramount importance. However, multiscale targets and complex environmental interference in UAV remote sensing images pose significant challenges during detection tasks. To address these obstacles, [...] Read more.

Timely and accurate detection of forest fires through unmanned aerial vehicle (UAV) remote sensing target detection technology is of paramount importance. However, multiscale targets and complex environmental interference in UAV remote sensing images pose significant challenges during detection tasks. To address these obstacles, this paper presents FF-Mamba-YOLO, a novel framework based on the principles of Mamba and YOLO (You Only Look Once) that leverages innovative modules and architectures to overcome these limitations. Specifically, we introduce MFEBlock and MFFBlock based on state space models (SSMs) in the backbone and neck parts of the network, respectively, enabling the model to effectively capture global dependencies. Second, we construct CFEBlock, a module that performs feature enhancement before SSM processing, improving local feature processing capabilities. Furthermore, we propose MGBlock, which adopts a dynamic gating mechanism, enhancing the model’s adaptive processing capabilities and robustness. Finally, we enhance the structure of Path Aggregation Feature Pyramid Network (PAFPN) to improve feature fusion quality and introduce DySample to enhance image resolution without significantly increasing computational costs. Experimental results on our self-constructed forest fire image dataset demonstrate that the model achieves 67.4% mAP@50, 36.3% mAP@50:95, and 64.8% precision, outperforming previous state-of-the-art methods. These results highlight the potential of FF-Mamba-YOLO in forest fire monitoring. Full article

(This article belongs to the Section Computer Vision and Pattern Recognition)

► Show Figures

Figure 1

20 pages, 33907 KB

Open AccessArticle

GLCN: Graph-Aware Locality-Enhanced Cross-Modality Re-ID Network

by Junjie Cao, Yuhang Yu, Rong Rong and Xing Xie

J. Imaging 2026, 12(1), 42; https://doi.org/10.3390/jimaging12010042 - 13 Jan 2026

Cited by 1 | Viewed by 661

Abstract

Cross-modality person re-identification faces challenges such as illumination discrepancies, local occlusions, and inconsistent modality structures, leading to misalignment and sensitivity issues. We propose GLCN, a framework that addresses these problems by enhancing representation learning through locality enhancement, cross-modality structural alignment, and intra-modality compactness. [...] Read more.

Cross-modality person re-identification faces challenges such as illumination discrepancies, local occlusions, and inconsistent modality structures, leading to misalignment and sensitivity issues. We propose GLCN, a framework that addresses these problems by enhancing representation learning through locality enhancement, cross-modality structural alignment, and intra-modality compactness. Key components include the Locality-Preserved Cross-branch Fusion (LPCF) module, which combines Local–Positional–Channel Gating (LPCG) for local region and positional sensitivity; Cross-branch Context Interpolated Attention (CCIA) for stable cross-branch consistency; and Graph-Enhanced Center Geometry Alignment (GE-CGA), which aligns class-center similarity structures across modalities to preserve category-level relationships. We also introduce Intra-Modal Prototype Discrepancy Mining Loss (IPDM-Loss) to reduce intra-class variance and improve inter-class separation, thereby creating more compact identity structures in both RGB and IR spaces. Extensive experiments on SYSU-MM01, RegDB, and other benchmarks demonstrate the effectiveness of our approach. Full article

(This article belongs to the Special Issue Infrared Image Processing with Artificial Intelligence: Progress and Challenges)

► Show Figures

Figure 1

24 pages, 3126 KB

Open AccessArticle

Calibrated Transformer Fusion for Dual-View Low-Energy CESM Classification

by Ahmed A. H. Alkurdi and Amira Bibo Sallow

J. Imaging 2026, 12(1), 41; https://doi.org/10.3390/jimaging12010041 - 13 Jan 2026

Cited by 1 | Viewed by 785

Abstract

Contrast-enhanced spectral mammography (CESM) provides low-energy images acquired in standard craniocaudal (CC) and mediolateral oblique (MLO) views, and clinical interpretation relies on integrating both views. This study proposes a dual-view classification framework that combines deep CNN feature extraction with transformer-based fusion for breast-side [...] Read more.

Contrast-enhanced spectral mammography (CESM) provides low-energy images acquired in standard craniocaudal (CC) and mediolateral oblique (MLO) views, and clinical interpretation relies on integrating both views. This study proposes a dual-view classification framework that combines deep CNN feature extraction with transformer-based fusion for breast-side classification using low-energy (DM) images from CESM acquisitions (Normal vs. Tumorous; benign and malignant merged). The evaluation was conducted using 5-fold stratified group cross-validation with patient-level grouping to prevent leakage across folds. The final configuration (Model E) integrates dual-backbone feature extraction, transformer fusion, MC-dropout inference for uncertainty estimation, and post hoc logistic calibration. Across the five held-out test folds, Model E achieved a mean accuracy of 96.88% ± 2.39% and a mean F1-score of 97.68% ± 1.66%. The mean ROC-AUC and PR-AUC were 0.9915 ± 0.0098 and 0.9968 ± 0.0029, respectively. Probability quality was supported by a mean Brier score of 0.0236 ± 0.0145 and a mean expected calibration error (ECE) of 0.0334 ± 0.0171. An ablation study (Models A–E) was also reported to quantify the incremental contribution of dual-view input, transformer fusion, and uncertainty calibration. Within the limits of this retrospective single-center setting, these results suggest that dual-view transformer fusion can provide strong discrimination while also producing calibrated probabilities and uncertainty outputs that are relevant for decision support. Full article

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

► Show Figures

Figure 1

16 pages, 64671 KB

Open AccessArticle

A Dual-UNet Diffusion Framework for Personalized Panoramic Generation

by Jing Shen, Leigang Huo, Chunlei Huo and Shiming Xiang

J. Imaging 2026, 12(1), 40; https://doi.org/10.3390/jimaging12010040 - 11 Jan 2026

Viewed by 866

Abstract

While text-to-image and customized generation methods demonstrate strong capabilities in single-image generation, they fall short in supporting immersive applications that require coherent 360° panoramas. Conversely, existing panorama generation models lack customization capabilities. In panoramic scenes, reference objects often appear as minor background elements [...] Read more.

While text-to-image and customized generation methods demonstrate strong capabilities in single-image generation, they fall short in supporting immersive applications that require coherent 360° panoramas. Conversely, existing panorama generation models lack customization capabilities. In panoramic scenes, reference objects often appear as minor background elements and may be multiple in number, while reference images across different views exhibit weak correlations. To address these challenges, we propose a diffusion-based framework for customized multi-view image generation. Our approach introduces a decoupled feature injection mechanism within a dual-UNet architecture to handle weakly correlated reference images, effectively integrating spatial information by concurrently feeding both reference images and noise into the denoising branch. A hybrid attention mechanism enables deep fusion of reference features and multi-view representations. Furthermore, a data augmentation strategy facilitates viewpoint-adaptive pose adjustments, and panoramic coordinates are employed to guide multi-view attention. The experimental results demonstrate our model’s effectiveness in generating coherent, high-quality customized multi-view images. Full article

(This article belongs to the Section AI in Imaging)

► Show Figures

Figure 1

15 pages, 2956 KB

Open AccessArticle

Self-Supervised Learning of Deep Embeddings for Classification and Identification of Dental Implants

by Amani Almalki, Abdulrahman Almalki and Longin Jan Latecki

J. Imaging 2026, 12(1), 39; https://doi.org/10.3390/jimaging12010039 - 9 Jan 2026

Viewed by 786

Abstract

This study proposes an automated system using deep learning-based object detection to identify implant systems, leveraging recent progress in self-supervised learning, specifically masked image modeling (MIM). We advocate for self-pre-training, emphasizing that its advantages when acquiring suitable pre-training data is challenging. The proposed [...] Read more.

This study proposes an automated system using deep learning-based object detection to identify implant systems, leveraging recent progress in self-supervised learning, specifically masked image modeling (MIM). We advocate for self-pre-training, emphasizing that its advantages when acquiring suitable pre-training data is challenging. The proposed Masked Deep Embedding (MDE) pre-training method, extending the masked autoencoder (MAE) transformer, significantly enhances dental implant detection performance compared to baselines. Specifically, the proposed method achieves a best detection performance of AP = 96.1, outperforming supervised ViT and MAE baselines by up to +2.9 AP. In addition, we address the absence of a comprehensive dataset for implant design, enhancing an existing dataset under dental expert supervision. This augmentation includes annotations for implant design, such as coronal, middle, and apical parts, resulting in a unique Implant Design Dataset (IDD). The contributions encompass employing self-supervised learning for limited dental radiograph data, replacing MAE’s patch reconstruction with patch embeddings, achieving substantial performance improvement in implant detection, and expanding possibilities through the labeling of implant design. This study paves the way for AI-driven solutions in implant dentistry, providing valuable tools for dentists and patients facing implant-related challenges. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

24 pages, 2570 KB

Open AccessArticle

SCT-Diff: Seamless Contextual Tracking via Diffusion Trajectory

by Guohao Nie, Xingmei Wang, Debin Zhang and He Wang

J. Imaging 2026, 12(1), 38; https://doi.org/10.3390/jimaging12010038 - 9 Jan 2026

Viewed by 792

Abstract

Existing detection-based trackers exploit temporal contexts by updating appearance models or modeling target motion. However, the sequential one-shot integration of temporal priors risks amplifying error accumulation, as frame-level template matching restricts comprehensive spatiotemporal analysis. To address this, we propose SCT-Diff, a video-level framework [...] Read more.

Existing detection-based trackers exploit temporal contexts by updating appearance models or modeling target motion. However, the sequential one-shot integration of temporal priors risks amplifying error accumulation, as frame-level template matching restricts comprehensive spatiotemporal analysis. To address this, we propose SCT-Diff, a video-level framework that holistically estimates target trajectories. Specifically, SCT-Diff processes video clips globally via a diffusion model to incorporate bidirectional spatiotemporal awareness, where reverse diffusion steps progressively refine noisy trajectory proposals into optimal predictions. Crucially, SCT-Diff enables iterative correction of historical trajectory hypotheses by observing future contexts within a sliding time window. This closed-loop feedback from future frames preserves temporal consistency and breaks the error propagation chain under complex appearance variations. For joint modeling of appearance and motion dynamics, we formulate trajectories as unified discrete token sequences. The designed Mamba-based expert decoder bridges visual features with language-formulated trajectories, enabling lightweight yet coherent sequence modeling. Extensive experiments demonstrate SCT-Diff’s superior efficiency and performance, achieving 75.4% AO on GOT-10k while maintaining real-time computational efficiency. Full article

(This article belongs to the Special Issue Object Detection in Video Surveillance Systems)

► Show Figures

Figure 1

39 pages, 14025 KB

Open AccessArticle

Degradation-Aware Multi-Stage Fusion for Underwater Image Enhancement

by Lian Xie, Hao Chen and Jin Shu

J. Imaging 2026, 12(1), 37; https://doi.org/10.3390/jimaging12010037 - 8 Jan 2026

Cited by 1 | Viewed by 954

Abstract

Underwater images frequently suffer from color casts, low illumination, and blur due to wavelength-dependent absorption and scattering. We present a practical two-stage, modular, and degradation-aware framework designed for real-time enhancement, prioritizing deployability on edge devices. Stage I employs a lightweight CNN to classify [...] Read more.

Underwater images frequently suffer from color casts, low illumination, and blur due to wavelength-dependent absorption and scattering. We present a practical two-stage, modular, and degradation-aware framework designed for real-time enhancement, prioritizing deployability on edge devices. Stage I employs a lightweight CNN to classify inputs into three dominant degradation classes (color cast, low light, blur) with 91.85% accuracy on an EUVP subset. Stage II applies three scene-specific lightweight enhancement pipelines and fuses their outputs using two alternative learnable modules: a global Linear Fusion and a LiteUNetFusion (spatially adaptive weighting with optional residual correction). Compared to the three single-scene optimizers (average PSNR = 19.0 dB; mean UCIQE ≈ 0.597; mean UIQM ≈ 2.07), the Linear Fusion improves PSNR by +2.6 dB on average and yields roughly +20.7% in UCIQE and +21.0% in UIQM, while maintaining low latency (~90 ms per 640 × 480 frame on an Intel i5-13400F (Intel Corporation, Santa Clara, CA, USA). The LiteUNetFusion further refines results: it raises PSNR by +1.5 dB over the Linear model (23.1 vs. 21.6 dB), brings modest perceptual gains (UCIQE from 0.72 to 0.74, UIQM 2.5 to 2.8) at a runtime of ≈125 ms per 640 × 480 frame, and better preserves local texture and color consistency in mixed-degradation scenes. We release implementation details for reproducibility and discuss limitations (e.g., occasional blur/noise amplification and domain generalization) together with future directions. Full article

(This article belongs to the Section Image and Video Processing)

► Show Figures

Figure 1

27 pages, 13798 KB

Open AccessArticle

A Hierarchical Deep Learning Architecture for Diagnosing Retinal Diseases Using Cross-Modal OCT to Fundus Translation in the Lack of Paired Data

by Ekaterina A. Lopukhova, Gulnaz M. Idrisova, Timur R. Mukhamadeev, Grigory S. Voronkov, Ruslan V. Kutluyarov and Elizaveta P. Topolskaya

J. Imaging 2026, 12(1), 36; https://doi.org/10.3390/jimaging12010036 - 8 Jan 2026

Cited by 1 | Viewed by 1866

Abstract

The paper focuses on automated diagnosis of retinal diseases, particularly Age-related Macular Degeneration (AMD) and diabetic retinopathy (DR), using optical coherence tomography (OCT), while addressing three key challenges: disease comorbidity, severe class imbalance, and the lack of strictly paired OCT and fundus data. [...] Read more.

The paper focuses on automated diagnosis of retinal diseases, particularly Age-related Macular Degeneration (AMD) and diabetic retinopathy (DR), using optical coherence tomography (OCT), while addressing three key challenges: disease comorbidity, severe class imbalance, and the lack of strictly paired OCT and fundus data. We propose a hierarchical modular deep learning system designed for multi-label OCT screening with conditional routing to specialized staging modules. To enable DR staging when fundus images are unavailable, we use cross-modal alignment between OCT and fundus representations. This approach involves training a latent bridge that projects OCT embeddings into the fundus feature space. We enhance clinical reliability through per-class threshold calibration and implement quality control checks for OCT-only DR staging. Experiments demonstrate robust multi-label performance (macro-F1

= 0.989 \pm 0.006

after per-class threshold calibration) and reliable calibration (ECE

= 2.1 \pm 0.4 %

), and OCT-only DR staging is feasible in 96.1% of cases that meet the quality control criterion. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

9 pages, 1650 KB

Open AccessCommunication

Comparison of the Radiomics Features of Normal-Appearing White Matter in Persons with High or Low Perivascular Space Scores

by Onural Ozturk, Sibel Balci and Seda Ozturk

J. Imaging 2026, 12(1), 35; https://doi.org/10.3390/jimaging12010035 - 8 Jan 2026

Viewed by 574

Abstract

The clinical significance of perivascular spaces (PVS) remains controversial. Radiomics refers to the extraction of quantitative features from medical images using pixel-based computational approaches. This study aimed to compare the radiomics features of normal-appearing white matter (NAWM) in patients with low and high [...] Read more.

The clinical significance of perivascular spaces (PVS) remains controversial. Radiomics refers to the extraction of quantitative features from medical images using pixel-based computational approaches. This study aimed to compare the radiomics features of normal-appearing white matter (NAWM) in patients with low and high PVS scores to reveal microstructural differences that are not visible macroscopically. Adult patients who underwent cranial MRI over a one-month period were retrospectively screened and divided into two groups according to their global PVS score. Radiomics feature extraction from NAWM was performed at the level of the centrum semiovale on FLAIR and ADC images. Radiomics features were selected using Least Absolute Shrinkage and Selection Operator (LASSO) regression during the initial model development phase, and predefined radiomics scores were evaluated for both sequences. A total of 160 patients were included in the study. Radiomics scores derived from normal-appearing white matter demonstrated good discriminative performance for differentiating high vs. low perivascular space (PVS) burden (AUC = 0.853 for FLAIR and AUC = 0.753 for ADC). In age- and scanner-adjusted multivariable models, radiomics scores remained independently associated with high PVS burden. These findings suggest that radiomics analysis of NAWM can capture subtle white matter alterations associated with PVS burden and may serve as a non-invasive biomarker for early detection of microvascular and inflammatory changes. Full article

(This article belongs to the Special Issue Progress and Challenges in Biomedical Image Analysis—2nd Edition)

► Show Figures

Figure 1

33 pages, 4122 KB

Open AccessArticle

Empirical Evaluation of UNet for Segmentation of Applicable Surfaces for Seismic Sensor Installation

by Mikhail Uzdiaev, Marina Astapova, Andrey Ronzhin and Aleksandra Figurek

J. Imaging 2026, 12(1), 34; https://doi.org/10.3390/jimaging12010034 - 8 Jan 2026

Viewed by 738

Abstract

The deployment of wireless seismic nodal systems necessitates the efficient identification of optimal locations for sensor installation, considering factors such as ground stability and the absence of interference. Semantic segmentation of satellite imagery has advanced significantly, and its application to this specific task [...] Read more.

The deployment of wireless seismic nodal systems necessitates the efficient identification of optimal locations for sensor installation, considering factors such as ground stability and the absence of interference. Semantic segmentation of satellite imagery has advanced significantly, and its application to this specific task remains unexplored. This work presents a baseline empirical evaluation of the U-Net architecture for the semantic segmentation of surfaces applicable for seismic sensor installation. We utilize a novel dataset of Sentinel-2 multispectral images, specifically labeled for this purpose. The study investigates the impact of pretrained encoders (EfficientNetB2, Cross-Stage Partial Darknet53—CSPDarknet53, and Multi-Axis Vision Transformer—MAxViT), different combinations of Sentinel-2 spectral bands (Red, Green, Blue (RGB), RGB+Near Infrared (NIR), 10-bands with 10 and 20 m/pix spatial resolution, full 13-band), and a technique for improving small object segmentation by modifying the input convolutional layer stride. Experimental results demonstrate that the CSPDarknet53 encoder generally outperforms the others (IoU = 0.534, Precision = 0.716, Recall = 0.635). The combination of RGB and Near-Infrared bands (10 m/pixel resolution) yielded the most robust performance across most configurations. Reducing the input stride from 2 to 1 proved beneficial for segmenting small linear objects like roads. The findings establish a baseline for this novel task and provide practical insights for optimizing deep learning models in the context of automated seismic nodal network installation planning. Full article

(This article belongs to the Special Issue Image Segmentation: Trends and Challenges)

► Show Figures

Figure 1

18 pages, 4519 KB

Open AccessArticle

A Unified Complex-Fresnel Model for Physically Based Long-Wave Infrared Imaging and Simulation

by Peter ter Heerdt, William Keustermans, Ivan De Boi and Steve Vanlanduit

J. Imaging 2026, 12(1), 33; https://doi.org/10.3390/jimaging12010033 - 7 Jan 2026

Viewed by 1122

Abstract

Accurate modelling of reflection, transmission, absorption, and emission at material interfaces is essential for infrared imaging, rendering, and the simulation of optical and sensing systems. This need is particularly pronounced across the short-wave to long-wave infrared (SWIR–LWIR) spectrum, where many materials exhibit dispersion- [...] Read more.

Accurate modelling of reflection, transmission, absorption, and emission at material interfaces is essential for infrared imaging, rendering, and the simulation of optical and sensing systems. This need is particularly pronounced across the short-wave to long-wave infrared (SWIR–LWIR) spectrum, where many materials exhibit dispersion- and wavelength-dependent attenuation described by complex refractive indices. In this work, we introduce a unified formulation of the full Fresnel equations that directly incorporates wavelength-dependent complex refractive-index data and provides physically consistent interface behaviour for both dielectrics and conductors. The approach reformulates the classical Fresnel expressions to eliminate sign ambiguities and numerical instabilities, resulting in a stable evaluation across incidence angles and for strongly absorbing materials. We demonstrate the model through spectral-rendering simulations that illustrate realistic reflectance and transmittance behaviour for materials with different infrared optical properties. To assess its suitability for thermal-infrared applications, we also compare the simulated long-wave emission of a heated glass sphere with measurements from a LWIR camera. The agreement between measured and simulated radiometric trends indicates that the proposed formulation offers a practical and physically grounded tool for wavelength-parametric interface modelling in infrared imaging, supporting applications in spectral rendering, synthetic data generation, and infrared system analysis. Full article

(This article belongs to the Section Visualization and Computer Graphics)

► Show Figures

Figure 1

15 pages, 979 KB

Open AccessArticle

Hybrid Skeleton-Based Motion Templates for Cross-View and Appearance-Robust Gait Recognition

by João Ferreira Nunes, Pedro Miguel Moreira and João Manuel R. S. Tavares

J. Imaging 2026, 12(1), 32; https://doi.org/10.3390/jimaging12010032 - 7 Jan 2026

Viewed by 676

Abstract

Gait recognition methods based on silhouette templates, such as the Gait Energy Image (GEI), achieve high accuracy under controlled conditions but often degrade when appearance varies due to viewpoint, clothing, or carried objects. In contrast, skeleton-based approaches provide interpretable motion cues but remain [...] Read more.

Gait recognition methods based on silhouette templates, such as the Gait Energy Image (GEI), achieve high accuracy under controlled conditions but often degrade when appearance varies due to viewpoint, clothing, or carried objects. In contrast, skeleton-based approaches provide interpretable motion cues but remain sensitive to pose-estimation noise. This work proposes two compact 2D skeletal descriptors—Gait Skeleton Images (GSIs)—that encode 3D joint trajectories into line-based and joint-based static templates compatible with standard 2D CNN architectures. A unified processing pipeline is introduced, including skeletal topology normalization, rigid view alignment, orthographic projection, and pixel-level rendering. Core design factors are analyzed on the GRIDDS dataset, where depth-based 3D coordinates provide stable ground truth for evaluating structural choices and rendering parameters. An extensive evaluation is then conducted on the widely used CASIA-B dataset, using 3D coordinates estimated via human pose estimation, to assess robustness under viewpoint, clothing, and carrying covariates. Results show that although GEIs achieve the highest same-view accuracy, GSI variants exhibit reduced degradation under appearance changes and demonstrate greater stability under severe cross-view conditions. These findings indicate that compact skeletal templates can complement appearance-based descriptors and may benefit further from continued advances in 3D human pose estimation. Full article

(This article belongs to the Section Computer Vision and Pattern Recognition)

► Show Figures

Figure 1

22 pages, 3045 KB

Open AccessArticle

Deep Learning-Assisted Autofocus for Aerial Cameras in Maritime Photography

by Haiying Liu, Yingchao Li, Shilong Xu, Haoyu Wang, Qiang Fu and Huilin Jiang

J. Imaging 2026, 12(1), 31; https://doi.org/10.3390/jimaging12010031 - 7 Jan 2026

Viewed by 674

Abstract

To address the unreliable autofocus problem of drone-mounted visible-light aerial cameras in low-contrast maritime environments, this paper proposes an autofocus system that combines deep-learning-based coarse focusing with traditional search-based fine adjustment. The system uses a built-in high-contrast resolution test chart as the signal [...] Read more.

To address the unreliable autofocus problem of drone-mounted visible-light aerial cameras in low-contrast maritime environments, this paper proposes an autofocus system that combines deep-learning-based coarse focusing with traditional search-based fine adjustment. The system uses a built-in high-contrast resolution test chart as the signal source. Images captured by the imaging sensor are fed into a lightweight convolutional neural network to regress the defocus distance, enabling fast focus positioning. This avoids the weak signal and inaccurate focusing often encountered when adjusting focus directly on low-contrast sea surfaces. In the fine-focusing stage, a hybrid strategy integrating hill-climbing search and inverse correction is adopted. By evaluating the image sharpness function, the system accurately locks onto the optimal focal plane, forming intelligent closed-loop control. Experiments show that this method, which combines imaging of the built-in calibration target with deep-learning-based coarse focusing, significantly improves focusing efficiency. Compared with traditional full-range search strategies, the focusing speed is increased by approximately 60%. While ensuring high accuracy and strong adaptability, the proposed approach effectively enhances the overall imaging performance of aerial cameras in low-contrast maritime conditions. Full article

(This article belongs to the Section Computational Imaging and Computational Photography)

► Show Figures

Figure 1

34 pages, 15414 KB

Open AccessArticle

From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

by Vasiliy Kudryavtsev, Kirill Borodin, German Berezin, Kirill Bubenchikov, Grach Mkrtchian and Alexander Ryzhkov

J. Imaging 2026, 12(1), 30; https://doi.org/10.3390/jimaging12010030 - 7 Jan 2026

Viewed by 1247

Abstract

Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic [...] Read more.

Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091 unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification. Full article

(This article belongs to the Section Biometrics, Forensics, and Security)

► Show Figures

Figure 1

31 pages, 2157 KB

Open AccessArticle

DynMultiDep: A Dynamic Multimodal Fusion and Multi-Scale Time Series Modeling Approach for Depression Detection

by Jincheng Li, Menglin Zheng, Jiongyi Yang, Yihui Zhan and Xing Xie

J. Imaging 2026, 12(1), 29; https://doi.org/10.3390/jimaging12010029 - 6 Jan 2026

Cited by 2 | Viewed by 1297

Abstract

Depression is a prevalent mental disorder that imposes a significant public health burden worldwide. Although multimodal detection methods have shown potential, existing techniques still face two critical bottlenecks: (i) insufficient integration of global patterns and local fluctuations in long-sequence modeling and (ii) static [...] Read more.

Depression is a prevalent mental disorder that imposes a significant public health burden worldwide. Although multimodal detection methods have shown potential, existing techniques still face two critical bottlenecks: (i) insufficient integration of global patterns and local fluctuations in long-sequence modeling and (ii) static fusion strategies that fail to dynamically adapt to the complementarity and redundancy among modalities. To address these challenges, this paper proposes a dynamic multimodal depression detection framework, DynMultiDep, which combines multi-scale temporal modeling with an adaptive fusion mechanism. The core innovations of DynMultiDep lie in its Multi-scale Temporal Experts Module (MTEM) and Dynamic Multimodal Fusion module (DynMM). On one hand, MTEM employs Mamba experts to extract long-term trend features and utilizes local-window Transformers to capture short-term dynamic fluctuations, achieving adaptive fusion through a long-short routing mechanism. On the other hand, DynMM introduces modality-level and fusion-level dynamic decision-making, selecting critical modality paths and optimizing cross-modal interaction strategies based on input characteristics. The experimental results demonstrate that DynMultiDep outperforms existing state-of-the-art methods in detection performance on two widely used large-scale depression datasets. Full article

(This article belongs to the Special Issue Towards Deeper Understanding of Image and Video Processing and Analysis)

► Show Figures

Figure 1

12 pages, 2605 KB

Open AccessArticle

Ultrashort Echo Time Quantitative Susceptibility Source Separation in Musculoskeletal System: A Feasibility Study

by Sam Sedaghat, Jin Il Park, Eddie Fu, Annette von Drygalski, Yajun Ma, Eric Y. Chang, Jiang Du, Lorenzo Nardo and Hyungseok Jang

J. Imaging 2026, 12(1), 28; https://doi.org/10.3390/jimaging12010028 - 6 Jan 2026

Viewed by 670

Abstract

This study aims to demonstrate the feasibility of ultrashort echo time (UTE)-based susceptibility source separation for musculoskeletal (MSK) imaging, enabling discrimination between diamagnetic and paramagnetic tissue components, with a particular focus on hemophilic arthropathy (HA). Three key techniques were integrated to achieve UTE-based [...] Read more.

This study aims to demonstrate the feasibility of ultrashort echo time (UTE)-based susceptibility source separation for musculoskeletal (MSK) imaging, enabling discrimination between diamagnetic and paramagnetic tissue components, with a particular focus on hemophilic arthropathy (HA). Three key techniques were integrated to achieve UTE-based susceptibility source separation: Iterative decomposition of water and fat with echo asymmetry and least-squares estimation for B0 field estimation, projection onto dipole fields for local field mapping, and χ-separation for quantitative susceptibility mapping (QSM) with source decomposition. A phantom containing varying concentrations of diamagnetic (CaCO₃) and paramagnetic (Fe₃O₄) materials was used to validate the method. In addition, in vivo UTE-QSM scans of the knees and ankles were performed on five HA patients using a 3T clinical MRI scanner. In the phantom, conventional QSM underestimated susceptibility values due to the mixed-source cancelling the effect. In contrast, source-separated maps provided distinct diamagnetic and paramagnetic susceptibility values that correlated strongly with CaCO₃ and Fe₃O₄ concentrations (r = −0.99 and 0.95, p < 0.05). In vivo, paramagnetic maps enabled improved visualization of hemosiderin deposits in joints of HA patients, which were poorly visualized or obscured in conventional QSM due to susceptibility cancellation by surrounding diamagnetic tissues such as bone. This study demonstrates, for the first time, the feasibility of UTE-based quantitative susceptibility source separation for MSK applications. The approach enhances the detection of paramagnetic substances like hemosiderin in HA and offers potential for improved assessment of bone and joint tissue composition. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

23 pages, 7137 KB

Open AccessArticle

Vision-Based People Counting and Tracking for Urban Environments

by Daniyar Nurseitov, Kairat Bostanbekov, Nazgul Toiganbayeva, Aidana Zhalgas, Didar Yedilkhan and Beibut Amirgaliyev

J. Imaging 2026, 12(1), 27; https://doi.org/10.3390/jimaging12010027 - 5 Jan 2026

Cited by 1 | Viewed by 1484

Abstract

Population growth and expansion of urban areas increase the need for the introduction of intelligent passenger traffic monitoring systems. Accurate estimation of the number of passengers is an important condition for improving the efficiency, safety and quality of transport services. This paper proposes [...] Read more.

Population growth and expansion of urban areas increase the need for the introduction of intelligent passenger traffic monitoring systems. Accurate estimation of the number of passengers is an important condition for improving the efficiency, safety and quality of transport services. This paper proposes an approach to the automatic detection and counting of people using computer vision and deep learning methods. While YOLOv8 and DeepSORT have been widely explored individually, our contribution lies in a task-specific modification of the DeepSORT tracking pipeline, optimized for dense passenger environments, strong occlusions, and dynamic lighting, as well as in a unified architecture that integrates detection, tracking, and automatic event-log generation. Our new proprietary dataset of 4047 images and 8918 labeled objects has achieved 92% detection accuracy and 85% counting accuracy, which confirms the effectiveness of the solution. Compared to Mask R-CNN and DETR, the YOLOv8 model demonstrates an optimal balance between speed, accuracy, and computational efficiency. The results confirm that computer vision can become an efficient and scalable replacement for traditional sensory passenger counting systems. The developed architecture (YOLO + Tracking) combines recognition, tracking and counting of people into a single system that automatically generates annotated video streams and event logs. In the future, it is planned to expand the dataset, introduce support for multicamera integration, and adapt the model for embedded devices to improve the accuracy and energy efficiency of the solution in real-world conditions. Full article

(This article belongs to the Section Computer Vision and Pattern Recognition)

► Show Figures

Figure 1

19 pages, 1885 KB

Open AccessArticle

A Hierarchical Multi-Resolution Self-Supervised Framework for High-Fidelity 3D Face Reconstruction Using Learnable Gabor-Aware Texture Modeling

by Pichet Mareo and Rerkchai Fooprateepsiri

J. Imaging 2026, 12(1), 26; https://doi.org/10.3390/jimaging12010026 - 5 Jan 2026

Viewed by 968

Abstract

High-fidelity 3D face reconstruction from a single image is challenging, owing to the inherently ambiguous depth cues and the strong entanglement of multi-scale facial textures. In this regard, we propose a hierarchical multi-resolution self-supervised framework (HMR-Framework), which reconstructs coarse-, medium-, and fine-scale facial [...] Read more.

High-fidelity 3D face reconstruction from a single image is challenging, owing to the inherently ambiguous depth cues and the strong entanglement of multi-scale facial textures. In this regard, we propose a hierarchical multi-resolution self-supervised framework (HMR-Framework), which reconstructs coarse-, medium-, and fine-scale facial geometry progressively through a unified pipeline. A coarse geometric prior is first estimated via 3D morphable model regression, followed by medium-scale refinement using a vertex deformation map constrained by a global–local Markov random field loss to preserve structural coherence. In order to improve fine-scale fidelity, a learnable Gabor-aware texture enhancement module has been proposed to decouple spatial–frequency information and thus improve sensitivity for high-frequency facial attributes. Additionally, we employ a wavelet-based detail perception loss to preserve the edge-aware texture features while mitigating noise commonly observed in in-the-wild images. Extensive qualitative and quantitative evaluation of benchmark datasets indicate that the proposed framework provides better fine-detail reconstruction than existing state-of-the-art methods, while maintaining robustness over pose variations. Notably, the hierarchical design increases semantic consistency across multiple geometric scales, providing a functional solution for high-fidelity 3D face reconstruction from monocular images. Full article

(This article belongs to the Topic Image Processing, Signal Processing and Their Applications)

► Show Figures

Figure 1

20 pages, 2351 KB

Open AccessArticle

A Slicer-Independent Framework for Measuring G-Code Accuracy in Medical 3D Printing

by Michel Beyer, Alexandru Burde, Andreas E. Roser, Maximiliane Beyer, Sead Abazi and Florian M. Thieringer

J. Imaging 2026, 12(1), 25; https://doi.org/10.3390/jimaging12010025 - 4 Jan 2026

Viewed by 1990

Abstract

In medical 3D printing, accuracy is critical for fabricating patient-specific implants and anatomical models. Although printer performance has been widely examined, the influence of slicing software on geometric fidelity is less frequently quantified. The slicing step, which converts STL files into printer-readable G-code, [...] Read more.

In medical 3D printing, accuracy is critical for fabricating patient-specific implants and anatomical models. Although printer performance has been widely examined, the influence of slicing software on geometric fidelity is less frequently quantified. The slicing step, which converts STL files into printer-readable G-code, may introduce deviations that affect the final printed object. To quantify slicer-induced G-code deviations by comparing G-code-derived geometries with their reference STL modelsTwenty mandibular models were processed using five slicers (PrusaSlicer (version 2.9.1.), Cura (version 5.2.2.), Simplify3D (version 4.1.2.), Slic3r (version 1.3.0.) and Fusion 360 (version 2.0.19725)). A custom Python workflow converted the G-code into point clouds and reconstructed STL meshes through XY and Z corrections, marching cubes surface extraction, and volumetric extrusion. A calibration object enabled coordinate normalization across slicers. Accuracy was assessed using Mean Surface Distance (MSD), Root Mean Square (RMS) deviation, and Volume Difference. MSD ranged from 0.071 to 0.095 mm, and RMS deviation from 0.084 to 0.113 mm, depending on the slicer. Volumetric differences were slicer-dependent. PrusaSlicer yielded the highest surface accuracy; Simplify3D and Slic3r showed best repeatability. Fusion 360 produced the largest deviations. The slicers introduced geometric deviations below 0.1 mm that represent a substantial proportion of the overall error in the FDM workflow. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

26 pages, 9792 KB

Open AccessArticle

LLM-Based Pose Normalization and Multimodal Fusion for Facial Expression Recognition in Extreme Poses

by Bohan Chen, Bowen Qu, Yu Zhou, Han Huang, Jianing Guo, Yanning Xian, Longxiang Ma, Jinxuan Yu and Jingyu Chen

J. Imaging 2026, 12(1), 24; https://doi.org/10.3390/jimaging12010024 - 4 Jan 2026

Cited by 1 | Viewed by 1322

Abstract

Facial expression recognition (FER) technology has progressively matured over time. However, existing FER methods are primarily optimized for frontal face images, and their recognition accuracy significantly degrades when processing profile or large-angle rotated facial images. Consequently, this limitation hinders the practical deployment of [...] Read more.

Facial expression recognition (FER) technology has progressively matured over time. However, existing FER methods are primarily optimized for frontal face images, and their recognition accuracy significantly degrades when processing profile or large-angle rotated facial images. Consequently, this limitation hinders the practical deployment of FER systems. To mitigate the interference caused by large pose variations and improve recognition accuracy, we propose a FER method based on profile-to-frontal transformation and multimodal learning. Specifically, we first leverage the visual understanding and generation capabilities of Qwen-Image-Edit that transform profile images to frontal viewpoints, preserving key expression features while standardizing facial poses. Second, we introduce the CLIP model to enhance the semantic representation capability of expression features through vision–language joint learning. The qualitative and quantitative experiments on the RAF (89.39%), EXPW (67.17%), and AffectNet-7 (62.66%) datasets demonstrate that our method outperforms the existing approaches. Full article

(This article belongs to the Special Issue AI-Driven Image and Video Understanding)

► Show Figures

Figure 1

Journal Menu

Journal Browser

J. Imaging, Volume 12, Issue 1 (January 2026) – 53 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI