-
A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis -
Next-Generation Advances in Prostate Cancer Imaging and Artificial Intelligence Applications -
Classifying Sex from MSCT-Derived 3D Mandibular Models Using an Adapted PointNet++ Deep Learning Approach in a Croatian Population -
AIGD Era: From Fragment to One Piece
Journal Description
Journal of Imaging
Journal of Imaging
is an international, multi/interdisciplinary, peer-reviewed, open access journal of imaging techniques, published online monthly by MDPI.
- Open Accessfree for readers, with article processing charges (APC) paid by authors or their institutions.
- High Visibility: indexed within Scopus, ESCI (Web of Science), PubMed, PMC, dblp, Inspec, Ei Compendex, and other databases.
- Journal Rank: JCR - Q2 (Imaging Science and Photographic Technology) / CiteScore - Q1 (Radiology, Nuclear Medicine and Imaging)
- Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 18 days after submission; acceptance to publication is undertaken in 3.6 days (median values for papers published in this journal in the second half of 2025).
- Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.
Impact Factor:
3.3 (2024);
5-Year Impact Factor:
3.3 (2024)
Latest Articles
A Dual Stream Deep Learning Framework for Alzheimer’s Disease Detection Using MRI Sonification
J. Imaging 2026, 12(1), 46; https://doi.org/10.3390/jimaging12010046 - 15 Jan 2026
Abstract
Alzheimer’s Disease (AD) is an advanced brain illness that affects millions of individuals across the world. It causes gradual damage to the brain cells, leading to memory loss and cognitive dysfunction. Although Magnetic Resonance Imaging (MRI) is widely used in AD diagnosis, the
[...] Read more.
Alzheimer’s Disease (AD) is an advanced brain illness that affects millions of individuals across the world. It causes gradual damage to the brain cells, leading to memory loss and cognitive dysfunction. Although Magnetic Resonance Imaging (MRI) is widely used in AD diagnosis, the existing studies rely solely on the visual representations, leaving alternative features unexplored. The objective of this study is to explore whether MRI sonification can provide complementary diagnostic information when combined with conventional image-based methods. In this study, we propose a novel dual-stream multimodal framework that integrates 2D MRI slices with their corresponding audio representations. MRI images are transformed into audio signals using a multi-scale, multi-orientation Gabor filtering, followed by a Hilbert space-filling curve to preserve spatial locality. The image and sound modalities are processed using a lightweight CNN and YAMNet, respectively, then fused via logistic regression. The experimental results of the multimodal achieved the highest accuracy in distinguishing AD from Cognitively Normal (CN) subjects at 98.2%, 94% for AD vs. Mild Cognitive Impairment (MCI), and 93.2% for MCI vs. CN. This work provides a new perspective and highlights the potential of audio transformation of imaging data for feature extraction and classification.
Full article
(This article belongs to the Section AI in Imaging)
Open AccessArticle
A Cross-Device and Cross-OS Benchmark of Modern Web Animation Systems
by
Tajana Koren Ivančević, Trpimir Jeronim Ježić and Nikolina Stanić Loknar
J. Imaging 2026, 12(1), 45; https://doi.org/10.3390/jimaging12010045 - 15 Jan 2026
Abstract
Although modern web technologies increasingly rely on high-performance rendering methods to support rich visual content across a range of devices and operating systems, the field remains significantly under-researched. The performance of animated visual elements is affected by numerous factors, including browsers, operating systems,
[...] Read more.
Although modern web technologies increasingly rely on high-performance rendering methods to support rich visual content across a range of devices and operating systems, the field remains significantly under-researched. The performance of animated visual elements is affected by numerous factors, including browsers, operating systems, GPU acceleration, scripting load, and device limitations. This study systematically evaluates animation performance across multiple platforms using a unified set of circle-based animations implemented with eight web-compatible technologies, including HTML, CSS, SVG, JavaScript, Canvas, and WebGL. Animations were evaluated under controlled feature combinations involving random motion, distance, colour variation, blending, and transformations, with object counts ranging from 10 to 10,000. Measurements were conducted on desktop operating systems (Windows, macOS, Linux) and mobile platforms (iOS, Android), using CPU utilisation, GPU memory usage, and frame rate (FPS) as key metrics. Results show that DOM-based approaches maintain stable performance at 100 animated objects but exhibit notable degradation by 500 objects. Canvas-based rendering extends usability to higher object counts, while WebGL demonstrates the most stable performance at large scales (5000–10,000 objects). These findings provide concrete guidance for selecting appropriate animation technologies based on scene complexity and target platform.
Full article
(This article belongs to the Section Visualization and Computer Graphics)
►▼
Show Figures

Graphical abstract
Open AccessArticle
A Deep Feature Fusion Underwater Image Enhancement Model Based on Perceptual Vision Swin Transformer
by
Shasha Tian, Adisorn Sirikham, Jessada Konpang and Chuyang Wang
J. Imaging 2026, 12(1), 44; https://doi.org/10.3390/jimaging12010044 - 14 Jan 2026
Abstract
Underwater optical images are the primary carriers of underwater scene information, playing a crucial role in marine resource exploration, underwater environmental monitoring, and engineering inspection. However, wavelength-dependent absorption and scattering severely deteriorate underwater images, leading to reduced contrast, chromatic distortions, and loss of
[...] Read more.
Underwater optical images are the primary carriers of underwater scene information, playing a crucial role in marine resource exploration, underwater environmental monitoring, and engineering inspection. However, wavelength-dependent absorption and scattering severely deteriorate underwater images, leading to reduced contrast, chromatic distortions, and loss of structural details. To address these issues, we propose a U-shaped underwater image enhancement framework that integrates Swin-Transformer blocks with lightweight attention and residual modules. A Dual-Window Multi-Head Self-Attention (DWMSA) in the bottleneck models long-range context while preserving fine local structure. A Global-Aware Attention Map (GAMP) adaptively re-weights channels and spatial locations to focus on severely degraded regions. A Feature-Augmentation Residual Network (FARN) stabilizes deep training and emphasizes texture and color fidelity. Trained with a combination of Charbonnier, perceptual, and edge losses, our method achieves state-of-the-art results in PSNR and SSIM, the lowest LPIPS, and improvements in UIQM and UCIQE on the UFO-120 and EUVP datasets, with average metrics of PSNR 29.5 dB, SSIM 0.94, LPIPS 0.17, UIQM 3.62, and UCIQE 0.59. Qualitative results show reduced color cast, restored contrast, and sharper details. Code, weights, and evaluation scripts will be released to support reproducibility.
Full article
(This article belongs to the Special Issue Underwater Imaging (2nd Edition))
►▼
Show Figures

Figure 1
Open AccessArticle
FF-Mamba-YOLO: An SSM-Based Benchmark for Forest Fire Detection in UAV Remote Sensing Images
by
Binhua Guo, Dinghui Liu, Zhou Shen and Tiebin Wang
J. Imaging 2026, 12(1), 43; https://doi.org/10.3390/jimaging12010043 - 13 Jan 2026
Abstract
Timely and accurate detection of forest fires through unmanned aerial vehicle (UAV) remote sensing target detection technology is of paramount importance. However, multiscale targets and complex environmental interference in UAV remote sensing images pose significant challenges during detection tasks. To address these obstacles,
[...] Read more.
Timely and accurate detection of forest fires through unmanned aerial vehicle (UAV) remote sensing target detection technology is of paramount importance. However, multiscale targets and complex environmental interference in UAV remote sensing images pose significant challenges during detection tasks. To address these obstacles, this paper presents FF-Mamba-YOLO, a novel framework based on the principles of Mamba and YOLO (You Only Look Once) that leverages innovative modules and architectures to overcome these limitations. Specifically, we introduce MFEBlock and MFFBlock based on state space models (SSMs) in the backbone and neck parts of the network, respectively, enabling the model to effectively capture global dependencies. Second, we construct CFEBlock, a module that performs feature enhancement before SSM processing, improving local feature processing capabilities. Furthermore, we propose MGBlock, which adopts a dynamic gating mechanism, enhancing the model’s adaptive processing capabilities and robustness. Finally, we enhance the structure of Path Aggregation Feature Pyramid Network (PAFPN) to improve feature fusion quality and introduce DySample to enhance image resolution without significantly increasing computational costs. Experimental results on our self-constructed forest fire image dataset demonstrate that the model achieves 67.4% mAP@50, 36.3% mAP@50:95, and 64.8% precision, outperforming previous state-of-the-art methods. These results highlight the potential of FF-Mamba-YOLO in forest fire monitoring.
Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
►▼
Show Figures

Figure 1
Open AccessArticle
GLCN: Graph-Aware Locality-Enhanced Cross-Modality Re-ID Network
by
Junjie Cao, Yuhang Yu, Rong Rong and Xing Xie
J. Imaging 2026, 12(1), 42; https://doi.org/10.3390/jimaging12010042 - 13 Jan 2026
Abstract
Cross-modality person re-identification faces challenges such as illumination discrepancies, local occlusions, and inconsistent modality structures, leading to misalignment and sensitivity issues. We propose GLCN, a framework that addresses these problems by enhancing representation learning through locality enhancement, cross-modality structural alignment, and intra-modality compactness.
[...] Read more.
Cross-modality person re-identification faces challenges such as illumination discrepancies, local occlusions, and inconsistent modality structures, leading to misalignment and sensitivity issues. We propose GLCN, a framework that addresses these problems by enhancing representation learning through locality enhancement, cross-modality structural alignment, and intra-modality compactness. Key components include the Locality-Preserved Cross-branch Fusion (LPCF) module, which combines Local–Positional–Channel Gating (LPCG) for local region and positional sensitivity; Cross-branch Context Interpolated Attention (CCIA) for stable cross-branch consistency; and Graph-Enhanced Center Geometry Alignment (GE-CGA), which aligns class-center similarity structures across modalities to preserve category-level relationships. We also introduce Intra-Modal Prototype Discrepancy Mining Loss (IPDM-Loss) to reduce intra-class variance and improve inter-class separation, thereby creating more compact identity structures in both RGB and IR spaces. Extensive experiments on SYSU-MM01, RegDB, and other benchmarks demonstrate the effectiveness of our approach.
Full article
(This article belongs to the Special Issue Infrared Image Processing with Artificial Intelligence: Progress and Challenges)
►▼
Show Figures

Figure 1
Open AccessArticle
Calibrated Transformer Fusion for Dual-View Low-Energy CESM Classification
by
Ahmed A. H. Alkurdi and Amira Bibo Sallow
J. Imaging 2026, 12(1), 41; https://doi.org/10.3390/jimaging12010041 - 13 Jan 2026
Abstract
Contrast-enhanced spectral mammography (CESM) provides low-energy images acquired in standard craniocaudal (CC) and mediolateral oblique (MLO) views, and clinical interpretation relies on integrating both views. This study proposes a dual-view classification framework that combines deep CNN feature extraction with transformer-based fusion for breast-side
[...] Read more.
Contrast-enhanced spectral mammography (CESM) provides low-energy images acquired in standard craniocaudal (CC) and mediolateral oblique (MLO) views, and clinical interpretation relies on integrating both views. This study proposes a dual-view classification framework that combines deep CNN feature extraction with transformer-based fusion for breast-side classification using low-energy (DM) images from CESM acquisitions (Normal vs. Tumorous; benign and malignant merged). The evaluation was conducted using 5-fold stratified group cross-validation with patient-level grouping to prevent leakage across folds. The final configuration (Model E) integrates dual-backbone feature extraction, transformer fusion, MC-dropout inference for uncertainty estimation, and post hoc logistic calibration. Across the five held-out test folds, Model E achieved a mean accuracy of 96.88% ± 2.39% and a mean F1-score of 97.68% ± 1.66%. The mean ROC-AUC and PR-AUC were 0.9915 ± 0.0098 and 0.9968 ± 0.0029, respectively. Probability quality was supported by a mean Brier score of 0.0236 ± 0.0145 and a mean expected calibration error (ECE) of 0.0334 ± 0.0171. An ablation study (Models A–E) was also reported to quantify the incremental contribution of dual-view input, transformer fusion, and uncertainty calibration. Within the limits of this retrospective single-center setting, these results suggest that dual-view transformer fusion can provide strong discrimination while also producing calibrated probabilities and uncertainty outputs that are relevant for decision support.
Full article
(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)
►▼
Show Figures

Figure 1
Open AccessArticle
A Dual-UNet Diffusion Framework for Personalized Panoramic Generation
by
Jing Shen, Leigang Huo, Chunlei Huo and Shiming Xiang
J. Imaging 2026, 12(1), 40; https://doi.org/10.3390/jimaging12010040 - 11 Jan 2026
Abstract
While text-to-image and customized generation methods demonstrate strong capabilities in single-image generation, they fall short in supporting immersive applications that require coherent 360° panoramas. Conversely, existing panorama generation models lack customization capabilities. In panoramic scenes, reference objects often appear as minor background elements
[...] Read more.
While text-to-image and customized generation methods demonstrate strong capabilities in single-image generation, they fall short in supporting immersive applications that require coherent 360° panoramas. Conversely, existing panorama generation models lack customization capabilities. In panoramic scenes, reference objects often appear as minor background elements and may be multiple in number, while reference images across different views exhibit weak correlations. To address these challenges, we propose a diffusion-based framework for customized multi-view image generation. Our approach introduces a decoupled feature injection mechanism within a dual-UNet architecture to handle weakly correlated reference images, effectively integrating spatial information by concurrently feeding both reference images and noise into the denoising branch. A hybrid attention mechanism enables deep fusion of reference features and multi-view representations. Furthermore, a data augmentation strategy facilitates viewpoint-adaptive pose adjustments, and panoramic coordinates are employed to guide multi-view attention. The experimental results demonstrate our model’s effectiveness in generating coherent, high-quality customized multi-view images.
Full article
(This article belongs to the Section AI in Imaging)
►▼
Show Figures

Figure 1
Open AccessArticle
Self-Supervised Learning of Deep Embeddings for Classification and Identification of Dental Implants
by
Amani Almalki, Abdulrahman Almalki and Longin Jan Latecki
J. Imaging 2026, 12(1), 39; https://doi.org/10.3390/jimaging12010039 - 9 Jan 2026
Abstract
This study proposes an automated system using deep learning-based object detection to identify implant systems, leveraging recent progress in self-supervised learning, specifically masked image modeling (MIM). We advocate for self-pre-training, emphasizing that its advantages when acquiring suitable pre-training data is challenging. The proposed
[...] Read more.
This study proposes an automated system using deep learning-based object detection to identify implant systems, leveraging recent progress in self-supervised learning, specifically masked image modeling (MIM). We advocate for self-pre-training, emphasizing that its advantages when acquiring suitable pre-training data is challenging. The proposed Masked Deep Embedding (MDE) pre-training method, extending the masked autoencoder (MAE) transformer, significantly enhances dental implant detection performance compared to baselines. Specifically, the proposed method achieves a best detection performance of AP = 96.1, outperforming supervised ViT and MAE baselines by up to +2.9 AP. In addition, we address the absence of a comprehensive dataset for implant design, enhancing an existing dataset under dental expert supervision. This augmentation includes annotations for implant design, such as coronal, middle, and apical parts, resulting in a unique Implant Design Dataset (IDD). The contributions encompass employing self-supervised learning for limited dental radiograph data, replacing MAE’s patch reconstruction with patch embeddings, achieving substantial performance improvement in implant detection, and expanding possibilities through the labeling of implant design. This study paves the way for AI-driven solutions in implant dentistry, providing valuable tools for dentists and patients facing implant-related challenges.
Full article
(This article belongs to the Section Medical Imaging)
►▼
Show Figures

Figure 1
Open AccessArticle
SCT-Diff: Seamless Contextual Tracking via Diffusion Trajectory
by
Guohao Nie, Xingmei Wang, Debin Zhang and He Wang
J. Imaging 2026, 12(1), 38; https://doi.org/10.3390/jimaging12010038 - 9 Jan 2026
Abstract
Existing detection-based trackers exploit temporal contexts by updating appearance models or modeling target motion. However, the sequential one-shot integration of temporal priors risks amplifying error accumulation, as frame-level template matching restricts comprehensive spatiotemporal analysis. To address this, we propose SCT-Diff, a video-level framework
[...] Read more.
Existing detection-based trackers exploit temporal contexts by updating appearance models or modeling target motion. However, the sequential one-shot integration of temporal priors risks amplifying error accumulation, as frame-level template matching restricts comprehensive spatiotemporal analysis. To address this, we propose SCT-Diff, a video-level framework that holistically estimates target trajectories. Specifically, SCT-Diff processes video clips globally via a diffusion model to incorporate bidirectional spatiotemporal awareness, where reverse diffusion steps progressively refine noisy trajectory proposals into optimal predictions. Crucially, SCT-Diff enables iterative correction of historical trajectory hypotheses by observing future contexts within a sliding time window. This closed-loop feedback from future frames preserves temporal consistency and breaks the error propagation chain under complex appearance variations. For joint modeling of appearance and motion dynamics, we formulate trajectories as unified discrete token sequences. The designed Mamba-based expert decoder bridges visual features with language-formulated trajectories, enabling lightweight yet coherent sequence modeling. Extensive experiments demonstrate SCT-Diff’s superior efficiency and performance, achieving 75.4% AO on GOT-10k while maintaining real-time computational efficiency.
Full article
(This article belongs to the Special Issue Object Detection in Video Surveillance Systems)
►▼
Show Figures

Figure 1
Open AccessArticle
Degradation-Aware Multi-Stage Fusion for Underwater Image Enhancement
by
Lian Xie, Hao Chen and Jin Shu
J. Imaging 2026, 12(1), 37; https://doi.org/10.3390/jimaging12010037 - 8 Jan 2026
Abstract
Underwater images frequently suffer from color casts, low illumination, and blur due to wavelength-dependent absorption and scattering. We present a practical two-stage, modular, and degradation-aware framework designed for real-time enhancement, prioritizing deployability on edge devices. Stage I employs a lightweight CNN to classify
[...] Read more.
Underwater images frequently suffer from color casts, low illumination, and blur due to wavelength-dependent absorption and scattering. We present a practical two-stage, modular, and degradation-aware framework designed for real-time enhancement, prioritizing deployability on edge devices. Stage I employs a lightweight CNN to classify inputs into three dominant degradation classes (color cast, low light, blur) with 91.85% accuracy on an EUVP subset. Stage II applies three scene-specific lightweight enhancement pipelines and fuses their outputs using two alternative learnable modules: a global Linear Fusion and a LiteUNetFusion (spatially adaptive weighting with optional residual correction). Compared to the three single-scene optimizers (average PSNR = 19.0 dB; mean UCIQE ≈ 0.597; mean UIQM ≈ 2.07), the Linear Fusion improves PSNR by +2.6 dB on average and yields roughly +20.7% in UCIQE and +21.0% in UIQM, while maintaining low latency (~90 ms per 640 × 480 frame on an Intel i5-13400F (Intel Corporation, Santa Clara, CA, USA). The LiteUNetFusion further refines results: it raises PSNR by +1.5 dB over the Linear model (23.1 vs. 21.6 dB), brings modest perceptual gains (UCIQE from 0.72 to 0.74, UIQM 2.5 to 2.8) at a runtime of ≈125 ms per 640 × 480 frame, and better preserves local texture and color consistency in mixed-degradation scenes. We release implementation details for reproducibility and discuss limitations (e.g., occasional blur/noise amplification and domain generalization) together with future directions.
Full article
(This article belongs to the Section Image and Video Processing)
►▼
Show Figures

Figure 1
Open AccessArticle
A Hierarchical Deep Learning Architecture for Diagnosing Retinal Diseases Using Cross-Modal OCT to Fundus Translation in the Lack of Paired Data
by
Ekaterina A. Lopukhova, Gulnaz M. Idrisova, Timur R. Mukhamadeev, Grigory S. Voronkov, Ruslan V. Kutluyarov and Elizaveta P. Topolskaya
J. Imaging 2026, 12(1), 36; https://doi.org/10.3390/jimaging12010036 - 8 Jan 2026
Abstract
The paper focuses on automated diagnosis of retinal diseases, particularly Age-related Macular Degeneration (AMD) and diabetic retinopathy (DR), using optical coherence tomography (OCT), while addressing three key challenges: disease comorbidity, severe class imbalance, and the lack of strictly paired OCT and fundus data.
[...] Read more.
The paper focuses on automated diagnosis of retinal diseases, particularly Age-related Macular Degeneration (AMD) and diabetic retinopathy (DR), using optical coherence tomography (OCT), while addressing three key challenges: disease comorbidity, severe class imbalance, and the lack of strictly paired OCT and fundus data. We propose a hierarchical modular deep learning system designed for multi-label OCT screening with conditional routing to specialized staging modules. To enable DR staging when fundus images are unavailable, we use cross-modal alignment between OCT and fundus representations. This approach involves training a latent bridge that projects OCT embeddings into the fundus feature space. We enhance clinical reliability through per-class threshold calibration and implement quality control checks for OCT-only DR staging. Experiments demonstrate robust multi-label performance (macro-F1 after per-class threshold calibration) and reliable calibration (ECE ), and OCT-only DR staging is feasible in 96.1% of cases that meet the quality control criterion.
Full article
(This article belongs to the Section Medical Imaging)
►▼
Show Figures

Figure 1
Open AccessCommunication
Comparison of the Radiomics Features of Normal-Appearing White Matter in Persons with High or Low Perivascular Space Scores
by
Onural Ozturk, Sibel Balci and Seda Ozturk
J. Imaging 2026, 12(1), 35; https://doi.org/10.3390/jimaging12010035 - 8 Jan 2026
Abstract
The clinical significance of perivascular spaces (PVS) remains controversial. Radiomics refers to the extraction of quantitative features from medical images using pixel-based computational approaches. This study aimed to compare the radiomics features of normal-appearing white matter (NAWM) in patients with low and high
[...] Read more.
The clinical significance of perivascular spaces (PVS) remains controversial. Radiomics refers to the extraction of quantitative features from medical images using pixel-based computational approaches. This study aimed to compare the radiomics features of normal-appearing white matter (NAWM) in patients with low and high PVS scores to reveal microstructural differences that are not visible macroscopically. Adult patients who underwent cranial MRI over a one-month period were retrospectively screened and divided into two groups according to their global PVS score. Radiomics feature extraction from NAWM was performed at the level of the centrum semiovale on FLAIR and ADC images. Radiomics features were selected using Least Absolute Shrinkage and Selection Operator (LASSO) regression during the initial model development phase, and predefined radiomics scores were evaluated for both sequences. A total of 160 patients were included in the study. Radiomics scores derived from normal-appearing white matter demonstrated good discriminative performance for differentiating high vs. low perivascular space (PVS) burden (AUC = 0.853 for FLAIR and AUC = 0.753 for ADC). In age- and scanner-adjusted multivariable models, radiomics scores remained independently associated with high PVS burden. These findings suggest that radiomics analysis of NAWM can capture subtle white matter alterations associated with PVS burden and may serve as a non-invasive biomarker for early detection of microvascular and inflammatory changes.
Full article
(This article belongs to the Special Issue Progress and Challenges in Biomedical Image Analysis—2nd Edition)
►▼
Show Figures

Figure 1
Open AccessArticle
Empirical Evaluation of UNet for Segmentation of Applicable Surfaces for Seismic Sensor Installation
by
Mikhail Uzdiaev, Marina Astapova, Andrey Ronzhin and Aleksandra Figurek
J. Imaging 2026, 12(1), 34; https://doi.org/10.3390/jimaging12010034 - 8 Jan 2026
Abstract
The deployment of wireless seismic nodal systems necessitates the efficient identification of optimal locations for sensor installation, considering factors such as ground stability and the absence of interference. Semantic segmentation of satellite imagery has advanced significantly, and its application to this specific task
[...] Read more.
The deployment of wireless seismic nodal systems necessitates the efficient identification of optimal locations for sensor installation, considering factors such as ground stability and the absence of interference. Semantic segmentation of satellite imagery has advanced significantly, and its application to this specific task remains unexplored. This work presents a baseline empirical evaluation of the U-Net architecture for the semantic segmentation of surfaces applicable for seismic sensor installation. We utilize a novel dataset of Sentinel-2 multispectral images, specifically labeled for this purpose. The study investigates the impact of pretrained encoders (EfficientNetB2, Cross-Stage Partial Darknet53—CSPDarknet53, and Multi-Axis Vision Transformer—MAxViT), different combinations of Sentinel-2 spectral bands (Red, Green, Blue (RGB), RGB+Near Infrared (NIR), 10-bands with 10 and 20 m/pix spatial resolution, full 13-band), and a technique for improving small object segmentation by modifying the input convolutional layer stride. Experimental results demonstrate that the CSPDarknet53 encoder generally outperforms the others (IoU = 0.534, Precision = 0.716, Recall = 0.635). The combination of RGB and Near-Infrared bands (10 m/pixel resolution) yielded the most robust performance across most configurations. Reducing the input stride from 2 to 1 proved beneficial for segmenting small linear objects like roads. The findings establish a baseline for this novel task and provide practical insights for optimizing deep learning models in the context of automated seismic nodal network installation planning.
Full article
(This article belongs to the Special Issue Image Segmentation: Trends and Challenges)
►▼
Show Figures

Figure 1
Open AccessArticle
A Unified Complex-Fresnel Model for Physically Based Long-Wave Infrared Imaging and Simulation
by
Peter ter Heerdt, William Keustermans, Ivan De Boi and Steve Vanlanduit
J. Imaging 2026, 12(1), 33; https://doi.org/10.3390/jimaging12010033 - 7 Jan 2026
Abstract
Accurate modelling of reflection, transmission, absorption, and emission at material interfaces is essential for infrared imaging, rendering, and the simulation of optical and sensing systems. This need is particularly pronounced across the short-wave to long-wave infrared (SWIR–LWIR) spectrum, where many materials exhibit dispersion-
[...] Read more.
Accurate modelling of reflection, transmission, absorption, and emission at material interfaces is essential for infrared imaging, rendering, and the simulation of optical and sensing systems. This need is particularly pronounced across the short-wave to long-wave infrared (SWIR–LWIR) spectrum, where many materials exhibit dispersion- and wavelength-dependent attenuation described by complex refractive indices. In this work, we introduce a unified formulation of the full Fresnel equations that directly incorporates wavelength-dependent complex refractive-index data and provides physically consistent interface behaviour for both dielectrics and conductors. The approach reformulates the classical Fresnel expressions to eliminate sign ambiguities and numerical instabilities, resulting in a stable evaluation across incidence angles and for strongly absorbing materials. We demonstrate the model through spectral-rendering simulations that illustrate realistic reflectance and transmittance behaviour for materials with different infrared optical properties. To assess its suitability for thermal-infrared applications, we also compare the simulated long-wave emission of a heated glass sphere with measurements from a LWIR camera. The agreement between measured and simulated radiometric trends indicates that the proposed formulation offers a practical and physically grounded tool for wavelength-parametric interface modelling in infrared imaging, supporting applications in spectral rendering, synthetic data generation, and infrared system analysis.
Full article
(This article belongs to the Section Visualization and Computer Graphics)
►▼
Show Figures

Figure 1
Open AccessArticle
Hybrid Skeleton-Based Motion Templates for Cross-View and Appearance-Robust Gait Recognition
by
João Ferreira Nunes, Pedro Miguel Moreira and João Manuel R. S. Tavares
J. Imaging 2026, 12(1), 32; https://doi.org/10.3390/jimaging12010032 - 7 Jan 2026
Abstract
Gait recognition methods based on silhouette templates, such as the Gait Energy Image (GEI), achieve high accuracy under controlled conditions but often degrade when appearance varies due to viewpoint, clothing, or carried objects. In contrast, skeleton-based approaches provide interpretable motion cues but remain
[...] Read more.
Gait recognition methods based on silhouette templates, such as the Gait Energy Image (GEI), achieve high accuracy under controlled conditions but often degrade when appearance varies due to viewpoint, clothing, or carried objects. In contrast, skeleton-based approaches provide interpretable motion cues but remain sensitive to pose-estimation noise. This work proposes two compact 2D skeletal descriptors—Gait Skeleton Images (GSIs)—that encode 3D joint trajectories into line-based and joint-based static templates compatible with standard 2D CNN architectures. A unified processing pipeline is introduced, including skeletal topology normalization, rigid view alignment, orthographic projection, and pixel-level rendering. Core design factors are analyzed on the GRIDDS dataset, where depth-based 3D coordinates provide stable ground truth for evaluating structural choices and rendering parameters. An extensive evaluation is then conducted on the widely used CASIA-B dataset, using 3D coordinates estimated via human pose estimation, to assess robustness under viewpoint, clothing, and carrying covariates. Results show that although GEIs achieve the highest same-view accuracy, GSI variants exhibit reduced degradation under appearance changes and demonstrate greater stability under severe cross-view conditions. These findings indicate that compact skeletal templates can complement appearance-based descriptors and may benefit further from continued advances in 3D human pose estimation.
Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
►▼
Show Figures

Figure 1
Open AccessArticle
Deep Learning-Assisted Autofocus for Aerial Cameras in Maritime Photography
by
Haiying Liu, Yingchao Li, Shilong Xu, Haoyu Wang, Qiang Fu and Huilin Jiang
J. Imaging 2026, 12(1), 31; https://doi.org/10.3390/jimaging12010031 - 7 Jan 2026
Abstract
To address the unreliable autofocus problem of drone-mounted visible-light aerial cameras in low-contrast maritime environments, this paper proposes an autofocus system that combines deep-learning-based coarse focusing with traditional search-based fine adjustment. The system uses a built-in high-contrast resolution test chart as the signal
[...] Read more.
To address the unreliable autofocus problem of drone-mounted visible-light aerial cameras in low-contrast maritime environments, this paper proposes an autofocus system that combines deep-learning-based coarse focusing with traditional search-based fine adjustment. The system uses a built-in high-contrast resolution test chart as the signal source. Images captured by the imaging sensor are fed into a lightweight convolutional neural network to regress the defocus distance, enabling fast focus positioning. This avoids the weak signal and inaccurate focusing often encountered when adjusting focus directly on low-contrast sea surfaces. In the fine-focusing stage, a hybrid strategy integrating hill-climbing search and inverse correction is adopted. By evaluating the image sharpness function, the system accurately locks onto the optimal focal plane, forming intelligent closed-loop control. Experiments show that this method, which combines imaging of the built-in calibration target with deep-learning-based coarse focusing, significantly improves focusing efficiency. Compared with traditional full-range search strategies, the focusing speed is increased by approximately 60%. While ensuring high accuracy and strong adaptability, the proposed approach effectively enhances the overall imaging performance of aerial cameras in low-contrast maritime conditions.
Full article
(This article belongs to the Section Computational Imaging and Computational Photography)
►▼
Show Figures

Figure 1
Open AccessArticle
From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification
by
Vasiliy Kudryavtsev, Kirill Borodin, German Berezin, Kirill Bubenchikov, Grach Mkrtchian and Alexander Ryzhkov
J. Imaging 2026, 12(1), 30; https://doi.org/10.3390/jimaging12010030 - 7 Jan 2026
Abstract
Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic
[...] Read more.
Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091 unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.
Full article
(This article belongs to the Section Biometrics, Forensics, and Security)
►▼
Show Figures

Figure 1
Open AccessArticle
DynMultiDep: A Dynamic Multimodal Fusion and Multi-Scale Time Series Modeling Approach for Depression Detection
by
Jincheng Li, Menglin Zheng, Jiongyi Yang, Yihui Zhan and Xing Xie
J. Imaging 2026, 12(1), 29; https://doi.org/10.3390/jimaging12010029 - 6 Jan 2026
Abstract
Depression is a prevalent mental disorder that imposes a significant public health burden worldwide. Although multimodal detection methods have shown potential, existing techniques still face two critical bottlenecks: (i) insufficient integration of global patterns and local fluctuations in long-sequence modeling and (ii) static
[...] Read more.
Depression is a prevalent mental disorder that imposes a significant public health burden worldwide. Although multimodal detection methods have shown potential, existing techniques still face two critical bottlenecks: (i) insufficient integration of global patterns and local fluctuations in long-sequence modeling and (ii) static fusion strategies that fail to dynamically adapt to the complementarity and redundancy among modalities. To address these challenges, this paper proposes a dynamic multimodal depression detection framework, DynMultiDep, which combines multi-scale temporal modeling with an adaptive fusion mechanism. The core innovations of DynMultiDep lie in its Multi-scale Temporal Experts Module (MTEM) and Dynamic Multimodal Fusion module (DynMM). On one hand, MTEM employs Mamba experts to extract long-term trend features and utilizes local-window Transformers to capture short-term dynamic fluctuations, achieving adaptive fusion through a long-short routing mechanism. On the other hand, DynMM introduces modality-level and fusion-level dynamic decision-making, selecting critical modality paths and optimizing cross-modal interaction strategies based on input characteristics. The experimental results demonstrate that DynMultiDep outperforms existing state-of-the-art methods in detection performance on two widely used large-scale depression datasets.
Full article
(This article belongs to the Special Issue Towards Deeper Understanding of Image and Video Processing and Analysis)
►▼
Show Figures

Figure 1
Open AccessArticle
Ultrashort Echo Time Quantitative Susceptibility Source Separation in Musculoskeletal System: A Feasibility Study
by
Sam Sedaghat, Jin Il Park, Eddie Fu, Annette von Drygalski, Yajun Ma, Eric Y. Chang, Jiang Du, Lorenzo Nardo and Hyungseok Jang
J. Imaging 2026, 12(1), 28; https://doi.org/10.3390/jimaging12010028 - 6 Jan 2026
Abstract
This study aims to demonstrate the feasibility of ultrashort echo time (UTE)-based susceptibility source separation for musculoskeletal (MSK) imaging, enabling discrimination between diamagnetic and paramagnetic tissue components, with a particular focus on hemophilic arthropathy (HA). Three key techniques were integrated to achieve UTE-based
[...] Read more.
This study aims to demonstrate the feasibility of ultrashort echo time (UTE)-based susceptibility source separation for musculoskeletal (MSK) imaging, enabling discrimination between diamagnetic and paramagnetic tissue components, with a particular focus on hemophilic arthropathy (HA). Three key techniques were integrated to achieve UTE-based susceptibility source separation: Iterative decomposition of water and fat with echo asymmetry and least-squares estimation for B0 field estimation, projection onto dipole fields for local field mapping, and χ-separation for quantitative susceptibility mapping (QSM) with source decomposition. A phantom containing varying concentrations of diamagnetic (CaCO3) and paramagnetic (Fe3O4) materials was used to validate the method. In addition, in vivo UTE-QSM scans of the knees and ankles were performed on five HA patients using a 3T clinical MRI scanner. In the phantom, conventional QSM underestimated susceptibility values due to the mixed-source cancelling the effect. In contrast, source-separated maps provided distinct diamagnetic and paramagnetic susceptibility values that correlated strongly with CaCO3 and Fe3O4 concentrations (r = −0.99 and 0.95, p < 0.05). In vivo, paramagnetic maps enabled improved visualization of hemosiderin deposits in joints of HA patients, which were poorly visualized or obscured in conventional QSM due to susceptibility cancellation by surrounding diamagnetic tissues such as bone. This study demonstrates, for the first time, the feasibility of UTE-based quantitative susceptibility source separation for MSK applications. The approach enhances the detection of paramagnetic substances like hemosiderin in HA and offers potential for improved assessment of bone and joint tissue composition.
Full article
(This article belongs to the Section Medical Imaging)
►▼
Show Figures

Figure 1
Open AccessArticle
Vision-Based People Counting and Tracking for Urban Environments
by
Daniyar Nurseitov, Kairat Bostanbekov, Nazgul Toiganbayeva, Aidana Zhalgas, Didar Yedilkhan and Beibut Amirgaliyev
J. Imaging 2026, 12(1), 27; https://doi.org/10.3390/jimaging12010027 - 5 Jan 2026
Abstract
Population growth and expansion of urban areas increase the need for the introduction of intelligent passenger traffic monitoring systems. Accurate estimation of the number of passengers is an important condition for improving the efficiency, safety and quality of transport services. This paper proposes
[...] Read more.
Population growth and expansion of urban areas increase the need for the introduction of intelligent passenger traffic monitoring systems. Accurate estimation of the number of passengers is an important condition for improving the efficiency, safety and quality of transport services. This paper proposes an approach to the automatic detection and counting of people using computer vision and deep learning methods. While YOLOv8 and DeepSORT have been widely explored individually, our contribution lies in a task-specific modification of the DeepSORT tracking pipeline, optimized for dense passenger environments, strong occlusions, and dynamic lighting, as well as in a unified architecture that integrates detection, tracking, and automatic event-log generation. Our new proprietary dataset of 4047 images and 8918 labeled objects has achieved 92% detection accuracy and 85% counting accuracy, which confirms the effectiveness of the solution. Compared to Mask R-CNN and DETR, the YOLOv8 model demonstrates an optimal balance between speed, accuracy, and computational efficiency. The results confirm that computer vision can become an efficient and scalable replacement for traditional sensory passenger counting systems. The developed architecture (YOLO + Tracking) combines recognition, tracking and counting of people into a single system that automatically generates annotated video streams and event logs. In the future, it is planned to expand the dataset, introduce support for multicamera integration, and adapt the model for embedded devices to improve the accuracy and energy efficiency of the solution in real-world conditions.
Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
►▼
Show Figures

Figure 1
Highly Accessed Articles
Latest Books
E-Mail Alert
News
Topics
Topic in
AI, Applied Sciences, Bioengineering, Healthcare, IJERPH, JCM, Clinics and Practice, J. Imaging
Artificial Intelligence in Public Health: Current Trends and Future Possibilities, 2nd EditionTopic Editors: Daniele Giansanti, Giovanni CostantiniDeadline: 15 March 2026
Topic in
Applied Sciences, Computers, Electronics, Information, J. Imaging
Visual Computing and Understanding: New Developments and Trends
Topic Editors: Wei Zhou, Guanghui Yue, Wenhan YangDeadline: 31 March 2026
Topic in
Applied Sciences, Electronics, J. Imaging, MAKE, Information, BDCC, Signals
Applications of Image and Video Processing in Medical Imaging
Topic Editors: Jyh-Cheng Chen, Kuangyu ShiDeadline: 30 April 2026
Topic in
Diagnostics, Electronics, J. Imaging, Mathematics, Sensors
Transformer and Deep Learning Applications in Image Processing
Topic Editors: Fengping An, Haitao Xu, Chuyang YeDeadline: 31 May 2026
Conferences
Special Issues
Special Issue in
J. Imaging
Image Segmentation: Trends and Challenges
Guest Editor: Nikolaos MitianoudisDeadline: 31 January 2026
Special Issue in
J. Imaging
Progress and Challenges in Biomedical Image Analysis—2nd Edition
Guest Editors: Lei Li, Zehor BelkhatirDeadline: 31 January 2026
Special Issue in
J. Imaging
Computer Vision for Medical Image Analysis
Guest Editors: Rahman Attar, Le ZhangDeadline: 15 February 2026
Special Issue in
J. Imaging
Emerging Technologies for Less Invasive Diagnostic Imaging
Guest Editors: Francesca Angelone, Noemi Pisani, Armando RicciardiDeadline: 28 February 2026




