MDPI - Publisher of Open Access Journals

19 pages, 7222 KB

Open AccessArticle

Multi-Channel Spectro-Temporal Representations for Speech-Based Parkinson’s Disease Detection

by Hadi Sedigh Malekroodi, Nuwan Madusanka, Byeong-il Lee and Myunggi Yi

J. Imaging 2025, 11(10), 341; https://doi.org/10.3390/jimaging11100341 - 1 Oct 2025

Viewed by 190

Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary [...] Read more.

Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary time–frequency representations—mel spectrogram, constant-Q transform (CQT), and gammatone spectrogram—into a three-channel input analogous to an RGB image. This fused representation is evaluated across CNNs (ResNet, DenseNet, and EfficientNet) and Vision Transformer using the PC-GITA dataset, under 10-fold subject-independent cross-validation for robust assessment. Results showed that fusion consistently improves performance over single representations across architectures. EfficientNet-B2 achieves the highest accuracy (84.39% ± 5.19%) and F1-score (84.35% ± 5.52%), outperforming recent methods using handcrafted features or pretrained models (e.g., Wav2Vec2.0, HuBERT) on the same task and dataset. Performance varies with sentence type, with emotionally salient and prosodically emphasized utterances yielding higher AUC, suggesting that richer prosody enhances discriminability. Our findings indicate that multi-channel fusion enhances sensitivity to subtle speech impairments in PD by integrating complementary spectral information. Our approach implies that multi-channel fusion could enhance the detection of discriminative acoustic biomarkers, potentially offering a more robust and effective framework for speech-based PD screening, though further validation is needed before clinical application. Full article

(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)

► Show Figures

Figure 1

17 pages, 930 KB

Open AccessArticle

Investigation of the MobileNetV2 Optimal Feature Extraction Layer for EEG-Based Dementia Severity Classification: A Comparative Study

by Noor Kamal Al-Qazzaz, Sawal Hamid Bin Mohd Ali and Siti Anom Ahmad

Algorithms 2025, 18(10), 620; https://doi.org/10.3390/a18100620 - 1 Oct 2025

Viewed by 148

Abstract

Diagnosing dementia and recognizing substantial cognitive decline are challenging tasks. Thus, the objective of this study was to classify electroencephalograms (EEGs) recorded during a working memory task in 15 patients with mild cognitive impairment (MCogImp), 5 patients with vascular dementia (VasD), and 15 [...] Read more.

Diagnosing dementia and recognizing substantial cognitive decline are challenging tasks. Thus, the objective of this study was to classify electroencephalograms (EEGs) recorded during a working memory task in 15 patients with mild cognitive impairment (MCogImp), 5 patients with vascular dementia (VasD), and 15 healthy controls (NC). Before creating spectrogram pictures from the EEG dataset, the data were subjected to preprocessing, which included preprocessing using conventional filters and the discrete wavelet transformation. The convolutional neural network (CNN) MobileNetV2 was employed in our investigation to identify features and assess the severity of dementia. The features were extracted from five layers of the MobileNetV2 CNN architecture—convolutional layers (‘Conv-1’), batch normalization (‘Conv-1-bn’), clipped ReLU (‘out-relu’), 2D Global Average Pooling (‘global-average-pooling2d1’), and fully connected (‘Logits’) layers. This was carried out to find the efficient features layer for dementia severity from EEGs. Feature extraction from MobileNetV2’s five layers was carried out using a decision tree (DT) and k-nearest neighbor (KNN) machine learning (ML) classifier, in conjunction with a MobileNetV2 deep learning (DL) network. The study’s findings show that the DT classifier performed best using features derived from MobileNetV2 with the 2D Global Average Pooling (global-average-pooling2d-1) layer, achieving an accuracy score of 95.9%. Second place went to the characteristics of the fully connected (Logits) layer, which achieved a score of 95.3%. The findings of this study endorse the utilization of deep processing algorithms, offering a viable approach for improving early dementia identification with high precision, hence facilitating the differentiation among NC individuals, VasD patients, and MCogImp patients. Full article

(This article belongs to the Special Issue Machine Learning in Medical Signal and Image Processing (3rd Edition))

► Show Figures

Figure 1

22 pages, 4583 KB

Open AccessArticle

MemGanomaly: Memory-Augmented Ganomaly for Frost- and Heat-Damaged Crop Detection

by Jun Park, Sung-Wook Park, Yong-Seok Kim, Se-Hoon Jung and Chun-Bo Sim

Appl. Sci. 2025, 15(19), 10503; https://doi.org/10.3390/app151910503 - 28 Sep 2025

Viewed by 135

Abstract

Climate change poses significant challenges to agriculture, leading to increased crop damage owing to extreme weather conditions. Detecting and analyzing such damage is crucial for mitigating its effects on crop yield. This study proposes a novel autoencoder (AE)-based model, termed “Memory Ganomaly,” designed [...] Read more.

Climate change poses significant challenges to agriculture, leading to increased crop damage owing to extreme weather conditions. Detecting and analyzing such damage is crucial for mitigating its effects on crop yield. This study proposes a novel autoencoder (AE)-based model, termed “Memory Ganomaly,” designed to detect and analyze weather-induced crop damage under conditions of significant class imbalance. The model integrates memory modules into the Ganomaly architecture, thereby enhancing its ability to identify anomalies by focusing on normal (undamaged) states. The proposed model was evaluated using apple and peach datasets, which included both damaged and undamaged images, and was compared with existing robust Convolutional neural network (CNN) models (ResNet-50, EfficientNet-B3, and ResNeXt-50) and AE models (Ganomaly and MemAE). Although these CNN models are not the latest technologies, they are still highly effective for image classification tasks and are deemed suitable for comparative analyses. The results showed that CNN and Transformer baselines achieved very high overall accuracy (94–98%) but completely failed to identify damaged samples, with precision and recall equal to zero under severe class imbalance. Few-shot learning partially alleviated this issue (up to 75.1% recall in the 20-shot setting for the apple dataset) but still lagged behind AE-based approaches in terms of accuracy and precision. In contrast, the proposed Memory Ganomaly delivered a more balanced performance across accuracy, precision, and recall (Apple: 80.32% accuracy, 79.4% precision, 79.1% recall; Peach: 81.06% accuracy, 83.23% precision, 80.3% recall), outperforming AE baselines in precision and recall while maintaining comparable accuracy. This study concludes that the Memory Ganomaly model offers a robust solution for detecting anomalies in agricultural datasets, where data imbalance is prevalent, and suggests its potential for broader applications in agricultural monitoring and beyond. While both Ganomaly and MemAE have shown promise in anomaly detection, they suffer from limitations—Ganomaly often lacks long-term pattern recall, and MemAE may miss contextual cues. Our proposed Memory Ganomaly integrates the strengths of both, leveraging contextual reconstruction with pattern recall to enhance detection of subtle weather-related anomalies under class imbalance. Full article

(This article belongs to the Special Issue Advanced Agricultural Technologies: Monitoring, Modeling, and Machine Learning Techniques)

► Show Figures

Figure 1

12 pages, 4847 KB

Open AccessArticle

Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features

by Manish Kansana, Elias Hossain, Shahram Rahimi and Noorbakhsh Amiri Golilarz

Information 2025, 16(10), 839; https://doi.org/10.3390/info16100839 - 27 Sep 2025

Viewed by 280

Abstract

Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and Principal Component [...] Read more.

Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and Principal Component Analysis (PCA)-reduced visual embeddings extracted via ResNet 50. The model integrates modality-specific encoders with cross-modal attention layers, enabling rich interactions between vision and touch. Currently, state-of-the-art deep learning models for vision tasks have achieved remarkable performance. With this in mind, our first set of experiments focused exclusively on tactile-only surface classification. Using feature engineering, we trained and evaluated multiple machine learning models, assessing their accuracy and inference time. We then implemented an encoder-only Transformer model tailored for tactile features. This model not only achieves the highest accuracy, but also demonstrated significantly faster inference time compared to other evaluated models, highlighting its potential for real-time applications. To extend this investigation, we introduced a multimodal fusion setup by combining vision and tactile inputs. We trained both Surformer v1 (using structured features) and a Multimodal CNN (using raw images) to examine the impact of feature-based versus image-based multimodal learning on classification accuracy and computational efficiency. The results showed that Surformer v1 achieved 99.4% accuracy with an inference time of 0.7271 ms, while the Multimodal CNN achieved slightly higher accuracy but required significantly more inference time. These findings suggest that Surformer v1 offers a compelling balance between accuracy, efficiency, and computational cost for surface material recognition. The results also underscore the effectiveness of integrating feature learning, cross-modal attention and transformer-based fusion in capturing the complementary strengths of tactile and visual modalities. Full article

(This article belongs to the Special Issue AI-Based Image Processing and Computer Vision)

► Show Figures

Figure 1

19 pages, 1027 KB

Open AccessArticle

A Convolutional-Transformer Residual Network for Channel Estimation in Intelligent Reflective Surface Aided MIMO Systems

by Qingying Wu, Junqi Bao, Hui Xu, Benjamin K. Ng, Chan-Tong Lam and Sio-Kei Im

Sensors 2025, 25(19), 5959; https://doi.org/10.3390/s25195959 - 25 Sep 2025

Viewed by 395

Abstract

Intelligent Reflective Surface (IRS)-aided Multiple-Input Multiple-Output (MIMO) systems have emerged as a promising solution to enhance spectral and energy efficiency in future wireless communications. However, accurate channel estimation remains a key challenge due to the passive nature and high dimensionality of IRS channels. [...] Read more.

Intelligent Reflective Surface (IRS)-aided Multiple-Input Multiple-Output (MIMO) systems have emerged as a promising solution to enhance spectral and energy efficiency in future wireless communications. However, accurate channel estimation remains a key challenge due to the passive nature and high dimensionality of IRS channels. This paper proposes a lightweight hybrid framework for cascaded channel estimation by combining a physics-based Bilinear Alternating Least Squares (BALS) algorithm with a deep neural network named ConvTrans-ResNet. The network integrates convolutional embeddings and Transformer modules within a residual learning architecture to exploit both local and global spatial features effectively while ensuring training stability. A series of ablation studies is conducted to optimize architectural components, resulting in a compact configuration with low parameter count and computational complexity. Extensive simulations demonstrate that the proposed method significantly outperforms state-of-the-art neural models such as HA02, ReEsNet, and InterpResNet across a wide range of SNR levels and IRS element sizes in terms of the Normalized Mean Squared Error (NMSE). Compared to existing solutions, our method achieves better estimation accuracy with improved efficiency, making it suitable for practical deployment in IRS-aided systems. Full article

(This article belongs to the Section Communications)

► Show Figures

Figure 1

17 pages, 2844 KB

Open AccessArticle

HazChemNet: A Deep Learning Model for Hazardous Chemical Prediction

by Nan Zhang, Hexiang Qiu, Hongxia Cai, Zhiru Li, Yutong Li, Zinan Li, Lijuan Qi, Hongju Du, Yan Pan, Haiming Jing, Junyu Ning, Bo Xian and Shan Gao

Int. J. Mol. Sci. 2025, 26(19), 9288; https://doi.org/10.3390/ijms26199288 - 23 Sep 2025

Viewed by 339

Abstract

The identification of hazardous chemicals is critical for mitigating environmental and health risks, yet existing methods often lack efficiency and accuracy. This study presents HazChemNet, a deep learning model integrating attention-based autoencoders and mixture-of-experts architectures, designed to predict chemical hazardousness from molecular structures. [...] Read more.

The identification of hazardous chemicals is critical for mitigating environmental and health risks, yet existing methods often lack efficiency and accuracy. This study presents HazChemNet, a deep learning model integrating attention-based autoencoders and mixture-of-experts architectures, designed to predict chemical hazardousness from molecular structures. The study utilized a dataset of 2428 hazardous compounds from China’s 2015 hazardous chemical list. Features were derived from molecular fingerprints and physicochemical descriptors, with external validation on 52 unseen chemicals achieving 92.3% accuracy for hazardous and 84.6% for non-hazardous classifications. Experimental validation using C. elegans assays confirmed model predictions for critical compounds. Ablation studies confirmed hydrogen bonding features as pivotal predictors, alongside molecular fingerprints. This work bridges the gap between AI-driven innovation and chemical safety, offering a transformative tool for sustainable industrial practices and proactive risk management in a rapidly evolving global landscape. Full article

(This article belongs to the Special Issue Machine Learning and Bioinformatics in Human Health and Disease: 2nd Edition)

► Show Figures

Figure 1

20 pages, 4847 KB

Open AccessArticle

Deep Learning-Based Approach to Automated Monitoring of Defects and Soiling on Solar Panels

by Ahmed Hamdi, Hassan N. Noura and Joseph Azar

Future Internet 2025, 17(10), 433; https://doi.org/10.3390/fi17100433 - 23 Sep 2025

Viewed by 535

Abstract

The reliable operation of photovoltaic (PV) systems is often compromised by surface soiling and structural damage, which reduce energy efficiency and complicate large-scale monitoring. To address this challenge, we propose a two-tiered image-classification framework that combines Vision Transformer (ViT) models, lightweight convolutional neural [...] Read more.

The reliable operation of photovoltaic (PV) systems is often compromised by surface soiling and structural damage, which reduce energy efficiency and complicate large-scale monitoring. To address this challenge, we propose a two-tiered image-classification framework that combines Vision Transformer (ViT) models, lightweight convolutional neural networks (CNNs), and knowledge distillation (KD). In Tier 1, a DINOv2 ViT-Base model is fine-tuned to provide robust high-level categorization of solar-panel images into three classes: Normal, Soiled, and Damaged. In Tier 2, two enhanced EfficientNetB0 models are introduced: (i) a KD-based student model distilled from a DINOv2 ViT-S/14 teacher, which improves accuracy from 96.7% to 98.67% for damage classification and from 90.7% to 92.38% for soiling classification, and (ii) an EfficientNetB0 augmented with Multi-Head Self-Attention (MHSA), which achieves 98.73% accuracy for damage and 93.33% accuracy for soiling. These results demonstrate that integrating transformer-based representations with compact CNN architectures yields a scalable and efficient solution for automated monitoring of the condition of PV systems, offering high accuracy and real-time applicability in inspections on solar farms. Full article

(This article belongs to the Special Issue Developments of Computer Vision and Image Processing: Methodologies and Applications—2nd Edition)

► Show Figures

Figure 1

29 pages, 9358 KB

Open AccessArticle

Deep Ensemble Learning and Explainable AI for Multi-Class Classification of Earthstar Fungal Species

by Eda Kumru, Aras Fahrettin Korkmaz, Fatih Ekinci, Abdullah Aydoğan, Mehmet Serdar Güzel and Ilgaz Akata

Biology 2025, 14(10), 1313; https://doi.org/10.3390/biology14101313 - 23 Sep 2025

Viewed by 382

Abstract

The current study presents a multi-class, image-based classification of eight morphologically similar macroscopic Earthstar fungal species (Astraeus hygrometricus, Geastrum coronatum, G. elegans, G. fimbriatum, G. quadrifidum, G. rufescens, G. triplex, and Myriostoma coliforme) using [...] Read more.

The current study presents a multi-class, image-based classification of eight morphologically similar macroscopic Earthstar fungal species (Astraeus hygrometricus, Geastrum coronatum, G. elegans, G. fimbriatum, G. quadrifidum, G. rufescens, G. triplex, and Myriostoma coliforme) using deep learning and explainable artificial intelligence (XAI) techniques. For the first time in the literature, these species are evaluated together, providing a highly challenging dataset due to significant visual overlap. Eight different convolutional neural network (CNN) and transformer-based architectures were employed, including EfficientNetV2-M, DenseNet121, MaxViT-S, DeiT, RegNetY-8GF, MobileNetV3, EfficientNet-B3, and MnasNet. The accuracy scores of these models ranged from 86.16% to 96.23%, with EfficientNet-B3 achieving the best individual performance. To enhance interpretability, Grad-CAM and Score-CAM methods were utilised to visualise the rationale behind each classification decision. A key novelty of this study is the design of two hybrid ensemble models: EfficientNet-B3 + DeiT and DenseNet121 + MaxViT-S. These ensembles further improved classification stability, reaching 93.71% and 93.08% accuracy, respectively. Based on metric-based evaluation, the EfficientNet-B3 + DeiT model delivered the most balanced performance, with 93.83% precision, 93.72% recall, 93.73% F1-score, 99.10% specificity, a log loss of 0.2292, and an MCC of 0.9282. Moreover, this modeling approach holds potential for monitoring symbiotic fungal species in agricultural ecosystems and supporting sustainable production strategies. This research contributes to the literature by introducing a novel framework that simultaneously emphasises classification accuracy and model interpretability in fungal taxonomy. The proposed method successfully classified morphologically similar puffball species with high accuracy, while explainable AI techniques revealed biologically meaningful insights. All evaluation metrics were computed exclusively on a 10% independent test set that was entirely separate from the training and validation phases. Future work will focus on expanding the dataset with samples from diverse ecological regions and testing the method under field conditions. Full article

(This article belongs to the Section Bioinformatics)

► Show Figures

Figure 1

18 pages, 1694 KB

Open AccessArticle

FAIR-Net: A Fuzzy Autoencoder and Interpretable Rule-Based Network for Ancient Chinese Character Recognition

by Yanling Ge, Yunmeng Zhang and Seok-Beom Roh

Sensors 2025, 25(18), 5928; https://doi.org/10.3390/s25185928 - 22 Sep 2025

Viewed by 312

Abstract

Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, [...] Read more.

Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, as they are typically trained on modern printed or handwritten text and lack interpretability. To tackle these challenges, we propose FAIR-Net, a hybrid architecture that combines the unsupervised feature learning capacity of a deep autoencoder with the semantic transparency of a fuzzy rule-based classifier. In FAIR-Net, the deep autoencoder first compresses high-resolution character images into low-dimensional, noise-robust embeddings. These embeddings are then passed into a Fuzzy Neural Network (FNN), whose hidden layer leverages Fuzzy C-Means (FCM) clustering to model soft membership degrees and generate human-readable fuzzy rules. The output layer uses Iteratively Reweighted Least Squares Estimation (IRLSE) combined with a Softmax function to produce probabilistic predictions, with all weights constrained as linear mappings to maintain model transparency. We evaluate FAIR-Net on CASIA-HWDB1.0, HWDB1.1, and ICDAR 2013 CompetitionDB, where it achieves a recognition accuracy of 97.91%, significantly outperforming baseline CNNs (p < 0.01, Cohen’s d > 0.8) while maintaining the tightest confidence interval (96.88–98.94%) and lowest standard deviation (±1.03%). Additionally, FAIR-Net reduces inference time to 25 s, improving processing efficiency by 41.9% over AlexNet and up to 98.9% over CNN-Fujitsu, while preserving >97.5% accuracy across evaluations. To further assess generalization to historical scripts, FAIR-Net was tested on the Ancient Chinese Character Dataset (9233 classes; 979,907 images), achieving 83.25% accuracy—slightly higher than ResNet101 but 2.49% lower than SwinT-v2-small—while reducing training time by over 5.5× compared to transformer-based baselines. Fuzzy rule visualization confirms enhanced robustness to glyph ambiguities and erosion. Overall, FAIR-Net provides a practical, interpretable, and highly efficient solution for the digitization and preservation of ancient Chinese character corpora. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

24 pages, 6470 KB

Open AccessArticle

A Method for Improving the Efficiency and Effectiveness of Automatic Image Analysis of Water Pipes

by Qiuping Wang, Lei Lu, Shuguang Liu, Qunfang Hu, Guihui Zhong, Zhan Su and Shengxin Xu

Water 2025, 17(18), 2781; https://doi.org/10.3390/w17182781 - 20 Sep 2025

Viewed by 456

Abstract

The integrity of urban water supply pipelines, an essential element of municipal infrastructure, is frequently undermined by internal defects such as corrosion, tuberculation, and foreign matter. Traditional inspection methods relying on CCTV are time-consuming, labor-intensive, and prone to subjective interpretation, which hinders the [...] Read more.

The integrity of urban water supply pipelines, an essential element of municipal infrastructure, is frequently undermined by internal defects such as corrosion, tuberculation, and foreign matter. Traditional inspection methods relying on CCTV are time-consuming, labor-intensive, and prone to subjective interpretation, which hinders the timely and accurate assessment of pipeline conditions. This study proposes YOLOv8-VSW, a systematically optimized and lightweight model based on YOLOv8 for automated defect detection in in-service pipelines. The framework is twofold: First, to overcome data limitations, a specialized defect dataset was constructed and augmented using photometric transformation, affine transformation, and noise injection. Second, the model architecture was improved on three levels: a VanillaNet backbone was adopted for lightweighting, a C2f-Star module was introduced to enhance multi-scale feature fusion, and the WIoUv3 dynamic loss function was employed to improve robustness under complex imaging conditions. Experimental results demonstrate the superior performance of the proposed YOLOv8-VSW model. This study validates the framework on a curated, real-world image dataset, where YOLOv8-VSW achieved mAP@50 of 83.5%, a 4.0% improvement over the baseline. Concurrently, GFLOPs were reduced by approximately 38.9%, while the inference speed was increased to 603.8 FPS. The findings validate the effectiveness of the proposed method, delivering a solution that effectively balances detection accuracy, computational efficiency, and model size. The results establish a strong technical basis for the intelligent and automated control of safety in urban water supply systems. Full article

(This article belongs to the Section Urban Water Management)

► Show Figures

Figure 1

18 pages, 3374 KB

Open AccessArticle

Evaluation of Apical Closure in Panoramic Radiographs Using Vision Transformer Architectures ViT-Based Apical Closure Classification

by Sümeyye Coşgun Baybars, Merve Daldal, Merve Parlak Baydoğan and Seda Arslan Tuncer

Diagnostics 2025, 15(18), 2350; https://doi.org/10.3390/diagnostics15182350 - 16 Sep 2025

Viewed by 386

Abstract

Objective: To evaluate the performance of vision transformer (ViT)-based deep learning models in the classification of open apex on panoramic radiographs (orthopantomograms (OPGs)) and compare their diagnostic accuracy with conventional convolutional neural network (CNN) architectures. Materials and Methods: OPGs were retrospectively [...] Read more.

Objective: To evaluate the performance of vision transformer (ViT)-based deep learning models in the classification of open apex on panoramic radiographs (orthopantomograms (OPGs)) and compare their diagnostic accuracy with conventional convolutional neural network (CNN) architectures. Materials and Methods: OPGs were retrospectively collected and labeled by two observers based on apex closure status. Two ViT models (Base Patch16 and Patch32) and three CNN models (ResNet50, VGG19, and EfficientNetB0) were evaluated using eight classifiers (support vector machine (SVM), random forest (RF), XGBoost, logistic regression (LR), K-nearest neighbors (KNN), naïve Bayes (NB), decision tree (DT), and multi-layer perceptron (MLP)). Performance metrics (accuracy, precision, recall, F1 score, and area under the curve (AUC)) were computed. Results: ViT Base Patch16 384 with MLP achieved the highest accuracy (0.8462 ± 0.0330) and AUC (0.914 ± 0.032). Although CNN models like EfficientNetB0 + MLP performed competitively (0.8334 ± 0.0479 accuracy), ViT models demonstrated more balanced and robust performance. Conclusions: ViT models outperformed CNNs in classifying open apex, suggesting their integration into dental radiologic decision support systems. Future studies should focus on multi-center and multimodal data to improve generalizability. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

► Show Figures

Figure 1

37 pages, 6540 KB

Open AccessArticle

Intelligent Systems for Autonomous Mining Operations: Real-Time Robust Road Segmentation

by Claudio Urrea and Maximiliano Vélez

Systems 2025, 13(9), 801; https://doi.org/10.3390/systems13090801 - 13 Sep 2025

Viewed by 571

Abstract

Intelligent autonomous systems in open-pit mining operations face critical challenges in perception and decision-making due to sensor-based visual degradations, particularly lens soiling and sun glare, which significantly compromise the performance and safety of integrated mining automation systems. We propose a comprehensive intelligent framework [...] Read more.

Intelligent autonomous systems in open-pit mining operations face critical challenges in perception and decision-making due to sensor-based visual degradations, particularly lens soiling and sun glare, which significantly compromise the performance and safety of integrated mining automation systems. We propose a comprehensive intelligent framework leveraging single-domain generalization with traditional data augmentation techniques, specifically Photometric Distortion (PD) and Contrast Limited Adaptive Histogram Equalization (CLAHE), integrated within the BiSeNetV1 architecture. Our systematic approach evaluated four state-of-the-art backbones: ResNet-50, MobileNetV2 (Convolutional Neural Networks (CNN)-based), SegFormer-B0, and Twins-PCPVT-S (ViT-based) within an end-to-end autonomous system architecture. The model was trained on clean images from the AutoMine dataset and tested on degraded visual conditions without requiring architectural modifications or additional training data from target domains. ResNet-50 demonstrated superior system robustness with mean Intersection over Union (IoU) of 84.58% for lens soiling and 80.11% for sun glare scenarios, while MobileNetV2 achieved optimal computational efficiency for real-time autonomous systems with 55.0 Frames Per Second (FPS) inference speed while maintaining competitive accuracy (81.54% and 71.65% mIoU respectively). Vision Transformers showed superior stability in system performance but lower overall performance under severe degradations. The proposed intelligent augmentation-based approach maintains high accuracy while preserving real-time computational efficiency, making it suitable for deployment in autonomous mining vehicle systems. Traditional augmentation approaches achieved approximately 30% superior performance compared to advanced GAN-based domain generalization methods, providing a practical solution for robust perception systems without requiring expensive multi-domain training datasets. Full article

(This article belongs to the Section Artificial Intelligence and Digital Systems Engineering)

► Show Figures

Figure 1

25 pages, 4660 KB

Open AccessArticle

Dual-Stream Former: A Dual-Branch Transformer Architecture for Visual Speech Recognition

by Sanghun Jeon, Jieun Lee and Yong-Ju Lee

AI 2025, 6(9), 222; https://doi.org/10.3390/ai6090222 - 9 Sep 2025

Viewed by 1060

Abstract

This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional [...] Read more.

This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional neural network (CNN)-based models, such as 3D-CNN + DenseNet-121 (CER: 5.31%), and transformer-based alternatives, such as vision transformers (CER: 4.05%). The Video Swin Transformer captures multiscale spatial representations with high computational efficiency, whereas the Conformer back-end enhances temporal modeling across diverse phoneme categories. Evaluation of a high-resolution dataset comprising 740,000 utterances across 185 classes highlighted the effectiveness of the model in addressing visually confusing phonemes, such as diphthongs (/ai/, /au/) and labio-dental sounds (/f/, /v/). Dual-Stream Former achieved phoneme recognition error rates of 10.39% for diphthongs and 9.25% for labiodental sounds, surpassing those of CNN-based architectures by more than 6%. Although the model’s large parameter count (168.6 M) poses resource challenges, its hierarchical design ensures scalability. Future work will explore lightweight adaptations and multimodal extensions to increase deployment feasibility. These findings underscore the transformative potential of Dual-Stream Former for advancing VSR applications such as silent communication and assistive technologies by achieving unparalleled precision and robustness in diverse settings. Full article

► Show Figures

Figure 1

26 pages, 6612 KB

Open AccessArticle

A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis

by Leonardo Scabini, Andre Sacilotti, Kallil M. Zielinski, Lucas C. Ribas, Bernard De Baets and Odemir M. Bruno

J. Imaging 2025, 11(9), 304; https://doi.org/10.3390/jimaging11090304 - 5 Sep 2025

Cited by 1 | Viewed by 947

Abstract

Texture, a significant visual attribute in images, plays an important role in many pattern recognition tasks. While Convolutional Neural Networks (CNNs) have been among the most effective methods for texture analysis, alternative architectures such as Vision Transformers (ViTs) have recently demonstrated superior performance [...] Read more.

Texture, a significant visual attribute in images, plays an important role in many pattern recognition tasks. While Convolutional Neural Networks (CNNs) have been among the most effective methods for texture analysis, alternative architectures such as Vision Transformers (ViTs) have recently demonstrated superior performance on a range of visual recognition problems. However, the suitability of ViTs for texture recognition remains underexplored. In this work, we investigate the capabilities and limitations of ViTs for texture recognition by analyzing 25 different ViT variants as feature extractors and comparing them to CNN-based and hand-engineered approaches. Our evaluation encompasses both accuracy and efficiency, aiming to assess the trade-offs involved in applying ViTs to texture analysis. Our results indicate that ViTs generally outperform CNN-based and hand-engineered models, particularly when using strong pre-training and in-the-wild texture datasets. Notably, BeiTv2-B/16 achieves the highest average accuracy (85.7%), followed by ViT-B/16-DINO (84.1%) and Swin-B (80.8%), outperforming the ResNet50 baseline (75.5%) and the hand-engineered baseline (73.4%). As a lightweight alternative, EfficientFormer-L3 attains a competitive average accuracy of 78.9%. In terms of efficiency, although ViT-B and BeiT(v2) have a higher number of GFLOPs and parameters, they achieve significantly faster feature extraction on GPUs compared to ResNet50. These findings highlight the potential of ViTs as a powerful tool for texture analysis while also pointing to areas for future exploration, such as efficiency improvements and domain-specific adaptations. Full article

(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)

► Show Figures

Figure 1

35 pages, 6026 KB

Open AccessArticle

A Comparative Analysis of the Mamba, Transformer, and CNN Architectures for Multi-Label Chest X-Ray Anomaly Detection in the NIH ChestX-Ray14 Dataset

by Erdem Yanar, Furkan Kutan, Kubilay Ayturan, Uğurhan Kutbay, Oktay Algın, Fırat Hardalaç and Ahmet Muhteşem Ağıldere

Diagnostics 2025, 15(17), 2215; https://doi.org/10.3390/diagnostics15172215 - 1 Sep 2025

Viewed by 939

Abstract

Background/Objectives: Recent state-of-the-art advances in deep learning have significantly improved diagnostic accuracy in medical imaging, particularly in chest radiograph (CXR) analysis. Motivated by these developments, a comprehensive comparison was conducted to investigate how architectural choices affect performance of 14 deep learning models across [...] Read more.

Background/Objectives: Recent state-of-the-art advances in deep learning have significantly improved diagnostic accuracy in medical imaging, particularly in chest radiograph (CXR) analysis. Motivated by these developments, a comprehensive comparison was conducted to investigate how architectural choices affect performance of 14 deep learning models across Convolutional Neural Networks (CNNs), Transformer-based models, and Mamba-based State Space Models. Methods: These models were trained and evaluated under identical conditions on the NIH ChestX-ray14 dataset, a large-scale and widely used benchmark comprising 112,120 labeled CXR images with 14 thoracic disease categories. Results: It was found that recent hybrid architectures—particularly ConvFormer, CaFormer, and EfficientNet—deliver superior performance in both common and rare pathologies. ConvFormer achieved the highest mean AUROC of 0.841 when averaged across all 14 thoracic disease classes, closely followed by EfficientNet and CaFormer. Notably, AUROC scores of 0.94 for hernia, 0.91 for cardiomegaly, and 0.88 for edema and effusion were achieved by the proposed models, surpassing previously reported benchmarks. Conclusions: These results not only highlight the continued strength of CNNs but also demonstrate the growing potential of Transformer-based architectures in medical image analysis. This work contributes to the literature by providing a unified, state-of-the-art benchmarking of diverse deep learning models, offering valuable guidance for researchers and practitioners developing clinically robust AI systems for radiology. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

► Show Figures

Figure 1

Search Results (242)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (242)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI