MDPI - Publisher of Open Access Journals

17 pages, 3612 KiB

Open AccessArticle

MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking

by Jianyu Xie, Yan Fu, Junlin Zhou, Tianxiang He, Xiaopeng Wang, Yuke Fang and Duanbing Chen

Appl. Sci. 2025, 15(14), 7967; https://doi.org/10.3390/app15147967 (registering DOI) - 17 Jul 2025

Abstract

Visual target tracking is a fundamental task in computer vision. Combining multi-modal information with tracking leverages complementary information, which improves the precision and robustness of trackers. Traditional multi-modal tracking methods typically employ a full fine-tuning scheme, i.e., fine-tuning pre-trained single-modal models to multi-modal [...] Read more.

Visual target tracking is a fundamental task in computer vision. Combining multi-modal information with tracking leverages complementary information, which improves the precision and robustness of trackers. Traditional multi-modal tracking methods typically employ a full fine-tuning scheme, i.e., fine-tuning pre-trained single-modal models to multi-modal tasks. However, this approach suffers from low transfer learning efficiency, catastrophic forgetting, and high cross-task deployment costs. To address these issues, we propose an efficient model named multi-modal prompt vision tracker (MPVT) based on an efficient prompt-tuning paradigm. Three key components are involved in the model: a decoupled input enhancement module, a dynamic adaptive prompt fusion module, and a fully connected head network module. The decoupled input enhancement module enhances input representations via positional and type embedding. The dynamic adaptive prompt fusion module achieves efficient prompt tuning and multi-modal interaction using scaled convolution and low-rank cross-modal attention mechanisms. The fully connected head network module addresses the shortcomings of traditional convolutional head networks such as inductive biases. Experimental results from RGB-T, RGB-D, and RGB-E scenarios show that MPVT outperforms state-of-the-art methods. Moreover, MPVT can save 43.8% GPU memory usage and reduce training time by 62.9% compared with a full-parameter fine-tuning model. Full article

(This article belongs to the Special Issue Advanced Technologies Applied for Object Detection and Tracking)

► Show Figures

Figure 1

27 pages, 6371 KiB

Open AccessArticle

Growth Stages Discrimination of Multi-Cultivar Navel Oranges Using the Fusion of Near-Infrared Hyperspectral Imaging and Machine Vision with Deep Learning

by Chunyan Zhao, Zhong Ren, Yue Li, Jia Zhang and Weinan Shi

Agriculture 2025, 15(14), 1530; https://doi.org/10.3390/agriculture15141530 - 15 Jul 2025

Viewed by 30

Abstract

To noninvasively and precisely discriminate among the growth stages of multiple cultivars of navel oranges simultaneously, the fusion of the technologies of near-infrared (NIR) hyperspectral imaging (HSI) combined with machine vision (MV) and deep learning is employed. NIR reflectance spectra and hyperspectral and [...] Read more.

To noninvasively and precisely discriminate among the growth stages of multiple cultivars of navel oranges simultaneously, the fusion of the technologies of near-infrared (NIR) hyperspectral imaging (HSI) combined with machine vision (MV) and deep learning is employed. NIR reflectance spectra and hyperspectral and RGB images for 740 Gannan navel oranges of five cultivars are collected. Based on preprocessed spectra, optimally selected hyperspectral images, and registered RGB images, a dual-branch multi-modal feature fusion convolutional neural network (CNN) model is established. In this model, a spectral branch is designed to extract spectral features reflecting internal compositional variations, while the image branch is utilized to extract external color and texture features from the integration of hyperspectral and RGB images. Finally, growth stages are determined via the fusion of features. To validate the availability of the proposed method, various machine-learning and deep-learning models are compared for single-modal and multi-modal data. The results demonstrate that multi-modal feature fusion of HSI and MV combined with the constructed dual-branch CNN deep-learning model yields excellent growth stage discrimination in navel oranges, achieving an accuracy, recall rate, precision, F1 score, and kappa coefficient on the testing set are 95.95%, 96.66%, 96.76%, 96.69%, and 0.9481, respectively, providing a prominent way to precisely monitor the growth stages of fruits. Full article

(This article belongs to the Special Issue Multi- and Hyper-Spectral Imaging Technologies for Crop Monitoring—2nd Edition)

► Show Figures

Figure 1

30 pages, 2023 KiB

Open AccessReview

Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects

by Yuval Cohen, Amir Biton and Shraga Shoval

Appl. Sci. 2025, 15(14), 7905; https://doi.org/10.3390/app15147905 - 15 Jul 2025

Viewed by 56

Abstract

The integration of advanced computer vision and artificial intelligence (AI) techniques into collaborative robotic systems holds the potential to revolutionize human–robot interaction, productivity, and safety. Despite substantial research activity, a systematic synthesis of how vision and AI are jointly enabling context-aware, adaptive cobot [...] Read more.

The integration of advanced computer vision and artificial intelligence (AI) techniques into collaborative robotic systems holds the potential to revolutionize human–robot interaction, productivity, and safety. Despite substantial research activity, a systematic synthesis of how vision and AI are jointly enabling context-aware, adaptive cobot capabilities across perception, planning, and decision-making remains lacking (especially in recent years). Addressing this gap, our review unifies the latest advances in visual recognition, deep learning, and semantic mapping within a structured taxonomy tailored to collaborative robotics. We examine foundational technologies such as object detection, human pose estimation, and environmental modeling, as well as emerging trends including multimodal sensor fusion, explainable AI, and ethically guided autonomy. Unlike prior surveys that focus narrowly on either vision or AI, this review uniquely analyzes their integrated use for real-world human–robot collaboration. Highlighting industrial and service applications, we distill the best practices, identify critical challenges, and present key performance metrics to guide future research. We conclude by proposing strategic directions—from scalable training methods to interoperability standards—to foster safe, robust, and proactive human–robot partnerships in the years ahead. Full article

(This article belongs to the Special Issue Integrating AI into Mechatronics and Robotics: Innovations and Applications)

► Show Figures

Figure 1

35 pages, 1458 KiB

Open AccessArticle

User Comment-Guided Cross-Modal Attention for Interpretable Multimodal Fake News Detection

by Zepu Yi, Chenxu Tang and Songfeng Lu

Appl. Sci. 2025, 15(14), 7904; https://doi.org/10.3390/app15147904 - 15 Jul 2025

Viewed by 50

Abstract

In order to address the pressing challenge posed by the proliferation of fake news in the digital age, we emphasize its profound and harmful impact on societal structures, including the misguidance of public opinion, the erosion of social trust, and the exacerbation of [...] Read more.

In order to address the pressing challenge posed by the proliferation of fake news in the digital age, we emphasize its profound and harmful impact on societal structures, including the misguidance of public opinion, the erosion of social trust, and the exacerbation of social polarization. Current fake news detection methods are largely limited to superficial text analysis or basic text–image integration, which face significant limitations in accurately identifying deceptive information. To bridge this gap, we propose the UC-CMAF framework, which comprehensively integrates news text, images, and user comments through an adaptive co-attention fusion mechanism. The UC-CMAF workflow consists of four key subprocesses: multimodal feature extraction, cross-modal adaptive collaborative attention fusion of news text and images, cross-modal attention fusion of user comments with news text and images, and finally, input of fusion features into a fake news detector. Specifically, we introduce multi-head cross-modal attention heatmaps and comment importance visualizations to provide interpretability support for the model’s predictions, revealing key semantic areas and user perspectives that influence judgments. Through the cross-modal adaptive collaborative attention mechanism, UC-CMAF achieves deep semantic alignment between news text and images and uses social signals from user comments to build an enhanced credibility evaluation path, offering a new paradigm for interpretable fake information detection. Experimental results demonstrate that UC-CMAF consistently outperforms 15 baseline models across two benchmark datasets, achieving F1 Scores of 0.894 and 0.909. These results validate the effectiveness of its adaptive cross-modal attention mechanism and the incorporation of user comments in enhancing both detection accuracy and interpretability. Full article

(This article belongs to the Special Issue Explainable Artificial Intelligence Technology and Its Applications)

► Show Figures

Figure 1

14 pages, 1509 KiB

Open AccessArticle

A Multi-Modal Deep Learning Approach for Predicting Eligibility for Adaptive Radiation Therapy in Nasopharyngeal Carcinoma Patients

by Zhichun Li, Zihan Li, Sai Kit Lam, Xiang Wang, Peilin Wang, Liming Song, Francis Kar-Ho Lee, Celia Wai-Yi Yip, Jing Cai and Tian Li

Cancers 2025, 17(14), 2350; https://doi.org/10.3390/cancers17142350 - 15 Jul 2025

Viewed by 58

Abstract

Background: Adaptive radiation therapy (ART) can improve prognosis for nasopharyngeal carcinoma (NPC) patients. However, the inter-individual variability in anatomical changes, along with the resulting extension of treatment duration and increased workload for the radiologists, makes the selection of eligible patients a persistent challenge [...] Read more.

Background: Adaptive radiation therapy (ART) can improve prognosis for nasopharyngeal carcinoma (NPC) patients. However, the inter-individual variability in anatomical changes, along with the resulting extension of treatment duration and increased workload for the radiologists, makes the selection of eligible patients a persistent challenge in clinical practice. The purpose of this study was to predict eligible ART candidates prior to radiation therapy (RT) for NPC patients using a classification neural network. By leveraging the fusion of medical imaging and clinical data, this method aimed to save time and resources in clinical workflows and improve treatment efficiency. Methods: We collected retrospective data from 305 NPC patients who received RT at Hong Kong Queen Elizabeth Hospital. Each patient sample included pre-treatment computed tomographic (CT) images, T1-weighted magnetic resonance imaging (MRI) data, and T2-weighted MRI images, along with clinical data. We developed and trained a novel multi-modal classification neural network that combines ResNet-50, cross-attention, multi-scale features, and clinical data for multi-modal fusion. The patients were categorized into two labels based on their re-plan status: patients who received ART during RT treatment, as determined by the radiation oncologist, and those who did not. Results: The experimental results demonstrated that the proposed multi-modal deep prediction model outperformed other commonly used deep learning networks, achieving an area under the curve (AUC) of 0.9070. These results indicated the ability of the model to accurately classify and predict ART eligibility for NPC patients. Conclusions: The proposed method showed good performance in predicting ART eligibility among NPC patients, highlighting its potential to enhance clinical decision-making, optimize treatment efficiency, and support more personalized cancer care. Full article

(This article belongs to the Special Issue Image-Guided Adaptive Radiation Therapy (IGART): Advancing Precision Oncology)

► Show Figures

Figure 1

20 pages, 5700 KiB

Open AccessArticle

Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features

by Hyeonuk Bhin and Jongsuk Choi

Electronics 2025, 14(14), 2837; https://doi.org/10.3390/electronics14142837 - 15 Jul 2025

Viewed by 155

Abstract

Personality is a fundamental psychological trait that exerts a long-term influence on human behavior patterns and social interactions. Automatic personality recognition (APR) has exhibited increasing importance across various domains, including Human–Robot Interaction (HRI), personalized services, and psychological assessments. In this study, we propose [...] Read more.

Personality is a fundamental psychological trait that exerts a long-term influence on human behavior patterns and social interactions. Automatic personality recognition (APR) has exhibited increasing importance across various domains, including Human–Robot Interaction (HRI), personalized services, and psychological assessments. In this study, we propose a multimodal personality recognition model that classifies the Big Five personality traits by extracting features from three heterogeneous sources: audio processed using Wav2Vec2, video represented as Skeleton Landmark time series, and text encoded through Bidirectional Encoder Representations from Transformers (BERT) and Doc2Vec embeddings. Each modality is handled through an independent Self-Attention block that highlights salient temporal information, and these representations are then summarized and integrated using a late fusion approach to effectively reflect both the inter-modal complementarity and cross-modal interactions. Compared to traditional recurrent neural network (RNN)-based multimodal models and unimodal classifiers, the proposed model achieves an improvement of up to 12 percent in the F1-score. It also maintains a high prediction accuracy and robustness under limited input conditions. Furthermore, a visualization based on t-distributed Stochastic Neighbor Embedding (t-SNE) demonstrates clear distributional separation across the personality classes, enhancing the interpretability of the model and providing insights into the structural characteristics of its latent representations. To support real-time deployment, a lightweight thread-based processing architecture is implemented, ensuring computational efficiency. By leveraging deep learning-based feature extraction and the Self-Attention mechanism, we present a novel personality recognition framework that balances performance with interpretability. The proposed approach establishes a strong foundation for practical applications in HRI, counseling, education, and other interactive systems that require personalized adaptation. Full article

(This article belongs to the Special Issue Explainable Machine Learning and Data Mining)

► Show Figures

Figure 1

21 pages, 4147 KiB

Open AccessArticle

AgriFusionNet: A Lightweight Deep Learning Model for Multisource Plant Disease Diagnosis

by Saleh Albahli

Agriculture 2025, 15(14), 1523; https://doi.org/10.3390/agriculture15141523 - 15 Jul 2025

Viewed by 71

Abstract

Timely and accurate identification of plant diseases is critical to mitigating crop losses and enhancing yield in precision agriculture. This paper proposes AgriFusionNet, a lightweight and efficient deep learning model designed to diagnose plant diseases using multimodal data sources. The framework integrates RGB [...] Read more.

Timely and accurate identification of plant diseases is critical to mitigating crop losses and enhancing yield in precision agriculture. This paper proposes AgriFusionNet, a lightweight and efficient deep learning model designed to diagnose plant diseases using multimodal data sources. The framework integrates RGB and multispectral drone imagery with IoT-based environmental sensor data (e.g., temperature, humidity, soil moisture), recorded over six months across multiple agricultural zones. Built on the EfficientNetV2-B4 backbone, AgriFusionNet incorporates Fused-MBConv blocks and Swish activation to improve gradient flow, capture fine-grained disease patterns, and reduce inference latency. The model was evaluated using a comprehensive dataset composed of real-world and benchmarked samples, showing superior performance with 94.3% classification accuracy, 28.5 ms inference time, and a 30% reduction in model parameters compared to state-of-the-art models such as Vision Transformers and InceptionV4. Extensive comparisons with both traditional machine learning and advanced deep learning methods underscore its robustness, generalization, and suitability for deployment on edge devices. Ablation studies and confusion matrix analyses further confirm its diagnostic precision, even in visually ambiguous cases. The proposed framework offers a scalable, practical solution for real-time crop health monitoring, contributing toward smart and sustainable agricultural ecosystems. Full article

(This article belongs to the Special Issue Computational, AI and IT Solutions Helping Agriculture)

► Show Figures

Figure 1

19 pages, 1635 KiB

Open AccessArticle

Integrating AI-Driven Wearable Metaverse Technologies into Ubiquitous Blended Learning: A Framework Based on Embodied Interaction and Multi-Agent Collaboration

by Jiaqi Xu, Xuesong Zhai, Nian-Shing Chen, Usman Ghani, Andreja Istenic and Junyi Xin

Educ. Sci. 2025, 15(7), 900; https://doi.org/10.3390/educsci15070900 - 15 Jul 2025

Viewed by 77

Abstract

Ubiquitous blended learning, leveraging mobile devices, has democratized education by enabling autonomous and readily accessible knowledge acquisition. However, its reliance on traditional interfaces often limits learner immersion and meaningful interaction. The emergence of the wearable metaverse offers a compelling solution, promising enhanced multisensory [...] Read more.

Ubiquitous blended learning, leveraging mobile devices, has democratized education by enabling autonomous and readily accessible knowledge acquisition. However, its reliance on traditional interfaces often limits learner immersion and meaningful interaction. The emergence of the wearable metaverse offers a compelling solution, promising enhanced multisensory experiences and adaptable learning environments that transcend the constraints of conventional ubiquitous learning. This research proposes a novel framework for ubiquitous blended learning in the wearable metaverse, aiming to address critical challenges, such as multi-source data fusion, effective human–computer collaboration, and efficient rendering on resource-constrained wearable devices, through the integration of embodied interaction and multi-agent collaboration. This framework leverages a real-time multi-modal data analysis architecture, powered by the MobileNetV4 and xLSTM neural networks, to facilitate the dynamic understanding of the learner’s context and environment. Furthermore, we introduced a multi-agent interaction model, utilizing CrewAI and spatio-temporal graph neural networks, to orchestrate collaborative learning experiences and provide personalized guidance. Finally, we incorporated lightweight SLAM algorithms, augmented using visual perception techniques, to enable accurate spatial awareness and seamless navigation within the metaverse environment. This innovative framework aims to create immersive, scalable, and cost-effective learning spaces within the wearable metaverse. Full article

(This article belongs to the Special Issue Artificial Intelligence and Blended Learning: Challenges, Opportunities, and Future Directions)

► Show Figures

Figure 1

23 pages, 3492 KiB

Open AccessArticle

A Multimodal Deep Learning Framework for Accurate Biomass and Carbon Sequestration Estimation from UAV Imagery

by Furkat Safarov, Ugiloy Khojamuratova, Misirov Komoliddin, Xusinov Ibragim Ismailovich and Young Im Cho

Drones 2025, 9(7), 496; https://doi.org/10.3390/drones9070496 - 14 Jul 2025

Viewed by 88

Abstract

Accurate quantification of above-ground biomass (AGB) and carbon sequestration is vital for monitoring terrestrial ecosystem dynamics, informing climate policy, and supporting carbon neutrality initiatives. However, conventional methods—ranging from manual field surveys to remote sensing techniques based solely on 2D vegetation indices—often fail to [...] Read more.

Accurate quantification of above-ground biomass (AGB) and carbon sequestration is vital for monitoring terrestrial ecosystem dynamics, informing climate policy, and supporting carbon neutrality initiatives. However, conventional methods—ranging from manual field surveys to remote sensing techniques based solely on 2D vegetation indices—often fail to capture the intricate spectral and structural heterogeneity of forest canopies, particularly at fine spatial resolutions. To address these limitations, we introduce ForestIQNet, a novel end-to-end multimodal deep learning framework designed to estimate AGB and associated carbon stocks from UAV-acquired imagery with high spatial fidelity. ForestIQNet combines dual-stream encoders for processing multispectral UAV imagery and a voxelized Canopy Height Model (CHM), fused via a Cross-Attentional Feature Fusion (CAFF) module, enabling fine-grained interaction between spectral reflectance and 3D structure. A lightweight Transformer-based regression head then performs multitask prediction of AGB and CO₂e, capturing long-range spatial dependencies and enhancing generalization. Proposed method achieves an R² of 0.93 and RMSE of 6.1 kg for AGB prediction, compared to 0.78 R² and 11.7 kg RMSE for XGBoost and 0.73 R² and 13.2 kg RMSE for Random Forest. Despite its architectural complexity, ForestIQNet maintains a low inference cost (27 ms per patch) and generalizes well across species, terrain, and canopy structures. These results establish a new benchmark for UAV-enabled biomass estimation and provide scalable, interpretable tools for climate monitoring and forest management. Full article

(This article belongs to the Special Issue UAVs for Nature Conservation Tasks in Complex Environments)

► Show Figures

Figure 1

24 pages, 1076 KiB

Open AccessArticle

Visual–Tactile Fusion and SAC-Based Learning for Robot Peg-in-Hole Assembly in Uncertain Environments

by Jiaxian Tang, Xiaogang Yuan and Shaodong Li

Machines 2025, 13(7), 605; https://doi.org/10.3390/machines13070605 - 14 Jul 2025

Viewed by 140

Abstract

Robotic assembly, particularly peg-in-hole tasks, presents significant challenges in uncertain environments where pose deviations, varying peg shapes, and environmental noise can undermine performance. To address these issues, this paper proposes a novel approach combining visual–tactile fusion with reinforcement learning. By integrating multimodal data [...] Read more.

Robotic assembly, particularly peg-in-hole tasks, presents significant challenges in uncertain environments where pose deviations, varying peg shapes, and environmental noise can undermine performance. To address these issues, this paper proposes a novel approach combining visual–tactile fusion with reinforcement learning. By integrating multimodal data (RGB image, depth map, tactile force information, and robot body pose data) via a fusion network based on the autoencoder, we provide the robot with a more comprehensive perception of its environment. Furthermore, we enhance the robot’s assembly skill ability by using the Soft Actor–Critic (SAC) reinforcement learning algorithm, which allows the robot to adapt its actions to dynamic environments. We evaluate our method through experiments, which showed clear improvements in three key aspects: higher assembly success rates, reduced task completion times, and better generalization across diverse peg shapes and environmental conditions. The results suggest that the combination of visual and tactile feedback with SAC-based learning provides a viable and robust solution for robotic assembly in uncertain environments, paving the way for scalable and adaptable industrial robots. Full article

(This article belongs to the Section Robotics, Mechatronics and Intelligent Machines)

► Show Figures

Figure 1

19 pages, 709 KiB

Open AccessArticle

Fusion of Multimodal Spatio-Temporal Features and 3D Deformable Convolution Based on Sign Language Recognition in Sensor Networks

by Qian Zhou, Hui Li, Weizhi Meng, Hua Dai, Tianyu Zhou and Guineng Zheng

Sensors 2025, 25(14), 4378; https://doi.org/10.3390/s25144378 - 13 Jul 2025

Viewed by 127

Abstract

Sign language is a complex and dynamic visual language that requires the coordinated movement of various body parts, such as the hands, arms, and limbs—making it an ideal application domain for sensor networks to capture and interpret human gestures accurately. To address the [...] Read more.

Sign language is a complex and dynamic visual language that requires the coordinated movement of various body parts, such as the hands, arms, and limbs—making it an ideal application domain for sensor networks to capture and interpret human gestures accurately. To address the intricate task of precise and expedient SLR from raw videos, this study introduces a novel deep learning approach by devising a multimodal framework for SLR. Specifically, feature extraction models are built based on two modalities: skeleton and RGB images. In this paper, we firstly propose a Multi-Stream Spatio-Temporal Graph Convolutional Network (MSGCN) that relies on three modules: a decoupling graph convolutional network, a self-emphasizing temporal convolutional network, and a spatio-temporal joint attention module. These modules are combined to capture the spatio-temporal information in multi-stream skeleton features. Secondly, we propose a 3D ResNet model based on deformable convolution (D-ResNet) to model complex spatial and temporal sequences in the original raw images. Finally, a gating mechanism-based Multi-Stream Fusion Module (MFM) is employed to merge the results of the two modalities. Extensive experiments are conducted on the public datasets AUTSL and WLASL, achieving competitive results compared to state-of-the-art systems. Full article

(This article belongs to the Special Issue Intelligent Sensing and Artificial Intelligence for Image Processing)

► Show Figures

Figure 1

25 pages, 4948 KiB

Open AccessReview

A Review of Visual Grounding on Remote Sensing Images

by Ziyan Wang, Lei Liu, Gang Wan, Wei Zhang, Binjian Zhong, Haiyang Chang, Xinyi Li, Xiaoxuan Liu and Guangde Sun

Electronics 2025, 14(14), 2815; https://doi.org/10.3390/electronics14142815 - 13 Jul 2025

Viewed by 197

Abstract

Remote sensing visual grounding, a pivotal technology bridging natural language and high-resolution remote sensing images, holds significant application value in disaster monitoring, urban planning, and related fields. However, it faces critical challenges due to the inherent scale heterogeneity, semantic complexity, and annotation scarcity [...] Read more.

Remote sensing visual grounding, a pivotal technology bridging natural language and high-resolution remote sensing images, holds significant application value in disaster monitoring, urban planning, and related fields. However, it faces critical challenges due to the inherent scale heterogeneity, semantic complexity, and annotation scarcity of remote sensing data. This paper first reviews the development history of remote sensing visual grounding, providing an overview of the basic background knowledge, including fundamental concepts, datasets, and evaluation metrics. Then, it categorizes methods by whether they employ large language models as a pedestal, and provides in-depth analyses of the innovations and limitations of Transformer-based and multimodal large language model-based methods. Furthermore, focusing on remote sensing image characteristics, it discusses cutting-edge techniques such as cross-modal feature fusion, language-guided visual optimization, multi-scale, and hierarchical feature processing, open-set expansion and efficient fine-tuning. Finally, it outlines current bottlenecks and proposes valuable directions for future research. As the first comprehensive review dedicated to remote sensing visual grounding, this work is a reference resource for researchers to grasp domain-specific concepts and track the latest developments. Full article

► Show Figures

Figure 1

23 pages, 3614 KiB

Open AccessArticle

A Multimodal Semantic-Enhanced Attention Network for Fake News Detection

by Weijie Chen, Yuzhuo Dang and Xin Zhang

Entropy 2025, 27(7), 746; https://doi.org/10.3390/e27070746 - 12 Jul 2025

Viewed by 272

Abstract

The proliferation of social media platforms has triggered an unprecedented increase in multimodal fake news, creating pressing challenges for content authenticity verification. Current fake news detection systems predominantly rely on isolated unimodal analysis (text or image), failing to exploit critical cross-modal correlations or [...] Read more.

The proliferation of social media platforms has triggered an unprecedented increase in multimodal fake news, creating pressing challenges for content authenticity verification. Current fake news detection systems predominantly rely on isolated unimodal analysis (text or image), failing to exploit critical cross-modal correlations or leverage latent social context cues. To bridge this gap, we introduce the SCCN (Semantic-enhanced Cross-modal Co-attention Network), a novel framework that synergistically combines multimodal features with refined social graph signals. Our approach innovatively combines text, image, and social relation features through a hierarchical fusion framework. First, we extract modality-specific features and enhance semantics by identifying entities in both text and visual data. Second, an improved co-attention mechanism selectively integrates social relations while removing irrelevant connections to reduce noise and exploring latent informative links. Finally, the model is optimized via cross-entropy loss with entropy minimization. Experimental results for benchmark datasets (PHEME and Weibo) show that SCCN consistently outperforms existing approaches, achieving relative accuracy enhancements of 1.7% and 1.6% over the best-performing baseline methods in each dataset. Full article

(This article belongs to the Section Multidisciplinary Applications)

► Show Figures

Figure 1

31 pages, 529 KiB

Open AccessReview

Advances and Challenges in Respiratory Sound Analysis: A Technique Review Based on the ICBHI2017 Database

by Shaode Yu, Jieyang Yu, Lijun Chen, Bing Zhu, Xiaokun Liang, Yaoqin Xie and Qiurui Sun

Electronics 2025, 14(14), 2794; https://doi.org/10.3390/electronics14142794 - 11 Jul 2025

Viewed by 268

Abstract

Respiratory diseases present significant global health challenges. Recent advances in respiratory sound analysis (RSA) have shown great potential for automated disease diagnosis and patient management. The International Conference on Biomedical and Health Informatics 2017 (ICBHI2017) database stands as one of the most authoritative [...] Read more.

Respiratory diseases present significant global health challenges. Recent advances in respiratory sound analysis (RSA) have shown great potential for automated disease diagnosis and patient management. The International Conference on Biomedical and Health Informatics 2017 (ICBHI2017) database stands as one of the most authoritative open-access RSA datasets. This review systematically examines 135 technical publications utilizing the database, and a comprehensive and timely summary of RSA methodologies is offered for researchers and practitioners in this field. Specifically, this review covers signal processing techniques including data resampling, augmentation, normalization, and filtering; feature extraction approaches spanning time-domain, frequency-domain, joint time–frequency analysis, and deep feature representation from pre-trained models; and classification methods for adventitious sound (AS) categorization and pathological state (PS) recognition. Current achievements for AS and PS classification are summarized across studies using official and custom data splits. Despite promising technique advancements, several challenges remain unresolved. These include a severe class imbalance in the dataset, limited exploration of advanced data augmentation techniques and foundation models, a lack of model interpretability, and insufficient generalization studies across clinical settings. Future directions involve multi-modal data fusion, the development of standardized processing workflows, interpretable artificial intelligence, and integration with broader clinical data sources to enhance diagnostic performance and clinical applicability. Full article

► Show Figures

Figure 1

30 pages, 8143 KiB

Open AccessArticle

An Edge-Deployable Multi-Modal Nano-Sensor Array Coupled with Deep Learning for Real-Time, Multi-Pollutant Water Quality Monitoring

by Zhexu Xi, Robert Nicolas and Jiayi Wei

Water 2025, 17(14), 2065; https://doi.org/10.3390/w17142065 - 10 Jul 2025

Viewed by 216

Abstract

Real-time, high-resolution monitoring of chemically diverse water pollutants remains a critical challenge for smart water management. Here, we report a fully integrated, multi-modal nano-sensor array, combining graphene field-effect transistors, Ag/Au-nanostar surface-enhanced Raman spectroscopy substrates, and CdSe/ZnS quantum dot fluorescence, coupled to an edge-deployable [...] Read more.

Real-time, high-resolution monitoring of chemically diverse water pollutants remains a critical challenge for smart water management. Here, we report a fully integrated, multi-modal nano-sensor array, combining graphene field-effect transistors, Ag/Au-nanostar surface-enhanced Raman spectroscopy substrates, and CdSe/ZnS quantum dot fluorescence, coupled to an edge-deployable CNN-LSTM architecture that fuses raw electrochemical, vibrational, and photoluminescent signals without manual feature engineering. The 45 mm × 20 mm microfluidic manifold enables continuous flow-through sampling, while 8-bit-quantised inference executes in 31 ms at <12 W. Laboratory calibration over 28,000 samples achieved limits of detection of 12 ppt (Pb²⁺), 17 pM (atrazine) and 87 ng L⁻¹ (nanoplastics), with R² ≥ 0.93 and a mean absolute percentage error <6%. A 24 h deployment in the Cherwell River reproduced natural concentration fluctuations with field R² ≥ 0.92. SHAP and Grad-CAM analyses reveal that the network bases its predictions on Dirac-point shifts, characteristic Raman bands, and early-time fluorescence-quenching kinetics, providing mechanistic interpretability. The platform therefore offers a scalable route to smart water grids, point-of-use drinking water sentinels, and rapid environmental incident response. Future work will address sensor drift through antifouling coatings, enhance cross-site generalisation via federated learning, and create physics-informed digital twins for self-calibrating global monitoring networks. Full article

(This article belongs to the Special Issue Application of Artificial Intelligence (AI) in Water Quality Monitoring)

► Show Figures

Figure 1

Search Results (1,288)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (1,288)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI