MDPI - Publisher of Open Access Journals

18 pages, 1811 KiB

Open AccessArticle

A Multimodal Deep Learning Framework for Consistency-Aware Review Helpfulness Prediction

by Seonu Park, Xinzhe Li, Qinglong Li and Jaekyeong Kim

Electronics 2025, 14(15), 3089; https://doi.org/10.3390/electronics14153089 (registering DOI) - 1 Aug 2025

Multimodal review helpfulness prediction (MRHP) aims to identify the most helpful reviews by leveraging both textual and visual information. However, prior studies have primarily focused on modeling interactions between these modalities, often overlooking the consistency between review content and ratings, which is a [...] Read more.

Multimodal review helpfulness prediction (MRHP) aims to identify the most helpful reviews by leveraging both textual and visual information. However, prior studies have primarily focused on modeling interactions between these modalities, often overlooking the consistency between review content and ratings, which is a key indicator of review credibility. To address this limitation, we propose CRCNet (Content–Rating Consistency Network), a novel MRHP model that jointly captures the semantic consistency between review content and ratings while modeling the complementary characteristics of text and image modalities. CRCNet employs RoBERTa and VGG-16 to extract semantic and visual features, respectively. A co-attention mechanism is applied to capture the consistency between content and rating, and a Gated Multimodal Unit (GMU) is adopted to integrate consistency-aware representations. Experimental results on two large-scale Amazon review datasets demonstrate that CRCNet outperforms both unimodal and multimodal baselines in terms of MAE, MSE, RMSE, and MAPE. Further analysis confirms the effectiveness of content–rating consistency modeling and the superiority of the proposed fusion strategy. These findings suggest that incorporating semantic consistency into multimodal architectures can substantially improve the accuracy and trustworthiness of review helpfulness prediction. Full article

(This article belongs to the Special Issue Innovative Applications of Large Language Models in Natural Language Processing (NLP))

16 pages, 4587 KiB

Open AccessArticle

FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving

by Jingyuan Zhang, Qiang Tong, Na Yan and Xiulei Liu

Symmetry 2025, 17(8), 1214; https://doi.org/10.3390/sym17081214 - 1 Aug 2025

Abstract

Accurate and efficient stereo matching is fundamental to real-time depth estimation from symmetric stereo cameras in autonomous driving systems. However, existing high-accuracy stereo matching networks typically rely on computationally expensive 3D convolutions, which limit their practicality in real-world environments. In contrast, real-time methods [...] Read more.

Accurate and efficient stereo matching is fundamental to real-time depth estimation from symmetric stereo cameras in autonomous driving systems. However, existing high-accuracy stereo matching networks typically rely on computationally expensive 3D convolutions, which limit their practicality in real-world environments. In contrast, real-time methods often sacrifice accuracy or generalization capability. To address these challenges, we propose FAMNet (Fusion Attention Multi-Scale Network), a lightweight and generalizable stereo matching framework tailored for real-time depth estimation in autonomous driving applications. FAMNet consists of two novel modules: Fusion Attention-based Cost Volume (FACV) and Multi-scale Attention Aggregation (MAA). FACV constructs a compact yet expressive cost volume by integrating multi-scale correlation, attention-guided feature fusion, and channel reweighting, thereby reducing reliance on heavy 3D convolutions. MAA further enhances disparity estimation by fusing multi-scale contextual cues through pyramid-based aggregation and dual-path attention mechanisms. Extensive experiments on the KITTI 2012 and KITTI 2015 benchmarks demonstrate that FAMNet achieves a favorable trade-off between accuracy, efficiency, and generalization. On KITTI 2015, with the incorporation of FACV and MAA, the prediction accuracy of the baseline model is improved by 37% and 38%, respectively, and a total improvement of 42% is achieved by our final model. These results highlight FAMNet’s potential for practical deployment in resource-constrained autonomous driving systems requiring real-time and reliable depth perception. Full article

(This article belongs to the Special Issue Computer Vision, Pattern Recognition, Machine Learning, and Symmetry, 2nd Edition)

► Show Figures

Figure 1

25 pages, 10331 KiB

Open AccessArticle

Forest Fire Detection Method Based on Dual-Branch Multi-Scale Adaptive Feature Fusion Network

by Qinggan Wu, Chen Wei, Ning Sun, Xiong Xiong, Qingfeng Xia, Jianmeng Zhou and Xingyu Feng

Forests 2025, 16(8), 1248; https://doi.org/10.3390/f16081248 - 31 Jul 2025

Abstract

There are significant scale and morphological differences between fire and smoke features in forest fire detection. This paper proposes a detection method based on dual-branch multi-scale adaptive feature fusion network (DMAFNet). In this method, convolutional neural network (CNN) and transformer are used to [...] Read more.

There are significant scale and morphological differences between fire and smoke features in forest fire detection. This paper proposes a detection method based on dual-branch multi-scale adaptive feature fusion network (DMAFNet). In this method, convolutional neural network (CNN) and transformer are used to form a dual-branch backbone network to extract local texture and global context information, respectively. In order to overcome the difference in feature distribution and response scale between the two branches, a feature correction module (FCM) is designed. Through space and channel correction mechanisms, the adaptive alignment of two branch features is realized. The Fusion Feature Module (FFM) is further introduced to fully integrate dual-branch features based on the two-way cross-attention mechanism and effectively suppress redundant information. Finally, the Multi-Scale Fusion Attention Unit (MSFAU) is designed to enhance the multi-scale detection capability of fire targets. Experimental results show that the proposed DMAFNet has significantly improved in mAP (mean average precision) indicators compared with existing mainstream detection methods. Full article

(This article belongs to the Section Natural Hazards and Risk Management)

► Show Figures

Figure 1

25 pages, 21958 KiB

Open AccessArticle

ESL-YOLO: Edge-Aware Side-Scan Sonar Object Detection with Adaptive Quality Assessment

by Zhanshuo Zhang, Changgeng Shuai, Chengren Yuan, Buyun Li, Jianguo Ma and Xiaodong Shang

J. Mar. Sci. Eng. 2025, 13(8), 1477; https://doi.org/10.3390/jmse13081477 - 31 Jul 2025

Viewed by 12

Abstract

Focusing on the problem of insufficient detection accuracy caused by blurred target boundaries, variable scales, and severe noise interference in side-scan sonar images, this paper proposes a high-precision detection network named ESL-YOLO, which integrates edge perception and adaptive quality assessment. Firstly, an Edge [...] Read more.

Focusing on the problem of insufficient detection accuracy caused by blurred target boundaries, variable scales, and severe noise interference in side-scan sonar images, this paper proposes a high-precision detection network named ESL-YOLO, which integrates edge perception and adaptive quality assessment. Firstly, an Edge Fusion Module (EFM) is designed, which integrates the Sobel operator into depthwise separable convolution. Through a dual-branch structure, it realizes effective fusion of edge features and spatial features, significantly enhancing the ability to recognize targets with blurred boundaries. Secondly, a Self-Calibrated Dual Attention (SCDA) Module is constructed. By means of feature cross-calibration and multi-scale channel attention fusion mechanisms, it achieves adaptive fusion of shallow details and deep-rooted semantic content, improving the detection accuracy for small-sized targets and targets with elaborate shapes. Finally, a Location Quality Estimator (LQE) is introduced, which quantifies localization quality using the statistical characteristics of bounding box distribution, effectively reducing false detections and missed detections. Experiments on the SIMD dataset show that the mAP@0.5 of ESL-YOLO reaches 84.65%. The precision and recall rate reach 87.67% and 75.63%, respectively. Generalization experiments on additional sonar datasets further validate the effectiveness of the proposed method across different data distributions and target types, providing an effective technical solution for side-scan sonar image target detection. Full article

(This article belongs to the Section Ocean Engineering)

► Show Figures

Figure 1

21 pages, 1681 KiB

Open AccessArticle

Cross-Modal Complementarity Learning for Fish Feeding Intensity Recognition via Audio–Visual Fusion

by Jian Li, Yanan Wei, Wenkai Ma and Tan Wang

Animals 2025, 15(15), 2245; https://doi.org/10.3390/ani15152245 - 31 Jul 2025

Viewed by 48

Abstract

Accurate evaluation of fish feeding intensity is crucial for optimizing aquaculture efficiency and the healthy growth of fish. Previous methods mainly rely on single-modal approaches (e.g., audio or visual). However, the complex underwater environment makes single-modal monitoring methods face significant challenges: visual systems [...] Read more.

Accurate evaluation of fish feeding intensity is crucial for optimizing aquaculture efficiency and the healthy growth of fish. Previous methods mainly rely on single-modal approaches (e.g., audio or visual). However, the complex underwater environment makes single-modal monitoring methods face significant challenges: visual systems are severely affected by water turbidity, lighting conditions, and fish occlusion, while acoustic systems suffer from background noise. Although existing studies have attempted to combine acoustic and visual information, most adopt simple feature-level fusion strategies, which fail to fully explore the complementary advantages of the two modalities under different environmental conditions and lack dynamic evaluation mechanisms for modal reliability. To address these problems, we propose the Adaptive Cross-modal Attention Fusion Network (ACAF-Net), a cross-modal complementarity learning framework with a two-stage attention fusion mechanism: (1) a cross-modal enhancement stage that enriches individual representations through Low-rank Bilinear Pooling and learnable fusion weights; (2) an adaptive attention fusion stage that dynamically weights acoustic and visual features based on complementarity and environmental reliability. Our framework incorporates dimension alignment strategies and attention mechanisms to capture temporal–spatial complementarity between acoustic feeding signals and visual behavioral patterns. Extensive experiments demonstrate superior performance compared to single-modal and conventional fusion approaches, with 6.4% accuracy improvement. The results validate the effectiveness of exploiting cross-modal complementarity for underwater behavioral analysis and establish a foundation for intelligent aquaculture monitoring systems. Full article

(This article belongs to the Special Issue Innovations in Aquaculture: New Technologies, Culture Systems and Integration of Emerging Species)

► Show Figures

Figure 1

20 pages, 1536 KiB

Open AccessArticle

Graph Convolution-Based Decoupling and Consistency-Driven Fusion for Multimodal Emotion Recognition

by Yingmin Deng, Chenyu Li, Yu Gu, He Zhang, Linsong Liu, Haixiang Lin, Shuang Wang and Hanlin Mo

Electronics 2025, 14(15), 3047; https://doi.org/10.3390/electronics14153047 - 30 Jul 2025

Viewed by 170

Abstract

Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic [...] Read more.

Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic Weighted Graph Convolutional Network (DW-GCN) for feature disentanglement and a Cross-Attention Consistency-Gated Fusion (CACG-Fusion) module for robust integration. DW-GCN models complex inter-modal relationships, enabling the extraction of both common and private features. The CACG-Fusion module subsequently enhances classification performance through dynamic alignment of cross-modal cues, employing attention-based coordination and consistency-preserving gating mechanisms to optimize feature integration. Experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method achieves state-of-the-art performance, significantly improving the

A C C_{7}

,

A C C_{2}

, and

F 1

scores. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

35 pages, 4940 KiB

Open AccessArticle

A Novel Lightweight Facial Expression Recognition Network Based on Deep Shallow Network Fusion and Attention Mechanism

by Qiaohe Yang, Yueshun He, Hongmao Chen, Youyong Wu and Zhihua Rao

Algorithms 2025, 18(8), 473; https://doi.org/10.3390/a18080473 - 30 Jul 2025

Viewed by 199

Abstract

Facial expression recognition (FER) is a critical research direction in artificial intelligence, which is widely used in intelligent interaction, medical diagnosis, security monitoring, and other domains. These applications highlight its considerable practical value and social significance. Face expression recognition models often need to [...] Read more.

Facial expression recognition (FER) is a critical research direction in artificial intelligence, which is widely used in intelligent interaction, medical diagnosis, security monitoring, and other domains. These applications highlight its considerable practical value and social significance. Face expression recognition models often need to run efficiently on mobile devices or edge devices, so the research on lightweight face expression recognition is particularly important. However, feature extraction and classification methods of lightweight convolutional neural network expression recognition algorithms mostly used at present are not specifically and fully optimized for the characteristics of facial expression images, yet fail to make full use of the feature information in face expression images. To address the lack of facial expression recognition models that are both lightweight and effectively optimized for expression-specific feature extraction, this study proposes a novel network design tailored to the characteristics of facial expressions. In this paper, we refer to the backbone architecture of MobileNet V2 network, and redesign LightExNet, a lightweight convolutional neural network based on the fusion of deep and shallow layers, attention mechanism, and joint loss function, according to the characteristics of the facial expression features. In the network architecture of LightExNet, firstly, deep and shallow features are fused in order to fully extract the shallow features in the original image, reduce the loss of information, alleviate the problem of gradient disappearance when the number of convolutional layers increases, and achieve the effect of multi-scale feature fusion. The MobileNet V2 architecture has also been streamlined to seamlessly integrate deep and shallow networks. Secondly, by combining the own characteristics of face expression features, a new channel and spatial attention mechanism is proposed to obtain the feature information of different expression regions as much as possible for encoding. Thus improve the accuracy of expression recognition effectively. Finally, the improved center loss function is superimposed to further improve the accuracy of face expression classification results, and corresponding measures are taken to significantly reduce the computational volume of the joint loss function. In this paper, LightExNet is tested on the three mainstream face expression datasets: Fer2013, CK+ and RAF-DB, respectively, and the experimental results show that LightExNet has 3.27 M Parameters and 298.27 M Flops, and the accuracy on the three datasets is 69.17%, 97.37%, and 85.97%, respectively. The comprehensive performance of LightExNet is better than the current mainstream lightweight expression recognition algorithms such as MobileNet V2, IE-DBN, Self-Cure Net, Improved MobileViT, MFN, Ada-CM, Parallel CNN(Convolutional Neural Network), etc. Experimental results confirm that LightExNet effectively improves recognition accuracy and computational efficiency while reducing energy consumption and enhancing deployment flexibility. These advantages underscore its strong potential for real-world applications in lightweight facial expression recognition. Full article

► Show Figures

Figure 1

27 pages, 10182 KiB

Open AccessArticle

Storage Life Prediction of High-Voltage Diodes Based on Improved Artificial Bee Colony Algorithm Optimized LSTM-Transformer Framework

by Zhongtian Liu, Shaohua Yang and Bin Suo

Electronics 2025, 14(15), 3030; https://doi.org/10.3390/electronics14153030 - 30 Jul 2025

Viewed by 131

Abstract

High-voltage diodes, as key devices in power electronic systems, have important significance for system reliability and preventive maintenance in terms of storage life prediction. In this paper, we propose a hybrid modeling framework that integrates the Long Short-Term Memory Network (LSTM) and Transformer [...] Read more.

High-voltage diodes, as key devices in power electronic systems, have important significance for system reliability and preventive maintenance in terms of storage life prediction. In this paper, we propose a hybrid modeling framework that integrates the Long Short-Term Memory Network (LSTM) and Transformer structure, and is hyper-parameter optimized by the Improved Artificial Bee Colony Algorithm (IABC), aiming to realize the high-precision modeling and prediction of high-voltage diode storage life. The framework combines the advantages of LSTM in time-dependent modeling with the global feature extraction capability of Transformer’s self-attention mechanism, and improves the feature learning effect under small-sample conditions through a deep fusion strategy. Meanwhile, the parameter type-aware IABC search mechanism is introduced to efficiently optimize the model hyperparameters. The experimental results show that, compared with the unoptimized model, the average mean square error (MSE) of the proposed model is reduced by 33.7% (from 0.00574 to 0.00402) and the coefficient of determination (R²) is improved by 3.6% (from 0.892 to 0.924) in 10-fold cross-validation. The average predicted lifetime of the sample was 39,403.3 h, and the mean relative uncertainty of prediction was 12.57%. This study provides an efficient tool for power electronics reliability engineering and has important applications for smart grid and new energy system health management. Full article

► Show Figures

Figure 1

20 pages, 19642 KiB

Open AccessArticle

SIRI-MOGA-UNet: A Synergistic Framework for Subsurface Latent Damage Detection in ‘Korla’ Pears via Structured-Illumination Reflectance Imaging and Multi-Order Gated Attention

by Baishao Zhan, Jiawei Liao, Hailiang Zhang, Wei Luo, Shizhao Wang, Qiangqiang Zeng and Yongxian Lai

Spectrosc. J. 2025, 3(3), 22; https://doi.org/10.3390/spectroscj3030022 - 29 Jul 2025

Viewed by 125

Abstract

Bruising in ‘Korla’ pears represents a prevalent phenomenon that leads to progressive fruit decay and substantial economic losses. The detection of early-stage bruising proves challenging due to the absence of visible external characteristics, and existing deep learning models have limitations in weak feature [...] Read more.

Bruising in ‘Korla’ pears represents a prevalent phenomenon that leads to progressive fruit decay and substantial economic losses. The detection of early-stage bruising proves challenging due to the absence of visible external characteristics, and existing deep learning models have limitations in weak feature extraction under complex optical interference. To address the postharvest latent damage detection challenges in ‘Korla’ pears, this study proposes a collaborative detection framework integrating structured-illumination reflectance imaging (SIRI) with multi-order gated attention mechanisms. Initially, an SIRI optical system was constructed, employing 150 cycles·m⁻¹ spatial frequency modulation and a three-phase demodulation algorithm to extract subtle interference signal variations, thereby generating RT (Relative Transmission) images with significantly enhanced contrast in subsurface damage regions. To improve the detection accuracy of latent damage areas, the MOGA-UNet model was developed with three key innovations: 1. Integrate the lightweight VGG16 encoder structure into the feature extraction network to improve computational efficiency while retaining details. 2. Add a multi-order gated aggregation module at the end of the encoder to realize the fusion of features at different scales through a special convolution method. 3. Embed the channel attention mechanism in the decoding stage to dynamically enhance the weight of feature channels related to damage. Experimental results demonstrate that the proposed model achieves 94.38% mean Intersection over Union (mIoU) and 97.02% Dice coefficient on RT images, outperforming the baseline UNet model by 2.80% with superior segmentation accuracy and boundary localization capabilities compared with mainstream models. This approach provides an efficient and reliable technical solution for intelligent postharvest agricultural product sorting. Full article

► Show Figures

Figure 1

25 pages, 2518 KiB

Open AccessArticle

An Efficient Semantic Segmentation Framework with Attention-Driven Context Enhancement and Dynamic Fusion for Autonomous Driving

by Jia Tian, Peizeng Xin, Xinlu Bai, Zhiguo Xiao and Nianfeng Li

Appl. Sci. 2025, 15(15), 8373; https://doi.org/10.3390/app15158373 - 28 Jul 2025

Viewed by 274

Abstract

In recent years, a growing number of real-time semantic segmentation networks have been developed to improve segmentation accuracy. However, these advancements often come at the cost of increased computational complexity, which limits their inference efficiency, particularly in scenarios such as autonomous driving, where [...] Read more.

In recent years, a growing number of real-time semantic segmentation networks have been developed to improve segmentation accuracy. However, these advancements often come at the cost of increased computational complexity, which limits their inference efficiency, particularly in scenarios such as autonomous driving, where strict real-time performance is essential. Achieving an effective balance between speed and accuracy has thus become a central challenge in this field. To address this issue, we present a lightweight semantic segmentation model tailored for the perception requirements of autonomous vehicles. The architecture follows an encoder–decoder paradigm, which not only preserves the capability for deep feature extraction but also facilitates multi-scale information integration. The encoder leverages a high-efficiency backbone, while the decoder introduces a dynamic fusion mechanism designed to enhance information interaction between different feature branches. Recognizing the limitations of convolutional networks in modeling long-range dependencies and capturing global semantic context, the model incorporates an attention-based feature extraction component. This is further augmented by positional encoding, enabling better awareness of spatial structures and local details. The dynamic fusion mechanism employs an adaptive weighting strategy, adjusting the contribution of each feature channel to reduce redundancy and improve representation quality. To validate the effectiveness of the proposed network, experiments were conducted on a single RTX 3090 GPU. The Dynamic Real-time Integrated Vision Encoder–Segmenter Network (DriveSegNet) achieved a mean Intersection over Union (mIoU) of 76.9% and an inference speed of 70.5 FPS on the Cityscapes test dataset, 74.6% mIoU and 139.8 FPS on the CamVid test dataset, and 35.8% mIoU with 108.4 FPS on the ADE20K dataset. The experimental results demonstrate that the proposed method achieves an excellent balance between inference speed, segmentation accuracy, and model size. Full article

► Show Figures

Figure 1

25 pages, 837 KiB

Open AccessArticle

DASF-Net: A Multimodal Framework for Stock Price Forecasting with Diffusion-Based Graph Learning and Optimized Sentiment Fusion

by Nhat-Hai Nguyen, Thi-Thu Nguyen and Quan T. Ngo

J. Risk Financial Manag. 2025, 18(8), 417; https://doi.org/10.3390/jrfm18080417 - 28 Jul 2025

Viewed by 406

Abstract

Stock price forecasting remains a persistent challenge in time series analysis due to complex inter-stock relationships and dynamic textual signals such as financial news. While Graph Neural Networks (GNNs) can model relational structures, they often struggle with capturing higher-order dependencies and are sensitive [...] Read more.

Stock price forecasting remains a persistent challenge in time series analysis due to complex inter-stock relationships and dynamic textual signals such as financial news. While Graph Neural Networks (GNNs) can model relational structures, they often struggle with capturing higher-order dependencies and are sensitive to noise. Moreover, sentiment signals are typically aggregated using fixed time windows, which may introduce temporal bias. To address these issues, we propose DASF-Net (Diffusion-Aware Sentiment Fusion Network), a multimodal framework that integrates structural and textual information for robust prediction. DASF-Net leverages diffusion processes over two complementary financial graphs—one based on industry relationships, the other on fundamental indicators—to learn richer stock representations. Simultaneously, sentiment embeddings extracted from financial news using FinBERT are aggregated over an empirically optimized window to preserve temporal relevance. These modalities are fused via a multi-head attention mechanism and passed to a temporal forecasting module. DASF-Net integrates daily stock prices and news sentiment, using a 3-day sentiment aggregation window, to forecast stock prices over daily horizons (1–3 days). Experiments on 12 large-cap S&P 500 stocks over four years demonstrate that DASF-Net outperforms competitive baselines, achieving up to 91.6% relative reduction in Mean Squared Error (MSE). Results highlight the effectiveness of combining graph diffusion and sentiment-aware features for improved financial forecasting. Full article

(This article belongs to the Special Issue Machine Learning Applications in Finance, 2nd Edition)

► Show Figures

Figure 1

24 pages, 3480 KiB

Open AccessArticle

MFPI-Net: A Multi-Scale Feature Perception and Interaction Network for Semantic Segmentation of Urban Remote Sensing Images

by Xiaofei Song, Mingju Chen, Jie Rao, Yangming Luo, Zhihao Lin, Xingyue Zhang, Senyuan Li and Xiao Hu

Sensors 2025, 25(15), 4660; https://doi.org/10.3390/s25154660 - 27 Jul 2025

Viewed by 345

Abstract

To improve semantic segmentation performance for complex urban remote sensing images with multi-scale object distribution, class similarity, and small object omission, this paper proposes MFPI-Net, an encoder–decoder-based semantic segmentation network. It includes four core modules: a Swin Transformer backbone encoder, a diverse dilation [...] Read more.

To improve semantic segmentation performance for complex urban remote sensing images with multi-scale object distribution, class similarity, and small object omission, this paper proposes MFPI-Net, an encoder–decoder-based semantic segmentation network. It includes four core modules: a Swin Transformer backbone encoder, a diverse dilation rates attention shuffle decoder (DDRASD), a multi-scale convolutional feature enhancement module (MCFEM), and a cross-path residual fusion module (CPRFM). The Swin Transformer efficiently extracts multi-level global semantic features through its hierarchical structure and window attention mechanism. The DDRASD’s diverse dilation rates attention (DDRA) block combines convolutions with diverse dilation rates and channel-coordinate attention to enhance multi-scale contextual awareness, while Shuffle Block improves resolution via pixel rearrangement and avoids checkerboard artifacts. The MCFEM enhances local feature modeling through parallel multi-kernel convolutions, forming a complementary relationship with the Swin Transformer’s global perception capability. The CPRFM employs multi-branch convolutions and a residual multiplication–addition fusion mechanism to enhance interactions among multi-source features, thereby improving the recognition of small objects and similar categories. Experiments on the ISPRS Vaihingen and Potsdam datasets show that MFPI-Net outperforms mainstream methods, achieving 82.57% and 88.49% mIoU, validating its superior segmentation performance in urban remote sensing. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

29 pages, 3125 KiB

Open AccessArticle

Tomato Leaf Disease Identification Framework FCMNet Based on Multimodal Fusion

by Siming Deng, Jiale Zhu, Yang Hu, Mingfang He and Yonglin Xia

Plants 2025, 14(15), 2329; https://doi.org/10.3390/plants14152329 - 27 Jul 2025

Viewed by 404

Abstract

Precisely recognizing diseases in tomato leaves plays a crucial role in enhancing the health, productivity, and quality of tomato crops. However, disease identification methods that rely on single-mode information often face the problems of insufficient accuracy and weak generalization ability. Therefore, this paper [...] Read more.

Precisely recognizing diseases in tomato leaves plays a crucial role in enhancing the health, productivity, and quality of tomato crops. However, disease identification methods that rely on single-mode information often face the problems of insufficient accuracy and weak generalization ability. Therefore, this paper proposes a tomato leaf disease recognition framework FCMNet based on multimodal fusion, which combines tomato leaf disease image and text description to enhance the ability to capture disease characteristics. In this paper, the Fourier-guided Attention Mechanism (FGAM) is designed, which systematically embeds the Fourier frequency-domain information into the spatial-channel attention structure for the first time, enhances the stability and noise resistance of feature expression through spectral transform, and realizes more accurate lesion location by means of multi-scale fusion of local and global features. In order to realize the deep semantic interaction between image and text modality, a Cross Vision–Language Alignment module (CVLA) is further proposed. This module generates visual representations compatible with Bert embeddings by utilizing block segmentation and feature mapping techniques. Additionally, it incorporates a probability-based weighting mechanism to achieve enhanced multimodal fusion, significantly strengthening the model’s comprehension of semantic relationships across different modalities. Furthermore, to enhance both training efficiency and parameter optimization capabilities of the model, we introduce a Multi-strategy Improved Coati Optimization Algorithm (MSCOA). This algorithm integrates Good Point Set initialization with a Golden Sine search strategy, thereby boosting global exploration, accelerating convergence, and effectively preventing entrapment in local optima. Consequently, it exhibits robust adaptability and stable performance within high-dimensional search spaces. The experimental results show that the FCMNet model has increased the accuracy and precision by 2.61% and 2.85%, respectively, compared with the baseline model on the self-built dataset of tomato leaf diseases, and the recall and F1 score have increased by 3.03% and 3.06%, respectively, which is significantly superior to the existing methods. This research provides a new solution for the identification of tomato leaf diseases and has broad potential for agricultural applications. Full article

(This article belongs to the Special Issue Advances in Artificial Intelligence for Plant Research)

► Show Figures

Figure 1

25 pages, 4296 KiB

Open AccessArticle

StripSurface-YOLO: An Enhanced Yolov8n-Based Framework for Detecting Surface Defects on Strip Steel in Industrial Environments

by Haomin Li, Huanzun Zhang and Wenke Zang

Electronics 2025, 14(15), 2994; https://doi.org/10.3390/electronics14152994 - 27 Jul 2025

Viewed by 329

Abstract

Recent advances in precision manufacturing and high-end equipment technologies have imposed ever more stringent requirements on the accuracy, real-time performance, and lightweight design of online steel strip surface defect detection systems. To reconcile the persistent trade-off between detection precision and inference efficiency in [...] Read more.

Recent advances in precision manufacturing and high-end equipment technologies have imposed ever more stringent requirements on the accuracy, real-time performance, and lightweight design of online steel strip surface defect detection systems. To reconcile the persistent trade-off between detection precision and inference efficiency in complex industrial environments, this study proposes StripSurface–YOLO, a novel real-time defect detection framework built upon YOLOv8n. The core architecture integrates an Efficient Cross-Stage Local Perception module (ResGSCSP), which synergistically combines GSConv lightweight convolutions with a one-shot aggregation strategy, thereby markedly reducing both model parameters and computational complexity. To further enhance multi-scale feature representation, this study introduces an Efficient Multi-Scale Attention (EMA) mechanism at the feature-fusion stage, enabling the network to more effectively attend to critical defect regions. Moreover, conventional nearest-neighbor upsampling is replaced by DySample, which produces deeper, high-resolution feature maps enriched with semantic content, improving both inference speed and fusion quality. To heighten sensitivity to small-scale and low-contrast defects, the model adopts Focal Loss, dynamically adjusting to sample difficulty. Extensive evaluations on the NEU-DET dataset demonstrate that StripSurface–YOLO reduces FLOPs by 11.6% and parameter count by 7.4% relative to the baseline YOLOv8n, while achieving respective improvements of 1.4%, 3.1%, 4.1%, and 3.0% in precision, recall, mAP₅₀, and mAP_50:95. Under adverse conditions—including contrast variations, brightness fluctuations, and Gaussian noise—SteelSurface-YOLO outperforms the baseline model, delivering improvements of 5.0% in mAP₅₀ and 4.7% in mAP_50:95, attesting to the model’s robust interference resistance. These findings underscore the potential of StripSurface–YOLO to meet the rigorous performance demands of real-time surface defect detection in the metal forging industry. Full article

► Show Figures

Figure 1

21 pages, 5527 KiB

Open AccessArticle

SGNet: A Structure-Guided Network with Dual-Domain Boundary Enhancement and Semantic Fusion for Skin Lesion Segmentation

by Haijiao Yun, Qingyu Du, Ziqing Han, Mingjing Li, Le Yang, Xinyang Liu, Chao Wang and Weitian Ma

Sensors 2025, 25(15), 4652; https://doi.org/10.3390/s25154652 - 27 Jul 2025

Viewed by 278

Abstract

Segmentation of skin lesions in dermoscopic images is critical for the accurate diagnosis of skin cancers, particularly malignant melanoma, yet it is hindered by irregular lesion shapes, blurred boundaries, low contrast, and artifacts, such as hair interference. Conventional deep learning methods, typically based [...] Read more.

Segmentation of skin lesions in dermoscopic images is critical for the accurate diagnosis of skin cancers, particularly malignant melanoma, yet it is hindered by irregular lesion shapes, blurred boundaries, low contrast, and artifacts, such as hair interference. Conventional deep learning methods, typically based on UNet or Transformer architectures, often face limitations in regard to fully exploiting lesion features and incur high computational costs, compromising precise lesion delineation. To overcome these challenges, we propose SGNet, a structure-guided network, integrating a hybrid CNN–Mamba framework for robust skin lesion segmentation. The SGNet employs the Visual Mamba (VMamba) encoder to efficiently extract multi-scale features, followed by the Dual-Domain Boundary Enhancer (DDBE), which refines boundary representations and suppresses noise through spatial and frequency-domain processing. The Semantic-Texture Fusion Unit (STFU) adaptively integrates low-level texture with high-level semantic features, while the Structure-Aware Guidance Module (SAGM) generates coarse segmentation maps to provide global structural guidance. The Guided Multi-Scale Refiner (GMSR) further optimizes boundary details through a multi-scale semantic attention mechanism. Comprehensive experiments based on the ISIC2017, ISIC2018, and PH2 datasets demonstrate SGNet’s superior performance, with average improvements of 3.30% in terms of the mean Intersection over Union (mIoU) value and 1.77% in regard to the Dice Similarity Coefficient (DSC) compared to state-of-the-art methods. Ablation studies confirm the effectiveness of each component, highlighting SGNet’s exceptional accuracy and robust generalization for computer-aided dermatological diagnosis. Full article

(This article belongs to the Section Biomedical Sensors)

► Show Figures

Figure 1

Search Results (1,870)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (1,870)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI