Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (113)

Search Parameters:
Keywords = modality consistency constraint

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
34 pages, 3799 KB  
Article
Simulation of 2D Shallow-Sea Acoustic Fields Using a Physics-Informed Residual Network
by Ziyue Wang, Lingyi Cong, Luotao Zhang, Shuyue Liu and Xiaobo Zhang
J. Mar. Sci. Eng. 2026, 14(13), 1154; https://doi.org/10.3390/jmse14131154 (registering DOI) - 23 Jun 2026
Abstract
Acoustic propagation in stratified shallow seas is governed by finite-depth waveguiding, impedance contrasts at the seawater–seabed interface, and coupled space–time wave dynamics. Conventional numerical solvers are accurate but often require detailed environmental priors, mesh generation, and explicit time marching, increasing the cost of [...] Read more.
Acoustic propagation in stratified shallow seas is governed by finite-depth waveguiding, impedance contrasts at the seawater–seabed interface, and coupled space–time wave dynamics. Conventional numerical solvers are accurate but often require detailed environmental priors, mesh generation, and explicit time marching, increasing the cost of simulations involving complex boundaries or repeated evaluations. This study proposes a physics-informed residual network (ResNet-PINN) for continuous simulation of two-dimensional acoustic fields in shallow-sea stratified media. The framework embeds a variable-density, variable-sound-speed acoustic pressure wave equation, initial and boundary constraints, and interface-focused collocation into network training. A Gaussian initial wave packet and temporal gating are incorporated through the output transformation to improve early-time physical consistency. The model is validated against SPECFEM2D simulations and a stratified semi-analytical modal benchmark. The results show that it captures source-region spreading, main wavefront evolution, and transmission–reflection structures near the seawater–seabed interface at an equivalent frequency of approximately 477 Hz. Supplementary tests with sloping and arched interfaces and modified boundary conditions indicate adaptability to smooth interface variations. Overall, the framework provides a physically consistent neural network strategy for continuous shallow-sea acoustic field simulation and a complementary basis for future extensions to higher-frequency propagation, more complex environments, and dynamically varying ocean conditions. Full article
26 pages, 12724 KB  
Article
A Hierarchical Semantic Consistency Constraint Framework for Hyperspectral and LiDAR Data Joint Classification
by Jie Shen, Yimeng Ma and Houqun Yang
Remote Sens. 2026, 18(12), 2058; https://doi.org/10.3390/rs18122058 (registering DOI) - 22 Jun 2026
Abstract
Hyperspectral image (HSI) and LiDAR data fusion is valuable for land-cover classification in complex surface scenes. Existing methods typically extract features from each modality independently and then consider how to fuse them, ignoring the semantic consistency between features of different modalities and across [...] Read more.
Hyperspectral image (HSI) and LiDAR data fusion is valuable for land-cover classification in complex surface scenes. Existing methods typically extract features from each modality independently and then consider how to fuse them, ignoring the semantic consistency between features of different modalities and across different hierarchical levels. Moreover, fully mining and exploiting the complementary information between multimodal remote sensing data remains a critical issue. To address these challenges, this paper proposes a hierarchical semantic consistency constraint (HSCC) framework for HSI and LiDAR data joint classification. The framework is co-constructed by a progressive interactive fusion network (PIFNet) and a semantic consistency constraint (SCC) strategy. Specifically, PIFNet progressively calibrates the semantic representations of multimodal features at different abstraction levels through Cross-Modal Shared Attention and Symmetric Cross-Attention mechanisms, promoting information parity in deep interactions. The SCC strategy establishes multi-level semantic associations and employs a semantic consistency constraint loss to guide the network to autonomously maintain the consistency of the same land-cover object across heterogeneous feature representations, thereby further enhancing the discriminative power of the fused features. Experiments on three public datasets, MUUFL, Houston2013, and Augsburg, demonstrate that HSCC outperforms current state-of-the-art methods, validating its effectiveness in multi-source remote sensing data fusion classification tasks. Full article
Show Figures

Figure 1

14 pages, 4300 KB  
Article
DeepFlare: Weakly Supervised Cross-Modality Translation and Segmentation for Immunohistochemistry and Immunofluorescence Imaging
by Md. Tamim, Aditto Rahman, Redwan Hossain, Tausib Abrar and Riasat Khan
BioMedInformatics 2026, 6(3), 37; https://doi.org/10.3390/biomedinformatics6030037 (registering DOI) - 22 Jun 2026
Abstract
Immunohistochemistry (IHC) is a widely used method for detecting specific proteins in tissue samples, helping diagnose diseases such as cancer. Traditional analysis methods rely heavily on human interpretation, which can lead to inconsistencies. In this study, we propose DeepFlare, a weakly supervised deep [...] Read more.
Immunohistochemistry (IHC) is a widely used method for detecting specific proteins in tissue samples, helping diagnose diseases such as cancer. Traditional analysis methods rely heavily on human interpretation, which can lead to inconsistencies. In this study, we propose DeepFlare, a weakly supervised deep learning framework for cross-modality translation and segmentation of immunofluorescence and immunohistochemistry images. The proposed method utilizes multiplex immunofluorescence (mpIF) and co-registered IHC images, combined with preprocessing techniques such as affine transformation, stain normalization, noise reduction, and artifact removal. Multiple imaging channels, including hematoxylin, DAPI, Lap2, and nuclear envelope signals, are leveraged to generate segmentation masks using a U-Net++ architecture. The final segmentation mask is obtained through weighted fusion of modality-specific outputs. A generative adversarial network (GAN) is employed to measure translation fidelity between generated and real images. Weakly supervised learning techniques, including image-level supervision and consistency constraints, are applied to enhance performance under limited annotation scenarios. Pretrained pathology foundation encoders such as UNI and Virchow are integrated to extract multi-scale morphological and contextual features. Explainable AI techniques are incorporated to highlight critical regions and refine model attention. Experimental results demonstrate strong performance, achieving an SSIM of 0.7077 for image translation and a Dice score of 0.7424 for segmentation. The integration of the UNI encoder provides marginal improvement over the baseline (0.72 Dice score), indicating limited domain adaptation without fine-tuning on the dataset of 1264 training samples. Full article
(This article belongs to the Section Imaging Informatics)
Show Figures

Figure 1

35 pages, 4624 KB  
Article
MCF-YOLO: Consistency-Guided Cross-Modal Attention for Small-Object RGB-IR Detection
by Xiang Yang, Mengyue Yang and Xiaolan Xie
Sensors 2026, 26(12), 3938; https://doi.org/10.3390/s26123938 (registering DOI) - 21 Jun 2026
Viewed by 128
Abstract
In low-light, occluded, and cluttered environments, single-modality RGB detectors are prone to false positives and missed detections. While infrared (IR) imaging provides relatively stable target visibility under poor illumination, it lacks texture and color information and is susceptible to background thermal noise and [...] Read more.
In low-light, occluded, and cluttered environments, single-modality RGB detectors are prone to false positives and missed detections. While infrared (IR) imaging provides relatively stable target visibility under poor illumination, it lacks texture and color information and is susceptible to background thermal noise and imaging variations. To address these limitations, this paper proposes an RGB–IR object detection network, named MCF-YOLO, consisting of three core components. First, the Cross-Modal Hierarchical Fusion (CMHF) module performs stage-wise alignment and fusion on multi-scale features, jointly modeling RGB texture details and IR thermal responses to exploit the structural and semantic complementarity between the two modalities. Second, the Soft Attention Regularization based on Attention Prior (SAR-AP) module derives attention priors from IR features to impose soft constraints on cross-modal attention maps. This mechanism helps the network maintain attention on target-relevant regions, thereby suppressing attention drift caused by low-light noise and complex backgrounds. Third, the Small-Object-Sensitive Detection Head (SOS-Head) processes high-resolution features to strengthen the representation of small targets, improving detection capability in long-range and occluded scenarios. In evaluations on two RGB–IR benchmarks—M3FD and VEDAI—MCF-YOLO achieves improvements of 2.7% in mAP@0.5 and 1.1% in mAP@0.5:0.95 on M3FD, and 5.4% and 4.4%, respectively, on VEDAI. These results suggest that consistency-guided cross-modal fusion and high-resolution small-target modeling are beneficial for RGB–IR detection in low-visibility and cluttered scenes. Full article
(This article belongs to the Section Sensing and Imaging)
21 pages, 1456 KB  
Article
A Camera-Based Multimodal Defect Sensing Framework for Substation Equipment Monitoring via Cross-Modal Feature Mapping
by Ziquan Liu, Hai Xue, Chengbo Hu, Chao Wei and Can Zhang
Sensors 2026, 26(12), 3935; https://doi.org/10.3390/s26123935 (registering DOI) - 21 Jun 2026
Viewed by 126
Abstract
To address the limitations of vision-only defect detection, image–semantic misalignment, and spatial-logic conflicts in complex substation inspection scenarios, this paper proposes a camera-sensor-based multimodal defect sensing framework with cross-modal feature mapping for substation equipment monitoring. The proposed framework integrates field inspection images acquired [...] Read more.
To address the limitations of vision-only defect detection, image–semantic misalignment, and spatial-logic conflicts in complex substation inspection scenarios, this paper proposes a camera-sensor-based multimodal defect sensing framework with cross-modal feature mapping for substation equipment monitoring. The proposed framework integrates field inspection images acquired by camera sensors, defect textual descriptions, and equipment topology knowledge and establishes a unified domain-adaptive pre-training–bidirectional cross-modal mapping–hierarchical reasoning workflow. First, a Contrastive Language–Image Pre-training (CLIP)-based domain-adaptive pre-training strategy is developed to enhance the representation of equipment categories, defect attributes, and inspection-scene semantics. Second, a bidirectional cross-modal feature mapping network is constructed to model fine-grained interactions between candidate visual regions and textual semantics, where uncertainty-aware fusion and prototype constraints are introduced to improve semantic alignment and defect discrimination. Third, a hierarchical neuro-symbolic reasoning module incorporates equipment topology and spatial rules for posterior verification, logical consistency checking, and false-positive suppression. Experiments on a substation inspection image dataset demonstrate that the proposed method achieves 90.8% mAP@0.5, 68.7% mAP@0.5:0.95, and 89.4% F1-score, outperforming mainstream and recent detection models. Full article
26 pages, 3157 KB  
Article
Geometric Scene Formalization in Vision-Based Educational Sensing via Multimodal Large Models
by Yanjing Cao and Lian Chen
Appl. Sci. 2026, 16(12), 6172; https://doi.org/10.3390/app16126172 - 18 Jun 2026
Viewed by 135
Abstract
This paper studies geometric scene formalization in vision-based educational sensing environments, where textual conditions and geometric diagram images jointly constitute heterogeneous perceptual inputs. The goal is to convert multimodal sensed information into standardized formal representations for machine understandable educational analysis. Existing methods remain [...] Read more.
This paper studies geometric scene formalization in vision-based educational sensing environments, where textual conditions and geometric diagram images jointly constitute heterogeneous perceptual inputs. The goal is to convert multimodal sensed information into standardized formal representations for machine understandable educational analysis. Existing methods remain limited by unstable cross modal alignment, inadequate expression of geometric relational constraints, and insufficient verifiability of generated outputs. To overcome these challenges, a unified modeling framework is proposed based on multimodal large models with structure-aware prompting and verification feedback. A geometry-oriented structure prompt injection mechanism is first introduced to encode prior cues of geometric entities, relational patterns, and constraint dependencies, which enhances the intrinsic alignment among textual descriptions, visually sensed diagram regions, and formal symbolic representations. In addition, an external verification feedback strategy is employed to constrain and iteratively refine the initial outputs, thereby improving structural consistency, syntactic correctness, and target proposition accuracy. To support this task, a new vision-based multimodal geometry formalization dataset is further constructed for model training and evaluation. Extensive experiments show that the proposed method can more effectively accomplish the transformation from multimodal sensed educational inputs to executable formal expressions, while also demonstrating stronger robustness and reliability in complex visual conditions. These results indicate that the proposed framework offers a feasible solution for structured scene interpretation, automatic problem analysis, error diagnosis, and intelligent feedback in vision-based educational systems. Full article
Show Figures

Figure 1

22 pages, 6454 KB  
Article
Research on Multimodal Sentiment Analysis Based on Bidirectional Cross-Modal Interaction and Text-Guided Fusion
by Junhao Wu and Xizhong Shen
Electronics 2026, 15(12), 2685; https://doi.org/10.3390/electronics15122685 - 17 Jun 2026
Viewed by 168
Abstract
Multimodal sentiment analysis (MSA) has become a key research area in artificial intelligence, aiming to predict sentiment polarity or intensity by jointly modeling text, audio, and visual information. However, traditional methods still face significant challenges due to inherent heterogeneity among modalities, semantic representation [...] Read more.
Multimodal sentiment analysis (MSA) has become a key research area in artificial intelligence, aiming to predict sentiment polarity or intensity by jointly modeling text, audio, and visual information. However, traditional methods still face significant challenges due to inherent heterogeneity among modalities, semantic representation discrepancies, and insufficient cross-modal interaction. To address these issues, this paper proposes a multimodal sentiment classification model that integrates bidirectional cross-modal attention with multi-level constraint optimization. Specifically, a unified multimodal feature encoding (UMFE) module combining BiLSTM and transformer architectures is first constructed to jointly model and extract robust unimodal representations from text, audio, and visual modalities, thereby enhancing both robustness and discriminative ability. On this basis, we introduce a bidirectional cross-modal attention mechanism, which performs Query–Key attention between modalities, enabling each modality to selectively aggregate complementary information and capture cross-modal semantic dependencies. Furthermore, a cross-modal re-fusion transformer (HMRT) module treats the textual modality as dominant to guide the deep fusion of high-level semantic features after cross-modal interaction, producing a compact unified representation. Finally, a multi-task joint optimization framework with uncertainty-based adaptive weighting dynamically balances unimodal supervision loss, cross-modal consistency loss, and sentiment classification loss, which helps improve representation learning and generalization ability. Full article
Show Figures

Figure 1

33 pages, 3372 KB  
Article
A Genomics-Guided Multimodal Contrastive Learning Framework for Clinically Significant Prostate Cancer Risk Stratification with Missing Clinical Data
by Abdullah, Muhammad Shahid, Muhammad Ateeb Ather, Zulaikha Fatima, Carlos Guzmán Sánchez Mejorada, Miguel Jesús Torres Ruiz, Rolando Quintero Téllez, Miguel Félix Mata-Rivera and Roberto Zagal-Flores
Cancers 2026, 18(12), 1952; https://doi.org/10.3390/cancers18121952 - 16 Jun 2026
Viewed by 238
Abstract
Background: Heterogeneous data integration remains a major challenge in intelligent information systems, particularly under missing-modality and cross-domain conditions. Existing multimodal fusion approaches often rely on complete datasets and weak alignment mechanisms, limiting their robustness and practical applicability. Objectives: This study aims to develop [...] Read more.
Background: Heterogeneous data integration remains a major challenge in intelligent information systems, particularly under missing-modality and cross-domain conditions. Existing multimodal fusion approaches often rely on complete datasets and weak alignment mechanisms, limiting their robustness and practical applicability. Objectives: This study aims to develop and evaluate a genomics-guided multimodal representation learning framework that enables robust heterogeneous data fusion, reliable cross-modal correspondence, and accurate prediction under incomplete-data conditions. Methods: We propose a multimodal learning architecture that models genomics as the primary biological anchor and learns conditional projections to imaging modalities, including multiparametric MRI and whole-slide histopathology (WSI). The framework formulates multimodal fusion as a genomics-guided contrastive learning problem, incorporates domain-specific optimization constraints, and learns a latent shared-state representation to support inference without requiring fully paired datasets. Evaluation was conducted using public datasets, including TCGA-PRAD and TCIA, across low-risk versus higher-risk/clinically significant prostate cancer (csPCa) discrimination, Gleason-based risk stratification, and clinically significant outcome prediction tasks under realistic multimodal and missing-modality scenarios. Results: In the adequately powered Genomics+WSI cohort (n = 486), the framework achieved an AUROC of 0.985 ± 0.005 for low-risk versus higher-risk/csPCa discrimination (p < 0.001). Exploratory analysis in a small, matched Genomics+MRI cohort (n = 28) yielded an AUROC of 0.980 ± 0.006 for the same endpoint; these findings are reported descriptively with bootstrap confidence intervals due to limited sample size. Because the negative reference group consisted of low-risk prostate cancer cases rather than cancer-free controls, results are interpreted as within-cancer risk discrimination rather than de novo cancer detection. The framework achieved weighted accuracy up to 92.1%, Cohen’s κ up to 0.86, and reduced critical decision errors by 58%. Calibration remained strong (ECE 0.021–0.024), and decision-curve analysis indicated improved utility with reduced unnecessary invasive workups in retrospective modeling. Robustness analysis demonstrated AUROC degradation below 0.04 under domain shifts. Single-modality inference using genomics alone maintained AUROC > 0.90. Interpretability analysis revealed feature attributions aligned with domain-relevant genomic markers. Conclusions: The proposed framework provides a scalable and generalizable solution for heterogeneous multimodal data fusion, supporting reliable prediction, robustness to missing modalities, and applicability to complex information systems beyond the studied domain. Full article
(This article belongs to the Section Molecular Cancer Biology)
Show Figures

Figure 1

25 pages, 3097 KB  
Article
Boosting RGB-D Pear Detection via Depth-Constraint Enhanced Gaussian Prior
by Feng Ling, Yunfeng Lin, Weijie Mao, Weizhong Xu and Wenzheng Xiao
Plants 2026, 15(12), 1852; https://doi.org/10.3390/plants15121852 - 15 Jun 2026
Viewed by 131
Abstract
Accurate pear detection in complex orchard environments is essential for automated harvesting, yet it remains challenging due to frequent occlusion, overlapping fruits, cluttered backgrounds, and highly variable illumination. Although RGB-D sensing provides complementary geometric information beyond RGB imagery, existing methods often fail to [...] Read more.
Accurate pear detection in complex orchard environments is essential for automated harvesting, yet it remains challenging due to frequent occlusion, overlapping fruits, cluttered backgrounds, and highly variable illumination. Although RGB-D sensing provides complementary geometric information beyond RGB imagery, existing methods often fail to fully exploit depth cues and rarely account for the inherently elliptical shape of pears. To address these issues, we propose a multimodal pear detection framework that jointly models RGB and depth information using a Siamese convolutional backbone and a unified Transformer-based fusion architecture. The proposed method contains three key components. First, Gaussian Prior Boxes are introduced to represent pear instances with Gaussian-shaped priors, enabling better alignment with pear contours and more precise localization than conventional rectangular boxes. Second, a Depth-Aware Constraint is designed to enforce depth consistency within the predicted regions, which improves robustness in cluttered orchard scenes. Third, a Robust Cross-Modal Token Exchange strategy is incorporated during training to strengthen feature interaction between RGB and depth modalities and reduce over-reliance on any single modality. Extensive experiments on pear detection demonstrate that the proposed method achieves an AP50 of 0.961, a precision of 0.941, a recall of 0.951, and an F1-score of 0.942. Compared with a strong recent YOLOv8-l RGB baseline (AP50 = 0.918) and a YOLOv8-l RGB-D variant (AP50 = 0.932) trained on the same dataset, our framework yields a notable improvement of +4.3 and +2.9 AP50, respectively. We further validate generalization on the publicly available KFuji RGB-DS apple dataset, where MMGFormer attains AP50 = 0.927, exceeding the previously reported state-of-the-art (AP50 = 0.901). In addition, the model runs at 41.2 FPS, indicating a favorable balance between detection accuracy and real-time performance. These results show the potential of the proposed framework for practical deployment in automated pear harvesting systems. Full article
(This article belongs to the Special Issue Advances in Artificial Intelligence for Plant Research—2nd Edition)
Show Figures

Figure 1

19 pages, 2430 KB  
Article
LMFusion: Breaking the Computational Barrier for Multimodal Classification in Remote Sensing
by Shenbo Zhou, Sibo He, Daixun Li, Weiying Xie and Yunsong Li
Remote Sens. 2026, 18(12), 1972; https://doi.org/10.3390/rs18121972 - 13 Jun 2026
Viewed by 139
Abstract
Multi-modal land cover classification plays an important role in remote sensing applications such as urban monitoring and environmental analysis. By integrating complementary information from hyperspectral imagery (HSI) and LiDAR data, multimodal learning can significantly improve classification performance. However, existing Transformer-based fusion methods often [...] Read more.
Multi-modal land cover classification plays an important role in remote sensing applications such as urban monitoring and environmental analysis. By integrating complementary information from hyperspectral imagery (HSI) and LiDAR data, multimodal learning can significantly improve classification performance. However, existing Transformer-based fusion methods often suffer from high computational complexity and inefficient cross-modal interaction modeling, which limits their applicability in resource-constrained scenarios. To address these challenges, we propose LMFusion, an efficient framework for multimodal feature learning. Specifically, LMFusion enables efficient bidirectional feature interaction through a linear-complexity cross-attention mechanism and enhances long-range spatial-spectral representation learning with Mamba-based state space modeling, thereby achieving effective multimodal dependency modeling with linear computational complexity. In addition, a selective quantization-aware optimization strategy is introduced to support multiple bit-width settings (down to 1-bit), yielding a more compact and efficient model while improving representation robustness under low-bit constraints. Extensive experiments on the Houston2013, MUUFL, and Augsburg datasets demonstrate the effectiveness of LMFusion. It achieves overall accuracies of 95.84%, 94.95%, and 99.05%, respectively, consistently outperforming representative multimodal classification methods and showing strong potential for accurate and efficient multimodal remote sensing classification. Full article
Show Figures

Figure 1

17 pages, 2212 KB  
Article
Robust Manipulation of Randomly Stacked Jenga Blocks via a Strategy-Driven Framework Using a Single RGB-D Sensor
by Dongwoon Song, Yeri Park, Minseong Jo, Wonje Hwang, Gijae Ahn and Seung-Joon Yi
Sensors 2026, 26(12), 3767; https://doi.org/10.3390/s26123767 - 12 Jun 2026
Viewed by 273
Abstract
Robust manipulation of small, densely stacked objects remains a challenging problem due to severe occlusions and geometric ambiguities, particularly under single-view sensing conditions. When observed using a single RGB-D sensor, adjacent surfaces of featureless cuboid objects, such as Jenga blocks, often merge in [...] Read more.
Robust manipulation of small, densely stacked objects remains a challenging problem due to severe occlusions and geometric ambiguities, particularly under single-view sensing conditions. When observed using a single RGB-D sensor, adjacent surfaces of featureless cuboid objects, such as Jenga blocks, often merge in depth measurements, making reliable instance separation and pose estimation difficult. This paper presents a strategy-driven perception and manipulation framework for the robotic rearrangement of randomly stacked Jenga blocks under single RGB-D sensor constraints. The proposed approach employs a heightmap-based perception pipeline that integrates color filtering with geometric reasoning to segment individual blocks and estimate manipulation-compatible poses. Beyond perception, the proposed system determines robot actions through a structured manipulation policy consisting of region-wise search for directly executable grasps, grasp candidate evaluation based on accessibility and collision risk, selective local regrasping for workspace reconfiguration, and placement mode selection between direct insertion and sliding-assisted placement. In this framework, controlled grasp-and-release actions are applied only when no directly executable candidate is found within the currently scanned region and a suitable recovery target can be identified, thereby transforming cluttered local arrangements into more executable states without requiring additional sensing modalities. Experimental results, conducted under competition-equivalent conditions, demonstrate a high task success rate of 99.02%, confirming the robustness and reliability of the proposed framework. The results show that strategy-driven manipulation can effectively compensate for perception limitations in single RGB-D sensor environments, enabling stable and efficient pick-and-place operations in dense clutter. Full article
Show Figures

Figure 1

32 pages, 7334 KB  
Article
Text Semantic Guided Spatial–Frequency Fusion Network for HSI–LiDAR Land-Cover Classification
by Aili Wang, Manman Yao, Haoran Lv and Haisong Chen
Remote Sens. 2026, 18(12), 1957; https://doi.org/10.3390/rs18121957 - 12 Jun 2026
Viewed by 205
Abstract
Joint classification of hyperspectral images (HSI) and light detection and ranging (LiDAR) data is important for land-cover recognition, as it can exploit both spectral discrimination and structural elevation information. However, existing methods mainly focus on visual feature fusion and insufficiently utilize class-level semantic [...] Read more.
Joint classification of hyperspectral images (HSI) and light detection and ranging (LiDAR) data is important for land-cover recognition, as it can exploit both spectral discrimination and structural elevation information. However, existing methods mainly focus on visual feature fusion and insufficiently utilize class-level semantic priors, which limits their discriminative capability in complex boundaries, visually similar categories, and limited-sample scenarios. To address these issues, this paper proposes a text-guided multimodal semantic fusion network for HSI–LiDAR classification. Specifically, a Channel-Modulated Mobile Convolution Module (CMMC) is designed to extract modality-specific features, a Spatial–Frequency Feature Enhancement Module (SFFE) is introduced to enhance spatial-boundary and frequency-domain structural representations, and a Bidirectional Cross-Modal Fusion Module (BCMF) is developed to promote complementary interaction between spectral and structural information. Meanwhile, class-level textual descriptions are constructed from class names, color attributes, and geographical contexts, and a text encoder is employed to obtain semantic prototypes. Furthermore, a multi-branch vision–text semantic alignment mechanism projects HSI features, LiDAR features, and fused features into a shared semantic space for joint constraints, improving semantic consistency and class separability. Experiments on the Houston2013, Augsburg, and Trento datasets demonstrate the effectiveness of the proposed method. It achieves an overall accuracy of 98.76% on Houston2013, with improvements of 0.62%, 0.52%, and 0.67 in overall accuracy, average accuracy, and Kappa coefficient × 100 over the best competing results, respectively. The proposed method also obtains the best overall metrics on Augsburg and Trento, and ablation studies verify the effectiveness of the proposed components. Full article
34 pages, 3907 KB  
Systematic Review
Meta-Learning in Land Use and Land Cover Classification: Review and Perspective
by Wei He, Lianfa Li, Haoxiong Wu, Xilin Gao, Yichen Yang, Zixuan Zhang, Xiaomei Yang and Yong Ge
Remote Sens. 2026, 18(12), 1879; https://doi.org/10.3390/rs18121879 - 7 Jun 2026
Viewed by 349
Abstract
Deep learning has exhibited potential in land use and land cover (LULC) classification applications. However, the effectiveness of deep learning remains constrained by the availability and quality of annotated training data. The persistent scarcity of labeled samples and spatial heterogeneity of remote sensing [...] Read more.
Deep learning has exhibited potential in land use and land cover (LULC) classification applications. However, the effectiveness of deep learning remains constrained by the availability and quality of annotated training data. The persistent scarcity of labeled samples and spatial heterogeneity of remote sensing imagery hinder the robustness and generalization of trained models. Meta-learning, commonly referred to as “learning to learn”, is a paradigm that trains models over a distribution of tasks to acquire transferable knowledge, enabling rapid adaptation to new tasks with only a few labeled samples. This cross-task learning capability makes meta-learning a promising solution to data scarcity and spatial heterogeneity in the remote sensing context. This paper provides a systematic review of meta-learning applications in LULC classification, identifying a total of 70 relevant studies between 2018 and 2025. Three mainstream meta-learning paradigms (memory-augmented, optimization-based, and metric-based) are reviewed, and the applications are analyzed across four core challenges in LULC remote sensing: label scarcity, cross-region and cross-domain distribution shifts, temporal dynamics modeling, and multimodal data integration. The review reveals that optimization-based and metric-based methods dominate current research, with MAML and its variants being the most widely adopted due to the model-agnostic property, while memory-augmented methods remain underexplored. A consistent finding is that meta-learning outperforms conventional pre-training followed by fine-tuning under significant domain shifts across multiple data modalities. Current limitations, including computational overhead, episodic training constraints, and the lack of standardized evaluation protocols, are discussed. Future directions in cross-domain generalization, integration with foundation models, novel architectures, and standardized benchmarks are identified. Full article
(This article belongs to the Section AI Remote Sensing)
Show Figures

Figure 1

21 pages, 2624 KB  
Article
Enhancing Fashion Retrieval with Constraint Verification
by Tina Aminian and Jessica Chen
Algorithms 2026, 19(6), 462; https://doi.org/10.3390/a19060462 - 6 Jun 2026
Viewed by 229
Abstract
Composed Image Retrieval (CIR) aims to search a target database for images that best align with a user’s intent, conditioned on a reference image paired with modification requirements. Existing CIR architectures typically treat the visual reference and the textual modification as symmetrical inputs, [...] Read more.
Composed Image Retrieval (CIR) aims to search a target database for images that best align with a user’s intent, conditioned on a reference image paired with modification requirements. Existing CIR architectures typically treat the visual reference and the textual modification as symmetrical inputs, fusing their features into a shared latent embedding space. From a user-centric perspective, however, these multi-modal inputs serve fundamentally asymmetric roles: the reference image acts as a soft semantic anchor, whereas the modification text functions as an explicit requirement specifying precise visual changes. Because current models optimize composition predominantly at a global representation level, these non-negotiable logical constraints are frequently violated during inference, leading to retrieval results that fail to satisfy the user’s explicit instructions. To mitigate this limitation, we introduce a novel, training-free verification framework for fashion retrieval that enforces textual constraint adherence without sacrificing the expressive flexibility of open-vocabulary natural language. Our approach leverages schema-conditioned large language models to extract explicit, structured logical constraints from raw queries during post-processing. A downstream vision-language agent subsequently verifies these constraints against the top retrieved candidate pool to penalize non-compliant images and optimize candidate ordering. Extensive evaluations across standard fashion benchmarks demonstrate that our plug-in framework consistently and significantly enhances the recall metrics of state-of-the-art supervised and zero-shot CIR baselines. Full article
Show Figures

Figure 1

20 pages, 2070 KB  
Article
Temporal-Enhanced and Visual-Text Adaptive Fusion for Weakly Supervised Video Anomaly Detection in Public Safety
by Jin Si, Qifen Dong and Xue Yang
J. Imaging 2026, 12(6), 249; https://doi.org/10.3390/jimaging12060249 - 6 Jun 2026
Viewed by 258
Abstract
In the realm of public safety, the automated identification of potential threats from voluminous surveillance streams is pivotal for developing intelligent security systems. Manual monitoring of such massive video feeds is highly inefficient, prone to human fatigue, and often leads to missed detections [...] Read more.
In the realm of public safety, the automated identification of potential threats from voluminous surveillance streams is pivotal for developing intelligent security systems. Manual monitoring of such massive video feeds is highly inefficient, prone to human fatigue, and often leads to missed detections or false alarms. Leveraging deep learning for automatic anomaly detection is therefore essential to improve response efficiency and mitigate security risks. Weakly supervised video anomaly detection (WS-VAD) has emerged as a critical yet challenging task in this domain. In this study, we propose the Temporal-Enhanced and Visual-Text Adaptive Fusion (TE-VTAF) model for robust WS-VAD. Specifically, a Dynamic Local–Global Temporal Adaptive Module (DLG-TAM) is designed to capture multi-scale temporal dependencies and extract high-level video semantics. Concurrently, a Visual-Text Adaptive Fusion Module (VTAFM) is introduced to aggregate complementary cross-modal features, utilizing a competitive activation mechanism to suppress redundant information and enhance the discriminative power between normal and anomalous events. To further refine the learning process within the Multiple Instance Learning (MIL) framework, we incorporate a Top-K outer bag loss and a K-maxmin inner bag loss. These constraints effectively maximize the inter-class separability while suppressing label noise from normal instances within positive bags, thereby bolstering the detector’s robustness. Extensive experiments demonstrate that the proposed TE-VTAF consistently outperforms state-of-the-art methods on two large-scale benchmarks, achieving an AUC of 88.93% on UCF-Crime and an AP of 85.62% on XD-Violence. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

Back to TopTop