Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (406)

Search Parameters:
Keywords = vision foundation models

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
18 pages, 2592 KB  
Article
Knee Osteoarthritis Severity Grading Using Contrastive Learning Image Pre-Training
by Sedigh Abdalla Bashir, Rabeeah S. Altarhouni, Mohamed Burid Milad, Fauzia Ali Abuhtna, Mansor Masaud Wafi, Ellafi. A. Elbahri, Esam Alsadiq Alshareef, Mohammad Khaleel Sallam Ma’aitah, Esraa Alsariera and Ainur Toigozhinova
J. Pers. Med. 2026, 16(6), 314; https://doi.org/10.3390/jpm16060314 - 12 Jun 2026
Viewed by 159
Abstract
Background/Objectives: Accurate evaluation of knee osteoarthritis (KOA) severity is critical for optimal patient care, yet manual radiographic grading remains subject to observer variability. This study aims to evaluate the performance of a fine-tuned contrastive language–image pre-training (CLIP) framework designed to assist clinicians [...] Read more.
Background/Objectives: Accurate evaluation of knee osteoarthritis (KOA) severity is critical for optimal patient care, yet manual radiographic grading remains subject to observer variability. This study aims to evaluate the performance of a fine-tuned contrastive language–image pre-training (CLIP) framework designed to assist clinicians in grading KOA severity in plain radiographs using the Kellgren–Lawrence (KL) classification system (Grades 0–4). Methods: The model operates by projecting visual features from radiographs and standard textual clinical descriptions into a shared embedding space. Training was conducted using 8260 posterior–anterior (PA) fixed-flexion X-ray images from the Osteoarthritis Initiative (OAI) dataset. For robust external evaluation across distinct data distributions, the model was tested on an independent dataset consisting of 1650 plain radiographs. Results: When evaluated on the external validation dataset, the fine-tuned CLIP model achieved an accuracy of 76.94% and an F1-score of 76.66%. Comparative analysis demonstrates that these aligned vision-language representations provide competitive, stable diagnostic capabilities even when applied to an entirely independent data distribution. Conclusions: Fine-tuned CLIP architectures offer a viable and valuable foundation for semantically transparent, computer-aided evaluation of KOA. Full article
Show Figures

Figure 1

21 pages, 21987 KB  
Article
A Spatial Distribution Probability-Guided Detection Framework for Underwater Sonar Imagery
by Dayu Jia, Yan Huang, Jianan Qiao, Zhenyu Wang, Hao Feng and Jiancheng Yu
Remote Sens. 2026, 18(12), 1906; https://doi.org/10.3390/rs18121906 - 9 Jun 2026
Viewed by 135
Abstract
Underwater target detection via side-scan sonar is vital for defense and economy but hindered by sparse targets, high data costs, and feature extraction difficulties due to textureless acoustic data and limited samples. To overcome these limitations, particularly for few-shot, small-object detection, we propose [...] Read more.
Underwater target detection via side-scan sonar is vital for defense and economy but hindered by sparse targets, high data costs, and feature extraction difficulties due to textureless acoustic data and limited samples. To overcome these limitations, particularly for few-shot, small-object detection, we propose a Spatial Distribution Probability-Guided Detection Framework to aid Unmanned Underwater Vehicles (UUVs) in precise localization and clustering. The framework features a novel module that leverages a pre-trained Vision Foundation Model (DINOv3) to generate spatial distribution probability maps, guiding a Transformer-based network for accurate detection with scarce data. Additionally, it incorporates a Target Position Calculation Module and a DBSCAN-based post-processing module to determine global geographic coordinates and cluster discrete points, respectively. Experiments were conducted on both a Public Mine Detection Dataset and a self-collected dataset containing simulated mines and buoys. Ablation studies and comparison experiments demonstrated that the proposed guidance mechanism significantly improves detection performance. Furthermore, two comb-search missions verified that the system could accurately locate and cluster targets, distinguishing real targets from false detections (noise). These results confirm the framework’s efficacy in enabling high-precision perception and autonomous operations for complex underwater inspection tasks. Full article
Show Figures

Figure 1

35 pages, 1263 KB  
Systematic Review
Advances in Artificial Intelligence-Enabled Crop Pest and Disease Detection: A Systematic Review
by Zhen Ma, Cundeng Wang, Xinzhong Wang and Xuegeng Chen
Agriculture 2026, 16(12), 1262; https://doi.org/10.3390/agriculture16121262 - 7 Jun 2026
Viewed by 444
Abstract
The detection technology of crop diseases and pests is transitioning from single sensor monitoring to intelligent perception and multimodal fusion. This paper follows the PRISMA 2020 standard and systematically reviews the relevant core literature. This paper systematically summarizes the development history of spectral [...] Read more.
The detection technology of crop diseases and pests is transitioning from single sensor monitoring to intelligent perception and multimodal fusion. This paper follows the PRISMA 2020 standard and systematically reviews the relevant core literature. This paper systematically summarizes the development history of spectral sensing technology and analyzes the physical mechanisms of hyperspectral and multispectral imaging in early identification of crop diseases. The focus is on the architectural evolution of deep learning models, including lightweight convolutional neural networks (CNNs), vision transformers (ViTs) with long-range dependency modeling capabilities, and the efficient computing state space model Mamba. In addition, the research progress of spatial spectral joint learning, heterogeneous data fusion, and vision-language models (VLMs) in improving system robustness and interpretability are introduced. By synthesizing the integrated applications of UAV remote sensing, Internet of Things (IoT) edge computing and intelligent robots in staple and cash crops, this paper summarizes the implementation of the integrated system of perception, decision-making and execution. To address the issues of insufficient cross-domain generalization ability and uneven allocation of computing resources in existing models, this paper provides perspectives on the future development of agricultural artificial intelligence (AI) towards foundation model-driven, edge-intelligent collaboration, and green sustainable direction, which can provide theoretical reference for engineering applications in the field of intelligent plant protection. Full article
(This article belongs to the Section Crop Protection, Diseases, Pests and Weeds)
Show Figures

Figure 1

20 pages, 10192 KB  
Article
Leaf Image Segmentation in Urochloa Pastures: A Comparative Analysis of Preprocessing Strategies Using Smartphone Imagery
by Isabel Felizardo Chambingo, Matheus de Godoi Bertin, Wilson Manuel Castro Silupu, Murilo Mesquita Baesso, Lilian Elgalise Techio Pereira and Adriano Rogério Bruno Tech
AgriEngineering 2026, 8(6), 232; https://doi.org/10.3390/agriengineering8060232 - 7 Jun 2026
Viewed by 209
Abstract
Smartphone-based proximal sensing has emerged as a promising low-cost approach for pasture monitoring. A critical component of this methodology is accurate leaf segmentation, as it directly affects the reliability of subsequent image-based analyses. Despite advances in computer vision, the role of preprocessing strategies [...] Read more.
Smartphone-based proximal sensing has emerged as a promising low-cost approach for pasture monitoring. A critical component of this methodology is accurate leaf segmentation, as it directly affects the reliability of subsequent image-based analyses. Despite advances in computer vision, the role of preprocessing strategies in segmentation performance remains insufficiently explored, particularly under resource-constrained conditions. This study presents a systematic comparative evaluation of three preprocessing pipelines based on HSV and CIELab color spaces for the segmentation of Urochloa grass leaves (Urochloa hybrid Mavuno and Urochloa decumbens) using smartphone imagery acquired field conditions. The pipelines were assessed using a multi-criteria framework, including the Fisher Discriminant Ratio (FDR), Intersection over Union (IoU), Overlap Error (OE), Structural Similarity Index (SSIM), and Edge Preservation Index (EPI), complemented by discordance map analysis. The results demonstrate that preprocessing design significantly influences segmentation stability, boundary preservation, and robustness to illumination variability. Pipelines based on HSV channels showed high sensitivity to shadows and non-uniform lighting, leading to reduced segmentation consistency. In contrast, the CIELab-based pipeline relying on the a* channel achieved superior performance, with higher discriminative capacity, improved edge preservation, and lower computational cost. These findings highlight that carefully designed classical preprocessing strategies remain highly effective for low-cost, real-time applications, even in the absence of computationally intensive models. This work establishes a robust segmentation foundation for future integration with advanced analytical methods, including machine learning approaches, and supports the development of scalable smartphone-based tools for pasture monitoring. Full article
Show Figures

Figure 1

20 pages, 6566 KB  
Communication
Consistency-Guided Distillation from Vision Foundation Models for Zero-Shot Airborne Point Cloud Segmentation
by Yuan Gao, Jindong Zhao, Shaobo Xia, Sheng Nie, Cheng Wang and Xiaohuan Xi
Remote Sens. 2026, 18(12), 1875; https://doi.org/10.3390/rs18121875 - 6 Jun 2026
Viewed by 175
Abstract
Semantic segmentation of large-scale airborne point clouds traditionally relies on labor-intensive 3D manual annotations. While recent zero-shot methods attempt to alleviate this burden by distilling knowledge from 2D Vision–Language Models (VLMs) via 2D-to-3D projection, they suffer from performance degradation in complex urban environments. [...] Read more.
Semantic segmentation of large-scale airborne point clouds traditionally relies on labor-intensive 3D manual annotations. While recent zero-shot methods attempt to alleviate this burden by distilling knowledge from 2D Vision–Language Models (VLMs) via 2D-to-3D projection, they suffer from performance degradation in complex urban environments. Specifically, lacking 3D geometric awareness, 2D VLMs frequently exhibit “semantic bleeding”, where large-scale background categories (e.g., ground) erroneously submerge small-scale targets (e.g., vehicles and street elements). To address this issue, we propose a geometry-constrained pseudo-label generation and purification framework. Our approach tackles the problem through a dual-branch design: extracting open-vocabulary semantics via SAM3-based multi-view projection while simultaneously deriving sharp, class-agnostic instances using SAM2 on Gamma-transformed elevation maps. By introducing a geometric–semantic consistency module, we evaluate the internal semantic purity and external spatial homogeneity of these instances, detecting and filtering out semantic misclassifications. The purified pseudo-labels are then used to supervise a 3D sparse convolutional network via a Masked Cross-Entropy Loss. Experiments on the H3D and Turin3D datasets demonstrate that our method recovers small-scale targets that are prone to being submerged, outperforming existing zero-shot baselines by improving mIoU from 52.15% to 63.45% on H3D and from 29.52% to 58.51% on Turin3D, thereby narrowing the performance gap with fully-supervised approaches. Full article
(This article belongs to the Section AI Remote Sensing)
Show Figures

Figure 1

21 pages, 4497 KB  
Article
Growth at What Cost? Energy Use, Investment, and Emissions in the Saudi Economy
by Uzma Khan and Aarif Mohammad Khan
Economies 2026, 14(6), 208; https://doi.org/10.3390/economies14060208 - 4 Jun 2026
Viewed by 237
Abstract
This study examines the long-run and distributional determinants of CO2 emissions in Saudi Arabia (1970–2021) by integrating economic growth, energy consumption, foreign direct investment (FDI), natural resource rents, and urbanisation within a unified framework. Johansen cointegration, Fully Modified Ordinary Least Squares (FMOLS), [...] Read more.
This study examines the long-run and distributional determinants of CO2 emissions in Saudi Arabia (1970–2021) by integrating economic growth, energy consumption, foreign direct investment (FDI), natural resource rents, and urbanisation within a unified framework. Johansen cointegration, Fully Modified Ordinary Least Squares (FMOLS), Dynamic Ordinary Least Squares (DOLS), Canonical Cointegrating Regression (CCR), and a Vector Error Correction Model (VECM) establish long-run relationships and causality; quantile regression identifies distributional heterogeneity. The principal novel finding is a countercyclical, regime-dependent mitigation role for natural resource rents: rents exert no significant effect at low-emission quantiles but generate negative effects from the median quantile onward (−0.04 to −0.06), precisely when emissions and oil revenues are simultaneously elevated. This distributional asymmetry, invisible to mean-based estimators, implies that hydrocarbon revenues provide a high-regime fiscal buffer for environmental investment. Aggregate FDI is environmentally neutral across all specifications, indicating the technique effect operates through fiscal channels rather than investment channels. Energy consumption drives emissions with near-unity elasticity, confirming carbon lock-in, and economic growth shows no decoupling. These findings provide quantitative foundations for fiscal rules linking oil revenue windfalls to green investment under Vision 2030. Full article
Show Figures

Figure 1

59 pages, 6209 KB  
Review
Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers
by Kassim B. Diallo and Moulay A. Akhloufi
Computers 2026, 15(6), 366; https://doi.org/10.3390/computers15060366 - 4 Jun 2026
Viewed by 219
Abstract
The field of pose estimation is a major problem in computer vision, enabling the direct transformation of an input image into a hierarchical representation of the human skeleton for application in the fields of virtual/augmented reality and human–machine interaction tasks. Research in this [...] Read more.
The field of pose estimation is a major problem in computer vision, enabling the direct transformation of an input image into a hierarchical representation of the human skeleton for application in the fields of virtual/augmented reality and human–machine interaction tasks. Research in this field has exploded between 2018 and 2025, with traditional taxonomies such as 2D versus 3D or top-down versus bottom-up no longer sufficient to capture the essence of the evolution of ideas. To solve this problem, we propose a conceptual review in the field of pose estimation, focusing on the intellectual evolution of methods and architecture rather than the standard flat classifications of papers. We divide recent advances into five structural pillars: Representation, which traces the evolution from pixel coordinate regression to heatmaps and probabilistic representation; Architecture, which analyzes the transition from multi-stage CNNs to transformers and state space models (SSMs); Ambiguity and Generalization, which analyzes how self-supervised, uncertainty-aware, and diffusion models address 3D depth ambiguity, occlusion, and domain gaps by modeling multiple plausible poses and reducing dependence on fully supervised in-the-wild 3D labels; Context Extension, which covers temporal dynamics, multi-view fusion, and potential sensors; and Applications, which links algorithms to efficiency, privacy, and foundation models. By providing an in-depth detailing of these pillars, we provide a unified view of the evolution of research paradigms that define human pose estimation and enable the identification of future problems and solutions in pose estimation and human-centered tasks. Full article
Show Figures

Figure 1

25 pages, 366 KB  
Article
A National Framework for Giftedness: The Mawhiba Model for Identification and Nurturing Gifted Students
by Fahad S. Alfaiz
Educ. Sci. 2026, 16(6), 884; https://doi.org/10.3390/educsci16060884 - 3 Jun 2026
Viewed by 191
Abstract
This study conducted a comparative analysis of the King Abdulaziz and His Companions Foundation for Giftedness and Creativity (Mawhiba) as a national model for gifted education and talent development. Despite global recognition of nurturing giftedness, the literature lacks comprehensive, sustainable national talent development [...] Read more.
This study conducted a comparative analysis of the King Abdulaziz and His Companions Foundation for Giftedness and Creativity (Mawhiba) as a national model for gifted education and talent development. Despite global recognition of nurturing giftedness, the literature lacks comprehensive, sustainable national talent development systems. Using a qualitative instrumental case study and documentary analysis, the research evaluated how Mawhiba differs from ten global gifted education institutions. Findings showed that Mawhiba design overcomes limitations of traditional models through endowment-based funding, national scale with decentralized delivery, and a multi-sector partnership aligned with Saudi Vision 2030. The study examined how Mawhiba integrated system translated identification and nurturing processes into measurable outcomes. Results demonstrated that Mawhiba operationalized contemporary giftedness frameworks via a multi-tiered ecosystem, from large-scale identification using the Mawhiba Multiple Cognitive Aptitude Test to diverse nurturing pathways including enrichment, research development, and university preparation. The model’s efficacy was evidenced by identifying over 244,000 gifted students and earning hundreds of international scientific awards, positioning Mawhiba as a paradigm shift in strategic human capital cultivation. Full article
(This article belongs to the Special Issue Unlocking Potential: The Future of Gifted and Talented Education)
49 pages, 2508 KB  
Review
Sensing the Action: Rethinking Sensor Modalities and Multi-Modal Fusion in Vision–Language–Action Models for Robotic Manipulation
by Byoung Chul Ko
Sensors 2026, 26(11), 3541; https://doi.org/10.3390/s26113541 - 3 Jun 2026
Viewed by 342
Abstract
Recent Vision–Language–Action (VLA) models have rapidly emerged as general-purpose robotic policies that integrate language understanding, visual perception, and robot control. However, prior studies and surveys have primarily emphasized backbone architectures, action decoders, training recipes, and benchmark performance, whereas relatively limited systematic attention has [...] Read more.
Recent Vision–Language–Action (VLA) models have rapidly emerged as general-purpose robotic policies that integrate language understanding, visual perception, and robot control. However, prior studies and surveys have primarily emphasized backbone architectures, action decoders, training recipes, and benchmark performance, whereas relatively limited systematic attention has been given to sensor modality selection, heterogeneous signal alignment and fusion, and their connection to action generation, all of which are critical to the performance and safety of real-world robotic manipulation. This survey addresses this gap by reinterpreting VLA within the framework of a sensor–fusion–action pipeline. This study first presents a systematic taxonomy of major sensor modalities, including RGB, depth, tactile sensing, force/torque, proprioception and inertial measurement unit, multi-spectral/thermal, and event-based vision, and compares them in terms of the physical information they provide, their characteristic failure modes, and their deployment constraints. This survey further reviews teleoperation-, human video-, and simulation-based data collection pipelines, together with representative dataset configurations, and analyzes the multi-modal design space from a sensor-centric perspective, including early and late fusion, cross-attention, token-level fusion, adapters, mixture of experts, and multi-rate action representations. In addition, this study identifies a strong bias in existing benchmarks toward RGB-centric inputs and single success-rate metrics and emphasizes the need for a multidimensional evaluation framework incorporating robustness, worst-case performance, safety, latency, and efficiency. By shifting the focus away from a model-centric narrative and explicitly accounting for real-world sensor complexity, this survey seeks to establish a sensor-centered foundation for the next generation of Physical AI. Full article
(This article belongs to the Special Issue Feature Review Papers in Sensors and Robotics)
Show Figures

Figure 1

19 pages, 1272 KB  
Article
Foundation Model-Based One-Shot Anatomical Landmark Detection with Mamba and Graph Refinement
by Yinbing Tian, Ziyang Wang and Li Guo
Electronics 2026, 15(11), 2414; https://doi.org/10.3390/electronics15112414 - 2 Jun 2026
Viewed by 141
Abstract
Accurate anatomical landmark detection is important for orthodontic analysis, surgical planning, and morphometric measurement, but fully supervised methods usually require large expert-annotated datasets. This work studies a one-shot setting, where only a single annotated template image is used for training. We propose a [...] Read more.
Accurate anatomical landmark detection is important for orthodontic analysis, surgical planning, and morphometric measurement, but fully supervised methods usually require large expert-annotated datasets. This work studies a one-shot setting, where only a single annotated template image is used for training. We propose a foundation-model-based landmark detection framework using a frozen DINO Vision Transformer (ViT) backbone. The proposed framework integrates three complementary components: a Multi-Layer Multi-Facet (MLMF) module that adaptively fuses key and value features from multiple ViT layers through global source-wise reweighting; a Mamba-Based Long-Range Context Aggregation (MLCA) module that injects global anatomical context into fused patch descriptors with linear complexity; and a Topology-Constrained Graph Refinement (TCGR) module that refines the predicted landmark configuration using anatomical graph constraints. Experiments on the Cephalometric dataset and the Hand X-ray dataset demonstrate that the proposed method achieves strong performance. Overall, the results show that jointly exploiting multi-source foundation-model representations, efficient long-range context aggregation, and topology-aware refinement improves annotation-efficient anatomical landmark detection. Full article
Show Figures

Figure 1

16 pages, 5001 KB  
Article
Safety Assessment Method for Engineering Structures Based on Modal Curvature
by Fang Dong, Nan Jin, Yue Liu, Rumian Zhong and Qingrui Yue
Buildings 2026, 16(11), 2203; https://doi.org/10.3390/buildings16112203 - 29 May 2026
Viewed by 425
Abstract
To address the limitations of traditional structural damage identification methods in terms of reliance on high-fidelity baseline models and sensitivity to minor damage, this paper proposes a novel physics-informed and data-driven approach based on the modal curvature variation coefficient. A damage-sensitive feature derived [...] Read more.
To address the limitations of traditional structural damage identification methods in terms of reliance on high-fidelity baseline models and sensitivity to minor damage, this paper proposes a novel physics-informed and data-driven approach based on the modal curvature variation coefficient. A damage-sensitive feature derived from the rate of change in the radius of curvature is established, providing a clear mathematical and physical interpretation to reduce model error interference and enhance local damage localization. The effectiveness of the proposed method is validated through a 1:20 scale model experiment of a main truss from a large stadium steel roof. A total of 33 experimental cases were designed, simulating single and multiple damage scenarios with varying severity levels (large, medium, and small). Multi-source monitoring techniques, including millimeter-wave radar interferometry, laser displacement sensors, high-resolution vision-based measurement, and accelerometers, were integrated. Modal parameters were extracted using the Stochastic Subspace Identification (SSI) method, and the finite element model was updated via a high-order response surface methodology. Numerical simulations and experimental results demonstrate that the proposed modal curvature variation coefficient is highly sensitive to local stiffness degradation and accurately locates both single and multiple large/medium damage regions. In cases involving multiple minor damages, the method effectively identifies the damaged areas but exhibits a risk of false positives in undamaged sections. The millimeter-wave radar measurements exhibit strong agreement with laser displacement data, confirming its viability for non-contact structural health monitoring. This research provides a robust technical framework and experimental foundation for condition assessment and early damage warning in large-scale engineering structures. Full article
Show Figures

Figure 1

59 pages, 1676 KB  
Review
Vision–Language–Action (VLA) Models for Unmanned Aerial Robotics and Bimanual Manipulation: A Review
by Inkyu Sa, Chanoh Park, Hea-Min Lee, Donghee Noh and Ho Seok Ahn
Drones 2026, 10(6), 412; https://doi.org/10.3390/drones10060412 - 26 May 2026
Viewed by 274
Abstract
Vision–Language–Action (VLA) models unify visual perception, natural-language understanding, and action generation within a single foundation model, allowing a robot to follow instructions such as “fold the towel” or “fly to the red building” directly from camera images. Because VLAs inherit world knowledge from [...] Read more.
Vision–Language–Action (VLA) models unify visual perception, natural-language understanding, and action generation within a single foundation model, allowing a robot to follow instructions such as “fold the towel” or “fly to the red building” directly from camera images. Because VLAs inherit world knowledge from internet-scale pre-training, they have become the dominant framework for learning-based manipulation, with bimanual coordination serving as the most demanding testbed: two arms with 7+ degrees of freedom each must move in concert to fold, assemble, and reorient objects. Unmanned aerial robotics faces a structurally similar challenge: a drone must coordinate thrust, attitude, and increasingly gripper commands from visual observations under strict latency and payload constraints. This review covers 183 contributions spanning 2017–2026 and organized along seven dimensions: VLA architectures, training recipes, action representations, bimanual coordination (2022–2026), unmanned aerial vehicle (UAV) navigation and control (2017–2026), language grounding, and cross-cutting concerns including memory and world models. We show that the coordination strategies, training recipes, and action representations developed for bimanual VLAs transfer to unmanned aerial systems and identify fourteen research directions across both domains. Full article
Show Figures

Graphical abstract

37 pages, 3939 KB  
Article
Reasoning-Centric Framework for Open-Set Wild Plant Recognition
by Dongkai Qi, Chia Sien Lim and Sivakumar Vengusamy
Appl. Sci. 2026, 16(11), 5292; https://doi.org/10.3390/app16115292 - 25 May 2026
Viewed by 193
Abstract
Open-set recognition of wild plants in natural complex scenes is an important task for plant conservation, ecological monitoring, and precision agriculture. Traditional closed-set learning methods struggle to handle unseen species not covered by the training set and complex environmental interferences, while existing open-vocabulary [...] Read more.
Open-set recognition of wild plants in natural complex scenes is an important task for plant conservation, ecological monitoring, and precision agriculture. Traditional closed-set learning methods struggle to handle unseen species not covered by the training set and complex environmental interferences, while existing open-vocabulary methods lack knowledge-driven reasoning capabilities and cannot provide interpretable recognition for unknown categories. This research proposes the Reasoning-Aware Perceptual Framework that integrates open-vocabulary vision-language models, foundation mask-generation tools, and domain knowledge reasoning to achieve known/unknown category recognition, online perception, and interpretable reasoning of unknown wild plant species. Centered on a five-stage closed loop of Perception-Retrieval-Reasoning-Decision-Iteration, the framework captures open concepts through vision-language feature alignment, completes evidence-based reasoning and confidence evaluation in combination with a botanical domain knowledge base, and finally outputs species classification decisions, interpretable reasoning reports with family/genus-level taxonomic affinity, and uncertainty-calibrated confidence scores. The unknown category estimation with family/genus-level taxonomic affinity in this framework refers to a general unknown label combined with taxonomic affinity at the family/genus level, which can clearly reflect the evolutionary relationship between unknown species and known species. Experiments on the self-constructed WildPlantOpenSet-10K dataset and public benchmark datasets report an F1-score of 84.7% for unknown species recognition, AUROC of 0.93 for known/unknown discriminability, and mean F1 of 87.0% across all categories. This framework focuses on open-set wild plant recognition and interpretable reasoning, using off-the-shelf instance extraction to acquire visual features for downstream reasoning. It maintains stable robustness in complex scenarios such as occlusion, strong light, and multi-species coexistence, and can adapt to the open-world environment without relying on large-scale pixel annotations, providing a research prototype for interpretable open-set recognition in complex natural environments. Full article
(This article belongs to the Special Issue Application of AI, Sensors, and IoT in Modern Agriculture)
Show Figures

Figure 1

26 pages, 2914 KB  
Review
A Review of Multimodal Image Feature Fusion Technology and Application
by Pingping Cao, Yuting Zhao, Tao Duan, Linguo Li, Chaole Xian and Shujing Li
Appl. Sci. 2026, 16(11), 5290; https://doi.org/10.3390/app16115290 - 25 May 2026
Viewed by 170
Abstract
Multimodal image fusion has emerged as a core technology for complex perception systems—such as autonomous driving, remote sensing monitoring, and medical diagnosis—by integrating complementary information from heterogeneous sensors. Given the rapid technological evolution within this field, particularly driven by the emergence of Mamba [...] Read more.
Multimodal image fusion has emerged as a core technology for complex perception systems—such as autonomous driving, remote sensing monitoring, and medical diagnosis—by integrating complementary information from heterogeneous sensors. Given the rapid technological evolution within this field, particularly driven by the emergence of Mamba architectures, Generative Diffusion Models, and Vision Foundation Models (VFMs), traditional classification methods no longer fully encompass the ongoing paradigm shifts. Following the PRISMA guidelines to ensure the objectivity and reproducibility of the findings, this paper provides a systematic literature review and data extraction for multimodal image feature fusion. Under this standardized framework, a five-dimensional decoupling classification architecture is proposed to deconstruct models across fusion hierarchy, backbone architecture, fusion operator, supervision paradigm, and deployment constraints. Specifically, the analysis highlights the linear computational efficiency of Mamba in long-sequence modeling, the high-fidelity reconstruction capabilities of diffusion models via generative priors, and the universal semantic alignment achieved by VFMs. Furthermore, this study summarizes qualitative and quantitative evaluation metrics alongside cross-domain public datasets for performance benchmarking while discussing critical future directions, including cross-modal alignment in complex environments, parameter-efficient fine-tuning of large models, and real-time inference at the edge. Full article
Show Figures

Figure 1

18 pages, 566 KB  
Review
Modelling and Measuring Professional Vision in Medical Education: A Cognitive Process Framework
by Tina Seidel, Christian Kosel, Ricardo Böheim, Martin Gartmeier and Pascal O. Berberat
Int. Med. Educ. 2026, 5(2), 52; https://doi.org/10.3390/ime5020052 - 22 May 2026
Viewed by 428
Abstract
Physicians routinely operate in environments that require the rapid processing of complex and dynamic visual information to diagnose patient conditions, communicate effectively, and make informed decisions. Despite the central role of visual attention in clinical practice, these processes are rarely conceptualized or systematically [...] Read more.
Physicians routinely operate in environments that require the rapid processing of complex and dynamic visual information to diagnose patient conditions, communicate effectively, and make informed decisions. Despite the central role of visual attention in clinical practice, these processes are rarely conceptualized or systematically measured in medical education research. In other professional domains, such abilities are described as professional vision (PV)—the situated capacity to selectively attend to relevant cues and interpret them considering domain-specific knowledge. Although the term professional vision foregrounds visual attention, we use it here to cover the multimodal clinical perception in which visual cues are typically embedded—predominantly visual, but in many tasks also auditory and verbal—with visual attention as the analytic anchor. This paper introduces a cognitive process model of professional vision for medical education (PV-CP) that specifies the perceptual and cognitive subprocesses underlying how physicians perceive and interpret clinically relevant information. Building on this model, we propose a theory-driven framework for the measurement of professional vision using multimodal indicators. Central to our argument is the assumption that professional vision represents a latent, temporally unfolding construct that cannot be validly captured through single behavioral metrics or outcome measures. Instead, robust measurement requires the coordinated analysis of gaze-based indicators of visual attention and cognitive indicators of reasoning, each reflecting distinct subprocesses of professional vision. By systematically linking families of indicators to specific subprocesses and clarifying their respective inferential strengths and limitations, the PV-CP model advances a process-oriented approach to studying professional vision in medical education. The framework provides a conceptual basis for integrating multimodal data sources and supports more precise interpretations of gaze and reasoning data in expertise research. In doing so, the model contributes to the theoretical refinement of professional vision and offers a structured foundation for future empirical research and the design of learning environments aimed at fostering clinically relevant perceptual–cognitive skills. Full article
Show Figures

Figure 1

Back to TopTop