Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (510)

Search Parameters:
Keywords = multi-modal reasoning

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 1808 KB  
Article
VLM-MCPDD: An Interpretable Vision Language Model for Multi-Crop Pests and Disease Diagnosis
by Liang Zhao, Mengwei Li, Xu Ren, Yuting Cheng and Zongxi Hu
Appl. Sci. 2026, 16(11), 5719; https://doi.org/10.3390/app16115719 - 5 Jun 2026
Viewed by 95
Abstract
Deep convolutional neural networks have made substantial progress in automated crop disease diagnosis. However, their practical application remains constrained by limited interpretability and insufficient structured reasoning, as these models largely operate as black boxes. Although they are effective in extracting visual features, they [...] Read more.
Deep convolutional neural networks have made substantial progress in automated crop disease diagnosis. However, their practical application remains constrained by limited interpretability and insufficient structured reasoning, as these models largely operate as black boxes. Although they are effective in extracting visual features, they often fail to provide semantically grounded explanations, which may reduce their reliability in complex and open agricultural environments. To address these issues, this study constructs a Vision Language Model for Multi-Crop Pest and Disease Diagnosis (VLM-MCPDD). Specifically, the LLaVA-1.5 model is fine-tuned using low-rank adaptation (LoRA) to better align visual symptom representations with domain-specific agricultural knowledge. In addition, a Pests and Diseases Semantic Dataset (PDSD) is constructed to support multimodal learning. Based on PDSD, a chain-of-thought (CoT) mechanism is introduced to simulate the diagnostic workflow of agronomists, covering symptom observation, causal analysis, and final decision-making. The experimental results show that compared with comparative models such as Swin Transformer and ConvNeXt, VLM-MCPDD performs better in overall performance and can provide some reference for disease and pest diagnosis in intelligent agriculture. Full article
39 pages, 10699 KB  
Article
SCPA-Net: Text-Enhanced Cross-Platform Framework with Semantic Consistency Enhancement for Pine Wilt Detection
by Shicong He, Weizhi Zhao, Peng Wang and Mingfang He
Plants 2026, 15(11), 1744; https://doi.org/10.3390/plants15111744 - 4 Jun 2026
Viewed by 191
Abstract
With the rapid development of UAV and satellite remote sensing, in combination with deep learning, high-efficiency monitoring of pine wilt disease (PWD) for forest health management is now feasible. Accurate detection has not yet been realised. The sensing platforms have different ranges of [...] Read more.
With the rapid development of UAV and satellite remote sensing, in combination with deep learning, high-efficiency monitoring of pine wilt disease (PWD) for forest health management is now feasible. Accurate detection has not yet been realised. The sensing platforms have different ranges of space, observation areas and imaging orientations. At the same time, the target groups for PWD often have weak phenotypic features, are easily affected by a complex forest background, and show irregular data distributions at different stages of the disease. The above factors are limits to the performance of traditional methods based only on general visual features. To address the problems mentioned above, we propose the cross-platform semantic-consistent and phenotype-adaptive detection network SCPA-Net for high-precision PWD detection in both UAV and satellite images. First, we construct a cross-platform multimodal framework to integrate remote sensing images and disease-related text descriptions. The above design adds semantic prior knowledge to expand the model’s capacity for high-level phenotypic attribute extraction without direct observation. Second, to reduce the semantic gap caused by the different platforms, improve the semantic consistency of UAV and satellite images, strengthen discriminative feature channels and salient regions, and address cross-platform misalignment. Third, since the targets are often associated with complex forest environments, target-context relational modeling is enhanced and irrelevant interference is suppressed to reduce the impact of non-causal attributes. As pine wilt disease symptoms gradually progress from mild to severe (e.g., crown discoloration, texture variation, and wilting severity), differences among disease stages may lead to learning imbalance and knowledge forgetting; therefore, a staged adaptation strategy has been proposed. First, the model learns from relatively easy examples. Subsequently, it progressively learns from more difficult examples to enhance generalization performance. Experiments have been conducted on a self-built cross-platform dataset, a satellite dataset, the PDT public dataset, and the Roboflow dataset, and the proposed method has achieved better detection accuracy and generalization. The framework can address the problem of PWD detection in challenging-to-process forestry remote sensing data reasonably well. Full article
(This article belongs to the Special Issue Advances in Artificial Intelligence for Plant Research—2nd Edition)
Show Figures

Figure 1

30 pages, 3776 KB  
Review
Multimodal Sensor Fusion in Autonomous Vehicles: Technologies, Architectures, and Open Challenges
by Patrik Viktor and Gabor Kiss
Sensors 2026, 26(11), 3528; https://doi.org/10.3390/s26113528 - 2 Jun 2026
Viewed by 260
Abstract
The rapid progress of sensing technologies, artificial intelligence, and embedded computing has significantly accelerated the development of autonomous vehicles. Among the core challenges of higher-level driving automation, reliable environmental perception remains one of the most critical. This review presents a systematic PRISMA-based analysis [...] Read more.
The rapid progress of sensing technologies, artificial intelligence, and embedded computing has significantly accelerated the development of autonomous vehicles. Among the core challenges of higher-level driving automation, reliable environmental perception remains one of the most critical. This review presents a systematic PRISMA-based analysis of multimodal sensor technologies and fusion architectures applied in autonomous driving, based on 66 peer-reviewed studies published between 2014 and 2025. The study examines the operational characteristics, advantages, and limitations of major sensing modalities, including cameras, LiDAR, radar, ultrasonic sensors, and GNSS/IMU-based localization systems. Particular attention is given to multimodal fusion strategies, covering early, mid-level, high-level, and transformer-based architectures that combine complementary sensor information to improve perception robustness and decision reliability. The review further synthesizes current evidence on performance under adverse environmental conditions, benchmark validation practices, real-time computational constraints, and the growing role of functional safety frameworks such as ISO 26262 and SOTIF. Emerging research directions, including 4D radar, self-supervised long-range fusion, foundation models, and cooperative V2X perception, are also discussed. The findings indicate that multimodal sensor fusion is a highly effective architectural strategy for improving scalability, fail-operational robustness, and certifiable safety in autonomous driving systems, particularly in higher-level automation scenarios. Future research should focus on uncertainty-aware fusion, explainable cross-modal reasoning, large-scale real-world validation, and efficient hardware–software co-design to support robust Level 4–5 vehicle autonomy. Full article
(This article belongs to the Section Vehicular Sensing)
Show Figures

Figure 1

39 pages, 3309 KB  
Review
Security in Collaborative Driving: A Survey of Threats, Defenses, and Emerging Trends
by Sahil Nayak, Onat Gungor and Tajana Rosing
Electronics 2026, 15(11), 2389; https://doi.org/10.3390/electronics15112389 - 1 Jun 2026
Viewed by 166
Abstract
Collaborative driving, in which autonomous vehicles cooperate with other vehicles and roadside infrastructure to improve safety, perception, and traffic efficiency, is emerging as a key paradigm for next-generation transportation systems. While such collaboration enhances situational awareness, it also introduces new security vulnerabilities across [...] Read more.
Collaborative driving, in which autonomous vehicles cooperate with other vehicles and roadside infrastructure to improve safety, perception, and traffic efficiency, is emerging as a key paradigm for next-generation transportation systems. While such collaboration enhances situational awareness, it also introduces new security vulnerabilities across perception, communication, planning, decision-making, and control layers. In this survey, we present a unified taxonomy of security threats and defense mechanisms in collaborative driving systems, systematically organizing attacks and countermeasures across system layers. We further examine the integration of language models, including vision-based and multimodal reasoning models, into collaborative driving pipelines, highlighting the resulting security risks and design challenges. Finally, we identify key open research challenges, including cross-layer and end-to-end security, uncertainty-aware defenses, and real-world validation, outlining promising directions for future work toward secure and resilient collaborative autonomous mobility. Full article
Show Figures

Figure 1

30 pages, 45966 KB  
Article
DriveTDPA: Trajectory-Decision Preference Alignment for Vision-Language Autonomous Driving Planning
by Dingqi Liu and Jiayu Qin
Electronics 2026, 15(11), 2378; https://doi.org/10.3390/electronics15112378 - 1 Jun 2026
Viewed by 181
Abstract
Autonomous driving planning requires not only accurate trajectory prediction but also coherent semantic alignment across perception, decision making, and motion generation. Existing vision-language-based approaches predominantly focus on improving trajectory accuracy, which may lead to limited behavioral consistency. In this paper, we reformulate planning [...] Read more.
Autonomous driving planning requires not only accurate trajectory prediction but also coherent semantic alignment across perception, decision making, and motion generation. Existing vision-language-based approaches predominantly focus on improving trajectory accuracy, which may lead to limited behavioral consistency. In this paper, we reformulate planning as a structured autoregressive generation task, where reasoning, actions, and future trajectories are jointly produced from multimodal observations. Based on this formulation, we propose Trajectory-Decision Joint Preference Optimization (TDJPO), which is a rollout-based alignment framework equipped with a unified reward that simultaneously captures physical trajectory quality and decision-level coherence. Starting from a supervised fine-tuned model, we construct preference pairs through stochastic rollouts and optimize the model using direct preference optimization. Experimental results on the NuScenes-TP benchmark demonstrate that our approach consistently enhances both trajectory accuracy and semantic consistency compared with supervised fine tuning, trajectory-only optimization, and lightweight vision-language baselines. These findings emphasize the necessity of jointly aligning physical feasibility and decision-level reasoning for achieving coherent and human-like autonomous driving behavior. Full article
Show Figures

Figure 1

11 pages, 612 KB  
Perspective
Bayes at the Bedside: Biomarkers in Situations of Clinical Uncertainty
by Uwe Klaus Zettl and Michael Hecker
Diagnostics 2026, 16(11), 1699; https://doi.org/10.3390/diagnostics16111699 - 31 May 2026
Viewed by 124
Abstract
Laboratory biomarkers influence a large proportion of clinical decision-making, yet their application is often limited by incomplete validation and context-dependent interpretability. Serum neurofilament light chain (sNfL), a biomarker of neuroaxonal injury in multiple sclerosis (MS), exemplifies this challenge. Although associated with inflammatory activity, [...] Read more.
Laboratory biomarkers influence a large proportion of clinical decision-making, yet their application is often limited by incomplete validation and context-dependent interpretability. Serum neurofilament light chain (sNfL), a biomarker of neuroaxonal injury in multiple sclerosis (MS), exemplifies this challenge. Although associated with inflammatory activity, lesion burden, and disability progression at the population level, its translation into individual patient management remains problematic. In this Perspective, we synthesise current literature on sNfL in MS and apply Bayesian diagnostic reasoning as a conceptual framework for its interpretation in individualised MS care. The need for such a framework arises from the heterogeneity of MS pathology, in which subclinical inflammation and neurodegeneration may occur as partly dissociated processes that are incompletely captured by clinical and radiological measures. Consequently, substantial uncertainty persists in disease monitoring and therapeutic decision-making. In this setting, sNfL may provide complementary information, but its interpretation is complicated by biological variability, methodological differences, confounding factors (e.g., age, body mass index, and comorbidities), and the absence of universally validated thresholds. We argue that sNfL should be interpreted within a Bayesian framework, in which biomarker results modify rather than determine the probability of disease activity. Its clinical utility is likely greatest when the pre-test probability is intermediate but remains constrained by uncertainty in both test characteristics and clinical context, leading to uncertainty propagation. Overall, sNfL should be interpreted longitudinally and within multimodal clinical decision models. Further prospective studies are needed to better define its role in individualised MS management. Full article
(This article belongs to the Section Clinical Laboratory Medicine)
Show Figures

Figure 1

24 pages, 2572 KB  
Article
SGR-Net: Learning Multimodal Embeddings for Traffic Accident Prediction via Geometry–State Attentive Fusion
by Yuliang Jin, Duanyang Li, Zhiwu Li and Naiqi Wu
Appl. Sci. 2026, 16(11), 5426; https://doi.org/10.3390/app16115426 - 29 May 2026
Viewed by 200
Abstract
Traffic accident prediction is a key challenge in road safety, and it is necessary to accurately identify high-risk sections from different data sources. Although graphical neural networks (GNNs) simulate the road network topology well, they ignore the visual and environmental clues from physical [...] Read more.
Traffic accident prediction is a key challenge in road safety, and it is necessary to accurately identify high-risk sections from different data sources. Although graphical neural networks (GNNs) simulate the road network topology well, they ignore the visual and environmental clues from physical road conditions. This paper addresses this gap by proposing a Sequential Geometric Reasoning Network (SGR-Net), a deep learning framework for multimodal accident prediction. Unlike prior GNN-based approaches, SGR-Net introduces a Geometry–State Attentive Fusion (GSAF) module—its main novelty—which dynamically integrates visual features from satellite imagery with structural graph contexts. The framework also includes a stability-aware training objective and meta-learning for cross-region generalization. We evaluate on a large-scale dataset covering six U.S. states with over nine million accidents and one million satellite images. SGR-Net achieves strong results, with AUROC up to 96.8% and MAE as low as 0.08 in Delaware. Ablations confirm the GSAF module is essential: removing it reduces AUROC by 2.7% and increases MAE by over 40%. The framework establishes a new state-of-the-art for multimodal traffic accident prediction. Full article
(This article belongs to the Section Transportation and Future Mobility)
Show Figures

Figure 1

91 pages, 6222 KB  
Review
A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision–Language Tasks
by Chia Xin Liang, Pu Tian, Caitlyn Heqi Yin, Yao Yua, An-Hou Wei, Ming Li, Xinyuan Song, Tianyang Wang, Ziqian Bi, Ming Liu, Riyang Bao and Pengbin Feng
Computation 2026, 14(6), 125; https://doi.org/10.3390/computation14060125 - 29 May 2026
Viewed by 561
Abstract
This survey provides a comprehensive guide to Multimodal Large Language Models (MLLMs) with a focus on vision–language tasks, including image captioning, visual question answering, cross-modal retrieval, visual grounding, multi-image reasoning, long-video understanding, and embodied AI. We examine architectures, training pipelines, and practical applications, [...] Read more.
This survey provides a comprehensive guide to Multimodal Large Language Models (MLLMs) with a focus on vision–language tasks, including image captioning, visual question answering, cross-modal retrieval, visual grounding, multi-image reasoning, long-video understanding, and embodied AI. We examine architectures, training pipelines, and practical applications, covering visual encoders, language model backbones, connector modules, contrastive pre-training, instruction tuning, and preference alignment. We also foreground first-principles constraints—information bottlenecks, data-processing limits, and statistical co-occurrence bias—that shape architecture, robustness, and evaluation. This survey centers on vision–language systems and does not cover audio-only models or code-generation tools without visual inputs. Through task-level analysis and system-level case studies, we examine prominent MLLM implementations while addressing key challenges in scalability, memory, energy use, inference cost, robustness, and cross-modal learning. We present a unified taxonomy of the MLLM design space, a comparative overview of representative models and evaluation benchmarks, and a discussion of open problems. Concluding with ethical considerations and responsible AI development, this survey offers theoretical frameworks and practical insights for researchers, practitioners, and students working at the intersection of natural language processing and computer vision. Full article
Show Figures

Figure 1

14 pages, 584 KB  
Review
Review of Management of Clinical Stage I Small Cell Lung Cancer: The Rising Role of Surgical Resection
by Gabriella R. Rasmussen, Eric Klipsch and Kathryn E. Engelhardt
Cancers 2026, 18(11), 1781; https://doi.org/10.3390/cancers18111781 - 29 May 2026
Viewed by 409
Abstract
Background: Small cell lung cancer (SCLC) is an aggressive malignancy that has traditionally been treated as a systemic disease, with surgery largely excluded from standard management. A small subset of patients, however, present with clinical Stage I disease (T1–2N0M0). With improvements in imaging, [...] Read more.
Background: Small cell lung cancer (SCLC) is an aggressive malignancy that has traditionally been treated as a systemic disease, with surgery largely excluded from standard management. A small subset of patients, however, present with clinical Stage I disease (T1–2N0M0). With improvements in imaging, staging, and systemic therapy, local therapy warrants consideration. Methods: We performed a narrative review of the literature focused on clinical Stage I SCLC, prioritizing studies addressing epidemiology, tumor biology, diagnostic workup, staging, treatment approaches, and surveillance. Emphasis was placed on current guideline recommendations and contemporary retrospective data relevant to surgical and non-surgical local therapies. Results: Clinical Stage I SCLC is rare and is frequently upstaged with complete diagnostic evaluation, highlighting the need for thorough staging and pathologic confirmation of node-negative disease when surgery is considered. Even in presumed local disease, distant metastases are many times evident with a proper staging workup. Retrospective analyses suggest potential for long-term control of disease in carefully selected Stage I patients treated with surgical resection, particularly lobectomy, as part of multimodality therapy that includes adjuvant systemic therapy. For patients who are not surgical candidates, stereotactic body radiation therapy combined with systemic therapy is a reasonable alternative. The role of prophylactic cranial irradiation and optimal surveillance strategies in Stage I disease remain areas of uncertainty. Conclusions: Clinical Stage I SCLC affects a small and unique group of patients where traditional treatment strategies may need to be reconsidered. Taken together, retrospective evidence suggests a survival benefit for surgery in carefully selected patients, although prospective validation is lacking. Surgery warrants consideration in appropriately staged, operable patients, while recognizing the limitations of existing data and the need for further study in this rare population. Full article
(This article belongs to the Special Issue State-of-the-Art Surgical Treatment for Lung Cancers)
Show Figures

Figure 1

61 pages, 7242 KB  
Review
Agricultural AI Agents: Architecture Design, Business Processes, Key Technologies, and Future Challenges
by Xuehua Song, Li Han, Yi Zhu, Qianxiang Wei, Zijun Yang and Xiaoming Jiang
Appl. Sci. 2026, 16(11), 5389; https://doi.org/10.3390/app16115389 - 28 May 2026
Viewed by 286
Abstract
Agricultural AI agents play a crucial role in the evolution of smart agriculture, from single-point automated applications to intelligent systems driven by tasks, collaborative decision-making, and closed-loop execution. However, their practical implementation still faces key challenges, such as heterogeneous agricultural data processing, insufficient [...] Read more.
Agricultural AI agents play a crucial role in the evolution of smart agriculture, from single-point automated applications to intelligent systems driven by tasks, collaborative decision-making, and closed-loop execution. However, their practical implementation still faces key challenges, such as heterogeneous agricultural data processing, insufficient cross-scenario generalization ability, complexity of multi-agent collaboration, difficulties in integrating software and hardware, and insufficient security and trust guarantees in real agricultural environments. This paper presents a systematic review of the architecture design, business processes, key technologies, and future challenges of agricultural AI agents. Agricultural AI agents are classified into two types: virtual agricultural AI agents and embodied agricultural AI agents. The paper summarizes a four-layer system architecture consisting of the infrastructure layer, agent management layer, agent collaboration layer, and application layer. The paper also analyzes the model capabilities required by agricultural AI agents from four typical business dimensions: perception and state understanding, knowledge memory and experience management, reasoning decision-making and task planning, and collaborative execution and resource scheduling. This research shows that technologies such as multimodal perception, knowledge graphs, retrieval-enhanced generation, digital twins, reinforcement learning, and multi-agent collaboration can provide important support for agricultural AI agents to enhance their environmental understanding, knowledge reuse, autonomous decision-making, and physical execution capabilities. Future research should focus on robust perception in open environments, long-term memory and knowledge evolution, reliable multi-agent collaboration, edge-cloud collaborative deployment, and secure and trustworthy human–machine collaboration. Integrating agricultural domain knowledge with intelligent agent technology is an important direction for promoting the large-scale, adaptive, and sustainable application of agricultural AI agents. Full article
(This article belongs to the Section Agricultural Science and Technology)
Show Figures

Figure 1

19 pages, 1764 KB  
Article
Automated Dataset Construction for Composed Video Retrieval in Soccer
by Riku Yoshida, Ryota Goka, Keisuke Maeda, Takahiro Ogawa and Miki Haseyama
Appl. Sci. 2026, 16(11), 5360; https://doi.org/10.3390/app16115360 - 27 May 2026
Viewed by 184
Abstract
Composed Video Retrieval (CoVR) enables flexible video search by retrieving a target video that reflects a specified modification to a query video. The triplet datasets—consisting of query videos, query text, and target videos—required for model training have been collected manually. Recent studies have [...] Read more.
Composed Video Retrieval (CoVR) enables flexible video search by retrieving a target video that reflects a specified modification to a query video. The triplet datasets—consisting of query videos, query text, and target videos—required for model training have been collected manually. Recent studies have explored automatic construction of training triplets for CoVR; however, most existing approaches rely heavily on caption similarity. This limitation is particularly problematic in soccer videos, where identical or highly similar captions can correspond to visually distinct situations, making it difficult to construct triplets with appropriate relationships. To address this issue, this paper proposes a multimodal triplet construction framework specialized for soccer videos. The key idea is to explicitly incorporate visual similarity alongside textual similarity. Specifically, candidate target videos are selected by combining visual similarity with commentary caption filtering, enabling the identification of videos that are visually similar yet semantically different. The semantic difference between videos is then generated as query text using a large language model (LLM) without manual annotation. Furthermore, a multimodal large language model (MLLM) is introduced to estimate whether the generated modification is visually and semantically consistent with the video pair. Rather than replacing human verification, this step provides an automated screening signal to identify potentially unreliable triplets. The experiments show that the proposed framework automatically constructs triplets with reasonable validity under limited human validation. These results demonstrate the potential of scalable triplet construction for CoVR in soccer videos. Full article
(This article belongs to the Collection Computer Science in Sport)
Show Figures

Figure 1

29 pages, 19613 KB  
Article
Cross-Modal Graph Attention for Bridge SHM Data Imputation
by Jiawei Xiong, Liangliang Hu, Xiaolin Meng, Xiangdong An and Yilin Xie
Sensors 2026, 26(11), 3339; https://doi.org/10.3390/s26113339 - 25 May 2026
Viewed by 267
Abstract
Bridge structural health monitoring (SHM) systems often suffer from large-scale data missing due to sensor faults, communication interruptions and other reasons during long-term operation, which seriously restricts the reliability of structural state assessment and maintenance decision-making. Compared with conventional single-channel independent modeling strategies [...] Read more.
Bridge structural health monitoring (SHM) systems often suffer from large-scale data missing due to sensor faults, communication interruptions and other reasons during long-term operation, which seriously restricts the reliability of structural state assessment and maintenance decision-making. Compared with conventional single-channel independent modeling strategies commonly used for data imputation, their inherent neglect of spatial correlations and cross-modal causal associations among multi-source heterogeneous monitoring data such as displacement, wind speed, and temperature constrain the imputation capability, particularly when the target channel suffers from long-term continuous data loss. To address the above problems, this paper proposes a collaborative imputation framework integrating a graph attention network (GAT), a modal-aware cross-attention (MACA) mechanism and temporal encoder–decoder architecture (ITimeGAN). Firstly, the sensor feature topological graph is constructed based on the Pearson correlation coefficient, and the spatial dependency among multi-source features is adaptively learned through GAT. Then, the MACA module is introduced, which takes the target displacement as Query and environmental loads as Key/Value, and dynamically aggregates cross-modal driving information through multi-head attention. Finally, a bidirectional LSTM encoder and a unidirectional LSTM decoder are adopted to capture long-range temporal dependencies, so as to realize the accurate reconstruction of missing displacement data. Validated on the 9-dimensional real-world monitoring data from the GeoSHM system of the Forth Road Bridge (UK) under both random missing (10–50%) and continuous long-term missing (1–10 days) scenarios, ITimeGAN achieves an R2 of 0.9950 (MAE = 4.25 mm) for longitudinal displacement and 0.9759 (MAE = 6.70 mm) for vertical displacement even under 10 consecutive days of complete data absence. Ablation analysis further reveals that the incorporation of graph attention and cross-modal attention modules reduces the longitudinal displacement MAE by 57% over the baseline, with the imputation performance ranking across three displacement directions being fully consistent with the underlying physical correlation strengths, thereby confirming the effectiveness of the proposed cross-modal collaborative strategy. Full article
Show Figures

Figure 1

17 pages, 2312 KB  
Article
Scaling Regulatory Compliance: A Multi-Agent System with Multimodal RAG for Automated Electrical Installation Inspection Under NOM and NEC Standards
by Francisco Manuel García-Reyes, Gustavo Castellanos-Guzman, Luis García-Reyes, Fausto Balderas-Jaramillo, Roberto Flores-Guerrero and Liliana Gonzalez-Gámez
Appl. Sci. 2026, 16(11), 5253; https://doi.org/10.3390/app16115253 - 24 May 2026
Viewed by 197
Abstract
In electrical systems, it is important to comply with regulations that guarantee the safety and proper functioning of the installation; to validate that this is complied with, it is necessary to have certifications that are carried out by inspectors who make a visual [...] Read more.
In electrical systems, it is important to comply with regulations that guarantee the safety and proper functioning of the installation; to validate that this is complied with, it is necessary to have certifications that are carried out by inspectors who make a visual review of the electrical installations. This article presents a multi-agent artificial intelligence system based on multimodal Generation Augmented by Recovery (RAG) that verifies compliance with electrical standards. The system is made up of agents specialized in visual perception, automatic retrieval of the applicable standards and the drafting of a technical opinion; this is done based on image processing contrasted with the NOM and NEC standards mainly in conjunction with some complementary standards such as NMX. The validity of the functionality of the system was tested in real environments where 103 inspections were carried out, achieving a reduction in the time used for inspections, which dropped from the usual 18.4 h to only 7.3 min, the time required for the inspection using the system, which represents an improvement of 99.3% in time efficiency. On the other hand, consistency among inspectors (kappa Cohen) increased from 0.68 to 0.94, thus demonstrating that there is a high standardization in opinions. These results show that the integration of large-scale language models (LLMs) and multi-agent architectures not only improved the productivity of inspection processes but also gives greater certainty to a good assessment of the physical conditions in electrical installations. Full article
(This article belongs to the Special Issue AI Applications in Modern Industrial Systems)
29 pages, 5803 KB  
Article
NS-Dep-KAN: An Explainable Neuro-Symbolic Framework with Kolmogorov–Arnold Networks for DSM-Guided Depression Assessment
by Qiong Hong, Lailatul Qadri Zakaria and Sabrina Tiun
Information 2026, 17(6), 516; https://doi.org/10.3390/info17060516 - 22 May 2026
Viewed by 173
Abstract
Automated depression assessment is critical for scalable mental healthcare but faces dual challenges: the lack of clinical interpretability in “black-box” deep learning models and the excessive computational cost of large-scale fusion architectures. To bridge this gap, we propose NS-Dep-KAN, a novel neuro-symbolic framework [...] Read more.
Automated depression assessment is critical for scalable mental healthcare but faces dual challenges: the lack of clinical interpretability in “black-box” deep learning models and the excessive computational cost of large-scale fusion architectures. To bridge this gap, we propose NS-Dep-KAN, a novel neuro-symbolic framework that harmonizes DSM-5-guided reasoning with Kolmogorov–Arnold Networks (KANs). Our approach leverages a Large Language Model (LLM) to extract symbolic symptom evidence aligned with diagnostic criteria, which then guides the aggregation of multimodal features from frozen pretrained encoders (WavLM and Qwen). Unlike traditional Multi-Layer Perceptrons, the proposed KAN prediction head employs learnable B-spline activation functions to capture complex nonlinear symptom–severity mappings with extreme parameter efficiency. Evaluations on the DAIC-WOZ benchmark demonstrate that NS-Dep-KAN achieves state-of-the-art performance among audio-text models (MAE 2.69, 13.5% improvement over the three-modality baseline MSGAF at MAE 3.11), with only ∼4.9 K trainable parameters. Moreover, the framework offers inherent interpretability, revealing granular symptom contribution profiles that align with clinical intuition. This work establishes a path toward explainable trustworthy AI for mental health screening. Full article
Show Figures

Graphical abstract

18 pages, 566 KB  
Review
Modelling and Measuring Professional Vision in Medical Education: A Cognitive Process Framework
by Tina Seidel, Christian Kosel, Ricardo Böheim, Martin Gartmeier and Pascal O. Berberat
Int. Med. Educ. 2026, 5(2), 52; https://doi.org/10.3390/ime5020052 - 22 May 2026
Viewed by 302
Abstract
Physicians routinely operate in environments that require the rapid processing of complex and dynamic visual information to diagnose patient conditions, communicate effectively, and make informed decisions. Despite the central role of visual attention in clinical practice, these processes are rarely conceptualized or systematically [...] Read more.
Physicians routinely operate in environments that require the rapid processing of complex and dynamic visual information to diagnose patient conditions, communicate effectively, and make informed decisions. Despite the central role of visual attention in clinical practice, these processes are rarely conceptualized or systematically measured in medical education research. In other professional domains, such abilities are described as professional vision (PV)—the situated capacity to selectively attend to relevant cues and interpret them considering domain-specific knowledge. Although the term professional vision foregrounds visual attention, we use it here to cover the multimodal clinical perception in which visual cues are typically embedded—predominantly visual, but in many tasks also auditory and verbal—with visual attention as the analytic anchor. This paper introduces a cognitive process model of professional vision for medical education (PV-CP) that specifies the perceptual and cognitive subprocesses underlying how physicians perceive and interpret clinically relevant information. Building on this model, we propose a theory-driven framework for the measurement of professional vision using multimodal indicators. Central to our argument is the assumption that professional vision represents a latent, temporally unfolding construct that cannot be validly captured through single behavioral metrics or outcome measures. Instead, robust measurement requires the coordinated analysis of gaze-based indicators of visual attention and cognitive indicators of reasoning, each reflecting distinct subprocesses of professional vision. By systematically linking families of indicators to specific subprocesses and clarifying their respective inferential strengths and limitations, the PV-CP model advances a process-oriented approach to studying professional vision in medical education. The framework provides a conceptual basis for integrating multimodal data sources and supports more precise interpretations of gaze and reasoning data in expertise research. In doing so, the model contributes to the theoretical refinement of professional vision and offers a structured foundation for future empirical research and the design of learning environments aimed at fostering clinically relevant perceptual–cognitive skills. Full article
Show Figures

Figure 1

Back to TopTop