Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (857)

Search Parameters:
Keywords = multi-modal language model

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
18 pages, 4314 KB  
Article
Optimizing a Multimodal Large Language Model for Ultrasound-Based Thyroid Nodule Malignancy Classification: A Comparative Study of Few-Shot Learning, Prompt Engineering, and Fine-Tuning
by Yu-Hsuan Li, Yu-Cheng Cheng, Chih-Yun Chang and I-Te Lee
Diagnostics 2026, 16(12), 1931; https://doi.org/10.3390/diagnostics16121931 (registering DOI) - 22 Jun 2026
Abstract
Objectives: Multimodal large language models (MLLMs) have shown potential for medical image classification. We evaluated four optimization strategies in two MLLMs—GPT-4o (gpt-4o-2024-08-06) and Gemini 2.5 Flash-Lite—for ultrasound-based thyroid nodule malignancy classification using two public datasets and a clinical cohort of nodules with atypia [...] Read more.
Objectives: Multimodal large language models (MLLMs) have shown potential for medical image classification. We evaluated four optimization strategies in two MLLMs—GPT-4o (gpt-4o-2024-08-06) and Gemini 2.5 Flash-Lite—for ultrasound-based thyroid nodule malignancy classification using two public datasets and a clinical cohort of nodules with atypia of undetermined significance (AUS) cytology. Methods: Text prompting, few-shot learning, fine-tuning, and a hybrid strategy combining fine-tuning with few-shot learning were evaluated for each model. Performance was assessed using the Digital Database of Thyroid Images (DDTI; n = 80), a 1000-image test subset of TN5000, and an institutional AUS cohort with surgical pathology (n = 84). In the AUS cohort, the best-performing strategy was compared with the consensus classification of three endocrinologists and the American Thyroid Association (ATA) ultrasound risk stratification. Results: For GPT-4o, the hybrid strategy achieved the highest area under the receiver operating characteristic curve (AUC) in DDTI (0.866), TN5000 (0.689), and the AUS cohort (0.836). In the AUS cohort, its specificity was higher than that of endocrinologist consensus and ATA risk stratification when only high-suspicion nodules were classified as malignant (95.1% vs. 70.7% and 70.7%; p = 0.002 and p = 0.001, respectively), while sensitivity did not differ significantly (72.1% vs. 74.4% and 79.1%, respectively; both p > 0.05). However, the hybrid model misclassified 12 of 43 malignant nodules, corresponding to a false-negative rate of 27.9%. When high- and intermediate-suspicion ATA categories were classified as malignant, ATA sensitivity increased to 83.7% and specificity decreased to 56.1%; the hybrid model had a higher AUC than ATA risk stratification (0.836 vs. 0.749; p = 0.017). For Gemini 2.5 Flash-Lite, few-shot learning, fine-tuning, and the hybrid strategy did not improve AUC relative to text prompting in any dataset. Conclusions: The hybrid strategy produced the most consistent performance gains for GPT-4o across the three datasets but did not improve Gemini 2.5 Flash-Lite. The optimized GPT-4o model achieved high specificity in the diagnostically challenging AUS cohort, although its false-negative rate limits its use as a stand-alone diagnostic tool. Further validation in larger, prospective multicenter cohorts is required before clinical use. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

23 pages, 7802 KB  
Article
A Latent-Guided Framework for Text-Based Full-Body Human Motion Generation
by Jannatul Nayeem, Hak-Bum Lee and Young-Ho Seo
Electronics 2026, 15(12), 2738; https://doi.org/10.3390/electronics15122738 (registering DOI) - 22 Jun 2026
Abstract
Text-to-motion generation aims to synthesize realistic human motion sequences that accurately reflect natural language descriptions. While recent approaches have improved motion quality, achieving strong semantic alignment between text and motion, especially for fine-grained articulations, remains a significant challenge. In this work, we propose [...] Read more.
Text-to-motion generation aims to synthesize realistic human motion sequences that accurately reflect natural language descriptions. While recent approaches have improved motion quality, achieving strong semantic alignment between text and motion, especially for fine-grained articulations, remains a significant challenge. In this work, we propose a latent-guided text-to-motion generation framework that strengthens the interaction between textual representations and motion latent sequences. The proposed method integrates a structured motion latent space with a text-conditioned variational generation module, enhanced by a cross-modal attention mechanism. This design enables the model to effectively capture both global motion dynamics and detailed semantic information from text. Extensive experiments on the Motion-X dataset demonstrate that the proposed approach achieves strong semantic alignment, as reflected by improved R-precision and competitive matching performance. In addition, the model improves multi-modality, indicating its ability to generate diverse motion patterns under the same textual condition. Qualitative results further show that the generated motions preserve core action semantics and exhibit coherent temporal dynamics across different motion categories. Overall, the proposed framework provides an effective solution for improving text–motion alignment in high-dimensional motion spaces, highlighting the importance of latent-guided modeling for realistic and semantically consistent motion generation. Full article
(This article belongs to the Topic AI-Based Interactive and Immersive Systems)
Show Figures

Figure 1

14 pages, 6425 KB  
Article
Improving Entity Understanding for Vision-Language Pre-Training via Active Learning
by Qunbo Wang, Sen Zhang, Boxuan Shao, Xize Guo, Jiayong An, Chao Fan, Yuanjun Jing, Junxian Li and Wenjun Wu
Big Data Cogn. Comput. 2026, 10(6), 198; https://doi.org/10.3390/bdcc10060198 (registering DOI) - 22 Jun 2026
Abstract
Although many researchers use pre-trained models to better solve downstream tasks, further exploration of more effective pre-training methods remains necessary, especially for multi-modal pre-training where high-quality training data is more difficult to obtain. This work aims to improve the knowledge-learning performance in multi-modal [...] Read more.
Although many researchers use pre-trained models to better solve downstream tasks, further exploration of more effective pre-training methods remains necessary, especially for multi-modal pre-training where high-quality training data is more difficult to obtain. This work aims to improve the knowledge-learning performance in multi-modal pre-training. Some researchers focus on injecting entity knowledge into language pre-trained models based on masked entity model (MEM) training, which masks entities randomly and lets the model recover. These methods cannot guarantee good performance due to the lack of consideration of which entities are more valuable for learning. Moreover, in multi-modal training data, some entities may be unrelated to visual content. In this work, for the vision-language pre-trained model, we propose a Masked Entity Model pre-training method based on Active learning (ActiveMEM). It is designed to actively mask important and informative entities—those that are both informative and uncertain—for the model to recover, thereby encouraging it to extract more valuable knowledge from the data. The proposed method is evaluated using three pre-training datasets and four downstream datasets, and the experimental results demonstrate the effectiveness of our method. Full article
Show Figures

Figure 1

32 pages, 1694 KB  
Review
Comprehensive Review of Nystagmus and Vertigo Diagnostics: From Pathological Foundations to AI-Driven Telemedicine
by Kowshik Balasubramanian, Ali Danesh and Abhijit Pandya
Sensors 2026, 26(12), 3949; https://doi.org/10.3390/s26123949 (registering DOI) - 22 Jun 2026
Abstract
Nystagmus, the involuntary rhythmic oscillation of the eyes, is a critical diagnostic marker in vestibular medicine, distinguishing life-threatening central disorders such as stroke from benign peripheral conditions including Benign Paroxysmal Positional Vertigo (BPPV). Despite its clinical importance, accurate nystagmus assessment has long been [...] Read more.
Nystagmus, the involuntary rhythmic oscillation of the eyes, is a critical diagnostic marker in vestibular medicine, distinguishing life-threatening central disorders such as stroke from benign peripheral conditions including Benign Paroxysmal Positional Vertigo (BPPV). Despite its clinical importance, accurate nystagmus assessment has long been constrained by expensive infrared video-oculography equipment such as videonystagmography, specialist dependency, and the episodic nature of vestibular symptoms that are often resolved before a clinical encounter. This review synthesizes approximately 50 papers published between 1952 and 2026 across four thematic domains: AI-driven nystagmus analysis, clinical medicine, smartphone and portable hardware innovations, and telemedicine and remote monitoring. On the AI front, classical machine learning models achieve up to 98.77% nystagmus recognition accuracy using ensemble methods, while deep learning frameworks spanning CNNs, U-Nets, LSTMs, and optical flow networks demonstrate clinical-grade slow-phase velocity measurement equivalent to gold standard video-oculography on standard smartphone RGB video. Large language and vision models including GPT-4V and Gemini 2.0 show early-stage promise as zero-shot triage tools but currently fall well below specialist-level diagnostic accuracy. Concurrently, portable hardware innovations ranging from 3D-printed goggle systems to ARKit-based smartphone applications are narrowing the accessibility gap, while telemedicine frameworks enable ictal recording and cloud-based specialist review outside the clinic. Across all domains, the common barriers to clinical translation are dataset scarcity for rare BPPV subtypes, sensitivity to ambient conditions, and the absence of explainable AI mechanisms. This review maps the current state of the field and identifies multimodal data fusion, prospective clinical validation, and interpretable AI as the critical next steps toward equitable, specialist independent vestibular diagnostics. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

21 pages, 1456 KB  
Article
A Camera-Based Multimodal Defect Sensing Framework for Substation Equipment Monitoring via Cross-Modal Feature Mapping
by Ziquan Liu, Hai Xue, Chengbo Hu, Chao Wei and Can Zhang
Sensors 2026, 26(12), 3935; https://doi.org/10.3390/s26123935 (registering DOI) - 21 Jun 2026
Abstract
To address the limitations of vision-only defect detection, image–semantic misalignment, and spatial-logic conflicts in complex substation inspection scenarios, this paper proposes a camera-sensor-based multimodal defect sensing framework with cross-modal feature mapping for substation equipment monitoring. The proposed framework integrates field inspection images acquired [...] Read more.
To address the limitations of vision-only defect detection, image–semantic misalignment, and spatial-logic conflicts in complex substation inspection scenarios, this paper proposes a camera-sensor-based multimodal defect sensing framework with cross-modal feature mapping for substation equipment monitoring. The proposed framework integrates field inspection images acquired by camera sensors, defect textual descriptions, and equipment topology knowledge and establishes a unified domain-adaptive pre-training–bidirectional cross-modal mapping–hierarchical reasoning workflow. First, a Contrastive Language–Image Pre-training (CLIP)-based domain-adaptive pre-training strategy is developed to enhance the representation of equipment categories, defect attributes, and inspection-scene semantics. Second, a bidirectional cross-modal feature mapping network is constructed to model fine-grained interactions between candidate visual regions and textual semantics, where uncertainty-aware fusion and prototype constraints are introduced to improve semantic alignment and defect discrimination. Third, a hierarchical neuro-symbolic reasoning module incorporates equipment topology and spatial rules for posterior verification, logical consistency checking, and false-positive suppression. Experiments on a substation inspection image dataset demonstrate that the proposed method achieves 90.8% mAP@0.5, 68.7% mAP@0.5:0.95, and 89.4% F1-score, outperforming mainstream and recent detection models. Full article
43 pages, 1242 KB  
Review
Machine-Learning-Driven Molecular Design and Structure–Property–Performance Relationships in Pharmaceutical Chemistry
by Aisulu Zh. Kabdraisova, Almagul K. Umbetova, Gulfairuz Zh. Kairalapova, Yuliya A. Litvinenko, Larissa R. Sassykova, Nazym S. Yelibayeva, Gauhar Sh. Burasheva, Aliya E. Berganayeva, Zhanibek S. Assylkhanov, Meruyert D. Dauletova, Dmitriy Yu. Korulkin, Marzhan A. Baiburkutova and Aigerim M. Sadvakas
Molecules 2026, 31(12), 2162; https://doi.org/10.3390/molecules31122162 - 19 Jun 2026
Viewed by 170
Abstract
This review examines the emerging role of machine learning (ML) in pharmaceutical chemistry, with emphasis on molecular design, synthetic feasibility, and structure–property–performance (SPP) relationships. By enabling pre-synthesis prediction of physicochemical properties, reaction pathways, and pharmaceutical performance, ML can reduce empirical trial-and-error experimentation and [...] Read more.
This review examines the emerging role of machine learning (ML) in pharmaceutical chemistry, with emphasis on molecular design, synthetic feasibility, and structure–property–performance (SPP) relationships. By enabling pre-synthesis prediction of physicochemical properties, reaction pathways, and pharmaceutical performance, ML can reduce empirical trial-and-error experimentation and support more efficient exploration of chemical space. A structured narrative review design with PRISMA-aligned systematic search elements was used to evaluate 101 studies, enabling transparent literature identification, eligibility screening, and thematic synthesis across heterogeneous ML applications in pharmaceutical chemistry. This review examines structure–property relationships (SPRs) and property–performance relationships (PPRs), with emphasis on key pharmaceutical endpoints such as solubility, permeability, stability, dissolution, and bioavailability. An integrated SPP framework is proposed to connect molecular structure, intermediate properties, and final performance outcomes while incorporating retrosynthetic analysis and experimental feedback and closed-loop optimization. Recent frontier developments are also discussed, including molecular foundation models, multimodal language–graph models, diffusion-based molecular generation, E(3)-equivariant models, and MolMIM-like latent-space optimization. This review also covers co-folding and joint ligand–protein modeling, Boltz-2-like affinity prediction, AlphaFold 3-related biomolecular interaction modeling, and absorption, distribution, metabolism, excretion, and toxicity (ADMET) prediction. Key limitations include dataset leakage, benchmark inconsistency, assay variability, conformational and protonation-state effects, reproducibility challenges, regulatory constraints, and the gap between computational prediction and prospective experimental validation. Future progress is expected to depend on hybrid physics–ML models, uncertainty-aware prospective validation, autonomous experimentation, explainable artificial intelligence, and sustainability-aware molecular design. Overall, ML is evolving from a predictive tool into a chemically informed decision-support framework for rational, synthesis-aware, and experimentally validated pharmaceutical development. Full article
(This article belongs to the Section Organic Chemistry)
Show Figures

Figure 1

32 pages, 25402 KB  
Article
MLLMto3D: An MCP-Driven Closed-Loop Framework for Architectural 3D Generation
by Dong Yao, Bingcheng He and Xiaoxi Zhao
Buildings 2026, 16(12), 2437; https://doi.org/10.3390/buildings16122437 - 18 Jun 2026
Viewed by 93
Abstract
Multimodal large language models can read architectural images and design instructions but they still struggle to turn architectural rules into editable, executable models in professional modeling environments. To address this limitation, this paper presents MLLMto3D, an MCP-driven closed-loop framework that connects multimodal reasoning [...] Read more.
Multimodal large language models can read architectural images and design instructions but they still struggle to turn architectural rules into editable, executable models in professional modeling environments. To address this limitation, this paper presents MLLMto3D, an MCP-driven closed-loop framework that connects multimodal reasoning with Rhino-based modeling, feedback, and revision. The framework consists of five phases: visual parsing, JSON-based intent serialization, code synthesis, MCP-driven Rhino execution and feedback, and verification with bounded repair. Its core mechanism is JSON-based intent serialization, which converts image-derived architectural information into machine-readable modeling parameters under a predefined JSON schema. The schema separates geometric and compositional constraints, including height, bay rhythm, facade zones, and alignment rules, from design variables such as materials, openings, and ornament. Building on this mechanism, Skills modules externalize facade typology knowledge and safe Rhino scripting patterns, providing callable professional constraints for code synthesis to reduce design-intent deviation and API hallucination. The framework is evaluated through an experimental design case study on a site in Shanghai’s Hengfu Historic District, where the generation of new façades is informed by a nearby heritage architectural reference. The results show that MLLMto3D can generate a parametrically adjustable Rhino model while preserving the main compositional constraints, thereby advancing AI-assisted architectural 3D generation toward a controllable, verifiable, and iterative modeling process. Full article
(This article belongs to the Section Construction Management, and Computers & Digitization)
23 pages, 643 KB  
Article
VISA-Agent: A Visual Symbolic Agent for Reasoning-Intensive Multimodal Retrieval
by Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi, Abdelrahman Abdallah and Hyun Soo Kang
Mathematics 2026, 14(12), 2197; https://doi.org/10.3390/math14122197 - 18 Jun 2026
Viewed by 159
Abstract
Reasoning-intensive multimodal retrieval suffers from a counter-intuitive bottleneck: on MM-BRIGHT multimodal-to-text (Query+ImageDocuments), the strongest dense multimodal encoder reaches only 27.6 nDCG@10 and the rest of the dense vision–language retrievers cluster between 10.0 and 23.0. The visual signal, encoded as [...] Read more.
Reasoning-intensive multimodal retrieval suffers from a counter-intuitive bottleneck: on MM-BRIGHT multimodal-to-text (Query+ImageDocuments), the strongest dense multimodal encoder reaches only 27.6 nDCG@10 and the rest of the dense vision–language retrievers cluster between 10.0 and 23.0. The visual signal, encoded as a dense vector, adds noise rather than evidence; even augmenting strong text retrievers with raw image captions degrades performance by up to 12.0 points. We propose VISA, a Visual Symbolic Agent that re-casts multimodal-to-text as text retrieval over three parallel streams. A Vision LLM is dispatched in three roles via separate prompts: a zero-shot router that classifies the query image into up to three parser types from a fixed taxonomy of nine (chart, circuit, equation, screenshot, code, figure, diagram, map, photograph); typed parsers that extract structured text per type; and a holistic captioner. The agent constructs three text streams (raw query, query ⊕ symbolic, query ⊕ caption), scores each with a single frozen 4B-parameter retrieval LLM, and fuses the per-document scores via Reciprocal Rank Fusion or a confidence-weighted linear combination. The whole agent contains no trainable parameters. The key novelty is a change of substrate: rather than projecting the query image into a dense multimodal vector that competes with text, VISA is, to our knowledge, the first retrieval system to convert the image into typed symbolic text and keep retrieval entirely text-side, so that a frozen text retriever can match the literal tokens (axis values, variable names, function signatures) that answering documents actually contain. Across all 29 MM-BRIGHT multimodal-to-text domains, VISA achieves 32.4 nDCG@10, an absolute improvement of +4.8 over the strongest dense multimodal encoder and substantially larger margins over the remaining six dense vision–language baselines. Per-domain analysis shows VISA maintains its margin across STEM and software domains where image content is structure-heavy. In practical terms, VISA is training-free and model-agnostic: it requires no fine-tuning, reuses any off-the-shelf vision LLM and text retriever, caches all per-image parsing so re-runs cost only three query encodes, and can therefore be dropped into an existing text-retrieval stack to add reasoning-intensive multimodal capability without building or training a multimodal encoder. Full article
(This article belongs to the Special Issue New Advances in Image Processing and Computer Vision)
Show Figures

Figure 1

27 pages, 682 KB  
Review
Cancer-Related Cognitive Impairment in Breast Cancer: Current State of Knowledge, Mechanisms, Diagnosis, Prevention and Treatment
by Federica Andreis, Chiara Deori, Valentina Giubileo, Chiara Abeni, Irene Caramella, Sara Cherri, Brunella Di Biasi, Michela Libertini, Silvia Noventa, Chiara Ogliosi, Ester Oneda, Tiziana Prochilo, Fausto Angelo Meriggi and Alberto Zaniboni
Cancers 2026, 18(12), 1974; https://doi.org/10.3390/cancers18121974 - 17 Jun 2026
Viewed by 160
Abstract
Cancer-related cognitive impairment (CRCI), also known as chemobrain or chemofog, is characterized by subjective and/or objective changes in attention, executive functions, memory, and processing speed in patients with non-CNS cancers, particularly women with breast cancer. This structured narrative review synthesizes current evidence on [...] Read more.
Cancer-related cognitive impairment (CRCI), also known as chemobrain or chemofog, is characterized by subjective and/or objective changes in attention, executive functions, memory, and processing speed in patients with non-CNS cancers, particularly women with breast cancer. This structured narrative review synthesizes current evidence on mechanisms, neuropsychological assessment, neuroimaging correlates, clinical and demographic risk factors, emerging artificial intelligence and machine learning applications, and non-pharmacological approaches to CRCI in breast cancer. A structured literature search was conducted using PubMed/MEDLINE, PsycInfo, and Clinical Key up to May 2026, with emphasis on studies published between 2023 and 2026. Peer-reviewed English-language studies involving adult breast cancer populations and addressing predefined thematic domains of CRCI were considered. Given the heterogeneity of study designs, assessment tools, interventions, and outcomes, the findings were synthesized narratively. Current evidence supports a multifactorial model of CRCI involving neurobiological, treatment-related, psychological, and behavioral mechanisms. Neuroinflammation, endocrine disruption, oxidative stress, glial alterations, and structural or functional brain changes may contribute to cognitive symptoms; however, the strength of evidence varies, and many findings remain correlational or preclinical. Non-pharmacological interventions, including cognitive training, physical activity, mindfulness-based and psychological approaches, and multimodal digital programs, appear promising as supportive strategies. However, evidence remains heterogeneous, with benefits more consistently reported for patient-reported outcomes, fatigue, emotional distress, and quality of life than for objective neuropsychological performance. CRCI in breast cancer should be approached as a heterogeneous condition requiring early recognition, standardized assessment, and multidisciplinary supportive care. Future research should prioritize longitudinal designs, harmonized endpoints, and a clearer distinction between subjective and objective outcomes. Full article
(This article belongs to the Section Cancer Survivorship and Quality of Life)
30 pages, 21418 KB  
Article
Semantic Translation and LLM-RAG Fusion of Multi-Source Heterogeneous Data for Production Cognition in Discrete Manufacturing
by Pingwen Zheng, Liping Wang, Changchun Liu and Dunbing Tang
Electronics 2026, 15(12), 2692; https://doi.org/10.3390/electronics15122692 - 17 Jun 2026
Viewed by 97
Abstract
Multi-source heterogeneous data in discrete manufacturing shop floors, including vibration signals, equipment logs, visual monitoring data, and handwritten production reports, exhibit significant differences in modality and semantic representation. Traditional fusion methods often fail to bridge the semantic gap between low-level sensing signals and [...] Read more.
Multi-source heterogeneous data in discrete manufacturing shop floors, including vibration signals, equipment logs, visual monitoring data, and handwritten production reports, exhibit significant differences in modality and semantic representation. Traditional fusion methods often fail to bridge the semantic gap between low-level sensing signals and high-level manufacturing cognition, limiting intelligent anomaly analysis and decision-making capability. To address this issue, this paper proposes a semantic translation and fusion framework for industrial heterogeneous data based on Knowledge Graph (KG), Retrieval-Augmented Generation (RAG), and Large Language Models (LLMs). First, a unified semantic translation mechanism is developed to convert multimodal industrial data into structured semantic representations for cross-modal alignment. Second, an industrial knowledge graph and RAG mechanism are introduced to integrate process knowledge, maintenance manuals, and historical fault records into the reasoning process. Third, an LLM-driven reasoning framework is designed for multimodal semantic fusion, anomaly identification, causal analysis, and optimization recommendation generation. In addition, a digital twin-based visualization interface is constructed to realize real-time interaction between production lines, industrial data, and intelligent cognitive reports. Experimental results demonstrate that the proposed framework significantly improves industrial reasoning accuracy, anomaly analysis correctness, and response efficiency compared with general-purpose LLMs, providing an effective solution for intelligent cognition and decision-making in discrete manufacturing systems. Full article
(This article belongs to the Section Computer Science & Engineering)
14 pages, 8910 KB  
Article
The Backend as a Possible Functional Analogue of Consciousness: Redirecting Attention from the Language Model to the Orchestrating Layer
by Pavel Straňák
Philosophies 2026, 11(3), 98; https://doi.org/10.3390/philosophies11030098 - 17 Jun 2026
Viewed by 150
Abstract
Discussion of consciousness and artificial intelligence has hitherto focused on the question of whether a large language model (LLM) exhibits signs of consciousness or understanding. This paper proposes to redirect attention elsewhere: not to the model itself, but to the orchestrating layer that [...] Read more.
Discussion of consciousness and artificial intelligence has hitherto focused on the question of whether a large language model (LLM) exhibits signs of consciousness or understanding. This paper proposes to redirect attention elsewhere: not to the model itself, but to the orchestrating layer that governs the model—the backend, understood here as the collection of mechanisms (context management, retrieval, evaluation, planning, and tool-use control) that structure the model’s operation. We argue that the backend performs a function functionally analogous to the role of consciousness in the human brain: it stabilizes generative processes, directs attention, maintains context, and mitigates the entropic disintegration of thought. Consciousness fulfills this function through the phenomenal layer—qualia—which creates a persistent subjective “inner canvas”, used here as a metaphor for a more general multimodal phenomenal space. The backend fulfills it only algorithmically, without phenomenal quality. We further show that computation is an informationally conservative process in the sense of Shannon’s Data Processing Inequality (DPI), and therefore cannot increase Shannon information, even though it may yield novel or pragmatically useful recombinations of existing information. We conclude by proposing the hypothesis that consciousness constitutes a phenomenon orthogonal to computation—not an emergent property of complexity, but a qualitative leap into a different dimension. This hypothesis, which builds on the author’s prior work in this Special Issue and in Symmetry, is presented as a conceptual contribution rather than a formal theory, and may have implications for how future artificial intelligence research conceptualizes the limits of computational architectures. Full article
Show Figures

Figure 1

31 pages, 8778 KB  
Article
An Explainable Multimodal Deep Learning Framework for Thyroid Nodule Diagnosis in Ultrasound Imaging Using Hybrid Vision Transformers and Med-PaLM
by Sathya Jayaraman, Ramkumar Sivasakthivel, Jayapriya Jayapal and Balakrishnan Chinnaiyan
Computation 2026, 14(6), 138; https://doi.org/10.3390/computation14060138 - 16 Jun 2026
Viewed by 224
Abstract
Thyroid tumors rank among the most frequently occurring endocrine cancers because early detection helps doctors deliver effective treatments that lead to better patient results. Ultrasound imaging enables the detection of thyroid nodules, yet medical professionals struggle to differentiate between benign and malignant nodules [...] Read more.
Thyroid tumors rank among the most frequently occurring endocrine cancers because early detection helps doctors deliver effective treatments that lead to better patient results. Ultrasound imaging enables the detection of thyroid nodules, yet medical professionals struggle to differentiate between benign and malignant nodules through their diagnostic tests. This study introduces a new medical framework that enables thyroid nodule diagnosis through ultrasound imaging. The proposed model combines advanced segmentation with feature extraction, classification, and reasoning components to create a complete system. The specialized segmentation method shows accurate results when it detects nodule boundaries, which leads to better analysis of specific regions. The Hybrid Vision Transformer (HVT) operates to capture detailed textural information together with complete environmental patterns, which boosts its ability to classify different elements. The proposed framework incorporates a Large Language Model (LLM), specifically Med-PaLM, to provide context-aware clinical reasoning and interpretation. The structured evaluation process uses Thyroid Imaging Reporting and Data System (TI-RADS)-based feature scoring to compare model results with designated clinical standards. The diagnostic process is enhanced through the use of a language model, which delivers contextual understanding and produces valuable information from features that have been extracted. The proposed model achieves excellent performance with accuracy at 98.5%, precision at 98.7%, recall at 98.4%, and F1-score at 98.5%, which demonstrates its capacity for accurate and equivalent performance across different classifications. The experimental results demonstrate that the model achieves better results than existing methods. The combination of multimodal data with clinical reasoning improves both the accuracy and the user experience of the system. The proposed framework provides an efficient, interpretable, and scalable solution for thyroid nodule diagnosis. Full article
(This article belongs to the Section Computational Biology)
Show Figures

Graphical abstract

14 pages, 536 KB  
Review
Advancing Pediatric Radiology Through Artificial Intelligence: Global Progress and Implications for Middle- and Low-Income Countries
by Sana Amreen, Ahmed Khairy, Fakeha Masood, Ngan Chu, Anju Paudel, Abdelrahman Aly Mohamed, Ayantoyinbo Oluwabusayomi and Yossef Alnasser
AI 2026, 7(6), 222; https://doi.org/10.3390/ai7060222 - 16 Jun 2026
Viewed by 274
Abstract
Background: Radiology underpins diagnosis and treatment across pediatrics, yet most artificial intelligence (AI) tools are developed for adults and validated on adult datasets only. Of more than 200 AI systems cleared by the United States (U.S.) Food and Drug Administration (FDA), only about [...] Read more.
Background: Radiology underpins diagnosis and treatment across pediatrics, yet most artificial intelligence (AI) tools are developed for adults and validated on adult datasets only. Of more than 200 AI systems cleared by the United States (U.S.) Food and Drug Administration (FDA), only about 3% include pediatric validation. Because children differ from adults in anatomy, physiology, pathology, epidemiology, and imaging protocols, adult-trained models often perform sub-optimally in pediatric settings. Methods: A narrative review of peer-reviewed literature from 2000 to 2025 was conducted using PubMed, MEDLINE, Google Scholar, and Scopus. Studies involving AI applications in pediatric X-ray, ultrasound, computed tomography (CT), magnetic resonance imaging (MRI), echocardiography, and point-of-care ultrasound with quantitative performance metrics were included. Findings were synthesized by imaging modality, clinical task, and differences between high-income countries (HICs) and low- and middle-income countries (LMICs). Results: AI demonstrated strong performance across multiple pediatric imaging tasks. In X-ray interpretation, AI detected fractures with area under the curve (AUC) values up to 0.96 (sensitivity, 90.8%; specificity, 88.7%). Pneumonia classification achieved 76.5% accuracy, and foreign body aspiration detection showed 95.3% specificity in HICs. In ultrasound, AI improved junior sonographers’ detection of intussusception (AUC 0.857 to 0.966) and reduced scan time by more than 50%. AI-assisted bone age estimation achieved a mean error of 0.39 years. In echocardiography, AI-derived ejection fraction showed excellent agreement with experts’ interclass correlation coefficient (ICC 0.983), and AI support improved atrioventricular septal defect detection (84.4% to 86.5%). In MRI, the use of AI enhanced lesion detection and supported quantitative analysis. Deep-learning models trained on routine T1- and T2-weighted sequences predicted liver stiffness across multi-site datasets, while advanced neuroimaging pipelines improved the identification of subtle epileptogenic lesions that are often missed on conventional pediatric MRI. However, adult-trained models showed limited generalizability to children. Still, excluding children under the age of two years improved the reading accuracy of pediatric chest X-rays (CXRs) by adult-trained models from 88% to 97%. AI faces challenges beyond the development of age-specific models. Substantial heterogeneity, limited pediatric-specific datasets, and unresolved medicolegal responsibility further restrict adoption worldwide. Challenges are amplified in LMICs, where unstable electricity, limited radiology resources, weak digital infrastructure, and scarce pediatric providers limit implementation. Additionally, many large language models underperform and lack inclusive algorithms suitable for pediatric radiology in many LMICs. Conclusions: AI can enhance diagnostic accuracy, efficiency, and access to pediatric imaging, particularly in resource-limited settings, through task-shifting and decision support. However, it cannot replace pediatric radiologists as of today. Safe adoption requires pediatric-specific model development, standardized validation metrics, diverse datasets that include LMIC populations, stronger digital infrastructure, robust radiologist training in AI capabilities, and the establishment of clear guidelines and medicolegal policies. Full article
Show Figures

Figure 1

20 pages, 2378 KB  
Article
Beyond Accuracy: A Multi-dimensional Cognitive Audit of Medical Large Vision–Language Models in Fundus Image Interpretation
by Jingling Zhang, Shuting Zheng, Xiangfei Liu and Jia Gu
Appl. Sci. 2026, 16(12), 6064; https://doi.org/10.3390/app16126064 - 15 Jun 2026
Viewed by 142
Abstract
Reliance on standalone accuracy limits credible assessment of fundus-focused large vision–language models (LVLMs), as high scores often stem from linguistic shortcuts rather than real visual reasoning. This work develops the Cognitive Audit Framework (CAF), a four-module automated auditing pipeline that dissects model reasoning [...] Read more.
Reliance on standalone accuracy limits credible assessment of fundus-focused large vision–language models (LVLMs), as high scores often stem from linguistic shortcuts rather than real visual reasoning. This work develops the Cognitive Audit Framework (CAF), a four-module automated auditing pipeline that dissects model reasoning flaws: Visual–Linguistic Decoupling (textual dependency via modality ablation), Hierarchical Logical Consistency (lesion–diagnosis contradiction detection), Reasoning Fidelity Gap (chain-of-thought unfaithfulness scoring), and Contextual Robustness (positional bias under option permutation). Experiments on six 7B–31B LVLMs over FunBench reveal a notable gap between benchmark accuracy and reasoning quality: high accuracy coexists with measurable textual dependency, logical inconsistencies across diagnostic levels, limited chain-of-thought faithfulness, and non-trivial positional sensitivity. CAF serves as a reproducible complement to pure accuracy metrics for validating clinical competence of ophthalmic multimodal models. Full article
Show Figures

Figure 1

30 pages, 5804 KB  
Article
How Does Progressive Visual Feedback Enhance Controllability? An Empirical Study of LLM-Driven, Culturally Sensitive Sustainable Rural Landscape Design
by Chang-Yu Liu, Xuan-Qi Qiao, Yan-Qiang Ding and Zhen-Chao Zhao
Sustainability 2026, 18(12), 6160; https://doi.org/10.3390/su18126160 - 15 Jun 2026
Viewed by 199
Abstract
As artificial intelligence (AI) becomes increasingly important in rural revitalization, building consensus among multiple stakeholders and developing participatory digital co-creation platforms has grown increasingly urgent. However, existing large language model (LLM) systems predominantly adopt a one-shot generation paradigm, making it challenging to accurately [...] Read more.
As artificial intelligence (AI) becomes increasingly important in rural revitalization, building consensus among multiple stakeholders and developing participatory digital co-creation platforms has grown increasingly urgent. However, existing large language model (LLM) systems predominantly adopt a one-shot generation paradigm, making it challenging to accurately capture villagers’ cultural aspirations and frequently resulting in a significant disconnect between design outputs and community expectations. This situation reveals deficiencies in progressive deliberation mechanisms and cultural controllability. To address these issues, this study proposes a multimodal Participatory Landscape Demand Generation (PLDG) system to enhance AI-generated dialogue controllability, facilitate effective cultural translation in sensitive rural contexts, and promote sustainable development where landscape design both drives and reflects rural revitalization. The system leverages LLMs to simulate stakeholder participatory interactions in village landscape design scenarios. Using culturally distinctive Chinese villages as case studies, the research conducts multi-role simulated dialogues, multimodal semantic extraction, and iterative consensus-building, and evaluates the resultant data to generate landscape design proposals. The results indicate that the PLDG system significantly improves participation efficiency among diverse design stakeholders and enhances the sustainability of design decisions. Compared to conventional methods, metrics such as cultural compatibility, villager participation, and design innovation show substantial improvements. These findings demonstrate the considerable potential of human-AI collaboration in future rural planning. This study introduces the Culture Constraint-Driven Rural Landscape AI Collaborative Design Framework (PLDG), validating its practical efficacy in identifying culturally sensitive elements, ensuring cultural congruence, facilitating community participation, and fostering design innovation. Consequently, it provides a reusable, iterative operational tool for the digital renewal of sustainable rural landscapes. Full article
(This article belongs to the Section Tourism, Culture, and Heritage)
Show Figures

Figure 1

Back to TopTop