Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (36)

Search Parameters:
Keywords = visual symbolic reasoning

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
28 pages, 731 KB  
Article
Research on an Automatic Classification Method for Art Film Scenes Based on Image and Audio Deep Features
by Zhaojun An and Heinz D. Fill
Appl. Sci. 2025, 15(23), 12603; https://doi.org/10.3390/app152312603 - 28 Nov 2025
Viewed by 349
Abstract
This paper addresses the challenging task of automatic scene classification in art films, a genre characterized by symbolic visuals, asynchronous audio, and non-linear storytelling. We propose Styloformer, a multimodal transformer architecture designed to integrate visual, auditory, textual, and curatorial signals into a unified [...] Read more.
This paper addresses the challenging task of automatic scene classification in art films, a genre characterized by symbolic visuals, asynchronous audio, and non-linear storytelling. We propose Styloformer, a multimodal transformer architecture designed to integrate visual, auditory, textual, and curatorial signals into a unified representation space. The model combines cross-modal attention, stylistic clustering, influence prediction, and canonicality estimation to handle the semantic and historical complexity of art cinema. Additionally, we introduce a novel module called Historiographic Navigation, which embeds ontological priors and temporal logic to support interpretive reasoning. Evaluated on multiple benchmarks, Styloformer achieves state-of-the-art performance, including 91.85% accuracy and 94.31% AUC on the MovieNet dataset—outperforming baselines such as CLIP and ViT. Ablation studies further demonstrate the importance of each architectural component. Unlike general-purpose video models, our system is tailored to the aesthetic and narrative structure of art films, making it suitable for applications in digital curation and computational film analysis. Styloformer represents a scalable and interpretable approach to understanding artistic media, bridging machine learning with art historical reasoning. Full article
Show Figures

Figure 1

39 pages, 4244 KB  
Article
A Neuro-Symbolic Multi-Agent Architecture for Digital Transformation of Psychological Support Systems via Artificial Neurotransmitters and Archetypal Reasoning
by Gerardo Iovane, Iana Fominska and Raffaella Di Pasquale
Algorithms 2025, 18(11), 721; https://doi.org/10.3390/a18110721 - 15 Nov 2025
Viewed by 704
Abstract
The digital transformation in the treatment of mental health and emotional disharmony requires artificial intelligence architectures that overcome the limitations of purely neural approaches, such as temporal inconsistency, opacity, and lack of theoretical foundations. Assuming the existence and use of generalist LLMs currently [...] Read more.
The digital transformation in the treatment of mental health and emotional disharmony requires artificial intelligence architectures that overcome the limitations of purely neural approaches, such as temporal inconsistency, opacity, and lack of theoretical foundations. Assuming the existence and use of generalist LLMs currently used in clinical settings and considering the appropriate limitations indicated by experts, this article aims to offer clinicians an alternative Neuro-symbolic-Psychological multi-agent architecture (NSPA-AI), which integrates archetypal symbolic reasoning with neurobiological modelling, based on our established framework of artificial neurotransmitters for the modelling and analysis of affective-emotional stimuli to enable interpretable AI-assisted psychological intervention. The system implements a hub-and-spoke topology that coordinates five specialized agents (symbolic, psychological, neurofunctional, decision fusion, learning) that process heterogeneous information via SPADE protocols. Seven archetypal constructs from Jungian psychology and narrative identity theory provide stable symbolic frameworks for longitudinal therapeutic consistency. An empirical study of 156 university students demonstrated significant improvements in depression (Cohen’s d = 1.03), stress (d = 0.89), and narrative identity integration (d = 0.75), which were maintained at a 12-week follow-up and superior to GPT-4 controls (d = 0.34). Neurofunctional correlations—downregulation of cortisol (r = 0.71 with stress reduction), increase in serotonin (r = −0.68 with depression improvement)—validated the neurobiological basis of the entropy-energy framework. Qualitative analysis revealed the following four mechanisms of improvement: symbolic emotional support (93%), increased self-awareness through neurotransmitter visualization (84%), non-judgmental AI interaction (98%), and archetypal narrative organization (87%). The results establish that neuro-symbolic architectures are viable alternatives to large language models for digital mental health, providing the interpretability and clinical validity essential for adoption in the healthcare sector. Full article
(This article belongs to the Special Issue Algorithms in Multi-Sensor Imaging and Fusion)
Show Figures

Figure 1

37 pages, 4859 KB  
Review
Eyes of the Future: Decoding the World Through Machine Vision
by Svetlana N. Khonina, Nikolay L. Kazanskiy, Ivan V. Oseledets, Roman M. Khabibullin and Artem V. Nikonorov
Technologies 2025, 13(11), 507; https://doi.org/10.3390/technologies13110507 - 7 Nov 2025
Viewed by 3238
Abstract
Machine vision (MV) is reshaping numerous industries by giving machines the ability to understand what they “see” and respond without human intervention. This review brings together the latest developments in deep learning (DL), image processing, and computer vision (CV). It focuses on how [...] Read more.
Machine vision (MV) is reshaping numerous industries by giving machines the ability to understand what they “see” and respond without human intervention. This review brings together the latest developments in deep learning (DL), image processing, and computer vision (CV). It focuses on how these technologies are being applied in real operational environments. We examine core methodologies such as feature extraction, object detection, image segmentation, and pattern recognition. These techniques are accelerating innovation in key sectors, including healthcare, manufacturing, autonomous systems, and security. A major emphasis is placed on the deepening integration of artificial intelligence (AI) and machine learning (ML) into MV. We particularly consider the impact of convolutional neural networks (CNNs), generative adversarial networks (GANs), and transformer architectures on the evolution of visual recognition capabilities. Beyond surveying advances, this review also takes a hard look at the field’s persistent roadblocks, above all the scarcity of high-quality labeled data, the heavy computational load of modern models, and the unforgiving time limits imposed by real-time vision applications. In response to these challenges, we examine a range of emerging fixes: leaner algorithms, purpose-built hardware (like vision processing units and neuromorphic chips), and smarter ways to label or synthesize data that sidestep the need for massive manual operations. What distinguishes this paper, however, is its emphasis on where MV is headed next. We spotlight nascent directions, including edge-based processing that moves intelligence closer to the sensor, early explorations of quantum methods for visual tasks, and hybrid AI systems that fuse symbolic reasoning with DL, not as speculative futures but as tangible pathways already taking shape. Ultimately, the goal is to connect cutting-edge research with actual deployment scenarios, offering a grounded, actionable guide for those working at the front lines of MV today. Full article
(This article belongs to the Section Information and Communication Technologies)
Show Figures

Figure 1

24 pages, 3721 KB  
Article
Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education
by Jiyoun Moon and Seung Min Song
Appl. Sci. 2025, 15(20), 11107; https://doi.org/10.3390/app152011107 - 16 Oct 2025
Viewed by 466
Abstract
In this study, we propose an interactive environment-aware dialog and planning system for social robots in early childhood education, aimed at supporting the learning and social interaction of young children. The proposed architecture consists of three core modules. First, semantic simultaneous localization and [...] Read more.
In this study, we propose an interactive environment-aware dialog and planning system for social robots in early childhood education, aimed at supporting the learning and social interaction of young children. The proposed architecture consists of three core modules. First, semantic simultaneous localization and mapping (SLAM) accurately perceives the environment by constructing a semantic scene representation that includes attributes such as position, size, color, purpose, and material of objects, as well as their positional relationships. Second, the automated planning system enables stable task execution even in changing environments through planning domain definition language (PDDL)-based planning and replanning capabilities. Third, the visual question answering module leverages scene graphs and SPARQL conversion of natural language queries to answer children’s questions and engage in context-based conversations. The experiment conducted in a real kindergarten classroom with children aged 6 to 7 years validated the accuracy of object recognition and attribute extraction for semantic SLAM, the task success rate of the automated planning system, and the natural language question answering performance of the visual question answering (VQA) module.The experimental results confirmed the proposed system’s potential to support natural social interaction with children and its applicability as an educational tool. Full article
(This article belongs to the Special Issue Robotics and Intelligent Systems: Technologies and Applications)
Show Figures

Figure 1

23 pages, 4988 KB  
Article
Contextual Object Grouping (COG): A Specialized Framework for Dynamic Symbol Interpretation in Technical Security Diagrams
by Jan Kapusta, Waldemar Bauer and Jerzy Baranowski
Algorithms 2025, 18(10), 642; https://doi.org/10.3390/a18100642 - 10 Oct 2025
Viewed by 503
Abstract
This paper introduces Contextual Object Grouping (COG), a specific computer vision framework that enables automatic interpretation of technical security diagrams through dynamic legend learning for intelligent sensing applications. Unlike traditional object detection approaches that rely on post-processing heuristics to establish relationships between the [...] Read more.
This paper introduces Contextual Object Grouping (COG), a specific computer vision framework that enables automatic interpretation of technical security diagrams through dynamic legend learning for intelligent sensing applications. Unlike traditional object detection approaches that rely on post-processing heuristics to establish relationships between the detected elements, COG embeds contextual understanding directly into the detection process by treating spatially and functionally related objects as unified semantic entities. We demonstrate this approach in the context of Cyber-Physical Security Systems (CPPS) assessment, where the same symbol may represent different security devices across different designers and projects. Our proof-of-concept implementation using YOLOv8 achieves robust detection of legend components (mAP50 ≈ 0.99, mAP50–95 ≈ 0.81) and successfully establishes symbol–label relationships for automated security asset identification. The framework introduces a new ontological class—the contextual COG class that bridges atomic object detection and semantic interpretation, enabling intelligent sensing systems to perceive context rather than infer it through post-processing reasoning. This proof-of-concept appears to validate the COG hypothesis and suggests new research directions for structured visual understanding in smart sensing environments, with applications potentially extending to building automation and cyber-physical security assessment. Full article
Show Figures

Figure 1

18 pages, 1040 KB  
Article
Analysis of Curricular Treatment of the Relationship Between Area and Perimeter in Two U.S. Curricula
by Jane-Jane Lo and Lili Zhou
Educ. Sci. 2025, 15(10), 1342; https://doi.org/10.3390/educsci15101342 - 10 Oct 2025
Viewed by 618
Abstract
This study examines how two widely used elementary mathematics curricula, Bridges in Mathematics and Eureka Math, support grade 3 students’ conceptual understanding of the relationship between area and perimeter. Drawing on the mathematical treatment and emphasis component of the analytic framework, we [...] Read more.
This study examines how two widely used elementary mathematics curricula, Bridges in Mathematics and Eureka Math, support grade 3 students’ conceptual understanding of the relationship between area and perimeter. Drawing on the mathematical treatment and emphasis component of the analytic framework, we identified distinct instructional strategies and learning opportunities. Findings indicate distinct instructional strategies and learning opportunities: Bridges in Mathematics emphasizes hands-on exploration, pattern recognition, and student-led reasoning using real-world contexts. In contrast, Eureka Math employs a more structured and symbolic approach, using multiplication, factor pairs, and line plots to support generalization and data-driven reasoning. Both curricula share strengths such as the use of visual supports, real-world contexts, and attention to student reasoning, yet they differ in how they scaffold conceptual development. Rather than recommending one curriculum over the other, the study highlights how each curriculum sequences ideas and supports mathematical reasoning, offering insights into curricular design and the learning experiences they foster. Full article
(This article belongs to the Special Issue Curriculum Development in Mathematics Education)
Show Figures

Figure 1

14 pages, 954 KB  
Article
A Benchmark for Symbolic Reasoning from Pixel Sequences: Grid-Level Visual Completion and Correction
by Lei Kang, Xuanshuo Fu, Mohamed Ali Souibgui, Andrey Barsky, Lluis Gomez, Javier Vazquez-Corral, Alicia Fornés, Ernest Valveny and Dimosthenis Karatzas
Mathematics 2025, 13(17), 2851; https://doi.org/10.3390/math13172851 - 4 Sep 2025
Viewed by 873
Abstract
Grid structured visual data such as forms, tables, and game boards require models that pair pixel level perception with symbolic consistency under global constraints. Recent Pixel Language Models (PLMs) map images to token sequences with promising flexibility, yet we find they generalize poorly [...] Read more.
Grid structured visual data such as forms, tables, and game boards require models that pair pixel level perception with symbolic consistency under global constraints. Recent Pixel Language Models (PLMs) map images to token sequences with promising flexibility, yet we find they generalize poorly when observable evidence becomes sparse or corrupted. We present GridMNIST-Sudoku, a benchmark that renders large numbers of Sudoku instances with style diverse handwritten digits and provides parameterized stress tracks for two tasks: Completion (predict missing cells) and Correction (detect and repair incorrect cells) across difficulty levels ranging from 1 to 90 altered positions in a 9 × 9 grid. Attention diagnostics on PLMs trained with conventional one dimensional positional encodings reveal weak structure awareness outside the natural Sudoku sparsity band. Motivated by these findings, we propose a lightweight Row-Column-Box (RCB) positional prior that injects grid aligned coordinates and combine it with simple sparsity and corruption augmentations. Trained only on the natural distribution, the resulting model substantially improves out of distribution accuracy across wide sparsity and corruption ranges while maintaining strong in distribution performance. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

32 pages, 6394 KB  
Article
Neuro-Bridge-X: A Neuro-Symbolic Vision Transformer with Meta-XAI for Interpretable Leukemia Diagnosis from Peripheral Blood Smears
by Fares Jammal, Mohamed Dahab and Areej Y. Bayahya
Diagnostics 2025, 15(16), 2040; https://doi.org/10.3390/diagnostics15162040 - 14 Aug 2025
Cited by 4 | Viewed by 1262
Abstract
Background/Objectives: Acute Lymphoblastic Leukemia (ALL) poses significant diagnostic challenges due to its ambiguous symptoms and the limitations of conventional methods like bone marrow biopsies and flow cytometry, which are invasive, costly, and time-intensive. Methods: This study introduces Neuro-Bridge-X, a novel neuro-symbolic hybrid model [...] Read more.
Background/Objectives: Acute Lymphoblastic Leukemia (ALL) poses significant diagnostic challenges due to its ambiguous symptoms and the limitations of conventional methods like bone marrow biopsies and flow cytometry, which are invasive, costly, and time-intensive. Methods: This study introduces Neuro-Bridge-X, a novel neuro-symbolic hybrid model designed for automated, explainable ALL diagnosis using peripheral blood smear (PBS) images. Leveraging two comprehensive datasets, ALL Image (3256 images from 89 patients) and C-NMC (15,135 images from 118 patients), the model integrates deep morphological feature extraction, vision transformer-based contextual encoding, fuzzy logic-inspired reasoning, and adaptive explainability. To address class imbalance, advanced data augmentation techniques were applied, ensuring equitable representation across benign and leukemic classes. The proposed framework was evaluated through 5-fold cross-validation and fixed train-test splits, employing Nadam, SGD, and Fractional RAdam optimizers. Results: Results demonstrate exceptional performance, with SGD achieving near-perfect accuracy (1.0000 on ALL, 0.9715 on C-NMC) and robust generalization, while Fractional RAdam closely followed (0.9975 on ALL, 0.9656 on C-NMC). Nadam, however, exhibited inconsistent convergence, particularly on C-NMC (0.5002 accuracy). A Meta-XAI controller enhances interpretability by dynamically selecting optimal explanation strategies (Grad-CAM, SHAP, Integrated Gradients, LIME), ensuring clinically relevant insights into model decisions. Conclusions: Visualizations confirm that SGD and RAdam models focus on morphologically critical features, such as leukocyte nuclei, while Nadam struggles with spurious attributions. Neuro-Bridge-X offers a scalable, interpretable solution for ALL diagnosis, with potential to enhance clinical workflows and diagnostic precision in oncology. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

31 pages, 2406 KB  
Article
Enhancing Mathematical Knowledge Graphs with Large Language Models
by Antonio Lobo-Santos and Joaquín Borrego-Díaz
Modelling 2025, 6(3), 53; https://doi.org/10.3390/modelling6030053 - 24 Jun 2025
Viewed by 1736
Abstract
The rapid growth in scientific knowledge has created a critical need for advanced systems capable of managing mathematical knowledge at scale. This study presents a novel approach that integrates ontology-based knowledge representation with large language models (LLMs) to automate the extraction, organization, and [...] Read more.
The rapid growth in scientific knowledge has created a critical need for advanced systems capable of managing mathematical knowledge at scale. This study presents a novel approach that integrates ontology-based knowledge representation with large language models (LLMs) to automate the extraction, organization, and reasoning of mathematical knowledge from LaTeX documents. The proposed system enhances Mathematical Knowledge Management (MKM) by enabling structured storage, semantic querying, and logical validation of mathematical statements. The key innovations include a lightweight ontology for modeling hypotheses, conclusions, and proofs, and algorithms for optimizing assumptions and generating pseudo-demonstrations. A user-friendly web interface supports visualization and interaction with the knowledge graph, facilitating tasks such as curriculum validation and intelligent tutoring. The results demonstrate high accuracy in mathematical statement extraction and ontology population, with potential scalability for handling large datasets. This work bridges the gap between symbolic knowledge and data-driven reasoning, offering a robust solution for scalable, interpretable, and precise MKM. Full article
Show Figures

Figure 1

18 pages, 14746 KB  
Article
PRJ: Perception–Retrieval–Judgement for Generated Images
by Qiang Fu, Zonglei Jing, Zonghao Ying and Xiaoqian Li
Electronics 2025, 14(12), 2354; https://doi.org/10.3390/electronics14122354 - 9 Jun 2025
Viewed by 865
Abstract
The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as [...] Read more.
The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as sexually explicit images, violent scenes, hate symbols, propaganda, and unauthorized imitations of copyrighted artworks. Existing image safety systems often rely on rigid category filters and produce binary outputs, lacking the capacity to interpret context or reason about nuanced, adversarially induced forms of harm. In addition, standard evaluation metrics (e.g., attack success rate) fail to capture the semantic severity and dynamic progression of toxicity. To address these limitations, we propose Perception–Retrieval–Judgement (PRJ), a cognitively inspired framework that models toxicity detection as a structured reasoning process. PRJ follows a three-stage design: it first transforms an image into descriptive language (perception), then retrieves external knowledge related to harm categories and traits (retrieval), and finally evaluates toxicity based on legal or normative rules (judgement). This language-centric structure enables the system to detect both explicit and implicit harms with improved interpretability and categorical granularity. In addition, we introduce a dynamic scoring mechanism based on a contextual toxicity risk matrix to quantify harmfulness across different semantic dimensions. Experiments show that PRJ surpasses existing safety checkers in detection accuracy and robustness while uniquely supporting structured category-level toxicity interpretation. Full article
(This article belongs to the Special Issue Trustworthy Deep Learning in Practice)
Show Figures

Figure 1

17 pages, 244 KB  
Hypothesis
Proprioceptive Resonance and Multimodal Semiotics: Readiness to Act, Embodied Cognition, and the Dynamics of Meaning
by Marco Sanna
NeuroSci 2025, 6(2), 42; https://doi.org/10.3390/neurosci6020042 - 12 May 2025
Cited by 2 | Viewed by 2825
Abstract
This paper proposes a theoretical model of meaning-making grounded in proprioceptive awareness and embodied imagination, arguing that human cognition is inherently multimodal, anticipatory, and sensorimotor. Drawing on Peircean semiotics, Lotman’s model of cultural cognition, and current research in neuroscience, we show that readiness [...] Read more.
This paper proposes a theoretical model of meaning-making grounded in proprioceptive awareness and embodied imagination, arguing that human cognition is inherently multimodal, anticipatory, and sensorimotor. Drawing on Peircean semiotics, Lotman’s model of cultural cognition, and current research in neuroscience, we show that readiness to act—a proprioceptively grounded anticipation of movement—plays a fundamental role in the emergence of meaning, from perception to symbolic abstraction. Contrary to traditional approaches that reduce language to a purely symbolic or visual system, we argue that meaning arises through the integration of sensory, motor, and affective processes, structured by axial proprioceptive coordinates (vertical, horizontal, sagittal). Using Peirce’s triadic model of interpretants, we identify proprioception as the modulatory interface between sensory stimuli, emotional response, and logical reasoning. A study on skilled pianists supports this view, showing that mental rehearsal without physical execution improves performance via motor anticipation. We define this process as proprioceptive resonance, a dynamic synchronization of embodied states that enables communication, language acquisition, and social intelligence. This framework allows for a critique of linguistic abstraction and contributes to ongoing debates in semiotics, enactive cognition, and the origin of syntax, challenging the assumption that symbolic thought precedes embodied experience. Full article
(This article belongs to the Topic Language: From Hearing to Speech and Writing)
12 pages, 233 KB  
Article
The Colors of Curiosity: Ekphrasis from Marguerite de Navarre to María de Zayas’ Tarde llega el desengaño
by Frederick A. De Armas
Humanities 2025, 14(4), 85; https://doi.org/10.3390/h14040085 - 9 Apr 2025
Viewed by 1314
Abstract
María de Zayas’ Tarde llega el desengaño, the fourth tale in her Desengaños amorosos (1641), is one of the most studied novellas in the collection. The reader’s curiosity may stem in part from the main model for the tale, the Apuleian story [...] Read more.
María de Zayas’ Tarde llega el desengaño, the fourth tale in her Desengaños amorosos (1641), is one of the most studied novellas in the collection. The reader’s curiosity may stem in part from the main model for the tale, the Apuleian story of Cupid and Psyche, which has curiositas as its central motivation. Nevertheless, this essay argues that one of the reasons that the tale has attracted so much attention has to do with the vividness of its scenes, the chromatic design that Zayas uses to write for the eyes and the relationship of these topics to curiosity. The text induces characters and readers to marvel not only at a colorful scene but also to seek to understand the choice of colors in eight impacting ekphrasis in the novella. These colors color emotions and arouse our curiosity regarding scene, symbol, shade, and character. In addition, Zayas alludes to a painting included in one of Marguerite de Navarre’s novellas to further arouse curiosity and visual memory. Full article
(This article belongs to the Special Issue Curiosity and Modernity in Early Modern Spain)
23 pages, 3133 KB  
Article
Integrating Textual Queries with AI-Based Object Detection: A Compositional Prompt-Guided Approach
by Silvan Ferreira, Allan Martins, Daniel G. Costa and Ivanovitch Silva
Sensors 2025, 25(7), 2258; https://doi.org/10.3390/s25072258 - 3 Apr 2025
Cited by 1 | Viewed by 1170
Abstract
While object detection and recognition have been extensively adopted by many applications in decision-making, new algorithms and methodologies have emerged to enhance the automatic identification of target objects. In particular, the rise of deep learning and language models has opened many possibilities in [...] Read more.
While object detection and recognition have been extensively adopted by many applications in decision-making, new algorithms and methodologies have emerged to enhance the automatic identification of target objects. In particular, the rise of deep learning and language models has opened many possibilities in this area, although challenges in contextual query analysis and human interactions persist. This article presents a novel neuro-symbolic object detection framework that aligns object proposals with textual prompts using a deep learning module while enabling logical reasoning through a symbolic module. By integrating deep learning with symbolic reasoning, object detection and scene understanding are considerably enhanced, enabling complex, query-driven interactions. Using a synthetic 3D image dataset, the results demonstrate that this framework effectively generalizes to complex queries, combining simple attribute-based descriptions without explicit training on compound prompts. We present the numerical results and comprehensive discussions, highlighting the potential of our approach for emerging smart applications. Full article
(This article belongs to the Special Issue Digital Imaging Processing, Sensing, and Object Recognition)
Show Figures

Figure 1

26 pages, 44831 KB  
Article
Challenges in Generating Accurate Text in Images: A Benchmark for Text-to-Image Models on Specialized Content
by Zenab Bosheah and Vilmos Bilicki
Appl. Sci. 2025, 15(5), 2274; https://doi.org/10.3390/app15052274 - 20 Feb 2025
Cited by 2 | Viewed by 12094
Abstract
Rapid advances in text-to-image (T2I) generative models have significantly enhanced visual content creation. However, evaluating these models remains challenging, particularly when assessing their ability to handle complex textual content. The primary aim of this research is to develop a systematic evaluation framework for [...] Read more.
Rapid advances in text-to-image (T2I) generative models have significantly enhanced visual content creation. However, evaluating these models remains challenging, particularly when assessing their ability to handle complex textual content. The primary aim of this research is to develop a systematic evaluation framework for assessing T2I models’ capabilities in generating specialized content, with emphasis on measuring text rendering accuracy and identifying model limitations across diverse domains. The framework utilizes carefully crafted prompts that require precise formatting, semantic alignment, and compositional reasoning to evaluate model performance. Our evaluation methodology encompasses a comprehensive assessment across many critical domains: mathematical equations, chemical diagrams, programming code, flowcharts, multi-line text, and paragraphs, with each domain tested through specifically designed challenge sets. GPT-4 serves as an automated evaluator, assessing outputs based on key metrics such as text accuracy, readability, formatting consistency, visual design, contextual relevance, and error recovery. Weighted scores generated by GPT-4 are compared with human evaluations to measure alignment and reliability. The results reveal that current T2I models face significant challenges with tasks requiring structural precision and domain-specific accuracy. Notable difficulties include symbol alignment in equations, bond angles in chemical diagrams, syntactical correctness in code, and the generation of coherent multi-line text and paragraphs. This study advances our understanding of fundamental limitations in T2I model architectures while establishing a novel framework for the systematic evaluation of text rendering capabilities. Despite these limitations, the proposed benchmark provides a clear pathway for evaluating and tracking improvements in T2I models, establishing a standardized framework for assessing their ability to generate accurate and reliable structured content for specialized applications. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

24 pages, 6475 KB  
Article
Towards AI-Assisted Mapmaking: Assessing the Capabilities of GPT-4o in Cartographic Design
by Abdulkadir Memduhoğlu
ISPRS Int. J. Geo-Inf. 2025, 14(1), 35; https://doi.org/10.3390/ijgi14010035 - 17 Jan 2025
Cited by 6 | Viewed by 4125
Abstract
Cartographic design is fundamental to effective mapmaking, requiring adherence to principles such as visual hierarchy, symbolization, and color theory to convey spatial information accurately and intuitively, while Artificial Intelligence (AI) and Large Language Models (LLMs) have transformed various fields, their application in cartographic [...] Read more.
Cartographic design is fundamental to effective mapmaking, requiring adherence to principles such as visual hierarchy, symbolization, and color theory to convey spatial information accurately and intuitively, while Artificial Intelligence (AI) and Large Language Models (LLMs) have transformed various fields, their application in cartographic design remains underexplored. This study assesses the capabilities of a multimodal advanced LLM, GPT-4o, in understanding and suggesting cartographic design elements, focusing on adherence to established cartographic principles. Two assessments were conducted: a text-to-text evaluation and an image-to-text evaluation. In the text-to-text assessment, GPT-4o was presented with 15 queries derived from key concepts in cartography, covering classification, symbolization, visual hierarchy, color theory, and typography. Each query was posed multiple times under different temperature settings to evaluate consistency and variability. In the image-to-text evaluation, GPT-4o analyzed maps containing deliberate cartographic errors to assess its ability to identify issues and suggest improvements. The results indicate that GPT-4o demonstrates general reliability in text-based tasks, with variability influenced by temperature settings. The model showed proficiency in classification and symbolization tasks but occasionally deviated from theoretical expectations. In visual hierarchy and layout, the model performed consistently, suggesting appropriate design choices. In the image-to-text assessment, GPT-4o effectively identified critical design flaws such as inappropriate color schemes, poor contrast and misuse of shape and size variables, offering actionable suggestions for improvement. However, limitations include dependency on input quality and challenges in interpreting nuanced spatial relationships. The study concludes that LLMs like GPT-4o have significant potential in cartographic design, particularly for tasks involving creative exploration and routine design support. Their ability to critique and generate cartographic elements positions them as valuable tools for enhancing human expertise. Further research is recommended to enhance their spatial reasoning capabilities and expand their use of visual variables beyond color, thereby improving their applicability in professional cartographic workflows. Full article
Show Figures

Figure 1

Back to TopTop