MDPI - Publisher of Open Access Journals

24 pages, 2114 KB

Open AccessArticle

An Integrated Framework for Automated Identification of Workers’ Safety Violation Based on Knowledge Graph

by Yifan Zhu, Yewei Ouyang, Rui Pan, Zhanhui Sun, Yang Zhou, Rui Ma, Baoquan Cheng and Wen Wang

Buildings 2026, 16(5), 1037; https://doi.org/10.3390/buildings16051037 - 6 Mar 2026

Viewed by 265

Abstract

Automatic identification of worker safety violations can substantially strengthen construction-site safety management by enabling continuous, real-time monitoring. Although recent advances have made automated detection feasible, many existing systems still suffer from poor adaptability and limited extensibility. To address these limitations, this study proposes [...] Read more.

Automatic identification of worker safety violations can substantially strengthen construction-site safety management by enabling continuous, real-time monitoring. Although recent advances have made automated detection feasible, many existing systems still suffer from poor adaptability and limited extensibility. To address these limitations, this study proposes an integrated, knowledge graph-based framework for automatic identification of workers’ safety violations. The framework comprises two principal components: (1) a knowledge graph construction module that encodes domain knowledge (safety regulations, task–hazard relationships, and contextual constraints) into a machine-readable graph structure and (2) a graph-enabled violation identification module that maps structured scene descriptions of worker and environmental states to the knowledge graph and performs semantic inference to detect violations. In this study, these structured scene descriptions are manually specified and simulated as subject–predicate–object triplets; integration with raw sensing data is left for future work. For validation, we construct a knowledge graph containing 1200 safety rules and evaluate the violation identification module on 500 annotated examples representing realistic worker scenarios. Using this curated knowledge graph and structured inputs, the proposed approach achieves an identification accuracy of 97.6% for unsafe worker behaviors. Experimental analysis shows that the knowledge graph representation substantially improves the system’s expandability and interpretability compared with traditional hard-coded rules, facilitating easier incorporation of new rules and multimodal sensing inputs. The results indicate that knowledge graph-driven reasoning offers a practical, scalable pathway for robust, context-aware safety violation detection in varied construction environments. Full article

(This article belongs to the Special Issue Recent Advances in Intelligent Infrastructure and Construction Engineering)

► Show Figures

Figure 1

23 pages, 1579 KB

Open AccessArticle

Exploring Difference Semantic Prior Guidance for Remote Sensing Image Change Captioning

by Yunpeng Li, Xiangrong Zhang, Guanchun Wang and Tianyang Zhang

Remote Sens. 2026, 18(2), 232; https://doi.org/10.3390/rs18020232 - 11 Jan 2026

Viewed by 585

Abstract

Understanding complex change scenes is a crucial challenge in remote sensing field. Remote sensing image change captioning (RSICC) task has emerged as a promising approach to translate appeared changes between bi-temporal remote sensing images into textual descriptions, enabling users to make accurate decisions. [...] Read more.

Understanding complex change scenes is a crucial challenge in remote sensing field. Remote sensing image change captioning (RSICC) task has emerged as a promising approach to translate appeared changes between bi-temporal remote sensing images into textual descriptions, enabling users to make accurate decisions. Current RSICC methods frequently encounter difficulties in consistency for contextual awareness and semantic prior guidance. Therefore, this study explores difference semantic prior guidance network to reason context-rich sentence for capturing appeared vision changes. Specifically, the context-aware difference module is introduced to guarantee the consistency of unchanged/changed context features, strengthening multi-level changed information to improve the ability of semantic change feature representation. Moreover, to effectively mine higher-level cognition ability to reason salient/weak changes, we employ difference comprehending with shallow change information to realize semantic change knowledge learning. In addition, the designed parallel cross refined attention in Transformer decoder can balance vision difference and semantic knowledge for implicit knowledge distilling, enabling fine-grained perception changes of semantic details and reducing pseudochanges. Compared with advanced algorithms on the LEVIR-CC and Dubai-CC datasets, experimental results validate the outstanding performance of the designed model in RSICC tasks. Notably, on the LEVIR-CC dataset, it reaches a CIDEr score of 143.34%, representing a 3.11% improvement over the most competitive SAT-cap. Full article

► Show Figures

Figure 1

19 pages, 4107 KB

Open AccessEditor’s ChoiceArticle

Structured Prompting and Collaborative Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference

by Yunxiang Yang, Ningning Xu and Jidong J. Yang

Computers 2025, 14(11), 490; https://doi.org/10.3390/computers14110490 - 9 Nov 2025

Cited by 1 | Viewed by 1547

Abstract

Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we [...] Read more.

Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and multi-agent collaborative knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large vision–language models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multiperspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured role-aware supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real-time risk monitoring without requiring extensive infrastructure upgrades. Full article

► Show Figures

Graphical abstract

16 pages, 579 KB

Open AccessArticle

IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation

by Donglin Zhang, Weixiang Shi, Boyuan Ma, Weiqing Min and Xiao-Jun Wu

Foods 2025, 14(21), 3697; https://doi.org/10.3390/foods14213697 - 30 Oct 2025

Viewed by 1062

Abstract

In recent years, food nutrition estimation has received growing attention due to its critical role in dietary analysis and public health. Traditional nutrition assessment methods often rely on manual measurements and expert knowledge, which are time-consuming and not easily scalable. With the advancement [...] Read more.

In recent years, food nutrition estimation has received growing attention due to its critical role in dietary analysis and public health. Traditional nutrition assessment methods often rely on manual measurements and expert knowledge, which are time-consuming and not easily scalable. With the advancement of computer vision, RGB-based methods have been proposed, and more recently, RGB-D-based approaches have further improved performance by incorporating depth information to capture spatial cues. While these methods have shown promising results, they still face challenges in complex food scenes, such as limited ability to distinguish visually similar items with different ingredients and insufficient modeling of spatial or semantic relationships. To solve these issues, we propose an Ingredient-Guided Semantic Modeling Network (IGSMNet) for food nutrition estimation. The method introduces an ingredient-guided module that encodes ingredient information using a pre-trained language model and aligns it with visual features via cross-modal attention. At the same time, an internal semantic modeling component is designed to enhance structural understanding through dynamic positional encoding and localized attention, allowing for fine-grained relational reasoning. On the Nutrition5k dataset, our method achieves PMAE values of 12.2% for Calories, 9.4% for Mass, 19.1% for Fat, 18.3% for Carb, and 16.0% for Protein. These results demonstrate that our IGSMNet consistently outperforms existing baselines, validating its effectiveness. Full article

(This article belongs to the Section Food Nutrition)

► Show Figures

Figure 1

25 pages, 15383 KB

Open AccessArticle

SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding

by Xilong Qin, Yue Hu, Wansen Wu, Xinmeng Li and Quanjun Yin

Big Data Cogn. Comput. 2025, 9(8), 209; https://doi.org/10.3390/bdcc9080209 - 14 Aug 2025

Viewed by 1226

Abstract

Scene Knowledge-guided Visual Grounding (SK-VG) is a multi-modal detection task built upon conventional visual grounding (VG) for human–computer interaction scenarios. It utilizes an additional passage of scene knowledge apart from the image and context-dependent textual query for referred object localization. Due to the [...] Read more.

Scene Knowledge-guided Visual Grounding (SK-VG) is a multi-modal detection task built upon conventional visual grounding (VG) for human–computer interaction scenarios. It utilizes an additional passage of scene knowledge apart from the image and context-dependent textual query for referred object localization. Due to the inherent difficulty in directly establishing correlations between the given query and the image without leveraging scene knowledge, this task imposes significant demands on a multi-step knowledge reasoning process to achieve accurate grounding. Off-the-shelf VG models underperform under such a setting due to the requirement of detailed description in the query and a lack of knowledge inference based on implicit narratives of the visual scene. Recent Vision–Language Models (VLMs) exhibit improved cross-modal reasoning capabilities. However, their monolithic architectures, particularly in lightweight implementations, struggle to maintain coherent reasoning chains across sequential logical deductions, leading to error accumulation in knowledge integration and object localization. To address the above-mentioned challenges, we propose SplitGround—a collaborative framework that strategically decomposes complex reasoning processes by fusing the input query and image with knowledge through two auxiliary modules. Specifically, it implements an Agentic Annotation Workflow (AAW) for explicit image annotation and a Synonymous Conversion Mechanism (SCM) for semantic query transformation. This hierarchical decomposition enables VLMs to focus on essential reasoning steps while offloading auxiliary cognitive tasks to specialized modules, effectively splitting long reasoning chains into manageable subtasks with reduced complexity. Comprehensive evaluations on the SK-VG benchmark demonstrate the significant advancements of our method. Remarkably, SplitGround attains an accuracy improvement of 15.71% on the hard split of the test set over the previous training-required SOTA, using only a compact VLM backbone without fine-tuning, which provides new insights for knowledge-intensive visual grounding tasks. Full article

► Show Figures

Figure 1

31 pages, 3266 KB

Open AccessArticle

Context-Driven Recommendation via Heterogeneous Temporal Modeling and Large Language Model in the Takeout System

by Wei Deng, Dongyi Hu, Zilong Jiang, Peng Zhang and Yong Shi

Systems 2025, 13(8), 682; https://doi.org/10.3390/systems13080682 - 11 Aug 2025

Viewed by 1208

Abstract

On food delivery platforms, user decisions are often driven by dynamic contextual factors such as time, intent, and lifestyle patterns. Traditional context-aware recommender systems struggle to capture such implicit signals, especially when user behavior spans heterogeneous long- and short-term patterns. To address this, [...] Read more.

On food delivery platforms, user decisions are often driven by dynamic contextual factors such as time, intent, and lifestyle patterns. Traditional context-aware recommender systems struggle to capture such implicit signals, especially when user behavior spans heterogeneous long- and short-term patterns. To address this, we propose a context-driven recommendation framework that integrates a hybrid sequence modeling architecture with a Large Language Model for post hoc reasoning and reranking. Specifically, the solution tackles several key issues: (1) integration of multimodal features to achieve explicit context fusion through a hybrid fusion strategy; (2) introduction of a context capture layer and a context propagation layer to enable effective encoding of implicit contextual states hidden in the heterogeneous long and short term; (3) cross attention mechanisms to facilitate context retrospection, which allows implicit contexts to be uncovered; and (4) leveraging the reasoning capabilities of DeepSeek-R1 as a post-processing step to perform open knowledge-enhanced reranking. Extensive experiments on a real-world dataset show that our approach significantly outperforms strong baselines in both prediction accuracy and Top-K recommendation quality. Case studies further demonstrate the model’s ability to uncover nuanced, implicit contextual cues—such as family roles and holiday-specific behaviors—making it particularly effective for personalized, dynamic recommendations in high-frequency scenes. Full article

► Show Figures

Figure 1

18 pages, 14746 KB

Open AccessArticle

PRJ: Perception–Retrieval–Judgement for Generated Images

by Qiang Fu, Zonglei Jing, Zonghao Ying and Xiaoqian Li

Electronics 2025, 14(12), 2354; https://doi.org/10.3390/electronics14122354 - 9 Jun 2025

Viewed by 1129

Abstract

The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as [...] Read more.

The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as sexually explicit images, violent scenes, hate symbols, propaganda, and unauthorized imitations of copyrighted artworks. Existing image safety systems often rely on rigid category filters and produce binary outputs, lacking the capacity to interpret context or reason about nuanced, adversarially induced forms of harm. In addition, standard evaluation metrics (e.g., attack success rate) fail to capture the semantic severity and dynamic progression of toxicity. To address these limitations, we propose Perception–Retrieval–Judgement (PRJ), a cognitively inspired framework that models toxicity detection as a structured reasoning process. PRJ follows a three-stage design: it first transforms an image into descriptive language (perception), then retrieves external knowledge related to harm categories and traits (retrieval), and finally evaluates toxicity based on legal or normative rules (judgement). This language-centric structure enables the system to detect both explicit and implicit harms with improved interpretability and categorical granularity. In addition, we introduce a dynamic scoring mechanism based on a contextual toxicity risk matrix to quantify harmfulness across different semantic dimensions. Experiments show that PRJ surpasses existing safety checkers in detection accuracy and robustness while uniquely supporting structured category-level toxicity interpretation. Full article

(This article belongs to the Special Issue Trustworthy Deep Learning in Practice)

► Show Figures

Figure 1

25 pages, 6410 KB

Open AccessArticle

Multi-View Stereo Using Perspective-Aware Features and Metadata to Improve Cost Volume

by Zongcheng Zuo, Yuanxiang Li, Yu Zhou and Fan Mo

Sensors 2025, 25(7), 2233; https://doi.org/10.3390/s25072233 - 2 Apr 2025

Viewed by 3553

Abstract

Feature matching is pivotal when using multi-view stereo (MVS) to reconstruct dense 3D models from calibrated images. This paper proposes PAC-MVSNet, which integrates perspective-aware convolution (PAC) and metadata-enhanced cost volumes to address the challenges in reflective and texture-less regions. PAC dynamically aligns convolutional [...] Read more.

Feature matching is pivotal when using multi-view stereo (MVS) to reconstruct dense 3D models from calibrated images. This paper proposes PAC-MVSNet, which integrates perspective-aware convolution (PAC) and metadata-enhanced cost volumes to address the challenges in reflective and texture-less regions. PAC dynamically aligns convolutional kernels with scene perspective lines, while the use of metadata (e.g., camera pose distance) enables geometric reasoning during cost aggregation. In PAC-MVSNet, we introduce feature matching with long-range tracking that utilizes both internal and external focuses to integrate extensive contextual data within individual images as well as across multiple images. To enhance the performance of the feature matching with long-range tracking, we also propose a perspective-aware convolution module that directs the convolutional kernel to capture features along the perspective lines. This enables the module to extract perspective-aware features from images, improving the feature matching. Finally, we crafted a specific 2D CNN that fuses image priors, thereby integrating keyframes and geometric metadata within the cost volume to evaluate depth planes. Our method represents the first attempt to embed the existing physical model knowledge into a network for completing MVS tasks, which achieved optimal performance using multiple benchmark datasets. Full article

(This article belongs to the Topic 3D Computer Vision and Smart Building and City, 2nd Edition)

► Show Figures

Figure 1

28 pages, 8967 KB

Open AccessArticle

Adaptive Global Dense Nested Reasoning Network into Small Target Detection in Large-Scale Hyperspectral Remote Sensing Image

by Siyu Zhan, Yuxuan Yang, Muge Zhong, Guoming Lu and Xinyu Zhou

Remote Sens. 2025, 17(6), 948; https://doi.org/10.3390/rs17060948 - 7 Mar 2025

Cited by 1 | Viewed by 1569

Abstract

Small and dim target detection is a critical challenge in hyperspectral remote sensing, particularly in complex, large-scale scenes where spectral variability across diverse land cover types complicates the detection process. In this paper, we propose a novel target reasoning algorithm named Adaptive Global [...] Read more.

Small and dim target detection is a critical challenge in hyperspectral remote sensing, particularly in complex, large-scale scenes where spectral variability across diverse land cover types complicates the detection process. In this paper, we propose a novel target reasoning algorithm named Adaptive Global Dense Nested Reasoning Network (AGDNR). This algorithm integrates spatial, spectral, and domain knowledge to enhance the detection accuracy of small and dim targets in large-scale environments and simultaneously enables reasoning about target categories. The proposed method involves three key innovations. Firstly, we develop a high-dimensional, multi-layer nested U-Net that facilitates cross-layer feature transfer, preserving high-level features of small and dim targets throughout the network. Secondly, we present a novel approach for computing physicochemical parameters, which enhances the spectral characteristics of targets while minimizing environmental interference. Thirdly, we construct a geographic knowledge graph that incorporates both target and environmental information, enabling global target reasoning and more effective detection of small targets across large-scale scenes. Experimental results on three challenging datasets show that our method outperforms state-of-the-art approaches in detection accuracy and achieves successful classification of different small targets. Consequently, the proposed method offers a robust solution for the precise detection of hyperspectral small targets in large-scale scenarios. Full article

(This article belongs to the Special Issue Integrating Deep Learning with Image Perception for Advanced Remote Sensing Applications)

► Show Figures

Figure 1

23 pages, 1882 KB

Open AccessArticle

Attention Mechanism-Based Cognition-Level Scene Understanding

by Xuejiao Tang and Wenbin Zhang

Information 2025, 16(3), 203; https://doi.org/10.3390/info16030203 - 5 Mar 2025

Viewed by 1673

Abstract

Given a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of understanding and [...] Read more.

Given a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding challenge. The VCR task has aroused researchers’ interests due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task have generally relied on pre-training or exploiting memory with long-term dependency relationship-encoded models. However, these approaches suffer from a lack of generalizability and a loss of information in long sequences. In this work, we propose a parallel attention-based cognitive VCR network, termed PAVCR, which fuses visual–textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides an intuitive interpretation of visual commonsense reasoning. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Figure 1

17 pages, 81622 KB

Open AccessArticle

A Hierarchical Spatiotemporal Data Model Based on Knowledge Graphs for Representation and Modeling of Dynamic Landslide Scenes

by Juan Li, Jin Zhang, Li Wang and Ao Zhao

Sustainability 2024, 16(23), 10271; https://doi.org/10.3390/su162310271 - 23 Nov 2024

Viewed by 1627

Abstract

Represention and modeling the dynamic landslide scenes is essential for gaining a comprehensive understanding and managing them effectively. Existing models, which focus on a single scale make it difficult to fully express the complex, multi-scale spatiotemporal process within landslide scenes. To address these [...] Read more.

Represention and modeling the dynamic landslide scenes is essential for gaining a comprehensive understanding and managing them effectively. Existing models, which focus on a single scale make it difficult to fully express the complex, multi-scale spatiotemporal process within landslide scenes. To address these issues, we proposed a hierarchical spatiotemporal data model, named as HSDM, to enhance the representation for geographic scenes. Specifically, we introduced a spatiotemporal object model that integrates both their structural and process information of objects. Furthermore, we extended the process definition to capture complex spatiotemporal processes. We sorted out the relationships used in HSDM and defined four types of spatiotemporal correlation relations to represent the connections between spatiotemporal objects. Meanwhile, we constructed a three-level graph model of geographic scenes based on these concepts and relationships. Finally, we achieved representation and modeling of a dynamic landslide scene in Heifangtai using HSDM and implemented complex querying and reasoning with Neo4j’s Cypher language. The experimental results demonstrate our model’s capabilities in modeling and reasoning about complex multi-scale information and spatio-temporal processes with landslide scenes. Our work contributes to landslide knowledge representation, inventory and dynamic simulation. Full article

(This article belongs to the Special Issue Landslides in Urban Environments: Monitoring, Impact Mitigation and Resilient Enhancement)

► Show Figures

Figure 1

19 pages, 20082 KB

Open AccessArticle

An Ontology-Based Vehicle Behavior Prediction Method Incorporating Vehicle Light Signal Detection

by Xiaolong Xu, Xiaolin Shi, Yun Chen and Xu Wu

Sensors 2024, 24(19), 6459; https://doi.org/10.3390/s24196459 - 6 Oct 2024

Cited by 2 | Viewed by 2017

Abstract

Although deep learning techniques have potential in vehicle behavior prediction, it is difficult to integrate traffic rules and environmental information. Moreover, its black-box nature leads to an opaque and difficult-to-interpret prediction process, limiting its acceptance in practical applications. In contrast, ontology reasoning, which [...] Read more.

Although deep learning techniques have potential in vehicle behavior prediction, it is difficult to integrate traffic rules and environmental information. Moreover, its black-box nature leads to an opaque and difficult-to-interpret prediction process, limiting its acceptance in practical applications. In contrast, ontology reasoning, which can utilize human domain knowledge and mimic human reasoning, can provide reliable explanations for the speculative results. To address the limitations of the above deep learning methods in the field of vehicle behavior prediction, this paper proposes a front vehicle behavior prediction method that combines deep learning techniques with ontology reasoning. Specifically, YOLOv5s is first selected as the base model for recognizing the brake light status of vehicles. In order to further enhance the performance of the model in complex scenes and small target recognition, the Convolutional Block Attention Module (CBAM) is introduced. In addition, so as to balance the feature information of different scales more efficiently, a weighted bi-directional feature pyramid network (BIFPN) is introduced to replace the original PANet structure in YOLOv5s. Next, using a four-lane intersection as an application scenario, multiple factors affecting vehicle behavior are analyzed. Based on these factors, an ontology model for predicting front vehicle behavior is constructed. Finally, for the purpose of validating the effectiveness of the proposed method, we make our own brake light detection dataset. The accuracy and mAP@0.5 of the improved model on the self-made dataset are 3.9% and 2.5% higher than that of the original model, respectively. Afterwards, representative validation scenarios were selected for inference experiments. The ontology model created in this paper accurately reasoned out the behavior that the target vehicle would slow down until stopping and turning left. The reasonableness and practicality of the front vehicle behavior prediction method constructed in this paper are verified. Full article

(This article belongs to the Section Vehicular Sensing)

► Show Figures

Figure 1

20 pages, 6718 KB

Open AccessArticle

Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events

by Mohammad Abu Tami, Huthaifa I. Ashqar, Mohammed Elhenawy, Sebastien Glaser and Andry Rakotonirainy

Vehicles 2024, 6(3), 1571-1590; https://doi.org/10.3390/vehicles6030074 - 2 Sep 2024

Cited by 42 | Viewed by 7381

Abstract

Traditional approaches to safety event analysis in autonomous systems have relied on complex machine and deep learning models and extensive datasets for high accuracy and reliability. However, the emerge of multimodal large language models (MLLMs) offers a novel approach by integrating textual, visual, [...] Read more.

Traditional approaches to safety event analysis in autonomous systems have relied on complex machine and deep learning models and extensive datasets for high accuracy and reliability. However, the emerge of multimodal large language models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities. Our framework leverages the logical and visual reasoning power of MLLMs, directing their output through object-level question–answer (QA) prompts to ensure accurate, reliable, and actionable insights for investigating safety-critical event detection and analysis. By incorporating models like Gemini-Pro-Vision 1.5, we aim to automate safety-critical event detection and analysis along with mitigating common issues such as hallucinations in MLLM outputs. The results demonstrate the framework’s potential in different in-context learning (ICT) settings such as zero-shot and few-shot learning methods. Furthermore, we investigate other settings such as self-ensemble learning and a varying number of frames. The results show that a few-shot learning model consistently outperformed other learning models, achieving the highest overall accuracy of about 79%. The comparative analysis with previous studies on visual reasoning revealed that previous models showed moderate performance in driving safety tasks, while our proposed model significantly outperformed them. To the best of our knowledge, our proposed MLLM model stands out as the first of its kind, capable of handling multiple tasks for each safety-critical event. It can identify risky scenarios, classify diverse scenes, determine car directions, categorize agents, and recommend the appropriate actions, setting a new standard in safety-critical event management. This study shows the significance of MLLMs in advancing the analysis of naturalistic driving videos to improve safety-critical event detection and understanding the interactions in complex environments. Full article

(This article belongs to the Special Issue Vehicle Design Processes, 2nd Edition)

► Show Figures

Figure 1

26 pages, 3960 KB

Open AccessArticle

Ontology-Based Deep Learning Model for Object Detection and Image Classification in Smart City Concepts

by Adekanmi Adeyinka Adegun, Jean Vincent Fonou-Dombeu, Serestina Viriri and John Odindi

Smart Cities 2024, 7(4), 2182-2207; https://doi.org/10.3390/smartcities7040086 - 2 Aug 2024

Cited by 10 | Viewed by 5529

Abstract

Object detection in remotely sensed (RS) satellite imagery has gained significance in smart city concepts, which include urban planning, disaster management, and environmental monitoring. Deep learning techniques have shown promising outcomes in object detection and scene classification from RS satellite images, surpassing traditional [...] Read more.

Object detection in remotely sensed (RS) satellite imagery has gained significance in smart city concepts, which include urban planning, disaster management, and environmental monitoring. Deep learning techniques have shown promising outcomes in object detection and scene classification from RS satellite images, surpassing traditional methods that are reliant on hand-crafted features. However, these techniques lack the ability to provide in-depth comprehension of RS images and enhanced interpretation for analyzing intricate urban objects with functional structures and environmental contexts. To address this limitation, this study proposes a framework that integrates a deep learning-based object detection algorithm with ontology models for effective knowledge representation and analysis. The framework can automatically and accurately detect objects and classify scenes in remotely sensed satellite images and also perform semantic description and analysis of the classified scenes. The framework combines a knowledge-guided ontology reasoning module into a YOLOv8 objects detection model. This study demonstrates that the proposed framework can detect objects in varying environmental contexts captured using a remote sensing satellite device and incorporate efficient knowledge representation and inferences with a less-complex ontology model. Full article

► Show Figures

Figure 1

11 pages, 3199 KB

Open AccessCommunication

Accurate Determination of Camera Quantum Efficiency from a Single Image

by Yuri Rzhanov

J. Imaging 2024, 10(7), 169; https://doi.org/10.3390/jimaging10070169 - 16 Jul 2024

Cited by 2 | Viewed by 2867

Abstract

Knowledge of spectral sensitivity is important for high-precision comparison of images taken by different cameras and recognition of objects and interpretation of scenes for which color is an important cue. Direct estimation of quantum efficiency curves (QECs) is a complicated and tedious process [...] Read more.

Knowledge of spectral sensitivity is important for high-precision comparison of images taken by different cameras and recognition of objects and interpretation of scenes for which color is an important cue. Direct estimation of quantum efficiency curves (QECs) is a complicated and tedious process requiring specialized equipment, and many camera manufacturers do not make spectral characteristics publicly available. This has led to the development of indirect techniques that are unreliable due to being highly sensitive to noise in the input data, and which often require the imposition of additional ad hoc conditions, some of which do not always hold. We demonstrate the reason for the lack of stability in the determination of QECs and propose an approach that guarantees the stability of QEC reconstruction, even in the presence of noise. A device for the realization of this approach is also proposed. The reported results were used as a basis for the granted US patent. Full article

(This article belongs to the Special Issue Color in Image Processing and Computer Vision)

► Show Figures

Figure 1

Search Results (40)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (40)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI