MDPI - Publisher of Open Access Journals

24 pages, 2105 KB

Open AccessArticle

A Multi-Stage Hybrid Retrieval Framework for the Scientific Literature with Cross-Encoder Re-Ranking

by Walaa Al-Joofi, Alaa Sagheer and Hala Hamdoun

Appl. Sci. 2026, 16(10), 4813; https://doi.org/10.3390/app16104813 - 12 May 2026

Viewed by 320

Effective scientific literature retrieval requires moving beyond surface-level term matching toward structured semantic reasoning. This paper presents a controlled empirical study of multi-stage retrieval for scientific literature, integrating lexical matching, dense semantic modeling, hybrid fusion, and cross-encoder re-ranking within a unified evaluation framework. [...] Read more.

Effective scientific literature retrieval requires moving beyond surface-level term matching toward structured semantic reasoning. This paper presents a controlled empirical study of multi-stage retrieval for scientific literature, integrating lexical matching, dense semantic modeling, hybrid fusion, and cross-encoder re-ranking within a unified evaluation framework. The study is designed to analyze the interactions, trade-offs, and failure modes of these components in claim-based scientific search. Experiments on the SciFact benchmark demonstrate that dense models capture semantic similarity but remain insufficient when used in isolation. Hybrid fusion broadens the candidate pool but does not consistently outperform the best standalone dense retriever, as RRF-based fusion can dilute strong dense rankings when lexical and semantic signals diverge. Cross-encoder re-ranking proves to be the primary driver of final performance gains, with the best configuration, Hybrid (SciNCL + BM25) + Cross-Encoder, reaching NDCG@10 of 0.523, MAP@10 of 0.479, Recall@10 of 0.642, and MRR@10 of 0.497. Ablation analysis shows that lexical pseudo-relevance feedback (RM3) introduces query drift in claim-focused retrieval, and that passage-level max pooling weakens effectiveness by fragmenting document-level evidence. Cross-domain evaluation on SciFact, PubMedQA, and SciDocs demonstrates that the relative ranking of retrieval paradigms remains stable across datasets with varying difficulty levels, while also revealing that the RRF dilution effect intensifies on harder retrieval tasks. These findings suggest that effective scientific retrieval benefits from integrated multi-stage pipelines, and that understanding component-level interactions is essential for designing robust retrieval systems. Full article

► Show Figures

Figure 1

30 pages, 502 KB

Open AccessArticle

S-Gens: Structure-Aware Synthetic Data Generation for Enhancing Reasoning-Intensive Dense Retrieval

by Zhou Lei, Yanqi Xu and Shengbo Chen

Information 2026, 17(5), 413; https://doi.org/10.3390/info17050413 - 26 Apr 2026

Viewed by 211

Abstract

Dense retrievers rely heavily on high-quality training triplets, yet existing data construction strategies remain inadequate for reasoning-intensive retrieval tasks involving multi-hop reasoning, entity relation tracing, and implicit evidence composition. Positive samples are often based on shallow semantic relevance and fail to capture explicit [...] Read more.

Dense retrievers rely heavily on high-quality training triplets, yet existing data construction strategies remain inadequate for reasoning-intensive retrieval tasks involving multi-hop reasoning, entity relation tracing, and implicit evidence composition. Positive samples are often based on shallow semantic relevance and fail to capture explicit reasoning chains, while negative samples are typically sampled from lexical overlap or random candidates and therefore provide limited supervision for learning clear decision boundaries. To address these issues, we propose S-Gens, a structure-aware synthetic data generation framework for enhancing reasoning-intensive dense retrieval. S-Gens uses relation paths in an external knowledge graph to synthesize queries and structurally consistent positive samples, and further constructs semantically similar but structurally inconsistent hard negatives. To improve data reliability, we introduce a Siamese graph neural network-based consistency filtering mechanism. Because S-Gens operates entirely during offline supervision construction, it remains model-agnostic, preserves the original inference architecture, and is complementary to graph-guided retrieval or RAG pipelines that inject structure online. Experiments on five benchmark datasets show that S-Gens consistently improves multiple trainable retrievers, with the largest gains on multi-hop reasoning tasks such as WebQSP and HotpotQA. These results indicate that structure-aware synthetic supervision can effectively improve dense retrieval in reasoning-intensive settings. Full article

(This article belongs to the Special Issue Advanced Retrieval-Augmented Generation Systems Based on Large Language Models)

► Show Figures

Figure 1

18 pages, 1014 KB

Open AccessArticle

Context-Aware Semantic Retrieval for Ancient Texts: A Native Reasoning Approach Based on In-Memory Knowledge Graph

by Tianrui Li and Hongyu Yuan

Electronics 2026, 15(9), 1827; https://doi.org/10.3390/electronics15091827 - 25 Apr 2026

Viewed by 238

Abstract

This paper presents a lightweight semantic retrieval framework driven by an in-memory knowledge graph (IMKG) to overcome the limitations of traditional keyword matching and the prohibitive hardware costs of deep learning models in digitizing ancient Chinese literature. By extracting structured metadata from canonical [...] Read more.

This paper presents a lightweight semantic retrieval framework driven by an in-memory knowledge graph (IMKG) to overcome the limitations of traditional keyword matching and the prohibitive hardware costs of deep learning models in digitizing ancient Chinese literature. By extracting structured metadata from canonical texts, we construct a dense, bidirectional graph schema. Diverging from resource-intensive neural architectures, our system abandons heavyweight vector embeddings in favor of a highly optimized, template-based heuristic matching engine natively implemented in Java. This purely symbolic approach ensures deterministic execution, zero-dependency deployment, and seamless operation on standard CPU-only servers. To handle complex historical inquiries, the framework integrates a context-aware dialogue manager for multi-turn anaphora and ellipsis resolution, alongside a synergistic tiered caching mechanism. Extensive evaluations on a benchmark of 13,652 annotated queries demonstrate that the system achieves an exceptional intent recognition accuracy of 97.14%, robust context retention, and ultra-low response latency (≤17 ms). Ultimately, this architecture provides a sustainable, highly reproducible, and cost-effective paradigm for the semantic exploration of classical textual heritage, exceptionally suited for small-to-medium cultural institutions. Full article

► Show Figures

Figure 1

27 pages, 1222 KB

Open AccessArticle

Query-Adaptive Hybrid Search

by Pavel Posokhov, Stepan Skrylnikov, Sergei Masliukhin, Alina Zavgorodniaia, Olesia Koroteeva and Yuri Matveev

Mach. Learn. Knowl. Extr. 2026, 8(4), 91; https://doi.org/10.3390/make8040091 - 5 Apr 2026

Viewed by 1317

Abstract

The modern information retrieval field increasingly relies on hybrid search systems combining sparse retrieval with dense neural models. However, most existing hybrid frameworks employ static mixing coefficients and independent component training, failing to account for the specific needs of individual queries and corpus [...] Read more.

The modern information retrieval field increasingly relies on hybrid search systems combining sparse retrieval with dense neural models. However, most existing hybrid frameworks employ static mixing coefficients and independent component training, failing to account for the specific needs of individual queries and corpus heterogeneity. In this paper, we introduce an adaptive hybrid retrieval framework featuring query-driven alpha prediction that dynamically calibrates the mixing weights based on query latent representations instantiated in a lightweight low-latency configuration and a full-capacity encoder-scale predictor, enabling flexible trade-offs between computational efficiency and retrieval accuracy without relying on resource-inefficient LLM-based online evaluation. Furthermore, we propose antagonist negative sampling, a novel training paradigm that optimizes the dense encoder to resolve the systematic failures of the lexical retriever, prioritizing hard negatives where BM25 exhibits high uncertainty. Empirical evaluations on large-scale multilingual benchmarks (MLDR and MIRACL) indicate that our approach demonstrates superior average performance compared to state-of-the-art models such as BGE-M3 and mGTE, achieving an nDCG@10 of 74.3 on long-document retrieval. Notably, our framework recovers up to 92.5% of the theoretical oracle performance and yields significant improvements in nDCG@10 across 16 languages, particularly in challenging long-context scenarios. Full article

(This article belongs to the Special Issue Trustworthy AI: Integrating Knowledge, Retrieval, and Reasoning)

► Show Figures

Figure 1

24 pages, 1885 KB

Open AccessArticle

A Lightweight and Scalable Conversational AI Framework for Intelligent Employee Onboarding

by Deborah Olaniyan, Samson Akinpelu, Serestina Viriri, Julius Olaniyan and Adesola Thanni

Appl. Sci. 2025, 15(21), 11754; https://doi.org/10.3390/app152111754 - 4 Nov 2025

Cited by 1 | Viewed by 2613

Abstract

Employee onboarding is a key process in workforce integration but is manual, time-consuming, and departmental. This paper presents OnboardGPT v1.0, an intelligent, scalable conversational AI platform to meet this task with automated and personalized onboarding experience through lightweight neural components. The platform uses [...] Read more.

Employee onboarding is a key process in workforce integration but is manual, time-consuming, and departmental. This paper presents OnboardGPT v1.0, an intelligent, scalable conversational AI platform to meet this task with automated and personalized onboarding experience through lightweight neural components. The platform uses a feedforward intent classification model, dense semantic retrieval through cosine similarity, and personalization aware of user profiles to deliver context-sensitive and relevant output. A 500-question proprietary dataset about onboarding and annotated answers was constructed to simulate real enterprise conversations from various roles and departments. The platform was launched with a Flask-based web interface that was not third-party API-dependent and enabled multi-turn dialogue, knowledge base searching, and role-aware task instruction. Experimental evaluation on performance indicators such as task success rate, intent classification accuracy, BLEU score, and user satisfaction in simulation demonstrates the system to be effective in offering coherent and actionable onboarding support. The contribution of this work includes a modular, explainable, and deployable AI pipeline suitable for onboarding automation at the enterprise level and lays the foundation for future extensions that include multilingual support, inclusion of long-term memory, and backend system interoperability. Full article

► Show Figures

Figure 1

19 pages, 14252 KB

Open AccessArticle

Physical-Guided Transfer Deep Neural Network for High-Resolution AOD Retrieval

by Debao Chen, Hong Guo, Xingfa Gu, Jinnian Wang, Yan Liu, Yuecheng Li and Yifan Wu

Remote Sens. 2025, 17(21), 3606; https://doi.org/10.3390/rs17213606 - 31 Oct 2025

Cited by 2 | Viewed by 1365

Abstract

Urban-scale aerosol pollution monitoring is of critical importance for both climate regulation and public health. To overcome the limitations of conventional kilometer-scale satellite aerosol optical depth (AOD) products in resolving urban pollution heterogeneity, this study develops a physical-guided transfer deep neural network (PT-DNN) [...] Read more.

Urban-scale aerosol pollution monitoring is of critical importance for both climate regulation and public health. To overcome the limitations of conventional kilometer-scale satellite aerosol optical depth (AOD) products in resolving urban pollution heterogeneity, this study develops a physical-guided transfer deep neural network (PT-DNN) model based on high-resolution Landsat 8 data. The PT-DNN introduces a novel physics-guided training framework, in which radiative transfer simulations are integrated to physically constrain the AOD retrieval. Pre-training was conducted using multi-scenario radiative transfer simulations, with subsequent fine-tuning via ground-based AERONET measurements. The model architecture integrates convolutional neural network (CNN) with residual connection. Validation results over impervious surfaces indicate that the PT-DNN model outperforms conventional data-driven models, with the coefficient of determination (R²) increasing from 0.81 to 0.86 and root mean square error (RMSE) decreasing from 0.122 to 0.104. Moreover, the AOD distributions retrieved at a high spatial resolution of 30 m effectively reveal fine-scale pollution gradients within urban environments, especially in densely built-up and industrial areas. Full article

(This article belongs to the Special Issue The Advancements in Aerosol, Cloud and Cloud–Aerosol Interaction by Remote Sensing)

► Show Figures

Figure 1

37 pages, 1895 KB

Open AccessReview

A Review of Artificial Intelligence and Deep Learning Approaches for Resource Management in Smart Buildings

by Bibars Amangeldy, Timur Imankulov, Nurdaulet Tasmurzayev, Gulmira Dikhanbayeva and Yedil Nurakhov

Buildings 2025, 15(15), 2631; https://doi.org/10.3390/buildings15152631 - 25 Jul 2025

Cited by 14 | Viewed by 8336

Abstract

This comprehensive review maps the fast-evolving landscape in which artificial intelligence (AI) and deep-learning (DL) techniques converge with the Internet of Things (IoT) to manage energy, comfort, and sustainability across smart environments. A PRISMA-guided search of four databases retrieved 1358 records; after applying [...] Read more.

This comprehensive review maps the fast-evolving landscape in which artificial intelligence (AI) and deep-learning (DL) techniques converge with the Internet of Things (IoT) to manage energy, comfort, and sustainability across smart environments. A PRISMA-guided search of four databases retrieved 1358 records; after applying inclusion criteria, 143 peer-reviewed studies published between January 2019 and April 2025 were analyzed. This review shows that AI-driven controllers—especially deep-reinforcement-learning agents—deliver median energy savings of 18–35% for HVAC and other major loads, consistently outperforming rule-based and model-predictive baselines. The evidence further reveals a rapid diversification of methods: graph-neural-network models now capture spatial interdependencies in dense sensor grids, federated-learning pilots address data-privacy constraints, and early integrations of large language models hint at natural-language analytics and control interfaces for heterogeneous IoT devices. Yet large-scale deployment remains hindered by fragmented and proprietary datasets, unresolved privacy and cybersecurity risks associated with continuous IoT telemetry, the growing carbon and compute footprints of ever-larger models, and poor interoperability among legacy equipment and modern edge nodes. The authors of researches therefore converges on several priorities: open, high-fidelity benchmarks that marry multivariate IoT sensor data with standardized metadata and occupant feedback; energy-aware, edge-optimized architectures that lower latency and power draw; privacy-centric learning frameworks that satisfy tightening regulations; hybrid physics-informed and explainable models that shorten commissioning time; and digital-twin platforms enriched by language-model reasoning to translate raw telemetry into actionable insights for facility managers and end users. Addressing these gaps will be pivotal to transforming isolated pilots into ubiquitous, trustworthy, and human-centered IoT ecosystems capable of delivering measurable gains in efficiency, resilience, and occupant wellbeing at scale. Full article

(This article belongs to the Section Building Energy, Physics, Environment, and Systems)

► Show Figures

Figure 1

25 pages, 2296 KB

Open AccessArticle

Multimedia Graph Codes for Fast and Semantic Retrieval-Augmented Generation

by Stefan Wagenpfeil

Electronics 2025, 14(12), 2472; https://doi.org/10.3390/electronics14122472 - 18 Jun 2025

Cited by 3 | Viewed by 3047

Abstract

Retrieval-Augmented Generation (RAG) has become a central approach to enhance the factual consistency and domain specificity of large language models (LLMs) by incorporating external context at inference time. However, most existing RAG systems rely on dense vector-based similarity, which fails to capture complex [...] Read more.

Retrieval-Augmented Generation (RAG) has become a central approach to enhance the factual consistency and domain specificity of large language models (LLMs) by incorporating external context at inference time. However, most existing RAG systems rely on dense vector-based similarity, which fails to capture complex semantic structures, relational dependencies, and multimodal content. In this paper, we introduce Graph Codes—a matrix-based encoding of Multimedia Feature Graphs—as an alternative retrieval paradigm. Graph Codes preserve semantic topology by explicitly encoding entities and their typed relationships from multimodal documents, enabling structure-aware and interpretable retrieval. We evaluate our system in two domains: multimodal scene understanding (200 annotated image-question pairs) and clinical question answering (150 real-world medical queries with 10,000 structured knowledge snippets). Results show that our method outperforms dense retrieval baselines in precision (+9–15%), reduces hallucination rates by over 30%, and yields higher expert-rated answer quality. Theoretically, this work demonstrates that symbolic similarity over typed semantic graphs provides a more faithful alignment mechanism than latent embeddings. Practically, it enables interpretable, modality-agnostic retrieval pipelines deployable in high-stakes domains such as medicine or law. We conclude that Graph Code-based RAG bridges the gap between structured knowledge representation and neural generation, offering a robust and explainable alternative to existing approaches. Full article

(This article belongs to the Special Issue AI Synergy: Vision, Language, and Modality)

► Show Figures

Figure 1

23 pages, 13542 KB

Open AccessArticle

A Lightweight Neural Network for Denoising Wrapped-Phase Images Generated with Full-Field Optical Interferometry

by Muhammad Awais, Younggue Kim, Taeil Yoon, Wonshik Choi and Byeongha Lee

Appl. Sci. 2025, 15(10), 5514; https://doi.org/10.3390/app15105514 - 14 May 2025

Cited by 2 | Viewed by 2066

Abstract

Phase wrapping is a common phenomenon in optical full-field imaging or measurement systems. It arises from large phase retardations and results in wrapped-phase maps that contain essential information about surface roughness and topology. However, these maps are often degraded by noise, such as [...] Read more.

Phase wrapping is a common phenomenon in optical full-field imaging or measurement systems. It arises from large phase retardations and results in wrapped-phase maps that contain essential information about surface roughness and topology. However, these maps are often degraded by noise, such as speckle and Gaussian, which reduces the measurement accuracy and complicates phase reconstruction. Denoising such data is a fundamental problem in computer vision and plays a critical role in biomedical imaging modalities like Full-Field Optical Interferometry. In this paper, we propose WPD-Net (Wrapped-Phase Denoising Network), a lightweight deep learning-based neural network specifically designed to restore phase images corrupted by high noise levels. The network architecture integrates a shallow feature extraction module, a series of Residual Dense Attention Blocks (RDABs), and a dense feature fusion module. The RDABs incorporate attention mechanisms that help the network focus on critical features and suppress irrelevant noise, especially in high-frequency or complex regions. Additionally, WPD-Net employs a growth-rate-based feature expansion strategy to enhance multi-scale feature representation and improve phase continuity. We evaluate the model’s performance on both synthetic and experimentally acquired datasets and compare it with other state-of-the-art deep learning-based denoising methods. The results demonstrate that WPD-Net achieves superior noise suppression while preserving fine structural details even with mixed speckle and Gaussian noises. The proposed method is expected to enable fast image processing, allowing unwrapped biomedical images to be retrieved in real time. Full article

(This article belongs to the Special Issue Computer-Vision-Based Biomedical Image Processing)

► Show Figures

Figure 1

19 pages, 13012 KB

Open AccessArticle

Neural Network-Based Temporal Ensembling of Water Depth Estimates Derived from SuperDove Images

by Milad Niroumand-Jadidi, Carl J. Legleiter and Francesca Bovolo

Remote Sens. 2025, 17(7), 1309; https://doi.org/10.3390/rs17071309 - 6 Apr 2025

Cited by 1 | Viewed by 1314

Abstract

CubeSats provide a wealth of high-frequency observations at a meter-scale spatial resolution. However, most current methods of inferring water depth from satellite data consider only a single image. This approach is sensitive to the radiometric quality of the data acquired at that particular [...] Read more.

CubeSats provide a wealth of high-frequency observations at a meter-scale spatial resolution. However, most current methods of inferring water depth from satellite data consider only a single image. This approach is sensitive to the radiometric quality of the data acquired at that particular instant in time, which could be degraded by various confounding factors, such as sun glint or atmospheric effects. Moreover, using single images in isolation fails to exploit recent improvements in the frequency of satellite image acquisition. This study aims to leverage the dense image time series from the SuperDove constellation via an ensembling framework that helps to improve empirical (regression-based) bathymetry retrieval. Unlike previous studies that only ensembled the original spectral data, we introduce a neural network-based method that instead ensembles the water depths derived from multi-temporal imagery, provided the data are acquired under steady flow conditions. We refer to this new approach as NN-depth ensembling. First, every image is treated individually to derive multitemporal depth estimates. Then, we use another NN regressor to ensemble the temporal water depths. This step serves to automatically weight the contribution of the bathymetric estimates from each time instance to the final bathymetry product. Unlike methods that ensemble spectral data, NN-depth ensembling mitigates against propagation of uncertainties in spectral data (e.g., noise due to sun glint) to the final bathymetric product. The proposed NN-depth ensembling is applied to temporal SuperDove imagery of reaches from the American, Potomac, and Colorado rivers with depths of up to 10 m and evaluated against in situ measurements. The proposed method provided more accurate and robust bathymetry retrieval than single-image analyses and other ensembling approaches. Full article

(This article belongs to the Special Issue Advances in Remote Sensing of the Inland and Coastal Water Zones II)

► Show Figures

Graphical abstract

22 pages, 1390 KB

Open AccessArticle

Emotion-Aware Embedding Fusion in Large Language Models (Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4) for Intelligent Response Generation

by Abdur Rasool, Muhammad Irfan Shahzad, Hafsa Aslam, Vincent Chan and Muhammad Ali Arshad

AI 2025, 6(3), 56; https://doi.org/10.3390/ai6030056 - 13 Mar 2025

Cited by 46 | Viewed by 8451

Abstract

Empathetic and coherent responses are critical in automated chatbot-facilitated psychotherapy. This study addresses the challenge of enhancing the emotional and contextual understanding of large language models (LLMs) in psychiatric applications. We introduce Emotion-Aware Embedding Fusion, a novel framework integrating hierarchical fusion and attention [...] Read more.

Empathetic and coherent responses are critical in automated chatbot-facilitated psychotherapy. This study addresses the challenge of enhancing the emotional and contextual understanding of large language models (LLMs) in psychiatric applications. We introduce Emotion-Aware Embedding Fusion, a novel framework integrating hierarchical fusion and attention mechanisms to prioritize semantic and emotional features in therapy transcripts. Our approach combines multiple emotion lexicons, including NRC Emotion Lexicon, VADER, WordNet, and SentiWordNet, with state-of-the-art LLMs such as Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4. Therapy session transcripts, comprising over 2000 samples, are segmented into hierarchical levels (word, sentence, and session) using neural networks, while hierarchical fusion combines these features with pooling techniques to refine emotional representations. Attention mechanisms, including multi-head self-attention and cross-attention, further prioritize emotional and contextual features, enabling the temporal modeling of emotional shifts across sessions. The processed embeddings, computed using BERT, GPT-3, and RoBERTa, are stored in the Facebook AI similarity search vector database, which enables efficient similarity search and clustering across dense vector spaces. Upon user queries, relevant segments are retrieved and provided as context to LLMs, enhancing their ability to generate empathetic and contextually relevant responses. The proposed framework is evaluated across multiple practical use cases to demonstrate real-world applicability, including AI-driven therapy chatbots. The system can be integrated into existing mental health platforms to generate personalized responses based on retrieved therapy session data. The experimental results show that our framework enhances empathy, coherence, informativeness, and fluency, surpassing baseline models while improving LLMs’ emotional intelligence and contextual adaptability for psychotherapy. Full article

(This article belongs to the Special Issue Multimodal Artificial Intelligence in Healthcare)

► Show Figures

Figure 1

25 pages, 13698 KB

Open AccessEditor’s ChoiceArticle

Self-Supervised Foundation Model for Template Matching

by Anton Hristov, Dimo Dimov and Maria Nisheva-Pavlova

Big Data Cogn. Comput. 2025, 9(2), 38; https://doi.org/10.3390/bdcc9020038 - 11 Feb 2025

Cited by 5 | Viewed by 3749

Abstract

Finding a template location in a query image is a fundamental problem in many computer vision applications, such as localization of known objects, image registration, image matching, and object tracking. Currently available methods fail when insufficient training data are available or big variations [...] Read more.

Finding a template location in a query image is a fundamental problem in many computer vision applications, such as localization of known objects, image registration, image matching, and object tracking. Currently available methods fail when insufficient training data are available or big variations in the textures, different modalities, and weak visual features exist in the images, leading to limited applications on real-world tasks. We introduce Self-Supervised Foundation Model for Template Matching (Self-TM), a novel end-to-end approach to self-supervised learning template matching. The idea behind Self-TM is to learn hierarchical features incorporating localization properties from images without any annotations. As going deeper in the convolutional neural network (CNN) layers, their filters begin to react to more complex structures and their receptive fields increase. This leads to loss of localization information in contrast to the early layers. The hierarchical propagation of the last layers back to the first layer results in precise template localization. Due to its zero-shot generalization capabilities on tasks such as image retrieval, dense template matching, and sparse image matching, our pre-trained model can be classified as a foundation one. Full article

(This article belongs to the Special Issue Perception and Detection of Intelligent Vision)

► Show Figures

Figure 1

23 pages, 4874 KB

Open AccessArticle

Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization

by Shakhnoza Muksimova, Sabina Umirzakova, Murodjon Sultanov and Young Im Cho

Sensors 2025, 25(3), 707; https://doi.org/10.3390/s25030707 - 24 Jan 2025

Cited by 16 | Viewed by 4268

Abstract

Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting [...] Read more.

Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting their applicability to complex, multi-event videos. In this paper, we introduce CMSTR-ODE, a novel Cross-Modal Streaming Transformer with Neural ODE Temporal Localization framework for dense video captioning. Our model incorporates three key innovations: (1) Neural ODE-based Temporal Localization for continuous and efficient event boundary prediction, improving the accuracy of temporal segmentation; (2) cross-modal memory retrieval, which enriches video features with external textual knowledge, enabling more context-aware and descriptive captioning; and (3) a Streaming Multi-Scale Transformer Decoder that generates captions in real time, handling objects and events of varying scales. We evaluate CMSTR-ODE on two benchmark datasets, YouCook2, Flickr30k, and ActivityNet Captions, where it achieves SOTA performance, significantly outperforming existing models in terms of CIDEr, BLEU-4, and ROUGE scores. Our model also demonstrates superior computational efficiency, processing videos at 15 frames per second, making it suitable for real-time applications such as video surveillance and live video captioning. Ablation studies highlight the contributions of each component, confirming the effectiveness of our approach. By addressing the limitations of current methods, CMSTR-ODE sets a new benchmark for dense video captioning, offering a robust and scalable solution for both real-time and long-form video understanding tasks. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

20 pages, 3911 KB

Open AccessArticle

Ground-Based Hyperspectral Retrieval of Soil Arsenic Concentration in Pingtan Island, China

by Meiduan Zheng, Haijun Luan, Guangsheng Liu, Jinming Sha, Zheng Duan and Lanhui Wang

Remote Sens. 2023, 15(17), 4349; https://doi.org/10.3390/rs15174349 - 4 Sep 2023

Cited by 13 | Viewed by 3082

Abstract

The optimal selection of characteristic bands and retrieval models for the hyperspectral retrieval of soil heavy metal concentrations poses a significant challenge. Additionally, satellite-based hyperspectral retrieval encounters several issues, including atmospheric effects, limitations in temporal and radiometric resolution, and data acquisition, among others. [...] Read more.

The optimal selection of characteristic bands and retrieval models for the hyperspectral retrieval of soil heavy metal concentrations poses a significant challenge. Additionally, satellite-based hyperspectral retrieval encounters several issues, including atmospheric effects, limitations in temporal and radiometric resolution, and data acquisition, among others. Given this, the retrieval performance of the soil arsenic (As) concentration in Pingtan Island, the largest island in Fujian Province and the fifth largest in China, is currently unclear. This study aimed to elucidate this issue by identifying optimal characteristic bands from the full spectrum from both statistical and physical perspectives. We tested three linear models, namely Multiple Linear Regression (MLR), Partial Least Squares Regression (PLSR) and Geographically Weighted Regression (GWR), as well as three nonlinear machine learning models, including Back Propagation Neural Network (BP), Support Vector Machine Regression (SVR) and Random Forest Regression (RFR). We then retrieved soil arsenic content using ground-based soil full spectrum data on Pingtan Island. Our results indicate that the RFR model consistently outperformed all others when using both original and optimal characteristic bands. This superior performance suggests a complex, nonlinear relationship between soil arsenic concentration and spectral variables, influenced by diverse landscape factors. The GWR model, which considers spatial non-stationarity and heterogeneity, outperformed traditional models such as BP and SVR. This finding underscores the potential of incorporating spatial characteristics to enhance traditional machine learning models in geospatial studies. When evaluating retrieval model accuracy based on optimal characteristic bands, the RFR model maintained its top performance, and linear models (MLR, PLSR and GWR) showed notable improvement. Specifically, the GWR model achieved the highest r value for the validation data, indicating that selecting optimal characteristic bands based on high Pearson’s correlation coefficients (e.g., abs(Pearson’s correlation coefficient) ≥0.45) and high sensitivity to soil active materials successfully mitigates uncertainties linked to characteristic band selection solely based on Pearson’s correlation coefficients. Consequently, two effective retrieval models were generated: the best-performing RFR model and the improved GWR model. Our study on Pingtan Island provides theoretical and technical support for monitoring and evaluating soil arsenic concentrations using satellite-based spectroscopy in densely populated, relatively independent island towns in China and worldwide. Full article

(This article belongs to the Special Issue New Advances in Machine Learning for Soil Properties Prediction and Mapping)

► Show Figures

Graphical abstract

15 pages, 5524 KB

Open AccessArticle

Content-Based Image Retrieval for Traditional Indonesian Woven Fabric Images Using a Modified Convolutional Neural Network Method

by Silvester Tena, Rudy Hartanto and Igi Ardiyanto

J. Imaging 2023, 9(8), 165; https://doi.org/10.3390/jimaging9080165 - 18 Aug 2023

Cited by 13 | Viewed by 4465

Abstract

A content-based image retrieval system, as an Indonesian traditional woven fabric knowledge base, can be useful for artisans and trade promotions. However, creating an effective and efficient retrieval system is difficult due to the lack of an Indonesian traditional woven fabric dataset, and [...] Read more.

A content-based image retrieval system, as an Indonesian traditional woven fabric knowledge base, can be useful for artisans and trade promotions. However, creating an effective and efficient retrieval system is difficult due to the lack of an Indonesian traditional woven fabric dataset, and unique characteristics are not considered simultaneously. One type of traditional Indonesian fabric is ikat woven fabric. Thus, this study collected images of this traditional Indonesian woven fabric to create the TenunIkatNet dataset. The dataset consists of 120 classes and 4800 images. The images were captured perpendicularly, and the ikat woven fabrics were placed on different backgrounds, hung, and worn on the body, according to the utilization patterns. The feature extraction method using a modified convolutional neural network (MCNN) learns the unique features of Indonesian traditional woven fabrics. The experimental results show that the modified CNN model outperforms other pretrained CNN models (i.e., ResNet101, VGG16, DenseNet201, InceptionV3, MobileNetV2, Xception, and InceptionResNetV2) in top-5, top-10, top-20, and top-50 accuracies with scores of 99.96%, 99.88%, 99.50%, and 97.60%, respectively. Full article

(This article belongs to the Special Issue Advances in Image Analysis: Shapes, Textures and Multifractals)

► Show Figures

Figure 1

Search Results (31)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (31)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI