Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (219)

Search Parameters:
Keywords = video frame selection

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
49 pages, 6627 KB  
Article
LEARNet: A Learning Entropy-Aware Representation Network for Educational Video Understanding
by Chitrakala S, Nivedha V V and Niranjana S R
Entropy 2026, 28(1), 3; https://doi.org/10.3390/e28010003 - 19 Dec 2025
Viewed by 199
Abstract
Educational videos contain long periods of visual redundancy, where only a few frames convey meaningful instructional information. Conventional video models, which are designed for dynamic scenes, often fail to capture these subtle pedagogical transitions. We introduce LEARNet, an entropy-aware framework that models educational [...] Read more.
Educational videos contain long periods of visual redundancy, where only a few frames convey meaningful instructional information. Conventional video models, which are designed for dynamic scenes, often fail to capture these subtle pedagogical transitions. We introduce LEARNet, an entropy-aware framework that models educational video understanding as the extraction of high-information instructional content from low-entropy visual streams. LEARNet combines a Temporal Information Bottleneck (TIB) for selecting pedagogically significant keyframes with a Spatial–Semantic Decoder (SSD) that produces fine-grained annotations refined through a proposed Relational Consistency Verification Network (RCVN). This architecture enables the construction of EVUD-2M, a large-scale benchmark with multi-level semantic labels for diverse instructional formats. LEARNet achieves substantial redundancy reduction (70.2%) while maintaining high annotation fidelity (F1 = 0.89, mAP@50 = 0.88). Grounded in information-theoretic principles, LEARNet provides a scalable foundation for tasks such as lecture indexing, visual content summarization, and multimodal learning analytics. Full article
Show Figures

Figure 1

12 pages, 2468 KB  
Article
A Real-World Underwater Video Dataset with Labeled Frames and Water-Quality Metadata for Aquaculture Monitoring
by Osbaldo Aragón-Banderas, Leonardo Trujillo, Yolocuauhtli Salazar, Guillaume J. V. E. Baguette and Jesús L. Arce-Valdez
Data 2025, 10(12), 211; https://doi.org/10.3390/data10120211 - 18 Dec 2025
Viewed by 537
Abstract
Aquaculture monitoring increasingly relies on computer vision to evaluate fish behavior and welfare under farming conditions. This dataset was collected in a commercial recirculating aquaculture system (RAS) integrated with hydroponics in Queretaro, Mexico, to support the development of robust visual models for Nile [...] Read more.
Aquaculture monitoring increasingly relies on computer vision to evaluate fish behavior and welfare under farming conditions. This dataset was collected in a commercial recirculating aquaculture system (RAS) integrated with hydroponics in Queretaro, Mexico, to support the development of robust visual models for Nile tilapia (Oreochromis niloticus). More than ten hours of underwater recordings were curated into 31 clips of 30 s each, a duration selected to balance representativeness of fish activity with a manageable size for annotation and training. Videos were captured using commercial action cameras at multiple resolutions (1920 × 1080 to 5312 × 4648 px), frame rates (24–60 fps), depths, and lighting configurations, reproducing real-world challenges such as turbidity, suspended solids, and variable illumination. For each recording, physicochemical parameters were measured, including temperature, pH, dissolved oxygen and turbidity, and are provided in a structured CSV file. In addition to the raw videos, the dataset includes 3520 extracted frames annotated using a polygon-based JSON format, enabling direct use for training object detection and behavior recognition models. This dual resource of unprocessed clips and annotated images enhances reproducibility, benchmarking, and comparative studies. By combining synchronized environmental data with annotated underwater imagery, the dataset contributes a non-invasive and versatile resource for advancing aquaculture monitoring through computer vision. Full article
Show Figures

Figure 1

13 pages, 918 KB  
Article
Self-Supervised Spatio-Temporal Network for Classifying Lung Tumor in EBUS Videos
by Ching-Kai Lin, Chin-Wen Chen, Hung-Chih Tu, Hung-Jen Fan and Yun-Chien Cheng
Diagnostics 2025, 15(24), 3184; https://doi.org/10.3390/diagnostics15243184 - 13 Dec 2025
Viewed by 244
Abstract
Background: Endobronchial ultrasound-guided transbronchial biopsy (EBUS-TBB) is a valuable technique for diagnosing peripheral pulmonary lesions (PPLs). Although computer-aided diagnostic (CAD) systems have been explored for EBUS interpretation, most rely on manually selected 2D static frames and overlook temporal dynamics that may provide important [...] Read more.
Background: Endobronchial ultrasound-guided transbronchial biopsy (EBUS-TBB) is a valuable technique for diagnosing peripheral pulmonary lesions (PPLs). Although computer-aided diagnostic (CAD) systems have been explored for EBUS interpretation, most rely on manually selected 2D static frames and overlook temporal dynamics that may provide important cues for differentiating benign from malignant lesions. This study aimed to develop an artificial intelligence model that incorporates temporal modeling to analyze EBUS videos and improve lesion classification. Methods: We retrospectively collected EBUS videos from patients undergoing EBUS-TBB between November 2019 and January 2022. A dual-path 3D convolutional network (SlowFast) was employed for spatiotemporal feature extraction, and contrastive learning (SwAV) was integrated to enhance model generalizability on clinical data. Results: A total of 465 patients with corresponding EBUS videos were included. On the validation set, the SlowFast + SwAV_Frame model achieved an AUC of 0.857, accuracy of 82.26%, sensitivity of 93.18%, specificity of 55.56%, and F1-score of 88.17%, outperforming pulmonologists (accuracy 70.97%, sensitivity 77.27%, specificity 55.56%, F1-score 79.07%). On the test set, the model achieved an AUC of 0.823, accuracy of 76.92%, sensitivity of 84.85%, specificity of 63.16%, and F1-score of 82.35%. The proposed model also demonstrated superior performance compared with conventional 2D architectures. Conclusions: This study introduces the first CAD framework for real-time malignancy classification from full-length EBUS videos, which reduces reliance on manual image selection and improves diagnostic efficiency. In addition, given its higher accuracy compared with pulmonologists’ assessments, the framework shows strong potential for clinical applicability. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

32 pages, 2917 KB  
Article
Robust Real-Time Sperm Tracking with Identity Reassignment Using Extended Kalman Filtering
by Mahdieh Gol Hassani, Mozafar Saadat and Peiran Lei
Sensors 2025, 25(24), 7539; https://doi.org/10.3390/s25247539 - 11 Dec 2025
Viewed by 518
Abstract
Accurate and real-time sperm tracking is essential for automation in Intracytoplasmic Sperm Injection (ICSI) and fertility diagnostics, where maintaining correct identities across frames improves the reliability of sperm selection. However, identity fragmentation, overcounting, and tracking instability remain persistent challenges in crowded and low-contrast [...] Read more.
Accurate and real-time sperm tracking is essential for automation in Intracytoplasmic Sperm Injection (ICSI) and fertility diagnostics, where maintaining correct identities across frames improves the reliability of sperm selection. However, identity fragmentation, overcounting, and tracking instability remain persistent challenges in crowded and low-contrast microscopy conditions. This study presents a robust two-layer tracking framework that integrates BoT-SORT with an Extended Kalman Filter (EKF) to enhance identity continuity. The EKF models sperm trajectories using a nonlinear state that includes position, velocity, and heading, allowing it to predict motion across occlusions and correct fragmented or duplicate IDs. We evaluated the framework on microscopy videos from the VISEM dataset using standard multi-object tracking (MOT) metrics and trajectory statistics. Compared to BoT-SORT, the proposed EKF-BoT-SORT achieved notable improvements: IDF1 increased from 80.30% to 84.84%, ID switches reduced from 176 to 132, average track duration extended from 74.4 to 91.3 frames, and ID overcount decreased from 68.75% to 37.5%. These results confirm that the EKF layer significantly improves identity preservation without compromising real-time feasibility. The method may offer a practical foundation for integrating computer vision into ICSI workflows and sperm motility analysis systems. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

28 pages, 8872 KB  
Article
Development and Application of an Intelligent Recognition System for Polar Environmental Targets Based on the YOLO Algorithm
by Jun Jian, Zhongying Wu, Kai Sun, Jiawei Guo and Ronglin Gao
J. Mar. Sci. Eng. 2025, 13(12), 2313; https://doi.org/10.3390/jmse13122313 - 5 Dec 2025
Viewed by 323
Abstract
As global climate warming enhances the navigability of Arctic routes, their navigation value has become prominent, yet ships operating in ice-covered waters face severe threats from sea ice and icebergs. Existing manual observation and radar monitoring remain limited, highlighting an urgent need for [...] Read more.
As global climate warming enhances the navigability of Arctic routes, their navigation value has become prominent, yet ships operating in ice-covered waters face severe threats from sea ice and icebergs. Existing manual observation and radar monitoring remain limited, highlighting an urgent need for efficient target recognition technology. This study focuses on polar environmental target detection by constructing a polar dataset with 1342 JPG images covering four classes, including sea ice, icebergs, ice channels, and ships, obtained via web collection and video frame extraction. The “Grounding DINO pre-annotation + LabelImg manual fine-tuning” strategy is employed to improve annotation efficiency and accuracy, with data augmentation further enhancing dataset diversity. After comparing YOLOv5n, YOLOv8n, and YOLOv11n, YOLOv8n is selected as the baseline model and improved by introducing the CBAM/SE attention mechanism, SCConv/AKConv convolutions, and BiFPN network. Among these models, the improved YOLOv8n + SCConv achieves the best in polar target detection, with a mean average precision (mAP) of 0.844–1.4% higher than the original model. It effectively reduces missed detections of sea ice and icebergs, thereby enhancing adaptability to complex polar environments. The experimental results demonstrate that the improved model exhibits good robustness in images of varying resolutions, scenes with water surface reflections, and AI-generated images. In addition, a visual GUI with image/video detection functions was developed to support real-time monitoring and result visualization. This research provides essential technical support for safe navigation in ice-covered waters, polar resource exploration, and scientific activities. Full article
Show Figures

Figure 1

22 pages, 12844 KB  
Article
Toward Energy-Safe Industrial Monitoring: A Hybrid Language Model Framework for Video Captioning
by Qianwen Cao, Che Li and Hangyuan Shi
Appl. Sci. 2025, 15(23), 12848; https://doi.org/10.3390/app152312848 - 4 Dec 2025
Viewed by 336
Abstract
In the energy industry, like industrial monitoring scenarios, using generative AI for video captioning technology is crucial in event understanding and safety analysis. Current approaches typically rely on a single language model to decode visual semantics from video frames. Lightweight pre-trained generative models [...] Read more.
In the energy industry, like industrial monitoring scenarios, using generative AI for video captioning technology is crucial in event understanding and safety analysis. Current approaches typically rely on a single language model to decode visual semantics from video frames. Lightweight pre-trained generative models often produce overly generic captions that omit domain-specific details like energy equipment states or procedural steps. Conversely, multimodal large generative AI models can capture fine-grained visual cues but are prone to distraction from complex backgrounds, resulting in hallucinated descriptions that reduce reliability in high-risk energy workflows. To bridge this gap, we propose a collaborative video captioning framework, EnerSafe-Cap (Energy-Safe Video Captioning), which introduces domain-aware prompt engineering to integrate the efficient summarization of lightweight models with the fine-grained analytical capability of large models, enabling multi-level semantic understanding, thereby improving the accuracy and completeness of video content expression. Furthermore, to fully exploit the strengths of both small and large models, we design a dual-path heterogeneous sampling module. The large model receives key frames selected according to inter-frame motion dynamics, while the lightweight model processes densely sampled frames at fixed intervals, thereby capturing complementary spatiotemporal cues global event semantics from salient moments and fine-grained procedural continuity from uniform sampling. Experimental results on commonly used benchmark datasets show that our model outperforms baseline models. Specifically, on the VATEX dataset, our model surpasses the lightweight pre-trained language model SwinBERT by 19.49 in the SentenceBERT metric, and outperforms the multimodal large language model Qwen2-vl-2b by 8.27, validating the effectiveness of the method. Full article
Show Figures

Figure 1

25 pages, 3453 KB  
Article
High-Frame-Rate Camera-Based Vibration Analysis for Health Monitoring of Industrial Robots Across Multiple Postures
by Tuniyazi Abudoureheman, Hayato Otsubo, Feiyue Wang, Kohei Shimasaki and Idaku Ishii
Appl. Sci. 2025, 15(23), 12771; https://doi.org/10.3390/app152312771 - 2 Dec 2025
Viewed by 386
Abstract
Accurate vibration measurement is crucial for maintaining the performance, reliability, and safety of automated manufacturing environments. Abnormal vibrations caused by faults in gears or bearings can degrade positional accuracy, reduce productivity, and, over time, significantly impair production efficiency and product quality. Such vibrations [...] Read more.
Accurate vibration measurement is crucial for maintaining the performance, reliability, and safety of automated manufacturing environments. Abnormal vibrations caused by faults in gears or bearings can degrade positional accuracy, reduce productivity, and, over time, significantly impair production efficiency and product quality. Such vibrations may also disrupt supply chains, cause financial losses, and pose safety risks to workers through collisions, falling objects, or other operational hazards. Conventional vibration measurement techniques, such as wired accelerometers and strain gauges, are typically limited to a few discrete measurement points. Achieving multi-point measurements requires numerous sensors, which increases installation complexity, wiring constraints, and setup time, making the process both time-consuming and costly. The integration of high-frame-rate (HFR) cameras with Digital Image Correlation (DIC) enables non-contact, multi-point, full-field vibration measurement of robot manipulators, effectively addressing these limitations. In this study, HFR cameras were employed to perform non-contact, full-field vibration measurements of industrial robots. The HFR camera recorded the robot’s vibrations at 1000 frames per second (fps), and the resulting video was decomposed into individual frames according to the frame rate. Each frame, with a resolution of 1920 × 1080 pixels, was divided into 128 × 128 pixel blocks with a 64-pixel stride, yielding 435 sub-images. This setup effectively simulates the operation of 435 virtual vibration sensors. By applying mask processing to these sub-images, eight key points representing critical robot components were selected for multi-point DIC displacement measurements, enabling effective assessment of vibration distribution and real-time vibration visualization across the entire manipulator. This approach allows simultaneous capture of displacements across all robot components without the need for physical sensors. The transfer function is defined in the frequency domain as the ratio between the output displacement of each robot component and the input excitation applied by the shaker mounted on the end-effector. The frequency–domain transfer functions were computed for multiple robot components, enabling accurate and full-field vibration analysis during operation. Full article
(This article belongs to the Special Issue Innovative Approaches to Non-Destructive Evaluation)
Show Figures

Figure 1

35 pages, 125255 KB  
Article
VideoARD: An Analysis-Ready Multi-Level Data Model for Remote Sensing Video
by Yang Wu, Chenxiao Zhang, Yang Lu, Yaofeng Su, Xuping Jiang, Zhigang Xiang and Zilong Li
Remote Sens. 2025, 17(22), 3746; https://doi.org/10.3390/rs17223746 - 18 Nov 2025
Viewed by 758
Abstract
Remote sensing video (RSV) provides continuous, high spatiotemporal earth observations that are increasingly important for environmental monitoring, disaster response, infrastructure inspection and urban management. Despite this potential, operational use of video streams is hindered by very large data volumes, heterogeneous acquisition platforms, inconsistent [...] Read more.
Remote sensing video (RSV) provides continuous, high spatiotemporal earth observations that are increasingly important for environmental monitoring, disaster response, infrastructure inspection and urban management. Despite this potential, operational use of video streams is hindered by very large data volumes, heterogeneous acquisition platforms, inconsistent preprocessing practices, and the absence of standardized formats that deliver data ready for immediate analysis. These shortcomings force repeated low-level computation, complicate semantic extraction, and limit reproducibility and cross-sensor integration. This manuscript presents a principled multi-level analysis-ready data (ARD) model for remote sensing video, named VideoARD, along with VideoCube, a spatiotemporal management and query infrastructure that implements and operationalizes the model. VideoARD formalizes semantic abstraction at scene, object, and event levels and defines minimum and optimal readiness configurations for each level. The proposed pipeline applies stabilization, georeferencing, key frame selection, object detection, trajectory tracking, event inference, and entity materialization. VideoCube places the resulting entities into a five-dimensional structure indexed by spatial, temporal, product, quality, and semantic dimension, and supports earth observation OLAP-style operations to enable efficient slicing, aggregation, and drill down. Benchmark experiments and three application studies, covering vessel speed monitoring, wildfire detection, and near-real-time three-dimensional reconstruction, quantify system performance and operational utility. Results show that the proposed approach achieves multi-gigabyte-per-second ingestion under parallel feeds, sub-second scene retrieval for typical queries, and second-scale trajectory reconstruction for short tracks. Case studies demonstrate faster alert generation, improved detection consistency, and substantial reductions in preprocessing and manual selection work compared with on-demand baselines. The principal trade-off is an upfront cost for materialization and storage that becomes economical when queries are repeated or entities are reused. The contribution of this work lies in extending the analysis-ready data concept from static imagery to continuous video streams and in delivering a practical, scalable architecture that links semantic abstraction to high-performance spatiotemporal management, thereby improving responsiveness, reproducibility, and cross-sensor analysis for Earth observation. Full article
Show Figures

Figure 1

22 pages, 3532 KB  
Article
Dual Weakly Supervised Anomaly Detection and Unsupervised Segmentation for Real-Time Railway Perimeter Intrusion Monitoring
by Donghua Wu, Yi Tian, Fangqing Gao, Xiukun Wei and Changfan Wang
Sensors 2025, 25(20), 6344; https://doi.org/10.3390/s25206344 - 14 Oct 2025
Viewed by 656
Abstract
The high operational velocities of high-speed trains present constraints on their onboard track intrusion detection systems for real-time capture and analysis, encompassing limited computational resources and motion image blurring. This emphasizes the critical necessity of track perimeter intrusion monitoring systems. Consequently, an intelligent [...] Read more.
The high operational velocities of high-speed trains present constraints on their onboard track intrusion detection systems for real-time capture and analysis, encompassing limited computational resources and motion image blurring. This emphasizes the critical necessity of track perimeter intrusion monitoring systems. Consequently, an intelligent monitoring system employing trackside cameras is constructed, integrating weakly supervised video anomaly detection and unsupervised foreground segmentation, which offers a solution for monitoring foreign objects on high-speed train tracks. To address the challenges of complex dataset annotation and unidentified target detection, weakly supervised learning detection is proposed to track foreign object intrusions based on video. The pretraining of Xception3D and the integration of multiple attention mechanisms have markedly enhanced the feature extraction capabilities. The Top-K sample selection alongside the amplitude score/feature loss function effectively discriminates abnormal from normal samples, incorporating time-smoothing constraints to ensure detection consistency across consecutive frames. Once abnormal video frames are identified, a multiscale variational autoencoder is proposed for the positioning of foreign objects. A downsampling/upsampling module is optimized to increase feature extraction efficiency. The pixel-level background weight distribution loss function is engineered to jointly balance background authenticity and noise resistance. Ultimately, the experimental results indicate that the video anomaly detection model achieved an AUC of 0.99 on the track anomaly detection dataset and processes 2 s video segments in 0.41 s. The proposed foreground segmentation algorithm achieved an F1 score of 0.9030 in the track anomaly dataset and 0.8375 on CDnet2014, with 91 Frames per Second, confirming its efficacy. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

28 pages, 3456 KB  
Article
Learning to Partition: Dynamic Deep Neural Network Model Partitioning for Edge-Assisted Low-Latency Video Analytics
by Yan Lyu, Likai Liu, Xuezhi Wang, Zhiyu Fan, Jinchen Wang and Guanyu Gao
Mach. Learn. Knowl. Extr. 2025, 7(4), 117; https://doi.org/10.3390/make7040117 - 13 Oct 2025
Viewed by 1880
Abstract
In edge-assisted low-latency video analytics, a critical challenge is balancing on-device inference latency against the high bandwidth costs and network delays of offloading. Ineffectively managing this trade-off degrades performance and hinders critical applications like autonomous systems. Existing solutions often rely on static partitioning [...] Read more.
In edge-assisted low-latency video analytics, a critical challenge is balancing on-device inference latency against the high bandwidth costs and network delays of offloading. Ineffectively managing this trade-off degrades performance and hinders critical applications like autonomous systems. Existing solutions often rely on static partitioning or greedy algorithms that optimize for a single frame. These myopic approaches adapt poorly to dynamic network and workload conditions, leading to high long-term costs and significant frame drops. This paper introduces a novel partitioning technique driven by a Deep Reinforcement Learning (DRL) agent on a local device that learns to dynamically partition a video analytics Deep Neural Network (DNN). The agent learns a farsighted policy to dynamically select the optimal DNN split point for each frame by observing the holistic system state. By optimizing for a cumulative long-term reward, our method significantly outperforms competitor methods, demonstrably reducing overall system cost and latency while nearly eliminating frame drops in our real-world testbed evaluation. The primary limitation is the initial offline training phase required by the DRL agent. Future work will focus on extending this dynamic partitioning framework to multi-device and multi-edge environments. Full article
Show Figures

Figure 1

14 pages, 1942 KB  
Article
Vocal Fold Disorders Classification and Optimization of a Custom Video Laryngoscopy Dataset Through Structural Similarity Index and a Deep Learning-Based Approach
by Elif Emre, Dilber Cetintas, Muhammed Yildirim and Sadettin Emre
J. Clin. Med. 2025, 14(19), 6899; https://doi.org/10.3390/jcm14196899 - 29 Sep 2025
Cited by 1 | Viewed by 836
Abstract
Background/Objectives: Video laryngoscopy is one of the primary methods used by otolaryngologists for detecting and classifying laryngeal lesions. However, the diagnostic process of these images largely relies on clinicians’ visual inspection, which can lead to overlooked small structural changes, delayed diagnosis, and interpretation [...] Read more.
Background/Objectives: Video laryngoscopy is one of the primary methods used by otolaryngologists for detecting and classifying laryngeal lesions. However, the diagnostic process of these images largely relies on clinicians’ visual inspection, which can lead to overlooked small structural changes, delayed diagnosis, and interpretation errors. Methods: AI-based approaches are becoming increasingly critical for accelerating early-stage diagnosis and improving reliability. This study proposes a hybrid Convolutional Neural Network (CNN) architecture that eliminates repetitive and clinically insignificant frames from videos, utilizing only meaningful key frames. Video data from healthy individuals, patients with vocal fold nodules, and those with vocal fold polyps were summarized using three different threshold values with the Structural Similarity Index Measure (SSIM). Results: The resulting key frames were then classified using a hybrid CNN. Experimental findings demonstrate that selecting an appropriate threshold can significantly reduce the model’s memory usage and processing load while maintaining accuracy. In particular, a threshold value of 0.90 provided richer information content thanks to the selection of a wider variety of frames, resulting in the highest success rate. Fine-tuning the last 20 layers of the MobileNetV2 and Xception backbones, combined with the fusion of extracted features, yielded an overall classification accuracy of 98%. Conclusions: The proposed approach provides a mechanism that eliminates unnecessary data and prioritizes only critical information in video-based diagnostic processes, thus helping physicians accelerate diagnostic decisions and reduce memory requirements. Full article
(This article belongs to the Special Issue Artificial Intelligence and Deep Learning in Medical Imaging)
Show Figures

Figure 1

20 pages, 14512 KB  
Article
Dual-Attention-Based Block Matching for Dynamic Point Cloud Compression
by Longhua Sun, Yingrui Wang and Qing Zhu
J. Imaging 2025, 11(10), 332; https://doi.org/10.3390/jimaging11100332 - 25 Sep 2025
Viewed by 724
Abstract
The irregular and highly non-uniform spatial distribution inherent to dynamic three-dimensional (3D) point clouds (DPCs) severely hampers the extraction of reliable temporal context, rendering inter-frame compression a formidable challenge. Inspired by two-dimensional (2D) image and video compression methods, existing approaches attempt to model [...] Read more.
The irregular and highly non-uniform spatial distribution inherent to dynamic three-dimensional (3D) point clouds (DPCs) severely hampers the extraction of reliable temporal context, rendering inter-frame compression a formidable challenge. Inspired by two-dimensional (2D) image and video compression methods, existing approaches attempt to model the temporal dependence of DPCs through a motion estimation/motion compensation (ME/MC) framework. However, these approaches represent only preliminary applications of this framework; point consistency between adjacent frames is insufficiently explored, and temporal correlation requires further investigation. To address this limitation, we propose a hierarchical ME/MC framework that adaptively selects the granularity of the estimated motion field, thereby ensuring a fine-grained inter-frame prediction process. To further enhance motion estimation accuracy, we introduce a dual-attention-based KNN block-matching (DA-KBM) network. This network employs a bidirectional attention mechanism to more precisely measure the correlation between points, using closely correlated points to predict inter-frame motion vectors and thereby improve inter-frame prediction accuracy. Experimental results show that the proposed DPC compression method achieves a significant improvement (gain of 70%) in the BD-Rate metric on the 8iFVBv2 dataset. compared with the standardized Video-based Point Cloud Compression (V-PCC) v13 method, and a 16% gain over the state-of-the-art deep learning-based inter-mode method. Full article
(This article belongs to the Special Issue 3D Image Processing: Progress and Challenges)
Show Figures

Figure 1

18 pages, 6253 KB  
Article
Exploring Sign Language Dataset Augmentation with Generative Artificial Intelligence Videos: A Case Study Using Adobe Firefly-Generated American Sign Language Data
by Valentin Bercaru and Nirvana Popescu
Information 2025, 16(9), 799; https://doi.org/10.3390/info16090799 - 15 Sep 2025
Viewed by 1270
Abstract
Currently, high quality datasets focused on Sign Language Recognition are either private, proprietary or difficult to obtain due to costs. Therefore, we aim to mitigate this problem by augmenting a publicly available dataset with artificially generated data in order to enrich and obtain [...] Read more.
Currently, high quality datasets focused on Sign Language Recognition are either private, proprietary or difficult to obtain due to costs. Therefore, we aim to mitigate this problem by augmenting a publicly available dataset with artificially generated data in order to enrich and obtain a more diverse dataset. The performance of Sign Language Recognition (SLR) systems is highly dependent on the quality and diversity of training datasets. However, acquiring large-scale and well-annotated sign language video data remains a significant challenge. This experiment explores the use of Generative Artificial Intelligence (GenAI), specifically Adobe Firefly, to create synthetic video data for American Sign Language (ASL) fingerspelling. Thirteen letters out of 26 were selected for generation, and short videos representing each sign were synthesized and processed into static frames. These synthetic frames replaced approximately 7.5% of the original dataset and were integrated into the training data of a publicly available Convolutional Neural Network (CNN) model. After retraining the model with the augmented dataset, the accuracy did not drop. Moreover, the validation accuracy was approximately the same. The resulting model achieved a maximum accuracy of 98.04%. While the performance gain was limited (less than 1%), the approach illustrates the feasibility of using GenAI tools to generate training data and supports further research into data augmentation for low-resource SLR tasks. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Graphical abstract

15 pages, 1786 KB  
Article
Application of Gaussian SVM Flame Detection Model Based on Color and Gradient Features in Engine Test Plume Images
by Song Yan, Yushan Gao, Zhiwei Zhang and Yi Li
Sensors 2025, 25(17), 5592; https://doi.org/10.3390/s25175592 - 8 Sep 2025
Viewed by 1126
Abstract
This study presents a flame detection model that is based on real experimental data that were collected during turbopump hot-fire tests of a liquid rocket engine. In these tests, a MEMRECAM ACS-1 M40 high-speed camera—serving as an optical sensor within the test instrumentation [...] Read more.
This study presents a flame detection model that is based on real experimental data that were collected during turbopump hot-fire tests of a liquid rocket engine. In these tests, a MEMRECAM ACS-1 M40 high-speed camera—serving as an optical sensor within the test instrumentation system—captured plume images for analysis. To detect abnormal flame phenomena in the plume, a Gaussian support vector machine (SVM) model was developed using image features that were derived from both color and gradient information. Six representative frames containing visible flames were selected from a single test failure video. These images were segmented in the YCbCr color space using the k-means clustering algorithm to distinguish flame and non-flame pixels. A 10-dimensional feature vector was constructed for each pixel and then reduced to five dimensions using the Maximum Relevance Minimum Redundancy (mRMR) method. The reduced vectors were used to train the Gaussian SVM model. The model achieved a 97.6% detection accuracy despite being trained on a limited dataset. It has been successfully applied in multiple subsequent engine tests, and it has proven effective in detecting ablation-related anomalies. By combining real-world sensor data acquisition with intelligent image-based analysis, this work enhances the monitoring capabilities in rocket engine development. Full article
Show Figures

Figure 1

15 pages, 2479 KB  
Article
Inter- and Intraobserver Variability in Bowel Preparation Scoring for Colon Capsule Endoscopy: Impact of AI-Assisted Assessment Feasibility Study
by Ian Io Lei, Daniel R. Gaya, Alexander Robertson, Benedicte Schelde-Olesen, Alice Mapiye, Anirudh Bhandare, Bei Bei Lui, Chander Shekhar, Ursula Valentiner, Pere Gilabert, Pablo Laiz, Santi Segui, Nicholas Parsons, Cristiana Huhulea, Hagen Wenzek, Elizabeth White, Anastasios Koulaouzidis and Ramesh P. Arasaradnam
Cancers 2025, 17(17), 2840; https://doi.org/10.3390/cancers17172840 - 29 Aug 2025
Viewed by 992
Abstract
Background: Colon capsule endoscopy (CCE) has seen increased adoption since the COVID-19 pandemic, offering a non-invasive alternative for lower gastrointestinal investigations. However, inadequate bowel preparation remains a key limitation, often leading to higher conversion rates to colonoscopy. Manual assessment of bowel cleanliness is [...] Read more.
Background: Colon capsule endoscopy (CCE) has seen increased adoption since the COVID-19 pandemic, offering a non-invasive alternative for lower gastrointestinal investigations. However, inadequate bowel preparation remains a key limitation, often leading to higher conversion rates to colonoscopy. Manual assessment of bowel cleanliness is inherently subjective and marked by high interobserver variability. Recent advances in artificial intelligence (AI) have enabled automated cleansing scores that not only standardise assessment and reduce variability but also align with the emerging semi-automated AI reading workflow, which highlights only clinically significant frames. As full video review becomes less routine, reliable, and consistent, cleansing evaluation is essential, positioning bowel preparation AI as a critical enabler of diagnostic accuracy and scalable CCE deployment. Objective: This CESCAIL sub-study aimed to (1) evaluate interobserver agreement in CCE bowel cleansing assessment using two established scoring systems, and (2) determine the impact of AI-assisted scoring, specifically a TransUNet-based segmentation model with a custom Patch Loss function, on both interobserver and intraobserver agreement compared to manual assessment. Methods: As part of the CESCAIL study, twenty-five CCE videos were randomly selected from 673 participants. Nine readers with varying CCE experience scored bowel cleanliness using the Leighton–Rex and CC-CLEAR scales. After a minimum 8-week washout, the same readers reassessed the videos using AI-assisted CC-CLEAR scores. Interobserver variability was evaluated using bootstrapped intraclass correlation coefficients (ICC) and Fleiss’ Kappa; intraobserver variability was assessed with weighted Cohen’s Kappa, paired t-tests, and Two One-Sided Tests (TOSTs). Results: Leighton–Rex showed poor to fair agreement (Fleiss = 0.14; ICC = 0.55), while CC-CLEAR demonstrated fair to excellent agreement (Fleiss = 0.27; ICC = 0.90). AI-assisted CC-CLEAR achieved only moderate agreement overall (Fleiss = 0.27; ICC = 0.69), with weaker performance among less experienced readers (Fleiss = 0.15; ICC = 0.56). Intraobserver agreement was excellent (ICC > 0.75) for experienced readers but variable in others (ICC 0.03–0.80). AI-assisted scores were significantly lower than manual reads by 1.46 points (p < 0.001), potentially increasing conversion to colonoscopy. Conclusions: AI-assisted scoring did not improve interobserver agreement and may even reduce consistency amongst less experienced readers. The maintained agreement observed in experienced readers highlights its current value in experienced hands only. Further refinement, including spatial analysis integration, is needed for robust overall AI implementation in CCE. Full article
(This article belongs to the Section Methods and Technologies Development)
Show Figures

Figure 1

Back to TopTop