Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (58)

Search Parameters:
Keywords = keyframe selection

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
49 pages, 6627 KB  
Article
LEARNet: A Learning Entropy-Aware Representation Network for Educational Video Understanding
by Chitrakala S, Nivedha V V and Niranjana S R
Entropy 2026, 28(1), 3; https://doi.org/10.3390/e28010003 - 19 Dec 2025
Viewed by 370
Abstract
Educational videos contain long periods of visual redundancy, where only a few frames convey meaningful instructional information. Conventional video models, which are designed for dynamic scenes, often fail to capture these subtle pedagogical transitions. We introduce LEARNet, an entropy-aware framework that models educational [...] Read more.
Educational videos contain long periods of visual redundancy, where only a few frames convey meaningful instructional information. Conventional video models, which are designed for dynamic scenes, often fail to capture these subtle pedagogical transitions. We introduce LEARNet, an entropy-aware framework that models educational video understanding as the extraction of high-information instructional content from low-entropy visual streams. LEARNet combines a Temporal Information Bottleneck (TIB) for selecting pedagogically significant keyframes with a Spatial–Semantic Decoder (SSD) that produces fine-grained annotations refined through a proposed Relational Consistency Verification Network (RCVN). This architecture enables the construction of EVUD-2M, a large-scale benchmark with multi-level semantic labels for diverse instructional formats. LEARNet achieves substantial redundancy reduction (70.2%) while maintaining high annotation fidelity (F1 = 0.89, mAP@50 = 0.88). Grounded in information-theoretic principles, LEARNet provides a scalable foundation for tasks such as lecture indexing, visual content summarization, and multimodal learning analytics. Full article
Show Figures

Figure 1

28 pages, 2836 KB  
Article
MA-EVIO: A Motion-Aware Approach to Event-Based Visual–Inertial Odometry
by Mohsen Shahraki, Ahmed Elamin and Ahmed El-Rabbany
Sensors 2025, 25(23), 7381; https://doi.org/10.3390/s25237381 - 4 Dec 2025
Viewed by 645
Abstract
Indoor localization remains a challenging task due to the unavailability of reliable global navigation satellite system (GNSS) signals in most indoor environments. One way to overcome this challenge is through visual–inertial odometry (VIO), which enables real-time pose estimation by fusing camera and inertial [...] Read more.
Indoor localization remains a challenging task due to the unavailability of reliable global navigation satellite system (GNSS) signals in most indoor environments. One way to overcome this challenge is through visual–inertial odometry (VIO), which enables real-time pose estimation by fusing camera and inertial measurements. However, VIO suffers from performance degradation under high-speed motion and in poorly lit environments. In such scenarios, motion blur, sensor noise, and low temporal resolution reduce the accuracy and robustness of the estimated trajectory. To address these limitations, we propose a motion-aware event-based VIO (MA-EVIO) system that adaptively fuses asynchronous event data, frame-based imagery, and inertial measurements for robust and accurate pose estimation. MA-EVIO employs a hybrid tracking strategy combining sparse feature matching and direct photometric alignment. A key innovation is its motion-aware keyframe selection, which dynamically adjusts tracking parameters based on real-time motion classification and feature quality. This motion awareness also enables adaptive sensor fusion: during fast motion, the system prioritizes event data, while under slow or stable motion, it relies more on RGB frames and feature-based tracking. Experimental results on the DAVIS240c and VECtor benchmarks demonstrate that MA-EVIO outperforms state-of-the-art methods, achieving a lower mean position error (MPE) of 0.19 on DAVIS240c compared to 0.21 (EVI-SAM) and 0.24 (PL-EVIO), and superior performance on VECtor with MPE/mean rotation error (MRE) of 1.19%/1.28 deg/m versus 1.27%/1.42 deg/m (EVI-SAM) and 1.93%/1.56 deg/m (PL-EVIO). These results validate the effectiveness of MA-EVIO in challenging dynamic indoor environments. Full article
(This article belongs to the Special Issue Multi-Sensor Integration for Mobile and UAS Mapping)
Show Figures

Figure 1

14 pages, 1310 KB  
Article
Stereo-GS: Online 3D Gaussian Splatting Mapping Using Stereo Depth Estimation
by Junkyu Park, Byeonggwon Lee, Sanggi Lee and Soohwan Song
Electronics 2025, 14(22), 4436; https://doi.org/10.3390/electronics14224436 - 14 Nov 2025
Viewed by 1843
Abstract
We present Stereo-GS, a real-time system for online 3D Gaussian Splatting (3DGS) that reconstructs photorealistic 3D scenes from streaming stereo pairs. Unlike prior offline 3DGS methods that require dense multi-view input or precomputed depth, Stereo-GS estimates metrically accurate depth maps directly from rectified [...] Read more.
We present Stereo-GS, a real-time system for online 3D Gaussian Splatting (3DGS) that reconstructs photorealistic 3D scenes from streaming stereo pairs. Unlike prior offline 3DGS methods that require dense multi-view input or precomputed depth, Stereo-GS estimates metrically accurate depth maps directly from rectified stereo geometry, enabling progressive, globally consistent reconstruction. The frontend combines a stereo implementation of DROID-SLAM for robust tracking and keyframe selection with FoundationStereo, a generalizable stereo network that needs no scene-specific fine-tuning. A two-stage filtering pipeline improves depth reliability by removing outliers using a variance-based refinement filter followed by a multi-view consistency check. In the backend, we selectively initialize new Gaussians in under-represented regions flagged by low PSNR during rendering and continuously optimize them via differentiable rendering. To maintain global coherence with minimal overhead, we apply a lightweight rigid alignment after periodic bundle adjustment. On EuRoC and TartanAir, Stereo-GS attains state-of-the-art performance, improving average PSNR by 0.22 dB and 2.45 dB over the best baseline, respectively. Together with superior visual quality, these results show that Stereo-GS delivers high-fidelity, geometrically accurate 3D reconstructions suitable for real-time robotics, navigation, and immersive AR/VR applications. Full article
(This article belongs to the Special Issue Real-Time Computer Vision)
Show Figures

Graphical abstract

22 pages, 1773 KB  
Article
ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency
by Shaoqian Yu, Xingyu Chen, Yuzhe Sheng, Han Zhang, Xinlong Li and Sijia Yu
Electronics 2025, 14(22), 4420; https://doi.org/10.3390/electronics14224420 - 13 Nov 2025
Viewed by 775
Abstract
The alarming realism of Deepfake presents a significant challenge to digital authenticity, yet its inherent difficulty in synchronizing the emotional cues between facial expressions and speech offers a critical opportunity for detection. However, most existing approaches rely on general-purpose backbones for unimodal feature [...] Read more.
The alarming realism of Deepfake presents a significant challenge to digital authenticity, yet its inherent difficulty in synchronizing the emotional cues between facial expressions and speech offers a critical opportunity for detection. However, most existing approaches rely on general-purpose backbones for unimodal feature extraction, resulting in an inadequate representation of fine-grained dynamic emotional expressions. Although a limited number of studies have explored cross-modal emotional consistency of deepfake detection, they typically employ shallow fusion techniques which limit latent expressiveness. To address this, we propose ACE-Net, a novel framework that identifies forgeries via multimodal emotional inconsistency. For the speech modality, we design a bidirectional cross-attention mechanism to fuse acoustic features from a lightweight CNN-based model with textual features, yielding a representation highly sensitive to fine-grained emotional dynamics. For the visual modality, a MobileNetV3-based perception head is proposed to adaptively select keyframes, yielding a representation focused on the most emotionally salient moments. For multimodal emotional consistency discrimination, we develop a multi-dimensional fusion strategy to deeply integrate high-level emotional features from different modalities within a unified latent space. For unimodal emotion recognition, both the audio and visual branches outperform baseline models on the CREMA-D dataset. Building on this, the complete ACE-Net model achieves a state-of-the-art AUC of 0.921 on the challenging DFDC benchmark. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Machine Learning)
Show Figures

Figure 1

22 pages, 3921 KB  
Article
Tightly Coupled LiDAR-Inertial Odometry for Autonomous Driving via Self-Adaptive Filtering and Factor Graph Optimization
by Weiwei Lyu, Haoting Li, Shuanggen Jin, Haocai Huang, Xiaojuan Tian, Yunlong Zhang, Zheyuan Du and Jinling Wang
Machines 2025, 13(11), 977; https://doi.org/10.3390/machines13110977 - 23 Oct 2025
Viewed by 1381
Abstract
Simultaneous Localization and Mapping (SLAM) has become a critical tool for fully autonomous driving. However, current methods suffer from inefficient data utilization and degraded navigation performance in complex and unknown environments. In this paper, an accurate and tightly coupled method of LiDAR-inertial odometry [...] Read more.
Simultaneous Localization and Mapping (SLAM) has become a critical tool for fully autonomous driving. However, current methods suffer from inefficient data utilization and degraded navigation performance in complex and unknown environments. In this paper, an accurate and tightly coupled method of LiDAR-inertial odometry is proposed. First, a self-adaptive voxel grid filter is developed to dynamically downsample the original point clouds based on environmental feature richness, aiming to balance navigation accuracy and real-time performance. Second, keyframe factors are selected based on thresholds of translation distance, rotation angle, and time interval and then introduced into the factor graph to improve global consistency. Additionally, high-quality Global Navigation Satellite System (GNSS) factors are selected and incorporated into the factor graph through linear interpolation, thereby improving the navigation accuracy in complex and unknown environments. The proposed method is evaluated using KITTI dataset over various scales and environments. Results show that the proposed method has demonstrated very promising better results when compared with the other methods, such as ALOAM, LIO-SAM, and SC-LeGO-LOAM. Especially in urban scenes, the trajectory accuracy of the proposed method has been improved by 33.13%, 57.56%, and 58.4%, respectively, illustrating excellent navigation and positioning capabilities. Full article
(This article belongs to the Section Vehicle Engineering)
Show Figures

Figure 1

28 pages, 32292 KB  
Article
Contextual Feature Fusion-Based Keyframe Selection Using Semantic Attention and Diversity-Aware Optimization for Video Summarization
by Chitrakala S and Aparyay Kumar
Symmetry 2025, 17(10), 1737; https://doi.org/10.3390/sym17101737 - 15 Oct 2025
Viewed by 1372
Abstract
Training-free video summarization tackles the challenge of selecting the most informative keyframes from a video without relying on costly training or complex deep models. This work introduces C2FVS-DPP (Contextual Feature Fusion Video Summarization with Determinantal Point Process), a lightweight framework that [...] Read more.
Training-free video summarization tackles the challenge of selecting the most informative keyframes from a video without relying on costly training or complex deep models. This work introduces C2FVS-DPP (Contextual Feature Fusion Video Summarization with Determinantal Point Process), a lightweight framework that generates concise video summaries by jointly modeling semantic importance, visual diversity, temporal structure, and symmetry. The design centers on a symmetry-aware fusion strategy, where appearance, motion, and semantic cues are aligned in a unified embedding space, and on a reward-guided optimization logic that balances representativeness and diversity. Specifically, appearance features from ResNet-50, motion cues from optical flow, and semantic representations from BERT-encoded BLIP captions are fused into a contextual embedding. A Transformer encoder assigns importance scores, followed by shot boundary detection and K-Medoids clustering to identify candidate keyframes. These candidates are refined through a reward-based re-ranking mechanism that integrates semantic relevance, representativeness, and visual uniqueness, while a Determinantal Point Process (DPP) enforces globally diverse selection under a keyframe budget. To enable reliable evaluation, enhanced versions of the SumMe and TVSum50 datasets were curated to reduce redundancy and increase semantic density. On these curated benchmarks, C2FVS-DPP achieves F1-scores of 0.22 and 0.43 and fidelity scores of 0.16 and 0.40 on SumMe and TVSum50, respectively, surpassing existing models on both metrics. In terms of compression ratio, the framework records 0.9959 on SumMe and 0.9940 on TVSum50, remaining highly competitive with the best-reported values of 0.9981 and 0.9983. These results highlight the strength of C2FVS-DPP as an inference-driven, symmetry-aware, and resource-efficient solution for video summarization. Full article
Show Figures

Figure 1

28 pages, 10315 KB  
Article
DKB-SLAM: Dynamic RGB-D Visual SLAM with Efficient Keyframe Selection and Local Bundle Adjustment
by Qian Sun, Ziqiang Xu, Yibing Li, Yidan Zhang and Fang Ye
Robotics 2025, 14(10), 134; https://doi.org/10.3390/robotics14100134 - 25 Sep 2025
Viewed by 1783
Abstract
Reliable navigation for mobile robots in dynamic, human-populated environments remains a significant challenge, as moving objects often cause localization drift and map corruption. While Simultaneous Localization and Mapping (SLAM) techniques excel in static settings, issues like keyframe redundancy and optimization inefficiencies further hinder [...] Read more.
Reliable navigation for mobile robots in dynamic, human-populated environments remains a significant challenge, as moving objects often cause localization drift and map corruption. While Simultaneous Localization and Mapping (SLAM) techniques excel in static settings, issues like keyframe redundancy and optimization inefficiencies further hinder their practical deployment on robotic platforms. To address these challenges, we propose DKB-SLAM, a real-time RGB-D visual SLAM system specifically designed to enhance robotic autonomy in complex dynamic scenes. DKB-SLAM integrates optical flow with Gaussian-based depth distribution analysis within YOLO detection frames to efficiently filter dynamic points, crucial for maintaining accurate pose estimates for the robot. An adaptive keyframe selection strategy balances map density and information integrity using a sliding window, considering the robot’s motion dynamics through parallax, visibility, and matching quality. Furthermore, a heterogeneously weighted local bundle adjustment (BA) method leverages map point geometry, assigning higher weights to stable edge points to refine the robot’s trajectory. Evaluations on the TUM RGB-D benchmark and, crucially, on a mobile robot platform in real-world dynamic scenarios, demonstrate that DKB-SLAM outperforms state-of-the-art methods, providing a robust and efficient solution for high-precision robot localization and mapping in dynamic environments. Full article
(This article belongs to the Special Issue SLAM and Adaptive Navigation for Robotics)
Show Figures

Figure 1

23 pages, 3485 KB  
Article
MSGS-SLAM: Monocular Semantic Gaussian Splatting SLAM
by Mingkai Yang, Shuyu Ge and Fei Wang
Symmetry 2025, 17(9), 1576; https://doi.org/10.3390/sym17091576 - 20 Sep 2025
Cited by 1 | Viewed by 2383
Abstract
With the iterative evolution of SLAM (Simultaneous Localization and Mapping) technology in the robotics domain, the SLAM paradigm based on three-dimensional Gaussian distribution models has emerged as the current state-of-the-art technical approach. This research proposes a novel MSGS-SLAM system (Monocular Semantic Gaussian Splatting [...] Read more.
With the iterative evolution of SLAM (Simultaneous Localization and Mapping) technology in the robotics domain, the SLAM paradigm based on three-dimensional Gaussian distribution models has emerged as the current state-of-the-art technical approach. This research proposes a novel MSGS-SLAM system (Monocular Semantic Gaussian Splatting SLAM), which innovatively integrates monocular vision with three-dimensional Gaussian distribution models within a semantic SLAM framework. Our approach exploits the inherent spherical symmetries of isotropic Gaussian distributions, enabling symmetric optimization processes that maintain computational efficiency while preserving geometric consistency. Current mainstream three-dimensional Gaussian semantic SLAM systems typically rely on depth sensors for map reconstruction and semantic segmentation, which not only significantly increases hardware costs but also limits the deployment potential of systems in diverse scenarios. To overcome this limitation, this research introduces a depth estimation proxy framework based on Metric3D-V2, which effectively addresses the inherent deficiency of monocular vision systems in depth information acquisition. Additionally, our method leverages architectural symmetries in indoor environments to enhance semantic understanding through symmetric feature matching. Through this approach, the system achieves robust and efficient semantic feature integration and optimization without relying on dedicated depth sensors, thereby substantially reducing the dependency of three-dimensional Gaussian semantic SLAM systems on depth sensors and expanding their application scope. Furthermore, this research proposes a keyframe selection algorithm based on semantic guidance and proxy depth collaborative mechanisms, which effectively suppresses pose drift errors accumulated during long-term system operation, thereby achieving robust global loop closure correction. Through systematic evaluation on multiple standard datasets, MSGS-SLAM achieves comparable technical performance to existing three-dimensional Gaussian model-based semantic SLAM systems across multiple key performance metrics including ATE RMSE, PSNR, and mIoU. Full article
(This article belongs to the Section Engineering and Materials)
Show Figures

Figure 1

24 pages, 1751 KB  
Article
Robust JND-Guided Video Watermarking via Adaptive Block Selection and Temporal Redundancy
by Antonio Cedillo-Hernandez, Lydia Velazquez-Garcia, Manuel Cedillo-Hernandez, Ismael Dominguez-Jimenez and David Conchouso-Gonzalez
Mathematics 2025, 13(15), 2493; https://doi.org/10.3390/math13152493 - 3 Aug 2025
Viewed by 1231
Abstract
This paper introduces a robust and imperceptible video watermarking framework designed for blind extraction in dynamic video environments. The proposed method operates in the spatial domain and combines multiscale perceptual analysis, adaptive Just Noticeable Difference (JND)-based quantization, and temporal redundancy via multiframe embedding. [...] Read more.
This paper introduces a robust and imperceptible video watermarking framework designed for blind extraction in dynamic video environments. The proposed method operates in the spatial domain and combines multiscale perceptual analysis, adaptive Just Noticeable Difference (JND)-based quantization, and temporal redundancy via multiframe embedding. Watermark bits are embedded selectively in blocks with high perceptual masking using a QIM strategy, and the corresponding DCT coefficients are estimated directly from the spatial domain to reduce complexity. To enhance resilience, each bit is redundantly inserted across multiple keyframes selected based on scene transitions. Extensive simulations over 21 benchmark videos (CIF, 4CIF, HD) validate that the method achieves superior performance in robustness and perceptual quality, with an average Bit Error Rate (BER) of 1.03%, PSNR of 50.1 dB, SSIM of 0.996, and VMAF of 97.3 under compression, noise, cropping, and temporal desynchronization. The system outperforms several recent state-of-the-art techniques in both quality and speed, requiring no access to the original video during extraction. These results confirm the method’s viability for practical applications such as copyright protection and secure video streaming. Full article
(This article belongs to the Section E: Applied Mathematics)
Show Figures

Figure 1

34 pages, 5774 KB  
Article
Approach to Semantic Visual SLAM for Bionic Robots Based on Loop Closure Detection with Combinatorial Graph Entropy in Complex Dynamic Scenes
by Dazheng Wang and Jingwen Luo
Biomimetics 2025, 10(7), 446; https://doi.org/10.3390/biomimetics10070446 - 6 Jul 2025
Viewed by 931
Abstract
In complex dynamic environments, the performance of SLAM systems on bionic robots is susceptible to interference from dynamic objects or structural changes in the environment. To address this problem, we propose a semantic visual SLAM (vSLAM) algorithm based on loop closure detection with [...] Read more.
In complex dynamic environments, the performance of SLAM systems on bionic robots is susceptible to interference from dynamic objects or structural changes in the environment. To address this problem, we propose a semantic visual SLAM (vSLAM) algorithm based on loop closure detection with combinatorial graph entropy. First, in terms of the dynamic feature detection results of YOLOv8-seg, the feature points at the edges of the dynamic object are finely judged by calculating the mean absolute deviation (MAD) of the depth of the pixel points. Then, a high-quality keyframe selection strategy is constructed by combining the semantic information, the average coordinates of the semantic objects, and the degree of variation in the dense region of feature points. Subsequently, the unweighted and weighted graphs of keyframes are constructed according to the distribution of feature points, characterization points, and semantic information, and then a high-performance loop closure detection method based on combinatorial graph entropy is developed. The experimental results show that our loop closure detection approach exhibits higher precision and recall in real scenes compared to the bag-of-words (BoW) model. Compared with ORB-SLAM2, the absolute trajectory accuracy in high-dynamic sequences improved by an average of 97.01%, while the number of extracted keyframes decreased by an average of 61.20%. Full article
(This article belongs to the Special Issue Artificial Intelligence for Autonomous Robots: 3rd Edition)
Show Figures

Figure 1

35 pages, 1553 KB  
Article
Efficient Learning-Based Robotic Navigation Using Feature-Based RGB-D Pose Estimation and Topological Maps
by Eder A. Rodríguez-Martínez, Jesús Elías Miranda-Vega, Farouk Achakir, Oleg Sergiyenko, Julio C. Rodríguez-Quiñonez, Daniel Hernández Balbuena and Wendy Flores-Fuentes
Entropy 2025, 27(6), 641; https://doi.org/10.3390/e27060641 - 15 Jun 2025
Viewed by 3148
Abstract
Robust indoor robot navigation typically demands either costly sensors or extensive training data. We propose a cost-effective RGB-D navigation pipeline that couples feature-based relative pose estimation with a lightweight multi-layer-perceptron (MLP) policy. RGB-D keyframes extracted from human-driven traversals form nodes of a topological [...] Read more.
Robust indoor robot navigation typically demands either costly sensors or extensive training data. We propose a cost-effective RGB-D navigation pipeline that couples feature-based relative pose estimation with a lightweight multi-layer-perceptron (MLP) policy. RGB-D keyframes extracted from human-driven traversals form nodes of a topological map; edges are added when visual similarity and geometric–kinematic constraints are jointly satisfied. During autonomy, LightGlue features and SVD give six-DoF relative pose to the active keyframe, and the MLP predicts one of four discrete actions. Low visual similarity or detected obstacles trigger graph editing and Dijkstra replanning in real time. Across eight tasks in four Habitat-Sim environments, the agent covered 190.44 m, replanning when required, and consistently stopped within 0.1 m of the goal while running on commodity hardware. An information-theoretic analysis over the Multi-Illumination dataset shows that LightGlue maximizes per-second information gain under lighting changes, motivating its selection. The modular design attains reliable navigation without metric SLAM or large-scale learning, and seamlessly accommodates future perception or policy upgrades. Full article
Show Figures

Figure 1

24 pages, 11944 KB  
Article
YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences
by Nada Alzahrani, Ouiem Bchir and Mohamed Maher Ben Ismail
Sensors 2025, 25(10), 3013; https://doi.org/10.3390/s25103013 - 10 May 2025
Cited by 3 | Viewed by 2811
Abstract
Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage [...] Read more.
Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage complex action dynamics within video sequences. YOLO-Act achieves precise and efficient action recognition by integrating keyframe extraction, action tracking, and class fusion. The model depicts essential temporal dynamics without the computational overhead of continuous frame processing by leveraging the adaptive selection of three keyframes representing the beginning, middle, and end of the actions. Compared with state-of-the-art approaches such as the Lagrangian Action Recognition Transformer (LART), YOLO-Act exhibits superior performance with a mean average precision (mAP) of 73.28 in experiments conducted on the AVA dataset, resulting in a gain of +28.18 mAP. Furthermore, YOLO-Act achieves this higher accuracy with significantly lower FLOPs, demonstrating its efficiency in computational resource utilization. The results highlight the advantages of incorporating precise tracking, effective spatial detection, and temporal consistency to address the challenges associated with video-based action detection. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

20 pages, 3071 KB  
Article
A Keyframe Extraction Method for Assembly Line Operation Videos Based on Optical Flow Estimation and ORB Features
by Xiaoyu Gao, Hua Xiang, Tongxi Wang, Wei Zhan, Mengxue Xie, Lingxuan Zhang and Muyu Lin
Sensors 2025, 25(9), 2677; https://doi.org/10.3390/s25092677 - 23 Apr 2025
Viewed by 2242
Abstract
In modern manufacturing, cameras are widely used to record the full workflow of assembly line workers, enabling video-based operational analysis and management. However, these recordings are often excessively long, leading to high storage demands and inefficient processing. Existing keyframe extraction methods typically apply [...] Read more.
In modern manufacturing, cameras are widely used to record the full workflow of assembly line workers, enabling video-based operational analysis and management. However, these recordings are often excessively long, leading to high storage demands and inefficient processing. Existing keyframe extraction methods typically apply uniform strategies across all frames, which are ineffective in detecting subtle movements. To address this, we propose a keyframe extraction method tailored for assembly line videos, combining optical flow estimation with ORB-based visual features. Our approach adapts extraction strategies to actions with different motion amplitudes. Each video frame is first encoded into a feature vector using the ORB algorithm and a bag-of-visual-words model. Optical flow is then calculated using the DIS algorithm, allowing frames to be categorized by motion intensity. Adjacent frames within the same category are grouped, and the appropriate number of clusters, k, is determined based on the group’s characteristics. Keyframes are finally selected via k-means++ clustering within each group. The experimental results show that our method achieves a recall rate of 85.2%, with over 90% recall for actions involving minimal movement. Moreover, the method processes an average of 274 frames per second. These results highlight the method’s effectiveness in identifying subtle actions, reducing redundant content, and delivering high accuracy with efficient performance. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

20 pages, 5975 KB  
Article
Fast Tongue Detection Based on Lightweight Model and Deep Feature Propagation
by Keju Chen, Yun Zhang, Li Zhong and Yongguo Liu
Electronics 2025, 14(7), 1457; https://doi.org/10.3390/electronics14071457 - 3 Apr 2025
Viewed by 1309
Abstract
While existing tongue detection methods have achieved good accuracy, the problems of low detection speed and excessive noise in the background area still exist. To address these problems, a fast tongue detection model based on a lightweight model and deep feature propagation (TD-DFP) [...] Read more.
While existing tongue detection methods have achieved good accuracy, the problems of low detection speed and excessive noise in the background area still exist. To address these problems, a fast tongue detection model based on a lightweight model and deep feature propagation (TD-DFP) is proposed. Firstly, a color channel is added to the RGB tongue image to introduce more prominent tongue features. To reduce the computational complexity, keyframes are selected through inter frame differencing, while optical flow maps are used to achieve feature alignment between non-keyframes and keyframes. Secondly, a convolutional neural network with feature pyramid structures is designed to extract multi-scale features, and object detection heads based on depth-wise convolutions are adopted to achieve real-time tongue region detection. In addition, a knowledge distillation module is introduced to improve training performance during the training phase. TD-DFP achieved 82.8% mean average precision (mAP) values and 61.88 frames per second (FPS) values on the tongue dataset. The experimental results indicate that TD-DFP can achieve efficient and accurate tongue detection, achieving real-time tongue detection. Full article
(This article belongs to the Special Issue Mechanism and Modeling of Graph Convolutional Networks)
Show Figures

Figure 1

26 pages, 8883 KB  
Article
Enhancing Machine Learning Techniques in VSLAM for Robust Autonomous Unmanned Aerial Vehicle Navigation
by Hussam Rostum and József Vásárhelyi
Electronics 2025, 14(7), 1440; https://doi.org/10.3390/electronics14071440 - 2 Apr 2025
Cited by 1 | Viewed by 1603
Abstract
This study introduces a visual SLAM real-time system designed for small indoor environments. The system demonstrates resilience against significant motion clutter and supports wide-baseline loop closing, re-localization, and automatic initialization. Leveraging state-of-the-art algorithms, the approach presented in this article utilizes adapted Oriented FAST [...] Read more.
This study introduces a visual SLAM real-time system designed for small indoor environments. The system demonstrates resilience against significant motion clutter and supports wide-baseline loop closing, re-localization, and automatic initialization. Leveraging state-of-the-art algorithms, the approach presented in this article utilizes adapted Oriented FAST and Rotated BRIEF features for tracking, mapping, re-localization, and loop closing. In addition, the research uses an adaptive threshold to find putative feature matches that provide efficient map initialization and accurate tracking. The assignment is to process visual information from the camera of a DJI Tello drone for the construction of an indoor map and the estimation of the trajectory of the camera. In a ’survival of the fittest’ style, the algorithms selectively pick adaptive points and keyframes for reconstruction. This leads to robustness and a concise traceable map that develops as scene content emerges, making lifelong operation possible. The results give an improvement in the RMSE for the adaptive ORB algorithm and the adaptive threshold (3.280). However, the standard ORB algorithm failed to achieve the mapping process. Full article
(This article belongs to the Special Issue Development and Advances in Autonomous Driving Technology)
Show Figures

Figure 1

Back to TopTop