MDPI - Publisher of Open Access Journals

20 pages, 4569 KiB

Open AccessArticle

Lightweight Vision Transformer for Frame-Level Ergonomic Posture Classification in Industrial Workflows

by Luca Cruciata, Salvatore Contino, Marianna Ciccarelli, Roberto Pirrone, Leonardo Mostarda, Alessandra Papetti and Marco Piangerelli

Sensors 2025, 25(15), 4750; https://doi.org/10.3390/s25154750 (registering DOI) - 1 Aug 2025

Abstract

Work-related musculoskeletal disorders (WMSDs) are a leading concern in industrial ergonomics, often stemming from sustained non-neutral postures and repetitive tasks. This paper presents a vision-based framework for real-time, frame-level ergonomic risk classification using a lightweight Vision Transformer (ViT). The proposed system operates directly [...] Read more.

Work-related musculoskeletal disorders (WMSDs) are a leading concern in industrial ergonomics, often stemming from sustained non-neutral postures and repetitive tasks. This paper presents a vision-based framework for real-time, frame-level ergonomic risk classification using a lightweight Vision Transformer (ViT). The proposed system operates directly on raw RGB images without requiring skeleton reconstruction, joint angle estimation, or image segmentation. A single ViT model simultaneously classifies eight anatomical regions, enabling efficient multi-label posture assessment. Training is supervised using a multimodal dataset acquired from synchronized RGB video and full-body inertial motion capture, with ergonomic risk labels derived from RULA scores computed on joint kinematics. The system is validated on realistic, simulated industrial tasks that include common challenges such as occlusion and posture variability. Experimental results show that the ViT model achieves state-of-the-art performance, with F1-scores exceeding 0.99 and AUC values above 0.996 across all regions. Compared to previous CNN-based system, the proposed model improves classification accuracy and generalizability while reducing complexity and enabling real-time inference on edge devices. These findings demonstrate the model’s potential for unobtrusive, scalable ergonomic risk monitoring in real-world manufacturing environments. Full article

(This article belongs to the Special Issue Secure and Decentralised IoT Systems)

► Show Figures

Figure 1

20 pages, 2776 KiB

Open AccessArticle

Automatic 3D Reconstruction: Mesh Extraction Based on Gaussian Splatting from Romanesque–Mudéjar Churches

by Nelson Montas-Laracuente, Emilio Delgado Martos, Carlos Pesqueira-Calvo, Giovanni Intra Sidola, Ana Maitín, Alberto Nogales and Álvaro José García-Tejedor

Appl. Sci. 2025, 15(15), 8379; https://doi.org/10.3390/app15158379 - 28 Jul 2025

Viewed by 156

Abstract

This research introduces an automated 3D virtual reconstruction system tailored for architectural heritage (AH) applications, contributing to the ongoing paradigm shift from traditional CAD-based workflows to artificial intelligence-driven methodologies. It reviews recent advancements in machine learning and deep learning—particularly neural radiance fields (NeRFs) [...] Read more.

This research introduces an automated 3D virtual reconstruction system tailored for architectural heritage (AH) applications, contributing to the ongoing paradigm shift from traditional CAD-based workflows to artificial intelligence-driven methodologies. It reviews recent advancements in machine learning and deep learning—particularly neural radiance fields (NeRFs) and its successor, Gaussian splatting (GS)—as state-of-the-art techniques in the domain. The study advocates for replacing point cloud data in heritage building information modeling workflows with image-based inputs, proposing a novel “photo-to-BIM” pipeline. A proof-of-concept system is presented, capable of processing photographs or video footage of ancient ruins—specifically, Romanesque–Mudéjar churches—to automatically generate 3D mesh reconstructions. The system’s performance is assessed using both objective metrics and subjective evaluations of mesh quality. The results confirm the feasibility and promise of image-based reconstruction as a viable alternative to conventional methods. The study successfully developed a system for automated 3D mesh reconstruction of AH from images. It applied GS and Mip-splatting for NeRFs, proving superior in noise reduction for subsequent mesh extraction via surface-aligned Gaussian splatting for efficient 3D mesh reconstruction. This photo-to-mesh pipeline signifies a viable step towards HBIM. Full article

(This article belongs to the Special Issue Intelligent Techniques and 3D Virtual Reconstruction for Architectural Heritage)

► Show Figures

Figure 1

17 pages, 7786 KiB

Open AccessArticle

Video Coding Based on Ladder Subband Recovery and ResGroup Module

by Libo Wei, Aolin Zhang, Lei Liu, Jun Wang and Shuai Wang

Entropy 2025, 27(7), 734; https://doi.org/10.3390/e27070734 - 8 Jul 2025

Viewed by 323

Abstract

With the rapid development of video encoding technology in the field of computer vision, the demand for tasks such as video frame reconstruction, denoising, and super-resolution has been continuously increasing. However, traditional video encoding methods typically focus on extracting spatial or temporal domain [...] Read more.

With the rapid development of video encoding technology in the field of computer vision, the demand for tasks such as video frame reconstruction, denoising, and super-resolution has been continuously increasing. However, traditional video encoding methods typically focus on extracting spatial or temporal domain information, often facing challenges of insufficient accuracy and information loss when reconstructing high-frequency details, edges, and textures of images. To address this issue, this paper proposes an innovative LadderConv framework, which combines discrete wavelet transform (DWT) with spatial and channel attention mechanisms. By progressively recovering wavelet subbands, it effectively enhances the video frame encoding quality. Specifically, the LadderConv framework adopts a stepwise recovery approach for wavelet subbands, first processing high-frequency detail subbands with relatively less information, then enhancing the interaction between these subbands, and ultimately synthesizing a high-quality reconstructed image through inverse wavelet transform. Moreover, the framework introduces spatial and channel attention mechanisms, which further strengthen the focus on key regions and channel features, leading to notable improvements in detail restoration and image reconstruction accuracy. To optimize the performance of the LadderConv framework, particularly in detail recovery and high-frequency information extraction tasks, this paper designs an innovative ResGroup module. By using multi-layer convolution operations along with feature map compression and recovery, the ResGroup module enhances the network’s expressive capability and effectively reduces computational complexity. The ResGroup module captures multi-level features from low level to high level and retains rich feature information through residual connections, thus improving the overall reconstruction performance of the model. In experiments, the combination of the LadderConv framework and the ResGroup module demonstrates superior performance in video frame reconstruction tasks, particularly in recovering high-frequency information, image clarity, and detail representation. Full article

(This article belongs to the Special Issue Rethinking Representation Learning in the Age of Large Models)

► Show Figures

Figure 1

20 pages, 2149 KiB

Open AccessArticle

Accelerating Facial Image Super-Resolution via Sparse Momentum and Encoder State Reuse

by Kerang Cao, Na Bao, Shuai Zheng, Ye Liu and Xing Wang

Electronics 2025, 14(13), 2616; https://doi.org/10.3390/electronics14132616 - 28 Jun 2025

Viewed by 410

Abstract

Single image super-resolution (SISR) aims to reconstruct high-quality images from low-resolution inputs, a persistent challenge in computer vision with critical applications in medical imaging, satellite imagery, and video enhancement. Traditional diffusion model-based (DM-based) methods, while effective in restoring fine details, suffer from computational [...] Read more.

Single image super-resolution (SISR) aims to reconstruct high-quality images from low-resolution inputs, a persistent challenge in computer vision with critical applications in medical imaging, satellite imagery, and video enhancement. Traditional diffusion model-based (DM-based) methods, while effective in restoring fine details, suffer from computational inefficiency due to their iterative denoising process. To address this, we introduce the Sparse Momentum-based Faster Diffusion Model (SMFDM), designed for rapid and high-fidelity super-resolution. SMFDM integrates a novel encoder state reuse mechanism that selectively omits non-critical time steps during the denoising phase, significantly reducing computational redundancy. Additionally, the model employs a sparse momentum mechanism, enabling robust representation capabilities while utilizing only a fraction of the original model weights. Experiments demonstrate that SMFDM achieves an impressive 71.04% acceleration in the diffusion process, requiring only 15% of the original weights, while maintaining high-quality outputs with effective preservation of image details and textures. Our work highlights the potential of combining sparse learning and efficient sampling strategies to enhance the practical applicability of diffusion models for super-resolution tasks. Full article

► Show Figures

Figure 1

25 pages, 2892 KiB

Open AccessArticle

Focal Correlation and Event-Based Focal Visual Content Text Attention for Past Event Search

by Pranita P. Deshmukh and S. Poonkuntran

Computers 2025, 14(7), 255; https://doi.org/10.3390/computers14070255 - 28 Jun 2025

Viewed by 304

Abstract

Every minute, vast amounts of video and image data are uploaded worldwide to the internet and social media platforms, creating a rich visual archive of human experiences—from weddings and family gatherings to significant historical events such as war crimes and humanitarian crises. When [...] Read more.

Every minute, vast amounts of video and image data are uploaded worldwide to the internet and social media platforms, creating a rich visual archive of human experiences—from weddings and family gatherings to significant historical events such as war crimes and humanitarian crises. When properly analyzed, this multimodal data holds immense potential for reconstructing important events and verifying information. However, challenges arise when images and videos lack complete annotations, making manual examination inefficient and time-consuming. To address this, we propose a novel event-based focal visual content text attention (EFVCTA) framework for automated past event retrieval using visual question answering (VQA) techniques. Our approach integrates a Long Short-Term Memory (LSTM) model with convolutional non-linearity and an adaptive attention mechanism to efficiently identify and retrieve relevant visual evidence alongside precise answers. The model is designed with robust weight initialization, regularization, and optimization strategies and is evaluated on the Common Objects in Context (COCO) dataset. The results demonstrate that EFVCTA achieves the highest performance across all metrics (88.7% accuracy, 86.5% F1-score, 84.9% mAP), outperforming state-of-the-art baselines. The EFVCTA framework demonstrates promising results for retrieving information about past events captured in images and videos and can be effectively applied to scenarios such as documenting training programs, workshops, conferences, and social gatherings in academic institutions Full article

► Show Figures

Figure 1

21 pages, 16441 KiB

Open AccessArticle

Video Compression Using Hybrid Neural Representation with High-Frequency Spectrum Analysis

by Jian Hua Zhao, Xue Jun Li and Peter Han Joo Chong

Electronics 2025, 14(13), 2574; https://doi.org/10.3390/electronics14132574 - 26 Jun 2025

Viewed by 376

Abstract

Recent advancements in implicit neural representations have shown substantial promise in various domains, particularly in video compression and reconstruction, due to their rapid decoding speed and high adaptability. Building upon the state-of-the-art Neural Representations for Videos, the Expedite Neural Representation for Videos and [...] Read more.

Recent advancements in implicit neural representations have shown substantial promise in various domains, particularly in video compression and reconstruction, due to their rapid decoding speed and high adaptability. Building upon the state-of-the-art Neural Representations for Videos, the Expedite Neural Representation for Videos and Hybrid Neural Representation for Videos primarily enhance performance by optimizing and expanding the embedded input of the Neural Representations for Videos network. However, the core module in Neural Representations for Videos network, responsible for video reconstruction, has garnered comparatively less attention. This paper introduces a novel High-frequency Spectrum Hybrid Network, which leverages high-frequency information from the frequency domain to generate detailed image reconstructions. The central component of this approach is the High-frequency Spectrum Hybrid Network block, an innovative extension of the module in Neural Representations for Videos network, which integrates the High-frequency Spectrum Convolution Module into the original framework. The high-frequency spectrum convolution module emphasizes the extraction of high-frequency features through a frequency domain attention mechanism, significantly enhancing both performance and the recovery of local details in video images. As an enhanced module in the Neural Representations for Videos network, it demonstrates exceptional adaptability and versatility, enabling seamless integration into a wide range of existing Neural Representations for Videos network architectures without requiring substantial modifications to achieve improved results. In addition, this work introduces the High-frequency Spectrum loss function and the Multi-scale Feature Reuse Path to further mitigate the issue of blurriness caused by the loss of high-frequency details during image generation. Experimental evaluations confirm that the proposed High-frequency Spectrum Hybrid Network surpasses the performance of the Neural Representations for Videos, the Expedite Neural Representation for Videos, and the Hybrid Neural Representation for Videos, achieving improvements of +5.75 dB, +4.53 dB, and +1.05 dB in peak signal-to-noise ratio, respectively. Full article

(This article belongs to the Special Issue Image Processing Based on Convolution Neural Network: 2nd Edition)

► Show Figures

Figure 1

32 pages, 4311 KiB

Open AccessArticle

DRGNet: Enhanced VVC Reconstructed Frames Using Dual-Path Residual Gating for High-Resolution Video

by Zezhen Gai, Tanni Das and Kiho Choi

Sensors 2025, 25(12), 3744; https://doi.org/10.3390/s25123744 - 15 Jun 2025

Viewed by 463

Abstract

In recent years, with the rapid development of the Internet and mobile devices, the high-resolution video industry has ushered in a booming golden era, making video content the primary driver of Internet traffic. This trend has spurred continuous innovation in efficient video coding [...] Read more.

In recent years, with the rapid development of the Internet and mobile devices, the high-resolution video industry has ushered in a booming golden era, making video content the primary driver of Internet traffic. This trend has spurred continuous innovation in efficient video coding technologies, such as Advanced Video Coding/H.264 (AVC), High Efficiency Video Coding/H.265 (HEVC), and Versatile Video Coding/H.266 (VVC), which significantly improves compression efficiency while maintaining high video quality. However, during the encoding process, compression artifacts and the loss of visual details remain unavoidable challenges, particularly in high-resolution video processing, where the massive amount of image data tends to introduce more artifacts and noise, ultimately affecting the user’s viewing experience. Therefore, effectively reducing artifacts, removing noise, and minimizing detail loss have become critical issues in enhancing video quality. To address these challenges, this paper proposes a post-processing method based on Convolutional Neural Network (CNN) that improves the quality of VVC-reconstructed frames through deep feature extraction and fusion. The proposed method is built upon a high-resolution dual-path residual gating system, which integrates deep features from different convolutional layers and introduces convolutional blocks equipped with gating mechanisms. By ingeniously combining gating operations with residual connections, the proposed approach ensures smooth gradient flow while enhancing feature selection capabilities. It selectively preserves critical information while effectively removing artifacts. Furthermore, the introduction of residual connections reinforces the retention of original details, achieving high-quality image restoration. Under the same bitrate conditions, the proposed method significantly improves the Peak Signal-to-Noise Ratio (PSNR) value, thereby optimizing video coding quality and providing users with a clearer and more detailed visual experience. Extensive experimental results demonstrate that the proposed method achieves outstanding performance across Random Access (RA), Low Delay B-frame (LDB), and All Intra (AI) configurations, achieving BD-Rate improvements of 6.1%, 7.36%, and 7.1% for the luma component, respectively, due to the remarkable PSNR enhancement. Full article

(This article belongs to the Special Issue Image/Video Coding and Processing Techniques for Intelligent Sensor Nodes: 2nd Edition)

► Show Figures

Figure 1

18 pages, 15092 KiB

Open AccessArticle

Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models

by Xinyi Chen, Weimin Lei, Wei Zhang, Yanwen Wang and Mingxin Liu

Symmetry 2025, 17(6), 913; https://doi.org/10.3390/sym17060913 - 10 Jun 2025

Viewed by 694

Abstract

Deep neural video compression codecs have shown great promise in recent years. However, there are still considerable challenges for ultra-low bitrate video coding. Inspired by recent diffusion models for image and video compression attempts, we attempt to leverage diffusion models for ultra-low bitrate [...] Read more.

Deep neural video compression codecs have shown great promise in recent years. However, there are still considerable challenges for ultra-low bitrate video coding. Inspired by recent diffusion models for image and video compression attempts, we attempt to leverage diffusion models for ultra-low bitrate portrait video compression. In this paper, we propose a predictive portrait video compression method that leverages the temporal prediction capabilities of diffusion models. Specifically, we develop a temporal diffusion predictor based on a conditional latent diffusion model, with the predicted results serving as decoded frames. We symmetrically integrate a temporal diffusion predictor at the encoding and decoding side, respectively. When the perceptual quality of the predicted results in encoding end falls below a predefined threshold, a new frame sequence is employed for prediction. While the predictor at the decoding side directly generates predicted frames as reconstruction based on the evaluation results. This symmetry ensures that the prediction frames generated at the decoding end are consistent with those at the encoding end. We also design an adaptive coding strategy that incorporates frame quality assessment and adaptive keyframe control. To ensure consistent quality of subsequent predicted frames and achieve high perceptual reconstruction, this strategy dynamically evaluates the visual quality of the predicted results during encoding, retains the predicted frames that meet the quality threshold, and adaptively adjusts the length of the keyframe sequence based on motion complexity. The experimental results demonstrate that, compared with the traditional video codecs and other popular methods, the proposed scheme provides superior compression performance at ultra-low bitrates while maintaining competitiveness in visual effects, achieving more than 24% bitrate savings compared with VVC in terms of perceptual distortion. Full article

(This article belongs to the Special Issue 2025 9th International Conference on Electronic Information Technology and Computer Engineering)

► Show Figures

Figure 1

23 pages, 1894 KiB

Open AccessArticle

ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention

by Yunan Qiu, Bingjian Lu, Wenrui Xiong, Zhenyu Lu, Le Sun and Yingjie Cui

Remote Sens. 2025, 17(12), 1966; https://doi.org/10.3390/rs17121966 - 6 Jun 2025

Viewed by 489

Abstract

Weather radar, as a crucial component of remote sensing data, plays a vital role in convective weather forecasting through radar echo extrapolation techniques. To address the limitations of existing deep learning methods in radar echo extrapolation, this paper proposes a radar echo extrapolation [...] Read more.

Weather radar, as a crucial component of remote sensing data, plays a vital role in convective weather forecasting through radar echo extrapolation techniques. To address the limitations of existing deep learning methods in radar echo extrapolation, this paper proposes a radar echo extrapolation model based on video vision transformer and spatiotemporal sparse attention (ViViT-Prob). The model takes historical sequences as input and initially maps them into a fixed-dimensional vector space through 3D convolutional patch encoding. Subsequently, a multi-head spatiotemporal fusion module with sparse attention encodes these vectors, effectively capturing spatiotemporal relationships between different regions in the sequences. The sparse constraint enables better utilization of data structural information, enhanced focus on critical regions, and reduced computational complexity. Finally, a parallel output decoder generates all time step predictions simultaneously, then maps back to the prediction space through a deconvolution module to reconstruct high-resolution images. Our experimental results on the Moving MNIST and real radar echo dataset demonstrate that the proposed model achieves superior performance in spatiotemporal sequence prediction and improves the prediction accuracy while maintaining structural consistency in radar echo extrapolation tasks, providing an effective solution for short-term precipitation forecasting. Full article

► Show Figures

Figure 1

18 pages, 8414 KiB

Open AccessArticle

Fish Body Pattern Style Transfer Based on Wavelet Transformation and Gated Attention

by Hongchun Yuan and Yixuan Wang

Appl. Sci. 2025, 15(9), 5150; https://doi.org/10.3390/app15095150 - 6 May 2025

Viewed by 416

Abstract

To address the temporal jitter with low segmentation accuracy and the lack of high-precision transformations for specific object classes in video generation, we propose the fish body pattern sync-style network for ornamental fish videos. This network innovatively integrates dynamic texture transfer with instance [...] Read more.

To address the temporal jitter with low segmentation accuracy and the lack of high-precision transformations for specific object classes in video generation, we propose the fish body pattern sync-style network for ornamental fish videos. This network innovatively integrates dynamic texture transfer with instance segmentation, adopting a two-stage processing architecture. First, high-precision video frame segmentation is performed using Mask2Former to eliminate background elements that do not participate in the style transfer process. Then, we introduce the wavelet-gated styling network, which reconstructs a multi-scale feature space via discrete wavelet transform, enhancing the granularity of multi-scale style features during the image generation phase. Additionally, we embed a convolutional block attention module within the residual modules, not only improving the realism of the generated images but also effectively reducing boundary artifacts in foreground objects. Furthermore, to mitigate the frame-to-frame jitter commonly observed in generated videos, we incorporate a contrastive coherence preserving loss into the training process of the style transfer network. This enhances the perceptual loss function, thereby preventing video flickering and ensuring improved temporal consistency. In real-world aquarium scenes, compared to state-of-the-art methods, FSSNet effectively preserves localized texture details in generated videos and achieves competitive SSIM and PSNR scores. Moreover, temporal consistency is significantly improved. The flow warping error index decreases to 1.412. We chose FNST (fast neural style transfer) as our baseline model and demonstrate improvements in both model parameter count and runtime efficiency. According to user preferences, 43.75% of participants preferred the dynamic effects generated by this method. Full article

(This article belongs to the Special Issue Advanced Pattern Recognition & Computer Vision)

► Show Figures

Figure 1

19 pages, 2647 KiB

Open AccessArticle

FDI-VSR: Video Super-Resolution Through Frequency-Domain Integration and Dynamic Offset Estimation

by Donghun Lim and Janghoon Choi

Sensors 2025, 25(8), 2402; https://doi.org/10.3390/s25082402 - 10 Apr 2025

Cited by 1 | Viewed by 873

Abstract

The increasing adoption of high-resolution imaging sensors across various fields has led to a growing demand for techniques to enhance video quality. Video super-resolution (VSR) addresses this need by reconstructing high-resolution videos from lower-resolution inputs; however, directly applying single-image super-resolution (SISR) methods to [...] Read more.

The increasing adoption of high-resolution imaging sensors across various fields has led to a growing demand for techniques to enhance video quality. Video super-resolution (VSR) addresses this need by reconstructing high-resolution videos from lower-resolution inputs; however, directly applying single-image super-resolution (SISR) methods to video sequences neglects temporal information, resulting in inconsistent and unnatural outputs. In this paper, we propose FDI-VSR, a novel framework that integrates spatiotemporal dynamics and frequency-domain analysis into conventional SISR models without extensive modifications. We introduce two key modules: the Spatiotemporal Feature Extraction Module (STFEM), which employs dynamic offset estimation, spatial alignment, and multi-stage temporal aggregation using residual channel attention blocks (RCABs); and the Frequency–Spatial Integration Module (FSIM), which transforms deep features into the frequency domain to effectively capture global context beyond the limited receptive field of standard convolutions. Extensive experiments on the Vid4, SPMCs, REDS4, and UDM10 benchmarks, supported by detailed ablation studies, demonstrate that FDI-VSR not only surpasses conventional VSR methods but also achieves competitive results compared to recent state-of-the-art methods, with improvements of up to 0.82 dB in PSNR on the SPMCs benchmark and notable reductions in visual artifacts, all while maintaining lower computational complexity and faster inference. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

29 pages, 8325 KiB

Open AccessArticle

Insights into Mosquito Behavior: Employing Visual Technology to Analyze Flight Trajectories and Patterns

by Ning Zhao, Lifeng Wang and Ke Wang

Electronics 2025, 14(7), 1333; https://doi.org/10.3390/electronics14071333 - 27 Mar 2025

Cited by 1 | Viewed by 543

Abstract

Mosquitoes, as vectors of numerous serious infectious diseases, require rigorous behavior monitoring for effective disease prevention and control. Simultaneously, precise surveillance of flying insect behavior is also crucial in agricultural pest management. This study proposes a three-dimensional trajectory reconstruction method for mosquito behavior [...] Read more.

Mosquitoes, as vectors of numerous serious infectious diseases, require rigorous behavior monitoring for effective disease prevention and control. Simultaneously, precise surveillance of flying insect behavior is also crucial in agricultural pest management. This study proposes a three-dimensional trajectory reconstruction method for mosquito behavior analysis based on video data. By employing multiple synchronized cameras to capture mosquito flight images, using background subtraction to extract moving targets, applying Kalman filtering to predict target states, and integrating the Hungarian algorithm for multi-target data association, the system can automatically reconstruct three-dimensional mosquito flight trajectories. Experimental results demonstrate that this approach achieves high-precision flight path reconstruction, with a detection accuracy exceeding 95%, an F1-score of 0.93, and fast processing speeds that enables real-time tracking. The mean error of three-dimensional trajectory reconstruction is only 10 ± 4 mm, offering significant improvements in detection accuracy, tracking robustness, and real-time performance over traditional two-dimensional methods. These findings provide technological support for optimizing vector control strategies and enhancing precision pest control and can be further extended to ecological monitoring and agricultural pest management, thus bearing substantial significance for both public health and agriculture. Full article

► Show Figures

Figure 1

19 pages, 26378 KiB

Open AccessArticle

2D to 3D Human Skeleton Estimation Based on the Brown Camera Distortion Model and Constrained Optimization

by Lan Ma and Hua Huo

Electronics 2025, 14(5), 960; https://doi.org/10.3390/electronics14050960 - 27 Feb 2025

Viewed by 1378

Abstract

In the rapidly evolving field of computer vision and machine learning, 3D skeleton estimation is critical for applications such as motion analysis and human–computer interaction. While stereo cameras are commonly used to acquire 3D skeletal data, monocular RGB systems attract attention due to [...] Read more.

In the rapidly evolving field of computer vision and machine learning, 3D skeleton estimation is critical for applications such as motion analysis and human–computer interaction. While stereo cameras are commonly used to acquire 3D skeletal data, monocular RGB systems attract attention due to benefits including cost-effectiveness and simple deployment. However, persistent challenges remain in accurately inferring depth from 2D images and reconstructing 3D structures using monocular approaches. The current 2D to 3D skeleton estimation methods overly rely on deep training of datasets, while neglecting the importance of human intrinsic structure and the principles of camera imaging. To address this, this paper introduces an innovative 2D to 3D gait skeleton estimation method that leverages the Brown camera distortion model and constrained optimization. Utilizing the Azure Kinect depth camera for capturing gait video, the Azure Kinect Body Tracking SDK was employed to effectively extract 2D and 3D joint positions. The camera’s distortion properties were analyzed, using the Brown camera distortion model which is suitable for this scenario, and iterative methods to compensate the distortion of 2D skeleton joints. By integrating the geometric constraints of the human skeleton, an optimization algorithm was analyzed to achieve precise 3D joint estimations. Finally, the framework was validated through comparisons between the estimated 3D joint coordinates and corresponding measurements captured by depth sensors. Experimental evaluations confirmed that this training-free approach achieved superior precision and stability compared to conventional methods. Full article

(This article belongs to the Special Issue 3D Computer Vision and 3D Reconstruction)

► Show Figures

Figure 1

19 pages, 9180 KiB

Open AccessArticle

Accurate Real-Time Live Face Detection Using Snapshot Spectral Imaging Method

by Zhihai Wang, Shuai Wang, Weixing Yu, Bo Gao, Chenxi Li and Tianxin Wang

Sensors 2025, 25(3), 952; https://doi.org/10.3390/s25030952 - 5 Feb 2025

Cited by 3 | Viewed by 1483

Abstract

Traditional facial recognition is realized by facial recognition algorithms based on 2D or 3D digital images and has been well developed and has found wide applications in areas related to identification verification. In this work, we propose a novel live face detection (LFD) [...] Read more.

Traditional facial recognition is realized by facial recognition algorithms based on 2D or 3D digital images and has been well developed and has found wide applications in areas related to identification verification. In this work, we propose a novel live face detection (LFD) method by utilizing snapshot spectral imaging technology, which takes advantage of the distinctive reflected spectra from human faces. By employing a computational spectral reconstruction algorithm based on Tikhonov regularization, a rapid and precise spectral reconstruction with a fidelity of over 99% for the color checkers and various types of “face” samples has been achieved. The flat face areas were extracted exactly from the “face” images with Dlib face detection and Euclidean distance selection algorithms. A large quantity of spectra were rapidly reconstructed from the selected areas and compiled into an extensive database. The convolutional neural network model trained on this database demonstrates an excellent capability for predicting different types of “faces” with an accuracy exceeding 98%, and, according to a series of evaluations, the system’s detection time consistently remained under one second, much faster than other spectral imaging LFD methods. Moreover, a pixel-level liveness detection test system is developed and a LFD experiment shows good agreement with theoretical results, which demonstrates the potential of our method to be applied in other recognition fields. The superior performance and compatibility of our method provide an alternative solution for accurate, highly integrated video LFD applications. Full article

(This article belongs to the Special Issue Advances in Optical Sensing, Instrumentation and Systems: 2nd Edition)

► Show Figures

Figure 1

28 pages, 9307 KiB

Open AccessArticle

Application Framework and Optimal Features for UAV-Based Earthquake-Induced Structural Displacement Monitoring

by Ruipu Ji, Shokrullah Sorosh, Eric Lo, Tanner J. Norton, John W. Driscoll, Falko Kuester, Andre R. Barbosa, Barbara G. Simpson and Tara C. Hutchinson

Algorithms 2025, 18(2), 66; https://doi.org/10.3390/a18020066 - 26 Jan 2025

Cited by 3 | Viewed by 3388

Abstract

Unmanned aerial vehicle (UAV) vision-based sensing has become an emerging technology for structural health monitoring (SHM) and post-disaster damage assessment of civil infrastructure. This article proposes a framework for monitoring structural displacement under earthquakes by reprojecting image points obtained courtesy of UAV-captured videos [...] Read more.

Unmanned aerial vehicle (UAV) vision-based sensing has become an emerging technology for structural health monitoring (SHM) and post-disaster damage assessment of civil infrastructure. This article proposes a framework for monitoring structural displacement under earthquakes by reprojecting image points obtained courtesy of UAV-captured videos to the 3-D world space based on the world-to-image point correspondences. To identify optimal features in the UAV imagery, geo-reference targets with various patterns were installed on a test building specimen, which was then subjected to earthquake shaking. A feature point tracking-based algorithm for square checkerboard patterns and a Hough Transform-based algorithm for concentric circular patterns are developed to ensure reliable detection and tracking of image features. Photogrammetry techniques are applied to reconstruct the 3-D world points and extract structural displacements. The proposed methodology is validated by monitoring the displacements of a full-scale 6-story mass timber building during a series of shake table tests. Reasonable accuracy is achieved in that the overall root-mean-square errors of the tracking results are at the millimeter level compared to ground truth measurements from analog sensors. Insights on optimal features for monitoring structural dynamic response are discussed based on statistical analysis of the error characteristics for the various reference target patterns used to track the structural displacements. Full article

(This article belongs to the Special Issue Algorithms for Image Processing and Machine Vision)

► Show Figures

Graphical abstract

Search Results (192)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (192)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI