Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (559)

Search Parameters:
Keywords = video fusion

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 6785 KiB  
Article
Spatiality–Frequency Domain Video Forgery Detection System Based on ResNet-LSTM-CBAM and DCT Hybrid Network
by Zihao Liao, Sheng Hong and Yu Chen
Appl. Sci. 2025, 15(16), 9006; https://doi.org/10.3390/app15169006 - 15 Aug 2025
Viewed by 206
Abstract
As information technology advances, digital content has become widely adopted across diverse fields such as news broadcasting, entertainment, commerce, and forensic investigation. However, the availability of sophisticated multimedia editing tools has significantly increased the risk of video and image forgery, raising serious concerns [...] Read more.
As information technology advances, digital content has become widely adopted across diverse fields such as news broadcasting, entertainment, commerce, and forensic investigation. However, the availability of sophisticated multimedia editing tools has significantly increased the risk of video and image forgery, raising serious concerns about content authenticity at both societal and individual levels. To address the growing need for robust and accurate detection methods, this study proposes a novel video forgery detection model that integrates both spatial and frequency-domain features. The model is built on a ResNet-LSTM framework enhanced by a Convolutional Block Attention Module (CBAM) for spatial feature extraction, and further incorporates Discrete Cosine Transform (DCT) to capture frequency domain information. Comprehensive experiments were conducted on several mainstream benchmark datasets, encompassing a wide range of forgery scenarios. The results demonstrate that the proposed model achieves superior performance in distinguishing between authentic and manipulated videos. Additional ablation and comparative studies confirm the contribution of each component in the architecture, offering deeper insight into the model’s capacity. Overall, the findings support the proposed approach as a promising solution for enhancing the reliability of video authenticity analysis under complex conditions. Full article
Show Figures

Figure 1

17 pages, 918 KiB  
Article
LTGS-Net: Local Temporal and Global Spatial Network for Weakly Supervised Video Anomaly Detection
by Minghao Li, Xiaohan Wang, Haofei Wang and Min Yang
Sensors 2025, 25(16), 4884; https://doi.org/10.3390/s25164884 - 8 Aug 2025
Viewed by 269
Abstract
Video anomaly detection has an important application value in the field of intelligent surveillance; however, due to the problems of sparse anomaly events and expensive labeling, it has made weakly supervised methods a research hotspot. Most of the current methods still adopt the [...] Read more.
Video anomaly detection has an important application value in the field of intelligent surveillance; however, due to the problems of sparse anomaly events and expensive labeling, it has made weakly supervised methods a research hotspot. Most of the current methods still adopt the strategy of processing temporal and spatial features independently, which makes it difficult to fully capture their temporal and spatial complex dependencies, affecting the accuracy and robustness of detection. Existing studies predominantly process temporal and spatial information independently, which limits the ability to effectively capture their interdependencies. To address this, we propose the Local Temporal and Global Spatial Network (LTGS) for weakly supervised video anomaly detection. The LTGS architecture incorporates a clip-level temporal feature relation module and a video-level spatial feature module, which collaboratively enhance discriminative representations. Through joint training of these modules, we develop a feature encoder specifically tailored for video anomaly detection. To further refine clip-level annotations and better align them with actual events, we employ a dynamic label updating strategy. These updated labels are utilized to optimize the model and enhance its robustness. Extensive experiments on two widely used public datasets, ShanghaiTech and UCF-Crime, validate the effectiveness of the proposed LTGS method. Experimental results demonstrate that the LTGS achieves an AUC of 96.69% on the ShanghaiTech dataset and 82.33% on the UCF dataset, outperforming various state-of-the-art algorithms in anomaly detection tasks. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

25 pages, 6821 KiB  
Article
Hierarchical Text-Guided Refinement Network for Multimodal Sentiment Analysis
by Yue Su and Xuying Zhao
Entropy 2025, 27(8), 834; https://doi.org/10.3390/e27080834 - 6 Aug 2025
Viewed by 345
Abstract
Multimodal sentiment analysis (MSA) benefits from integrating diverse modalities (e.g., text, video, and audio). However, challenges remain in effectively aligning non-text features and mitigating redundant information, which may limit potential performance improvements. To address these challenges, we propose a Hierarchical Text-Guided Refinement Network [...] Read more.
Multimodal sentiment analysis (MSA) benefits from integrating diverse modalities (e.g., text, video, and audio). However, challenges remain in effectively aligning non-text features and mitigating redundant information, which may limit potential performance improvements. To address these challenges, we propose a Hierarchical Text-Guided Refinement Network (HTRN), a novel framework that refines and aligns non-text modalities using hierarchical textual representations. We introduce Shuffle-Insert Fusion (SIF) and the Text-Guided Alignment Layer (TAL) to enhance crossmodal interactions and suppress irrelevant signals. In SIF, empty tokens are inserted at fixed intervals in unimodal feature sequences, disrupting local correlations and promoting more generalized representations with improved feature diversity. The TAL guides the refinement of audio and visual representations by leveraging textual semantics and dynamically adjusting their contributions through learnable gating factors, ensuring that non-text modalities remain semantically coherent while retaining essential crossmodal interactions. Experiments demonstrate that the HTRN achieves state-of-the-art performance with accuracies of 86.3% (Acc-2) on CMU-MOSI, 86.7% (Acc-2) on CMU-MOSEI, and 80.3% (Acc-2) on CH-SIMS, outperforming existing methods by 0.8–3.45%. Ablation studies validate the contributions of SIF and the TAL, showing 1.9–2.1% performance gains over baselines. By integrating these components, the HTRN establishes a robust multimodal representation learning framework. Full article
(This article belongs to the Section Information Theory, Probability and Statistics)
Show Figures

Figure 1

24 pages, 23817 KiB  
Article
Dual-Path Adversarial Denoising Network Based on UNet
by Jinchi Yu, Yu Zhou, Mingchen Sun and Dadong Wang
Sensors 2025, 25(15), 4751; https://doi.org/10.3390/s25154751 - 1 Aug 2025
Viewed by 312
Abstract
Digital image quality is crucial for reliable analysis in applications such as medical imaging, satellite remote sensing, and video surveillance. However, traditional denoising methods struggle to balance noise removal with detail preservation and lack adaptability to various types of noise. We propose a [...] Read more.
Digital image quality is crucial for reliable analysis in applications such as medical imaging, satellite remote sensing, and video surveillance. However, traditional denoising methods struggle to balance noise removal with detail preservation and lack adaptability to various types of noise. We propose a novel three-module architecture for image denoising, comprising a generator, a dual-path-UNet-based denoiser, and a discriminator. The generator creates synthetic noise patterns to augment training data, while the dual-path-UNet denoiser uses multiple receptive field modules to preserve fine details and dense feature fusion to maintain global structural integrity. The discriminator provides adversarial feedback to enhance denoising performance. This dual-path adversarial training mechanism addresses the limitations of traditional methods by simultaneously capturing both local details and global structures. Experiments on the SIDD, DND, and PolyU datasets demonstrate superior performance. We compare our architecture with the latest state-of-the-art GAN variants through comprehensive qualitative and quantitative evaluations. These results confirm the effectiveness of noise removal with minimal loss of critical image details. The proposed architecture enhances image denoising capabilities in complex noise scenarios, providing a robust solution for applications that require high image fidelity. By enhancing adaptability to various types of noise while maintaining structural integrity, this method provides a versatile tool for image processing tasks that require preserving detail. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

20 pages, 1536 KiB  
Article
Graph Convolution-Based Decoupling and Consistency-Driven Fusion for Multimodal Emotion Recognition
by Yingmin Deng, Chenyu Li, Yu Gu, He Zhang, Linsong Liu, Haixiang Lin, Shuang Wang and Hanlin Mo
Electronics 2025, 14(15), 3047; https://doi.org/10.3390/electronics14153047 - 30 Jul 2025
Viewed by 335
Abstract
Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic [...] Read more.
Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic Weighted Graph Convolutional Network (DW-GCN) for feature disentanglement and a Cross-Attention Consistency-Gated Fusion (CACG-Fusion) module for robust integration. DW-GCN models complex inter-modal relationships, enabling the extraction of both common and private features. The CACG-Fusion module subsequently enhances classification performance through dynamic alignment of cross-modal cues, employing attention-based coordination and consistency-preserving gating mechanisms to optimize feature integration. Experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method achieves state-of-the-art performance, significantly improving the ACC7, ACC2, and F1 scores. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Figure 1

17 pages, 1603 KiB  
Perspective
A Perspective on Quality Evaluation for AI-Generated Videos
by Zhichao Zhang, Wei Sun and Guangtao Zhai
Sensors 2025, 25(15), 4668; https://doi.org/10.3390/s25154668 - 28 Jul 2025
Viewed by 640
Abstract
Recent breakthroughs in AI-generated content (AIGC) have transformed video creation, empowering systems to translate text, images, or audio into visually compelling stories. Yet reliable evaluation of these machine-crafted videos remains elusive because quality is governed not only by spatial fidelity within individual frames [...] Read more.
Recent breakthroughs in AI-generated content (AIGC) have transformed video creation, empowering systems to translate text, images, or audio into visually compelling stories. Yet reliable evaluation of these machine-crafted videos remains elusive because quality is governed not only by spatial fidelity within individual frames but also by temporal coherence across frames and precise semantic alignment with the intended message. The foundational role of sensor technologies is critical, as they determine the physical plausibility of AIGC outputs. In this perspective, we argue that multimodal large language models (MLLMs) are poised to become the cornerstone of next-generation video quality assessment (VQA). By jointly encoding cues from multiple modalities such as vision, language, sound, and even depth, the MLLM can leverage its powerful language understanding capabilities to assess the quality of scene composition, motion dynamics, and narrative consistency, overcoming the fragmentation of hand-engineered metrics and the poor generalization ability of CNN-based methods. Furthermore, we provide a comprehensive analysis of current methodologies for assessing AIGC video quality, including the evolution of generation models, dataset design, quality dimensions, and evaluation frameworks. We argue that advances in sensor fusion enable MLLMs to combine low-level physical constraints with high-level semantic interpretations, further enhancing the accuracy of visual quality assessment. Full article
(This article belongs to the Special Issue Perspectives in Intelligent Sensors and Sensing Systems)
Show Figures

Figure 1

18 pages, 4836 KiB  
Article
Deep Learning to Analyze Spatter and Melt Pool Behavior During Additive Manufacturing
by Deepak Gadde, Alaa Elwany and Yang Du
Metals 2025, 15(8), 840; https://doi.org/10.3390/met15080840 - 28 Jul 2025
Viewed by 607
Abstract
To capture the complex metallic spatter and melt pool behavior during the rapid interaction between the laser and metal material, high-speed cameras are applied to record the laser powder bed fusion process and generate a large volume of image data. In this study, [...] Read more.
To capture the complex metallic spatter and melt pool behavior during the rapid interaction between the laser and metal material, high-speed cameras are applied to record the laser powder bed fusion process and generate a large volume of image data. In this study, four deep learning algorithms are applied: YOLOv5, Fast R-CNN, RetinaNet, and EfficientDet. They are trained by the recorded videos to learn and extract information on spatter and melt pool behavior during the laser powder bed fusion process. The well-trained models achieved high accuracy and low loss, demonstrating strong capability in accurately detecting and tracking spatter and melt pool dynamics. A stability index is proposed and calculated based on the melt pool length change rate. Greater index value reflects a more stable melt pool. We found that more spatters were detected for the unstable melt pool, while fewer spatters were found for the stable melt pool. The spatter’s size can affect its initial ejection speed, and large spatters are ejected slowly while small spatters are ejected rapidly. In addition, more than 58% of detected spatters have their initial ejection angle in the range of 60–120°. These findings provide a better understanding of spatter and melt pool dynamics and behavior, uncover the influence of melt pool stability on spatter formation, and demonstrate the correlation between the spatter size and its initial ejection speed. This work will contribute to the extraction of important information from high-speed recorded videos for additive manufacturing to reduce waste, lower cost, enhance part quality, and increase process reliability. Full article
(This article belongs to the Special Issue Machine Learning in Metal Additive Manufacturing)
Show Figures

Figure 1

28 pages, 8982 KiB  
Article
Decision-Level Multi-Sensor Fusion to Improve Limitations of Single-Camera-Based CNN Classification in Precision Farming: Application in Weed Detection
by Md. Nazmuzzaman Khan, Adibuzzaman Rahi, Mohammad Al Hasan and Sohel Anwar
Computation 2025, 13(7), 174; https://doi.org/10.3390/computation13070174 - 18 Jul 2025
Viewed by 402
Abstract
The United States leads in corn production and consumption in the world with an estimated USD 50 billion per year. There is a pressing need for the development of novel and efficient techniques aimed at enhancing the identification and eradication of weeds in [...] Read more.
The United States leads in corn production and consumption in the world with an estimated USD 50 billion per year. There is a pressing need for the development of novel and efficient techniques aimed at enhancing the identification and eradication of weeds in a manner that is both environmentally sustainable and economically advantageous. Weed classification for autonomous agricultural robots is a challenging task for a single-camera-based system due to noise, vibration, and occlusion. To address this issue, we present a multi-camera-based system with decision-level sensor fusion to improve the limitations of a single-camera-based system in this paper. This study involves the utilization of a convolutional neural network (CNN) that was pre-trained on the ImageNet dataset. The CNN subsequently underwent re-training using a limited weed dataset to facilitate the classification of three distinct weed species: Xanthium strumarium (Common Cocklebur), Amaranthus retroflexus (Redroot Pigweed), and Ambrosia trifida (Giant Ragweed). These weed species are frequently encountered within corn fields. The test results showed that the re-trained VGG16 with a transfer-learning-based classifier exhibited acceptable accuracy (99% training, 97% validation, 94% testing accuracy) and inference time for weed classification from the video feed was suitable for real-time implementation. But the accuracy of CNN-based classification from video feed from a single camera was found to deteriorate due to noise, vibration, and partial occlusion of weeds. Test results from a single-camera video feed show that weed classification accuracy is not always accurate for the spray system of an agricultural robot (AgBot). To improve the accuracy of the weed classification system and to overcome the shortcomings of single-sensor-based classification from CNN, an improved Dempster–Shafer (DS)-based decision-level multi-sensor fusion algorithm was developed and implemented. The proposed algorithm offers improvement on the CNN-based weed classification when the weed is partially occluded. This algorithm can also detect if a sensor is faulty within an array of sensors and improves the overall classification accuracy by penalizing the evidence from a faulty sensor. Overall, the proposed fusion algorithm showed robust results in challenging scenarios, overcoming the limitations of a single-sensor-based system. Full article
(This article belongs to the Special Issue Moving Object Detection Using Computational Methods and Modeling)
Show Figures

Figure 1

22 pages, 4033 KiB  
Article
Masked Feature Residual Coding for Neural Video Compression
by Chajin Shin, Yonghwan Kim, KwangPyo Choi and Sangyoun Lee
Sensors 2025, 25(14), 4460; https://doi.org/10.3390/s25144460 - 17 Jul 2025
Viewed by 447
Abstract
In neural video compression, an approximation of the target frame is predicted, and a mask is subsequently applied to it. Then, the masked predicted frame is subtracted from the target frame and fed into the encoder along with the conditional information. However, this [...] Read more.
In neural video compression, an approximation of the target frame is predicted, and a mask is subsequently applied to it. Then, the masked predicted frame is subtracted from the target frame and fed into the encoder along with the conditional information. However, this structure has two limitations. First, in the pixel domain, even if the mask is perfectly predicted, the residuals cannot be significantly reduced. Second, reconstructed features with abundant temporal context information cannot be used as references for compressing the next frame. To address these problems, we propose Conditional Masked Feature Residual (CMFR) Coding. We extract features from the target frame and the predicted features using neural networks. Then, we predict the mask and subtract the masked predicted features from the target features. Thereafter, the difference is fed into the encoder with the conditional information. Moreover, to more effectively remove conditional information from the target frame, we introduce a Scaled Feature Fusion (SFF) module. In addition, we introduce a Motion Refiner to enhance the quality of the decoded optical flow. Experimental results show that our model achieves an 11.76% bit saving over the model without the proposed methods, averaged over all HEVC test sequences, demonstrating the effectiveness of the proposed methods. Full article
Show Figures

Figure 1

23 pages, 1187 KiB  
Article
Transmit and Receive Diversity in MIMO Quantum Communication for High-Fidelity Video Transmission
by Udara Jayasinghe, Prabhath Samarathunga, Thanuj Fernando and Anil Fernando
Algorithms 2025, 18(7), 436; https://doi.org/10.3390/a18070436 - 16 Jul 2025
Cited by 1 | Viewed by 273
Abstract
Reliable transmission of high-quality video over wireless channels is challenged by fading and noise, which degrade visual quality and disrupt temporal continuity. To address these issues, this paper proposes a quantum communication framework that integrates quantum superposition with multi-input multi-output (MIMO) spatial diversity [...] Read more.
Reliable transmission of high-quality video over wireless channels is challenged by fading and noise, which degrade visual quality and disrupt temporal continuity. To address these issues, this paper proposes a quantum communication framework that integrates quantum superposition with multi-input multi-output (MIMO) spatial diversity techniques to enhance robustness and efficiency in dynamic video transmission. The proposed method converts compressed videos into classical bitstreams, which are then channel-encoded and quantum-encoded into qubit superposition states. These states are transmitted over a 2×2 MIMO system employing varied diversity schemes to mitigate the effects of multipath fading and noise. At the receiver, a quantum decoder reconstructs the classical information, followed by channel decoding to retrieve the video data, and the source decoder reconstructs the final video. Simulation results demonstrate that the quantum MIMO system significantly outperforms equivalent-bandwidth classical MIMO frameworks across diverse signal-to-noise ratio (SNR) conditions, achieving a peak signal-to-noise ratio (PSNR) up to 39.12 dB, structural similarity index (SSIM) up to 0.9471, and video multi-method assessment fusion (VMAF) up to 92.47, with improved error resilience across various group of picture (GOP) formats, highlighting the potential of quantum MIMO communication for enhancing the reliability and quality of video delivery in next-generation wireless networks. Full article
(This article belongs to the Section Algorithms for Multidisciplinary Applications)
Show Figures

Figure 1

20 pages, 5700 KiB  
Article
Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features
by Hyeonuk Bhin and Jongsuk Choi
Electronics 2025, 14(14), 2837; https://doi.org/10.3390/electronics14142837 - 15 Jul 2025
Viewed by 586
Abstract
Personality is a fundamental psychological trait that exerts a long-term influence on human behavior patterns and social interactions. Automatic personality recognition (APR) has exhibited increasing importance across various domains, including Human–Robot Interaction (HRI), personalized services, and psychological assessments. In this study, we propose [...] Read more.
Personality is a fundamental psychological trait that exerts a long-term influence on human behavior patterns and social interactions. Automatic personality recognition (APR) has exhibited increasing importance across various domains, including Human–Robot Interaction (HRI), personalized services, and psychological assessments. In this study, we propose a multimodal personality recognition model that classifies the Big Five personality traits by extracting features from three heterogeneous sources: audio processed using Wav2Vec2, video represented as Skeleton Landmark time series, and text encoded through Bidirectional Encoder Representations from Transformers (BERT) and Doc2Vec embeddings. Each modality is handled through an independent Self-Attention block that highlights salient temporal information, and these representations are then summarized and integrated using a late fusion approach to effectively reflect both the inter-modal complementarity and cross-modal interactions. Compared to traditional recurrent neural network (RNN)-based multimodal models and unimodal classifiers, the proposed model achieves an improvement of up to 12 percent in the F1-score. It also maintains a high prediction accuracy and robustness under limited input conditions. Furthermore, a visualization based on t-distributed Stochastic Neighbor Embedding (t-SNE) demonstrates clear distributional separation across the personality classes, enhancing the interpretability of the model and providing insights into the structural characteristics of its latent representations. To support real-time deployment, a lightweight thread-based processing architecture is implemented, ensuring computational efficiency. By leveraging deep learning-based feature extraction and the Self-Attention mechanism, we present a novel personality recognition framework that balances performance with interpretability. The proposed approach establishes a strong foundation for practical applications in HRI, counseling, education, and other interactive systems that require personalized adaptation. Full article
(This article belongs to the Special Issue Explainable Machine Learning and Data Mining)
Show Figures

Figure 1

19 pages, 709 KiB  
Article
Fusion of Multimodal Spatio-Temporal Features and 3D Deformable Convolution Based on Sign Language Recognition in Sensor Networks
by Qian Zhou, Hui Li, Weizhi Meng, Hua Dai, Tianyu Zhou and Guineng Zheng
Sensors 2025, 25(14), 4378; https://doi.org/10.3390/s25144378 - 13 Jul 2025
Viewed by 454
Abstract
Sign language is a complex and dynamic visual language that requires the coordinated movement of various body parts, such as the hands, arms, and limbs—making it an ideal application domain for sensor networks to capture and interpret human gestures accurately. To address the [...] Read more.
Sign language is a complex and dynamic visual language that requires the coordinated movement of various body parts, such as the hands, arms, and limbs—making it an ideal application domain for sensor networks to capture and interpret human gestures accurately. To address the intricate task of precise and expedient SLR from raw videos, this study introduces a novel deep learning approach by devising a multimodal framework for SLR. Specifically, feature extraction models are built based on two modalities: skeleton and RGB images. In this paper, we firstly propose a Multi-Stream Spatio-Temporal Graph Convolutional Network (MSGCN) that relies on three modules: a decoupling graph convolutional network, a self-emphasizing temporal convolutional network, and a spatio-temporal joint attention module. These modules are combined to capture the spatio-temporal information in multi-stream skeleton features. Secondly, we propose a 3D ResNet model based on deformable convolution (D-ResNet) to model complex spatial and temporal sequences in the original raw images. Finally, a gating mechanism-based Multi-Stream Fusion Module (MFM) is employed to merge the results of the two modalities. Extensive experiments are conducted on the public datasets AUTSL and WLASL, achieving competitive results compared to state-of-the-art systems. Full article
(This article belongs to the Special Issue Intelligent Sensing and Artificial Intelligence for Image Processing)
Show Figures

Figure 1

12 pages, 4368 KiB  
Article
A Dual-Branch Fusion Model for Deepfake Detection Using Video Frames and Microexpression Features
by Georgios Petmezas, Vazgken Vanian, Manuel Pastor Rufete, Eleana E. I. Almaloglou and Dimitris Zarpalas
J. Imaging 2025, 11(7), 231; https://doi.org/10.3390/jimaging11070231 - 11 Jul 2025
Viewed by 609
Abstract
Deepfake detection has become a critical issue due to the rise of synthetic media and its potential for misuse. In this paper, we propose a novel approach to deepfake detection by combining video frame analysis with facial microexpression features. The dual-branch fusion model [...] Read more.
Deepfake detection has become a critical issue due to the rise of synthetic media and its potential for misuse. In this paper, we propose a novel approach to deepfake detection by combining video frame analysis with facial microexpression features. The dual-branch fusion model utilizes a 3D ResNet18 for spatiotemporal feature extraction and a transformer model to capture microexpression patterns, which are difficult to replicate in manipulated content. We evaluate the model on the widely used FaceForensics++ (FF++) dataset and demonstrate that our approach outperforms existing state-of-the-art methods, achieving 99.81% accuracy and a perfect ROC-AUC score of 100%. The proposed method highlights the importance of integrating diverse data sources for deepfake detection, addressing some of the current limitations of existing systems. Full article
Show Figures

Figure 1

21 pages, 7528 KiB  
Article
A Fine-Tuning Method via Adaptive Symmetric Fusion and Multi-Graph Aggregation for Human Pose Estimation
by Yinliang Shi, Zhaonian Liu, Bin Jiang, Tianqi Dai and Yuanfeng Lian
Symmetry 2025, 17(7), 1098; https://doi.org/10.3390/sym17071098 - 9 Jul 2025
Viewed by 369
Abstract
Human Pose Estimation (HPE) aims to accurately locate the positions of human key points in images or videos. However, the performance of HPE is often significantly reduced in practical application scenarios due to environmental interference. To address this challenge, we propose a ladder [...] Read more.
Human Pose Estimation (HPE) aims to accurately locate the positions of human key points in images or videos. However, the performance of HPE is often significantly reduced in practical application scenarios due to environmental interference. To address this challenge, we propose a ladder side-tuning method for the Vision Transformer (ViT) pre-trained model based on multi-path feature fusion to improve the accuracy of HPE in highly interfering environments. First, we extract the global features, frequency features and multi-scale spatial features through the ViT pre-trained model, the discrete wavelet convolutional network and the atrous spatial pyramid pooling network (ASPP). By comprehensively capturing the information of the human body and the environment, the ability of the model to analyze local details, textures, and spatial information is enhanced. In order to efficiently fuse these features, we devise an adaptive symmetric feature fusion strategy, which dynamically adjusts the intensity of feature fusion according to the similarity among features to achieve the optimal fusion effect. Finally, a multi-graph feature aggregation method is developed. We construct graph structures of different features and deeply explore the subtle differences among the features based on the dual fusion mechanism of points and edges to ensure the information integrity. The experimental results demonstrate that our method achieves 4.3% and 4.2% improvements in the AP metric on the MS COCO dataset and a custom high-interference dataset, respectively, compared with the HRNet. This highlights its superiority for human pose estimation tasks in both general and interfering environments. Full article
(This article belongs to the Special Issue Symmetry and Asymmetry in Computer Vision and Graphics)
Show Figures

Figure 1

17 pages, 7292 KiB  
Article
QP-Adaptive Dual-Path Residual Integrated Frequency Transformer for Data-Driven In-Loop Filter in VVC
by Cheng-Hsuan Yeh, Chi-Ting Ni, Kuan-Yu Huang, Zheng-Wei Wu, Cheng-Pin Peng and Pei-Yin Chen
Sensors 2025, 25(13), 4234; https://doi.org/10.3390/s25134234 - 7 Jul 2025
Viewed by 410
Abstract
As AI-enabled embedded systems such as smart TVs and edge devices demand efficient video processing, Versatile Video Coding (VVC/H.266) becomes essential for bandwidth-constrained Multimedia Internet of Things (M-IoT) applications. However, its block-based coding often introduces compression artifacts. While CNN-based methods effectively reduce these [...] Read more.
As AI-enabled embedded systems such as smart TVs and edge devices demand efficient video processing, Versatile Video Coding (VVC/H.266) becomes essential for bandwidth-constrained Multimedia Internet of Things (M-IoT) applications. However, its block-based coding often introduces compression artifacts. While CNN-based methods effectively reduce these artifacts, maintaining robust performance across varying quantization parameters (QPs) remains challenging. Recent QP-adaptive designs like QA-Filter show promise but are still limited. This paper proposes DRIFT, a QP-adaptive in-loop filtering network for VVC. DRIFT combines a lightweight frequency fusion CNN (LFFCNN) for local enhancement and a Swin Transformer-based global skip connection for capturing long-range dependencies. LFFCNN leverages octave convolution and introduces a novel residual block (FFRB) that integrates multiscale extraction, QP adaptivity, frequency fusion, and spatial-channel attention. A QP estimator (QPE) is further introduced to mitigate double enhancement in inter-coded frames. Experimental results demonstrate that DRIFT achieves BD rate reductions of 6.56% (intra) and 4.83% (inter), with an up to 10.90% gain on the BasketballDrill sequence. Additionally, LFFCNN reduces the model size by 32% while slightly improving the coding performance over QA-Filter. Full article
(This article belongs to the Special Issue Multimodal Sensing Technologies for IoT and AI-Enabled Systems)
Show Figures

Figure 1

Back to TopTop