MDPI - Publisher of Open Access Journals

21 pages, 1566 KB

Open AccessArticle

A Scene-Adaptive Super-Resolution Framework for Video Compression

by Qiyu Zha and Jiangling Guo

J. Imaging 2026, 12(5), 200; https://doi.org/10.3390/jimaging12050200 - 5 May 2026

Viewed by 675

Video compression is central to large-scale video delivery, where better rate–distortion efficiency directly reduces bandwidth and storage cost. A practical way to improve efficiency is to encode a low-resolution video stream with a standard codec and restore high-resolution details with a learned super-resolution [...] Read more.

Video compression is central to large-scale video delivery, where better rate–distortion efficiency directly reduces bandwidth and storage cost. A practical way to improve efficiency is to encode a low-resolution video stream with a standard codec and restore high-resolution details with a learned super-resolution model at the decoder. However, prior SR-assisted compression methods usually update the reconstruction model at fixed temporal intervals, which can waste bitrate when those update boundaries do not match actual scene changes. In this paper, we present SASVC, a scene-adaptive super-resolution video compression framework for offline codec-augmented compression. SASVC detects scene changes using frame-wise grayscale differences, updates only compact adapter modules when a content transition is observed, and compresses the resulting model updates with chained differencing, quantization, and entropy coding. In this way, the method reduces unnecessary model-stream overhead while preserving scene-specific reconstruction fidelity. Experimental results on both long-form and short-form datasets show that SASVC consistently outperforms SRVC-style baselines and conventional codec-based alternatives under the Bjontegaard delta rate based on peak signal-to-noise ratio (BD-rate/PSNR) criterion. Complementary rate–distortion (RD) comparisons in terms of structural similarity index measure (SSIM) and Video Multi-Method Assessment Fusion (VMAF) show the same overall trend, indicating that the gain is not limited to a single distortion metric. Specifically, SASVC achieves BD-rate gains of

- 41.33 %

and

- 53.49 %

on Vimeo and Xiph, respectively, and further reaches

- 51.53 %

and

- 39.83 %

on UVG and MCL-JCV. The decoder also maintains real-time 1080p reconstruction at 125 frames per second (FPS) on an NVIDIA RTX 3080 Ti GPU, indicating that scene-aligned model updates can improve compression efficiency while keeping decoder-side deployment practical. Full article

(This article belongs to the Section Image and Video Processing)

► Show Figures

Figure 1

29 pages, 1779 KB

Open AccessArticle

BWT-Enhanced Compression for GIS Raster Data: A Hybrid AV1-Inspired Approach with Burrows–Wheeler Transform

by Yair Wiseman

Big Data Cogn. Comput. 2026, 10(5), 140; https://doi.org/10.3390/bdcc10050140 - 1 May 2026

Viewed by 532

Abstract

The AVIF (AV1 Image File Format) is a modern, royalty-free image format that leverages the AV1 video codec for superior compression efficiency, supporting both lossy and lossless modes. Its entropy encoding relies on a multi-symbol context-adaptive arithmetic coder (range coding with adaptive cumulative [...] Read more.

The AVIF (AV1 Image File Format) is a modern, royalty-free image format that leverages the AV1 video codec for superior compression efficiency, supporting both lossy and lossless modes. Its entropy encoding relies on a multi-symbol context-adaptive arithmetic coder (range coding with adaptive cumulative distribution functions (CDFs)), which is effective for general imagery but may not optimally exploit the repetitive structures common in Geographic Information System (GIS) maps/data. This paper proposes replacing AVIF’s entropy encoder with the Burrows–Wheeler Transform (BWT), a reversible preprocessing algorithm that rearranges data to create runs of similar symbols, enhancing subsequent compression. We detail the technical steps for modification, drawing from AV1’s open-source implementation, and explain why BWT is advantageous for GIS raster maps/data, which often feature large uniform areas, limited color palettes, and spatial redundancies. Empirical evidence from related studies on BWT-based image compression shows improvements in lossless scenarios, potentially considerably reducing file sizes over standard methods while preserving data integrity critical for geospatial analysis. This swap could improve storage, transmission, and processing efficiency in GIS applications, such as remote sensing and cartography. The discussion includes challenges like computational overhead and compatibility, with recommendations for implementations. The resulting BWT-AVIF hybrid produces a non-standard AV1 bit-stream that is not compliant with the AV1 or AVIF specifications and therefore requires custom decoders. It is presented here as a research prototype for GIS-specific compression rather than a compliant AVIF extension. Full article

(This article belongs to the Special Issue Intelligent Communication and Sensor Networks for Advanced Signal Processing)

► Show Figures

Figure 1

31 pages, 5378 KB

Open AccessArticle

FUSEPOP: A Multi-Modal Fusion with Mutual Information Weighting and Stacked Ensemble for Social Media Popularity Prediction

by Ömer Ayberk Şencan, İsmail Atacak, İbrahim Alper Doğru, Sinan Toklu, Necaattin Barışçı and Kazım Kılıç

Appl. Sci. 2026, 16(9), 4160; https://doi.org/10.3390/app16094160 - 23 Apr 2026

Viewed by 841

Abstract

Short-form video content has gained importance as a popular form of digital media due to the rising popularity of social media platforms and the decreasing attention spans of consumers. However, a major obstacle to popularity detection in short-form content is the heterogeneous nature [...] Read more.

Short-form video content has gained importance as a popular form of digital media due to the rising popularity of social media platforms and the decreasing attention spans of consumers. However, a major obstacle to popularity detection in short-form content is the heterogeneous nature of the data, encompassing textual, visual, and metadata components. To tackle this challenge, we propose FUSEPOP, a robust multi-modal architecture. The proposed framework utilizes ResNet-50 for visual feature extraction and XLM-RoBERTa for encoding multilingual textual information. FUSEPOP employs a mutual information-based modality weighting mechanism with logarithmic smoothing and a 0.7 weight ceiling to balance contributions from each input stream. Furthermore, FUSEPOP implements a robust stacked generalization strategy trained via stratified 5-fold cross-validation. This approach utilizes a logistic regression meta-learner to dynamically synthesize predictions from random forest, XGBoost, and a neural network-based classifier. Experimental results show that this architecture significantly outperforms benchmark models, achieving an accuracy of 0.980 and an average F1-score of 0.964 on the feature configuration selected for this study, and remains competitive on a literature-aligned alternative configuration. These findings confirm that the proposed model successfully detects popularity on short-form social media content. Full article

(This article belongs to the Special Issue Advances in Machine Learning and Data Mining: Emerging Trends and Applications)

► Show Figures

Figure 1

25 pages, 5157 KB

Open AccessArticle

HDC-RTDETR: Instrument Detection Model for Intelligent Inspection of Wind Farm Switching Stations Under Fog, Light, or Noise Conditions

by Wenshuo Shang, Xiaoqiang Jia, Ying Cui and Yu Jia

Symmetry 2026, 18(4), 595; https://doi.org/10.3390/sym18040595 - 31 Mar 2026

Viewed by 549

Abstract

The continuous expansion of wind farms and the escalating demand for automated operation and maintenance have established the efficient and accurate performance of intelligent inspection systems for switching stations as a critical factor for ensuring power facility safety and stability. However, the intelligent [...] Read more.

The continuous expansion of wind farms and the escalating demand for automated operation and maintenance have established the efficient and accurate performance of intelligent inspection systems for switching stations as a critical factor for ensuring power facility safety and stability. However, the intelligent inspection trolleys deployed in such settings are frequently hampered by suboptimal instrument detection accuracy and limited robustness, attributable primarily to environmental interference from fog, variable lighting conditions, or image noise. This paper proposes a multi-module-integrated real-time object detection model, termed HDC-RTDETR (HSAN + DBlockC3 + CGAFusion + RT-DETR). The model is grounded in the intelligent inspection principle of “clear visibility precedes efficient inspection”, with the primary objective of enabling reliable instrument identification under the influence of fog, changing lighting conditions or image noise. Specifically, building upon the RT-DETR architecture, we introduce three targeted enhancements: (1) the HSAN module adaptively fuses grayscale, edge, and color features to improve robustness against composite degradations (e.g., fog, illumination variations, noise) by enhancing target responses while suppressing background clutter; (2) DBlockC3 captures and integrates multi-scale contextual information, refining the discrimination of fine-grained instrument details under complex lighting; and (3) the CGAFusion module strengthens hierarchical feature integration within the encoder, effectively mitigating fog-induced blurring effects. Experimental validation on a Custom Dataset demonstrates that the proposed model achieves a mAP@50 of 95.566% (representing an improvement of 3.390 percentage points) and a precision of 90.557% (an increase of 11.20 percentage points). Furthermore, on an Industrial Instrument Needle Dataset, it attains a mAP@50 of 98.130% (+2.242%) and a precision of 95.130% (+4.269%). In addition, we validated its edge deployment capabilities on the Jetson AGX Orin, achieving real-time inference at 16.5 FPS, which meets the near-real-time video streaming processing requirements of many application scenarios. These results confirm that the HDC-RTDETR model exhibits superior detection performance and environmental adaptability in complex industrial scenarios, thereby establishing a high-confidence localization foundation for subsequent instrument reading extraction tasks. Full article

(This article belongs to the Section Engineering and Materials)

► Show Figures

Figure 1

19 pages, 10157 KB

Open AccessArticle

DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360^∘ Videos

by Huimin Zheng, Lina Du, Xiushan Nie and Fei Dong

Electronics 2026, 15(6), 1326; https://doi.org/10.3390/electronics15061326 - 23 Mar 2026

Viewed by 451

Abstract

Viewport prediction is a key component in tile-based 360° video streaming. Existing viewport prediction models based on Long Short-term Memory Networks (LSTM) or Transformer typically output a single deterministic future trajectory through deterministic mapping, which fails to capture the inherent randomness in viewing [...] Read more.

Viewport prediction is a key component in tile-based 360° video streaming. Existing viewport prediction models based on Long Short-term Memory Networks (LSTM) or Transformer typically output a single deterministic future trajectory through deterministic mapping, which fails to capture the inherent randomness in viewing behavior. Moreover, when encoding trajectory features, such models often map trajectory coordinates directly into a high-dimensional space while neglecting the spatial information inherent in the coordinates themselves. Additionally, they exhibit limitations in capturing cross-modal relationships between visual and trajectory features. To address these issues, this paper proposes DiffVP, a diffusion model for viewport prediction in 360° videos. Under the constraints of viewing historical trajectories and video saliency maps, DiffVP leverages Denoising Diffusion Implicit Models (DDIMs) to model future viewing trajectories in the form of probability distributions, generating diverse and reasonable prediction results. In the denoising network, DiffVP employs Explicit Coordinate-Time Encoding (ECTE) to model the temporal dependencies of trajectories and the spatial relationships among coordinates; moreover, a Coordinate-Aware Saliency Features Fusion (CASF) module is proposed to achieve cross-modal alignment and interactive fusion of saliency and trajectory features. Experimental results on three public datasets demonstrate that DiffVP achieves the best accuracy for 2–5 s viewport prediction without sacrificing the performance of short-term (<1 s) prediction. Full article

► Show Figures

Figure 1

27 pages, 1058 KB

Open AccessArticle

An AI-Driven Multimodal Sensor Fusion Framework for Fraud Perception in Short-Video and Live-Streaming Platforms

by Ruixiang Zhao, Xuanhao Zhang, Jinfan Yang, Haofei Li, Zhengjia Lu, Wenrui Xu and Manzhou Li

Sensors 2026, 26(5), 1525; https://doi.org/10.3390/s26051525 - 28 Feb 2026

Viewed by 1013

Abstract

With the rapid proliferation of short-video platforms and live-streaming commerce ecosystems, marketing activities are increasingly manifested through complex multimodal sensing signals. These heterogeneous sensor data streams exhibit strong temporal dependency, high cross-modal coupling, and progressive evolutionary characteristics, making early-stage fraud perception particularly challenging [...] Read more.

With the rapid proliferation of short-video platforms and live-streaming commerce ecosystems, marketing activities are increasingly manifested through complex multimodal sensing signals. These heterogeneous sensor data streams exhibit strong temporal dependency, high cross-modal coupling, and progressive evolutionary characteristics, making early-stage fraud perception particularly challenging for conventional unimodal or static analytical paradigms. Existing approaches often fail to effectively capture weak anomalous cues emerging across multimodal channels during the initial stages of fraudulent campaigns. To address these limitations, an artificial intelligence-driven multimodal sensor perception framework is proposed for temporal fraud detection in short-video environments. A multimodal temporal alignment module is first designed to synchronize heterogeneous sensor signals with inconsistent sampling granularities. Subsequently, a shared temporal encoding network is constructed to learn evolution-aware representations across multimodal sensor sequences. On this basis, a cross-modal temporal attention fusion mechanism is introduced to dynamically weight sensor contributions at different behavioral stages. Finally, a fraud evolution modeling and early risk prediction module is developed to characterize the progressive intensification of fraudulent activities and to enable risk assessment under incomplete temporal observations. Extensive experiments conducted on real-world datasets collected from multiple mainstream short-video platforms demonstrate the effectiveness of the proposed AI-driven sensing framework. The model achieves an overall accuracy of 0.941, precision of 0.865, recall of 0.812, and F1 score of 0.838, with the AUC further reaching 0.956, significantly outperforming text-based, vision-based, temporal, and conventional multimodal baselines. In early-stage detection scenarios utilizing only the first 30% of video content, the framework maintains stable performance advantages, achieving a precision of 0.812, recall of 0.704, and F1 score of 0.754, validating its capability for proactive fraud warning. Full article

(This article belongs to the Special Issue Artificial Intelligence-Driven Sensing)

► Show Figures

Figure 1

28 pages, 2555 KB

Open AccessArticle

Deep Learning-Based Video Watermarking: A Robust Framework for Spatial–Temporal Embedding and Retrieval

by Antonio Cedillo-Hernandez, Lydia Velazquez-Garcia, Francisco Javier Garcia-Ugalde and Manuel Cedillo-Hernandez

Future Internet 2026, 18(2), 104; https://doi.org/10.3390/fi18020104 - 16 Feb 2026

Cited by 2 | Viewed by 1390

Abstract

This paper introduces a deep learning-based framework for video watermarking that achieves robust, imperceptible, and fast embedding under a wide range of visual and temporal conditions. The proposed method is organized into seven modules that collaboratively perform frame encoding, semantic region analysis, block [...] Read more.

This paper introduces a deep learning-based framework for video watermarking that achieves robust, imperceptible, and fast embedding under a wide range of visual and temporal conditions. The proposed method is organized into seven modules that collaboratively perform frame encoding, semantic region analysis, block selection, watermark transformation, and spatiotemporal injection, followed by decoding and multi-objective optimization. A key component of the framework is its ability to learn a visual importance map, which guides a saliency-based block selection strategy. This allows the model to embed the watermark in perceptually redundant regions while minimizing distortion. To enhance resilience, the watermark is distributed across multiple frames, leveraging temporal redundancy to improve recovery under frame loss, insertion, and reordering. Experimental evaluations conducted on a large-scale video dataset demonstrate that the proposed method achieves high fidelity, while preserving low decoding error rates under compression, noise, and temporal distortions. The proposed method operates processing 38 video frames per second on a standard GPU. Additional ablation studies confirm the contribution of each module to the system’s robustness. This framework offers a promising solution for watermarking in streaming, surveillance, and content verification applications. Full article

(This article belongs to the Section Big Data and Augmented Intelligence)

► Show Figures

Graphical abstract

23 pages, 6932 KB

Open AccessArticle

RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems

by Jaro Meyer, Frédéric Giraud, Joschua Wüthrich, Marc Pollefeys, Philipp Fürnstahl and Lilian Calvet

Sensors 2026, 26(3), 1036; https://doi.org/10.3390/s26031036 - 5 Feb 2026

Cited by 1 | Viewed by 852

Abstract

Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene understanding. However, synchronizing multiple cameras remains a significant challenge, especially in heterogeneous setups combining professional- and consumer-grade devices, [...] Read more.

Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene understanding. However, synchronizing multiple cameras remains a significant challenge, especially in heterogeneous setups combining professional- and consumer-grade devices, visible and infrared sensors, or systems with and without audio, where common hardware synchronization capabilities are often unavailable. This limitation is particularly evident in real-world environments, where controlled capture conditions are not feasible. In this work, we present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems while supporting both visible (RGB) and infrared (IR) modalities. The proposed solution employs a custom-built LED Clock that encodes time through red and infrared LEDs, allowing visual decoding of the exposure window (start and end times) from recorded frames for millisecond-level synchronization. We benchmark our method against hardware synchronization and achieve a residual error of 1.34 ms RMSE across multiple recordings. In further experiments, our method outperforms light-, audio-, and timecode-based synchronization approaches and directly improves downstream computer vision tasks, including multi-view pose estimation and 3D reconstruction. Finally, we validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities. This solution simplifies and streamlines the synchronization pipeline and expands access to advanced vision-based sensing in unconstrained environments, including industrial and clinical applications. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

5 pages, 398 KB

Open AccessProceeding Paper

A Lightweight Deep Learning Framework for Robust Video Watermarking in Adversarial Environments

by Antonio Cedillo-Hernandez, Lydia Velazquez-Garcia and Manuel Cedillo-Hernandez

Eng. Proc. 2026, 123(1), 25; https://doi.org/10.3390/engproc2026123025 - 5 Feb 2026

Viewed by 640

Abstract

The widespread distribution of digital videos in social networks, streaming services, and surveillance systems has increased the risk of manipulation, unauthorized redistribution, and adversarial tampering. This paper presents a lightweight deep learning framework for robust and imperceptible video watermarking designed specifically for cybersecurity [...] Read more.

The widespread distribution of digital videos in social networks, streaming services, and surveillance systems has increased the risk of manipulation, unauthorized redistribution, and adversarial tampering. This paper presents a lightweight deep learning framework for robust and imperceptible video watermarking designed specifically for cybersecurity environments. Unlike heavy architectures that rely on multi-scale feature extractors or complex adversarial networks, our model introduces a compact encoder–decoder pipeline optimized for real-time watermark embedding and recovery under adversarial attacks. The proposed system leverages spatial attention and temporal redundancy to ensure robustness against distortions such as compression, additive noise, and adversarial perturbations generated via Fast Gradient Sign Method (FGSM) or recompression attacks from generative models. Experimental simulations using a reduced Kinetics-600 subset demonstrate promising results, achieving an average PSNR of 38.9 dB, SSIM of 0.967, and Bit Error Rate (BER) below 3% even under FGSM attacks. These results suggest that the proposed lightweight framework achieves a favorable trade-off between resilience, imperceptibility, and computational efficiency, making it suitable for deployment in video forensics, authentication, and secure content distribution systems. Full article

(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)

► Show Figures

Figure 1

19 pages, 1787 KB

Open AccessArticle

Event-Based Machine Vision for Edge AI Computing

by Paul K. J. Park, Junseok Kim, Juhyun Ko and Yeoungjin Chang

Sensors 2026, 26(3), 935; https://doi.org/10.3390/s26030935 - 1 Feb 2026

Cited by 2 | Viewed by 1354

Abstract

Event-based sensors provide sparse, motion-centric measurements that can reduce data bandwidth and enable always-on perception on resource-constrained edge devices. This paper presents an event-based machine vision framework for smart-home AIoT that couples a Dynamic Vision Sensor (DVS) with compute-efficient algorithms for (i) human/object [...] Read more.

Event-based sensors provide sparse, motion-centric measurements that can reduce data bandwidth and enable always-on perception on resource-constrained edge devices. This paper presents an event-based machine vision framework for smart-home AIoT that couples a Dynamic Vision Sensor (DVS) with compute-efficient algorithms for (i) human/object detection, (ii) 2D human pose estimation, (iii) hand posture recognition for human–machine interfaces. The main methodological contributions are timestamp-based, polarity-agnostic recency encoding that preserves moving-edge structure while suppressing static background, and task-specific network optimizations (architectural reduction and mixed-bit quantization) tailored to sparse event images. With a fixed downstream network, the recency encoding improves action recognition accuracy over temporal accumulation (0.908 vs. 0.896). In a 24 h indoor monitoring experiment (640 × 480), the raw DVS stream is about 30× smaller than conventional CMOS video and remains about 5× smaller after standard compression. For human detection, the optimized event processing reduces computation from 5.8 G to 81 M FLOPs and runtime from 172 ms to 15 ms (more than 11× speed-up). For pose estimation, a pruned HRNet reduces model size from 127 MB to 19 MB and inference time from 70 ms to 6 ms on an NVIDIA Titan X while maintaining a comparable accuracy (mAP from 0.95 to 0.94) on MS COCO 2017 using synthetic event streams generated by an event simulator. For hand posture recognition, a compact CNN achieves 99.19% recall and 0.0926% FAR with 14.31 ms latency on a single i5-4590 CPU core using 10-frame sequence voting. These results indicate that event-based sensing combined with lightweight inference is a practical approach to privacy-friendly, real-time perception under strict edge constraints. Full article

(This article belongs to the Special Issue Next-Generation Edge AI in Wearable Devices)

► Show Figures

Figure 1

22 pages, 3601 KB

Open AccessArticle

On Exploiting Tile Partitioning to Reduce Bitrate and Processing Time in VVC Surveillance Streams with Object Detection

by Panagiotis Belememis, Maria Koziri and Thanasis Loukopoulos

Mathematics 2026, 14(2), 368; https://doi.org/10.3390/math14020368 - 22 Jan 2026

Viewed by 551

Abstract

One of the main targets in video surveillance systems is to detect and possibly identify objects within monitoring range. This entails analyzing the video stream, by applying object detection techniques on one or more frames. Regardless of the output, the stream is usually [...] Read more.

One of the main targets in video surveillance systems is to detect and possibly identify objects within monitoring range. This entails analyzing the video stream, by applying object detection techniques on one or more frames. Regardless of the output, the stream is usually archived for future use. Real-time requirements, network bandwidth, and storage constraints play a significant role to total performance. As video resolution increases, so does the video stream size. To harness such an increase, newer video compression standards offer sophisticated coding tools that aim at reducing video size, with minimal quality loss. However, as the achievable compression ratio increases, so does the computational complexity. In this paper, we propose a methodology to reduce both bitrate and processing time of video surveillance streams whereby object detection is performed. The method takes advantage of tile partitioning, with the aim of (i) reducing the scope and the invocation frequency of the object detection module, (ii) encoding different blocks of a frame at different quality levels, depending on whether objects exist or not, and (iii) encoding and transmitting only tiles containing objects. Experimental results using the UA-DETRAC dataset and the VVenC encoder demonstrate that exploiting tile partitioning in the manner proposed in the paper results in reducing bitrate and processing time at the expense of only tiny losses in accuracy. Full article

(This article belongs to the Special Issue Advanced Optimization Modeling and Algorithms for Planning and Scheduling)

► Show Figures

Figure 1

21 pages, 2372 KB

Open AccessArticle

IDG-ViolenceNet: A Video Violence Detection Model Integrating Identity-Aware Graphs and 3D-CNN

by Hong Huang and Qingping Jiang

Sensors 2025, 25(20), 6272; https://doi.org/10.3390/s25206272 - 10 Oct 2025

Cited by 5 | Viewed by 2703

Abstract

Video violence detection plays a crucial role in intelligent surveillance and public safety, yet existing methods still face challenges in modeling complex multi-person interactions. To address this, we propose IDG-ViolenceNet, a dual-stream video violence detection model that integrates identity-aware spatiotemporal graphs with three-dimensional [...] Read more.

Video violence detection plays a crucial role in intelligent surveillance and public safety, yet existing methods still face challenges in modeling complex multi-person interactions. To address this, we propose IDG-ViolenceNet, a dual-stream video violence detection model that integrates identity-aware spatiotemporal graphs with three-dimensional convolutional neural networks (3D-CNN). Specifically, the model utilizes YOLOv11 for high-precision person detection and cross-frame identity tracking, constructing a dynamic spatiotemporal graph that encodes spatial proximity, temporal continuity, and individual identity information. On this basis, a GINEConv branch extracts structured interaction features, while an R3D-18 branch models local spatiotemporal patterns. The two representations are fused in a dedicated module for cross-modal feature integration. Experimental results show that IDG-ViolenceNet achieves accuracies of 97.5%, 99.5%, and 89.4% on the Hockey Fight, Movies Fight, and RWF-2000 datasets, respectively, significantly outperforming state-of-the-art methods. Additionally, ablation studies validate the contributions of key components in improving detection accuracy and robustness. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

19 pages, 7875 KB

Open AccessArticle

SATSN: A Spatial-Adaptive Two-Stream Network for Automatic Detection of Giraffe Daily Behaviors

by Haiming Gan, Xiongwei Wu, Jianlu Chen, Jingling Wang, Yuxin Fang, Yuqing Xue, Tian Jiang, Huanzhen Chen, Peng Zhang, Guixin Dong and Yueju Xue

Animals 2025, 15(19), 2833; https://doi.org/10.3390/ani15192833 - 28 Sep 2025

Viewed by 1066

Abstract

The daily behavioral patterns of giraffes reflect their health status and well-being. Behaviors such as licking, walking, standing, and eating are not only essential components of giraffes’ routine activities but also serve as potential indicators of their mental and physiological conditions. This is [...] Read more.

The daily behavioral patterns of giraffes reflect their health status and well-being. Behaviors such as licking, walking, standing, and eating are not only essential components of giraffes’ routine activities but also serve as potential indicators of their mental and physiological conditions. This is particularly relevant in captive environments such as zoos, where certain repetitive behaviors may signal underlying well-being concerns. Therefore, developing an efficient and accurate automated behavior detection system is of great importance for scientific management and welfare improvement. This study proposes a multi-behavior automatic detection method for giraffes based on YOLO11-Pose and the spatial-adaptive two-stream network (SATSN). Firstly, YOLO11-Pose is employed to detect giraffes and estimate the keypoints of their mouths. Observation-Centric SORT (OC-SORT) is then used to track individual giraffes across frames, ensuring temporal identity consistency based on the keypoint positions estimated by YOLO11-Pose. In the SATSN, we propose a region-of-interest extraction strategy for licking behavior to extract local motion features and perform daily behavior classification. In this network, the original 3D ResNet backbone in the slow pathway is replaced with a video transformer encoder to enhance global spatiotemporal modeling, while a Temporal Attention (TA) module is embedded in the fast pathway to improve the representation of fast motion features. To validate the effectiveness of the proposed method, a giraffe behavior dataset consisting of 420 video clips (10 s per clip) was constructed, with 336 clips used for training and 84 for validation. Experimental results show that for the detection tasks of licking, walking, standing, and eating behaviors, the proposed method achieves a mean average precision (mAP) of 93.99%. This demonstrates the strong detection performance and generalization capability of the approach, providing robust support for automated multi-behavior detection and well-being assessment of giraffes. It also lays a technical foundation for building intelligent behavioral monitoring systems in zoos. Full article

(This article belongs to the Special Issue Mathematical Modeling and Computer Vision in Animal Activity or Behavior: 2nd Edition)

► Show Figures

Figure 1

19 pages, 1201 KB

Open AccessArticle

Design of a Low-Latency Video Encoder for Reconfigurable Hardware on an FPGA

by Pablo Perez-Tirador, Jose Javier Aranda, Manuel Alarcon Granero, Francisco J. J. Quintanilla, Gabriel Caffarena and Abraham Otero

Technologies 2025, 13(10), 433; https://doi.org/10.3390/technologies13100433 - 25 Sep 2025

Viewed by 3233

Abstract

The growing demand for real-time video streaming in power-constrained embedded systems, such as drone navigation and remote surveillance, requires encoding solutions that prioritize low latency. In these applications, even small delays in video transmission can impair the operator’s ability to react in time, [...] Read more.

The growing demand for real-time video streaming in power-constrained embedded systems, such as drone navigation and remote surveillance, requires encoding solutions that prioritize low latency. In these applications, even small delays in video transmission can impair the operator’s ability to react in time, leading to instability in closed-loop control systems. To mitigate this, encoding must be lightweight and designed so that streaming can start as soon as possible, ideally even while frames are still being processed, thereby ensuring continuous and responsive operation. This paper presents the design of a hardware implementation of the Logarithmic Hop Encoding (LHE) algorithm on a Field-Programmable Gate Array (FPGA). The proposed architecture is deeply pipelined and parallelized to achieve sub-frame latency. It employs adaptive compression by dividing frames into regions of interest and uses a quantized differential system to minimize data transmission. Our design achieves an encoding latency of between 1.87 ms and 2.1 ms with a power consumption of only 2.7 W when implemented on an FPGA clocked at 150 MHz. Compared to a parallel GPU implementation of the same algorithm, this represents a 6.6-fold reduction in latency at approximately half the power consumption. These results show that FPGA-based LHE is a highly effective solution for low-latency, real-time video applications and establish a robust foundation for its deployment in embedded systems. Full article

► Show Figures

Graphical abstract

23 pages, 3606 KB

Open AccessArticle

Dual-Stream Attention-Enhanced Memory Networks for Video Anomaly Detection

by Weishan Gao, Xiaoyin Wang, Ye Wang and Xiaochuan Jing

Sensors 2025, 25(17), 5496; https://doi.org/10.3390/s25175496 - 4 Sep 2025

Cited by 5 | Viewed by 2513

Abstract

Weakly supervised video anomaly detection (WSVAD) aims to identify unusual events using only video-level labels. However, current methods face several key challenges, including ineffective modelling of complex temporal dependencies, indistinct feature boundaries between visually similar normal and abnormal events, and high false alarm [...] Read more.

Weakly supervised video anomaly detection (WSVAD) aims to identify unusual events using only video-level labels. However, current methods face several key challenges, including ineffective modelling of complex temporal dependencies, indistinct feature boundaries between visually similar normal and abnormal events, and high false alarm rates caused by an inability to distinguish salient events from complex background noise. This paper proposes a novel method that systematically enhances feature representation and discrimination to address these challenges. The proposed method first builds robust temporal representations by employing a hierarchical multi-scale temporal encoder and a position-aware global relation network to capture both local and long-range dependencies. The core of this method is the dual-stream attention-enhanced memory network, which achieves precise discrimination by learning distinct normal and abnormal patterns via dual memory banks, while utilising bidirectional spatial attention to mitigate background noise and focus on salient events before memory querying. The models underwent a comprehensive evaluation utilising solely RGB features on two demanding public datasets, UCF-Crime and XD-Violence. The experimental findings indicate that the proposed method attains state-of-the-art performance, achieving 87.43% AUC on UCF-Crime and 85.51% AP on XD-Violence. This result demonstrates that the proposed “attention-guided prototype matching” paradigm effectively resolves the aforementioned challenges, enabling robust and precise anomaly detection. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

Search Results (77)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (77)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI