MDPI - Publisher of Open Access Journals

28 pages, 5699 KiB

Open AccessArticle

Multi-Modal Excavator Activity Recognition Using Two-Stream CNN-LSTM with RGB and Point Cloud Inputs

by Hyuk Soo Cho, Kamran Latif, Abubakar Sharafat and Jongwon Seo

Appl. Sci. 2025, 15(15), 8505; https://doi.org/10.3390/app15158505 (registering DOI) - 31 Jul 2025

Viewed by 41

Recently, deep learning algorithms have been increasingly applied in construction for activity recognition, particularly for excavators, to automate processes and enhance safety and productivity through continuous monitoring of earthmoving activities. These deep learning algorithms analyze construction videos to classify excavator activities for earthmoving [...] Read more.

Recently, deep learning algorithms have been increasingly applied in construction for activity recognition, particularly for excavators, to automate processes and enhance safety and productivity through continuous monitoring of earthmoving activities. These deep learning algorithms analyze construction videos to classify excavator activities for earthmoving purposes. However, previous studies have solely focused on single-source external videos, which limits the activity recognition capabilities of the deep learning algorithm. This paper introduces a novel multi-modal deep learning-based methodology for recognizing excavator activities, utilizing multi-stream input data. It processes point clouds and RGB images using the two-stream long short-term memory convolutional neural network (CNN-LSTM) method to extract spatiotemporal features, enabling the recognition of excavator activities. A comprehensive dataset comprising 495,000 video frames of synchronized RGB and point cloud data was collected across multiple construction sites under varying conditions. The dataset encompasses five key excavator activities: Approach, Digging, Dumping, Idle, and Leveling. To assess the effectiveness of the proposed method, the performance of the two-stream CNN-LSTM architecture is compared with that of single-stream CNN-LSTM models on the same RGB and point cloud datasets, separately. The results demonstrate that the proposed multi-stream approach achieved an accuracy of 94.67%, outperforming existing state-of-the-art single-stream models, which achieved 90.67% accuracy for the RGB-based model and 92.00% for the point cloud-based model. These findings underscore the potential of the proposed activity recognition method, making it highly effective for automatic real-time monitoring of excavator activities, thereby laying the groundwork for future integration into digital twin systems for proactive maintenance and intelligent equipment management. Full article

(This article belongs to the Special Issue AI-Based Machinery Health Monitoring)

► Show Figures

Figure 1

27 pages, 1128 KiB

Open AccessArticle

Adaptive Multi-Hop P2P Video Communication: A Super Node-Based Architecture for Conversation-Aware Streaming

by Jiajing Chen and Satoshi Fujita

Information 2025, 16(8), 643; https://doi.org/10.3390/info16080643 - 28 Jul 2025

Viewed by 245

Abstract

This paper proposes a multi-hop peer-to-peer (P2P) video streaming architecture designed to support dynamic, conversation-aware communication. The primary contribution is a decentralized system built on WebRTC that eliminates reliance on a central media server by employing super node aggregation. In this architecture, video [...] Read more.

This paper proposes a multi-hop peer-to-peer (P2P) video streaming architecture designed to support dynamic, conversation-aware communication. The primary contribution is a decentralized system built on WebRTC that eliminates reliance on a central media server by employing super node aggregation. In this architecture, video streams from multiple peer nodes are dynamically routed through a group of super nodes, enabling real-time reconfiguration of the network topology in response to conversational changes. To support this dynamic behavior, the system leverages WebRTC data channels for control signaling and overlay restructuring, allowing efficient dissemination of topology updates and coordination messages among peers. A key focus of this study is the rapid and efficient reallocation of network resources immediately following conversational events, ensuring that the streaming overlay remains aligned with ongoing interaction patterns. While the automatic detection of such events is beyond the scope of this work, we assume that external triggers are available to initiate topology updates. To validate the effectiveness of the proposed system, we construct a simulation environment using Docker containers and evaluate its streaming performance under dynamic network conditions. The results demonstrate the system’s applicability to adaptive, naturalistic communication scenarios. Finally, we discuss future directions, including the seamless integration of external trigger sources and enhanced support for flexible, context-sensitive interaction frameworks. Full article

(This article belongs to the Special Issue Second Edition of Advances in Wireless Communications Systems)

► Show Figures

Figure 1

27 pages, 705 KiB

Open AccessArticle

A Novel Wavelet Transform and Deep Learning-Based Algorithm for Low-Latency Internet Traffic Classification

by Ramazan Enisoglu and Veselin Rakocevic

Algorithms 2025, 18(8), 457; https://doi.org/10.3390/a18080457 - 23 Jul 2025

Viewed by 306

Abstract

Accurate and real-time classification of low-latency Internet traffic is critical for applications such as video conferencing, online gaming, financial trading, and autonomous systems, where millisecond-level delays can degrade user experience. Existing methods for low-latency traffic classification, reliant on raw temporal features or static [...] Read more.

Accurate and real-time classification of low-latency Internet traffic is critical for applications such as video conferencing, online gaming, financial trading, and autonomous systems, where millisecond-level delays can degrade user experience. Existing methods for low-latency traffic classification, reliant on raw temporal features or static statistical analyses, fail to capture dynamic frequency patterns inherent to real-time applications. These limitations hinder accurate resource allocation in heterogeneous networks. This paper proposes a novel framework integrating wavelet transform (WT) and artificial neural networks (ANNs) to address this gap. Unlike prior works, we systematically apply WT to commonly used temporal features—such as throughput, slope, ratio, and moving averages—transforming them into frequency-domain representations. This approach reveals hidden multi-scale patterns in low-latency traffic, akin to structured noise in signal processing, which traditional time-domain analyses often overlook. These wavelet-enhanced features train a multilayer perceptron (MLP) ANN, enabling dual-domain (time–frequency) analysis. We evaluate our approach on a dataset comprising FTP, video streaming, and low-latency traffic, including mixed scenarios with up to four concurrent traffic types. Experiments demonstrate 99.56% accuracy in distinguishing low-latency traffic (e.g., video conferencing) from FTP and streaming, outperforming k-NN, CNNs, and LSTMs. Notably, our method eliminates reliance on deep packet inspection (DPI), offering ISPs a privacy-preserving and scalable solution for prioritizing time-sensitive traffic. In mixed-traffic scenarios, the model achieves 74.2–92.8% accuracy, offering ISPs a scalable solution for prioritizing time-sensitive traffic without deep packet inspection. By bridging signal processing and deep learning, this work advances efficient bandwidth allocation and enables Internet Service Providers to prioritize time-sensitive flows without deep packet inspection, improving quality of service in heterogeneous network environments. Full article

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

► Show Figures

Figure 1

40 pages, 1540 KiB

Open AccessReview

A Survey on Video Big Data Analytics: Architecture, Technologies, and Open Research Challenges

by Thi-Thu-Trang Do, Quyet-Thang Huynh, Kyungbaek Kim and Van-Quyet Nguyen

Appl. Sci. 2025, 15(14), 8089; https://doi.org/10.3390/app15148089 - 21 Jul 2025

Viewed by 505

Abstract

The exponential growth of video data across domains such as surveillance, transportation, and healthcare has raised critical challenges in scalability, real-time processing, and privacy preservation. While existing studies have addressed individual aspects of Video Big Data Analytics (VBDA), an integrated, up-to-date perspective remains [...] Read more.

The exponential growth of video data across domains such as surveillance, transportation, and healthcare has raised critical challenges in scalability, real-time processing, and privacy preservation. While existing studies have addressed individual aspects of Video Big Data Analytics (VBDA), an integrated, up-to-date perspective remains limited. This paper presents a comprehensive survey of system architectures and enabling technologies in VBDA. It categorizes system architectures into four primary types as follows: centralized, cloud-based infrastructures, edge computing, and hybrid cloud–edge. It also analyzes key enabling technologies, including real-time streaming, scalable distributed processing, intelligent AI models, and advanced storage for managing large-scale multimodal video data. In addition, the study provides a functional taxonomy of core video processing tasks, including object detection, anomaly recognition, and semantic retrieval, and maps these tasks to real-world applications. Based on the survey findings, the paper proposes ViMindXAI, a hybrid AI-driven platform that combines edge and cloud orchestration, adaptive storage, and privacy-aware learning to support scalable and trustworthy video analytics. Our analysis in this survey highlights emerging trends such as the shift toward hybrid cloud–edge architectures, the growing importance of explainable AI and federated learning, and the urgent need for secure and efficient video data management. These findings highlight key directions for designing next-generation VBDA platforms that enhance real-time, data-driven decision-making in domains such as public safety, transportation, and healthcare. These platforms facilitate timely insights, rapid response, and regulatory alignment through scalable and explainable analytics. This work provides a robust conceptual foundation for future research on adaptive and efficient decision-support systems in video-intensive environments. Full article

(This article belongs to the Special Issue Advances in Machine Learning and Data Mining: Emerging Trends and Applications)

► Show Figures

Figure 1

39 pages, 2628 KiB

Open AccessArticle

A Decentralized Multi-Venue Real-Time Video Broadcasting System Integrating Chain Topology and Intelligent Self-Healing Mechanisms

by Tianpei Guo, Ziwen Song, Haotian Xin and Guoyang Liu

Appl. Sci. 2025, 15(14), 8043; https://doi.org/10.3390/app15148043 - 19 Jul 2025

Viewed by 436

Abstract

The rapid growth in large-scale distributed video conferencing, remote education, and real-time broadcasting poses significant challenges to traditional centralized streaming systems, particularly regarding scalability, cost, and reliability under high concurrency. Centralized approaches often encounter bottlenecks, increased bandwidth expenses, and diminished fault tolerance. This [...] Read more.

The rapid growth in large-scale distributed video conferencing, remote education, and real-time broadcasting poses significant challenges to traditional centralized streaming systems, particularly regarding scalability, cost, and reliability under high concurrency. Centralized approaches often encounter bottlenecks, increased bandwidth expenses, and diminished fault tolerance. This paper proposes a novel decentralized real-time broadcasting system employing a peer-to-peer (P2P) chain topology based on IPv6 networking and the Secure Reliable Transport (SRT) protocol. By exploiting the global addressing capability of IPv6, our solution simplifies direct node interconnections, effectively eliminating complexities associated with Network Address Translation (NAT). Furthermore, we introduce an innovative chain-relay transmission method combined with distributed node management strategies, substantially reducing reliance on central servers and minimizing deployment complexity. Leveraging SRT’s low-latency UDP transmission, packet retransmission, congestion control, and AES-128/256 encryption, the proposed system ensures robust security and high video stream quality across wide-area networks. Additionally, a WebSocket-based real-time fault detection algorithm coupled with a rapid fallback self-healing mechanism is developed, enabling millisecond-level fault detection and swift restoration of disrupted links. Extensive performance evaluations using Video Multi-Resolution Fidelity (VMRF) metrics across geographically diverse and heterogeneous environments confirm significant performance gains. Specifically, our approach achieves substantial improvements in latency, video quality stability, and fault tolerance over existing P2P methods, along with over tenfold enhancements in frame rates compared with conventional RTMP-based solutions, thereby demonstrating its efficacy, scalability, and cost-effectiveness for real-time video streaming applications. Full article

► Show Figures

Figure 1

21 pages, 4859 KiB

Open AccessArticle

Improvement of SAM2 Algorithm Based on Kalman Filtering for Long-Term Video Object Segmentation

by Jun Yin, Fei Wu, Hao Su, Peng Huang and Yuetong Qixuan

Sensors 2025, 25(13), 4199; https://doi.org/10.3390/s25134199 - 5 Jul 2025

Viewed by 516

Abstract

The Segment Anything Model 2 (SAM2) has achieved state-of-the-art performance in pixel-level object segmentation for both static and dynamic visual content. Its streaming memory architecture maintains spatial context across video sequences, yet struggles with long-term tracking due to its static inference framework. SAM [...] Read more.

The Segment Anything Model 2 (SAM2) has achieved state-of-the-art performance in pixel-level object segmentation for both static and dynamic visual content. Its streaming memory architecture maintains spatial context across video sequences, yet struggles with long-term tracking due to its static inference framework. SAM 2’s fixed temporal window approach indiscriminately retains historical frames, failing to account for frame quality or dynamic motion patterns. This leads to error propagation and tracking instability in challenging scenarios involving fast-moving objects, partial occlusions, or crowded environments. To overcome these limitations, this paper proposes SAM2Plus, a zero-shot enhancement framework that integrates Kalman filter prediction, dynamic quality thresholds, and adaptive memory management. The Kalman filter models object motion using physical constraints to predict trajectories and dynamically refine segmentation states, mitigating positional drift during occlusions or velocity changes. Dynamic thresholds, combined with multi-criteria evaluation metrics (e.g., motion coherence, appearance consistency), prioritize high-quality frames while adaptively balancing confidence scores and temporal smoothness. This reduces ambiguities among similar objects in complex scenes. SAM2Plus further employs an optimized memory system that prunes outdated or low-confidence entries and retains temporally coherent context, ensuring constant computational resources even for infinitely long videos. Extensive experiments on two video object segmentation (VOS) benchmarks demonstrate SAM2Plus’s superiority over SAM 2. It achieves an average improvement of 1.0 in J&F metrics across all 24 direct comparisons, with gains exceeding 2.3 points on SA-V and LVOS datasets for long-term tracking. The method delivers real-time performance and strong generalization without fine-tuning or additional parameters, effectively addressing occlusion recovery and viewpoint changes. By unifying motion-aware physics-based prediction with spatial segmentation, SAM2Plus bridges the gap between static and dynamic reasoning, offering a scalable solution for real-world applications such as autonomous driving and surveillance systems. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

25 pages, 5088 KiB

Open AccessArticle

Improved Perceptual Quality of Traffic Signs and Lights for the Teleoperation of Autonomous Vehicle Remote Driving via Multi-Category Region of Interest Video Compression

by Itai Dror and Ofer Hadar

Entropy 2025, 27(7), 674; https://doi.org/10.3390/e27070674 - 24 Jun 2025

Viewed by 707

Abstract

Autonomous vehicles are a promising solution to traffic congestion, air pollution, accidents, wasted time, and resources. However, remote driver intervention may be necessary in extreme situations to ensure safe roadside parking or complete remote takeover. In these cases, high-quality real-time video streaming is [...] Read more.

Autonomous vehicles are a promising solution to traffic congestion, air pollution, accidents, wasted time, and resources. However, remote driver intervention may be necessary in extreme situations to ensure safe roadside parking or complete remote takeover. In these cases, high-quality real-time video streaming is crucial for remote driving. In a preliminary study, we presented a region of interest (ROI) High-Efficiency Video Coding (HEVC) method where the image was segmented into two categories: ROI and background. This involved allocating more bandwidth to the ROI, which yielded an improvement in the visibility of classes essential for driving while transmitting the background at a lower quality. However, migrating the bandwidth to the large ROI portion of the image did not substantially improve the quality of traffic signs and lights. This study proposes a method that categorizes ROIs into three tiers: background, weak ROI, and strong ROI. To evaluate this approach, we utilized a photo-realistic driving scenario database created with the Cognata self-driving car simulation platform. We used semantic segmentation to categorize the compression quality of a Coding Tree Unit (CTU) according to its pixel classes. A background CTU contains only sky, trees, vegetation, or building classes. Essentials for remote driving include classes such as pedestrians, road marks, and cars. Difficult-to-recognize classes, such as traffic signs (especially textual ones) and traffic lights, are categorized as a strong ROI. We applied thresholds to determine whether the number of pixels in a CTU of a particular category was sufficient to classify it as a strong or weak ROI and then allocated bandwidth accordingly. Our results demonstrate that this multi-category ROI compression method significantly enhances the perceptual quality of traffic signs (especially textual ones) and traffic lights by up to 5.5 dB compared to a simpler two-category (background/foreground) partition. This improvement in critical areas is achieved by reducing the fidelity of less critical background elements, while the visual quality of other essential driving-related classes (weak ROI) is at least maintained. Full article

(This article belongs to the Special Issue Information Theory and Coding for Image/Video Processing)

► Show Figures

Figure 1

20 pages, 2223 KiB

Open AccessArticle

ChatGPT-Based Model for Controlling Active Assistive Devices Using Non-Invasive EEG Signals

by Tais da Silva Mota, Saket Sarkar, Rakshith Poojary and Redwan Alqasemi

Electronics 2025, 14(12), 2481; https://doi.org/10.3390/electronics14122481 - 18 Jun 2025

Viewed by 563

Abstract

With an anticipated 3.6 million Americans who will be living with limb loss by 2050, the demand for active assistive devices is rapidly increasing. This study investigates the feasibility of leveraging a ChatGPT-based (Version 4o) model to predict motion based on input electroencephalogram [...] Read more.

With an anticipated 3.6 million Americans who will be living with limb loss by 2050, the demand for active assistive devices is rapidly increasing. This study investigates the feasibility of leveraging a ChatGPT-based (Version 4o) model to predict motion based on input electroencephalogram (EEG) signals, enabling the non-invasive control of active assistive devices. To achieve this goal, three objectives were set. First, the model’s capability to derive accurate mathematical relationships from numerical datasets was validated to establish a foundational level of computational accuracy. Next, synchronized arm motion videos and EEG signals were introduced, which allowed the model to filter, normalize, and classify EEG data in relation to distinct text-based arm motions. Finally, the integration of marker-based motion capture data provided motion information, which is essential for inverse kinematics applications in robotic control. The combined findings highlight the potential of ChatGPT-generated machine learning systems to effectively correlate multimodal data streams and serve as a robust foundation for the intuitive, non-invasive control of assistive technologies using EEG signals. Future work will focus on applying the model to real-time control applications while expanding the dataset’s diversity to enhance the accuracy and performance of the model, with the ultimate aim of improving the independence and quality of life of individuals who rely on active assistive devices. Full article

(This article belongs to the Special Issue Advances in Intelligent Control Systems)

► Show Figures

Figure 1

27 pages, 1880 KiB

Open AccessArticle

UAV-Enabled Video Streaming Architecture for Urban Air Mobility: A 6G-Based Approach Toward Low-Altitude 3D Transportation

by Liang-Chun Chen, Chenn-Jung Huang, Yu-Sen Cheng, Ken-Wen Hu and Mei-En Jian

Drones 2025, 9(6), 448; https://doi.org/10.3390/drones9060448 - 18 Jun 2025

Viewed by 659

Abstract

As urban populations expand and congestion intensifies, traditional ground transportation struggles to satisfy escalating mobility demands. Unmanned Electric Vertical Take-Off and Landing (eVTOL) aircraft, as a key enabler of Urban Air Mobility (UAM), leverage low-altitude airspace to alleviate ground traffic while offering environmentally [...] Read more.

As urban populations expand and congestion intensifies, traditional ground transportation struggles to satisfy escalating mobility demands. Unmanned Electric Vertical Take-Off and Landing (eVTOL) aircraft, as a key enabler of Urban Air Mobility (UAM), leverage low-altitude airspace to alleviate ground traffic while offering environmentally sustainable solutions. However, supporting high bandwidth, real-time video applications, such as Virtual Reality (VR), Augmented Reality (AR), and 360° streaming, remains a major challenge, particularly within bandwidth-constrained metropolitan regions. This study proposes a novel Unmanned Aerial Vehicle (UAV)-enabled video streaming architecture that integrates 6G wireless technologies with intelligent routing strategies across cooperative airborne nodes, including unmanned eVTOLs and High-Altitude Platform Systems (HAPS). By relaying video data from low-congestion ground base stations to high-demand urban zones via autonomous aerial relays, the proposed system enhances spectrum utilization and improves streaming stability. Simulation results validate the framework’s capability to support immersive media applications in next-generation autonomous air mobility systems, aligning with the vision of scalable, resilient 3D transportation infrastructure. Full article

(This article belongs to the Special Issue Advanced Autonomous Mobility Toward Low-Altitude Economy and Three-Dimensional Transportation Systems)

► Show Figures

Figure 1

802 KiB

Open AccessProceeding Paper

Video Surveillance and Artificial Intelligence for Urban Security in Smart Cities: A Review of a Selection of Empirical Studies from 2018 to 2024

by Abdellah Dardour, Essaid El Haji and Mohamed Achkari Begdouri

Comput. Sci. Math. Forum 2025, 10(1), 15; https://doi.org/10.3390/cmsf2025010015 - 16 Jun 2025

Viewed by 106

Abstract

The rapid growth of information and communication technologies, in particular big data, artificial intelligence (AI), and the Internet of Things (IoT), has made it possible to make smart cities a tangible reality. In this context, real-time video surveillance plays a crucial role in [...] Read more.

The rapid growth of information and communication technologies, in particular big data, artificial intelligence (AI), and the Internet of Things (IoT), has made it possible to make smart cities a tangible reality. In this context, real-time video surveillance plays a crucial role in improving public safety. This article presents a systematic review of studies focused on the detection of acts of aggression and crime in these cities. By studying 100 indexed scientific articles, dating from 2018 to 2024, we examine the most recent methods and techniques, with an emphasis on the use of machine learning and deep learning for the processing of real-time video streams. The works examined cover several technological axes such as convolutional neural networks (CNNs), fog computing, and integrated IoT systems while also addressing issues such as the challenges related to the detection of anomalies, frequently affected by their contextual and uncertain nature. Finally, this article offers suggestions to guide future research, with the aim of improving the accuracy and efficiency of intelligent monitoring systems. Full article

(This article belongs to the Proceedings of International Conference on Sustainable Computing and Green Technologies (SCGT’2025))

► Show Figures

Figure 1

17 pages, 1587 KiB

Open AccessArticle

Accelerating Visual Anomaly Detection in Smart Manufacturing with RDMA-Enabled Data Infrastructure

by Yifan Wang, Tiancheng Yuan, Yuting Yang, Miao He, Richard Wu and Kenneth P. Birman

Electronics 2025, 14(12), 2427; https://doi.org/10.3390/electronics14122427 - 13 Jun 2025

Viewed by 487

Abstract

Industrial Artificial Intelligence (IAI) services are increasingly integral to smart manufacturing, especially in quality assurance tasks like defect detection. This paper presents the design, implementation, and evaluation of a video-based visual anomaly detection (VAD) system that runs at inspection stations on a smart [...] Read more.

Industrial Artificial Intelligence (IAI) services are increasingly integral to smart manufacturing, especially in quality assurance tasks like defect detection. This paper presents the design, implementation, and evaluation of a video-based visual anomaly detection (VAD) system that runs at inspection stations on a smart shop floor. Our system processes real-time video streams from multiple cameras mounted around a conveyor belt to detect surface-level defects in mechanical components. To meet stringent latency and accuracy requirements, an edge-cloud architecture powered by AI accelerators and InfiniBand networking is adopted. The IAI service features key frame extraction modules, fine-tuned lightweight VAD models, and optimization techniques such as batching and microservice-level parallelism. The design choices of AI modules are carefully evaluated to balance effectiveness and efficiency. As a result, the system latency is optimized by 57%. In addition to the high-performance solution, a cost-efficient alternative is also suggested that is able to complete the task within the time frame. Full article

(This article belongs to the Special Issue Advanced Condition Monitoring and Fault Analysis in Industrial Electronics)

► Show Figures

Figure 1

35 pages, 16759 KiB

Open AccessArticle

A Commodity Recognition Model Under Multi-Size Lifting and Lowering Sampling

by Mengyuan Chen, Song Chen, Kai Xie, Bisheng Wu, Ziyu Qiu, Haofei Xu and Jianbiao He

Electronics 2025, 14(11), 2274; https://doi.org/10.3390/electronics14112274 - 2 Jun 2025

Viewed by 516

Abstract

Object detection algorithms have evolved from two-stage to single-stage architectures, with foundation models achieving sustained improvements in accuracy. However, in intelligent retail scenarios, small object detection and occlusion issues still lead to significant performance degradation. To address these challenges, this paper proposes an [...] Read more.

Object detection algorithms have evolved from two-stage to single-stage architectures, with foundation models achieving sustained improvements in accuracy. However, in intelligent retail scenarios, small object detection and occlusion issues still lead to significant performance degradation. To address these challenges, this paper proposes an improved model based on YOLOv11, focusing on resolving insufficient multi-scale feature coupling and occlusion sensitivity. First, a multi-scale feature extraction network (MFENet) is designed. It splits input feature maps into dual branches along the channel dimension: the upper branch performs local detail extraction and global semantic enhancement through secondary partitioning, while the lower branch integrates CARAFE (content-aware reassembly of features) upsampling and SENet (squeeze-and-excitation network) channel weight matrices to achieve adaptive feature enhancement. The three feature streams are fused to output multi-scale feature maps, significantly improving small object detail retention. Second, a convolutional block attention module (CBAM) is introduced during feature fusion, dynamically focusing on critical regions through channel–spatial dual attention mechanisms. A fuseModule is designed to aggregate multi-level features, enhancing contextual modeling for occluded objects. Additionally, the extreme-IoU (XIoU) loss function replaces the traditional complete-IoU (CIoU), combined with XIoU-NMS (extreme-IoU non-maximum suppression) to suppress redundant detections, optimizing convergence speed and localization accuracy. Experiments demonstrate that the improved model achieves a mean average precision (mAP50) of 0.997 (0.2% improvement) and mAP50-95 of 0.895 (3.5% improvement) on the RPC product dataset and the 6th Product Recognition Challenge dataset. The recall rate increases to 0.996 (0.6% improvement over baseline). Although frames per second (FPS) decreased compared to the original model, the improved model still meets real-time requirements for retail scenarios. The model exhibits stable noise resistance in challenging environments and achieves 84% mAP in cross-dataset testing, validating its generalization capability and engineering applicability. Video streams were captured using a Zhongweiaoke camera operating at 60 fps, satisfying real-time detection requirements for intelligent retail applications. Full article

(This article belongs to the Special Issue Emerging Technologies in Computational Intelligence)

► Show Figures

Figure 1

18 pages, 4439 KiB

Open AccessArticle

Combining Infrared Thermography with Computer Vision Towards Automatic Detection and Localization of Air Leaks

by Ângela Semitela, João Silva, André F. Girão, Samuel Verdasca, Rita Futre, Nuno Lau, José P. Santos and António Completo

Sensors 2025, 25(11), 3272; https://doi.org/10.3390/s25113272 - 22 May 2025

Viewed by 634

Abstract

This paper proposes an automated system integrating infrared thermography (IRT) and computer vision for air leak detection and localization in end-of-line (EOL) testing stations. This system consists of (1) a leak tester for detection and quantification of leaks, (2) an infrared camera for [...] Read more.

This paper proposes an automated system integrating infrared thermography (IRT) and computer vision for air leak detection and localization in end-of-line (EOL) testing stations. This system consists of (1) a leak tester for detection and quantification of leaks, (2) an infrared camera for real-time thermal image acquisition; and (3) an algorithm for automatic leak localization. The python-based algorithm acquires thermal frames from the camera’s streaming video, identifies potential leak regions by selecting a region of interest, mitigates environmental interferences via image processing, and pinpoints leaks by employing pixel intensity thresholding. A closed circuit with an embedded leak system simulated relevant leakage scenarios, varying leak apertures (ranging from 0.25 to 3 mm), and camera–leak system distances (0.2 and 1 m). Results confirmed that (1) the leak tester effectively detected and quantified leaks, with larger apertures generating higher leak rates; (2) the IRT performance was highly dependent on leak aperture and camera–leak system distance, confirming that shorter distances improve localization accuracy; and (3) the algorithm localized all leaks in both lab and industrial environments, regardless of the camera–leak system distance, mostly achieving accuracies higher than 0.7. Overall, the combined system demonstrated great potential for long-term implementation in EOL leakage stations in the manufacturing sector, offering an effective and cost-effective alternative for manual inspections. Full article

(This article belongs to the Special Issue Sensing Technology and Applications for Industrial Maintenance and Automation)

► Show Figures

Figure 1

16 pages, 5532 KiB

Open AccessArticle

Intelligent System Study for Asymmetric Positioning of Personnel, Transport, and Equipment Monitoring in Coal Mines

by Diana Novak, Yuriy Kozhubaev, Hengbo Kang, Haodong Cheng and Roman Ershov

Symmetry 2025, 17(5), 755; https://doi.org/10.3390/sym17050755 - 14 May 2025

Viewed by 443

Abstract

The paper presents a study of an intelligent system for personnel positioning, transport, and equipment monitoring in the mining industry using convolutional neural network (CNN) and OpenPose technology. The proposed framework operates through a three-stage pipeline: OpenPose-based skeleton extraction from surveillance video streams, [...] Read more.

The paper presents a study of an intelligent system for personnel positioning, transport, and equipment monitoring in the mining industry using convolutional neural network (CNN) and OpenPose technology. The proposed framework operates through a three-stage pipeline: OpenPose-based skeleton extraction from surveillance video streams, capturing 18 key body joints at 30fps; multimodal feature fusion, combining skeletal key points and proximity sensor data to achieve environmental context awareness and obtain relevant feature values; and hierarchical pose alert, using attention-enhanced bidirectional LSTM (trained on 5000 annotated fall instances) for fall warning. The experiment conducted demonstrated that the combined use of the aforementioned technologies allows the system to determine the location and behavior of personnel, calculate the distance to hazardous areas in real time, and analyze personnel postures to identify possible risks such as falls or immobility. The system’s capacity to track the location of vehicles and equipment enhances operational efficiency, thereby mitigating the risk of accidents. Additionally, the system provides real-time alerts, identifying abnormal behavior, equipment malfunctions, and safety hazards, thus promoting enhanced mine management efficiency, improved safe working conditions, and a reduction in accidents. Full article

(This article belongs to the Special Issue Symmetry and Asymmetry in Computer Vision and Graphics)

► Show Figures

Figure 1

21 pages, 4777 KiB

Open AccessArticle

Harnessing Semantic and Trajectory Analysis for Real-Time Pedestrian Panic Detection in Crowded Micro-Road Networks

by Rongyong Zhao, Lingchen Han, Yuxin Cai, Bingyu Wei, Arifur Rahman, Cuiling Li and Yunlong Ma

Appl. Sci. 2025, 15(10), 5394; https://doi.org/10.3390/app15105394 - 12 May 2025

Viewed by 399

Abstract

Pedestrian panic behavior is a primary cause of overcrowding and stampede accidents in public micro-road network areas with high pedestrian density. However, reliably detecting such behaviors remains challenging due to their inherent complexity, variability, and stochastic nature. Current detection models often rely on [...] Read more.

Pedestrian panic behavior is a primary cause of overcrowding and stampede accidents in public micro-road network areas with high pedestrian density. However, reliably detecting such behaviors remains challenging due to their inherent complexity, variability, and stochastic nature. Current detection models often rely on single-modality features, which limits their effectiveness in complex and dynamic crowd scenarios. To overcome these limitations, this study proposes a contour-driven multimodal framework that first employs a CNN (CDNet) to estimate density maps and, by analyzing steep contour gradients, automatically delineates a candidate panic zone. Within these potential panic zones, pedestrian trajectories are analyzed through LSTM networks to capture irregular movements, such as counterflow and nonlinear wandering behaviors. Concurrently, semantic recognition based on Transformer models is utilized to identify verbal distress cues extracted through Baidu AI’s real-time speech-to-text conversion. The three embeddings are fused through a lightweight attention-enhanced MLP, enabling end-to-end inference at 40 FPS on a single GPU. To evaluate branch robustness under streaming conditions, the UCF Crowd dataset (150 videos without panic labels) is processed frame-by-frame at 25 FPS solely for density assessment, whereas full panic detection is validated on 30 real Itaewon-Stampede videos and 160 SUMO/Unity simulated emergencies that include explicit panic annotations. The proposed system achieves 91.7% accuracy and 88.2% F1 on the Itaewon set, outperforming all single- or dual-modality baselines and offering a deployable solution for proactive crowd safety monitoring in transport hubs, festivals, and other high-risk venues. Full article

► Show Figures

Figure 1

Search Results (270)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (270)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI