Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,376)

Search Parameters:
Keywords = image and video processing

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
24 pages, 897 KB  
Article
Neural Encoding Strategies for Neuromorphic Computing
by Michael Liu, Honghao Zheng and Yang Yi
Electronics 2026, 15(6), 1221; https://doi.org/10.3390/electronics15061221 (registering DOI) - 14 Mar 2026
Abstract
Neuromorphic computing seeks to mimic structure and function of biological neural systems to enable energy-efficient, adaptive information processing. A critical component of this paradigm is neural encoding—the translation of analog or digital input data into spike-based representations suitable for spiking neural networks (SNNs). [...] Read more.
Neuromorphic computing seeks to mimic structure and function of biological neural systems to enable energy-efficient, adaptive information processing. A critical component of this paradigm is neural encoding—the translation of analog or digital input data into spike-based representations suitable for spiking neural networks (SNNs). This paper provides a comprehensive overview of major neural encoding schemes used in neuromorphic systems, including rate and temporal encoding, as well as latency, interspike interval, phase, and multiplexed encoding. The purpose of this paper is to explore the use of encoding techniques for deep learning applications. We discussed the underlying principles of spike encoding approaches, their biological inspiration, computational efficiency, power consumption, integrated circuit design and implementation, and suitability for various neuromorphic applications. We also presented our research on a hardware-and-software co-design platform for different encoding schemes and demonstrated their performance. By comparing their strengths, limitations, and implementation challenges, we aim to provide insights that will guide the development of more efficient and application-specific neuromorphic systems. We also performed an encoder performance analysis via Python 3.12 simulations to compare classification accuracies across these spike encoders on three popular image and video datasets. The performance of neural encoders working with both deep neural networks (DNNs) and SNNs is analyzed. Our performance data is largely consistent with the benchmark data on image classification from other papers, while limited performance data on the University of Central Florida’s 101 (UCF-101) video dataset were found in comparable studies on spike encoders. Based on our encoder performance data, the Interspike Interval (ISI) encoder performs well across all three datasets, preserving continuous, detailed spike timing and richer temporal information for standard classification tasks. Further, for image classification, multiplexing encoders outperform other spike encoders as they simplify timing patterns by enforcing phase locking and improve stability and robustness to noise. Within the SNN testbenches, the ISI-Phase encoder achieved the highest accuracy on the Modified National Institute of Standards and Technology (MNIST) dataset, surpassing the Time-To-First Spike (TTFS) encoder by 1.9%. On the Canadian Institute For Advanced Research (CIFAR-10) dataset, the ISI encoder achieved the highest accuracy. This ISI encoder had 22.7% higher accuracy than the TTFS encoder on the CIFAR-10 dataset. The ISI encoder performed best on the UCF-101 dataset, achieving 12.7% better performance than the TTFS encoder. Full article
(This article belongs to the Section Artificial Intelligence)
21 pages, 7166 KB  
Article
Geometric Reliability of AI-Enhanced Super-Resolution in Video-Based 3D Spatial Modeling
by Marwa Mohammed Bori, Zahraa Ezzulddin Hussein, Zainab N. Jasim and Bashar Alsadik
ISPRS Int. J. Geo-Inf. 2026, 15(3), 125; https://doi.org/10.3390/ijgi15030125 - 13 Mar 2026
Abstract
Video-based photogrammetric reconstruction is increasingly used when high-resolution still images are unavailable. However, limited spatial resolution, compression artifacts, and motion blur often reduce geometric accuracy. Recent advances in learning-based image super-resolution (SR) offer a promising preprocessing method, but their geometric reliability within photogrammetric [...] Read more.
Video-based photogrammetric reconstruction is increasingly used when high-resolution still images are unavailable. However, limited spatial resolution, compression artifacts, and motion blur often reduce geometric accuracy. Recent advances in learning-based image super-resolution (SR) offer a promising preprocessing method, but their geometric reliability within photogrammetric workflows remains not well understood. This study provides a controlled quantitative evaluation of learning-based super-resolution for video-based 3D reconstruction. Low-resolution video frames are enhanced using two representative methods: an open-source real-world SR model (Real-ESRGAN ×4) and a commercial solution (Topaz Video AI ×4). All datasets are processed with the same Structure-from-Motion and Multi-View Stereo pipelines and tested against terrestrial laser scanning (TLS) reference data. Results show that super-resolution significantly increases reconstruction density and improves the recovery of fine-scale surface details, while also leading to greater local surface variability compared with reconstructions from the original video; photogrammetric stability remains consistent despite these changes. The findings highlight a fundamental trade-off between reconstruction completeness and local geometric accuracy and clarify when enhanced video imagery via super-resolution can be a reliable source for 3D reconstruction. These results are especially important for spatial data science workflows and AI-powered 3D modeling and digital twin applications. Full article
(This article belongs to the Special Issue Urban Digital Twins Empowered by AI and Dataspaces)
Show Figures

Figure 1

30 pages, 3812 KB  
Review
Video-Based 3D Reconstruction: A Review of Photogrammetry and Visual SLAM Approaches
by Ali Javadi Moghadam, Abbas Kiani, Reza Naeimaei, Shirin Malihi and Ioannis Brilakis
J. Imaging 2026, 12(3), 128; https://doi.org/10.3390/jimaging12030128 - 13 Mar 2026
Abstract
Three-dimensional (3D) reconstruction using images is one of the most significant topics in computer vision and photogrammetry, with wide-ranging applications in robotics, augmented reality, and mapping. This study investigates methods of 3D reconstruction using video (especially monocular video) data and focuses on techniques [...] Read more.
Three-dimensional (3D) reconstruction using images is one of the most significant topics in computer vision and photogrammetry, with wide-ranging applications in robotics, augmented reality, and mapping. This study investigates methods of 3D reconstruction using video (especially monocular video) data and focuses on techniques such as Structure from Motion (SfM), Multi-View Stereo (MVS), Visual Simultaneous Localization and Mapping (V-SLAM), and videogrammetry. Based on a statistical analysis of SCOPUS records, these methods collectively account for approximately 6863 journal publications up to the end of 2024. Among these, about 80 studies are analyzed in greater detail to identify trends and advancements in the field. The study also shows that the use of video data for real-time 3D reconstruction is commonly addressed through two main approaches: photogrammetry-based methods, which rely on precise geometric principles and offer high accuracy at the cost of greater computational demand; and V-SLAM methods, which emphasize real-time processing and provide higher speed. Furthermore, the application of IMU data and other indicators, such as color quality and keypoint detection, for selecting suitable frames for 3D reconstruction is investigated. Overall, this study compiles and categorizes video-based reconstruction methods, emphasizing the critical step of keyframe extraction. By summarizing and illustrating the general approaches, the study aims to clarify and facilitate the entry path for researchers interested in this area. Finally, the paper offers targeted recommendations for improving keyframe extraction methods to enhance the accuracy and efficiency of real-time video-based 3D reconstruction, while also outlining future research directions in addressing challenges like dynamic scenes, reducing computational costs, and integrating advanced learning-based techniques. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

22 pages, 1747 KB  
Review
Talking Head Generation Through Generative Models and Cross-Modal Synthesis Techniques
by Hira Nisar, Salman Masood, Zaki Malik and Adnan Abid
J. Imaging 2026, 12(3), 119; https://doi.org/10.3390/jimaging12030119 - 10 Mar 2026
Viewed by 203
Abstract
Talking Head Generation (THG) is a rapidly advancing field at the intersection of computer vision, deep learning, and speech synthesis, enabling the creation of animated human-like heads that can produce speech and express emotions with high visual realism. The core objective of THG [...] Read more.
Talking Head Generation (THG) is a rapidly advancing field at the intersection of computer vision, deep learning, and speech synthesis, enabling the creation of animated human-like heads that can produce speech and express emotions with high visual realism. The core objective of THG systems is to synthesize coherent and natural audio–visual outputs by modeling the intricate relationship between speech signals, facial dynamics, and emotional cues. These systems find widespread applications in virtual assistants, interactive avatars, video dubbing for multilingual content, educational technologies, and immersive virtual and augmented reality environments. Moreover, the development of THG has significant implications for accessibility technologies, cultural preservation, and remote healthcare interfaces. This survey paper presents a comprehensive and systematic overview of the technological landscape of Talking Head Generation. We begin by outlining the foundational methodologies that underpin the synthesis process, including generative adversarial networks (GANs), motion-aware recurrent architectures, and attention-based models. A taxonomy is introduced to organize the diverse approaches based on the nature of input modalities and generation goals. We further examine the contributions of various domains such as computer vision, speech processing, and human–robot interaction, each of which plays a critical role in advancing the capabilities of THG systems. The paper also provides a detailed review of datasets used for training and evaluating THG models, highlighting their coverage, structure, and relevance. In parallel, we analyze widely adopted evaluation metrics, categorized by their focus on image quality, motion accuracy, synchronization, and semantic fidelity. Operating parameters such as latency, frame rate, resolution, and real-time capability are also discussed to assess deployment feasibility. Special emphasis is placed on the integration of generative artificial intelligence (GenAI), which has significantly enhanced the adaptability and realism of talking head systems through more powerful and generalizable learning frameworks. Full article
Show Figures

Figure 1

15 pages, 7693 KB  
Article
Effects of Overload Current on the Ignition and Burning Hazards of Polyethylene-Insulated Wires
by Heran Song, Qingwen Lin, Zhurong Dong, Songfeng Liang, Ruichao Wei, Zhanyu Li, Shenshi Huang, Yiting Yan and Yang Li
Polymers 2026, 18(5), 641; https://doi.org/10.3390/polym18050641 - 5 Mar 2026
Viewed by 221
Abstract
To quantitatively elucidate the effects of overload current on the ignition and burning hazards of polyethylene-insulated wires, 2.5 mm2 polyethylene-insulated copper wires used commercially were tested in an electrical fire fault simulation system. Experiments were conducted to study the evolution of overloads, [...] Read more.
To quantitatively elucidate the effects of overload current on the ignition and burning hazards of polyethylene-insulated wires, 2.5 mm2 polyethylene-insulated copper wires used commercially were tested in an electrical fire fault simulation system. Experiments were conducted to study the evolution of overloads, ignition, and burning. The entire process, from insulation smoking and ignition to sustained burning and final extinction driven by wire fusing, was recorded using synchronized digital and high-speed imaging. Video-based measurements were used to extract the following: smoking emission duration, ignition time, burning duration, maximum flame height, and segmented flame width. The results show that stable ignition and sustained burning occur when the overload current is greater than or equal to 180 A. As the current increases, ignition occurs earlier, while the smoking stage becomes shorter but exhibits nonmonotonic fluctuations. The burning duration shows a staged response. It first increases, then decreases toward a relatively stable level. This reflects the competition between enhanced Joule heating and accelerated wire melting and fusing. Maximum flame height and segmented flame width vary nonmonotonically with current, and the segmented flame width peaks at 200 A. A multi-indicator fire hazard evaluation framework was established and an entropy-weight TOPSIS method was applied to integrate the quantification and ranking. The overall fire hazard is greatest at 200 A. These findings provide experimental insight into overload-induced ignition and combustion behavior and contribute to a quantitative understanding of fire hazard evolution in overloaded electrical wires. Full article
Show Figures

Figure 1

16 pages, 5863 KB  
Article
A Rapid Aerial Image Mosaic Method for Multiple Drones Based on Key Frames
by Xiuzhen Wu, Yahui Qi, Liang Qin, Shi Yan and Jianxiu Zhang
Automation 2026, 7(2), 43; https://doi.org/10.3390/automation7020043 - 5 Mar 2026
Viewed by 170
Abstract
Due to their advantages of being low-cost, lightweight and flexible, and having wide shooting coverage, UAVs have played an important role in situational awareness in the fields of disaster prevention and mitigation, urban planning and management, etc. In these applications, UAV aerial photography [...] Read more.
Due to their advantages of being low-cost, lightweight and flexible, and having wide shooting coverage, UAVs have played an important role in situational awareness in the fields of disaster prevention and mitigation, urban planning and management, etc. In these applications, UAV aerial photography is limited by the field of view, and high-definition panoramic images of the complete target area cannot be obtained. Image mosaic technology is essential, but an image mosaic using only a single UAV cannot meet the high real-time requirements for situational awareness. In response to the above problems, this paper proposes a multi-UAV fast aerial image mosaic method based on key frames. First, the multi-UAV area coverage flight strategy is determined according to the size of the task area and the UAV flight parameters; then, the field of view of the pod, the flight speed, and the flight altitude are used to determine the key frame extraction time period during the UAV aerial photography process. The image matching-rate calculation method is designed and the key frames are extracted during the extraction time period, and the key frames are returned to the ground visual puzzle system; in the ground visual puzzle system, the improved Laplacian pyramid method is used to quickly fuse and stitch the key frames extracted by each UAV to obtain a panoramic stitched map. The experiment shows that the method can quickly obtain high-precision real-scene map information of the task area. Compared with the single-UAV method and the multi-UAV full video stream-splicing method, this method greatly reduces the consumption of computing power and the requirements of communication bandwidth and improves the efficiency and real-time performance of panoramic map acquisition. Full article
Show Figures

Figure 1

3 pages, 127 KB  
Editorial
Information Theory and Coding for Image and Video Processing
by Ofer Hadar
Entropy 2026, 28(3), 294; https://doi.org/10.3390/e28030294 - 5 Mar 2026
Viewed by 200
Abstract
Recent advances in image and video processing have been profoundly influenced by developments in information theory, source coding, and learning-based representations [...] Full article
(This article belongs to the Special Issue Information Theory and Coding for Image/Video Processing)
24 pages, 879 KB  
Review
A Survey of Diffusion Models: Methods and Applications
by HaoYu Ma and Hon-Cheng Wong
Appl. Sci. 2026, 16(5), 2482; https://doi.org/10.3390/app16052482 - 4 Mar 2026
Viewed by 450
Abstract
Diffusion models have emerged as the state-of-the-art generative paradigm, surpassing GANs in synthesizing high-fidelity images, videos, and audio. However, their reliance on iterative denoising processes imposes substantial computational burdens and memory overheads, creating a significant barrier to their deployment on resource-constrained edge devices. [...] Read more.
Diffusion models have emerged as the state-of-the-art generative paradigm, surpassing GANs in synthesizing high-fidelity images, videos, and audio. However, their reliance on iterative denoising processes imposes substantial computational burdens and memory overheads, creating a significant barrier to their deployment on resource-constrained edge devices. Unlike existing surveys that broadly cover general methodologies, this paper provides a focused review with a specific emphasis on efficient and lightweight diffusion models. We systematically analyze the trade-offs between generation quality and computational cost, categorizing acceleration techniques into sampling optimization, architectural compression, and knowledge distillation. Furthermore, we explore the integration of diffusion models with emerging architectures (e.g., Mamba) and their evolution towards general-purpose world simulators. This survey aims to provide a roadmap for “Green AI,” bridging the gap between high-end academic research and practical, real-world applications. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

32 pages, 4404 KB  
Article
Revisiting Text-Based Person Retrieval: Mitigating Annotation-Induced Mismatches with Multimodal Large Language Models
by Zihang Han, Chao Zhu and Mengyin Liu
Sensors 2026, 26(5), 1599; https://doi.org/10.3390/s26051599 - 4 Mar 2026
Viewed by 158
Abstract
Text-based person retrieval (TBPR) aims to search for target person images from large-scale video clips or image databases based on textual descriptions. The quality of benchmarks is critical to accurately evaluating TBPR models for their ability in relation to cross-modal matching. However, we [...] Read more.
Text-based person retrieval (TBPR) aims to search for target person images from large-scale video clips or image databases based on textual descriptions. The quality of benchmarks is critical to accurately evaluating TBPR models for their ability in relation to cross-modal matching. However, we find that existing TBPR benchmarks have a common problem, which often leads to ambiguities where multiple images of persons with different identities have very similar or even identical textual descriptions. As a consequence, although TBPR models correctly retrieve the images corresponding to a given description, such matches may be erroneously evaluated as mismatches due to the above annotation problem. We argue that the main cause of this problem is that each person image is annotated individually without reference to other similar images, making it challenging to provide distinctive descriptions for each image. To address this problem, we propose an effective and efficient annotation refinement framework to improve the annotation quality of TBPR benchmarks and thereby mitigate annotation-induced mismatches. Firstly, sets of images prone to mismatches are automatically identified by TBPR models. Then, by leveraging multimodal large language models (MLLMs), multiple images are simultaneously processed and distinctive descriptions are generated for each image. Finally, the original descriptions are replaced to improve the annotation quality. Extensive experiments on three popular TBPR benchmarks (CUHK-PEDES, RSTPReid and ICFG-PEDES) validate the effectiveness of our proposed method for improving the quality of annotations, and demonstrate that the resulting more discriminative captions can truly benefit the mainstream TBPR models. The improved annotations of these benchmarks will be released publicly. Full article
Show Figures

Figure 1

26 pages, 30049 KB  
Article
HVIFormer: A Dual-Stage Low-Light Image Enhancement Method Based on HVI Representation
by Yimei Li, Liuhong Luo and Hongjun Li
Appl. Sci. 2026, 16(5), 2450; https://doi.org/10.3390/app16052450 - 3 Mar 2026
Viewed by 248
Abstract
Low-light image enhancement improves the quality of video surveillance and image analysis and, as a result, has long been a hot topic in image processing. However, current research on this topic faces a difficult challenge—effectively suppressing noise while improving brightness and maintaining color [...] Read more.
Low-light image enhancement improves the quality of video surveillance and image analysis and, as a result, has long been a hot topic in image processing. However, current research on this topic faces a difficult challenge—effectively suppressing noise while improving brightness and maintaining color consistency, especially in extremely dark scenes, where dark noise amplification, uneven exposure, and color shifts often interact, leading to detail loss and color distortion. To address the issue, we propose a dual-stage low-light enhancement framework based on the HVI (Horizontal/Vertical-Intensity) color space. The low-light image is first mapped to the HVI space, obtaining the intensity component I and the HVI-based feature map, with I being explicitly extracted as an intensity prior. A Transformer-based pre-recovery module is introduced for global dependency modeling, guided by the intensity prior I through an Intensity-Conditioned Block (ICB) for conditional feature interaction. Subsequently, a dual-branch enhancement network utilizes lightweight Complementary Cross-Attention (CCA) blocks for brightness refinement and color denoising. Finally, the enhanced image is remapped to the sRGB color space. The proposed framework decouples global brightness recovery and feature preprocessing from detail enhancement and color refinement, improving stability in extremely dark and high-noise scenarios. Through 18 quantitative and qualitative experiments, we demonstrate that our proposed method achieves superior performance in dark noise suppression and color restoration across multiple low-light datasets. Full article
Show Figures

Figure 1

24 pages, 8953 KB  
Article
Face Recognition System Using CLIP and FAISS for Scalable and Real-Time Identification
by Antonio Labinjan, Sandi Baressi Šegota, Ivan Lorencin and Nikola Tanković
Math. Comput. Appl. 2026, 31(2), 36; https://doi.org/10.3390/mca31020036 - 1 Mar 2026
Viewed by 252
Abstract
Face recognition is increasingly being adopted in industries such as education, security, and personalized services. This research introduces a face recognition system that leverages the embedding capabilities of the CLIP model. The model is trained on multimodal data, such as images and text [...] Read more.
Face recognition is increasingly being adopted in industries such as education, security, and personalized services. This research introduces a face recognition system that leverages the embedding capabilities of the CLIP model. The model is trained on multimodal data, such as images and text and it generates high-dimensional features, which are then stored in a vector index for further queries. The system is designed to facilitate accurate real-time identification, with potential applications in areas such as attendance tracking and security screening. Specific use cases include event check-ins, implementation of advanced security systems, and more. The process involves encoding known faces into high-dimensional vectors, indexing them using a vector index FAISS, and comparing them to unknown images based on L2 (euclidean) distance. Experimental results demonstrate a high accuracy that exceeds 90% and prove efficient scalability and good performance efficiency even in datasets with a high volume of entries. Notably, the system exhibits superior computational efficiency compared to traditional deep convolutional neural networks (CNNs), significantly reducing CPU load and memory consumption while maintaining competitive inference speeds. In the first iteration of experiments, the system achieved over 90% accuracy on live video feeds where each identity had a single reference video for both training and validation; however, when tested on a more challenging dataset with many low-quality classes, accuracy dropped to approximately 73%, highlighting the impact of dataset quality and variability on performance. Full article
Show Figures

Figure 1

37 pages, 3787 KB  
Article
PDGV-DETR: Object Detection for Secure On-Site Weapon and Personnel Location Based on Dynamic Convolution and Cross-Scale Semantic Fusion
by Nianfeng Li, Peizeng Xin, Jia Tian, Xinlu Bai, Hongjie Ding, Zhiguo Xiao and Qian Liu
Sensors 2026, 26(5), 1542; https://doi.org/10.3390/s26051542 - 28 Feb 2026
Viewed by 200
Abstract
In public safety scenarios, the precise detection and positioning of prohibited weapons such as firearms and knives along with the involved personnel are the core pre-requisite technologies for violent risk warning and emergency response. However, in security surveillance scenarios, there are common problems [...] Read more.
In public safety scenarios, the precise detection and positioning of prohibited weapons such as firearms and knives along with the involved personnel are the core pre-requisite technologies for violent risk warning and emergency response. However, in security surveillance scenarios, there are common problems such as object occlusion, difficulty in capturing small-sized weapons, and complex background interference, which lead to the shortcomings of existing general object detection models in the tasks of detecting and locating security-related objects, including poor adaptability, low detection accuracy, and insufficient robustness in complex scenarios. Therefore, this paper proposes a threat object detection framework for security scenarios (PDGV-DETR) based on adaptive dynamic convolution and cross-scale semantic fusion, specifically optimized for the detection and positioning tasks of weapons and personnel objects in static security surveillance images. This research focuses on category recognition at the object level and pixel-level spatial positioning, and does not involve the classification and identification of violent behaviors based on temporal information. There are clear technical boundaries and scene limitations between the two. This framework is optimized through three core modules: designing a dynamic hierarchical channel interaction convolution module to reduce computational complexity while enhancing the ability to detect occluded and incomplete objects; constructing an improved bidirectional hybrid feature pyramid network, combining the cross-scale fusion module to strengthen multi-scale feature expression, and adapting to the simultaneous detection requirements of small weapon objects and large personnel objects; and introducing a global semantic weaving and elastic feature alignment network to solve the problem of low discrimination between objects and complex backgrounds. Under the same experimental configuration, the proposed model is verified against current mainstream models on typical datasets: on a dataset of 2421 conflict scene personnel violent images, the peak average precision mAP50 of PDGV-DETR reached 85.9%. Through statistical verification, compared with the baseline model RT-DETR with an average value ± standard deviation of 0.840 ± 0.007, the average value ± standard deviation of PDGV-DETR reached 0.858 ± 0.004, demonstrating statistically significant performance improvement, with a p-value less than 0.01. This model can accurately complete the task of locating the object area of personnel, and compared with the deformable DETR, the accuracy improvement rate reached 15.1%.; on the weapon-specific dataset OD-WeaponDetection, the mAP for gun and knife detection reached 93.0%, improving by 2.2% compared to RT-DETR. Compared to the performance fluctuations of other general object detection models in complex security scenarios, PDGV-DETR not only has better detection and positioning accuracy for security-related objects, but also significantly improves the generalization and stability of the model. The results show that PDGV-DETR effectively balances the accuracy of positioning, detection, and computational efficiency, accurately completing end-to-end detection and positioning of weapon and personnel objects in static security surveillance images, demonstrating highly competitive performance in the detection and positioning of security-related objects in security scenes, providing core object-level pre-processing technology support for scenarios such as public area monitoring, intelligent video monitoring, and early warning of violent risks, and providing basic data for subsequent violent behavior recognition based on temporal data. Full article
Show Figures

Figure 1

21 pages, 1469 KB  
Article
Development of Surveillance Robots Based on Face Recognition Using High-Order Statistical Features and Evidence Theory
by Slim Ben Chaabane, Rafika Harrabi, Anas Bushnag and Hassene Seddik
J. Imaging 2026, 12(3), 107; https://doi.org/10.3390/jimaging12030107 - 28 Feb 2026
Viewed by 304
Abstract
The recent advancements in technologies such as artificial intelligence (AI), computer vision (CV), and Internet of Things (IoT) have significantly extended various fields, particularly in surveillance systems. These innovations enable real-time facial recognition processing, enhancing security and ensuring safety. However, mobile robots are [...] Read more.
The recent advancements in technologies such as artificial intelligence (AI), computer vision (CV), and Internet of Things (IoT) have significantly extended various fields, particularly in surveillance systems. These innovations enable real-time facial recognition processing, enhancing security and ensuring safety. However, mobile robots are commonly employed in surveillance systems to handle risky tasks that are beyond human capability. In this paper, we present a prototype of a cost-effective mobile surveillance robot built on the Raspberry PI 4, designed for integration into various industrial environments. This smart robot detects intruders using IoT and face recognition technology. The proposed system is equipped with a passive infrared (PIR) sensor and a camera for capturing live-streaming video and photos, which are sent to the control room through IoT technology. Additionally, the system uses face recognition algorithms to differentiate between company staff and potential intruders. The face recognition method combines high-order statistical features and evidence theory to improve facial recognition accuracy and robustness. High-order statistical features are used to capture complex patterns in facial images, enhancing discrimination between individuals. Evidence theory is employed to integrate multiple information sources, allowing for better decision-making under uncertainty. This approach effectively addresses challenges such as variations in lighting, facial expressions, and occlusions, resulting in a more reliable and accurate face recognition system. When the system detects an unfamiliar individual, it sends out alert notifications and emails to the control room with the captured picture using IoT. A web interface has also been set up to control the robot from a distance through Wi-Fi connection. The proposed face recognition method is evaluated, and a comparative analysis with existing techniques is conducted. Experimental results with 400 test images of 40 individuals demonstrate the effectiveness of combining various attribute images in improving human face recognition performance. Experimental results indicate that the algorithm can identify human faces with an accuracy of 98.63%. Full article
Show Figures

Figure 1

17 pages, 1563 KB  
Article
Feasibility of Drone-Mounted Camera for Real-Time MA-rPPG in Smart Mirror Systems
by Mohammad Afif Kasno, Yong-Sik Choi and Jin-Woo Jung
Appl. Sci. 2026, 16(5), 2307; https://doi.org/10.3390/app16052307 - 27 Feb 2026
Viewed by 173
Abstract
Remote photoplethysmography (rPPG) enables contactless estimation of cardiovascular signals from video, but most existing studies assume a fixed, stationary camera. This study investigates the feasibility of performing real-time moving-average rPPG (MA-rPPG) using a drone-mounted camera, where platform motion, vibration, and viewing distance introduce [...] Read more.
Remote photoplethysmography (rPPG) enables contactless estimation of cardiovascular signals from video, but most existing studies assume a fixed, stationary camera. This study investigates the feasibility of performing real-time moving-average rPPG (MA-rPPG) using a drone-mounted camera, where platform motion, vibration, and viewing distance introduce additional challenges. Building on our previously validated real-time MA-rPPG smart mirror platform, we reuse the smart mirror interface as a unified frontend for visualization, synchronization, and logging while adapting the MA-rPPG pipeline to operate on live video streamed from an off-the-shelf DJI Tello micro-drone. Feasibility experiments were conducted with 10 participants under controlled indoor lighting and constrained flight conditions, where the drone maintained a stable hover in front of a standing subject and facial video was processed in real time to estimate heart rate from a forehead region of interest. To avoid cross-modality bias and clarify the effect of the aerial imaging platform, drone-derived MA-rPPG outputs were compared against a fixed desktop-camera MA-rPPG reference using the same trained model, enabling a controlled, like-for-like evaluation. The results indicate that continuous heart-rate estimation from a drone camera is feasible in our controlled hover-only setup, while agreement tended to vary with hover stability and effective facial resolution. This work is presented strictly as a feasibility-stage investigation and does not claim clinical validity. The findings provide an experimental baseline and operating-envelope insight for future motion-robust rPPG on mobile and aerial health-sensing platforms. Full article
Show Figures

Figure 1

31 pages, 4772 KB  
Article
Benchmark Operational Condition Multimodal Dataset Construction for the Municipal Solid Waste Incineration Process
by Yapeng Hua, Jian Tang and Hao Tian
Sustainability 2026, 18(5), 2282; https://doi.org/10.3390/su18052282 - 27 Feb 2026
Viewed by 161
Abstract
Municipal solid waste incineration (MSWI) is a typical complex industrial process for achieving sustainable development of the global environment. It implements the “perception-prediction–control” mode based on domain experts by using multimodal information. To harness the complementary value of different modal data, prevent information [...] Read more.
Municipal solid waste incineration (MSWI) is a typical complex industrial process for achieving sustainable development of the global environment. It implements the “perception-prediction–control” mode based on domain experts by using multimodal information. To harness the complementary value of different modal data, prevent information conflicts or fusion failures caused by misalignment, and ensure the availability of multimodal datasets and the reliability of analytical conclusions, constructing a benchmark operational condition multimodal dataset is essential. The objective of this work was to create a multimodal reference database for the operational status of IMSW processes. Based on the description of the MSWI process and the analysis of the characteristics of the multimodal data, the process data is first preprocessed under different missing scenarios, missing value processing and outlier processing. Then, single-frame images of the flame video are captured on a minute scale, and the missing combustion lines are quantized by using machine vision technology. Finally, the alignment of combustion line quantization (CLQ) values with the minute time scale of process data is achieved through the multimodal time synchronization module. Taking an MSWI power plant in Beijing as the research object, the combustion flame video and process data under the benchmark operating conditions were collected. The hybrid missing value management strategy combining linear interpolation with the LRDT model improved data integrity, and a spatiotemporal aligned multimodal dataset was constructed. The standardized benchmark operating condition multimodal data was obtained to support combustion state analysis during the incineration process, pollutant generation prediction, and process optimization. Therefore, the objectives of ‘reduction, harmlessness, and resource utilization’ of municipal solid waste, addressing land resource shortages, protecting the ecological environment, and promoting the dual carbon goal can be supported. Additionally, data and technical support for environmental and urban sustainable development are provided. Full article
(This article belongs to the Section Waste and Recycling)
Show Figures

Figure 1

Back to TopTop