sensors-logo

Journal Browser

Journal Browser

Machine Learning in Image/Video Processing and Sensing

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensing and Imaging".

Deadline for manuscript submissions: 30 September 2026 | Viewed by 17890

Special Issue Editor


E-Mail Website
Guest Editor
College of Microelectronics, Fudan University, Shanghai 201203, China
Interests: image processing; video coding; machine learning; associated VLSI architecture

Special Issue Information

Dear Colleagues,

In recent years, machine learning methods have been increasingly applied to video and image processing, such as video compression, image denoising, super resolution, image generation, etc. At the algorithmic level, more and more video image processing algorithms are based on machine learning, achieving better video and image quality, as well as video compression rates. But, at the same time, they also face challenges in computational complexity and real-time processing capabilities. At the hardware level, some machine learning methods are gradually being applied to processor design, such as AI-ISP processors and AI-Codec processors. At the same time, they also face the challenge of integrating traditional hardware modules with machine learning acceleration modules. 

This Special Issue is focused on machine learning in image/video processing and sensing technologies, addressing (but not limited to) the following topics:

  • Machine learning in image and video compression;
  • Machine learning in image processing and enhancement;
  • Hardware design of accelerator for machine learning;
  • Hardware design of AI-Codec;
  • Hardware design of AI-ISP.

Prof. Dr. Yibo Fan
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • image/video processing
  • video compression
  • machine learning

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

23 pages, 10822 KB  
Article
Off-Road Autonomous Vehicle Semantic Segmentation and Spatial Overlay Video Assembly
by Itai Dror, Omer Aviv and Ofer Hadar
Sensors 2026, 26(6), 1944; https://doi.org/10.3390/s26061944 - 19 Mar 2026
Viewed by 781
Abstract
Autonomous systems are expanding rapidly, driving a demand for robust perception technologies capable of navigating challenging, unstructured environments. While urban autonomy has made significant progress, off-road environments pose unique challenges, including dynamic terrain and limited communication infrastructure. This research addresses these challenges by [...] Read more.
Autonomous systems are expanding rapidly, driving a demand for robust perception technologies capable of navigating challenging, unstructured environments. While urban autonomy has made significant progress, off-road environments pose unique challenges, including dynamic terrain and limited communication infrastructure. This research addresses these challenges by introducing a novel three-part solution for off-road autonomous vehicles. First, we present a large-scale off-road dataset curated to capture the visual complexity and variability of unstructured environments, providing a realistic training ground that supports improved model generalization. Second, we propose a Confusion-Aware Loss (CAL) that dynamically penalizes systematic misclassifications based on class-level confusion statistics. When combined with cross-entropy, CAL improves segmentation mean Intersection over Union (mIoU) on the off-road test set from 68.66% to 70.06% and achieves cross-domain gains of up to ~0.49% mIoU on the Cityscapes dataset. Third, leveraging semantic segmentation as an intermediate representation, we introduce a spatial overlay video encoding scheme that preserves high-fidelity RGB information in semantically critical regions while compressing non-essential background regions. Experimental results demonstrate Peak Signal-to-Noise Ratio (PSNR) improvements of up to +5 dB and Video Multi-Method Assessment Fusion (VMAF) gains of up to +40 points under lossy compression, enabling efficient and reliable off-road autonomous operation. This integrated approach provides a robust framework for real-time remote operation in bandwidth-constrained environments. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

29 pages, 14346 KB  
Article
LRCFuse: Infrared and Visible Image Fusion Based on Low-Rank Representation and Convolutional Sparse Learning
by Jingjing Liu, Yujie Zhu, Yuhao Zhang, Aiying Guo, Mengjiao Li and Jianhua Zhang
Sensors 2026, 26(6), 1771; https://doi.org/10.3390/s26061771 - 11 Mar 2026
Viewed by 505
Abstract
With the development of cross-modal image fusion in multi-sensor systems, current fusion technologies have made significant progress in feature extraction, facilitating more effective image analysis. However, insufficient fusion information may degrade the correlation between the source and fused images, often resulting in the [...] Read more.
With the development of cross-modal image fusion in multi-sensor systems, current fusion technologies have made significant progress in feature extraction, facilitating more effective image analysis. However, insufficient fusion information may degrade the correlation between the source and fused images, often resulting in the omission of critical features from the original modalities. Therefore, in order to preserve as much information as possible, especially for the complete extraction of effective feature information in source images, this paper proposes a new cross-modal image fusion method based on low-rank representation and convolutional sparse learning named LRCFuse. Firstly, the learned low-rank representation (LLRR) blocks are employed to perform dimensionality reduction on the source images while simultaneously extracting their low-rank and sparse feature components. Nevertheless, considering that the low-rank representation has insufficient modeling ability for different modal images, we introduce common feature preservation module (CFPM) blocks based on convolutional sparse coding. By leveraging the CFPM module, LRCFuse recovers common features from both source images to mitigate the loss caused by the imperfect assumptions of low-rank representation. Based on this, a multi-level optimization strategy incorporating pixel loss, shallow-level loss, mid-level loss, deep-level loss, and sobel loss is proposed to hierarchically learn and refine diverse image features. Quantitative and qualitative evaluations are conducted across various datasets, revealing that LRCFuse can effectively detect targets infrared salient targets, preserve additional details in visible images, and achieve better fusion results for subsequent downstream tasks. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

26 pages, 3735 KB  
Article
On Demand Secure Scalable Video Streaming for Both Human and Machine Applications
by Alaa Zain, Yibo Fan and Jinjia Zhou
Sensors 2026, 26(4), 1285; https://doi.org/10.3390/s26041285 - 16 Feb 2026
Viewed by 658
Abstract
Scalable video coding plays an essential role in supporting heterogeneous devices, network conditions, and application requirements in modern video streaming systems. However, most existing scalable coding approaches primarily optimize human perceptual quality and provide limited support for data privacy, as well as for [...] Read more.
Scalable video coding plays an essential role in supporting heterogeneous devices, network conditions, and application requirements in modern video streaming systems. However, most existing scalable coding approaches primarily optimize human perceptual quality and provide limited support for data privacy, as well as for machine analyses and the integration of heterogeneous sensor data. This limitation motivated the development of adaptive scalable video coding frameworks. The proposed approach is designed to serve both human viewers and automated analysis systems while ensuring high security and compression efficiency. The method adaptively encrypts selected layers during transmission to protect sensitive content without degrading decoding or analysis performance. Experimental evaluations on benchmark datasets demonstrate that the proposed framework achieves superior rate distortion efficiency and reconstruction quality, while also improving machine analysis accuracy compared to existing traditional and learning-based codes. In video surveillance scenarios, where the base layer is preserved for analysis, the proposed scalable human machine coding (SHMC) method outperforms scalable extensions of H.265/High Efficiency Video Coding (HEVC), Scalable High Efficiency Video Coding (SHVC), reducing the average bit-per-pixel (bpp) by 26.38%, 30.76%, and 60.29% at equivalent mean Average Precision (mAP), Peak Signal-to-Noise Ratio (PSNR), and Multi-Scale Structural Similarity (MS-SSIM) levels. These results confirm the effectiveness of integrating scalable video coding with intelligent encryption for secure and efficient video transmission. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

14 pages, 47836 KB  
Article
Flow-Multi: A Flow-Matching Multi-Reward Framework for Text-to-Image Generation
by Jaegun Lee and Janghoon Choi
Sensors 2026, 26(4), 1120; https://doi.org/10.3390/s26041120 - 9 Feb 2026
Viewed by 1257
Abstract
Recent approaches in text-to-image (T2I) generation have actively adopted reinforcement learning (RL) techniques for human preference alignment. However, existing approaches primarily rely on a single reward function, which can lead to overfitting on specific metrics, resulting in issues such as reward hacking and [...] Read more.
Recent approaches in text-to-image (T2I) generation have actively adopted reinforcement learning (RL) techniques for human preference alignment. However, existing approaches primarily rely on a single reward function, which can lead to overfitting on specific metrics, resulting in issues such as reward hacking and imbalanced optimization among multiple objectives. To address this, we propose Flow-Multi: a flow-matching multi-reward framework for text-to-image generation. Our method builds upon flow-matching-based group-relative policy optimization (GRPO) learning. Each sample is evaluated by four reward models—based on text-to-image alignment, human preference, aesthetic quality, and GenEval—to create a multi-dimensional reward vector. We then utilize the Pareto dominance relationship to remove dominated samples and update the policy using only the non-dominated set. Additionally, we introduce advantage masking during training to suppress the contribution of low-reward samples, ensuring that only high-quality rewards are reflected in policy optimization. Experimental results demonstrate that Flow-Multi achieves balanced improvements across multiple reward criteria compared to the existing Flow-GRPO, validating the effectiveness of the multi-reward reinforcement learning framework for stable alignment in text-to-image generation. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

20 pages, 5153 KB  
Article
A Practical Method for Red-Edge Band Reconstruction for Landsat Image by Synergizing Sentinel-2 Data with Machine Learning Regression Algorithms
by Yuan Zhang, Zhekui Fan, Wenjia Yan, Chentian Ge and Huasheng Sun
Sensors 2025, 25(11), 3570; https://doi.org/10.3390/s25113570 - 5 Jun 2025
Cited by 1 | Viewed by 2720
Abstract
Red-edge bands are the most essential spectral data for multispectral remote sensing images, with them playing a critical role in monitoring vegetation growth status at regional and global scales. However, the absence of red-edge bands limits the applicability of Landsat images, the most [...] Read more.
Red-edge bands are the most essential spectral data for multispectral remote sensing images, with them playing a critical role in monitoring vegetation growth status at regional and global scales. However, the absence of red-edge bands limits the applicability of Landsat images, the most widely used remote sensing data, to vegetation monitoring. This study proposes an innovative method to reconstruct Landsat’s red-edge bands. The consistency in corresponding bands of Landsat OLI and Sentinel-2 MSI was first investigated using different resampling approaches and atmospheric correction algorithms. Three machine learning algorithms (ridge regression, gradient boosted regression tree (GBRT), and random forest regression) were then employed to build the red-edge reconstruction model for different vegetation types. With the optimal model, three red-edge bands of Landsat OLI were subsequently obtained in alignment with their derived vegetation indices. Our results showed that bilinear interpolation resampling, in combination with the LaSRC atmospheric correction algorithm, achieved high consistency between the matching bands of OLI and MSI (R2 > 0.88). With the GBRT algorithm, three simulated OLI red-edge bands were highly consistent with those of MSI, with an R2 > 0.96 and an RMSE < 0.0122. The derived Landsat red-edge indices coincide with those of Sentinel-2, with an R2 of 0.78 to 0.95 and an rRMSE of 3.37% to 21.64%. This study illustrates that the proposed red-edge reconstruction method can extend the spectral domain of Landsat OLI and enhance its applicability in global vegetation remote sensing. Meanwhile, it provides potential insight into historical Landsat TM/ETM+ data enhancement for improving time-series vegetation monitoring. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

20 pages, 1569 KB  
Article
IESSP: Information Extraction-Based Sparse Stripe Pruning Method for Deep Neural Networks
by Jingjing Liu, Lingjin Huang, Manlong Feng, Aiying Guo, Luqiao Yin and Jianhua Zhang
Sensors 2025, 25(7), 2261; https://doi.org/10.3390/s25072261 - 3 Apr 2025
Cited by 4 | Viewed by 1284
Abstract
Network pruning is a deep learning model compression technique aimed at reducing model storage requirements and decreasing computational resource consumption. However, mainstream pruning techniques often encounter challenges such as limited precision in feature selection and a diminished feature extraction capability. To address these [...] Read more.
Network pruning is a deep learning model compression technique aimed at reducing model storage requirements and decreasing computational resource consumption. However, mainstream pruning techniques often encounter challenges such as limited precision in feature selection and a diminished feature extraction capability. To address these issues, we propose an information extraction-based sparse stripe pruning (IESSP) method. This method introduces an information extraction module (IEM), which enhances stripe selection through a mask-based mechanism, promoting inter-layer interactions and directing the network’s focus toward key features. In addition, we design a novel loss function that links output loss to stripe selection, enabling an effective balance between accuracy and efficiency. This loss function also supports the adaptive optimization of stripe sparsity during training. Experimental results on benchmark datasets demonstrate that the proposed method outperforms existing techniques. Specifically, when applied to prune the VGG-16 model on the CIFAR-10 dataset, the proposed method achieves a 0.29% improvement in accuracy while reducing FLOPs by 75.88% compared to the baseline. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

17 pages, 19409 KB  
Article
Wavelet-Based Topological Loss for Low-Light Image Denoising
by Alexandra Malyugina, Nantheera Anantrasirichai and David Bull
Sensors 2025, 25(7), 2047; https://doi.org/10.3390/s25072047 - 25 Mar 2025
Cited by 3 | Viewed by 1954
Abstract
Despite significant advances in image denoising, most algorithms rely on supervised learning, with their performance largely dependent on the quality and diversity of training data. It is widely assumed that digital image distortions are caused by spatially invariant Additive White Gaussian Noise (AWGN). [...] Read more.
Despite significant advances in image denoising, most algorithms rely on supervised learning, with their performance largely dependent on the quality and diversity of training data. It is widely assumed that digital image distortions are caused by spatially invariant Additive White Gaussian Noise (AWGN). However, the analysis of real-world data suggests that this assumption is invalid. Therefore, this paper tackles image corruption by real noise, providing a framework to capture and utilise the underlying structural information of an image along with the spatial information conventionally used for deep learning tasks. We propose a novel denoising loss function that incorporates topological invariants and is informed by textural information extracted from the image wavelet domain. The effectiveness of this proposed method was evaluated by training state-of-the-art denoising models on the BVI-Lowlight dataset, which features a wide range of real noise distortions. Adding a topological term to common loss functions leads to a significant increase in the LPIPS (Learned Perceptual Image Patch Similarity) metric, with the improvement reaching up to 25%. The results indicate that the proposed loss function enables neural networks to learn noise characteristics better. We demonstrate that they can consequently extract the topological features of noise-free images, resulting in enhanced contrast and preserved textural information. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

20 pages, 2654 KB  
Article
DCAN: Dynamic Channel Attention Network for Multi-Scale Distortion Correction
by Jianhua Zhang, Saijie Peng, Jingjing Liu and Aiying Guo
Sensors 2025, 25(5), 1482; https://doi.org/10.3390/s25051482 - 28 Feb 2025
Cited by 2 | Viewed by 1971
Abstract
Image distortion correction is a fundamental yet challenging task in image restoration, especially in scenarios with complex distortions and fine details. Existing methods often rely on fixed-scale feature extraction, which struggles to capture multi-scale distortions. This limitation results in difficulties in achieving a [...] Read more.
Image distortion correction is a fundamental yet challenging task in image restoration, especially in scenarios with complex distortions and fine details. Existing methods often rely on fixed-scale feature extraction, which struggles to capture multi-scale distortions. This limitation results in difficulties in achieving a balance between global structural consistency and local detail preservation on distorted images with varying levels of complexity, resulting in suboptimal restoration quality for highly complex distortions. To address these challenges, this paper proposes a dynamic channel attention network (DCAN) for multi-scale distortion correction. Firstly, DCAN employs a multi-scale design and utilizes the optical flow network for distortion feature extraction, effectively balancing global structural consistency and local detail preservation under varying levels of distortion. Secondly, we present the channel attention and fusion selective module (CAFSM), which dynamically recalibrates feature importance across multi-scale distortions. By embedding CAFSM into the upsampling stage, the network enhances its ability to refine local features while preserving global structural integrity. Moreover, to further improve detail preservation and structural consistency, a comprehensive loss function is designed, incorporating structural similarity loss (SSIM Loss) to balance local and global optimization. Experimental results on the widely used Places2 dataset demonstrate that DCAN achieves state-of-the-art performance, with an average improvement of 1.55 dB in PSNR and 0.06 in SSIM compared with existing methods. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

28 pages, 10234 KB  
Article
Estimating QoE from Encrypted Video Conferencing Traffic
by Michael Sidorov, Raz Birman, Ofer Hadar and Amit Dvir
Sensors 2025, 25(4), 1009; https://doi.org/10.3390/s25041009 - 8 Feb 2025
Cited by 1 | Viewed by 2404
Abstract
Traffic encryption is vital for internet security but complicates analytical applications like video delivery optimization or quality of experience (QoE) estimation, which often rely on clear text data. While many models address the problem of QoE prediction in video streaming, the video conferencing [...] Read more.
Traffic encryption is vital for internet security but complicates analytical applications like video delivery optimization or quality of experience (QoE) estimation, which often rely on clear text data. While many models address the problem of QoE prediction in video streaming, the video conferencing (VC) domain remains underexplored despite rising demand for these applications. Existing models often provide low-resolution predictions, categorizing QoE into broad classes such as “high” or “low”, rather than providing precise, continuous predictions. Moreover, most models focus on clear-text rather than encrypted traffic. This paper addresses these challenges by analyzing a large dataset of Zoom sessions and training five classical machine learning (ML) models and two custom deep neural networks (DNNs) to predict three QoE indicators: frames per second (FPS), resolution (R), and the naturalness image quality evaluator (NIQE). The models achieve mean error rates of 8.27%, 7.56%, and 2.08% for FPS, R, and NIQE, respectively, using a 10-fold cross-validation technique. This approach advances QoE assessment for encrypted traffic in VC applications. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

18 pages, 13728 KB  
Article
BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection
by Ruicheng Cao, Ruiteng Zhang, Xinyue Yan and Jian Zhang
Sensors 2024, 24(22), 7411; https://doi.org/10.3390/s24227411 - 20 Nov 2024
Cited by 14 | Viewed by 3192
Abstract
Degraded underwater images decrease the accuracy of underwater object detection. Existing research uses image enhancement methods to improve the visual quality of images, which may not be beneficial in underwater image detection and lead to serious degradation in detector performance. To alleviate this [...] Read more.
Degraded underwater images decrease the accuracy of underwater object detection. Existing research uses image enhancement methods to improve the visual quality of images, which may not be beneficial in underwater image detection and lead to serious degradation in detector performance. To alleviate this problem, we proposed a bidirectional guided method for underwater object detection, referred to as BG-YOLO. In the proposed method, a network is organized by constructing an image enhancement branch and an object detection branch in a parallel manner. The image enhancement branch consists of a cascade of an image enhancement subnet and object detection subnet. The object detection branch only consists of a detection subnet. A feature-guided module connects the shallow convolution layers of the two branches. When training the image enhancement branch, the object detection subnet in the enhancement branch guides the image enhancement subnet to be optimized towards the direction that is most conducive to the detection task. The shallow feature map of the trained image enhancement branch is output to the feature-guided module, constraining the optimization of the object detection branch through consistency loss and prompting the object detection branch to learn more detailed information about the objects. This enhances the detection performance. During the detection tasks, only the object detection branch is reserved so that no additional computational cost is introduced. Extensive experiments demonstrate that the proposed method significantly improves the detection performance of the YOLOv5s object detection network (the mAP is increased by up to 2.9%) and maintains the same inference speed as YOLOv5s (132 fps). Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

Back to TopTop