Image Processing Based on Convolution Neural Network

Shaozhang Niu; Jiwei Zhang

doi:10.3390/electronics14234649

and

¹

Beijing Key Lab of Intelligent Telecommunication Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism (BUPT), Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Electronics2025, 14(23), 4649;https://doi.org/10.3390/electronics14234649

This article belongs to the Special Issue Image Processing Based on Convolution Neural Network

Version Notes

Order Reprints

1. Introduction to Image Processing Based on Convolution Neural Network

In today’s digital age, images serve as vital information carriers that are widely deployed across a diverse range of fields, including healthcare, security, transportation, and entertainment. The aim of mage processing technologies is to extract meaningful insights from raw image data through manipulation, facilitating comprehension and analysis, in order to meet the demands of differing application scenarios. Traditional image processing methods, such as techniques based on filtering [1,2], edge detection [3,4], and frequency features [5,6], have gradually shown their limitations when confronted with complex and dynamic image scenarios, as well as given the increasing demand for high precision. Concurrently, these conventional approaches often depend heavily on manually designed feature extractors, the performance of which is critically dependent on domain expertise and exhibits limited generalization capabilities.

The advent of Convolutional Neural Networks (CNNs) has revolutionized the field of image processing [7,8,9]. Firstly, CNNs possess formidable feature learning capabilities, enabling them to autonomously extract deep-level features from raw image data without requiring the manual design of feature extraction algorithms. This significantly enhances the accuracy and efficiency of feature extraction. Secondly, CNNs demonstrate excellent generalization capabilities. By training on large-scale image datasets, the model learns universal image features, thereby achieving good performance when processing new, unseen images. Furthermore, CNNs employ an end-to-end learning approach, optimizing the entire process from image input to final output as a unified whole. This avoids the cumulative error issues inherent in traditional methods, where errors accumulate across sequential processing steps.

CNNs can autonomously learn task-driven feature representations from large datasets, significantly improving performance in core tasks such as image classification [10,11,12], object detection [13,14,15], and semantic segmentation [16,17,18]. They have also substantially advanced the automation and intelligence of image processing technologies. Through continuous innovation in network architectures and the integration of techniques like attention mechanisms, generative adversarial networks, and Transformers, CNNs have demonstrated breakthrough results across numerous domains including image super-resolution [19,20,21], image denoising [22,23,24], image generation [25,26,27], medical image analysis [28,29,30], and autonomous driving visual perception [31,32].

However, CNN-based image processing techniques also face certain challenges and issues. For instance, CNN models typically demand substantial computational resources and storage capacity, limiting their deployment on resource-constrained devices. The training process requires extensive labelled datasets, yet acquiring high-quality annotated data is often costly and time-consuming. Furthermore, CNN models exhibit poor interpretability, making it difficult to understand their internal decision-making processes. This poses a pressing concern in applications with stringent security requirements.

Currently, research in CNN-based image processing is advancing further in areas such as lightweight architecture [33,34,35], interpretability [36,37], cross-modal fusion [38,39,40], few-shot learning [41,42,43], and robustness to complex real-world scenarios (such as occlusion, low-light conditions, and adverse weather) [44,45]. Its continuous evolution not only pushes the boundaries of computer vision technology but also provides a robust technical foundation for industrial applications.

2. Overview of This Special Issue

This Special Issue, dedicated to “Image Processing Based on Convolutional Neural Networks,” has garnered significant attention and interest from the research community. Throughout the double-blind review process, submissions were assessed with great rigor, considering thematic relevance, innovation, research depth, and practical implications. Following the thorough and meticulous evaluation, thirteen outstanding papers were selected for publication in this issue. The following offers a concise introduction to these contributions, with the aim of encouraging readers to further engage with the full articles and explore the details in depth.

The research paper by Wang et al. is the first contribution to this Special Issue. It proposes a bidirectional convolutional neural network based on spatial-channel attention mechanisms, designed to address the increasingly complex challenge of accurately identifying and localising fake faces (such as DeepFakes and FaceSwaps) arising from advances in deep learning and artificial intelligence technologies. This network integrates strengths from DenseNet and Xception architectures to enhance feature extraction capabilities. By applying weighted attention across both spatial and channel dimensions to feature maps, it highlights tampered regions. Utilising attention maps as supervisory signals, the model’s focus on tampering features is significantly strengthened. Experimental validation was conducted across three widely used public face forgery datasets. Results demonstrate that the proposed method achieves significant improvements over existing approaches in both classification accuracy and tampering region localisation capability: on the FaceForensics++ dataset, it achieves the AUC of

98.75 %

, significantly superior to baseline models. The paper indicates that future work could focus on enhancing detection performance for highly compressed and low-resolution images.

The second contribution by Sun et al. addresses the issues of dynamic blur and insufficient brightness in smart cabinets, arising from consumers’ rapid pickup actions and low-light environments. They propose MIMO-IMF, a deblurring method that integrates frequency-domain attention and intensity enhancement. Improved upon the MIMO-UNet architecture, this proposed method incorporates three core modules: the Low-Light Luminance Information Extraction Module (IFEM) enhances feature capture in dimly lit conditions; the Frequency-Domain Adaptive Fusion Module (FDAFM) reinforces high-frequency detail recovery through Fourier transform and frequency domain attention mechanisms; and the Multi Residual Block (MRB) further optimises detail reconstruction. Experiments conducted on the public GoPro dataset and the self-built MBSI dataset demonstrate that MIMO-IMF outperforms existing mainstream methods in both PSNR and SSIM metrics. It exhibits particularly notable performance in low-light scenarios and detail recovery, offering a promising solution for enhancing the performance of smart retail cabinets in night-time or low-light environments.

The third contribution of the Special Issue is the paper by Hu et al. It proposes an enhanced YOLOv7-based method for detecting faults in electrical insulators, addressing challenges in aerial photography caused by complex backgrounds, variable perspectives, and shifting lighting conditions. This approach incorporates a Contextual Transformer Network (CoTNet) into the YOLOv7 backbone architecture. By integrating static contextual information capture with dynamic self-attention mechanisms, it enhances the model’s perception of global contextual information within images. Concurrently, an EMA attention mechanism is integrated into the

P 4

and

P 5

output channels. This leverages multi-scale parallel sub-networks to establish long- and short-range feature dependencies, thereby improving multi-scale object localisation accuracy. Experiments conducted on real aerial datasets with geometrically augmented data demonstrate that the proposed method significantly improves overall detection performance and robustness while maintaining high accuracy. The paper acknowledges limitations in extreme weather detection and false positives for class-like objects (e.g., birds), suggesting future research directions including multimodal data fusion, temporal information utilisation, and hierarchical validation framework development.

The fourth contribution by Ding et al. investigates a dual-branch decoder network for visible-light ship segmentation, addressing the degradation in segmentation accuracy caused by variable lighting, background interference, and ship size variations in complex maritime surveillance scenarios. Specifically, DBD-Net captures multi-scale features through its encoder’s multi-scale cascaded residual modules (encoder-block) to accommodate target size variations. The decoding stage employs two parallel branches: one integrating spatial and channel attention mechanisms with multi-scale convolutions to suppress background interference, and the other enhancing key features to improve feature representation precision and robustness. Comprehensive validation on the public datasets MariBoatsSubclass and SeaShipsSeg demonstrated that DBD-Net achieves leading performance across metrics including Dice, Recall, MCC, and Jaccard. Notably, on MariBoatsSubclass, it attained a Dice score of

0.9003

, significantly outperforming baseline models such as U-Net.

The paper by Niu et al. is the fifth contribution of this Special Issue. It addresses computational redundancy and excessive parameters in existing swarm counting methods for UAV RGB-T images by introducing a lightweight multimodal convolutional neural network, PII-GCNet. This approach selectively reduces redundant feature computations by designing a Partial Information Interaction Convolution (PIIConv) module, retaining only key channels for cross-modal interaction. This reduces computational load to one-eighth of traditional convolutions. Concurrently, a Global Collaborative Fusion (GCFusion) module is introduced, employing spatial attention mechanisms to extract modality-specific features and adaptively fuse them, thereby enhancing multimodal representation capabilities. Experimental results demonstrate that PII-GCNet achieves optimal counting accuracy and fastest inference speed on the DroneRGBT and RGBT-CC datasets, while significantly reducing computational overhead and energy consumption. The paper notes that the method’s localisation accuracy and anomaly handling capabilities require further optimisation. Future work will focus on enhancing multi-scale feature augmentation and improving the modality alignment mechanism.

The sixth contribution by Dai et al. presents a novel contrastive learning and edge reconstruction-driven complementary learning network, CECL-Net, designed to address the issue of existing methods over-relying on foreground features while neglecting background semantic information in image forgery localization. Specifically, this approach employs a complementary learning strategy that integrates foreground and background features. It utilises an edge extractor (EE) to generate precise edge artefacts and an edge-guided feature reconstruction (EGFR) to reconstruct complementary features, thereby enhancing the model’s holistic understanding of tampered images. Experiments conducted across five benchmark datasets including artificial forgeries and real-world challenge datasets demonstrate that CECL-Net outperforms seven state-of-the-art models in F1 score, IoU, and AUC metrics, exhibiting particularly strong performance in complicated forgeries and edge details. The paper points out that the method still has limitations when confronted with highly covert tampering, interference from semantically similar objects, and extremely small tampering targets. Future work should further optimise the model’s sensitivity to subtle tampering traces and its robustness against interference.

The seventh contribution of this Special Issue is a research article by Zhuang et al. It proposes the PGD-Trap method, a proactive defence mechanism for deepfakes that addresses the critical issue of existing adversarial signals being easily weakened by image preprocessing techniques such as Gaussian blurring. This approach combines intensity-adaptive PGD-Dark to generate initial adversarial signals, while introducing frequency-domain inverse filtering to produce trap signals. Consequently, when subjected to linear filtering attacks, these signals are amplified rather than weakened. Concurrently, the paper introduces ILVR-A, an adversarial signal embedding method based on the denoising diffusion probabilistic model (DDPM) and iterative latent variable refinement (ILVR). This approach permits reconstruction of the original image within a reasonable visual range, enhancing the flexibility of signal embedding. Experimental validation demonstrates that the PGD-Trap maintains an

85 %

success rate under Gaussian blur attacks, significantly outperforming traditional baseline methods. The study indicates that future research will focus on optimising diffusion models, enhancing robustness through multiple pre-processing steps, and improving computational efficiency.

The eighth contribution, by Munish Rathee et al., proposes a hybrid machine learning solution integrating kernel operations with customised transfer learning strategies. This approach aims to enhance the efficiency of safety inspection in transport infrastructure by automatically detecting, via computer vision techniques, whether metal dowels connecting concrete segments are in unsafe positions. The methodology utilises video data captured by high-speed cameras mounted on barrier-transfer vehicles. Multi-stage preprocessing extracts spatio-temporal regions of interest (ROIs), followed by spatio-temporal feature analysis and real-time monitoring via an enhanced 3D convolutional ResNet-50 architecture. A semi-automated synthetic data generation method is proposed, effectively mitigating the scarcity of real-world anomaly data through background cloning techniques. Experimental validation employs three-fold hierarchical cross-validation, demonstrating the proposed method maintains high detection robustness under complex lighting and weather conditions. This research offers a promising solution for addressing safety inspection challenges associated with movable concrete barriers (MCBs) on the Auckland Harbour Bridge.

Liu et al.’s study constitutes the ninth contribution to this Special Issue, which proposes a fine-grained few-shot image classification method based on dual-feature reconstruction. It addresses the limitation in classification performance arising from substantial intra-class variation and minimal inter-class distinction within fine-grained images. The approach incorporates a Mixed Residual Attention Block (MRA Block), combining channel attention with a window-based self-attention mechanism to enhance local detail capture. It further designs a Dual-Reconstruction Feature Fusion (DRFF) module, which improves the model’s adaptability to both inter-class and intra-class variations through bidirectional feature reconstruction and cross-scale feature fusion. Experiments validated the approach on five-way, one-shot and five-way, five-shot tasks across CUB-200-2011, Stanford Cars, and Stanford Dogs datasets. The results showed peak accuracies of

96.99 %

,

98.53 %

, and

89.78 %

, respectively, significantly outperforming multiple state-of-the-art methods. The paper concludes that the method retains limitations in its reliance on pre-trained embedding networks and relatively high model complexity. Future research will explore lighter-weight designs and cross-domain applicability.

The tenth contribution by Cai et al. proposes an enhanced single-stage neural network based on an improved YOLOv7-tiny architecture. This provides a lightweight yet highly accurate real-time detection solution for transmission line inspection scenarios, addressing the issues of missed detections and false positives caused by unconventional human target postures, complex backgrounds, and interference from small objects. This approach optimises the original ELAN architecture by introducing lightweight GSConv modules to reduce parameter count. It designs a combined CSPNeXt-GSConv module (ELAN-CSPGS) to enhance deep feature extraction for unconventionally posed targets, while employing WIoU loss to bolster detection robustness in complex backgrounds. Experiments demonstrate that the improved model maintains a high frame rate (117 FPS) while reducing parameters to

9.7

M. It achieves a mean average precision (mAP) of

92.6 %

, significantly lowering both false positive and false negative rates compared to the original YOLOv7-tiny.

The eleventh contribution addresses the challenge of online defect detection on irregularly shaped surfaces of pet food products. Ding et al. propose an image data augmentation method based on Deep Convolutional Generative Adversarial Networks (DCGANs). Specifically, this approach constructs a Residual Block–Hybrid Attention Mechanism (ResB-HAM)-DCGAN model. This model aims to augment the training dataset by generating high-quality defect images, thereby supporting the training of deep learning defect detection models. Residual blocks (ResBs) are introduced into both the generator and discriminator to enhance deep feature extraction capabilities and mitigate gradient vanishing issues. Concurrently, a hybrid attention mechanism (HAM) integrating efficient channel attention and spatial attention is embedded to reinforce critical feature information and improve the quality of generated image details. Experiments conducted on a real-world beef jerky stick defect dataset demonstrate that the proposed model outperforms mainstream approaches across IS, FID, and SSIM metrics.

Han et al.’s research constitutes the twelfth contribution to the Special Issue, proposing a novel fuzz testing method named ESFuzzer. This approach leverages the concepts of Equivalent Statement and Equivalent Exchange to efficiently test vulnerabilities within WebAssembly interpreters. This approach characterises the impact of instructions on the stack through an intermediate representation (IR) known as the Effect-Array. It incorporates a Stack Repair Algorithm to dynamically restore stack state after generating or mutating test cases, thereby ensuring generated test cases pass WebAssembly’s stringent type and stack validation checks. Experimental validation on V8 engine WebAssembly modules demonstrated that all test cases generated by ESFuzzer passed verification, achieving over twice the code coverage of libFuzzer with tenfold execution efficiency improvement. Further experiments revealed ESFuzzer significantly outperformed libFuzzer in both average sample size and execution rate during 24-h testing. The research indicates that this approach has not yet fully resolved the fundamental issues concerning code coverage, necessitating further optimisation of coverage capabilities in future work.

Finally, the thirteenth contribution in this Special Issue is the research by Yu et al., presenting a feature reduction network FRNet based on convolutional neural networks for addressing single-image defogging. Its core concept involves progressively eliminating informative features within the image to generate non-informative noise, thereby effectively restoring fog-free imagery. Addressing the limitation of traditional end-to-end defogging models that treat outputs as noise while neglecting feature extraction capabilities, this approach introduces unique Frequency Residual Blocks (FRBlocks) and subtraction skip connections to more accurately represent and eliminate haze distributions. Concurrently, it incorporates Selective Kernel Fusion (SK Fusion) layers and large convolution windows (

7 \times 7

) to optimise feature fusion and reconstruction quality. The paper was validated across multiple datasets, with experimental results demonstrating superiority over current mainstream methods in both PSNR and SSIM metrics, whilst exhibiting lower model parameter counts and computational latency. FRNet effectively addresses the balancing challenge between fog distribution modelling and computational efficiency, providing an efficient, scalable solution for image defogging tasks.

3. Conclusions

The Guest Editors of this Special Issue firmly believe that CNN, with its formidable hierarchical visual feature learning capabilities, will continue to drive innovation in image processing theoretical frameworks and algorithmic design while continually expanding its application boundaries. We anticipate that the papers compiled in this Special Issue will provide inspiration and reference for researchers in related fields, collectively advancing the progress and development of intelligent image processing technologies.

Funding

This work is supported by National Key Research and Development Program of China [grant number 2024YFF0907404].

Conflicts of Interest

The authors declare no conflicts of interest.

List of Contributions

Wang, X.; Song, W.; Hao, C.; Liu, S.; Liu, F. Supervised Face Tampering Detection Based on Spatial Channel Attention Mechanism. Electronics 2025, 14, 500. https://doi.org/10.3390/electronics14030500.
Sun, Y.; Hu, S.; Xie, K.; Wen, C.; Zhang, W.; He, J. Enhanced Deblurring for Smart Cabinets in Dynamic and Low-Light Scenarios. Electronics 2025, 14, 488. https://doi.org/10.3390/electronics14030488.
Hu, J.; Wan, W.; Qiao, P.; Zhou, Y.; Ouyang, A. Power Insulator Defect Detection Method Based on Enhanced YOLOV7 for Aerial Inspection. Electronics 2025, 14, 408. https://doi.org/10.3390/electronics14030408.
Ding, X.; Jiang, X.; Jiang, X. 4DBD-Net: Dual-Branch Decoder Network with a Multiscale Cascaded Residual Module for Ship Segmentation. Electronics 2025, 14, 209. https://doi.org/10.3390/electronics14010209.
Niu, Z.; Pi, H.; Jing, D.; Liu, D. PII-GCNet: Lightweight Multi-Modal CNN Network for Efficient Crowd Counting and Localization in UAV RGB-T Images. Electronics 2024, 13, 4298. https://doi.org/10.3390/electronics13214298.
Dai, G.; Chen, K.; Huang, L.; Chen, L.; An, D.; Wang, Z.; Wang, K. Cecl-Net: Contrastive Learning and Edge-Reconstruction-Driven Complementary Learning Network for Image Forgery Localization. Electronics 2024, 13, 3919. https://doi.org/10.3390/electronics13193919.
Zhuang, Z.; Tomioka, Y.; Shin, J.; Okuyama, Y. PGD-Trap: Proactive Deepfake Defense with Sticky Adversarial Signals and Iterative Latent Variable Refinement. Electronics 2024, 13, 3353. https://doi.org/10.3390/electronics13173353.
Rathee, M.; Bačić, B.; Doborjeh, M. Hybrid Machine Learning for Automated Road Safety Inspection of Auckland Harbour Bridge. Electronics 2024, 13, 3030. https://doi.org/10.3390/electronics13153030.
Liu, S.; Zhong, W.; Guo, F.; Cong, J.; Gu, B. Fine-Grained Few-Shot Image Classification Based on Feature Dual Reconstruction. Electronics 2024, 13, 2751. https://doi.org/10.3390/electronics13142751.
Cai, C.; Nie, J.; Tong, J.; Chen, Z.; Xu, X.; He, Z. An Enhanced Single-Stage Neural Network for Object Detection in Transmission Line Inspection. Electronics 2024, 13, 2080. https://doi.org/10.3390/electronics13112080.
Ding, S.; Guo, Z.; Chen, X.; Li, X.; Ma, F. DCGAN-Based Image Data Augmentation in Rawhide Stick Products’ Defect Detection. Electronics 2024, 13, 2047. https://doi.org/10.3390/electronics13112047.
Han, J.; Zhang, Z.; Du, Y.; Wang, W.; Chen, X. Esfuzzer: An Efficient Way to Fuzz WebAssembly Interpreter. Electronics 2024, 13, 1498. https://doi.org/10.3390/electronics13081498.
Yu, H.; Yuan, X.; Jiang, R.; Feng, H.; Liu, J.; Li, Z. Feature Reduction Networks: A Convolution Neural Network-Based Approach to Enhance Image Dehazing. Electronics 2023, 12, 4984. https://doi.org/10.3390/electronics12244984.

References

Shreyamsha Kumar, B.K. Image Fusion Based on Pixel Significance Using Cross Bilateral Filter. Signal Image Video Process. 2015, 9, 1193–1204. [Google Scholar] [CrossRef]
Rao, Y.; Zhao, W.; Zhu, Z.; Lu, J.; Zhou, J. Global Filter Networks for Image Classification. Adv. Neural Inf. Process. Syst. 2021, 34, 980–993. [Google Scholar]
Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z. Learning a Deep Multi-Scale Feature Ensemble and an Edge-Attention Guidance for Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 105–119. [Google Scholar] [CrossRef]
Mochurad, L. Approach for Enhancing the Accuracy of Semantic Segmentation of Chest X-ray Images by Edge Detection and Deep Learning Integration. Front. Artif. Intell. 2025, 8, 1522730. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Li, X.; Huang, Y.; Xu, S.; Tang, J.; Hu, H. Underwater Image Enhancement via Frequency and Spatial Domains Fusion. Opt. Lasers Eng. 2025, 186, 108826. [Google Scholar] [CrossRef]
Zhang, Y.; Xing, K.; Bai, R.; Sun, D.; Meng, Z. An Enhanced Convolutional Neural Network for Bearing Fault Diagnosis Based on Time–Frequency Image. Measurement 2020, 157, 107667. [Google Scholar] [CrossRef]
Liu, X.; Ghazali, K.H.; Han, F.; Mohamed, I.I. Review of CNN in Aerial Image Processing. Imaging Sci. J. 2023, 71, 1–13. [Google Scholar] [CrossRef]
Kshatri, S.S.; Singh, D. Convolutional Neural Network in Medical Image Analysis: A Review. Arch. Comput. Methods Eng. 2023, 30, 2793–2810. [Google Scholar] [CrossRef]
Archana, R.; Jeevaraj, P.S.E. Deep Learning Models for Digital Image Processing: A Review. Artif. Intell. Rev. 2024, 57, 11. [Google Scholar] [CrossRef]
Hussain, W.; Mushtaq, M.F.; Shahroz, M.; Akram, U.; Ghith, E.S.; Tlija, M.; Kim, T.; Ashraf, I. Ensemble Genetic and CNN Model-Based Image Classification by Enhancing Hyperparameter Tuning. Sci. Rep. 2025, 15, 1003. [Google Scholar] [CrossRef]
Xi, B.; Zhang, Y.; Li, J.; Zheng, T.; Zhao, X.; Xu, H.; Xue, C.; Li, Y.; Chanussot, J. MCTGCL: Mixed CNN-Transformer for Mars Hyperspectral Image Classification with Graph Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5503214. [Google Scholar] [CrossRef]
Xi, B.; Zhang, Y.; Li, J.; Li, Y.; Li, Z.; Chanussot, J. CTF-SSCL: CNN-Transformer for Few-Shot Hyperspectral Image Classification Assisted by Semi-Supervised Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5532617. [Google Scholar] [CrossRef]
Liu, H.; Tseng, Y.; Chang, K.; Wang, P.; Shuai, H.; Cheng, W. A Denoising FPN with Transformer R-CNN for Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar]
Sagar, A.S.S.; Chen, Y.; Xie, Y.; Kim, H.S. MSA R-CNN: A Comprehensive Approach to Remote Sensing Object Detection and Scene Understanding. Expert Syst. Appl. 2024, 241, 122788. [Google Scholar]
Yue, G.; Jiao, G.; Li, C.; Xiang, J. When CNN Meet with ViT: Decision-Level Feature Fusion for Camouflaged Object Detection. Vis. Comput. 2025, 41, 3957–3972. [Google Scholar] [CrossRef]
Jin, T.; Kang, S.M.; Kim, N.R.; Kim, H.R.; Han, X. Comparative Analysis of CNN-Based Semantic Segmentation for Apple Tree Canopy Size Recognition in Automated Variable-Rate Spraying. Agriculture 2025, 15, 789. [Google Scholar] [CrossRef]
Elgamily, K.M.; Mohamed, M.A.; Abou-Taleb, A.M.; Ata, M.M. A Novel W13 Deep CNN Structure for Improved Semantic Segmentation of Multiple Objects in Remote Sensing Imagery. Neural Comput. Appl. 2025, 37, 5397–5427. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar]
Gao, G.; Xu, Z.; Li, J.; Yang, J.; Zeng, T.; Qi, G.J. CTCNet: A CNN-Transformer Cooperation Network for Face Image Super-Resolution. IEEE Trans. Image Process. 2023, 32, 1978–1991. [Google Scholar] [CrossRef] [PubMed]
Kang, L.; Tang, B.; Huang, J.; Li, J. 3D-MRI Super-Resolution Reconstruction Using Multi-Modality Based on Multi-Resolution CNN. Comput. Methods Programs Biomed. 2024, 248, 108110. [Google Scholar]
Liu, K.; Yin, L.; Liu, T.; Chen, Z.; Yu, W.; Long, X.; Wu, G. A Super-Resolution Algorithm of Ghost Imaging Using CNN with Grouped Orthonormalization Algorithm Constraint. Opt. Laser Technol. 2025, 181, 111847. [Google Scholar] [CrossRef]
Bai, X.; Wan, Y.; Wang, W. CEPDNet: A Fast CNN-Based Image Denoising Network Using Edge Computing Platform. J. Supercomput. 2025, 81, 100. [Google Scholar] [CrossRef]
Hou, R.; Li, F. Hyperspectral Image Denoising via Cooperated Self-Supervised CNN Transform and Nonconvex Regularization. Neurocomputing 2025, 616, 128912. [Google Scholar]
Zhuang, L.; Ng, M.K.; Gao, L.; Wang, Z. Eigen-CNN: Eigenimages Plus Eigennoise Level Maps Guided Network for Hyperspectral Image Denoising. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5512018. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Zhang, M.; Dong, Q.; Zhang, G.; Wang, Z.; Wei, P. SugarcaneGAN: A Novel Dataset Generating Approach for Sugarcane Leaf Diseases Based on Lightweight Hybrid CNN-Transformer Network. Comput. Electron. Agric. 2024, 219, 108762. [Google Scholar] [CrossRef]
Beyan, E.V.P.; Rossy, A.G.C. A Review of AI Image Generator: Influences, Challenges, and Future Prospects for Architectural Field. J. Artif. Intell. Archit. 2023, 2, 53–65. [Google Scholar]
Park, S.; Shin, Y.-G. Generative Convolution Layer for Image Generation. Neural Netw. 2022, 152, 370–379. [Google Scholar]
Salehi, A.W.; Khan, S.; Gupta, G.; Alabduallah, B.I.; Almjally, A.; Alsolai, H.; Siddiqui, T.; Mellit, A. A Study of CNN and Transfer Learning in Medical Imaging: Advantages, Challenges, Future Scope. Sustainability 2023, 15, 5930. [Google Scholar] [CrossRef]
Yao, W.; Bai, J.; Liao, W.; Chen, Y.; Liu, M.; Xie, Y. From CNN to Transformer: A Review of Medical Image Segmentation Models. J. Imaging Inform. Med. 2024, 37, 1529–1547. [Google Scholar] [CrossRef]
Zhang, L.; Guo, X.; Sun, H.; Wang, W.; Yao, L. Alternate Encoder and Dual Decoder CNN-Transformer Networks for Medical Image Segmentation. Sci. Rep. 2025, 15, 8883. [Google Scholar] [CrossRef]
Shi, R.; Yang, S.; Chen, Y.; Wang, R.; Zhang, M.; Lu, J.; Cao, Y. CNN-Transformer for Visual-Tactile Fusion Applied in Road Recognition of Autonomous Vehicles. Pattern Recognit. Lett. 2023, 166, 200–208. [Google Scholar] [CrossRef]
Turay, T.; Vladimirova, T. Toward Performing Image Classification and Object Detection with Convolutional Neural Networks in Autonomous Driving Systems: A Survey. IEEE Access 2022, 10, 14076–14119. [Google Scholar] [CrossRef]
Chen, P.; Liu, F.; Zhang, J.; Wang, B. MFEM-CIN: A Lightweight Architecture Combining CNN and Transformer for the Classification of Pre-Cancerous Lesions of the Cervix. IEEE Open J. Eng. Med. Biol. 2024, 5, 216–225. [Google Scholar] [CrossRef]
Gursesli, M.C.; Lombardi, S.; Duradoni, M.; Bocchi, L.; Guazzini, A.; Lanata, A. Facial Emotion Recognition (FER) through Custom Lightweight CNN Model: Performance Evaluation in Public Datasets. IEEE Access 2024, 12, 45543–45559. [Google Scholar] [CrossRef]
Begum, M.; Shuvo, M.H.; Nasir, M.K.; Hossain, A.; Hossain, M.J.; Ashraf, I.; Uddin, J.; Samad, M.A. LCNN: Lightweight CNN Architecture for Software Defect Feature Identification Using Explainable AI. IEEE Access 2024, 12, 55744–55756. [Google Scholar] [CrossRef]
Dhore, V.; Bhat, A.; Nerlekar, V.; Chavhan, K.; Umare, A. Enhancing Explainable AI: A Hybrid Approach Combining GradCAM and LRP for CNN Interpretability. arXiv 2024, arXiv:2405.12175. [Google Scholar] [CrossRef]
Alizamir, M.; Heddam, S.; Kim, S. A Robust and Explainable Deep Learning Model Based on an LSTM-CNN Framework for Reliable FDOM Prediction in Water Quality Monitoring: Incorporating SHAP Analysis for Enhanced Interpretability. Process Saf. Environ. Prot. 2025, 107594. [Google Scholar] [CrossRef]
Li, M.; Hao, R.; Shi, S.; Yu, Z.; He, Q.; Zhan, J. A CNN-Transformer Approach for Image-Text Multimodal Classification with Cross-Modal Feature Fusion. In Proceedings of the 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE), Xi’an, China, 21–23 March 2025; pp. 1182–1186. [Google Scholar]
Yi, L.; Huang, Y.; Zhan, J.; Wang, Y.; Sun, T.; Long, J.; Liu, J.; Chen, B. CNN-ELMNet: Fault Diagnosis of Induction Motor Bearing Based on Cross-Modal Vector Fusion. Meas. Sci. Technol. 2024, 35, 115114. [Google Scholar] [CrossRef]
Sun, K.; Ding, J.; Li, Q.; Chen, W.; Zhang, H.; Sun, J.; Jiao, Z.; Ni, X. CMAF-Net: A Cross-Modal Attention Fusion-Based Deep Neural Network for Incomplete Multi-Modal Brain Tumor Segmentation. Quant. Imaging Med. Surg. 2024, 14, 4579. [Google Scholar] [CrossRef] [PubMed]
Azad, R.; Fayjie, A.R.; Kauffmann, C.; Ben Ayed, I.; Pedersoli, M.; Dolz, J. On the Texture Bias for Few-Shot CNN Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 2674–2683. [Google Scholar]
An, D.-J.; Yoo, I.-S.; Jo, J.-M.; Lee, W.-J.; Yu, H.-J.; Park, S. Few-Shot-Learning for Scar Recognition: A CNN-Based Binary Classification Approach. In Proceedings of the 2024 International Technical Conference on Circuits/Systems Computers, and Communications (ITC-CSCC), Jeju, Republic of Korea, 2–5 July 2024; pp. 1–5. [Google Scholar]
Liu, Y.; Xiao, F.; Zheng, X.; Deng, W.; Ma, H.; Su, X.; Wu, L. Integrating Deformable CNN and Attention Mechanism into Multi-Scale Graph Neural Network for Few-Shot Image Classification. Sci. Rep. 2025, 15, 1306. [Google Scholar] [CrossRef]
Madhan, K.; Shanmugapriya, N. Object Detection in Unfavourable Weather Conditions Using CNN-Diffusion Neural Networks. Signal Image Video Process. 2025, 19, 516. [Google Scholar] [CrossRef]
Pei, X.; Huang, Y.; Su, W.; Zhu, F.; Liu, Q. FFTFormer: A Spatial-Frequency Noise Aware CNN-Transformer for Low Light Image Enhancement. Knowl. Based Syst. 2025, 314, 113055. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).