Topic Editors

Dr. Fengping An
School of Automation and Software Engineering, Shanxi University, Taiyuan 030006, China
Department of Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
Dr. Chuyang Ye
School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing 100811, China

Transformer and Deep Learning Applications in Image Processing

Abstract submission deadline
closed (31 March 2026)
Manuscript submission deadline
31 May 2026
Viewed by
17747

Topic Information

Dear Colleagues,

Convolutional Neural Networks (CNNs) represent a class of deep learning architectures specifically designed to process spatial data, such as images and videos. Due to their ability to autonomously extract features, maintain translational invariance, and perceive local patterns, CNNs have found extensive applications in domains such as image classification, object recognition, object tracking, and medical image processing. However, CNNs are unable to model long-range dependencies effectively and struggle to extract long-distance feature information of the target to be tracked, which impacts the efficiency and accuracy of target tracking. Since the release of ChatGPT 3.0 based on transformers on June 11, 2020, transformers have demonstrated strong capabilities in handling sequential data.

Although CNN models have achieved significant success in the field of image processing imagery over the years, many challenges remain in practical applications, such as complex scene image classification, specific object recognition and tracking, and medical image processing. This situation highlights a noticeable gap between theoretical advancements and practical applications in the image processing field.

Therefore, we invite submissions of studies on theoretical research and practical applications related to transformer and deep learning architectures in the fields of medical image analysis, image classification, recognition, and tracking.

We welcome submissions on the following topics, including but not limited to the following:

  • Novel architectures and variations of transformers and deep learning models;
  • Fine-tuning strategies for pre-trained transformers and deep learning models;
  • Image classification based on transformers and deep learning models;
  • Image recognition based on transformers and deep learning models;
  • Medical image processing based on transformers and deep learning models;
  • Object tracking based on transformers and deep learning models;
  • Transformers and deep learning for sciences;
  • Transformers and convolutional neural network fusion architecture;
  • Transformers and deep learning for diverse machine learning tasks;
  • Natural language processing.

Prof. Dr. Fengping An
Prof. Dr. Haitao Xu
Dr. Chuyang Ye
Topic Editors

Keywords

  • image processing
  • medical image processing
  • transformer
  • deep learning
  • CNN

Participating Journals

Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
Diagnostics
diagnostics
3.3 5.9 2011 21.6 Days CHF 2600 Submit
Electronics
electronics
2.6 6.1 2012 16.4 Days CHF 2400 Submit
Journal of Imaging
jimaging
3.3 6.7 2015 18 Days CHF 1800 Submit
Mathematics
mathematics
2.2 4.6 2013 17.3 Days CHF 2600 Submit
Sensors
sensors
3.5 8.2 2001 17.8 Days CHF 2600 Submit

Preprints.org is a multidisciplinary platform offering a preprint service designed to facilitate the early sharing of your research. It supports and empowers your research journey from the very beginning.

MDPI Topics is collaborating with Preprints.org and has established a direct connection between MDPI journals and the platform. Authors are encouraged to take advantage of this opportunity by posting their preprints at Preprints.org prior to publication:

  1. Share your research immediately: disseminate your ideas prior to publication and establish priority for your work.
  2. Safeguard your intellectual contribution: Protect your ideas with a time-stamped preprint that serves as proof of your research timeline.
  3. Boost visibility and impact: Increase the reach and influence of your research by making it accessible to a global audience.
  4. Gain early feedback: Receive valuable input and insights from peers before submitting to a journal.
  5. Ensure broad indexing: Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (9 papers)

Order results
Result details
Journals
Select all
Export citation of selected articles as:
24 pages, 3126 KB  
Article
Calibrated Transformer Fusion for Dual-View Low-Energy CESM Classification
by Ahmed A. H. Alkurdi and Amira Bibo Sallow
J. Imaging 2026, 12(1), 41; https://doi.org/10.3390/jimaging12010041 - 13 Jan 2026
Cited by 1 | Viewed by 634
Abstract
Contrast-enhanced spectral mammography (CESM) provides low-energy images acquired in standard craniocaudal (CC) and mediolateral oblique (MLO) views, and clinical interpretation relies on integrating both views. This study proposes a dual-view classification framework that combines deep CNN feature extraction with transformer-based fusion for breast-side [...] Read more.
Contrast-enhanced spectral mammography (CESM) provides low-energy images acquired in standard craniocaudal (CC) and mediolateral oblique (MLO) views, and clinical interpretation relies on integrating both views. This study proposes a dual-view classification framework that combines deep CNN feature extraction with transformer-based fusion for breast-side classification using low-energy (DM) images from CESM acquisitions (Normal vs. Tumorous; benign and malignant merged). The evaluation was conducted using 5-fold stratified group cross-validation with patient-level grouping to prevent leakage across folds. The final configuration (Model E) integrates dual-backbone feature extraction, transformer fusion, MC-dropout inference for uncertainty estimation, and post hoc logistic calibration. Across the five held-out test folds, Model E achieved a mean accuracy of 96.88% ± 2.39% and a mean F1-score of 97.68% ± 1.66%. The mean ROC-AUC and PR-AUC were 0.9915 ± 0.0098 and 0.9968 ± 0.0029, respectively. Probability quality was supported by a mean Brier score of 0.0236 ± 0.0145 and a mean expected calibration error (ECE) of 0.0334 ± 0.0171. An ablation study (Models A–E) was also reported to quantify the incremental contribution of dual-view input, transformer fusion, and uncertainty calibration. Within the limits of this retrospective single-center setting, these results suggest that dual-view transformer fusion can provide strong discrimination while also producing calibrated probabilities and uncertainty outputs that are relevant for decision support. Full article
Show Figures

Figure 1

22 pages, 1777 KB  
Article
DP2PNet: Diffusion-Based Point-to-Polygon Conversion for Single-Point Supervised Oriented Object Detection
by Peng Li, Limin Zhang and Tao Qu
Sensors 2026, 26(1), 329; https://doi.org/10.3390/s26010329 - 4 Jan 2026
Viewed by 530
Abstract
Rotated Bounding Boxes (RBBs) for oriented object detection are labor-intensive and time-consuming to annotate. Single-point supervision offers a cost-effective alternative but suffers from insufficient size and orientation information, leading existing methods to rely heavily on complex priors and fixed refinement stages. In this [...] Read more.
Rotated Bounding Boxes (RBBs) for oriented object detection are labor-intensive and time-consuming to annotate. Single-point supervision offers a cost-effective alternative but suffers from insufficient size and orientation information, leading existing methods to rely heavily on complex priors and fixed refinement stages. In this paper, we propose DP2PNet (Diffusion-Point-to-Polygon Network), the first diffusion model-based framework for single-point supervised oriented object detection. DP2PNet features three key innovations: (1) A multi-scale consistent noise generator that replaces manual or external model priors with Gaussian noise, reducing dependency on domain-specific information; (2) A Noise Cross-Constraint module based on multi-instance learning, which selects optimal noise point bags by fusing receptive field matching and object coverage; (3) A Semantic Key Point Aggregator that aggregates noise points via graph convolution to form semantic key points, from which pseudo-RBBs are generated using convex hulls. DP2PNet supports dynamic adjustment of refinement stages without retraining, enabling flexible accuracy optimization. Extensive experiments on DOTA-v1.0 and DIOR-R datasets demonstrate that DP2PNet achieves 53.82% and 53.61% mAP50, respectively, comparable to methods relying on complex priors. It also exhibits strong noise robustness and cross-dataset generalization. Full article
Show Figures

Figure 1

18 pages, 2632 KB  
Article
Adverse-Weather Image Restoration Method Based on VMT-Net
by Zhongmin Liu, Xuewen Yu and Wenjin Hu
J. Imaging 2025, 11(11), 376; https://doi.org/10.3390/jimaging11110376 - 26 Oct 2025
Viewed by 1417
Abstract
To address global semantic loss, local detail blurring, and spatial–semantic conflict during image restoration under adverse weather conditions, we propose an image restoration network that integrates Mamba with Transformer architectures. We first design a Vision-Mamba–Transformer (VMT) module that combines the long-range dependency modeling [...] Read more.
To address global semantic loss, local detail blurring, and spatial–semantic conflict during image restoration under adverse weather conditions, we propose an image restoration network that integrates Mamba with Transformer architectures. We first design a Vision-Mamba–Transformer (VMT) module that combines the long-range dependency modeling of Vision Mamba with the global contextual reasoning of Transformers, facilitating the joint modeling of global structures and local details, thus mitigating information loss and detail blurring during restoration. Second, we introduce an Adaptive Content Guidance (ACG) module that employs dynamic gating and spatial–channel attention to enable effective inter-layer feature fusion, thereby enhancing cross-layer semantic consistency. Finally, we embed the VMT and ACG modules into a U-Net backbone, achieving efficient integration of multi-scale feature modeling and cross-layer fusion, significantly improving reconstruction quality under complex weather conditions. The experimental results show that on Snow100K-S/L, VMT-Net improves PSNR over the baseline by approximately 0.89 dB and 0.36 dB, with SSIM gains of about 0.91% and 0.11%, respectively. On Outdoor-Rain and Raindrop, it performs similarly to the baseline and exhibits superior detail recovery in real-world scenes. Overall, the method demonstrates robustness and strong detail restoration across diverse adverse-weather conditions. Full article
Show Figures

Figure 1

21 pages, 6219 KB  
Article
Model-Free Transformer Framework for 6-DoF Pose Estimation of Textureless Tableware Objects
by Jungwoo Lee, Hyogon Kim, Ji-Wook Kwon, Sung-Jo Yun, Na-Hyun Lee, Young-Ho Choi, Goobong Chung and Jinho Suh
Sensors 2025, 25(19), 6167; https://doi.org/10.3390/s25196167 - 5 Oct 2025
Cited by 1 | Viewed by 1379
Abstract
Tableware objects such as plates, bowls, and cups are usually textureless, uniform in color, and vary widely in shape, making it difficult to apply conventional pose estimation methods that rely on texture cues or object-specific CAD models. These limitations present a significant obstacle [...] Read more.
Tableware objects such as plates, bowls, and cups are usually textureless, uniform in color, and vary widely in shape, making it difficult to apply conventional pose estimation methods that rely on texture cues or object-specific CAD models. These limitations present a significant obstacle to robotic manipulation in restaurant environments, where reliable six-degree-of-freedom (6-DoF) pose estimation is essential for autonomous grasping and collection. To address this problem, we propose a model-free and texture-free 6-DoF pose estimation framework based on a transformer encoder architecture. This method uses only geometry-based features extracted from depth images, including surface vertices and rim normals, which provide strong structural priors. The pipeline begins with object detection and segmentation using a pretrained video foundation model, followed by the generation of uniformly partitioned grids from depth data. For each grid cell, centroid positions, and surface normals are computed and processed by a transformer-based model that jointly predicts object rotation and translation. Experiments with ten types of tableware demonstrate that the method achieves an average rotational error of 3.53 degrees and a translational error of 13.56 mm. Real-world deployment on a mobile robot platform with a manipulator further validated its ability to autonomously recognize and collect tableware, highlighting the practicality of the proposed geometry-driven approach for service robotics. Full article
Show Figures

Figure 1

20 pages, 2226 KB  
Article
RST-Net: A Semantic Segmentation Network for Remote Sensing Images Based on a Dual-Branch Encoder Structure
by Na Yang, Chuanzhao Tian, Xingfa Gu, Yanting Zhang, Xuewen Li and Feng Zhang
Sensors 2025, 25(17), 5531; https://doi.org/10.3390/s25175531 - 5 Sep 2025
Cited by 2 | Viewed by 1989
Abstract
High-resolution remote sensing images often suffer from inadequate fusion between global and local features, leading to the loss of long-range dependencies and blurred spatial details, while also exhibiting limited adaptability to multi-scale object segmentation. To overcome these limitations, this study proposes RST-Net, a [...] Read more.
High-resolution remote sensing images often suffer from inadequate fusion between global and local features, leading to the loss of long-range dependencies and blurred spatial details, while also exhibiting limited adaptability to multi-scale object segmentation. To overcome these limitations, this study proposes RST-Net, a semantic segmentation network featuring a dual-branch encoder structure. The encoder integrates a ResNeXt-50-based CNN branch for extracting local spatial features and a Shunted Transformer (ST) branch for capturing global contextual information. To further enhance multi-scale representation, the multi-scale feature enhancement module (MSFEM) is embedded in the CNN branch, leveraging atrous and depthwise separable convolutions to dynamically aggregate features. Additionally, the residual dynamic feature fusion (RDFF) module is incorporated into skip connections to improve interactions between encoder and decoder features. Experiments on the Vaihingen and Potsdam datasets show that RST-Net achieves promising performance, with MIoU scores of 77.04% and 79.56%, respectively, validating its effectiveness in semantic segmentation tasks. Full article
Show Figures

Figure 1

16 pages, 9189 KB  
Article
SEND: Semantic-Aware Deep Unfolded Network with Diffusion Prior for Multi-Modal Image Fusion and Object Detection
by Rong Zhang, Mao-Yi Xiong and Jun-Jie Huang
Mathematics 2025, 13(16), 2584; https://doi.org/10.3390/math13162584 - 12 Aug 2025
Viewed by 1286
Abstract
Multi-modality image fusion (MIF) aims to integrate complementary information from diverse imaging modalities into a single comprehensive representation and serves as an essential processing step for downstream high-level computer vision tasks. The existing deep unfolding-based processes demonstrate promising results; however, they often rely [...] Read more.
Multi-modality image fusion (MIF) aims to integrate complementary information from diverse imaging modalities into a single comprehensive representation and serves as an essential processing step for downstream high-level computer vision tasks. The existing deep unfolding-based processes demonstrate promising results; however, they often rely on deterministic priors with limited generalization ability and usually decouple from the training process of object detection. In this paper, we propose Semantic-Aware Deep Unfolded Network with Diffusion Prior (SEND), a novel framework designed for transparent and effective multi-modality fusion and object detection. SEND consists of a Denoising Prior Guided Fusion Module and a Fusion Object Detection Module. The Denoising Prior Guided Fusion Module does not utilize the traditional deterministic prior but combines the diffusion prior with deep unfolding, leading to improved multi-modal fusion performance and generalization ability. It is designed with a model-based optimization formulation for multi-modal image fusion, which is unfolded into two cascaded blocks: a Diffusion Denoising Fusion Block to generate informative diffusion priors and a Data Consistency Enhancement Block that explicitly aggregates complementary features from both the diffusion priors and input modalities. Additionally, SEND incorporates the Fusion Object Detection Module with the Denoising Prior Guided Fusion Module for object detection task optimization using a carefully designed two-stage training strategy. Experiments demonstrate that the proposed SEND method outperforms state-of-the-art methods, achieving superior fusion quality with improved efficiency and interpretability. Full article
Show Figures

Figure 1

25 pages, 3449 KB  
Article
CSANet: Context–Spatial Awareness Network for RGB-T Urban Scene Understanding
by Ruixiang Li, Zhen Wang, Jianxin Guo and Chuanlei Zhang
J. Imaging 2025, 11(6), 188; https://doi.org/10.3390/jimaging11060188 - 9 Jun 2025
Cited by 1 | Viewed by 1781
Abstract
Semantic segmentation plays a critical role in understanding complex urban environments, particularly for autonomous driving applications. However, existing approaches face significant challenges under low-light and adverse weather conditions. To address these limitations, we propose CSANet (Context Spatial Awareness Network), a novel framework that [...] Read more.
Semantic segmentation plays a critical role in understanding complex urban environments, particularly for autonomous driving applications. However, existing approaches face significant challenges under low-light and adverse weather conditions. To address these limitations, we propose CSANet (Context Spatial Awareness Network), a novel framework that effectively integrates RGB and thermal infrared (TIR) modalities. CSANet employs an efficient encoder to extract complementary local and global features, while a hierarchical fusion strategy is adopted to selectively integrate visual and semantic information. Notably, the Channel–Spatial Cross-Fusion Module (CSCFM) enhances local details by fusing multi-modal features, and the Multi-Head Fusion Module (MHFM) captures global dependencies and calibrates multi-modal information. Furthermore, the Spatial Coordinate Attention Mechanism (SCAM) improves object localization accuracy in complex urban scenes. Evaluations on benchmark datasets (MFNet and PST900) demonstrate that CSANet achieves state-of-the-art performance, significantly advancing RGB-T semantic segmentation. Full article
Show Figures

Figure 1

25 pages, 12873 KB  
Article
License Plate Detection Based on Improved YOLOv8n Network
by Ruizhe Zhu, Qiyang He, Hai Jin, Yonghua Han and Kejian Jiang
Electronics 2025, 14(10), 2065; https://doi.org/10.3390/electronics14102065 - 20 May 2025
Cited by 2 | Viewed by 4491
Abstract
To address the challenges of complex backgrounds, varying target scales, and dense targets in license plate detection within surveillance scenarios, we propose an enhanced method based on an improved YOLOv8n network. This approach involves redesigning key components of the YOLOv8n architecture, including the [...] Read more.
To address the challenges of complex backgrounds, varying target scales, and dense targets in license plate detection within surveillance scenarios, we propose an enhanced method based on an improved YOLOv8n network. This approach involves redesigning key components of the YOLOv8n architecture, including the C2f module, the SPPF module, and the detection head. Additionally, we optimize the WIoU loss function, replacing the original CIoU loss function, which leads to improved bounding box feature extraction and enhanced regression accuracy. To evaluate the model’s robustness in complex environments with varying lighting, backgrounds, angles, and vehicle types, we created a custom surveillance license plate dataset. Experimental results show that the improved model achieves a notable increase in detection accuracy, with mAP@0.5 rising from 90.9% in the baseline model to 94.4%, precision improving from 90.2% to 92.8%, and recall increasing from 82.9% to 87.9%. Additionally, the model’s parameters are reduced from 3.1 M to 2.1 M, significantly enhancing computational efficiency. Moreover, the model achieves an inference speed FPS of 86, maintaining high precision and meeting real-time detection requirements. This demonstrates that our method provides an efficient and reliable solution for license plate detection in surveillance scenarios. Full article
Show Figures

Figure 1

15 pages, 3571 KB  
Article
Lightweight UAV Landing Model Based on Visual Positioning
by Ning Zhang, Junnan Tan, Kaichun Yan and Sang Feng
Sensors 2025, 25(3), 884; https://doi.org/10.3390/s25030884 - 31 Jan 2025
Cited by 5 | Viewed by 1978
Abstract
In order to enhance the precision of UAV (unmanned aerial vehicle) landings and realize the convenient and rapid deployment of the model to the mobile terminal, this study proposes a Land-YOLO lightweight UAV-guided landing algorithm based on the YOLOv8 n model. Firstly, GhostConv [...] Read more.
In order to enhance the precision of UAV (unmanned aerial vehicle) landings and realize the convenient and rapid deployment of the model to the mobile terminal, this study proposes a Land-YOLO lightweight UAV-guided landing algorithm based on the YOLOv8 n model. Firstly, GhostConv replaces standard convolutions in the backbone network, leveraging existing feature maps to create additional “ghost” feature maps via low-cost linear transformations, thereby lightening the network structure. Additionally, the CSP structure of the neck network is enhanced by incorporating the PartialConv structure. This integration allows for the transmission of certain channel characteristics through identity mapping, effectively reducing both the number of parameters and the computational load of the model. Finally, the bidirectional feature pyramid network (BiFPN) module is introduced, and the accuracy and average accuracy of the model recognition landing mark are improved through the bidirectional feature fusion and weighted fusion mechanism. The experimental results show that for the landing-sign data sets collected in real and virtual environments, the Land-YOLO algorithm in this paper is 1.4% higher in precision and 0.91% higher in mAP0.5 than the original YOLOv8n baseline, which can meet the detection requirements of landing signs. The model’s memory usage and floating-point operations per second (FLOPs) have been reduced by 42.8% and 32.4%, respectively. This makes it more suitable for deployment on the mobile terminal of a UAV. Full article
Show Figures

Figure 1

Back to TopTop