Submit your Manuscript Submit your Abstract Propose a Topic

Topic Menu

Topic Editors

Dr. Fengping An

E-Mail Website

School of Automation and Software Engineering, Shanxi University, Taiyuan 030006, China

Prof. Dr. Haitao Xu

E-Mail Website

Department of Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

Dr. Chuyang Ye

E-Mail Website

School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing 100811, China

Transformer and Deep Learning Applications in Image Processing

Abstract submission deadline

31 March 2026

Manuscript submission deadline

31 May 2026

Viewed by

9090

Topic Information

Dear Colleagues,

Convolutional Neural Networks (CNNs) represent a class of deep learning architectures specifically designed to process spatial data, such as images and videos. Due to their ability to autonomously extract features, maintain translational invariance, and perceive local patterns, CNNs have found extensive applications in domains such as image classification, object recognition, object tracking, and medical image processing. However, CNNs are unable to model long-range dependencies effectively and struggle to extract long-distance feature information of the target to be tracked, which impacts the efficiency and accuracy of target tracking. Since the release of ChatGPT 3.0 based on transformers on June 11, 2020, transformers have demonstrated strong capabilities in handling sequential data.

Although CNN models have achieved significant success in the field of image processing imagery over the years, many challenges remain in practical applications, such as complex scene image classification, specific object recognition and tracking, and medical image processing. This situation highlights a noticeable gap between theoretical advancements and practical applications in the image processing field.

Therefore, we invite submissions of studies on theoretical research and practical applications related to transformer and deep learning architectures in the fields of medical image analysis, image classification, recognition, and tracking.

We welcome submissions on the following topics, including but not limited to the following:

Novel architectures and variations of transformers and deep learning models;
Fine-tuning strategies for pre-trained transformers and deep learning models;
Image classification based on transformers and deep learning models;
Image recognition based on transformers and deep learning models;
Medical image processing based on transformers and deep learning models;
Object tracking based on transformers and deep learning models;
Transformers and deep learning for sciences;
Transformers and convolutional neural network fusion architecture;
Transformers and deep learning for diverse machine learning tasks;
Natural language processing.

Prof. Dr. Fengping An
Prof. Dr. Haitao Xu
Dr. Chuyang Ye
Topic Editors

Keywords

image processing
medical image processing
transformer
deep learning
CNN

Participating Journals

Journal Name	Impact Factor	CiteScore	Launched Year	First Decision (median)	APC
Diagnostics diagnostics	3.3	5.9	2011	21 Days	CHF 2600	Submit
Electronics electronics	2.6	6.1	2012	16.8 Days	CHF 2400	Submit
Journal of Imaging jimaging	3.3	6.7	2015	15.3 Days	CHF 1800	Submit
Mathematics mathematics	2.2	4.6	2013	18.4 Days	CHF 2600	Submit
Sensors sensors	3.5	8.2	2001	19.7 Days	CHF 2600	Submit

Preprints.org is a multidisciplinary platform offering a preprint service designed to facilitate the early sharing of your research. It supports and empowers your research journey from the very beginning.

MDPI Topics is collaborating with Preprints.org and has established a direct connection between MDPI journals and the platform. Authors are encouraged to take advantage of this opportunity by posting their preprints at Preprints.org prior to publication:

Share your research immediately: disseminate your ideas prior to publication and establish priority for your work.
Safeguard your intellectual contribution: Protect your ideas with a time-stamped preprint that serves as proof of your research timeline.
Boost visibility and impact: Increase the reach and influence of your research by making it accessible to a global audience.
Gain early feedback: Receive valuable input and insights from peers before submitting to a journal.
Ensure broad indexing: Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (7 papers)

Download All Papers

Order results

Result details

Journals

Show export options Show export options

Select all

Export citation of selected articles as:

18 pages, 2632 KB

Open AccessArticle

Adverse-Weather Image Restoration Method Based on VMT-Net

by Zhongmin Liu, Xuewen Yu and Wenjin Hu

J. Imaging 2025, 11(11), 376; https://doi.org/10.3390/jimaging11110376 - 26 Oct 2025

Viewed by 266

Abstract

To address global semantic loss, local detail blurring, and spatial–semantic conflict during image restoration under adverse weather conditions, we propose an image restoration network that integrates Mamba with Transformer architectures. We first design a Vision-Mamba–Transformer (VMT) module that combines the long-range dependency modeling of Vision Mamba with the global contextual reasoning of Transformers, facilitating the joint modeling of global structures and local details, thus mitigating information loss and detail blurring during restoration. Second, we introduce an Adaptive Content Guidance (ACG) module that employs dynamic gating and spatial–channel attention to enable effective inter-layer feature fusion, thereby enhancing cross-layer semantic consistency. Finally, we embed the VMT and ACG modules into a U-Net backbone, achieving efficient integration of multi-scale feature modeling and cross-layer fusion, significantly improving reconstruction quality under complex weather conditions. The experimental results show that on Snow100K-S/L, VMT-Net improves PSNR over the baseline by approximately 0.89 dB and 0.36 dB, with SSIM gains of about 0.91% and 0.11%, respectively. On Outdoor-Rain and Raindrop, it performs similarly to the baseline and exhibits superior detail recovery in real-world scenes. Overall, the method demonstrates robustness and strong detail restoration across diverse adverse-weather conditions. Full article

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

► Show Figures

Figure 1

21 pages, 6219 KB

Open AccessArticle

Model-Free Transformer Framework for 6-DoF Pose Estimation of Textureless Tableware Objects

by Jungwoo Lee, Hyogon Kim, Ji-Wook Kwon, Sung-Jo Yun, Na-Hyun Lee, Young-Ho Choi, Goobong Chung and Jinho Suh

Sensors 2025, 25(19), 6167; https://doi.org/10.3390/s25196167 - 5 Oct 2025

Viewed by 509

Abstract

Tableware objects such as plates, bowls, and cups are usually textureless, uniform in color, and vary widely in shape, making it difficult to apply conventional pose estimation methods that rely on texture cues or object-specific CAD models. These limitations present a significant obstacle to robotic manipulation in restaurant environments, where reliable six-degree-of-freedom (6-DoF) pose estimation is essential for autonomous grasping and collection. To address this problem, we propose a model-free and texture-free 6-DoF pose estimation framework based on a transformer encoder architecture. This method uses only geometry-based features extracted from depth images, including surface vertices and rim normals, which provide strong structural priors. The pipeline begins with object detection and segmentation using a pretrained video foundation model, followed by the generation of uniformly partitioned grids from depth data. For each grid cell, centroid positions, and surface normals are computed and processed by a transformer-based model that jointly predicts object rotation and translation. Experiments with ten types of tableware demonstrate that the method achieves an average rotational error of 3.53 degrees and a translational error of 13.56 mm. Real-world deployment on a mobile robot platform with a manipulator further validated its ability to autonomously recognize and collect tableware, highlighting the practicality of the proposed geometry-driven approach for service robotics. Full article

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

► Show Figures

Figure 1

20 pages, 2226 KB

Open AccessArticle

RST-Net: A Semantic Segmentation Network for Remote Sensing Images Based on a Dual-Branch Encoder Structure

by Na Yang, Chuanzhao Tian, Xingfa Gu, Yanting Zhang, Xuewen Li and Feng Zhang

Sensors 2025, 25(17), 5531; https://doi.org/10.3390/s25175531 - 5 Sep 2025

Viewed by 1225

Abstract

High-resolution remote sensing images often suffer from inadequate fusion between global and local features, leading to the loss of long-range dependencies and blurred spatial details, while also exhibiting limited adaptability to multi-scale object segmentation. To overcome these limitations, this study proposes RST-Net, a semantic segmentation network featuring a dual-branch encoder structure. The encoder integrates a ResNeXt-50-based CNN branch for extracting local spatial features and a Shunted Transformer (ST) branch for capturing global contextual information. To further enhance multi-scale representation, the multi-scale feature enhancement module (MSFEM) is embedded in the CNN branch, leveraging atrous and depthwise separable convolutions to dynamically aggregate features. Additionally, the residual dynamic feature fusion (RDFF) module is incorporated into skip connections to improve interactions between encoder and decoder features. Experiments on the Vaihingen and Potsdam datasets show that RST-Net achieves promising performance, with MIoU scores of 77.04% and 79.56%, respectively, validating its effectiveness in semantic segmentation tasks. Full article

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

► Show Figures

Figure 1

16 pages, 9189 KB

Open AccessFeature PaperArticle

SEND: Semantic-Aware Deep Unfolded Network with Diffusion Prior for Multi-Modal Image Fusion and Object Detection

by Rong Zhang, Mao-Yi Xiong and Jun-Jie Huang

Mathematics 2025, 13(16), 2584; https://doi.org/10.3390/math13162584 - 12 Aug 2025

Viewed by 632

Abstract

Multi-modality image fusion (MIF) aims to integrate complementary information from diverse imaging modalities into a single comprehensive representation and serves as an essential processing step for downstream high-level computer vision tasks. The existing deep unfolding-based processes demonstrate promising results; however, they often rely on deterministic priors with limited generalization ability and usually decouple from the training process of object detection. In this paper, we propose Semantic-Aware Deep Unfolded Network with Diffusion Prior (SEND), a novel framework designed for transparent and effective multi-modality fusion and object detection. SEND consists of a Denoising Prior Guided Fusion Module and a Fusion Object Detection Module. The Denoising Prior Guided Fusion Module does not utilize the traditional deterministic prior but combines the diffusion prior with deep unfolding, leading to improved multi-modal fusion performance and generalization ability. It is designed with a model-based optimization formulation for multi-modal image fusion, which is unfolded into two cascaded blocks: a Diffusion Denoising Fusion Block to generate informative diffusion priors and a Data Consistency Enhancement Block that explicitly aggregates complementary features from both the diffusion priors and input modalities. Additionally, SEND incorporates the Fusion Object Detection Module with the Denoising Prior Guided Fusion Module for object detection task optimization using a carefully designed two-stage training strategy. Experiments demonstrate that the proposed SEND method outperforms state-of-the-art methods, achieving superior fusion quality with improved efficiency and interpretability. Full article

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

► Show Figures

Figure 1

25 pages, 3449 KB

Open AccessArticle

CSANet: Context–Spatial Awareness Network for RGB-T Urban Scene Understanding

by Ruixiang Li, Zhen Wang, Jianxin Guo and Chuanlei Zhang

J. Imaging 2025, 11(6), 188; https://doi.org/10.3390/jimaging11060188 - 9 Jun 2025

Cited by 1 | Viewed by 1117

Abstract

Semantic segmentation plays a critical role in understanding complex urban environments, particularly for autonomous driving applications. However, existing approaches face significant challenges under low-light and adverse weather conditions. To address these limitations, we propose CSANet (Context Spatial Awareness Network), a novel framework that effectively integrates RGB and thermal infrared (TIR) modalities. CSANet employs an efficient encoder to extract complementary local and global features, while a hierarchical fusion strategy is adopted to selectively integrate visual and semantic information. Notably, the Channel–Spatial Cross-Fusion Module (CSCFM) enhances local details by fusing multi-modal features, and the Multi-Head Fusion Module (MHFM) captures global dependencies and calibrates multi-modal information. Furthermore, the Spatial Coordinate Attention Mechanism (SCAM) improves object localization accuracy in complex urban scenes. Evaluations on benchmark datasets (MFNet and PST900) demonstrate that CSANet achieves state-of-the-art performance, significantly advancing RGB-T semantic segmentation. Full article

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

► Show Figures

Figure 1

25 pages, 12873 KB

Open AccessArticle

License Plate Detection Based on Improved YOLOv8n Network

by Ruizhe Zhu, Qiyang He, Hai Jin, Yonghua Han and Kejian Jiang

Electronics 2025, 14(10), 2065; https://doi.org/10.3390/electronics14102065 - 20 May 2025

Cited by 2 | Viewed by 2681

Abstract

To address the challenges of complex backgrounds, varying target scales, and dense targets in license plate detection within surveillance scenarios, we propose an enhanced method based on an improved YOLOv8n network. This approach involves redesigning key components of the YOLOv8n architecture, including the C2f module, the SPPF module, and the detection head. Additionally, we optimize the WIoU loss function, replacing the original CIoU loss function, which leads to improved bounding box feature extraction and enhanced regression accuracy. To evaluate the model’s robustness in complex environments with varying lighting, backgrounds, angles, and vehicle types, we created a custom surveillance license plate dataset. Experimental results show that the improved model achieves a notable increase in detection accuracy, with mAP@0.5 rising from 90.9% in the baseline model to 94.4%, precision improving from 90.2% to 92.8%, and recall increasing from 82.9% to 87.9%. Additionally, the model’s parameters are reduced from 3.1 M to 2.1 M, significantly enhancing computational efficiency. Moreover, the model achieves an inference speed FPS of 86, maintaining high precision and meeting real-time detection requirements. This demonstrates that our method provides an efficient and reliable solution for license plate detection in surveillance scenarios. Full article

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

► Show Figures

Figure 1

15 pages, 3571 KB

Open AccessArticle

Lightweight UAV Landing Model Based on Visual Positioning

by Ning Zhang, Junnan Tan, Kaichun Yan and Sang Feng

Sensors 2025, 25(3), 884; https://doi.org/10.3390/s25030884 - 31 Jan 2025

Cited by 4 | Viewed by 1327

Abstract

In order to enhance the precision of UAV (unmanned aerial vehicle) landings and realize the convenient and rapid deployment of the model to the mobile terminal, this study proposes a Land-YOLO lightweight UAV-guided landing algorithm based on the YOLOv8 n model. Firstly, GhostConv replaces standard convolutions in the backbone network, leveraging existing feature maps to create additional “ghost” feature maps via low-cost linear transformations, thereby lightening the network structure. Additionally, the CSP structure of the neck network is enhanced by incorporating the PartialConv structure. This integration allows for the transmission of certain channel characteristics through identity mapping, effectively reducing both the number of parameters and the computational load of the model. Finally, the bidirectional feature pyramid network (BiFPN) module is introduced, and the accuracy and average accuracy of the model recognition landing mark are improved through the bidirectional feature fusion and weighted fusion mechanism. The experimental results show that for the landing-sign data sets collected in real and virtual environments, the Land-YOLO algorithm in this paper is 1.4% higher in precision and 0.91% higher in mAP0.5 than the original YOLOv8n baseline, which can meet the detection requirements of landing signs. The model’s memory usage and floating-point operations per second (FLOPs) have been reduced by 42.8% and 32.4%, respectively. This makes it more suitable for deployment on the mobile terminal of a UAV. Full article

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

► Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Displaying articles 1-7

Submit your Abstract

Journal Name	Impact Factor	CiteScore	Launched Year	First Decision (median)	APC
Diagnostics diagnostics	3.3	5.9	2011	21 Days	CHF 2600	Submit
Electronics electronics	2.6	6.1	2012	16.8 Days	CHF 2400	Submit
Journal of Imaging jimaging	3.3	6.7	2015	15.3 Days	CHF 1800	Submit
Mathematics mathematics	2.2	4.6	2013	18.4 Days	CHF 2600	Submit
Sensors sensors	3.5	8.2	2001	19.7 Days	CHF 2600	Submit

Topic Menu

Topic Editors

Transformer and Deep Learning Applications in Image Processing

Topic Information

Keywords

Participating Journals

Published Papers (7 papers)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI