AI/Machine Learning in Computer Vision/Image Processing and Natural Language Processing

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: 15 May 2025 | Viewed by 4509

Special Issue Editors


E-Mail Website
Guest Editor
1. ITI/Larsys, Agência Regional para o Desenvolvimento da Investigação, Tecnologia e Inovação, Caminho da Penteada, Funchal-9020-125, Madeira, Portugal
2. Biomedical Engineering Group, Department of Cybernetics and Biomedical Engineering, Faculty of Electrical Engineering and Computer Science, VSB –Technical University of Ostrava,17. Istopadu 15, 708 00 Ostrava, Czech Republic
3. Department of Engineering and Exact Sciences, University of Madeira, Caminho da Penteada, 9020-125 Funchal, Madeira, Portugal
Interests: computer vision; deep learning; artificial intelligence

E-Mail Website
Guest Editor
1. ITI/Larsys/Madeira Interactive Technologies Institute, 9020-105 Funchal, Portugal
2. Institute for Technological Development and Innovation in Communications, Universidad de Las Palmas de Gran Canaria, 35001 Las Palmas de Gran Canaria, Spain
Interests: data analysis; signal processing; artificial intelligence
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Biomedical Engineering Group, Department of Cybernetics and Biomedical Engineering, Faculty of Electrical Engineering and Computer Science, VSB –Technical University of Ostrava,17. listopadu 15, 708 00 Ostrava, Czech Republic
Interests: contactless vital signs monitoring; assistive technologies; telemedicine; fuzzy logic
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

With the rapid pace of innovations and discoveries of generic and robust artificial intelligence/machine learning algorithms and methods, it is challenging to keep track of recent developments in various domains such as computer vision (CV) and natural language processing (NLP). Additionally, exploration of these algorithms is pivotal in investigating their robustness in various domains, such as healthcare, agriculture, remote sensing, marine diversity exploration, signal processing, vital signs monitoring, etc., which also deserve considerable attention. Therefore, with an effort to serve as a one-stop-shop for the AI community dedicated to developing robust AI/ML methods and algorithms in the domains of CV and NLP and applying these methods and algorithms for solving real-time problems of diverse domains, this Special Issue aims to provide a compilation of upcoming significant research developments in the field of AI and machine learning, covering the high- (pattern recognition and sophisticated reasoning) and low-level tasks (feature creation and extraction) encompassing both domains.

This Special Issue aims to cover all aspects of AI and ML and is not limited to the following topics:

  • Image captioning, object detection and recognition, and scene understanding;
  • Text summarization, machine translation, and sentiment analysis;
  • Large language/vision models and foundation models in the context of CV/NLP;
  • Causal and/or explainable artificial intelligence in the context of CV/NLP;
  • Edge detection and image or video denoising and deblurring;
  • Generative modelling and its applications in the context of CV/NLP;
  • Reinforcement learning and its applications in the context of CV/NLP;
  • Computer vision-based remote health monitoring, such as remote photoplethysmography or vital signs monitoring;
  • AI and machine learning methods for biomedical imaging;
  • NLP in healthcare.

We look forward to your contributions to this Special Issue.

Regards,

Dr. Ankit Gupta
Dr. Morgado Dias
Dr. Antonio G. Ravelo-Garcia
Dr. Martin Černý
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • computer vision
  • reinforcement learning
  • explainable AI
  • natural language processing
  • multimodal AI

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

22 pages, 1959 KiB  
Article
DMFormer: Dense Memory Linformer for Image Captioning
by Yuting He and Zetao Jiang
Electronics 2025, 14(9), 1716; https://doi.org/10.3390/electronics14091716 - 23 Apr 2025
Viewed by 67
Abstract
Image captioning is a cross-task of computer vision and natural language processing, aiming to describe image content in natural language. Existing methods still have deficiencies in modeling the spatial location and semantic correlation between image regions. Furthermore, these methods often exhibit insufficient interaction [...] Read more.
Image captioning is a cross-task of computer vision and natural language processing, aiming to describe image content in natural language. Existing methods still have deficiencies in modeling the spatial location and semantic correlation between image regions. Furthermore, these methods often exhibit insufficient interaction between image features and text features. To address these issues, we propose a Linformer-based image captioning method, the Dense Memory Linformer for Image Captioning (DMFormer), which has lower time and space complexity than the traditional Transformer architecture. The DMFormer contains two core modules: the Relation Memory Augmented Encoder (RMAE) and the Dense Memory Augmented Decoder (DMAD). In the RMAE, we propose Relation Memory Augmented Attention (RMAA), which combines explicit spatial perception and implicit spatial perception. It explicitly uses geometric information to model the geometric correlation between image regions and implicitly constructs memory unit matrices to learn the contextual information of image region features. In the DMAD, we introduce Dense Memory Augmented Cross Attention (DMACA). This module fully utilizes the low-level and high-level features generated by the RMAE through dense connections, and constructs memory units to store prior knowledge of image and text. It learns the cross-modal associations between visual and linguistic features through an adaptive gating mechanism. Experimental results on the MS-COCO dataset show that the descriptions generated by the DMFormer are richer and more accurate, with significant improvements in various evaluation metrics compared to mainstream methods. Full article
Show Figures

Figure 1

19 pages, 1320 KiB  
Article
SkinSavvy2: Augmented Skin Lesion Diagnosis and Personalized Medical Consultation System
by Hyungjoon Kim, Yunju Kim and Wonho Song
Electronics 2025, 14(5), 969; https://doi.org/10.3390/electronics14050969 - 28 Feb 2025
Viewed by 513
Abstract
The shortage of medical personnel and the busy lives of modern people have increased the desire for the self-diagnosis of diseases, and the latest large-scale language models and image recognition technologies have the potential to meet this demand. In particular, skin-related diseases are [...] Read more.
The shortage of medical personnel and the busy lives of modern people have increased the desire for the self-diagnosis of diseases, and the latest large-scale language models and image recognition technologies have the potential to meet this demand. In particular, skin-related diseases are one of the areas where symptoms are visually distinguishable, making self-diagnosis and care possible. In this paper, we propose a system that classifies diseases through images of skin diseases and combines them with individual conditions such as age, skin type, and gender for self-diagnosis. First, we design the latest deep learning model-based skin disease classifier that can classify six types of skin diseases using the HAM10000 dataset and generate prompts by combining the personal information input. By utilizing the Generative Pre-trained Transformer (GPT) model, the system generates personalized care methods based on these prompts. We measured the accuracy of the classification model of the proposed system and validated the effectiveness of the proposed method through user evaluations. Full article
Show Figures

Figure 1

20 pages, 12240 KiB  
Article
Character Can Speak Directly: An End-to-End Character Region Excavation Network for Scene Text Spotting
by Yan Li, Yan Shu, Binyang Li and Ruifeng Xu
Electronics 2025, 14(5), 851; https://doi.org/10.3390/electronics14050851 - 21 Feb 2025
Viewed by 421
Abstract
End-to-end scene text spotting methods have garnered significant research attention due to their promising results. However, most existing approaches are not well suited for real-world applications because of their inherently complex pipelines. In this paper, we propose an end-to-end Character Region Excavation Network [...] Read more.
End-to-end scene text spotting methods have garnered significant research attention due to their promising results. However, most existing approaches are not well suited for real-world applications because of their inherently complex pipelines. In this paper, we propose an end-to-end Character Region Excavation Network (CRENet) to streamline the text spotting pipeline. Our contributions are threefold: (i) Pipeline simplification: For the first time, we eliminate the text region retrieval step, allowing characters to be directly spotted from scene images. (ii) ROA layer: We introduce a novel RoI (Region of Interest) feature sampling layer for multi-oriented character region feature sampling, significantly enhancing the recognizer’s performance. (iii) Progressive learning strategy: We propose a progressive learning strategy to gradually bridge the gap between synthetic data and real-world images, addressing the challenge posed by the high cost of character-level annotations required during training. Extensive experiments demonstrate that our proposed method is robust and effective across horizontal, oriented, and curved text, achieving results comparable to state-of-the-art methods on ICDAR 2013, ICDAR 2015, Total-Text and ReCTS. Full article
Show Figures

Figure 1

21 pages, 4068 KiB  
Article
Three-Dimensional Mesh Character Pose Transfer with Neural Sparse-Softmax Skinning Blending
by Siqi Liu, Mengxiao Yin, Ming Li, Feng Zhan and Bei Hua
Electronics 2025, 14(3), 589; https://doi.org/10.3390/electronics14030589 - 1 Feb 2025
Viewed by 1228
Abstract
Three-dimensional mesh pose transfer transforms the pose of a source model into the pose of a reference model while preserving the source model’s identity (body detail). It has tremendous potential in computer graphics tasks. Current neural network-based methods primarily focus on extracting pose [...] Read more.
Three-dimensional mesh pose transfer transforms the pose of a source model into the pose of a reference model while preserving the source model’s identity (body detail). It has tremendous potential in computer graphics tasks. Current neural network-based methods primarily focus on extracting pose and body features, not entirely using the articulated body structure of humans and animals. We propose an end-to-end pose transfer network based on skinning deformation to address these issues. This network first extracts skinning weights and model joint features. Then, they are decoded to transfer the source model to a pose similar to the reference model while preserving the features of the source model. During feature extraction, we utilize the features of the k-nearest neighborhoods and one-ring neighborhoods to enable the network to learn the body details of the model better. Additionally, we apply skinning weights and joint features to capture the variation in the source model pose compared to the reference model pose and then use a decoding network to obtain the target model, replacing linear blend skinning. We conducted experiments on datasets such as SMPL, SMAL, FAUST, DYNA, and the MG dataset to provide empirical evidence and demonstrate that our method is the best in quantitative experiments. Our method efficiently transfers poses while better preserving the identity of the source model. Full article
Show Figures

Figure 1

17 pages, 16060 KiB  
Article
Channel-Wise Attention-Enhanced Feature Mutual Reconstruction for Few-Shot Fine-Grained Image Classification
by Qianying Ou and Jinmiao Zou
Electronics 2025, 14(2), 377; https://doi.org/10.3390/electronics14020377 - 19 Jan 2025
Cited by 1 | Viewed by 641
Abstract
Fine-grained image classification is faced with the challenge of significant intra-class differences and subtle similarities between classes, with a limited number of labelled data. Previous few-shot learning approaches, however, often fail to recognize these discriminative details, such as a bird’s eyes and beak. [...] Read more.
Fine-grained image classification is faced with the challenge of significant intra-class differences and subtle similarities between classes, with a limited number of labelled data. Previous few-shot learning approaches, however, often fail to recognize these discriminative details, such as a bird’s eyes and beak. In this paper, we proposed a channel-wise attention-enhanced feature mutual reconstruction mechanism that helps to alleviate these problems for fine-grained image classification. This mechanism first employed a channel-wise attention module (CAM) to learn the channel weights for both the support and query features. We utilized channel-wise self-attention to assign greater importance to object-relevant channels. This helps the model to focus on subtle yet discriminative details, which is essential to the classification process. Then, we introduce a feature mutual reconstruction module (FMRM) to reconstruct features. The support features are reconstructed by a support-weight-enhanced feature map to reduce the intra-class variations, and query features are reconstructed by a query-weight-enhanced feature map to increase inter-class variations. The results of classification depend on the similarity between reconstructed features and enhanced features. We evaluated the performance based on four fine-grained image datasets when Conv-4 and Resnet-12 were used. The experimental results showed that our method outperforms previous few-shot fine-grained classification methods. This proves that our method can improve fine-grained image classification performance and simultaneously balance both the inter-class and intra-class variations. Full article
Show Figures

Figure 1

25 pages, 3210 KiB  
Article
In-Depth Collaboratively Supervised Video Instance Segmentation
by Yunnan Deng, Yinhui Zhang and Zifen He
Electronics 2025, 14(2), 363; https://doi.org/10.3390/electronics14020363 - 17 Jan 2025
Viewed by 650
Abstract
Video instance segmentation (VIS) is plagued by the high cost of pixel-level annotation and defects of weakly supervised segmentation, leading to the urgent need for a trade-off between annotation cost and performance. We propose a novel In-Depth Collaboratively Supervised video instance segmentation (IDCS) [...] Read more.
Video instance segmentation (VIS) is plagued by the high cost of pixel-level annotation and defects of weakly supervised segmentation, leading to the urgent need for a trade-off between annotation cost and performance. We propose a novel In-Depth Collaboratively Supervised video instance segmentation (IDCS) with efficient training. A collaborative supervised training pipeline is designed to flow samples of different labeling levels and carry out multimodal training, in which instance clues are obtained from mask-annotated instances to guide the box-annotated training through an in-depth collaborative paradigm: (1) a trident learning method is proposed, which leverages the video temporal consistency to match instances with multimodal annotation across frames for effective instance relation learning without additional network parameters; (2) spatial clues in the first frames are captured to implement multidimensional pixel affinity evaluation of box-annotated instances and augment the noise-disturbed spatial affinity map. Experiments on YoutTube-VIS validate the performance of IDCS with mask-annotated instances in the first frames and the bounding-box-annotated samples in the remaining frames. IDCS achieves up to 92.0% fully supervised performance and average 1.4 times faster, 2.2% mAP higher than the weakly supervised baseline. The results show that IDCS can efficiently utilize multimodal data, while providing advanced guidance for effective trade-off in VIS training. Full article
Show Figures

Figure 1

19 pages, 2167 KiB  
Article
Robust Bi-Orthogonal Projection Learning: An Enhanced Dimensionality Reduction Method and Its Application in Unsupervised Learning
by Xianhao Qin, Chunsheng Li, Yingyi Liang, Huilin Zheng, Luxi Dong, Yarong Liu and Xiaolan Xie
Electronics 2024, 13(24), 4944; https://doi.org/10.3390/electronics13244944 - 15 Dec 2024
Viewed by 675
Abstract
This paper introduces a robust bi-orthogonal projection (RBOP) learning method for dimensionality reduction (DR). The proposed RBOP enhances the flexibility, robustness, and sparsity of the embedding framework, extending beyond traditional DR methods such as principal component analysis (PCA), neighborhood preserving embedding (NPE), and [...] Read more.
This paper introduces a robust bi-orthogonal projection (RBOP) learning method for dimensionality reduction (DR). The proposed RBOP enhances the flexibility, robustness, and sparsity of the embedding framework, extending beyond traditional DR methods such as principal component analysis (PCA), neighborhood preserving embedding (NPE), and locality preserving projection (LPP). Unlike conventional approaches that rely on a single type of projection, RBOP innovates by employing two types of projections: the “true” projection and the “counterfeit” projection. These projections are crafted to be orthogonal, offering enhanced flexibility for the “true” projection and facilitating more precise data transformation in the process of subspace learning. By utilizing sparse reconstruction, the acquired true projection has the capability to map the data into a low-dimensional subspace while efficiently maintaining sparsity. Observing that the two projections share many similar data structures, the method aims to maintain the similarity structure of the data through distinct reconstruction processes. Additionally, the incorporation of a sparse component allows the method to address noise-corrupted data, compensating for noise during the DR process. Within this framework, a number of new unsupervised DR techniques have been developed, such as RBOP_PCA, RBOP_NPE, and RBO_LPP. Experimental results from both natural and synthetic datasets indicate that these proposed methods surpass existing, well-established DR techniques. Full article
Show Figures

Figure 1

Back to TopTop