Submit to Special Issue Submit Abstract to Special Issue Review for Mathematics Propose a Special Issue

Journal Menu

Journal Browser

Artificial Intelligence: Deep Learning and Computer Vision

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "E1: Mathematics and Computer Science".

Deadline for manuscript submissions: 31 January 2026 | Viewed by 5659

Share This Special Issue

Special Issue Editors

Prof. Dr. Juan Manuel Rendón-Mancha

E-Mail Website
Guest Editor

Instituto de Investigación en Ciencias Básicas y Aplicadas, Centro de Investigación en Ciencias, Universidad Autónoma Del Estado de Morelos, Cuernavaca 62209, Mexico
Interests: computer vision; image analysis; deep learning

Prof. Dr. Edgar Roman-Rangel

E-Mail Website
Guest Editor

Instituto Tecnológico Autónomo de México, Ciudad de Mexico 01080, Mexico
Interests: machine learning; representation learning; computer vision

Special Issue Information

Dear Colleagues,

Currently, we are witnessing how computer vision applications, powered by advancements in deep learning, are becoming a reality previously imagined in science fiction novels and films. Moreover, since 2015, computers have been outperforming human experts in complex vision tasks. The methods allowing these achievements are mainly powerful artificial intelligence models based on deep neural networks.

This Special Issue will present recent applications of artificial intelligence for computer vision. Special attention is devoted to deep learning methods using convolutional and transformer-based architectures.

The Special Issue is an opportunity for authors/researchers to present their work while discussing the novel capabilities of the DNN revolution in operational settings and the reliability and limitations of these processes.

This Special Issue will accept high-quality papers containing original research results and review articles in the following fields:

Image or video segmentation using deep neural networks;
Image or video classification using deep neural networks;
Image or video restoration or reconstruction using deep neural networks;
Image to X and image from X reconstruction using deep neural networks;
Autonomous robot or vehicle navigation using deep neural networks for images;
Object detection in images or video using deep neural networks;
Three-dimensional reconstruction or depth estimation using deep neural networks;
Generative deep learning for images;
Image-based deep reinforcement learning;
Computational methods for computer vision using deep neural networks;
Optimization algorithms for computer vision using deep neural networks;
Intelligent systems for computer vision using deep neural networks.

Prof. Dr. Juan Manuel Rendón-Mancha
Prof. Dr. Edgar Roman-Rangel
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

artificial intelligence
deep learning
computer vision
image classification
image segmentation
image restoration
robot or vehicle autonomous navigation
image-based deep reinforcement learning
generative deep models
computational methods for computer vision
intelligent systems
optimization algorithms for computer vision

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (6 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

31 pages, 3735 KiB

Open AccessFeature PaperArticle

An Analysis of Layer-Freezing Strategies for Enhanced Transfer Learning in YOLO Architectures

by Andrzej D. Dobrzycki, Ana M. Bernardos and José R. Casar

Mathematics 2025, 13(15), 2539; https://doi.org/10.3390/math13152539 - 7 Aug 2025

Viewed by 332

Abstract

The You Only Look Once (YOLO) architecture is crucial for real-time object detection. However, deploying it in resource-constrained environments such as unmanned aerial vehicles (UAVs) requires efficient transfer learning. Although layer freezing is a common technique, the specific impact of various freezing configurations on contemporary YOLOv8 and YOLOv10 architectures remains unexplored, particularly with regard to the interplay between freezing depth, dataset characteristics, and training dynamics. This research addresses this gap by presenting a detailed analysis of layer-freezing strategies. We systematically investigate multiple freezing configurations across YOLOv8 and YOLOv10 variants using four challenging datasets that represent critical infrastructure monitoring. Our methodology integrates a gradient behavior analysis (L2 norm) and visual explanations (Grad-CAM) to provide deeper insights into training dynamics under different freezing strategies. Our results reveal that there is no universal optimal freezing strategy but, rather, one that depends on the properties of the data. For example, freezing the backbone is effective for preserving general-purpose features, while a shallower freeze is better suited to handling extreme class imbalance. These configurations reduce graphics processing unit (GPU) memory consumption by up to 28% compared to full fine-tuning and, in some cases, achieve mean average precision (mAP@50) scores that surpass those of full fine-tuning. Gradient analysis corroborates these findings, showing distinct convergence patterns for moderately frozen models. Ultimately, this work provides empirical findings and practical guidelines for selecting freezing strategies. It offers a practical, evidence-based approach to balanced transfer learning for object detection in scenarios with limited resources. Full article

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

► Show Figures

Figure 1

16 pages, 3426 KiB

Open AccessArticle

Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context

by Xinyu Ma, Jun Rao and Xuebo Liu

Mathematics 2025, 13(11), 1874; https://doi.org/10.3390/math13111874 - 3 Jun 2025

Viewed by 489

Abstract

Multimodal Machine Translation (MMT) has long been assumed to outperform traditional text-only MT by leveraging visual information. However, recent studies challenge this assumption, showing that MMT models perform similarly even when tested without images or with mismatched images. This raises fundamental questions about the actual utility of visual information in MMT, which this work aims to investigate. We first revisit commonly used image-must and image-free MMT approaches, identifying that suboptimal performance may stem from insufficiently robust baseline models. To further examine the role of visual information, we propose a novel visual type regularization method and introduce two probing tasks—Visual Contribution Probing and Modality Relationship Probing—to analyze whether and how visual features influence a strong MMT model. Surprisingly, our findings on a mainstream dataset indicate that the gains from visual information are marginal. We attribute this improvement primarily to a regularization effect, which can be replicated using random noise. Our results suggest that the MMT community should critically re-evaluate baseline models, evaluation metrics, and dataset design to advance multimodal learning meaningfully. Full article

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

► Show Figures

Figure 1

22 pages, 3388 KiB

Open AccessArticle

Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees

by Joaquin Alvarez and Edgar Roman-Rangel

Mathematics 2025, 13(11), 1711; https://doi.org/10.3390/math13111711 - 23 May 2025

Viewed by 427

Abstract

In this work, we introduce a framework to combine arbitrary image segmentation algorithms from different agents under data privacy constraints to produce an aggregated prediction set satisfying finite-sample risk control guarantees. We leverage distribution-free uncertainty quantification techniques in order to aggregate deep neural networks for image segmentation tasks. Our method can be applied in settings to merge the predictions of multiple agents with arbitrarily dependent prediction sets. Moreover, we perform experiments in medical imaging tasks to illustrate our proposed framework. Our results show that the framework reduced the empirical false positive rate by 50% without compromising the false negative rate, with respect to the false positive rate of any of the constituent models in the aggregated prediction algorithm. Full article

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

► Show Figures

Figure 1

15 pages, 1990 KiB

Open AccessArticle

Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models

by Longbin Jin, Hyuntaek Jung, Hyo Jin Jon and Eun Yi Kim

Mathematics 2025, 13(9), 1365; https://doi.org/10.3390/math13091365 - 22 Apr 2025

Viewed by 747

Abstract

Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available. Full article

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

► Show Figures

Figure 1

16 pages, 1939 KiB

Open AccessArticle

Auto-Probabilistic Mining Method for Siamese Neural Network Training

by Arseniy Mokin, Alexander Sheshkus and Vladimir L. Arlazarov

Mathematics 2025, 13(8), 1270; https://doi.org/10.3390/math13081270 - 12 Apr 2025

Viewed by 476

Abstract

Training deep learning models for classification with limited data and computational resources remains a challenge when the number of classes is large. Metric learning offers an effective solution to this problem. However, it has its own shortcomings due to the known imperfections of widely used loss functions such as contrastive loss and triplet loss, as well as sample mining methods. This paper address these issues by proposing a novel mining method and metric loss function. Firstly, this paper presents an auto-probabilistic mining method designed to automatically select the most informative training samples for Siamese neural networks. Combined with a previously proposed auto-clustering technique, the method improves model training by optimizing the utilization of available data and reducing computational overhead. Secondly, this paper proposes the novel cluster-aware triplet-based metric loss function that addresses the limitations of contrastive and triplet loss, enhancing the overall training process. To evaluate the proposed methods, experiments were conducted with the optical character recognition task using the PHD08 and Omniglot datasets. The proposed loss function with the random-mining method achieved

82.6 %

classification accuracy on the PHD08 dataset with full training on the Korean alphabet, surpassing the known baseline. The same experiment, using a reduced training alphabet, set a new baseline of

88.6 %

on the PHD08 dataset. The application of the novel mining method further enhanced the accuracy to

90.6 %

(+2.0%) and, combined with auto-clustering, achieved

92.3 %

(+3.7%) compared with the new baseline. On the Omniglot dataset, the proposed mining method reached

92.32 %

, rising to

93.17 %

with auto-clustering. These findings highlight the potential effectiveness of the developed loss function and mining method in addressing a wide range of pattern recognition challenges. Full article

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

► Show Figures

Figure 1

23 pages, 1774 KiB

Open AccessArticle

Adaptive Transformer-Based Deep Learning Framework for Continuous Sign Language Recognition and Translation

by Yahia Said, Sahbi Boubaker, Saleh M. Altowaijri, Ahmed A. Alsheikhy and Mohamed Atri

Mathematics 2025, 13(6), 909; https://doi.org/10.3390/math13060909 - 8 Mar 2025

Cited by 1 | Viewed by 1969

Abstract

Sign language recognition and translation remain pivotal for facilitating communication among the deaf and hearing communities. However, end-to-end sign language translation (SLT) faces major challenges, including weak temporal correspondence between sign language (SL) video frames and gloss annotations and the complexity of sequence alignment between long SL videos and natural language sentences. In this paper, we propose an Adaptive Transformer (ADTR)-based deep learning framework that enhances SL video processing for robust and efficient SLT. The proposed model incorporates three novel modules: Adaptive Masking (AM), Local Clip Self-Attention (LCSA), and Adaptive Fusion (AF) to optimize feature representation. The AM module dynamically removes redundant video frame representations, improving temporal alignment, while the LCSA module learns hierarchical representations at both local clip and full-video levels using a refined self-attention mechanism. Additionally, the AF module fuses multi-scale temporal and spatial features to enhance model robustness. Unlike conventional SLT models, our framework eliminates the reliance on gloss annotations, enabling direct translation from SL video sequences to spoken language text. The proposed method was evaluated using the ArabSign dataset, demonstrating state-of-the-art performance in translation accuracy, processing efficiency, and real-time applicability. The achieved results confirm that ADTR is a highly effective and scalable deep learning solution for continuous sign language recognition, positioning it as a promising AI-driven approach for real-world assistive applications. Full article

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

► Show Figures

Journal Menu

Journal Browser

Artificial Intelligence: Deep Learning and Computer Vision

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (6 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI