applsci-logo

Journal Browser

Journal Browser

Research on Machine Learning in Computer Vision

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 25 July 2025 | Viewed by 13162

Special Issue Editors


E-Mail Website
Guest Editor
Department of Mathematical, Physical and Computer Sciences, University of Parma, 43124 Parma, Italy
Interests: computer science; feature extraction; deep learning; meta-learning; computer vision
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

This Special Issue is dedicated to the exploration of the latest advancements in Machine Learning (ML) as they apply to computer vision. It is well known that the rapid progress and use of ML techniques have significantly enhanced the capabilities of computer vision systems, enabling them to interpret visual data with unprecedented effectiveness.

The aim of this Special Issue is to delve into and discuss how the most recent ML approaches, including but not limited to the field of deep learning, are being successfully applied to various computer vision tasks. These tasks include object detection, image retrieval, segmentation, recognition, and more.

We find particular interest in ML techniques such as meta-learning, reinforcement learning, and unsupervised and semi-supervised learning. We especially welcome contributions that address the challenges encountered in deploying these techniques, such as the demand for large datasets and high computational power, and that discuss and propose potential solutions, with a specific focus on one-shot or few-shot approaches. Moreover, contributions that highlight the impact of these advancements on various application domains, like healthcare, autonomous vehicles, and surveillance, are also welcomed.

Dr. Eleonora Iotti
Prof. Dr. João M. F. Rodrigues
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • machine learning
  • computer vision
  • one- and few-shot learning
  • meta-learning
  • reinforcement learning
  • unsupervised and semi-supervised learning
  • ML-based computer vision applications

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

21 pages, 2876 KiB  
Article
A Computer-Aided Approach to Canine Hip Dysplasia Assessment: Measuring Femoral Head–Acetabulum Distance with Deep Learning
by Pedro Franco-Gonçalo, Pedro Leite, Sofia Alves-Pimenta, Bruno Colaço, Lio Gonçalves, Vítor Filipe, Fintan McEvoy, Manuel Ferreira and Mário Ginja
Appl. Sci. 2025, 15(9), 5087; https://doi.org/10.3390/app15095087 (registering DOI) - 3 May 2025
Abstract
Canine hip dysplasia (CHD) screening relies on radiographic assessment, but traditional scoring methods often lack consistency due to inter-rater variability. This study presents an AI-driven system for automated measurement of the femoral head center to dorsal acetabular edge (FHC/DAE) distance, a key metric [...] Read more.
Canine hip dysplasia (CHD) screening relies on radiographic assessment, but traditional scoring methods often lack consistency due to inter-rater variability. This study presents an AI-driven system for automated measurement of the femoral head center to dorsal acetabular edge (FHC/DAE) distance, a key metric in CHD evaluation. Unlike most AI models that directly classify CHD severity using convolutional neural networks, this system provides an interpretable, measurement-based output to support a more transparent evaluation. The system combines a keypoint regression model for femoral head center localization with a U-Net-based segmentation model for acetabular edge delineation. It was trained on 7967 images for hip joint detection, 571 for keypoints, and 624 for acetabulum segmentation, all from ventrodorsal hip-extended radiographs. On a test set of 70 images, the keypoint model achieved high precision (Euclidean Distance = 0.055 mm; Mean Absolute Error = 0.0034 mm; Mean Squared Error = 2.52 × 10−5 mm2), while the segmentation model showed strong performance (Dice Score = 0.96; Intersection over Union = 0.92). Comparison with expert annotations demonstrated strong agreement (Intraclass Correlation Coefficients = 0.97 and 0.93; Weighted Kappa = 0.86 and 0.79; Standard Error of Measurement = 0.92 to 1.34 mm). By automating anatomical landmark detection, the system enhances standardization, reproducibility, and interpretability in CHD radiographic assessment. Its strong alignment with expert evaluations supports its integration into CHD screening workflows for more objective and efficient diagnosis and CHD scoring. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

20 pages, 2777 KiB  
Article
Video Human Action Recognition Based on Motion-Tempo Learning and Feedback Attention
by Yalong Liu, Chengwu Liang, Songqi Jiang and Peiwang Zhu
Appl. Sci. 2025, 15(8), 4186; https://doi.org/10.3390/app15084186 - 10 Apr 2025
Viewed by 329
Abstract
In video human action-recognition tasks, motion tempo describes the dynamic patterns and temporal scales of human motion. Different categories of actions are typically composed of sub-actions with varying motion tempos. Effectively capturing sub-actions with different motion tempos and distinguishing category-specific sub-actions are crucial [...] Read more.
In video human action-recognition tasks, motion tempo describes the dynamic patterns and temporal scales of human motion. Different categories of actions are typically composed of sub-actions with varying motion tempos. Effectively capturing sub-actions with different motion tempos and distinguishing category-specific sub-actions are crucial for improving action-recognition performance. Convolutional Neural Network (CNN)-based methods attempted to address this challenge, by embedding feedforward attention modules to enhance the action’s dynamic representation learning. However, feedforward attention modules rely only on local information from low-level features, lacking contextual information to generate attention weights. Therefore, we propose a Sub-action Motion information Enhancement Network (SMEN) based on motion-tempo learning and feedback attention, which consists of the Multi-Granularity Adaptive Fusion Module (MgAFM) and Feedback Attention-Guided Module (FAGM). MgAFM enhances the model’s ability to capture crucial sub-action intrinsic information by extracting and adaptively fusing motion dynamic features at different granularities. FAGM leverages high-level features that contain contextual information in a feedback manner to guide low-level features in generating attention weights, enhancing the model’s ability to extract more discriminative spatio-temporal and channel-wise features. Experiments are conducted on three datasets, and the proposed SMEN achieves top-1 accuracies of 52.4%, 63.3% on the Something-Something V1 and V2 datasets, and 76.9% on the Kinetics-400 dataset. Ablation studies, evaluations, and visualizations demonstrate that the proposed SMEN is effective for sub-action motion tempo and representation learning, and outperforms compared methods for video action recognition. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

16 pages, 6883 KiB  
Article
Integrated AI System for Real-Time Sports Broadcasting: Player Behavior, Game Event Recognition, and Generative AI Commentary in Basketball Games
by Sunghoon Jung, Hanmoe Kim, Hyunseo Park and Ahyoung Choi
Appl. Sci. 2025, 15(3), 1543; https://doi.org/10.3390/app15031543 - 3 Feb 2025
Viewed by 2364
Abstract
This study presents an AI-based sports broadcasting system capable of real-time game analysis and automated commentary. The model first acquires essential background knowledge, including the court layout, game rules, team information, and player details. YOLO model-based segmentation is applied for a local camera [...] Read more.
This study presents an AI-based sports broadcasting system capable of real-time game analysis and automated commentary. The model first acquires essential background knowledge, including the court layout, game rules, team information, and player details. YOLO model-based segmentation is applied for a local camera view to enhance court recognition accuracy. Player’s actions and ball tracking is performed through YOLO algorithms. In each frame, the YOLO detection model is used to detect the bounding boxes of the players. Then, we proposed our tracking algorithm, which computed the IoU from previous frames and linked together to track the movement paths of the players. Player behavior is achieved via the R(2+1)D action recognition model including player actions such as running, dribbling, shooting, and blocking. The system demonstrates high performance, achieving an average accuracy of 97% in court calibration, 92.5% in player and object detection, and 85.04% in action recognition. Key game events are identified based on positional and action data, with broadcast lines generated using GPT APIs and converted to natural audio commentary via Text-to-Speech (TTS). This system offers a comprehensive framework for automating sports broadcasting with advanced AI techniques. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

23 pages, 20134 KiB  
Article
The Development and Validation of an Artificial Intelligence Model for Estimating Thumb Range of Motion Using Angle Sensors and Machine Learning: Targeting Radial Abduction, Palmar Abduction, and Pronation Angles
by Yutaka Ehara, Atsuyuki Inui, Yutaka Mifune, Kohei Yamaura, Tatsuo Kato, Takahiro Furukawa, Shuya Tanaka, Masaya Kusunose, Shunsaku Takigami, Shin Osawa, Daiji Nakabayashi, Shinya Hayashi, Tomoyuki Matsumoto, Takehiko Matsushita and Ryosuke Kuroda
Appl. Sci. 2025, 15(3), 1296; https://doi.org/10.3390/app15031296 - 27 Jan 2025
Viewed by 808
Abstract
An accurate assessment of thumb range of motion is crucial for diagnosing musculoskeletal conditions, evaluating functional impairments, and planning effective rehabilitation strategies. In this study, we aimed to enhance the accuracy of estimating thumb range of motion using a combination of MediaPipe, which [...] Read more.
An accurate assessment of thumb range of motion is crucial for diagnosing musculoskeletal conditions, evaluating functional impairments, and planning effective rehabilitation strategies. In this study, we aimed to enhance the accuracy of estimating thumb range of motion using a combination of MediaPipe, which is an AI-based posture estimation library, and machine learning methods, taking the values obtained using angle sensors to be the true values. Radial abduction, palmar abduction, and pronation angles were estimated using MediaPipe based on coordinates detected from videos of 18 healthy participants (nine males and nine females with an age range of 30–49 years) selected to reflect a balanced distribution of height and other physical characteristics. A conical thumb movement model was constructed, and parameters were generated based on the coordinate data. Five machine learning models were evaluated, with LightGBM achieving the highest accuracy across all metrics. Specifically, for radial abduction, palmar abduction, and supination, the root mean square error (RMSE), mean absolute error (MAE), coefficient of determination (R2), and correlation coefficient were 4.67°, 3.41°, 0.94, and 0.97; 4.63°, 3.41°, 0.95, and 0.98; and 5.69°, 4.17°, 0.88, and 0.94, respectively. These results demonstrate that when estimating thumb range of motion, the AI model trained using angle sensor data and LightGBM achieved accuracy that was high and comparable to that of prior methods involving the use of MediaPipe and a protractor. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

16 pages, 2038 KiB  
Article
Enhancing Colony Detection of Microorganisms in Agar Dishes Using SAM-Based Synthetic Data Augmentation in Low-Data Scenarios
by Kim Mennemann, Nikolas Ebert, Laurenz Reichardt and Oliver Wasenmüller
Appl. Sci. 2025, 15(3), 1260; https://doi.org/10.3390/app15031260 - 26 Jan 2025
Viewed by 706
Abstract
In many medical and pharmaceutical processes, continuous hygiene monitoring relies on manual detection of microorganisms in agar dishes by skilled personnel. While deep learning offers the potential for automating this task, it often faces limitations due to insufficient training data, a common issue [...] Read more.
In many medical and pharmaceutical processes, continuous hygiene monitoring relies on manual detection of microorganisms in agar dishes by skilled personnel. While deep learning offers the potential for automating this task, it often faces limitations due to insufficient training data, a common issue in colony detection. To address this, we propose a simple yet efficient SAM-based pipeline for Copy-Paste data augmentation to enhance detection performance, even with limited data. This paper explores a method where annotated microbial colonies from real images were copied and pasted into empty agar dish images to create new synthetic samples. These new samples inherited the annotations of the colonies inserted into them so that no further labeling was required. The resulting synthetic datasets were used to train a YOLOv8 detection model, which was then fine-tuned on just 10 to 1000 real images. The best fine-tuned model, trained on only 1000 real images, achieved an mAP of 60.6, while a base model trained on 5241 real images achieved 64.9. Although far fewer real images were used, the fine-tuned model performed comparably well, demonstrating the effectiveness of the SAM-based Copy-Paste augmentation. This approach matches or even exceeds the performance of the current state of the art in synthetic data generation in colony detection and can be expanded to include more microbial species and agar dishes. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

26 pages, 1303 KiB  
Article
On Explainability of Reinforcement Learning-Based Machine Learning Agents Trained with Proximal Policy Optimization That Utilizes Visual Sensor Data
by Tomasz Hachaj and Marcin Piekarczyk
Appl. Sci. 2025, 15(2), 538; https://doi.org/10.3390/app15020538 - 8 Jan 2025
Viewed by 1085
Abstract
In this paper, we address the issues of the explainability of reinforcement learning-based machine learning agents trained with Proximal Policy Optimization (PPO) that utilizes visual sensor data. We propose an algorithm that allows an effective and intuitive approximation of the PPO-trained neural network [...] Read more.
In this paper, we address the issues of the explainability of reinforcement learning-based machine learning agents trained with Proximal Policy Optimization (PPO) that utilizes visual sensor data. We propose an algorithm that allows an effective and intuitive approximation of the PPO-trained neural network (NN). We conduct several experiments to confirm our method’s effectiveness. Our proposed method works well for scenarios where semantic clustering of the scene is possible. Our approach is based on the solid theoretical foundation of Gradient-weighted Class Activation Mapping (GradCAM) and Classification and Regression Tree with additional proxy geometry heuristics. It excels in the explanation process in a virtual simulation system based on a video system with relatively low resolution. Depending on the convolutional feature extractor of the PPO-trained neural network, our method obtains 0.945 to 0.968 accuracy of approximation of the black-box model. The proposed method has important application aspects. Through its use, it is possible to estimate the causes of specific decisions made by the neural network due to the current state of the observed environment. This estimation makes it possible to determine whether the network makes decisions as expected (decision-making is related to the model’s observation of objects belonging to different semantic classes in the environment) and to detect unexpected, seemingly chaotic behavior that might be, for example, the result of data bias, bad design of the reward function or insufficient generalization abilities of the model. We publish all source codes so our experiments can be reproduced. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

13 pages, 1853 KiB  
Article
Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image Classification
by Ahmad Mouri Zadeh Khaki and Ahyoung Choi
Appl. Sci. 2025, 15(1), 422; https://doi.org/10.3390/app15010422 - 5 Jan 2025
Cited by 3 | Viewed by 2234
Abstract
Deep learning (DL) has revolutionized image classification, yet deploying convolutional neural networks (CNNs) on edge devices for real-time applications remains a significant challenge due to constraints in computation, memory, and power efficiency. This work presents an optimized implementation of VGG16 and VGG19, two [...] Read more.
Deep learning (DL) has revolutionized image classification, yet deploying convolutional neural networks (CNNs) on edge devices for real-time applications remains a significant challenge due to constraints in computation, memory, and power efficiency. This work presents an optimized implementation of VGG16 and VGG19, two widely used CNN architectures, for classifying the CIFAR-10 dataset using transfer learning on field-programmable gate arrays (FPGAs). Utilizing the Xilinx Vitis-AI and TensorFlow2 frameworks, we adapt VGG16 and VGG19 for FPGA deployment through quantization, compression, and hardware-specific optimizations. Our implementation achieves high classification accuracy, with Top-1 accuracy of 89.54% and 87.47% for VGG16 and VGG19, respectively, while delivering significant reductions in inference latency (7.29× and 6.6× compared to CPU-based alternatives). These results highlight the suitability of our approach for resource-efficient, real-time edge applications. Key contributions include a detailed methodology for combining transfer learning with FPGA acceleration, an analysis of hardware resource utilization, and performance benchmarks. This work underscores the potential of FPGA-based solutions to enable scalable, low-latency DL deployments in domains such as autonomous systems, IoT, and mobile devices. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

15 pages, 1426 KiB  
Article
Attention Score Enhancement Model Through Pairwise Image Comparison
by Yeong Seok Ju, Zong Woo Geem and Joon Shik Lim
Appl. Sci. 2024, 14(21), 9928; https://doi.org/10.3390/app14219928 - 30 Oct 2024
Viewed by 1126
Abstract
This study proposes the Pairwise Attention Enhancement (PAE) model to address the limitations of the Vision Transformer (ViT). While the ViT effectively models global relationships between image patches, it encounters challenges in medical image analysis where fine-grained local features are crucial. Although the [...] Read more.
This study proposes the Pairwise Attention Enhancement (PAE) model to address the limitations of the Vision Transformer (ViT). While the ViT effectively models global relationships between image patches, it encounters challenges in medical image analysis where fine-grained local features are crucial. Although the ViT excels at capturing global interactions within the entire image, it may potentially underperform due to its inadequate representation of local features such as color, texture, and edges. The proposed PAE model enhances local features by calculating cosine similarity between the attention maps of training and reference images and integrating attention maps in regions with high similarity. This approach complements the ViT’s global capture capability, allowing for a more accurate reflection of subtle visual differences. Experiments using Clock Drawing Test data demonstrated that the PAE model achieved a precision of 0.9383, recall of 0.8916, F1-Score of 0.9133, and accuracy of 92.69%, showing a 12% improvement over API-Net and a 1% improvement over the ViT. This study suggests that the PAE model can enhance performance in computer vision fields where local features are crucial by overcoming the limitations of the ViT. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

15 pages, 1919 KiB  
Article
A Multimodal Recommender System Using Deep Learning Techniques Combining Review Texts and Images
by Euiju Jeong, Xinzhe Li, Angela (Eunyoung) Kwon, Seonu Park, Qinglong Li and Jaekyeong Kim
Appl. Sci. 2024, 14(20), 9206; https://doi.org/10.3390/app14209206 - 10 Oct 2024
Cited by 3 | Viewed by 3284
Abstract
Online reviews that consist of texts and images are an essential source of information for alleviating data sparsity in recommender system studies. Although texts and images provide different types of information, they can provide complementary or substitutive advantages. However, most studies are limited [...] Read more.
Online reviews that consist of texts and images are an essential source of information for alleviating data sparsity in recommender system studies. Although texts and images provide different types of information, they can provide complementary or substitutive advantages. However, most studies are limited in introducing the complementary effect between texts and images in the recommender systems. Specifically, they have overlooked the informational value of images and proposed recommender systems solely based on textual representations. To address this research gap, this study proposes a novel recommender model that captures the dependence between texts and images. This study uses the RoBERTa and VGG-16 models to extract textual and visual information from online reviews and applies a co-attention mechanism to capture the complementarity between the two modalities. Extensive experiments were conducted using Amazon datasets, confirming the superiority of the proposed model. Our findings suggest that the complementarity of texts and images is crucial for enhancing recommendation accuracy and performance. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

Back to TopTop