applsci-logo

Journal Browser

Journal Browser

Machine Learning for Object Detection and Scene Description in Images and Videos

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 20 March 2026 | Viewed by 7367

Special Issue Editors


E-Mail Website
Guest Editor
Institute of Control and Industrial Electronics, Warsaw University of Technology, Ul. Koszykowa 75, 00-662 Warszawa, Poland
Interests: computer vision; machine learning; deep learning; image processing
Special Issues, Collections and Topics in MDPI journals
Department of Electronic Engineering, Yeungnam University, Gyeongsan 35841, Republic of Korea
Interests: image processing computer vision signal; image and video processing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor

Special Issue Information

Dear Colleagues,

Object detection and scene description are fundamental to advancing computer vision as a tool for automatically understanding the human environment. Recognizing and interpreting objects and scenes is critical for machines to understand and interact with the world meaningfully. This understanding forms the basis for more complex tasks like image and video analysis, autonomous navigation, and interactive systems. These technologies have various applications across various industries, namely healthcare, robotics, automotive, security, etc. Object detection and scene description improve the interaction between humans and computers, making it more intuitive. In big data, these methods enable the analysis and interpretation of visual data, constituting the majority of the data generated today. The complexity of real-world scenes and the variety of objects present ongoing challenges, making this an active and exciting area of research. Improving object detection and scene description models' accuracy, speed, and robustness remains crucial, driving innovation in machine learning algorithms and computational strategies. This Special Issue aims to present recent advances in object detection, semantic and instance segmentation, image captioning, visual question answering, scene modeling, object tracking, video summarizing, action recognition, and all other fields related to machine learning.

Dr. Marcin Iwanowski
Dr. Sungho Kim
Prof. Dr. Zhaoqing Pan
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • machine learning
  • scene description
  • object detection
  • image segmentation
  • semantic segmentation
  • image captioning
  • video summarizing
  • robot vision
  • action recognition

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

28 pages, 583 KB  
Article
Multiple Large AI Models’ Consensus for Object Detection—A Survey
by Marcin Iwanowski and Marcin Gahbler
Appl. Sci. 2025, 15(24), 12961; https://doi.org/10.3390/app152412961 - 9 Dec 2025
Viewed by 287
Abstract
The rapid development of large artificial intelligence (AI) models—large language models (LLMs), multimodel large language models (MLLMs) and vision–language models (VLMs)—has enabled instruction-driven visual understanding, where a single foundation model can recognize and localize arbitrary objects from natural-language prompts. However, predictions from individual [...] Read more.
The rapid development of large artificial intelligence (AI) models—large language models (LLMs), multimodel large language models (MLLMs) and vision–language models (VLMs)—has enabled instruction-driven visual understanding, where a single foundation model can recognize and localize arbitrary objects from natural-language prompts. However, predictions from individual models remain inconsistent—LLMs hallucinate nonexistent entities, while VLMs exhibit limited recall and unstable calibration compared to purpose-trained detectors. To address these limitations, a new paradigm termed “multiple large AI model’s consensus” has emerged. In this approach, multiple heterogeneous LLMs, MLLMs or VLMs process a shared visual–textual instruction and generate independent structured outputs (bounding boxes and categories). Next, their results are merged through consensus mechanisms. This cooperative inference improves spatial accuracy and semantic correctness, making it particularly suitable for generating high-quality training datasets for fast real-time object detectors. This survey provides a comprehensive overview of the large multi-AI model’s consensus for object detection. We formalize the concept, review related literature on ensemble reasoning and multimodal perception, and categorize existing methods into four frameworks: prompt-level, reasoning-to-detection, box-level, and hybrid consensus. We further analyze fusion algorithms, evaluation metrics, and benchmark datasets, highlighting their strengths and limitations. Finally, we discuss open challenges—vocabulary alignment, uncertainty calibration, computational efficiency, and bias propagation—and identify emerging trends such as consensus-aware training, structured reasoning, and collaborative perception ecosystems. Full article
Show Figures

Figure 1

19 pages, 4507 KB  
Article
Automated Weld Defect Classification Enhanced by Synthetic Data Augmentation in Industrial Ultrasonic Images
by Amir-M. Naddaf-Sh, Vinay S. Baburao, Zina Ben-Miled and Hassan Zargarzadeh
Appl. Sci. 2025, 15(23), 12811; https://doi.org/10.3390/app152312811 - 3 Dec 2025
Viewed by 367
Abstract
Automated ultrasonic testing (AUT) serves as a vital method for evaluating critical infrastructure in industries such as oil and gas. However, a significant challenge in deploying artificial intelligence (AI)-based interpretation methods for AUT data lies in improving their reliability and effectiveness, particularly due [...] Read more.
Automated ultrasonic testing (AUT) serves as a vital method for evaluating critical infrastructure in industries such as oil and gas. However, a significant challenge in deploying artificial intelligence (AI)-based interpretation methods for AUT data lies in improving their reliability and effectiveness, particularly due to the inherent scarcity of real-world defective data. This study directly addresses data scarcity in a weld defect classification task, specifically concerning the detection of lack of fusion (LOF) defects in weld inspections using a proprietary industrial ultrasonic B-scan image dataset. This paper leverages state-of-the-art generative models, including Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPM) (StyleGAN3, VQGAN with an unconditional transformer, and Stable Diffusion), to produce realistic B-scan images depicting LOF defects. The fine-tuned Transformer-based models, including ViT-Base, Swin-Tiny, and MobileViT-Small classifiers, on the regular B-scan image dataset are then applied to retain only high-confidence positive synthetic samples from each method. The impact of these synthetic images on the classification performance of a ResNet-50 model is evaluated, where it is fine-tuned with cumulative additions of synthetic images, ranging from 10 to 200 images. Its accuracy on the test set increases by 38.9% relative to the baseline with the addition of either 80 synthetic images using VQGAN with an unconditional transformer or 200 synthetic images by StyleGAN3 to the training set, and by 36.8% with the addition of 150 synthetic images by Stable Diffusion. This also outperforms Transformer-based vision models that are trained on regular training data. Concurrently, knowledge distillation experiments involve training ResNet-50 as a student model, leveraging the expertise of ViT-Base and Swin-Tiny as teacher models to demonstrate the effectiveness of adding the synthetic data to the training set, where the greatest enhancement is observed to be 34.7% relative to the baseline. This work contributes to advancing robust, AI-assisted tools for critical infrastructure inspection and offers practical pathways for enhancing available models in resource-constrained industrial environments. Full article
Show Figures

Figure 1

15 pages, 5494 KB  
Article
Classification of OCT Images of the Human Eye Using Mobile Devices
by Agnieszka Stankiewicz, Tomasz Marciniak, Nina Budna, Róża Chwałek and Marcin Dziedzic
Appl. Sci. 2025, 15(6), 2937; https://doi.org/10.3390/app15062937 - 8 Mar 2025
Cited by 1 | Viewed by 1375
Abstract
The aim of this study was to develop a mobile application for Android devices dedicated to the classification of pathological changes in human eye optical coherence tomography (OCT) B-scans. The classification process is conducted using convolutional neural networks (CNNs). Six models were trained [...] Read more.
The aim of this study was to develop a mobile application for Android devices dedicated to the classification of pathological changes in human eye optical coherence tomography (OCT) B-scans. The classification process is conducted using convolutional neural networks (CNNs). Six models were trained during the study: a simple convolutional neural network with three convolutional layers, VGG16, InceptionV3, Xception, Joint Attention Network + MobileNetV2 and OpticNet-71. All of these models were converted to TensorFlow Lite format to implement them into a mobile application. For this purpose, three models with the best parameters were chosen, taking accuracy, precision, recall, F1-score and confusion matrix into consideration. The Android application designed for the classification of OCT images was developed using the Kotlin programming language within the Android Studio integrated development environment. With the application, classification can be performed on an image chosen from the user’s files or an image acquired using the photo-taking function. The results of the classification are displayed for three neural networks, along with the respective classification times for each neural network and the associated image undergoing the classification task. The mobile application has been tested using various smartphones. The testing phase included an evaluation of image classification times and score accuracy, considering factors such as image acquisition method, i.e., camera or gallery. Full article
Show Figures

Figure 1

13 pages, 13678 KB  
Article
Improving CNN Fish Detection and Classification with Tracking
by Boubker Zouin, Jihad Zahir, Florian Baletaud, Laurent Vigliola and Sébastien Villon
Appl. Sci. 2024, 14(22), 10122; https://doi.org/10.3390/app142210122 - 5 Nov 2024
Cited by 4 | Viewed by 2707
Abstract
The regular and consistent monitoring of marine ecosystems and fish communities is becoming more and more crucial due to increasing human pressures. To this end, underwater camera technology has become a major tool to collect an important amount of marine data. As the [...] Read more.
The regular and consistent monitoring of marine ecosystems and fish communities is becoming more and more crucial due to increasing human pressures. To this end, underwater camera technology has become a major tool to collect an important amount of marine data. As the size of the data collected outgrew the ability to process it, new means of automatic processing have been explored. Convolutional neural networks (CNNs) have been the most popular method for automatic underwater video analysis for the last few years. However, such algorithms are rather image-based and do not exploit the potential of video data. In this paper, we propose a method of coupling video tracking and CNN image analysis to perform a robust and accurate fish classification on deep sea videos and improve automatic classification accuracy. Our method fused CNNs and tracking methods, allowing us to detect 12% more individuals compared to CNN alone. Full article
Show Figures

Figure 1

16 pages, 6525 KB  
Article
Recurrent and Concurrent Prediction of Longitudinal Progression of Stargardt Atrophy and Geographic Atrophy towards Comparative Performance on Optical Coherence Tomography as on Fundus Autofluorescence
by Zubin Mishra, Ziyuan Chris Wang, Emily Xu, Sophia Xu, Iyad Majid, SriniVas R. Sadda and Zhihong Jewel Hu
Appl. Sci. 2024, 14(17), 7773; https://doi.org/10.3390/app14177773 - 3 Sep 2024
Cited by 2 | Viewed by 1768
Abstract
Stargardt atrophy and geographic atrophy (GA) represent pivotal endpoints in FDA-approved clinical trials. Predicting atrophy progression is crucial for evaluating drug efficacy. Fundus autofluorescence (FAF), the standard 2D imaging modality in these trials, has limitations in patient comfort. In contrast, spectral-domain optical coherence [...] Read more.
Stargardt atrophy and geographic atrophy (GA) represent pivotal endpoints in FDA-approved clinical trials. Predicting atrophy progression is crucial for evaluating drug efficacy. Fundus autofluorescence (FAF), the standard 2D imaging modality in these trials, has limitations in patient comfort. In contrast, spectral-domain optical coherence tomography (SD-OCT), a 3D imaging modality, is more patient friendly but suffers from lower image quality. This study has two primary objectives: (1) develop an efficient predictive modeling for the generation of future FAF images and prediction of future Stargardt atrophic (as well as GA) regions and (2) develop an efficient predictive modeling with advanced 3D OCT features at ellipsoid zone (EZ) for the comparative performance in the generation of future enface EZ maps and prediction of future Stargardt atrophic regions on OCT as on FAF. To achieve these goals, we propose two deep neural networks (termed ReConNet and ReConNet-Ensemble) with recurrent learning units (long short-term memory, LSTM) integrating with a convolutional neural network (CNN) encoder–decoder architecture and concurrent learning units integrated by ensemble/multiple recurrent learning channels. The ReConNet, which incorporates LSTM connections with CNN, is developed for the first goal on longitudinal FAF. The ReConNet-Ensemble, which incorporates multiple recurrent learning channels based on enhanced EZ enface maps to capture higher-order inherent OCT EZ features, is developed for the second goal on longitudinal OCT. Using FAF images at months 0, 6, and 12 to predict atrophy at month 18, the ReConNet achieved mean (±standard deviation, SD) and median Dice coefficients of 0.895 (±0.086) and 0.922 for Stargardt atrophy and 0.864 (±0.113) and 0.893 for GA. Using SD-OCT images at months 0 and 6 to predict atrophy at month 12, the ReConNet-Ensemble achieved mean and median Dice coefficients of 0.882 (±0.101) and 0.906 for Stargardt atrophy. The prediction performance on OCT images is comparably good to that on FAF. These results underscore the potential of SD-OCT for efficient and practical assessment of atrophy progression in clinical trials and retina clinics, complementing or surpassing the widely used FAF imaging technique. Full article
Show Figures

Figure 1

Back to TopTop