Submit to Applied Sciences Review for Applied Sciences Propose a Special Issue

Journal Menu

Journal Browser

Recent Trends in Automatic Image Captioning Systems

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 February 2025) | Viewed by 24018

Share This Special Issue

Special Issue Editors

Dr. Mourad Oussalah

E-Mail Website
Guest Editor

Faculty of Information Technology and Electrical Engineering (ITEE), University of Oulu, 90570 Oulu, Finland
Interests: social media; data mining; robotics; data fusion; computer vision

Prof. Dr. Rachid Jennane

E-Mail Website
Guest Editor

University of Orleans, IDP Laboratory, UMR CNRS 7013, 45100 Orleans, France
Interests: image analysis; medical analysis; image retrieval systems;

Special Issue Information

Dear Colleagues,

The rapid development of digitalization, tagging and user-generated content has brought about a substantial increase in datasets where images are accompanied by related text, giving rise to automatic image captioning (AIC), a process that seeks to automatically generate properly formed English sentences or captions that describe the content of images. AIC research has a great impact on various domains, such as virtual assistants, image indexing, recommendation systems, and medical diagnosis systems, among others. Image captioning in particular relies on computer vision for image comprehension and natural language processing to generate textual descriptions that are semantically and linguistically sound.

Various techniques have been pursued to improve automatic image captioning, including deep learning technology, automatic text generation, retrieval-based image captioning, and template-based image captioning. Nevertheless, the research in this field has been brought to a halt by several inherent challenges and limitations of the theoretical framework employed, the quality of datasets used for machine learning development, and difficulty of cross-domain generalization. Thus, further research and advancement in the field is urgently needed. This Special Issue will present state-of-the-art research in the field of automatic image captioning, highlighting the latest developments for both the theoretical and practical applications of this emerging technology. Topics of interest include, but are not limited to:

Machine learning and deep learning techniques for image captioning;
Theoretical frameworks for automatic image captioning;
Text summarization for image captioning;
Review papers on image captioning;
New dataset and evaluation frameworks for automatic image captioning;
Preprocessing and filtering techniques for image captioning;
Cross-domain generalization in image captioning;
Recent advances automatic medical image captioning

Dr. Mourad Oussalah
Prof. Dr. Rachid Jennane
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

image captioning
natural language processing
text generation
image content analysis
image tagging

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (7 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

21 pages, 1298 KB

Open AccessArticle

A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning

by Jiajia Peng and Tianbing Tang

Appl. Sci. 2024, 14(6), 2657; https://doi.org/10.3390/app14062657 - 21 Mar 2024

Viewed by 1861

Abstract

Image captioning, also recognized as the challenge of transforming visual data into coherent natural language descriptions, has persisted as a complex problem. Traditional approaches often suffer from semantic gaps, wherein the generated textual descriptions lack depth, context, or the nuanced relationships contained within the images. In an effort to overcome these limitations, we introduce a novel encoder–decoder framework called A Unified Visual and Linguistic Semantics Method. Our method comprises three key components: an encoder, a mapping network, and a decoder. The encoder employs a fusion of CLIP (Contrastive Language–Image Pre-training) and SegmentCLIP to process and extract salient image features. SegmentCLIP builds upon CLIP’s foundational architecture by employing a clustering mechanism, thereby enhancing the semantic relationships between textual and visual elements in the image. The extracted features are then transformed by a mapping network into a fixed-length prefix. A GPT-2-based decoder subsequently generates a corresponding Chinese language description for the image. This framework aims to harmonize feature extraction and semantic enrichment, thereby producing more contextually accurate and comprehensive image descriptions. Our quantitative assessment reveals that our model exhibits notable enhancements across the intricate AIC-ICC, Flickr8k-CN, and COCO-CN datasets, evidenced by a 2% improvement in BLEU@4 and a 10% uplift in CIDEr scores. Additionally, it demonstrates acceptable efficiency in terms of simplicity, speed, and reduction in computational burden. Full article

(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)

► Show Figures

Figure 1

15 pages, 6866 KB

Open AccessArticle

IFE-Net: An Integrated Feature Extraction Network for Single-Image Dehazing

by Can Leng and Gang Liu

Appl. Sci. 2023, 13(22), 12236; https://doi.org/10.3390/app132212236 - 11 Nov 2023

Cited by 2 | Viewed by 1786

Abstract

In recent years, numerous single-image dehazing algorithms have made significant progress; however, dehazing still presents a challenge, particularly in complex real-world scenarios. In fact, single-image dehazing is an inherently ill-posed problem, as scene transmission relies on unknown and nonhomogeneous depth information. This study proposes a novel end-to-end single-image dehazing method called the Integrated Feature Extraction Network (IFE-Net). Instead of estimating the transmission matrix and atmospheric light separately, IFE-Net directly generates the clean image using a lightweight CNN. During the dehazing process, texture details are often lost. To address this issue, an attention mechanism module is introduced in IFE-Net to handle different information impartially. Additionally, a new nonlinear activation function is proposed in IFE-Net, known as a bilateral constrained rectifier linear unit (BCReLU). Extensive experiments were conducted to evaluate the performance of IFE-Net. The results demonstrate that IFE-Net outperforms other single-image haze removal algorithms in terms of both PSNR and SSIM. In the SOTS dataset, IFE-Net achieves a PSNR value of 24.63 and an SSIM value of 0.905. In the ITS dataset, the PSNR value is 25.62, and the SSIM value reaches 0.925. The quantitative results of the synthesized images are either superior to or comparable with those obtained via other advanced algorithms. Moreover, IFE-Net also exhibits significant subjective visual quality advantages. Full article

(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)

► Show Figures

Figure 1

13 pages, 4074 KB

Open AccessArticle

Machine Vision-Based Chinese Walnut Shell–Kernel Recognition and Separation

by Yongcheng Zhang, Xingyu Wang, Yang Liu, Zhanbiao Li, Haipeng Lan, Zhaoguo Zhang and Jiale Ma

Appl. Sci. 2023, 13(19), 10685; https://doi.org/10.3390/app131910685 - 26 Sep 2023

Cited by 4 | Viewed by 2497

Abstract

Walnut shell–kernel separation is an essential step in the deep processing of walnut. It is a crucial factor that prevents the increase in the added value and industrial development of walnuts. This study proposes a walnut shell–kernel detection method based on YOLOX deep learning using machine vision and deep-learning technology to address common issues, such as incomplete shell–kernel separation in the current airflow screening, high costs and the low efficiency of manually assisted screening. A dataset was produced using Labelme by acquiring walnut shell and kernel images following shellshock. This dataset was transformed into the COCO dataset format. Next, 110 epochs of training were performed on the network. When the intersection over the union threshold was 0.5, the average precision (AP), the average recall rate (AR), the model size, and floating point operations per second were 96.3%, 84.7%, 99 MB, and 351.9, respectively. Compared with YOLOv3, Faster Region-based Convolutional Neural Network (Faster R-CNN), and Single Shot MultiBox Detector algorithms (SSD), the AP value of the proposed algorithm was increased by 2.1%, 1.3%, and 3.4%, respectively. Similarly, the AR was increased by 10%, 2.3%, and 9%, respectively. Meanwhile, walnut shell–kernel detection was performed under different situations, such as distinct species, supplementary lighting, or shielding conditions. This model exhibits high recognition and positioning precision under different walnut species, supplementary lighting, and shielding conditions. It has high robustness. Moreover, the small size of this model is beneficial for migration applications. This study’s results can provide some technological references to develop faster walnut shell–kernel separation methods. Full article

(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)

► Show Figures

Figure 1

17 pages, 5406 KB

Open AccessArticle

Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning

by Tian Xie, Weiping Ding, Jinbao Zhang, Xusen Wan and Jiehua Wang

Appl. Sci. 2023, 13(13), 7916; https://doi.org/10.3390/app13137916 - 6 Jul 2023

Cited by 18 | Viewed by 4950

Abstract

The discipline of automatic image captioning represents an integration of two pivotal branches of artificial intelligence, namely computer vision (CV) and natural language processing (NLP). The principal functionality of this technology lies in transmuting the extracted visual features into semantic information of a higher order. The bidirectional long short-term memory (Bi-LSTM) has garnered wide acceptance in executing image captioning tasks. Of late, scholarly attention has been focused on modifying suitable models for innovative and precise subtitle captions, although tuning the parameters of the model does not invariably yield optimal outcomes. Given this, the current research proposes a model that effectively employs the bidirectional LSTM and attention mechanism (Bi-LS-AttM) for image captioning endeavors. This model exploits the contextual comprehension from both anterior and posterior aspects of the input data, synergistically with the attention mechanism, thereby augmenting the precision of visual language interpretation. The distinctiveness of this research is embodied in its incorporation of Bi-LSTM and the attention mechanism to engender sentences that are both structurally innovative and accurately reflective of the image content. To enhance temporal efficiency and accuracy, this study substitutes convolutional neural networks (CNNs) with fast region-based convolutional networks (Fast RCNNs). Additionally, it refines the process of generation and evaluation of common space, thus fostering improved efficiency. Our model was tested for its performance on Flickr30k and MSCOCO datasets (80 object categories). Comparative analyses of performance metrics reveal that our model, leveraging the Bi-LS-AttM, surpasses unidirectional and Bi-LSTM models. When applied to caption generation and image-sentence retrieval tasks, our model manifests time economies of approximately 36.5% and 26.3% vis-a-vis the Bi-LSTM model and the deep Bi-LSTM model, respectively. Full article

(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)

► Show Figures

Figure 1

24 pages, 10221 KB

Open AccessArticle

ACapMed: Automatic Captioning for Medical Imaging

by Djamila Romaissa Beddiar, Mourad Oussalah, Tapio Seppänen and Rachid Jennane

Appl. Sci. 2022, 12(21), 11092; https://doi.org/10.3390/app122111092 - 1 Nov 2022

Cited by 8 | Viewed by 4652

Abstract

Medical image captioning is a very challenging task that has been rarely addressed in the literature on natural image captioning. Some existing image captioning techniques exploit objects present in the image next to the visual features while generating descriptions. However, this is not possible for medical image captioning when one requires following clinician-like explanations in image content descriptions. Inspired by the preceding, this paper proposes using medical concepts associated with images, in accordance with their visual features, to generate new captions. Our end-to-end trainable network is composed of a semantic feature encoder based on a multi-label classifier to identify medical concepts related to images, a visual feature encoder, and an LSTM model for text generation. Beam search is employed to ensure the best selection of the next word for a given sequence of words based on the merged features of the medical image. We evaluated our proposal on the ImageCLEF medical captioning dataset, and the results demonstrate the effectiveness and efficiency of the developed approach. Full article

(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)

► Show Figures

Figure 1

18 pages, 6632 KB

Open AccessArticle

Metaheuristics Optimization with Deep Learning Enabled Automated Image Captioning System

by Mesfer Al Duhayyim, Sana Alazwari, Hanan Abdullah Mengash, Radwa Marzouk, Jaber S. Alzahrani, Hany Mahgoub, Fahd Althukair and Ahmed S. Salama

Appl. Sci. 2022, 12(15), 7724; https://doi.org/10.3390/app12157724 - 31 Jul 2022

Cited by 11 | Viewed by 2768

Abstract

Image captioning is a popular topic in the domains of computer vision and natural language processing (NLP). Recent advancements in deep learning (DL) models have enabled the improvement of the overall performance of the image captioning approach. This study develops a metaheuristic optimization with a deep learning-enabled automated image captioning technique (MODLE-AICT). The proposed MODLE-AICT model focuses on the generation of effective captions to the input images by using two processes involving encoding unit and decoding unit. Initially, at the encoding part, the salp swarm algorithm (SSA), with a HybridNet model, is utilized to generate effectual input image representation using fixed-length vectors, showing the novelty of the work. Moreover, the decoding part includes a bidirectional gated recurrent unit (BiGRU) model used to generate descriptive sentences. The inclusion of an SSA-based hyperparameter optimizer helps in attaining effectual performance. For inspecting the enhanced performance of the MODLE-AICT model, a series of simulations were carried out, and the results are examined under several aspects. The experimental values suggested the betterment of the MODLE-AICT model over recent approaches. Full article

(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)

► Show Figures

Figure 1

20 pages, 10487 KB

Open AccessArticle

Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

by Deepika Kumar, Varun Srivastava, Daniela Elena Popescu and Jude D. Hemanth

Appl. Sci. 2022, 12(13), 6733; https://doi.org/10.3390/app12136733 - 2 Jul 2022

Cited by 10 | Viewed by 3218

Abstract

Image captioning is oriented towards describing an image with the best possible use of words that can provide a semantic, relatable meaning of the scenario inscribed. Different models can be used to accomplish this arduous task depending on the context and requirement of what needs to be achieved. An encoder–decoder model which uses the image feature vectors as an input to the encoder is often marked as one of the appropriate models to accomplish the captioning process. In the proposed work, a dual-modal transformer has been used which captures the intra- and inter-model interactions in a simultaneous manner within an attention block. The transformer architecture is quantitatively evaluated on a publicly available Microsoft Common Objects in Context (MS COCO) dataset yielding a Bilingual Evaluation Understudy (BLEU)-4 Score of 85.01. The efficacy of the model is evaluated on Flickr 8k, Flickr 30k datasets and MS COCO datasets and results for the same is compared and analysed with the state-of-the-art methods. The results shows that the proposed model outperformed when compared with conventional models, such as the encoder–decoder model and attention model. Full article

(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)

► Show Figures

Journal Menu

Journal Browser

Recent Trends in Automatic Image Captioning Systems

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (7 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI