Next Article in Journal
Research on Tongue Image Segmentation and Classification Methods Based on Deep Learning and Machine Learning
Previous Article in Journal
A Review of Computer Vision Technology for Football Videos
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Making Images Speak: Human-Inspired Image Description Generation

LRIT, Faculty of Sciences in Rabat, Mohammed V University in Rabat, Rabat 10000, Morocco
*
Authors to whom correspondence should be addressed.
Information 2025, 16(5), 356; https://doi.org/10.3390/info16050356
Submission received: 21 December 2024 / Revised: 26 February 2025 / Accepted: 20 March 2025 / Published: 28 April 2025

Abstract

Despite significant advances in deep learning-based image captioning, many state-of-the-art models still struggle to balance visual grounding (i.e., accurate object and scene descriptions) with linguistic coherence (i.e., grammatical fluency and appropriate use of non-visual tokens such as articles and prepositions). To address these limitations, we propose a hybrid image captioning framework that integrates handcrafted and deep visual features. Specifically, we combine local descriptors—Scale-Invariant Feature Transform (SIFT) and Bag of Features (BoF)—with high-level semantic features extracted using ResNet50. This dual representation captures both fine-grained spatial details and contextual semantics. The decoder employs Bahdanau attention refined with an Attention-on-Attention (AoA) mechanism to optimize visual-textual alignment, while GloVe embeddings and a GRU-based sequence model ensure fluent language generation. The proposed system is trained on 200,000 image-caption pairs from the MS COCO train2014 dataset and evaluated on 50,000 held-out MS COCO pairs plus the Flickr8K benchmark. Our model achieves a CIDEr score of 128.3 and a SPICE score of 29.24, reflecting clear improvements over baselines in both semantic precision—particularly for spatial relationships—and grammatical fluency. These results validate that combining classical computer vision techniques with modern attention mechanisms yields more interpretable and linguistically precise captions, addressing key limitations in neural caption generation.
Keywords: image captioning; deep learning; visual attention; neural networks; ResNet; GloVe embeddings; SIFT; Bag of Features; CIDEr; assistive technologies; MS COCO; NLP image captioning; deep learning; visual attention; neural networks; ResNet; GloVe embeddings; SIFT; Bag of Features; CIDEr; assistive technologies; MS COCO; NLP

Share and Cite

MDPI and ACS Style

Sebbane, C.; Belhajem, I.; Rziza, M. Making Images Speak: Human-Inspired Image Description Generation. Information 2025, 16, 356. https://doi.org/10.3390/info16050356

AMA Style

Sebbane C, Belhajem I, Rziza M. Making Images Speak: Human-Inspired Image Description Generation. Information. 2025; 16(5):356. https://doi.org/10.3390/info16050356

Chicago/Turabian Style

Sebbane, Chifaa, Ikram Belhajem, and Mohammed Rziza. 2025. "Making Images Speak: Human-Inspired Image Description Generation" Information 16, no. 5: 356. https://doi.org/10.3390/info16050356

APA Style

Sebbane, C., Belhajem, I., & Rziza, M. (2025). Making Images Speak: Human-Inspired Image Description Generation. Information, 16(5), 356. https://doi.org/10.3390/info16050356

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop