Image Captioning Using Topic Faster R-CNN-LSTM Networks

Yeh, Jui-Feng; Lin, Kuei-Mei; Chen, Chun-Chieh

doi:10.3390/info16090726

Open AccessArticle

Image Captioning Using Topic Faster R-CNN-LSTM Networks

by

Jui-Feng Yeh

^*,

Kuei-Mei Lin

and

Chun-Chieh Chen

Department of Computer Science and Information Engineering, National Chiayi University, Chiayi City 60004, Taiwan

^*

Author to whom correspondence should be addressed.

Information 2025, 16(9), 726; https://doi.org/10.3390/info16090726

Submission received: 2 July 2025 / Revised: 4 August 2025 / Accepted: 22 August 2025 / Published: 25 August 2025

(This article belongs to the Special Issue Information Processing in Multimedia Applications)

Download

Browse Figures

Versions Notes

Abstract

Image captioning is an important task in cross-modal research in numerous applications. Image captioning aims to capture the semantic content of an image and express it in a linguistically and contextually appropriate sentence. However, existing models mostly trend to focus on a topic generated by the most conspicuous foreground objects. Thus, other topics in the image are often ignored. To address these limitations, we propose a model that can generate richer semantic content and more diverse captions. The proposed model can capture not only main topics using coarse-grained objects but also finds fine-grained visual information from background or minor foreground objects. Our image captioning system combines the ResNet, LSTM, and topic feature models. The ResNet model extracts fine-grained image features and enriches the description of objects. The LSTM model provides a longer context for semantics, increasing the fluency and semantic completeness of the generated sentences. The topic model determines multiple topics based on the image and text content. The topics provide directions for the model to generate different sentences. We evaluate our model on the MSCOCO dataset. The results show that compared with other models, our model achieves a certain improvement in higher-order BLEU scores and a significant improvement in CIDEr score.

Keywords:

image caption generation; convolutional neural networks; topic model; long short-term memory

1. Introduction

Image captioning is a crucial task in multimodal image–text tasks and serves as a foundational component in vision–language applications. These tasks include diverse applications such as image/text generation, image retrieval, visual question answering, and image understanding. Image captioning can generate contextually relevant textual captions from an image. Thus, an image–text pair dataset can be created automatically by image caption generation. In image retrieval, image captions can generate caption sentences and extract keywords. Consequently, image retrieval can locate the image through keyword search. In visual question answering, some domains lack sufficient data. Image captioning can augment the training data with new sentences expressing the same content but in different ways. Therefore, image captioning plays a significant role in enhancing the performance, scalability, and generalizability of various multimodal tasks involving both images and text.

Recent researchers employed template-based generation models to implement an image captioning task. However, template-based generation models suffer from several limitations. The baby talk system proposed by Kulkarni et al. [1] adopted a template generation model. The model generated a template and filled the detected image objects into the blanks of the generated template. The advantage of this approach is its readability, but it lacks flexibility. Additionally, the generated caption sentences cannot include interactions between objects. Following the template model, an auto-encoder model was proposed. The auto-encoder model combines a convolutional neural network (CNN) and recurrent neural network (RNN). This architecture system aims to address the inflexibility issue. However, existing models predominantly generate caption sentences that focus on describing objects with large areas in images. Conversely, smaller objects have a lower probability of being the main subject of a generated sentence due to the model’s weighting. Therefore, while current models can produce diverse sentences, the topic scope of these sentences remains relatively limited. Therefore, we aim to address the issue of the limited topic scope in generated sentences in this paper.

We propose an image captioning system that integrates a Faster R-CNN, LSTM model, and topic model to generate semantically diverse and context-aware captions. We employ Faster R-CNN as the object detection model. Faster R-CNN extracts object visual and location information, called attribute (appearance) and spatial features. The attribute features serve as a crucial basis for sentence generation and topic detection. We utilize non-negative matrix factorization (NMF) to extract latent topics from textual features and attribute features from training data. Subsequently, we leverage the extracted topic information to train a ResNet model capable of identifying topics associated with image objects. For the language generation component, we employ the LSTM model to generate captions on both topic and visual features. This dual-modality input enhances the ability of the model to generate diverse and semantically relevant sentences aligned with both the visual and topic features of the image.

The structure of this paper is as follows. Section 2 reviews the related works. In Section 3, we present the proposed method and provide details for each module. Section 4 discusses the experimental results. Finally, we draw conclusions in Section 5.

2. Related Works

2.1. Image Caption Generation

Several studies have explored the application of image caption generation. Kulkarni et al. [1] concentrated on identifying sentence templates by analyzing sentence constructions, like “There is/are…” They integrated these templates with an image detection model to finalize the sentences. Despite offering consistent descriptive sentence patterns, their approach had fixed structures and lacked adaptability. In comparison, the image captioning generation model, which combines a convolutional neural network (CNN) and recurrent neural network (RNN), can produce more adaptable descriptive sentences, including descriptions of interactions among objects in an image. Çaylı et al. [2] successfully implemented an image captioning model that can be executed on smartphones. Their proposed model is based on Inception-v3 and multi-layer recurrent neural networks. Inception-v3 adopted a CNN-based architecture and utilizes multi-scale convolutions to extract visual features. Meanwhile, Inception-v3 leveraged factorization and sparse connection techniques to reduce model complexity. This method achieved high accuracy while alleviating the model’s burden, enabling it to run on smartphones with limited computational capabilities. Additionally, they employed three layers of GRUs in the language model, allowing the model to compute more complex representations. Sangolgi et al. [3] proposed a multilingual voice-based image captioning system. They employed a CNN-LSTM architecture for image caption generation. The system first extracts image features using a CNN, and then generates image captions using an LSTM as a language model. To make the system accessible to visually impaired users, they integrated Google Text to Speech, developed by Google, to convert the generated image captions into audio. Finally, to achieve multilingual capability, they employed the Google Translate API to translate the synthesized speech into different languages. Huang et al. [4] proposed a character-level recurrent neural network (c-RNN) model for image captioning tasks. To avoid biases introduced by word tokenization algorithms, the authors employed characters as token units. These character tokens were embedded into the model as input to the RNN. The use of characters as input units also enabled the language model to dynamically infer word pronunciations and grammatical rules. It led to the generation of more expressive and complex sentence structures. Bineeshia [5] employed a CNN + RNN model for image captioning tasks. He utilized a CNN to extract image features and RNN to generate image caption sentences. For evaluation, he used BLEU-1, BLEU-2, BLEU-3, and BLEU-4 to test the performance of his model. Khamparia et al. [6] proposed a model for video captioning tasks that generated semantically rich descriptions. They combined a CNN and LSTM to generate meaningful captions. Their system consists of three parts. First, they used a CNN to extract image objects and assign corresponding words. Second, they employed an LSTM as a language model to generate candidate sentences. Finally, they utilized a semantic ranking mechanism to select the candidate sentence with the highest semantic score as the output. Consequently, their model can obtain the most semantically relevant image caption. Verma et al. [7] proposed a model that combines a VGG and LSTM to generate image caption sentences with complete grammar. Bartosiewicz et al. [8] aimed to find the best combination of components for an image captioning model. The authors selected CNN-based models such as ResNet and VGG models to extract visual features. The authors adopt an RNN-based model as the language model. The combination of Xception and DenseNet201 achieved the best performance on the image captioning task.

In addition to the traditional architecture, Wang et al. [9] attempted to combine a CNN and transformer in the task of image captioning generation. The self-attention mechanism from the transformer model aided in handling image features and generating caption sentences. Image captioning generation has also shown promise in professional fields. Zeng et al. [10] applied image captioning generation in the medical domain, utilizing region detection to identify lesion areas in ultrasound images and describing disease information using LSTM. Furthermore, image captioning generation has extended to languages other than English. Palash et al. [11] and Alam et al. [12] investigated image captioning generation in Bengali, while Mishra et al. [13] applied it to Hindi. Image captioning generation has also been employed to assist the visually impaired. Wadhwa et al. [14] described the surroundings of the visually impaired using words and converted them into Braille. Kulkarni et al. [15] converted image descriptors into sound to help the visually impaired perceive their environment. Yang et al. [16] proposed FS-StyleCap, a visual caption system. The FS-StyleCap model utilizes a conditional language model and a visual projection module. The conditional language model integrates visual features and a stylized sentence format from a corpus to guide the caption generation. Yu et al. [17] used a graph convolution network (GCN) in an image caption task. The GCN connects the relations between the objects and generates more related captions. Li et al. [18] and Peng et al. [19] integrated a patch-based image encoder and large language model to generate more local perception captions.

2.2. Topic Model

Topic modeling is a statistical method commonly used for analyzing textual data. It assumes that text is composed of multiple topics and automatically groups words into different topics by analyzing their co-occurrence patterns. Topic models can serve as prior knowledge to predict the topic of a new sentence. In our research, we aim to promote diversity in the generated sentences. To achieve this, we utilize topics as sentence titles to guide the model to generate sentences with different focuses based on the specific topic. The common topic models comprise Probabilistic Latent Semantic Analysis (PLSA) [20], Latent Dirichlet Allocation (LDA) [21], and non-negative matrix factorization (NMF) [22]. Latent Dirichlet Allocation (LDA) is a model based on Bayesian inference. LDA primarily estimates the probability distribution between texts and topics using Bayesian methods to achieve topic classification. Non-negative matrix factorization (NMF) is a dimensionality reduction technique based on matrix decomposition. It decomposes high-dimensional textual data into low-dimensional topic and word matrices. NMF imposes non-negativity constraints, ensuring that both topics and words are non-negative, facilitating interpretation and understanding. Additionally, NMF is computationally efficient and consumes minimal resources, making it a popular choice for various systems. Probabilistic Latent Semantic Analysis (PLSA) is a probabilistic text topic model that estimates the probability distribution between texts and topics using the Expectation-Maximization (EM) algorithm. This enables PLSA to perform topic discovery and text classification. Topic models have been applied in various domains. For instance, English et al. [23] employed LDA for document classification. They determined the classification of documents by detecting their topics. Additionally, they calculated the semantic distance between document topics to determine topic clusters and subsequently performed coarse classification based on these clusters. In the field of Artificial Intelligence (AI), topic models have also emerged as a popular choice. Yu et al. [24] compiled a comprehensive analysis of the usage trends of topic models in AI from 1990 to 2021. Applications include Approximate Reasoning, Computational Theory, Intelligent Automation, Artificial Neural Networks, Machine Learning, Natural Language Processing, and Computer Vision. This demonstrates the significant role of topic models in AI-related applications. Inoue et al. [25] employed LDA topic models to investigate the impact of COVID-19 on healthcare research. They performed PCA in conjunction with LDA topic detection on hospital survey documents. Subsequently, they extracted negative impact topics related to COVID-19 on healthcare research to analyze the effects of large-scale epidemics on healthcare personnel. Abdalgader et al. [26] introduced a framework with dynamic word embedding in a short text topic classification task. Temporal-Aware word embedding was proposed and the embedding added the time parameter to capture semantic drift.

In recent years, there has been a growing trend in utilizing topic information to generate sentences with semantic consistency. Wu et al. [27] proposed a Constrained Topic-aware Model to address the issue of text generation for poetry. They utilized a visual semantic vector to embed visual content in the absence of paired image–poetry data. To tackle the problem of topic deviation in generated content, they introduced a topic-aware poetry generation model. Finally, they designed an Anti-frequency Decoding (AFD) scheme to constrain high-frequency characters in the generation process. The authors observed an improvement in readability and overall semantic consistency. Their proposed approach even achieved good performance with unsupervised learning. Wang et al. [28] applied the topic model to image captioning generation for aerial images. They used pre-established topic libraries or artificially defined topics to guide the model in generating sentences that aligned with the specified topics. Li et al. [29] provided a multiple-view summarization framework (MVSF) to generate multiple summaries from the same social media posts. The framework analyzes perspectives and divides the posts into a variety of topics. According to these topics, the model generates multiple viewpoint summaries.

3. Methods

In this section, the system is divided into three components: object detection, topic detection, and text generation. Firstly, Faster R-CNN is employed for object detection to extract individual attributes and objects from images. Attributes and objects serve as the basis for text generation. Subsequently, we train the topic model to ensure specific topics are present during text generation. Finally, we utilize attributes, objects, and topics to train the language model. The goal of generating text captions is focusing on specific topics. The system flowchart is illustrated in Figure 1.

3.1. Object Detection

The purpose of image object detection is finding the object locations in the images and extracting the features from the images. In order to achieve this purpose, we need to train a model which must extract spatial features and attributes (appearance features). The spatial features represent the object location in the image and the attribute features represent the object attribute features. In caption generation, a language model requires objects to provide more detailed and extensive object attributes to generate more precise and informative textual captions. Therefore, the Faster R-CNN (Region-based Convolutional Neural Network) [30] model is adopted. In contrast to other CNN-based models such as CNN and ResNet, the CNN and ResNet models focus on hierarchical feature extraction, while the Faster R-CNN combines the CNN model and a Region Proposal Network (RPN) to give more attention to the accurate region location and attributes of objects. Therefore, the Faster R-CNN is more suitable for object detection tasks requiring high precision than the CNN and ResNet models. Therefore, we adopt the Faster R-CNN model as the backbone model for object detection in our system.

The Faster R-CNN comprises five convolutional layers, a Region Proposal Network (RPN), a Region of Interest (ROI) pooling layer, and a softmax layer. The entire image is adopted to the input of the Faster R-CNN. We resize the image such that its shorter side is s = 800 pixels, and its longer side does not exceed 1200 pixels. The resized image is processed by the convolutional layers to extract the visual features into a feature map. This image feature map is fed into the RPN to generate candidate regions representing potential object areas. ROI pooling is employed to obtain proposal feature maps. The candidate regions from RPN guide the proposal feature map extraction. Finally, the proposal feature maps are input into the softmax layer, which outputs objects and attributes. The combined image features with objects and attributes serve as one of the inputs to the language model. The architecture of the Faster R-CNN is illustrated in Figure 2.

3.2. Topic Detection

A single image contains an immense amount of information. Language models tend to generate captions with more emphasis on objects that occupy a larger proportion of the image. While such captions generally align with the main content, they often overlook the equally important but smaller details in the image. Therefore, our system introduces the topic concept and generates a variety of text captions around the different topics. In our system, topic detection is divided into two parts: text and image detection. Non-negative matrix factorization (NMF) is adopted as the text topic detection model and ResNet is used as the image topic detection model. Since the caption dataset does not include labels for image topics, we guide the training of the image topic model based on the text topics. The neural network architecture is illustrated in Figure 3.

Textual topic detection: Due to its ease of implementation and fast computational efficiency, NMF is employed as the detection model for topic detection in textual data. NMF is a matrix factorization method. In our implementation, we treat the captions in the dataset as documents. We utilize NMF to decompose the documents into matrices representing the probabilities of topics and words, as well as the probabilities of topics and documents. Subsequently, we extract the top K probabilities of topic relevance for each document from the matrix representing the probabilities of topics and documents. These top K topics serve as the input for textual topic features to train on ResNet.

Image topic detection Most of the current captions primarily focus on captioning the main objects in the images, lacking descriptions of the background and minor objects in the foreground. Therefore, we aim to strengthen this aspect by extracting a variety of semantic concept topic features. In the training phase, due to the absence of image topic labels in the dataset, the text topic-guided image topic model is designed. Textual topics from the NMF model guide the model to learn the image topic based on visual features. The image topic model aims to learn the explicit relationships to identify image topics. Additionally, we also retain the probabilities of other object topics preserved by object detection as implicit topics. In implementation, the visual features from the image and the text topics from the NMF model are fed into the ResNet model to train the image topic model. Then, the fully connected layer of the ResNet is the image topic feature representation. The textual and image topics are utilized as features for training the language model.

3.3. Language Model

The language model needs to generate a caption with contextual semantics using attributes, object relationships, and topic features. Therefore, we adopt a recurrent neural network (RNN)-based model, because a RNN can capture contextual relationships. A long short-term memory (LSTM) network is particularly suitable for our needs. The advantage of an LSTM is the design of the input gate, forget gate, and output gate. This design allows an LSTM to better retain relevant information and address the issue of vanishing gradients in long sentences. Additionally, the functionality of the forget gate and memory cell connects the context in longer sentences and prevents semantic error or inconsistent phrases. Thus, an LSTM can capture complete contextual relationships. Formulas (1)–(6) define the LSTM model.

i_{t} = σ (W_{i x} x_{t} + W_{i h} h_{t - 1} + b_{i})

(1)

f_{t} = σ (W_{f x} x_{t} + W_{f h} h_{t - 1} + b_{f})

(2)

o_{t} = σ (W_{o x} x_{t} + W_{o h} h_{t - 1} + b_{o})

(3)

c_{t} = f_{t} • c_{t - 1} + i_{t} • \tanh (W_{c x} x_{t} + W_{c h} h_{t - 1} + b_{c})

(4)

h_{t} = o_{t} • c_{t}

(5)

p_{t + 1} = S o f t m a x (h_{t})

(6)

i_{t}

represents the input gate,

f_{t}

represents the forget gate, and

o_{t}

represents the output gate.

c_{t}

is a register that determines whether to update the memory based on the input gate and forget gate.

h_{t}

is the hidden layer. The hidden layer is fed into the softmax layer to obtain the probability distribution of possible words.

4. Experiments

4.1. Dataset

The Microsoft Common Objects in Context (MSCOCO) dataset, created by Microsoft, comprises real-world and synthetic images paired with corresponding annotations. The MSCOCO dataset encompasses 91 object categories, categorized into 11 primary classes and 80 subcategories. The 11 primary classes are: Person & Accessory, Vehicle, Outdoor Obj, Animal, Sports, Kitchenware, Food, Furniture, Electronics, Appliance, and Indoor Obj. The MSCOCO dataset trains tasks such as object detection, image segmentation, image captioning, and visual question answering. It comprises 164,062 images, each accompanied by five caption sentences. The dataset is divided into training, validation, and test sets, containing 82,783, 40,504, and 40,775 images, respectively. An example of the MSCOCO dataset in the image caption task is shown in Table 1.

The Flickr30k dataset is a novel benchmark on the image caption task. The dataset contains 31,783 real-world images from the Flickr website. An image is annotated with five captions. The dataset is divided into training, validation, and test sets. Additionally, the sets contain 29,783 training pairs, 1000 validation pairs, and 1000 test pairs. In our experiment, we conduct transfer evaluation by the model trained on the MSCOCO dataset to evaluate the Flickr30k.

4.2. Experimental Setup

To reduce the training time required for the model, this research also uses the GPU on the experimental hardware. The hardware configuration is conducted on a system equipped with an Intel Core i7-6700 3.4 GHz CPU, 32 GB of RAM, and a GeForce GTX 1080Ti 11 GB GDDR5X GPU. The proposed model is implemented in Python version 3.6 with TensorFlow and Keras versions 1.13.0 and 2.3.1, separately.

The model is trained with the following hyperparameter settings: a vocabulary size of 150,000, batch size of 64, embedding size of 1024, initial learning rate of 0.0005 for the Inception layer, learning rate decay factor of 0.25, number of epochs per decay of 8.0, word count threshold of 4, and maximum sentence length of 32. We use the Adam optimizer to optimize this model.

4.3. Evaluation Metrics

Evaluating image caption generation requires considering criteria such as readability, fluency, accuracy, relevance, and diversity to align with human expectations. These aspects can be assessed through both manual review and evaluation metrics. Following the works of Liu et al. [31] and Zhu et al. [32], we employed the Bilingual Evaluation Understudy (BLEU) metric, widely used in image captioning. BLEU assesses the n-gram accuracy between generated captions and ground truth sentences, thereby verifying the accuracy and readability of the generated captions. The BLEU is defined by Formulas (7)–(9).

p = \frac{c o u n t (n - g r a m)}{n u m_{s y s}}

(7)

p_{n} = \frac{\sum_{C \in {C a n d i d a t e s}} \sum_{n - g r a m \in C} C o u n t_{c l i p} (n - g r a m)}{\sum_{C \in {C a n d i d a t e s}} \sum_{n - g r a m \in C} C o u n t (n - g r a m)}

(8)

B L E U = B P \cdot \exp (\sum_{n = 1}^{N} w_{n} \log p_{n}), B P = \{\begin{array}{l} 1, l_{c} > l_{s} \\ e^{(1 - l_{s} / l_{c})}, l_{c} \leq l_{s} \end{array}

(9)

In addition to the BLEU, we employ two recall-based metrics: Metric for Evaluation of Translation with Explicit Ordering (METEOR) and Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence (Rouge-L). METEOR calculates the precision and recall of the caption sentences, considering synonyms. Its definition is provided in Formula (10).

\begin{array}{l} P e n = γ {(\frac{c h}{m})}^{θ} \\ F_{m e a n} = \frac{P R}{α P + (1 - α) R} \\ P = \frac{| m |}{\sum_{k} h_{k} (c_{i})} \\ R = \frac{| m |}{\sum_{k} h_{k} (s_{i j})} \\ M e t e o r = (1 - P e n) F_{m e a n} \end{array}

(10)

Pen represents the penalty function, ch represents the chunk, and m represents the number of matched unigrams. P denotes the precision, and R denotes the recall. Rouge-L calculates the similarity between the caption sentences and the ground truth sentences using precision and recall. Formula (11) defines Rouge-L.

\begin{array}{l} R_{l c s} = \frac{L C S (X, Y)}{m} \\ P_{l c s} = \frac{L C S (X, Y)}{n} \\ F_{l c s} = \frac{(1 + β^{2}) R_{l c s} P_{l c s}}{R_{l c s} + β^{2} P_{l c s}}, β = \frac{P_{l c s}}{R_{l c s}} \end{array}

(11)

LCS represents the Longest Common Subsequence between a caption sentence and a ground truth sentence. m represents the length of the caption sentence, and n represents the length of the ground truth sentence.

Furthermore, we employed the CIDEr score, which calculates the similarity between generated captions and ground truth sentences based on n-gram matching. Each n-gram receives a different TF-IDF weight to calculate the similarity, helping assess the relevance of the sentences. Formulas (12)–(14) define the CIDEr score.

C I D E r_{n} (c_{i}, S_{i}) = \frac{1}{m} \sum_{j} \frac{g^{n} (c_{i}) g^{n} (s_{i j})}{‖g^{n} (c_{i})‖ ‖g^{n} (s_{i j})‖}

(12)

C I D E r_{n} (c_{i}, S_{i}) = \frac{1}{N} \sum_{n = 1} C I D E r_{n} (c_{i}, S_{i})

(13)

g_{k} (s_{i j}) = \frac{h_{k} (s_{i j})}{\sum_{ω l \in Ω} h_{l} (s_{i j})} \log (\frac{|I|}{\sum_{I_{p \in I}} \min (1, \sum_{q} h_{k} (s_{p q}))})

(14)

c represents the i-th caption sentence, s represents the i-th reference (ground truth) sentence, and n represents the word length of the n-gram.

4.4. Comparison with Other Models

To evaluate its effectiveness, the proposed approach was compared with three state-of-the-art baseline systems: the c-RNN proposed by Huang and Hu [4], the hybrid CNN–RNN model proposed by Khamparia et al. [6], the stylized captioning framework introduced by Yang et al. [16], the Multi-Relational Weighted Graph Convolutional Network (MR-WGCN) by Du et al. [17], and Zhao et al. [33]’s integrated CNN and adaptive LSTM model in system performance. The first is the fine-grained recurrent language model c-RNN proposed by Huang and Hu [4], which focuses on linguistic structure for image captioning. They proposed a character-level recurrent neural network (c-RNN) model for image captioning tasks. The model utilizes character-level tokenization and uses the language model to dynamically infer word pronunciation and syntactic rules. The model in [6] integrates convolutional and long short-term memory (LSTM) to generate image captions based on visual features and sequential dependencies, forming a strong baseline for conventional captioning tasks. Their model leverages the long-term memory capabilities of LSTMs to generate more semantically rich and contextually relevant caption sentences. In contrast, the approach by Yang et al. [16] extends captioning flexibility by conditioning outputs on stylized sentence patterns, enabling guided and diverse caption generation. The approach by Du et al. [17] adopted the GCN model and connects the relations between objects in the image.

As shown in Table 2, the proposed method achieves strong performance in higher-order BLEU scores, particularly BLEU-4 (0.304), surpassing the c-RNN [4] (0.237) and Yang et al.’s model [16] (0.136). Although Khamparia et al. [6] report a high BLEU-1 score (0.852), they do not provide higher-order BLEU metrics, making comprehensive comparison difficult. In the experiments, the proposed model outperforms the c-RNN [4] in BLEU-1, -2, -3, and -4. The c-RNN model shows superior capability in capturing individual words from the caption sentences in the test dataset. Our proposed model exhibits greater robustness generating synonymous expressions based on topic features. Consequently, the c-RNN achieves closer BLEU-1 scores. However, the images from the dataset are complex. The long-term memory retention ability of the LSTM model is better than the RNN model. The proposed model allows the generation of more semantically complete and syntactically accurate sentence captions than the c-RNN model. The proposed model can achieve higher BLEU-2, BLEU-3, and BLEU-4 scores. Notably, our model exhibits well-balanced results across BLEU-1 to BLEU-4, reflecting its superior capability in generating coherent, multi-word captions. This improvement stems from the synergy between semantic object detection via the Faster R-CNN and topic-informed sequential captioning through LSTM. Similarly, our model shows BLEU scores with the same trend in the Flickr30k dataset (Table 3). However, when directly evaluated on the Flickr30k dataset without fine-tuning, the BLEU-3 and -4 scores are lower than the results of the MSCOCO dataset. We suggest that our model can generate more synonymous expressions but not exactly match the ground truth because of the unmatched topic features. This reason leads to lower BLEU-3 and -4 scores. Nevertheless, the higher BLEU-1 and BLEU-2 scores show that the generated captions still contain relevant and accurate keywords.

Further analysis, shown in Table 4 and Table 5 using semantic metrics, demonstrates the robustness of the proposed approach. The CIDEr score of the proposed approach is higher than that of the models proposed by Khamparia et al. [6] and by Yang et al. [16] in the MSCOCO dataset. It can be observed that the topic model in the proposed model provides enriched semantics to our language model, and the Faster R-CNN can extract the more detailed object features. As a result, our model achieves superior performance on the CIDEr metric, because the CIDEr metric emphasizes semantic richness and content diversity. Although Khamparia et al. report a notably high ROUGE-L score (0.863), this metric alone does not fully reflect semantic depth. Our model achieves a more balanced performance across METEOR (0.190) and ROUGE-L (0.305), indicating improved fluency and lexical variation. Conversely, the results are worse than Khamparia’s model in BLEU-1 and ROUGE-L, because both metrics emphasize word order and precise n-gram matching. Compared to the c-RNN [4], which focuses on fine-grained language modeling but lacks explicit visual–semantic grounding, our topic-enhanced Faster R-CNN + LSTM model generates captions that are not only grammatically coherent but also contextually richer and more informative. However, a decrease of the CIDEr score was found on the Flickr30k dataset. Our model achieves a higher result in the METEOR metric compared to the c-RNN [4], the method by Du et al. [17], and Zhao et al.’s model [33]. The results indicate that the generated captions align closer to human semantic expressions. The decrease of the CIDEr score may be due to the different topic features stylized from the MSCOCO dataset, which lead to distinct style expressions.

In Table 6, we compare the generated captions in an image with the proposed model (ResNet + Oe + ATTLSTM + SCST) and the c-RNN model. The results indicate that our model can generate more detailed and accurate captions, especially regarding attributes. For example, our captions “A man in a white shirt holding a hot dog and a hamburger” describe the color attribute “white shirt” and the detailed features “holding a hot dog and a hamburger”. To contrast, the c-RNN model generates simple captions, such as “A man holding food”. The example demonstrates our topic-enhanced Faster R-CNN + LSTM model is not only rich in content but also contextually precise.

Overall, our approach outperforms existing baselines in both linguistic quality and semantic accuracy, validating the advantages of incorporating visual semantics and topic guidance into the captioning process. Overall, due to our adoption of topic features aimed at enhancing sentence diversity, we cannot achieve significant improvements in the BLEU-1 and ROUGE-L metrics. In the BLEU metric, our model can still extract some keywords to achieve higher BLEU scores. Moreover, the combination of the topic model and the LSTM model enables our model to outperform other models in terms of semantic completeness, sentence fluency, and grammatical correctness.

5. Conclusions

In this study, we present an image captioning system based on the Faster R-CNN, LSTM, and topic models. The system uses a Faster R-CNN for fine-grained object detection, a non-negative matrix factorization (NMF) model and ResNet for the topic model, and LSTM for sentence generation. The caption system achieves the goal of enhancing sentence diversity using the topic-guided semantic information and fine-grained visual features. The experiment results demonstrate the competitive performance of our topic-guided framework on the higher-order BLEU and CIDEr metrics. The results validate its strength in generating semantically rich and diverse captions. Specifically, the combination of the topic model and the LSTM model enhance CIDEr by a score of 0.065. The improvement of the high-order BLEU scores prove its sentence fluency, lexical diversity, and semantic completeness. However, observing the generated captions also indicates that the semantic information remains fragmented and incomplete. Therefore, we aim to improve the integrity and coherence of the semantic information and seek more specialized metrics for evaluating semantic diversity to present our results more accurately in the future.

Author Contributions

Conceptualization, J.-F.Y. and C.-C.C.; methodology, K.-M.L.; software, C.-C.C. and K.-M.L.; validation, J.-F.Y., C.-C.C. and K.-M.L.; formal analysis, K.-M.L.; investigation, K.-M.L.; resources, K.-M.L.; data curation, K.-M.L.; writing—original draft preparation, K.-M.L.; writing—review and editing, J.-F.Y.; visualization, K.-M.L.; supervision, J.-F.Y.; project administration, J.-F.Y.; funding acquisition, J.-F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

National Science and Technology Council: 111-2221-E-415-012-MY3.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://cocodataset.org/#home; accessed on 22 April 2025.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Baby talk: Understanding and generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1601–1608. [Google Scholar] [CrossRef] [PubMed]
Çaylı, Ö.; Makav, B.; Kılıç, V.; Onan, A. Mobile application based automatic caption generation for visually impaired. In Intelligent and Fuzzy Techniques: Smart and Innovative Solutions, Proceedings of the INFUS 2020 Conference, Istanbul, Turkey, 21–23 July 2020; Springer: Cham, Switzerland, 2020; pp. 1532–1539. [Google Scholar]
Sangolgi, V.A.; Patil, M.B.; Vidap, S.S.; Doijode, S.S.; Mulmane, S.Y.; Vadaje, A.S. Enhancing Cross-Linguistic Image Caption Generation with Indian Multilingual Voice Interfaces using Deep Learning Techniques. Procedia Comput. Sci. 2023, 233, 547–557. [Google Scholar] [CrossRef]
Huang, G.; Hu, H. c-RNN: A Fine-Grained Language Model for Image Captioning. Neural Process Lett. 2019, 49, 683–691. [Google Scholar] [CrossRef]
Bineeshia, J. Image Caption Generation Using CNN-LSTM Based Approach. In Proceedings of the ICCAP 2021, Chennai, India, 7–8 December 2021; p. 352. [Google Scholar]
Khamparia, A.; Pandey, B.; Tiwari, S.; Gupta, D.; Khanna, A.; Rodrigues, J.J. An integrated hybrid CNN–RNN model for visual description and generation of captions. Circuits Syst. Signal Process. 2020, 39, 776–788. [Google Scholar] [CrossRef]
Verma, A.; Yadav, A.K.; Kumar, M. Automatic image caption generation using deep learning. Multimed. Tools Appl. 2024, 83, 5309–5325. [Google Scholar] [CrossRef]
Bartosiewicz, M.; Iwanowski, M. The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning. Information 2024, 15, 504. [Google Scholar] [CrossRef]
Wang, S.; Zhu, Y. A Novel Image Caption Model Based on Transformer Structure. In Proceedings of the ICICSE, Chengdu, China, 19–21 March 2021; pp. 144–148. [Google Scholar] [CrossRef]
Zeng, X.; Wen, L.; Liu, B.; Qi, X. Deep learning for ultrasound image caption generation based on object detection. Neurocomputing 2020, 392, 132–141. [Google Scholar] [CrossRef]
Palash, M.A.H.; Nasim, M.D.; Saha, S.; Afrin, F.; Mallik, R.; Samiappan, S. Bangla Image Caption Generation through CNN-Transformer Based Encoder-Decoder Network. In Proceedings of the International Conference on Fourth Industrial Revolution and Beyond 2021, Dhaka, Bangladesh, 10–11 December 2021; Springer: Singapore, 2021; Volume 437, pp. 631–644. [Google Scholar]
Alam, M.D.S.; Rahman, M.D.S.; Hosen, M.D.I.; Mubin, K.A.; Hossen, S.; Mridha, M.F. Bahdanau Attention Based Bengali Image Caption Generation. In Proceedings of the 2022 International Conference on Decision Aid Sciences and Applications (DASA), Chiangrai, Thailand, 23–25 March 2022; pp. 1073–1077. [Google Scholar] [CrossRef]
Mishra, S.K.; Dhir, R.; Saha, S.; Bhattacharyya, P. A Hindi image caption generation framework using deep learning. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 20, 1–19. [Google Scholar] [CrossRef]
Wadhwa, V.; Gupta, B.; Gupta, S. AI Based Automated Image Caption Tool Implementation for Visually Impaired. In Proceedings of the 2021 International Conference on Industrial Electronics Research and Applications (ICIERA), New Delhi, India, 22–24 December 2021; pp. 1–6. [Google Scholar] [CrossRef]
Kulkarni, C.; Monika, P.; Preeti, B.; Shruthi, S. A novel framework for automatic caption and audio generation. Mater. Today Proc. 2022, 65, 3248–3252. [Google Scholar] [CrossRef]
Yang, D.; Chen, H.; Hou, X.; Ge, T.; Jiang, Y.; Jin, Q. Visual captioning at will: Describing images and videos guided by a few stylized sentences. In Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA, 29 October–3 November 2023; pp. 5705–5715. [Google Scholar] [CrossRef]
Du, S.; Zhu, H.; Zhang, Y.; Wang, D.; Shi, J.; Xing, N.; Lin, G.; Zhou, H. Controllable Image Captioning with Feature Refinement and Multilayer Fusion. Appl. Sci. 2023, 13, 5020. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Zhang, T.; Wang, G.; Wang, X.; Li, S. A patch-level region-aware module with a multi-label framework for remote sensing image captioning. Remote Sens. 2024, 16, 3987. [Google Scholar] [CrossRef]
Peng, R.; He, H.; Wei, Y.; Wen, Y.; Hu, D. Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 3963–3973. [Google Scholar]
Hofmann, T. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, San Francisco, CA, USA, 30 July–1 August 1999; pp. 289–296. [Google Scholar]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 2000, 13, 556–562. [Google Scholar]
English, J.A.; Kossarian, M.M.; McManis, C.E.; Smith, D.A. Phenomenological Semantic Distance from Latent Dirichlet Allocations (LDA) Classification. U.S. Patent 10,242,002, 1 February 2018. [Google Scholar]
Yu, D.; Xiang, B. Discovering topics and trends in the field of Artificial Intelligence: Using LDA topic modeling. Expert Syst. Appl. 2023, 225, 120114. [Google Scholar] [CrossRef]
Inoue, M.; Fukahori, H.; Matsubara, M.; Yoshinaga, N.; Tohira, H. Latent Dirichlet allocation topic modeling of free-text responses exploring the negative impact of the early COVID-19 pandemic on research in nursing. Jpn. J. Nurs. Sci. 2023, 20, e12520. [Google Scholar] [CrossRef]
Abdalgader, K.; Matroud, A.A.; Al-Doboni, G. Temporal Dynamics in Short Text Classification: Enhancing Semantic Understanding Through Time-Aware Model. Information 2025, 16, 214. [Google Scholar] [CrossRef]
Wu, L.; Xu, M.; Qian, S.; Cui, J. Image to Modern Chinese Poetry Creation via a Constrained Topic-aware Model. ACM Trans. Multimedia Comput. Commun. Appl. 2020, 16, 1–21. [Google Scholar] [CrossRef]
Wang, B.; Zheng, X.; Qu, B.; Lu, X. Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 13, 256–270. [Google Scholar] [CrossRef]
Li, C.Y.; Chun, S.A.; Geller, J. Perspective-based Microblog Summarization. Information 2025, 16, 285. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Proc. NIPS 2015, 28, 91–99. [Google Scholar] [CrossRef]
Liu, X.; Xu, Q.; Wang, N. A survey on deep neural network-based image captioning. Vis. Comput. 2019, 35, 445–470. [Google Scholar] [CrossRef]
Zhu, Y.; Lu, S.; Zheng, L.; Guo, J.; Zhang, W.; Wang, J.; Yu, Y. Texygen: A benchmarking platform for text generation models. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 1097–1100. [Google Scholar] [CrossRef]
Zhao, W.; Wu, X.; Luo, J. Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans. Image Process 2020, 30, 1180–1192. [Google Scholar] [CrossRef]

Figure 1. System flowchart. The Faster R-CNN extracts image objects and attributes. Topic model extracts a specific topics from images and captions. Finally, LSTM model generates the topic-guided captions.

Figure 2. The architecture of the Faster R-CNN. The model adopts five convolution layers for extracting object and attributes. The region proposal network (RPN) generates candidate object regions, which are classified and refined to extract precise attributes.

Figure 3. Architecture of the topic neural network. NMF model is applied to extract textual topics and guides ResNet model to learn image topic. Finally, text and image topics embeddings are integrated to generate the topic neural network representation.

Table 1. An example of (a) the MSCOCO dataset and (b) the Flickr30k dataset in the image caption task.

(a) MSCOCO dataset	(b) Flickr30k dataset

a picture of an airplane parked at the loading docks	Several men in hard hats are operating a giant pulley system.
a jet airplane on the tarmac at the airport	Workers look down from up above on a piece of equipment.
a plane on the runway at the airport	Two men working on a machine wearing hard hats.
a jumbo jet american airlines plane sitting in a waiting area.	Four men on top of a tall structure.
a large jet liner sitting on top of an airport runway.	Three men on a large rig.

Table 2. Comparison of different topic models on the MSCOCO dataset (BLEU).

Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4
c-RNN [4]	0.612	0.449	0.323	0.237
Khamparia et al. [6]	0.852	-	-	-
VGGNet + LSTM [16]	0.510	0.322	0.207	0.136
The proposed approach	0.638	0.507	0.405	0.304

Bold: values indicate the best-performing model for each metric.

Table 3. Comparison of different topic models on the Flickr30k dataset (BLEU).

Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4
c-RNN [4]	0.604	0.431	0.304	0.215
Du et al. [17]	-	-	-	0.146
Zhao et al. [33]	0.690	0.493	0.347	0.241
The proposed approach	0.643	0.435	0.277	0.180

Bold: values indicate the best-performing model for each metric.

Table 4. Comparison of different topic models on the MSCOCO dataset (METEOR, Rouge-L, and CIDEr).

Model	METEOR	Rouge-L	CIDEr
c-RNN [4]	0.247	-	-
Khamparia et al. [6]	-	0.863	0.493
VGGNet + LSTM [16]	0.170	-	0.654
The proposed approach	0.223	0.484	0.827

Bold: values indicate the best-performing model for each metric.

Table 5. Comparison of different topic models on the Flickr30k dataset (METEOR, Rouge-L, and CIDEr).

Model	METEOR	Rouge-L	CIDEr
c-RNN [4]	0.190	-	-
Du et al. [17]	0.190	0.521	1.169
Zhao et al. [33]	0.203	0.465	0.528
The proposed approach	0.374	0.311	0.676

Bold: values indicate the best-performing model for each metric.

Table 6. Qualitative comparison of generated captions.


ResNet + Oe + ATTLSTM + SCS (ours)	c-RNN [4]
穿白襯衫的男人拿著熱狗堡及漢堡 (A man in a white shirt holding a hot dog and a hamburger)	一個男人拿著食物 (A man holding food)
穿白襯衫的男人雙手各拿著食物 (A man in a white shirt tightly holding food)	一個男人拿著三明治 (A man holding a sandwich)
穿黑衣的男人拿著冰淇淋和飲料 (A man in yellow clothes holding ice cream and a drink)	一個男人拿著啤酒 (A man holding a beer)
穿黃衣的男子站在冰箱前面 (A man in yellow clothes standing in front of a fridge)	一個穿著白衣服的男子拿著食物 (A man in white clothes holding food)
冰箱上有一個紫色的背包 (A purple backpack is on top of the fridge)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yeh, J.-F.; Lin, K.-M.; Chen, C.-C. Image Captioning Using Topic Faster R-CNN-LSTM Networks. Information 2025, 16, 726. https://doi.org/10.3390/info16090726

AMA Style

Yeh J-F, Lin K-M, Chen C-C. Image Captioning Using Topic Faster R-CNN-LSTM Networks. Information. 2025; 16(9):726. https://doi.org/10.3390/info16090726

Chicago/Turabian Style

Yeh, Jui-Feng, Kuei-Mei Lin, and Chun-Chieh Chen. 2025. "Image Captioning Using Topic Faster R-CNN-LSTM Networks" Information 16, no. 9: 726. https://doi.org/10.3390/info16090726

APA Style

Yeh, J.-F., Lin, K.-M., & Chen, C.-C. (2025). Image Captioning Using Topic Faster R-CNN-LSTM Networks. Information, 16(9), 726. https://doi.org/10.3390/info16090726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image Captioning Using Topic Faster R-CNN-LSTM Networks

Abstract

1. Introduction

2. Related Works

2.1. Image Caption Generation

2.2. Topic Model

3. Methods

3.1. Object Detection

3.2. Topic Detection

3.3. Language Model

4. Experiments

4.1. Dataset

4.2. Experimental Setup

4.3. Evaluation Metrics

4.4. Comparison with Other Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI