Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning

: The discipline of automatic image captioning represents an integration of two pivotal branches of artiﬁcial intelligence, namely computer vision (CV) and natural language processing (NLP). The principal functionality of this technology lies in transmuting the extracted visual features into semantic information of a higher order. The bidirectional long short-term memory (Bi-LSTM) has garnered wide acceptance in executing image captioning tasks. Of late, scholarly attention has been focused on modifying suitable models for innovative and precise subtitle captions, although tuning the parameters of the model does not invariably yield optimal outcomes. Given this, the current research proposes a model that effectively employs the bidirectional LSTM and attention mechanism (Bi-LS-AttM) for image captioning endeavors. This model exploits the contextual comprehension from both anterior and posterior aspects of the input data, synergistically with the attention mechanism, thereby augmenting the precision of visual language interpretation. The distinctiveness of this research is embodied in its incorporation of Bi-LSTM and the attention mechanism to engender sentences that are both structurally innovative and accurately reﬂective of the image content. To enhance temporal efﬁciency and accuracy, this study substitutes convolutional neural networks (CNNs) with fast region-based convolutional networks (Fast RCNNs). Additionally, it reﬁnes the process of generation and evaluation of common space, thus fostering improved efﬁciency. Our model was tested for its performance on Flickr30k and MSCOCO datasets (80 object categories). Comparative analyses of performance metrics reveal that our model, leveraging the Bi-LS-AttM, surpasses unidirectional and Bi-LSTM models. When applied to caption generation and image-sentence retrieval tasks, our model manifests time economies of approximately 36.5% and 26.3% vis-a-vis the Bi-LSTM model and the deep Bi-LSTM model, respectively.


Introduction
Image captioning is a hot topic involving several fields such as computer vision (CV) and natural language processing (NLP), known as image semantic description or "talking about pictures" [1][2][3][4][5][6].Image captioning technology not only needs to recognize the entity object information in the image and the relationship between objects but also needs to learn how to integrate them into the ability to describe reasonable sentence descriptions.Traditional methods use models based on visual space search, sentence template usage, and the most matching sentence in the dataset to accomplish the tasks of image captioning.The disadvantage of these methods is the low efficiency of generating real and accurate sentences and the poor ability to generate structurally novel sentences.In recent research [7][8][9][10][11], visual and language information has been embedded into a common space via recurrent neural networks (RNNs) initially.Convolutional neural networks (CNNs) were then embedded within the visual space and combined with long short-term memory (LSTM) to produce more effective results.
Most models extract image features by embedding the CNN into visual space.While this method can achieve good results, the extracted image features are not highly accurate and efficient, wasting a lot of time.Many models embed LSTM and Bi-LSTM into language space to generate sentences, but the results are not accurate enough.Therefore, it is challenging for subtitling models to perform novel subtitling tasks with accurate and efficient image-sentence retrieval.
To address these issues, we propose a model leveraging a bidirectional LSTM coupled with an attention mechanism (Bi-LS-AttM).This innovative model substitutes the region convolutional neural network (RCNN)-commonly used for feature extraction-with a more efficient fast region convolutional neural network (Fast RCNN).This adjustment enhances the identification and extraction of features within the image's regions of interest (RoIs).The optimized model is then applied to refine the LSTM network's performance.By juxtaposing forward and backward outcomes and incorporating the attention boost, the Bi-LS-AttM is able to predict word vectors with greater precision and generate more fitting image captions.
Why do we use the model?We employed the model to break through the boundaries of the traditional Bi-LSTM model, which is not focused enough on the comparison of historical and future word results.In the traditional LSTM cells, the prediction of the next word x t using the visible context V and historical context x 1:t−1 is performed by estimating log P(x t |V, x 1:t−1 ).However, in the Bi-LS-AttM, the prediction of the word x t depends on the forward and backward results of separately maximizing log P(x t |V, x 1:t−1 ) and log P(x t |V, x t+1:t ) at time t.By combining the Bi-LSTM with the attention model, the model focuses increasingly on comparing historical and future word results and using their dependencies to predict and generate appropriate image captions.Figure 1 shows the example image of the Bi-LS-AttM model generating a sentence that supports our hypothesis that the Bi-LS-AttM model can generate more complementary and focused captions.
Appl.Sci.2023, 13, x FOR PEER REVIEW 2 of 17 were then embedded within the visual space and combined with long short-term memory (LSTM) to produce more effective results.
Most models extract image features by embedding the CNN into visual space.While this method can achieve good results, the extracted image features are not highly accurate and efficient, wasting a lot of time.Many models embed LSTM and Bi-LSTM into language space to generate sentences, but the results are not accurate enough.Therefore, it is challenging for subtitling models to perform novel subtitling tasks with accurate and efficient image-sentence retrieval.
To address these issues, we propose a model leveraging a bidirectional LSTM coupled with an attention mechanism (Bi-LS-AttM).This innovative model substitutes the region convolutional neural network (RCNN)-commonly used for feature extractionwith a more efficient fast region convolutional neural network (Fast RCNN).This adjustment enhances the identification and extraction of features within the image's regions of interest (RoIs).The optimized model is then applied to refine the LSTM network's performance.By juxtaposing forward and backward outcomes and incorporating the attention boost, the Bi-LS-AttM is able to predict word vectors with greater precision and generate more fitting image captions.
Why do we use the model?We employed the model to break through the boundaries of the traditional Bi-LSTM model, which is not focused enough on the comparison of historical and future word results.In the traditional LSTM cells, the prediction of the next word  using the visible context  and historical context  : is performed by estimating log ( |,  : ).However, in the Bi-LS-AttM, the prediction of the word  depends on the forward and backward results of separately maximizing log ( |,  : ) and log ( |,  : ) at time  .By combining the Bi-LSTM with the attention model, the model focuses increasingly on comparing historical and future word results and using their dependencies to predict and generate appropriate image captions.Figure 1 shows the example image of the Bi-LS-AttM model generating a sentence that supports our hypothesis that the Bi-LS-AttM model can generate more complementary and focused captions.We tested the efficiency of our model on the datasets Flickr30K and MSCOCO and performed a qualitative analysis.The analysis showed that the method performs efficiently, and the proposed Bi-LS-AttM model outperforms other published models.The principal contributions of this paper are threefold:

•
We proposed a trainable model incorporating a bidirectional LSTM and attention mechanism.This model embeds image captions and scores into a region by capitalizing on the long-term forward and backward context.

•
We upgraded the feature extraction mechanism, replacing the conventional CNN and RCNN with a Fast RCNN.This improvement enhances the model's ability to rapidly detect and extract features from items within an image's regions of interest.

•
We verified the efficiency of the framework on two datasets Flickr30K and MSCOCO.The evaluation demonstrated that the Bi-LSTM and attention mechanism model We tested the efficiency of our model on the datasets Flickr30K and MSCOCO and performed a qualitative analysis.The analysis showed that the method performs efficiently, and the proposed Bi-LS-AttM model outperforms other published models.The principal contributions of this paper are threefold:

•
We proposed a trainable model incorporating a bidirectional LSTM and attention mechanism.This model embeds image captions and scores into a region by capitalizing on the long-term forward and backward context.

•
We upgraded the feature extraction mechanism, replacing the conventional CNN and RCNN with a Fast RCNN.This improvement enhances the model's ability to rapidly detect and extract features from items within an image's regions of interest.

•
We verified the efficiency of the framework on two datasets Flickr30K and MSCOCO.The evaluation demonstrated that the Bi-LSTM and attention mechanism model achieved highly competitive performance results relative to current techniques in the tasks of generating captions and image sentence retrieval.

Related Works
Initially, researchers utilized computers to analyze identified content in image captioning, which was the original task for image recognition [12][13][14].Later, they introduced additional requirements such as processing and determining object properties, identifying object relationships, and describing image content in natural language.Since then, numerous image captioning techniques have been introduced, broadly categorized into three groups: template-based, retrieval-based, and deep-learning-based methods.
Template-based methods, which utilize fixed templates for sentence generation, identify image elements such as objects, actions, and scenes based on visual dependency grammar.For instance, Farhadi [15] used a support vector machine (SVM) [16,17] to detect image items and pre-established templates for sentence descriptions.However, the limitations of datasets and template algorithms impeded their performance.Similarly, Li [18] employed Web-scale N-grams for phrase extraction linked to objects, actions, and relationships in 2011.Later, Kulkarni [19] used a conditional random field (CRF) [20,21] for data extraction from a large pool of visual descriptive text, thereby improving computer vision recognition and sentence generation.Despite these efforts, the performance of these methods was suboptimal due to the inherent restrictions of template-based approaches.
The retrieval-based method stores all image descriptions in a collection.The image to be described is then compared to the training set and filtered to find similar images.Using a similar image description to the one found, the candidate description is modified accordingly.Kuznetsova [22] proposed to search for images with attached titles on the Internet and obtain expressive phrases as tree fragments from the test images.Then, new descriptions are composed by filtering and merging the fragments of the extraction tree.Mason [23] proposed a nonparametric density estimation (NDE) technique that estimates the frequency of visual content words of the image to be detected and transforms caption generation into an extractive summarization problem.Sun [24] proposed a concept automatic recognition method that uses parallel text and visual corpora.It can filter out text terms by matching the visual characteristics of similar images in the image library and the image to be described.Retrieval-based methods can be more natural language-like, although relying heavily on the capacity of the database makes it difficult to generate sentences for specific images.
In recent years, with the continued advancement of deep learning, neural networks have been extensively used in image caption tasks.Kiros [25] first used deep neural networks and LSTM to construct two different multimodal neural network models in 2014, continuously integrating semantic information to generate words.For the encoding part, they applied an RNN to convert vocabulary into D-dimensional word vectors.The sentence described can be written as matrix V x D, where V is the number of words in a sentence, and D is the size of the word vector.Finally, they used a decoder consisting of LSTM cells to generate the final picture subtitle result word by word with the combination of image features and the language model sentence by sentence.In subsequent research, Xu et al. [26] incorporated attention mechanisms into the encoder and decoder structural models to describe images.By establishing an attention matrix, they can automatically focus on different areas when predicting different words at different times to enhance the description effectiveness of the model.Bo [27] used generative adversarial networks to generate diversified descriptions by controlling random noise vectors.
In contemporary research, Ayoub [28] deployed the Bahdanau attention mechanism and transfer learning techniques for image caption generation.They incorporated a pretrained image feature extractor alongside the attention mechanism, thus improving captioning quality and precision.Muhammad [29] proposed a model blending the attention mechanism and object features for image captioning, enhancing the model's ability to leverage extracted object features from images.Chun [30] demonstrated an advanced deep learning approach for image captioning that combined CNN for image feature extraction and RNN for caption generation, enhanced by the attention mechanism.This innovative method facilitated the automated creation of comprehensive bridge damage descriptions.Lastly, Wu [31] addressed the challenge of describing novel objects through a switchable attention mechanism and multimodal fusion approach, resulting in the generation of accurate and meaningful descriptions.
Wang [8] used a Bi-LSTM model to perform image caption.Wang has developed a deep Bi-LSTM model based on this and has achieved good results.Fazin [9] simplified Wang's model by reducing many parameters and improving the efficiency of the model.Unlike the above models, the mapping relationship between vision and language in our Bi-LS-AttM model is reverse-crossed, and the forward and backward attention of visual language are dynamically learned.As shown in Section 4, this has been demonstrated to be extremely beneficial for picture caption and image sentence retrieval.

Methodology
This section outlines our proposed model, an enhanced version of the deep Bi-LSTM model for image captioning as proposed by Wang [8] and Fazin [9].In our design, we replace the RCNN used in Wang and Vahid's model with Fast RCNN to expedite the feature extraction process.Furthermore, we substitute the Bi-LSTM with the Bi-LS-AttM, representing our unique contribution to this study.
Our model framework comprises three components: a Fast RCNN for detecting objects within images; a Bi-LSTM paired with an attention mechanism to provide attentional representation for each word; and a common space to compile all sentences and their respective final scores.The specifics of each module will be elaborated in the subsequent sections.

Detect Object by Fast RCNN
In this section, we adopt the method proposed by Girshick [32] for feature extraction and recognition.The selective search algorithm is utilized to extract candidate regions from the input image.These regions are then mapped to the final convolutional feature layer based on their spatial positional relationship.
For each candidate region on the convolutional feature layer, RoI pooling is performed to secure fixed-dimensional features.These extracted features are then fed into the fully connected layer for subsequent classification and regression tasks.
Fast RCNN outputs the probability of each category for each candidate region, as well as the calculated position of each candidate box through regression.For each candidate region, the following loss function is calculated: where p is the probability of each category belonging to the candidate region and u is the ground truth category.t is the predicted position for each category, and v is the ground truth position for the candidate field.Compared with the previous version of RCNN, Fast RCNN improves the calculation speed, saving the time and cost of object detection.Fast RCNN combines classification and regression into a common network, enabling consistent training.In particular, its main enhancement on RCNN is that it eliminates the practice of using separate SVM classifiers and bounding regressors, which greatly improves speed.

Long Short-Term Memory
The LSTM cells form the basis of this work.They are a unique form of RNN able to memorize long-term associations.Figure 2 shows that an LSTM cell is made up of four important components: a memory cell g and three gate circuits (i is the gate of the input, f is the gate of forget, and o is the gate of the output) [9].
The following equation uses the parameters   and   to predict the next word: where   is the tipping probability of the forecast value.

Bi-LSTM
Both RNN and LSTM units leverage past temporal information to predict forthcoming outputs.However, in some instances, the desired output is associated with not just the previous state but also the future state.For instance, predicting a missing word within a textual context requires comprehension of both the preceding and succeeding context.This dual-directional context analysis provides a more comprehensive and accurate interpretation, thereby achieving a genuine contextual understanding and decision-making process.In the formula below, f (t), i(t), and o(t) are the values of forget, input, and output at time t, respectively.a(t) is the intermediate feature extract result of h t−1 and x t at time t: where x t is the entrance, and h t−1 is the hidden state value at time t − 1.The results calculated by forgetting and inputting operate on the cell state, expressed as the formula below: where represents the Hadamard product.Finally, the hidden state is at t. h(t) is obtained by multiplying the gate output o(t) and the current cell state c(t) using the Hadamard product: The following equation uses the parameters W s and b S to predict the next word: where p ti is the tipping probability of the forecast value.

Bi-LSTM
Both RNN and LSTM units leverage past temporal information to predict forthcoming outputs.However, in some instances, the desired output is associated with not just the previous state but also the future state.For instance, predicting a missing word within a textual context requires comprehension of both the preceding and succeeding context.This dual-directional context analysis provides a more comprehensive and accurate interpretation, thereby achieving a genuine contextual understanding and decision-making process.
In the traditional LSTM, the forecast of the word x t using the optical context V and historical context x 1:t−1 is performed by estimating log P(x t |V, x 1:t−1 ).However, in the Bi- LSTM with attention, the prediction of the word x t depends on the forward and backward results of separately maximizing log P(x t |V, x 1:t−1 ) and log P(x t |V, x t+1:t ) at time t.
In the Bi-LSTM cell structure, the input sequence is processed in both forward and backward directions by two distinct LSTM cells to extract features.As illustrated in Figure 3, the output vectors generated are amalgamated to form the final word representation.The core concept behind the Bi-LSTM cell is to facilitate the capture of features at any given time point, encompassing information from both preceding and succeeding time steps.It is worth noting that the two LSTM units within the Bi-LSTM cell operate with independent parameters while sharing a common word-embedding vector space.
Appl.Sci.2023, 13, x FOR PEER REVIEW 6 of 17 In the traditional LSTM, the forecast of the word   using the optical context  and historical context  1:−1 is performed by estimating log (  |,  1:−1 ).However, in the Bi-LSTM with attention, the prediction of the word   depends on the forward and backward results of separately maximizing log (  |,  1:−1 ) and log (  |,  +1: ) at time .
In the Bi-LSTM cell structure, the input sequence is processed in both forward and backward directions by two distinct LSTM cells to extract features.As illustrated in Figure 3, the output vectors generated are amalgamated to form the final word representation.The core concept behind the Bi-LSTM cell is to facilitate the capture of features at any given time point, encompassing information from both preceding and succeeding time steps.It is worth noting that the two LSTM units within the Bi-LSTM cell operate with independent parameters while sharing a common word-embedding vector space.

Architecture Model
The general layout of the model is illustrated in Figure 4.It is mainly composed of three modules: Fast RCNN for encoding image input, Bi-LS-AttM for encoding sentence input, and embedding picture and caption into common space and decoding it into image captions and evaluation scores.
The Bi-LS-AttM generates word vectors by comparing similarity using the context information from the frontend and the backend.More accurate words are selected after passing by attention.In our work, the model calculates the front hidden vector ℎ ⃗ and the back hidden vector ℎ ⃖⃗ .The front cell starts at  = 1, while the back cell starts at  = .Our framework works such that for an initial input frame I, the encoding is performed as follows: where  and  represent Fast RCNN and Bi-LSTM, respectively.  and   are their corresponding weight coefficients. ⃗⃗ and  ⃖⃗⃗ are forward and backward vectors learned

Architecture Model
The general layout of the model is illustrated in Figure 4.It is mainly composed of three modules: Fast RCNN for encoding image input, Bi-LS-AttM for encoding sentence input, and embedding picture and caption into common space and decoding it into image captions and evaluation scores.
The Bi-LS-AttM generates word vectors by comparing similarity using the context information from the frontend and the backend.More accurate words are selected after passing by attention.In our work, the model calculates the front hidden vector → h and the back hidden vector ← h .The front cell starts at t = 1, while the back cell starts at t = T.Our framework works such that for an initial input frame I, the encoding is performed as follows: → ← h are input into attention.The bilinear scoring procedure is applied to calculate the correlation between the query q and → h and ← h .Next, a SoftMax is applied to these scores to normalize them and obtain the attention distribution a = [a 1 , a 2 , . . . ,a t ].The bilinear scoring function and SoftMax are defined as follows: where W is a trainable parameter matrix.s is a bilinear function.

Dataset
Experiments were performed to validate the effectiveness, generality, and robustness of the model compared to other methods on two datasets, Flickr30K [33] and MSCOCO [34]: Flickr30K.This is an extended version of Flickr8K.The dataset can be accessed via the following link: http://shannon.cs.illinois.edu/DenotationGraph(accessed 2 May 2023).It contains 31,783 images, each with 5 captions.The dataset does not explicitly categorize the images into different types or categories.We followed the dataset partitioning proposed by Karpathy [4].In this split of the dataset, 29,000/1000/1000 pictures were utilized for training, validation, and testing, respectively.
MSCOCO.The dataset can be accessed via the following link: https://cocodataset.org (accessed 2 May 2023).This dataset, published several years ago, includes 82,783 training, 40,504 validation, and 40,775 test images.The dataset contains 80 different object categories.Five sentences are annotated for each frame.The focus is on describing all important parts of the scene rather than unimportant details.In the absence of a standard classification, we follow the classification of Wang et al. [8], which uses 80,000 images to train and 5000 to validate and test.After training the model, it can predict the word x t with a given image context V and forward word context x 1:t−1 , predicted either in a forward direction using P(x t |x 1:t−1 , V) or in a backward direction using P(x t |x t+1:t , V).For both forward and backward directions, x 1 = x T = 0 is set at the starting point.Finally, for sentences generated from both directions, the last sentence of the given image P(x 1:T |V) is determined by the sum of all words' probabilities in the caption: The Bi-LSTM module and its training parameters are similar to those presented in Wang [8].The difference is that an attention mechanism is added to it.It can focus more on comparing the forward and backward context information to obtain the attention distribution.When extracting features, the Fast RCNN is more efficient and saves time.

Dataset
Experiments were performed to validate the effectiveness, generality, and robustness of the model compared to other methods on two datasets, Flickr30K [33] and MSCOCO [34]: Flickr30K.This is an extended version of Flickr8K.The dataset can be accessed via the following link: http://shannon.cs.illinois.edu/DenotationGraph(accessed 2 May 2023).It contains 31,783 images, each with 5 captions.The dataset does not explicitly categorize the images into different types or categories.We followed the dataset partitioning proposed by Karpathy [4].In this split of the dataset, 29,000/1000/1000 pictures were utilized for training, validation, and testing, respectively.
MSCOCO.The dataset can be accessed via the following link: https://cocodataset.org (accessed 2 May 2023).This dataset, published several years ago, includes 82,783 training, 40,504 validation, and 40,775 test images.The dataset contains 80 different object categories.Five sentences are annotated for each frame.The focus is on describing all important parts of the scene rather than unimportant details.In the absence of a standard classification, we follow the classification of Wang et al. [8], which uses 80,000 images to train and 5000 to validate and test.

Evaluation Metrics
The evaluation methods of machine translation can be referred to as the evaluation criteria, which match the generated sentences with human descriptions to obtain a similarity score to measure the accuracy of the task.
For caption generation, the previous work is continued, and the BLEU-N score [35] is used: where d represents the candidate description, which is the reference description, b is the penalty function, k represents the probability of selecting a specific caption, and p represents the accuracy measurement function.Comparing the results of METEOR [36] and CIDEr [37], METEOR can overcome the inherent deficiencies of the BLUE standard, while CIDEr computes the closeness of reference and modeled descriptions as the evaluation standard.
In the retrieval of an image-sentence, we use R@K and Medr as assessment scores.R@K is the recall rate of the top captions.Medr is the average score of the first basic fact image and caption retrieved.

Implementation Details
During the image encoding process, we utilize the VggNet model [38] for pre-training and employ Fast RCNN to obtain the features from the final fully connected layer.This allows Fast RCNN to share features and parameters in the feature extraction and RoI pooling stages, thereby enhancing processing efficiency.The Bi-LS-AttM is deployed for training the language module.In addition, we selected the widely used and enhanced VggNet [38] and GoogleNet [39] models for our experiments.We tested our model on two datasets specifically designed for image captioning: Flickr30k and MSCOCO.
The server hardware configuration was as shown below: Intel(R) Core (TM) i5-6200U 2.30 GHz CPU, NVIDIA RTX3080Ti GPU, and Win10 OS.The respective version levels needed for the software are Python 3.9, Torch 1.13.1,Scipy 1.2.1, H5py, and Tqdm.The parameters set for models are shown in Tables 1-3.All words in the caption are taken from the vector used to generate the caption.Words appearing less than five times in the training set are marked and removed.A vocabulary of 7200 and 8600 words is provided for the Flickr30K and MSCOCO datasets, respectively.Additionally, 048 Bi-LSTM hidden units are used, and the initialization range of the weight coefficients is set to [−0.06, 0.06].

Experimental Results on the Generated Image Caption
Our image captioning model's efficacy was evaluated through comparative experiments utilizing the BLUE-N metric, with the resultant data exhibited in Table 4.The additional attention layer implemented within our model contributed significantly to its strong performance on both evaluated datasets.Substituting AlexNet with VggNet [40] resulted in substantial performance improvements across all BLUE metrics.Our model ranks predominantly within the top two positions across these metrics.While our model lags marginally behind the top-rated Hard-attention [24] model in the B-1 metric, it surpasses the performance of both the Bi-LSTM and deep Bi-LSTM models in all other assessed metrics.
Figure 5 illustrates the comparison of our model with others on the METEOR and CIDEr metrics.We compared our model with the Bi-LSTM [9] and deep Bi-LSTM [8] models.As shown in the figure, the Bi-LS-AttM outperformed the leading-edge methods on the metrics.In the Flickr30K dataset, we improved the METEOR and CIDEr scores by about 8.0% and 12.5%, respectively.In the MSCOCO dataset, our model improved the METEOR and CIDEr scores by about 6.8% and 15.6%, respectively.We can also speculate that it can give better results on larger datasets.ginally behind the top-rated Hard-attention [24] model in the B-1 metric, it surpasses the performance of both the Bi-LSTM and deep Bi-LSTM models in all other assessed metrics.The results shown in bold type are the best values.
Figure 5 illustrates the comparison of our model with others on the METEOR and CIDEr metrics.We compared our model with the Bi-LSTM [9] and deep Bi-LSTM [8] models.As shown in the figure, the Bi-LS-AttM outperformed the leading-edge methods on the metrics.In the Flickr30K dataset, we improved the METEOR and CIDEr scores by about 8.0% and 12.5%, respectively.In the MSCOCO dataset, our model improved the METEOR and CIDEr scores by about 6.8% and 15.6%, respectively.We can also speculate that it can give better results on larger datasets.We performed a comparative evaluation of our model against other advanced models on METEOR and CIDEr metrics, as depicted in Table 5.This analysis reveals that our Bi-LS-AttM model demonstrates robust competitiveness within the MSCOCO Karpathy split dataset.Particularly, on the METEOR score, our model trails slightly behind Wu's model, potentially due to their superior parameter optimization techniques and the dataset's compatibility with their switchable novel object captioner.This calls for further We performed a comparative evaluation of our model against other advanced models on METEOR and CIDEr metrics, as depicted in Table 5.This analysis reveals that our Bi-LS-AttM model demonstrates robust competitiveness within the MSCOCO Karpathy split dataset.Particularly, on the METEOR score, our model trails slightly behind Wu's model, potentially due to their superior parameter optimization techniques and the dataset's compatibility with their switchable novel object captioner.This calls for further investigations into novel datasets.However, our model excels in the CIDEr score.A considerable score variation is evident when comparing the performance of our model with and without the attention mechanism.
Table 5.Comparison of the METEOR and CIDEr scores between our and the state-of-the-art models.

41.2
The results shown in bold type are the best values.
In a similar vein, we conducted a thorough comparison between the performance scores of the baseline model, which lacks an attention mechanism, and our proposed model, employing diverse evaluation metrics.The results presented in Table 6 clearly depict the relative performance of the baseline model in contrast to our model.Remarkably, our model consistently surpasses the baseline model in terms of performance metrics on both the MSCOCO and Flickr30k test sets.Furthermore, the line graph depicted in Figure 6 visually demonstrates the competitive advantage of our model across various evaluation metrics.It is noteworthy that our model exhibits a more pronounced competitive edge, particularly on the MSCOCO test set.We attribute this observation to the larger scale of the MSCOCO test set, enabling a more comprehensive evaluation and, subsequently, yielding higher performance scores.The results shown in bold type are the best values.

41.2
The results shown in bold type are the best values.
In a similar vein, we conducted a thorough comparison between the performance scores of the baseline model, which lacks an attention mechanism, and our proposed model, employing diverse evaluation metrics.The results presented in Table 6 clearly depict the relative performance of the baseline model in contrast to our model.Remarkably, our model consistently surpasses the baseline model in terms of performance metrics on both the MSCOCO and Flickr30k test sets.Furthermore, the line graph depicted in Figure 6 visually demonstrates the competitive advantage of our model across various evaluation metrics.It is noteworthy that our model exhibits a more pronounced competitive edge, particularly on the MSCOCO test set.We attribute this observation to the larger scale of the MSCOCO test set, enabling a more comprehensive evaluation and, subsequently, yielding higher performance scores.

Experimental Results on the Retrieval of Image-Sentence
In assessing image-sentence retrieval, we primarily focused on retrieval scores.Table 7 presents the R@K and Medr scores from our model's image-sentence retrieval across various datasets.Generally, our model surpasses previous methodologies on most metrics, with a particularly strong performance on the MSCOCO dataset.Notably, the Bi-LS-AttM outstrips advanced models in both image-to-sentence and sentence-to-image retrieval tasks.However, some metrics reveal suboptimal performance; for instance, the Mind's Eye model [42], which effectively integrates image and text features, outperforms our model on the Flickr30K dataset.We posit that incorporating an adaptive attention mechanism could enhance efficiency in image-sentence retrieval tasks, a hypothesis we aim to investigate in future research.
We conducted additional experiments to authenticate our model's performance in image-sentence retrieval tasks.Figure 7 presents examples from several retrieval experiments on the MSCOCO validation set.In each caption query, the model retrieves visually congruent images and captions, illustrating its proficiency in discerning the visualtextual association in image caption rankings.The upper dashed line represents image retrieval predicated on keywords, while the lower one symbolizes sentence retrieval based on images.

Discussion
Effect of Bi-LS-AttM: To gauge the influence of the Bi-LS-AttM, we compared the computational time of the Bi-LS-AttM model with the Bi-LSTM and deep Bi-LSTM models for caption creation and image-to-sentence retrieval tasks.Table 8 delineates the computational durations of these models for the respective tasks.We randomly selected 20 images from the Flickr30K validation set and assessed each model ten times for caption creation and image-to-sentence retrieval.The table provides the average time duration across the ten trials, excluding model initialization and training time.In the sentence-image retrieval task, we generated three images that mirror the keywords and sentences and then selected the matching image based on its similarity.For the image-sentence retrieval task, we produced three appropriate captions grounded on the im-ages and then chose the generated sentences based on their high scores in the shared space.
The given examples underscore the efficiency of our model in executing image-sentence retrieval tasks.

Discussion
Effect of Bi-LS-AttM: To gauge the influence of the Bi-LS-AttM, we compared the computational time of the Bi-LS-AttM model with the Bi-LSTM and deep Bi-LSTM models for caption creation and image-to-sentence retrieval tasks.Table 8 delineates the computational durations of these models for the respective tasks.We randomly selected 20 images from the Flickr30K validation set and assessed each model ten times for caption creation and image-to-sentence retrieval.The table provides the average time duration across the ten trials, excluding model initialization and training time.The caption generation time expenses encompass the extraction of image features, bidirectional caption content sampling, computation of the final caption result, and caption accuracy evaluation.Conversely, the retrieval time accounts for the computation of the image-to-sentence retrieval score, image and sentence query, and sorting operations in descending order.By employing the Fast R-CNN framework and fine-tuning the relevant parameters, our model demonstrates significant time savings in accomplishing the given task.From Table 8, we can see that our model saves about 36.5% and 26.3% of the time compared to the Bi-LSTM and deep Bi-LSTM models, respectively, in solving image captioning and image-sentence retrieval tasks.We have verified the efficiency of the Bi-LS-AttM.
Effect of image caption: We used the Bi-LS-AttM model to generate real, accurate, and novel image descriptions.Figure 8 shows the comparison of the baseline model and our model in generating captions on the datasets.We evaluated generated captions from various perspectives.In some descriptions, the relationships between objects are well expressed (e.g., "A hot dog and a red bottle of drink are on the table").In the example above, the objects 'hot dog' and 'table' are accurately identified, and the relationship between them is established.Finally, the image is described accurately and in a novel way (e.g., "A boy dressed in black surfs the sea with a red surfboard.").However, the baseline model solely provides descriptive accounts of the images, lacking the generation of novel and expressive sentences to depict them.From the perspective of object detection, the object recognized in the baseline model is "bread" rather than the more accurate "hot dog".The results of the studies show our model has good efficiency.Our model can achieve a balance between performance and efficiency.
Examples of failed experiments: Figure 9 depicts a notable number of anomalies arising from our experimental approach.It is pertinent to note that these inaccuracies primarily originate from the Flickr30K validation set, which we hypothesize may be due to the limited range and diversity present in the training dataset of this source.For instance, in the preceding images, our model exhibits imprecision in object identification (i.e., identifying "white clothes" when the man is not clothed).Another example demonstrates an illogical caption suggesting a man cycling on water.We surmise that these limitations could be ameliorated through improvements in our visual feature extraction aspect.
Despite these anomalies, we see them as potential avenues for further research rather than setbacks.Nonetheless, it is crucial to emphasize that a substantial number of remaining cases were accurately represented, as exemplified in Figure 8. them is established.Finally, the image is described accurately and in a novel way (e.g., "A boy dressed in black surfs the sea with a red surfboard.").However, the baseline model solely provides descriptive accounts of the images, lacking the generation of novel and expressive sentences to depict them.From the perspective of object detection, the object recognized in the baseline model is "bread" rather than the more accurate "hot dog".The results of the studies show our model has good efficiency.Our model can achieve a balance between performance and efficiency.Examples of failed experiments: Figure 9 depicts a notable number of anomalies arising from our experimental approach.It is pertinent to note that these inaccuracies primarily originate from the Flickr30K validation set, which we hypothesize may be due to the limited range and diversity present in the training dataset of this source.For instance, in the preceding images, our model exhibits imprecision in object identification (i.e., identifying "white clothes" when the man is not clothed).Another example demonstrates an illogical caption suggesting a man cycling on water.We surmise that these limitations could be ameliorated through improvements in our visual feature extraction aspect.Despite these anomalies, we see them as potential avenues for further research rather than setbacks.Nonetheless, it is crucial to emphasize that a substantial number of remaining cases were accurately represented, as exemplified in Figure 8.

Conclusions
In this study, we have introduced a model that leverages the capabilities of the Bi-LS-AttM approach to generate captions that are precise, inventive, and context-sensitive.This was accomplished by incorporating bidirectional information and an attention mechanism.

Conclusions
In this study, we have introduced a model that leverages the capabilities of the Bi-LS-AttM approach to generate captions that are precise, inventive, and context-sensitive.This was accomplished by incorporating bidirectional information and an attention mechanism.For the dual purposes of feature extraction and time optimization, we utilized the Fast RCNN.Additionally, to provide a comprehensive understanding of the proposed model's structure, we generated a detailed visualization outlining the word generation process at consecutive timesteps.The model's robustness and stability were thoroughly assessed across various datasets pertinent to image captioning and image-sentence retrieval tasks.
In terms of future work, we intend to delve into more intricate domains of image captioning, including those related to remote sensing and medical imaging.We anticipate broadening the application scope of our model to encapsulate other forms of captioning tasks such as video captioning.Furthermore, we plan to explore the integration of multitask learning methodologies with an aim to enhance the model's general applicability.

Figure 1 .
Figure 1.Example captions generated by the model.(a) Caption generation (by the unidirectional model (upper) and by our model (lower)) on Flickr30K.(b) Caption generation (by the unidirectional model (upper) and by our model (lower)) on MSCOCO.

Figure 1 .
Figure 1.Example captions generated by the model.(a) Caption generation (by the unidirectional model (upper) and by our model (lower)) on Flickr30K.(b) Caption generation (by the unidirectional model (upper) and by our model (lower)) on MSCOCO.

Appl. Sci. 2023, 13 , 7916 7 of 17 whereFM
and B represent Fast RCNN and Bi-LSTM, respectively.θ m and θ n are their corresponding weight coefficients.are forward and backward vectors learned from the neural network, respectively.Afterward, the obtained vectors → h and ← Appl.Sci.2023, 13, x FOR PEER REVIEW 8 of 17

Figure 5 .
Figure 5. (a) Comparison of METEOR scores of three models on two benchmark datasets; (b) comparison of CIDEr scores of three models on two benchmark datasets.

Figure 5 .
Figure 5. (a) Comparison of METEOR scores of three models on two benchmark datasets; (b) comparison of CIDEr scores of three models on two benchmark datasets.

Table 6 .
Performance scores of the baseline model and our model across various metrics.

Figure 7 .
Figure 7. Example of using our model for image retrieval and caption retrieval on the MSCOCO validation set.(a) To search for three images using captions.(b) To search for three captions using images.

Figure 7 .
Figure 7. Example of using our model for image retrieval and caption retrieval on the MSCOCO validation set.(a) To search for three images using captions.(b) To search for three captions using images.

Figure 8 .
Figure 8. Examples of image captioning for the baseline and our model on the datasets.The captions generated by the baseline model are above, while the captions generated by our model are below.

Figure 8 .
Figure 8. Examples of image captioning for the baseline and our model on the datasets.The captions generated by the baseline model are above, while the captions generated by our model are below.

Figure 9 .
Figure 9. Examples of failed experiments: (a) feature extraction error, (b) Image representation error, (c) caption logic error.We mark the extracted features on the image with red boxes and use blue fonts to distinguish errors.

Figure 9 .
Figure 9. Examples of failed experiments: (a) feature extraction error, (b) Image representation error, (c) caption logic error.We mark the extracted features on the image with red boxes and use blue fonts to distinguish errors.

Table 4 .
Compare the BLEU score of each model on Flickr30K and MSCOCO.
The results shown in bold type are the best values.

Table 4 .
Compare the BLEU score of each model on Flickr30K and MSCOCO.

Table 6 .
Performance scores of the baseline model and our model across various metrics.The results shown in bold type are the best values.

Table 7 .
R@K (a high score is good) and Medr (a low score is good) comparison of each model on Flickr30K and MSCOCO.The results shown in bold type are the best values.

Table 8 .
The cost of checking 10 images on Flickr30K.The results shown in bold type are the best values.