Metaheuristics Optimization with Deep Learning Enabled Automated Image Captioning System

: Image captioning is a popular topic in the domains of computer vision and natural language processing (NLP). Recent advancements in deep learning (DL) models have enabled the improvement of the overall performance of the image captioning approach. This study develops a metaheuristic optimization with a deep learning-enabled automated image captioning technique (MODLE-AICT). The proposed MODLE-AICT model focuses on the generation of effective captions to the input images by using two processes involving encoding unit and decoding unit. Initially, at the encoding part, the salp swarm algorithm (SSA), with a HybridNet model, is utilized to generate effectual input image representation using ﬁxed-length vectors, showing the novelty of the work. Moreover, the decoding part includes a bidirectional gated recurrent unit (BiGRU) model used to generate descriptive sentences. The inclusion of an SSA-based hyperparameter optimizer helps in attaining effectual performance. For inspecting the enhanced performance of the MODLE-AICT model, a series of simulations were carried out, and the results are examined under several aspects. The experimental values suggested the betterment of the MODLE-AICT model over recent approaches.


Introduction
Presently, a significant number of images have been produced from many origins such as advertisements, the internet, document diagrams, and news articles. Such origins have images which viewers should analyze themselves [1]. Many images do not contain descriptions; however, human beings can mostly understand them without having any detailed captions. However, machinery should make an interpretation in certain forms of image captions whenever human beings require automated image captions from it. Image captioning is considered significant on numerous grounds [2]. For instance, it is utilized for automated image indexing. Image indexing plays a vital role in content-based image retrieval (CBIR), and thus is implemented in numerous areas involving digital libraries, biomedicine, the military, education, web searching, and commerce. Mass media platforms such as Twitter and Facebook could straight away produce descriptions from images [3]. The descriptions might involve places (e.g., beach, cafe), things that are worn, and, most significantly, the activities that are taking place.
Image captioning basically involves natural language processing (NLP) and computer vision. Computer vision is helpful in recognizing and understanding the situation in an image [4]; NLP transforms semantic knowledge into a descriptive line. Retrieval of the semantic matter of an image and communicating it in a structure which human beings can understand becomes extremely complex. The complete image captioning method not only gives information, but it also further reveals the connection among the substances [5]. Image captioning consists of numerous applications-for example, as an aid advanced for guiding the person having visual disabilities while traveling alone [6]. This is made possible by changing the scenario into text and transforming the text into voice messages. Image captioning is also utilizes mass communication for the automatic generation of the caption for an image which is posted or to explain a video [7,8]. Moreover, automated image captioning might enhance the Google image search method by changing the image to a caption and, after, by utilizing the keywords for additional related searches [9].
Image realization mostly relies on acquiring image features. The methods utilized for understanding the image of two categories are deep machine learning (ML)-related methods and old machine learning-related methods (2). In conventional ML, handcrafted features, namely the histogram of oriented gradients (HOG), local binary patterns (LBPs), and scale-invariant feature transform (SIFT), and a grouping of these features, were broadly utilized. In such methods, features have been derived from the input unit [10]. They were passed afterward to a classifier such as support vector machines (SVMs) for article classification. Furthermore, real-world data such as video and images become complicated and contain diverse semantic interpretations [11]. Conversely, in ML-related methods, features were studied automatically from training data, and they could manage a big and varied set of videos and images. For instance, convolutional neural networks (CNNs) were broadly employed for feature learning, and a classifier such as Softmax can be utilized for categorization. CNN can be usually tracked by recurrent neural networks (RNNs) for generating captions [12].
This study develops a metaheuristic optimization with a deep learning-enabled automated image captioning technique (MODLE-AICT). The proposed MODLE-AICT model aims for the generation of effective captions to the input images by using two processes involving an encoding unit and a decoding unit. At the encoding part, the salp swarm algorithm (SSA) with a HybridNet model is utilized to generate effectual input image representation using fixed-length vectors. Then, the decoding part includes a bidirectional gated recurrent unit (BiGRU) model to generate descriptive sentences. For examining the enhanced performance of the MODLE-AICT model, a series of simulations were carried out, and the results are examined under several aspects.

Prior Image Captioning Techniques
In Zhao et al. [13], a fine-grained, structured attention-related technique was suggested when using the structural features of semantic matters in high-resolution distant sensing images. The segmentation is mutually trained with captioning in a unified outline with no need for pixel-wise annotations. Hoxha et al. [14] provide an RSIR technique which mainly focuses on exploiting and producing written descriptions to precisely define the relations among the matters and their features in RS images including captions (e.g., sentences). The initial level focuses to encrypt the image's visual characteristics and later convert the encrypted features to a textual description which sums up the image content-containing captions. The next level focuses on converting the produced textual descriptions as to semantically useful feature vectors. Lastly, estimating the likeness among the textual descriptions' vectors of query images with that of archive images restores images of high likeness to the query image.
Wang et al. [15] suggested an end wise trainable deep bidirectional LSTM (Bi-LSTM) method for addressing the issue. With the combination of two separate LSTM networks and a deep CNN (DCNN), this methodology can learn long-term visual-language interactions with the help of future and historical context data at a high level semantic area. In Chang et al. [16], an advanced image captioning method-with image captioning, object detection, and color analysis-was suggested for the automated generation of the textual descriptions of images. In an encrypted-decrypted method for image captioning, VGG16 can be utilized as an encoder and an LSTM network and can be employed as a decoder.
Xiong et al. [17] recommend a hierarchical transformer-related medical imaging report generation technique. This presented technique has two parts: one is ann image encoder that extracts heuristic visual features through a bottom-up attention algorithm; the other is a non-recurrent captioning decoding technique that enhances computational efficacy through parallel computation. Wang et al. [18] suggested an original methodology to indirectly design the association between areas of interest in an image by a graph NN along with the original context-aware attention system for guiding attention selection by completely memorizing formerly attended visual contents.
In Al-Malla et al. [19], the authors introduced a new method to apply the Generative Adversarial Network into sequence generation. The greedy decoding method is utilized for generating an effective baseline reward for self-critical training. The visual and semantic relationship of diverse objects are combined into local-relation attention. The authors in [20] developed an attention-based encoder-decoder deep model which utilizes convolutional features derived from a CNN model that is pre-trained on ImageNet (Xception) along with the object features derived by the YOLOv4 model, pre-trained on MS COCO. The authors also introduced a novel positional encoding scheme for object features, termed the "importance factor".

The Proposed Model
In this study, a new MODLE-AICT technique has been developed for the generation of effective captions to the input images by using two processes involving an encoding unit and a decoding unit. Primarily, at the encoding part, the SSA with a HybridNet model is utilized to generate effectual input image representation using fixed-length vectors. In addition, the decoding part includes a BiGRU model that is used to generate descriptive sentences. Figure 1 show cases of the block diagram of the MODLE-AICT algorithm.

Data Pre-Processing
At the preliminary level, data pre-processing is performed in different stages as given below.

•
Lower case conversion; • Removal of punctuation marks to decrease complexity; • Removal of numeric values; • Tokenization; • Vectorization (to turn the original strings into integer sequences where each integer represents the index of a word in a vocabulary).

Feature Extraction: HybridNet Model
In this work, the HybridNet model is utilized for generating visual features of the input images. Generally, classification requires intra-class, in-variant features, while reconstruction requires the preservation of each dataset. In order to overcome these shortcomings, HybridNet includes the unsupervised path (E u and D u ) and the discriminative path (E c and D c ). These two E u and E c encoders take an x input image and generate h c and h u representations, whereas decoders D c and D u take h c and h u , respectively, as an input to generatex andx partial reconstructions. At last, the C classifier produces a class prediction Appl. Sci. 2022, 12, 7724 4 of 18 by means of a discriminative feature:ŷ = C(h c ). Although both paths may have analogous architecture, they have to perform complementary and different roles. The discriminative path needs to extract h c discriminative feature that must be ultimately well crafted to effectively execute a classifier task and produce ax c partial reconstruction that should not be accurate; meanwhile, retaining each dataset is not a behavior that we want to inspire [21]. As a result, the role of unsupervised paths is complementary to the discriminative path through p in h u the data lost in h c . Consequently, it producesx complementary reconstruction, while, integratingx and x, the last reconstructionx is closer to x. The architecture of HybridNet is formulated by using the below expression:

Data Pre-Processing
At the preliminary level, data pre-processing is performed in different stages as given below.

•
Lower case conversion; • Removal of punctuation marks to decrease complexity; • Removal of numeric values; • Tokenization; • Vectorization (to turn the original strings into integer sequences where each integer represents the index of a word in a vocabulary).

Feature Extraction: HybridNet Model
In this work, the HybridNet model is utilized for generating visual features of the It should be noted that the ultimate role of reconstruction is to regularize for the discriminative encoding. The major contribution and challenge of the study is to establish a method to guarantee that both paths would actually perform in such way.
The two major problems that we address are the discriminative path to emphasize discriminative features and the fact that we need these two paths to contribute and cooperate to the reconstruction. In fact, with this framework, we might create two paths that work individually: a reconstruction pathx =x u = D(E(x)) and a classification patĥ y = C(E(x)) andx c = 0. We resolve the issue by using the encoder and decoder architecture along with a proper training and loss function. The HybridNet model has two data paths, with one generating a class prediction and both generating partial reconstruction that needs to be integrated. In these subsections, we resolve the problem of training this structural design proficiently. It encompasses terms for stability with Ω stability ; classification with L cls ; last reconstruction with L rec ; and intermediate reconstruction with L rec−interb,l (for layer l and branch b).
Moreover, it is followed by a branch complementarity training model. All the terms are weighted through a λ variable, respectively: HybridNet architecture is trained on partially labelled data comprised of unlabeled . All the batches are comprised of n instances, separated into unlabeled images n u from D unsup and labelled images n s from D sup . The classification term is a regular cross-entropy term employed only on the n s -labelled instance, as follows:

Hyperparameter Optimization
In order to effectually tune the hyperparameters related to the HybirdNet model, the SSA is exploited. To resolve optimization problems, motion behavior of SSA can be mathematically modelled [22]. Salps are sea creatures that have barrel-shaped, jelly-like bodies and move from place to place by driving water through their bodies from one side to the other sides. They exist as colonies and travel together like chains. Leader and follower are the two most important classes of salps. Leaders lead the chain in a forwarding direction, while followers follow the leader synchronously and in harmony. Similar to a swarm intelligent model, SSA begins with an arbitrary initialization of the swarm of N salps. Variable n is considered to be measured, x symbolizes the position of salp, and y defines the food source-specifying objective of swarm in searching region. Leader salp describes their position by the subsequent formula: In Equation (3), in i-th parameter, x i 1 -position of initial salp; y i -position of food. ub i and lb i -upper and lower bounds, and r 1 ,r 2 ,r 3 -arbitrary number.
Among three arbitrary numbers, r 1 inhabits the lead position because it balances exploitation and exploration at the time of searching process. It can be formulated as follows: In Equation (4), l shows existing iteration; L-the formerly determined amount of iterations; r 2 ,r 3 -arbitrary integer lies within [0, 1]. To update the location according to Newton's law of motion, the following mathematical expressions are utilized for followers: where ≥ 2,x j l -position of j-th salp in i-th parameter, t-time, δ 0 -initial speed.
Assume that δ 0 = 0, t-iteration in an optimization issue; the abovementioned formula is transformed into succeeding expression: In Equation (7), j ≥ 2. This equation demonstrates that follower salps describe the position according to the preceding salps and their own position. When some salps escape from the restricted searching space, they are carried back within the limitation as follows: The abovementioned expression is repeatedly executed until the ending condition is met. Note that the food source is sometimes upgraded by exploring and exploiting space around an existing solution, which might determine the best solution. Salp chains, during optimization, have the capacity to move toward global optimum solutions as illustrated in Algorithm 1.

Image Captioning
In this study, the decoding part includes the BiGRU model to generate descriptive sentences. A recurrent neural network (RNN) has been successfully used to handle data sequences in different areas [23]. In RNN, the input sequence = (x 1 , . . . , x T ), hidden vector sequence h = (h 1 , . . . , h T ), and output vector sequence y = (y 1 , . . . , y T ) are derived by the given equations: Let Φ be the activation function, and the popular activation function is generally an element-wise application of the sigmoid function. U refers to the input-hidden weight matrixes, W stands for the hidden-hidden weight matrixes, and, in Equation (10), b denotes the hidden bias vector, V signifies the hidden-output weight matrixes, and c denotes the output bias vector. It is nearly impossible to capture the long-term dependency of RNN, as the gradient tends to explode or vanish. Therefore, some research workers have made every effort to develop a more complex activation function to resolve the shortcomings. For instance, the LSTM unit is initially proposed for capturing the long-term dependency. In recent years, other variants of the recurrent unit, such as GRU, is also devised, which are Appl. Sci. 2022, 12, 7724 7 of 18 easier to calculate and have good performance of generalization compared to that of the LSTM unit. Figure 2 depicts the framework of GRU. element-wise application of the sigmoid function.
refers to the input-hidden weight matrixes, stands for the hidden-hidden weight matrixes, and, in Equation (10), denotes the hidden bias vector, signifies the hidden-output weight matrixes, and denotes the output bias vector. It is nearly impossible to capture the long-term dependency of RNN, as the gradient tends to explode or vanish. Therefore, some research workers have made every effort to develop a more complex activation function to resolve the shortcomings. For instance, the LSTM unit is initially proposed for capturing the long-term dependency. In recent years, other variants of the recurrent unit, such as GRU, is also devised, which are easier to calculate and have good performance of generalization compared to that of the LSTM unit. Figure 2 depicts the framework of GRU. LSTM makes use of an output gate for controlling the exposure of the quantity of memory content.
In Equation (11), the output gate is represented as and is calculated as follows: In Equation (12), the logistic function is indicated as . The memory cell is preserved by adding some new memories and eliminating (forgetting) current memories: The ̃ new memories are given by: LSTM makes use of an output gate for controlling the exposure of the quantity of memory content.
In Equation (11), the output gate is represented as o t and is calculated as follows: In Equation (12), the logistic function is indicated as σ. The memory cell c t is preserved by adding some new memories and eliminating (forgetting) current memories: The c t new memories are given by: The extent to add and remove memories can be controlled by the input gate i t and the forget gate f t . The forget gate can be calculated by the following equation: and i t is calculated as follows: From the equation, the corresponding bias vector is indicated as b. As with the LSTM unit, GRU uses the gate for controlling the data stream inside a unit; however, there is no memory cell. The h r hidden state is a linear integration of new hidden states h t and the preceding hidden state h t−1 : In Equation (17), the update gate z t controls how much its new activation is upgraded. It is calculated as follows: The h t new activation is calculated as follows: In Equation (19), the forget gate r t is the same as the update unit in LSTM: While typical RNN exploits the preceding data, the bi-directional RNN (BRNN) processes information in two directions. The y output of BRNN is attained by measuring the → h r forward hidden sequence and ← h t backward sequence as follows: Integrating BRNN with GRU provides BIGRU that is utilized for accessing the longterm data sequence in two directions. As a fault, a diagnoses issue can be generally regarded as a classification issue, and cross entropy is adapted as the loss function. The weighted cross entropy is represented as: In Equation (24), θ indicates the neural network parameter, N represents the sample count, the number of faults is represented as M, and the true label can be indicated as y i and the predicted probability is represented asŷ i .

Performance Validation
The experimental validation of the MODLE-AICT model is tested using the Flick8K dataset (https://www.kaggle.com/adityajn105/flickr8k/activity, accessed on 13 March 2022) and MS-COCO 2014 dataset [24]. A comparison study is also made with recent models [25][26][27][28][29][30][31]. A few sample images are depicted in Table 1. It contains 8000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.

Performance Measures
To validate the performance of the presented model, a set of four metrices are utilized, such as BLEU, Meter, CIDEr, and Rouge-L. BLEU [25], a commonly utilized metric for estimating the quality of the produced text. For an effectual image captioning outcome, BLEU values are required to be high, and it is defined using Equation (25): where BP denotes penalty factor, r and c represent length of the reference and generated sentences, respectively. METEOR metric relies on word recall rate and single precision weighted harmonic mean. It computes the reconciliation mean of accuracy and recalls amongst the optimum candidate and reference translations. It is defined as follows: where α, γ, and θ denotes default parameters.  Three people ride off-road bikes through a field surrounded by trees

Performance Measures
To validate the performance of the presented model, a set of four metrices are utilized, such as BLEU, Meter, CIDEr, and Rouge-L. BLEU [25], a commonly utilized metric for estimating the quality of the produced text. For an effectual image captioning outcome, BLEU values are required to be high, and it is defined using Equation (25): where BP denotes penalty factor, and represent length of the reference and generated sentences, respectively. METEOR metric relies on word recall rate and single precision weighted harmonic mean. It computes the reconciliation mean of accuracy and recalls amongst the optimum candidate and reference translations. It is defined as follows: where α, γ, and θ denotes default parameters. The CIDEr index assumes each sentence as a ''document'' and represents in the form of a TF-IDF vector. It computes the cosine similarity amongst the created caption ( ) and original caption by the use of a score value.  Three people ride off-road bikes through a field surrounded by trees

Performance Measures
To validate the performance of the presented model, a set of four metrices are utilized, such as BLEU, Meter, CIDEr, and Rouge-L. BLEU [25], a commonly utilized metric for estimating the quality of the produced text. For an effectual image captioning outcome, BLEU values are required to be high, and it is defined using Equation (25): where BP denotes penalty factor, and represent length of the reference and generated sentences, respectively. METEOR metric relies on word recall rate and single precision weighted harmonic mean. It computes the reconciliation mean of accuracy and recalls amongst the optimum candidate and reference translations. It is defined as follows: where α, γ, and θ denotes default parameters. The CIDEr index assumes each sentence as a ''document'' and represents in the form of a TF-IDF vector. It computes the cosine similarity amongst the created caption ( ) and original caption by the use of a score value.  Three people ride off-road bikes through a field surrounded by trees

Performance Measures
To validate the performance of the presented model, a set of four metrices are utilized, such as BLEU, Meter, CIDEr, and Rouge-L. BLEU [25], a commonly utilized metric for estimating the quality of the produced text. For an effectual image captioning outcome, BLEU values are required to be high, and it is defined using Equation (25): where BP denotes penalty factor, and represent length of the reference and generated sentences, respectively. METEOR metric relies on word recall rate and single precision weighted harmonic mean. It computes the reconciliation mean of accuracy and recalls amongst the optimum candidate and reference translations. It is defined as follows: where α, γ, and θ denotes default parameters. The CIDEr index assumes each sentence as a ''document'' and represents in the form of a TF-IDF vector. It computes the cosine similarity amongst the created caption ( ) and original caption by the use of a score value.
People on ATVs and dirt bikes are traveling along a worn path in a field surrounded by trees Three people are riding around on ATVs and motorcycles Three people on motorbikes follow a trail through dry grass Three people on two dirt bikes and one four-wheeler are riding through brown grass Three people ride off-road bikes through a field surrounded by trees The CIDEr index assumes each sentence as a "document" and represents in the form of a TF-IDF vector. It computes the cosine similarity amongst the created caption s ij and original caption by the use of a score value.
ROUGE is another similarity measurement model that is mainly based on the recall rate. It determines the co-occurrence probability of N-gram in the reference translation and the translation to be investigated. It is defined using Equation (28). Table 2 and Figure 3 inspect a detailed result analysis of the MODLE-AICT model on the test Flickr8K dataset [23][24][25][26][27][28]. The results implied that the MODLE-AICT model has gained effectual outcomes over other models. For instance, based on BLEU-1, the MODLE-AICT model obtained a higher BLEU-1 of 69.06, whereas the M-RNN, G-NICG, L-Bilinear, DVS, ResNet50, VGA-16, and HPTDL models attained a lower BLEU-1 of 59. 18 Table 3. Classification analysis of MODLE-AICT algorithm with approaches on Flickr8K dataset.

SCST-IN Model
A comparison study of the MODLE-AICT model with recent models on the Flickr8K dataset is shown in Figure 5. The figure implied that the SCST-IN and SCST-ALL models have obtained lower performance than other models. This was followed by the G-NIC, A-NIC, and DenseNet models, which attained moderately closer results. Along with that, the HPTDL model accomplished a reasonable performance. However, the MODLE-AICT model has shown enhanced performance over other models on the test Flickr8K dataset.
The training accuracy (TA) and validation accuracy (VA) attained by the MODLE-AICT approach on the Flickr8K dataset is demonstrated in Figure 6. The experimental outcome implied that the MODLE-AICT technique has gained maximum values of TA and VA. Specifically, the VA seemed to be higher than TA.
The training loss (TL) and validation loss (VL) achieved by the MODLE-AICT methodology on the Flickr8K dataset are established in Figure 7. The experimental outcome inferred that the MODLE-AICT system accomplished the lowest values of TL and VL. Specifically, the VL seemed to be lower than TL. Table 4 and Figure 8

METEOR CIDEr
Rouge-L SCST-IN Model [29] 20.00 161.00 49.00 SCST-ALL Model [29] 23.00 154.00 42.00 The training accuracy (TA) and validation accuracy (VA) attained by the MODLE-AICT approach on the Flickr8K dataset is demonstrated in Figure 6. The experimental outcome implied that the MODLE-AICT technique has gained maximum values of TA and VA. Specifically, the VA seemed to be higher than TA.
The training loss (TL) and validation loss (VL) achieved by the MODLE-AICT methodology on the Flickr8K dataset are established in Figure 7. The experimental outcome inferred that the MODLE-AICT system accomplished the lowest values of TL and VL. Specifically, the VL seemed to be lower than TL.       A comparison study of the MODLE-AICT technique with recent models on the MS-COCO 2014 dataset is shown in Figure 10. The figure implied that the SCST-IN and SCST-ALL methodologies acquired a lower performance than the other models. Then, the G-NIC, A-NIC, and DenseNet approaches have gained moderately closer results. Moreover, the HPTDL approach has tried to accomplish reasonable performance. However, the MODLE-AICT system has shown an enhanced performance over other models on the test MS-COCO 2014 dataset.
The TA and VA attained by the MODLE-AICT technique on the MS-COCO 2014 dataset are demonstrated in Figure 11. The experimental outcome implied that the MODLE-AICT method has gained maximum values of TA and VA. Specifically, the VA seemed to be higher than TA.    Table 5 and Figure 9 review a detailed classification analysis of the MODLE-AICT The TL and VL achieved by the MODLE-AICT approach on the MS-COCO 2014 dataset are established in Figure 12. The experimental outcome inferred that the MODLE-AICT methodology has accomplished least values of TL and VL. Specifically, the VL seemed to be lower than TL. From the detailed results and discussion, it is assured that the proposed model has shown effective outcomes on the image captioning process. based on Rouge-L, the MODLE-AICT technique acquired a higher Rouge-L of 63, whereas the SCST-IN, SCST-ALL, G-NIC, A-NIC, DenseNet, and HPTDL algorithms acquired a lower Rouge-L of 51, 59, 51, 58, 57, and 60, correspondingly.

METEOR CIDEr
Rouge-L SCST-IN Model [29] 22.00 109.00 51.00 SCST-ALL Model [29] 25.00 114.00 59.00 G-NIC Model [26] 21.00 111.00 51.00 A-NIC Model [26] 24.00 110.00 58.00 DenseNet Model [24] 24.00 122.00 57.00 HPTDL Model [25] 34.00 125.00 60.00 MODLE-AICT 37.00 129.00 63.00 A comparison study of the MODLE-AICT technique with recent models on the MS-COCO 2014 dataset is shown in Figure 10. The figure implied that the SCST-IN and SCST-ALL methodologies acquired a lower performance than the other models. Then, the G-NIC, A-NIC, and DenseNet approaches have gained moderately closer results. Moreover, the HPTDL approach has tried to accomplish reasonable performance. However, the MO-DLE-AICT system has shown an enhanced performance over other models on the test MS-COCO 2014 dataset.  Figure 11. The experimental outcome implied that the MODLE-AICT method has gained maximum values of TA and VA. Specifically, the VA seemed to be higher than TA.  The TA and VA attained by the MODLE-AICT technique on the MS-COCO 2014 dataset are demonstrated in Figure 11. The experimental outcome implied that the MODLE-AICT method has gained maximum values of TA and VA. Specifically, the VA seemed to be higher than TA.  The TL and VL achieved by the MODLE-AICT approach on the MS-COCO 2014 dataset are established in Figure 12. The experimental outcome inferred that the MODLE-AICT methodology has accomplished least values of TL and VL. Specifically, the VL seemed to be lower than TL. From the detailed results and discussion, it is assured that the proposed model has shown effective outcomes on the image captioning process.

Conclusions
In this study, a novel MODLE-AICT technique was developed for the generation of effective captions to inputted images using two processes involving an encoding unit and a decoding unit. Primarily, at the encoding part, the SSA with a HybridNet model is utilized to generate effectual input image representation using fixed-length vectors. In addition, the decoding part includes a BiGRU model used to generate descriptive sentences. The inclusion of an SSA-based hyperparameter optimizer helps in attaining effectual performance. For inspecting the enhanced performance of the MODLE-AICT

Conclusions
In this study, a novel MODLE-AICT technique was developed for the generation of effective captions to inputted images using two processes involving an encoding unit and a decoding unit. Primarily, at the encoding part, the SSA with a HybridNet model is utilized to generate effectual input image representation using fixed-length vectors. In addition, the decoding part includes a BiGRU model used to generate descriptive sentences. The inclusion of an SSA-based hyperparameter optimizer helps in attaining effectual performance. For inspecting the enhanced performance of the MODLE-AICT model, a series of simulations are carried out, and the results are examined under several aspects. The experimental values implied the betterment of the MODLE-AICT model over recent approaches. Thus, the presented MODLE-AICT technique can be exploited as an effectual approach for image captioning. In future, ensemble DL-based fusion models can be designed to enhance the performance.