Modeling of Hyperparameter Tuned Deep Learning Model for Automated Image Captioning

: Image processing remains a hot research topic among research communities due to its applicability in several areas. An important application of image processing is the automatic image captioning technique, which intends to generate a proper description of an image in a natural language automated. Image captioning is a recently developed hot research topic, and it started to receive signiﬁcant attention in the ﬁeld of computer vision and natural language processing (NLP). Since image captioning is considered a challenging task, the recently developed deep learning (DL) models have attained signiﬁcant performance with increased complexity and computational cost. Keeping these issues in mind, in this paper, a novel hyperparameter tuned DL for automated image captioning (HPTDL-AIC) technique is proposed. The HPTDL-AIC technique encompasses two major parts, namely encoder and decoder. The encoder part utilizes Faster SqueezNet with the RMSProp model to generate an effective depiction of the input image via insertion into a predeﬁned length vector. At the same time, the decoder unit employs a bird swarm algorithm (BSA) with long short-term memory (LSTM) model to concentrate on the generation of description sentences. The design of RMSProp and BSA for the hyperparameter tuning process of the Faster SqueezeNet and LSTM models for image captioning shows the novelty of the work, which helps to accomplish enhanced image captioning performance. The experimental validation of the HPTDL-AIC technique is carried out against two benchmark datasets, and the extensive comparative study pointed out the improved performance of the HPTDL-AIC technique over recent approaches.


Introduction
Over the last years, the image processing and computer vision (CV) system has made significant progress in various fields such as object detection and image classification. Benefitting from the advancement of object detection and image classification, it becomes possible to generate more than one sentence automatically for understanding the visual content of an image, called image captioning. Automatically creating natural images and complete description has greater impacts, namely titles description related to healthcare images, attached to news images, accessing data for blind users, human-robot communication, and text-based image retrieval [1]. This application in image captioning has significant practical and theoretical research values. Hence, image captioning is a sophisticated but useful task in the era of artificial intelligence (AI) technology.
Provided a novel image, an image captioning method needs to output descriptions based on images at a semantic level. For instance, the input images consist of waves, people, • Develops a novel HPTDL-AIC technique for the automated image captioning process; • Aims to create correct descriptions for the input images by the use of encoderdecoder structure; • Employs the Faster SqueezeNet with RMSProp model for the extraction of visual features that exist in the image; • Presents a BSA with LSTM as a language modeling tool to generate description sentences and decodes the vector into sentences; • Validate the performance of the HPTDL-AIC technique using two benchmark datasets and inspect the results under several aspects.
The rest of the study is organized as follows. Section 2 offers a detailed description of the HPTDL-AIC technique, and its experimental analysis take place in Section 3. Lastly, Section 4 draws the major findings of the study with future scope.

Literature Review
Ren et al. [9] presented an advanced decision-making architecture for captioning images. They employ value and policy networks to collectively produce captions. The value network acts as a lookahead and global guidance by estimating each possible extension of the present state. In addition, the policy network acts as local guidance by offering the confidence of forecasting the following word as per the present state. It alters the aim of forecasting the accurate word towards the aim of creating captions that are the same as the ground truth caption. Kesavan et al. [10] systematically analyzed distinct deep DNN-based pre-trained models and image caption generation methods to accomplish the effective models by finetuning. The examined model contains with and without 'attention' concepts for optimizing the caption generation capacity. Each model is trained on a similar dataset for actual comparison.
Wang et al. [11] presented a multi-layer dense attention approach for image captioning process. The authors utilized a faster recurrent convolutional neural network (Faster R-CNN) for the extraction of image features as the coding layer, the long short-term memory (LSTM) attend can be utilized for decoding the multi-layer dense attention approach, and the description text is created. The hyperparameters of the model are tuned by the use of strategy gradient optimization in reinforcement learning. The utilization of the dense attention scheme at the coding layer eliminates the interference of non-salient information and selectively outputs the respective description text for the decoding procedure.
Sharma [12] proposed a novel image captioning method that considers the text present in the image. The study employs the idea of the morphology of the word and, therefore, creates Fisher Vectors-based morphology of a word. The presented method is estimated on two open-source datasets. The caption generated by the presented approach is similar to advanced art captioning models. Cheng et al. [13] proposed a semi-supervised DL approach, named the N-gram + Pseudo Label NIC technique. The approach integrates present DNN systems, for example, pseudo labels, N-gram, and NIC (Neural Image Caption) methods. This technique produces pseudo labels by the N-gram search method and enhances the effects of the method by employing people's descriptive habits and previous knowledge of the N-gram tables.
Zeng et al. [14] developed a technique of ultrasound image captioning-based region detection. Simultaneously, this technique encodes and detects the focus region in ultrasound images, uses the LSTM to decode the vector, and produces annotation text data to describe the disease contents in an ultrasound image. Shen et al. [15] designed a Variational Autoencoder and Reinforcement Learning-based Two-stage Multi-task Learning Model (VRTMM) for the remote sensing image captioning process. Initially, finetune the VAE and CNN models. Next, the transformer generates text descriptions with semantic and spatial features. Then, the Reinforcement Learning (RL) approach is used for enhancing the quality of the sentence. Although several models are available in recent times, the proposed model focuses on the design of encoding and decoding units to generate an effective depiction of the input image via insertion into a predefined length vector and concentrating on the generation of description sentences.

The Proposed Image Captioning Model
For an effective and automated image captioning process, a novel HPTDL-AIC technique has been developed, which aims to produce appropriate descriptions for input images by the use of an encoder-decoder structure. In particular, the encoder unit includes the Faster SqueezeNet with RMSProp model for generating a one-dimensional vector representation of the input image. Then, the BSA with the LSTM model is utilized as a decoder to produce description sentences and decode the vector into a sentence. In addition, RMSProp and BSA techniques are applied to appropriately tune the hyperparameter involved in it. Figure 1 showcases the overall working process of the HPTDL-AIC technique. The steps involved in the proposed model are listed as follows.
Step 2: Feature Extraction. Next to data preprocessing, the feature extraction proce is performed by using Faster RCNN with RMSProp model, which is commonly utiliz to generate visual features.
Step 3: Image Caption Generation. Finally, the textual description of the images automatically generated by the use of the LSTM model, and the hyperparameter tuni of the LSTM model is appropriately adjusted by the use of BSA.

Pre-Processing
Initially, data preprocessing occurs from varying levels as listed here:

•
The dataset text has words with distinct letter cases, which creates issues to comp nents the same as the words with varying capitalized are regarded as altered. Th this improves issue vocabulary and afterward results in complexity. Therefore, it c be essential to alter the entire text to lower case in order to prevent this problem.

•
The presence of punctuation improves the complexity of these issues; therefore, th are removed from the dataset. Step 1: Preprocessing. At the primary stage, actual input data can be transformed into a useful format by the inclusion of several subprocesses such as lower case conversion, punctuation mark removal, tokenization, and vectorization.
Step 2: Feature Extraction. Next to data preprocessing, the feature extraction process is performed by using Faster RCNN with RMSProp model, which is commonly utilized to generate visual features.
Step 3: Image Caption Generation. Finally, the textual description of the images is automatically generated by the use of the LSTM model, and the hyperparameter tuning of the LSTM model is appropriately adjusted by the use of BSA.

Pre-Processing
Initially, data preprocessing occurs from varying levels as listed here:

•
The dataset text has words with distinct letter cases, which creates issues to components the same as the words with varying capitalized are regarded as altered. Thus, this improves issue vocabulary and afterward results in complexity. Therefore, it can be essential to alter the entire text to lower case in order to prevent this problem.

•
The presence of punctuation improves the complexity of these issues; therefore, they are removed from the dataset. • Numerical data present from the text retain an issue in the component as it increases the vocabulary that is extracted.

•
Indicates initial and final order: word tokens '<start>' and '<end>' are further initial and final of every sentence for representing the initial and last token of the forecast order to the component.
• Tokenization: clean text is separated into constituent words, and a dictionary including the entire vocabulary to word-to-index and index-to-word equivalent are obtained. • Vectorization: For resolving different sentence lengths, the short sentence is padded to the length of long sentence orders.

Feature Extraction: Optimal Faster SqueezeNet Model
At this stage, the Faster SqueezeNet model is utilized to generate visual features of the applied images. For improving the performance of the electronic module classifier, the Faster SqueezeNet was presented. For preventing overfit, BatchNorm and remaining frameworks can be used. Simultaneously, as DenseNet, it utilizes concatenation for connecting distinct layers for enhancing the expressiveness of the initial layers from the network. The Faster SqueezeNet has one BatchNorm layer, three-block layers, four convolutional, and global average pooling layers. Faster SqueezeNet is mostly enhanced in subsequent manners: In order to enhance the data flow amongst layers, it can imitate the DenseNet framework and present various connection modes. It contains a pooling layer and fire module and, eventually, the two concat layers are linked to the next convolutional layer. The existing layer obtains every feature map of the earlier layer, and it can be utilized x 0 , . . . , x l−1 as input; afterwards, x l is demonstrated as follows: where [x 0 , x 1 , . . . , x l−1 ] signifies the association of feature graph created from the layer 0, 1, . . . , l − 1, and H l concatenates multiple inputs [16]. Without extremely enhancing the number of network variables, the efficiency of the network has improved from initial phases; simultaneously, some two-layer network is directly connected data. For ensuring optimum network convergence, it can be learned in the ResNet framework and presents various structure blocks that have pooling layers and fire modules. At last, when the two layers are summed, they can be linked to the next convolution layer. Figure 2 illustrates the framework of the SqueezeNet model. In the ResNet model, the shortcut links utilize identity mapping indicating that the input of the convolutional stack was provided straight to the resultant convolutional stack. Properly speaking, for representing the desired fundamental mapping as H(x), assume the stacked non-linear layer appropriate for another mapping of F(x) := H(x) − x. A new mapping is a reform to (x) + x. F(x) + x, which is realized by a framework named as shortcut linking from the actual encoded method. The shortcut connections generally skip more than one layer. Thus, it can utilize the remaining framework of ResNet to address the gradient vanishing problem without enhancing the number of network parameters.
In order to properly adjust the hyperparameter of the Faster SqueezeNet model, RMSProb is utilized. RMSProp (root mean square propagation) is an optimization method developed by Geoffrey E. Hinton in Coursera. For additionally optimizing the loss functions in the upgrade of extreme swings and accelerating the convergence function, the RMSProp method utilized the differential square weight average for the gradient of weight Wand bias b. Consequently, it makes great advancement in the direction wherever the variable space is gentler. The amount of squares of the historical gradient is small due to the gentler direction that results in a small learning drops. Assume t iteration process, which is described as follows: where as s dw and s db represent gradient and gradient momentum accumulated using the loss function in the preceding iteration t − 1, as well as β vector, which is an exponential of gradient accumulation. For avoiding the situation where the denominator becomes zero, ε becomes a smaller value. RMSProp assists in removing the direction of the larger swing and is employed for correcting the swing in order render the swing in all the dimensions small. Alternatively, it makes the network function converge fast. In order to properly adjust the hyperparameter of the Faster SqueezeNet model, RMSProb is utilized. RMSProp (root mean square propagation) is an optimization method developed by Geoffrey E. Hinton in Coursera. For additionally optimizing the loss functions in the upgrade of extreme swings and accelerating the convergence function, the RMSProp method utilized the differential square weight average for the gradient of weight Wand bias b. Consequently, it makes great advancement in the direction wherever the variable space is gentler. The amount of squares of the historical gradient is small due to the gentler direction that results in a small learning drops. Assume t iteration process, which is described as follows:

Language Modeling for Image Caption Generation
Finally, the LSTM model is applied to produce effective description sentences of the applied input images. The LSTM network is effectively utilized for accomplishing the tasks of machine translation and order generation. During the structure, LSTM can be applied as a language method for generating suitable captions dependent upon the input vector in ResNet50 output: where the resultant vector of the preceding cell h t−1 with a novel element of order x t has concatenated [17].
In two generated vectors, the states in C t−1 to C t were utilized for updating. Thus, it multiplies the past state by f t for forgetting data detection, as it is unnecessary from the preceding step; afterwards, add i t * C t .
The input gate defines what value is upgraded, and the tanh layer generates the vector of novel candidates to C t , and the values are more towards a cell state.
The attained values of C t and h t are transferred to the NN input at time t + 1.
The multiplicative filter permits efficient training of LSTM because it is optimum for preventing explosions as well as the vanishing gradient. Non-linearity has been offered as the sigmoid σ(·) and the hyperbolic tangent h(). During the final formula, h t refers the fed to softmax function for calculating the probability distributions p t on every word. This function has been computed and optimized on the entire trained dataset. The word with maximal likelihood was chosen at every time step and passed to succeeding ones for generating complete sentences.
For proper hyperparameter tuning of the LSTM model, BSA is applied in such a manner that the overall performance becomes improved. BSA is a nature inspired technique, which is stimulated by social behavior and social interaction in the bird's swarm. It stimulates the nature of the birds' foraging, vigilance, and flight behavior. Therefore, swarm behavior can be effectively derived from the swarm of birds for optimization processes. The birds' swarm technique can be simplified by five rules: • Rule1: All birds are switched amongst vigilant as well as foraging behaviors. If a bird forages or retains vigilance, it can be defined as a stochastic decision. • Rule2: If foraging, all birds record and upgrade their preceding optimum experience and swarm earlier optimum experience. The experience is utilized for searching for food. Social information is distributed concurrently amongst the entire swarm. • Rule3: While maintaining vigilance, all birds attempt to move nearby the center of swarm. This characteristic can be determined by disturbance due to swarm competitions. the birds with higher reserves further tend towards adjacent swarm centers than birds with lower reserves. • Rule4: The bird flies to other locations frequently. Upon flying to other places, birds frequently switch amongst production as well as scrounging. The bird with maximum reserves becomes a producer, and others with minimum reserves are scroungers. Another bird with maximal as well as minimal reserves was arbitrarily chosen to be the producer as well as a scrounger. • Rule5: The producer actively seeks food. The scroungers arbitrarily follow a producer for searching the food.
Based on Rule1, the time interval of all birds' flight performance FQ, the probability of foraging performance P(P ∈ (0, 1)), and uniform, arbitrary value δ ∈ (0, 1) can be determined. When the amount of iteration is lesser than FQ and δ ≤ P, the bird has foraging performance. Rule2 has been expressed mathematically as follows [18]: where C and S refer to two positive numbers; the previous number is known as a cognitive accelerated coefficient, and the final number is termed as a social accelerated coefficient. At this point, p i,j implies the ith bird optimum preceding place, and g j stands for the optimum preceding swarm place. When the amount of iteration is lesser than FQ and δ > P, the bird has vigilance performance. Rule3 is expressed as mathematical procedure y as follows: where a 1 and a 2 refer to the two positive constants from 0 and 2, pFit i signifies the optimum fitness value of ith bird, and sumFit indicates the sum of swarms' optimum fitness value. When the amount of iteration is equivalent to FQ, the bird has flight performance that is separated as to the performances of the producer as well as scrounger by fitness. Rule3 and Rule4 have been formulated as a mathematical model as follows: where FL (FL ∈ [0, 2]) denotes the scrounger who will follow the producer in searching for food. The BSA approach develops an FF for attaining enhanced classifier efficiency. It defines a positive integer for representing the optimum efficiency of candidate solution. During this analysis, the minimized classification error rate has been assumed as FF is provided in Equation (17). An optimum solution is a lesser error rate, and the lowest solution gains an enhanced error rate.

Performance Validation
In this section, the performance validation of the HPTDL-AIC technique takes place by using two benchmark datasets, namely Flickr8K [19] and MSCOCO [20] caption dataset. The results are examined under various measures, namely BLEU, Meter, CIDEr, and Rouge-L.

Implementation Data
The Flickr dataset generally comprises Flickr8k and 30K datasets. The Flickr8K dataset includes 8000 images, which exhibit human activities. The images in the dataset contain five sentences of textual descriptions. The MSCOCO dataset has collected data with many objects with scenarios. Figure 3 showcases the sample set of test images that exist in the dataset. The dataset file contains image captions for every sample image.

Implementation Data
The Flickr dataset generally comprises Flickr8k and 30K datasets. The Flickr8K dataset includes 8000 images, which exhibit human activities. The images in the dataset contain five sentences of textual descriptions. The MSCOCO dataset has collected data with many objects with scenarios. Figure 3 showcases the sample set of test images that exist in the dataset. The dataset file contains image captions for every sample image.

Performance Measures
In this study, four measures are used for experimental validation, namely BLEU, Meter, CIDEr, and Rouge-L. Bleu [21] is a widely employed measure to estimate the quality of the generated text. The value of Blue needs to be high for better machine translation performance. It can be determined as follows: where BP indicates the penalty factor, and and denote the length of the reference and generated sentences, respectively. The METEOR [22] measure mainly depends upon word recall rate and single precision weighted harmonic mean. It determines the reconciliation mean of accuracy and recalls among the optimal candidate and reference translations. It can be computed by using Equation (19):

Performance Measures
In this study, four measures are used for experimental validation, namely BLEU, Meter, CIDEr, and Rouge-L. Bleu [21] is a widely employed measure to estimate the quality of the generated text. The value of Blue needs to be high for better machine translation performance. It can be determined as follows: where BP indicates the penalty factor, and r and c denote the length of the reference and generated sentences, respectively. The METEOR [22] measure mainly depends upon word recall rate and single precision weighted harmonic mean. It determines the reconciliation mean of accuracy and recalls among the optimal candidate and reference translations. It can be computed by using Equation (19): where, α, γ, and θ denote the default variables. The CIDEr [23] index considers every sentence as a "document" and reports it in the form of a TF-IDF vector. It determines cosine similarity among the generated caption s ij and reference caption using a score value as defined in the following.
ROUGE [24] (Recall-Oriented Understudy for Gisting Evaluation) is a similarity measurement approach that depends upon the recall rate. It determines the co-occurrence probability of N-gram in the reference translation and the translation to be examined. It can be mathematically formulated as follows. Figure 4 visualizes sample image captioning results obtained by the HPTDL-AIC technique. Figure 4a shows the sample test image, and the respective image caption generated image is provided in Figure 4b. The figure implied that the HPTDL-AIC technique has properly provided the textual description of the image as "man with dog in the mountain".

Visualization Results
( , ) = ‖ ( )‖ • ROUGE [24] (Recall-Oriented Understudy for Gisting Evaluation) is a sim measurement approach that depends upon the recall rate. It determines the co-occ probability of N-gram in the reference translation and the translation to be exam can be mathematically formulated as follows. Figure 4 visualizes sample image captioning results obtained by the HPTD technique. Figure 4a shows the sample test image, and the respective image capti erated image is provided in Figure 4b. The figure implied that the HPTDL-AIC tec has properly provided the textual description of the image as "man with dog in the tain".         A brief comparative study of the HPTDL-AIC technique on the test HPTDL-AIC technique with recent methods is provided in Table 2. Figure 6              The accuracy results analysis of the HPTDL-AIC system on the test Flickr8K is demonstrated in Figure 9. The results showcased that the HPTDL-AIC technique has resulted in increased training and validation accuracies. It can be clear that the HPTDL-AIC technique has the ability of gained maximum validation accuracy over training accuracy. The loss outcome analysis of the HPTDL-AIC technique on the test Flickr8K is provided in Figure 10. The figure referred that the HPTDL-AIC technique has reached lower training and validation losses. It can be noted that the HPTDL-AIC method has the capability of accomplishing a reduction in validation loss and overtraining loss. The accuracy results analysis of the HPTDL-AIC system on the test Flickr8K is demonstrated in Figure 9. The results showcased that the HPTDL-AIC technique has resulted in increased training and validation accuracies. It can be clear that the HPTDL-AIC technique has the ability of gained maximum validation accuracy over training accuracy.  The accuracy results analysis of the HPTDL-AIC system on the test Flickr8K is demonstrated in Figure 9. The results showcased that the HPTDL-AIC technique has resulted in increased training and validation accuracies. It can be clear that the HPTDL-AIC technique has the ability of gained maximum validation accuracy over training accuracy. The loss outcome analysis of the HPTDL-AIC technique on the test Flickr8K is provided in Figure 10. The figure referred that the HPTDL-AIC technique has reached lower training and validation losses. It can be noted that the HPTDL-AIC method has the capability of accomplishing a reduction in validation loss and overtraining loss. The loss outcome analysis of the HPTDL-AIC technique on the test Flickr8K is provided in Figure 10. The figure referred that the HPTDL-AIC technique has reached lower training and validation losses. It can be noted that the HPTDL-AIC method has the capability of accomplishing a reduction in validation loss and overtraining loss.  Table 3 and Figure 11 offer a brief comparative BLEU analysis of the HPTDL-AIC technique on the test MS COCO 2014 dataset. The results demonstrated that the M-RNN and DVS techniques attained worse outcomes with the least BLEU. In addition, Goog-leNICG and ResNet50 models reached somewhat improved values of BLEU. Next to that, L-Bilinear and VGA-16 models resulted in moderately sensible values of BLEU. However, the proposed HPTDL-AIC technique has reported improved outcomes over the other methods with higher BLEU-1, BLEU-2, BLEU-3, and BLEU-4 of 0.742, 0.587, 0.428, and 0.343, respectively.    [15]. Figure 12  A detailed comparative analysis of the HPTDL-AIC technique on th AIC technique with recent methods is portrayed in Table 4 [15].    Subsequently, a comparative CIDEr analysis of the HPTDL-AIC technique on the test MS COCO 2014 dataset is performed in Figure 13. The results stated that the A-NIC technique has resulted in inferior performance with the CIDEr value of 106. At the same time, SCST-IN, SCST-ALL, and Google NIC techniques accomplished considerable CIDEr values of 111, 114, and 108, respectively. Although the DenseNet model has depicted a high CIDEr value of 118, the HPTDL-AIC technique has presented a better outcome with a higher CIDEr value of 111. Subsequently, a comparative CIDEr analysis of the HPTDL-AIC technique on the tes MS COCO 2014 dataset is performed in Figure 13. The results stated that the A-NIC tech nique has resulted in inferior performance with the CIDEr value of 106. At the same time SCST-IN, SCST-ALL, and Google NIC techniques accomplished considerable CIDEr va ues of 111, 114, and 108, respectively. Although the DenseNet model has depicted a hig CIDEr value of 118, the HPTDL-AIC technique has presented a better outcome with higher CIDEr value of 111.      The accuracy outcomes analysis of the HPTDL-AIC technique on the test MSCOCO 2014 is displayed in Figure 15. The results showcased that the HPTDL-AIC method resulted in maximum training and validation accuracy. It can be stated that the HPTDL-AIC manner can attain increased validation accuracy on training accuracy. The accuracy outcomes analysis of the HPTDL-AIC technique on the test MSCOCO 2014 is displayed in Figure 15. The results showcased that the HPTDL-AIC method resulted in maximum training and validation accuracy. It can be stated that the HPTDL-AIC manner can attain increased validation accuracy on training accuracy. The loss outcomes analysis of the HPTDL-AIC technique on the test MSCOCO 2014 is offered in Figure 16. The figure portrayed that the HPTDL-AIC technique has gained reduced training and validation losses. It is noticeable that the HPTDL-AIC approach has the capability of accomplishing decreased validation loss overtraining loss. From the anal- The loss outcomes analysis of the HPTDL-AIC technique on the test MSCOCO 2014 is offered in Figure 16. The figure portrayed that the HPTDL-AIC technique has gained reduced training and validation losses. It is noticeable that the HPTDL-AIC approach has the capability of accomplishing decreased validation loss overtraining loss. From the analysis of the abovementioned results, it is apparent that the HPTDL-AIC technique has been employed as an efficient method for image captioning applications in real time. The loss outcomes analysis of the HPTDL-AIC technique on the test MSCOCO 2014 is offered in Figure 16. The figure portrayed that the HPTDL-AIC technique has gained reduced training and validation losses. It is noticeable that the HPTDL-AIC approach has the capability of accomplishing decreased validation loss overtraining loss. From the analysis of the abovementioned results, it is apparent that the HPTDL-AIC technique has been employed as an efficient method for image captioning applications in real time.

Conclusions
In this study, a novel HPTDL-AIC technique has been developed to generate image captions automatically. The HPTDL-AIC technique intends to create correct descriptions for input images by the use of encoder-decoder structures. In particular, the encoder unit includes the Faster SqueezeNet with RMSProp model for generating a one-dimensional vector representation of the input image. Then, the BSA with the LSTM model is utilized as a decoder to produce description sentences and decode the vector into a sentence. For examining enhanced outcomes of the HPTDL-AIC technique, a series of simulations was performed on two benchmark datasets, and the extensive comparative study pointed out

Conclusions
In this study, a novel HPTDL-AIC technique has been developed to generate image captions automatically. The HPTDL-AIC technique intends to create correct descriptions for input images by the use of encoder-decoder structures. In particular, the encoder unit includes the Faster SqueezeNet with RMSProp model for generating a one-dimensional vector representation of the input image. Then, the BSA with the LSTM model is utilized as a decoder to produce description sentences and decode the vector into a sentence. For examining enhanced outcomes of the HPTDL-AIC technique, a series of simulations was performed on two benchmark datasets, and the extensive comparative study pointed out the improvement of the HPTDL-AIC technique over recent approaches. The experimental results stated that the inclusion of the hyperparameter tuning process results in improved captioning performance compared to other methods. Therefore, the HPTDL-AIC technique can be utilized as an effective tool for image captioning in NLP tasks. In the future, hybrid DL models can be employed for language modeling to boost overall performance.  Data Availability Statement: Data sharing is not applicable to this article as no datasets were generated during the current study.