Hybrid of Deep Learning and Word Embedding in Generating Captions: Image-Captioning Solution for Geological Rock Images

Captioning is the process of assembling a description for an image. Previous research on captioning has usually focused on foreground objects. In captioning concepts, there are two main objects for discussion: background object and foreground object. In contrast to the previous image-captioning research, generating captions from the geological images of rocks is more focused on the background of the images. This study proposed image captioning using a convolutional neural network, long short-term memory, and word2vec to generate words from the image. The proposed model was constructed by a convolutional neural network (CNN), long short-term memory (LSTM), and word2vec and gave a dense output of 256 units. To make it properly grammatical, a sequence of predicted words was reconstructed into a sentence by the beam search algorithm with K = 3. An evaluation of the pre-trained baseline model VGG16 and our proposed CNN-A, CNN-B, CNN-C, and CNN-D models used BLEU score methods for the N-gram. The BLEU scores achieved for BLEU-1 using these models were 0.5515, 0.6463, 0.7012, 0.7620, and 0.5620, respectively. BLEU-2 showed scores of 0.6048, 0.6507, 0.7083, 0.8756, and 0.6578, respectively. BLEU-3 performed with scores of 0.6414, 0.6892, 0.7312, 0.8861, and 0.7307, respectively. Finally, BLEU-4 had scores of 0.6526, 0.6504, 0.7345, 0.8250, and 0.7537, respectively. Our CNN-C model outperformed the other models, especially the baseline model. Furthermore, there are several future challenges in studying captions, such as geological sentence structure, geological sentence phrase, and constructing words by a geological tagger.


Introduction
Geological observation involves field research by a geologist. One of the tasks is to write about the rock's content and take a photo of the rock. Each picture is paired with its description. This task requires an expert to write carefully and accurately about each image. Each description should contain the rock's characteristics, including the rock's color, shape, and constituents. This process is an essential task because the information helps in decisions on other activities, such as mining, land fertilization, field surveillance, and drilling. Sometimes, a geologist finds the same characteristics in the rocks that correspond to previous descriptions and repeatedly writes the same descriptions. It is interesting to adopt these tasks for researching how to make descriptions for the content of a photo image. The geologist's experience can be assumed as staging for the creation of a description for each rock image. In computer vision terms, those activities can be defined as captioning. One challenging caption process is how to make the images and their descriptions pairwise. With captioning in place, we can predict and describe the content of other photos. Adopting a geologist's knowledge as an intelligent system for captioning is one aspect that can be explored. How to identify rocks and create descriptions of the content of rock images has been proposed as a research topic.

•
Building the corpus for captioning that contains the pairwise images of rocks and their captions. • Building a captioning model that can interpret images of rocks and achieve outcomes with an accuracy that is similar to a geologist's annotation. Our models can outperform the baseline model relating to the BLEU score and acquire captions that are similar to a geologist's annotation. We arranged the sections in this paper as follows. The paper leads with an introduction, which encompasses the urgency for this type of research and the research problems. The methods section details content theory and related research, with associated research, such as on convolutional neural networks, long short-term memory, BLEU score measurement, and the estimation function. The subsequent section presents the proposed model to solve the current problems. This paper also conveys the outcome of experiments using the proposed model. Then, the discussion section explains the results. The last section presents the conclusions.

Long Short-Term Memory (LSTM)
The LSTM language model can be used as part of the process of generating captions automatically. Another method of sentence representation is the RNN, a simpler method than LSTM. Karpathy used the same number of dimensions as the length of words in the image descriptions from the experts [2]. The RNN method is a simple approach for producing the sentence representation of the detected object. This approach does not consider the order or context of information in a sentence. In the sequence, the results obtained are not grammatically arranged. Karpathy used bi-gram techniques or two dependency relations to generate a sentence with support from either the beam search algorithm or the greedy search algorithm to overcome this [2].
Long short-term memory (LSTM) is a language model, the successor of the RNN, that enables long-term learning. The LSTM unit has an additional hidden state as a nonlinear mechanism that allows a state to perform a backpropagation process without modification, change, or reset. Learning in LSTM uses simple function gates to learn speech recognition and language translation [19]. Figure 2 is a simple form of the recurrent neural network (RNN) and LSTM. LSTM is processed by repeating each process performed. In LSTM, the value ( ) = (1 + − ) −1 as a sigmoid function with an accurate boundary With regard to the problem and the proposed model, we propose contributions to solve some problems relating to captioning. Our contribution can be presented as follows: • Using geological field exploration to support captioning and build a model that produces a caption from an image. We collected that geological knowledge and used it to construct an algorithm and the architecture of the captioning model.

•
Building the corpus for captioning that contains the pairwise images of rocks and their captions. • Building a captioning model that can interpret images of rocks and achieve outcomes with an accuracy that is similar to a geologist's annotation. Our models can outperform the baseline model relating to the BLEU score and acquire captions that are similar to a geologist's annotation.
We arranged the sections in this paper as follows. The paper leads with an introduction, which encompasses the urgency for this type of research and the research problems. The methods section details content theory and related research, with associated research, such as on convolutional neural networks, long short-term memory, BLEU score measurement, and the estimation function. The subsequent section presents the proposed model to solve the current problems. This paper also conveys the outcome of experiments using the proposed model. Then, the discussion section explains the results. The last section presents the conclusions.

Long Short-Term Memory (LSTM)
The LSTM language model can be used as part of the process of generating captions automatically. Another method of sentence representation is the RNN, a simpler method than LSTM. Karpathy used the same number of dimensions as the length of words in the image descriptions from the experts [2]. The RNN method is a simple approach for producing the sentence representation of the detected object. This approach does not consider the order or context of information in a sentence. In the sequence, the results obtained are not grammatically arranged. Karpathy used bi-gram techniques or two dependency relations to generate a sentence with support from either the beam search algorithm or the greedy search algorithm to overcome this [2].
Long short-term memory (LSTM) is a language model, the successor of the RNN, that enables long-term learning. The LSTM unit has an additional hidden state as a nonlinear mechanism that allows a state to perform a backpropagation process without modification, change, or reset. Learning in LSTM uses simple function gates to learn speech recognition and language translation [19]. Figure 2 is a simple form of the recurrent neural network (RNN) and LSTM. LSTM is processed by repeating each process performed. In LSTM, the value σ(x) = (1 + e −x ) −1 as a sigmoid function with an accurate boundary value between 0 and 1, whereas for nonlinear hyperbolic functions tanh(x) = e x −e −x e x +e −x = 2σ(2x) − 1 with input range values between −1 and 1. LSTM updates based on time t for input , ℎ −1 , and −1 following the functions below: where stands for input at timestamp t, is the symbol for the forget gate at timestamp t, stands for the output gate at timestamp t, and is a hyperbolic function, stands for the concatenation elementwise operation, and ℎ is an output state.
The RNN updates the values on time t for input and ℎ −1 following the formulas below: = ( ℎ ℎ + ) where g is an elementwise nonlinearity, such as a sigmoid function or hyperbolic tangent, is an input, ℎ ∈ ℝ is a hidden state with N hidden units, and is the output at time t. For the length of the T input sequence 〈 1 , 2 , … , 〉, the update carried out is to calculate sequentially by ignoring ℎ 0 , ℎ 1 , 1 , ℎ 2 , 2 , … , ℎ , .

Part One Architectures
Our pipeline model was constructed and strengthened by: (1) image preprocessing, including resizing, reducing, and cropping; (2) ensuring that the image could be recognized without losing the pixel information when using the reduce function in the CNN; (3) finding a suitable CNN for our domain, considering CNN layers, dropout, pooling, and dense units. Captioning architecture is essential to generate a word approximate to the human description. Identifying the rocks in the image differs with regard to the colors in the image. LSTM updates based on time t for input x t , h t−1 , and c t−1 following the functions below: where i t stands for input at timestamp t, f t is the symbol for the forget gate at timestamp t, o t stands for the output gate at timestamp t, and g t is a hyperbolic function, c t stands for the concatenation elementwise operation, and h t is an output state. The RNN updates the values on time t for input x t and h t−1 following the formulas below: where g is an elementwise nonlinearity, such as a sigmoid function or hyperbolic tangent, x t is an input, h t ∈ R N is a hidden state with N hidden units, and z t is the output at time t.
For the length of the T input sequence x 1 , x 2 , . . . , x T , the update carried out is to calculate sequentially by ignoring h 0 , h 1 , z 1 , h 2 , z 2 , . . . , h T , z T .

Part One Architectures
Our pipeline model was constructed and strengthened by: (1) image preprocessing, including resizing, reducing, and cropping; (2) ensuring that the image could be recognized without losing the pixel information when using the reduce function in the CNN; (3) finding a suitable CNN for our domain, considering CNN layers, dropout, pooling, and dense units. Captioning architecture is essential to generate a word approximate to the human description. Identifying the rocks in the image differs with regard to the colors in the image. Figure 3 depicts a geology caption model and divides the model into parts, such as image extraction, word embedding, generating words, and assembling the captions. The image extraction part is in part one and word embedding is in part two. The outcome units are 256 units for each learning stage. We concatenated both outcomes between image extraction and the LSTM unit. Furthermore, after compiling using an ADAM optimizer with learning = 0.0001, we acquired 12,746,112, 2,397,504, 20,482,432, and 104,867,300 training parameters for CNN-A, CNN-B, CNN-C, and CNN-D, respectively. These parameters were achieved from the reengineering of VGG16 and word embedding [2].   Following the significant step, the captioning model always starts with a feature ex traction model. An input image had a size of 224 × 224 with RGB color and three channels We worked on the recognized image to convert it into an RGB value with three channels After that, we proposed the image extracts using a CNN, shown in Figure 3 [52]. Archi tecture classification is the classification model that identifies every single rock [53]. The out put of the convolution process was dense with 4096 units. The convolution utilized the dropout function to avoid overfitting, with a rate of 0.5. Every value in a feature map was scaled up using the formula x/(1-0.5), where x is the single feature map.
After the dropout function, the process continued to leverage linear activation o ReLU and gave outcomes of 256 units [17]. The ReLu function calculated a maximized output between the vector feature value and 0 or Max[x,0]. The ReLu function will presen 0 if the feature value is smaller than 0 or the original, x. The outcome of the ReLu function was 256 units. The outcome of part one was a sequence unit with a length of 256 neuron units.
The pipeline process of part one can be written as the following pseudocode: 1. Input Image (i1, i2, i3,…, in)-n stands for collecting the image, i; 2. Reduce image; function for reduction of image to a size of 224 × 224; 3. For I ∈ ( 1 , 2 , 3 , … , )-I is the collection of the image: a. Image_feature = CNN(I, C, F)-I is the image with a size of 224 × 224; C represent the channels of convolutions; F is the filter matrix size that can be 3 × 3 or 5 × 5; Following the significant step, the captioning model always starts with a feature extraction model. An input image had a size of 224 × 224 with RGB color and three channels. We worked on the recognized image to convert it into an RGB value with three channels. After that, we proposed the image extracts using a CNN, shown in Figure 3 [52]. Architecture classification is the classification model that identifies every single rock [53]. The output of the convolution process was dense with 4096 units. The convolution utilized the dropout function to avoid overfitting, with a rate of 0.5. Every value in a feature map was scaled up using the formula x/(1-0.5), where x is the single feature map.
After the dropout function, the process continued to leverage linear activation or ReLU and gave outcomes of 256 units [17]. The ReLu function calculated a maximized output between the vector feature value and 0 or Max[x,0]. The ReLu function will present 0 if the feature value is smaller than 0 or the original, x. The outcome of the ReLu function was 256 units. The outcome of part one was a sequence unit with a length of 256 neuron units.
The pipeline process of part one can be written as the following pseudocode: Reduce image; function for reduction of image to a size of 224 × 224; 3.
For I ∈ (i 1 , i 2 , i 3 , . . . , i n )-I is the collection of the image: a. Image_feature = CNN(I, C, F)-I is the image with a size of 224 × 224; C represents the channels of convolutions; F is the filter matrix size that can be 3 × 3 or 5 × 5; b.

Part Two Architectures
In Figure 3, part two is a concept to create a vector feature value from a geologist's description. Word2vec is an important embedding model that provides a vector sequence value for each vocabulary unit. Word2vec can produce a vector with 306 × 100 dimensions for word embedding [54]. After the embedding process, vector word embedding was transformed into a vector feature for 22 words for captions that had 22 × 100 dimensions. LSTM used those vectors to generate a word that matched the image feature. Furthermore, the process continued to the dense layer to obtain the Max value via the ReLu function. The last process in part two was to carry out an output of 256 units. This study used the SoftMax, Equation (14), function to acquire a probability value and select the higher probability as a proper word. Equation (14) is a SoftMax function σ : R K → [0, 1] K which is defined as follows: where i = 1 . . . K and z = (z i . . . z K ) ∈ R K ; σ(z) i is a probability value for every unit at index-i; and e z i is an epsilon of the vector at each unit, z, at index-i. We proposed the process flow of part two from our architectures. This pseudocode writes the following schema of process: 1. C = Input (caption)-input the corpus from the geologist's annotation; 2. X = 'start_seq'-initialization of the word; 3. U = unique_word(C)-building a unique word into a vocabulary that attaches from the corpus, C; 4. C_Index = Making_Index(C)-providing the index for each word of the vocabulary; 5.
After completing the processing of both parts, the process continues to add both units to be one pairwise vector feature. In Figure 4, there is an operation ADD (X1, X2). This means that both units were flattened from the image and text extraction to create one flattened vector feature. After that, the last process was to operate the SoftMax function to acquire the word classification. The word classification process aligns the predicted word with the image feature.

Word Embedding
To provide vector value embedding, we used Word2Vec as a word embedding model to provide a map of values of the words. Word2vec was introduced by Thomas Mikolov and consists of two processes: continuous bag of words (CBOW) and continuous n-skip gram. Each process has a unique task in handling a word. CBOW acts as a neural network process that gives a probability value and selects the higher probability as a candidate value. On the other hand, the continuous n-skip gram process takes the current word as the input and tries to accurately predict the words before and after this current word [54]. This study used 100 dimensions for each word and there was a 306 × 100 vector space for all words.
color, and dominant rocks [55]. Figure 4 shows examples of images from the dataset and their captions. The captions were written in the Indonesian language and we translated the text to make the captions clear. This study was focused on developing a model in the Indonesian language, because of its usefulness for further study. The translation is included for the purpose of making the paper easy to read.

Experiments
The proposed models were assembled from reengineering the VGG16 and our model called CNN-n. The choice of the name CNN-n is based on the experiments, which

Results
This section reports on the experiments' achievement and is separated into two subsections: dataset and experiments. We used Google Collaboratory Pro, Python version 3.6, TensorFlow, a GPU, and 25 GB of RAM for the experiments. The GPU used by default Google Collaboratory is NVIDIA T4 or P100.

Dataset
We started by collecting the images and carrying out the data preprocessing. Following the proposed pipeline model in Figure 3, the first step was to reduce the image into 224 × 224 pixels with RGB color. As the input in the CNN, all images were set to 32 channels in the first convolution process.
We collected 1297 images of geological rocks and divided them into two datasets. The training dataset was 1001 images and the validation dataset was 296 images. In addition, we added five captions for each image and acquired a caption from a geologist for each image as a geological corpus. The caption was completed by the geologist following the guidelines of writing lithology descriptions and arranged into the names of the rock, color, and dominant rocks [55]. Figure 4 shows examples of images from the dataset and their captions. The captions were written in the Indonesian language and we translated the text to make the captions clear. This study was focused on developing a model in the Indonesian language, because of its usefulness for further study. The translation is included for the purpose of making the paper easy to read.

Experiments
The proposed models were assembled from reengineering the VGG16 and our model called CNN-n. The choice of the name CNN-n is based on the experiments, which consisted of several layers or shallow learning. Our model was divided into CNN-A, CNN-B, CNN-C, and CNN-D sub-models. The CNN-n model was introduced from the image classification results regarding rock types [56]. CNN-D using conv (32,5) and conv (64, 5) produced a larger number of output parameters than the CNN-A, CNN-B, and CNN-C models.
Simonyan and Zisserman stated that the filter plays an important role in extracting an image [57]. Filters are square matrices with odd numbers, such as 1 × 1, 3 × 3, 5 × 5, and 7 × 7 [57]. In several studies on image extraction, scholars used many 3 × 3 filters. This particular filter proved effective in extracting the image and providing a feature map for recognizing the image [53,58]. Figure 5 shows several architectures from our re-engineered CNN models. We obtained outputs for VGG16 , CNN-A, CNN-B, CNN-C, and CNN-D of 134,260,544, 12,746,112,  2,397,504, 20,482,432, and 104,867,300 consisted of several layers or shallow learning. Our model was divided into CNN-A, CNN-B, CNN-C, and CNN-D sub-models. The CNN-n model was introduced from the image classification results regarding rock types [56]. CNN-D using conv (32,5) and conv (64,5) produced a larger number of output parameters than the CNN-A, CNN-B, and CNN-C models. Simonyan and Zisserman stated that the filter plays an important role in extracting an image [57]. Filters are square matrices with odd numbers, such as 1 × 1, 3 × 3, 5 × 5, and 7 × 7 [57]. In several studies on image extraction, scholars used many 3 × 3 filters. This particular filter proved effective in extracting the image and providing a feature map for recognizing the image [53,58]. Figure 5 shows several architectures from our re-engineered CNN models. We obtained outputs for VGG16 , CNN-A, CNN-B, CNN-C, and CNN-D of 134,260,544,  12,746,112, 2,397,504, 20,482,432, and 104,867,300  We delivered the duplicate operation pruning for each model at the last layer of unit classification. These actions were taken because we needed the weight of the units for the subsequent process. The process continued to concatenate operations between FC units from the CNN and LSTM units to gain a predicted word. This model follows the likelihood function ( ) = ∏ ( 1 , 2 , … , | ), where w stands for previous words, I is an image region, and is a predicted word. Figure 6 shows the accuracy curve for each CNN model and depicts the comparison between loss and accuracy. We observed that the accuracy increased at 80 epochs. Our  CNN-A, CNN-B, CNN-C, CNN-D, and VGG16 architecture had accuracies of 0.9137, 0.9148, 0.9178, 0.9206, and 0.9228, respectively. Figure 6 presents how much the experiments were influenced by the number of CNN layers and parameters in the domain under study. The curves depend on the receptive field and channel settings.
Here, we have written an illustration for measuring a caption. For instance, we acquired a caption, such as "singkapan batugamping klastik dengan berukuran butir lempungan dengan lensa rijang" in Indonesian. We calculated the BLEU score by the following algorithm [60]:
Calculate the variable "count" and "clip_count" from the reference token and candidate token, see Figure 7 Figure 8 shows an instance of the captioning from our training models. We validated the training model using dataset validation, as shown in Figure 8. Different layers, filters, and training parameters caused different results. Varying the parameters supports the creation of an accurately predicted word that aligns with the image area [2]. It is a point of discussion because, in CNN operation, every mapping process results in a new block value that is smaller than before. Reducing the H × W × D of the CNN block avoided overfitting [5].
We compared the generated captions with ground-truth captions.  Word  Candidate  Ref-1  Ref-2  Ref-3  Ref-4  Ref-5  Max Ref  Count  Clip-Count   batuan  0  1  1  1  1  1  1  0  sedimen  0  1  1  1  1  1  1  0  klastik  1  1  1  1  1  1  1  1  dengan  2  1  1  0  0  1  1  1  bidang  0  1 Figure 8 shows an instance of the captioning from our training models. We validated the training model using dataset validation, as shown in Figure 8. Different layers, filters, and training parameters caused different results. Varying the parameters supports the creation of an accurately predicted word that aligns with the image area [2]. It is a point of discussion because, in CNN operation, every mapping process results in a new block value that is smaller than before. Reducing the H × W × D of the CNN block avoided overfitting [5].
We compared the generated captions with ground-truth captions.

Discussion
There are several discussions to be presented. The objective was to find the best captions by obtaining a high BLEU score. We followed the proposed method as a strategy for observing the object in the background [60]. The experiments proved to have different outcomes. A change in the layers can cause various outcomes, such as in the resulting caption, training parameters, and time. An event in the convolution process and size filtering can result in different feature maps. The number of channels the CNN has influences the feature maps' colors, shapes, and sizes. We set a significant number of channels, which provided opportunities for various gradations of colors. On rock images, observing the color is necessary to distinguish the content of the rocks; even if the rocks have a similar color, the content is probably different [53].
VGG16 successfully identified the object and predicted the classification [63,64,65]. Meanwhile, our models successfully classified the rock object within an efficient time,

Discussion
There are several discussions to be presented. The objective was to find the best captions by obtaining a high BLEU score. We followed the proposed method as a strategy for observing the object in the background [60]. The experiments proved to have different outcomes. A change in the layers can cause various outcomes, such as in the resulting caption, training parameters, and time. An event in the convolution process and size filtering can result in different feature maps. The number of channels the CNN has influences the feature maps' colors, shapes, and sizes. We set a significant number of channels, which provided opportunities for various gradations of colors. On rock images, observing the color is necessary to distinguish the content of the rocks; even if the rocks have a similar color, the content is probably different [53].
VGG16 successfully identified the object and predicted the classification [63][64][65]. Meanwhile, our models successfully classified the rock object within an efficient time, particularly CNN-B and CNN-D [66]. Table 1 shows that VGG16 is suitable for our rock images, but the model has a longer processing time compared to the CNN-B and CNN-D models. Table 3 compares the accuracy and loss of the models. The results showed that VGG16 achieved the best accuracy and that several layers and structure processes applied to the CNN were robust in differentiating color. The theoretical concept of VGG16 tries to extract the image from deeper layers and provide outcomes with various features [65]. In image recognition, the detection of similar features is an essential task to achieve the best accuracy. In our problem domain, if the image features concatenate with text features, then many pairwise mistakes can occur and cause the wrong predicted word. Many feature map differences cannot guarantee that the captions will be optimized. The spread of various colors causes a bias when the object area pairs with the sentence. The effect that many features create causes probable bias in predicting words. We used a limited object in our domain, which is why the results of pairwise testing a sequence of image descriptors and feature vector text is important.
LSTM, as a word generator, was chosen as the language model because of its ability to separate outcomes into three gates. The LSTM generates words based on previous words by relating to the corpus and selecting the best probability word from many produced words. The previous word makes the LSTM machine more powerful in predicting the next word [67]. This study found error captions for the VGG16+word2vec+LSTM model when predicting a word. Simonyan models have successfully generated a word when the object is a person, an animal, a car, etc. [57]. In contrast, the models experience an error in predicting a word when the object appears in the background.
The likelihood function log P(w t |w 1:t−1 , I; θ) is important for making a true predicted word when generating a word. Re-engineering of the log-likelihood function will result in different captions. We acquired a loss value of 0 ≤ ∑ I,S X ∑ |s| t=1 log P(w t |w 1:t−1 , I; θ)≤ 0.1358. This means that the function was successful in maximizing the detection of images and predicting the words by minimizing the residual loss values [68]. Figure 9 shows an error caption and compares VGG16 + Word2Vec + LSTM to our models [69]. This study successfully presented an error result from the baseline. It proves that many CNN layers do not always acquire the best outcome; indeed, the result is sometimes a mistake. The image in Figure 9 is of sandstone. Nevertheless, the baseline presented a mistake caption of chert. Many parameters produced the feature that caused VGG16 difficulty to pairwise between the word and the feature maps. Geological rock images always show a variety of color within the rocks and color differentiation is necessary to separately identify the rocks.  This study shows a need to recognize and analyze the image relating to captions by using image preprocessing, such as reducing image size. In the caption, it is necessary to pay attention to the text marks when reading. Sometimes, captions use the "-" sign to make adjectives or derivatives of rock types. In text preprocessing, the text is cleaned by removing the stop word, sign, number, and symbols. Figure 3 shows the word vectorizing using Word2Vec as a feature text. The difference between one hot vector and word2vec lies in the matrix values. One hot vector consists of a 0 or 1 value and has a length the same as the defined word length [28,31]. On the other hand, word2vec creates a decimal value with the defined length and dimensions [25,69].
Regarding the process, the annotation of the rocks was accompanied by their properties, such as carbonate mudstone and clay sandstone [72]. The classification and interpretation of images created a caption based on feature maps and the text feature. This scheme is an essential part of captioning because this research's target was a caption similar to the geologist's description [53].

Conclusions
Our models outperform the architecture of the baseline model. A CNN (32,5) with a 5 × 5 filter and 32 channels produced a meaningful caption. The metric of the model was directed more toward the precision of the caption than accuracy. The accuracy is just needed to measure the image classification and how similar the factual feature map extraction compares with the actual feature maps. The experiments proved that shallow layers effectively solved our domain problem. Our proposed CNN-A, CNN-B, CNN-C, and CNN-D models used BLEU score methods for the N-gram. The BLEU scores achieved for B@1 were 0.5515, 0.6463, 0.7012, 0.7620, and 0.5620, respectively. B@2 showed scores of 0.6048, 0.6507, 0.7083, 0.8756, and 0.6578, respectively. B@3 had scores of 0.6414, 0.6892, 0.7312, 0.8861, and 0.7307, respectively. Finally, B@4 showed scores of 0.6526, 0.6504, 0.7345, 0.8250, and 0.7537, respectively. The CNN-D architecture encouraged our model Some images could not be adequately classified by VGG16, thus, resulting in a low BLEU score. However, it is understandable to assume that VGG16 is accurate because it has a deeper layer than our models. The impact of this is that many feature maps eventually become biased when rephrasing image feature extraction and word embedding. It is known that image feature extraction becomes decisive regarding feature identification. The number of CNN layers can convey the success of feature extraction and many receptive field channels supply more space for feature assortments [70,71]. It can help to recognize objects to a certain degree in an image. Meanwhile, the pooling layer also helps the feature map values to avoid overfitting.
This study shows a need to recognize and analyze the image relating to captions by using image preprocessing, such as reducing image size. In the caption, it is necessary to pay attention to the text marks when reading. Sometimes, captions use the "-" sign to make adjectives or derivatives of rock types. In text preprocessing, the text is cleaned by removing the stop word, sign, number, and symbols. Figure 3 shows the word vectorizing using Word2Vec as a feature text. The difference between one hot vector and word2vec lies in the matrix values. One hot vector consists of a 0 or 1 value and has a length the same as the defined word length [28,31]. On the other hand, word2vec creates a decimal value with the defined length and dimensions [25,69].
Regarding the process, the annotation of the rocks was accompanied by their properties, such as carbonate mudstone and clay sandstone [72]. The classification and interpretation of images created a caption based on feature maps and the text feature. This scheme is an essential part of captioning because this research's target was a caption similar to the geologist's description [53].

Conclusions
Our models outperform the architecture of the baseline model. A CNN (32,5) with a 5 × 5 filter and 32 channels produced a meaningful caption. The metric of the model was directed more toward the precision of the caption than accuracy. The accuracy is just needed to measure the image classification and how similar the factual feature map extraction compares with the actual feature maps. The experiments proved that shallow layers effectively solved our domain problem. Our proposed CNN-A, CNN-B, CNN-C, and CNN-D models used BLEU score methods for the N-gram. The BLEU scores achieved for B@1 were 0.5515, 0.6463, 0.7012, 0.7620, and 0.5620, respectively. B@2 showed scores of 0.6048, 0.6507, 0.7083, 0.8756, and 0.6578, respectively. B@3 had scores of 0.6414, 0.6892, 0.7312, 0.8861, and 0.7307, respectively. Finally, B@4 showed scores of 0.6526, 0.6504, 0.7345, 0.8250, and 0.7537, respectively. The CNN-D architecture encouraged our model to produce a high B@4 score of 0.8250 but it had a time deficiency. The BLEU score measurement was dependent on precision and word embedding. The combination of the CNN and Word2vec embedding increased the speed and produced precision words. Construction of the caption using the beam search supported the creation of proper sentences for the caption. There are several considerations for building the architectures, such as the optimum number of layers, the precise ReLu function, a suitable SoftMax function, and an ADAM optimizer tuned to acquire good results. On the other hand, the accuracy score is used to measure how precisely the image detection matches the image references. The metrics used for measuring the success of image detection are similar for captioning.
Relating to the results, we discovered several challenges for future research. This study did not just find layers, filters, strides, and pooling methods but also proposed language generators. Language models, such as structured language and paraphrasing models, are important subjects of research. Assembling captioning based on geological sentence arrangement, geological sentence representations, and assembling words by geological sentence tagging is a challenging topic. In all captioning models, the target is word precision, which is an indicator of success when generating a caption, with a high BLEU score or other language metric.