Image-Caption Model Based on Fusion Feature

Geng, Yaogang; Mei, Hongyan; Xue, Xiaorong; Zhang, Xing

doi:10.3390/app12199861

Open AccessArticle

Image-Caption Model Based on Fusion Feature

by

Yaogang Geng

,

Hongyan Mei

^*,

Xiaorong Xue

and

Xing Zhang

College of Electronic and Information Engineering, Liaoning University of Technology, Jinzhou 121000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(19), 9861; https://doi.org/10.3390/app12199861

Submission received: 8 August 2022 / Revised: 24 September 2022 / Accepted: 25 September 2022 / Published: 30 September 2022

(This article belongs to the Special Issue AI-Based Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The encoder–decoder framework is the main frame of image captioning. The convolutional neural network (CNN) is usually used to extract grid-level features of the image, and the graph convolutional neural network (GCN) is used to extract the image’s region-level features. Grid-level features are poor in semantic information, such as the relationship and location of objects, while regional features lack fine-grained information about images. To address this problem, this paper proposes a fusion-features-based image-captioning model, which includes the fusion feature encoder and LSTM decoder. The fusion-feature encoder is divided into grid-level feature encoder and region-level feature encoder. The grid-level feature encoder is a convoluted neural network embedded in squeeze and excitation operations so that the model can focus on features that are highly correlated to the title. The region-level encoder employs node-embedding matrices to enable models to understand different node types and gain richer semantics. Then the features are weighted together by an attention mechanism to guide the decoder LSTM to generate an image caption. Our model was trained and tested in the MS COCO2014 dataset with the experimental evaluation standard Bleu-4 score and CIDEr score of 0.399 and 1.311, respectively. The experimental results indicate that the model can describe the image in detail.

Keywords:

image caption; encoder–decoder framework; multi-modal; fusion features

1. Introduction

With the rapid development of multimedia technology and computer networks, the multi-modal character of data has become more prominent, and the amount of data has exploded. There is a growing demand for multi-modal data processing in academia and business circles.

Farhadi et al. [1] proposed the task of image captioning. Their task was to realize modal conversion from the image mode to the text mode through the model, which was embodied as a binary group (I, S). The model completed the modal conversion from image mode I (Image) to text mode S (Sentence). Image captioning is a cross-task between Computer Vision (CV) and Natural Language Processing (NLP). This task is straightforward for humans with particular life experiences. However, it is a considerable challenge for computers, which not only requires computers to understand the content of images but also requires computers to generate sentences with human language habits. Image captioning is essential in image understanding, human–computer interaction, assisting visually impaired people, and journalism. Because of its importance and challenge, it is becoming a hotspot in artificial intelligence.

In the early stages of task studies, most researchers used machine-learning methods that required much human effort to annotate, extract manually, and evaluate datasets, resulting in slow progress, poor model performance, and poor caption quality. With the rise of deep learning, image captioning, for which the tasks have significantly progressed, has become a mainstream method. This method employs CNN as the encoder to extract image features. Then the model sends them to Long Short-Term Memory Network (LSTM), which decodes features to caption. This approach avoids wasting resources and produces more flexible and high-quality captions. Recently, GCN has been used as an encoder in many works to improve the model’s performance. The difference between CNN and GCN is that CNN extracts the grid-level features of the image. Grid-level features contain contextual and fine-grained information about the image, but some semantic information, such as the interrelationships and location relationships between multiple objects, is not included. GCN extracts the regional features of the image and relies on target detection techniques to accurately capture semantic information such as targets, target attributes, and the interrelationship between targets. However, it is object-oriented and ignores the fine-grained and contextual information of the rest of the image.

In order to solve this problem, this paper proposes an image caption-model based on the fusion feature. This model employs a fused feature approach, where an attention mechanism dynamically captures grid-level features and region-level features related to captioning, and they are weighted and fused. The fusion features are more capable of capturing the semantic relationships between objects and retain a certain degree of fine-grained and contextual information. Our captions are richer in information than traditional captions and can more fully express the relationships between objects in the images. The main contributions of this paper are as follows:

(1): The grid-level features of an image are extracted by a Res-Net101 with squeeze and excitation operations. The GCN extracts the region-level features. The feature-fusion module weights fusion of the two features of the joint attention mechanism to learn image features and semantic information better.
(2): Fusion-feature vectors are processed by a two-layer LSTM, which adds an attention mechanism to obtain the context information of the heading and dramatically improves the quality of the heading.
(3): The algorithm was trained and tested, and the experimental results show that the algorithm achieved excellent performance.

The remaining part of the article is structured as follows. Section 2 describes related work, Section 3 details the model and method, Section 4 presents the experimental results, and Section 5 concludes the paper.

2. Related Work

2.1. Image Captioning

The methods of image captioning are mainly divided into three basic approaches. (1) The first is the template-based approach [1,2], which was proposed by Farhadi et al. in 2010. According to the syntax specification, this method sets the sentence template and the triad “object, action, scene”. The model detects all possible values of the scene, object, object’s attributes, and action in the image by target detection in computer vision. The model then uses the Conditional Random Field (CRF) algorithm to predict the correct three-element filling in the template to form the basic structure of the title; next, it uses the relevant algorithm to fill the rest of the template to create the image caption. The method requires the manual design of syntactic templates and relies on the visual concept of hard decoding. The caption generated by this method has a single grammatical form and an insufficient variety of captions. (2) The second is Kuznetsova et al.’s retrievation-based approach [3,4]. Based on the input image, the method generates an image caption driven by the whole database, retrieves similar images, and synthesizes phrases describing the image, which are then selectively combined to create image captions. The caption quality depends on the similarity between the input image and the image in the database. It is not easy to guarantee the semantic correctness of the title. (3) The third is the encode–decode-based approach [5,6,7,8,9,10,11], which was proposed by Vinyals et al. in 2015. The model uses CNN and Long Short-Term Memory Network (LSTM) as the encoder and decoder, respectively. The model obtains the input image; it extracts the image feature by CNN, initializes the decoder using the image feature, then generates the word chronologically, and then combines the text into image captions. These captions are semantically rich, syntactically flexible, and aligned with human language logic. The encode–decode-based approach is the current mainstream image captioning method.

2.2. Grid-Level Feature

CNN, as an encoder, extracts grid-level features of images. To improve the image-caption model, research focuses on the attention mechanism. Xu et al. [7] first applied the attention mechanism in the image-caption field. They proposed an image-caption model based on the attention mechanism, which includes “soft attention” and “hard attention”. The soft attention assigns weights to all grids for each decoding, ranging from 0 to 1, and uses backward propagation for training. The hard attention mechanism focuses on only one graph grid for each decoding. It uses one-hot coding, which takes less time but is not differentiable, and it generally uses Monte Carlo sampling to estimate the gradient first and then backward propagation for training. The attention mechanism commonly used in image captioning is mainly a “soft” attention mechanism. Attention-mechanism models optimize the CNN extracted image features from global to grid features. Before generating words, the model calculates the correlation between the words generated and each grid in the image by the attention mechanism to select the grid features with high correlation. It passes them to the decoder and guides the decoder to generate the image captions. The introduction of the attention mechanism gives the image caption model the ability to focus on the key grid. However, this attention mechanism will force words to correspond to the going grid, and dummy words such as “of” and “the” will also force them to correspond to the images’ grids, resulting in wasted arithmetic power. Therefore, Lu et al. [8] proposed an adaptive attention mechanism. This attention mechanism introduces the “Visual Sentinel” vector, which indicates the relevance of the generated words to the visual information between 0 and 1. When the words are directly related to the image, the value is 1, and the model will focus on the image grid and generate the words. When words, such as “of” and “the,” are generated, they are inferred directly by the language model. In recent years, various attention mechanisms have been applied to grid-level features to improve model performance continuously.

2.3. Region-Level Feature

Since the proposed encoder–decoder model, the research on image high-level semantics extraction has been slow due to technical limitations. Before Kipf et al. [12] proposed GCN, GCN [13] showed a high performance in extracting non-Euclidean data signatures (as shown in the structure) and remained high when untrained. Graph structure has the advantage that other data structures cannot compare to in representing semantic information. It can express the high-level semantics of images, such as objects in an image, their properties, and the interrelationship between objects.

Yao et al. [14] proposed a GCN–LSTM architecture for an image-captioning model, the first GCN application to the image-caption-generation field. The model relies on target detection techniques such as Faster-RCNN. It first detects the relationship between object, object attribute, and object in the image by target-detection techniques; then it constructs a graph structure. GCN was used to extract the graph structure’s features to guide the LSTM to generate the caption. Graph convolutional neural networks are widely used in image captioning. It is worth mentioning that, with the advent of graph convolutional neural networks, the features of the image have been changed from the direct extraction of grid-level features to the extraction of region-level features of the image by the target-detection technique first.

Yao et al. [13] proposed the Hierarchy Parsing (HIP) architecture. It was developed by combining Faster-RCNN and Mask-RCNN techniques for region-level and instance-level segmentation of images, constructing the image into a tree structure, where I represents the image, and R represents the region-level object. M represents the instance-level object, representing the relationship in the tree structure, and then using GCN to extract the features of the tree structure, passing the features into the up-to-down attention mechanism to compute the most relevant few objects. The model adopts the Analytic Hierarchy Process (AHP) to extract the semantics of images at three levels. The acquired semantics are richer, and the model’s generalization is more robust than the regular model. However, the tree structure has some limitations in expressing the complex relationship of objects in the image. Therefore, Shi et al. [15] proposed a framework for Caption-Guided Visual Relationship Graph (CGVRG). The framework first takes objects from the image by Faster-RCNN and then by text scene graphical parsers to extract relational triads from the caption. Then CGVRG is constructed by using weak-supervised learning to correspond to objects and predicates. The CGVRG is input to GCN, and GCN extracts the features and context vectors of CGVRG. The model uses the features and context vectors of the graph structure to guide the decoder to generate captions so that the model has better semantic information than the regular model.

3. Models and Methods

The image-caption model based on the fusion feature, which shown in Figure 1, is designed to generate fluent and semantically informative sentences based on a specified image, I. The model adopts an encoder–decoder structure, which consists of a fusion of characteristic encoder and decoder. The encoder and decoders are described in Section 3.1 and Section 3.2, respectively.

3.1. Encoder

The fused-feature encoder proposed in this paper consists of three parts: squeeze and excitation module (SE), node embedding module (NE), and feature-fusion module based on the attention mechanism. The SE module can extract the image’s grid-level features, and the NE module extracts region-level features of the image. Then they are weighted and fused by attention mechanisms so that the model obtains more fine-grained and richer semantic information.

3.1.1. Squeeze and Excitation Module (SE)

The frame of the SE module is shown in Figure 2. The encoder dynamically captures the dependency of each channel’s feature map and selects highly dependent feature maps to guide the decoder.

Inspired by Hu et al. [16], we embedded the SE module in front of the convolutional operation of ResNet-101. The feature of the l-th convolutional layer is expressed as

x^{l} = \{x_{1}^{l}, \dots, x_{C}^{l}\}

, where the feature

x^{l}

is composed of several feature maps,

x_{m}^{l}

, and the size of

x_{m}^{l}

is h*w. C represents the number of channels in the feature map. The calculation of

S E

is as follows:

z_{m} = F_{s q} (x_{m}^{l}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{m}^{l} (i, j)

(1)

First, we add each value of x and divide it by its size. We compress feature map x into a value. Then

z_{m}

and the dual activation function are used to calculate the weight of the feature map in the m-th channel. The calculation is as follows:

s_{m} = F_{e x} (z_{m}, W) = σ (W_{2} δ (W_{1} z_{m})) = \frac{1}{1 + e^{W_{2} m a x (0, W_{1} z_{m})}}

(2)

where

s_{m}

is the grid-level feature attention weight of

z_{m}; δ

is the ReLU function that ensures a positive output;

W_{1}

and

W_{2}

are weight parameters to learn; and

σ

is the Sigmoid activation function that maps the grid-level feature attention weight between 0 and 1. The dual fully connected layers enable the model to capture the dependency of feature maps better.

Finally, the module multiplies the dependency of the feature map,

s_{m}

, with the feature values in the feature map

x_{m}^{l}

. The calculation is as follows:

f_{m} = F_{s c a l e} (x_{m}^{l}, s_{m}) = s_{m} * x_{m}^{l}

(3)

where

f_{m}

represents the m-th channel’s feature map after an SE operation. After SE operation, the feature of the l-th convolutional layer is expressed as

f^{l} = \{f_{1}^{l}, \dots, f_{C}^{l}\}

. Then we employ the convolution operation on

f^{l}

to obtain the feature maps of

x^{l + 1}

. The calculation is as follows:

x^{l + 1} = R e s i d u a l (f^{l})

(4)

So, the grid-level feature is

F_{g} = \{s_{1}, \dots, s_{C}\}

, which is the output of the final layer.

3.1.2. Node Embedding (NE) Module

The NE module of the model is shown in Figure 3, which relies on the MS COCO dataset processed by the target detection technique. The processed dataset includes the object information in the image, object attributes, the relationship between objects, and the corresponding box position. The model employs this information to construct a directional graph structure that initially encodes the graph nodes as X = {x₁, _, x_V}, in which x_i is the visual feature of the node and V is the number of nodes.

Unlike the traditional GCN of visual features of direct extraction nodes, this model incorporates the node-embedding operation in this module. Through the node embedding of different types of nodes, the model can distinguish the type of the node, as well as obtain the semantic information of the node and its neighbors better, and the formula of node embedding is as follows:

x_{i}^{(0)} = \{\begin{matrix} v_{i} ⨀ W_{r} [0], i f i \in o; \\ v_{i} ⨀ (W_{r} [1] + p o s [i]), i f i \in a; \\ v_{i} ⨀ W_{r} [2], i f i \in r . \end{matrix}

(5)

where o, a, and r represent object node, attribute node, and relationship node, respectively;

W_{r}

∈

ℝ

^3×d is the role-embedding matrix; d is the feature dimension;

W_{r}

[k] denotes the k-th row of

W_{r}

; and

p o s [i]

is the positional embedding to distinguish the order of the different attribute nodes connected to the same object.

After node embedding, the model obtains the graph structure

G = (V, ε, R)

. The model then encodes the graph context in G by using GCN as follows:

X_{i}^{(l + 1)} = σ (W_{0}^{(l)} x_{i}^{(l)} + \sum_{\tilde{r} \in R} \sum_{j \in N_{i}^{\tilde{r}}} \frac{1}{|N_{i}^{\tilde{r}}|} W_{\tilde{r}}^{(l)} x_{i}^{(l)})

(6)

where

N_{i}^{\tilde{r}}

denotes the neighbors of the i-th node under the relation

\tilde{r}

∈

R

, σ is the ReLU activation function, and

W_{*}^{(l)}

is the parameter to be learned by the l-th GCN layer. Using one layer brings context for each node from its direct neighbor nodes, while stacking multiple layers can encode a broader range of contexts in the graph. The model superimposes the L layer and then uses the output of the final L layer as the final node embedded in X. The regional features of the image are obtained by averaging X.

3.1.3. Feature-Fusion Module

The feature-fusion module employs the attention mechanism to weigh the fusion of two levels of features. In our fusion method, we chose the vector-splicing approach to avoid the fusion-noise problem after the fusion of two levels of features.

First, the model uses the following equations to perform vector concatenate for two levels of the feature:

F = c o n c a t (F_{c}, F_{r})

(7)

The model computes the attention weights for the features with the following equation:

β = σ (W_{2} δ (W_{1} F))

(8)

where

W_{1}

and

W_{2}

are the parameters of learning. Finally, multiply and merge the image features with their attention weights by taking the following actions:

F = β \otimes F

(9)

3.2. Decoder

The model adopts a double-layer memory network with an attention mechanism, aiming to increase the depth of the model, extract deeper features, improve the prediction accuracy, and add an attention weight vector to the LSTM operation, which is calculated as follows:

h_{t} = σ_{h} (W_{x h} x_{t} + W_{h h} h_{t - 1} + b_{n})

(10)

y_{t} = σ_{y} (W_{h y} h_{t} + b_{y})

(11)

f_{t} = σ_{g} (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})

(12)

i_{t} = σ_{g} (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i})

(13)

o_{t} = σ_{g} (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})

(14)

\tilde{C_{t}} = \tanh (W_{C} x_{t} + U_{C} h_{t - 1} + b_{C})

(15)

where

x_{t}

is the input vector of the LSTM, and

h_{t - 1}

is the output vector of the LSTM of the previous time step. The LSTM module of each time step consists of an Att–LSTM and a Lan–LSTM. The input of the Att–LSTM at the moment t in the LSTM is

x_{t}^{1} = [h_{t - 1}^{1}, W_{t - 1}]

, where

h_{t - 1}^{1}

is the output of the Lan–LSTM at the previous time step, and

W_{t - 1}

is the embedding vector of the output word of the previous time step. Finally, the two are spliced to obtain

x_{t}^{1}

. The output

h_{t}^{a}

is obtained by feeding it into the Att–LSTM. The input of the Lan-LSTM is

x_{t}^{2} = [F, h_{t}^{a}]

, which consists of

h_{t}^{a}

and the image features obtained above, and the output is obtained as

h_{t}^{2}

. Using the notation

y_{1 : T}

to denote the sequence of words (y¹, …, y^T), the probability distribution of the output words at time step t is given by the following equation:

p (y_{t} | y_{1 : t - 1}) = s o f t m a x (W_{p} h_{t}^{2} + b_{p})

(16)

where

W_{p}

and

b_{p}

are the learned weights and bias parameters, respectively.

3.3. Dataset, Evaluation, and Loss Function

We performed experiments on the MS COCO2014 [17] dataset, which aims to improve the state-of-the-art in object recognition by placing the object-recognition problem in the context of a broader scene of understanding the problem and by collecting images containing common objects in natural environments. The dataset uses professional agencies to artificially describe the images with 5 or 15 reference descriptions per image, and the annotation set is typically saved in JSON format. The dataset has over 330,000 images, 200,000 annotated descriptions containing 91 classes of objects, and a total of 2.5 million tagged instances in 328,000 images, making it the largest semantic segmentation dataset available.

BLEU (Bilingual Evaluation Understudy) [18] and CIDEr [19] were used to evaluate this algorithm. The BLEU algorithm rates the difference between the captions generated and those annotated manually, scoring outputs between 0 and 1. The algorithm has become one of image description’s most widely used algorithms. For an image, the image description model is used to evaluate a set of 5 descriptors generated for the image and manually marked:

C P_{n} (C, S) = \frac{\sum_{i} \sum_{k} m i n (h_{k} (c_{i}), m a x_{j \in m} h_{k} (s_{i j}))}{\sum_{i} \sum_{k} h_{k} (c_{i})}

(17)

b (C, S) = \{\begin{matrix} 1, i f l_{C} > l_{S} \\ e^{1 - l_{S} / l_{C}}, i f l_{C} < l_{S} \end{matrix}

(18)

B L E U_{N} (C, S) = b (C, S) e x p (\sum_{n = 1}^{N} ω_{n} l g C P_{n} (C, S))

(19)

The number of occurrences of the n-tuple,

ω_{k}

, in the manual reference caption, S_ij, and the caption,

c_{i}

, to be evaluated is denoted as

h_{k} (s_{i j})

and

h_{k} (c_{i})

. Moreover,

l_{S}

is the length of the caption to be evaluated, and l_C is the length of the manual reference caption. The higher the BLEU score, the better the performance.

The METEOR metric considers the effect of recall, which calculates the matching relationship between synonyms, root words, and affixes, and the evaluation results are more correlated with the results of the manual evaluation, which is calculated as follows:

F_{m e a n s} = \frac{P R}{α P + (1 - α) R}

(20)

P = \frac{m}{c}

(21)

R = \frac{m}{r}

(22)

P e n = \frac{# c h u n k s}{m}

(23)

M E T E O R = (1 - P e n) \cdot F_{m e a n s}

(24)

where R is the recall rate; P is the accuracy rate, m is the total number of matched pairs; c is the length of the candidate caption; r is the length of the reference caption; and Pen is the penalty factor, which is to consider the order between words, if in two sentences, the words that match each other are adjacent. They are defined as the same chunk, the total number of #chunks.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of evaluation metrics proposed by Lin, mainly including ROUGE-N, ROUGE-L, ROUGE-S, ROUGE-W, and ROUGE-SU. Users can choose the appropriate evaluation metrics according to their needs. ROUGE-L evaluation metrics are generally used to evaluate the model’s performance in the image-captioning field. ROUGE-L is mainly optimized for the problem that BLEU evaluation metrics ignore the recall rate. Similar to BLEU, ROUGE-L is also based on the n-gram evaluation metrics, which calculate the tuple overlap between generated captions and reference captions to measure the quality of captions. BLEU is the same, except that the recall factor is added to the algorithm, and its calculation formula is as follows:

ROUGE - L = \frac{(1 + β^{2}) R_{l c s} P_{l c s}}{R_{l c s} + β^{2} P_{l c s}}

(25)

R_{l c s} = \frac{L C S (X, Y)}{m}

(26)

P_{l c s} = \frac{L C S (X, Y)}{n}

(27)

where X denotes the candidate title, Y denotes the reference title, LCS(X,Y) denotes the length of the longest common subsequence of the candidate caption and the reference caption, m denotes the length of the reference caption, and n denotes the length of the candidate caption.

CIDEr is specially designed to evaluate the image-captioning model; it obtains the similarity between the captions to be evaluated and the reference captions by calculating the TF-IDF weights of each n-tuple to evaluate the effectiveness of the image captioning.

The number of occurrences of the n-tuple,

ω_{k}

, in the manual reference caption, S_ij, and the caption

c_{i}

to be evaluated is denoted as

h_{k} (s_{i j})

and

h_{k} (c_{i})

, and the TF-IDF weight,

g_{k} (S_{i j})

, of the n-tuple is shown below:

g_{k} (S_{i j}) = \frac{h_{k} (S_{i j})}{\sum ω_{l} \in Ω} l g (\frac{|I|}{\sum_{I_{p} \in I} m i n (1, \sum_{q} h_{k} (s_{p q}))})

(28)

C I D E r_{n} (c_{i}, S_{i}) = \frac{1}{m} \sum_{j} \frac{g^{n} (c_{i}) \cdot g^{n} (s_{i j})}{| | g^{n} (c_{i}) | | \cdot | | g^{n} (s_{i j}) | |}

(29)

C I D E r (c_{i}, S_{i}) = \sum_{n = 1}^{N} ω_{n} C I D E r_{n} (c_{i}, S_{i})

(30)

The higher the CIDEr score, the better the resulting discourse quality.

Based on the given sequence of real artificial descriptions,

y_{1 : T}^{*}

, and the parameters, θ, obtained from the training of the model in this paper, a cross-entropy function is used to minimize the loss, calculated according to Equation (31):

L (θ) = - \sum_{t = 1}^{T} l o g (P_{θ} (y_{t}^{*} | y_{1 : t - 1}^{*}))

(31)

where

y_{t}^{*}

denotes the t-th true artificial description.

4. Results

4.1. Experimental Environment

The experimental environment is based on Ubuntu 18.04, Inteli9-9900 k CPU, NVIDIA GeForce RTX 3080 GPU, 12 GB RAM, PyTorch deep-learning environment with Python 3.8 + CUDA 11.1. In processing the annotation file, the non-letter characters were deleted. The remaining characters were converted to lowercase letters, and all words that appeared less than five times were replaced by the unique word UNK, making the 10,942 words in the MS COCO dataset the final use of the corpus. The maximum output length of the algorithm is set at 16, the algorithm scores the highest for the evaluation index. The dropout method is used to prevent excess surplus, and the parameter is set to 0.5. The model is trained with the Adam optimizer, with a learning rate set at 3 × 10⁻⁴, weight decay set at 1 × 10⁻⁶, batch size set at 128, and training rounds set at 30.

4.2. Experimental Results and Analysis

Firstly, the validity of the model is verified. In this paper, five state-of-the-art-level image-captioning models in the last three years are selected for comparison, and the results are shown in Table 1. the scoring results are all on the MS COCO dataset with the condition of five reference captions. (1) High-Level Attention is the image caption generation model proposed by Ding et al. [20]. The model adopts a CNN–LSTM architecture with an embedded bottom-up attention mechanism. The model can score low-level features (contrast, sharpness, and clarity) and high-level features (face impact) of image regions and combine the scores to determine the regions to which attention should be directed. (2) CGVRG is an image-caption-generation model based on a caption-guided visual relationship graph (CGVPR) proposed by Shi et al. [15]. The model uses the features and context vectors of the graph structure to guide the decoder to generate captions so that the model has better semantic information. (3) M²-Transformer is a model proposed by Cornia et al. [21], adopting the transformer architecture. The framework learns the multilevel representation of the relationship between image regions. It uses grid-like connections to exploit the properties of the lower and higher levels in the decoding phase, again reducing the complexity of the model and mitigating the semantic gap. (4) RDN (Reflective Decoding Network) is a model proposed by Ke et al. [22]. The CNN–LSTM architecture is embedded in the RDN. The framework enhances the decoder’s long-sequence dependence and improves the decoder’s long-sequence modeling capability. (5) Entangle-Transformer is a model proposed by Li et al. [23]. The transformer [24,25] model is also adopted. The framework alleviates the semantic gap. (6) DCLT is a Transformer-based model proposed by Luo et al. [26] that uses two Transformers to extract two levels of features and fuse them.

Compared with the above five models, the results of the five evaluation metrics consistently show that our work is favorable with other state-of-the-art techniques, which include CNN-based (High-Level Attention and RDN), GCN-based (CGVRG), and Transformer-based models. Our model’s CIDEr score and BLEU-4 score can reach 131% and 39.9%, which are 2.4% and 0.5% better than the best comparison models, respectively. Furthermore, inspired by the fact that all five better models contain attention mechanisms, the model adds a fusion-feature attention mechanism at a later stage to improve the performance. From the analysis of the model architecture and its performance, the current image-caption-generation models, mainly GCN and Transformer, have gradually replaced the mainstream encoder position of CNN because of their excellent performance. We performed a second set of ablation experiments to further explore the model’s advantages over the traditional CNN and GCN, as shown in Table 2.

In order to investigate the superiority of this model over the traditional model, several ablation experiments were carried out. First, a CNN–LSTM architecture with no embedded attention mechanism was used as the baseline results. The first set of ablation experiments in this study was CNN–LSTM architecture based on an embedded MLP attention mechanism. The results showed that the ablation model based on the embedded attention mechanism has good scoring performance. GCN–LSTM was the framework for the second set of experiments. The data show that GCN has a significant performance advantage over CNN, and CNN–LSTM–Attention, with its CIDEr score of 39.1% and 33.3% higher than CNN–LSTM and CNN–LSTM–Attention, respectively. Due to the excellent performance of the attention mechanism, we embedded the attention mechanism in the GCN–LSTM architecture, and its performance in the CIDE metric increased by another 2.6%, further validating the attention mechanism in the image-caption-generation model. Next, we performed three sets of comparison experiments to verify the effectiveness of the fusion properties. Firstly, we verified the add-fusion of region features and channel features by Grid + Region^add. The model performance outperformed the traditional CNN–LSTM–Attention architecture, which verified the feasibility of fused features. Due to the fusion noise problem in add-fusion and multiplicative fusion, we obtained the best performance on the concatenate approach Grid + Region^concat measured by CIDE scores. It was 3.2% higher than in the best-class comparison experiments.

The above ablation experiment verifies the superiority of the image-caption-generation model based on fusion features. To further explore the role and importance of each part of the model, we performed a third set of ablation experiments at baseline, using a pure CNN + GCN + LSTM architecture, without adding the SE module, node embedding modules, attention mechanism fusion module, and pure vector stitching fusion vectors, as shown in Table 3. SE refers to the SE channel feature-extraction module, NE refers to the node-embedded regional feature module, and Att refers to the feature-fusion module based on the attention mechanism.

As can be seen from the above experimental results’ graphs and images, the attention mechanism of the feature-fusion module improves model performance the most among the various components of the model, by 1% and 3.4%, respectively, compared to the baseline of BLEU-4 and CIDE. In the first two cases alone, SE and NE are added. While more semantic and contextual information is obtained, the lack of attention to the mechanisms’ fusion process results in noise and poor performance of guidance information. Therefore, we embedded the attention mechanism into the module with SE and NE alone, and the experiments were conducted again. It is found that a more significant increase in both increases the performance compared to the case without the attention mechanism. In the SE + NE + Att case, the model can obtain rich semantic and contextual information and attention mechanism to mitigate the fusion-noise problem and obtain the best performance.

After verifying the model’s performance through various experiments, two images were randomly selected and fed into this paper’s training and the NIC model models, respectively, resulting in a caption, as shown in Figure 4. As can be seen from the generated captions, the image captions generated by the model are more detailed and richer in semantic information.

5. Conclusions

This paper presents an image-description algorithm based on fusion features. The model innovatively integrates grid-level and region-level features so that the extracted image features have a visual focus. The streamlined model parameters can effectively reduce the training time. The algorithm further unites the cross-module features between visual image and language comprehension. Experiments show that the algorithm has an excellent performance in all evaluation metrics. In future work, we will further study the combination of the graphical convolutional neural network and transformer to strengthen the connection between targets in the image.

Author Contributions

Conceptualization, Y.G.; methodology, Y.G.; software, H.M.; validation, Y.G., H.M. and X.Z.; formal analysis, X.Z.; investigation, Y.G.; resources, Y.G.; data curation, Y.G.; writing—original draft preparation, H.M.; writing—review and editing, X.X.; visualization, X.X.; supervision, X.X.; project administration, Y.G.; funding acquisition, H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Innovation Project of Postgraduate Education Reform of Liaoning University of Technology (No. YJG2021020), Liaoning Natural Science Foundation Mentoring Program Project (No. 2019-ZD-0700), Liaoning Education Department Scientific Research Project (No. JZL202015404, No. LJKZ0625), General project of Liaoning Provincial Department of Education (No. LJKZ0618), and Liaoning Higher Education Innovation Talent Support Project (No. LR2019034).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Farhadi, A.; Hejrati, M.; Sadeghi, M.A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2010; pp. 15–29. [Google Scholar]
Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2891–2903. [Google Scholar] [CrossRef]
Kuznetsova, P.; Ordonez, V.; Berg, A.; Berg, T.; Choi, Y. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju, Korea, 8–14 July 2012; pp. 359–368. [Google Scholar]
Ordonez, V.; Kulkarni, G.; Berg, T. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 2011, 24, 1143–1151. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2048–2057. [Google Scholar]
Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image caption generationing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image caption generationing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Wu, Q.; Shen, C.; Liu, L.; Dick, A.; Van Den Hengel, A. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 203–212. [Google Scholar]
Tanti, M.; Gatt, A.; Camilleri, K.P. What is the role of recurrent neural networks (rnns) in an image caption generation generator? arXiv 2017, arXiv:1708.02043. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Yao, T.; Pan, Y.; Li, Y.; Mei, T. Hierarchy parsing for image caption generationing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 2621–2629. [Google Scholar]
Yao, T.; Pan, Y.; Li, Y.; Mei, T. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 684–699. [Google Scholar]
Shi, Z.; Zhou, X.; Qiu, X. Improving image caption generationing with Better Use of Captions. arXiv 2020, arXiv:2006.11807. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Ding, S.; Qu, S.; Xi, Y.; Sangaiah, A.K.; Wan, S. image caption generation generation with high-level image features. Pattern Recognit. Lett. 2019, 123, 89–95. [Google Scholar] [CrossRef]
Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10578–10587. [Google Scholar]
Ke, L.; Pei, W.; Li, R. Reflective decoding network for image caption generationing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 8888–8897. [Google Scholar]
Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 8928–8937. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Luo, Y.; Ji, J.; Sun, X.; Cao, L.; Wu, Y.; Huang, F.; Lin, C.W.; Ji, R. Dual-level collaborative transformer for image captioning. arXiv 2021, arXiv:2101.06462. [Google Scholar] [CrossRef]

Figure 1. The network architecture of the fusion−feature model.

Figure 2. Squeeze and excitation module.

Figure 3. Node embedding (NE) module.

Figure 4. Display diagram of title generation results.

Table 1. Model performance on the MS COCO dataset.

	BLEU-1	BLEU-4	CIDEr	METEOR	ROUGE-L
High-Level Attention, 2019 [20]	0.746	0.317	1.103	0.265	0.535
CGVRG, 2020 [15]	0.814	0.386	1.267	0.286	0.588
M2-Transformer, 2020 [21]	0.816	0.397	1.293	0.294	0.592
RDN, 2019 [22]	0.775	0.368	1.153	0.272	0.568
Entangle-Transformer, 2019 [23]	0.776	0.378	1.193	0.284	0.574
DLCT, 2021 [24]	0.816	0.392	1.297	0.298	0.598
Ours	0.821	0.399	1.311	0.297	0.601

Table 2. Performance of frame structure.

	BLEU-1	BLEU-4	CIDEr	METEOR	ROUGE-L
CNN–LSTM	0.666	0.246	0.862	0.201	-
CNN–LSTM–Attention	0.726	0.310	0.920	0.235	-
GCN–LSTM	0.808	0.387	1.253	0.285	0.585
GCN–LSTM–Attention	0.816	0.393	1.279	0.288	0.590
Grid + Region^add	0.801	0.344	1.237	0.273	0.577
Grid + Region^mul	0.811	0.389	1.271	0.280	0.591
Grid + Region^concat	0.821	0.399	1.311	0.297	0.601

Table 3. Comparison of the performance of each part of the model.

	BLEU-4	CIDEr
Baseline	0.792	1.183
Grid + Region + SE	0.797	1.199
Grid + Region + NE	0.802	1.217
Grid + Region + Att	0.811	1.233
Grid + Region + SE + NE	0.807	1.229
Grid + Region + SE + Att	0.818	1.253
Grid + Region + NE + Att	0.816	1.267
Grid + Region + SE + NE + Att	0.821	1.311

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Geng, Y.; Mei, H.; Xue, X.; Zhang, X. Image-Caption Model Based on Fusion Feature. Appl. Sci. 2022, 12, 9861. https://doi.org/10.3390/app12199861

AMA Style

Geng Y, Mei H, Xue X, Zhang X. Image-Caption Model Based on Fusion Feature. Applied Sciences. 2022; 12(19):9861. https://doi.org/10.3390/app12199861

Chicago/Turabian Style

Geng, Yaogang, Hongyan Mei, Xiaorong Xue, and Xing Zhang. 2022. "Image-Caption Model Based on Fusion Feature" Applied Sciences 12, no. 19: 9861. https://doi.org/10.3390/app12199861

APA Style

Geng, Y., Mei, H., Xue, X., & Zhang, X. (2022). Image-Caption Model Based on Fusion Feature. Applied Sciences, 12(19), 9861. https://doi.org/10.3390/app12199861

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image-Caption Model Based on Fusion Feature

Abstract

1. Introduction

2. Related Work

2.1. Image Captioning

2.2. Grid-Level Feature

2.3. Region-Level Feature

3. Models and Methods

3.1. Encoder

3.1.1. Squeeze and Excitation Module (SE)

3.1.2. Node Embedding (NE) Module

3.1.3. Feature-Fusion Module

3.2. Decoder

3.3. Dataset, Evaluation, and Loss Function

4. Results

4.1. Experimental Environment

4.2. Experimental Results and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI