Next Article in Journal
Laboratory Study on the Performance of Asphalt Mixes Modified with a Novel Composite of Diatomite Powder and Lignin Fiber
Previous Article in Journal
Efficient Multi-Player Computation Offloading for VR Edge-Cloud Computing Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Associating Images with Sentences Using Recurrent Canonical Correlation Analysis

1
Institute of Biophysics, Chinese Academy of Sciences, Beijing 100190, China
2
The College of Automation and Electronic Information, Xiangtan University, Xiangtan 411105, China
3
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(16), 5516; https://doi.org/10.3390/app10165516
Submission received: 5 July 2020 / Revised: 28 July 2020 / Accepted: 5 August 2020 / Published: 10 August 2020
(This article belongs to the Section Applied Physics General)

Abstract

:
Associating images with sentences has drawn much attention recently. Existing methods commonly represent an image by indistinctively describing all its contents in an one-time static way, which ignores two facts that (1) the association analysis can only be effective for partial salient contents and the associated sentence, and (2) visual information acquisition is a dynamical rather than static process. To deal with this issue, we propose a recurrent canonical correlation analysis (RCCA) method for associating images with sentences. RCCA includes a contextual attention-based LSTM-RNN which can selectively attend to salient regions of an image at each time step, and then represent all the salient contents within a few steps. Different from existing attention-based models, our model focuses on the modelling of contextual visual attention mechanism for the task of association analysis. RCCA also includes a conventional LSTM-RNN for sentence representation learning. The resulting representations of images and sentences are fed into CCA to maximize linear correlation, where parameters of LSTM-RNNs and CCA are jointly learned. Due to the effective image representation learning, our model can well associate images with sentences with complex contents, and achieve better performance in terms of image annotation and retrieval.

1. Introduction

Associating images with sentences plays an important role in many applications, e.g., finding sentences given an image query for image annotation and caption, and retrieving images with a sentence query for image search. Recently, various methods have been proposed to deal with this problem. Most of them first extract representations for images and sentences, and then use either a structured objective [1,2] or a canonical correlation objective [3,4] to learn a latent joint space for association analysis, where matched images and sentences have small distance or high correlation.
Obtaining effective representations for images is crucial for the association analysis. An image usually contains very abundant contents, but its associated sentence can only describe partial salient contents due the limited size. Thus the following association analysis can only be effective for these salient contents and their corresponding textual descriptions since only they share the same semantics. As a result, the rest undescribed contents in the image are redundant, and tend to bring in noisy side effects in the association analysis. However, most existing methods [1,3,4,5,6,7,8] ignore this fact and represent an image by simultaneously processing its entire contents. Although other methods [2,9,10] consider to associate the salient contents (regions) with the corresponding descriptions (words) in the image and sentence, respectively, they have to pre-establish the salient regions using manual annotations or detection algorithms. In addition, the existing methods usually represent the visual information in an one-time static process, which is in contrast to the neuroscience fact [11] that visual information acquisition is actually a dynamic rather than static process. Selective attention mechanism plays an essential role in this process. In particular, human visual system does not focus its attention on an entire image for only once. Instead, it selectively attends to salient regions at different locations to collect all important information in a sequential way.
Recently, some attention-based models have been proposed in various vision applications [12,13,14,15], but the modelling of attention mechanism in the context of association analysis is still under investigated. In fact, existing models mostly focus on the problems of recognizing salient image regions using dense supervision information, which makes them less generally applicable to the association analysis that aims to fuse the information of all the salient regions to obtain a unified representation. Moreover, they rarely consider the modelling of context [16] which actually has the prominent effect on attention mechanism. Ba et al. [17] consider to exploit the context information only at the first time step during the attention process, but they ignore the fact that context information modulates the attention mechanism during all the time steps [18].
In this work, we propose a recurrent canonical correlation analysis (RCCA) method for associating images with sentences using the visual attention. The proposed RCCA includes a contextual attention-based LSTM-RNN for image representation learning, which can selectively focus on salient regions of an image by referring to context information at each time step, and sequentially aggregate local features corresponding to the regions at all time steps to obtain the final image representation. RCCA also exploits a conventional LSTM-RNN to learn representation for the associated sentence by orderly taking its contained words as sequential inputs. Then the obtained representations of images and sentences are fed into CCA for association analysis, where parameters of LSTM-RNNs and CCA are jointly learned by maximizing linear correlation between images and sentences. To demonstrate the effectiveness of the proposed RCCA, we perform several experiments of image annotation and retrieval on three publicly available datasets, and achieve satisfactory results.
Our main contributions can be summarized as follows. We propose the recurrent canonical correlation analysis for associating images with sentences, which models the contextual visual attention mechanism and learns image representation in a biologically-plausible dynamical process. We systematically study the modelling of context in conjunction with attention mechanism, and demonstrate that the use of dense context information does help much. We achieve better performance in the tasks of image annotation and retrieval.

2. Related Work

2.1. Association between Images and Sentences

Existing methods for associating images with sentences usually include two steps: (1) feature extraction for images and sentences, and (2) association analysis based on the obtained features. Many recent methods [2,3,4,6,7,8,9,10,19] exploit CNNs [20,21] to obtain image features, i.e., the 4096-dimensional output vector in the last fully-connected layer. While for sentences, since they are sequential data, most methods use LSTM-RNNs [22] to model their long-range temporal dependency, and exploit the hidden states at the last step as their representations. After obtaining features, they use either a structured objective [1,2,7] or a canonical correlation objective [3,4,9] to learn a latent joint space, in which matched image-sentence pairs have small distance or high correlation.
It is believed that obtaining good image representations, such as by CNNs and fisher vector (FV) [4,23], can greatly facilitate the following association analysis. However, these methods indistinctively describe all the contents of an image, which might be suboptimal since only salient contents and their associated textual descriptions can be effective in association analysis. Although other works [2,9,10] consider to associate these salient regions with their descriptions, they have to explicitly use object detectors or manual annotations to pre-establish the salient regions. Different from them, our model can automatically select and attend to salient regions in the image by modelling the contextual attention mechanism, which is more efficient and biologically-plausible.
The most related work to ours is [24], but it has to perform an one-to-one similarity measurement during testing, which makes it much more time-consuming than ours. Note that we do not make comparisons with recent state-of-the-art technologies [25,26,27], since they all use advance image representation modules like ResNet [28] and bottom-up features [29].

2.2. Attention-Based Models

There have been several works simulating the attention mechanism. Alex Graves [30] exploits RNNs and differentiable Gaussian filters to simulate the attention mechanism, and applies it to handwriting synthesis. The model is able to predict a word conditioned on an auxiliary annotation sequence. Gregor et al. [12] introduce the deep recurrent attentive writer for image generation, which develops a novel spatial attention mechanism based on 2-dimensional Gaussian filters to mimic the foveation of human eyes. Larochelle and Hinton [31] propose high-order Boltzmann machines to learn how to accumulate information from an image over several fixations. Mnih et al. [32] present a RNN-based model to select a sequence of regions and then only process the selected regions from an image to extract information. Ba et al. [17] present a recurrent attention model for multiple objects recognition. By training with reinforcement learning, the model can attend to some most label-relevant regions of an image. Bahdanau et al. [13] propose a neural machine translator which can search for relevant parts of a source sentence to predict a target word, without explicit alignment. Inspired by it, Xu et al. [15] develop an attention-based caption model which can automatically learn to fix gazes on salient objects in an image and generate the corresponding annotated words. In addition, this work is also related to recent works [33,34,35] focusing on semantic segmentation based on attention-based networks. To the best of our knowledge, we make the first attempt to consider the attention mechanism in the context of association between images and sentences, and systematically study the modelling of contextual attention mechanism.

3. Association between Images and Sentences via Recurrent Canonical Correlation Analysis

In this section, we will introduce the proposed recurrent canonical correlation analysis (RCCA) for associating images with sentences, by describing its included three modules in details: (1) a contextual attention-based LSTM-RNN for image representation learning, (2) a conventional LSTM-RNN for sentence representation learning, and (3) canonical correlation analysis based on the resulting image and sentence representations for model learning.

3.1. Image Representation Using Contextual Attention-Based LSTM-RNN

The contextual attention-based LSTM-RNN is illustrated in Figure 1 (at the bottom). Given an image, we first use a CNN [36] to extract two kinds of features: (1) multiple local features a l | a l R F l = 1 , , L , i.e., feature maps from the last convolutional layer, where a l describes l-th local region in the image and L is the total number of local regions, and (2) a context feature m R D , i.e., a feature vector from the last fully-connected layer, describing the global context information of the image. Based on these features, we exploit a contextual attention scheme to obtain sequential representations e t t = 1 , , T for the static image, where e t is a combined local feature describing the attended salient regions at the t-th time step. Then we use LSTM-RNN to model the long-rang temporal dependency in the sequential representations, by separately taking them as inputs to hidden states at different time steps h t R D x t = 1 , , T . As time goes on, all the hidden states propagate their values along the time axis until the end, and thus the hidden state at the last time step h T indicates the fused representation of all the attended salient regions.
The details of the contextual attention scheme at the t-th time step can be formulated as follows. As illustrated in Figure 2a, given context feature m , local feature set a l | a l R F l = 1 , , L and hidden state at previous time step h t 1 , we compute p t , l | p t , l [ 0 , 1 ] l = 1 , · · · , L for all the local regions, where p t , l indicates the probability that the l-th region will be attended to at the t-th time step:
p t , l = e p ^ t , l l = 1 L e p ^ t , l , p ^ t , l = f g a t e d ( [ h t 1 , m , a l ] )
where f g a t e d ( · ) is a gated fusion unit [37] to fuse all the context information in terms of h ^ t 1 , m and a l as input, and h ^ t 1 is obtained by modulating previous hidden state h t 1 with context information m . Note that the usage of context information is necessary, since it can not only provide the complementary global information about the current attended regions, but also encourage the model to focus on other unexplored regions under the principle of information maximization.
After obtaining the probabilities, we exploit the “soft attention” scheme [13] to compute the attended local feature e t . We perform element-wise multiplication between each local feature a l and its corresponding probability p t , l , and sum all the products together:
e t = l = 1 L p t , l a l
Then, we can use LSTM-RNN [38] to model the temporal dependency in the sequential representations e t t = 1 , , T as shown in Figure 2b, where (initial) memory state ( c 0 ) c t , (initial) hidden state ( h 0 ) h t , input gate i t , forget gate f t and output gate o t can be computed as follows:
c 0 = f M L P , c ( 1 L l = 1 L a l ) , h 0 = f M L P , h ( 1 L l = 1 L a l )
i t = σ ( W e i e t + W h i h t 1 + b i ) , f t = σ ( W e f e t + W h f h t 1 + b f )
c t = f t c t 1 + i t tanh ( W e c e t + W h c h t 1 + b c )
o t = σ ( W e o e t + W h o h t 1 + b o ) , h t = o t tanh ( c t )
where f M L P , c ( · ) and f M L P , h ( · ) are two multilayer perceptrons associated with c and h , respectively, σ denotes the sigmoid activation function, and ⊙ denotes element-wise multiplication.

3.2. Sentence Representation Using Conventional LSTM-RNN

Considering that sentences are naturally sequential data consisting of ordered words, and conventional LSTM-RNN can well model long-range contextual information, so we directly exploit the conventional LSTM-RNN to obtain sentence representations. As shown in Figure 1 (at the top), conventional LSTM-RNN orderly takes a word as the input to the corresponding hidden state at each time step, and all the hidden states propagate their values along the time axis. The hidden state at the last time step collects all the information in the sentence, which can thus be regarded as the representation of the entire sentence.
We denote [ w 1 , , w t , , w T ] as a sentence associated with the image, w t 0 , 1 M as the t-th word in the sentence, T as the number of words in the sentence (also the number of time steps in the conventional LSTM-RNN), M as the size of word vocabulary, and h t R D y t = 1 , , T as the hidden states at all time steps. Since the original representations of words are high-dimensional “one-hot” vectors, we embed them into low-dimensional real-valued vectors using an embedding matrix E R U × M . The details of the conventional LSTM-RNN can be formulated as follows:
i t = σ ( W w i E w t + W h i h t 1 + b i ) , f t = σ ( W w f E w t + W h f h t 1 + b f )
c t = f t c t 1 + i t tanh ( W w c E w t + W h c h t 1 + b c )
o t = σ ( W w o E w t + W h o h t 1 + b o ) , h t = o t tanh ( c t )
Different from the contextual attention-based LSTM-RNN, here we do not include any explicit definition of the initial memory state and the hidden state, similar to [38].

3.3. Model Learning with Canonical Correlation Analysis

We regard the last hidden states of two modality-specific LSTM-RNNs as the representations of images and sentences, denoted by X R D x × N and Y R D y × N , respectively, where N is the number of given image-sentence pairs, and D x and D y are the dimensions of obtained representations of images and sentences, respectively. We feed these representations to canonical correlation analysis (CCA) to learn two linear projections W x and W y for X and Y, respectively, so that their correlation is high:
( W x * , W y * ) = arg max W x , W y c o r r ( X , Y ) = arg max W x , W y W x T x y W y W x T x x W x W y T y y W y
The covariances x x , y y and x y are estimated as: x x = 1 N 1 X ¯ X ¯ T + λ x I , y y = 1 N 1 Y ¯ Y ¯ T + λ y I and x y = 1 N 1 X ¯ Y ¯ T , respectively, where X ¯ and Y ¯ are centered representations, and λ x I and λ y I are regularization terms to ensure the positive definiteness of x x and y y .
Since the objective is invariant to the scaling of W x and W y , the projections can be constrained to have unit variance. Then by assembling the top k projection vectors into the columns of projection matrices, denoted by W ^ x and W ^ y , the objective can be written as:
max W ^ x , W ^ y t r ( W ^ x T x y W ^ y ) , s . t . , W ^ x T x x W ^ x = W ^ y T y y W ^ y = 1
Then according to [39], the optimum of this objective can be attained at:
( W ^ x * , W ^ y * ) = ( x x 1 / 2 U k , y y 1 / 2 V k )
where U k and V k are matrices of the first k left- and right- singular vectors of T = x x 1 / 2 x y y y 1 / 2 , respectively. As a result, the objective is equivalent to the sum of the top k singular values of T, i.e., the trace norm:
c o r r ( X , Y ) = T t r = t r ( ( T T T ) 1 / 2 )
It should be noted that the parameters of two modality-specific LSTM-RNNs and CCA are jointly learned, where the parameters of LSTM-RNNs are adaptively trained to optimize the CCA objective using stochastic gradient descent. The gradients of c o r r ( X , Y ) with respect to parameters of LSTM-RNNs can be computed by first obtaining the gradients with respect to X and Y as follows, and then performing backpropagation.
c o r r ( X , Y ) X = 1 N 1 ( 2 x x X ¯ + x y Y ¯ ) , c o r r ( X , Y ) Y = 1 N 1 ( 2 y y Y ¯ + y x X ¯ )
x x = 1 2 x x 1 / 2 U D U T x x 1 / 2 , x y = x x 1 / 2 U V T y y 1 / 2
y y = 1 2 y y 1 / 2 V D V T y y 1 / 2 , y x = y y 1 / 2 V U T x x 1 / 2
where U D V T is the singular value decomposition (SVD) of T.
Similar to [40,41], although the objective of CCA is a function of the entire training set, we perform mini-batch based optimization due to the high computational cost. Such implementations often lead to the fact that, directly optimizing parameters of LSTM-RNNs from random initialization cannot obtain good results. To provide a better initialization for the parameters, we exploit a structured objective for parameter pretraining, which has been widely used in cross-modal learning [2,7,10,41]:
W L S T M R N N * = arg min W L S T M R N N i = 1 N k = 1 K max 0 , m d ( x i , y i ) + d ( x i , y k ) + max 0 , m d ( x i , y i ) + d ( x k , y i )
where W L S T M R N N denotes all the parameters of two LSTM-RNNs, x i , y i are the obtained representations of the i-th matched images and sentences, x k is a mismatched (contrastive) image representation for y i , and vice-versa with x k , K is the total number of contrastive samples, m is a tuning parameter and d ( · , · ) is a cosine similarity measurement. This structured objective encourages representations of matched images and sentences to be more similar than those of mismatched ones. Such initial representations are helpful for the following CCA optimization, since maximizing the correlation between two more similar representations is much easier.
In addition, we add a doubly stochastic regularization [15] to the objective, which forces the contextual attention-based LSTM-RNN to pay equal attention to every local region of the image during the dynamical attention process:
R = λ l = 1 L 1 t = 1 T p t , l
where λ is a balancing parameter. In our experiments, we find that using this regularization can further improve the performance.

4. Experimental Results

To demonstrate the effectiveness of the proposed RCCA, we perform experiments of image annotation and retrieval on three publicly available datasets.

4.1. Datasets and Protocols

The three evaluation datasets and their corresponding experimental protocols are described as follows. (1) Flickr8k [42] consists of 8000 images collected from the Flickr website, each of which may contain people or animal performing actions, and is accompanied with five sentences describing the content of the image. This dataset provides the standard training, validation and testing splits, with 6000, 1000 and 1000 images, respectively. (2) Flickr30k [43] is an extension of Flicker8K, which consists of 31,783 images collected from the Flickr website. Each image is also accompanied with five sentences, and the sentences are annotated in a similar way as those in Flicker8K. We use the public training, validation and testing splits, which contains 28,000, 1000 and 1000 images, respectively. (3) Microsoft COCO [44] consists of 82,783 training and 40,504 validation images, each of which is associated with five sentences. We use the public training, validation and testing splits [45], with 82,783, 4000 and 1000 images, respectively.

4.2. Experimental Details

During model learning, we use the Stochastic Gradient Descent (SGD) to optimize the parameters, in which the learning rate is 0.005, batch size is 128 and gradient clipping is [−10, 10]. The model is trained for 30 epochs to guarantee its convergence, based on the Pytorch platform [46] and accelerated with a NVIDIA Titan X GPU.
The task of image annotation and retrieval can be jointly formulated as cross-modal retrieval, i.e., given an image, the goal is to retrieve highly matched sentences for annotation, and vice versa. The commonly used evaluation criterions are “ R @ 1 ”, “ R @ 5 ” and “ R @ 10 ”, i.e., recall rate at the top 1, 5 and 10 results. Another one is “Med r” which is the median rank of the first ground truth result. We also compute an additional criterion “Sum” to evaluate the overall performance for both image annotation and retrieval.

4.3. Image and Sentence Matching

To systematically study the modelling of contextual attention mechanism, as shown in Figure 3, we propose four variants of RCCA: (a) RCCA-nc does not use any context information at all the time steps in the contextual attention-based LSTM-RNN, (b) RCCA-fc uses the context information only at the first time step, (c) RCCA-lc uses the context information only at the last time step, and (d) RCCA-na dose not use the attention mechanism but only the context information. We denote RCCA-ac as the original model in Figure 1 which uses the context information at all the time steps. We also develop an ensemble model RCCA   * in a similar way as [8], by summing five cross-modal similarity matrices generated from the five RCCAs together,
For all the RCCAs, we use the 19-layer VGG Network [21] to extract 512 feature maps (with a size of 14 × 14) in “conv5-4” layer as multiple local features, and a feature vector in “fc7” layer as the context feature. Therefore, the dimensions of local and context features are F = 512 and D = 4096, respectively, and the total number of local regions is L = 196. Some hyperparameters of RCCAs are illustrated as follows: D x = 1024, D y = 1024, λ x = 0.1 , λ y = 0.1 , λ = 10, K = 100, U = 300 and m = 0.2 . For the pretraining, we find that using 3 pretraining epoches can always achieve satisfactory performance. The number of time steps in the contextual attention-based LSTM-RNN is T=3 (more details in Section 4.4).
We compare RCCAs with several previous state-of-the-arts on the three datasets in Table 1, Table 2 and Table 3, respectively. All the compared methods can be classified into two classes: (1) global matching and (2) local matching. The global matching methods [1,3,4,6,7,45,47,48] directly learn global representations for image and sentence, and then associate them by computing the global similarity in an one-to-one manner. While the local matching methods [2,5,8,9,10] first detect local objects and words in the image and sentence, and then summarize their local similarities to finally obtain the global similarity.
From Table 1, we can see that the pretraining results can already perform better than most compared methods. Then after CCA optimization, the performance of all the RCCAs can be further improved. Note that our best RCCA   * still performs a little worse than FV   * . It is mainly attributed to the factor that our model has a large number of learning parameters which cannot be well fitted using only the limited number of training data on the Flickr8k dataset.
When exploiting large-scale training data on the Flickr30k (an extension of Flickr8k) and Microsoft COCO datasets in Table 2 and Table 3, RCCA   * can achieve much better performance than all the compared methods. Our best single model RCCA-ac outperforms both the best single model FV (HGLMM) on the Flickr30k dataset, and the best single model m-CNN-st on the Microsoft COCO dataset. These observations demonstrate that and dynamically learning representations for images is more suitable for association analysis.
When comparing among all the RCCAs, we can conclude as follows. (1) Using context information to modulate attention mechanism is very necessary, since RCCA-fc, RCCA-lc and RCCA-ac all perform much better than RCCA-nc. (2) Exploiting only context information without the attention mechanism, RCCA-na achieve worse results than RCCA-fc, RCCA-lc and RCCA-ac, which verifies the effectiveness of the attention mechanism. (3) RCCA-ac performs the best among all the RCCA variants, which indicates that the best way of modelling the context is using dense context information at all the time steps. (4) The ensemble of the five RCCA variants RCCA   * can greatly improve the performance.
To evaluate the cross-dataset generalization of the proposed model, we use the train splits of Flickr30K and Microsoft COCO datasets to train two models: RCCA   * -f30k and RCCA   * -coco, respectively, and then perform test on the test split of Flickr8K dataset. We can see that either RCCA   * -f30k or RCCA   * -coco achieves worse performance than the original RCCA   * . It mainly results from the fact that the image and sentence contents are quite different between the two datasets. RCCA   * -coco performs much better than RCCA   * -f30k, because the train split of Microsoft COCO has a larger size than that of Flickr30K.

4.4. Analysis of the Number of Time Steps

For a sequential sentence, the number of time steps in conventional LSTM-RNN is naturally the length of the sentence. While for a static image, we have to manually set the number of time steps T. Ideally, T should roughly equal to the number of salient objects appeared in the image, so that the model can separately attend to these objects within T steps to collect all information. In the following, we gradually increase T from 1 to 9, and testify the impact of different numbers of time steps on the performance of RCCA-ac and RCCA-nc.
From Table 4 we can observe that both RCCA-ac and RCCA-nc achieve their best performance when the number of time steps is 3. It indicates that they can get all the important information in an image by iteratively visiting the image three times. Intuitively, most images usually contain at most three objects, which is in consistent with the optimal number of time steps. Note that when T becomes larger than 3, the performance of RCCA-ac and RCCA-nc slightly drops, which results from the fact that an overly complex network architecture could lead to overfitting. When comparing RCCA-ac with RCCA-nc, we find that the usage of context information can greatly improve the performance in all cases, regardless of the number of time steps. Especially for image annotation, with the aid of context information, RCCA-ac seems to exploit more time steps to attend to more contents of the image so that it can produce more accurate results.

4.5. Visualization of Dynamical Attention Maps

To verify whether the proposed model can attend to salient regions of an image at different time steps, we visualize the predicted dynamical attention maps on the contextual attention-based LSTM-RNN. For the predicted probabilities at the t-th time step p t , l l = 1 , , 196 , we first reshape it to a probabilistic map, with the size of 14 × 14. Its layout is equivalent to that of feature maps extracted in “conv5-4” of the VGG Network. Then we resize the probabilistic map to the same size as its corresponding original image, so that each probability in the resized map measures the importance of an image pixel at the same location. We thus perform element-wise multiplication between the resized probabilistic map and its corresponding image to obtain an attention map, where lighter areas indicate attended regions.
Dynamical attention maps at three successive time steps by RCCA-ac and RCCA-nc are shown in Table 5. We can see that RCCA-ac attends to different objects in the images at different time steps. Note that without the aid of context information, RCCA-nc cannot produce accurate dynamical attention maps as those of RCCA-ac. Particularly, it cannot well attend to the “cow” and “clock” in the two images, respectively. In fact, RCCA-nc always finishes attending to salient objects within the first two steps, and does not focus on meaningful regions at the third step any more. Different from it, RCCA-ac uses more steps to focus on more salient objects, which is also inconsistent with the observation in Section 4.4. These evidences again demonstrate that the modelling of context can greatly facilitate the attention mechanism.

4.6. Error Analysis

By analyzing the results by our proposed model, we find it cannot well generalize to those images containing very complex content. To illustrate this, we present the results of image retrieval by sentence queries in Figure 4, where the numbers in the top left corner are returned ranks (the smaller, the better) of groundtruth images. We can see that the ranks of left three examples are very high and the corresponding image contents are relatively simple. While the ranks of right three examples are very high, which indicates that our model cannot find matched images. According to a recent state-of-the-art [50], it mainly because the complex image contents include many few-shot objects and attributes, which cannot be well associated with few-shot words in the sentences.

5. Conclusions and Future Work

In this paper, we have proposed the recurrent canonical correlation analysis (RCCA) method for associating images with sentences. Our main contribution is the modelling of contextual attention mechanism for dynamically learning image representations. We have applied our model to image annotation and retrieval, and achieved better results.
In the future, we will exploit the similar attention-based scheme to learn good representations for sentences, and further verify its effectiveness on more datasets. In addition, we will try to combine the two tasks of image and sentence matching and image phrase localization into a joint framework to exploit the complementary advantages of global and local matching. To presuit higher performance, we will also consider to exploit the more detailed annotations in Flickr30k Entity [9] between objects and words for fine-grained image and sentence matching.

Author Contributions

Data curation, H.Y.; Investigation, Y.G., H.Y. and K.Z.; Methodology, Y.G.; Project administration, Y.G.; Validation, K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.A.; Mikolov, T. Devise: A deep visual-semantic embedding model. In Neural Information Processing Systems (NeurIPS); NeurIPS Foundation: Lake Tahoe, NV, USA, 2013. [Google Scholar]
  2. Karpathy, A.; Joulin, A.; Li, F.F. Deep fragment embeddings for bidirectional image sentence mapping. In Neural Information Processing Systems (NeurIPS); NeurIPS Foundation: Montreal, QC, Canada, 2014. [Google Scholar]
  3. Mikolajczyk, F.Y.K. Deep Correlation for Matching Images and Text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  4. Klein, B.; Lev, G.; Sadeh, G.; Wolf, L. Associating Neural Word Embeddings with Deep Image Representations using Fisher Vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  5. Socher, R.; Karpathy, A.; Le, Q.V.; Manning, C.D.; Ng, A.Y. Grounded compositional semantics for finding and describing images with sentences. In Transactions of the Association for Computational Linguistics (TACL); MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
  6. Chen, X.; Zitnick, C.L. Learning a recurrent visual representation for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
  7. Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 2014, arXiv:1411.2539. [Google Scholar]
  8. Ma, L.; Lu, Z.; Shang, L.; Li, H. Multimodal Convolutional Neural Networks for Matching Image and Sentence. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  9. Plummer, B.; Wang, L.; Cervantes, C.; Caicedo, J.; Hockenmaier, J.; Lazebnik, S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  10. Karpathy, A.; Li, F.F. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  11. Rensink, R.A. The dynamic representation of scenes. Vis. Cogn. 2000, 7, 17–42. [Google Scholar] [CrossRef]
  12. Gregor, K.; Danihelka, I.; Graves, A.; Wierstra, D. DRAW: A recurrent neural network for image generation. arXiv 2015, arXiv:1502.04623. [Google Scholar]
  13. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  14. Sharma, S.; Kiros, R.; Salakhutdinov, R. Action Recognition using Visual Attention. arXiv 2015, arXiv:1511.04119. [Google Scholar]
  15. Xu, K.; Ba, J.; Kiros, R.; Courville, A.; Salakhutdinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. arXiv 2015, arXiv:1502.03044. [Google Scholar]
  16. Albright, T.D.; Stoner, G.R. Contextual influences on visual processing. Ann. Rev. Neurosci. 2002, 25, 339–379. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Ba, J.; Mnih, V.; Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv 2014, arXiv:1412.7755. [Google Scholar]
  18. Wang, W.; Chen, C.; Wang, Y.; Jiang, T.; Fang, F.; Yao, Y. Simulating human saccadic scanpaths on natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
  19. Jo, Y.; Wi, J.; Kim, M.; Lee, J.Y. Flexible Fashion Product Retrieval Using Multimodality-Based Deep Learning. Appl. Sci. 2020, 10, 1569. [Google Scholar] [CrossRef] [Green Version]
  20. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS); NeurIPS Foundation: Lake Tahoe, NV, USA, 2012, pp. 1106–1114. [Google Scholar]
  21. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  22. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  23. Perronnin, F.; Dance, C. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA, 17–22 June 2007. [Google Scholar]
  24. Huang, Y.; Wang, W.; Wang, L. Instance-aware image and sentence matching with selective multimodal lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2310–2318. [Google Scholar]
  25. Huang, Y.; Wu, Q.; Wang, W.; Wang, L. Image and sentence matching via semantic concepts and order learning. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2020, 42, 636–650. [Google Scholar] [CrossRef] [PubMed]
  26. Li, K.; Zhang, Y.; Li, K.; Li, Y.; Fu, Y. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 4654–4662. [Google Scholar]
  27. Nguyen, D.K.; Okatani, T. Multi-task learning of hierarchical vision-language representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10492–10501. [Google Scholar]
  28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  29. Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
  30. Graves, A. Generating sequences with recurrent neural networks. arXiv 2013, arXiv:1308.0850. [Google Scholar]
  31. Larochelle, H.; Hinton, G.E. Learning to combine foveal glimpses with a third-order Boltzmann machine. In Neural Information Processing Systems (NeurIPS); NeurIPS Foundation: Vancouver, BC, Canada, 2010. [Google Scholar]
  32. Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. In Neural Information Processing Systems (NeurIPS); NeurIPS Foundation: Montreal, QC, Canada, 2014. [Google Scholar]
  33. Hu, X.; Yang, K.; Fei, L.; Wang, K. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Chicago, IL, USA, 22–25 September 2019; pp. 1440–1444. [Google Scholar]
  34. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
  35. Yang, K.; Hu, X.; Chen, H.; Xiang, K.; Wang, K.; Stiefelhagen, R. Ds-pass: Detail-sensitive panoramic annular semantic segmentation through swaftnet for surrounding sensing. arXiv 2019, arXiv:1909.07721. [Google Scholar]
  36. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  37. Arevalo, J.; Solorio, T.; Montes-y Gómez, M.; González, F.A. Gated multimodal units for information fusion. arXiv 2017, arXiv:1702.01992. [Google Scholar]
  38. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
  39. Davis, S.B.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. (TASSP) 1980, 28, 357–366. [Google Scholar] [CrossRef] [Green Version]
  40. Andrew, G.; Arora, R.; Bilmes, J.; Livescu, K. Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
  41. Yan, F.; Mikolajczyk, K. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  42. Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 2013, 47, 853–899. [Google Scholar] [CrossRef] [Green Version]
  43. Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (TACL) 2014, 2, 67–78. [Google Scholar] [CrossRef]
  44. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV); Sprinder: Zurich, Switzerland, 2014. [Google Scholar]
  45. Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L. Explain images with multimodal recurrent neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  46. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems (NeurIPS); NeurIPS Foundation: Vancouver, BC, Canada, 2019; pp. 8026–8037. [Google Scholar]
  47. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 24–27 June 2014. [Google Scholar]
  48. Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
  49. Kiros, R.; Zhu, Y.; Salakhutdinov, R.R.; Zemel, R.; Urtasun, R.; Torralba, A.; Fidler, S. Skip-thought vectors. In Neural Information Processing Systems (NeurIPS); NeurIPS Foundation: Montreal, QC, Canada, 2015. [Google Scholar]
  50. Huang, Y.; Wang, L. Acmm: Aligned cross-modal memory for few-shot image and sentence matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 5774–5783. [Google Scholar]
Figure 1. The proposed recurrent canonical correlation analysis (RCCA).
Figure 1. The proposed recurrent canonical correlation analysis (RCCA).
Applsci 10 05516 g001
Figure 2. Illustrations of contextual attention scheme and LSTM-RNN at the t-th time step.
Figure 2. Illustrations of contextual attention scheme and LSTM-RNN at the t-th time step.
Applsci 10 05516 g002
Figure 3. Different usages of context information in the four RCCA variants.
Figure 3. Different usages of context information in the four RCCA variants.
Applsci 10 05516 g003
Figure 4. Example results of sentence-based image retrieval. The top left numbers are returned ranks of groundtruth matched images.
Figure 4. Example results of sentence-based image retrieval. The top left numbers are returned ranks of groundtruth matched images.
Applsci 10 05516 g004
Table 1. Comparison results of image annotation and retrieval on the Flickr8K dataset ( * indicates ensemble or multi-model methods).
Table 1. Comparison results of image annotation and retrieval on the Flickr8K dataset ( * indicates ensemble or multi-model methods).
MethodImage AnnotationImage RetrievalSum
R@1R@5R@10Med rR@1R@5R@10Med r
DeViSE [1]4.816.527.328.05.920.129.629104.2
SDT-RNN [5]6.022.734.023.06.621.631.725122.6
Deep Fragment [2]12.632.944.0149.729.642.515171.3
RVP (T+I) [6]11.734.848.611.211.432.046.211184.7
m-RNN [45]14.537.248.51111.531.042.415185.0
DCCA [3]17.940.351.9912.731.244.113197.9
DVSA (BRNN) [10]16.540.654.27.611.832.144.712.4199.9
MNLM [7]18.040.955.0812.537.051.510214.9
NIC [47]20.0-61.0619.0-64.05-
m-CNN-st [8]18.144.157.9714.638.553.59226.7
m-CNN   * [8]24.853.767.1520.347.661.75275.2
FV(HGLMM) [4]28.558.471.7420.649.464.06292.6
FV   * [4]31.059.373.7421.350.064.85300.1
Ours:
RCCA-nc19.544.858.2714.739.953.09230.1
RCCA-na22.848.963.3616.641.154.09246.7
RCCA-fc23.349.464.0617.041.253.58248.4
RCCA-lc25.651.965.4517.741.253.79255.5
RCCA-ac26.555.067.9418.245.358.27271.1
RCCA   * -f30k28.156.166.3418.543.859.67272.4
RCCA   * -coco29.057.768.2419.745.361.16281.0
RCCA   * 30.359.669.7320.647.962.16290.2
Table 2. Comparison results of image annotation and retrieval on the Flickr30K dataset ( * indicates ensemble or multi-model methods, and † indicates using manual annotations).
Table 2. Comparison results of image annotation and retrieval on the Flickr30K dataset ( * indicates ensemble or multi-model methods, and † indicates using manual annotations).
MethodImage AnnotationImage RetrievalSum
R@1R@5R@10Med rR@1R@5R@10Med r
DeViSE [1]4.518.129.2266.721.932.725113.1
SDT-RNN [5]9.629.841.1168.929.841.116160.3
RVP (T+I) [6]12.127.847.81112.733.144.912.5178.4
Deep Fragment [2]14.237.751.31010.230.844.214188.4
DCCA [3]16.739.352.9812.631.043.015195.5
NIC [47]17.0-56.0717.0-57.07-
DVSA (BRNN) [10]22.248.261.44.815.237.750.59.2235.2
MNLM [7]23.050.762.9516.842.056.58251.9
LRCN [48]----17.540.350.89-
m-RNN [45]35.463.873.7322.850.763.15309.5
FV (HGLMM) [4]34.461.072.3324.452.165.65309.8
FV   * [4]35.062.073.8325.052.766.05314.5
m-CNN-st [8]27.056.470.1419.748.462.36283.9
m-CNN   * [8]33.664.174.9326.256.369.64324.7
RTP   * [9]37.463.174.3-26.056.069.3-326.1
Ours:
RCCA-nc27.553.366.9520.946.758.87274.1
RCCA-na34.461.671.2323.951.161.85304.0
RCCA-fc32.260.472.3323.753.165.05306.7
RCCA-lc32.261.272.4323.953.465.85308.9
RCCA-ac36.065.875.6325.853.965.75322.8
RCCA   * 39.368.778.2228.757.269.84341.9
Table 3. Comparison results of image annotation and retrieval on the Microsoft COCO dataset ( *  indicates ensemble or multi-model methods).
Table 3. Comparison results of image annotation and retrieval on the Microsoft COCO dataset ( *  indicates ensemble or multi-model methods).
MethodImage AnnotationImage RetrievalSum
R@1R@5R@10Med rR@1R@5R@10Med r
STD (bi-skip) [49]32.767.379.6324.257.173.24334.1
STD   * [49]33.867.782.1325.960.074.64344.1
m-RNN [45]41.073.083.5229.042.277.03345.7
FV (HGLMM) [4]37.766.679.1324.958.876.54343.6
FV   * [4]39.467.980.9225.159.876.64349.7
DVSA [10]38.469.980.5127.460.274.83351.2
MNLM [7]43.475.785.8231.066.779.93382.5
m-CNN-st [8]38.369.681.0227.463.479.53359.2
m-CNN   * [8]42.873.184.1232.668.682.83384.0
Ours:
RCCA-nc37.470.381.5229.765.379.83364.0
RCCA-na40.771.084.6232.968.881.03379.0
RCCA-fc41.973.584.1233.468.181.83382.8
RCCA-lc42.377.887.9234.368.481.03391.7
RCCA-ac44.979.687.7235.871.283.32402.5
RCCA   * 49.480.189.5237.973.584.92415.3
Table 4. Different numbers of time steps on the Microsoft COCO dataset. T: number of time steps in contextual attention-based LSTM-RNN.
Table 4. Different numbers of time steps on the Microsoft COCO dataset. T: number of time steps in contextual attention-based LSTM-RNN.
MethodImage AnnotationImage RetrievalSum
R@1R@5R@10Med rR@1R@5R@10Med r
RCCA-ac:
T = 1 42.574.986.0232.167.781.43384.6
T = 3 44.979.687.7235.871.283.32402.5
T = 5 44.078.086.8235.471.083.12398.3
T = 7 44.176.986.7235.571.183.22397.5
T = 9 43.276.587.1234.771.083.12395.6
RCCA-nc:
T = 1 38.570.180.9228.063.276.13356.8
T = 3 37.470.381.5229.765.379.83364.0
T = 5 34.967.178.9328.864.577.53351.7
T = 7 36.265.677.9328.964.077.23350.9
T = 9 32.664.776.6327.162.675.73339.2
Table 5. Visualization of dynamical attention maps on the Microsoft COCO dataset.
Table 5. Visualization of dynamical attention maps on the Microsoft COCO dataset.
Input ImageRCCA-acRCCA-nc
1st Step2nd Step3rd Step1st Step2nd Step3rd Step
Applsci 10 05516 i001 Applsci 10 05516 i002 Applsci 10 05516 i003 Applsci 10 05516 i004 Applsci 10 05516 i005 Applsci 10 05516 i006 Applsci 10 05516 i007
Applsci 10 05516 i008 Applsci 10 05516 i009 Applsci 10 05516 i010 Applsci 10 05516 i011 Applsci 10 05516 i012 Applsci 10 05516 i013 Applsci 10 05516 i014

Share and Cite

MDPI and ACS Style

Guo, Y.; Yuan, H.; Zhang, K. Associating Images with Sentences Using Recurrent Canonical Correlation Analysis. Appl. Sci. 2020, 10, 5516. https://doi.org/10.3390/app10165516

AMA Style

Guo Y, Yuan H, Zhang K. Associating Images with Sentences Using Recurrent Canonical Correlation Analysis. Applied Sciences. 2020; 10(16):5516. https://doi.org/10.3390/app10165516

Chicago/Turabian Style

Guo, Yawen, Hui Yuan, and Kun Zhang. 2020. "Associating Images with Sentences Using Recurrent Canonical Correlation Analysis" Applied Sciences 10, no. 16: 5516. https://doi.org/10.3390/app10165516

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop