Associating Images with Sentences Using Recurrent Canonical Correlation Analysis

: Associating images with sentences has drawn much attention recently. Existing methods commonly represent an image by indistinctively describing all its contents in an one-time static way, which ignores two facts that (1) the association analysis can only be effective for partial salient contents and the associated sentence, and (2) visual information acquisition is a dynamical rather than static process. To deal with this issue, we propose a recurrent canonical correlation analysis (RCCA) method for associating images with sentences. RCCA includes a contextual attention-based LSTM-RNN which can selectively attend to salient regions of an image at each time step, and then represent all the salient contents within a few steps. Different from existing attention-based models, our model focuses on the modelling of contextual visual attention mechanism for the task of association analysis. RCCA also includes a conventional LSTM-RNN for sentence representation learning. The resulting representations of images and sentences are fed into CCA to maximize linear correlation, where parameters of LSTM-RNNs and CCA are jointly learned. Due to the effective image representation learning, our model can well associate images with sentences with complex contents, and achieve better performance in terms of image annotation and retrieval.


Introduction
Associating images with sentences plays an important role in many applications, e.g., finding sentences given an image query for image annotation and caption, and retrieving images with a sentence query for image search. Recently, various methods have been proposed to deal with this problem. Most of them first extract representations for images and sentences, and then use either a structured objective [1,2] or a canonical correlation objective [3,4] to learn a latent joint space for association analysis, where matched images and sentences have small distance or high correlation.
Obtaining effective representations for images is crucial for the association analysis. An image usually contains very abundant contents, but its associated sentence can only describe partial salient contents due the limited size. Thus the following association analysis can only be effective for these salient contents and their corresponding textual descriptions since only they share the same semantics. As a result, the rest undescribed contents in the image are redundant, and tend to bring in noisy side effects in the association analysis. However, most existing methods [1,[3][4][5][6][7][8] ignore this fact and represent an image by simultaneously processing its entire contents. Although other methods [2,9,10] consider to associate the salient contents (regions) with the corresponding descriptions (words) in the image and sentence, respectively, they have to pre-establish the salient regions using manual annotations or detection algorithms. In addition, the existing methods usually represent the visual information in an one-time static process, which is in contrast to the neuroscience fact [11] that visual information acquisition is actually a dynamic rather than static process. Selective attention mechanism plays an essential role in this process. In particular, human visual system does not focus its attention on an entire image for only once. Instead, it selectively attends to salient regions at different locations to collect all important information in a sequential way.
Recently, some attention-based models have been proposed in various vision applications [12][13][14][15], but the modelling of attention mechanism in the context of association analysis is still under investigated. In fact, existing models mostly focus on the problems of recognizing salient image regions using dense supervision information, which makes them less generally applicable to the association analysis that aims to fuse the information of all the salient regions to obtain a unified representation. Moreover, they rarely consider the modelling of context [16] which actually has the prominent effect on attention mechanism. Ba et al. [17] consider to exploit the context information only at the first time step during the attention process, but they ignore the fact that context information modulates the attention mechanism during all the time steps [18].
In this work, we propose a recurrent canonical correlation analysis (RCCA) method for associating images with sentences using the visual attention. The proposed RCCA includes a contextual attention-based LSTM-RNN for image representation learning, which can selectively focus on salient regions of an image by referring to context information at each time step, and sequentially aggregate local features corresponding to the regions at all time steps to obtain the final image representation. RCCA also exploits a conventional LSTM-RNN to learn representation for the associated sentence by orderly taking its contained words as sequential inputs. Then the obtained representations of images and sentences are fed into CCA for association analysis, where parameters of LSTM-RNNs and CCA are jointly learned by maximizing linear correlation between images and sentences. To demonstrate the effectiveness of the proposed RCCA, we perform several experiments of image annotation and retrieval on three publicly available datasets, and achieve satisfactory results.
Our main contributions can be summarized as follows. We propose the recurrent canonical correlation analysis for associating images with sentences, which models the contextual visual attention mechanism and learns image representation in a biologically-plausible dynamical process. We systematically study the modelling of context in conjunction with attention mechanism, and demonstrate that the use of dense context information does help much. We achieve better performance in the tasks of image annotation and retrieval.

Association between Images and Sentences
Existing methods for associating images with sentences usually include two steps: (1) feature extraction for images and sentences, and (2) association analysis based on the obtained features. Many recent methods [2][3][4][6][7][8][9][10]19] exploit CNNs [20,21] to obtain image features, i.e., the 4096-dimensional output vector in the last fully-connected layer. While for sentences, since they are sequential data, most methods use LSTM-RNNs [22] to model their long-range temporal dependency, and exploit the hidden states at the last step as their representations. After obtaining features, they use either a structured objective [1,2,7] or a canonical correlation objective [3,4,9] to learn a latent joint space, in which matched image-sentence pairs have small distance or high correlation.
It is believed that obtaining good image representations, such as by CNNs and fisher vector (FV) [4,23], can greatly facilitate the following association analysis. However, these methods indistinctively describe all the contents of an image, which might be suboptimal since only salient contents and their associated textual descriptions can be effective in association analysis. Although other works [2,9,10] consider to associate these salient regions with their descriptions, they have to explicitly use object detectors or manual annotations to pre-establish the salient regions. Different from them, our model can automatically select and attend to salient regions in the image by modelling the contextual attention mechanism, which is more efficient and biologically-plausible.
The most related work to ours is [24], but it has to perform an one-to-one similarity measurement during testing, which makes it much more time-consuming than ours. Note that we do not make comparisons with recent state-of-the-art technologies [25][26][27], since they all use advance image representation modules like ResNet [28] and bottom-up features [29].

Attention-Based Models
There have been several works simulating the attention mechanism. Alex Graves [30] exploits RNNs and differentiable Gaussian filters to simulate the attention mechanism, and applies it to handwriting synthesis. The model is able to predict a word conditioned on an auxiliary annotation sequence. Gregor et al. [12] introduce the deep recurrent attentive writer for image generation, which develops a novel spatial attention mechanism based on 2-dimensional Gaussian filters to mimic the foveation of human eyes. Larochelle and Hinton [31] propose high-order Boltzmann machines to learn how to accumulate information from an image over several fixations. Mnih et al. [32] present a RNN-based model to select a sequence of regions and then only process the selected regions from an image to extract information. Ba et al. [17] present a recurrent attention model for multiple objects recognition. By training with reinforcement learning, the model can attend to some most label-relevant regions of an image. Bahdanau et al. [13] propose a neural machine translator which can search for relevant parts of a source sentence to predict a target word, without explicit alignment. Inspired by it, Xu et al. [15] develop an attention-based caption model which can automatically learn to fix gazes on salient objects in an image and generate the corresponding annotated words. In addition, this work is also related to recent works [33][34][35] focusing on semantic segmentation based on attention-based networks. To the best of our knowledge, we make the first attempt to consider the attention mechanism in the context of association between images and sentences, and systematically study the modelling of contextual attention mechanism.

Association between Images and Sentences via Recurrent Canonical Correlation Analysis
In this section, we will introduce the proposed recurrent canonical correlation analysis (RCCA) for associating images with sentences, by describing its included three modules in details: (1) a contextual attention-based LSTM-RNN for image representation learning, (2) a conventional LSTM-RNN for sentence representation learning, and (3) canonical correlation analysis based on the resulting image and sentence representations for model learning.

Image Representation Using Contextual Attention-Based LSTM-RNN
The contextual attention-based LSTM-RNN is illustrated in Figure 1 (at the bottom). Given an image, we first use a CNN [36] to extract two kinds of features: (1) multiple local features a l |a l ∈ R F l=1,...,L , i.e., feature maps from the last convolutional layer, where a l describes l-th local region in the image and L is the total number of local regions, and (2) a context feature m ∈ R D , i.e., a feature vector from the last fully-connected layer, describing the global context information of the image. Based on these features, we exploit a contextual attention scheme to obtain sequential representations {e t } t=1,...,T for the static image, where e t is a combined local feature describing the attended salient regions at the t-th time step. Then we use LSTM-RNN to model the long-rang temporal dependency in the sequential representations, by separately taking them as inputs to hidden states at different time steps h t ∈ R D x t=1,...,T . As time goes on, all the hidden states propagate their values along the time axis until the end, and thus the hidden state at the last time step h T indicates the fused representation of all the attended salient regions.
Along the River During the Qingming Festival captures the daily life of people and the landscape of the capital, Bianjing, today's Kaifeng, from the Northern Song period. People and commodities are transported by various modes: wheeled wagons, beasts of labor, sedan chairs, and chariots. There are businesses of all kinds, selling wine, grain, secondhand goods, cookware, bows and arrows, lanterns, musical instruments, gold and silver, ornaments, dyed fabrics, paintings, medicine, needles, and artifacts, as well as many restaurants, separated by the gate to the left. In addition to the shops and diners, there are inns, temples, private residences, and official buildings varying in grandeur and style, from huts to mansions with grand front-and backyards. The river is packed with passenger-carrying ferries. The vendors extend all along the great bridge, called the Rainbow Bridge. Where the great bridge crosses the river is the center and main focus of the scroll. A great commotion animates the people on the bridge. A boat approaches at an awkward angle with its mast not completely lowered, threatening to crash into the bridge The details of the contextual attention scheme at the t-th time step can be formulated as follows. As illustrated in Figure 2a, given context feature m, local feature set a l |a l ∈ R F l=1,...,L and hidden state at previous time step h t−1 , we compute p t,l |p t,l ∈ [0, 1] l=1,···,L for all the local regions, where p t,l indicates the probability that the l-th region will be attended to at the t-th time step: where f gated (·) is a gated fusion unit [37] to fuse all the context information in terms ofĥ t−1 , m and a l as input, andĥ t−1 is obtained by modulating previous hidden state h t−1 with context information m.
Note that the usage of context information is necessary, since it can not only provide the complementary global information about the current attended regions, but also encourage the model to focus on other unexplored regions under the principle of information maximization. After obtaining the probabilities, we exploit the "soft attention" scheme [13] to compute the attended local feature e t . We perform element-wise multiplication between each local feature a l and its corresponding probability p t,l , and sum all the products together: Then, we can use LSTM-RNN [38] to model the temporal dependency in the sequential representations {e t } t=1,...,T as shown in Figure 2b, where (initial) memory state (c 0 ) c t , (initial) hidden state (h 0 ) h t , input gate i t , forget gate f t and output gate o t can be computed as follows: where f MLP,c (·) and f MLP,h (·) are two multilayer perceptrons associated with c and h, respectively, σ denotes the sigmoid activation function, and denotes element-wise multiplication.

Sentence Representation Using Conventional LSTM-RNN
Considering that sentences are naturally sequential data consisting of ordered words, and conventional LSTM-RNN can well model long-range contextual information, so we directly exploit the conventional LSTM-RNN to obtain sentence representations. As shown in Figure 1 (at the top), conventional LSTM-RNN orderly takes a word as the input to the corresponding hidden state at each time step, and all the hidden states propagate their values along the time axis. The hidden state at the last time step collects all the information in the sentence, which can thus be regarded as the representation of the entire sentence.
We denote [w 1 , . . . , w t , . . . , w T ] as a sentence associated with the image, w t ∈ {0, 1} M as the t-th word in the sentence, T as the number of words in the sentence (also the number of time steps in the conventional LSTM-RNN), M as the size of word vocabulary, and h t ∈ R D y t=1,...,T as the hidden states at all time steps. Since the original representations of words are high-dimensional "one-hot" vectors, we embed them into low-dimensional real-valued vectors using an embedding matrix E ∈ R U×M . The details of the conventional LSTM-RNN can be formulated as follows: Different from the contextual attention-based LSTM-RNN, here we do not include any explicit definition of the initial memory state and the hidden state, similar to [38].

Model Learning with Canonical Correlation Analysis
We regard the last hidden states of two modality-specific LSTM-RNNs as the representations of images and sentences, denoted by X ∈ R D x ×N and Y ∈ R D y ×N , respectively, where N is the number of given image-sentence pairs, and D x and D y are the dimensions of obtained representations of images and sentences, respectively. We feed these representations to canonical correlation analysis (CCA) to learn two linear projections W x and W y for X and Y, respectively, so that their correlation is high: The covariances ∑ xx , ∑ yy and ∑ xy are estimated as: T , respectively, whereX andȲ are centered representations, and λ x I and λ y I are regularization terms to ensure the positive definiteness of ∑ xx and ∑ yy . Since the objective is invariant to the scaling of W x and W y , the projections can be constrained to have unit variance. Then by assembling the top k projection vectors into the columns of projection matrices, denoted byŴ x andŴ y , the objective can be written as: Then according to [39], the optimum of this objective can be attained at: where U k and V k are matrices of the first k left-and right-singular vectors of T = ∑ xx −1/2 ∑ xy∑ yy −1/2 , respectively. As a result, the objective is equivalent to the sum of the top k singular values of T, i.e., the trace norm: It should be noted that the parameters of two modality-specific LSTM-RNNs and CCA are jointly learned, where the parameters of LSTM-RNNs are adaptively trained to optimize the CCA objective using stochastic gradient descent. The gradients of corr(X, Y) with respect to parameters of LSTM-RNNs can be computed by first obtaining the gradients with respect to X and Y as follows, and then performing backpropagation.
where UDV T is the singular value decomposition (SVD) of T. Similar to [40,41], although the objective of CCA is a function of the entire training set, we perform mini-batch based optimization due to the high computational cost. Such implementations often lead to the fact that, directly optimizing parameters of LSTM-RNNs from random initialization cannot obtain good results. To provide a better initialization for the parameters, we exploit a structured objective for parameter pretraining, which has been widely used in cross-modal learning [2,7,10,41]: where WLSTM−RNN denotes all the parameters of two LSTM-RNNs, (x i , y i ) are the obtained representations of the i-th matched images and sentences, x k is a mismatched (contrastive) image representation for y i , and vice-versa with x k , K is the total number of contrastive samples, m is a tuning parameter and d(·, ·) is a cosine similarity measurement. This structured objective encourages representations of matched images and sentences to be more similar than those of mismatched ones. Such initial representations are helpful for the following CCA optimization, since maximizing the correlation between two more similar representations is much easier.
In addition, we add a doubly stochastic regularization [15] to the objective, which forces the contextual attention-based LSTM-RNN to pay equal attention to every local region of the image during the dynamical attention process: where λ is a balancing parameter. In our experiments, we find that using this regularization can further improve the performance.

Experimental Results
To demonstrate the effectiveness of the proposed RCCA, we perform experiments of image annotation and retrieval on three publicly available datasets.

Datasets and Protocols
The three evaluation datasets and their corresponding experimental protocols are described as follows. (1) Flickr8k [42] consists of 8000 images collected from the Flickr website, each of which may contain people or animal performing actions, and is accompanied with five sentences describing the content of the image. This dataset provides the standard training, validation and testing splits, with 6000, 1000 and 1000 images, respectively. (2) Flickr30k [43] is an extension of Flicker8K, which consists of 31,783 images collected from the Flickr website. Each image is also accompanied with five sentences, and the sentences are annotated in a similar way as those in Flicker8K. We use the public training, validation and testing splits, which contains 28,000, 1000 and 1000 images, respectively.
(3) Microsoft COCO [44] consists of 82,783 training and 40,504 validation images, each of which is associated with five sentences. We use the public training, validation and testing splits [45], with 82,783, 4000 and 1000 images, respectively.

Experimental Details
During model learning, we use the Stochastic Gradient Descent (SGD) to optimize the parameters, in which the learning rate is 0.005, batch size is 128 and gradient clipping is [−10, 10]. The model is trained for 30 epochs to guarantee its convergence, based on the Pytorch platform [46] and accelerated with a NVIDIA Titan X GPU.
The task of image annotation and retrieval can be jointly formulated as cross-modal retrieval, i.e., given an image, the goal is to retrieve highly matched sentences for annotation, and vice versa. The commonly used evaluation criterions are "R@1", "R@5" and "R@10", i.e., recall rate at the top 1, 5 and 10 results. Another one is "Med r" which is the median rank of the first ground truth result. We also compute an additional criterion "Sum" to evaluate the overall performance for both image annotation and retrieval.

Image and Sentence Matching
To systematically study the modelling of contextual attention mechanism, as shown in Figure 3, we propose four variants of RCCA: (a) RCCA-nc does not use any context information at all the time steps in the contextual attention-based LSTM-RNN, (b) RCCA-fc uses the context information only at the first time step, (c) RCCA-lc uses the context information only at the last time step, and (d) RCCA-na dose not use the attention mechanism but only the context information. We denote RCCA-ac as the original model in Figure 1 which uses the context information at all the time steps. We also develop an ensemble model RCCA * in a similar way as [8], by summing five cross-modal similarity matrices generated from the five RCCAs together, For all the RCCAs, we use the 19-layer VGG Network [21] to extract 512 feature maps (with a size of 14 × 14) in "conv5-4" layer as multiple local features, and a feature vector in "fc7" layer as the context feature. Therefore, the dimensions of local and context features are F = 512 and D = 4096, respectively, and the total number of local regions is L = 196. Some hyperparameters of RCCAs are illustrated as follows: D x = 1024, D y = 1024, λ x = 0.1, λ y = 0.1, λ = 10, K = 100, U = 300 and m = 0.2. For the pretraining, we find that using 3 pretraining epoches can always achieve satisfactory performance. The number of time steps in the contextual attention-based LSTM-RNN is T=3 (more details in Section 4.4).  We compare RCCAs with several previous state-of-the-arts on the three datasets in Tables 1-3, respectively. All the compared methods can be classified into two classes: (1) global matching and (2) local matching. The global matching methods [1,3,4,6,7,45,47,48] directly learn global representations for image and sentence, and then associate them by computing the global similarity in an one-to-one manner. While the local matching methods [2,5,[8][9][10] first detect local objects and words in the image and sentence, and then summarize their local similarities to finally obtain the global similarity. From Table 1, we can see that the pretraining results can already perform better than most compared methods. Then after CCA optimization, the performance of all the RCCAs can be further improved. Note that our best RCCA * still performs a little worse than FV * . It is mainly attributed to the factor that our model has a large number of learning parameters which cannot be well fitted using only the limited number of training data on the Flickr8k dataset.
When exploiting large-scale training data on the Flickr30k (an extension of Flickr8k) and Microsoft COCO datasets in Tables 2 and 3, RCCA * can achieve much better performance than all the compared methods. Our best single model RCCA-ac outperforms both the best single model FV (HGLMM) on the Flickr30k dataset, and the best single model m-CNN-st on the Microsoft COCO dataset. These observations demonstrate that and dynamically learning representations for images is more suitable for association analysis. Table 2. Comparison results of image annotation and retrieval on the Flickr30K dataset ( * indicates ensemble or multi-model methods, and † indicates using manual annotations).

Method
Image When comparing among all the RCCAs, we can conclude as follows.
(1) Using context information to modulate attention mechanism is very necessary, since RCCA-fc, RCCA-lc and RCCA-ac all perform much better than RCCA-nc. (2) Exploiting only context information without the attention mechanism, RCCA-na achieve worse results than RCCA-fc, RCCA-lc and RCCA-ac, which verifies the effectiveness of the attention mechanism. (3) RCCA-ac performs the best among all the RCCA variants, which indicates that the best way of modelling the context is using dense context information at all the time steps. (4) The ensemble of the five RCCA variants RCCA * can greatly improve the performance.
To evaluate the cross-dataset generalization of the proposed model, we use the train splits of Flickr30K and Microsoft COCO datasets to train two models: RCCA * -f30k and RCCA * -coco, respectively, and then perform test on the test split of Flickr8K dataset. We can see that either RCCA * -f30k or RCCA * -coco achieves worse performance than the original RCCA * . It mainly results from the fact that the image and sentence contents are quite different between the two datasets. RCCA * -coco performs much better than RCCA * -f30k, because the train split of Microsoft COCO has a larger size than that of Flickr30K.

Analysis of the Number of Time Steps
For a sequential sentence, the number of time steps in conventional LSTM-RNN is naturally the length of the sentence. While for a static image, we have to manually set the number of time steps T. Ideally, T should roughly equal to the number of salient objects appeared in the image, so that the model can separately attend to these objects within T steps to collect all information. In the following, we gradually increase T from 1 to 9, and testify the impact of different numbers of time steps on the performance of RCCA-ac and RCCA-nc.
From Table 4 we can observe that both RCCA-ac and RCCA-nc achieve their best performance when the number of time steps is 3. It indicates that they can get all the important information in an image by iteratively visiting the image three times. Intuitively, most images usually contain at most three objects, which is in consistent with the optimal number of time steps. Note that when T becomes larger than 3, the performance of RCCA-ac and RCCA-nc slightly drops, which results from the fact that an overly complex network architecture could lead to overfitting. When comparing RCCA-ac with RCCA-nc, we find that the usage of context information can greatly improve the performance in all cases, regardless of the number of time steps. Especially for image annotation, with the aid of context information, RCCA-ac seems to exploit more time steps to attend to more contents of the image so that it can produce more accurate results.

Visualization of Dynamical Attention Maps
To verify whether the proposed model can attend to salient regions of an image at different time steps, we visualize the predicted dynamical attention maps on the contextual attention-based LSTM-RNN. For the predicted probabilities at the t-th time step p t,l l=1,...,196 , we first reshape it to a probabilistic map, with the size of 14 × 14. Its layout is equivalent to that of feature maps extracted in "conv5-4" of the VGG Network. Then we resize the probabilistic map to the same size as its corresponding original image, so that each probability in the resized map measures the importance of an image pixel at the same location. We thus perform element-wise multiplication between the resized probabilistic map and its corresponding image to obtain an attention map, where lighter areas indicate attended regions.
Dynamical attention maps at three successive time steps by RCCA-ac and RCCA-nc are shown in Table 5. We can see that RCCA-ac attends to different objects in the images at different time steps. Note that without the aid of context information, RCCA-nc cannot produce accurate dynamical attention maps as those of RCCA-ac. Particularly, it cannot well attend to the "cow" and "clock" in the two images, respectively. In fact, RCCA-nc always finishes attending to salient objects within the first two steps, and does not focus on meaningful regions at the third step any more. Different from it, RCCA-ac uses more steps to focus on more salient objects, which is also inconsistent with the observation in Section 4.4. These evidences again demonstrate that the modelling of context can greatly facilitate the attention mechanism.

Error Analysis
By analyzing the results by our proposed model, we find it cannot well generalize to those images containing very complex content. To illustrate this, we present the results of image retrieval by sentence queries in Figure 4, where the numbers in the top left corner are returned ranks (the smaller, the better) of groundtruth images. We can see that the ranks of left three examples are very high and the corresponding image contents are relatively simple. While the ranks of right three examples are very high, which indicates that our model cannot find matched images. According to a recent state-of-the-art [50], it mainly because the complex image contents include many few-shot objects and attributes, which cannot be well associated with few-shot words in the sentences. the people are quietly listening while the story of the ice cabin was explained to them battling between the sexes who will win a woman acts out a dramatic scene in public behind yellow caution tape

Conclusions and Future Work
In this paper, we have proposed the recurrent canonical correlation analysis (RCCA) method for associating images with sentences. Our main contribution is the modelling of contextual attention mechanism for dynamically learning image representations. We have applied our model to image annotation and retrieval, and achieved better results.
In the future, we will exploit the similar attention-based scheme to learn good representations for sentences, and further verify its effectiveness on more datasets. In addition, we will try to combine the two tasks of image and sentence matching and image phrase localization into a joint framework to exploit the complementary advantages of global and local matching. To presuit higher performance, we will also consider to exploit the more detailed annotations in Flickr30k Entity [9] between objects and words for fine-grained image and sentence matching.