DA-GAN: Dual Attention Generative Adversarial Network for Cross-Modal Retrieval

Cross-modal retrieval aims to search samples of one modality via queries of other modalities, which is a hot issue in the community of multimedia. However, two main challenges, i.e., heterogeneity gap and semantic interaction across different modalities, have not been solved efficaciously. Reducing the heterogeneous gap can improve the cross-modal similarity measurement. Meanwhile, modeling cross-modal semantic interaction can capture the semantic correlations more accurately. To this end, this paper presents a novel end-to-end framework, called Dual Attention Generative Adversarial Network (DA-GAN). This technique is an adversarial semantic representation model with a dual attention mechanism, i.e., intra-modal attention and inter-modal attention. Intra-modal attention is used to focus on the important semantic feature within a modality, while intermodal attention is to explore the semantic interaction between different modalities and then represent the high-level semantic correlation more precisely. A dual adversarial learning strategy is designed to generate modality-invariant representations, which can reduce the cross-modal heterogeneity efficiently. The experiments on three commonly used benchmarks show the better performance of DA-GAN than these competitors.


Introduction
Cross-modal retrieval [1,2] is a hot issue in the field of multimedia [3]. As shown in Figure 1, it is aiming to find objects of one modality by queries of another modality. Recently, multimedia data [4] is growing exponentially, which is widely used in several scenarios, such as information retrieval, recommendation system [5], social network [6], etc. It makes this problem attract increasing interest by a growing number of researchers.
The main challenge of cross-modal retrieval is how to eliminate the heterogeneity between multimedia objects and how to bridge the semantic gap [7,8] by understanding cross-modal consistent semantic concepts. In the existing literature, the classic way to overcome this challenge is to construct a common latent subspace [9], in which the multimedia instances are represented in the same form and the semantic features can be aligned [10]. As a traditional approach, Canonical Correlation Analysis (CCA) [11] is adopted by many researches [12][13][14][15] to learn correlation between cross-modal instances with the same category label. Although these CCA-based methods are supported by classical statistical theory, they cannot represent the complex non-linear semantic correlation. To break this limitation, some non-linear extensions such as KCCA [11], RCCA [16], LPCCA [17], etc. have been proposed to enhance the cross-modal representation.
Thanks to the powerful representation ability of deep learning models [18][19][20][21], crossmodal semantic representation learning has been boosted significantly. For instance, several CCA-based approaches, e.g., deep CCA [22], DisDCCA [23], DCCAE [24], are extended by integrating CCA with DNNs. In recent years, attention mechanisms are exploited to support cross-modal feature learning, which is used to discover more significant semantic A black H arle y-Davidson motorcycle parked in a parking lot. A motorbike parked on the side of the road near a gate. A motorcycle sits unmanned in a parking en closure.
The large cows hover over the young calf. Three adult cows and one baby cow stand on the gras s. Three brown cows and a small calf in a field.
A group of elderly people pos e around a dining table. A group of e lderly people sitting around a dining table.
A view of a s tree t fille d with vehicles, people, and a large brick building. Old s tyle church near Gre yfriar's Bobby The curious boys look in the shed . Two boys are looking into the window of a shed . Two children look into the window of a small s hed. A blue and gree n bottle on the sides of an empty bottle. An empty bottle is positione d at center, with one full bottle to its left and another to its right. Three bottles are on a table.
A pe rson on a mountain bike makes his way up a steep dirt trail. Pers on walking a bike in a wooded area. Two people on dirt bikes ride up a hill.
A black and white scene of birds eating at bird feede r. Black and white photo of two birds circling a small birdbath.
A baby holds a ketchup packet. A baby in a highchair holds a ketchup packet. A baby is seated in a polka-dotte d high chair and holding a packet of cats .
A cows ass and some buildings Back end of cow that is graz ing on grass.
The black and brown cow is standing on the green grass.
The kitty bus offers the ride rs a pleas ant atmosphere. This is a yellow bus whose front is painte d and s hape d to look like a cat. Yellow bus shaped like a cat at side of road. A cat laying on its side with a light source to its backside A cat lying on its side. A close up of a brown faced cat. A car sits by a window with a curtain pulled most of the way acros s. A cat is on the sill of a window with yellow.curtains A man poses on a maroon chair ne ar a sofa with a bright pillow. A man sitting on a chair with legs crossed smiling at the camera. A s miling man who is we aring glasses is sitting on a chair with his legs crosse d.
A smiling woman with a beer sitting outside with another smiling woman. Two seated women laughing with one holding a bottle. two women laugh together one is driving be er An elde rly woman stands in a kitche n with two cats at her feet. A woman in the kitche n wearing her blue apron stands with two cats at he r feet A man and woman posing on a field with a bottle of wine.
A man and woman we aring white outfits stand on grass and smile. A man in white with a wine bottle and a woman in a red turtleneck with a white vest.
A man dress ed for a race relaxes. An older gen tlemen relaxing after a race A runner is res ting in the s hade with a bottle of soda.
A black stripped cat A striped cat laying down on a bag of cat litte r. Domes tic cat laying on back of cat litte r. Striped cat lounges on top of cat food bag.
A man with a bottle and a man in a Hawaiian s hirt are pos ing for a picture. Two guys are smiling, one is holding a bee r. Two me n in button-up shirts smile for the camera A boy s its on a horse near a group of people and another hors e. A girl s itting on one of two horses as five other people watch. A hand painted Volks wagon Bug.
A multicolored, painted Volkswagen Beattle parked in a field. A painted VW bug A large white "Victory Liner" bus with red and yellow trim is in a parking lot. A side view of a pass enger bus. A Victory Liner bus is white with red and gold stripe s.
A car with a canoe on top is parked on the street near a mope d.
A mother prep ares to feed her infant son. A person holding a baby. A pe rson holds a young baby who is sleeping.
A cruise s hip called the Carnival Triumph is moving through the water.
A large white Carnival cruise ship glides through the water against a blue sky.
A man in a canoe on a sunny day. A man on a yellow canoe paddles down a rive r. Man in yellow canoe paddling through water.

Text Data Set
A s mall plane grounded in a field of grass. A s mall, white plane sits in a gras sy airfield.
A small, white SE-KEO airplane parked on the grass .
A be autiful vie w in the mountains is taken in by a man wearing shorts.
A bird is pecking at a log in a.marsh A small bird standing on a log at the waters edge . Bird sitting on a log in a lake.
A blurry photo of a bottle of water. A close-up of an Ice Mountain water bottle.
A close up view of a bottle of water. Close up of a bottle of water.
A black and white bird s itting on the water looking down. A black duck floats on water.
A dark bird with a light-colored beak is in the water. A black and white photo of a se agull flying in front of the Golden G ate Bridge. A seagull flying over a bay with a bridge .
A man dre ssed in a pale scarf with a stick. A seated young man drape d in a tan cloth holding a stick. A young man is dress ed like a mys tic. A double decker bus is driving by Big Ben in London. Big Ben clock in London with red double decker bus driving by. Big Ben with a red double-de cker bus pas sing by.
A black H arle y-Davidson motorcycle parked in a parking lot. A motorbike parked on the side of the road near a gate. A motorcycle sits unmanned in a parking en closure.
The large cows hover over the young calf. Three adult cows and one baby cow stand on the gras s. Three brown cows and a small calf in a field.
A group of elderly people pos e around a dining table. A group of e lderly people sitting around a dining table.
A view of a s tree t fille d with vehicles, people, and a large brick building. Old s tyle church near Gre yfriar's Bobby The curious boys look in the shed . Two boys are looking into the window of a shed . Two children look into the window of a small s hed. A blue and gree n bottle on the sides of an empty bottle. An empty bottle is positione d at center, with one full bottle to its left and another to its right. Three bottles are on a table.
A pe rson on a mountain bike makes his way up a steep dirt trail. Pers on walking a bike in a wooded area. Two people on dirt bikes ride up a hill.
A black and white scene of birds eating at bird feede r. Black and white photo of two birds circling a small birdbath.
A baby holds a ketchup packet. A baby in a highchair holds a ketchup packet. A baby is seated in a polka-dotte d high chair and holding a packet of cats .
A cows ass and some buildings Back end of cow that is graz ing on grass.
The black and brown cow is standing on the green grass.
The kitty bus offers the ride rs a pleas ant atmosphere. This is a yellow bus whose front is painte d and s hape d to look like a cat. Yellow bus shaped like a cat at side of road.  Our method. To implement the above idea, this paper proposes a new approach, named Dual Attention Generative Adversarial Network (DA-GAN). This method combines adversarial learning, intra-modal, and inter-modal attention mechanism to improve cross-modal representation capability. Specifically, the inputs are divided into three groups: an image-text pair I i , T i , L i with category label L i , a group of images and a group of texts with the same label L i . For the generator, we utilize visual CNN and textual CNN to generate visual and textual feature vectors respectively. Then these feature vectors are fed into a two-channel intra-attention model (each channel per modality) to learn intramodal high-level semantic feature representation with the help of a group of images and texts. At the top of this model, a two-channel encoder is implemented by DNN to learn modality-consistent representations, at the top of which an inter-attention model captures the important semantic features across different modalities. Besides, a two-channel decoder is to re-construct the feature representation for intra-modal adversarial learning. In addition, two types of discriminators are used to form a dual adversarial learning strategy to narrow the heterogeneity gap.
Contributions. This paper has three-fold contributions, which are listed as follows.
• We propose a novel Dual Attention Generative Adversarial Network (DA-GAN) for cross-modal retrieval, which is an integration of the adversarial learning method with a dual attention mechanism. • To narrow semantic gap and learn high-level semantic features, a dual attention mechanism is designed to capture important semantic features from cross-modal instances in both intra-modal view and inter-modal view, which enhances abstract concepts learning across different modalities. • To reduce heterogeneity gap, a cross-modal adversarial learning model is employed to learn consistent feature distribution via intra-modal and inter-modal adversarial loss.
Roadmap. The rest of this paper is organized as follows: related works on cross-modal retrieval, attention models, and generative adversarial network are introduced in Section 2. In Section 3, the problem definition and related concepts are proposed. In Section 4, we discuss the details of the proposed DA-GAN. Section 5 presents the experiments and the results. At last, Section 6 concludes this paper.

Cross-Modal Retrieval
The main challenge of cross-modal retrieval [30][31][32][33] is to diminish the heterogeneity gap and semantics gap by learning a consistent semantic subspace, in which the crossmodal similarity can be directly measured. The existing methods include CCA-based methods, deep learning-based methods, and hashing-based methods. We review them in brief as follows.
CCA-Based Methods. Rasiwasia et al. [34] is the first to use CCA [11] for cross-modal correlation learning. After this work, several CCA-based methods are proposed to enhance cross-modal representation learning. For example, Sharma et al. [14] studied a supervised extension of CCA, which is a general multi-view and kernelizable feature learning method. Pereira et al. [12] proposed three CCA-based approaches, namely correlation matching (CM), semantic matching (SM), and semantic correlation matching (SCM). Gong et al. [13] presented a three-view CCA model in which the abstract semantic information is learned by a third view module to support semantic correlation learning. In [15], cluster-CCA method is developed to generate discriminant cross-modal representations.
Deep Learning-Based Methods. Recently, deep learning [18,19,35] techniques have made great progress, which empowers the multimedia analysis [36][37][38][39] and cross-modal representation [40,41]. To learn non-linear correlations from different data modalities, Andrew et al. [42] proposed to integrate deep neural networks into the CCA method. It is a two-channel model, each of which is for one modality. Benton et al. [22] introduced Deep Generalized Canonical Correlation Analysis (DGCCA) to learn non-linear transformations of arbitrarily many views. Gu et al. [43] designed generative processes so as to learn global and local features from cross-modal samples. Zhen et al. [44] introduced a method named Deep Supervised Cross-modal Retrieval (DSCMR) with a weight-sharing strategy to explore the cross-modal consistent relationship.

Attention Models
Attention mechanism [45] is widely applied in image caption [46], action recognition [47], fine-grained image classification [48], visual questing answering [49], cross-modal retrieval [25] and etc. For example, Wu et al. [50] introduced a deep attention-based spatially recursive model to consider spatial dependencies during feature learning. Sudhakaran et al. [51] proposed Long Short-Term Attention method to capture features from spatial relevant parts across the video frames. For cross-modal task, Peng et al. [25] proposed a modality-specific cross-modal similarity approach by using a recurrent attention network. Wang et al. [52] designed a hierarchically aligned cross-modal attention (HACA) model to fuse both global and local temporal dynamics of different modalities. Xu et al. [26] developed a Cross-modal Attention with Semantic Consistency (CASC) method to realize local alignment and multi-label prediction for image-text matching. Liu et al. [53] proposed a cross-modal attention-guided erasing approach to comprehend and align cross-modal information for referring expression grounding. Huang et al. [54] used object-oriented encoders along with inter-modal and intra-modal attention networks to improve inter-modal dependencies. Fang et al. [27] introduced subjective attention-based multi-task auxiliary cross-modal fusion method to enhance the robustness and contextual awareness of image fusion.

Generative Adversarial Network
Generative adversarial network (GAN) is devised by Goodfellow et al. [55], which is a powerful generative model applied in various multimedia tasks [56]. Wang et al. [57] is the first to employ GAN to learn modality-invariant features to diminish cross-modal heterogeneity. Liu et al. [58] presented an adversarial learning-based image-text embedding method to make the distributions of different modalities consistent. Huang et al. [59] studied an adversarial-based transfer model to realize knowledge transfer, and generate modality-indiscriminative representations.
With the support of GAN, many works proposed effective cross-modal hashing methods to realize efficient retrieval in binary Hamming space [60,61]. For example, [62] presented a GAN-based semi-supervised cross-modal hashing approach is presented, which is to learn semantic correlations from unlabeled samples via a minimax game.

Preliminaries
In this section, the formal problem definition and related notions are presented. Then, we review the theory of generative adversarial networks, which is the base of the proposed technique. Table 1 summarizes the mathematical notations used in this paper.  an attention map U a cross-modal semantic correlation matrix ζ(i) I a reconstructed representation of i-th image ζ(i) T a reconstructed representation of i-th text

Problem Definition
This work considers two common modalities: image and text.
be a multimedia dataset that contains n image-text pairs, where I i ∈ R λ I and T i ∈ R λ T represent i-th image sample and text sample in their original space respectively, λ I and λ T are the dimensions of image and text original space. Each pair is assigned a semantic label vector that is denoted as where λ L is the number of semantic categories in D. If I i and T i belong to the same semantic category, then L (c) i = 1; otherwise L (j) i = 0. Cross-modal retrieval aims to search multimedia instances, which are different from the modality of the query Q but similar enough to Q. If the query is an image, denoted as Q I , we call this type of cross-modal as image-to-text (I2T) retrieval; otherwise text-to-image (T2I) retrieval. In the following the definition of I2T and T2I retrieval are formulated.

Definition 1. Cross-Modal Retrieval. Given a multimedia dataset
and two queries Q I and Q T . The I2T retrieval is to return a set of results where Sim(·) denotes the similarity function, k is the number of results.
Apparently, Definition 1 indicates that the key problem of cross-modal retrieval is to realize the function Sim(·). However, due to the heterogeneity gap and the semantic gap, it is hard to measure the semantic similarity between instances of different modalities in their original space. Therefore, two non-linear mappings Φ I (·) : R λ I → R λ C and Φ T (·) : R λ T → R λ C need to be learned, which is to transform images and texts into a λ Cdimensional common semantic subspace. Thus, the heterogeneity of different modalities can be diminished and the cross-modal representations can be described by a set of semantic concepts C = {C} λ C l=1 , As a result, the cross-modal similarity can be measured accurately by the following function.

Definition 2. Cross-Modal Similarity Function.
Given a multimedia dataset D, an image I ∈ D and a text T ∈ D, the cross-modal similarity between I and T is defined as where Φ I (I) and Φ I (T) denote the cross-modal representations in the common semantic subspace. Φ I (I) (i) and Φ I (T) (i) are the i-th element of representation vectors, respectively.
To learn these two non-linear mappings, we propose a deep architecture by using adversarial learning, which generates modality-invariant representations from multi-modality data and realizes cross-modal semantic augmentation via a dual attention mechanism.

Review of Generative Adversarial Netw
As a powerful technique, generative adversarial networks (GANs) [55] have be utilized in many multimedia tasks, such as image synthesis, video generation, motion generation, face aging, etc. It consists of two components: a generator G(·; θ G ) and a discriminator D(·; θ D ), where θ G and θ D are the model parameter vectors. During the training, the generator G(·; θ G ) tries to make the synthetic image more realistic to fool the discriminator D(·; θ D ). The discriminator D(·; θ D ) makes its efforts to distinguish the fake samples from real samples. In other words, G(·; θ G ) and D(·; θ D ) are diametrically against to each other.
Specifically, let I be a real image sample obey natural data distribution P data (I), z ∈ R λ z be a random noise vector generated from distribution P z (z). After fed into the generator G(·; θ G ), z is transformed into a synthetic sample G(z; θ G ) that obeys the generative distribution P G . The discriminator receives the real sample I and the synthetic sample G(z; θ G ) as inputs, and outputs the discriminant result D(G(z; θ G ); θ D ), a probability that G(z; θ G ) is produced by the generator. This adversarial process can be formulated as arg min where E I∼P data (I) [·] and E z∼P z (z) [·] denote mathematical expectations: In the training process, the generator G(·; θ G ), on one hand, synthesizes images as authentic as possible to fool the discriminator D(·; θ D ) by minimizing the loss function. On the other hand, the discriminator (·; θ D ) does its utmost to recognize the fake samples from real samples by maximizing the loss function, shown as follows:

Methodology
In this section, we discuss the proposed Dual Attention Generative Adversarial Network (DA-GAN). This method is to learn cross-modal non-linear mappings in an adversarial manner, in which a dual attention mechanism is developed to mine important semantic details to bridge heterogeneity gap and semantic gap. In Section 4.1 we introduce the overview of DA-GAN, and in Sections 4.2 and 4.3 discuss the multi-modal feature learning and adversarial learning with dual attention mechanism. The implementation details are described in Section 4.4. Figure 2 illustrates the framework of DA-GAN. It consists of three layers: the input layer, generation layer, and discrimination layer.

Overview of DA-GAN
The Input Layer. The input layer is responsible for training data preparation. To capture more semantic knowledge, two types of samples are selected from the training dataset. The one type is the image-text sample pairs , and the other type is a group of images { I j , L i } m j=1 and a group of texts { T j , L i } m j=1 that have the same semantic label. They are fed into the generation layer to produce the common semantic representations.
The Generation Layer. The generation layer is a deep cross-modal generative model with intra-modal attention (intra-attention) and inter-modal attention (inter-attention). Specifically, the visual and textual features are extracted by a two-channel multi-modal feature learning model ImgCNN(·; θ I Fea ) and TxtCNN(·; θ T Fea ), one channel per modality, where θ I Fea and θ T Fea denote parameter vectors. For image modality, it consists of several layers of convolutional networks, which generates visual convolutional representations T . To narrow the heterogeneity gap, a two-channel encoder with weight-sharing strategy over two branches is used following the intra-attention model. Under weight sharing constraint, it generates λ C -dimensional visual and textual representations F (i) T . The Discrimination Layer. In discrimination layer, there are three types of discriminators, i.e., semantic category discriminator D S (·; θ S ), intra-modal discriminator D Intra (·; θ Intra ) and inter-modal discriminator D Inter (·; θ Inter ), to conduct semantic discrimination, intra-modality and inter-modality discrimination. D S (·; θ S ) and D Intra (·; θ Intra ) are two-channel models (one channel per modality). The Former is to predict the semantic labels of convolutional  (2) for each pair, a group of images { I j , L i } m j=1 and a group of texts { T j , L i } m j=1 that have the same semantic label are selected from multimedia dataset. The generation layer consists of a two-channel CNN-based multi-modal feature learning model, a two-channel intra-attention model, a two-channel encoder, a two-channel decoder, as well as an inter-attention model. The discrimination layer includes: a two-channel intra-modal discriminator to discriminate the convolutional feature representation and common semantic representation, a two-channel semantic discriminator and an inter-modal discriminator to distinguish the common semantic representations of different modalities.

Multi-Modal Feature Learning
The multi-modal feature learning model consists of two channels: visual feature learning model ImgCNN(·; θ I Fea ) and textual feature learning model TxtCNN(·; θ T Fea ) to generate convolutional representations of image and text samples.

Visual Feature Learning
The visual feature learning model is to project visual samples from original data space into convolutional feature space. Formally, ξ (i) We use a pre-trained AlexNet [64] to implement visual feature learning. We refine this model on the training dataset via squared loss. Suppose the training set contains n image samples, the ground-true probability vector of i-th sample is denoted as p (I i ) = L i /||L i || 1 , where || · || 1 is the L1 norm. The predictive probability vector is p( ). Thus, the objective function is arg min (4)

Textual Feature Learning
The textual feature learning model is a combination of a Word2Vec model, a BiLSTM model and a textual convolutional network [65]. It generates textual convolutional representations, i.e., ξ More concretely, a Word2Vec model Word2Vec(·; θ w2v ) generates -dimensional word embedding w j ∈ R for each word in T i . Suppose the length of each text sample T i ∈ D is l (padded if necessary), then the embedding of it is denoted as where denotes vector concatenation operator. The word embeddings are fed into a BiLSTM model to encode the contextual semantic information from both the previous and future context on forward and reverse direction, The following textual CNN model receives h(t) at time t and encode local semantic information. Let the convolutional kernels be {K j } κ j=1 with size λ B × m, for the d-th window of the input vector covered by j-th kernel K j , namely (h(t), h(t + 1), . . . , t(t + m − 1)), the value of convolution is:ĥ where σ(·) : R → R denotes an activation function, * denotes convolutional operator, and β is a bias term. For j-th kernel, the result of the convolution at each window on vector h(t) isĥ Then, a max pooling operation is conducted on the all the vectors (ĥ 1 (t),ĥ 2 (t), . . . ,ĥ κ (t)) as follows: where max(·) is the function to choose the maximal element of a vector. This κ-dimensional vector is fed into the last FC layer with drop-out to restraint over-fitting: where W f s is the parameters of FC layer, β is the bias term, denotes element-wise multiplication operator, and Ω is a mask to realize drop-out.

Semantic Grouping of Samples
As described in Section 4.1, for each pair I i , T i , L i , the input layer produces a group of images and a group of texts, which belong to the same semantic category to I i , T i , L i . In other words, it randomly samples α images { I j , L i } α j=1 and texts { T j , L i } α j=1 according to the semantic label L i from training set D. After that, these two groups are fed into visual and textual feature learning model, respectively, i.e., The final convolutional representations of the two groups are the average of each representations, i.e.,ξ In this work,ξ T are used to represent the common semantic features of the category labeled by L i .

Adversarial Learning with Dual Attention
In DA-GAN, a novel dual attention mechanism is proposed to learn more discriminative representations via modeling intra-modal and inter-modal semantic correlations by two attention models: intra-attention and inter-attention. Besides, three types of discriminative models are integrated into the framework to achieve modality-invariant representations in an adversarial manner.

Intra-Attention
Intra-Attention model aims to learn more discriminative feature representations by modeling the intra-modal semantic correlations. In our method, it is a two-channel model, one channel per modality. Since the images and texts are processed in the same way, we take the image intra-attention as an example. For the feature representation pair x, y, d denote the weight, height and depth of the tensors. For convenience of discussion, we reshape these two tensors as ξ (i) where · 2 is the L2 norm, notation is called semantic correlation multiplication. Obviously, M (i) I encode the semantic correlation between the single-sample I i and the corresponding group {I j } α j=1 . We reshape it in the following form: where m I ∈ R p×1 be the convolutional kernel, which is learned from the inputs ξ (i) I ,ξ (i) I by meta learning as follows: where W 1 and W 2 denote the model parameter vectors, σ(·) is a non-linear activation function, here we employ ReLU function. Then a softmax operation is conducted on the convolution result to generate intra-attention map A (i) where Γ is the temperature hyperparameter that influences the entropy. In the same way, the intra-attention map of text modality A (i) I ∈ R x×y is achieved. Finally, a residual attention mechanism is utilized to calculate the results for both modalities: where is the element-wise multiplication. Following intra-attention model, a two-channel encoder E(·; θ I Enc ) and E(·; θ T Enc ) is to generate common representations F T . In this model, weight-sharing constraint is applied in last few layers to learn the cross-modal consistent joint distribution, which diminishes heterogeneity effectively.

Inter-Attention
To realize semantic augmentation in the common representation subspace, an interattention model is designed to learn the semantic relationship between image and text, i.e., Similar to the intra-attention mechanism, it calculates the cross-modal semantic correlation matrix U (i) from F (i) and then generates two correlation matrices: Similar to Equation (14), u

Discriminative Model
Three types of discrimination model are integrated into DA-GAN framework: (1) a semantic discriminator D S (·; θ S ) to realize semantic discrimination, (2) a two-channel intramodal discriminator D I (·; θ I D ) and D T (·; θ I D ) and (3) a two-channel inter-modal discriminator D I (·;θ I D ) andD T (·;θ T D ) to realize intra-modal and inter-modal adversarial learning. Semantic Discriminator. Semantic discriminator D S (·; θ S ) is used to recognize the semantic category of the instance in common semantic representation subspace. To this end, a two-channel network with softmax function is added on the top of inter-attention model (one channel per modality), which takes F where θ S = (θ I G , θ T G , θ C ) denotes the parameter vector of this model, θ C is the parameter vector of the classifier. θ I G and θ T G denote the parameter vector of the image and text generation model respectively, i.e., . Intra-Model Discriminator. The intra-modal discriminator tries to discriminate the real representations ξ Inter-Modal Discriminator. Similar to intra-modal discriminator, the inter-modal discriminator has two channels, the subnetwork for image modality is to recognize the visual common representation as the real sample. By contrast, the subnetwork for text modality aims to recognize the textual common representation as the real sample. This branch of the adversarial network is denoted as GAN2. The objective function is:

Optimization
According to the above discussion, the DA-GAN model can be optimized by the following objective functions: arg min arg min For discrimination in GAN1, the intra-modal discriminator takes the convolutional representation ξ For discrimination in GAN2, the subnetwork for image modality receives the image common representation F (i) I as the real instance and the text common representation F (i) T as the fake instance. The stochastic gradient ascending is calculated as: For the two-channel generative model, it aims to generate more authentic data from the original sample to fit the real semantic distribution by minimizing the objective function. Both of the subnetworks are optimized by stochastic gradient descent (SGD) as follows: Besides, the generative model is optimized by the semantic discrimination to learning abstract semantic concepts: where η denotes the learning rate, m denotes the number of samples in each mini-batch. The pseudocode of optimizing the proposed model is shown in Algorithm 1. Before training the GAN1 and GAN2, we pre-train the multi-modal feature learning model and intra-attention modal for both image and text on the training set, which is to prevent the instability of training GAN1 and GAN2. The minimax game is implemented by Adam [66].
, mini-batch size m, the number of generative model training steps k, learning rate η. 2: pre-train ImgCNN(·; θ I Fea ) and IntraAtt I ·; θ I Intra ; 3: pre-train TxtCNN(·; θ T Fea ) and IntraAtt T ·; θ T Intra ; 4: repeat until convergence: 5: for k steps do 6: Update the parameters of generator for image θ I G by Equation (30); 7: Update the parameters of generator for text θ T G by Equation (31); 8: Update the parameters of generators for both image and text θ I G and θ T G by Equation (32); 9: end for 10: Update the parameters of intra-modal discriminator θ I D for image by Equation (26); 11: Update the parameters of intra-modal discriminator θ T D for text by Equation (27); 12: Update the parameters of inter-modal discriminator for imageθ I D by Equation (28); 13: Update the parameters of inter-modal discriminator for textθ T D by Equation (29); 14: Output: the optimized DA-GAN model.

Implementation Details
Multi-Modal Feature Learning Model. The image feature learning model is implemented by the AlexNet [64] pre-trained on ImageNet dataset. Each input is resized into 256 × 256 without cropping and 227 × 227 patches are extracted randomly from the inputs. The 4096-dimensional feature maps from the fc7 layer are treated as the outputs. To improve the learning performance, we fine-tune this model on the training dataset via squared loss. The mini-batch size of 128. the learning rate of the convolutional layer and fully-connected layer are set as 0.001 and 0.002, respectively. The momentum, weight decay, and drop-out rate are set to 0.9, 0.0005, and 0.5, respectively. The convolutional kernel size is set to 3 × 300, following which is one layer fully-connected network. The drop-out rate is set to 0.5 to avoid over-fitting. The dimension of the last fully-connected layer is set to 4096. The Textual feature learning model includes a pre-trained word2vec model Skip-gram on Wikipedia corpus which contains over 1.8 billion words. This model outputs 300-dimensional word vectors from texts. The textual CNN contains a filter with a size of 3 × 300. The last fully-connected layer has 4096 dimensions and the learning rate is set to 0.01.
Encoder and Decoder. The two-channel encoder is implemented by a two-layer fullyconnected network. For each channel, both of the fc layers are 1024-dimensional, and the weights of the second layer are shared over two branches to model the cross-modal joint distributions. Each branch of the decoder has two layers of fully-connected networks. The dimension of these two layers are 1024 and 4096, respectively.
Intra-modal and Inter-modal Discriminator. For intra-modal discriminator, each branch of it is constructed by one FC layer. To discriminate the convolutional representations and the reconstructed representations, the former is labeled by tag 1, and the latter is labeled by tag 0. For the inter-modal discriminator, both of the two channels are two-layer fully-connected networks. The 1ts layer has 1024 dimensions, and the 2nd layer with a sigmoid activation function calculates the predicted score for each input representation. The common representations of image modality are labeled by 1 and the representations of text modality are labeled by 0. For text modality, these two types of representations are labeled in the opposite way.

Datasets
All the experiments are conducted on three widely-used benchmark datasets: Wikipedia [34], NUS-WIDE [67] and Pascal Sentences [68]. Some image and text samples of these three datasets are shown in Figure 3.
The first attempts to regulate competitive ice hockey matches came in the late 1880s. Before then, teams competed in tournaments and infrequent challenge contests that prevailed in the Canadian sports world at the time. In 1887, four clubs from Montreal, the M ontreals, the Crystals, the Victorias, McGill University, as well as Ottawa HC formed the Amateur Hockey Association of Canada (AHAC) and developed a structured schedule. Lord Stanley donated the Stanley Cup and appointed Sheriff John Sweetland and Philip Dansken Ross as its trustees; they chos e to award it to the best team in the AHAC, or to any pre-approved team that won it in a challenge. Since the Cup carried an air of nobility, its prestige greatly benefited the AHAC.
The coordination and regularized schedule that the AHAC brought helped commercialize amateur ice hockey, which ran against the spirit of the prevailing amateur ethic. As the importance of winning grew, AHAC clubs began recruiting players from outside, and the disparity in skill between teams of the AHAC and those of other leagues became clearer. Since team owners in the AHAC wanted to defend the Stanley Cup and maintain the organization's honour, and rink owners wanted senior hockey as their marquee attraction, AHAC clubs became increasingly reluctant about admitting new teams into the league and the senior series. When the relatively weak Ottawa Capitals joined in 1898, the five original clubs withdrew from the AHAC to form the new Canadian Amateur Hockey League (CAHL). In 1903, four new teams created the Federal Amateur Hockey League (FAHL), and in 1904, the International Hockey League (IHL), based around Lake M ichigan, was created as the first fully professional league. In recruiting players, the IHL caus ed an "Athletic War" that drained amateur clubs of top players, most noticeably in the Ontario Hockey Association (OHA).
The Palazzo Pitti is characterized by a severe and s imple architecture. One continual architectural theme used throughout four centuries has produced massive but impressive elevations and faç ades which belie the long evolution and history of the structure. The architecture commands attention by virtue of size, strength and the reflection of the sun on the glass and stone, coupled with the repetitive, almos t monotonous theme. Ornament and elegance of design take second place to the vast and solid mass of rusticated stonework relieved solely by the arcade-like frequency of the arched window embrasures. As with many Italian palazzi one has to enter the building in order to fully appreciate its Control of the palazzo, today transformed from royal palace to museum, is in the hands of the Italian state through the "Polo Museale Fiorentino", an institution which administers twenty museums, including the Uffizi Gallery, and has ultimate responsibility for 250,000 catalogued works of art. In s pite of its metamorphos is from royal residence to a state-owned public building, the palazzo, sitting on its elevated s ite overlooking Florence, still retains the air and atmosphere of a private collection in a grand house. This is to a great extent due to the "Amici di Palazzo Pitti" (Friends of the Palazzo Pitti), an organisation of volunteers and patrons founded in 1996, which raises funds and makes suggestions for the ongoing maintenance of the palazzo and the collections, and for the continuing improvement of their visual display.
Florence receives more than five million visitors each year, and for many of them the Palazzo Pitti is an essential stop. Thus the palazzo still impresses visitors with the splendours of Florence, the purpose for which it was originally built. The Irish and black populations intermingled and borrowed elements of folk culture from each other. One area of exchange was dance, and the Irish jig blended with black folk steps.Stearns and Stearns 45. In this environment, Juba learned to dance from his peers, including "Uncle" Jim Lowe, a black jig and reel dancer who performed in low-brow establishments. Juba was dancing for food and tossed coins by the early 1840s.Knowles 88 says that "At age ten, Lane attracted attention with his dancing". Watkins 107 gives the date as the early 1840s .Winter 231.Sanjek 169. Winter speculated that by about age 15, Juba had no family. Soldier Field, located on Lake Shore Drive in Chicago, is the current home to the Bears. The Bears moved into Soldier Field in 1971 after outgrowing Wrigley Field, the team's home for 50 years, and Northwes tern University's residential neighbors objected to their playing at Dyche Stadium, now called Ryan Field. After the AFL-NFL Merger, the newly merged league wanted their teams to play in stadiums that could hold at least 50,000 fans. Even with the portable bleachers that the team brought into Wrigley, the stadium could still only hold 46,000. Soldier Field's playing turf was changed from astroturf to natural grass in time for the start of the 1988 season. The stadium was the site of the infamous Fog Bowl playoff game between the Bears and Philadelphia Eagles.
In 2002, the s tadium was closed and rebuilt with only the exterior wall of the stadium being preserved. It was closed on Sunday, January 20, 2002, a day after the Bears los t in the playoffs. It reopened on September 27, 2003 after a complete rebuild (the second in the stadium's history). Many fans refer to the rebuilt stadium as "New Soldier Field". During the season, the Bears played their home games at the University of Illinois ' Memorial Stadium in Champaign, where they went 3-5.
Many critics have negative views of the new stadium. They believe that its current structure has made it more of an eyesore than a landmark; some have dubbed it the "Mistake on the Lake". Soldier Field was stripped of its National Historic Landmark designation on February 17, 2006. Virtually all the historically known plant and animal species, with the exception of the bison and woodland caribou, are present, providing biologists an intact ecosystem for plant and animal research. Two threatened species of mammals, the grizzly bear and the Canadian lynx, are found in the park. Although their numbers remain at historical levels, both are listed as threatened because in virtually every other region of the U.S. outside of Alaska, they are either extremely rare or abs ent from their historical range. On average, one or two bear attacks on humans occur each year; since the creation of the park in 1910, there have been a total of 10 bear related deaths. The number of grizzlies and lynx in the park is not known for certain, but park biologis ts believe that there are slightly fewer than 350 grizzlies parkwide, and a study commenced in 2001 hoped to determine the number of lynx in the park. Another s tudy has indicated that the wolverine, another very rare mammal in the lower 48 states, continues to reside in the park. An estimated 800 black bears are believed to exist parkwide. The black bear is less aggressive than the grizzly and a recent study using DNA to identify hair samples indicated that there are about six times as many black bears as there are grizzlies. Other large mammals such as the mountain goat (the official park symbol), bighorn sheep, moose, elk, mule deer, whitetailed deer, coyote, and the rarely seen mountain lion, are either plentiful or common. Unlike in Yellowstone National Park, which commenced a wolf reintroduction program in the 1990s, wolves have existed almost continuously in Glacier. 62 species of mammals in all have been documented including badger, river otter, porcupine, mink, marten, fisher, six species of bats and numerous other smaller mammals.
Demosthenes was born in 384 BC, during the last year of the 98th Olympiad or the first year of the 99th Olympiad.H. Weil, ''Biography of Demosthenes '', 5&ndash;6 His father-also named Demosthenes-who belonged to the local tribe, Pandionis, and lived in the deme of PaeaniaAeschines, ''Against Ctesiphon'', in the Athenian countryside, was a wealthy sword-maker.H. T. Peck, Aeschines, Demosthenes' greatest political rival, maintained that his mother Kleoboule was a Scythian by bloodAeschines, ''Against Ctesiphon'', -an allegation disputed by some modern scholars. Demosthenes was orphaned at the age of seven. Although his father provided well for him, his legal guardians, Aphobus, Demophon and Therippides, mishandled his inheritance.O. Thoms en, ''The Looting of the Estate of the Elder Demosthenes'', 61 As s oon as Demosthenes came of age in 366 BC, he demanded they render an account of their management. According to Demosthenes, the account revealed the mis appropriation of his property. Although his father left an estate of nearly fourteen talents,Demosthenes, ''Against Aphobus 1'', (very roughly 11,700 troy ounces in silver or 150,000 current United States dollars) Demosthenes as serted his guardians had left nothing "except the house, and fourteen slaves and thirty silver minae" (30 minae = ½ talent).Demosthenes, ''Agains t Aphobus 1'', At the age of 20, Demosthenes sued his trustees in order to recover his patrimony and delivered five orationsthree ''Against Aphobus'' during 363 BC and 362 BC and two ''Against Ontenor'' during 362 and 361 BC. The courts fixed Demos thenes' damages at ten talents.Demosthenes , ''Against Aphobus 3'', When all the trials came to an end, he only succeeded in retrieving a portion of his inheritance.
The character of Batman has appeared in various media aside from comic books. The character has been developed as a vehicle for newspaper syndicated comic strips, books, radio dramas, television and several theatrical feature films. The first adaptation of Batman was as a daily newspaper comic strip which premiered on October 25, 1943.Daniels (1999), pg. 50 That same year the character was adapted in the 13-part s erial ''Batman'', with Lewis Wils on becoming the first actor to portray Batman on screen. While Batman never had a radio series of his own, the character made occasional guest appearance in ''The Adventures of Superman'' starting in 1945 on occasions when Superman voice actor Bud Collyer needed time off.Daniels (1999), pg. 64 A second movie serial, ''Batman and Robin'', followed in 1949, with Robert Lowery taking over the role of Batman. The exposure provided by these adaptations during the 1940s "helped make a household name for millions who never bought a comic book.".
In the 1964 publication of Donald Barthelme's collection of short stories "Come Back, Dr. Caligari", Barthelme wrote "The Joker's Greatest Triumph." Batman is portrayed for purposes of spoof as a pretentious French-speaking rich man.Olsen, Lance. "Linguistic Pratfalls in Barthelme", ''South Atlantic Review'', Vol. 51, No. 4 (Nov., 1986), pp. 69-77 (article consists of 9 pages). South Atlantic M odern Language As sociation. Stable URL: http://www.jstor.org/stable/3199757 The Blue Iguana Recovery Programme grew from a small project started within the National Trus t for the Cayman Islands in 1990. It is now a partnership, linking the Trust with the Cayman Islands Department of Environment, National Trust Cayman Islands, Queen Elizabeth II Botanic Park, Durrell Wildlife Conservation Trust, International Reptile Conservation Foundation, IRCF, and the European Commission. This program operates under a s pecial exemption from provisions in the Animals Law of the Cayman Is lands which normally would make it illegal for anyone to kill, capture, or keep iguanas. BIRP's conservation strategy involves generating large numbers of genetically diverse hatchlings, head-starting'''Head-starting''' means rais ing the animals in captivity from the time they hatch until they are big enough not to fall prey to feral cats and rodents, thereby offsetting the juvenile mortality rate. them for two years where their chance of survival in the wild is high, and using these animals to rebuild a series of wild sub-populations in protected, managed natural areas. This is accompanied by field research, nest site protection, and monitoring of the released animals.
A rapid numerical increase from a maximum possible number of founding stock is s ought to minimize loss of genetic diversity caused by the "population bottleneck".

NUS-WIDE
A bicycle racer on a road in a rural area A man in green and yellow lira riding a bike through the countryside. A man on a bicycle with a racing suit. Cyclist pedaling down country road. The cyclist is speeding along on a country road in his yellow and green suit.
A bird is flapping its wings in the water. A black swan flapping its wings on the water. A goose flapping its wings in a body of water. A large black bird is sitting in the water. Black duck flapping wings in the water.
A dark-haired man with a mustache is behind a red-haired man on a boat. Two men on a ship are looking into the camera. Two men on a ship at sea are posing for a camera up close. Two men on boat floating on the water. Two men, one blonde and one brunette, stand in a a large boat on the ocean.  OPC took place in an area of northern Iraq above the 36th parallel. This area, approximately 160 by 70 kilometers in size, was designated a "no-fly" security zone by UN coalition forces and was enforced by a combined task force (CTF) of daily armed aircraft patrols from participating nations , including the United Kingdom, France, Turkey, and the United States. The United States Army was tasked with as sisting civilian relief agencies to build communities and facilities for the Kurds in Northern Iraq. Over the next three years, 27,000 fixed-wing and 1,400 helicopter coalition flights took place in the zone to support humanitarian operations without interference from Iraqi aircraft or other military units.Schmitt, "Copter Deaths: Pentagon Finds Human Failure", Snook, ''Friendly Fire'', p. 7&ndash;8, 29&ndash;30, Hall, ''Michael, My Son'', pp. 78&ndash;81.
The Green and Golden Bell Frog was first described as ''Rana aurea'' by Less on in 1827. It has changed class ification 20 times; it was first named ''Litoria aurea'' in 1844 by Günther, and changed another nine times before being named again as ''Litoria aurea''. The specific epithet ''aurea'' derived from the Latin ''aureus'' for 'golden'. The species is now class ified within the ''Litoria aurea'' complex, a closely related group of frogs in the ''Litoria'' genus. This complex is s cattered throughout Australia: three species occur in south-east Australia, one in northern Australia, and two in Southwest Australia. The complex consists of the Green and Golden Bell The female is s imilar in appearance to the male, but the tail streamers are shorter, the blue of the upperparts and breast band is less glossy, and the underparts more pale. The juvenile is browner and has a paler rufous face and whiter underparts. It als o lacks the long tail streamers of the adult.
The song of the Barn Swallow is a cheerful warble, often ending with ''su-seer'' with the second note higher than the first but falling in pitch. Calls include ''witt'' or ''witt-witt'' and a loud ''splee-plink'' when excited. The alarm calls include a sharp ''s iflitt'' for predators like cats and a ''flitt-flitt'' for birds of prey like the Hobby. This species is fairly quiet on the wintering grounds. The front of a red, blue, and yellow bus. The idle tourist bus awaits its passengers.
A black car is parked on gravel near a fence. A black car parked in the dirt out in the country. A black vehicle parked with a vies of low-lying mountains. A greenish black car parked in the country with mountains in the distance. Black PT Cruiser parked in gravel.
A black cat looking at image in a mirror. A black cat sitting on a pink chair, next to a table holding a mirror and a plant. A black cat sitting on a pink folding chair staring into a mirror sitting on a table next to a plant. Black cat with pink collar staring into small mirror. The cat seems to like her reflection in the mirror.
A bulldog is sitting on a yellow chair, which is next to a plant and a dumpster. A bulldog sitting on a yellow chair beside a large waste receptacle. A dog sitting on a yellow chair next to a green dumpster. A pug dog wits on a yellow chair next to a green dumpster. Dog sitting next to a dumpster A living room decorated with contemporary furnishings. A living room scene with couches and coffee table. A living room with large sofas and a dining table in the background. A living room with two leather sofas, a chair and a coffee table with the dining room in the back ground. Two tan leather sofas in a cream colored living room with a bookcase and table behind.
A row of brightly colored motorcycles. A row of motorcycles is parked outside of a row of stores. A row of parked motorbikes. Brightly colored motorcycles are parked in a line. Motorcycles parked in town with men gathered together in the background sea sky sun moon india reflection water digital photoshop stars model photographer manipulation manipulations ps shade elements actor maharashtra mumbai hdr kutch humayun madai photoshopmanipulations peerzada deolali humayunn peerzaada kudachi kudchi humayoon proudshopper humayunnnapeerzaada wwwhumayooncom humayunnapeerzaada australia melbourne victoria southbank actor filmproduction reddress filmset shortfilm moviestill productionstill vcaschooloffilmtelevision texas statefair ferriswheel amusementpark midway statefairoftexas texasstar blueribbonwinner dallastexas supershot beautifulcapture abigfave anawesomeshot impressedbeauty flickrdiamond flickrelite bridge sunset usa topf25 lines america wow fantastic topv333 500v20f unitedstates tn memphis tennessee 100v10f fv5 mississippiriver thesouth 300 flickrzen may2005 i40 mostviewed hernandodesotobridge shelbycounty 3030300 favorites80 memphistowestmemphis ocean california sea summer beach water beautiful wow ouch geotagged 1 cool nice topf50 bravo surf waves photographer action quality awesome been1of100 interestingness1 spray topf300 explore hamster zoomzoom china california bear sleeping 15fav baby station giant zoo interestingness panda sandiego sleep bears chinese reserve bamboo species endangered sandiegozoo pandas sdzoo meisheng gaogao sidef a bald man smiling with a black, white-haired horse A bald man touches the nose of a black horse on the other side of a fence. A bald man with a beard pets the nose of a black and white horse at the fence. A man is staring at a horse with a white mane. A man strokes a horses nose while it stands over a fence.
a close up of a small cactus in a pot A close-up of three terra cotta pots with cacti growing in them. A line of potted cactus plants. A row of three potted cactus. The trio of cactus take in some sun.
A group of children jump on the beach. A row of children on a beach jumping in the air. Children at the seashore are jumping into the air. Six people jump in the air above a sandy beach. These teenagers jumped into the air simultaneously for this fun beach photo.
A woman wearing shorts and a desktop poses on the hood of a white car. A woman wearing shorts on top of a white car. A women sits on the hood of a white car parked next to a house. Girl on hood of white car There is a women sitting on the hood of a white car with a bathing suit on.
The show follows several groups of meerkats, who act communally for the benefit of the groups in which they live. These groups are typically led by a dominant female and male, who maintain almost exclusive rights to have offspring. The group followed most clos ely is known as the Whiskers family. This group was chosen because of its matriarch Flower, an unusually successful dominant female who led the group for five years. During series three, Flower died from a s nake bite and was succeeded by her daughter Rocket Dog.
Animals in neighbouring groups are highlighted in each series as well. In the first series, a group called the Lazuli were shown frequently in competition with the Whiskers family, and the opening credits referred to them as the "neighbours from hell". Although their dominant male Big Si died between series, the Lazuli appeared in the second series, mostly as a source of roving males. Another group called the Commandoes, led by a one-eyed male named Hannibal, introduced themselves by attacking the Lazuli burrow, killing a pup and badly wounding the babysitting adult. The Commandoes became the new major rivals in the area, killing the pups of evicted Whiskers female Mozart and taking ov er some of the Whiskers' territory.
The Whis kers' new neighbours were the Zappa and the Starsky. Although smaller than the Whiskers, the Zappa attacked frequently, and when they fled after one attack, the Whiskersin a rare occurrenceadopted an abandoned Zappa pup. The Stars ky group, on the other hand, was no threat to the Whiskers. Formed by a trio of Flower's daughters permanently evicted from the Whis kers, the small group was ravaged by illness, predators, and a lack of new pups . The constantly struggling Starsky succumbed in the penultimate episode, with the death of the last survivor, Mozart, who was killed by a jackal.
Punk's first venture into wrestling was a stint in a backyard wres tling federation called the Lunatic Wrestling Federation with his friends and brother Mike Brooks in the mid-late 1990s. He first started us ing the ring name CM Punk when he was put into a tag team named the Chick Magnets with CM Venom after another performer skipped out on the card. Unlike his friends, Punk genuinely wanted to be a wrestler and saw it as more than simple fun. When the promotion started taking off, doing spot shows out of a warehouse in Mokena, Illinois, Punk found out that his brother Mike had embezzled thousands of dollars from the small company, causing them to become estranged. Mike has not wrestled since.
He soon left the federation and enrolled as a student at the "Steel Dominion" wrestling school in Chicago, where he was trained by Ace Steel, Danny Dominion and Kevin Quinn to become a profess ional wrestler. As In 1924, armed with a renewable ten-year digging concession from the Mexican government, Morley, his field director Earl H. Morris, artists Ann Axtel Morris and Jean Charlot, and several others began their first explorations . They selected an area within what appeared to be the central plaza of the site, where the capitals of some columns lay exposed.Charlot's biography (McVicker 1994). Much to their surprise they uncovered row upon row of free-standing columnssurprising since such columns hardly ever figured in Classic Maya architecture. This complex (now called the "Complex of a Thousand Columns", although the columns number fewer than one thousand), un-Maya-like in both execution and arrangement, added confirmation to earlier speculations that Chichen Itza was something of an enigma. This arrangement had much more in common with the architectural styles of civilizations in central Mexico (more than a thousand kilometres away) than that of the Classic or Pre-Classic Maya. In particular, this complex and some others which were gradually revealed appeared to have much in common with structures built at Tula, believed to be the capital of the Toltecs and which was located about 100 km north of present-day Mexico City.
Over the next few seasons, the team expanded their digs, recovering other anomalous structures from the earthen mounds, such as the Temple of the Jaguar and the Temple of the Warriors . In 1927 they discovered an older structure underneath this latter, which they called the "Temple of the Chacmool" after a further example found of this distinctive statuary. These structures had frescoes which again exhibited a non-Maya style, or at least a hybrid of M aya and non-Maya. They also worked on the reconstruction of ''el Caracol'', a unique circular building believed (and later confirmed) to be an observatory. A separate archaeological dig, this one under the M exican government, had also commenced working the site; the two projects divided the areas to excavate, continuing side-by-side for several years, in a somewhat guarded but nonetheless cordial fashion. On the breeding grounds, the Red-necked Grebe feeds mainly on invertebrates including adult and larval aquatic ins ects, such as water beetles and dragonfly larvae, crayfis h and molluscs. Fish (such as smelt) may be important locally or seasonally, especially for the American subspecies, and crustaceans can constitute up to 20% of the grebe's diet. Birds breeding at the coast often make foraging flights to inland lakes or offshore areas to feed.
Aquatic prey is obtained by diving or by swimming on surface with the head submerged, and terrestrial ins ects and their larvae are picked off vegetation. A line slanting downward from the eye to the tip of the opened lower mandible may be used for sighting on prey before diving or when swimming under water. The grebe probably opens its bill and looks down the eye-line toward its target. European breeders, which have to compete with the larger Great Crested Grebe for fis h, eat a greater proportion of invertebrates than the longer-billed American s ubspecies, although both races eat mainly fish in winter. Birds of the nominate subspecies from the northernmost breeding populations in Finland and Russia, beyond the range of Great Crested Grebe, have a longer and more slender bill than those further south, reflecting a greater proportion of fish in the diet where their main competitor is absent. If food is scarce, parents may desert unhatched eggs, or allow the smallest chicks to starve, although the latter strategy appears not to be particularly efficient in protecting the older chicks.
The format of the Cricket World Cup has changed greatly over the cours e of its history. Each of the first four tournaments was played by eight teams, divided into two groups of four. There, competition comprised two stages , a group stage and a knock-out stage. The four teams in each group played each other in the round-robin group stage, with the top two teams in each group progressing to the semi-finals. The winners of the semifinals played against each other in the final. With the return of South Africa in 1992 after the ending of the apartheid boycott, nine teams played each other once in the group phase, and the top four teams progressed to the semi-finals. The tournament was further expanded in 1996, with two groups of six teams. The top four teams from each group progres sed to quarter-finals and semi-finals.
A new format was used for the 1999 and 2003 World Cups. The teams were split into two pools, with the top three teams in each pool advancing to the '''Super 6'''. The "Super 6" teams played the three other teams that advanced from the other group. As they advanced, the teams carried their points forward from previous matches against other teams advancing alongside them, giving them an incentive to perform well in the group stages . The top four teams from the "Super 6" stage progressed to the s emi-finals, with winners playing in the final.
The Ring-tailed Lemur is polygynous, although the dominant male in the troop typically breeds with more females than other males. Fighting is most common during the breeding seas on. A receptive female may initiate mating by presenting her backside, lifting her tail and looking at the desired male over her s houlder. Males may inspect the female's genitals to determine receptiveness. Females typically mate within their troop, but may seek outs ide males.
The breeding season runs from mid-April to mid-May. Estrus lasts 4 to 6 hours, and females mate with multiple males during this period. Within a troop, females stagger their receptivity so that each female comes into season on a different day during the breeding season, reducing competition for male attention. Gestation lasts for about 135 days, and parturition occurs in September or occasionally October. In the wild one offspring is the norm, although twins may occur. Ring-tailed Lemur infants have a birth weight of and are carried ventrally (on the chest) for the first 1 to 2 weeks, then dorsally (on the back).
The young lemurs begin to eat solid food after two months and are fully weaned after five months. Sexual maturity is reached between 2.5 and 3 years. Male involvement in infant rearing is limited, although the entire troop, regardless of age or s ex, can be seen caring for the young. Alloparenting between troop females has been reported. Kidnapping by females and infanticide by males also occur occasionally. Due to harsh environmental conditions, predation and accidents such as falls infant mortality can be as high as 50% within the firs t year and as few as 30% may reach adulthood. The Ring-tailed Lemur can go on to live 16
• CCA [69] is a statistical method that is to learn linear correlations between samples of different modalities. • KCCA [11] is a non-linear extension of CCA, which employs kernel function to improve the performance of common subspace learning. • MCCA [70] is a generalization of CCA to more than two views, which is used to recognize similar patterns across multiple domains. • MvDA [71] jointly learns multiple view-specific linear transforms so as to construct a common subspace for multiple views. • MvDA-VC [72] is an extension of MvDA with with view consistency, which utilize the structure similarity of views corresponding to the same object. • JRL [73] uses sparse projection matrix and semi-supervised regularization to explore correlations of labeled and unlabeled cross-modal samples. • DCCA [42] is implemented by deep neural networks to learn non-linear correlation. It has two separated DNNs, one branch per modality. • DCCAE [24] is a DCCA extension that integrates CCA model and autoencoder-based model to realize multi-view representation learning. • CCL [74] realizes a hierarchical network to combine multi-grained fusion and crossmodal correlation exploiting. It includes two learning stages to realize representation learning and intrinsic relevance exploiting. • CMDN [75] contains two learning stages to model the complementary separate representation of different modalities, and combines cross-modal representations to generate rich cross-media correlation. • ACMR [57] is a adversarial learning-based method to construct a common subspace for different modalities by generating modality-invariant representations. • DSCMR [44] exploits semantic discriminative features from both label space and common representation space by supervised learning, and minimizes modality invariance loss via weight-sharing to generate modality-invariant representation.
• CM-GANs [76] models cross-modal joint distributions by two parallel GANs to generate modality-invariance representations

Performance Metrics
Two tasks are considered, i.e., (1) I2T retrieval and (2) T2I retrieval, both of which are defined in Definition 1. Besides, we utilize PR-curves and mAP score to measure the retrieval performance:

Results on Wikipedia Dataset
The mAP scores of DA-GAN and the 13 competitors on the Wikipedia dataset are reported in Table 2. For both I2T and T2I tasks, the proposed DA-GAN outperforms all these state-of-the-arts by 54.3% and 63.9% respectively, higher than the two best competitors, i.e., DSCMR [44] (I2T mAP = 52.1%) and CM-GANs [76] (T2I mAP = 62.1%). Besides, the average mAP of DA-GAN is the highest, which is 3% higher than CM-GANs. The main reason is that the combination of intra-and inter-modal attention captures more singlemodal and cross-modal semantic correlations. Although both DSCMR and CM-GANs extract the semantic information by supervised learning, they do not learn the inter-modal semantic correlation effectively to realize cross-modal semantic augmentation. On the other hand, except for DCCA and DCCAE whose mAPs (I2T mAP = 44.4% and 43.5%, T2I mAP = 39.6% and 38.5%) are a bit lower than JRL (I2T mAP = 44.9%, T2I mAP = 41.8%).  Figure 6. Obviously, for all these approaches, there are big differences between the retrieval precisions of different categories. Specifically, for both I2T and T2I tasks, the performances on "biology", "geography & places", "sport & recreation" and "warfare" are better than other categories. That is mainly because the samples in the above categories are semantically independent of other categories, and have more obvious distinguishing features than other categories. In contrast, the categories "art & architecture", "history" and "royalty & nobility" are relative to each other in abstract semantics. The samples of the categories have more confusing features.
From Figures 4 and 5, it is clear that DA-GAN has better semantic recognition ability. For example, the highest I2T and T2I mAP scores of DA-GAN on "biology", "sport & recreation" and "warfare" are near 83% and 85%, higher than the competitive rivals such as DSCMR (I2T mAP = 78%, T2I mAP = 73%), CCL (I2T mAP = 73%, T2I mAP = 69%) and CM-GANs (I2T mAP = 74%, T2I mAP = 82%).    A v e r a g e m A P o n W i k i p e d i a D a t a S e t  R e c a l l

Results on Nus-Wide Dataset
The mAP scores on NUS-WIDE of DA-GAN and competitors are reported in Table 3. Compared with the results on Wikipedia, the precision of all these methods are relatively higher. The proposed method performs well on this dataset, which defeats CM-GANs (I2T mAP = 78.1%, T2I mAP = 72.4%, Aver. mAP = 75.3%) and DSCMR (I2T mAP = 61.1%, T2I mAP = 61.5%, Aver. mAP = 61.3%) by I2T mAP = 79.7%, T2I mAP = 75.2%, Aver. mAP = 77.5%. It indicates that the dual attention mechanism can discover more important semantic features between different modalities to generate more discriminant representations. On the other hand, we observe that the performance of other traditional and deep learning-based approaches are far behind our method even though the precisions of them are obviously higher than the results on Wikipedia. The PR curve of DA-GAN and the state-of-the-arts are presented in Figure 7b,c. We can find that the trends of the precisions on NUS-WIDE are different from the situations on Wikipedia. For the I2T task (shown in Figure 7b), the precision of DA-GAN and the competitors decline obviously in the interval [0.0, 0.2]. After that, the downward trend tends to be gentle. When the recall is larger than 0.8, fast performance degradation occurs, except for three traditional methods, i.e., CCA, KCCA, and MCCA. At all levels of recall, the precision of DA-GAN is higher than all the rivals. For the T2I task (shown in Figure 7c), the performance of all these approaches shows a gradual downward trend. Although the precision of CM-GANs is slightly higher than our method in the interval [0.1, 0.2], it cannot defeat DA-GAN when the recall is larger than 0.2. The retrieval accuracies of other approaches, as expected, are much lower than DA-GAN.

Results on Pascal Sentences Dataset
The Comparison of mAP scores of DA-GAN and the 13 state-of-the-arts on Pascal Sentences dataset are shown in Table 4. Once again, DA-GAN is the winner in this contest, which achieves I2T mAP = 72.9%, T2I mAP = 73.5% and average mAP = 73.2%, defeats the runner-up DSCMR (I2T mAP = 71.0%, T2I mAP = 72.2%, average mAP = 71.6%) by 2.5%, 1.3% and 1.6%, respectively. Different from the above comparisons, CM-GANs (I2T mAP = 61.2%, T2I mAP = 61.0%, average mAP = 61.1%) performs worse than DA-GAN and DSCMR evidently. As analyzed above, the performance improvement mainly comes from the integration of intra-and inter-modal attention as well as adversarial learning.  -10 illustrate the I2T, T2I and average mAP scores of each approaches on 20 categories on Pascal Sentences dataset, respectively. For both I2T and T2I tasks, all these approaches have poor cross-modal retrieval performance in some categories, such as "bottle" and "chair". It is mainly because the objects in these categories are relatively small. By contrast, the precisions on "aeroplane", "bird", "cat", "horse", "motorbike", "sheep" and "train" are obviously higher since these samples contain much more discriminative semantic features. Specifically, for the I2T task, the mAP of DA-GAN reaches nearly 90%, 91% and 92% on "aeroplane", "cat" and "train", respectively. For the T2I task, it achieves nearly 92%, 93%, and 95% on these three categories. From Figure 10 we observe that the semantic recognition performance of DA-GAN is the best among these 14 approaches. Figure 7c,f show the PR curves of DA-GAN and 13 state-of-the-arts for I2T and T2I tasks, respectively. On both tasks, it is clear that the changing of performance of DA-GAN and CM-GANs are very similar. Although CM-GANs show good performance, they cannot overcome our method. For the I2T task, the precision of DA-GAN declines slowly when the recall increases from 0.2 to 0.8. After that, it drops sharply. In contrast, the performance of our method shows a significant downward trend for the T2I task, but it is still the best.

Conclusions
We present a new deep adversarial model for cross-modal retrieval, called Dual Attention Generative Adversarial Network (DA-GAN). This method utilizes a novel dual attention mechanism to focus on important semantic details in a uni-modal manner and a cross-modal manner, which can effectively learn high-level semantic interaction across different modalities. Besides, a dual adversarial learning method that learns modality-consistent representation is proposed to reduce the heterogeneity gap. Comprehensive experiments on four commonly used multimedia datasets indicate the great performance of the proposed method.