Recommendations for Different Tasks Based on the Uniform Multimodal Joint Representation

: Content curation social networks (CCSNs), such as Pinterest and Huaban, are interest driven and content centric. On CCSNs, user interests are represented by a set of boards, and a board is composed of various pins. A pin is an image with a description. All entities, such as users, boards, and categories, can be represented as a set of pins. Therefore, it is possible to implement entity representation and the corresponding recommendations on a uniform representation space from pins. Furthermore, lots of pins are re-pinned from others and the pin’s re-pin sequences are recorded on CCSNs. In this paper, a framework which can learn the multimodal joint representation of pins, including text representation, image representation, and multimodal fusion, is proposed. Image representations are extracted from a multilabel convolutional neural network. The multiple labels of pins are automatically obtained by the category distributions in the re-pin sequences, which beneﬁts from the network architecture. Text representations are obtained with the word2vec tool. Two modalities are fused with a multimodal deep Boltzmann machine. On the basis of the pin representation, different recommendation tasks are implemented, including recommending pins or boards to users, recommending thumbnails to boards, and recommending categories to boards. Experimental results on a dataset from Huaban demonstrate that the multimodal joint representation of pins contains the information of user interests. Furthermore, the proposed multimodal joint representation outperformed unimodal representation in different recommendation tasks. Experiments were also performed to validate the effectiveness of the proposed recommendation methods.

As is well known, CCSNs are content-centric social networks [10]. Different from user-centric social networks, users on CCSNs pay more attention to the contents that users collect, which are not only communication carriers but also carriers of user interests. Taking one of the best known CCSNs, Pinterest, as an example, a "pin" is an image with a brief text description supplied by the users. A "board" is a collection of some similar style pins. In other words, the pins are curated into "boards" by categories [19]. On CCSNs, the collection of a user is composed of several boards, and a board is composed of pins. The relationships between users, boards, categories, and pins are shown in Figure 1. A "user" represents the users on CCSNs. A "board" represents a container of pins and is organized into different categories. A "category" is the category of the board, which is given by the user. A "pin", which is created by a user, is the basic unit, composed of an image and a corresponding brief text description. Users on CCSNs can "follow" the users they are interested in like on Twitter or Facebook. "Re-pin" is an action like a "repost" or a "retweet", whereby users can re-save pins and re-organize them with new descriptions and new categories to their own board. "Create" is similar to "post" on Twitter or Weibo, allowing users to post their original contents on CCSNs. From the figure, we can see that the pin is the basic unit in CCSNs. Besides the content collections, there are also abundant social behaviors on CCSNs. Users can follow other users or other users' boards, users can also re-pin other users' pins and collect them into their own board. Furthermore, the re-pin path is recorded in CCSNs. All the users who have re-pinned a pin can be connected using a re-pin path. All users of the same re-pin path have collected the same image, but they have organized them into different boards and different categories. As shown in Figure 2.
On content-centric CCSNs, most user activities are related to the pins. Liu et al. [25] found that only 30% of pins are re-pinned from their followers by statistics on Huaban. Furthermore, users do not follow the users from whose boards they re-pin the pins [4]. A non-trivial number of pins are collected from non-followees [5], and those from native followees are more than those from cross-domain followees [3]. These observations suggest that social relationships are not the main motivation of content discovery on CCSNs. On the contrary, user interests represented by pins play an important role in user behaviors on CCSNs. It is possible that content-based recommender algorithms will be more effective than social behavior-based algorithms such as collaborative filtering. Inspired by this, we managed to implement recommendations for different tasks based on identical representations of pins. The problem can be broken down into two questions: how to represent a given pin effectively; and how to implement the different tasks with the obtained representation. Illustration of a re-pin tree composed of some re-pin paths. Each star represents a pin and the C i next to it is the category given by the corresponding user. Note that all pins in the same re-pin tree have an identical image.
As shown in Figure 3, a pin is an image with its text description, hence it is obvious that both modalities should be utilized for complete representations. In order to fully utilize two modalities, we propose a framework that can learn the multimodal joint representation of pins. Image representations and text representations are obtained separately by deep models and are then fused to form multimodal joint representations. An intermediate layer of a convolutional neural network (CNN) is used to extract image representations. In order to establish the relation between image representations and user interests, some chosen images are annotated with their category distributions, which are the statistics of selections of users, to fine-tune the CNN. Text representations are means of word vectors in a word2vec model trained on public text corpora. Then, a multimodal deep Boltzmann machine (DBM) is trained with two modalities as inputs and the activation probabilities of the top layer are extracted as the final representation of pins. Recommendation tasks include recommending pins to users, recommending thumbnails to boards, recommending categories to boards, and recommending boards to users. On the basis of the representation of pins, pin recommendation becomes a problem of similarity measurement in the representation space, which can be solved by ranking the similarities between the candidates and the target pin. A board thumbnail consists of representative pins that can be selected by clustering pins in the board. The board category, which is the coarse interest selected by its owner, is considered to be the accumulation of the category distribution of its pins, and the category distribution can be obtained with a trained multidimensional logistic regression (LR). Boards and users are treated as pin collections, modeled as the Fisher vector (FV) of all their pin representations, and recommended to target users, similar to the pin recommendation method.
This paper makes the following contributions: • On the basis of the characteristics of CCSNs, an easy-to-accomplish annotation method is proposed to automatically label the images by the category distributions on the re-pin tree of the corresponding pins. On the basis of the image and the corresponding labels, a multilabel CNN Network was fine-tuned, which significantly enhances the capability of image representation; • We designed a framework which combines deep features of images and texts into a joint representation to maintain both consistent information and specific characteristic of different modalities. On this basis, a uniform recommendation scheme was designed for different tasks on CCSNs; • The experimental results demonstrate that the proposed multimodal representation is more effective than representations learned from unimodal information. Furthermore, the proposed method performs better than existing multimodal representation learning methods on multiple recommendation tasks.

Related Work
With the rise of CCSNs, several studies have been performed, of which search engine, user modeling, and recommender systems are the most relevant. Most prior work only studied monomodal data. Yang et al. [16] recommended boards re-ranked with image representations based on boards with the text representations model. Liu et al. [22] recommended pins with two unimodal representations separately. Cinar et al. [11] predicted categories of pins with two kinds of unimodal representations and fused the two modality results using decision fusion. All the models are late fusion models that do not concern multimodal joint representations.
Multimodal joint representation includes unimodal representation models and multimodal fusion schemes. For image representation, CNNs have achieved remarkable performance in the field of computer vision. Creating a large labelled dataset is the key to train CNNs. Cinar et al. [11] and You et al. [12] directly used a pin's category as its label. However, this label may not be absolutely correct since the same image may have different categories selected by different users. Geng et al. [10] trained a multitask CNN with ontological concepts, but the ontology was constructed in the fashion domain and was difficult to extend to all other domains. Zhai et al. [21] extracted more detailed labels on Pinterest by taking top text search queries, but the quality and consumption of this annotation highly depends on the search engine. Inspired by the fact that the predefined categories on CCSNs are not independent objects but related notions, labels formed by statistics category distributions are used and a CNN is fine-tuned as a multilabel regressor. With regard to text representation, one-hot representations [13,16] and distributed representations, such as the word2vec tool [11], have been used. From the practical point of view, the word2vec tool [26], which can capture syntactic and semantic relationships between words in the corpus, is more scalable. In addition, mean vectors [27] of the word2vec tool can obtain usable text representation without further learning.
Several multimodal fusion studies are being performed on classification and retrieval. Except for directly concatenating modalities, most existing schemes are designed based on models such as CNNs [28] and recurrent neural networks [8]. These models mainly learn the consistency between multiple modalities and cannot deal with missing input modalities well. On the generative side, latent Dirichlet allocations (LDAs) [29], restricted Bolzmann machines (RBMs) [30], deep autoencoders (DAEs) [31], and deep Boltzmann machines (DBMs) [32] have been proven to be feasible methods for learning the consistency and complementarity between modalities and can easily deal with some missing modalities. However, limited studies have focused on fusing features obtained from these deep learning models. Zhang et al. [33] used a DAE for fusing the textual features extracted by training the Word2Vec tool [26] and visual features generated by the 6-th layer of AlexNet [34]. However, there are no existing studies that have used information from all modalities from CCSNs for recommendation tasks. In this paper, we trained a multimodal DBM to handle a situation in which the data from CCSNs are unlabeled and some modality inputs are missing, and we used features obtained by deep learning as the input to make our multimodal representation more accurate and compact.
Compared to pin and board category recommendation, few studies have been performed on board and user recommendations. Kamath et al. and Wu et al. [13,23] model boards and users, respectively, with text data and some collaborative filtering methods [15,20,25] to recommend users with user behaviors, but they do not take images, which are the essential content on CCSNs, into account. Yang et al. [17] represent boards by sparse coding the descriptors of images, but similarly to Yang et al. [16] as mentioned above, their methods require cross-domain information. Moreover, the information loss of the sparse code based on a cluster dictionary is more than the FV based on a Gaussian mixed model (GMM). Furthermore, no studies on board thumbnails have been published.
Existing research only focuses on one recommendation task, while the method in this paper uses identical pin representation to accomplish different recommendations such that the problems are simplified and resource saving.

Multimodal Joint Representation of Pins
A pin, that is, an image with text descriptions, is the basic item and the carrier of user interests on CCSNs. The purpose of this section is to learn the representation of pins from both modalities. As the foundation of further applications, the representation should contain the information of user interests.
The proposed framework of learning multimodal joint representation of pins is shown in Figure 4. The proposed process can be divided into three parts: image representation learning, text representation learning, and the multimodal fusion. For an input pin, the image representation of the pin is extracted by our modified CNN and the text representation is generated by the pre-trained Word2Vec model. Finally, we can obtain the joint representation by fusing both the image and text representations with a modified multimodal DBM model. The whole process comprises three parts: the image representation, the text representation, and the multimodal fusion. For given pins, their images are loaded by a CNN that is fine-tuned on an image dataset that is annotated automatically, and one of the intermediate layers of the CNN is extracted as image representations. Meanwhile, text representations are computed by applying mean pooling on word vectors derived from the word2vec tool, which are trained on some text corpora. Then, a multimodal DBM is trained on both image and text representations. Finally, the activation probabilities of the last hidden layer of the multimodal DBM are inferred as the expected multimodal joint representation of pins.

Image Representation
Image representation aims to learn image features which not only maintain intrinsic characteristics, but also reflect user interests on CCSNs. CNNs have become the dominant approach in computer vision. Top layers of CNNs can extract high-level image features interpreted as color, material, scene, texture, object, and so on by various means. Intermediate layers of CNNs, especially fully connected (FC) layers, are often used for image representation and for further applications. As supervised learning models, CNNs can capture the relationships between user interests and images if user interests are trained as labels during the training process.
As a typical deep learning framework, a CNN requires a training set with large number of images with corresponding labels. Social networks are good sources for collecting the images, but noisy labels are always a primary problem. On CCSNs, all of the pins are collected by users and the categories are assigned by users, therefore the categories can be seen as labels with a high level of confidence. Users can create boards, and then create or collect pins into the boards to exhibit their interests. When a user creates a board, he is asked to choose one of the predefined categories on CCSNs, and the chosen category is the category of all the pins on the board. This is to say, every pin has a user-selected category. Since the category can reflect the theme of the board and the pin, it can describe coarse-grained user interests and can be trained as the label of an image.
On CCSNs, different users may select different categories for the same image. For example, the pin in Figure 3a is re-pinned by 50 users. Because the image is a poster of the video game NBA 2K12, 26 users categorized the pin into category 'sports', 16 users categorized it into category 'entertainments', and the other eight users categorized it into 'design'. On the basis of the statistical distribution of predefined categories given by users, the category distribution of pins can be computed as where f C i denotes the i-th category (C i ) frequency, and N C is the total number of predefined categories. As a result of the fact that the minority opinion is sometimes hard to understand and spammers exist, in practice, before the computation of category distribution, we set where M C is the total number of chosen categories that appear in the re-pin tree of I, to remove spam and make the sequence represent the majority opinion. Using the proposed annotation method, we were able to acquire labels of collected images without any additional human labor. Furthermore, compared to expensive human-labeled data, we believe the category distribution contributed by the collective user intelligence from re-pin trees is more suitable as the label of a pin. In contrast to existing image representation learning methods, which rely on high-quality label supervision, our category distribution of pins is acquired by mining the rich re-pin relationships from inexhaustible CCSN contents. We then fine-tuned a pretrained CNN model to accelerate the training process. A deeper and wider architecture commonly performs better, while it is usually more time and space consuming. Thus, AlexNet [34] was chosen as a basis. The core visual deep model could be replaced by any of the other state-of-the-art models, such as GoogLeNet and ResNet. AlexNet, with weights pretrained by ImageNet [35], is commonly used to classify independent objects, though we needed a multilabel regressor model. Accordingly, the loss layer from softmax was changed from a logarithmic loss layer to a sigmoid with a cross entropy loss layer. We define loss function as where p C i denotes the percentage in Equation (1), andp C i is the corresponding sigmoid output. After fine-tuning the CNN, its weights are stored for feature extraction. Then, the image representations are the activation values of the FC layer.

Text Representation
The text description is an important personalized supplement to the image representation. Similar to the image representation, we generate the text representation for the purpose of discovering the relationships between the descriptions and categories of the pins. Contrary to the case of images, descriptions of the same pin may be different. Therefore, it is not easy to build a large high-quality labelled dataset on CCSNs.
Since words used on CCSNs have no obvious difference with those in common situations, we trained a word2vec [26] model on some public corpora for encoding words. The efficient shallow model was designed for studying word representations. The learned word vectors capture a large amount of syntactic word relationships and meaningful semantic relationships. The training dictionary should include words from the category words and the text description to represent the relationships between the text representation and the categories. In addition, word vectors, which encode words into compact vector spaces, are more scalable than one-hot representations, because the vocabulary of natural language is extremely wide. Both the training speed and the quality of the vectors could be improved by several extensions including the hierarchical softmax, negative sampling, noise contrastive estimation, and subsampling of frequent words [36]. For details of the Word2Vec model, please refer to the original paper.
Because of the diverse lengths of the texts, it is necessary to generate vectors with a constant dimension from a set of word vectors to represent a complete text. Some pooling methods, such as mean pooling [27], have been proven feasible in solving this problem. For a text T = Word 1 , Word 2 · · · , Word M T , we compute the mean vector in Equation 4 as its text representation, where KeyedVector Word i denotes the i-th word (Word i ) vector, and M T is the text length.

Multimodal Fusion
Different modalities can provide both consistent and complementary information, while their distinct statistical properties make it difficult to combine them into a joint representation that maintains their specific characteristics using a shallow architecture. A multimodal DBM [32] can effectively model a joint distribution over modalities, which adds a shared hidden layer on top of DBMs to combine them.
As illustrated in Figure 5, a multimodal DBM is an undirected graphical model with fully bipartite connections between adjacent layers. Each pathway of it is a DBM, which is structured by stacking two restricted Bolzmann machines (RBMs) in a hierarchical manner. All layers, except the two bottom layers, use standard binary units. An RBM with hidden units H = h j ∈ {0, 1} F and visible units V = (v i ) ∈ {0, 1} D defines the energy function as follows: where σ i denotes the standard deviation of the i-th visible unit and θ = w ij ∈ R D×F , (a i ) ∈ R D , b j ∈ R F , (σ i ) ∈ R D . During the unsupervised pretraining process of the multimodal DBM, modalities can be thought of as labels for each other. Each of the multimodal DBM layers has a small contribution to eliminating modality-specific correlations. Therefore, in contrast to the modality-full input layers, the top layer can learn representations that are relatively modality free. The joint representation of the image and text inputs can be represented as follows: where θ denotes all model parameters. The reader may refer to the original paper for more details of multimodal DBMs. An advantage of multimodal DBMs is that they can deal with the absence of some modalities. After training our multimodal DBM, even though some pins may have no descriptions, the activation probabilities of H 3 , which are used as our final multimodal joint representation of pins, could be inferred from different conditional distributions with the standard Gibbs sampler. In addition, the multimodal DBMs are used to generate the missing text representation in a similar manner. Moreover, multimodal DBMs can be trained supervised by connecting additional label layers on top of them.

Implementation of Recommendations for Different Tasks
Once the representations of pins were obtained, we then aimed to apply them to the recommender system. According to the practical applications on CCSNs, there are four recommendation tasks: recommending pins or boards to users, recommending thumbnails to boards, and recommending board categories to boards. All the recommendation methods are content-based.

Pin Recommendation
Pin recommendation is a crucial function for content discovery on CCSNs. It can be inferred that pins with similar interests are close in the representation vector space. Considering that the different boards a user collects have different characteristics, accordingly, given a target user, the similarity between pins in a board and the candidate pins is computed in the vector space, and the pins are ranked by similarities in descending order. For different boards, different pins are recommended. Most similarity metrics can be used; cosine similarities were computed in our work. Pins are ranked according to the similarity score and the most similar pins are selected as candidates.

Board Thumbnail Recommendation
Boards are displayed as thumbnails on all public and personal home pages. A thumbnail includes a cover and two/three small images or just six small images. A well-designed thumbnail can attract other users to access the board. Both Pinterest and Huaban allow users to select a cover from pins of the board, but they do not recommend candidates to users. As illustrated in Figure 6a, if a cover is selected, the small images will automatically be selected from the two latest pins. If the user has not selected an image for the cover, the thumbnail will be composed of the six latest pins. It is difficult for a user to select a suitable image to represent the board without any recommendation. Furthermore, the thumbnail consists of the latest pins possible that could not represent the boards. Boards like the bottom two have such wide interests that images in the thumbnail cannot fully express them. Similarly, thumbnails on Huaban, one of which consists of the cover and the three latest pins, have same drawbacks, as respectively shown in Figure 6b,c.
In view of the above, we defined a new task for recommending board thumbnails. The mean vector of pins in the board are computed, which is the center of the boards. The pins nearest to the center of the board are selected as the cover candidates. Then, we implement clustering, and the closest images with respect to the cluster centers are selected as substitutions for the latest pins.

Board Category Recommendation
On CCSNs, every board should be assigned a category, though some boards with no category were created before the constraint of the forced-choice approach. However, it is illogical because even if it is difficult to choose a board category from different user interests, users can select the category "other". Board category recommendation is convenient for category choice, not only in terms of first selection but also for further editing.
As mentioned in Section 3.1, interests associated with an image can be spread over the categories which occur in the re-pin tree. The only way to estimate the user preference on one image is to analyze its description and category. Because individual understanding of certain notions differs, even if manual analyzing cannot determine which single category the user intends to describe, it is common sense in this condition. We consider that personalization on CCSNs is mainly formed by the way the user organizes his or her boards. Hence, similarly to how user interests are reflected by pins, user interests reflected by boards should be more than one category. With the increasing number of pins, the category preference of the board in the majority opinion is reinforced. A board interest distribution B can be calculated by the average of all its pin interest distributions as where N B denotes the pin count of B, Interest I i = p iC j ∈ [0, 1] N is the interest distribution of the i-th pin I i . In order to infer Interest I i , we trained a multidimensional LR between the representation of pins and the labels obtained in Equation (1). The generated Interest B should be normalized immediately. The recommended category is the category which is the highest number in terms of the board interest distribution. This method can also be used for computing the interest distribution of a user. As an important part of the user profile, the interest distribution of a user can be intuitively represented by normalizing a frequency distribution of categories of boards or pins. However, this distribution has many limitations. First, it cannot deal with the absence of some categories, however, this does not mean that the user is not interested in those categories. Secondly, the ratio between categories may not be accurate, not only because categories are related, but also because images related to certain fine-grained interests are rarer than others and the user cannot collect enough pins related to these interests. Thirdly, it cannot be used to represent the interests of a board, as the categories of pins in it are the same. Our interest distribution of a target user U is computed as where N U denotes the pin count of U. Because Interest I i actually spreads over all the categories, Interest U does not suffer from the absence of some categories. In addition, to some extent, the ratio error, which is caused by the imbalance between pin counts of boards, is reduced, since the strong categories have faster accumulation processes than the weak categories.

Board and User Recommendation
As the pins are assembled, the theme of a board emerges. Users can easily collect pins with well-organized boards. For this reason, board recommendation is another important function for content discovery on CCSNs. In this section, we discuss how to model boards and users using the acquired multimodal joint representations of pins.
There is an analogy between user contents on CCSNs and articles, as user contents consist of boards which consist of pins, while articles are composed of paragraphs or sentences that are composed of words. One clear difference between them is that the order of pins or boards may not be that important. Therefore, the loss of order information is not an issue when modeling. Inspired by this, we consider that applying pooling methods to transform a different number of pins into a constant dimension vector, as we mentioned in Section 3.2, is reasonable for board and user modeling. Among pooling methods, the Fisher vector (FV) was chosen as our solution for board and user modeling.
The FV [37] was designed for encoding patch descriptors of an image into a high-dimensional vector. Since boards and users are image collections, a pin can be treated as a descriptor of them. A common method to encode a set of descriptors is to assign them into a visual dictionary, which is composed of prototypical elements such as cluster centers, while the FV approximates the distribution of descriptors with a GMM, whose Gaussian distributions can be treated as a universal probabilistic visual dictionary. As for representation of pins V X i = v ij ∈ R J , the GMM is defined as where norm k denotes the k-th multivariate normal distribution, ω k is the weight of the k-th mixture component and is subject to the following constraints: ∀ k : ω k ≥ 0 and K ∑ k=1 ω k = 1; and K is the number of mixture components. The parameters of the GMM also include (µ kj ) ∈ R J and Σ k , which are the mean vector and covariance matrix of the k-th mixture component, respectively. The FV first computes the partial derivatives with respect to the parameters of the logarithm of the GMM, and then it normalizes them with the Fisher information matrix. The simplified normalized partial derivatives of a board B are given by where σ kj denotes the standard deviation of the j-th dimension of the k-th mixture component, and γ ik is the soft assignment of V P i to the k-th mixture component, which is written as and is also known as the posterior probability or responsibility. All partial derivatives are concatenated to compose the FV. Since one of ω k is redundant because of the constraints, the dimension of the FV is (2J + 1) K − 1. Power normalization and L2-normalization [38] are applied to improve the quality of the FV as follows: where ρ ∈ [0, 1] is the normalization parameter. The FV of a user can be computed in the same manner. Please refer to the original paper for more details regarding the FV. In essence, the FV is the gradient of the log-likelihood of a board. Notice that the computations of Equations (11)-(13) can be simplified with where S 0 k , S 1 k , and S 2 k are the zeroth order, first order, and second order statistics of the board, respectively. Accordingly, the FV preserves more information than other pooling methods, such as the vector of aggregate locally descriptor and sparse coding, with the same dictionary capacity. It actually measures not only which words in the visual dictionary the pins belong to, but also the differences between the mean vectors of the GMM and the board or user. On the other hand, the FV uses a relatively small dictionary to generate the same dimension vector as the others, such that the computational complexity is lower. In addition, the FV is interpretable. If we consider the mean vectors as the center of interests, improving K will make the FV more fine-grained, while the curse of dimensionality is a significant limitation of the FV. For the sake of large-scale applications, the FV could be lossless compressed by sparsity encoding with product quantization [39].
After modeling, boards can be recommended according to the similarity metrics between them and the target board. Because users can be considered image collections with wider interests than boards, user recommendation done in this same manner is also helpful for content discovery, although users on CCSNs are not very interested in following.

Experiments and Results
In this section, the datasets and implementation details are firstly introduced. Then, the performance of our representation of pins are evaluated in an interest analysis. Thereafter, the results of experiments on real-world datasets are presented to verify the feasibility and effectiveness of our recommendation methods.

Datasets and Implementation Details
We crawled data used in experiments from Huaban, a typical Chinese CCSN. Huaban provides certain applications similar to those in Pinterest, while the main differences between the two networks are as follows: There are "like" pins or board operations on Huaban but not on Pinterest; Huaban records both users and the paths in a re-pin tree, while Pinterest only records all the users and the initially created user.
We first crawled the pins of 5957 users without images, and then sampled 88 users according to board categories and pin counts. Some extremely active and cold-start users had been confirmed among them to make our dataset diverse and to take the influence of pin counts into account. We then crawled all images of the sampled users and all their "like" pins. In addition, we crawled the top 1000 pins recommended by the system of each category to fine-tune AlexNet and their re-pin paths for automatic annotation. The dataset for recommendation included 151,631 pins, which were categorized into 33 categories from 1694 boards, and the number of unique images for both fine-tuning and recommendation was 167,747. All pins were used as supplement elements for obtaining distributions of the recommended pin categories. The average re-pin path length was 47.57.
After a little manual label balancing, labelled images were split into 80% for training and validating and the remaining 20% for testing. Because the input dimension of AlexNet should be constant, every image was firstly rescaled so that the shorter side was 256 pixels, and then the central 256 × 256 patch of the processed image was cropped out. The loss layer of our AlexNet was replaced. As a comparison, the most frequent category was used as the label to fine-tune a multiclass AlexNet. The dimensions of the FC8 layers of both Alexnets were changed to 33. Image representations were generated from the FC7 layer of the multilabel Alexnet.
We trained our Word2Vec model on Wikipedia dumps (https://dumps.wikimedia.org/) and Sougou Lab dataset (http://www.sogou.com/labs/resource/list_news.php) with the CBOW (Continuous Bag of Words) model and negative sampling. In addition, the vector dimension was 300. The words with a frequency lower than five were ignored. Word preprocessing, such as removing punctuation, traditional and simplified Chinese conversion, word tokenization, machine translation, and removing stop words, was applied on pin descriptions.
All image and text representations were exploited for the multimodal DBM training. The dimensions of H T1 , H T2 and H V1 were the same as their corresponding visible inputs, and dimensions of H V2 and H 3 were set to 2048 to compress the vectors, as the FV would increase the dimension. Each layer was pretrained using a contrastive divergence strategy to accelerate the training of the DBM. Then, missing text representations were extracted using Gibbs sampler and the multimodal joint representation of pins was inferred.
K in Equation (10) was set to 1 such that the dimension of the FV of a board was twice that of the pin vector. α in Equation (15) was set to 0.5.
To evaluate the effectiveness of the proposed model, we compared it with the following multimodal deep architectures: the Multimodal Autoencoder (MAE), which was proposed in [31] and connects two deep autoencoders of multimodalities by a shared hidden layer; and ICMAE, which imposes Independent Component Analysis (ICA) constraints in the MAE architecture to de-correlate the relationships among the variables. All the baseline methods had the same number of layers, and we used the same features as inputs to ensure that the comparisons were fair.

Analysis of Interests Represented by Pins
Analysis of interests based on pins is the prerequisite of analysis of interests based on boards and users. As mentioned above, it is hard to measure the interest distribution of one pin. Hence, we treated the interest distribution of its image as an approximation, even though some categories would be improved by its text description.
Multidimensional LRs were trained on the dataset to fine-tune for all unimodal representations and multimodal representations. Table 1 illustrates the results, together with those of the multiclass classification with softmax. The mean nonzero error was the average error between all nonzero categories and corresponding predictions. The accuracy of the dominant category checks the consistency of the most frequent category between labels and predictions. The comparison of multiclass and multilabel CNNs shows that our method with multilabel annotation improves the accuracy significantly. This is not only because the interference of related categories could be eliminated by category distributions, but also because more information from the users' collective intelligence was provided for learning. Although the performance of text representations and image representations was not comparable, the performance of the multimodal joint model was better than that of image representations that are complementary between two modalities. From the results, we can see that our method had the best performance because all unimodal and multimodal representations contained information about user interests and our joint representation contains richer information than other methods. Our method could also analyze interests of images on other networks. The comparison of MAE/ICMAE shows that the joint representation of pins learned by our method has a higher correlation with their categories.

Pin Recommendation
We invited 10 users to engage in the evaluation of pin recommendation. Each user was given 200 randomly selected target images and corresponding recommendation results of different methods. They were required to decide whether to pin some images of three candidates, but not if they were the owner of the target pin. Table 2 shows the precision of recommendations. A simple content-based filtering, which randomly selects an image with the same category as the target image, was implemented as a reference. All other methods achieved higher accuracies than the category-based method, simply because they utilized more information to reduce the affect of related categories. Object-based and interest-based methods used the probability layer from the original AlexNet and multilabel AlexNet, respectively. The results of those two methods were comparable, while interest distributions were more compact than object distributions. This indicates that even coarse-grained interests of an image were a little more important than what this image was on CCSNs. The other methods computed cosine similarity between representations. We note that using only dominant categories as the label to fine-tune AlexNet led to a decline, which may have been caused by confusion of similar images with different categories. Notice that the performance of multimodal features was worse than that of image features. We believe that the descriptions could not completely describe all the interests and characteristics that images have. Our text representations were clearly not as effective as our image representations, therefore, image representations were more suitable for image recommendation.  Figure 7 illustrates 10 images and their recommendation results. Obviously, intrinsic characteristics such as background, scene, pattern, texture, color, object, material, and so forth are maintained in the image representation and usually had an effect on the recommendation, especially for images in the left panel. Images in the right panel show that some abstract notions, for example, style and user interest, influenced the results. All these high-level image features learned from CNNs could significantly improve the accuracy and diversity of recommendations. From the recommendation results, we can clearly see that our model recommended similar styles and types of images. This means that our model could achieve a good recommendation effect in terms of content-based recommendations. This further illustrates that the features extracted by our multimodal joint representation model were effective. The recommendation data were different from the training data, which proves that our model had a good generalization ability.

Board Thumbnail Recommendation
In this experiment, we recommended the board thumbnail according to the interest distributions of pins and the representation of pins. Because Huaban does not yet offer the function of editing thumbnails, we manually re-pinned all pins from the original board and changed the orders of the pins to display our results.  Figure 6b, pins from the board are album covers of a music group. Four pins in the original thumbnail were all from the same album. Strong categories for this board were "file music book" (20.87%), "design" (16.24%), and "architecture" (11.50%), while those for the cover in Figure 8a are "film music books" (20.44%), "design" (14.15%), and "architecture" (10.81%). Three clusters, whose centers mainly belonged to "photography" (15.72%), "film music books" (80.93%), and "architecture" (47.19%), contained 30, 7, and 4 pins, respectively. This indicates that the recommendation results are consistent with the target board thumbnail. On the other hand, those four components of the thumbnail were from different albums. Similar to the result generated with interest distributions, the result generated with image featured comprise pins from different albums, partly owing to the fact that image representations were also related to interests. Our results also indicated that even narrow interests could be divided. It is obvious that recommending thumbnails for a board about wide interests was easier, the recommendations for Figure 6c are shown in Figure 8c,d. We believe that our recommended thumbnails, which depicted more interests, were more attractive.

Board Category Recommendation
The ground truth of board category recommendation is the crawled board category. The performance metric of the experiment was mean reciprocal rank (MRR). We only give the top MRR because there was only one accurate selection of the board category recommendation. The results are shown in Table 3.
From the table, we can see that our model had the highest MRR. Because the board category recommendation results were based on different features but the same classifier, the best result meant the best features. Our best recommendation results illustrate that multimodal representations with the benefit of personalized text representations had a better performance than other baselines.

Board Recommendation
Every board was divided into two parts based on the order of pins. One part must be similar to the other part. The user of each half part should be interested in another and naturally further like or follows or re-pin from it. Depending on this fact, half of the board was treated as the only accurate recommendation result, and we retrieved the index in the similarity sequence. Because there were five pins in the top row exhibited on Huaban with common resolution screens, the top five MRR was also demonstrated. Table 4 shows the experimental results. From the table we can see two things. Firstly, the same feature encoded with FV performed the best. For example, the method with pin vectors, except for text vectors encoded with the FV, performed better than that with the corresponding pin vectors combined with the mean vector. The better performance is due to the utilization of higher order statistics. Secondly, our representation demonstrated the best performance when different features were encoded with the same method. The results also illustrate that multimodal joint representations have a better board modeling performance than the unimodal representations with lower dimensions.

Conclusions
We propose a framework for multimodal joint representation learning of pins on CCSNs. The obtained representation contains the information of user interests, which is useful for recommender systems and user modeling. We modeled boards and users with the FV and propose a series of recommendation methods for different recommendation tasks, including a novel board thumbnail recommendation defined by us and based on our pin recommendation. The experimental results show that the obtained representations perform better in terms of interpreting pin-level interests than unimodal representations with lower dimensions, and our recommendation methods based on our multimodal representation are effective in terms of recommending pins, board thumbnails, board categories, and boards.

Conflicts of Interest:
The authors declare no conflict of interest.