On the effectiveness of convolutional autoencoders on image-based personalized recommender systems

Recommender systems (RS) are increasingly present in our daily lives, especially since the advent of Big Data, which allows for storing all kinds of information about users' preferences. Personalized RS are successfully applied in platforms such as Netflix, Amazon or YouTube. However, they are missing in gastronomic platforms such as TripAdvisor, where moreover we can find millions of images tagged with users' tastes. This paper explores the potential of using those images as sources of information for modeling users' tastes and proposes an image-based classification system to obtain personalized recommendations, using a convolutional autoencoder as feature extractor. The proposed architecture will be applied to TripAdvisor data, using users' reviews that can be defined as a triad composed by a user, a restaurant, and an image of it taken by the user. Since the dataset is highly unbalanced, the use of data augmentation on the minority class is also considered in the experimentation. Results on data from three cities of different sizes (Santiago de Compostela, Barcelona and New York) demonstrate the effectiveness of using a convolutional autoencoder as feature extractor, instead of the standard deep features computed with convolutional neural networks.


I. INTRODUCTION
Digital revolution has undoubtedly changed our lifestyle. With the advent of e-commerce, social networks dedicated to sharing reviews and recommendations about different products started to become popular, mainly due to the large number of options that consumers had within their reach, as well as the distrust generated by not being able to directly see what is being purchased. Because of this, we can find millions of images tagged with the tastes of users in these social This research has been financially supported in part by European Union ERDF funds, by the Spanish Ministerio de Economía y Competitividad (research project TIN2015-65069-C2), by the Xunta de Galicia (research projects GRC2014/035 and ED431G/01), and by the Principado de Asturias Regional Government (research project IDI-2018-000176). We would like to express our gratitude to the CESGA for the provided resources that allowed this research, and the support of NVIDIA Corporation with the donation of the Titan Xp GPUs. networking web services. This, among other factors, led to the integration of different personalized recommendation systems with great success in several online platforms.
On the other hand, due to the boom that culinary culture is experiencing in social networks in recent years, platforms such as TripAdvisor or FourSquare are becoming very popular. If we consider that TripAdvisor has approximately 460 million unique monthly visitors and more than 830 million opinions 1 , we can have an idea of the economic impact of using these data in a personalized recommender system.
In recent years, the use of images in recommender systems (RS) is increasing [1], [2], specifically through feature extraction [3], [4]. Some of the approaches that use images in their recommendations also employ text or metadata to compliment them [5], others do not make personalize recommendations [6], and those that use personalization, do not consider images [7], [8]. To the best of our knowledge, our paper is the first one that explores the use of images in a personalized RS for restaurants, taking advantage from convolutional autoencoders for feature extraction.
Therefore, the goal of this paper is to study how well an RS works when using convolutional autoencoders to model users and items through their images. In this context, we propose a model that uses the images uploaded by users of a gastronomic platform in order to build a personalized recommendation. For doing that, we need to find the feature vector that best defines a given image, trying to improve the performance of a personalized restaurant RS. Convolutional autoencoders are commonly used in various image-related tasks, such as compression [9], reconstruction [10], noise reduction [11], and feature extraction [12]. However, up to our knowledge, their use has not been explored in depth in the context of image-based RS. Different than existing approaches, the main contribution of this work is to use a convolutional autoencoder as feature extractor for the images that feed a personalized RS.
We will demonstrate that our approach is sound because (1) it makes use of the context of the problem, (2) it works better than standard approaches that use a pre-trained convolutional neural network (CNN), and (3) it is less computationally expensive that integrating and fine-tuning a CNN.
The remainder of this paper is organized as follows. Section II includes a review of the latest studies related to restaurant recommendations, as well as the use of images in RS. Section III presents the architecture and implementation of the proposed model. In Section IV, we describe the three datasets used, the different experiments carried out, and the performance achieved. Finally, in Section V the most relevant aspects that we could extract from experimentation are presented, as well as various lines of interesting future research.

II. RELATED WORK
As mentioned in Section I, there are several attempts aiming to build an RS based on available TripAdvisor reviews. Zhang et al. [7], [8] used TripAdvisor data, but without considering the visual information at all. Amis [6] used TripAdvisor images to select the photos to be displayed in search results employing a convolutional network, but not in a personalized way. Regarding restaurant recommendations, Chu et al. [5] investigated the visual effects of images using blog photographs along with the accompanying text. Feature extraction was carried out using a CNN, but with a support vector machine to previously classify images into different categories. Therefore, we have not found other related studies in the available literature with which to compare our results.
Regarding the use of visual information on personalized RS, one of the first attempts was proposed by He et al. [3], using a CNN to extract the deep features of product images, which are then processed in an RS based on matrix factorization. Notice that the main difference concerning our proposal is that we intend to use images in order to define not only the user profile but also the item to recommend. Fant et al. [1] developed an imaged-based RS to allow personalized recommendations via exploratory search, from large-scale collections of manuallyannotated Flickr images. These collections were first organized at a semantic level, and then series of points of interest and their scale-invariant feature transform (SIFT) features were used to characterize the local visual properties of the images. Kurt et al. [4] proposed an image-based RS using the bag of words technique and feature descriptors such as SIFT, SURF (speeded-up robust features) and LBP (local binary patterns) [13] to characterize shoes images of the users. Finally, Xu et al. [2] tried to predict user attention for images, dividing them into segments and using content similarity.

III. METHODOLOGY
This work proposes a method to provide gastronomic recommendations based on users tastes that can be inferred from the images posted by them. In general terms, the problem at hand can be defined as a classification task in which we have some triads with either one of two labels: where u is a user that visited the restaurant r and took the image i. Regarding the two labels, 0 means that the user u does not like the restaurant r, whilst 1 is the opposite. Aiming at solving this binary classification task, we propose the model depicted in Figure 1: a network that learns on triads of users, restaurants and photos (u, r, i) to provide an imagebased personalized recommendation. The codification of each input triad is defined as follows: • Users and restaurants are represented by a one-hot codification. • Photos are represented using a convolutional encoder, next described in Section III-A. The three codifications are then mapped to their corresponding embedding representations, each one composed of 512 features, by using embedding layers for users and restaurants, and a fully connected (FC) layer for images. These three embedding codes are then concatenated, obtaining a 1536dimensional vector. Given that the features of each individual input (u, r, i) are in different scales, a batch normalization [14] layer is used to normalize this vector, thus speeding up the learning process. This layer is followed by a FC layer that maps the input vector into a 1024-dimensional vector.
The vector obtained includes an individual representation of each element in the triad, (u, r, i), and so a subsequent processing is necessary to obtain a joint representation. This step is carried out by means of two reduction blocks, whose structure is illustrated in Figure 2. As can be observed, a reduce block maps an input vector into an output vector of half size, and is composed of a sequence layers that include FC, dropout [15] with a probability p = 0.5, and rectified linear unit (ReLU) [16] as the activation function.
After the two reduction blocks in Figure 1, there is a FC layer that reduces the size of the vector by half; and finally, there is another FC layer with a sigmoid activation function that generates a probability output in the range [0, 1], where 0 means dislike, and 1 like.

A. Image encoding
As mentioned above, the proposed network needs a vectorial representation of the input image to combine the visual information with the user and restaurant codifications. For this purpose, we propose to use a convolutional encoder.
An autoencoder [17] is an unsupervised machine learning algorithm that takes the input data and aims to reconstruct them back using a lower dimensional representation. It has a symmetrical structure formed by two main components: the encoder, and the decoder. The encoder transforms the input x into a representation h, also known as code, using a deterministic function of the type with parameters θ = {W, b}, where W is the weight matrix, b is the bias vector, and σ is an element wise activation function. The decoder uses the code h to generate r, a reconstruction of the input x, using a reverse mapping of f : is the bias vector, and σ is an element wise activation function.
It is worth noting that the autoencoder is trained to minimize the difference between input x and its reconstruction r. Based on the premise that the reconstruction generated by the autoencoder is good enough, we can use the code generated by the encoder as a feature vector that represents the input.
When images are used as input data, we have to talk about convolutional autoencoders (CAE) [18]. Broadly speaking, they have the same conceptual structure than any other autoencoder. Therefore, when a CAE is trained, its encoder part can be used as a highly effective feature extractor [17]. The main difference is that the encoder uses convolutional and pooling layers to extract features and reduce the size of the input volume (image), respectively, to finally produce a lower-dimensional representation (code); and next, the decoder takes the code and process it to obtain the final reconstruction by means of convolutional and upsampling layers. Figure 3 illustrates the general architecture of a CAE. Table I details the architecture of the CAE used in this research, based on the one proposed by Chollet [19]. The core part of the CAE is a building block that consists of a sequence of three layers: convolutional, batch normalization, and ReLU. Regarding the whole architecture, it is composed of a downsampling path (encoder) that includes a sequence of convolutional and maxpooling layers, and an upsampling path (decoder) that includes a sequence of convolutional and upsampling layers. The code generated by the CAE corresponds to the feature maps of the bottle neck, i.e., the intermediate representation obtained between both paths. Once the CAE is trained, its encoder part can be used as a feature extractor. More specifically, the encoder is followed by a flatten layer that converts the code generated (feature maps) into a feature vector, which represents the image encoding used in our model (see Figure 1). This section first describes the datasets used to evaluate the performance of our proposed method. Next, we detail the experimentation carried out, which includes two alternative approaches to compute the feature vector of input images. Finally, the results obtained are presented and analyzed in depth, including an ablation study.

A. Dataset
The data used in this work were collected in 2018 and 2019 from the TripAdvisor reviews published by users about restaurants in cities of different sizes: 1) Santiago de Compostela (Spain), a small city located in the Atlantic coast, with a population of 95800 inhabitants; 2) Barcelona (Spain), one of the largest and most touristic cities in the Mediterranean coast, with a population of 1.7 millions of inhabitants; and 3) New York (USA), a very large and popular city of approximately 8.4 millions of inhabitants, located in the East Coast. Given that we need to store the relationships between users and restaurants, large sparse matrices are generated. As the number of individuals increases, the problem becomes more latent. Considering that the number of users and items at a city level is representative enough to make a good recommendation, three different datasets are considered, one per city. Table II depicts the magnitude of the three datasets, including the number of users, restaurants and reviews made by users. Notice that this research is focused on images, thus only the reviews with images are considered for experimentation. Therefore, the number of reviews are: 7003 in Santiago, 66904 in Barcelona, and 111415 in New York. As each review can include one or more images, the number of photos available for experimentation is as follows: 16168 in Santiago, 153707 in Barcelona, and 234689 in New York.
These datasets present some peculiarities that may hinder the recommendation task. According to the figures presented in Table II, there is a high imbalance between the positive class (1, like), and the negative class (0, dislike). In particular, the ratios between positive and negative classes are approximately 6:1 for Santiago, 5:1 for Barcelona, and 7:1 for New York.
Consequently, it will be more difficult for any system to learn how to distinguish cases corresponding to negative samples.
Regarding the data available on TripAdvisor, it is worth noting that each review includes the identifier of the user who published it, the identifier of the restaurants reviewed, a set of images (optional), and a score in terms of stars (from one to five). As detailed above, only reviews with images are used in this research and thus, each review considered contains at least one image. Regarding the scores, those reviews with from one to three stars are labeled as 0 (dislike), whilst four or five stars are labeled as 1 (like). Note that these labels are the output of the binary classification system proposed in Section III.
For experimentation purposes, the datasets were split in train and test sets. Figure 4 illustrates the procedure carried out to create these partitions. Note that it is a user-dependent problem and, thus, some restrictions must be considered. The complete procedure is detailed as follows. For each user: 1) If there are several reviews that belong to the same pair (u, r), all the corresponding images are fed to the train set in order to avoid opposite ratings.
2) The remaining reviews of the user u are divided into two groups based on their ratings, positive or negative. For each group, all the images of one review are assigned to the test set and the images belonging to the rest of the reviews to the train set. Note that if a user has a single review, then it is assigned to the train set. In this manner, it is guaranteed that all the users evaluated in the test set have been considered in the learning process with the train set. 3) If there is any review in the test set that belongs to a restaurant that is not included in the train set, then all its images are moved to the train set. Again, the idea is to guarantee that all the restaurants evaluated in the test set are also in the train set. Due to experimentation requirements, the initial train set  was split again following the same procedure, thus obtaining a new train set and a validation set. From now on, we will refer to this new train set as the train set. Table III shows the number of images per partition for the three datasets, including the ratios between positive and negative samples. As the datasets are highly unbalanced, a strategy must be applied to reduce its impact on the model performance. Data augmentation [20] consists in increasing the number of samples (images) of the train set, without collecting new data. Particularly, image data augmentation involves expanding the size of the train set by creating modified versions of the original images. The objective of this technique is not only to increase the amount of data available, but also their variability, thus improving the robustness of the learning models.
Data augmentation can be applied to all the samples in the train set but, following an over-sampling perspective [21], we only over-sampled the minority class by applying four different transformations: rotation (-30), flipping (x axis), rescaling (1.25) and translation ( [5,5]). As a result, the imbalance problem is alleviated and the ratios between the positive and the negative classes are very close to 1:1 (1.2:1 for Santiago de Compostela, 1.02:1 for Barcelona, and 1.39:1 for New York). Notice that all the experiments that entail training the proposed model use the augmented train set.

B. Experiments
This section presents the details of the experimentation carried out to evaluate the proposed model, showing how an RS that models users' preferences by means of their images works in terms of prediction. It also describes the two baseline methods considered to provide a comparative study in order to contrast the effectiveness of using a convolutional encoder as feature extractor. Both of them are based on CNNs, the most commonly used approach in the related work described in Section II.
CNNs are considered a benchmark in the supervised classification of images since Krizhevsky et al. won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with the architecture AlexNet [22]; and they have been successfully applied to different computer vision tasks, such as object detection [23], [24] or image segmentation [25], [26]. In terms of image classification, a CNN is composed of, among others, convolutional layers that extract features followed by fully connected layers that perform the final classification. Notice that, once the CNN is trained, the convolutional base of the network can be used as a feature extractor.
The baseline methods are used to measure the impact of using a CAE in the image encoding step (see Figure 1). Both methods use the convolutional base of the ResNet50 [27], with weights pre-trained on ImageNet 2 . In the first scheme, the pretrained ResNet50 is used as feature extractor to generate the codification of the images. That is, the image encoding of our model is performed by the convolutional base of the ResNet50, instead of using the CAE. In the second one, ResNet50 FT, the convolutional base of the ResNet50 is integrated in our model to get the image encoding. In this manner, its weights are fine-tuned for the problem at hand.
In order to evaluate the performance of the proposed model, either using the feature vector generated by the CAE or the CNN, a complete set of performance measures was used: • Sensitivity, also known as recall: probability that the system correctly classifies a sample of the positive class. • Specificity: probability that the system correctly classifies a sample of the negative class. • Precision: ratio of positive samples correctly classified among the total of samples classified as positive. • F1-Score: the harmonic mean of precision and recall. Precision and recall are two of the most popular metrics in RS. Precision does not work well for unbalanced datasets, as it is our case, and then F1-score is commonly used in these scenarios. The main problem with the F1-score is that it gives more importance to the positive class, and thus problems with the negative class may go unnoticed. In a recommendation system, not only it is important to recommend those items that the user likes, but also not to recommend those items that the user does not like. Therefore, we look for a system with high values for both sensitivity and specificity. For this purpose, we propose to use the balanced score (B-score), a variation of the F1-score that calculates the harmonic mean of sensitivity and specificity: B-score = 2 * sensitivity * specificity sensitivity + specificity (4) If we rely on metrics such as precision or F1-score to select the best model, we could choose the one that best detects the items that the user likes, even if the performance is very low when classifying the items that he/she does not like (e.g., 0.99 sensitivity and 0.40 specificity: F1-score = 0.94, B-score = 0.56), rather than one with more balanced and desirable predictions (e.g., 0.85% sensitivity and 0.70% specificity: F1-score = 0.89, B-score = 0.76). Using the Bscore, the model chosen will be the one with the best balance between sensitivity and specificity. Although we focus on the B-score value throughout the experiments, and sensitivity and specificity are shown for a more complete quick view of the results, we also keep precision and F1-score since they are standard metrics. However, it is important to emphasize that these metrics do not correctly represent the behavior we are looking for.
All the experiments were performed on a computer equipped with a GeForce Titan XP 12GB GPU from NVIDIA, an Intel Core i7-4790 CPU @ 3.60GHz x 8, and 16 GiB memory. The implementation of the model and baseline methods is in Keras 3 , with Tensorflow as backend, and the code will be publicly available after paper acceptance. The models' weights were initialized using the HeUniform initialization [28], except for the weights of the ResNet50, as previously mentioned; and the Adam algorithm [29] was used as optimizer. The training process was carried out by setting a batch size of 32, and the outputs were monitored by using the B-score (see Eq. 4), with a patience of 12 epochs and a maximum of 100 epochs. Note that the original images were resized to 224 × 224 × 3 to meet the input size requirements of the ResNet50. Although CAEs do not impose any input size restriction, we considered the same input size when applying the architecture of Table I. We performed a grid search to determine the best hyperparameters of our system, using the validation scheme described in Section IV-A, monitoring the learning process with a patience of 6, and focusing on the results obtained in the train and validation sets. After trying different options for the learning rate (0.001, 0.0001), and dimension of user/restaurant embeddings (128, 256, 512), the best ones were 0.001 and 512, respectively. Once these hyper-parameters were established, the models were trained with the train set and next evaluated with the test set. Moreover, an ablation study was carried out 3 https://keras.io/

C. Results
In order to train and evaluate the proposed model, the autoencoder must be trained first. For its training, the original train set (i.e., the train set without data augmentation) is used, although later on, the augmented train set is used for model training. As detailed in Section IV-A, input images were resized to 224 × 224 × 3. Taking into account the input dimensions and the architecture proposed in Table I, the encoder part generates three feature maps of size 28×28, thus resulting in a 2352-dimensional feature vector used as image encoding.
Regarding the learning process, the autoencoder was trained with a batch size of 32, a patience of 6 and a maximum of 100 epochs, using the loss function to monitor the results. Figure 5 illustrates some examples of input images and their respective outputs generated by the trained autoencoder.
In order to demonstrate the adequacy of our model based on autoencoders, we provide a comparison with two baseline methods based on CNNs (see Section IV-B). Table IV shows the comparative results with the three approaches considered: the model proposed in this research, which uses an autoencoder as feature extractor; and two alternative methods that replace the autoencoder with CNNS (pre-trained ResNet50 as feature extractor, and integrated and fine-tuned ResNet50). The three approaches were tested on the three different datasets (Santiago de Compostela, Barcelona, and New York). As can be observed, our proposal obtains the best trade-off between sensitivity and specificity for the three datasets, i.e., it provides the best performance in terms of B-score. In contrast, the less competitive approach is the one that uses the pretrained ImageNet without fine-tuning (ResNet50), the only one that was trained out of context. In particular, the ResNet50 approach shows problems to correctly classify the minority class, providing very low specificity values. When applying fine-tuning (ResNet50 FT), the results improved, since this approach includes a fine-tuning of the ResNet50 weights. In this sense, the convolutional layers are able to adapt to the context using the information associated with the images and become more specialized, being able to discard irrelevant information in the problem at hand, which would be inevitably used when applying transfer learning.
Analyzing the differences between the three datasets, it can be seen that the autoencoder works better that the other two approaches, in terms of specificity, when more data is Taking into account that the strongest point of our approach is the use of images as a source for modeling users' preferences, the convolutional autoencoder is crucial in this process. Compared to ResNet50, it is able to improve the classification of the minority class without incurring in significant losses when detecting the majority class. Furthermore, there is an important reduction in the runtime required to train the autoencoder compared to adjusting the weights of the filters during the training of the ResNet50 FT, which provides the second best performance, making our proposal the most affordable one in terms of training time.

D. Ablation study
An ablation study was carried out trying to assess the influence of diminishing the number of reduce blocks in the proposed model (see Figure 1). Table V shows the performance of the proposed model based on its depth, checking the effects of eliminating a reduce block from its structure. Using two reduce blocks exerts a great influence on the detection of cases belonging to the minority class, but at the cost of losing sensitivity. In the case of Santiago and Barcelona, increasing the depth of the model decreases the sensitivity (in 0.1435 and 0.0763, respectively), but highly improves the specificity (in 0.2715 and 0.0988, respectively). It is worth mentioning that when dealing with the largest dataset (New York), both sensitivity and specificity were improved when using two reduce blocks. In general, looking at the B-score metric, which balances sensitivity and specificity, the architecture with two reduce blocks is the best option for the three datasets.

V. CONCLUSION
Despite the great success that RS are showing in recent years, there is no approach, in the literature or commercially, that explores in depth the use of images in the context of personalized restaurant recommendations. In this work, we explored the potential of modeling both users and restaurants using images as single source. Specifically, we presented an image-based RS that makes use of TripAdvisor data to predict users' tastes, with the particularity that it employs a convolutional autoencoder as feature extractor. Due to the nature of the problem, with typically more positive than negative reviews, the datasets used are highly unbalanced. For this reason, we proposed to use data augmentation on the minority class. We also found out that metrics such as F1score are not appropriate for our problem, since it does not take into account both the sensitivity and specificity. For this reason, we have used the B-score that balances both. This metric is especially recommended in this case, since we have a binary classification system with the positive class as the majority one, and the misclassification of the negative class must be penalized without passing through a high success rate in the positive class.
In the three cities considered, we achieve a B-score ≈ 70% in the worst case, thus showing the potential of using images in the context of personalized recommendations. Our approach, which uses an autoencoder as feature extractor, was compared with the standard approach of using CNNs for image encoding. In particular, we considered the popular ResNet50, pre-trained on ImageNet with and without parameter fine-tuning. The experimental results demonstrated the effectiveness of using the autoencoder as feature extractor, since it allowed to improve the detection of the minority class, and thus obtaining the highest B-score values for the three datasets. As expected, the worse option was to use the ResNet50 without finetuning because, although it has been trained with a much larger database (ImageNet), the purpose of the training was to solve a image classification task that is not related to the problem at hand; thus, it did not take advantage of the context that involves users and restaurants. Therefore, we can then conclude that the context is very important when learning to detect this type of features, having to choose quality over quantity, if necessary.
Regarding the fine-tuned CNN, it improves the results obtained with the pre-trained CNN since, in the first case, the parameters are fine-tuned for the problem at hand. However, if we compare the fine-tuned CNN with the autoencoder, it can be seen that the former cannot achieve the competitive results of the latter. This fact seems to indicate that the feature extraction does not depend on the user, although the subsequent recommendation does, i.e., the features are intrinsic to the restaurant, not to the user. That is, taking into account the subjectivity that characterizes tastes, there are images of the same restaurant with the same visual characteristics that present contrary ratings for different users. By having so little data for each user and presenting a highly marked class imbalance, the interclass separation may not be sufficiently defined with this input data. And this type of situation is usually common in the field of online recommendations, where users do not usually offer a rich and sufficiently representative view of their tastes. For this reason, our model is better predicting user preferences when the architecture is deeper, because it needs to make a great computational effort to learn how to detect so many different cases with similar characteristics and very few examples.
As future work, we plan to test the adequacy of performing the feature extraction step depending on restaurants, rather than on users, taking as a reference the images of the restaurants and their overall rating. The idea is to learn to detect relevant characteristics using the general taste, and then apply this knowledge in a personalized way to each particular user. Our future research also involves the integration of the proposed model into a more complete RS that takes into account additional information, such as the text included in the reviews or sociodemographic data. Finally, we are also considering to apply our method to other RS that deal with images.