Deep Learning Model with Transfer Learning to Infer Personal Preferences in Images

Featured Application: Image-based personal recommendation system superior to human experts on customized interiors. Abstract: In this paper, we propose a deep convolutional neural network model with transfer learning that reﬂects personal preferences from inter-domain databases of images having atypical visual characteristics. The proposed model utilized three public image databases (Fashion-MNIST, Labeled Faces in the Wild [LFW], and Indoor Scene Recognition) that include images with atypical visual characteristics in order to train and infer personal visual preferences. The e ﬀ ectiveness of transfer learning for incremental preference learning was veriﬁed by experiments using inter-domain visual datasets with di ﬀ erent visual characteristics. Moreover, a gradient class activation mapping (Grad-CAM) approach was applied to the proposed model, providing explanations about personal visual preference possibilities. Experiments showed that the proposed preference-learning model using transfer learning outperformed a preference model not using transfer learning. In terms of the accuracy of preference recognition, the proposed model showed a maximum of about 7.6% improvement for the LFW database and a maximum of about 9.4% improvement for the Indoor Scene Recognition database, compared to the model that did not reﬂect transfer learning.


Introduction
Humans generally shape their individual preferences for complex and diverse visual information through various visual experiences in childhood. In the course of this process, based on the learning process from experiencing visual information in very different fields, personal preferences for common visual features of visual information from different domains are made. Therefore, in order to implement a preference model for individual visual information, it is necessary to form personal preferences through learning using image data with different visual characteristics. The user's preference classification problem is actively researched in the field of recommendation systems [1]. This helps the e-commerce market to continuously grow, and its influence is gradually expanding. From the standpoint of providing specific products or services, it is essential to predict and infer a customer's preferences. Thus, most recommendation systems are in the form of servicing the results from predicting the user's preferences [2][3][4][5]. In the case of visual-based preferences, it is difficult to predict them compared to text-based preferences, since visual-preference information is very sparse, based on each domain [1], and there are many difficulties in expressing them with formal information [5]. However, most of these studies mainly used formal information and features about objects such as color, shape, and style as a way of inferring visual preference [6][7][8][9][10][11][12]. The problems with these studies are that it is difficult to transfer visual-preference features between domains [5,13], and it is difficult to express visual preference information when a preference has an atypical feature [5]. Furthermore, most human visual preferences have atypical features [6].
In this work, we propose a learning and classification method for these atypical visual preferences. In order to design a model for inferring visual preferences, the deep convolutional neural network (DCNN) [14] was applied, which has shown significant achievements in object-recognition/classification applications. In order to analyze the inference possibility and characteristics of inter-domain visual preferences, transfer learning was applied to the proposed model [6,15]. In addition, three datasets for verification of the proposed preference inference model were used: Fashion-MNIST [16], Labeled Faces in the Wild (LFW) [17], and Indoor Scene Recognition [18]. Images from the datasets were reclassified according to personal preferences. Additionally, in order to infer personal preferences from datasets with different morphological characteristics [6], we applied transfer learning to the proposed preference inference model, and analyzed the data characteristics according to cross-domain preferences [6]. Thus, we verified the possibility of classifying common atypical visual features of inter-domain data with completely different morphological and categorical characteristics.
In general, humans have personal specific preferences for visual information, but there is a tendency for the reason to not be explained accurately as to whether they prefer the corresponding visual information. In this work, we tried to partially explain the personal preferences for the visual information to which the gradient class activation mapping (Grad-CAM) method [19] was applied to explain why the proposed model shapes personal preferences for visual information. A Grad-CAM can be utilized to identify salient areas of an image that are important for determining user preferences in visual images. A partial explanation of the user's preferences in visual information might be possible from visual characteristics of a salient area, which leads to determining the preference. Moreover, the Grad-CAM approach was applied to graphically express the atypical visual features preferred by a specific user [19]. Through this method, the localization and patterns of atypical visual features were analyzed according to the preference classification result of the proposed model.
The rest of this paper is organized as follows. In Section 2, the proposed model is described in detail. For verification, Section 3 shows the experimental results from the proposed model's performance. Section 4 offers some conclusions and describes further work.

Deep Convolutional Neural Networks
A DCNN model was utilized to implement the proposed personal visual-preference model, which is composed of three different types of layers: the convolution layer, the pooling layer, and the fully connected layer [6,20]. We mainly used [3 × 3] and [1 × 1] kernels, as used in VGGNet to construct a deeper neural network for improving classification performance [6,20]. Pooling layers are used as a means to reduce the computational cost by reducing the dimensions of the feature map and the number of parameters in the network [6,20]. Fully connected layers are used for the classification of feature maps generated in the final feature-extraction layer, which were composed of two hidden layers for sufficient classification performance [6,20]. In addition, we utilized a softmax function [6,21] as the activation function of the output layer, and the adaptive momentum (Adam) as an optimizer of the objective function to improve the classification performance [6,22]. Figure 1 shows the overall architecture of the proposed DCNN-based model for preference classification.

Transfer Learning
In order to mimic a human-like incremental learning mechanism, we considered transfer learning for incrementally training and inferring visual preferences in inter-domain images with atypical visual characteristics. In general, transfer learning can improve the classification performance by reusing a part of the pre-trained network for another domain dataset similar to the domain of the original dataset when the deep learning model does not have significant data to increase the classification performance [6,15]. Thus, we reused the feature extraction layers of the pretrained network, connecting the pre-learned feature extraction layer with the newly defined classification layer to fine-tune all the parameters of the network [6], as shown in Figure 2.

Transfer Learning
In order to mimic a human-like incremental learning mechanism, we considered transfer learning for incrementally training and inferring visual preferences in inter-domain images with atypical visual characteristics. In general, transfer learning can improve the classification performance by reusing a part of the pre-trained network for another domain dataset similar to the domain of the original dataset when the deep learning model does not have significant data to increase the classification performance [6,15]. Thus, we reused the feature extraction layers of the pre-trained network, connecting the pre-learned feature extraction layer with the newly defined classification layer to fine-tune all the parameters of the network [6], as shown in Figure 2.

Transfer Learning
In order to mimic a human-like incremental learning mechanism, we considered transfer learning for incrementally training and inferring visual preferences in inter-domain images with atypical visual characteristics. In general, transfer learning can improve the classification performance by reusing a part of the pre-trained network for another domain dataset similar to the domain of the original dataset when the deep learning model does not have significant data to increase the classification performance [6,15]. Thus, we reused the feature extraction layers of the pretrained network, connecting the pre-learned feature extraction layer with the newly defined classification layer to fine-tune all the parameters of the network [6], as shown in Figure 2.

Grad-CAM
Grad-CAM was utilized as a method to analyze and interpret the cause of classification results in a deep learning-based model [19]. In the class activation mapping (CAM) model, the attention map can be generated only when the global average pooling is applied (instead of the fully connected layer) to the feature map passing through the convolution layers [23]. However, Grad-CAM can be applied to a fully connected structure, not only a global average pooling layer [19]. We used Grad-CAM as a method for the analysis and interpretation of the results of personal preferences. We applied the Grad-CAM approach to implementing a personal preference inference model with plausible explanatory possibilities. Figure 3 shows the structure of Grad-CAM for generating a heat map reflecting the user's preference based on atypical visual features.

Grad-CAM
Grad-CAM was utilized as a method to analyze and interpret the cause of classification results in a deep learning-based model [19]. In the class activation mapping (CAM) model, the attention map can be generated only when the global average pooling is applied (instead of the fully connected layer) to the feature map passing through the convolution layers [23]. However, Grad-CAM can be applied to a fully connected structure, not only a global average pooling layer [19]. We used Grad-CAM as a method for the analysis and interpretation of the results of personal preferences. We applied the Grad-CAM approach to implementing a personal preference inference model with plausible explanatory possibilities. Figure 3 shows the structure of Grad-CAM for generating a heat map reflecting the user's preference based on atypical visual features.

Three Image Databases
In this paper, three well-known open image databases with atypical visual characteristics were considered for training in, and inferring of, personal visual preferences: Fashion-MINIST, LFW, and Indoor Scene Recognition [16][17][18]. Fashion-MNIST is an image dataset of fashion products, which is made up of 60,000 training images and 10,000 test samples including 10 categories of 28 × 28 grayscale images [6,16]. Figure 4 shows 10 class labels from Fashion-MNIST and some images sampled in Fashion-MNIST.

Three Image Databases
In this paper, three well-known open image databases with atypical visual characteristics were considered for training in, and inferring of, personal visual preferences: Fashion-MINIST, LFW, and Indoor Scene Recognition [16][17][18]. Fashion-MNIST is an image dataset of fashion products, which is made up of 60,000 training images and 10,000 test samples including 10 categories of 28 × 28 grayscale images [6,16]. Figure 4 shows 10 class labels from Fashion-MNIST and some images sampled in Fashion-MNIST.

Grad-CAM
Grad-CAM was utilized as a method to analyze and interpret the cause of classification results in a deep learning-based model [19]. In the class activation mapping (CAM) model, the attention map can be generated only when the global average pooling is applied (instead of the fully connected layer) to the feature map passing through the convolution layers [23]. However, Grad-CAM can be applied to a fully connected structure, not only a global average pooling layer [19]. We used Grad-CAM as a method for the analysis and interpretation of the results of personal preferences. We applied the Grad-CAM approach to implementing a personal preference inference model with plausible explanatory possibilities. Figure 3 shows the structure of Grad-CAM for generating a heat map reflecting the user's preference based on atypical visual features.

Three Image Databases
In this paper, three well-known open image databases with atypical visual characteristics were considered for training in, and inferring of, personal visual preferences: Fashion-MINIST, LFW, and Indoor Scene Recognition [16][17][18]. Fashion-MNIST is an image dataset of fashion products, which is made up of 60,000 training images and 10,000 test samples including 10 categories of 28 × 28 grayscale images [6,16]. Figure 4 shows 10 class labels from Fashion-MNIST and some images sampled in Fashion-MNIST.  LFW is a public benchmark dataset for face recognition and classification [6,17]. LFW is made of 13,233 facial images of 5749 people, and consists of 150 × 150 color (RGB) images [6,17]. Figure 5 shows some of the FL images sampled from LFW.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 12 LFW is a public benchmark dataset for face recognition and classification [6,17]. LFW is made of 13,233 facial images of 5749 people, and consists of 150 × 150 color (RGB) images [6,17]. Figure 5 shows some of the FL images sampled from LFW. The Indoor Scene Recognition dataset is a group of images of various indoor scenes. The dataset contains 67 indoor categories and a total of 15,620 images. On average, there are more than 100 images per class. Resolutions of the images in the dataset vary, and they are also RGB [18]. Figure 6 shows several categories and examples of the data.

Experiment Results
For verification of the inference accuracy of personal visual preferences from the proposed model, we utilized the three public benchmark datasets Fashion-MNIST, LFW, and Indoor Scene Recognition [16][17][18]. Before training the proposed model, all images from these datasets were classified and labeled as preferred or non-preferred, based on the preferences of a specific user. Table  1 shows the ratios of the two classes (preferred and non-preferred) for the Fashion-MINIST, LFW, and Indoor Scene Recognition databases, according to the user's preferences. In the process of determining the subject's preference for each image, a subject was asked to determine a preference for each image within 1 s in order to maximally exclude interference from other factors including the subject's intrinsic episodic memory or other non-visual factors. In addition, in order to minimize the The Indoor Scene Recognition dataset is a group of images of various indoor scenes. The dataset contains 67 indoor categories and a total of 15,620 images. On average, there are more than 100 images per class. Resolutions of the images in the dataset vary, and they are also RGB [18]. Figure 6 shows several categories and examples of the data.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 12 LFW is a public benchmark dataset for face recognition and classification [6,17]. LFW is made of 13,233 facial images of 5749 people, and consists of 150 × 150 color (RGB) images [6,17]. Figure 5 shows some of the FL images sampled from LFW. The Indoor Scene Recognition dataset is a group of images of various indoor scenes. The dataset contains 67 indoor categories and a total of 15,620 images. On average, there are more than 100 images per class. Resolutions of the images in the dataset vary, and they are also RGB [18]. Figure 6 shows several categories and examples of the data.

Experiment Results
For verification of the inference accuracy of personal visual preferences from the proposed model, we utilized the three public benchmark datasets Fashion-MNIST, LFW, and Indoor Scene Recognition [16][17][18]. Before training the proposed model, all images from these datasets were classified and labeled as preferred or non-preferred, based on the preferences of a specific user. Table  1 shows the ratios of the two classes (preferred and non-preferred) for the Fashion-MINIST, LFW, and Indoor Scene Recognition databases, according to the user's preferences. In the process of determining the subject's preference for each image, a subject was asked to determine a preference for each image within 1 s in order to maximally exclude interference from other factors including the subject's intrinsic episodic memory or other non-visual factors. In addition, in order to minimize the effect of the subject's fatigue, the entire image database was divided into several subgroups. Additionally, a subject was asked to decide preference instantly and sensibly based on visual characteristics as much as possible.

Experiment Results
For verification of the inference accuracy of personal visual preferences from the proposed model, we utilized the three public benchmark datasets Fashion-MNIST, LFW, and Indoor Scene Recognition [16][17][18]. Before training the proposed model, all images from these datasets were classified and labeled as preferred or non-preferred, based on the preferences of a specific user. Table 1 shows the ratios of the two classes (preferred and non-preferred) for the Fashion-MINIST, LFW, and Indoor Scene Recognition databases, according to the user's preferences. In the process of determining the subject's preference for each image, a subject was asked to determine a preference for each image within 1 s in order to maximally exclude interference from other factors including the subject's intrinsic episodic memory or other non-visual factors. In addition, in order to minimize the effect of the subject's fatigue, the entire image database was divided into several subgroups. Additionally, a subject was asked to decide preference instantly and sensibly based on visual characteristics as much as possible. We considered two different models. Model 1 was trained and tested using only a training dataset and a test dataset. Model 2 utilized a validation dataset as well as training and test datasets.
In the experiments for measuring classification performance of the proposed preference inference model using the Fashion-MNIST database, Model 1 used 60,000 training images and 10,000 test images. However, for Model 2, training was conducted with 50,000 training images, 10,000 validation images, and 10,000 test images, as shown in Figure 7. Table 2 shows the preference classification results for Model 1 and Model 2 with the Fashion-MNIST dataset.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 12 We considered two different models. Model 1 was trained and tested using only a training dataset and a test dataset. Model 2 utilized a validation dataset as well as training and test datasets.
In the experiments for measuring classification performance of the proposed preference inference model using the Fashion-MNIST database, Model 1 used 60,000 training images and 10,000 test images. However, for Model 2, training was conducted with 50,000 training images, 10,000 validation images, and 10,000 test images, as shown in Figure 7. Table 2 shows the preference classification results for Model 1 and Model 2 with the Fashion-MNIST dataset.       From the experimental results, we were able to conclude that both Model 1 and Model 2 performed well in classifying preferences with the Fashion-MNIST dataset. The classification performance of Model 2 was 95.07%, which was superior to Model 1 for the test dataset. For the purpose of evaluating the performance of the proposed personal preference model, we applied the proposed model (pre-trained with the Fashion-MNIST database) for preference classification with the LFW database and the Indoor Scene Recognition database. In this experiment, Model 1 and Model 2 were applied to classify the LFW data and Indoor Scene Recognition data into only two classes (preferred and non-preferred) without any additional training process for fine-tuning. The preference classification experiment using the LFW dataset was conducted by randomly selecting 9000 images from 13,233 images in LFW. In addition, in the experiments using the Indoor Scene Recognition dataset [18], 15,000 images were randomly selected from 15,620 images. Table 3 shows the experimental results of preference classification with the LFW dataset and the Indoor Scene Recognition dataset by Model 1 and Model 2.
According to the experimental results in Table 3, the proposed model showed plausible preference classification performance for two databases with different characteristics from the prior-learning database.
Additionally, in order to verify the effectiveness of transfer learning between inter-domain datasets with different visual characteristics, transfer learning was applied to the learning process of the proposed preference inference model. Table 4 shows the experimental results of preference classification with the LFW and Indoor Scene Recognition datasets by the proposed model considering transfer learning. As shown in Table 4, the proposed preference classification model reflecting transfer learning showed better performance than the model that did not reflect transfer learning, as shown in Table 3.
From these experimental results, we can conclude that the proposed model was able to properly classify preferences for datasets with different visual characteristics through transfer learning. This transfer learning process might be recognized as playing an important role in constructing the personality of each person, reflecting visual preferences by incrementally determining the individually preferred characteristics for visual information. Thus, we were able to conclude that, through numerous experiences with visual information that is tremendously diverse in characteristics, each human shapes their own preferences for visual data.
Finally, we applied a Grad-CAM model for the purpose of explaining the reasons for the proposed model's preference classifications [19]. Grad-CAM generated an attention heat map to provide explanations for the classification results of the specific user's preferences [19]. Figure 8 shows the visual attention map for the results of preferred and non-preferred classifications from the LFW dataset. Figure 8a shows some heat map results obtained by Grad-CAM for the correctly classified preferred images in the LFW dataset. Figure 8b shows some of the heat map results obtained by Grad-CAM for the correctly classified non-preferred images in the LFW dataset. According to the subject's opinions, non-facial visual features including a suit or tie had some influence on the preference of the face image when deciding the preference for each face image of LFW. In addition, it was judged that the racial issue or the presence of acquaintances was also partly affected, even though a subject was asked to decide preference momentarily and sensibly based on the visual face as much as possible. However, some of these influences might be acceptable because we aimed to develop a model that learns and infers individual preferences for general visual features.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 12 (a) (b) Figure 8. Visual expressions from applying Grad-CAM to user preference-classification results with LFW: (a) sample heat maps for the results in preferred data, and (b) sample heat maps for the results in non-preferred data. Figure 9 shows a visual attention heat map for the results of preferred and non-preferred classifications from the Indoor Scene Recognition dataset. Figure 9a shows some heat map results obtained by Grad-CAM for the correctly classified preferred images from the Indoor Scene Recognition dataset. Figure 9b shows the heat map results obtained by Grad-CAM for the correctly classified non-preferred images from the Indoor Scene Recognition dataset. As shown in the experimental results, we can see that the heat map provided by Grad-CAM is generally set in a meaningful feature area in the image where human preferences can be found. Therefore, we can conclude that it might be possible to explain a user's preferences using heap maps generated by Grad-CAM. In order to confirm the explanatory feasibility of the subject's preference using the results of Grad-CAM, we analyzed how well the area that influenced the subject's preference decision matched the attention area of Grad-CAM. As shown in Table 5, analysis was performed on the test datasets of LFW and Indoor Scene Recognition. In the case of the LFW dataset, it was analyzed that the Grad-CAM results properly reflected the subject's preference for 312 images out of 465 images correctly classified as preferred images among the 1000 test images. On the other hand, it was analyzed that the Grad-CAM results for 238 images out of 371 images correctly classified as non-preferred images properly reflected the subject's disfavor. In addition, the experimental results showed that the attention heat maps of the Grad-CAM results were matched to the subject's preference decision by 66.68% and 60.74% of the preferred images and non-preferred images, respectively. Accordingly, the Grad-CAM results might be utilized to partially explain personal preference, even though it is insufficient to describe the subject's visual preference with only current Grad-CAM results.  Figure 9 shows a visual attention heat map for the results of preferred and non-preferred classifications from the Indoor Scene Recognition dataset. Figure 9a shows some heat map results obtained by Grad-CAM for the correctly classified preferred images from the Indoor Scene Recognition dataset. Figure 9b shows the heat map results obtained by Grad-CAM for the correctly classified non-preferred images from the Indoor Scene Recognition dataset. As shown in the experimental results, we can see that the heat map provided by Grad-CAM is generally set in a meaningful feature area in the image where human preferences can be found. Therefore, we can conclude that it might be possible to explain a user's preferences using heap maps generated by Grad-CAM. In order to confirm the explanatory feasibility of the subject's preference using the results of Grad-CAM, we analyzed how well the area that influenced the subject's preference decision matched the attention area of Grad-CAM. As shown in Table 5, analysis was performed on the test datasets of LFW and Indoor Scene Recognition. In the case of the LFW dataset, it was analyzed that the Grad-CAM results properly reflected the subject's preference for 312 images out of 465 images correctly classified as preferred images among the 1000 test images. On the other hand, it was analyzed that the Grad-CAM results for 238 images out of 371 images correctly classified as non-preferred images properly reflected the subject's disfavor. In addition, the experimental results showed that the attention heat maps of the Grad-CAM results were matched to the subject's preference decision by 66.68% and 60.74% of the preferred images and non-preferred images, respectively. Accordingly, the Grad-CAM results might be utilized to partially explain personal preference, even though it is insufficient to describe the subject's visual preference with only current Grad-CAM results.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 12 (a) (b) Figure 9. Visual expressions applying Grad-CAM for user preference-classification results with LFW: (a) sample heat maps for the results in preferred data, and (b) sample heat maps for the results in nonpreferred data.

Conclusions and Further Work
In this paper, we proposed a deep CNN-based transfer learning model for inferring personal visual preferences in visual images with atypical characteristics. For performance verification of the proposed model, we considered three public benchmark datasets (Fashion-MNIST, LFW, and Indoor Scene Recognition) that have atypical visual characteristics compared with other inter-domain datasets. Experimental results showed that the proposed model properly inferred personal preferences for visual inter-domain images. Moreover, the proposed model also used a transfer learning process that showed better performance for preference prediction with atypical datasets. In addition, the proposed model applied a Grad-CAM approach in order to try to explain personal preferences, even though it is partial to giving explanations about personal preferences, since personal preferences are typically unexplainable, even for humans.
As further work, we are considering many more experiments with multiple users in order to verify the generality of the proposed model. Moreover, we have to use different datasets with different atypical characteristics. Finally, we plan to enhance the proposed model in order to make a personal preference inference model with many more plausible explanatory possibilities. For a theoretical contribution on developing a personal visual preference model, we need to find more biological mechanisms related to generating attention or indirectly obtain insights from known biological mechanisms in further work since the attention mechanism of humans is very complex in nature. As long-term further work, we are considering combining the personal visual preference model with the Grad-CAM based explanatory model to develop an automatically explainable

Conclusions and Further Work
In this paper, we proposed a deep CNN-based transfer learning model for inferring personal visual preferences in visual images with atypical characteristics. For performance verification of the proposed model, we considered three public benchmark datasets (Fashion-MNIST, LFW, and Indoor Scene Recognition) that have atypical visual characteristics compared with other inter-domain datasets. Experimental results showed that the proposed model properly inferred personal preferences for visual inter-domain images. Moreover, the proposed model also used a transfer learning process that showed better performance for preference prediction with atypical datasets. In addition, the proposed model applied a Grad-CAM approach in order to try to explain personal preferences, even though it is partial to giving explanations about personal preferences, since personal preferences are typically unexplainable, even for humans.
As further work, we are considering many more experiments with multiple users in order to verify the generality of the proposed model. Moreover, we have to use different datasets with different atypical characteristics. Finally, we plan to enhance the proposed model in order to make a personal preference inference model with many more plausible explanatory possibilities. For a theoretical contribution on developing a personal visual preference model, we need to find more biological mechanisms related to generating attention or indirectly obtain insights from known biological mechanisms in further work since the attention mechanism of humans is very complex in nature. As long-term further work, we are considering combining the personal visual preference model with the Grad-CAM based explanatory model to develop an automatically explainable personality model. We will apply the proposed model to a visual personal recommendation system that is superior to human experts. For example, the proposed personal preference inference model can be applied to image-based personal recommendation systems that might be superior to human experts on customized interiors.