Dataset for the Aesthetic Value Automatic Prediction

: One of the most relevant issue in the prediction and classification of the aesthetic value of an image is the sample set used to train and validate the computational system. In this document the limitations found in different datasets used to classificate and predict aesthetic values are exposed, and a new dataset is proposed with images from the DPChallenge.com portal, with evaluations of three different populations.


Introduction
Different research groups have tried to create computer systems capable of learning the aesthetic perception of a group of human beings as part of a generative system, with the intention of being used in the selection or automatic ordering of images. Due the subjective nature of the aesthetic problem, the selection of the dataset with which the system is trained is especially relevant. After analyzed, in previous research [1,2], the generalization degree of some datasets, it has been concluded that it is not enough to take them as a reference in the training of automatic image classification and prediction systems. In order to providing a solution to the problems detected, this paper describes the creation of a new dataset from the DPChallenge.com portal, with greater statistical coherence. In addition, this new dataset has been evaluated according to aesthetic and quality criteria by a human group in controlled experimental conditions and by another American group through online surveys.

Limitations Found in the Datasets Available
There are some datasets that have been used in several times for the images classification. Among them, Photo.net [3][4][5], DPChallenge.com [6,7] and the one created by Cela-Conde et al. [8][9][10] However, when its generalization capacity is studied, it has been detected that they cannot be considered as representative for the realization of image experiments. In some cases, the correlation is greater when the validation set belongs to the same data source as the training set, and this correlation drops markedly when the validation source set is different from that of the training. In addition, the sample sets trained with evaluations from the photographic portals have some defects: the evaluation system does not have the same control as a psychological test because it is not possible to obtain all the information about the evaluators or about the evaluation conditions; the number of images could be insufficient, since there is no justified reason to choose a sample size and there is a very high difference between the number of people who value each image; user ratings can be easily conditioned by personal tastes, personal relationships with the work creator, or by the momentary boom or popularity of certain styles. Lastly, in one of the cases [3] it has been shown that the users of these portals do not have sufficient grounds to differentiate between aesthetics and originality criteria, with a Pearson correlation coefficient of 0.891. In the dataset created by Ke et al. [6] is another limitation: the web portal DPChallenge.com works as a photo contest and does not specify any criteria to evaluate the images with their own judgment and nothing related with that of other users. On the other hand, in the dataset created by Cela-Conde et al. [8] the number of images presented by category is not equitable, so the results obtained cannot be considered as representative of the set. In addition, it part of a considerable amount of subsets of images, which results in the dataset is eventually converted into several independent datasets, smaller and with less internal consistency.

A New Dataset
After the detection of the limitations described above, the construction of a new dataset for the aesthetic prediction of images has been carried out. This new creation method allows us to build a dataset with greater statistical coherence from the evaluation results collected on the DPChallenge.com photography website. Later, it is evaluated by two different population types. With this, we obtained the possibility of analyzing the correlation between the results obtained with subjects in controlled circumstances and those obtained through online surveys. First, a set of images has been compiled from the DPChallenge.com photo portal. This portal has been used previously to obtain data for aesthetic classification experiments [6,7]. Those images with a minimum of 100 ratings have been selected. In this way, it is intended that the average value that will be assigned to each image will be as little biased as possible. Once this selection is made, the images are organized in groups according to the average evaluation received in DPChallenge.com. The images of our selection have been classified in 9 scoring ranges, one for each whole value of evaluation allowed. Then, all groups are expected to have a minimum number of images, which in our case was 200. There are not sufficiently large groups of images with average evaluations lower than 3 or higher than 8, so the groups used were those collected in the range [≥3, <8]. Of these groups, the 200 images with the smallest standard deviation were selected, that is, those that present votes with greater internal coherence. This process provides a set of images with the same number of elements in each range and with high voting coherence.

Evaluation
The dataset proposed above was evaluated by a group of Spanish humans under controlled experimental conditions. The evaluations were carried out by student volunteers from the Universidade da Coruña, Spain. Ninety-nine participants (33 men and 66 women) were part of this study, with an average age of 18.7 years, in an age range of 18-30. Each participant evaluated at least 200 images in the members of the research group presence and under the same viewing conditions. For each image, users assessed their aesthetics and their quality independently. Later, another experiment was conducted through online surveys with the USA population. This experiment was carried out through the Amazon Mechanical Turk tool. 525 people evaluated the images, 39% men and 61% women, with an average age of 32.6 years, in a range of 18-70 years. The same images were used as in the on-site experiment and the evaluators had to score, in the same way, the aesthetics and quality criteria, independently.

Results
The correlation between the evaluations made in person and those recorded on the DPChallenge.com platform has been calculated. The Pearson correlation between the average score of DPChallenge.com and the average evaluation according to the aesthetic value is 0.692, and of 0.69 according to Spearman. The average correlation between DPChallenge.com and the average according to the quality value is 0.748 according to Pearson and 0.756 according to Spearman. Finally, the correlation between the two measurements obtained in the on-site experiment (aesthetics/quality) is 0.787 according to Pearson and 0.786 according to Spearman. When it analyzes the correlation between the on-site evaluations and the USA online survey, a correlation of 0.76 was detected between the aesthetic criteria of both experiments, and of 0.85 between the quality criteria. The correlation between the aesthetic and the quality criteria in the USA evaluations is 0.89, the same correlation that exists between criteria in the experiment carried out by Datta et al. [3] With this new dataset, different models based on Machine Learning have been trained using different metrics for automatic prediction of aesthetic and quality value. The highest correlation obtained with these models is 0.58 using SVM [11].

Conclusions
The correlation results suggest that the evaluation of DPChallenge.com is closer to a quality criteria than aesthetics and that, in the same way, all evaluators coincide with greater precision when evaluating the quality criteria than aesthetics. In addition, it can be deduced that the evaluators better differentiate the criteria to be evaluated when the difference can be explained to them in person. It should be noted that the complex systems used predict better quality results than aesthetic ones, perhaps due to their lower subjective component and their greater relationship with the intrinsic characteristics of the images.