Perspective Morphometric Criteria for Facial Beauty and Proportion Assessment

: Common sense usually considers the assessment of female human attractiveness to be subjective. Nevertheless, in the past decades, several studies and experiments showed that an objective component in beauty assessment exists and can be strictly related, even if it does not match, with proportions of features. Proportions can be studied through analysis of the face, which relies on landmarks, i.e., specific points on the facial surface, which are shared by everyone, and measurements between them. In this work, several measures have been gathered from studies in the literature considering datasets of beautiful women to build a set of measures that can be defined as suggestive of female attractiveness. The resulting set consists of 29 measures applied to a public dataset, the Bosphorus database, whose faces have been both analyzed by the developed methodology based on the expanded set of measures and judged by human observers. Results show that the set of chosen measures is significant in terms of attractiveness evaluation, confirming the key role of proportions in beauty assessment; furthermore, the sorting of identified measures has been performed to identify the most significant canons involved in the evaluation.


Introduction
The beauty and proportion of the human face have always been an object of interest through the years, as evidenced by Greek sculptures, ancient Egyptian paintings, and even in prehistory [1]. Some studies have also been carried on during the Renaissance period by well-known artists, such as Leonardo da Vinci [2] and Michelangelo Buonarroti [3], but it is from the 20th century that systematic studies focused on objective assessments began.
Several studies of facial aesthetics have focused on specific aspects, such as the role of the divine proportion legacy of ancient Greeks, [4][5][6], the average intended as an instrument to delete imperfections [7,8], the average plus exaggerated sex-specific traits to discriminate between beauty and attractiveness [9], and the relative importance of some traits of the face [10]. The presence of a strong objective component in beauty assessment has been clearly shown by several experiments, including Iliffe's [11], Udry's [12] and Cunningham's [13].
Cultural influence on beauty assessment is very strong, as testified by different canons that have been adopted through the years to assess female beauty. During the Paleolithic period, statuettes of "Venus", representing women, were sculpted in such a way that they looked full-figured to symbolize fecundation, fertility and regeneration [14]. Ancient Egyptians considered a large forehead and well-defined mandibles attractive [15]. On the contrary, Greeks preferred an oval facial shape for both men and women, and a forehead as small as possible to highlight the hair [16]. In the middle Some databases of human faces were built to be analyzed and to provide new suggestions for further feature extraction and proportion studies, but also to validate results already obtained. Two available databases are the 3D Facial Norm (3DFN) [50] and the Bosphorus sets, employed in the development of the current research work, which provides faces belonging to more than one hundred subjects in various poses, expressions and different types of occlusions [51].
The present work gathers facial measures that have been identified in previous studies as representative of female face attractiveness with the purpose of classifying female faces. After that, the Bosphorus, meaning a database of normal women, has been used to test whether the final set of canons is suitable and sufficient for women's attractiveness evaluation. The novelties in this work are the development of a methodology capable of putting together data identified by different sources in the literature in order to obtain an expanded set of measures comprehensive of all the works published up-to-now, and to provide a ranking of the measures included in the expanded set to understand which are the most significant in attractiveness assessment. Results confirm that the evaluation of selected measures corresponds closely to human assessments, providing the opportunity to quantitatively analyze women's beauty; moreover, a ranking showing the influence of measures in women's facial beauty and proportion assessment has been drawn up. The importance of ratios between measures and the higher relevance of the vertical measures compared to the horizontal have been highlighted.

Materials and Methods
Face analysis is the discipline that studies human faces based on the identification of landmarks, specific points common to everyone that can be identified on the face. Landmarks can be recognized on the hard tissue through palpation or on the soft tissue through observation, even if some of the landmarks positioned on the soft tissue depend on landmarks positioned on the hard tissue. In order to evaluate women's attractiveness from a quantitative point of view, an expanded set of measures, relying on landmark positions, has been defined.
The most difficult step in evaluating female beauty is to identify a ground truth that allows us to make considerations on measures of women's faces. Significant experiments proving the presence of a strong objective component have been conducted; Iliffe's [11] and Udry's [12] works showed that almost identical classifications are obtained when people are asked to judge and rank the beauty in specific photo galleries, as well as Cunningham's [13], which stated that beauty cannot be defined as a quality resting only in the eye of the beholder.
In the past decades, several works aiming at establishing which human facial traits influence an observer's assessment of beauty have been conducted. Works considered as the most incisive in this field, thus taken into greater account in the present work, have been carried out by Farkas et al. [52], Ferrario et al. [23], Sarver et al. [37], Sforza et al. [24] and Galantucci et al. [25].
Each of those works has been carried out via live-positioning landmarks on female subjects before the acquisition of the point cloud required to compute measures. Manual allocation directly on subjects has been chosen in order to achieve the best accuracy possible. Acquisitions have been made using an RGB-D camera, namely cameras which allow to acquire both color and depth images. Subjects were actresses, participants to beauty contests, and common women; the latter have been chosen to validate the results. Going deeper into detail, the purpose of computing and comparing facial measures has been to highlight the correlation among some measures in the set of attractive women and the differences with the same measures in the set of normal women. Whenever the difference has been evaluated as significant, the measure related to that difference has been classified as influential for women's facial beauty and proportion assessment. As it can be seen in Table 1, the vast majority of measures belonging to the expanded set show a statistically significant difference between Galantucci's attractive and normal women datasets.  [25], a t-test has been performed between measures in this dataset and those ones in Bosphorus. The significance level has been fixed to 95%. This means that differences between measures are significant if the p-value is lower than 0.05.
The landmark framework considered in this work is reported in Table 2 and shown in Figure 1.   (al_l) and 27 (al_r) have been discarded due to a lack of available measures involving them. Suffixes _l and _r state that the landmark considered is respectively the left or the right one.
All the measures considered in this analysis have been acquired employing the abovementioned landmarks and can be subdivided into three categories: Linear, angular, and ratios.
Linear measures described in Table 3 and shown in Figure 2 are Euclidean distances between two landmarks or between a landmark and another specific point. More specifically, one of those specific points is the point on the E-line that minimizes the distance with Labiale superius (or Labiale inferius), where the E-line is the line passing through the Pronasal and the Pogonion.   Table 4 and shown in Figure 3 are angles subtended by a vertex identified by three landmarks or, exceptionally for the Interlabial distance, by two lines lying on the same plane and identified by four landmarks, two for each line.  Ratios between linear distances (Table 5) allow us to perform quantitative evaluations of proportions. The face analysis moves the focus from the local to the global point of view, since not only the absolute value of one single measure is considered, but rather the overall effect of two measures. Intuitively, ratios are the quantitative way to represent the big picture. Table 5. Ratios of Euclidean distances. The third column reports the work from which the measure has been taken.

Measure
Description References (t_r-t_l)/(n-pg) Middle facial width to facial height Galantucci et. al [25] (n-sn)/(n-pg) Nasion -Subnasale/Nasion -Pogonion Ferrario et al. [23], Galantucci et. al [25] (sn-pg)/(n-pg) Subnasale -Pogonion/Nasion -Pogonion Ferrario et al. [23], Galantucci et. al [25] (t_r-n)/(t_r-sn) Right Tragi-Nasion/Right Tragi-Subnasale Galantucci et. al [25] (sn-pg)/(n-sn) Lower to upper facial height Galantucci et. al [25] (sn-gn)/(n-gn) Lower third/facial height Farkas et al. [52], Sarver et al. [37] (sto-gn)/(sngn) Mandibula/lower third Farkas et al. [52], Sarver et al. [37] As a result of the literature review, twenty-nine measures have been identified. The present work employs data from the Bosphorus database, namely women's pictures, 3D models, and relative landmark coordinates. Because all the studies previously cited were carried out by research groups with different expertise, slightly different sets of landmarks have been adopted. A landmarking expert identified some missing landmarks on 2D pictures and on 3D models on the Bosphorus database in order to complete the landmark framework. Nonetheless, some landmarks had to be discarded because they relied on the hard tissue, and the only way to identify them was through palpation. Thus, it has not been possible to include some measures into the expanded set. An example is the zygion (zy), which is the most lateral point on the soft tissue contour of each zygomatic arch.
The works of Galantucci et al. [25], Farkas et al. [52], Sarver et al. [37], Ferrario et al. [23] and Sforza et al. [24] provided mean value and standard deviation for every measure. Some measures are common to different sources, even if the related mean value and standard deviation are slightly different depending on the study. The small sample size of attractive sets of women is due to the complexity of finding available databases providing data, i.e., not only images, but also 3D models and information like landmarks position. Indeed, one of the purposes of the present study is to merge the information coming from these sources ( Figure 4) and build an overall measuring methodology. A score, the result of the sum of penalties, has been computed for each woman present in the Bosphorus database. For each measure, if the value of the considered woman was within the range mean value ± standard deviation, no penalty has been added. For the sake of clarity, it has to be highlighted that the mean value taken into consideration is the mean value of the measures belonging to the attractive sets of women and not the mean value referred to a generic non-selected group of women (normal women). Conversely, if the value was out of range, a penalty has been added, and the amount of penalty (1) has been computed as the ratio between the distance of the measure from the mean value normalized with the mean value.
In the case of more than one mean value, since there is more than one source in the literature that refers to the same measure, the computed penalty value has been then multiplied for weight (2).
This weight has been introduced to consider the different degree of confidence assigned to different studies found in the literature. Remembering that in those studies canons have been extracted from sets of attractive women, it was considered essential to evaluate more robustly the analysis using a greater number of subjects. Thus, the weights have been computed as the ratio between the number of women involved in a single study and the sum of all the women involved in all the studies related to the same measure. For instance, consider a hypothetical measure identified by two studies, A and B, which involves x and y women, respectively. The weight related to the study A will be x/(x+y), while the weight related to the study B will be y/(x+y).
After all the scores were obtained, a cluster analysis was performed through the usage of k-means methodology, subdividing the datasets into five classes. The purpose of this step was to identify which women were closer to the well-proportioned standard face in terms of compliance with the measures in the expanded set; in other words, considering the descending order adopted, faces belonging to class 5 are closer to the well-proportioned standard face than the faces belonging to class 1. The number of classes has been chosen to have a correspondence with the Likert scale. Indeed, in parallel, a qualitative evaluation of the women present in Bosphorus has been made by a focus group using precisely the Likert scale. A comparison between the results obtained using the developed methodology and the qualitative evaluation performed by human observers has been performed ( Figure 5).
Finally, the results of the developed methodology in terms of cluster subdivision have been analyzed to rank the influence of each measure of the expanded set in women's facial beauty and proportion assessment.

Results
The measures selected from literature and included in the expanded set are reported in Table 6. For every measure, it is possible to have more than one mean value and one standard deviation due to the possibility of retrieving the same measure from different sources in literature; the last column reports the reference number of the considered source.  [37,52] After the expanded set definition, each of the forty-four women in Bosphorus database were measured, and the scores were computed summing all the penalties. Scores were subdivided into five clusters so that it is easy to identify women's faces closer to the well-proportioned standard face, remembering that clusters are arranged in descending order; namely, faces belonging to cluster 5 are the closest to the well-proportioned standard face, and conversely, faces belonging to cluster 1 are the furthest.
Hereafter, a focus group composed by 10 people, 6 men and 4 women aged between 19 and 39 years old met up to evaluate Bosphorus women attractiveness. For every face, a discussion between focus group members led to a final score. In other words, the whole group provided a single score for each face analyzed. The aim of that was to compare the developed methodology with human judgment. The results showed that the cluster label and the Likert's scale label are the same; namely, the difference between those two values equals 0 for 15/44 women (very high correspondence); the difference is 1 for 16/44 women (high correspondence); the difference is 2 for 9/44 women (moderate correspondence); the difference is 3 for 3/44 women (low correspondence); and the difference is 4 for 1/44 women (very low correspondence).
Very high correspondence means that the result of the method matches the focus group's outcome (a woman with very high correspondence label has been judged in the same way both by the method and by the focus group).
All those results are reported in Table 7. Finally, results obtained through cluster analysis were more deeply analyzed. The purpose of this step was to rank the measures to understand which are the most important in women's facial beauty and proportion assessment. In particular, the focus was moved on cluster 4 and cluster 5, which contain attractive and very attractive women, respectively. Knowing the cardinality of each cluster, that is 17 if clusters 4 and 5 are taken together, values out of range have been computed for each measure; a measure must be considered influential if the smallest possible number of women has that measure out of range. For instance, referring to Table 8, it is possible to notice that the Euclidean distance ls-sto is the most influential measure, because only two women (12%) belonging to cluster 4 and cluster 5 are out of range; conversely, ch_r-ch_l is one of the least influential measures because only 5 out of 17 women are within range. Table 8. The measures rankings. The cardinality of each cluster is reported in the header of the table. The first column lists the measures, and the second column reports the number of women belonging to cluster 4 or 5 that are out of range. For the sake of completeness, in the other columns, the number of women belonging to each cluster has been reported.
The present work has focused on gathering measures validated in past studies, building the expanded set of measures defined on facial landmarks and analyzing a public database of normal women, the Bosphorus. Unfortunately, some of the landmarks used in the literature were not present in the set of data of the Bosphorus database. For this reason, an expert has manually added those missing landmarks lying on soft tissue, but some of the hard tissue landmarks have not been considered due to the impossibility of identifying them without live palpation. Consequently, a limited set of measures that could have been included in the expanded set has been discarded.
In order to analyze the Bosphorus, a methodology able to integrate information provided by different sources was required. Thus, the algorithm based on the penalty mechanism has been developed, and the results have been clustered to provide the possibility of classifying faces into 5 different levels. The cluster numbering is from 5-1, that means it ranges from the most compliance with the well-proportioned standard face to the least. The cluster numbering is chosen this way so that the opportunity of comparing quantitative outcome and qualitative assessment is guaranteed. Likert's 5-level scale has allowed us to analyze the correspondence between the developed methodology and people's assessments.
Obtained results displayed in Figure 6 show the levels of correspondence in women's evaluation from very high to very low, namely from a 4-level difference to a 0-level difference between quantitative and qualitative evaluation. The 31/44 means slightly more than 70%, obtained the summing of high and very high correspondence, justifying the theories mentioned in the introduction stating the presence of objective elements that are unconsciously but incontrovertibly considered in evaluating women's attractiveness. Cluster analysis has led this study to discover part of those elements in terms of proportions between significant measures, ranking them from the most to the least influential. Some interesting observations have arisen from the ranking analysis. Firstly, vertical measures are typically more meaningful than the horizontal: Some examples of influential vertical measures are the thickness of the upper lip (ls-sto), the height of the central part of the face (n-sn), and the height of the lower part of the face (sn-gn). Going deeper into the detail, the lower part of the face in attractive women resulted as higher than the central part, coherently with the fact that if they have identical values, faces appear rounded, a characteristic not considered attractive. The upper lip was more meaningful than lower lip, but this does not mean it should be greater; rather, it means that it is more ordinary to have a full lower lip, while to have also a full upper lip is more uncommon, thus it is a peculiar feature of attractiveness. The angle between the two exterior corners of the eyes and the nasion, i.e., the point which separates the upper third of the face and the middle third of the face, (ex_l-n-ex_r), and the mouth width (ch_r-ch_l) at the end of the table are some examples of less meaningful horizontal measures. This does not retract the importance of mouth width in women's faces assessment; simply, a not-so-relevant difference in mouth width between attractive and normal women has been shown by the analysis. Secondly, beauty turned out to be strictly connected to proportions; 5 out of 7 ratios present in the expanded set of measures are on the top half of the ranking, confirming that women's beauty and well-proportioned faces are also given by relationships between measures, thus they involve a holistic process. Another indication is the relatively poor importance of the facial width absolute value, t_r-t_l, compared with the ratio between facial width and facial height, (t_r-t_l)/(npg), ranked in the top 10 most influential measures.
In total, 70% of high and very high correspondence between quantitative and qualitative evaluation is a not neglectable result, but the 30% non-similarity of the results needs to be investigated in future research. A larger set of attractive and normal women would allow us to enlarge the expanded set of measures, as well as the opportunity of live-acquiring all the needed landmarks, which would allow us to have uniform data in term of colors, poses, and expressions that could be more properly evaluated by humans. Moreover, texture analysis in terms of eye color and skin imperfections could further thin that percentage of non-correspondence and bridge the gap between perceived beauty and facial proportions.

Conclusions
The present study has been carried on considering experimental evidence proving that beauty assessment is not only subjective, but also relies on objective elements.
Several previous studies have identified relevant measures to assess women's facial beauty and proportion, comparing sets of attractive and normal women by measuring Euclidean distances, and angular and ratio values; measures have been computed relying on landmarks live-identified on subject faces. All these works have considered different sets of measures to find differences in the normal and attractive populations.
The importance of the attractive women dataset size is evidenced by the development of the methodology, which is penalty-based. That methodology has been developed to consider data provided by different sets of attractive women, in order to consider a larger number of measures. Future work should focus exactly on that aspect. In the literature there is a lack of attractive women datasets, and establishing a larger database will allow us to deepen the analysis related to measures peculiar of attractiveness.
The current work has defined an expanded set of measures gathering all information presented in the literature and a methodology to merge the results and to classify faces. For each woman, the methodology provides a score, computed as a sum of penalties, and each penalty is given when a woman's measure is not in the range identified by mean value and standard deviation. That procedure has been then used on the set of normal women of the Bosphorus public database, and after a cluster analysis, it has been possible to classify women's faces considering their distance from the well-proportioned standard face, i.e., the expanded set of measures.
After that, a deeper investigation on results provided by the cluster analysis has permitted us to rank the measures from the most to the least influential, to understand which the most considered measures in women's' beauty assessment is. Vertical Euclidean distances were shown to be very impactful, as well as ratio measures; indeed, the direct comparison between measures resulted in being even more significant than the single measure value, as shown by the facial width. Upper lip thickness proved to be the most meaningful measure in the mouth area.
The present work could make a significant contribution to female beauty and proportion assessment and proposes some research hints for future works on female attractiveness, but also for other studies that have been discouraged by the misconception of considering beauty something merely, or mainly, subjective.