Beauty in the Eyes of Machine: A Novel Intelligent Signal Processing-Based Approach to Explain the Brain Cognition and Perception of Beauty Using Uncertainty-Based Machine Voting

Aldhahi, Waleed; Albusair, Thekra; Sull, Sanghoon

doi:10.3390/electronics12010048

Open AccessArticle

Beauty in the Eyes of Machine: A Novel Intelligent Signal Processing-Based Approach to Explain the Brain Cognition and Perception of Beauty Using Uncertainty-Based Machine Voting

by

Waleed Aldhahi

¹

,

Thekra Albusair

² and

Sanghoon Sull

^1,*

¹

School of Electrical Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, Republic of Korea

²

Department of Psychology, Imam Mohammad Ibn Saud University, Riyadh 11314, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(1), 48; https://doi.org/10.3390/electronics12010048

Submission received: 28 November 2022 / Revised: 17 December 2022 / Accepted: 20 December 2022 / Published: 23 December 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The most mysterious question humans have ever attempted to answer for centuries is, “What is beauty, and how does the brain decide what beauty is?”. The main problem is that beauty is subjective, and the concept changes across cultures and generations; thus, subjective observation is necessary to derive a general conclusion. In this research, we propose a novel approach utilizing deep learning and image processing to investigate how humans perceive beauty and make decisions in a quantifiable manner. We propose a novel approach using uncertainty-based ensemble voting to determine the specific features that the brain most likely depends on to make beauty-related decisions. Furthermore, we propose a novel approach to prove the relation between golden ratio and facial beauty. The results show that beauty is more correlated with the right side of the face and specifically with the right eye. Our study and findings push boundaries between different scientific fields in addition to enabling numerous industrial applications in variant fields such as medicine and plastic surgery, cosmetics, social applications, personalized treatment, and entertainment.

Keywords:

deep learning; machine learning; perception of beauty; uncertainty; explainable AI; facial beauty; golden ratio; brain cognition; cognitive psychology

1. Introduction

Plato once wrote in the Symposium that “if there is anything worth living for, it is to behold beauty”. For centuries, beauty has captivated humans, and philosophers have attempted to explain this mysterious human phenomenon without any appropriate means of measuring and comprehending beauty. Facial attractiveness (or facial beauty) profoundly affects numerous facets of a person’s social life. In the US alone, money spent on beauty products exceeds what is spent on both social services and education [1]. Researchers from many different disciplines, including engineering, human science, and medicine, have been interested in beauty [2]. The field of computing has only recently begun studying facial beauty. However, the psychological community has thoroughly researched this [3]. Gustav Fechner, a pioneer in experimental psychology, was interested in quantifying beauty [4]. However, in contemporary models of aesthetic experiences, beauty still remains a mysterious aesthetic response [5,6].

With advancements in psychology, scientists have detected internal and external factors that affect beauty judgment. Fertility and hormone levels are examples of internal influences, whereas social contexts, temporal contexts (e.g., long-term versus short-term relationships), environmental contexts, and visual experiences (e.g., parental familiarity) are examples of external elements [7,8]. Another study has found that facial symmetry is a major factor in determining facial beauty, as the human brain is sensitive to symmetry [9]. Furthermore [10], in addition to facial symmetry and averageness, the major attributes that affect facial attractiveness are full lips, thin eyebrows, tiny nose, high forehead, high cheekbones, and thick hair, which are probably applicable in the case of the attractiveness of men as well. However, the face is more feminine and attractive because of certain major characteristics, such as large lips, a small nose, and high cheekbones, which are indicators of youth, fertility, and high estrogen levels [10].

Existing studies explain the common features that contribute to beauty perception, but fail to explain and quantify the reasons why the human brain decides for a specific face to be beautiful, and why two persons could argue whether full lips are beautiful.

A pioneer neurobiologist Semir Zeki, who has extensively researched the question of beauty from a neuroscience perspective for many years, has defined beauty as, “Beauty is an experience that correlates quantitatively with neural activity in a specific part of the emotional brain, namely, in the field A1 of the medial orbito-frontal cortex (A1mOFC); the more intense the declared experience of beauty, the more intense the neural activity there” [11].

In one recent fMRI-based study [12], Zeki et al. [11] investigated how beautiful faces are manifested in the brain and whether beauty can be decoded on the basis of pattern activity. They showed 120 faces to seventeen volunteers, who rated the attractiveness of each face on a seven-point Likert scale while having their brain activity measured. They discovered that the emergence of various patterns of activity in the fusiform face area (FFA) and occipital face area (OFA), and the concurrent emergence of activity in the medial orbital frontal cortex are the results of the sense of beauty (mOFC). The right FFA (61.17%) and the mOFC (62.35%) exhibited the highest cross-participant classifications of the brain’s activity. However, researchers have concluded their research as, “The precise features that render a face beautiful, beyond the accepted general properties of symmetry, proportions, and precise relationships—which are not in themselves necessarily sufficient to render a face beautiful—may be unknown”.

The irrational integer of order 1.618 known as the golden ratio, symbolized by the symbol Φ (phi), was first referenced in writing in Euclid’s Elements in 300 BC [13].

Many have considered the golden ratio, commonly referred to as the divine proportion, to be the solution to the phenomenon of aesthetics; regularity; alignment; and human physiology, psychology, and beauty [14,15,16]. The golden ratio has been also used in surgical anatomy [17], dental and maxillofacial surgery [18,19], and in cosmetic surgery as a paradigm to measure aesthetics in both soft tissue or hard tissue (cephalometric) levels [20,21]. However, is the face that perfectly matches the golden ratio considered to be beautiful, and who decides whether the face is beautiful or not? Holland’s research [22] shows that there are no sufficient studies supporting the relationship between the golden ratio and facial beauty.

Owing to extensive advancements in the artificial intelligence field and strong links between deep learning techniques and the human brain [23,24], in recent years, the automatic human-like facial attractiveness prediction (FAP) or face beauty prediction (FBP) and its applications in machine learning and computer vision have attracted considerable research interest [25]. Despite significant advancements in determining and estimating the beauty of faces, the majority of studies have concentrated on predicting beauty without a thorough grasp of what constitutes beauty. A machine approach to understanding how the human brain processes such deep human concepts is necessary, as it is still not comprehensively understood in almost all fields [26]. Few studies [3,27,28] have attempted to investigate the constituents of beauty and understand the human brain’s judgments of beauty. However, such studies are all based on objective datasets obtained by either voting or average scores.

To appropriately understand beauty, we propose a novel methodology to train the machine regarding beauty and a new framework to explain the features it depends on in deciding beautiful faces. Table 1 summarizes the related studies. The research is organized as follows: Section 2 presents the details of the proposed approach, Section 3 presents the experimental results, Section 4 discusses the results, and Section 5 concludes the research and mentions future studies. In summary, the major contributions of this paper include the following:

We extract objective general patterns of facial beauty attributes learned subjectively by deep convolutional neural networks (CNNs).
This is the first study, wherein correlations between beauty and facial attributes are subjectively analyzed based on a quantitative approach. We conclude general patterns of statistically significant attributes of attractiveness.
We propose a novel framework and algorithm to train CNNs, visualize and extract the learnings of the machine about beauty, and automatically explain every machine decision in the entire dataset.
We validate existing psychological, biological, and neurobiological studies of beauty and discover new patterns.
We propose a novel method to prove the relationship between the golden ratio and facial beauty.

2. Related Work

In a recent study [35], researchers used deep learning to build a facial attractiveness assessment model by training a CNN on the SCUT-FBP5500 dataset [34], and they used transfer learning techniques to improve the training accuracy. SCUT-FBP5500 [34] is a new dataset that contains 5500 Asian and Caucasian male and female faces for a multi-paradigm facial beauty prediction. In addition to a shallow prediction model with hand-crafted features, a deep learning model is used to predict facial beauty scores based on the input of 60 volunteers who labelled faces with a score ranging from 0 to 5, where 5 represents the most attractive face. However, does the perception of the 60 volunteers adequately represent beauty in general? Does the predicted score of 5 accurately indicate an extremely beautiful face and a lower score a less beautiful face? What does less beauty mean and according to whom? The main issue with this approach is that it attempts to establish a general beauty prediction model while the concept of beauty is not addressed. First, we need to understand why each volunteer judges a face to be beautiful or not and draw conclusions and establish a model upon this. In addition, the dataset used to train the machine would never manifest how the majority perceive beauty, and considering an average score does not satisfy anyone, including the 60 volunteers themselves, which causes the prediction output to be random and meaningless.

A geometry-based beauty assessment has been proposed to test the effect of geometric shape representation on facial beauty, regardless of other appearance features [28]. Apart from the fact that the proposed method is based on average score data, two faces can have similar geometric shapes; yet the beauty decision is different.

In addition to building beauty prediction models, researchers have attempted to visualize the research on beauty by CNNs trained on Asian faces through visualization of filters for understanding the reasons behind beauty decisions. However, the visualization of filters did not explain the reasons behind beauty decisions or show features of beauty because filters alone are uninterpretable [32]. The correlation between facial features was investigated by a different approach [3]. In this approach, researchers pre-assumed that the correlation between facial features renders a face beautiful; accordingly, their neural network was trained not to learn beauty but extract and estimate facial features based on majority voting and average score datasets, without investigating the reasons behind beauty decisions. Recent research [27] used an effective approach based on the gradient-weighted class activation mapping (Grad-CAM) technique [36] to visualize the learnings of neural networks about beautiful faces. This study used the previously mentioned dataset SCUT-FBP5500 [34] to train the CNN model and obtain activation maps, and the top 50 images with the highest attractiveness scores were averaged by overlaying images of the face activation maps. However, this approach has two limitations: first, training is based on average score data; second, an average of activation maps cannot convey how the machine determines beauty for each face and based on what features.

3. Methods

Deep human feelings, such as beauty, are common and highly subjective and are affected by many internal and external factors such as hormones, media, and environment, or even certain scenes, sounds, and smells [37,38]. The way these experiences evolve and accumulate causes an individual’s decisions to be unique and subjective. Therefore, to study such human phenomena and arrive at a general pattern and conclusion, we propose closely investigating each decision subjectively.

In our approach, we assume that beauty decisions are the result of brain cognition and perception and inherit all the accumulated information that eventually leads to such decisions. Since there is no sufficient information on the constituents of beauty in different fields of science, and since there is a strong relation between deep learning and the human brain, if we successfully train a deep learning network to mimic the human brain by identifying and classifying beauty subjective to a specific person, then we can claim that studying the learnings of the machine can explain what the individual’s brain learned.

To perform this study, data were subjectively collected and fed to the machine to teach it to mimic each participant’s brain decisions using deep learning. Once the machine successfully adopts the human brain’s decisions, we can study the human brain by studying the machine. Instead of assuming predefined features, the machine completely learns the features that contribute to beauty decisions. The proposed methodology is illustrated in Figure 1.

3.1. Data Preparation

In previous studies, the datasets used in training CNNs, determining facial ratios, and assessing facial geometric features were based on either group voting or average score rating, by assuming that beauty is universal, which is not necessarily true. Figure 2 illustrates the problem of such an approach, which can be summarized as follows:

Predictions are not satisfactory and meaningless.
However, when the majority rate or vote for a face is beautiful, they still disagree on the reason behind voting, which aggravates the problem that renders predictions meaningless.
Because the model is trained on average scores and/or voting and produces meaningless predictions, the model learning is ambiguous, and it is impossible to interpret neurons to explain beauty, as there is no basis to justify beauty accordingly.

Since “beauty lies in the eyes of the beholder”, we believe every face is beautiful, at least to someone. Thus, in this study, we use the terms “Beauty” and “Different Beauty” that indicate the subjectivity of beauty. In this study, we collected 50,000 female faces’ images and included 10 male participants. The participants were from the Middle East and were aged between 25 and 35 years. The dataset contained different ethnicities, ages, accessories, and backgrounds. Images were crawled from sources under permissive licenses and automatically aligned and cropped using image processing techniques. To avoid any potential ethical or copyright issues, all faces illustrated in this paper have been synthetically generated by a computer and do not belong to any real person. We developed an interactive GUI application that all participants utilized to automate the classification process and data collection. This resulted in a clear and simple classification approach in which all participants employed the same classification strategy for the same dataset and were provided the same instructions. They were told to base their judgment solely on facial beauty. To avoid order effects, photos were shown on separate pages, each in random order. Participants could look at a face without time limits and were free to return to faces that they had already seen and changed their evaluation. The GUI is programmed to show each face with three labels: “Beauty”, “Different Beauty”, and “Another”, which a participant can choose if the face is perceived to neither belong to the “Beauty” or “Different Beauty” category. Every participant collected 3000 female face images for each class; the total number of faces per participant was 6000, and the average labeling time was 28 h. Since the 50,000 dataset was randomly displayed for each participant, it was not necessary that all participants rated the same faces, as our main objective was to construct a subjective dataset for each person and subsequently derive a general pattern at the end of the study.

3.2. Ensemble Learning

Ensemble learning promotes the “wisdom of crowds”, which holds that the judgment of a larger group of people is often superior to that of a single expert. Similarly, ensemble learning describes a collection (or ensemble) of basic learners or models who collaborate to produce a more accurate final prediction. A single model, also known as a base or weak learner, may not function effectively alone, owing to excessive variation or strong bias. However, weak learners can collaborate to build a strong learner, which outperforms any individual base model, since their combination minimizes bias or variation [39,40].

Let

S

be a sample space containing samples

x_{i} (i = 1, 2, \dots, s)

and belong to classes

c_{j} (j = 1, 2, \dots, N)

which comprises class set

C = {c_{1}, c_{2}, \dots, c_{N}}

. Consider

K

independent voters (basic classifiers) as

f_{k} (k = 1, 2, \dots, K) .

Given sample

x_{i}

, the prediction (vote) by

f_{k}

is

f_{k} (x_{i})

. Accordingly, the simple ensemble voter (classifier) system is described as

F (x_{i}) = {\begin{matrix} c_{j^{*}} i f \exists c_{j^{*}} \in C \land T_{F} (x_{i} \in c_{j^{*}}) = \max_{1 \leq j \leq N} T_{F} (x_{i} \in c_{j^{*}}) \\ \land T_{F} (x_{i} \in c_{j^{*}}) > α \times K \\ c_{r e j e c t} E l s e \end{matrix},

(1)

T_{F} (x_{i} \in c_{j}) = \sum_{K = 1}^{K} y_{k} (x_{i} \in c_{j}), j = 1, 2, \dots, N,

(2)

y_{k} (x_{i} \in c_{j}) = {\begin{matrix} 1 i f f_{k} (x_{i}) = c_{j} (j = 1, 2, \dots, N) \\ 0 E l s e \end{matrix},

(3)

where

α \times K

represents the threshold for voting, and

\land

implies the logic operator AND. In this approach, each voter has the same weight, regardless of their performance. However, by examining each voter’s strengths in recognizing and assigning different weights, the ensemble system judges more accurately and produces considerably better performance. The ensemble-weighted majority voting system is described as follows:

F (x_{i}) = c_{j^{*}} i f \exists c_{j^{*}} \in C \land T_{F} (x_{i} \in c_{j^{*}}) = \max_{1 \leq j \leq N} T_{F} (x_{i} \in c_{j^{*}}),

(4)

T_{F} (x_{i} \in c_{j}) = \sum_{K = 1}^{K} w_{k} (x_{i} \in c_{j}), j = 1, 2, \dots, N,

(5)

V_{w i n n e r} (x_{i} \in c_{j^{*}}) = {f_{1,} f_{2}, \dots, f_{k}},

(6)

W (x_{i} \in c_{j^{*}}) = {w_{1}, w_{2}, \dots, w_{k}},

(7)

where

w_{k} (x_{i} \in c_{j})

represents the weight of voter

f_{k}

voting to

c_{j}

for a given

x_{i}

, and

W (x_{i} \in c_{j^{*}})

represents all weight lists of the majority votes of winner voters

V_{w i n n e r} (x_{i} \in c_{j^{*}})

. For optimum learning, we propose using varied training data with k-fold cross-validation for each voter, wherein for each fold model

m

, we create

s

snapshots. The primary idea behind creating model snapshots is to train a single model while continuously lowering the learning rate to reach a local minimum and to save a snapshot of the weight of the current model. Later, to move away from the current local minimum, it will be essential to actively accelerate learning. The process continues until all cycles are completed. Using cyclic cosine annealing, one of the primary techniques for producing model snapshots for CNN is to gather multiple models during a single training session [41]. The cyclic cosine annealing approach begins with the initial learning rate, progressively drops to the bare minimum, and then rapidly shoots back up. Each epoch’s learning rate for cyclic cosine annealing is given by the following expression:

α (t) = \frac{α_{0}}{2} (\cos (\frac{π m o d (t - 1, [T / s])}{[T / s]}) + 1),

(8)

where

T

is the entire count of the training iterations,

s

the count of cycles,

α (t)

is the learning rate at epoch

t

, and

α_{0}

is the initial learning rate. The weight of the snapshot model is defined as the weight at the bottom of each cycle. The next learning rate cycle makes use of these weights but permits the learning algorithm to reach various conclusions, thereby producing a variety of snapshot models. We obtained

s

model snapshots when s training cycles were completed, each of which was used for ensemble prediction. By using

s

model snapshot predictions, we applied simple average ensemble predictions for each fold model m of each base classifier (voter). The output of the simple average ensemble predictions for each voter was used to obtain the final weighted voting ensemble, as illustrated in Algorithm 1. In the next section, we discuss the use of different policies as voting weights.

Algorithm 1: Machine-weighted voting algorithm

Inputs : Data set D = {(x, y) : x \in ℝ^{n \times p}, y \in ℝ^{n}}

; K base learning algorithms
Output: Machines Vote
Initialization;

for k = 1, \dots, K

do
1: Split D into

D^{T r a i n}, D^{T e s t}

for i = 1, \dots, m s p l i t s

do

Split D^{T r a i n}

into D_{i}^{t r a i n}, D_{i}^{v a l i d a t i o n}

for ith split.

Basic classifier f_{k}

train on D_{i}^{t r a i n}

and validate on D_{i}^{v a l i d a t i o n}

P_{i k} =

Predict on D^{T e s t}

End
Concatenate m

predictions on D^{T e s t}

{\hat{Y}}_{k} = (P_{1 k}, \dots, P_{m k})

Compute optimal voting weights (w_{1}^{*}, ..., w_{k}^{*})

Apply weighted class voting (4).
End

3.3. Optimal Voting Weights

3.3.1. Best Combination

Grid searching for weight values between 0 and 1 for each ensemble member, such that the weights across all ensemble members add to one, is the simplest and possibly the most exhaustive approach. According to a study [39], however, using the optimization model proposed by Perrone et al. (1992) that aims to combine predictions from fundamental classifiers by choosing the most appropriate weight to combine them, such that the resulting ensemble reduces the overall expected prediction error (MSE) is a more effective approach than other methods [42].

M i n M S E (w_{1} {\hat{y}}_{1} + w_{2} {\hat{y}}_{2} + \dots + w_{k} {\hat{y}}_{k}, Y) s . t . \sum_{k = 1}^{k} w_{k} = 1, w_{k} \geq 0, \forall f_{k} (k = 1, 2, \dots, K),

(9)

where

{\hat{Y}}_{k}

denotes the vector of the basic classifier

f_{k}

predictions on the cross-validation samples;

Y

is the vector of true response values, and

w_{k}

is the weight corresponding to the base model

f_{k} (k = 1, 2, \dots, K)

. Assuming that

n

is the total number of instances, the optimization model is as follows:

y_{i}

is the true value of observation

i

, and

{\hat{y}}_{i k}

is the prediction of observation

i

of the base model

f_{k}

.

M i n \frac{1}{n} \sum_{n = 1}^{n} {(y_{i} - \sum_{j = 1}^{k} w_{k} {\hat{y}}_{i k})}^{2} s . t . \sum_{k = 1}^{k} w_{k} = 1, w_{k} \geq 0, \forall f_{k} (k = 1, 2, \dots, K)

(10)

The aforementioned formulation is a nonlinear convex optimization problem. Computing the Hessian matrix shows that the objective function is convex since the constraints are linear. Therefore, the best solution to this problem is shown to be the global best solution because the local optimum of a convex function (the objective function) in a convex feasible region (the feasible region of the preceding formulation) is guaranteed to be a global optimum [43].

3.3.2. Priori Recognition Performance Statistics

A basic classifier is assigned more weight based on how well it recognizes patterns [40] Let the confusion matrix of voter

f_{k}

be

C M_{k} = [\begin{matrix} n_{11}^{k} & n_{12}^{k} & \dots & n_{1 N}^{k} \\ n_{21}^{k} & ⋱ & \dots & n_{2 N}^{k} \\ ⋮ & ⋮ & n_{j_{1} j_{2}}^{k} & ⋮ \\ n_{N 1}^{k} & n_{N 2}^{k} & \dots & n_{N N}^{k} \end{matrix}] (k = 1, 2, \dots, K) .

(11)

When

j_{1} = j_{2}

,

n_{j_{1} j_{2}}^{k}

denotes the count of samples that belong to class

c_{j_{1}}

and are classified accurately as

c_{j_{1}}

by voter

f_{k}

. When

j_{1} \neq j_{2}

,

n_{j_{1} j_{2}}^{k}

represents the number of samples belonging to class

c_{j_{1}}

that are misclassified as

c_{j_{2}}

by voter

f_{k}

. The instances classified as

c_{j_{2}}

become

n_{j_{2}}^{k} = \sum_{j_{1} = 1}^{N} n_{j_{1} j_{2}}^{k} (j_{2} = 1, 2, \dots, N) .

(12)

Consequently, the conditional likelihood that this sample belongs to class

c_{j_{1}}

is reflected as

P M_{k} = [\begin{matrix} P_{11}^{k} & P_{12}^{k} & \dots & P_{1 N}^{k} \\ P_{21}^{k} & ⋱ & \dots & P_{2 N}^{k} \\ ⋮ & ⋮ & P_{j_{1} j_{2}}^{k} & ⋮ \\ P_{N 1}^{k} & P_{N 2}^{k} & \dots & P_{N N}^{k} \end{matrix}] (k = 1, 2, \dots, K) .

(13)

When classifier

f_{k}

classifies instance

x_{i}

as class

c_{j_{2}}

, its voting weight for class

c_{j_{1}}

is

P_{j_{1} j_{2}}^{k}

.

3.3.3. Model Calibration

Neural networks are usually calibrated inadequately [44], implying that they are overconfident in their predictions. In the classification process, neural networks produce “confidence” scores along with the predictions. These confidence levels ideally coincide with the actual likelihood of correctness. For instance, if we provide 100 predictions with a confidence level of 80%, we would anticipate that 80% of the predictions will come true. If such a case, the network is calibrated. Model calibration is the process of obtaining a trained model and applying a post-processing procedure to enhance its probability estimation.

Let input images

X \in x

and class labels

Y \in Y = {1, \dots, k}

be random variables following the joint ground-truth distribution

π (X, Y) = π (Y | X) π (X)

, and let

h

be a CNN with

h (X) = (\hat{Y}, \hat{P})

, where

\hat{Y}

is the predicted class, and

\hat{P} \in [0, 1]

is the attributed confidence level. The objective is to calibrate

\hat{P}

, such that it represents the true class probability. In practice, the accuracy of deep learning networks is typically lower than confidence. Perfect calibration is defined as

ℙ (\hat{Y} = Y | \hat{P} = p) = p, \forall p \in [0, 1]

(14)

The calibration error which describes the deviation in expectation between confidence and accuracy is

E_{\hat{p}} = [| (\hat{Y} = Y | \hat{P} = p) - p |] .

(15)

Calibration techniques for classifiers seek to convert an uncalibrated classifier’s confidence score to a calibrated one

\hat{Q} \in [0, 1]

that corresponds to the precision for a specific level of confidence [45]. This calibration technique is a post-processing technique that requires a separate learning phase to establish a mapping

g : \hat{P} \to \hat{Q}

along with

\hat{θ}

which denotes the calibration parameters and can be considered a probabilistic model

\hat{π} (Y | \hat{P}, \hat{θ})

. The calibration parameters

\hat{θ}

are typically estimated using the maximum likelihood (ML) for all scaling methods while minimizing the NLL loss. The calibration parameter

\hat{θ}

can be calculated in this case by applying an uninformative Gaussian prior

π (θ)

with a wide variance over the parameters and inferring the posterior by

π (θ | \hat{P}, Y) = \frac{π (Y | \hat{P}, θ) π (θ)}{\int_{Θ} π (Y | \hat{P}, θ) π (θ) d θ},

(16)

where

π (Y | \hat{P}, θ)

is the likelihood. We can map a new input

{\hat{p}}^{*}

with the posterior predictive distribution defined by

f (y^{*} | {\hat{p}}^{*}, \hat{P}, Y) = \int_{θ} π (y^{*} | {\hat{p}}^{*}, θ) π (θ | \hat{P}, Y) d θ .

(17)

We modeled the epistemic uncertainty of calibration mapping as opposed to Bayesian neural networks (BNNs). The distribution

f_{i}

acquired by calibration for a sample with index

i

rather reflects the uncertainty of the classifier for a specific degree of confidence than the model uncertainty for a particular prediction. A distribution is obtained as a calibrated estimate. We utilized stochastic variational inference (SVI) as an approximation because the posterior cannot be calculated analytically. SVI uses a variational distribution (often a Gaussian distribution) whose structure is simple to evaluate [45]. We sample

T

sets of weights and utilize them to construct a sample distribution consisting of

T

estimates for a new single input

{\hat{p}}^{*}

, with the parameters of the variational distribution optimized to match the true posterior using the evidence lower bound (ELBO) loss.

We utilized the vector scaling method which is an extension of Platt scaling [46]. Temperature scaling is a popular method for calibrating deep learning models [44]. Temperature scaling is a parametric calibration approach optimized with respect to negative-log-likelihood (NLL) on validation data [44]. It learns a single parameter temperature (

T)

for all classes to produce the calibrated confidences:

{\hat{q}}_{i} = \max_{k} σ_{S M} {(\frac{z_{i}}{T})}^{(k)},

(18)

where

k

is the class label

(k = 1, \dots, K)

,

z_{i}

is the logit vector, and

σ_{S M} (z_{i})

is the predicted confidence. As

T \to \infty

,

{\hat{q}}_{i}

approaches the minimum that indicates maximum uncertainty.

Calibration Evaluation

A typical metric for determining the calibration error of neural networks is expected calibration error (ECE) [44]. Let

B_{m}

be the set of sample indices whose predicted confidence falls within the range

I_{m} = (\frac{m - 1}{M}, \frac{m}{M}]

,

m \in M

. The accuracy of

B_{m}

is

a c c (B_{m}) = \frac{1}{| B_{m} |} \sum_{i \in B_{m}} 𝟙 ({\hat{y}}_{i} = y_{i}),

(19)

where

{\hat{y}}_{i}

and

y_{i}

are the true label and predicted value of the instance

i

. The average predicted confidence of bin

B_{m}

can be formulated as

c o n f (B_{m}) = \frac{1}{| B_{m} |} \sum_{i \in B_{m}} {\hat{p}}_{i},

(20)

where

{\hat{p}}_{i}

is the confidence of sample

i

.

The expected calibration error (ECE) takes the weighted average of the bins’ accuracy/confidence differences of

n

number of samples:

E C E = \sum_{m = 1}^{M} \frac{| B_{m} |}{n} | a c c (B_{m}) - c o n f (B_{m}) | .

(21)

The maximum calibration error (MCE) [44] focuses more on high-risk applications, where the maximum accuracy/confidence difference is more important than the average which represents the worst-case scenario.

M C E = \max_{m \in {1, \dots, m}} | a c c (B_{m}) - c o n f (B_{m}) |

(22)

Uncertainty Evaluation

Prediction interval coverage probability (PICP) is a metric used for Bayesian models to determine the quality of uncertainty estimates, that is, the likelihood that an instance’s true value falls within the predictive range. The mean prediction interval width (MPIW) is another metric used to measure the mean width of all prediction intervals to evaluate the sharpness of the uncertainty estimates.

A prediction interval around the mean estimate can be used to express epistemic uncertainty. We obtained the interval boundaries by selecting quantile-based constraints of the range for a specific confidence level

τ

, while assuming a normal distribution.

C_{τ, i} = (l_{i}, u_{i}) : ℙ (l_{i} \leq p r e c (i) \leq u_{i}) = 1 - τ

(23)

where

p r e c (i)

represents the observed precision of sample

i

for specific

{\hat{p}}_{i}

. If all samples of the measured accuracies lie in a

100 (1 - τ) %

prediction interval (PI) around

100 (1 - τ) %

of the time, the uncertainty is appropriately calibrated [45], with

g

being a calibration model that produces a PDF

f_{i}

for input with index

i

out of

N

samples. The PICP was calculated as follows:

P I C P = \frac{1}{N} \sum_{i = 1}^{N} 𝟙 (p r e c (i) \in C_{τ, i})

(24)

The definition of the PICP is usually applied when performing calibrated regression, and when the true target value is known. However, the true precision of the classification is not easily obtainable. As a result, we apply a binning method to all available quantities

\hat{P}

with

N

samples to estimate the precision for each sample. It is necessary for PICP

\to (1 - τ)

as

N

\to

\infty

for flawless uncertainty calibration. Using this concept, we can calculate the uncertainty by measuring the difference between PICP and

(1 - τ)

. The PI width for a certain

C_{τ, i}

is averaged over all N samples to obtain the MPIW, which is a complementary metric. With regard to the two metrics, we want the models to have larger PICP values while reducing the MPIW. By utilizing PICP and MPIW, we can assess both the quality of the calibration mapping and epistemic uncertainty quantification. In our study, we propose using PICP as an ensemble voting weight.

3.4. Proposed Framework to Explain CNNs

Even though CNN achieves extremely high accuracy in many detection and classification problems, the CNN is considered a “Black box”. Although we understand the CNN architecture and process and how features are extracted, it is still difficult for humans to know how the network decides its classification and based on what features the decision is made. This is extremely important in vital areas where the decision reason is important, such as in military operations and medicine fields. Explainable AI (XAI) research has grown significantly in recent years owing to the rapid advancement in deep learning and the need for reliable machine decisions. Deep Taylor decomposition is a powerful technique for explaining CNN decisions by identifying the features in an input vector that have the greatest impact on a neural network’s output based on redistributed relevance [47]. In a recent Alzheimer’s disease detection study, deep Taylor decomposition produced more reasonable and accurate results than Grad-CAM [48]. Layer-wise relevance propagation (LRP) is the foundation for deep Taylor decomposition [49] that seeks to create a relevance metric R over the input vector, such that we can represent the network output as the sum of the values of R,

f (x) = \sum_{d = 1}^{V} R_{d}

, where

f

is the neural network forward pass function. We can use the deep Taylor decomposition function in terms of its partial derivatives to approximate the relevance propagation function illustrated in Figure 3.

Consider a neuron taking input vector

x_{i}

that outputs

x_{j} = \max (0, \sum_{i} x_{i} w_{i j} + b_{j}), b_{j} \leq 0, x_{j} \geq 0 .

(25)

To decompose the neuron output in terms of the input variables, the output is rewritten as a first-order Taylor expansion:

x_{j} = \sum_{i} {\frac{\partial x_{j}}{\partial x_{i}} |}_{{(x_{i})}_{i =} {({\tilde{x}}_{i})}_{i}} . (x_{i} - {\tilde{x}}_{i}),

(26)

where

{({\tilde{x}}_{i})}_{i}

is the root point of the forward-propagation function. The local decision boundary is where the root points of the forward propagation function are; therefore, the gradients along that boundary point provide the greatest details regarding how the function categorizes the input. The deep Taylor decomposition equation can be rewritten as follows:

R_{i} = {\frac{\partial x_{j}}{\partial x_{i}} |}_{{(x_{i})}_{i =} {({\tilde{x}}_{i})}_{i}} . (x_{i} - {\tilde{x}}_{i}), \sum_{i} R_{i} = R_{j} .

(27)

By searching the root point in the direction

v_{i j}

of the input space, we obtain the explicit decomposition equation

R_{i} = \sum_{j} \frac{q_{i j}}{\sum_{i^{'}} q_{i^{'} j}} R_{j}, q_{i j} = v_{i j} w_{i j}, q_{i^{'} j} = v_{i^{'} j} w_{i^{'} j} .

(28)

Another contemporary XAI method based on popular local interpretable model-agnostic explanations (LIME) [50] is the Bayesian local interpretable model-agnostic explanations (BayLIME) [51]. This method can be used with any machine-learning model because it is model-agnostic. By changing the input of the data samples and observing how the predictions vary, the technique aims to understand the model. The output of the LIME is a list of explanations indicating the contribution of each characteristic to the prediction of a data sample. This offers local interpretability and makes it possible to identify feature changes that have the greatest influence on the prediction. Let

X = {x_{1}, \dots, x_{n}}

be the input set with

n

samples, where

x_{i}

is instance

i

with

m

features (i.e.,

X = (x_{i j}) \in ℝ^{n \times m}

), and let

Y = {[y_{1}, \dots, y_{n}]}^{T}

be the corresponding

n

target values. Accordingly, the maximum likelihood estimates of the weighted samples

(X^{'}, Y^{'})

for the linear regression model is

β = {(X^{T} W X)}^{- 1} X^{T} W Y

, and

W = d i a g (w_{1,} \dots, w_{n})

is the diagonal matrix, which is determined by a kernel function based on the proximity of the new samples to the original instance. BayLIME proposes embedding prior knowledge as follows:

μ_{n} = {(λ I_{m} + α X^{T} W X)}^{- 1} λ I_{m} μ_{o} + {(λ I_{m} + α X^{T} W X)}^{- 1} α X^{T} W X β,

(29)

where

μ_{o}

is the mean vector matrix to be estimated,

λ I_{m}

is the “pseudo-count” of the prior sample size based on which we form our prior estimates of

μ_{o}

(

I_{m}

is

m \times m

identity matrix), and

α X^{T} W X

is the “accurate-actual-count” of observation data size, i.e., the true inference of the

n

perturbed data

X^{T} W X

scaled by precision

α

. Let

m = 1

which symbolizes one feature instance with a simpler kernel function that yields a constant weight

w_{c}

then

μ_{n}

becomes

\frac{λ}{λ + α w_{c} n} μ_{o} + \frac{α w_{c} n}{λ + α w_{c} n} β .

(30)

Using

λ

point data prior to our new experiment, we created our prior estimate of

μ_{o}

. We collected

n

samples in the experiments and obtained the MLE estimate

β

by considering the precision

α

and weights

w_{c}

of the new samples. Depending on the distributions of the efficient data size employed, that is,

λ

and

α w_{c} n

, we combine

μ_{o}

and

β

. Finally, all of the effective samples employed capture the confidence in our new posterior estimate, that is,

λ + α w_{c} n

(the posterior precision).

In our research, we also use another effective method that uses perturbation masks to alter Sobol indices in the context of black-box models to determine whether parts of the sample have an impact on output predictions [52]. Let

(X_{1}, \dots, X_{d}}

be independent variables, and assume that

f \in L^{2} (X, P)

. Let

U = {1, \dots, d};

u

is

a s u b s e t o f U

, its complementary

~ u

and

E (\cdot)

is the expectation over the perturbation space.

Hoeffding decomposition enables the expression of the function

f

as sums of increasing dimensions, with the symbol

f_{u}

signifying the partial contribution of the variables

X_{u} = {(X_{i})}_{i \in u}

to the score

f (X)

:

f (X) = \sum_{u \subseteq U} f_{u} (X_{u}), V a r (f (X)) = \sum_{u \subseteq U} V a r (f_{u} (X_{u})) .

Following the constraint,

\forall (u, v) \subseteq U^{2} s . t . u \neq v, E (f_{u} (X_{u}) f_{v} (X_{v})) = 0 .

(31)

The sensitivity index

S_{u}

which quantifies the contribution of the variable set contributed.

X_{u}

to the model response

f (X)

in terms of fluctuation are given by Sobol indices:

S_{u} = \frac{V a r (f_{u} (X_{u}))}{V a r (f (X))} = \frac{V a r (E (f (X) | (X_{u})) - \sum_{v \subset u} V a r (E (f (X) | (X_{u}))}{V a r (f (X))}

(32)

With regard to the model decision, Sobol indices provide a quantification of the significance of any subset of features where

\sum_{u \subseteq U} S_{u} = 1

.

The total Sobol index

S_{T_{i}}

which quantifies the contribution of variable

X_{i}

to the variance of the output of the model and its interactions of any order with any other input variables are as follows:

S_{T_{i}} = \sum_{\begin{matrix} u \subseteq U \\ i \in u \end{matrix}} S_{u} = 1 - \frac{V a r X_{~ i} (E_{X_{i}} (f (X) | (X_{~ i}))}{V a r (f (X))} = \frac{E_{X_{~ i}} (V a r_{X_{i}} (f (X) | (X_{~ i}))}{V a r (f (X))}

(33)

A more efficient estimator uses the Jansen estimator with the quasi-Monte Carlo (QMC) sampling strategy. Let

A_{j i}

and

B_{j i}

be the elements of the matrices, such that

i = 1, \dots, d

represents the number of variables investigated, and

j = 1, \dots, N

is the number of instances in matrices A and B obtained in the same size as the perturbed inputs. The new matrix

C^{(i)}

formed is similar to A, except that the column of

B

replaced the corresponding column of variable i. We express

f_{\emptyset} = \frac{1}{2 N} \sum_{j = 1}^{N} f (A_{j})

and the empirical variance

\hat{V} = \frac{1}{N - 1} \sum_{j = 0}^{N} {(f (A_{i}) - f_{\emptyset})}^{2}

. The empirical estimators for the first order (

{\hat{S}}_{i}

) and total order (

{\hat{S}}_{T_{i}}

) can be formulated as

{\hat{S}}_{T_{i}} = \frac{\hat{V} - \frac{1}{2 N} \sum_{j = 1}^{N} {(f (B_{j}) - f (C_{j}^{(i)}))}^{2}}{\hat{V}} . {\hat{S}}_{T_{i}} = \frac{\frac{1}{2 N} \sum_{j = 1}^{N} {(f (A_{j}) - f (C_{j}^{(i)}))}^{2}}{\hat{V}} .

(34)

3.5. Proposed Approach to Explain Facial Beauty

XAI techniques are excellent tools for inferring and visualizing the learnings of neurons; however, examining every output manually is a tedious process, and it is an almost impossible task in the case of enormous datasets. Furthermore, examining only some samples provides less information about the learning of the machine, and in some cases, it can be extremely dangerous, as in medical diagnosis and army equipment. In general, to have a reliable machine system, we need to know what the neurons learned and how they made their decision for every input before a real application deployment. To overcome this challenge and visualize and understand what the machine learned about beauty, we propose the most voted feature (MVF) algorithm, Algorithm 2: a novel approach that enables explanation of every decision a machine makes and determination of a general learned pattern by considering the dataset as an input and the learned features that the machine decision depends on in the entire dataset as the output. The proposed MVF algorithm acting as an independent component can be applied as part of any CNN and XAI method, and it has the possibility of being deployed to different machine learning tasks.

The MVF algorithm starts the process by reading an image as input data in RGB channels

I : u \to {[0, 1]}^{3}

and obtaining the coordination points of the interesting area

f_{n} = [(x_{1}, y_{1}), \dots, (x_{n}, y_{n})]

using object and landmark detection techniques. Subsequently, these coordination points are encoded and grouped into all possible and interesting features to examine

F_{n} = [f_{1}, \dots, f_{n}]

. The possible beauty features are presented in Table 2. The next process is based on the output of the XAI technique. For the activation of visualization-based methods such as Sobol and Deep Taylor, we predict the class and subsequently obtain the learned features using a pre-trained model. The output is normalized and converted into a new grayscale image

I_{N} : u \to {[0, 1]}^{1}

, where Gaussian convolution is applied (Equation (35)). Gaussian convolution is used to average the image pixels and obtain the coordination of the maximum pixel values that represent the most learned feature coordinate

M_{p}

(Equation (36)).

G {[I_{N}]}_{p} : \sum_{q \in u} G_{σ} (p - q) {(I_{N})}_{q}, p = (p_{x}, p_{y}), q = (q_{x}, q_{y}), \in u

(35)

M_{p} = \underset{p}{a r g m a x} {(I_{N})}_{p}

(36)

For segmentation visualization-based methods such as BayLIME, we first obtain the contours [53] as objects

N_{z} (z = 1, 2, \dots, n)

, and subsequently, calculate the raw moments. For a 2D continuous function

f (x, y)

, the moment of order

(p + q)

is defined as

M^{N}_{p q} = \iint_{- \infty}^{\infty} x^{p} y^{q} f (x, y) d x d y

(37)

for; adapting this to a scalar (grayscale) image with pixel intensities

I (x, y)

, raw image moments

M_{i j}

for a segment

N_{z}

are calculated by

M^{N_{z}}_{i j} = \sum_{x} \sum_{y} x^{i} y^{j} I (x, y)

(38)

Once we calculate the raw moments for each object, we can obtain the maximum contour area

M^{m a x}_{i j} = m a x

{

M^{N_{1}}_{i j}, \dots, M^{N_{z}}_{i j}}

and derive different image properties, such as the centroid coordinate

M_{p} = {\bar{x}, \bar{y}} = {\frac{M^{m a x}_{10}}{M^{m a x}_{00}}, \frac{M^{m a x}_{01}}{M^{m a x}_{00}}}

.

After obtaining the most learned feature coordinate

M_{p}

, we calculate all distances

D = [d (M_{p}, F_{1}), \dots, d (M_{p}, F_{n})]

between interesting features

F_{n}

and the most learned feature

M_{p}

using

l^{2} n o r m

(Equation (39)). All calculations are based on the Cartesian pixel coordinate system, where the origin (0, 0) is in the upper-left corner. The final step is to obtain the most dependent feature (MVF) which is the minimum distance (Equation (40)).

d (M_{p}, F_{n}) = \sqrt{\sum_{i = 1}^{n} {| M_{p} - F_{i} |}^{2}}

(39)

M V F = M i n (D)

(40)

Beauty Feature Voting

After obtaining the learned features of each voter, we propose weighted feature voting. Let

M L F = {l_{1 k}, \dots ., l_{m K}}

be the output features set by each voter

f_{k} \in V_{w i n n e r}

for a given sample

x_{i}

. Equation (4) becomes

E_{k} (x_{i}) = l_{m k^{*}} i f \exists l_{k^{*}} \in M L F \land T_{E} (x_{i} \in l_{k^{*}}) = \max_{1 \leq j \leq m} T_{E} (x_{i} \in l_{m k}),

(41)

T_{E} (x_{i} \in l_{m k}) = \sum_{K = 1}^{K} w_{k} (x_{i} \in l_{m k}),

(42)

where

w_{k} (x_{i} \in l_{m k})

represents the weight of the voter

f_{k} \in V_{w i n n e r}

voting for feature

l

explained by

m

explainer. The framework of the proposed method is shown in Figure 4 and Algorithm 2.

Algorithm 2: Most voted feature (MVF) algorithm

Input : I : u \to {[0, 1]}^{3}

; V_{w i n n e r}

; w_{k}^{*}

; Landmarks Detector

Output : The Most Voted Feature (M V F)

initialization;

for k = 1, \dots, K

do

for j = 1, \dots, m e x p l a i n e r s

do

1 : Read Image I : u \to {[0, 1]}^{3}

.

2 : Get object features points by Landmarks Detector f_{n} = [(x_{1}, y_{1}), \dots, (x_{n}, y_{n})]

;

3 : Group and encode features points into all possible / interest features F_{n} = [f_{1}, \dots, f_{n}]

;
      4: Get class using Voters
      If Activations based method do

5 : Get learned features (Activations) by V_{w i n n e r}

6 : Normalize output and convert it to a new grayscale image I_{N} : u \to {[- 1, 1]}^{1}

;

7 : Apply Gaussian convolution G {[I_{N}]}_{p} : \sum_{q \in u} G_{σ} (p - q) {(I_{N})}_{q}, p = (p_{x}, p_{y}), q = (q_{x}, q_{y}), \in u

;

8 : Get the most learned feature coordinate M_{p} = \underset{p}{a r g m a x} {(I_{N})}_{p}

;

9 : Get D = [d (M_{p}, F_{1}), \dots, d (M_{p}, F_{n})], d (M_{p}, F_{n}) = \sqrt{\sum_{i = 1}^{n} {| M_{p} - F_{n} |}^{2}}

10 : Get the most learned features l_{m K} = M i n (D)

Else if Segmentation based method do

5 : Get learned features (Segmentation mask) by V_{w i n n e r}

.

6 : Get contours objects N_{z} (z = 1, 2, \dots, n)

7 : Get raw image moments for each contour object M^{N_{z}}_{i j} = \sum_{x} \sum_{y} x^{i} y^{j} I (x, y)

8: Get

maximum contour area M^{m a x}_{i j} = m a x

{

M^{N_{1}}_{i j}, \dots, M^{N_{z}}_{i j}}

and Centroid coordinate M_{p} = {\bar{x}, \bar{y}} = {\frac{M^{m a x}_{10}}{M^{m a x}_{00}}, \frac{M^{m a x}_{01}}{M^{m a x}_{00}}}

9 : Get D = [d (M_{p}, F_{1}), \dots, d (M_{p}, F_{n})], d (M_{p}, F_{n}) = \sqrt{\sum_{i = 1}^{n} {| M_{p} - F_{n} |}^{2}}

10 : Get the most learned features l_{m K} = M i n (D)

11 : Get the most learned features M L F =

{l_{1 k}, \dots ., l_{m K}}

12: Apply weighted feature voting and get The Most Voted Feature (MVF) (4).
end

3.6. Golden Ratio

The golden ratio of the two-line segments equals the ratio of their sum to the longer segment, which equals their ratio [54], as illustrated in Figure 5. To investigate the golden ratio effect and its relation to facial beauty, we propose to calculate 23 golden facial geometric features in addition to the main facial features, including the eyes, nose, and mouth, as illustrated in Figure 6 and Table 3. Unlike previous studies that examined facial symmetry, the 23 geometric features contained almost all possible facial features that provided sufficient information to test feature symmetry based on the golden ratio and provided a suitable assessment of the overall facial symmetry.

3.7. Evaluation Metrics

During classification training, the evaluation metric is crucial for obtaining the best classifier. Therefore, choosing an appropriate assessment measure is crucial for differentiating and achieving the best classifier [55]. To statistically evaluate our proposed methods, we used multiple metrics to evaluate both performance and correlation. To measure the correlation between features and beauty, we applied the point-biserial correlation coefficient (r_pb) [56] which can be considered a case of the Pearson correlation coefficient (PCC) in the case of a dichotomous variable. The dichotomous variable Y is assumed by r_pb to have two values, 0 and 1, which in our study correspond to the two binary classes of Beauty and Different Beauty. The point-biserial correlation coefficient is calculated as follows: the data set is divided into two groups: the Beauty group, which received the value “1” on Y, and the Different Beauty group, which received the value “0” on Y.

r_{p b} = \frac{M_{1} - M_{0}}{S_{n}} \sqrt{\frac{n_{1} n_{0}}{n^{2}}},

(43)

s_{n} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}},

(44)

where

M_{0}

represents the mean value of the continuous variable X for the total sample points in Group 2,

M_{1}

is the mean value of the continuous variable X for the total sample points in Group 1, and

s_{n}

is the standard deviation. The number of data samples in Group 1 is

n_{1}

, those in Group 2 are

n_{0}

, and the overall sample size is

n

. The point-biserial correlation coefficient, which ranges from −1 to +1, assesses the degree of relationship between two variables. A value of −1 denotes a perfect negative association, a value of +1 denotes a perfect positive association, and a value of 0 indicates no association.

For performance evaluation, we applied the Matthews correlation coefficient (MCC) [57] to evaluate how well binary (two-class) classifications were performed. The MCC returns a number between 1 and +1 and is essentially a correlation coefficient between the observed and anticipated binary classifications. A coefficient of +1 denotes a correct prediction, 0 denotes a prediction that is no better than chance, and 1 denotes a complete discrepancy between prediction and observation. In addition, we considered the following evaluation measures to statistically assess the efficacy of our suggested method: area under the curve (AUC), a popular ranking metric; accuracy, which measures the ratio of correct predictions over the total number of instances evaluated; precision, which measures the proportion of positive patterns that are correctly classified; recall, which measures the proportion of positive patterns that are correctly predicted; and F1-score, which represents the harmonic mean between recall and precision values [55]. Let

T P, T N

represent true positive and true negative, respectively, and

F P, F N

represent false positive and false negative, respectively. Accordingly, the performance metrics are calculated as follows:

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}},

(45)

A C C = \frac{(T P + T N)}{(P + N)},

(46)

P r e c i s o n = \frac{T P}{T P + F P},

(47)

R e c a l l = \frac{T P}{T P + F N},

(48)

F 1 = \frac{2 T P}{(2 T P + F P + F N)} .

(49)

4. Empirical Results

4.1. Learning Beauty

Quantitative evaluations of the proposed approach were performed for both beauty classifications and explanations. In our study, we deployed ResNet50 [58], SENet [59], and VGG16 [60] as base models. A convolutional neural network (CNN) architecture VGG16 was employed to win the 2014 ILSVR (ImageNet) competition. This is regarded as one of the best vision model architectures created to date. The most distinctive feature of VGG16 is that it emphasizes having convolution layers of 3 × 3 filters with a stride of 1 and always utilizes the same padding and MaxPool layer of 2 × 2 filters with a stride of 2. Throughout the entire architecture, the convolution and max pool layers are arranged in the same manner. It ends with two fully connected layers (FC) and a softmax layer for the output. The 16 layers in VGG16 indicate that there are 16 layers with weights. This network incorporates 138.4 million parameters, rendering it sizable. ResNet50 comprises 48 convolutional layers, one MaxPool layer, and one average pool layer. The framework ResNets introduced enabled the training of extremely deep neural networks, implying that the network may include hundreds or thousands of layers and still function suitably. The squeeze and excitation network (SENet), which adds a channel-wise transform to the current deep neural network (DNN) building blocks, such as the residual unit, has achieved outstanding image classification results. SENet provides CNNs with a novel channel-wise attention technique to enhance channel interdependencies. The network includes a parameter that adjusts each channel weight, such that it is more responsive to important features and less sensitive to unimportant features.

To accomplish the model formulation and evaluation processes, we used three-fold cross-validation. For each fold cross-validation, three snapshot models were established, which resulted in nine models for the three folds, and the data were split into three equivalent subsets that were mutually exclusive for the three-fold cross-validation. Two of the three subsets were used as the training set in each iteration, and the third subset was used as the validation set. The final evaluation of each model depended on the average ensemble performance of nine snapshots created for each model. Before the final weighted voting ensemble stage, each base model was calibrated to obtain the base model performance, calibration errors, model confidence, and uncertainty. All experiments were repeated 10 times in order to reduce statistical variability. The calibration output and performance of our experiments are reported in Table 4, Table 5, Table 6 and Table 7.

4.2. Most Dependent Features

In the previous section, we enabled the machine in understanding beauty, and here, we expect the machine to teach us what leads to beauty decisions. We must study the pattern and understand why and how the machine made its decisions based on what information. Using our proposed Algorithm 2, we are able to detect which features the model mostly depends on for each face in the entire dataset in both Beauty and Different Beauty classes. Table 8 shows the Top-5 facial attribute correlations with beauty decisions. Figure 7 shows the output of Algorithm 2: the most dependent feature (MVF) for both classes.

Figure 8 shows the face’s side correlation with the machine’s sentimental decisions based on the side of the face the machine paid more attention to while deciding and categorizing into Beauty and Different Beauty classes.

4.3. Golden Beauty

In this section, we investigate the relationship between the golden ratio and beauty by calculating the golden ratios of the proposed 23 golden facial geometric features for every face in the datasets in both classes. The results of the golden ratio for every feature are shown in Figure 9, and the total mean golden ratio of both classes for each participant is shown in Figure 10.

Using the MVF in the previous section, we found that beauty is correlated more with the right side of the face and specifically to the right eye. The golden ratio correlates more with beauty face decisions. However, it is not understood what causes the right eye to be more influential in beauty decisions than the left eye. To investigate this further, we calculated the average golden ratio of both the right and left eyes of all datasets of the participants. The results of the average eye representation, along with its golden ratio, are shown in Table 9.

5. Discussion

The experimental results presented in the previous section show a unique and interesting pattern of beauty. We noticed that most attention was paid to the eyes for both classes of Beauty and Different Beauty. This is extremely interesting as a recent social psychology study concluded that eyes are not only a “window to the soul, but also a benchmark of beauty” [61]. However, in our study, we were able not only to detect the most important feature in beauty decisions, but also to identify which sides of the eyes the beauty decision depends on. The decisions pertaining to the Beauty class were mostly based on the right eye area, while the decisions pertaining to the Different Beauty class were based on the left eye area beside the nose and lip areas. Another interesting finding is the side of the face on which beauty decisions mostly depend. The decisions pertaining to the Beauty class correlated with the right side of the face, while those regarding the Different Beauty class correlated more with the left side of the face. Interestingly, the fMRI-based study mentioned earlier in the Introduction [12] concluded that the cross-participant classification of the activity in the brain was higher on the right side of the fusiform face area (FFA). In another recent fMRI-based study, researchers found that the left dorsal lateral prefrontal cortex (dlPFC) strongly correlates with the Different Beauty class [62]. In the second part of the experiment, we investigated facial symmetry based on the golden ratio. The results show that a specific feature-based golden ratio is not necessarily an indicator of beauty. However, the overall golden ratio confirms a strong relationship between beauty and the golden ratio. In addition, the most interesting finding is that the right eye features a higher golden ratio than the left eye in both classes. Unlike previous approaches that classify beauty based on the golden ratio, we claim that our approach is a novel empirical approach that proves the correlation between beauty and the golden ratio without any previous assumption of beauty and is based on subjective pre-classified faces. These results could be affected by dataset quality, participant bias, and other factors such as education, environment, and personality characteristics of left-handed versus right-handed persons. A single source of face images of the same quality and poses should be considered in future studies. In addition, having different participants of different ethnicities and including additional metadata regarding participants in the study can produce more informative results. Reproducible results are available at https://github.com/waleed-aldhahi/LSLS/ (accessed on 20 December 2022).

6. Conclusions and Future Work

Existing psychological and biological studies explain the common features that contribute to beauty perception but fail to explain and quantify the reasons behind a person’s judgement of a specific face to be beautiful and why two persons could argue about whether full lips are beautiful. Despite considerable progress in estimating and predicting facial beauty, most studies have focused on predicting beauty without an adequate understanding of beauty. In this research, we propose a novel approach based on deep learning to address the question of beauty and how people process and perceive beauty. We evaluated state-of-the-art related research in psychology, neuroscience, biology, and engineering. To explain how the brain determines beauty, we proposed a subjective approach to teaching the machine to learn beauty using uncertainty-based ensemble machine voting. To obtain a quantifiable measure of beauty, we propose a novel algorithm and framework that addresses the limitations of the current explainable AI (XAI) and provides us the ability to understand every decision the machine makes and obtain the general learned patterns. In addition, we propose a novel approach to prove the relationship between beauty and facial symmetry based on the golden ratio. In future work, we will develop an additional dataset with male faces, conduct a larger experiment involving more participants of both genders and diverse ethnicities and ages, and apply our approach to obtain a deeper and general conclusion of beauty decision patterns. In deep learning, the network weights are updated and accumulated by each input; analogically, accumulated human experiences render beauty decisions highly subjective and distinct, even for the same person. Thus, there might be no ultimate truth or absolute correctness of the essence of beauty.

Author Contributions

Conceptualization, W.A.; methodology, W.A. and T.A.; software, W.A.; validation, W.A., T.A., and S.S.; formal analysis, W.A.; investigation, W.A. and T.A.; resources, W.A.; data curation, W.A.; writing—original draft preparation, W.A.; writing—review and editing, W.A. and T.A.; visualization, W.A.; supervision, S.S.; project administration, W.A. and S.S.; funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to acknowledge the Saudi Arabian Ministry of Higher Education and Korea University. The authors are thankful to the editor and anonymous reviewers for their time and critical reading and are specifically grateful to all the participants in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Adamson, P.A.; Doud Galli, S.K. Modern Concepts of Beauty. Plast. Surg. Nurs. 2009, 29, 5–9. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Fan, Y.-Y.; Samal, A.; Guo, Z. Advances in Computational Facial Attractiveness Methods. Multimed. Tools Appl. 2016, 75, 16633–16663. [Google Scholar] [CrossRef]
Liu, X.; Li, T.; Peng, H.; Ouyang, I.C.; Kim, T.; Wang, R. Understanding Beauty via Deep Facial Features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Ortlieb, S.A.; Kügel, W.A.; Carbon, C.-C. Fechner (1866): The Aesthetic Association Principle—A Commented Translation. i-Perception 2020, 11. [Google Scholar] [CrossRef] [PubMed]
Pelowski, M.; Markey, P.S.; Forster, M.; Gerger, G.; Leder, H. Move Me, Astonish Me… Delight My Eyes and Brain: The Vienna Integrated Model of Top-down and Bottom-up Processes in Art Perception (VIMAP) and Corresponding Affective, Evaluative, and Neurophysiological Correlates. Phys. Life Rev. 2017, 21, 80–125. [Google Scholar] [CrossRef]
Leder, H.; Nadal, M. Ten Years of a Model of Aesthetic Appreciation and Aesthetic Judgments: The Aesthetic Episode—Developments and Challenges in Empirical Aesthetics. Br. J. Psychol. 2014, 105, 443–464. [Google Scholar] [CrossRef]
Baker, S.B.; Patel, P.K.; Weinzweig, J. Aesthetic Surgery of the Facial Skeleton; Elsevier: London, UK, 2021. [Google Scholar]
Little, A.C.; Jones, B.C.; DeBruine, L.M. Facial Attractiveness: Evolutionary Based Research. Philos. Trans. R. Soc. B Biol. Sci. 2011, 366, 1638–1659. [Google Scholar] [CrossRef] [Green Version]
Little, A.C.; Jones, B.C. Attraction Independent of Detection Suggests Special Mechanisms for Symmetry Preferences in Human Face Perception. Proc. Biol. Sci. 2006, 273, 3093–3099. [Google Scholar] [CrossRef] [Green Version]
Buggio, L.; Vercellini, P.; Somigliana, E.; Viganò, P.; Frattaruolo, M.P.; Fedele, L. “You Are so Beautiful”: Behind Women’s Attractiveness towards the Biology of Reproduction: A Narrative Review. Gynaecol. Endocrinol. 2012, 28, 753–757. [Google Scholar] [CrossRef]
Zeki, S. Notes towards a (Neurobiological) Definition of Beauty. Gestalt Theory 2019, 41, 107–112. [Google Scholar] [CrossRef] [Green Version]
Yang, T.; Formuli, A.; Paolini, M.; Zeki, S. The Neural Determinants of Beauty. bioRxiv 2021, 4999. [Google Scholar] [CrossRef]
Vegter, F.; Hage, J.J. Clinical Anthropometry and Canons of the Face in Historical Perspective. Plast. Reconstr. Surg. 2000, 106, 1090–1096. [Google Scholar] [CrossRef] [PubMed]
Bashour, M. History and Current Concepts in the Analysis of Facial Attractiveness. Plast. Reconstr. Surg. 2006, 118, 741–756. [Google Scholar] [CrossRef] [PubMed]
Marquardt, S.R. Stephen, R. Marquardt on the Golden Decagon and Human Facial Beauty. Interview by Dr. Gottlieb. J. Clin. Orthod. 2002, 36, 339–347. [Google Scholar]
Iosa, M.; Morone, G.; Paolucci, S. Phi in Physiology, Psychology and Biomechanics: The Golden Ratio between Myth and Science. Biosystem 2018, 165, 31–39. [Google Scholar] [CrossRef] [PubMed]
Petekkaya, E.; Ulusoy, M.; Bagheri, H.; Şanlı, Ş.; Ceylan, M.S.; Dokur, M.; Karadağ, M. Evaluation of the Golden Ratio in Nasal Conchae for Surgical Anatomy. Ear Nose Throat J. 2021, 100, NP57–NP61. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bragatto, F.P.; Chicarelli, M.; Kasuya, A.V.; Takeshita, W.M.; Iwaki-Filho, L.; Iwaki, L.C. Golden Proportion Analysis of Dental–Skeletal Patterns of Class II and III Patients Pre and Post Orthodontic-Orthognathic Treatment. J. Contemp. Dent. Pract. 2016, 17, 728–733. [Google Scholar] [CrossRef]
Kawakami, S.; Tsukada, S.; Hayashi, H.; Takada, Y.; Koubayashi, S. Golden Proportion for Maxillofacial Surgery in Orientals. Ann. Plast. Surg. 1989, 23, 95. [Google Scholar] [CrossRef]
Stein, R.; Holds, J.B.; Wulc, A.E.; Swift, A.; Hartstein, M.E. Phi, Fat, and the Mathematics of a Beautiful Midface. Ophthal. Plast. Reconstr. Surg. 2018, 34, 491–496. [Google Scholar] [CrossRef]
Jefferson, Y. Facial Beauty—Establishing a Universal Standard. Int. J. Orthod. Milwaukee 2004, 15, 9–22. [Google Scholar]
Holland, E. Marquardt’s Phi Mask: Pitfalls of Relying on Fashion Models and the Golden Ratio to Describe a Beautiful Face. Aesthetic Plast. Surg. 2008, 32, 200–208. [Google Scholar] [CrossRef]
Krauss, P.; Maier, A. Will We Ever Have Conscious Machines? Front. Comput. Neurosci. 2020, 14. [Google Scholar] [CrossRef] [PubMed]
Kuzovkin, I.; Vicente, R.; Petton, M.; Lachaux, J.-P.; Baciu, M.; Kahane, P.; Rheims, S.; Vidal, J.R.; Aru, J. Activations of Deep Convolutional Neural Networks Are Aligned with Gamma Band Activity of Human Visual Cortex. Commun. Biol. 2018, 1, 107. [Google Scholar] [CrossRef] [PubMed]
Bougourzi, F.; Dornaika, F.; Taleb-Ahmed, A. Deep Learning Based Face Beauty Prediction via Dynamic Robust Losses and Ensemble Regression. Knowl.-Based Syst. 2022, 242, 108246. [Google Scholar] [CrossRef]
Savage, N. How AI and Neuroscience Drive Each Other Forwards. Nature 2019, 571, S15–S17. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sano, T. Visualization of Facial Attractiveness Factors Using Gradient-weighted Class Activation Mapping to Understand the Connection between Facial Features and Perception of Attractiveness. Int. J. Affect. Eng. 2022, 21, 111–116. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, D.; Sun, M.-M.; Chen, F.-M. Facial Beauty Analysis Based on Geometric Feature: Toward Attractiveness Assessment Application. Expert Syst. Appl. 2017, 82, 252–265. [Google Scholar] [CrossRef]
Gunes, H.; Piccardi, M. Assessing Facial Beauty through Proportion Analysis by Image Processing and Supervised Learning. Int. J. Hum. Comput. Stud. 2006, 64, 1184–1199. [Google Scholar] [CrossRef] [Green Version]
Chen, F.; Zhang, D. A Benchmark for Geometric Facial Beauty Study. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; pp. 21–32. [Google Scholar]
Fan, J.; Chau, K.P.; Wan, X.; Zhai, L.; Lau, E. Prediction of Facial Attractiveness from Facial Proportions. Pattern Recognit. 2012, 45, 2326–2334. [Google Scholar] [CrossRef]
Xu, J.; Jin, L.; Liang, L.; Feng, Z.; Xie, D. A New Humanlike Facial Attractiveness Predictor with Cascaded Fine-Tuning Deep Learning Model. arXiv 2015, arXiv:1511.02465. [Google Scholar]
Zhang, D.; Chen, F.; Xu, Y. Computer Models for Facial Beauty Analysis; Springer International Publishing: Cham, Switzerland, 2016; pp. 143–163. [Google Scholar] [CrossRef]
Liang, L.; Lin, L.; Jin, L.; Xie, D.; Li, M. SCUT-FBP5500: A Diverse Benchmark Dataset for Multi-Paradigm Facial Beauty Prediction. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 1598–1603. [Google Scholar]
Lebedeva, I.; Guo, Y.; Ying, F. Transfer Learning Adaptive Facial Attractiveness Assessment. J. Phys. Conf. Ser. 2021, 1922, 012004. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef] [Green Version]
Feng, G.; Lei, J. The Effect of Odor Valence on Facial Attractiveness Judgment: A Preliminary Experiment. Brain Sci. 2022, 12, 665. [Google Scholar] [CrossRef] [PubMed]
He, D.; Workman, C.I.; He, X.; Chatterje, A. What Is Good Is Beautiful (and What Isn’t, Isn’t): How Moral Character Affects Perceived Facial Attractiveness. Psychol. Aesthet. Creat. Arts 2022. [CrossRef]
Shahhosseini, M.; Hu, G.; Pham, H. Optimizing Ensemble Weights and Hyperparameters of Machine Learning Models for Regression Problems. Mach. Learn. Appl. 2022, 7, 100251. [Google Scholar] [CrossRef]
Sun, J.; Li, H. Listed Companies’ Financial Distress Prediction Based on Weighted Majority Voting Combination of Multiple Classifiers. Expert Syst. Appl. 2008, 35, 818–827. [Google Scholar] [CrossRef]
Huang, G.; Li, Y.; Pleiss, G.; Liu, Z.; Hopcroft, J.E.; Weinberger, K.Q. Snapshot Ensembles: Train 1, Get M for Free. arXiv 2017, arXiv:1704.00109. [Google Scholar]
Perrone, M.P.; Cooper, L.N.; National Science Foundation U.S. When Networks Disagree: Ensemble Methods for Hybrid Neural Networks; U.S. Army Research Office: Research Triangle Park, NC, USA, 1992. [Google Scholar]
Boyd, S. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Liang, G.; Zhang, Y.; Wang, X.; Jacobs, N. Improved Trainable Calibration Method for Neural Networks on Medical Imaging Classification. arXiv 2020, arXiv:2009.04057. [Google Scholar]
Küppers, F.; Kronenberger, J.; Schneider, J.; Haselhoff, A. Bayesian Confidence Calibration for Epistemic Uncertainty Modelling. In Proceedings of the 2021 IEEE Intelligent Vehicles Symposium (IV), Nagoya, Japan, 11–17 July 2021; pp. 466–472. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1321–1330. [Google Scholar] [CrossRef]
Kauffmann, J.; Müller, K.-R.; Montavon, G. Towards Explaining Anomalies: A Deep Taylor Decomposition of One-Class Models. Pattern Recognit. 2020, 101, 107198. [Google Scholar] [CrossRef]
Dyrba, M.; Pallath, A.H.; Marzban, E.N. Comparison of CNN Visualization Methods to Aid Model Interpretability for Detecting Alzheimer’s Disease. In Informatik Aktuell; Springer Fachmedien Wiesbaden: Wiesbaden, Germany, 2020; pp. 307–312. [Google Scholar]
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.-R.; Samek, W. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar] [CrossRef] [Green Version]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You? In ” Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
Zhao, X.; Huang, W.; Huang, X.; Robu, V.; Flynn, D. BayLIME: Bayesian Local Interpretable Model-Agnostic Explanations. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Virtual Event, 7–8 April 2022; pp. 887–896. [Google Scholar]
Fel, T.; Cadène, R.; Chalvidal, M.; Cord, M.; Vigouroux, D.; Serre, T. Look at the Variance! Efficient Black-Box Explanations with Sobol-Based Sensitivity Analysis. Adv. Neural Inf. Process Syst. 2021, 34. [Google Scholar] [CrossRef]
Suzuki, S.; Abe, K. Topological Structural Analysis of Digitized Binary Images by Border Following. Comput. Vis. Graph. Image Process. 1985, 29, 396. [Google Scholar] [CrossRef]
Weisstein, E.W. Golden Ratio. Available online: https://mathworld.wolfram.com/GoldenRatio.html (accessed on 29 October 2020).
Hossin, M.; Sulaiman, M.N. A Review on Evaluation Metrics for Data Classification Evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5. [Google Scholar] [CrossRef]
Linacre, J.M.; Rasch, G. The expected value of a point-biserial (or similar) correlation. Rasch Meas. Trans. 2008, 22, 1154. [Google Scholar]
Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [Green Version]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-And-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the British Machine Vision Conference (BMVC), Swansea, UK, 7–11 September 2015. [Google Scholar] [CrossRef] [Green Version]
Zhang, P.; Chen, Y.; Zhu, Y.; Wang, H. Eye Region as a Predictor of Holistic Facial Aesthetic Judgment: An Eventrelated Potential Study. Soc. Behav. Pers. 2021, 49. [Google Scholar] [CrossRef]
Lan, M.; Peng, M.; Zhao, X.; Li, H.; Yang, J. Neural Processing of the Physical Attractiveness Stereotype: Ugliness Is Bad vs. Beauty Is Good. Neuropsychologia 2021, 155, 107824. [Google Scholar] [CrossRef]

Figure 1. An illustration of research methodology.

Figure 2. Illustration of the objective approach problem. (a) Approach based on average scores. (b) Approach based on voting.

Figure 3. Computational flow of deep Taylor decomposition.

Figure 4. Framework of the proposed method.

Figure 5. Golden Ratio.

Figure 6. Facial geometrics features used in Table 3.

Figure 7. Most Dependent Feature (MVF). (a) Shows the Top-5 features the machine was depending on when making Beauty class decision. (b) Shows the Top-5 features the machine was depending on when making Different beauty class decision.

Figure 8. Face’s side correlation with machine’s sentiment decisions.

Figure 9. Features golden ratios.

Figure 10. Mean of total golden ratio for both classes.

Table 1. Review of certain existing research on facial beauty models.

Research	Data Size	Participants	Facial Data Property	Data Type	Explaining Beauty Decisions	Learned Features Visualization
(Gunes et al., 2006) [29]	215	46	Diverse	Score on average Rating	No	No
(Chen et al., 2010) [30]	23,412	Unknown	Asian	unknown	No	No
(Fan et al., 2012) [31]	432	30	Computer Generated	Score on average Rating	No	No
(Xu et al., 2015) [32]	500	75	Asian	Score on average Rating	No	CNNs Filters
(Zhang et al., 2016) [33]	799	25	Diverse	Score on average Rating	No	No
(Zhang et al., 2017) [28]	9415	Unknown	Asian/Computer Generated	Score on average Rating	No	No
(Liang et al., 2018) [34]	5500	60	Asian/Caucasian	Score on average Rating	No	No
(Liu et al., 2019) [3]	No data proposed	0	Diverse/Computer Generated	Voting/Score on average Rating	No	No
(Lebedeva et al., 2021) [35]	No data proposed	0	Asian/Caucasian	Score on average Rating	No	No
(Sano, 2022) [27]	No data proposed	0	Asian/Caucasian	Score on average Rating	No	Averaged Grad-CAM
(Bougorzi et al., 2022) [25]	No data proposed	0	Asian/Caucasian	Score on average Rating	No	No
Proposed	50,000	10	Diverse	Subjective	Yes	Proposed MVF Algorithm

Table 2. Facial features.

Feature	Area
Feature 1	Right eye
Feature 2	Left eye
Feature 3	Right eyebrow
Feature 4	Left eyebrow
Feature 5	Nose
Feature 6	Center of lips
Feature 7	Right of lips
Feature 8	Left of lips
Feature 9	Chin
Feature 10	Around hairline
Feature 11	Right cheek
Feature 12	Left cheek

Table 3. Proposed 23 golden facial geometric features.

Feature	Distance	Description
Feature 1	D3 vs. D5	Midpoint between eyes to nose tip vs. nose tip to chin
Feature 2	D7 vs. D19	Left eye length vs. distance to eyes midpoint
Feature 3	D8 vs. D18	Right eye length vs. eyes midpoint
Feature 4	D4 vs. D6	Nose tip to lips center vs. lips center to chin
Feature 5	D3 vs. D4	Nose length vs. nose width
Feature 6	D1 vs. D2	Length of the face vs. width of the face
Feature 7	D11 vs. D17	Hairline to right pupil vs. pupil to lips center
Feature 8	D12 vs. D16	Hairline to left pupil vs. pupil to lips center
Feature 9	D5 vs. D6	Nose tip to chin vs. lips to chin
Feature 10	D5 vs. D14	Nose tip to chin vs. right pupil to nose tip
Feature 11	D5 vs. D15	Nose tip to chin vs. left pupil to nose tip
Feature 12	D4 vs. D20	Nose width vs. nose tip to lips
Feature 13	D21 vs. D11	Outside eyes vs. hairline to the right pupil
Feature 14	D21 vs. D12	Outside eyes vs. hairline to the left pupil
Feature 15	D4 vs. D21	Nose width vs. lips length
Feature 16	D13 vs. D1	Forehead vs. face length
Feature 17	D10 vs. D8	Right pupil to eyebrow vs. right eye length
Feature 18	D23 vs. D7	Left pupil to eyebrow vs. left eye length
Feature 19	D9 vs. D8	Right eyebrow length to right eyes length
Feature 20	D22 vs. D7	Left eyebrow length to left eyes length
Feature 21	D26 vs. D8	Right eye height vs. width
Feature 22	D25 vs. D7	Left eye height vs. width
Feature 23	D24 vs. D21	Height of lips vs. width

Table 4. Average calibration output of each voter for all participants.

Voters	ECE-B	ECE-A	MCE-B	MCE-A	PICP	MPIW
VGG16	35.99 (34.81–37.02)	12.66 (11.45–13.68)	41.05 (40.26–41.14)	26.93 (26.1–27.71)	81.29 (80.31–81.93)	13.57 (9.76–17.91)
SENet50	44.40 (43.56–45.08)	3.93 (2.92–4.93)	51.35 (49.79–52.73)	30.66 (29.67–32.5)	81.23 (79.37–83.76)	15.85 (12.87–17.66)
ResNet50	41.49 (39.74–42.85)	8.61 (6.79–10.38)	55.21 (54.96–56.48)	17.43 (16.46–18.49)	81.39 (80.69–82.86)	17.06 (14.55–19.99)

In each cell M (P25–P75), M is the median, P25 is the 25th percentile, and P75 is the 75th percentile of 10 tests. For each model, the best performance of each column is indicated in bold. A indicates “after calibration”, while B indicates “before calibration”.

Table 5. Average performance of each voter for all participants.

Voters	Precision	Recall	F1	MCC	ACC
VGG16	79.00 (77.41–80.01)	77.40 (76.58–78.48)	78.58 (78.02–79.1)	60.88 (59.31–62.43)	79.48 (77.42–81.73)
SENet50	83.01 (81.53–84.46)	87.14 (85.6–87.49)	85.47 (83.1–89.44)	71.67 (70.28–76.26)	84.51 (82.74–85.37)
ResNet50	85.94 (84.96–87.09)	87.20 (84.16–88.64)	86.31 (85.67–87.82)	74.11 (72.3–75.89)	87.34 (86.46–88.22)

In each cell M (P25–P75), M is the median, P25 is the 25th percentile, and P75 is the 75th percentile of 10 tests. For each model, the best performance of each column is indicated in bold. A indicates “after calibration”, while B indicates “before calibration”.

Table 6. Calibration output of the ensembled models on the test data.

Strategy	ECE-B	ECE-A	MCE-B	MCE-A	PICP	MPIW
Majority Voting	34.04 (32.82–35.36)	4.98 (3.52–6.25)	49.57 (48.01–51.58)	14.22 (11.78–17.51)	52.55 (51.79–53.25)	23.47 (22.02–25.39)
Best Combination	31.22 (30.53–32.02)	4.57 (4–5.15)	52.39 (50–54.17)	7.89 (6.2–9.21)	48.76 (45.92–50.69)	26.12 (25.53–26.64)
Priori Recognition Performance	31.98 (31.16–33.67)	4.84 (4.25–5.89)	41.02 (39.12–43.84)	9.80 (8.57–11.43)	48.58 (46.28–50.91)	23.10 (22.41–23.64)
ECE (ours)	32.87 (30.65–35.23)	2.83 (1.38–2.95)	47.25 (44.15–50.41)	8.42 (5.6–9.74)	45.46 (44.22–46.49)	24.52 (23.29–25.48)
MCE (ours)	26.18 (25.33–26.9)	2.37 (1.72–2.73)	51.60 (48.98–53.34)	5.80 (4.17–7.22)	48.56 (47.31–50.04)	25.19 (23.93–26.33)
PICP (ours)	34.37 (33.95–35.24)	4.52 (3.11–6.43)	49.51 (48.68–50.24)	6.79 (5.69–8.37)	45.90 (44.72–46.59)	24.39 (23.63–25.34)

In each cell M (P25–P75), M is the median, P25 is the 25th percentile, and P75 is the 75th percentile of the 10 tests. For each model, the best performance of each column is in bold. A indicates “after calibration”, while B indicates “before calibration”.

Table 7. Performance of the ensembled models on the test data.

Strategy	Precision	Recall	ACC	F1	MCC	AUC
Majority Voting	88.73 (88.21–90.02)	89.97 (89.15–91.21)	89.81 (87.01–91.52)	90.26 (89.29–90.82)	78.09 (77.86–78.52)	90.01 (88.43–92.69)
Best Combination	90.00 (89.96–90.03)	88.35 (87.78–88.63)	89.42 (89.09–90.09)	89.15 (88.63–89.44)	78.25 (77.02–79.32)	89.95 (89.71–90.35)
Priori Recognition Performance	90.48 (88.33–91.33)	88.75 (87.63–89.66)	89.03 (87.3–91.28)	89.43 (89.22–89.59)	78.69 (78.29–79.13)	89.93 (89.8–90.09)
ECE (ours)	91.40 (90.18–92.63)	90.74 (89.56–91.67)	91.02 (90.94–91.1)	91.00 (90.06–91.83)	82.38 (81.17–83.5)	91.93 (91.26–93.07)
MCE (ours)	90.21 (89.36–90.88)	90.94 (90.75–91.18)	90.96 (90.52–91.75)	90.57 (90.29–90.81)	81.01 (80.57–81.49)	91.57 (91.09–93.25)
PICP (ours)	93.89 (93.2–94.71)	92.16 (90.09–92.42)	92.25 (91.8–92.71)	92.98 (92.3–93.75)	85.56 (85.26–86.16)	93.66 (93.43–94.01)

In each cell M (P25–P75), M is the median, P25 is the 25th percentile, and P75 is the 75th percentile of the 10 tests. For each model, the best performance of each column is in bold. A indicates “After calibration”, while B indicates “before calibration”.

Table 8. Top-five facial attributes correlation with beauty decision.

Feature	Variance	Standard Deviation	r_pb	t-Test	p-Value	Correlation
Right eye area	13,232.09	115.03	0.87	7.37	<0.001	Positive
Left eye area	13,024.69	114.13	−0.93	10.60	<0.001	Negative
Right eyebrow area	623.15	24.96	0.58	3.06	0.007	Positive
Nose	245.13	15.66	−0.53	2.63	0.02	Negative
Right cheek area	627.25	25.04	0.06	0.27	0.79	Positive
Right corner of lips	181.16	13.46	0.11	0.48	0.64	Positive

Table 9. Beauty class average eye golden ratio of 10 participants.

	Right Eye Average Representation	Golden Ratio	Left Eye Average Representation	Golden Ratio
Beauty		19.60%		19.60%
Different Beauty		10.95%		7.31%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aldhahi, W.; Albusair, T.; Sull, S. Beauty in the Eyes of Machine: A Novel Intelligent Signal Processing-Based Approach to Explain the Brain Cognition and Perception of Beauty Using Uncertainty-Based Machine Voting. Electronics 2023, 12, 48. https://doi.org/10.3390/electronics12010048

AMA Style

Aldhahi W, Albusair T, Sull S. Beauty in the Eyes of Machine: A Novel Intelligent Signal Processing-Based Approach to Explain the Brain Cognition and Perception of Beauty Using Uncertainty-Based Machine Voting. Electronics. 2023; 12(1):48. https://doi.org/10.3390/electronics12010048

Chicago/Turabian Style

Aldhahi, Waleed, Thekra Albusair, and Sanghoon Sull. 2023. "Beauty in the Eyes of Machine: A Novel Intelligent Signal Processing-Based Approach to Explain the Brain Cognition and Perception of Beauty Using Uncertainty-Based Machine Voting" Electronics 12, no. 1: 48. https://doi.org/10.3390/electronics12010048

APA Style

Aldhahi, W., Albusair, T., & Sull, S. (2023). Beauty in the Eyes of Machine: A Novel Intelligent Signal Processing-Based Approach to Explain the Brain Cognition and Perception of Beauty Using Uncertainty-Based Machine Voting. Electronics, 12(1), 48. https://doi.org/10.3390/electronics12010048

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Beauty in the Eyes of Machine: A Novel Intelligent Signal Processing-Based Approach to Explain the Brain Cognition and Perception of Beauty Using Uncertainty-Based Machine Voting

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Data Preparation

3.2. Ensemble Learning

3.3. Optimal Voting Weights

3.3.1. Best Combination

3.3.2. Priori Recognition Performance Statistics

3.3.3. Model Calibration

Calibration Evaluation

Uncertainty Evaluation

3.4. Proposed Framework to Explain CNNs

3.5. Proposed Approach to Explain Facial Beauty

Beauty Feature Voting

3.6. Golden Ratio

3.7. Evaluation Metrics

4. Empirical Results

4.1. Learning Beauty

4.2. Most Dependent Features

4.3. Golden Beauty

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI