Hyper-Parameter Optimization of Stacked Asymmetric Auto-Encoders for Automatic Personality Traits Perception

In this work, a method for automatic hyper-parameter tuning of the stacked asymmetric auto-encoder is proposed. In previous work, the deep learning ability to extract personality perception from speech was shown, but hyper-parameter tuning was attained by trial-and-error, which is time-consuming and requires machine learning knowledge. Therefore, obtaining hyper-parameter values is challenging and places limits on deep learning usage. To address this challenge, researchers have applied optimization methods. Although there were successes, the search space is very large due to the large number of deep learning hyper-parameters, which increases the probability of getting stuck in local optima. Researchers have also focused on improving global optimization methods. In this regard, we suggest a novel global optimization method based on the cultural algorithm, multi-island and the concept of parallelism to search this large space smartly. At first, we evaluated our method on three well-known optimization benchmarks and compared the results with recently published papers. Results indicate that the convergence of the proposed method speeds up due to the ability to escape from local optima, and the precision of the results improves dramatically. Afterward, we applied our method to optimize five hyper-parameters of an asymmetric auto-encoder for automatic personality perception. Since inappropriate hyper-parameters lead the network to over-fitting and under-fitting, we used a novel cost function to prevent over-fitting and under-fitting. As observed, the unweighted average recall (accuracy) was improved by 6.52% (9.54%) compared to our previous work and had remarkable outcomes compared to other published personality perception works.


Introduction
Whether deep or shallow, the operation of artificial neural networks (ANNs) depends on their hyper-parameters and parameters [1][2][3]. Certain variables of ANNs are called hyper-parameters, such as the number of layers [2], or control the training process, such as the learning rate [4]. In contrast, the trainable variables pertaining to layer connections and tuned during the training process, which are weights and biases, are called parameters [5][6][7]. Although parameter tuning may yield good results, it does not yield notable results without hyper-parameter tuning (HPT).
The importance of HPT became more manifest than before with the development of deep learning algorithms. Deep learning is a type of machine learning (ML) technique with diverse hyper-parameters that severely affect its performance [8][9][10]. Since HPT is an arduous task and requires data and network knowledge [11,12], it is often acquired by empirical methods (trial-and-error), which is time-consuming and does not guarantee 1.
HPT with the classical method and parameter optimization [13][14][15][16]: The fine-tuning of weights and biases (parameters) can provide useful information about the problem, but their size and initial value rely on HPT. Moreover, the number of parameters in deep neural networks (DNNs) and high dimensional datasets is enormous, and calculating the optimum value of these parameters is complicated, not easily implemented, and requires computational systems with remarkable capabilities. 2.
Hyper-parameter and parameter optimization [17][18][19]: Adaptive hyper-parameters are obtained by parameter training. The critical disadvantage is that with each possible vector of hyper-parameters, the parameters must be optimized, which causes runtime errors in the computational system and requires expensive training and large storage capacity to save the best parameters value over epochs. Additionally, all possible combinations of hyper-parameters are computationally infeasible. Hence, this method is not applicable in a large model such as deep learning [20,21]. 3.
Hyper-parameter optimization (HPO) and parameter tuning with backpropagation [4,11,22]: The main drawback is that although optimization methods are efficient in finding global optima, the gradient may vanish when back-propagating. As a result, not all network parameters are tuned well, which impacts results [23]. To tackle the poor-tuning process of deep neural network parameters, an asymmetric auto-encoder (Asy AE ) was presented in our previous work for automatic personality perception (APP) from speech [24]. We showed that Asy AE could improve the model outcome results compared with conventional auto-encoders by semi-supervised training of parameters, and it can be effectively employed in deep learning. However, the stacked asymmetric auto-encoder (SA AE ) hyper-parameters were chosen by trial-anderror, which was time-consuming, and two personality traits achieved lower accuracy than other prior research [24].
Thus, the aims of the present work were to (1) propose a novel optimization method based on cultural evolution and parallel computing, (2) obtain the near-optimal values of hyper-parameters of SA AE , and (3) classify five personality traits.
The rest of the article is organized as follows. In Section 2, some related works of HPO in deep learning and APP are explained. In Section 3, the dataset is introduced, and the summary of the feature extraction method is presented in Section 4. The new optimization method is proposed in Section 5. The simulation results of the new method, which is applied to three benchmark functions of finding global optima, are presented in Section 6. In addition, this section discusses the outcomes of applying the proposed method to SA AE for automatic personality perception classification.

Related Works
Given that this article examines HPO methods in order to find a proper one to optimize the hyper-parameters of SA AE for automatic personality trait perception, the related works section is divided into two parts. The focus of the first part is on recently published methods of neural network hyper-parameter tuning, regardless of the application in which it is used. Thus, the works related to the investigation of HPO in ML are summarized in the first part. Since the aim of our research was HPT of SA AE to classify five personality traits from speech, the second part is related to studying HPO in machine learning methods applied in the field of personality trait perception.

Hyper-Parameter Tuning in ML
Deep learning hyper-parameter types are vast and can be divided into three groups: integer, real, and categorical. The integer group consists of variables such as the number of layers (whether hidden or convolutional) [25], the number of neurons [8], the size of the kernel [26], the number of kernels [27], batch size, pooling size, and number of maximum epochs [9]. The real group includes the learning rate [25], dropout rate [25], regularization factor [25], network weight initialization [5], and momentum [4]. The categorical group comprises activation function type [8] and optimization method [8].
Considering that a change in the value of each hyper-parameter changes the values of the neural network parameters that affect the output of the network, and also that examination of any possible combination of hyper-parameters is time-consuming, expensive and practically impossible, studies have investigated the effect of adjusting and optimizing some of the most important hyper-parameters.
In this regard, the article in [4] employed the HPO method for bearing fault diagnosis in mechanical equipment. Parallel computing was used to find hyper-parameters of the deep belief network (DBN). The learning rate and momentum were optimized, while other hyper-parameters were predefined and kept constant. Additionally, Wu Deng et al. used quantum-inspired differential evolution (DE) to optimize DBN parameters. Results showed an improvement in global search and avoiding premature convergence for fault classification [28].
The numbers of hidden neurons as a hyper-parameter and of the weights/biases as parameters were optimized in a feed-forward ANN by Gray wolf optimizer in [18]. Feed-forward ANNs (not back-propagation) were used because adjusted parameters were achieved by the optimization method.
Y. Peng et al. proposed an HPO method based on a fuzzy system in [8]. They optimized the number of hidden layers and the number of neurons in each layer of a DNN. The activation function type and optimization method, including Genetic Algorithms (GA), Bayesian search, grid search, random search, and quasi-random search, were selected automatically during HPO. For preventing over-fitting, the dropout technique was used. The proposed method was tested in three rainfall prediction datasets.
The authors of [29] suggested a distributed particle swarm optimization (PSO) for the HPO of a convolution neural network (CNN). They were concerned about the timeconsuming population search based on distributed PSO, and parallel computing was employed to speed up the algorithm. They optimized the number and size of the kernels, the type of pooling (max or average) for two convolutional layers, the activation function type in convolutional layers, the number of neurons, learning rate, and the dropout rate of the fully connected layers.
Time-series prediction of congestion in highway systems based on long short-term memory (LSTM) was investigated in [9]. To obtain the proper model and structure, the authors recommended an HPO method by applying the Bayesian optimization (BO) method. Five hyper-parameters were automatically obtained, including learning rate, the number of hidden layers, the number of neurons in each layer, batch size, and dropout rate.
The intention of [25] was to examine the robustness of one HPO method over six benchmarks, contrary to other works that designed an algorithm that fit one problem. In other work, the authors used BO as an old HPO method in CNN [1] and applied four strategies to alleviate the drawbacks of BO. They tuned the hyper-parameters of two convolutional layers and two fully connected layers in this way.
In [26], an intuitive architecture design using GA was proposed for CNN. The obtained model was evaluated on a CNN with a single convolutional layer and a fully connected layer. Additionally, some hyper-parameters, including maximum epochs, batch size, initial learning rate, regularization, and momentum were optimized by PSO to prepare a CNN for expression recognition in [30].
Since the success of neural networks depends on their structure, the article in [31] proposed a micro-canonical optimization algorithm for overcoming large parameter spaces and optimizing hyper-parameters of a CNN. Hyper-parameters were the number of convolution layers, activation function type, batch size, pooling type, and dropout rate. The method was evaluated by six image recognition datasets and exhibited accuracy improvement.
State-of-health estimation and remaining usable life prediction in battery prognosis were examined in [32] by a deep convolution neural network. The authors addressed hyper-parameter tuning that affected DNN performance. They improved the algorithm by using the BO method.
Anjir A. Chowdhury et al. concentrated on the role of hyper-parameter optimization in the performance and reliability of deep learning outcomes [33]. They compared several HPO algorithms to obtain better validation accuracy in DNNs and concluded that most of them are computationally expensive. Finally, a greedy approach-based HPO algorithm was proposed for enabling faster computing on edge devices for on-the-fly learning applications. The VGG and ResNet architectures were used, and their hyper-parameters such as epochs, number of hidden layers, number of units per layer, activation function, dropout rate, batch size, and learning rate were optimized.
The Gray wolf optimization was employed to optimize the parameters of the kernel extreme learning machine to realize a hyperspectral image classification method in [34].

Automatic Personality Perception
In psychology, the big five inventory (BFI) is a well-known theory of personality with five traits, including openness to experience (Ope.), conscientiousness (Con.), extraversion (Ext.), agreeableness (Agr.), and neuroticism (Neu.). These traits are in an individual simultaneously by different scores and can be measured by a BFI questionnaire in general [35,36].
Due to the importance of personality in daily life, computer science researchers have investigated personality trait identification by multimodal media (audio, text, video, image) recently. Here, we focus on studies structured by deep learning methods.
A multimodal approach for perceiving personality traits was proposed by employing well-known deep structures (ResNet-v2-101 and VGGish) [37]. The LSTM network for using temporal information was added at the end. The authors optimized only the learning rate, while other hyper-parameters were configured manually. It is clear that the structure of the mentioned deep methods is fixed, and the weights and biases are pre-trained. Therefore, HPO or HPT does not tune according to each dataset in these networks.
Given the fact that personality traits can influence appearance, MobileNetv2 and ResNeSt50 networks were employed in [38] to extract facial features and classification. Results specified that one pre-trained network such as MobileNetv2 is inappropriate for classifying all five personality traits. It indicated that each trait must classify by a specific model, which means different hyper-parameters are necessary. However, the authors did not mention it directly and applied a combination of two pre-trained deep networks to build a complex deep model. Onno Kampman et al. examined feature extraction and the classification of five personality traits by applying a one-dimensional CNN to a raw audio dataset. The HPT of the deep network containing regularization factors and kernel size was performed manually [39].
One of the personality detection applications is discovering interpersonal communication skills. Article [40] investigated this aspect from a video interview using a semisupervised CNN in which HPT was performed by trial-and-error. The authors concentrated on video processing, and a fixed hyper-parameter set to utilize for all traits.
The study in [41] analyzed the acoustic and lexical features of a speech signal that were affected by BFI traits. Additionally, it designed six models based on recurrent neural networks for classifying those traits. Hyper-parameters such as hidden size, learning rate, batch size, and dropout percentage were defined, but tuning them was not discussed.

Dataset
The SSPNet speaker personality corpus (SPC) is a well-known automatic personality perception dataset introduced in 2010. This dataset originally contained 640 recorded speech signals of 322 native French speakers. There is one speaker in each clip recorded for 10 s. Due to the studies on the effect of mental factors on speech signals [42], the collected clips were emotionally neutral, and to confirm that lexical content did not affect the personality scores, evaluators who were foreign to the French language were selected. Therefore, eleven assessors who did not understand French evaluated each clip based on the BFI questionnaire. The average score of these assessors was considered as the final score for each clip. Hence, five scores were obtained for each clip [43].
Although the SPC dataset has been applied in several works and is a proper dataset for comparison with the new methods, the number of samples is uses is low to train the enormous number of parameters of a DNN. This important challenge was addressed in our previous work [24], and we proved that the sample size of speech signals could be enhanced with data augmentation methods based on a spectrogram so that the prosodic content of speech could be preserved. Data augmentation is a popular technique to expand the size of the dataset artificially and is widely used in image processing. However, using this technique in speech is not as easy as using an image. In other words, we needed to choose transformations that maintain the speaker's personality, and we had to be confident that such manipulations in the spectrogram do not interfere with the extracted features related to personality traits. In this regard, frequency masking and time warping were selected as data augmentation methods, and the number of clips increased up to 640,000. For more details, please see [24].

Feature Extraction
Despite DNN's ability to perform automatic feature extraction from raw speech signals, deep learning methods have been generally applied to manually extracted hand-crafted audio features. This is mainly because of the large volume of data required for deep learning methods to outperform. Nevertheless, building a dataset with large available labeled samples is costly, time-consuming, and laborious work in the automatic personality perception field, which restricts various methods. Therefore, previous studies have used handcrafted features for the DNN input [44].

Proposed Method
This section is divided into two parts. In the first part, we thoroughly describe the new optimization method mathematically. In order to apply our optimization method to the SA AE , we had to address several problems. The second part deals with this issue and its solution.

The Proposed Optimization Method
HPO of deep learning is a time-consuming task in practice that depends on the network depth, the size of parameters, processor system, and optimization algorithm speed [5]. Applying HPO to deep learning is challenging. It can be (1) the unsupervised learning of most deep learning methods that causes trouble for optimization and imperfect tuning of parameters [47], (2) a large model with enormous trainable parameters that lead the processing system to runtime errors [5,8], and (3) an intricate search space created by different types of hyper-parameter domains (categorical, continuous, and integer value), causing inherent computational complexity [5]. A larger search space gives rise to a longer search time.
Parallel evaluation can partly reduce optimization time [48], and culture speeds up the population's evolution more than chromosomes (each chromosome represents a solution in the population space) [49]. Accumulated experience that is potentially accessible to all individuals is called culture, which is used in problem-solving activities [50]. The knowledge extracted by identifying patterns in the population's problem-solving experiences influences the generation of new solutions [51]. Therefore, the combination of CA and parallel computing can facilitate the discovery of the search space [52]. In this regard, researchers are interested in combining CA with other optimization algorithms. Sun et al. combined a cultural algorithm and two PSO populations and shared their belief space. It indicated that sharing knowledge of belief space can improve performance by avoiding local optima [53]. A single population and multi-population based on CA was proposed in [54]. A PSO population-based method with interactive belief space was introduced by [49]. A hybrid evolutionary optimization method coupling CA with GAs was defined in [55]. Fuzzy operations were employed to exchange individuals between belief space and population space in [56].
From this perspective, we proposed a four-island approach based on the parallel evaluation and CA.
Although CA and parallel computing can perform better than the basic optimization algorithms [57], they do not provide enough convergence speed alone for deep learning. Thus, three driving force factors were applied to population space for creating interactive space between four island population spaces. Creating interactive population space causes interactive belief space, which can determine the direction and step size faster than traditional optimization methods. In this regard, our proposed method is called the multi-island interactive cultural (MIC) algorithm.
The MIC method is illustrated in Figure 1. In this method, control parameters are configured firstly. The initial population X[m, D] is generated randomly in the feasible space. The variable m indicates the population size (the number of chromosomes or individuals), and D is chromosome dimension (the number of genes).  After preparing the random initial population, it transfers into the four islands in parallel (gray lines): GA, PSO, DE, and evaluation strategy (ES). The GA and PSO are the optimization algorithms widely applied to HPO studies in deep learning [1,8]. GA is far more successful in complex networks such as CNNs, but eliminates previous information by changing the population every iteration [50]. PSO shares information between the particles and is popular on the smaller networks [29]. The DE algorithm is utilized in optimization problems due to the high convergence speed and low control parameters when searching global optima. It is suitable for nonlinear search spaces [28]. The ES is less popular among the global optimization algorithms because it is a simple mutation-selection method, but it is helpful in making small changes [48]. It should be noticed that in the first After preparing the random initial population, it transfers into the four islands in parallel (gray lines): GA, PSO, DE, and evaluation strategy (ES). The GA and PSO are the optimization algorithms widely applied to HPO studies in deep learning [1,8]. GA is far more successful in complex networks such as CNNs, but eliminates previous information by changing the population every iteration [50]. PSO shares information between the particles and is popular on the smaller networks [29]. The DE algorithm is utilized in optimization problems due to the high convergence speed and low control parameters when searching global optima. It is suitable for nonlinear search spaces [28]. The ES is less popular among the global optimization algorithms because it is a simple mutation-selection method, but it is helpful in making small changes [48]. It should be noticed that in the first iteration, the population of the four islands is the same.
The four islands were evaluated individually and in parallel. Then, some individuals of each island were randomly selected to transfer into an interactive belief space (InBS) through an acceptance function (colored arrows). Here, the acceptance function was 25% of the best individuals of each island. So, the belief space size was y[m, D].
The InBS consisted of normative (N[D]) and situational knowledge (S) of all islands. Knowledge of different islands in the belief space causes the chromosomes to move away from unwanted regions and get closer to the optimal points by using different experiences faster than previously published works. InBS can be used effectively to prune the population space.
Normative knowledge represents the range of the best solutions by determining the upper and lower bands of each gene of a chromosome and is used to influence the direction of the search efforts within the promising ranges. In other words, it computes the range of each gene that leads the individual to a good solution.
The offspring affected by normative knowledge are generated by Equation (1) as where u j is the upper and l j is the lower band of InBS for jth gene, respectively, β is a constant value, t is the current iteration, and N(0, 1) is the normal distribution. For each gene, the structure contains the upper band (u t j ), the lower bound (l t j ), the upper band value (U t j ), and the lower bound value (L t j ), which are obtained by Equations (2)-(5), respectively.
where y i,j is the jth gene in the ith individual of InBS, and the f(y i ) is the value of the individual y i calculated by the fitness function. A fitness function (loss function) evaluated individuals of each island separately. The problem description determines the fitness function. The situational knowledge, as seen in Equation (6), adjusts the mutation step size relative to the distance between the current best individual and the other individuals. The greater the distance between ith individual, y i , and the current best individual, the greater the step size and vice versa.
Updating the situational knowledge adds the InBS's best individual to the situational knowledge if it outperforms the current best individual, as described in Equation (6).
Here, y t best is the best individual in the InBS at iteration t. The influence rule can be represented with Equation (7) (for i = 1, . . . , m and j = 1, . . . , D).
where E j is the jth gene in the best individual, β is a constant factor, N(0, 1) is the normal distribution, and y p+i,j is the offspring of the individual y i,j . After updating InBS with new generations, some individuals are transferred into each island population space by influence function. There is no doubt that the individuals of InBS contain the knowledge of all of the islands. This is the ability of the proposed method. Various studies have shown that the efficiency of optimization methods is altered for different problems. In other words, choosing an optimization method for a problem is a challenge that some researchers consider as a kind of hyper-parameter that needs to be tuned. Hence, 25% of the best individuals of InBS were replaced with 25% of the worst population on each island. Offspring generation processing is started in each island separately and evaluated through fitness function.
If the algorithm reaches the stopping criterion, the process will be stopped. Otherwise, interactive population space is created by three driving forces in order to promote cooperation among the islands and increase diversity.
The three driving-force methods are named the elitism method (EM), merge method (MM), and lambda method (LM).
In interactive population space, all individuals of each island are considered. In EM, the best individuals with size m are preserved and replaced with the old population on each island. As we use this method, the populations of the next generation for each island are the same. This driving force method forces the four basic algorithms to create interactive space only by the best individuals of four islands.
In MM, after considering all individuals of each island, a random number a, a ∈ (0, 1), is produced. The a × m (a * sizeofpopulation) of the best individuals are merged with (a − 1) × m of the old population on each island. It is clear that each island has a unique new population in this interactive space.
In LM, two of the islands are selected randomly, according to two random numbers µ, µ ∈ (0, 1), and λ, λ ∈ (0, 1), representing emigration and immigration, respectively. The random numbers of individuals based on µ and λ of each island indicate which individuals can immigrate to and emigrate from another random island. This method forces islands to cooperate with the best individual and the worst one to create interactive space.
Due to the interaction and sharing of individuals among the four islands, if one algorithm traps in local optima, others can lead MIC into global optima because the result is not dependent on a single algorithm. This feature allows the MIC to be used for various global optimization problems to escape local optima efficiently.
The MIC strategy is presented step by step below (Algorithm 1).

Algorithm 1: Implementation of MIC
Step 1: Set the MIC parameters randomly.
Step 2: Generate the initial population randomly.
Step 3: Transfer 25% of the best individuals of each island into InBS (Accept).

Else
Go to Step 7.
Step 7: Create Interactive population space by using the following three methods: EM: m of the best individuals of four islands are selected and replaced with an old population.
MM: The a × m (a * sizeo f population) of the best individuals are selected and merged with (a − 1) × m, which is obtained from the old population in islands.
LM: According to two random numbers, µ and λ, some individuals of a random island can immigrate to and emigrate from another random island.

Stacked Asymmetric Auto-Encoder HPO Using MIC
Since our work aimed to obtain the SA AE near-optimal structure, a brief overview of this method is presented below. (

1) Stacked asymmetric auto-encoder
The Asy AE is a semi-supervised DNN that poses the curse of dimensionality. The schematic of the Asy AE is illustrated in Figure 2. LM: According to two random numbers, µ and λ, some individuals of a r dom island can immigrate to and emigrate from another random island.

Stacked Asymmetric Auto-Encoder HPO Using MIC
Since our work aimed to obtain the SAAE near-optimal structure, a brief over this method is presented below. (

1) Stacked asymmetric auto-encoder
The AsyAE is a semi-supervised DNN that poses the curse of dimensionali schematic of the AsyAE is illustrated in Figure 2. In this type, one neuron is added in the decoder part of the conventional a coder with the desired value of the problem, which is the studied personality scor field. The symmetry of the encoder and decoder parts is disrupted by this single and made asymmetric.
The feed-forward equations of the AsyAE are similar to the conventional one lows. For representing encoder and decoder layers, superscripts of 1 and 2 wer respectively. (1) (1) = , net W X Figure 2. Schematic of the asymmetric auto-encoder [24].
In this type, one neuron is added in the decoder part of the conventional auto-encoder with the desired value of the problem, which is the studied personality score in our field. The symmetry of the encoder and decoder parts is disrupted by this single neuron and made asymmetric. The feed-forward equations of the Asy AE are similar to the conventional one as follows. For representing encoder and decoder layers, superscripts of 1 and 2 were used, respectively.
where W (1) indicates the encoder weight matrix, X displays the input matrix, O (1) is the encoder output matrix, and f is the activation function.
where W (2) and O (2) are the weight and output matrixes of the decoder layer, respectively. The error back-propagation related to the encoder and decoder weights matrixes is calculated by Equation (12).
where e t is the error vector of Asy AE at time t, which is described by Equation (13), and k is the neuron size of decoder layer output.
The desired output vector at time t is presented by d t , which belongs to the matrix D. It is the desired output matrix of Asy AE , which is produced by the combination of desired labels and Asy AE input.
Here, x ij is the Asy AE input matrix element, and L is the desired label of the problem. A stacked asymmetric auto-encoder is a result of putting several Asy AE s together.

(2) Optimizing some hyper-parameters of a stacked asymmetric auto-encoder
Given the fact that the number of DNN hyper-parameters is significantly large, the simultaneous optimization of all of them complicates the computation and requires highperformance computing systems. Hence, we compromised between MIC and expertise for calculating the six critical DNN hyper-parameters as follows: 1.
number of neurons in each hidden layer 2.
learning rate value 3.
number of hidden layers 5.
maximum epoch of network training 6.
preventing over-fitting and under-fitting For HPO of SA AE , the following principles come after. Figure 3 illustrates the flowchart of the proposed method in detail.
According to the ANN-base of an AsyAE and RBM, the AsyAE can be interpreted two consecutive RBMs illustrated in Figure 4. The input layer is the visible unit, and t encoder layer is the hidden unit for the RBM1. In the RBM2, the encoder layer is the visib unit, and the decoder layer is the hidden unit.  Figure 3. Flowchart of SAAE hyper-parameter optimization.
The conventional RBM is based on binary visible and hidden units, called Bernoul Bernoulli RBM (BBRBM). If both visible and hidden units have a Gaussian distributio the Gaussian-Gaussian RBM (GGRBM) is employed [60]. Since the AsyAE input and p rameters are real values, we used the GGRBM equations.  Determining the number of neurons in each hidden layer: In our work, N i indicates the number of neurons in the i th hidden layer that will be optimized by the MIC method. So, the first variable of MIC is N i , which is an integer value, N i ∈ [1, m] where m value is equal to the input size of Asy AE . It forces the Asy AE to be an incomplete network. It means the encoder layer has fewer neurons than the input layer.
Determining the learning rate in each hidden layer: µ i specifies the learning rate in the i th hidden layer, which will be optimized by the MIC method. Therefore, the second variable of the MIC population is a real value between zero and one, µ i ∈ (0, 1). It should be mentioned that we set the decimal digit of µ i equal to 5 to examine its effect on SA AE performance.
Initial value of trainable parameters: Although deep learning methods have good performance in various problems, they are complicated tasks. Because there are huge factors that strongly influence them, one of the critical factors is initialization.
The DNN parameters need a starting point in the feasible area to be trained. The proper initial parameters can accelerate the convergence. Contrarily, random initialization can trap the network in the local optima.
Optimization algorithms such as GA and PSO can be used in this field. However, the number of DNN parameters (weights and biases) is vast, e.g., 10 15 , and producing the chromosomes with these dimensions causes a memory error in the processor system and is not efficient in practice. Another method, suggested by Hinton et al., applies the restricted Boltzman machine (RBM) network to tune the auto-encoder's initial parameters [58,59].
According to the ANN-base of an Asy AE and RBM, the Asy AE can be interpreted as two consecutive RBMs illustrated in Figure 4. The input layer is the visible unit, and the encoder layer is the hidden unit for the RBM 1 . In the RBM 2 , the encoder layer is the visible unit, and the decoder layer is the hidden unit. The conventional RBM is based on binary visible and hidden units, called Bern Bernoulli RBM (BBRBM). If both visible and hidden units have a Gaussian distrib the Gaussian-Gaussian RBM (GGRBM) is employed [60]. Since the AsyAE input an rameters are real values, we used the GGRBM equations.  The conventional RBM is based on binary visible and hidden units, called Bernoulli-Bernoulli RBM (BBRBM). If both visible and hidden units have a Gaussian distribution, the Gaussian-Gaussian RBM (GGRBM) is employed [60]. Since the Asy AE input and parameters are real values, we used the GGRBM equations.
The energy function of the GGRBM is defined as Equation (14), where v presents visible units and h shows hidden units. It should be noted that the Asy AE input and the encoder output are the visible units of RBM 1 and RBM 2 , respectively.
where a i and b j are visible and hidden units biases, respectively, σ i and σ j are their standard deviations. W i,j is the weight between the visible and hidden units. A probability value is assigned to each possible visible and hidden unit by Equation (15), Here, Z is the normalization constant calculated by Equation (16).
Equation (17) shows the loss function, which must be maximized, The updating functions are where < • > data and < • > model are expanded values of sample data and model probabilistic distribution, and ζ is the learning rate. We described GGRBM briefly, and this is the time to use it. For a traditional autoencoder, first, the initial parameters of the encoder layer are randomly selected and then trained by the GGRBM method. The trained parameters are considered the encoder layer's initial parameters, and its transposition is employed for the decoder layer. However, in the Asy AE , the encoder and decoder parameters are not symmetric and have to be obtained individually. So, the above principle is applied to the decoder layer to obtain the initial parameters.
The number of hidden layers: The value of this hyper-parameter is dependent on the performance of Asy AE s. The classification performance of each Asy AE is examined in MIC for each pair of (N i , µ i ). For the next Asy AE , the performance has to be better than that of the previous one. If the performance of Asy AE(i+1) is better than that of Asy AE(i) , the MIC algorithm is continued.
The performance criterion is different from one problem to another. The Unweighted Average (UA) recall criterion frequently used in personality perception studies is calculated by Equation (21), The recall Low means the recall of detecting the low degree of studied personality, and the recall High indicates the recall of detecting its high degree.
The maximum epoch of network training: Generally, the DNN training process proceeds to reach maximum epoch (updating time) [40]. As discussed in [24], proper data separation does not occur in the maximum epoch. Thus, a J variation is employed as a stopping criterion to finish the training process in the epoch in which the maximum separation is achieved.
J is calculated as follows, where S W is a within-class scattering matrix, and S B is a between-class scattering matrix [61]. det represents the determinant of a matrix.
Here, n i is the instance number of ith class, X is the encoder output matrix, and c is the number of classes, µ is the matrix for average all instances, and µ i is the class average matrix of ith class.
Preventing over-fitting and under-fitting problems: The over-fitting problem happens when a model trains properly on the training dataset but performs poorly on the testing dataset. The under-fitting problem occurs when a model performs poorly on both the training and testing samples.
The number of layers and the neurons in each layer can excessively lead a model into over-fitting or under-fitting. This can be easily changed by changing the structure. More neurons and layers complicate the model, but fewer cannot pursue the data pattern. Therefore, this is one of the problems that has to be dealt with in designing an optimum structure. So, a new loss function is defined to guide the model toward good fitting.
where a is the training threshold, and b is the validation threshold. We already discussed the UA recall criterion used chiefly in personality perception. We applied the loss function defined in Equation (25) instead of Equation (21). The aim is the maximization of Equation (25). We set a = 0.8 and b = 0.6 because a UA train of more than 80% and UA val of more than 60% are acceptable. The loss value can be in the range of [2.08, 0]. So, the set of (N i , µ i ) is acceptable to be maximized in Equation (25).

Final algorithm:
The pseudo-code of optimizing SA AE hyper-parameters is described in Algorithm 2.

End if
Set the encoder layer output of ith Asy AE as the input of (i + 1)th Asy AE . i = i + 1. End while End while

Simulations and Results
In this section, firstly, the results of the MIC method on three benchmarks and comparison with other published methods will be discussed. Then, the MIC will be used to design the structure of five individual DNNs for classifying five personality traits. A final comparison can be found at the end of this section.

The Results of the MIC on Three Optimization Benchmarks
Three well-known, multimodal, continuous, and non-separable benchmark functions that have a global minimum value of zero, called Rastrigin [52], Ackley [62], and Griewang [62], are used to validate the MIC method.
The multimodal property means having many local optima or peaks in the function, which can test the ability of an algorithm to avoid being stuck in a local minimum. Nonseparable refers to the independence of obtained solution variables. If all variables are independent, they can be optimized independently, and the function will be optimized [62]. Therefore, these three functions are complex problems in evaluating the performance of any new optimization algorithm.
The formula, feasible range of variables, and the global optima points of three functions are summarized in Table 2.
Here, n indicates the dimension of the function, which is n ≥ 2 for all mentioned functions. Figure 5 shows the shape of the functions described in Table 2. As can be seen, all three functions have many local optima and are suitable to show the ability of optimization methods to escape from being stuck in local optima. In order to show the performance of MIC against the conventional optimization methods, the comparison results of the mentioned four islands and MIC are reported in Table 3.
Given the fact that the problem complexity increases with increasing dimensionality, increasing the number of the variables (dimension) grows the search space, which makes exploring the best solution difficult [62]. To investigate the effect of dimension on searching quality in MIC, we compared our results with 30D and 10D in Table 3.
For a fair comparison, all parameters and initial populations for the basic algorithms and MIC were set to the same values.
The following six criteria were utilized for a more reliable analysis. It should be mentioned that these criteria are common in optimization problems.

•
The average of iterations where the stop criterion is reached for examining convergence speed (AvI).

•
The average of obtained best optima point (AvP).

•
The smallest iteration at which the stop criterion occurs (SI).

•
Calculating the standard deviation (SD) for proving the efficiency and robustness of the algorithm.

•
The number of successful runs divided by the total number of runs called success rate (SR) . Table 3 shows the simulation outcomes of MIC and four basic optimization algorithms.
It was concluded by AvI numerical results that MIC can reach more accurate solutions with a faster convergence speed than traditional algorithms in n = 10. Although for the n = 30, the MM performance diminishes, LM and EM preserve their performance with increasing complexity. It is demonstrated that LM and EM improve solutions steadily for a long time without getting stuck in local minima. It is clear that MIC is more powerful than the four basic algorithms alone when it comes to solving global optimization problems.
According to the AvP values in n = 10 and n = 30, traditional algorithms are often unsuccessful in finding favorable solutions in comparison to MIC, especially EM. Additionally, it can be concluded from AvP that the MIC speeds up the convergence to the global optima. The AvP values in n = 30 in comparison to n = 10 decreased about 0.1 in In order to show the performance of MIC against the conventional optimization methods, the comparison results of the mentioned four islands and MIC are reported in Table 3.
Given the fact that the problem complexity increases with increasing dimensionality, increasing the number of the variables (dimension) grows the search space, which makes exploring the best solution difficult [62]. To investigate the effect of dimension on searching quality in MIC, we compared our results with 30D and 10D in Table 3.
For a fair comparison, all parameters and initial populations for the basic algorithms and MIC were set to the same values.
The following six criteria were utilized for a more reliable analysis. It should be mentioned that these criteria are common in optimization problems.

•
The average of iterations where the stop criterion is reached for examining convergence speed (AvI).

•
The average of obtained best optima point (AvP).

•
The smallest iteration at which the stop criterion occurs (SI).

•
Calculating the standard deviation (SD) for proving the efficiency and robustness of the algorithm.

•
The number of successful runs divided by the total number of runs called success rate (SR).
It was concluded by AvI numerical results that MIC can reach more accurate solutions with a faster convergence speed than traditional algorithms in n = 10. Although for the n = 30, the MM performance diminishes, LM and EM preserve their performance with increasing complexity. It is demonstrated that LM and EM improve solutions steadily for a long time without getting stuck in local minima. It is clear that MIC is more powerful than the four basic algorithms alone when it comes to solving global optimization problems.  Table 3 shows the simulation outcomes of MIC and four basic optimization algorithms. According to the AvP values in n = 10 and n = 30, traditional algorithms are often unsuccessful in finding favorable solutions in comparison to MIC, especially EM. Additionally, it can be concluded from AvP that the MIC speeds up the convergence to the global optima. The AvP values in n = 30 in comparison to n = 10 decreased about 0.1 in Rastrigin and remained constant for the other two functions in LM and EM. The change in the AvP values in MM is meaningful, which indicates getting stuck in the local optimum with the increase in the complexity of the problem, like the traditional methods.
Our SI outcomes show that the MIC method, especially EM and LM, reaches the stop criterion in a few iterations. It means the MIC method speeds up convergence. Moreover, the SI criterion shows that although the MM method performs better than the basic optimization methods in simpler functions (n = 10), its performance drops in complex functions (n = 30). LM and EM not only show their effectiveness in simple functions, but also perform well in complex problems compared to other methods.
The evaluation results of criterion BOP show that the LM and EM methods achieve the global optimal value more accurately than the basic methods in n = 10 and n = 30. However, MM implementation results decrease with increasing complexity.
It can be seen that the SD values of MIC, except for MM, are very small in comparison to those of the four basic algorithms in n = 10 and n = 30, which means the repeatability and robustness of the new algorithm are due to pruning search space.
The SR results prove that the MIC is very promising in bringing higher reliability than traditional algorithms because the number of times that LM, EM, and MM reached the desired value of the function was 100% in n = 10. As can be seen, as the complexity of the function increases (n = 30), the LM and EM methods are still successful in reaching the desired value.
From Table 3, it is concluded that despite increasing dimensions, the implementation outcomes of all algorithms decrease, except EM and LM.
Our study indicates that the quality of the solutions found using our proposed method for widespread global optima functions is higher than that of the solutions provided by traditional algorithms. This is due to a more appropriate tradeoff between exploring new individuals and exploiting highly fit individuals found at the parallelism level. By means of three widespread test functions, it is demonstrated that the new method has great potential for substantial improvement in search performance.
Due to the wide usage of these benchmarks, a comparison with other published works is presented in Table 4. It can be observed that LM and EM achieved the best solution in Ackley and Griewang functions (30D). After the successful outcomes with the MIC method to find the global optima of three complex benchmark functions, we applied our novel method to find the near-optimal values of hyper-parameters for classifying five personality traits. We used "near-optimal" instead of "optimal" structure because tuning of MIC hyper-parameters such as mutation and crossover rating is chosen randomly.
Taking into account that different personality traits have different effects on speech characteristics [24,42], using the same DNN structure for all traits to extract features is not recommended. Assuming the five personality traits were independent, five separate neural networks were designed and trained to classify the five personality traits.
Hence, the network's depth was determined by classifying the output of each Asy AE encoder layer by the SVM with radial basis function kernel. The Asy AE with higher classification results is considered as the output layer of the S AE . Table 5 shows the comparison results of our proposed method with other works in the SPC dataset in terms of UA recall and accuracy. In our previous work, the structure of SA AE was chosen by trial-and-error, which was time-consuming, and two traits (extraversion and openness) achieved lower accuracy than reported by other research [24]. N/A means not available. In the present study, not only were the accuracy of extraversion and openness improved, but UA recalls were also increased more than before. This evidences that the performance and robustness of trained models are highly dependent on their hyper-parameter settings.

Conclusions
Since HPT is the most challenging aspect of ANN studies, it is mostly obtained by trial-and-error, affecting its performance. This article proposed a new approach based on cultural evolution and parallel computing to achieve a near-optimal structure of SA AE in a reasonable time for automatic personality perception. We used the concept of parallelism and information on different regions of the search space to improve the search spaces in MIC and exchanged them between islands to provide greater population diversity. The proposed approach was implemented on three complex benchmarks, and six criteria evaluated our method's performance in comparison with four basic optimization methods. The results showed that our approach outperforms other traditional optimization and newly published algorithms in four aspects: (1) convergence speed, (2) precision, (3) escaping from entrapment in local optima, and (4) repeatability. As an indication of our method's performance, we increased the problem complexity by increasing the number of variables up to 30. The outcomes demonstrated the reliability of the MIC method, especially LM and EM. Subsequently, five hyper-parameters of SA AE were optimized. Since the tuning of hyper-parameters affects over-fitting and under-fitting, we introduced a new cost function to control them during the optimization process.