Next Article in Journal
Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models
Previous Article in Journal
Oil-Painting Style Classification Using ResNet with Conditional Information Bottleneck Regularization
Previous Article in Special Issue
Advancing Traditional Dunhuang Regional Pattern Design with Diffusion Adapter Networks and Cross-Entropy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hyperparameter Optimization EM Algorithm via Bayesian Optimization and Relative Entropy

1
School of Information Engineering, Suihua University, Suihua 152061, China
2
Engineering Technology Research Center of Artificial Intelligence Innovation Application, Suihua University, Suihua 152061, China
*
Authors to whom correspondence should be addressed.
Entropy 2025, 27(7), 678; https://doi.org/10.3390/e27070678
Submission received: 9 May 2025 / Revised: 20 June 2025 / Accepted: 23 June 2025 / Published: 25 June 2025
(This article belongs to the Special Issue Entropy in Machine Learning Applications, 2nd Edition)

Abstract

Hyperparameter optimization (HPO), which is also called hyperparameter tuning, is a vital component of developing machine learning models. These parameters, which regulate the behavior of the machine learning algorithm and cannot be directly learned from the given training data, can significantly affect the performance of the model. In the context of relevance vector machine hyperparameter optimization, we have used zero-mean Gaussian weight priors to derive iterative equations through evidence function maximization. For a general Gaussian weight prior and Bayesian linear regression, we similarly derive iterative reestimation equations for hyperparameters through evidence function maximization. Subsequently, after using relative entropy and Bayesian optimization, the aforementioned non-closed-form reestimation equations can be partitioned into E and M steps, providing a clear mathematical and statistical explanation for the iterative reestimation equations of hyperparameters. The experimental result shows the effectiveness of the EM algorithm of hyperparameter optimization, and the algorithm also has the merit of fast convergence, except that the covariance of the posterior distribution is a singular matrix, which affects the increase in the likelihood.

1. Introduction

In machine learning, parameters are classified into two categories: model parameters, which are internal and configurable, and hyperparameters, which are external and cannot be estimated from data. Model parameters include the weights in a deep neural network, for example. Hyperparameters include batch size, the learning rate, and the number of hidden layers in a neural network [1]. Hyperparameter optimization (HPO) represents a pivotal aspect of machine learning model training, including the process of fine-tuning the hyperparameters of a model to improve its performance. Hyperparameters are parameters that are set before the learning process begins. The process of tuning these hyperparameters can significantly affect the model’s accuracy and generalization ability [2]. There are some main hyperparameter optimization methods as follows. Grid search: Grid search is the brute-force way of searching hyperparameters [3], with defined lower and higher bounds along with specific steps [4]. Grid search works based on the Cartesian product of the different sets of values, evaluates every configuration, and returns the combination with the best performance [5]. Grid search is a simple implementation that, however, can be highly inefficient for large search spaces due to its exhaustive nature. This problem is further compounded as data dimensionality increases. Random search: Random search involves randomly sampling hyperparameter combinations from a predefined search space. While less computationally intensive than grid search, it can often identify superior hyperparameter configurations because of its ability to explore the search space more efficiently [6,7]. Genetic algorithms: Genetic algorithms are inspired by the process of natural selection, whereby a population of candidate solutions (hyperparameter configurations) evolves over multiple generations [8,9]. They are useful for exploring large search spaces and can handle discrete and nonlinear search spaces effectively. Gradient-based optimization: Gradient-based optimization treats hyperparameters as continuous variables [10,11,12]. This approach is often used in neural network architectures and can be efficient for optimizing large-scale models. Bayesian optimization: Bayesian optimization employs a probabilistic model of the objective function (model performance) to identify the most promising hyperparameter configurations for evaluation. This approach is more efficient than grid and random search, particularly in the context of high-dimensional and noisy optimization problems [13,14,15,16]. Recently, automated machine learning (AutoML) platforms also have been applied to practical problems, which automate the entire machine learning pipeline, including data preprocessing, model selection, feature engineering, and hyperparameter optimization. They leverage various optimization techniques to find the best model configuration automatically and make machine learning more accessible to users with limited expertise in data science and machine learning by automating repetitive tasks and reducing the need for manual intervention. This allows experts to focus more on understanding the problem domain and interpreting the results rather than spending time on technical details [17,18,19,20]. Particle swarm optimization (PSO) is a popular algorithm for hyperparameter optimization recently. It is simple to implement and explores the global search space efficiently [21,22,23]. However, methods based on gradients, grid search, random search, genetic algorithms, and particle swarm optimization all lack rigorous mathematical explanation; they are more computational logic. Additionally, grid search, random search, and particle swarm optimization are all brute-force methods that are time-consuming and labor-intensive. Our proposed hyperparameter optimization EM algorithm, based on Bayesian optimization theory and relative entropy, has a strict mathematical derivation and explanation. The simulation results show that the algorithm has the advantage of fast convergence.
The relevance vector machine (RVM) represents a Bayesian sparse kernel technique that has been developed for the purpose of regression tasks. The RVM model for regression is linear, whose weight prior is Gaussian, defined by the following form [24]
p ( ω | η ) = i = 1 M N ( ω i | 0 , η i 1 ) ,
where η i is a hyperparameter and η = ( η 1 , , η M ) T .
In this paper, we will propose a Bayesian hyperparameter optimization in terms of a more general form of weight prior, defined as follows:
p ( ω | η , μ ) = i = 1 M N ( ω i | μ i , η i 1 ) ,
where μ i is also a hyperparameter and μ = ( μ 1 , , μ M ) T .
We designate weight prior (2) as the general Gaussian weight prior (GGWP). Initially, leveraging evidence function maximization and GGWP, we derive non-closed-form iteratively reestimation equations for hyperparameters. Subsequently, we partition the aforementioned non-closed-form reestimation equations into E step and M step, elucidating the hyperparameter reestimation equations mathematically and statistically.

2. Related Mathematical Knowledge

Lemma 1. 
x ln M = T r ( M 1 M x ) .
Lemma 2. 
x ( M 1 ) = M 1 M x M 1 .
The proofs of both Lemmas 1 and 2 refer to [24,25,26,27]
Definition 1.
Let   be gradient and  h T  is the operator such that for  k N ,
( h T ) k + 1 f = ( h T ) ( h T ) k f ,
where  ( h T ) 0 f = f .
The symbolic powers are referred to as operators that act upon a multivariate function with n variables.
Lemma 3
(multivariate Taylor’s expansion). Let  f : B x , r  be a multivariate function, where  B x , r k  and  f ( x ) C B x , r . For  h k   h < r , then there exists  θ ( 0 , 1 )  such that 
f x + h = l = 0 n 1 1 l ! ( ( h T ) l f ) ( x ) + 1 n ! ( ( h T ) n f ) ( x + θ h ) .
Both Definition 1 and proof of Lemma 3 in detail refer to [28].
Corollary 1.
A quadratic function  f : B w , r  is defined as follows:
f ( x ) = x T A x + x T b + c ,
where  c  is a scalar constant,  b  and  x = ( x 1 , , x n ) T  are both n-dimensional column vectors as well as  A  is a n-dimensional inverse and symmetric matrix. Then, at the stationary point  x 0 = 1 2 A 1 b , we obtain the Taylor expansion of the function  f ( x )  as follows:
f ( x ) = f ( x 0 ) + 1 2 ( x x 0 ) T H ( x x 0 ) ,
where elements of the Hessian matrix  H  are defined by
f ( x ) x i x j .
The proof of Corollary 1 refers to [29] in detail.

3. HPO via Maximization of the Evidence Function

3.1. Bayesian Linear Regression and Linear Basis Function Models

The linear model for regression is given with linear combinations of fixed nonlinear functions of the input variables as follows:
y ( x , ω ) = ω 0 + i = 1 M 1 ω i ψ i ( x ) ,
where ω 0 is called bias parameter and ψ i ( x ) is a basis function. For the sake of convenience, setting ψ 0 ( x ) = 1 .
We rewrite (3) in matrix form [24]:
y ( x , ω ) = ω T ψ ( x ) ,
where ψ ( x ) = ( ψ 0 ( x ) , , ψ M 1 ( x ) ) T and ω = ( ω 0 , , ω M 1 ) T .
The target variable t for regression is defined as the sum of a random noise ε and a deterministic function y ( x , ω ) as follows:
t = y ( x , ω ) + ε ,
where the noise ε is Gaussian, whose mean is zero and precision is λ (inverse variance). So, we obtain
p ( t | x , ω , λ ) = N ( t | y ( x , ω ) , λ 1 ) .
Now, X = ( x 1 , , x N ) , called the input dataset, is now considered in conjunction with its target value vector T = ( t 1 , , t N ) . These data points are all drawn independently from the Gaussian distribution (6) based on our assumption. Thus, the likelihood function is obtained as follows:
p ( T | X , ω , λ ) = i = 1 N N ( t i | ω T ψ ( x i ) , λ 1 ) .
According to (7), we define corresponding conjugate prior as follows:
p ( ω ) = N ( ω | m 0 , C 0 )
From (7) and (8), we can shortly obtain the following posterior distribution:
p ( ω | T , X , λ ) = N ( ω | m N , C N ) ,
where
m N = C N ( C 0 1 m 0 + λ Ψ T t ) ,
C N 1 = C 0 1 + λ Ψ T Ψ ,
Ψ = ψ 0 ( x 1 ) ψ M 1 ( x 1 ) ψ 0 ( x N ) ψ M 1 ( x N ) .
The proofs of both (10) and (11) in detail refer to Appendix A in [29].

3.2. Evidence Approximation and Bayesian Model Comparison

Let us suppose we are presented with a series of models M k , k = 1 , , L . We compare these models and then choose the optimal one according to a Bayesian perspective, aiming to mitigate the risk of overfitting commonly associated with maximum likelihood approaches. We express uncertainty by a prior probability p ( M k ) . In the context of a training set, it is reasonable to assume that all models have the same prior probability. This assumption is consistent with the notion that, in practice, there should be no inherent preference for any specific models L . The objective is thus to assess the following posterior distribution given the dataset D ,
p ( M k | D ) p ( M k ) p ( D | M k ) .
The posterior distribution p ( M k | D ) represents the model evidence, also known as the marginal likelihood. This term is interpreted as a likelihood function [24].
In this section, we present a fully Bayesian treatment based on the introduction of priors over the hyperparameters η , λ and μ . The derivation of predictions can be accomplished by using the marginalization over these hyperparameters and the weight parameter. Nevertheless, it is difficult to completely marginalize all the variables in the analytical process. Consequently, we propose an approximation approach to identify these hyperparameters η , λ and μ by the maximization of the marginal likelihood function after integrating over these parameters. In statistics, the method is commonly referred to as empirical Bayes [30,31], type 2 maximum likelihood [32], or generalized likelihood [33]. In the field of machine learning, it is frequently referred to as evidence approximation. [24,34]. Furthermore, a hyperprior over the parameters η , λ and μ is introduced first. Subsequently, the following predictive distribution is obtained:
p ( t | T ) = p ( t | x , ω , λ ) p ( η , λ , μ | T ) p ( ω | T , η , λ , μ ) d η d λ d μ d ω ,
where p ( t | x , ω , λ ) is given according to (6) and p ( ω | T , η , λ , μ ) is determined from (9).
If the posterior p ( η , λ , μ | T ) exhibits a sharp peak around η ^ , λ ^ and μ ^ , the predictive distribution can be derived by marginalizing over ω , where we set the hyperparameters to the value η ^ , λ ^ and μ ^ . So, we obtain
p ( t | T ) p ( t | T , η ^ , λ ^ , μ ^ ) = p ( ω | T , η ^ , λ ^ , μ ^ ) p ( t | x , ω , λ ^ ) d ω .
The posterior for the hyperparameters is expressed as
p ( η , λ , μ | T ) p ( η , λ , μ ) p ( T | η , λ , μ ) .
The values of η ^ , λ ^ , and μ ^ can be derived by the maximization of the marginal likelihood function p ( T | η , λ , μ ) based on the evidence approximation, particularly when the prior is relatively flat. By continuously evaluating the marginal likelihood, it is possible to identify its maxima, which enables the values of these hyperparameters to be determined solely from the training dataset.

3.3. The Evidence Function Evaluation

From (7), we derive the following result:
p ( T | X , ω , λ ) = i = 1 N N ( t i | ω T ψ ( x i ) , λ 1 ) = λ 2 π N exp ( λ 2 i = 1 N ( t i ω T ψ ( x i ) ) 2 ) = λ 2 π N exp ( λ 2 T Ψ ω 2 ) .
From (2), the following result is obtained:
p ( ω | η , μ ) = i = 1 M N ( ω i | μ i , η i 1 ) = i = 1 M η i ( 2 π ) M exp ( 1 2 i = 1 M η i ( ω i μ i ) 2 ) = Λ 1 2 ( 2 π ) M exp ( 1 2 ( ω μ ) T Λ ( ω μ ) ) .
It can be demonstrated that the posterior with respect to weight parameter ω is Gaussian as follows:
p ( ω | T , X , η , λ , μ ) = N ( ω | m , C ) ,
where
m = ( Λ + λ Ψ T Ψ ) 1 ( Λ μ + λ Ψ T T ) ,
C = ( Λ + λ Ψ T Ψ ) 1 ,
where Λ = d i a g ( η i ) .
The proofs of both (20) and (21) are detailed in Appendix A.
The evidence approximation is utilized to evaluate the hyperparameters. We obtain the marginal likelihood function (evidence function) by integrating out the weight parameters as follows:
p ( T | X , η , λ , μ ) = p ( T |   X , ω , λ ) p ( ω | η , μ ) d ω ,
Then, we obtain
p ( T | X , η , λ , μ ) = λ 2 π N Λ 1 2 C 1 2 exp ( G ( m ) ) ,
where G ( m ) = λ 2 T Ψ m 2 + 1 2 ( m μ ) T Λ ( m μ ) .
The proof of (23) is detailed in Appendix B.
In terms of (23), we take the logarithm of the marginal likelihood function, also called the log marginal likelihood function, and obtain the following result:
ln p ( T | X , η , λ , μ ) = N 2 ln λ G ( m ) + 1 2 i = 1 M ln η i + 1 2 ln C N 2 ln ( 2 π ) .
For convenience, we abbreviate the marginal likelihood function as the likelihood and the log marginal likelihood function as the log likelihood. Subsequently, the log likelihood will be maximized.

3.4. The Evidence Function Maximization

Now, p ( T | X , η , λ , μ ) is maximized with respect to η i . From (24) and Lemma 1, we derive the following result:
η i ln C = T r ( C 1 C η i ) = T r ( ( Λ + λ Ψ T Ψ ) η i ( ( Λ + λ Ψ T Ψ ) 1 ) ) ,
By utilization of (21), (25), and Lemma 1, we get the following result:
η i ln C = T r ( ( η i ( Λ + λ Ψ T Ψ ) ) C ) ,
where we have used
η i ( ( Λ + λ Ψ T Ψ ) 1 ) = ( Λ + λ Ψ T Ψ ) 1 η i ( ( Λ + λ Ψ T Ψ ) ) ( Λ + λ Ψ T Ψ ) 1 .
By making use of (26), the following result was obtained
η i ln C = C i i ,
where C i i represents the ith component of the principal diagonal of the covariance matrix C .
The following result is obtained:
G ( m ) η i = 1 2 m i 2 ,
By making use of (27) and (28), the stationary point is obtained regarding α i such that
1 2 η i 1 2 C i i 1 2 m i 2 = 0 ,
in which m i is the ith element of the matrix m .
From (29), we obtain the following result:
η i = 1 m i 2 + C i i .
The derivative of ln C is obtained regarding λ by applying Lemma 2 and (21) as follows:
λ ln C = T r ( C 1 C λ ) = T r ( ( Λ + λ Ψ T Ψ ) λ ( ( Λ + λ Ψ T Ψ ) 1 ) ) .
The application of Lemma 2 yields the following result.
λ ln C = T r ( Ψ T Ψ C ) .
We also get
G ( m ) λ = 1 2 T Ψ m 2 .
From (32) and (33), the stationary point is obtained regarding λ by satisfying the following conditions:
N λ T Ψ m 2 T r ( Ψ T Ψ C ) = 0 .
From (34), we obtain
λ = N T Ψ m 2 + T r ( Ψ T Ψ C ) .
Let G ( m ) μ = Λ μ Λ m = 0 , we obtain
μ = m .
From the above derivation, we know that (30), (35) and (36) are non-closed-form reestimation equations for η , λ and μ . In order to solve the problem, an iterative procedure is presented. Initially, after the mean and covariance are both computed using (20) and (21) by giving initial value η , λ and μ , respectively, we reestimate hyperparameters alternately by employing (30), (35) and (36). We again alternately employ (20) and (21) to recompute the mean and covariance, repeating this process until a suitable standard for convergence is met. Nevertheless, it remains unclear whether this approach is accurate, as outlined in the reestimate iterative procedure. To gain a clearer understanding of the iterative reestimation equations, we will turn to the EM algorithm to analyze them from both mathematical and statistical perspectives.

4. HPO EM Algorithm via Bayesian Optimization and Relative Entropy

The expectation maximization (EM) algorithm is a highly effective and sophisticated approach for identifying maximum likelihood solutions within probabilistic models that incorporate latent variables [35,36,37].
The EM algorithm was derived by first treating the weights as latent variables, a process that is inherently straightforward. The log marginal likelihood function is then obtained by the weight parameters marginalization over the joint distribution p ( T , ω | X , η , λ , μ )
L = ln p ( T | X , η , λ , μ ) = ln p ( T , ω | X , η , λ , μ ) d ω ,
where p ( T , ω | X , η , λ , μ ) is equal to the product of the prior ω and the likelihood given by the following result:
p ( T , ω | X , η , λ , μ ) = p ( ω | η , μ ) p ( T | X , ω , λ ) .
From Jensen’s inequality, we obtain a lower bound on L as follows:
L = ln p ( T , ω | X , η , λ , μ ) v ( ω ) v ( ω ) d ω ln p ( T , ω | X , η , λ , μ ) v ( ω ) v ( ω ) d ω = F ( v , η , λ , μ , θ ) ,
where v ( ω ) is the variational probability distribution over the weights and θ denotes all other parameters. The EM algorithm is designed to maximize the log marginal likelihood function L by using iterative maximization of the lower bound. For the E step, we maximize F regarding the probability distribution v ( ω ) for fixed hyperparameters η , λ and μ . For the M step, we maximize F regarding the hyperparameters η , λ and μ for the fixed probability distribution v ( ω ) . We rewrite the lower bound F to enhance the understanding of the E step:
F = K L ( v ( ω ) | | p ( ω | T , X , η , λ , μ ) ) + L ,
where K L ( v ( ω ) | | p ( ω | T , X , η , λ , μ ) ) is the relative entropy, also called Kullback–Leibler (KL) divergence. The KL divergence is always greater than or equal to zero, which is zero only and only if the two distributions are equal. The E step corresponds thus to equating the posterior on ω with the distribution v ( ω ) , which means v ( ω ) = p ( ω | T , X , η , λ , μ ) and F = L [38]. Given that the posterior probability distribution is also Gaussian, the E step is reduced to the computation of its mean matrix m and covariance matrix C defined by (20) and (21), respectively.
To enforce the M step, F is rewritten as a distinct form:
F = ln p ( T , ω | X , η , λ , μ ) v ( ω ) d ω H ( v ( ω ) ) ,
in which H ( v ( ω ) ) clearly is irrelevant to η , λ , and μ and is the entropy of v ( ω ) .
In order to perform the M step, from (40), we obtain
F = ln p ( T , ω | X , η , λ , μ ) v ( ω ) d ω H ( v ( ω ) ) = ( ln p ( ω | η , μ ) + ln p ( T | X , ω , λ ) ) v ( ω ) d ω H ( v ( ω ) ) = ln p ( ω | η , μ ) v ( ω ) d ω + ln p ( T | X , ω , λ ) v ( ω ) d ω H ( v ( ω ) )
Finally, we obtain the following result:
F = 1 2 ln Λ M + N 2 ln ( 2 π ) 1 2 m T Λ m 1 2 T r ( Λ C ) + μ T m 1 2 μ T μ + N 2 ln λ λ 2 ( T Ψ m 2 + T r ( Ψ T Ψ C ) ) H ( v ( ω ) )
The proof of (42) is derived in detail in Appendix C.
The stationary point of (42) is obtained with ease regarding η i , λ , and μ such that
F η i = 1 2 η i 1 2 ( m i 2 + C i i ) = 0 ,
F λ = N 2 λ 1 2 ( T Ψ m 2 + T r ( Ψ T Ψ C ) ) = 0 ,
F μ = m μ = 0 .
So, the update laws are obtained as follows
η i n e w = 1 m i 2 + C i i ,
λ n e w = N T Ψ m 2 + T r ( Ψ T Ψ C ) ,
μ n e w = m
The goal of the update equations is to achieve the maximization of the log marginal likelihood by the proposed EM algorithm. This result is also the same as maximizing the evidence function.
Finally, let us provide an intuitive explanation of our proposed EM algorithm for hyperparameter optimization. Firstly, we generate initial random hyperparameters values η , λ , μ , which allows us to “guess” the initial hyperparameter. Then, we compute(guess) η , λ , μ again by (46), (47), and (48). Subsequently, the log marginal likelihood is evaluated. By repeating this process, the algorithm alternates between refining the guesses and the log marginal likelihood gradually increases until it no longer changes—that is, once it has converged, the algorithm stops making further guesses for hyperparameters η , λ , μ . This also implies that the hyperparameters have converged. Checking the convergence of the log likelihood function is easier to implement programmatically than checking the convergence of the hyperparameters. Different initial values for hyperparameters may lead to different local optima while the likelihood converges.

5. Experimental Set-Up

5.1. Synthetic Data

The parameter w either is integrated out or is considered latent variables, so after sampling from a Gaussian distribution, its goal is to construct Ψ .
We firstly generate hyperparameters η , λ , μ and X = x 1 , , x N randomly and choose a suitable linear basis function ψ j ( x n ) to get Ψ and T . Subsequently, the EM algorithm is applied in order to maximize the likelihood or the log likelihood. The procedure is presented in detail as Algorithm 1:
Algorithm 1: HPO EM algorithm for synthetic data
1. Generate randomly hyperparameters value η , λ , μ  and the dataset X = x 1 , , x N
2. Choose a suitable linear basis function ψ ( x n )  to get an N × M  matrix Ψ  by using (12).
3. Generate parameters value ω  randomly sampled by (2).
4. Generate N-dimensional vector ε  sampled randomly by the Gaussian distribution N ( 0 , λ 1 )  and then generate T = ( t 1 , , t N )  by (4) and (5), respectively.
5. E step. Compute the mean m  and covariance C  using the current hyperparameter values.
m = ( Λ + λ Ψ T Ψ ) 1 ( Λ μ + λ Ψ T T ) ,(49)
C = ( Λ + λ Ψ T Ψ ) 1 .(50)
6. M step. Estimate again the hyperparameters by employing the mean m  and covariance C  obtained by step 5 and the following update equations
η i n e w = 1 m i 2 + C i i ,(51)
λ n e w = N T Ψ m 2 + T r ( Ψ T Ψ C ) ,(52)
μ n e w = m .(53)
7. Compute the likelihood function or the log likelihood function given by the following result:
p ( T | X , η , λ , μ ) = λ 2 π N Λ 1 2 C 1 2 exp ( G ( m ) )
or
ln p ( T | X , η , λ , μ ) = N 2 ln λ G ( m ) + 1 2 i = 1 M ln η i + 1 2 ln C N 2 ln ( 2 π )
and then determine the convergence of the hyperparameters or the likelihood. If convergence is not satisfied, back to step 5. If the likelihood or the log likelihood converges, then the algorithm’s computational complexity is O ( M ) .
We apply Algorithm 1 to optimize hyperparameters value η , λ , and μ , where the linear basis function ψ ( x ) is a Gaussian basis function. We have difficulty judging the convergence of hyperparameter values η , λ , and μ , so we turn to judge the convergence of the likelihood or the log likelihood. It is reasonable because the convergence of hyperparameter values η , λ , and μ means the likelihood or the log likelihood keeps constant; in other words, the convergence of the likelihood or the log likelihood. This is much easier to implement programmatically.
From Figure 1, we see that after approximately 10 iterations, the likelihood converges, so we can draw the conclusion that the EM algorithm has the advantage of fast convergence, which comes from strict interpretation mathematically and statistically.
In the process of experimental set-up, the covariance C of the posterior distribution defined by (50) may easily be a singular matrix, so we have to use a pseudo-inverse matrix to evaluate C , which may produce the imprecise result that affects the increase in the likelihood function as shown in Figure 2.
The singularity for C arises from the initial value of the hyperparameter values η , λ , and μ , the simplest and intuitive method is to randomly generate initial values of η , λ , and μ again and again when the singularity occurs until the convergence criterion is met like Figure 2.
Particle swarm optimization (PSO) is an intuitive and computationally efficient metaheuristic that is highly effective for hyperparameter optimization across a wide range of machine learning models. For the same likelihood function, we use PSO to optimize the hyperparameter values η , λ , and μ , as shown in Figure 3. After 10,000 iterations, the likelihood function still has not converged. This highlights the advantage of our proposed method’s fast convergence, which is attributed to its rigorous mathematical foundation.

5.2. The Diabetes Dataset

The diabetes dataset contains data from 442 diabetes patients, with each patient having measurements for 10 baseline variables, including age, sex, body mass index, average blood pressure, and six blood serum measurements. In addition, each patient has a response variable that represents a quantitative measure of disease progression one year after the baseline. This dataset is commonly used for predicting the progression of the disease and is one of the most frequently used datasets in machine learning. It is particularly well suited for research on regression problems.
Firstly, from the diabetes dataset, we obtain the input dataset X = x 1 , , x 442 , a 8 × 442 matrix and the target value T = ( t 1 , , t 442 ) . We then generate hyperparameters η , λ , μ randomly and choose a suitable linear basis function ψ j ( x n ) to get Ψ . Subsequently, the EM algorithm is applied in order to maximize the likelihood or the log likelihood. The procedure is presented in detail as Algorithm 2:
Algorithm 2: HPO EM algorithm for the diabetes dataset
1. Generate randomly hyperparameters value η , λ , μ .
2. Choose a suitable linear basis function ψ ( x n )  to get an 8 × 442  matrix Ψ  by using (12).
3. E step. Compute the mean m  and covariance C  using the current hyperparameter values.
m = ( Λ + λ Ψ T Ψ ) 1 ( Λ μ + λ Ψ T T ) , (54)
C = ( Λ + λ Ψ T Ψ ) 1 . (55)
4. M step. Estimate again the hyperparameters by employing the mean m  and covariance C  obtained by step 3 and the following update equations
η i n e w = 1 m i 2 + C i i , (56)
λ n e w = N T Ψ m 2 + T r ( Ψ T Ψ C ) , (57)
μ n e w = m . (58)
5. Compute the likelihood function or log likelihood function given by the following result:
p ( T | X , η , λ , μ ) = λ 2 π N Λ 1 2 C 1 2 exp ( G ( m ) )
or
ln p ( T | X , η , λ , μ ) = N 2 ln λ G ( m ) + 1 2 i = 1 M ln η i + 1 2 ln C N 2 ln ( 2 π )
and then determine the convergence of the hyperparameters or the likelihood. If convergence criterion is not satisfied, go back to step 3.
From Figure 4, we can also draw the same conclusion that the EM algorithm has the advantage of fast convergence for the famous dataset of machine learning after the first iteration.

6. Conclusions

In this paper, we present the general Gaussian weight prior rather than zero-mean Gaussian weight priors for the hyperparameter optimization in the field of machine learning. We firstly derive non-closed-form iterative reestimation equations for hyperparameters by evidence function maximization. Though we know how to optimize the hyperparameters, we can’t understand the iterative reestimation equations clearly. To better understand non-closed-form reestimation equations more clearly, we have to resort to the EM algorithm. After using Bayesian theory and optimization, the EM algorithm partitions the iterative reestimation equations into E and M steps, which provides a clear interpretation for the iterative reestimation equations both mathematically and statistically. The experimental result shows the effectiveness of the EM algorithm of hyperparameter optimization, and the EM algorithm also has the advantage of fast convergence, except that the variance C of the posterior distribution defined by (50) is a singular matrix, which affects the increase in the likelihood.

Author Contributions

Conceptualization, D.Z. and C.M.; methodology, D.Z.; software, D.Z. and P.W.; validation, D.Z., C.M. and P.W.; formal analysis, D.Z.; investigation, D.Z.; resources, Y.G.; data curation, Y.G.; writing—original draft preparation, D.Z. and C.M.; writing—review and editing, D.Z. and P.W.; visualization, D.Z.; supervision, C.M.; project administration, C.M.; funding acquisition, Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Universities in Heilongjiang Province (Project Numbers: YWF10236220242, YWF10236240126).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

From (17) and (18), the following result is obtained
p ( T | X , ω , λ ) = N ( T | Ψ ω , λ 1 )
and
p ( ω | η , μ ) = N ( ω | μ , Λ 1 )
From (10), (11), and using m 0 = μ and C 0 = Λ 1 , we obtain
C 1 = C N 1 = C 0 1 + λ Ψ T Ψ = Λ + λ Ψ T Ψ ,
m = C N ( C 0 1 m 0 + λ Ψ T T ) = ( Λ + λ Ψ T Ψ ) 1 ( Λ μ + λ Ψ T T )

Appendix B

By substituting (17) and (18) into (22), the following result is obtained
p ( T | X , η , λ , μ ) = ( λ 2 π ) N Λ 1 2 ( 2 π ) M exp ( G ( ω ) ) d ω ,
together with
G ( ω ) = λ 2 T Ψ ω 2 + 1 2 ( ω μ ) T Λ ( ω μ ) = λ 2 ( T Ψ ω ) T ( T Ψ ω ) + 1 2 ω T Λ ω ω T Λ μ + 1 2 μ T Λ μ = 1 2 ω T ( Λ + λ Ψ T Ψ ) ω ω T ( λ Ψ T T + Λ μ ) + 1 2 μ T Λ μ + λ 2 T T T .
Firstly, we compute the gradient of G ( ω ) and let it be zero. Then, we obtain the stationary point of G ( ω ) as follows:
G ( ω ) = ( Λ + λ Ψ T Ψ ) ω ( Λ μ + λ Ψ T T ) ,
G ( ω ) = Λ + λ Ψ T Ψ = C 1 .
Let
G ( ω ) = ( Λ + λ Ψ T Ψ ) ω ( Λ + λ Ψ T T ) = 0 ,
Then, we obtain the following result:
ω = ( Λ + λ Ψ T Ψ ) 1 ( Λ μ + λ Ψ T T ) = m .
We obtain the following result by using Corollary 1:
G ( ω ) = G ( m ) + 1 2 ( ω m ) T C 1 ( ω m )
along with
G ( m ) = λ 2 T Ψ m 2 + 1 2 ( m μ ) T Λ ( m μ ) .
Thus, we obtain
p ( T | X , η , λ , μ ) = λ 2 π N Λ ( 2 π ) M exp ( G ( m ) ) exp ( 1 2 ( ω m ) T C 1 ( ω m ) ) d ω = λ 2 π N Λ 1 2 C 1 2 exp ( G ( m ) ) ,
where we have employed the following result:
1 ( 2 π ) M C 1 2 exp ( 1 2 ( ω m ) T C 1 ( ω m ) ) d ω = 1 .

Appendix C

We obtain the following the result from the first term of (41)
ln p ( ω | η , μ ) v ( ω ) d ω = ( 1 2 ln Λ M 2 ln ( 2 π ) 1 2 ( ω μ ) T Λ ( ω μ ) ) v ( ω ) d ω = ( 1 2 ln Λ M 2 ln ( 2 π ) ) v ( ω ) d ω 1 2 ( ω μ ) T Λ ( ω μ ) v ( ω ) d ω = 1 2 ln Λ M 2 ln ( 2 π ) 1 2 ( ω μ ) T Λ ( ω μ ) v ( ω ) d ω ,
where v ( ω ) d ω = 1 is used.
We get the following result:
( ω μ ) T Λ ( ω μ ) v ( ω ) d ω = T r ( Λ ( ω μ ) ( ω μ ) T ) v ( ω ) d ω = T r ( Λ ω ω T ω μ T μ ω T + μ μ T ) v ( ω ) d ω = ( T r ( Λ ω ω T ) 2 T r ( ω μ T ) + T r ( μ μ T ) ) v ( ω ) d ω = T r ( Λ ω ω T v ( ω ) d ω ) 2 T r ( ( ω v ( ω ) d ω ) μ T ) + T r ( μ μ T v ( ω ) d ω ) = T r ( Λ E v ( ω ω T ) ) 2 T r ( E v ( ω ) μ T ) + T r ( μ μ T ) = T r ( Λ ( m m T + C ) ) 2 T r ( m μ T ) + T r ( μ μ T ) = T r ( Λ m m T ) + T r ( Λ C ) 2 T r ( m μ T ) + T r ( μ μ T ) = m T Λ m + T r ( Λ C ) 2 μ T m + μ T μ ,
where we have used E v ( ω ω T ) = m m T + C , where E v ( ) is the expectation over the given probability distribution v ( ω ) . T r ( ) is the trace of a matrix.
Finally, we get
ln p ( ω | η , μ ) v ( ω ) d ω = 1 2 ln Λ M 2 ln ( 2 π ) 1 2 m T Λ m 1 2 T r ( Λ C ) + μ T m 1 2 μ T μ .
The second term of (41) is now computed, and then the following result is obtained:
ln p ( T | X , ω , λ ) v ( ω ) d ω = ln ( λ 2 π N exp ( λ 2 T Ψ ω 2 ) ) v ( ω ) d ω = ( N 2 ln λ N 2 ln ( 2 π ) ) v ( ω ) d ω λ 2 T Ψ ω 2 v ( ω ) d ω = N 2 ln λ N 2 ln ( 2 π ) λ 2 T Ψ ω 2 v ( ω ) d ω = N 2 ln λ N 2 ln ( 2 π ) λ 2 ( T Ψ ω ) T ( T Ψ ω ) v ( ω ) d ω = N 2 ln λ N 2 ln ( 2 π ) λ 2 ( T T T 2 ω T Ψ T T + ω T Ψ T Ψ ω ) v ( ω ) d ω .
The following result is obtained
( T T T 2 ω T Ψ T T + ω T Ψ T Ψ ω ) v ( ω ) d ω = T T T v ( ω ) d ω 2 ω T Ψ T T v ( ω ) d ω + ω T Ψ T Ψ ω v ( ω ) d ω = T T T 2 ω T Ψ T T v ( ω ) d ω + T r ( Ψ ω ω T Ψ T ) v ( ω ) d ω = T T T 2 ω T Ψ T T v ( ω ) d ω + T r Ψ ω ω T Ψ T v ( ω ) d ω = T T T 2 ( ω T v ( ω ) d ω ) Ψ T T + T r ( Ψ ( ω ω T v ( ω ) d ω ) Ψ T ) = T T T 2 E v ( ω T ) Ψ T T + T r ( Ψ E v ( ω ω T ) Ψ T ) = T T T 2 m T Ψ T T + T r ( Ψ ( m m T + C ) Ψ T ) = T T T 2 m T Ψ T T + T r ( Ψ m m T Ψ T + Ψ C Ψ T ) = T T T 2 m T Ψ T T + T r ( Ψ m m T Ψ T ) + T r ( Ψ C Ψ T ) = T T T 2 m T Ψ T T + m T Ψ T Ψ m + T r ( Ψ C Ψ T ) = ( T Ψ m ) T ( T Ψ m ) + T r ( Ψ C Ψ T ) = T Ψ m 2 + T r ( Ψ C Ψ T ) = T Ψ m 2 + T r ( C Ψ T Ψ ) = T Ψ m 2 + T r ( Ψ T Ψ C ) ,
where E v ( ω ω T ) = m m T + C is used.
The result is obtained as follows:
ln p ( T | X , ω , λ ) v ( ω ) d ω = T r ( Ψ T Ψ C ) + N 2 ( ln λ ln ( 2 π ) ) λ 2 ( T Ψ m 2 ) .
Finally, we get
F = 1 2 ln Λ M + N 2 ln ( 2 π ) 1 2 m T Λ m 1 2 T r ( Λ C ) + μ T m 1 2 μ T μ + N 2 ln λ λ 2 ( T Ψ m 2 + T r ( Ψ T Ψ C ) ) H ( v ( ω ) )

References

  1. Alibrahim, H.; Ludwig, S.A. Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization. In Proceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC), Krakow, Poland, 28 June–1 July 2021. [Google Scholar]
  2. Algorain, F.T.; Alnaeem, A.S. Deep Learning Optimisation of Static Malware Detection with Grid Search and Covering Arrays. Telecom 2023, 4, 249–264. [Google Scholar] [CrossRef]
  3. Claesen, M.; Simm, J.; Popovic, D.; Moreau, Y.; Moor, B.D. Easy Hyperparameter Search Using Optunity. arXiv 2014. [Google Scholar] [CrossRef]
  4. Syarif, I.; Prugel-Bennett, A.; Wills, G. SVM parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (Telecommun. Comput. Electron. Control.) 2016, 14, 1502–1509. [Google Scholar] [CrossRef]
  5. Liu, B. A Very Brief and Critical Discussion on AutoML. arXiv 2018. [Google Scholar] [CrossRef]
  6. Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
  7. Snoek, J.; Larochelle, H.; Adams, R. Practical Bayesian Optimization of Machine Learning Algorithms. Adv. Neural Inf. Process. Syst. 2012, 4, 1–9. [Google Scholar]
  8. Di Francescomarino, C.; Dumas, M.; Federici, M.; Ghidini, C.; Maggi, F.M.; Rizzi, W.; Simonetto, L. Genetic Algorithms for Hyperparameter Optimization in Predictive Business Process Monitoring. Inf. Syst. 2018, 74, 67–83. [Google Scholar] [CrossRef]
  9. Elsken, T.; Metzen, J.H.; Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res. 2019, 20, 1997–2017. [Google Scholar]
  10. Thiede, L.A.; Parlitz, U. Gradient based hyperparameter optimization in Echo State Networks. Neural Netw. 2019, 115, 23–29. [Google Scholar] [CrossRef] [PubMed]
  11. Bengio, Y. Practical Recommendations for Gradient-Based Training of Deep Architectures. In Neural Networks: Tricks of the Trade, 2nd ed.; Montavon, G., Orr, G.B., Müller, K.-R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 437–478. [Google Scholar]
  12. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016. [Google Scholar] [CrossRef]
  13. Bergstra, J.; Bardenet, R.; Kégl, B.; Bengio, Y. Algorithms for Hyper-Parameter Optimization. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011. [Google Scholar]
  14. Snoek, J.; Rippel, O.; Swersky, K.; Kiros, R.; Satish, N.; Sundaram, N.; Patwary, M.; Prabhat, M.; Adams, R. Scalable Bayesian Optimization Using Deep Neural Networks. arXiv 2015. [Google Scholar] [CrossRef]
  15. Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; Freitas, N.D. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 2016, 104, 148–175. [Google Scholar] [CrossRef]
  16. Yao, C.; Cai, D.; Bu, J.; Chen, G. Pre-training the deep generative models with adaptive hyperparameter optimization. Neurocomputing 2017, 247, 144–155. [Google Scholar] [CrossRef]
  17. Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.; Blum, M.; Hutter, F. Efficient and robust automated machine learning. Adv. Neural Inf. Process. Syst. 2015, 28, 2944–2952. [Google Scholar]
  18. Kaul, A.; Maheshwary, S.; Pudi, V. AutoLearn—Automated Feature Generation and Selection. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), Orleans, LA, USA, 18–21 November 2017. [Google Scholar]
  19. Jin, H.; Song, Q.; Hu, X. Auto-Keras: An Efficient Neural Architecture Search System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar] [CrossRef]
  20. Salehin, I.; Islam, M.S.; Saha, P.; Noman, S.M.; Tuni, A.; Hasan, M.M.; Baten, M.A. AutoML: A systematic review on automated machine learning with neural architecture search. J. Inf. Intell. 2024, 2, 52–81. [Google Scholar] [CrossRef]
  21. Chartcharnchai, P.; Jewajinda, Y.; Praditwong, K. A Categorical Particle Swarm Optimization for Hyperparameter Optimization in Low-Resource Transformer-Based Machine Translation. In Proceedings of the 28th International Computer Science and Engineering Conference (ICSEC), Khon Kaen, Thailand, 6–8 November 2024. [Google Scholar]
  22. Indrawati, A.; Wahyuni, I.N. Enhancing Machine Learning Models through Hyperparameter Optimization with Particle Swarm Optimization. In Proceedings of the International Conference on Computer, Control, Informatics and Its Applications (IC3INA), Bandung, Indonesia, 4–5 October 2023. [Google Scholar]
  23. Marchisio, A.; Ghillino, E.; Curri, V.; Carena, A.; Bardella, P. Particle swarm optimization hyperparameters tuning for physical-model fitting of VCSEL measurements. In Proceedings of the SPIE OPTO, San Francisco, CA, USA, 27 January–1 February 2024. [Google Scholar]
  24. Bishop, M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
  25. Golub, G.H.; Van Loan, C.F. Matrix Computations; JHU Press: Baltimore, MD, USA, 2013. [Google Scholar]
  26. Lütkepohl, H. Handbook of Matrices; John Wiley & Sons: Hoboken, NJ, USA, 1997. [Google Scholar]
  27. Aggarwal, C. Linear Algebra and Optimization for Machine Learning; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  28. Simovici, D. Mathematical Analysis for Machine Learning and Data Mining; World Scientific: Singapore, 2018. [Google Scholar]
  29. Zou, D.; Tong, L.; Wang, J.; Fan, S.; Ji, J. A Logical Framework of the Evidence Function Approximation Associated with Relevance Vector Machine. Math. Probl. Eng. 2020, 2020, 2548310. [Google Scholar] [CrossRef]
  30. Bernardo, M.; Smith, A.F. Bayesian Theory; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
  31. Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B. Bayesian Data Analysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 1995. [Google Scholar]
  32. Berger, J.O. Statistical Decision Theory and Bayesian Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  33. Wahba, G. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Ann. Stat. 1985, 13, 1378–1402. [Google Scholar] [CrossRef]
  34. MacKay, D.J.C. Bayesian Interpolation. Neural Comput. 1992, 4, 415–447. [Google Scholar] [CrossRef]
  35. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  36. McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2008. [Google Scholar]
  37. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 1977, 39, 1–22. [Google Scholar] [CrossRef]
  38. Quinonero-Candela, J. Sparse probabilistic linear models and the RVM. In Learning with Uncertainty: Gaussian Processes and Relevance Vector Machines; Technical University of Denmark: Lyngby, Denmark, 2004. [Google Scholar]
Figure 1. Illustration of the convergence of the EM algorithm for synthetic data.
Figure 1. Illustration of the convergence of the EM algorithm for synthetic data.
Entropy 27 00678 g001
Figure 2. Illustration of singular covariance C  affecting the increase in the likelihood.
Figure 2. Illustration of singular covariance C  affecting the increase in the likelihood.
Entropy 27 00678 g002
Figure 3. Illustration of convergence of the likelihood base on PSO.
Figure 3. Illustration of convergence of the likelihood base on PSO.
Entropy 27 00678 g003
Figure 4. Illustration of the convergence of the EM algorithm for the diabetes dataset.
Figure 4. Illustration of the convergence of the EM algorithm for the diabetes dataset.
Entropy 27 00678 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zou, D.; Ma, C.; Wang, P.; Geng, Y. Hyperparameter Optimization EM Algorithm via Bayesian Optimization and Relative Entropy. Entropy 2025, 27, 678. https://doi.org/10.3390/e27070678

AMA Style

Zou D, Ma C, Wang P, Geng Y. Hyperparameter Optimization EM Algorithm via Bayesian Optimization and Relative Entropy. Entropy. 2025; 27(7):678. https://doi.org/10.3390/e27070678

Chicago/Turabian Style

Zou, Dawei, Chunhua Ma, Peng Wang, and Yanqiu Geng. 2025. "Hyperparameter Optimization EM Algorithm via Bayesian Optimization and Relative Entropy" Entropy 27, no. 7: 678. https://doi.org/10.3390/e27070678

APA Style

Zou, D., Ma, C., Wang, P., & Geng, Y. (2025). Hyperparameter Optimization EM Algorithm via Bayesian Optimization and Relative Entropy. Entropy, 27(7), 678. https://doi.org/10.3390/e27070678

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop