Fast Model Selection and Hyperparameter Tuning for Generative Models

Generative models have gained significant attention in recent years. They are increasingly used to estimate the underlying structure of high-dimensional data and artificially generate various kinds of data similar to those from the real world. The performance of generative models depends critically on a good set of hyperparameters. Yet, finding the right hyperparameter configuration can be an extremely time-consuming task. In this paper, we focus on speeding up the hyperparameter search through adaptive resource allocation, early stopping underperforming candidates quickly and allocating more computational resources to promising ones by comparing their intermediate performance. The hyperparameter search is formulated as a non-stochastic best-arm identification problem where resources like iterations or training time constrained by some predetermined budget are allocated to different hyperparameter configurations. A procedure which uses hypothesis testing coupled with Successive Halving is proposed to make the resource allocation and early stopping decisions and compares the intermediate performance of generative models by their exponentially weighted Maximum Means Discrepancy (MMD). The experimental results show that the proposed method selects hyperparameter configurations that lead to a significant improvement in the model performance compared to Successive Halving for a wide range of budgets across several real-world applications.


Introduction
The performance of the generative models depends heavily on so-called hyperparameters which include the model architecture, the choice of training objective, regularization and training algorithms.However, the choice of these hyperparameters is often problemdependent, and it is unknown a priori which configuration would produce the best results in terms of a specific distance or divergence measure.With the rich set of objective functions and training algorithms proposed in recent years and the growing number of tuning parameters associated with them, it is crucial to develop computationally efficient search methods for hyperparameter configurations that yield models with a desired performance within a fixed budget constraint.
The problem of efficient model search and hyperparameter optimization has recently been dominated by Bayesian optimization approaches, e.g., [1][2][3], which speed up the search for good configurations by modeling the underlying structure of the search space.These approaches select and evaluate hyperparameter configurations in an adaptive manner, trying to find good configurations faster than baselines such as random search or grid search.While Bayesian optimization is efficient in tuning few hyperparameters, its efficiency often degrades significantly when the search dimension becomes much higher.Wang et al. [4] showed that for high-dimensional problems, standard Bayesian optimization methods perform similarly to random search.Moreover, traditional Bayesian optimization methods (that are often based on Gaussian processes) can only work on continuous hyperparameters, but not categorical ones (e.g., the choice of the training objective).The vast majority of these hyperparameter selection procedures consider the training of machine learning models to be black-box procedures, and only evaluate models after they have been trained to convergence.It thus seems natural to ask the following question: Can we terminate some of these poor-performing hyperparameter settings early to speed up hyperparameter optimization?This is one of the primary questions that we address in this work.
In fact, there is a line of research that perceives hyperparameter optimization as an adaptive computational resource allocation problem, where the type of resources can be iterations, execution time, data samples, or even total money to spend with a cloud computing provider.These approaches evaluate partially trained models and make decisions on the fly, allocating more resources to promising hyperparameter configurations while early stopping those that are not.They allow for the training of multiple models simultaneously.And by quickly eliminating unpromising ones and paralleling the model training for different hyperparameter configurations, more hyperparameter configurations can be examined.Swersky et al. [5], Domhan et al. [6] and Klein et al. [7] made parametric assumptions on the convergence behavior of learning curves to devise early stopping rules.However, these assumptions are strong and restrictive for the kinds of learning curves that are typically found in training machine learning models.In contrast, Sparks et al. [8] cast it as a multi-armed bandit problem, viewing each hyperparameter configuration as an 'arm', and resources constrained by some predetermined budget are allocated among them by some heuristic rule.Jamieson and Talwalkar [9] and Li et al. [10] studied a similar problem but in the non-stochastic setting and based their resource allocation strategies on the Successive Halving algorithm originally proposed in Karnin et al. [11] for stochastic settings.
However, most existing methodologies on fast model selection and hyperparameter tuning, including all of those discussed above, mainly focus on supervised learning tasks where the data are labeled and the performance of the partially trained models is represented by their losses on a hold-out test set.However, for unsupervised learning tasks, such as many generative modeling tasks that are performed to generate high-quality samples, e.g., [12], fast model search approaches, though important, have been largely scarce in the literature.In light of the rapidly growing literature on generative modeling, it is important to be able to perform efficient model search and hyperparameter tuning based on the quality of samples generated from the candidate models.
In this work, we focus on the best hyperparameter configuration identification problem for generative models.We perceive it as an adaptive resource allocation problem with a given budget (e.g., iteration and training time).We build our approach upon the nonstochastic multi-armed bandit formulation proposed by Jamieson and Talwalkar [9], which makes minimal assumptions on the convergence behavior of the model performance during the training process.Note that, in this work, we restrict the term "Generative Models" to represent models that learn the underlying probability distribution in input data to generate similar samples.To evaluate the performance of generative models, we compute a sample-based distance metric between the samples generated from partially trained models and those from the reference distribution.Particularly, we base our evaluation criterion on the Maximum Mean Discrepancy (MMD) [13] that measures the closeness of the generated samples to the reference distribution.By incorporating statistical tests between partially trained models into the evaluation process, our method effectively identifies the poor-performing configurations early on and allocates more resources to promising configurations.
The remainder of this paper is organized as follows.In Section 2.1, we review the non-stochastic best-arm identification problem and the Successive Halving algorithm.In Sections 2.2 and 2.3, we present the proposed Adaptive Successive Halving algorithm (AdaptSH) and provide intuition for it through an example.In Section 3, we present empirical results comparing AdaptSH with Successive Halving.We conclude with a discussion in Section 4.

Methods
In this section, we present the AdaptSH algorithm.We start with a brief review of the non-stochastic best-arm identification problem and Successive Halving.We subsequently introduce our choice of the metric for comparing the intermediate performance of generative models to decide which models should be trained further, and we provide intuition for our choice via a simple example.We then introduce our proposed statistical test and how we incorporate it into Successive Halving to help us distinguish between candidate models and make early stopping decisions.

Non-Stochastic Best-Arm Identification and Successive Halving
The non-stochastic best-arm identification problem, originally proposed in Jamieson and Talwalkar [9], considers a very general setting that encompasses the hyperparameter optimization problem of interest.It only assumes that the sequence of the losses of each arm (hyperparameter configuration) eventually converges without making any assumptions on the rate of the convergence, monotonicity, or smoothness of the sequence.Hence, it is generally applicable to a wide variety of problems including minimizing a non-convex objective using stochastic gradient descent or some other iterative algorithms.Let K denote the total number of arms and let ℓ k,j denote the validation error of the kth arm after training for j units of resources (e.g., iterations).For all k ∈ {1, 2, . . ., K}, assume ν k = lim j→∞ ℓ k,j exists.The goal is to identify arg min k ν k when the resources are constrained by some predetermined budget.Successive Halving, shown in Algorithm 1, is proposed by Jamieson and Talwalkar [9] to solve the above problem.The strategy of Successive Halving follows its name: given a set of K arms and a budget B, it splits the given budget evenly across log 2 (K) elimination rounds, uniformly allocates the resources to remaining arms at each round, evaluates their intermediate performance, throws out the worst half until one arm remains.By the design of the algorithm, it allocates exponentially more resources to more promising configurations.

Algorithm 1 Successive Halving
Sort the intermediate losses of the models in S i such that While in Successive Halving, half of the configurations are discarded at each elimination round, it is not entirely clear why we should do so.Indeed, it is not clear what proportion to discard in each round would lead to better results without prior knowledge about the convergence behavior of the sequences of losses.We propose to use statistical tests to detect when two models have separated in their performance to make on-the-fly elimination decisions.Before we jump into the statistical test, we first introduce our choice of metric to represent the intermediate performance of generative models, based on which we develop our statistical test to distinguish between their model performance.
While for supervised learning tasks, models are usually compared by their validation errors on a hold-out set, there is no such straightforward measure for generative model comparisons.Although there are a number of evaluation measures for generative models that have been proposed in recent years including the average log-likelihood, different variants of the Wasserstein distance [14], Fréchet Inception Distance (FID) [15] and Maximum Mean Discrepancy (MMD), there is no consensus as to which measure best captures the strengths and limitations of generative models and should be used for a fair model comparison.Indeed, there are a number of desired properties for a good measure, including the ability to distinguish generated samples from real ones, favoring models that generate diverse samples, and having low computational and sample complexity.It is unlikely that a single measure can cover all aspects.Since different applications require different trade-offs among the desired properties, it has been argued that the evaluation metric should be chosen with respect to specific applications [16].On the other hand, previous works have shown through empirical studies that MMD performs well in terms of the discriminability, robustness and efficiency compared to other metrics when it operates in the feature space [17,18].Moreover, the empirical estimate of MMD enjoys favorable statistical properties such as asymptotic normality, making it a favorable choice to construct two-sample and three-sample tests that compares probability distributions.Therefore, we base our model selection criterion on MMD.
MMD is a metric of probability measures which falls within the family of integral probability metrics (IPMs) [19].For two probability measures, P and Q, over X ⊂ R d , IPM is defined as which is the maximum difference between the mean function values on the two probability measures.The choice of the witness function class F determines the probability metric.The MMD is defined as the IPM with F being the unit ball in a reproducing kernel Hilbert space (RKHS) H, with a positive definite kernel k(•, •) : X × X → R, It can be interpreted as the distance between the mean embeddings of P and Q into H.It can be shown that the square of the MMD can be expressed as where X and X ′ are independent random variables having distribution P, and Y and Y ′ are independent random variables having distribution Q [13].It immediately follows that MMD has a straightforward unbiased empirical estimator: where X m := {x 1 , . . ., x m } and Y n := {y 1 , . . ., y n } are i.i.d.samples from P and Q, respectively.Let v i := (x i , y i ), i = 1, . . ., m be i.i.d samples from P × Q, when m = n.Then, ) converges weakly to a Gaussian distribution as m → ∞, when P ̸ = Q and E h 2 < ∞.MMD and the Wasserstein distance are two extremes of the Sinkhorn divergences, an entropic regularized variant of the Wasserstein distance, e.g., [20,21].
Instead of simply using the MMD 2 at the current iteration to represent the intermediate performance of each candidate model, we consider an exponentially weighted average of MMD 2 that takes into account the model performance at previous training iterations, i.e., where h ≥ 1 is the window size, β ∈ (0, 1) is the decay rate, P is the reference/target distribution and Q R k denotes the distributions of the samples generated from model k after being trained for R units of resources.An exponentially weighted average smooths the learning curves.As a result, the models' performance represented by the smoothed MMD 2 are more distinguishable from each other.Figure 1 shows an illustrative example.Two Generative Adversarial Network (GAN) models with different training objectives are trained on the Half Moons dataset, respectively (details of the dataset can be found in Section 3).At the early stage of training, the variation of the loss tends to be large.While any sequences of losses (under the convergence assumption) would eventually stabilize and be separated from each other even without smoothing, smoothed loss is able to distinguish between the two models at a much earlier stage, which can help us make the right decision about which models should be trained further.

Adaptive Successive Halving with Hypothesis Testing
Now, we introduce our proposed algorithm called Adaptive Successive Halving (AdaptSH), shown in Algorithm 2. As the name of our algorithm suggests, instead of following a predetermined elimination schedule as in Successive Halving, we base our decision on test results and adaptively change the elimination schedule based on the remaining budget and remaining number of arms.In particular, the major difference between AdaptSH and the original Successive Halving lies in line 8 of Algorithm 2, where we perform a sequence of statistical tests to compare the intermediate performance of the current "best" arm and each of other remaining arms by taking samples from them and compare their relative similarity to the target distribution.The corresponding p-values are adjusted using the procedure proposed by Benjamini and Yekutieli [22], which controls the false discovery rate.To avoid inflating the type I error rate, we use two independent samples from each model to sort the models (line 7) and to perform the statistical tests (line 8), respectively.The algorithm stops allocating further resources to models that perform significantly worse than the current "best" model, as measured by a desired loss criteria (e.g., MMD).It should be noted that our proposed algorithm can perhaps be used if an alternative loss (e.g., the Wasserstein distance and alike) is chosen, but developing an asymptotic theory required by the statistical tests can be challenging.
Sort the intermediate losses of the models in S i such that lβ,h , where σ i (•) is a bijection from {1, 2, . . ., n} to S i 8: Compare M σ i (1) against M σ i (j) (j = 2, . . ., n) using three-sample tests comparing the relative closeness of their generated samples to the validation dataset with Benjamini and Yekutieli [22] correction to obtain the adjusted p-values p i = i + 1 13: end while The statistical test in line 8 is a three-sample relative similarity test that aims to determine with high significance whether the samples generated by the current "best" model are closer to the evaluation data set than those of each remaining model.While there is rich literature on two-sample test problems for multivariate data, statistical tests for three-sample relative similarity are rarely studied in the literature.Bounliphone et al. [23] propose a relative similarity test: which tests the null hypothesis that two distributions Q and T are equally close to a target distribution P against the alternative hypothesis that Q is closer to P than T.They propose the following test statistic: where X m := {x 1 , . . ., x m }, Y m := {y 1 , . . ., y m } and Z m := {z 1 , . . ., z m } are iid samples from P, Q and T, respectively.The test statistic is asymptotically Gaussian, which directly follows from Hoeffding [24] (Theorem 7.1), which states that the joint distribution of several U-statistics converges weakly to a multivariate Gaussian distribution as m → ∞.We generalize their test to compare the averages of MMD 2 between two arms.In Section 2.2, we defined the exponentially weighted MMD 2 , lβ,h k,R that we use to represent the performance of arm k after being trained for R units of resources.In particular, we want to determine with high significance whether one arm has a smaller exponentially weighted MMD 2 than the other, which translates to the following null and alternative hypothesis: where {Q R i } R>0 , i = 1, 2 denote two sequences of distributions corresponding to two arms that are being compared.It is then natural to use the following test statistic: where The following theorem states the asymptotic normality of the joint distribution of multiple unbiased estimators of MMD 2 s, which follows directly from Hoeffding [24] (Theorem 7.1).
where 0 2h denotes a vector of zeros with length 2h, and Σ h,β m denotes the covariance matrix.
The explicit form of Σ h,β m and its empirical estimate are given in Appendix A. Then, the p-value can be approximated by where T and Φ(•) is the cumulative distribution function of a standard normal distribution.When h = 1, our proposed test reduces to the three-sample relatively similarity test proposed in Bounliphone et al. [23].Notice that as we have a closed form expression to compute the p-values using the asymptotic distribution (as given in the Appendix A), there is not much additional overhead in using the statistical tests within our proposed AdaptSH compared to the traditional SH.We use the same example as we used in Section 2.2 to illustrate the effect of using exponentially weighted average on the results of the statistical test between arms.We apply the test based only on the current MMD 2 and our proposed test respectively to the two models shown in Figure 1.The tests are performed every 10 iterations, and the alternative hypothesis considered is that Model 1 (represented in blue in Figure 1) has smaller (exponentially weighted) MMD 2 s.The resulting p-values are shown in Figure 2. Suppose the significance level α = 0.01 is considered.Then, a p-value less than 0.01 indicates that the Model 2 is significantly worse than Model 1 and will be stopped from further training based on our model search algorithm.And a p-value greater than 0.99 indicates that Model 2 is significantly better then Model 1 and that Model 1 will be early stopped, since it is equivalent to a p-value less than 0.01 if the opposite alternative hypothesis were considered.Figure 3 shows that increasing h increases the ease of making the right decision during the training process.For this particular example, when h ≥ 7, the chance of early stopping in favor of the worse model reduces to zero.

Experimental Results
In this section, we compare our proposed algorithm to Successive Halving on two hyperparameter optimization problems for GAN models.In particular, we consider a number of GAN models that are trained using different variants of the Sliced Wasserstein distance.The space of the models to search includes the set of distance metrics to search over and the reasonable ranges of their associated hyperparameters.We consider seven different variants of the Sliced Wasserstein distance and a set of different combinations of hyperparameters for each of them, which sum up to 30 GAN models in total.The details of the hyperparameter configurations considered for each distance metric can be found in Appendix B. The same generator and discriminator architectures are used for all 30 models (See Appendix B).
To evaluate the different search algorithms' performance under different budgets, we set the total budget of the iterations to a sequence of values, and for each budget let the search algorithms decide how to allocate it amongst the different arms.The final performance of the models is represented by their final losses, defined as ν k = lim j→∞ ℓ k,j for each model k.As ν k is unknown, and we approximate it by the loss after training the model for some finite units of resources R. In real-world applications, R is often determined by the maximum amount of resources that one wants to allocate to any given configuration, which are often inferred from previous training experiences on similar tasks or determined by the time and money one wants to spend on training one model for a particular task.To address the oscillation around the optimal value when the models are trained by stochastic optimization methods, we use the average loss within a small window [R, R + ∆) to approximate ν k , k = 1, . . ., K, i.e, vk = 1 ∆ ∑ R+∆−1 j=R lk,j .

2-D Half Moons
We first apply the searching algorithms to compare the 30 GAN models being trained on the Half Moons dataset.We uniformly sample 1000 points on the half moons as the training set and another 1000 points as the validation set.These sample sizes are used for just for numerical illustration but the results are not sensitive to these choices unless we use very small sample sizes.The training data are shown in Figure 4.At each elimination round, 500 points are sampled from each model to compute MMD 2  u that is used for sorting, and another 500 points are sampled for statistical testing.We set one unit of resources to 10 training iterations.And we vary the budget from 400 to 1400 units by a step of 50.And for each budget, we record the estimated final loss of the selected arm by each searching algorithm.We repeat the experiment for 20 trials.For each searching algorithm and each budget, we compute the average loss of the 20 trials, μi,B = 1 20 ∑ 20 j=1 vj k(i,B,j) , where i, B and j denote the search algorithm, budget and trial number, respectively, k(i, B, j) denotes the index of the model selected by algorithm i using budget B at trial j, and vj k denotes the estimated final loss of the model k at trial j.The tuning parameters for the exponential weighting are chosen as β = 0.9 and h = 6.For the choice of the kernel function in MMD, we follow Bounliphone et al. [23] and use a Gaussian kernel with bandwidth selected as the median pairwise distance between data points.But one can use alternative kernel functions (e.g., the semantic-aware deep kernel proposed by Gao et al. [25]).Additional numerical results for other tuning parameter settings are provided in Appendix C.
As is shown in Figure 5, for the majority of the budgets considered, our proposed method selects arms with smaller final losses on average.To further compare the losses of the arms selected by the two searching algorithms, we conduct a Mann-Whitney U test between the two samples { vj k(1,B,j) } 20 j=1 and { vj k(2,B,j) } 20 j=1 for each B with the alternative hypothesis that the final loss obtained by AdaptSH is stochastically smaller than that of Successive Halving.The p-values of the tests are shown in Figure 5.For budgets over 600, the p-value stays around 0.05, which indicates that our algorithm outperforms SH with a high confidence for a large range of budgets.

3-D Swiss Roll
We next compare AdaptSH to Successive Halving on the Swiss Roll dataset.A total of 1000 points are sampled on the Swiss Roll as the training set and another 2000 points as the validation set.The training data are shown in Figure 6.At each elimination round, 1000 points are sampled from each model to compute MMD 2  u that is used for sorting, and another 1000 points are sampled for statistical testing.The experiment setting is the same as that in the previous example, as well as the choice of the tuning parameters for the exponential weighting.We repeat the experiment for 20 trials.And the results are shown in Figure 7.Our proposed method selects arms with smaller final losses on most of the budgets, and the advantage is significant for most moderate budgets.

Conclusions
We have presented a method for hyperparameter optimization for generative models.We cast it as a non-stochastic best-arm identification problem with a fixed budget and identify the clearly underperforming arms early on through statistical tests between average empirical MMD 2 s of partially trained models.In our experiments, we showed that our procedure leads to a significant improvement in the performance of the selected configurations compared to Successive Halving on a wide range of budgets and is robust to the choice of the tuning parameters.Our method does not make any assumptions about the parametric form of the learning curves, nor about their monotonicity or smoothness, which makes it general enough to be suitable for most hyperparameter optimization tasks.
Then the diagonal terms of Σ h,β m are given by e T Kyy e where e is a vector of 1s with appropriate size, K xx , K yy and K xy are the kernel matrices, and K indicates that the diagonal terms have been set to zero.For example, Kxx ij = [K xx ] ij for all i ̸ = j and Kxx ij = 0 for i = j.
And the off-diagonal terms of Σ h,β m are given by     Figure A5 illustrates the difference between the resource allocation of Successive Halving and that of AdaptSH.Successive Halving has exhausted the budget and has to make the final decision at around iteration 1000, while AdaptSH is able to train some of the models to 2000 iterations before making the final decision, which results in overall smaller MMDs of the selected models (see Figure A6).

Figure 2 .
Figure 2. Upper: p-values of the statistical tests when only the most recent MMD 2 is considered; lower: p-values of the statistical tests when historical MMD 2 s within the moving window with h = 10 and β = 0.9 are considered.The two horizontal dashed lines correspond to p-values of 0.01 and 0.99 respectively.

Figure 3 .
Figure 3. Upper: The change of the percentage of tests that have a p-value less than 0.01 with h; lower: The change of the percentage of tests that have a p-value greater than 0.99 with h.

Figure 4 .
Figure 4.One thousand points sampled on the "Half Moons".

Figure 5 .
Figure 5. Results on the 2-D Half Moons dataset.Upper: Final loss of the selected models averaged over 20 trials against the input budget.Lower: p-values of the Mann-Whitney U tests.Two horizontal dashed lines represent p-value = 0.05 and 0.5, respectively.

Figure 6 .
Figure 6.One thousand points sampled on the "Swiss Roll".

Figure 7 .
Figure 7. Results on the 3-D Swiss Roll dataset.Upper: Final loss of the selected model averaged over 20 trials against the input budget.Lower: p-values of the Mann-Whitney U tests.Two horizontal dashed lines represent p-value = 0.05 and 0.5 respectively.

2 − 2 1 m 2 2 1 m 2 e T K xy e 2 + 1 m
(m − 1) e T Kyy K yx e − 1 n 2 (m − 1)m e T Kyy ee T K xy e + 1 m 3 e T K yx K xy e − 3 e T K xy K yx e,

Figure A1 .
Figure A1.Results on the 2-D Half Moons dataset with β = 0.9 and varying h.Upper: Final loss of the selected models averaged over 20 trials against the input budget.Lower: p-values of the Mann-Whitney U tests.Two horizontal dashed lines represent p-value = 0.05 and 0.5 respectively.

Figure A2 .
Figure A2.Results on the 2-D Half Moons dataset with h = 6 and varying β.Upper: Final loss of the selected models averaged over 20 trials against the input budget.Lower: p-values of the Mann-Whitney U tests.Two horizontal dashed lines represent p-value = 0.05 and 0.5 respectively.

Figure A3 .
Figure A3.Results on the 3-D Swiss Roll dataset with β = 0.9 and varying h.Upper: Final loss of the selected models averaged over 20 trials against the input budget.Lower: p-values of the Mann-Whitney U tests.Two horizontal dashed lines represent p-value = 0.05 and 0.5 respectively.

Figure A4 .
Figure A4.Results on the 3-D Swiss Roll dataset with h = 6 and varying β.Upper: Final loss of the selected models averaged over 20 trials against the input budget.Lower: p-values of the Mann-Whitney U tests.Two horizontal dashed lines represent p-value = 0.05 and 0.5 respectively.

Figure A5 .
Figure A5.Comparison of learning curves of AdaptSH and Successive Halving on the Half Moons dataset with budget B = 700.The plots contain all learning curves from the 20 runs.Each curve is truncated at the time of early stopping.The blue curves represent the final selected models of the 20 runs respectively.