Next Article in Journal
Enhanced U-Net-Based Deep Learning Model for Automated Segmentation of Organoid Images
Previous Article in Journal
Bioengineering of Periodontal Tissues: Cell Therapy and Biomaterials Application
Previous Article in Special Issue
PARKA AI: A Sensor-Integrated Mobile Application for Parkinson’s Disease Monitoring and Self-Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Learning for Predicting Late-Onset Breast Cancer Metastasis: The Single-Hyperparameter Grid Search (SHGS) Strategy for Meta-Tuning a Deep Feed-Forward Neural Network

Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, USA
*
Author to whom correspondence should be addressed.
Bioengineering 2025, 12(11), 1214; https://doi.org/10.3390/bioengineering12111214
Submission received: 10 September 2025 / Revised: 27 October 2025 / Accepted: 1 November 2025 / Published: 7 November 2025
(This article belongs to the Special Issue Application of Artificial Intelligence in Complex Diseases)

Abstract

Background: While machine learning has advanced in medicine, its widespread use in clinical applications, especially in predicting breast cancer metastasis, is still limited. We have been dedicated to constructing a deep feed-forward neural network (DFNN) model to predict breast cancer metastasis n years in advance. However, the challenge lies in efficiently identifying optimal hyperparameter values through grid search, given the constraints of time and resources. Issues such as the infinite possibilities for continuous hyperparameters like L1 and L2, as well as the time-consuming and costly process, further complicate the task. Methods: To address these challenges, we developed the Single-Hyperparameter Grid Search (SHGS) strategy, serving as a preselection method before grid search. Our experiments with SHGS applied to DFNN models for breast cancer metastasis prediction focused on analyzing eight target hyperparameters (epochs, batch size, dropout, L1, L2, learning rate, decay, and momentum). Results: We created three figures, each depicting the experimental results obtained from three LSM-I-10+-year datasets. These figures illustrate the relationship between model performance and the target hyperparameter values. Our experiments achieved maximum test AUC scores of 0.770, 0.762, and 0.886 for the 10-year, 12-year, and 15-year datasets, respectively. For each hyperparameter, we analyzed whether changes in this hyperparameter would affect model performance, examined whether there were specific patterns, and explored how to choose values for the hyperparameter. Conclusions: Our experimental findings reveal that the optimal value of a hyperparameter is not only dependent on the dataset but is also significantly influenced by the settings of other hyperparameters. Additionally, our experiments suggest a reduced range of values for a target hyperparameter, which may be helpful for “low-budget” grid search. This approach serves as a foundation for the subsequent use of grid search to enhance model performance.

1. Introduction

Machine learning (ML) has always been an important research topic in AI, and we have seen successful cases of using AI-based learning techniques such as deep learning to conduct various tasks and solve a wide range of problems in the biomedical domain. Machine learning and deep learning methods have been used in applications such as predicting drug–drug interactions using real world data collected via large-scale projects [1], modeling miRNA–mRNA interactions that cause phenotypic abnormalities in breast cancer patients [2], prioritizing disease-causing genes using sequence-based feature candidates [3,4,5], and predicting ubiquitination sites using physicochemical properties of protein sequence data [6,7]. Recent studies have also demonstrated the growing role of ML in diverse medical prediction tasks. For example, radiomics-based models have been used to identify specific breast cancer subtypes, helping to create more individualized approaches to diagnosis and therapy [8], Bayesian and machine learning models have been applied to predict how long heart surgery patients will stay in the hospital, helping hospitals plan resources and make treatment decisions [9], and bio-inspired algorithms have performed well in detecting Type 2 diabetes [10]. Furthermore, various learning methods have also been developed and applied to cancer-related prediction, such as predicting local breast cancer recurrence using language processing, identifying risk factors of prostate cancer recurrence using high-dimensional gene and clinical data [11], and predicting acute lymphoblastic leukemia (ALL) relapse in children [12].
An artificial neural network (ANN) is a machine learning framework that is designed to recognize patterns using a model loosely resembling the human brain [13,14]. Deep neural networks (DNNs), called deep learning, refer to the use of neural networks composed of more than one hidden layer [15,16,17]. DNNs have obtained significant success in commercial applications such as voice and pattern recognition, computer vision, and image processing [18,19,20,21]. However, their power has not been fully explored or demonstrated in clinical applications, such as the prediction of breast cancer metastasis, which is in part due to modeling challenges that result from the sheer magnitude of the number of variables involved in these problems [22].
We have previously developed deep feed-forward neural network (DFNN) models that are able to predict n-year breast cancer metastasis [23]. Figure 1 illustrates the structure and the inner connections of a DFNN that we have developed. It is a four-layer neural network that contains one input layer, two hidden layers, and one output layer. The input nodes to this neural network represent the clinical features contained in the input patient data, which we also refer to as predictors, and the output layer contains one node representing the binary status of n-year breast cancer metastasis, which is also referred to as the target variable in this context.
One of the challenges of deep learning is that there are many hyperparameters that must be tuned to obtain good prediction models [24]. In deep learning, hyperparameters are the variables that determine the model’s architecture and directly influence the training process and output model performance [25]. Hyperparameters are predetermined prior to the initiation of the training cycle, and they remain constant, withstanding any learning or modification throughout the progression of the training process [24]. Tuning hyperparameters for a given dataset is a repetitive process of identifying an optimal set of hyperparameter values that produce good prediction results. We call one value assignment to the set of hyperparameters of the DFNN models a hyperparameter setting [23]. Grid search is a systematic way to tune hyperparameters in which each of a predetermined set of hyperparameter settings is used to train models and all trained models are compared to identify the best-performing model [23]. We previously conducted grid searches to optimize the prediction performance of our DFNN models [26]. During the grid searches, DFNN models were trained and tested corresponding to each of the preselected hyperparameter settings one at time and the average training and testing results per setting were saved.
Other approaches have been proposed to address the above-described challenge of grid search’s inefficiency. For instance, random search improves efficiency by exploring a random subset of hyperparameter combinations, achieving broader coverage while keeping computational demands low [27]. Bayesian optimization guides the search toward promising regions of the hyperparameter space using probability models, reducing the number of trails needed. Although both methods generally outperform grid search in terms of optimization speed, their primary aim is to identify the optimal combination of hyperparameters.
In contrast to these optimization-focused methods, the proposed Single-Hyperparameter Grid Search (SHGS) strategy has a different primary goal. Its focus is on systematically narrowing the search space by separating and evaluating the effect of individual hyperparameters under random background conditions. Rather than replacing existing optimization methods, SHGS complements existing methods by providing a clearer view of hyperparameter behavior, narrowing the search to more promising areas, and later fine-tuning with grid search, random search, or Bayesian optimization.
The grid searches take as an input a finite number of preselected values for each of the tunable hyperparameters. Some DFNN hyperparameters are continuous, making it difficult to predefine a suitable finite set of candidate values, so grid search can become time-consuming and computationally expensive as the number of settings increases. With a low-budget grid search for which the available computation time and resources are limited, the number of values that each hyperparameter can take is very small, often making it a difficult task to preselect the input hyperparameter settings, especially for a hyperparameter that has a very large or even an infinite number of possible values [26].
During the preselection of the input hyperparameter settings, we often face unanswered or under-answered questions, such as: will changing the values of a particular hyperparameter truly have a significant effect on model performance? How does model performance change as the value of a hyperparameter increases, and are there particular patterns? If there are performance patterns associated with a hyperparameter, would they be dataset-dependent? When selecting a hyperparameter value, should we concentrate on lower values, mid-range values, or higher values? With a low-budget grid search, what would be a rule of thumb for selecting a promising set of values for a hyperparameter that can take a very large number of values?
The primary motivation of this study is to address this challenge of hyperparameter preselection, which often constrains model optimization and reproducibility. To this end, we introduce the SHGS strategy, which we describe in detail in the Section 2 below. The main goals of this work are as follows: (1) to provide an efficient pre-search approach that guides the selection of promising hyperparameter ranges; (2) to analyze the sensitivity of model performance to specific hyperparameters; and (3) to offer empirical insights for tuning DFNN models in clinical prediction tasks. We conducted experiments using the SHGS with our DFNN models for predicting late-onset breast cancer metastasis, which we define as metastasis that occurred at least 10 years after the initial treatment. We identified eight target hyperparameters that have a large or an infinite number of values (epochs, batch size, dropout, L1, L2, learning rate, decay, and momentum).

2. Methods

2.1. The SHGS Framework

The purpose of running an SHGS is to look for a promising but reduced range of values for a particular hyperparameter, which we call the target hyperparameter in the SHGS, to assist in the task of preselecting input values for the subsequent grid search. In a SHGS, we will first give a large range of values for the target hyperparameter that has an infinite number of possible values, such as L1 and L2, while all the other hyperparameters will each take only one preselected value. So, an SHGS can be considered as a special type of grid search in which the number of hyperparameter settings used for model training during the grid search is equal to the number of values of the target hyperparameter. In this study, we identified 8 hyperparameters that can take an infinite number of values and treated them as our target hyperparameters. These 8 hyperparameter are epochs, batch size, learning rate, dropout rate, momentum, decay, L1 weight decay, and L2 weight decay. Table 1 shows a description of the 8 target hyperparameters and their values that we studied via our SHGS experiments. In these experiments, to avoid bias, the fixed value used by each of the non-target hyperparameters was pre-picked randomly from the pool of all values considered for each of these hyperparameters following a uniform distribution. We call one value assignment of these non-target hyperparameters a background hyperparameter setting. Table 1 also shows all values of the non-target hyperparameters that are considered. For each of the target hyperparameters, we repeated the SHGS experiment 10 times and each time used a different randomly selected background hyperparameter setting from all possible settings in order to see how a different background setting can affect the results. Repeating each experiment 10 times helped to reveal consistent hyperparameter effects while mitigating randomness from varying background settings and balancing stability with computational cost. We used Python along with the TensorFlow and Keras libraries to implement SHGS, which is now available at [https://pypi.org/project/SHGS], accessed on 31 October 2025.

2.2. Experiments

Figure 2 presents the workflow of the proposed SHGS framework, summarizing its key components (iteration control, random background hyperparameter selection, model training, and result recording). The SHGS framework takes three main inputs (the number of iterations, the target hyperparameter, and the value ranges for all hyperparameters). The number of iterations determines how many times the SHGS process is executed for a given target hyperparameter. In each iteration, random values are assigned to all non-target hyperparameters to form a background hyperparameter setting. For the target hyperparameter, each value within its specified range is systematically explored in conjunction with the background setting. The SHGS algorithm then sequentially updates the target hyperparameter, performs 5-fold cross-validation (see details below) to train models under each target hyperparameter setting, tests the resulting model on an independent test dataset, and saves the outcomes.

2.3. The Evaluation of Model Performance Using Five-Fold Cross-Validation

For a given binary diagnostic test, a receiver operator characteristic (ROC) curve plots the true positive rate against the false positive rate for all possible cutoff values [28]. The area under an ROC curve (AUC) measures the discrimination performance of a model [29]. We followed a 5-fold cross-validation (CV) procedure to train and evaluate models under each hyperparameter setting in an SHGS. The dataset was first split into two subsets (an 80% training dataset for optimizing the model’s performance and a separate 20% independent test dataset to evaluate its generalization ability). The entire training dataset was then divided evenly into 5 sub-datasets. The division was mostly conducted randomly, and, in the meantime, we made sure that each sub-dataset had approximately 20% of the positive cases and 20% of the negative cases to guarantee an overall balanced distribution of the two. For each target hyperparameter setting, we conducted both the training and testing five times. Each time, a model was learned from a different set of 4 sub-datasets combined, and then it was tested using the remaining sub-datasets as the validation set. The average training and testing AUC across all five times, which are called mean_train_auc and mean_test_auc, respectively, were reported via the grid search package. Additionally, to evaluate the model’s performance completely independently, we used the 20% set-aside test dataset to test the output model refitted using the entire training dataset under each target hyperparameter setting to obtain a test_auc.
While the mean test AUC from cross-validation was used as an internal validation metric, the test AUC obtained from the independent hold-out set was adopted as the final evaluation metric to analyze performance trends across a wide range of target hyperparameter values—the primary goal of our exploratory study. All experiments were executed using the same computing environment to ensure consistent runtimes and performance evaluations. Each grid search and SHGS experiment was conducted on a shared pool of machines, with each job allocated 16 CPU cores and 64 GB of RAM. No GPU acceleration was used, and early stopping was deliberately disabled to allow for full observation of convergence behavior. The implementation was developed in Python using the Keras (v2.4.3) and Scikit-Learn (v0.24.2) packages. All runs were executed under identical software configurations to ensure reproducibility. We used this procedure for all datasets involved in this research. Given the class imbalance in our datasets and the need to compare many hyperparameter settings efficiently, AUC was chosen as the primary evaluation metric due to its robustness, threshold-independence, and interpretability in imbalanced settings.

2.4. Datasets

In this research, we used three datasets, LSM-I-10Year, LSM-I-12Year, and LSM-I-15Year, which include breast cancer patients with at least 10, 12, and 15 years of follow-up, respectively. These datasets were extracted directly from the broader LSM-I-nYear dataset based on follow-up duration, which had already undergone data cleaning, missing value handling, and feature selection in our prior study [30]. As such, no additional preprocessing was necessary in the current work.
The LSM-I-nYear dataset was originally constructed in our previous study [31] using clinical records from the Lynn Sage Database (LSDB) and the Northwestern Medicine Enterprise Data Warehouse (NMEDW) at Northwestern Medicine. In that study, we applied the Model-Based Interaction Learning (MBIL) method to identify predictors whose interactions are strongly associated with the disease outcome, such as metastasis in this study’s context. This method, first proposed in our earlier work [30], is based on Bayesian network modeling and information-theoretic criteria and is designed to discover predictive feature interactions relevant to clinical outcomes. Table 2 below shows the counts of the cases and predictors included in the three datasets. A description of these predictors is included in Supplementary Tables S1–S3 and a previously published paper [23].

3. Results

Figure 3, Figure 4 and Figure 5 show the results obtained for the LSM-I-10Year, LSM-I-12Year, and LSM-I-15Year datasets, respectively. Each figure consists of eight panels of scatter plots, one for each of the eight target hyperparameters. Within each panel are ten individual scatter plots, each showing the results of a specific SHGS experiment conducted using the corresponding dataset for the corresponding target hyperparameter. Such a scatter plot demonstrates how model performance, as measured by test_auc, changes when the values of a target hyperparameter increase during an SHGS experiment.
We analyzed performance trends using three-segment piecewise linear regression fitted with the pwlf package, which effectively captures segmented relationships in hyperparameter spaces. To reduce noise, trend lines were displayed only when the mean squared error (MSE) between the fitted curve and observed test AUC values was below 0.001; otherwise, the figure was labeled as “no clear trend”. Interestingly, some hyperparameters exhibit clear trends in one dataset but not in others, suggesting that the effectiveness of a hyperparameter is influenced by the background setting. Conversely, a few hyperparameters demonstrate consistent trends across all three datasets, indicating they may exert a more robust influence on model performance.
Based on the results shown in Figure 3, Figure 4 and Figure 5, the scatter plots of the 10 different experiments for the same target hyperparameter are quite different. This is perhaps because each experiment used a different background setting that was randomly selected by the SHGS scheme. Note that, with certain hyperparameter configurations that were randomly generated via grid search, the prediction performance was very poor, indicated by the consistently low test_AUC values (approximately 0.5), as seen in figures such as Figure 3a experiments 7 and 8. An AUC value of 0.5 indicates that the model is unable to distinguish between positive and negative class points, performing no better than random guessing [32].
The contrast between successful runs (e.g., Figure 3a(5)), which reveal clear performance trends, and failed runs (e.g., Figure 3a(7,8)), where performance remains near chance level (AUC ≈ 0.5), reveals that SHGS results are highly dependent on the background hyperparameter configurations used. Successful runs occur when the background configuration provides a stable setting for optimization, allowing the effect of the target hyperparameter to be observed clearly. In contrast, failed runs arise when suboptimal background combinations disrupt the optimization process, obscuring the true influence of the target hyperparameter and illustrating how background variability can affect the reliability of SHGS results.
As to epochs (Figure 3a, Figure 4a and Figure 5a), the highest test_auc values for the 10-year, 12-year, and 15-year data are 0.683 (Figure 3a(5)), 0.727 (Figure 4a(8)), and 0.852 (Figure 5a(1)), respectively. We notice that optimal test_auc values are often reached at low epochs (e.g., Figure 3a(5) and Figure 5a(1)). In addition, we see the following three different patterns: (1) test_auc decreases once the number of epochs surpasses a specific threshold (e.g., Figure 3a(2) and Figure 5a(3)), (2) test_auc plateaus after passing a certain point (e.g., Figure 4a(8)) or throughout (e.g., Figure 5a(9)), and (3) test_auc steadily goes up as the number of epochs passes a certain point (e.g., Figure 4a(1) and Figure 5a(10). We also notice that, in some cases, test_auc has high variance and fluctuates rapidly while the number of epochs increases (e.g., Figure 4a(6)), while it has very low variance in some other cases (e.g., Figure 5a(9)).
In the batch size experiments (Figure 3b, Figure 4b and Figure 5b), the best test_auc values are 0.7 (Figure 3b(6)) for the 10-year dataset, 0.738 (Figure 4b(8)) for the 12-year dataset, and 0.856 (Figure 5b(6)) for the 15-year dataset. There tends to be a short warm-up period (batch size < 50) when performance is unstable but reaches a peak quickly, and, in most experiments and over all three datasets, performance reaches the peak before the batch size increases to one-fifth of its highest value. Overall, we see the following three patterns after performance reaches its peak: (1) a slight negative correlation between performance and batch size (Figure 3b(7), Figure 4b(9) and Figure 5b(8,10)); (2) a sharp dip when the batch size reaches about half of its highest value, but the trend remains stable at either side of the sharp dipping point (Figure 3b(5,9), Figure 4b(3,7) and Figure 5b(7)); and (3) the trend remains constant, with or without large variance, and this is seen in most cases, indicating that a larger batch size tends to have a limited influence on performance improvement.
Figure 3c, Figure 4c and Figure 5c show the patterns in dropout rate. The highest scores are 0.77 (10-year dataset), 0.726 (12-year dataset), and 0.886 (15-year dataset). Similar to batch size, in most cases, the performance trend remains constant or becomes worse after it quickly reaches a peak. The performance tends to become very poor after the dropout passes 0.5, with some exceptions (e.g., Figure 3c(5), Figure 4c(1) and Figure 5c(4)).
Regarding the L1 and L2 regularization parameters, denoted as L1 and L2, respectively, in subsequent discussions, the best test_auc values for L1 in Figure 3d, Figure 4d and Figure 5d are 0.689, 0.734, and 0.872, and the best test_auc results for L2 in Figure 3e, Figure 4e and Figure 5e are 0.706, 0.734, and 0.86. The results of L1 reveal no increasing trend in performance as measured by test_auc when L1 values increase in any of the 10 experiments. Based on these results, using a small L1 value (<0.03) in grid search can be sufficient to obtain the best-performing model. The L2 results demonstrate similar trends in most cases with some exceptions, in which it requires a larger L2 value to reach the performance peak in experiments (Figure 3e(2,10), Figure 4e(2,4) and Figure 5e(9)), but none of these experiments gave the best test_auc values.
The highest test_auc scores for learning rate (Figure 3f, Figure 4f and Figure 5f) are 0.734 (Figure 3f(6)), 0.762 (Figure 4f(4)), and 0.832 (Figure 5f(1)), respectively. The best test_auc values in most experiments were obtained when the learning rate was below 0.03. After reaching the peak, the performance tends to stabilize (e.g., Figure 4f(2)), or, alternatively, it exhibits a significant decline with increased fluctuations after passing a certain point (e.g., Figure 5f(4)).
The best test_auc values achieved for decay are 0.706 (Figure 3g(10) for the 10-year dataset), 0.733 (Figure 4g(2) for the 12-year dataset), and 0.87 (Figure 5g(4) for the 15-year dataset). All experiments that achieved the best test_auc values reached the best performance almost immediately and the increase in the decay rate had no significant impact on performance. Certain patterns are seen for other experiments, such as the eighth experiment of Figure 4g, where the test_auc value initially rises as the decay rate increases but drops abruptly at a point (around 0.03) and remains suboptimal. Overall, the performance reaches the peak when the decay rate assumes a small value (below 0.03), then the trend either stays constant or goes down as the decay rate increases, with the exceptions seen in Figure 5g(8,10), where it takes longer for the performance to reach the peak.
For momentum, the best test_auc values are 0.675 (Figure 3h(3) for the 10-year dataset), 0.72 (Figure 4h(6) for the 12-year dataset), and 0.835 (Figure 5h(3) for the 15-year dataset). Considering all three datasets, the performance trend demonstrates two patterns (stays constant or goes up after momentum at least passes the half-way point), but the well-performing models tend to appear with the second pattern. For example, for the 10-year data, the performance reaches the peak after momentum passes 0.5 in the experiment in which we identified the best model (Figure 3h(3)), and, for the 15-year data, we obtained the best model after momentum passed 0.7. For the 12-year dataset, the best test_auc was obtained when momentum was below 0.3 and the trend after the peak test_auc went down as an exception, but we obtained near-best models in other experiments when the momentum passed 0.5 (Figure 4h(1)) and when it passed 0.7 (Figure 4h(2)).
To further evaluate the stability and reliability of our experimental results, we conducted a statistical analysis on the performance of each hyperparameter. For every value of the tested hyperparameter, we computed the mean and standard deviation of the AUC across multiple trials. As shown in Figure 6, Figure 7, Figure 8 and Figure 9, the blue curve represents the smoothed performance trend, while the shaded area indicates the variance across experiments. To focus on valid configurations, we excluded models with AUC < 0.55, which are typically caused by ineffective random parameter combinations. Figure 6, Figure 7, Figure 8 and Figure 9 omit low-performing cases (shown in Figure 3, Figure 4 and Figure 5) to better reveal optimal hyperparameters, performance trends, and stability, as including the numerous failed runs shown in Figure 3, Figure 4 and Figure 5 would obscure the underlying meaningful patterns. In addition, the red marker denotes the highest AUC observed across all 10 experiments, along with its corresponding hyperparameter value.
Based on Figure 6, Figure 7, Figure 8 and Figure 9, we observed several hyperparameters exhibiting consistent performance trends across all three datasets, including batch size, momentum, L2 regularization, learning rate, and dropout rate, while others—specifically the number of training epochs, weight decay, and L1 regularization—demonstrated dataset-specific effects. As shown in Figure 6d–f, a medium-to-small batch size consistently yielded superior AUC performance. Similarly, high momentum values were beneficial across all datasets (Figure 7d–f), supporting accelerated convergence. Learning rate exhibited a consistent preference for smaller values in the range of 0.01–0.08 across datasets (Figure 8a–c). Likewise, lower dropout rates (0.05–0.2) outperformed higher rates consistently (Figure 8d–f). L2 regularization was broadly effective across datasets (Figure 9d–f), with the AUC generally improving as L2 increased, though the trend often plateaued beyond a moderate value.
In contrast, certain hyperparameters required dataset-specific tuning. The optimal number of training epochs varied; both the LSM-I-10Year and LSM-I-15Year datasets favored fewer epochs, likely to prevent overfitting, whereas the LSM-I-12Year dataset benefited from extended training. Weight decay exhibited a similar divergence, with the 10Year and 12Year datasets performing better under weaker regularization, while the 15Year dataset achieved improved performance with stronger decay. For L1 regularization, both the 10Year and 15Year datasets preferred minimal or no L1 penalty, indicating that feature sparsity was not beneficial. In contrast, the 12Year dataset showed marginal performance gains with increased L1 strength.
Table 3 displays the number of hyperparameter settings and corresponding running times for the SHGS experiments conducted using the three LSM-I datasets. Table 4 below represents the total experimental time for each of the eight target hyperparameters studied across the same datasets. Surprisingly, the experiments targeting L2 as the hyperparameter exhibit the longest duration among all three datasets.
We also studied how running time changes along with each of the hyperparameters, but, as shown in Supplementary Figures S1–S3, there are no apparent correlations between running time and the target hyperparameter in most cases. There are a few notable correlations between the running time and target hyperparameters such as batch size, epochs, and L1 as shown in Figure 10 below. Based on Figure 10a, running time decreases quickly as the value of batch size increases till it reaches a turning point, from which the running time either remains unchanged or decreases very slowly and slightly. The turning point often occurs when the batch size is very small, mostly below 50. Figure 10b shows that there is often an apparent positive correlation between the running time and epochs. Figure 10c demonstrates that occasionally there is a positive correlation between the running time and L1, but, in most experiments, increasing the value of L1 does not affect the running time much.
These patterns align with the design of the SHGS framework, which separates one target hyperparameter while randomizing others to expose how each hyperparameter independently contributes to model performance under diverse background settings. Furthermore, the piecewise regression analysis and Figure 6, Figure 7, Figure 8 and Figure 9 quantitatively show that SHGS captures non-linear and segmented response patterns in hyperparameter–performance relationships, identifying appropriate value ranges and highlighting the stability or context-dependence of each parameter. The runtime analysis (Table 3 and Table 4, Figure 10) demonstrates SHGS’s efficiency, providing comparable insights at a fraction of the cost of a full grid search, consistent with its role as a preselection step for meta-tuning.

4. Discussion

Previous studies reported that the best model performance may occur at either very high epochs [33] or at very low values (below 10) [34] depending on dataset characteristics [35]. Our findings refine this understanding by not only confirming the dataset-dependent nature of optimal epoch values (Figure 6a–c) but also revealing that model performance is substantially influenced by background hyperparameter settings. As shown in Figure 3a, the effects of epochs can vary dramatically across different background configurations, even within the same dataset.
In addition, our results (Figure 10b) show that running time is often correlated with the number of epochs, so using a high number of epochs could significantly increase the computation time, which renders it a lesser choice especially when the time available to run a grid search is limited. These observations indicate the importance of adaptive training schedules. Instead of statically assigning epoch values across the board, future systems could benefit from early stopping mechanisms or dynamic adjustment strategies based on convergence signals or overfitting risk [35,36].
The optimal batch size remains an open question, with prior studies reporting conflicting findings. Some suggest that smaller batch sizes tend to yield improved results [37], while others observed a higher test accuracy with larger batches [38]. This inconsistency may arise from differences in datasets and background hyperparameter configurations. Additionally, performance often peaks when the batch size is less than half the dataset size (Figure 6d–f), supporting claims that large batches may converge to sharp minima and impair generalization [36,39]. As such, constraining batch size values to the lower range (e.g., <50% of the maximum tested value) may serve as an effective strategy for low-budget grid searches, particularly when computational resources or time are limited [40].
Finding a dropout rate that works perfectly for all datasets and situations is difficult, as it is influenced by various factors [37,38,41,42]. The dataset size seems to be a key factor. Research indicates that smaller dropout rates are beneficial for small datasets [43] and remain effective for large datasets. Our results show that the performance trend mostly remains constant or goes down after it reaches the peak, which often occurs before the dropout rate reaches 0.5. This substantiated that a dropout rate of p = 0.5 works well for various networks as suggested in [22]. Combining our experimental results and the previous findings, using a dropout rate of 0.5 or a few values that are less than or equal to 0.5 might be a good initial choice for a low-budget grid search.
L1 regularization is suggested to be more effective than L2 regularization and dropout [44,45,46]. The authors of [44] suggest that an L1 value of 0.001 is preferable when compared with 0.0001, 0.01, and 0.1. According to [45], if the L1 value is too high, the model is too simple, leading to underfitting, while if it is too low, the model becomes too complex and may overfit. Based on our experimental results, regardless of the dataset, the result is consistently better when using a smaller L1 value, and the optimal test_auc values were consistently achieved when L1 was less than 0.03. L2 regularization was more uniformly beneficial, with performance improving as the penalty increased up to a point. Several experiments (e.g., Figure 3e(2,10), Figure 4e(2,4) and Figure 5e(4)) achieved the peak AUC at relatively large L2 values (>0.03), although these settings did not produce the highest test AUCs overall. Therefore, for low-budget grid search settings, a narrow L1 range near zero (e.g., 0–0.01) is advisable, while a broader L2 range (e.g., 0–0.3) may be necessary to capture the regularization benefit before performance declines.
Selecting an appropriate learning rate for grid search is nontrivial, as it must balance convergence speed and stability—being large enough to ensure efficiency, yet small enough to avoid overshooting or oscillation [47,48]. In each of our experiments, we tested a large range of learning rate values (0 to 0.3), but, across all three datasets, the best test_auc values most often occurred before the learning rate reached 0.03. We also noticed that once test_auc reached a peak, it rarely went up as the learning rate further increased. These experimental results suggest that a suitable range of learning rate values for a low-budget grid search is below 0.03.
Researchers currently tend to determine the values of learning rate decay (referred to as decay in this context) based on empirical considerations, including 0.1, 0.01, and 0.001 [46]. Due to the relationship between decay and learning rate, the magnitude of the learning rate is a crucial factor to consider when determining the value of decay [46]. While many experiments achieved the peak AUC at low decay values (typically <0.03), suggesting that minimal regularization is often adequate, this pattern was not consistent across all datasets. The 15-year dataset experiments (e.g., Figure 7c) exhibited markedly better performance at high decay values (e.g., >0.5), indicating that stronger regularization can be beneficial in certain contexts. Therefore, for low-budget grid searches, decay values between 0 and 0.003 (approximately one-tenth of the upper-bound learning rate) are generally sufficient. However, when resources permit, extending the search range up to 0.8 may better accommodate dataset-specific variations.
Momentum directly influences optimization during model training and expedites convergence by incorporating historical gradients [49]. Previous research suggests that values around 0.9 seem to be good values [45,46]. Some studies suggest that the interaction between momentum and learning rate matters more than their individual values [50]. In our experiments, we evaluated 81 momentum values ranging from 0.1 to 0.9 to ensure comprehensive coverage. Results indicate that optimal model performance most frequently occurs when momentum exceeds 0.5, with two of the three highest test AUC scores obtained at values above 0.7. Nonetheless, isolated cases showed strong performance at lower momentum settings; for instance, the best result for the LSM-I-12Year dataset was achieved with momentum below 0.3. Taken together, our findings suggest that a momentum range of 0.7 to 0.9 is a strong initial choice for grid search. When computational resources allow, exploring lower momentum values (e.g., 0.01–0.03) may further uncover dataset-specific optima. These findings also highlight a key limitation of a fully randomized design: some hyperparameters, such as momentum, only take effect under specific conditions (e.g., with an SGD-type optimizer). This reinforces that a hyperparameter’s relevance and influence are inherently context-dependent.
An unexpected but notable finding from our experiments was the significantly longer runtime observed for L2 regularization. Because L1 regularization did not exhibit a similar slowdown, this cannot be explained by a simple computational overhead. We hypothesize that this behavior reflects an interaction effect between the L2 penalty and certain unfavorable background hyperparameter configurations. Unlike L1, which promotes sparsity by pushing many weights to exactly zero and can simplify the effective network structure, L2 applies a continuous constraint on all weights. This universal pressure, when combined with a poor background setting—such as a high learning rate or an unstable weight initializer—could create highly unstable gradient dynamics. Consequently, the optimizer may be forced to perform substantially more internal computations at each step to manage these conflicting update signals, leading to a prolonged runtime. This result highlights that hyperparameter interactions may affect not only performance outcomes but also computational efficiency—an aspect often overlooked in practice.

5. Conclusions

We examined how eight key hyperparameters affect the prediction of late-onset breast cancer metastasis, aiming to define efficient search ranges under limited computational resources. Our results show that optimal hyperparameter values vary across datasets and depend on the configuration of other parameters, referred to here as the background setting. Despite this variability, we provide practical guidance for conducting low-budget grid searches in similar tasks.
We propose the following reduced value ranges as a strong initial choice: epochs, fewer than 100; batch size, less than half the number of data points; dropout rate, ≤0.5 is commonly an effective initial choice; L1, 0 to 0.03; L2, a range of 0 to 0.3 can be explored; learning rate, <0.03; decay, 0 to 0.003 is often sufficient, though extending the search up to 0.8 may be beneficial if resources permit; momentum, 0.7 to 0.9 is recommended, while broader ranges (e.g., 0.01–0.03) may be included if resources allow. It is crucial to note, however, that these recommendations were empirically derived from our specific task of predicting late-onset breast cancer metastasis using a DFNN, so they should be interpreted as task- and model-specific observations rather than universally optimal settings. Furthermore, for this kind of clinical prediction, time-to-event or survival models could offer a more realistic approach worth investigating in future studies.
Nevertheless, the SHGS framework itself is versatile and can be adapted to other prediction tasks to narrow hyperparameter ranges before performing full-scale optimization. In practice, running a few SHGS trials beforehand can help locate high-potential regions and improve overall tuning efficiency. Although this study focused on using a DFNN to predict late-onset breast cancer metastasis, the SHGS framework holds potential for broader applications. Future research may extend SHGS to other architectures, such as convolutional and recurrent neural networks, and to multimodal data (e.g., imaging, text, and omics data) in order to evaluate its generalizability across domains. Additionally, since the current SHGS relies on random background sampling, future work could develop more sophisticated sampling strategies. A promising direction is to enhance the SHGS framework with a constraint system or conditional sampling to ensure that only valid and meaningful hyperparameter combinations are explored, thereby increasing the efficiency and interpretability of the exploratory process. Finally, enhancing SHGS from its current PyPI implementation into a more comprehensive and modular open-source toolkit could further improve its usability and broaden its potential applicability in biomedical and general machine learning research.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bioengineering12111214/s1, Figure S1: Scatter plots: running time vs. the values taken by the target hyperparameters concerning LSM-I-10Year; Figure S2: Scatter plots: running time vs. the values taken by the target hyperparameters concerning LSM-I-12Year.png; Figure S3: Scatter plots: running time vs. the values taken by the target hyperparameters concerning LSM-I-15Year; Table S1: A Description of the Variables of the LSM-10Year Dataset; Table S2: The variables of the LSM-12Year Dataset; Table S3: The variables of the LSM-15Year Dataset.

Author Contributions

X.J. originated the study and designed the methods. Y.Z., O.A.-J., and X.J. wrote the first draft of the manuscript. X.J. and Y.Z. implemented the methods. Y.Z. conducted experiments. Y.Z., O.A.-J., and X.J. prepared and analyzed the results. All authors contributed to the preparation and revision of the manuscript. All work was conducted at the University of Pittsburgh. All authors have read and agreed to the published version of the manuscript.

Funding

The research reported in this paper was supported by the U.S. Department of Defense through the Breast Cancer Research Program under Award No. W81XWH-19-1-0495 (to XJ). Other than supplying funds, the funding agencies played no role in the research.

Institutional Review Board Statement

Ethics approval and consent to participate. The study was approved by the University of Pittsburgh Institutional Review Board (IRB # 196003) and the U.S. Army Human Research Protection Office (HRPO # E01058.1a). The need for patient consent was waived by the ethics committees because the data consists only of de-identified data that are publicly available.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available in datadryad.org at DOI 10.5061/dryad.64964m0 (https://datadryad.org/dataset/doi:10.5061/dryad.64964m0).

Acknowledgments

OAJ was involved in the work as a trainee in XJ’s Artificial Intelligence research lab in summer 2023 via the UPMC Hillman Cancer Center Academy.

Conflicts of Interest

The authors declare that they have no competing interests.

References

  1. Zitnik, M.; Agrawal, M.; Leskovec, J. Modeling Polypharmacy Side Effects with Graph Convolutional Networks. Bioinformatics 2018, 34, i457–i466. [Google Scholar] [CrossRef]
  2. Lee, S.; Jiang, X. Modeling MiRNA-MRNA Interactions That Cause Phenotypic Abnormality in Breast Cancer Patients. PLoS ONE 2017, 12, e0182666. [Google Scholar] [CrossRef]
  3. Han, P.; Shang, S.; Yang, P.; Liu, Y.; Zhao, P.; Zhou, J.; Gao, X.; Kalnis, P. GCN-MF: Disease-Gene Association Identification by Graph Convolutional Networks and Matrix Factorization. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 705–713. [Google Scholar]
  4. Rogers, M.F.; Shihab, H.A.; Mort, M.; Cooper, D.N.; Gaunt, T.R.; Campbell, C. FATHMM-XF: Accurate Prediction of Pathogenic Point Mutations via Extended Features. Bioinformatics 2018, 34, 511–513. [Google Scholar] [CrossRef]
  5. Hu, J.; Ng, P.C. Predicting the Effects of Frameshifting Indels. Genome Biol. 2012, 13, R9. [Google Scholar] [CrossRef]
  6. Tung, C.W.; Ho, S.Y. Computational Identification of Ubiquitylation Sites from Protein Sequences. BMC Bioinform. 2008, 9, 310. [Google Scholar] [CrossRef] [PubMed]
  7. Cai, B.; Jiang, X. Computational Methods for Ubiquitination Site Prediction Using Physicochemical Properties of Protein Sequences. BMC Bioinform. 2016, 17, 116. [Google Scholar] [CrossRef] [PubMed]
  8. Comert, R.G.; Durak, G.; Yilmaz, R.; Aktas, H.E.; Tuz, Z.; Pan, H.; Zeng, J.; Bayram, A.; Mollavelioglu, B.; Erturk, S.M.; et al. Radiomics for Detecting Metaplastic Histology in Triple-Negative Breast Cancer: A Step Towards Personalized Therapy. Bioengineering 2025, 12, 973. [Google Scholar] [CrossRef] [PubMed]
  9. Abdurrab, I.; Mahmood, T.; Sheikh, S.; Aijaz, S.; Kashif, M.; Memon, A.; Ali, I.; Peerwani, G.; Pathan, A.; Alkhodre, A.B.; et al. Predicting the Length of Stay of Cardiac Patients Based on Pre-Operative Variables—Bayesian Models vs. Machine Learning Models. Healthcare 2024, 12, 249. [Google Scholar] [CrossRef]
  10. Al-Tawil, M.; Mahafzah, B.A.; Al Tawil, A.; Aljarah, I. Bio-Inspired Machine Learning Approach to Type 2 Diabetes Detection. Symmetry 2023, 15, 764. [Google Scholar] [CrossRef]
  11. Long, Q.; Chung, M.; Moreno, C.S.; Johnson, B.A. Risk Prediction for Prostate Cancer Recurrence through Regularized Estimation with Simultaneous Adjustment for Nonlinear Clinical Effects. Ann. Appl. Stat. 2011, 5, 2003–2023. [Google Scholar] [CrossRef]
  12. Pan, L.; Liu, G.; Lin, F.; Zhong, S.; Xia, H.; Sun, X.; Liang, H. Machine Learning Applications for Prediction of Relapse in Childhood Acute Lymphoblastic Leukemia. Sci. Rep. 2017, 7, 7402. [Google Scholar] [CrossRef]
  13. Mcculloch, W.S.; Pitts, W. A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. Math. Biophys. 1990, 5, 115–133. [Google Scholar] [CrossRef]
  14. Farley, B.G.; Clark, W.A. Simulation of Self-Organizing Systems by Digital Computer. IRE Prof. Group Inf. Theory 1954, 4, 76–84. [Google Scholar] [CrossRef]
  15. Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Netw. 2014, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
  16. Fernández, S.; Graves, A.; Schmidhuber, J. An Application of Recurrent Neural Networks to Discriminative Keyword Spotting. In Proceedings of the International Conference on Artificial Neural Networks, Porto, Portugal, 9–13 September 2007; Springer Berlin Heidelberg: Berlin/Heidelberg, Germany, 2007; pp. 220–229. [Google Scholar]
  17. Neapolitan, R.E.; Jiang, X. Neural Network and Deep Learning; Springer: Cham, Switzerland, 2018; pp. 389–411. [Google Scholar]
  18. Ran, L.; Zhang, Y.; Zhang, Q.; Yang, T. Convolutional Neural Network-Based Robot Navigation Using Uncalibrated Spherical Images. Sensors 2017, 17, 1341. [Google Scholar] [CrossRef]
  19. Lundervold, A.S.; Lundervold, A. An Overview of Deep Learning in Medical Imaging Focusing on MRI. Z. Med. Phys. 2019, 29, 102–127. [Google Scholar] [CrossRef]
  20. Fiddy, M.A. Regularized Image Reconstruction Using SVD and a Neural Network Method for Matrix Inversion. IEEE Trans. Signal Process. 1993, 41, 3074–3077. [Google Scholar] [CrossRef]
  21. Deng, L.; Tur, G.; He, X.; Hakkani-Tur, D. Use of Kernel Deep Convex Networks and End-to-End Learning for Spoken Language Understanding. In Proceedings of the 2012 IEEE Workshop on Spoken Language Technology, SLT 2012—Proceedings, Miami, FL, USA, 2–5 December 2012. [Google Scholar]
  22. Glorot, X.; Bengio, Y. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Proceedings of the Thirteenth International Conference On Artificial Intelligence And Statistics, Sardinia, Italy, 13–15 May 2010. [Google Scholar]
  23. Jiang, X.; Xu, C. Deep Learning and Machine Learning with Grid Search to Predict Later Occurrence of Breast Cancer Metastasis Using Clinical Data. J. Clin. Med. 2022, 11, 5772. [Google Scholar] [CrossRef]
  24. Shen, H. Towards a Mathematical Understanding of the Difficulty in Learning with Feedforward Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 811–820. [Google Scholar]
  25. Mitchell, T.M.; Tom, M. Machine Learning; McGraw-Hill: New York, NY, USA, 1997. [Google Scholar]
  26. Xu, C.; Coen-Pirani, P.; Jiang, X. Empirical Study of Overfitting in Deep Learning for Predicting Breast Cancer Metastasis. Cancers 2023, 15, 1969. [Google Scholar] [CrossRef]
  27. Bergstra, J.; Ca, J.B.; Ca, Y.B. Random Search for Hyper-Parameter Optimization Yoshua Bengio. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
  28. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  29. Swets, J.A. Measuring the Accuracy of Diagnostic Systems. Science 1988, 240, 1285–1293. [Google Scholar] [CrossRef]
  30. Jiang, X.; Wells, A.; Brufsky, A.; Shetty, D.; Shajihan, K.; Neapolitan, R.E. Leveraging Bayesian Networks and Information Theory to Learn Risk Factors for Breast Cancer Metastasis. BMC Bioinform. 2020, 21, 298. [Google Scholar] [CrossRef]
  31. Jiang, X.; Wells, A.; Brufsky, A.; Neapolitan, R. A Clinical Decision Support System Learned from Data to Personalize Treatment Recommendations towards Preventing Breast Cancer Metastasis. PLoS ONE 2019, 14, e0213292. [Google Scholar] [CrossRef]
  32. Bhandari, A. Guide to AUC ROC Curve in Machine Learning: What Is Specificity? Available online: https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/ (accessed on 31 October 2025).
  33. Jason Brownlee Difference Between a Batch and an Epoch in a Neural Network. Available online: https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/ (accessed on 31 October 2025).
  34. Kayley, M. How Does Epoch Affect Accuracy. Available online: https://deepchecks.com/question/how-does-epoch-affect-accuracy/ (accessed on 31 October 2025).
  35. Afaq, S.; Rao, S. Significance Of Epochs On Training A Neural Network. Int. J. Sci. Technol. Res. 2020, 9, 485–488. [Google Scholar]
  36. Yang, L.; Shami, A. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
  37. Kandel, I.; Castelli, M. The Effect of Batch Size on the Generalizability of the Convolutional Neural Networks on a Histopathology Dataset. ICT Express 2020, 6, 312–315. [Google Scholar] [CrossRef]
  38. Radiuk, P.M. Impact of Training Set Batch Size on the Performance of Convolutional Neural Networks for Diverse Datasets. Inf. Technol. Manag. Sci. 2018, 20, 20–24. [Google Scholar] [CrossRef]
  39. Ajayi, O.G.; Ashi, J. Effect of Varying Training Epochs of a Faster Region-Based Convolutional Neural Network on the Accuracy of an Automatic Weed Classification Scheme. Smart Agric. Technol. 2023, 3, 100128. [Google Scholar] [CrossRef]
  40. Keskar, N.S.; Nocedal, J.; Tang, P.T.P.; Mudigere, D.; Smelyanskiy, M. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017—Conference Track Proceedings, Toulon, France, 24–26 April 2017. [Google Scholar]
  41. Phaisangittisagul, E. An Analysis of the Regularization Between L2 and Dropout in Single Hidden Layer Neural Network. In Proceedings of the International Conference on Intelligent Systems, Modelling and Simulation, ISMS, Bangkok, Thailand, 25–27 January 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 174–179. [Google Scholar]
  42. Thakur, A. What’s the Optimal Batch Size to Train a Neural Network. Available online: https://wandb.ai/ayush-thakur/dl-question-bank/reports/What-s-the-Optimal-Batch-Size-to-Train-a-Neural-Network---VmlldzoyMDkyNDU (accessed on 31 October 2025).
  43. Pauls, A.; Yoder, J.A. Determining Optimum Drop-out Rate for Neural Networks. In Proceedings of the Midwest Instructional Computing Symposium, Duluth, MN, USA, 6–7 April 2018. [Google Scholar]
  44. Obaidat, M.S.; Mohammad, S.; Institute of Electrical and Electronics Engineers; IEEE Communications Society. Deep Learning Regularization in Imbalanced Data. In Proceedings of the 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), Sharjah, United Arab Emirates, 3–5 November 2020. [Google Scholar]
  45. Oppermann, A. Regularization in Deep Learning—L1, L2, and Dropout. Available online: https://www.e2enetworks.com/blog/regularization-in-deep-learning-l1-l2-dropout (accessed on 31 October 2025).
  46. Brownlee, J. How to Use Weight Decay to Reduce Overfitting of Neural Network in Keras. Available online: https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-learning-with-weight-regularization/ (accessed on 31 October 2025).
  47. Ba, J.; Brendan Frey, L. Adaptive Dropout for Training Deep Neural Networks. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, NV, USA, 5–8 December 2013. [Google Scholar]
  48. Sun, C.; Sharma, J.; Maiti, M. Investigating the Relationship Between Dropout Regularization and Model Complexity in Neural Networks. arXiv 2021, arXiv:2108.06628. [Google Scholar] [CrossRef]
  49. Zeiler, M.D. ADADELTA: An Adaptive Learning Rate Method. arXiv 2012, arXiv:1212.5701v1. [Google Scholar] [CrossRef]
  50. Mitliagkas, I.; Zhang, C.; Hadjis, S.; Ré, C. Asynchrony Begets Momentum, with an Application to Deep Learning. arXiv 2016, arXiv:1605.09774v2. [Google Scholar] [CrossRef]
Figure 1. The structure of an example DFNN model that contains two hidden layers. The arrows indicate the direction of information flow from one layer to the next in the network. The three vertically arranged dots represent additional neurons omitted for clarity.
Figure 1. The structure of an example DFNN model that contains two hidden layers. The arrows indicate the direction of information flow from one layer to the next in the network. The three vertically arranged dots represent additional neurons omitted for clarity.
Bioengineering 12 01214 g001
Figure 2. A flowchart illustrating the proposed SHGS strategy and its experimental procedure.
Figure 2. A flowchart illustrating the proposed SHGS strategy and its experimental procedure.
Bioengineering 12 01214 g002
Figure 3. Scatter plots: test_auc vs. a target hyperparameter concerning LSM-I-10Year. Each panel (ah) corresponds to one of the eight target hyperparameters examined in the SHGS experiments. Within each panel, the ten numbered subplots (1–10) represent results from ten independent SHGS runs, each conducted under a distinct random background hyperparameter configuration. The red line represents the piecewise linear regression fit illustrating the trend of model performance across the hyperparameter range.
Figure 3. Scatter plots: test_auc vs. a target hyperparameter concerning LSM-I-10Year. Each panel (ah) corresponds to one of the eight target hyperparameters examined in the SHGS experiments. Within each panel, the ten numbered subplots (1–10) represent results from ten independent SHGS runs, each conducted under a distinct random background hyperparameter configuration. The red line represents the piecewise linear regression fit illustrating the trend of model performance across the hyperparameter range.
Bioengineering 12 01214 g003
Figure 4. Scatter plots: test_auc vs. a target hyperparameter concerning LSM-I-12Year. Plot elements are defined as in Figure 3.
Figure 4. Scatter plots: test_auc vs. a target hyperparameter concerning LSM-I-12Year. Plot elements are defined as in Figure 3.
Bioengineering 12 01214 g004
Figure 5. Scatter plots: test_auc vs. a target hyperparameter concerning LSM-I-15Year. Plot elements are defined as in Figure 3.
Figure 5. Scatter plots: test_auc vs. a target hyperparameter concerning LSM-I-15Year. Plot elements are defined as in Figure 3.
Bioengineering 12 01214 g005
Figure 6. Impact of epochs and batch size on test AUC. The blue line represents the smoothed performance trend across the ten SHGS experiments, the shaded area indicates variance across trials, and the red marker denotes the highest observed AUC and its corresponding hyperparameter value.
Figure 6. Impact of epochs and batch size on test AUC. The blue line represents the smoothed performance trend across the ten SHGS experiments, the shaded area indicates variance across trials, and the red marker denotes the highest observed AUC and its corresponding hyperparameter value.
Bioengineering 12 01214 g006
Figure 7. Impact of decay and momentum on test AUC. Plot elements are defined as in Figure 6.
Figure 7. Impact of decay and momentum on test AUC. Plot elements are defined as in Figure 6.
Bioengineering 12 01214 g007
Figure 8. Impact of learning rate and dropout rate on test AUC. Plot elements are defined as in Figure 6.
Figure 8. Impact of learning rate and dropout rate on test AUC. Plot elements are defined as in Figure 6.
Bioengineering 12 01214 g008
Figure 9. Impact of L1 and L2 on test AUC. Plot elements are defined as in Figure 6.
Figure 9. Impact of L1 and L2 on test AUC. Plot elements are defined as in Figure 6.
Bioengineering 12 01214 g009
Figure 10. Selected scatter plots: running time vs. a target hyperparameter. (a) Batch size with LSM-I-10Year, (b) Epochs with LSM-I-12Year, and (c) L1 regularization with LSM-I-15Year. The blue line represents the smoothed trend of running time across the hyperparameter range.
Figure 10. Selected scatter plots: running time vs. a target hyperparameter. (a) Batch size with LSM-I-10Year, (b) Epochs with LSM-I-12Year, and (c) L1 regularization with LSM-I-15Year. The blue line represents the smoothed trend of running time across the hyperparameter range.
Bioengineering 12 01214 g010
Table 1. Hyperparameters and their values used in SHGS experiments.
Table 1. Hyperparameters and their values used in SHGS experiments.
HyperparameterDescriptionValues Used in SHGS Experiments Number of Values Studied per SHGS Experiment (Target Hyperparameter Only)
EpochsNumber of times the model is trained by the full training dataset5~1001, step size 3333
Batch sizeNumber of samples that are processed together in a single forward and backward pass during training1 to the # of datapoints in a dataset730 in LSM-10Year-I dataset
446 in LSM-12Year-I dataset
300 in LSM-15Year-I dataset
Learning rateControl the learning and parameter update speed during optimization0.001~0.3, Step size 0.001300
Dropout rateMitigate overfitting and training time by randomly ignoring nodes0~0.9, Step size 0.0191
MomentumSpeed up optimization by incorporating historical gradients into parameter updates. Momentum is exclusively applicable within the SGD optimizer. 0.1~0.9, Step size 0.0181
DecayIterative decay of the learning rate by applying a decreasing factor at each epoch0~0.3, Step size 0.001301
L1 weightControl the strength of L1 regularization in model training0~0.3, Step size 0.001301
L2 weightControl the strength of L2 regularization in model training0~0.3, Step size 0.001301
# of Hidden LayersThe depth of a DNN model1,2,3,4Non-target hyperparameter
# of Hidden Nodes Number of neurons in a hidden layerAll integers from 1 to the number of datapoints of each datasetNon-target hyperparameter
OptimizerOptimizes model parameters during the training process toward minimizing the lossSGD, Adam, Adagrad, Nadam, AdamaxNon-target hyperparameter
Weight InitializerA technique employed to assign initial values to the weights of the connections between neurons in a neural networkConstant, Glorot_normal, Glorot_uniform, He_normal, He_uniformNon-target hyperparameter
Input Layer Activation FunctionA function applied to the input data of a neural network’s input layerRelu, Sigmoid, Softmax, TanhNon-target hyperparameter
Hidden Layer Activation FunctionA function applied to the output of a hidden layer in a neural network Relu, Sigmoid, Softmax, TanhNon-target hyperparameter
Output Layer Activation FunctionA function applied to the output of a neural network’s output layer, producing the final prediction of the network.SigmoidNon-target hyperparameter
Loss FunctionA method of evaluating the dissimilarity between the predicted output of a model and the actual valuesBinary_crossentropyNon-target hyperparameter
Table 2. Case and predictor count of the three LSM-I-10+-year datasets.
Table 2. Case and predictor count of the three LSM-I-10+-year datasets.
Total # of Cases# Positive Cases# Negative Cases#Number of Predictors
LSM-I-10year1827572125518
LSM-I-12year111558852720
LSM-I-15year75160814317
Table 3. Number of hyperparameter settings and running time per dataset in SHGS experiments.
Table 3. Number of hyperparameter settings and running time per dataset in SHGS experiments.
DatasetNumber of Hyperparameter SettingsUnit Running Time per Hyperparameter Setting Total Running Time per Dataset
LSM-I-10Year24,38093.875 s635.74 h
LSM-I-12Year21,54071.454 s427.532 h
LSM-I-15Year20,08076.849 s428.646 h
Table 4. Running time (in hours) per target hyperparameter.
Table 4. Running time (in hours) per target hyperparameter.
Target HyperparameterTotal Running TimeRunning Time per Dataset
LSM-I-10Year LSM-I-12YearLSM-I-15Year
Epochs178.003113.04729.56835.388
Batch size145.6977.54259.2938.855
L1170.957115.33647.7197.903
L2466.878103.453161.839201.586
Dropout rate102.2851.56910.39340.318
Learning rate125.21323.89464.08237.237
Momentum107.46785.2055.81816.444
Decay195.42965.69448.8280.915
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Y.; Arora-Jain, O.; Jiang, X. Deep Learning for Predicting Late-Onset Breast Cancer Metastasis: The Single-Hyperparameter Grid Search (SHGS) Strategy for Meta-Tuning a Deep Feed-Forward Neural Network. Bioengineering 2025, 12, 1214. https://doi.org/10.3390/bioengineering12111214

AMA Style

Zhou Y, Arora-Jain O, Jiang X. Deep Learning for Predicting Late-Onset Breast Cancer Metastasis: The Single-Hyperparameter Grid Search (SHGS) Strategy for Meta-Tuning a Deep Feed-Forward Neural Network. Bioengineering. 2025; 12(11):1214. https://doi.org/10.3390/bioengineering12111214

Chicago/Turabian Style

Zhou, Yijun, Om Arora-Jain, and Xia Jiang. 2025. "Deep Learning for Predicting Late-Onset Breast Cancer Metastasis: The Single-Hyperparameter Grid Search (SHGS) Strategy for Meta-Tuning a Deep Feed-Forward Neural Network" Bioengineering 12, no. 11: 1214. https://doi.org/10.3390/bioengineering12111214

APA Style

Zhou, Y., Arora-Jain, O., & Jiang, X. (2025). Deep Learning for Predicting Late-Onset Breast Cancer Metastasis: The Single-Hyperparameter Grid Search (SHGS) Strategy for Meta-Tuning a Deep Feed-Forward Neural Network. Bioengineering, 12(11), 1214. https://doi.org/10.3390/bioengineering12111214

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop