1. Introduction
Machine learning (ML) has emerged as a transformative technology, revolutionizing a wide array of academic disciplines and industries through its ability to derive valuable insights from complex and large datasets. Integrating statistical techniques, computational algorithms, and artificial intelligence, ML has attracted widespread attention due to its robust capacity to identify patterns, make predictions, and automate decision-making processes. Machine learning is becoming more commonly used in urban sciences; in such settings, it is used to analyze complex data, uncover patterns, and generate predictions across diverse areas such as transportation, urban meteorology, and real estate. The versatility and scalability of ML have positioned it as an indispensable tool for researchers and professionals aiming to unlock the full potential of their datasets.
Several studies demonstrate the application of ML in transportation research, primarily for improving transportation systems. Deep neural networks, specifically convolutional and long short-term memory (LSTM) layers, are used to predict urban bus travel times [
1]. This approach captures spatiotemporal correlations and uses an encoder/decoder architecture for accurate multi-time-step predictions, outperforming traditional methods and even Google’s model in the Greater Copenhagen Area. Another study uses Context-dependent Random Forests to predict traffic flow based on Global Positioning System (GPS) data from a small percentage of vehicles, significantly improving prediction accuracy for intelligent GPS navigation systems [
2].
ML models like Stochastic Gradient Boosting and boosted regression trees are employed to identify hazardous traffic conditions and analyze crash data [
3,
4]. Stochastic Gradient Boosting is particularly effective at handling different types of predictors and missing values, making it suitable for real-time risk assessment, and it can identify crash cases with high accuracy and a low false-positive rate [
3]. Boosted regression trees models, on the other hand, are used to investigate the complex, nonlinear relationships in high-variance crash data, providing improved transferability over traditional logistic regression models [
4]. Gradient boosting regression trees are also used to analyze and model freeway travel time, offering improved prediction accuracy and model interpretability by strategically correcting mistakes from previous models [
5].
In engineering- and physics-based models, a data-driven subspace identification technique, which is a form of ML, can be used to estimate damage to structures by determining damage coefficients that define the reduction of stiffness or damping [
6]. The minimization of a residual function helps provide accurate damage coefficient values. AI, a broader field encompassing ML, is being reviewed for its impact on concrete mix design, offering improvements in optimal proportioning, property prediction, and quality control [
7]. It helps overcome the limitations of traditional approaches, though challenges remain with regard to data availability and model interpretability.
In climate and weather modeling, ML provides a powerful way to recognize complex patterns and model nonlinear dynamics in climate and weather processes [
8]. Researchers have successfully incorporated physics and domain knowledge into ML models to improve emulating, downscaling, and forecasting. An algorithm based on generalized Hidden Markov Models has been used to track the best-performing climate model over time, with its average prediction loss matching that of the best model in hindsight and outperforming traditional methods [
9].
There has been a growing consensus that complex scientific and engineering problems require methodologies that integrate traditional physics-based models with state-of-the-art ML techniques [
10]. These physics-guided ML models can lead to greater physical consistency, reduced training time, and better generalization. For example, physics-based morphodynamic modeling is used to quantify the impact of climate change on coastal erosion [
11].
ML has also become an essential analytical framework to interpret and manage the complexities of contemporary urbanization. As cities generate increasingly large and heterogeneous datasets, from traffic sensors and satellite imagery to housing market transactions and environmental monitors, ML algorithms provide the computational power and flexibility to make sense of this information. They enable researchers to explore the underlying structures and interrelationships among various urban phenomena, such as housing density, transportation networks, and public health [
12,
13].
The predictive capabilities of ML allow for dynamic modeling of urban systems. These include forecasting urban growth, simulating the impacts of policy interventions, and evaluating sustainability metrics with a data-driven foundation. Machine learning can support the study of spatial segregation and gentrification [
14], allowing researchers and policymakers to identify patterns and predict future trends in these areas. For example, by analyzing demographic and socioeconomic data, ML models can pinpoint neighborhoods at risk of gentrification, providing a basis for targeted policy interventions.
ML is instrumental in addressing urban environmental concerns. Its approaches are used to forecast the hourly concentration of air pollutants like ozone, particle matter (PM2.5), and sulfur dioxide [
15]. These predictions are vital for public health warnings and for evaluating the effectiveness of air quality regulations. ML models can also monitor urban expansion and detect land use changes, as demonstrated by a study in Cameroon [
16]. This capability helps urban planners optimize land allocation and infrastructure development to promote sustainable growth. In a similar vein, an artificial neural network model is used to predict urban growth in five major Greek cities by 2030, offering a tool for local policymakers to analyze future development scenarios across Europe [
17].
While supervised ML methods are prevalent, some research also considers the potential of unsupervised methods and algorithm selection paradigms [
18]. The integration of various ML approaches can lead to more accurate results and more thorough analyses, especially in areas like image processing and analytics. However, it is important to acknowledge that certain aspects, such as image acquisition and decision-making, still require human supervision [
18]. This highlights the ongoing need for an expert-guided approach, ensuring that ML tools are used to augment, rather than replace, human expertise in urban planning and decision-making.
The application of ML to urban planning has recently advanced through the integration of reinforcement learning and graph neural networks. A notable example is an AI model that represents urban spaces as graphs and solves spatial planning as a sequential decision-making problem [
19]. This model outperforms human experts in objective evaluations and supports collaborative workflows for planning, demonstrating the profound potential of computational methods to tackle complex urban challenges.
The emergence of mixed-use functional area detection using spatiotemporal data and entropy-based methods [
20], as well as the growing use of unsupervised learning (UL) approaches such as clustering and topic modeling [
21], further highlight ML’s expanding role in urban sciences. UL methods, in particular, allow for the discovery of hidden structures in unlabeled urban datasets, revealing insights into urbanization patterns, built environments, sustainability, and urban dynamics. Despite certain challenges pertaining to data quality, interpretability, and validation, UL provides a promising pathway for understanding and managing urban complexity.
In addition, ML is reshaping the broader urban discourse by contributing to the emergence of new paradigms in city governance and development. Recent studies suggest a shift from “smart urbanism” to “autonomous cities,” where AI systems increasingly mediate urban life, raise ethical concerns, and challenge traditional notions of control and transparency in governance [
22].
A critical element underpinning the success of predictive ML models in urban sciences is hyperparameter tuning. Hyperparameters are user-defined settings that govern the learning process, such as tree depth in decision trees, learning rates in gradient boosting machines, or the number of clusters in unsupervised algorithms. Unlike model parameters learned from data, hyperparameters must be optimized externally to improve model accuracy, robustness, and generalizability. In high-stakes urban applications where predictions may influence transportation systems, housing policies, or environmental monitoring, well-tuned models can lead to reliable insights and guided decisions.
Systematic hyperparameter optimization (e.g., via Random Search, Grid Search, or advanced techniques like BayesOpt and Optuna) enhances the predictive performance of ML models by identifying the best combination of settings. This is especially important in applications such as sustainable transport modeling [
23], where ML models classify freight transport sustainability levels with high accuracy, and real estate valuation [
24,
25,
26,
27], where ML-based approaches outperform traditional methods in forecasting property prices and identifying investment opportunities.
In summary, the synergy between machine learning and urban sciences is yielding novel frameworks for understanding, modeling, and managing cities. As ML methods become increasingly sophisticated, the role of hyperparameter tuning becomes more central, not only for achieving high-performance predictions but also for ensuring the interpretability, reliability, and applicability of ML models to real-world urban challenges.
Despite the growing integration of machine learning techniques in urban studies, several methodological gaps persist. Many existing studies continue to rely on traditional hyperparameter tuning methods such as Random Search and Grid Search [
28,
29,
30,
31]. While these approaches have been widely adopted, they are becoming increasingly outdated and inefficient, particularly when applied to large datasets, due to their exhaustive and time-consuming search strategies. As such, they may hinder computational efficiency and limit the scalability of machine learning applications in urban research contexts.
A critical gap lies in the limited adoption of more advanced, automated optimization frameworks such as Optuna [
32]. Built on Bayesian optimization and equipped with pruning techniques, Optuna offers a more intelligent and adaptive search strategy, leading to faster convergence and improved predictive performance. However, its application remains underrepresented in urban analytics, particularly among researchers less familiar with recent developments in machine learning.
Another notable shortcoming in current urban studies is the predominant focus on the test set results, with little to no evaluation of the model’s performance with regard to the training data. Several recent real estate studies [
33,
34,
35,
36] which have employed machine learning algorithms combined with Shapley values to predict property prices reported performance metrics solely based on the test set. While test set evaluation provides insight into the model’s predictive accuracy on unseen data, it offers a limited view of the model’s overall behavior. Without examining training set metrics, it becomes difficult to assess the model’s generalization capability or to identify problems such as overfitting (where the model memorizes training data) or underfitting (where it fails to capture patterns). A more comprehensive evaluation should include a diverse set of metrics for both training and test sets. This dual reporting allows for a clearer understanding of how well the model learns from data, maintains accuracy across different subsets, and ensures its robustness, reliability, and practical applicability in real-world real estate prediction tasks.
These research gaps highlight the critical need for more efficient hyperparameter tuning frameworks and more rigorous model evaluation protocols within the urban research domain. As machine learning becomes increasingly integrated into urban sciences, ranging from real estate to transportation and environmental monitoring, ensuring that models are both accurate and generalizable is essential. Current practices often lack transparency and consistency, particularly in model validation and performance reporting. By adopting systematic tuning approaches and comprehensive evaluation metrics, researchers can better assess model reliability, identify overfitting or underfitting, and ultimately produce more trustworthy and actionable insights to support urban planning, policy development, and decision-making processes.
2. Literature Review
This section explores two machine learning algorithms frequently used in urban problems. We start with Gradient Boosting Machine (GBM), a powerful ensemble learning algorithm that combines multiple weak models to create a strong and accurate predictor. It is effective for complex datasets and can be used for regression, classification, and feature selection. Next, we examine Random Forest (RF), a popular ensemble learning method that combines multiple decision trees to create a robust and accurate model. It is known for handling large datasets and can be used for various tasks, including classification, regression, and clustering.
GBMs have been widely adopted in real-world applications due to their strong predictive capabilities. Using machine learning to detect and prevent Internet of Things (IoT) attacks has been suggested by [
37]. Their study focuses on detecting noise level anomalies in a city suburb using machine learning algorithms, including Gradient Boosting Machine and Deep Learning. Two types of attacks are tested: sudden and gradual changes in noise levels. Their results show that their approach can effectively detect such anomalies, even for small changes in noise levels. This can help cities protect their infrastructure and ensure a safer and more sustainable environment for citizens.
Three machine learning algorithms, including Support Vector Machine (SVM), Random Forest (RF), and Gradient Boosting Machine (GBM), are employed to predict property prices in the new town “Tseung Kwan O” in Hong Kong [
38]. This study applies these methods to a dataset of 39,554 housing transactions in Hong Kong during the period between June 1996 and August 2014, and then compared their results. In terms of predictive power, GBM and RF outperform SVM, as demonstrated by their higher values of
and lower values of standard evaluation metrics.
Based on GBM, a non-parametric method has been introduced to measure how connected financial institutions are, and how this affects their risk of failing [
39]. This paper studies the connections between banks in China, and finds that banks with similar characteristics tend to be more connected and riskier, especially those heavily involved in interbank activities like industrial banking. The banking system becomes more connected during times of financial stress, and state-owned banks are more likely to be the source of risk during instability.
GBM can effectively impute missing values related to monitoring and reporting greenhouse gas emissions in healthcare facilities [
40]. By using larger datasets and GBM, researchers can improve the accuracy of filling missing values, providing a more precise depiction of GHG emissions from healthcare facilities. This could influence policy-making and regulatory practices for monitoring and reporting GHG emissions. In this paper, hyperparameters are optimized by tuning with Grid Search.
RF has seen extensive application in multiple domains, with one notable example being urban planning. Although urban planning is becoming increasingly complex due to the vast and growing volume of available data, the integration of advanced information technologies into the field has advanced at a relatively slow pace. This is particularly noticeable in the recent emergence and implementation of ‘smart city’ technologies within urban environments [
41,
42]. Despite the rapid development of information and communication technologies (ICTs) in other sectors, their adoption in urban planning practices has lagged behind, falling short of leveraging the full potential these innovations offer. This gap presents both challenges and opportunities for urban planners, policymakers, and researchers to bridge the divide between cutting-edge technologies and their practical application in creating more efficient, sustainable, and livable cities. One promising solution lies in the use of machine learning techniques. For example, a study combines data from various sources and applies a Random Forest classifier alongside other machine learning models to evaluate their performance [
43]. The findings indicate that the Random Forest model outperforms others in terms of accuracy, precision, and other evaluation metrics.
To demonstrate how machine learning can be used to deliver more accurate pricing predictions, the real estate market has been taken as an example [
44]. Utilizing 24,936 housing transaction records, this paper employs Extra Trees,
k-Nearest Neighbors, and RF, followed by hyperparameter tuning by Optuna, to predict property prices and then compares their results with those of a hedonic price model. Their results suggest that these three algorithms markedly outperform the traditional statistical techniques in terms of explanatory power and error minimization.
A study by [
45] analyzes survey data from 2016 to 2021 and finds that Islamic banks’ accuracy in survey responses improves with a country’s level of development. The study also finds a significant reduction in credit portfolio risk due to improved risk management practices, global economic growth, and stricter regulations. Additionally, concerns about terrorism financing and cybersecurity threats have decreased due to strengthened anti-money-laundering regulations and investments in cybersecurity infrastructure and education.
An approach to analyze large amounts of exposome data from long-term cohort studies has been proposed, using machine learning to identify predictors of health and applied to the 30-year Doetinchem Cohort Study [
46]. Utilizing Random Forest (RF), their study finds that 9 exposures, from different areas, are most important for predicting health, while 87 exposures have little impact. The approach shows an acceptable ability to distinguish between good and poor health, with an accuracy of 70.7%.
3. Methodology
To ensure a fair comparison, we apply three hyperparameter tuning methods (Random Search, Grid Search, and Optuna) to two distinct machine learning models: GBM and RF. Each model uses its own specific set of hyperparameter spaces, as detailed in
Table 1. Experiments are conducted on a dataset that reflects real-world scenarios to ensure the relevance of our findings. We assess the performance of each tuning method based on two primary criteria: computational efficiency and model performance metrics.
Figure 1 illustrates the workflow of the methodology adopted in this study.
3.1. Materials
Gradient Boosting Machine (GBM), introduced by Friedman [
47], is a powerful ensemble learning technique that constructs a predictive model by sequentially combining multiple weak learners, typically decision trees. The fundamental idea behind GBM is to convert weak learners, which perform poorly on their own, into a strong predictive model by iteratively correcting the errors of previous models. During each iteration, the algorithm computes the gradient of the loss function with respect to the model predictions and adjusts the parameters of the new tree to minimize this loss. This process is repeated for a specified number of iterations, often referred to as boosting rounds, which is one of the key hyperparameters of the model.
GBMs are highly effective in handling heterogeneous data types, capturing complex feature interactions, and modeling nonlinear relationships. These properties make them particularly suitable for applications across a wide range of fields, including urban analytics, finance, healthcare, and environmental studies. The flexibility of GBM allows it to adapt to different data structures and learning tasks, making it a preferred choice for predictive modeling in both research and industry.
However, the performance of GBMs is heavily dependent on careful hyperparameter tuning. Important hyperparameters include the maximum depth of trees, the learning rate, the number of boosting iterations, and regularization parameters that control model complexity and prevent overfitting. Given the large and high-dimensional hyperparameter space, manual tuning can be labor-intensive and inefficient. Techniques such as Grid Search, Random Search, or automated optimization methods like Optuna are often employed to identify optimal parameter settings. Despite this complexity, when properly tuned, GBMs consistently deliver high predictive accuracy and robustness, which has contributed to their widespread adoption in applied machine learning.
Random Forest (RF), introduced by Breiman in 2001 [
48], is a widely used ensemble learning method that combines bootstrap aggregation, or bagging, with decision trees to produce robust predictive models. As a supervised learning algorithm, RF can be applied to both classification and regression tasks. The algorithm works by constructing a large number of decision trees during the training phase, with each tree trained on a randomly sampled subset of the data. The final prediction is obtained by aggregating the outputs of all individual trees, typically through majority voting for classification or averaging for regression. This ensemble approach improves predictive accuracy and stability compared to a single decision tree, reducing variance and mitigating overfitting.
RF offers several practical advantages that contribute to its popularity in applied research. It can effectively handle large datasets with high dimensionality, including cases with numerous irrelevant features, without a significant loss in predictive power. The algorithm also manages missing values naturally, making it suitable for real-world datasets where incomplete information is common. Furthermore, the averaging of predictions across multiple trees enhances generalization and decreases sensitivity to noise in the training data.
Despite its robustness, RF includes a variety of hyperparameters that must be carefully tuned to achieve optimal performance. Key hyperparameters include the maximum depth of trees, the number of trees, and regularization parameters that control model complexity. While tuning RF hyperparameters is generally less time-consuming and complex compared to GBM, careful selection is still important to ensure that the model balances bias and variance effectively. Overall, the combination of accuracy, stability, and flexibility makes RF a highly versatile tool in diverse domains, from urban planning and finance to healthcare and environmental studies.
3.2. Gradient Boosting Machine
For a continuous target variable
Y, the GBM model constructs an approximation as a weighted sum of weak learner functions:
To minimize the average loss function over the training set, we begin with a constant function model,
and iteratively update it according to
Here,
represents the base learner function, with
h being the optimal function at each step. Since finding the exact function
h that minimizes the loss function
L is computationally infeasible [
46], we use a steepest descent approach. The model is updated as follows:
where
and
While this method provides an approximate solution, GBM estimations remain approximations.
Table 1 documents the hyperparameter spaces for GBM, which are critical for optimizing model performance and generalization in the experiments described. These hyperparameters control the model’s complexity, learning behavior, and computational efficiency, directly influencing the accuracy and robustness of predictions for the continuous target variable
Y. Hyperparameters for GBM are optimized using various techniques such as Random Search, Grid Search and Optuna to further enhance model accuracy and generalization. Well-tuned hyperparameters significantly improve a model’s performance, efficiency, and robustness, enabling better capture of data patterns and more reliable predictions.
The performance of the model is evaluated using the coefficient of determination
, mean absolute error (MAE), median absolute error (MedAE), mean squared error (MSE), mean absolute percentage error (MAPE), symmetric mean absolute percentage error (sMAPE), root mean squared error (RMSE), and root mean squared log error (RMSLE) on the test set, defined as
where
and
represent the predicted value and actual value of the target variable
Y, respectively; and
m represents the number of observations in the test data.
In this research, we employ MAE, MedAE, MSE, MAPE, sMAPE, RMSE, and RMSLE as evaluation metrics because they are widely accepted and conventional measures in machine learning analysis. Each metric offers a different perspective on model performance. MAE provides a straightforward interpretation of average prediction error, while MedAE is the median of the absolute differences between predicted and actual values. Unlike MAE, MedAE is more robust to outliers because it uses the median instead of the mean. MSE penalizes larger errors more heavily, making it sensitive to outliers. MAPE measures relative error, allowing for intuitive comparisons across different scales. sMAPE is a percentage-based error metric that calculates the absolute difference between predicted and actual values, divided by the average of the two. It is “symmetric” because it treats over- and under-predictions equally, avoiding bias from large actual values (unlike MAPE). RMSE expresses error in the same units as the target variable, which aids interpretability. RMSLE measures the square root of the average squared differences between the log-transformed predicted and actual values. It penalizes under-predictions more than over-predictions, and is less sensitive to large differences in absolute values.
Using this set of complementary metrics ensures a more comprehensive evaluation of model accuracy and robustness. These metrics are standard in both academic literature and practical machine learning applications, enabling comparisons with other studies and supporting the credibility and reproducibility of the findings.
3.3. Random Forest
Random Forest is a supervised learning algorithm that employs an ensemble approach for both classification and regression. It builds a specified number of decision trees and aggregates their predictions to create a more accurate and robust model than a single tree. The algorithm uses random sampling with replacement, a technique known as “bagging” [
49,
50], to reduce variance. Given a training set with features X and outputs
Y, bagging iteratively selects random samples from the training set
K times and fits trees to these samples. For each tree, a random sample of instances is drawn with replacement, corresponding to a unique random vector
which shapes a specific tree. This variability introduces slight differences among the trees. The prediction of the
k-th tree for input X is given by
At each node during tree splitting, features are randomly selected to minimize correlations. A node
S is split into subsets
S1 and
S2 using a threshold
c that minimizes the sum of squared errors:
The final prediction is the average of all tree outputs:
Table 1 also documents the hyperparameter spaces for RF, which are critical for optimizing model performance and generalization in the experiments described. Hyperparameters are optimized using techniques such as Random Search, Grid Search, and Optuna to enhance model performance and generalization. Finally, model performance is evaluated using the coefficient of determination
, MAE, MedAE, MSE, MAPE, sMAPE, RMSE, and RMSLE on the test set.
3.4. Hyperparameter Tuning
Hyperparameter tuning is the process of finding the best combination of hyperparameters for a machine learning model. It plays a crucial role in maximizing performance and generalization while avoiding overfitting. However, the operation of hyperparameter tuning can be computationally expensive, especially for complex models with plenty of hyperparameters.
Several techniques have been proposed for hyperparameter tuning, including Random Search, Grid Search, Optuna, and more. Traditional methods like Random Search and Grid Search often struggle to efficiently explore the high-dimensional hyperparameter space, leading to suboptimal results or excessive computational costs. Random Search involves randomly sampling hyperparameters from specified distributions. Despite its simplicity, Random Search often surpasses Grid Search in terms of efficiency as it explores the search space more diversely [
51]. However, there is no guarantee of finding the global optimum, and the quality of the solution heavily relies on the number of random samples taken.
In contrast, Grid Search is a method that involves exhaustively evaluating preset hyperparameter values to determine the optimal combination. Although Grid Search ensures the discovery of the global optimum within the search space, it can incur high computational costs, particularly when dealing with numerous hyperparameters or continuous parameters requiring precise resolutions.
Optuna represents a recent advancement in hyperparameter tuning techniques, introducing a more streamlined and automated approach centered around Bayesian optimization and tree-structured Parzen estimators (TPEs). Diverging from the conventional Random Search and Grid Search methods, Optuna builds probabilistic models to capture the link between hyperparameters and the objective function, enabling informed decisions regarding which hyperparameters to investigate next.
Through the application of Bayesian optimization principles, Optuna intelligently navigates the hyperparameter space by emphasizing promising areas while disregarding less fruitful ones. This adaptive sampling methodology leads to notable reductions in computational expenses and faster convergence towards the optimal solution compared to traditional approaches. A key strength of Optuna lies in its lightweight and versatile design, facilitating seamless integration with a variety of machine learning libraries and frameworks. Its Pythonic approach to defining search spaces and simple parallelization capabilities further support its user-friendliness and scalability [
52].
4. Data Definitions and Sources
In this study, we focus on the analysis of a specific private housing estate, Ocean Shores, located in the Tseung Kwan O district of Hong Kong. This estate is officially designated as one of the “selected popular residential developments” by the Rating and Valuation Department of the Hong Kong SAR Government. Ocean Shores comprises 15 residential blocks and a total of 5,726 domestic units. Our dataset spans from January 2010 to December 2020, encompassing 7,652 intertemporal and cross-sectional observations.
The dataset includes detailed information on individual buildings, locations, transaction dates, occupation permits, sale prices, square footage, and other property characteristics (e.g., the inclusion of a parking space). These records are maintained by the government and compiled by a commercial data provider known as EPRC. All property prices are inflation-adjusted using the private housing estate price index published by the Rating and Valuation Department.
To ensure data quality, we exclude transaction records with incorrect transaction dates or missing values for key variables such as sale price or gross floor area. Furthermore, to enhance the robustness of the analysis and reduce the impact of outliers and potential data errors, we apply a trimming approach, one of the standard methods to handle outliers, by removing the bottom and top 5% of observations based on transaction prices.
where
represents the total transaction price of residential property i during time period t, measured in HK dollars and inflation-adjusted.
represents the net floor area of residential property i.
represents the age of residential property i in years during time period t, which can be obtained by determining the difference between the date of issue of the occupation permit and the date of housing sales.
represents the floor level of residential property i.
represent eight orientations that residential property i is facing. They are assigned to be 1 if a property is facing a particular orientation, 0 otherwise. The omitted category is Northwest so that coefficients may be interpreted relative to this category.
In this study, our primary objective is not to predict property prices per se but to systematically evaluate and compare the performance of different hyperparameter tuning methods, namely Random Search, Grid Search, and Optuna, using machine learning algorithms GBM and RF. We employ a relatively small dataset comprising 7,652 housing transaction records from the property market solely as a practical case study to demonstrate and illustrate the differences in computational efficiency and predictive performance among these tuning strategies. While the dataset provides a concrete and relatable context, it is important to note that the methodology and insights derived here are broadly applicable to any dataset or domain where hyperparameter tuning is necessary.
Despite its modest size, the dataset still imposes significant computational demands, particularly for exhaustive methods such as Grid Search. For example, tuning a GBM model with about 7,652 rows and 10 features using Grid Search can take nearly two hours, and the required time increases substantially with larger datasets or more complex models. This highlights the importance of efficient hyperparameter optimization techniques in real-world scenarios, where computational resources and time are often limited.
We have selected GBM and RF as representative algorithms due to their widespread adoption across various fields and their distinct computational characteristics. This choice enables a meaningful evaluation of the trade-offs between tuning efficiency and model accuracy. Our approach underscores the practical challenges researchers and practitioners face in selecting hyperparameter optimization methods, even when working with datasets of moderate size.
By focusing on GBM and RF and using the property market as a familiar example, this study provides valuable insights into optimizing model selection processes that are transferable across domains. The findings can guide practitioners in balancing the competing demands of computational speed and predictive performance, making them relevant for a wide range of applications beyond real estate analytics.
Exploratory Data Analysis
Exploratory statistics is a field that focuses on exploring and summarizing datasets to gain insights into patterns, trends, and characteristics before modeling or hypothesis testing. By using graphical and numerical methods, researchers can conduct exploratory data analysis to examine key characteristics.
Figure 2,
Figure 3 and
Figure 4 show histograms, correlations, and a correlation matrix for property prices and the selected features. The results reveal a high correlation between footage area and property prices at 0.8, as well as moderate correlations between southeast at 0.4 and property age at −0.4, and floor level and south at 0.2.
Table 2 provides a summary of descriptive statistics for the features analyzed in this study, offering a concise overview of the dataset.
5. Results and Discussions
In this study, the dataset is first split into 70% for training and 30% for testing. Hyperparameter tuning is performed on the training set using five-fold cross-validation to identify the optimal parameter values. In each iteration, 80% of the training data (equivalent to 56% of the entire dataset) is used for model training, while the remaining 20% (14% of the full dataset) serves as the internal validation fold. Each fold is used once as the validation fold to ensure robust and unbiased evaluation. Each hyperparameter tuning method will be employed to optimize the hyperparameters by maximizing the average performance across the five folds, thereby enhancing the overall predictive accuracy of the model. After selecting the optimal hyperparameters, the model is retrained on the training set (70%), and its predictive performance is assessed on the completely unseen test set (30%), providing a robust estimate of real-world performance.
Before presenting the optimized model results, it is useful to first compare the performance of the untuned GBM and RF models to highlight key differences in their predictive behavior. As shown in
Table 3, the untuned GBM achieves a very high training R
2 of 0.98732, but its test R
2 decreases to 0.91444, indicating overfitting. This is further evidenced by the substantial percentage increases between the training and the test sets for MAE (170.36%), MedAE (165.43%), MSE (586.48%), and RMSE (162.01%). By contrast, the RF model exhibits a more balanced performance, with a training R
2 of 0.91645 and test R
2 of 0.89652, reflecting only a modest decline of 2.18%. Nevertheless, its error metrics (MAE, MSE, and RMSE) still point to signs of overfitting. With respect to relative percentage errors (MAPE and sMAPE), both untuned models perform similarly, though GBM attains slightly lower test values. Overall, both models show overfitting in their untuned forms.
5.1. Optimized Gradient Boosting Machine
Table 4 compares the hyperparameter selections made by three tuning methods after estimating the GBM. The hyperparameters criterion, learning_rate, loss, max_features, and subsample are fixed, while the tuning methods independently select optimal values for max_depth, min_impurity_decrease, min_samples_leaf, min_samples_split, min_weight_fraction_leaf and n_estimators. The results indicate that the three methods do not converge on the same optimal hyperparameters. In particular, Optuna selects a max_depth of eight while the other two tuning methods select a value of six. Optuna has a consensus with Random Search and Grid Search that they all select a min_impurity decrease of 0.1, min_weight_fraction_leaf of 0.1 and n_estimators of 600. Optuna selects min_samples_leaf of 13, while Random Search and Grid Search select values of 12 and 10, respectively. Optuna selects a min_samples_split of 12 while Random Search and Grid Search select values of 15 and 10, respectively.
Table 5 presents the results of using GBM optimized by each method. Our findings indicate that Optuna outperforms Random Search and Grid Search in terms of computational efficiency and model performance. Optuna accomplishes the hyperparameter optimization process in a duration of approximately 3.10 min, thereby demonstrating a remarkable acceleration of 6.77 times compared to Random Search (21.02 min) and an impressive acceleration of 34.50 times compared to Grid Search (107.08 min). This substantial reduction in computation time is particularly valuable in situations where resources are limited or time-sensitive applications are involved.
Furthermore, GBM models optimized using Optuna exhibit superior performance metrics compared to those optimized using Random Search and Grid Search. Evaluation metrics such as MAE, MedAE, MSE, MAPE, sMAPE, RMSE, and RMSLE consistently display lower values for models optimized by Optuna on the test set, suggesting that Optuna effectively traverses the hyperparameter space to identify optimal configurations that lead to models with enhanced predictive accuracy and generalization capabilities (see
Table 5).
The acceptable threshold for overfitting varies depending on the application and dataset. As a general guideline, a difference of less than 5% between the training and test performance metrics is considered negligible, while differences greater than 10% may suggest overfitting. In this study, the observed differences across the three hyperparameter tuning methods for GBM range from 0.34% to 8.48%, indicating only minimal overfitting and supporting the conclusion that the models are well optimized for generalization. The consistently low error metrics on both the training and the test sets, together with high R2 values, further confirm the absence of underfitting, where a model fails to capture essential data patterns. Taken together, the integration of hyperparameter tuning, cross-validation, and comprehensive performance evaluation has yielded an optimized GBM model that demonstrates strong predictive accuracy and robust generalization across all three tuning methods.
5.2. Optimized Random Forest
Table 4 also presents a comparison of the hyperparameter selections made by three different tuning methods after estimating a Random Forest (RF) model. In our experimental setup, the hyperparameters bootstrap, criterion, and max_features are fixed, while each tuning method independently determines optimal values for max_depth, min_impurity_decrease, min_samples_leaf, min_samples_split, min_weight_fraction_leaf, and n_estimators. Interestingly, Optuna does not select the same value for all hyperparameters when compared to the other two methods. Random Search and Grid Search select the same value for max_depth of 8, min_impurity_decrease of 0.1, min_weight_fraction_leaf of 0.03, and n_estimators of 90. For the optimal value of min_samples_leaf and min_samples_split, Random Search selects 18 and 15 while Grid Search selects 15 and 10, respectively.
In
Table 5, our results based on RF indicate that Optuna surpasses both Random Search and Grid Search in terms of computational efficiency and model performance metrics. Notably, Optuna exhibits significantly faster computation speed, completing the hyperparameter optimization process in approximately 0.79 min, a reduction of 7.25 times compared to Random Search (5.74 min) and 108.92 times compared to Grid Search (86.27 min).
RF optimized using Optuna also displays superior performance metrics compared to those optimized using Random Search and Grid Search. Evaluation metrics such as MAE, MedAE, MSE, MAPE, sMAPE, RMSE, and RMSLE consistently show lower values for Optuna-optimized models based on the test set, suggesting that Optuna effectively traverses the hyperparameter space to identify optimal configurations that lead to models with enhanced predictive accuracy and generalization capabilities.
In this study, the observed differences across the three hyperparameter tuning methods for RF range from 0.37% to 8.63%, indicating only minimal overfitting and supporting the conclusion that the models are well optimized for generalization. The consistently low error metrics on both the training and the test sets, together with high R2 values, further confirm the absence of underfitting. Taken together, the integration of hyperparameter tuning, cross-validation, and comprehensive performance evaluation has yielded optimized RF model that demonstrates strong predictive accuracy and robust generalization across all three tuning methods.
5.3. Five-Fold Cross-Validation
Table 5 also presents the fold-wise cross-validation results for GBM and RF under different hyperparameter optimization strategies. For GBM, test R
2 values ranged from 0.831 to 0.855, exhibiting modest variability across folds. In contrast, RF showed slightly higher variability, with larger fold-to-fold differences. Despite these fluctuations, the mean performance of each tuning strategy is consistent across models (GBM: R
2 ≈ 0.845; RF: R
2 ≈ 0.776), indicating that the relative ranking of Random Search, Grid Search, and Optuna is robust. Comparison of training and test scores shows minimal differences, with the average gap between train and test R
2 ranging from approximately −0.95% to −0.97% for GBM and from −0.89% to −1.93% for RF. These small differences indicate no signs of overfitting and suggest good model generalization across all three tuning methods. The reported standard deviations underscore the sensitivity of cross-validation to fold splits. Across the three tuning methods, the average standard deviation of test R
2 over the five folds is 0.015 to 0.016 for GBM and 0.043 for RF, further highlighting that RF predictions are slightly more sensitive to training-validation partitioning than GBM. Overall, these results emphasize the stability and generalizability of the models across folds and tuning strategies.
5.4. Discussions
The findings of this study extend beyond technical improvements in machine learning workflows and have direct implications for urban planning and policy-making. Predictive models are increasingly used to support urban scholarship and decision-making in areas such as housing affordability, gentrification dynamics, property valuation, land-use change, and transportation demand. In these contexts, efficiency and accuracy are critical, as models are often updated frequently with new data and must produce timely, reliable insights.
By empirically evaluating the efficiency and predictive accuracy of different hyperparameter tuning methods using a real-world housing dataset, our study demonstrates how more advanced approaches such as Optuna can significantly reduce computation time while enhancing model performance. Although hyperparameter tuning is a standard procedure in machine learning, its computational demands can become a critical bottleneck, particularly in urban research where datasets are increasingly large, complex, and high-dimensional. This issue is rarely addressed explicitly in the urban science literature. Our study fills this gap by illustrating how the choice of tuning method affects both the speed and accuracy of predictive modeling tasks.
The implications for urban researchers and policymakers are substantial. Our findings show that Optuna dramatically reduces computation time compared to Random Search and Grid Search (e.g., 186.22 s versus 1,261.07 and 6,245.08 s, respectively, for GBM and 47.52 s versus 344.39 and 5,176.05 s for RF), while consistently achieving lower error values and higher Average R2. Efficient tuning methods such as Optuna therefore allow for more timely model deployment, enabling decision-makers to act on insights without undue computational delays. They also make it feasible to analyze more extensive or more granular datasets without requiring high-end computing resources.
For example, housing market models tuned with Optuna could provide more reliable short-term forecasts of property prices, which in turn could inform affordability policies or investment strategies. Similarly, models of urban growth or land-use change optimized with Optuna could improve the design of zoning policies or infrastructure planning. By demonstrating that Optuna outperforms traditional hyperparameter tuning methods, our study highlights how methodological advances in machine learning can translate into more robust, efficient, and actionable tools for evidence-based urban planning and policy-making.
Table 6 summarizes the main advantages and limitations of the machine learning algorithms (GBM, RF) and hyperparameter optimization methods (Random Search, Grid Search and Optuna) considered in this study. While GBM and RF differ in flexibility, interpretability, and computational cost, the choice of hyperparameter tuning strategy strongly influences model performance and efficiency. The comparison highlights the trade-offs between simplicity, accuracy, and scalability that should guide method selection in practical applications.
6. Conclusions
This study compares three hyperparameter optimization strategies—Random Search, Grid Search, and Optuna—using a housing estate dataset as a case study. The results highlight the advantages of Optuna in achieving higher predictive accuracy while reducing computational time, underscoring its potential for advancing methodological rigor in applied urban research. By facilitating more efficient model development, Optuna enables researchers to uncover complex patterns in urban data and generate insights that can inform policy and planning decisions in areas such as housing markets, traffic management, and land use. The adoption of such advanced optimization frameworks can therefore enhance the practical relevance of machine learning in urban sciences, where timely and data-driven decision-making is critical.
The potential applications of Optuna extend beyond housing data to a wide range of urban challenges. Its efficiency and scalability can support the development of cost-effective, data-driven approaches to urban planning, governance, and resource allocation. At the same time, its limitations should not be overlooked. Optuna’s performance is influenced by model complexity, dataset characteristics, and search space configuration. In certain cases, alternative frameworks such as HyperOpt, BayesOpt, or Ray Tune may prove equally effective or even superior. Users must also be mindful that Optuna’s default pruning strategies and parameter settings are not universally optimal, requiring careful adjustment to specific tasks.
This study has two limitations that should also be acknowledged. First, the analysis relies on a single housing estate dataset. While this dataset provides a concrete and well-defined context, its use inevitably restricts the external validity of the findings. Here, the housing market data serves primarily as a case study to demonstrate and compare the performance of different hyperparameter optimization strategies: Random Search, Grid Search, and Optuna. Its purpose is not to develop a universally generalizable housing price prediction model. Consequently, the empirical results should be interpreted as illustrative of methodological differences rather than definitive for all housing markets. Nevertheless, the comparative insights are transferable to a wide range of urban data science problems where predictive modeling and hyperparameter tuning are central. Second, the computational efficiency results are dependent on the specific experimental environment, including hardware and software configurations. Different platforms may yield different runtimes, although relative comparisons across optimization methods are expected to remain informative.
Future research should build upon this study by conducting comparative evaluations of Optuna and other state-of-the-art frameworks, including HyperOpt, BayesOpt, and Ray Tune, across multiple algorithms, datasets, and tasks. Investigations into hybrid approaches that integrate the strengths of different optimizers also hold promise for improving robustness and adaptability. In addition, further work should examine the interpretability, reproducibility, and usability of these tools in applied urban sciences. By addressing these avenues, future studies can provide deeper guidance on selecting and deploying hyperparameter optimization strategies to strengthen both methodological foundations and real-world impact in urban analytics.