Next Article in Journal
Landslide Displacement Prediction Method Based on GA-Elman Model
Previous Article in Journal
Garlic Extract: Inhibition of Biochemical and Biophysical Changes in Glycated HSA
Article

A Study on Estimating Land Value Distribution for the Talingchan District, Bangkok Using Points-of-Interest Data and Machine Learning Classification

1
Geography and Geo-Informatics Program, Faculty of Humanities and Social Sciences, Suan Suandha Rajabhat University, Bangkok 10300, Thailand
2
Technology Computer Application in Architecture Program, Faculty of Industrial Technology, Suansunandha Rajabhat University, Bangkok 10300, Thailand
3
Remote Sensing and GIS Field of Study, School of Engineering and Technology, Asian Institute of Technology, Klong Luang, Pathumthani 12120, Thailand
*
Author to whom correspondence should be addressed.
Academic Editor: Pietro Picuno
Appl. Sci. 2021, 11(22), 11029; https://doi.org/10.3390/app112211029
Received: 12 October 2021 / Revised: 15 November 2021 / Accepted: 17 November 2021 / Published: 21 November 2021
(This article belongs to the Section Earth Sciences and Geography)

Abstract

Land is an essential factor in real estate developments, and each location has its unique characteristics. Land value is a vital cost of real estate developments. Higher land costs mean that project developers must create higher valued products to cover the higher land costs and to maintain a profit level from their developments. Land values vary according to surrounding factors, such as environment, social, and economic situations. Machine learning is a popular data estimation technique that enables a system to learn from sample data; however, there are few studies on its use for estimating land value distribution. Therefore, we aim to apply the technique of machine learning to estimate land value and to investigate the factors affecting the land value in the Talingchan district, Bangkok., we used land value level as the dependent variable, with other factors affecting land value levels as the independent variables. Ten points of interest were chosen from Google Places API. Then, three machine learning algorithms, namely CART, random forest, support vector machine, were applied. For this study, we selected 45,032 land parcels as the experimental data and randomly divided them into two groups. The first 70% of the land parcels was used to create the training area. The other 30% of the land parcels was used to create the testing area to verify the accuracy of the land value estimation from the applied machine learning techniques. The most accurate machine learning results were produced by random forest, which were then used to measure the factor importance. The academic group factor was school, and the commercial group factors were clothing store, pharmacy, convenience store, hawker stall, grocery store, automatic teller machine, supermarket, restaurant, and company.
Keywords: land value; machine learning; Talingchan land value; machine learning; Talingchan

1. Introduction

Since Thailand’s 5th National Economic and Social Development Plan, the government has implemented the policy of distributing the main activities from the city center of Bangkok to the central communities of all five perimeter provinces, including Nonthaburi, Nakhon Pathom, Pathum Thani, Samut Prakan, and Samut Sakhon, and thus has continued to spread prosperity to the provincial communities of the country. The critical measure is to distribute some activities to the provinces and to develop a transportation system capable of increasing the connections between Bangkok and the central communities of the provinces, especially the expansions along the BTS/MRT lines; therefore, the economic growth in the metropolitan communities is higher than other communities [1].
Urban expansion is an important phenomenon because it affects political and economic development planning and also affects the livelihoods of people in each community. Nowadays, the population of urban areas has increased dramatically, where people use the land for the basics of everyday life. In 1991, Ginsburg et al. [2] described the areas with expanding urban activities around important metropolises or cities in Asia as “extended metropolises”, where urban and countryside activities coexisted. The Talingchan district is located on the west side of Bangkok, which is expanding along the central road axis; the outstanding characteristic of this area is the high level of interactions between the countryside and urban areas that are connected via effective transportation routes. The combined agriculture and non-agricultural activities, such as commercial, transportation, and industrial activities, are increasing in the area that previously had been agricultural, especially rapidly increasing housing estates. This expansion of the area with Bangkok as the center creates a megacity much like other spectacular cities in the world.
“Land” is the essential factor in real estate developments, and each location has its unique characteristics. Land value is a vital cost of real estate developments. Higher land costs mean that project developers must create higher valued products to cover the higher land costs, and thus maintain profit levels from their developments. In 2020, the market value of vacant land around the BTS/MRT stations continuously increased, i.e., 2530% in some land parcels, around 1.7–3.1 million Baht per square wah; experts from one real estate company have referred to this situation as an “expensive land situation” [3]. Vacant land in the same area can have different values; therefore, land value estimation in the Talingchan district is essential for assisting developments by the government, individuals, and the public sector.
Accurate land value distribution mapping is the basis of urban studies. Land value distribution is used to evaluate how neighborhoods and location factors impact real estate prices and subsequently explain inhabitants’ facility choices using their spatial division [4,5]. Second, mapping land value distribution associated indicators (such as home rentals and prices) and modelling their spatiotemporal patterns are critical stages in analyzing and monitoring the local residential land market [6]. Third, variations in land value distribution mapping over time can provide insight into urban growth direction and patterns [7,8]. Despite the critical points stated above, predicting fine-scale land value distribution in developing countries remains difficult. While the land transaction sample is recognized as a trustworthy data source for modelling land value distribution, it frequently does not include all plots in a city. As a result, more capable methodologies and tools are necessary to ensure complete coverage when estimating or forecasting the land value distribution in unseen areas. Additionally, the models must be capable of effectively learning and fitting the complicated nonlinear interactions between real estate values and potential causes such as the natural environment, economic policy, geography, and community characteristics [9,10]. Land value varies according to surrounding factors, such as environmental, social, and economic situations [11,12]. By analyzing land value variation, many studies have found that when social and economic conditions are terrible, the land value is usually low [13]. Some studies have found that land value was related to the quality of the schools in an area. In contrast, some studies have found that land value was related to commercial districts [14,15]. In addition, other studies have found that public transportation [16] and the environment in the community, such as wetlands [17] and green spaces [18,19], had some effects on the land value. Therefore, the focus of this study is mainly on using environmental factors to predict land value.
Methodologically, many studies have used interpolation methods such as inverse distance weighting and Kriging to study the distribution of land value [20,21,22]. Some studies have applied interpolation methods in many areas to confirm data anomaly [23]. In this study, we used the spatial value estimation technique to create the spatial value for the effect of each factor on land value.
Machine learning is one of the popular data estimation techniques that enables a system to learn from sample data without a programmer’s involvement. Another benefit of this technique is that the machine learning can learn from existing data to determine its correct result. These techniques have been widely used in various parts of the world. in recent years, as evidenced by studies on image classification, climate prediction, road traffic injuries, landslide prediction, and poverty prediction etc. [24,25,26,27].
Moreover, many studies have found that the machine learning algorithm technique has higher accuracy than traditional statistical methods [9] and some studies have also used machine learning to estimate house values [28]. We also found that machine learning has been used to determine the factors that affect land value in New York, United States [29], and predicting property prices in Hong Kong [30], but there have been relatively few studies on this type of research. Therefore, from these findings, we aim to apply machine learning techniques to estimate the land value and to identify the factors affecting land value in the Talingchan district, Bangkok.

2. Materials and Methods

2.1. Study Area

In this study, we chose the Talingchan district as the study area. The Talingchan district is one of fifty areas of Bangkok, located in the outer area on the west side of the Chao Phraya River called “Thon Buri Side”, and generally contains rural and agricultural conservation areas mixed with low-density residential areas (Figure 1). However, currently, the agricultural area is shrinking and decreasing due to construction of residential areas to support city expansion and transportation routes.

2.2. Dataset

In this study, the independent variable was the factor that may affect the land value, and the dependent variable was the assessed land value from the Treasury Department of Thailand; both were accessed from the 2021 dataset.
The independent variable of this study was collected from the Google Places application programming interface (API). The dependent variable of this study is land value data collected from the Treasury Department.
The Treasury Department’s Thailand has land value estimation for roadside parcels only, and there is no reference land value estimation for all land parcels. The total number of land parcels in Talingchan district has 138,963 parcels (100%) are divided into 32.4% or 45,032 plots, which are roadside parcels, use for the experiment. Other parcels (67.4% or 93,931), which are not roadside, are not being use for the experiment.

2.2.1. Points-of-Interest Data

In this study, we mainly used the Google Places application programming interface (API) to collect the factors, since we found that the points-of-interest data from Google Places API were always up to date, unlike the databases from other organizations that were updated periodically. Through examination of documents, we identified the related factors for land value, which included school, fashion store, pharmacy, grocery store, convenience store, automated teller machine (ATM), company, market, and restaurant, after we derived the points-of-interest from Google Places API, and then estimated the spatial density using kernel density estimation to determine the density of the related factors (Figure 2 and Figure 3).

2.2.2. Land Parcel and Land Value Data

The land parcel data were obtained from the Department of Lands for the dependent variable in polygon format. The land value data were obtained from the Treasury Department in datasheet format. The features and tables of the land value dataset were joined with the land parcel ID in GIS software. The land value dataset includes only the essential road in the district (Figure 4). The lowest-priced land value parcel (blue parcel) was close to the highest-priced land value parcel (red parcel) because the lowest-priced land value parcel was a plot in a small alley from the major road and this dataset was derived from the Treasury Department.
Descriptive statistics of the independent variables of the Talingchan district, Bangkok are given in Table 1 and show the number of points-of-interest in the study area. The table includes the number of points in the POI datasets, from school to grocery store, in point units. Descriptive statistics of the independent variables are shown in Table 1.
Descriptive statistics of the dependent variables or land value of the Talingchan district, Bangkok are provided in Table 2 and show the frequency of the land value levels. The table includes the number of parcels in land parcel units.

2.3. Methods

2.3.1. Modeling Framework

We selected 45,032 land parcels as the experimental data for this study, or 32.4% of the 138,963 land parcels throughout the Talingchan district. These were the land parcels that had land value data from the Treasury Department. Then, the other independent factors that affected land value were verified from the primary documents and added to the experimental data. The next step was to divide the experimental data into two groups randomly. The first 70% of the land parcels was used to create the training area. The other 30% of the land parcels was used to create the testing area to verify the accuracy of the land value estimation from applied machine learning techniques. Then, we compared the accuracy of all of the models to find out which one had the highest accuracy and to identify the factors that were related to the model and the land values (Figure 5).

2.3.2. Imbalanced Data

The dataset of this study faced the imbalanced data problem. The synthetic minority oversampling technique (SMOTE) is a unique data sampling technique for additional sampling instead of using existing data. SMOTE was applied to synthesize the new data from the existing data using the nearest neighbor principle to expand the model’s decision boundary, affecting the existing data’s average and standard deviation value. In this study, there were imbalanced data in each land value level, which could affect the learning process of the machine learning algorithm, and therefore the study’s accuracy. The training area distribution of each land value level and the factors from additional sampling are shown in the detail in Table 3.

2.3.3. Machine Learning Models

1. CART
The classification and regression tree (CART) is a predictive tree model for investigating data structure. It creates visualized decision rules for predicting a categorical and continuous variable. The regression tree does not generate classes of dependent variables, unlike the classification tree (which splits the input space of many variables into subspaces, each subspace associated with a specific class of output variable). However, in the matrix of the independent variables, dependent variables represent the response values for each observation. Because regression trees do not contain preassigned classes, the stage’s output is a response value for each of the new dependent variable observations. The squared residuals minimization procedure is used to make the splitting rule in regression trees, which indicates that the predicted sum variances for two resulting nodes should be minimized.
The classification and regression tree, proposed by Breiman et al. [31], is one of the most widely used approaches for dealing with classification and regression issues. CART models execute the Gini and the least-squared deviation measurements for categorical and numerical issues, respectively [32]. Let   the   p th   sample   be   illustrated   as   ( I p , 1 , I p , 2 , I p , n O p ) ,   where   I p , n   is   the   value   of   the   p th   sample   with   n   features ,   and   O p   is   the   corresponding   output   value of the sample. The minimization of the least-squared deviation measure of impurity given by Equation (1) serves as a choice to decide the split-up of trees into branches for a CART regression issue.
1 N V U r ( O p O ¯ r ) 2 + 1 N V U l ( O p O ¯ l ) 2
Ur and Ul are training data sets pointing to the right and left child nodes, respectively, and Nis the total number of training samples. The output values of the right and left nodes are represented by O ¯ and O ¯   l ¯ respectively.
2. Random Forest
The random forest classifier comprises many tree classifiers, each of which is created using a random vector sampled separately from the input vector, each of which casts a unit vote for the most popular class to categorize an input vector [33]. The random forest classifier employed in this work grows a tree by randomly selecting or combining features at each node. For each feature/feature combination selected, bagging, a method for generating a training dataset by randomly drawing with replacement N instances, where N is the size of the original training set, was employed. Any instances are categorized by selecting the class with the highest votes from all the tree predictors in the forest (Breiman). The decision tree design process necessitated the selection of an attribute selection measure and a pruning technique. There are several ways to select characteristics for decision tree induction, and most approaches explicitly give a quality measure to the attribute.
The information gain ratio criteria and the Gini index are the most often utilized attribute selection metrics in decision tree induction.
The random forest classifier employs the Gini index as an attribute selection metric, quantifying an attribute’s impurity about the classes. For a given training set T, the Gini index can be expressed as:
j i ( f ( C i , T ) / | T | ) ( f ( C j , T ) / | T | )
where f (Ci, T)/|T| is the probability that the selected instance belongs to class Ci.
Each time, a tree is built to its maximum depth utilizing a mix of features on new training data. These mature trees are unpruned, which is a significant benefit of the random forest classifier over other decision tree techniques.
The findings indicate that the pruning strategies used, rather than the attribute selection criteria, affect the performance of tree-based classifiers [34].
3. Support Vector Machine
The support vector machine (SVM) is a machine learning algorithm, first introduced by Vapnik et al. [35]. SVM is a supervised classification method that reshapes the nonlinear environment into the linear and makes it a processable and straightforward class through the generation of hyperplane. The kernel function is the mathematical function that is utilized for data transformation. SVM translates the original input into a high-dimensional feature space using the training dataset. A separating hyperplane is produced between the points of various tree classes in the original space of n coordinates. SVM calculates the maximum separation between classes and creates a classification hyperplane in the center of the maximum margin. If the point is above the hyperplane, it is categorized as +1; otherwise, it is classified as −1. Then, the properties of new data can be used to forecast which group a new record should belong to. Support vectors are the training points closest to the hyperplane. Following the acquisition of the decision surface, new data can be classified. After obtaining the decision surface, it can be used to categorize additional data. The technique is specified over a vector space. The decision surface for linearly separable space is a hyperplane, which can be represented as (Equation (3)):
w x + b = 0
The vector w and constant b are learnt using a training set of linearly separable items, where x is an arbitrary object to be categorized. SVM was suggested to handle a linearly restricted quadratic programming problem such as Equation (4), with the result that the SVM solution is always globally optimum.
min ω 1 2 ω 2 + C i ξ i
with constraints
y i ( x i w + b ) 1 ξ i   ξ i 0 , i
The original input data is transformed into a higher dimensional space using a nonlinear mapping for linearly inseparable objects, and the linearly separating hyperplane can also be found in the new space without increasing the quadratic programming problem’s computation complexity by using the kernel function [36]. To put it another way, to compute the similarities between the vectors in higher dimensions space for the linearly inseparable issue, the kernel function is utilized to deduce these similarities in the original lower dimensional space.

2.3.4. Model Evaluation

Contamination between the data types can be verified and analyzed from the confusion matrix table, which is the table that brings the results from data type classification from any method to overlap with the area with the related fact data, and then compares the accuracy between them. The multi-class confusion matrix includes the following Figure 6 [37]:
The confusion elements for each class are shown by (Equations (6)–(9)):
T P i = c i i
F P i = l = 1 n c l i T P i
F N i = l = 1 n c i l T P i
T N i = l = 1 n k = 1 n c l k T P i F P i F N i
True Positive (TP) was the actual value, and the predicted value should be the same.
False Positive (FP) was the sum of the corresponding rows values except for the TP value.
False Negative (FN) was the sum of values of corresponding column except for the TP value.
True Negative (TN) was the sum of values of all columns and rows except the values of that class for which we calculate the values.
After knowing the TP, FP, FN, and TN values, check a p-value and Kappa coefficient index.
A p-value was used to evaluate the significance of the results in comparison to the null hypothesis when carrying out statistical tests. The null hypothesis asserts that no link exists between the two variables under investigation (one variable does not affect the other). The alternative hypothesis asserts that the independent variable did influence the dependent variable, and the results are important in terms of corroborating the theory under investigation. Statistical significance is defined as a p-value less than 0.05 (usually 0.05). It shows significant evidence against the null hypothesis since the likelihood of the null being true is less than 5%. As a result, the null hypothesis is rejected, and the alternative hypothesis is accepted.
Cohen’s Kappa coefficient, which is frequently used to measure reliability, may be used for training–testing reliability. The Kappa coefficient reflects the degree of agreement between the frequencies of two sets of data obtained on two distinct occasions in training–testing. The scale of Kappa value interpretation was shown in Table 4. [38].
The accuracy of the model is based on the following Equations ((10)–(14)):
S e n s i t i v i t y = T P i T P i + F N i
S p e c i t i v i t y = T N i T N i + F P i
P o s i t i v e   p r e d i c t i o n   v a l u e = T P i T P i + F P i
N e g a t i v e   p r e d i c t i o n   v a l u e = T N i T N i + F N i
Accuracy = TP + TN TP + FN + FP + TN
The ratio of correctly identified positives to the total number of positive samples is known as sensitivity, or actual positive rate or recall. Sensitivity is a crucial metric to assess and compare classifiers with since it indicates the correct classification rate of the class.
The ratio of properly categorized negatives to the total number of negative samples is specificity. Specificity is similar accuracy because the number of occurrences is less.
The ratio of correctly identified positives to the total number of samples labelled as positives is the precision or positive predictive value.
The ratio of correctly identified negatives to the total number of samples classified as negatives is called fallout, sometimes known as the false-positive rate. It is a rate that complements specificity and shows the percentage of “false alarms.”
The capacity of a classification test to accurately identify or exclude an outcome is measured by accuracy (ACC), and it is the proportion of correct predictions to total samples. When the dataset is significantly skewed, overall accuracy is insufficient to describe the model’s performance because overall accuracy might be more significant when most samples are categorized into the majority class.
In addition, the average absolute percentage error (MAPE) was used to assess the effectiveness of the machine learning model in terms of land value price. The mathematical formula of MAPE is Equation (15):
MAPE = 1 N t = 1 N | V t P t V t * 100 |
where N is the amount of predicting periods, V t is the actual value at period t and P t is the predicting value at period t.

2.3.5. GINI Index

The GINI index, created by Breiman, determines the purity of a specific class following a split along with a particular characteristic [22]. The best split increases the purity of the sets resulting from the split. If L is a dataset with j different class labels, GINI is defined as Equation (16):
GINI ( L ) = 1 i = 1 j p i 2
where pi is relative frequency if class i in L. If the dataset is split on attribute A into two subsets L1 and L2, with sizes N1 and N2 respectively, GINI is calculated as Equation (17):
GINI A ( L ) = N 1 N GINI ( L 1 ) + N 2 N GINI ( L 2 )
Reduction in impurity is calculated as Equation (18):
Δ GINI ( A ) = GINI ( L ) GINI A ( L )

3. Results

3.1. Validating the Land Value Predictive Value Obtained by Applying the Machine Learning Technique Model

The land value predictive values obtained by applying the machine learning models were validated with the multi-class confusion matrix to evaluate the accuracy of the prediction. Table 5, Table 6 and Table 7 show the multi-class confusion matrix of each model.
Table 5, Table 6 and Table 7 is a confusion matrix table of all models. From these tables, TP, FP, FN, and TN values can be known. For example, Table 6 shows the CART model, the land value ranging from 0–15,000 THB/Wah2 row 1 column 1 was true positive (322), the false positive value of the price was 73 (2 + 42 + 12 + 17), the false negative value was 35 (6 + 18 + 11), and true negative was 13,111.
Regarding the accuracy of the confusion matrix accuracy of each model, it was found that the RF model had the highest accuracy of 96.9%, followed by the SVM model with a maximum accuracy of 90.26%, and the CART model had a precision of only 62.56%. For its part, by using Cohen’s Kappa statistics or Cohen’s Kappa coefficients, it was found that the RF model had the highest such values at 95.66%, followed by 86.87% and 54.3%. The confusion matrix of each model found that RF models had the highest precision and consistency values, followed by SVM and CART, respectively, and all models had p-values below 0.000000000000000000022.
In addition, from the confusion matrix table of all models, sensitivity, specificity, positive prediction value, and negative prediction value can be calculated. In Table 8, the example of calculating the sensitivity of the CART model value of 0–15,000 THB/Wah2 was 0.902 (322/(322 + 35)). Meanwhile, the specificity of the CART model of the price was 0.99 (13,111/(13,111 + 73)). The positive prediction of the price’s CART model was 0.815 (322/(322 + 73)). Finally, the negative prediction of the CART model of the price was 0.997 (13,111/(13111 + 35)). Then, the sensitivity, specificity, positive prediction value, and negative prediction value of all models and price estimation in these methods can be calculated.
Figure 7 shows the sensitivity values distribution of all models. The table found that RF and SVM models found high sensitivity values at all land value estimations except those at levels 32,001–40,000 THB/Wah2, SVM was significantly less than the RF model. The CART model has low sensitivity.
Figure 8 shows the specificity values distribution of all models. According to the table, all models have high specificity values at all land value estimation levels. Specifically, the RF model finds a value equal to 1 at all land value estimation levels.
Figure 9 shows the positive prediction values distribution of all models. At all land value estimation levels, RF and SVM models were found to have high positive prediction values. Except those at levels 15,001–18,000 THB/Wah2 and 50,001–80,000 THB/Wah2, SVM was noticeably less than the RF model, while with the CART model, positive prediction values were poor.
Figure 10 shows the negative prediction value distribution of all models. In the negative prediction value, all models had high negative prediction values. Specifically, RF found a value equal to 1 at almost every land value distribution. Except for 32,001–40,000 THB/Wah2.
Lewis [39] claims that MAPE values of less than 10% indicate a high level of accuracy, while values of 10% to 20% indicate a reasonable level of accuracy. Thus, RF (1.906) and SVM (5.242) are a high level of accuracy. At the same time, CART (20.391) is a reasonable level of accuracy. Table 9 shows the average MAPE value of machine learning models.
After all models have been validated, each model’s predicted probabilities of land value estimation can be removed in a data frame format. Then, the testing feature class and predicted probabilities data frame are joined together using featured ID field by GIS programming. After successful joining, the estimated land value map is created. According to the estimation land value, using points-of-interest data and the machine learning classification algorithms (Figure 11), the RF (Figure 11C) and SVM (Figure 11D) maps are the closest to the testing map. Meanwhile, the CART (Figure 11B) map was less accurate than the testing map (Figure 11A).

3.2. The Factors That Influence the Land Value Parcel

We describe which predictor factors the RF model found to be the most important for predicting land value. Then, factor importance is measured using RF with the mean decrease in Gini method. The most important factor from the algorithm is school, and the commercial group also appears to be important to the model, including a clothing store, pharmacy, convenience store, hawker stall, grocery store, automatic teller machine, supermarket, restaurant, and company, as shown in Figure 12.

4. Discussion

Wen et al. [15] studied the education institutions via the real estate market. They found that education facilities had a very high effect on real estate value, especially when the quality of the school was considered with the real estate value. Many studies [40] have found that house values in South Korea are affected by education facilities.
Hu et al. [10] found that stores and commercial groups had a very high effect on the land value, especially when the commercial group was within 15 min of walking. Furthermore, the results were also produced using machine learning, and they found that the random forest regressor (RFR) and extra-trees regressor (ETR) produced the best house rent predictions. Wu et al. [41] also found that random forest produced the best classification results related to a real estate asset.
The results mentioned above found that education facilities, stores, and commercial groups affected real estate values. In this study, the predictive results were accurate by using data generated from Google Places API to solve outdated data from the government sector that was prepared the following fiscal year. We used Google Places API to solve this problem, such as a study by Wu et al. [42] that found that using POI data from the private sector produced data analysis results with more accuracy.
The MAPE value of past research is listed in Table 10. The RF model in this work outperforms several prior models in terms of MAPE, as seen in the table. However, Pai et al. [43] models were better than this work, but this work uses Google Places API to solve the problems. Google Places API is convenient and up to date. It is an exciting issue if researchers want to solve the problem without relying on a dataset that requires government permission. In this issue, other researchers need to consider the pros and cons of the methodology between Pai’s and this work.
Various factors might affect real estate success, including countries, cultures, market trends, and economic situations. As a result, this research presents a viable and comparable option for land value estimation to retain stability and feasibility over time; forecasting models should be modified and enhanced regularly.

5. Conclusions

Using a machine-learning algorithm produces more accurate study results. Moreover, machine learning can tune parameters to determine the appropriate value via experiment, which allows the method to differentiate the data appropriately.
In this study, we found that machine learning algorithms provided accurate predictive results and displayed the factors that affected the accuracy of the predictive land values for each land parcel. The advantages of this study method are (1) machine learning can predict the land value for each land parcel, (2) the method identifies the factors that affect the land value prediction for each land parcel, (3) machine learning is a helpful tool to support decisions, and (4) machine learning could improve the prediction of the land value on a macro level in the future.
This study’s limitation is that the land value data from the Treasury Department is in a static format rather than a dynamic format and must be updated regularly. If the Treasury Department uses API land values and publishes them online, future studies should be beneficial.
The factors used in this study could be improved by using alternative factors, such as the number of transit stations, public facilities, street networks, and walkability that may play essential roles in predicting land value estimations.
This study confirms that machine learning can produce land value prediction results that are more accurate. Furthermore, point of interest data from Place API is useful to estimate land value distribution. The future works, machine learning, and point of interest data from Place API could be applied in other fields of science to produce more information. In the future, such data and techniques can be used to predict the direction of urban expansion, find public utility needs, and predict crime locations.

Author Contributions

Conceptualization, M.W. and K.T.; methodology, M.W.; software, M.W.; validation, M.W.; formal analysis, M.W.; investigation, M.W. and K.T.; data curation, M.W. and K.T.; writing—original draft preparation, M.W.; writing—review and editing, M.W.; visualization, M.W.; supervision, S.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from The Institute of Research and Development, Suan Sunandha Rajabhat University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors thank all organizations permitted to use their data, including the Department of Land and the Treasury Department.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Pruksanubal, B. Rural-Urban Linkages Pertaining to Rural Trade in Bangkok Mega-Urban Region: A Case Study of Phathumthani Province; Chulalongkorn University: Bangkok, Thailand, 2007. [Google Scholar] [CrossRef]
  2. Ginsburg, N.S.; Koppel, B.; McGee, T.G.; East-West Environment and Policy Institute. The Extended Metropolis: Settlement Transition Is Asia; University of Hawaii Press: Honolulu, HI, USA, 1991. [Google Scholar]
  3. Yoonan, M. Forcasting Model for Bangkok CBD Vacant Land Prices Using MLR Methods; Thammasat University: Bangkok, Thailand, 2018. [Google Scholar] [CrossRef]
  4. Hu, S.; Yang, S.; Li, W.; Zhang, C.; Xu, F. Spatially non-stationary relationships between urban residential land price and impact factors in Wuhan city, China. Appl. Geogr. 2016, 68, 48–56. [Google Scholar] [CrossRef]
  5. Qu, S.; Hu, S.; Li, W.; Zhang, C.; Li, Q.; Wang, H. Temporal variation in the effects of impact factors on residential land prices. Appl. Geogr. 2020, 114, 102124. [Google Scholar] [CrossRef]
  6. Davis, M.A.; Oliner, S.D.; Pinto, E.J.; Bokka, S. Residential land values in the Washington, DC metro area: New insights from big data. Reg. Sci. Urban Econ. 2017, 66, 224–246. [Google Scholar] [CrossRef]
  7. Mendonça, R.; Roebeling, P.; Martins, F.; Fidélis, T.; Teotónio, C.; Alves, H.; Rocha, J. Assessing economic instruments to steer urban residential sprawl, using a hedonic pricing simulation modelling approach. Land Use Policy 2020, 92, 104458. [Google Scholar] [CrossRef]
  8. Yang, S.; Hu, S.; Wang, S.; Zou, L. Effects of rapid urban land expansion on the spatial direction of residential land prices: Evidence from Wuhan, China. Habitat Int. 2020, 101, 102186. [Google Scholar] [CrossRef]
  9. Chen, Y.; Liu, X.; Li, X.; Liu, Y.; Xu, X. Mapping the fine-scale spatial pattern of housing rent in the metropolitan area by using online rental listings and ensemble learning. Appl. Geogr. 2016, 75, 200–212. [Google Scholar] [CrossRef]
  10. Hu, L.; He, S.; Han, Z.; Xiao, H.; Su, S.; Weng, M.; Cai, Z. Monitoring housing rental prices based on social media:An integrated approach of machine-learning algorithms and hedonic modeling to inform equitable housing policies. Land Use Policy 2019, 82, 657–673. [Google Scholar] [CrossRef]
  11. Glumac, B.; Herrera-Gomez, M.; Licheron, J. A hedonic urban land price index. Land Use Policy 2019, 81, 802–812. [Google Scholar] [CrossRef]
  12. Wen, H.; Goodman, A.C. Relationship between urban land price and housing price: Evidence from 21 provincial capitals in China. Habitat Int. 2013, 40, 9–17. [Google Scholar] [CrossRef]
  13. Huang, H.; Yin, L. Creating sustainable urban built environments: An application of hedonic house price models in Wuhan, China. Neth. J. Hous. Environ. Res. 2014, 30, 219–235. [Google Scholar] [CrossRef]
  14. Burge, G. The capitalization effects of school, residential, and commercial impact fees on undeveloped land values. Reg. Sci. Urban Econ. 2014, 44, 1–13. [Google Scholar] [CrossRef]
  15. Wen, H.; Xiao, Y.; Hui, E.C.; Zhang, L. Education quality, accessibility, and housing price: Does spatial heterogeneity exist in education capitalization? Habitat Int. 2018, 78, 68–82. [Google Scholar] [CrossRef]
  16. Cervero, R.; Kang, C.D. Bus rapid transit impacts on land uses and land values in Seoul, Korea. Transp. Policy 2011, 18, 102–116. [Google Scholar] [CrossRef]
  17. Du, X.; Huang, Z. Spatial and temporal effects of urban wetlands on housing prices: Evidence from Hangzhou, China. Land Use Policy 2018, 73, 290–298. [Google Scholar] [CrossRef]
  18. Glaesener, M.-L.; Caruso, G. Neighborhood green and services diversity effects on land prices: Evidence from a multilevel hedonic analysis in Luxembourg. Landsc. Urban Plan. 2015, 143, 100–111. [Google Scholar] [CrossRef]
  19. Kim, H.-S.; Lee, G.-E.; Lee, J.-S.; Choi, Y. Understanding the local impact of urban park plans and park typology on housing price: A case study of the Busan metropolitan region, Korea. Landsc. Urban Plan. 2019, 184, 1–11. [Google Scholar] [CrossRef]
  20. Liu, Q.; Xu, Q.; Zheng, V.W.; Xue, H.; Cao, Z.; Yang, Q. Multi-task learning for cross-platform siRNA efficacy prediction: An in-silico study. BMC Bioinform. 2010, 11, 181. [Google Scholar] [CrossRef] [PubMed]
  21. Schläpfer, F.; Waltert, F.; Segura, L.; Kienast, F. Valuation of landscape amenities: A hedonic pricing analysis of housing rents in urban, suburban and periurban Switzerland. Landsc. Urban Plan. 2015, 141, 24–40. [Google Scholar] [CrossRef]
  22. Higgins, C.D. A 4D spatio-temporal approach to modelling land value uplift from rapid transit in high density and topographically-rich cities. Landsc. Urban Plan. 2019, 185, 68–82. [Google Scholar] [CrossRef]
  23. Hu, S.; Cheng, Q.; Wang, L.; Xu, D. Modeling land price distribution using multifractal IDW interpolation and fractal filtering method. Landsc. Urban Plan. 2013, 110, 25–35. [Google Scholar] [CrossRef]
  24. Chan, T.-H.; Jia, K.; Gao, S.; Lu, J.; Zeng, Z.; Ma, Y. PCANet: A Simple Deep Learning Baseline for Image Classification? IEEE Trans. Image Process. 2015, 24, 5017–5032. [Google Scholar] [CrossRef]
  25. Worachairungreung, M.; Ninsawat, S.; Witayangkurn, A.; Dailey, M. Identification of Road Traffic Injury Risk Prone Area Using Environmental Factors by Machine Learning Classification in Nonthaburi, Thailand. Sustainability 2021, 13, 3907. [Google Scholar] [CrossRef]
  26. Chen, W.; Xie, X.; Wang, J.; Pradhan, B.; Hong, H.; Bui, D.T.; Duan, Z.; Ma, J. A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility. Catena 2017, 151, 147–160. [Google Scholar] [CrossRef]
  27. Li, Q.; Yu, S.; Échevin, D.; Fan, M. Is poverty predictable with machine learning? A study of DHS data from Kyrgyzstan. Socio-Econ. Plan. Sci. 2021, 101195. [Google Scholar] [CrossRef]
  28. Park, B.; Bae, J.K. Using machine learning algorithms for housing price prediction: The case of Fairfax County, Virginia housing data. Expert Syst. Appl. 2015, 42, 2928–2934. [Google Scholar] [CrossRef]
  29. Ma, J.; Cheng, J.C.; Jiang, F.; Chen, W.; Zhang, J. Analyzing driving factors of land values in urban scale based on big data and non-linear machine learning techniques. Land Use Policy 2020, 94, 104537. [Google Scholar] [CrossRef]
  30. Ho, W.K.; Tang, B.-S.; Wong, S.W. Predicting property prices with machine learning algorithms. J. Prop. Res. 2021, 38, 48–70. [Google Scholar] [CrossRef]
  31. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Routledge: Oxfordshire, UK, 1984. [Google Scholar] [CrossRef]
  32. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  33. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  34. Pal, M.; Mather, P.M. An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sens. Environ. 2003, 86, 554–565. [Google Scholar] [CrossRef]
  35. Vapnik, V.; Golowich, S.E.; Smola, A. Support vector method for function approximation, regression estimation and signal processing. In Advances in Neural Information Processing Systems 10: Proceedings of the 1997 Conference, Denver, CO, USA, 1–6 December 1997; MIT Press: Cambridge, MA, USA, 1998; pp. 281–287. [Google Scholar]
  36. Aizerman, M.A. Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning. Autom. Remote Control. 1964, 25, 821–837. [Google Scholar]
  37. Manliguez, C. Generalized Confusion Matrix for Multiple Classes. 2016. Available online: https://www.researchgate.net/publication/310799885_Generalized_Confusion_Matrix_for_Multiple_Classes (accessed on 11 October 2021). [CrossRef]
  38. Nichols, T.R.; Wisner, P.M.; Cripe, G.; Gulabchand, L. Putting the Kappa Statistic to Use. Qual. Assur. J. 2010, 13, 57–61. [Google Scholar] [CrossRef]
  39. Lewis, C.D. Industrial and Business Forecasting Methods: A Practical Guide to Exponential Smoothing and Curve Fitting; Butterworth-Heinemann: Oxford, UK, 1982. [Google Scholar]
  40. Bae, H.; Chung, I.H. Impact of school quality on house prices and estimation of parental demand for good schools in Korea. KEDI J. Educ. Policy 2013, 10, 43–61. [Google Scholar]
  41. Wu, R.; Wang, J.; Zhang, D.; Wang, S. Identifying different types of urban land use dynamics using Point-of-interest (POI) and Random Forest algorithm: The case of Huizhou, China. Cities 2021, 114, 103202. [Google Scholar] [CrossRef]
  42. Wu, M.; Pei, T.; Wang, W.; Guo, S.; Song, C.; Chen, J.; Zhou, C. Roles of locational factors in the rise and fall of restaurants: A case study of Beijing with POI data. Cities 2021, 113, 103185. [Google Scholar] [CrossRef]
  43. Pai, P.-F.; Wang, W.-C. Using Machine Learning Models and Actual Transaction Data for Predicting Real Estate Prices. Appl. Sci. 2020, 10, 5832. [Google Scholar] [CrossRef]
  44. Del Giudice, V.; De Paola, P.; Forte, F. Using Genetic Algorithms for Real Estate Appraisals. Buildings 2017, 7, 31. [Google Scholar] [CrossRef]
  45. Plakandaras, V.; Gupta, R.; Gogas, P.; Papadimitriou, T. Forecasting the U.S. real house price index. Econ. Model. 2015, 45, 259–267. [Google Scholar] [CrossRef]
  46. Antipov, E.A.; Pokryshevskaya, E.B. Mass appraisal of residential apartments: An application of Random forest for valuation and a CART-based approach for model diagnostics. Expert Syst. Appl. 2012, 39, 1772–1778. [Google Scholar] [CrossRef]
  47. Kuşan, H.; Aytekin, O.; Özdemir, I. The use of fuzzy logic in predicting house selling price. Expert Syst. Appl. 2010, 37, 1808–1813. [Google Scholar] [CrossRef]
Figure 1. The study area selected 45,032 land parcels from all over the Talingchan district as the experimental data.
Figure 1. The study area selected 45,032 land parcels from all over the Talingchan district as the experimental data.
Applsci 11 11029 g001
Figure 2. Kernel density estimation of independent variables.
Figure 2. Kernel density estimation of independent variables.
Applsci 11 11029 g002
Figure 3. Kernel density estimation of independent variables.
Figure 3. Kernel density estimation of independent variables.
Applsci 11 11029 g003
Figure 4. The study area consisting of 45,032 land parcels selected from all over the Talingchan district as the experimental data.
Figure 4. The study area consisting of 45,032 land parcels selected from all over the Talingchan district as the experimental data.
Applsci 11 11029 g004
Figure 5. Modeling Framework.
Figure 5. Modeling Framework.
Applsci 11 11029 g005
Figure 6. The multi-class confusion matrix.
Figure 6. The multi-class confusion matrix.
Applsci 11 11029 g006
Figure 7. Sensitivity values distribution of all models.
Figure 7. Sensitivity values distribution of all models.
Applsci 11 11029 g007
Figure 8. Specificity values distribution of all models.
Figure 8. Specificity values distribution of all models.
Applsci 11 11029 g008
Figure 9. The positive prediction values distribution of all models.
Figure 9. The positive prediction values distribution of all models.
Applsci 11 11029 g009
Figure 10. The negative prediction values distribution of all models.
Figure 10. The negative prediction values distribution of all models.
Applsci 11 11029 g010
Figure 11. A study on the estimation land value using points-of-interest data and the machine learning classification algorithms for the Talingchan district, Bangkok.
Figure 11. A study on the estimation land value using points-of-interest data and the machine learning classification algorithms for the Talingchan district, Bangkok.
Applsci 11 11029 g011
Figure 12. The factor importance is measured using random forest.
Figure 12. The factor importance is measured using random forest.
Applsci 11 11029 g012
Table 1. Independent variables.
Table 1. Independent variables.
Point of InterestFrequencyPercentageData AttributeData Source
School1757.22PointsGoogle Places API
Fashion Store411.69
Pharmacy200.82
Restaurant1164.78
Hawker stall59924.70
Convenience store763.13
Automatic Teller Machine43417.90
Company994.08
Market1516.23
Grocery Store71429.44
2425100.00
Table 2. Dependent variable or land value.
Table 2. Dependent variable or land value.
Land Value Level
THB/wah2
FrequencyPercentageData AttributeData Source
0–15,00011482.55PolygonsLand parcel from the Department of Land
Parcel value level from the Treasury Department
15,001–18,0008861.97
18,001–20,00033347.40
20,001–25,00035967.99
25,001–32,00030116.69
32,001–40,00022,97351.01
40,001–45,00020914.64
45,001–50,0009932.21
50,001–80,00033937.53
80,001–130,00036078.01
45,032100.00
Note: THB/wah2 equal THB/ 4 m2.
Table 3. The synthetic minority oversampling technique.
Table 3. The synthetic minority oversampling technique.
ABCDEFGHIJ
0–
15,000
15,001–
18,000
18,001–
20,000
20,001–
25,000
25,001–
32,000
32,001–
40,000
40,001–
45,000
45,001–
50,000
50,000–
80,000
80,001–
130,000
Training83352921662409225316,084141658426862566
SMOTE3153315331533152315231533152315331533152
Table 4. The scale of Kappa value interpretation.
Table 4. The scale of Kappa value interpretation.
KappaInterpretation
<0%No agreement
0.01–20%Slight
21–40%Fair
41–60%Moderate
61–80%Substantial
81–100%Perfect
Table 5. CART.
Table 5. CART.
Actual Land Value Range (THB/wah2)
Prediction Land Value Range (THB/wah2) 0–
15,000
15,001–
18,000
18,001–
20,000
20,001–
25,000
25,001–
32,000
32,001–
40,000
40,001–
45,000
45,001–
50,000
50,000–
80,000
80,001–
130,000
0–
15,000
322200042001217
15,001–
18,000
6205408842428005991
18,001–
20,000
005630445505217812
20,001–
25,000
0964926117794002553
25,001–
32,000
080773315005123
32,001–
40,000
004001134891213580
40,001–
45,000
00119015855890520
45,001–
50,000
000004280235840
50,000–
80,000
180861101810065469
80,001–
130,000
112160173810023734
Note: THB/wah2 equal THB/ 4 m2. Accuracy 62.56%. p-Value [Acc > NIR]: < 2.2 × 10−16 Kappa: 54.3%.
Table 6. Random Forest.
Table 6. Random Forest.
Actual Land Value Range (THB/wah2)
Prediction Land Value Range (THB/wah2) 0–
15,000
15,001–
18,000
18,001–
20,000
20,001–
25,000
25,001–
32,000
32,001–
40,000
40,001–
45,000
45,001–
50,000
50,000–
80,000
80,001–
130,000
0–
15,000
3540000170009
15,001–
18,000
02210114600017
18,001–
20,000
00916008360100
20,001–
25,000
00010291410010
25,001–
32,000
02009291200115
32,001–
40,000
1021066192141
40,001–
45,000
0020029598000
45,001–
50,000
00100230248120
50,000–
80,000
00711420111204
80,001–
130,000
230020210021053
Note: THB/wah2 equal THB/ 4 m2. Accuracy 96.9%. p-Value [Acc > NIR]: < 2.2 × 10−16. Kappa: 95.66%.
Table 7. Support Vector Machine.
Table 7. Support Vector Machine.
Actual Land Value Range (THB/wah2)
Prediction Land Value Range (THB/wah2) 0–
15,000
15,001–
18,000
18,001–
20,000
20,001–
25,000
25,001–
32,000
32,001–
40,000
40,001–
45,000
45,001–
50,000
50,000–
80,000
80,001–
130,000
0–
15,000
35300006000021
15,001–
18,000
022301483400035
18,001–
20,000
008760026810170
20,001–
25,000
03010290920070
25,001–
32,000
0001886800527
32,001–
40,000
0020058871032
40,001–
45,000
00220070604000
45,001–
50,000
0014003370250350
50,000–
80,000
001410910010789
80,001–
130,000
400031460051005
Note: THB/wah2 equal THB/ 4 m2. Accuracy 90.26%. p-Value [Acc > NIR]: < 2.2 × 10−16. Kappa: 86.87%.
Table 8. Model development in the testing area.
Table 8. Model development in the testing area.
Land Value Range (THB/wah2)
0

15,000
15,001

18,000
18,001

20,000
20,001

25,000
25,001

32,000
32,001

40,000
40,001

45,000
45,001

50,000
50,001

80,000
80,001

130,000
SensitivityCART0.9020.9070.6070.8970.7600.5060.9720.9400.5690.668
RF0.9920.9780.9870.9970.9630.9600.9870.9920.9740.958
SVM0.9890.9870.9440.9970.9180.8540.9971.0000.9370.914
SpecificityCART0.9940.9430.9370.9150.9870.9800.9410.9610.9700.964
RF0.9980.9970.9920.9970.9980.9980.9980.9970.9950.996
SVM0.9940.9910.9770.9920.9970.9990.9930.9710.9910.993
Pos
Pred
Value
CART0.8150.2140.4160.4660.8230.9630.4380.3150.6420.620
RF0.9320.8530.9020.9600.9690.9980.9510.8730.9520.956
SVM0.8130.6540.7540.9100.9560.9990.8680.3930.9040.921
Neg
Pred
Value
CART0.9970.9980.9700.9910.9820.6560.9990.9990.9600.970
RF1.0001.0000.9991.0000.9970.9600.9991.0000.9980.996
SVM1.0001.0000.9961.0000.9940.8681.0001.0000.9940.992
Note: THB/wah2 equal THB/ 4 m2.
Table 9. Average MAPE value of machine learning models.
Table 9. Average MAPE value of machine learning models.
ModelsCARTRFSVM
MAPE (%)20.391051.9062755.242138
Table 10. A list of MAPE values of previous studies and this study.
Table 10. A list of MAPE values of previous studies and this study.
Prediction ModelsMAPE (%)
Del Giudice et al. [44]10.62
Plakandaras et al. [45]2.15
Antipov and Pokryshevskaya [46]13.95
Kusan et al. [47]3.65
Pai et al. [43] models without attribute selection1.676
Pai et al. [43] models with attribute selection0.228
RF *1.906
Note: * The random forest model in this work.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Back to TopTop