Applying Comparable Sales Method to the Automated Estimation of Real Estate Prices

: In this paper, we propose a novel procedure designed to apply comparable sales method to the automated price estimation of real estates, in particular, that of apartments. Apartments are the most popular residential housing type in Korea. The price of a single apartment is inﬂuenced by many factors, making it hard to estimate accurately. Moreover, as an apartment is purchased for living, with a sizable amount of money, it is mostly traded infrequently. Thus, its past transaction price may not be particularly helpful to the estimation after a certain period of time. For these reasons, the up-to-date price of an apartment is commonly estimated by certiﬁed appraisers, who typically rely on comparable sales method (CSM). CSM requires comparable properties to be identiﬁed and used as references in estimating the current price of the property in question. In this research, we develop a procedure to systematically apply this procedure to the automated estimation of apartment prices and assess its applicability using nine years’ real transaction data from the capital city and the most-populated province in South Korea and multiple scenarios designed to reﬂect the conditions of low and high ﬂuctuations of housing prices. The results from extensive evaluations show that the proposed approach is superior to the traditional approach of relying on real estate professionals and also to the baseline machine learning approach.


Introduction
Since the subprime mortgage crisis [1], which was caused by over-inflated house prices, the valuation of real estate prices has emerged as a critical economic activity directly tied to national economic health [2]. Traditionally, the valuation of a real estate property has been done by certified appraisers and thus is time-consuming and expensive, yet often producing biased and inconsistent outcomes [3]. Computational approaches to real estate valuation are much more efficient and free from individual human biases. In addition, these automated approaches are likely to become essential for the design and operation of smart cities [4,5].
The difficulties in accurately estimating the value of a property are due to a number of factors including idiosyncratic personal circumstances that influence the transaction price, which are hard to capture systematically. Furthermore, houses are purchased for living, with a sizable amount of money in most cases. Hence, the past selling price of a house may not be particularly helpful in identifying its current value after a certain period of time. To overcome these limitations and difficulties, we propose a new automated procedure developed on the basis of comparable sales method [6,7], which can be applied to any real estate valuation settings, and test its effectiveness in comparison to human expert estimations, involving the most populated areas in Korea using two different scenarios.
Automated estimation of housing prices has been undertaken in a limited scope in recent machine learning studies, mostly focusing on the traditional single housing type or real estate price index (e.g., [8,9]). In this research, we broaden the horizon by focusing on the problem of apartment price prediction. An apartment is a self-contained housing unit, occupying part of a larger building. Apartments are a very common form of residential housing, representing 67.5% of the total housing transactions of South Korea in the 4th quarter of the year 2018 [10]. Accurate estimation of apartment prices has wide-encompassing implications for financial service, tax collection, economic planning, and policy-making, not to mention the selling and buying transactions of the apartments themselves.
To explore the possibility of enhancing the existing valuation approaches, we adopt the intuition of comparable sales method (CSM) and devise new ways comparable sales transactions can be computationally conducted. There can be various criteria for the two real estate properties to be considered similar or compatible. Our approach is based on two assumptions about the characteristics of apartment price volatility, which refers to the fluctuation of apartment prices over time. First, we assume that "apartments located nearby exhibit similar price volatility". Our proposed method based on this assumption regards geographic proximity as a primary consideration for identifying comparable sales transactions, which is further refined with additional considerations of intrinsic characteristics of the apartments to make the similarity computation more meaningful. The second assumption is that "apartments priced similarly in the past show similar price volatility". Our proposed method based on this assumption regards apartment transaction prices as a primary consideration, which is further adjusted by time and real estate price index to make the price comparisons more accurate.
To evaluate the proposed approach, we collected the apartment sales data from the "Transaction Price Open System" [11], provided by the Ministry of Land, Infrastructure, and Transport (MOLIT). We built two prediction models using LightGBM [12] and CatBoost [13] and assessed their performances by computing Mean Absolute Percentage Error (MAPE) and Root Mean Squared Percentage Error (RMSPE), as well as the percentage of low-error predictions. The key contributions of our work can be summarized as follows: 1. We propose a novel computational approach designed to estimate the prices of real estates, in particular, those of apartments on the basis of CSM. To the best of our knowledge, this is the first work that applies CSM to machine learning. 2. The proposed approach, albeit tested in Korea, is easily applicable to other areas, in particular where high rise residential buildings are closely located, given the popularity of CSM and the general nature of the approach. 3. We conduct experiments using real-world datasets with multiple scenarios, empirically showing the superiority of the proposed approach over the existing methods. 4. A fully working system based on the proposed approach reported in this paper is currently under development for public service by one of the largest banking companies in South Korea.

Housing Price Estimation
One of the widely used methods in real estate valuation is hedonic pricing modeling [14][15][16]. Hedonic pricing modeling is based on the assumption that the price of a good is determined by both the internal characteristics of the item being sold and the environmental factors surrounding the item in an additive fashion. For example, in case of apartments, a hedonic price model can be built on the basis of the intrinsic attributes of the apartment, such as the number of bedrooms and bathrooms, and environmental attributes such as the average of neighborhood annual income. The strength of the hedonic pricing model is the interpretative power of features by avoiding intractability of multicommodity. However, one of the weaknesses is the lower prediction accuracy in comparison to machine learning approaches as it is commonly modeled via regression.
Currently, various machine learning and deep learning algorithms, such as Random Forest [17], Support Vector Machines (SVM) [18], and Long Short-Term Memory (LSTM) [19], have been used to predict real estate prices showing that the performance of real estate valuation can be improved by extending the predictor set to include mortgage contract rate [8], features extracted from home interiors and exteriors [20], or satellite images [21,22]. While some prior work focused on the estimation of the single property price, others applied the methods to predicting the trend of the real estate price index such as Zillow Home Value Index (ZHVI) [9]. ZHVI is a well known index for understanding volatility of housing prices in the specific region but can be less informative when analyzing the price of a single property. Very little, if not at all, research effort has been made to applying machine learning to the prediction of apartments or high rise building units on a massive scale.

Price Estimation Models
Among various ensemble methods, boosting-based methods are designed to reduce bias and variance by utilizing sequential learners. Prior research [19] found a boosting model to be superior with a higher estimation accuracy than a traditional hedonic model (regression model). Further, gradient boosting tree-based (GBT) methods are known to be more accurate than bootstrap aggregating (bagging) methods such as Random Forest for difficult cases [23,24]. However, GBT training takes generally longer because it generates trees sequentially. Recently proposed XGBoost [25], LightGBM [12] and CatBoost [13] implemented with GPU-acceleration are more efficient, successfully overcoming this limitation. The XGBoost and LightGBM algorithms are different in their strategy of growing the trees: the first performs it level-wise and the latter does it leaf-wise. LightGBM shows generally faster and better performance than XGBoost, but it tends to be sensitive to overfitting in case of small datasets. Similar to XGBoost, CatBoost grows trees level-wise. CatBoost can be considered as an improved version of XGBoost as it implements ordered boosting with a fraction of data in random shuffling and provides extra support for categorical data processing such as target encoding, categorical feature combination, and one-hot encoding for low cardinality feature. LightGBM also supports direct processing of categorical data.

Comparables Sales Method Valuation
In the real estate domain, real estate assessors commonly use comparable sales method [6,7] to overcome the lack of the information about the variables influencing house prices. It is also known as Comparative Market Analysis [26]. The basic concept of CSM is to estimate the property value by examining and comparing the prices of similar houses, usually located in high proximity to the property in question. The similarity can be measured by comparing the number of rooms, area, quality of the neighborhood, proximity to schools, etc. The methodology of CSM can be applied to various fields such as tax valuation [27], groundwater valuation [28], and timberland valuation [29]. No study has explicated how CSM can be applied to real estate valuation using machine learning techniques.

Comparable Sales Method
Comparable sales method is a valuation technique to derive property prices based on recently sold similar properties. We note that the method operates with the following two assumptions.
• Nearby Apartment Transactions: The price of the apartment will follow the transaction (sales) prices of neighboring apartments with similar characteristics. • Similar Price Transactions: Apartments priced similar in the past will exhibit similar prices in the future.

Comparable Sales Features Based on Nearby Apartment Transactions
In this part we describe the algorithm we used to extract comparable sales features based on the nearby apartment transactions assumption. Using the existing features of each transaction in the sales dataset, we identified most similar transactions in the nearby apartments and added their prices as additional features. Initially, our dataset comprised 50 features, which can be semantically categorized into three groups.
Distance features are those features reflecting external conditions of an apartment, and are represented by calculating the euclidean distances from the center of the apartment complex to the closest public facilities such as subway, school, and park. Furthermore, there are intrinsic features describing the apartment complex such as the total number of households, location coordinates, and heating method. The third group includes intrinsic features related to the apartment itself, such as the number of rooms, the number of bathrooms, and housing type. The algorithm to find comparable properties will involve comparing all those features. To reduce the computational complexity, we first identify n nearby apartment buildings and select similar transactions out of the transactions occurred in those nearby buildings.

Handling Distance and Intrinsic Features' Similarity
Let A be the set of apartment complexes located inside the district. Suppose we want to predict the price of an apartment in the building A i , then the partition A i = A\{A i } denotes the apartment buildings except A i . It is more likely that closer buildings share more similar price fluctuation characteristics as A i . Therefore, using the location coordinates available in our dataset, we measure the Euclidean distance, as shown below, between A i and each of the buildings in A i .
where ∆λ is the difference in latitude of two apartment buildings and ∆φ is the difference in longitude. Then we select n buildings which are closest to A i . We want to keep n small to make sure that identified neighboring buildings are not too far from A i . We repeat these steps for every apartment building in A. Algorithm A1 (in Appendix A) describes the procedures involved.

Extracting Prices of Similar Apartments
Here we define similar apartments to be the apartments that share similar intrinsic characteristics. Let T a be the set of the apartments in the building A i and t j ∈ T a be the apartment for which we want to predict the price. As we found neighboring buildings for A i , we denote them as B = {B k B ∈ A i and 0 > k > n}. For each building in B, we retrieve its apartments T b for which we know the transaction prices. We then compare intrinsic features of each apartment in T b with intrinsic features of the apartment for which we want to predict price, t j . To measure the similarity between two apartments, we calculate Cosine Similarity of their respective feature vectors as shown below: where A i and B i are feature values of the apartments' vectors. A feature vector for each apartment includes values of the transactions' specific floor, PY (area in square meter), total number of households, highest floor, and transaction date. Finally, we extract prices of apartments with the highest value of Cosine Similarity and add them as features to our dataset. Algorithms A2 and A3 (in Appendix A) describe the procedures of extracting prices of similar apartments and adding them as features, respectively.

Comparable Sales Features Based on Similar Price Transactions
In this part we describe the algorithm we used to extract comparable sales features based on the similar price transactions assumption. Let A be the set of apartment complexes located in a predefined district. Let T = {t 0 , t 1 , ..., t n } be the set of transactions occurred in the building A i over the specific time period where A i ∈ A. For each transaction t j , we find a previous transaction t k where t k occurred before t j and the respective areas of the apartment units in t j and t k are the same. We further call t j as actual transaction and t k as its corresponding previous transaction. The price of previous transaction is added to the set of features that describe the actual transaction. The predictive power of the previous transaction price feature comes from the fact that the price of the previous transaction is similar to the price of the actual transaction if both transactions happened close in time.
However, as the time gap between the two transactions increases, the price of the previous transaction deviates significantly from the price of the actual transaction diminishing the effectiveness of the feature. To overcome this limitation, we create a new set of price features that are similar to the price of the actual transaction. The process of constructing these new features consists of four stages: (1) building candidate transaction set; (2) time-based filtering; (3) price-based filtering; and (4) price adjustment using KB (Kookmin Bank) Index. If the difference between the previous transaction and the current transaction is more than m days we proceed to add the new features, else we set the new feature values to be the same as the previous transaction price, so that we can only use the new features when the time gap is serious (i.e., more than m days).

Building Candidate Transaction Set
Let S = {t 0 , t 1 , ..., t m } be the set of transactions occurred across all apartment buildings in A meaning that T ⊆ S. We would like to select a subset of candidate transactions from S that are similar in price to the actual transaction t j . However, as the price of the actual transaction is available only during training, it is impossible to select candidate transactions by evaluating price similarity with respect to the actual transaction price during testing. Thus, their similarity is measured between candidate transaction price and previous transaction price t k of t j . From all transactions in S we select a subset of candidate transactions C ⊆ S. The set of candidate transactions does not include the actual transaction t j and its corresponding previous transaction t k , i.e., t j , t k / ∈ C. In addition, candidate transactions comprise only those transactions occurred before the actual transaction.

Time-Based Filtering
The price of the previous transaction deviates significantly from the price of the current transaction if the previous transaction occurred long before. The goal of time-based filtering is to restrict the candidate transactions from being too far in time from the transaction for which we want to estimate its price. We define a day offset α, which is used to calculate lower and upper time thresholds. The thresholds are calculated based on the date of the aforementioned previous transaction t k .
The candidate transactions in C that occur before the earliest possible date as defined by the lower bound or after the latest possible date as defined by the upper bound are filtered out.

Price-Based Filtering
Up to now, the set of candidate transactions comprises transactions that may significantly vary in price. To restrict the candidate set to include only those transactions with potentially similar prices, we define a price offset β. The lower and upper bounds of price filter are defined based on the price of the previous transaction t k and calculated as shown below.
Those transactions with prices being less than the lower bound or greater than the upper bound are removed from the candidate set C. As new price features deviate slightly by the amount of β from the previous transaction price, we expect the features to fall closer to the actual transaction price. The next step is to select the n transactions from the candidate set and add the corresponding prices to the feature vector of the actual transaction. All transactions in the candidate set are sorted in the ascending order of the absolute price difference, which is then calculated by subtracting the respective candidate price from the previous transaction price, and the top n transactions are selected.

Price Adjustment Using KB Index
After finishing the steps above, we can obtain n comparable sales price features that occurred within m days ago. These price features need to be adjusted so that their values can be comparable to the current transaction price (here, "current" means current in reference to the estimation time). For the adjustment, we used KB Index from "Market Price Open System" [30]. The index, which is provided by Kookmin Bank (KB), is an indicator that weekly quantifies the overall state of the real estate market based on the housing price changes in a particular region. Thus, we used the KB Index to calibrate the prices that occurred in the past so that they can be properly adjusted to the current price level. The adjustment is called momentum, which is computed as follows.
This momentum value is then used to compensate for the differences of similar price features in time. Algorithm A4 (in Appendix A) describes how this procedure is used in creating the CSM features based on similar price transactions.
KB Index is measured weekly, but is published one or two weeks later. To overcome the time delay associated with this index, we decided to predict the future values of the index using an Long Short-Term Memory (LSTM) [31] model, which is a popular choice for time-series data. The model was implemented using Keras framework [32]. We trained the LSTM model with the index values estimated during the train period for 1000 epochs. These parameters were learned using Adam optimizer. We predicted KB Index for the test period with various time windows. Empirically, the best results were found with window 3 (3 weeks). Thus, this modeling setting was used to predict the KB Index, which was then used in the subsequent experiment.

Target Cities and Periods
For the evaluation of the proposed approach, we selected one city, Seoul, and one province, Gyeonggi, in Korea. Seoul is the capital of the Republic of Korea (Korea) and is one of the most populated cities in the world. High population density corresponds to the growing demand for the residential properties, which in turn increases the volatility of housing prices [33], making the price estimation of residential properties inside the city challenging. Gyeonggi is the most populated province in Korea. Korea consists of nine provinces and Gyeonggi is one of them. Geographically, Seoul is located inside the Gyeonggi province, which covers a larger area but with a smaller number of population. Table 1 shows the selected area information (size, population, and density) in 2018. As mentioned earlier, the densely populated cities tend to exhibit high volatility in residential property prices. It is important to evaluate the robustness of the model during the periods of high fluctuations. Therefore, we selected two time periods that are different in terms of the level of price volatility. In the second half of 2018, prices of Seoul apartments showed a sharp increase [34]. Thus, we divided the year of 2018 into two periods, the first time period is the period of stable market (Scenario 1) and the second one is the period of rising market. (Scenario 2). Scenario 1 uses the first half of 2018 as the test period. Scenario 2 uses the second half of 2018 as the test period. Table 2 shows the exact time frames of the data used for the train and test set in each scenario. We collected apartment sales data of the selected areas from "Transaction Price Open System", ref. [11] provided by the Ministry of Land, Infrastructure, and Transport (MOLIT) of Korea. Table 3 shows the number of transaction records in the train and test set for each scenario. Market Price We collected apartment market price data from "Market Price Open System" [30] run by KB, which is the largest bank in Korea. KB obtains the expected apartment sales prices from its large network of licensed real estate agent partners. For each apartment complex, there are two or more real estate agents who provide their estimations of the apartment prices for each different type of apartments in the complex to the system each week. Based on the inputs of these agents, the system is updated weekly with the hundreds of thousands of records of the expected minimum, maximum, and common (most likely) price estimates for each of the apartment types (differentiated by its size and floor) in each apartment complex across the country. Then, the final estimated price of an apartment is obtained by averaging the input prices from the multiple agents. Even though the apartment is not on the market for sale, this estimated price is used as a reference point for the home mortgage loans and other property-based loans connected to the apartment. Further, KB sells these estimates of apartment properties to other banks in Korea as they also are in need of immediate access to the estimated values for their mortgage loan services.

Apartment Transaction Features
We briefly cover all the apartment transaction features used by the base model below, and summarize them in tables.

Apartment Intrinsic Features and Transaction Prices
These features comprise the apartment's intrinsic characteristics and transaction price, and can be collected from Transaction Price Open System [11] and Market Price Open System [30]. Table 4 shows the detailed features and their descriptions. An apartment complex contains several housing types or buildings. Table 5 shows the number of district ("Gu") and neighborhood ("Dong") in Seoul and Gyeonggi.

Time-Variant Features
The real estate market changes with time. Thus, we include time-variant features. These features are assumed to describe the state of the real estate market at the time of the transaction. According to prior research in real estate [35,36], the transaction volume may be related to the changes in housing price. Thus, we collected monthly transaction volume data for each district from Korea Appraisal Board (KAB) (Available: https://www.r-one.co.kr/rone/resis/statistics/statisticsViewer.do (Date last accessed on 21 June 2020)). Each entry in the transaction volume dataset consists of the transaction volume and the associated month of estimation. We connected the transaction volume data and apartment sales data by matching the month of the transaction with the closest month in the transaction volume dataset. In addition, floating population in a region could be one of the factors influencing housing price [37]. We obtained the data about floating population from Korean Statistical Information Service (KOSIS) (Available: http://kosis.kr/statHtml/statHtml.do?orgId=101&tblId= DT_1B26001_A01 (Date last accessed on 21 June 2020)) and joined it with the sales data in a similar manner. Lastly, we included the age of the apartment represented as the number of months passed from the date of construction till the date of transaction. Table 6 shows the time-variant features and their corresponding descriptions. The price of a house is affected by the quality of its neighborhood including transport accessibility [38] and the availability of certain public amenities such as schools [39], public service facilities [40], and parks [41]. Accordingly, we include the distance features to capture the accessibility information from the apartment in question to the nearby facilities. With the dataset, it was not possible to pinpoint the exact location of each particular apartment as that information is not revealed for privacy protection. Thus, the distance features were calculated by measuring the Euclidean distance between the center location coordinates of the apartment complex and the coordinates of the facility to capture the relative difference on distance. Table 7 lists the distance features and their corresponding descriptions. To consider the time-series characteristics of transactions in determining apartment prices, we have added the prices of two immediate previous transactions of the apartment to the predictor set. Each transaction record is linked to its corresponding previous transactions if the transactions happened in the apartments of the same housing type located in the same complex. Table 8 shows the detailed previous transaction features and their descriptions. In observance of the related Korea government regulations, Transaction Price Open System [11] does not disclose the exact dates of the transactions. Instead, the transaction record is mapped to the ten day period during which the transaction occurred. The system-provided (start date, end date) pairs are (1, 10), (11,20), and (21, 28/29/30/31). Given the nature of the data, we used the start date of 1, 11, 21 as the transaction date on our dataset.

Report Days for the Real Estate Transaction
In Korea, a real estate transaction must be reported within 60 days from when the transaction occurred, meaning that there may be transactions that occurred but have not been reported yet at any date on which we want to estimate the price of an apartment (estimation date). In Figure 1, there are two previous transaction A, B, but we can only identify the previous transaction A on a specific estimation date because the previous transaction B has not been reported yet. To properly incorporate this issue into the evaluation of the proposed and compared models, we excluded those transactions that were reported after the estimation date from the test set in our consideration of previous transactions (i.e., on our dataset, we did not use the previous transaction B in Figure 1 as an instance of previous transaction so that the tested models could be directly applicable to the real world usage settings in which those transactions that occurred but have not been reported yet at the time of estimation would be not utilized as comparables).

Report Days for Time-Variant Features
Aforementioned, we use two time-variant features-transaction volume and floating population. Although the two datasets, transaction volume and floating population, are reported monthly, their reporting dates are not equal. As a result of the differences in data gathering and preprocessing, the data of transaction volume are generally published at the beginning of the month whereas the data of floating population are typically published toward the end of the month. Therefore, we matched the data of floating population in the previous month with that of transaction data.

Date Difference between Transaction and Market Price
As mentioned before, the start date is treated as the transaction date. It means that the transaction price has been recorded in every 1st, 11th, or 21st day of the month. As described earlier, the dataset contains estimations of the apartment values made by real estate agents. These estimations are used to assess the accuracy of the expert-based approach in our evaluation. However, the estimation dates in this market price dataset do not necessarily match the sales transaction dates in the apartment sales dataset. Therefore, each transaction in the sales dataset is mapped to the entry in the market price dataset, which occurred earlier or on the same date. Table 9 shows the day differences between the transaction date (in the sales dataset) and the market price estimation date (in the market price dataset) in the two scenarios.

Methods
To implement the proposed approaches and assess the relative improvements made by the CSM features, we used two gradient tree boosting methods, LightGBM [12] and CatBoost [13], each of which was implemented in Python. Considering that it is difficult to trace how specific features affect the estimation performance using a neural network modeling approach, we did not include a neural network model in this experiment.
For comparison, we used two prices, "Common" and "MinMax" (average of the minimum and the maximum estimates), derived from the market prices reported by the professional real estate agents (two human-expert conditions). In addition, we included the baseline that uses only the features covered in Section 4.1.3. This baseline (Basic condition) represents a regular machine learning approach, exclusive of any CSM features. On top of the baseline, we test the additional effects of the two types of CSM features (features from nearby apartment transactions and features from similar price transactions), separately and jointly. Hence, we evaluate the performance improvement made by each of the two types of CSM features and by the two types in combination, over and above the baseline and the two human-expert estimation conditions (Common and MinMax).

Experiment Settings
For the two boosting methods, we use base parameter settings, except n_estimator. We set n_estimator to 1000, objective to 'Root Mean Square Error (RMSE)' and subsample to 0.8 to avoid overfitting. For CatBoost, we set bootstrap_type to 'Bernoulli'. Then, we run the model 10 times for each comparison case, and report the mean value of 10 trials and unpaired t-test statistic compared with the Baseline (Basic (B)). Additionally, we empirically set the number of comparable sales features n to 5, day interval m to 100, day offset α to 40, price offset β to 0.05.

Experiment Metrics
In our experiment, we used two evaluation criteria, Mean Absolute Percentage Error (MAPE) and Root Mean Squared Percentage Error (RMSPE) because the average error value has different magnitude by districts.
• MAPE measures the average deviation of the predicted values out of the corresponding actual values in percentage. More specifically, it is computed as, where N is the number of apartment transactions,ŷ k is the predicted kth value and y k is the corresponding actual value. • RMSPE measures the standard deviation of the predicted values out of the corresponding actual values in percentage. Compared to MAPE, this measure more heavily penalizes outliers. It is computed as, where N is the number of apartment transactions,ŷ k is the predicted kth value and y k is the corresponding actual value.
Additionally, for practical reasons, we also present cumulative percentage of 10% errors as the practitioners in this field commonly use this criterion.

Performance Results
The experimental results consistently show that the machine learning models are superior to the human expert estimations. In terms of MAPE (Table 10), RMSPE (Table 11), and cumulative percentage of 10% (Table 12), the machine learning models we developed produce more accurate predictions of apartment transaction prices than the prices estimated by the real estate professionals (i.e., Market Price in Tables 10-12). The professionals are those who work in a real estate office located closely to the apartment complex, serving as mediaries for the sales and purchase transactions of the apartments in the complex. The largest bank in Korea, KB Kookmin Bank, relies on a massive number of these real estate agents to produce and update its assessment of those real estate properties every week, sharing that information with other major banks across the country, which is then used for determining the maximum amount of mortgage loan associated with the real estate property. Our automated approach is a much more efficient yet less biased solution as evidenced through MAPE, RMSPE, and cumulative percentage of 10%.
For Seoul and Gyeonggi, adding the full set of CSM-derived price features showed the highest improvement over the baseline in comparison to the conditions where only the nearby apartment transaction or similar price transaction features were added, confirming that the additional features based on CSM have positive effects on improving the prediction accuracy of apartment prices, over and above the regular machine learning approach, and the nearby apartment transaction features and the similar price transaction features have independent effects, lending themselves to greater effects when combined together. In terms of MAPE, we confirmed that there were significant improvements in all of the comparison cases based on the t-test results. Hence, in RMSPE and cumulative percentage, almost all of the comparison cases showed significant improvements according to the t-test results, and some slight improvements were still observable even when not significant. In addition, the baseline condition mostly shows worse estimation than the market price estimation by human experts in terms of RMSPE. It also shows that the baseline approach generates higher variance estimation than the market price estimation. In contrast, the addition of the CSM features in combination (baseline plus nearby features plus similar price features) shows positive effects on decreasing variance regardless of scenarios or districts, relative to the estimations made by human experts and to the baseline models, without exception.    CSM-derived price features show different impacts on the estimation of housing prices. Specifically, CSM-derived price features based on the similar price transaction assumption reveal more positive predictive efficacy than those features based on the nearby transaction assumption, probably due to the fact that similarly priced apartments appeal to similar groups of people and that they share some common features that contribute to the formation of their comparable prices. Further, between the two boosting techniques of LightGBM and CatBoost, the results show that LightGBM is a better performer for Gyoenggi and CatBoost is a better performer for Seoul, consistently for all of the performance metrics used in this study.

Discussion
For future smart cities, computational approaches to the estimation of real estate properties on a large scale are likely to be essential for intelligent planning and policy making [4,5]. In this study, we explored the possibility of applying comparable sales method (CSM) to the automated price estimation of real estate properties. To the best of our knowledge, no attempt has been made to apply CSM to the automated estimation of house prices. The study results consistently show that the computational approaches developed in this research are superior to human-based approaches in the low and high fluctuation periods of time. Furthermore, among the computational approaches, this study shows that the CSM-based approaches, particularly involving both the nearby-located comparables and similarly-priced comparables, are superior to the traditional machine learning approaches.
The objective of this study was to develop a computational method that can implement the idea of CSM currently used by real estate experts. In doing so, we have articulated two approaches that can help us locate effective comparables in the market. One approach is to consider proximity information on nearby apartment transactions. To accurately identify those nearby comparables, we utilized not only distance information, but also intrinsic and transaction characteristics. This approach more closely resembles the current practices involving human assessors. Unlike the previous approach, the other approach is to consider past price information and use it to identify comparable properties that can serve as the references against the price fluctuations over time.
The proposed method of CSM may be implemented as an extension of the Ordinary Least Squares (OLS) methodology. Then, CSM can provide additional perspectives on expanding the variable set and explicating the model's estimation results. However, because the model assumes that the variables are linearly related, the performance is likely to suffer. In general, boosting-based methodologies show significantly higher predictive performance than other conventional methodologies [42][43][44].
In conclusion, in this paper we propose a novel approach inspired by CSM to infer the prices of similar properties and use them as predictors of apartment transaction prices. In order to measure the predictive power of the newly proposed features, we collected the apartment sales dataset for the capital city and the largest province in Korea, and constructed machine learning models with alternative combinations of features. The experiment results show that the proposed CSM-based approaches are superior when compared with the traditional methods involving human experts or to the regular machine learning approach that uses both internal and external factors excluding the CSM-derived features. Based on the significant results found by this research, the proposed method that incorporates the two types of CSM features in combination is currently under development by KB Kookmin Bank for its public service. In our future work, we plan to expand the prediction model by incorporating other types of information such as news articles and policy changes and by applying deep learning approaches to the modelling of comparable sales features. for j = 0 to length(neighbors) do 6: price ← GetPrice(neighbors j , T, T i ) 7: price_ f eatures ← append price 8: if prev_date_interval > m then 8: search_start ← prev_date − α 9: search_end ← prev_date + α 10: lower ← prev_price × (1 − β) 11: upper ← prev_price × (1 + β) 12: for i = 0 to length(B) do 13: transaction ← B i if price > lower & price < upper then 21: is_valid_price ← True 22: if is_valid_date & is_valid_price then 23: candidates ← append price 24: if isEmpty(candidates) then 25: for i = 0 to length(n) do 26: F ← append previous_price 27: else 28: candidates ← GetTopNtransacions(candidates,prev_price,n) 29: for i = 0 to length(n) do 30: F ← append candidates i * KB Index neardate KB Index nearprev_date 31: return F