Next Article in Journal
Environmental and Health Risk Assessment Due to Potentially Toxic Elements in Soil near Former Antimony Mine in Western Serbia
Previous Article in Journal
Evolution and Influencing Factors of Manufacturing Production Space in the Pearl River Delta—Based on the Perspective of Global City-Region
Previous Article in Special Issue
Image Classification and Land Cover Mapping Using Sentinel-2 Imagery: Optimization of SVM Parameters
 
 
Article
Peer-Review Record

Application of Explainable Artificial Intelligence (XAI) in Urban Growth Modeling: A Case Study of Seoul Metropolitan Area, Korea

by Minjun Kim 1, Dongbeom Kim 2, Daeyong Jin 3 and Geunhan Kim 1,*
Reviewer 1:
Reviewer 2: Anonymous
Submission received: 31 December 2022 / Revised: 1 February 2023 / Accepted: 2 February 2023 / Published: 6 February 2023

Round 1

Reviewer 1 Report

This paper talks about the application of artificial intelligence in urban growth modeling in the context of Korea. The topic itself is interesting and brings new perspectives among academic community and researchers. The research idea can be expanded further in other similar settings. I have the following feedback before accepting the paper to publish in your journal.

Change name to “Machine Learning” instead of artificial intelligence.

Describe “study area” based on the context of this study. For example: the reasons for expansion of city such as opportunity for job, education, better life, lack of opportunity in nearby locations/cities so this research method and results can be replicated and used as reference for similar cities.

Presenting tables 1 & 2 in several maps would be more helpful to understand the data.

Present growth models from 2007 and 2019 in maps.

I would like to see a classified map of urban and rural areas for both 2007 and 2019.

Why you predicted for 2031 not for another year eg. 2030 or 2032 or even 2050? Here I am trying to know what the rationale is for choosing 2031.

Do you have sufficient data to predict change for 2031? You may not necessarily have enough data to do predict for 2031 if you are using two points from 2009 and 2019 to predict for 2031. You may want to at least mention the accuracy of prediction as limitation in your result or discussion.

Author Response

This paper talks about the application of artificial intelligence in urban growth modeling in the context of Korea. The topic itself is interesting and brings new perspectives among academic community and researchers. The research idea can be expanded further in other similar settings. I have the following feedback before accepting the paper to publish in your journal.

Change name to “Machine Learning” instead of artificial intelligence.

  • Thanks for the suggestions. Does your comment mean that we need to change all of term ‘artificial intelligence’ to ‘machine learning’? or just for the title of this paper? Since our study describes the adaptation and development of explainable artificial intelligence (XAI) model throughout the study, we think that the use of term ‘artificial intelligence’ is more suitable rather than ‘machine learning’. In the next review round, is it possible for you to explain the reason for the suggestion of changing the term? Thanks again for your insightful review.

Describe “study area” based on the context of this study. For example: the reasons for expansion of city such as opportunity for job, education, better life, lack of opportunity in nearby locations/cities so this research method and results can be replicated and used as reference for similar cities.

  • Thanks for the comments. The study area has experienced unprecedented urban growth during the last decades due to construction of several new towns and corresponding explosive inflows of population. We added these backgrounds in methodology part (p2-3).

Presenting tables 1 & 2 in several maps would be more helpful to understand the data. & Present growth models from 2007 and 2019 in maps. & I would like to see a classified map of urban and rural areas for both 2007 and 2019.

  • Thanks for the suggestions. Based on these sequential suggestions, we added maps of dependent and independent variables in Figure 3 (p5) and urban & rural classification and growth patterns from 2007 to 2019 in Figure 5 (p9).

Why you predicted for 2031 not for another year eg. 2030 or 2032 or even 2050? Here I am trying to know what the rationale is for choosing 2031.

  • Thanks for the comments. We predicted urban growth maps in 2031, since we utilized urban growth model from 2007 to 2019. While the process of urbanization is not linear, we just simply project the tendency of previous 12 years to predict next 12 years in study area. We added this explanation in methodology part (p6).

Do you have sufficient data to predict change for 2031? You may not necessarily have enough data to do predict for 2031 if you are using two points from 2009 and 2019 to predict for 2031. You may want to at least mention the accuracy of prediction as limitation in your result or discussion.

  • Thanks for the suggestions. As mentioned above, we just simply project the urban growth trends for previous two points to predict urban growth map in 2031. We added this limitation in conclusion part (p12).

Reviewer 2 Report

In its current form, the manuscript cannot guarantee publication. A serious revision and satisfactory responses to my concerns are required.

Line 56-60: This part is a bit confusing. You use the term "machine learning" for both white-box and black-box models at the same time. While decision trees are white-box models, they're also ML methods. Please give other examples for black-box models such as SVM, KNN, ANN, and derivatives.
 
Line 56–62: I agree with the fact that black-box models cannot be interpretable and explainable. However, because performance metrics are completely dependent on the input data, dimensions, hyperparameters, and complexity of the feature variables, we cannot declare a champion model for a specific task. Besides that, you may have seen some papers claiming that white-box models produce less predictive power, but we cannot generalize this for all ML domains. So, please revise this paragraph by emphasizing only the explainability, not the predictive power.
 
Land Cover Data: Your land cover map has a good resolution. Is it satellite-imagery-based? Or is it based on a field survey? Please be more informative for non-Korean scholars. Furthermore, you listed nine types of LULC in Table 1 and showed seven types in Figure 1. However, I see that you inserted 41 types of LULC into the input data frame (as you indicated in Line 125 and can be seen in the SHAP dot plot). This is a bit complicated. What is more, in ML-based studies, scholars generally define why and on what basis they select their independent variables. I see that there's no justification for your variable list. You may refer to other studies to back up your selection. On the other hand, I suggest a multicollinearity test using all input variables. I personally know that bagging and boosting-based ensemble tree models (like XGBoost) are not susceptible to multicollinearity. However, your input data frame seems huge (41 LULC classes plus others), and I see very close absolute SHAP values, which raises a question of multicollinearity in your data and overtraining in your model. If you don't have enough time to implement a multicollinearity test (which may lead to a large table including all TOL and VIF values), alternatively, you can omit the variable producing very low mean SHAP values and then re-train your learners with a filtered data frame.

Line 155: k-fold cross validation cannot be the only way to tune the hyperparameters. Did you combine it with GridSearchCV?
 
Lines 167–168: The fact that "a decision tree-based algorithm that sequentially combines a number of weak learners to construct a strong learner" is valid for all boosting-based algorithms (LightGBM, AdaBoost, Gradient Boosting Machines, etc.) What is the difference when XGBoost creates a more powerful learner? Second-order Derivatives? Taylor expansion? Histogram-based decision rules? Please convince the readers about your algorithm selection.
 
Section 2.3.3:  Your literature review on XAI is very weak. The use of SHAP analysis is very novel and insightful in urban studies (especially in your case, urban growth prediction). However, the readers who are not familiar with the XAI will remain mundane when they read through your article. In its current form, it does not elaborate on the contributions of XAI approaches to GIS and urban studies. Most importantly, it does not express what it means to be explainable or interpretable. You need to delve more deeply into this concept. You may express the different aspects of SHAP against feature importance, beta coefficients, LIME, etc. I listed novel studies on urban issues using SHAP. Please scan through them and discuss the potential of SHAP:
https://doi.org/10.1016/j.habitatint.2022.102660
https://doi.org/10.1016/j.seta.2022.102888
https://doi.org/10.1016/j.compenvurbsys.2022.101845
https://doi.org/10.1016/j.scitotenv.2022.155070
 
Lines 224–225: It is very fine to train different models using different buffer radii (grid sizes). However, I am quite confused because you listed a single hyperparameter dictionary. By changing the buffer radius, it is expected to obtain different input data frames, so the models need to behave differently and the optimal tuned parameters need to change. Please clarify this issue without contradicting the theory.
 
Figure 4: What is your accuracy metric? Please provide its formula. Furthermore, are you sure that only one metric is sufficient for understanding predictive performance? Certainly, your learning curve shows that once the buffer radius increases, the accuracy declines. But you need to support this with other metrics derived from the confusion matrix. Since your task is a bi-class one, metrics like precision, recall, and F-1 score will tell us more about the predictive power.
 
Lines 237-239 and Figure 5(b): Sometimes SHAP dot plots can be misinterpreted. When red dots remain in the positive part of the plot, it means that the high value of that variable contributes positively to the learning. Similarly, any blue dot (with a lower value of that variable) can remain in the positive part of the plot and can contribute positively, too. So, please reconsider your sentence. On the other hand—and this is the bottleneck of your SHAP analysis—your numerical variables (Slope, GRDP, etc.) seem to have a logical SHAP distribution. But your categorical variables (LULC classes) seem strange. For instance, considering that you have already done label encoding, your input data has a column of farm areas, including 0s (yes, there's a farm in this sample) and 1s (no, there's not a farm in this sample). I don't know which one is 0 or 1. But the 1s (red dots) contribute positively to the model, and the 0s (blue dots) remain neutral (accumulating a near-zero shapley value). Since we cannot resample the LULC categorical data between two integers, I guess your SHAP dot value somehow shows the importance of the existence or non-existence of that LULC class. The trend I mention is valid for almost all LULC classes. Therefore, the dot plot may be quite problematic here. Your mean SHAP values (Fig. 5a) enable the reader to understand the power of the contribution of each variable. However, you may need to colorize your bar plot depending on the direction (positive or negative) of that contribution. Please see the colorized bar plot example in the papers I listed above. Its code is available on GitHub and Kaggle.
 
Figure 6 and Line 263: Since you do a bi-class predictive model (urban pixels or non-urban pixels), how did you calculate the 90% and 95% probabilities instead of 100% and 0%?

Author Response

In its current form, the manuscript cannot guarantee publication. A serious revision and satisfactory responses to my concerns are required.

Line 56-60: This part is a bit confusing. You use the term "machine learning" for both white-box and black-box models at the same time. While decision trees are white-box models, they're also ML methods. Please give other examples for black-box models such as SVM, KNN, ANN, and derivatives.

  • Thanks for your comment. As you mentioned, the sentence seems incorrect. So, we corrected the Introduction section by following your comment as follows. [p2, Line 56-63]

Line 56–62: I agree with the fact that black-box models cannot be interpretable and explainable. However, because performance metrics are completely dependent on the input data, dimensions, hyperparameters, and complexity of the feature variables, we cannot declare a champion model for a specific task. Besides that, you may have seen some papers claiming that white-box models produce less predictive power, but we cannot generalize this for all ML domains. So, please revise this paragraph by emphasizing only the explainability, not the predictive power.

  • Thanks for your suggestion. As you mentioned, Black box models such as SVM (Support Vector Machine) and DNN (Deep Neural Network) do not necessarily guarantee good performance and many real problems can be solved by simple models of white box models with high explanatory power. However, black box models with high dimensions tend to show better performance empirically reflecting complex formulas. So, we corrected the Introduction section as follows [p2, Line 56-63]

Land Cover Data: Your land cover map has a good resolution. Is it satellite-imagery-based? Or is it based on a field survey? Please be more informative for non-Korean scholars. Furthermore, you listed nine types of LULC in Table 1 and showed seven types in Figure 1. However, I see that you inserted 41 types of LULC into the input data frame (as you indicated in Line 125 and can be seen in the SHAP dot plot). This is a bit complicated.

  • Thanks for the comments. Actually, we utilized a middle-class land-cover classification map, which consist of 22 types of LULC. We revised the overall statements with 41 to 22 LULC types and added all LULC types in tables and figures.

What is more, in ML-based studies, scholars generally define why and on what basis they select their independent variables. I see that there's no justification for your variable list. You may refer to other studies to back up your selection.

  • Thanks for the suggestions. We added some reference studies to support our selection of independent variables in the methodology part (p5).

On the other hand, I suggest a multicollinearity test using all input variables. I personally know that bagging and boosting-based ensemble tree models (like XGBoost) are not susceptible to multicollinearity. However, your input data frame seems huge (41 LULC classes plus others), and I see very close absolute SHAP values, which raises a question of multicollinearity in your data and overtraining in your model. If you don't have enough time to implement a multicollinearity test (which may lead to a large table including all TOL and VIF values), alternatively, you can omit the variable producing very low mean SHAP values and then re-train your learners with a filtered data frame.

  • Thanks for the insightful comments. As you mentioned, the independent variables used in the XGBoost model are not susceptible to multicollinearity issue. To check the overfitting issues, however, we estimated VIF values of each independent variable (as shown in below table), and found that there’s no significant multicollinearity issues in general. While there’s some variables that exceeds 10 VIF values, we considered that re-training process is not necessary.

Variable

VIF

Variable

VIF

Residential

1.572

NaturalGrassland

1.262

Industrial

1.161

ArtificialGrassland

1.160

Commercial

1.083

InlandWetland

1.269

Recreational

1.014

CoastalWetland

1.686

Transportation

1.347

NaturalBareland

1.014

PublicFacility

1.160

ArtificialBareland

3.056

RicePaddy

9.411

InlandWater

2.712

Farm

6.528

OceanWater

1.152

CultivatedFacility

1.201

DEM

2.552

Orchard

1.555

Eco

3.357

OtherCultivated

1.777

GRDP

1.212

BroadleafForest

11.554

Law

1.472

ConiferousForest

9.224

PopDen

1.130

MixedStandForest

6.675

Slope

5.294

Line 155: k-fold cross validation cannot be the only way to tune the hyperparameters. Did you combine it with GridSearchCV?

  • Thanks for the comments. While we did not directly utilize the ‘GridSearchCV’ algorithm, we instead adopted the ‘Pycaret’ package in Python which is technically inherited the basic form of ‘GridSearchCV’. This package automatically adjusts many hyperparameters when only the number of folds are specified. Since this study focused on urban growth modeling rather than detailed model specification, we added simple explanation of the hyperparameter tuning process in method part (p5).

Lines 167–168: The fact that "a decision tree-based algorithm that sequentially combines a number of weak learners to construct a strong learner" is valid for all boosting-based algorithms (LightGBM, AdaBoost, Gradient Boosting Machines, etc.) What is the difference when XGBoost creates a more powerful learner? Second-order Derivatives? Taylor expansion? Histogram-based decision rules? Please convince the readers about your algorithm selection.

  • Thanks for your comment. Technically, XGBoost has a mechanism to prevent overfitting through regularization, and is generally better at preventing overfitting than gradient boosting algorithms. Also, XGBoost is slower than LightGBM but is usually fast enough and has better performance(accuracy). Furthermore, since the problem is expressed in a linear form by Taylor expansion, it is suitable for parallel processing and it has high flexibility in parameter setting Therefore, many studies utilize the XGBoost model, which has a good balance between speed and performance. However, the fundamental reason we chose the XGBoost model is that it is one of the most commonly used models. Also, XGboost is one of the popular models that have won many awards including Kaggle competitions. No matter which boosting model we choose above, we do not expect much difference in results. As the reviewer points out, the technical details of the algorithm are also very important, but it is more important to choose algorithm utilized in many fields, which is why we ended up choosing XGBoost. So, we corrected the methods section as follows by partially reflecting your opinion [p6, Lines 172-180]

Section 2.3.3:  Your literature review on XAI is very weak. The use of SHAP analysis is very novel and insightful in urban studies (especially in your case, urban growth prediction). However, the readers who are not familiar with the XAI will remain mundane when they read through your article. In its current form, it does not elaborate on the contributions of XAI approaches to GIS and urban studies. Most importantly, it does not express what it means to be explainable or interpretable. You need to delve more deeply into this concept. You may express the different aspects of SHAP against feature importance, beta coefficients, LIME, etc. I listed novel studies on urban issues using SHAP. Please scan through them and discuss the potential of SHAP:
https://doi.org/10.1016/j.habitatint.2022.102660
https://doi.org/10.1016/j.seta.2022.102888
https://doi.org/10.1016/j.compenvurbsys.2022.101845
https://doi.org/10.1016/j.scitotenv.2022.155070

  • Thanks for the suggestions. As you suggested, we added the justification of adopting the SHAP analysis in the study. By comparing the LIME and SHAP, which have been most widely utilized techniques, we emphasized that the independent variables we adopted is more suitable for SHAP analysis (p7).

Lines 224–225: It is very fine to train different models using different buffer radii (grid sizes). However, I am quite confused because you listed a single hyperparameter dictionary. By changing the buffer radius, it is expected to obtain different input data frames, so the models need to behave differently and the optimal tuned parameters need to change. Please clarify this issue without contradicting the theory.

  • Thanks for the insightful comments. Since we developed several XGBoost models with different buffer radius, the optimal hyperparameters were also different for each developed model. However, to simplify and clarify our study’s results, we only described the hyperparameter sets for the model of a 50m buffer radius rather than all hyperparameter sets for each model. We highlighted these explanations in results part (p8).

Figure 4: What is your accuracy metric? Please provide its formula. Furthermore, are you sure that only one metric is sufficient for understanding predictive performance? Certainly, your learning curve shows that once the buffer radius increases, the accuracy declines. But you need to support this with other metrics derived from the confusion matrix. Since your task is a bi-class one, metrics like precision, recall, and F-1 score will tell us more about the predictive power.

  • Thanks for the suggestions. First, the accuracy metric we first used was the ratio of accurate prediction ((N of predicted urban & actual urban + N of predicted non-urban & actual non-urban) / (Total N)). As you suggested, we also added other accuracy metrics (precision, recall, and F-1 score) to test the sensitivity of the XGBoost model. Results showed that all accuracy measures showed similar trends when a buffer radius is differentiated. We added these explanations and revised the figure in results part (p8-9).

Lines 237-239 and Figure 5(b): Sometimes SHAP dot plots can be misinterpreted. When red dots remain in the positive part of the plot, it means that the high value of that variable contributes positively to the learning. Similarly, any blue dot (with a lower value of that variable) can remain in the positive part of the plot and can contribute positively, too. So, please reconsider your sentence.

  • Thanks for the suggestions. We revised the sentences regarding the explanation of SHAP dot plots through additional review of SHAP values (p11)

On the other hand—and this is the bottleneck of your SHAP analysis—your numerical variables (Slope, GRDP, etc.) seem to have a logical SHAP distribution. But your categorical variables (LULC classes) seem strange. For instance, considering that you have already done label encoding, your input data has a column of farm areas, including 0s (yes, there's a farm in this sample) and 1s (no, there's not a farm in this sample). I don't know which one is 0 or 1. But the 1s (red dots) contribute positively to the model, and the 0s (blue dots) remain neutral (accumulating a near-zero shapley value). Since we cannot resample the LULC categorical data between two integers, I guess your SHAP dot value somehow shows the importance of the existence or non-existence of that LULC class. The trend I mention is valid for almost all LULC classes. Therefore, the dot plot may be quite problematic here. Your mean SHAP values (Fig. 5a) enable the reader to understand the power of the contribution of each variable. However, you may need to colorize your bar plot depending on the direction (positive or negative) of that contribution. Please see the colorized bar plot example in the papers I listed above. Its code is available on GitHub and Kaggle.

  • Thanks for the insightful comments. In this study, the categorical variables used in the XGBoost model were only ECVAM values. The EVCAM values indicate the level of environmental preservation value, and thus not a dummy variable. The LULC classes you mentioned were adopted as continuous numerical numbers as they were calculated as total areas within a certain buffer radius from each raster cell.

Figure 6 and Line 263: Since you do a bi-class predictive model (urban pixels or non-urban pixels), how did you calculate the 90% and 95% probabilities instead of 100% and 0%?

  • Thanks for the questions. As you mentioned, the dependent variable of our urban growth model is whether a certain raster cell is urbanized or not from 2007 to 2019. In this kind of binary decision tree (XGBoost) model, the output of prediction is a certain number between 0 (non-urbanized at all) to 1 (100% sure of urbanization). The 90% and 95% probabilities of urban growth indicate the raster cells whose urban growth probability exceeds 0.9 and 0.95, respectively. To increase the interpretability, we added this explanation in the results part (p9).

Round 2

Reviewer 2 Report

Dear Editor and Authors,
Thank you for inviting me to evaluate the revised manuscript. In my opinion, the authors have made a significant improvement in the paper and have attempted to address all of my concerns. I think the paper is now ready to go to print. 

Back to TopTop