Simulation of Dynamic Urban Growth with Partial Least Squares Regression-Based Cellular Automata in a GIS Environment

We developed a geographic cellular automata (CA) model based on partial least squares (PLS) regression (termed PLS-CA) to simulate dynamic urban growth in a geographical information systems (GIS) environment. The PLS method extends multiple linear regression models that are used to define the unique factors driving urban growth by eliminating multicollinearity among the candidate drivers. The key factors (the spatial variables) extracted are uncorrelated, resulting in effective transition rules for urban growth modeling. The PLS-CA model was applied to simulate the rapid urban growth of Songjiang District, an outer suburb in the Shanghai Municipality of China from 1992 to 2008. Among the three components acquired by PLS, the first two explained more than 95% of the total variance. The results showed that the PLS-CA simulated pattern of urban growth matched the observed pattern with an overall accuracy of 85.8%, as compared with 83.5% of a logistic-regression-based CA model for the same area. The PLS-CA model is readily applicable to simulations of urban growth in other rapidly urbanizing areas to generate realistic land use patterns and project future scenarios.


Introduction
Cellular automata (CA) method is a discrete dynamic modeling technique that has been widely applied in fields related to spatiotemporal distributions [1][2][3][4].Classical CA formalism has been extended to accommodate the complexity of many systems [5,6].Geographical information systems (GIS) based CA models have attracted extensive attention because of their ability to simulate urban growth and land use change [7][8][9], following the pioneering work of Tobler [10].
Over the past two decades, remarkable achievements have been made in geographical CA-based dynamic urban growth and land use change modeling, particularly in rapidly urbanizing areas [11][12][13][14][15][16].Substantial progress has also been made in CA methodology, including transition rules retrieval, neighborhood configuration, scale effects, and results assessment [17][18][19][20].One important issue in CA modeling is the quantification of the impacts of the factors that drive urban growth and land use change at both global and local scales.Many approaches have been developed to define CA transition rules and each is aimed at improving the overall accuracy and reducing errors of simulation [21][22][23][24].These approaches vary widely in theoretical assumptions, underlying methodologies, and spatio-temporal resolutions and extents [25].For example, a CA model based on artificial neural networks (ANN) was developed to calculate land conversion probabilities and model dynamic land use in a GIS environment [21].This model was used to simulate the multiple land use changes in a rapidly growing area of Guangdong Province, China.A heuristic CA model of urban land use change was proposed based on a simulated annealing (SA) algorithm and was successfully applied to simulate the urban growth in one of Shanghai's outer suburbs [22].This model was built around a function that minimizes the difference (residual) between observed and simulated land use patterns, resulting in improved locational accuracy when compared to a logistic-regression-based CA model (named logistic-CA).Other heuristic optimization algorithms such as genetic algorithms (GA) and particle swarm optimization (PSO) have been used to optimize CA parameters from logistic regression and calibrate CA models [22,[26][27][28].A landscape expansion index was incorporated into CA (LEI-CA) to simulate both the adjacent and outlying urban growth of Dongguan City in southern China [15].This approach demonstrated an improvement when compared to the logistic-CA model in terms of urban simulation accuracy.A random forest based CA model was used to simulate urban growth in Harare Metropolitan Province, Zimbabwe from 1984 to 2013 [24].This model outperformed CA models based on support vector machine (SVM) and logistic regression in the study area.Markov chain integrated CA (CA-Markov) models are another class of methods developed in the last decade to simulate multiple land use changes [29,30].The CA-Markov has become increasingly popular in geographical modeling since it was included in IDRISI.Most of these proposed new models perform better than earlier models, substantially advancing CA-based modeling of urban growth/expansion and land use change across the world.Current trends of CA model calibration, such as ANN, SVM, GA, SA and PSO, have become more complex [2,27,29].Therefore, reconsideration of the statistical approaches is necessary for CA-based urban modeling.
Statistical approaches such as logistic regression and principal components analysis (PCA) are relatively simple and easy to implement using modern software packages.As a classical method, logistic regression has proved to be reliable in CA modeling [11,18,31,32]; however, most of the studies were conducted without consideration of correlation among variables.Moreover, the logistic regression method is incapable of eliminating the negative effects of the multicollinearity among variables [21,28].By adding an auto-covariate term, logistic regression can be used to reduce the effect of correlation and, hence, increase its predictive accuracy in modeling land use change.A case study of the Paochiao watershed region in Taiwan shows that auto-logistic regression performs better than logistic regression [33].PCA was used to reduce the effect of multicollinearity among spatial variables and obtain more reasonable CA parameters [34], yielding an improvement in performance when compared to the logistic-CA model.Statisticians have pointed out that the PCA method produces principal components that reflect only the covariance structure between the independent variables [35], and, as a consequence, the extracted components may only weakly explain the variance of the independent variable corresponding to the dependent variable in the regression.
The issue of variable multicollinearity, therefore, has continuously pushed researchers to develop more accurate, justifiable, and defensible models for simulating urban growth.Partial least squares (PLS) regression appears to be useful in addressing correlation because it integrates and generalizes features from PCA and multiple regression methods [36,37].The method offers three advantages: (1) it removes data redundancies and extracts components from highly correlated spatial variables that better represent and explain the dependent variables (land conversion); (2) it avoids the detrimental effects in modeling due to multicollinearity and can regress when the number of observations is less than the number of variables; and (3) it integrates the basic functions of regression models, PCA, and canonical correlation analysis.In summary, PLS searches for the principal components that explain as much as possible of the covariance between the independent and dependent variables.The parameters obtained using the PLS method might then better explain the dependent variables, i.e., the conversion probability of urban growth.
This paper presents a novel CA model based on the PLS approach that we call PLS-CA.This approach was used to derive principal components of the spatial variables for regressing CA parameters.Compared to logistic-CA, PLS can extract variables that are uncorrelated amongst the explanatory variables, and also between the explanatory and response variables.The result is the discovery of important transition rules from a number of driving factors that may be highly correlated.Our PLS-CA model was applied to simulate urban growth in the Songjiang district, an outer suburb of Shanghai Municipality, from 1992 to 2008.For comparison, a logistic-CA model was also applied to simulate the urban growth in the same study area.

Study Area and Data
Songjiang is an outer suburb in the southwest part of Shanghai Municipality that is centered at 121 • 45 E and 31 • 00 N. Songjiang has a total area of 598.5 km 2 , 15.5 km 2 of which is water (Figure 1).Over the past two decades, the urban area of Songjiang has grown rapidly with a significant increase in economic activity and concomitant dramatic land use change.According to the local government census, the total registered population has increased from 498,600 in 1995 to 1,074,200 in 2008.Rapid population growth has resulted in an explosive expansion of the urban area [38].Such large-scale land use change and rapid urban growth have led to the degradation of the landscape, environment, and ecosystem [39,40].
Two Landsat-5 TM/ETM+ images acquired on 18 July 1992 and 24 March 2008 were collected to derive the changes in patterns in the study area.Other, essential ancillary datasets including 1:5000 administrative, topographic, and transportation maps were also collected from the local government.A total of 21 ground control points (GCPs) were identified on the remote sensing images using the topographic map as a reference source.A polynomial method was adopted for geometric rectification and the resulting accuracy obtained was 0.34 and 0.28 pixels for 1992 and 2008, respectively.Finally, the areal extent of Songjiang was clipped from the rectified Landsat images using the administrative map as the boundary.
ISPRS Int.J. Geo-Inf.2016, 5, 243 3 of 16 dependent variables.The parameters obtained using the PLS method might then better explain the dependent variables, i.e., the conversion probability of urban growth.This paper presents a novel CA model based on the PLS approach that we call PLS-CA.This approach was used to derive principal components of the spatial variables for regressing CA parameters.Compared to logistic-CA, PLS can extract variables that are uncorrelated amongst the explanatory variables, and also between the explanatory and response variables.The result is the discovery of important transition rules from a number of driving factors that may be highly correlated.Our PLS-CA model was applied to simulate urban growth in the Songjiang district, an outer suburb of Shanghai Municipality, from 1992 to 2008.For comparison, a logistic-CA model was also applied to simulate the urban growth in the same study area.

Study Area and Data
Songjiang is an outer suburb in the southwest part of Shanghai Municipality that is centered at 121°45'E and 31°00'N.Songjiang has a total area of 598.5 km 2 , 15.5 km 2 of which is water (Figure 1).Over the past two decades, the urban area of Songjiang has grown rapidly with a significant increase in economic activity and concomitant dramatic land use change.According to the local government census, the total registered population has increased from 498,600 in 1995 to 1,074,200 in 2008.Rapid population growth has resulted in an explosive expansion of the urban area [38].Such large-scale land use change and rapid urban growth have led to the degradation of the landscape, environment, and ecosystem [39,40].
Two Landsat-5 TM/ETM+ images acquired on 18 July 1992 and 24 March 2008 were collected to derive the changes in patterns in the study area.Other, essential ancillary datasets including 1:5000 administrative, topographic, and transportation maps were also collected from the local government.A total of 21 ground control points (GCPs) were identified on the remote sensing images using the topographic map as a reference source.A polynomial method was adopted for geometric rectification and the resulting accuracy obtained was 0.34 and 0.28 pixels for 1992 and 2008, respectively.Finally, the areal extent of Songjiang was clipped from the rectified Landsat images using the administrative map as the boundary.

Input Variables
Nine factors affecting land use change were chosen to model urban growth in Songjiang from 1992 to 2008.These factors were distance-based variables, neighborhood, constraints, and a stochastic factor (Table 1); all are closely related to urban development and land use changes [2,41,42].We then visualized the spatial variables and constraints in ArcGIS and produced them as input layers for the PLS-CA model (Figure 2).

Input Variables
Nine factors affecting land use change were chosen to model urban growth in Songjiang from 1992 to 2008.These factors were distance-based variables, neighborhood, constraints, and a stochastic factor (Table 1); all are closely related to urban development and land use changes [2,41,42].We then visualized the spatial variables and constraints in ArcGIS and produced them as input layers for the PLS-CA model (Figure 2).Topographic data play an important role in generating spatial variables for CA models.As an example, it is sometimes difficult to convert rural land on a steep slope into urban use.As a result, a slope factor should be included in any credible model.However, the Songjiang study area lies on a very flat land in the Yangtze River Delta [43], and, therefore, the impact of slope can be omitted in Topographic data play an important role in generating spatial variables for CA models.
As an example, it is sometimes difficult to convert rural land on a steep slope into urban use.
As a result, a slope factor should be included in any credible model.However, the Songjiang study area lies on a very flat land in the Yangtze River Delta [43], and, therefore, the impact of slope can be omitted in the modeling.Distance-based variables and neighborhood reflect the agglomerative effect of urban development and the attractive power of infrastructure [44].Spatial variables used in the PLS-CA model can be categorized as positive and negative distances.Positive distances include distances to urban center, town centers, and main roads; these factors are significant "push" forces to urban growth.Conversely, the negative distances, such as distances to agricultural land and green space, yield a "repellent" effect on urban development.
Apart from the aforementioned quantifiable factors, there are still many uncertainties and errors in modeling urban growth, resulting in the departure of actual urban growth from some well-known trajectories.Some of these uncertainties are intangible and can be difficult to identify and/or quantify.To represent these uncertainties, a stochastic factor was introduced into our CA model (Table 1).The real values of these spatial variables were acquired from both remotely sensed imagery and vector maps.The conversion probability (y) was calculated by detecting land use change using the thematic mapper (TM) images from 1992 to 2008.

A Generic CA Model
The global conversion probability of land conversion from non-urban to urban can be calculated as the combined effect of the static probability, neighborhood effect, constraints, and random impact [9,45].A general form of the global conversion probabilities for u × v cells (in a lattice) is: where P t ij is the global probability of rural-to-urban conversion for cell ij at time t; P d is the static probability determined by spatial distances [11,34]; con() is a constraint function which returns either 0 or 1 [46]; P t Ω,ij is the effect for cell ij at time t within Ω l × l neighborhood and it is calculated by where con S t ij = suitable) returns 1 if the state of the cell ij is urban, otherwise, it returns 0; 1 + (− ln(Rnd) ) β ) is the stochastic factor [47], where Rnd is a random real number ranging from 0 to 1, and β is a parameter ranging from 0 to 10 that adjusts the influence of the stochastic factor.
The global conversion probability, therefore, consists of: (1) the conversion probability based on spatial variables, (2) cell conversion constraints including planning regulation, protected farmland, and water bodies, (3) neighborhood effects, and (4) a stochastic factor.The first component is the observed conversion probability P d [18,41]: where α 0 + α 1 x 1 + . . .+ α p x p represents the comprehensive impacts of distance-based variables on cell ij, x i (i = 1, . . ., p) are the distances from the cell ij to a key point such as the urban center, town centers, main roads, etc.; and a i (i = 0, 1, . . ., p) are their corresponding parameters.These distances are also defined as spatial or independent variables in our CA modeling.

The PLS Method
We assume that y = (y 1 , . . ., y q ) n × q is a set of dependent variables (i.e., the observed rural-to-urban conversion), where n is the size of samples and q is the number of dependent variables, x = (x 1 , . . ., x q ) n × p is a set of independent variables with p as the number of independent variables.We also assume that E = (E 01 , . . ., E 0p ) n × p and F = (F 01 , . . ., F 0p ) n × p are the normalized (mean-centered and variance-scaled) matrix forms of x and y, respectively, t 1 is the first principal component vector of E 0 , i.e., t 1 = E 0 w 1 , w 1 is the corresponding unit weight vector of E 0 and ||w 1 ||= 1 , and that u 1 is the first principal component vector of F 0 , i.e., u 1 = F 0 c 1 , c 1 is the corresponding unit weight vector of F 0 and ||c 1 ||= 1 .In PLS regression, the goal is to obtain a first pair of vectors t 1 = E 0 w 1 and u 1 = F 0 c 1 under the condition that ||w 1 ||= 1 and ||c 1 ||= 1 , and maximizing t T i u 1 .The objective can be re-written as an optimization problem [36,37]: By applying the Lagrange algorithm, we obtained eigenvalue equations resolving a first pair of weight vectors w 1 and c 1 as follows: where w 1 and c 1 are the unit eigenvectors of the matrices E T 0 F 0 F T 0 E 0 and F T 0 E 0 E T 0 F 0 , respectively, θ 2 1 is the corresponding eigenvalue, and According to Equation ( 1), θ 1 is supposed to be maximal in the sense of PLS regression.
We compute the first pair of component vectors t 1 = E 0 w 1 and u 1 = F 0 c 1 , and run the regression of E 0 and F 0 with respect to t 1 and u 1 , respectively.The equation is: where E 1 and F 1 are the residual matrices, and p 1 and r 1 are the coefficient vectors that can be given by: Substituting the residual matrices E 1 and F 1 for E 0 and F 0 and repeating the above method, we obtained the second component vectors t 2 and u 2 as: where w 2 and c 2 are the unit eigenvectors of matrices respectively, corresponding to the maximum eigenvalue θ 2  2 .Running the regression of E 1 and F 1 with respect to t 2 and u 2 , respectively, we have: where the coefficient vectors p 2 and r 2 are calculated from: The procedure is iterated until E 0 becomes a null matrix, and the final components t i (i = 1, . . ., m) are determined by cross-validation.Therefore, we have the following equations: Since t 1 , . . ., t m can be represented as the linear combination of the original variables E 01 , . . ., E 0p , and F 0 in Equation ( 10) is recovered by the regression equation of y * j = F ok (k = 1, . . ., q) with respect to x * j = E oj (j = 1, . . ., p) as follows: where α k 1 , . . ., α k p are the corresponding coefficients and F mk is the k th column of residual matrix F m .Cross-validation checks the contributions of the extracted principal components to determine how well the regression model predicts the data.The cross-validation for the component t n is: where PRESS h is the sum of squares of prediction error with a total of h components (t 1 , . . ., t h ), and SS h−1 is the sum of squares of combination error of y with the first (h−1) components (t 1 , . . ., t h−1 ).
If PRESS h SS h−1 ≤ 0.95 2 , the contribution margin of the newly added component t n is significant, and as a result, iteration stops when Q 2 h ≥ 0.00975 [36,37].

PLS-Based CA Model
Since the conversion probability of each cell in CA is a single decimal variable, Equation ( 11) can be re-written as [36,37]: where α i (i = 0, 1, . . ., p) is the i th regression estimator.The form of Equation ( 13) retrieved by PLS method is similar to that from PCA method but the regression estimator α p obtained from the PLS method contains information about the dependent variable y as shown in Equation (7), while α p obtained by PCA does not contain any contribution of the responsive variable y [34,36,37].Although data redundancy can be eliminated by PCA, the regression estimators obtained are not related to the independent variables and, thus, have less strong ability to interpret the independent variable y.PLS is more robust than PCA at explaining the responsive variable y.
Integrating Equations ( 1), ( 2) and ( 13), we derived the global conversion probability in the PLS-CA model: If the calculated global probability P t ij exceeds the predefined threshold ranging from 0 to 1, the cell ij at time t will be converted to urban land use at time t + 1.Otherwise, it will retain its current state at next time t + 1 [18,41].

Structure of the PLS-CA Model
The PLS-CA model workflow consists of five steps: raw data collection, data processing, CA rule discovery with PLS, determination of other CA factors, and model implementation and results assessment (Figure 3).Each step of the model plays a distinct role in the modeling as follows: (1) Raw data collection: Data used in the model include historical raster images such as remotely sensed images, an administrative vector map, a topographic map, and a transportation map.
(2) Data processing: Spatial variables were extracted from raw data using the ArcGIS Spatial Analyst tool.These spatial variables included the distance to the urban center (D urban ), town centers (D town ), main roads (D mrd ), agricultural land (D agri ), and green space (D gs ).The five spatial variables were normalized by: where D max is the maximum value of the spatial variable, D ori is the original distance value from the raw data, and D norm is the normalized value in the range (0, 1).Normalization enables a precise interpretation of the geographic meaning of the parameters.For instance, if a cell is situated at the urban center, its normalized D urban value will be 0, where if the cell is situated far from the urban center, its normalized D urban will approach 1.
ISPRS Int.J. Geo-Inf.2016, 5, 243 8 of 16 urban center, its normalized Durban value will be 0, where if the cell is situated far from the urban center, its normalized Durban will approach 1.
(3) CA rule discovery: This module derives uncorrelated spatial components using PLS.It determines whether the derived spatial variables satisfy the cross-validation of ≥ 0.0975 and, hence, it is used to define CA parameters (i.e., weights of spatial variables) by which the land conversion probability under variables can be obtained.The PLS regression was conducted using the "PLSR" package of R-language [48].(4) Other CA factors: These include non-spatial factors such as neighborhood effect, constraints of basic farmland, and a stochastic factor.
(5) PLS-CA implementation and assessment: This module enables the simulation of the PLS-CA model and incorporates simulation accuracy assessment by generating overall accuracy, producer's accuracy, user's accuracy, Kappa coefficient, and the compared urban growth rate (CUGR).The module also displays and exports simulation outcomes.(3) CA rule discovery: This module derives uncorrelated spatial components using PLS.It determines whether the derived spatial variables satisfy the cross-validation of Q 2 h ≥ 0.0975 and, hence, it is used to define CA parameters (i.e., weights of spatial variables) by which the land conversion probability P d under variables can be obtained.The PLS regression was conducted using the "PLSR" package of R-language [48].
(4) Other CA factors: These include non-spatial factors such as neighborhood effect, constraints of basic farmland, and a stochastic factor.
(5) PLS-CA implementation and assessment: This module enables the simulation of the PLS-CA model and incorporates simulation accuracy assessment by generating overall accuracy, producer's accuracy, user's accuracy, Kappa coefficient, and the compared urban growth rate (CUGR).The module also displays and exports simulation outcomes.
The simulated area of each category from the CA modeling was not exactly equal to the actual area.Therefore, an indicator termed the compared urban growth rate (CUGR) is calculated to assess the accuracy of the PLS-CA model by comparing the observed and simulated urban growth rates.The CUGR indicator was computed as: where CUGR is the difference between the observed and simulated areas of each category in terms of growth rate, S sim2008 is the simulation area of the urban or non-urban category at 2008, and S obs2008 is the statistical areas of observed urban growth in 2008 or non-urban loss in 1992, respectively.

Assessment of Correlation
A total of 5000 samples were randomly selected from spatial variables and the classified land use patterns in 1992 and 2008 to determine the CA transition rules.The correlation matrix of spatial variables was calculated using the samples (Table 2), showing significant correlations among these spatial variables.Traditional methods, such as multi-criteria evaluation (MCE) technique and logistic regression, are not able to avoid the negative effects of multicollinearity and are relatively weak in providing correct weights for the variables.We, therefore, applied PLS to extract the uncorrelated principal components from spatial variables to achieve more reasonable CA transition rules and improve the performances of the CA model.

CA Transition Rules
Among the three components acquired by PLS regression, only the first two satisfied the cross-validation requirement but explained more than 95% of the total variance (Table 3).The first component is mainly related to urban center, and the second component is principally related to main roads.For the third component, its Q 2 h is less than the critical value and, therefore, it is not a valid component.By comparison, PCA can reduce data redundancy but it extracts exactly five components for the same samples used in this research.Its first three explained 84.268% of the variance, lower than that of the PLS regression.Based on PLS regression, suitable weights for CA models can be easily defined, since the principal components are independent, avoiding the repeated counting that may occur in general MCE [21].The CA parameters acquired by logistic regression are quite different with D mrd (1.7590) being very large and D town (0.5846) relatively small (Table 4) as compared with those in the PLS regression.This indicates that the logistic-CA model over-weights D mrd but undervalues D urban .In contrast, PLS regression generated CA parameters that more reasonably reflect the actual urban growth in Songjiang.In the PLS approach, the negative weights for the distance factors are: D urban (−1.1063),D mrd (−0.8274),D town (−0.5841), followed by the positive weights of D agri and D gs which reflect the factors that tend to prevent non-urban land from being developed.The land conversion potential was produced as map layers based on the calibrated logistic regression and PLS methods (Figure 4), which varied from 0.38 to 0.72 for logistic-CA and from 0.34 to 0.70 for PLS-CA.

Simulation Results
The PLS-CA model was applied to simulate urban growth of Songjiang from 1992 to 2008 (Figure 5).In the simulation, land use types were generalized as urban, non-urban, and water body.
Before running the model, the best combination of threshold value Pthd and the number of iterations was determined for the calibration of the PLS-CA model.The meaning of each iteration should also be defined.As an initial trial, a Pthd value of 0.40 was used to test if a non-urban cell can be converted to an urban cell.By running the model, simulated results that approximated the actual urban growth were realized within a certain number of iterations.For the next trial, Pthd was increased by 0.02 and a simulation result with the highest overall accuracy was acquired with another number of iterations.Pthd increased by 0.02 from 0.40 to 0.80, indicating that there were 21 trials for this model.After comparing the results of all trials with different Pthd values, the PLS-CA model generated the highest overall simulation accuracy of 85.8% at Pthd = 0.68 and Iteration = 16.By comparison, a logistic-CA model was also calibrated with the best overall simulation accuracy of 83.5% at Pthd = 0.66 and Iteration = 16.
Visual inspection demonstrates a good match between the observed and simulated patterns, for both the logistic-CA and PLS-CA models (Figure 5).Further, all three observed and simulated urban

Simulation Results
The PLS-CA model was applied to simulate urban growth of Songjiang from 1992 to 2008 (Figure 5).In the simulation, land use types were generalized as urban, non-urban, and water body.
Before running the model, the best combination of threshold value P thd and the number of iterations was determined for the calibration of the PLS-CA model.The meaning of each iteration should also be defined.As an initial trial, a P thd value of 0.40 was used to test if a non-urban cell can be converted to an urban cell.By running the model, simulated results that approximated the actual urban growth were realized within a certain number of iterations.For the next trial, P thd was increased by 0.02 and a simulation result with the highest overall accuracy was acquired with another number of iterations.P thd increased by 0.02 from 0.40 to 0.80, indicating that there were ISPRS Int.J. Geo-Inf.2016, 5, 243 11 of 16 21 trials for this model.After comparing the results of all trials with different P thd values, the PLS-CA model generated the highest overall simulation accuracy of 85.8% at P thd = 0.68 and Iteration = 16.By comparison, a logistic-CA model was also calibrated with the best overall simulation accuracy of 83.5% at P thd = 0.66 and Iteration = 16.
Visual inspection demonstrates a good match between the observed and simulated patterns, for both the logistic-CA and PLS-CA models (Figure 5).Further, all three observed and simulated urban patterns in 2008 show that urban growth of Songjiang occurred around the urban centers and northeastern areas in the late 1990s to early 2000s.

Accuracy Analysis
To quantitatively evaluate the simulation accuracy and the performance of the PLS-CA model, a pixel-by-pixel comparison was used to calculate a confusion matrix about the concordance between the simulated results and the observed pattern [11,21,49,50].The reference land use map illustrating the observed urban growth was the classified result using a supervised minimum distance classifier in ENVI 5.2.The confusion matrix derived from the simulated results and the observed reference map was produced for comparison (Table 5).Kappa coefficients were also calculated to quantify their actual degree of agreement [51,52].
User's accuracy was 72.9% for non-urban areas and 96.8% for urban areas in 2008, while the producer's accuracies for non-urban and urban categories were 95.2% and 80.7%, respectively (Table 5).The user's accuracy of urban informs that, of all the observed cells considered as urban in the classified patterns, 96.8% were correctly classified in the simulated pattern, and the probability of identification of urban mislabeled as non-urban (i.e., commission error) was 3.2% [53].For the same urban category in Table 5, of all the urban cells in the simulation result, 80.7% actually correspond to urban in the classified pattern.In other words, the probability of identification of a cell erroneously labeled as urban category (i.e., an omission error) was 19.3%.The overall accuracy shows that 85.8% of all the cells under assessment were correctly categorized in the simulation result.The Kappa coefficient means that the simulation achieved an accuracy that was 70.9% better than what would

Accuracy Analysis
To quantitatively evaluate the simulation accuracy and the performance of the PLS-CA model, a pixel-by-pixel comparison was used to calculate a confusion matrix about the concordance between the simulated results and the observed pattern [11,21,49,50].The reference land use map illustrating the observed urban growth was the classified result using a supervised minimum distance classifier in ENVI 5.2.The confusion matrix derived from the simulated results and the observed reference map was produced for comparison (Table 5).Kappa coefficients were also calculated to quantify their actual degree of agreement [51,52].
User's accuracy was 72.9% for non-urban areas and 96.8% for urban areas in 2008, while the producer's accuracies for non-urban and urban categories were 95.2% and 80.7%, respectively (Table 5).The user's accuracy of urban informs that, of all the observed cells considered as urban in the classified patterns, 96.8% were correctly classified in the simulated pattern, and the probability of identification of urban mislabeled as non-urban (i.e., commission error) was 3.2% [53].For the same urban category in Table 5, of all the urban cells in the simulation result, 80.7% actually correspond to urban in the classified pattern.In other words, the probability of identification of a cell erroneously labeled as urban category (i.e., an omission error) was 19.3%.The overall accuracy shows that 85.8% of all the cells under assessment were correctly categorized in the simulation result.The Kappa coefficient means that the simulation achieved an accuracy that was 70.9% better than what would be expected from the chance assignment of cells to categories.

Discussion
The detailed simulation accuracies were calculated for Songjiang in 2008 for the logistic-CA and PLS-CA models for Songjiang (Figure 6).The user's accuracy for non-urban generated by the PLS-CA model was 3.5% greater than that of the logistic-CA model, and the producer's accuracy for non-urban from the PLS-CA model was nearly equal to that of the logistic-CA model (94.9%).For the urban category, the user's and producer's accuracies of the logistic-CA model were 96.5% and 77.3%, respectively, lower than those of the PLS-CA model.The overall accuracy of the simulation results in 2008 was 83.5% for the logistic-CA model, 2.3% less than the new PLS-CA model.The Kappa coefficient of the PLS-CA model was 70.9%, which outperforms the logistic-CA model (66.6%).The comparison suggests that the PLS-CA model generated more accurate results compared to the logistic-CA model.
The CUGR indicator illustrates the growth rate of the simulated urban area compared with the actual urban development.A CUGR for urban areas larger 100% indicates that the simulated growth exceeds the observed growth; otherwise, the simulation growth is slower than the observed growth.A CUGR value approaching 100% suggests that the CA model performs well in terms of the area control.Statistics for observed urban growth (excluding water bodies) were retrieved from remote sensing, corrected by the data from the local government of Songjiang.Logistic-CA and PLS-CA models both simulated more urban growth than actually occurred (Table 6).The CUGR of the urban category was 114.2% for the PLS-CA model, lower than that of the logistic-CA model (121.1%).The CUGR of the non-urban was 92.3% for the PLS-CA model, which is closer to 100% than that for the logistic-CA model (88.6%).This result suggests that the overall area control performance of the PLS-CA model was better than that of the logistic-CA model, while still overestimating urban growth and underestimating the persistence of non-urban.urban from the PLS-CA model was nearly equal to that of the logistic-CA model (94.9%).For the urban category, the user's and producer's accuracies of the logistic-CA model were 96.5% and 77.3%, respectively, lower than those of the PLS-CA model.The overall accuracy of the simulation results in 2008 was 83.5% for the logistic-CA model, 2.3% less than the new PLS-CA model.The Kappa coefficient of the PLS-CA model was 70.9%, which outperforms the logistic-CA model (66.6%).The comparison suggests that the PLS-CA model generated more accurate results compared to the logistic-CA model.The CUGR indicator illustrates the growth rate of the simulated urban area compared with the actual urban development.A CUGR for urban areas larger 100% indicates that the simulated growth exceeds the observed growth; otherwise, the simulation growth is slower than the observed growth.A CUGR value approaching 100% suggests that the CA model performs well in terms of the area control.Statistics for observed urban growth (excluding water bodies) were retrieved from remote sensing, corrected by the data from the local government of Songjiang.Logistic-CA and PLS-CA models both simulated more urban growth than actually occurred (Table 6).The CUGR of the urban  Urban development is a complex open system whose trajectory is affected by drivers that may be significantly, spatially correlated.Integrated with the analytical functions of GIS, logistic regression can be used to evaluate the impact of these factors on urban growth.However, logistic regression cannot eliminate the correlation of spatial variables, while PCA can eliminate spatial correlation only to a certain degree, and the principal components found in the independent variables may not adequately explain the dependent variables [35].The proposed PLS method can extract variables that are uncorrelated from amongst the explanatory variables, and also between the explanatory and response factors [36,37], resulting in the discovery of transition rules from a number of driving factors that are usually highly correlated.This relationship could explain why PLS-CA modeling outperformed the traditional logistic-CA model, at least from a theoretical point of view.Our results show that the simulated patterns of urban growth accord well with the actual urban pattern of Songjiang.Compared with the logistic-CA model, the PLS-CA model achieved better simulation accuracies in modeling the urban growth of Songjiang through time.We speculate that our model is better than the PCA-based CA model as inferred by the principal components, but not necessarily better than auto-logistic regression which performs nearly as well as ANN [33].It still needs to be tested whether the PLS-CA model is more or less accurate than other CA models based on artificial intelligence and machine learning.However, our new model poses the advantages that it is simpler than these models and generates parameters having clear physical meanings.
In addition, CA models contain various types of uncertainty caused by several other factors such as sampling, neighborhood configuration, constraints, stochastic perturbation, and spatial scale [19,47,54,55].We took only one group of samples for training the PLS-CA model in this study.Like any other CA models, our PLS-CA could be sensitive to samples that are determined by both sampling method and sample grouping [9].The effect of sampling on the simulation results is reflected by the CA parameters of drivers.Such an effect is relatively greater for statistically significant drivers, whereas it is much smaller for statistically non-significant drivers which can even be excluded in modeling [32,56].Neighborhood configuration such as the shape and number of neighbors influences the CA models by local interactions [19,57].A stochastic factor in CA transition rules is to simulate less tangible uncertainties and perturbations that may affect the simulation results [11,42,47].Moreover, raster-based CA models are also sensitive to cell size (grain size) in terms simulation accuracy and landscape structure [39,55,58,59].The proposed PLS-CA model is no exception because it depends on a rasterized space.

Conclusions
This paper demonstrates that CA models can accurately simulate urban growth using global and local constraints that reflect various environmental concerns.The advantages of urban growth simulation by GIS-based CA modeling include the identification of the driving factors of land use change and the identification of spatial patterns across space and over time.The most important part of developing these new models is to discover mature CA transition rules.Our PLS-CA model is capable of extracting uncorrelated factors from the candidate explanatory variables.Thus, PLS-CA can well eliminate redundancy of the input data and, thus, allow for the discovery of better and more reasonable transition rules.The PLS-CA model was successfully applied to simulate the urban growth in Songjiang, demonstrating better simulation accuracy than a conventional logistic-CA model.
Further improvements could be made by testing the response and robustness of the PLS-CA model on sampling, neighborhood configuration, constraints, and spatial scale.In addition, advanced CA models could be packaged with simple, robust, and easily implementable modules such as real-time and dynamic display of simulation results.

Figure 1 .
Figure 1.The study area of Songjiang district in Shanghai, China.(a) Map of P. R.China and (b) Map of Shanghai.

Figure 1 .
Figure 1.The study area of Songjiang district in Shanghai, China.(a) Map of P. R.China and (b) Map of Shanghai.

Figure 2 .
Figure 2. Visualization of spatial variables used in the PLS-CA model.(a) D urban ; (b) D town ; (c) D mrd ; (d) D agri ; (e) D gs ; and (f) Constraint.

Figure 3 .
Figure 3. Structure of the PLS-CA model.

Figure 4 .
Figure 4. Land conversion potential based on spatial variables.(a) Logistic regression and (b) PLS.

Figure 4 .
Figure 4. Land conversion potential based on spatial variables.(a) Logistic regression and (b) PLS.

16 Figure 5 .
Figure 5.The observed and simulated patterns in Songjiang.(a) The 1992 initial state; (b) The 2008 observed pattern; (c) The 2008 simulated pattern by logistic-CA; and (d) The 2008 simulated pattern by PLS-CA.

Figure 5 .
Figure 5.The observed and simulated patterns in Songjiang.(a) The 1992 initial state; (b) The 2008 observed pattern; (c) The 2008 simulated pattern by logistic-CA; and (d) The 2008 simulated pattern by PLS-CA.

Figure 6 .
Figure 6.Simulation accuracy (%) of the two CA models in 2008.

Figure 6 .
Figure 6.Simulation accuracy (%) of the two CA models in 2008.

Table 1 .
The spatial variables used to simulate urban growth in the partial least squares-based cellular automata (PLS-CA) model.

Table 1 .
The spatial variables used to simulate urban growth in the partial least squares-based cellular automata (PLS-CA) model.

Table 2 .
Correlation matrix of spatial variables.

Table 3 .
Principal components derived from PLS.

Table 4 .
Comparison of CA parameters generated by PLS and logistic regression.

Table 3 .
Principal components derived from PLS.

Table 4 .
Comparison of CA parameters generated by PLS and logistic regression.

Table 5 .
Confusion matrix between remote sensing-based classification and simulated urban pattern using the PLS-CA model for Songjiang in 2008.

Table 6 .
Observed and simulated urban growth rates from 1992-2008 (CUGR stands for compared urban growth rate).