5. Methods
In this paper we undertake a comprehensive data inspection and apply several regression techniques in order to investigate the interdependency between a dependent variable (e.g., degree of soil sealing) and one or more independent variables (e.g., influential factors). Relevant procedural steps are described in
Figure 2.
Data exploration consists of data inspection, data transformation and correlation analysis. The aim here is to understand the distribution of each variable and to discover dependencies between several variables. Data are inspected and distributions analyzed by means of statistical methods and visualization techniques (histograms, density plots, quantile-quantile plots) [
51,
52,
53,
54]. Transformation processes and the so-called “ladder of power” can be applied to the data in order to ensure the required linear correlations between the dependent and independent variables within the regression analysis, as well as to reveal skewed distributions [
52]. In this way, it is possible to describe nonlinear correlations, to make distribution patterns more symmetrical and to reduce the spread of data points [
52]. The objective of the correlation analysis is to identify the strength and direction of the relationship between two variables. If the relationship is approximately linear, the Pearson product-moment correlation coefficient can be used [
52,
53]. Furthermore, scatter plots are a useful visual tool to analyze the relationship between two variables.
In this study, three different models of regression analysis (ordinary least squares regression, spatial lag regression and spatial error regression) are applied in order to determine which model is best suited to describing the correlations within the data. Ordinary least squares regression (OLS) is a linear regression model using the least-squares estimation to fit the model. It assumes an approximately linear dependence between the dependent and independent variables. This method applies an approach of minimizing the sum of the squared residuals [
52].
The regression equation is created by an iterative process and has to be checked for validity and consistency after every model fit. The F-test is used to examine the overall validity and significance. The significance of the F-test indicates that the null-hypothesis can be rejected (i.e., that all slope coefficients of a model are 0), and thus, the model possesses some explanatory value. The significance of the regression coefficients is determined by the t-test, which is calculated by dividing the regression coefficients by the standard deviation. The null-hypothesis also indicates that the independent variables do not significantly influence the dependent variable. In the case at hand, the null-hypothesis is rejected for all independent variables, which therefore are seen to significantly influence the dependent variable. Such regression diagnostics is necessary to ensure that the required model assumptions are met. Multicollinearity in regression equations increases the standard deviation and hence can have the effect that independent variables are declared to be statistically non-significant. The severity of multicollinearity can be tested by the variance inflation factor (VIF). If the value for a variable is larger than 10, then this can potentially be a source of multicollinearity. A further criterion for the detection of multicollinearity is the so-called condition number. If the condition number is higher than 30, then the regression model can be affected by multicollinearity between the independent variables.
In statistical approaches, data should be statistically independent. However, data subject to spatial analysis are often found to be spatially autocorrelated, which means that a variable is found to cluster in space [
55]. This reflects Waldo Tobler’s first law of geography: “Everything is related to everything else, but near things are more related than distant things.” [
56]. To test data for spatial dependence, we can apply Moran’s I to calculate global autocorrelation and the local indicator of spatial association (LISA) for local autocorrelation [
57]. On this basis, spatial patterns in variables and their conduct (e.g., values that are spatially near are more similar) can be detected and visualized. Moran’s I can measure the global spatial autocorrelation, which is the correlation of a variable with itself, by applying a matrix of weights [
58]. Anselin [
59] defined LISA as an indicator of the extent of significant spatial clustering of similar values around an observation, determining that the mean of LISA is proportional to the global indicator of spatial association. LISA can identify local hotspots and can be used to detect clustering. The local Moran’s I can be visualized in a choropleth map showing potential spatial clustering and its significance. In a multiple regression with several independent variables, the primary focus is to determine which of these most strongly influences the dependent variable. Standardization of the regression coefficients allows the strength of their influence on the independent variables to be compared by removing the various units of measurement.
Spatial regression deals with spatial effects such as spatial dependence and spatial heterogeneity [
57,
60]. The spatial lag and the spatial error model consider the fact of autocorrelation in linear models. Thus autocorrelation can compromise the statistical explanatory power. Spatial lag models are basically the OLS model with an additional term of a weights matrix and an autoregressive factor
ρ, which determines the strength of the spatial autoregressive relation between
and
[
61]. This model assumes autocorrelation in the dependent variable and includes an autoregressive term for the spatial autocorrelation [
62]. The spatial error model assumes autocorrelation in the error term [
61]. The Lagrange multiplier tests [
63] provide information about whether spatial dependence exists and, if so, whether a lag or error model are more appropriate. Based on the OLS residuals, the Lagrange multipliers tests examine for a missing lag variable (LM (lag)) and for dependencies in the error term (LM (error)). In the case of significant results for both tests, the robust lag model determines which regression model is best suited.
A regression model’s goodness-of-fit is determined by the coefficient of determination
with the value range [0, 1], where 1 is a perfect fit. A few model assumptions must be verified in a linear regression, mostly through examination of the residuals. First, the residuals should be normally distributed, otherwise the statistical F-test and t-test are invalid. Second, the residuals should be independent, i.e., they should not show autocorrelation, which otherwise causes inefficiency in the least square estimation and incorrect calculation of the standard deviation, also leading to a false determination of significance. Third, the independent variables should not be correlated (a phenomenon called multicollinearity), as this reduces the precision of the estimators. If residuals do not have the same constant variance, then heteroscedasticity occurs, producing the same inefficiency as with autocorrelation [
52]. An often used criterion to determine the model fit is the Akaike information criterion (AIC). AIC tries to minimize the trade-off between goodness-of-fit and degrees of freedom [
64]. It can be used for model selection and comparison, where the model with the lowest AIC performs best.
7. Discussion
In order to limit the degree of soil sealing by means of spatial planning instruments, it is first necessary to obtain information on likely influential factors. In view of the constantly increasing mass of analyzable data, it is becoming ever more difficult to formulate individual hypotheses. Some spatial patterns may remain hidden if an overly narrow or biased approach is adopted. Due to these problems, it can happen that complex datasets are not examined with sufficient thoroughness, i.e., not all possible aspects are considered. Consequently, interesting interdependencies may be ignored. Against this backdrop, the present study adopts the method of urban data mining [
65,
66] to reveal logical or mathematical and partly complex descriptions of patterns and regularities inside a set of geospatial data. A large number of variables (
) was collected and inspected. On this basis, correlation and regression analyses were undertaken in order to identify diverse bundles of variables that characterize the degree of soil sealing. As a result, 25 variables were identified that have an approximately linear bivariate relationship with soil sealing.
For example, the hypothesis that the extent of sealed surface in Germany’s municipalities is dependent on the density of settlements and/or a high level of economic activity has been confirmed (cf. Hypothesis 1 in
Table 1). The following measures of density are significant in this regard: e.g., population density (
), road network density (
), settlement density (
), density of flats (
), daytime population density (
) and housing density (
). The tax capacity (
) and municipal revenues from commercial taxes (
) are also correlated with the extent of soil sealing (cf. Hypothesis 5 in
Table 1), as is (transport) accessibility. Driving times to schools (
) are clearly correlated to soil sealing and, thus, serve as a specific indicator for the development of infrastructure in a region (cf. Hypothesis 3 in
Table 1).
Currently, it is difficult to determine a clear dependency between lifestyle and consumption patterns (living space per inhabitant/household, journeys between home, work, shops and leisure areas) and the degree of soil sealing. There is only a moderate correlation between living space per inhabitant (
) and the degree of soil sealing and a similar (negative) correlation between the average commuting distance and soil sealing (
). Other variables should be taken into account in order to investigate the assumed relationships more precisely (cf. Hypothesis 5 in
Table 1). Regarding the formulated hypotheses on tourism infrastructure (cf. Hypothesis 2 in
Table 1), we note that the percentage of vacation homes in a municipality is not a useful influential factor to characterize soil sealing in a pan-German study. The dependencies between soil sealing and this variable are relatively weak when considering all of Germany’s municipalities (
). Thus, it is recommended that analysis be conducted at a different/smaller spatial scale and that supplementary variables be used as indicators for tourism infrastructure. Furthermore, no strong dependency could be identified between soil sealing and the attractiveness of the landscape or the underlying topography (cf. Hypothesis 9 in
Table 1). There is only a very weak correlation between relief diversity and the degree of soil sealing. In future investigations, terrain slope might be a suitable variable to investigate the assumed relationship more precisely.
In regard to the dimensions of public policy (cf. Hypothesis 10 in
Table 1), only very weak correlations were found with the degree of soil sealing. Here, further data must be gathered to permit quantitative analysis. Currently, such influences cannot be suitably illustrated, even if they are doubtless of considerable importance. Small-scale analyses are likely to be the best approach to uncovering potential dependencies.
Regarding methodology, the presented process of data analysis can be broken down into several stages: selecting the target data, pre-processing the data, applying transformations if necessary, performing correlation and regression analysis to extract relationships and then interpreting and assessing the results. Theory-driven data selection and, in particular, close data inspection, including transformation measurements, are required to ensure good quality results. The presented approach leads to a deeper understanding of the distribution of each variable. It was observed that most of the selected variables follow a log-normal distribution. Against this background, the mean and standard deviation are appropriate measures to distinguish variable characteristics. For example, it was possible to distinguish between six different soil sealing classes and to discern finer spatial patterning at the level of German municipalities (central vs. peripheral municipalities). Due to the confirmation that data are normally distributed, the strength and direction of the relationship could be measured using Pearson’s correlation coefficient. The data analytical process used scatter plots to check for linear dependence between the independent and dependent variables as a precondition for the ordinary least squares regression. In previous studies, ordinary least squares regression has often been applied to explain soil sealing or more general land consumption properties. In some cases, stepwise regression has been used to identify so-called relevant variables. In contrast to such approaches involving stepwise regression, here we have applied correlation measurements and visual techniques, such as scatter plots, in combination with substantive considerations. Furthermore, several different spatial regression approaches have been presented in this article to investigate the complex bundle of influential factors. These are the spatial lag model and the spatial error model. Such spatial regression methods possess a high explanatory value by incorporating various spatial characteristics in the model, such as spatial autocorrelation.
Furthermore, geographically-weighted regression (GWR) should be discussed as a powerful technique to study influential factors at the local level. Through the use of local statistics, non-stationarity can be detected to show how several administrative units can serve to characterize the whole study area. Non-stationarity implies that phenomena can vary over space, and hence, it is necessary to deal with their spatial distribution [
57]. GWR is a local regression model that estimates new coefficient values for each unit, contrary to the OLS and SAR, which estimate one equation for the whole study area [
67]. In this way, GWR addresses non-stationarity directly by providing a range of regression coefficients over the study area. However, it is rather difficult to create a model encompassing a large number of variables for the entire national territory of Germany. In order to devise such a model, the authors stress the importance of employing regression diagnostics, as well as the need to diagnose model collinearity, e.g., variance inflation factor, condition indexes (for more information, see [
68,
69,
70,
71]). Under the prerequisite of small-scale data on influential factors, future work should focus on GWR applications in selected study areas (e.g., urban regions, other soil sealing hotspots). Furthermore, other spatial interpolation approaches (e.g., regression kriging/cokriging) might be appropriate to get a deeper understanding of influential factors at the local level [
72].
In general, this paper has attempted to provide an overview of methodological approaches and related challenges. In the future, depending on the availability of new datasets, it should be possible to conduct deeper analysis into the influential factors of soil sealing at fixed time points (static perspective), as well as examining changes in the extent of soil sealing along multidimensional pathways (dynamic perspective).
The recently published remote sensing data of the European Environment Agency (EEA) opens up new avenues to apply the techniques presented here to other European study regions in order to analyze the complex bundle of influential factors for the time frames 2006, 2009 and 2012. This will doubtless reveal differences between Europe’s various spatial units. Future comparative studies, as well as case studies should be undertaken to determine whether the values for sealed surfaces calculated for municipalities from EEA data are reliable.