# Analyzing and Predicting Micro-Location Patterns of Software Firms

## Abstract

## 1. Introduction

- RQ1
- Are the effects of location factors, as reported by previous studies using aggregated spatial units, robust at the microgeographic level?
- RQ2
- How does a firm location prediction model perform at the microgeographic level and to what degree does it provide valuable new insights into the firm allocation process? What are the distinct requirements to the data and the statistical model?

## 2. Data

#### 2.1. OpenStreetMap Data

#### 2.2. Official Geodata

#### 2.3. The Mannheim Enterprise Panel

_{i}*) of low match rates in rural areas. However, there is only a minor positive correlation (r

_{s}= 0.006 ***) between the geocoding match rate and population density. Hence, known OSM data quality issues in rural areas (see above) do not seem to induce a systematic error in our geocoding results. We included an according control variable in the regression analysis (geocoding match rate at postal code area level) to cope with spatially varying geocoding completeness. We further used the MUP to identify the headquarter locations of the top 100 firms (by annual turnover) in Germany to include them as a location factor in the regression analysis.

## 3. Methods

#### 3.1. Exploratory Spatial Data Analysis

_{s}to measure the degree of monotonic relationship between variables. Quadrat analysis is used to evaluate the dispersion of point patterns by calculating their variance-to-mean ratio (VMR) using regular grids. The results of the quadrat analysis are used to assess whether the software firm location point pattern was produced by a random (homogenous Poisson) process [44,45]. We measure global spatial autocorrelation using Moran’s Index I, which is arguably the most common measure to do so. We also utilize standardized Moran’s I z-values, which allow us to compare I values between different levels of spatial aggregation. The generalized local G autocorrelation statistic G

_{i}* is used to evaluate local spatial association [46]. G

_{i}* was selected because we are mostly interested in detecting local pockets of positive spatial autocorrelation (e.g., “hotspots of the software industry”). Measures of spatial autocorrelation require us to hypothesize the spatial relationships in the study area [47]. We use the topological contiguity method with queen contiguity criterion (QNN) for our regular grids.

#### 3.2. Count Data Regression Models

_{i}in cell i is conditional on the local values of

**x**

_{i}:

_{i}|

**x**

_{i}~ Poisson(λ

_{i}) E(y

_{i}|

**x**

_{i}) = λ

_{i}Var

_{i}= λ

_{i}

_{i}, given n location factors x, is then:

_{n}change in x

_{n}is ${\mathrm{e}}^{{\widehat{\mathsf{\beta}}}_{\mathrm{n}}\Delta {\mathrm{x}}_{\mathrm{n}}}$ (ceteris paribus). Cameron and Trivedi [26] recommend using robust standard errors for Poisson models.

## 4. Results

#### 4.1. Exploratory Spatial Data Analysis Results

_{i}*) in the areas of Munich, Stuttgart and Rhine-Main (around Frankfurt). On the contrary, the absence of high software industry shares and hotspots in the very densely populated and large Ruhr area (around Essen) indicates that high population density alone does not necessarily imply large proportions of software firms in the local firm population.

_{s}= 0.94 ***) and the total stock of firms (r

_{s}= 0.97 ***) exhibit similarly strong monotonic relationships with population numbers. At the 1 km scale, software firm numbers show a distinctively lower correlation to local population numbers (r

_{s}= 0.38 ***) than the rest of the firm population (r

_{s}= 0.65 ***). This indicates that population numbers alone do not predict the number of software firms very well at low levels of geographic aggregation.

#### 4.2. Regression Analysis Results

#### 4.2.1. Interpretation of Regression Coefficients

#### 4.2.2. Model Fit and Spatial Residual Analysis

_{i}*) still exist. The described prediction disparity in Berlin is still present for example, because Berlin was, by chance, divided uniformly into four cells (cf. MAUP as mentioned above). This results in significant (p < 0.05) clustering of negative residuals (overestimation) in the south of Berlin (coldspot) and a hotspot of positive residuals (underestimation) in the north. Interestingly, other residual clusters occur mainly in areas which were identified as hotspots of the software industry (see Figure 6b).

## 5. Discussion

#### 5.1. Discussion of Regression Coefficients

#### 5.1.1. Agglomeration Location Factors

#### 5.1.2. Infrastructure Location Factors

#### 5.1.3. Socio-Economic Location Factors

_{s}= 0.49 ***) to the proportion of university graduated employees in the local workforce, which is found to have a strong positive effect on local software firm numbers (indicating multicollinearity). The software industry’s need for highly educated employees is further emphasized by the strong positive effect of nearby universities. The number of local public research institutes has no significant effect though. It needs to be kept in mind that some socio-economic location factors are measured at a low spatial resolution (district and municipality level). While this is of no concern for tax levels, the share of graduate employees and wages can differ significantly within districts (ecological fallacy [4,76]). The lack of socio-economic location factors at the microgeographic level could in fact be a major issue of our model as we discuss further below.

#### 5.1.4. Quality of Life and Amenities Location Factors

#### 5.1.5. Other Location Factors

#### 5.2. Discussion of Model Adequacy

## 6. Conclusions

#### 6.1. RS1: Scale-Robust Location Factors

#### 6.2. RS2: Microgeographic Location Prediction

## Acknowledgments

## Author Contributions

## Conflicts of Interest

**Figure 1.**Overview (5 km scale) and zoom (1 km scale; with selection of location factors for exemplary cell) of the software firm location pattern.

**Figure 2.**(

**a**) Share of software firms in total stock of firms (25 km scale); (

**b**) and standardized Moran’s I by 1 km, 5 km, 10 km, and 25 km level of aggregation.

**Figure 6.**(

**a**) Standardized Moran’s I of regression residual aggregated at different levels of aggregation; (

**b**) Significant clustering of regression residual aggregated at 25 km grid.

Scale | Obs. | $\overline{\mathbf{X}}$ | $\tilde{\mathbf{X}}$ | SD | Min. | Max. | VMR | Histogram |
---|---|---|---|---|---|---|---|---|

1 km | 361,453 | 0.19 | 0 | 1.64 | 0 | 211 | 14.12 | |

5 km | 14,951 | 4.58 | 1 | 25.98 | 0 | 1604 | 147.39 | |

10 km | 3860 | 17.74 | 4 | 87.07 | 0 | 3265 | 427.35 | |

25 km | 671 | 102.06 | 27 | 301.74 | 0 | 4105 | 892.11 |

Location Factor | Description | IRR |
---|---|---|

Agglomeration Location Factors | ||

Firm density | Number of local firms (in 10) | 1.028 *** (0.003) |

Firm density² | Squared number of local firms (in 10) | 0.999 *** (0.000) |

High-tech firms | Proportion of high-tech firms in local stock of firms (in %) | 1.021 *** (0.000) |

Major firms | Distance to next major firm in km | 0.998 *** (0.000) |

Commercial rent | Difference local rent to mean rent in neighborhood (in Euro) | 1.127 *** (0.12) |

Population | Population per cell (in 100) | 1.081 *** (0.003) |

Population² | Squared population per cell (in 100) | 0.999 *** (0.000) |

Population centrality | Urban Centrality Index (in 0.1 UCI) high value ≙ monocentricity | 1.079 *** (0.192) |

Infrastructure Location Factors | ||

Broadband Internet | Availability of ≥50 mb Internet (categories) high value ≙ low availability of Internet | 0.764 *** (0.009) |

Motorway | Distance to nearest motorway access (in km) | 0.977 *** (0.001) |

Railway | Distance to nearest main-line railway station (in km) | 0.998 *** (0.000) |

Airport | Distance to nearest main airport (in km) | 0.998 *** (0.000) |

Public transport | Weighted count of public transport stops | 1.000 (0.001) |

Socio-economic Location Factors | ||

Wages | Median income of full time employee (in 100 Euro) | 1.005 (0.003) |

Universities | Distance to nearest university (in km) | 0.980 *** (0.000) |

Research institutes | Number of research institutes | 1.004 (0.036) |

Educated workforce | Proportion of graduate employees in % | 1.063 ***(0.006) |

Students | Proportion of students in local population in % | 0.986 *** (0.003) |

Business tax | Business tax factor (in 100) high values ≙ high taxes | 0.925 ** (0.023) |

Quality of Life and Amenities Location Factor | ||

Life expectancy | Mean life expectancy of population | 1.092 *** (0.012) |

Crime | Violent and street crime incidents per 1000 inhabitants | 1.021 (0.015) |

Recreation | Number of recreational, community, and sports facilities | 1.056 *** (0.008) |

Culture | Number of cultural facilities | 1.015 0.017 |

Leisure | Number of gastronomy, nightlife, and general leisure facilities | 1.002 (0.002) |

Other | ||

Terrain | Difference in elevation to mean neighborhood elevation (in 100m) high values ≙ hillside location | 0.919 *** (0.004) |

Geocoding control variable | Geocoding match rate (in %) high value ≙ high completeness | 1.018 *** (0.002) |

GoF Measure | Poisson | Negative Binomial |
---|---|---|

Pseudo-R² | 0.58 | 0.33 |

RMSE | 1.36 | 483,735 |

AIC | 211,603 | 179,705 |

BIC | 211,892 | 180,004 |

