Data-Driven Analysis on Inter-City Commuting Decisions in Germany

Understanding commuters’ behavior and influencing factors becomes more and more important every day. With the steady increase of the number of commuters, commuter traffic becomes a major bottleneck for many cities. Commuter behavior consequently plays an increasingly important role in city and transport planning and policy making. Although prior studies investigated a variety of potential factors influencing commuting decisions, most of them are constrained by the data scale in terms of limited time duration, space and number of commuters under investigation, largely owing to their dependence on questionnaires or survey panel data; as such only small sets of features can be explored and no predictions of commuter numbers have been made, to the best of our knowledge. To fill this gap, we collected inter-city commuting data in Germany between 1994 and 2018, and, along with other data sources, analyzed the influence of GDP, housing and the labor market on the decision to commute. Our analysis suggests that the access to employment opportunities, housing price, income and the distribution of the location’s industry sectors are important factors in commuting decisions. In addition, different age, gender and income groups have different commuting patterns. We employed several machine learning algorithms to predict the commuter number using the identified related features with reasonably good accuracy.


Introduction
With the urbanization development, commuting is becoming an increasingly important part of modern society. It is well-known that during morning and evening peak commuting periods on weekdays, roads become highly congested due to a large number of commuters, causing severe overheads to the transport infrastructure systems [1]. In the recent past, the number of inter-city commuters in Germany increased substantially (27.9%), from 2,442,630 in 2004 to 3,123,924 in 2014, while the country's whole population had a slight decrease (from 81,646,474 to 81,450,370) during the same period [2]. The growth of inter-city commuters can lead to personal, environmental and societal changes such as increased traffic loads and frequent congestion, more road/railway work, higher levels of pollution, lower life satisfaction and the need for subsidies [3]. It has been demonstrated that urban planning will be highly associated with commuting costs, and NO x and CO 2 emissions from road traffic [4,5]. With the current discussion on environmental protection and sustainable societies, we believe that it is of high importance to understand inter-city commuting in more detail. It is especially vital to understand the volume and patterns of people's inter-city commuting (a commuting mode that typically connects the residents of the periphery with big cities) and to find the underlying infrastructural bottlenecks and suggest possible responses, as the majority means of inter-city/regional commuting are by car [6].
In the scope of our study, inter-city commuters are socially insured employees whose work municipality differs from their residential municipality [7]. As commuters typically base their family and job location planning on several factors, we focus not only on the economic structure of the city but also on the living standard and commuting patterns, which have been largely ignored in previous studies. More specifically, we aim to conduct a data-driven analysis of the potential factors behind inter-city commuting decisions in Germany: the labor and real estate situation (without relying on questionnaires and surveys), commuting patterns, cities' economic structure such as gross domestic product (GDP) and industry sectors.
In this work, we use only publicly available data so that the data sources are easily available and our results can be replicated. By integrating multiple datasets from different sources from over two decades, we study features that have not been considered or not available but are very important for understanding the inter-city commuting behavior, such as GDP, various housing purchasing/rental prices information, the job market in different industry sectors and computing patterns. In addition, with our time-series data, we leverage machine learning approaches to perform commuter prediction-with reasonably good performance-which is not seen in the previous efforts.
Section 2 presents related works. After Section 3 describes our data sources and methods, Section 4 provides our in-depth analysis results on these data including commuter prediction results, and Section 5 discusses additional issues. Section 6 is the conclusion.

Literature Review
Over the decades, sociologists, economists, geographers and computer scientists have studied commuting from different angles. With the increasing importance of intercity commuting, one focus of these studies has been the influencing factors of inter-city commuting decisions.
First, income has been found as a determinant factor for long-haul commuting [8][9][10][11]. For instance, Dauth and Haller [11] showed that the willingness to pay for a shortened commuting distance is no lower than the income increase for the people who seek a job change for the same commuting distance.
Second, location is another determinant factor for commuting decisions. Clark [12] observed that households prefer to move closer to the workplace if they lived far from the workplace before, and the commuting time is significant for relocation decisions. Kalter [9] noted that most long-haul commuters come from small municipalities. Eckey et al. [13] as well as Haas and Hamann [14] found that workers in west Germany are more willing to commute than those from east Germany. Andersson et al. [15] showed rural-to-urban longdistance commuting is rapidly increasing in Sweden, and rural residents working in large cities are better paid, better educated and younger than workers in rural municipalities.
Third, commuting distances play an important role in commuting decisions. Instead of focusing on residential or workplace location alone, Simpson [16] modeled both workplace and residential locations and found such a joint model considering commuting distances between two locations can well explain the commuting behavior. Levinson [17,18] also established that there is an interdependence between the workplace and residential locations. Kalter [9] showed long-haul commuters tend to remain in their current livingplace workplace combination.
Fourth, different types of work influence commuting decisions differently. Huinink and Feldhau [19] showed that women with a part-time job and long-distance commute will have much less fertility intention than women with full-time or self-employed jobs. Ding and Bagchi-Sen [10] found that workers in different industry categories have varying distances they are willing to commute. Eckey et al. [13] found that in general, bluecollar workers are more willing to commute than white-collar ones. However, Haas and Hamann [14] noted that the most highly qualified employees tend to commute.
Fifth, gender differences play a regulatory role in commuting decisions. It has been found that male workers (80.5%) are more willing to commute than female workers [9], and males commute longer than their female partners [20]. Reuschke [21] showed that the vast majority (87.6%) of female commuters are childless; 35% of female commuters have Sustainability 2021, 13, 6320 3 of 24 a second residence due to their partners. However, for female workers fertility intention does not play a significant role in the decision to commute, while getting pregnant has a high negative correlation with commuting [19].
Other factors related to commuting decisions that have been studied include age [9] educational background [9], nationality [13], housing costs [22], household com-position (with one or two workers) [23] and levels of well-being [24]. For example, Kalter [9] found that workers who are younger or with high school diplomas are more willing to commute. Eckey et al. [13] showed Germans are more willing to commute than foreigners in Germany. Mitra and Saphores [22] found that housing costs have a strong influence on long-distance commuting. Dickerson et al. [24] showed that longer commutes are not generally associated with lower levels of well-being.
An overview of different datasets, methods and factors studied in related literatures is given in Table 1.  (1989-1990, 1992-1994 and 1996-1997)  To summarize, while sociologists mostly focus on the reasons behind commuting on a personal basis primarily based on surveys and questionnaires, economists focus on the trend of commuting at an aggregate level and emphasize more on the economic backgrounds and cost benefits for the commuters and regions using statistical data. The major data sources of both types of studies are panels and questionnaires, in addition to statistical data, and could be complemented by integrating multiple datasets available from heterogeneous sources, which form the starting point of this paper.

Data Sources
We scraped the commuting data, employment data including industry sector data, unemployment rate and income data from the Federal Employment Agency [7], the house and apartment price data from Immobilenscout24 [25] and the distance data from Google Maps API [26] for each city and county in Germany, plus GDP data from GovData [27] per county-level. In total, we collected and computed 16 categories of data, and an overview of these data is shown in Table 2. They represent four perspectives (labor market, economic structure, real estate market and commuter pattern) which are of potential relevance for commuting decisions. In addition, auxiliary information such as age range, gender, nationality and GPS coordinates are included where available. For a better understanding of these data, besides their basic structure and some extreme cases, we chose four cities in State Lower Saxony (Göttingen, Braunschweig, Hannover and Wolfsburg) as examples. The sum of these represents roughly the industry distribution of Germany: Hannover is the capital of State Lower Saxony; both Wolfsburg and Braunschweig are known for their industry which has been expanded since the 1990s (leading to an increased need in workforce); Göttingen is a representative German university campus city and most known for its university.

Commuting Patterns
The commuting data on a municipality basis consist of about 14,000 municipalities from over two decades. Table 3 shows the basic statistics of commuters from the perspective of the total 11,385 German municipalities in 2017. It shows the commuter distribution is heavily unbalanced: a small number of cities have high numbers of commuters and heavily outweigh many small cities. With a mean of 2820 incoming and 3010 outgoing commuters, the median (50%) is only 232 incoming and 651 outgoing commuters. The 75% quartile of the incoming (outgoing) commuters is only 40.4% (63.2%) of the mean. There is also an extremely high standard deviation throughout the whole dataset. Typically, a county consists of a central city and more affordable peripheries (e.g., towns and villages), which generally do not provide as many jobs as the central city. Thus, on average, there are more incoming than outgoing commuters in the central cities. On the contrary, there are fewer incoming commuters than outgoing commuters in the peripheries.
For commuting distance, we used the Google Maps API to scrape the coordinates of all cities and counties in Germany. We then classify some cities as metropolitan regions based on GDP, and calculate the nearest metropolitan area for each city. The distances from cities to their nearest 289 metropolis are calculated as follows (Table 3). where R is the approximate radius of the earth in km (6373), and lat 1 , long 1 , lat 2 and long 2 are the lateral and longitudinal GPS coordinates of the two cities, respectively. Using the coordinates, we are able to calculate the mean commuting distance for households living in each city. We use a weighted mean to take into account the number of commuters. For each city we calculate: where c i is the number of commuters between the current city and workplace i, and d i is the distance between the two cities. Therefore, mean i is the mean distance between the city and the workplace in combination with the number of commuters.
The ratio of incoming and outgoing commuters to the resident population expressed as a percentage in the four example cities are shown in Table 4: Wolfsburg has the highest percentage of incoming commuters, at 64%; the second highest, though standing at only 33% is Hannover; Braunschweig and Göttingen are very close with 27% and 26%, respectively. The outgoing commuters do not vary significantly for the four cities, ranging between 8% and 14%. The county-level commuting data include the same type of municipality data, with additional information such as gender and nationality. Note that they do not distinguish places within the same county (e.g., the distance between Herzberg am Harz and Hann. Münden is 70 km, but both are in the same Göttingen county). As shown in the statistics in Table 5, like the municipality data, the data on the county level are also very unbalanced, with the mean deviating heavily from the median. This is again due to few (large) counties and many (small) counties. Table 5 shows that there are more male commuters than female commuters (from the perspective of residence place), confirming the previous studies based on surveys and questionnaires [9,28,29]. It also shows that the number of commuters being native Germans is about 8-9 times of the number of commuters with foreign nationalities per German county on average in 2017, which is approximately the same ratio between the total number of native employees and that of foreign employees in Germany in the same year. Hence, we do not explore the nationality factor of commuters further here.

Labor Market
We scraped the employment (per sector) and unemployment data for each city and county from the Federal Employment Agency. Figure 1 shows four distinct exemplary cities within geographical proximity with their six most important industry branches. We can see that among all employed workers, most (84%) of them work in the tertiary sector (e.g., corporate management, healthcare, education) including less than 1% in the higher education sector, and only 15% in the secondary sector (e.g., machine and vehicle technology, construction work).

Labor Market
We scraped the employment (per sector) and unemployment data for each city and county from the Federal Employment Agency. Figure 1 shows four distinct exemplary cities within geographical proximity with their six most important industry branches. We can see that among all employed workers, most (84%) of them work in the tertiary sector (e.g., corporate management, healthcare, education) including less than 1% in the higher education sector, and only 15% in the secondary sector (e.g., machine and vehicle technology, construction work).  Table 6 shows some example cities with different unemployment situations in 2017, including several big cities and four cities in the state of Lower Saxony.

Economic Structure
We scraped GDP data from "GovData" [27] for German cities from 2000 to 2016, including GDP per city, per employee, per resident and per industrial sector. An example of GDP data is shown in Table 7, which leaves out the GDP per industrial sector for simplicity. Table 8 shows exemplar median incomes for cities and counties with the highest and lowest median income. This shows an income disparity in Germany: after more than two decades of the German reunification [30], the median income of eastern Germany still is 19% lower than in the west; the top ten cities with the highest income are all in western Germany, while all of the five regions with the lowest income are in eastern Germany. Due to the continuous large amounts of workers moving from east Germany to west Germany [31] we conjecture that the median income difference between a large city and its adjacent regions will also influence the commuting behavior, which will be examined in the next section.  Table 6 shows some example cities with different unemployment situations in 2017, including several big cities and four cities in the state of Lower Saxony.

Economic Structure
We scraped GDP data from "GovData" [27] for German cities from 2000 to 2016, including GDP per city, per employee, per resident and per industrial sector. An example of GDP data is shown in Table 7, which leaves out the GDP per industrial sector for simplicity.   Table 8 shows exemplar median incomes for cities and counties with the highest and lowest median income. This shows an income disparity in Germany: after more than two decades of the German reunification [30], the median income of eastern Germany still is 19% lower than in the west; the top ten cities with the highest income are all in western Germany, while all of the five regions with the lowest income are in eastern Germany. Due to the continuous large amounts of workers moving from east Germany to west Germany [31] we conjecture that the median income difference between a large city and its adjacent regions will also influence the commuting behavior, which will be examined in the next section.
For each county/city we obtained data about the median income of employees from the Federal Employment Agency, including the median incomes of men, women and the residents in each region (city/Stadt or county/Landkreis). They are further split into three age groups, "15 to 25", "25 to 55" and "55 to 65" years old, and three educational levels, "no professional degree", "recognized professional degree" and "academic degree". A small example of the data can be seen in Table 9. The dataset contains the aggregated information of all employees working in each region for the "place of work" field, including incoming commuters but excluding outgoing commuters; whereas, "place of residence" includes everybody living in the city and excludes incoming commuters. Interestingly, even though it differs on a regional level, on average men are earning 500 € more than women per month. This may be a possible factor to explain the observation in [13], where men are found to be typically more willing to commute than women.
Overall, we can see that the median gross income for the "place of work" is higher than the income for the "place of residence". This further implies that commuting has a positive impact on income; therefore, it strengthens the conjecture that commuting contributes to the income discrepancy between men and women (https://statistik.arbeitsagentur.de/ Statistikdaten/Detail/201712/iiia6/beschaeftigung-sozbe-qheft/qheft-d-0-201712-xls.xls? blob=publicationFile&v=1, accessed on 11 April 2021).

Real Estate Market
We scraped the house and rental prices via the ImmobilienScout24 API [25]. Table 10 shows an example of apartment rental prices. With the differentiation between cities and counties we have 419 data points. In this example, we include Munich because it indicates the vast difference between the house and apartment prices in Germany.
Since we have only the present housing price data, we add additional data preprocessing to incorporate additional knowledge from other sources and try to reflect the changes over the time as much as possible. For example, we superimpose an increase in housing prices by 21.7% from 2015 to 2018 (this information is from the Federal Statistics Office [32].

Methods
We use statistical methods to pre-process the data to get an overall view of different potential factors including their dynamic characteristics (where available).
We use linear regression to analyze the influence of factors like housing prices, GDP and median income on commuting decisions, taking housing prices as a specific example.
We use correlation to understand the potential factors related to commuting. To measure the correlation between variables x and y, the Pearson's correlation coefficient is given by: We use the following machine learning algorithms to predict the commuter number using the identified related features.

•
Linear regression: an easy regression approach used to predict a continuous output (here, commuter number) where there is a linear relationship between the features of the dataset and the output variable. It assumes the input features to be mutually independent. • Decision trees: this approach first splits the dataset into smaller subsets and then makes predictions based on what subset a new example would fall into; it re-cursively runs this process until a good match is found. Decision trees make no assumptions on distribution of data and work well with colinearity between input features. • Random forest: a random forest aggregates a multitude of decision trees during the training time, each of which independently derives a prediction, then returns the mean prediction (regression) of the individual trees. It is one of the most accurate machine learning algorithms available and works well for many datasets.

Commuter Dynamics in Four Exemplar Cities
In the four example cities in State Lower Saxony (Göttingen, Braunschweig, Hannover and Wolfsburg), the number of commuters gradually increases, as shown in Figure 2  Wolfsburg and the massive increase in incoming commuters to Wolfsburg, it is likely due to the strengthened industry in Wolfsburg and that many employees tend to commute there.
Starting from 2014, the Federal Employment Agency provides additional information on how many people live in the same region as they work. It now treats the county of Göttingen as Göttingen instead of the city of Göttingen, as in the 1994-2013 data. This leads to an increase in both incoming and outgoing commuters for Göttingen in 2014.   Wolfsburg and the massive increase in incoming commuters to Wolfsburg, it is likely due to the strengthened industry in Wolfsburg and that many employees tend to commute there.
Starting from 2014, the Federal Employment Agency provides additional information on how many people live in the same region as they work. It now treats the county of Göttingen as Göttingen instead of the city of Göttingen, as in the 1994-2013 data. This leads to an increase in both incoming and outgoing commuters for Göttingen in 2014.   The blue line shows the number of incoming commuters, the yellow line shows the number of outgoing commuters and the gray line shows the number of employees living in the same places where they work over the years. We see all commuter numbers increase but there are differences from each other. Wolfsburg denotes the most visible increase, almost doubling its incoming commuters during 1994-2013 due to its increased employment opportunities. Braunschweig and Göttingen's increases of incoming commuters are more subtle but still easily observable. Hannover, on the other hand, seems to stagnate. As a small university town, Göttingen's increase in incoming commuters is smaller.
The number of outgoing commuters stays almost the same for Wolfsburg, Hannover and Göttingen. Braunschweig, however, witnesses a big increase. Due to its closeness to Wolfsburg and the massive increase in incoming commuters to Wolfsburg, it is likely due to the strengthened industry in Wolfsburg and that many employees tend to commute there.
Starting from 2014, the Federal Employment Agency provides additional information on how many people live in the same region as they work. It now treats the county of Göttingen as Göttingen instead of the city of Göttingen, as in the 1994-2013 data. This leads to an increase in both incoming and outgoing commuters for Göttingen in 2014.
From the information on "work (place) = residence (place)" in Figure 3, we can see that both Göttingen and Braunschweig have the biggest proportion of their non-commuting employees. Incoming commuters in Braunschweig grew close in number to the noncommuting employees in 2016-2017. Both Wolfsburg (as an industry city) and Hannover (as the state capital) have more incoming commuters than non-commuting employees. Incoming commuters in Wolfsburg are nearly twice the number of non-commuting employees.
To sum up, we can see that the increase of commuters depends heavily on the city's industry and economic development and the relationship with the adjacent cities.

Housing Prices: Statistics
Housing prices are important for commuting decisions [33,34]. We collected data for over 400 cities in 2019. For most of them, we have house and apartment prices, as well as the rental prices for each type. Furthermore, we have the mean living space and with that can calculate the mean price per sqm. This is the most important part of the data since it allows us to compare the cities based on their living price per sqm.
Housing prices differ greatly for German cities. Many regions in eastern Germany are known for having cheap property, as there are not as many jobs as in western Germany. In the industry sector data from the Federal Employment Agency, we see that there are a total of 150,000 reported jobs, while there are 630,000 reported jobs in western Germany. The regions differ heavily in mean income as well. The median income for western Germany is 2700 e while the eastern Germany median is 2200 e. Therefore, it makes sense that the property prices for eastern Germany are lower than in western Germany. Due to the way Immobilienscout24 returns the data, we could not classify the advertisements to eastern or western Germany. However, if we look at the cheapest property prices, we can verify that most of them are regions in eastern Germany. This can be seen in Table 11. The five regions with the cheapest apartment rental prices, except for Grafschaft Bentheim which is on the western border to the Netherlands, are in eastern Germany. Apart from some small secluded regions, this trend continues throughout our data.
It is well known that Munich is the most expensive city in Germany [35], followed by Frankfurt and Stuttgart; these three cities are important metropolises for the German industry. In Figure 4, we see the most expensive mean prices per sqm for buying or renting a house or apartment.
The bars indicate the renting price, and the graphs denote the buying price. We see that Munich is the most expensive city to both rent or buy an apartment or a house. It reflects the property market well, having Munich, Stuttgart, Frankfurt, Hamburg, Berlin, Cologne and Mainz in the top 20 most expensive properties in all four categories.

Commuting Distances: Statistics
Using the calculation method in Section 3.2, the statistical commuting distance data are computed in Table 12.

Commuting Distances: Statistics
Using the calculation method in Section 3.2, the statistical commuting distance data are computed in Table 12. The average commuting distance of 77 km is from the data on a regional level and therefore does not account for short-haul commuters. The minimal commuting distance is mostly for commuting that is between cities within the same county. Because the Federal Employment Agency lists them as different areas, they have a very short commuting distance with a very high amount of commuters. The maximal value of 183.2 km is for Birkenfeld where many employees are commuting to Bad Kreuznach, which is 140 km away. With both the mean and the median at about 70 km, we can see that these data are balanced and represent the long-distance commuters well. The exact distribution of commuting distances can be seen in Figure 5. The diagram shows the number of cities corresponding to the average commuting distance. The x-axis shows the intervals the cities belong to. These buckets have a size of 15 km each. The y-axis denotes the number of cities that are part of the respective bucket. For example, the column on the far left has 84 cities with an average commuting distance of 53 km to 68 km. The orange line represents the cumulative total, which is almost at 50% after the first two bars. It further shows that most of the commuters are commuting medium distances, between 38 km and 98 km, accounting for 75% of the total data. We can also see that only 10% of the cities have very long or very short distance averages below 38 km, or over 113 km. Our results seem to deviate a bit from Schulze [36] who found that most of the commuters commute up to 25 km. The reason is that Schulze used a different data source which can directly compute commuting distances, including for both intra-regional/city and inter-city commuters. With the Federal Employment Agency dataset, we have only aggregated information about inter-city commuters; due to the data provider's privacy restrictions we had to calculate the commuting distance ourselves.
the Federal Employment Agency lists them as different areas, they have a very short commuting distance with a very high amount of commuters. The maximal value of 183.2 km is for Birkenfeld where many employees are commuting to Bad Kreuznach, which is 140 km away. With both the mean and the median at about 70 km, we can see that these data are balanced and represent the long-distance commuters well. The exact distribution of commuting distances can be seen in Figure 5. The diagram shows the number of cities corresponding to the average commuting distance. The x-axis shows the intervals the cities belong to. These buckets have a size of 15 km each. The y-axis denotes the number of cities that are part of the respective bucket. For example, the column on the far left has 84 cities with an average commuting distance of 53 km to 68 km. The orange line represents the cumulative total, which is almost at 50% after the first two bars. It further shows that most of the commuters are commuting medium distances, between 38 km and 98 km, accounting for 75% of the total data. We can also see that only 10% of the cities have very long or very short distance averages below 38 km, or over 113 km. Our results seem to deviate a bit from Schulze [36] who found that most of the commuters commute up to 25km. The reason is that Schulze used a different data source which can directly compute commuting distances, including for both intra-regional/city and inter-city commuters. With the Federal Employment Agency dataset, we have only aggregated information about inter-city commuters; due to the data provider's privacy restrictions we had to calculate the commuting distance ourselves.
Overall, the commuter data are not very balanced with many small regions with few commuters, and a smaller amount of big cities with very many commuters. Additionally, the type of city plays a key role in the observable commuting patterns. With our regional data, we are able to validate the findings of previous studies, e.g., by confirming that male commuters outnumber female commuters.

Housing Prices vs. Commuters: Linear Regression Results
We investigate the influence of the housing prices in regards to the number of commuters. To illustrate this, we conduct regression studies on apartment rental prices vs. the ratios of incoming and outgoing commuters to the number of local employees. The cases of other prices (apartment buying prices, house rental prices, house buying prices) are similar and skipped here due to space limit.
Two simple ordinary least squares (OLS) linear regression models are built for analyzing the relationship between apartment rental price (€ per sqm) and the ratio of Overall, the commuter data are not very balanced with many small regions with few commuters, and a smaller amount of big cities with very many commuters. Additionally, the type of city plays a key role in the observable commuting patterns. With our regional data, we are able to validate the findings of previous studies, e.g., by confirming that male commuters outnumber female commuters.

Housing Prices vs. Commuters: Linear Regression Results
We investigate the influence of the housing prices in regards to the number of commuters. To illustrate this, we conduct regression studies on apartment rental prices vs. the ratios of incoming and outgoing commuters to the number of local employees. The cases of other prices (apartment buying prices, house rental prices, house buying prices) are similar and skipped here due to space limit.
Two simple ordinary least squares (OLS) linear regression models are built for analyzing the relationship between apartment rental price (€ per sqm) and the ratio of commuters (against the local employees). The fit plots are shown in Figure 6. Both models suffer from heteroscedasticity which we can detect from both White's test results (Table 13, p-value <0.05) and residual plots as shown in Figure 7. To fix the heteroscedasticity, we apply the heteroscedasticity-consistent covariance matrix estimator [33].    Using OLS linear regression for the log transformation of the apartment rental price (e per sqm), the result parameters are shown in Table 14; further model diagnosis reveals that the models' parameters are significant and there is no heteroscedasticity inside anymore (p-Value >0.05). Using OLS linear regression for the log transformation of the apartment rental price (e per sqm), the result parameters are shown in Table 14; further model diagnosis reveals that the models' parameters are significant and there is no heteroscedasticity inside anymore (p-Value >0.05). We can see the relationship between the number of commuters and the logged unit price to rent an apartment in Figure 8. Both figures show an increasing trend, indicating a higher average number of corresponding commuters for a higher rental price. Furthermore, the number of incoming commuters increases faster with a higher rent cost than  We can see the relationship between the number of commuters and the logged unit price to rent an apartment in Figure 8. Both figures show an increasing trend, indicating a higher average number of corresponding commuters for a higher rental price. Furthermore, the number of incoming commuters increases faster with a higher rent cost than the number of outgoing commuters. The number of outgoing commuters also increases, likely due to being in bigger cities with more inhabitants. the number of outgoing commuters. The number of outgoing commuters also increases, likely due to being in bigger cities with more inhabitants. Again, we see that the incoming commuters increase quickly for higher apartment prices; the deviation is high for higher apartment prices due to the distribution of the data. Therefore, we rely on medium house prices and medium apartment prices.
Overall, the more expensive the real estate, the more employees will commute over long distances. This is in accordance with Boje et al. [34] who stated that according to location theory, rationally acting individuals compare the resulting benefit with the costs of commuting. If the costs outweigh the benefits, as they would have to pay a high percentage of his or her income for rent, they would give up renting in the workplace city and consider commuting instead. This behavior can be observed in our data, e.g., Again, we see that the incoming commuters increase quickly for higher apartment prices; the deviation is high for higher apartment prices due to the distribution of the data. Therefore, we rely on medium house prices and medium apartment prices.
Overall, the more expensive the real estate, the more employees will commute over long distances. This is in accordance with Boje et al. [34] who stated that according to location theory, rationally acting individuals compare the resulting benefit with the costs of commuting. If the costs outweigh the benefits, as they would have to pay a high percentage of his or her income for rent, they would give up renting in the workplace city and consider commuting instead. This behavior can be observed in our data, e.g., fewer employees in cities with low housing prices will decide to commute than in cities with higher housing prices.

Housing Prices and Income
We also analyze the relationship between housing prices and the median income. Similar to the previous subsection, our first results also show heteroscedasticity but can be fixed by the heteroscedasticity-consistent covariance matrix estimator; the results are omitted here again for the space limit, which explains that with an increasing median income, the apartment rent rises as well.
The result is expected, as it is logical that the real estate market and the median income are related to each other. Nonetheless, as the income has a strong link to the apartment and housing prices, it indicates a link to the commuter data as well.

GDP and Median Income
In this subsection we will take a closer look at our median income and GDP data. While the individual city-level GDP data depict well the productivity of the city, the aggregated GDP information on the state level ( Figure 9) shows a clear trend in the German economy distribution.

Housing Prices and Income
We also analyze the relationship between housing prices and the median income. Similar to the previous subsection, our first results also show heteroscedasticity but can be fixed by the heteroscedasticity-consistent covariance matrix estimator; the results are omitted here again for the space limit, which explains that with an increasing median income, the apartment rent rises as well.
The result is expected, as it is logical that the real estate market and the median income are related to each other. Nonetheless, as the income has a strong link to the apartment and housing prices, it indicates a link to the commuter data as well.

GDP and Median Income
In this subsection we will take a closer look at our median income and GDP data. While the individual city-level GDP data depict well the productivity of the city, the aggregated GDP information on the state level ( Figure 9) shows a clear trend in the German economy distribution. As shown in Figure 9, North Rhine-Westphalia has the highest GDP, followed by Bavaria and Baden-Württemberg. North Rhine-Westphalia is well known for the Ruhrgebiet, which is a composite of industrial cities and thus a big metropolitan area. Bavaria also has important cities for the German industry like Nuremberg and Munich. Hamburg and Berlin are in the 4th and 5th place, respectively. This is no surprise, as these two cities are the biggest in Germany and hence have a great influence on the German economy. Overall, we see that the states in west Germany have higher GDP than their counterparts in east Germany.

Correlation Results
After analyzing all the data separately, we study their correlation with each other with a focus on the correlation with the commuting data.
To understand the most important reason behind commuting, we limit the correlation matrix to the 16 most important factors (see Table 2). The result is shown in Figure 10.
Beyond the highest correlations between the jobs in any two of the three industrial sectors, primary sector, secondary sector and tertiary sector, another high correlation is found between incoming commuters and outgoing commuters (in percentage of local As shown in Figure 9, North Rhine-Westphalia has the highest GDP, followed by Bavaria and Baden-Württemberg. North Rhine-Westphalia is well known for the Ruhrgebiet, which is a composite of industrial cities and thus a big metropolitan area. Bavaria also has important cities for the German industry like Nuremberg and Munich. Hamburg and Berlin are in the 4th and 5th place, respectively. This is no surprise, as these two cities are the biggest in Germany and hence have a great influence on the German economy. Overall, we see that the states in west Germany have higher GDP than their counterparts in east Germany.

Correlation Results
After analyzing all the data separately, we study their correlation with each other with a focus on the correlation with the commuting data.
To understand the most important reason behind commuting, we limit the correlation matrix to the 16 most important factors (see Table 2). The result is shown in Figure 10.
Beyond the highest correlations between the jobs in any two of the three industrial sectors, primary sector, secondary sector and tertiary sector, another high correlation is found between incoming commuters and outgoing commuters (in percentage of local employees). Except for commuting-related factors, the highest negative correlation is found between median income and metropolitan distance. employees). Except for commuting-related factors, the highest negative correlation is found between median income and metropolitan distance. Now we examine the factors behind commuting based on this correlation matrix: Figure 10. Correlation Matrix.
1. The matrix shows that the most important factor behind commuting is the GDP per resident of the city, as among all factors it has the highest Pearson's correlation coefficient with incoming commuters in percentage of the local employers (0.57) and the lowest (and negative) coefficient with outgoing commuters in percentage of the local employers (−0.22). This is somewhat surprising, as we expected that the median income and housing prices may have a more important influence on commuting decisions. 2. The median incomes of work and living places are also important. The median income in the place of work is highly influential on incoming commuters, as more employees may commute if they receive a higher income. How much they earn in their residence is influential to both commuting groups. The income in the place of residence is a main factor of commuting, either leaving the city or coming there, because if it is high, many people will decide to commute there; if it is low, more people will leave the region to work somewhere else. 3. The third most important factor for incoming commuters is the apartment price; more expensive apartments seem to be a factor related to employees commuting. A 1.
The matrix shows that the most important factor behind commuting is the GDP per resident of the city, as among all factors it has the highest Pearson's correlation coefficient with incoming commuters in percentage of the local employers (0.57) and the lowest (and negative) coefficient with outgoing commuters in percentage of the local employers (−0.22). This is somewhat surprising, as we expected that the median income and housing prices may have a more important influence on commuting decisions.

2.
The median incomes of work and living places are also important. The median income in the place of work is highly influential on incoming commuters, as more employees may commute if they receive a higher income. How much they earn in their residence is influential to both commuting groups. The income in the place of residence is a main factor of commuting, either leaving the city or coming there, because if it is high, many people will decide to commute there; if it is low, more people will leave the region to work somewhere else.

3.
The third most important factor for incoming commuters is the apartment price; more expensive apartments seem to be a factor related to employees commuting. A plausible reason behind this relation is that if the cost-benefit ratio of buying an apartment is bad, the employees may consider commuting over longer distances. For outgoing commuters, the distance to the next metropolitan area is very important. This means that if the distance to the next metropolitan area increases, employees are less likely to leave their region to commute, given the cost-benefit ratio of long-haul commuting.

4.
An interesting anti-correlation can be found between the outgoing commuters and the metropolitan distance. If the metropolitan distance increases, the outgoing commuters decrease, as their commuting distance would get longer and become most likely unprofitable.

5.
A surprising high correlation can be found between commuters and the unemployment data. This has a big influence on both incoming and outgoing commuters. This may be related to the fact that most bigger cities tend to have a higher unemployment rate. 6.
In regard to jobs (workplaces), the secondary and tertiary sectors are more influential on commuters than primary sectors, likely due to their high number of employees. For example, there were 82.3% jobs in the tertiary sector, and 17.2% in the secondary sector, in contrast to 0.5% in the primary sector as of 2017 [32]. Workplaces in the primary sector even show an anti-correlation with commuters, indicating that most farmers tend to not commute.

Commuter Prediction Results
As commuting is an important part of social life, for city and infrastructure planners, it is helpful to predict the commuting trend for the next years. Since the data collection of the Federal Employment Agency changed in 2013, we mainly focus on predictions using data from 1994 to 2012 to predict the number of commuters in 2013 for each city.
First, we generate our time series data using the TimeSeriesSplit function of scikit-learn, which splits the commuter data into different time frames. We then train a linear regression model (with heteroscedasticity detection and correction procedures), a decision tree model and a random forest model (with 100 decision trees as baseline), respectively, to predict the incoming and outgoing commuters for each city in 2013.
Three metrics of measuring the prediction accuracy are used here: (1) The mean absolute error (MAE) means that we are on average off by a certain number of commuters.
(2) The mean squared error (MSE) measures the average of the squares of the errors; the closer to zero the MSE is, the better. (3) The root mean squared error (RMSE) is the root of the MSE and measures the accuracy of a forecast; again, the closer to zero the better, where a value of zero would mean that the prediction is perfect. The results for accuracies for incoming commuters and outgoing commuters in 2013 using the 1994-2012 data are shown in Table 15. The following observations can be made: In general, linear regression yields the worst performance as the input features do not hold collinearity; meanwhile, decision trees achieve much reduced MAE, MSE and RMSE. Random forest provides further improvements on prediction accuracy. An outlier is the MSE and RMSE are better for predicting outgoing commuters using linear regression compared to using decision tree or random forest algorithms, which may be attributed to the limited features available for the better balanced outgoing commuter data; more concrete reasons have to be found out.
Overall accuracy is reasonably good, considering the mean and median (50%-percentile) commuters numbers (see Table 3) of incoming commuters (2820 and 232) against its MAE (14.36 in the case of random forest, 18.38 for decision tree), and outgoing commuters (3010 and 651) against its MAE (41.97 for random forest, 44.58 for decision tree). This reflects only roughly 0.5-6.8% of absolute errors on average in the prediction.
The prediction accuracy for incoming commuters is generally better than that of outgoing commuters. This is affected by the highly unbalanced commuter data that the small numbers of incoming commuters in the cities are much more heavily distributed than large numbers, compared to outgoing commuters (see Table 3). When the overall incoming commuting number for a city is small, it is easier to predict with lower MAE than to predict the larger number.
We then examine how the number of decision trees affects the prediction accuracy. We try it at low as 10 and as high as 300 decision trees. The corresponding MAE can be seen in Figure 11. We then examine how the number of decision trees affects the prediction accu try it at low as 10 and as high as 300 decision trees. The corresponding MAE can in Figure 11. We see that the MAE fluctuates between 14.2 and 14.7 after 90 estimators. H does not make much sense to increase the number of trees over 100. The low M estimators is most likely due to the randomness of the trees.
The important feature of the decision trees (see Table 16) shows that the (i.e., year 2011 and year 2012) and the four to last year (i.e., year 2009) are t important ones. This result is expected, as we are working with a time series and the nu commuters of the next year is mostly influenced by the most recent data.

Discussion
Although this work focuses on the Germany case, we believe the meth proposed in this paper can be extended for studying commuting behaviors countries, as most countries have published their per-city level employment, GDP and commuter information online, and there are abundant other sour LinkedIn and Facebook as well as real estate market websites to gain access to We see that the MAE fluctuates between 14.2 and 14.7 after 90 estimators. Hence, it does not make much sense to increase the number of trees over 100. The low MAE at 70 estimators is most likely due to the randomness of the trees.
The important feature of the decision trees (see Table 16) shows that the last two (i.e., year 2011 and year 2012) and the four to last year (i.e., year 2009) are the most important ones. This result is expected, as we are working with a time series and the number of commuters of the next year is mostly influenced by the most recent data.

Discussion
Although this work focuses on the Germany case, we believe the methodology proposed in this paper can be extended for studying commuting behaviors in other countries, as most countries have published their per-city level employment, income, GDP and commuter information online, and there are abundant other sources like LinkedIn and Facebook as well as real estate market websites to gain access to further information.
Furthermore, it may be useful to include the total number of residents (rather than just socially insured employees) in the analysis, which covers the whole commuter population such as students, who may contribute to the peak hour congestion. Furthermore, more studies on commuting distances may be also useful to understand the commuting behavior from cost-benefit tradeoffs.
Additionally, the social, educational and medical facilities could be considered as potential additional factors. Including data like the number of hospitals, doctors or kindergartens, or even green areas and points of interest may be helpful for better under-standing commuter decisions and for the prediction of commuters. Furthermore, with the increase in housing prices over the last years, we think that it could be interesting to perform an in-depth analysis of the connection between the real estate market and commuters. We only had the house price data for one year, so looking at other historic data sources may reveal new information.
Our commuter prediction is currently only based on our time-series commuter data during 1994-2013, which can be extended for later (2014-2018) data which contain richer information such as housing prices, GDP and jobs in different sectors in each county or city. The results are still yet to be improved by future fine-tuning of the models and feature engineering, and subject to further analysis on how individual factors affect the performance of commuter predictability. Nonetheless, our initial results show that even with simple methods a reasonably good prediction can be achieved. This will bring value as it helps the city and infrastructure planners to better understand the commuting trend and deploy better countermeasures, e.g., for clogged roads or traffic jams in a short term, or developing alternative mobility options other than cars in a longer term.
Lastly, the current COVID-19 pandemic may significantly change commuting behavior. This may open a large body of new insights for future exploitation.

Conclusions
The question of what leads to commuting is a critical issue for modern society's development. Most prior studies focused on a small set of factors constrained by limited scale in terms of timespans, space and commuter numbers. To fill this gap, in this paper, we explored a big data approach, by collecting data from multiple publicly accessible sources and performing a systematic analysis on the potential influencing factors from four perspectives (the cities' economic structure, labor and real estate markets as well as commuting patterns). We found that the GDP, the median income and the price of buying or renting an apartment or a house in potential places for work and residence, as well as their distance to the next metropolitan area, are key factors in the decision to commute. We showed these main driving factors behind commuting in our data, confirming some findings in previous work and offering some new insights such as GDP, detailed categories of housing prices and job market in different sectors with the aid of much richer data sources. We hope that such a data-driven approach will open this field of study to more coverage in the future, as commuting is an important part of daily life in Germany (and worldwide).
Additionally, we leveraged several machine learning models to predict the number of commuters. Our results show it is possible to forecast the commuters quite precisely.
Author Contributions: Conceptualization, H.C. and X.F.; methodology, X.F. and H.C.; data collection and processing, S.V.; writing-original draft preparation, S.V. and H.C.; writing-review and editing, H.C. and X.F.; data curation, X.F.; supervision, H.C. and X.F. All authors have read and agreed to the published version of the manuscript.