Street Network Models and Measures for Every U.S. City, County, Urbanized Area, Census Tract, and Zillow-Defined Neighborhood

OpenStreetMap provides a valuable crowd-sourced database of raw geospatial data for constructing models of urban street networks for scientific analysis. This paper reports results from a research project that collected raw street network data from OpenStreetMap using the Python-based OSMnx software for every U.S. city and town, county, urbanized area, census tract, and Zillow-defined neighborhood. It constructed nonplanar directed multigraphs for each and analyzed their structural and morphological characteristics. The resulting data repository contains over 110,000 processed, cleaned street network graphs (which in turn comprise over 55 million nodes and over 137 million edges) at various scales — comprehensively covering the entire U.S. — archived as reusable open-source GraphML files, node/edge lists, and GIS shapefiles that can be immediately loaded and analyzed in standard tools such as ArcGIS, QGIS, NetworkX, graph-tool, igraph, or Gephi. The repository also contains measures of each network’s metric and topological characteristics common in urban design, transportation planning, civil engineering, and network science. No other such dataset exists. These data offer researchers and practitioners a new ability to quickly and easily conduct graph-theoretic circulation network analysis anywhere in the U.S. using standard, free, open-source tools.


Introduction
Urban planners and transportation engineers have examined and modeled street networks for decades to explore household travel behavior, accessibility and equity, urban form and design patterns, connectivity, and centrality . Complex networks such as street networks have also been explored from the perspective of statistical physics to assess structure and performance [27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44]. However, large volumes of cross-sectional street network data-in a format well-suited for graph-theoretic analysis-have often been difficult to come by, especially in an open-source, scalable, automatable way. This paper presents a new data repository to address this gap. It describes a research project that downloaded raw OpenStreetMap street network data for the entire U.S., cleaned these data, then constructed graph-theoretic models of these data at multiple scales for fast and rigorous urban analysis. It then calculated dozens of metric and topological measures of each network, nationwide. These models and measures can save researchers weeks or months of ad hoc data collection and analysis. However, they are even more useful for urban planners and policymakers who often lack the technical capacity to write their own API scripts in custom query languages or implement their own network science algorithms to understand the built form and urban circulation.
This new data repository provides four significant value-additions for urban science and analytics. First, the repository contains graph-theoretic models in common reusable formats, immediately suited data [77]. It is an open-source, worldwide, collaborative mapping project. OpenStreetMap provides geospatial information about streets and intersections, along with attribute data about road types, names, and (sometimes) speeds, widths, and numbers of lanes. However, its data cannot by default be automatically extracted into a graph-theoretic object for network analysis [78]. Furthermore, the network topology must be substantially cleaned to correctly represent nodes exclusively as intersections and dead-ends. Prior to this wider project, no tools or data repositories enabled the automatic, at-scale, configurable acquisition of OpenStreetMap data and construction of graph-theoretic data objects for analysis.

Methods
Given this background and motivation, this project originally developed a new open-source Python-based software toolkit called OSMnx [45]. OSMnx can download data from OpenStreetMap using configurable user queries, then construct a nonplanar, directed multigraph, and finally clean the topology [79]. Python was chosen to develop this tool for three reasons. First, Python is one of the most popular programming languages in the world, giving it a broad audience. Second, Python offers particularly simple and straightforward syntax, making it easy for newcomers to learn and lowering the scientific barriers to entry. Third, it has become a standard language for data science research and practice, with an extensive ecosystem of related packages for scientific, network, and geospatial analysis. Finally, other similar tools in this space include dodgr (an R tool for distance calculations on weighted directed graphs), shp2graph (an R tool for converting spatial networks into igraph objects) [80], pandana (a Python tool for network accessibility queries) [81], the Urban Network Analysis Toolbox plugin for ArcGIS and Rhino [82], and GISF2E (a Python tool that processes shapefiles into edge lists) [78]. However, none of these offer the end-to-end capabilities of OSMnx to download network data directly, build models, clean the topology, and conduct statistical analyses and simulations.
To construct this data repository, this project used OSMnx to download network data and construct graphs for the drivable street networks of every U.S. city/town, county, urbanized area, census tract, and Zillow-defined neighborhood (Zillow is a large online real estate database company that defines neighborhood boundaries in many cities and towns). It saved these graphs as shapefiles, GraphML files (a standard, open-source format for graph serialization), and node/edge lists. Finally, it analyzed these networks to assess the geometric, topological, and morphological characteristics of U.S. street networks and how they reflect various urban planning eras, transportation technologies, economic conditions, and design paradigms. These study sites are presented in Figure 1.

Graph Production
To produce the data in the repository, we loaded five publicly-available input datasets defining these study sites into OSMnx version 0.  [84][85][86].
Zillow is a prominent real estate database company, and their boundaries dataset covers large U.S. cities. One at a time, for each city, county, urbanized area, tract, and neighborhood boundary defined in the above shapefiles, we downloaded the drivable public street network within its boundaries from OpenStreetMap using OSMnx. To acquire these raw data, OSMnx buffers each boundary polygon by 500 m then downloads the streets within this geometry, filtering them based on attribute data. It then constructs a nonplanar directed multigraph. In the case of one-way streets, a directed edge is added from the origin node to the destination node. However, for bidirectional streets, reciprocal directed edges are added in each direction between the two nodes.
Next, OSMnx cleans the graph's topology to retain nodes only at intersections and dead-ends (detailed below). However, the full edge spatial geometry and length are retained in the cleaned graph. Then, it calculates node degrees and node types before truncating the graph to the original boundary polygon. This buffer/truncate workflow attenuates perimeter effects [87] and guarantees that true intersections are not incorrectly considered pseudo-nodes or dead-ends if an incident edge links to a node outside the boundary polygon. The final graph may be strongly connected, weakly connected, or neither. If it is not connected, OSMnx returns all connected components as a single graph object.
To clean the graph's topology, OSMnx only retains nodes that represent the junction of multiple streets, as depicted in Figure 2. First, it identifies all non-intersection pseudo-nodes (i.e., all those that simply form an expansion graph). Next, it removes these pseudo-nodes while maintaining the true spatial geometry and attribute data of the street segment between the true intersection nodes. In strict mode, OSMnx considers two-way "intersections" to be topologically identical to a single street that bends around a curve. Conversely, to retain these intersections when the incident edges have different OpenStreetMap IDs, we can use OSMnx's non-strict mode. This cleaning step is critical to this dataset, providing additional value beyond extraction of data from OpenStreetMap itself by producing models more suitable to urban design/morphology and transportation analysis by representing intersections/dead-ends as nodes and linear blocks' sides as edges. Once we have constructed and cleaned the graph, we use OSMnx to save it to disk as node/edge lists (formatted as comma-separated values), as ESRI shapefiles to work with in GIS software, and as a GraphML file (an open-source, standard format for serializing graphs) to work with in common network analysis software packages such as NetworkX, Gephi, graph-tool, or igraph [88][89][90]. OSMnx's saved shapefiles include separate node and edge layers. When saving shapefiles, OSMnx simplifies the network to an undirected representation but preserves one-way origin-destination directionality as edge attributes for subsequent GIS-based routing applications.

Graph Analysis
Finally, for this repository we calculate several metric and topological measures of each of these networks, common in the urban design, transportation planning, and network science disciplines. These measures include each network's area (km 2 ), mean average neighborhood degree, mean average weighted neighborhood degree, average circuity, average clustering coefficient, average weighted clustering coefficient, average degree centrality, edge density (m/km 2 ), average edge length (m), total edge length (m), count of intersections, intersection density (per km 2 ), count of dead-ends, proportion of nodes that are dead-ends, count of three-way intersections, proportion of nodes that are three-way intersections, count of four-way intersections, proportion of nodes that are four-way intersections, average node degreek, count of edges m, count of nodes n, node density (per km 2 ), maximum and minimum node PageRank (a measure of node importance based on the structure of incoming edges), proportion of edges that self-loop, street density (m/km 2 ), average street segment length (m), total street length (m), count of street segments, and average number of streets per node.

Data and Code Availability
These data are freely available online at the Harvard Dataverse at https://dataverse.harvard. edu/osmnx-street-networks. This repository includes the graphs of the street networks for every U.S. city, county, urbanized area, census tract, and Zillow neighborhood as a GraphML file, node/edge shapefiles, and node/edge lists. These files can be loaded into any standard GIS software or network analysis package. It also contains the analytical measures for every street network. The datasets are available in the following two repositories: All the code to download and analyze these street networks with OSMnx (version 0.8.1) is open-source and available on GitHub at https://github.com/gboeing/dataverse-street-networks. The OSMnx software itself is also open-source and freely available for download and installation from GitHub (https://github.com/gboeing/osmnx), the PyPI package repository (https://pypi.org/ project/OSMnx/), and the Anaconda package repository (https://anaconda.org/conda-forge/osmnx).

Discussion
This paper described a new urban science data repository constructed from raw OpenStreetMap data that provides four significant value-additions. First, the repository contains graph-theoretic models in common reusable formats, immediately more useful than typical raw geometry data downloads out of the box. Second, these models all have meaningful spatial extents (municipalities, counties, urbanized areas, census tracts, and neighborhoods) that correspond to administrative boundaries or social units for urban analysis and simulation. Third, these graphs have substantially cleaned-up topologies such that nodes exclusively represent intersections and dead-ends and edges represent the street segments connecting them. Fourth, this repository contains dozens of metric and topological measures calculated for each graph-no such database previously existed. These measures can be used to analyze the urban fabric's texture and walkability (via intersection density, node/edge density, and average street segment length), connectivity (via average number of streets per node and circuity), and network resilience and importance (via centrality measures and PageRank).
In total, this data repository contains over 110,000 street networks which in turn collectively comprise over 55 million nodes and over 137 million edges. While these data fill a gap in helping researchers quickly ramp-up graph-theoretic street network analyses in the U.S. without having to spend weeks developing their own ad hoc data collection, modeling, and analysis workflows, they fill a larger gap in opening up these scientific modes of urban analysis to planners and policymakers without the individual technical or institutional capacity to do so otherwise. Yet to use these data accordingly, they must reasonably model the real world. Validation of these data can be considered from two perspectives. The first considers how faithfully the OpenStreetMap data represent the real-world street network. The second considers how faithfully the repository's graphs in turn represent the OpenStreetMap street network.
Regarding the former perspective, various authors have explored this subject in detail [76,[107][108][109][110][111][112][113][114]. OpenStreetMap's road data quality is generally quite high--for example, Garmin consumer GPS devices can use OpenStreetMap roads data for navigation. Although data coverage varies worldwide, it is generally good when compared to corresponding estimates from the CIA World Factbook. In the US, OpenStreetMap imported the 2005 TIGER/Line roads in 2007 as a foundational data source. Since then, numerous corrections and improvements have been made. More importantly, many additions have been made beyond what TIGER/Line captures, including richer attribute data describing the characteristics of features and finer-grained codes for classifying streets. Of course, much of these data are crowd-sourced and user-generated, and errors thus occasionally exist. However, the data are validated and vetted by the OpenStreetMap community, resulting in high quality overall. Most relevant to this project's study area, the U.S. network is essentially complete on OpenStreetMap [76].
Regarding the latter perspective, we comprehensively tested the final dataset's quality, adapting the methodologies of [76,78] against three reference datasets: the TIGER/Line roads, Google Earth satellite imagery, and the OpenStreetMap raw source data. The first step uses QGIS to compare these data spatially against the TIGER/Line roads shapefile in a random sample of 100 cities to identify any edges appearing in one of these datasets but missing in the other. Discrepancies were then checked one at a time to ensure they correctly matched the OpenStreetMap source and secondarily against Google Earth satellite imagery for real-world verification. Finally, tests are performed on each graph to ensure they can be loaded, analyzed, and routed. The validation confirmed that this project's algorithms reconstructed the OSM data properly with nodes exclusively at intersections and dead-ends.
Comprehensive documentation for using OSMnx is available at https://osmnx.readthedocs.org and tutorials, examples, and demonstrations are available at https://github.com/gboeing/osmnxexamples. To reuse this dataset, researchers can install OSMnx according to the installation instructions in the documentation. Then, they can load the GraphML files using OSMnx's load_graphml function. The graphs may be similarly loaded in NetworkX, graph-tool, igraph, and other similar network analysis tools. To load these graphs with Gephi, first load the GraphML file in OSMnx, re-export it using OSMnx's save_graphml function with argument gephi=True (to add additional customization for Gephi compatibility), then open the exported file in Gephi. The shapefiles may be loaded in standard fashion in any GIS software, such as QGIS, ArcGIS, or geopandas.
Funding: This research received no specific external funding.