Characterizing the Heterogeneity of the OpenStreetMap Data and Community

OpenStreetMap (OSM) constitutes an unprecedented, free, geographic information source contributed by millions of individuals, resulting in a database of great volume and heterogeneity. In this study, we characterize the heterogeneity of the entire OSM database and historical archive in the context of big data. We consider all users, geographic elements, and user contributions from an eight-year data archive, at a size of 692 GB. We rely on some nonlinear methods such as power-law statistics and head/tail breaks to uncover and illustrate the underlying scaling properties. All three aspects (users, elements, and contributions) demonstrate striking power laws or heavy-tailed distributions. The heavy-tailed distributions imply that there are far more small elements than large ones, far more inactive users than active ones, and far more lightly edited elements than heavily edited ones. Furthermore, about 500 users in the core group of the OSM are highly networked in terms of collaboration. Keywords: OpenStreetMap, big data, power laws, head/tail breaks, ht-index


Introduction
Twenty-first century society benefits considerably from, and is increasingly driven by, two forces characterized by the head and the tail of a long-tail distribution (Anderson 2006). For example, while the telephone industry was dominated by national telecoms such as AT&T, we now have services such as Skype. The Encyclopedia Britannica was very popular, but we now have a free, more popular counterpart in Wikipedia. Information was controlled by governments and mass media giants such as CNN, but WikiLeaks or OpenLeaks recently made history by freely sharing information. In the same vein, volunteered geographic informaton (VGI) (Goodchild 2007) emerged as a counterpart to geographic information, which is conventionally collected and maintained by national mapping agencies. As part of user-generated content in the era of Web 2.0, VGI is uniqueby providing georeferenced location information. OpenStreetMap is the most successful and well-known project of VGI. It attracts significant sustained interest in academia, industry, and government.
In this article, we study all OSM data collected over the past decade, submitted by about 1 million registered users up to February 2013. Previous studies showed that both the data and the user community are very heterogeneous. For example, only a small percentage of users make almost all the contributions, including creation and edits (Neis and Zipf 2012, Mooney and Corcoran 2012a, 2012b. In terms of data concentration and accuracy, the OSM data varies dramatically from urban to rural areas, or from country to country (Neis et al. 2011, Neis andZielstra 2014). However, these previous studies were conducted mostly at country and city levels. They lack quantitative indicators about heterogeneity or variation. In contrast, we examined all the OSM data and its history to present a holistic picture of OSM based on power-law statistics and the head/tail breaks-induced ht-index. More specifically, we illustrate and quantify the underlying heterogeneity of the OSM elements, the users, and their contributions through a set of quantitative metrics such as α, p value and ht-index.
Power-law statistics is based on the robust maximum-likelihood estimation, which differs from the conventional least-square estimation (Clauset et al. 2009) (see Section 3 for more details). The maximum-likelihood estimation provides two metrics: α (degree of heterogeneity), and p value (goodness of fit). On the other hand, the head/tail breaks (Jiang 2013) is a newly developed classification scheme for data with a heavy-tailed distribution. It also is an efficient, effective visualization tool for big data (Jiang 2015). Head/tail breaks partition the whole around an average size into many small things in the tail being a majority, and a few large ones in the head being a minority. This partition continues recursively for the head (the large things) until the notion of far more small things than large ones is violated. Eventually, the number of times that far more small things recurs is defined as the ht-index (Jiang and Yin 2014) for characterizing complexity or hierarchical levels of the whole. This paper's contribution is three-fold. We situated the study in the context of big data and extracted the related historical and attributed information from the entire OSM databases and users' historic archive. Based on the extraction, we characterized the heterogeneity of OSM databases and discovered very striking scaling patterns for both users and data. We built up the co-contribution networks over the eight-year timespan of the data and found the underlying nonlinear characteristics of user collaboration networks.
The remainder of the paper is organized as follows: Section 2 presents the OSM history, data, and the working procedure of processing the huge data set. Section 3 briefly introduces the methodology for conducting the scaling analysis, including power-law statistics, detection, and head/tail breaks. Section 4 shows the statistical results of the scaling patterns and other results. Section 5 further discusses this study's implications. Finally, Section 6 draws conclusions and points to future work.

Data and data processing
Started in July 2004, and motivated by the great success of Wikipedia, OSM aimed to provide free editable maps for the entire world (Bennett 2010). A large number of volunteers relied on GPS receivers to collect trajectory data and transformed it into map data using online editing tools. The mapping processes are time-consuming and tedious. In 2006, Yahoo! donated digital images to the OSM community, so that mapping could be done directly from the images. Later on, OSM obtained free data sets from companies and countries, such as a complete road data set of Netherlands donated by Automotive Navigation Data, and the transformation of a US Census TIGER road data set. Over the past decade, OSM became one of the largest geodata sources and most famous VGI platforms, with around 1.8 million users and billions of geographic elements.
The OSM data is freely accessed on the Internet, with a number of supported formats such as XML and shape files. This study uses the complete, global, OSM data-history dump (http://planet.openstreetmap.org/planet/full-history/). The dump is large, at 692 GB collected from April 9, 2005 to Feb. 5, 2013. It mainly includes, and is structured sequentially by, three basic types of geographical elements of OSM data: Node, way, and relation. Nodes are point features that store the location information of longitude and latitude coordinates. Ways are polylines and polygons that contain a set of ordered nodes. Relation denotes the geographic relationships among the three types of elements. Each element contains a variety of information, such as user and element ID, timestamp of creation or edits, contributing user, version number, and different kinds of tags. The historical information is organized by version numbers with the attribute name version, which increases by 1 each time there is a new version of this element.
It is difficult to work with such a big file, since simply running it takes several hours on a state-of-theart desktop computer. We developed a working procedure ( Figure 1) to extract both historical and attribute information for each element of the entire database for further analysis. For the historical information, we collected element ID, timestamp, contributing user ID and version number at each version. Attribute information of each element was with respect to the latest version. For each node element, we extracted its coordinate pair (latitude and longitude), and for each way and relation element, we collected their member IDs. The whole process took three days on an eight-core, 3.4-GHz CPU, 32-GB memory desktop. The extraction was organized as a big table and formatted as a .txt file with a size of about 150 GB, including approximate 2.1 billion elements consisting of 1.9 billion nodes, 0.2 billion ways and 2 million relations. For further analysis, we calculated the number of users detailed efficienc results b

Meth
This stu maximu for data analysis estimatio These tw propertie power la reveals t More im powerfu However, this method suffers from the messy tail at the very end of the distribution. Clauset et al. (2009) introduced a rigorous statistical test based on maximum likelihood and the Kolmogorov-Smirnov (KS) test for power-law detection. There are two parameters: An estimated exponent and the index of a goodness-of-fit p. They are used as indices for power-law fit and the goodness of the fit. This method has been widely used and proven robust for detecting the power-law distributions with a wide range of complex systems (Marta et al. 2008, Jiang et al. 2009, Jiang and Jia 2011.

Pow
Simply put, the estimated exponent  shapes the power-law distribution and the acceptance range is from 1 to 3, given by: in which α denotes the estimated exponent, and is the smallest value above which the power-law fit is held. We adopted a modified KS test to assess how data fits a power-law distribution (goodness of fit). It is based on the idea of the maximum distance ( ) between the cumulative density functions (CDF) of the data and the fitted model: in which is the CDF of the data for the observations with a value of at least , and is the CDF for the power-law model that best fits the data in which .
Usually, 1,000 synthetic data sets are then generated with the fitted model , which contains data whose values above perfectly follow a power-law distribution. Conversely, values below are not power-law distributed. The maximum difference D is re-calculated between the fitted model and each synthetic dataset. The goodness-of-fit index p is denoted as a fraction of the number of D i whose values are greater than D to 1,000. The higher the p value, the better fit with the power law. The closer the p-value gets to 1, the more the data is accepted for a power-law distribution. The acceptable threshold for goodness of fit is 0.05.
Power-law detection is probably the toughest statistical estimation to differentiate power laws from other alternatives, such as lognormal, exponential, and other variants. In contrast to the rigorous power-law detection, the head/tail breaks provides a simple solution to reveal the underlying scaling. It applies for all kinds of heavy-tailed distributions, as long as the scaling pattern of far more small things than large ones recurs multiple times.

Head/tail breaks
The head/tail breaks is basically originated from the main characteristic of heavy-tailed distributions. Given data with a heavy-tailed distribution, the arithmetic mean, or average, can split all the data values into two unbalanced parts: A minority of big values above the mean, called the head; and a majority of small values below the mean, called the tail. This process recursively continues for the head until the notion of far more small values than large ones is violated; see the following recursive function namely head/tail breaks. The percentage of splitting up data into the head and tail is set at 40 percent. This implies that the tail percentage is 60 percent. The number of times the data can be split + 1 is the ht-index (Jiang and Yin 2014). It captures how many times the scaling pattern of far more small things than large ones recurs in the data. It quantifies the scaling characteristic of the data. The higher the ht-index, the more hierarchical levels in the data.     elements, in that that a minority of users/elements accounts for a majority of contributions/edits. The major difference between our work and previous studies is that we conducted an in-depth quantitative analysis on all users and elements at the global scale. This enabled us to see something that was not illustrated in previous works. To our best knowledge, the scaling patterns have never been examined for the OSM data set at such a massive level. In this connection, we believe that this study can be extended to other user-generated content such as Wikipedia (Voss 2005).

Recursive function
This paper applies the scaling analysis to characterize the heterogeneity of the global OSM database. Apart from examining the power-law statistics for detecting scaling patterns, other heavy-tailed distributions were observed and measured by the ht-index. It is widely known that data from realworld phenomena is very likely to be heavy-tail distributed, as is the case with the OSM data. The data naturally evolves and accumulates from individuals from the bottom up, rather than imposed by authorities from the top down. As a result, the data of all aspects generally follows power laws or heavy-tailed distributions. Therefore, conventional linear methods such as Gaussian statistics show some inadequacies in characterizing this kind of heterogeneity. There is no typical mean or scale to characterize the heterogeneity. Instead, all scaling characterizes the diversity or heterogeneity. Our study argues that, in the big-data era, geospatial analysis requires a new way of thinking, or Paretian thinking (Jiang 2015b), to better understanding geographic forms and processes.
Big data, due to its diversity and heterogeneity, is likely to demonstrate the scaling pattern of far more small things than large ones. The large and small things constitute the head and tail, respectively, of a long-tailed distribution. Interestingly, the scaling pattern recurs multiple times, which implies that the things in the head recursively demonstrate the scaling pattern of far more small things than large ones. This recurring scaling pattern is what underlies the new classification scheme called head/tail breaks (Jiang 2013). The head/tail breaks divides things around an average into a few large things in the head and many small things in the tail, and continue recursively for the dividing process for the head until the notion of far more small things than large ones is violated. The head/tail breaks can efficiently, effectively filter out data that is too big to handle by conventional means. This filtering function is also what underlies the visualization function of the head/tail breaks (Jiang 2015a). We believe that the head/tail thinking behind the head/tail breaks is very promising for big data and its analytics.

Conclusion
OSM data is essentially very heterogeneous, either at the local or global scale. This is because geographic space, or the earth's surface, is very heterogeneous with no average location on the earth's surface. In this paper, we studied the entire OSM database and found that this heterogeneity can be fairly illustrated and measured from elements, users, and their collaborations. For the users, both their contributions and the degree of the co-contribution networks exhibit a clear power-law distribution, which means that there are far more inactive users than active ones. There are also far more small elements than large ones, since their attribute values throughout three categories (number of users, edits, and sizes) are heavy-tail distributed. In addition, the elements assigned to individual countries demonstrate a striking power law. Such a pattern also remains at the country level concerning the spatial distribution of all elements. The head/tail breaks can analyze and visualize the big data in capturing the underlying scaling hierarchies and complement the mathematical power-law detection.
To summarize, the scaling property is clearly shown with the OSM data and can well-characterize this great heterogeneity through power law fitting the metrics and underlying the scaling hierarchical levels.
The study was conducted from the big-data perspective, which focuses on the entire database and data-intensive computing (Hey et al. 2009). Therefore, we created a comprehensive image of the heterogeneity of the OSM data and obtained a valuable database with respect to the historical and attributable information of all elements at a certain time point. Interested researchers are always welcome to contact us for further detailed information on the data processing. As for future work, two things should be done. The first is to take the tag information of each element into account and conduct a scaling analysis on them. The second is to study the nonlinear dynamics of both spatial and attributable information of each element at different temporal granularities (such as year, month, and week) to find the underlying mechanism of the evolution of both OSM-community and user-mapping activities.

Appendix: The head/tail breaks statistics for users, edits, sizes
To supplement the description of the results presented in Section 4.1, this appendix contains the detailed statistics on the head/tail breaks process for the three aspects: users, edits, and sizes. As we can see, all the data have more than 12 hierarchical levels, shown in the level column, and the mean head percentages of all three aspects are less than 30%, which is far less than the default threshold of 40%. Note that for the results of each element size (Table A3), there are 4,356 elements excluded from the calculation, therefore the number of elements is 2,138,154,220 -4,356 = 2,138,149,864.