1. Introduction
Volunteered Geographic Information (VGI), the term coined by Goodchild [
1], is the recent empowerment of citizens in the collaborative collection of geographic information. He argues that VGI has enormous potential to become a “significant source of geographers' understanding of the surface of the Earth”. Crucially, “by motivating individuals to act voluntarily, it is far cheaper than any alternative, and its products are almost invariably freely available”. OpenStreetMap (OSM) is a collaborative project to create a free editable map database of the world as is probably the most well known example of VGI [
2]. Spatial data is contributed to OSM from: portable GPS devices, tracing shape outlines from aerial photography, import of free spatial data, or simply from local knowledge [
3]. Ciepluch
et al. [
4] and Haklay and Weber [
5] provide detailed introductions to the OSM project. Real world geographic objects are represented in OSM as points, lines, and polygons. In OSM these are referred to as nodes and ways. Ways is a collective term for both polylines and polygons. Spatial attributes for these objects are stored as
tags. An object can be tagged with any number of tags. On the OSM wiki [
6] there is a community maintained page (see [
7]) detailing the most-popular tags. This amounts to something close to an OSM-community generated ontology for the spatial objects in the OSM database. Volunteers, who collect and contribute data to OSM, have the freedom to add their own arbitrary tags if necessary to annotate their data. However it is usually only the tags listed on the
map features list [
7] that are supported by GIS software capable of consuming OSM data and cartographic software for rendering OSM data as map image tiles. The growing spatial coverage and high-quality content in OSM [
8,
9], has branched beyond “the converted” and has gained enthusiastic endorsement from the likes of Yahoo, ESRI, MapQuest [
10], and Microsoft [
11]. Yahoo! and Bing have agreed to let OSM use their aerial imagery for the purposes of volunteers tracing the outline of objects. However, the concept of local knowledge is at the very core of VGI and OSM. De Leeuw
et al. [
12] describe results in their paper which shows volunteers with local knowledge classified roads, from high resolution aerial imagery, with over 92% accuracy on average, irrespective of surveying background and always better than professional surveyors without local knowledge. The combination of highly motivated volunteers with detailed local knowledge [
13] has been instrumental in seeing OSM grow to become a very large global database of spatial data, frequently edited, and continually growing in size.
We feel that one of the most exciting aspects of OSM is the collaborative editing and development of the OSM database by contributors. This provides motivation for this research work. As will be outlined in
Section 2 all current studies into OSM, in the literature, only consider the most currently available version of the OSM database. This literature discusses: quality evaluation, accuracy measurement, or applications. Every few months OSM makes the “planet.osm/full” file available which includes almost all OSM data ever collected [
14]. The key motivation of this paper is to use this historical data to investigate if there are any special characteristics that may be observed from analysis of “heavily edited” objects. After the literature review section of the paper (
Section 2) we show how the full history for spatial objects can be extracted and processed from the “planet.osm/full” file (
Section 3). In this section we also outline carefully how we selected our dataset of “heavily edited” objects. In
Section 4 we provide the results of analysis performed on these objects. This includes: rates of contribution (
Section 4.2), editing and tagging (
Section 4.3 and
Section 4.5), changes to object geometry (
Section 4.6). Access to the historical trial of edits for objects allows use to investigate how these features have evolved from their first version to their current version. The full OSM history data for the UK and Ireland is extracted from the “planet.osm/full” file and is used as the case-study data.
Section 5 closes the paper with some conclusions from the analysis performed and opportunities for further research.
2. Literature Overview
Currently there is only a small body of literature published on analysis of the spatial data contents of the OSM database. However these publications have delivered important contributions on many issues related to OSM in general. In classical GIS methodologies there have been several accuracy and ground-truth comparisons of OSM data with authoritative sources of spatial data. These include Haklay [
15] with the Ordnance Survey UK, Zielstra and Zipf [
16] compared OSM data with Teleatlas Data in Germany, Girres and Touya [
17] with the French OS dataset IGN), and Mooney
et al. [
18] with land cover features in Ordnance Survey Ireland datasets. In terms of using OSM as a primary source of geographic information there are several examples in the literature. Goetz and Zipf [
19] present an extensive 3D building ontology based on OSM making it possible to map indoor spaces in addition to the outdoor environment. The richness of spatial data for urban areas in VGI, particularly for buildings, has seen some authors (such as Goetz and Zipf ([
20]) use VGI for the creation of 3D building models for the purposes of building virtual city models. Over
et al. [
8] also discuss the generation of 3D models. Ciepluch
et al. [
4] present a framework for distribution of environmental information using OSM. Some work has been presented on the motivations of those volunteers who contribute to VGI projects [
9,
21]. Finally some authors have investigated the development of applications based upon the local spatial knowledge inherent in VGI. Pultar
et al. [
22] develop software for wildfire evacuation modelling and travel scenarios of urban environments using VGI as an input data source. De Leeuw
et al. [
12] argues that local knowledge demonstrated in VGI (coupled with the enthusiasm and dedication of contributors as emphasized by Goodchild [
23]) could potentially be very rich extending to the point “where there is reason to consider engaging local expertise in the production and updating of NMA topographic maps”.
The issue of the spatial data quality in OSM is dominant amongst the available literature. Goodchild [
23] argues that in the case of OSM the rather unique task of compiling independently contributed pieces of a (geographic) patchwork necessarily imposes some degree of quality control. He adds that “one might term this process structured, to distinguish it from the essentially unstructured process by which entries in Wikimapia and other VGI gazetteers are compiled”. The quality of VGI is now a hot-topic in GIS [
18]. Qian
et al. [
24] argue that in VGI “since general users can add and change data, the stored data should be updated frequently, resulting in an abundant and updated geographic dataset”. This has “reversed the traditional top-down flow of information” [
25]. Qian
et al. [
24] conclude by arguing that one of the most serious disadvantages of OSM is that the underlying data is acquired by non-professionals with non-professional equipment, meaning that there is no guarantee of quality about the data unless it can be compared to some other source. Flanagan and Metzger [
26] argue that as the amount of VGI continues to grow “the issues of credibility and quality should assume a prominent place on the research agenda”. This will require a multi-disciplinary approach combining knowledge from geography, computer science, social sciences, to understand the credibility of VGI. Metadata is also an issue with Bulterman [
27] suggesting that the “complete disregard for documentation of data resources” has made it almost impossible for one to perform a fitness for use or fitness for purpose evaluation on available data resources. Without some quantitative measures of accessing the quality of the OSM data the GIS community has been slow to consider OSM as a serious source of data [
18]. Imports of government or National Mapping Agency (NMA) data into OSM, as mentioned in
Section 1 is the exception rather than the rule [
28]. The “contributors are spontaneous and the density of data is unpredictable” and consequently the spatial distribution of the data itself will continue to be uneven and inconsistent. Flanagan and Metzger [
29] remark that for VGI in general the “professional and scientific gate-keeping that usually filters and reviews digital information may not be present (in sufficient forms or structures)” and subsequently can lead to information which is prone to being “poorly organized, out-of-date, incomplete, or inaccurate”. Ballatore and Bertolotto [
30] calls OSM “spatially-rich and semantically-poor”. Brando and Bucher [
31] suggest that the quality of VGI is enhanced if proper metadata is created and maintained which details: types of changes and edits, methods of survey and collection, and finally a fitness for purpose statement. The recent study by Zielstra and Zipf [
16] of OSM and TeleAtlas for Germany shows that “while professional data is not without its faults the coverage of OSM in rural areas is too small to be seriously considered a sophisticated alternative for
any applications”. However the study does conclude that for larger cities (Berlin, Frankfurt, Munich) the data diversity is so rich that “OSM is replacing proprietary data for many projects”. Heterogeneity of the spatial data coverage in OSM is a real barrier. Neis
et al. [
32] show that the difference between the OSM street network for car navigation in Germany and a comparable proprietary dataset was only 9% in June 2011.
In this study we analyse “heavily edited” objects in OpenStreetMap. We are unaware of any similar example in other sources of crowdsourced spatial data or VGI. In the next section of this literature review we make connections to research work carried out in Wikipedia. Similar to OSM it is possible to extract the entire history of edits to articles in Wikipedia. A growing body of literature is available on this topic. The crowdsourced collection and collaborative editing of spatial data in OSM is unique and is certainly non-traditional for geographical data. The literature available dealing with these topics in relation to OSM is still rather limited. There is a very substantial collection of literature on Wikipedia where the equivalent to the OSM collaboratively edited object is the
article [
33]. We feel that it is useful to briefly introduce some key outcomes from research studies of the collaborative editing nature of Wikipedia. Heavily edited articles in Wikipedia are usually those that gain the status of “featured article”. Featured articles are recognized as articles of high quality, with a long history of collaborative editing, and have become relatively stable (no major recent edits) [
34]. In Korfiatis
et al. [
35] the authors introduce a measure of article quality for featured articles based on the succeeding edits by other users. Subsequent edits (deletions) and roll-backs to earlier versions are considered as disapproval whereas maintaining the edits of previous editors is deemed as a sign of approval. Wesler
et al. [
33] analyzed the history of edits of Wikipedia and identified a number of specific users, most notably
Substantive editors. Substantive editors contributed, at minimum, between 30 and 80 percent of all content edits to pages. Overall research appears to indicate that there does not appear to be a “standard contributor” to Wikipedia or standard pattern of contribution. Yang and Lai [
36] conclude that frequent Wikipedia users (like
Substantive users by Wesler
et al.[
33]'s definition) may contribute knowledge by making minor changes to Wikipedia entries. Conversely, some users who contribute infrequently may provide extremely rich content. Antin [
37] reports in a survey of Wikipedia contributors that many contributors are worried about how “individual agendas could shape editing behaviors” particularly those edits from very frequent contributors. This is supported by Hecht and Gergle [
38] who provide evidence that many Wikipedia users continually re-edit their “pet pages” very frequently.
5. Conclusions and Future Work
In this paper 15,640 ways (polygons and polylines) resulting in 316,949 unique versions of these objects were analyzed from the OSM database for the UK and Ireland. In our analysis we only considered “heavily edited” objects in OSM: objects which have been edited over 15 times. We motivated the selection of this threshold in
Section 3.2 and we feel that this provided us with a good representative sample of OSM activity in the UK and Ireland. There is good spatial distribution of the selected objects as illustrated in
Figure 1. As stated by Zielstra and Zipf [
16] OSM data is found in the largest quantities and coverage in urban areas. The map of the locations of our “heavily edited” objects shows greater concentration of these objects around the cities of Dublin, London, Belfast, Cardiff, and Glasgow. To our knowledge, and following an extensive literature search, this is the first study of its type of historical OSM data. Kessler
et al. [
53], Roick
et al. [
54], and van Exel
et al. [
55] consider version history but only for visualization purposes.
5.1. Conclusions
Our analysis of OSM history data has given us some interesting research results. In
Section 4.2 we showed that 11% of contributors created or edited 87% of the spatial data in the 15,640 “heavily edited” objects. Assignment of values to attributes or tag keys is another area where a historical analysis of edits demonstrates issues in the collaborative nature of OSM. 4.1% of objects have the assigned value to their “name” attribute changed 3 or more times. This rises to 25% of objects which have the assigned value to their “name” attribute changed 2 or more times. Disputes and disagreements occur.
Table 4 shows a long and protracted dispute in Germany over the assignment of a classification to the highway tag.
Table 3 shows an example from the UK where a street is assigned 5 different names. The use of two well known string matching metrics in
Figure 3 shows that changes to “name” attributes are not subtle single character changes but major edits to the value assigned to “name”. This will form a useful basis for future work to investigate if this behaviour extends to other OSM regions and communities. The uncertainty introduced by frequent changes to “name” or “highway” attributes has implications for the development of gazetteers from OSM and location-based services (LBS) which will need to be evaluated [
56]. In Over
et al. [
8] the authors comment that the quality control of OSM differs fundamentally from professionally edited maps. The community-based approach allows anyone to upload and alter map data. However, due to the huge number of editors, errors and conflicts are usually quickly resolved. A long-term historical analysis, following on from this work, will provide evidence to support this hypothesis that eventually OSM data “stabilizes” for an area. Some tools are beginning to appear for visualization of the history of Wikipedia pages [
57]. Some tools are beginning to emerge for OSM history but they are still in their infancy and lack powerful information visualisation functionality.
Section 4.5 discusses the number of tags, or metadata, assigned to each object at its final (current) version. The mean number of tags assigned to features is 3.45. There is no discernible statistical relationship between increasing numbers of contributors and number of tags nor is there a statistical relationship between the number of versions created for an object and the number of tags. This could indicate that new contributors to an object passively accept the current set of tags without adding any additional tags. The increase in versions, without an apparent correlated increase in the number of tags, is probably a result of the “tag flip-flopping” we discussed in
Section 4.3 or edits to the geometry of the object (
Section 4.6). As
Table 6 shows in 79% of edits nodes are added to objects. In
Section 4.6 we showed in
Table 5 that consecutive edits to the same object create and maintain valid spatial geometries in 91% of cases. However it is worth noting that in 8% of cases an object with an invalid geometry is edited and this invalidity problem is not fixed. In 87% of these cases the same contributor is responsible. This raises potential issues surrounding the understanding these contributors have of the need for valid geometries (avoiding self intersections,
etc.) in a spatial dataset.
5.2. Future Work
While the paper does not specifically provide “measurements” of quality of the OSM data we believe that this work could provide a platform for future studies on OSM data quality which would consider the lineage or evolutionary history of the OSM data as part of quality assessments. A survey of contributors to OSM, particularly large scale contributors, is required to gain a better understanding of the rationale behind some of the tagging behaviour we have observed in this paper. We have used the threshold of 15 versions as the qualifying criteria for “heavily edited” objects in OSM. As part of ongoing work we are investigating the effects of revising this threshold downwards whilst attempting to understand the factors that are related to new versions of objects being created in OSM. Relations in OSM are one of the code data elements. They consist of one or more tags and an ordered list of one or more nodes and/or ways as members. They are used to define logical or geographical relationships between elements. We decided against investigating relations in this research as we wanted to maintain focus on the editing of ways as a first step towards understanding the characteristics of heavily edited objects in OSM. Just under 30% of the 15,640 ways we analysed in this research were marked explicitly as members of relations. As part of future work we intend to carry out an analysis of the characteristics of relations in OSM. In this paper we considered an edit as a composite record of edits to an object's geometry and tags. We shall be investigating the types of edits recorded: edits to geometry and then edits to tagging. We believe this will help us understand the editing behaviour of contributors—do some contributors contribute geometry but never perform tagging or do some contributors only correct or update tagging? Finally, an analysis of a larger number of “heavily edited” objects is required to validate our findings here to show the existence of other characteristics (spatial autocorrelation and spatial interaction). This will be the subject of our immediate future work.