OSM provides a simple and open tagging
mechanism of OSM elements with key-value
pairs. Contributors can add whatever they want to as key-value pairs, but there are best practices to follow. For instance, some keys are intended to classify OSM entities into classes
: building, highway, amenity, shop, etc., while other keys play the role of attributes
: name, maxspeed, created_by, etc., Values are used to set attribute values, while values for classes are used to classify class members into categories
: residential, hotel, monument, etc., As observed in [4
], some of the contributed entities can be assigned to wrong or implausible classes, due to (1) individual interpretation of the submitted data, or (2) misunderstanding about commonly used classes. Ambiguous nature of entities, large number of users with diverse motivations and background can be the causes of such situations. The open and loose mechanism for tagging in OSM leads to many incompletely classified or wrongly attributed entities.
Focused on the conceptual classification of OSM entities, the work of [6
] establishes several parameters for assessing the quality of OSM: Accuracy
: Distance between conceptualization and domain knowledge. It can be seen as the degree of correctness in the classification of features into classes; Granularity
: Level of thematic description present in the data, moving from very abstract to very specific concepts; Completeness
: Coverage in the conceptualization of the features of interest. A distinction exists between class completeness and attribute completeness; Consistency
: Degree of homogeneity in the descriptions of geographic features; Compliance
: Degree of adherence of an attribute, a feature, or a set of features to a given source, ranging from non compliance to full compliance; and Richness
: Amount and variety of dimensions that are included in the description of the real-world entity.
In this paper a framework for the assessment of the quality of OpenStreetMap is presented, comprising a batch of methods in order to analyze the quality of entity tagging. Our approach uses Taginfo as reference base. Taginfo recollects the most used keys by OSM contributors, the most used combinations of keys, as well as the most used values for keys. Taginfo is used, in our approach, to assess the quality of the tagging process of the study area.
Our approach aims to provide a number of indicators/scores to measure the quality of any OSM dataset with independence of the zone, country or continent to be analyzed. It permits to compare any pair (or set) of cities. On the other hand, our approach assumes that the main goal of quality assessment is to ensure a suitable query processing focused on POI search and navigational queries. Some expected query results can be discarded due to missing, unusual or heterogeneous information. While contributors are free to label OSM maps following some (local-regional-country) agreements, it complicates query formulation and comparison. For instance, some local communities can agree to assume maxspeed and oneway default values and omit them. The goal of the approach is to evaluate, for instance, completeness, with independence of local common practices, assessing the occurrence of certain tags in certain entities. Thus, completeness analysis can be intended in two ways: the lack of information (missing/unknown) which can be crucial for the usage of OSM datasets and most common practices of OSM contributors in a given area. Be able to distinguish missing/unknown information from a local common practice is not always possible. Moreover, our approach takes Taginfo as basis of agreement points, which has been elaborated from OSM contributions in all the world. Taginfo is used to evaluate compliance of the OSM dataset in terms of the percentage of commonly used keys as well as commonly used combinations of keys. Consistency (i.e., agreement in the number of tags for specific entities) is also analyzed. Again, this indicator is crucial for POI search purposes. Granularity and richness analyze the quality of classification and description of entities that again guarantees filtering of answers when queries are formulated. Finally, trust proposes the analysis of the experience of contributors as well as the number of revisions of OSM items.
The framework has been used to evaluate the quality of the current status (i.e., at the time of the study) of OSM in Spain. We have selected a set of Spanish cities, with different size (and population), as well as with historical, economic and cultural background. Also, we have compared Spanish cities with some major European cities.
We have also developed a Web tool called QXOSM
, available at http://xosm.ual.es:8080/qxosm
, enabling the quality analysis of the tagging process in any area of the planet. QXOSM
has been built on top of XOSM
, also a Web system (http://xosm.ual.es/XOSM/
), developed by our group in the last years [8
allows the querying of OSM with XQuery
database query language (https://www.w3.org/XML/Query/
is equipped with an XQuery
library of operators enabling spatial and keyword based queries as well as aggregation queries. The quality analysis of any OSM dataset is carried out by executing a batch of XOSM
queries against the dataset, wherein each XOSM
query represents a quality indicator. The Web tool permits to select any area of the OSM planet, and to assess the quality of the selected area in real time. The dataset of the selected area is retrieved with the help of the overpass API (https://wiki.openstreetmap.org/wiki/Overpass_API
). Once retrieved, the Web tool analyzes completeness, compliance, consistence, granularity, richness and trust of the selected area. Moreover, the Web tool offers a great flexibility allowing the selection of the entities, categories and attributes to be analyzed. Entities, categories and attributes are retrieved with the help of the Taginfo
). The results of the analysis are shown in two forms: aggregated (numeric results) and disaggregated (charts). For instance, in the granularity analysis, the average and median of the number of attributes of a certain entity is reported, but also a pie chart is used to show the percentage of entity instances for each set of attributes. The back-end of the tool has been implemented in XQuery, while the front-end has been implemented in Vaadin (https://vaadin.com/
). The Web tool catches fetched data in order to improve the answer time of the analysis tasks. Unfortunately, for performance reasons, the Web tool has limitations on the size of the selected area. The analysis presented in the paper for some cities has been made off line.
1.1. Related Work
In the literature, there are several works whose goal is to analyze the quality of OSM for a specific country (or group of countries). As far as we know, there does not exist such study for Spain. Unfortunately, a standard method to analyze the OSM quality does not exist.
From the proposed methods, we can distinguish those carrying out an extrinsic quality analysis
by using an authoritative dataset
in order to compare it with OSM data, and that acts as ground truth dataset
]. The main goal of existing extrinsic methods is to validate the accuracy of OSM geometries
with regard to real world objects. It is important, for instance, for using OSM maps for navigational purposes, and the study of the accuracy of OSM geometries is mainly focused on OSM ways
. There is also a strong interest on the precision of OSM nodes
, which has a high impact when OSM maps are used for locating POIs. However, as pointed by several authors (see, for instance, [6
]), a dataset may contain highly accurate geometries, but if the description of the entities and their attributes is not clear, articulate, rich and complete enough, the value of the data for consumers will be severely curtailed. On the other hand, extrinsic analysis is not always possible, and new approaches explores the possibility of assessing the quality using intrinsic dimensions [23
A number of methods for intrinsic quality analysis
has been proposed in which OSM data are validated assuming some rules of quality. The newest intrinsic quality evaluation methods [24
] use the number of contributors
], heavily edited objects
] and combinations of factors: number of versions
and number of users
, as well as confirmations
, tag corrections
] to assess the quality of OSM. In some cases the history of OSM contributor’s updates
is used for the analysis. While the analysis of the history updates is interesting, its extraction is currently a difficult and time-consuming process. On the other hand, the concept of “crowdquality”
] identifies two characteristics: the quality of the user
and the quality of the geographic information
. In this line, the authors of [26
] investigate which indicators influence trust
, focusing on intrinsic properties that do not require any comparison with a ground truth dataset. High numbers of contributors
are considered as positive indicators, while corrections
have a negative influence on trustworthiness. The “many eyes principle” [7
] is crucial for trust, where the quality is more likely to be higher if more people have worked on a feature. Trust is closely related to reputation
(a subjective perception of trustworthiness), inferred from the historical behavior of a certain contributor. Related to trust, credibility
comprises accuracy, authority and competence in addition to trustworthiness.
has been previously used as reference base in some works. For instance, in [28
] the OSM Map Features
) suggestions and recommendations are analyzed for 40 cities, revealing that it is generally average or poor. They selected the 30 most frequently occurring keys in Taginfo
. They state that co-occurrence of keys suggested by the OSM Map Features does not always happen. Tagging process is also analyzed in [22
], carried out on 25,000 objects from the OSM databases of Ireland
, United Kingdom
. The selected objects are the so-called heavily edited, having 15 or more versions. They analyze the number of unique tags and values assigned to name
A more elaborated and extrinsic analysis can be found in [13
], in which the authors study the quality of the French OSM dataset with regard to geometric accuracy (positioning and geometry resolution from the ground reality), attribute accuracy (accuracy of quantitative attributes, correctness of non-quantitative attributes and classification of features), completeness (absence of data –omission– and the excess of data –commission–), logical consistency (internal consistency: modeling rules, specifications, integrity constraints, distinguishing intra-theme consistency versus inter-theme consistency), semantic accuracy (correspondence with real world objects), temporal accuracy (actuality of the objects relative to changes in the real world), lineage (capture and evolution of objects) and usage (how well the database fits for the use that will be made). They conclude that the number of tags linearly increases with the number of contributors. Therefore, the more contributors, the better quantitative attribute quality is. Additionally, they conclude that smaller objects are more likely to be missing, and contributors are more focused on capturing attractive objects. They also detect that territories are best represented in rich areas, and/or areas with a young population, and completeness becomes very problematic in rural areas.
Intrinsic analysis using the OSM history of contributor’s updates can be found in [26
], in which a subset of the entities in the study area is selected based on the number of versions/editions that the entity has undergone. Trust assessment is based on the provenance of the entity, as well as on the number of users involved in creating an entity. Also indirect confirmations are taken into account by looking at all revisions that have been made in the immediate vicinity of an entity after the last revision of an entity. Also tag corrections and rollbacks (when the value for a certain tag is changed) decrease trustworthiness.
OSM history of contributor’s updates is also used in [29
] for intrinsic analysis. They highlight several quality measures via the iOSMAnalyzer
) according to six categories: (1) general information on the study area: development of OSM entities and tags, currentness of data, comparison of newly created and edited objects, syntactic attribute accuracy, positional accuracy of junctions; (2) user information and behavior: number of contributors, contributor activity, distribution of contributors, user profiles; (3) routing and navigation: road network completeness, attribute accuracy of roads, road network currentness, logic consistency and positional accuracy of the road network, roads without a name or route number; (4) map-applications: geometric polygon representation, untouched OSM features, invadid polygons, logical consistency of landuse polygons, development of selected polygons; (5) geocoding: development of address information, completeness of address annotations, completeness of house numbers tagged to buildings; (6) points of interest-search: development of POIs, average number of POI tags, attribute completeness of POIs. iOSMAnalyzer
is mainly focused on geometries (categories (3) and (4)), but it includes some elements of completeness analysis (categories (1), (5) and (6)), and contributor analysis (category (2)).
There are some other studies about OSM quality analysis in specific areas of the world. For instance, in [30
] an extension of Quantum GIS (QGIS)
) processing toolbox is presented enabling to assess the completeness of the spatial data using intrinsic indicators, proposing a heuristic approach to test the road navigability of Punjab (India)
. This is also the case of [31
], in which some areas of Brazil
are studied in terms of length of rural roads, density of urban roads, number of buildings, percentage of classified roads, number of days since last edition, number of versions/editions, and a comparison is stablished with economic and developmental variables.
There is also a number of developed tools with different goals, ranging from tagging process improvement (recommender and data validation systems), to visualization of OSM data and quality metrics. Among tagging process improvement tools, the tag recommender system called OSMantic
], developed as a JOSM
) plugin, automatically suggests relevant tags to contributors during the editing process. It suggests tags that could be added to better describe map entities, and by detecting tags associated to the same feature that appear too dissimilar. Relations between tags are computed based on the semantic similarity between tags and the number of times a tag has been used in the database. Taginfo
and OSM semantic network
] are used by OSMantic
Data validation systems are, for instance, Keep Right
) which performs several data consistence checks, and shows them in a map, including geometry checks: non-closed areas, dead-ended one-ways, etc., as well as some tag related checks: deprecated tags, missing tags, point of interest without name, etc. It has mechanisms for reporting false positives and for labelling a bug as fixed. Osmose (Open Street Map Oversight Search Engine)
) is similar to Keep Right
but offers a larger number of data consistence checks, including missing and wrong tags. JOSM Validator
integrated with JOSM editor checks data loaded into the editor, highlights errors and warnings, and some automatic fixes are done by request. It checks all objects modified in a session by the user, reporting errors even though the user is not responsible of them. Several checks are done including bad keys and values, untagged ways, missing and duplicated names. OSM inspector
), is an error debugging tool which takes part of the GeoFabrik
tools. It shows a map with several layers: geometry, routing, tagging, places, highways, areas, coastline, addresses, water, public transport stops and public transport routes. In the case of tagging, empty tag key, empty tag value, tag key with space, unusual character, unusual key length, tagged with “FIXME”, and name/description without feature tags are detected. For highway and addresses several data consistence checks are carried out. Finally, in [4
] two methods are proposed based on constraint checking and machine learning, in order to check the integrity of VGI data: hierarchical consistency and classification plausibility. The methods can be applied to check validity of data during contribution or data correction.
Visualization of OSM with the aim to analyze quality of maps has been studied by [33
], whose system OSMatrix
) is a Web-based approach to visually explore quality metrics of OSM. MVP OSM
] is a tool designed to highlight areas where contributors provide a high level of spatial detail. MVP OSM scores local knowledge (ignoring extraction from aerial imagery), mapping experience (time spent: number of edits and months) and community recognition (frequency in which contributor updates data within a given area). OSM Tag History
) visualizes the usage of a tag in the OSM database by a line chart. OSMstats
) examines even other statistical data about the OSM datasets and the users, and visualizes the data by line charts. The geospatial distribution of elements tagged as buildings or roads can be examined by OpenStreetMap Analytics
1.2. Comparison with Related Work
Our approach can be considered an extension of the analysis proposed in [22
], since we have carried out the analysis of a large number of entities, and also this analysis has been extended to consistency, granularity, richness and trust. Our study can be considered similar to the proposed in [13
], focused here on the Spanish OSM dataset, but without the use of an authoritative dataset. As we will show, it does not prevent to make a detailed analysis of tagging process. With the exception of the accuracy, quality aspects as completeness, compliance, consistence, granularity and richness can be analyzed in our approach. Moreover, some other aspects related to trust as number of versions, local and global experience of contributors, can be analyzed in our approach. Instead of using OSM Map Features suggestions and recommendations like in [28
], our approach compares the tagging process of Spanish cities with Taginfo
in two cases (1): occurrence of keys, and (2) co-occurrence of keys. In (1) our approach analyzes whether the most used keys (Taginfo
top 300 keys), are also used in Spain. The occurrence of rare keys (from 300 and beyond) is considered as non-compliant tagging. In (2) our approach analyzes the use of the top 300 combinations of keys in Taginfo
. The occurrence of unusual key combinations is considered as non-compliant tagging. We have decided to carry out an intrinsic analysis of the current status (at the time of study) of Spanish OSM dataset, assuming that it may impose some limitations to the analysis. However, a reference base is used (i.e., Taginfo
) to assess the tagging process. In our opinion, even though OSM data are constantly evolving, the analysis of a certain instant of the dataset can offer an adequate picture of data quality. Moreover, our main aim is to compare the data quality of a group of cities, and to establish data quality indicators for current and future analysis. Additionally, it permits to develop a Web tool that, in real time, will be able to analyze pieces of OSM planet. The number of quality indicators we are able to analyze in real time justifies, in our opinion, this decision. Given that our approach is intrinsic, the thematic accuracy (i.e., degree of correctness in the classification of entities with regard to real world objects) cannot be measured. Finally, trustworthiness is analyzed like in [26
], but in terms of the number of versions and the local and global experience of the contributor. We assume that the more experience of the contributor, the better the reliability of the entity is. Number of versions also increase reliability.
The structure of the paper is as follows. Section 2
will present the framework of OSM quality analysis, and the main results of the analysis of Spanish cities. Section 3
will compare Spanish cities with some major European cities. Section 4
will present the Web tool. Finally, Section 5
will conclude and present future work.