An Investigation into the Completeness of , and the Updates to , OpenStreetMap Data in a Heterogeneous Area in Brazil

The integration of user-generated content made in a collaborative environment is being increasingly considered a valuable input to reference maps, even from official map agencies such as USGS and Ordnance Survey. In Brazil, decades of lack of investment has resulted in a topographic map coverage that is both outdated and unequally distributed throughout the territory. This paper aims to analyze the spatial distribution of updates of OpenStreetMap in rural and urban areas in the country to understand the patterns of user updates and its correlation with other economic and developmental variables. This analysis will contribute to generating the knowledge needed in order to consider the use of this data as part of a reference layer of the National Spatial Database Infrastructure as well to design strategies to encourage user action in specific areas.


Introduction
Web technologies enable subjects without education in map design/production to become potential cartographers or "produsers" [1][2][3][4][5][6][7][8][9].Because many types of individuals are involved in the use and production of geoinformation, the Volunteered Geographic Information (VGI) has increased in importance OPEN ACCESS due to two main factors.The first one comprises the emergent natural interest of individuals in the use of web 2.0 media such as Facebook or OpenStreetMap generating content and disseminating their own information [3,6,7,10,11].The second is the interest of official mapping agencies in updating their geodatabases with this rich crowd-sourced content [9][10][11][12][13][14][15][16][17].It is this second factor which is our research motivation.In this case, the concerns in adopting VGI content for official purposes are related to the lack of methods that allow us to measure the reliability of this kind of data [18,19].
It is noticeable that the integration of user-generated content made in a collaborative environment is being increasingly considered a valuable input to reference maps, even from official map agencies such as USGS and Ordnance Survey [16,20,21].While systems like OpenStreetMap, Wikimapia or Google Maps have triggered a powerful revolution-transforming "map users" into "map makers"-researchers and mapping agencies have observed VGI aiming to comprehend how these VGI systems could provide information for the official databases [16,17,21].VGI systems could be a viable alternative to increase the speed of national information updating processes in developing countries, such as Brazil, where decades of a lack of investment has resulted in topographic map coverage that is both outdated and unequally distributed throughout the territory [22,23].
In 1994, Estes and Mooneyhan [22] presented an interesting critique of the situation of the national mapping coverage in developing countries.They used such impactful words when describing the non-existence of official geoinformation about these territories: "in many developing countries, even the most basic information related to resources and the environment does not exist" [22].Within Brazil, the scenario is still the same: scarce investment in mapping agencies results in an outdated and unequally distributed map coverage for the Brazilian territory [23].Figure 1 shows the topographic mapping coverage in Brazil by different scales.At a 1:250,000 scale, almost the whole territory is mapped.In contrast, at a 1:25,000 scale there are few maps available.In Brazil, topographic mapping is a shared responsibility of both the Geographic Service of the Brazilian Army-DSG-and the Brazilian Institute of Geography and Statistics-IBGE.Brazil is the biggest country in South America with over 8.5 million•km 2 , producing an expensive scenario for mapping projects, especially for government funded ones.
Nevertheless, a connection between official map agencies and VGI systems will still depend on several factors such as quality tests and standardization [18,21].In order to establish a first perspective about VGI quality, Haklay [21] has examined the positional accuracy of VGI content in the OpenStreetMap.He has found that the volunteered information is accurate to about 6 m, on average, of the position recorded by the Ordinance Survey in the UK territory [21].After Haklay [21], several researchers have tested the quality of crowd-sourcing-in terms of positional as well as semantic accuracy-with similar findings [20,[25][26][27][28][29].However, what really matters in this case is that researchers and mapping agencies have repeatedly tested the Volunteered Geographic Information aiming to use this rich source [18] to gather both VGI and official initiatives [15].
The critical issue in VGI seems to be the evaluation of quality [18,19].Moreover, there are such comprehensive efforts into investigating how reliable this content is because the reliability is considered a concern of quality control [16,18,19,30].By reliability, we mean "the correctness or accuracy of the information" as stated by Comber et al. [19].Goodchild and Li [18] argue that there are three alternatives to assure the quality of VGI content, in contrast with those well-known from the classical approach-Guptill and Morrison [31]-and employed by traditional mapping agencies such as Ordinance Survey and USGS.However, we have focused our attention on the second and the third of these alternatives, because they are relative to "the ability of a group to validate and correct the error that an individual might make"-a Linus's Law approach-as well as possibly being the key element for the reliability in understanding [32].
Goodchild and Li [18] have suggested that this approach-the Linus's Law [32]-can be applied to quality assurance of VGI.A similar idea is provided by Flanagin and Metzger [30] whose statement is relevant for the understanding of how to manage VGI content and assure its reliability, or, as they prefer, its "credibility".They indicate that the more visited a place is, the more accurate the information about it will be.This means that, by taking into account these two points of view [18,30] we believe that the reliability of VGI can be observed and measured by the level of completeness and how often the database content is updated.In this case, areas with a high level of urbanization might be more visited in VGI systems, because individuals prefer to describe the geographic region in which they live or know something about, digitizing their personal experiences [33]-this sense is supported by the Topophilia concept [34].Additionally, Haklay [21] pointed out some relevant results of a type of segregation phenomenon-the difference between the existence of geoinformation in areas with high urbanization level in contrast with those in rural areas-observing few users posting data in the OpenStreetMap.He indicates, "the centers of big cities in England (such as London, Manchester, Birmingham, Newcastle, and Liverpool) are well mapped" while suburban areas-especially the boundary between the city and rural areas-are not.Thus, Haklay [21] stated "it is important to know which areas are well covered and which are not-otherwise, the data can be assumed to be unusable" when someone is thinking about the integration of VGI and official databases.If these kind of discrepancies exist in countries with a tradition of reliable maps such as the United Kingdom, it is reasonable to expect that this heterogeneity will be a important issue in developing countries around the world such as Brazil.Figure 2 shows a comparative scenario between areas with high and low levels of urbanization in Brazil, as part of the OpenStreetMap system.The visual contrast of features mapped in both situations ("a" and "b") is relevant to us because it indicates similar findings to Haklay [21]. .This first window (a) comprises over 25% of the total population of Brazil.In contrast, the second one (b) shows an area with a low level of urbanization-at the same scale as (a).This second and less populated area comprises two state capitals (Cuiabá and Goiâ nia) as well as the federal capital, Brasí lia.Looking at the pictures one can see the large difference between the amounts of features mapped in both cases.This is likely the result of the "segregation phenomenon" described by Haklay [21].Source: Adapted from OpenStreetMap, 2015.
While Brazil has quite recently reached over 200 million inhabitants [35] the distribution of this population over the large territory is extremely unequal as described by Carvalho [36].The majority of Brazilians live in urban centers such as Sã o Paulo, Rio de Janeiro, Curitiba, and other state capitals.Although the state capitals remain the largest cities, medium-sized cities have been the focus of demographic investigations because they have attracted internal migrants for reasons such as diversity in economic activities, decentralization of industry and better quality of life indicators [37][38][39].
Municipalities for the lowest level of administrative units after the federation and the states.The concept of "city" in Brazil, as defined by law, comprises the most representative urban agglomeration.As such, every municipality must incorporate at least one city and, in most cases, adjacent rural areas [40].
Therefore, an interesting research subject would be understanding which attributes of a geographic region with heterogeneous level of urbanization could have an association with the reliability of VGI content concerning these areas.Moreover, there are several countries, as we pointed out previously [22], which could benefit substantially from that understanding, once the reliability problem starts to be solved and VGI content can fill in the mapping coverage gaps.
Accordingly, the following sentence determines what we have addressed as the research problem in this paper: Do demographic and economic characteristics of a geographic region have a relationship with the level of completeness of, and the frequency of updates to, the VGI content?The hypothesis which we argue here is that the more editors working in a single region, the more likely it is to keep the mapping accurate dynamically over time.It follows that areas with a high level of urbanization are going to have the best reliability levels-in terms of completeness and how updated the data is.Moreover, a second premise we argue is that the level of completeness and how updated the content is in a VGI system might be measured by the representation of roads and buildings.We advocate this because when there are no maps at a suitable scale in a region, one of the first geographic features that individuals represent on a VGI system are roads and buildings-especially in the emergency situations given by Zook et al. [41] and Liu and Palen [14].In addition, we argue that the method proposed is suited to work in areas where no other database is available to make comparisons, a situation commonly occurring in many developing countries.
Thus, this paper aims to understand the patterns of user updates and its correlation with other economic and developmental variables provided by the census data from IBGE.This type of research can lead to strategies to address the use of VGI considering local characteristics and needs, and use open data, standards and software to achieve the best spatial reference data in order to support much needed optimal decision-making processes.Besides this, such an analysis can contribute to generate the knowledge needed to consider the use of this data as part of a reference layer of the National Spatial Database Infrastructure as well to design strategies to encourage user action in specific areas.This paper describes initial efforts into investigating a way to assess the reliability of VGI content in Brazil-or other developing countries.

The Case Study
Considering how diverse and challenging the Brazilian territory is-a case of heterogeneity which is repeated in many developing countries [22]-we have selected a study field as part of a first attempt to understand how VGI data could provide benefits to the official map coverage [22].The selected study field comprises the Metropolitan Mesoregion of Curitiba (Figure 3) a good example of how diverse Brazil, or any other developing country, might be.This study field has high heterogeneous characteristics as there are large and small cities-urban areas of municipalities, areas with a high level of industrialization as well as mainly agricultural municipalities.There are also areas dependent on tourism, and areas protected due to their environmental importance have no urban occupation of any kind.We considered that this heterogeneity is a good first challenge to observe the usefulness of our hypothesis.
Furthermore, we have established this region because its municipalities have particular interest for the Federal University of Paraná community as our university is involved in developing local strategies that benefit the population in the surrounding areas.The Metropolitan Mesoregion of Curitiba is a unit of Brazilian territory, although not an administrative area.The Brazilian Institute of Geography and Statistics-IBGE-has created mesoregions in Brazil grouping municipalities with the same characteristics in terms of proximity, population and economy, for statistical purposes [42].The selected region comprises 37 municipalities (Table 1) all inside the State of Paraná (Figure 3).
It is necessary to highlight one more time (see Table 1.) that the area we have selected has high degree of heterogeneity in terms of population demographic, financial power distribution as well as the distribution of urban and rural population-and these are the characteristics which compose what we meant by heterogeneous indicators.The main city (Curitiba) accounts for over 50% of the total population of the whole Mesoregion and the 10 biggest municipalities account for almost 90% of the total inhabitants.There are cities achieving high or medium Human Development Index (HDI) levels (e.g., Curitiba, Pinhais, Paranaguá , Sã o José dos Pinhais) and cities scoring low levels (e.g., Doutor Ulysses, Itaperuç u, Tunas do Paraná , Tijucas do Sul).Cities with good indicators, such as great GDP per capita and HDI, are also those ones with a higher urban population as well as being closer to the state capital of Curitiba.In addition, these cities also attend to have a variety of industries (e.g., Campo Largo, Araucá ria, Pinhais, and Sã o José dos Pinhais).

Methodology and Results
As a first attempt to explain how we have conducted this research work, Figure 5 demonstrates the workflow we have adopted.Accordingly, in this section we are going to describe all procedures illustrated in the Figure below.
In the first step in our research, we selected the case study with the aim of testing our procedures.The second step comprised a data extraction operation from the sources OpenStreetMap, the official Brazilian topographic database (from IBGE [43]), and demographic data (from IBGE [34,44] and others [45]).The OSM data used was an extract from Planet.osm [46] in 25 January 2015, along with the change sets available at JOSM on the same date.Geofabrik extracts large portions of OSM database and makes it available for download, while downloading directly from OSM database has a limit on download size.Demographic data, such as that for population and average household income, is from the 2010 census [35].Human Development Index (HDI) is calculated by UNPD, also using 2010 data [45].The municipal GDP is 2011 data from IBGE [44].Census spatial data information, also from the IBGE [35] census, provided the basis for area calculations and for the division between official rural and urban areas, along with the official municipality boundaries.Topographic mapping only at 1:250,000 scale is available as a geodatabase from IBGE [43].As a third step for the research project, we defined the data quality components, which were going to compose the assessment and support the discussions, during the fourth step (data analysis).In this case, without detailed reference data, there is no knowledge supporting a specific methodology for assessing crowd-sourced geoinformation, so we have adopted two parameters of comparison from ISO: 19157 [47], regarding the main goal of this research work: completeness and temporal quality.For completeness, we have used four groups of evaluations: one case for rural areas and three cases for urban ones.For temporal quality, we have considered two topics.Table 2 shows the parameters and their definitions.All parameters were designed to allow for a comparison between the municipalities, not as absolute quality measures.Therefore, as Table 2 shows, the first element studied was completeness in rural areas (completeness 1, at Table 2).This parameter was calculated dividing the total length of roads in official 1:250,000 topographic database road layers on the IBGE database by the total length of roads on OSM.This last parameter excluded footpaths and railroads that are stored in different layers on the IBGE database.Additionally, the major roads represented by double lines in OSM had their length divided by two, because the official database, at this scale, represents them as single lines.Figure 6 shows the results we have found applying this systematic approach; Table 3 shows the values of ttotal length of roads in rural areas by municipality.Both representations (Figure 6 and Table 3) demonstrate that municipalities close to the capital (Curitiba) are more detailed, and this is the focus of our attention on the data analysis item for this topic.
Thinking about the completeness, although, for urban areas, we have considered three more cases (see Table 2, completeness 2-4).In this case, the approach chosen was due to the lack of an official database for urban areas in all municipalities.In fact, open data access for urban areas for the region comprising our study field has just been released for the state capital of Curitiba.For this reason, completeness in urban areas was evaluated in tandem with other methods, as there was not official geoinformation available.Therefore, the first criterion adopted was the density of urban roads per square kilometre (completeness 2, Table 2).This is maybe not the optimal solution, as urban areas are legally defined (in Brazil), and show an uneven distribution of urbanization between municipalities.In future studies, satellite images could provide a better assessment of urban patterns.A second criterion taken into consideration was the total number of features in the building layer of OSM in each urban area (completeness 3, Table 2).The third evaluation was the attribute completeness in urban areas (completeness 4, Table 2).This considered the percentage of roads with neither name attribute nor detailed description.This situation can be caused by the use of purely remotely sensed data, without field knowledge of the feature attributes.Therefore, this parameter was calculated as the percentage of unclassified features in total features.
The next two elements refer to temporal quality in urban areas.This assessment involved the premise that the more editors working in a single region, the higher the likelihood of accurate mapping dynamically over time.The number was calculated by summarizing the editors that worked in change sets with data in urban areas supplied in the JOSM.The date of the last edition was also recorded to compute the number of days since the last change.All results in urban areas are described in Table 4 and Figure 7.
The fifth and final step in the research comprised the data analysis.In this case, we explored the data connecting the parameters and the demographic information aiming to comprehend the dynamics of user updates inside OSM and its correlation with other economic and demographic variables such as GDP per capita and HDI.

Discussion
As stated earlier, the purpose of this research is to analyse the spatial distribution of updates of OpenStreetMap through rural and urban areas in Brazil to understand the patterns of user updates and its correlation with other economic and development variables.Up until this point, we have presented the results of the quality parameters we have investigated.However, the main focus is to compare these results with those demographic parameters shown in Table 1 in order to explore the possible existence of a correlation between these variables.In this case, we have calculated the Pearson coefficient of correlation Equation (1) as below.From this calculation, we have obtained the data presented in Table 5, which shows the correlation between the quality parameters and the demographic data.For the purposes of this analysis, we considered a strong correlation to be above 0.70, moderate between 0.40 and 0.69 and weak bellow 0.39.
Regarding the previous points about the strength of correlation, the first distinct correlation analysed was the completeness of the road layer in rural areas is with the total population (0.43) and with distance from Curitiba (−0.45).This suggests, although to a moderate degree, that the most populated areas, nearer the capital, have higher density of roads mapped in OSM as the representation (Figure 6) indicated.More specifically, cities adjacent to Curitiba with a large population such as Araucá ria, Almirante Tamandaré , Colombo, Fazenda Rio Grande and Tijucas do Sul, have almost 10 times (1000%) the length of roads the official (IBGE 1:250,000) database store.On the other hand, the smaller and more isolated municipalities such as Adrianópolis, Campo do Tenente and Doutor Ulysses, have fewer roads represented in OSM that on official topographic maps.Flanagin and Metzger [30] have suggested something similar: the more visited a place is, the more accurate the information about it is.In other words, we have proved that the best-mapped municipalities-in terms of roads in rural regions-are also the more populated municipalities.
The urban quality parameters did not exhibit any uniform behaviour from the data collected.However, in four of the five tests, the higher correlation shown suggests a weak association with population, which leads to a possible understanding that the more populated area, the more data is available and the more often it is updated, as stated before.
Analyzing the road completeness in urban areas, it is noticeable that some areas returned higher scores in highly dense urbanized areas, such as Curitiba (11.42) and Sã o José dos Pinhais (8.48), with some outliers in small cities as Tunas do Paraná and Bocaí uva do Sul.Even so, the higher correlation (0.69) is with population density, which is consistent with the nature of this parameter.The main issue here is that the legal designation of urban areas is not always consistent among distinct local authorities and, therefore, does not necessarily imply uniform urban patterns.
The third urban data quality parameter was attribute completeness.This was the parameter which presented the weakest correlation with demographic variables.The strongest correlation was with population (0.29).The distribution has a much dispersed pattern with mainly small cities with almost no attributes on the streets at all (Campo do Tenente, 0%, Itaperuç u, 3%, Mandirituba, 1% and Porto Amazonas 0%).However, some of the less populated places have a very high rate, such as Doutor Ulysses with 75%, Guaraqueç aba with 84% and Pontal do Parana with 71%.This could be due to individual efforts to actually provide attribute information in these places.
The next two observations aimed to compare the temporal aspects between areas.The individual contributors are highly concentrated in Curitiba, 200 (around 0.01% of the population).The next municipalities in number of contributors are Sã o José dos Pinhais (54), Colombo and Pinhais (41 each), which are, respectively, the second, third and fifth largest cities.This is observed also in correlation of 0.97, a strong one, that shows that editors are more abundant in bigger cities, in proportion to an average 0.1% of the urban population.
The last urban analysis considered the number of days since the last edition.The idea here was that an active community would make more often additions to the database, keeping it updated and implying greater temporal quality.Fourteen cities had editions in the month prior to the study, such as Curitiba, Sã o José dos Pinhais and Campo Largo.On the opposite side, Doutor Ulysses, Quitandinha and Itaperuç u did not have any updates at all in a year or more.These cities are among the poorest and more isolated areas.In fact, although the correlation of this parameter was not particularly strong, in this case, HDI and Average Income appears to have a stronger influence that the population.
In the maps in Figure 7, we can observe the distinct patterns of spatial distribution of the urban parameters proposed.Although, from a first look, they seem very different from each other, we can observe that Curitiba, Sã o José dos Pinhais and Colombo, the more populated areas, are often in the higher class of each parameter.Doutor Ulysses, Tunas do Paraná and Porto Amazonas, and other smaller, poorer or isolated areas, with some exceptions, are mostly in the lower data quality classes.Instead of mathematically creating a formula of how these parameters behave, this study achieved its aim of analysing the specifics among the areas.

Conclusions
The aim of this paper was to observe the distribution of OpenStreetMap data in a significantly diverse region in Brazil.Comparing various parameters, it was observed that this distribution is uneven and concentrated mainly in areas with the largest population.One relevant point is that in these urban areas we could not actually measure absolute parameters of quality, as there were no available datasets to use as field truth.Instead, we had to compare a number of indicators of data quality observed with homogenous criteria in the 37 municipalities studied.
In order to consider the use of OSM data as inputs in official spatial databases, these quality issues must be addressed.When analyzing urban areas, the places where data is more abundant and more often updated are also the same cities with resources to invest in mapping initiatives.The places where these data are much needed and both financial and human resources are more scarce, the contributors might need some incentives or a specific call to concentrate these efforts as such spots seem not to naturally fall into the scope of volunteer map makers.This study showed that, even without official databases available to assess absolute quality parameters, comparisons could be made between distinct areas that show that VGI alone cannot be the answer for providing data in poorer and isolated areas, the one area where a lack of official maps is more significant.
To expand this initial approach, a future research agenda could observe aspects as positional accuracy, correlation with other variables and studies in statistical significance of the correlation, including spatial statistics techniques.It is also important to define the thresholds of acceptable quality parameters in order to consider this data as part of official databases.An issue that could enhance the understanding the VGI updates is the peculiarities of tourist areas, such as municipalities in the coastal and Serra do Mar region.These areas seem to have an increased number of contributions, which could be due to visitors, not only the local population, but this effect was not part of the present study.
Open data in conjunction with open standards and software can help local authorities and the population to manage their space more efficiently considering social and economic factors.VGI information can play an important role in this process, once we understand its nature and the quality aspects related to it.

Figure 2 .
Figure 2. The first figure (a) shows an area with high urbanization level represented on OpenStreetMap, comprising the cities of Curitiba (South region of Brazil), Sã o Paulo (Southwest region of Brazil), and Rio de Janeiro (Southwest region of Brazil).This first window (a) comprises over 25% of the total population of Brazil.In contrast, the second one (b) shows an area with a low level of urbanization-at the same scale as (a).This second and less populated area comprises two state capitals (Cuiabá and Goiâ nia) as well as the federal capital, Brasí lia.Looking at the pictures one can see the large difference between the amounts of features mapped in both cases.This is likely the result of the "segregation phenomenon" described by Haklay[21].Source: Adapted from OpenStreetMap, 2015.

Figure 3 .
Figure 3.The Metropolitan Mesoregion of Curitiba within the Brazil and Paraná State context.

Figure 4 .
Figure 4.The selected municipalities comprising the Metropolitan Mesoregion of Curitiba.

Figure 6 .
Figure 6.Completeness 1: Length of rural roads from OSM base vs. Length of rural roads from the official topographic mapping (at 1:250,000 scale).

Figure 7 .
Figure 7. Five data quality elements in urban areas.

Table 1 .
The selected municipalities comprising the Metropolitan Mesoregion of Curitiba.
* Curitiba and Pinhais do not have a rural area.

Table 2 .
Parameters and definitions.

Table 3 .
Total length of roads in rural areas by municipality.
* Curitiba and Pinhais do not have rural areas.

Table 4 .
Urban data quality elements.