Using Latent Semantic Analysis to Identify Research Trends in OpenStreetMap

OpenStreetMap (OSM), based on collaborative mapping, has become a subject of great interest to the academic community, resulting in a considerable body of literature produced by many researchers. In this paper, we use Latent Semantic Analysis (LSA) to help identify the emerging research trends in OSM. An extensive corpus of 485 academic abstracts of papers published during the period 2007–2016 was used. Five core research areas and fifty research trends were identified in this study. In addition, potential future research directions have been provided to aid geospatial information scientists, technologists and researchers in undertaking future OSM research.


Introduction
OpenStreetMap (OSM), founded in 2004, provides a free editable guide to the world, available under an Open Database License (ODbL).The project is supported by Web 2.0 technologies, which enable more coordinated efforts among web clients, different users and content suppliers [1].These technologies render new methods for sharing information [1][2][3] by crowdsourcing developments such as Wikipedia [4].In the context of geographic information, crowdsourcing is also known as Volunteered Geographic Information (VGI) [5] or collaborative mapping [6], where websites such as Wikimapia and OSM [7] are good examples.The volunteers contributing towards OSM use various devices to record GPX tracks and edit the information using online editors (e.g., iD3, Potlatch) or offline editors (e.g., JOSM) [8].
The VGI researcher community has focused on devising methods and tools for the utilization of volunteered data.During the last decade, OSM has gained in maturity, and numerous papers on different aspects of OSM have been published.In particular, a trend towards the analysis and fitness-for-use of OSM in various application domains has been witnessed.However, such research trends are not well understood, and hence, it is important to identify them.Manual systematic reviews [9] or semi-automated topic modeling algorithms [10][11][12] are two methods that can be employed.Systematic reviews are more critical and can be biased, whereas semi-automated methods are more generic in finding the trends [13].A comprehensive review has been published by Senaratne et al. [14], which focuses on text-, map-and image-based VGI, but this study involved a manual review of 56 papers describing quality assessment methods.In contrast, this current study uses a quantitative approach called Latent Semantic Analysis (LSA), which is a well-established method.LSA has been used previously by See et al. [15] to analyze trends in VGI, but the focus of our study is specifically on OSM.Hence, in this paper, we apply LSA to 485 abstracts of research papers published during the period 2007-2016 with the aim of discovering the core research areas of OSM, the trends and their relationships.In addition, we suggest future research directions.This set of 485 papers is considered sufficiently large enough for performing text-mining as explained in Evangelopoulos et al. [12].On the basis of this study, five core research areas and fifty research trends have been identified.
A secondary aim of this study is to try to answer the following research questions framed by Kitchenham et al. [16]: RQ1: Who is leading OSM research?RQ2: Which research areas have been widely investigated by researchers?RQ3: How has the focus of topics within each core research area changed over time?RQ4: What are the potential future directions of OSM research?
This paper is divided into six sections.The next section describes the methodology adopted for collecting the research literature, as well as the steps undertaken in the application of the LSA approach.The third section discusses the results obtained from different topic solutions and maps the research trends to core research areas.The fourth section answers the research questions, while the fifth section considers the limitations of the study.The conclusions drawn from the findings of the study are provided in the last section.

Data Acquisition
Various bibliographic databases were used to collect the literature dataset.The articles were selected using "OpenStreetMap (OSM)" OR"volunteered geographic information (VGI)" OR "crowdsourced map" as search keywords.The open-source tool JabRef [17] was used for the purpose of the collection, screening, selection and corpus preparation.Bibliographic databases searched included IEEExplore, ScienceDirect, the DBLP computer science bibliography, ArXiv, Directory of Open Access Journals (DOAJ), Association for Computing Machinery Digital Library (ACM DL) and CiteSeerX.In addition, a manual search of Taylor and Francis, Wiley, the MDPI journal bibliographic database and the Zotero repository was undertaken, and any relevant literature found was added to the collection.The assembled literature was then manually reviewed in JabRef to identify the articles based on inclusion and exclusion criteria set out in Table 1.The elimination of duplicate articles or those that were out of focus, as described in Table 2, resulted in 485 articles in the final literature dataset.This dataset included nineteen papers converted from German, Spanish and Italian to English.The dataset was exported to a csv (comma-separated value) file using an export filter, purposefully designed for JabRef [18].The exported file included titles, abstracts and year of publication.3 and 4, respectively.

Application of Latent Semantic Analysis
The literature dataset described in Section 2.1 was provided to the LSA model for uncovering the "latent" semantic structure [19][20][21].LSA, which is a natural language processing approach, provides a methodology for automatically organizing, understanding, searching and summarizing a textual dataset.It examines the relationship between documents and terms in the dataset to reveal concepts.It is an unsupervised text-mining approach that uses Singular Vector Decomposition (SVD) to create a low-dimensional space for finding relationships, revealing topics and comparing documents [13,19,[22][23][24][25][26].Moreover, it is an established approach for identifying research trends prevailing in a large literature dataset [12,20,27,28].Recommendations for application of the methodology were taken from Evangelopoulos et al. [12].Since the aim of the study was to find the latent structure of the corpus, the factor analysis extension to LSA was applied using the fast truncated incremental stochastic SVD algorithm with a single pass [29,30].
The application of LSA produces two matrices, namely a term-loading and a document-loading matrix.The term-loading matrix represents the topics and associated highly-loaded terms.The document-loading matrix represents topics and associated highly-loaded documents.Higher loading values indicate greater familiarity with a topic [13].Table 5 shows a five topic solution representing five latent classes, associated keywords and their labels.It also presents the highly-loading terms in the term-loading matrix, generated from an empirical analysis after the application of the LSA to the literature dataset.The detailed procedure followed for the semantic analysis is discussed in the following sections.Applications to navigation and disaster servic mobil user map devic indoor web haptic navig collabor spatial visual interfac queri disast T5. 4 Traffic simulation and mobility traffic simul activ map commun contributor citi collabor urban contribut swarm sumo time real T5. 5 Indoor navigation models indoor rout build qualiti transport land public plan germani trust footprint floor accuraci complet

Pre-Processing and Term-Filtering
The first step in the corpus preparation was pre-processing and term-filtering.Pre-processing of the literature dataset is a vital part of any text-mining algorithm.The characters, words and sentences discovered during the pre-processing step act as tokens for further processing by the LSA.This step helps to reduce the size of the dictionary and improves the efficiency and effectiveness of the text-mining approach [31,32].This involved the removal of names, numbers, abbreviations, slang, acronyms, punctuation and N characters as recommended by Evangelopoulos et al. [12].The following steps (developed in Python using Natural Language Toolkit (NLTK) http://www.nltk.org)were followed for corpus preparation: 1. Sentences (titles and abstracts) for each publication (document) were tokenized.2. Tokens in each document were converted to lowercase letters.3. Punctuation including periods, exclamation points, commas, apostrophes, question marks, quotation marks and hyphens were eliminated.4. The numbers were filtered to contain only textual terms.5. N-character filtering was performed to filter all those terms that consist of words with less than three characters.6. English stop-words (stop-words of nltk python package) and the common keywords in all of the publications ("OpenStreetMap", "Volunteered", "Geographic", "Information", "Crowdsource", "Maps", "OSM", etc.) were removed.The dataset was then further refined to remove terms that exist only once in a document.These terms were local to a particular document and were considered insignificant [33].7. The SnowballC stemmer algorithm was applied to convert inflected words to the base stem of the tokens in each document.
Initially, the dataset had 87,348 tokens.After the pre-processing step, the token count was reduced to 2510.In this study, 485 sparse vectors were created with the 2510 tokens.A dataset of 485 documents was thus converted to a vector space, where the rows represented the 2510 terms (dimensions) of the 485 columns, each of which corresponds to an article.Each document was then converted to a representation called "bag-of-words" [33].This mapping process converted each term in the document to its integer Identity (ID), along with its count of occurrences in each document producing a dictionary.This dictionary was given to the next step of the process to create a weighted matrix.

Term Frequency-Inverse Document Frequency
A TF-IDF weighting scheme was utilized to reflect the significance of a given entity in comparison to other entities (term or document) in the corpus.It increased the weight in proportion to the number of occurrences of a word in the document, but was often offset by the frequency of the word in the corpus.This helped to adjust the weight of some words appearing more frequently [13].Furthermore, it resulted in a better topic analysis [12,34,35].Various combinations of TF-IDF weighting schemes can be applied [36,37].The approach followed in this study is represented in Equation ( 2), where W i,j , t f , d f and n d describe the TF-IDF weight obtained, the term frequency, the number of documents where the term appeared and the number of documents in the dataset, respectively.The term frequency (Equation (1)) represents the local component of a document and measures the frequency of occurrence of a term in a document; the inverse document frequency contains the global component by explaining the importance of a term in the document collection, i.e., log 2 (n d /d f i ).For driving the term-document weighted matrix, a local-component (term frequency) was multiplied with a global component (inverse document frequency).Using the weighting scheme given in Equation (2), a 2510 × 485 term-document weighted matrix was created for the ith term in the jth document of the corpus of n d documents.The same weighting scheme was used in all of the identified topic solutions.

Singular Vector Decomposition
The prepared TF-IDF weighted matrix was provided to the fast truncated SVD in order to perform rank lowering.The SVD model X = UΣV T performed factorization of matrix X into variables: initial rotation U, scaling Σ and final rotation V [19,30,38,39] as described in Equations ( 3) and (4).
The mathematical expression XX T and X T X provided the term-loading; and document-loading with respect to the topics and ΣΣ T represented by the weights of the topics (singular values) in a descending order.The maximum number of topics generated was equal to the number of documents in the corpus.For extracting a few topics (k), the topmost k singular values were taken from the matrix ΣΣ T [12,40].

Dimensional Reduction: Selecting Optimal Topic Solutions
Dimensional reduction is a process of selecting k largest singular values from the singular matrix obtained by applying SVD.Selecting an optimal dimensional reduction has always been an open issue [12], which requires extensive understanding and iterations to reach the optimal value.A low value of k dimensions is not sufficient to represent relationships between the terms and documents, whereas a large value induces noise.
As discussed by Deerwester et al. [19], Bradford [41] and Dumais [26], the optimal number of topic solutions for 1000 documents is approximately 100.Based on their recommendations, a fifty topic solution was considered optimal for depicting the research trends in OSM; in addition, three, five and ten topic solutions were considered to describe the core research areas.

Selecting Threshold Values for Topic Solutions
The term-loading and document-loading matrices consisted of corresponding weights for uncovered topics, i.e., each cell of the matrix (term-loading and document-loading) represented the loading value corresponding to the term/document (row-wise) and topic (column-wise).The values in the loading matrices were both positive and negative.For interpreting the results, varimax rotation [42,43] was performed on both matrices.This resulted in increased loading for one topic relative to other topics [13,20].The number of documents loaded for a particular topic describes the proximity to that topic.To distinguish between significant and insignificant loading, a heuristic approach called an empirical tail distribution was applied to select the threshold values as discussed by Sidorova et al. [20] and Yalcinkaya and Singh [13].For instance, to define the threshold values of documents for a ten topic solution, the loading values of (485) documents in each (ten) topic were transformed to a vector (a one-dimensional matrix with 4850 elements) from its matrix form.After sorting this vector in descending order, the threshold value was obtained by retaining the 1 485 th term of high-loading values of this vector.As per the calculations performed by the tail distribution, the threshold values for three, five, ten and fifty topic solutions were 0.133, 0.142, 0.162 and 0.183, respectively, for document loadings.Thus, any document having a loading of less than these values was considered insignificant for the topic.Furthermore, the terms and documents were loaded to only one topic.

Topic Labeling
The loading values in the term-loading and document-loading matrices were sorted in a descending order.An iterative approach was followed for topic labeling by examining highly loaded key-terms in the term-loading matrix and documents in the document-loading matrix for each topic solution.The highly-loaded values were grouped together for creating a sensible label for each topic as shown in Tables 6-8.The topic labeling was subject to the possibility of human bias as the degree of topical coherence varied significantly.

Summary of Topic Solutions
The application of LSA resulted into three, five and ten topic solutions presented as core research areas, which are presented in Table 6 along with the topic labels and the number of publications for three different time periods within 2007-2016.Topic solutions are represented as Ti.j, which denotes the jth factor of the ith topic solution, e.g., T10.4 represents the fourth factor of the tenth topic solution.The number of articles associated with a particular topic solution represents the importance of the corresponding research area within that topic solution.The mapping presented in Table 9 shows the semantic connections between the core research areas and the trends established using the cross-loading analysis.

Core OSM Research Areas
The core research areas exhibited in Table 6 for the three topic solution focused on "quality assessment and analysis" (T3.1), "routing and navigation" (T3.2) and "miscellaneous" (T3.3).These articles emphasized the development of methods for the quality assessment of crowdsourced data and issues pertaining to routing and their applications.The core research areas that emerged in the five topic solution were "quality assessment and analysis" (T5.1), "assessment of contributors' behavior" (T5.2), "applications to navigation and disaster" (T5.3), "traffic simulation and mobility" (T5.4) and "indoor navigation models (T5.5)".The five highly-loaded documents for each topic along with the loading values are presented in Table 7.
The results revealed that numerous high-loading publications converged to one research area, i.e., "quality assessment and analysis" in the three, five and ten topic solutions.This is because OSM was originally developed in response to the high cost of government data so representing an alternative source of open information.Hence, it is unsurprising that OSM has been compared to numerous proprietary map datasets.This topic appeared across (T3.1)-(T10.1),but the number of high-loading documents has decreased as new topics have emerged from the corpus.There has been an extensive discussion on quality assessment of OSM data since the year 2008 to understand the fitness of the data in various application areas [69].Researchers have applied various established data quality indicators [70][71][72][73][74][75] for the assessment and analysis of OSM data.The evolution of OSM and its assessment in different regions of the world, particularly Europe, the USA and China, has been observed.The established assessment methods of comparing OSM data against authoritative data are not always feasible [14,76]; thus researchers have explored intrinsic quality indicators to assess OSM data.

OSM Research Trends
The fifty topic solution uncovered OSM research trends as presented in Table 8 with a count of highly-loaded papers for a particular topic solution.Papers with a loading value of 0.183 or more were considered relevant for a particular topic.The distribution of highly-loaded articles presented in Table 8 shows that the "quality assessment" (T50.1)trend emerged as being highly explored in the fifty topic solution.This was consistent with the three, five and ten topic solutions, with 185 papers contained in this topic solution.Some of the highly-loaded papers compared OSM data with other authoritative and proprietary datasets [7,45,47,48, based on data quality parameters as suggested by Guptill [73] and Longley et al. [72].Others reported on the analysis and implementation of frameworks for the assessment of OSM data [102][103][104][105].The research trend "land-use patterns" (T50.6) was reported in thirteen articles that were focused on the use of OSM in remote sensing applications, particularly land use mapping [106][107][108][109][110][111][112][113][114].Another trend that emerged was "indoor navigation" (T50.2), which focused on mobile enabled indoor navigation in transport services [65,66,115] and their augmentation with floor plans [68].
The "traffic simulation and management" (T50.4)trend uncovered the use of evolutionary methods to calculate real vehicle flows in cities [61,116] using Simulation of Urban Mobility (SUMO), the use of data mining techniques in the field of traffic simulation [59] and the development of models in SUMO and MATSim for traffic simulation and management using VANETs-based applications and protocols [62,117,118].Another research trend that emerged related to traffic was "smart cities and mobility" (T50.8).These papers focused on traffic regulation using evolutionary algorithms to reduce travel times and greenhouse gas emissions [60,[119][120][121].The "shortest path computation" (T50.16) research trend revealed the use of shortest path algorithms [122,123] and the application of Hadoop to solve the shortest path problem in large complex datasets [124,125].The trend of "OSM for routing" (T50.11)provided an analysis for checking the richness of vehicle routing [126] and routing with the fewest-turn map directions [127].
Another significant research trend that emerged was "disaster management" (T50.13), where OSM has played a vital role in disaster management efforts and has been utilized by community-based disaster response organizations and researchers.OSM is used in the simulation of impact modeling and disaster readiness analysis.Papers in this theme focused on the applications and case studies of OSM during the Haitian earthquake [57], the assessment of OSM for disaster management [128], and for development of a location-based early warning and evacuation system [129].The trend "haptic for navigation" (T50.9)uncovered the use of haptic tools for navigation.The published papers concentrate on such tools and the exploration of street networks [54][55][56]58,130].The research trends "contributors' pattern" (T50.26) and "trust in OSM data" (T50.23)provided answers to the motivation of contributors, patterns and trust in OSM data.Other important research trends uncovered in the fifty topic solution were "location-based services" (T50.18),"data mining approaches for OSM data" (T50.33),"conflation of maps" (T50.29) and "OSM for autonomous navigation" (T50.39).From Table 8, it is clear that research activities in the field of OSM increased tremendously during the period 2012-2016.

Mapping of Core Research Areas and Research Trends
Table 9 presents the mapping of core research areas and trends.A manual connection was established between core research areas and trends on the basis of high-loading papers for the topic solutions as discussed in a number of studies [13,20,131].The mapping presents the relationship between low-aggregated topic solutions with highly-aggregated topic solutions by referring to the minimum loading value (threshold).In the current study, most of the articles clustered around one topic, i.e., "quality assessment" (T50.1).This effect may be attributed to the use of dominating keywords in these articles.For instance, the topic "land-use patterns" (50.6) revealed thirteen articles [106][107][108][109][110][111][112][113][114][132][133][134][135].From these, eleven articles emerged from "quality assessment and analysis" (T5.1), whereas two of the papers were focused on "assessment of contributors' behavior" (T5.2) [110] and "indoor navigation models" (T5.5) [107], respectively.Hence, based on the high-loading articles, the research trend "land-use pattern" (T50.6) was mapped to the core research area of "quality assessment and analysis" (T5.1).The detailed discussion is presented in the following sections.

Assessment of Contributors' Behavior (T5.2)
Goodchild and Li [167] have outlined that crowdsourcing can enable a group of individuals to validate and correct the errors that others might have made, which could converge to the truth-based on Linus' Law [168].The studies conducted recently have shown that OSM data quality can be judged by assessing contributors' motivation and patterns [50,78,169].Haklay et al. [170] have revealed that positional accuracy improves with the increase in the number of contributors.The two research trends that emerged from this core research area were "trust in OSM data" (T50.23)[50,171] and "contributors' patterns" (T50.26)[172].

Applications to Navigation and Disaster (T5.3)
This core research area provided research trends related to navigation and crisis management during natural disasters.Developer teams around the globe are working on the development of information models for navigation like "mobile-based services" (T50.3) and "haptic for navigation" (T50.9)[54][55][56]58,130], web servers for using map data in crisis situations like "disaster management" (T50.13)[57,128,173], "evacuation modeling" (T50.34)[174,175] and "humanitarian efforts" (T50.32)[176].The research on the haptic model for navigation has been conducted by Jacob et al. [130] and Kaklanis et al. [58] for developing a multi-modal, haptic and audio feedback interface to vibrate based on the navigation path to assist users by touch.The research trend, viz.disaster management, elaborated on the effectiveness of applications in improving situational awareness, coordination among organizations and emergency response teams.This core research area also uncovered the work of Tully et al. [177,178], who have used OSM to create a 3D virtual environment for improving quality and performance of decision support systems during any such crisis.

Traffic Simulation and Mobility (T5.4)
Simulation of traffic systems has emerged as one of the core research areas where OSM data have been used for solving relevant traffic problems, reducing the travel time of vehicles and smart mobility problems.This core research area uncovered three trends, viz."traffic simulation and management" (T50.4)[59,61,63], "smart cities and mobility" (T50.8)[60,119] and "reducing travel times" (T50.14)[121].

Indoor Navigation Models (T5.5)
The demand for indoor routing or navigation has increased vastly in recent years.This core research area included "indoor navigation" (T50.2) [65,66,68,115] and "indoor planning and simulation" (T50.5)[64,67].In these trends, smartphone sensors have been used to acquire real time parameters.Indoor navigation models integrate the acquired data with OSM by using various algorithms to provide navigation inside complex buildings.The research trends on devising methods for indoor navigation using augmented reality [179][180][181][182] and 3D models [183,184] have also been observed during the period 2012-2016.

Discussion and Potential Future Directions
In this section, we consider how the results from the LSA can be used to answer the four research questions provided in the introduction:  3 present the top journals and authors publishing on OSM.Journals include the ISPRS International Journal of Geo-Information, Transactions in GIS, International Journal of Geographical Information Science and The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences.The literature is dominated by publications from European researchers.Goodchild [5] and Haklay [2] are two prominent researchers in VGI.The preliminary assessments of OSM data have been performed by Kounadi [7], Ather [77] and Haklay [78].Leading researchers from research group in GIScience and Geoinformatics have worked on the assessment of OSM [185]; user contributions [53]; indoor-outdoor navigation and location-based services [64]; application and tools for intrinsic analysis [103]; and assessment of land-use information in OSM [51].Mooney [70] is a leading researcher in OSM and has suggested the need for quality metrics for assessing OSM in the absence of "trusted" sources of ground truth.Other researchers who have identified important topics include: Jacob et al. [56], who have developed a haptic feedback navigation applications for pedestrians; Keßler and de Groot [50], who have discussed issues of the trust in user contributions; Stolfi and Alba [60], who have worked on the use of OSM for smart mobility; Goetz [154], who has developed 3D models with OSM; Ballatore et al. [186], who have utilized LinkedGeoData, developed by Auer et al. [187], and presented a conceptual model for quality assessment and enrichment of OSM data; Ruta et al. [181], who have incorporated augmented reality for utilizing a point-of-interest (POI) discovery-based tool for indoor-outdoor navigation; and Jilani et al. [188], who have brought the concept of machine learning into VGI and presented automatic tag assessment and inference models.More than 90 universities and research groups have developed tools for the contribution to and assessment, visualization and application of OSM.

RQ2: Which Research Areas Have Been Widely Investigated by Researchers?
The results of the study show that "quality assessment and analysis" has been the most widely investigated topic in OSM research.OSM data quality is a matter of concern due to non-existence of knowledge about the contributors.OSM quality has been assessed on the basis of established quality indicators [71,72,74,189] or intrinsic indicators for VGI in the absence of authoritative datasets [14,76,190] using contributors' motivation and patterns as inputs [50,[169][170][171]191].
From an investigation of this core research area, we have identified the studies focusing on different quality indicators and the methods used for this purpose (Table 10).Positional accuracy, completeness, attribute accuracy, and semantic accuracy indicators have been widely explored by OSM researchers, whereas logical consistency, temporal accuracy, and lineage have gained less attention.

Attribute accuracy
Attribute accuracy represents correctness of quantitative and non-quantitative attributes [79].
Semantic accuracy Semantic accuracy is evaluated through tags and three measures as suggested by Vandecasteele and Devillers [190]: • Data-centric measures: Type and number of features and attributes of features [45,79,104,114,138,215].
Temporal accuracy Analysis of history file on the basis of temporal measures: • Changes in community activity [209,229,230] • Statistical correlation between number of contributors, date of capture and version of the captured objects [79].
• Number of editors and number days past the last change [98].
• Lorenz curve and the Gini coefficient, quantile-based classification method, and Mann-Whitney-Wilcoxon test to assess contribution inequality, community changes, and productivity changes respectively [52].

Lineage
Lineage is measured by analyzing the history file by checking source information in the 'tag' attribute [79,84].

RQ3: How Has the Focus of Topics within Each Core Research Area Changed over Time?
The evolution of OSM research can be examined by looking at the shifts in focus across the five core research areas.From Table 8, OSM has clearly gained momentum since the year 2011.The research area "quality assessment and analysis" remained a trending area over the period.This research area is tightly coupled with development of advances in VGI [14,103].The assessment of OSM data has promoted, for example, the use of intrinsic quality parameters in the situation where authoritative data are not available [76,103], models for OSM tag-recommendations [149,190], and enrichment [186,225].The maturity and open access to OSM data have encouraged the use of the data for different application areas [51].In addition, there has been a shift from OSM tool development to more application-oriented research.
A shift in computing technology has also been witnessed through the literature dataset.Numerous tools and custom developed codes have been used in OSM research including proprietary tools such as ArcGIS, Manifold GIS and MapInfo [7,45,77,91,97,201,205,206,232,233] while open-source tools such as QGIS, JOSM, OSMOSIS, OSMIUM, PostgreSQL, and PostGIS [47,53,198,234] have been used by numerous researchers.These tools have significantly improved to handle spatial data more efficiently.

RQ4: What Are the Potential Future Directions?
OSM research is a recent and emerging area in computational and geospatial sciences, and there are ample opportunities for further research.On the basis of the results of the LSA, some recommendations are made in the sections that follow.

General Recommendations
OSM is increasing and all components from data collection to data dissemination must be explored through longitudinal studies.In particular, these points need to be considered for further research: • Development of a 'gamification' framework for motivating contributors to collect data while taking care of the reference scale and resolution.• Development of a specification model to ensure consistency and quality of the contributed data.
• Identification of heuristic intrinsic quality indicators for the assessment of OSM data and the development of a framework for data assessment applicable to different domains.
There are various open issues related to the quality assessment of OSM as outlined in Senaratne et al. [14].Existing studies on OSM are far from complete.Researchers should view OSM as an opportunity to investigate computational research challenges.

Research Directions
This section outlines future research directions, which may help to inform future research on OSM.

Assessment of contributors' behavior:
The literature review suggested that methods and motivating factors are required to attract contributions to OSM [76,99,235].Fritz et al. [236] identified open challenges in attracting the crowd to contribute.By incentivizing people using rewards, or by using 'gamification' of applications [237,238], the spatial coverage and the amount of participation can be increased.Contributors are the "gate-keepers" of the information [167].Their motives, behaviors and patterns influence the quality of the data and on the development of trust in the information [50,171].However, existing studies [52,239] for analyzing user contributions lack generalization.Thus, the development of a comprehensive framework for user contribution analysis and reputation assessment is needed.This further depends upon the following open research questions: • What are the motivational factors and patterns of user contributions?
• Which attributes should be considered for creating a user reputation system?Quality assessment and analysis: Researchers are using various established assessment methods by comparing OSM with authoritative datasets as per guidelines presented in [71].But even these methods are not sufficient for assessing OSM data [14].Therefore, there is always scope for identifying new quality indicators for OSM in the absence of authoritative datasets.Recent developments on OSM have witnessed intrinsic quality indicators for assessing the data using history files [172,214] and three quality frameworks have been developed by Barron et al. [103], Ballatore and Zipf [104], and Rehrl and Gröchenig [105].However, these frameworks need to be further extended, which confirms the research gap identified in the recent study by Senaratne et al. [14] to develop a comprehensive framework for the assessment of OSM data using intrinsic, extrinsic and hybrid quality indicators.

Applications to navigation and disaster:
The application of OSM to navigation has been less investigated than other topics.The suitability of road networks for navigation can be assessed based on two parameters: topological consistency and semantic consistency.The topological consistency can be assessed by applying topological rules.The semantic consistency can be assessed by evaluating tag information, which is vital for navigation applications.Furthermore, inconsistent modeling of features is an area of concern since OSM does not enforce a uniform specification model to the data being uploaded.Current studies present the following issues to be explored: • Handling of data imputation and incorrect values of tags such as turn-restrictions, one way streets, maximum speed, etc. [69].• Identification of issues related to geometrical modeling such as divided highways modeled by double lines [69].• Development of a framework to detect and correct the topological and semantic inconsistencies.
In addition, there are several heuristic aspects that can be further investigated for routable OSM data such as identification of correlations between routes selected and its agreement to route length and geometry [126].
The routing applications developed for mobile devices can use haptic feedback mechanism to get information about the route [130].Those with vision impairment can be greatly helped by such a technology stack.Haptic feedback uses variations in vibration frequencies to present information about the distance [56,240].Researchers need to validate the success of such a framework in different situations.Thus, we suggest an extension to this work by using wireless enabled hand-bands to assist people with visual impairments by providing path directions with a heuristic mechanism.
Other areas that have emerged from this core research area are related to disaster management and its preparedness [57,128,[173][174][175][176].The OSM tasking manager is a mapping tool designed and built by the Humanitarian OSM Team (HOT) (http://tasks.hotosm.org/)for handling disaster situations.To aid rapid response, 3D models are being developed.The issues of data imputation and the quality of decisions during a crisis are still open areas for further research.As per suggestions of Tully et al. [177], for enhancing the performance of crisis decision support systems, further studies are required on: • Development of semi-automatic approaches to conflate multiple map data for better decision making.
• Selection of appropriate interpolation techniques needed for the large datasets.

Traffic simulation and mobility:
Transportation systems need to be developed and maintained to meet current and future needs.Various simulation methods are being used by researchers for better management of traffic flows, but this is not possible without in-depth knowledge of the intersections and their connections.Stolfi and Alba [61] have presented a traffic flow study by modeling traffic scenarios using sensor input data.Their methodology has certain issues that can be resolved by further research on: • Capturing sensor data and optimization of various parameters (traffic lights, routes, etc.) for better results and generalization of the study [61].
Recent studies present the analysis of road networks using social network analysis (SNA).Such studies have used models from graph theory to investigate social phenomena [69,214,241,242].SNA can be useful to present spatial and temporal characteristics [242].Future research can be undertaken by applying SNA to OSM for studying the centrality, density, clustering coefficient and other properties of the existing road networks and their impact on specific application domains.We suggest that SNA can be further applied on OSM data to: • Apply mathematical measurements that facilitate the analysis of quantitative relationships within the network.• Uncover gaps and prevalent pain-areas from the configuration of roads and their spatial connectivity properties.
Indoor navigation models: Another potential research and application area is indoor navigation.Traditional and contemporary algorithms [65,66,68,115] are being supported by augmented reality [179][180][181][182] and 3D models [183,184] for this purpose.The current study suggests that there are several opportunities to work upon the following: • Handling the GPS/heading accuracy issues during indoor mapping.
• Development of a mobile-based framework for 3D visualization and navigation for indoor maps.

Application of data mining, machine learning and big data to OSM research:
Crowdsourcing has greatly attracted the attention of the research community for quick and low cost data collection and tagging.Machine learning is appropriate for labeled, uncertain, vague, diverse, continuous and rapid data [243].Researchers [50,171,188,200,215,219,228,244,245] have used various data and text mining approaches for assessment of and knowledge extraction from OSM data.Spatial data mining is still in its infancy in OSM research.Various data mining approaches such as classification and prediction, association rule mining, clustering, regionalization and point pattern analysis, and geo-visualization could be applied to improve the results in various application areas using OSM.We strongly recommend further application of data mining and machine learning to OSM data.Potential research areas include: • Handling data imputation or incorrect names, inconsistent tag detection and data correction [69].
• Semantic analysis of attributes for user classification and reputation assessment [50] .
• Development of a framework to analyze past contribution trends and future OSM contribution patterns [239].• Prediction of labels of features from types of features [215].
• Evaluation of indirect and intrinsic indicators to identify fitness of a dataset for a particular domain [219].• Identification of a prohibition sign based on the knowledge gained from data presented in OSM [228].• Traffic simulation to reduce greenhouse gas emissions and travel times [59].
• Clustering of similar users for prediction, and finding associations and dependencies to characterize OSM data.
In addition, big data in geographic knowledge discovery is a key area for further research especially given that some researchers [247,248] state that VGI exhibits many of the characteristics of big data.As OSM is growing, traditional methods need to be extended for data analysis in the era of big data [239,249,250].New geographical methods for assessment will need to be developed in this changing computational paradigm.However, the potential for applications of big data in VGI is enormous after certain barriers are overcome [251].SpatialHadoop, an extended MapReduce framework that supports spatial operations, is being developed [252,253].Some studies [252][253][254] have been performed to use Hadoop for temporal analysis of OSM data.This opens new doors for future work as geo-spatial analysis uses large resources for computation.Thus, the current study emphasizes the use of big data in OSM, but is not limited to: • Assessment and analysis of OSM as big data.
• Nonlinear temporal analysis of spatial and attribute information to retrieve knowledge about contributors' patterns.
Direct observation of forums and mailing lists on the development of SpatialHadoop reveals the following areas for extended research on: • Supporting the input format of shapefiles to support spatial datasets other than OSM.
• Adding kNN join and distance-based join support.
• Developing a web-based interface to make it easier to explore datasets and use the system for non-technical users.

Limitations of the Study
Some issues may have arisen in the compilation of the literature dataset on OSM.This relied upon factors such as the query used, the sources of the literature, and the selection and identification of the final literature used in the dataset.The keywords "OpenStreetMap", "volunteered geographic information" and "crowdsourced map" were used to find suitable publications.In order to generate a good dataset, the other bibliographic databases, which did not appear in the automated search, were manually checked.The papers collected for the purpose of the present study were thoroughly checked for refining the dataset by applying the inclusion and exclusion criteria listed in Table 1.However, it may be possible that a few relevant studies have been omitted.
There may be bias introduced when using LSA.To reduce the bias as much as possible, a heuristic approach was followed to identify a suitable threshold for use by the algorithm.Although the LSA improves the vector space model by considering synonyms, the number of topic solutions cannot be determined statistically.However, to mitigate this, the optimal determination of topic solutions was made after having thorough discussions with experts.Lastly, the topic labeling was performed based on human judgment, which may also have some subjective bias.
There may be limitations related to the generalization of the results.The topic solutions were extracted from the abstract and titles of the articles that have a focus on OSM.The core research areas and research trends identified were based on an experimental design involving the selection of the literature, the pre-processing of the literature, the term-document matrix creation, the utilization of SVD for low rank approximation and the topic labeling.Each of these steps will influence the results.For example, the results will be affected by using abstracts and titles for preparing the dataset rather than the full text of the articles.However, verification of the dataset was conducted using a manual review so the results should be robust enough to achieve generalization.

Conclusions
This study focused on the discovery of research trends in the OSM literature by analyzing 485 documents published in the academic literature.The approach generated k-topic solutions, corresponding terms and document loadings.These loadings explain the proximity to a given topic.Highly-loaded terms and documents above a given threshold were considered relevant to the topic.The results of the study revealed five core research areas and fifty research trends.
The area entitled "quality assessment and analysis" has been widely investigated by researchers and is confined to specific regions of the world.Furthermore, researchers have been trying to identify intrinsic quality indicators, as established measures alone are not adequate for assessing the quality of OSM data.Another key research area that has emerged is related to the motivations and patterns of the contributors in the research area "assessment of contributors' behavior".Other research areas related to OSM usage and application were also identified, i.e., "application to navigation and disaster", "traffic simulation and mobility" and "indoor navigation models".The study has also provided general recommendations regarding research gaps that have emerged from the different core research areas, which could be used by researchers to further investigate OSM.

Figure 1
Figure 1 presents the distribution of publications over time.Based on the number of occurrences in the dataset, the top researchers with the most publications on OSM during the period 2007-2016 and the top fifteen journals publishing articles related to OSM are presented in Tables3 and 4, respectively.
t f = Number of times term appears in a document Total number of terms in the document(1)

4. 1 .Figure 1
Figure 1 and Table3present the top journals and authors publishing on OSM.Journals include the ISPRS International Journal of Geo-Information, Transactions in GIS, International Journal of Geographical Information Science and The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences.The literature is dominated by publications from European researchers.Goodchild[5] and Haklay[2] are two prominent researchers in VGI.The preliminary assessments of OSM data have been performed by Kounadi[7], Ather[77] and Haklay[78].Leading researchers from research group in GIScience and Geoinformatics have worked on the assessment of OSM[185]; user contributions[53]; indoor-outdoor navigation and location-based services[64]; application and tools for intrinsic analysis[103]; and assessment of land-use information in OSM[51].Mooney[70] is a leading researcher in OSM and has suggested the need for quality metrics for assessing OSM in the absence of "trusted" sources of ground truth.Other researchers who have identified important topics include: Jacob et al.[56], who have developed a haptic feedback navigation applications for pedestrians; Keßler and de Groot[50], who have discussed issues of the trust in user contributions; Stolfi and Alba[60], who have worked on the use of OSM for smart mobility; Goetz[154], who has developed 3D models with OSM; Ballatore et al.[186], who have utilized LinkedGeoData, developed by Auer et al.[187], and presented a conceptual model for quality assessment and enrichment of OSM data; Ruta et al.[181], who have incorporated augmented reality for utilizing a point-of-interest (POI) discovery-based tool for indoor-outdoor navigation; and Jilani et al.[188], who have brought the concept of machine learning into VGI and presented automatic tag assessment and inference models.More than 90 universities and research groups have developed tools for the contribution to and assessment, visualization and application of OSM.

Table 1 .
Inclusion and exclusion criteria.

Table 3 .
Top researchers in OSM research.

Table 5 .
Five topic-based term loading tokens.

Table 6 .
Core research areas for OSM.

Table 7 .
High-loading research papers for five topic solution.

Table 8 .
Research trends for OSM.

Table 9 .
Mapping of core research areas and trends.

Table 10 .
Data quality indicators and methods used.