Technical Guidelines to Extract and Analyze VGI from Different Platforms

: An increasing number of Volunteered Geographic Information (VGI) and social media platforms have been continuously growing in size, which have provided massive georeferenced data in many forms including textual information, photographs, and geoinformation. These georeferenced data have either been actively contributed (e.g., adding data to OpenStreetMap (OSM) or Mapillary) or collected in a more passive fashion by enabling geolocation whilst using an online platform (e.g., Twitter, Instagram, or Flickr). The beneﬁt of scraping and streaming these data in stand-alone applications is evident, however, it is difﬁcult for many users to script and scrape the diverse types of these data. On 14 June 2016, a pre-conference workshop at the AGILE 2016 conference in Helsinki, Finland was held. The workshop was called “LINK-VGI: LINKing and analyzing VGI across different platforms”. The workshop provided an opportunity for interested researchers to share ideas and ﬁndings on cross-platform data contributions. One portion of the workshop was dedicated to a hands-on session. In this session, the basics of spatial data access through selected Application Programming Interfaces (APIs) and the extraction of summary statistics of the results were illustrated.


Introduction
Since the emergence of Web 2.0 technologies, a massive amount of user-generated content (UGC) have been interactively generated by the public, which has provided us with an alternative source of information [1].Even though the portion of explicitly geotagged UGC (where geographical coordinates are attached to each piece of data [2]) is not seemingly high (ranging from 2% in Twitter to 25% in Instagram [3,4]), this trend arguably generates a significant mass of data.This is highly important and useful for geography and the GIScience domain, as every citizen can be counted as a volunteer mapper or a "sensor".These volunteers can collect, edit, and share geographical information of their Data 2016, 1, 15 2 of 22 surroundings [5].Thanks to the wide availability of GPS-enabled devices (e.g., smart phones) and the location-based services installed on them, as well as the professional infrastructure behind VGI projects, the quality of the collected geoinformation and, therefore, the usefulness of the VGI could potentially be comparable to, or even better than authoritative datasets.This was shown, for example, in the case of OSM compared to map data from national agencies [6,7] and Mapillary images vs. Google StreetView [8].This is because they can be more detailed in terms of attributes, whilst potentially being collected at a finer geographical scale and being more up-to-date [2,8,9].Therefore, a new era in gathering geospatial information has begun in GIScience in the form of VGI [10].The VGI topic has attracted a huge amount of attention in research across various disciplines ranging from environmental management to social sciences and has, in general, the potential to revolutionize science [11].Recent trends also indicate that creators and users of VGI have started to utilize some user-generated content to improve the quality or enrich another VGI source [12], which is a new research area to explore.
Due to the high importance of access to these data for end-users, it is evident that there is a need for sharing mechanisms and tools for collecting them with a minimal amount of effort invested by users whilst maintaining higher certainty about data.Thus, this paper aims to openly present the guidelines, tools, and scripts for collecting VGI across multiple platforms which were presented and tested at the LINK-VGI workshop.Readers can use these tools and scripts for collecting VGI from various platforms.
This paper addresses methods for accessing explicitly geotagged UGC.However, as mentioned above, this is only a small fraction of user generated content.It is important to note that there are other techniques that can be used to spatially locate UGC with spatially implicit information (where geography is expressed as place names or toponyms) [13].Studies show that analyzing the textual information embedded in UGC can lead towards the extraction of more geographic data.For example, a prototype system solely relying on the textual information of Tweets and Flickr photos was able to geolocate social media contributions and then report forest fire locations without using any explicit geographic information [14,15].Recent big data research turned to natural language processing that can potentially be even more useful for enriching location information.The use of more sophisticated natural language processing methods is common in big data research.Applying natural language processing tools to "geoparse" UGC is also an ongoing research direction and is mainly applied to Twitter [16][17][18].A recent study also predicts location and estimates errors of Flickr and Twitter data [19].
The remainder of the paper is structured as follows: Section 2 presents a brief description of the sample data provided at the workshop while Section 3 along with a series of Appendices provide working examples of VGI extraction from various sources.Section 4 applies these examples on two case studies with potential applications.All materials (including sample datasets and code examples) along with working case studies can be found in a GitHub repository via the link provided in the Supplementary Materials section.

Data Description
Sample datasets (extracted from Twitter, Instagram, Flickr, Foursquare, Wheelmap, and Mapillary) are also provided along with this paper in a GitHub folder (https://github.com/jlevente/link-vgi/tree/master/sample_datasets [20]).The geographic extent of these sample datasets cover Helsinki in Finland and Heidelberg in Germany (Figure 1).Table 1 describes these datasets in greater detail.These datasets serve as examples of VGI data that can be extracted from public APIs (Application Programming Interfaces).In some cases, additional attributes could have been extracted from these platforms according to each API documentation, though, for the case of the exercises in the workshop, these additional attributes were not required.It is also important to note that these services often change over time in terms of data access.Practically, it means that methods to scrape VGI data and the data that can be extracted need to be revised from time to time.Potential applications based on these datasets would be accessibility analysis, tourism analysis, city planning, and land use/cover monitoring, to name a few [21,22].

Software Requirements
Programmatically interacting with VGI data from public APIs requires a set of software components to be set up.During the LINK-VGI workshop, we illustrated the process with Python 2.7+ and a set of additional packages (tweepy, pyshp, psycopg2, python-flickrapi, python-instagram), along with R statistics and a few of its external libraries (ggplot2, ggmap, plyr, wordcloud).As for traditional GIS environments, QGIS and PostgreSQL (+PostGIS) were used.Detailed installation instructions can be found on the public GitHub repository of the workshop (https://github.com/jlevente/link-vgi/blob/master/requirements.md).In the remainder of this section, we illustrate each step with a series of Appendices.Appendices are code snippets written in Python (unless otherwise noted) with which one can reproduce results and explore the potential of VGI data extraction.

Interacting with APIs
An API standardizes the ways of interaction between software components.In web environments, it is a well-defined request-response system where servers respond to client application requests.A common practice is having different endpoints for different sets of functionalities.For example, http://api.twitter.com/1.1/users is responsible for operations related to users (e.g., recommending friends to follow), whereas http://api.twitter.com/1.1/geohas methods related to the geospatial domain (e.g., searching for Tweets in a given radius).APIs usually have different, well documented methods (functions) implemented for different functionalities (e.g., querying data, inserting new data).These documentations define accepted parameters for each method along with the expected output (response).A response is usually a JSON or XML document that can be further processed.In addition, many APIs make use of the type of HTTP request (POST, PUSH, GET, etc.) which can be set by the requesting agent.This type is then used by the responding server to identify the type of action being requested (i.e., using GET for requesting data, or PUSH to update information in the server's data structure).
A number of platforms require an API key (i.e., registering for the service) to be provided from the developer's side.This is to protect user's privacy, monitor usage intensity, and, in general, to govern the different levels of data access.A general guideline is that an application should only be able to execute operations for which it is authorized.An example could be OpenStreetMap's JOSM editor, where users can access data (e.g., download via API) but are only able to upload changes once logged in with their credentials (i.e., acquired permissions to perform uploads).Different APIs implemented different authentication systems, some being completely open and public (Overpass API (http://wiki.openstreetmap.org/wiki/Overpass_API)) and some requiring a registered application before any interaction (Instagram API).Table 2 lists the process and parameters needed from each platform.As mentioned above, different levels of access require different levels of authentication.For example, a registered basic Twitter application is not able to use the /geo endpoints, nor to pull data from a private Twitter account.However, if a user explicitly authorizes this application to make requests in his name (and, therefore, to reach these endpoints), data can be acquired.Different platforms implement different authentication methods to govern data usage.When developing an application for data mining, one should always consult the API documentation and the Terms of Service (or Terms of Usage).An example of setting up credentials for authenticated requests with Twitter is provided in Appendix A. The example allows the developer to acquire two additional parameters ("access_token" and "token_secret") in addition to the parameters associated with the application with which authenticated requests can be made.

Making Requests in a Python Environment
Although APIs can be used from any environment that can handle HTTP requests, it is beneficial to make use of existing API wrappers.These wrappers are usually developed based on the API documentation (often by third-parties) for the purpose of tackling some technical challenges for developers and making interactions easier (i.e., handling authentication process, implementing object classes, etc.).Using API wrappers in Python is an easy way to start.Some examples are tweepy (http://www.tweepy.org/)for Twitter and python-instagram (https://github.com/facebookarchive/python-instagram).
In some cases, however, such as with Wheelmap, easy to use API wrappers are not readily available.With some basic knowledge of HTTP communication, data can still be obtained relatively easily.The bulk of the HTTP communication can be done through built-in packages (urllib2 in Python, for example) specifically aimed at facilitating communication of data over HTTP connections in a straightforward manner.

API Methods
API methods are the most important elements of data interaction.There are different functionalities defined for every API, together with the list of properties we can use and the expected output of the methods.When working with a platform, the API documentation is the starting point where one can figure out what options are available and what methods suit best for the data collection.The documentation for different platforms can be accessed on their website.Some services have more thorough documentation than others.Appendix B illustrates the process of searching for Tweets within an area using search/tweets method from Twitter's REST API.The request contains a geocode parameter that consists of a pair of latitude and longitude values and a radius value.This is also an example of using an existing API wrapper from Section 3.2.1.
As mentioned earlier, not all APIs have a wrapper available to make communication easy.One example of these is Wheelmap.In the case that a wrapper class is not available, data can normally be obtained through connection with the API through direct HTTP requests.Appendix C showcases the interactions without an API wrapper.Within Python, there is the urllib2 built-in package that can perform such tasks.Using the urllib2 package, calls are made up of two components: a request and a response.The request is sent by the client (the Python script) to the API and basically asks for some specific action to be performed along with additional information such as search parameters, authentication tokens, and requested response data type.This request can be an action such as getting a list of features, creating a new feature, or any other process of the service that the API exposes.In this case, the simplest approach of asking for some data is used.The response component is what the API sends back to the client and contains the data obtained and/or response codes indicating whether the operation was successful.Compared to the previous examples, one can note that there are additional steps performed in this solution.Without available API wrappers, the programmer is responsible for building direct URLs and handling pagination for the response dataset, amongst other methods that may need to be implemented.Authentication for Wheelmap is done by providing the "api_key" parameter in the requests.

Exporting Data from APIs
Since different APIs use different data structures that are often not relational and are not in a tabular format, the next step in VGI data collection after acquiring the data from a service is usually exporting the result set into a widely used data format.The general idea is to reshape the data structure so it can be integrated into other systems.The process involves looping through result sets and adding data to a result set of standard format.Outputs can be anything that can be created in programming environments.If data falls in the geospatial domain, geospatial data formats can be used as well.

Plain CSV
Plain CSV files are often used as an exchange format as they are easy to handle.A Python code snippet to get tweets within a radius of a given point and write results to a CSV file is given in Appendix D. In this example the Python core package csv is used to create a CSV file of geocoded Tweets.Properties to export can be decided either when examining the result set or can also be determined from the Result section of Twitter's API documentation (https://dev.twitter.com/rest/reference/get/search/tweets).For simplicity, this code snippet exports some basic information such as username, tweet id, message of the post, timestamp of post, and a latitude-longitude coordinate pair corresponding to the location.As fields can be missing in the response set, it is often useful to implement some error handling solutions in the code (e.g., try-except statements in Python).

GeoJSON
GeoJSON is a common interchange format gaining popularity over the recent years.It is commonly used in web environments to transport geospatial data but most Desktop GIS software can read and write the format as well.Some API methods return GeoJSON directly meaning that there is no need for conversion.However, often GeoJSON documents have to be built manually.The process of building a GeoJSON file is illustrated in Appendix E. This Python function can be used to extend Appendix C with the export functionality so that all Wheelmap nodes could be further processed in a desktop GIS software.

Shapefile
Shapefiles are well known to all GIS professionals, therefore, it is useful to learn how to create them in Python.Appendix F queries Mapillary photos in a given area and exports them as a shapefile.The example uses the search/im (https://a.mapillary.com/#get-searchim)method from the Mapillary API to access images nearby a specified location expressed by latitude and longitude coordinates.Similar to the urrlib2 package seen in the Wheelmap example, another package called requests can be used to handle HTTP requests in the absence of an available API wrapper.The process starts with defining the request parameters and manually building the URL for the GET requests.The code snippet could be easily extended to handle multiple result pages as well.This example also uses the pyshp package for writing a shapefile.

PostGIS
PostgreSQL is the leading open source Relational Database Management System and can be considered as a geographic data store with its PostGIS extension.In addition, it allows the execution of many geospatial operations in a highly customizable manner.The database can be integrated in the data collection process and then it can be used not just as a data store but as a powerful processing framework or even for collaborative work.Appendixs G and H show the process of exporting locations from OSM's OverpassAPI identifying where drinking water can be obtained.The first step is creating a database and then setting it up for the data.PostGIS is an extension that needs to be enabled for the database.Similarly, hstore can be used to store key-value pairs.Since PostgreSQL uses a database schema to describe data, the table structure needs to be defined first (Appendix G).Connecting to the database and populating it with data then can be done within a Python environment (Appendix H).

Extracting Summary Statistics in an R Environment
In addition to accessing VGI platforms and standardizing formats, exploratory analysis provides better understanding of the nature of these data.R statistics is a widely used statistical framework to analyze VGI sources.In general, however, standardized outputs scraped from APIs can be processed in any statistical software.The purpose of this step is to explore data, create charts, and apply statistical tests.
The following example uses data from the Instagram API.The harvested dataset contains information of photos posted to Instagram since 1 January 2015 in downtown Helsinki.A sample dataset of Instagram locations and Instagram photo metadata are provided in a GitHub folder (https://github.com/jlevente/link-vgi/tree/master/sample_datasets).Please note that Instagram API has changed its policy as of 1 June 2016 (http://developers.instagram.com/post/133424514006/instagram-platform-update).All registered applications start with limited access to data and thus the method presented above does not work with real data.However, API methods have not been changed and in theory it is possible to obtain a higher level of access for Instagram.
This data allows us to extract insights about popular places in Helsinki by quantifying data upload intensity and extracting basic measures.This ultimately leads towards an understanding of how Instagram users post photos.

Data Access
The importance of this first step is to actually load the data into R.It can be done by connecting R to PostgreSQL with the RPostgreSQL package.Importing many other formats, such as shapefiles or even JSON documents, is also possible.This example uses the quickest way to get started, which is reading CSV files (Appendix I).
The import of data results in three data frames for different datasets.The head() and summary() functions can be called to examine if the data is correctly loaded.At this point, all the powerful functionalities of R can be used, such as nrow(locations), which yields that the data frame contains 819 locations.To visualize the spatial distribution, a map can be drawn as an R plot with the ggmap package that extends the functionality of ggplot2 with handy tools to manage spatial data, such as loading background tiles [23].Once all necessary packages are loaded, Appendix J can be used to generate a map of locations (Figure 2).

Extracting Summary Statistics in an R Environment
In addition to accessing VGI platforms and standardizing formats, exploratory analysis provides better understanding of the nature of these data.R statistics is a widely used statistical framework to analyze VGI sources.In general, however, standardized outputs scraped from APIs can be processed in any statistical software.The purpose of this step is to explore data, create charts, and apply statistical tests.
The following example uses data from the Instagram API.The harvested dataset contains information of photos posted to Instagram since 1 January 2015 in downtown Helsinki.A sample dataset of Instagram locations and Instagram photo metadata are provided in a GitHub folder (https://github.com/jlevente/link-vgi/tree/master/sample_datasets).Please note that Instagram API has changed its policy as of 1 June 2016 (http://developers.instagram.com/post/133424514006/instagram-platform-update).All registered applications start with limited access to data and thus the method presented above does not work with real data.However, API methods have not been changed and in theory it is possible to obtain a higher level of access for Instagram.
This data allows us to extract insights about popular places in Helsinki by quantifying data upload intensity and extracting basic measures.This ultimately leads towards an understanding of how Instagram users post photos.

Data Access
The importance of this first step is to actually load the data into R.It can be done by connecting R to PostgreSQL with the RPostgreSQL package.Importing many other formats, such as shapefiles or even JSON documents, is also possible.This example uses the quickest way to get started, which is reading CSV files (Appendix I).
The import of data results in three data frames for different datasets.The head() and summary() functions can be called to examine if the data is correctly loaded.At this point, all the powerful functionalities of R can be used, such as nrow(locations), which yields that the data frame contains 819 locations.To visualize the spatial distribution, a map can be drawn as an R plot with the ggmap package that extends the functionality of ggplot2 with handy tools to manage spatial data, such as loading background tiles [23].Once all necessary packages are loaded, Appendix J can be used to generate a map of locations (Figure 2).

Data Exploration and Analysis
Since the data is already imported in an R workspace, further explorations are relatively easy.For example, the total number of photos and summary measures by locations can be extracted with simple commands.Distributions can also be visualized as simple histograms.Appendix K shows a few examples of data explorations along with the extraction of 20 the most popular places in terms of unique users (Figure 3).simple commands.Distributions can also be visualized as simple histograms.Appendix K shows a few examples of data explorations along with the extraction of 20 the most popular places in terms of unique users (Figure 3).
Similarly, upload intensity (number of photos per weekday) can be extracted as shown in Appendix L, or even wordclouds can be generated from the hashtags attached to each photo, revealing insights of the conversation from the data collection period (Figure 4, Appendix M).As R is a powerful statistical framework, we can easily test hypotheses that we generate.For example, we can fit a simple linear regression between "likes" and "users tagged" in each photo if we suspect any relationship between them (Figure 5, Appendix N).Similarly, upload intensity (number of photos per weekday) can be extracted as shown in Appendix L, or even wordclouds can be generated from the hashtags attached to each photo, revealing insights of the conversation from the data collection period (Figure 4, Appendix M).Since the data is already imported in an R workspace, further explorations are relatively easy.For example, the total number of photos and summary measures by locations can be extracted with simple commands.Distributions can also be visualized as simple histograms.Appendix K shows a few examples of data explorations along with the extraction of 20 the most popular places in terms of unique users (Figure 3).
Similarly, upload intensity (number of photos per weekday) can be extracted as shown in Appendix L, or even wordclouds can be generated from the hashtags attached to each photo, revealing insights of the conversation from the data collection period (Figure 4, Appendix M).As R is a powerful statistical framework, we can easily test hypotheses that we generate.For example, we can fit a simple linear regression between "likes" and "users tagged" in each photo if we suspect any relationship between them (Figure 5, Appendix N).As R is a powerful statistical framework, we can easily test hypotheses that we generate.For example, we can fit a simple linear regression between "likes" and "users tagged" in each photo if we suspect any relationship between them (Figure 5, Appendix N).

Case Studies
During the workshop, some basic case studies were also provided around extracted VGI data.
Step by step tutorials are accessible on the workshop's GitHub repository (https://github.com/jlevente/link-vgi/blob/master/workshop/tutorial.md) in the Case studies section.Below are some short interpretations of case studies whereby data from VGI sources can be linked, visualized, and have secondary datasets created.
In some instances, data scraped from APIs do not contain references to particular real world places but, instead, indicate the locations of particular events, such as the where photographs were taken.Such a dataset is Flickr.In that case, it is possible for a photograph to have data associated with it indicating where the image was captured.When looking at this data overlaid on a map, it is clear that a number of images are clustered in particular locations which could be used as an indicator for points of interest (POIs) within the environment.Using clustering methods (see [24]), the Flickr data can be processed to provide such POIs.It should be noted that if only the location of the photograph is taken into account, and not the direction and focal distance, then the point extracted from the clustering methods will in fact only indicate where photos are taken and will likely not be at the location of the feature which forms the focus of the image.Figure 6 shows some potential POIs extracted from the Flickr dataset for the central Helsinki area.

Case Studies
During the workshop, some basic case studies were also provided around extracted VGI data.
Step by step tutorials are accessible on the workshop's GitHub repository (https://github.com/jlevente/link-vgi/blob/master/workshop/tutorial.md) in the Case studies section.Below are some short interpretations of case studies whereby data from VGI sources can be linked, visualized, and have secondary datasets created.
In some instances, data scraped from APIs do not contain references to particular real world places but, instead, indicate the locations of particular events, such as the where photographs were taken.Such a dataset is Flickr.In that case, it is possible for a photograph to have data associated with it indicating where the image was captured.When looking at this data overlaid on a map, it is clear that a number of images are clustered in particular locations which could be used as an indicator for points of interest (POIs) within the environment.Using clustering methods (see [24]), the Flickr data can be processed to provide such POIs.It should be noted that if only the location of the photograph is taken into account, and not the direction and focal distance, then the point extracted from the clustering methods will in fact only indicate where photos are taken and will likely not be at the location of the feature which forms the focus of the image.Figure 6 shows some potential POIs extracted from the Flickr dataset for the central Helsinki area.

Case Studies
During the workshop, some basic case studies were also provided around extracted VGI data.
Step by step tutorials are accessible on the workshop's GitHub repository (https://github.com/jlevente/link-vgi/blob/master/workshop/tutorial.md) in the Case studies section.Below are some short interpretations of case studies whereby data from VGI sources can be linked, visualized, and have secondary datasets created.
In some instances, data scraped from APIs do not contain references to particular real world places but, instead, indicate the locations of particular events, such as the where photographs were taken.Such a dataset is Flickr.In that case, it is possible for a photograph to have data associated with it indicating where the image was captured.When looking at this data overlaid on a map, it is clear that a number of images are clustered in particular locations which could be used as an indicator for points of interest (POIs) within the environment.Using clustering methods (see [24]), the Flickr data can be processed to provide such POIs.It should be noted that if only the location of the photograph is taken into account, and not the direction and focal distance, then the point extracted from the clustering methods will in fact only indicate where photos are taken and will likely not be at the location of the feature which forms the focus of the image.Figure 6 shows some potential POIs extracted from the Flickr dataset for the central Helsinki area.When looking at popular places, for wheelchair users it is important to know if the place is accessible to them or not.For example, it is not particularly useful to plan a trip to several cafes during a vacation in a city if the person doing the trip cannot use the cafe itself.
Wheelmap is a service developed by Sozialhelden in Berlin that aims at allowing users to view and edit whether a place is accessible to wheelchair users.The base data used are POI features from OSM, with the accessibility information being stored against the POI in the OSM dataset.The possible values for this accessibility are "yes", "limited", "no", and "unknown".They provide an API that can be used to harvest the information contributed to the Wheelmap service.
It is possible to create a basic link between the Wheelmap and Foursquare datasets based on the name given to the POI.Though this is by no means perfect as multiple places can have the same name or the same place could have a different spelling between datasets.As a simple exercise for understanding accessibility of popular places it can generate a powerful representation.Using the datasets and QGIS, Figure 7 was produced for the central Helsinki area.In that figure, the size of the circle represents popularity based on Foursquare check-ins (larger = more check-ins), and the color represents the accessibility (red = not accessible, yellow = limited accessibility, green = accessible, and grey = unknown).When looking at popular places, for wheelchair users it is important to know if the place is accessible to them or not.For example, it is not particularly useful to plan a trip to several cafes during a vacation in a city if the person doing the trip cannot use the cafe itself.
Wheelmap is a service developed by Sozialhelden in Berlin that aims at allowing users to view and edit whether a place is accessible to wheelchair users.The base data used are POI features from OSM, with the accessibility information being stored against the POI in the OSM dataset.The possible values for this accessibility are "yes", "limited", "no", and "unknown".They provide an API that can be used to harvest the information contributed to the Wheelmap service.
It is possible to create a basic link between the Wheelmap and Foursquare datasets based on the name given to the POI.Though this is by no means perfect as multiple places can have the same name or the same place could have a different spelling between datasets.As a simple exercise for understanding accessibility of popular places it can generate a powerful representation.Using the datasets and QGIS, Figure 7 was produced for the central Helsinki area.In that figure, the size of the circle represents popularity based on Foursquare check-ins (larger = more check-ins), and the color represents the accessibility (red = not accessible, yellow = limited accessibility, green = accessible, and grey = unknown).In addition to being able to portray the accessibility of places, the two datasets can be compared to determine the amount of spatial difference between the same POIs in each dataset.As the same POI (say a particular cafe) is recorded in both datasets, but via different methods, there is often a discrepancy which makes linking by location difficult.By measuring the distance between two POIs with the same name, it is possible to get a limited understanding of how much difference is present.Figure 8 shows lines linking features with the same name.All lines greater than 200 m in length were removed from the data as it is likely that these mostly represent differences between places that have the same name multiple times (i.e., McDonald's fast food restaurants).From performing basic statistics on these line features (not longer than 200 m) within QGIS, an average displacement between the POIs in the two datasets is found to be 39 m.In addition to being able to portray the accessibility of places, the two datasets can be compared to determine the amount of spatial difference between the same POIs in each dataset.As the same POI (say a particular cafe) is recorded in both datasets, but via different methods, there is often a discrepancy which makes linking by location difficult.By measuring the distance between two POIs with the same name, it is possible to get a limited understanding of how much difference is present.Figure 8 shows lines linking features with the same name.All lines greater than 200 m in length were removed from the data as it is likely that these mostly represent differences between places that have the same name multiple times (i.e., McDonald's fast food restaurants).From performing basic statistics on these line features (not longer than 200 m) within QGIS, an average displacement between the POIs in the two datasets is found to be 39 m.

Discussion
With the evolution of user-generated content on the Internet, researchers often face problems with data collection.The collection process contains unique solutions depending on the data source and type of data, and there is no standard way to extract data for research purposes.This can be confusing, especially without a background in programming.However, most VGI platforms provide public APIs, which make it possible to extract potentially useful data for answering many research questions in a relatively straightforward manner.In this paper, we provided an overview of methods used in API interactions when working with VGI data from multiple platforms.Starting from the description of APIs through standardizing output formats, the paper presents some easy and generic ways to extract meaningful information from the dataset in a statistical environment.Our examples provide Twitter, Instagram, Mapillary, Wheelmap, Flickr, and OSM data and aimed to encourage researchers to integrate these solutions in their research methodology.We also provide an open repository consisting of code and step-by-step case studies available on a GitHub page that can be further used for discussion and collaboration.The materials are based on the LINK-VGI preconference workshop held in Helsinki, Finland on June 14 during the AGILE conference.This paper serves as a starting point for researchers exploring public APIs of user-generated content providers, allowing them to start their own data collection campaigns.We also provided some sample datasets illustrating some possible outputs.
Although this paper focuses on the extraction and usage of VGI data, it does not address other common concerns about VGI, such as data quality and credibility [25,26] as well as legal and liability issues [27].By nature, VGI datasets are not homogeneous and they are not representative of the whole population.For these reasons, we strongly advise against using data acquired from VGI platforms without further investigation and quality checks in research studies.Failure to address these concerns can lead to different and unreliable results when performing data analysis.
Author Contributions: Levente Juhász, Adam Rousell, and Jamal Jokar Arsanjani contributed to organizing the LINK-VGI hands-on session as well as the writing of the paper.The materials for the paper were prepared by Levente Juhász and Adam Rousell.All authors have read and approved the final manuscript.

Discussion
With the evolution of user-generated content on the Internet, researchers often face problems with data collection.The collection process contains unique solutions depending on the data source and type of data, and there is no standard way to extract data for research purposes.This can be confusing, especially without a background in programming.However, most VGI platforms provide public APIs, which make it possible to extract potentially useful data for answering many research questions in a relatively straightforward manner.In this paper, we provided an overview of methods used in API interactions when working with VGI data from multiple platforms.Starting from the description of APIs through standardizing output formats, the paper presents some easy and generic ways to extract meaningful information from the dataset in a statistical environment.Our examples provide Twitter, Instagram, Mapillary, Wheelmap, Flickr, and OSM data and aimed to encourage researchers to integrate these solutions in their research methodology.We also provide an open repository consisting of code and step-by-step case studies available on a GitHub page that can be further used for discussion and collaboration.The materials are based on the LINK-VGI pre-conference workshop held in Helsinki, Finland on June 14 during the AGILE conference.This paper serves as a starting point for researchers exploring public APIs of user-generated content providers, allowing them to start their own data collection campaigns.We also provided some sample datasets illustrating some possible outputs.
Although this paper focuses on the extraction and usage of VGI data, it does not address other common concerns about VGI, such as data quality and credibility [25,26] as well as legal and liability issues [27].By nature, VGI datasets are not homogeneous and they are not representative of the whole population.For these reasons, we strongly advise against using data acquired from VGI platforms without further investigation and quality checks in research studies.Failure to address these concerns can lead to different and unreliable results when performing data analysis.
Author Contributions: Levente Juhász, Adam Rousell, and Jamal Jokar Arsanjani contributed to organizing the LINK-VGI hands-on session as well as the writing of the paper.The materials for the paper were prepared by Levente Juhász and Adam Rousell.All authors have read and approved the final manuscript.

Figure 2 .
Figure 2. Illustration of Instagram locations along with their names.

Figure 2 .
Figure 2. Illustration of Instagram locations along with their names.

Figure 3 .
Figure 3.A bar plot showing user counts for the 20 most visited Instagram locations.

Figure 4 .
Figure 4.The word cloud of tags in Helsinki.

Figure 3 .
Figure 3.A bar plot showing user counts for the 20 most visited Instagram locations.

Figure 3 .
Figure 3.A bar plot showing user counts for the 20 most visited Instagram locations.

Figure 4 .
Figure 4.The word cloud of tags in Helsinki.

Figure 4 .
Figure 4.The word cloud of tags in Helsinki.

Figure 6 .
Figure 6.Potential points of interest (POIs) within Helsinki based on Flickr data.

Figure 6 .
Figure 6.Potential points of interest (POIs) within Helsinki based on Flickr data.

Figure 6 .
Figure 6.Potential points of interest (POIs) within Helsinki based on Flickr data.

Figure 7 .
Figure 7. Accessibility of popular places in central Helsinki.

Figure 7 .
Figure 7. Accessibility of popular places in central Helsinki.

Table 1 .
Some metadata of sample datasets.

Table 1 .
Some metadata of sample datasets.

Table 2 .
Information on the Volunteered Geographic Information (VGI) services and the corresponding Application Programming Interfaces (APIs).