POI Mining for Land Use Classiﬁcation: A Case Study

: The modern planning and management of urban spaces is an essential topic for smart cities and depends on up-to-date and reliable information on land use and the functional roles of the places that integrate urban areas. In the last few years, driven by the increased availability of geo-referenced data from social media, embedded sensors, and remote sensing images, various techniques have become popular for land use analysis. In this paper, we ﬁrst highlight and discuss the di ﬀ erent data types and methods usually adopted in this context, as well as their purposes. Then, based on a systematic state-of-the-art study, we focused on exploring the potential of points of interest (POIs) for land use classiﬁcation, as one of the most common categories of crowdsourced data. We developed an application to automatically collect POIs for the study area, creating a dataset that was used to generate a large number of features. We used a ranking technique to select, among them, the most suitable features for classifying land use. As ground truth data, we used CORINE Land Cover (CLC), which is a solid and reliable dataset available for the whole European territory. It uses an artiﬁcial neural network (ANN) in di ﬀ erent scenarios and our results reveal values of more than 90% for the accuracy and F-score in one experiment performed. Our analysis suggests that POI data have promising potential to characterize geographic spaces. The work described here aims to provide an alternative to the current methodologies for land use and land cover (LULC) classiﬁcation, which are usually time-consuming and depend on expensive data types. of a specific type of crowdsourced data for land use classification, namely, POIs. We studied the of this to decide which was the most appropriate for our analysis. Based on a service available online, we developed an application to automatically collect POIs for our region of study. applied some data treatment to prepare the data for our experiments.


Introduction
With the recent and rapid development of cities, sustainability concerns have opened a new area for an essential field in recent studies, namely, smart growth. In general, smart growth is an effort for the better management of natural resources by reducing and controlling their consumption [1]. Because of this, the needs for urban land use planning and the efficient management of urban areas have evidently become important [2]. These points are directly connected with the design and development of smart cities, converging to a common objective, which is to attempt to create a high quality of life for people in a more sustainable world. With attention turned to urban spaces, land use analysis becomes an essential topic in this context.
Currently, urban spaces have also gained focus due to issues such as urban expansion, traffic control, wellbeing, population activity monitoring, construction projects, environmental preservation, hazard and pollution analysis, and economic analysis, in addition to public health care and other essential services, all of which are related to smart growth and smart cities. The work in these subjects often requires fine-grained maps to be designed and managed [2,3]. However, as urban areas change, keeping maps and information on infrastructures and functional zones up to date is a challenge that code of the application we developed to collect POIs, have been made publicly available and can be downloaded through the link available in Supplementary Materials section.
The remainder of this paper is structured as follows: Section 2 presents a detailed state-of-the-art description, including a comparative analysis of the data and methods utilized for land use classification. Section 3 describes the approach we proposed for characterizing geographic areas, including data collecting and preprocessing. In Section 4, we present and discuss the results we obtained via the work we performed based on various scenarios we planned. Finally, in Section 5, we present our conclusions and suggestions for future work.

State of the Art
Given the importance of up-to-date information related to LULC and urban functional regions, many efforts have recently been made regarding this topic, increasing the popularity of different types of data and methods for knowledge discovery in the context of land use analysis. In this section, we discuss the most common categories of data and the most frequent methods employed by authors in this subject.

Data
Many techniques can be used for land use analysis, based on different data types. An important task for researchers is improving the accuracy of the results generated by these techniques. The integration of features extracted from various data types can, to some extent, show better results. In this section, we present the main data types frequently used for urban functional region extraction and LULC classification. The data types presented in this subsection were used in at least two studies among the set of works analyzed during our survey.

Remote Sensing Images
Several methods, used to update LC maps, are based on the interpretation of aerial photos and field surveys, which are time-consuming and difficult. Due to the recent development of remote sensing technologies, a large number of remote sensing images are available through sensors installed in aircraft or satellites [15]. In addition, remote sensing images are present in scientific datasets, in some cases provided by universities [16], research centers [3], and government agencies [17], among other organizations. Remote sensing images are often useful for extracting LC information and, combined with other data types, generally provide the possibility to identify plots of land used for various purposes (e.g., residential, commercial, or industrial). This identification is usually based on the physical properties of objects, with different characteristics, such as spatial distribution, color, texture, shape, etc. [2,15].
For land use and functional region analysis, when discussing remote sensing images, low-level semantic features can be described as information that comes with data, such as physical properties (e.g., color and texture), and high-level semantic features are directly related to specific "knowledge" for each user and application [2,4]. A semantic gap usually refers to the disparity of features identified between low-level and high-level semantic features. Using only low-level semantic features is probably less accurate because different objects may have the same physical properties and identical objects may have different attributes. In the classification, adding high-level semantic features, referring to various attributes of the object given by the human user will probably achieve better results. For example, a set of remote sensing images where land cover objects (e.g., buildings) can be recognized based on a low-level description. In this case, high-level information provides good features for functional zone classification, such as residential, commercial, or industrial areas [2]. The addition of high-level semantic features has been suggested by many authors-e.g., Zhong et al. [18], in order to provide the possibility of achieving better results for the classification of land use.

Crowdsourced Data
Crowdsourced data are created voluntarily by users, mainly using mobile applications, to provide useful data for different domains and diverse types, especially when the data are opportunistic, based on the user context, such as POIs, social media data, taxi trajectories, cell phone usage, check-in activities from location-based social networks (LBSNs), and even text messages [9,19]. Around the world, every day, there are 7000 million check-ins on Foursquare, 500 million tweets are posted, and more than 80 million photos are uploaded to Instagram [20]. These rich and diversified sources of data potentially provide information on human activities and socioeconomic information, which has been the central idea of many studies to indicate urban functions [9]. Among various categories of crowdsourced data, the most frequently used are the following: Points of Interest (POIs) Specific urban functions are reflected by the spatial distributions and interactions of various types of POIs [6]. For different types of activities (e.g., working, studying, dining, shopping, or relaxing), people usually go to specific POIs. For the same reason, many scientific studies (e.g., [6,13,19,21,22]) have focused on extracting features from POIs, which are often fused with remote sensing images. POIs can be collected from various sources and are frequently available for free through social network application programming interfaces (APIs) and online map service providers. For example, Gong et al. [11] and Zhai et al. [19] used POIs extracted via APIs provided by Baidu Map Services, while Liu et al. [22] obtained them from http://www.dianping.com, one of China's largest online to offline companies, https://www.fang.com, one of China's largest online house information service providers, and additionally Baidu maps.

Text Messages
Crowdsourced data usually contain a very large amounts of text messages, which can be exploited to generate socioeconomic features [9]. Among some common examples of this type of crowdsourced data are tweets (or Twitter messages), used in [9,19,20,23,24]. In addition, there are other services used for the same purpose, such as the Sina Weibo application [23], as it has the same symmetric Twitter user structure, where a user can follow anyone else without having to establish a friendship. Another category of textual data that can be considered in land use classification is the tags associated with each photo on Flickr [8]. Generally, users can attach geographical coordinates to messages and photos, which is the focus of location-based social networks (LSBNs), which often offer open APIs to download this content for free.

Check-in Activities from LBSNs
Considering the physical locations of users, several LBSNs have created traces of social interactions. Generally, in these social networks, users can check in at a location, rate it, and share their comments or tips [6]. Some examples of LBSNs are Gowalla, Foursquare, and Facebook Places, which are increasingly exploited as the dimensions of these services grow. Flickr can also be considered in this category [8], as a photo in a specific location is evidence of the user's presence at that time. Some studies (e.g., in [25]) have analyzed spatial, temporal, social, and textual aspects associated with hundreds of millions of user-driven check-in activities. Others (e.g., in [20]) have explored LBSN check-in activities as a factor of popularity for POIs. The use of such data was encouraged by Xing et al. [9] to improve results when other categories of data do not provide useful information.

OpenStreetMap (OSM) Datasets
In comparison with proprietary sources, for many developed regions, OSM datasets are almost always complete, according to a recent analysis on their street networks [26]. A set of user-friendly interfaces is accessible for volunteers. These interfaces offer some map editing capabilities that allow the user to delineate the geometric representation of features or areas of interest based on remote sensing images provided by the broad range of image libraries available for citizen science-based projects. In addition, individual attributes can be added to mapped features to enrich them. The establishment of these types of services has provided representative advances in digital maps thanks to the accurate GPS-based technology present in mobile devices, which may produce a new and more complete map experience. As a result, digital content has become significantly diversified, bringing much more spatial information thanks to volunteers. Because of the large number of users, much larger sets of digital content and information are free for community use. In this context, OSM is an ideal example of a collaborative project. For example, its datasets are used by Liu et al. [2] and Zhang et al. [3] in their studies.

Taxi Trajectories
Beyond the mentioned categories of data, many works have also used, with some frequency, taxi trajectories [22,27]. Taxi trajectories can easily provide pick-up and drop-off points, trip lengths, and the time of each trip. However, these points often do not represent the exact locations where users have their activities [28]. In most cases, passengers exit their taxi quite far from their final destination. Additionally, because the information provided by taxi trajectories does not contain an accurate indication of the passenger's purposes for their activities, it is challenging to deal with only this kind of information. This is the main reason why taxi trajectory data are commonly combined with other data types (e.g., building blocks or LBSN user information) to provide better results.

Building Blocks
Building blocks are often referred to as "street blocks", and although they represent a different category than taxi trajectories, for example, they are often used as complementary information in many studies. Building block information is normally provided by local administrations [14], but it is possible to extract these kinds of data from remote sensing images. This technique was utilized, for example, by Liu et al. [22], and the obtained building blocks were combined with social network records, taxi trajectories, and POIs to characterize mixed-use buildings. Another work, conducted by Huang et al. [15], also employed building blocks together with remote sensing images for urban land use mapping.

Methods
Driven by advances in computational resources, the availability of georeferenced data and modern tools provided by GIS applications, different techniques have become popular for land use analysis. In this subsection, we highlight the most common methods frequently adopted in this context. The methods covered in this subsection have been used in at least two scientific studies among the set of works analyzed in our survey.

Object-Oriented Classification (OOC)
Terms such as "object-oriented" and "object-specific" are also referred to as object-based image analysis (OBIA). This scientific area emerged after the first piece of commercial software designed specifically for the design and analysis of "image objects", rather than individual pixels, was based on remote sensing images [7]. By these concepts, scene components or entities are distinguishable objects in a given image (e.g., a tree, house, or vehicle). Using an object-oriented approach, according to a specific user definition, pixel-based images are segmented into objects. Within each image object, the user-defined homogeneity is obtained during the segmentation process. To avoid a large growth of user-established heterogeneity, a pair of adjacent objects are merged at each step of the process. The process is interrupted if smaller growth exceeds the scale parameter [10]. Currently, this is one of the most popular methods for extracting land use patterns through the physical features of ground objects from images [13]. Although many studies have used this method, object-oriented classification can only reveal land cover information based on low-level semantic features, where spatial relationships among ground objects are not considered.

Latent Dirichlet Allocation (LDA)
When analyzing abundant textual descriptions to discover thematic features and their respective structures, a large number of studies used probabilistic topic models, and, among them, the most common was LDA [6]. An LDA model is often used to extract socioeconomic information from crowdsourced data, providing explicit descriptions of human activities, such as that implemented by Xing et al. [9]. Moreover, it has been applied to elicit the topics from textual descriptions of Flickr photos in order to create features for a land use classification model [8]. LDA is an unsupervised model that works in a generative and probabilistic way, implementing a bag-of-words approach, which means that the order of words in the document is not applicable. In LDA, the main idea is to represent documents as a distributed probability of latent topics, where each topic is a distribution of words. To simplify the concept, the probabilistic topic model, including LDA, can be generically described as a "random mixture of topics" [18].

K-Means
Among many clustering algorithms, K-means is one of the most common in data mining [18]. As a type of unsupervised learning, K-means clustering is used for unlabeled data-i.e., when the categories are not defined. This algorithm works by locating groups in the data by using a parameter k that represents the number of groups. The clustering process is iterative and at each iteration the data points are assigned to one of the k groups based on their attributes. One example of application for K-means is given by Trevino [29], where feature similarity is used for clustering the data points. Moreover, clustering techniques have been successfully applied by many authors for various proposes, such as defining areas or regions [30], classifying features extracted from social media data [2], analyzing correlations between points of interest and zones [13], and aggregating similar formal regions in terms of region topic distributions [21].

Hierarchical Semantic Cognition (HSC)
HSC is a bottom-up Bayesian method with a hierarchical structure used to classify urban functional zones [4,14]. It consists of four semantic levels: Functional zones, patterns of spatial objects, categories of objects, and visual features. In this model, using conditional probabilities, each level characterizes a relationship between two semantic layers. Thus, the first level can, for example, model the relationship between functional zones and patterns of spatial objects. Typically, different objects generally have different distributions of visual features in the same spatial object pattern, whereas in the same object type different patterns of spatial objects may exist and have small differences related to their distributions of visual features. HSC is used for LULC and functional zone classification by using data such as remote sensing images, POIs, and road blocks.

Random Forest (RF)
RF is a bagging ensemble learning algorithm that works by building multiple decision trees, where each one is based on a random sub-sample of the training dataset [3]. The model provides its results based on the class voted by the most trees. As a tree-based ensemble method, this classifier can provide a higher accuracy than single decision trees, such as classification and regression trees (CART) or C4.5. In addition, in many cases, without the need to adjust numerous parameters, RF overcomes popular models, such as support vector machines. In this context, RF is well established in the literature and is widely used for land use and functional zone classification [4,8,19].

Support Vector Machine (SVM)
In a general way, support vector machines can be described as a supervised learning method that works as a discriminative classifier. The method creates a hyperplane or a set of hyperplanes that allow for classifying the inputs in a high-dimensional space by separating them. The algorithm outputs an optimal hyperplane, based on training data. This hyperplane is an N-dimensional space, where "N" is the number of features used for training. For example, a hyperplane created for a two-dimensional space is a line splitting it into two different parts [31]. It is a model based on the principle of structural risk minimization [32]. This method is used, for example, as a classifier in scene classification, to predict scene labels. The main idea of this technique is to train a linear learning classifier in a kernel space, considering generalization and performance optimization, leading to overcoming the problem of pattern classification [18]. SVM was chosen, for example, by Liu et al. [2] to identify urban land use types. The authors adopted SVM because it was suggested in previous studies (e.g., in [33,34]) that, when working with high-dimensional features, this method has a high efficiency level as a classifier.

Deep Convolutional Neural Network (DCNN)
One common approach for LULC classification is to use methods per field to directly extract or classify low-level features of the physical properties of images. These methods can add some advantages over per-pixel or object-based methods. However, per-pixel object-based and per-field land use and land cover classification techniques are based on manual feature descriptors and shallow architectures and cannot work with complex land-use images to capture fine features [15]. Because these images are used for generalization, none of these methods reach the level of accuracy generally required for practical applications. Land use can be described at many levels in an LULC scheme, including the intensities of pixels, edges, objects, parts of objects, and parcels of land. Deep architectures can efficiently represent all these levels. Through a deep learning process, a group of machine learning algorithms aim to model high-level abstractions by employing deep architectures, which are a composition of multiple nonlinear transformations. Deep learning models are a highly promising approach to handle urban LULC classification problems, since they can model hierarchical representations of features that describe urban LULC schemes. DCNNs consist of several convolutional layers and can learn high level abstract features from the original pixel values of images [35]. Among many deep learning methods, the DCNN technique has achieved a high level of performance in land use classification, based on remote sensing images.

Comparative Analysis
According to the information presented in Table 1, POIs are among the most common crowdsourced data types observed in the studies we analyzed. Many authors adopt them, mainly because of the direct connection they have with human behavior, which allows, to some extent, for revealing the ways people use the spaces. Moreover, POIs are often related with LBSN user activities, and because of this, these two data types are frequently combined for land use classification [2,3]. Observing the table, it is also possible to note the use of various other crowdsourced data types generated by volunteers in their daily routines. In many countries, crowdsourced data are widely available, encouraging their utilization in cases in which other datasets-e.g., urban planning data or GPS data-are not available.
Regarding the methods, due to the absence of ground truth data for validating results, many researchers use non-supervised techniques, among which clustering methods, including spectral clustering and K-nearest neighbor (KNN) are common. However, considering the studies we analyzed, the most frequent technique in this category is K-means, given its simplicity and effectiveness for tasks like grouping POIs, functional regions, or geographic spaces. Some examples of works in which K-means was used include [2,16]. In [2], the authors conclude that using this method together with others leads to satisfactory results for land use classification. During our survey, we observed different cases where ground truth data were available, leading authors to the adoption of supervised techniques. Among the supervised techniques, RF was one of the most common, given its effectiveness as a classifier, considering a balance between resource consumption and performance for issues such as the classification of land use and functional zones. However, there are other reasons why RF is often chosen, as, for example, in [4,36]. Considering the specific case of the latter, the authors mentioned that they adopted the algorithm because they saw it as a scalable and powerful method to deal with datasets containing a large number of features.
In general, various methods were observed. As we can see in Table 1, the techniques used include Naïve Bayes (NB), extreme learning machines (EML), Word2Vec, Skeleton-based decomposition, Multiresolution segmentation, Place2Vec, joint deep learning (JDL), and many others. Although there are cases where different methods were chosen for similar purposes, the datasets used were often different, making it difficult to compare the results and conclusions.

Proposed Approach
According to our previous research, as presented in the state-of-the-art study, many different approaches for land use classification were observed. While many of them are based on image interpretation, a big concern arises from this due to the related costs. Besides these techniques being time-consuming, they are also expensive. Although some scientists have utilized crowdsourced data in their analyses, the uses of these types of data are usually adopted as a complement for image interpretation techniques. Regarding this concern, we tested an approach based on only POI data. As ground truth, we used an LULC dataset available for the whole European territory.

Study Area
For this study case, the Lisbon metropolitan area (LMA) was chosen. The LMA is a region in Portugal, centered on Lisbon, the capital and largest city of the country. The LMA is spread over 3015 km 2 , with around 2.8 million inhabitants, which represents 27% of Portugal's population. The population density in the region is approximately 932 inh/km 2 , the highest in the country, which is about eight times higher than the national average. Covering 18 municipalities, it is also the largest urban area in the country (the 10th largest in the European Union).

POI Mining
As mentioned by other authors, POIs are freely and widely available online, either through APIs or download services powered by crowdsourced data communities. It is also possible to get POI data from LBSNs. One example is a service provided by the Facebook API, where it is possible to perform a search given a center coordinate and a radius. The social network provides different options of software development kits (SDKs) to facilitate the development of applications that can interact with API services to get the data. Users can register their accounts as developers, allowing them to create these types of applications.
For this study, we chose to use POI data available from Facebook Places. When a company creates a page on Facebook, the LBSN allows it to add a coordinate that represents its geographic location. If such a location is added, the page become available as a POI dubbed Facebook Place. These kinds of POIs are also accessible through an API service provided by the social network.
To automatically get POI data available on Facebook Places, we developed software. In order to use the Facebook API, first it was necessary to define the coordinates to be used during the search. The software we developed was used to calculate these points, considering four coordinates given by the user, which represent the limits of a rectangle projected over the area where the search was performed. In the specific case of LMA, the bounding box adopted was composed of the following coordinates: Once the set of coordinates were generated, we discarded the points outside the geographic bounds of the study area. This can be done by using a GIS or a spatial database, for example. However, as shown in Figure 1, given the limitations of this type of search, the coordinates need to respect a predefined spacing, leading to the problem of overlapping radii. In practice, this means that, during the search, the same POI could be returned as result of different requests. We solved this problem by filtering the final dataset using the unique ID provided by Facebook for each POI. As seen in (a), if the distances between the points are too large, the application will not be able to cover the whole area, so, the most reasonable configuration may be that in (b). By adopting this configuration, the final pattern is shown in (c), where we can see that the red points are slightly displaced when compared to the black points. This configuration allows the software to cover more area with less overlapping areas.
Our software stored the generated coordinates in a database table. Figure 2 shows the steps involved in filtering these points to make them ready for processing. Initially, the set of coordinates generated covers the entire region, according to the four parameters given by the user. This can be seen when creating a layer in GIS software using these points. In this study, we performed a special query directly to the database, based on a polygon saved in a table, which contains the information regarding the geographic bounds of the study area. As a result, only the points within the limits were kept. Although we chose to manipulate the data directly with SQL, the same goal may be easily reached by performing an intersection through a GIS application.
After generating geographic coordinates, the developed application processed them by making searches though a service called "Places Search API for Web" [43,44], provided by Facebook, to collect the data. For each coordinate, the Facebook API returned all POIs in the given radius. When returned, obtained POI data were stored in a database table. As each request take some seconds, the necessary time to collect the data depends on size of the region and, hence, the number of coordinates to be processed. After finishing, it is necessary to filter the final dataset in order to eliminate repeated POIs. Once again, in this case, this was done by manipulating the table directly in the database. For solving this specific problem, we think this is the best approach, given its simplicity and efficacy.
The data used in this work were collected in the scope of another project, and, for this reason they were first obtained for the whole country territory. However, as described before, in the experience we present here, our focus was the LMA. By using the software that we developed, we collected 171,177 POIs distributed along continental Portugal, from which 17,777 were located in the LMA. Considering that each POI can belong to more than one category, the collected dataset presents a total of 24,144 examples in the study area, which were used for extracting features to train the models.
The last step in the POI collection process is to get the Facebook Places taxonomy. We chose to Figure 1. Coordinate generation. As seen in (a), if the distances between the points are too large, the application will not be able to cover the whole area, so, the most reasonable configuration may be that in (b). By adopting this configuration, the final pattern is shown in (c), where we can see that the red points are slightly displaced when compared to the black points. This configuration allows the software to cover more area with less overlapping areas.
Our software stored the generated coordinates in a database table. Figure 2 shows the steps involved in filtering these points to make them ready for processing. Initially, the set of coordinates generated covers the entire region, according to the four parameters given by the user. This can be seen when creating a layer in GIS software using these points. In this study, we performed a special query directly to the database, based on a polygon saved in a table, which contains the information regarding the geographic bounds of the study area. As a result, only the points within the limits were kept. Although we chose to manipulate the data directly with SQL, the same goal may be easily reached by performing an intersection through a GIS application.
After generating geographic coordinates, the developed application processed them by making searches though a service called "Places Search API for Web" [43,44], provided by Facebook, to collect the data. For each coordinate, the Facebook API returned all POIs in the given radius. When returned, obtained POI data were stored in a database table. As each request take some seconds, the necessary time to collect the data depends on size of the region and, hence, the number of coordinates to be processed. After finishing, it is necessary to filter the final dataset in order to eliminate repeated POIs. Once again, in this case, this was done by manipulating the table directly in the database. For solving this specific problem, we think this is the best approach, given its simplicity and efficacy.

Ground Truth Data
CORINE land cover (CLC) [45,46] is basically a dataset representing the use and the coverage of the geographic areas over Europe. It was created as an initiative of the European Environment Agency, in a partnership with the member states. As a thematic cartography, CLC data are mainly produced by image interpretation, often using automatic or semi-automatic solutions available in GIS applications. The dataset can be downloaded in either a raster or vector format, with a minimum mapping unit (MMU) of 25 hectares (ha) for areal phenomena and a minimum width of 100 m for linear phenomena. According to its official nomenclature [47], CLC data are divided in 44 classes over three hierarchical levels. It was first released in 1990 and was updated in 2000, 2006, 2012, and 2018. For our analysis, we utilized the latest version in the format of vector data.

Data Preparation
We adopted a grid-based methodology [48] to represent the land parcels for classification. This method involved creating cells uniformly distributed over the study area. By using a GIS application, a grid in which each cell is 250 meters long/250 meters wide was automatically built, as shown in Figure 3. When created, the grid is a square composed by polygons with a chosen size. In order to keep only the cells inside the study area, we performed an intersection operation. The resulting grid was stored on a database table. For each cell, a spatial attribute that represents its centroid was included. This attribute is essential when generating features for training the predictive model.  The data used in this work were collected in the scope of another project, and, for this reason they were first obtained for the whole country territory. However, as described before, in the experience we present here, our focus was the LMA. By using the software that we developed, we collected 171,177 POIs distributed along continental Portugal, from which 17,777 were located in the LMA. Considering that each POI can belong to more than one category, the collected dataset presents a total of 24,144 examples in the study area, which were used for extracting features to train the models.
The last step in the POI collection process is to get the Facebook Places taxonomy. We chose to do this through the API, as this was simple and fast. Although each collected POI has an attribute containing a list of all categories to which it belongs, some operations need to be done to link it to the taxonomy obtained separately. We did some manipulation using SQL in order to establish these connections. This step is optional; however, it is highly recommended because the list of categories returned with each POI has no hierarchical structure by itself, and, because of this, the final analysis result may not be so valuable when this taxonomy is unknown.

Ground Truth Data
CORINE land cover (CLC) [45,46] is basically a dataset representing the use and the coverage of the geographic areas over Europe. It was created as an initiative of the European Environment Agency, in a partnership with the member states. As a thematic cartography, CLC data are mainly produced by image interpretation, often using automatic or semi-automatic solutions available in GIS applications. The dataset can be downloaded in either a raster or vector format, with a minimum mapping unit (MMU) of 25 hectares (ha) for areal phenomena and a minimum width of 100 m for linear phenomena. According to its official nomenclature [47], CLC data are divided in 44 classes over three hierarchical levels. It was first released in 1990 and was updated in 2000, 2006, 2012, and 2018. For our analysis, we utilized the latest version in the format of vector data.

Data Preparation
We adopted a grid-based methodology [48] to represent the land parcels for classification. This method involved creating cells uniformly distributed over the study area. By using a GIS application, a grid in which each cell is 250 meters long/250 meters wide was automatically built, as shown in Figure 3. When created, the grid is a square composed by polygons with a chosen size. In order to keep only the cells inside the study area, we performed an intersection operation. The resulting grid was stored on a database table. For each cell, a spatial attribute that represents its centroid was included. This attribute is essential when generating features for training the predictive model. We adopted a grid-based methodology [48] to represent the land parcels for classification. This method involved creating cells uniformly distributed over the study area. By using a GIS application, a grid in which each cell is 250 meters long/250 meters wide was automatically built, as shown in Figure 3. When created, the grid is a square composed by polygons with a chosen size. In order to keep only the cells inside the study area, we performed an intersection operation. The resulting grid was stored on a database table. For each cell, a spatial attribute that represents its centroid was included. This attribute is essential when generating features for training the predictive model.  After creating the grid, we extracted the LULC class for each cell, based on the CORINE data. It was assumed that there was only one class per land parcel. For example, if in a given cell there is 20% of land covered by class "111. Continuous urban fabric" and in the remaining 80%, the predominant class is "112. Discontinuous urban fabric", we consider the one that represents most of the space-i.e., the last one. However, before extracting the dominant class for each cell, we organized the dataset class structure, keeping two levels and enabling the possibility to easily classify the units later. Table 2 shows the new class structure. This operation was performed by using a "dissolve" tool available in almost every modern GIS application, which provides the possibility of merging polygons from vector data. Considering the 15 classes obtained through the reorganization process we applied, after preparing the grid, we obtained 81,238 cells for the study area. Each cell is available for extracting features to be used as training data for the models. The exact number of examples used in each scenario is described in Section 4.

Feature Extraction
We extracted four distinct types of features based on the collected POI categories and their respective levels, which range from 1 to 6 in the Facebook Places taxonomy [49]. Through an exploratory analysis, we decided to focus on all categories belonging to levels 2, 3, 4, and 5, considering that level 1 is highly generic and level 6 is too specific, presenting only restaurant service categories. Table 3 shows the extracted attribute groups. As the optimum radius distance, we analyzed 2500, 5000 and 10,000 meters, where the last one proved to be capable of generating the best models. To find the proportion of POIs, this was computed by the following calculation for each category: The amount of points inside the radius, divided by the total of POIs belonging to the category. Considering all feature sets, for all levels, we extracted 5928 attributes in total, with 500 of them belonging to group 1, 1808 to group 2, 2888 to group 3, and the remaining 732 to group 4.

Feature Selection
After extracting the POI attributes, we first used a correlation matrix to analyze the relevance of each feature type. We noticed that attributes belonging to group 2 presented extremely low correlation with LULC classes. Because of this, we discarded these features. Regarding the features in group 4, we verified that they present almost the same relevance as the attributes belonging to group 3. As the proportion of POI demands more computational power to be calculated, when compared to their amount in a given cell radius, we decided to keep group 3 instead of 4.
As we intended to explore the potential of metrics extracted from POI data for classifying LULC, considering the remain 3388 attributes, for each scenario, we applied a ranking method [50] in order to select the most valuable feature set. Different criteria, including the Information gain, Gain ratio and Gini index, were tested in this process. We chose this approach because it is usually faster and consumes less resources when compared to other methods. By using this technique, we selected different attribute sets which were employed for classification tests. The number of attributes selected in each scenario is presented in Section 4. A list of all feature sets used is available online at http://tiny.cc/9eq8rz.

Classification
As presented in the state-of-the-art section, among the many different scientific studies we analyzed, the use of various techniques was observed. Some of them are used for data preparation and preprocessing, and others for classification tasks. However, for creating a model capable of classifying land use by using numeric attributes, we believe that it is possible to successfully employ a powerful method which is not usually seen in this context, namely, an artificial neural network (ANN). We tested different scenarios using an ANN, as presented in the next section. The parametrization details are shown in Table 4. As evaluation metrics, the following were adopted: Accuracy, F-score, Kappa, precision and recall. For training the models, we adopted a 10-fold cross-validation technique. RapidMiner Studio [51] was utilized for the classification tests we conducted, and the details for ANN algorithm implementation in this software can be seen in the official documentation [52].

Results and Discussion
As we noticed in a previously conducted exploratory analysis, a geographic proximity pattern between classes "2. Agricultural areas", "3. Forest and semi-natural areas", and "112. Discontinuous urban fabric" seems to exist. Considering this evidence, some preliminary experiments were performed (not presented here), in which we noticed that the models generated when using these three classes together tend to present low performance. For this reason, in most of the experiments approached below, we decided to consider these classes separately.
In the first classification test, we used a dataset containing 144 attributes. Table 5 shows the results of the model evaluation. For this experience, we chose four distinct classes as the final objective for classification: (1) "2. Agricultural areas"; (2) "111. Continuous urban fabric"; (3) "112. Discontinuous urban fabric"; (4) "121. Industrial or commercial units and public facilities". The training was made by using 2667 examples from the class 121 and 3000 examples for each of those remaining. As seen, the precision and recall of class 112 are both inferior when compared to the others, while the model seems capable of better distinguishing for class 2 among all of them. As with the previous experiment, we also performed a test based on a model trained using four classes. However, in this test we used the class "3. Forest and semi-natural areas" instead of class "2". Through the automatic selection method described before, we chose 157 attributes. The results obtained can be seen in Table 6. They show that, in general, the model presents almost no difference when compared to that created in the test before, although we used the same 3000 examples from each class, except for class 121, for which we used 2667 examples. In the third experiment performed, we also used classes 111 and 121. In order to investigate how far a model can successfully distinguish forests, semi-natural areas, and agricultural land parcels from dense urban spaces and industrial areas, we decided to include classes 2 and 3 for training the model. As with the tests carried out previously, we also adopted a dataset composed by 11,667 examples, although containing 142 attributes-i.e., 15 less than used before. According to the information available in Table 7, the model evaluation shows that, in dense urban areas, the classifier reaches better results, compared to the other classes.  Table 8. In order to complement the presented experiments, in a new scenario considering 221 attributes, we tested a model trained by using three classes that we believed to be highly valuable for land use classification: (1) "2. Agricultural areas"; (2) "111. Continuous urban fabric"; (3) "Industrial or commercial units and public facilities". The results can be seen in Table 9. For this specific test, we used 4000 examples from class 2, 3043 from class 111, and 2667 from class 121.  Table 10. Our last experiment was based on a dataset containing 93 attributes. The model was trained with 4000 examples from classes 111 and 112 together. Another 4000 examples from classes 2 and 3 were mapped to a new class, and 2667, as usual, from class 121. We decided to join examples representing forests and semi-natural regions to those that delimit agricultural areas because of the similarity observed between these two classes via an exploratory analysis we carried out. The results obtained are presented in Table 11. When analyzing the results obtained, it is possible to discuss some observations. Considering the F-score per class, we can see that, in general, the models tend to show inferior results for the classification of examples that represent discontinuous urban areas. In fact, we can verify this by checking a confusion matrix, where these land parcels are frequently misclassified. As presented in Table 12, the model often tends to predict class 112 as 2 or 111, while it also classifies many examples of agricultural plots as discontinuous urban zones. However, when analyzing the results generated from the model evaluation of scenario 3, it is possible to observe that, when using classes 2, 3, 111 and 121, agricultural lands and forest parcels are those that present the lowest F-score. Examining the confusion matrixes generated for each scenario, we found that these classes are frequently confused with each other. For trying to find out the main reason forests, agricultural lands, and discontinuous urban areas are classified incorrectly among themselves, we conducted an exploratory analysis. As Figure 4 shows, there is geographical proximity between these classes. Considering that the attributes used represent, in some cases, the distance between each centroid and the nearest POI of each category, and, in others, the amount of points in each category within a radius from each cell center, it is possible to suspect that as many categories are often localized in the dense urban regions, the distance from cells representing these three classes and POIs could be also similar. In other words, this means that the cells representing these three classes are often located far away from many other POI categories. Following this idea, we also think that the amount of POIs from most categories could be lower in these areas, making it difficult to distinguish between them using those features chosen here. from the model evaluation of scenario 3, it is possible to observe that, when using classes 2, 3, 111 and 121, agricultural lands and forest parcels are those that present the lowest F-score. Examining the confusion matrixes generated for each scenario, we found that these classes are frequently confused with each other. For trying to find out the main reason forests, agricultural lands, and discontinuous urban areas are classified incorrectly among themselves, we conducted an exploratory analysis. As Figure 4 shows, there is geographical proximity between these classes. Considering that the attributes used represent, in some cases, the distance between each centroid and the nearest POI of each category, and, in others, the amount of points in each category within a radius from each cell center, it is possible to suspect that as many categories are often localized in the dense urban regions, the distance from cells representing these three classes and POIs could be also similar. In other words, this means that the cells representing these three classes are often located far away from many other POI categories. Following this idea, we also think that the amount of POIs from most categories could be lower in these areas, making it difficult to distinguish between them using those features chosen here.

Conclusions and Future Work
Initially, in this work, we conducted a survey in order to study a set of works related to the analysis of geographic space. Through a systematic analysis, we have highlighted and discussed the main data types and methods, as well as their utilizations in this context. From the state-of-the-art analysis, we chose to investigate the potential of a specific type of crowdsourced data for land use classification, namely, POIs. We carefully studied the available sources of this information in order to decide which was the most appropriate for our analysis. Based on a service available online, we developed an application to automatically collect POIs for our region of study. Additionally, we applied some data treatment to prepare the data for our experiments.
Based on the POI data, we extracted and analyzed different types of features in order to select those that presented more relevance to be used. Considering the fact that we had a large number of features, we adopted a technique well-established in the literature to automatically rank, among them, the most suitable for each experiment we performed. For land use classification, we chose a

Conclusions and Future Work
Initially, in this work, we conducted a survey in order to study a set of works related to the analysis of geographic space. Through a systematic analysis, we have highlighted and discussed the main data types and methods, as well as their utilizations in this context. From the state-of-the-art analysis, we chose to investigate the potential of a specific type of crowdsourced data for land use classification, namely, POIs. We carefully studied the available sources of this information in order to decide which was the most appropriate for our analysis. Based on a service available online, we developed an application to automatically collect POIs for our region of study. Additionally, we applied some data treatment to prepare the data for our experiments.
Based on the POI data, we extracted and analyzed different types of features in order to select those that presented more relevance to be used. Considering the fact that we had a large number of features, we adopted a technique well-established in the literature to automatically rank, among them, the most suitable for each experiment we performed. For land use classification, we chose a grid-based methodology based on a highly reliable ground truth dataset. We defined different scenarios for our tests and adopted a powerful algorithm for classification analysis when using numeric features for all of them.
Although the grid-based method we adopted as part of the methodology we used to prepare the data was previously suggested in the literature, it can be highlighted that, in our study, some differences from the original technique exist. First, our work was based on vector data instead of raster data. We believe that vector data are more efficient for automatic or semi-automatic solutions applied for geographic analysis, because they can aggregate more information related to each parcel of land and can also be easily stored and manipulated directly in a database. Furthermore, the features we adopted are partially different from those used when the method was suggested. Our work provides solid evidence that this methodology works well, and we see these differences that we have implemented as our contribution to the state of the art.
By using an artificial neural network and POI-based data, through different classification scenarios, we achieved values of more than 90% for the accuracy and F-score in the most successful case here. In general, for most experiences, values for these metrics are near to 80%, proving that POIs have the potential to be suggested as a data type for land use classification. Being one of the most common types of crowdsourced data, they are widely and freely available for many countries and can provide relevant contributions by themselves or when combined with other data types, thereby improving results in studies related to the characterization of geographic spaces.
Regarding feature engineering, in this work, we extracted and analyzed four different groups of features. It was verified that only two of them are useful for land use classification. Although we generated a significant number of features, they represent specific metrics extracted from the POI data. Thus, as a suggestion for future work, we recommend analyzing the potential of new metrics-for example, the semantic similarity among POIs-regarding their textual descriptions, which can be added as attributes for training the models, thereby possibly providing better results. Applying deep neural networks for natural language processing (NLP) has been demonstrated as a powerful tool to extract features from texts [13,19]. The adoption of such deep neural networks in our approach could improve the performance of land use classification.
As stated in the state-of-the-art analysis, POI data, as one of the most common crowdsourced data types, are widely available on API services and online repositories. However, as with most of the datasets generated by a large number of users or volunteers, it is very common to find lots of noise in these sets of data. For this reason, we also suggest, as future work, the quality assessment of crowdsourced POI data available from different sources. We have noticed that using reliable datasets is the key for reaching good results in many cases.
In this work, we explored the potential of POI data to characterize geographic spaces-i.e., LULC classification. In the analysis we conducted, we adopted only one source for this kind of dataset. Thus, for future work, we also suggest the investigation of techniques that can be adopted for merging POIs collected from different sources in order to provide additional data enrichment. We believe that using a set of data that is as complete as possible can certainly help to create efficient models.
Supplementary Materials: The datasets and figures used are available online at http://tiny.cc/gq6iqz and the source-code of the application developed to collect the POIs can be found at https://github.com/RibeiroSt/poicolector-fbp.
Author Contributions: Conceptualization, methodology, data curation, writing-review and editing: Renato Andrade and Ana Alves; investigation, software, writing-original and draft preparation: Renato Andrade; supervision, funding acquisition and project administration: Ana Alves and Carlos Bento. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest. The data used were collected within a preliminary phase of a project that could be implemented by the funders in the future. However, the funders had no role in the design of the study; in the analyses, or interpretation of data; in the writing of the manuscript, nor in the decision to publish the results.