Land use identification through social network interaction

The Internet generates large volumes of data at a high rate, in particular, posts on social networks. Although social network data has numerous semantic adulterations, and is not intended to be a source of geo-spatial information, in the text of posts we find pieces of important information about how people relate to their environment, which can be used to identify interesting aspects of how human beings interact with portions of land based on their activities. This research proposes a methodology for the identification of land uses using Natural Language Processing (NLP) from the contents of the popular social network Twitter. It will be approached by identifying keywords with linguistic patterns from the text, and the geographical coordinates associated with the publication. Context-specific innovations are introduced to deal with data across South America and, in particular, in the city of Arequipa, Peru. The objective is to identify the five main land uses: residential, commercial, institutional-governmental, industrial-offices and unbuilt land. Within the framework of urban planning and sustainable urban management, the methodology contributes to the optimization of the identification techniques applied for the updating of land use cadastres, since the results achieved an accuracy of about 90%, which motivates its application in the real context. In addition, it would allow the identification of land use categories at a more detailed level, in situations such as a complex/mixed distribution building based on the amount of data collected. Finally, the methodology makes land use information available in a more up-to-date fashion and, above all, avoids the high economic cost of the non-automatic production of land use maps for cities, mostly in developing countries.


Introduction
The current dynamic nature of cities has generated substantial changes in urban environments, making the analysis of geo-spatial information more complex [Liu et al., 2016]. The urban management and planning of cities has been nourished for years by Geographic Information Systems (GIS) tools, since they allow storing, modeling, and analyzing geo-spatial data. On the other hand, urban planning faces challenges such as organizational changes, staff availability, updating resources and data availability [Yeh, 1999]. In this context, land use maps are one of the most demanded resources by authorities and researchers, for conducting urban planning and environmental sustainability studies [Anugraha and Chu, 2018]. However, the production of these maps is costly and very time-consuming [Ye et al., 2020]. Consequently, the production and management of land use maps evidence a technical problem [Sendra and García, 2000] caused by economic constraints, mostly in developing countries.
The Internet generates large volumes of data at a high rate, in particular, posts on social networks. Social networks have become a source of data for researchers, because this data contains information that allows studying the interaction of citizens with their environment [Lin and Geertman, 2019]. The concept of "user-generated content" [Thakur et al., 2018, Lei et al., 2018, which encompasses how citizens share excerpts from their daily lives, can be translated to different application domains: the representation of urban boundaries, relationships between weather conditions and traffic, citizen's activity patterns, urban transportation behavior and land use [Mora et al., 2018, Lin andGeertman, 2019], urban planning and spatial planning. Therefore, social networks are a valuable source of data that could be used to identify interesting aspects of a city's land use.
Although social network data has numerous semantic adulterations, and is not intended to be a source of geo-spatial information, in the text of posts we find pieces of important information about how people interact with their space, and can complement other sources in the process of identifying land uses [Tessore et al., 2019]. These data can be composed of various fields such as message text, location, and images. In this context, various approaches have been developed and applied to analyze this data type, being semantic analysis of the text of the messages -unstructured information-the basis of much of the research related to social networks. However, it is an underexploited resource in obtaining and analyzing geographic information [Stock, 2018]. Current land use identification and analysis studies rely primarily on publication metadata, such as time and geographical coordinates -structured information-.
The various studies on land use classification have rarely considered the content of posts as a source of data to infer the type of use. In the content of social network posts, people communicate attitudes, activities, relationships and other emotional states that are often complemented with a geo-spatial location. In written expressions it is possible to find words that allow inferring information about where a user is situated. For example, if a person posts the following message "having lunch with friends" on a social network, we could assume that they are in a restaurant, a house, a shopping mall, or another place where they can have lunch. In addition, if we associate two or more words such as "having lunch with friends in a restaurant", the connector "in" and the word "restaurant" provide us with a high probability that the place where he/she is located is, in fact, a restaurant. If the coordinates associated with the message are included, we can know the space where the restaurant is placed, and consequently, the type of land use to which that space belongs. Therefore, the potential of the content of social network publications to address the study of land use is remarkable.
Nevertheless, the language used in social network publications is very informal, includes many idioms, expressions specific to each region, misspellings, or other language deformations. For this reason, the processing of this type of information with specific objectives becomes an important challenge [Iglesias et al., 2016], in which numerous algorithmic techniques will concur, to form a methodology capable of extracting the valuable knowledge.
Most studies on land use are mainly oriented to classification tasks. When we use a data source such as the expressions of a language, it is necessary to propose a methodology that starts with data collection, followed by pre-processing and, finally, classification. The fundamental reason is that the expressions in each language and country -and even regionare different. For example, in order to carry out the present study, it was necessary to create a corpus and a dictionary for the Spanish language and for expressions specific to Peru. This paper proposes a methodology for the identification of land uses using Natural Language Processing (NLP) from the contents of the popular social network Twitter. It will be approached by identifying keywords with linguistic patterns from the text, and the geographical coordinates associated with the publication. The methodology is composed of the following stages: data collection, corpus and dictionary creation, pre-processing, feature extraction, learning, prediction and representation. At each stage, context-specific innovations are introduced to deal with data from South America and, in particular, the city of Arequipa, Peru. The objective is to identify the five main land uses: residential, commercial, institutional-governmental, industrial-offices and unbuilt land.
Within the framework of urban planning and sustainable urban management, the methodology proposed in this study contributes to the optimization of the identification techniques applied for the updating of land use cadastres, since the results achieved an accuracy of about 90%, which motivates its application in the real context. In addition, this model would allow the identification of land use categories at a more detailed level, in situations such as a complex/mixed distribution building based on the amount of data collected. Finally, the methodology makes land use information available in a more up-to-date and, above all, much less costly way.
The document is organized as follows: first, a review of the recent scientific literature related to the use of machine learning techniques applied to the categorization of land use, and specifically involving data from social networks, followed by an overview of the NLP-based methodology; next, a description of the application context in the city of Arequipa, containing examples of the results provided by the main phases of the methodology, together with the results achieved in comparison to several models; also, visual examples are presented to illustrate the contrast between results from our approach and what is registered in the cadastre; finally, conclusions and future work.
2 State-of-the-art 2.1 Land use and urban planning applications Recent proposals for urban growth cover aspects such as the potential exploitation of land uses: those that allow to accommodate a high urban diversity. Land uses are thus defined as a measure of the diversity of uses contained in a given space [Hajna et al., 2014], which allows in a specific way to identify nearby uses or nearby activities developed in a limited spatial range. The term land use refers to the employment that human beings give to a portion of land based on the activities carried out in it. However, in order to use land uses in urban management and planning, an updated map with the most recent and accurate information is necessary [Ye et al., 2020, Terroso-Saenz andMuñoz, 2020] to make the right decisions [Da Silva, 2013]. Under this situation, several terms emerge, such as urban computing [Silva et al., 2018] and urban planning applications [Frias-Martinez et al., 2012], which have become an emerging research area where urban problems in cities are studied using different data sources such as electronic devices -mainly smartphones-, location-based social media, web pages -e.g., analyzed with Robotic Process Automation technologies-, among other digital information.
Many approaches include the use of images (e.g. from satellites), sometimes together with some structured data (e.g. points of interest). For example, Liu et al. [Liu et al., 2016] use natural physical features from high spatial resolution images (HSR) and socio-economic semantic features (frequent characteristic words related to an urban land type) from social data to create a dictionary of land use words, including data from multiple sources such as OpenStreetMap road networks, Gaode's Points of Interest and Tencent's real-time user density. On the other hand, Zhan et al. [Zhan et al., 2014] use large-scale Twitter log data, and they propose a preprocessing method in which the raw data only contains coordinates and activity category information. One of the few papers that mentions data preprocessing is Thakur et al. [2018], who used Twitter posts and metadata. From each tweet they use text to analyze whether a space is a restaurant, airport, or stadium. They perform preprocessing of the text of each tweet before using a term frequency-based technique, previously removing emoticons and other non-ASCII characters, as well as hashtags and the "@" character.

Social Networks and Text Mining
The incremental use of social networks has been producing large amounts of data that are being extracted, analyzed, and structured [Stock, 2018]. This allows gathering useful information that can reflect different areas such as the human dynamics in a city, identifying how people live and interact with the environment, health applications, natural hazards management, tourism, environmental monitoring, crimes and disturbances [García-Palomares, 2018].
The ways to extract the information offered by a publication on social networks range from the analysis of its metadata (geo-tags, time and date, username, place name, etc.), the profile of the user who publishes it, to the application of text mining to the message of the publication [Iglesias et al., 2016]. The message is considered the backbone of many of the investigations related to the inference of location because of its enormous potential [Ajao et al., 2015], apart from presenting several scientific challenges. Since the texts of social media are mostly generated by mobile devices and have no writing restrictions, it gives the user a great margin of typographical error and brevity. These texts also include links, emoticons, use of informal language and idioms, spelling and grammatical errors, presence of user mentions and hashtags, HTML tags, use of acronyms and abbreviations. Therefore, the main challenge is to clean up the large amount of noise in the messages, in addition to deal with the unstructured format, in contrast with articles, web pages, or blogs, that have more content and make use of conventional grammar and semantic rules.

Text Mining
Text mining is a research field that aims to automatically discover or extract new knowledge from texts written in natural language [Wabula et al., 2017, Tandel et al., 2019. According to the objective to be achieved, the text mining process has a set of stages that are classified into data collection, pre-processing, transformation, and analysis.

Data collection
The collection is the first step of the text mining process, where unstructured data is captured from different sources such as blogs, reviews, news, publications on social media, among others. The data is stored for future pre-processing and analysis.

Pre-Processing
The objective is to obtain clean and actionable data from the data collected through the application of different debugging and cleaning techniques. There is no specific order to perform the pre-processing task, so this must be determined empirically [Ragini et al., 2018], experimenting with each technique individually, comparing the results, and combining those that perform best. The pre-processing of social media data presents certain challenges due to the use of informal language and the length of the messages (for instance, tweets are very short in length) [Kateb and Kalita, 2015].
The techniques commonly used in text pre-processing are the following [Lansley and Longley, 2016, Tellez et al., 2017, Varma and Ahmad, 2018, Stock, 2018, Tessore et al., 2019, Kulkarni and Shivananda, 2019]: a) Normalization and noise reduction: this process unifies the text and removes irrelevant and meaningless elements.
The literature recommends the deletion of URLs, HTML tags, special characters, emojis, or emoticons.
b) Remove punctuation marks: as punctuation marks do not add additional information, the elimination of them helps reduce the size of the data and increase the model efficiency.
c) Remove stopwords: these are terms that help build sentences but do not commonly provide meaning, like prepositions or articles. d) Processing of abbreviations, acronyms, and entities. e) Spell correction.
f) Tokenization: it splits text into significant words delimited by blank spaces, commas, periods, or any other special character.
g) Lemmatization: it is the search for the words' lemma to unify the terms that give the same information (for example, derivational forms of verbs).
h) Part of Speech Tagging (PoS): grammatical tagging of each word.

Data transformation
Data transformation involves the extraction and selection of characteristics (SC) from text data (string) to generate a suitable representation for computational learning. In classification, the transformation is crucial as the SC directly impacts the result of the classifier. The selection depends on the type of document and the classification chosen. The objective is to find characteristics with relevant information that improve the precision of the classifier. Then, the text is represented as a value using a binary representation or as a SC technique, such as Term Frequency (TF), Inverse Document Frequency (IDF), or Term Frequency-Inverse Document Frequency (TF-IDF). From these representations, the global feature space is generated from all the texts used for training. The next paragraph presents some techniques to obtain the characteristics of a text: a) Term Frequency-Inverse Document Frequency (TF-IDF): it seeks to find the balance between the Frequency Term (importance of a word w k locally in a text T ), and IDF (a global measure of the importance of w k in the corpus) [Tellez et al., 2017].
b) N-grams: they are sequences of words grouped according to the value given to N . For example, the 1-grams (unigrams) of T = "Tomorrow is Thursday" are W T 1 = {Tomorrow, is, Thursday}, the 2-grams (bigrams) are W T 2 = {Tomorrow is, is Thursday}; then, given a text T with m words, a set of n-grams of size m − n + 1 is obtained.
c) Bag-of-PoS: application of n-grams in the PoS labels of the text to be analyzed.

Data analysis
There exist different methods to analyze unstructured data, which are classified into information extraction, summary, grouping, and categorization [Maheswari, 2017, Tandel et al., 2019. The present study uses categorization in order to classify a set of documents into topics or categories, which requieres a correct combination of NLP and machine learning techniques.
In the field of machine learning, classification tasks are mainly supervised, which consist of training the system and then testing with information about the classes before the real classification process [Thangaraj and Sivakami, 2018]. Among the most popular techniques are Logistic Regression, Naive Bayes, Support Vector Machines, Decision Trees, Neural Networks, or k-Nearest Neighbor. The Naive Bayes approach is one of the most employed classifiers for analyzing text documents. It works under the principle of using the probabilities of words and categories that allow determining the classes for given documents, and assumes two interesting properties that makes the technique very fast: conditional independence among variables with respect to the target variable, and normal distribution of independent variables. In summary, Bayesian classifiers are simple and powerful in terms of the degree of certainty, which makes them a god choice for approaching NLP problems.

Materials and Methods
The area of study is the city of Arequipa, which is the second most populated urban area in Peru, with more than a million of inhabitants and a yearly growth rate of 2.3%, according to the last census conducted in 2017. This study covers the historical centre of Arequipa (see Fig. 1), with an approximate area of 3.46 square kilometers and about 2251 properties located within 56 blocks. The categorization made is based on the real estate properties declared in the area.
The Master Plan for Arequipa's Historic Center and Surrounding Area (PlaMCha 2017-2027) describes the geographical characteristics of the historical center, which has three defined zones; old zone, monumental zone, and buffer zone located in the coordinates S: 16 • 23' 53.33" W: 71 • 32' 12.67" [Mercado, 2018]. The oldest area, where the main monuments and the Main Square of the city are located, is called Damero (transl. checkboard) and is constituted by blocks of 111.4 m per side and separated by streets of 10.3 m, which gives the characteristic unitary image and covers an area of 1.41 km 2 (141 ha.). The Monumental Zone has 2.12 km 2 (212 ha.), and the Buffer Zone has 3.46 km 2 (346 ha.) and covers the first 2 zones.
The historical center was selected because it brings together different types of land use such as historical, artistic, cultural, workplaces, residential, commercial, and public spaces. The large influx of tourists and residents in this area makes it possible to collect a large amount of data from social networks, unlike in other parts of the city.

Types of land use
The land use map of the historical center of Arequipa, made by the PlaMCha, defines 14 categories of land use (see Table  1 -left). However in this work, the re-categorization carried out by [Hajna et al., 2014] is considered, which initially takes into account 48 categories of land use and then groups them into 5: residential, commercial, industrial-offices, institutional-government and unbuilt land (see Table 1 -right). The residential category groups houses, apartments, and condominiums. The commercial category comprises properties that carry out typical activities of stores, bookstores, shopping centers, restaurants, bars, hotels, lodgings, parking lots, and entertainment venues. The industrial-offices category groups industrial, companies and offices land uses. The institutional-government category includes buildings related to education (institutes, academies, schools and universities), health (hospitals and clinics), cultural (cultural centers and museums), management (administrative, financial or governmental grounds) and religious centers. Finally, the category of unbuilt land is mostly composed of agricultural land (crops), vacant land (buildings that do not exceed 1% of the surface), the Chili river course -which runs north-southwest through the historical area-.

Cartographic map
This study focuses on determining the categories of land use in the historical center of Arequipa, hence it is necessary to define the boundaries and characteristics of the area as a function of polygons.The information is obtained from the documents developed in the PlaMCha, and the cadastral maps of the historical center developed by the Technical Team of the Municipal Planning Institute in the Pilot Project named "Altura para la Cultura" (transl. Height for Culture) [Mercado, 2018]. Cadastral maps were provided as GIS data, representing set of polygons for blocks and lots in the historical center (see Fig. 1).

Geo-tagged Twitter publications
Tweets allow the categorization of land uses where tweets were generated. A tweet is a short message of 280 characters maximum composed of text, emojis and attachments, published on the Twitter platform, which is considered one of the largest sources of information fed by millions of users [Information Resources Management Association, 2019]. Twitter Application Programming Interfaces (APIs) allow capturing tweets encoded in JavaScript Object Notation (JSON) format with their associated attributes and values. A tweet can have around 150 associated attributes, although this research only requires the ID, user ID, text, timestamp, geodata, and language of the message.

Experimentation
The methodology is divided into a sequence of tasks, which begin with the data collection, followed by splitting the data for building the corpus and for validating. The validation data go through a previous classification using PoS patterns before entering the classification algorithm. For clearer interpretation, the results of the tagged data are presented on a map of the study area according to the related coordinates of the tweet. Finally, quality metrics are shown to validate the approach. The methodology is graphically illustrated in Fig.2.

Data Collection
In this phase, the model connects to the Twitter Streaming API using the Tweepy library to capture geo-referenced tweets across South America. The Streaming API has certain limitations, but it allows the download of 100% of tweets that meet the defined filter, as long as they are less than 1% of the global volume of publications at a given time [Campan  A corpus related to land use categorization was not available in Spanish, so it was necessary to build one. The collected data was divided into two sets: the first set consists of 3870 tweets located in the historical center of Arequipa, that will then be used to identify the type of land use. The second set is used for the creation of the corpus and consists of 42318 tweets located anywhere in South America.
These tweets are previously processed to remove the duplicates, blanks, single word and only numbers.
As a result, we obtained a total of 24995 messages that were semi-automatically categorized into a land use type according to their content and geographical coordinates, leaving a total of 4538 useful tweets in different languages, which were randomly split into training and test. The result was distributed into the defined categories and divided into subcategories to avoid data imbalance (see Table 2).

Pre-Processing
The pre-processing is divided into two phases: the first phase consists of cleaning the text to remove noise and correct data, and the second phase is the data filtering to identify only the tweets which are within a block of the historical center. The tweets to be used as classifier input go through the two pre-processing phases, while the corpus tweets only go through the first phase.
(a) Noise removal, language detection, and Spanish translation. URLs, symbols, and HTML tags are removed [Salas-Zárate et al., 2017, Ragini et al., 2018. For translation and detection, the Googletrans library for Python is used, which allows processing large amounts of records without a query limit; the translation is of the entire text and not word by word so that the result is a meaningful sentence, as Google Translate Ajax API does. Hashtags and mentions are excluded from the translation.
(b) Hashtag processing. Hashtags are words that tag a tweet to a topic that is generally related to its content [Asriadie et al., 2018]. In this study, these words are preserved and translated individually if they are in English.
(c) Elimination of punctuation marks and replacement of mentions. All punctuation marks are removed, except for the symbol @ which is replaced by the word in,which is useful in the context of location extraction from a text [Thakur et al., 2018].
An example of the result of applying the first three techniques is shown in Table 3.
(d) Processing of Abbreviations, Acronyms, Slang, and Establishment Names. The most recurrent abbreviations and slangs within the corpus are identified and stored for the construction of a dictionary. The names of establishments, institutions and acronyms identified from google maps are added to this list, which includes bars, pharmacies, hotels, universities, etc. located within the Historical Center of Arequipa. The resulting dictionary is used to identify these words in the text of the publication and replace them with the related word [Tellez et al., 2017]. The result is shown in Table 4.

Original Text
Abbreviation Processing I'm at Mallplaza Bellavista -@mallplazaperu in Bellavista, Callao https://t.co/brtyxSe8CY estoy en centro comercial bellavista en centro comercial en bellavista callao work breakfast! #friends #meeting en Universidad Jorge Tadeo Lozano https://t.co/sNYJhxG6cw trabajo desayuno amigos reunión en universidad jorge tadeo lozano Un dia cualquiera en Cevicheria Karloncho Oficia https://t.co/f9kdEEwdMx un dia cualquiera en restaurante karloncho oficia He venido a que mami me atiborre de comidaaaaaa (@ Residencial Parque Central in Lima) https://t.co/dCCBbEgvZm he venido a que mamá me atiborre de comidaaaaaa en residencial parque central en lima complex morphology, such as Spanish [Salas-Zárate et al., 2017]. The concept of Longest Common Subsequence (LCS) is applied to each of the options proposed by the spell checker to improve the result. Hunspell checks that each token in a publication is a valid word in the language; otherwise, it is replaced if the LCS value of the word proposed by the proofreader is not less than 71%; if it is less, the word is deleted (see Table 5).
(f) Stopwords. All stopwords are removed, except the words "en" (transl. in) and "de" (transl. from) because these are spatial indicators that help identify that a user is posting the tweet from a location.
(g) Lemmatization and PoS Tagging. All publications are pre-checked and cleaned before lemmatization for better results. We used the Freeling tool, which is a library providing language analysis functionalities (morphological analysis, named entity detection, PoS-tagging, parsing, etc.) for a variety of languages [Padró and Stanilovsky, 2012].The results are shown in Table 6.
After pre-processing, the tweets belonging to the application data are selected according to their geolocation (South America). Each tweet positioned in the radius of the city of Arequipa goes through two conditions: the first condition selects the tweets that are located within the polygon that represents the historical center, and the second condition selects the tweets that are located in a polygon that represents a block. To perform the first filtering, the shapefile containing the representation of the historical center is imported into the PostgreSQL database and the tweets geo-positioned in Arequipa (ID, latitude, and longitude) are loaded into a tweet_gps table. To select the tweets that are located in the polygon, the coordinates are transformed into geometry type data with the function st_geomfromtext. For instance, the SQL query used returned a total of 2343 tweets distributed in the plane as shown in Fig. 3. For the second filter, the polygons of the blocks are stored in the PostgreSQL database as geometry type data. Tweets that are located in the middle of the street or outside the historic center are removed, providing 924 tweets (see in Fig. 4 a zoomed area).

Feature Extraction
Using only one feature extraction method does not guarantee the best results according to [Tessore et al., 2019]. As a consequence, this research explores the use of TF-IDF values associated with a multidimensional vector, the n-grams (unigrams, bigrams, and trigrams), which have shown to work well in the classification of documents [Anzovino et al., 2018], and Bag-of-PoS that works similarly to the n-grams, but based on the sequence patterns of the Part-of-Speech labels.
For example, according to [Sakaki et al., 2012], the "Verb-Preposition" pairs are commonly followed by the name of the place where a user is located, which allows it to be used as an indicator to identify whether the user is talking about an establishment or location. Once the data is numerical, the categorization of tweets becomes a standard machine learning classification problem.

Classification
The Multinomial Naïve-Bayes Classifier (MNB) is one of the most popular algorithms in social media data and text categorization [Kateb andKalita, 2015, Anzovino et al., 2018]. It is based on the evaluation of the probabilities of each class. Each tweet in the corpus has associated vectors with the extracted characteristics and their respective label of the land use type; therefore, the classifier is trained with each one of the characteristics, evaluating them individually, and later different combinations are examined.
Unlike applying the classifier to the corpus data which was filtered in the labeling process ensuring that they all refer to one location, when using the classifier with the application data, there is a need to previously identify which tweets refer to the user being in one location and which refer to other topics. For that purpose, with the tagged and untagged corpus data, we selected the PoS sequences with Bag-Of-PoS and identified the i-most frequent sequences that represent each data set, as it is done in [Koto and Adriani, 2015]. Depending on the presence or not of the sequences of each class in the text ([it is / it is not] in a location), the tweet is classified by the MNB approach.

Land use categorization
Class imbalance and expected results are some of the considerations taken into account in the selection of metrics to measure the performance of a classifier. There is no one metric that measures well the performance of a classifier in every scenario. In this research, the classifier is of the multi-class type (many possible results that are mutually exclusive) [Sokolova and Lapalme, 2009] , so it is also important to calculate the metrics for each class c 1 ...c n individually, and then calculate the classifier metrics generally, as proposed in [Tellez et al., 2017]: accuracy, precision, recall, and F1-score. Since the classes are unbalanced (some classes have more data, so they are more likely to appear than others) the prioritized indicator is the F1-score [Anzovino et al., 2018]. Results for each model are shown in Table 7.
The corpus is used with two variations: one, with the text in its lemmatized form and the other with the original text, without lemmatization. The values achieved with these two variations are very close to each other, but the use of the lemmatized text always provides slightly better results (see Table 7, where two rows for each feature is shown, with and without lemma). Furthermore, comparing the results of the models it is clear that by using all n-grams (unigram + bigram + trigram) together the highest accuracy is achieved. Therefore, this model is chosen for the classification of application data resulting in tweets labeled to be positioned in a Land Use Map (a sample of the results achieved with real data is shown in Table 8). Out of the 924 tweets used by the classifier, 327 were classified in the commercial category, 248 in the institutional category, 35 in residential, 14 in unbuilt land, and 8 in industrial. There were 292 tweets left that did not reach the cut-off point established for the classifier, so they were not labeled (non-classified) assuming that the text does not refer to a location.

Results visualization
The tweets were also presented on the cadastre map of the historic city center. In this way, it is possible to compare the land uses identified by the classifier against the land use category registered in the cadastre by the local municipality.
The map of the historic center is shown in Fig. 5 (on the left) with the classification of land uses according to the city's cadastre. On the map, each category of land use was associated with a color. Thus, the use of residential, commercial, industrial-office, institutional-governmental, and unbuilt land have the colors yellow, red, blue, light blue, and green, respectively. The second map in Fig. 5 (on the right) shows the tweets labeled with the categories of land use. According to the information shown on the map, a large number of land uses corresponding to commercial, industrial, unbuilt land, and institutional-governmental were identified. However, a small number of tweets with the residential label is observed although according to the cadastre a significant percentage of land uses in the historic center corresponds to residential. Therefore, there is no correspondence on the residential category between the cadastre and the automatic classification of land use.
In order to deepen the analysis of the inconsistencies revealed between the two methods, Figs. 6 and 7 clearly show interesting differences. Color scale for land use is the same for both, as depicted in the legend. Fig. 6 illustrates areas (light blue circle) labeled as residential in the cadastre, but identified as vacant land or institutional. In the other hand, Figs. 7 shows a block (yellow) in which is located the Santa Catalina Convent, and it is cataloged as residential in the cadastre. However, tweets indicate the potential use as unbuilt or institutional-government land (religious is a subcategory within the last one). Multiple cases of inconsistency were detected when comparing cadastre and classifier outcome, many of which were later verified by a site visit. In general, the results provided by our approach were much more accurate than those registered in the cadastre.

Conclusions
Social networks provide valuable data on urban dynamics, offering new opportunities for research in the field. In this study, we used Twitter data to analyze land use in the historical center of the city of Arequipa by capturing tweets from the area with the text of the publication, time, date, and coordinates.  buildings of the institutional-cultural category due to the historical character of the area. However, the methodology detected that many residential spaces registered in the cadastre have currently other activities or uses.
We conclude that the Twitter data provides useful information to identify land uses in the geographic area where it is captured. Tweets are gathered in a simple and inexpensive way and provide information that can be used as an additional method by urban planning professionals and organizations interested in that area.The advantage of this model over the traditional ones resides in its dynamism since it uses data that is constantly updated by the users and allows reflecting the inconsistencies that exist in the maps generated by the cadastre due to the constant change of the environment. Also, the methodology is easily transferred to other geographical areas by means of the use of specific dictionaries to the region under study. This knowledge might be used as a recommendation system for short-term supervision or updating of the cadastre.
Finally, this method depends on the amount of geo-located data available at the time of classification, so the capture of publications from other social networks should be considered for future work to maximize the effectiveness and usefulness of the results.