My-Trac: System for Recommendation of Points of Interest on the Basis of Twitter Proﬁles

: New mapping and location applications focus on offering improved usability and services based on multi-modal door to door passenger experiences. This helps citizens develop greater conﬁdence in and adherence to multi-modal transport services. These applications adapt to the needs of the user during their journey through the data, statistics and trends extracted from their previous uses of the application. The My-Trac application is dedicated to the research and development of these user-centered services to improve the multi-modal experience using various techniques. Among these techniques are preference extraction systems, which extract user information from social networks, such as Twitter. In this article, we present a system that allows to develop a proﬁle of the preferences of each user, on the basis of the tweets published on their Twitter account. The system extracts the tweets from the proﬁle and analyzes them using the proposed algorithms and returns the result in a document containing the categories and the degree of afﬁnity that the user has with each category. In this way, the My-Trac application includes a recommender system where the user receives preference-based suggestions about activities or services on the route to be taken. transport, services) for a


Introduction
Humans are social beings; we always seek to be in contact with other people and to have as much information as possible about the world around us. The philosopher Aristotle (384-322 B.C.) in his phrase "Man is a social being by nature" states that human beings are born with the social characteristic and develop it throughout their lives, as they need others in order to survive. Socialization is a learning process; the ability to socialize means we are capable of relating with other members of the society with autonomy, self-realization and self-regulation. For example, the incorporation of rules associated with behavior, language, and culture improves our communication skills and the ability to establish relationships within a community.
In the search for improvement, communication, and relationships, human beings seek to get in contact with other people and to obtain as much information as possible about the environment in order to achieve the above objectives. The emergence of the Internet has made it possible to define new forms of communication between people. It has also made it possible to make a large amount of information on any subject available to the average user at any time. This is materialized in the development of social networks. The concept of social networking emerged in the 2000s as a place that allows for interconnection between people, and, very soon, the first social networking platforms appeared on the Internet that guides them via public transport. There are many different systems that incorporate both recommendation and navigation. However, there is no system that combines event recommendation and pedestrian navigation with (real-time) public transport. However, it does not employ multi-modal navigation between different public transport modes (bus, train, carpooling, plane, etc.) in different countries and that would use information from the user's social network profile. Instead, current systems utilize a set of information initially entered into the application which is not updated afterwards. Finally, Tables 1 and 2 present a review of similar works. Mobile phone application that suggests events and places to the user and guides them via public transport.
The current systems utilize a set of information initially entered into the application which is not updated afterwards.
There is no system that combines event recommendation and pedestrian navigation with (real-time) public transport. It does not employ multi-modal navigation between different public transport modes (bus, train, carpooling, plane, etc.) in different countries and that would use information from the user's social network profile.
Systems for city-based tourism [2] A personalized travel route recommendation based on the road networks and users' travel preferences.
The experimental results show that the proposed methods achieve better results for travel route recommendations compared with the shortest distance path method.
It does not use information from public transport services in route recommendations.
Tourism routes as a tool for the economic development of rural areas-vibrant hope or impossible dream? [3] This paper argues that the clustering of activities and attractions, and the development of rural tourism routes, stimulates co-operation and partnerships between local areas. The paper further discusses the development of rural tourism routes in South Africa and highlights the factors critical to its success.
The article analyzes the realization of routes that include activities and attractions in a way that encourages and enhances rural development in Africa.
Preliminary project that requires public cooperation (institutions, transport, services) for a comprehensive improvement of the proposal.
This article improves on the previous system for the extraction of information regarding Twitter users [4]. The system is capable of obtaining information about a particular user and of elaborating a profile with the user's preferences in a series of preestablished categories. A review of existing reputation systems is presented in Section 2. Section 3 describes the proposal. Section 4 presents the assessment made with synthetic data. Section 5 shows how the system is integrated in My-Trac app. Finally, Section 6 presents the conclusions.

Title / Publication Functionality Advantages Shortcomings
Social Recommendations for Events [5] Outlife recommender assists in finding the ideal event by providing recommendations based on the user's personal preferences.
In addition to the user's preferences, the recommender uses information from the user's group of friends to make event recommendations more satisfactory.
Although it uses information from the user's groups of friends, no use is made of information from the user's social networks to complement the analysis and recommendation.
Smart Discovery of Cultural and Natural Tourist Routes [6] This paper presents a system designed to utilize innovative spatial interconnection technologies for sites and events of environmental, cultural and tourist interests. The system discover and consolidate semantic information from multiple sources, providing the end-user the ability to organize and implement integrated and enhanced tours.
The system adapts the services offered to meet the needs of specific individuals, or groups of users who share similar characteristics, such as visual, acoustic, or motor disabilities. Personalization is done in a dynamic way that takes place at the time and place of the service. The very comprehensive system that uses external services, scraping, crawling, geo-positioning but does not include information from social networks to complement the analysis and recommendation of events.
Enhancing cultural recommendations through social and linked open data [7] Hybrid recommender system (RS) in the artistic and cultural heritage area, which takes into account the activities on social media performed by the target user and her friends The system integrates collaborative filtering and community-based algorithms with semantic technologies to exploit linked open data sources in the recommendation process.
Furthermore, the proposed recommender provides the active user with personalized and context-aware itineraries among cultural points of interest.
The main drawback is the absence of extensive control over the semantics that are not taken into account. It generates difficulties in justifying, explaining, and hence analyzing the resulting scores.
Personalized Tourist Route Generation [8] Intelligent routing system able to generate and customize personalized tourist routes in real-time and taking into account public transportation.
We have modeled the tourist planning problem, integrating public transportation, as the Time Dependent Team Orienteering Problem with Time Windows (TDTOPTW). We have designed a heuristic able to solve it in real time, precalculating the average travel times between each pair of POIs in a preprocessing step.
Future works consists on extending the system to more cities with a different public transport network topology.
The next one consists on integrating an advanced recommendation system in a wholly functional PET. The systema don't use social network capabilities, that allows to store, share and add travel experiences to better help tourists on the destination.

Natural Language Processing Techniques Applied to Twitter Profiles
In this section, we review the main techniques applied in the analysis that make it possible to get to know the users preferences through their tweets. This allows for recommendations to be made according to the user profile.

Word Embedding Techniques
NLP techniques allow computers to analyze human language, interpret it, and derive its meaning so that it can be used in practical ways. These techniques allow for tasks, such as automatic text summarization, language translation, relation extraction, sentiment analysis, speech recognition, and item classification, to be carried out. Currently, NLP is considered to be one of the great challenges of artificial intelligence as it is one of the fields with the highest development activity since it presents tasks of great complexity: how to really understand the meaning of a text, how to intuit neologisms, ironies, jokes, or poetry? It is a challenge to apply the techniques and algorithms that allow us to obtain the expected results.
One of the most commonly used NLP techniques is Topic Modeling. This technique is a type of statistical modeling that is used to discover the abstract "topics" that appear in a series of input texts. Topic modeling is a very useful text mining tool for discovering hidden semantic structures in texts. Generally, the text of a document deals with a particular topic, and the words related to that topic are likely to appear more frequently in the document than those that are unrelated to the text. Topic Modeling collects the set of more frequent words in a mathematical framework, which allows one to examine a set of text documents and discover, on the basis of the statistics of the words in each one, what the topics may be and what the balance is between the topics in each document.
The input of topic modeling is a document-term matrix. The order of words does not matter. In a document-term matrix, each row is a question (or document), each column is a term (or word), we label "0" if that document does not contain that term, "1" if that document contains that term once, "2" if that document contains that term twice, and so on.
Algorithms, such as Bag-of-words or TF-IDF, among others, make it possible to represent the words used by the models and create the matrix defined above, representing a token in each column and counting the number of times that token appears in each sentence (represented in each row).
• Bag-of-words. This model allows to extract the characteristics of texts (also images, audios, etc.). It is, therefore, a feature extraction model. The model consists of two parts: a representation of all the words in the text and a vector representing the number of occurrences of each word throughout the text. That is why it is called Bag-of-words. This model completely ignores the structure of the text, it simply counts the number of times words appear in it. It has been implemented through the Genism library [9]. • Term Frequency -Inverse Document Frequency (TF-IDF). This is the product of two measures that indicate, numerically, the degree of relevance that a word has in a document within a collection of documents [10]. It is broken down into two parts: -Term frequency: Measures the frequency with which certain terms appear in a document. There are several measurement options, the simplest being the gross frequency, i.e., the number of times a term t appears in a document d. However, in order to avoid a predisposition towards long documents, the normalized frequency is used: As shown in Equation (1), the frequency of the term is divided by the maximum frequency of the terms in the document. -Inverse document frequency: If a term appears very frequently in all of the analyzed documents, its weight is reduced. If it appears infrequently, it is increased.
As shown in Equation (2), the total number of documents is divided by the number of documents containing the term. Term frequency-Inverse document frequency: The entire formula is as shown in Equation (3).
Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning [11]. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers [12].

•
Word2vec. This technique uses huge amounts of text as input and is able to identify which words appear to be similar in various contexts [13][14][15]. Once trained on a sufficientñy big dataset, 300-dimensional vectors are generated for each word, forming a new vocabulary where "similar" words are placed close to each other. Pre-trained vectors are used, achieving a wealth of information from which to understand the semantic meaning of the texts. • Doc2vec. This technique is an extension of Word2Vec and is applied to a document as a whole instead of individual words, it uses an unsupervised learning approach to better understand documents as a whole [16]. Doc2Vec model, as opposed to Word2Vec model [17], is used to create a vectorized representation of a group of words taken collectively as a single unit. It does not only give the simple average of the words in the sentence.

Topic Modeling
As already presented in the previous section, topic modeling is a tool that takes an individual text (or corpus) as input and looks for patterns in word usage; it is an attempt to find semantic meaning in the vocabulary of that text (or corpus).
This set of tools enables the extraction of topics from texts; a topic is a list of words that is presented in a way that is statistically significant. Topic modeling programs do not know anything about the meaning of the words in a text. Instead, they assume that each text fragment is composed (by an author) through the selection of words from possible word baskets, where each basket corresponds to a topic. If that is true, then it is possible to mathematically decompose a text into the baskets from which the words that compose it are most likely to come. The tool repeats the process over and over again until the most probable distribution of words within the baskets, the so-called topics, is established.
The techniques executed by the proposed system are used to discover word usage patterns of each user on Twitter, and they make it possible to group users into different categories. To this end, a thorough review of the main tools for topic modeling has been carried out. Most of the algorithms are based on the paradigm of unsupervised learning. These algorithms return a set of topics, as many as indicated in the training. Each topic represents a cluster of terms that must be related to one of those categories. Precisely for this reason, a large number of tweets have been retrieved as training data. Keywords have been searched for for each category. As part of this research, a total of three algorithms have been evaluated: LDA, LSI, and NMF. In the NMF experiment, the best results were obtained, although the techniques applied in other works have been reviewed in order to contrast their results with this method.
Apart from the comparison itself, there are numerous studies that have made similar comparisons between these techniques so that the decision is supported by similar studies. In the work of Tunazzina Islam, in 2019 a similar experiment was carried out to the one proposed in this paper [18]. In this paper, Apache Kafka is employed to handle the big streaming data from Twitter. Tweets on yoga and veganism are extracted and processed in parallel with data mining by integrating Apache Kafka and Spark Streaming. Topic modeling is then used to obtain the semantic structure of the unstructured data (i.e. Tweets). They then perform a comparison of the three different algorithms LSA, NMF, and LDA, with NMF being the best performing model.
Another noteworthy work is that carried out by Chen et al. [19], in which an experiment is carried out to detect topics in small text fragments. This is similar to the proposal made in this paper, since tweets can be considered small texts. In this work a comparison is made between the LDA and NMF methods, the latter being the one that provided the best results.
• Latent Dirichlet allocation (LDA). Is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar [20][21][22]. For example, if observations are collections of words in documents, each document is a mixture of a small number of topics and each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox. • Nonnegative Matrix Factorization (NMF). Is an unsupervised learning algorithm belonging to the field of linear algebra. NMF reduces the dimensionality of an input matrix by factoring it in two and approximating it to another of a smaller range. The formula is V ≈ W H. Let us suppose, observing Equation (4), a vectorization of P documents with an associated dictionary of N terms (weight). That is, each document is represented as a vector of N dimensions. All documents, therefore, correspond to a V matrix where N is the number of rows in the matrix, and each of them represents a term, while P is the number of columns in the matrix and each of them represents a document.
Equations (5) and (6) shows matrices W and H. The value r marks the number of topics to be extracted from the texts. Matrix W contains the characteristic vectors that make up these topics. The number of characteristics (dimensionality) of these vectors is identical to that of the data in the input matrix V. Since only a few topic vectors are used to represent many data vectors, it is ensured that these topic vectors discover latent structures in the text.
The H-matrix indicates how to reconstruct an approximation of the V-matrix by means of a linear combination with the W-columns.
where N is the number of rows in matrix W, and each of them represents a term (weight), and r is the number of columns in matrix W, where r is the number of characteristics to be extracted.
where r is the number of rows in matrix H, r is the number of characteristics to be extracted, and P is the number of columns, with one column for each document. The result of the matrix product between W and H is, therefore, a matrix of dimensions NxP corresponding to a compressed version of V.
The use of Machine Learning techniques for the analysis of information extracted from Twitter is a very common case study today. It is convenient to study what kind of research is being carried out on this subject. One of the main applications is the use of Twitter and Natural Language Processing techniques in order to extract a user's opinion about what is being tweeted at a given time. The article "A system for real-time Twitter sentiment analysis of 2012 U.S. presidential election cycle", written by Hao Wang et al. [23], presents a system for real-time polarity analysis of tweets related to candidates for the 2012 U.S. elections.
The system collects tweets in real time, tokens and cleans them, identifies which user is being talked about in the tweet, and analyzes the polarity. For training, it applies Naïve Bayes, a statistical classifier. It uses hand-categorized tweets as input. Another study similar to this one is the one proposed by J.M.Cotelo et al. from the University of Seville: "Tweet Categorization by combining content and structural knowledge" [24]. It proposes a method to extract the users' opinion about the two main Spanish parties in the 2013 elections. It uses two processing pipelines, one based on the structural analysis of the tweets, and the other based on the analysis of their content.
Another possible line of research is based on categorizing Twitter content. This is the case of the article "Twitter Trending Topic Classification" written by Kathy Lee et al. [25]. It studies the way to classify trending topics (hashtags highlighted) in 18 different categories. To this end, Topic Modeling techniques were used. The key point lies in providing a solution based on the analysis of the network underlying the hashtags and not only the text: "our main contribution lies in the use of the social network structure instead of using only textual information, which can often be noisy considering the social network context".
As it can be seen, there are many studies currently oriented to the analysis of Twitter using Machine Learning tools. The challenge to be faced in this work is to find the optimal way of classifying users according to their tweets. The sections that follow describe the objectives of the project and detail the research and testing that led to the construction of a stable system fit for the purpose for which it has been designed.

Proposal
This section proposes a system for the extraction of information about Twitter users. The system is capable of obtaining information about a particular user and of elaborating a profile with the user's preferences in a series of pre-established categories. From an abstract point of view, the proposal could be seen as a processing pipeline, as shown in Figure 1. The different phases of this pipeline contribute to the achievement of the main objective: user classification.

Category Definition
Matching a given profile to a specific category or topic is one of the objectives of NLP algorithms. As a starting point, it is necessary to prepare the training dataset that is used when investigating the algorithmic model. The strategy followed is based on the model of the Interactive Advertising Bureau (IAB) association [26]. Today, IAB is a benchmark standard for the classification of digital content. In particular, the IAB Tech Lab has developed and released a content taxonomy on which the present categorization is based. This taxonomy proposes a total of 23 categories with their corresponding subcategories covering the main topics of interest. In this way, 8000 tweets from each of these categories have been ingested. As a result, 23 datasets with examples of tweets related to each category were obtained, these datasets have been used to train the system at a later stage. Specifically, the list of topics is shown in Table 3.

Twitter Data Extraction
The Twitter data extraction mechanism is a fundamental element of the system. The goal of this mechanism is to recover two types of data.
On the one hand, the system extracts a set of anonymous tweets related to each of the defined preference categories; these tweets are used to train the data classification algorithms.
On the other hand, the mechanism extracts information about the given user for the analysis of their preferences.
Twitter's API enables developers to perform all kinds of operations on the social network. It is, therefore, necessary for our system to use this powerful API. This API could be used by elaborating a module that would make HTTP requests to the API so that the endpoints of interest are executed. However, this involves a remarkably high development cost.
Another option would be to make use of one of the multiple Python libraries that encapsulate this logic and offer a simple interface to developers. The latter option has been chosen for the development of this system, more specifically, library Tweepy [27].

Preprocessing of Tweets
Once the data has been extracted, it must be prepared for the classification algorithms. Cleaning and preprocessing techniques must be applied, so that the text is prepared for topic modeling algorithms. Libraries, such as NLTK and Spacy, have been used, as can be observed in Listing 1.
The first step involves cleaning tweets, by removing content that does not provide information for language processing. More specifically, this task consists in eliminating URLs, hashtags, mentions, punctuation marks, etc.
Another of the techniques applied to obtain more information from tweets is the transformation of the emojis contained in the text into a format from which it is possible to extract information. To do this, a dictionary of emojis is used as a starting point for the conversion of the data. This dictionary contains a series of values that interpret each of the existing emojis when applying the corresponding analysis. In this way, it has been possible to identify and give a certain value to each emoji for its treatment.
The key activity performed during the preprocessing consist of eliminating stopwords and tokenization. Whether it is a paragraph, an entire document or a simple tweet, every text contains a set of empty words or stopwords. This set of words is characterized by its continuous repetition in the document and its low value within the analysis. These words are mainly articles, determiners, synonyms, conjunctions, and others.  Table 4 shows the results obtained after the tweets have gone through the preprocessing and preparation process which had been carried out using the tools listed above.

Vectorization
Vectorization is the application of models that convert texts into numerical vectors so that the algorithms can work with the data. Two algorithms have been considered for the performance of this task, "Bag-Of-Words" and "Tf-Idf". Both are widely used in the field of NLP, but, in general, creation of tf-idf weights from text works properly and is not very expensive computationally. Moreover, NMF expects as input a Term-Document matrix, typically a "Tf-Idf" normalized.
The vectorizer have been tuned manually with some parameters according to the dataset, as can be observed in Listing 2. Min_d f was set to 100 to ignore words that appear in less than 100 tweets. In the same way, max_d f was set to 0.85 to ignore words that appear in more than 85% of the tweets. Thanks to that feature, it is possible to remove words that introduce noise in the model. Finally, the algorithm only takes into account single words, so, in order to include bigrams, the parameter ngram_range was set to (1, 2)

Topic Modeling
Topic Modeling is a typical NLP task that aims to discover abstract topics in texts. It is widely used to discover hidden semantic structures. In the present work, this technique has been used to discover the main topics of interest of the My-Trac application users based on their Twitter profiles, which should correspond to some of the previously defined categories.
Regarding the features of the model, in Section 3.4, the training tweets were vectorized to create a Term-Document matrix which has been the input of the NMF model. In addition, NMF needs one important parameter, the number of topics to be discovered "n_components". In this case, n_components was set manually to 23, which is the number of topics that were defined initially in the categories taxonomy. Following this approach, the algorithm is trained with 184,000 tweets (8000 per category) with the aim of obtaining as many topics as categories were defined in the taxonomy. Once the model has been trained, it has been possible to determine in which topics a user's profile fits on the basis of their tweets. The implementation of the topic modeling algorithm has been carried out on the basis of NMF using SKLearn library, as is deailed in Listing 3. Finally, it is worth mentioning the use of some extra parameters which were set in the implementation of the model. The method used to initialize the procedure was set to "NNDSVa" which works better with the tweet dataset since this kind of data it is not sparse. Al pha and l1_ratio both are parameters which helps to define regularization.

Evaluation and Results
In order to evaluate the results of the algorithm, the most relevant terms have been identified for each resulting topic. Then, by reviewing the main terms for each topic, it is possible to determine if that words really represent the content of the topic. An example is shown in Figure 2, where the most relevant terms have been identified for 4 different topics, proving how well the algorithm identifies the terms associated with each one. As it can be seen, all of them are unambiguously related to their defined categories. Topic 1: Travel. Topic 2: Arts & Entertainment. Topic 9: Personal Finance. Topic 10: Pets. The full list of topics and their top 10 related keywords identified by the algorithm can be seen in Table 5. It should be noted that some of the previously defined categories in Table 3 have been removed during the evaluation of this model. This fact is due to the lack of tweets that would fit into those categories, as well as some topics were quite overlapped amongst them. The initially defined categories that have been removed during training process and evaluation are: "Home & Garden", "Real State", "Society", and "News". In the same way, the algorithm has been able to discover new categories related to the original ones, such as: "Movies", "Videogames", "Music", "Events", and "Medicine & Health", leaving a total of 23 categories in the system. Once the resulting model has been evaluated and verified, the next step is to check the effectiveness of the model with real Twitter profiles. The tests have been performed extracting 1200 tweets from different users and predicting for each user the most related topics based on their tweets. The final test results are shown in Table 6, where it can be observed how each profile name match with related topics according to the profile.
As an example, the main topics for the profile "Tesla" are "Automotive", "Technology and computing", and "Travel".
Finally, in order to suggest the main topics of a specific user in the My-Trac app, for each user, the model returns the associated categories, along with the percentage of weight that each category has on the user. The lower the percentage, the less relation the user has with the category. The results of the final classification using some known Twitter accounts are given in Table 7. It should be noted that only the three main categories are shown in the table (together with their associated percentage), as they are the most accurate for categorizing the user.

Final System Integration in My-Trac Application
Having passed the entire research and evaluation process, a trained algorithm has been obtained capable of classifying different Twitter accounts according to defined and discovered categories. In addition, a reliable data extraction method has been developed. Therefore, the next step consists of applying the algorithm to the My-Trac app to create a system that allows recommendations to My-Trac users based on their Twitter profiles, which is the objective of the present work.
The final system for My-Trac app consists of a mobile app where the user logs in, as it can be seen in Figure 3, and is asked to grant access to their Twitter data. Once the user signs in to the application, My-Trac seeks for the optimal means of transport to reach a specific destination given by the user and suggests the best conveyance for the trip, as Figures 4 and 5 show.
Finally, when the user chooses the route and mean of transport that best fits his trip, based on the present work, My-Trac app recommends some activities and points of interest for the user during the way based on its Twitter information, as can be observed in Figures 6 and 7.     Moreover, is possible to get some detailed information for each activity recommended, as Figure 8 shows. In this way, thanks to My-Trac app, the user can improve his experience not only by receiving suggestions for the best conveyance for the trip but also receiving customized activity recommendations and points of interest.

Conclusions and Future Work
This article presents a novel approach to extracting preferences from a Twitter profile by analyzing the tweets published by the user for use in mapping applications. This approach has successfully defined a consistent and representative list of categories, and the mechanisms needed for information extraction have been developed, both for model training and end-user analysis. It is a unique system, with which it has been possible to develop an important feature in the My-Trac app, whereby it is possible to recommend relevant point of interest to the end users.
Regarding future work on this system, many areas of improvement and development have been identified. Tweets are not the only source of information that allows to discern the interests of a profile. It may be the case that a user only writes about football but is the follows many news-related and political accounts. The current system would only be able to extract the sports category. Therefore, one of the improvements would be the implementation of a model that would analyze followed users. This has been started, by extracting the followers and creating wordclouds with the most relevant ones. Similarly, hashtags also provide additional information suitable for analysis. Another line of research is the training of a model that allows to analyze the tweets individually. This would open the doors to performing a polarity analysis that would allow us to know if a user who writes about a certain category does it in a positive, negative, or neutral way.
As for the limitations of the system, it is possible that, in some regions, there may be restrictive regulations on the use of information published on social networks for this type of analysis. Therefore, the user should carry out a study of data protection and the legal framework adapted to each region where the service is to be provided. Furthermore, in terms of performance, it is possible that specific context-dependent systems training an algorithm for each individual user may perform slightly better than the proposed solution.