1. Introduction
With the rapid expansion in the use of smartphones and the advancements in positioning technology, a large volume of data from Location-Based Social Networks (LBSNs), such as Foursquare, Flickr, and Facebook Places, have been produced [
1,
2,
3,
4,
5]. These LBSN data contain an enormous amount of check-in records that contain information on location (Latitude/longitude) and time (timestamp) as well as comments. From these check-in records we can discover information which can be used to identify Points of Interest (POIs). These POIs and the meaningful information hidden in the check-in data are not only helpful in understanding the social behaviors of users but also benefit urban planning and location recommendation [
2,
4,
6,
7]. Therefore, automatically discovering POIs and map them into categories (hereafter referred to as automatic POI annotation) has become a hot topic in research [
7,
8,
9,
10,
11].
Current methods of automatic POI annotation can be classified into two main types according to the information patterns used: single pattern-based methods and hybrid pattern-based methods. The basic idea of single pattern-based methods is that they simply use one information pattern during POI annotation. They are further divided into three kinds: temporal pattern-based methods, spatial pattern-based methods, and text pattern-based methods. Hybrid pattern-based methods often combine multiple patterns, such as spatiotemporal pattern-based methods, to perform POI annotation [
9,
10,
11].
The temporal pattern-based methods first discover regular temporal patters from the LBSN data and then use the information inferred from these temporal patterns to discover and annotate POIs. A variety of temporal patterns could be mined from LBSN data. Commonly used temporal patterns include frequency of access to particular locations, duration of the visit, periodicity of visits. Many researchers have found that these temporal patterns can effectively distinguish POIs of different categories, thus they explored the temporal regularity to annotate a POI [
8,
12,
13]. The apparent periodicity of POIs, such as day and night, weekday as well as weekend, was directly explored as a temporal pattern [
14]. To discover the implied temporal pattern from users’ visitation trends, Shannon Entropy or other mathematical concepts have been adopted [
7]. Despite the fact that methods based on temporal patterns can achieve high accuracy in annotating POIs, these methods only work well when the POIs’ temporal patterns are clear and easily discoverable. However, not all POIs necessarily hold clear temporal patterns, such as sports games or concerts where the event is only temporary and not regular. Furthermore, it is difficult to discover useful temporal patterns from these temporary events without the knowledge of particular locale and its functions.
To address these shortcomings of temporal pattern-based methods, researchers have tried to mine spatial patterns from the raw data of LBSNs and to complete POI annotations. Spatial patterns could be discovered by clustering the check-in records since similar geographical characteristics of check-in records could be allocated to similar groups with similar POIs [
15]. People who live close to each other may have similar living habits [
6]. Thus, the demographic spatial distribution could also be used as auxiliary information for mining spatial patterns [
16]. After obtaining similar groups according to spatial patterns, the information of the known POI category in the group could be used to annotate the unknown POIs in the same spatial groups. However, the accuracy of POI annotations based on spatial patterns is affected by the accuracy of the Global Positioning System (GPS) used, as deviations from true locations could be significant due to noise in mobile devices.
The text pattern-based methods make use of the text patterns in the comments to annotate POIs. For example, patrons of a tourist site might text to their respective friends or family members about the things they saw at the site. This information in the texts reveals the nature of the location and thus can be used to annotate POIs. However, the key issue in text pattern-based methods is how to extract category information from comments. Current research provides two approaches: syntactic based [
17] and hot word-based [
18]. The syntactic based approach can find more specific geographic words through syntactic dependencies among words [
17], but it is time-consuming, and its results are redundant for POI annotation. The hot word-based approach counts the relevance between comment words and check-in locations through Kernel Density Estimation (KDE), then it chooses these highest-ranked words to annotate related locations [
18]. Compared with the syntactic-based approach, the hot word method is more efficient, and there are many mature word extraction tools available, but this method [
18] directly uses hot words to annotate the location without specifying its category. The text-based methods only focus on the text perspective of LBSN data, and sometimes the texts may not be relevant to the locations of the users’ visit. For example, a person might be texting to his/her friend about something which has nothing to do with the location the person is at. In addition, text pattern methods omit important information hidden in temporal patterns or spatial patterns.
The above three simple pattern methods (temporal pattern, spatial pattern, and the text pattern) can accomplish POI annotation under certain conditions. However, their efficiency and accuracy in POI annotation are not satisfactory. This may be due to the fact that they each only focus on one kind of information, that is, they simply use one of the three patterns separately. Recently, some researchers proposed a hybrid method by considering both temporal and spatial patterns. They believe that the combination of temporal and spatial patterns (i.e., the spatiotemporal patterns) can find POIs and annotate them with specifying categories better since the spatiotemporal pattern could discover POIs based on temporal and spatial thresholds, where a user will stay at a POI long enough to meet these thresholds [
9]. Besides, spatiotemporal patterns can also be mined from other perspectives, such as demographic spatial distributions and moving speed [
16,
19]. Information from these aspects helps to identify the POI category with higher accuracy. For example, with the help of information on movement speed, we can identify whether a POI is a traffic station or not. That is, if the moving speed of many users changes greatly in a POI, it means that these users are traveling on some kind of traffic tool in this POI. Then, it could be inferred with a high probability that this POI is a transportation station [
19]. Based on the spatiotemporal pattern, we can first discover useful POIs and then annotate them with semantic information instead of using the whole raw data.
The method of POI annotation based on spatiotemporal patterns optimizes the methods based on temporal or spatial patterns, and it slightly enhances the accuracy and decreases the POI annotation time. However, limitations still exist. For example, it omits rich information hidden in the text comment related to POIs. Considering the methods based on text patterns could mine POI context information with higher accuracy, we put forward a novel hybrid pattern-based method, which successively utilizes temporal, spatial, and text patterns in each step. The proposed method has two main contributions. One is that largely redundant raw data and noise data could be filtered out by making use of temporal periodicity and spatial aggregation, which leads to a reduction in computation time of processing the data. The other contribution is that our method could further improve the accuracy of POI annotation since context information is mined from the text comments related to POIs. In summary, the proposed method is able to annotate an unknown POI category with less labeling time and higher accuracy.
The remainder of this paper is organized as follows.
Section 2 explains our methodology, introducing the check-in dataset, essential pre-processing, and key aspects of the hybrid-based approach.
Section 3 describes the experimental design, performance metrics, and experimental results.
Section 4 discusses the merits and drawbacks between our results and the baseline method (named Falcone’s method) [
7]. Finally, the conclusion and future work are presented in
Section 5.
2. Methodology
2.1. Basic Idea
The multiple patterns mined from different perspectives of one object could be used together to understand this object better in data mining fields [
20]. As for POI annotation, we also believe that the integration of multiple patterns could more accurately annotate the POIs than using one or two of these patterns. This idea results in the development of our new method by comprehensively integrating temporal, spatial, and text patterns to discover and annotate POIs.
The process of the proposed method can also be divided into four parts: preprocessing data, discovering POIs, determining categories, and associating POIs with categories. To implement the above idea, we first preprocessed check-in data based on temporal patterns. Secondly, we discovered POIs through spatial patterns by clustering these check-in data. Thirdly, we built a keyword dictionary for discovering the categories of POIs to be annotated via mining the text patterns. Finally, we associated the category with each POI.
2.2. Study Region and Flickr Data
In actual use, LBSN data is obtained through users’ check-in behavior. Check-in data contains a user’s profile, temporal and spatial information, as well as text comments. Making use of this information helps to discover the context information of check-in locations [
21].
The check-in data used in this paper were collected from Flickr, which is an online photo management service that allows uploading and sharing photos publicly or within groups [
2]. The metadata in Flickr contains several properties about the record and user ID, upload date and time, location where a photo was taken, description, and tags. The properties of a record are shown in
Table 1, where
Flickr_id is the identifier of each record on the Flickr website.
User_id is the user’s identifier on the Flickr website and represents each registered user.
Create_date and
Create_time represent the date and time a record was uploaded, respectively.
Longitude and
Latitude correspond to the geographical location, which are entered automatically by the GPS in the camera or manually by a user who has identified the photo’s location on the map. The last three fields—
Title,
Content, and
User_tags— form the user’s comment text. We can infer from what is in
Table 1 that the example may belong to the “College & University” category since its “Content” includes the word “University” and its coordinate (“Longitude” and “Latitude”) is located at New York University.
Flickr has released a publicly available dataset of 49 million geotagged photos [
22]. We extracted the metadata of partial records created by users worldwide from Flickr. This dataset contained 1,605,291 records, which were distributed worldwide as shown in
Figure 1a. Meanwhile, we exploited KDE to analyze the popularity of Flickr check-in records, as shown in
Figure 1b. There was apparently high popularity in several regions, such as North America and Europe, which are typically developed regions. It is not difficult to see that the popularity of check-ins in the northern hemisphere is substantially higher than that in the southern hemisphere.
To simplify the subsequent computation process and for verifying our method, we selected two regions: New York and London, which are both well-known international cities, where Flickr check-in records are quite dense. Numerically speaking, the coordinates of these two areas are (1) New York (
), (2) London (
). The spatial distributions of check-ins in the two regions are shown in
Figure 2.
2.3. Preprocessing Flickr Data Based on Temporal Patterns
It is common to preprocess raw data before further analysis, which can ensure the completeness and accuracy of the data. There were some deviations within the Flickr raw dataset. We can exploit the temporal pattern to eliminate obviously wrong data from the raw dataset. For example, the
Create_date of some records were obviously wrong (e.g., dates in the 1980s even though Flickr was founded in 2004). As shown in
Figure 3, except for these with date errors, most of the records were created from 2005 to 2014. Since the objective of this paper is to infer categories of POIs through spatiotemporal patterns and text information in check-in data, data that were not georeferenced or had errors in any field were filtered out. We selected records according to 2 constraints: (1) the length of the comment text to be greater than 30 words, and (2) the
Create_date of check-in between 2005 and 2014.
Therefore, after deleting records that were invalid or had missing values, 25,205 records in New York and 22,567 records in London remained for our study. Each of these records was then described using the formalized definition:
Definition 1 (check-in record, ). Each check-in record can be expressed as , where denotes a coordinate point, denotes the check-in time, and denotes the user’s description of the photo.
2.4. Discovering POIs through Spatial Patterns
Considering the spatial pattern, where nearby things are more related to each other [
6,
23], similar geographical characteristics of check-in records could be allocated to similar groups with similar POIs. Besides, a POI might be represented by slightly different GPS coordinates because of coordinate deviation. Therefore, we can cluster these adjacent check-in points to find a spatially representative POI and eliminate coordinate deviation. Here, we give the following formalized definition of a POI:
Definition 2 (POI, ). Each POI can be expressed as , where denotes a coordinate point and .
To discover POIs, we chose DBSCAN [
24], which is an algorithm for finding density-based clusters in spatial data. Other algorithms could also have fulfilled this purpose. For instance, the OPTICS method and K-means have both been adapted to mine POIs [
7,
9]. However, in contrast to these approaches, DBSCAN can effectively deal with noisy points and detect any cluster shape, which coincides with the geographical characteristics of check-in data. Here, the noise refers to some sporadic check-in points, which are likely to be caused by misoperation. Thus, not suitable in the analysis. In DBSCAN, a cluster is defined as a set of densely connected points. There are two main parameters: neighborhood radius
EPS and density threshold
MINPTS, which limit the size of the minimum cluster and noisy points. Taking the similarity of spatial density and spatial distance as measures of POI, we merged similar points that met the density constraints and extracted clusters using DBSCAN [
15]. The clustering result is a set of clusters
. Each cluster
represents one POI (i.e.,
), which is a set of check-in points within space/distance constraints. We used the coordinates of the cluster’s centroid
to identify each POI uniquely.
2.5. Determining Categories from the Text Pattern
2.5.1. Extraction of Text Patterns
In addition to mining POI from spatiotemporal information, text contents in Title, Content, and User tags in the Flickr dataset were used to extract the text patterns. More specifically, the text pattern could be landmark information (e.g., a museum or an arena) or event information (e.g., sports games or concerts). Our task of extracting the text patterns included word segmentation and the extraction of geo-related named entities.
Currently, named entity identification tools (Named Entity Recognizer, NER) have been extensively used. Among them, the Stanford Named Entity Recognizer (Stanford NER) [
25] is a Conditional Random Field (CRF)-based statistical NER system developed by the Stanford Natural Language Processing Group in 2005, which can conveniently complete word segmentation and named entity extraction. Therefore, we used Stanford NER to extract the keywords which represent the nouns of places and events. The output was a set of word vectors. Here, we give the formalized description of a word vector.
Definition 3 (word vector, ). For each record, there is a word vector w W containing geo-related words that are expressed as . Here, denotes a geo-related named entity and .
2.5.2. Determination of Category
In order to explore the meanings of POIs more normatively, it is necessary to generalize the meanings into several categories. This is also a common practice in the related works of POI annotation [
7,
16,
26]. We surveyed the categories in different works, as shown in
Table 2. It is not difficult to find that, although these annotation methods were different, the categories in each paper were very similar.
Like the work of Goodchild and Satman [
27,
28], we also used Google Places, which is widely used to provide location-related information and represent the ground truth labels. As a public data source, Google Places can provide detailed information about 100 million places across a wide range of categories, from the same database as Google Maps. The database of Google Places contains about 133 types to describe POIs for search queries. By contrast, as shown in
Table 2, the number of categories was between 6 and 13 in most related works. Similarly, we generalized these types into seven categories after Majid et al. [
26]: “College & University”, “Outdoors & Recreation”, “Office”, “Arts & Entertainment”, “Travel & Transportation”, “Shopping & Services”, and “Health”. In addition, we used “Other places” to represent all other types. Note that in the last two columns of
Table 2, there is a slight difference between that in Falcone’s work [
7] and ours, especially in the third row (i.e., “Professional & Other Places” and “Office”).
Additionally, for each category, we randomly selected some of the labeled samples to extract their high-frequency keywords, then we constructed a keyword dictionary for matching POI and category. The formulation of the dictionary is defined as follows:
Definition 4 (Dictionary, ). , , where represents the keywords contained in the ith category .
2.6. Association between POIs and Categories
So far, we obtained POIs that were mined from check-in data based on temporal patterns in
Section 2.3, spatial patterns in
Section 2.4, and categories that were determined based on text patterns in
Section 2.5. In this section, we will associate POIs with the most likely category. The final output should be a tuple that contains the location and category of a POI (i.e., an annotated POI).
Definition 5 (Annotated POI, ). An annotated POI represents a POI labeled with a category, which can be defined as , where s represents the category information used to describe the POI .
Our annotation process can be viewed as a matching process, which is described in Algorithm 1. First, for each POI, we traversed each check-in record in the region of this POI. Using the Knuth–Morris–Pratt (KMP) algorithm [
29], we matched each word vector
with a category in keyword dictionary
, then determined the category of this check-in. Furthermore, we defined the reliability for each category to which the POI belonged. The total match count for each category is calculated as follows:
where
and
. In addition, the reliability of the POI’s category
is defined as follows.
As described above, the category with the highest weight was chosen from the top of , which was used to measure the reliability of classification.
Algorithm 1. POI annotation algorithm |
Input: // set of POI // set of word vector Output: // set of annotated POI |
1. 2. FOR each POI 3. CREATE 4. FOR each coordinate point and 5. 6. IF MATCH(,) = THEN 7. 8. END IF 9. END FOR 10. COMPUTE based on 11. IF TOP THEN 12. and 13. 14. END IF 15. END FOR |
We further analyzed the time complexity of Algorithm 1. Let be the number of POIs, be the number of coordinate points aggregated in one POI, and be the length of the word vector. The function MATCH(,) in Step 6 can be supported efficiently by the KMP algorithm, whose time complexity is . Note that the total number of categories is eight (seven categories plus the “Other Categories”), that is, the max value of is a constant 8, which can be ignored. The time complexity of Steps 1 and 4 are and , respectively. Hence, we can claim that for a given POI and word vector, the time complexity of our category matching method is , that is, Algorithm 1 is of linear time complexity with respect to the number of POIs, the number of coordinate points, and the length of word vector. In other words, compared to the scale of the problem, our algorithm has reached the optimal time complexity.