Automatic Identification of the Social Functions of Areas of Interest (AOIs) Using the Standard Hour-Day-Spectrum Approach

: The social function of areas of interest (AOIs) is crucial to the identification of urban functional zoning and land use classification, which has been a hot topic in various fields such as urban planning and smart city fields. Most existing studies on urban functional zoning and land use classification either largely rely on low-frequency remote sensing images, which are constrained to the block level due to their spatial scale limitation, or suffer from low accuracy and high uncertainty when using dynamic data, such as social media and traffic data. This paper proposes an hour-day-spectrum (HDS) approach for generating six types of distribution waveforms of taxi pick-up and drop-off points which serve as interpretation indicators of the social functions of AOIs. To achieve this goal, we first performed fine-grained cleaning of the drop-off points to eliminate the spatial errors caused by taxi drivers. Next, buffer and spatial clustering were integrated to explore the associations between travel behavior and AOIs. Third, the identification of AOI types was made by using the standard HDS method combined with the k-nearest neighbor (KNN) algorithm. Finally, some matching tests were carried out by similarity indexes of a standard HDS and sample HDS, i.e., the Gaussian kernel function and Pearson coefficient, to ensure matching accuracy. The experiment was conducted in the Chongchuan and Gangzha Districts, Nantong, Jiangsu Province, China. By training 50 AOI samples, six types of standard HDS of residential districts, schools, hospitals, and shopping malls were obtained. Then, 108 AOI samples were tested, and the overall accuracy was found to be 90.74%. This approach generates value-added services of the taxi trajectory and provides a continuous update and fine-grained supplementary method for the identification of land use types. In addition, the approach is object-oriented and based on AOIs, and can be combined with image interpretation and other methods to improve the identification effect.


Introduction
Urban functional zoning refers to the division of regions according to the dominant functions of a city, which is an organic whole with relatively independent functions and mutual connections. Land use type refers to a land resource unit with the same land use mode, and it is the basic regional unit reflecting use, property, and distribution law. Urban areas of interest (AOIs) refer to units with a social function that attract the attention of humans. As a basic unit in urban functional zoning and land use types, the identification of the social function types of AOIs plays an important role in land use classification and urban functional partition [1,2]. The identification and renewal of urban functional zones and land use types are also hot research topics with broad applications ranging from transportation, urban planning, and smart cities [3]. However, few studies have focused on the The main innovations of this article include: (1) Data preprocessing of double cleaning. After cleaning twice, the spatial accuracy of the pickup and drop-off points is ensured; this is better for micro-level travel behavior analysis. This operation leapfrogs travel research from macro to micro, from city level to block-level, and from community level to building level. The buffer analysis combined with DBSCAN automatically classifies the pick-up and drop-off points. Each cluster can be automatically associated with a neighboring AOI to determine the affiliation of the points with the AOI.
(2) A top-down method of automatic identification of AOI is designed. This method relies on the six types of HDS of AOIs. The standard HDS of the AOI is obtained by temporal analysis on the pick-up and drop-off points, and then the social function of the AOI is identified by the spectrum matching technique.
(3) Waveform recognition is obtained using the HDS pattern matching method of the Gaussian kernel function. Compared with the method based on the cosine similarity and the Pearson correlation coefficient, the recognition accuracy is obviously improved, and the rate is increased to 90.74%.
This method requires only the trajectory data of the taxicabs, and automatic identification of the functional type can be performed without adding new sensors and data sources. Because the trajectory in each city is continuously being updated in intelligent transportation systems (ITS), the implementation of this scheme can achieve long-term and dynamic monitoring of the social functions of AOIs. The results can be used alone or in combination with other schemes to complement the urban functional area identification.
The rest of the article is arranged as follows. Section 2 presents the methodology. The study area and data preprocessing, as well as results and analysis, are introduced in Section 3. Then, discussion occurs in Section 4. Lastly, conclusions and future work are covered in Section 5.

Methodology
AOI can be regarded as an important part of land use types, and the identification of its social function type is a hot topic. This article attempts to identify the social function types of AOIs through the spatiotemporal data mining of taxi GPS trajectories. This section will introduce the principle and implementation of the supervised classification method in detail.

Study Area
The study area is presented in Figure 1. All experiments were conducted in the Chongchuan and Gangzha Districts, which cover about 234 square kilometers, and the permanent population was 0.884 million in 2017. These districts are the traditional main urban areas of Nantong, and Nantong is the prefecture-level city in Jiangsu Province, China. It is located on the northern bank of the Yangtze River. In 2018, Nantong had a gross domestic product growth of 8.95%, with a total of about 842.7 billion yuan, ranking 20 th across the whole country. Because the subway in Nantong has yet to be constructed, buses and taxis are the main travel ways of urban human mobility in public transportation. Taxis play an important role in citizens' lives. The total number of taxis in Nantong is about 1200 and the number of buses is about 3000.  Figure 2 illustrates the research framework and interrelated tasks of this proposed work, with the details given below.

Associating AOIs with Pick-Up and Drop-Off Points
Compared to other public travel modes, the taxi is the most maneuverable and flexible, and its drop-off points are relatively close to the destination. Based on different methods, the collections of drop-off points near AOI entrances were collected as sample data. We used DBSCAN combined with spatial buffer in closed entrances analysis, as well as spatial buffer singly in open entrances.
The road boundaries constrain the distribution of the drop-off points, so the setting of the buffer size should vary according to the width of each road. Because the spatial error of the GNSS device was five to ten meters, in order to include more drop-off points into the buffer zone, the buffer width perpendicular to the road was determined by adding ten meters to the road width.

Closed Entrances
Closed entrances have doorplates, gateposts, fences, and other iconic objects, so we carried out DBSCAN clustering for pick-up and drop-off points in the buffer, removed noise points in the results, and then associated the pick-up and drop-off point clusters with AOIs. Combined with buffer analysis, we determined buffer widths parallel to the road by using the drop-off point density associated with AOIs, as shown in Figure 3.

Open Entrances
Some AOIs, such as shopping malls, have no clear accessible entries, which means people can enter or leave the area from any position along the road. To eliminate GPS data error, buffer widths parallel to the road were obtained by adding ten meters to the AOI boundary length, as shown in Figure 4.

Conception of SHDS
There are several types of AOIs, but for a specific type, the composition of the population is relatively similar, and, therefore, people tend to travel short distances. This means that the temporal distribution law of the same type of AOI is similar. Daily, there are differences between weekdays and holidays. For example, primary and secondary schools are only open on weekdays, but the scenic spots attract more tourists on Sundays. Hospitals have no rest all year round, having a relatively balanced distribution of people flow. In the HDS, some AOIs have several peaks in a day. For instance, students have a fixed school time, which results in corresponding morning and evening peaks in residential districts. The differences between functional zones, like a person's unique fingerprints, can be used to identify the different types.
The flow of passengers can be expressed by the number of pick-up and drop-off points near the AOI entrance. Hence, the spectrum constructed by these points in each time period is a sign of the AOI's own characteristics. From a time point of view, HDS includes differences between holidays and weekdays. At the same time, HDS includes differences between pick-up and drop-off points. Thus, HDS can be divided into six types, including total drop-offs, holiday drop-offs, weekday dropoffs, total pick-ups, holiday pick-ups, and weekday pick-ups. The above characteristics can be described as curves in a two-dimensional coordinate system, that is, the horizontal axis represents 0-23 o'clock, and the vertical axis represents the flow of people, which is called HDS in this paper.
Because same types of AOIs have similar social systems and working characteristics, their own characteristics are almost identical. Under this circumstance, it is possible to use a standard spectrum to identify the characteristics of such AOIs. Corresponding to the aforementioned six types of HDSs, six types of standard spectrum can be generated which are collectively called SHDSs.

Implementation
The SHDS should express the fundamental law of all AOIs in this type. Hence, information of only one example is taken, as the SHDS has greater subjectivity and deviation. In order to acquire a SHDS with strong universality and distinction, this paper uses a method based on sampling and interpolation. The implementation process is as follows: (1) We extract the pick-up and drop-off points in each buffer of the AOI entrance. In addition, according to the 'Time' field, the number of hourly points within 24 hours is separately counted to calculate the time-interval spectrum, as shown in Figure 5. Distribution of the pick-up and drop-off points is usually uneven; the region close to the AOI entrance may have a high density. The DBSCAN algorithm is sensitive to the distribution density of points, which can extract the clusters of the points and remove noise points. Because the parking position of the vehicle is dispersed around the closed entrance, the DBSCAN algorithm is used to cluster the pick-up and drop-off points in the buffer, as shown in Figure 6. Then, after extracting the quantity sequence of pick-up and drop-off points, the data set can be divided into the following: PP={ 1 , 2 , … , } represents pick-up points, HPP={ℎ 1 , ℎ 2 , … , ℎ } represents pick-up points on holidays, WPP={ 1 , 2 , … , } represents pick-up points on weekdays, DP={ 1 , 2 , … , } represents drop-off points, HDP={ℎ 1 , ℎ 2 , … , ℎ } represents drop-off points on holidays, and WDP={ 1 , 2 , … , } represents drop-off points on weekdays.
(2) By calculating the spectrum of AOI with sequence data, the spectral sequence ( ,ℎ) is {s1, … , si, … , s24}, where i is the sequence number of the spectrum (1≤i≤24), j is the number of examples, k is the number of classes, h is the number of spectral types, si is the quantity of pick-up and drop-off points at time i, and ( ,ℎ) represents the spectral information of the j th example of the k th class of AOI and the h th class of the spectrum.
(3) The average hour-day-spectrum ( ,ℎ) is calculated for each type of AOI, as shown in where N is the total number of examples of the k th class.
(4) The same type of HDS may sometimes have abnormal values, that is, the spectral shape shows sharp fluctuations. It is therefore necessary to take interval sampling for ( ,ℎ) with m to reduce the influence of abnormal values. The result ( ,ℎ) is {s1..., si + m,,..., s24}. In order to ensure the universality of SHDSs and retain the distinction of the original HDS, m is less than 3.
(5) The dimension of the spectrum after sampling is less than 24, but the HDS of the AOI to be identified is 24 dimensions. Thus, it is necessary to interpolate the ( ,ℎ) sequence to restore it with 24 dimensions, and the interpolation result is the final standard spectrum ( ,ℎ) .

Automatic Identification of Social Function of AOIs with KNN and SHDSs
KNN is a common method used in data mining classification technology. Compared with other machine learning algorithms, it is especially suitable for multi-classification processing. However, it has a large time complexity when calculating the similarity between the sample and all training samples. Hence, the SHDS is used to replace the whole sample set; this means only the overall similarity (distance) of the HDS and the corresponding SHDS needs to be calculated, which can greatly reduce the algorithm complexity.

Concept of KNN
KNN means that the nearest k neighbors can represent each sample. If a sample has a majority of the k nearest neighbors belonging to a certain type in the feature space, the sample is also classified into this type. In the KNN algorithm, the selected neighbors are considered to have been correctly classified. The classification decision only depends on the type of the nearest one or several samples.
The following steps are performed for each point in the dataset of an unknown type: 1. The distance is computed between the point in the known type and the current point; 2. The distances are sorted in ascending order; 3. k points with the smallest distance from the current point are chosen; 4. The occurrence frequency of the type of the first k points is obtained; 5. The type with the highest frequency as the classification of the current point is returned.

Combination of KNN and SHDS
There are differences between different types of HDSs of the same AOI, and the same type of HDS of different AOI types may also be different. However, for a specific type of AOI, the shape of the six HDSs is relatively stable, meaning the spectral curves can be assembled and regarded as the identification of the AOI type. The number of pick-up and drop-off points near the entrance is recorded in the spectrum sequence. However, due to the differences in acquisition time and spatial regions, the error is high when calculating the similarity of the spectral sequence based on the absolute number, and the spectral sequence elements need to be normalized in advance. The steps of the identification method are as follows: 1. The training process of the SHDS The spectrum sequence is converted into a 24-dimensional vector and normalized, and then six SHDSs of various types of AOIs are calculated. The normalization formula is where denotes the vector form of this type of SHDS, denotes the minimum value of the vector, and denotes the maximum value of the vector. 2. Identify the type of AOI Cosine similarity, Pearson coefficient, and Gaussian kernel function were selected as the similarity (distance) functions of KNN, respectively, and the best one was decided according to the sensitivity of self-correlation of the AOIs' SHDSs.
The next step involves converting the AOI spectrum sequence to be identified into a normalized vector form, calculating the similarity with the SHDS vector of each type, integrating the six spectral similarities, and calculating the total similarity as the distance factor in the KNN algorithm. The calculation formula is where denotes the index of the AOI types, _ denotes the type of spectrum (for example, the spectrum of weekdays), _ denotes the similarity between the SHDS of type_i and the HDS to be identified; and denotes the total similarity between the k and .

Result and Analysis
In order to validate the feasibility of the approach, 108 AOI samples were selected as test sets, and different similarity calculation methods were used to verify the results.

Trajectory Data of the Taxi
The original taxi GPS trajectories data involved about 1,400 taxis from September to October 2018 in Nantong, China, of which the attributes included the license plate number, the driver's call sign, and latitude and longitude, etc., as shown in Table 1. Specifically, 'Time' indicates the time at which the trajectory point is recorded, 'Latitude and longitude' represent the current geographic location of the vehicle, 'Speed' records the current vehicle speed, and 'Direction' signifies the current direction. If 'State' is left empty, this indicates that there are no passengers in the car. The designed sampling time interval was 30 s, but it was less than 30 s in practice because the signal data caused by the change in passenger status were also collected. We then extracted information on the pick-up and drop-off points according to the vehicle state. When the state of the vehicle changes from empty to heavy, this is the pick-up point, and vice versa, as shown in Figure 7. In practice, taxi drivers change the passenger status after passengers get on, leading to a relatively small error between the recorded pick-up point and the actual pick-up point. However, when approaching the destination, some drivers will change the status in advance, resulting in a significant error between the recorded drop-off point and the actual recorded drop-off point. Hence, this paper characterizes the empty point as the drop-off point when the vehicle state changes from heavy to empty and when the distance between the two is less than 50 meters. The cleaning process ensures that the position accuracy of the drop-off point can realize the identification of the building. The distance calculation formula is shown in Equation (5) where 1 and 2 are latitude angles, 1 and 2 are longitude angles, and is the radius of the Earth.

AOI Data
This paper used Nantong, China as the research area. AOI data were obtained using Amap API via web crawler technology. We selected several different types of AOIs for the experimental data, including shopping malls, schools, hospitals, and residential districts (Figure 8). The details of each type of AOI are shown in Table 2. In this experiment, 50 samples in the Chongchuan District were selected as the training set and 108 samples in the Gangzha District were collected as the validation set.

Training Results of the SHDSs
Fifty AOI samples were used to construct six types of HDSs. Taking the total drop-offs of the HDS of residential districts as an example, 78% of residential districts were found to have peaks at 10 a.m. and 8 p.m., and there were some abnormal fluctuations in different HDSs, but the overall trend was the same, as shown in Figure 9. Other types of AOIs, such as shopping malls, schools, and hospitals, also have a similar regularity and abnormal fluctuations. The SHDS of the corresponding spectrum sequence of each type of AOI was calculated, and sampling at interval points was performed. In this experiment, we set m equal to 1, meaning sampling at every other point was performed, followed by interpolation . After standardizing the results of the SHDSs, the standard spectrum of each type was obtained, and these are shown in Figure 10. The Pearson correlation coefficient was used to calculate the correlation of the six SHDSs of each type of AOI. The average correlations of schools, communities, hospitals, and shopping malls were found to be 0.907, 0.743, 0.940, and 0.918. Each SHDS of each type of AOI shows the same trend. Taking the hospital as an example, there is a peak at 9 a.m. and 3 p.m. Taking the SHDS of the DP (Drop-off points) type as an example, as shown in Figure 11, the spectrum trends of different AOIs are different: there are two peaks in the hospital and three peaks in the school; the SHDS of the residential district shows an upward trend, while the other spectrum has a downward trend after rising.

Social Functional Identification of AOIs
Appropriate similarity indicators are the key to the matching method, which can improve the accuracy of identification. Thus, this article compares three different similarity indicators.

Cosine Similarity
The smaller the angle between the two vectors, the more similar the two vectors are. The cosine similarity abides by this theoretical idea. It measures the similarity between vectors by calculating the cosine of the angle between the two vectors. The derivation formula of cosine similarity is shown as where and denote the vectors to be calculated. denotes the th element in , and denotes the th element in , as shown in Figure 12.

Pearson Correlation Coefficient
Pearson correlation, also known as product difference correlation (or product-moment correlation), is a method of calculating correlations which was proposed by British statistician Pearson in the 20 th century. The larger the absolute value of the correlation coefficient, the stronger the correlation. The closer the correlation coefficient is to −1 or 1, the stronger the correlation degree is, and the closer to 0, the weaker it is. In general, the correlation strength of variables can be determined using the following ranges: 0.
where denotes the mathematical expectation, denotes the covariance, and denotes the number of variables.

Gaussian Kernel Function
The Gaussian kernel function is defined as a monotone function of Euclidean distance between and Y in space, and is an effective method used to calculate the similarity between vectors. The farther the distance, the higher the difference between individuals. Hence, this paper takes the Gaussian kernel function as the similarity indicator, such as in the following formula, i.e., where denotes the natural logarithm and denotes the standard deviation. Three similarity indicators were used to validate the SHDSs of four types of AOIs, so the value range of the results are different. The self-correlation values were then normalized for comparison, as shown in Figures 13-15. The heat map was used to reflect the similarity of different AOIs' SHDSs. The darker the color, the higher the similarity. The color of the main diagonal, whose values are 1, is the darkest, which shows that the correlation between the same type of HDS is 100%. 'M', 'R', 'S' and 'H' in the maps represent shopping malls, residential districts, schools, and hospitals, respectively.   In the DP heat map of the Pearson correlation coefficients, as shown in Figure 12, the correlation coefficient between the SHDS of the residential district and the mall is 0.36, displaying a weak correlation, while the correlation coefficient between the SHDS of the mall and the hospital is 0.87, showing an extremely strong correlation. Obviously, the self-correlation of the Pearson correlation coefficient is the highest among the three, so the matching accuracy based on it is significantly lower than the others. Also affected by the strong self-correlation, the matching result based on cosine similarity is not ideal, being only 85.19%. By adjusting the value of 2 2 of the Gaussian kernel function, an appropriate parameter can be found to enhance the constraint degree of HDS mutual matching during identification. The weight selection and accuracy comparison are shown in Figure  16. When 2 2 is less than 1.15, the accuracy shows an upward trend. When 2 2 is greater than or equal to 1.15 and less than or equal to 1.52, the accuracy reaches the peak value, which is 90.74%. When 2 2 is higher than 1.52, the accuracy gradually declines and eventually converges to 88.88%, as shown in Table 3. Hence, we selected the Gaussian kernel function in which 2 2 is 1.5 as the similarity indicator.

Discussion
Partial identification results of the AOIs obtained via the proposed method are shown in Table  4, where the experimental accuracy is 90.74%. The main reasons for the incorrect cases are as follows: 1. Mutual interference between different types of AOIs AOIs are in fact on different levels. For example, hospitals can be divided into several levels. The higher the level, the greater the influence. If there is a significant level difference between two adjacent AOIs, it may result in the unclear attribution of the surrounding trajectory data. For example, Nantong First People's Hospital and Nantong First Middle School are adjacent, as shown in Figure  17, but the influence of the First People's Hospital is much stronger than that of the First Middle School. Hence, most of the trajectory data near the school were allocated to the hospital, meaning that the spectral information was not typical. In this case, we can consider accumulating data for an extended period and extracting data with a small buffer area for big data analysis, which is one of the research plans for the future. 2. AOIs are newly built or have an abnormal status Exploring the correlation between AOIs and travel behavior requires a series of data points. Some buildings or residential districts are newly built or may not be open to the public, as shown in Figure 18. Due to the low occupancy rate, the number of drop-off points is insufficient to support the analysis of the spectrum. The entrance of the individual AOIs may need to rebuilt, which could also result in an abnormal status. 3. Impact of the spatial location Theoretically, the closer to the center of the city, the more prosperous, and the stronger the regularity. On the contrary, when close to the edge of the city, the regularity is weakened.

Conclusions
A top-down supervised classification method has been proposed in this article using dynamic taxi pick-up and drop-off points to identify the social functional types of AOIs, so as to support the identification of urban functional partitions. Firstly, taxi trajectory data were used to replace the evaluation index of human travel behavior, and the relationship between the social function of AOIs and travel behavior was established. There was a strong correlation between these two things, and SHDS was obtained through the AOI samples. Then, using multiple SHDS and KNN methods, automatic identification and monitoring of the social function of AOIs were implemented. Finally, the experimental accuracy, which was up to 90.74%, was verified by various methods. Due to the continuous collection of taxi GPS trajectory data in many cities, this solution will serve as an effective long-term solution. Meanwhile, if this method is combined with image interpretation and other identification methods, better results can be achieved. Compared with cities such as Shanghai or Guangzhou, Nantong has a smaller population in its main urban area, meaning Nantong is a small city and the types of AOIs are not rich enough. If the experiment is able to be carried out in a big city, it may achieve better results. Because most AOIs in China, such as buildings and residential districts, are enclosed by walls and are only accessible via one or more entrances, this method is more suitable for most cities in China rather than those in other countries with an open management mode.