Identification of Urban Functional Areas by Coupling Satellite Images and Taxi GPS Trajectories

Urban functional area (UFA) recognition is one of the most important strategies for achieving sustainable city development. As remote-sensing and social-sensing data sources have increasingly become available, UFA recognition has received a significant amount of attention. Research on UFA recognition that uses a single dataset suffers from a low update frequency or low spatial resolution, while data fusion-based methods are limited in efficiency and accuracy. This paper proposes an integrated model to identify UFA using satellite images and taxi global positioning system (GPS) trajectories in four steps. First, blocks were generated as spatial units in the study area, and the spatiotemporal information entropy of the taxi GPS trajectory (STET) for each block was calculated. Second, a 24-hour time-frequency series was formed based on the pick-up and drop-off points extracted from taxi trajectories and used as the interpretation indicator of the blocks. The K-Means++ and k-Nearest Neighbor (kNN) algorithm were used to identify their social functions. Third, a multilabel classification method based on the residual neural network (MLC-ResNets) and “You Only Look Once” (YOLO) target detection algorithms were used to identify the features of the typical and atypical spatial textures, respectively, of the satellite images in the blocks. The confidence scores of the features of the blocks were categorized by the decision tree algorithm. Fourth, to find the best way to integrate the two sub-models for UFA identification, the 10-fold cross-validation method based on stratified random sampling was applied to determine the most optimal STET thresholds. The results showed that the average accuracy reached 82.0%, with an average kappa of 73.5%—significant improvements over most existing studies. This paper provides new insights into how the advantages of satellite images and taxi trajectories in UFA identification can be fully exploited to support sustainable city management.


Introduction
Urban systems have natural and social characteristics. With rapid urbanization and the intensification of human activities, the structure and characteristics of the city have become more complex, and the types of urban functional areas (UFAs) more diverse [1]. Scientific planning of UFAs has become one of the important strategies for regional development and national construction [2], and the delineation of UFAs is essential for the optimization of urban planning [3]. In nature, each functional area is spatially aggregated by diverse geographic objects, which are semantically extracted from land uses [4,5]. Unlike the traditional investigation methods, the automatic and semiautomatic methods for mapping UFAs have been in high demand with the rapid development of geographical information and remote-sensing technologies. On the one hand, remote-sensing data have been widely used to detect Land Use and Land Cover (LULC) and built-up areas, with good effectiveness and efficiency. For instance, Landsat 8 is used to monitor land use changes [6,7], and the Luojia1-01 and Radiometer Suite (VIIRS) day-night band carried by the Suomi National Polar-orbiting Partnership (NPP) satellite have been used to extract urban built-up areas [8][9][10]. However, satellite images can only monitor the physical characteristics of a city's land surface, and it is insufficient to recognize the social function and describe the spatiotemporal law of human mobility [11].
On the other hand, with the popularity of location-aware devices and related technologies, various types of social-sensing data have become available; these include vehicle trajectory data (e.g., those from taxis, bicycles and buses) [12][13][14]; social media sign-in data (e.g., Sina Weibo, Twitter and WeChat) [15][16][17]; points of interest and so on [18]. Based on these data, the laws of human mobility and the distribution pattern of the regional functions from the perspectives of time [19,20] and space [21,22] can be analyzed and established. However, due to the limitations of regional transportation, the economy and infrastructure construction, such sensing data, have not been fully available in all regions. The lack of social-sensing data makes it still difficult to accurately identify the type of UFAs [23].
In recent years, there has been a significant improvement in computer software and hardware [24]. Correspondingly, artificial intelligence (AI) algorithms have been implemented more easily, and a coupling analysis using multisource data has become a reality in the dynamic identification of UFAs [10,25]. AI enhances the methods of image recognition and contributes to the deeper mining of spatiotemporal laws. A multisource data coupling analysis can allow the advantages of the data themselves to be exploited and, furthermore, allow the data to complement each other. However, in the most recent research, multisource data have been directly input into an end-to-end artificial intelligence framework, which lacks the applicability of the evaluation and selection and may result in low accuracy and low efficiency.
A new integrated model for UFA identification is therefore proposed in this paper by coupling satellite images and taxi trajectory data and using AI algorithms. Firstly, an urban area was divided into blocks based on roads and rivers as the basic spatial units, and the spatiotemporal information entropy of the trajectory (STET) was calculated. A threshold was selected to divide the blocks into two groups based on STET. Then, two sub-models were developed to identify the functional types of the two groups of blocks using taxi GPS trajectories and satellite images. Based on the results of these two sub-models, a 10-fold cross-validation method based on stratified random sampling was used to adjust the threshold for determining the best way to integrate the two sub-models to obtain the best UFA identification strategy. The main innovations of this study include: (1) An integrated model of UFA identification was proposed. The functional type of some blocks was identified by the trajectory sub-model, while that of others was by the image sub-model. All these depend on the sufficiency of information of the trajectory data in the block. This new model can allow the advantages of social-sensing data and satellite images to be fully exploited and, thus, improves the identification accuracy. (2) A new index was designed and named STET, which was used as an index to measure the information of the trajectory data of blocks. A suitable sub-model was then selected to identify the UFA based on the STET index. (3) In the image sub-model, the multilabel classification method based on the residual neural network (MLC-ResNets) and You Only Look Once (YOLO) v3 algorithms were used to identify the land uses in the satellite image. Features with typical interpretation keys, such as schools, were identified using YOLO v3, while other features, such as residential areas, were identified using MLC-ResNets.
The rest of this paper is organized as follows. In Section 2, the study area and the dataset are briefly introduced. The methodology of the proposed model is illustrated in Section 3. The experimental results are presented and discussed in Section 4, and the conclusion is provided in Section 5.

Study Area
The study area is Chongchuan District, located at 31 • 58'48"N, 120 • 53'42"E in Nantong on the Southeast coast of Jiangsu Province, China ( Figure 1). This is where the Nantong Municipal Committee and Municipal Government are situated. The total area is around 215 square kilometers. In 2018, the resident population was 718,900, and the gross domestic product (GDP) was 81.951 billion Chinse Yuan. Since the subway in Nantong is yet to be constructed, taxis constitute one of the main travel modes for on-demand human mobility, with a total number of taxis of about 1200.

Datasets and Data Processing
Satellite images. This study uses satellite images from a Baidu map in 2018. The image has three RGB bands, with a resolution of 0.5 m/pixel. The preprocessing of the original image includes georeferencing and masking, as shown in Figure 2a. This image shows that the study area has a relatively higher proportion of construction land in the northwest and more green space in the east and south.
Taxi trajectory data. The GPS trajectory data, whose positioning mode is single-point positioning (SSP), are provided by the Nantong Taxi Management System from September, October and November 2018. The data are in a structured table file, which records the license plate number, phone number, time, longitude and latitude, speed, direction and passenger status. The sampling time interval is 30 s. The preprocessing includes the extraction of the pick-up and drop-off points, as shown in Figure 2b. The pick-up and drop-off points are determined by changes in the passenger status. When the status changes from empty to heavy, the point is the pick-up, while if it changes from heavy to empty, the point is the drop-off. Road, rivers network and cadastral data. Other datasets include road, river and cadastral data. The road and river networks are obtained from a Baidu Map Application Programming Interface (API) using a web crawler, and the cadastral data are obtained from the Nantong City Planning Bureau (http://nantong. gov.cn/ntsghj/). Through data preprocessing, such as topology correction, georeferencing, line-to-polygon conversion, etc., other vector datasets are obtained, as shown in Figure 2c. To achieve greater geometric accuracy, the local coordinate projection system, GCS_China_Geodetic_Coordinate_System_2000, is used. The cadastral data (Figure 2d) include 9 functional types.

Methodology
This paper proposes an integrated model for UFA identification (Figure 3). In the first step, we combine the road and river networks to generate blocks as spatial units and calculate the spatiotemporal information entropy of trajectory (STET) for each block. Secondly, we use taxi trajectory data and satellite images to develop two sub-models to be optimized in the process of the UFA identification. If the STET is higher than or equal to the threshold , then the trajectory sub-model Ψ(T i ) is used; otherwise, the image sub-model Φ(I i ) is used. Through the integrated model Γ(STET i , T i , I i ), defined as Equation (1), the identification of the UFA of each block is implemented. Finally, due to the imbalance of the urban function types of the blocks, the 10-fold cross-validation based on stratified random sampling is used to adjust the threshold, and the accuracy and kappa coefficient are used to evaluate the effectiveness of the model. Figure 3 shows the research framework and related work.
where n refers to the number of blocks, T i refers to the taxi trajectory data of block i , I i refers to the satellite image of block i and I (conditon) refers to the indicator function, whose value is 1 when the condition is true; otherwise, it is 0, and refers to the decision threshold between using the trajectory sub-model or using the image sub-model.

Generation of Blocks
According to the United States (U.S.) Census Bureau's definition of the block, a block is usually an area surrounded by humans and natural features, such as roads, rivers, lakes, mountains and cliffs [26]. It is the smallest granularity in urban planning and population statistics, so it is the smallest spatial unit in this study. Nantong is a city with developed traffic conditions along the river and sea. In this study, Nantong City is divided by a road network and river, generating several blocks. The road network and river have different levels. If the blocks are divided without distinguishing the levels, the block unit will be too small, which is inconsistent with the actual situation and will cause experimental difficulties. We recommend the roads and rivers of the third level for division, as shown in Figure 4, with a total of 482 blocks. The average area of the blocks is 17,3251 m 2 , ranging from the minimum area of 12,947 m 2 to 1,730,130 m 2 . The urban function type of this block is mainly the residential area, which is composed of some rural houses, divided by the Tongjia River, Tongjia Road and Shengli Road, as shown in Figure 4b. The maximum area of the block is 1,730,130 m 2 . The urban function type of this block is mainly the industrial area, which is composed of Nantong Tongxin Village, Nantong COSCO Shipbuilding Steel Structure Co., Ltd., and many other factories, as shown in Figure 4c.

STET Computing
Information entropy reflects the capacity of information in the data. The larger the amount of information in the data, the greater the information entropy, and when 0 < probability ≤ 1 e (e refers to the basis of the natural logarithm and equals to 2.71), the entropy tends to increase [27]. When the social-sensing data in the study area are sufficient, the reliability of using the data to infer the functional type of the area is higher [28]. This paper proposes a measure of the trajectory information of the blocks-namely, the spatiotemporal information entropy of the trajectory (STET) of each block-defined as Equation (2). The STET is calculated from the density of the pick-up and drop-off points in the block at each period: STET ∈ (0, 12.7]. The higher the STET, the higher the traffic density of the block-that is, the block is a hotspot area.
where n refers to the number of blocks, N ij refers to the number of pick-up and drop-off points in block i during the jth hour and S i refers to the area of block i , which should be e times greater than N ij after standardization.

Trajectory-Based Sub-Model
Taxis constitute one of the main means by which residents travel in the study area on-demand, and its GPS trajectory data can thus perceive the spatiotemporal laws of residents' travel behaviors. The spatiotemporal laws can effectively infer the distribution patterns of urban areas [19,20]. Therefore, when the trajectory data information is sufficient, no additional information needs to be added, and the trajectory sub-model Ψ(T i ) can be used to mine the urban function type and spatial distribution pattern-that is, extract the time-frequency series of the pick-up and drop-off points of each block and use the K-Means++ and kNN algorithm to identify the social functions of the block.

Time Frequency Series and K-Means++
Time-frequency series. The time-frequency series of the pick-up and drop-off points extracted from the taxi trajectory can reflect the flow of information in different periods of the block. By mining the spatiotemporal laws of human activities, the urban functional attributes can be effectively inferred. In a recent study of the social functions of the area of interest (AOI),  proposed the concept of the hour-day spectrum (HDS) approach, which performed well in identifying the pattern of the social functions in Nantong [19]. The study proposed six kinds of spectrums, reflecting the regularity of the region's changes over time. Since the taxi trajectories have a certain systematic error, in this paper, we generate block buffers according to the road widths [21] and generated the HDS for each block as a time-frequency sequence ( Figure 5). The six spectrum types are: PP i = pp i0 , . . . , pp ij , . . . , pp i23 , which represents pick-up points; pp ij , which represents the average number of pick-up points in the ith block during the jth hour; HPP i = hpp i0 , . . . , hpp ij , . . . , hpp i23 , which represents the pick-up points on holidays; WPP i = wpp i0 , . . . , wpp ij , . . . , wpp i23 , which represents the pick-up points on weekdays; DP i = dp i0 , . . . , dp ij , . . . , dp i23 , which represents the drop-off points; HDP i = hdp i0 , . . . , hdp ij , . . . , hdp i23 , which represents the drop-off points on holidays and WDP i = wdp i0 , . . . , wdp ij , . . . , wdp i23 , which represents the drop-off points on weekdays. With the difference in the block popularity or grade [19], although the trend of the HDS in the same type of block is almost the same, the numerical magnitude of the sequence is different, so normalization is required. This article uses Equation (3) to normalize the 6 kinds of HDS in various blocks: where n refers to the number of blocks; HDS iz refers to the zth HDS (PP, HPP, WPP, DP, HDP and WDP) of block i ; HDS iz refers to the normalized HDS iz ; max(HDS iz ) refers to the maximum value in HDS iz and min(HDS iz ) refers to the minimum value in HDS iz . K-Means++. Clustering algorithms are usually classified into partition-based, density-based and hierarchy-based types [29][30][31], among which the density-based clustering algorithm is often used to find data with density distribution features. It is not suitable for scattered data, and it has difficulty adjusting the parameters using prior knowledge from experiments [30]. The hierarchical clustering algorithm can partition data adaptively, but the model is inefficient and time-consuming [32]. The K-Means algorithm, as the representative of the partitioning data, has the advantages of effectively processing large data and high-dimensional data [33]. Compared with the K-Means algorithm, the K-Means++ algorithm optimizes the selection of clustering centers and reduces the impact of a poor selection of clustering centers [34]. The algorithm process is as follows: K samples are randomly selected as the clustering center, CC = {cc 1 , cc 2 , . . . , cc k }, ensuring that the distance between each cluster center is relatively far. The distance (similarity) between each cluster center and each sample x i is calculated. Each sample is classified as the nearest (highest similarity) cluster center. The mean value among the cluster samples is calculated as the cluster center. Then, one cycles through the abovementioned operations until the cluster center does not change or the maximum number of iterations is reached. The K-Means++ algorithm has two most important factors-namely, the measure of similarity between data and the number of clusters.

Clustering Analysis
We use Euclidean distance as the K-Means++ similarity index S iz , which is defined as Equation (4).
where n refers to the number of blocks; K refers to the number of clusters and cc kz refers to the zth HDS (PP, HPP, WPP, DP, HDP and WDP) of the kth cluster center. The number of K-Means++ clusters often depends on the external prior experience or internal aggregation indicators in the data [35]. In remote-sensing fields, when unsupervised methods are used to classify land use, scholars often set the number of initial classifications to 2 to 3 times the final result [36]. In similar studies, the silhouette coefficient or elbow method is used to determine the number of classifications. While these methods can measure the aggregation between samples and give a reasonable number of classifications, in fact, the number is lower than the actual number of categories, so only coarser-grained classifications can be performed, which is not convenient for more detailed work [37]. Therefore, we will use external prior empirical methods to determine the number of classification categories.
After clustering is completed, each cluster may contain multiple functional types, so we need to identify the UFA type represented by each cluster-that is, the clusters of the same urban functional types are merged based on the maximum proportion principle, which is defined as Equation (5).
where c k refers to the k th cluster, L C k refers to the original social functional attribute set of c k , K refers to the number of clusters, L c k refers to the final social functional attributes of c k , x refers to the social functional attribute element of c k and n x refers to the number of blocks with social functional attribute x.
After the K-Means++ training is completed, the kNN algorithm is used to classify the unknown data in combination with the classified results. Specifically, the similarity between each test datum and each classified datum is calculated, and the similarity index is shown in Equation (4). Degrees are sorted in ascending order. The top k categories of data with the lowest similarity to the test data are used as their categories.

Image-Based Sub-Model
When the trajectory and other social-sensing data cannot provide effective decisions due to insufficient information, the satellite image is the most effective way to identify UFA types. We can mine UFA types and morphological patterns based on the image sub-model Φ(I i ), with the detection of the distribution patterns of some characteristic landmarks or buildings in the area. That is, MLC-ResNets and YOLO v3 are used to identify the block image, and the confidence score and other information are generated. Based on the identification results, the decision tree algorithm is used to classify the UFA types.

MLC-ResNets
Compared with the classification of satellite images based on spectral features, convolutional neural networks can simulate human vision and make classifications based on the physical shape and texture features of images, and this has played an important role in modern computer vision [38][39][40][41]. ResNets has proved to be an important breakthrough in the field of deep learning in recent years. It is characterized by the addition of internal residual blocks using jump connections, which are easy to optimize and whose accuracy can be increased by adding more layers [42]. In the blocks with multiple feature types, for example, a block may include residential areas, factories, schools, etc. The general classification task will consider the block as one of them, and the multilabel classification task refers to a series of nonexclusive labels on blocks according to the probability distribution of the features. The result of a probability distribution is Φ m (I) = p m 1 , . . . , p m l , . . . , p m n , where p m l is the confidence score of the feature in the image I. In essence, the task of multilabel classification is to make a binary classification for each label. Therefore, when performing MLC-ResNets, the activation function at the end of the network needs to be set to the Sigmoid function, which has a value range of (0, 1) and is defined as Equation (6). The calculation result is often used to indicate the probability of things happening [43]. The loss function is set as the binary cross entropy, which is often used to measure the difference between two probability distributions and whether the model learning is sufficient. It is often combined with Sigmoid [44], and the combination is defined as Equation (7).
where σ(z) refers to Sigmoid function, z refers to the linear combination of the last layer input of the network and e refers to the basis of the natural logarithm (e = 2.71).
where J(θ) refers to the loss function, N refers to the sample size, x (i) refers to the ith sample, h θ x (i) refers to the activation function, which can be set to the Sigmoid function, and y (i) refers to the label of the ith sample.

YOLO v3
Some UFAs contain some typical geographic features, which can be used to infer the functional types of the areas, so these objects need to be detected based on satellite images. Recently, major breakthroughs have been made in object-detection algorithms in computer vision. The "You Only Look Once" (YOLO) algorithm is the representative one, which has a high performance and provides end-to-end prediction. YOLO v3, as the third generation of the YOLO algorithm, compared with the previous two generations, has a significantly improved classification accuracy and calculation speed and is suitable for detecting geographical entities that are not clustered [45][46][47][48]. The result is Φ y (I)= p y 1 , . . . , p y l , . . . , p y n , where p y l is the confidence score of target l in image I. The YOLO v3 algorithm uses a deep residual network to extract a series of multifeature layers of different sizes from the original picture and uses up-sampling to connect each feature layer. The network is trained by optimizing the comprehensive loss function, defined as Equations (8)- (13), to adjust the size, e.g., the width (w); height (h) and position, e.g., central coordinates (x, y) and category confidence (C) of the prior frame [47]. Loss = Loss 1 + Loss 2 + Loss 3 + Loss 4 + Loss 5 ) where Loss 1 , Loss 2 , Loss 3 , Loss 4 and Loss 5 refer to the central coordinate error, width height coordinate error, object confidence error, no object confidence error and classification error, respectively.

Decision Tree
In the case of insufficient trajectory information, Φ m , Φ y is used to mine useful information from the satellite image and calculate the confidence score of each type of object: P i = Φ m (I i ), Φ y (I i ) . Then, we combine with the traditional machine-learning algorithms Φ t to learn the classification of the hidden mode-namely, Φ(I i ) = Φ t (P i ).
Decision tree is a supervised classification method, which is based on a tree structure and has the advantages of high readability and speed. Decision tree usually consists of three steps: feature selection, tree generation and overfitting processing [49]. Feature selection belongs to feature engineering-namely, selecting reasonable and classifiable features for learning. The ID3 algorithm is often employed in the generation of a decision tree in the following way: starting from the root node, all possible information divergence is calculated, as defined in Equation (14). The greatest feature of the information divergence, its node characteristics, is set up by the characteristics of the node child nodes. These steps are repeated until the information divergence is small, or there are no features to choose from.
where D refers to the dataset, A refers to the feature, m refers to the number of labels, C j refers to the number of samples belonging to the jth label, |D| refers to the number of samples, n refers to the number of features, |D i | refers to the number of samples of the ith feature and C ij refers to the number of samples of the ith feature belonging to the jth label. The decision tree generation algorithm will generate the decision tree recursively, which leads to a high accuracy of the training set but a weak generalization ability. In other words, overfitting easily occurs, and it can be prevented by pruning or limiting the depth of the tree.

Image Analysis
Based on the field investigation and the prior background knowledge of Nantong, we found that most of the blocks that meet the requirements of the image sub-model are almost all factories, bare/farmland, rural land, nonopen schools and residential areas under construction. Based on this background, we performed an image classification for these features, which can greatly simplify the workload and allow for a quick prediction of the urban functional categories of the blocks.
Before the classification task, we analyzed all kinds of other areas and selected the appropriate method of image classification: (1) The spatial distribution of the school does not have aggregation and contains representative ground features, such as a playground. By identifying the playground and calculating the area proportion of the playground, the school type can be inferred. YOLO v3 has a better recognition effect on such objects, so it is suitable to use as the target detection algorithm.
(2) Since the physical characteristic of the factory and residential area are different and have spatial aggregation, it is difficult to use target detection, which is suitable for multitarget classification. (3) The spatial extent of bare/farmland, which is often used as a background area, is large, and the probability of residential factories can be compared to determine whether it belongs to this type. It is suitable for multi-objective classification tasks. Therefore, we identified playgrounds, factories, residential areas and bare/farmland as targets. Figure 6 shows the structure of YOLO v3 and MLC-ResNets.
After the classification based on the deep-learning algorithm, the confidence result of the target feature and the area proportion of the playground are obtained. Combined with the STET and the actual UFA type, the structured dataset is constructed, and classification learning experiments are conducted using the decision tree algorithm, as shown in Figure 7.

Stratified Random Sampling
Stratified random sampling first divides the overall samples into various types. Then, according to the ratio of the sample number of each type to the total number, the number of each type is determined. Finally, samples are drawn from each type according to the random principle. This method can ensure that, after data division, the proportion of categories in each dataset is consistent, and it is suitable for data with uneven sampling categories [50,51].

K-fold Cross-Validation
Cross-validation is often used to check the accuracy of the model. As the number of blocks is only a few, and the function type is unbalanced, it is easy to cause the verification results to be unrepresentative using simple cross-validation. Therefore, the applicable k-fold cross-validation divides the training set into k sub-samples, which means that a single sub-sample is reserved as the data of the validation model, and the other k-1 samples are used for training and are repeated k times. The average k-times result is used as the final estimation-among which, that of the 10-fold cross-validation is the most popular [52].

Stratified Random Sampling
Stratified random sampling first divides the overall samples into various types. Then, according to the ratio of the sample number of each type to the total number, the number of each type is

Kappa Coefficient
Kappa coefficient, defined as Equations (15)(16)(17), is used for consistency test the evaluation of the classification tasks of unbalanced data. Its value ranges from 0 to 1, of which 0.0~0.20 represents extremely low consistency, 0.21~0.40 represents general consistency, 0.41~0.60 represents moderate consistency, 0.61~0.80 represents high consistency and 0.81~1 represents almost complete consistency [53]. This index is often used in the study of land use classification in remote-sensing geoscience analyses. (17) where

Results and Discussion
All experiments in this part are conducted on the Jupyter Notebook, ArcGIS software platform, GeForce RTX 2080ti GPU, and other hardware platforms and Python libraries, such as Numpy, Pandas, Keras and Matplotlib, are used. Basic information on all blocks is shown in Table 1.  Figure 8 shows the STET value of each block. For example, the STET of South Street (a), Central South Century City (b) and Nantong East Passenger Station (c) are higher, indicating that there are more residential activities in this block, and the trajectory information is sufficient. The marginal region of the study area is mainly industrial or rural, and the STET is lower, indicating that there are fewer residential activities in this area, and there is little trajectory information. The STET results present a right-skewed distribution, so it is difficult to directly select . Therefore, based on the quantile of STET as the value of , in the following experiments (4.2 and 4.3), we will take the 50% quantile of STET (i.e., the median, = 0.0983) as the default threshold and adjust in Section 4.4 to select the optimal parameters of the model.  Figure 9 shows the HDS curves of different types of blocks, for which the STET values are greater than . It can be seen that different types of blocks have their unique spatiotemporal patterns, and there are some differences between the different types of HDS. Taking the business area as an example, the curves of the pick-up points show an upward trend from 6 a.m. to 10 p.m. and reach a peak at 10 p.m. This indicates that, with the time delay, residents leave the business area after shopping. On holidays, the curves are stable from 3 p.m. to 10 p.m. and are in a peak state, and on weekdays, the peak of the curve is often at 10 p.m., indicating that, during the holidays, residents have more free time to shop, while, on the weekdays, most of the residents shop after work. The curve of the drop-off points peaked at 10 a.m., indicating that most residents like shopping during this period. From 3 p.m. to 8 p.m., the curve of holidays is slightly higher than that of weekdays, which also confirms the difference between the above-mentioned residents' work and rest on holidays and weekdays. HDP, holiday drop-off points; WDP, weekday drop-off points; PP, pick-up points; HPP, holiday pick-up points and WPP, weekday pick-up points. Y-axis label: Res., residential area; Bus., business area; Edu., education area; Ind., industrial area; Adm., administrative area; Pub., public service area; Mix., mixed area; Sce., scenic spot and Bar., bare/farmland). Figure 10 shows that there are differences between the HDS of different types of blocks, and the similarity confusion matrix is calculated by Equation (4) based on these HDS. The lower the value, the higher the similarity. Among them, the similarity between the administrative area and the public service area is high, which indicates that the spatiotemporal characteristics of the two functional districts are similar, and they can provide some services for residents. Secondly, the similarity between the mixed area and the business area, the education area and the public service area is high, which indicates that there may be business land or schools in the mixed area.

HDS Result
However, due to the difference in the UFA level and property of the blocks, taking the public service area as an example, hospitals, stations and gymnasiums all belong to this function type, while these three types have their spatiotemporal laws, which leads to a large difference in the HDS trend, as shown in Figure 11. In order to distinguish their differences in the trajectory sub-model, we again divide education areas, residential areas and public service areas in a fine-grained way, as shown in Table 2.

Cluster Result
Taking Equation (4) as the similarity index and three times the fine-grained UFA number as the initial classification number to train the trajectory sub-model, we use a stratified random sampling method to select 80% of the data for training and 20% of the data for validation. Combined with the real cadastral data, we count the proportion of the original types in each cluster and merge the clusters by the principle of the maximum proportion. The results are shown in Table 3. Based on the results of the K-Means++ merging, the result is tested using the kNN algorithm. In this case, the accuracy and Kappa coefficient of the trajectory sub-model are 71% and 46%, respectively. Due to the low , the trajectory data information is insufficient, and the accuracy and kappa coefficient need to be improved. A comprehensive parameter adjustment is carried out in Section 4.4 in combination with the image model.

Results of the Image Sub-Model
The image recognition of UFA is a one-to-one process, which needs to be processed in blocks. Therefore, the satellite image of the block range that meets the conditions of the image model is used as the test sample, and part of the data is shown in Figure 12a. In this model, the education area, industrial area, residential area and bare/farmland are identified.  Figure 12b. The training and test images, which have been resized into 300*300 pixels, are labeled and trained with the network architecture shown in Figure 6b. Figure 13a shows the learning status of the network. While the curve of the verification set fluctuates greatly, the overall trend is downward. After the 50th epoch, the model tends to overfitting-that is, the loss of validation decreases and then rises, so the model of 50 epochs training is more suitable. Based on the same conditions, 400 pictures of the schools with playgrounds are selected as the training set. As shown in Figure 12c, the training samples are labeled and trained using the YOLO v3 network architecture of Figure 6a. Figure 13b shows the learning status of the network. The learning effect is good before the 90th epoch, and after that, the loss curve rises sharply, indicating that there is an exploding gradient problem. This is due to the large and complex structure of YOLO v3, which leads to the instability of the network weight update. Therefore, the model of 90 epochs training is adopted.

Decision Tree Result
After the recognition of the images, the confidence results of each image, the area rate of the playground and the STET value of the block are combined into a real category to establish a structured table. Taking part of the data shown in Figure 12a as an example, with its structured table shown in Table 4, through the probability of geographic features and other information, the urban functional area of the corresponding block can be inferred. For example, when the confidence result of the playground and the area rate is large, the real type is mostly like the education area. The stratified random sampling method is used to select 80% of the blocks that meet the conditions of the image sub-model as the training set and 20% as the test set. The decision tree algorithm is used to learn based on the structured table. The decision tree algorithm easily encounters an overfitting problem, which can be prevented by adjusting the tree height parameters. As shown in Figure 14, the model works best when the tree height is 5, with a test accuracy of 85% and a Kappa coefficient of 79%.

Model Parameter Adjustment
In this section, we use the 10-fold cross-validation method based on stratified random sampling to integrate the trajectory sub-model and the image sub-model by adjusting the parameter to achieve the optimal effect of the recognition model. Figure 15 shows the identification results of the integrated model based on different STET quantiles. The learning curve of the model rises with the increase of . When the quantile reaches 90% ( = 0.491), the identification effect of the model is the best, with an average test accuracy of 82.0% and kappa coefficient of 73.5%, which shows that the identification results are highly consistent, and the recognition effect of the model is poor under the condition of low data information. It is worth noting that, when the STET quantity = 0 or 1, this means that only the trajectory sub-model or image sub-model work, which indicates that the identification effect is not good when there is only either a single model or a single data source. Especially when only trajectory sub-models are used, the accuracy and kappa coefficient are the lowest.

Discussion
The threshold-adjusted model is generalized for the purposes of the urban functional area identification study of Gangzha District. The identification results are shown in Figure 16.
The identification result of UFA via the adjusted model is shown in Figure 16, where (a) denotes the real distribution of UFAs, and (b) illustrates the UFA identification result, which shows that the precision and Kappa coefficient are 78.9% and 71.2%, respectively. The experimental result shows that the models based on the divide and conquer strategy have strong generalizability. Since the main UFA types in Gangzha District are industrial areas, residential areas and farmlands/bare lands, the STET values are relatively low. Only five blocks satisfy the condition of the trajectory sub-model, four of which are in a business area, and the other one is a public service area (train station). The model can identify these business areas, but it is hard to deal with the train station, because there are fewer samples in the train sets. The main reasons for the erroneous results are explained below.

Data Error
The trajectory may contain a systematic error of 5-10 m [21], as shown in Figure 17. In fact, when approaching the destination, some drivers operate the metering device in advance, artificially changing the vehicle's heavy status to empty. This results in a large error between the partially extracted drop-off point and the actual drop-off point.

Multi-Functionality of the Block
Some blocks contain multiple kinds of geographical entities, so it is difficult to define them as social functional types. As shown in Figure 18, the dominant function type of this type of block is a residential area (including Sansan Flat, Yin Garden, Hering Garden and Hongfengyuan), but it also contains other types of geographic entities, such as schools and hotels, which affect the ability of the trajectory data to perceive the residents' activities. Recently, some experiments with UFA identification were conducted using social-sensing data or satellite images. For example, Liu (2020) studied the identifications and patterns of UFA using K-Medoids and kNN algorithms based on cab trajectories [20]. Some methods in his work were partly similar to our trajectory sub-model, but limitations still existed. For example, only a single data source was used, the data were not standardized and the cluster similarity index was selected using dynamic time warping, resulting in a high time complexity, etc. A combination of satellite images and cell phone-positioning mobiles were applied in recent studies [11,25]. Compared with these two studies, our research applies the model integration strategy, which fully exploits the advantages of satellite images and trajectory data. At the same time, only one data source is used in a certain area, which improves the operating efficiency of the model to a certain extent.

Conclusions
This paper proposed an integrated UFA identification model, which fully exploits the advantages of trajectory and image data. We divided an urban area into blocks based on road and river network data and treated the blocks as the research units for UFA identification. The STET was then calculated for each block, from which the trajectory or image sub-model was selected and analyzed. The trajectory sub-model based on K-Means++ and kNN worked in the blocks with enough trajectory data, and the image sub-model based on MLC-ResNets, YOLO v3 and decision tree played a complementary role in the remaining blocks. The proposed model was validated by conducting an experiment in Nantong City. By using a 10-fold cross-validation based on stratified random sampling, the credibility of the identification was increased. The results showed that the average accuracy reached 82.0%, with an average kappa of 73.5%, a significant improvement compared to most existing studies.
This paper applied machine-learning and deep-learning algorithms, as well as an integrated strategy to UFA recognition research, providing a novel approach to research in related fields. Particularly, the proposed new index STET can be extended to applications in other social-sensing data, which will make full use of social-sensing data and remote-sensing images in identifying urban functional areas. Future research in incorporating multisource data, such as urban bicycle data, mobile phone positioning data, and social media data, will further improve the accuracy and efficiency of this tool. Nevertheless, the present paper provides insights into the distribution patterns of urban areas and a more advanced approach by using big data mining. The results also suggest that, when based on multisource data mining, data precision and errors must be strictly checked to ensure high data quality.