Social Sensing for Urban Land Use Identification

The utilization of urban land use maps can reveal the patterns of human behavior through the extraction of the socioeconomic and demographic characteristics of urban land use. Remote sensing that holds detailed and abundant information on spectral, textual, contextual, and spatial configurations is crucial to obtaining land use maps that reveal changes in the urban environment. However, social sensing is essential to revealing the socioeconomic and demographic characteristics of urban land use. This data mining approach is related to data cleaning/outlier removal and machine learning, and is used to achieve land use classification from remote and social sensing data. In bicycle and taxi density maps, the daytime destination and nighttime origin density reflects work-related land uses, including commercial and industrial areas. By contrast, the nighttime destination and daytime origin density pattern captures the pattern of residential areas. The accuracy assessment of land use classified maps shows that the integration of remote and social sensing, using the decision tree and random forest methods, yields accuracies of 83% and 86%, respectively. Thus, this approach facilitates an accurate urban land use classification. Urban land use identification can aid policy makers in linking human activities to the socioeconomic consequences of different urban land uses.


Introduction
Urban land use carries crucial information for human activity for urban planning and economic analysis, as well as hazard and pollution management [1]. Issues associated with land use have attracted considerable interest from communities because of the central role of land use as a cause and a consequence of human activity. The demand for the utilization of urban land use maps by urban authorities, researchers, and citizens has steadily increased, especially in fast developing regions where the timely acquisition of up-to-date land use information is crucial [2].
Land use is a human-environment system consisting of the relationship between human activities and socioeconomic, environmental, and demographic characteristic components of urban land. Uncovering land uses from only remotely sensed imagery, however, is rather difficult [3,4]. The spectral reflectance is not directly related to socioeconomic features and objects. Social sensing broadly refers to a set of sensing and data collections where data are collected from humans or devices on their behalf [4]. Social sensing data in cities such as traffic information and crowd movement have become an important means of analyzing urban development issues. For urban areas, social sensing intensity is usually high in urban centers and human activities are relatively dense. However, social sensing density and activity intensity in suburbs are relatively low. Such social sensing data include bike trajectories, taxi trajectories, smart card records in public transportation systems, social media or social networking data, and so on [4][5][6]. Previous studies have used social sensing data for mapping dynamic urban land use patterns [7,8], and traffic and transportation systems [5,6]. Human mobility can be The sentinel-2A remote sensed imagery of New York on 7 June 2016 was used as the remote sensing data source. Thirteen spectral bands can be found in the sentinel-2A imagery, and they range from the visible and the near-infrared to the shortwave infrared at spatial resolutions ranging from 10 m to 60 m on the ground. The sentinel-2A imagery has spatial resolutions of 10, 20, and 60 m. Each band in the different spatial resolutions of the sentinel-2A imagery has a different function (e.g., the bands for the 10 m spatial resolution are used for basic land cover classification, the bands for the 20 m spatial resolution are used to enhance the retrieval of geophysical parameters, and the bands for the 60 m spatial resolution are used for atmospheric correction and cirrus-cloud screening). The red, green, blue, and NIR bands (10 m spatial resolution) from Level L1C image are used in this study.

Social Sensing Data
Bike data and taxi data are chosen as the social sensing data in this study. Social sensing data are used to understand human behavior in the study area for the benefit of urban authorities and citizens. The bike and taxi datasets are openly accessible, and the retrieved data cover one month. The whole month of June in 2016 was selected as the study period. The bike dataset for this study was retrieved from the CitiBike Bikeshare System of New York City.
Each bike trip record contains the following fields: The taxi dataset for this study was retrieved from the New York City Taxi and Limousine Commission.
Each taxi trip record contains the following fields:  [11]. In this study, street network and point of interest (POI) data obtained from OSM that are openly accessible and licensed under the Open Data Commons Open Database License are used, because OSM provides a more detailed representation of land use/land cover than remote sensing does [12]. Figure 1 shows the street network for two kinds of streets in OSM. The roads in residential areas (highway = residential) are defined as residential streets, and the primary and secondary roads (highway = primary or secondary in OSM tags) are defined as nonresidential streets in this study.

Class Definition
Land use and land cover are key factors in understanding important issues such as climate change, natural resource management, and urban and regional planning [13]. A land cover map describes the physical and biological cover of an area (e.g., grassland or water), whereas a land use map reveals human activity (e.g., infrastructure) [14][15][16]. Defining classes is required to generate a land cover and land use map. The detailed definitions of land use categories and building categories are provided in Table 1. The six major categories of land cover and land use are water, open space, industrial, residential, office, and entertainment.

Method
The data mining framework, i.e., temporal and spatial analysis, data cleaning, and machine learning, adopted in this study is presented in Figure 2. The bike and taxi datasets were processed to understand when and where people are in the city and thereby discover human behavior patterns in

Class Definition
Land use and land cover are key factors in understanding important issues such as climate change, natural resource management, and urban and regional planning [13]. A land cover map describes the physical and biological cover of an area (e.g., grassland or water), whereas a land use map reveals human activity (e.g., infrastructure) [14][15][16]. Defining classes is required to generate a land cover and land use map. The detailed definitions of land use categories and building categories are provided in Table 1. The six major categories of land cover and land use are water, open space, industrial, residential, office, and entertainment.

Method
The data mining framework, i.e., temporal and spatial analysis, data cleaning, and machine learning, adopted in this study is presented in Figure 2. The bike and taxi datasets were processed to ISPRS Int. J. Geo-Inf. 2020, 9, 550 5 of 19 understand when and where people are in the city and thereby discover human behavior patterns in the city. Temporal analysis was performed by extracting the bike and taxi datasets to reveal when people are in the city. Moreover, temporal analysis can reveal the peak times when people ride bikes and taxis in one day. Human behavior was divided into two, namely, human behavior during weekdays and human behavior during weekends. To visualize the average bike and taxi usage in one day, we divided a month's worth of bike and taxi usage into four, on the basis of the number of weeks in a month. The average bike and taxi data were split into 24 h to investigate the peak times when people ride bikes and taxis in one day during weekdays and weekends. The peak times showing when people ride bikes and taxis the most in one day were identified as Wednesday and Sunday in this study. The bike and taxi data on Wednesday, which represents a weekday; and on Sunday, which represents a weekend, were selected to generate bike and taxi usage. Meanwhile, spatial analysis was carried out by analyzing the bike and taxi density maps to reveal where people are in the city. The density maps of bikes and taxis need to be generated using kernel density estimation (pixel size = 10 m, search radius = 100 m). The coordinate location information shows the exact coordinates of the places where people take and return bikes in the bike station map and where taxi drivers pick up and drop off passengers. The bike and taxi density maps can be obtained from people's usage of bikes and taxis in the city. The origin density map indicates the location where people depart the stations on their bikes or ride taxis from pick-up locations. The destination density map indicates the location of people who return their bikes at the arrival stations or alight taxis in their drop-off locations.

Data Cleaning
Data cleaning was performed to remove the outliers from the bike and taxi point distribution. The outliers, e.g., data points far away from the commercial and residential streets, were removed. As revealed in this study, most people in the study area tend to ride bikes and taxis during weekdays to go to their offices from their homes. By contrast, most people ride bikes and taxis during weekends to go to entertainment areas from their homes. Under this phenomenon, outliers emerge because some people ride bikes and taxis during weekdays to go to places other than their offices. Likewise, some people ride bikes and taxis during weekends to go to places other than entertainment areas. Therefore, OSM was used to overcome these kinds of outliers. Two kinds of streets obtained from OSM, namely, nonresidential street and residential street, were used to remove the outliers. The As the bike and taxi point distribution contains uncertain factors and outliers, data cleaning was performed to remove them. The outliers contained in the density maps were removed using the street network. The process of data cleaning involved the OSM dataset. After data cleaning, effective bike and taxi density maps for land use were generated. All of these steps were used to reveal the human behavior patterns in the city. These steps were used to classify land use through the integration of remote sensing and social sensing data. Eventually, the decision tree and random forest methods were used in land use classification from the training samples. Training sampling was carried out by overlaying the data from the remote sensed imagery and the bike and taxi density maps. The remotely sensed imagery and bike and taxi density maps, e.g., the origins and destinations at the first and second peaks, were selected as the features for classification. For accuracy assessment, testing sampling was performed on the basis of MapPLUTO, which the New York government uses as ground truth or reference map. The land cover and land use categories included water, open space, industrial, residential, office, and entertainment.

Data Cleaning
Data cleaning was performed to remove the outliers from the bike and taxi point distribution. The outliers, e.g., data points far away from the commercial and residential streets, were removed. As revealed in this study, most people in the study area tend to ride bikes and taxis during weekdays to go to their offices from their homes. By contrast, most people ride bikes and taxis during weekends to go to entertainment areas from their homes. Under this phenomenon, outliers emerge because some people ride bikes and taxis during weekdays to go to places other than their offices. Likewise, some people ride bikes and taxis during weekends to go to places other than entertainment areas. Therefore, OSM was used to overcome these kinds of outliers. Two kinds of streets obtained from OSM, namely, nonresidential street and residential street, were used to remove the outliers. The residential street is located in a residential area, whereas the nonresidential street is located close to an office area. On the basis of the observation of OSM data in the study area, entertainment points are also located along the nonresidential streets.

Decision Tree
The decision tree belongs to supervised classification often used in land use classification because it is an efficient algorithm for classifying large datasets. The decision tree can handle data measurement on different scales without any assumptions regarding the distribution of data frequency [17]. It is a top-down classification strategy based on automatically selected rules that partition a set of given entities into small classes [18]. Several decision tree learning algorithms have been proposed, and they include classification and regression tree (CART) [19], iterative dichotomizer version 3 (ID3) [20], and C4.5 (an industrial version of ID3). These algorithms have different ways of quantifying distinction and different criteria, of which the entities of the input dataset might be independent. CART processed using R programming environment (rpart package) was used in this study.

Random Forest
The random forest is an ensemble learning method that uses multiple independently constructed decision trees, each of which uses a unique bootstrap sample of the data [21]. The random forest algorithm has two levels of randomization in each tree [22]. The first is bootstrap aggregation, in which a random subset consisting of two-thirds of the data is sampled with the replacement of the training data and the remaining portion of the data. The second level of randomization is at each node of the individual decision tree. The randomization identifies the best split among the variables. The classification results can be achieved using a majority vote from all the decision trees, and the vote for each tree carries the same weight [23]. In this study, the random forest model is processed using R programming environment (randomForest package). The default parameter setting, i.e., 500 ntrees (number of trees), is used.

Training and Testing Sampling
The New York government has an openly accessible land use map called MapPLUTO obtained from the Department of City Planning of New York City (NYC Planning). MapPLUTO was created in 2016 and contains extensive land use and geographic data in shapefile and geodatabase formats. MapPLUTO also merges the data from PLUTO, which contains extensive land use and geographic data at the tax lot level in a comma-separated value (CSV) file format and is maintained by city agencies with the Department of Finance's Digital Tax Map (NYC, 2018). Therefore, the training and testing samples in this work were performed based on the MapPLUTO, which serves as the reference map. The 700 training samples are used to construct models of decision tree and random forest. The 300 testing samples were applied to measure the accuracy of the classified maps. Figure 3 shows MapPLUTO as the reference map. ntrees (number of trees), is used.

Training and Testing Sampling
The New York government has an openly accessible land use map called MapPLUTO obtained from the Department of City Planning of New York City (NYC Planning). MapPLUTO was created in 2016 and contains extensive land use and geographic data in shapefile and geodatabase formats. MapPLUTO also merges the data from PLUTO, which contains extensive land use and geographic data at the tax lot level in a comma-separated value (CSV) file format and is maintained by city agencies with the Department of Finance's Digital Tax Map (NYC, 2018). Therefore, the training and testing samples in this work were performed based on the MapPLUTO, which serves as the reference map. The 700 training samples are used to construct models of decision tree and random forest. The 300 testing samples were applied to measure the accuracy of the classified maps. Figure 3 shows MapPLUTO as the reference map.

Accuracy Assessment
Accuracy assessment is the most important step in any classification method because it reveals how accurate the classification results are [24]. In this study, accuracy assessment was performed to investigate how accurate the model is. Specifically, the percentage of the testing sample from the reference map that was correctly classified by the model was estimated. The confusion matrix method was selected as the accuracy assessment tool in this study. In general, the confusion matrix identifies four types of accuracies, namely, overall accuracy, producer's accuracy, user's accuracy, and kappa coefficient. Overall accuracy denotes the total percentage of the reference or actual data correctly classified by the model. User's accuracy denotes the percentage of the model that is actually present on the reference shown by the map maker or from the user's point of view. Producer's accuracy denotes the percentage of the reference or actual map that is correctly classified by the model shown by the map maker or from the producer's point of view [25]. The confusion matrix of accuracy assessment was established to investigate how good the model was by using the testing samples from MapPLUTO.

Temporal Analysis of Bike and Taxi Data
Human behavior patterns can be determined through temporal analysis, which was performed in this study to reveal when people are in the city on the basis of bike and taxi usage. Figure 4 shows the average bike and taxi usage on a weekday. The pattern shows the average number of people who rode bikes and taxis in a 24-h period during weekdays in the whole month of June 2016. Interesting insights could be gleaned from the results. For example, the average bike and taxi usage on a weekday has two peak times, namely, 08:00-09:00 and 18:00-19:00. The former refers to the period when people leave the house to go to the office. The latter refers to the period when people leave the office to go home. Another interesting insight is the similarity of the graphs of bike and taxi usage for weekdays. Both graphs show a decline in usage from 00:00-01:00 to 03:00-04:00, indicating the limited use of bikes and taxis at those times. Both graphs also show an increase in usage from 04:00-05.00 to 08:00-09:00, indicating people's regular use of bikes and taxis to travel to the workplace. Another decline is noted from 09:00-10:00 to 12:00-13:00, indicating periods when people are working. An increase in usage is noted from 13:00-14:00 to 18:00-19:00, indicating people's lunch break and departure from the office, respectively. coefficient. Overall accuracy denotes the total percentage of the reference or actual data correctly classified by the model. User's accuracy denotes the percentage of the model that is actually present on the reference shown by the map maker or from the user's point of view. Producer's accuracy denotes the percentage of the reference or actual map that is correctly classified by the model shown by the map maker or from the producer's point of view [25]. The confusion matrix of accuracy assessment was established to investigate how good the model was by using the testing samples from MapPLUTO.

Temporal Analysis of Bike and Taxi Data
Human behavior patterns can be determined through temporal analysis, which was performed in this study to reveal when people are in the city on the basis of bike and taxi usage. Figure 4 shows the average bike and taxi usage on a weekday. The pattern shows the average number of people who rode bikes and taxis in a 24-h period during weekdays in the whole month of June 2016. Interesting insights could be gleaned from the results. For example, the average bike and taxi usage on a weekday has two peak times, namely, 08:00-09:00 and 18:00-19:00. The former refers to the period when people leave the house to go to the office. The latter refers to the period when people leave the office to go home. Another interesting insight is the similarity of the graphs of bike and taxi usage for weekdays. Both graphs show a decline in usage from 00:00-01:00 to 03:00-04:00, indicating the limited use of bikes and taxis at those times. Both graphs also show an increase in usage from 04:00-05.00 to 08:00-09:00, indicating people's regular use of bikes and taxis to travel to the workplace. Another decline is noted from 09:00-10:00 to 12:00-13:00, indicating periods when people are working. An increase in usage is noted from 13:00-14:00 to 18:00-19:00, indicating people's lunch break and departure from the office, respectively.    Figure 5 shows the average bike and taxi usage during weekends. The results present the average number of people who rode bikes and taxis in a 24-h period during weekends in the whole month of June 2016. Interesting insights can also be gleaned from the results. First, the average bike and taxi usage during weekends indicate different peak times. Bike usage data show one peak time at 17:00-18:00 while taxi usage data show one peak time at 18:00-19:00. This result indicates a one-hour difference in people's use of bikes and taxis, which tends to yield a similar pattern. Weekend peak times refer to the period when most people ride bikes and taxis to go to entertainment areas from their home and vice versa.

Weekend Time
The graphs of bike and taxi usage during weekends are similar. The weekend graphs show declines in bike and taxi usage from 00:00-01:00 to 05:00-06:00 and from 06:00-07:00, respectively. This result indicates that people seldom ride bikes and taxis during those times. In addition, both graphs show increases in bike and taxi usage until the peak times of 17:00-18:00 and 18:00-19:00, respectively. This result indicates that people ride bikes and taxis at those times to go to entertainment areas. The graphs also show a decline until midnight, indicating the period when people leave entertain areas to go home. declines in bike and taxi usage from 00:00-01:00 to 05:00-06:00 and from 06:00-07:00, respectively. This result indicates that people seldom ride bikes and taxis during those times. In addition, both graphs show increases in bike and taxi usage until the peak times of 17:00-18:00 and 18:00-19:00, respectively. This result indicates that people ride bikes and taxis at those times to go to entertainment areas. The graphs also show a decline until midnight, indicating the period when people leave entertain areas to go home.

Spatial Analysis of Bike and Taxi Density Maps
The bike and taxi density maps for those peak times were obtained to reveal human behavior in the city by visualizing the areas where people ride bikes and taxis. The bike and taxi density maps were divided into two, namely, origin density map and destination map.

Weekday Time
Bike and taxi usage on weekdays showed two peak times. The first peak time is 08:00-09:00, and the second peak time is 18:00-19:00. The former indicates the period when people ride bikes and taxis the most to go to work from their homes, whereas the latter indicates the period when people ride bikes and taxis the most to go home. Therefore, the bike and taxi density maps in the origin and destination areas at the first peak time (08:00-09:00) and at the second peak time (18:00-19:00) during weekdays were generated to reveal human behavior in the city. Figures 6 and 7 show the bike and taxi density maps during the first and second weekday peak times in the origin and destination areas. The bike and taxi density hotspots at the first weekday peak time (08:00-09:00) are residential areas in the origin density map and office areas in the destination density map. The second weekday peak time (18:00-19:00) occurs in office areas in the origin density map and in residential areas in the destination density map.

Spatial Analysis of Bike and Taxi Density Maps
The bike and taxi density maps for those peak times were obtained to reveal human behavior in the city by visualizing the areas where people ride bikes and taxis. The bike and taxi density maps were divided into two, namely, origin density map and destination map.

Weekday Time
Bike and taxi usage on weekdays showed two peak times. The first peak time is 08:00-09:00, and the second peak time is 18:00-19:00. The former indicates the period when people ride bikes and taxis the most to go to work from their homes, whereas the latter indicates the period when people ride bikes and taxis the most to go home. Therefore, the bike and taxi density maps in the origin and destination areas at the first peak time (08:00-09:00) and at the second peak time (18:00-19:00) during weekdays were generated to reveal human behavior in the city. Figures 6 and 7 show the bike and taxi density maps during the first and second weekday peak times in the origin and destination areas. The bike and taxi density hotspots at the first weekday peak time (08:00-09:00) are residential areas in the origin density map and office areas in the destination density map. The second weekday peak time (18:00-19:00) occurs in office areas in the origin density map and in residential areas in the destination density map.  (a) (b) (c) (d) Figure 6. Bike density maps during weekday peak times:(a) origin density map at 08:00-09:00; (b) destination density map at 08:00-09:00; (c) origin density map at 18:00-19:00; (d) destination density map at 18:00-19:00.

Weekend Time
During weekends, bike usage shows one peak time at 17:00-18:00, whereas taxi usage shows one peak time at 18:00-19:00. This result indicates a one-hour difference in peak times and a similar pattern. Weekend peak times refer to the periods when people ride bikes and taxis the most to go to entertainment areas from their home and vice versa. The origin and destination density maps of bike usage during the weekend peak time of 17:00-18:00 and taxi usage during the weekend peak time of 18:00-19:00 were generated to reveal human behavior in the city.
In Figure 8, the locations of where people are during the weekend peak times of bike and taxi usage in the area of origin and destination area are residential and entertainment areas. Moreover, the locations of where people are during the second weekend peak time of bike usage in the area of origin and destination area reflect entertainment and residential areas.
Generally, people in the area of origin are located in a residential area during the first peak time on a weekday, and their destination area is an office area. During the second peak time on a weekday, people originate their journeys in an office area, and their destination area is a residential area. The weekend peak times refer to the periods when people ride bikes and taxis the most to go to entertainment areas from their homes and vice versa. On the basis of bike and taxi usage, human behavior in a city can be identified and analyzed. Bike and taxi usage hotspots in traffic source-sink areas are related to traffic intensities and land use patterns [4,10].

Weekend Time
During weekends, bike usage shows one peak time at 17:00-18:00, whereas taxi usage shows one peak time at 18:00-19:00. This result indicates a one-hour difference in peak times and a similar pattern. Weekend peak times refer to the periods when people ride bikes and taxis the most to go to entertainment areas from their home and vice versa. The origin and destination density maps of bike usage during the weekend peak time of 17:00-18:00 and taxi usage during the weekend peak time of 18:00-19:00 were generated to reveal human behavior in the city.
In Figure 8, the locations of where people are during the weekend peak times of bike and taxi usage in the area of origin and destination area are residential and entertainment areas. Moreover, the locations of where people are during the second weekend peak time of bike usage in the area of origin and destination area reflect entertainment and residential areas.
Generally, people in the area of origin are located in a residential area during the first peak time on a weekday, and their destination area is an office area. During the second peak time on a weekday, people originate their journeys in an office area, and their destination area is a residential area. The weekend peak times refer to the periods when people ride bikes and taxis the most to go to entertainment areas from their homes and vice versa. On the basis of bike and taxi usage, human behavior in a city can be identified and analyzed. Bike and taxi usage hotspots in traffic source-sink areas are related to traffic intensities and land use patterns [4,10].

Effects of Data Cleaning on Point Distribution
The bike and taxi point distribution show the location where people ride bikes and taxis. Data cleaning was performed to remove the outliers from the bike and taxi point distribution by using the OSM office and residential streets.

Effects of Data Cleaning on Point Distribution
The bike and taxi point distribution show the location where people ride bikes and taxis. Data cleaning was performed to remove the outliers from the bike and taxi point distribution by using the OSM office and residential streets.

Weekday Time
The bike and taxi usage on weekdays show the same peak times at 08:00-09:00 and 18:00-19:00. The former refers to the period when people ride bikes and taxis to leave the house and go to the office. The latter refers to the period when people ride bikes and taxis to leave the office and go home.
The data cleaning procedure was employed to remove the outliers of the point distributions in the origin and destination locations by utilizing the residential and nonresidential streets. At the peak time of 08:00-09:00, the point distributions in the origin and destination locations show a residential area located in a residential street and an office area located in a nonresidential street. Therefore, at the peak time of 08:00-09:00, the outliers of the point distribution in the origin location not located in a residential street, and in the destination location not located in a nonresidential street, should be removed (Figures 9 and 10). At the peak time of 18:00-19:00, the point distribution in the origin location shows a residential area in a nonresidential street, whereas that in the destination location shows a residential area in a residential street. Therefore, at the peak time of 18:00-19:00, the outliers of the point distribution in the origin location not located in a nonresidential street, and in the destination location not located in a residential street, should be removed. The bike and taxi point distributions at those same peak times before and after data cleaning were generated (Figures 11 and 12).

Effects of Data Cleaning on Point Distribution
The bike and taxi point distribution show the location where people ride bikes and taxis. Data cleaning was performed to remove the outliers from the bike and taxi point distribution by using the OSM office and residential streets.

Weekday Time
The bike and taxi usage on weekdays show the same peak times at 08:00-09:00 and 18:00-19:00. The former refers to the period when people ride bikes and taxis to leave the house and go to the office. The latter refers to the period when people ride bikes and taxis to leave the office and go home.
The data cleaning procedure was employed to remove the outliers of the point distributions in the origin and destination locations by utilizing the residential and nonresidential streets. At the peak time of 08:00-09:00, the point distributions in the origin and destination locations show a residential area located in a residential street and an office area located in a nonresidential street. Therefore, at the peak time of 08:00-09:00, the outliers of the point distribution in the origin location not located in a residential street, and in the destination location not located in a nonresidential street, should be removed (Figures 9 and 10). At the peak time of 18:00-19:00, the point distribution in the origin location shows a residential area in a nonresidential street, whereas that in the destination location shows a residential area in a residential street. Therefore, at the peak time of 18:00-19:00, the outliers of the point distribution in the origin location not located in a nonresidential street, and in the destination location not located in a residential street, should be removed. The bike and taxi point distributions at those same peak times before and after data cleaning were generated (Figures 11 and  12).

Weekend Time
The weekend peak times of bike and taxi usage are 17:00-18:00 and 18:00-19:00, respectively. During these weekend peak times, most people ride bikes and taxis to leave the house and go to entertainment areas. The data cleaning procedure was implemented to remove the outliers of point distribution in the origin and destination locations by utilizing the residential and nonresidential streets. During the peak time of bike and taxi usage, the point distribution in the origin location shows a residential area located in a residential street, whereas that in the destination location shows an entertainment area located in a nonresidential street. Therefore, at those peak times of bike and taxi usage, the outliers of point distributions in the origin location not in a residential street and those in the destination location not in an entertainment area in a nonresidential street should be removed. The bike and taxi point distributions at the different peak times before and after data cleaning are shown in Figures 13 and 14, respectively.

Weekend Time
The weekend peak times of bike and taxi usage are 17:00-18:00 and 18:00-19:00, respectively. During these weekend peak times, most people ride bikes and taxis to leave the house and go to entertainment areas. The data cleaning procedure was implemented to remove the outliers of point distribution in the origin and destination locations by utilizing the residential and nonresidential streets. During the peak time of bike and taxi usage, the point distribution in the origin location shows a residential area located in a residential street, whereas that in the destination location shows an entertainment area located in a nonresidential street. Therefore, at those peak times of bike and taxi usage, the outliers of point distributions in the origin location not in a residential street and those in the destination location not in an entertainment area in a nonresidential street should be removed. The bike and taxi point distributions at the different peak times before and after data cleaning are shown in Figures 13 and 14, respectively.

Weekend Time
The weekend peak times of bike and taxi usage are 17:00-18:00 and 18:00-19:00, respectively. During these weekend peak times, most people ride bikes and taxis to leave the house and go to entertainment areas. The data cleaning procedure was implemented to remove the outliers of point distribution in the origin and destination locations by utilizing the residential and nonresidential streets. During the peak time of bike and taxi usage, the point distribution in the origin location shows a residential area located in a residential street, whereas that in the destination location shows an entertainment area located in a nonresidential street. Therefore, at those peak times of bike and taxi usage, the outliers of point distributions in the origin location not in a residential street and those in the destination location not in an entertainment area in a nonresidential street should be removed. The bike and taxi point distributions at the different peak times before and after data cleaning are shown in Figures 13 and 14, respectively.   Table 2 shows the different overall accuracy and kappa coefficient values of the urban land use models for the six classes. Considering only remote sensing information, the overall accuracy of the decision tree was 69%, whereas the overall accuracy of the random forest was 63%, respectively. Using remote and social data, the overall accuracy of the decision tree increases to 78%, whereas the overall accuracy of the random forest increases to 81%. After data cleaning, the overall accuracy of the decision tree reaches 83%, whereas the overall accuracy of the random forest reaches 86%. The best kappa coefficients of the decision tree classified map and random forest classified maps were 0.80 and 0.82, respectively. The results show that the effective data cleaning of remote and social sensing data ensures an accurate urban land use classification. The results also concluded that for the six-class case, the random forest classified map had better accuracy than the decision tree classified map.

Accuracy Assessment of Land Use Model
Tables A1-A6 respectively show the confusion matrix of the decision tree and random forest classified maps resulting from the integration of remote and social sensing with the use of data cleaning. The result implies the dramatic improvement with the integration of social and remote sensing data with data cleaning relative to the sole use of remote sensing data. With the sole application of remote sensing, the user and producer accuracies are low in entertainment, industrial, commercial, and residential areas. In urban land use classification, using a combination of remote and social sensing data with data cleaning is superior to using remote sensing only. Take the decision tree as an example; user accuracy (commission error) improves in office, residential, and industrial areas (18%, 25%, and 21% respectively), and producer accuracy (omission error) improves in residential and entertainment areas (34% and 40%, respectively).   Table 2 shows the different overall accuracy and kappa coefficient values of the urban land use models for the six classes. Considering only remote sensing information, the overall accuracy of the decision tree was 69%, whereas the overall accuracy of the random forest was 63%, respectively. Using remote and social data, the overall accuracy of the decision tree increases to 78%, whereas the overall accuracy of the random forest increases to 81%. After data cleaning, the overall accuracy of the decision tree reaches 83%, whereas the overall accuracy of the random forest reaches 86%. The best kappa coefficients of the decision tree classified map and random forest classified maps were 0.80 and 0.82, respectively. The results show that the effective data cleaning of remote and social sensing data ensures an accurate urban land use classification. The results also concluded that for the six-class case, the random forest classified map had better accuracy than the decision tree classified map. Tables A1-A6 respectively show the confusion matrix of the decision tree and random forest classified maps resulting from the integration of remote and social sensing with the use of data cleaning. The result implies the dramatic improvement with the integration of social and remote sensing data with data cleaning relative to the sole use of remote sensing data. With the sole application of remote sensing, the user and producer accuracies are low in entertainment, industrial, commercial, and residential areas. In urban land use classification, using a combination of remote and social sensing data with data cleaning is superior to using remote sensing only. Take the decision tree as an example; user accuracy (commission error) improves in office, residential, and industrial areas (18%, 25%, and 21% respectively), and producer accuracy (omission error) improves in residential and entertainment areas (34% and 40%, respectively).

Urban Land Use Map
The land use classified map is generated in three ways, namely, using remote sensing data only, using integration of remote sensing and social sensing data without data cleaning, and using integration of both data with data cleaning. Figures 15 and 16 show the comparison results for the six categories of the land use classified map from the three aforementioned methods based on decision tree and random forest classification. All the land use classified map results reveal that the distribution of industrial areas in Brooklyn is random for the six classes of decision tree and random forest land use classified maps that used the integration of remote and social sensing data without data cleaning. The distribution of offices in the random forest classified map is more random than that in the decision tree classified map when only remote sensing is used. Random forest obviously requires social sensing data to generate office areas. The random forest classified map based on the integration of remote sensing and social sensing with data cleaning has higher density of office and residential areas in Central Manhattan than that of the decision tree classified map.
Through additional social sensing data, model performance can be improved. The combination of remote and social sensing data is a promising way to map detailed urban land use in a large metropolitan area [9]. Bike and taxi density maps are applied to reliably estimate the intensity, duration, and frequency of human activity patterns [26]. However, a certain bias or uncertainty may exist toward the use of social data to detect the urban land use types of a city [26]. For instance, people who use bikes or taxis on weekdays do not always travel to office areas from their homes, e.g., going to educational places from home. Nevertheless, most people use bikes and taxis to leave their homes and go to office areas on weekdays. On the basis of the given rule, outliers that do not belong to residential and office areas are removed in the density maps. After data cleaning, the urban land use patterns using both models show a 5% improvement in overall accuracy in comparison with those obtained without data cleaning.
Sensing-based land use identification can help policy makers link land use intensity and human activity to the socioeconomic consequences of different urban land uses within a landscape. On the basis of the proposed approach, we will conduct further research to identify the patterns of urban land use classes, such as compactness of a specified land use type, degree of urban change, and expansion rate [27].

Urban Land Use Map
The land use classified map is generated in three ways, namely, using remote sensing data only, using integration of remote sensing and social sensing data without data cleaning, and using integration of both data with data cleaning. Figures 15 and 16 show the comparison results for the six categories of the land use classified map from the three aforementioned methods based on decision tree and random forest classification. All the land use classified map results reveal that the distribution of industrial areas in Brooklyn is random for the six classes of decision tree and random forest land use classified maps that used the integration of remote and social sensing data without data cleaning. The distribution of offices in the random forest classified map is more random than that in the decision tree classified map when only remote sensing is used. Random forest obviously requires social sensing data to generate office areas. The random forest classified map based on the integration of remote sensing and social sensing with data cleaning has higher density of office and residential areas in Central Manhattan than that of the decision tree classified map.
Through additional social sensing data, model performance can be improved. The combination of remote and social sensing data is a promising way to map detailed urban land use in a large metropolitan area [9]. Bike and taxi density maps are applied to reliably estimate the intensity, duration, and frequency of human activity patterns [26]. However, a certain bias or uncertainty may exist toward the use of social data to detect the urban land use types of a city [26]. For instance, people who use bikes or taxis on weekdays do not always travel to office areas from their homes, e.g., going to educational places from home. Nevertheless, most people use bikes and taxis to leave their homes and go to office areas on weekdays. On the basis of the given rule, outliers that do not belong to residential and office areas are removed in the density maps. After data cleaning, the urban land use patterns using both models show a 5% improvement in overall accuracy in comparison with those obtained without data cleaning.
Sensing-based land use identification can help policy makers link land use intensity and human activity to the socioeconomic consequences of different urban land uses within a landscape. On the basis of the proposed approach, we will conduct further research to identify the patterns of urban land use classes, such as compactness of a specified land use type, degree of urban change, and expansion rate [27].

Conclusions
This study focused on considering dynamic patterns associated with various human activities and behaviors for urban land use classification. This study successfully generated the density maps to show where most people are located in the city at peak times. The bike and taxi density maps of the origin and destination areas on weekdays reflect the location of residential and office areas, whereas on weekends, they reflect the location of residential and entertainment areas. The density map patterns are highlighted for understanding human behavior in a city. The daytime destination and nighttime origin density reflect work-related land uses, including commercial and industrial areas. By contrast, the nighttime destination and daytime origin density pattern capture the pattern of residential areas. Human behavior toward taxi and bike usage during weekdays and weekends offers interesting insights. During weekdays, bike and taxi usage show the same two peak times, and the patterns are identical due to the similarity of the graphs. These peak times occur at 08:00-09:00 and 18:00-19:00. The same peak times indicate that most people who ride bikes or taxis go to work and return home. The bike and taxi usage on weekends reflects a different pattern. During weekends, bike and taxi usage shows different peak times, and the graph patterns are similar, although the similarity is not as close as that for the graph of bike and taxi usage on weekdays. These weekend peak times refer to the time when most people ride bikes and taxis to go to entertainment areas from their home and vice versa.
On the basis of the remote sensing and social sensing data, the machine learning methods successfully generate a land use classified map of an urban area in New York, USA, with six categories, namely, water, open space, industrial, residential, office, and entertainment classes. The accuracy assessment of the confusion matrix is used to investigate the accuracy of the decision tree and random forest classified maps. The overall accuracies of the decision tree and random forest classified maps for the six classes reach 83% and 86%, respectively. The land use classified maps show high accuracy for the integration of remote sensing and social sensing data with data cleaning. The accuracy assessment indicates the effective mining of remote and social sensing data and ensures an accurate urban land use classification. In future work, area-based accuracy assessment and other information for data cleaning can be considered.

Conclusions
This study focused on considering dynamic patterns associated with various human activities and behaviors for urban land use classification. This study successfully generated the density maps to show where most people are located in the city at peak times. The bike and taxi density maps of the origin and destination areas on weekdays reflect the location of residential and office areas, whereas on weekends, they reflect the location of residential and entertainment areas. The density map patterns are highlighted for understanding human behavior in a city. The daytime destination and nighttime origin density reflect work-related land uses, including commercial and industrial areas. By contrast, the nighttime destination and daytime origin density pattern capture the pattern of residential areas. Human behavior toward taxi and bike usage during weekdays and weekends offers interesting insights. During weekdays, bike and taxi usage show the same two peak times, and the patterns are identical due to the similarity of the graphs. These peak times occur at 08:00-09:00 and 18:00-19:00. The same peak times indicate that most people who ride bikes or taxis go to work and return home. The bike and taxi usage on weekends reflects a different pattern. During weekends, bike and taxi usage shows different peak times, and the graph patterns are similar, although the similarity is not as close as that for the graph of bike and taxi usage on weekdays. These weekend peak times refer to the time when most people ride bikes and taxis to go to entertainment areas from their home and vice versa.
On the basis of the remote sensing and social sensing data, the machine learning methods successfully generate a land use classified map of an urban area in New York, USA, with six categories, namely, water, open space, industrial, residential, office, and entertainment classes. The accuracy assessment of the confusion matrix is used to investigate the accuracy of the decision tree and random forest classified maps. The overall accuracies of the decision tree and random forest classified maps for the six classes reach 83% and 86%, respectively. The land use classified maps show high accuracy for the integration of remote sensing and social sensing data with data cleaning. The accuracy assessment indicates the effective mining of remote and social sensing data and ensures an accurate urban land use classification. In future work, area-based accuracy assessment and other information for data cleaning can be considered.