Grid-Based Essential Urban Land Use Classification: A Data and Model Driven Mapping Framework in Xiamen City

: Accurate and timely mapping of essential urban land use categories (EULUC) is vital to understanding urban land use distribution, pattern, and composition. Recent advances in leveraging big open data and machine learning algorithms have demonstrated the possibility of large-scale mapping of EULUC in a new cost-effective way. However, they are still limited by the transferability of samples, models, and classification results across space, particularly across different cities. Given the heterogeneities of environmental and socioeconomic conditions among cities, in-depth studies of data and model adaptation towards city-specific EULUC mappings are highly required to support policy making, and urban renewal planning and management practices. In addition, the trending need for timely and detailed small land unit data processing with finer data granularity becomes increasingly important. We proposed a City Meta Unit (CMU) data model and classification framework driven by multisource data and artificial intelligence (AI) algorithms to address these challenges. The CMU Framework was innovatively applied to systematically set up a grid-based data model and classify urban land use with an improved AI algorithm by applying Moore neighborhood correlations. Specifically, we selected Xiamen, Fujian, in China, a coastal city, as the typical testbed to implement this proposed framework and apply an AI transfer learning technique for grid and parcel land-use study. Experimental results with our proposed CMU framework showed that the grid-based land use classification performance achieves overall accuracies of 81.17% and 76.55% for level I (major classes) and level II (minor classes), which is much higher than the parcel-based land use classification (overall accuracies of 72.37% for level I, and 68.99% for level II). We further investigated the relationship between training sample size and classification performance and quantified the contribution of different data sources to urban land use classifications. The CMU framework makes data collections and processing intelligent and efficient, with finer granularity, saving time and cost by using existing open social data. Incorporating the CMU framework with the proposed grid-based model is an effective and new approach for urban land use classification, which can be flexibly extended and applied to various cities


Introduction
Rapid urbanization has profoundly changed the built environment and affected residents' daily life.To date, about 55% of the global population resides in cities, which is expected to be 70% by 2050 [1][2][3].In the meantime, urban areas consume about 78% of global energy and account for 60% of greenhouse gas emissions.Given the pressing challenges in urban environmental problems, such as environment related disease [4] heat island threat [5] air pollution, clean water shortage, and renewable energy crisis, climate change [6], global awareness of sustainable urban planning [7], design and development has become increasingly far-reaching in academia, professionals, and society [8][9][10].Among these, urban land use maps [11,12] that reflect socioeconomic functions and human activity attributions [13] are crucial for urban planning and management [14].However, detailed urban land use classifications outlining the distribution, pattern, and composition of different land use types are continuously limited due to difficulties in: (i) coordinating financial support and professional manpower for on-site investigation and individual mapping; (ii) differentiating complex urban landscapes to different semantic land use type; and (iii) securing spatially and temporally explicit datasets of high-resolution urban scanning.
Urban diversity and social fabrics generate beauty in cities [15]."Policy silos" exist [16].To implement integrated policy making, effective information integrations are needed.In addition, timely updated land use maps are required for urban renewal development, especially for land policy making, which plays a key role in the provision of housing [7].
Fortunately, remote sensing and satellite technology have greatly enhanced our ability to observe the Earth's surface on a large scale [17,18].andmonitor its time-series changes [19,20].Based on multisource remote sensing data [21], an accumulating body of research efforts has been conducted in the field of urban land use classification, which can be categorized into: (i) pixel-based mapping [22]; (ii) object-based mapping [23]; and (iii) scene-based mapping [24].With the advancement of Internet technology, multidimensional social data are created exponentially [24,25].Research related to social data is expanded quickly, including the use of mobile phones [26], taxis [27], Weibo, Jingdongshenghuo, POI data, and grid statistical data [28].By combining remote sensing data and social data, land use classification and land use maps are increasingly improving and covering broader areas [29,30].
Noticeably, Gong et al. (2020) reported a new map of essential urban land use categories for the whole of China (EULUC-China) that uses 10-m satellite images, Open-StreetMap (OSM), nighttime lights, Point of Interest (POI), and Tencent location-based service data as feature inputs for machine learning based classification practices [31].A crowdsourcing mapping approach is implemented by coordinating 68 research scientists from 21 research teams to collect training and validation samples for different cities in China.This work marks the beginning of a new paradigm of collaborative and collective urban land use mapping over large areas.It provides guided insights for scaling down city-specific characteristics of urban land use from a top-down perspective.
Although many progressive advances have been achieved in the campaign of urban land use classifications, the following needs from policy makers are still not satisfied.
(1) Timely, detailed, and smaller land unit-based data analyses are needed for urban planning for large urban areas.Currently, the average parcel sizes are fairly large, exceeding square kilometers.For smaller land unit scales, intensive manual work is required to generate parcel land units, and collect and process data.(2) Across cities and regions comparisons are needed for socio-economic, environment, and biodiversity analysis.In order to compare information across different regions, city-specific studies are needed, and using similar data models and data sources is an essential prerequisite.(3) Digital simulations and projections of planned urban development require an accumulation of historical data for cities or urban areas.In order to accomplish effective data accumulations, new approaches in model creation and data collection techniques are needed.
Following up the EULUC-China study, city-specific investigations by different research teams regarding sample sensitivity, feature engineering, method adaptation, and classification scheme were performed in Ningbo [32], Nanjing [33], Lanzhou [34], Shenzhen [35], and Hangzhou [36].As one of the crowd-sourcing teams conducting the EULUC study for Xiamen, we have done further studies for the city.As the Xiamen city area is much smaller than other cities in China, we found it is useful and necessary to make the parcel unit size much smaller.Significant efforts were made to manually select 9741 parcels with an average size of about one ha.We noted that small parcel land classification is time-consuming and costly.The amount of information per land unit is proportional to its unit size.However, the amount of work to collect the same amount of meaningful data is inversely proportional to the unit size, i.e., to accurately predict land use for smaller land units, one must collect additional GIS and social data.In the meantime, the number of land units increases hundreds of times for the same coverage area, making data processing more challenging.For example, a large amount of data must be collected and processed even for small areas of a city.
Through these findings, we have concluded the following aspects that require further in-depth studies: (i) fine-granular urban land use classification with a smaller size of the minimum classification unit.With a sub-meter resolution of satellite data and a large amount of social data, small land unit study becomes feasible.For instance, more precise land maps can be generated.This is important for urban study and planning, as larger land units sometimes consist of more different land types; (ii) Grid-based land use classification studies are lacking, especially for small classification grids equal to or less than 1 ha.Most urban land use studies are based on parcel units segmented by roads.With small land units, describing the parcel becomes labor-intensive or even impractical.In the meantime, most of the social data are frequently available in grid form, as a grid-based data system is more effective for creating and expanding; and (iii) A land use classification-oriented systematic framework for multisource and high-dimensional data storage, processing, and synthesizing is limited.Since both satellite data and social data are complex and large, sometimes it can be expanded beyond more than 100,000 plus dimensions and millions of data sources for many years temporally.Data-based solutions impose challenges.
This led us to investigate and systematically set up a CMU data model to collect and pre-process the data.In this study, we proposed a City Meta Unit (CMU) data model and classification framework driven by multisource data and artificial intelligence (AI) algorithms to address the challenges mentioned above.With the CMU data model, information can be stored over time systematically.This is the basic building block for city or urban renewal simulation, policymaking, planning and development.We chose Xiamen City as our case study.The main contributions of this study are as follows: (1) We established a city meta unit (CMU) data and model framework for processing multisource datasets into abstract grid-based feature layers, which consists of multi-level functions, including a foundation layer, summation feature layer, density index function, visualization analysis layer, and application solution layer; (2) We developed a new approach to the grid-based data model for urban land use classification by using the city of Xiamen as a testbed.We used an improved RF algorithm by applying Moore neighborhood correlations; and (3) We analyzed the classification performance in parcel-based and grid-based mapping practices, attempted an AI transfer learning technique for grid and parcel land use prediction, and further investigated the relationship between training samples and classification accuracy [35,37,38].The CMU framework is the beginning of a new paradigm to discover a more effective methodology and means to address urban renewal and planning needs.
The remaining paper is structured as follows: Section 1 describes the background and reviews related work of this research.Section 2 describes the study area and data sources in detail.Section 3 introduces the CMU framework.Section 4 introduces the methodology.Section 5 illustrates the experimental results and analysis.Section 6 provides discussion of the results and future work.Section 7 presents conclusions of this study.

Study Area
Xiamen, a prefecture-level coastal city (24 • 23 N-24 • 54 N, 117 • 53 E-118 • 26 E) of Fujian province, is located in the southeast region of China (Figure 1).Xiamen is well known for its mild climate, Minnan culture, and livable environment.It is also one of the most beautiful sightseeing tourist places in China, with an area of 1700.61 square kilometers and about 5.16 million population.In 2021, Xiamen's GDP reached 7003.9 billion RMB, with per capita GDP above 140 k RMB.Rapid urbanization growth over the past few decades has brought significant changes to the land use patterns in Xiamen, thus posing increasing challenges to urban planning and management of land, water, transportation, industry, energy, and development.

Datasets
We used multisource datasets, mainly categorized into two groups: (1) high-resolution satellite data from Gaofen-2 and Gaofen-7; and (2) social big data collected from publicly available Internet resources and different companies, such as Tsinghua 2861 DaaS Project, Zhihuizuji, Baidu, and Gaode.

Satellite Spectral and Textural Data
The Gaofen-2 satellite is the first civil optical remote sensing satellite developed in China with a spatial resolution of 1 m.The Gaofen-7 satellite is a high-resolution earth observation satellite, achieving sub-meter level stereo mapping accuracy.In this study, we collected Gaofen-2 data for 2015-2018 and Gaofen-7 for 2021.
We used satellite data sources to extract spectral and textural features for the CMU summation feature layer, which will be described in Section 3.2.Spectral features were calculated.Texture features were calculated using the grey level concurrence matrix (GLCM) [39] with parameters as follows: row and column number of processing window are 3, co-occurrence shift in X and Y dimensions is 1, and greyscale quantization level is 64.

Social Big Data
POI data were collected from 2019 to 2021 from Shuijingzhu, which contains information including name, location coordinates, urban function attributes, etc.A total of 437,085 POIs were retained in Xiamen after data cleaning and filtering.We checked the geospatial projection, mapped the POIs into 4 different groups, and then calculated the proportion and total number of POIs in each grid, as shown in Figure 2a.The Tsinghua 2861 DaaS Project is an Internet-based data collection system.It takes crowdsourcing Internet data as inputs and builds about 9.8 million information grids for China.The grid size of 2861 index data is about 0.010869 × 0.008983 degree.There are 18 indexes in total.We used a mapping algorithm to calculate the corresponding index from the original index data.For example, Figure 2b below describes 2861 shopping convenience level.
Mobile statistics of location-based service records are very useful in city studies.Compared with other types of data, this has the advantage of integrated full coverage of activities in time and space.The grid size of mobile data is 0.001 × 0.001 degree.We used data from Zhihuizuji company to calculate the number of people who live, work, or visit a specific grid of Xiamen in December 2021.We set up the projected number of residents, workers, and visitors for the grid or parcel, as shown in Figure 3.We also collected the WorldPop population dataset (https://www.worldpop.org/,accessed on 10 January 2021), which provides the estimated number of people residing in each 100 × 100 m grid based on a random forest model and a global database of administrative unit-based census information [40].
Building data of Xiamen were downloaded from Shuijingzhu.The original source of the construction data from Shuijingzhu were based on a combination of Baidu Map and Gaode Map.We used the data to calculate the number of buildings, total coverage area, and the average building story for each parcel or grid, as shown in Figure 4. Road data were obtained from the OSM platform (http://www.openstreetmap.org,accessed on 10 January 2021).The raw OSM road network comprises 27 categories of road types: primary, secondary, trunk, pedestrian, and so on.Specifically, we included nine major types of roads in this study.They are primary, primary link, secondary, secondary link residential, residential link, tertiary, tertiary link, and trunk road types (Figure 5).

CMU Framework
We proposed a City Meta Unit (CMU) framework for data processing with three specific objectives: (1) To enable a scalable and traceable multi-dimensional meta-model framework for collecting, storing, describing, and grouping citywide multisource data; (2) Based on this framework of data structure, processing the data by calculating hierarchically grouped information can be used as feature input for applications; and (3) To make solution-oriented AI algorithms more effective and to realize different AI algorithms as applications in the proposed framework.The diagram of the CMU framework is shown in Figure 6.

CMU Foundation Layer
A data layer is created in the data model for each data source.For example, satellite images from the Gaofen series, POIs, and human mobility from the location-based service data are sorted and stored.We grouped the data layers based on the nature of the data, for example, traffic, population, building, education, environments, etc.This practice scheme is called the CMU Foundation Layer (CMU FL).

CMU Summation Feature Layer
We grouped and summarized the data based on the geometric unit (grid or parcel), which serves as feature collections for subsequent algorithm processing and application implementations.In the meantime, it also serves as a data abstraction to reduce storage needs and improve application efficiency.Scalable grid size can be customized with the geometric dimension by summating a smaller grid.In this study, we have created both 0.001 • N × 0.001 • E and 0.01 • N × 0.01 • E grids.Regarding the temporal dimension, historical data can be accumulated parallel for model simulation and time-series analysis.For instance, the summation of remote sensing data is calculated for spectral and textural statistics; for the summation of POI data is calculated for commercial activities analysis.We called this the CMU Summation Feature Layer (CMU SFL).

CMU Density Index Layer
Based on the CMU SFL, abstracting or grouping certain features together is very useful to create a density function or index function layer.A specific feature density function is defined as the area of the feature divided by the area of the grid or parcel.For example, based on the NDVI of multispectral remote sensing data, we can create a greenspace density function to describe the spatial extent and magnitude of greenspace coverage in a grid or parcel.The same function can be applied to other land cover types such as water, road, building, etc.We also added a weighting factor function to account for the fact that certain features, like the number of POIs, shall be amplified to account for missing areas occupied by grass (greenspace density), for example.An index function was defined for its specific attribute for a grid or parcel.For instance, urban environmental information such as PM 2.5 and carbon consumption can be added as a spatially explicit index and the nighttime light (NTL) intensity.We called this the CMU Density Index Layer (CMU DIL).

CMU Visualization Analysis Layer
We have created a Visualization Analysis Layer to present data in two or three dimensions in space.For example, the number of POIs can be displayed in three dimensions (Figure 7).The number of POIs is much bigger in the central urban area.We find visualization tools like this very useful and supportive in using AI algorithms as they can correlate features with the study grid spatially explicitly.The data collection and preparation of a digital city can be massive, with data dimensions exceeding millions and data sources exceeding hundreds of millions.We, therefore, developed knowledge graph (KG) tools to describe the ontology of the data model.For example, we used KG to describe POIs information (Figure 8).We called this the CMU Visualization Analysis Layer (CMU VAL).

CMU Application Solution Layer
After setting up the data model, one can easily use the data model to study or solve application problems in city planning, traffic control, and renewable new energy needs analysis.These applications can be added as part of the framework.We called this the CMU Application Solution Layer (CMU ASL).Section 4 will use Xiamen as a case study and apply the CMU framework to generate urban land use classification.

CMU-Based Xiamen Land Use Study
We proposed a systematic approach to grid and parcel land use classification.For the land grid, the grid is set up as 0.001 • N × 0.001 • E and 0.01 • N × 0.01 • E. For land parcels, we use the OpenStreetMap road network to generate the land parcels [41,42].We used the CMU data model as an application example to study Xiamen City land use and generate land use maps.

Proposed Method
In this study, we used a modified EULUC scheme (Table 1) for land use classification because there is insufficient information for analysis at a smaller grid size.For gridlevel analysis, it contains seven Level I land-use classifications (Residential, Commercial, Industrial, Public management and service, road, greenspace, water) and 11 Level II land-use classifications (Residential for both low-and high-rise building, Business office, Commercial service, Public & Admin, Road first class, Road second class, Road third class, Greenspace, Water) were formed, as shown in Table 1.We named the modified EULUC as GULUC (Grid Urban Land Use Classification).For parcel-level analysis, it contains four Level I land-use classifications (Residential, Commercial, Industrial, Public management and service) and seven Level II land-use classifications (Residential for both low-and high-rise buildings, Business office, Commercial service, Public & Admin, Greenspace).We named the modified EULUC as PULUC (Parcel Urban Land Use Classification).
The implementation of the proposed method is shown as Figure 9.We have incorporated both grid-based and parcel-based land use classification.For the grid-based study, we found that it is a new study area as the grid naturally combines different land cover types.We formulated the solutions by using exclusion-inclusion techniques.First, road, green, and water density functions were used to identify road, greenspace, and water grids, respectively.Second, grids identified were excluded, then a random forest algorithm was used to predict the remaining classes, including Residential, Public, Commercial, and Industrial.Lastly, a Moore neighborhood algorithm was applied to increase the accuracy of the RF algorithm.

Data Preparation
We used CMU FL created in 4.1 and combined them in Table 2. Parcel data preparation was the same as grid data preparation, except that road features were not considered because OSM was used to segment and group the parcels [41][42][43].All features in Table 2 were obtained from verifiable data sources, which were also verified by our team and have been widely used in other projects: (1) Gaofen satellite images are processed using remote sensing image calibration.Visual verifications are performed for the specific sample points; (2) For POI data, multiple POI points are selected for verification using the visual method to ensure that POI points are consistent with the actual sites in the real world.We have also developed a visualization tool to study POI characteristics in each land type; (3) 2861 index data were originally produced based on the open social data on the Internet with rigorous data processing.We verified the data manually in Xiamen; (4) To verify the precision of mobile data, we validated the distribution and the trend of data with the actual activities on the ground by analyzing the heat map created by the raw grid data; (5) The WorldPop population dataset was downloaded from the WorldPop website.The WorldPop team started complementing traditional population sources with dynamic, high-resolution data for mapping human population distributions in 2004, cross-checked by Zhihuizhuji and 2861 data.(6) Visual verifications were performed on building data using street view pictures from Baidu and online map information; (7) We confirmed the precision of road data through field and visual verifications based on remote sensing images; and (8) Haihang and Xiamen local teams verified 9741 parcel samples.Through Tabulate Intersection selection, grid samples were processed initially, which were further verified manually.
In addition to the features in Table 2, we further derived Road, Greenspace, Water density functions for each grid.
(1) For road density function: OSM road data were used to calculate the area of roads.
Road width value was specified for each road according to its type, road areal vector data were obtained by using the buffer tool in ArcGIS.After drawing the buffer regions of all types of roads, areal vector data of roads in Xiamen were generated.The road density of each grid was calculated as the area of roads divided by the area of the grid (Figure 10).(2) For Green Density Function: We used Gaofen7 data to construct the density function.In Gaofen7 satellite data, there are four bands which are RED, GREEN, BLUE, and NIR, we used these bands and grid vector data to calculate NDVI and the fractional green coverage.Specifically, the green density of each grid was calculated as the greenspace area divided by the grid's area (Figure 11).(3) For Water Density Function: Like the NDVI mentioned above, the NDWI was calculated as follows: the water density of each grid was calculated as the water area divided by the grid's area (Figure 12).We first used exclusion-inclusion techniques [44].Based on the road, green, and water density function created above, we assigned the thresholds as 30%, 70%, and 75% for road, green, and water, respectively, as we concluded that these are the optimal choices after on-site and images verifications.For example, we set up a threshold of 0.7 according to the green density index to determine whether the grid was greenspace.If the green density index was above 0.7, the grid was more likely to be greenspace and vice versa.
For road, green, and water, the following land grids were identified, with 8380 road grids, 23,258 greenspace grids, and 17,292 water grids.Visual inspection was performed.The overall accuracy for the road was 90.37%, with 374 testing samples.The overall accuracy for greenspace was 87.39%, with 333 testing samples.The overall accuracy for water was 79.28%, with 362 testing samples.We applied the same procedures for parcel land use analysis to exclude green and water parcels.

Random Forest (RF) for Urban Land Use Classification
We collected 2800 grid training samples and 600 grid testing samples.In the meantime, we also collected 6284 parcel training samples and 699 parcel testing samples provided by Haihang company, for which we have arranged a research team for on-site investigations in Xiamen to verify the sample quality.
Firstly, we completed 13,000 experiments for the grid and 7000 for the parcel, respectively.Each experiment runs 1000 times and uses unique combinations of different CMU SFL features from different sources, including Satellite only, social data only, Satellite + POI, Satellite + 2861 index, Satellite + WorldPop, Satellite + zhihuizuji mobile, All features, etc.Second, we also conducted a total of 120,000 experiments to study the relationship between training sample size and testing accuracy, which also covered different combinations of CMU SFL features from different data sources.Third, we attempted a new method to use parcel land training samples to train the model, then apply it to grid-based land classification and vice versa.Because those samples are costly to process, such usage helps to expand the studies to all cities in China.

Using Moore Neighborhood to Improve Land Use Prediction
A new research method was developed to use Moore neighborhood to increase the accuracy of the RF algorithm [45,46].For the RF algorithm voting scheme, when two or more predictions have similar probabilities, the high error rate is developed to be calculated by testing samples, which will be described in detail in Section 5.1.We use Moore 3 × 3 neighborhood to determine the grid type with a threshold defined by the confidence level is equal to or less than 60%.The algorithm is as follows: we select the Moore neighborhood of eight cells (grids) around the uncertain grid.Of these eight cells, we calculate the number of grids corresponding to the most confident prediction of the uncertain grid as A, and the number corresponding to the second most confident prediction as B. Suppose A > B, then the uncertain grid will be chosen as the most confident prediction, and vice versa.If A and B are zero, then the most confident prediction voting wins.Take Figure 13 for an example, the most confident prediction of the uncertain grid is Industrial, and the second most confident prediction is Public.We have three certain grids in which confidence is higher than a certain threshold (here is 60%) in the Moore Neighborhood, two of them are Industrial and the other one is Public.According to the algorithm, we can determine the type of uncertain grid as Industrial.

Grid Experiments and Performance
Different data sources and feature combinations contribute to overall accuracy (Figure 14).We further analyzed the importance of features contributing to the performance of the RF model.As shown in Figure 15, the top 5, top 10, and top 15 of all features can achieve 68.91%, 75.47%, and 78.97%, respectively.The satellite with the top five social features achieved 75.90%, and the top 10 social features achieved 79.24%.The confusion matrix results are as follows.For Level I, the RF algorithm achieves 80.33% of OA with a kappa coefficient of 0.7146 (Table 3).Based on the RF algorithm, the Moore neighborhood algorithm achieved 81.17% of OA with a kappa coefficient of 0.7253 (Table 4).For Level II, The OA achieved 76.55% with a kappa coefficient of 0.6847 (Table 5).We find that Public Management & Services, Industrial are easily misclassified as Resident and vice versa.By examining the RF prediction voting scheme, we find that prediction accuracy improved as the most confident prediction (Figure 16a) increases or the difference between the most confident prediction and the second most confident prediction increases (Figure 16b).As shown in the chart below for a total of 600 testing samples, (a) indicates the most confident prediction increases from 0.3 to 1 (30% to 100%), and the correct predictions increase; and (b) the difference between the most confident prediction and the second most confident prediction increases from 0 to 1, and the correct predictions increase.When the first and second most confident predictions are almost the same, the correct prediction is less than 50%.Therefore, we need to focus on those uncertain ranges of voting confidence to improve classification accuracy further.By setting up the threshold of confidence level to 60%, we improved the OA by 0.84%.We then used Moore neighborhood to further predict the grid type with different combinations of data sources.The Moore neighborhood results are shown in Table 6.We find that the lower the RF accuracy, the more significant the resulting improvement, ranging from 1% to 2%.When the accuracy surpasses 80%, the improvement becomes limited.All in all, the results prove the algorithm's effectiveness in grid land use prediction.

Parcel Experiments and Performance
In parcel-based land use classification, we also quantified the contributions of different feature combinations to overall accuracy (Figure 17).The OA achieved 69.54% for the RF algorithm with a kappa coefficient of 0.55.Satellite only achieved an OA of 57.68%, while social data only achieved 57.54%.Interestingly, the derived classifications from these two scenarios were very close, indicating that social and satellite data are equally important to land use classification at parcel levels.2861 index data achieved an OA of 64.26%, higher than other social data and verified the importance of more data dimensionality.The confusion matrix results are as follows.For Level I, the RF algorithm achieves 72.37% of OA with a kappa coefficient of 0.5841 (Table 7).For Level II, the OA achieved 68.99% with a kappa coefficient of 0.5240 (Table 8).A land use map was produced for land grid study by combining the results of RF predictions with the road, green and water grids.For the land parcel study, a land use map was produced by combining the results of RF predictions with green and water parcels.The map is limited to the selected parcels for the area provided by Haihang.The detailed maps of Xiamen are presented in Figure 18.

Sensitivity of Training Sample Size
We compared two scenarios, in which training sample sizes of different land types were either proportional, as in the raw data, or balanced.The parcel data we used in this experiment is shown in Table 9, and the results are shown in Figure 19.With 120,000 experiments, we conclude that with the average parcel size of 6284, a total of 6000 training samples under the scenario of feeding all features will reach 69.54% accuracy in the first scenario and 73.25% in the second scenario.Overfitting occurs as we continue to add more samples in both scenarios.

Grid and Parcel Exchange Experiments
We used grid-based land use classification results to predict parcel-based land use.We overlapped grids with land parcels, the most dominant land use was used to determine the parcel land type.The confusion matrix results are listed in Table 10.We also tested to predict grid-based land use using an RF algorithm trained by parcel training samples.There are 6284 parcel training samples and 600 grid testing samples.The confusion matrix results are listed in Table 11.We find that the combination of remote sensing and social data achieves the best land classification performance results.In the meantime, the satellite data plus the top 5 and top 10 of social features achieved OA 75.90% and 79.24%, respectively, indicating redundancy of multiple dimensionalities.PCA analysis can be used to further the data model research, etc. [47,48].Experiments also indicated that we can extend the experiments to other cities efficiently with fewer but important data selections based on the CMU framework.Secondly, CMU data model abstraction is multi-dimensional.Sparsity along the geometric dimension for both grid or parcel types is common.For example, less or non-POI data exist in some green space or rural area grids.Therefore, it is crucial to study the data pattern using data visualization analysis tools such as CMU VAL.Thirdly, adding additional social data to fulfill the sparsity along specific dimensions is necessary.In some cases, when obtaining additional social information is difficult, one can explore using the crowdsourcing method for adding additional CMU data sources [31].Lastly, from a data science perspective, data granularity and transparent feature structures provide data insight.A CMU framework enables such capabilities.
In addition, the proposed grid-based CMU framework can be flexibly extended to other cities in practice: (1) the digital structure of CMU data model is reusable for other cities or urban areas; (2) data collection and processing work can be leveraged because of the availability of remote sensing and social data in grid format from satellite imagery, 2861, Zhihuizuji, Baidu, etc. [43].In addition, area-specific data can be added as well; (3) For CMU ASL, either algorithms already trained or training samples for a specific city can be reused as a base for other cities.City-specific characteristics can be addressed with tuning of the CMU ASL algorithms and adding additional relevant training samples.As more cities are added, training samples and CMU ASL solutions will be accumulated for improved performance and future use; (4) The new methods of combining Moore neighborhood and RF algorithm, grid and parcel exchange analysis in land use prediction can also be utilized and generalized to other cities and regions in China, or internationally.

Grid and Parcel Exchange Analysis
A large volume of data collection and costly data processing work are needed to improve prediction accuracy.The data characteristics or training features are the same for both parcel-based and grid-based training samples.Therefore, for example, using grid-based training samples to train the RF tree model and apply it to the parcel-based land classification saves time and cost.Such concepts and practices are common in transfer learning models [49].By doing this, one can leverage work already done for the collection of training samples, which are labor intensive and costly.
Using grid-based land use classification results to predict parcel land use, we found that the OA reached 71.36%, which is partially or even better than the parcel-based trained model itself.While the grid-based model is more efficient to set up, this finding can be further expanded to use a grid-based model for parcel-based land use.On the other hand, while using parcel trained RF model to predict grid land use, an OA of 71.69% was achieved, which was lower than the aforementioned grid-based model; however it verifies the feasibility of this new method.

Sensibility Analysis of Training Sample Size
We find that prediction accuracy improves with the increase of training sample size and improves much faster while the sample size is less than 1000.As shown in Figure 19a, it takes 80 samples to reach 70% of the accuracy range (with the range from 0% to 69.54%), 320 samples to reach 80%, and 1200 samples to reach 90%, respectively, which is consistent with the stable sampling concept Gong et al. [50].We verified that the accuracy curve covering these three key sample size points is the same in different feature combinations in the first scenario of sample size testing.We need 40, 80 and 800 samples for the second scenario to reach the same conditions, respectively.This is somewhat surprising since it reveals and verifies that the sample size range for RF training accuracy can be accurately predicted.

Limitations and Future Research
Our study was limited to the urban area of Xiamen city and has not further researched rural areas and other cities so far.In addition, we use the RF algorithm in this study because it has been proven effective in land cover and use classification.However, it is worth exploring other, more effective AI algorithms.Although we collected and processed a larger amount of data and also studied and clarified the top data feature contributions, we concluded that additional data sources are needed to enhance solution accuracies.To expand the CMU model in volume and in higher dimension, we believe that new methods to use AI and other means to automatically collect and process data define future research needs and digital technology trends.
The CMU framework can be used to address the urban housing and renewal challenges in terms of analyzing of quantity, quality, and distribution need [7], and provides policy makers with timely updated information [51].The CMU data model is multidimensional.Additional data can be added and accumulated.With additional cities added, comparable comparisons with similar regions, cities can be accomplished.With accumulations of data, future growth projection and carrying capability studies can be added [52].Such capabilities are very important for policymaking and urban reviewal planning.In addition, the CMU framework can be further used to study city-specific functions, such as traffic control, economic analysis, environment managements, renewal energy planning, etc.
In particular, there are several areas of research interest for the future: (1) Grid-based land use methodology can be extended to cover all cities.Knowledge sharing can be further studied for CMU ASL AI algorithms and training samples sharing, as it has been done for global land cover study [37] (2) For the setting up of the CMU data model to multiple cities with multimillion data dimensions, it becomes necessary to use automated AI methods to collect and process data for the model.Such studies are important for future digital and intelligent city research; (3) Different AI algorithms can be added and/or tested with the CMU data model, including SVM [53], ANN, CNN [54], etc. (4) Applications using AI algorithms can be added to the framework in the CMU Application Solution Layer CMU ASL for urban renewal development, traffic management, renewable energy planning, etc.; (5) We used exclusion-inclusion techniques [44] to identify green, water, and road grids.By combining the results with RF predictions, the overall accuracy is 84.06%.Since roads, green space, and water cover more land areas, further analysis can be done for rural areas, and to further study combinations of CMU ASL algorithms; and (6) As social data sources are accumulated temporally, city development projections and simulations can be added to the model [25,[55][56][57].

Conclusions
Addressing urban land use classification challenges in the development curve from digital city, intelligent city to now meta city, we established a CMU framework for cityspecific studies.First, it combines remote sensing data with open social data, which can be served as a framework for meta-city study from data collection, features summation, and abstraction across space and time (historical simulation and future projections).The CMU framework consists of five layers of functions, including the Foundation Layer (CMU FL),Summation Feature Layer (CMU SFL), Density Index Layer (CMU DIL), Visualization Analysis Layer (CMU VAL), and Application Solution Layer (CMU ASL).Second, we implemented the proposed CMU framework for a systematic grid-based land use classification by leveraging the city of Xiamen as a testbed.Third, the large data size of meta-city analysis imposes challenges for data collection and processing.We studied the relationship between grid-based and parcel-based land classifications and concluded that the two models complement each other and can be reused robustly.Fourth, we also considered the factors that neighboring grids influence each other by applying the Moore neighborhood concept to the study.With the integration of Moore neighborhood methods, we can improve RF accuracy by resolving RF tree prediction uncertainty, which can be leveraged as a good strategy for guiding future land use classification practice.Finally, the proposed grid-based CMU framework can be flexibly extended to other cities in practice, mainly because of the availability of remote sensing and social data from satellite imagery, 2861, Zhihuizuji, Baidu, etc.Such a study shall enable urban land use analysis and planning more effectively by leveraging fast-advancing digital twin technology since most social data are more conveniently available in a grid format.It presents a detailed demonstration of data-rich experiment and model-driven framework for essential urban land use classification, which can be adapted to any other cities across the globe.

Figure 2 .
Figure 2. POI and 2861 index data of Xiamen.(a) POI data; and (b) 2861 shopping convenience level.

Figure 3 .
Figure 3. Mobile data of Xiamen: (a) resident data; (b) working data; and (c) visit data.

Figure 4 .
Figure 4. Building data of Xiamen: (a) building data in the main urban area; and (b) zoomed-in area of the red frame.

Figure 7 .
Figure 7. Grid-based POI numbers: (a) POIs in the 3D display; and (b) zoomed-in area of the red frame.

Figure 8 .
Figure 8.(a) Ontology description of CMU data sources; and (b) zoomed-in area of the red frame.

Figure 9 .
Figure 9. Grid and Parcel Land Use Process.

Figure 10 .
Figure 10.(a) Road grids; (b) Zoomed-in area of the red frame.

Figure 11 .
Figure 11.(a) Greenspace grids; and (b) zoomed-in area of the red frame.

Figure 12 .
Figure 12.(a) Water grids; and (b) zoomed-in area of the red frame.

( 1 )
We compared the classification results derived from satellite data and social data.Satellite only data achieved 65.97%, indicating the importance of high-resolution images.Social data achieved 75.03%, indicating the importance of involving human activities in interpreting land use functions.The CMU data model framework was created for using social data more effectively.(2) We quantified the contribution of each data type in detail.Satellite spectral only achieved 58.80%, which is almost the same as satellite texture.WorldPop data has only one feature, but it achieved 45.08%, which indicates that the corresponding feature has a greater contribution.POI data only achieved 55.87%, which surpassed other social data such as WorldPop, Mobile statistic, and the 2861 index.(3) We tested each satellite data with a combination of different social data sources.The combination of satellite and 2861 index data achieved 75.86% and surprisingly surpassed other combinations, as shown in Experiment 2, which indicated that 2861 index data is more complimentary to satellite data.(4) By combining all satellite and social data, the overall accuracy achieved 80.67% with a kappa coefficient of 0.7194, which indicates multiple dimensionalities of urban land uses are important to complement high-level semantic urban land use differentiation.

Figure 14 .
Figure 14.Accuracy of different feature combinations for Level I (run 1000 times, 2800 training, and 600 testing samples): (a) single data source; and (b) features combinations.

Figure 15 .
Figure 15.Features importance and contributions for RF model (run 1000 times, 2800 training and 600 testing samples): (a) all features importance; (b) social features importance; and (c) features combinations.

Figure 16 .
Figure 16.Prediction probability distribution of RF model: (a) most confident prediction; and (b) difference of the first and second confident prediction.

Figure 17 .
Figure 17.Accuracy of different feature combinations for Level I (run 1000 times, 6000 training, and 699 testing samples).

Figure 18 .
Figure 18.(a) Land use grid-based urban area map of Xiamen; (b) zoomed-in area of the right red frame in graph (a); (c) zoomed-in area of the red frame in graph (b); (d) parcel-based map for the same area of graph (c); (e) zoomed-in area of the left red frame in graph (a); (f) parcel-based map for the same area of graph (e).

Figure 19 .
Figure 19.Accuracy for different training sample size (run 1000 times each): (a) training samples balanced; (b) from 0 to 400 samples of graph (a); (c) training samples proportionally; and (d) from 0 to 400 samples of graph (c).
Lands used for administrative, education, medical and sport related.05 Road (grid only) 0501 Road first class 0502 Road second class 0503 Road third class Paved roads including freeways Major and minor city-roads.

Table 2 .
Summary of features from CMU FL.

Table 4 .
Grid confusion matrix for Level I (Moore Neighborhood addition).

Table 10 .
Confusion matrix of grid and parcel exchange experiment I.

Table 11 .
Confusion matrix of grid and parcel exchange experiment II.