Next Article in Journal
Multi-Scale Remote Sensing Analysis of Terrain–Resilience Coupling in Mountainous Traditional Villages: A Case Study of the Qinba Mountains, China
Next Article in Special Issue
Measuring the Degree of Residents’ Integration in Heritage Site Conservation and Utilization—A Case Study of Han Chang’an City Heritage Area
Previous Article in Journal
Ecological Priority-Oriented Performance Evaluation of Land Use Functions and Zoning Governance by Entropy–Catastrophe Progression Model
Previous Article in Special Issue
Urban–Agricultural–Ecological Interactions and Land Surface Temperature—A Spatiotemporal Study of the Middle Yangtze River Region
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China

1
School of Humanities and Social Science, Xi’an Jiaotong University, Xi’an 710049, China
2
School of Human Settlement and Civil Engineering, Xi’an Jiaotong University, Xi’an 710049, China
3
Research Institute for Smart Cities, School of Architecture and Urban Planning, Shenzhen University, Shenzhen 518060, China
4
Ningbo Institute of Technology, School of Economics, Zhejiang University, Hangzhou 310027, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Land 2025, 14(12), 2298; https://doi.org/10.3390/land14122298
Submission received: 15 October 2025 / Revised: 18 November 2025 / Accepted: 19 November 2025 / Published: 21 November 2025

Abstract

Rural settlements are the fundamental socio-economic units of China’s countryside. In line with national strategies that emphasize place-based and category-specific pathways for rural revitalization, accurate classification of rural settlements is essential for differentiated planning and policy delivery. However, given the sheer number of settlements, manual classification is time-consuming and resource-intensive, limiting scalability. This study proposes an AI-driven, multi-model framework to automate rural settlement classification with high stability and accuracy. First, informed by a rigorous literature review, we construct a multidimensional indicator system that integrates natural conditions, socio-economic attributes, and land-use factors to capture spatial and functional characteristics at the settlement scale. Using Gaoqing County (Shandong Province) as the study area, we collect and curate survey data and apply outlier detection for preprocessing. We then benchmark multiple machine learning models and find that algorithms with native handling of missing values perform markedly better—a critical advantage given the prevalence of missingness in survey-based datasets. Finally, we assemble the three best-performing models—LightGBM, CatBoost, and XGBoost—into a weighted-voting ensemble, achieving an overall classification accuracy of approximately 88%. The results demonstrate that the refined indicator system, coupled with a multi-model ensemble, substantially improves both accuracy and robustness. This work provides a methodological foundation and empirical evidence to support differentiated planning and targeted rural revitalization at the settlement level, offering a scalable blueprint for broader regional and national implementation.

1. Introduction

Rural areas play a vital role in maintaining the socio-economic and ecological balance of China’s vast territory [1,2]. Despite the country’s rapid urbanization, villages remain the fundamental socio-economic units—the “cells” of the rural system [3]. In the administrative hierarchy, an administrative village typically encompasses several natural villages. These natural villages, also referred to as rural settlements, constitute the smallest spatial and social entities. They represent the primary spaces where daily life unfolds, land is utilized, and community interactions occur. While administrative villages function primarily as governance and statistical units, rural settlements more directly reflect the lived spatial structure and functional organization of rural society. Understanding the spatial and functional characteristics of these settlements is therefore crucial for achieving sustainable rural development and advancing rural revitalization. Within this context, China’s Rural Revitalization Strategy [4] provides a comprehensive national policy framework to promote balanced, place-based development across the countryside.
In recent years, the principle of “adapting measures to local conditions and implementing categorized strategies” has become one of the core guidelines of China’s national rural revitalization policy. Official planning documents explicitly identify four primary categories of rural development strategies: Agglomeration Upgrading, Urban-Periphery Integration, Characteristic-Protection, and Relocation-Consolidation [4]. This classification system aims to match different types of rural settlements with appropriate development pathways and resource allocation models. Scientific classification provides a solid foundation for targeted decision-making by revealing the structural diversity and spatial differentiation of rural settlements [5]. A well-designed classification framework enables differentiated planning objectives, optimized resource allocation, and the rational distribution of public services [6]. Therefore, classifying rural settlements provides a crucial methodological basis for evidence-based governance and spatial optimization. This process directly supports the broader objectives of the Rural Revitalization Strategy.
However, village classification practices in China still largely rely on expert-based manual assessments. Such qualitative approaches are time-consuming, labor-intensive, and highly dependent on subjective judgments, making them impractical for large-scale applications. Given the vast number of rural settlements nationwide, manually classifying every village is unrealistic. This inefficiency creates a critical bottleneck for data-driven analysis and spatial planning at the national scale. To address these limitations, there is an urgent need to establish an automated classification framework that integrates systematic feature engineering, data-driven modeling, and algorithmic learning, enabling scalable, objective, and reproducible analysis.
Several studies have begun to leverage machine learning (ML) for various rural classification tasks. These efforts have focused on diverse attributes, such as the spatial characteristics of rural settlements [7,8], aiming to reveal the dynamic mechanisms of spatial reconstruction and evolution. Other works have assessed rural suitability [9,10], which is fundamental for optimizing planning and layout. Concurrently, ML has also been applied to rural risk assessment [11,12]. While these studies demonstrate the growing utility of ML in rural research, a critical review reveals several significant limitations as shown in Figure 1. First, indicator systems in feature engineering are often fragmented and ad-hoc. They are frequently dictated by data availability rather than a standardized, comprehensive framework. This hinders the comparability and transferability of findings. Second, in data processing, the literature largely overlooks the need for robust, automated preprocessing, particularly for outlier detection. This is a crucial omission, as rural datasets are notoriously prone to noise, inconsistencies, and scarcity, which can severely compromise model reliability. Finally, regarding model selection, the predominant approach is to apply a single, pre-selected classic model (e.g., SVM or Random Forest) and report its performance. This practice not only lacks a systematic comparison of models but also limits the robustness and generalizability of the results.
To address these gaps, this study proposes a systematic and automated AI-driven multi-model classification framework, providing a scalable and robust solution for rural settlement classification. This approach not only enhances the automation and stability of the classification process but also provides valuable technical support for differentiated rural spatial planning and targeted revitalization strategies. Specifically, we first construct a comprehensive indicator system based on an extensive literature review. This system integrates natural conditions, socio-economic attributes, and land-use characteristics to represent both the spatial structure and functional attributes of rural settlements. We then select Gaoqing County in Zibo City, Shandong Province—an agricultural county representative of the central Shandong Plain—as the case study area. Multi-source survey data are collected from multiple settlements, covering both feature indicators and settlement type labels. To ensure data quality, we conduct rigorous preprocessing using outlier detection techniques, minimizing the potential impact of survey errors and noise. Subsequently, multiple machine learning models are trained and evaluated to examine the classification effectiveness and the complementarity between different algorithms. Finally, we integrate the best-performing models into an ensemble framework. The resulting multi-model system is robust and capable of handling missing values, complex spatial heterogeneity, and class imbalance. Overall, the main contributions of this paper are as follows:
1.
We establish a comprehensive and scalable indicator system for rural settlement classification, addressing the fragmented and ad-hoc nature of feature engineering in prior studies.
2.
We develop and validate a robust data preprocessing module with automated outlier management, ensuring data quality in the presence of noise and inconsistencies.
3.
We perform a systematic multi-model evaluation and introduce an ensemble learning approach, demonstrating superior accuracy and robustness over traditional single-model methods.
4.
We propose and validate, to our knowledge, the first fully automated end-to-end framework for rural settlement classification, integrating preprocessing, model comparison, and ensemble learning into a reproducible pipeline for data-driven rural analysis.

2. Background

In this section, we clarify the conceptual hierarchy of rural space in China, integrating the macro notion of the countryside and the micro understanding of rural settlements, and further distinguish between administrative villages and natural villages—the latter serving as the analytical focus of this study. Finally, we summarize the policy-oriented typologies of rural settlements that guide our classification framework.

2.1. Conceptual Framework of the Countryside and Rural Settlements in China

The countryside [13] denotes the broader rural region that integrates population, production, ecology, and culture beyond the contiguous built-up urban fabric; it encompasses numerous villages and dispersed residential clusters shaped by historical and environmental contexts. Within this space, a rural settlement is the smallest, place-based unit of human habitation and social life—typically a clustered, recognizable agglomeration of dwellings and ancillary facilities—where everyday living, land use, and local networks are organized [14]. In China’s practice, rural settlements correspond to what are commonly referred to as natural villages: organically evolved residential points with relatively stable spatial form, lineage- or neighborhood-based social ties, and close coupling to local landscapes. Understanding rural settlements at this fine scale is essential for evidence-based revitalization, since functional heterogeneity and service demands are most clearly expressed at the settlement (i.e., natural village) level rather than at aggregated administrative levels.
China’s rural governance distinguishes administrative villages—the basic units for statistics, elections, and public service delivery—from natural villages, which are spatially and socially coherent residential points formed historically; one administrative village typically contains several natural villages. Administrative villages are delineated for management and often reflect past rounds of consolidation, whereas natural villages preserve the on-the-ground settlement pattern, everyday interactions, and service catchments. Consequently, in this study we define rural settlements as natural villages and take them as the operative unit for classification and model evaluation. This choice aligns the analytical unit with the spatial scale at which infrastructure, land-use decisions, and facility accessibility most directly affect rural residents [15,16,17].

2.2. Typologies of Rural Settlements for Targeted Revitalization

Recent national planning documents promote “adapting measures to local conditions and implementing categorized strategies’’, encouraging localities to differentiate development pathways by village type. Rural settlements are commonly organized into the following policy-oriented categories [4]:
  • Agglomeration-Upgrading Settlements. Focus on reinforcing endogenous industries and upgrading infrastructure and public services to consolidate their role as local hubs of population and functions.
  • Urban-Periphery Integration Settlements. Emphasize urban–rural integration by improving connectivity, shared services, and compatible land uses at the metropolitan fringe or county seats.
  • Characteristic-Protection Settlements. Prioritize heritage conservation and quality improvement—protecting traditional architecture, cultural landscapes, and vernacular environments while upgrading essential infrastructure.
  • Relocation-Consolidation Settlements. For settlements in ecologically fragile, disaster-prone, or severely shrinking areas, implement orderly relocation or merger with safeguards for livelihoods, employment, and ecological restoration.
These categories provide both a conceptual lens and an operational basis for our AI-driven classification: they articulate function-oriented development goals, match observable indicator patterns (e.g., industry mix, accessibility, land-use structure), and facilitate targeted policy design at the settlement scale.

3. Materials and Methods

3.1. Study Area and Data Sources

In studies of rural spatial optimization and settlement pattern analysis, the selection of a representative study area is essential to ensure the scientific validity and applicability of the research outcomes. A well-chosen region provides a realistic spatial context for testing methodological frameworks and facilitates the generalization of findings across different rural environments. As shown in Figure 2 and Figure 3, Gaoqing County, located in Zibo City, Shandong Province, China, was selected as the study area for this research. As a typical agricultural county on the central Shandong Plain, Gaoqing possesses dual locational advantages within both the Yellow River Basin and the provincial capital economic circle. It demonstrates representative characteristics in terms of population evolution, land use, and rural settlement development, making it a suitable case for examining spatial differentiation and optimization strategies of rural settlements. Therefore, this study takes the entire administrative territory of Gaoqing County as the research area and all rural settlements within the county as the research objects, with the aim of revealing spatial characteristics and proposing optimization directions at the county scale.
1.
Geographical Location: Gaoqing County is situated in the northern part of the Shandong Plain, along the lower reaches of the Yellow River, with an average elevation of about 12 m. It lies within the Yellow River Delta Ecological Zone. The county is approximately 50 km from Binzhou, 120 km from Jinan, and 90 km from Zibo, forming a transportation hub of the northern Shandong region. Convenient access to national highways, expressways, and future railway and airport connections ensures good regional accessibility.
2.
Natural Environment: The terrain of Gaoqing is flat, with gentle slopes descending from the central plain to the south. It has a warm temperate monsoon climate, characterized by hot, rainy summers and cold, dry winters. The annual average precipitation is around 600 mm, and the per capita water resource is approximately 320 m3. The Yellow River is the main water source, supplemented by small lakes and reservoirs. The ecological environment is relatively fragile, with saline-alkali soils and seasonal flooding affecting the landscape.
3.
Socio-economic Conditions: As of 2020, Gaoqing County had a permanent population of approximately 312,200, including 347,000 registered residents. The population structure shows that 31.3% are aged 41–65, and 21.6% are aged 65 or above, resulting in a clear aging trend. The aging rate reaches 26.5%, indicating that Gaoqing has entered a “moderate aging” stage. The total GDP of the county in 2020 reached 18.15 billion CNY, with a per capita GDP comparable to the provincial average. However, outmigration of young labor remains evident, particularly among those aged 20–40.
4.
Land Use: The total land area of Gaoqing County is approximately 950 km2, with 520 km2 of arable land (about 55%), mainly concentrated in the southern and eastern plains along the Yellow River. The county has 110 km2 of built-up land, accounting for 11.6% of the total area, and over 2 million m2 of rural housing land. On average, construction land per capita reaches 263.8 m2, with most settlements located within 100 m of main transportation corridors. The county has also established several industrial parks and an eco-tourism zone centered on the “Wetland Belt” of the Yellow River.
5.
Spatial Pattern and Characteristics of Rural Settlements: Gaoqing County administers seven townships, 39 village-level administrative units, and a total of 767 rural settlements. Settlements are generally evenly distributed along the river corridors and transportation routes, with a spatial hierarchy characterized by clustering near township centers and dispersal in peripheral zones. Differences in development level and service accessibility are notable: northern settlements are more scattered and have weaker infrastructure, whereas southern and central settlements are more compact and better equipped. Overall, the county exhibits a clear spatial stratification of settlements, with evident contrasts in living standards, spatial compactness, and service accessibility across different zones.

3.2. Methodological Overview

As illustrated in Figure 4, the proposed research framework consists of three main stages: (1) indicator system construction and data compilation, (2) automated data preprocessing, and (3) multi-model classification and ensemble integration. First, a comprehensive indicator system is established through literature synthesis to capture the natural, socio-economic, and land-use characteristics of rural settlements. Based on this framework, multi-source data, including feature indicators and settlement-type labels, are compiled to form the analysis dataset. Second, an automated preprocessing pipeline is implemented to enhance data reliability. It incorporates outlier detection and treatment (e.g., IQR- and Z-score-based methods), normalization, and missing-value imputation, thereby minimizing the influence of noise, inconsistencies, and data sparsity. Third, multiple machine learning models, including Logistic Regression, SVM, Random Forest, XGBoost, LightGBM, and CatBoost, are trained and evaluated. The best-performing models are subsequently integrated into an ensemble learning framework to enhance stability, robustness, and overall classification accuracy. This three-stage workflow enables a scalable, automated, and reproducible approach to rural settlement classification.

3.3. Step1: Indicator System Construction and Data Compilation

To enable automated classification of rural settlements, it is essential to first establish a comprehensive indicator system that can accurately reflect the multifaceted characteristics of rural settlements. Such an indicator system should capture key dimensions including socio-economic conditions, natural environment, land use, and supporting infrastructure, providing a structured representation of settlement attributes. In this study, we adopt a literature-based indicator construction approach, systematically reviewing existing research and policy documents to identify relevant variables and ensure conceptual completeness. Based on the established indicator system, representative study areas are then selected, and field surveys are conducted to collect high-quality data. This dataset serves as the foundation for the subsequent automated classification framework, ensuring that the classification results are both data-driven and contextually meaningful. A literature-based indicator construction approach is adopted, and the overall procedure of the literature review and indicator development process is summarized in Table 1.

3.3.1. Literature Search and Screening

Under the theme of “Applications of Machine Learning in Rural and Village Planning,” this study conducted a comprehensive review of theoretical and empirical research published from 2015 to the present. The purpose was to examine how machine learning has been applied to village classification, spatial pattern analysis, optimization, and suitability evaluation. The literature search covered journal articles and conference papers, with data obtained from CNKI and Web of Science (WOS).
The search strategy was designed to ensure comprehensive coverage across three dimensions: (1) Objects (e.g., “rural,” “settlement,” “village”), (2) Topics (e.g., “planning,” “optimization,” “layout”), and (3) Methods (e.g., “machine learning,” “artificial intelligence”). Logical keyword combinations were employed to maximize retrieval coverage. The search keywords and corresponding results are summarized in Table 2.
The screening process was conducted in three main steps to ensure the accuracy, relevance, and quality of the retrieved literature.
Step 1: Duplicate Removal. Duplicates were identified and removed by comparing titles, authors, publication years, and DOI or journal source information. When both conference and journal versions of the same study were available, the journal version was retained.
Step 2: Title and Keyword Screening. Literature was retained if it explicitly contained core terms such as “rural,” “village,” or “settlement” together with “machine learning” or “deep learning.” Studies unrelated to the research topic were excluded.
Step 3: Abstract and Full-Text Review. Each publication was evaluated based on whether it addressed the application of machine learning in rural or village planning, and whether it provided empirical or methodological insights related to village classification, spatial pattern identification, or suitability analysis.
The final literature set was verified through detailed cross-checking, ensuring inclusion of both domestic and international studies.

3.3.2. Indicator System Construction

The indicators identified in the reviewed studies were systematically extracted, organized, and classified to form the foundation of the proposed indicator system. First, indicators related to rural settlements and machine learning–based analytical methods were collected through word-frequency analysis and terminology consolidation, with standardized processing to eliminate semantic and linguistic inconsistencies. Second, all indicators were categorized by thematic relevance into five primary dimensions: settlement morphology, locational conditions, natural environment, socio-economic attributes, and historical–cultural context. Each primary dimension was further refined into specific secondary indicators, such as boundary compactness, road network proximity, elevation, slope, per-capita GDP, and the density of historical and cultural points, ensuring both conceptual coherence and empirical measurability.
Building upon the synthesis of the literature review and the empirical context of Gaoqing County, a comprehensive evaluation indicator system for the spatial pattern of rural settlements was established. This framework integrates multiple aspects of rural characteristics and captures their multi-dimensional attributes through a structured and data-driven approach. Indicators were derived through extensive literature review, keyword co-occurrence analysis, and expert reasoning. The overall procedure includes:
  • Reviewing domestic and international literature related to rural vitality, settlement morphology, and land-use evaluation;
  • Extracting potential indicators using keyword-frequency statistics and co-occurrence network analysis;
  • Refining and validating indicator selection through expert discussion and relevance testing.
This comprehensive methodological process ensures that the constructed indicator system is both theoretically grounded and empirically applicable. It provides a robust foundation for subsequent data compilation, preprocessing, and multi-model analysis, thereby linking conceptual understanding with the quantitative modeling framework developed in the following stages.

3.4. Automated Data Processing

TUpon completing the construction of the indicator system, we obtain the basic dataset required for machine learning, consisting of the independent variables (indicator values) and the dependent variable (settlement-type labels). To ensure the reliability of subsequent classification results, a rigorous data preprocessing procedure is implemented to minimize noise, correct inconsistencies, and enhance overall data quality.
The dataset used in this study combines field-collected information with survey-based records. Due to the unique characteristics of rural settlements, the data compilation process is inherently challenging. First, the indicator system is highly multidimensional, covering socio-economic conditions, natural environment, land construction, and public service facilities, which increases the likelihood of inconsistent measurement and recording errors. Second, data collection typically involves multi-person, multi-stage collaboration in the field, leading to heterogeneous data quality, including variations in measurement precision, inconsistent units, and occasional merging errors during integration. Third, the availability of rural data is often constrained: some indicators may be absent from official records, while others must be obtained through indirect estimates, visual observations, or interviews—approaches that introduce additional uncertainty. Moreover, temporal inconsistencies (e.g., different collection periods) and spatial heterogeneity (e.g., varying infrastructure conditions across villages) may further amplify noise.
These characteristics make preprocessing a critical component of the entire workflow. If noise and abnormal values are not properly addressed, they will directly propagate into the classification models, producing unstable or biased outcomes. However, manual verification of large-scale rural datasets is both time-consuming and error-prone. Therefore, we develop an automated outlier detection pipeline to systematically identify and correct anomalous values, ensuring scalability and reproducibility when dealing with complex and error-prone rural datasets.

3.4.1. Automated Outlier Detection

To achieve robust performance while maintaining computational efficiency, we employ two complementary statistical detection methods: Interquartile Range (IQR)-based detection and Z-score-based detection.
  • IQR-based detection [18]: This method uses the 25th percentile ( Q 1 ) and 75th percentile ( Q 3 ) to compute the interquartile range IQR = Q 3 Q 1 . Any data point below Q 1 1.5 · IQR or above Q 3 + 1.5 · IQR is flagged as an outlier. The IQR method is particularly robust to skewed distributions and is well suited for rural indicators with heterogeneous value ranges and non-Gaussian distributions.
  • Z-score detection [19]: This method standardizes data points and computes z = ( x μ ) / σ , where μ and σ are the mean and standard deviation, respectively. Points with | z | > 3 are marked as outliers. The Z-score method is effective for identifying global deviations from the mean in approximately normal distributions.
Using both methods allows the pipeline to capture distribution-tail anomalies and global deviation anomalies simultaneously, thereby improving detection sensitivity across diverse indicator types. The approach is also highly scalable to high-dimensional indicator systems, making it well suited for large rural datasets.

3.4.2. Outlier Handling and Missing Value Imputation

To prevent anomalous values from distorting model training, all detected outliers are uniformly converted into missing values. This standardization ensures consistency across all features and reduces the risk of overfitting to abnormal records. Since several machine learning models (e.g., SVM, logistic regression) do not support missing values directly, we construct an additional imputed dataset in which missing entries are replaced by their column-wise means. This enables consistent and fair comparisons across algorithms. For models with native missing-value handling capabilities (e.g., CatBoost, LightGBM), we retain the original missing indicators, allowing these models to leverage this information during training.
Overall, the preprocessing stage provides a rigorous and reproducible foundation for the subsequent classification tasks by systematically identifying and correcting data noise, ensuring broad algorithm compatibility through flexible missing-value treatment, and reducing the risk that raw-data inconsistencies propagate into model predictions.

3.5. Multi-Model Classification and Ensemble Integration

3.5.1. Model Selection

To explore which machine learning algorithms are most suitable for rural settlement classification, we selected a set of seven representative models. This selection was intentionally designed to be systematic, covering a broad spectrum of modeling approaches:
1.
Traditional Statistical Baselines (Logistic Regression, Linear Discriminant Analysis)
2.
Classic High-Performance ML (Support Vector Machines)
3.
Tree-Based Algorithms (Random Forest)
4.
Gradient Boosting Algorithms (XGBoost, LightGBM, CatBoost)
This diverse portfolio allows us to rigorously evaluate the relative performance of these distinct modeling paradigms. The selected algorithms combine classical interpretability, robustness to noise, and strong performance on tabular data. Table 3 summarizes the selected models, their support for missing values, their key characteristics, and their year of introduction.
Each model is briefly described below, with technical details and core mathematical formulations.
(1)
Logistic Regression (LR).
Logistic regression is a classical linear model that predicts the probability of a sample belonging to a class using the sigmoid function. For binary classification, the model is defined as:
P ( y = 1 x ) = σ ( w x + b ) = 1 1 + exp ( ( w x + b ) ) ,
where x is the input feature vector, w the weight vector, b the bias term, and σ ( · ) the sigmoid activation function. For multi-class classification, the softmax function is used:
P ( y = k x ) = exp ( w k x ) j = 1 K exp ( w j x ) .
LR serves as a performance baseline and provides interpretable coefficients indicating the relative importance of each indicator.
(2)
Linear Discriminant Analysis (LDA).
LDA is another cornerstone of statistical classification, originating from Fisher’s linear discriminant. Unlike the discriminative LR, LDA is a generative model that assumes features follow a Gaussian distribution. It finds a low-dimensional projection that maximizes the separation between class means while minimizing the variance within each class. The objective is to find a projection vector w that maximizes the Fisher–Rao ratio:
J ( w ) = w S B w w S W w ,
where S B is the between-class scatter matrix and S W is the within-class scatter matrix. For K classes, LDA projects the data onto a ( K 1 ) -dimensional space. Classification is then performed by finding the class centroid closest to the new data point in the projected space. It serves as a strong traditional baseline, especially for linearly separable data.
(3)
Support Vector Machine (SVM).
SVM aims to find the optimal separating hyperplane that maximizes the margin between different classes. Given training data { ( x i , y i ) } with y i { 1 , 1 } , the primal optimization problem is:
min w , b , ξ 1 2 w 2 + C i = 1 n ξ i s . t . y i ( w x i + b ) 1 ξ i , ξ i 0 ,
where C is the penalty parameter and ξ i are slack variables allowing for misclassification. Nonlinear decision boundaries can be modeled using kernel functions such as RBF or polynomial kernels.
(4)
Random Forest (RF).
Random Forest is a decision-tree-based algorithm known for its robustness. It constructs multiple decision trees on bootstrapped samples of the data and aggregates their predictions through majority voting:
y ^ = mode { h t ( x ) } t = 1 T ,
where h t ( x ) denotes the prediction of the t-th decision tree. RF is robust to noise, reduces overfitting through averaging, and performs well with heterogeneous indicators.
(5)
XGBoost.
Extreme Gradient Boosting (XGBoost) is an efficient implementation of gradient boosting decision trees (GBDT). The model is trained in an additive manner:
y ^ i ( t ) = y ^ i ( t 1 ) + f t ( x i ) ,
where f t is the t-th regression tree. The objective function combines training loss and regularization:
L ( t ) = i = 1 n l ( y i , y ^ i ( t ) ) + t = 1 T Ω ( f t ) , Ω ( f ) = γ T + 1 2 λ ω 2 .
XGBoost supports efficient handling of missing values and offers strong performance on tabular data.
(6)
LightGBM.
LightGBM is an optimized implementation of gradient boosting decision trees. Instead of using exact greedy algorithms, it adopts histogram-based feature binning and leaf-wise tree growth with depth constraints, which significantly improves training speed and reduces memory usage while maintaining high accuracy. These characteristics make LightGBM particularly suitable for high-dimensional tabular datasets.
(7)
CatBoost.
CatBoost is a gradient boosting algorithm specifically designed to handle categorical features and missing values natively. It introduces ordered boosting to prevent target leakage and efficient encoding strategies for categorical variables. This allows CatBoost to achieve stable and robust performance on heterogeneous survey data without extensive preprocessing.

3.5.2. Model Configuration and Training

  • Model Configuration. All models are implemented using Python 3.9 (scikit-learn, XGBoost, LightGBM, and CatBoost). Default hyperparameters serve as starting points, followed by light tuning to balance accuracy and computational efficiency.
  • Logistic Regression: solver = ’lbfgs’, max_iter = 500, class_weight = ’balanced’.
  • Linear Discriminant Analysis: solver = ’svd’, shrinkage = ’auto’.
  • SVM: kernel = ’rbf’, C = 1.0 , class_weight = ’balanced’.
  • Random Forest: n_estimators = 300, max_depth = None, min_samples_split = 2.
  • XGBoost: n_estimators = 500, learning_rate = 0.05, max_depth = 6, subsample = 0.8.
  • LightGBM: num_leaves = 31, learning_rate = 0.05, n_estimators = 500.
  • CatBoost: iterations = 500, learning_rate = 0.05, depth = 6, loss_function = MultiClass.
  • Hyperparameters are determined through preliminary experiments to achieve a trade-off between accuracy, stability, and training time. All experiments are conducted on a workstation with an NVIDIA RTX GPU and 64 GB RAM.
  • Model Training. A total of N samples were collected from representative rural regions. To ensure both sufficient training data and an independent evaluation set, the dataset is randomly divided into training (90%) and testing (10%) subsets. A stratified sampling strategy is adopted to preserve the original class distribution across different settlement categories, thereby avoiding potential sampling bias. Following the split, an analysis of the training set revealed a significant class imbalance, with certain settlement categories being severely underrepresented. To mitigate the risk of models developing a bias towards the majority classes, we implemented a data-level rebalancing strategy using the Synthetic Minority Oversampling Technique (SMOTE) [27].
Crucially, the SMOTE algorithm was applied only to the 90% training set. The 10% testing set was left untouched in its original, imbalanced state to serve as a realistic representation of the real-world data distribution for our final evaluation. Unlike simple duplication, SMOTE generates new, synthetic samples for the minority classes. The procedure operates in the feature space:
1.
It selects a minority class sample x i .
2.
It identifies its k nearest neighbors in the feature space (we used the standard k = 5 ).
3.
It randomly selects one of these neighbors, x nn .
4.
It generates a new synthetic sample x new by interpolating along the line segment between the two samples:
x new = x i + λ ( x nn x i ) , λ U ( 0 , 1 ) .
This process was applied to all underrepresented classes with sampling_strategy = "auto", which instructs SMOTE to oversample each minority class until it reaches the size of the majority class. As a result, all classes in the training set achieve an equal number of samples, forming a balanced 1:1:1:1 distribution. By enriching the training data with these synthetic yet plausible samples, the models are less biased and can learn more discriminative decision boundaries, particularly for the originally underrepresented settlement categories.

3.6. Ensemble Integration and SHAP Explainability

3.6.1. Ensemble Integration Strategy

While the multi-model comparison identifies the best-performing single models, relying on any individual model may still introduce model-specific biases or unstable behavior on certain subsets of the data. To improve overall robustness and stability, we adopt a simple top-N weighted ensemble approach [28]. The procedure consists of two steps and is summarized in Algorithm 1. First, all seven candidate models (LR, LDA, SVM, RF, XGBoost, LightGBM, CatBoost) are trained and evaluated on the test set. Based on their performance (e.g., overall accuracy), the top three models are automatically selected as the ensemble base learners, denoted by M top . Second, each selected model is assigned a weight proportional to its test-set performance. The weight for model m is computed as:
w m = Score ( m ) j M top Score ( j ) .
Here, Score ( m ) denotes the evaluation metric (e.g., accuracy) of model m on the test set, and the normalization in Equation (1) ensures that the weights sum to one.
Algorithm 1 Top-N Weighted Ensemble Framework
Require: Training data D train , test data D test
Require: Candidate models M all = { M 1 , , M 7 }
Require: N = 3
 1: Train each m i M all on D train
 2: for each model m i M all do
 3:   S i Evaluate ( m i , D test )
 4: end for
 5: Select top-N models M top based on S i
 6: Compute weights w i = S i / m j M top S j
 7: Prediction for a new sample x :
 8: y ^ arg max c m i M top w i · P i ( y = c x )
 9: return y ^
The final predicted class label for each rural settlement i is obtained through weighted soft voting:
y ^ i = arg max c m M top w m · P m ( y i = c ) ,
where P m ( y i = c ) denotes the probability assigned by model m that settlement i belongs to class c, and w m is the weight defined in Equation (1). After this integration, the ensemble can be regarded as a single ensemble model. By combining the strengths of the top-performing base learners and weighting them according to their empirical performance, this ensemble model achieves more stable and robust classification results than any individual model across heterogeneous rural settlement categories. For further details, related code is available at: mult-model-rural-classification, https://github.com/hnurxn/mult-model-rural-classification (accessed on 18 November 2025).

3.6.2. SHAP-Based Explanation

After determining the final ensemble classification model, we further apply SHAP (SHapley Additive exPlanations) [29] to interpret the contribution of each indicator and enhance the transparency of the model. SHAP provides a theoretically grounded framework for feature attribution based on cooperative game theory, where each indicator is treated as a “player” contributing to the prediction outcome.
For a given sample, the SHAP value of feature x j represents its marginal contribution to the model prediction and is defined as:
ϕ j = S F { j } | S | ! ( | F | | S | 1 ) ! | F | ! f ( S { j } ) f ( S ) ,
where F denotes the full feature set, S is any subset not containing feature j, and f ( · ) represents the model output. The term f ( S { j } ) f ( S ) captures the marginal improvement in predictive power when adding feature j, while the combinatorial coefficient ensures that all subsets are weighted fairly. This formulation enables SHAP to provide additive and locally accurate feature attributions.
Applied to the ensemble model, SHAP allows us to derive both global and local explanations. Globally, SHAP summarizes which indicators consistently exert strong influence on predicting rural settlement types and how their contributions vary across categories. Locally, SHAP values reveal the specific indicators driving the classification of an individual village, offering insight into case-by-case decision mechanisms. These explanations bridge the gap between data-driven modeling and domain knowledge, ensuring that the automated classification results remain interpretable, diagnostically useful, and aligned with planning realities.

4. Results

4.1. Constructed Indicator System and Dataset Characteristics

4.1.1. Literature Review Results

The literature review (Table 4) shows that existing machine learning applications in rural and village planning predominantly revolve around five major categories of indicators: settlement morphology, locational conditions, natural environment, socio-economic attributes, and historical–cultural features. These categories collectively reflect the multidimensional characteristics that influence rural settlement patterns.
(1)
Settlement Morphology.
This category captures the spatial form, structure, and scale of rural settlements. Commonly used indicators include boundary compactness, axial or road network patterns, and built-up area ratio. Prior studies employing models such as XGBoost, LightGBM, and GBDT demonstrate that morphological indicators effectively describe spatial structure and help reveal underlying development patterns.
(2)
Locational Conditions.
Locational indicators are among the most frequently used in the literature. They typically measure distances to towns, roads, rivers, and public facilities, as well as accessibility and road network density. Indicators such as distance to main roads and distance to towns appear most consistently, underscoring the importance of transportation accessibility in shaping settlement distribution and functional clustering. Methods including Random Forest, GBDT, and SVM are commonly adopted for these analyses.
(3)
Natural Environment.
Environmental indicators form the largest group, encompassing elevation, slope, terrain relief, land-cover characteristics (e.g., cultivated land ratio, NDVI), and climatic factors such as temperature and precipitation. Studies using Random Forest, XGBoost, and MGWR (Multiscale Geographically Weighted Regression) highlight the strong constraining effect of natural environmental conditions on the spatial differentiation and evolution of rural settlements.
(4)
Socio-Economic Attributes.
These indicators describe variations in economic activity and population distribution. Common metrics include per-capita GDP, income level, population density, built-up land area, and nighttime light intensity. Analytical approaches such as MGWR, CNN-based models, and XGBoost–SHAP are frequently used to examine socio-economic disparities and their relationship with settlement form and function.
(5)
Historical and Cultural Features.
Although less frequently emphasized, historical–cultural indicators capture elements such as cultural heritage sites, traditional architecture, and historical landmarks. Techniques including MGWR and BP Neural Networks have been employed to assess how these cultural characteristics contribute to rural settlement differentiation and identity.
Overall, the literature consistently demonstrates that rural settlement classification requires a multi-dimensional indicator system reflecting spatial, environmental, economic, and cultural characteristics. These findings support the comprehensiveness of the indicator framework constructed in this study.

4.1.2. Indicator System Results

Based on the synthesis of the literature review and the empirical characteristics of Gaoqing County, this study establishes a systematic indicator framework for evaluating the spatial patterns of rural settlements. As summarized in Table 5, the framework consists of four primary dimensions—socio-economic attributes, natural environment, land construction and utilization, and supporting public services—each further divided into detailed secondary indicators. Together, these indicators provide a structured and comprehensive representation of settlement characteristics.
(1)
Socio-economic Attributes.
This dimension captures demographic conditions and industrial development. Key indicators include aging rate, permanent population, population outflow rate, and average annual population growth. In addition, per capita village income and the proportion of elderly agricultural labor reflect local economic vitality and labor structure. These indicators, shown in Table 5, characterize the human and economic foundation of each settlement.
(2)
Natural Environment.
This dimension incorporates natural geographic and ecological resource conditions. Indicators include terrain factors, hydrological features, vegetation status, and ecological sensitivity. As listed in Table 5, these variables describe the environmental constraints and carrying capacity that influence both the distribution and potential development trajectories of rural settlements.
(3)
Land Construction and Utilization.
This dimension reflects development intensity and land-use efficiency. Construction-related indicators—such as residential land aggregation, land-use intensity, and built-up area ratio—capture spatial compactness and physical development patterns. Land-use indicators, including farmland ratio, farmland transfer rate, and ecological red-line proportion, further describe human–land interactions and the degree of land consolidation. All corresponding indicators are detailed in Table 5.
(4)
Supporting Public Services.
This dimension reflects development intensity and land-use efficiency. Construction-related indicators such as residential land aggregation, land-use intensity, and built-up area ratio capture spatial compactness and physical development patterns. Land-use indicators, including farmland ratio, farmland transfer rate, and ecological red-line proportion, further describe human–land interactions and the degree of land consolidation. All corresponding indicators are detailed in Table 5.

4.1.3. Dataset Collection Results

After establishing the indicator system, a complete set of 33 feature variables (i.e., independent variables) was determined, covering socio-economic attributes, locational conditions, natural environment, land construction and utilization, and supporting public services. The dependent variable corresponds to the rural settlement categories defined by policy documents, comprising four types.
Following the determination of features and classification labels, a large-scale field survey was conducted across all towns and subdistricts of Gaoqing County. The survey covered 87 settlements in Gaocheng Town, 100 in Heilizhai Town, 103 in Muli Town, 73 in Changjia Town, 57 in Luhu Subdistrict, 111 in Qingcheng Town, 71 in Tianzhen Subdistrict, 72 in Huagou Town, and 73 in Tangfang Town, resulting in a total of 767 rural settlements. Among these, 361 settlements were assigned definitive category labels through expert validation based on policy criteria. Therefore, the final dataset used for modeling consists of 440 samples, each with 33 feature variables and one policy-defined settlement type belonging to one of the four categories.
Drawing on an extensive literature review and the characteristics of rural settlements, we constructed an indicator system with 4 primary dimensions and 33 secondary indicators. Based on the 4 policy-defined settlement categories, field surveys were conducted, yielding 361 valid rural settlements with complete features and confirmed labels for subsequent analysis.

4.2. Data Cleaning: Outlier Detection and Preprocessing Results

After obtaining the dataset, we applied an automated outlier detection procedure to address the inherent challenges of rural data collection, such as inconsistent recording practices and potential human errors. A data point was flagged as an outlier only when it was simultaneously detected by both the IQR-based and Z-score methods, ensuring high precision in anomaly identification. Table 6 summarizes the eight features with the highest number of detected outliers.
The first row lists the feature names, and the second row reports the corresponding outlier counts. Several indicators exhibit notably high anomaly frequencies, which can be traced to questionnaire filling mistakes, data type mismatches, or formatting inconsistencies during manual entry. For example, both Natural Environment and Land Abandonment are ordinal variables expected to take discrete values from 1 to 4. However, five and eight observations, respectively, were filled with the invalid value 0. Likewise, Ecological Redline Encroachment, also an ordinal variable with a valid range of 1–4, contains 20 entries incorrectly recorded as percentages.
Another typical issue arises with Land Transfer Rate, which should be expressed as a numeric percentage. Among the 34 outliers detected, 32 entries contain Chinese characters (a clear type error), and two values are near-zero extremes that likely resulted from incorrect reporting during field investigation. Moreover, Tap Water Access and Waste Collection Points are binary indicators (0 or 1), yet 7 and 11 entries, respectively, contain values outside this range, making their intended meaning ambiguous.
Apart from the features listed in Table 6, outlier counts for most other indicators remain relatively low. Notably, among all 33 features, only two exhibit no anomalies. To further validate the reliability of our automated outlier detection method, we manually inspected all anomalies identified in Table 6 and confirmed that every detected case corresponded to a genuine data error, with no false positives. This demonstrates the robustness and practical effectiveness of our detection approach.
Overall, these findings highlight the common challenges associated with large-scale rural survey data, including heterogeneous recording standards, the absence of validation rules during data entry, and misunderstandings of measurement scales among enumerators. Addressing such anomalies at the preprocessing stage is essential to ensuring the accuracy and stability of subsequent classification modeling.
Rural survey data are prone to recording errors, making automated outlier detection essential. Our approach identified anomalies in 31 out of 33 features, and manual verification confirmed all detections as true errors. This demonstrates that the proposed method can substantially reduce manual checking effort while ensuring data reliability.

4.3. Classification Evaluation and Ensemble Model Performance

4.3.1. Multi-Model Performance Evaluation

Before evaluating model performance, it is important to clarify the class distribution of the training data. The original dataset exhibited substantial class imbalance, with the four rural settlement categories containing 194, 51, 21, and 95 samples, respectively. After applying SMOTE to the training set, all classes were resampled to match the majority class, resulting in an equal count of 194 samples for each category. This balanced dataset provides a more reliable foundation for comparing the effectiveness of different classifiers. According to the performance comparison of the eight classifiers shown in Figure 5, this section first provides an overall analysis of the results, followed by a detailed model-by-model interpretation, and concludes with a discussion of practical model selection strategies for this task.
  • Overall Findings. The results clearly establish a performance hierarchy. At the baseline, the RandomClassifier performs as expected (Acc. ≈ 0.25), and the traditional statistical models, LDA (F1 0.39) and LR (F1 0.55), demonstrate that simple linear boundaries are insufficient for this complex task. The SVM_RBF (F1 0.53) performs similarly poorly, exhibiting a severe imbalance between high precision (0.75) and very low recall (0.41). Random Forest (F1 0.63) offers a moderate baseline but is clearly outperformed by the gradient boosting family.
The gradient boosting models (XGBoost, LightGBM, CatBoost) are the clear winners, occupying the Top 3 positions. CatBoost achieves the best all-around performance, leading in F1-score (0.88), Accuracy (0.86), and Recall (0.90). LightGBM is the strong runner-up, with the second-highest F1 (0.83) and Accuracy (0.84). XGBoost ranks third (F1 0.78; Acc. 0.78), still outperforming all non-boosting models by a significant margin. This outcome aligns with theoretical expectations. The rural survey dataset is characterized by complex, non-linear interactions and (as is common in such tasks) a high proportion of missing values. Models requiring complete data (e.g., LDA, LR, SVM ) are vulnerable to imputation errors, which dilute informative patterns. Random Forest is also dependent on imputation quality. In contrast, LightGBM, XGBoost, and CatBoost can natively handle missing values, treating missingness as an informative signal during tree growth rather than as a preprocessing hurdle. This native handling is the primary explanation for their superior ability to maintain high performance.
  • Model-by-Model Analysis. The models are analyzed in order of increasing performance:
  • RandomClassifier. Serves as the theoretical minimum baseline for a 4-class problem (F1/Acc. 0.25). All ML models significantly outperform this floor.
  • LDA. As the weakest-performing ML model (F1 0.39), LDA’s assumption of linear separability and Gaussian distributions is ill-suited for the dataset’s complexity.
  • SVM_RBF. Exhibits the most severe performance imbalance (F1 0.53). Its high precision (0.75) but exceptionally low recall (0.41) suggests it only classifies high-confidence samples, missing the majority of true positives. This reflects its sensitivity to the high-dimensional, noisy, and imputed data.
  • LR. Although balanced, its linear nature limits its performance (F1 0.55; Acc. 0.53), failing to capture the non-linear feature interactions critical to rural classification.
  • Random Forest. Represents a significant step up from linear models (F1 0.63). While bagging mitigates variance, its performance is capped by its reliance on data imputation, preventing it from leveraging missingness as a feature.
  • XGBoost. The first of the high-performance models (F1 0.78; Acc. 0.78). It confirms the power of gradient boosting, achieving strong, balanced results.
  • LightGBM. Demonstrates excellent, balanced performance (F1 0.83; Acc. 0.84). Its high recall (0.86) and precision (0.80) show a well-balanced trade-off, validating its histogram-based approach.
  • CatBoost. The clear top performer (F1 0.88; Acc. 0.86). Its state-of-the-art handling of missing values and high recall (0.90) make it exceptionally robust, achieving the best overall precision-recall balance for this task.
  • Implications for Model Selection. This comparison provides a clear strategy for this classification task:
  • When missingness is substantial, tree-boosting models with native NaN handling (CatBoost, LightGBM, XGBoost) should be prioritized. They avoid the risks of imputation bias and utilize all available information.
  • Models like LDA, LR, and SVM, while useful for benchmarking, are not suitable for high-performance deployment in this context due to their poor performance and sensitivity to imputation.
  • For a single-model deployment, CatBoost offers the best all-around performance (highest F1, Acc, and Recall). LightGBM presents a highly competitive and balanced alternative.
Given their superior and complementary performance, CatBoost, LightGBM, and XGBoost are confirmed as the optimal Top 3 base learners. This result provides the direct input for our automated ensemble framework (as detailed in Algorithm 1), 0 aims to combine their respective strengths for a final, state-of-the-art classification.

4.3.2. Multi-Model Ensemble Evaluation

Building upon the insights from the multi-model comparison, the proposed Automated Top-N Weighted Ensemble Framework was deployed. As detailed in Table 7, the framework achieved a remarkable overall classification accuracy of 88%. This represents a significant improvement of 2 percentage points over the best single model (CatBoost at 86% accuracy), clearly validating the efficacy of our ensemble strategy.
Beyond the impressive overall accuracy, the ensemble demonstrates robust and balanced performance across all four rural settlement categories:
  • High-Performance Categories: Classes 1 and 4 achieve exceptional accuracy levels (0.90 and 0.91, respectively). This performance suggests that the ensemble effectively synthesizes the distinct spatial, structural, and socio-economic characteristics defining these settlement types. It is also worth noting that these categories typically represent larger proportions of the original dataset, which likely contributes to the models’ enhanced ability to learn more robust patterns for them. The complementary strengths of the Top 3 models allow for more precise decision boundaries in these well-defined categories.
  • Robustness in Complex Categories: For Classes 2 and 3, where single models might struggle with more intricate patterns or higher inherent data uncertainty (as implied by their slightly lower individual model scores), the ensemble maintains a strong performance of 0.85 and 0.86, respectively. This demonstrates the framework’s superior generalization ability and resilience to noise and ambiguous feature interactions, even in potentially less represented or more ambiguous categories.
The synergy achieved by combining the complementary decision boundaries and expertise of the Top 3 gradient boosting models (CatBoost, LightGBM, and XGBoost) is evident. This ensemble strategy enhances adaptability to heterogeneous rural data, effectively mitigates individual model weaknesses, and significantly improves resilience against challenges posed by missing values and class imbalance.
In conclusion, the proposed fully automated multi-model evaluation and ensemble framework is a significant advancement. It not only achieves state-of-the-art classification accuracy but also ensures stable and reliable performance across diverse rural settlement categories. This makes it an exceptionally suitable and robust AI-driven decision-support tool, enabling users to achieve satisfactory end-to-end effectiveness for practical planning and policy formulation scenarios aimed at sustainable rural development.

4.3.3. SHAP-Based Feature Importance Analysis

To complement the accuracy-based evaluation and to further understand the internal logic of our ensemble model, we employed SHapley Additive exPlanations (SHAP) to conduct a feature importance analysis. SHAP provides a theoretically grounded approach for quantifying how each feature contributes to the predicted probability of each settlement category. The analysis is presented at two complementary levels: global feature importance (Figure 6) and class-specific feature influence (Figure 7).
  • Global Feature Importance.
Figure 6 ranks the features according to their mean absolute SHAP values across all classes, thereby identifying the variables that exert the strongest influence on the ensemble model’s overall decision-making. The results show that idle housing vacancy rate and natural gas access are the two most influential features. This indicates that the model heavily relies on indicators related to demographic stability (vacancy) and infrastructure development (gas access) when distinguishing between settlement types. Following these, farmers’ market access and community service center availability also exhibit high global importance, confirming the relevance of public service provision and local economic activity in shaping rural settlement characteristics. Overall, the Top 15 features reflect a coherent set of factors aligned with both spatial structure and functional attributes, validating the effectiveness of the proposed indicator system.
  • Class-Specific Feature Influence.
While global importance identifies influential features across the entire model, the class-wise SHAP beeswarm plots in Figure 7 reveal how specific features affect the probability of assigning a settlement to each category. In these plots, positive SHAP values (right side) indicate that the feature pushes the prediction toward the target class, while negative values (left side) indicate that it pushes the prediction away. The color scale reflects the original feature value (red = high, blue = low).
For example, in the case of Agglomeration-Upgrading Settlements, high values of natural gas access and farmers’ market access correspond to strongly positive SHAP values, indicating that well-developed infrastructure and active local services significantly increase the probability of being assigned to this category. In contrast, the presence of a sewage treatment plant contributes negatively, suggesting this facility is more indicative of other settlement types. Likewise, a higher level of primary school availability is positively associated with this class, aligning with the idea that “upgrading” settlements serve as local service hubs. Meanwhile, for Urban-Periphery Integration Settlements, the model demonstrates an opposite logic. Features such as farmers’ market, community service center, cultural activity center, and population outflow rate produce negative SHAP values when their levels are high, and positive values when low. This suggests that settlements in this category tend to lack in-village services, as residents often rely on nearby urban amenities. Additionally, a low population outflow rate is a strong positive signal, pointing to population stability or inflow driven by proximity to urban areas. These patterns reflect the model’s ability to capture both service dependency and spatial interaction in urban–rural fringe zones.

5. Discussion

5.1. Comparison of Rural Settlement Spatial Patterns

Figure 8 and Figure 9 illustrate the spatial distribution of rural settlements before and after applying the automated classification framework. A comparative analysis highlights substantial improvements in classification completeness, spatial continuity, and the structural logic of category distribution, offering a clearer understanding of spatial differentiation within the county.

5.1.1. Improved Classification Coverage

As shown in Figure 8, the original classification—based on manual surveys and administrative data—covered only 361 out of 767 natural villages (47.1%), leaving more than half unclassified. Specifically, 194 villages were labeled as Agglomeration-Upgrading, 51 as Urban-Periphery Integration, 21 as Characteristic-Protection, and 95 as Relocation-Consolidation. This incomplete coverage limited the validity and usability of subsequent planning analysis.
By contrast, the automated framework successfully filled in the remaining village types, achieving full coverage of all 767 settlements. The updated distribution includes 493 Agglomeration-Upgrading, 88 Urban-Periphery Integration, 22 Characteristic-Protection, and 164 Relocation-Consolidation villages. The dominant category—Agglomeration-Upgrading—now accounts for 64.3% of all settlements, reinforcing its role as the foundational base for rural development. The proportions of the other three types are 11.5%, 2.9%, and 21.4%, respectively. These results confirm that the automated classification process not only addresses data incompleteness but also ensures consistency across the entire dataset.

5.1.2. More Cohesive and Optimized Spatial Structure

Compared to the original manual classification, the spatial patterns derived from the automated framework display significantly improved continuity and structural optimization. In the original map, villages of different types were scattered irregularly, resulting in a "salt-and-pepper" distribution. In contrast, the post-automation map shows smoother boundaries and higher spatial contiguity among same-type villages. The number of fragmented patches has been reduced by roughly one-third, while the average patch area increased by approximately 1.7 times, and the spatial agglomeration index improved by about 20%. These changes suggest that the model effectively captures spatial autocorrelation and enhances the structural readability of settlement patterns across the county.

5.1.3. Greater Regularity in Type-Specific Spatial Patterns

The automated classification also enhances the spatial regularity of each settlement type. Agglomeration-Upgrading villages (in red) form a large-scale continuous base, mainly in the eastern and southern parts of the county, reflecting strong development potential in terms of transportation, industry, and population concentration. Urban-Periphery Integration villages (in cyan) are primarily located in the central and western zones—particularly near Tianzhen Subdistrict and Qingcheng Town—forming a ring-like distribution around the county center that typifies transitional urban-rural areas. Characteristic-Protection villages (in dark green) remain sparsely distributed near the county’s borders, aligning with the cultural and ecological characteristics of Gaoqing County. Relocation-Consolidation villages (in gray) are concentrated in the southwestern area, forming a strip-like cluster, consistent with local conditions such as environmental constraints and weak industrial bases.

5.1.4. Type Differentiation at the Township Scale

Figure 10 and Figure 11 provide a more granular view of classification differences within individual townships, offering valuable insights into intra-county development gradients. Towns such as Tangfang and Luhu are dominated by Agglomeration-Upgrading villages, reflecting their central functional positioning. In contrast, Changjia and Heilizhai have a higher concentration of Relocation-Consolidation villages, likely due to topographic and developmental limitations. Qingcheng and Tianzhen show a strong presence of Urban-Periphery Integration types, consistent with their proximity to the county’s urban core. This township-level differentiation offers a practical basis for targeted policies and refined governance strategies.

5.2. Rural Renewal Strategies Based on Classification

The core contribution of this study lies in the development of a fully automated and highly accurate rural settlement classification framework. With this tool, large-scale classification can be conducted efficiently across diverse regions. Once accurate settlement types are identified, corresponding revitalization strategies can be systematically implemented, as each type is associated with specific spatial and policy needs. Building on existing national and regional rural revitalization policies, we propose tailored renewal strategies for each settlement type as follows. As illustrated in Figure 12, the classification outcomes can be directly translated into spatially explicit planning interventions, guiding land-use control, infrastructure investment, and governance mechanisms.
For Agglomeration-Upgrading settlements, the strategy prioritizes the promotion of intelligent agriculture through the construction of 1–2 standardized planting bases (focusing on spinach and tomatoes), achieving Level 1 GAP dual certification, and establishing a “black cattle breeding + organic fertilizer” circular farming model. Additional measures include infrastructure upgrades and the implementation of a point-based community governance system.
For Urban-Periphery Integration settlements, the plan emphasizes the development of peri-urban agricultural experience parks and shared vegetable plots, the establishment of e-commerce-oriented processing warehouses, the extension of urban infrastructure and commuter transport services, and the provision of customized employment, healthcare, and educational support.
For Characteristic-Protection settlements, the approach focuses on constructing cultural tourism corridors along the Yellow River, supporting local homestays and specialty markets, advancing ecological protection through straw recycling and sewage treatment, and preserving cultural heritage via folk museums and agricultural festivals.
For Relocation-Consolidation settlements, the strategy includes centralized resettlement with comprehensive public facilities, converting land into high-standard farmland, implementing collective land transfer with dividend-sharing cooperatives, and enhancing livelihood security through social insurance transfer and vocational training programs.

5.3. Limitations

Although the automated classification framework substantially improves classification completeness and spatial coherence, several limitations are important to acknowledge, primarily concerning inherent methodological characteristics and the current scope of empirical validation.
First, regarding methodological specifics, the spatial smoothing effect inherent in the ensemble strategy may lead to very small or isolated settlements being absorbed into larger surrounding categories. This could potentially weaken the model’s ability to capture fine-grained, micro-level differences in rural settlement patterns. Additionally, we observed that boundary areas between the Agglomeration-Upgrading and Urban-Periphery Integration types occasionally exhibit outward expansion. This phenomenon appears to be influenced more by locally high accessibility or geographic proximity effects rather than intrinsic settlement attributes, potentially leading to misclassifications in transition zones.
Second, concerning the framework’s transferability beyond the current study area, our model demonstrated robust generalization capabilities for rural settlements within the same region, confirmed through rigorous testing on a hold-out dataset (10% of the Gaoqing County data). This validation approach aligns with standard machine learning evaluation practices and provides a foundational indication of its potential for transferability to similar contexts. However, it is important to note that this study did not conduct external validation using independent datasets from geographically distinct regions. While our fully automated classification algorithm is designed for direct application, and the indicator system is theoretically universal, its empirical performance in areas with significantly different topographic, socioeconomic, or cultural contexts has yet to be comprehensively assessed. Therefore, although the framework possesses considerable promise, its efficacy in completely novel regions requires further empirical verification through subsequent data collection and field studies.

5.4. Research Significance

5.4.1. Transferability and Replicability

The automated classification framework developed in this study demonstrates robust generalization capabilities for rural settlements within the study area, and shows promising transferability for automated classification in other regions. This potential for replicability stems from its meticulously designed components and standardized workflow.
Firstly, regarding its robust generalization and adaptability: Our classification model’s generalization is rigorously validated through an independent test set (10% of samples), ensuring robust performance on unseen data and confirming its ability to learn underlying patterns without overfitting. This establishes a strong foundation for its effective application to other rural settlements within the same study context. Concurrently, the indicator system is designed with universality in mind, incorporating commonly used variables from global rural spatial studies (geographic, locational, morphological, land-use, and accessibility attributes). This makes the indicator structure inherently adaptable for new regions, requiring only local data collection without modifying the methodological core. Furthermore, the framework employs an automated multi-model ensemble strategy, dynamically selecting top-performing models to adapt to varying data structures and rural settlement patterns across diverse regions, thus ensuring context-appropriate performance. Secondly, concerning its high replicability and potential for broader utility: The entire workflow, encompassing data preprocessing, indicator construction, model training, spatial consistency checking, classification verification, and strategy linkage, is highly standardized. This clear, systematic pipeline ensures operational simplicity and makes the framework readily replicable across various cities and regions. It serves as a flexible and extensible digital tool for rural classification, spatial diagnosis, and policy formulation.
In summary, while primarily validated with data from Gaoqing County, this framework offers a powerful, standardized, and adaptable approach. It effectively provides automated classification for rural settlements within its validated region and holds significant promise for transferability to other regions, offering valuable support for rural spatial governance and revitalization planning.

5.4.2. Integration into Multi-Level Spatial Governance

The value of the classification results extends beyond local design applications and serves as a foundational component for building a multi-level rural spatial governance system. At the county level, the classification can guide differential spatial zoning, enabling the coordinated optimization of ecological, agricultural, and construction spaces. At the township level, it supports decisions on infrastructure allocation, industrial zoning, and public service deployment. At the village level, it provides a basis for improving living environments, optimizing land use, and guiding village-scale renewal strategies. From a governance perspective, the automated framework establishes a clear analytical bridge between spatial differentiation and hierarchical governance, translating the broad goals of the Rural Revitalization Strategy into quantifiable and operational management tools. By aligning classification outputs with differentiated policy intensities at multiple scales, the framework contributes to forming an interactive governance mechanism that integrates top-down planning with bottom-up local responses, thereby improving both model applicability and policy effectiveness.

5.4.3. Toward Adaptive and Data-Driven Rural Governance

Looking forward, the AI-driven classification framework should be viewed not merely as a static spatial identification tool, but as a key enabler of adaptive rural governance. The dynamic integration of data analytics and spatial decision-making enables continuous monitoring, evaluation, and adjustment of rural spaces, supporting a transition from experience-based planning to data-informed and feedback-driven governance. In the context of Gaoqing County, the adoption of this framework facilitates a shift from fragmented and reactive planning practices toward systematic, anticipatory, and adaptive governance. Through the combined effects of model-based classification, policy formulation, and community participation, the framework helps maintain a dynamic balance among economic development, ecological protection, and social inclusiveness. More broadly, this study demonstrates the methodological potential of artificial intelligence in rural spatial governance. By transforming spatial classification into an evidence-based decision-support mechanism, the framework offers a replicable technical pathway and governance paradigm for county-level and regional territorial spatial planning.

6. Conclusions

This study develops an AI-driven multi-model framework for the automated classification of rural settlements at the natural-village scale. By constructing a comprehensive indicator system that integrates socio-economic attributes, natural conditions, and land-use characteristics, the proposed framework enables fine-grained representation of settlement features. Through rigorous data preprocessing and multi-model comparison, the approach effectively addresses common challenges such as missing values, data noise, and class imbalance. The results demonstrate the feasibility and scalability of data-driven automated classification, providing a methodological basis for differentiated spatial planning and targeted rural revitalization.

Author Contributions

Conceptualization, J.H.; methodology, J.H.; software, J.H.; validation, J.H.; formal analysis, J.H.; investigation, X.W.; resources, X.W.; data curation, X.W.; writing—original draft preparation, J.H.; writing—review and editing, Y.Q. and D.Z.; visualization, J.J.; supervision, D.M. and J.Y.; project administration, Y.Q.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China [52408039, 52108030], Innovation Capability Support Foundation of Shaanxi Province [2024ZC-YBXM-008], Social Science Foundation of Shaanxi Province [2023J008, 2025J055], Postdoctoral Research Foundation of Shaanxi Province [2023BSHEDZZ50], Fundamental Research Funds for the Central Universities [SK2024021, xxj032025025], and Special Research Project on Teaching Reform Empowered by Generative Artificial Intelligence of Xi’an Jiaotong University [24ZK25Z].

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Long, H.; Liu, Y.; Li, X.; Chen, Y. Building new countryside in China: A geographical perspective. Land Use Policy 2010, 27, 165–173. [Google Scholar] [CrossRef]
  2. Li, Y.; Westlund, H.; Liu, Y. Why some rural areas decline while some others not: An overview of rural evolution in the world. J. Rural Stud. 2019, 68, 135–143. [Google Scholar] [CrossRef]
  3. Long, H.; Tu, S.; Ge, D.; Li, T.; Liu, Y. The allocation and management of critical resources in rural China under restructuring: Problems and prospects. J. Rural Stud. 2016, 47, 392–412. [Google Scholar] [CrossRef]
  4. State Council of the People’s Republic of China. Rural Revitalization Strategy Plan (2024–2027); Official Policy Document; State Council of the People’s Republic of China: Beijing, China, 2024.
  5. Tu, S.; Long, H.; Zhang, Y.; Ge, D.; Qu, Y. Rural restructuring at village level under rapid urbanization in metropolitan suburbs of China and its implications for innovations in land use policy. Habitat Int. 2018, 77, 143–152. [Google Scholar] [CrossRef]
  6. Liu, Y.; Li, Y. Revitalize the world’s countryside. Nature 2017, 548, 275–277. [Google Scholar] [CrossRef]
  7. Fu, P.; Xiao, J.; Zhao, Z.Q.; Xie, X. The Method of “Space-Dynamic” Coupling Mechanism of Rural Settlements Based on Machine Learning: Taking Liyang City, Jiangsu Province as an Example. J. Hum. Settlements West China 2022, 37, 1–9. [Google Scholar]
  8. Tang, Y.; Chen, C. Analysis of Factors Influencing the Evolution of Rural Settlements in Major Grain-Producing Areas Based on Explainable Machine Learning: A Case in Central China. Sci. Technol. Eng. 2023, 23, 9378–9387. [Google Scholar]
  9. Zhou, H.; Na, X.; Li, L.; Ning, X.; Bai, Y.; Wu, X.; Zang, S. Suitability Evaluation of Rural Settlements in a Farming–Pastoral Ecotone Area Based on Machine Learning Maximum Entropy. Ecol. Indic. 2023, 154, 110794. [Google Scholar] [CrossRef]
  10. Huang, X.; Liu, Y.; Stouffs, R. Exploring Spatio-Temporal Heterogeneity of Rural Settlement Patterns on Carbon Emission across More Than 2800 Chinese Counties Using Multiple Supervised Machine Learning Models. J. Environ. Manag. 2025, 373, 123932. [Google Scholar] [CrossRef]
  11. Shu, B.; Liu, Y.; Wang, C.; Zhang, H.; Amani-Beni, M.; Zhang, R. Geological Hazard Risk Assessment and Rural Settlement Site Selection Using GIS and Random Forest Algorithm. Ecol. Indic. 2024, 166, 112554. [Google Scholar] [CrossRef]
  12. Kalaycıoğlu, O.; Akhanlı, S.E.; Menteşe, E.Y.; Kalaycıoğlu, M.; Kalaycıoğlu, S. Using Machine Learning Algorithms to Identify Predictors of Social Vulnerability in the Event of a Hazard: Istanbul Case Study. Nat. Hazards Earth Syst. Sci. 2023, 23, 2133–2156. [Google Scholar] [CrossRef]
  13. Halfacree, K. Locality and social representation: Space, discourse and alternative definitions of the rural. In The Rural; Mit Press: Cambridge, MA, USA, 2017; pp. 245–260. [Google Scholar]
  14. Cloke, P. Conceptualizing rurality. In Handbook of Rural Studies; Sage: Thousand Oaks, CA, USA, 2006; pp. 18–28. [Google Scholar]
  15. Tu, S.; Long, H. Rural restructuring in China: Theory, approaches and research prospect. J. Geogr. Sci. 2017, 27, 1169–1184. [Google Scholar] [CrossRef]
  16. Wang, Y.; Zhu, X.; Wei, T.; Xu, F.; Williams, T.K.A.; Zhang, H. Entity-based image analysis: A new strategy to map rural settlements from Landsat images. Remote Sens. Environ. 2025, 318, 114549. [Google Scholar] [CrossRef]
  17. Liu, Y.; Zhou, Y.; Li, Y. Rural regional system and rural revitalization strategy in China. Acta Geogr. Sin. 2019, 74, 2511–2528. [Google Scholar]
  18. Tukey, J.W. Exploratory Data Analysis; Addison–Wesley: Reading, MA, USA, 1977. [Google Scholar]
  19. Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; Wiley: New York, NY, USA, 1994. [Google Scholar]
  20. Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 3rd ed.; Wiley: Hoboken, NJ, USA, 2013. [Google Scholar]
  21. Fisher, R.A. The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  22. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  23. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  24. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  25. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
  26. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. Adv. Neural Inf. Process. Syst. 2018, 31, 6639–6649. [Google Scholar]
  27. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  28. Dietterich, T.G. Ensemble Methods in Machine Learning. In Multiple Classifier Systems; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
  29. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
  30. Jiang, J.S.; Li, Z.; Bedra, K.B.; Long, C.R.; Wu, J.D.; Zhong, Q.K. Predicting Outdoor Thermal Comfort in Traditional Villages: An Explainable Machine Learning Framework Integrating Model Optimization, Seasonal Variability, and Tourist-Resident Insights. Build. Environ. 2025, 282, 113315. [Google Scholar] [CrossRef]
  31. Fan, D.Z. Study on the Spatial Form Characteristics and Environmental Adaptability Mechanism of Rural Settlements in the Lower Reaches of the Yellow River. Master’s Thesis, Shandong Jianzhu University, Jinan, China, 2024. [Google Scholar]
  32. Zhao, Z.Y. Study on the Formation Mechanism of Village Spatial Form in the Lower Reaches of the Yellow River. Master’s Thesis, Shandong Jianzhu University, Jinan, China, 2023. [Google Scholar]
  33. Zhang, Y.; Duan, S.; Dong, L.; Ding, X.M. Spatial Sustainability of Agricultural Rural Settlements: An Analysis of Rural Spatial Patterns and Influencing Factors in Three Northeastern Provinces of China. Sustainability 2025, 17, 5597. [Google Scholar] [CrossRef]
  34. Jiang, X.; Man, S.H.; Zhu, X.L.; Zhao, H.Y.; Yan, T.J. Sustainable Protection Strategies for Traditional Villages Based on a Socio-Ecological Systems Spatial Pattern Evaluation: A Case Study from Jiang River Basin in China. Sustainability 2024, 16, 7700. [Google Scholar] [CrossRef]
  35. Pan, Y.P.; Zhao, X.; Wang, J. Identifying the Class of the Villages Based on SMOTE-RF Algorithm. J. Geo-Inf. Sci. 2023, 25, 163–176. [Google Scholar]
  36. Yang, X.; Pu, F. Spatial Cognitive Modeling of the Site Selection for Traditional Rural Settlements: A Case Study of Kengzi Village, Southern China. J. Urban. Plan. Dev. 2020, 146, 05020026. [Google Scholar] [CrossRef]
  37. Chen, L.K.; Zhong, Q.K.; Li, Z. Analysis of Spatial Characteristics and Influence Mechanism of Human Settlement Suitability in Traditional Villages Based on Multi-Scale Geographically Weighted Regression Model: A Case Study of Hunan Province. Ecol. Indic. 2023, 154, 110828. [Google Scholar] [CrossRef]
  38. Li, W.M.; Li, T.S.; Wu, P. Study on Layout Optimization of Rural Residential Areas Based on Gravity Model and Weighted Voronoi Diagram—A Case Study of Xiangqiao Street, Xi’an. Chin. J. Agric. Resour. Reg. Plan. 2018, 39, 77–82. [Google Scholar]
  39. Zhao, Z.; Lü, N.; Jiang, C.M. Village Classification and Development Strategy in the North Foot of Qinling Mountains Based on SOM Neural Network. J. Guilin Univ. Technol. 2023, 43, 608–616. [Google Scholar]
  40. Peng, J.J.; Kong, X.S.; Liu, Y.L.; Cui, J.X. Spatial Optimization Allocation of Rural Residential Areas Based on Agent-Based Model. Geogr. Geo-Inf. Sci. 2016, 32, 52–58. [Google Scholar]
  41. Liu, F.J.; Xu, W.; Niu, Q. Spatial Pattern of Traditional Villages in Remote Mountainous Areas and Their Development Potential Assessment: The Case of Enshi, China. Sustainability 2025, 17, 1138. [Google Scholar] [CrossRef]
  42. Han, G.F.; Xiong, J.P.; Liu, G.X.; Li, L.; Lei, J.; Lu, Y.R. A Classification Method of Mountainous Villages Based on Logistic Model: A Case Study on Wuxi County, Chongqing Municipality. J. Hum. Settl. West China 2021, 36, 46–53. [Google Scholar]
  43. Zhang, C.; Teng, J.L.; Liu, P.L.; Liu, C.Q. Ecological suitability evaluation of traditional village locations in Jiangxi Province based on multi-model integration using artificial intelligence. PLoS ONE 2025, 20, 0332375. [Google Scholar] [CrossRef]
  44. Wu, K.H.; Su, W.C.; Ye, S.A.; Li, W.; Cao, Y.; Jia, Z.Z. Analysis on the geographical pattern and driving force of traditional villages based on GIS and Geodetector: A case study of Guizhou, China. Sci. Rep. 2023, 13, 20659. [Google Scholar] [CrossRef]
  45. Fan, L.; Zhang, D.Y. Study on Spatial Differentiation Characteristics and Influencing Factors of Traditional Villages in North China Based on MGWR Mode. Chin. Landsc. Archit. 2022, 38, 56–61. [Google Scholar]
  46. Tang, L.N.; Liu, Y.; Pan, Y.C.; Ren, Y.M. Evaluation and Zoning of Rural Regional Multifunction Based on BP Model and Ward Method: A Case in the Pinggu District of Beijing City. Sci. Geogr. Sin. 2016, 36, 1514–1521. [Google Scholar]
  47. Li, D.H.; Gao, X.C.; Lv, S.Y.; Zhao, W.W.; Yuan, M.; Li, P.T. Spatial distribution and influencing factors of traditional villages in Inner Mongolia Autonomous Region. Buildings 2023, 13, 2807. [Google Scholar] [CrossRef]
  48. Niu, Y.L.; Wang, Y. Study on Spatial Differentiation Pattern and Influencing Mechanism of Traditional Villages in Taihang Mountain Area Based on MGWR Model. J. Arid Land Resour. Environ. 2024, 38, 87–96. [Google Scholar]
  49. Wu, S.L.; Di, B.F.; Ustin, S.L.; Stamatopoulos, C.A.; Li, J.R.; Zuo, Q.; Wu, X.; Ai, N.S. Classification and detection of dominant factors in geospatial patterns of traditional settlements in China. J. Geogr. Sci. 2022, 32, 873–891. [Google Scholar] [CrossRef]
  50. Zhu, K.K.; Gu, Y.; Zhang, Y.T.; Song, Y.D.; Guo, Z.H.; Yan, X.Q.; Yao, Y.; Guan, Q.F.; Li, X. From Street View Imagery to the Countryside: Large-Scale Perception of Rural China Using Deep Learning. Ann. Am. Assoc. Geogr. 2025, 115, 1720–1741. [Google Scholar] [CrossRef]
  51. Nie, Z.Y.; Chen, C.; Pan, W.; Dong, T. Exploring the dynamic cultural driving factors underlying the regional spatial pattern of Chinese traditional villages. Buildings 2023, 13, 3068. [Google Scholar] [CrossRef]
  52. Hu, J.M.; Niu, J.Q.; Su, H.Y.; Han, G.F. A Classification Method of Mountainous Villages Based on BP Neural Network: A Case Study in Wuxi County, Chongqing City. Dev. Small Cities Town 2023, 41, 22–31. [Google Scholar]
  53. Lian, M.C.; Li, Y.J. The Spatial Patterns and Architectural Form Characteristics of Chinese Traditional Villages: A Case Study of Guanzhong, Shaanxi Province. Sustainability 2024, 16, 9491. [Google Scholar] [CrossRef]
Figure 1. Key limitations across the main stages of current machine learning-based rural classification research.
Figure 1. Key limitations across the main stages of current machine learning-based rural classification research.
Land 14 02298 g001
Figure 2. Gaoqing County’s location in China.
Figure 2. Gaoqing County’s location in China.
Land 14 02298 g002
Figure 3. Spatial pattern of rural settlements in Gaoqing County.
Figure 3. Spatial pattern of rural settlements in Gaoqing County.
Land 14 02298 g003
Figure 4. Overview of the proposed research framework, including indicator system construction, data collection and preprocessing, and multi-model analysis and integration.
Figure 4. Overview of the proposed research framework, including indicator system construction, data collection and preprocessing, and multi-model analysis and integration.
Land 14 02298 g004
Figure 5. Comparison of model performance across seven classifiers (LDA, LR, SVM_RBF, Random Forest, XGBoost, LightGBM, and CatBoost), together with a random baseline. Each subplot presents four evaluation metrics (Precision, Recall, F1score, and Accuracy), providing a clear visual comparison of classifier effectiveness.
Figure 5. Comparison of model performance across seven classifiers (LDA, LR, SVM_RBF, Random Forest, XGBoost, LightGBM, and CatBoost), together with a random baseline. Each subplot presents four evaluation metrics (Precision, Recall, F1score, and Accuracy), providing a clear visual comparison of classifier effectiveness.
Land 14 02298 g005
Figure 6. Overall Global Feature Importance (All Classes).
Figure 6. Overall Global Feature Importance (All Classes).
Land 14 02298 g006
Figure 7. Feature Importance for Every Class.
Figure 7. Feature Importance for Every Class.
Land 14 02298 g007
Figure 8. Original spatial distribution of rural settlement categories based on survey and existing administrative records. The map illustrates the initial classification status prior to automated processing, revealing incomplete or missing category information in several areas.
Figure 8. Original spatial distribution of rural settlement categories based on survey and existing administrative records. The map illustrates the initial classification status prior to automated processing, revealing incomplete or missing category information in several areas.
Land 14 02298 g008
Figure 9. Spatial distribution of rural settlement categories after applying the proposed automated classification framework. The classification results fill in previously missing categories and refine spatial patterns, providing a more complete and consistent settlement classification map.
Figure 9. Spatial distribution of rural settlement categories after applying the proposed automated classification framework. The classification results fill in previously missing categories and refine spatial patterns, providing a more complete and consistent settlement classification map.
Land 14 02298 g009
Figure 10. Comparison of Township Numbers by Four Settlement Types.
Figure 10. Comparison of Township Numbers by Four Settlement Types.
Land 14 02298 g010
Figure 11. Comparison of Different Settlement Types by Township.
Figure 11. Comparison of Different Settlement Types by Township.
Land 14 02298 g011
Figure 12. Design Strategy.
Figure 12. Design Strategy.
Land 14 02298 g012
Table 1. Steps and objectives of literature research.
Table 1. Steps and objectives of literature research.
StepObjectiveOutput/Number of Papers
1. Define search scopeClarify research themes and establish initial databaseInitial dataset: 160 (CNKI) + 69 (WOS)
2. Eliminate duplicatesRemove redundant records and retain high-quality papers145
3. Title screeningRapidly filter non-relevant topics105–128
4. Abstract reviewEnsure thematic relevance and methodological contribution89–102
5. Full-text reviewConfirm inclusion and extract core references69
6. Indicator extractionBuild the foundation for indicator construction69
Table 2. Search keywords and retrieved literature counts.
Table 2. Search keywords and retrieved literature counts.
DatabaseSearch KeywordsNumber of Records
CNKIRural AND Machine Learning38
CNKI(Rural OR Village OR Traditional Settlement) AND (Machine Learning OR Deep Learning OR Artificial Intelligence)72
CNKI(Rural OR Settlement OR Traditional Village OR Village Classification OR Evaluation) AND (Machine Learning OR Deep Learning OR Artificial Intelligence OR Random Forest OR Neural Network OR GBDT)160
WOS“Rural settlements” + “Traditional village” AND “Machine learning” + “Random forest” + “Deep learning” + “Convolutional Neural Network”69
Table 3. Selected Classification Models and Their Characteristics.
Table 3. Selected Classification Models and Their Characteristics.
ModelMissing Value SupportKey CharacteristicsYear
Logistic Regression (LR) [20]NoLinear, interpretable, discriminative1958
Linear Discriminant Analysis (LDA) [21]NoLinear, statistical, generative1936
Support Vector Machine (SVM) [22]NoNonlinear kernels, max-margin1995
Random Forest (RF) [23]PartialRobust to noise, bagging2001
XGBoost [24]YesGradient boosting2016
LightGBM [25]YesHistogram-based boosting2017
CatBoost [26]YesHandles categorical data natively2018
Table 4. Indicator frequency statistics of studies.
Table 4. Indicator frequency statistics of studies.
Indicator
Category
KeywordFreq.Authors/YearsML Methods
Settlement
Morphology
Boundary7Jiang (2025) [30]XGBoost, VOSM
Axial pattern2Fan (2024) [31]LightGBM
Skeleton3Zhao (2023) [32]; Fan (2024) [31]XGBoost, LightGBM
Scale area6Zhang (2025) [33]; Jiang (2024) [34]XGBoost, GBDT, BP
Locational
Conditions
Distance to towns16Fu (2022) [7]; Pan (2023) [35]GBDT, SMOTE, Random Forest
Distance to roads19Zhou (2023) [9]; Xi (2020) [36]Random Forest, SVM, GBDT
Distance to rivers15Shu (2024) [11]; Chen (2023) [37]Random Forest, GMM
Distance to public facilities5Zhou (2023) [9]; Chen (2023) [37]; Li (2018) [38]GBDT, MaxEnt
Accessibility4Zhao (2023) [39]; Peng (2016) [40]SOM, MGWR
Road network density3Chen (2023) [37]; Liu (2025) [41]; Han (2021) [42]Multiclass Logistic Regression, BP
Natural
Environment
Elevation14Zhang (2025) [43]; Wu (2023) [44]XGBoost, LightGBM
Slope20Shu (2024) [11]; Wu (2023) [44]GBWT, Random Forest
Terrain relief15Fan (2022) [45]; Tang (2016) [46]XGBoost, SOM
Cultivated land area10Shu (2024) [11]; Li (2023) [47]MGWR, XGBoost–SHAP
Aspect (slope direction)8Shu (2024) [11]; Zhou (2023) [9]; Li (2023) [47]MaxEnt, Random Forest
NDVI6Zhang (2025) [33]; Chen (2023) [37]K-Means, Random Forest, BP
Annual precipitation6Zhang (2025) [33]; Shu (2024) [11]MGWR
Annual temperature5Chen (2023) [37]; Li (2023) [47]MGWR, XGBoost–SHAP
Annual sunshine duration4Niu (2024) [48]; Wu (2022) [49]MGWR, Random Forest
Water system density4Zhang (2025) [33]; Wu (2023) [44]MGWR, XGBoost–SHAP
Altitude4Wu (2023) [44]K-Means, MaxEnt
River flow direction2Wu (2022) [49]; Zhou (2023) [9]XGBoost, LightGBM
Socio-economic
Attributes
Per-capita GDP8Zhu (2025) [50]; Li (2023) [47]; Nie (2023) [51]CNN, XGBoost–SHAP
Per-capita income5Zhu (2025) [50]; Jiang (2024) [34]; Hu (2023) [52]CNN, K-Means, Gravity Model, BP
GDP7Zhang (2025) [33]; Lian (2024) [53]; Chen (2023) [37]MGWR
Population density7Niu (2024) [48]; Chen (2023) [37]MGWR, SMOTE, Random Forest
Night-time light intensity4Zhang (2025) [33]; Lian (2024) [53]MGWR, XGBoost–SHAP, GBDT
Cultivated land ratio10Chen (2023) [37]; Xi (2022) [36]K-Means, Gravity Model
Urbanization rate5Zhang (2025) [33]; Nie (2023) [51]SMOTE, Random Forest, GBDT
Historical and
Cultural Factors
Cultural heritage concentration2Nie (2023) [51]; Fan (2022) [45]MGWR, XGBoost–SHAP
Intangible cultural heritage2Li (2023) [47]; Chen (2023) [37]MGWR, XGBoost–SHAP
Historical cultural points3Hu (2023) [52]; Han (2021) [42]BP, Multiclass Logistic Regression
Table 5. Evaluation indicator framework for rural settlement spatial pattern.
Table 5. Evaluation indicator framework for rural settlement spatial pattern.
Primary DimensionSecondary IndicatorCalculation or Definition
Socio-economic
Attributes
Aging rateRatio of population aged 65+ to total permanent population
Permanent populationRegistered permanent residents
Population outflow rate(Average 2017–2019 out-migration)/population in 2017
Village per capita incomeMean income per village household
Share of elderly agricultural laborShare of agricultural workers aged 65+
Natural
Environment
Current natural conditionsGraded: good (1), moderate (2), poor (3), very poor (4)
Current resource conditionsRich mountain and water resources (1), Abundant tourism
resources (2), Prominent industrial resources (3), Advantageous locational conditions (4), Rich historical and cultural
resources (5)
Land Construction
and Utilization
Transportation conditionsConvenient (1), relatively convenient (2), average (3), poor (4)
Idle housing vacancy rateNumber of idle/unused housing units (households)
Residential land aggregationDegree of clustering of residential land
Built-up area ratioProportion of built-up area to total land (%)
Land fragmentationGraded: high (1), relatively high (2), low (3), none (4)
Farmland transfer rateRatio of transferred farmland area (%)
Basic farmland ratioShare of basic farmland area (%)
Farmland per capitaTotal farmland area / permanent population
Ecological red-line ratioGraded: high (1), relatively high (2), low (3), none (4)
Supporting
Public Services
Waste collection facility1 = present, 0 = absent
Waste transfer station1 = present, 0 = absent
Centralized heating1 = present, 0 = absent
Natural gas access1 = present, 0 = absent
Tap water supply1 = present, 0 = absent
Sewage treatment plant1 = present, 0 = absent
Sanitary toilets1 = present, 0 = absent
Elderly care center1 = present, 0 = absent
Elderly care station1 = present, 0 = absent
Community service center1 = present, 0 = absent
Farmers’ market1 = present, 0 = absent
Cultural activity center1 = present, 0 = absent
Fitness / sports venue1 = present, 0 = absent
Clinic or health station1 = present, 0 = absent
Kindergarten1 = present, 0 = absent
Primary school1 = present, 0 = absent
Table 6. Summary of Detected Outliers per Feature.
Table 6. Summary of Detected Outliers per Feature.
FeatureLand
Transfer Rate
Ecological Redline
Encroachment
Vacant
Houses
Resource
Condition
Number of Outliers34201211
FeatureWaste
Collection Points
Land
Abandonment
Tap Water
Access
Natural
Environment
Number of Outliers11875
Table 7. Classification accuracy of the proposed weighted voting ensemble across different categories.
Table 7. Classification accuracy of the proposed weighted voting ensemble across different categories.
MethodClass 1Class 2Class 3Class 4Overall
Ensemble (Ours)0.900.850.860.910.88
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, J.; Wang, X.; Qi, Y.; Jiang, J.; Zhou, D.; Ma, D.; Ying, J. AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China. Land 2025, 14, 2298. https://doi.org/10.3390/land14122298

AMA Style

He J, Wang X, Qi Y, Jiang J, Zhou D, Ma D, Ying J. AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China. Land. 2025; 14(12):2298. https://doi.org/10.3390/land14122298

Chicago/Turabian Style

He, Jing, Xinlei Wang, Yingtao Qi, Jinghan Jiang, Dian Zhou, Ding Ma, and Jing Ying. 2025. "AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China" Land 14, no. 12: 2298. https://doi.org/10.3390/land14122298

APA Style

He, J., Wang, X., Qi, Y., Jiang, J., Zhou, D., Ma, D., & Ying, J. (2025). AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China. Land, 14(12), 2298. https://doi.org/10.3390/land14122298

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop