AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China
Abstract
1. Introduction
- 1.
- We establish a comprehensive and scalable indicator system for rural settlement classification, addressing the fragmented and ad-hoc nature of feature engineering in prior studies.
- 2.
- We develop and validate a robust data preprocessing module with automated outlier management, ensuring data quality in the presence of noise and inconsistencies.
- 3.
- We perform a systematic multi-model evaluation and introduce an ensemble learning approach, demonstrating superior accuracy and robustness over traditional single-model methods.
- 4.
- We propose and validate, to our knowledge, the first fully automated end-to-end framework for rural settlement classification, integrating preprocessing, model comparison, and ensemble learning into a reproducible pipeline for data-driven rural analysis.
2. Background
2.1. Conceptual Framework of the Countryside and Rural Settlements in China
2.2. Typologies of Rural Settlements for Targeted Revitalization
- Agglomeration-Upgrading Settlements. Focus on reinforcing endogenous industries and upgrading infrastructure and public services to consolidate their role as local hubs of population and functions.
- Urban-Periphery Integration Settlements. Emphasize urban–rural integration by improving connectivity, shared services, and compatible land uses at the metropolitan fringe or county seats.
- Characteristic-Protection Settlements. Prioritize heritage conservation and quality improvement—protecting traditional architecture, cultural landscapes, and vernacular environments while upgrading essential infrastructure.
- Relocation-Consolidation Settlements. For settlements in ecologically fragile, disaster-prone, or severely shrinking areas, implement orderly relocation or merger with safeguards for livelihoods, employment, and ecological restoration.
3. Materials and Methods
3.1. Study Area and Data Sources
- 1.
- Geographical Location: Gaoqing County is situated in the northern part of the Shandong Plain, along the lower reaches of the Yellow River, with an average elevation of about 12 m. It lies within the Yellow River Delta Ecological Zone. The county is approximately 50 km from Binzhou, 120 km from Jinan, and 90 km from Zibo, forming a transportation hub of the northern Shandong region. Convenient access to national highways, expressways, and future railway and airport connections ensures good regional accessibility.
- 2.
- Natural Environment: The terrain of Gaoqing is flat, with gentle slopes descending from the central plain to the south. It has a warm temperate monsoon climate, characterized by hot, rainy summers and cold, dry winters. The annual average precipitation is around 600 mm, and the per capita water resource is approximately 320 m3. The Yellow River is the main water source, supplemented by small lakes and reservoirs. The ecological environment is relatively fragile, with saline-alkali soils and seasonal flooding affecting the landscape.
- 3.
- Socio-economic Conditions: As of 2020, Gaoqing County had a permanent population of approximately 312,200, including 347,000 registered residents. The population structure shows that 31.3% are aged 41–65, and 21.6% are aged 65 or above, resulting in a clear aging trend. The aging rate reaches 26.5%, indicating that Gaoqing has entered a “moderate aging” stage. The total GDP of the county in 2020 reached 18.15 billion CNY, with a per capita GDP comparable to the provincial average. However, outmigration of young labor remains evident, particularly among those aged 20–40.
- 4.
- Land Use: The total land area of Gaoqing County is approximately 950 km2, with 520 km2 of arable land (about 55%), mainly concentrated in the southern and eastern plains along the Yellow River. The county has 110 km2 of built-up land, accounting for 11.6% of the total area, and over 2 million m2 of rural housing land. On average, construction land per capita reaches 263.8 m2, with most settlements located within 100 m of main transportation corridors. The county has also established several industrial parks and an eco-tourism zone centered on the “Wetland Belt” of the Yellow River.
- 5.
- Spatial Pattern and Characteristics of Rural Settlements: Gaoqing County administers seven townships, 39 village-level administrative units, and a total of 767 rural settlements. Settlements are generally evenly distributed along the river corridors and transportation routes, with a spatial hierarchy characterized by clustering near township centers and dispersal in peripheral zones. Differences in development level and service accessibility are notable: northern settlements are more scattered and have weaker infrastructure, whereas southern and central settlements are more compact and better equipped. Overall, the county exhibits a clear spatial stratification of settlements, with evident contrasts in living standards, spatial compactness, and service accessibility across different zones.
3.2. Methodological Overview
3.3. Step1: Indicator System Construction and Data Compilation
3.3.1. Literature Search and Screening
3.3.2. Indicator System Construction
- Reviewing domestic and international literature related to rural vitality, settlement morphology, and land-use evaluation;
- Extracting potential indicators using keyword-frequency statistics and co-occurrence network analysis;
- Refining and validating indicator selection through expert discussion and relevance testing.
3.4. Automated Data Processing
3.4.1. Automated Outlier Detection
- IQR-based detection [18]: This method uses the 25th percentile () and 75th percentile () to compute the interquartile range . Any data point below or above is flagged as an outlier. The IQR method is particularly robust to skewed distributions and is well suited for rural indicators with heterogeneous value ranges and non-Gaussian distributions.
- Z-score detection [19]: This method standardizes data points and computes , where and are the mean and standard deviation, respectively. Points with are marked as outliers. The Z-score method is effective for identifying global deviations from the mean in approximately normal distributions.
3.4.2. Outlier Handling and Missing Value Imputation
3.5. Multi-Model Classification and Ensemble Integration
3.5.1. Model Selection
- 1.
- Traditional Statistical Baselines (Logistic Regression, Linear Discriminant Analysis)
- 2.
- Classic High-Performance ML (Support Vector Machines)
- 3.
- Tree-Based Algorithms (Random Forest)
- 4.
- Gradient Boosting Algorithms (XGBoost, LightGBM, CatBoost)
- (1)
- Logistic Regression (LR).
- (2)
- Linear Discriminant Analysis (LDA).
- (3)
- Support Vector Machine (SVM).
- (4)
- Random Forest (RF).
- (5)
- XGBoost.
- (6)
- LightGBM.
- (7)
- CatBoost.
3.5.2. Model Configuration and Training
- Model Configuration. All models are implemented using Python 3.9 (scikit-learn, XGBoost, LightGBM, and CatBoost). Default hyperparameters serve as starting points, followed by light tuning to balance accuracy and computational efficiency.
- Logistic Regression: solver = ’lbfgs’, max_iter = 500, class_weight = ’balanced’.
- Linear Discriminant Analysis: solver = ’svd’, shrinkage = ’auto’.
- SVM: kernel = ’rbf’, , class_weight = ’balanced’.
- Random Forest: n_estimators = 300, max_depth = None, min_samples_split = 2.
- XGBoost: n_estimators = 500, learning_rate = 0.05, max_depth = 6, subsample = 0.8.
- LightGBM: num_leaves = 31, learning_rate = 0.05, n_estimators = 500.
- CatBoost: iterations = 500, learning_rate = 0.05, depth = 6, loss_function = MultiClass.
- Hyperparameters are determined through preliminary experiments to achieve a trade-off between accuracy, stability, and training time. All experiments are conducted on a workstation with an NVIDIA RTX GPU and 64 GB RAM.
- Model Training. A total of N samples were collected from representative rural regions. To ensure both sufficient training data and an independent evaluation set, the dataset is randomly divided into training (90%) and testing (10%) subsets. A stratified sampling strategy is adopted to preserve the original class distribution across different settlement categories, thereby avoiding potential sampling bias. Following the split, an analysis of the training set revealed a significant class imbalance, with certain settlement categories being severely underrepresented. To mitigate the risk of models developing a bias towards the majority classes, we implemented a data-level rebalancing strategy using the Synthetic Minority Oversampling Technique (SMOTE) [27].
- 1.
- It selects a minority class sample .
- 2.
- It identifies its k nearest neighbors in the feature space (we used the standard ).
- 3.
- It randomly selects one of these neighbors, .
- 4.
- It generates a new synthetic sample by interpolating along the line segment between the two samples:
3.6. Ensemble Integration and SHAP Explainability
3.6.1. Ensemble Integration Strategy
| Algorithm 1 Top-N Weighted Ensemble Framework |
| Require: Training data , test data Require: Candidate models Require: 1: Train each on 2: for each model do 3: 4: end for 5: Select top-N models based on 6: Compute weights 7: Prediction for a new sample : 8: 9: return |
3.6.2. SHAP-Based Explanation
4. Results
4.1. Constructed Indicator System and Dataset Characteristics
4.1.1. Literature Review Results
- (1)
- Settlement Morphology.This category captures the spatial form, structure, and scale of rural settlements. Commonly used indicators include boundary compactness, axial or road network patterns, and built-up area ratio. Prior studies employing models such as XGBoost, LightGBM, and GBDT demonstrate that morphological indicators effectively describe spatial structure and help reveal underlying development patterns.
- (2)
- Locational Conditions.Locational indicators are among the most frequently used in the literature. They typically measure distances to towns, roads, rivers, and public facilities, as well as accessibility and road network density. Indicators such as distance to main roads and distance to towns appear most consistently, underscoring the importance of transportation accessibility in shaping settlement distribution and functional clustering. Methods including Random Forest, GBDT, and SVM are commonly adopted for these analyses.
- (3)
- Natural Environment.Environmental indicators form the largest group, encompassing elevation, slope, terrain relief, land-cover characteristics (e.g., cultivated land ratio, NDVI), and climatic factors such as temperature and precipitation. Studies using Random Forest, XGBoost, and MGWR (Multiscale Geographically Weighted Regression) highlight the strong constraining effect of natural environmental conditions on the spatial differentiation and evolution of rural settlements.
- (4)
- Socio-Economic Attributes.These indicators describe variations in economic activity and population distribution. Common metrics include per-capita GDP, income level, population density, built-up land area, and nighttime light intensity. Analytical approaches such as MGWR, CNN-based models, and XGBoost–SHAP are frequently used to examine socio-economic disparities and their relationship with settlement form and function.
- (5)
- Historical and Cultural Features.Although less frequently emphasized, historical–cultural indicators capture elements such as cultural heritage sites, traditional architecture, and historical landmarks. Techniques including MGWR and BP Neural Networks have been employed to assess how these cultural characteristics contribute to rural settlement differentiation and identity.
4.1.2. Indicator System Results
- (1)
- Socio-economic Attributes.This dimension captures demographic conditions and industrial development. Key indicators include aging rate, permanent population, population outflow rate, and average annual population growth. In addition, per capita village income and the proportion of elderly agricultural labor reflect local economic vitality and labor structure. These indicators, shown in Table 5, characterize the human and economic foundation of each settlement.
- (2)
- Natural Environment.This dimension incorporates natural geographic and ecological resource conditions. Indicators include terrain factors, hydrological features, vegetation status, and ecological sensitivity. As listed in Table 5, these variables describe the environmental constraints and carrying capacity that influence both the distribution and potential development trajectories of rural settlements.
- (3)
- Land Construction and Utilization.This dimension reflects development intensity and land-use efficiency. Construction-related indicators—such as residential land aggregation, land-use intensity, and built-up area ratio—capture spatial compactness and physical development patterns. Land-use indicators, including farmland ratio, farmland transfer rate, and ecological red-line proportion, further describe human–land interactions and the degree of land consolidation. All corresponding indicators are detailed in Table 5.
- (4)
- Supporting Public Services.This dimension reflects development intensity and land-use efficiency. Construction-related indicators such as residential land aggregation, land-use intensity, and built-up area ratio capture spatial compactness and physical development patterns. Land-use indicators, including farmland ratio, farmland transfer rate, and ecological red-line proportion, further describe human–land interactions and the degree of land consolidation. All corresponding indicators are detailed in Table 5.
4.1.3. Dataset Collection Results
4.2. Data Cleaning: Outlier Detection and Preprocessing Results
4.3. Classification Evaluation and Ensemble Model Performance
4.3.1. Multi-Model Performance Evaluation
- Overall Findings. The results clearly establish a performance hierarchy. At the baseline, the RandomClassifier performs as expected (Acc. ≈ 0.25), and the traditional statistical models, LDA (F1 0.39) and LR (F1 0.55), demonstrate that simple linear boundaries are insufficient for this complex task. The SVM_RBF (F1 0.53) performs similarly poorly, exhibiting a severe imbalance between high precision (0.75) and very low recall (0.41). Random Forest (F1 0.63) offers a moderate baseline but is clearly outperformed by the gradient boosting family.
- Model-by-Model Analysis. The models are analyzed in order of increasing performance:
- RandomClassifier. Serves as the theoretical minimum baseline for a 4-class problem (F1/Acc. 0.25). All ML models significantly outperform this floor.
- LDA. As the weakest-performing ML model (F1 0.39), LDA’s assumption of linear separability and Gaussian distributions is ill-suited for the dataset’s complexity.
- SVM_RBF. Exhibits the most severe performance imbalance (F1 0.53). Its high precision (0.75) but exceptionally low recall (0.41) suggests it only classifies high-confidence samples, missing the majority of true positives. This reflects its sensitivity to the high-dimensional, noisy, and imputed data.
- LR. Although balanced, its linear nature limits its performance (F1 0.55; Acc. 0.53), failing to capture the non-linear feature interactions critical to rural classification.
- Random Forest. Represents a significant step up from linear models (F1 0.63). While bagging mitigates variance, its performance is capped by its reliance on data imputation, preventing it from leveraging missingness as a feature.
- XGBoost. The first of the high-performance models (F1 0.78; Acc. 0.78). It confirms the power of gradient boosting, achieving strong, balanced results.
- LightGBM. Demonstrates excellent, balanced performance (F1 0.83; Acc. 0.84). Its high recall (0.86) and precision (0.80) show a well-balanced trade-off, validating its histogram-based approach.
- CatBoost. The clear top performer (F1 0.88; Acc. 0.86). Its state-of-the-art handling of missing values and high recall (0.90) make it exceptionally robust, achieving the best overall precision-recall balance for this task.
- Implications for Model Selection. This comparison provides a clear strategy for this classification task:
- When missingness is substantial, tree-boosting models with native NaN handling (CatBoost, LightGBM, XGBoost) should be prioritized. They avoid the risks of imputation bias and utilize all available information.
- Models like LDA, LR, and SVM, while useful for benchmarking, are not suitable for high-performance deployment in this context due to their poor performance and sensitivity to imputation.
- For a single-model deployment, CatBoost offers the best all-around performance (highest F1, Acc, and Recall). LightGBM presents a highly competitive and balanced alternative.
4.3.2. Multi-Model Ensemble Evaluation
- High-Performance Categories: Classes 1 and 4 achieve exceptional accuracy levels (0.90 and 0.91, respectively). This performance suggests that the ensemble effectively synthesizes the distinct spatial, structural, and socio-economic characteristics defining these settlement types. It is also worth noting that these categories typically represent larger proportions of the original dataset, which likely contributes to the models’ enhanced ability to learn more robust patterns for them. The complementary strengths of the Top 3 models allow for more precise decision boundaries in these well-defined categories.
- Robustness in Complex Categories: For Classes 2 and 3, where single models might struggle with more intricate patterns or higher inherent data uncertainty (as implied by their slightly lower individual model scores), the ensemble maintains a strong performance of 0.85 and 0.86, respectively. This demonstrates the framework’s superior generalization ability and resilience to noise and ambiguous feature interactions, even in potentially less represented or more ambiguous categories.
4.3.3. SHAP-Based Feature Importance Analysis
- Global Feature Importance.
- Class-Specific Feature Influence.
5. Discussion
5.1. Comparison of Rural Settlement Spatial Patterns
5.1.1. Improved Classification Coverage
5.1.2. More Cohesive and Optimized Spatial Structure
5.1.3. Greater Regularity in Type-Specific Spatial Patterns
5.1.4. Type Differentiation at the Township Scale
5.2. Rural Renewal Strategies Based on Classification
5.3. Limitations
5.4. Research Significance
5.4.1. Transferability and Replicability
5.4.2. Integration into Multi-Level Spatial Governance
5.4.3. Toward Adaptive and Data-Driven Rural Governance
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Long, H.; Liu, Y.; Li, X.; Chen, Y. Building new countryside in China: A geographical perspective. Land Use Policy 2010, 27, 165–173. [Google Scholar] [CrossRef]
- Li, Y.; Westlund, H.; Liu, Y. Why some rural areas decline while some others not: An overview of rural evolution in the world. J. Rural Stud. 2019, 68, 135–143. [Google Scholar] [CrossRef]
- Long, H.; Tu, S.; Ge, D.; Li, T.; Liu, Y. The allocation and management of critical resources in rural China under restructuring: Problems and prospects. J. Rural Stud. 2016, 47, 392–412. [Google Scholar] [CrossRef]
- State Council of the People’s Republic of China. Rural Revitalization Strategy Plan (2024–2027); Official Policy Document; State Council of the People’s Republic of China: Beijing, China, 2024.
- Tu, S.; Long, H.; Zhang, Y.; Ge, D.; Qu, Y. Rural restructuring at village level under rapid urbanization in metropolitan suburbs of China and its implications for innovations in land use policy. Habitat Int. 2018, 77, 143–152. [Google Scholar] [CrossRef]
- Liu, Y.; Li, Y. Revitalize the world’s countryside. Nature 2017, 548, 275–277. [Google Scholar] [CrossRef]
- Fu, P.; Xiao, J.; Zhao, Z.Q.; Xie, X. The Method of “Space-Dynamic” Coupling Mechanism of Rural Settlements Based on Machine Learning: Taking Liyang City, Jiangsu Province as an Example. J. Hum. Settlements West China 2022, 37, 1–9. [Google Scholar]
- Tang, Y.; Chen, C. Analysis of Factors Influencing the Evolution of Rural Settlements in Major Grain-Producing Areas Based on Explainable Machine Learning: A Case in Central China. Sci. Technol. Eng. 2023, 23, 9378–9387. [Google Scholar]
- Zhou, H.; Na, X.; Li, L.; Ning, X.; Bai, Y.; Wu, X.; Zang, S. Suitability Evaluation of Rural Settlements in a Farming–Pastoral Ecotone Area Based on Machine Learning Maximum Entropy. Ecol. Indic. 2023, 154, 110794. [Google Scholar] [CrossRef]
- Huang, X.; Liu, Y.; Stouffs, R. Exploring Spatio-Temporal Heterogeneity of Rural Settlement Patterns on Carbon Emission across More Than 2800 Chinese Counties Using Multiple Supervised Machine Learning Models. J. Environ. Manag. 2025, 373, 123932. [Google Scholar] [CrossRef]
- Shu, B.; Liu, Y.; Wang, C.; Zhang, H.; Amani-Beni, M.; Zhang, R. Geological Hazard Risk Assessment and Rural Settlement Site Selection Using GIS and Random Forest Algorithm. Ecol. Indic. 2024, 166, 112554. [Google Scholar] [CrossRef]
- Kalaycıoğlu, O.; Akhanlı, S.E.; Menteşe, E.Y.; Kalaycıoğlu, M.; Kalaycıoğlu, S. Using Machine Learning Algorithms to Identify Predictors of Social Vulnerability in the Event of a Hazard: Istanbul Case Study. Nat. Hazards Earth Syst. Sci. 2023, 23, 2133–2156. [Google Scholar] [CrossRef]
- Halfacree, K. Locality and social representation: Space, discourse and alternative definitions of the rural. In The Rural; Mit Press: Cambridge, MA, USA, 2017; pp. 245–260. [Google Scholar]
- Cloke, P. Conceptualizing rurality. In Handbook of Rural Studies; Sage: Thousand Oaks, CA, USA, 2006; pp. 18–28. [Google Scholar]
- Tu, S.; Long, H. Rural restructuring in China: Theory, approaches and research prospect. J. Geogr. Sci. 2017, 27, 1169–1184. [Google Scholar] [CrossRef]
- Wang, Y.; Zhu, X.; Wei, T.; Xu, F.; Williams, T.K.A.; Zhang, H. Entity-based image analysis: A new strategy to map rural settlements from Landsat images. Remote Sens. Environ. 2025, 318, 114549. [Google Scholar] [CrossRef]
- Liu, Y.; Zhou, Y.; Li, Y. Rural regional system and rural revitalization strategy in China. Acta Geogr. Sin. 2019, 74, 2511–2528. [Google Scholar]
- Tukey, J.W. Exploratory Data Analysis; Addison–Wesley: Reading, MA, USA, 1977. [Google Scholar]
- Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; Wiley: New York, NY, USA, 1994. [Google Scholar]
- Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 3rd ed.; Wiley: Hoboken, NJ, USA, 2013. [Google Scholar]
- Fisher, R.A. The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. Adv. Neural Inf. Process. Syst. 2018, 31, 6639–6649. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble Methods in Machine Learning. In Multiple Classifier Systems; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
- Jiang, J.S.; Li, Z.; Bedra, K.B.; Long, C.R.; Wu, J.D.; Zhong, Q.K. Predicting Outdoor Thermal Comfort in Traditional Villages: An Explainable Machine Learning Framework Integrating Model Optimization, Seasonal Variability, and Tourist-Resident Insights. Build. Environ. 2025, 282, 113315. [Google Scholar] [CrossRef]
- Fan, D.Z. Study on the Spatial Form Characteristics and Environmental Adaptability Mechanism of Rural Settlements in the Lower Reaches of the Yellow River. Master’s Thesis, Shandong Jianzhu University, Jinan, China, 2024. [Google Scholar]
- Zhao, Z.Y. Study on the Formation Mechanism of Village Spatial Form in the Lower Reaches of the Yellow River. Master’s Thesis, Shandong Jianzhu University, Jinan, China, 2023. [Google Scholar]
- Zhang, Y.; Duan, S.; Dong, L.; Ding, X.M. Spatial Sustainability of Agricultural Rural Settlements: An Analysis of Rural Spatial Patterns and Influencing Factors in Three Northeastern Provinces of China. Sustainability 2025, 17, 5597. [Google Scholar] [CrossRef]
- Jiang, X.; Man, S.H.; Zhu, X.L.; Zhao, H.Y.; Yan, T.J. Sustainable Protection Strategies for Traditional Villages Based on a Socio-Ecological Systems Spatial Pattern Evaluation: A Case Study from Jiang River Basin in China. Sustainability 2024, 16, 7700. [Google Scholar] [CrossRef]
- Pan, Y.P.; Zhao, X.; Wang, J. Identifying the Class of the Villages Based on SMOTE-RF Algorithm. J. Geo-Inf. Sci. 2023, 25, 163–176. [Google Scholar]
- Yang, X.; Pu, F. Spatial Cognitive Modeling of the Site Selection for Traditional Rural Settlements: A Case Study of Kengzi Village, Southern China. J. Urban. Plan. Dev. 2020, 146, 05020026. [Google Scholar] [CrossRef]
- Chen, L.K.; Zhong, Q.K.; Li, Z. Analysis of Spatial Characteristics and Influence Mechanism of Human Settlement Suitability in Traditional Villages Based on Multi-Scale Geographically Weighted Regression Model: A Case Study of Hunan Province. Ecol. Indic. 2023, 154, 110828. [Google Scholar] [CrossRef]
- Li, W.M.; Li, T.S.; Wu, P. Study on Layout Optimization of Rural Residential Areas Based on Gravity Model and Weighted Voronoi Diagram—A Case Study of Xiangqiao Street, Xi’an. Chin. J. Agric. Resour. Reg. Plan. 2018, 39, 77–82. [Google Scholar]
- Zhao, Z.; Lü, N.; Jiang, C.M. Village Classification and Development Strategy in the North Foot of Qinling Mountains Based on SOM Neural Network. J. Guilin Univ. Technol. 2023, 43, 608–616. [Google Scholar]
- Peng, J.J.; Kong, X.S.; Liu, Y.L.; Cui, J.X. Spatial Optimization Allocation of Rural Residential Areas Based on Agent-Based Model. Geogr. Geo-Inf. Sci. 2016, 32, 52–58. [Google Scholar]
- Liu, F.J.; Xu, W.; Niu, Q. Spatial Pattern of Traditional Villages in Remote Mountainous Areas and Their Development Potential Assessment: The Case of Enshi, China. Sustainability 2025, 17, 1138. [Google Scholar] [CrossRef]
- Han, G.F.; Xiong, J.P.; Liu, G.X.; Li, L.; Lei, J.; Lu, Y.R. A Classification Method of Mountainous Villages Based on Logistic Model: A Case Study on Wuxi County, Chongqing Municipality. J. Hum. Settl. West China 2021, 36, 46–53. [Google Scholar]
- Zhang, C.; Teng, J.L.; Liu, P.L.; Liu, C.Q. Ecological suitability evaluation of traditional village locations in Jiangxi Province based on multi-model integration using artificial intelligence. PLoS ONE 2025, 20, 0332375. [Google Scholar] [CrossRef]
- Wu, K.H.; Su, W.C.; Ye, S.A.; Li, W.; Cao, Y.; Jia, Z.Z. Analysis on the geographical pattern and driving force of traditional villages based on GIS and Geodetector: A case study of Guizhou, China. Sci. Rep. 2023, 13, 20659. [Google Scholar] [CrossRef]
- Fan, L.; Zhang, D.Y. Study on Spatial Differentiation Characteristics and Influencing Factors of Traditional Villages in North China Based on MGWR Mode. Chin. Landsc. Archit. 2022, 38, 56–61. [Google Scholar]
- Tang, L.N.; Liu, Y.; Pan, Y.C.; Ren, Y.M. Evaluation and Zoning of Rural Regional Multifunction Based on BP Model and Ward Method: A Case in the Pinggu District of Beijing City. Sci. Geogr. Sin. 2016, 36, 1514–1521. [Google Scholar]
- Li, D.H.; Gao, X.C.; Lv, S.Y.; Zhao, W.W.; Yuan, M.; Li, P.T. Spatial distribution and influencing factors of traditional villages in Inner Mongolia Autonomous Region. Buildings 2023, 13, 2807. [Google Scholar] [CrossRef]
- Niu, Y.L.; Wang, Y. Study on Spatial Differentiation Pattern and Influencing Mechanism of Traditional Villages in Taihang Mountain Area Based on MGWR Model. J. Arid Land Resour. Environ. 2024, 38, 87–96. [Google Scholar]
- Wu, S.L.; Di, B.F.; Ustin, S.L.; Stamatopoulos, C.A.; Li, J.R.; Zuo, Q.; Wu, X.; Ai, N.S. Classification and detection of dominant factors in geospatial patterns of traditional settlements in China. J. Geogr. Sci. 2022, 32, 873–891. [Google Scholar] [CrossRef]
- Zhu, K.K.; Gu, Y.; Zhang, Y.T.; Song, Y.D.; Guo, Z.H.; Yan, X.Q.; Yao, Y.; Guan, Q.F.; Li, X. From Street View Imagery to the Countryside: Large-Scale Perception of Rural China Using Deep Learning. Ann. Am. Assoc. Geogr. 2025, 115, 1720–1741. [Google Scholar] [CrossRef]
- Nie, Z.Y.; Chen, C.; Pan, W.; Dong, T. Exploring the dynamic cultural driving factors underlying the regional spatial pattern of Chinese traditional villages. Buildings 2023, 13, 3068. [Google Scholar] [CrossRef]
- Hu, J.M.; Niu, J.Q.; Su, H.Y.; Han, G.F. A Classification Method of Mountainous Villages Based on BP Neural Network: A Case Study in Wuxi County, Chongqing City. Dev. Small Cities Town 2023, 41, 22–31. [Google Scholar]
- Lian, M.C.; Li, Y.J. The Spatial Patterns and Architectural Form Characteristics of Chinese Traditional Villages: A Case Study of Guanzhong, Shaanxi Province. Sustainability 2024, 16, 9491. [Google Scholar] [CrossRef]












| Step | Objective | Output/Number of Papers |
|---|---|---|
| 1. Define search scope | Clarify research themes and establish initial database | Initial dataset: 160 (CNKI) + 69 (WOS) |
| 2. Eliminate duplicates | Remove redundant records and retain high-quality papers | 145 |
| 3. Title screening | Rapidly filter non-relevant topics | 105–128 |
| 4. Abstract review | Ensure thematic relevance and methodological contribution | 89–102 |
| 5. Full-text review | Confirm inclusion and extract core references | 69 |
| 6. Indicator extraction | Build the foundation for indicator construction | 69 |
| Database | Search Keywords | Number of Records |
|---|---|---|
| CNKI | Rural AND Machine Learning | 38 |
| CNKI | (Rural OR Village OR Traditional Settlement) AND (Machine Learning OR Deep Learning OR Artificial Intelligence) | 72 |
| CNKI | (Rural OR Settlement OR Traditional Village OR Village Classification OR Evaluation) AND (Machine Learning OR Deep Learning OR Artificial Intelligence OR Random Forest OR Neural Network OR GBDT) | 160 |
| WOS | “Rural settlements” + “Traditional village” AND “Machine learning” + “Random forest” + “Deep learning” + “Convolutional Neural Network” | 69 |
| Model | Missing Value Support | Key Characteristics | Year |
|---|---|---|---|
| Logistic Regression (LR) [20] | No | Linear, interpretable, discriminative | 1958 |
| Linear Discriminant Analysis (LDA) [21] | No | Linear, statistical, generative | 1936 |
| Support Vector Machine (SVM) [22] | No | Nonlinear kernels, max-margin | 1995 |
| Random Forest (RF) [23] | Partial | Robust to noise, bagging | 2001 |
| XGBoost [24] | Yes | Gradient boosting | 2016 |
| LightGBM [25] | Yes | Histogram-based boosting | 2017 |
| CatBoost [26] | Yes | Handles categorical data natively | 2018 |
| Indicator Category | Keyword | Freq. | Authors/Years | ML Methods |
|---|---|---|---|---|
| Settlement Morphology | Boundary | 7 | Jiang (2025) [30] | XGBoost, VOSM |
| Axial pattern | 2 | Fan (2024) [31] | LightGBM | |
| Skeleton | 3 | Zhao (2023) [32]; Fan (2024) [31] | XGBoost, LightGBM | |
| Scale area | 6 | Zhang (2025) [33]; Jiang (2024) [34] | XGBoost, GBDT, BP | |
| Locational Conditions | Distance to towns | 16 | Fu (2022) [7]; Pan (2023) [35] | GBDT, SMOTE, Random Forest |
| Distance to roads | 19 | Zhou (2023) [9]; Xi (2020) [36] | Random Forest, SVM, GBDT | |
| Distance to rivers | 15 | Shu (2024) [11]; Chen (2023) [37] | Random Forest, GMM | |
| Distance to public facilities | 5 | Zhou (2023) [9]; Chen (2023) [37]; Li (2018) [38] | GBDT, MaxEnt | |
| Accessibility | 4 | Zhao (2023) [39]; Peng (2016) [40] | SOM, MGWR | |
| Road network density | 3 | Chen (2023) [37]; Liu (2025) [41]; Han (2021) [42] | Multiclass Logistic Regression, BP | |
| Natural Environment | Elevation | 14 | Zhang (2025) [43]; Wu (2023) [44] | XGBoost, LightGBM |
| Slope | 20 | Shu (2024) [11]; Wu (2023) [44] | GBWT, Random Forest | |
| Terrain relief | 15 | Fan (2022) [45]; Tang (2016) [46] | XGBoost, SOM | |
| Cultivated land area | 10 | Shu (2024) [11]; Li (2023) [47] | MGWR, XGBoost–SHAP | |
| Aspect (slope direction) | 8 | Shu (2024) [11]; Zhou (2023) [9]; Li (2023) [47] | MaxEnt, Random Forest | |
| NDVI | 6 | Zhang (2025) [33]; Chen (2023) [37] | K-Means, Random Forest, BP | |
| Annual precipitation | 6 | Zhang (2025) [33]; Shu (2024) [11] | MGWR | |
| Annual temperature | 5 | Chen (2023) [37]; Li (2023) [47] | MGWR, XGBoost–SHAP | |
| Annual sunshine duration | 4 | Niu (2024) [48]; Wu (2022) [49] | MGWR, Random Forest | |
| Water system density | 4 | Zhang (2025) [33]; Wu (2023) [44] | MGWR, XGBoost–SHAP | |
| Altitude | 4 | Wu (2023) [44] | K-Means, MaxEnt | |
| River flow direction | 2 | Wu (2022) [49]; Zhou (2023) [9] | XGBoost, LightGBM | |
| Socio-economic Attributes | Per-capita GDP | 8 | Zhu (2025) [50]; Li (2023) [47]; Nie (2023) [51] | CNN, XGBoost–SHAP |
| Per-capita income | 5 | Zhu (2025) [50]; Jiang (2024) [34]; Hu (2023) [52] | CNN, K-Means, Gravity Model, BP | |
| GDP | 7 | Zhang (2025) [33]; Lian (2024) [53]; Chen (2023) [37] | MGWR | |
| Population density | 7 | Niu (2024) [48]; Chen (2023) [37] | MGWR, SMOTE, Random Forest | |
| Night-time light intensity | 4 | Zhang (2025) [33]; Lian (2024) [53] | MGWR, XGBoost–SHAP, GBDT | |
| Cultivated land ratio | 10 | Chen (2023) [37]; Xi (2022) [36] | K-Means, Gravity Model | |
| Urbanization rate | 5 | Zhang (2025) [33]; Nie (2023) [51] | SMOTE, Random Forest, GBDT | |
| Historical and Cultural Factors | Cultural heritage concentration | 2 | Nie (2023) [51]; Fan (2022) [45] | MGWR, XGBoost–SHAP |
| Intangible cultural heritage | 2 | Li (2023) [47]; Chen (2023) [37] | MGWR, XGBoost–SHAP | |
| Historical cultural points | 3 | Hu (2023) [52]; Han (2021) [42] | BP, Multiclass Logistic Regression |
| Primary Dimension | Secondary Indicator | Calculation or Definition |
|---|---|---|
| Socio-economic Attributes | Aging rate | Ratio of population aged 65+ to total permanent population |
| Permanent population | Registered permanent residents | |
| Population outflow rate | (Average 2017–2019 out-migration)/population in 2017 | |
| Village per capita income | Mean income per village household | |
| Share of elderly agricultural labor | Share of agricultural workers aged 65+ | |
| Natural Environment | Current natural conditions | Graded: good (1), moderate (2), poor (3), very poor (4) |
| Current resource conditions | Rich mountain and water resources (1), Abundant tourism resources (2), Prominent industrial resources (3), Advantageous locational conditions (4), Rich historical and cultural resources (5) | |
| Land Construction and Utilization | Transportation conditions | Convenient (1), relatively convenient (2), average (3), poor (4) |
| Idle housing vacancy rate | Number of idle/unused housing units (households) | |
| Residential land aggregation | Degree of clustering of residential land | |
| Built-up area ratio | Proportion of built-up area to total land (%) | |
| Land fragmentation | Graded: high (1), relatively high (2), low (3), none (4) | |
| Farmland transfer rate | Ratio of transferred farmland area (%) | |
| Basic farmland ratio | Share of basic farmland area (%) | |
| Farmland per capita | Total farmland area / permanent population | |
| Ecological red-line ratio | Graded: high (1), relatively high (2), low (3), none (4) | |
| Supporting Public Services | Waste collection facility | 1 = present, 0 = absent |
| Waste transfer station | 1 = present, 0 = absent | |
| Centralized heating | 1 = present, 0 = absent | |
| Natural gas access | 1 = present, 0 = absent | |
| Tap water supply | 1 = present, 0 = absent | |
| Sewage treatment plant | 1 = present, 0 = absent | |
| Sanitary toilets | 1 = present, 0 = absent | |
| Elderly care center | 1 = present, 0 = absent | |
| Elderly care station | 1 = present, 0 = absent | |
| Community service center | 1 = present, 0 = absent | |
| Farmers’ market | 1 = present, 0 = absent | |
| Cultural activity center | 1 = present, 0 = absent | |
| Fitness / sports venue | 1 = present, 0 = absent | |
| Clinic or health station | 1 = present, 0 = absent | |
| Kindergarten | 1 = present, 0 = absent | |
| Primary school | 1 = present, 0 = absent |
| Feature | Land Transfer Rate | Ecological Redline Encroachment | Vacant Houses | Resource Condition |
| Number of Outliers | 34 | 20 | 12 | 11 |
| Feature | Waste Collection Points | Land Abandonment | Tap Water Access | Natural Environment |
| Number of Outliers | 11 | 8 | 7 | 5 |
| Method | Class 1 | Class 2 | Class 3 | Class 4 | Overall |
|---|---|---|---|---|---|
| Ensemble (Ours) | 0.90 | 0.85 | 0.86 | 0.91 | 0.88 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, J.; Wang, X.; Qi, Y.; Jiang, J.; Zhou, D.; Ma, D.; Ying, J. AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China. Land 2025, 14, 2298. https://doi.org/10.3390/land14122298
He J, Wang X, Qi Y, Jiang J, Zhou D, Ma D, Ying J. AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China. Land. 2025; 14(12):2298. https://doi.org/10.3390/land14122298
Chicago/Turabian StyleHe, Jing, Xinlei Wang, Yingtao Qi, Jinghan Jiang, Dian Zhou, Ding Ma, and Jing Ying. 2025. "AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China" Land 14, no. 12: 2298. https://doi.org/10.3390/land14122298
APA StyleHe, J., Wang, X., Qi, Y., Jiang, J., Zhou, D., Ma, D., & Ying, J. (2025). AI-Driven Multi-Model Classification of Rural Settlements for Targeted Rural Revitalization: A Case Study of Gaoqing County, Shandong Province, China. Land, 14(12), 2298. https://doi.org/10.3390/land14122298

