Next Article in Journal
A Comprehensive Narrative Review of Abrupt Movements in Human–Robot Interaction
Previous Article in Journal
Sustainable Paving Blocks Using Alkali-Activated Furnace Slag and Recycled Aggregates
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Optimizing Data Preprocessing and Hyperparameter Tuning for Soil Organic Carbon Content Prediction Using Large Language Models: A Case Study of the Black Soil and Windblown Sandy Soil Regions in Northeast China

1
School of Environment, Shenyang University, Shenyang 110044, China
2
School of Agricultural Science and Practice, Royal Agricultural University, Cirencester GL7 6JS, Gloucestershire, UK
3
Key Laboratory of Eco-Restoration of Regional Contaminated Environment, Ministry of Education, Shenyang University, Shenyang 110044, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(7), 3349; https://doi.org/10.3390/app16073349
Submission received: 3 March 2026 / Revised: 23 March 2026 / Accepted: 27 March 2026 / Published: 30 March 2026
(This article belongs to the Section Environmental Sciences)

Abstract

To address the current issues in soil organic carbon (SOC) content prediction where data preprocessing relies on expert experience to formulate fixed rules, resulting in a lack of uniform standards and insufficient consideration of regional soil heterogeneity; while hyperparameter tuning faces problems of high computational costs and excessively long runtimes, this study proposes an intelligent modeling workflow driven by Large Language Models (LLM). This workflow focuses on optimizing two key aspects of SOC Random Forest modeling: data preprocessing and hyperparameter tuning. Results: The LLM-defined rules achieved sample retention rates of 55.33% and 61.90% in the two regions, respectively, showing more significant differences compared to traditional hard-coded rules (56.2% and 59.3%), and the mean soil organic carbon content deviations (30.27% and 20.05%) were both lower than those of traditional hard-coding. At the same time, the mean soil organic carbon content values in both regions closely matched the effectiveness of other methods, indicating that the large language model has effectively captured regional soil differences. With only a single evaluation of hyperparameter optimization, the adaptive model achieved test set R2 values of 0.394 and 0.694 in the black soil region and the aeolian sandy soil region, respectively, with root mean square error values of 8.76 g/kg and 6.07 g/kg—its performance is comparable to that of Grid Search and Random Search, while computational efficiency improved by over 95%. Performance comparisons with eXtreme Gradient Boosting (XGBoost) and Partial Least Squares Regression (PLSR) show that the LLM-optimized Random Forest achieved R2 = 0.394 and RMSE = 8.76 g/kg in the black soil region, and R2 = 0.694 and RMSE = 6.07 g/kg in the windblown sandy soil region, demonstrating practical application value.
Keywords: soil data preprocessing; LLM; hyperparameter tuning; random forest; regional adaptability soil data preprocessing; LLM; hyperparameter tuning; random forest; regional adaptability

Share and Cite

MDPI and ACS Style

Cui, H.; Chang, X.; Gang, S. Optimizing Data Preprocessing and Hyperparameter Tuning for Soil Organic Carbon Content Prediction Using Large Language Models: A Case Study of the Black Soil and Windblown Sandy Soil Regions in Northeast China. Appl. Sci. 2026, 16, 3349. https://doi.org/10.3390/app16073349

AMA Style

Cui H, Chang X, Gang S. Optimizing Data Preprocessing and Hyperparameter Tuning for Soil Organic Carbon Content Prediction Using Large Language Models: A Case Study of the Black Soil and Windblown Sandy Soil Regions in Northeast China. Applied Sciences. 2026; 16(7):3349. https://doi.org/10.3390/app16073349

Chicago/Turabian Style

Cui, Hao, Xianmin Chang, and Shuang Gang. 2026. "Optimizing Data Preprocessing and Hyperparameter Tuning for Soil Organic Carbon Content Prediction Using Large Language Models: A Case Study of the Black Soil and Windblown Sandy Soil Regions in Northeast China" Applied Sciences 16, no. 7: 3349. https://doi.org/10.3390/app16073349

APA Style

Cui, H., Chang, X., & Gang, S. (2026). Optimizing Data Preprocessing and Hyperparameter Tuning for Soil Organic Carbon Content Prediction Using Large Language Models: A Case Study of the Black Soil and Windblown Sandy Soil Regions in Northeast China. Applied Sciences, 16(7), 3349. https://doi.org/10.3390/app16073349

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop