Spatially Explicit Reconstruction of Cropland Using the Random Forest: A Case Study of the Tuojiang River Basin, China from 1911 to 2010

: A long-term, high-resolution cropland dataset plays an essential part in accurately and systematically understanding the mechanisms that drive cropland change and its effect on biogeochemical processes. However, current widely used spatially explicit cropland databases are developed according to a simple downscaling model and are associated with low resolution. By combining historical county-level cropland archive data with natural and anthropogenic variables, we developed a random forest model to spatialize the cropland distribution in the Tuojiang River Basin (TRB) during 1911–2010, using a resolution of 30 m. The reconstruction results showed that the cropland in the TRB increased from 1.13 × 10 4 km 2 in 1911 to 1.81 × 10 4 km 2 . In comparison with satellite-based data for 1980, the reconstructed dataset approximated the remotely sensed cropland distribution. Our cropland map could capture cropland distribution details better than three widely used public cropland datasets, due to its high spatial heterogeneity and improved spatial resolution. The most critical factors driving the distribution of TRB cropland include nearby-cropland, elevation, and climatic conditions. This newly reconstructed cropland dataset can be used for long-term, accurate regional ecological simulation, and future policymaking. This novel reconstruction approach has the potential to be applied to other land use and cover types via its ﬂexible framework and modiﬁable parameters. cropland Plain TRB, cropland a slight the cropland the and lower parts of the basin. Grids with cropland fractions of 0–10% and more than 60% decrease to 9.33% and 56.71% of the total, respectively. Our spatial cropland reconstruction results reveal that the most intensive area the is the upper of the TRB the Chengdu Plain area,


Introduction
The land use and cover change (LUCC) process has undergone drastic changes due to the population boom over the past century, during which rising demands for food and fiber have led to expansion of agricultural lands. As one of the predominant land use types [1], cropland serves as an important carbon budget component and has significant impacts on biogeochemical processes, such as food production, global change, regional ecosystem services, and biodiversity [2][3][4]. The long-term evolution of the cropland distribution plays a key role in understanding agricultural development trajectories and long-term environmental and economic strategies [5,6]. However, most currently used cropland datasets are developed at the large scale with medium-or low-resolution, and thus lack high-precision spatial information. This produces an important research gap because long-term, spatially explicit land-use maps are critical input data for many biogeochemical land-surface models (e.g., ORCHIDEE, DLEM) [7,8], which are used for the IPCC report and the annual global carbon budget estimate [9]. Therefore, there is an urgent need for a long-term, accurate cropland dataset that can improve ecological simulation accuracy and support detailed ecological analyses.
Although the remote sensing method provides a fairly accurate data source for mapping land change, it is resource-intensive and has only been available for the past four regions and socio-economic centers in the southwest of China. The cropland area has expanded at an unprecedented speed over the past century due to the population boom [26]. In addition, it has many problems, including over-intensive agriculture, human-land contradiction, and eco-environmental problems [27]. Thus, it is an ideal testbed for centurylong cropland reconstruction research.

Input Factors and Data Resources
Given the availability of variables for model training and prediction, ten representative natural and anthropogenic variables (elevation, slope, relief amplitude, climatic productive potential, nearby cropland, distance to rural settlements, distance to a river, flood risk, soil erosion, and soil fertility) were used as independent variables in an effort to describe the key factors that affect cropland probability. Population density, waterbody, and elevation were selected as three constraint factors.

Input Factors and Data Resources
Given the availability of variables for model training and prediction, ten representative natural and anthropogenic variables (elevation, slope, relief amplitude, climatic productive potential, nearby cropland, distance to rural settlements, distance to a river, flood risk, soil erosion, and soil fertility) were used as independent variables in an effort to describe the key factors that affect cropland probability. Population density, waterbody, and elevation were selected as three constraint factors.

County-Level Cropland Census
Based on historical cropland statistics and archives, a county-level TRB cropland area dataset was reconstructed for 1911-2010 via revision, calibration, and verification. In the Republic of China era, the data sources used for cropland reconstruction mainly included statistics and survey data from government departments. In addition to the local gazetteers, cropland data were obtained from the Republic of China Agricultural and Commercial Statistics (1912-1921) (Ministry of Agriculture and Commerce), China Land Use Statistics [28], and Sichuan Province Statistics Summary (1945). During the People's Republic of China period after 1949, the archives were superior to those that came before in terms of quantity and accuracy. Cropland data were obtained from the Sichuan Provincial Digital Local Gazetteers, Sichuan Provincial Statistical Yearbook, China's Rural Economic Statistics by Counties (1980Counties ( -1987 [29], and China Regional Economic Statistics Yearbook [30]. National Land Survey results were used as inventory data during this study. Finally, we get the dynamics of the statistical cropland area in the Tuojiang River Basin from 1911 to 2010 (Supplementary Materials Figure S1). Cropland distribution data from 2017 were obtained from the FROM-GLC10 product (Table 1), which was developed based on Sentinel-2 satellite imagery in conjunction with a random forest classifier [31]. The FROM-GLC10 dataset provided the most detailed spatial information regarding various land uses and offered 10 m spatial resolution in 2017. The TRB cropland distribution was clipped, resampled, and derived from the FROM-GLC10 product, and was later used as training data. Cropland layer data with a 30 m resolution in 1980 were derived from the GlobeLand30 dataset.

Other Ancillary Data
Ancillary raster data used for cropland spatialization included terrain, climate, soil, rural settlement, river, and population data. Topography data with a 30 m spatial resolution were obtained from NASA's ASTER GDEM version 3 (https://search.earthdata.nasa.gov/) (accessed on 20 May 2021), and the slope was calculated from Digital Elevation Model data. Meteorological data for the study region from 1951 to 2018 were provided by the National Meteorological Science Data Center in China (http://data.cma.cn/) (accessed on 22 March 2021). The climatic productive potential was calculated using photosynthetic, temperature, and moisture correction coefficients. These were used to represent comprehensive agricultural production conditions [32].  [33]. All raster data were resampled to a 30 m resolution using bilinear spatial interpolation.

Climatic Productive Potential
The climatic productive potential was calculated using the annual average solar radiation, photosynthetic correction coefficient, temperature correction coefficient, and moisture correction coefficient from meteorological datasets developed since 1951, from 40 stations in the TRB and surrounding areas [34].
where Climate is the climatic productive potential; Q is the total solar radiation; f (q) is the photosynthetic correction coefficient, which is set to 21.9; f (t) is the temperature correction coefficient, which is the number of frost-free days in the year; and f (w) is the moisture correction coefficient, which is the ratio of precipitation to evaporation.

Soil Fertility
The soil organic matter, total nitrogen, total phosphorus, and total potassium contents were used to calculate the soil fertility indicators. After the normalization of each factor, the comprehensive soil fertility index of each grid unit was calculated using the linearity synthetic method [18].
where Soil i is the soil fertility index of grid i; Soil j is the normalized soil nutrient content; and µ j is the weight coefficient of an individual single soil factor determined via the analytic hierarchy process.

Flood Risk
Flood disasters may have a certain degree of impact on agricultural activity due to the dense river networks in the TRB. The flood risk index was calculated from four indicators, including the multi-year average rainfall during flood season; altitude and terrain relief factors; river network distribution characteristics; and historical flood frequency [35]. Using the analytic hierarchy process [36], the weight coefficients of each factor were determined to be 0.239, 0.239, 0.433, and 0.089, respectively. Finally, the flood risk index of the TRB area was calculated.

Distance to Rural Settlements
The distance between the cropland fields and rural settlements is an important reference for farmers when selecting sites for land cultivation. The cultivation probability decreases as the distance increases, and vice versa. Therefore, the rural settlements present in 1911 were used to represent the rural settlements in the Republic of China period, and the rural settlements present in 1980 were used to represent settlements from 1949 to 1980.

Time Slice Selection and Cropland Area Calibration
We chose eight time slices, 1911, 1933, 1945, 1957, 1960, 1980, 2000, and 2010, based on data availability and representativeness to create cropland maps. Cropland area calibration mainly involves the unification and merging of administrative units from different historical periods, raster resampling, and unification of coordinate systems. In light of changes in administrative divisions, we calibrated the cropland area of each county using administrative maps from 2010. Adjacent counties that were involved in complex changes were merged and reallocated. Jintang county and Qingbaijiang district were merged into the Jinqing region, Jianyang city and Longquanyi district were merged into the Jianlong region, Dongxing and Shizhong districts were merged into the Neijiang region, Luxian and Longmatan were merged into the Longlu region, and the districts and counties in Zigong were merged into the Zigong region. All raster spatial covariates were projected onto WGS_1984_UTM_Zone_48 and resampled to produce a 30 m-resolution grid.

Cropland Data Reconciliation
There is a gap between the cropland archive area data and the cropland remote sensing area data. We used a reconciliation algorithm [37] to reconcile the archive data to the remote sensing data level. The reconciled cropland area from 2017 was derived from FROM-GLC10 data. The reconciliation algorithm is calculated as follows: where t1 and t2 are the current and previous years, respectively. A t1 C (k) is the cropland area from the archive in year t1 for county k; A t2 C (k) is the cropland area from the archive in year t2 for county k; A t1 R (k) is the reconciled cropland area from the archive in year t1 for county k; A t2 R (k) is the reconciled cropland area from the archive in year t2 for county k; and α(k) is the weight of the relative anomaly, as follows: The whole data preparation process is shown in a flowchart in Figure 2 and processed based on ArcGIS Pro 2.5. ( ) 2 t R Ak is the reconciled cropland area from the archive in year t2 for county k; and α(k) is the weight of the relative anomaly, as follows: The whole data preparation process is shown in a flowchart in Figure 2 and processed based on ArcGIS Pro 2.5.

Random Forest Classifier Algorithm
An RF algorithm is an ensemble machine learning and data mining approach composed of a set of decision trees. Created based on bootstrap resampling and a simple decision tree, the algorithm can be used for classification and regression and can achieve significant improvements in classification accuracy [38]. RF can handle high-dimensional data effectively, achieve high tolerance for noise and outliers, and reduce generalization error [22]. This approach is used to explore the occurrence probability associated with a specific land use or cover type in a grid pixel. The cropland distribution can be interpreted as binary data. Thus, contemporary cropland can be used as a binary dependent variable (cropland = 1, non-cropland = 0), and the driving factors that affect cropland can be used as independent variables. The RF algorithm is able to estimate variable importance via out-of-bag error measurements [39].

RF Model Training
A flowchart that describes model building and result analysis is presented in Figure  3. The natural and anthropogenic factors and cropland distribution map from FROM-GLC10 in 2017 were put into the RF model as independent variables and a dependent variable, respectively, for training. The RF model was executed using the "scikit-learn" package under the Python 3.6 environment in ArcGIS Pro 2.5. The RF model was used to model cropland probability. The training and testing sample sets were divided at an 8:2 ratio. Then, the predicted value of the random forest cropland gridding model after training was taken as the weight of the cropland grid allocation in the historical period. The number of decision trees (n_estimators) and the number of features (max_features) are the two critical parameters that affect RF classification accuracy. The grid search method was used to test each possibility in each parameter selection opportunity via loop traversal.

Random Forest Classifier Algorithm
An RF algorithm is an ensemble machine learning and data mining approach composed of a set of decision trees. Created based on bootstrap resampling and a simple decision tree, the algorithm can be used for classification and regression and can achieve significant improvements in classification accuracy [38]. RF can handle high-dimensional data effectively, achieve high tolerance for noise and outliers, and reduce generalization error [22]. This approach is used to explore the occurrence probability associated with a specific land use or cover type in a grid pixel. The cropland distribution can be interpreted as binary data. Thus, contemporary cropland can be used as a binary dependent variable (cropland = 1, non-cropland = 0), and the driving factors that affect cropland can be used as independent variables. The RF algorithm is able to estimate variable importance via out-of-bag error measurements [39].

RF Model Training
A flowchart that describes model building and result analysis is presented in Figure 3. The natural and anthropogenic factors and cropland distribution map from FROM-GLC10 in 2017 were put into the RF model as independent variables and a dependent variable, respectively, for training. The RF model was executed using the "scikit-learn" package under the Python 3.6 environment in ArcGIS Pro 2.5. The RF model was used to model cropland probability. The training and testing sample sets were divided at an 8:2 ratio. Then, the predicted value of the random forest cropland gridding model after training was taken as the weight of the cropland grid allocation in the historical period. The number of decision trees (n_estimators) and the number of features (max_features) are the two critical parameters that affect RF classification accuracy. The grid search method was used to test each possibility in each parameter selection opportunity via loop traversal. Then, the best parameters were used in the latter analysis as the model approached a steady state. In this study, n_estimators and max_features were set to 150 and 3, respectively. At runtime in each county, the model was evaluated between the prediction and the validation using the receiver operating characteristic (ROC) curve and the area under curve (AUC) value. The results indicate good predictive accuracy performance and an average AUC value of 0.92 (Supplementary Materials Figure S2). Then, the best parameters were used in the latter analysis as the model approached a steady state. In this study, n_estimators and max_features were set to 150 and 3, respectively. At runtime in each county, the model was evaluated between the prediction and the validation using the receiver operating characteristic (ROC) curve and the area under curve (AUC) value. The results indicate good predictive accuracy performance and an average AUC value of 0.92 (Supplementary Materials Figure S2).

Spatial Allocation of Cropland
We performed the following steps under a Python 3.6 environment in ArcGIS Pro 2.5. The reconciled cropland area was combined with the adjacent year cropland change and cropland probability layers generated by the RF model, and was allocated to pixels in each county iteratively. During cropland area increase years, the newly added cropland grid was determined according to the distribution probability from high to low. During cropland area decrease years, the cropland grid converted to non-cropland was determined according to the cropland probability from low to high. Several constraint factors were used to exclude impossible cropland areas. No cultivated land could be distributed in an area with population density of over 3000 people/km 2 , elevation over 2000 m, or waterbody grid. The model results for each year were aggregated into one dataset. Finally, we generated a spatio-temporal cropland dataset with 30 m resolution that described the past 100 years of the TRB.

Accuracy Assessment
We used a cropland cover map derived from satellite-based data from 1980 to perform the accuracy assessment. The comparison formula is as follows: where Differences(i) is the absolute difference between the modeled and satellite-based data in grid i, and Cm(i) and Cs(i) are the cropland fractions from grid i in this study and the satellite-based data, respectively. We also compared our results to three public global and regional datasets reported in previous studies ( Table 2). The three datasets were the (1) widely used HYDE 3.2 dataset [12], (2) Yang-dataset, a historical cropland dataset that covers 300 years of the

Spatial Allocation of Cropland
We performed the following steps under a Python 3.6 environment in ArcGIS Pro 2.5. The reconciled cropland area was combined with the adjacent year cropland change and cropland probability layers generated by the RF model, and was allocated to pixels in each county iteratively. During cropland area increase years, the newly added cropland grid was determined according to the distribution probability from high to low. During cropland area decrease years, the cropland grid converted to non-cropland was determined according to the cropland probability from low to high. Several constraint factors were used to exclude impossible cropland areas. No cultivated land could be distributed in an area with population density of over 3000 people/km 2 , elevation over 2000 m, or waterbody grid. The model results for each year were aggregated into one dataset. Finally, we generated a spatio-temporal cropland dataset with 30 m resolution that described the past 100 years of the TRB.

Accuracy Assessment
We used a cropland cover map derived from satellite-based data from 1980 to perform the accuracy assessment. The comparison formula is as follows: where Differences(i) is the absolute difference between the modeled and satellite-based data in grid i, and C m (i) and C s (i) are the cropland fractions from grid i in this study and the satellite-based data, respectively. We also compared our results to three public global and regional datasets reported in previous studies ( Table 2). The three datasets were the (1) widely used HYDE 3.2 dataset [12], (2) Yang-dataset, a historical cropland dataset that covers 300 years of the traditional Land 2021, 10, 1338 9 of 18 cultivated regions of China [18], and (3) ChinaCropland dataset, a cropland dataset from China that is based on multiple data sources [26]. Our dataset was up-scaled to a spatial resolution of 5 km for the grid-by-grid comparison. Notably, the cropland fraction was compared using the following formula: where Differences(i) is the absolute difference between the modeled data and the public dataset in grid i, and C m (i) and C p (i) are the cropland fractions of grid i in this study and the public dataset, respectively.

Changes in Cropland Area during 1911-2010
Overall, the cropland area in the TRB increases rapidly before 1980, and then decreases slowly after 1980 (Figure 4). The cropland area increases steadily from 1.13 × 10 4 km 2 in 1911 to 1.74 × 10 4 km 2 in 1957. This indicates an average annual growth rate of 1.17%. There is a sharp decrease in cropland area from 1957 to 1960, which indicates an an annual decrease rate of 1.81%. From 1960 to 1980, the amount of cropland increases rapidly from 1.64 × 10 4 km 2 , to a peak of 2.13 × 10 4 km 2 in 1980. This indicates an annual growth rate of 1.46%. After 1980, the cropland area decreases to 1.81 × 10 4 km 2 via a −0.51% average annual growth rate.
Land 2021, 10, x FOR PEER REVIEW 10 of 20 traditional cultivated regions of China [18], and (3) ChinaCropland dataset, a cropland dataset from China that is based on multiple data sources [26]. Our dataset was up-scaled to a spatial resolution of 5 km for the grid-by-grid comparison. Notably, the cropland fraction was compared using the following formula: where Differences(i) is the absolute difference between the modeled data and the public dataset in grid i, and Cm(i) and Cp(i) are the cropland fractions of grid i in this study and the public dataset, respectively.

Changes in Cropland Area during 1911-2010
Overall, the cropland area in the TRB increases rapidly before 1980, and then decreases slowly after 1980 ( Figure 4). The cropland area increases steadily from 1.13 × 10 4 km 2 in 1911 to 1.74 × 10 4 km 2 in 1957. This indicates an average annual growth rate of 1.17%. There is a sharp decrease in cropland area from 1957 to 1960, which indicates an an annual decrease rate of 1.81%. From 1960 to 1980, the amount of cropland increases rapidly from 1.64 × 10 4 km 2 , to a peak of 2.13 × 10 4 km 2 in 1980. This indicates an annual growth rate of 1.46%. After 1980, the cropland area decreases to 1.81 × 10 4 km 2 via a -0.51% average annual growth rate.

Spatial Distribution of Cropland Patterns
The cropland maps in the TRB from 1911 to 2010 (including 1911, 1933, 1945, 1960, 1980, 2000, and 2010) were reconstructed using the RF model ( Figure 5). In 1911, croplands are concentrated primarily in the upper reaches of the TRB with higher cropland fraction levels, particularly the Chengdu Plain area. In 1933, there is a cropland expansion in the middle of the basin (Figure 6). From then on, spatial agglomeration occurs in the upper reaches and spreads to the middle of the basin. In 1957, grids with cropland area fractions of 0-10%, more than 60%, and more than 80% account for 9.84%, 55.66%, and 25.18%,

Spatial Distribution of Cropland Patterns
The cropland maps in the TRB from 1911 to 2010 (including 1911, 1933, 1945, 1960, 1980, 2000, and 2010) were reconstructed using the RF model ( Figure 5). In 1911, croplands are concentrated primarily in the upper reaches of the TRB with higher cropland fraction levels, particularly the Chengdu Plain area. In 1933, there is a cropland expansion in the middle of the basin (Figure 6). From then on, spatial agglomeration occurs in the upper reaches and spreads to the middle of the basin. In 1957, grids with cropland area fractions of 0-10%, more than 60%, and more than 80% account for 9.84%, 55.66%, and 25.18%, respectively, of the total. During 1957-1960, the cropland area decreases sharply in most counties. After 1960, dramatic growth occurs in the middle reaches of the basin. Between 1960 and 1980, a sharp increase in the cropland area is observed in both the middle and lower reaches of the river basin. In the 1980s, the cropland area reaches its maximum level in all regions. At this point, 72.42% of grids have a cropland fraction of more than 60%. Between 1911 and 1980, there is a great expansion in the amount of intensive cropland from the upper reaches to the middle and lower reaches of the TRB, as the cropland coverage fraction increases continuously.   Grids with cropland area fractions of 0-10% account for 18.57% of the total and those with fractions of more than 60% represent 20.95% of the total (Figure 7). In 2000, the percentage of grids with cropland fractions of 0-10% increases to 6.57%, and only 64.68% have more than 60% cropland. The largest quantities of abandoned cropland are detected in the Chengdu Plain areas of the TRB, such as in the Jinqing and Jianlong areas. The spatial distribution of cropland in 2010 is similar to that in 2000. Only a slight decrease in the cropland fraction is noted in the middle and lower parts of the basin. Grids with cropland fractions of 0-10% and more than 60% decrease to 9.33% and 56.71% of the total, respectively. Our spatial cropland reconstruction results reveal that the most intensive cropland area over the past century is distributed in the upper reaches of the TRB basin, mainly in the Chengdu Plain area, because of its fertile soil, flat terrain, and developed socio-economic conditions. Land 2021, 10, x FOR PEER REVIEW 12 of 2 Grids with cropland area fractions of 0-10% account for 18.57% of the total and those with fractions of more than 60% represent 20.95% of the total (Figure 7). In 2000, the per centage of grids with cropland fractions of 0-10% increases to 6.57%, and only 64.68% have more than 60% cropland. The largest quantities of abandoned cropland are detected in the Chengdu Plain areas of the TRB, such as in the Jinqing and Jianlong areas. The spatial distribution of cropland in 2010 is similar to that in 2000. Only a slight decrease in the cropland fraction is noted in the middle and lower parts of the basin. Grids with cropland fractions of 0-10% and more than 60% decrease to 9.33% and 56.71% of the total respectively. Our spatial cropland reconstruction results reveal that the most intensive cropland area over the past century is distributed in the upper reaches of the TRB basin mainly in the Chengdu Plain area, because of its fertile soil, flat terrain, and developed socio-economic conditions.  1911,1933,1945,1957,1960,1980, 2000, and 2010.

Verification of Simulation Results
We used remotely sensed cropland data from 1980 to validate the newly reconstructed cropland dataset (Figure 8). Comparison of the cropland fractions produced by

Verification of Simulation Results
We used remotely sensed cropland data from 1980 to validate the newly reconstructed cropland dataset (Figure 8). Comparison of the cropland fractions produced by this study and GlobeLand30 (1 km resolution) shows that 63.05% of grids have differences of between −20% and 20%. The greater the difference rate, the lower the proportion of the grid. The difference is close to a Gaussian distribution (Figure 9). This indicates that our reconstructed result captures the actual cropland pattern. Further spatial details are generated under more restrictive factors (Supplementary Materials Figure S3). However, there remains misinterpretations of two places in the GlobeLand30, which affected the results comparison. Firstly, we can see there almost no reclamation in Zigong county area (Figure 8b), which is contrary to the fact. However, in reality, this area is an important agricultural area, and clear evidence could be found in the agricultural survey data. On the other hand, other land use categories may be misinterpreted as cropland in GlobeLand30, especially in the mountain area of Anyue, Renshou and Weiyuan Counties [40], which caused a higher reclamation in these areas. The other noticeable difference existed in the northern mountain areas, which indicated that the cropland fraction was overestimated in the mountainous area in the study.

Comparison to Public Cropland Datasets
To analyze the reasonableness of the reconstructed dataset further, three widely used regional and global datasets, ChinaCropland, the Yang-dataset, and HYDE 3.2, were se-

Comparison to Public Cropland Datasets
To analyze the reasonableness of the reconstructed dataset further, three widely used regional and global datasets, ChinaCropland, the Yang-dataset, and HYDE 3.2, were se-

Comparison to Public Cropland Datasets
To analyze the reasonableness of the reconstructed dataset further, three widely used regional and global datasets, ChinaCropland, the Yang-dataset, and HYDE 3.2, were selected for comparison with our results (Figure 10). For each year, the HYDE cropland is much smaller than the cropland area from our reconstructed result. This is consistent with the conclusion presented by Yu that the HYDE dataset underestimates the cropland area in the Sichuan Plain [26]. This difference may be caused by differences in data sources. The HYDE dataset aggregates statistical data at the country scale with no provincial-level inventory data, and allocates a cropland area directly based on demographic and socioeconomic parameters [12]. Since the Yang dataset only collects provincial cropland archives in Sichuan and Chongqing, it overestimates the total cropland area in the TRB. The cropland area in our study most closely matches that from the ChinaCropland dataset, because they are both derived from National inventory data after 1980. Both datasets show that the basin cropland area is maximized in 1980. Due to its consistent cropland trends, our cropland dataset was aggregated to 5 km resolution for spatial comparison with the ChinaCropland dataset at five time points (Figure 11). In 1911 and 1933, the ChinaCropland dataset indicates that cropland area is scarce in the upper reaches of the basin. However, this is inconsistent with fact. For comparison, the cropland area around Chengdu Plain in the upper reaches was considered in this study. The Chengdu Plain region cropland fractions identified via this study are significantly greater than those in the ChinaCropland database. For 1911-1933, 8% of grids exhibit a difference of more than 50% in the upper reaches of the basin. This indicates underestimation of the cropland area in the upper reaches. The ChinaCropland dataset shows that the cropland in the middle and southeastern portions of the basin expands significantly between 1933 and 2010. However, this is inconsistent with the actual changes in cropland area, and substantially different from the cropland spatial patterns in the study. The many extensive negative differences of more than 50% in the eastern portion of the basin indicate that ChinaCropland overestimates the cropland area in the region. Due to its consistent cropland trends, our cropland dataset was aggregated to 5 km resolution for spatial comparison with the ChinaCropland dataset at five time points ( Figure 11). In 1911 and 1933, the ChinaCropland dataset indicates that cropland area is scarce in the upper reaches of the basin. However, this is inconsistent with fact. For comparison, the cropland area around Chengdu Plain in the upper reaches was considered in this study. The Chengdu Plain region cropland fractions identified via this study are significantly greater than those in the ChinaCropland database. For 1911-1933, 8% of grids exhibit a difference of more than 50% in the upper reaches of the basin. This indicates underestimation of the cropland area in the upper reaches. The ChinaCropland dataset shows that the cropland in the middle and southeastern portions of the basin expands significantly between 1933 and 2010. However, this is inconsistent with the actual changes in cropland area, and substantially different from the cropland spatial patterns in the study. The many extensive negative differences of more than 50% in the eastern portion of the basin indicate that ChinaCropland overestimates the cropland area in the region. Land 2021, 10, x FOR PEER REVIEW 16 of 20 Figure 11. Comparison of the cropland spatial fractions presented by ChinaCropland and this study.

Policy Factors and Cropland Area Data
Cropland expansion consists of two phases. The first phase is the climax of cropland expansion after the application of the household responsibility system in the 1980s [41], and the other phase is mainly concentrated after 2010. Historical changes in the TRB cropland area might potentially be explained via association with social productive forces and government policies. In the early 1900s, Migration Reclamation Planning was used to guide refugees to reclaim land. The cropland area in the TRB increased, as did the amount of cropland throughout the Republic of China [42]. In the 1950s, the land reform movement greatly encouraged farmers to engage in agricultural production [43]. The establishment of agricultural cooperatives is conducive to the construction of new water conservation facilities and land improvements. The agricultural production capacity improved significantly during this stage. However, the great leap forward and three-year natural disasters during 1959-1961 delivered a two-punch blow to agricultural development in the TRB [44]. After 1980, urbanization drove cropland decline in most parts of the TRB ( Figure  5), in which plenty of cropland is occupied by non-agricultural construction land. The amount of cropland area in economically active regions has decreased during recent decades, due to rapid urban expansion. Since the implementation of the Grain for Green Project in the 1990s, some unsuitable cropland is converted into forestland [45]. The cropland area in the basin stabilized gradually after the Dynamic Balance of Total Farmland Area policy [46] was enacted in 1998 [47].

Policy Factors and Cropland Area Data
Cropland expansion consists of two phases. The first phase is the climax of cropland expansion after the application of the household responsibility system in the 1980s [41], and the other phase is mainly concentrated after 2010. Historical changes in the TRB cropland area might potentially be explained via association with social productive forces and government policies. In the early 1900s, Migration Reclamation Planning was used to guide refugees to reclaim land. The cropland area in the TRB increased, as did the amount of cropland throughout the Republic of China [42]. In the 1950s, the land reform movement greatly encouraged farmers to engage in agricultural production [43]. The establishment of agricultural cooperatives is conducive to the construction of new water conservation facilities and land improvements. The agricultural production capacity improved significantly during this stage. However, the great leap forward and three-year natural disasters during 1959-1961 delivered a two-punch blow to agricultural development in the TRB [44]. After 1980, urbanization drove cropland decline in most parts of the TRB (Figure 5), in which plenty of cropland is occupied by non-agricultural construction land. The amount of cropland area in economically active regions has decreased during recent decades, due to rapid urban expansion. Since the implementation of the Grain for Green Project in the 1990s, some unsuitable cropland is converted into forestland [45]. The cropland area in the basin stabilized gradually after the Dynamic Balance of Total Farmland Area policy [46] was enacted in 1998 [47].

Causes of Cropland Distribution
The RF machine learning method can handle complex correlations between dependent geographic variables effectively [48]. The feature importance represents the degree that the accuracy of the RF model is influenced when an independent variable is replaced with randomly distributed data. The higher the importance of the feature, the more important the independent variable, and vice versa. Nearby cropland, the elevation, and the climatic productive potential are the top three factors that influence the cropland distribution ( Figure 12). The feature importance of nearby cropland is 0.46, which is consistent with the principle that land is more likely to be reclaimed around existing cropland [16]. It also confirms that more cropland tends to intensify under the trend of agricultural development [49]. This is followed by the elevation and climatic productive potential data, where the feature importance values are 0.09 and 0.08, respectively. Natural factors such as the relief amplitude, soil erosion, and soil fertility are also of little influence. In addition, the quantity of ancillary data is vital to cropland distribution modeling. Our study adds more comprehensive variables that affect cropland distribution than traditional allocation methods [8,12]. These additional variables provide more complete and reliable data.

Causes of Cropland Distribution
The RF machine learning method can handle complex correlations between dependent geographic variables effectively [48]. The feature importance represents the degree that the accuracy of the RF model is influenced when an independent variable is replaced with randomly distributed data. The higher the importance of the feature, the more important the independent variable, and vice versa. Nearby cropland, the elevation, and the climatic productive potential are the top three factors that influence the cropland distribution ( Figure 12). The feature importance of nearby cropland is 0.46, which is consistent with the principle that land is more likely to be reclaimed around existing cropland [16]. It also confirms that more cropland tends to intensify under the trend of agricultural development [49]. This is followed by the elevation and climatic productive potential data, where the feature importance values are 0.09 and 0.08, respectively. Natural factors such as the relief amplitude, soil erosion, and soil fertility are also of little influence. In addition, the quantity of ancillary data is vital to cropland distribution modeling. Our study adds more comprehensive variables that affect cropland distribution than traditional allocation methods [8,12]. These additional variables provide more complete and reliable data.

Uncertainties and Prospects
First, the availability of temporally dynamic data limited the selection of influencing factors. Undoubtedly, the RF model performance may improve if more variables are accessible. Some of the anthropogenic variables were unavailable beyond recent decades, and remained unchanged in the study. Therefore, related driving factor datasets with high resolutions and long time-spans are urgently needed to support more precise modeling. In addition, urban areas without cropland are defined based on a population density threshold of 3000. However, historical areas with fewer than 3000 people might be similarly urban. The value used in the study may not be optimal for some regions. The altitude threshold is also set to the same value for mountainous area throughout the period, which may cause an overestimation of cropland in the history. As a result, it cannot be used as the basis for some local research. However, our flexible framework allows researchers to replace input data or revise model parameters to acquire the best results for their specific study areas. To improve the model further, one can seek to reduce the uncertainties of the most sensitive parameters, and include more process details and additional constraints as more data become available. Secondly, although inevitable misinterpretations exist in small local regions (Figure 8b), the remotely sensed data are still a reliable validation method. It also stresses the necessity of cropland reconstruction. In the future work, regional remotely sensed data with a higher precision can be used for verification.
This study can provide a method for the reconstruction of other land use types in a more extensive spatial range (continental to global scale), and at a longer time scale. In

Uncertainties and Prospects
First, the availability of temporally dynamic data limited the selection of influencing factors. Undoubtedly, the RF model performance may improve if more variables are accessible. Some of the anthropogenic variables were unavailable beyond recent decades, and remained unchanged in the study. Therefore, related driving factor datasets with high resolutions and long time-spans are urgently needed to support more precise modeling. In addition, urban areas without cropland are defined based on a population density threshold of 3000. However, historical areas with fewer than 3000 people might be similarly urban. The value used in the study may not be optimal for some regions. The altitude threshold is also set to the same value for mountainous area throughout the period, which may cause an overestimation of cropland in the history. As a result, it cannot be used as the basis for some local research. However, our flexible framework allows researchers to replace input data or revise model parameters to acquire the best results for their specific study areas. To improve the model further, one can seek to reduce the uncertainties of the most sensitive parameters, and include more process details and additional constraints as more data become available. Secondly, although inevitable misinterpretations exist in small local regions (Figure 8b), the remotely sensed data are still a reliable validation method. It also stresses the necessity of cropland reconstruction. In the future work, regional remotely sensed data with a higher precision can be used for verification.
This study can provide a method for the reconstruction of other land use types in a more extensive spatial range (continental to global scale), and at a longer time scale. In future work, we are challenged to incorporate more detailed indicators, which may improve the accuracy of mapping. There is also potential for improvement in the machine learning models. Even though RFs are used in this study, there are still many other machine learning methods, such as artificial networks, generalized linear models, deep learning models, and various ensemble learning algorithms. Notably, ensemble models have been shown to provide better modeling of land use change probabilities. They can improve the efficiencies of base classifiers that are used as ensemble classifiers and minimize bias, variance, and overfitting problems during the deforestation probability modeling process [50]. These simulation rules can be changed via the use of other programming languages that improve the simulation capabilities of the machine learning models [21].

Conclusions
This paper proposed an RF model workflow for historical cropland reconstruction. Based on calibrated historical cropland data, we developed a century-long historical TRB cropland dataset with 30 m resolution using a RF model. Our spatial cropland reconstruction result revealed that the most intensive cropland area over the past century was distributed in the upper reaches of the TRB. The model achieved precision superior to those of existing widely used public datasets and approximated actual cropland patterns. Nearby cropland, the elevation, and the climatic productive potential were the key factors that affected cropland distribution in the TRB. After accuracy assessments, the proposed method was validated as a promising way for the reconstruction of high-resolution cropland grids. This framework can be applied to larger areas with more complex conditions or other land use or cover changes via its flexible structure and parameters. In the future, we suggest that researchers reconstruct historical cropland patterns using ensemble machine learning algorithms, and compare the resulting data accuracy.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/land10121338/s1, Figure S1: Dynamics of the statistical cropland area in the Tuojiang River Basin from 1911 to 2010. Figure S2: Receiver operator characteristic curves and associated area under the curve for the random forest models for 20 counties. Figure

Data Availability Statement:
The data that support the findings of this study are available on request from the corresponding author.