Data-Driven Analysis of Forest–Climate Interactions in the Conterminous United States

: A predictive understanding of interactions between vegetation and climate has been a grand challenge in terrestrial ecology for over 200 years. Developed in recent decades, continental-scale monitoring of climate and forest dynamics enables quantitative examination of vegetation– climate relationships through a data-driven paradigm. Here, we apply a data-intensive approach to investigate forest–climate interactions across the conterminous USA. We apply multivariate statistical methods (stepwise regression, principal component analysis) including machine learning to infer signiﬁcant climatic drivers of standing forest basal area. We focus our analysis on the ecoregional scale. For most ecoregions analyzed, both stepwise regression and random forests indicate that factors related to precipitation are the most signiﬁcant predictors of forest basal area. In almost half of US ecoregions, precipitation of the coldest quarter is the single most important driver of basal area. The demonstrated data-driven approach may be used to inform forest-climate envelope modeling and the forecasting of large-scale forest dynamics under climate change scenarios. These results have important implications for climate, biodiversity, industrial forestry, and indigenous communities in a changing world.


Introduction
Understanding the interactions between vegetation and climate is a central question in ecology and biogeography [1,2]. The seminal work by von Humboldt and Bonpland [3] built a foundation for quantitative analysis of climatic factors controlling vegetation at a large scale. Climate-vegetation observations collected over the 19th century resulted in the Köppen climate classification, developed by Köppen in 1884Köppen in -1936. The Köppen classification was critically evaluated, developed, modified, and improved over the next century [7][8][9][10]. Several of its extensions are known as Köppen and Geiger [10,11] and Trewartha classifications [12,13]. The Köppen classification is also employed in the delineation of Bailey's US ecoregions [14]. Another bioclimatic scheme, called the Holdridge life zone system, was proposed by Holdridge in 1947 [15,16], initially for tropical regions and later extended to boreal and temperate zones [17]. These bioclimatic classification schemes served as a scientific basis and inspiration for modern developments, including climate envelope models (CEMs), species distribution models (SDMs), and dynamic global vegetation models (DGVMs), within Earth system models (ESMs), focusing on both the understanding of vegetation formations and the prediction of climate effects across various temporal and spatial scales [18,19].
Forested and atmospheric systems are complex adaptive systems, as they operate at multiple scales, preserve structural unity, and demonstrate self-organizational patterns [20][21][22][23]. The determination of macroscopic characteristics capable of capturing complex system dynamics is a challenging problem due to their multidimensional and multiscale nature [24][25][26][27]. While this complexity is broadly acknowledged, atmospheric systems in vegetation modeling are often characterized by macroscopic mean-field approximations [18]. In particular, average temperature and precipitation are widely used as climatic factors determining the distribution and abundance of plant communities, gradients in the vegetation continuum, and primary productivity [1]. In a seminal work, von Humboldt and Bonpland [3] used temperature as the primary factor and air pressure as the secondary factor to quantify altitude-related climatic controls on vegetation. The first Köppen classification scheme was based on the mean temperature, as precipitation was added as another factor in an updated classification [4]. Yet, mean temperature and precipitation patterns were insufficient for adequate climate-vegetation classification, and only the inclusion of interseasonal changes allowed differentiation between climatic zones [5,6]. The Holdrige system [15,16] employs three primary variables: (1) precipitation, (2) biotemperature or the annual mean temperature adjusted to the duration of the vegetation period, and (3) potential evapotranspiration ratio. Due to the complexity of atmospheric and vegetative systems, traditional reasoning often dictated the selection of mean temperature and precipitation as primary climatic parameters in large-scale climate-envelope and plant-species distribution models [19].
A data-intensive approach provides an opportunity to quantitatively evaluate relationships within and across complex ecological systems through data mining and data-driven modeling [28][29][30]. In particular, currently available large-scale spatially explicit climatic data sets and continental-level individual-based forest inventories allow a data-intensive analysis of climate-vegetation systems and data mining for possible connections between different quantitative characteristics of climate and vegetation [19,24,26]. In this work, we employ a data-intensive paradigm to examine the continental-scale effects of climate on forested ecosystems. Specifically, we investigate the relationship between forest basal area and climatic factors in the conterminous United States, for each ecological domain (Humid, Dry, and Humid Tropical Domains) and for each ecoregion as per Bailey's classification scheme [14,31,32]. Our goal is to rank climatic factors by their effect on forest basal areas across these ecoregions.
We employ a combination of two distinct approaches: (1) traditional multivariate statistical methods (correlation analysis, principal component analysis, and stepwise regression; Section 2.2.1) and (2) a recently developed machine learning approach (random forests; Section 2.2.2). Multiple regression models in combination with multivariate statistics is a commonly used statistical methodology for exploring climatic drivers of forested ecosystems. For example, the link between forest growth rate and temperature and precipitation patterns was explored in [33] using regression analysis. Stepwise linear regression was used in [34] to describe climate effects on a tree species. Various linear regressions were implemented to model stem basal area dynamics related to climatic factors in [35]. Random forests is an ensemble machine learning method [36], averaging over multiple decision trees to derive a robust classifier or regressor (Section 2.2.2). In this work, we perform an intercomparison between stepwise multiple regression and random forests through a data-intensive paradigm by applying both methods to infer forest-climate interactions. While the outcomes of both methods were similar, our intercomparison reveals that random forests is an efficient tool for data mining and modeling (see Section 3.3).

Data Mining
We mined two continental-scale databases: the USDA Forest Inventory and Analysis (FIA) database and the WorldClim climatic database (https://www.fia.fs.fed.us/ and https://www.worldclim.org/, last time accessed November 2020). These data sets cover three ecological domains and 36 ecological regions across continental USA (Appendix A). The WorldClim data set contains normals for different climatic variables computed with a spatial resolution of approximately 1 km since 1950 [37]. The FIA data set includes information on 211,949 forest plots observed over the 1968-2013 period. Figure A1 visualizes the spatial distribution of FIA forest plots across ecoregions. Table 1 summarizes the data used in our study. In this study, we characterize forested ecosystems using basal area (see [24,38,39] for details on forest basal area computation with the FIA data set). After mining the FIA data set, we obtained 409,868 observations of basal area across the conterminous USA. This includes different forested plots observed at irregular time intervals due to the sampling rotation rule implemented by the US Forest Service. To normalize the observations and to avoid duplicate records, we selected the data snapshot for the year 2000. Specifically, we used only one observation of every FIA forest inventory plot acquired near the year 2000. These forest basal area measurements were linked with 19 climatic variables of three types (temperature, precipitation, variability) using the WorldClim data set (see Table 1).

Data Analysis and Software
Our primary goal was to reveal large-scale climatic drivers of forest basal area dynamics. We employed stepwise linear regression and a machine learning approach known as random forests (Section 3.2, [36]) with climatic factors as the independent variables. In concert with stepwise regression, we applied traditional multivariate statistical techniques, correlation analysis (Section 3.1.1), and principal component analysis (Section 3.1.2). We employed multivariate traditional statistical and recent machine learning approaches to analyze the (a) conterminous USA, (b) Humid, Dry, and Humid Tropical Domains, and (c) 36 US ecoregions (Table 1 and Appendix A). We used the opensource R language for statistical computing (www.r-project.org), packages FactoMineR (http://factominer.free.fr/) and factoextra (https://rpkgs.datanovia.com/factoextra/index. html) to conduct and visualize the principal component analysis (PCA), and the popular Python library scitkit-learn (scikit-learn.org) to run the random forests analysis. Original software used in this study is freely available at the following GitHub repository: Olia8848/US-forest-dynamics-vs-climate.

Stepwise Linear Regression
We employed stepwise multiple regression to develop predictive models and to rank climatic factors with respect to their linear connection with forest basal area. We built stepwise regression models sequentially by adding climatic factors one-by-one and re-evaluating model parameters at every step.

Step 1. Simple linear regression
At the first step, we built a collection of simple linear regression models for all climatic variables in a focal ecoregion and chose the 'best' model with the highest coefficient of determination R 2 .
Let us consider an ecoregion with n forest inventory plots. Let y = (y 1 , y 2 , . . . , y n ) be the vector of basal area observations at n forest plot locations. We compute WorldClim values for all k = 19 climatic variables at these n locations. Let c j = (c j 1 , c j 2 , ..., c j n ) be the vector of values of a climatic characteristic c j , j = 1 . . . 19. We build a simple linear regression model of the following type: basal area ∼ single climatic characteristic for each c j , j = 1 . . . 19: where ε = (ε 1 , ε 2 , . . . , ε n ) is a random error vector, and α and β are the intercept and slope, respectively. We then select the 'best' model with the highest coefficient of determination R 2 among these 19 linear regression models. We consider the chosen model as the 'best' model per R 2 and the Euclidean distance of data points from the regression line (0 ≤ R 2 ≤ 1, and the statement R 2 = 1 means that all data points belong to the line). We determine the slope and intercept of the simple linear regression model (1), and, in this case, the coefficient of determination R 2 is equal to the square of the sample Pearson correlation coefficient, r y,c j . Therefore, at the first step, the 'best' model shows which climatic characteristic has the highest correlation with forest basal area in the given ecoregion.

Step 2. Regression with two climatic factors
At the second step, we construct 18 multiple regression models with two climatic factors, one of which is determined by the 'best' model at the previous step: basal area ∼ 2 climatic characteristics: where y = (y 1 , y 2 , . . . , y n ) is the basal area vector, c 1 is the climatic variable determined by the 'best' model selected in the 1st step, c 1 = (c 1 1 , c 1 2 , . . . , c 1 n ) is the vector of c 1 values, c j = (c j 1 , c j 2 , . . . , c j n ) is the vector of values of other climatic characteristics c j , j = 2 . . . 19, and ε = (ε 1 , ε 2 , . . . , ε n ) is the random error vector. The model parameters are: α-intercept term and β 1 , β 2 -regression coefficients.
As in the first step, we choose the 'best' model or that having the highest coefficient of determination R 2 among 18 multiple regression models. The coefficient of determination R 2 is equal to the square of the multiple correlation coefficient. This 'best' model defines two climatic variables that we will use at the next regression step.
Step k. Regression with k climatic factors Theoretically, we can continue adding new climatic variables until all 19 variables are incorporated in the linear regression model. In the presented work, we limited the process to the first five steps k = 5, as adding more than 5 climatic variables did not substantially improve the goodness of fit (measured by R 2 ). This was in agreement with the results of the principal component analysis of our data set (see Section 3.1.2). In general, at the k th step, we construct 19 − k + 1 multiple regression models, basal area ∼ k climatic characteristics: where y = (y 1 , y 2 , . . . , y n ) is the basal area vector, . . , c k n ) is the the k th climatic variable vector, and ε = (ε 1 , ε 2 , . . . , ε n ) is the random error vector. The model parameters are: α-intercept term and β 1 , β 2 , . . . , β kregression parameters.
In all 19 − k + 1 regression models (3), the first k − 1 climatic characteristics, c 1 , c 2 , . . . , c k−1 , were chosen at the previous k − 1 steps (as climatic characteristics with the maximal R 2 at their particular steps). At the current k step, we again determined the 'best' model (3) per the highest coefficient of determination R 2 from 19 − k + 1 models. This process determines the 'best' c k climatic factor. At the theoretical last step, k = 19, we will have only one variable left to include in the model.
As a form of model optimization, the statistical methodology described above allows us to reveal which climatic characteristics explain the majority of basal area variation. Implementing the abovementioned series of regression models, we obtain the ordered (by its influence on basal area) set of climatic factors important for basal area.

Random Forests
We apply the random forests algorithm [36] to determine the climatic factors driving basal area dynamics in the contiguous US. Random forests is a machine learning model based on the concept of regularization through bootstrap aggregation (i.e., bagging) with out-of-bag generalization error estimation. This allows it to average over several decision tree models learned by training on data subsets and obtain a robust ensemble model. The method has been widely used previously in investigating climate impact on vegetation [40][41][42][43][44][45][46][47]. For example, in [41], the authors investigate the effects of climate change on conifer tree species distributions. A similar analysis was performed in [40], where the impact of temperature-related bioclimatic factors on wine regions was investigated.
First, we perform data preparation. We correct for missing points and remove outliers. Next, we apply a low-variance filter. We remove climatic characteristics having low variance as noninformative. We also apply a high-correlation filter. We delete highly correlated climatic factors. For example, looking at the correlation plot for the conterminous USA ( Figure A1), we see that some climatic variables are highly correlated. Dropping one of two highly correlated variables, we filter data by reducing the number of climatic parameters.
We run the algorithm for both the contiguous USA and for each ecoregion using a random forests regressor. Using SciKit-Learn, we import the RandomForestRegressor class from the sklearn.ensemble module to instantiate a regressor object. For each tree, we use a default model fitness criterion of mean squared error (MSE); we will describe this parameter later) in our objective function. We select the number of trees to generate in our random forests using the parameter n¯estimators in RandomForestRegressor. Suppose that we start with three trees: n¯estimators = 3.
Denote y = (y 1 , y 2 , . . . , y n ) is the vector of basal area observations at n forest plot locations, and C 1 , C 2 , . . . , C 19 are the vectors of our 19 climatic characteristics (features). We fit the random forest regressor: basal area ∼ climatic factors: For each tree, we take a set of m (1 ≤ m ≤ 19, where m is a hyperparameter) randomly chosen (repetitions are allowed) climatic factors: (C j 1 , C j 2 , . . . , C j m ) For notational convenience, we use (C 1 , C 2 , . . . , C m ).
In order to make our explanations more transparent, suppose m = 2. Then, y = (y 1 , y 2 , . . . , y n ), We see that R contains n points, and it is a subset of R m+1 = R 3 . R is our bootstrap sample. Now, we can 'grow' a regression tree. Growing a regression tree within a random forest is equivalent to approximating the dependence y = f (C 1 , C 2 ) by a piecewise continuous function. We use a greedy (i.e., locally optimal) algorithm in order to approximate the relationship y = f (C 1 , C 2 ) (RandomForestRegressor uses a greedy algorithm by default).
The first step in the greedy algorithm (the first tree node) is the following. In the bootstrap sample R, we choose a feature C (C is either C 1 or C 2 , but we simply use notation C for convenience) and a split s of the values of C such that the error is minimized: where the constantŷ is an estimate of y i (y i is i-th coordinate of basal area y). Because we have a finite number of data points, we only need to consider finitely many splits s. At the first step, we split our data using the split s. This way, R is divided into two subsets: R 1 and R 2 (suppose R 1 has more points). By construction, this division produces the most separation between the data points. At the second step (consecutive tree nodes), we perform the same procedure for subset R 1 . We look for feature C and such a split s of R 1 , which together minimize the error (produce the most data separation) defined in our objective function: In practice, minimizing the errors (5), (6), and the consecutive ones, we build a stepwise function that approximates the basal area y. To build such a stepwise function, we cut R into smaller and smaller subsets (choosing feature C and its split s at every step). We perform cuts until we reach the case when we have only one point (or some small finite number of points) in a subset R k . This case corresponds to a leaf of the tree (a node with no offspring).
The random forests algorithm naturally splits data into training and testing sets through bagging and out-of-bag error estimation, similar to k-fold cross-validation. However, we explicitly constructed training and testing data sets to get a better notion of the generalization error. We used 5% of observations as the test set with the rest of the data serving as the training set. First, the model runs for a training set. Then, we used the testing set in order to evaluate model generalization (accuracy, overfitting). This gives a stronger overall picture of real-world model performance.
Random forests can optionally estimate the feature importance scores, determined by randomly permuting each feature and calculating the corresponding change to the out-ofbag error. The feature importance score of a climatic variable (feature) can be intuitively understood as a measure of its significance in explaining basal area. Consider a tree in a random forest. In an internal node of the tree, the algorithm chooses the feature that reduces the variance of basal area. In other words, it looks for the feature that decreases the impurity of the split. After we average a feature's importance scores over all of the trees, we obtain the final feature importance score. This is the percent reduction in classification accuracy compared to an out-of-bag error with all variables left intact [36].

Stepwise Regression and Multivariate Statistical Analysis
We applied multivariate statistical methods (correlation and principal component analyses), including stepwise regression, to reveal linear relationships between climatic variables and forest basal area. The correlation analysis and principal component analysis (PCA) allowed us to evaluate the intrinsic dimensionality of our data set and provided necessary ground information for stepwise regression modeling.

Correlation Analysis
We computed the Pearson correlation coefficient, r, between basal area and each of the 19 climatic variables for each ecoregion. In all ecoregions, basal area was weakly correlated with climatic variables, |r| < 0.3. At the same time, some climatic variables strongly covaried. A correlation matrix computed for the conterminous US is shown in Figure A2 in Appendix B. In most ecoregions, a large number of observations showed correlation coefficients statistically significant with p-values smaller than 0.05. Ecoregion 262 (California Dry Steppe Province) had only five observations; hence, we omitted results for this territory. Regarding Ecoregion 261 (California Dry Steppe Province), the only climatic factors showing significant correlation with basal area are Annual Precipitation (BIO12) and Precipitation of Wettest Month (BIO13).
Weak correlations between basal area and climatic factors are explained by high variation in local topography, disturbance, and land-use history within ecoregions. Natural and anthropogenic disturbances of varying magnitudes create complex patchwork mosaic patterns across forested landscapes [24,26]. This mosaic includes a large number of forest stands (i.e., patches) in different successional stages, leading to variation in forest basal area and biomass values. Forest inventory plots are designed to uniformly sample the landscape, and, therefore, they reflect some aspects of this patch mosaic [24,26,38]. In addition, many ecoregions include several forest and soil types, with different correlations between basal area and climatic factors for each. In the present analysis, we investigate correlations between climatic factors and forest basal area at the ecoregion level without controlling for the patchwork mosaic, forest type, or local topography. That low correlations are often shown between forest basal area and climatic factors is not surprising given the importance of local topography and land-use history, motivating future studies along these lines.

Principal Component Analysis
Principal component analysis (PCA) was performed on the set of 19 climatic factors together with forest basal area. The first five principal components explained most of the variation for all ecoregions (Figure 1), while the first three principal components retained much of this variation. On average, for all ecoregions, three principal components retained around 84% of the variation. On the other hand, climatic factors show low contributions or loadings of the principal components. Maximally contributing to Principal Component 1, climatic factors have on average contribution scores around 6%. In this way, principal axes describe much of the data variation, while each climatic factor does not contribute substantially to the primary principal component. PCA results are summarized in Table A1. We also performed principal component analysis (PCA) on the set of 19 climatic factors for every ecoregion without a basal area. The results were similar to the PCA with the basal area variable. In this case, principal components retain much of the data variation (about 90% for three main principal axes), while individual climatic variables show only a small contribution to each principal component (about 10%). We interpret this as a reflection of the gestalt of the climate system or that the whole is greater than the mean-field approximation of its parts (an analogy to the gestalt principle for ecosystems [14,31]).
Our PCA reveals that the system can be considered five-dimensional, while even a three-dimensional model would provide a relatively good approximation. At the same time, none of the climatic variables provide a large single contribution to the principal axes. This is consistent with the results of the correlation analysis, where none of the climatic factors demonstrated a strong correlation with the basal area. Theoretically, based on the PCA, five synthetic variables can be extracted as principal axes and employed in regression modeling, while unexplained variance can be considered as random noise. We considered this research direction; however, we decided to build stepwise regression models in five steps using the original climatic variables as independent variables. While the stepwise regression models using original climatic variables have less predictive power than models based on the synthetic variables determined by PCA, these models allow us to rank climatic factors using multiple correlation coefficients as a selection criterion.
The PCA results raise questions regarding the complexity of the climate-vegetation system. Our set of 19 climatic variables, including seasonal and interannual characteristics (Table 1), taken as a whole characterizes forest basal area well even in the linear modeling framework (Figure 1). At the same time, none of these variables can explain more than 10% of the variation in any given ecoregion, which are distinct with respect to vegetation and climate. Most of the traditional climate-vegetation theories and models are based on easily measurable climatic factors, most notably average temperature and precipitation. However, the PCA results indicate that none of these quantities are particularly important. Thus, there is a possibility that we often employ climatic variables that are convenient to measure and compute but that do not accurately represent essential climatic characteristics for forested ecosystems.

Stepwise Regression
We implemented the linear stepwise regression analysis for every ecoregion with a five-step depth. In line with the results of the correlation analysis, we observed low R 2 values (less than 15%) in 28 out of 35 ecoregions with a substantial number of observations for statistical analysis. Therefore, in these ecoregions, the stepwise regression did not reveal substantial linear relationships between forest basal area and climatic variables based on five regression steps. This is in the agreement with the PCA results, indicating that the first principal axes are not closely correlated with any particular climatic variables. Table A2 summarizes our stepwise regression results for 11 ecoregions, with a significantly high coefficient of determination after five steps (R 2 > 13%): 242, 261, M242, M261, M262, 313, 315, 321, 322, M331, and M341. For Ecoregion 242, we obtain that linear regression basal area ∼ Mean Temperature of Warmest Quarter gives R 2 = 12.3%, and multiple regression basal area ∼ BIO10, BIO11, BIO4, BIO6, BIO15 explains 13.1% of basal area variation. Using the division of 19 climatic characteristics into three groups (precipitation, temperature, variability; Table 1), we visualize the results in Figure 2. The map contains colored ecoregions, where the color depends on the climatic characteristic that maximally contributes to basal area variation in the given ecoregion. The gray areas are the locations where regressions do not capture essential basal area climate dependence (low R 2 ). It is noticeable that the majority of these 11 ecoregions are located in desert and arid provinces or in mountainous areas in the Pacific Northwest and Rocky Mountain regions (Table A2). It is not surprising that we observe the highest R 2 values in Ecoregion 322 (American Semidesert and Desert Province), where three of the five most significant climatic factors are precipitation related: Precipitation of Driest Month (BIO14), Precipitation of Wettest Month (BIO13), and Precipitation of Wettest Quarter (BIO16). These results clearly indicate that precipitation is a crucial factor for this territory.
The first line in Table A2 contains climatic factors maximally contributing to basal area variation. In Ecoregions 261, M242, M261, M262, 315, 321, 322, and M341, it is precipitation of the extreme (driest, wettest, coldest, or warmest) quarter or month. In Ecoregions 242 and M331, temperature of the warmest period is the most significant factor, while in Ecoregion 313 Annual Mean Temperature (BIO1), most contributes to basal area variation. There is a noticeable overlap between these 11 ecoregions and ecoregions where forest gap dynamics are not a primary driver of stand succession [39].

Random Forests
As an alternative approach to the linear multivariate and regression analysis, we applied the random forests (RF) algorithm [36] to infer climatic factors linked to forest basal area. The application of RF to the conterminous United States reveals that Precipitation of Coldest Quarter (BIO19) has the highest importance score, while Mean Temperature of Coldest Quarter (BIO11) has secondary importance (see Figure 3 for the top five climatic factors and their importance scores). Table A3 and Figure 4 summarize the results of our RF application at the ecoregion level. Table A3 contains the list of main climatic characteristics together with their feature importance scores for various ecoregions (Table A3).

Stepwise Linear Regression Versus Random Forests
In this work, we applied to the same general problem two different data-intensive methodologies: stepwise linear regression and random forests. Linear regression is able to capture only linear relationships between variables, while random forests can theoretically capture both linear and nonlinear relationships. The multilinear regression approach gave us nontrivial results only for several USA ecological regions 242, 261, M242, M261, M262, 313, 315, 321, 322, M331, and M341 (Table A2). Therefore, the intercomparison of these two methods is restricted to these ecoregions. In Ecoregions M242, M261, and 321, both the regression analysis and random forest reveals Precipitation of Coldest Quarter (BIO19) as the leading bioclimatic characteristics (Tables A2 and A3). With respect to Ecoregions 242, 261, M262, and 315, both methods suggest a key climatic factor belonging to the same group (temperature and precipitation groups), while the specific factors were different (Tables A2 and A3). Finally, the random forests algorithm and the regression approach give completely different key climatic characteristics in Ecoregions 313, 322, M331, and M341 (Tables A2 and A3).
From this analysis, we find that both methods give similar results. Random forests appears to be a more advanced and efficient method, yet it essentially remains a 'black-box' approach. Random forests appears to be an effective and powerful dimension reduction tool for large multidimensional data sets (dimension of 19 climatic characteristics in our case). As we see in Table A3, over all ecoregions, we have only 8 out of 19 climatic factors as the most essential factors. At the same time, the regression approach applied to the climatic characteristics did not substantially reduce dimensionality that can be observed in the comparison of R 2 values as PCA results (Table A2 and Figure 1). Overall, we conclude that random forests is the preferred method where possible.

Summary
In the presented research, we investigate how climatic changes may affect forest basal area in the USA. We apply various statistical and machine learning techniques in order to infer which bioclimatic factors mostly influence basal area in US forests. We built a series of multiple regression models based on linear regression. For some ecoregions, the models gave good results and we found the most influential bioclimatic factors for these areas. Precipitation-related factors turn out to be crucial for estimating basal area in the USA. However, there are many ecoregions where this linear analysis is inconclusive.
Principal component analysis gave us an interesting result: in all ecoregions, three main principal components described more than 80% of data variation, while each climatic factor's contribution to the main principal axis is low. We also used an advanced machine learning technique-the random forest algorithm. We generated a map of the US with ecological regions, which shows the climatic factor that most influences basal area. Comparing multiple regression with random forests, we see that the latter is a more suitable tool for data analysis of climate-forest relationships. However, random forests and multiple linear regression give similar results: precipitation-related factors are the most important climatic factors controlling basal area. In particular, Precipitation of Coldest Quarter (BIO19) is shown as a key factor in the majority of US ecoregions.

Future Research
We anticipate that the application of deep learning algorithmscombined with additional data mining may be productive. In particular, various neural-network-based dimensionality reduction techniques could be used. One could also apply backward feature elimination or forward feature selection for each ecoregion. This may be a good strategy, as we have relatively small data sets within an ecoregion. The other direction would be investigating data for nonlinearities and applying projection-based dimensionality reduction techniques.

Conclusions
We examined relationships between climate and forested ecosystems in the contiguous United States using a data-driven paradigm. Data mining of the fForest Inventory and Analysis and WorldClim data sets allowed us to link quantitative measurements of forested ecosystems and climate. We employed two data analysis approaches, multivariate statistics and machine learning to examine the multidimensional structure of the climate-forest complex system and to reveal linkages between variables. PCA reveals that our system can be considered five-dimensional; however, none of the climatic variables are tightly correlated with principal axes. Stepwise linear regression revealed leading climatic characteristics for 11 ecoregions, located in desert and semidesert provinces, the Pacific Northwest, and the southern Rocky Mountains. In these areas, the results obtained with the random forests algorithm and stepwise regression were similar. However, random forests appears to be a more versatile and powerful approach than traditional regression analysis. We anticipate that the application of more advanced deep learning methods with additional data mining may be useful for understanding and predicting forest-climate relationships. Acknowledgments: We would like to thank Andrey Sarantsev (University of Nevada) for his advice and help with manuscript editing and Adam Erickson (NASA) for help with the manuscript proofreading.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: