Chimera : A MultiTask Recurrent Convolutional Neural Network for Forest Classification and Structural Estimation

More consistent and current estimates of forest land cover type and forest structural metrics are needed to guide national policies on forest management, carbon sequestration, and ecosystem health. In recent years, the increased availability of high-resolution (<30 m) imagery and advancements in machine learning algorithms have opened up a new opportunity to fuse multiple datasets of varying spatial, spectral, and temporal resolutions. Here, we present a new model, based on a deep learning architecture, that performs both classification and regression concurrently, thereby consolidating what was previously several independent tasks and models into one stream. The model, a multi-task recurrent convolutional neural network that we call the Chimera, integrates varying resolution, freely available aerial and satellite imagery, as well as relevant environmental factors (e.g., climate, terrain) to simultaneously classify five forest cover types (‘conifer’, ‘deciduous’, ‘mixed’, ‘dead’, ‘none’ (non-forest)) and to estimate four continuous forest structure metrics (above ground biomass, quadratic mean diameter, basal area, canopy cover). We demonstrate the performance of our approach by training an ensemble of Chimera models on 9967 georeferenced (true locations) Forest Inventory and Analysis field plots from the USDA Forest Service within California and Nevada. Classification diagnostics for the Chimera ensemble on an independent test set produces an overall average precision, recall, and F1-score of 0.92, 0.92, and 0.92. Class-wise F1-scores were high for ‘none’ (0.99) and ‘conifer’ (0.85) cover classes, and moderate for the ‘mixed’ (0.74) class samples. This demonstrates a strong ability to discriminate locations with and without trees. Regression diagnostics on the test set indicate very high accuracy for ensembled estimates of above ground biomass (R2 = 0.84, RMSE = 37.28 Mg/ha), quadratic mean diameter (R2 = 0.81, RMSE = 3.74 inches), basal area (R2 = 0.87, RMSE = 25.88 ft2/ac), and canopy cover (R2 = 0.89, RMSE = 8.01 percent). Comparative analysis of the Chimera ensemble versus support vector machine and random forest approaches demonstrates increased performance over both methods. Future implementations of the Chimera ensemble on a distributed computing platform could provide continuous, annual estimates of forest structure for other forested landscapes at regional or national scales.


Introduction
Accurate estimates of above ground biomass (AGB) of vegetation in forests are fundamental for quantifying and monitoring forest conditions and trends. AGB and other forest structure metrics and ensembling. Multi-task learning is a method in which a neural network is trained to perform two or more similar tasks, such as identifying road characteristics while predicting steering direction for an automated driving system [43]. This shared task method of learning has been demonstrated to increase predictive ability for each individual task [44]. Ensembling, the technique of using multiple models together to make a single prediction, has also been demonstrated to increase predictive ability of many machine learning approaches, by reducing the bias of any single model prediction [45]. These technical advancements and characteristics present an opportunity to apply them in an ensemble of multi-task RCNNs as a case study, and test the ability of such an architecture to classify forest cover and predict AGB and other forest structure metrics against more conventional approaches.
In this study, we introduce an ensemble of individual RCNN models called "Chimera" (together called the Chimera ensemble, or CE) to perform a data fusion of high resolution NAIP imagery, moderate resolution time-varying Landsat, and ancillary climate and terrain (ANC) variables, and to build prediction tiles which can then be reassembled for spatially explicit mapping of larger areas. We present performance metrics based on field plots collected from the USDA forest service forest inventory and analysis (FIA) program dataset.

Objectives
The main objective of this study is to measure the performance of a novel multi-task RCNN architecture which simultaneously classifies forest and land cover ('conifer', 'deciduous', 'mixed', 'dead', or 'none') and estimates forest structure metrics (AGB, quadratic mean diameter (QMD), basal area, and canopy cover). Our performance experiments will involve: (1) comparing the different combinations of input datasets, specifically, the impact of including high resolution (1-m) imagery; (2) comparing the RCNN architecture with more commonly used random forest (RF) and support vector machine (SVM) models, with similar inputs; (3) assessing the potential improvement with ensembling of RCNN models compared to an individual best fit model. It is hoped that these objectives ( Figure 1) will better inform the application of deep learning approaches in forest structure and type estimation, and present a potential architecture upon which future modeling efforts might improve.

Region of Analysis
California and Nevada were chosen as the region of analysis for data sampling and modeling ( Figure 2). The California-Nevada region embodies a wide gradient of elevation from −86 m in Death Valley to 4421 m at the summit of Mt. Whitney. In California, about 40% of the area (approx. 13.35 million hectares) is covered by forests [46], while Nevada contains the largest national forest in the lower 48 states, the 6.3-million acre Humboldt-Toiyabe National Forest [47]. Both states contain a wide variety of ecosystems including alpine, montane and subalpine forests, coastal forests, mixed conifer-deciduous forests, chaparral, pinyon-juniper woodlands, and desert scrub. This variety of forest ecotypes make these states an excellent case study for forest structure estimation over an extensive, heterogeneous region. Case study region of California and Nevada for training and testing the Chimera model. The region of interest provides high variability of forested and non-forested cover types. Image variability creates a range of challenges for estimation of forest structure metrics. Training (blue) and testing (red) Forest Inventory and Analysis (FIA) plot accurate estimates of above ground biomass (AGB)'s are displayed here as scaled dots (minimum = 0 Mg/ha, maximum = 3240.2 Mg/ha) to demonstrate range of spatial variability within the study region (locations here are approximate to maintain FIA spatial confidentiality).

Predictor Variables-Optical Data
Our input variables for the RCNN architecture are formed from a combination of four different, open-source remote sensing datasets available in the Google Earth Engine (GEE) [48]. These datasets (NAIP imagery, LANDSAT imagery, PRISM climate data, and USGS NED terrain data) are combined across different spatial and temporal resolutions to form individual tensors (i.e., multi-dimensional array), each approximating a single 120 × 120-m FIA field measured sample for each of our predictor variables ( Figure A3). We chose the 120 × 120-m size to maximize the inclusion of the sampled area while allowing for any potential mismatches in either imagery georeferencing, or FIA plot location [49]. Imagery Program) USDA NAIP images are mostly cloud-free image tiles collected across the continental United States during the summer months by aerial fly-over in frequencies ranging from 5 years (in some rural areas) to 1 year [20]. NAIP image tiles are sized at 3.75 × 3.75 arc-minutes (approx. 6 km × 6 km) in GEE, with either three (red, green, blue; RGB) or four (red, green, blue, and near-infrared starting in 2007; RGBN) spectral bands, with a target of 1 m or smaller pixel size. NAIP is available across the entire time range of FIA inventory samples that we considered (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017). In order to limit the cases of non-representative samples, locations without NAIP imagery available within a three year time window (one year forward, two years back from the date of FIA sampling) were not included in the training dataset. The image collected closest to the date of FIA field data collection was chosen and resampled to 1 meter resolution either through mean aggregation, or nearest neighbor resampling. Only RGB bands were used due to lack of uniform availability of RGBN images across the two states in our case study region.

Landsat 7
Landsat 7 is one of a series of satellites providing open-source imagery for scientific purposes through a joint effort by the USGS and NASA [15]. Landsat 7 imagery is collected at 16-day intervals through the entire period of interest for this study. For training purposes, we produced a 4 × 4-pixel image at the native 30-m resolution of all but the thermal bands, resampling the panchromatic band to 30 m from 15 m via mean aggregation, and utilizing the quality assessment (QA) band codes to mask noisy pixels and all but the lowest confidence cloudy pixels. This was accomplished using only tier 1, top-of-atmosphere (TOA) reflectance data. As a result of gaps in surface reflectance data and reduced efficacy of data correction over arid/snow covered regions in GEE, we used TOA to maintain the highest possible number of continuous temporal samples [50]. We gathered a three year time series (one year forward, two years back from FIA sampling) and produced a monthly average for each of the 12 calendar months of reflectance values with only high quality pixels to represent a single year. This method provided the best chance at capturing the distinct patterns produced by deciduous and mixed stands, as well as dead stands through an annual cycle.

Predictor Variables-Ancillary Data
Kane et al. [51] were able to demonstrate that forest structure patterns such as canopy cover percentage, could be predicted based on water balance and topographic position information. This follows the intuition that specific forest species with unique structural attributes can be delimited based on abiotic factors [52]. Therefore, we included ancillary (ANC) climate and terrain data to provide the CE with additional information for land cover classification and forest structure estimation.

PRISM (Parameter-Elevation Regressions on Independent Slopes Model)
In order to provide long term climate trend information to the CE for each sample, we utilized PRISM from Oregon State University [53]. PRISM aggregates continuous data from weather stations across the United States and uses sophisticated interpolation methods to produce gridded data for the entire United States on multiple time scales. We utilize AN81m, a monthly curated and quality-controlled dataset which provides dewpoint, vapor pressure deficit, temperature, and precipitation data at a coarse 2.5 arc-minute (approx. 4 km) scale. This dataset was resampled to a single 120-m pixel for each sample using the nearest neighbor method, and a mean/standard deviation across the 30-year period prior to the FIA inventory year associated with each training or test sample.

USGS NED (National Elevation Dataset)
The National Elevation Dataset (NED) provides gridded elevation at a 1/3 arc-second (approx. 10-15 m) resolution for the entire continental U.S. [54]. NED was produced from multiple individual datasets as a national aggregation and has been shown to have an RMSE ≈ 1.63 m from truth in vertical accuracy across the continental U.S. [55]. Terrain features including elevation, slope, and aspect, have been shown to be correlated with forest structural attributes [56,57]. We use elevation, slope, and aspect (which we calculate and apply a cosine transform on the fly directly from the NED) in the training process, represented as a single, 3-band, 120 m mean aggregated pixel for each sample.

Response Variables
Training Data Targets: FIA Sampling and Database Parameter Usage Continuous response variables were generated from the forested and non-forested 27,966 FIA Phase 2 plots inventoried from 2005-2017 within California and Nevada state boundaries. The FIA program applies a nationally consistent quasi-systematic sampling protocol [58] with a nominal sampling intensity of approximately one sample location per 2400 ha, to provide a source of information about the extent, condition, status, and trends of forests across the United States [8,59]. In Phase 2, field crews visited permanent plots that contain a forested land use (areas that have at least 10% tree canopy cover, are at least 0.4 ha in size, and at least 36.6 m wide) and collect information on individual trees and site variables [60]. In each plot, trees were sampled at the subplot (four 24 foot (7.31 m) diameter circles) or macroplot (58.9 foot (17.95 m)) level in a consistent orientation. Only larger trees are measured outside the subplots in the case of macroplot measurements by FIA field crews. We considered all live trees above 1 inch (2.54 cm) in diameter when calculating our response variables. Dry biomass components for each live tree were calculated via the national standard method (the component ratio method (CRM)) detailed in the current FIA handbook [60]. We summed the total above ground components of each tree together to represent the dry biomass of a sample plot, accounting for the variations of the CRM for woodland species, timber species, and saplings. These values were then converted to a density value using recorded plot geometry information. Basal area metrics were taken directly from FIA field measurements within the database. Canopy cover measurements were taken from the live canopy cover attribute of the database, which can be measured differently (in-situ or remotely sensed image interpretation) depending on the region and year of survey [61]. Plot-level QMD was given by the equation: where dbh is the diameter at breast height of each tree, i, for all measured live trees, i = 1, ..., n on the plot. For the land cover classification task of this study, we classified plots based on an 80% tree count threshold as 'conifer' or 'deciduous'. If there was not 80% trees of conifer or deciduous type, plots were labeled as 'mixed'. A label of 'dead' was given to plots where 80% of the trees were recorded as standing dead. Plots where no trees were present were labeled as 'none'. It should be noted that these classification labels are different from FIA definitions, which are land-use labels and not land cover labels. We additionally utilized FIA non-forested plots (20,660 plots of the 27,966). These plots provide a valuable baseline for image feature recognition of urban areas, bare land, and water. However, since these plots are non-forested by definition, these locations are not-visited and have no tree attributes recorded but may still have tree cover (e.g., urban and residential areas) [61]. Following the methods of Hogland et al. [35], who visually inspected plot locations against the corresponding reference NAIP image, we manually inspected images corresponding to these non-visited plots. For locations where there were clearly no trees visible in the imagery, we labeled them as 'none' and attributed them with AGB, QMD, basal area, and canopy cover values of zero.
After all manual inspections of plot-to-image correspondence were completed, plots were further filtered to those containing all four corresponding predictor variable data, leaving n = 9967 samples ( Table 1). The data were then randomly subdivided into training, validation, and test samples. We first set aside 500 plots for testing (5% of the total data) and then used the remaining data at a 4:1 ratio of training to validation data for fitting each individual Chimera model (7724 training samples, 1743 withheld validation samples, and 500 test samples). Training, validation, and test samples followed similar distributions for each response metric, and across all k-fold cross-validation subsets as discussed in the Model Ensembling section below ( Figure A1). Table 1. Summary of forest inventory and analysis (FIA) plot data: (a) distribution of FIA plot forest type classes for all plots used in training/validation and testing; (b) bulk statistics of FIA plot forest structure metrics (above ground biomass (AGB), quadratic mean diameter (QMD), basal area, and canopy cover) for all forested plots used in training (n = 3237) and testing (n = 170).

Chimera RCNN Architecture
We constructed a deep learning model architecture called Chimera, for estimating forest structure using the Keras (v2.1.3) package [62] for building TensorFlow (v1.3.0) CNNs in Python. We implemented a multi-task learning (MTL) neural network architecture that merges climate, terrain, Landsat, and NAIP imagery datasets as predictors to perform two tasks: (1) classification of land cover type ('none' (non-forested), 'deciduous', 'conifer', 'mixed', 'dead') and (2) regression of four continuous forest metrics (AGB, quadratic mean diameter (QMD), basal area, and canopy cover) simultaneously [44] ( Figure 3). Multi-task learning is a technique that has been demonstrated to improve accuracy by allowing convolutions to fit parameters more efficiently through understanding "easier" tasks [43,63,64]. This follows the intuition from another popular model called "you only look once" (YOLO) [65], that performs both object detection and classification simultaneously. In the YOLO architecture, a single trained CNN takes an image as an input and then performs discrete object identification (classification task) and draws a box around the object, thus performing an estimate of the box width, height, and center coordinates (regression tasks). Our architecture, similarly performs two tasks, but also allows the neural network to learn multiple differing input feature parameters simultaneously. To perform the classification and regression tasks, Chimera minimizes two loss functions in a single model; a cross-entropy loss function (H): where y i is the ground truth label of ith training sample instance andŷ i is the ith model prediction, and the L2 (mean squared error; MSE) loss function: Chimera's architecture utilized a five block DenseNet structure to serve as the shared layer that will identify features within a three channel NAIP image. The DenseNet architecture [66] allows improved information flow between sequential convolutional composite layers F j composed of a batch normalization, rectified linear units (ReLU), convolution, dropout, and pooling. We denoted the output of the j-th layer as x j by direct concatenation of the feature maps produced from proceeding layers: where [x 0 , x j , . . . , x (j−1) ] refers to the concatenation of the j previous layer outputs. This improved information flow additionally allows for higher performance, reduced parameter space, and faster model fitting compared to other well known architectures (e.g., ResNet, GoogLeNet, AlexNet) [32,37,67]. A three layer convolutional long short-term memory (LSTM) block was applied to time sequenced Landsat scenes to summarize temporal changes that would be conducive to identifying deciduous, conifer, and mixed forest stands [38,68,69]. The output of the LSTM layer was concatenated to the dense layer of the DenseNet for classification.
We concatenated the ancillary climate and DEM data to the output of the final DenseNet layer and LSTM layer. We then forked the concatenated layer in two for each task and run each fork through four fully connected layers with five outputs for the classification task and four outputs for the regression task. This is an example of a hard parameter sharing architecture, in which the same final concatenated layer is used for both tasks [44]. This allows the classification loss to influence regression parameters, and the regression loss to influence the classification parameters. For example, where classification was 'none', forest structure metrics were low to zero, indicating a strong relationship between the two tasks.
NAIP and Landsat image samples were augmented randomly using image rotation and mirroring during model training to increase robustness of the RCNN and to allow varying forest image feature geometries to represent the same forest structure metric values across different training epochs. Furthermore, to compensate for training sample class count imbalance, we applied the Eigen and Fergus [70] class weighting method to adjust the final cross-entropy loss layer.
Each individual Chimera RCNN was trained for 90 epochs at an 80 sample batch size using a stepped learning rate Adam optimizer [71] (learning rate stepped by a factor of 0.1 at every 30 epochs) to increase speed of model convergence during gradient descent. Average training time was 1h30m on a Microsoft Azure NC6 instance using a Intel (R) Xeon (R) E5-2690v3@2.80GHz CPU with 56 GB of RAM and a NVIDIA Tesla K80 GPU with 24 GB of GDDR5 memory.
Additionally, to understand if contributions of each input dataset improve performance, we fit seven different independent models with the complete training dataset, using all possible combinations of input data with the Chimera architecture (ANC only; NAIP only; NAIP and ANC; Landsat only; Landsat and ANC; NAIP and Landsat; all inputs).

Chimera Comparisons with RF and SVM
Using the same training and test data sets, we built inputs for random forest and support vector machine models with the Scikit-learn package implemented in Python to compare results [72]. Data were subsetted to accomplish each regression task (four models), and the classification task separately (one model), resulting in five models for each algorithm. Two forms of data aggregation were tested: one used a single Landsat pixel time series of aggregated arrays resulting in 84 features, while the other used the same full 4 × 4 Landsat image time series as the Chimera ensemble, resulting in 1344 features. In both cases, textural extraction of NAIP was performed identically: one layer of 4 × 4 standard deviation gray level (SDGL) and one layer of 4 × 4 mean-downsampled images were concatenated, resulting in 96 features. Climate and DEM inputs were kept in the same shape and all data were normalized through the same process as the Chimera Ensemble. The final shapes of the resultant flattened vectors were 197 and 1457 features. The RF models were parameterized with n estimators = 150 and the SVM models were parameterized with c = 250, γ = 0.5 after employing a grid search methodology and finding the best output RMSE values across all tasks.

Model Ensembling
We used a k-fold cross-validation methodology to divide the data into a 4:1 ratio training versus validation samples. Five separate Chimera models (k = 5) were fit with each fold of the training data. We performed diagnostics on each model to ensure all had converged on their loss functions and accuracy metrics (Figure 4). A super learner [45] approach was used to ensemble the five models weighted by the design matrix A in the form: whereŶ stack is a matrix of stacked column vectorsŶ k , for each fitted model k. We solved for A using a multiple linear regression as the meta-learning algorithm and applied these final weights to combine models for prediction. The stacking methodology has been demonstrated to increase prediction accuracy metrics over a single model alone, or a simple unweighted averaging of models [73].

Model Diagnostics and Input Experiments
Both the batch-level classification and regression loss functions converged for training and validation data for all k-folds within 60 epochs for a total of 1,353,593 parameters. Low differences in loss values between training and validation suggested that no individual Chimera model was overfit ( Figure 5). Experiments of differing Chimera model input combinations (Table 2) demonstrated that all data types on their own (ANC only, NAIP only, Landsat only), did not perform as well as models with two input types. NAIP + ANC data outperformed Landsat models in classification (+0.01 overall F1-score) and similarly to Landsat + ANC in regression performance (R 2 = 0.76). The full model case (NAIP + Landsat + ANC) had the best overall performance (overall classification F1-score = 0.90; overall regression R 2 = 0.81, normalized RMSE = 0.076) for the combined task of classification and regression simultaneously compared to each input combination considered separately, however the imagery-only model (NAIP + Landsat) performed slightly better in land use cover plot classification (overall classification F1-score = 0.92; overall regression R 2 = 0.78, normalized RMSE = 0.080).

Classification Task Diagnostics
We used accuracy metrics of precision, recall, and F1-scores to assess Chimera's classification abilities. Precision refers to number of true positives divided by total true and false positives, while recall refers to true positives divided by the sum of true positives and false negatives. The F1-Score is a combination of the precision and recall, defined by, (2 * precision * recall)/(precision + recall). Classification diagnostics on the test set reported an overall average precision, recall, and F1-score of 0.92, 0.92, and 0.92, respectively (Table 3) (Figure 6). Class-wise F1-scores were high for 'none' (0.99) and 'conifer' (0.86) classes. This indicated that the ensemble model was able to distinguish between forested and non-forested plots. The CE had difficulty accurately distinguishing the 'deciduous' class and 'mixed' class, with moderate F1-scores of 0.6 and 0.74 respectively. This difficulty emerged as a result of class label imbalance in the training data. The 'dead' class samples were insufficiently assessed due to the low sample number within the test samples (n = 1). Due to this limited number of dead samples in the test set, we assessed each individual k-fold Chimera model in CE for their predictive ability on their respective k-fold validation subset. F1-scores for these k-fold validations were as follows: (k 1 = 0.222, k 2 = 0.182, k 3 = 0.0, k 4 = 0.0, k 5 = 0.133), demonstrating a weak predictive ability for the dead class. Comparisons to SVM and RF reported higher levels of performance from the CE in all classes (except 'dead' class), with the largest difference occurring with the CE versus the SVM in classifying the 'deciduous' (F1-score difference of 0.26) and 'mixed classes' (F1-score difference of 0.13) ( Table 3). Table 3. Classification accuracy metrics for the best individual Chimera model (BIM), Chimera ensemble (CE), random forest (RF), and support vector machine (SVM) used in this study to predict forest type from FIA plot data. All statistics reported here are for the training-withheld test set of 500 plots. Precision refers to number of true positives divided by total true and false positives, while Recall refers to true positives divided by the sum of true positives and false negatives. The F1-Score is two times the precision times the recall, divided by the sum of precision and recall. Support refers to the number of samples for each class. Best scores (and best score ties) are denoted in bold.

Class
Precision Recall F1-Score Support

Regression Task Diagnostics
The CE presented a strong ability to predict forest structure metrics, with all metrics reporting R 2 's above 0.8 on the independent test data (n = 500) (Figure 7). Scores for the single best individual model (BIM) and ensemble were similar, but with the ensemble consistently outperforming the BIM at all forest structural metrics. Canopy cover was the best performing metric (R 2 = 0.89, RMSE = 8.01 percent), followed by basal area (R 2 = 0.87, RMSE = 25.88 ft 2 /ac), AGB (R 2 = 0.84, RMSE = 37.28 Mg/ha), and finally QMD being the lowest (R 2 = 0.81, RMSE = 3.74 inches) ( Table 4). This was expected, as elements of forest structure such as QMD underneath the canopy are hidden and can be difficult to estimate based on optically sensed spatial feature detection alone, whereas canopy cover features can often be recognized based on imagery alone.
Similarly, in spatially explicit prediction, plots demonstrated the ability for the model to be able to distinguish dense forest canopy cover and provided representative estimates of large and small diameter tree stands reflected with low QMD and basal area estimates (Figures 8 and 9)

Model Comparison Experiment
We explored two versions of SVM and RF models with a single-pixel seven-channel representation of Landsat data, and a 4 pixel × 4 pixel × seven-channel input. We found the single pixel variation resulted in the highest accuracy. RF outperformed SVM in all response variable estimates except basal area. The largest improvements by the Chimera ensemble were observed for AGB and basal area with a 0.08 increase in R 2 's respectively when compared to RF AGB and SVM basal area. Both the BIM k-fold and CE were superior to SVM and RF in all classification tasks except dead (Table 3). Highest classification difference was between the 'deciduous' type, with CE having a 0.14 increase in F-score compared to RF. CE also outperformed RF and SVM in regression in all forest structure metrics with the biggest improvements in basal area and AGB (Table 4). Post-hoc comparisons of saturation levels for estimated AGB values were modeled using a saturating exponential function [74,75] a ), to identify the data saturation asymptote (c 1 ) for AGB prediction. This comparison reported Chimera reaching AGB data saturation much later (416.7 Mg/ha) than RF (386.3 Mg/ha) and SVM (245.0 Mg/ha), which agrees with its ability to perform better at estimating AGB ( Figure A2).

Discussion
This research describes an effort to measure the ability of a novel deep learning approach for estimation of forest structure and classification simultaneously based on a fusion of multi-resolution and temporal data. There has been extensive research on estimating forest structure given the development of new modeling methodologies and the availability of free remote sensing data [6,7,76]. Implementations include utilization of various combinations of models and datasets to address forest structure estimates and classification [35,59,[77][78][79][80]. Blackard et al. [11], performed a national scale effort to model AGB and forest classification through a tree-based algorithm approach with impressive results for AGB estimation utilizing three eight-day composited MODIS images, classified land cover information, and climate/topographic data. Three issues were cited in the study: (1) over-prediction of areas of small AGB and under-prediction of areas of large AGB due to reflectance saturation, (2) not capturing the full range in variability due to pixel size mismatch (250-m MODIS derived data) to FIA plot size, and (3) error in forest/non-forest classification due to FIA plot-based estimates pertaining to forest land use, which does not necessarily have trees on them, while satellite image-based estimates portraying forest land cover. Wilson et al. [78] made another step forward in this effort utilizing a phenological gradient nearest neighbor approach which integrating MODIS imagery across multiple time intervals, rather than a single aggregated time-step, and ancillary environmental data that includes topography and climate, to initially estimate forest basal area at the species level and later include AGB [8]. Our research examines a method to further include high resolution NAIP imagery in the suite of predictor variables, and utilize a unique RCNN architecture we call Chimera.
Previous work in utilizing more moderate spatial resolutions (30-m) remote sensing for estimating AGB had found that applications of machine learning methodologies more accurately estimated AGB than classical statistical methods, with an R 2 ranging from 0.54-0.78 [81][82][83]. Zolkos et al. [6] states that, although there are no explicit required accuracy levels for AGB estimation, a map for global AGB should not exceed errors of 50 Mg/ha at 1-ha resolution. This de facto standard has led to an increased interest in more advanced sensors to improve AGB estimation. For example, Zhang et al. [84] utilized Landsat and the Geoscience Laser Altimeter System (GLAS) space-borne platform in a large-scale modeling effort to produce gridded leaf-area index maps and canopy height maps of California, which derived AGB at a 30-m resolution across the entire state. Modeled uncertainties via their Monte Carlo methodology were in the range of 50-150 Mg/ha across the entire state, with denser forest in the Sierra Nevada Mountains closer to 100-150 Mg/ha. Chen et al. [85] was able to report within-sample statistics of R 2 = 0.83 and RMSE = 72.2 Mg/ha using a model based solely on aerial LiDAR for the Lake Tahoe region. This effort represents one of the best models for AGB estimation in the region. The Chimera ensemble displays the ability to achieve similar performance in a comparable forested type landscape with high coefficient of determination (R 2 ) values (above 0.8) and low RMSE for all forest structure metrics.
As our first step at measuring the RCNN performance at the predictor level, we ran experiments comparing various permutations which included different combinations of the input NAIP, Landsat, and ANC data. We found that combining the full suite of data performed better overall than any of the input datatypes in isolation for forest structure estimation, specifically with the highest performance increases in AGB and basal area accuracy. The inclusion of high resolution information increased the data saturation level for estimates of AGB, where estimates reached saturation at 416.7 Mg/ha, rather than 354.0 Mg/ha, when only including the Landsat time-series and ancillary variables ( Figure A2). The combination of NAIP and Landsat are found to have the best performance at the classification task overall versus Landsat alone, with an overall F1-score improvement of 0.08, which suggests that inclusion of high resolution imagery aids in AGB estimation and identification of forested and non-forested plots due to the additional textural information. This is in agreement with previous studies which demonstrated that including high resolution imagery increases a machine learning model's ability to distinguish forest composition types [86,87].
Our experiments with RF and SVM models using the same training and testing datasets further demonstrated how the RCNN method of spatial feature detection can improve forest structure estimation and classification. Due to the non-convolutional characteristics of RF and SVM, even if textural summarizing functions such as SDGL were included in the predictor datasets, the RCNN model was still able to achieve higher accuracy in both classification and regression tasks. The convolutional nature is important in the regression task, where structural features such as the size of a large or small canopy can be relevant for estimates of basal area and QMD. This characteristic contributed to the CE's increased performance versus RF, with an overall 17% reduction in RMSE for all forest structure metrics (Table 4). Additionally, because the Chimera model utilizes a recurrent layer in its structure for Landsat images, features that are identified in the twelve month series are seen as a set of orderly and continuing sequences, rather than a one-dimensional vector of independent values [88]. This allowed substantial improvement in distinguishing of the land use class of 'deciduous', with CE improving the F1-score by 0.26 and 0.14 compared to SVM and RF, respectively. This indicates that the Chimera model, with its ability to automatically identify relevant image features and incorporate information regarding spatial-temporal relationships between pixels in prediction, can efficiently and substantially improve forest structure estimation using the same input datasets.
Model ensembling/stacking was found to be an improvement over a single best fit Chimera model. Although only a modest improvement was found versus the best individual k-fold model with an increase of 0.025 in overall F1-score for classification, and regression improvements of overall R 2 of 0.0125, and reduction of overall RMSE by 3.1%. These findings were consistent with the literature that stacking multiple models based on a linear regression weighting, can reduce individual model biases and increase the robustness of prediction [89]. These biases could include image distortion, object occlusion, and inconsistent lighting being abundant in NAIP training data due to off-nadir image acquisition, and Landsat seven commonly containing gaps due to scanline errors and poor quality pixels. By fitting multiple models and predicting in an ensemble, we show that the Chimera ensemble is able to qualitatively estimate land cover type and forest structure well (Figures A3 and A5).

Caveats Associated with the Training Data
Although 27,966 FIA plots were available for the CA-NV study area, we removed a large proportion (>50%) of the samples after manual visual inspection, in which non-forest land-use classification defined by FIA did not correspond to the associated land cover in the NAIP image. Though time-consuming and laborious, this exercise was essential to prevent the model from learning incorrect image texture representations of the forest classes and structural attributes measured in situ [21]. The performance of Chimera can mainly be attributed to its access to those 9967 quality data plots. CNNs are the most data-hungry method of machine learning, primarily due to the sheer number of parameters needed to fit the model, which in turn is based on the volume of training data. We also note that given the differences in FIA protocols over time and regions [61], some forest structure response attributes such as canopy cover were measured subject to variability (in situ versus remote sensed image interpretation), which could have downstream effects on the performance of CE ( Figure A5). In terms of portability and robustness, it should be mentioned that our tests of CE are exclusively in the CA-NV region. Although a variety of US Western ecotypes, from temperate forests to semi-arid woodlands are represented in our analysis, our approach would not be applicable to make predictions of forest class or structure in rainforest ecotypes nor the hardwood forests of the Eastern US, unless CE were retrained on data from those specific forest types.

Conclusions and Future Research
We demonstrate the performance of a novel multi-task, multi-input recurrent convolutional neural network called the Chimera model for forest land use classification and forest structure estimation. We summarized the results of three major objectives as follows: (1) performance of the Chimera model was the highest with the full input dataset that included NAIP, Landsat time-series, and ancillary climate and topography data; (2) the Chimera model outperforms SVM and RF models with the same input data for all classification and regression tasks; and (3) ensembling of multiple Chimera models modestly increased predictive performance compared to a single best fit model alone. These results represent, to our knowledge, the first application of a RCNN with multi-input for improving estimations of forest structure.
The ability of the Chimera ensemble to distinguish between forested and non-forested land cover images within California and Nevada presents a new approach for generating a time-series of change detection for both afforestation and deforestation based on high-resolution imagery. Since our model requires only freely accessible data, we can feed new acquisitions of NAIP and Landsat scenes through the existing model to generate new predictions. This potential for continuous prediction is progress towards work similar to Wilson et al. [8] who developed a model for imputing forest carbon stock at a nationally continuous scale. Given access to the national USFS FIA dataset, widely considered the "gold standard" of field collected forest metric samples within the entire United States, one would have a large enough sample size to parameterize multiple robust RCNN models focused on a forest of interest. A future iteration of the Chimera ensemble trained on a new subset of USFS FIA encompassing a region outside of California and Nevada could provide state-of-the-art estimates of forest structure in many completely different ecotypes. Additionally, with integration of advanced satellite technology including the GEDI and NISAR missions [90,91], more opportunities to combine LiDAR or radar information with existing NAIP and Landsat data could generate still better estimates of forest structure.
Finally, we see this study as just one indication of the promising application of deep learning architectures to ecology and remote sensing. Future work could include multi-task learning applied to land-cover and land-use mapping problems such as monitoring of woody encroachment on wetlands, or measuring forest loss due to wildfire or disease outbreaks. In addition to forest and stand structural characteristics, future research could inform the identification and delineation of key habitat attributes for wildlife [92]. Wall-to-wall forest structural maps can also be used as inputs for fire modeling applications that depend on canopy cover, canopy height, canopy crown bulk density, and crown base height to determine fire behavior [93]. Integration of our model in a "near real-time" wildfire probability model could provide estimates of fire risk and serve as a tool for prioritizing locations for fire prevention treatments [94]. Predictive modeling applications (e.g., coastline vegetation, wetlands, grasslands, or shrublands) often require both classification (e.g., presence/absence, type) and regression (e.g., abundance/cover conditional on presence).
We have demonstrated that the Chimera architecture is capable of fusing various types of data (sensor, climate, and geophysical data, both time-varying and fixed) that are now commonplace in many ecological applications, and using the information they provide to improve predictive ability.
By taking advantage of distributed cloud computing, future development of a continuously integrated pipeline could extend this model towards automatically updated landscape or global forest metric estimates. RCNN architectures similar to Chimera present an opportunity to combine disparate data types and move towards global-to-local level ecosystem monitoring.   Figure A3. Visual representations of a sample (not all inputs presented here) used for training the Chimera ensemble. White outlined area represents plot regions for FIA field measurement and total image represents 120-m × 120-m area at 1-m resolution. Fourteen total climate parameters, three terrain characteristics, and all Landsat bands except thermal bands were used for training, as well as RGB NAIP. Landsat images represent 4 × 4 pixels at 30-m resolution. In addition to AGB of live standing trees >= 1 inch (2.54 cm) in diameter, other prediction values generated from FIA samples included QMD, percent canopy cover, basal area, and forest plot classification. Examples of 'conifer' class plots, demonstrate the ability to identify textural features of both (a) heavy timber and (b) woodland conifer forests.  Figure A6. Example of challenging 'deciduous' and 'dead' stands. (a) 'deciduous' example was half populated with trees within the NAIP image sample. (b) 'dead' case reveals a mixed Landsat signal, leading to a misclassification of 'deciduous' with low prediction probability.