Deep Neural Networks with Transfer Learning for Forest Variable Estimation Using Sentinel-2 Imagery in Boreal Forest

: Estimation of forest structural variables is essential to provide relevant insights for public and private stakeholders in forestry and environmental sectors. Airborne light detection and ranging (LiDAR) enables accurate forest inventory, but it is expensive for large area analyses. Continuously increasing volume of open Earth Observation (EO) imagery from high-resolution (<30 m) satellites together with modern machine learning algorithms provide new prospects for spaceborne large area forest inventory. In this study, we investigated the capability of Sentinel-2 (S2) image and metadata, topography data, and canopy height model (CHM), as well as their combinations, to predict growing stock volume with deep neural networks (DNN) in four forestry districts in Central Finland. We focused on investigating the relevance of different input features, the effect of DNN depth, the amount of training data, and the size of image data sampling window to model prediction performance. We also studied model transfer between different silvicultural districts in Finland, with the objective to minimize the amount of new ﬁeld data needed. We used forest inventory data provided by the Finnish Forest Centre for model training and performance evaluation. Leaving out CHM features, the model using RGB and NIR bands, the imaging and sun angles, and topography features as additional predictive variables obtained the best plot level accuracy (RMSE% = 42.6%, | BIAS % | = 0.8%). We found 3 × 3 pixels to be the optimal size for the sampling window, and two to three hidden layer DNNs to produce the best results with relatively small improvement to single hidden layer networks. Including CHM features with S2 data and additional features led to reduced relative RMSE (RMSE% = 28.6–30.7%) but increased the absolute value of relative bias ( | BIAS % | = 0.9–4.0%). Transfer learning was found to be beneﬁcial mainly with training data sets containing less than 250 ﬁeld plots. The performance differences of DNN and random forest models were marginal. Our results contribute to improved structural variable estimation performance in boreal forests with the proposed image sampling and input feature concept.


Introduction
Forest structural variables, such as the growing stock volume (GSV), offer valuable information for forest conditions and long-term development trends [1,2]. This information can be utilized by various stakeholders, e.g., for sustainable forest management and commercial utilization, carbon balance assessment or predictions of ecosystem response to climate change [3,4]. In situ measurements provide transparent, comparable and consistent information on the availability of forest resources for the development of forest-associated policies and to support decision making [5]. Combined with remote sensing instruments, the sparse network of field plots can provide analyses for larger areas [6]. Airborne LiDAR measurements and aerial imagery enable accurate forest inventory and monitoring, but airborne acquisitions are expensive, limited in spatial and temporal coverage, and can contain data gaps, thus restricting their utility for wall-to-wall mapping of large areas [7]. Consequently, open satellite imagery becomes appealing for cost-effective wide-area forest inventories. However, there is a constant need to improve the accuracy of spaceborne forest inventory for it to be more attractive for a wider selection of users.
The Sentinel-2 mission of the European Copernicus program offers free optical imagery at 10 to 60 m spatial resolution, with 13 spectral bands and a revisit frequency of five days at the equator [8], enabling cost-effective and frequent forest monitoring at a large scale. The volume of openly available Earth Observation (EO) data through, e.g., ESA and USGS is increasing exponentially [9], which induces the need for reliable, robust, and transferable machine learning techniques for large-scale information extraction [10]. In recent years, the remote sensing community has shown great interest in non-parametric methods, particularly kernel-based regression methods and neural networks. For large area mapping, non-parametric methods are considered more versatile than traditional parametric methods [11] because they only rely on the available data and do not assume any fixed functional form [12]. Therefore, they typically perform better in the presence of a nonlinear relationship between the explanatory and response variables [13]. In particular, deep learning methods have seen a great increase in popularity for optical satellite image analysis [14][15][16]. Typical applications of deep learning in remote sensing include image classification and segmentation tasks, object detection or hyperspectral image analysis [17,18], and the most commonly used deep learning models include convolutional neural network (CNN) and recurrent neural network (RNN) [15].
Despite promising results for classification tasks, deep learning has not yet been used extensively for regression in environmental studies [19]. A recent review on CNNs in vegetation remote sensing [20] included just ten regression studies, only two of which used satellite images as the primary source. Several regression studies focus on estimating continuous biophysical variables in the agricultural sector. Reference [21] used CNNs and RNNs to predict soybean yield in the USA, while [22] used CNNs to estimate crop yields in India. In Finland, reference [23] estimated wheat and barley yield using CNNs.
Deep learning has also been tested in estimating strawberry yield [24] or predicting soil moisture [25].
Few studies have used deep learning for estimating forest aboveground biomass (AGB). For instance, [26] used stacked sparse autoencoders (SSAE) to estimate AGB in China, in mixed broadleaved and coniferous forests characterized by a subtropical monsoon climate. Their SSAE model outperformed traditional methods, such as RF, support vector machine (SVM), and k-nearest neighbor (k-NN). Their results were confirmed by [27] in the same area in China. Forest biomass has been estimated using deep learning also in the U.S. [28] and in Spain [29]. In both studies, the use of deep learning was found to be a promising approach for further investigation. Authors in [7] introduced a deep multi-input recurrent convolutional neural network called Chimera to classify forest and land cover to five classes, and to estimate forest structure metrics (including AGB) simultaneously using satellite and aerial imagery, and ancillary data as model inputs. The Chimera predictor outperformed RF and SVM in both tasks when using the same input data for all classification and regression tasks.
Other continuous forest variables that have been estimated using deep learning models recently are growing stock volume and vegetation height. Authors in [30] were able to predict GSV accurately (RMSE = 30.5 m 3 /ha) in China using tree pixel information extracted from on-ground digital camera images. Authors in [31] trained a CNN to extract suitable textural and spectral features from Sentinel-2 images and used the model for spatial prediction of vegetation height in Switzerland and Gabon. They obtained RMSE values of 3.4 m and 5.6 m for tree height estimates in the two countries respectively, with a CNN model consisting of nearly 20 million trainable weights. Recently in Finland, boreal forest variables have been estimated with neural networks [32] and kernel-based regression methods [33,34]. Although deep learning has been used for estimating continuous forest variables, studies focusing on the boreal forest biome are still lacking. The present study will contribute to closing this research gap.
Usually with machine learning algorithms, the assumption is that the distribution of the data used for model training is the same as the data that the model is applied to [35]. However, transfer learning [36,37] allows these distributions to be different. In short, transfer learning is a fine-tuning method that aims to improve learning in a new target data set utilizing information and knowledge learned in a data set from another domain. In remote sensing, transfer learning offers appealing means to refine models for new target distributions without the need to organize expensive and slow field data campaigns, or compromising on model accuracy. It may be used to refine an existing model for a new target data distribution, e.g., for a new geographical area, or to update a model outdated due to, e.g., forest growing or management. Transfer learning can also be used to transform an existing model for a new target variable, or with a new type of source data (e.g., another satellite). For deep learning models in remote sensing, transfer learning is a useful tool for training a comprehensive target network and avoiding over-fitting, even if the target domain data set would be smaller than the source data set [38].
Studies combining transfer learning and deep learning in remote sensing are still relatively rare, especially for regression. Reference [39] estimated soybean yield in Brazil utilizing transfer learning with an RNN model trained on Argentinian soybean harvests. Their transfer learning model outperformed all other used models, which were trained only on data from Brazil. Reference [40] found transfer learning to be extremely valuable in retrieving information on small-scale urban structures. They reported that transfer learning from one optical domain (QuickBird) to another optical domain (Sentinel-2) worked well and improved segmentation accuracy.
We applied deep neural networks for continuous forest variable estimation in the boreal forest biome. The motivation to use deep neural networks in our study was their ability to automatically extract useful features from complex data, which is beneficial when combining multiple data sources of different natures, and the simplicity to implement transfer learning. We investigated the benefit of combining the Sentinel-2 image data sampled from an extended spatial context of a target pixel with information of the satellite imaging and sun angles and with topography data. We assumed this to improve model performance when compared with the previously used methods such as simple neural networks with pixel-wise image sampling. Furthermore, as the potential of transfer learning in remote sensing has not been widely explored yet, especially in the forest sector, we investigated its applicability to refine models from one spatial region to another characterized with different data distribution, aiming to reduce the need for additional expensive field work. The special interest was to evaluate the amount of new training data needed to obtain models with accuracy close to models that are trained with extensive training data. An important goal of our study was also to obtain information on how to apply the developed deep learning concept and transfer learning in practice.
The main objective of this study was to investigate the performance and practical usability of deep neural networks for forest variable prediction in the boreal forest biome by combining predictive data from multiple sources. The experiments to achieve this objective were (1) to investigate the relevance of different input features, the effect of DNN depth, the amount of training data, and the size of image data sampling window to model prediction performance; (2) to examine the relation of available reference data (number of training samples) to forest variable prediction accuracy; (3) to examine the benefits of using transfer learning for a new geographic region with a limited set of reference data.

Study Site
Our study area (Figure 1) was located in Central Finland and covered an area of four forestry districts: Central Ostrobothnia (Keski-Pohjanmaa), Central Finland (Keski-Suomi), North Savo (Pohjois-Savo), and North Karelia (Pohjois-Karjala). The total area covered by the four districts is 68,315 km 2 , and it extended to southern and central boreal zones. The study sites are typical Finnish boreal forests. The topography is relatively flat with the changes in elevation increasing towards the east. The forest area is conifer-dominated with Scots pine (Pinus sylvestris) and Norway spruce (Picea abies) as the main overstory species. The deciduous trees usually occur as mixed-species stands. Birches are the most common deciduous species, especially Silver birch (Betula pendula) and Downy birch (Betula pubescens).

Field Reference Data
The Finnish Forest Centre provided the field plots that we used as reference data. The plots were distributed over the four forestry districts. The data included information on stem volume, mean diameter at breast height, mean height, basal area, age, stem number, and additional stand information, such as development class, dominant tree species, proportion on timber, and regeneration situation.
The reference data were acquired during 2016-2017 using three different plot radii: 9 m in young and advanced managed forests with a relatively high tree density, 12.62 m in a forest with a low stem density and typically high volume due to the mature development stage, and 5.64 m in seedling stands. A total of 42.7% field plots used in the study were on green heathland, 27.1% on dryish heathland, and 24.1% on grovy heathland, with the remaining 5.8% residing on dry heathland, groves, and barren heathlands. A total of 34.0% of the plots were located on grown up growing forests, 25.2% on young growing forests, 22.4% on recently planted forest stands, and 18.4% on mature forests. Table 1 shows the mean and standard deviation of major forest variables in the four forestry districts. We used four target variables (stem volume, basal area, (mean) stem diameter, and (mean) tree height) and their species-specific components (pine, spruce, and broadleaved) in our study. We focused on stem volume for most of our tests, but we also produced performance figures for all of them with our forest variable prediction concept.

Optical Satellite Imagery
We used the optical remote sensing data from the Sentinel-2 Multispectral Instrument (MSI). A set of 22 Level-1C products (https://sentinel.esa.int/web/sentinel/user-guides/ sentinel-2-msi/product-types/level-1c; accessed 6 June 2021) over Southern/Central Finland were downloaded from the Copernicus Open Access API Hub (Figure 1 To perform the necessary atmospheric correction, we could not rely on approaches requiring dark vegetation targets, such as the Dense Dark Vegetation (DDV) algorithm [41]. Dark objects, typically pure old spruce forest stands, were very rare in the target forestry districts mostly populated by managed pine forests on mineral soil. The selected Sentinel-2 images were also partly cloudy, which reduced the possibilities of finding the required reference dark target stands. Consequently, we computed the bottom-of-atmosphere (BOA) reflectances by applying the SMAC4 radiation transfer code [42] using a constant atmospheric optical density (AOD) parameter of 0.01, and masked out clouds in the image.
We used the S2 image channels at 10 m and 20 m spatial resolution, with the 20 m channels re-sampled to 10 m (Table 2). In addition, we included image metadata and auxiliary data as model input features. These data are listed in (Table 3). Near-Infrared (NIR) 785-900 10 9 Narrow NIR (nNir) 855-875 20 10 Water vapor 935-955 60 11 Shortwave infrared -Cirrus 1360-1390 60 12 Shortwave infrared (SWIR1) 1565-1655 20 13 Shortwave infrared (SWIR2) 2100-2280 20 Table 3. The different input data used as model predictor variables. Each data source was used as model inputs either separately (referred to by data set number) or in data groups (referred to by group code). The dimensions of the S2 image data varied depending on the image channels used as well as the image sampling window size (1 × 1, 3 × 3 or 5 × 5 pixels), which also affected the dimensions of terrain slope data. The CHM statistics (mean, median, stdev) were computed for 1 × 1 and 3 × 3 S2 window areas.

Canopy Height Model
The canopy height model (CHM) computed from LiDAR measurements was acquired at 1 m pixel resolution by the Finnish Forest Centre. The mean, median, and standard deviation values of CHM data were computed for 1 × 1 and 3 × 3 Sentinel-2 pixel windows (i.e., 10 m × 10 m and 30 m × 30 m). These data were included in combination with the other data sources to evaluate the approximate level of accuracy that can be obtained when they are available for forest variable modeling (Table 3).

Training, Validation, and Test Sets
The 2970 reference field plots were divided into training, validation, and test sets using stratified random sampling with five strata of equal bin widths and relative set sizes of 50%, 30%, and 20% for training, validation, and test sets, respectively. The stratification variable was total stem volume, and the bin edges were 0, 160, 320, 480, 640, and 800 m 3 ha −1 . The resulting numbers of field plots were 1488, 895, and 587 for training, validation, and test sets, respectively.
According to [43], the spatial auto-correlation effect in Finnish forests reduces to close to zero within 200-300 m. Adopting this information, we set the minimum distance requirement of field plots belonging to different data sets (i.e., training, validation, or test set) to 250 m, and we removed all field plots in training and validation sets that did not exceed this threshold. This reduced the number of field plots in the resulting training and validation sets to 1129 and 338, respectively. In order to keep enough field plots in the test set, we kept the set of test field plots intact. The resulting minimum distances between the three sets were 252.6 m (from training to validation), 251.1 m (from training to test), and 256.9 m (from validation to test).
Multiple Sentinel-2 images were included from the test area for the selected period, with the number of images for each field plot location varying from 1 to 22. After combining the image data with the field plot data, the resulting data set included in total 10,686 data vectors. The sizes for training, validation, and test sets were 5826, 1826, and 3034 data vectors, respectively.
Some of the Sentinel-2 images contained heavy cloud cover (up to 86% of image area), and the image pixels that were not masked out as clouds were often contaminated by haze or fog that could not be removed or compensated. To avoid the use of noisy data in model training, the maximum cloud cover percentage of the training and validation images was set to 70%. The same threshold was set to 10% for the test set data to use only the best quality data for computing performance measures.
With the maximum cloud percentage limitation, the number of data vectors in training, validation, and test set reduced to 5762, 1804, and 908, respectively. To obtain a unique test set, we randomly selected a single data vector from those field plot locations that included multiple data vectors, and thus the test data set size reduced from 908 to 368 vectors. The numbers of field plots in the resulting training, validation, and test sets were then 1129, 338, and 368, respectively.
This compound set of data that consisted of the field plots with the associated image data, image metadata, DEM, and CHM was used throughout the study. The data were combined into a feature matrix that was stored in text format. A summary of the reference data sets with statistics of main forest variables is shown in Table 4. To investigate the relationship between the prediction performance and the amount of data used for model training, we split the training and validation reference data into smaller subsets. We ordered the field plot data within these two sets according to increasing total stem volume and split the ordered data sets into 30 sub-groups (strata) with an equal number of field plots (37 or 11 for training and validation sets, respectively) in each subgroup. As the numbers of field plots in training and validation sets (1129 and 338, respectively) were not divisible by the number of strata, a 31st group included the remaining field plots with the highest stem volumes (19 plots for training, 8 for validation). Then we randomly assigned the subset labels (1 to 37) for training and (1 to 11) for validation data within the ordered subgroups. The resulting 37 training and 11 validation data subsets then contained 30 or 31 field plots, and their total stem volume distribution followed roughly the distribution of the entire field plot data set.

Study Concept
We defined a series of tests to investigate the effect of different factors on the model prediction performance. Such factors included the size of the Sentinel-2 image sampling window, the number of DNN hidden layers (DNN depth), the amount of reference data for model training, and the effect of auxiliary data (e.g., topography data, and imaging angles or acquisition dates). In addition, we used transfer learning to fine-tune DNN models to evaluate the applicability of the models in other geographical regions.
The input data for the forest variable models were a combination of data from Sentinel-2 satellite, topography data (digital elevation model (DEM) and terrain slope), and canopy height model (CHM). The data from Sentinel-2 included image pixel values from spectral bands with 10 and 20 m spatial resolution, acquisition date, time difference between acquisition and field measurement, imaging angle, sun angle, and target pixel location (Lat, Long). The spatial resolution of the model output was 10 m, i.e., the pixel size of Sentinel-2 RGBNIR bands.
The motivation for selecting group A auxiliary features was to compensate for seasonal effects (feature #2/A; Table 3) and to capture the effect of vegetation growth between image acquisition and field measurement dates (feature #6/A). For selecting group B or auxiliary feature #3 the idea was to capture areal variations in the landscape, mostly applying to the west-east direction in our study area. The reason to include groups C and D (features #4, 5, 7 and 9) was to compensate the differences in the pixel reflectance values due to variation in terrain elevation (effect of shadowing), if these data were combined with the windowed S2 image sampling.
The performance of the DNN models was compared with results produced with commonly used random forest (RF) models [44]. The performance of the transfer learning models was compared with the results obtained with models trained from scratch using the same training data. Total stem volume (V) was used as the target variable in most of the tests, although we produced DNN models also for a wider set of forest variables to assess the improvement obtained with the proposed data fusion concept when compared with a simple data model.

Deep Neural Network (DNN) Models
A fully connected deep neural net (DNN) implemented with Keras deep learning library [45] using Tensorflow [46] platform was used for the forest variable regression. We used rectified linear units (ReLU) as the activation functions for both hidden and output layer neurons.
We used the Adam algorithm [47] as the optimizer in the network training. The parameter values used for Adam were beta1 = 0.9 , beta2 = 0.999. To prevent the network from overfitting the training data, we used regularization and weight decay in combination with early stopping to interrupt training at an appropriate point. The L2 regularization method [48] was used, after checking that it produced roughly an equal regularization effect on network training as dropout [49]. The weight decay rate for the Adam algorithm was set to learning rate/number of epochs. As the early stopping criterion, the error measure to monitor was the mean absolute error on the validation set. We defined 0.001 as its minimum improvement (min_delta) with the patience parameter set to three epochs.

Transfer Learning
In transfer learning, a model developed for a given source data set is used as the starting point for tuning a model on a target data set. Two strategies are commonly adopted: either the network weights learned in the source data set are used as initial weights to retrain (or fine-tune) the entire network on the target data set, or a part of the network weights are fixed and only the non-fixed weights are retrained. The non-fixed weights may also be initialized randomly instead of using the pre-trained weight values. The network structure may also be changed by adding or removing layers and/or outputs to the network.
We tested transfer learning using a DNN model trained with reference forest data from the Central Ostrobothnia district, which has a lower mean stem volume and different species composition than the other three districts ( Table 1). The data used to fine-tune the pre-trained model consisted of the field plot data from the three other forestry districts: Central Finland, North Savo, and North Karelia. We used the field plot data subsets described in Section 2.5.1 to evaluate the effect of a limited amount of training data to transfer learning. We compared the transfer learning model with a model of equal structure, and that was trained starting from randomly initialized network weights ('training from scratch') with an identical training data set.

Hyper-Parameter Search
To find the best hyper-parameter values for model training, we used the empirical cost function presented by [32] that combined relative root-mean-square error, absolute value of relative bias, and coefficient of determination. We used a factor of 3 instead of 10 for the bias term in this study: The cost for each hyper-parameter combination was computed using the mean values of validation data set performance figures from 25 modeling iterations, and the hyperparameter combination with minimum cost was selected for subsequent model training. A modeling iteration wrapper [50] was then run with the found optimal hyper-parameters for 100 times for each separate test setup (see Figure 2).
The learning rate was searched among values [0.00001, 0.0001, 0.0003, 0.0005, 0.001, 0.005]. For mini-batch size, two ranges of search values were used: [8,16,32,64,128] for the smallest sets of training data, and [32,64,128,256] for others. The number of training epochs was set to 250, which was considered adequate for convergence as the early stopping criterion interrupted training before this limit was reached. We used an L2-regularization weight penalty value of L2 = 0.15 for each test run.

Test Setups
We defined different test setups to evaluate the effect of various factors to the DNN model performance. The contribution of each factor to model performance was tested by individually changing the parameters related to the factor being investigated, while fixing other parameters to default values that were found close to optimal for each factor in preliminary tests (Table 5).

Different Input Data Sources
Input data of the estimation model, i.e., the features that were used in the study, included the data sources listed in Table 3. To investigate the individual significance of different input data, each of the sources was added one at a time to the used Sentinel-2 image channels. For the Sentinel-2 data we used two different combinations: spectral bands with 10 m spatial resolution (i.e., B2, B3, B4, and B8) or spectral bands with 10 and 20 m spatial resolution (i.e., B2, B3, B4, B5, B6, B7, B8, B8A, B11, and B12).

Size of Image Sampling Window
Three different sampling sizes were used: 1 × 1, 3 × 3, and 5 × 5 pixels. The model performance was assumed to improve with the windowed approach, as the effects of shadows in images would be taken into account in combination with DEM and slope information.

Minimal Amount of Reference Data
In this experiment, our goal was to find the smallest amount of reference data (field plots) for proper training of the models. To evaluate this, the training and validation data were drawn randomly from the 37 training and 11 validation data subsets described in Section 2.5.1, while aiming at a nominal number of field plots N = 60, 90, 120, 150, 210, 300, 510, 750, and 990 in the combined training and validation sets. The ratio of training and validation data varied from 50/50 to 67/33 with the different nominal amounts of field plots, with a more equal split for the smallest nominal values.
This random selection and the subsequent modeling iteration run was repeated 10 times for each nominal number of field plots N (i.e., 10 random combinations of N field plots). For each of these 10 iterations, one optimal hyper-parameter search loop was performed, followed by 25 DNN training iteration runs with the same hyper-parameters. The results for the training data sub-sample tests were computed as mean values from the 10 × 25 modeling runs, with the standard deviation of the means from the 10 different training data combinations.

Number of DNN Hidden Layers
A selection of fully connected networks with a varying number of hidden layers was tested, between 1 and 20 hidden layers (Table 6). In the tests, we used two sets of model inputs: one set with all ten Sentinel-2 (S2) image channels and all auxiliary input features, and another set with the 10 m spatial resolution channels (i.e., B2, B3, B4, and B8) and S2 imaging/sun angles and DEM/slope features. The 3 × 3 image sampling was used for both input feature groups. The numbers of model inputs were 118 and 59, respectively.
The numbers of neurons in hidden layers were empirically searched for stable convergence of the network, aiming to keep the number of network parameters low, while preserving the network ability to learn the desired prediction task without overfitting to the fairly limited set of reference data. The number of neurons in the first hidden layer was set to <number of inputs+3> as this was found to be close to optimal for the single hidden layer network. The number of neurons in subsequent hidden layers followed a decreasing trend from the first to the last layer, as in [28]. For instance, the structure of an eight hidden layer DNN was [59-62-32-24-16-12-9-6-3-1] for the input configuration S2-4-CD (the number of neurons in hidden layers in bold; for the model structure coding, see Section 4). Table 6 shows the number of hidden layers and the number of weights in the used networks. Table 6. Different depths of neural networks tested in the study. Number of hidden layers shown with the number of trainable weights for two different sets of network inputs (S2-4-CD w3 and S2-10-ABCD w3) that were used in the tests. In addition, the ratio of the number of training data vectors (Nt = training and validation sets combined) and the number of network weights (Nw) are shown. For the model structure coding, see Section 4.

Random Forest (RF) Models
The Random forest algorithm [44] is an ensemble learning method that combines the bootstrap aggregation method ('bagging' [51]) with the random subspace method to train a set of decision trees that have a high variance and low bias [52]. The outputs of the trees are combined by mode (classification trees) or mean (regression trees) to produce the prediction output. Random forests avoid overfitting to the training set [51], provide fast processing [53], and means to rank the importance of the predictors [51,52]. The random forest models were implemented with scikit-learn [54], a frequently used machine learning library for Python programming language, using the function 'sklearn.ensemble. RandomForestRegressor'.
For the random forest models we tuned the hyper-parameters for tree maximum depth (search values = [20, 30,  . We also examined two alternative values for the RF parameter that controls the number of features to consider when looking for the best split. Setting this parameter value to the square root of the total number of inputs (predictor variables) speeded up modeling roughly by a factor of 10. However, this setting caused a slight increase in root-mean-square error and bias, when compared to the case of considering all features. Thus, we decided to use all features in node splitting, as speed was not an issue in testing. As the variance in validation set performance was very small for different iterations, the iteration count for random forest hyper-parameter search was set to 10.

Model Performance Evaluation
To evaluate model performance, we used the absolute and relative root-mean-square error (RMSE and RMSE%), bias (BIAS and BIAS%), and the coefficient of determination (R 2 ).
with y i the observed values from reference forest inventory andȳ their mean,ŷ i the model estimates, n the total number of measurements, SS tot the total sum of squares, and SS res the residual sum of squares; all sums are for i = 1 . . . n.
Due to the stochastic nature of the model training processes, the models performed slightly differently in separate training runs, even with the same model structure and hyperparameters. Consequently, we repeated the modeling iteration wrapper 100 times for each separate DNN model test setup, or 25 times for tests with a limited number of field plots (Section 3.

Results
The results presented in this section were obtained using total stem volume (V) as the target variable in each test. Sections 4.1-4.4 present the sensitivity analysis of various factors (Table 5)  For instance, a label S2-4 [CD] w3 3L indicates that model inputs were four Sentinel-2 channels with auxiliary inputs from groups C and D (DEM and slope data), and that the model used 3 × 3 pixel image sampling window and a DNN structure with three hidden layers. For the DNN structure we use the notation i-h 1 − h 2 − ...h n -o, where i = number of input nodes, h 1 − h n = number of hidden layer neurons, and o = number of output neurons. Table 7 summarizes the factors tested and parameters varied in the different test setups.

Factor Tested/Parameter Varied Parameter Values Section
Effect of different input features /DNN input feature combination 24 different input combinations: The tests were run with an HP power laptop PC with Intel® Core™ i7-10610U CPU @ 1.80 GHz 2.30 GHz processor, with 24 GB RAM memory installed, and with 64-bit operating system. The runtimes for the different tests, and for a single model, were from 30 min to 6 h, depending on the DNN size, hyper-parameter tuning, and the number of iterations defined for the test. Typical time for training a single model (with hyper-parameter tuning, no performance statistics collection) varied roughly between 10 and 15 min (training time for S2-4 [CD] w3 2L: hyper-parameter tuning: 10 min; model Training: 2 min).

Effect of Different Input Features
When testing the effect of different input features as model predictors, we chose a DNN structure with three hidden layers for the models (3L) and the 3 × 3 window (w3) for S2 image sampling. As a baseline model, we used a three hidden layer DNN having only ten S2 image channels with pixel-wise (1 × 1) image sampling as inputs, i.e., S2-10 [] w1 3L. Figure 3 shows the relative root-mean-square error (RMSE%) and the average of relative bias absolute values (BIAS% abs.) with different combinations of input features.
(a) RMSE% (b) BIAS% abs. Figure 3. Test set relative root-mean-square error (RMSE%) and average of relative bias absolute values (BIAS% abs.) for DNN models predicting total stem volume with different input feature combinations. The results are mean values from 100 iterations of the modeling wrapper with the same hyper-parameter values. The bars are ordered with decreasing performance (the best performing on top). The bar colors correspond to the same feature combinations in both figures: gray = the baseline model; blue = models with all ten Sentinel-2 image channels (S2-10) and various auxiliary features; green = models with four Sentinel-2 image channels (S2-4); orange = models including CHM data (auxiliary input group E). The 99% confidence intervals are shown for BIAS% abs. only, as they are not distinguishable for RMSE% at this scale. Using 3 × 3 image sampling and adding auxiliary features as model predictors improved the RMSE% accuracy in comparison to the baseline model. The best performance was obtained by adding LiDAR-based canopy height model (CHM) features to satellite data (models with orange bars), with 2022% (pp) improvement with respect to the baseline. The orange bar labeled with [E] shows the performance of the model using only the CHM features as predictive inputs (RMSE% = 34.9%, BIAS% abs. = 0.9%). When combining S2 data and auxiliary features with the CHM data the RMS error was 28.6-30.7% with different feature combinations.
In addition to the CHM features, almost all the auxiliary features caused a clear improvement in RMSE% accuracy, when added one by one as model inputs in addition to the ten S2 image channels (models S2-10 [2-7]). In general, the DNN models with a subset of four S2 image channels produced better RMSE% performance (and BIAS% as well) when compared to the models that included all ten S2 channels. The improvement with respect to the baseline RMSE% performance was 8

.1% (pp) with models S2-4 [CD] and S2-4 [ABCD].
The BIAS% (abs.) varied from 0.4% to 4.0% in the test set. The smallest bias was obtained with the model including four S2 image channels and all auxiliary data sources (S2-4 [ABCD] w3 3L). In addition, two other models using the reduced set of four S2 image channels as inputs (S2-4 [] * and S2-4 [CD] *) were among the four best performing models in terms of the bias. Unlike with RMSE%, adding auxiliary features to model inputs did not systematically improve the bias when compared to models S2-10 [] w3 3L or S2-10 [] w1 3L. This was most surprising with the CHM features, which obtained the worst performance in terms of bias, when added alone to the ten S2 image channels.
An explanation for this was found by examining the residual error plots of the test set estimates, for which the example with CHM features is shown in Figure 4. The CHM inputs reduced the underestimates above 200 m 3 /ha, but the slight overestimates at stem volumes lower than this caused an increased positive bias, as most of the test data volume is concentrated there. This was also the case with other auxiliary inputs that caused an increase in the bias when added to the S2 image data as model inputs. It must be noted, however, that the increase in BIAS% with respect to model S2-10 [] w3 3L was significant only for the four models shown in the bottom of Figure 3b, as indicated by the non-overlapping 99% confidence intervals.

Effect of DNN Depth
The RMSE% and BIAS% performance figures for DNNs with different numbers of hidden layers are shown in Figure 5 for two different input feature combinations with a 3 × 3 window as S2 image sampling scheme (S2-10 [ABCD] w3 * and S2-4 [CD] w3 *). These were the two models with the lowest RMSE%, including all input features and a reduced set of S2 image channel data (excluding models with CHM features; Figure 3a).  There was about a 1% (pp) decrease in RMSE% with both of the feature sets, when moving from one to two hidden layer networks. The minimum RMSE% values were obtained already with two or three hidden layer networks, after which the error increased slowly along with the number of hidden layers. The variance in the 100 model test results increased considerably for the models with 16 or 20 hidden layers.
In bias values (BIAS% abs.) we did not get a clear indication of the optimum network depth, although there seemed to be a minimum at five hidden layers with the model having the input set of S2-10 [ABCD], and at two hidden layers with the model having the input set of S2-4 [CD]. As with the RMSE% values, the variance in model performance increased with the deepest models (16 or 20 hidden layers).
Training failed occasionally and more frequently with the deepest networks. The proportion of the failed runs was zero for the one-or two-hidden-layer networks, and was 1% for the three-layer network, and then increased gradually, exceeding 50% for the 20layer network. In these training attempts the weights in the first two network layers were practically zero, and the network produced a zero output. Thus, the training suffered from vanishing gradients [55,56] along increasing depth of the DNN. We addressed this by replacing the ReLU activations with the Scaled Exponential Linear Unit (SELU) activation function, which reduced the proportion of failed training runs considerably. However, as the use of SELU activation caused an increase in RMSE% and bias variances, we used ReLU for the final results.

Effect of Image Sampling Window Size
The effect of the S2 image sampling window size on model performance is shown in Figure 6. We used six different sets of input features in these tests, combining two S2 image channel sets (S2-10 and S2-4) and three selections of auxiliary features [none, CD, and ABCD]. All DNN models had three hidden layers.
(a) RMSE% (b) BIAS% abs. Figure 6. Test set relative root-mean-square error (RMSE%) and average of relative bias absolute values (BIAS% abs.) for DNN models with six different input feature combinations. All DNN networks consisted of three hidden layers. The whiskers show the 99% confidence intervals from 100 modeling iterations.
All six curves in Figure 6a obtained a minimum at 3 × 3 window sampling. The improvement of using 3 × 3 pixels rather than 1 × 1 was 1.5-5.5% (pp), being largest with the input feature set S2-4 [CD]. Using four of the S2 image channels instead of ten and adding auxiliary features (CD or ABCD) resulted in greater improvement. Extending the S2 image sampling to 5 × 5 pixels did not improve the results any more. On the contrary, there was a systematical decrease in RMSE% performance with all the tested input feature combinations between 3 × 3 and 5 × 5 sampling schemes.
Increasing the S2 image sampling window size from 1 × 1 reduced the test set bias on average (Figure 6b Sampling the Sentinel-2 image data using 3 × 3 or 5 × 5 pixel windows thus clearly improved the model performance in comparison to pixel-wise sampling. As the usage of the 5 × 5 window did not bring any benefit to the model performance in comparison to 3 × 3 sampling, and on the other hand increased the model complexity along increased number of trainable weights, our choice for optimum image sampling window size was 3 × 3 pixels.

Effect of the Amount of Training Data
The performances of the deep neural networks and random forest were compared by training total stem volume prediction models using limited numbers of field plots.  The RMSE% values of both model types increased about linearly when the number of field plots in training was reduced from 990 to 300 (including validation set). With fewer than 300 field plots the RMSE% increased rapidly, showing almost exponential growth under 210 field plots. The variation between the ten iteration run results also increased heavily below 510 field plots. In general, there was no large difference between the performance of DNN and RF methods. DNN produced slightly more accurate stem volume estimations between 120 and 510 training samples. On the other hand, the RF method yielded lower BIAS% values with all sample counts. However, the confidence levels of both RMSE% and BIAS% (abs.) overlapped heavily, so no preference between these two methods could be made.
Our results indicate that a practical minimum number of field plots for DNN (or RF) model training is somewhere between 200 and 300, and amounts larger than 500 plots are preferred.

Transfer Learning Tests
We tested transfer learning with a DNN model (fully connected 118-121-24-12-6-1 network) pre-trained on data from the Central Ostrobothnia forestry district. We chose a model with enough layers to separate the possible low level features in the earlier hidden layers, from the more specific features in the last layers. On the other hand, we selected a network structure with close to optimal RMSE% performance ( Figure 5a).
To first investigate the best transfer learning configuration, we trained (refined) the pre-trained model by freezing the weights from 0 to 4 hidden layers using field plot data from the three other forestry districts (Table 1). We then computed the test set performance (BIAS% and BIAS% abs.) for the refined models (see Figure 8). The RMSE% increased significantly with 1 to 4 layer weights frozen. The bias did not show a monotonic increase, but the general trend was the same as with RMSE%. In general, we obtained the best performance by letting all weights learn from new data; thus, we used this principle in the subsequent transfer learning tests. We refined the pre-trained DNN model with different numbers of field plots. The RMSE% and BIAS% performances of the refined models are compared in Figure 9 with models having equal structure, and that were trained from scratch with the same field plots as the transfer learning network. The RMSE% performance of the transfer learning models and models trained from scratch were practically equivalent when 250 field plots or more were used in training. The mean RMSE% performance of the transfer learning models was better than the models trained from scratch with fewer than 250 field plots, but the variances overlapped heavily (except with the smallest number of field plots).  In addition, BIAS% performance with the transfer learning model was better with small numbers of training plots (<170). An interesting result was that with the largest number of nominal field plots (≥815), the RMSE% and BIAS% were higher for the transfer learning models than with the models trained from scratch, although the differences in RMSE% were small.
Our results indicated that transfer learning improved total stem volume prediction accuracy, if the number of training samples is limited (i.e., less than 250). With larger amounts of training data transfer, learning did not bring any benefit in comparison to training the models from scratch, as can be seen from Figure 9.

The Performance for a Wider Set of Forest Variables
To get a broader perspective, we produced prediction models and the performance figures for a wider set of forest variables with the proposed concept. The proposed models consist of a two-hidden-layer DNN with 3 × 3 image sampling window trained with image data and all auxiliary features except CHM (S2-4 [ABCD] w3 2L). We chose the four S2 image channels and all auxiliary variables as the model predictors as they produced the best RMSE% and bias accuracies of the tested input features (Figure 3). On the other hand the two-layer DNN produced the best performance when the four S2 channels were used as model predictors ( Figure 5).
We also produced predictions for the same set of variables with more simple models that represent a straightforward forest variable modeling using satellite image data without windowed image sampling, and any auxiliary features as predictive inputs. We used two hidden layers for the simple models as well (S2-4 [] w1 2L).
We produced models for four forest variables: stem volume (V), basal area (G), stem diameter (D), and tree height (H). We predicted both species-specific data and variable totals. We used (64-67-24-1) networks for the proposed models and (4-7-3-1) networks for the simple models. Relative root-mean-square error (RMSE%), bias (BIAS% abs.), and coefficient of determination (R 2 ) for the models are shown in Table 8. Table 8. Test set relative root-mean-square error (RMSE%), average of relative bias absolute values (BIAS% abs.), and coefficient of determination (R 2 ) for forest variables stem volume (V), basal area (G), stem diameter (D), and tree height (H), and their species-specific components (suffixes -pine for pine, -spr for spruce, and -bl for broadleaved). The proposed DNN models with two hidden layers using [ABCD] input variables and 3 × 3 window image sampling compared to models using pixel-wise S2 image data and two hidden layers. The best performance values in bold.

Proposed DNN Models
Simple The proposed DNN models produced better RMSE% and R 2 results than the simple models for all forest variables. The simple models produced lower BIAS% than the proposed models for 8 out of 16 forest variables. The relative improvements with proposed concept with respect to the simple models were from 2.3% to 28.3%), with the largest decrease obtained with stem volume and basal area variables (G-spr: 28.3% and V-bl: 16.9%, relative improvement). Figure 10 shows the total stem volume model response for the test set when using the proposed concept (a) and with the simple model (b). The proposed model reduced both the saturation at about 320 m 3 /ha stem volumes, and the overestimates below 200 m 3 /ha that were visible in the response of the simple model. Similar effects can be seen from the scatterplots of basal area, stem diameter, and tree height (Appendix A).   Figure 11b shows 10 km × 10 km extracts from Central Ostrobothnia (top) and North Savo (bottom). The difference in the forest area, and the higher stem volumes in North Savo, is clearly visible in the zoomed figures. Figure 12 (top row) shows the predicted stem volume distributions for the entire study area and the four forestry districts. The predictions were sampled from random locations, with minimum mutual distance of 250 m. The bottom row shows the reference data (training, validation, and test sets combined) histograms from corresponding areas. The distributions of the reference data and predictions show relatively similar behavior in all forestry districts, with the biggest differences in North Savo area. The differences in the stem volumes between the four areas can be seen from the histograms and the statistics printed on the graphs.

Sensitivity Analysis of Training Parameters
Our test results showed that there was a clear benefit in combining the auxiliary features with S2 image data that were sampled with 3 × 3 or 5 × 5 window instead of using a pixel-wise approach. We obtained a substantial improvement in total stem volume prediction performance: the best models using 3 × 3 image sampling, topography data (DEM and slope; input group D), and the sun and imaging angles (input group C) obtained about 8% (pp) lower RMSE% values than the model with only pixel-wise image data. The improvement of R 2 value from about 0.6 to 0.7 was observed between these models as well. The effect of the image acquisition date and the time to field data collection (feature group A) had a relatively small improving effect to the model RMSE performance (see Figure 3). This was no surprise as there was not much seasonal variation in the green vegetation during the imaging period (1 June 2017-6 September 2017), and as the time differences from imaging to field data collection were also small. Inclusion of the image pixel location (Lat, Lon; feature 3/B) caused slightly more improvement, which was also an expected result, as the forest characteristics and the topography vary significantly between the western and eastern parts of the study area.
The results also indicated that the S2 sun angles and terrain slope were the two sets of features that most improved the RMSE% when added alone to the S2 image channels (CHM excluded). When selecting the auxiliary predictive variables to the DNN models, we assumed that by including the topography data from the close spatial context of the target pixel, and combining it with S2 imaging and sun angles, we could at least partly compensate the effect of shadowing in the pixel reflectance values. The effect of these features can be seen in the saturation effect above 300 m 3 /ha stem volumes that correspond to the lowest image reflectance values (on forest area) ( Figure 10; the scatterplots of models with features S2-4 [CD] and S2-4 [ABCD] are almost identical). The windowed S2 image sampling combined with topography data (feature group D) and sun and imaging angles (feature group C) reduce this saturation, i.e., the DNN model is able to capture the correlations between sunny/shadowed terrain slopes and the image reflectance variations.
In our tests, the optimum Sentinel-2 image sampling window size was 3 × 3 pixels, and increasing to 5 × 5 pixels did not improve the total stem volume prediction accuracy. An obvious explanation is that the image sampling area of 3 × 3 pixels matches best the field plot area with a maximum radius of R = 12.6 m (mature stands), and the outer 16 pixels of the 5 × 5 pixel sampling window do not necessarily overlap the field plot area at all. In other words, the 5 × 5 px area contains data that may not correspond to the ground truth data, which is thus seen as noise in the model training process.
In addition, we observed the best performance using a subset of the ten S2 image channels, i.e., RGBNIR bands. We did not use any procedure for selecting the best predictive image features, but we used two different sets of S2 image channels: one with the 10 m resolution bands and the other with all the 10 m and 20 m bands. The systematic difference in model performance is clearly visible in Figure 5 with RMSE% about 2% (pp) lower for the models having fewer inputs. We hypothesize that the better results with fewer spectral bands are due to the better ratio of the amount of training data to the number of trainable weights in the network (see ratio (Nt/Nw in Table 6): the training algorithm has more weights to optimize with the models with more inputs that also bring additional noise components to the optimization task. The ratio (Nt/Nw may also partly explain the differences in the model accuracy between 3 × 3 px and 5 × 5 px image sampling windows. However, as the selection of the best predictive S2 bands vary for different forest variables, one should include a feature selection step for optimal performance, especially if the amount of good quality reference data for DNN model training is limited, as in this study. The benefit of adding layers to the neural network was marginal, and we obtained the best RMSE% performance already with networks having two hidden layers (Figure 5a). Using more than two hidden layers did not improve RMSE% and seemed to increase the variance in bias (Figure 5b). On the other hand, the DNNs with 3 to 12 hidden layers kept performing almost as well as the two-hidden-layer network.
The probability of the gradients to go to zero during DNN training seemed to increase rapidly with the depth of the network. As seen in our study, the selection of SELU activation instead of ReLU improved the situation but with the drawback of increased uncertainty of the model quality. Choosing a different weight initialization might improve the situation. Another way would be to use, e.g., a ResNet-like architecture with skip connections [57]. However, as the skip connections require the number of network elements not to change in between the skipped layers, this solution tends to increase the number of weights in the network. As there was no benefit seen from using more than two or three hidden layers in the DNN, this was not studied further.
As seen from the results with a limited number of field plots, the practical minimum number of field plots is somewhere between 200 and 330 plots in our experimental setting. On the other hand, the stem volume prediction performance (RMSE%) improved by less than 1% (pp) when increasing the number of field plots from 990 to 1467. Based on this observation, one could expect the marginal utility of adding more training data (i.e., field plots) with the present predictive inputs would be relatively small. This means that to gain more accuracy, one must include additional features that contain relevant information about the variable to be predicted. Alternatively, one could choose a completely different approach, e.g., the use of time series analysis and different DNN architecture.
Typically, the amount of training data is very large in deep learning applications, which enables DNN architectures with up to several millions of trainable parameters. When building models for forest variable estimation this is seldom the case, due to the costs of fieldwork and the size of data sets available for the model training (often consisting of only 200 to 500 up-to-date field plots). A widely used technique to increase the size of training set is data augmentation [58], in which the available relevant training data are multiplied with different techniques (e.g., image mirroring or rotation). It has to be noted that when using these data augmentation techniques the consistency with other related input variables must be preserved. In our case, this would mean that, in addition to image data, the sun and imaging azimuth angles as well as the slope data must be rotated/mirrored accordingly. The analysis for the impact of data augmentation in the current context will be the subject of a future study.

Comparison with Similar Studies
A few recent remote sensing studies estimating stem volume in boreal forests have used similar inputs or methods as the present study [43,[59][60][61]. A study concentrating on improving Finnish multi-source national forest inventory by 3D aerial imaging compared the accuracy and spatial characteristics of 2D satellite (Landsat 8) and aerial imagery as well as 3D airborne laser scanning (ALS) and photogrammetric remote sensing data in the estimation of forest inventory variables using the KNN method [43]. Volume estimations using Landsat 8 data or aerial imagery produced poor accuracies with RMSE% values of over 60%. Even their combination produced poor accuracy of approximately 58%. Estimations carried out in the present study using only 2D optical data produced clearly higher accuracies (Figure 3). This result was in line with earlier study by Astola et al. [32] that compared the use of Landsat 8 and Sentinel-2 data for forest variable estimation. The highest accuracy (RMSE = 27.80%) for the volume estimation in [43] was obtained using a combination of ALS and 2D aerial imagery, which had very similar accuracy to the best accuracy obtained in the present study (RMSE% = 28.6%, Figure 3). Thus, it can be noticed that our models with CHM features included as predictors in combination with S2 data achieved accuracies comparable to those reported by [43] using 3D ALS data and aerial imagery. This is encouraging to note and highlights the potential for using openly available Sentinel-2 data instead of aerial imagery for large area inventories, considering the increasing availability of ALS data.
In a study conducted in southern Finland the performance of multitemporal Sentinel-2 data was determined for estimating several forest variables using random forest [59]. In the study, the multitemporal results were compared to those of single date Sentinel-2 data, ALS, stereo-SAR, and high-resolution 3D satellite data. The study area was relatively small and consisted of only 74 plots (32 × 32 m 2 ). ALS produced the best estimation accuracy (RMSE%) for volume (14.17%), and the 3D photogrammetric data from the highresolution satellite data produced the second highest accuracy (17.33%). The multitemporal Sentinel-2 gave higher accuracy for volume than the stereo data from SAR (27.20% and 30.22%, respectively). These results highlight the benefit of 3D information when estimating forest structure variables. Our results support this, since the addition of canopy height information increased the volume estimation accuracy (Figure 3). In Sweden, a couple of recent studies have obtained similar results to ours when estimating stem volume.
For instance, Sentinel-2 and TanDEM-X data were combined in a study to estimate stem volume over large areas in Sweden using the KNN method (k = 5) [60]. The Level-1C data and 10 m and 20 m spatial resolution bands were used together with windowed radar data (3 × 3 and 5 × 5 pixels, similarly to our study). The obtained stand level accuracy (RMSE%) using the combination of Sentinel-2 and TanDEM-X was 30.2%. Using only Sentinel-2 and TanDEM-X the accuracies were 37.9% and 32.0%, respectively. The use of interferometric phase height from TanDEM-X provided the basis for better volume estimations. Sentinel-2 data were considered important for tree species mapping. Another study that was conducted in mid-and south Sweden developed linear regression models using only 3D information from stereophotogrammetry of aerial images and ALS (i.e., no 2D optical data) [61]. The estimations were validated on stand level. In mid Sweden, where forest types are more similar to Finland than in southern Sweden, the accuracy (RMSE%) for stem volume estimation was approximately 22% when using point cloud metrics from aerial images and ALS as predictor variables. In addition, these studies carried out in Sweden agree with the fact that 3D information is important for volumetric predictions in boreal forests, as shown by our results.
Another study using rather similar inputs or methods compared with the present study was carried out in the southeast of Poland, outside a boreal forest. In the study, growing stock volumes of Scots pine stands were estimated using Sentinel-2 images and airborne image-derived point clouds [62]. Multiple linear regression and random forest were used for plot level (400 m 2 ) estimations. It was investigated whether the inclusion of Sentinel-2 data improved the accuracy of models based on the 3D point cloud data. The estimation accuracy (RMSE%) with random forest using only Sentinel-2 data was approximately 36%. The accuracy was notably improved when using only the point cloud data or when combining Sentinel-2 and point cloud data. Accuracy for both was approximately 20% using random forest. Even though the study did not find any major benefits from combining Sentinel-2 data with 3D information in plot level estimations, the difference in accuracy was notable when compared with a situation where only satellite data were used, similarly to our results, e.g., in Figure 3.
A recent study in different forest conditions, but with similar model inputs to our study, was conducted in southwestern Germany [63]. In the study, multi-temporal Sentinel-2 data and 3D photogrammetric point clouds were combined and used to estimate timber volume. In the study, Sentinel-2 data were used to model the percentage of broadleaf tree volume. This was also used as one the metrics in timber volume estimations in addition to data from CHM that were derived from the photogrammetric point cloud. A non-linear logistic regression model was used to estimate timber volume, and the achieved accuracy was RMSE% = 31.7%, which is very similar to the accuracy produced by our model with combined 10 Sentinel-2 bands and CHM features (S2-10 [E] w3 3L, Figures 3 and 4).
Other studies using deep neural networks in different conditions, but with similar findings to our study regarding, e.g., Sentinel-2 sampling window size and the importance of Sentinel-2 spectral bands or the optimal DNN depth, were reported by [28,31].
In this study, the focus was on applying DNN to satellite data, and we did not make full use of the 3D information provided by ALS data, when computing statistical features of the 1 m resolution CHM data. A more sophisticated way would be to include the CHM at its original 1 m resolution or to use the ALS metrics from point clouds directly. This would most likely require a change to the DNN architecture, e.g., to select CNNs instead of fully connected dense layers [64,65]. Furthermore, as the number of trainable parameters would increase along with the number of predictive inputs, a larger amount of training data would probably be needed.

Comparison between DNNs and RF
In our study, the deep neural network models and random forest models produced equivalent test set performances, and we could not find any preference for selecting between these two methods for this kind of application. The hyper-parameter tuning was slightly more demanding for DNNs, but with models of the size used in our study, this is not an issue. One benefit of available DNN architectures is that they are able to extract meaningful features from image texture (e.g., CNN) or time series (e.g., LSTM), which must be produced separately in prediction systems based on other methods, as pointed out in [7].
Additionally, in some other studies, deep learning models have not been outstandingly better than RF models. For instance, [64] reported that, in general, their deep learning method for growing stock volume prediction produced the best results, when compared to other methods (e.g., RF or k-NN), but that the differences were small. Using the Landsat and ALS predictors, they reported RMSE% = 24.16%, BIAS% = −2.2%, and R 2 = 0.38 with deep learning, and RMSE% = 25.18%, BIAS% = −5.32%, and R 2 = 0.39 with RF model. Narine et al. [28] compared DNN to RF methods in three different scenarios: daytime, nighttime, and no noise. They reported equal performance in AGB prediction for both models, DNN being slightly more accurate for daytime and nighttime, and vice versa for the no noise scenario.
On the other hand, many other studies have reported results showing DNNs outperforming RF in remote sensing applications [7,26,27]. The Chimera model presented in [7] outperformed SVM and RF models with the same input data for all tested classification and regression tasks, including, e.g., basal area and quadratic mean diameter. In the study, data augmentation was used to train the Chimera RCNN (with more than 1.3 million parameters) with a relatively small amount of reference data (<10,000 field plots). The root-mean-square errors they obtained for basal area (7.1 m 2 /ha) and quadratic mean diameter (9.9 cm) were of the same order as in our study: 5.8 m 2 /ha for basal area and 5.4 cm for mean diameter at breast height. Shao et al. [26] estimated forest biomass in China. They obtained satisfactory estimation accuracies when using LiDAR-derived AGB as ground reference data. The best mapping accuracy was obtained using SSAE model that outperformed more traditional methods, such as KNN, SVM, and RF. Zhang et al. [27] also mapped AGB in China. Their study targeted at developing a novel approach for predicting AGB by integrating Landsat 8 imagery with LiDAR data through a deep learning-based workflow. SSAE was used as deep learning model, and it was found to be superior over the other prediction models (e.g., RF, KNN, SVM).

Transfer Learning Performance and Applicability
Transfer learning produced equal or better overall performance than the reference model when all the pre-trained model weights were refined with field plot data. Our tests showed that transfer learning is beneficial mainly with small training data sets (fewer than 250 field plots), not with larger amounts of training data.
In our tests, we observed the difficulty of the network to learn the characteristics of the new area, when even only the lowest layer weights of the pre-trained network were frozen, as seen also by [38]. However, their observation that even features transferred from a distant task would be better than random weights was not supported by our results: when enough training data were available (i.e., >250 field plots), the performance of the refined transfer learning model was not better than the model trained from scratch. However, in our case the model was much simpler; thus, the results are not necessarily comparable. Authors in [20] showed that training deep models from scratch on remote sensing data usually gave higher accuracy than transfer learning for tasks related to vegetation.

Conclusions
In this study we tested different combinations of model predictors and two predictive methods, i.e., deep neural network (DNN) and random forest (RF) algorithm, for forest variable estimation. The model inputs were selected from several data sources: Sentinel-2 image data and metadata, digital elevation model (DEM), terrain slope data, and canopy height model (CHM). We tested different factors influencing on the model predictive performance: the image sampling window size, the importance of different input features as model predictors, the DNN depth, and the number of field plots used for training. In addition, we tested the utility of transfer learning in Finnish boreal forest conditions.
The concept of using Sentinel-2 metadata features and terrain elevation and slope with the image reflectance data in combination with windowed image sampling scheme proved successful, and we were able to gain significant improvement (8% pp. in RMSE%) in total stem volume prediction accuracy. Increasing the number of hidden layers in DNN models improved the performance only marginally, with two or three hidden layers being the optimum depth in our tests.
Both DNN and RF models were suitable for the forest variable estimation in the proposed modeling setup, with approximately equal performance. The tuning of hyperparameters was slightly easier with RF models, which makes this method more appealing. When fewer than 300 field plots were available for model training, the errors with both methods increased rapidly along with decreasing amount of training data.
The advantage of deep neural network is the possibility to implement transfer learning easily. DNN transfer learning facilitates the refinement of an existing model to new areas and for new variables with a relatively small amount of new reference data. Thus, the major advantage of using transfer learning is that the expensive and time-consuming fieldwork may be reduced considerably, which may be found profitable in, e.g., regularly repeating forest inventories. The reduction in model performance with small amounts of reference data may also be compensated by using transfer learning.
As an outcome of the study we were able to improve the forest variable estimation performance with the proposed image sampling and input feature concept that is straightforward to implement with DNN or RF algorithms, and that has small computational impact. The developed deep learning concept with the transfer learning option will be implemented in Forestry Thematic Exploitation Platform Forestry-TEP (European Space Agency, https://f-tep.com/; accessed 17 June 2021) in the near future to be available for the remote sensing community.
Our research interests in the future include the potential improvement in model performance when using data augmentation techniques to multiply the amount of reference data, and to investigate the usage of multitask learning for forest variable regression. In addition, the utilization of ALS point clouds in combination with deep neural networks is an interesting topic to explore.

Data Availability Statement:
The data presented in this study are available upon reasonable request from the corresponding author. The data are not publicly available due to reference data license conditions.