Beyond Measurement: Extracting Vegetation Height from High Resolution Imagery with Deep Learning

: Measuring and monitoring the height of vegetation provides important insights into forest age and habitat quality. These are essential for the accuracy of applications that are highly reliant on up-to-date and accurate vegetation data. Current vegetation sensing practices involve ground survey, photogrammetry, synthetic aperture radar (SAR), and airborne light detection and ranging sensors (LiDAR). While these methods provide high resolution and accuracy, their hardware and collection effort prohibits highly recurrent and widespread collection. In response to the limitations of current methods, we designed Y-NET, a novel deep learning model to generate high resolution models of vegetation from highly recurrent multispectral aerial imagery and elevation data. Y-NET’s architecture uses convolutional layers to learn correlations between different input features and vegetation height, generating an accurate vegetation surface model (VSM) at 1 × 1 m resolution. We evaluated Y-NET on 235 km 2 of the East San Francisco Bay Area and ﬁnd that Y-NET achieves low error from LiDAR when tested on new locations. Y-NET also achieves an R 2 of 0.83 and can effectively model complex vegetation through side-by-side visual comparisons. Furthermore, we show that Y-NET is able to identify instances of vegetation growth and mitigation by comparing aerial imagery and LiDAR collected at different times.


Introduction
Frequent and accurate assessments of vegetation are critical for managing forest biomass, designing effective mitigation strategies in anticipation of wildland fires, and managing vegetation in the wildland urban interface (WUI) [1][2][3][4]. Moreover, quality information about forest structure and biomass helps analysts develop smarter logging strategies, predict economic yield, and analyze the effects of vegetation in socio-technical systems. Vegetation, of all the common geographic information systems (GIS) data layers, is arguably the most dynamic due to constantly changing shape and density, a trend that appears to be accelerating with climate change [5]. An accurate model of vegetation height, which we refer to as a vegetation surface model (VSM), provides important insights for environmental planning, risk management, and economic benefits.
Current methods of remote sensing for detailed vegetation information include manual ground surveys, aerial photogrammetry, synthetic aperture radar (SAR), and light detection and ranging (LiDAR) [6][7][8][9]. While these methods are widely used, they are all subject to multiple Previous work has explored the possibility of extracting a VSM from low-resolution multispectral images (Figure 1c) [22][23][24]; however, current practical applications often require higher resolution data. For example, wildland fire suppression is most effective when direct attack methods are used and enables firefighters to work close to the fire perimeter. Current state-of-the-art fire models use a 30 × 30 m spatial resolution imagery to predict fire spread [25,26], causing a spatial disconnect between models and fire fighting strategies on the ground. Furthermore, since current methods have limited range and recurrence, up-to-date information from current practices would require premeditated scanning strategies from recently before the fire, when smoke occlusion is not an issue. This would require accurate predictions of where fires will happen beforehand, which is not feasible. Contrarily, highly recurrent aerial imagery would likely have a recent image prior to the fire.
We developed Y-NET, a novel deep learning model to overcome the scanning recurrence and range limitations of current methods, and to improve the spatial resolution of previous related work. Y-NET generates a VSM with 1 × 1 m resolution given multispectral 2D aerial imagery ( Figure 1b) and terrain data. Y-NET's architecture is designed with a priori knowledge of the input data to learn from imagery and terrain inputs separately. Y-NET combines the compressed data together in latent space for accurate vegetation height estimations. To validate the efficacy of our approach, we use aerial imagery obtained from the United States National Agriculture Imagery Program (NAIP) [27], and DEMs and LiDAR scans from the USGS [21], all of which are freely available to the public. We evaluated Y-NET in the East San Francisco Bay Area, a highly-researched and high-risk area of wildland fire, where mitigation strategists are actively looking for high resolution modeling through multi-million dollar government funded research grants [28][29][30].
Y-NET uses supervised learning, meaning the model trains on landscape A for which LiDAR data are included as a ground truth measure. Y-NET is then evaluated on landscape B, where the model only receives imagery and terrain data to recreate the 3D structure of B. Regions in landscape A are separated into commonly named and disjoint training and validation datasets for training. Landscape B composes the testing dataset, which is completely spatially separate from landscape A. To evaluate the performance of Y-NET, we compare the generated VSM for landscape B with the corresponding LiDAR and calculate the pixel-wise error in height, consistent with other work [14,31,32]. We find empirically that Y-NET achieves high accuracy and R 2 value with 1 × 1 m spatial resolution data by formalizing the problem as being similar to semantic image segmentation (similar to past work [33][34][35]).
While we evaluated Y-NET for spatial generalizability (training on A, testing on B), we expect that temporal generalizability may also be possible given available data (training on A t , testing on A t+i where t and i represent time). However, a proper evaluation of temporal generalizability requires high-resolution multispectral imagery and LiDAR from two distinct times for the same location. We observed two LiDAR scanning missions within our study site, one from 2007 and one from 2018. Since four-band multispectral NAIP imagery was not collected in our study site in 2007, we were unable to perform a proper temporal generalizability evaluation. However, in Section 4 we visually compared Y-NET's VSM with 2016 four-band NAIP imagery to the 2007 LiDAR in our study site, and show that Y-NET can effectively identify vegetation mitigation and growth given temporal differences. We leave a full in-depth analysis of temporal generalizability to future work reliant on available data.
The contributions of our work are three-fold: (i) we developed Y-NET, a novel deep learning model capable of highly accurate VSM generation given four-band multispectral imagery and terrain data, (ii) we evaluated Y-NET on real-data from a highly-studied region at high risk of wildland fire, both statistically and visually for spatial generalizability, and (iii) we visually evaluated the potential for temporal generalizability of Y-NET and show it can identify vegetation mitigation and growth.

Study Site
Our study was conducted in the East San Francisco Bay Area (Figure 2), chosen for the dense WUI, high vulnerability to wildland fire, and the focus of past vegetation studies [3,4,30,36,37]. Our study site covers approximately 235 km 2 and contains a broad range of vegetation heights, from 0 to 77 m tall composed of grasslands, shrublands, woodlands, and forest. This region is well monitored and studied for remote sensing, environmental planning, and natural disaster research. In addition to the available LiDAR we used to benchmark Y-NET, our study site is located in a Mediterranean climate, which is known to be at high risk of wildland fire, and requires strict vegetation mitigation strategies [38][39][40][41].
Helping support up-to-date vegetation monitoring for applications such as wildland fire modeling and vegetation mitigation in areas where VSM information is crucial is the main motivation of our work.

Study Site
Projection/Datum: Contiguous USA Albers / NAD 1983 (2011) Figure 2. Our study site is located in the East San Francisco Bay Area, covering roughly 235 km 2 of the highly-researched wildland urban interface (WUI) that is at high risk of wildland fire.

Data Features and Processing
Y-NET utilizes nine layers of input data, divided into two groups: visual data (six layers) and terrain data (three layers). In practice, these features may be collected for any available date; however, due to available LiDAR, which we used to train and evaluate our model, we selected data from 2018. The study site was divided into non-overlapping tiles of 64 × 64 pixels. For example, the study area (235 km 2 ) has over 235 million 1 × 1 m pixels for each layer of input data, amounting to 57,373 tiles of size 64 × 64 × 9 features. We divided this dataset into spatially disjoint training, validation, and testing datasets, explained further in Section 2.5. The validation dataset was used to assist the training phase, and we emphasize that the model was not evaluated on this dataset and was not exposed to LiDAR during our evaluation or future deployment.

Visual Data
We collected 1 × 1 m resolution multispectral aerial imagery from a 23 July 2018 United States NAIP imaging mission, and project it to the NAD_1983_2011_Contiguous_USA_Albers coordinate system and Albers datum using the ArcGIS Project tool from ESRI [42]. NAIP has red (0.6-0.7 µ), blue (0.4-0.5 µ), green (0.5-0.6 µ), and near-infrared (NIR, 0.8-0.9 µ) bands of imagery which are commonly used to identify important vegetation signatures. In addition to the bands themselves, we combine the red and NIR bands to generate another input layer, NDVI using ArcGIS's Raster Calculator tool [43,44]. NDVI represents the normalized vegetation health within each pixel so that NDV I ∈ [−1, 1], where a higher value is synonymous with healthier vegetation.
To ensure that Y-NET was only exposed to pixels which contain vegetation, we included a layer to mask out structures and roads that may harm training of the model. We used the iterative self-organizing (ISO) cluster unsupervised classification tool in ArcGIS [45] to perform an unsupervised classification on the four NAIP raster bands, separating pixels that are vegetation from water, roads, and structures, similar to recent work [46]. As insurance, we concatenated the generated mask with: road footprints delineated from a GPS verified ground based street centerline survey, and building footprints generated by the AI team at Microsoft using a popular machine learning segmentation model, U-NET, with satellite imagery [47]. We defined our mask as the footprints layer, which contains pixels with binary values pertaining to either vegetation or non-vegetation.
The four NAIP bands (red, blue, green, and NIR), NDVI derived from NAIP, and footprints layer detailed above make up the model inputs composed from aerial imagery, and are passed to the visual branch of Y-NET, discussed further in Section 2.5.

Terrain Data
The ground surface elevation and shape can have a major impact on the type and condition of vegetation, therefore we denote the second class of input data to our model as terrain data. Our definition of terrain data includes a DEM obtained from the USGS, and slope degree and surface aspect layers derived from the DEM using the Slope and Aspect tools provided within ArcGIS [48,49]. DEMs are remotely sensed raster images with raw elevation metric values for each pixel, measured as the elevation of the ground surface above the mean sea-level elevation, ignoring vegetation and human-built structures on the ground. The USGS DEM has 1 × 1 m spatial resolution, a vertical accuracy of 0.086 m for non-vegetated vertical accuracy (NVA) at a 95% confidence level, and a vertical accuracy 0.27 m for vegetated vertical accuracy (VVA) at a 95% confidence level. Pixels in the aspect layer a represent the compass direction in degrees from north that the ground faces, so that a ∈ [0, 359]. Pixels in the slope layer s represent the degree that the ground is tilted from flat within each pixel, so that s ∈ [0, 90].

LiDAR Derived Dataset Labels
While training a supervised deep machine learning model, a model must be exposed to labels defined as ground truth in order to learn correlations and a distribution over real data, ideally generalizing to produce labels for unseen instances. We used airborne LiDAR as the source of our ground truth as it is the state-of-the-art 3D remote sensing technology. LiDAR lasers record reflections for every object they contact to create a point cloud in 3D space, the last surface generally being the Earth's surface. We collected LiDAR data for our study site from the USGS scanned from December 2017 to April 2018 by Quantum Spatial, Inc. using a VQ1560i laser scanning system from Riegl of Horn, Austria [50,51], scanned at approximately the same time as the NAIP imagery. The collected LiDAR data has the horizontal projection of Contiguous USA Albers, meters, using the NAD 1983(2011) datum, and the vertical datum of NAVD88 GEOID 12B, meters, and a nominal pulse spacing (NPS) of 0.71 m. Using the first-pulse of the LiDAR, we generated a digital surface model (DSM) representing the elevation above mean sea-level of the first object that the lasers contacted on their path to the ground. We calculated the DSM using the USGS LiDAR, which was classified by the American Society for Photogrammetry and Remote Sensing (ASPRS) standard LiDAR point classification. To generate the DSM, we used a natural neighbor interpolation that is less susceptible to small changes in the triangulation of Voronoi neighbors using ArcGIS's LAS to Raster tool [52,53]. This follows Sibson's [54] area-stealing interpolation and avoids peaks, pits, ridges, valleys, passes, and pales that have not already been represented by the input samples [55]. We performed pixel-wise subtraction between the DSM and DEM and generate a layer where each 1 × 1 m pixel value represents the height of the object within each pixel.
In theory, this should result in a perfect representation of vegetation height, although we identify a high amount of noise in the 2018 LiDAR dataset with unrealistic heights, seen in box b of Figure  3. In response, we deployed a data processing algorithm to eliminate LiDAR noise and better represent the true landscape. Our process first identifies pixels whose heights appear to be anomalies, usually appearing as small groups or single pixels with heights that differ by orders of magnitude from their neighbors. We transformed the pixels into points with coordinates (i, j, z), where i and j are the row and column of the pixels in the study site, and z is the height in meters. We run a DBSCAN clustering algorithm [56], and points labeled as "noise" represent potentially anomalous pixels. DBSCAN has two input arguments, eps ∈ R + and min_samples ∈ N. In the point representation of our dataset, the algorithm labels a point as noise if it does not contain at least min_samples points within distance eps. This proves to be a good identifier of anomalies, as these points will be isolated in 3D space (immediate x, y neighborhood has extremely different height values), while most other points will have most neighbors close in 3D space (immediate x, y neighborhood has similar height values). As we are concerned with identifying anomalies in our dataset, we refer to [56] for a further understanding of how DBSCAN defines entire clusters. After identifying the anomalous points and their corresponding pixels, the median height of the pixel's 3 × 3 neighborhood was assigned as the new pixel height value.
An example of the performance of this algorithm is shown in Figure 3, where we filtered out noise in the layer representing the heights of objects, and further assign values as the median of neighborhoods over the spatial layer. DBSCAN effectively removes random potential anomalies, but preserves the power lines from the LiDAR data due to those clusters of pixels being similar in height. Known techniques like the Hough transform have been shown to remove linear shaped data from LiDAR [57], though we leave this further processing to future work and discuss the implications of not removing power lines in Section 4. We refer to this final layer as the normalized-DSM (nDSM), and used it as the labeled data for supervised training of our model, as well as the ground truth height in our evaluation to measure Y-NET's performance. Discussed next, the labels were used to update the weight parameters of our neural network during the training phase by the process of backpropogation by calculating the difference between Y-NET's predictions and the corresponding label at each iteration of training.

Deep Learning Background
The field of machine learning is built on the fundamental idea of learning directly from experience and data itself [58]. The hierarchy of how a computer identifies low-level visual representations (i.e., identifying edges first, then corners, then object parts in images) combine for a rich and complex understanding of an image; the culmination of which forms the definition of deep learning [59]. In essence, training a deep learning model is the process of approximating a theoretical true underlying distribution of observed data, P data . Although a perfect representation of P data is theoretically impossible, we can optimize the parameters of an approximating function P model given observed data samples. For example, in least-squares linear regression we find the optimal parameters (slope, intercept) to fit a line (P model ) to observed samples, best approximating the true underlying distribution (P data ). Using a neural network with parameters θ as P model , we approximate P data by observing data (i.e., images, text, and numerical vectors) to learn important features and fit θ to hopefully generalize well to unseen data from the same distribution.
The process of training a neural network model can be defined as an optimization problem which minimizes a defined loss function (A loss function is sometimes also called a cost function or objective function), visualized as a landscape which represents how good θ of P model is at approximating P data . The global lowest point of the loss function is equivalent to finding the best possible θ to estimate P data . Minimizing this function is often done through the algorithm of gradient descent [60], which continuously makes small changes to θ to decrease future error by the process of backpropagation [61]. Backpropagation continuously calculates the derivative of the loss function given θ, and updates θ to give more accurate estimations in the future. This is equivalent to taking a step in the downward direction of the loss function according to the calculated gradient. There are a variety of loss functions available, but our model presented in Section 2.5 minimizes the mean squared error (MSE) loss function, defined for i estimations as: whereŷ test is the set of estimations from P model and y test is the set of ground truth labels. Unlike feedforward regression or classification deep learning models, autoencoders are composed of two modules: an encoder to map inputs to a compressed feature mapping h and a decoder to map h back to the original shape, shown in Figure 4. These modules are used to compress input data and preserve important representations for a defined goal, an example is image segmentation [62][63][64]. Autoencoders consist of any combination of fully connected layers, convolutional layers [65][66][67][68], and pooling or dropout layers [69,70]. While fully connected neural networks are used for single dimension inputs, convolutional layers use a sliding kernel to preserve relativity of the input and are typically used for images, videos, or sensor data. Dropout layers are used to ignore a subset of parameters in the network at each iteration of training, ultimately reducing the reliance between individual variables in θ. An example of using autoencoders for image segmentation is training a model φ given input data with n features so that φ : R α → R, where α = 3 given a natural image (a natural image is one taken from a camera, or how humans perceive the world instead of a PDF or scanned document) (RGB), seen in Figure 4. Further details of autoencoders and representation learning are beyond the scope of this paper, and we refer the reader to [59] for a deeper understanding.

U-NET
Popular applications for deep learning autoencoders include autonomous driving, object detection, and medical imaging. A popular model that achieves state-of-the-art performance for tumor identification is called U-NET, named for the shape of the model's architecture [72]. U-NET is an extension of the classic convolutional network designed to learn information from fewer training examples by using both context-aware upsampling and high-resolution data in the decoder. Mentioned in Section 2.2, U-NET was also used by Microsoft AI to identify the buildings we include in our footprints mask. U-NET uses convolutional and pooling layers to encode features in h, and convolutional and upsampling layers to decode h, with a final layer using a sigmoid activation function. Sigmoid compresses estimations for each pixel to saturate to either 0 or 1, creating a distribution with two sets in the output image, for example a tumor or background. This can be thought of as performing a classification for every pixel in an input image.
The main difference in the U-NET architecture is the use of skip connections, which directly copies high-resolution data from the encoder to the decoder to be paired with context-aware upsampling data. Skip connection also provide a shortcut for gradient updates of θ to travel directly back to early layer of the network. Recent work has used U-NET for one-dimensional (1D) data anomaly detection, and created an abstraction called MU-NET for multivariate 1D sensor data [73]. MU-NET has a separate encoder for each variable and concatenates each h together before passing the data to a single decoder; however, their code is not public and they do not provide detail of the concatenation or skip connection configuration coming from multiple encoders. To highlight the power of our model's architecture, we included U-NET in our evaluation and replace the output activation function with a linear transformation to allow for height estimation greater than a value of one.

Architecture and Training
Y-NET uses a novel architecture to generate a VSM from the multispectral aerial imagery and terrain data presented in Section 2.2. Y-NET uses an autoencoder framework, used to disentangle important features of input data and map the output to the same dimensions of the input. Whereas autoencoders are typically used for classification, we designed Y-NET to generate regression estimations pertaining to vegetation height. Figure 5 shows the Y-NET architecture, where the grey boxes represent convolutional layers, empty boxes represent dropout layers, plus signs represent concatenation, and the blue, red, and green arrows represent max pooling layers, upsampling layers, and skip connections respectively. While convolutional layers, dropout layers, and skip connections were discussed in Section 2.3, we detail the functionality of concatenation, max pooling, and upsampling layers in this section. We designed Y-NET with two identical input encoder branches to leverage the unique properties of the visual and terrain data separately, similar to feature grouping in other work [74]. We list the output dimension of each convolutional layer throughout the network to better understand how the data are transformed. The model receives samples of terrain data that are 64 × 64 × 3 into the terrain branch, and samples of visual data that are 64 × 64 × 6 into the visual branch. Each encoder passes the data through two convolutional layers at a time, followed by a max pooling layer, which decreases the height and width dimensions of the data by assigning the maximum value within a 3 × 3 neighborhood. The convolutional layers increase the third dimension through nonlinear transformations, shown in each stage of the model.
The latent representations from each encoder both have size 4 × 4 × 1024, and are concatenated, or joined, along their third axis to create h, and move through the single decoder on the right of Figure 5. Skip connections copy information from the output of each encoding stage directly to the correlated decoder stage, also providing a direct path for gradient update information to flow to early layers of the network during training. Y-NET generates a regression output distribution of vegetation heights in R 1 (size n × m × 1) given the input features of R α (n × m × α), where α is an arbitrary number of input layers for each pixel; m = n = 64 pixels and α = 9 in our experiments.
We divided the dataset described in Section 2.2 into disjoint sets, using 56% for training, 14% for validation, and 30% for testing. For evaluation, the output distribution of each tile in the testing dataset was compared with the corresponding nDSM, and we calculated the absolute error value between the estimated height and the ground truth value. Again, we stress the importance of understanding that the validation dataset was used to determine the duration of training, and the evaluation only uses the testing dataset. As part of our underlying research, we queried vegetation researchers and fire experts to specify ranges of vegetation heights that are relevant to vegetation modeling and wildland fire fighting. We show the defined ranges of interest and dataset sizes in pixels (1 × 1 m) in Table 1, and note that the ranges are also consistent with past studies uses the imperial system of measurement [30,75]. For the deep learning details of Y-NET, each convolutional layer makes use of the rectified linear unit (ReLU) activation function [76][77][78]. Three dropout layers with a dropout rate of 50% were included to help improve the learning performance of individual parameters throughout the network. The model was trained using the Adam optimizer with a learning rate of 10e −4 and the MSE loss function using the early-stopping method. Early-stopping uses the validation dataset to assess the performance of the model after each training epoch, stopping training when the model begins to overfit to the training dataset. Y-NET was implemented using the TensorFlow deep learning library [79].

Motivation and Inspiration
While Y-NET uses an autoencoder framework, the actual architecture of the model is quite unique. The inspiration behind Y-NET's architecture came from the U-NET model, which is an extension of the so-called "fully convolutional network" [80] designed to learn important information from fewer training examples. Unlike three-band images, our input layers have far more than just red, green, and blue channels. We were also able to form input layer two groups to separate inputs, which likely have little affect on the other (visual and terrain data). The kernel in convolutional layers encodes all input features together, and since we expected significant correlations between the visual and terrain input layers to be unlikely, we defined two separate encoder branches which feed into one decoder, forming a "Y" shape seen in Figure 5.
The higher resolution features from both encoders were combined with the context-aware upsampled output in the decoder through skip connections. Combining these data features together helps with feature localization in the eventual output. Incorporating the higher resolution from both encoders ensures that both visual and terrain input layers improve the localization of 3D features. Given our application domain, we expected that the culmination of dual encoders and using all skip connections would result in higher localization and feature estimation accuracy over other architectures.

Workflow & Implementation
The workflow of using Y-NET in practice includes minor GIS formatting of remotely sensed data for locations of interest, defined by a rectangular polygon over a landscape. Currently, our workflow includes manually collecting the data from a desired source and clipping each layer to the exact same coordinates and dimensions over each location. We also used the ArcGIS Slope and Aspect tools to extract the slope and aspect layers from the DEM, and performed unsupervised classification for the footprints layer, which must be visually validated by the user for accurate coverage, usually just for a subset of the location. The NDVI layer can be calculated using ArcGIS's Raster Calculator, or using NumPy array calculations from the red and NIR NAIP bands. The size of the location only effects the time for the preprocessing algorithms to run, though typically does not take long. After preprocessing each location of interest, the user simply includes them in a directory that Y-NET's code can access, automatically completing the remaining preprocessing steps of creating tiles and constructing the necessary datasets. When training a new Y-NET model, the user can choose to include all locations as a single input or omit an entire location. Including all locations results in a random sampling of tiles for each dataset, while specifying a location to omitting from results in that location becoming the entire testing dataset. Training the Y-NET model for our evaluation takes about 2 hours using an NVIDIA TITAN V GPU, and deploying it on the testing dataset takes only a few minutes. Once a Y-NET model is trained, it can be given any new location and generate a VSM in a few minutes, without retraining. The amount of time needed to generate a VSM is dependent on the size of the testing location, however our evaluation completed in minutes. We acknowledge that an end-to-end workflow could include automatic data preprocessing given a user-defined polygon, increasing the speed of the application workflow even more, however leave this implementation for future work.

Evaluation
As no previous work creates a VSM at 1 × 1 m resolution from a multispectral image and terrain layers, our evaluation includes Y-NET, Y-NET without skip connections (Y-NET-noskip), the state-of-the-art segmentation U-NET model with a linear output layer, and an autoencoder. While photogrammetry and SAR have the ability to create a VSM, their collection processes require multiple special passes with specific measurements to be accurate and still compare to LiDAR as a benchmark for accuracy. Therefore, we compared each model against LiDAR (nDSM) and evaluated for median absolute error in meters, mean absolute error in meters, root-mean-squared-error (RMSE) in meters, and R 2 . Let Y represent the set of pixels in the testing dataset. R 2 is calculated by where y i is the ground truth height of a pixel, y = 1 Y ∑ Y i=0 y i , and f i is the height estimation from Y-NET for pixel i. An R 2 value closer to 1 suggests the model fits to the ground truth landscape better, and negative values mean the model consistently predicts Secondly, we plotted the cumulative density function (CDF) for Y-NET and the separate height ranges from Table 1. A CDF details the percentage of estimations (y-axis) that are within a certain amount of error in meters (x-axis) from the ground truth, the nDSM in our case. For example, the CDF for an oracle with no error would show a perfectly vertical line ranging from 0.0-1.0 along the y-axis. We also conducted an extensive visual analysis to inspect the VSM from Y-NET on the testing dataset. Table 2 shows the statistical results of every model in our evaluation, where Y-NET performs best for each metric. We find empirically that Y-NET achieves an R 2 value of 0.830, compared to 0.774 for Y-NET-noskip, and negative values for U-NET and autoencoder. U-NET and autoencoder fail to learn any important representations from the input data, despite countless efforts of tweaking their architectures. Both models make trivial predictions for every pixel in the dataset, showing that the standard autoencoder framework is unable to generalize towards estimating vegetation height. Figure 6 shows a colorized representation of the confusion matrices for Y-NET and Y-NET-noskip, normalized to improve visibility where the darkest blue corresponds with the maximum value of that row, or highest correlation between Y-NET and the nDSM. We included a 45 • diagonal line to correspond with perfect estimations. Whereas Y-NET-noskip appears to under-estimate vegetation below 5 m and over-estimate the height of vegetation from 5 to 15 m tall, Y-NET's estimations are noticeably more aligned with the diagonal, and are less variable below 30 m, supporting Y-NET's higher R 2 value. We find it interesting that both models suffer from noisy estimations around nDSM = 60 m, which we attribute to LiDAR noise missed by the DBSCAN algorithm and power lines, discussed further in Section 4.   We also highlight that the most important aspect of Figure 7b is that Y-NET effectively differentiates grasslands from other vegetation types. While Y-NET generally achieves low median estimation error, the long CDF tails are further evidence of anomalous estimation errors caused by data noise and power lines.

Statistical Results
In summary, while U-NET and Autoencoder fail to learn effective representations, both versions of Y-NET learn associations between the input data layers and vegetation height to a certain degree. This result speaks to the importance of the data grouping architecture given a priori knowledge about the input features, and further to the skip connection configuration for localization and gradient flow within Y-NET. In particular, since the data layers from imagery are more similar than those of the terrain data, encoding them separately limits the amount of confusion when comparing dissimilar input features.

Visual Results
While statistical performance is important, we understand that a visual analysis of the VSM output from Y-NET can also reveal helpful notions of performance. We further emphasize that while Y-NET is trained by comparing with the nDSM for tile locations in the training dataset, Y-NET is never exposed to the nDSM for tiles in the testing dataset, which are entirely different physical locations than in the training dataset. For example, the tiles shown in Figures 8 and 9 are examples from the testing dataset where Y-NET receives the visual data and terrain data for this new location and generates a VSM. Figure 8 shows six example tiles from the testing dataset, with rows representing data layers and columns representing different tile locations. Row 1 is the NAIP imagery which we include for visual context as to what is located in each tile. Row 2 shows the nDSM, row 3 shows the VSM generated by Y-NET, and row 4 shows the difference between the two (rows 2-4 encoded using the color bar at the bottom of each column). To relate this data to the workflow from Section 2, Y-NET receives multiple 2D layers of visual data and terrain data similar to row 1, and generates the output in row 3, where each pixel value represents a height estimation. By calculating the difference between rows 2 and 3, we can calculate the performance of the model on the testing dataset. We chose these samples from the test dataset to help visualize the variance in the testing dataset distribution: low-forested areas (a, b), dense tall forest vegetation (c), and the wildland-urban interface (d, e, f ). Analyzing each tile, we can see that Y-NET can effectively model the vegetation height similar to the nDSM for each location despite a wide distribution of landscape types, vegetation density, and shadowing present in the visual data. Perhaps the most important part of Figure 8 is row 4 for tiles a and f, where we notice estimation error is often greatest around the perimeters of tree canopies. We discuss the potential cause of this being relief displacement in Section 4.  Figure 9 shows more examples from the testing dataset where we project the NAIP image over the 3D VSM generated by Y-NET and the nDSM for each tile. We select these tile locations to show 3D models with varying amounts of distribution variability, similar to Figure 8. Comparing each tile location we can see that Y-NET generates an effective VSM for tiles with varying vegetation densities from the publicly available input data. This evaluation shows the high performing spatial generalizability of Y-NET, being able to train on locations A and generate highly accurate and representative VSMs for locations B. As only a fraction of the world has been measured at least once with advanced methods such as LiDAR, high performing spatial generalizability shows that Y-NET is capable of producing accurate VSMs over large areas of land which may not have been scanned with LiDAR before. Widespread availability to high resolution 3D spatial data will help support important applications such as wildland fire fighting and mitigation strategies, one motivation behind this research, among many others.

Discussion
Although we envision Y-NET to be capable given frequently collected multispectral imagery from satellites, we validate our model using NAIP imagery collected by aircraft because it is in the public domain. As NAIP imagery is obtained from lower elevation flights, there is expected relief displacement [82] for the canopies of trees, which is radial from the center of the image, an imperfection not as prevalent in satellite imagery [83]. With relief displacement, tree canopies are projected slightly further away from the tree center, and the displacement is increased with steeper topography. Although NAIP uses orthoimagery to account for this, the performance of the USGS's algorithm to correct relief displacement may cause slight misalignment between NAIP and LiDAR in the same projection. We hypothesize this may be the cause of the error seen around the circumference of tree canopies shown in Figure 8 for tiles a and f, and could be minimized when using privately collected satellite imagery.
Furthermore, it is also important to note that while the LiDAR in our study was recorded from December 2017 to May 2018, the NAIP imagery was taken in July 2018. According to the Moraga-Orinda Fire District webpage, the government-mandated vegetation mitigation deadline within our study site is 15 June [81]. This means that the grasslands are expected to be taller in the LiDAR than what is present in the NAIP image; therefore, we suspect the 80th percentile error for 0-0.6 of 0.24 m to actually be slightly exaggerated. While collecting LiDAR and imagery on the same day would allow for an ideal performance analysis, the time consumption and cost of collecting LiDAR makes this infeasible for large swaths of land.

Long CDF Tails
Perhaps more beneficial to our evaluation than observing overall performance metrics is to analyze instances where Y-NET's predictions are the most statistically inaccurate. As shown by the long tails of the CDFs in Figure 7, a small amount of estimations are extremely incorrect, essentially skewing our statistical results in Table 2. We emphasize statistically inaccurate because we observe a small set of examples in our testing dataset with high error, and find that they are likely due to sources of noise that were not removed by our DBSCAN algorithm.
We identify two main instances of noise across the 2018 LiDAR dataset for our study site shown in Figure 10: LiDAR processing noise and power lines. We use a height color map ranging from blue (short) to yellow (tall) to highlight the anomalous variations more than using the overlaid NAIP image. Seen on the left of Figure 10, we observe random irregularities of extremely high values which are not characteristic of the landscape and causes high performance error for our model. The second source of high estimation error we identify as power lines, shown on the right of Figure 10. As power lines are physical objects above the Earth's surface, they are recorded as being the first object the LiDAR laser contacted, and have enough neighboring pixels around the same height to avoid being identified as noise. Furthermore, power lines are too small in the NAIP imagery to be removed by our footprints masking layer, causing the nDSM height for these pixels to be extremely high. The calculated error for Y-NET in instances with power lines is enormous, despite the Y-NET modeling the vegetation surface below, shown in Figure 10.

Vegetation Mitigation and Growth
While our evaluation shows the ability for Y-NET to generalize to spatially disjoint locations, we acknowledge Y-NET could be used in practice to identify temporal changes of vegetation, such as mitigation efforts to reduce fire risk and vegetation growth. For example, temporal generalizability involves training on a landscape at one time A t , and generating a VSM for the same landscape given future 1 × 1 m resolution multispectral images A t+i , where i is a difference in time. In order to properly evaluate temporal generalizability, LiDAR and multispectral imagery must be available for the same location at two different times. As this data does not exist for our study site, we are unable to statistically evaluate the efficacy of Y-NET for temporal generalizability.
As a result, we perform a secondary evaluation of Y-NET with 2016 NAIP imagery and visually compare the generated VSM with LiDAR (nDSM) from 2007 to draw interesting temporal conclusions.
The 2016 NAIP data also shows Y-NET can generalize to images collected with different environmental and atmospheric conditions present during different imaging missions. Figure 11 contains three tile locations from this evaluation, including the red band of NAIP as it enhances tree canopies. Columns a and b show two instances of mitigation highlighted by red boxes, where trees had been removed between 2007 and 2016. We clearly see large differences in row 4 between Y-NET and the 2007 nDSM, however the red band in row 1 shows no tree present in that location in 2016.

Impact
We expect Y-NET to have a significant impact on how 3D data is obtained and to support extensive applications which benefit from highly recurrent and accurate 3D data. While we evaluated the spatial generalizability of Y-NET, we also acknowledge that Y-NET can be used for temporal generalizability and visually validate this, although the data for a proper evaluation are not currently available. While LiDAR data are sparse in time, we have shown that once a Y-NET model is trained, it can generate an accurate VSM given spatially disjoint data sources and identify mitigation and growth given different imagery despite new shadowing and environmental conditions. While image-based methods such as photogrammetry allow for good 3D models to be generated more easily than LiDAR, they still require multiple photos with precise angular specifications, leading to higher collection costs when compared to Y-NET's input data. Furthermore, in the event of a natural disaster, such as a wildland fire, photogrammetry would only be effective if the region were mapped in hindsight due to smoke occlusions at the time of a fire, wheres aerial imagery that Y-NET uses is constantly collected over large swaths of land. Furthermore, Y-NET allows for VSM recurrence at the timescale of multispectral imagery collection, allowing for spatially widespread 3D modeling of vegetation with only increasing recurrence as more satellites are deployed. Applications that would directly benefit from constantly updated VSMs include wildfire mitigation and modeling, biomass and ecosystem health, and forest economic yield forecasts. While providing higher frequency of measurements with lower cost than existing methods, we expect new previously unfathomable applications to emerge as well. Figure 12 details the results of an input feature ablation study with Y-NET, where we trained and evaluated Y-NET while removing a single input feature each time. Doing so reveals the contribution of each input layer to Y-NET's performance. We label each bar where "!Red" represents NAIP's red band being removed from the input. All values recorded are the difference from the performance of Y-NET with all layers included, therefore downward facing bars mean the model yielded less error for that height range, suggesting those layers hinder performance when included and less important to the output.

Input Feature Ablation Study
We note the red imagery band and slope terrain layers are most important when determining the height of short vegetation under 1.8 m, though adversely affect Y-NET for tall vegetation. The NIR imagery band is most important for medium height vegetation, while the DEM, aspect, green, NIR, and NDVI are all significant for the tallest vegetation. With regards to overall statistical metrics not shown in Figure 12, the NIR band causes the highest increase in mean error, RMSE, and also decreased R 2 value the most. Parallel with Figure 12, the red imagery and slope terrain layers have the greatest impact on the median error, since this value is consistently below 1.8 m. While removing layers from Y-NET occasionally leads to better outcomes for specific height ranges, removing an input feature never increases Y-NET's performance across all heights or evaluation metrics. Thus, we conclude that all input layers are beneficial for generating accurate VSMs, and the original Y-NET model with all layers is superior.

Future Work
While Y-NET is an important first step, we believe there is an abundance of future work with highly recurrent VSMs and in the intersection of artificial intelligence and remote sensing. Although Y-NET achieves good performance at high resolution, wildfire modeling software at high resolution has been lacking [17], causing a spatial disconnect with actual fire fighting efforts. Perhaps existing fire modeling software such as Farsite [26] or FireCast [25] could be re-purposed to model fires at 1 × 1 m resolution and leverage Y-NET's VSM for improved accuracy where recent LiDAR is not available.
In this paper, we have only discussed supervised machine learning, although we believe WUI fire modeling and mitigation strategies would also benefit from reinforcement learning given an up-to-date VSM from Y-NET. Reinforcement learning and remote sensing is an intersection that is almost entirely unexplored, and we envision an abundance of interesting applications emerging from the generalizability of Y-NET with reinforcement learning algorithms. We refer the reader to [84] for more information on reinforcement learning. With data driven approaches emerging as the dominant force of research and development, we hope a proper evaluation of Y-NET's temporal generalizability will be possible with future data to use as ground truth.

Conclusions
We propose Y-NET, a novel deep learning model to generate a high resolution VSM from readily available visual data and terrain data. The motivation behind our work stems from cost, complexity, and periodicity limitations of LiDAR for widespread remote sensing, and the need for current and future modeling software to transition towards higher resolution. Among many domains, wildland fire fighting and modeling is one that would directly benefit from up-to-date high resolution vegetation data, especially since Y-NET can generate a VSM given aerial imagery just before the time of a fire. We also expect the VSMs Y-NET generates to unlock a class of unprecedented applications which rely on up-to-date vegetation modeling.
Our method improves on past attempts to use deep learning for remote sensing by moving to 1 × 1 m resolution and achieving a high R 2 value with low error from LiDAR. We empirically show that grouping similar input features into separate encoder branches and including skip connections in Y-NET drastically improves estimation performance. By visually evaluating Y-NET, we validate the models robustness given tiles with varying heights and densities of vegetation, ranging from single trees, to forests and grasslands, to the WUI. Furthermore, we identify instances of noise in existing LiDAR, deploy a DBSCAN clustering algorithm to mitigate that noise, and show that Y-NET can effectively model vegetation height and identify instances of vegetation growth and mitigation. Finally, we assert that Y-NET is a first step to using deep learning for VSMs. The abundance of available spatial data is attractive for data-driven techniques such as neural networks, and we argue the intersection between deep learning and GIS is currently underutilized.