A Comparison of Multi-Temporal RGB and Multispectral UAS Imagery for Tree Species Classiﬁcation in Heterogeneous New Hampshire Forests

: Unmanned aerial systems (UASs) have recently become an affordable means to map forests at the species level, but research into the performance of different classiﬁcation methodologies and sensors is necessary so users can make informed choices that maximize accuracy. This study investigated whether multi-temporal UAS data improved the classiﬁed accuracy of 14 species examined the optimal time-window for data collection, and compared the performance of a consumer-grade RGB sensor to that of a multispectral sensor. A time series of UAS data was collected from early spring to mid-summer and a sequence of mono-temporal and multi-temporal classiﬁcations were carried out. Kappa comparisons were conducted to ascertain whether the multi-temporal classiﬁcations signiﬁcantly improved accuracy and whether there were signiﬁcant differences between the RGB and multispectral classiﬁcations. The multi-temporal classiﬁcation approach signiﬁcantly improved accuracy; however, there was no signiﬁcant beneﬁt when more than three dates were used. Mid- to late spring imagery produced the highest accuracies, potentially due to high spectral heterogeneity between species and homogeneity within species during this time. The RGB sensor exhibited signiﬁcantly higher accuracies, probably due to the blue band, which was found to be very important for classiﬁcation accuracy and lacking in the multispectral sensor employed here.


Introduction
Detailed maps of forest composition are necessary for effective and efficient forest management [1,2]. Maps depicting species-level composition serve a number of applications, such as monitoring biodiversity [3,4], forest health assessments [5,6], conducting precision forestry [7,8] or as inputs for species-specific allometric models [9]. Remotely sensed imagery has been used to decades as a quick and efficient means to produce continuous, large-area maps of forest types [10,11]. However, traditional remote sensing platforms, such as satellite or aerial imagery, are incapable of providing the temporal and/or spatial resolutions necessary for species level mapping at an affordable cost [12,13]. Thanks to recent technological advancements, unmanned aerial systems (UASs) have become an affordable alternative, capable of providing the flexibility and resolution necessary to accurately map forest species composition [2,14].
Besides the significantly higher spatial resolution, the flexibility of the UAS platform is another major characteristic. For one, UAS platforms can be equipped with different sensors capable of acquiring information from different portions of the electromagnetic spectrum (EMS), like the visible bands (RGB), red edge, and near infrared (NIR) [14,32]. Typically, though, cost and payload weight limits the sensor used [14,16,33]. As a result, consumer-grade digital cameras are often employed in UAS studies [21,[34][35][36][37]. The downside of employing these cameras, however, is that they ordinarily only capture reflectance in the visible range of the EMS (i.e., RGB cameras). Typically, most land cover classifications, especially with vegetation, require multispectral sensors capable of sensing beyond the visible range of the EMS, frequently in the NIR spectrum, in order to improve the distinction between classes, especially classes that are spectrally similar in the visible range, like vegetation [38,39]. Many studies have modified the spectral sensitivity of the bands in the consumer-grade cameras by adding or removing filters from the camera lens, usually to capture NIR reflectance [6,18,22,40,41]. The modified cameras, however, are not perfect substitutes for real multispectral cameras. All three bands on a consumer-grade camera are sensitive to NIR energy, and thus removal of the filter, blocking NIR energy from reaching the sensor, can cause redundant band sensitivity or spectral overlap between bands. This spectral overlap reduces the potential for discrimination between features. Several studies have found the RGB imagery performed better compared to the CIR imagery from a modified camera [6,28,42] and have suggested that the redundant sensitivity between the bands, after modifying the camera, reduced the ability to discriminate between species using CIR imagery. Franklin et al. (2018) found the imagery collected by an actual multispectral camera outperformed the RGB imagery for tree species mapping. However, multispectral cameras can be more expensive [14,42] and thus more cost-effective methods of accurately generating this information would help make UASs more operationally feasible.
Taking advantage of UASs' temporal flexibility may help to overcome limitations in sensor spectral resolution [2]. The much higher temporal resolution of the UAS platform is considered one of its major advantages over other remote sensing platforms [1,12,33]. In a multi-temporal classification, multiple dates of imagery are used to create a single land cover map, taking advantage of the spectral differences within and between species during this period to improve the accuracy of the map [43]. With an appropriately timed series of images, multiple species can be differentiated [2]. In highly heterogeneous forests with many species of trees, like those characteristic of New England, spectral separability is crucial [43][44][45].
Several studies have demonstrated the advantages of a multi-temporal classification for mapping forest composition with moderate resolution satellite imagery. However, it should be noted that these studies are typically classifying species mixtures rather than singular species, since the spatial resolution is usually larger than most tree crowns [44,46,47]. The use of high spatial resolution imagery for multi-temporal species classification is uncommon [46][47][48] and the use of very high spatial resolution (sub-meter), non-UAS imagery is scarce, mainly due to the high costs for both [28]. While several studies have taken advantage of the temporal resolution of the UAS for other applications [34,37,[49][50][51], few have done so for tree species classification [6,28].
As the availability and access to high, and now very high, spatial resolution imagery has increased, there has been a shift away from traditional per-pixel image processing for detecting and mapping features of interest to an object-based approach [52,53]. Objectbased image analysis was a move towards integrating more spatial information into the classification/feature detection process in an effort to try and mimic human photointerpretation [54]. As of late, improvements in computer hardware have made deep learning algorithms, like the popular convolutional neutral networks (CNNs), a viable tool. Deep learning looks to train computers to think like humans and automatically identify features in an image [55]. Deep learning CNNs have performed well with very high resolution imagery but, as pointed out by Bhuiyan et al. [56], can only utilize three spectral bands. Users must typically choose a limited subset of all the available bands [56][57][58], which would Remote Sens. 2021, 13, 2631 3 of 24 limit the use of multi-temporal datasets which contain numerous bands. Furthermore, deep learning approaches perform best with a large quantity of reference data and require substantial computing power [59]. Meanwhile, computationally efficient machine learning algorithms, such as random forest, are readily available in many coding languages, such as R and Python, and have been found to perform well with high-dimensional, multi-temporal datasets [6,28,60,61].
The integration of UASs into the field of remote sensing is very recent and given the inherent differences between UASs and traditional remote sensing platforms/data, there is a need to explore how UASs perform in a variety of applications and environments to better inform end-users on how best to employ them. This study sought to investigate whether multi-temporal classification of RGB and multispectral UAS imagery improved the accuracy of species-level forest composition maps in a highly heterogeneous forest in New Hampshire, USA. Additionally, an optimal phenological window for data collection was investigated and the accuracy of the maps produced from RGB imagery were compared to those produced from the multispectral imagery. This study will inform users on data collection strategies that may help to optimize accuracy in these complex environments.

Study Area Description
This study was conducted at Kingman Farm in Madbury, NH, USA ( Figure 1). The property is owned by the University of New Hampshire (UNH) and is comprised of both agricultural fields and research support buildings for the NH Agricultural Experiment Station, as well as 101 ha of forest which are managed by the UNH Office of Woodlands and Natural Areas for the purposes of education, research, and conservation. From this point forward, any reference to Kingman Farm, or just Kingman, will be used to indicate the forested lands on the property. The Kingman Farm forests are an example of a hemlockbeech-oak-pine forest community [62], dominated by white pine (Pinus strobus), eastern hemlock (Tsuga canadensis), red maple (Acer rubrum), red oak (Quercus rubra), and American beech (Fagus grandifolia). The land-use history of the property and surrounding region, combined with the ongoing management practices within the woodlot, has resulted in a considerable mix of species. A recent inventory of the property conducted in 2017 as part of the UNH Continuous Forest Inventory (CFI) Program detected 16 different species of trees on the property.

UAS Data Collection
All flights were carried out with a Sensefly eBee X fixed-wing UAS and the eMotion 3 mission planning software [64]. Two sensors, the Sensefly Aeria X and the Parrot Sequoia, were flown to collect the RGB and multispectral imagery, respectively. The speci- It is important to note several characteristics within the study site that may potentially affect the within-species spectral response. Hemlock woolly adelgid and beech bark disease are widespread throughout the study site. Infected eastern hemlock and American beech trees may exhibit differing spectral patterns compared to uninfected individuals. Additionally, the study site encompasses a range of hydrologic conditions, from dry uplands to permanently saturated swamps. Facultative species like red maple tend to exhibit wide variability in phenology due to their ability to tolerate a multitude of conditions [63].
In order to adhere to Part 107 of the U.S. Federal Aviation Administration Regulations (Small Unmanned Aircraft Systems, 14 C.F.R. Part 107) and to maintain the safety of the research team and others, only a portion of the Kingman Farm was covered by the UAS, as indicated in Figure 1. The far eastern half of the property is classified as Class E to Surface airspace belonging to the Pease International Airport and is off limits to UASs; it was thus removed from the study area. Additional limits were placed on the UAS mission area to ensure the pilot and visual observers could maintain visual line-of-sight as well as a constant radio connection with the UAS while flying.

UAS Data Collection
All flights were carried out with a Sensefly eBee X fixed-wing UAS and the eMotion 3 mission planning software [64]. Two sensors, the Sensefly Aeria X and the Parrot Sequoia, were flown to collect the RGB and multispectral imagery, respectively. The specifications for each camera are provided in Table 1. The Aeria X is a standard DSLR camera and employs a common APS-C sensor capable of capturing normal color (RGB) imagery. The Parrot Sequoia is a multispectral sensor specifically designed for vegetation mapping and monitoring. As such, it captures spectral information in the green, red, red edge, and NIR portions of the EMS. While the Sequoia camera does carry an additional RGB sensor, this sensor is not optimized for the generation of the orthomosaics and was not utilized [65].
Imagery was collected over Kingman farm between April 2019 and June 2020. The goal was to fly bi-weekly from the very beginning for the growing season through to the end in order to capture the full phenology of the forest with both sensors. There was a preference to fly on cloudy days to maintain consistent illumination across all the images and to avoid shadows. When not possible, the imagery was collected under clear or nearly clear conditions and as close to solar noon at possible. All missions were undertaken 100 m above the trees (approximately 120 m above the ground) with an 80% latitudinal overlap and an 85% longitudinal overlap. The Sequoia requires an additional radiometric calibration prior to each flight using a calibration target with a known albedo. Table 2 shows the collection dates for both cameras with a seasonal descriptor. Due to weather, flight constraints, and equipment malfunctions, it was not possible to collect all the imagery within a single growing season. Within-sensor collections were largely within the same year (2019 for the Aeria X and 2020 for the Sequoia), with the exception of the first and last dates of collection for the Aeria X. Every effort was made to keep the between-sensor collections as close as possible in order to avoid large differences in Remote Sens. 2021, 13, 2631 5 of 24 phenology when comparing sensors. Weather conditions between 2019 and 2020 were similar. May and June 2020 were roughly two degrees warmer and June 2020 received two more inches of rain compared to June 2019. A visual inspection of the imagery did not show significant differences in phenology, however. Table 2. Collection dates for each sensor with seasonal descriptions. The description is based on regional trends in phenology and not on any particular date ranges.

Imagery Pre-Processing and Orthomosaic Generation
Due to the high canopy cover in the study area, it was not possible to set ground control points (GCPs) across the woodlot to improve the positional accuracy of the orthomosaics. The eBee X, however, is real-time kinematic (RTK)-enabled and thus the raw GPS positions for each image could be PPK post-processed. All the raw UAS imagery were pre-processed using the Sensefly Flight Data Manager built into the eMotion 3 software. The Flight Data Manager extracted the geotags for all the images stored in the mission flight logs and then used a post-process kinematic (PPK) technique to correct the positions. A CORS station located approximately 3.85 km from the center of the study area (station ID: NHUN) was used for all PPK processing. Once corrected, the software then geotagged the images with the corrected positions.
Each date of collection was processed in Agisoft Metashape Professional (formally Agisoft Photoscan) [66]. Agisoft utilizes the structure from motion (SfM) and multi-view stereo (MVS) processes to generate a georeferenced orthomosaic, or ortho. Points representing different features within each image are detected and then matched across multiple overlapping images. The matched points, called tie points, are then utilized to estimate the interior and exterior orientation parameters for the camera for each image. The reprojection error for all models ranged between 0.448 and 1.28 px. The original point cloud, or sparse point cloud, from the tie points is densified by matching pixel windows between successive image pairs using the estimated camera orientations [67,68]. A digital surface model (DSM) is generated from the dense point cloud, which is then used to orthorectify the images. The rectified images are then mosaicked together to form the final orthomosaic. Specifically, within the Agisoft software, the Align Photos tool was run in the high accuracy mode with generic preselection, guided image matching, and adaptive camera model fitting turned on. The dense point cloud generation was run in high quality with mild filtering.
While all the missions were flown with the same parameters, the different focal lengths of the two sensors resulted in very different spatial resolutions for the resulting orthomosaics. The coarsest spatial resolution of the Aeria X and Sequoia orthos were 2.7 cm and 11.9 cm, respectively. In order to eliminate spatial resolution as a factor when comparing the performance of the two sensors, all the orthos were exported at a 12 cm spatial resolution from Agisoft. They were then georeferenced to improve the positional agreement. The 27 June 2020 Aeria X orthomosaic was chosen as the base ortho. The remaining orthomosaics were then registered to the base ortho using several well-dispersed structural features across the study site and rectified using an affine transformation and nearest neighbor resampling.

Reference Data Collection
The dense point cloud from the 27 June 2020 Aeria X imagery was exported and converted into a DSM with a 12 cm spatial resolution to match that of the orthomosaics. The DSM was then normalized using a digital terrain model (DTM) produced from a 2011 leaf-off Lidar collection for coastal New Hampshire and downloaded from the GRANIT LiDAR Distribution site (https://lidar.unh.edu/map/, accessed on 2 July 2021) to produce the canopy height model (CHM). Due to the inability of photogrammetrically produced point clouds to accurately capture the ground, externally produced DTMs, typically from LiDAR, are commonly used to normalize those produced from imagery [34,35,69]. Based on the land-use history of the site, there was no concern about the about the age of the DTM. A 3 × 3 cell Gaussian filter was then applied to the CHM to reduce the noise in the original model [70]. Pixels with a height less than 5 m were considered non-forested and subsequently masked from the CHM.
A local maximum filter was used to generate points representing treetops for the entire study area [29,71]. Kingman farm has a high stand density with highly variable crown widths. To ensure smaller crowns were appropriately captured, a 7 cell, or 84 cm wide, circular window was applied. This window size was chosen based on the smallest measured crown width from a 2017 CFI inventory of the Kingman Farm woodlot. While smaller window sizes will over-segment larger crowns [72], this is preferable to under-segmentation, which could result in the canopies of different species being grouped together, and has been found to improve classification accuracies [73,74].
An initial set of reference trees were selected from the 2017 CFI inventory. For each sampled tree, the distance and azimuth from the plot center to the center of the stem at breast height was recorded in addition to the tree species. This information was used to map the location of each sampled tree stem. Each mapped tree was first carefully inspected to determine whether the tree could visually be seen in the fully leaf-on imagery and CFI trees that were obscured by taller trees were removed. Next, for trees that were leaning, the location of the center of the stem would not match that of the highest point of the crown, so a visual inspection of the UAS imagery in Agisoft was used to select the local maximum for the remaining trees.
Based on the species represented in the chosen CFI trees, 14 were chosen for classification (Table 3). These species were determined to have a high enough occurrence within the study area to ensure that a representative number of reference samples could be gathered. To improve the efficiency of the reference data collection, a random forest (RF) classification [60] was performed, using the chosen CFI trees as training data. Each local maximum was assigned a preliminary classification based on the average spectral information from the 26 June 2020 Sequoia orthomosaic occurring within a 0.5 m buffer around each point. This preliminary classification was used to perform stratified random sampling. Each selected point was then carefully inspected using the high-resolution orthomosaics and adjusted as necessary. Field reconnaissance was carried out for those reference samples that were too difficult to photo interpret. One hundred samples per class (species) were collected per the recommendation of Congalton and Green [75]. These reference samples were then randomly divided into two independent groups, one for training the classification algorithm and the other for validation, with half the samples assigned to each.
A marker-controlled watershed (MCW) segmentation was performed to delineate individual tree crowns. In a traditional watershed segmentation for tree crown delineation, a single banded image, typically representing height, is treated as a topographic surface [52,72]. The values are inverted so that local maximums (i.e., potential treetops) become local minimums and the catchment basins (i.e., crown boundaries) around all the local minima within the image are delineated. MCW segmentation requires an additional input, markers or points representing the local minima of interest. The basins associated with non-marker minima are converted to plateaus within the image and not delineated. The result is a one-to-one relationship between markers and basins, which reduces oversegmentation. In this study, the local maximums representing the tree crowns in the study area were used as the markers and the CHM was used to define the crown boundary.

Tree Species Classification
A series of mono-temporal (single date) and multi-temporal (multiple dates) classifications were carried out for each sensor using an object-based classification approach, whereby a grouping of pixels (image objects) are classified instead of the individual pixels. An object-based approach performs better than a traditional pixel-based approach when classifying high-spatial resolution imagery since it can better handle the higher intra-class spectral variability that occurs as the spatial resolution increases [53,76,77]. The previously created tree crown segments acted as the image objects for this study.
The RF classifier was employed for all classifications. RF is a robust, non-parametric classification algorithm used often for classification and employed in other multi-temporal species classification studies [6,23,78,79] The per-band average spectral value of the training tree segments was used to train the RF classifier. Each RF model was grown using 500 trees and the square root of the number of spectral bands included in the model as described below. The resulting model was then applied to the independent validation tree segments to assess its accuracy.
Each singular date of imagery was classified alone (i.e. mono-temporal classification). Additionally, a series of multi-temporal image stacks were classified using varying combinations of the single-date orthomosaics for each sensor. Image stacks started with imagery for every combination of two dates. The number of dates included in the stack was then increased incrementally until all dates of imagery were included (i.e., three-date stack, four-date stack, five-date stack). In total, 62 combinations were generated, 31 per sensor (Table 4). Table 4. All single-and multi-date image stacks for classification grouped by the number of dates included and indicated on the far-left. The index column is a unique identifier assigned to each combination within a sensor.

Index Aeria Sequoia
One Date

Accuracy Assessment
The accuracy of all the classifications was assessed using the validation tree segments and an error matrix approach [80]. The ground classification of each validation tree was compared to its respective map classification and the results tallied in a matrix with the columns and the rows of the matrix representing the sample's ground and map classification, respectively. For each matrix, the overall accuracy (OA) was calculated by dividing the sum of the major diagonal (total agreement) by the total number of samples. The accuracy of the individual classes was determined by calculating the user's (UA) and producer's (PA) accuracies [81]. The PA was calculated by dividing the number of correctly classified samples for each class by the total number of samples for that class. The UA was calculated by dividing the number of correctly classified samples for each class by the total number of samples classified as that class. UA and PA were then used to calculate an F-measure (F; Equation (1)) as a way to summarize the UA and PA in a single metric.
Due to the randomization approach implemented by the RF classifier, the accuracy of no two RF models will be the same. To account for this, 30 RF models were generated for each date combination in Table 4. Each model was validated and the OA, UA, PA, and F Remote Sens. 2021, 13, 2631 9 of 24 calculated. These results were then averaged together to calculate a mean accuracy result for each combination.

Feature Importance
A feature importance investigation was carried out for both sensors. An RF classifier was trained using the training tree segments and all the bands for all dates of imagery and validated using the independent validation tree segments to establish a baseline accuracy. One at a time, each band included in the image stack was removed, the model retrained and validated, and the difference in overall accuracy taken as the measure of importance for that band.

Statistical Comparisons
A kappa analysis was conducted to statistically compare the best single-date and multi-date classifications for each sensor. The kappa statistic, KHAT, is another measure of how well the classification agrees with the reference data that does not assume the land cover classes are independent and utilizes the information in the entire error matrix, not just the diagonal [80]. The KHAT statistic for two error matrices can be statistically compared to determine whether there is a significant difference between methodologies [75].
Several KHAT comparisons were conducted. First, within each sensor, the best monoand multi-temporal classifications were compared to determine not only whether a multitemporal classification was significantly better than a single-date classification, but also whether there was a significant difference between how many dates were used. Next, between-sensor KHAT comparisons were conducted for each date of imagery to compare the classification performance of the RGB imagery to that of the multispectral imagery.  (Tables A1 and A2). Overall classification accuracies were highly varied, ranging from 24.8% to 61.1% for the Aeria and 27.0% to 55.5% for the Sequoia. Across the individual date groups, the mono-temporal classifications had the lowest overall accuracies, reaching a maximum OA of 37.3% and 36.2% for the Aeria and Sequoia respectively. Generally, the inclusion of additional dates resulted in the accuracy of all classifications improving. However, there was a distinct leveling off in the OA as the number of dates included in the multi-temporal classification increased, reaching the peak OA for the five-date classification (Aeria) and for the four-date classification (Sequoia).

Within-Sensor General Classification Results
For the top performing combinations (Figure 2), the mid-spring and late spring imagery were consistently chosen. The best mono-temporal classification for both sensors also occurred at the end of May, for late spring. For the multi-temporal classifications, the best date combinations varied slightly between the sensors, but mid-and late spring imagery were frequently utilized, especially for the two and three-date combinations for which there was 10 combinations for each.

Mono-Versus Multi-Temporal Classification
The results of the pairwise comparison between the mono-and multi-temporal classifications for both sensors are given in Table 5. For each pairing, the 30 individual classifications were compared and the number of significantly different classifications totaled. Both sensors exhibited the same trend in the number of significantly different classifications. The best two-date multi-temporal classification was always significantly better than the best mono-temporal classification. Between two and two-dates, the number of significantly different classifications decreased considerably. After three dates of imagery, there was no significant difference in the classifications. sifications for both sensors are given in Table 5. For each pairing, the 30 individual classifications were compared and the number of significantly different classifications totaled. Both sensors exhibited the same trend in the number of significantly different classifications. The best two-date multi-temporal classification was always significantly better than the best mono-temporal classification. Between two and two-dates, the number of significantly different classifications decreased considerably. After three dates of imagery, there was no significant difference in the classifications.

Per-Species Classification Result
The UA, PA, and F for all species and all classifications are presented in Figures 3 and 4 for the Aeria and Sequoia, respectively. The accuracy of eastern hemlock (eh) and white pine (wp), the only coniferous species in this study, were consistently better than that of the deciduous species across all combinations. The F of both were often >70%, peaking at 88% for eastern hemlock (Aeria) and 80% for white pine (Sequoia). White ash (wa), red maple (rm), and American beech (ab) were consistently poorly classified, never achieving Fs greater than 50%. The performance of the remaining species varied with the number of dates included and the specific dates in the combination for each sensor.  Table 3). The y-axis is the species abbreviation (see Table 2).  Table 3). The y-axis is the species abbreviation (see Table 2). Remote Sens. 2021, 13, 2631 13 of 25  Table 3). The y-axis is the species abbreviation (see Table 2).  Table 3). The y-axis is the species abbreviation (see Table 2).

Between-Sensor Classification Results
The best performing Aeria and Sequoia classifications based on OA for each mono-and multi-temporal classification group were statistically compared. For each pairing, the 30 individual classifications were compared and the number of statistically significant results summarized. The results of the comparisons are shown in Figure 5. When compared to the Aeria, the Sequoia consistently under-performed in terms of OA. The smallest difference was seen in the mono-temporal classifications (OA difference of 1.1%) while the greatest occurred with the five-date classification (OA difference of 6.9%). None of the mono-temporal classifications were found to be significantly different. Each of the multi-temporal pairings had some significantly different results, the number of which increased with the number of added dates. Almost all of the five-date comparisons were found to be significantly different.
sults summarized. The results of the comparisons are shown in Figure 5. When compared to the Aeria, the Sequoia consistently under-performed in terms of OA. The smallest difference was seen in the mono-temporal classifications (OA difference of 1.1%) while the greatest occurred with the five-date classification (OA difference of 6.9%). None of the mono-temporal classifications were found to be significantly different. Each of the multitemporal pairings had some significantly different results, the number of which increased with the number of added dates. Almost all of the five-date comparisons were found to be significantly different.

Feature Importance
The results of the feature importance analysis are presented in Figure 6. Feature importance here was measured as the decrease in overall accuracy relative to a baseline model (the five-date combination) when that feature or band was removed. Positive values indicate that the model accuracy decreased when the band was removed while negative values indicate that the model accuracy improved. For the Aeria, the blue bands were considerably more important than the other spectral bands. Furthermore, the mid-and late spring imagery, regardless of the spectral band, were also important. The Sequoia had numerous bands indicated as having negative impacts on performance. The mid-and early spring green and red bands were predominately the most important. The red-edge and NIR bands were consistently the least important.

Feature Importance
The results of the feature importance analysis are presented in Figure 6. Feature importance here was measured as the decrease in overall accuracy relative to a baseline model (the five-date combination) when that feature or band was removed. Positive values indicate that the model accuracy decreased when the band was removed while negative values indicate that the model accuracy improved. For the Aeria, the blue bands were considerably more important than the other spectral bands. Furthermore, the mid-and late spring imagery, regardless of the spectral band, were also important. The Sequoia had numerous bands indicated as having negative impacts on performance. The mid-and early spring green and red bands were predominately the most important. The red-edge and NIR bands were consistently the least important.

Discussion
This study sought to (1) investigate whether a multi-temporal approach improved the accuracy of species-level forest composition mapping with UAS imagery in a highly heterogeneous forest, and in doing so to determine whether there is an optimal phenological window within which to collect imagery; and (2) compare the performance of RGB imagery collected via a consumer-grade DSLR to that of a multispectral camera. A series of mono-temporal and multi-temporal classifications of 14 different species were carried

Discussion
This study sought to (1) investigate whether a multi-temporal approach improved the accuracy of species-level forest composition mapping with UAS imagery in a highly heterogeneous forest, and in doing so to determine whether there is an optimal phenological window within which to collect imagery; and (2) compare the performance of RGB imagery collected via a consumer-grade DSLR to that of a multispectral camera. A series of monotemporal and multi-temporal classifications of 14 different species were carried out for both sensors and validated with an independent set of reference samples and error matrices.
Kappa comparisons were then conducted between the best performing mono-and multitemporal classifications within each sensor and then between sensors to determine whether multi-temporal classifications were significantly better than mono-temporal classifications and whether there was any significant difference between the classifications produced by the RGB and multispectral sensors.
While the underlying goal of this study is to inform users on data collection strategies, it is important to note that this study was conducted in a single stand in one point of the globe. The results of this study should be interpreted within the context from which they were derived. Geographic variation in phenology aside, results may vary even with geographically close locations simply due to differences in site, lighting, and composition, most of which are difficult to control.

Tree Species Classification Accuracy
This study achieved maximum overall accuracy of 61.1% and 55.5% for the Aeria and Sequoia, respectively. These OAs are lower compared to comparable studies that performed similar investigations [6,28,82]. Both Lisein et al. [28] and Michez et al. [6] conducted multispecies level forest mapping in mixed forest stands using both multi-temporal RGB and multispectral UAS imagery. These studies achieved maximum accuracies of 91.2% (based on RF out-of-bag errors) and 84.1%, respectively. It should be noted that these studies, while similar, varied in two important ways. First, both studies only included five classes. Some were species while others were groupings representing specific genera (e.g., birches). This study included 14 individual species of trees. The greater number of species employed here led to greater spectral confusion, especially for species exhibiting similar phenology across the time period investigated [6]. This study chose to represent the diversity of the study site "as is", rather than choosing a subset of species exhibiting the best separation, thus expanding the generalization of these results to similar conditions [2,23].
Second, these studies employed additional derivative layers that were not utilized here, mainly spectral indices and textural metrics. Additional derivative information, especially texture, has been found to significantly improve the accuracy of forest classification in a number of settings [78,82,83] and in other vegetation mapping studies as well [84,85]. This study establishes a baseline for the performance of these two sensors based on spectral properties alone. Given the resolution these UAS sensors are capable of achieving, a great deal of information on crown texture can be extracted. The benefits of textural metrics for mapping stands such as the one investigated here are an interesting topic in need of additional research.

Mono-versus Multi-Temporal Classification
Both sensors employed here demonstrated a continuous increase in the overall classification accuracy as the number of dates included in the multi-temporal classification increased (Figure 2). This result falls in line with many other studies that have investigated the performance of multi-temporal classifications both with UAS [6,28,82] and non-UAS imagery [43,46,86,87]. Of interest in this study was the significance of the additional benefit incurred by adding more dates. The highest accuracy was achieved when using all five dates of imagery for the Aeria and four dates for the Sequoia. From a cost-benefit perspective, one would look to achieve the highest accuracy possible with the least number of collections. While the OA did increase with the number of dates utilized, the rate at which it increased for both sensors leveled off, indicating a diminishing return. The results of the mono-versus multi-temporal kappa comparisons support this conclusion ( Table 5). The two-date classification for both sensors was significantly better than the mono-temporal classification for all iterations. There was only a minor benefit when a third date was included and, beyond three dates, there was no significant benefit. Weil et al. [23] similarly saw little improvement in classification accuracy after three-dates of optimal near-surface imagery using the RF classifier. These results not only reinforce the benefits of multi-temporal classifications, but also suggest that there would be no need to collect more than three dates of optimally timed imagery.

Timing of Aerial Collection
Based on the date combinations of the best performing mono-and multi-temporal classifications, the mid-and late spring imagery play an important role in trees species classifications. The best mono-temporal collection date was found to be towards the end of May for both sensors. Similar studies investigating optimal phenological timing have also found the middle and end of spring to be important [23,28]. This runs counter to what one would expect, which is that the accuracy would be maximized at the point when the trees express their greatest phenological differences, either early spring or autumn [28]. Indeed, other studies have found autumn to be the optimal mono-temporal window for species mapping [23,46,86].
Lisein [28] suggested that this period presents a balance between inter-and intraspecies spectral variation, not only improving the separability between species but also the homogeneity within species. After this period, individual phenology starts to express the effects of differing microclimate, age, and even health [88][89][90]. It is at this point too that the spectral response of trees below the upper canopy are suppressed (full to almost full leaf-cover above), further improving the variability. This suggests that more focus should be placed on the intra-species variation when collecting phenology data for species classification.
The results of the multi-temporal classifications still demonstrate that including periods with high inter-species variation is important for achieving high classification accuracies. The best performing two and three-date classifications included those combinations with the mid-spring imagery and the late spring imagery. Many species experienced an increase in their individual accuracies for the date combinations containing both those dates (Figures 3 and 4). Visually, the mid-spring imagery collected here exhibited the greatest difference between species. Unfortunately, due to equipment difficulties, the full phenological profile of the study site was not captured. Based on the results of the previously mentioned studies, the inclusion of autumn imagery along with the mid-and late spring imagery could have significantly increased the accuracy of the three-date classifications, perhaps leading to greater significance when statistically compared to the optimal two-date classification.
While this study focused primarily on a global classification result, it is still important to investigate the accuracy of the individual species. There was a substantial difference in the performance for different species and combinations (Figures 3 and 4). Most notably, the two coniferous species were consistently well-classified compared to the deciduous species. Eastern hemlock exhibited accuracies >70% within only a single date of imagery. White pine performed better once there were two dates and then stabilized. White ash, American beech, and red maple did consistently poorly, showing only a minor improvement with additional dates. Within-species variation, as noted, could have a significant impact on an individual species' performance. Red maple naturally exhibited great variability during the important mid-spring time period. Some trees were just starting to show the early red flourescence while others had almost fully leafed out; expressing the influence of the wide variety of conditions red maple can tolerate [63,90]. American beech in the study area was much further ahead phenologically than most other species, almost completely leafed out by mid-spring, but, at this time, many of the beech trees in the stand are suffering from the effects of beech bark disease. The range of infestation is wide, with some beech trees only recently being infected to others nearing mortality. This range would have caused large variability in the spectral response, not just because of the change in vegetation health, but also because of the change in the structure of the canopy as well [6]. Additionally, the time series collected here may not have been dense enough to capture the specific periods within which a species becomes distinct. For example, white ash had few if any leaves by mid-spring but was fully leafed out by late spring. An important window may have been missed. Far more spectrally unique species, for example the aspen trees, black oak, and black birch, performed well, even with just a few dates of imagery.

RGB versus Multispectral Sensors for Tree Species Classification
The multispectral sensor employed here was found to underperform compared to the consumer-grade RGB sensor. The statistical comparison between the two sensors ( Figure 5) suggests that for a mono-temporal classification the RGB sensor and the multispectral were not different. However, the RGB sensor became significantly better with each additional date added to the classification. Both Lisein et al. [28] and Michez et al. [6] carried out comparisons between multi-temporal RGB imagery and color infrared (CIR) imagery (green, red, and near infrared sensitivity only) for the purpose of forest species classification and found that the RGB outperformed the CIR. Both studies suggested that the poor performance from the CIR was due to the redundant sensitivity to NIR across the three bands after modifying their cameras. Nijland et al. [42] concluded the same when comparing modified (i.e., NIR blocking filter removed) and unmodified RGB cameras for monitoring plant health and phenology. This study sought to overcome the redundant sensitivity problem by utilizing a multispectral sensor designed specifically for vegetation mapping and monitoring. Not only was each band specifically designed to avoid spectral overlap, but they also included an additional band in the red-edge region of the EMS, which has been found to benefit the discrimination between species [91][92][93]. The results of the feature importance testing ( Figure 6) suggest that the blue band, which is lacking in the Parrot Sequoia, is of high importance for mapping tree species. Key et al., [86] also found the blue band to be highly significant for species classification due to its sensitivity to chlorophyll and insensitivity to shadowing in canopies, a significant problem in many types of classification studies [2,86,94]. The most important bands for the Sequoia also happened to be in the visible range (red and green) while the red-edge and the NIR bands were found to be the least important bands. The visible bands should thus be considered highly important when conducting future classification studies [31].
This result has important implications in that users of the technology may not necessarily have to buy a more expensive multispectral sensor when in fact they could achieve better results with the RGB sensor alone. However, studies comparing the consumergrade RGB sensor to multispectral sensors containing blue bands, such as the Micasense RedEdge-MX (https://micasense.com) or the DJI P4 Multispectral (https://www.dji.com), should be carried out. Hyperspectral sensors with hundreds of bands covering visible to invisible wavelengths exist and could very well improve the accuracy of species classifications [29,31,95], but they will most likely remain cost prohibitive for some time.

Conclusions
With greater focus being placed on precision forestry, there is a growing need to improve our ability to generate species-level maps of forest communities. UASs, capable of achieving very high spatial and temporal resolutions, have recently become an affordable means of generating these species-level maps. Hardware limitations, mainly weight, have restricted the type of sensors that can be flown. Lower spectral resolution, consumer-grade RGB cameras are frequently being flown due to their lower weight and affordability, but they are not typically optimal for classifying vegetation down to the species level. While lightweight multispectral cameras exist, the costs of these sensors are potentially prohibitive. This study investigated whether taking advantage of UASs' higher temporal resolution to track tree phenology could help to improve the species-level classification accuracy with both RGB and multispectral imagery. Additionally, the optimal phenological timing for UAS data collection was investigated and a comparison between the performances of an RGB sensor and that of a multispectral sensor carried out.
The results show that there was a considerable and statistically significant increase in accuracy when utilizing a multi-temporal classification compared to a mono-temporal classification. While accuracy increased with additional dates of imagery, there was no significant increase in accuracy beyond three dates of optimally timed imagery. Based on the accuracy of the best performing date combinations, mid-and late spring imagery were found to be crucial points in the growing to capture, most likely due to the high inter-species spectral heterogeneity and intra-species homogeneity captured at these moments.
The multispectral sensor employed in this study consistently underperformed compared to the RGB sensor. The RGB sensor was found to perform the same as the multispectral sensor when employing a mono-temporal classification, but became statistically better as the number of dates of imagery increased. An analysis of feature importance suggests that the visual bands are important for species classification at this resolution, especially the blue band, and less significance can be placed on the non-visual bands.
This study was conducted in a highly heterogeneous forest; 14 separate species were classified. High-inter species spectral variability was to be expected, especially if they exhibited similar phenology or were naturally highly variable to due growing conditions or health. Future research is needed to investigate the benefits of derivative layers, such as spectral indices and texture, on overall accuracy. Additionally, expansion of the UAS collection into the late summer/autumn months may present interesting results. Finally, further research is necessary on comparing consumer-grade RGB sensors to multispectral sensors that employ all the visual bands, if not more.
Author Contributions: H.G. conceived and designed the study, conducted the data collection and analysis, and wrote the paper. R.G.C. contributed to the development of the overall research design and analysis and aided in the writing of the paper. Both authors have read and agreed to the published version of the manuscript.