Quantifying Leaf Phenology of Individual Trees and Species in a Tropical Forest Using Unmanned Aerial Vehicle (UAV) Images

Tropical forests exhibit complex but poorly understood patterns of leaf phenology. Understanding speciesand individual-level phenological patterns in tropical forests requires datasets covering large numbers of trees, which can be provided by Unmanned Aerial Vehicles (UAVs). In this paper, we test a workflow combining high-resolution RGB images (7 cm/pixel) acquired from UAVs with a machine learning algorithm to monitor tree and species leaf phenology in a tropical forest in Panama. We acquired images for 34 flight dates over a 12-month period. Crown boundaries were digitized in images and linked with forest inventory data to identify species. We evaluated predictions of leaf cover from different models that included up to 14 image features extracted for each crown on each date. The models were trained and tested with visual estimates of leaf cover from 2422 images from 85 crowns belonging to eight species spanning a range of phenological patterns. The best-performing model included both standard color metrics, as well as texture metrics that quantify within-crown variation, with r2 of 0.84 and mean absolute error (MAE) of 7.8% in 10-fold cross-validation. In contrast, the model based only on the widely-used Green Chromatic Coordinate (GCC) index performed relatively poorly (r2 = 0.52, MAE = 13.6%). These results highlight the utility of texture features for image analysis of tropical forest canopies, where illumination changes may diminish the utility of color indices, such as GCC. The algorithm successfully predicted both individual-tree and species patterns, with mean r2 of 0.82 and 0.89 and mean MAE of 8.1% and 6.0% for individualand species-level analyses, respectively. Our study is the first to develop and test methods for landscape-scale UAV monitoring of individual trees and species in diverse tropical forests. Our analyses revealed undescribed patterns of high intraspecific variation and complex leaf cover changes for some species.


Introduction
Plant phenology has long been recognized as a critical driver of ecosystem processes, and one heavily influenced by climate.Satellite observations and field data have provided high confidence that phenological shifts have been caused by climate change in recent decades in temperate and boreal ecosystems [1].Compared to temperate and boreal ecosystems, however, relatively little is known about the leaf phenology of tropical forests, how it responds to climate variability, and how phenological shifts might impact tropical forest ecosystems [2,3].Tropical forests have many more tree species than temperate forests, and tropical species exhibit a much greater variety of phenological patterns, including species that are entirely leafless for several months, species that partially lose their leaves for an extended time, species that drop and flush leaves quickly, species with continuous leaf turnover throughout the year, and species with two periods of leaf drop and leaf flush, with multiple types often present within a given site [4,5].To better understand the potential drivers of variation in tropical tree leaf phenology, a critical first step is to quantify patterns in tropical forest phenology at individual and species levels.However, few datasets have the combination of high spatial and temporal resolution and comprehensive coverage needed to quantify intra-and interspecific variation in tropical leaf phenology, which requires individual tree-level monitoring for many species.
Commonly used field methods and satellite remote sensing have provided important insights for tropical forest phenology, but have difficulty capturing leaf phenology patterns among and within species on a landscape scale.Field observations, such as scoring tree crowns for leaf presence or cover [6][7][8][9], are useful for individual tree-level monitoring, but are very labor-intensive.In addition, it can be difficult for an observer on the ground to make accurate assessments of the upper canopy in tall, dense, multi-layered forests.Litter traps can provide long-term information on species-and stand-level leaf-fall [10,11], but this method provides limited information on intraspecific variation and no information about the timing of leaf flush.Satellite images, which have been used successfully in temperate and boreal areas to document phenology patterns and their shifts in response to climate, are more difficult to interpret in tropical areas.Frequent cloud cover, smoke from fires and haze, and sensor artifacts complicate satellite-based studies of tropical forest phenology [12], which has led to ongoing debate concerning the patterns of phenology in areas such as the Amazon [13][14][15][16].Furthermore, the spatial scale of satellite images allows tracking of at best only the largest canopy trees, thus precluding individual-or species-level measurements for all but a few species [17,18].
Near-surface remote sensing, such as tower-mounted cameras ("phenocams"), have provided important next steps in quantifying tropical forest leaf phenology.Phenocams can quantify leaf phenology for multiple individual crowns with high frequency (every 30 min) without being obscured by clouds [15,19].Analyses of phenocam images of a 4-ha area in the Amazon showed that the majority of crowns flush new leaves during the dry season, providing support for some satellite-based studies that indicate Amazon forests "green up" in the dry season [15].The frequent temporal sampling of phenocams allows the selection of cloud-free images with consistent atmospheric conditions and fixed viewing geometry to quantify changes in the color of crowns related to leaf cover changes.However, in species-rich forests, the limited area covered by phenocams provides insufficient sample sizes for studying intra-and interspecific variation of leaf phenology.Furthermore, most individual crowns are not georeferenced in phenocam images, making it difficult to link these crowns to specific trees on the ground for which field data, including species identification, have been recorded [15,19].
Unmanned aerial vehicles (UAVs) are another form of near-surface remote sensing that can potentially provide important advances in monitoring tropical forest phenology.UAV images can cover areas ranging from many hectares to many square kilometers, thus encompassing enough trees to provide large sample sizes for many species.Since UAV images view crowns directly from above and can be georeferenced, trees in UAV images can be linked with trees on the ground so that data on species identity, stem diameter, growth rate, physiological measurements, and micro-environmental conditions can be associated with each crown in the image.Phenological studies using UAVs in a temperate forest have used individual color-channel indices (Green and Blue Chromatic Coordinates, GCC and BCC) from image time-series, taken under cloud-free conditions, to quantify the contribution of individual species to landscape patterns of phenology [20,21].Unfortunately, in many tropical forest regions, opportunities for cloud-free flights are too infrequent to develop full-year time-series under consistent atmospheric conditions, which may limit the utility of simple color metrics such as GCC.As with satellite data, near-surface images of tropical forests are affected by highly variable solar conditions, frequent cloud shadows, and mist that hangs over the canopy.Therefore, a key challenge for using UAV images to monitor tropical forest leaf phenology is to develop methods to extract accurate quantitative phenology estimates under variable lighting and viewing conditions.
An alternative approach to using simple color indices (e.g., GCC) is to capitalize on the wealth of detail in fine-resolution (e.g., <10 cm) UAV images, which can be used to quantify variation or "texture" within tree crowns [22].Remote-sensing texture metrics have been used to analyze patterns in ecosystem characteristics such as species richness [23][24][25], but have not been used for phenological studies.Even under variable lighting conditions, a human interpreter can easily recognize changes in leaf cover of canopy trees in a fine-resolution UAV image time-series.Leafless crowns, which have exposed branches, shadows between branches, and upwelling radiation from understory trees beneath the upper canopy, have high within-crown variation in color.In contrast, fully-leaved crowns have mostly green leaves, little to no exposed branches and fewer shadows, and therefore less within-crown color variability.Therefore, texture-based image features that quantify within-crown color variation [22] may be sensitive to changes in leaf cover, even under highly variable lighting conditions.
In this study, we propose and test methods for quantifying leaf phenology of individual tree crowns in tropical forests using UAV image time-series.Our study is the first to develop and test methods suitable for landscape-scale monitoring of individual trees and species in diverse tropical forests.We used percent leaf cover (or "leaf cover"), the two-dimensional coverage of green leaves, for quantifying leaf phenology of tree crowns.Leaf cover has been often used in field-based studies of leaf phenology [6,9], but little used in remote sensing to this point.Percent leaf cover is similar to, but at a finer spatial scale than % canopy or crown cover [26,27], or the spatial coverage of green tree canopies, which are more widely used in remote-sensing studies.Although percent leaf cover limits our study to the two-dimensional coverage of leaves in the upper canopy, assessing three-dimensional leaf distributions would require understanding the seasonal change of leaf position across vertical structure of tree crowns, which is beyond the scope of this study.Our methodology uses a machine learning approach, in which observations (human-interpreted estimates of leaf cover in image time-series) for individual tree crowns are used to train statistical models that predict leaf cover from color and texture features.We quantify how well our method captures seasonal patterns of leaf cover for individual trees and species, and how model performance depends on the types of image features (color and/or texture indices) included in the model.We evaluate model performance based on prediction errors for out-of-sample observations (i.e., observations excluded from training data) for a variety of data subsets, including 10-fold cross-validation, out-of-sample individual trees, out-of-sample species, and out-of-sample time periods.We also evaluate the repeatability of human image interpretation.We show that in our study, visual estimates of leaf cover are largely repeatable among different human observers, and that model predictions are insensitive to modest differences among observers.

Study Site and Ground-Based Forest Inventory Data
This study was conducted in the 50-ha forest dynamics plot in lowland moist tropical forest on Barro Colorado Island (9 • 9 12.002" N, 79 • 51 2.491" W) in the Republic of Panama.The 50-ha plot (1000 m × 500 m) is the site of a long-term forest dynamics monitoring effort in which every tree ≥ 1 cm in DBH (diameter at breast height, 1.3 m) is tagged, identified to species, and recensused every 5 years since 1985 [28].As explained below, we matched individual crowns in UAV images to trees in the 50-ha plot, so that the species identity of each crown was known.Mean annual rainfall at the site was 2657 mm (SD = 478 mm) for the period 1929 to 2017, with a pronounced dry season from mid-December to mid-April when many trees lose some or all of their leaves [29].

Overview of UAV Image Acquisition and Analysis
We collected and processed one year of weekly to biweekly aerial images (34 image sets from 2 October 2014 to 30 September 2015) to quantify intra-annual patterns of variation in leaf cover of individual trees and species.We manually delineated boundaries of individual tree crowns based on the 2 October 2014 orthomosaic image, then used ground surveys to link each crown to a tagged tree in the 50-ha plot.After quality control, we created a training and testing dataset for a machine learning algorithm by visually interpreting leaf cover for 2422 images of 85 tree crowns on 34 dates.We used the machine learning algorithm to develop alternative models based on different sets of color and/or texture features extracted from each image.The best-performing model was then applied to all images for subsequent analyses, including evaluating model performance for individual dates, as well as annual patterns of leaf cover for individual trees and species.The workflow is presented in Figure 1.

Overview of UAV Image Acquisition and Analysis
We collected and processed one year of weekly to biweekly aerial images (34 image sets from 2 October 2014 to 30 September 2015) to quantify intra-annual patterns of variation in leaf cover of individual trees and species.We manually delineated boundaries of individual tree crowns based on the 2 October 2014 orthomosaic image, then used ground surveys to link each crown to a tagged tree in the 50-ha plot.After quality control, we created a training and testing dataset for a machine learning algorithm by visually interpreting leaf cover for 2422 images of 85 tree crowns on 34 dates.We used the machine learning algorithm to develop alternative models based on different sets of color and/or texture features extracted from each image.The best-performing model was then applied to all images for subsequent analyses, including evaluating model performance for individual dates, as well as annual patterns of leaf cover for individual trees and species.The workflow is presented in Figure 1.

UAV Flights and Image Processing
Here, we provide a brief description of image collection and processing methods; additional explanation and details are in Appendix A. UAV remote sensing was carried out with a consumergrade multirotor aircraft equipped with the Pixhawk flight controller (3DRobotics Inc.; http://copter.ardupilot.com) and configured for automated waypoint flight.The multirotor design was based on previous work [30,31].Missions were flown approximately every week from 2 October 2014 to 1 June 2015 and then approximately every two weeks from 1 June 2015 to 30 September 2015.The camera was calibrated for every flight to reduce effects from variable illumination within and among flights.Post-collection image processing to further reduce variability across color channels involved radiometric normalization of the image set using pseudo-invariant features and histogram equalization techniques.Radiometrically normalized image sets were processed into 7-cm resolution

UAV Flights and Image Processing
Here, we provide a brief description of image collection and processing methods; additional explanation and details are in Appendix A. UAV remote sensing was carried out with a consumer-grade multirotor aircraft equipped with the Pixhawk flight controller (3DRobotics Inc.; http://copter.ardupilot.com) and configured for automated waypoint flight.The multirotor design was based on previous work [30,31].Missions were flown approximately every week from 2 October 2014 to 1 June 2015 and then approximately every two weeks from 1 June 2015 to 30 September 2015.The camera was calibrated for every flight to reduce effects from variable illumination within and among flights.Post-collection image processing to further reduce variability across color channels involved radiometric normalization of the image set using pseudo-invariant features and histogram equalization techniques.Radiometrically normalized image sets were processed into 7-cm resolution georeferenced orthomosaics using Agisoft Photoscan Professional (64-bit, v1.1.6,build 2038).Due to the relatively low accuracy of the UAV's on-board GPS, the orthomosaics showed large horizontal georeferencing errors (>5 m) relative to existing LIDAR digital surface model in this area from 2009 [32]; therefore, horizontal georeferencing was performed manually in ArcGIS relative to the high-resolution LIDAR digital surface model from 2009 to reduce the errors (<1 m).An example of a full orthomosaic image is shown in Figure 2.
georeferenced orthomosaics using Agisoft Photoscan Professional (64-bit, v1.1.6,build 2038).Due to the relatively low accuracy of the UAV's on-board GPS, the orthomosaics showed large horizontal georeferencing errors (>5 m) relative to existing LIDAR digital surface model in this area from 2009 [32]; therefore, horizontal georeferencing was performed manually in ArcGIS relative to the highresolution LIDAR digital surface model from 2009 to reduce the errors (<1 m).An example of a full orthomosaic image is shown in Figure 2.

Manual Tree Crown Delineation
We linked delineated crowns in the images to stems in the 50 ha forest inventory plot using methods described in References [33,34] as follows.Within the 50-ha plot, we manually delineated boundaries of all individual tree crowns with an area greater than 50 m 2 (crown area based on the 2 October 2014 orthomosaic image), as well as some tree crowns smaller than 50 m 2 that were distinct enough in images to delineate.We then used ground surveys to link each crown to a tagged tree in the 50-ha plot.The crown polygon map, tree stem locations, and orthomosaic were loaded on a tablet, which was used in the field to match each crown polygon to a tagged tree.The total number of crowns digitized and linked to tagged trees was 2133, but we only use a subset of these crowns in this paper as described below.Additional details on the digitizing and field procedures are described in Appendix A.

Manual Tree Crown Delineation
We linked delineated crowns in the images to stems in the 50 ha forest inventory plot using methods described in References [33,34] as follows.Within the 50-ha plot, we manually delineated boundaries of all individual tree crowns with an area greater than 50 m 2 (crown area based on the 2 October 2014 orthomosaic image), as well as some tree crowns smaller than 50 m 2 that were distinct enough in images to delineate.We then used ground surveys to link each crown to a tagged tree in the 50-ha plot.The crown polygon map, tree stem locations, and orthomosaic were loaded on a tablet, which was used in the field to match each crown polygon to a tagged tree.The total number of crowns digitized and linked to tagged trees was 2133, but we only use a subset of these crowns in this paper as described below.Additional details on the digitizing and field procedures are described in Appendix A.

Quality Assessment of Images
For each crown, we created a time series of images by extracting the relevant area from each orthomosaic.For each crown, we collated these images into a "montage" (Figure 3) to facilitate visual assessment of changes over time.In the context of our analyses, "image" hereafter refers to the area within a crown polygon on a given date.We performed a visual quality assessment of all crown montages to remove problematic images.We removed images that had visible flower cover (0.5% of the total images) because the present study focuses only on leaf phenology, and the presence of flowers tended to result in an underestimation of leaf cover.We also removed crown montages for which we could not visually estimate leaf cover because the images were too blurry or deeply shadowed (22% of crown montages; Figure A1).

Generating an Observation Dataset for the Machine Learning Algorithm
We used a subset of the quality-controlled crown montages to generate an observation dataset to train and test a machine learning algorithm.The subset analyzed in this paper included all quality-controlled images for 85 total individuals (2422 images) belonging to eight tree species that are known to be deciduous (Table 1).We focused on deciduous species in order to provide a wide range of percent leaf cover for model training and testing.To create an observation dataset, human observers visually interpreted each image to estimate percent leaf cover (e.g., Figure 3).As with ground-based visual estimates of percent leaf cover that are widely used to study leaf phenology in the tropics [6][7][8][9], visual interpretation of images is prone to human error.However, because UAV images provide an unobstructed view of the canopy from above, we expect errors in UAV image interpretation to be smaller than those associated with ground-based visual estimates (particularly in dense, multi-layered tropical forests).We evaluated the repeatability of leaf cover estimates among observers, as well as the sensitivity of our analysis to observer differences (for details, see Appendix A).Because repeatability was high (see Results), analyses presented in this paper are based on a single observer (unless stated otherwise).In estimating percent leaf cover, we did not distinguish between liana and host tree leaf cover.Therefore, the estimates included any leaves and branches that were present in the crown regardless of whether they originated from the host tree or a liana.

Quality Assessment of Images
For each crown, we created a time series of images by extracting the relevant area from each orthomosaic.For each crown, we collated these images into a "montage" (Figure 3) to facilitate visual assessment of changes over time.In the context of our analyses, "image" hereafter refers to the area within a crown polygon on a given date.We performed a visual quality assessment of all crown montages to remove problematic images.We removed images that had visible flower cover (0.5% of the total images) because the present study focuses only on leaf phenology, and the presence of flowers tended to result in an underestimation of leaf cover.We also removed crown montages for which we could not visually estimate leaf cover because the images were too blurry or deeply shadowed (22% of crown montages; Figure A1).

Generating an Observation Dataset for the Machine Learning Algorithm
We used a subset of the quality-controlled crown montages to generate an observation dataset to train and test a machine learning algorithm.The subset analyzed in this paper included all qualitycontrolled images for 85 total individuals (2422 images) belonging to eight tree species that are known to be deciduous (Table 1).We focused on deciduous species in order to provide a wide range of percent leaf cover for model training and testing.To create an observation dataset, human observers visually interpreted each image to estimate percent leaf cover (e.g., Figure 3).As with ground-based visual estimates of percent leaf cover that are widely used to study leaf phenology in the tropics [6][7][8][9], visual interpretation of images is prone to human error.However, because UAV images provide an unobstructed view of the canopy from above, we expect errors in UAV image interpretation to be smaller than those associated with ground-based visual estimates (particularly in dense, multilayered tropical forests).We evaluated the repeatability of leaf cover estimates among observers, as  For each image, we calculated 14 color and texture metrics ("features"; Table 2), where "texture" can include a variety of metrics of variation within an image [22].Because images for a given tree crown are not perfectly aligned across dates, we reduced each crown polygon by a 2-m buffer to ensure that features were derived only from the crown of interest.Below, we briefly describe the color and texture features.Additional details are in Appendix A. Color features included Red, Green, and Blue Chromatic Coordinates (RCC, GCC, and BCC), Excess Green (ExG), and canopy surface fractions of green vegetation (GV) and non-photosynthetic vegetation (NPV) from endmember analysis.We calculated crown-level means of each color index (hereafter, "RCCm", "GCCm", and "BCCm") for each image as in previous analyses of phenocam images [15].
Texture features included standard deviations of red, green, and blue channels and of ExG, NPV, and GV.We also derived two texture features from the Grey-Level Co-occurrence Matrix correlation (Gcor), which quantifies the similarity in grey levels among neighboring pixels within moving windows.We calculated Gcor within moving windows of 5 × 5 pixels (35 × 35 cm), and we then calculated the median and standard deviation of Gcor within each image (Gcor_md and Gcor_sd, respectively).

Model Training and Evaluation
To predict leaf cover of each crown polygon from different sets of the 14 color and texture features described above, we used stochastic gradient tree boosting (SGTB; [38]), a machine learning algorithm.Details on our implementation of SGTB are provided in Appendix A.
We compared alternative models incorporating some or all of the 14 crown features.We evaluated the performance of each model using K-fold cross-validation [39], which evaluates model performance based on "out-of-sample error" (i.e., how well the model fits validation data that were not used for training).Out-of-sample errors are particularly relevant for our study, because we are interested in developing and testing methodology that could in the future be applied to individuals, species, or time periods for which no training data are available.In contrast, AIC and related metrics are based on "in-sample error" (i.e., how well the model fits the training data) [39].We used K = 10 (i.e., 10-fold cross-validation), which provides a good balance between variance (precision of predictions) and bias (accuracy of predictions) [39].We implemented 10-fold cross-validation as follows: We randomly partitioned the observation data (2422 images) into 10 subsets, and used nine subsets (90% of observations) for training and one subset (10% of observations) for out-of-sample testing.For each model, we repeated this process 10 times, so that each subset was used nine times for model training and once for validation.This procedure provided one out-of-sample prediction for each observation.
In addition to 10-fold cross-validation, we performed additional analyses to explore how the amount and timing of training data affects model performance.These additional analyses included using the first half of the observations (2 October 2014 to 25 February 2015) for training and the second half (4 March 2015 to 30 September 2015) for validation; using the second half for training and the first half for validation; and alternating sequential sampling dates for training/validation.
To evaluate model performance, we used three goodness-of-fit metrics for out-of-sample predictions (i.e., all metrics were based on predictions for observations not included in training data): r 2 , mean absolute error (MAE), and mean error (ME).r 2 is the squared Pearson correlation between observed and predicted leaf cover, which indicates the proportion of variance in leaf cover explained by the model.MAE and ME are defined as: where N is the number of images, and obs i and pred i are observed and predicted leaf cover of image i.
Both MAE and ME have units of % leaf cover.MAE measures the magnitude of the deviations between predicted and observed leaf cover, whereas ME measures bias (systematic over-or under-prediction).
Because the "full model" (including all 14 features) had the best goodness-of-fit metrics (see Results), we used the full model for subsequent analyses described below.Note that because goodness-of-fit metrics were based on out-of-sample predictions, the full model is not guaranteed a priori to have the best goodness-of-fit.

Evaluating Model Predictions of Intra-Annual Variation of Leaf Cover
Using the full model, we evaluated model predictions of intra-annual variation in leaf cover for the mean of all trees in the observation dataset on a given date, for individual species, and for individual trees.For the analysis of mean leaf cover, we used the out-of-sample prediction for each image from the 10-fold cross-validation procedure described above.We then calculated r 2 over time (34 sample dates) between the observed and predicted means.We also evaluated the r 2 over time between the observed means and GCCm (averaged over all trees on a given date), because GCCm is the most commonly used metric for quantifying canopy phenology in near-surface remote-sensing.
To evaluate how well the model captured intra-annual variation in leaf cover for a given species, we trained the model using observations for all species except the target (out-of-sample) species.We then generated species-level time-series of observed and predicted leaf cover by averaging over all crowns of the target species for each of the 34 sampling dates.We evaluated these species-level predictions using r 2 , MAE, and ME.We repeated this process for each of the eight species in the observed dataset.
To evaluate how well the model captured intra-annual variation in leaf cover for a given individual tree, we trained the model on all observations except those of the target (out-of-sample) tree, and we then evaluated model predictions for the target tree for the 34 sampling dates.We evaluated these individual-level predictions using r 2 , MAE, and ME.We repeated this analysis for each of the 85 individual trees in our observed dataset.To explore how individual tree properties might influence model performance, we regressed each goodness-of-fit metric (r 2 and MAE) on mean observed leaf cover (across the 34 sample dates for a given tree) and log-transformed crown area.The sample size for these regressions was the number of trees in our analysis (n = 85).

Evaluating Model Predictions
All model performance metrics (r 2 , MAE, and ME) presented in this paper are based on out-of-sample predictions (i.e., predictions for observations excluded from training data for a given analysis).The data subset used to evaluate out-of-sample predictions differed among analyses.For 10-fold cross-validation used to compare the performance of models that include different predictor variables, out-of-sample observations were randomly selected (see details above).For split-dataset analyses (50% training, 50% validation), out-of-sample observations were from the first half of the study period, the second half, or from alternate dates.For individual tree-or species-level tests, out-of-sample observations belonged to a single tree or species.

Comparing Models with Different Image Features
The best-performing model (based on 10-fold cross-validation) was the "full model" that included all 14 image features, including both color and texture features (r 2 = 0.84; MAE = 7.8%; ME = −0.07%;Tables 3 and A1).Models based only on color features performed substantially worse than the full model.Using only crown-level average GCC (GCCm), the most commonly used measure for quantifying phenology in near-surface remote-sensing, model performance was relatively poor (r 2 = 0.52; MAE = 13.6%;ME = 0.10%; Table 3).The model including all six mean color features did considerably better than models based on any single-color feature.Models based only on texture features outperformed models based only on color features.The best single-feature model used a texture feature, the standard deviation of blue channel (Bsd; r 2 = 0.63; MAE = 12.9%; ME = 0.02%; Tables 3 and A1).All texture features were negatively correlated with leaf cover; i.e., less leaf cover was associated with higher within-crown standard deviations in color metrics and in local measures of pixel variability (Table A2).The model using the standard deviations of all three color channels (Rsd, Bsd, and Gsd) had a slightly better fit (r 2 = 0.69; MAE = 11.2%;ME = −0.06%)than the model using the means of all three color channels (GCCm, RCCm, BCCm) (r 2 = 0.62; MAE = 11.9%;ME = −0.07%;Table A1).Color features tended to be correlated with other color features, and texture features with other texture features.However, there was little correlation between color features and texture features, indicating that these two groups of features provide largely independent information (Table A2).

Model Performance in Capturing Intra-Annual Variation
Mean predicted leaf cover (averaged over all trees with observed leaf cover on a given date) closely tracked mean observed leaf cover over time (r 2 = 0.94 for 10-fold cross-validation; Figure 4a, Table A3).In contrast, GCCm was only weakly correlated with mean observed leaf cover over time (r 2 = 0.32; Figure 4b).Predictions derived from using only the first half   Intra-annual variation in species-level leaf cover (i.e., temporal variation in mean leaf cover of individuals of a given species) was well-predicted by the full model fit to other species, with mean (across species) r 2 of 0.89 (range 0.74 to 0.97), mean MAE of 6.0% (range 3.7% to 8.2%), and mean ME of 0.2% (range −4.6% to 4.3%) (Figure 5

Model Performance in Capturing Intra-Annual Variation
Mean predicted leaf cover (averaged over all trees with observed leaf cover on a given date) closely tracked mean observed leaf cover over time (r 2 = 0.94 for 10-fold cross-validation; Figure 4a, Table A3).In contrast, GCCm was only weakly correlated with mean observed leaf cover over time (r 2 = 0.32; Figure 4b).Predictions derived from using only the first half (2 October 2014 to 25 February 2015) or second half (4 March 2015 to 30 September 2015) of observations for model training did not perform as well as predictions from 10-fold cross-validation (Figure A2; r 2 = 0.75 for first-half/secondhalf split).In contrast, when alternate image dates were used for model training (i.e., every other sample date used for training, with the alternate dates used for validation), predictions performed nearly as well as those from 10-fold cross-validation (Figure A2; r 2 = 0.88 for alternating dates).A4.Species abbreviations are defined in Table 1.
Intra-annual variation in individual-tree leaf cover was well-predicted by the full model fit to other individuals, with mean (across individuals) r 2 of 0.82 (range 0.36 to 0.99), mean MAE of 8.1% (range 1.2% to 17.9%), and ME of −0.07%(range −12.5% to 12.7%) (Figure 6; Table A5; statistics are for out-of-sample individual-level predictions).MAE for intra-annual variation tended to decrease across individual trees with increasing mean annual leaf cover, but r 2 was not significantly correlated with mean annual leaf cover (Figure A3).Neither MAE nor r 2 for intra-annual variation was significantly correlated with crown size (Figure A3).Of the 15 crowns with r 2 ≤ 0.70, 13 belonged to Cordia alliodora and Ceiba pentandra (Figure 6h).
(range 1.2% to 17.9%), and ME of −0.07%(range −12.5% to 12.7%) (Figure 6; Table A5; statistics are for out-of-sample individual-level predictions).MAE for intra-annual variation tended to decrease across individual trees with increasing mean annual leaf cover, but r 2 was not significantly correlated with mean annual leaf cover (Figure A3).Neither MAE nor r 2 for intra-annual variation was significantly correlated with crown size (Figure A3).Of the 15 crowns with r 2 ≤ 0.70, 13 belonged to Cordia alliodora and Ceiba pentandra (Figure 6h).A5.Species abbreviations are defined in Table 1.

Repeatability of Visual Leaf-Cover Estimates and Model Predictions
Estimates of percent leaf cover based on visual interpretation of images were largely repeatable among multiple observers (r 2 = 0.88 and 0.78 for comparisons between pairs of human observers; Figure A4).Model predictions of leaf cover for individual images were insensitive to observer differences (r 2 = 0.97 and 0.95 for comparisons between pairs of predictions derived from different observers' estimates; Figure A5a-b).Model predictions of leaf cover time-series for individual trees were also insensitive to observer differences (Figure A5c).A5.Species abbreviations are defined in Table 1.

Repeatability of Visual Leaf-Cover Estimates and Model Predictions
Estimates of percent leaf cover based on visual interpretation of images were largely repeatable among multiple observers (r 2 = 0.88 and 0.78 for comparisons between pairs of human observers; Figure A4).Model predictions of leaf cover for individual images were insensitive to observer differences (r 2 = 0.97 and 0.95 for comparisons between pairs of predictions derived from different observers' estimates; Figure A5a-b).Model predictions of leaf cover time-series for individual trees were also insensitive to observer differences (Figure A5c).

Discussion
Our framework for analyzing UAV images to quantify individual-level tree phenology in a tropical forest-combining field work, visual image interpretation, and a machine learning algorithm-performed well at predicting leaf cover of individual crowns on a given date and quantifying intra-annual patterns of leaf phenology for individuals and species.The best-performing model used all 14 image features that we explored, including means and standard deviations of color channels, as well as more computationally intensive features (end members from a mixture model and a moving-window measure of pixel variability).A model using only color means and standard deviations performed nearly as well as the full model.A model based only on the mean of the green color channel, which has been widely used in phenological studies of temperate ecosystems, performed poorly at our tropical forest site.We suggest that texture features, which quantify variability within images, should be considered in future studies of canopy phenology, particularly in cases where mean color features do not perform well.Some texture features may be more robust than widely-used mean color features to highly variable illumination within and among sampling dates and imperfect image calibration (as in our study).

Image Features for Quantifying Phenology
Image texture features, particularly the standard deviation of the individual color channels (Bsd, Gsd and Rsd), were useful for quantifying leaf cover phenology in this tropical forest.Texture features capture variability in brightness and color, which is higher in leafless and partially leafless crowns due to exposed branches and shadows.Texture features were only weakly correlated with color features (Table A2), indicating that these two feature classes provide largely independent information.Texture features may be particularly useful when illumination is variable among images (as in our dataset), because much of the variability in color will be preserved even if overall illumination changes.The single most predictive image feature we tested was the standard deviation in blue color (Bsd; Tables 3 and A1).The blue channel is a chlorophyll absorption band, such that leaves and shadows appear dark and branches appear light.Therefore, within-crown variability in the blue channel increases as leaves are dropped, branches are exposed, and within-crown shadowing increases (Figure A6).Although the red channel also includes wavelengths with high chlorophyll absorption, standard deviation in red color (Rsd) did not perform as well as standard deviation in blue color (Bsd).
The most commonly used image metric in studies of leaf phenology in temperate forests-mean GCC (GCCm)-had limited power to explain variation in leaf cover among crowns in our study, and color metrics were less useful than texture metrics in general.Multiple factors may have contributed to the limited ability of color metrics to predict leaf cover in our dataset.First, tropical forests have multi-layered canopies [33], and understory trees are less likely to be deciduous than canopy trees [40], such that an image from above a crown that is partially or fully deciduous often includes considerable green from understory trees.Second, our study site has high canopy tree species diversity, and leaf color can vary considerably within and among species with leaf traits and leaf age [41,42].We can gain insight into the usefulness of different image features by considering only those tree crowns with 100% leaf cover on a given date.For these fully-leaved crowns, a robust image feature (e.g., one that is insensitive to variation in illumination due to sun angle and cloud cover) should be relatively constant throughout the year.For fully-leaved crowns, GCCm was highly variable over time, whereas the standard deviation of the blue color channel (Bsd) was more constant (Figure A7).This comparison helps explain why Bsd was a better predictor for leaf cover than GCCm in our analysis.
In general, the ability of color versus texture metrics to predict leaf cover (or other ecological features of a tree crown) will depend strongly on characteristics of the images, as well as on the study site.Higher pixel resolution will provide more pixels per tree crown, likely increasing the observable differences in texture.With lower pixel resolution, mean crown color may provide more information than texture.Similarly, blurry images will decrease the color variability among pixels, and make texture measures less useful in comparison to mean color metrics.In contrast, more consistent illumination or better color normalization across dates will increase the utility of color metrics, as will similarity in leaf color across species and crowns within a dataset.Finally, our study focused on leaf cover, which is only one aspect of canopy phenology.Mean color metrics may be useful for quantifying other aspects of canopy phenology [15]; e.g., photosynthetic capacity depends not only on leaf cover, but also on leaf quality, which may be correlated with mean color metrics related to chlorophyll absorption bands (e.g., red and blue channels).

Sources of Prediction Error
Any errors in the human interpretation of the images would of course limit the extent to which such observations represent true leaf cover.However, observation errors are likely to be less common for our observations than for standard ground-based observations of leaf cover.The UAV images we analyzed provide an unobstructed view of the canopy from above (Figure 3), allowing for much easier human interpretation compared to observers on the ground looking up at the canopy (sometimes over 40 m tall) through what is often a dense, multi-layered understory.In our study, leaf cover estimates from visual interpretation of images by different observers were largely repeatable despite the lack of observer training (Figure A4), and model predictions were insensitive to modest differences among different observers (Figure A5 and Table A6).
Beyond observation errors, the amount and characteristics of training data (e.g., the range of image features covered in the training dataset) could strongly affect model performance.In our study, model performance was more sensitive to the seasonal distribution of training data than the amount of training data.Specifically, if only half of the observations were used for training, the model still performed well if the observations were distributed throughout the entire year (Figure A2c), whereas model performance was worse if training data included only the first half (or only the second half) of the observation time-series (Figure 6b).It is likely that the temporal distribution of training data is important because sun angle, cloud cover, atmospheric smoke and haze, or other image features changed over the one-year period of sampling, so that temporally-restricted training data do not include the full "image space" needed for robust predictions.This "image space" issue may also explain why out-of-sample predictions were relatively poor for some species and individuals (Figures 5  and 6), e.g., if certain species or individuals have image features that were not well-represented in the training data.Future community-level analyses that target additional species (beyond the eight species studied here) may therefore benefit from additional training data.

Advances in Phenological Observations of Tropical Forests
Our UAV-based methods precisely document intra-annual leaf cover changes in many individual canopy trees at fine temporal resolution, and thereby precisely quantify interspecific and intraspecific variation in deciduousness within a tropical forest.These methods provide considerably more detailed information on leaf phenology than was previously available, even in this exceptionally well-studied tropical site.Previous studies of leaf phenology at this site involved qualitative descriptions of species leafing patterns [4], field observations of a few individuals of selected species at a few dates during the dry season [6], and analysis of litterfall patterns [11,43].Litterfall data are perhaps the most widely available type of leaf phenology data, yet they are usually collected only at the stand scale, and do not in and of themselves provide information on whether (or when) trees are deciduous because they provide no information on the timing of leaf production.No previous study at this tropical forest site provided comprehensive, quantitative information on intra-annual leaf cover changes within and among species.As discussed below, the methods we present here could be extended to quantify phenology of additional species at our study site (based on currently available imagery), as well as at other sites (if imagery is available).
Our results document subtle patterns, including the speed of leaf cover change and the degree of intraspecific variation in leaf phenology, and their variation among species.Some species (Cavanillesia platanifolia, Zanthoxylum ekmanii) show rapid declines in leaf cover, while other species (Cordia alliodora, Handroanthus guayacan) show more gradual declines.Some species (Cavanillesia platanifolia, Sterculia apetala, and Zanthoxylum ekmanii) exhibit highly synchronous leaf cover variation, while other species (Platypodium elegans, Cavanillesia platanifolia, Cordia alliodora) show a high level of intraspecific variation in leaf phenology (Figure A8).Some individuals of some species showed two periods of low leaf cover during the year, which had been previously documented for Tabebuia rosea [4] (Figure 5), but not for Ceiba pentandra (Figures 5 and 6e) or Platypodium elegans.The leaf cover patterns illustrated in this study can be further analyzed to estimate phenology metrics such as dates and rates of leaf cover increase and decrease.Variability among species and individuals in the timing, degree, and number of deciduous events per year makes defining and quantifying these phenology metrics complicated in tropical forests.This topic is beyond the scope of this paper and will be addressed in a future study.
The methods described in this paper could be used to document the phenology of many individual canopy plants over long time periods.Although our analyses here focused on only 85 crowns, our imagery includes time series for over 1600 delineated crowns, illustrating the potential for large-scale monitoring with these methods.Our analyses demonstrate that UAV images, including images collected under cloudy conditions and with variable illumination, can provide high-resolution and high-quality quantitative data on leaf cover, which should allow for long-term observations of large numbers of individuals and species.Long-term observations in tropical forests have been highlighted as a critical need for understanding drivers of phenology [2].Given the recent advances in the reliability of UAVs, our methods could be easily applied at other sites and over larger areas.Other forest dynamics plots, where canopy trees have already been mapped and identified to species [44] would be obvious candidate sites to apply our methods.Even in areas where trees have not been mapped and identified on the ground, our methods could still be used to quantify canopy phenology at the stand scale and for individual crowns (even without data on species identity).Our methods performed well throughout both the dry and wet seasons (Figure 4a), suggesting they would work well in longer-term studies.
Quantitative information on inter-and intraspecific variation in canopy phenology constitutes a valuable resource for understanding the environmental drivers of tropical tree phenology and thereby for improving models of forest carbon and water fluxes.Recent studies demonstrate that stand-level leaf phenology is the primary driver of photosynthetic phenology in Amazon forests [14], and that mechanistic models of leaf phenology incorporating interspecific variation can improve predictions of stand-level leaf phenology [45].High-resolution, UAV-derived phenology data for multiple species and plant functional types could be used to calibrate or drive ecosystem process models, enabling improved predictions of tropical forest ecosystem function across time and space.

Conclusions
The methodology presented here, tested with eight canopy tree species, could be applied to monitor phenology of individuals and species at the landscape scale (e.g., 1000s of trees) in tropical forests.Machine learning algorithms are becoming widely used to translate image data, including images from UAVs, into ecological information.Our study demonstrates three key components of a research program that combines UAV images and machine learning to quantify tropical forest phenology.The first key component is fieldwork that links tree crowns in UAV images to trees that are monitored on the ground (i.e., in forest inventory plots), thus allowing identification of species, as well as linkages to other data not used in the present study (e.g., growth rates and functional traits of individual trees).Although much of the methodology we present could be applied to monitor stand-level phenology in landscapes where ground data are not available, the link between UAV images (or other fine-spatial-resolution remote-sensing data) and ground data is critical for monitoring phenology at the species level.The second key component in our pipeline is generating training data for machine learning algorithms using visual estimates of leaf cover, which were largely repeatable among different human observers.UAV images provide a clear, unobstructed view of the canopy, which likely makes visual estimation of leaf cover less error-prone compared to traditional ground-based phenological studies in dense, multi-layered tropical forests.The third key component is including information on color variation ("texture") in image analysis.Models that included both color and texture features had better predictive ability than those based only on color features.We suspect that texture features are less sensitive than color features to variation in illumination within and among dates, and to imperfect image calibration.Our analysis of eight species showed contrasting patterns, with some species having nearly synchronous phenology among individuals, and other species having high levels of intraspecific variation.Landscape-scale applications of our methodology will allow for unprecedented community-level analyses of intra-and interspecific variation in tropical forest canopy phenology.
Funding: Financial support was provided by a Smithsonian Institution Scholarly Studies Grant to H.C.M.L. (for UAV data collection), Smithsonian Institution ForestGEO Global Earth Observatories (for field work), a University of Florida Biodiversity Institute fellowship award to J.Y.P., and sabbatical fellowship awards from sDiv (the Synthesis Centre of iDiv; DFG FZT 118) to S.A.B. and J.W.L.
indicates the depth of each regression tree.Shrinkage affects the learning rate of the model, and is known to yield better performance when it is slow.

Appendix A.7. Repeatability of Leaf Cover Observations and Predictions
To explore the repeatability of leaf cover observations (human interpretation of images) and the sensitivity of model results to differences among observers, we analyzed datasets of observations from multiple observers.The first dataset includes leaf cover observations from a single observer (referred to here as 'Observer 1') of 2422 images belonging to 85 individual trees.These 'Observer 1 observations' are the data used for all analyses in this paper unless stated otherwise.The second dataset is the subset of the Observer 1 observations for which observations were also available from another observer; this subset includes 1780 images belonging to 58 trees.The third dataset combines observations from two observers ('Observers 2 and 3') and includes a total of 1747 images belonging to the same 58 trees in the Observer 1 subset.Although the second and third datasets target the same individual trees, the number of images differs because observers sometimes excluded different images (due to blurry or otherwise poor image quality).No images were interpreted by both Observer 2 and 3.The three observers interpreted images independently, without any training (e.g., practicing with a set of common images).Therefore, our assessment of repeatability is likely a 'worst case scenario' that could be improved by observer training.
Leaf cover estimates from visual interpretation of images by different observers were largely repeatable.For the 1135 images observed by both Observer 1 and Observer 2, the squared Pearson correlation between observers was r 2 = 0.88 (Figure A4a).For the 612 images observed by both Observer 1 and Observer 3, the squared Pearson correlation between observers was r 2 = 0.78 (Figure A4b).In addition to these comparisons that pool all images, we also quantified the correlation over time between observers for time series of leaf cover for individual trees (examples for Observer 1 are shown in Main text Figure 6h).For the 58 tree-level comparisons (i.e., 58 trees with observations recorded by Observer 1 and either Observer 2 or 3), the mean r 2 was 0.84, with 83% of comparisons having an r 2 ≥ 0.75 (Figure A4c).In summary, these results suggest that human interpretation of UAV images yielded largely repeatable leaf cover estimates, even though different observers received no training to standardize their interpretations.
Model results were not sensitive to the modest differences among observers in their visual estimates of leaf cover (observer differences are described in the preceding paragraph).Firstly, the relative performance of different models (i.e., models that included different image features) was very similar for the three observation datasets, with the rank order of the top three models being identical for all three datasets (Table A6; top three models: (1) Full model, (2) Mean Chromatic Coordinates and SDs of color channels, and (3) All texture measures).Secondly, out-of-sample predictions from 10-fold cross-validation were similar between observers, regardless of whether the Observer 1 full dataset or the Observer 1 subset was used for model training (r 2 ≥ 0.95 for all comparisons between Observer 1 predictions and Observer 2 or 3 predictions; Figure A5a,b).Finally, predicted time-series of leaf cover for out-of-sample individual trees were very similar regardless of whether the model was trained to data from Observer 1 or Observers 2+3 (mean r 2 = 0.95; Figure A5c).In summary, these results suggest that our model results are robust to the modest differences that occurred among observers in their interpretations of UAV images.
Table A1.Comparison of models that included different sets of image features.Goodness-of-fit statistics are means from 10-fold cross-validation (10 replicates, each with 90% training data, and 10% validation data).r 2 is the squared Pearson correlation between predictions and the validation observations.MAE and ME are, respectively, the mean absolute error and mean error between predictions and validation observations.Units for MAE and ME are percent leaf cover.Standard deviations of the normalized color indices (GCCsd, RCCsd, BCCsd) are marked with asterisks (*).These three features were excluded from the full model due to their low predictive ability.In contrast, standard deviations of the non-normalized color channels (Gsd, Bsd, Rsd) had relatively high predictive ability and were included with other features (14 in total) in the full model.A5).Species codes in the color legend are defined in Table 1.A5).Species codes in the color legend are defined in Table 1.r 2 is the squared Pearson correlation between predictions and out-of-sample individual observations (n = 34 dates for each individual).MAE is the mean absolute error (% leaf cover) between predictions and validation observations.A5).Species codes in the color legend are defined in Table 1.For these fully-leaved crowns, Bsd is less variable over time than GCCm, which suggests that Bsd is less sensitive to variation in illumination or other image features that are unrelated to leaf cover.This likely explains why Bsd is a better predictor of leaf cover than GCCm in our study (Table A1).Y-axes were scaled relative to each variable by setting the axis range equal to three standard deviations across all images (not only fully-leaved crowns).observed leaf cover on a given date (sample sizes range from n = 9 to 65 trees per date).For these fully-leaved crowns, Bsd is less variable over time than GCCm, which suggests that Bsd is less sensitive to variation in illumination or other image features that are unrelated to leaf cover.This likely explains why Bsd is a better predictor of leaf cover than GCCm in our study (Table A1).Y-axes were scaled relative to each variable by setting the axis range equal to three standard deviations across all images (not only fully-leaved crowns).

Figure A8
. Time series of observed (orange) and predicted (blue) leaf cover for individual tree crowns within different species.Orange dashed lines represent observed leaf cover (%) time series of individual trees, and blue dashed lines represent predicted leaf cover (%) time series of individual trees.Species abbreviations are defined in Table 1.Within each species, the observed and predicted time series span a similar range (see Figure 6 for r 2 values of observations vs predictions for individual time series).Species differ in the degree of intraspecific variation; e.g., TAB1RO and PLA2EL show a high level of intraspecific variation, whereas ZANTBE shows little intraspecific variation.1.Within each species, the observed and predicted time series span a similar range (see Figure 6 for r 2 values of observations vs predictions for individual time series).Species differ in the degree of intraspecific variation; e.g., TAB1RO and PLA2EL show a high level of intraspecific variation, whereas ZANTBE shows little intraspecific variation.

Figure 1 .
Figure 1.Schematic workflow for predicting leaf phenology patterns of individual crowns from UAV (Unmanned aerial vehicles) images.See Materials and Methods for details.

Figure 1 .
Figure 1.Schematic workflow for predicting leaf phenology patterns of individual crowns from UAV (Unmanned aerial vehicles) images.See Materials and Methods for details.

Figure 2 .
Figure 2. UAV orthomosaic of the entire 50-ha plot (red rectangle).The orthomosaic is from 2 October 2014.A subset of the UAV image (orange square) with delineated tree crowns is shown on the right.

Figure 2 .
Figure 2. UAV orthomosaic of the entire 50-ha plot (red rectangle).The orthomosaic is from 2 October 2014.A subset of the UAV image (orange square) with delineated tree crowns is shown on the right.

32 Figure 3 .
Figure 3. Example of tree image time-series in the UAV dataset.Selected images from the time series for a single crown (Ceiba pentandra, tag 4092).The text box in each image gives the date (MM-DD-YY) and the observed/predicted (full model) leaf cover.24 out of 34 dates are shown; the 10 excluded dates have observed and predicted leaf cover near 100%.

Figure 3 .
Figure 3. Example of tree image time-series in the UAV dataset.Selected images from the time series for a single crown (Ceiba pentandra, tag 4092).The text box in each image gives the date (MM-DD-YY) and the observed/predicted (full model) leaf cover.24 out of 34 dates are shown; the 10 excluded dates have observed and predicted leaf cover near 100%.
(2 October 2014 to 25 February 2015) or second half (4 March 2015 to 30 September 2015) of observations for model training did not perform as well as predictions from 10-fold cross-validation (Figure A2; r 2 = 0.75 for first-half/second-half split).In contrast, when alternate image dates were used for model training (i.e., every other sample date used for training, with the alternate dates used for validation), predictions performed nearly as well as those from 10-fold cross-validation (Figure A2; r 2 = 0.88 for alternating dates).Remote Sens. 2019, 11, x FOR PEER REVIEW 11 of 32

Figure 4 .
Figure 4. (a) Predicted and observed mean leaf cover (average across all crowns on a given date) and (b) mean Green Chromatic Coordinate (GCCm).(a) Observed leaf cover (filled circles, solid line) and leaf cover predicted by the full model (open circles, dashed line).The r 2 values are the squared Pearson correlation over time between (a) mean observed and predicted leaf cover, and (b) mean observed leaf cover and mean GCCm.Values do not represent stand-level means, because our analysis targeted deciduous species that are a non-random subset of the mixed evergreen/deciduous forest community.The number of crowns ranges from 40 to 85 per date (some crowns were excluded on a given date based on filtering criteria described in Materials and Methods).Vertical bars are standard errors.

Figure 4 .
Figure 4. (a) Predicted and observed mean leaf cover (average across all crowns on a given date) and (b) mean Green Chromatic Coordinate (GCCm).(a) Observed leaf cover (filled circles, solid line) and leaf cover predicted by the full model (open circles, dashed line).The r 2 values are the squared Pearson correlation over time between (a) mean observed and predicted leaf cover, and (b) mean observed leaf cover and mean GCCm.Values do not represent stand-level means, because our analysis targeted deciduous species that are a non-random subset of the mixed evergreen/deciduous forest community.The number of crowns ranges from 40 to 85 per date (some crowns were excluded on a given date based on filtering criteria described in Materials and Methods).Vertical bars are standard errors.

Figure 5 .
Figure 5. Evaluation of model performance for out-of-sample species predictions.Observed (filled circles and solid lines) and predicted (open circles and dashed lines) time series for species-level means.Predictions are from the full model trained with a dataset that excluded the target (out-of-sample) species.r 2 is the squared Pearson correlation over time between observed and predicted species means.Additional goodness-of-fit metrics are in TableA4.Species abbreviations are defined in Table1.

Figure 6 .
Figure 6.Evaluation of model performance for out-of-sample individual predictions.(a-f) Observed (filled circles, solid lines) and predicted (open circles, dashed lines) time series for selected individual trees; r 2 is the squared Pearson correlation over time between observed and predicted leaf cover.(g) Distribution of r 2 values across all trees (85) in the observation dataset (mean r 2 = 0.83; SD = 0.13).(h) Distribution of r 2 values for each species in the observation dataset.Predictions are from the full model trained with a dataset that excluded the target (out-of-sample) individual.Additional goodness-of-fit metrics are in TableA5.Species abbreviations are defined in Table1.

Figure 6 .
Figure 6.Evaluation of model performance for out-of-sample individual predictions.(a-f) Observed (filled circles, solid lines) and predicted (open circles, dashed lines) time series for selected individual trees; r 2 is the squared Pearson correlation over time between observed and predicted leaf cover.(g) Distribution of r 2 values across all trees (85) in the observation dataset (mean r 2 = 0.83; SD = 0.13).(h) Distribution of r 2 values for each species in the observation dataset.Predictions are from the full model trained with a dataset that excluded the target (out-of-sample) individual.Additional goodness-of-fit metrics are in TableA5.Species abbreviations are defined in Table1.

Figure A1 .
Figure A1.Example of a blurry crown montage (time series) that was excluded from our analysis (Tag = 712166; species Cecropia insignis).

Figure A1 .
Figure A1.Example of a blurry crown montage (time series) that was excluded from our analysis (Tag = 712166; species Cecropia insignis).

Figure A2 .
Figure A2.Model performance is sensitive to the seasonal distribution of training data.Filled circles with solid line: mean observed leaf cover (averaged over all crowns on a given date); open circles with dashed line: mean predicted leaf cover.(a) 10-fold cross-validation (see Materials and Methods: 'Model Training and Evaluation'), where each prediction was from the full model trained with 90% of the data.Panel (a) is identical to Figure 4a in the Main Text, and is reproduced here for comparison with panels (b-c).(b) Training and validation data split into first and second halves indicated by shading.Each prediction was from the full model trained with 50% of the data.Predictions for the shaded period used training data from the non-shaded period, and vice versa.(c) Training and validation data split into alternating dates indicated by shading.Each prediction was from the full model trained with 50% of the data.Predictions for the shaded dates used training data from the nonshaded dates, and vice versa.

Figure A2 .
Figure A2.Model performance is sensitive to the seasonal distribution of training data.Filled circles with solid line: mean observed leaf cover (averaged over all crowns on a given date); open circles with dashed line: mean predicted leaf cover.(a) 10-fold cross-validation (see Materials and Methods: 'Model Training and Evaluation'), where each prediction was from the full model trained with 90% of the data.Panel (a) is identical to Figure 4a in the Main Text, and is reproduced here for comparison with panels (b-c).(b) Training and validation data split into first and second halves indicated by shading.Each prediction was from the full model trained with 50% of the data.Predictions for the shaded period used training data from the non-shaded period, and vice versa.(c) Training and validation data split into alternating dates indicated by shading.Each prediction was from the full model trained with 50% of the data.Predictions for the shaded dates used training data from the non-shaded dates, and vice versa.

Figure A3 .
Figure A3.Variation in model performance among individual trees in the observation dataset in relation to an individual's mean leaf cover (averaged over the year) and crown area.Each point shows goodness-of-fit (r 2 or MAE) for an individual's leaf cover time-series from the out-of-sample validation analysis (Figure 6 and TableA5).Species codes in the color legend are defined in Table1.r 2 is the squared Pearson correlation between predictions and out-of-sample individual observations (n = 34 dates for each individual).MAE is the mean absolute error (% leaf cover) between predictions and validation observations.
Figure A3.Variation in model performance among individual trees in the observation dataset in relation to an individual's mean leaf cover (averaged over the year) and crown area.Each point shows goodness-of-fit (r 2 or MAE) for an individual's leaf cover time-series from the out-of-sample validation analysis (Figure 6 and TableA5).Species codes in the color legend are defined in Table1.r 2 is the squared Pearson correlation between predictions and out-of-sample individual observations (n = 34 dates for each individual).MAE is the mean absolute error (% leaf cover) between predictions and validation observations.

Figure A4 .
Figure A4.Repeatability of leaf cover observations.(a,b) Comparison of observed leaf cover (%) of tree crowns in images that were examined by (a) Observer 1 and Observer 2 (n = 1135), or (b) Observer 1 and Observer 3 (n = 612).In (a,b), the solid line is the 1:1 line.(c) Distribution of 58 r 2 values from 58 comparisons of individual-tree time-series between observers (the red vertical line is the mean r 2 ); each r 2 value in the histogram represents the correlation over time for a given tree between the Observer 1 time-series of leaf cover and either the Observer 2 or Observer 3 time-series for the same crown.P-value < 0.001 for both (a) and (b).

Figure A3 .
Figure A3.Variation in model performance among individual trees in the observation dataset in relation to an individual's mean leaf cover (averaged over the year) and crown area.Each point shows goodness-of-fit (r 2 or MAE) for an individual's leaf cover time-series from the out-of-sample validation analysis (Figure 6 and TableA5).Species codes in the color legend are defined in Table1.r 2 is the squared Pearson correlation between predictions and out-of-sample individual observations (n = 34 dates for each individual).MAE is the mean absolute error (% leaf cover) between predictions and validation observations.

32 Figure A3 .
Figure A3.Variation in model performance among individual trees in the observation dataset in relation to an individual's mean leaf cover (averaged over the year) and crown area.Each point shows goodness-of-fit (r 2 or MAE) for an individual's leaf cover time-series from the out-of-sample validation analysis (Figure 6 and TableA5).Species codes in the color legend are defined in Table1.r 2 is the squared Pearson correlation between predictions and out-of-sample individual observations (n = 34 dates for each individual).MAE is the mean absolute error (% leaf cover) between predictions and validation observations.
Figure A3.Variation in model performance among individual trees in the observation dataset in relation to an individual's mean leaf cover (averaged over the year) and crown area.Each point shows goodness-of-fit (r 2 or MAE) for an individual's leaf cover time-series from the out-of-sample validation analysis (Figure 6 and TableA5).Species codes in the color legend are defined in Table1.r 2 is the squared Pearson correlation between predictions and out-of-sample individual observations (n = 34 dates for each individual).MAE is the mean absolute error (% leaf cover) between predictions and validation observations.

Figure A4 .
Figure A4.Repeatability of leaf cover observations.(a,b) Comparison of observed leaf cover (%) of tree crowns in images that were examined by (a) Observer 1 and Observer 2 (n = 1135), or (b) Observer 1 and Observer 3 (n = 612).In (a,b), the solid line is the 1:1 line.(c) Distribution of 58 r 2 values from 58 comparisons of individual-tree time-series between observers (the red vertical line is the mean r 2 ); each r 2 value in the histogram represents the correlation over time for a given tree between the Observer 1 time-series of leaf cover and either the Observer 2 or Observer 3 time-series for the same crown.P-value < 0.001 for both (a) and (b).

Figure A4 .
Figure A4.Repeatability of leaf cover observations.(a,b) Comparison of observed leaf cover (%) of tree crowns in images that were examined by (a) Observer 1 and Observer 2 (n = 1135), or (b) Observer 1 and Observer 3 (n = 612).In (a,b), the solid line is the 1:1 line.(c) Distribution of 58 r 2 values from 58 comparisons of individual-tree time-series between observers (the red vertical line is the mean r 2 ); each r 2 value in the histogram represents the correlation over time for a given tree between the Observer 1 time-series of leaf cover and either the Observer 2 or Observer 3 time-series for the same crown.P-value < 0.001 for both (a) and (b).

Figure A5 .
Figure A5.Repeatability of leaf cover predictions from models trained with different observer datasets.(a,b) Comparison of predicted leaf cover (%) of tree-crown images from models trained with observations from (a) Observers 1 and 2 (n = 1135), and (b) Observers 1 and 3 (n = 612).In (a,b), the solid line is the 1:1 line.(c) Distribution of 58 r 2 values from 58 comparisons of individual-tree timeseries predicted from models trained to different observation datasets (the red vertical line is the mean r 2 ); each r 2 value in the histogram represents the correlation over time for a given tree between predictions from the Observer 1-trained model and the Observer 2 + 3-trained model.Predictions are from the full model (TableA6) trained with a dataset (either Observer 1 or Observer 2 + 3) that excluded the target (out-of-sample) individual.P-value < 0.001 for both (a) and (b).

Figure A6 .
Figure A6.Red channel (a), green channel (b), blue channel (c), and full color (d) image of a Dipteryx oleifera crown (Tag 1342) on 5 February 2015.The red boxes highlight regions in which branches are visibly more distinct in the blue channel.

Figure A5 . 32 Figure A5 .
Figure A5.Repeatability of leaf cover predictions from models trained with different observer datasets.(a,b) Comparison of predicted leaf cover (%) of tree-crown images from models trained with observations from (a) Observers 1 and 2 (n = 1135), and (b) Observers 1 and 3 (n = 612).In (a,b), the solid line is the 1:1 line.(c) Distribution of 58 r 2 values from 58 comparisons of individual-tree time-series predicted from models trained to different observation datasets (the red vertical line is the mean r 2 ); each r 2 value in the histogram represents the correlation over time for a given tree between predictions from the Observer 1-trained model and the Observer 2+3-trained model.Predictions are from the full model (TableA6) trained with a dataset (either Observer 1 or Observer 2+3) that excluded the target (out-of-sample) individual.P-value < 0.001 for both (a) and (b).

Figure A6 .
Figure A6.Red channel (a), green channel (b), blue channel (c), and full color (d) image of a Dipteryx oleifera crown (Tag 1342) on 5 February 2015.The red boxes highlight regions in which branches are visibly more distinct in the blue channel.

Figure A6 .
Figure A6.Red channel (a), green channel (b), blue channel (c), and full color (d) image of a Dipteryx oleifera crown (Tag 1342) on 5 February 2015.The red boxes highlight regions in which branches are visibly more distinct in the blue channel.

Figure A7 .
Figure A7.Temporal variability of mean Green Chromatic Coordinate (GCCm) and standard deviation of blue channel (Bsd) for fully-leaved trees.Values are means ± 2 s.e.across all crowns with 100% observed leaf cover on a given date (sample sizes range from n = 9 to 65 trees per date).For these fully-leaved crowns, Bsd is less variable over time than GCCm, which suggests that Bsd is less sensitive to variation in illumination or other image features that are unrelated to leaf cover.This likely explains why Bsd is a better predictor of leaf cover than GCCm in our study (TableA1).Y-axes were scaled relative to each variable by setting the axis range equal to three standard deviations across all images (not only fully-leaved crowns).

Figure A7 .
Figure A7.Temporal variability of mean Green Chromatic Coordinate (GCCm) and standard deviation of blue channel (Bsd) for fully-leaved trees.Values are means ± 2 s.e.across all crowns with 100% observed leaf cover on a given date (sample sizes range from n = 9 to 65 trees per date).For these fully-leaved crowns, Bsd is less variable over time than GCCm, which suggests that Bsd is less sensitive to variation in illumination or other image features that are unrelated to leaf cover.This likely explains why Bsd is a better predictor of leaf cover than GCCm in our study (TableA1).Y-axes were scaled relative to each variable by setting the axis range equal to three standard deviations across all images (not only fully-leaved crowns).

Figure A8 .
Figure A8.Time series of observed (orange) and predicted (blue) leaf cover for individual tree crowns within different species.Orange dashed lines represent observed leaf cover (%) time series of individual trees, and blue dashed lines represent predicted leaf cover (%) time series of individual trees.Species abbreviations are defined in Table1.Within each species, the observed and predicted time series span a similar range (see Figure6for r 2 values of observations vs predictions for individual time series).Species differ in the degree of intraspecific variation; e.g., TAB1RO and PLA2EL show a high level of intraspecific variation, whereas ZANTBE shows little intraspecific variation.

Table 1 .
Tropical tree species included in the training and testing dataset.N is the number of trees in the observation dataset for a given species.

Table 2 .
Color and texture features used to predict leaf cover of individual crowns.The full model included all 14 features, and other models included a subset of the features.

Table 3 .
Comparison of models incorporating different combinations and numbers of features (additional models in Table A1).Goodness-of-fit statistics were calculated for out-of-sample data from 10-fold cross-validation.r 2 is the squared Pearson correlation.MAE and ME are mean absolute error and mean error, respectively, between predictions and observations.

Table 1 .
and TableA4; statistics are for out-of-sample species-level predictions).Evaluation of model performance for out-of-sample species predictions.Observed (filled circles and solid lines) and predicted (open circles and dashed lines) time series for species-level means.Predictions are from the full model trained with a dataset that excluded the target (out-ofsample) species.r 2 is the squared Pearson correlation over time between observed and predicted species means.Additional goodness-of-fit metrics are in TableA4.Species abbreviations are defined in

Table A3 .
Means and standard deviations (SD) of leaf cover on each sampling date for trees in the observation dataset.

Table A4 .
Goodness-of-fit statistics for out-of-sample species-level predictions (n = 34 dates).The model was trained on data for all trees except those belonging to the target (out-of-sample) species.Goodness-of-fit statistics were calculated by comparing mean predicted and observed leaf cover (averaged across individuals in the out-of-sample target species) for each sample date.r 2 , MAE, and ME are as defined in TableA1.

Table A5 .
Goodness-of-fit statistics for out-of-sample individual tree-level predictions (n = 34 dates).The model was trained on data for all trees excluding the target (out-of-sample) tree.Goodness-of-fit statistics were calculated by comparing predicted and observed leaf cover for each tree on each sample date.Goodness-of-fit statistics were then pooled across individuals within each species to calculate the means and standard deviations (SD) reported in the table.r 2 , MAE, and ME are as defined in TableA1.

Table A6 .
Comparison of models trained with different sets of observations: 'Observer 1 full' refers to the full set of observations used to obtain the Main text model Results (2422 images belonging to 85 individual crowns); 'Observer 1 subset' is the subset of the full dataset (1780 images belonging to 58 individual crowns) corresponding to the same 58 crowns for which Observer 2 or 3 observations were available; 'Observer 2+3 subset' refers to observations from either Observer 2 or Observer 3 (1747 images belonging to 58 individual crowns for the combined Observer 2+3 dataset).Values in the columns for each observer dataset report the r 2 (squared Pearson correlation) between observations and out-of-sample predictions from 10-fold cross-validation.