Identifying Leaf Phenology of Deciduous Broadleaf Forests from PhenoCam Images Using a Convolutional Neural Network Regression Method

: Vegetation phenology plays a key role in inﬂuencing ecosystem processes and biosphere-atmosphere feedbacks. Digital cameras such as PhenoCam that monitor vegetation canopies in near real-time provide continuous images that record phenological and environmental changes. There is a need to develop methods for automated and effective detection of vegetation dynamics from PhenoCam images. Here we developed a method to predict leaf phenology of deciduous broadleaf forests from individual PhenoCam images using deep learning approaches. We tested four convolutional neural network regression (CNNR) networks on their ability to predict vegetation growing dates based on PhenoCam images at 56 sites in North America. In the one-site experiment, the predicted phenology dated to after the leaf-out events agree well with the observed data, with a coefﬁcient of determination (R2) of nearly 0.999, a root mean square error (RMSE) of up to 3.7 days, and a mean absolute error (MAE) of up to 2.1 days. The method developed achieved lower accuracies in the all-site experiment than in the one-site experiment, and the achieved R2 was 0.843, RMSE was 25.2 days, and MAE was 9.3 days in the all-site experiment. The model accuracy increased when the deep networks used the region of interest images rather than the entire images as inputs. Compared to the existing methods that rely on time series of PhenoCam images for studying leaf phenology, we found that the deep learning method is a feasible solution to identify leaf phenology of deciduous broadleaf forests from individual PhenoCam images. four methods all-site.


Introduction
Vegetation phenology denotes periodic phenomena affected by climate, hydrology, soil, and other environmental conditions, which accompanies key events such as germination, leaf expansion, flowering, leaf discoloration, and defoliation [1,2]. Changes in leaves reflect the high sensitivity of vegetation to seasonal weather and climate variability [3][4][5]. Leaf phenology alters physical characteristics of the land surface such as albedo and roughness [6], which also influences the biogeochemical processes such as photosynthesis and transpiration [7,8]. Monitoring leaf phenology is important for us to understand the interactive relationship between a changing climate and terrestrial ecosystems.
Available approaches to collect phenological data include field measurements, nearsurface observations, and airborne and space-borne remote sensing. Field measurements date back over 2000 years [9] and still play a role today in recording phenology changes of individual trees such as flowering or budburst [10,11]. Field measurements of leaf phenology involve extensive fieldwork, making it difficult to conduct long-term and high-frequency experiments. Airborne and space-borne remote sensing is able to provide repetitive and synoptic land surface observations and offers opportunities to monitor vegetation phenology on a small scale across large areas [12]. The use of satellite data has promoted macro-scale phenology studies under the context of global climate change [13,14]. Remote sensing observations however often have quality issues such as sensor degradation and cloud contamination and are affected by platform revisit periods and sensor resolutions. The inversion algorithms including underlying assumptions and parameter settings also contribute uncertainties in retrieved phenology metrics from time series of remote sensing data [15]. Near-surface observation from eddy covariation flux towers has been used to derive and study leaf phenology changes. Eddy covariation flux towers that continuously monitor vegetation canopies could provide observations of land-atmosphere fluxes on an hourly or half-hourly basis. Because phenological events typically occur only once or twice a year, the phenological metrics retrieved from eddy covariation flux towers are limited by the availability of the tower sites [16].
In recent years, digital photography has provided an automated and cost-effective approach to monitoring leaf dynamics at the canopy level [17]. The method called Pheno-Cam involves mounting digital cameras with visible-wavelength imaging sensors above vegetation canopies to capture images throughout the day [18]. Near-surface observations from PhenoCam provide time series of images that are ideal for tracking seasonal changes in leaf phenology [19]. As high-quality and low-cost digital cameras are now widely available, PhenoCam has been increasingly deployed for ecological studies. PhenoCam images have been used to extract leaf phenology metrics [20] and track biodiversity [21]. Zhang et al. monitored vegetation growth over time using both PhenoCam and remote sensing data [14]. Wang et al. estimated vegetation gross primary productivity on a daily bias using PhenoCam data to reduce modelling uncertainties [22]. Wu et al. studied the relationship between leaf ages and canopy characteristics using PhenoCam data [23]. The current studies have proven that PhenoCam images carry useful information for monitoring and understanding leaf phenology.
To date, most studies use indirect methods to identify phenology variation based on the time series of PhenoCam images. The indirect methods track changes in images by deriving handcrafted features such as green chromatic coordinate (GCC) [24,25] and red chromatic coordinates (RCC) [1] from PhenoCam images and then apply algorithms to derive the timing of phenological events, such as start-of-growing season (SOS) and endof-growing season (EOS). The use of handcrafted features such as GCC or RCC overlooked high-level information in digital images. More importantly, these studies generally required at least two images or even the time series to obtain information related to leaf phenology. In this study, we aim to propose a method that can predict the daily leaf growth time with the provided PhenoCam image only. Specifically, leaf growth time refers to the period between SOS and EOS, and we predict the day of that period for individual PhenoCam images.
Deep learning, as a subset of machine learning, is able to automatically learn the relationships between features and tasks. The way to extract complex features from input data in deep learning is different from traditional shallow learning. Deep learning transforms the features of the sample in the original space to a new feature. Deep learning methods have been widely used for image processing in the fields such as computer vision, speech recognition, and natural language processing [26]. In remote sensing, deep learning approaches have been used to process the tasks of image classification, segmentation, and object recognition. Compared with traditional classification methods, the deep learning model often provides higher processing accuracy when large samples for model training and testing are available. The image data recorded by PhenoCam has been accumulated in recent years, and it is of interest to further explore the recorded dataset. The development of deep learning approaches provides the possibility to process and make predictions on individual images. Recent studies on facial age recognition and image segmentation using Remote Sens. 2021, 13, 2331 3 of 20 the deep learning models inspire us to test identification of leaf phenology from PhenoCam images.
The main goal of this study was to identify the leaf phenology of deciduous broadleaf forests from individual PhenoCam images using deep learning methods. In other words, we tested the deep learning models on predicting leaf growing dates after SOS in a year from a given PhenoCam image. Compared with traditional methods that predict leaf phenology with handcrafted features from time series data, the use of deep learning methods allows us to infer daily leaf phenology from individual PhenoCam images and can potentially improve image processing accuracy and reduce labor costs.

Study Materials
The PhenoCam network, established in 2008, initially provided automated monitoring of leaf phenology in forest ecosystems in the northeastern United States and neighbouring Canada [27], and later expanded to give long-term observations covering a variety of geographic sites. More than 600 site cameras in the PhenoCam network are now deployed across a wide range of ecological, climatic, and plant functional types in North America. Most sites upload one digital photo every 30 min from 04:00 am to 9:30 pm local time. These photos are stored on servers at the University of New Hampshire [28]. The archived dataset currently has approximately 15 million images requiring 6 TB of disk space and provides a record that can be used to determine the phenological state of leaves [27]. High-quality data is ensured by minimizing data discontinuity due to adverse weather conditions (e.g., rain, snow, and hail), adverse light conditions (e.g., cloud and aerosol), or short-term power outages [29]. The PhenoCam dataset is publicly available on the website of the PhenoCam project (http://phenocam.sr.unh.edu/; accessed on 18 September 2020).
In the PhenoCam dataset, there are three types of observation sites, namely Type I, II, and III sites. The Type I sites follow a standard protocol to ensure data quality and data continuity [27], whereas the Type II and III sites are required to obey the standard protocol.
The key aspects of the standard protocol are as follows. Firstly, the camera is set to fix the white balance. Secondly, the camera is mounted at a safe point and tilted down at 20-40 degrees such that its field of view covers the landscape. Ideally, the acquired image mainly consists of vegetation and a small part of the sky. Thirdly, in the northern hemisphere, the camera is pointed to the north to reduce the lens flares, shadows, and forward scattering of the canopy. Basic information for each PhenoCam site, including site category, site location, the start and end dates of site images, the camera model, vegetation type, the climate, and site attributes, is included in the data record.
We used the data from 56 deciduous broadleaf forest sites in North America in the PhenoCam 1.0 version dataset. Figure 1 shows the spatial distribution of the studied PhenoCam sites, which are mainly distributed across the latitude range of 32 • -47 • N with an elevation up to 1550 m. Most sites are situated in the temperate continental climate zone. We chose the Type I sites of deciduous broadleaf forests that follow the standard protocol. We used the images that were taken within 11:30 am-1:30 pm local time every day, given that images acquired in the early morning and late afternoon were often affected by sunlight scattering.  Table A1.

148
Our goal was to predict the leaf growing days of deciduous broadleaf forests in a 149 growing season (from SOS to EOS) from each PhenoCam image using deep learning mod-150 els. We determined the dates of leaf-out based on the time series of GCC and labelled the 151 dates after leaf-out for each image. As our task was to solve a regression problem rather 152 than a classification problem, we modified the CNN-based models to fit in the regression 153 task. We trained and evaluated four different CNN-based models at three different tem-154 poral scales. 155 The architecture shown in Figure 2 illustrates the workflow of our study and it in-156 cludes three main parts. The first part is data preprocessing, including data labelling and 157 augmentation. The second part is the leaf phenology prediction based on deep learning 158 methods, in which two strategies were used. For the first strategy, we used CNNRs to 159 predict leaf phenology from PhenoCam images directly; in other words, we fed the entire 160 PhenoCam image into the CNNRs. For the second strategy, to investigate whether back-161 ground information in the PhenoCam images such as the sky or other land covers influ-162 ences the accuracy of leaf phenology, we used semantic segmentation models to identify 163 region-of-interests (ROIs) in the Phenocam images first and then fed the ROI images into 164 the CNNRs. ROIs denote a subset in the Phenocam images and only includes vegetation 165 canopies and excludes backgrounds. ROIs in the Phenocam dataset has already been man-166 ually labelled by camera maintainers and can be downloaded directly from the official 167 website. As ROIs were labelled for only a few sites, we employed a semantic segmentation 168 method based on deep learning to detect ROIs from the PhenoCam images for all sites. 169 The last part is the leaf phenology evaluation. The detailed methodology is introduced in 170 the following sections.  Table A1.

Methods
Our goal was to predict the leaf growing days of deciduous broadleaf forests in a growing season (from SOS to EOS) from each PhenoCam image using deep learning models. We determined the dates of leaf-out based on the time series of GCC and labelled the dates after leaf-out for each image. As our task was to solve a regression problem rather than a classification problem, we modified the CNN-based models to fit in the regression task. We trained and evaluated four different CNN-based models at three different temporal scales.
The architecture shown in Figure 2 illustrates the workflow of our study and it includes three main parts. The first part is data preprocessing, including data labelling and augmentation. The second part is the leaf phenology prediction based on deep learning methods, in which two strategies were used. For the first strategy, we used CNNRs to predict leaf phenology from PhenoCam images directly; in other words, we fed the entire PhenoCam image into the CNNRs. For the second strategy, to investigate whether background information in the PhenoCam images such as the sky or other land covers influences the accuracy of leaf phenology, we used semantic segmentation models to identify region-of-interests (ROIs) in the Phenocam images first and then fed the ROI images into the CNNRs. ROIs denote a subset in the Phenocam images and only includes vegetation canopies and excludes backgrounds. ROIs in the Phenocam dataset has already been manually labelled by camera maintainers and can be downloaded directly from the official website. As ROIs were labelled for only a few sites, we employed a semantic segmentation method based on deep learning to detect ROIs from the PhenoCam images for all sites. The last part is the leaf phenology evaluation. The detailed methodology is introduced in the following sections.

178
To feed the PhenoCam images into the CCNRs for training and testing, we needed 179 to do a series of data processing. Data preprocessing mainly included two steps, i.e., train-180 ing and testing image data selection and image labelling. As the PhenoCam images are 181 often influenced by animals and climate ( Figure A1), we needed to choose available data 182 set for the experiment. In this study, we selected images that had no animal tracks and no 183 climate contaminations, such as rainfall, fog, and solar flare. In total, 14,453 PhenoCam 184 images from 56 sites were chosen for study, which was downloaded from the PhenoCam 185 Step 1 shows the preprocessing of the raw input data, including data augmentation and transformation; Step 2 displays the deep learning processing, including direct CNNRs (method 1) and regression after ROIs detection (method 2); and Step 3 demonstrates the model assessment.

Data Preprocessing
To feed the PhenoCam images into the CCNRs for training and testing, we needed to do a series of data processing. Data preprocessing mainly included two steps, i.e., training and testing image data selection and image labelling. As the PhenoCam images are often influenced by animals and climate ( Figure A1), we needed to choose available data set for the experiment. In this study, we selected images that had no animal tracks and no climate contaminations, such as rainfall, fog, and solar flare. In total, 14,453 PhenoCam images from 56 sites were chosen for study, which was downloaded from the Pheno- Cam website (https://phenocam.sr.unh.edu/webcam/tools/; accessed on 18 September 2020). Figure 3. (a) illustrates phenology changes of the PhenoCam images at a deciduous broadleaf forest. We only show the PhenoCam images every eight days as examples, although there are PhenoCam images every day. The PhenoCam images generally consist of three visible bands including red, green, and blue bands, and occasionally contain the nearinfrared band at a few sites. The image sizes of the raw data and the label data are not the same across the sites, ranging from 640 × 480 pixels at the Bartlett site to 4000 × 2500 pixels at the Coville site. We cropped the images to the same size of 224 × 224 pixels.

243
In the first strategy, four modified convolutional neural network regression methods 244 (CNNRs, Figure 4) were used to predict the leaf growing dates for deciduous broadleaf 245 forests. The used CNNRs were derived from the commonly used convolutional neural 246 network (CNN) structures. The CNN models have been proven advantageous over tradi-247 tional machine learning methods in extracting low-level and high-level features from im-248 ages. Accompanying with the development of hardware and public datasets, researchers 249 have proposed various CNN models to improve image classification, among which the 250 widely used ones include AlexNet [30], Visual Geometry Group (VggNet) [31], GoogleNet 251  Figure 4c), and EOS is defined as the date when GCC first exceeds 90% of the seasonal amplitudes in the declining stage (the blue vertical dotted line on the right in Figure 4c). SOS is the first day of leaf growth in a year, and EOS is the last day of leaf growth in a year. We labelled leaf growing date for each PhenoCam image between SOS and EOS as the date starting from SOS, and labelled leaf growing dates for the images before SOS or after EOS in a year as 0 because leaf growth is in a dormancy stage. The dates associated with major changes in leaf colour characteristics, such as greenness rise, and greenness decline, were derived from high-frequency images collected over the past ten years on daily scales [27]. The acquisition frequency of the PhenoCam data is important for studying temporal changes in leaf colour characteristics. There is a complete image data file processed for each phenology site, namely the summary product, which contains the calculation results of daily GCC time series. We used the daily product and eliminated outliers as we aim to predict leaf phenology from each image on a daily scale. Moreover, we investigated the proposed leaf phenology prediction method at different time scales and selected 8-day and 80-day data for additional experiments. In the additional experiments, the date of leaf growth is divided into few sessions, for example, day 1-8, day 9-16, and so on in the 8-day experiment.
For each selected PhenoCam image, we determined its leaf growing date using GCC. Owing to noise originating from factors such as weather, sun illumination geometry, and exposure controls, the original PhenoCam images were rarely used directly in past phenology studies. Methods have been developed to convert the digital numbers of each pixel to chromatic indices such as GCC and RCC. Studies have found that the time series of GCC or RCC is representative of the seasonal trajectory of leaf growth and activities [17]. The calculation method of GCC in the summary product we used is as follows (Equation (1)): where, R DN , G DN , and B DN denote the digital numbers in red, green, and blue bands, respectively.
We conducted the quality control processes such as filtering and outlier detection for the GCC time series and inspected the processed time series of GCC visually. Missing data in the GCC time series were linearly interpolated based on nearest neighbor points. In this part, the least square method and the Savitzky-Golay smoothing algorithm were employed for the fit of the interannual GCC curve. The hyperparameters used for the Savitzky-Golay smoothing algorithm were as follows: the window size was set as 53 and the polynomial order was 3. Figure 3. (c) shows both the derived GCC time series and its fitted curve corresponding to the growing cycle of leaves.

Leaf Phenology Prediction
In the first strategy, four modified convolutional neural network regression methods (CNNRs, Figure 4) were used to predict the leaf growing dates for deciduous broadleaf forests. The used CNNRs were derived from the commonly used convolutional neural network (CNN) structures. The CNN models have been proven advantageous over traditional machine learning methods in extracting low-level and high-level features from images. Accompanying with the development of hardware and public datasets, researchers have proposed various CNN models to improve image classification, among which the widely used ones include AlexNet [30], Visual Geometry Group (VggNet) [31], GoogleNet [32], and Deep Residual Network (ResNet) [33]. Here we use the four networks, i.e., AlexNet, VGG, ResNet50, and ResNet101, as the backbones for predicting leaf phenology from individual PhenoCam images. The architectures of four used CNNRs are shown in Figure 3. AlexNet is an 8-layer deep CNN, which can solve over-fitting problems through multiple skills such as dropout and Relu. VGG is an improvement of AlexNet. It uses a smaller convolution kernel and stride in the first convolution layer, which is suitable for multi-scale training and testing. Compared to both AlexNet and VGG, ResNet overcomes the problem of gradient explosion and disappearance [34]. The residual module in ResNet enables the gradient transmission and low-level information retaining through a linear connected path. In our leaf phenology prediction task, we inherit the convolution layer and the pooling layer of each network, but use a regression loss function rather than the cross-entropy loss function for regression analysis. For ResNet, we tested two different depth structures, which are 50-layer (ResNet50) and 101-layer (ResNet101). Finally, AlexNet-R, VGG-R, ResNet50-R and ResNet101-R are used for leaf phenology prediction. We used MSELoss for optimizing the task of phenology prediction. The MSELoss function is suitable for the regression problems, which expresses averaged sum of the squares of the difference between the predicted values and ground truth.

Leaf Phenology Prediction Using Detected ROI Images
As mentioned above, we adopted the second strategy to predict leaf growing dates based on ROIs detected from each PhenoCam image. Compared with the first strategy, there is a need to extract representative ROIs from PhenoCam images first. ROIs contain only canopies compared to the entire PhenoCam image that includes background information such as the sky, buildings, lakes, and animals. ROIs masked out the background and the use of ROIs possibly reduces the influence of background factors on the prediction accuracies.
Semantic segmentation is a process that separates the regions of different objects and labels their categories for a given image [35]. Currently, there are various semantic segmentation networks available to label each pixel in an image, such as FPN [36], Unet [37], DeeplabV3+ [38], and PSPNet [39]. These networks often have encoder-decoder architectures, and used various up-sampling structures in the decoder part to restore the compressed features generated from the encoder part to the same size as the input image. In our previous work, we proposed BAPANet [40], which is a robust semantic segmentation architecture that combines the backbone network with a feature optimization module. BAPANet used a lightweight ResNet-34 as the backbone and a bridging structure between the encoders and decoders of the backbone to enhance feature representation. The input of each bridge consisted of the up-sampled features from the previous bridge and the corresponding encoder, which can well integrate hierarchical features.
In the second strategy, we used the above-mentioned five semantic segmentation methods for detecting ROIs. The entire PhenoCam images including red, green, and blue bands were fed into each semantic segmentation model, and the model generates the outputs of ROIs. The labelled ROIs and the corresponding PhenoCam images were served as the training and test samples. Recent studies showed that data augmentation played a crucial role in deep network training and helped to reduce the effect of overfitting. Accordingly, we conducted data augmentation for model training [40], including grayscale transformation (i.e., we changed the grayscale of the images to reduce noise), random folding (including horizontal and oblique folding), random scaling (i.e., we randomly scaled images by up to 10%), random offset (i.e., the images were randomly offset by up to 10%), and random stretching (we stretched the image along the either vertical or horizontal direction to up to 10% randomly). The CrossEntropyLoss function that accounts for the proximity of two probability distributions was employed for model optimization. We fed the detected ROIs into the four CNNRs for leaf phenology prediction.

Model Assessment
The deep learning network requires a uniform fixed image size. According to the conditions of the GPU memory of the hardware environment and the characteristics of our network architecture, we uniformed the input image size as 224×224. All the PhenoCam images are cropped into the same size of 224×224. In the ROI detection stage, we used 13,000 training samples and 1453 test samples with a batch size of 128. In the stage of leaf phenology prediction, we used two kinds of a dataset. First, one-site dataset, it contained 7637 images acquired at the Dukehw site in 2016, where 5600 and 1400 images were randomly selected for model training and validation, respectively, and the rest were used for the model test. Second, the all-site dataset, it contained a total of 14453 images from all studied 56 sites, in which 10400 images (71.96%) were randomly selected for model training, 2600 images (17.99%) for model validation, and 1453 images for the model test (10.05%). The experiment was carried out on an 11GB NVIDIA GTX 1080Ti based on PyTorch. The model was initialized by the pre-trained models and then fine-tuned based on the phenological datasets we built. The network was optimized by Adam algorithm in model training.
We used the metrics of coefficient of determination (R 2 ), mean absolute error (MAE) and root mean square error (RMSE) for the assessment of model performance [40]. R 2 is defined as the ratio of the sum of regression squares to the sum of total deviation squares. The larger R 2 is, the prediction effect is better. The normal range of R 2 is [0,1]. RMSE is defined as the square root of the averaged square sum of the deviation between predicted and observed values, which can better describe the result of data prediction and susceptible affected by outliers. MAE is defined as the mean of the absolute differences between predicted and observed values. Figure 5 shows the leaf phenology prediction accuracies of the four deep learning methods on deciduous broadleaf forests at a daily scale. The four networks all show good results. All the deep networks performed well on the one-site dataset, with all results in one-site are close to the 1:1 line. PhenoCam images from the same sites have similar scenes and the same field of view. For the all-site dataset, the accuracies of four networks were reduced when using images from all-56 sites for leaf phenology prediction. The variety of colors in this figure indicates the magnitude of the prediction error, and the red ones indicate those that have the largest errors. The results for different PhenoCam images have a large deviation. In terms of the model performance, ResNet101-R achieved the best accuracy with RMSE of 4.38 days and the MAE of 2.15 days in the one-site dataset, and ResNet50-R achieved higher accuracy than the other three networks on the all-site dataset with the MAE of 9.77 days. Overall, ResNet is a robust architecture for predicting leaf growing dates among tested models. Figure 6 illustrates the results of deciduous broadleaf forests leaf phenology dates on an 8-day scale. The category of leaf phenology dates was reduced compared with that of the daily scale. A total of 636 images were verified in the first row of Figure 6. Both AlexNet-R and VGG-R achieved similar accuracies on 8-day leaf phenology prediction and daily leaf phenology in terms of R 2 . Compared with the other tested methods, AlexNet-R shows the best prediction results on the one-site dataset. The two ResNet architectures achieved reasonable accuracies in predicting 8-day leaf growth for the one-site dataset, and performed better for the all-site dataset. ResNet50-R achieves RMSE 10 days less than VGG-R when using the all-site dataset. It seems that ResNet is suitable for large training datasets but has overfitting issues when using small training data. a large deviation. In terms of the model performance, ResNet101-R achieved the best ac-344 curacy with RMSE of 4.38 days and the MAE of 2.15 days in the one-site dataset, and 345 ResNet50-R achieved higher accuracy than the other three networks on the all-site dataset 346 with the MAE of 9.77 days. Overall, ResNet is a robust architecture for predicting leaf 347 growing dates among tested models.  Figure 6 illustrates the results of deciduous broadleaf forests leaf phenology dates on 357 an 8-day scale. The category of leaf phenology dates was reduced compared with that of 358 the daily scale. A total of 636 images were verified in the first row of Figure 6. Both 359 AlexNet-R and VGG-R achieved similar accuracies on 8-day leaf phenology prediction 360 and daily leaf phenology in terms of R 2 . Compared with the other tested methods, 361 AlexNet-R shows the best prediction results on the one-site dataset. The two ResNet ar-362 chitectures achieved reasonable accuracies in predicting 8-day leaf growth for the one-site 363 dataset, and performed better for the all-site dataset. ResNet50-R achieves RMSE 10 days 364  best prediction result in terms of all metrics, with RMSE of 11.81 days and MAE of 5.01 380 days. When using the ResNet101-R architecture, the accuracies for the one-site dataset 381 decrease in the 80-day experiment as compared to the 8-day experiment. For the all-site 382 dataset, the accuracies of two ResNet architectures were improved as compared to the 383 one-site dataset. ResNet50-R achieved accuracy with RMSE of 32.39 and MAE of 10.62.  Figure 8 shows the comparison of deciduous broadleaf forest ROIs detected by five different semantic segmentation methods. Based on visual inspection, BAPANet performed the best among all the methods, with reasonable ROI detection results for deciduous broadleaf forests. Compared with the other four methods, BAPANet has clearer boundaries and better segmentation details.

Predicting Leaf Phenology Using Detected ROI Images
Image segmentation metrics are presented in Table 1. The five networks achieved similar accuracies for ROIs detection. The segmentation results of Unet and PSPNet are similar in five metrics, slightly inferior to the other three methods. The accuracies in terms of Recall, F1, and Mean IOU achieved by Unet and PSPNet are roughly 0.1 less than DeepLabV3+ and FPN, and roughly 0.2 lower than BAPANet. As shown in Table 1, BAPANet was better than the other four methods in terms of the five metrics except for the overall accuracies (0.981). In particular, the results of BAPANet were superior to other methods, with a Recall of 0.961 and F1 of 0.966.  Figure 8 shows the comparison of deciduous broadleaf forest ROIs detected by five 394 different semantic segmentation methods. Based on visual inspection, BAPANet per-395 formed the best among all the methods, with reasonable ROI detection results for decid-396 uous broadleaf forests. Compared with the other four methods, BAPANet has clearer 397 boundaries and better segmentation details.  Table 1. The five networks achieved 406 similar accuracies for ROIs detection. The segmentation results of Unet and PSPNet are 407 similar in five metrics, slightly inferior to the other three methods. The accuracies in terms 408 of Recall, F1, and Mean IOU achieved by Unet and PSPNet are roughly 0.1 less than 409 DeepLabV3+ and FPN, and roughly 0.2 lower than BAPANet. As shown in Table 1,  ANet was better than the other four methods in terms of the five metrics except for the 411 overall accuracies (0.981). In particular, the results of BAPANet were superior to other 412 methods, with a Recall of 0.961 and F1 of 0.966. 413 Table 1. The accuracy of detected ROIs using different semantic segmentation models .   Figures A2 and A3 show the leaf growth prediction results with the detected ROIs at three different time scales of daily, 8-day, and 80-day, respectively. Firstly, we can see that the prediction accuracies on different time scales varied greatly. The results of daily prediction accuracy were better than those of 8-day and 80-day. On the daily scale, VGG-R performed much better than AlexNet-R and two ResNet architectures for the one-site dataset, whereas the two ResNet architectures performed much better than AlexNet-R ( Figure 9e) and VGG-R (Figure 9f) for the all-site dataset. For the 8-day scale ( Figure A2), VGG-R achieved the best results on the one-site dataset. The RMSE was within five days, while the other methods were greater than those of the 8-day scale. On all-site datasets, the results of ResNet50-R had slightly better accuracy in terms of R 2 (0.85) and RMSE (23.83). Compared with other methods, ResNet50-R achieved the best accuracy on the all-site dataset with the lowest RMSE. For the 80-day scale ( Figure A3), VGG-R performed the best on the one-site dataset, followed by AlexNet and the two ResNet architectures. Again, ResNet architectures achieved better accuracies than the other deep networks. Compared with using the whole PhenoCam image, VGG-R method has better prediction results with the detected ROIs input, both on one-site and all-site datasets. ResNet50-R has a slight advantage on daily and 8-day leaf growth prediction based on the detected ROIs. Similar results were obtained with the 80-day time scale. In general, the second strategy performed better than the first one, and ROIs could remove the interference by other information (such as the sky or other land covers) in the PhenoCam image.
tasets. ResNet50-R has a slight advantage on daily and 8-day leaf growth prediction based 432 on the detected ROIs. Similar results were obtained with the 80-day time scale. In general, 433 the second strategy performed better than the first one, and ROIs could remove the inter-434 ference by other information (such as the sky or other land covers) in the PhenoCam im-435 age.

Discussion
Based on the results above, we found that the results of leaf phenology prediction vary greatly when using the one-site and all-site datasets. By comparing the results obtained from the one-site and all-site datasets, we found that the convolutional neural network regression methods used in this paper performed well when predicting the results of a single site, but need improvements when applied to multiple sites. One possible reason is that images from different sites vary largely due to imaging conditions, making it difficult for training and convergence of the deep learning models.
The accuracy of leaf phenology prediction also depends on the training dataset. Imaging conditions and observation frequency often vary considerably from site to site, leading to the variation of data distribution. For most of the studied sites, vegetation leaf comes out in spring and falls off in autumn, whereas there are high-altitude sites where vegetation leaf comes out in summer or even autumn with a much shorter growing period. These altitude and latitude conditions lead to a wide variation in the prediction of leaf phenology from site to site and result in phenology prediction errors. From the presented results, we found that different network structures lead to large differences in the predicted results. In addition to the network architecture, training data is one important factor that influences the performance of deep learning models. Collecting observation datasets from sites with varying altitudes and latitudes are likely useful for improving the applicability of the deep learning models.
According to the results, we found that the accuracies of leaf phenology prediction varied largely with the time scales. The experimental results showed that the models performed better on the daily scale than on the 8-day or 80-day scales. In addition, the R 2 showed that the model performed better on the 8-day scale than those on the 80-day scale, especially at multiple sites. For the 8-day and 80-day scales, VGG-R and AlexNet achieved better results in the one-site datasets than in the all-site dataset. For the daily time scale, Resnet-50 performed well on both one-site and all-site datasets. In the experiments at different time scales using the all-site dataset, the two ResNet architectures performed relatively better than the other methods. Overall, the daily time scale is a reasonable scale for leaf phenology prediction. It is probably the case that the uncertainties related to series generation are too coarse in 8-day and 80-day scales for the accurate estimation of the leaf growth between EOS and SOS. As the results show in Figures 5 and 9, we predicted leaf phenology on the daily scale from the entire PhenoCam image, as well as the detected ROI images. Compared with the first strategy, the accuracy of the second strategy that used ROIs as model inputs slightly improved. The two ResNet-R structures achieved slightly better accuracies using ROIs than using the entire PhenoCam images to predict daily leaf phenology based on the one-site dataset. Both AlexNet-R and VGG-R also reduce RMSE errors when using ROIs as compared with using entire PhenoCam images as model inputs. This implies that detecting ROI images helps phenology prediction as the use of ROIs reduces irrelevant but influential backgrounds other than vegetation canopies. Different layers of networks have different advantages in leaf growth prediction. The shallow networks such as AlexNet-R achieved better accuracies on the 8-day and 80-day scales in the one-site dataset. The method of VGG-R using the detected ROIs has advantages in the one-site dataset compared to the deep network. Deeper networks such as ResNet50-R and ResNet101-R performed better in the all-site dataset. The deep network of ResNet with 50 layers performed better than that with 101 layers. This implies that deep networks are suitable for complex data processing, and high-level features extracted by deeper networks likely lead to overfitting when the dataset comes from only one site. Overall, increasing the network depth does not always improve the accuracies in predicting leaf growing dates directly from the PhenoCam images, and finding a suitable network is important for leaf phenology prediction.
In addition, there are still some improvements that can be made in future studies. For example, we might integrate the ROI detection network and the phenology prediction network together to improve the model performance and computational efficiency. On the other hand, we would add other indices (such as LAI, NDVI, etc.) to improve the identification of leaf phenology.

Conclusions
This study investigated the ability of four deep learning models to identify leaf phenology from individual PhenoCam images. The deep learning models can extract high-level features from high-frequency PhenoCam images, making it a special solution for leaf phenology prediction. Compared to the existing methods that analyze leaf phenology using time series of PhenoCam images, our algorithm is able to predict leaf phenology from one PhenoCam image only. The accuracy in terms of MAE is about two days at a single site, and that of multiple sites is about 9 days. The two ResNet architectures performed better than the other two methods on the multiple sites test dataset. Although we take deciduous broad-leaved forest as the studied plant functional type, it is possible that the method developed can be extended to other vegetation types such as crops, and crop leaf phenology prediction will thus provide efficient and effective possibilities for crop yield estimation. While challenges in accurate identification of leaf phenology from digital camera photos remain, we provide a feasible solution to predict daily leaf phenology using individual PhenoCam images via deep learning.