1. Introduction
The monitoring and management of forest resources have become increasingly important in the context of global climate change. Individual tree detection and segmentation play a critical role in forest management, because information at the individual tree level, such as locations and crown sizes, can be used to model growth, yield and fire behaviour, and to understand the ecology and dynamics of forests in ways that are directly applicable to forest management [
1,
2].
Thus far, the existing studies have primarily utilized satellite imagery, aerial imagery, unmanned aerial vehicles (UAV) imagery, and light detection and ranging (LiDAR) point cloud data to detect individual trees. Satellite imagery, although offering wide coverage, lacks the resolution necessary for individual tree detection [
3,
4]. LiDAR data can provide accurate three-dimensional information but the equipment to collect the data is usually expensive [
5]. Furthermore, as LiDAR typically uses a single spectral band, there is a lack of spectral information and only geometric information is employed to detect and segment individual trees. When the branches of neighbouring trees intertwine, it is difficult to separate the trees solely based on geometric information. Compared to the other data sources, UAV imagery stands out as an important data source for individual tree detection and segmentation due to its ease of acquisition, abundant spectral information, and high spatial resolution [
6,
7,
8].
By loading a UAV with different camera sensors, red-green-blue (RGB), multispectral, or hyperspectral images can be obtained. In particular, UAV-based visible light imaging has gained popularity due to its ability to capture extremely high-resolution RGB images and the low cost of equipment [
9].
Existing research on individual tree crown detection (ITCD) using RGB imagery can be broadly classified into three categories: traditional image processing methods, machine learning-based methods, and deep learning-based methods. Traditional image processing methods include techniques like local maxima filtering [
10], watershed segmentation [
11,
12] and region-growing [
13]. These methods do not require the construction of a sample dataset for model training and are computationally efficient. However, they often rely on the assumption of high brightness at the top of the tree crown. Such an assumption may not hold in forests with significant crown overlap or a large proportion of broadleaf species [
14]. Crown overlap often results in unclear crown boundaries, while broadleaf trees typically have irregular shapes and may have multiple treetops. Additionally, although UAV RGB imagery has high resolution, the complex texture within individual crowns, spectral heterogeneity, and noise make detection of individual tree crowns particularly challenging [
15]. Machine learning-based methods, such as random forests and support vector machines, rely on hand-crafted features like colour, texture, and shape. Although these approaches can outperform traditional methods, they require extensive feature selection and are prone to overfitting without diverse training data [
16,
17]. Deep learning techniques have made significant progress in object detection from remote sensing imagery in recent years. Various deep learning networks, such as the YOLO series, Faster R-CNN and RetinaNet, have been developed for object detection. The detection of individual tree crowns can be considered as an object detection task. Some studies have used deep learning techniques to detect individual trees [
18,
19,
20]. Deep learning-based methods are considered to be more effective and have better transferability than traditional machine learning-based approaches for detecting individual tree crowns due to their ability to learn hierarchical combinations of object-level image features and directly delineate objects of interest [
21,
22].
Existing approaches to individual tree detection primarily rely on single-seasonal images. In the forests with large canopy gaps or obvious treetops, such approaches based on brightness or spectral information can provide high accuracies of ITCD. However, in mixed species forests with a high proportion of broadleaf species, the detection and delineation of individual tree crowns remain a challenge. Neighbouring trees of different species may have similar spectral responses [
23]. The varying crown size and the intertwining branches of broadleaf trees increase the difficulty of ITCD. One solution is to explore more features that can facilitate the differentiation between neighbouring trees of different species. Some studies have used multi-seasonal images to classify tree species, based on the observation that the seasonal variation in the spectral response of tree crowns differs between species [
24,
25]. For instance, ever-green trees may have a small seasonal variation in the spectral response, whereas the spectral response of deciduous trees may vary greatly due to defoliation and leaf regrowth. Although the use of multi-seasonal imagery helps improve the mapping of tree species [
26], research on the application of multi-seasonal imagery to ITCD is rare. Berra [
27] employed leaf-off imagery for digital terrain model (DTM) generation, while utilizing leaf-on imagery to derive digital surface model (DSM). A canopy height model was generated from the DTM and the DSM, which was then used for individual tree delineation. However, they did not employ the phenological signatures inherent in dual-seasonal datasets. Therefore, it is necessary to evaluate the extent to which the accuracy of ITCD can be improved by using dual- or multi-seasonal imagery. The selection of seasonal combinations for dual- or multi-seasonal imagery is another problem that requires systematic investigation.
To characterize the seasonal variations in the spectral responses of trees of different species, appropriate spectral, textural, and geometric features need to be extracted from dual- or multiple-seasonal images and be fused. However, the feature selection has always been a complex problem in related studies [
28]. To address this issue, deep learning techniques can be used to simplify the complex problem and automate the data processing [
29]. Among various object detection architectures, You Only Look Once (YOLO) series is well-known for its computational efficiency and superior performance in object detection tasks, in particular its ability to handle small and irregularly shaped objects [
30]. Compared to its predecessors (e.g., YOLOv5 and YOLOv7), YOLOv8 introduces a series of vital enhancements and offers significant advantages in both efficiency and adaptability [
31,
32]. Previous studies have used this model for ITCD from UAV imagery [
33].
In this study, we aim to explore the feasibility of improving the accuracy of ITCD by detecting individual trees from dual-seasonal UAV-captured RGB imagery. For this purpose, we modified the YOLOv8 model so as to automatically extract and fuse features from dual-seasonal RGB imagery for the detection of individual tree crowns. To assess the benefits of utilizing dual-seasonal RGB imagery, the modified model was trained and tested across three test sites featuring different subtropical forest types: urban broadleaved forest, planted coniferous forest, and mixed coniferous and broadleaved forest. UAV-captured RGB images collected across three seasons (winter, spring and autumn for test sites 1 and 3; winter, spring and summer for test site 2) were grouped into dual-seasonal combinations and employed for ITCD. The ITCD results were compared with those derived from single-seasonal imagery at each test site. Comparative analyses were further conducted on various seasonal combinations to quantify the impact of data acquisition timing on ITCD performance. Finally, we evaluated our modified YOLOv8 model against the original YOLOv8 model, which employed channel-wise stacked dual-seasonal inputs, to demonstrate our model’s capability for feature extraction and fusion.
2. Study Area and Datasets
In this study, three test sites were selected to assess the application of dual-seasonal UAV-captured RGB imagery. Two are located in Lin’an District, Hangzhou City, Zhejiang Province, and the third is in Anji County, Huzhou City, Zhejiang Province (
Figure 1). China’s administrative hierarchy primarily consists of the central government, provinces, prefecture-level cities, county-level administrative divisions (counties, county-level cities, districts, etc.), and township-level administrative divisions (towns, townships, subdistricts, etc.). RGB imagery was acquired using a DJI Phantom 4 RTK UAV, which is equipped with a 24 mm wide-angle camera with a 1-inch CMOS sensor. The operation area and flight route were set in the DJI GS RTK application, and the flight was automatically carried out following the set route. The flight parameters at each test site are detailed in
Table 1. The details of all sample plots are summarized in
Table 2.
Field datasets were acquired in each plot to facilitate subsequent tree crown labelling in images. A real-time kinematic (RTK) global navigation satellite system (GNSS) was used to determine stem locations. Other information such as tree species and diameter at breast height (DBH) was also documented. More details of the test sites and datasets are given as follows:
As shown in
Figure 1, test site 1 is located on the campus of Zhejiang A&F University in Lin’an District, Hangzhou, Zhejiang Province (30.1533° N, 119.4327° E). This area has a typical subtropical monsoon climate, with an average annual temperature of around 16 °C and annual rainfall of around 1500 mm. Two sample plots (
Figure 2) were selected to generate image sample dataset. The terrain within the sample plots is flat. Both sample plots have a high degree of canopy cover, with only a few shrubs. UAV RGB photos for test site 1 were collected in February (winter), May (spring), and November (autumn) 2023.
- (2)
Test site 2
Test site 2 is located in Changhua, Lin’an District, Hangzhou City, Zhejiang Province (30.1342° N, 119.4133° E). This area has a subtropical monsoon climate, with an average annual rainfall of around 1400 mm and an average annual temperature of 15.8 °C. One sample plot (
Figure 3) was selected to generate image sample dataset. The main species within the sample plot is Chinese fir. Some broadleaved trees are unevenly distributed in this area, accounting for roughly 20% of the total population. Most of the area has a simple stand structure and relatively large tree spacing, with dense shrubs and small trees growing beneath the upper canopy in summer. UAV RGB photos for test site 2 were collected in January (winter), April (spring), and August (summer) 2023.
- (3)
Test site 3
Test site 3 is located in Anji County, Huzhou City, Zhejiang Province (30.6382° N, 119.6820° E). This area has a subtropical monsoon climate characterized by a humid environment and dense vegetation, with annual rainfall of around 1600 mm and an average annual temperature of around 17 °C. One sample plot (
Figure 4) was selected to generate image sample dataset. The forest type within test site 3 is a coniferous and broadleaved mixed forest. The tree density in this area is high, and the forest structure is complex, with multiple vertical layers. UAV RGB photos for test site 3 were collected in January (winter), April (spring), and October (autumn) of 2023.
5. Discussion
5.1. Advantages of Dual-Seasonal Image Combination
In this study, the ITCD results derived from single-seasonal image datasets indicated that the season of UAV data collection had a significant effect on the ITCD result. There were considerable differences in the ITCD accuracy (F1, AP and AIoU) between data from different seasons. In the urban broadleaved forest (test site 1), the autumn data yielded the best results (F1 = 71.0%, AP = 76.1%, AIoU = 72.9%), whereas the spring data had the lowest accuracy (F1 = 51.5%, AP = 46.7%, AIoU = 66.3%). In contrast, the optimal data for the other two test sites were from the spring season. The crowns of different tree species have similar spectral reflectance in the spring and summer data, making it challenging for the model to distinguish between neighbouring tree crowns. This is especially problematic in dense forests. Furthermore, in summer, grass and shrubs grow taller, and their spectral reflectance is similar to that of tree crowns, making it difficult to distinguish between background (grass and shrubs) and canopy and increasing the difficulty of ITCD. In both autumn and winter data, the spectral reflectance of the crowns of different tree species is significantly different. The winter data also show considerable differences in textural characteristics as a result of defoliation. However, the spectral reflectance of deciduous trees in winter is similar to that of the ground, leading to misclassification between canopy and ground. In addition, the ITCD result is also influenced by data quality. Environmental conditions, the timing of UAV data collection, and the equipment used for data collection all affect data quality. For instance, the data collected in the morning or afternoon tends to be heavily shaded, causing the canopy to be misclassified as background and undetected. Excessive image noise caused by instrument imperfections or environmental factors can also affect the ITCD result. In summary, it is difficult to find an optimal season for data collection for all forests due to differences in forest conditions and data quality.
With the use of dual-seasonal image data, the accuracy of ITCD improved to varying degrees at the three test sites compared to the use of single-seasonal data. Even if low ITCD accuracies were derived by using single-seasonal data, combining the data from two seasons can lead to significant improvements. For example, at test site 1, the F1 score was 51.5% for spring data, and was 61.3% for winter data. When the spring and winter data were combined, the F1 score rose to 72.5%. These results indicate that the features extracted from dual-seasonal images can effectively capture species-specific spectral reflectance variations in tree crowns across seasons, enabling enhanced crown delineation in mixed stands. In addition, the combination of dual-seasonal images can help mitigate the effects of background (e.g., grass and shrubs) and data quality issues (e.g., shadows) on the ITCD result. However, the extent of these gains depends on the approach to feature extraction and fusion. Simply concatenating dual-seasonal imagery into a six-channel input does not necessarily improve the accuracy of ITCD. This was evaluated in
Section 4.4 through a comparison between the original YOLOv8 and our modified version (YOLOv8-DualFusion). Some of the accuracy values derived by the original YOLOv8 using dual-seasonal imagery were even slightly lower than those derived using single-seasonal imagery.
The improvement in accuracy gained by using dual-seasonal imagery was also demonstrated by applying the original Faster R-CNN and the modified version (Faster-DualFusion) to the single- and dual-seasonal image datasets, respectively (see
Section 4.5). The consistent results between Faster R-CNN and YOLOv8 suggest that the improvement in accuracy achieved by using dual-seasonal RGB imagery is independent of the detector and primarily arises from the fusion of complementary seasonal features. However, the ITCD accuracy derived by the Faster-DualFusion model was lower than that derived by the YOLOv8-DualFusion model at all three test sites (see
Figures S1–S3 in Supplementary Materials). This suggests that the accuracy of ITCD is influenced to some extent by model selection.
The benefits of using dual-seasonal image data have also been proven in other studies that fulfil tasks such as tree species classification and disease tree detection. For example, Veras et al. [
48] used UAV-captured RGB imagery across four phenological stages (February, May, August, November) to train a tree species classification network, which demonstrated a 21.1% enhancement in classification accuracy over the models based on single-seasonal data. However, they adopted a pixel-based rather than tree-level classification method. By integrating dual-seasonal satellite imagery (WorldView-3 for summer and Google Earth for winter), Guo et al. [
49] achieved a 3% improvement in tree species classification accuracy (from 75.1% to 78.1%). The study employed a marker-controlled watershed segmentation algorithm to delineate individual tree crowns, subsequently generating training datasets for the deep learning model. More recently, Li et al. [
36] fused UAV imagery acquired on two consecutive dates (18 and 19 October) within a YOLOv8—based framework for pine wood nematode disease tree detection, raising mAP
50 from 71.9% to 78.5%. Although dual- or multi-seasonal remote sensing imagery has proven effective for diverse applications, the benefits of using dual- or multi-seasonal imagery for ITCD within a deep learning framework have not been systematically evaluated. Our study evaluated the advantages of using dual-seasonal imagery for ITCD in various types of forest and examined the impact of different seasonal combinations.
Although the optimal seasonal combination differed between test sites, the differences in F1 score and AP value between the three seasonal combinations were smaller than the differences between single-seasonal datasets. On test sites 1, 2 and 3, the largest difference in F1 score between the three single-seasonal datasets was 19.5%, 5.4% and 4.4%, respectively, while the largest difference between the three seasonal combinations was 2.9%, 3.4% and 5.4%, respectively. As for the AP value, the largest difference between the three single-seasonal datasets was 29.4%, 8.1% and 4.3%, respectively, while the largest difference between the three seasonal combinations was 2.9%, 6.1% and 2.7%, respectively. The comparative analysis shows that relatively consistent ITCD accuracy can be obtained by integrating dual-seasonal images from any two seasons. Therefore, data from any two seasons can be combined in practical applications, depending on data availability. However, this conclusion still needs to be further validated using more datasets over four seasons in different types of forest.
Notably, the enhanced ITCD performance enabled by dual-seasonal imagery facilitates more effective forest monitoring and management. Accurate detection of individual trees makes it possible to derive information such as crown widths and tree density within a specific area. It also allows tasks such as tree species classification and disease tree detection to be conducted at an individual tree level. Seasonal variations in the spectral responses of individual tree crowns can facilitate these tasks. In further study, we will investigate the phenological signatures of broad-leaved species inherited in dual- or multi-seasonal imagery and adapt our dual-seasonal framework for tree species classification based on individual tree detection.
5.2. Limitations
As indicated by the experimental results, the most significant improvement in accuracy was achieved at test site 3, which is covered by a dense mixed coniferous and broadleaved forest. At test site 3, the F1 score exhibited a range of 56.3% to 60.7% when utilizing different single-seasonal datasets. This range increased to 69.1%–74.5% when dual-seasonal datasets were used instead. Correspondingly, the AP value range increased from 57.2%–61.5% to 70.1%–72.8%. In contrast, less improvement in accuracy was observed at both test sites 1 and 2. In particular, at test site 2, which is dominated by a planted coniferous forest with relatively large tree spacing, the ITCD results from the single-seasonal image datasets already had high accuracies (F1 score and AP value above 70%). Therefore, ITCD based on dual-seasonal RGB imagery is more advantageous in dense mixed species forests than in forests with a simple structure.
It is worth noting that this study does not rely on a very large dataset. This is due to limitations in data collection. Multiple-seasonal RGB image datasets covering different forest types were required, as well as field-measured data to verify the tree crown labelling. Such datasets were difficult to collect over a large area. However, Bumbaca et al. [
50] showed that the YOLO models can reach benchmark performance with approximately 110–130 annotated training images. Everingham et al. [
45] also utilized relatively small training datasets for object detection benchmarks. In their research, the number of annotated objects per class was in the hundreds. Zhang et al. [
51] used only 2610 bounding boxes across three tree species for species classification. Sun et al. [
52] trained a deep learning model using 2269 manually labelled samples of individual trees (340 of which were used for testing) and achieved strong performance in the segmentation of individual tree crowns and extraction of crown width. Additionally, previous studies have shown that data augmentation can increase the size of training datasets and improve their quality, thereby enhancing model performance and generalization [
53]. In the context of remote sensing, Hao et al. [
54] compared a range of augmentation techniques and highlighted their effectiveness for small-sample training. In our study, we employed a range of augmentation techniques, including translation, scaling, rotation, and noise perturbation, to expand the training dataset. Nevertheless, more datasets covering different forest types should be acquired to further evaluate the advantages of dual-seasonal RGB imagery for ITCD and tree species classification, and to improve the deep learning model.
The deep learning model employed in this study, i.e., YOLOv8-DualFusion, used rectangular boxes to approximate the extents of tree crowns, providing location information for individual tree crowns. However, the precise boundaries of the tree crowns were not extracted. On one hand, the focus of this study is to evaluate the advantages of using dual-seasonal images, rather than delineating precise crown boundaries. On the other hand, in dense forests with significant crown overlap, especially in broad-leaved forests, the boundaries between adjacent crowns are difficult to delineate. In this study, both test sites 1 and 3 are covered by dense forests with a large proportion of broad-leaved trees. In such forests, even manually delineating crown boundaries is challenging. Future research can combine instance segmentation networks with boundary optimization algorithms to achieve accurate crown boundary delineation. It should also be noted that the difficulty of delineating crown boundaries in dense forests poses a great challenge to crown labelling. Although we performed crown labelling based on both seasonal variations in spectral responses of crowns and field measurements, inaccurate crown boundaries may still degrade model accuracy.
6. Conclusions
This study assesses the application of dual-seasonal UAV-captured RGB imagery for individual tree crown detection (ITCD) and examines the potential accuracy improvements across various forest types. A modified YOLOv8 model (YOLOv8-DualFusion) was trained and tested at three sites with different forest types. It was compared with the original YOLOv8 model, which was applied to single-seasonal datasets, and with the original model using channel-wise stacked dual-seasonal imagery as inputs. For comparison, the original and modified Faster R-CNN models were also applied to the single- and dual-seasonal datasets, respectively. An ablation analysis was conducted to evaluate the contribution of the convolutional block attention module. The experimental results showed that using dual-seasonal imagery improved detection accuracy, particularly in dense mixed forests with complex structures. The use of single-seasonal datasets resulted in performance fluctuations influenced by acquisition season, data quality issues (e.g., shadow interference), and spectral confusion between canopy and background objects (e.g., grass/shrubs). In contrast, when using dual-seasonal image data, the detection accuracies obtained for different combinations of seasons showed relatively small variations. Model comparison revealed that YOLOv8-DualFusion achieved higher ITCD accuracy than other models at each test site. These results demonstrate that the improvement in ITCD accuracy was due not only to the complementary seasonal information, but also to the model structure and the way features extracted from dual-seasonal imagery were fused.
The findings of this study provide guidance on selecting between single-seasonal and dual-seasonal image datasets across various forest types and also inform the optimal data collection season or combination of seasons. However, we used a rectangular box to approximate the crown extent rather than trying to outline the crown boundary. Models that can perform instance segmentation should be considered, or post-processing can be implemented to extract accurate crown boundaries.