1. Introduction
Bamboo forests are widely distributed across the mountainous regions of central and southern Taiwan, where warm and humid climatic conditions promote rapid growth and high biomass accumulation [
1]. The various bamboo species play a crucial ecological and economic role in material production, forest industries, and agricultural applications. However, unmanaged bamboo stands often experience excessive expansion, aging structures, and decreased biomass quality, thereby increasing ecological risks and compromising long-term sustainability. Extreme weather events and pest outbreaks further amplify these concerns and highlight the need for timely, high-resolution monitoring of bamboo forest conditions.
Despite the importance of these forest resources, effective monitoring remains challenging. Bamboo often grows in mixed stands, exhibits highly irregular spatial structures, and is influenced by complex mountainous terrain, all of which degrade the quality and reliability of remote sensing observations. These factors collectively hinder the precise characterization of bamboo spatial distribution and expansion dynamics when relying solely on traditional field-based or visual interpretation methods [
2]. Moreover, the monitoring techniques themselves present inherent limitations. Traditional ground-based surveys are labor intensive, time-consuming, and constrained by terrain accessibility. Although field measurements and allometric models have been widely applied, their effectiveness is restricted by sampling dependencies, potential human error, and limited regional scalability. Even when aerial photographs are available, conventional visual interpretation still relies on expert judgment, preventing consistent and large-scale monitoring.
Recently, advancements in remote sensing, machine learning (ML), and deep learning (DL) technologies have driven forest monitoring toward higher resolution, larger spatial coverage, and increasing levels of automation [
3,
4,
5,
6]. Early studies mainly employed Landsat MSS, TM, and ETM+ imagery combined with classical classifiers such as Maximum Likelihood [
4]. Although these methods improved upon traditional surveys, they were constrained by mixed spectral signatures, coarse spatial resolution, and crown-level occlusion. Subsequently, ML algorithms such as Random Forest enhanced classification accuracy but still struggled with complex vegetation structures [
5]. With the rapid advancement of DL techniques, their application to remote sensing image analysis has become an important research trend in recent years. DL-based approaches for remote sensing imagery can generally be categorized into classification and segmentation models. While classification models are primarily used to assign class labels at the image or pixel level, segmentation models provide more detailed spatial information by explicitly delineating the location and shape of target objects. As a result, segmentation models are often preferred for vegetation monitoring and forest resource assessment tasks that require explicit spatial delineation, such as area mapping and boundary extraction, whereas classification models are more commonly used for species identification and label assignment at the pixel, patch, or image level.
Segmentation models can be further divided into semantic segmentation [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18] and instance segmentation [
19,
20,
21,
22,
23,
24,
25,
26,
27] according to their output representations. Semantic segmentation focuses on assigning a class label to each pixel in an image, enabling the extraction of continuous vegetation regions and overall spatial distributions. This type of model has been extensively applied to tasks such as forest cover mapping, land-cover classification, and large-scale vegetation distribution analysis. In contrast, instance segmentation not only distinguishes different classes but also separates individual objects within the same class, allowing the identification of distinct plant entities such as individual trees, bamboo clumps, or crop plants. Instance segmentation is therefore particularly suitable for fine-scale vegetation monitoring, structural analysis, and quantitative assessments at the individual-object level.
Recent advances in DL architectures have further enhanced the performance of both semantic and instance segmentation in complex vegetation scenes. Among semantic segmentation models, U-Net [
7] and its variants [
13,
14,
15,
16] has been widely adopted in remote sensing and forest applications due to its encoder–decoder structure and effective multi-scale feature fusion, enabling accurate delineation of vegetation boundaries in high-resolution imagery. More recently, transformer-based semantic segmentation models, such as SegFormer [
9], along with representative transformer backbones including Swin Transformer [
17] and Pyramid Vision Transformer (PVT) [
18], have attracted increasing attention by incorporating global contextual information through self-attention mechanisms, while maintaining computational efficiency. This characteristic makes these models particularly suitable for heterogeneous forest environments with complex textures and variable canopy structures.
For instance segmentation, two mainstream modeling paradigms have been commonly applied in vegetation analysis. Mask R-CNN [
21] represents a two-stage framework that performs region proposal, object classification, and pixel-level mask prediction in a unified pipeline, and has been successfully used for individual tree and plant segmentation in forest and agricultural studies. In contrast, recent YOLO-based architectures [
22,
23,
24,
25], such as YOLOv8-Seg [
24], adopt a single-stage detection and segmentation strategy, offering faster inference speed and competitive accuracy. These models provide complementary advantages for individual-object delineation and enable flexible deployment across different spatial scales and monitoring scenarios.
On the other hand, unmanned aerial vehicles (UAVs) have emerged as a transformative tool for forest monitoring, offering high-resolution imagery, flexible deployment, and reduced operational costs. Integrating UAV imagery with advanced DL models has significantly improved object detection and species classification performance in complex forest environments [
28,
29,
30,
31]. For example, Xiang et al. [
28] utilized UAV imagery in combination with convolutional neural network (CNN) segmentation algorithms (such as U-Net and ResUNet) to develop a fully automated forest change detection system capable of accurately extracting forest change patches, demonstrating the advantages of UAVs in dynamic environmental monitoring. Huang et al. [
29] used UAV-acquired RGB imagery and evaluated five deep learning models (e.g., ViT, EfficientNetB0, YOLOv5) for tree species classification, achieving an F1-score of 0.96 for summer imagery, while ViT performed best on autumn imagery, demonstrating consistently high classification capability across seasons. Onishi and Ise [
30] proposed an automated tree-crown segmentation and classification system based on UAV RGB imagery and CNNs. Their method successfully identified seven tree categories—including broadleaf, coniferous species, and specific taxa such as
Pinaceae and
Chamaecyparis—achieving an overall classification accuracy exceeding 90%. Mohan et al. [
31] employed a low-cost UAV equipped with a consumer-grade camera and utilized Structure-from-Motion (SfM) to generate point clouds and a canopy height model (CHM). Individual tree detection was then performed using a local-maxima algorithm. In experiments conducted in mixed conifer forests in Wyoming, the system was evaluated against 367 reference trees, achieving a detection accuracy of F1-score = 0.86. This study demonstrated that UAVs combined with SfM provide a cost-effective and efficient approach for monitoring and surveying individual trees in forest environments.
In bamboo forest research, related studies have also shown rapid technological advancements. Lv et al. [
32] integrated UAV multispectral imagery with DL-based object detection models (such as Faster R-CNN, YOLOv5, and YOLOv7) to monitor and count individual moso bamboo clumps and bamboo culms, effectively improving object detection accuracy in complex forest environments. This approach has demonstrated strong potential for detecting bamboo clumps, estimating canopy conditions, and capturing fine-scale vegetation dynamics, thereby overcoming the limitations of both field surveys and satellite-based observations.
Given the limited number of studies that integrate DL with UAV imagery for bamboo-related research, no existing work has systematically explored the application of advanced segmentation models to delineate both bamboo forest regions and individual bamboo clumps from high-resolution UAV imagery. Therefore, this study aims to develop an integrated DL-based framework that combines semantic segmentation and instance segmentation to detect bamboo forest areas and individual bamboo clumps using UAV orthomosaic images. It should be emphasized that the objective of this work is not to develop new segmentation models, but to investigate the applicability and effectiveness of well-established and widely adopted models in the context of bamboo forest mapping. Accordingly, mature and representative architectures were deliberately selected, in order to ensure methodological stability, reproducibility, and practical relevance.
The framework includes three main components: (1) the use of high-resolution UAV RGB orthomosaic imagery as the input data; (2) the application of CNN-based and transformer-based semantic segmentation models, namely U-Net and SegFormer, to extract bamboo forest regions; and (3) the adoption of two instance segmentation models, YOLOv8-Seg and Mask R-CNN, to detect and delineate individual bamboo clumps. This framework is designed to support more effective, scalable, and consistent monitoring of bamboo forest structure and dynamics.
2. Materials and Methods
2.1. The Overall Framework
Figure 1 illustrates the experimental workflow of this study. First, RGB UAV imagery was acquired and processed to generate orthomosaic images. The orthomosaics were then cropped into sub-images, which were organized into two independent datasets. Dataset I was used for the bamboo forest semantic segmentation experiments, while Dataset II was employed for the bamboo clump instance segmentation experiments. Both datasets were further divided into training, validation, and test subsets, and all images were annotated with corresponding ground truth labels. Finally, the results of both experiments were evaluated using four quantitative performance metrics.
The following subsections provide a step-by-step description of each component of the proposed workflow.
2.2. Study Area
The study area was located in Compartment 43 of the Qishan Working Circle in Kaohsiung City, as shown in
Figure 2. The geographic extent of the study area ranges approximately from 120°25′10″ to 120°27′00″ E and from 22°50′10″ to 22°53′10″ N. The survey site covers approximately 3 km
2, containing bamboo stands of varying density. The terrain within the compartment exhibits relatively gentle variation, and the dominant species is
Bambusa stenostachya, which occurs in dense stands with a characteristic radial expansion pattern. Scattered broadleaf species—such as
Leucaena leucocephala,
Acacia confusa, and
Cassia siamea—are also present, contributing to a complex and competitive vegetation structure [
33] well suited for research on bamboo spatial distribution and bamboo clump detection.
2.3. Bamboo Image Dataset
2.3.1. UAV Image Acquisition
The UAV imagery used in this study was collected in September 2024. A DJI M300-RTK drone (DJI Technology Co., Shenzhen, China) equipped with an RGB camera, GPS, and IMU modules was employed to acquire a continuous sequence of overlapping aerial photographs under stable weather conditions, thereby minimizing illumination inconsistencies across the dataset. All flights were conducted following a predefined flight plan with a flight altitude of 250 m, a flight speed of 7 m/s, and image overlaps of 80% forward overlap and 70% side overlap, ensuring sufficient redundancy for subsequent orthomosaic generation. Under this configuration, the native ground sampling distance (GSD) was approximately 3.9 cm. The orthomosaic imagery was resampled to a spatial resolution of 10 cm, which was deemed suitable for segmenting bamboo units related to bamboo production and carbon management [
33,
34]. These 10-cm resolution images served as the input data for all subsequent analyses in this study.
2.3.2. Orthorectification
Because UAV imagery is easily affected by camera tilt, terrain variation, and fluctuations in flight altitude during acquisition, geometric distortions and scale inconsistencies often occur. Using such images directly may result in discrepancies between the recorded features and their actual geographic locations. To ensure measurement accuracy and the reliability of subsequent analyses, orthorectification is required to transform the imagery into a unified geographic coordinate system, eliminate distortions, and establish a consistent spatial reference.
Our orthomosaic generation was performed using Agisoft Metashape (version 2.1) through a four-step workflow. First, the UAV images were positioned using the onboard GPS coordinates. Subsequently, aerial triangulation and bundle adjustment [
35] (Equations (1) and (2)) were performed by incorporating ground control points (GCPs), IMU data, and tie points extracted from image overlaps. This adjustment process yields refined estimates of the image positions, orientations, and a sparse point cloud.
where
and
denote the image coordinates of the point in the point cloud;
and
represent the image coordinates of the principal point;
are the object-space coordinates of the corresponding ground point;
are the object-space coordinates of the perspective center; and
is the rotation matrix composed of the three orientation angles.
Second, dense stereo matching [
36] was applied to derive depth maps from overlapping image pairs, which were merged into a high-density point cloud. The dense point cloud provides a high-resolution representation of the surface and serves as the core foundation for subsequent mesh construction and orthorectification.
Third, a triangular mesh model was generated from the dense point cloud via variational optimization and Min-cut refinement to improve surface smoothness and geometric continuity [
37]. The formula of Min-cut is as follow:
where
and
represent two disjoint subsets,
and
denote two nodes, and
indicates the weight of the edge connecting nodes
and
.
Finally, the orthomosaic was produced by projecting the corrected mesh and camera parameters onto a unified coordinate system, followed by radiometric correction and color balancing to reduce shading variations. The resulting orthomosaic image served as the primary input for subsequent bamboo detection and segmentation analysis.
Figure 3 illustrates the images generated from each step of the workflow described above.
2.3.3. Sub-Image Generation
Figure 4 displays the two high-resolution orthomosaic images used in this study. The original image dimensions are 26,673 × 26,595 pixels and 28,211 × 18,788 pixels, respectively. To facilitate model training and inference, a non-overlapping sliding window technique was applied to crop the orthomosaics into multiple sub-images of 1024 × 1024 pixels. A total of 656 sub-images were generated from the first orthomosaic and 571 from the second, resulting in 1227 sub-images forming the initial bamboo dataset.
2.3.4. Image Annotation and Dataset Preparation
The first part of the experiment focuses on semantic segmentation of bamboo forest areas. From the initial set of sub-images, 52 images containing clearly identifiable bamboo forest regions were manually selected through visual inspection to form Dataset I. For ground truth annotation, a binary labeling scheme was used, where bamboo forest pixels were assigned a value of 1 and all other areas were assigned 0.
Figure 5a shows an example of the annotated image.
The second part of the experiment focuses on instance segmentation of bamboo clumps. Bamboo commonly forms distinct aggregation structures, and for labeling purposes, the clumps were grouped into two simplified categories. The first category is radiating cluster-type clumps, characterized by outward expansion from a central point, forming circular or semi-circular patterns with clear directional shadows. These structures are easily recognized in UAV images due to their dense canopy and radial texture. The second category is compact clumps, representing younger or smaller bamboo clusters. These appear as rounded or irregular green patches with less distinct directional texture but maintain consistent color and canopy density. In this study, manual annotation was limited to radiating bamboo clumps that can be reliably identified through visual interpretation. Bamboo clumps exhibiting irregular stem arrangements, indistinct individual boundaries, or compact structural characteristics were excluded from the annotation process.
From the initial set of sub-images, 150 images containing clearly identifiable bamboo clump characteristics with strong visual distinguishability were manually selected based on visual inspection to form Dataset II. Single bamboo clumps within these images were then annotated by hand.
Figure 5b shows an example of the annotated sub-image.
Finally, both datasets were randomly divided into training, validation, and test sets at the sub-image level using a 7:1:2 ratio.
2.3.5. Data Augmentation
To improve the generalization capability and robustness of the proposed models under varying environmental conditions, several data augmentation strategies were applied during the training phase. Horizontal flipping was used to expose the model to left–right symmetric variations in bamboo clump arrangements, thereby enhancing its ability to learn spatial symmetry. Random rotation was performed within a range of −15° to +15° to simulate minor camera tilt during UAV image acquisition. In addition, image saturation was randomly adjusted by −25% to +25% to account for color variations caused by different weather conditions and sensor responses. Brightness adjustment was applied within a range of −15% to +15% to simulate changes in illumination intensity across different times of day. Finally, slight exposure adjustment within −10% to +10% was introduced to improve the model’s robustness to underexposed and overexposed imaging conditions.
Table 1 summarizes the data split for training, validation, and testing in both datasets.
2.4. Segmentation Models
For the semantic segmentation task, two representative models were selected: U-Net and SegFormer, each implemented using its original architecture.
- (1)
U-Net is a classical CNN-based segmentation architecture originally proposed by Ronneberger et al. [
7] to address cell and organ segmentation in medical imaging. Its most distinctive feature is the symmetric encoder–decoder structure combined with skip connections that link corresponding layers between the encoder and decoder. This design enables the model to extract high-level features while preserving spatial and boundary information, thereby improving segmentation accuracy and detail.
- (2)
SegFormer, proposed by Xie et al. [
9], is a transformer-based semantic segmentation model and is considered a major advancement in applying transformer architectures to dense prediction tasks. SegFormer overcomes the limitations of traditional segmentation models regarding positional encoding and multi-scale representation. It offers advantages such as lightweight design, high accuracy, and the ability to operate without positional encodings, making it particularly suitable for semantic segmentation of high-resolution remote sensing imagery.
For the instance segmentation task, two representative models were adopted: Mask R-CNN and YOLOv8-Seg.
- (1)
Mask R-CNN, proposed by He et al. [
21], is an extension of the Faster R-CNN framework for instance segmentation. In addition to predicting bounding boxes and class labels, Mask R-CNN introduces a parallel branch that outputs pixel-level masks for each detected object. This architecture enables precise delineation of object shapes and has been widely adopted in various computer vision applications, including image analysis, scene understanding, and fine-grained object extraction. Its two-stage design allows for high segmentation accuracy, particularly in tasks that require detailed boundary representation.
- (2)
YOLOv8-Seg (You Only Look Once, Version 8–Segmentation) [
24,
38], released by Ultralytics in 2023, is the instance segmentation extension of the YOLOv8 framework. Beyond object detection, it integrates a segmentation head that generates pixel-level masks for each detected object. With an improved backbone, enhanced feature pyramid, and optimized loss design, YOLOv8-Seg achieves fast inference and strong segmentation accuracy. Its real-time performance makes it well suited for large-scale UAV imagery analysis and other applications requiring efficient and accurate object-level segmentation.
2.5. Evaluation Metrics
To evaluate the accuracy and practical applicability of the semantic segmentation and instance segmentation models, four performance metrics commonly used in object detection and segmentation tasks were adopted: Precision, Recall, F1-score, and mean Average Precision (mAP). When comparing model predictions with manually annotated ground truth masks, an Intersection over Union (IoU) threshold of 0.5 was used as the matching criterion to determine whether a detection was considered valid. The definitions and formulations of the evaluation metrics are described as follows.
Precision measures the proportion of correctly predicted bamboo clumps among all clumps predicted by the model, reflecting the model’s tolerance to false positives:
Recall represents the proportion of actual bamboo clumps that are correctly detected by the model, indicating its ability to identify all relevant targets and avoid missed detections:
In Equations (4) and (5), TP, FP, and FN denote the numbers of true positives, false positives, and false negatives, respectively.
The F1-score is the harmonic mean of Precision and Recall, providing a balanced measure of both prediction accuracy and completeness. A higher F1-score indicates a better trade-off between precision and recall:
The mean Average Precision (mAP) represents the area under the precision–recall curve across different recall levels. In this study, since only a single class (i.e., bamboo clumps) was considered, the mAP corresponds to the Average Precision (AP) evaluated at an IoU threshold of 0.5, which is computed as:
where
denotes the precision at a given recall level
.
2.6. Experimental Setup and Parameter Settings
All experiments were conducted using the Kaggle online computing environment. The implementation was developed in Python (version 3.11.13) and executed with PyTorch (version 2.6.0+cu124). Model training and inference were accelerated using an NVIDIA Tesla T4 GPU (NVIDIA, Santa Clara, CA, USA).
Four DL models were implemented with task-specific parameter configurations. For U-Net, the Adam optimizer was employed with a learning rate of . Binary cross-entropy loss was used as the objective function, and the batch size was set to 8. For SegFormer, training was conducted using the Adam optimizer with a learning rate of , while cross-entropy loss was adopted, and the batch size was set to 4. To investigate the impact of different training epochs, U-Net and SegFormer were trained for 100, 200, and 300 epochs. For Mask R-CNN, stochastic gradient descent was used as the optimizer with an initial learning rate of 0.0025, a momentum of 0.9, and a weight decay of 0.0001. Due to the high computational cost of instance segmentation, the batch size was limited to 2. For YOLOv8-Seg, the RAdam optimizer was applied in combination with cross-entropy loss, and the batch size was set to 4 and 8, as preliminary experiments indicated that YOLOv8-Seg exhibits relatively high sensitivity to batch size selection. To ensure reproducibility, the random seed was fixed at 123.f.
3. Results
3.1. Semantic Segmentation Results of Bamboo Forests
The quantitative results of U-net are summarized in
Table 2. As shown in
Table 2, the U-Net model achieved its best overall performance with 200 epochs, yielding Recall, F1-score, and mAP values of 0.9487, 0.9538, and 0.9062, respectively. The performance was generally stable across epochs. The highest F1-score (0.9538) and highest mAP (0.9062) were obtained at 200 epochs, suggesting that moderate training duration is sufficient for achieving optimal performance, while further training to 300 epochs did not yield additional gains.
On the other hand, the quantitative results of SegFormer are summarized in
Table 3. As shown in the table, SegFormer achieved its best overall performance at 300 epochs, attaining optimal values in Recall, F1-score, and mAP of 0.9582, 0.9569, and 0.9182, respectively. Notably, performance improvements from 100 to 300 epochs were modest, indicating that SegFormer converges reliably and benefits only marginally from extended training in this dataset.
Overall, SegFormer consistently outperformed U-Net by a small margin across the evaluated metrics, suggesting that transformer-based architectures have the potential to better capture contextual and structural information in high-resolution UAV imagery for bamboo forest delineation.
Figure 6 presents representative qualitative results of bamboo forest semantic segmentation. From left to right, each row shows the input UAV RGB image, the corresponding ground truth (GT), and the segmentation results produced by U-Net and SegFormer, respectively.
Overall, both models are able to effectively delineate the general spatial extent of bamboo forest areas, indicating that high-resolution UAV imagery provides sufficient visual cues for deep learning models to discriminate the target class (bamboo forest) from background categories such as broadleaf forest, grassland, and exposed rocky terrain. In terms of boundary representation, the predicted masks generated by both models exhibit relatively smooth edges compared with the GT, and several protruding boundaries reasonably correspond to actual bamboo clump or culm structures. Overall, SegFormer demonstrates more complete and consistent boundary delineation than U-Net.
With respect to misclassification patterns, the two models exhibit distinct characteristics. U-Net is more prone to producing localized false negatives (FNs) within large bamboo forest areas, particularly in regions with lower brightness or pronounced shadowing, and also generates sporadic false positives (FPs) in some background areas. In contrast, SegFormer shows higher sensitivity to fine-scale textures and contextual variations, enabling it to more completely capture certain irregular bamboo structures; however, a small number of FNs and FPs can still be observed along object boundaries. Overall, SegFormer achieves better semantic segmentation performance than U-Net in this study, a trend that is consistent with the quantitative metrics reported in
Table 2 and
Table 3.
It is worth noting that, although the aforementioned errors are visually noticeable, bamboo pixels account for a relatively large proportion of the image area in most test samples. Consequently, true positive pixels dominate the evaluation statistics, which partially explains why high overall accuracy and recall values can still be obtained despite localized boundary-level segmentation errors. This observation highlights the importance of jointly considering pixel-wise quantitative metrics and qualitative visual analysis when evaluating the performance of semantic segmentation models on high-resolution UAV imagery.
3.2. Instance Segmentation Results of Bamboo Clumps
Table 4 summarizes the quantitative performance of Mask R-CNN for bamboo clump instance segmentation. As shown in the table, the model achieved its best performance under the 12-epoch setting, with a Precision of 0.7343, a Recall of 0.8399, and a combined F1-score of 0.7835. The mAP also reached 0.8221, indicating that the model attained a good balance between prediction accuracy and completeness.
Table 5 summarizes the quantitative performance of YOLOv8-Seg for bamboo clump instance segmentation under different combinations of training epochs and batch sizes. Among the evaluated settings, the configuration with 100 epochs and a batch size of 8 achieved the highest mAP of 0.8024 and the highest precision of 0.8232, indicating strong localization accuracy and low false-positive rates. In contrast, configurations with a batch size of 4 generally yielded higher recall values, particularly at 100 and 300 epochs, suggesting a tendency to detect a larger proportion of bamboo clump instances at the cost of slightly reduced precision. The F1-score, which balances precision and recall, reached its highest value (0.7666) under the Epoch = 200, batch size 4 setting, reflecting a more balanced detection performance.
Figure 7 presents qualitative examples of bamboo clump instance segmentation results. From left to right, the columns represent (a) the input UAV test images, (b) the corresponding manually annotated ground truth, (c) the segmentation results generated by Mask R-CNN (Epoch = 12), and (d) those produced by YOLOv8-Seg (Epoch = 200, Batch size = 4). Panels (c) and (d) display both the instance segmentation masks (highlighted regions) and their associated bounding boxes.
From visual inspection, both models are capable of detecting individual bamboo clumps under complex canopy conditions. Mask R-CNN produces relatively smooth and coherent instance boundaries that generally enclose the full extent of each bamboo clump. In contrast, YOLOv8-Seg generates boundaries that more closely follow the shape of individual bamboo culms, resulting in instance masks that appear more consistent with the ground truth annotations.
However, YOLOv8-Seg exhibits a higher number of missed detections (false negatives), leading to produce relatively lower recall. For example, in the second sample, YOLOv8-Seg fails to detect the bamboo clump located in the upper-left region, and in the fourth sample, three small bamboo clumps in the lower-right area are not identified. The relatively lower recall can be attributed to the model’s conservative prediction behavior, which prioritizes precision over detection completeness. Small or partially occluded bamboo clumps are more likely to be missed, especially under complex canopy conditions. In addition, the limited size of the training dataset and the conservative nature of ground truth annotations may further contribute to missed detections.
Conversely, Mask R-CNN tends to produce more false positives. For instance, in the first sample, Mask R-CNN detects bamboo clumps in the upper-left and lower-right regions that are not labeled in the ground truth (which may correspond to actual bamboo clumps omitted due to conservative annotation). Similar cases can be observed in the left region of the third sample and the left region of the fifth sample, where Mask R-CNN detects bamboo clumps absent from the ground truth annotations. Mask R-CNN tends to produce more false positives, which can be attributed to its region proposal mechanism that prioritizes high recall by generating a large number of candidate regions. In complex forest environments where non-bamboo vegetation shares similar spectral and textural characteristics with bamboo clumps, such a design increases the likelihood of misclassifying visually similar background regions as bamboo.
4. Discission
This work focuses on the application and evaluation of existing deep learning models rather than the development of new architectures. The aim is to investigate how representative semantic and instance segmentation models perform when applied to bamboo forest mapping tasks under practical training and computational constraints.
Accordingly, the following discussion interprets the observed results in terms of model characteristics, evaluation metrics, and operational considerations for UAV-based bamboo monitoring.
4.1. Training Behavior and Convergence Characteristics
The experimental results indicate that different model architectures exhibit distinct convergence behaviors in both semantic segmentation and instance segmentation tasks. In the semantic segmentation experiments, both U-Net and SegFormer achieved their optimal performance at 200 epochs. Based on additional exploratory experiments, further increasing the number of training epochs resulted in only marginal performance gains or even performance degradation. This suggests that, for bamboo forest delineation using high-resolution UAV imagery, a moderate training duration is sufficient to capture the dominant spatial and contextual features.
In the instance segmentation experiments, Mask R-CNN reached stable performance within a relatively small number of epochs. This behavior can be attributed to its two-stage architecture and the use of a pretrained backbone, which reduce the learning difficulty and enable efficient refinement of region proposals. In contrast, YOLOv8-Seg exhibited greater sensitivity to training configurations, such as the number of epochs and batch size, requiring careful parameter adjustment to achieve optimal segmentation performance.
4.2. Trade-Offs Among Evaluation Metrics and Practical Implications
The results further highlight clear trade-offs among precision, recall, F1-score, and mAP across different training settings. For YOLOv8-Seg, configurations with larger batch sizes tended to favor higher precision and mAP, indicating stronger localization accuracy and fewer false positives, whereas smaller batch sizes generally resulted in higher recall and more balanced F1-scores. These findings suggest that no single parameter configuration uniformly optimizes all evaluation metrics.
From a practical perspective, the choice of training configuration should be guided by application requirements. For tasks emphasizing accurate boundary delineation and reduced false alarms, such as detailed bamboo clump mapping, higher-precision settings may be preferable. Conversely, applications focused on comprehensive detection and inventory may benefit from configurations that favor higher recall.
4.3. Model Characteristics and Annotation-Related Effects
The qualitative results indicate that Mask R-CNN and YOLOv8-Seg exhibit complementary characteristics in the bamboo clump instance segmentation task. Mask R-CNN tends to detect a larger number of potential bamboo clump candidates, resulting in higher recall performance; however, this advantage is accompanied by an increased number of false positives, particularly in areas with complex vegetation composition where non-bamboo vegetation shares similar spectral and textural characteristics with bamboo clumps. In contrast, YOLOv8-Seg adopts a more conservative prediction strategy and produces cleaner instance masks with clearer boundaries, making it less prone to false alarms. Nevertheless, under dense canopy conditions or partial occlusion scenarios, small or incomplete bamboo clumps are more likely to be missed, leading to relatively lower recall but higher precision.
These qualitative observations are consistent with the quantitative evaluation metrics and further highlight the complementary strengths of the two instance segmentation approaches for bamboo clump detection in high-resolution UAV imagery. For example, when the application objective focuses on estimating the total number of bamboo clumps within a region and minimizing missed detections is a priority, models such as Mask R-CNN, which favor higher recall, are more suitable. Conversely, when the task requires precise delineation of individual bamboo clumps for subsequent analyses—such as estimating the number of bamboo culms within each clump—maintaining accurate instance boundaries that preserve culm-level geometry becomes critical. In such cases, YOLOv8-Seg is a more appropriate choice.
In addition, annotation uncertainty plays a non-negligible role in model evaluation. Bamboo clump annotation presents challenges that are fundamentally different from those associated with typical single-tree annotation tasks. A bamboo clump consists of multiple bamboo culms and can be regarded as a “multi-object” entity, whereas most conventional trees can be treated as individual objects. In dense bamboo forest imagery, clumps often grow in close proximity and exert mutual pressure during growth, resulting in irregular culm arrangements that do not form clear radial patterns. This substantially increases the difficulty of delineating bamboo clump boundaries during manual annotation. Without supporting field surveys, it is often difficult to determine whether certain groups of bamboo culms belong to the same clump, leading to a conservative annotation strategy adopted in this study.
Under such circumstances, some detections produced by Mask R-CNN that are categorized as false positives in the quantitative evaluation may in fact correspond to real bamboo clumps that were omitted during the manual annotation process. Therefore, model performance should not be assessed solely based on quantitative metrics, as such an approach may underestimate the practical utility of certain models in real-world applications. Moreover, the conservative annotation strategy may indirectly lead to an insufficient number of training samples, which can further limit model learning and generalization performance. In such cases, future studies may consider adopting synthetic image generation or GAN-based data augmentation approaches to alleviate the scarcity of annotated training data [
39]. By synthesizing representative bamboo clump samples with diverse spatial structures, occlusion conditions, and background compositions, such methods have the potential to enhance model robustness and reduce dependence on fully exhaustive manual annotations.
5. Conclusions
This study proposes a deep learning-based bamboo forest monitoring framework that integrates high-resolution UAV orthomosaic imagery for both bamboo forest semantic segmentation and bamboo clump instance segmentation. The framework consists of two independent components, each designed to address different monitoring tasks and operational requirements.
Experimental results demonstrate that both semantic segmentation and instance segmentation approaches can be effectively applied to bamboo feature extraction in complex forest environments. For bamboo forest semantic segmentation, both U-Net and SegFormer exhibit stable performance, with SegFormer achieving superior boundary delineation and better preservation of fine structural details, while U-Net produces more conservative segmentation results with relatively fewer false detections. For bamboo clump instance segmentation, Mask R-CNN and YOLOv8-Seg show complementary characteristics: Mask R-CNN detects a larger number of potential bamboo clumps and achieves higher recall, whereas YOLOv8-Seg provides more precise instance boundaries and attains higher precision.
Overall, the proposed framework validates the feasibility of integrating UAV imagery and deep learning techniques for bamboo forest mapping at the regional scale and bamboo clump detection at a finer scale. By treating bamboo forest extent identification and bamboo clump detection as parallel tasks, the framework offers flexible and scalable monitoring information to support diverse forest management and planning needs. More importantly, the segmentation results produced in this study can serve as a foundational step for downstream applications, including bamboo clump counting, structural analysis of individual clumps, and area-based assessments of bamboo forest resources. A critical secondary information on the competition between bamboo and invasive neutralized species can be further derived based on a decompositional stand structure analysis approach [
40]. Such information is critical for translating image-based segmentation outputs into actionable indicators for bamboo forest monitoring and management decision-making.
Future work will focus on developing bamboo-specific deep learning models and further integrating the semantic segmentation and instance segmentation components into a unified workflow, forming a two-stage architecture to enhance overall detection robustness. In addition, models for individual bamboo culm detection and quantitative estimation will be explored to provide more comprehensive bamboo forest monitoring capabilities and to support sustainable bamboo forest resource management.