1. Introduction
China stands as the world’s foremost producer and consumer of tea, with its tea plantations spanning approximately 3.381 million hectares, which constitutes 62.16% of the global total. These plantations are predominantly situated in the hilly terrains extending from 18° to 37° N latitude and 98° to 123° E longitude [
1].
In 2023, China’s production of dry tea reached 3.3395 million tons, with the total output value escalating to 329.668 billion yuan. This represented year-on-year increases of 4.98% and 3.65%, respectively [
2]. Such growth is intricately linked to the expansion of tea plantations, often at the expense of forestland and farmland, which are converted to cultivate tea for short-term economic gains [
3,
4]. Although this boosts production, it adversely impacts biodiversity and disrupts the ecological equilibrium, thereby underscoring the ongoing conflict between agricultural expansion and environmental conservation [
5,
6]. Consequently, there is a compelling need for dynamic monitoring and the implementation of sustainable management practices using scientific methods.
Traditional monitoring techniques, predominantly based on field surveys, are not only costly and inefficient but also suffer from delays in data updating [
7]. While UAV imagery enhances accuracy at local scales, its high operational costs and limited coverage render it impractical for extensive areas [
8,
9]. Remote sensing emerges as a superior alternative, offering extensive coverage, timely updates, and reduced costs [
10]. Furthermore, integrating remote sensing with phenological observations facilitates the tracking of the tea growth cycle, thereby promoting precision agriculture [
11].
Extensive research has explored the correlation between crop types and various indices including spectral, temporal, texture, and phenological [
12,
13]. Commonly, machine learning techniques such as Support Vector Machine (SVM) [
10,
14], Random Forest (RF) [
2,
15,
16], and Decision Tree (DT) [
17] are employed. These methods, which are straightforward and heavily dependent on expert knowledge, are particularly effective for feature selection and integration. For instance, Dihkan et al. combined spectral and texture features with SVM to achieve an overall accuracy of 97.4% [
18]. Similarly, Xiong et al. utilized SVM for feature selection in tea plantations in Fujian Province, achieving a sampling point accuracy of 94.65% [
19]. However, it is important to note that such overall accuracy metrics may not always be reliable, and the generalizability of sampling accuracy can be limited [
20]. A significant limitation of machine learning in this context is its focus on pixel-level analysis, which overlooks spatial relationships and is susceptible to errors due to spectral confusion, consequently diminishing the accuracy of feature extraction [
21].
Deep learning has created new opportunities for tea plantation mapping [
22,
23,
24]. Tang first combined machine learning and deep learning [
25]. He proposed an object-oriented CNN model and achieved 86% accuracy in Anxi County, Fujian. Wei compared SVM, KNN, and ResNet in Yunnan tea plantations [
26]. The results demonstrated that deep learning significantly outperformed machine learning in terms of accuracy, F1-score, and recall. Numerous studies have also incorporated multi-temporal imagery. Yao developed an R-CNN model using multi-temporal Sentinel-2 data in Xinchang County [
27]. The overall accuracy reached 95.3%. This outcome underscores the benefits of utilizing multi-temporal data. Multi-source data fusion has emerged as a new trend. Zhou et al. integrated multi-temporal Sentinel-2 optical data, Sentinel-1 SAR, and DEM in Anji County, Zhejiang [
23]. Their model achieved an accuracy of 98.9%. However, tea plantations are often fragmented and irregular in shape, and low-resolution images frequently result in mixed pixels along edges, increasing error rates and diminishing reliability.
High-resolution imagery is instrumental in capturing the textural features of tea [
25]. Phenological data can mitigate spectral confusion [
28,
29]. Based on Gaofen-2 (GF-2) images, this study aims to improve the fine-grained segmentation of tea plantations in Hangzhou by integrating phenological dynamics into a dedicated deep-learning framework. We develop a phenology-aware segmentation model that leverages multi-temporal seasonal patterns to enhance the discrimination between tea and other vegetation types, thereby increasing mapping accuracy under complex landscape conditions. The proposed framework is further used to produce detailed distribution, structural, and density maps of tea plantations across Hangzhou, providing an important reference for regional plantation management and ecological assessment.
2. Materials and Methods
2.1. Study Area
This investigation is centered on Hangzhou, located in Zhejiang Province, China (
Figure 1). Positioned in southeastern China, within the southern wing of the Yangtze River Delta, Hangzhou’s geographical coordinates span from 29°11′ N to 30°33′ N and from 118°21′ E to 120°30′ E. Covering approximately 1.66 million hectares, it serves as Zhejiang’s hub for economic, cultural, and educational pursuits. Hangzhou is also renowned as a primary tea-producing region in China, often referred to as the “Tea Capital of China.” Here, the celebrated West Lake Longjing tea is cultivated.
The topography of Hangzhou is varied, encompassing hills, mountains, and plains, with the terrain descending from southwest to northeast. The western, central, and southern regions are predominantly hilly, accounting for 65.6% of the total area, whereas plains are primarily situated in the northeast, comprising 26.4% of the land. Tea plantations are chiefly located in the hilly regions of the west and south. The climate is characterized by a subtropical monsoon, featuring four distinct seasons where rain and heat coincide. Average annual temperatures range between 16 °C and 17.5 °C, with annual rainfall between 1270 mm and 1450 mm, peaking during the plum rain season in June and the summer months. These geographical and climatic conditions collectively create favorable natural settings for tea cultivation.
2.2. Dataset
To obtain high-resolution spatial information, this study uses GF-2 imagery as the primary data source for detailed mapping. The high spatial resolution ensures accurate identification of tea plantations and precise delineation of their boundaries. This study employs both remote sensing data and field surveys. A total of 25 GF-2 scenes were downloaded for this study. The GF-2 satellite imagery was selected for its ability to capture detailed spatial features, providing 0.8 m panchromatic and 3.2 m multispectral resolution. To obtain comprehensive spatial details of the study area, the original images were processed through dehazing enhancement, radiometric calibration, and atmospheric correction, resulting in a 0.8 m resolution remote sensing dataset covering the entire Hangzhou region. These high-resolution datasets are instrumental in identifying plantations, delineating boundaries, and selecting validation samples. Additionally, Sentinel-2 multispectral data were utilized to discern phenological variations that differentiate tea from other types of vegetation. Cloud detection and masking were conducted using the QA60 band of Sentinel-2 imagery on the Google Earth Engine (GEE) platform. Scenes with more than 30% cloud cover were excluded, and remaining cloud-contaminated pixels were removed through bitwise masking based on the QA60 band. Subsequently, monthly cloud-free mosaics were generated using median compositing to ensure high-quality temporal consistency for phenological analysis. On the Google Earth Engine (GEE) platform, all available Sentinel-2 images from 2023 covering Hangzhou were preprocessed to remove clouds and create mosaics. Leveraging these datasets, we developed the THSI, which captures the seasonal dynamics of tea cultivation and enhances its separability from other land covers.
Hangzhou features diverse landscapes, including the renowned Longjing tea hills, plains adjacent to urban areas, and mixed environments with built-up land. To ensure robustness and generalization of the study, this diversity was considered during the sampling process. Annotations were made in ArcGIS10.8 through visual interpretation and digitization, using GF-2 imagery, historical plantation maps, and field surveys as references. Consequently, 45,486 image tiles of 512 × 512 pixels were generated.
In an effort to balance the classes and enhance training efficiency, all 12,982 positive tiles containing tea plantations were retained. These samples encompass both extensive, continuous plantations and fragmented small plots. To prevent overrepresentation of negative features, a random selection strategy was employed, resulting in 6518 representative negative tiles. Visual inspection showed that these negative samples included forests, other vegetation types, built-up areas, bare land, and water bodies. The final dataset comprised 19,500 tiles, divided into training, validation, and test sets in an 8:1:1 ratio, respectively consisting of 15,600, 1950, and 1950 tiles. This distribution was designed to ensure effective training and reliable evaluation.
2.3. Methods
2.3.1. Network Architecture
In this study, we introduced a segmentation network structured upon a multi-task learning framework, integrating a Swin Transformer encoder with a context fusion decoder. The proposed architecture comprises three concurrent pathways: the Swin Transformer encoder, a phenology context branch, and a multi-scale decoder equipped with a fusion module and dual output heads. This configuration aims to address two prevalent issues in tea plantation mapping: spectral confusion and blurred boundary delineation.
The encoder takes high-resolution multispectral imagery as input, allowing detailed spatial information to be preserved throughout the feature extraction process. Employing the Swin Transformer as its core, it retains dense image features that are further transmitted to the decoder through skip connections. The shifted-window attention mechanism captures two key characteristics of tea plantations. First, it extracts the fine-grained textures created by their row-based planting pattern, which produces distinctive stripe-like canopy structures. Second, it models the large-scale spatial dependencies that reflect the typical distribution of tea plantations, which are commonly located in mountainous regions and positioned farther from urban areas. The encoder generates four feature maps at varying scales, maintaining high-resolution details which are subsequently relayed to the decoder via skip connections.
Simultaneously, a lightweight CNN constitutes the phenology context branch, processing the phenology index map to extract semantic features across multiple scales. The primary objective of this branch is to distill temporal prior knowledge, rather than preserving spatial details, thereby preventing the direct amalgamation of low-resolution phenological data with high-resolution image features.
The decoder adopts a progressive up-sampling scheme and utilizes the Phenology-Guided Fusion Module (PGFM) rather than mere concatenation. This module amalgamates the decoder and phenology context features to produce a spatial attention map that reweights the decoder features. These weighted features are then combined with the high-resolution features from the skip connections. Ultimately, the network employs dual output heads, one for segmentation and another for edge prediction, establishing a robust multi-task framework. The edge prediction head is designed to compensate for the interference caused by the integration of low-resolution phenological features with high-resolution spatial data. By explicitly learning boundary-aware representations, it helps suppress noise from coarse phenology cues and refines object contours for more accurate tea plantation delineation. The architecture of the network is depicted in
Figure 2.
2.3.2. Fusion Module
The PGFM represents a pivotal innovation within this framework. It amalgamates three distinct types of features: the low-resolution semantic features from the decoder, the high-resolution spatial features from the encoder, and the prior knowledge derived from the phenology branch.
Contrary to traditional approaches that employ gated skip connections, the PGFM ensures the unmitigated transfer of encoder features. At each stage of the decoder, the fusion process is dynamically executed. Initially, decoder features are concatenated with phenology features. This combined feature set is then processed through a small convolutional attention network, which generates a spatial attention map. This map enables the element-wise multiplication that reweights the decoder features. Subsequently, these weighted features are concatenated with those from the high-resolution encoder. A convolutional block then further refines these combined features. This strategic integration leverages phenological knowledge while preserving the fine spatial details, thereby facilitating precise segmentation in complex landscapes. The internal structure of the PGFM is illustrated in
Figure 3.
2.3.3. Phenology Index Construction
Tea plantations are characterized as evergreen crops that are densely planted and intensively managed. Their distinctive remote sensing signatures differ markedly from those of seasonal farmland and natural forests. To accurately capture these unique features, we developed a set of spectral indices.
Initially, we selected the Enhanced Vegetation Index (EVI2) [
30], which demonstrates superior performance over the NDVI in dense vegetation canopies by avoiding signal saturation. This index more precisely reflects the robust biomass characteristic of tea plantations. Subsequently, we incorporated the Red-edge Chlorophyll Index (CIre) [
31], which utilizes the red-edge spectral band and is highly sensitive to chlorophyll content, thereby effectively monitoring the physiological activity of tea shoots. Additionally, we employed the Bare Soil Index (BSI) [
32] to quantify the structural feature of exposed soil between rows in tea plantations, distinguishing them from forests where the soil is typically fully covered. The calculations for these indices are specified in Equations (1)–(3).
To synthesize these indices and emphasize the unique spectral response of tea during the spring harvest, we proposed the Vegetation Phenology Separation Index (VPSI), calculated as delineated in Equation (4).
The methodology for calculating VPSI is depicted in
Figure 4.
To elucidate seasonal differences, we sampled five land-cover types across Hangzhou, including tea plantations, forests, croplands, water bodies, and built-up areas. We then plotted the annual VPSI curves for these land covers. As illustrated in
Figure 5, tea plantations exhibit a distinct seasonal pattern.
The VPSI curve for tea plantations decreases from the beginning of the year, reaching a nadir in May, and subsequently rises over several months. This pattern coincides with intensive harvesting and pruning activities during April–May [
33], a period during which young shoots are harvested, leading to a reduction in vegetation greenness and chlorophyll content, and an increase in bare soil exposure. These changes manifest as a pronounced “valley” in the VPSI curve.
To quantify this seasonal variation, we introduced the THSI, which measures the disparity between VPSI values during the growing and dormant seasons, as shown in Equation (5).
Here, represents the VPSI value for month t. This index aggregates approximately 10 months of time-series data, effectively distinguishing tea plantations by their characteristic harvest-recovery cycle. It also differentiates these plantations from forests, which exhibit smoother phenological curves, and from seasonal farmland, which follows different temporal dynamics. The resultant THSI map serves as a critical input to the phenology attention module, providing essential spatial prior knowledge.
2.3.4. Accuracy Evaluation
To assess the performance of the model, we employed multiple metrics, including Overall Accuracy (
OA),
Precision,
Recall,
F1
-score, and Mean Intersection over Union (
mIoU). The definitions of these metrics are provided in Equations (6)–(10).
OA quantifies the proportion of pixels accurately classified. Precision represents the ratio of true positives to predicted positives, while
Recall denotes the ratio of true positives to all actual positives. The
F1-
score calculates the harmonic mean between
Precision and
Recall.
mIoU evaluates the degree of overlap between the prediction and the ground truth.
Within these equations, TP (True Positive) denotes correctly predicted tea pixels, FP (False Positive) signifies non-tea pixels incorrectly classified as tea, FN (False Negative) refers to tea pixels mistakenly identified as non-tea, and TN (True Negative) indicates non-tea pixels correctly classified. These metrics collectively provide a thorough assessment of the segmentation performance.
2.4. Experimental Setup
The experiments were conducted using the PyTorch2.4.1 framework. We utilized an NVIDIA GeForce RTX 3090 GPU for both training and inference processes.
Regarding the loss function, two outputs were optimized independently. The primary segmentation head generated a binary mask distinguishing between tea and background. Given the class imbalance, we implemented Focal Loss, as delineated in Equation (11).
In this context, represents the probability predicted by the model for the correct class, and γ is the focusing parameter. Focal Loss is designed to decrease the influence of easily classified samples, thereby concentrating the model’s efforts on more challenging samples. This adjustment enhances the segmentation of minority classes, such as tea plantations.
For boundary prediction, an auxiliary edge head was employed to produce boundary maps. This component was trained using the Binary Cross-Entropy Loss with Logits (BCE With Logits Loss), as detailed in Equation (12).
Here, denotes the ground-truth edge label, represents the model output logits, and is the sigmoid function. This mechanism provides explicit boundary supervision and mitigates spatial blurring caused by low-resolution phenology data and patch-based Transformer features.
The total loss function combined both segmentation and edge losses, which were weighted according to Equation (13).
To optimize the weight λ, which balances the segmentation and edge tasks, sensitivity experiments were performed. We tested multiple values, each replicated three times over 20 epochs, and recorded the average F1-score on the validation set. Optimal performance was achieved when λ = 6.0, as illustrated in
Figure 6.
Through this multi-task joint optimization strategy, the model learns precise and continuous boundary representations while effectively segmenting regions. This approach significantly enhances both the quality and the accuracy of the final segmentation contours.
3. Results
3.1. Comparative Experiments
We compared PGSUNet with seven well-known semantic segmentation networks: DeepLabV3+ [
34], HRNet [
35], SegFormer [
36], SegFormer-B2 [
36], Swin-UNet [
37], K-Net [
38] and MaskFormer [
39]. All models were uniformly trained under identical conditions using the same datasets, image resolutions, and hyperparameter settings to ensure fairness.
The quantitative results are presented in
Table 1, ranked by F1-score. The best performance is highlighted in bold; the second-best is underlined. PGSUNet outperformed all other models across all metrics. Its F1-score was 4.35% higher than that of the second-best model, and its precision increased by more than 6%. This improvement underscores the effectiveness of the phenology-guided fusion and edge supervision techniques used.
The qualitative results are depicted in
Figure 7. We selected three representative regions, forests, wetlands, and croplands, that are often mistaken for tea plantations. By comparing the classification results with the ground truth, it can be observed that even the best-performing networks still exhibit limitations when dealing with challenging scenarios. The first case is tea plantations located in forest areas, where the plantation boundaries are irregular and often fragmented by roads. In this scene, DeepLabV3+ and MaskFormer show problems of boundary adhesion and edge blurring. The second case is farmland, where ridge-planted vegetable seedlings exhibit similar spectral and textural characteristics to tea plantations. In this situation, models such as SwinUNet and HRNet tend to produce misclassifications. The third case is wetland areas, which contain abundant grasses and aquatic vegetation with spectral features similar to tea plantations. Here, models like SwinUNet and DeepLabV3+ also show obvious false detections.
In contrast, PGSUNet, through the use of the THSI index, captured the phenological rhythm and reduced spectral confusion. Moreover, it produced sharper boundaries due to edge supervision. These visual comparisons further validate the robustness of our proposed model.
3.2. Ablation Study
To validate the effectiveness of the phenology-guided fusion and edge supervision, we conducted ablation experiments. Four variants were tested: (1) Baseline (Swin-UNet with 4-band input), (2) 5-Band Concatenate (adding a phenology index as a fifth band directly), (3) Proposed model without edge supervision, (4) Full PGSUNet model.
The results, displayed in
Table 2, show that the Baseline model performed poorly due to the absence of phenological information. The 5-Band Concatenate model showed improved results, demonstrating the value of phenology indices. However, the simple concatenation approach resulted in detail loss due to the mismatch between the low-resolution phenology data and high-resolution images. The proposed fusion strategy successfully mitigated this limitation and enhanced accuracy. The inclusion of edge supervision in the Full PGSUNet model led to significantly better performance, with higher F1-scores and mIoU than the version without edge supervision.
Both the comparative and ablation experiments confirm that our proposed method is highly effective in complex land-cover situations. By integrating temporal information from phenology with spatial details from high-resolution images, our method addresses the primary challenges faced in traditional remote sensing classification. It excels particularly when spectral features are similar and spatial textures are complex in agricultural scenes.
3.3. Large-Scale Extraction
We applied the trained PGSUNet to extract tea plantations across the entire Hangzhou area. The final extraction results are presented in
Figure 8.
The distribution map indicates that tea plantations are concentrated in the West Lake area and extend to Yuhang and Fuyang. Smaller plantations were also detected in Jiande and Chun’an. The density map emphasizes the aggregation of plantations in the core West Lake production region, while those in outer counties are more dispersed.
According to official statistics, the total area of tea plantations in Hangzhou is approximately 37,156 hectares. Chun’an accounts for the largest area (12,943 ha), followed by Fuyang (4786 ha), and then Jiande, Yuhang, and Tonglu. Our extraction results align with these statistics, confirming both the concentration in core production areas and the presence of scattered plantations in peripheral counties. This alignment demonstrates the accuracy and reliability of our method for large-scale extraction.
When examining the plantation density map (bottom right of
Figure 8), it is evident that the core production areas, led by the West Lake District, exhibit highly concentrated planting. In contrast, the outer regions are characterized by scattered plantations with lower density. This spatial pattern reflects, to some extent, the historical formation and development of the traditional tea industry. The West Lake Longjing production area has historically relied on favorable natural conditions and a strong cultural foundation to develop an intensive core production zone. Surrounding counties mainly engage in dispersed planting and serve as supplementary production areas.
5. Conclusions
This study addresses the significant challenge of extracting tea plantations, which often exhibit spectral similarity and irregular boundaries, by employing GF-2 images of Hangzhou. We proposed and validated a phenology-guided segmentation approach, yielding several key findings: (1) Tea plantations exhibit a distinct phenological dip during the spring harvest and pruning periods. (2) The newly proposed THSI index effectively diminishes the confusion between tea plantations and other land-cover types. (3) The PGSUNet model, which integrates phenology-guided fusion with edge supervision, enhances both edge delineation and overall segmentation accuracy. Relative to the second-best performing model, the PGSUNet improved the F1-score by 4.35% and increased precision by 6.09%. (4) The large-scale extraction results reveal the spatial distribution of tea plantations in Hangzhou, which correlates well with official statistics, thus confirming the potential of high-resolution remote sensing and PGSUNet in crop mapping. (5) Phenological variations associated with elevation lead to index variations, suggesting that future research should include DEM-based corrections to improve the model’s applicability in complex terrains.