Transformer-Based Semantic Segmentation of Japanese Knotweed in High-Resolution UAV Imagery Using Twins-SVT

Valicharla, Sruthi Keerthi; Karimzadeh, Roghaiyeh; Li, Xin; Park, Yong-Lak

doi:10.3390/info16090741

Open AccessArticle

Transformer-Based Semantic Segmentation of Japanese Knotweed in High-Resolution UAV Imagery Using Twins-SVT

¹

Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA

²

School of Natural Resources and the Environment, West Virginia University, Morgantown, WV 26506, USA

³

Department of Plant Protection, Faculty of Agriculture, University of Tabriz, Tabriz 5166616471, Iran

⁴

Department of Computer Science, University at Albany, Albany, NY 12222, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2025, 16(9), 741; https://doi.org/10.3390/info16090741

Submission received: 8 June 2025 / Revised: 18 July 2025 / Accepted: 25 August 2025 / Published: 28 August 2025

(This article belongs to the Special Issue Machine Learning and Artificial Intelligence with Applications)

Download

Browse Figures

Versions Notes

Abstract

Japanese knotweed (Fallopia japonica) is a noxious invasive plant species that requires scalable and precise monitoring methods. Current visually based ground surveys are resource-intensive and inefficient for detecting Japanese knotweed in landscapes. This study presents a transformer-based semantic segmentation framework for the automated detection of Japanese knotweed patches using high-resolution RGB imagery acquired with unmanned aerial vehicles (UAVs). We used the Twins Spatially Separable Vision Transformer (Twins-SVT), which utilizes a hierarchical architecture with spatially separable self-attention to effectively model long-range dependencies and multiscale contextual features. The model was trained on 6945 annotated aerial images collected in three sites infested with Japanese knotweed in West Virginia, USA. The results of this study showed that the proposed framework achieved superior performance compared to other transformer-based baselines. The Twins-SVT model achieved a mean Intersection over Union (mIoU) of 94.94% and an Average Accuracy (AAcc) of 97.50%, outperforming SegFormer, Swin-T, and ViT. These findings highlight the model’s ability to accurately distinguish Japanese knotweed patches from surrounding vegetation. The method and protocol presented in this research provide a robust, scalable solution for mapping Japanese knotweed through aerial imagery and highlight the successful use of advanced vision transformers in ecological and geospatial information analysis.

Keywords:

Japanese knotweed; invasive species; twins spatially separable vision transformer; unmanned aerial vehicle; drone

Graphical Abstract

1. Introduction

Japanese knotweed (Fallopia japonica) is a clonal, herbaceous perennial plant belonging to the Polygonaceae family, recognized globally as one of the most aggressive and ecologically destructive invasive species in temperate regions [1]. Native to Japan, China, and Korea, it was first introduced to Europe in the 19th century through the horticultural trade for its ornamental appearance and potential in soil erosion control [2,3]. Since then, it has naturalized across large parts of the United Kingdom, continental Europe, and North America, where it proliferates rapidly and extensively in disturbed habitats, riparian zones, roadsides, and urban areas [4]. The invasiveness of Japanese knotweed is primarily attributed to its robust underground rhizome system, which enables it to regenerate from root fragments as small as 0.7 g [5]. These rhizomes can spread up to 7 m horizontally and 3 m vertically, allowing the plant to dominate entire landscapes by outcompeting native flora for light, space, and nutrients [5,6]. The dense stands it forms can lead to significant reductions in plant biodiversity, disruption of successional dynamics, and alteration of hydrological regimes. Furthermore, Japanese knotweed poses serious infrastructural challenges; its ability to exploit minute cracks and weak points in concrete and tarmac has been linked to structural damage in buildings, pavements, and flood defense systems, resulting in reduced land value and costly litigation [6].

Once Japanese knotweed is established, eradication or control is extremely difficult. Mechanical control strategies (e.g., excavation of roots and repeated cutting) frequently fail to eliminate belowground rhizomes [7]. Chemical control using herbicides is more effective in suppressing regrowth but requires persistent application over several growing seasons, affecting non-target plants. Biological control using natural enemy insects (e.g., Aphalara itadori) is available in the USA but has not yet demonstrated broad effectiveness across varied climatic regions [8]. Collectively, these methods to control Japanese knotweed are labor-intensive, costly, and often ineffective when applied over large or inaccessible landscapes. Given these challenges, there is growing recognition that the successful management of Japanese knotweed requires scalable, efficient, and accurate approaches for both early detection and long-term monitoring. The traditional method for detecting and mapping Japanese knotweed infestations is ground-based visual surveys. Ground crews typically rely on systematic sampling or opportunistic observations to detect Japanese knotweed; however, these methods often miss small or newly emerging knotweed patches, particularly in areas with dense vegetation.

Advancements in remote sensing technologies have increasingly been used to overcome the spatial and temporal limitations of traditional invasive pest monitoring [9,10]. Unmanned Aerial Vehicles (UAVs), equipped with high-resolution RGB and multispectral sensors, offer a flexible platform for frequent, fine-scale data collection across diverse landscapes [11]. These aerial systems enable large-area surveillance with minimal human intervention and provide critical visual information for distinguishing invasive species from surrounding vegetation. Our previous study [12] demonstrated that morphological features of individual Japanese knotweed, such as alternate leaf arrangement and zigzag-shaped stems, could be identifiable in UAV-acquired RGB imagery during early and full vegetative stages. Detection probabilities reached 100% at flight heights up to 35 m above the canopy. Late-season imagery also enabled identification of flowers and seed clusters, with 100% detection at ≤20 m above the canopy. These findings indicate the feasibility of low-altitude UAV imagery for automated seasonal detection and support its integration into vision transformer-based monitoring pipelines for scalable ecological applications. However, detecting the patches of Japanese knotweed in large areas and various environments has not been realized.

From a computer vision perspective, detecting Japanese knotweed patches from aerial imagery constitutes a semantic segmentation problem where each pixel is classified into categories such as Japanese knotweed patches, non-Japanese knotweed patches, or background. Earlier segmentation methods relied on handcrafted features and rule-based algorithms, including edge detection, thresholding, and color histograms [13]. The advent of deep learning significantly advanced image segmentation capabilities, particularly through the introduction of Convolutional Neural Networks (CNNs), which automatically learn hierarchical spatial features and have become foundational in computer vision tasks [14]. Fully Convolutional Networks (FCNs) extended CNNs to enable end-to-end dense pixel-wise prediction for semantic segmentation [15], while U-Net further improved segmentation accuracy with its encoder–decoder architecture and skip connections, especially in biomedical imaging contexts [16]. DeepLab architectures introduced atrous convolutions and enhanced encoder–decoder frameworks to capture multiscale context [17]. Despite their success, CNNs inherently focus on local receptive fields and may have difficulty in modeling long-range spatial dependencies, which are crucial for interpreting high-resolution aerial images with complex contextual information [18]. To overcome these limitations, transformer-based models originally developed for natural language processing have been adapted for vision tasks. The transformer architecture employs self-attention mechanisms that capture global relationships within data [19]. Vision Transformers (ViTs) treat images as sequences of patches, enabling flexible and context-rich feature extraction beyond localized convolutional kernels [20].

This study was conducted to develop a computer vision tool for the detection of Japanese knotweed patches using UAV imagery. We adopted a semantic segmentation framework based on the Twins Transformer, which was better suited for delineating object boundaries and capturing spatial context from UAV imagery. By transitioning from patch-level classification to pixel-wise segmentation, this study developed and tested the granularity and accuracy of automated detection of Japanese knotweed patches from UAV imagery.

2. Materials and Methods

2.1. Study Sites

The aerial imagery was obtained by using UAVs from three sites in West Virginia, USA: Monongalia, Marshall, and Brooke counties (Figure 1). The site in Monongalia County is an organic farm with a history of at least 10 years of Japanese knotweed infestation. Mechanical control and chemical control have been applied over the past five years, but it has not been successful in eradicating Japanese knotweed. The site in Marshall County is the shoreline of the Ohio River. Japanese knotweed patches were found along the river and commercial boat docking facilities. The site in Brooke County is located along the railroad trails, where Japanese knotweed invasion is commonly observed. The characteristics of the three sites represent diverse environmental conditions where Japanese knotweed is commonly found in West Virginia.

2.2. UAV-Based Aerial Data Collection

High-resolution aerial imagery was acquired using rotary-wing UAVs: DJI Mavic 2 Enterprise Advanced and DJI Mavic 2 Pro (SZ DJI Technology Co., Shenzhen, China). The Mavic 2 Enterprise Advanced was equipped with a 48-megapixel RGB sensor and a 3.3-megapixel thermal camera, while the Mavic 2 Pro carried a 20-megapixel RGB camera. Both UAVs were capable of recording 4K videos, which facilitated detailed spatial data capture in the study sites. The UAVs were flown in each study site at a flight height of 50 m above the ground and recorded 4K videos to obtain aerial imagery data, ensuring that the key features of Japanese knotweed were visible in the imagery [12]. Eight aerial missions were flown, producing 10 video clips, each ranging from 5 to 21 min. Surveys were conducted on sunny days with clear skies, specifically scheduled between noon and 3 p.m. During these missions, the average temperature ranged from 20 to 26 °C, and the average wind speed remained below 10 km/h.

2.3. Dataset Construction and Preprocessing

From the 4K videos, every 100th frame was extracted, yielding a total of 463 RGB images at a resolution of 1920 × 1080 pixels. To address the computational constraints associated with processing full-resolution images while preserving sufficient spatial context, each image was subdivided into overlapping patches of size 640 × 480 pixels. This preprocessing step expanded the dataset to 6945 image patches, striking a balance between spatial detail and computational feasibility [21].

Each image was manually annotated by systematically delineating knotweed patches. The resulting annotations were converted into binary segmentation masks, where pixels corresponding to knotweed were assigned a value of 1 and background pixels were assigned 0 (see Figure 2 for an example). To improve mask quality, morphological operations were applied to reduce noise and refine object boundaries [22]. This process yielded segmentation masks compatible with established semantic segmentation datasets such as ADE20k [23].

The dataset was partitioned into training, validation, and test subsets in a 70:20:10 ratio, which is a standard approach in supervised machine learning to optimize model training, hyperparameter tuning, and unbiased evaluation [24]. This resulted in 4862 images for training, 1390 for validation, and 693 for testing. All UAV images were converted from BGR to RGB format and normalized using ImageNet mean [123.675, 116.28, 103.53] and standard deviation [58.395, 57.12, 57.375] statistics. Padding values of 0 for images and 255 for segmentation masks were applied to preserve spatial alignment during loss computation.

2.4. Model Configuration and Optimization

The semantic segmentation pipeline was implemented within the MMSegmentation framework [25], employing an encoder–decoder architecture optimized for high-resolution UAV imagery. The backbone network utilized the Twins-SVT-Small architecture [26], paired with a UPerNet-based decode head [27], facilitating effective multi-scale feature aggregation and detailed segmentation.

2.4.1. Encoder: Twins-SVT-Small Transformer

The encoder is a hierarchical Vision Transformer designed to mitigate the computational burden associated with global self-attention. It employs spatially separable attention mechanisms through two principal modules: Local Sub-window Attention (LSA), which captures fine-grained local details, and Global Subsampled Attention (GSA), which efficiently models long-range contextual dependencies. The encoder architecture used in this study was based on a hierarchical vision transformer and consists of four stages with varying depths, embedding dimensions, and attention heads (Table 1).

Each encoder block consisted of Layer Normalization (LN), Multi-Head Self-Attention (MHSA), and a Feed Forward Network (FFN), with residual skip connections applied. The GSA module uses downsampled keys and values to reduce the attention complexity from O(N²) to approximately O(N

\sqrt{N}

), thereby enhancing scalability for high-resolution input data.

2.4.2. Decoder: UPerNet with Pyramid Pooling

The decoder module in the proposed segmentation pipeline was based on the Unified Perceptual Parsing Network (UPerNet) [27], designed to enable effective integration of multi-scale features for dense prediction tasks. This decoder is particularly suitable for high-resolution semantic segmentation in ecological data due to its robust multi-level feature fusion strategy that enhances both spatial detail and semantic understanding. UPerNet fuses hierarchical feature maps extracted from different stages of the encoder to recover fine spatial structures while leveraging deep contextual information. Specifically, it takes inputs from all four encoder stages (i.e., F₁, F₂, F₃, and F₄), where each F_i is a feature map with decreasing spatial resolution and increasing semantic richness. These feature maps were rescaled via bilinear upsampling to a common resolution and concatenated along the channel dimension to enable efficient spatial alignment and integration.

To further incorporate global context, a Pyramid Pooling Module (PPM) was applied to the output of the final encoder stage F₄. The PPM performs average pooling at multiple spatial scales {1 × 1, 2 × 2, 3 × 3, 6 × 6}, followed by 1 × 1 convolutions, and then upsamples all pooled features to the resolution of F₄. The fused output enhances the network’s ability to capture large-scale contextual dependencies crucial for distinguishing visually similar plant species from an aerial perspective. In addition to the main decoder head, an auxiliary segmentation head was attached to the third encoder stage F₃ (with 256 channels). This head consisted of a single convolutional layer followed by upsampling and provided intermediate supervision during training to improve gradient flow and facilitate feature learning in earlier layers.

The segmentation network was trained using a multi-loss formulation. The main decoder output and the auxiliary output were both supervised using a standard cross-entropy loss over the pixel-wise class probabilities. The total training loss was computed as

L_total = L_main + λ · L_aux

(1)

where L_main and L_aux denote the cross-entropy losses for the main and auxiliary outputs, respectively, and λ is a weighting factor set to 0.4 in this study. The network outputs two classes: background and knotweed, encoded as 0 and 1, respectively. This decoder design effectively bridges low-level spatial features with high-level semantic context, enabling precise delineation of Japanese knotweed in high-resolution UAV imagery. The overall architecture of the proposed framework is illustrated in Figure 3.

2.4.3. Training Strategy and Environment

The model was initialized with pretrained weights from the Twins-SVT architecture trained on the ImageNet-1K dataset [28], enabling effective transfer learning and faster convergence on the downstream segmentation task. Optimization was performed using the AdamW optimizer [29], well-suited for transformer-based architectures due to its decoupled weight decay regularization. The initial learning rate was set to 0.0001, with a weight decay coefficient of 0.05 and momentum parameters (β₁ = 0.9 and β₂ = 0.999).

A polynomial learning rate decay schedule with a linear warm-up phase was adopted. The learning rate at iteration t was defined as follows:

η_{t} = η_{0} {(1 - \frac{t}{T})}^{p}

(2)

where

η_{0}

is the initial learning rate, T is the total number of iterations, and p is the power factor, set to 1.0 in this study. A linear warm-up over the first 1500 iterations, starting from a factor of 1 × 10⁻⁶, was applied to prevent unstable updates during early training.

To promote generalization and reduce overfitting, several image augmentation techniques were applied during training, including random cropping, horizontal flipping, and photometric distortions (i.e., adjustments to brightness, contrast, and saturation). These augmentations were essential for training a robust model capable of handling intra-class variation, particularly under varying lighting and vegetation conditions. The augmentations were applied dynamically during training rather than by generating a fixed number of augmented samples beforehand. Therefore, no explicit augmentation ratio was calculated. Due to the high memory demands of transformer-based networks, the batch size was limited to 2. All training was conducted using the MMSegmentation framework (v2.25.1) [25], which provides a modular implementation of vision transformer models, multi-stage decoders, and optimized data pipelines for semantic segmentation.

2.4.4. Implementation Details

Model training and evaluation were performed on a high-performance computing workstation equipped with an NVIDIA GeForce RTX 3090 GPU (24 GB VRAM), providing ample computational resources to handle high-resolution UAV imagery. The software environment included Python 3.10.4, PyTorch 1.12.1, CUDA 11.6, and cuDNN 8.5, supported by GCC 9.3 and running on Ubuntu 20.04.1 LTS. All dependencies were managed using conda and pip to ensure reproducibility across different systems.

The inference speed of the Twins-SVT + UPerNet model was approximately 30–40 milliseconds per 640 × 480 image (~25–33 frames per second), measured on an NVIDIA RTX 3090 GPU, with an estimated computational cost of ~88 GFLOPs.

To facilitate scalability and experimental reproducibility, all training scripts, configuration files, and evaluation protocols were implemented within the OpenMMLab MMSegmentation framework. This setup enabled seamless integration of the Twins-SVT encoder, UPerNet decoder, and associated data preprocessing modules.

2.4.5. Evaluation Metrics

To assess the segmentation performance of the proposed framework, four widely adopted metrics were utilized: Intersection over Union (IoU), mean Intersection over Union (mIoU), pixel-level Accuracy, and Average Accuracy (AAcc). IoU quantifies the overlap between predicted and ground-truth regions for each class. It is defined as the ratio of the intersection area to the union area of the predicted and true masks [30]. mIoU represents the arithmetic mean of IoU values across all classes [31]. In the binary setting of this study, it corresponds to the average of the IoUs for Japanese knotweed and background. mIoU provides a balanced measure of segmentation performance, regardless of class imbalance. Pixel-level Accuracy measures the proportion of correctly classified pixels across the entire image [15]. These metrics are well-suited for binary semantic segmentation tasks, such as distinguishing Japanese knotweed patches from background vegetation in UAV imagery. AAcc is defined as the mean of per-class accuracies, calculated as the ratio of correctly predicted pixels to the total number of ground truth pixels in each class [32]. This metric mitigates the effects of class imbalance by equally weighting all classes:

3. Results

3.1. Quantitative Evaluation

Among the evaluated models, the Twins-SVT architecture consistently outperformed its counterparts in both classes, achieving the highest IoU and accuracy values for the knotweed class (Table 2). This performance improvement can be attributed to its hierarchical backbone design, which combines Locally grouped Self-Attention (LSA) and Global Subsampled Attention (GSA) mechanisms [26]. In contrast, SegFormer exhibited competitive results with a lightweight design but showed slight degradation in boundary localization. Swin-T achieved reasonable performance due to its shifted window attention, while ViT struggled to generalize to dense pixel-wise predictions, likely due to its non-hierarchical design optimized for classification.

To further evaluate the model’s robustness, overall segmentation performance was summarized using mIoU and AAcc, which provide a holistic view by averaging class-level results (Table 3). These aggregate metrics reaffirmed the superior performance of the proposed Twins-SVT architecture. Specifically, the model achieved the highest mIoU of 94.94% and AAcc of 97.50%, indicating its effectiveness in balancing precision and recall across class boundaries.

Due to class imbalance, mean Intersection over Union (mIoU) and Average Accuracy (AAcc) were selected as primary evaluation metrics. These metrics provide a balanced assessment of segmentation performance across classes without additional per-pixel statistical outputs. Although the observed differences in evaluation metrics between the Twins-SVT model and competing approaches are modest, the consistent performance improvements across multiple metrics indicate the efficacy of the proposed method.

3.2. Training Dynamics and Convergence Analysis

The training process was monitored through auxiliary and decoder branch metrics, specifically auxiliary accuracy (Aux Acc) and decoder accuracy (Decode Acc), alongside their respective cross-entropy losses (Aux Loss CE and Decode Loss CE). The auxiliary classifier was employed during training to provide intermediate supervision, which facilitated better gradient flow and stabilized optimization. This design choice helped to accelerate convergence and improved overall segmentation performance by guiding early layers more effectively. The decoder branch represented the primary output used for final predictions. The plots for training curves (Figure 4) showed consistent improvement in segmentation accuracy and corresponding reduction in loss, indicating stable convergence. Notably, the auxiliary branch metrics closely tracked those of the decoder branch, affirming the auxiliary supervision’s efficacy without impeding the main learning objective.

3.3. Qualitative Evaluation and Visualization

To further assess model performance qualitatively, each example pairs the original RGB image with its corresponding predicted segmentation mask generated by the proposed Twins-SVT + UPerNet architecture. The qualitative assessment demonstrated that the model accurately detected Japanese knotweed (Figure 5).

These qualitative visualizations indicate the model’s robustness in accurately segmenting Japanese knotweed patches amid dense and heterogeneous vegetation. The integration of hierarchical attention mechanisms within Twins-SVT and multi-scale feature fusion via UPerNet facilitated clear boundary adherence and resilience to occlusions and texture variability.

Collectively, all the quantitative metrics, training dynamics, and qualitative results confirmed that the proposed Twins-SVT + UPerNet framework effectively addressed the challenges of high-resolution UAV-based segmentation tasks for Japanese knotweed detection, outperforming contemporary transformer-based methods in both accuracy and reliability.

4. Discussion

This study implemented a semantic segmentation framework designed to detect Japanese knotweed patches in UAV-acquired aerial imagery. Our approach demonstrated high spatial precision and contextual awareness in detecting Japanese knotweed patches. The segmentation network was constructed using the Twins-SVT transformer backbone, integrated into the MMSegmentation platform [25], an open-source toolkit developed by OpenMMLab. The architecture followed an encoder–decoder paradigm, where a multi-scale hierarchical transformer backbone extracted rich spatial features, and a UPerNet-inspired decoder generated detailed pixel-wise segmentation maps. This design effectively addressed challenges inherent to segmenting high-resolution, heterogeneous ecological datasets (i.e., Japanese knotweed patches in various environments). The superior performance of the proposed approach can be attributed to the architectural strengths of Twins-SVT, which combines locally grouped self-attention and global subsampled attention mechanisms [26], enabling effective modeling of both fine-scale texture and long-range contextual dependencies. When integrated with the UPerNet decoder [27], this configuration facilitated efficient multi-scale feature fusion, further improving semantic consistency in segmentation outputs. These results reinforce recent findings on the efficacy of hierarchical vision transformers in dense prediction tasks and demonstrate their applicability in ecological monitoring and invasive species detection. More broadly, our findings underscore the growing relevance of transformer-based architectures for remote sensing applications, particularly their ability to model spatial hierarchies and maintain contextual coherence in UAV-captured datasets characterized by variability in scale, texture, and illumination.

Our prior work [12] demonstrated the effectiveness of transformer-based architectures such as the Swin Transformer for multi-class knotweed classification at the individual plant level, achieving high precision and recall in controlled, low-altitude UAV surveys. In addition, Swin Transformer outperformed both CNNs and standard transformer variants. However, this approach required UAV flight heights below 25 m above the canopy to detect the key features of Japanese knotweed, which limited its scalability for landscape-scale monitoring. In contrast, the current Twins-SVT framework operates effectively at higher altitudes of 50 m while maintaining a high mIoU of 94.94% for patch detection. This enables efficient surveys of larger areas, making it far more practical for invasive pest managers. The shift from individual plant detection to patch-level segmentation is particularly well-suited for Japanese knotweed, which typically propagates via rhizomes to form contiguous patches. Patch-level analysis not only aligns more closely with the spatial patterns of knotweed infestations but also reduces computational demands by more than a third compared to pixel-wise Swin Transformer implementations. This methodological advancement is significant for the detection and management of invasive plants because eradication and containment programs are generally organized around the identification and treatment of patches rather than individual plants.

The improved detection capabilities offered by the Twins-SVT framework have substantial implications for site-specific management of Japanese knotweed. By enabling precise localization of knotweed patches, the framework supports spatially targeted interventions that are both effective and resource-efficient. For instance, UAV-mounted spray systems can deliver herbicides directly and precisely to detected patches, reducing chemical usage compared to blanket applications while maintaining high efficacy [37]. Similarly, the model’s outputs facilitate the targeted release of biological control agents [35,36,38] such as Aphalara itadori, as patch area and location data allow for optimized aerial deployment strategies. Early detection of small, satellite populations enables rapid intervention before these patches can expand, which is critical given the rapid lateral spread rate of Japanese knotweed. UAV-based detection also offers unique advantages over traditional ground surveys, including the ability to identify nascent patches during early growth stages when manual detection rates are low and to monitor infestations in challenging terrains such as riparian zones and steep slopes that are often inaccessible on foot. Furthermore, UAVs enable high-frequency monitoring compared to the much longer cycles typical of ground-based surveys [39]. Collectively, these advances support adaptive management frameworks in which segmentation outputs directly inform treatment schedules, ultimately reducing reinfestation rates and improving the long-term success of invasive species management.

This is the first study to report on the use of the Twins-SVT transformer for invasive species detection. The results clearly demonstrate the feasibility and effectiveness of Twins-SVT-based transformer architectures for fine-grained vegetation mapping from aerial imagery, providing a strong foundation for scalable, high-precision ecological monitoring systems. Specifically, the Twins-SVT transformer achieved more precise segmentation for detecting Japanese knotweed patches, especially in dense vegetation areas where accurate boundary delineation is critical. Despite these promising results, several challenges remain. First, future research should focus on field-level validation to assess the model’s generalizability across different ecological regions, vegetation types, and UAV platforms. Incorporating temporal data from multi-season UAV surveys may enhance the model’s ability to monitor phenological variations and support long-term invasive species management strategies. Our aerial surveys were conducted during the summer, and, thus, the model’s applicability might be limited to the detection of vegetative and flowering stages of Japanese knotweed. Second, manual annotation required considerable effort due to the difficulty of visually differentiating knotweed from surrounding vegetation, especially under dense or overlapping canopy conditions. Variability in vegetation height and shadowing further complicated accurate boundary delineation [25]. These constraints highlight the necessity for robust segmentation methods capable of managing complex and noisy data [40]. Lastly, there is a need to explore computational efficiency improvements, including model compression, efficient attention approximations, and deployment-oriented optimizations. Investigating domain adaptation and transfer learning techniques will also be essential to ensure robustness under varying environmental conditions and across diverse landscapes. In addition, it is necessary to test the robustness of deep learning models for fragmented habitats (e.g., urban edges, mixed agriculture, and other land-use types). This will allow the managers to prioritize management zones based on the size and distribution of Japanese knotweed patches.

Author Contributions

Conceptualization, X.L. and Y.-L.P.; methodology, S.K.V., R.K., X.L., and Y.-L.P.; software, S.K.V. and R.K.; validation, R.K. and S.K.V.; formal analysis, S.K.V. and R.K.; investigation, R.K., S.K.V., and Y.-L.P.; resources, Y.-L.P. and X.L.; data curation, R.K. and S.K.V.; writing—original draft preparation, S.K.V.; writing—review and editing, all authors; visualization, S.K.V.; supervision, Y.-L.P. and X.L.; project administration, Y.-L.P. and X.L.; funding acquisition, Y.-L.P. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the USDA Forest Service Biological Control for Invasive Forest Pests Program (23-DG-11094200-230), the USDA NIFA AFRI Foundational and Applied Science Program (2021-67014-33757), and the West Virginia University Agriculture and Forestry Experiment Station (WVA00785).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made available upon request.

Acknowledgments

We thank Kushal Naharki at West Virginia University for his help with drone operations.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bailey, J.P.; Conolly, C.A. Prize-winners to pariahs—A history of Japanese knotweed s.l. (Polygonaceae) in the British Isles. Watsonia 2001, 23, 93–110. [Google Scholar]
Child, L.E.; Wade, M.R. The Japanese Knotweed Manual: The Management and Control of an Invasive Alien Weed; Packard Publishing Ltd.: Chichester, UK, 2003. [Google Scholar]
Conolly, C.A. The distribution and history in the British Isles of some alien species of Polygonum and Reynoutria. Watsonia 1977, 11, 291–311. [Google Scholar]
Beerling, D.J. Biological flora of the British Isles: Fallopia japonica (Houtt.) Ronse Decraene. J. Ecol. 1991, 79, 1249–1272. [Google Scholar]
Soll, J. Controlling Knotweed (Polygonum cuspidatum) in the Pacific Northwest; The Nature Conservancy: Arlington, VA, USA, 2004. [Google Scholar]
Hocking, S.; Toop, T.; Jones, D.; Graham, I.; Eastwood, D. Assessing the relative impacts and economic costs of Japanese knotweed management methods. Sci. Rep. 2023, 13, 3872. [Google Scholar] [CrossRef]
Powles, S.B.; Yu, Q. Control of Conyza spp. with glyphosate: A review of the situation in Europe. Weed Res. 2015, 55, 1–16. [Google Scholar]
Shaw, R.H.; Bryner, S.; Tanner, R. The life history and host range of the Japanese knotweed psyllid, Aphalara itadori. Biol. Control. 2011, 58, 328–335. [Google Scholar]
Valicharla, S.K.; Li, X.; Greenleaf, J.; Turcotte, R.; Hayes, C.; Park, Y.-L. Precision detection and assessment of ash death and decline caused by the emerald ash borer using drones and deep learning. Plants 2023, 12, 798. [Google Scholar] [CrossRef]
Park, Y.-L.; Naharki, K.; Karimzadeh, R.; Seo, B.Y.; Lee, G.S. Rapid assessment of insect pest outbreak using drones: A case study with Spodoptera exigua (Hübner) (Lepidoptera: Noctuidae) in soybean fields. Insects 2023, 14, 555. [Google Scholar] [CrossRef]
Anderson, K.; Gaston, K. Lightweight unmanned aerial vehicles will revolutionize spatial ecology. Front. Ecol. Environ. 2016, 11, 138–146. [Google Scholar] [CrossRef]
Valicharla, S.K.; Karimzadeh, R.; Naharki, K.; Li, X.; Park, Y.L. Detection and multi-class classification of invasive knotweeds with drones and deep learning models. Drones 2024, 8, 293. [Google Scholar] [CrossRef]
Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI); Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 801–818. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 19–23 June 2018; pp. 7794–7803. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Places: A 10 million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef]
Soille, P. Morphological Image Analysis: Principles and Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through ADE20K Dataset. Int. J. Comput. Vis. 2017, 127, 302–321. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Pyšek, P.; Richardson, D.M. Invasive species, environmental change and management, and health. Annu. Rev. Environ. Resour. 2010, 35, 25–55. [Google Scholar] [CrossRef]
Chu, X.; Tian, Z.; Xie, W.; Li, Y.; Jiang, Y.; Liu, Z.; Hu, H. Twins: Revisiting spatial attention design in vision transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 418–434. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC). Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Park, Y.-L.; Gururajan, S.; Thistle, H.; Chandran, R.; Reardon, R. Aerial release of Rhinoncomimus latipes (Coleoptera: Curculionidae) to control Persicaria perfoliata (Polygonaceae) using an unmanned aerial system. Pest Manag. Sci. 2018, 74, 141–148. [Google Scholar] [CrossRef]
Naharki, K.; Hayes, C.; Park, Y.-L. Aerial systems for releasing natural enemy insects of purple loosestrife using drones. Drones 2024, 8, 635. [Google Scholar] [CrossRef]
Tsouros, D.C.; Bibi, S.; Sarigiannidis, P.G. A review on UAV-based applications for precision agriculture. Information 2019, 10, 349. [Google Scholar] [CrossRef]
Kim, J.; Huebner, C.D.; Reardon, R.; Park, Y.-L. Spatially targeted biological control of mile-a-minute weed using Rhinoncomimus latipes (Coleoptera: Curculionidae) and an unmanned aircraft system. J. Econ. Entomol. 2021, 114, 1889–1895. [Google Scholar] [CrossRef] [PubMed]
Karimzadeh, R.; Naharki, K.; Park, Y.-L. Detection of bean damage caused by Epilachna varivestis (Coleoptera: Coccinellidae) using drones, sensors, and image analysis. J. Econ. Entomol. 2024, 117, 2143–2150. [Google Scholar] [CrossRef]
Dalponte, M.; Coomes, D.A. Deep learning for remote sensing data: A review. Remote Sens. Environ. 2018, 204, 207–223. [Google Scholar] [CrossRef]

Figure 1. Locations of study sites: Morgantown (Monongalia County), Wellsburg (Brooke County), and Moundsville (Marshall County) in West Virginia, USA.

Figure 2. Example of a manually annotated UAV image (a) and its corresponding binary segmentation mask (b). The annotated image shows an expert-labeled Japanese knotweed patch, while the binary mask represents the ground truth used for model training.

Figure 3. Overall architecture highlighting the decoder. UPerNet integrates features from all encoder stages and applies a Pyramid Pooling Module (PPM) for multi-scale context aggregation. An auxiliary head from the third stage supports optimization through intermediate supervision.

Figure 4. Training dynamics illustrating accuracy and cross-entropy loss for both auxiliary (a,b) and decoder branches (c,d). Auxiliary supervision promotes stable convergence and effective gradient flow throughout training.

Figure 5. Qualitative segmentation examples from the UAV knotweed dataset. The model demonstrated precise delineation of Japanese knotweed patches under varying vegetation complexity and lighting conditions.

Table 1. Encoder configuration across the four hierarchical stages of the vision transformer model used in this study. This setup is inspired by the Swin Transformer architecture [15].

Parameter	Stage 1	Stage 2	Stage 3	Stage 4
Embedding dimension	64	128	256	512
Attention heads	2	4	8	16
Depth	2	2	10	4
MLP expansion ratio	4	4	4	4
Window size	7 × 7	7 × 7	7 × 7	7 × 7

Table 2. Class-wise IoU and pixel-level accuracy (%) of transformer-based segmentation models on the knotweed test set. All models were pre-trained on ImageNet-1K. The proposed Twins-SVT model demonstrates superior performance across both classes.

Model	Class	IoU (%)	Accuracy (%)	Pretrain
Twins-SVT [26]	Background	98.18	99.02	ImageNet-1K
	Knotweed	91.70	95.98	ImageNet-1K
SegFormer [33]	Background	97.85	98.86	ImageNet-1K
	Knotweed	90.23	95.10	ImageNet-1K
Swin-T [34]	Background	95.23	98.79	ImageNet-1K
	Knotweed	90.01	95.57	ImageNet-1K
ViT [20]	Background	97.63	98.23	ImageNet-1K
	Knotweed	89.65	88.92	ImageNet-1K

Table 3. Overall mean Intersection over Union (mIoU) and Average Accuracy (AAcc) of transformer-based segmentation models on the knotweed test set. The proposed Twins-SVT model demonstrates the highest performance across both metrics.

Model	mIoU (%)	AAcc (%)
Twins-SVT [30]	94.94	97.50
SegFormer [35]	94.04	96.98
Swin-T [15]	92.62	96.18
ViT [36]	93.64	93.58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Valicharla, S.K.; Karimzadeh, R.; Li, X.; Park, Y.-L. Transformer-Based Semantic Segmentation of Japanese Knotweed in High-Resolution UAV Imagery Using Twins-SVT. Information 2025, 16, 741. https://doi.org/10.3390/info16090741

AMA Style

Valicharla SK, Karimzadeh R, Li X, Park Y-L. Transformer-Based Semantic Segmentation of Japanese Knotweed in High-Resolution UAV Imagery Using Twins-SVT. Information. 2025; 16(9):741. https://doi.org/10.3390/info16090741

Chicago/Turabian Style

Valicharla, Sruthi Keerthi, Roghaiyeh Karimzadeh, Xin Li, and Yong-Lak Park. 2025. "Transformer-Based Semantic Segmentation of Japanese Knotweed in High-Resolution UAV Imagery Using Twins-SVT" Information 16, no. 9: 741. https://doi.org/10.3390/info16090741

APA Style

Valicharla, S. K., Karimzadeh, R., Li, X., & Park, Y.-L. (2025). Transformer-Based Semantic Segmentation of Japanese Knotweed in High-Resolution UAV Imagery Using Twins-SVT. Information, 16(9), 741. https://doi.org/10.3390/info16090741

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Semantic Segmentation of Japanese Knotweed in High-Resolution UAV Imagery Using Twins-SVT

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Sites

2.2. UAV-Based Aerial Data Collection

2.3. Dataset Construction and Preprocessing

2.4. Model Configuration and Optimization

2.4.1. Encoder: Twins-SVT-Small Transformer

2.4.2. Decoder: UPerNet with Pyramid Pooling

2.4.3. Training Strategy and Environment

2.4.4. Implementation Details

2.4.5. Evaluation Metrics

3. Results

3.1. Quantitative Evaluation

3.2. Training Dynamics and Convergence Analysis

3.3. Qualitative Evaluation and Visualization

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI