MBA-Former: A Boundary-Aware Transformer for Synergistic Multi-Modal Representation in Pine Wilt Disease Detection from High-Resolution Satellite Imagery

Hou, Rui; Zhou, Yantao; Wang, Ying; Huang, Zhiquan; Yao, Jing; Jiao, Quanjun; Huang, Wenjiang; Zhang, Biyao

doi:10.3390/f17050517

Open AccessArticle

MBA-Former: A Boundary-Aware Transformer for Synergistic Multi-Modal Representation in Pine Wilt Disease Detection from High-Resolution Satellite Imagery

by

Rui Hou

^1,2,

Yantao Zhou

^3,4,

Ying Wang

⁵,

Zhiquan Huang

⁵,

Jing Yao

¹

,

Quanjun Jiao

¹

,

Wenjiang Huang

^1,2

and

Biyao Zhang

^1,*

¹

State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

Center for Biological Disaster Prevention and Control, National Forestry and Grassland Administration, Shenyang 110034, China

⁴

Key Laboratory of National Forestry and Grassland Administration on Forest and Grassland Pest Monitoring and Warning, Shenyang 110034, China

⁵

Beijing Academy of Forestry and Landscape Architecture, Beijing 100102, China

^*

Author to whom correspondence should be addressed.

Forests 2026, 17(5), 517; https://doi.org/10.3390/f17050517

Submission received: 23 March 2026 / Revised: 16 April 2026 / Accepted: 20 April 2026 / Published: 23 April 2026

(This article belongs to the Special Issue Forest Disturbance Monitoring by Remote Sensing: Advancements and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Pine wilt disease (PWD) is a devastating biological forest disturbance, making its large-scale and high-precision remote sensing monitoring crucial for epidemic prevention and control. However, the performance of existing deep learning methods in high-resolution imagery is often limited by the confusion of spectral features among disparate ground objects and the complexity of forest boundaries. To address these challenges, this study proposes an innovative, end-to-end deep learning architecture termed MBA-Former. Built upon the robust Swin Transformer V2 backbone, the model systematically integrates two highly adaptable functional modules: (1) a front-end intelligent fusion module designed to adaptively fuse heterogeneous features, and (2) a back-end boundary refinement module that refines segmentation contours via dual-task learning. To train and evaluate the model, fine-grained manual annotations were first performed on Gaofen-2 satellite imagery acquired from multiple typical epidemic areas across northern and southern China. Information-enhanced datasets were constructed by fusing the original spectral bands, typical vegetation indices, and texture features. A comprehensive performance evaluation was then conducted, specifically targeting typical challenging scenarios characterized by complex ground object boundaries. The experimental results demonstrate that the Multi-modal Boundary-Aware Transformer (MBA-Former) significantly outperforms current state-of-the-art models. It achieved a mean Intersection over Union (mIoU) of 81.74%, an IoU of 77.58% for the most critical infected tree category, and a Boundary F1-Score of 78.62%. Compared to the best-performing baseline model, Swin-Unet, these three metrics exhibited notable improvements of 2.88%, 3.55%, and 4.46%, respectively. These findings convincingly demonstrate that MBA-Former provides a highly accurate and robust solution for the large-scale, automated remote sensing monitoring of forest diseases, offering immense value in preventing significant economic losses and preserving forest ecosystem integrity.

Keywords:

pine wilt disease; high-resolution satellite imagery; deep learning; transformer; multi-modal fusion; boundary-aware learning

1. Introduction

Pine Wilt Disease (PWD), driven by the pine wood nematode (Bursaphelenchus xylophilus) and transmitted by cerambycid beetle vectors, represents one of the most devastating biological disturbances in global forest ecosystems [1]. Characterized by its rapid transmission, swift disease progression, and exceptionally high mortality rates [2], this severe forest disturbance inflicts persistent damage on the structural integrity and key functions of forest ecosystems, severely diminishing their service value and posing a serious challenge to regional sustainable development. The unprecedented scale of this disturbance was evidenced in China by the end of 2025, where PWD outbreaks were reported in 626 counties across 17 provinces, affecting an area of 0.99 million hectares and resulting in the mortality of 5.72 million pine trees in that year alone [3]. Consequently, the development of large-scale, high-efficiency disturbance monitoring technologies has become an urgent imperative for the construction of a precise forest epidemic prevention and control system. In practice, even seemingly minor improvements in detection precision can yield profound impacts. For instance, an increase of merely a few percentage points in intersection over union can translate to the accurate identification of thousands of additional potential diseased trees during million-hectare regional surveys. This level of precision is critical for reducing the erroneous felling of healthy pines, minimizing manual ground-survey costs, and effectively disrupting secondary transmission chains, thereby preventing billions of dollars in economic losses and safeguarding forest ecosystem integrity.

Traditional ground surveys, with their high labor costs and limited spatiotemporal coverage, are inadequate for managing the rapid spread of PWD [4]. This highlights the necessity of developing alternative monitoring methods, among which remote sensing technology, with its macroscopic, rapid, and periodic observation capabilities, has demonstrated unique application potential [5].

Previous research has established that the biotic stress induced by B. xylophilus causes significant physiological and structural changes within pine trees, leading to specific spectral responses [6,7,8]. As the infection progresses, the chlorophyll and water content within the pine needles decline significantly, a process that markedly alters the spectral reflectance characteristics of the plant in the visible and near-infrared bands. A quintessential spectral response is the “blue shift” of the red-edge position, which is caused by the diminished absorption of chlorophyll [9,10]. This well-defined physiological-spectral linkage provides the core theoretical basis for differentiating between healthy pines and infected trees using remote sensing techniques.

Based on this principle, Unmanned Aerial Vehicle (UAV) remote sensing platforms, capable of acquiring centimeter-level resolution imagery, have been widely applied in the precise identification of PWD at the individual tree level [11,12,13]. UAVs equipped with multispectral or hyperspectral payloads have been proven effective in capturing the subtle changes in infected tree crowns, enabling high-precision localization of diseased trees [14,15]. However, the limitations of UAVs are equally prominent: they have a limited operational range, higher flight costs, and are susceptible to meteorological conditions and airspace regulations. These factors constrain their application in large-area surveys. In contrast, satellite remote sensing offers extensive coverage and a fixed revisit cycle, making it an irreplaceable tool for achieving regional-scale, systematic monitoring [16,17]. Therefore, this study focuses on leveraging high-resolution satellite remote sensing technology to achieve large-scale, high-efficiency, and automated monitoring of PWD.

Early satellite-based remote sensing studies for PWD monitoring predominantly relied on medium- to low-resolution imagery, such as that from Landsat and MODIS. For example, Kim et al. [18] utilized Landsat 8 imagery to identify suspected forest areas by analyzing temporal anomalies in classic vegetation indices. With the increasing availability of high-spatial-resolution imagery, the research paradigm has shifted towards more refined classification strategies, including object-based image analysis and traditional machine learning algorithms [19,20,21]. While these methods have improved identification accuracy, their performance is highly dependent on complex, expert-driven manual feature engineering, which often compromises the stability and generalization capabilities of the models when applied to imagery from different geographical regions and temporal phases.

In recent years, deep learning (DL) has demonstrated exceptional performance in a wide range of remote sensing intelligent interpretation tasks [22,23,24]. Consequently, DL-based high-resolution satellite remote sensing monitoring is now widely regarded as the most promising avenue for achieving large-scale, high-precision, and automated identification of PWD. The evolution of DL methods for PWD and similar forest disease monitoring is closely intertwined with the advancements in mainstream computer vision architectures [25]. The first generation of these methods, represented by Convolutional Neural Networks (CNNs) such as U-Net [26] and DeepLabV3+ [27], achieved initial success by leveraging their powerful local feature extraction capabilities to identify the spectral and textural characteristics of infected tree crowns. For instance, Zhou et al. [28] employed high-resolution BJ-2 remote sensing imagery in conjunction with a CNN and bounding box tools to achieve intelligent identification of infected trees. Huang et al. [29] utilized GF-1 and GF-2 satellite imagery to construct a sample dataset for a deep CNN and applied transfer learning to obtain an optimized SqueezeNet model for detecting PWD outbreak areas. Ye et al. [30], based on Landsat 8 and UAV imagery, developed an automated PWD detection model by integrating a multi-scale attention mechanism into a U-Net architecture, achieving an average improvement in identification accuracy of approximately 10.2% compared to classic models. However, in the complex scenarios typical of PWD monitoring, the inherently local receptive field of CNNs limits their ability to comprehend large-scale contextual information, making it difficult to effectively distinguish isolated infected trees from other morphologically similar features at the forest edge.

To address these limitations, a second generation of methods, represented by Vision Transformers (ViTs), has been introduced. The Swin Transformer [31] and its variants effectively capture long-range dependencies within an image through a self-attention mechanism, thereby attaining a global receptive field. This capability has shown greater potential than CNNs in remote sensing applications where extensive contextual information is required for accurate classification. For example, Yang et al. [32] constructed a three-branch Swin Transformer Classification (TSTC) network to identify the severity of forest diseases. Amin et al. [33] introduced an external attention mechanism into the original Transformer model, which improved computational efficiency to some extent and increased the detection accuracy for PWD by approximately 5% compared to the best-performing baseline model.

Despite the powerful representation potential demonstrated by Transformer-based methods, their practical application for precise PWD detection from high-resolution satellite imagery remains fundamentally constrained by a compounding challenge: the severe confusion of multi-modal features and the extreme complexity of object boundaries in dense forest environments. On one hand, this feature confusion arises from ground object complexity. Disparate ground objects, such as infested pines, bare soil, and other discolored vegetation, often exhibit highly similar spectral signatures. Existing research typically addresses this by performing simple channel-stacking of ancillary features (e.g., vegetation indices and textures) with the original spectral bands. However, this fixed, non-data-driven fusion approach fails to enable the model to adaptively weigh the importance of different modalities based on the local context, resulting in persistent misclassification. On the other hand, this issue is exacerbated by insufficient localization accuracy at the boundaries. Within complex forest areas, individual tree crowns are irregular and mutually occlusive. Crucially, while modern attention-based models (like Vision Transformers) possess a global receptive field, their inherent patch-based tokenization mechanism and standard pixel-wise cross-entropy loss functions fundamentally treat all pixels equally. This means they lack explicit supervision for high-frequency spatial details and struggle to inherently resolve fine-grained contours, typically leading to smoothed, blurred, or adhered segmentation boundaries in complex forest-edge zones.

To systematically address these interlinked challenges, this study proposes a novel deep learning model termed Multi-modal Boundary-Aware Transformer (MBA-Former), utilizing the advanced Swin Transformer V2 as its core backbone. The primary novelty of MBA-Former lies in its synergistic, end-to-end architecture designed to simultaneously perform intelligent feature weighting to overcome spectral confusion and explicit contour refinement to resolve boundary blurring. The main objectives and contributions of this research are as follows: (1) To develop the innovative MBA-Former model for high-precision PWD detection. This framework is designed to systematically improve overall model performance through the synergistic integration of intelligent front-end feature fusion and back-end boundary refinement with the powerful Swin Transformer V2 backbone. (2) To design a Multi-modal Gating Fusion Module (MFM). As the front-end of the model, this module is engineered to adaptively learn the weights of spectral and textural features on a pixel-by-pixel basis. By intelligently fusing heterogeneous information, it aims to enhance the model’s ability to discriminate between spectrally similar ground objects in complex backgrounds. (3) To construct a Boundary-Aware Dual-Task Decoder (BA). As the back-end of the model, this module introduces a parallel, explicit boundary prediction task. This is intended to compel the model to focus on and learn the precise contours of objects, thereby improving the boundary quality of the final segmentation results. (4) To validate the identification performance of the MBA-Former model in typical challenging scenarios.

2. Materials

2.1. Overview of the Study Area

To comprehensively evaluate the performance, robustness, and geographical generalization capability of the proposed model (MBA-Former), three provinces in China exhibiting typical incidence patterns of PWD and significant geographical differences were selected as the study areas: Liaoning, Hunan, and Zhejiang (Figure 1). Among them, Liaoning Province, characterized by a temperate monsoon climate with cold winters, constitutes the northernmost outbreak boundary of PWD in China. In the south, Hunan Province is situated inland with widespread hills, experiencing high temperatures and high humidity in summer; meanwhile, Zhejiang Province is adjacent to the ocean, featuring a warm and humid climate. The climatic and topographic conditions of these two subtropical provinces collectively render them the most typical core regions with a high incidence of the disease in China. A total of six GF-2 satellite remote sensing images were selected for the experiment, including three scenes covering Fushun City in Liaoning Province, two covering Changsha City in Hunan Province, and one covering Hangzhou City in Zhejiang Province.

2.2. Experimental Data Acquisition and Preprocessing

2.2.1. High Resolution Satellite Remote Sensing Imagery

All imagery utilized in this study consists of Level-1A multispectral images acquired by the GaoFen-2 (GF-2) satellite (Table 1). All images were specifically selected during the peak period of PWD symptom expression, with cloud cover strictly below 5%. The GF-2 sensor is equipped with a panchromatic camera and a multispectral camera, providing multispectral data across four bands. Through the fusion of the panchromatic and multispectral data, multispectral imagery products with a spatial resolution of 0.8 m can be generated, providing sufficient detail to support fine-scale identification at the individual tree crown level. The Level-1A high-resolution satellite remote sensing imagery products used in this study were provided by the China Centre for Resources Satellite Data and Application (CRESDA), Beijing, China. All images were subjected to standardized preprocessing procedures, including radiometric calibration, atmospheric correction using the Quick Atmospheric Correction (QUAC) algorithm, orthorectification, and image fusion (pan-sharpening).

2.2.2. High-Resolution UAV Remote Sensing Imagery

To generate high-precision ground truth data, aerial surveys were conducted over the core plots using a DJI Mavic 2 Pro Unmanned Aerial Vehicle (UAV, Dajiang Innovation, Shenzhen, China) concurrently with the acquisition of the satellite imagery. The acquired UAV data were processed to generate orthomosaics with a spatial resolution better than 0.1 m. The ultra-high spatial resolution of the RGB UAV imagery provided exceptional visual clarity, enabling us to visually confirm and precisely delineate the crowns of mid-to-late stage discolored trees with high confidence. Subsequently, utilizing the UAV orthomosaics as a reference, high-precision geometric registration was performed on the preprocessed GF-2 images (Figure 2). By uniformly selecting more than 30 ground control points (GCPs) across each scene, the root mean square error (RMSE) of the registration was strictly constrained to within 0.5 pixels [34]. This ensured precise spatial correspondence between the satellite imagery and the high-resolution reference imagery, thereby laying a solid foundation for the subsequent fine-grained manual annotation.

3. Methods

The overall technical workflow of this study aims to construct an end-to-end solution for PWD detection from data processing to model prediction, with the novel MBA-Former model at its core (Figure 3). This section systematically elaborates on the details of dataset construction, the core architecture of MBA-Former, and the final loss functions and evaluation framework.

3.1. Construction of the Standardized Semantic Segmentation Dataset

3.1.1. Feature Enhancement

To provide the deep learning model with richer, information-enhanced inputs beyond the raw spectral data, a high-dimensional feature cube integrating spectral, physiological, and spatial structural information was constructed. The selection of these features is not arbitrary but grounded in the underlying mechanisms and prior knowledge of PWD remote sensing monitoring [6,10], aiming to enhance the separability between infected trees and background objects across multiple dimensions.

In addition to the four original spectral bands of the GF-2 imagery, two additional categories of features were introduced, resulting in an 8-channel high-dimensional feature cube:

Vegetation Indices (VIs): We selected indices highly sensitive to variations in vegetation health and chlorophyll content: (1) the Normalized Difference Vegetation Index (NDVI), a classic indicator for reflecting vegetation coverage and biomass; and (2) the Normalized Green-Red Difference Index (NGRDI), which is particularly sensitive to subtle changes in chlorophyll by maximizing the reflectance difference between the green and red bands. The formulas for calculating these indices are as follows [35]:

N D V I = (N I R - R) / (N I R + R)

(1)

N G R D I = (G - R) / (G + R)

(2)

where G, R, and NIR represent the reflectance of the green, red, and near-infrared bands, respectively.

Textural Features (Tex): PWD causes pine needle defoliation and canopy thinning, which consequently alters the spatial texture of the trees in remote sensing imagery. In this study, two classic second-order statistical metrics based on the Gray-Level Co-occurrence Matrix (GLCM) were extracted to describe these structural changes: (1) Contrast, which measures the intensity of local variations in the image, as infected canopies may exhibit higher contrast due to sparse foliage and increased shadows; and (2) Homogeneity, which measures the local uniformity of the image texture, since healthy and dense canopies are typically more homogeneous than infected ones. The GLCM features were calculated using a 7 × 7 pixel window size, and the results were averaged across four orientation angles (0°, 45°, 90°, 135°) to ensure rotational invariance. The formulas for these textural features are as follows [36]:

C o n t r a s t = \sum_{i, j} ∣ i - j ∣^{2} P (i, j)

(3)

H o m o g e n e i t y = \sum_{i, j} \frac{P (i, j)}{1 + ∣ i - j ∣}

(4)

where i and j represent the gray-level values of the reference pixel and the adjacent pixel, respectively.

3.1.2. Sample Dataset Generation and Partitioning

Based on the high-precision registered GF-2 imagery, fine-grained manual visual interpretation and annotation of ground objects were conducted using the UAV imagery as a high-resolution reference. The annotation classification system comprises four categories: background, healthy pine trees, PWD-infected trees, and others. The visual interpretation criteria for the different categories are as follows: the crowns of infected trees typically appear orange-red, bright red, or reddish-brown, with a circular or elliptical shape, and are generally sparsely distributed, occasionally forming clusters of multiple diseased trees. In contrast, healthy crowns are typically dark green with densely aggregated foliage. Typical interfering ground objects other than pines are annotated as “others,” while all remaining areas are annotated as “background.” All annotated vector polygons were subsequently rasterized into label maps corresponding pixel-by-pixel to the high-dimensional feature cube.

Subsequently, to generate samples suitable for training the deep learning model, a sliding window approach was employed to crop the entire high-dimensional feature cube and the corresponding label maps into 256 × 256 pixel patches. To strictly prevent spatial autocorrelation and data leakage, a scene-level (geospatial) allocation strategy was implemented prior to cropping. Specifically, the large satellite images were first spatially divided into completely non-overlapping training, validation, and test geographical zones. Only after this strict spatial isolation did we apply the sliding window approach (with a stride of 128 pixels) within each respective zone.

The construction of the dataset followed a rigorous procedure of screening, partitioning, and balancing. First, all sample patches generated via the sliding window underwent a strict quality screening. Samples wherein the valid annotated area (non-background) accounted for less than 1% of the pixels (fewer than approximately 650 valid pixels) were discarded to ensure that each sample contained sufficient effective information. Following this screening, a total of 18,756 high-quality samples were acquired, with examples illustrated in Figure 2. Subsequently, all high-quality samples were randomly partitioned into a training set, a validation set, and a test set at a ratio of 8:1:1. The actual acquisition details of the sample dataset are presented in Table 2. Herein, a “positive sample” refers to a 256 × 256 image patch containing at least one pixel of an infected tree, while a “negative sample” contains no infected tree pixels.

Considering the sparsity of infected tree samples in the real world, and to mitigate the negative impact of class imbalance on model training, data balancing was performed exclusively on the training set. For the 1384 positive samples in the training set, we applied a 3-fold oversampling strategy using data augmentation techniques (including typical geometric and color transformations), generating an additional 4152 positive samples and thereby effectively improving the ratio of positive to negative samples. The validation and test sets retained their original data distributions to ensure the objectivity and fairness of the model evaluation.

3.2. The MBA-Former Model

This study proposes an innovative end-to-end semantic segmentation network specifically for PWD identification, termed MBA-Former, the detailed architecture of which is illustrated in Figure 4. The core innovation of this model lies not in modifying the backbone network itself but in systematically optimizing the entire segmentation workflow by integrating two plug-and-play functional modules: a front-end Multi-modal Gating Fusion Module (MFM) and a back-end Boundary-Aware (BA) module. The MFM is designed to address spectral feature ambiguity through adaptive feature fusion, while the BA module refines segmentation contours by introducing explicit boundary supervision learning. These two modules, in conjunction with the powerful Swin Transformer V2 backbone, collectively constitute the complete architecture of MBA-Former.

The complete workflow of MBA-Former is as follows: The 8-channel input feature map X with dimensions of (B, 8, 256, 256), where B is the batch size, is first processed by the MFM. The MFM performs an intelligent fusion operation, producing a feature map Ffused with dimensions of (B, 64, 256, 256). This fused feature map is then fed into the backbone encoder, which consists of four stages comprising a total of eight Swin Transformer V2 Blocks. Following four instances of 2-fold down-sampling, four hierarchical feature maps at different scales are generated: S1, S2, S3, and S4. Next, a U-Net-style decoder, composed of four Decoder-Blocks, progressively recovers the spatial resolution using skip connections from the feature maps S1–S4, ultimately outputting a feature map d1 with dimensions of (B, 32, 256, 256). At the output stage, the BA module parallelly predicts the final segmentation logits map (B, N_cls, 256, 256) and a single-channel boundary logits map (B, 1, 256, 256), which are jointly used for the calculation of the combined loss.

3.2.1. Multi-Modal Gating Fusion Module (MFM)

The front-end of the model is the Multi-modal Gating Fusion Module (MFM), designed to upgrade the traditional feature-stacking approach in PWD identification to an explicit, data-driven, and adaptive fusion process (Figure 5). The core mechanism of the MFM is as follows: for the input “spectral-physiological” modal features (F_sv (B, 6, H, W)) and the “spatial-structural” modal features (F_tex (B, 2, H, W)), a “gating network” first generates a dual-channel attention gate G (B, 2, H, W):

G = S o f t m a x (f_{g a t e} (C o n c a t (F_{s v}, F_{t e x})))

(5)

where Concat(·) denotes the channel concatenation operation. The two channels of g₁, g₂, represent the weights to be assigned to the two modalities at each pixel location. Concurrently, F_sv and F_tex are mapped to the same target dimension C_out via their respective 1 × 1 convolutional projection functions, p_sv and p_tex.

F_{f u s e d} = g_{1} ⊙ p_{s v} (F_{s v}) + g_{2} ⊙ p_{t e x} (F_{t e x})

(6)

where ⊙ denotes element-wise multiplication.

This gating mechanism represents a critical departure from commonly used fusion methods such as simple concatenation or addition. Traditional methods inherently apply a spatially uniform, non-adaptive fusion across the entire image, ignoring the profound spatial heterogeneity of complex forest scenes where the reliability of different feature modalities varies significantly by location [29,30]. By contrast, the MFM acts as a dynamic, data-driven attention mechanism. It learns to adaptively weigh the importance of each modality on a pixel-by-pixel basis, intelligently emphasizing the most reliable features for any given local context and thereby effectively mitigating feature confusion.

3.2.2. Backbone Network

We utilize the Swin Transformer V2 as the core feature extractor of the model. Compared to traditional CNNs, the Swin Transformer V2, with its hierarchical design and core Shifted Window-based Self-Attention mechanism, can effectively capture both multi-scale features and long-range dependencies in an image while maintaining computational efficiency [37]. The powerful global context modeling capability of the Swin Transformer V2 is crucial for accurately identifying PWD-infected trees in complex backgrounds.

3.2.3. Boundary-Aware (BA) Module

The back-end of the model is the Boundary-Aware (BA) module, which is designed to enhance the boundary precision of the segmentation results, particularly for PWD-infected trees, through a dual-task learning paradigm (Figure 6). The motivation for this module is that standard pixel-wise classification loss functions treat all pixels equally, causing the model to prioritize the classification accuracy of large-area regions while relatively neglecting the boundary pixels, which are few in number but critical for segmentation quality. The BA module addresses this by introducing a dedicated, explicit boundary supervision signal that compels the model to focus on and learn precise contour information of diseased tree crowns, thereby improving boundary segmentation accuracy. It takes the feature map from the shallowest layer of the decoder, d₁ (B,

C_{d}

, H, W) (e.g.,

C_{d}

= 32), as input, and passes it through two parallel branches. The Region Branch, through a classification head hseg, predicts the class membership of each pixel, outputting the segmentation logits map (B, N_cls, H, W); The Boundary Branch, a more lightweight convolutional network h_b performs a binary classification to determine whether each pixel lies on the boundary between different classes, outputting a single-channel boundary logits map L_b (B, 1, H, W). Finally, a Refinement Module h_refine concatenates d₁ with the predicted boundary probability map (σ(L_b)) and utilizes the learned boundary knowledge to correct and sharpen the contours, producing the final segmentation result:

L_{s e g_f i n a l} = h_{r e f i n e} (C o n c a t (d_{1}, σ (L_{b})))

(7)

where σ(·) denotes the Sigmoid activation function.

3.2.4. The Joint Loss Function

To simultaneously supervise both the segmentation and boundary tasks, we designed a joint loss function, defined as follows:

L_{t o t a l} = L_{C E} + α \cdot L_{B C E}

(8)

where α is a hyperparameter that balances the weights of the two loss terms. L_CE is the standard Categorical Cross-Entropy Loss for region segmentation, L_BCE is the Binary Cross-Entropy Loss for boundary prediction. For a single pixel i, the two loss terms are calculated as [38]

L_{C E} = - \frac{1}{M} \sum_{i = 1}^{M} \sum_{c = 1}^{N_{c l s}} y_{i, c} l o g (p_{i, c})

(9)

L_{B C E} = - \frac{1}{M} \sum_{i = 1}^{M} [{\hat{y}}_{i} \log ({\hat{p}}_{i}) + (1 - {\hat{y}}_{i}) l o g (1 - {\hat{p}}_{i})]

(10)

where M is the total number of pixels, N_cls is the number of classes, y_i,c is an indicator variable (if the true class of pixel i is c, and 0 otherwise), p_i,c is the model’s predicted probability that pixel i belongs to class c,

{\hat{y}}_{i}

is the true boundary label of pixel i (1 for a boundary, 0 for non-boundary), and

{\hat{p}}_{i}

is the model’s predicted probability that the pixel is a boundary. To ensure precise and continuous pixel-level contour supervision for calculating

L_{B C E}

, the ground truth boundary map (

{\hat{y}}_{i}

) must be generated meticulously. In this study,

{\hat{y}}_{i}

was dynamically derived from the fine-grained regional masks using morphological operations, specifically calculating the difference between morphological dilation and erosion with a 3 × 3 rectangular kernel. This approach was deliberately chosen over traditional edge detectors to prevent the introduction of discontinuous or noisy edge signals, which could otherwise degrade the training stability of the Boundary Branch. By minimizing this joint loss, the model is incentivized to simultaneously optimize both the accuracy of PWD-infected tree identification and the precision of boundary localization.

3.3. Evaluation Metrics

To comprehensively and quantitatively evaluate the model performance in PWD identification, a series of standard evaluation metrics was adopted. For a task comprising k categories, let

p_{i j}

denote the total number of pixels belonging to category i but predicted as category j. The core metric for semantic segmentation tasks is the Intersection over Union (IoU), which measures the degree of overlap between the predicted region and the ground truth. Its mean value (mIoU) is crucial for evaluating the overall performance of the model. For category I, its IoU is calculated as follows:

{I O U}_{i} = \frac{p_{i i}}{\sum_{j = 0}^{k - 1} p_{i j} + \sum_{j = 0}^{k - 1} p_{j i} - p_{i i}}

(11)

m I o U = \frac{1}{N_{f g}} \sum_{i \in F o r e g r o u n d} {I o U}_{i}

(12)

Furthermore, we calculated the F1-Score, which combines precision and recall to provide a more balanced reflection of model performance, particularly for the class-imbalanced datasets typical of PWD monitoring. To specifically assess the boundary segmentation quality of the model on diseased tree crowns, we introduced the Boundary F1-Score. This metric measures boundary alignment by calculating the pixel-level agreement between the predicted boundary and the ground truth boundary, making it more sensitive to slight deviations in segmentation contours than the region-based IoU [39]. Additionally, Overall Accuracy (OA) and Mean Pixel Accuracy (MPA) metrics were incorporated to provide a comprehensive comparative reference.

4. Results

4.1. Experimental Parameter Setup and Model Training

All experiments in this study were conducted in a unified hardware and software environment to ensure comparability and reproducibility. The hardware platform utilized a graphics workstation equipped with an NVIDIA RTX 2080 GPU (12 GB of VRAM). The software environment was based on the Windows 10 operating system, utilizing PyTorch 2.1 as the core deep learning framework and the timm (v0.9.12) library to construct the backbone networks.

To guarantee a strictly fair comparison and eliminate any performance discrepancies arising from optimization strategies, all baseline models and our proposed MBA-Former were trained from scratch using an identical hyperparameter configuration. The comprehensive parameter setup is detailed in Table 3.

To thoroughly analyze the dynamic behavior of the different models during the learning process, we recorded and compared their validation loss and mIoU trajectories over the course of training (Figure 7). The loss curves indicate that all models exhibited effective convergence trends. However, the proposed MBA-Former demonstrated distinct superiority in both convergence speed and the final convergence level (Figure 7a). In the early stages of training, the loss of MBA-Former decreased most rapidly, indicating that the front-end MFM significantly accelerated the initial learning process by providing high-quality fused features. In the middle to late stages, MBA-Former converged to a significantly lower loss level than all baseline models, while maintaining minimal fluctuation. This stability can be attributed to the boundary supervision introduced by the back-end BA module, which not only improved the final accuracy but also served as an effective regularization mechanism.

Simultaneously, the mIoU curves reveal that MBA-Former consistently outperformed the comparative models by a significant margin throughout the training process (Figure 7b). It was the first to surpass an mIoU of 80% at approximately 60 epochs, demonstrating exceptional learning efficiency. In contrast, although advanced baseline models such as Swin-Unet and ConvNeXt-Unet exhibited strong learning capabilities, their performance growth plateaued after reaching a bottleneck, resulting in a clear performance gap compared to MBA-Former. These curves provide intuitive and dynamic evidence that systematic front-end feature fusion combined with back-end boundary refinement yields a segmentation model that is faster to converge, higher-performing, and more robust than existing SOTA methods.

4.2. Ablation Study

4.2.1. Comparative Analysis of Different Feature Combinations

To verify the necessity of the constructed multi-dimensional feature cube for PWD detection, we utilized the complete MBA-Former model as a baseline and modified only the composition of its input features. The experimental results are presented in Table 4. As clearly shown, when only the four raw spectral bands (Base) were used as input, the model achieved an mIoU of 78.53%. The addition of Vegetation Indices (VIs) and Textural Features (Tex) to the spectral data resulted in mIoU improvements of 2.28% and 1.39%, respectively, demonstrating that both VIs and Tex provide effective complementary information. When the full feature set containing all 8 channels was utilized, the model reached its optimal performance, with the mIoU rising to 81.74%—a 3.21% improvement over the baseline. These results indicate that the introduction of VIs, by directly quantifying the physiological stress of the vegetation, provided the most critical discriminative information for distinguishing between healthy and PWD-infected trees, yielding the largest performance gain. Conversely, the inclusion of Tex provided independent supplementary evidence in terms of spatial structure, playing a crucial role in differentiating morphologically distinct targets. By effectively utilizing multi-modal information, MBA-Former further enhanced both identification performance and robustness in complex PWD monitoring scenarios.

4.2.2. Validation of the Effectiveness of Independent Innovation Modules

To evaluate the individual contributions of the front-end MFM and back-end BA modules in PWD identification, we constructed three comparative models based on the full feature set and the Swin-V2 backbone. The results are presented in Table 5. Baseline model (a), which utilized simple feature concatenation without boundary supervision, achieved an mIoU of 78.86%. The introduction of the MFM increased the mIoU by 2.06%, demonstrating that the MFM’s intelligent fusion mechanism effectively enhanced the model’s capacity to distinguish PWD-infected trees from background ground objects with highly similar spectral signatures. The subsequent introduction of the BA module yielded an additional 1.29% increase in mIoU, accompanied by a significant improvement in the Boundary F1-Score, thereby proving its effectiveness in refining the boundary precision of diseased tree crowns. The model reached its peak performance (mIoU of 81.74%) when both modules worked synergistically. This clearly demonstrates that the front-end MFM provides high-quality inputs via adaptive feature fusion, while the back-end BA module executes precise refinement of the segmentation contours of PWD-infected trees via explicit boundary supervision. Their synergistic operation results in a substantial enhancement in the overall detection and monitoring performance for this devastating forest disease.

4.3. Comparative Analysis of Performance with SOTA Models

To systematically evaluate the performance advantages of MBA-Former in complex PWD identification scenarios, we conducted a rigorous horizontal comparison with various mainstream semantic segmentation baseline models, encompassing traditional convolutional, modern convolutional, and Transformer architectures. To ensure fairness, all models received the identical 8-channel multi-modal input and were optimized on the same dataset using consistent hyperparameters and training strategies. The quantitative comparison results are detailed in Table 6.

The experimental results clearly illustrate the performance evolution trajectory from traditional CNNs to modern hybrid architectures. Due to its simple convolutional structure, the classic U-Net exhibited limited feature extraction capabilities when processing high-dimensional multi-modal inputs, achieving a mean Intersection over Union (mIoU) of only 73.58%. By introducing the Atrous Spatial Pyramid Pooling (ASPP) module, DeepLabV3+ enhanced its ability to capture multi-scale contextual information, increasing its mIoU to 75.05%. However, owing to the inherent limitations of their local receptive fields, these two traditional CNN-based models still struggled to effectively handle large-scale spectral confusion and the complex boundary scenarios characteristic of PWD-infected forest patches.

The introduction of modern architectures significantly broke through this bottleneck. Drawing on the macroscopic design philosophy of Transformers, ConvNeXt-Unet achieved superior feature representation within a purely convolutional framework, with its mIoU leaping to 78.23%. The best-performing baseline model, Swin-Unet V2, successfully established long-range dependencies among image pixels relying on its innovative shifted window self-attention mechanism. This powerful global context awareness enabled it to demonstrate strong baseline performance, achieving an mIoU of 78.86% and a core PWD-infected tree IoU of 74.03%.

Despite the excellent performance of Swin-Unet V2, our proposed MBA-Former still achieved a leapfrog performance improvement upon this foundation. Across all core metrics, MBA-Former achieved overwhelmingly optimal results: its mIoU reached 81.74% ± 0.32%, the single-class IoU for PWD-infected trees reached 77.58% ± 0.35%, and the Boundary F1-Score was as high as 78.62% ± 0.31%. Furthermore, as explicitly presented in Table 6, MBA-Former also achieved the highest IoU (83.15%) on the “Others” category, confirming that its high overall mIoU is a genuine reflection of robust scene parsing capabilities rather than an artifact of class imbalance. Compared to the optimal Swin-Unet V2 baseline model, MBA-Former realized significant gains of +2.88%, +3.55%, and +4.46% in these three core metrics, respectively. To verify the statistical validity of this performance enhancement, a paired-sample t-test was conducted on the pixel-level classification accuracy between the baseline Swin-Unet and MBA-Former across the test set. The result yielded a p-value < 0.05, confirming that the 2.88% improvement in mIoU is statistically significant and not merely due to random data fluctuations. The consistently small standard deviations across three independent runs further powerfully demonstrate the exceptional stability and reproducibility of our proposed architecture against random weight initializations.

Furthermore, to comprehensively assess the trade-offs between detection accuracy, computational complexity, and real-time applicability, we quantitatively compared the model parameters (Params), floating-point operations (FLOPs), and inference time across the different architectures (Table 7).

As shown in Table 7, MBA-Former maintains high efficiency (35 ms/patch) despite a minor increase in complexity (31.5 M params; 52.4 G FLOPs) relative to Swin-Unet. This overhead is well-justified by the leap in detection performance, specifically the +2.88% mIoU and +3.55% infested pine IoU. For large-scale regional surveys typically processed offline, the enhanced boundary precision and robust error suppression of MBA-Former far exceed the marginal increase in computational requirements.

4.4. Visual Analysis

Considering that the ultimate application goal of this study is the identification of PWD-infected trees, to maximize the clarity of visual comparison, avoid the interference of redundant colors, and more intuitively demonstrate the performance advantages of MBA-Former in addressing key challenges, we adopted a focusing strategy for the visual display. As suggested by practical forestry perspectives, spatial context is vital. Therefore, we explicitly present the original high-resolution True-Color (RGB) imagery alongside the Ground Truth (GT) annotations in our comprehensive scene comparisons. Specifically, for the model prediction results, all non-infected tree categories (i.e., healthy pine trees, other vegetation, etc.) were uniformly rendered as the background (black), with only the PWD-infected tree category highlighted in white. This binarized visual presentation most intuitively highlights the performance differences among models in detecting the core target. To provide a mechanistic justification for the improved performance and to explicitly demonstrate how our proposed modules address feature confusion and boundary adhesion, we present a detailed visual analysis consisting of three key aspects.

4.4.1. Mitigating Feature Confusion via the MFM

To intuitively understand the contexts in which the gating network is activated, we visualized the internal attention maps generated by the MFM (Figure 8). As shown, the data-driven attention mechanism exhibits a clear spatial heterogeneity. In dense, homogeneous canopy areas, the gating network maintains a balanced or slightly higher weight (visualized in neutral white to light blue in Figure 8b) for the spectral-physiological modality, indicating its baseline reliability. Conversely, along complex forest edges, roads, or when encountering spectrally similar ground objects, the network adaptively shifts its focus, assigning dominant weights (visualized in deep red in Figure 8c, and correspondingly deep blue in Figure 8b) to the spatial-structural (textural) modality. This dynamic weighting visualization provides direct evidence that the MFM effectively mitigates feature confusion by intelligently leveraging complementary multi-modal features, relying on structural cues when spectral signatures become ambiguous.

4.4.2. Refining Contours via the BA Module

To specifically illustrate the types of errors reduced by the Boundary-Aware module, we present high-resolution crop comparisons between the baseline model without boundary supervision and our final MBA-Former (Figure 9). As magnified in the visual comparison, the baseline model frequently suffers from boundary adhesion—erroneously merging adjacent infected tree crowns into a single, coarse polygon—and exhibits severe contour erosion. In stark contrast, MBA-Former, guided by explicit boundary supervision, successfully severs these false adhesions and restores the sharp, accurate geometric morphology of individual tree crowns. This visual improvement directly supports the substantial increase in the Boundary F1-Score.

4.4.3. Comprehensive Visual Comparison

Finally, we present the overall segmentation results across two representative challenging forest areas (Figure 10). By referencing the original RGB imagery (Figure 10a), the spatial locations of the target trees are fully transparent. As shown in Figure 10c, although the baseline Swin-Unet could detect most PWD-infected tree targets, it exhibited two primary types of errors. Firstly, weak but clear false positives appeared in localized areas (highlighted by orange boxes), likely due to the presence of other ground objects with spectral characteristics similar to those of PWD-infected trees, causing model confusion. Secondly, in areas with densely clustered targets, the model suffered from severe boundary adhesion, erroneously merging multiple individual crowns into a single, coarse polygon (highlighted by red boxes).

In stark contrast, the proposed MBA-Former (Figure 10d) demonstrated significant superiority when addressing these typical scenarios. The false positives present in Swin-Unet were successfully suppressed in the MBA-Former results, as indicated by the empty orange boxes. This intuitively proves that the front-end MFM effectively enhances the model’s ability to discriminate spectrally similar but materially different objects by intelligently fusing multi-modal features, thereby reducing misclassifications caused by feature confusion. Furthermore, regarding boundary delineation capability, the adhesion phenomena were successfully resolved (as shown in the red boxes). The predicted contours of MBA-Former are highly consistent with the ground truth (Figure 10b); the patch edges are sharper and their shapes more regular. This fully confirms the effectiveness of the back-end BA module in refining spatial details and ensuring high-fidelity contour delineation.

In summary, although both models still exhibit certain limitations in capturing a few extremely tiny targets within this highly challenging, fragmented forest patch scenario, the qualitative comparison results are highly consistent with the quantitative metrics presented in Table 6. This visually provides strong evidence of the high precision and robustness of MBA-Former in complex scenarios.

5. Discussion

This study aimed to enhance the accuracy and robustness of automated PWD detection using high-resolution satellite imagery. To this end, we developed the innovative MBA-Former deep learning model. Experimental results demonstrate that the model achieved outstanding performance across all key metrics, significantly outperforming a range of baseline models, including U-Net, DeepLabV3+, and Swin-Unet V2. This section delves into the mechanisms underlying the model’s performance advantages, provides a comparison with existing research, and analyzes the limitations of the current method and future research directions.

5.1. The Importance of Synergistic Feature Representation and Boundary Refinement

The performance enhancement of this study primarily stems from targeted improvements to two critical bottlenecks in existing deep learning methods: feature fusion strategies and boundary processing mechanisms.

First, regarding feature fusion, previous remote sensing studies on PWD monitoring often employed simple channel-stacking to combine heterogeneous features such as spectra and textures [19,21]. This non-data-driven fusion approach is ill-equipped to handle the feature ambiguity caused by the “different objects, same spectrum” phenomenon in complex forest landscapes. To overcome this, rather than relying on traditional machine learning feature selection methods (e.g., Random Forest importance scores), which decouple the feature evaluation from the end-to-end learning process, we introduced the Multi-modal Gating Fusion Module (MFM). The MFM draws upon attention concepts from multi-modal learning, utilizing a gating network to adaptively weight the spectral-physiological and spatial-structural modalities. This enables the model to dynamically and intelligently balance the importance of different features based on the local context. As visualized in the internal attention maps (Figure 8), the MFM explicitly addresses feature confusion by assigning dominant weights to spatial-structural (textural) features in highly heterogeneous boundary zones or when encountering spectrally ambiguous objects, while relying on spectral signatures in homogeneous healthy canopies. This context-aware discrimination aligns with recent trends in remote sensing, demonstrating that explicit multi-modal attention mechanisms are fundamentally more effective than simple feature concatenation [40,41].

Second, concerning boundary processing, standard pixel-wise loss functions provide insufficient optimization for the minority of pixels that constitute boundaries, leading to blurred segmentation contours—a common problem in fine-scale forestry remote sensing mapping [29,33]. Our Boundary-Aware (BA) module addresses this by constructing a parallel boundary prediction branch and a joint loss function, thereby internalizing boundary learning within the model training process. By explicitly predicting boundary probabilities, the BA module effectively reduces prevalent spatial errors in standard attention-based models, particularly the “adhesion” of adjacent infected tree crowns and the “spillover” of segmentation contours. As shown in Table 5, the Boundary F1-Score of MBA-Former reached 78.62%, a 4.46% improvement over the best-performing baseline model, Swin-Unet V2. This convincingly demonstrates that an end-to-end dual-task learning paradigm is more effective in enhancing the spatial localization accuracy and contour fidelity of segmentation results than relying solely on the intrinsic feature extraction capabilities of the backbone network.

5.2. Performance Comparison with Previous Research

Compared to the MBA-Former proposed in this study, previous research on satellite-based remote sensing of PWD exhibits notable differences in both methodology and performance. While traditional feature engineering approaches, such as Object-Based Image Analysis (OBIA) or classical Machine Learning (ML), have been widely used, they typically rely on manually crafted thresholds and struggle to generalize across diverse geographic regions, often hitting performance ceilings around 75%–80% overall accuracy [20,42]. In contrast, deep learning methods automatically learn hierarchical representations.

For instance, Ye et al. [30] utilized Landsat 8 imagery and a multi-scale attention U-Net model based on fine-grained manual annotations to achieve automated detection of PWD-affected areas. However, due to the spatial resolution limitations of Landsat 8 imagery, severe mixed pixel effects made it difficult to effectively distinguish individual or small clusters of infected trees. Furthermore, the data used in that study lacked the near-infrared and red-edge information most sensitive to vegetation stress, preventing the model from capturing the typical spectral characteristics of the disease and resulting in low recall (57.38%) and overall performance. Wang et al. [43] employed a semi-supervised generative adversarial network and GF-2 imagery for PWD detection, achieving an overall identification accuracy of only 68.63%. However, this semi-supervised architecture suffers from inherent drawbacks, including optimization instability and limited model capacity. Its lack of a fine-grained classification strategy for highly heterogeneous backgrounds led to weak performance in morphological feature extraction, with outputs accompanied by significant spatial noise and a need for improved generalization capability.

In contrast, the proposed MBA-Former achieved a mean Intersection over Union (mIoU) of 81.74%, a core metric for segmentation quality. The model fully leverages a multi-head scaled dot-product self-attention mechanism, combined with an innovative shifted window strategy, enabling it to capture long-range dependencies in the image while maintaining computational efficiency. Building upon this foundation, our MFM and BA modules further unlocked the model’s potential, ultimately achieving a decisive performance advantage in the most critical task of infected tree identification (IoU: 77.58%). This systematic design, based on intelligent feature fusion and boundary refinement, effectively mitigates the impacts of spectral feature ambiguity and spatial noise, thereby endowing the model with exceptional generalization capability and robustness in complex backgrounds.

5.3. Limitations and Future Perspectives

Although the proposed MBA-Former achieved remarkable detection performance in complex forest landscapes, certain limitations persist at the practical application level, which also point to directions for future research.

First, the constraints of spatial resolution represent a primary limitation of this study. The 0.8 m resolution of GF-2 imagery is still susceptible to mixed pixel effects when dealing with the complex canopy structures of natural forests, complicating the localization of individual infected trees. Furthermore, we acknowledge that PWD is a disturbance causing a progressive, non-uniform decline in tree health. While our ultra-high spatial resolution (<0.1 m) RGB UAV imagery provided exceptional clarity for visually delineating mid-to-late stage discoloration, it fundamentally lacks the spectral bands necessary for detecting early-stage stress. The limitations of the spectral dimension are equally important. GF-2 imagery comprises only four standard multispectral bands and cannot capture the subtle biochemical signals of early vegetation stress, such as shifts in the red-edge position. Therefore, achieving robust monitoring during the early stages of PWD infection, before significant visual discoloration of the needles occurs, remains a critical challenge for current satellite remote sensing methods.

To address these challenges, future research should focus on the deep synergy and fusion of multi-source data. For instance, the incorporation of LiDAR point cloud data could be considered. Its penetrative capability can provide three-dimensional vertical structure information of the forest, which could effectively mitigate the interference of canopy overlap and shadows, thereby further improving segmentation accuracy at the individual tree level [44]. Additionally, building upon high-resolution satellite imagery, exploring the fusion of data with higher spectral resolution, such as hyperspectral data or Sun-Induced Chlorophyll Fluorescence (SIF) sensor data [45], holds significant promise for capturing the subtle biochemical anomalies of early-stage disease. Moreover, future ground-truth validation protocols should ideally incorporate multispectral UAV imagery to better align with satellite sensor bands and further enhance validation rigor for early-stage detection. Meanwhile, our model’s performance relies on high-quality manual annotations, which is a major bottleneck in forestry remote sensing. Exploring weakly supervised or semi-supervised learning paradigms to reduce annotation costs is an active research direction.

Finally, as a high-performance architecture, the exceptional segmentation accuracy of MBA-Former is inevitably accompanied by higher computational complexity. Although its current inference speed (~35 ms/patch) is highly acceptable for offline regional monitoring, future research should focus on the lightweight evolution of the algorithm. Exploring more efficient linear attention mechanisms could substantially reduce computational costs while maintaining robustness, meeting the operational deployment demands for national-level automated forest disease surveys.

6. Conclusions

This study addresses the critical challenge of high-precision Pine Wilt Disease (PWD) monitoring using high-resolution satellite imagery, a task fundamentally constrained by severe feature confusion and inaccurate boundary localization in complex forest environments. We successfully developed and evaluated the MBA-Former, an innovative end-to-end deep learning model designed to overcome these limitations.

Our core findings demonstrate that the synergistic integration of intelligent feature fusion and explicit boundary supervision significantly elevates segmentation performance beyond the capabilities of current state-of-the-art models. Specifically, the proposed Multi-modal Gating Fusion Module (MFM) dynamically mitigates severe feature confusion by adaptively weighting spectral and textural modalities based on local context. Concurrently, the Boundary-Aware (BA) module successfully resolves spatial localization inaccuracies, effectively suppressing the contour adhesion and erosion commonly observed in standard attention-based networks. This synergistic design ensures high-fidelity delineation of infested tree crowns even in highly heterogeneous mixed forests, culminating in a remarkable mean Intersection over Union (mIoU) of 81.74% and a Boundary F1-Score of 78.62% on our rigorous, multi-regional test set.

The exceptional accuracy and robust generalization capability exhibited by MBA-Former underscore its profound practical significance. By fully leveraging high-resolution satellite data, this model provides a highly reliable, cost-effective algorithmic engine for large-scale, automated forest disease surveys. Ultimately, the MBA-Former offers critical technical support for forestry management departments to rapidly locate epidemic centers, formulate precise eradication strategies, and safeguard the structural integrity of forest ecosystems.

Author Contributions

Conceptualization, R.H.; methodology, R.H., B.Z., Y.Z. and W.H.; software, R.H.; validation, R.H.; formal analysis, R.H.; investigation, R.H., B.Z., Y.W. and Z.H.; data curation, R.H. and Y.Z.; writing—original draft preparation, R.H.; writing—review and editing, R.H., B.Z. and W.H.; supervision, B.Z., Q.J., J.Y. and W.H.; funding acquisition, W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Intergovernmental International Science and Technology Innovation Cooperation Program under National Key Research and Development Plan of China (2024YFE0198600), National Natural Science Foundation of China (42201355), Sino-UK Crop Pest and Disease Forecasting and Management Joint Laboratory (183611ZYLH20240009), SINO-EU Dragon 6 proposal (ID 95250), GEO-PDRS: Global Vegetation Pest and Disease Dynamic Remote Sensing Monitoring and Forecasting (Post-2025 GEO Work Programme) and the BioClima project funded by the European Union’s Horizon Europe research and innovation programme under the Grant Agreement no. 101181408.

Data Availability Statement

Gaofen-2 images can be requested from China Centre for Resources Satellite Data and Application (CRESDA, https://data.cresda.cn/#/2dMap, accessed on 15 October 2025), and the High-resolution UAV visible remote sensing imagery and software program in this study are publicly available at https://doi.org/10.17632/h5wxb8sd78.1.

Acknowledgments

We wish to thank the editorial office staff for their assistance with professional typesetting, as well as the anonymous reviewers for their insightful comments and constructive support, which significantly improved the quality of this work.

Conflicts of Interest

We wish to transparently disclose that Professor Biyao Zhang, the corresponding author of this manuscript, is currently serving as a Guest Editor for this Special Issue. To ensure a strictly independent, transparent, and objective peer-review process, we respectfully request that the handling of this submission be assigned to an independent Editorial Board Member or the Editor-in-Chief in accordance with MDPI’s Conflict of Interest policies.

References

Ye, J.; Wu, X. Progress in pine wilt disease research. China For. Pest Dis. 2022, 41, 1–10. [Google Scholar] [CrossRef]
Chen, F.; Li, M. Comprehensive strategies for the prevention and control of pine wilt disease in China: A review and future directions. J. For. Res. 2024, 36, 9. [Google Scholar] [CrossRef]
Zhang, C.; Sun, H.; Chen, Y.; Liu, B.; Feng, F.; Zhou, T. 2025 national occurrence of major forestry pests and trend forecast for 2026. China For. Pest Dis. 2026, 45, 69–74. (In Chinese) [Google Scholar] [CrossRef]
Hussain, T.; Aslam, A.; Ozair, M.; Tasneem, F.; Gómez-Aguilar, J.F. Dynamical aspects of pine wilt disease and control measures. Chaos Solitons Fractals 2021, 145, 110764. [Google Scholar] [CrossRef]
Yousefi, S.; Haghighian, F.; Jahromi, M.N.; Pourghasemi, H.R. Pest-infected oak trees identification using remote sensing–based classification algorithms. In Computers in Earth and Environmental Sciences; Elsevier: Amsterdam, The Netherlands, 2022; pp. 363–376. [Google Scholar] [CrossRef]
Xu, H.C.; Luo, Y.Q.; Zhang, Q. Changes in water content, pigments, and antioxidant enzyme activities in pine needles of Pinus thunbergii and Pinus massoniana affected by pine wood nematode. Sci. Silvae Sin. 2012, 11, 140–143. [Google Scholar]
Song, Q.; Xiang, R.; Qin, L.; Yuan, L.; Tian, H.; Zhu, P.; Liu, W.; Yang, W.; Qu, Y.; Zhou, J. Analysis of spectral reflectance parameters and moisture content in pine needles. Sci. Technol. Innov. 2018, 3, 26–28. [Google Scholar] [CrossRef]
Yu, R.; Luo, Y.; Zhou, Q.; Zhang, X.; Wu, D.; Ren, L. A machine learning algorithm to detect pine wilt disease using UAV-based hyperspectral imagery and LiDAR data at the tree level. Int. J. Appl. Earth Obs. Geoinf. 2021, 101, 102363. [Google Scholar] [CrossRef]
Xu, H.; Luo, Y.; Zhang, T.; Shi, Y. Spectral feature variations of pine needles at different infection stages following natural infestation by pine wilt nematode. Spectrosc. Spectr. Anal. 2011, 31, 1352–1356. [Google Scholar] [CrossRef]
Li, N.; Huo, L.; Zhang, X. Classification of pine wilt disease at different infection stages by diagnostic hyperspectral bands. Ecol. Indic. 2022, 142, 109198. [Google Scholar] [CrossRef]
Wu, B.; Liang, A.; Zhang, H.; Zhu, T.; Zou, Z.; Yang, D.; Tang, W.; Li, J.; Su, J. Application of conventional UAV-based high-throughput object detection to the early diagnosis of pine wilt disease by deep learning. For. Ecol. Manag. 2021, 486, 118986. [Google Scholar] [CrossRef]
Sun, Z.; Wang, Y.; Pan, L.; Xie, Y.; Zhang, B.; Liang, R.; Sun, Y. Pine wilt disease detection in high-resolution UAV images using object-oriented classification. J. For. Res. 2022, 33, 1377–1389. [Google Scholar] [CrossRef]
Li, J.; Wang, X.; Zhao, H.; Hu, X.; Zhong, Y. Detecting pine wilt disease at the pixel level from high spatial and spectral resolution UAV-borne imagery in complex forest landscapes using deep one-class classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102947. [Google Scholar] [CrossRef]
Luo, J.; Fan, J.; Huang, S.; Wu, S.; Zhang, F.; Li, X. Semi-supervised learning techniques for detection of dead pine trees with UAV imagery for pine wilt disease control. Int. J. Remote Sens. 2025, 46, 575–605. [Google Scholar] [CrossRef]
Wang, J.; Jin, L.; Wang, F.; Zhou, H.; Lin, H. Hierarchical attention and feature enhancement network for multi-scale small targets in pine wilt disease. Comput. Electron. Agric. 2025, 239, 111037. [Google Scholar] [CrossRef]
Cai, P.; Chen, G.; Yang, H.; Li, X.; Zhu, K.; Wang, T.; Liao, P.; Han, M.; Gong, Y.; Wang, Q.; et al. Detecting individual plants infected with pine wilt disease using drones and satellite imagery: A case study in Xianning, China. Remote Sens. 2023, 15, 2671. [Google Scholar] [CrossRef]
Wang, G.; Chai, G.; Aierken, N.; Chen, L.; Wang, J.; Qian, Z.; Lei, L.; Zhang, X. Monitoring forest damage caused by pine wilt disease with remote sensing technologies: A review. IEEE Geosci. Remote Sens. Mag. 2025, 14, 286–305. [Google Scholar] [CrossRef]
Kim, S.R.; Lee, W.K.; Lim, C.H.; Kim, M.; Kafatos, M.C.; Lee, S.H.; Lee, S.S. Hyperspectral analysis of pine wilt disease to determine an optimal detection index. Forests 2018, 9, 115. [Google Scholar] [CrossRef]
Olegario, T.V.; Baldovino, R.G.; Bugtai, N.T. A decision tree-based classification of diseased pine and oak trees using satellite imagery. In Proceedings of the 2020 IEEE 12th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM), Manila, Philippines, 3–6 December 2020; pp. 1–4. [Google Scholar] [CrossRef]
Ma, Y.; Zhao, N.; Wang, H.; Zhang, H. Monitoring study of pine wilt disease in Pinus massoniana based on BJ-2 satellite. Agric. Technol. 2021, 41, 76–79. (In Chinese) [Google Scholar] [CrossRef]
Qin, L.; Meng, X.; Zhang, S.; Xue, Y.; Liu, X.; Xing, P. Application evaluation of Beijing-3 satellite data in remote sensing monitoring of pine wilt disease. For. Resour. Manag. 2022, 4, 126–133. (In Chinese) [Google Scholar] [CrossRef]
Wang, X.; Hu, Z.; Shi, S.; Hou, M.; Xu, L.; Zhang, X. A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet. Sci. Rep. 2023, 13, 7600. [Google Scholar] [CrossRef]
Bolyn, C.; Lejeune, P.; Michez, A.; Latte, N. Mapping tree species proportions from satellite imagery using spectral–spatial deep learning. Remote Sens. Environ. 2022, 280, 113205. [Google Scholar] [CrossRef]
Saputra, M.R.U.; Bhaswara, I.D.; Nasution, B.I.; Ern, M.A.L.; Husna, N.L.R.; Witra, T.; Feliren, V.; Owen, J.R.; Kemp, D.; Lechner, A.M. Multi-modal deep learning approaches to semantic segmentation of mining footprints with multispectral satellite imagery. Remote Sens. Environ. 2025, 318, 114584. [Google Scholar] [CrossRef]
Grondin, V.; Fortin, J.M.; Pomerleau, F.; Giguère, P. Tree detection and diameter estimation based on deep learning. Forestry 2023, 96, 264–276. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Zhou, H.; Yuan, X.; Zhou, H.; Shen, H.; Ma, L.; Sun, L.; Fang, G.; Sun, H. Surveillance of pine wilt disease by high resolution satellite. J. For. Res. 2022, 33, 1401–1408. [Google Scholar] [CrossRef]
Huang, J.; Lu, X.; Chen, L.; Sun, H.; Wang, S.; Fang, G. Accurate identification of pine wood nematode disease with a deep convolution neural network. Remote Sens. 2022, 14, 913. [Google Scholar] [CrossRef]
Ye, W.; Lao, J.; Liu, Y.; Chang, C.C.; Zhang, Z.; Li, H.; Zhou, H. Pine pest detection using remote sensing satellite images combined with a multi-scale attention-UNet model. Ecol. Inform. 2022, 72, 101906. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Yang, B.; Wang, Z.; Guo, J.; Guo, L.; Liang, Q.; Zeng, Q.; Zhao, R.; Wang, J.; Li, C. Identifying Plant Disease and Severity from Leaves: A Deep Multitask Learning Framework Using Triple-Branch Swin Transformer and Deep Supervision. Comput. Electron. Agric. 2023, 209, 107809. [Google Scholar] [CrossRef]
Amin, S.U.; Jung, Y.; Fayaz, M.; Kim, B.; Seo, S. Enhancing pine wilt disease detection with synthetic data and external attention-based transformers. Eng. Appl. Artif. Intell. 2025, 159, 111655. [Google Scholar] [CrossRef]
Zhou, X.; Wang, H.; Chen, C.; Nagy, G.; Jancso, T.; Huang, H. Detection of growth change of young forest based on UAV RGB images at single-tree level. Forests 2023, 14, 141. [Google Scholar] [CrossRef]
Datt, B. Visible/near infrared reflectance and chlorophyll content in Eucalyptus leaves. Int. J. Remote Sens. 1999, 20, 2741–2759. [Google Scholar] [CrossRef]
Kurniati, F.T.; Manongga, D.H.; Sediyono, E.; Prasetyo, S.Y.J.; Huizen, R.R. GLCM-based feature combination for extraction model optimization in object detection using machine learning. arXiv 2024, arXiv:2404.04578. [Google Scholar] [CrossRef]
Kang, X.; Duan, P.; Li, J.; Li, S. Efficient swin transformer for remote sensing image super-resolution. IEEE Trans. Image Process. 2024, 33, 6367–6379. [Google Scholar] [CrossRef] [PubMed]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 23803–23828. [Google Scholar]
Wang, J.; Chen, T.; Zheng, L.; Tie, J.; Zhang, Y.; Chen, P.; Luo, Z.; Song, Q. A multi-scale remote sensing semantic segmentation model with boundary enhancement based on UNetFormer. Sci. Rep. 2025, 15, 14737. [Google Scholar] [CrossRef] [PubMed]
Lu, Y.; Huang, Y.; Sun, S.; Zhang, T.; Zhang, X.; Fei, S.; Chen, V. M2fnet: Multi-modal forest monitoring network on large-scale virtual dataset. In Proceedings of the 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Orlando, FL, USA, 16–21 March 2024; pp. 539–543. [Google Scholar] [CrossRef]
Cao, Y.; Coops, N.C.; Murray, B.A.; Sinclair, I.; Geordie, R.M. M3FNet: Multi-modal multi-temporal multi-scale data fusion network for tree species composition mapping. ISPRS J. Photogramm. Remote Sens. 2026, 231, 797–814. [Google Scholar] [CrossRef]
Park, J.; Sim, W.; Lee, J. Detection of trees with pine wilt disease using object-based classification method. J. For. Environ. Sci. 2016, 32, 384–391. [Google Scholar] [CrossRef][Green Version]
Wang, J.; Zhao, J.; Sun, H.; Lu, X.; Huang, J.; Wang, S.; Fang, G. Satellite remote sensing identification of discolored standing trees for pine wilt disease based on semi-supervised deep learning. Remote Sens. 2022, 14, 5936. [Google Scholar] [CrossRef]
Henrich, J.; van Delden, J.; Seidel, D.; Kneib, T.; Ecker, A.S. TreeLearn: A deep learning method for segmenting individual trees from ground-based LiDAR forest point clouds. Ecol. Inform. 2024, 84, 102888. [Google Scholar] [CrossRef]
Pierrat, Z.A.; Magney, T.; Maguire, A.; Brissette, L.; Doughty, R.; Bowling, D.R.; Logan, B.; Parazoo, N.; Frankenberg, C.; Stutz, J. Seasonal timing of fluorescence and photosynthetic yields at needle and canopy scales in evergreen needleleaf forests. Ecology 2024, 105, e4402. [Google Scholar] [CrossRef]

Figure 1. Overview of the study area and data: (a) Liaoning Province, Hunan Province and Zhejiang Province; (b) Fushun City; (c) Changsha City; (d) Hangzhou City.

Figure 2. Comparison of satellite and UAV imagery for high-precision ground truth generation: (a) preprocessed GF-2 satellite imagery (0.8 m resolution); (b) ultra-high-resolution UAV RGB orthomosaic used as the visual reference for manual annotation.

Figure 3. Overall technical workflow of the study.

Figure 4. Architecture of the proposed MBA-Former model.

Figure 5. Detailed architecture of the Multi-modal Gating Fusion Module (MFM).

Figure 6. Detailed architecture of the Boundary-Aware (BA) module.

Figure 7. Training dynamics of the evaluated models: (a) Validation loss trajectories; (b) Mean Intersection over Union (mIoU) curves during the training process.

Figure 8. Visualization of the adaptive attention mechanism within the Multi-modal Gating Fusion (MFM) module: (a) The original high-resolution GF-2 true-color image patch featuring a complex mix of dense pine forests, clear boundaries, and various ground objects. (b) The learned attention weight map for the spectral-physiological modality. (c) The learned attention weight map for the spatial-structural (textural) modality. (In both heatmaps, warmer colors (red) indicate higher attention weights (approaching 1), while cooler colors (blue) indicate lower weights (approaching 0)).

Figure 9. Effect of the Boundary-Aware (BA) module on segmentation contour refinement: (a) Input RGB image. (b) Ground Truth. (c) Segmentation results without the BA module. (d) Segmentation results with the full MBA-Former. (The red bounding boxes highlight areas where the BA module effectively mitigates the adhesion of closely situated infected tree crowns, resulting in sharper and more accurate boundaries).

Figure 10. Comprehensive visual comparison of segmentation results across two representative challenging forest scenes: (a) Original high-resolution True-Color (RGB) imagery. (b) Ground Truth (GT) annotations. (c) Predictions by the best-performing baseline model, Swin-Unet. (d) Predictions by our proposed MBA-Former. (The orange boxes highlight typical false positive errors (misclassifications of spectrally similar background objects) generated by the baseline. The red boxes indicate areas of severe boundary adhesion and erosion in the baseline results. Non-infected classes are uniformly rendered as a black background to maximize visual contrast on the core target (infected pines, in white)).

Table 1. Specifications of GF-2 satellite remote sensing imagery.

Province	Number of Images	Spatial Resolution	Cloud Cover	Acquisition Date	Coverage Area (km²)
Hunan	2	0.8 m	<5%	28 September 2023	793.86
Zhejiang	1	0.8 m	<5%	21 October 2023	402.35
Liaoning	3	0.8 m	<5%	5 September 2022	1215.77

Table 2. Composition of the sample dataset.

Dataset	Total Samples (Before Data Balancing)	Positive Samples	Negative Samples	Final Samples (After Data Augmentation)
Training set	15,004	1384	13,620	19,156
Validation set	1876	169	1707	1876
Test set	1876	195	1681	1876
Total	18,756	1748	17,008	22,908

Table 3. Unified hyperparameter configurations for all benchmark models.

Parameter	Configuration Value
Optimizer	AdamW
Initial Learning Rate	3 × 10⁻⁴
Learning Rate Scheduler	Cosine Annealing
Batch Size	64
Total Epochs	200
Weight Decay	1 × 10⁻⁵
Input Patch Size	256 × 256 pixels
Data Normalization	Z-score standardization
Boundary Loss Weight (α)	0.7

Table 4. Results of the ablation study on different feature combinations.

Case	Input Feature Composition	Input Channels	mIoU (%)	F1-Score (%)	Boundary F1 (%)
1	Base (Spectral Only)	4	78.53	86.95	74.28
2	Base + VIs	6	80.81	88.31	77.15
3	Base + Tex	6	79.92	87.80	76.53
4	Base + VIs + Tex (Full)	8	81.74	89.68	78.62

Table 5. Results of the ablation study on different model modules.

ID	Model Configuration	MFM	BA	mIoU (%)	Infested Pine IoU (%)	Boundary F1 (%)
a	Swin-Unet V2			78.86	74.03	74.16
b	MFM only	✓		80.92	75.79	75.98
c	BA only		✓	80.15	74.61	78.23
d	MBA-Former (Full)	✓	✓	81.74	77.58	78.62

Table 6. Performance comparison between MBA-Former and state-of-the-art (SOTA) models.

Model	Backbone	Model Type	mIoU (%)	Infested Pine IoU (%)	Others IoU (%)	Boundary F1 (%)
U-Net	Simple CNN	CNN-based	73.58 ± 0.52	69.15 ± 0.61	75.42 ± 0.48	68.46 ± 0.55
DeepLabV3+	ResNet-50	CNN-based	75.05 ± 0.45	71.28 ± 0.53	76.81 ± 0.44	70.73 ± 0.49
ConvNeXt-Unet	ConvNeXt-T	Modern CNN	78.23 ± 0.41	73.14 ± 0.47	80.15 ± 0.39	73.49 ± 0.42
Swin-Unet V2	Swin-T V2	Transformer-based	78.86 ± 0.38	74.03 ± 0.42	80.92 ± 0.35	74.16 ± 0.39
MBA-Former (Ours)	Swin-T V2	Transformer-based (Enhanced)	81.74 ± 0.32	77.58 ± 0.35	83.15 ± 0.28	78.62 ± 0.31

Table 7. Comparison of computational complexity and inference efficiency.

Model	Params (M)	FLOPs (G)	Inference Time (ms/patch)	mIoU (%)
U-Net	14.7	32.5	18	73.58
DeepLabV3+	41.2	75.3	42	75.05
ConvNeXt-Unet	28.5	40.1	29	78.23
Swin-Unet V2	27.2	45.1	31	78.86
MBA-Former (Ours)	31.5	52.4	35	81.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hou, R.; Zhou, Y.; Wang, Y.; Huang, Z.; Yao, J.; Jiao, Q.; Huang, W.; Zhang, B. MBA-Former: A Boundary-Aware Transformer for Synergistic Multi-Modal Representation in Pine Wilt Disease Detection from High-Resolution Satellite Imagery. Forests 2026, 17, 517. https://doi.org/10.3390/f17050517

AMA Style

Hou R, Zhou Y, Wang Y, Huang Z, Yao J, Jiao Q, Huang W, Zhang B. MBA-Former: A Boundary-Aware Transformer for Synergistic Multi-Modal Representation in Pine Wilt Disease Detection from High-Resolution Satellite Imagery. Forests. 2026; 17(5):517. https://doi.org/10.3390/f17050517

Chicago/Turabian Style

Hou, Rui, Yantao Zhou, Ying Wang, Zhiquan Huang, Jing Yao, Quanjun Jiao, Wenjiang Huang, and Biyao Zhang. 2026. "MBA-Former: A Boundary-Aware Transformer for Synergistic Multi-Modal Representation in Pine Wilt Disease Detection from High-Resolution Satellite Imagery" Forests 17, no. 5: 517. https://doi.org/10.3390/f17050517

APA Style

Hou, R., Zhou, Y., Wang, Y., Huang, Z., Yao, J., Jiao, Q., Huang, W., & Zhang, B. (2026). MBA-Former: A Boundary-Aware Transformer for Synergistic Multi-Modal Representation in Pine Wilt Disease Detection from High-Resolution Satellite Imagery. Forests, 17(5), 517. https://doi.org/10.3390/f17050517

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MBA-Former: A Boundary-Aware Transformer for Synergistic Multi-Modal Representation in Pine Wilt Disease Detection from High-Resolution Satellite Imagery

Abstract

1. Introduction

2. Materials

2.1. Overview of the Study Area

2.2. Experimental Data Acquisition and Preprocessing

2.2.1. High Resolution Satellite Remote Sensing Imagery

2.2.2. High-Resolution UAV Remote Sensing Imagery

3. Methods

3.1. Construction of the Standardized Semantic Segmentation Dataset

3.1.1. Feature Enhancement

3.1.2. Sample Dataset Generation and Partitioning

3.2. The MBA-Former Model

3.2.1. Multi-Modal Gating Fusion Module (MFM)

3.2.2. Backbone Network

3.2.3. Boundary-Aware (BA) Module

3.2.4. The Joint Loss Function

3.3. Evaluation Metrics

4. Results

4.1. Experimental Parameter Setup and Model Training

4.2. Ablation Study

4.2.1. Comparative Analysis of Different Feature Combinations

4.2.2. Validation of the Effectiveness of Independent Innovation Modules

4.3. Comparative Analysis of Performance with SOTA Models

4.4. Visual Analysis

4.4.1. Mitigating Feature Confusion via the MFM

4.4.2. Refining Contours via the BA Module

4.4.3. Comprehensive Visual Comparison

5. Discussion

5.1. The Importance of Synergistic Feature Representation and Boundary Refinement

5.2. Performance Comparison with Previous Research

5.3. Limitations and Future Perspectives

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI