Pine-YOLO: A Method for Detecting Pine Wilt Disease in Unmanned Aerial Vehicle Remote Sensing Images

: Pine wilt disease is a highly contagious forest quarantine ailment that spreads rapidly. In this study, we designed a new Pine-YOLO model for pine wilt disease detection by incorporating Dynamic Snake Convolution (DSConv), the Multidimensional Collaborative Attention Mechanism (MCA)


Introduction
Pine trees are widely distributed worldwide and primarily grow in the temperate regions of the Northern Hemisphere, such as Russia, Canada, the United States, China, Japan, and Sweden.Pine trees are highly significant arboreal species inside the forest, fulfilling a crucial function in safeguarding the ecosystem and upholding the equilibrium of carbon levels.Nevertheless, pine trees are always vulnerable to infestations and illnesses.Pine wilt disease is a highly contagious forest quarantine ailment that spreads rapidly and causes significant mortality [1].This disease is transmitted both artificially and naturally.Artificial transmission refers to the movement of contaminated wood and its products, as well as the utilization of infected wood as packaging in commercial transactions [2].Natural transmission is mainly generated by the spread of the infection to new pine trees by monochamus and other insect vectors [3].This vector is native to North America, but its distribution and damage range have been expanded dramatically to many countries such as China, Japan, Korea, Portugal, and Spain [4].The pine wood nematode has been spreading in China as an invasive species since 1982, resulting in the mortality of significant numbers of pine trees [5].This reduces forest carbon sinks continuously and thus brings huge economic losses to the world [6].The primary strategies to prevent the further dissemination of the pine wood nematode include physical control, chemical control, and biological control.The first two methods are the most efficient but are harmful to the ecological environment, and biological control takes too long a time to achieve efficient control.Most importantly, all these three methods need precise surveillance of infected trees to treat them effectively and promptly.
There are three main means to monitor the pine wilt disease, namely, manual inspection, satellite or aerial remote sensing monitoring, and unmanned aerial vehicle (UAV) monitoring.Manual inspections are labor-intensive, inefficient, and costly, and are further constrained by the uneven geographical distribution of pine trees and the complex and variable environments in which they are situated [7,8].Satellite or aerial remote sensing monitoring can achieve large-area spatial coverage, but they are limited by sensor resolution and satellite operation cycles.UAV remote sensing has been increasingly used in agriculture and forestry with advantages such as higher flexibility, lower cost, and higher resolution relative to satellite remote sensing, etc. [9,10].Remote sensing plays a crucial role in the detection of forest health.
To process the massive remote-sensing images of pine trees acquired by satellites or UAV, machine learning has been extensively used and developed.Syifa et al. [11] divided the images based on GPS into classes such as PWD-indicated trees, normal pine, buildings, roads.Their study has enhanced the ability to distinguish buildings or roads with similar colors to PWD-indicated trees.Oide et al. [12] combined visible color imagery and ML algorithms, but were not concerned with detecting the infection stages of individual trees.In order to detect PWD earlier, Iordache et al. [13] and Yu et al. [14] assigned infected trees to different classes and distinguished early infections from other stages.Wu et al. [15] proposed the green attack stage as the key issue for early monitoring and compared the differences in detection accuracy across different dates.Traditional machine learning must construct data features manually with high separability and select appropriate classifiers, which are not suitable for large-scale data training.Machine learning also struggles to adapt to diverse and complex scenarios, resulting in limited practicality.For instance, overlapping tree canopies make it difficult to distinguish the target from surrounding trees, and similar-colored backgrounds and other dead trees cause detection confusion.
Unlike traditional machine learning, deep learning algorithms are adapted to training on large-scale raw datasets and various complex scenarios.Therefore, it has been applied to remote sensing more widely due to its powerful automatic feature learning without human intervention.It is a common practice to employ diverse methods to enhance remote sensing images, such as mirroring, flipping, adding noise, rotating, scaling, etc. [16][17][18].Cai et al. [19] proposed an effective data augmentation method based on Sentinel-2 satellite data and UAV images to efficiently detect PWD.Zhang et al. [20] corrected 5band multi-spectral images and visualized them as heat maps to propose a patch-based deep classification.Many researchers also engage in evaluating [21,22] or improving models [23][24][25][26][27][28], such as optimizing neural networks.Deng et al. [29] improved Faster-RCNN based on RPN and added a geographic location module.Li et al. [30] proposed YOLOv4-Tiny-3Layers to filter uninterested and irrelevant images.Abdollahnejad et al. [31] innovatively used UAV images as reference data and combines high-resolution satellite platforms with time series data to evaluate and predict forest health status.Ren et al. [32] proposed a Global Multi-Scale Channel Adaptation Network based on circle sampling to better match the circular shape of the diseased trees.Zhang et al. [33] improved YOLOv5 by four attention mechanism modules to detect smaller infected wood in images that covers a large area.Qin et al. [34] designed SCANet (spatial-context-attention network) and Han et al. [35] proposed multi-scale spatial supervision convolutional network (MSSCN) to reduce the loss of spatial information and detect trees in complex backgrounds.
In comparison, YOLO has greatly enhanced the speed of detection.Additionally, it excels in learning generalized features of detection targets, thereby reducing background recognition errors.While extensive researches have been conducted to apply YOLO in identification of pine wilt disease, challenges still remain, including accurate detection in complex backgrounds, identifying subtle tubular structures in pine tree branches, etc. Inspired by previous research, we adopted the highly precise and adaptable YOLOv8 model as our benchmark network and developed a new detection model named Pine-YOLO.This model integrates DSConv (Dynamic Snake Convolution) and incorporates MCA (Multidimensional Collaborative Attention Mechanism) along with WIoUv3 (Wise-IoU v3).We acquired images of pine forests in the Weihai area via the UAV remote sensing technique, then used these data to train and evaluate the new Pine-YOLO model.Finally, we found this new model can effectively extract fine and curved structures in a complex natural environment, so it can reach an extreme high accuracy of ~90%.

Image Acquisition
In this work, we utilized images from Beihai Forest and Linhai Park, located in Weihai City (37°30′ N, 122°6′ E), Shandong Province, China.More than 70% of the vegetation in this area is made up of pine trees, and some of them are already nematode-infected, which is depicted in Figure 1.The research area boasts an extensive coastline, is situated in the north temperate zone, and is characterized by a monsoon continental climate, with black pine being the predominant species.The research area exhibits a rich age structure, encompassing various growth stages from seedlings to mature trees.Vertically, the forest can be divided into three main layers: the uppermost layer consists of mature, tall pine trees forming the canopy; the middle layer is primarily composed of mid-height pine trees and a few other tree species, adding to the forest's vertical complexity; the ground layer is dominated by shrubs and herbaceous plants.Horizontally, the distribution of trees within the forest is not uniform but rather presents a pattern of dense areas and open spaces alternating, indicating a high level of structural complexity in the forest.
A DaJiang UAV outfitted with a DG2pro CMOS was employed as a flying platform in the data gathering procedure, conducting six flights at altitudes ranging from 180 m to 240 m.The sizes of these orthorectified images are 128,601 × 62,669 pixels, 126,365 × 90,989 pixels, 48,236 × 47,168 pixels, and 57,653 × 48,979 pixels, respectively, amounting to a total coverage area of approximately 22.19 km 2 .Using a custom program which we developed and compiled in PyCharm, the original images are segmented into 43,095 image slices in 1024 × 1024 pixels by a sliding window method with 20% overlapping area, which is demonstrated in Figure 2b.The geographic coordinates of all diseased pine tree samples are individually labeled and verified on-site.During the training process, the image dataset containing diseased pine trees is randomly divided into a training dataset and a validation dataset in a ratio of 8:2.The dataset collection was conducted between 27 September-8 October 2022, during which the weather was predominantly cloudy, with two days of rainfall.The strategic selection of this period aimed to mitigate the confounding effects of drought stress and discoloration observed in deciduous broad-leaved trees.This timing not only avoids the phenotypic variations commonly induced by environmental stressors, but also ensures the specificity of the phenotypic features associated with pine wilt disease (PWD) in our study.Careful selection of the dataset during this period further helped in minimizing the inclusion of tubular, locally elongated, and curved branch structures that might arise from conditions other than PWD, thereby enhancing the detection accuracy of our improved method focusing on tubular structural enhancements.

Image Pre-Processing
In this study, we adopt the mosaic data pre-processing technique integrated within the YOLOv8 framework to augment the variety of the training dataset, thereby strengthening the generalization ability of this model.This technique randomly combines four different photos into a unified image by stochastic scaling, cropping, and alignment processes, as illustrated in Figure 3.This approach not only adds new variations to the smallsized targets in the training sample, but also helps to achieve a balanced distribution for the labeled diseased and unlabeled diseased pine trees in the dataset.

Pine-YOLO Network Structure
YOLOv8 shows excellent detection speed and accuracy, and it is composed of four key components: the Input module, Backbone module, Neck module, and Head module.The Input module primarily utilizes techniques like mosaic data augmentation, dynamic anchor box computation and grayscale augmentation, etc.The Backbone module includes Conv, C2f, and SPPF, etc., where C2f learns residual features and expands the model gradient flow by branching across layer connections.The Neck module still uses the PAN-FPN idea to enhance the fusion of object features in different dimensions.The Head module adopts a decoupled head structure that calculates the confidence and location of the final detected target based on the enhanced features [36].
In detection of pine wilt disease, the morphologic features of the dataset are a critical factor to obtain perfect recognition results.However, images of diseased pine trees taken via UAV remote sensing are particularly sensitive to variations in lighting and shadow conditions; some infected pine trees also have color and texture differences.Therefore, when the standard YOLOv8 network is employed to perform the detection of pine wilt disease using UAV remote sensing images, it usually outputs a significant number of false negative and false positive results.To address these disadvantages, we integrated the DSConv module into the YOLOv8 network in this study, along with the MCA module and the WIoUv3 loss function, thus developing a new Pine-YOLO model.This newly designed model improves the ability of feature extraction from pine trees and image recognition amidst background interference, thereby reducing the false negatives and false positives in the whole training and testing processes.The network structure of Pine-YOLO is shown in Figure 4.

Dynamic Snake Convolution (DSConv)
In the overhead view of the UAV remote sensing images, we noted that in addition to the obvious color features, the branches of diseased pine trees are topologically tubular, locally elongated, and curved.
As shown in Figure 5, YOLOv8 is able to learn geometric variations freely by the addition of DSConv, because the perception of geometric structures is improved by adaptively focusing on the fragile and curved local features of the tubular structure.On fine tubular structures, this approach can consider the serpentine morphology of the tubular structure and use constraints to complement the free learning process to enhance the perception of fine tubular structures in discolored pine branches [37].
DSConv enhances target recognition using deformation offsets.This allows the convolutional kernel to flexibly focus on the complex and variable geometric features of the target.Additionally, an iterative strategy is employed in this model to prevent the perceptual field from drifting away from the target during the free learning of these deformation offsets.This strategy involves selecting the subsequent target for observation in the processing sequence, which ensures continuity of attention while not extending the perceptual range further, due to the excessive deformation offsets.DSConv, introduced here, defines a convolution kernel G with size 9 in the x-axis and y-axis directions, and the distinct portrayal of each network in G is  ± = ( ± ,  ± ), where  = {0, 1, 2, 3, 4} represents the horizontal distance between the grid and the central location, whereas the choice of each grid point G i±a in the convolution kernel Gi is an accumulative procedure.Starting from Gi, the position away from the center grid depends on the position of the previous grid:  ± has an additional offset Δ = { |  ∈ [−1,1]} compared to Gi.Therefore, the offset needs to be ∑ in order to ensure that the convolution kernel adheres to a linear structural shape.As shown in Figure 6, Gi±a in the x-axis direction becomes: Gi±a in the y-axis direction becomes: The bilinear interpolation formula is written as follows: ( ) where G is the fractional portion of Equations ( 1) and ( 2),  ′ enumerates all integral space positions, and D is a bilinear interpolation kernel being divided into two one-dimensional kernels: As shown in Figure 6, due to the two-dimensional (x-axis, y-axis) variations, DSConv covers a 9 × 9 range during deformation to acquire better adaptability to the slender tubular structures on top of the dynamic structure, improving the perception of key features.

Multidimensional Collaborative Attention Mechanism
The multidimensional collaborative attention mechanism (MCA) successfully captures the spatial dimension and feature interdependence between channels through its parallel branching structure, and thus enhances the comprehension of the YOLOv8 model of the spatial properties of pine trees and their representations in images.At the same time, MCA also strengthens the attention on the specific features of pine trees by finetuning the input feature maps.This can raise the accuracy of recognition, specifically in cases where the background is complex or the pine tree features are not obvious.This attention can also provide efficient performance gains by enhancement of the network representation, improving the generalization of the model for pine tree recognition, which is valuable in variable natural environments.
As shown in Figure 7, the MCA module we used comprises three branches.Each branch is dedicated to a separate attentional model in the channel, width, and height dimensions.The squeeze transformation employs global mean and standard deviation pooling to consolidate cross-dimensional feature responses.It also employs a combinatorial technique to intelligently blend mean and standard deviation pool information, hence improving the representation of feature descriptors.The excitation transformation structure of MCA effectively resolves the dilemma between detection performance and computational overhead trade-offs by dynamically capturing local feature interactions.The uppermost branch is utilized to record the interconnections among characteristics in the spatial dimension W. Similarly, the middle branch is utilized to record the relationships between features in the spatial dimension H.The lower branch obtains the exchanges among channels.The MCA utilizes substitution procedures to capture the longterm dependencies between the channel dimension and either of the spatial dimensions in the first two branches.Ultimately, the results from each of the three branches are combined by a straightforward averaging process during the integration step.The symbol ⊗ in Figure 7 denotes broadcast element multiplication, and ⊕ denotes broadcast element summation.The overall design intends to convert input features into fine outputs of the same dimensions.
MCA can also be viewed as a computational unit that performs specific transformations to refine the input tensor into an output tensor of the same shape.Specifically, let F denote the outcome of the convolutional layer and functions as the input feature mapping for the MCA module; then, the shape of F can be described as C × H × W, where C, H, and W refer to the number of channels (filters), the height and width of the spatial feature map, respectively.The purpose of the MCA module is to feed F into each branch to enhance its refining feature.F performs  trans on three branches separately.We should note that F is rotated 90° anticlockwise along both the H-axis and W-axis in the first branch and the second branch, while the original features are maintained after  trans in the third branch, generating the feature map denoted as  ⏜ .Then,  ⏜ is input into the squeeze transformation to obtain the aggregated feature map  ̂.Then,  ̂ is passed into the excitation transformation to capture the spatial dimensions and inter-channel feature interactions, producing  ̃ accordingly.Next,  ̃ is passed through the sigmoid activation function and A is applied to  ⏜ via element-by-element multiplication to obtain the enhanced feature map  ′ .Finally,  ′ is inverted by  trans to obtain  ″ .This process can be summarized in the following equations: ̂=   ( ⏜ ),  ̃=   ( ̂) where   (⋅) denotes the transformation of the input feature map, while   −1 (⋅) denotes the inverse transformation process, (⋅) represents the sigmoid activation function,   (⋅) and  ex (⋅) denote the squeezing and excitation transforms, respectively.
(1) Squeeze: A method for adaptively combining dual interaction information In the Squeeze module, effective interaction of features in the spatial and channel dimensions is achieved by combining mean pooling and standard deviation pooling [38].High performance is maintained while the computational overhead is controlled.The process of the squeezing transformation is shown in Figure 8.  ) The optimized trainable floating-point parameters α and β must have values between zero and one.Simultaneously, the input-dependent dynamics have the capacity to allocate varying weights to the mean pooled and standard deviation pooled features throughout different stages of image feature extraction.This promotes the distinctiveness of the output feature descriptors.
(2) Motivation: A method for adaptively combining the capture of local feature interactions The excitation transformation method is employed to capture the local interactions of features between channels, which are further transformed to maximize the usage of dimensionally relevant feature descriptors produced by the squeezing transform [38].
As shown in Figure 9, we can obtain the channel feature weight by taking the channel feature descriptor  ̂ as an input via Equation (6).In this process, we only take the interaction with its   neighbors form th channel.The channel feature weight  ̃ can be computed by the following equation: represents the collection of feature descriptors from   adjacent channels connected to the initial m th channel, whereas   represents the learnable parameters that are common and not unique to any one channel.The implementation of this transformation can be achieved by a 2D convolution technique using a kernel size of (1,   ) , which can be expressed as: Then, KC can be approximately acquired if C is given.(3) Integration: Triple focus collaboration The augmented feature map  ″ can be refined in three branches represented as   ″ ,   ″ , and   ″ , respectively.This refinement process eventually produces the final refined feature map  ‴ .It is achieved by a simple average summation in the integration stage with the following equation:

WIoUv3 Loss Function
During the training process of our newly designed Pine-YOLO model, it is crucial to utilize a bounding box loss function to guide the regression, and thus to reduce the bias between the predicted frame and the true frame, which increases the efficiency of the detection model.The loss function of YOLOv8 is described as: where   ,   , and   represent the focal point loss, the class loss, and the bounding box regression loss, respectively.The bounding box regression loss for YOLOv8 employs the CioU function with the following formula: where   , ℎ  , , and ℎ define the width and height of the real frame and the predicted frame, respectively. refers to the weight function, v represents the similarity of the width-to-height ratio, and  is the intersection ratio of the real frame and the predicted frame.  and b denote the central points of the boundaries of the real frame and the predicted frame, respectively. is the Euclidean distance between   and b.  refers to the length of the diagonal of the smallest outer rectangle of the real frame and the predicted frame.  and ℎ  represent the dimensions of the smallest possible rectangle that can encompass both the real frame and the predicted frame, respectively.
( ) ( ) The CioU loss function has an obvious advantage over the traditional IoU in bounding box regression, which considers the variations in the geographical position, size, and shape of the predicted frame and the real frame.For instance, the distance of the bounding box regression, the centroid offset, the overlap area, and other factors make the bounding box regression converge better.However, from the formula calculating CioU in Equation ( 16), the parameter v only evaluates the similarity of the aspect ratio and does not accurately represent the actual relationship between the width and height of the real frame and the predicted frame.This would worsen the penalty for low-quality samples, weakening the generalization ability accordingly.
To address the disadvantages of CioU, we adopted WioUv3 (Wise-IoUv3) to deal with the loss function in this study, which incorporates a weighting coefficient to modify the correlation between each predicted frame and the real frame.It also considers the quality of the samples in relation to the CioU loss function.Additionally, it evaluates the standard of the anchor frames through a dynamic non-monotonic focusing mechanism [39].WioUv3 is built upon the foundation of WioUv1.The formula to describe WioUv1 is presented below: Here,   is the normalized length of the centroid connection representing the loss of a high-quality anchor frame.As shown in Figure 10, the blue and green rectangles represent the anchor frame and the target frame, respectively.  and   refer to the width and height of the smallest outer rectangle of the anchor frame and the target frame, respectively. and  represent the coordinates of the centroid of the anchor frame, respectively, while   and   are the coordinates of the centroid of the target frame, respectively.WioUv3 introduces a gradient gain distribution method to reduce the influence of harmful gradients as compared to WioUv1.This ensures the effect of high-quality anchor frames and strengthens the generalizability of the Pine-YOLO model.The WioUv3 formulas are: where  refers to the measured quality of the anchor frame by considering the presence of outliers, for which a smaller outlier suggests a better anchor frame. is a non-monotonic focusing factor that successfully mitigates the occurrence of bigger damaging gradients caused by low-quality samples. *  is a monotonic focusing factor, while   is a sliding average of   with the momentum equalling to m.  and  are the hyper-parameters.When  =  ,  = 1 .When  is equal to the specific value of  , the anchor frame will have higher gradient gain.
follows the change of   , so the gradient gain of the anchor frame can be continuously adjusted.According to the current quality of the anchor frame, the loss function can dynamically adjust the gradient gain allocation strategy.

Parameter Settings and Evaluation Metrics
With our newly designed Pine-YOLO, we carried out series of detection on pine wilt disease using our UAV images collected from Weihai.The experiments are performed with the NVIDIA GeForce RTX 3060 GPU and the Windows 10 operating system.PyTorch is employed for the deep learning framework.An automatic optimizer is employed with the following parameters: an initial learning rate of 0.01, a learning rate decay factor of 0.01, a batch size of 16, a momentum value of 0.937, a weight decay coefficient of 0.0005, a non-maximum suppression threshold of 0.5, a patience value of 50, and a pre-trained YOLOv8 model.
The employed assessment measures for the detection and classification outcomes are mAP@0.5,mAP@0.5:0.95,Precision, Recall, and F1-score.The formulas to calculate these metrics are given below: Precision denotes the probability that the prediction target is correct.TP means that a diseased tree is correctly detected.FP means that a healthy pine tree is detected as diseased.
Recall denotes the probability of a diseased pine tree being truthfully predicted.FN means that the diseased tree is incorrectly predicted to be healthy.
mAP represents the Precision accuracy of the detection process.mAP@0.5 represents the average Precision at an intersection over the IoU criterion of 0.5.mAP@0.5:0.95refers to the average mAP value with IoU from 0.5 to 0.95 with a step size of 0.05.
F1-score provides a consistent metric to measure the balance between Precision and Recall.

Detection Results from Alternative Methods
Figure 11a shows five images containing diseased trees and disease annotations in the testing dataset, and each image is taken under different conditions, such as different lighting and shading conditions.These various conditions directly affect the brightness and contrast of the images, and may mask or emphasize certain features of the pine trees, thus affecting the recognition results.In addition, the diversity of ground conditions, such as bare soil, grass, rocks, and other surface features, may also interfere with the target detection of pine trees.Particularly, in cases where the ground color is similar to that of a diseased pine tree, it is difficult for the detection algorithm to distinguish the pine tree from the background.At the same time, the presence of other background factors, such as vegetation, may confuse pine trees with the background, further increasing the likelihood of detection errors.To address these problems, we performed detection by our newly designed Pine-YOLO model, and show the detection results in Figure 11d.Table 1 summarizes the quantitative detection results of Pine-YOLO methods applied to testing images.We can find that the Pine-YOLO model exhibits exceptional performance, particularly in terms of mAP@0.5 (90.69%) and Precision (91.31%).This demonstrates its high reliability in locating and predicting pine trees accurately.Pine-YOLO also presents a notable Recall of 85.72%, indicating its effectiveness in identifying the most diseased pine trees.As an average of Precision and Recall, F1-score reaches a relative high value of 88.43%, further confirming the superiority of Pine-YOLO in striking a balance between detection accuracy and coverage.The symbol "*" denotes that the model and its data are sourced from other references in the same research field.
To make a clear comparison between our results and other deep learning algorithms, we also carried out detections using Faster R-CNN, RetinaNet, YOLOv5, YOLOX, and DETR, respectively.The detection results for all these algorithms are also listed in Table 1, Figure 12, and Figure S1 in Supplementary Materials.We can see from these data that the other algorithms also exhibit good performances in some metrics, for example YOLOv5 provides a relatively high mAP@0.5 of 82.72%.However, as a whole, they showed a considerable number of missed detections and false detections, resulting in the low evaluation parameters.Moreover, Pine-YOLO also demonstrated superior performance in the more stringent metric of mAP@0.5:0.95(49.72%), indicating its higher efficiency across different IoU thresholds.That is, Pine-YOLO demonstrated an overall superior ability to identify diseased pine trees correctly, confirming the strong accuracy and robustness of this model under varying lighting conditions.
To further validate our research findings, we compared the performance of Pine-YOLO with other previous methods mentioned in the literature within the same field, MA-Unet [40] and YOLOv5-PWD [41].The data presented in the Table 1 clearly illustrate that Pine-YOLO in this work surpasses both MA-Unet and YOLOv5-PWD in terms of precision and recall, thereby also exhibiting a marked superiority in the F1 score.This comparison showcases the outstanding performance of Pine-YOLO in accurately identifying and classifying diseased pine trees.

Ablation Experiment
In this study, we introduced DSConv, MCA, and WIoUv3 into the YOLOv8 network to form the Pine-YOLO model.DSConv enhances the perception of the fine tubular structure of discolored pine branches by adaptively focusing on the fragile and curved local features of the tubular structure.MCA helps the model better understand the spatial properties of the pine trees and their representation in the image.It strengthens the attention of the model to the specific features of pine trees, and thus improves recognition accuracy and the generalization ability of the model for diseased pine tree recognition in variable natural environments.WIoUv3 uses a dynamic non-monotonic focusing mechanism (FM) to make the most realistic gradient gain allocation decision at each moment based on the current situation, improving the overall recognition accuracy and robustness.
The quantitative findings of ablation experiments are presented in Table 2.All models used on the testing dataset show significant performance improvements.The implementation of DSConv increases the mAP@0.5 from 78.46% to 85.34%.This substantial enhancement indicates that DSConv can augment the Precision effectively in locating pine trees.The introduction of the MCA module independently boosts the Recall to 85.71%, underscoring its crucial role in capturing pine tree features.WIoUv3 achieves a competitive edge under the more comprehensive mAP@0.5:0.95evaluation parameter with mAP reaching 50.61%, demonstrating a robust capability.When these three components are integrated into the YOLOv8 network to construct the Pine-YOLO model, it achieves the best detection results, which are shown in Table 2 and Figure 12 in detail.The data in Table 2 show that Pine-YOLO attains peak performance across all key metrics, especially achieving a high mAP@0.5 of 90.69% and an F1-score of 88.43%, which verifies the significant superiority of DSConv, MCA, and WIoUv3 synergy in detection of pine wilt disease.In Figure 12, all types of algorithm combination do not show any missed detection in the testing images, while the false detection still remains.With the successive addition of these modules, the false detection rate gradually decreases.These phenomena prove that the newly designed Pine-YOLO model can extract the target features effectively and thus significantly improves recognition accuracy and reliability.Specifically, with the combined application of different modules in the Pine-YOLO algorithm, the superposition effect enables the model to distinguish the target from the background more accurately when it encounters complex image content.That is, in scenes with complex or similar features, Pine-YOLO effectively reduces false recognition and ensures highly accurate target detection.Additional ablation experimental results are included in Figure S2 in Supplementary Materials.The symbol "√" denotes the inclusion of a module or classification network in the baseline network, whereas the symbol "-" signifies its absence.

Pine-YOLO Composite Indicator Assessment and Discussion
The validation losses of Pine-YOLO, which are depicted in Figure 13a, are also analyzed thoroughly in this study.The gradual decrease in different loss metrics highlights the progress in learning how to identify pine trees more accurately.A significant decrease in Validation Box Loss is found in Figure 13a, which reveals the increasing accuracy in locating the bounding box of a diseased pine tree, which is critical for fine-tuning target detection in complex images.The continued decrease in Validation Class Loss in Figure 13a indicates the increased accuracy of Pine-YOLO in identifying the class of the targeted pine tree.In addition, the decrease in Validation Distribution-Focused Loss (DFL) further confirms the continued efforts of Pine-YOLO to improve localization details, particularly for accuracy on the edge of pine trees.As a consequence, these results demonstrate the effectiveness of the design and training strategy for Pine-YOLO in improving the accuracy of diseased pine tree detection results.Meanwhile, when the Pine-YOLO model deals with new and different data, it shows a smooth Validation Loss curve, indicating better consistency, reliability, and thus the generalization of this newly designed model.The convergence curves of the performance metrics for Pine-YOLO reveal the detection exhibition at each training stage, which are shown in Figure 13b.Although these values, e.g., the Precision, Recall, mAP@0.5, and mAP@0.5:0.95,can reach a relatively high level, the volatility of the Precision and Recall values in this model reminds us that we need to balance the ability to recognize positive and negative samples during the training process.
Beyond the detailed analyses of the accuracy and detection performance improvements highlighted above, it is also noteworthy to mention the distinct advantage of the model in terms of size.In our specific experimental environment configuration, compared to the Pine-YOLO model optimized to a compact size of 6.7 Mb, standard YOLOv5 models typically require around 18.0 Mb.The enhanced versions of YOLOv5, exemplified by Zhang et al. [33] with YOLOv5s-CA and YOLOv5s-ECA, showcase model sizes of 16.0 Mb and 14.4 Mb, respectively.These models demonstrate valuable improvements in detection accuracy through the integration of advanced attention mechanisms.This compact size of Pine-YOLO not only eases deployment but also indirectly boosts operational efficiency and processing speed, vital for real-time applications on resource-constrained platforms.Future research efforts can focus on more effective picture pre-processing techniques and refine the model structure to enhance the robustness and Precision of the model across various contexts.

Conclusions
In this paper, we incorporated DSConv, MCA, and WIoUv3 into a YOLOv8 network to construct a newly designed Pine-YOLO model for pine wilt disease detection.We utilized images exclusively captured in Weihai City to construct a dataset via a sliding window method, among which all diseased pine tree samples were individually labeled and verified on-site.Then, we used this dataset to train and test the detection performance of Pine-YOLOv8.The results show that the F1-score of this model is 88.43%, which is 14.85%, 14.35%, 13.67%, 12.14%, and 6.43% higher than that of Faster-RCNN, RetinaNet, YOLOv5, YOLOX, and DETR, respectively, suggesting excellent performance of our detection model.We also performed the ablation experiment to analyze the exact interactions of each new modules we introduced into the Pine-YOLO model.DSConv enhances the perception of geometrical structures by adaptively focusing on the fragile and curved local features of tubular structures in diseased pine tree branches.MCA strengthens the model's attention to pine tree-specific features in complex and changing natural environments.This mechanism efficiently captures feature interdependencies between spatial dimensions and channels through parallel branching structures.Moreover, the squeezing transform adaptively aggregates bi-dimensional feature responses, alongside an excitation transform that adaptively captures local feature interactions, allowing for fine-tuning of the input feature map.WIoUv3 focuses on the common quality anchor frames and promote the overall recognition accuracy and robustness of the detector.Therefore, the newly designed Pine-YOLO model overcomes some disadvantages of normal deep learning algorithms in the field of pine wilt disease detection.Hence, it can be used by forestry managers for rapid detection of pine wilt disease, which helps to maintain the health and stability of the ecological environment.

Figure 1 .
Figure 1.Detailed geographical location of the investigated area in this work with corresponding remote sensing images.

Figure 2 .
Figure 2. Schematic diagram of image cropping with (a) regular grid and (b) overlapping region, respectively.

Figure 3 .
Figure 3. Graphical representation of the mosaic data pre-processing technique integrated within the YOLOv8 model.

Figure 4 .
Figure 4. Diagram of the Pine-YOLO model.

Figure 5 .
Figure 5. Illustration of the dynamic snake convolution added into YOLOv8 model, which uses input feature maps to learn deformations and selectively emphasizes local aspects of elongated zigzags.

Figure 6 .
Figure 6.Illustration depicting (a) the coordinate calculation process of DSConv and (b) visualization of the receptive field of DSConv.

Figure 8 .
Figure 8. Illustration of the squeeze transform in the MCA module, using an adaptive mechanism for aggregating global mean and standard deviation information.

Figure 8
Figure 8 illustrates that the input  ⏜ is the spatial information of the feature map, which is combined of global average and standard deviation pooling.This process produces two distinct channel feature statistics, i.e.,  ⏜  and  ⏜  , representing the average pooled and standard deviation pooled feature descriptors, respectively.More precisely, the two pooling processes for these channels can be represented individually in the following equations:( )

Figure 9 .
Figure 9. Illustration of the planned alteration of excitation in the MCA module.

Figure 10 .
Figure 10.Schematic diagram of the anchor frame and the target frame.

Figure 13 .
Figure 13.Convergence curves of (a) Validation loss for Pine-YOLO model and (b) Performance metrics.

Author
Contributions: J.Y.: Conceptualization, Methodology, Software.B.S.: Data curation, Writing-Original draft preparation.X.C. and M.Z.: Visualization, Investigation, Writing-Original draft preparation.X.D. and H.L.: Software, Validation.F.L.: Writing-Reviewing and Editing.L.Z. and Y.-B.L.: Conceptualization, Methodology, Supervision.C.X.: Data curation.R.K.: Remote Sensing Data Acquisition and Preprocessing.All authors have read and agreed to the published version of the manuscript.Funding: This work was supported by the Open Project of Weihai Key Laboratory of Energy and Mineral Resources Investigation and Evaluation; New Liberal Arts Research and Reform Project of the Ministry of Education [grant number 2021140084] and Youth Opening Project of National Space Science Data Center [grant number NSSDC2302001]; National Key R&D Program of China [grant number 2022YFF0711400].

Table 1 .
Detection results obtained from different deep learning algorithms.

Table 2 .
Detection results obtained from quantitative ablation experiments.YOLOv8 serves as the benchmark network.